As an EMT I have spent numerous times waiting in hospitals for the hospital to accept the patient. This made me start thinking about how wait times around the U.S. are. I began to think that maybe population size and/or density would have something to do with it. So with that I started to look around for datasets that I could use. If you would like to see from where I got my data: Census
The Centers for Medicare & Medicaid Services (CMS)
These two websites provided all the information I would need to complete this analysis.
These are the imports I used
import requests
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt #for plotting
from sklearn.linear_model import LinearRegression #for linear regression
I used this dictionary to simplify converting State Codes to Statenames.
us_state_abbrev = {
'Alabama': 'AL',
'Alaska': 'AK',
'American Samoa': 'AS',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'District of Columbia': 'DC',
'Florida': 'FL',
'Georgia': 'GA',
'Guam': 'GU',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Northern Mariana Islands':'MP',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Pennsylvania': 'PA',
'Puerto Rico': 'PR',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virgin Islands': 'VI',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY'
}
abbrev_us_state = dict(map(reversed, us_state_abbrev.items()))
First things first, I had to read in the data from the wehsites I was using. For CMS it was simply a matter of downloading the csv. For the Census Bureau, however, I had to request an API key, and used that to request the data point I was interested in. You can take a look at some of the other variables here. I then was able to get the data in a json file that I could convert into a python dataframe.
state_care = pd.read_csv("https://data.cms.gov/provider-data/sites/default/files/resources/39df8f4e604a58f85211278815a2a0b2_1603488877/apyc-v239.csv")
national_care = pd.read_csv("https://data.cms.gov/provider-data/sites/default/files/resources/8bf27b1e43c5de68678f4e496c304726_1603488876/isrn-hqyy.csv")
hospital_care = pd.read_csv("https://data.cms.gov/provider-data/sites/default/files/resources/b466805e3dd99ae225e9af4691de2487_1603488876/yv7e-xc69.csv")
API = "https://api.census.gov/data/2019/pep/population"
api_key = "df5dc91b30208fe1edaaf2e2174cbe477cee1cbb"
PANDA_AUTH_PARAMS = {
'api_key': 'df5dc91b30208fe1edaaf2e2174cbe477cee1cbb'
}
year='2019'
dsource='pep'
dname='population'
cols='DATE_CODE,DENSITY,POP,NAME'
base_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}'
data_url = f'{base_url}?get={cols}&for=county:*&key={api_key}'
response=requests.get(data_url)
r = response.json()
Next I only wanted to look at a subset of these values. And the Id code OP_18b corresponds with the average (median) time patients spent in the emergency department before leaving from the visit, where a lower number of minutes is better. I then replaced any not available values with NaN to not interfere with processing. Next to help with future merging I renamed a few columns and dropped others that contained extraneous information.
hospital_ed = hospital_care[hospital_care['Condition'] == "Emergency Department"]
hospital_times = hospital_ed[hospital_ed['Measure ID'] == 'OP_18b']
hospital_times = hospital_times.replace('Not Available', np.nan )
hospital_times.rename(columns={'County Name':'County'},inplace=True)
hospital_times = hospital_times.drop(['Measure Name', 'Phone Number', 'Address', 'City', 'ZIP Code'], axis=1)
hospital_times = hospital_times.replace({'State':abbrev_us_state})
hospital_times
Facility ID | Facility Name | State | County | Condition | Measure ID | Score | Sample | Footnote | Start Date | End Date | |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 010001 | SOUTHEAST ALABAMA MEDICAL CENTER | Alabama | HOUSTON | Emergency Department | OP_18b | 178 | 349 | NaN | 01/01/2019 | 12/31/2019 |
22 | 010005 | MARSHALL MEDICAL CENTERS | Alabama | MARSHALL | Emergency Department | OP_18b | 115 | 1408 | NaN | 01/01/2019 | 12/31/2019 |
41 | 010006 | NORTH ALABAMA MEDICAL CENTER | Alabama | LAUDERDALE | Emergency Department | OP_18b | 152 | 362 | NaN | 01/01/2019 | 12/31/2019 |
60 | 010007 | MIZELL MEMORIAL HOSPITAL | Alabama | COVINGTON | Emergency Department | OP_18b | 108 | 1295 | NaN | 01/01/2019 | 12/31/2019 |
79 | 010008 | CRENSHAW COMMUNITY HOSPITAL | Alabama | CRENSHAW | Emergency Department | OP_18b | 91 | 335 | NaN | 01/01/2019 | 12/31/2019 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
89987 | 670128 | BAYLOR SCOTT & WHITE MEDICAL CENTER PFLUGERVILLE | Texas | TRAVIS | Emergency Department | OP_18b | 126 | 286 | 3 | 01/01/2019 | 12/31/2019 |
90006 | 670129 | THE HEIGHTS HOSPITAL | Texas | HARRIS | Emergency Department | OP_18b | NaN | NaN | 19 | 01/01/2019 | 12/31/2019 |
90025 | 670130 | SOUTHCROSS HOSPITAL | Texas | BEXAR | Emergency Department | OP_18b | NaN | NaN | 19 | 01/01/2019 | 12/31/2019 |
90044 | 670131 | BAYLOR SCOTT & WHITE MEDICAL CENTER BUDA | Texas | HAYS | Emergency Department | OP_18b | NaN | NaN | 5 | 01/01/2019 | 12/31/2019 |
90063 | 670132 | METHODIST SOUTHLAKE HOSPITAL | Texas | TARRANT | Emergency Department | OP_18b | NaN | NaN | 19 | 01/01/2019 | 12/31/2019 |
4741 rows × 11 columns
I then had to do the same to the census data. Now if you notice from hopital times, it was from the year 2019. So I decided to use the Bureau's 7/1/2019 population estimate which has the Date_Code set to 12. After that I then had to separate County and State values and rename them for future merging. I then also only wanted data from the continental U.S.
df19 = pd.DataFrame(r[1:], columns = r[0])
df19 = df19[df19['DATE_CODE'] == '12']
df19[['County','State']] = df19['NAME'].str.split(",",expand=True)
df19['County'] = df19['County'].str.split(" ",expand=True)[0]
df19['state'] = df19['state'].astype(int)
df19 = df19[df19['state'] < 57]
df19['County'] = df19['County'].str.upper()
df19 = df19.drop(['state', 'county', 'NAME'], axis=1)
df19['State'] = df19['State'].str.lstrip()
df19
DATE_CODE | DENSITY | POP | County | State | |
---|---|---|---|---|---|
11 | 12 | 93.98534601500000 | 55869 | AUTAUGA | Alabama |
23 | 12 | 140.41440206000000 | 223234 | BALDWIN | Alabama |
35 | 12 | 27.89353269000000 | 24686 | BARBOUR | Alabama |
47 | 12 | 35.97652981800000 | 22394 | BIBB | Alabama |
59 | 12 | 89.67624657200000 | 57826 | BLOUNT | Alabama |
... | ... | ... | ... | ... | ... |
37655 | 12 | 4.06090720020000 | 42343 | SWEETWATER | Wyoming |
37667 | 12 | 5.87062835570000 | 23464 | TETON | Wyoming |
37679 | 12 | 9.71600056740000 | 20226 | UINTA | Wyoming |
37691 | 12 | 3.48643886830000 | 7805 | WASHAKIE | Wyoming |
37703 | 12 | 2.88865121780000 | 6927 | WESTON | Wyoming |
3142 rows × 5 columns
Here is where the last of the data tidying, and we can merge the two dataframes from before. We perform an inner join on County names since they are the most unique, and then we filter out any of the rows that have different state names since that is a mismatch. Next we drop a lot of values to be left with the data that we want, namely county and population and density. And we also need to drop any of the NaN values sicne they will mess with any modeling.
combined = pd.merge(hospital_times, df19, on = ['County'], how='inner')
combined = combined[combined['State_x'] == combined['State_y']]
combined = combined.drop(['Start Date','End Date','DATE_CODE','Condition','Measure ID','State_y', 'Footnote', 'Sample'],axis=1)
combined.sort_values(by=['Facility ID'], inplace=True)
combined = combined.loc[combined['Score'].notnull(),:]
combined['POP'] = combined['POP'].astype(int)
combined['DENSITY'] = combined['DENSITY'].astype(float)
combined['Score'] = combined['Score'].astype(int)
combined
Facility ID | Facility Name | State_x | County | Score | DENSITY | POP | |
---|---|---|---|---|---|---|---|
0 | 010001 | SOUTHEAST ALABAMA MEDICAL CENTER | Alabama | HOUSTON | 178 | 182.600572 | 105882 |
30 | 010005 | MARSHALL MEDICAL CENTERS | Alabama | MARSHALL | 115 | 171.025622 | 96774 |
174 | 010006 | NORTH ALABAMA MEDICAL CENTER | Alabama | LAUDERDALE | 152 | 138.813133 | 92729 |
192 | 010007 | MIZELL MEMORIAL HOSPITAL | Alabama | COVINGTON | 108 | 35.950409 | 37049 |
201 | 010008 | CRENSHAW COMMUNITY HOSPITAL | Alabama | CRENSHAW | 91 | 22.619139 | 13772 |
... | ... | ... | ... | ... | ... | ... | ... |
20076 | 670112 | LEGENT ORTHOPEDIC + SPINE | Texas | BEXAR | 102 | 1615.369745 | 2003554 |
20040 | 670116 | WISE HEALTH SYSTEM | Texas | TARRANT | 102 | 2434.430253 | 2102515 |
20017 | 670118 | FIRST TEXAS HOSPITAL | Texas | HARRIS | 65 | 2762.549219 | 4713325 |
2562 | 670122 | HOUSTON METHODIST THE WOODLANDS HOSPITAL | Texas | MONTGOMERY | 196 | 582.746985 | 607391 |
20063 | 670128 | BAYLOR SCOTT & WHITE MEDICAL CENTER PFLUGERVILLE | Texas | TRAVIS | 126 | 1283.926548 | 1273954 |
3611 rows × 7 columns
First I wanted to take a look at the number hospitals in a state and the population of that state. One thing to note about this however is that state lines are not impenetrable. This means that if someone is on the border of a state, but is closer to a hospital in a state they do not live, then that will throw this off slightly. I will be assuming that this is minimal and barely affects the overrall trend. Then I will also be able to analyze things like the average score and density to help pinpoint positive and negative correlations between these variables.
hospital_counts = combined['State_x'].value_counts()
pop_hosp_tot = []
for x in combined['State_x'].unique():
pop_hosp_tot.append([x,combined.loc[combined['State_x'] == x, 'POP'].sum()
,hospital_counts[x],
combined.loc[combined['State_x'] == x, 'Score'].sum()/
combined.loc[combined['State_x'] == x, 'Score'].count(),
combined.loc[combined['State_x'] == x, 'DENSITY'].sum()/
combined.loc[combined['State_x'] == x, 'DENSITY'].count()])
pop_hosp = pd.DataFrame(pop_hosp_tot,columns=['State','Pop','Hospitals'
, 'Avg Score', 'Avg Density'])
pop_hosp['Ratio'] = pop_hosp['Hospitals']/pop_hosp['Pop']
pop_hosp.sort_values(by=['Avg Score'], inplace=True, ascending = False)
pop_hosp
State | Pop | Hospitals | Avg Score | Avg Density | Ratio | |
---|---|---|---|---|---|---|
19 | Maryland | 15150042 | 31 | 211.032258 | 1784.414434 | 2.046199e-06 |
7 | Delaware | 883461 | 4 | 206.750000 | 264.772830 | 4.527648e-06 |
2 | Arizona | 129685996 | 58 | 197.017241 | 245.777782 | 4.472341e-07 |
38 | Rhode Island | 4331114 | 10 | 186.500000 | 1190.062635 | 2.308875e-06 |
20 | Massachusetts | 44642419 | 56 | 180.696429 | 3166.017399 | 1.254412e-06 |
31 | New York | 109277024 | 139 | 172.143885 | 6431.587267 | 1.271997e-06 |
44 | Vermont | 541822 | 10 | 171.700000 | 84.024468 | 1.845625e-05 |
29 | New Jersey | 35300524 | 62 | 167.951613 | 3644.686575 | 1.756348e-06 |
4 | California | 152972049 | 126 | 167.801587 | 1007.307594 | 8.236799e-07 |
32 | North Carolina | 22491283 | 97 | 160.226804 | 423.328528 | 4.312782e-06 |
30 | New Mexico | 4311821 | 26 | 155.653846 | 121.600463 | 6.029935e-06 |
6 | Connecticut | 12068420 | 19 | 154.473684 | 966.198446 | 1.574357e-06 |
37 | Pennsylvania | 64913124 | 142 | 154.126761 | 1422.763326 | 2.187539e-06 |
46 | Washington | 52481501 | 76 | 153.552632 | 372.033680 | 1.448129e-06 |
12 | Illinois | 245638526 | 134 | 151.776119 | 2012.769823 | 5.455170e-07 |
39 | South Carolina | 12045759 | 55 | 151.618182 | 277.795249 | 4.565922e-06 |
28 | New Hampshire | 4552344 | 26 | 150.307692 | 218.243273 | 5.711343e-06 |
8 | Florida | 135633233 | 152 | 147.842105 | 952.775603 | 1.120669e-06 |
36 | Oregon | 13011266 | 49 | 146.551020 | 346.169929 | 3.765967e-06 |
21 | Michigan | 47391201 | 114 | 145.912281 | 630.643869 | 2.405510e-06 |
9 | Georgia | 20225482 | 121 | 145.528926 | 448.028701 | 5.982552e-06 |
18 | Maine | 3369009 | 31 | 143.548387 | 101.596881 | 9.201519e-06 |
45 | Virginia | 9629912 | 50 | 142.660000 | 1052.048337 | 5.192155e-06 |
27 | Nevada | 43422334 | 32 | 138.718750 | 176.757875 | 7.369479e-07 |
16 | Kentucky | 7278731 | 81 | 138.580247 | 268.115050 | 1.112831e-05 |
41 | Tennessee | 18174315 | 91 | 138.384615 | 371.911466 | 5.007066e-06 |
47 | West Virginia | 2102632 | 45 | 138.000000 | 129.284741 | 2.140175e-05 |
34 | Ohio | 52559883 | 134 | 135.529851 | 860.635727 | 2.549473e-06 |
5 | Colorado | 14820347 | 61 | 135.508197 | 603.192152 | 4.115963e-06 |
13 | Indiana | 19084340 | 99 | 135.383838 | 458.362314 | 5.187499e-06 |
0 | Alabama | 12444788 | 83 | 133.951807 | 160.628710 | 6.669459e-06 |
42 | Texas | 306484519 | 303 | 133.907591 | 879.917914 | 9.886307e-07 |
24 | Missouri | 10988076 | 81 | 132.925926 | 228.848861 | 7.371627e-06 |
48 | Wisconsin | 16571102 | 107 | 127.850467 | 382.330887 | 6.457024e-06 |
3 | Arkansas | 6983233 | 69 | 126.130435 | 134.219544 | 9.880810e-06 |
17 | Louisiana | 8446680 | 67 | 125.492537 | 352.216582 | 7.932111e-06 |
22 | Minnesota | 16172825 | 102 | 124.980392 | 382.539426 | 6.306876e-06 |
49 | Wyoming | 563115 | 18 | 124.388889 | 7.883207 | 3.196505e-05 |
1 | Alaska | 1310695 | 8 | 122.875000 | 86.461586 | 6.103632e-06 |
11 | Idaho | 2516659 | 35 | 120.200000 | 79.945982 | 1.390733e-05 |
10 | Hawaii | 9892129 | 21 | 118.666667 | 676.588499 | 2.122900e-06 |
43 | Utah | 5622009 | 29 | 115.517241 | 224.833064 | 5.158298e-06 |
14 | Iowa | 4648481 | 82 | 115.109756 | 94.781697 | 1.764017e-05 |
23 | Mississippi | 4069587 | 67 | 114.373134 | 96.115247 | 1.646359e-05 |
35 | Oklahoma | 21169792 | 93 | 113.333333 | 344.256353 | 4.393052e-06 |
25 | Montana | 1498675 | 47 | 109.936170 | 11.623847 | 3.136104e-05 |
15 | Kansas | 7085578 | 119 | 109.613445 | 112.425140 | 1.679468e-05 |
40 | South Dakota | 1137267 | 33 | 108.848485 | 39.408019 | 2.901693e-05 |
26 | Nebraska | 5341370 | 79 | 103.189873 | 178.843382 | 1.479021e-05 |
33 | North Dakota | 929872 | 37 | 96.351351 | 15.241909 | 3.979042e-05 |
plt.title('Hospital Count vs. Population Size')
plt.xlabel('Population')
plt.ylabel('Hospital Count')
x = pop_hosp['Pop']
y = pop_hosp['Hospitals']
plt.plot(x, (np.polyfit(x,y,1))[0]*x + (np.polyfit(x,y,1))[1], color = 'red')
plt.scatter(x, y)
plt.show()
Based on the linear fit, it is easy to see that a larger population results in more hospitals. While this seems like an obvious statement, I had to start somewhere.
plt.title('Hospital Ratio vs. Population Size')
plt.xlabel('Population')
plt.ylabel('Hospital Ratio')
x = pop_hosp['Pop']
y = pop_hosp['Ratio']
plt.plot(x, (np.polyfit(x,y,1))[0]*x + (np.polyfit(x,y,1))[1], color = 'red')
plt.scatter(x, y)
plt.show()
Looking at the ratio of hospitals to population vs population however yield different results. There looks to be a negative correlation to the number of hospitals per person as population increases. There are any number of reasons for this, but one explanation could be that States with larger populations, have more cities, and therefore since the concentration of population is greater, you get larger hospitals, but less.
plt.title('Hospital Ratio vs. Hospital Count')
plt.xlabel('Hospital Count')
plt.ylabel('Hospital Ratio')
x = pop_hosp['Hospitals']
y = pop_hosp['Ratio']
plt.plot(x, (np.polyfit(x,y,1))[0]*x + (np.polyfit(x,y,1))[1], color = 'red')
plt.scatter(x, y)
plt.show()
So based on this graph I can see that as the number of hospitals increases, the amount of hospitals per person decreases. This is inline with what was boserved with the previous two graphs. A positive correlation between population size and hospital count, but a negative ratio between ratio and population would indicate that there would be negative correlation between ration and hopsital count. Nothing surprising here.
plt.title('Avg Score vs. Population Size')
plt.xlabel('Population')
plt.ylabel('Avg Score')
x = pop_hosp['Pop']
y = pop_hosp['Avg Score']
plt.plot(x, (np.polyfit(x,y,1))[0]*x + (np.polyfit(x,y,1))[1], color = 'red')
plt.scatter(x, y)
plt.show()
It does seem that the higher a population is, the longer it takes to wait in an emergancy room. That would start to make sense, considering there are also less hospitals to admit people, even if they are larger.
plt.title('Avg Score vs. Hospitals')
plt.xlabel('Hospitals')
plt.ylabel('Avg Score')
x = pop_hosp['Hospitals']
y = pop_hosp['Avg Score']
plt.plot(x, (np.polyfit(x,y,1))[0]*x + (np.polyfit(x,y,1))[1], color = 'red')
plt.scatter(x, y)
plt.show()
Here we can see that as the number of hospitals increases, the score tends to decrease. That would mean more hospitals is a good thing. More hospitals means more options and less waiting around to get admitted.
plt.title('Avg Score vs. Hospital Ratio')
plt.xlabel('Ratio')
plt.ylabel('Avg Score')
x = pop_hosp['Ratio']
y = pop_hosp['Avg Score']
plt.plot(x, (np.polyfit(x,y,1))[0]*x + (np.polyfit(x,y,1))[1], color = 'red')
plt.scatter(x, y)
plt.show()
This graph shows a stronger slope for the score vs ratio. Thism eans that a better hospital per person ratio results in shorter wait times. This again makes sense. If there are more options per person, then they can be moved through faster.
plt.title('Hospital Ratio vs. Population Density')
plt.xlabel('Density')
plt.ylabel('Hospital Ratio')
x = pop_hosp['Avg Density']
y = pop_hosp['Ratio']
plt.plot(x, (np.polyfit(x,y,1))[0]*x + (np.polyfit(x,y,1))[1], color = 'red')
plt.scatter(x, y)
plt.show()
This graph starts by showing that a high population density has a lower hospital ratio. That is not what you want. It seems that my earlier assumption about high density states, having larger and fewer hospitals would be correct.
plt.title('Avg Score vs. Population Density')
plt.xlabel('Density')
plt.ylabel('Avg Score')
x = pop_hosp['Avg Density']
y = pop_hosp['Avg Score']
plt.plot(x, (np.polyfit(x,y,1))[0]*x + (np.polyfit(x,y,1))[1], color = 'red')
plt.scatter(x, y)
plt.show()
This graph also tells a good story about how a higher population dnesity tends to result in a higher average score. That definitely indicates that higher population density results in longer wait times.
Based on all of this we can safely say that a greater number of hospitals is more beneficial for people. This means that even smaller hospitals will help people. None of this accounts for severity, but if we assume an equal distribution of high to low severity, then lower severity patients would go to the smaller hospitals, freeing up the larger hospitals for the more serious conditions. So my recommendation would be that going to larger population densities and increasing the number of smaller hopsitals would help reduce wait times.