Friday, December 20, 2013

HCUP and Python Pt 5: Nulls and Pre-Analysis

This is the fifth post in a series on working with Healthcare Cost and Utilization Project (HCUP) datasets using Python and other open-source alternatives to SAS and SPSS, which are the two primary data tools supported by HCUP.

Please note! These instructions build off previous posts. In particular, they assume you are configured with Canopy and PyHCUP, have some basic facility with Python/IPython, and have already read in a small HCUP dataset.

IPython Notebook from this Tutorial

You can download an IPython notebook with code examples from this post.

Now that you have some HCUP data read in, you'll want to do a little housekeeping to get it ready for meaningful analysis. This post will cover two data wrangling tasks you're probably all-too-familiar with: identifying missing values and characterizing/verifying the contents of the data set.

HCUP Sentinel Values

Files from HCUP have several different sentinel values used to mark missing, invalid, or otherwise unavailable data. For a full breakdown, see HCUP's documentation on missing values in http://www.hcup-us.ahrq.gov/db/coding.pdf. In most of my work, the details of this are not vital I just need to mark them as null in the data set.

Replacing Sentinel Values

PyHCUP has a built-in function in its parser module to sweep through an imported HCUP data set and mark all the sentinels as missing values. To use it, pass your imported data set to the parser.replace_df_sentinels() function. This may take some time, depending on the size of the data set you've imported and the speed of your computer.

In [4]:nulls_replaced = pyhcup.parser.replace_df_sentinels(df)

Check the AGE column (or whichever column you used .describe() with previously) in the replaced and non-replaced data to see the effects.

In [5]:df['AGE'].describe() #before Out[5]:count 10000.000000 mean 43.053200 std 25.857201 min -88.000000 25% 24.000000 50% 46.000000 75% 63.000000 max 104.000000 dtype: float64 In [6]:nulls_replaced['AGE'].describe() #and after Out[6]:count 9999.000000 mean 43.066307 std 25.825253 min 0.000000 25% 24.000000 50% 46.000000 75% 63.000000 max 104.000000 dtype: float64

Notice that, in my data set, the count for this column has dropped by one. When using .describe() on a column, the count reflects the number of non-null values. When I used parser.df_replace_sentinels(), it found and replaced only one sentinel value in the AGE column. Note as well that the minimum value in this column has changed from -88 to 0, which is still young but no longer requires time travel.

Characterizing Numeric Data

Although we are only looked at one column above for brevity, the parser.df_replace_sentinels() function has been applied to every column in the data set. You check describe additional columns by substituting in the column name for AGE in the code we've been using so far. Here is another, looking at the total charges per inpatient stay.

In [7]:nulls_replaced['TOTCHG'].describe() Out[7]:count 10000.00000 mean 32321.22250 std 50510.43668 min 990.00000 25% 9669.00000 50% 18253.50000 75% 35106.25000 max 1162225.00000 dtype: float64

So out of these 10,000 stays, somebody got out for cheap at less than $1,000 and somebody had a rough time with a bill greater than $1.1 million.

It is possible to describe multiple columns side by side by first preparing a list of columns, then referring to that instead of passing a single column name. In Python, you can make a list by putting all the desired list items in a shared set of square brackets, with commas between the items. For example, we could get ages, total charges, and length of stay all at once with something like this.

In [8]:col_list = ['AGE', 'TOTCHG', 'LOS'] In [9]:nulls_replaced[col_list].describe() Out[9]:
 AGETOTCHGLOS
count9999.00000010000.0000010000.000000
mean43.06630732321.222506.110600
std25.82525350510.4366810.319303
min0.000000990.000000.000000
25%24.0000009669.000002.000000
50%46.00000018253.500003.000000
75%63.00000035106.250006.000000
max104.0000001162225.00000271.000000

Characterizing Non-Numeric Data

Averages, extreme values, and percentiles are less helpful for describing things like discharge destinations and diagnoses. For non-numeric data, pandas has a built-in .value_counts() method you can use on columns similar to .describe(). As an example, HCUP SID data typically have available a FEMALE column, where a value of 1 indicates the patient was female and 0 indicates the patient was male.

In [10]:nulls_replaced['FEMALE'].value_counts() Out[10]:1 5142 0 4857 dtype: int64

Unlike .describe(), you cannot directly use .value_counts() simultaneously with multiple columns. However, you can use Python to put together a loop that generates value counts for columns in a list. You can also "nest" this kind of count by grouping, such as getting value counts of primary diagnosis by race or discharge destination by homeless status. We'll cover both of these techniques in future posts.

Knowing What Columns Are Available

HCUP provide documentation with detailed descriptions on which columns are available for which states in which year, as well as what the contents of each column mean. But for day-to-day work, we can get a list of all the columns in a DataFrame by invoking DataFrame.columns on any of our data sets. The results should be the same right now whether you use it on df or nulls_replaced, since one is derived from the other.

In [11]:nulls_replaced.columns Out[11]:Index([AGE, AGEDAY, AGEMONTH, AHOUR, AMONTH, ANESTH, ASOURCE, ASOURCEUB92, ASOURCE_X, ATYPE, AWEEKEND, BLOOD, BWT, DHOUR, DIED, DISPUB04, DISPUNIFORM, DISP_X, DQTR, DRG, DRG24, DRGVER, DRG_NoPOA, DSHOSPID, DX1, DX2, DX3, DX4, DX5, DX6, DX7, DX8, DX9, DX10, DX11, DX12, DX13, DX14, DX15, DXCCS1, DXCCS2, DXCCS3, DXCCS4, DXCCS5, DXCCS6, DXCCS7, DXCCS8, DXCCS9, DXCCS10, DXCCS11, DXCCS12, DXCCS13, DXCCS14, DXCCS15, DXPOA1, DXPOA2, DXPOA3, DXPOA4, DXPOA5, DXPOA6, DXPOA7, DXPOA8, DXPOA9, DXPOA10, DXPOA11, DXPOA12, DXPOA13, DXPOA14, DXPOA15, ECODE1, ECODE2, ECODE3, ECODE4, ECODE5, ECODE6, ECODE7, ECODE8, ECODE9, E_CCS1, E_CCS2, E_CCS3, E_CCS4, E_CCS5, E_CCS6, E_CCS7, E_CCS8, E_CCS9, E_POA1, E_POA2, E_POA3, E_POA4, E_POA5, E_POA6, E_POA7, E_POA8, E_POA9, FEMALE, HCUP_ED, HCUP_OS, HISPANIC_X, HOSPBRTH, HOSPST, Homeless, KEY, LOS, LOS_X, MDC, MDC24, MDNUM1_R, MDNUM2_R, MEDINCSTQ, NDX, NECODE, NEOMAT, NPR, PAY1, PAY1_X, PAY2, PAY2_X, PAY3_X, PL_CBSA, PL_MSA1993, PL_NCHS2006, PL_RUCA10_2005, PL_RUCA2005, PL_RUCA4_2005, PL_RUCC2003, PL_UIC2003, PL_UR_CAT4, PR1, PR2, PR3, PR4, PR5, PR6, PR7, PR8, PR9, PR10, PR11, PR12, PR13, PR14, PR15, PRCCS1, PRCCS2, PRCCS3, PRCCS4, PRCCS5, PRCCS6, PRCCS7, PRCCS8, PRCCS9, PRCCS10, PRCCS11, PRCCS12, PRCCS13, PRCCS14, PRCCS15, PRDAY1, PRDAY2, PRDAY3, PRDAY4, PRDAY5, PRDAY6, PRDAY7, PRDAY8, PRDAY9, PRDAY10, PRDAY11, PRDAY12, PRDAY13, PRDAY14, PRDAY15, PROCTYPE, PSTATE, PSTCO, PSTCO2, PointOfOriginUB04, PointOfOrigin_X, RACE, RACE_X, TOTCHG, TOTCHG_X, YEAR, ZIP3, ZIPINC_QRTL, ZIP, AYEAR, DMONTH, BMONTH, BYEAR, PRMONTH1, PRMONTH2, PRMONTH3, PRMONTH4, PRMONTH5, PRMONTH6, PRMONTH7, PRMONTH8, PRMONTH9, PRMONTH10, PRMONTH11, PRMONTH12, PRMONTH13, PRMONTH14, PRMONTH15, PRYEAR1, PRYEAR2, PRYEAR3, PRYEAR4, PRYEAR5, PRYEAR6, PRYEAR7, PRYEAR8, PRYEAR9, PRYEAR10, PRYEAR11, PRYEAR12, PRYEAR13, PRYEAR14, PRYEAR15], dtype=object)

No comments: