Tuesday, December 10, 2013

HCUP and Python Pt 4: Reading in Data

This is the fourth post in a series on working with Healthcare Cost and Utilization Project (HCUP) datasets using Python and other open-source alternatives to SAS and SPSS, which are the two primary data tools supported by HCUP.

This post will cover how to actually get an HCUP data set into Python. Please note! These instructions assume you have Canopy up and running and have installed the PyHCUP package using pip. If any of those things sound unfamiliar, please visit the second post for instructions on getting set up.

IPython Notebook from this Tutorial

You can download an IPython notebook with code examples from this post.

Locate Your HCUP Data (and decompress it, if necessary)

HCUP Data typically arrive on a DVD, and may or may not be compressed in a ZIP archive. Pick a directory on your computer and copy into it your data file. If the data are in a ZIP archive, extract them at this time. You should end up with a file somewhere around five to fifteen times larger (in MB) than the ZIP archive.

The decompressed file should end in a .asc file extension, though the file extension may be hidden from view depending on your operating system settings. Its name should correspond to the data it contains. For example, you are working with New York's 2008 State Inpatient Database CORE data, you should end up with a file called something like NY_SID_2008_CORE.asc. Make a note of its location as we'll need to know that later.

Locate a SAS Load Program Your HCUP Data

Next, you'll need the appropriate SAS Load Program for your particular HCUP data. Continuing with the above example, we need to find the New York SID 2008 CORE data SAS loading program. HCUP provides SAS loading programs in a few different places on its website, depending on whether you which data you are using. Use the links below to locate the appropriate SAS Load Program for your data. If the links don't work, HCUP has probably changed its website; Google should still be able to help you find them.

After you locate and download the appropriate SAS Load Program for your data, make a note of its location as we'll need to know that later.

Start a New IPython Notebook

Open Canopy and start the Editor. Once the Editor is open, use the "File>>New" menu to start a new IPython notebook. You should save your notebook at this time as well, which may require the use of "File>>Save As".

Write Some Code

Type and run these cells using Shift+Enter after the "In" section of each one. You can omit lines beginning with the pound sign (#), which are comments for clarity. Your output (if any) should look similar to the output in the example.

Import the PyHCUP Package

In [1]:import pyhcup

Specify the Location of Your Files

The formatting for these varies slightly depending on your operating system. Change the path to reflect the correct directory for your files. Also, swap out the filenames at the end for your own!

Paths in Windows look like this.

In [2]:#for Windows, use this style data_path = 'C:\\path\\to\\files\\NY_SID_2008_CORE.asc' load_path = 'C:\\path\\to\\files\\NY_SID_2008_CORE.sas'

Paths in Mac and Linux environments look like this.

In [2]:#for Mac and Linux, use this style data_path = '/home/user/dir/NY_SID_2008_CORE.asc' load_path = '/home/user/dir/NY_SID_2008_CORE.sas'

Acquire Metadata

The SAS Load Program files supplied by HCUP are built to tell SAS how to load your data. We will use them instead to tell Python how to load your data.

In [3]:#build a pandas DataFrame object from meta data meta_df = pyhcup.sas.meta_from_sas(load_path)

You can verify that the meta data we have parsed out is reasonable by calling the meta_df DataFrame object we've created. This summary output is for the 2008 NY SID CORE, referenced above.

In [4]:meta_df Out[4]:<class 'pandas.core.frame.DataFrame'> Int64Index: 222 entries, 0 to 221 Data columns (total 6 columns): field 222 non-null values informat 222 non-null values position 222 non-null values format 6 non-null values label 222 non-null values width 222 non-null values dtypes: int32(1), object(5)

The number of entries reflects the number of columns specified in the SAS Load Program. There should generally be either above more than one hundred or fewer than ten columns, depending on the type of HCUP data. For your own interest, you can see the first x rows in a pandas DataFrame by using its .head() method. If there are too many columns, pandas will abridge this from a table-style output to a summary-style output as seen above.

In [5]:meta_df.head() Out[5]:
 fieldinformatpositionformatlabelwidth
0AGEN3PF.1NoneAge in years at admission3
1AGEDAYN3PF.4NoneAge in days (when age < 1 year)3
2AGEMONTHN3PF.7NoneAge in months (when age < 11 years)3
3AHOURN4PF.10Z4.Admission Hour4
4AMONTHN2PF.14NoneAdmission month2

Use the Metadata to Parse the Data

Now that we have the metadata, we can parse the actual data using the pyhcup.sas.df_from_sas function, with our meta_df as the second our argument. However, given the size of the data sets being read in, you may wish to consider reading in only a subset of rows. Do this by passing an nrows argument as well.

In [6]:#grab the first 10,000 rows in a DataFrame df = pyhcup.sas.df_from_sas(data_path, meta_df, nrows=10000) df Out[6]:<class 'pandas.core.frame.DataFrame'> Int64Index: 10000 entries, 0 to 9999 Columns: 222 entries, AGE to PRYEAR15 dtypes: float64(30), int64(148), object(44)

You now have ten thousand rows from the data file of your choice in a pandas DataFrame object.

The Catch and Next Steps

pandas provides a fantastic set of tools for working with data. One of which is, of course, a simple summary statistics method called .describe() which you can use on either an entire DataFrame or on any of its columns (aka Series). Try it now with the AGE column if you are working with CORE data, or a column of your choice for other data sets. Access any of the columns by using the DataFrame['Column Name (Series)'] notation.

In [7]:df['AGE'].describe() Out[7]:count 10000.000000 mean 43.053200 std 25.857201 min -88.000000 25% 24.000000 50% 46.000000 75% 63.000000 max 104.000000 dtype: float64

Something is not quite right about the minimum age in this dataset.

While it is true that we have loaded the data as it was provided by HCUP, we have not done any cleanup work whatsoever. In particular, HCUP have a set of placeholder values (aka sentinels) that they use to note when a state has submitted missing data or data values which are invalid for one reason or another. In the next post, we will examine functions in PyHCUP and pandas for dealing with this missing or invalid data

No comments: