This is the fourth post in a series on working with Healthcare Cost and Utilization Project (HCUP) datasets using Python and other open-source alternatives to SAS and SPSS, which are the two primary data tools supported by HCUP.
This post will cover how to actually get an HCUP data set into Python. Please note! These instructions assume you have Canopy up and running and have installed the PyHCUP package using pip. If any of those things sound unfamiliar, please visit the second post for instructions on getting set up.
IPython Notebook from this Tutorial
You can download an IPython notebook with code examples from this post.
Locate Your HCUP Data (and decompress it, if necessary)
HCUP Data typically arrive on a DVD, and may or may not be compressed in a ZIP archive. Pick a directory on your computer and copy into it your data file. If the data are in a ZIP archive, extract them at this time. You should end up with a file somewhere around five to fifteen times larger (in MB) than the ZIP archive.
The decompressed file should end in a .asc file extension, though the file extension may be hidden from view depending on your operating system settings. Its name should correspond to the data it contains. For example, you are working with New York's 2008 State Inpatient Database CORE data, you should end up with a file called something like NY_SID_2008_CORE.asc. Make a note of its location as we'll need to know that later.
Locate a SAS Load Program Your HCUP Data
Next, you'll need the appropriate SAS Load Program for your particular HCUP data. Continuing with the above example, we need to find the New York SID 2008 CORE data SAS loading program. HCUP provides SAS loading programs in a few different places on its website, depending on whether you which data you are using. Use the links below to locate the appropriate SAS Load Program for your data. If the links don't work, HCUP has probably changed its website; Google should still be able to help you find them.
- NIS: http://www.hcup-us.ahrq.gov/db/nation/nis/nissasloadprog.jsp
- KID: http://www.hcup-us.ahrq.gov/db/nation/kid/kidsasloadprog.jsp
- NEDS: http://www.hcup-us.ahrq.gov/db/nation/neds/nedssasloadprog.jsp
- SID, SEDD, and SASD: http://www.hcup-us.ahrq.gov/sasload/sasload_search.jsp
After you locate and download the appropriate SAS Load Program for your data, make a note of its location as we'll need to know that later.
Start a New IPython Notebook
Open Canopy and start the Editor. Once the Editor is open, use the "File>>New" menu to start a new IPython notebook. You should save your notebook at this time as well, which may require the use of "File>>Save As".
Write Some Code
Type and run these cells using Shift+Enter after the "In" section of each one. You can omit lines beginning with the pound sign (#), which are comments for clarity. Your output (if any) should look similar to the output in the example.
Import the PyHCUP Package
Specify the Location of Your Files
The formatting for these varies slightly depending on your operating system. Change the path to reflect the correct directory for your files. Also, swap out the filenames at the end for your own!
Paths in Windows look like this.
Paths in Mac and Linux environments look like this.
Acquire Metadata
The SAS Load Program files supplied by HCUP are built to tell SAS how to load your data. We will use them instead to tell Python how to load your data.
You can verify that the meta data we have parsed out is reasonable by calling the meta_df DataFrame object we've created. This summary output is for the 2008 NY SID CORE, referenced above.
The number of entries reflects the number of columns specified in the SAS Load Program. There should generally be either above more than one hundred or fewer than ten columns, depending on the type of HCUP data. For your own interest, you can see the first x rows in a pandas DataFrame by using its .head() method. If there are too many columns, pandas will abridge this from a table-style output to a summary-style output as seen above.
field | informat | position | format | label | width | |
---|---|---|---|---|---|---|
0 | AGE | N3PF. | 1 | None | Age in years at admission | 3 |
1 | AGEDAY | N3PF. | 4 | None | Age in days (when age < 1 year) | 3 |
2 | AGEMONTH | N3PF. | 7 | None | Age in months (when age < 11 years) | 3 |
3 | AHOUR | N4PF. | 10 | Z4. | Admission Hour | 4 |
4 | AMONTH | N2PF. | 14 | None | Admission month | 2 |
Use the Metadata to Parse the Data
Now that we have the metadata, we can parse the actual data using the pyhcup.sas.df_from_sas function, with our meta_df as the second our argument. However, given the size of the data sets being read in, you may wish to consider reading in only a subset of rows. Do this by passing an nrows argument as well.
You now have ten thousand rows from the data file of your choice in a pandas DataFrame object.
The Catch and Next Steps
pandas provides a fantastic set of tools for working with data. One of which is, of course, a simple summary statistics method called .describe() which you can use on either an entire DataFrame or on any of its columns (aka Series). Try it now with the AGE column if you are working with CORE data, or a column of your choice for other data sets. Access any of the columns by using the DataFrame['Column Name (Series)'] notation.
Something is not quite right about the minimum age in this dataset.
While it is true that we have loaded the data as it was provided by HCUP, we have not done any cleanup work whatsoever. In particular, HCUP have a set of placeholder values (aka sentinels) that they use to note when a state has submitted missing data or data values which are invalid for one reason or another. In the next post, we will examine functions in PyHCUP and pandas for dealing with this missing or invalid data
No comments:
Post a Comment