This is the third post in a series on working with Healthcare Cost and Utilization Project (HCUP) datasets using Python and other open-source alternatives to SAS and SPSS, which are the two primary data tools supported by HCUP.
This post will cover some basics on how to use Python and navigate in an IPython editor. If you are already familiar with these things, you can probably safely skip to Part 4 on how to actually get an HCUP data set into Python. Please note! These instructions assume you have Canopy up and running and have installed the PyHCUP package using pip. If any of those things sound unfamiliar, please visit the previous post for instructions on getting set up.
The Very Basics
You can use your HCUP data in Python without being a Python expert. However, it would be helpful to invest a small amount of time familiarizing yourself with the fundamentals. Given the huge volume of Python how-to's on the web, it can be paralyzing to get started (at least, it was for me). Here are some resources that I can recommend with some commentary. Consider bookmarking all of them for future reference.
LearnPython.org
The "Learn the Basics" series highly recommended. If you have experience programming in other languages or with using tools like STATA, R, or MATLAB, you can probably make it through this set of tutorials in less than 30 minutes. Double or triple that if you have no programming experience whatsoever. Also, the site is friendly and concise.
Codecademy's Python track
Well-regarded and intended to be a much more robust training regimen. Allows you to save your work as you go. If you have no prior programming experience and/or you are interested in a long-term relationship with Python, you might give this a try. The first two modules should be enough to get you comfortable with what we'll be doing in the next couple posts.
The Official Python Documentation
This is the official tutorial documentation for the Python programming language. Not the most user-friendly, especially to programming newcomers, but certainly among the most robust. Google searches for Python issues will often end up back at the documentation, and once you've spent some time on the site it will become easier to know what you're looking at.
Learn Python The Hard Way
Much vaunted among Python gurus. Basically, if people learn Python (or any other kind of programming) poorly they end up writing bad code and/or often seeking someone else to make something work, instead of figuring out the right way to do it themselves. Learn Python The Hard Way reads like it was written by someone who's had quite enough of that, thank you, and here's all the paternalistic things someone should have said to you a long, long time ago.
Will you learn Python if you do these tutorials? Probably. Will the author's tone be distracting, even if you have a thick skin? Absolutely. It's a bit odd for any profession, programming included, to think that it is unique in having practioners that come up through different routes: some poorly trained, some well-trained, some self-trained, some with no training (yet), and some with simply divergent training. As someone who has been both expert and idiot in various contexts and will remain so as long as I am human, I find this approach ineffective.
But I'm an admitted non-expert in Python (same for programming in general), so I'm listing the site anyways since the experts seem to like it.
Start a New IPython Notebook
Open Canopy and start the Editor. Once the Editor is open, use the "File>>New" menu to start a new IPython notebook. You should save your notebook at this time as well, which may require the use of "File>>Save As".
Navigating the PyHCUP Package in IPython
Type the following into the notebook and press Shift+Enter to run the cell.
The pyhcup package, like most Python packages, consists of a set of modules. Each of these modules contains one or more functions, which are grouped into modules by the author(s) at their stylistic discretion. Usually modules will contain functions grouped by the sort of purpose that they serve.
You can access modules in a package using what is called dot notation. For example, pyhcup.sas will access a module called sas in the pyhcup package. You can use more dot notation to access functions within a module. For example, pyhcup.sas.df_from_sas will access the df_from_sas function in the sas module.
IPython notebooks provide several useful features, one of which is called tab-completion. Tab-completion means you can hit the "Tab" key on your keyboard while typing in a cell and IPython will give you suggestions of commands you might be interested in typing. For example, type the code below and hit Tab (without pressing Shift+Enter!).
IPython should have automatically looked to see what functions whose names begin with "met" exist in the sas module. And, since there is only one, it will finish writing it out for you when you press Tab.
You can also use tab-completion immediately after a dot. Try pressing backspace until just the following is in the cell, and then hit Tab again.
This time, IPython should give you a list of possible options, including these functions contained in the sas module.
- pyhcup.sas.df_from_sas
- pyhcup.sas.file_length
- pyhcup.sas.meta_from_sas
Try selecting the meta_from_sas function. You can either use the up and down arrows on your keyboard followed by the Enter key, or you can double-click the function. Next, hit Shift+Enter to run the cell. You should get something like this.
So, what is this? This is Python describing the object you pointed to at pyhcup.sas.df_from_sas, which is a function. In order to actually call the function, we need to add parentheses at the end. We also need to pass along any parameters (aka arguments) the function needs in order to run. Knowing which functions require which arguments would normally be a matter of referring to the author's documentation (or looking at the source code itself), but IPython has one more trick up its sleeve. You can use the Tab key next to an open parenthesis and IPython will show you a list of arguments for that function look for any documentation the author put into the source code (aka the docstring). Try typing the following and pressing Tab.
A small pop-up window should come up. In the top-right corner of the pop-up will be a bold plus (+) symbol. Click on that to expand the pop-up. Its contents, which you can now scroll through, should look something like this.
pyhcup.sas.df_from_sas(target, meta_df, skiprows=None, nrows=None, chunksize=None)
Parses target SAS datafile. Requires a pandas DataFrame object with meta data.
Returns a pandas DataFrame object containing the parsed data if chunksize is None, otherwise returns a reader generating chunksize chunks with each iteration.
target must be either a full path to a file (including filename) or a file-like Python object.
Can optionally specify rows to skip (skiprows) or limit the number of rows to read (nrows).
Let's pull apart the first line, which is copied straight from the function definition in the source code. .df_from_sas() has five arguments, separated by commas. The last three, skiprows, nrows, and chunksize, all have a default value, indicated by the equals sign. Default values means you can omit these arguments when calling the function, and the function will use the default values anyways. In particular, these three all default to None, which is Python's special value for nothing or null. It is different from False, which concretely says something is False. None is a value of None.
The first two arguments, target and meta_df, have no default value. This means you must provide a value for these. Furthermore, you must provide them in the order listed: the first argument must be a non-None target data file for the function, and the second argument must be a non-None set of meta data.
You can pull up similar function definitions for any function within IPython. The availability of docstrings will vary by module and function, but are generally available.
Now that we've covered the basics, the next post will jump into actually using your HCUP data!
No comments:
Post a Comment