Using Python to tackle the CPS (Part 4)

Last time, we got to where we’d like to have started: One file per month, with each month laid out the same. As a reminder, the CPS interviews households 8 times over the course of 16 months. They’re interviewed for 4 months, take 8 months off, and are interviewed four more times. So if your first interview was in month $m$, you’re also interviewed in months $$m + 1, m + 2, m + 3, m + 12, m + 13, m + 14, m + 15$$. ...

May 19, 2014

Using Python to tackle the CPS (Part 3)

In part 2 of this series, we set the stage to parse the data files themselves. As a reminder, we have a dictionary that looks like id length start end 0 HRHHID 15 1 15 1 HRMONTH 2 16 17 2 HRYEAR4 4 18 21 3 HURESPLI 2 22 23 4 HUFINAL 3 24 26 ... ... ... ... giving the columns of the raw CPS data files. This post (or two) will describe the reading of the actual data files, and the somewhat tricky process of matching individuals across the different files. After that we can (finally) get into analyzing the data. The old joke is that statisticians spend 80% of their time munging their data, and 20% of their time complaining about munging their data. So 4 posts about data cleaning seems reasonable. ...

May 19, 2014

Tidy Data in Action

Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren’t language specific. A tidy dataset must satisfy three criteria (page 4 in Whickham’s paper): Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. In this StackOverflow post, the asker had some data NBA games, and wanted to know the number of days since a team last played. Here’s the example data: ...

March 27, 2014

Organizing Papers

As a graduate student, you read a lot of journal articles… a lot. With the material in the articles being as difficult as it is, I didn’t want to worry about organizing everything as well. That’s why I wrote this script to help (I may have also been procrastinating from studying for my qualifiers). This was one of my earliest little projects, so I’m not claiming that this is the best way to do anything. ...

February 13, 2014

Using Python to tackle the CPS (Part 2)

Last time, we used Python to fetch some data from the Current Population Survey. Today, we’ll work on parsing the files we just downloaded. We downloaded two types of files last time: CPS monthly tables: a fixed-width format text file with the actual data Data Dictionaries: a text file describing the layout of the monthly tables Our goal is to parse the monthly tables. Here’s the first two lines from the unzipped January 1994 file: ...

February 4, 2014

Using Python to tackle the CPS

The Current Population Survey is an important source of data for economists. It’s modern form took shape in the 70’s and unfortunately the data format and distribution shows its age. Some centers like IPUMS have attempted to put a nicer face on accessing the data, but they haven’t done everything yet. In this series I’ll describe methods I used to fetch, parse, and analyze CPS data for my second year paper. Today I’ll describe fetching the data. Everything is available at the paper’s GitHub Repository. ...

January 27, 2014