Last time, we used Python to fetch some data from the Current Population Survey. Today, we’ll work on parsing the files we just downloaded.
We downloaded two types of files last time:
- CPS monthly tables: a fixed-width format text file with the actual data
- Data Dictionaries: a text file describing the layout of the monthly tables
Our goal is to parse the monthly tables. Here’s the first two lines from the unzipped January 1994 file:
/V/H/U/t/D/C/monthly head -n 2 cpsb9401
881605952390 2 286-1 2201-1 1 1 1-1 1 5-1-1-1 22436991 1 2 1 6 194 2A61 -1 2 2-1-1-1-1 363 1-15240115 3-1 4 0 1-1 2 1-1660 1 2 2 2 6 236 2 8-1 0 1-1 1 1 1 2 1 2 57 57 57 1 0-1 2 5 3-1-1 2-1-1-1-1-1 2-1-1-1-1-1-1-1-1-1-1-1 -1-1-1-1-1-1-1-1-1-1-1 -1-1 169-1-1-1-1-1-1-1-1-1-1-1-1-1-1 -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 2-1 0 4-1-1-1-1-1-1 -1-1-1 0 1 2-1-1-1-1-1-1-1-1-1 -1 -1-1-1 -1 -1-1-1 0-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 0-1-1-1-1-1 -1 -1 -1 0-1-1 0-1-1-1 -1 0-1-1-1-1-1-1-1-1 2-1-1-1-1 22436991 -1 0 22436991 22422317-1 0 0 0 1 0-1 050 0 0 0 011 0 0 0-1-1-1-1 0 0 0-1-1-1-1-1-1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 1 1 1 1 1 1 1 1 1 1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 1 1 1-1-1-1
881605952390 2 286-1 2201-1 1 1 1-1 1 5-1-1-1 22436991 1 2 1 6 194 2A61 -1 2 2-1-1-1-1 363 1-15240115 3-1 4 0 1-1 2 3-1580 1 1 1 1 2 239 2 8-1 0 2-1 1 2 1 2 1 2 57 57 57 1 0-1 1 1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 2-140-1-1 40-1-1-1-1 2-1 2-140-1 40-1 -1 2 5 5-1 2 3 5 2-1-1-1-1-1-1 -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 1-118 1 1 1 4-1-1-1 -1 1-1 1 2-1-1-1-1-1-1-1 4 1242705-1-1-1 -1 3-1-1 1 2 4-1 1 6-1 6-136-1 1 4-110-1 3 1 1 1 0-1-1-1-1 -1-1 -1 -1 0-1-1 0-1-1-1 -10-1-1-1-1-1-1-1-1-1-1-1-1-1 22436991 -1 0 31870604 25650291-1 0 0 0 1 0-1 0 1 0 0 0 0 0 0 0 0-1-1-1-1 0 0-1 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 0 0 0-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1
Clearly, we’ll need to parse the data dictionaries before being able to make sense of that.
Keeping with the CPS’s tradition of consistently being inconsistent, the data dictionaries don’t have a consistent schema across the years. Here’s a typical example for some years (this one is from January 2003):
NAME SIZE DESCRIPTION LOCATION
HRHHID 15 HOUSEHOLD IDENTIFIER (Part 1) (1 - 15)
EDITED UNIVERSE: ALL HHLD's IN SAMPLE
Part 1. See Characters 71-75 for Part 2 of the Household Identifier.
Use Part 1 only for matching backward in time and use in combination
with Part 2 for matching forward in time.
My goal was to extract 4 fields (name
, size
, start
, end
). Name and size could be taken directly (HRHHID
, and 15
). start
and end
would be pulled from the LOCATION
part.
In generic_data_dictionary_parser
, I define a class do this. The main object Parser
, takes
infile
: the path to a data dictionary we downloadedoutfile
: path to an HDF5 filestyle
: A string representing the year of the data dictionary. Different years are formatted differently, so I define a style for each (3 styles in all)regex
: This was mostly for testing. If you don’t pass aregex
it will be inferred from the style.
The heart of the parser is a regex that matches on lines like HRHHID 15 HOUSEHOLD IDENTIFIER (Part 1) (1 - 15)
, but nowhere else. After many hours, failures, and false positives, I came up with something roughly like ur'[\x0c]{0,1}(\w+)[\s\t]*(\d{1,2})[\s\t]*(.*?)[\s\t]*\(*(\d+)\s*-\s*(\d+)\)*$'
Here’s an explanation, but the gist is that
\w+
matches words (likeHRHHID
)- there’s some spaces or tabs
[\s\t]*
(yes the CPS mixes spaces and tabs) between that and… - size
\d{1,2}
which is 1 or two columns digits - the description which we don’t really care about
- the start and end positions
(*(\d+)\s*-\s*(\d+)\)*$
broken into two groups.
Like I said, that’s the heart of the parser. Unfortunately I had to pad the file with some 200+ more lines of code to handle special cases, formatting, and mistakes in the data dictionary itself.
The end result is a nice HDFStore
, with a parsed version of each data dictionary looking like:
id length start end
0 HRHHID 15 1 15
1 HRMONTH 2 16 17
2 HRYEAR4 4 18 21
3 HURESPLI 2 22 23
4 HUFINAL 3 24 26
... ... ... ...
This can be used as an argument pandas’ read_fwf
parser.
Next time I’ll talk about actually parsing the tables and wrangling them into a usable structure. After that, we will finally get to actually analyzing the data.