Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren’t language specific.
A tidy dataset must satisfy three criteria (page 4 in Whickham’s paper):
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
In this StackOverflow post, the asker had some data NBA games, and wanted to know the number of days since a team last played. Here’s the example data:
import datetime
import pandas as pd
df = pd.DataFrame({'HomeTeam': ['HOU', 'CHI', 'DAL', 'HOU'],
'AwayTeam' : ['CHI', 'DAL', 'CHI', 'DAL'],
'HomeGameNum': [1, 2, 2, 2],
'AwayGameNum' : [1, 1, 3, 3],
'Date' : [datetime.date(2014,3,11), datetime.date(2014,3,12),
datetime.date(2014,3,14), datetime.date(2014,3,15)]})
df
AwayGameNum | AwayTeam | Date | HomeGameNum | HomeTeam | |
---|---|---|---|---|---|
0 | 1 | CHI | 2014-03-11 | 1 | HOU |
1 | 1 | DAL | 2014-03-12 | 2 | CHI |
2 | 3 | CHI | 2014-03-14 | 2 | DAL |
3 | 3 | DAL | 2014-03-15 | 2 | HOU |
4 rows × 5 columns
I want to focus on the second of the three criteria: Each observation forms a row. Realize that the structure your dataset should take reflects the question you’re trying to answer. For the SO question, we want to answer “How many days has it been since this team’s last game?” Given this context what is an observation?
We’ll define an observation as a team playing on a day.
Does the original dataset in df
satisfy the criteria for tidy data?
No, it doesn’t since each row contains 2 observations, one for the home team and one for the away team.
Let’s tidy up the dataset.
- I repeat each row (once for each team) and drop the game numbers (I don’t need them for this example)
- Select just the new rows (the one with odd indicies,
%
is the modulo operator in python) - Overwrite the value of
Team
for the new rows, keeping the existing value for the old rows - rename the
HomeTeam
column tois_home
and make it a boolen column (True when the team is home)
s = df[['Date', 'HomeTeam', 'AwayTeam']].reindex_axis(df.index.repeat(2)).reset_index(drop=True)
s = s.rename(columns={'AwayTeam': 'Team'})
new = s[(s.index % 2).astype(bool)]
s.loc[new.index, 'Team'] = new.loc[:, 'HomeTeam']
s = s.rename(columns={'HomeTeam': 'is_home'})
s['is_home'] = s['Team'] == s['is_home']
s
Date | is_home | Team | |
---|---|---|---|
0 | 2014-03-11 | False | CHI |
1 | 2014-03-11 | True | HOU |
2 | 2014-03-12 | False | DAL |
3 | 2014-03-12 | True | CHI |
4 | 2014-03-14 | False | CHI |
5 | 2014-03-14 | True | DAL |
6 | 2014-03-15 | False | DAL |
7 | 2014-03-15 | True | HOU |
8 rows × 3 columns
Now that we have a 1:1 correspondance between rows and observations, answering the question is simple.
We’ll just group by each team and find the difference between each consecutive Date
for that team.
Then subtract one day so that back to back games reflect 0 days of rest.
s['rest'] = s.groupby('Team')['Date'].diff() - datetime.timedelta(1)
s
Date | is_home | Team | rest | |
---|---|---|---|---|
0 | 2014-03-11 | False | CHI | NaT |
1 | 2014-03-11 | True | HOU | NaT |
2 | 2014-03-12 | False | DAL | NaT |
3 | 2014-03-12 | True | CHI | 0 days |
4 | 2014-03-14 | False | CHI | 1 days |
5 | 2014-03-14 | True | DAL | 1 days |
6 | 2014-03-15 | False | DAL | 0 days |
7 | 2014-03-15 | True | HOU | 3 days |
8 rows × 4 columns
I planned on comparing that one line solution to the code needed with the messy data.
But honestly, I’m having trouble writing the messy data version.
You don’t really have anything to group on, so you’d need to keep track of the row where you last saw this team (either in AwayTeam
or HomeTeam
).
And then each row will have two answers, one for each team.
It’s certainly possible to write the necessary code, but the fact that I’m struggling so much to write the messy version is pretty good evidence for the importance of tidy data.