Welcome back. As a reminder:
- In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store
- In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io.
You can find the full source code and data at this project’s GitHub repo.
Today we’ll use pandas, seaborn, and matplotlib to do some exploratory data analysis. For fun, we’ll make some maps at the end using folium.
%matplotlib inline
import os
import datetime
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_hdf(os.path.join('data', 'cycle_store.h5'), key='with_weather')
df.head()
time | ride_time_secs | stopped_time_secs | latitude | longitude | elevation_feet | distance_miles | speed_mph | pace_secs | average_speed_mph | average_pace_secs | ascent_feet | descent_feet | calories | ride_id | time_adj | apparentTemperature | cloudCover | dewPoint | humidity | icon | precipIntensity | precipProbability | precipType | pressure | summary | temperature | visibility | windBearing | windSpeed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-08-01 12:07:10 | 1.1 | 0 | 41.703753 | -91.609892 | 963 | 0.00 | 2.88 | 1251 | 0.00 | 0 | 0 | 0 | 0 | 0 | 2013-08-01 07:07:10 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
1 | 2013-08-01 12:07:17 | 8.2 | 0 | 41.703825 | -91.609835 | 852 | 0.01 | 2.88 | 1251 | 2.56 | 1407 | 0 | 129 | 0 | 0 | 2013-08-01 07:07:17 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
2 | 2013-08-01 12:07:22 | 13.2 | 0 | 41.703858 | -91.609814 | 789 | 0.01 | 2.88 | 1251 | 2.27 | 1587 | 0 | 173 | 0 | 0 | 2013-08-01 07:07:22 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
3 | 2013-08-01 12:07:27 | 18.2 | 0 | 41.703943 | -91.610090 | 787 | 0.02 | 6.60 | 546 | 4.70 | 767 | 0 | 173 | 1 | 0 | 2013-08-01 07:07:27 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
4 | 2013-08-01 12:07:40 | 31.2 | 0 | 41.704381 | -91.610258 | 788 | 0.06 | 9.50 | 379 | 6.37 | 566 | 0 | 173 | 2 | 0 | 2013-08-01 07:07:40 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
Upon further inspection, it looks like some of our rows are duplicated.
df.duplicated().sum()
2
The problem is actually a bit more severe than that. The app I used to collect the data sometimes records multiple observations per second, but only reports the results at the second frequency.
df.time.duplicated().sum()
114
What to do here? We could drop change the frequency to micro- or nano-second resolution, and add, say, a half-second onto the duplicated observations.
Since this is just for fun though, I’m going to do the easy thing and throw out the duplicates (in real life you’ll want to make sure this doesn’t affect your analysis).
Then we can set the time
column to be our index, which will make our later analysis a bit simpler.
df = df.drop_duplicates(subset=['time']).set_index('time')
df.index.is_unique
True
Because of a bug in pandas, we lost our timzone information when we filled in our missing values. Until that’s fixed we’ll have to manually add back the timezone info and convert. The actual values stored were UTC (which is good practice whenever you have timezone-aware timestamps), it just doesn’t know that it’s UTC.
df = df.tz_localize('UTC').tz_convert('US/Central')
df.head()
ride_time_secs | stopped_time_secs | latitude | longitude | elevation_feet | distance_miles | speed_mph | pace_secs | average_speed_mph | average_pace_secs | ascent_feet | descent_feet | calories | ride_id | time_adj | apparentTemperature | cloudCover | dewPoint | humidity | icon | precipIntensity | precipProbability | precipType | pressure | summary | temperature | visibility | windBearing | windSpeed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
time | |||||||||||||||||||||||||||||
2013-08-01 07:07:10-05:00 | 1.1 | 0 | 41.703753 | -91.609892 | 963 | 0.00 | 2.88 | 1251 | 0.00 | 0 | 0 | 0 | 0 | 0 | 2013-08-01 07:07:10 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
2013-08-01 07:07:17-05:00 | 8.2 | 0 | 41.703825 | -91.609835 | 852 | 0.01 | 2.88 | 1251 | 2.56 | 1407 | 0 | 129 | 0 | 0 | 2013-08-01 07:07:17 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
2013-08-01 07:07:22-05:00 | 13.2 | 0 | 41.703858 | -91.609814 | 789 | 0.01 | 2.88 | 1251 | 2.27 | 1587 | 0 | 173 | 0 | 0 | 2013-08-01 07:07:22 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
2013-08-01 07:07:27-05:00 | 18.2 | 0 | 41.703943 | -91.610090 | 787 | 0.02 | 6.60 | 546 | 4.70 | 767 | 0 | 173 | 1 | 0 | 2013-08-01 07:07:27 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
2013-08-01 07:07:40-05:00 | 31.2 | 0 | 41.704381 | -91.610258 | 788 | 0.06 | 9.50 | 379 | 6.37 | 566 | 0 | 173 | 2 | 0 | 2013-08-01 07:07:40 | 61.62 | 0 | 58.66 | 0.9 | clear-day | 0 | 0 | NaN | 1017.62 | Clear | 61.62 | 8.89 | 282 | 2.77 |
Timelines
We’ll store the time part of the DatetimeIndex
in a column called time
.
df['time'] = df.index.time
With these, let’s plot how far along I was in my ride (distance_miles
) at the time of day.
ax = df.plot(x='time', y='distance_miles')
There’s a couple problems. First of all, the data are split into morning and afternoon. Let’s create a new, boolean, column indicating whether the ride took place in the morning or afternoon.
df['is_morning'] = df.time < datetime.time(12)
ax = df[df.is_morning].plot(x='time', y='distance_miles')
Better, but this still isn’t quite what we want. When we call .plot(x=..., y=...)
the data are sorted before being plotted. This means that an observation from one ride gets mixed up with another. So we’ll need to group by ride, and then plot.
axes = df[df.is_morning].groupby(df.ride_id).plot(x='time',
y='distance_miles',
color='k',
figsize=(12, 5))
Much better. Groupby is one of the most powerful operations in pandas, and it pays to understand it well. Here’s the same thing for the evening.
axes = df[~df.is_morning].groupby(df.ride_id).plot(x='time',
y='distance_miles',
color='k',
figsize=(12, 5))
Fun. The horizontal distance is the length of time it took me to make the ride. The starting point on the horizontal axis conveys the time that I set out. I like this chart because it also conveys the start time of each ride. The plot shows that the morning ride typically took longer, but we can verify that.
ride_time = df.groupby(['ride_id', 'is_morning'])['ride_time_secs'].agg('max')
mean_time = ride_time.groupby(level=1).mean().rename(
index={True: 'morning', False: 'evening'})
mean_time / 60
is_morning
evening 30.761667
morning 29.362716
Name: ride_time_secs, dtype: float64
So the morning ride is typically shorter! But I think I know what’s going on. We were misleading with our plot earlier since the range of the horizontal axis weren’t identical. Always check the axis!
At risk of raising the ire of Hadley Whickham, we’ll plot these on the same plot, with a secondary x-axis. (I think its OK in this case since the second is just a transformation – a 10 hour or so shift – of the first).
We’ll plot evening first, use matplotlib’s twinx
method, and plot the morning on the second axes.
fig, ax = plt.subplots()
morning_color = sns.xkcd_rgb['amber']
evening_color = sns.xkcd_rgb['dusty purple']
_ = df[~df.is_morning].groupby(df.ride_id).plot(x='time', y='distance_miles',
color=evening_color, figsize=(12, 5),
ax=ax, alpha=.9, grid=False)
ax2 = ax.twiny()
_ = df[df.is_morning].groupby(df.ride_id).plot(x='time', y='distance_miles',
color=morning_color, figsize=(12, 5),
ax=ax2, alpha=.9, grid=False)
# Create fake lines for our custom legend.
morning_legend = plt.Line2D([0], [0], color=morning_color)
evening_legend = plt.Line2D([0], [0], color=evening_color)
ax.legend([morning_legend, evening_legend], ['Morning', 'Evening'])
<matplotlib.legend.Legend at 0x115640198>
There’s a bit of boilerplate at the end. pandas tries to add a legend element for each ride ID. It doesn’t know that we only care whether it’s morning or evening. So instead we just fake it, creating two lines thate are invisible, and labeling them appropriately.
Anyway, we’ve accomplished our original goal. The steeper slope on the evening rides show that they typically took me less time. I guess I wasn’t too excited to get to school in the morning. The joys of being a grad student.
I’m sure I’m not the only one noticing that long evening ride sticking out from the rest. Let’s note it’s ride ID and follow up. We need the ride_id
so groupby that. It’s the longest ride so take the max
of the distance. And we want the ride_id
of the maximum distance, so take the argmax
of that. These last three sentances can be beautifully chained together into a single line that reads like poetry.
long_ride_id = df.groupby('ride_id')['distance_miles'].max().argmax()
long_ride_id
22
Cartography
We’ll use Folium to do a bit of map plotting. If you’re using python3 (like I am) you’ll need to use this pull request from tbicr, or just clone the master of my fork, where I’ve merged the changes.
Since this is a practical pandas post, and not an intro to folium
, I won’t delve into the details here. The basics are that we initialize a Map
with some coordinates and tiles, and then add line
s to that map. The line
s will come from the latitude and longitude columns of our DataFrame.
Here’s a small helper function from birdage to inline the map in the notebook. This allows it to be viewable (and interactive) on nbviewer. For the blog post I’m linking them to `
def inline_map(map):
"""
Embeds the HTML source of the map directly into the IPython notebook.
This method will not work if the map depends on any files (json data). Also this uses
the HTML5 srcdoc attribute, which may not be supported in all browsers.
"""
from IPython.display import HTML
map._build_map()
return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border: none"></iframe>'.format(srcdoc=map.HTML.replace('"', '"')))
I’ve plotted two rides, a hopefully representative ride (#42) and the long ride from above.
import folium
folium.initialize_notebook()
lat, lon = df[['latitude', 'longitude']].mean()
mp = folium.Map(location=(lat, lon), tiles='OpenStreetMap', zoom_start=13)
mp.line(locations=df.loc[df.ride_id == 42, ['latitude', 'longitude']].values)
mp.line(locations=df.loc[df.ride_id == long_ride_id, ['latitude', 'longitude']].values,
line_color='#800026')
inline_map(mp)
So you pan around a bit, it looks like the GPS receiver on my phone was just going crazy. But without visualizing the data (as a map), there’d be no way to know that.
For fun, we can plot all the rides.
mp_all = folium.Map(location=(lat, lon), tiles='OpenStreetMap', zoom_start=13)
for ride_id in df.ride_id.unique():
mp_all.line(locations=df.loc[df.ride_id == ride_id, ['latitude', 'longitude']].values,
line_weight=1, line_color='#111', line_opacity=.3)
inline_map(mp_all)
You can barely make out that I changed my path partway through the year to take Old Hospital Road instead of the North Ridge Trail (North boundry of my path).
Folium is cool; you should check it out (really, just use anything made by Rob).