This is the first post in a series where I’ll show how I use pandas on real-world datasets.
For this post, we’ll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at the beginning and end of each ride. There may have been times where I forgot to do that, so we’ll see if we can find those.
Let’s begin in the usual fashion, a bunch of imports and loading our data.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython import display
Each day has data recorded in two formats, CSVs and KMLs.
For now I’ve just uploaded the CSVs to the data/
directory.
We’ll start with the those, and come back to the KMLs later.
!ls data | head -n 5
Cyclemeter-Cycle-20130801-0707.csv
Cyclemeter-Cycle-20130801-0707.kml
Cyclemeter-Cycle-20130801-1720.csv
Cyclemeter-Cycle-20130801-1720.kml
Cyclemeter-Cycle-20130805-0819.csv
Take a look at the first one to see how the file’s laid out.
df = pd.read_csv('data/Cyclemeter-Cycle-20130801-0707.csv')
df.head()
Time | Ride Time | Ride Time (secs) | Stopped Time | Stopped Time (secs) | Latitude | Longitude | Elevation (feet) | Distance (miles) | Speed (mph) | Pace | Pace (secs) | Average Speed (mph) | Average Pace | Average Pace (secs) | Ascent (feet) | Descent (feet) | Calories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-08-01 07:07:10 | 0:00:01 | 1.1 | 0:00:00 | 0 | 41.703753 | -91.609892 | 963 | 0.00 | 2.88 | 0:20:51 | 1251 | 0.00 | 0:00:00 | 0 | 0 | 0 | 0 |
1 | 2013-08-01 07:07:17 | 0:00:08 | 8.2 | 0:00:00 | 0 | 41.703825 | -91.609835 | 852 | 0.01 | 2.88 | 0:20:51 | 1251 | 2.56 | 0:23:27 | 1407 | 0 | 129 | 0 |
2 | 2013-08-01 07:07:22 | 0:00:13 | 13.2 | 0:00:00 | 0 | 41.703858 | -91.609814 | 789 | 0.01 | 2.88 | 0:20:51 | 1251 | 2.27 | 0:26:27 | 1587 | 0 | 173 | 0 |
3 | 2013-08-01 07:07:27 | 0:00:18 | 18.2 | 0:00:00 | 0 | 41.703943 | -91.610090 | 787 | 0.02 | 6.60 | 0:09:06 | 546 | 4.70 | 0:12:47 | 767 | 0 | 173 | 1 |
4 | 2013-08-01 07:07:40 | 0:00:31 | 31.2 | 0:00:00 | 0 | 41.704381 | -91.610258 | 788 | 0.06 | 9.50 | 0:06:19 | 379 | 6.37 | 0:09:26 | 566 | 0 | 173 | 2 |
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 252 entries, 0 to 251
Data columns (total 18 columns):
Time 252 non-null object
Ride Time 252 non-null object
Ride Time (secs) 252 non-null float64
Stopped Time 252 non-null object
Stopped Time (secs) 252 non-null float64
Latitude 252 non-null float64
Longitude 252 non-null float64
Elevation (feet) 252 non-null int64
Distance (miles) 252 non-null float64
Speed (mph) 252 non-null float64
Pace 252 non-null object
Pace (secs) 252 non-null int64
Average Speed (mph) 252 non-null float64
Average Pace 252 non-null object
Average Pace (secs) 252 non-null int64
Ascent (feet) 252 non-null int64
Descent (feet) 252 non-null int64
Calories 252 non-null int64
dtypes: float64(7), int64(6), object(5)
Pandas has automatically parsed the headers, but it could use a bit of help on some dtypes.
We can see that the Time
column is a datetime
but it’s been parsed as an object
dtype.
This is pandas’ fallback dtype that can store anything, but its operations won’t be optimized like
they would on an float
or bool
or datetime[64]
. read_csv
takes a parse_dates
parameter, which
we’ll give a list of column names.
date_cols = ["Time", "Ride Time", "Stopped Time", "Pace", "Average Pace"]
df = pd.read_csv("data/Cyclemeter-Cycle-20130801-0707.csv",
parse_dates=date_cols)
display.display_html(df.head())
df.info()
Time | Ride Time | Ride Time (secs) | Stopped Time | Stopped Time (secs) | Latitude | Longitude | Elevation (feet) | Distance (miles) | Speed (mph) | Pace | Pace (secs) | Average Speed (mph) | Average Pace | Average Pace (secs) | Ascent (feet) | Descent (feet) | Calories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-08-01 07:07:10 | 2014-08-26 00:00:01 | 1.1 | 2014-08-26 | 0 | 41.703753 | -91.609892 | 963 | 0.00 | 2.88 | 2014-08-26 00:20:51 | 1251 | 0.00 | 2014-08-26 00:00:00 | 0 | 0 | 0 | 0 |
1 | 2013-08-01 07:07:17 | 2014-08-26 00:00:08 | 8.2 | 2014-08-26 | 0 | 41.703825 | -91.609835 | 852 | 0.01 | 2.88 | 2014-08-26 00:20:51 | 1251 | 2.56 | 2014-08-26 00:23:27 | 1407 | 0 | 129 | 0 |
2 | 2013-08-01 07:07:22 | 2014-08-26 00:00:13 | 13.2 | 2014-08-26 | 0 | 41.703858 | -91.609814 | 789 | 0.01 | 2.88 | 2014-08-26 00:20:51 | 1251 | 2.27 | 2014-08-26 00:26:27 | 1587 | 0 | 173 | 0 |
3 | 2013-08-01 07:07:27 | 2014-08-26 00:00:18 | 18.2 | 2014-08-26 | 0 | 41.703943 | -91.610090 | 787 | 0.02 | 6.60 | 2014-08-26 00:09:06 | 546 | 4.70 | 2014-08-26 00:12:47 | 767 | 0 | 173 | 1 |
4 | 2013-08-01 07:07:40 | 2014-08-26 00:00:31 | 31.2 | 2014-08-26 | 0 | 41.704381 | -91.610258 | 788 | 0.06 | 9.50 | 2014-08-26 00:06:19 | 379 | 6.37 | 2014-08-26 00:09:26 | 566 | 0 | 173 | 2 |
<class 'pandas.core.frame.DataFrame'>
Int64Index: 252 entries, 0 to 251
Data columns (total 18 columns):
Time 252 non-null datetime64[ns]
Ride Time 252 non-null datetime64[ns]
Ride Time (secs) 252 non-null float64
Stopped Time 252 non-null datetime64[ns]
Stopped Time (secs) 252 non-null float64
Latitude 252 non-null float64
Longitude 252 non-null float64
Elevation (feet) 252 non-null int64
Distance (miles) 252 non-null float64
Speed (mph) 252 non-null float64
Pace 252 non-null datetime64[ns]
Pace (secs) 252 non-null int64
Average Speed (mph) 252 non-null float64
Average Pace 252 non-null datetime64[ns]
Average Pace (secs) 252 non-null int64
Ascent (feet) 252 non-null int64
Descent (feet) 252 non-null int64
Calories 252 non-null int64
dtypes: datetime64[ns](5), float64(7), int64(6)
One minor issue is that some of the dates are parsed as datetimes
when they’re really just times
.
We’ll take care of that later. Pandas store everything as datetime64
. For now we’ll keep them as
datetimes
, and remember that they’re really just times
.
Now let’s do the same thing, but for all the files.
Use a generator expression to filter down to just csv’s that match the simple condition of having the correct naming style. I try to use lazy generators instead of lists wherever possible. In this case the list is so small that it really doesn’t matter, but it’s a good habit.
import os
csvs = (f for f in os.listdir('data') if f.startswith('Cyclemeter')
and f.endswith('.csv'))
I see a potential problem: We’ll potentailly want to concatenate each csv together
into a single DataFrame. However we’ll want to retain some idea of which specific
ride an observation came from. So let’s create a ride_id
variable, which will
just be an integar ranging from $0 \ldots N$, where $N$ is the number of rides.
Make a simple helper function to do this, and apply it to each csv.
def read_ride(path_, i):
"""
read in csv at path, and assign the `ride_id` variable to i.
"""
date_cols = ["Time", "Ride Time", "Stopped Time", "Pace", "Average Pace"]
df = pd.read_csv(path_, parse_dates=date_cols)
df['ride_id'] = i
return df
dfs = (read_ride(os.path.join('data', csv), i)
for (i, csv) in enumerate(csvs))
Now concatenate together. The original indicies are meaningless, so we’ll ignore them in the concat.
df = pd.concat(dfs, ignore_index=True)
df.head()
Time | Ride Time | Ride Time (secs) | Stopped Time | Stopped Time (secs) | Latitude | Longitude | Elevation (feet) | Distance (miles) | Speed (mph) | Pace | Pace (secs) | Average Speed (mph) | Average Pace | Average Pace (secs) | Ascent (feet) | Descent (feet) | Calories | ride_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-08-01 07:07:10 | 2014-08-26 00:00:01 | 1.1 | 2014-08-26 | 0 | 41.703753 | -91.609892 | 963 | 0.00 | 2.88 | 2014-08-26 00:20:51 | 1251 | 0.00 | 2014-08-26 00:00:00 | 0 | 0 | 0 | 0 | 0 |
1 | 2013-08-01 07:07:17 | 2014-08-26 00:00:08 | 8.2 | 2014-08-26 | 0 | 41.703825 | -91.609835 | 852 | 0.01 | 2.88 | 2014-08-26 00:20:51 | 1251 | 2.56 | 2014-08-26 00:23:27 | 1407 | 0 | 129 | 0 | 0 |
2 | 2013-08-01 07:07:22 | 2014-08-26 00:00:13 | 13.2 | 2014-08-26 | 0 | 41.703858 | -91.609814 | 789 | 0.01 | 2.88 | 2014-08-26 00:20:51 | 1251 | 2.27 | 2014-08-26 00:26:27 | 1587 | 0 | 173 | 0 | 0 |
3 | 2013-08-01 07:07:27 | 2014-08-26 00:00:18 | 18.2 | 2014-08-26 | 0 | 41.703943 | -91.610090 | 787 | 0.02 | 6.60 | 2014-08-26 00:09:06 | 546 | 4.70 | 2014-08-26 00:12:47 | 767 | 0 | 173 | 1 | 0 |
4 | 2013-08-01 07:07:40 | 2014-08-26 00:00:31 | 31.2 | 2014-08-26 | 0 | 41.704381 | -91.610258 | 788 | 0.06 | 9.50 | 2014-08-26 00:06:19 | 379 | 6.37 | 2014-08-26 00:09:26 | 566 | 0 | 173 | 2 | 0 |
Great! The data itself is clean enough that we didn’t have to do too much munging.
Let’s persist the merged DataFrame
. Writing it out to a csv would be fine, but I like to use
pandas’ HDF5
integration (via pytables) for personal projects.
df.to_hdf('data/cycle_store.h5', key='merged',
format='table')
I used the table
format in case we want to do some querying on the HDFStore itself, but we’ll save that for next time.
That’s it for this post. Next time, we’ll do some exploratry data analysis on the data.