dplyr and pandas

This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you’re an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition. We’ll work through the introductory dplyr vignette to analyze some flight data. I’m working on a better layout to show the two packages side by side. But for now I’m just putting the dplyr code in a comment above each python call. ...

October 16, 2014

Practical Pandas Part 3 - Exploratory Data Analysis

Welcome back. As a reminder: In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io. You can find the full source code and data at this project’s GitHub repo. Today we’ll use pandas, seaborn, and matplotlib to do some exploratory data analysis. For fun, we’ll make some maps at the end using folium. ...

September 16, 2014 · Tom Augspurger

Practical Pandas Part 2 - More Tidying, More Data, and Merging

This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish. It’s a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from data acqusition data tidying exploratory analysis model building production As you work through a problem you’ll realize, “I need this other bit of data”, or “this would be easier if I stored the data this way”, or more commonly “strange, that’s not supposed to happen”. ...

September 4, 2014

Practical Pandas Part 1 - Reading the Data

This is the first post in a series where I’ll show how I use pandas on real-world datasets. For this post, we’ll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at the beginning and end of each ride. There may have been times where I forgot to do that, so we’ll see if we can find those. ...

August 26, 2014

Tidy Data in Action

Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren’t language specific. A tidy dataset must satisfy three criteria (page 4 in Whickham’s paper): Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. In this StackOverflow post, the asker had some data NBA games, and wanted to know the number of days since a team last played. Here’s the example data: ...

March 27, 2014

Organizing Papers

As a graduate student, you read a lot of journal articles… a lot. With the material in the articles being as difficult as it is, I didn’t want to worry about organizing everything as well. That’s why I wrote this script to help (I may have also been procrastinating from studying for my qualifiers). This was one of my earliest little projects, so I’m not claiming that this is the best way to do anything. ...

February 13, 2014