Modern Pandas (Part 6): Visualization

This is part 6 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Visualization and Exploratory Analysis A few weeks ago, the R community went through some hand-wringing about plotting packages. For outsiders (like me) the details aren’t that important, but some brief background might be useful so we can transfer the takeaways to Python. The competing systems are “base R”, which is the plotting system built into the language, and ggplot2, Hadley Wickham’s implementation of the grammar of graphics. For those interested in more details, start with ...

April 28, 2016

Modern Pandas (Part 5): Tidy Data

This is part 5 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Reshaping & Tidy Data Structuring datasets to facilitate analysis (Wickham 2014) So, you’ve sat down to analyze a new dataset. What do you do first? In episode 11 of Not So Standard Deviations, Hilary and Roger discussed their typical approaches. I’m with Hilary on this one, you should make sure your data is tidy. Before you do any plots, filtering, transformations, summary statistics, regressions… Without a tidy dataset, you’ll be fighting your tools to get the result you need. With a tidy dataset, it’s relatively easy to do all of those. ...

April 22, 2016

Modern Panadas (Part 3): Indexes

This is part 3 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Indexes can be a difficult concept to grasp at first. I suspect this is partly becuase they’re somewhat peculiar to pandas. These aren’t like the indexes put on relational database tables for performance optimizations. Rather, they’re more like the row_labels of an R DataFrame, but much more capable. ...

April 11, 2016

Modern Pandas (Part 4): Performance

This is part 4 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Wes McKinney, the creator of pandas, is kind of obsessed with performance. From micro-optimizations for element access, to embedding a fast hash table inside pandas, we all benefit from his and others’ hard work. This post will focus mainly on making efficient use of pandas and NumPy. ...

April 8, 2016

Modern Pandas (Part 2): Method Chaining

This is part 2 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Method Chaining Method chaining, where you call methods on an object one after another, is in vogue at the moment. It’s always been a style of programming that’s been possible with pandas, and over the past several releases, we’ve added methods that enable even more chaining. ...

April 4, 2016

Modern Pandas (Part 1)

This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Effective Pandas Introduction This series is about how to make effective use of pandas, a data analysis library for the Python programming language. It’s targeted at an intermediate level: people who have some experience with pandas, but are looking to improve. Prior Art There are many great resources for learning pandas; this is not one of them. For beginners, I typically recommend Greg Reda’s 3-part introduction, especially if they’re familiar with SQL. Of course, there’s the pandas documentation itself. I gave a talk at PyData Seattle targeted as an introduction if you prefer video form. Wes McKinney’s Python for Data Analysis is still the goto book (and is also a really good introduction to NumPy as well). Jake VanderPlas’s Python Data Science Handbook, in early release, is great too. Kevin Markham has a video series for beginners learning pandas. ...

March 21, 2016

dplyr and pandas

This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you’re an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition. We’ll work through the introductory dplyr vignette to analyze some flight data. I’m working on a better layout to show the two packages side by side. But for now I’m just putting the dplyr code in a comment above each python call. ...

October 16, 2014

Practical Pandas Part 3 - Exploratory Data Analysis

Welcome back. As a reminder: In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io. You can find the full source code and data at this project’s GitHub repo. Today we’ll use pandas, seaborn, and matplotlib to do some exploratory data analysis. For fun, we’ll make some maps at the end using folium. ...

September 16, 2014 · Tom Augspurger

Practical Pandas Part 2 - More Tidying, More Data, and Merging

This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish. It’s a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from data acqusition data tidying exploratory analysis model building production As you work through a problem you’ll realize, “I need this other bit of data”, or “this would be easier if I stored the data this way”, or more commonly “strange, that’s not supposed to happen”. ...

September 4, 2014

Practical Pandas Part 1 - Reading the Data

This is the first post in a series where I’ll show how I use pandas on real-world datasets. For this post, we’ll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at the beginning and end of each ride. There may have been times where I forgot to do that, so we’ll see if we can find those. ...

August 26, 2014