Tom's Blog

Modern Pandas (Part 7): Timeseries

This is part 7 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Timeseries Pandas started out in the financial world, so naturally it has strong timeseries support. The first half of this post will look at pandas’ capabilities for manipulating time series data. The second half will discuss modelling time series data with statsmodels. %matplotlib inline import os import numpy as np import pandas as pd import pandas_datareader.data as web import seaborn as sns import matplotlib.pyplot as plt sns.set(style='ticks', context='talk') if int(os.environ.get("MODERN_PANDAS_EPUB", 0)): import prep # noqa Let’s grab some stock data for Goldman Sachs using the pandas-datareader package, which spun off of pandas: ...

Modern Pandas (Part 6): Visualization

This is part 6 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Visualization and Exploratory Analysis A few weeks ago, the R community went through some hand-wringing about plotting packages. For outsiders (like me) the details aren’t that important, but some brief background might be useful so we can transfer the takeaways to Python. The competing systems are “base R”, which is the plotting system built into the language, and ggplot2, Hadley Wickham’s implementation of the grammar of graphics. For those interested in more details, start with ...

Modern Pandas (Part 5): Tidy Data

This is part 5 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Reshaping & Tidy Data Structuring datasets to facilitate analysis (Wickham 2014) So, you’ve sat down to analyze a new dataset. What do you do first? In episode 11 of Not So Standard Deviations, Hilary and Roger discussed their typical approaches. I’m with Hilary on this one, you should make sure your data is tidy. Before you do any plots, filtering, transformations, summary statistics, regressions… Without a tidy dataset, you’ll be fighting your tools to get the result you need. With a tidy dataset, it’s relatively easy to do all of those. ...

Modern Panadas (Part 3): Indexes

This is part 3 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Indexes can be a difficult concept to grasp at first. I suspect this is partly becuase they’re somewhat peculiar to pandas. These aren’t like the indexes put on relational database tables for performance optimizations. Rather, they’re more like the row_labels of an R DataFrame, but much more capable. ...

Modern Pandas (Part 4): Performance

This is part 4 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Wes McKinney, the creator of pandas, is kind of obsessed with performance. From micro-optimizations for element access, to embedding a fast hash table inside pandas, we all benefit from his and others’ hard work. This post will focus mainly on making efficient use of pandas and NumPy. ...

Modern Pandas (Part 2): Method Chaining

This is part 2 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Method Chaining Method chaining, where you call methods on an object one after another, is in vogue at the moment. It’s always been a style of programming that’s been possible with pandas, and over the past several releases, we’ve added methods that enable even more chaining. ...

Modern Pandas (Part 1)

This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Effective Pandas Introduction This series is about how to make effective use of pandas, a data analysis library for the Python programming language. It’s targeted at an intermediate level: people who have some experience with pandas, but are looking to improve. Prior Art There are many great resources for learning pandas; this is not one of them. For beginners, I typically recommend Greg Reda’s 3-part introduction, especially if they’re familiar with SQL. Of course, there’s the pandas documentation itself. I gave a talk at PyData Seattle targeted as an introduction if you prefer video form. Wes McKinney’s Python for Data Analysis is still the goto book (and is also a really good introduction to NumPy as well). Jake VanderPlas’s Python Data Science Handbook, in early release, is great too. Kevin Markham has a video series for beginners learning pandas. ...

dplyr and pandas

This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you’re an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition. We’ll work through the introductory dplyr vignette to analyze some flight data. I’m working on a better layout to show the two packages side by side. But for now I’m just putting the dplyr code in a comment above each python call. ...

Practical Pandas Part 3 - Exploratory Data Analysis

Welcome back. As a reminder: In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io. You can find the full source code and data at this project’s GitHub repo. Today we’ll use pandas, seaborn, and matplotlib to do some exploratory data analysis. For fun, we’ll make some maps at the end using folium. ...

Practical Pandas Part 2 - More Tidying, More Data, and Merging

This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish. It’s a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from data acqusition data tidying exploratory analysis model building production As you work through a problem you’ll realize, “I need this other bit of data”, or “this would be easier if I stored the data this way”, or more commonly “strange, that’s not supposed to happen”. ...