Maintaining Performance

As pandas’ documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes pandas’ current setup for monitoring performance My personal debugging strategy for understanding and fixing performance regressions when they occur. I hope that the first section topic is useful for library maintainers and the second topic is generally useful for people writing performance-sensitive code. ...

April 1, 2020

pandas + binder

This post describes the start of a journey to get pandas’ documentation running on Binder. The end result is this nice button: For a while now I’ve been jealous of Dask’s examples repository. That’s a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some tools to present a set of documentation that is both viewable as a static site at examples.dask.org, and as a executable notebooks on mybinder. ...

July 21, 2019

Moral Philosophy for pandas or: What is `.values`?

The other day, I put up a Twitter poll asking a simple question: What’s the type of series.values? Pop Quiz! What are the possible results for the following: >>> type(pandas.Series.values) — Tom Augspurger (@TomAugspurger) August 6, 2018 I was a bit limited for space, so I’ll expand on the options here. Choose as many as you want. NumPy ndarray pandas Categorical (or all of the above) An Index or any of it’s subclasses (DatetimeIndex, CategoricalIndex, RangeIndex, etc.) (or all of the above) None or all of the above I was prompted to write this post because a.) this is an (unfortunately) confusing topic and b.) it’s undergoing a lot of change right now (and, c.) I had this awesome title in my head). ...

August 14, 2018

Modern Pandas (Part 8): Scaling

This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in tension with the fact that a pandas DataFrame is an in memory container. You can’t have a DataFrame larger than your machine’s RAM. In practice, your available RAM should be several times the size of your dataset, as you or pandas will have to make intermediate copies as part of the analysis. ...

April 23, 2018

Extension Arrays for Pandas

This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren’t a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we’d like to open that up to anybody. A couple months ago, a client came to Anaconda with a problem: they have a bunch of IP Address data that they’d like to work with in pandas. They didn’t just want to make a NumPy array of IP addresses for a few reasons: ...

February 12, 2018

Modern Pandas (Part 7): Timeseries

This is part 7 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Timeseries Pandas started out in the financial world, so naturally it has strong timeseries support. The first half of this post will look at pandas’ capabilities for manipulating time series data. The second half will discuss modelling time series data with statsmodels. %matplotlib inline import os import numpy as np import pandas as pd import pandas_datareader.data as web import seaborn as sns import matplotlib.pyplot as plt sns.set(style='ticks', context='talk') if int(os.environ.get("MODERN_PANDAS_EPUB", 0)): import prep # noqa Let’s grab some stock data for Goldman Sachs using the pandas-datareader package, which spun off of pandas: ...

May 13, 2016

Modern Pandas (Part 6): Visualization

This is part 6 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Visualization and Exploratory Analysis A few weeks ago, the R community went through some hand-wringing about plotting packages. For outsiders (like me) the details aren’t that important, but some brief background might be useful so we can transfer the takeaways to Python. The competing systems are “base R”, which is the plotting system built into the language, and ggplot2, Hadley Wickham’s implementation of the grammar of graphics. For those interested in more details, start with ...

April 28, 2016

Modern Pandas (Part 5): Tidy Data

This is part 5 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Reshaping & Tidy Data Structuring datasets to facilitate analysis (Wickham 2014) So, you’ve sat down to analyze a new dataset. What do you do first? In episode 11 of Not So Standard Deviations, Hilary and Roger discussed their typical approaches. I’m with Hilary on this one, you should make sure your data is tidy. Before you do any plots, filtering, transformations, summary statistics, regressions… Without a tidy dataset, you’ll be fighting your tools to get the result you need. With a tidy dataset, it’s relatively easy to do all of those. ...

April 22, 2016

Modern Panadas (Part 3): Indexes

This is part 3 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Indexes can be a difficult concept to grasp at first. I suspect this is partly becuase they’re somewhat peculiar to pandas. These aren’t like the indexes put on relational database tables for performance optimizations. Rather, they’re more like the row_labels of an R DataFrame, but much more capable. ...

April 11, 2016

Modern Pandas (Part 4): Performance

This is part 4 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Wes McKinney, the creator of pandas, is kind of obsessed with performance. From micro-optimizations for element access, to embedding a fast hash table inside pandas, we all benefit from his and others’ hard work. This post will focus mainly on making efficient use of pandas and NumPy. ...

April 8, 2016