Tom's Blog

What's Next?

Some personal news: Last Friday was my last day at Anaconda. Next week, I’m joining Microsoft’s AI for Earth team. This is a very bittersweet transition. While I loved working at Anaconda and all the great people there, I’m extremely excited about what I’ll be working on at Microsoft. Reflections I was inspired to write this section by Jim Crist’s post on a similar topic: https://jcristharif.com/farewell-to-anaconda.html. I’ll highlight some of the projects I worked on while at Anaconda. If you want to skip the navel gazing, skip down to what’s next. ...

Maintaining Performance

As pandas’ documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes pandas’ current setup for monitoring performance My personal debugging strategy for understanding and fixing performance regressions when they occur. I hope that the first section topic is useful for library maintainers and the second topic is generally useful for people writing performance-sensitive code. ...

Compatibility Code

Compatibility Code Most libraries with dependencies will want to support multiple versions of that dependency. But supporting old version is a pain: it requires compatibility code, code that is around solely to get the same output from versions of a library. This post gives some advice on writing compatibility code. Don’t write your own version parser Centralize all version parsing Use consistent version comparisons Use Python’s argument unpacking Clean up unused compatibility code 1. Don’t write your own version parser It can be tempting just do something like ...

Dask Workshop

Dask Summit Recap Last week was the first Dask Developer Workshop. This brought together many of the core Dask developers and its heavy users to discuss the project. I want to share some of the experience with those who weren’t able to attend. This was a great event. Aside from any technical discussions, it was ncie to meet all the people. From new acquaintences to people you’re on weekly calls with, it was great to interact with everyone. ...

pandas + binder

This post describes the start of a journey to get pandas’ documentation running on Binder. The end result is this nice button: For a while now I’ve been jealous of Dask’s examples repository. That’s a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some tools to present a set of documentation that is both viewable as a static site at examples.dask.org, and as a executable notebooks on mybinder. ...

A Confluence of Extension

This post describes a few protocols taking shape in the scientific Python community. On their own, each is powerful. Together, I think they enable for an explosion of creativity in the community. Each of the protocols / interfaces we’ll consider deal with extending. NEP-13: NumPy __array_ufunc__ NEP-18: NumPy __array_function__ Pandas Extension types Custom Dask Collections First, a bit of brief background on each. NEP-13 and NEP-18, each deal with using the NumPy API on non-NumPy ndarray objects. For example, you might want to apply a ufunc like np.log to a Dask array. ...

Tabular Data in Scikit-Learn and Dask-ML

Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data. This blogpost will introduce those improvements with a small demo. We’ll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays and DataFrames. import dask import dask.array as da import dask.dataframe as dd import numpy as np import pandas as pd import seaborn as sns import fastparquet from distributed import Client from distributed.utils import format_bytes Background For the most part, Scikit-Learn uses NumPy ndarrays or SciPy sparse matricies for its in-memory data structures. This is great for many reasons, but one major drawback is that you can’t store heterogenous (AKA tabular) data in these containers. These are datasets where different columns of the table have different data types (some ints, some floats, some strings, etc.). ...

Distributed Auto-ML with TPOT with Dask

This work is supported by Anaconda Inc. This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose new models to try out in the next generation. Parallelizing TPOT In TPOT-730, we made some modifications to TPOT to support distributed training. As a TPOT user, the only changes you need to make to your code are ...

Moral Philosophy for pandas or: What is `.values`?

The other day, I put up a Twitter poll asking a simple question: What’s the type of series.values? Pop Quiz! What are the possible results for the following: >>> type(pandas.Series.values) — Tom Augspurger (@TomAugspurger) August 6, 2018 I was a bit limited for space, so I’ll expand on the options here. Choose as many as you want. NumPy ndarray pandas Categorical (or all of the above) An Index or any of it’s subclasses (DatetimeIndex, CategoricalIndex, RangeIndex, etc.) (or all of the above) None or all of the above I was prompted to write this post because a.) this is an (unfortunately) confusing topic and b.) it’s undergoing a lot of change right now (and, c.) I had this awesome title in my head). ...

Modern Pandas (Part 8): Scaling

This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in tension with the fact that a pandas DataFrame is an in memory container. You can’t have a DataFrame larger than your machine’s RAM. In practice, your available RAM should be several times the size of your dataset, as you or pandas will have to make intermediate copies as part of the analysis. ...