My 2022 Year in Books

It’s “Year in X” time, and here’s my 2022 Year in Books on GoodReads. I’ll cover some highlights here. Many of these recommendations came from the Incomparable’s Book Club, part of the main Incomparable podcast. In particular, episode 600 The Machine was a Vampire which is a roundup of their favorites from the 2010s. Bookended by Murderbot Diaries I started and ended this year (so far) with a couple installments in the Murderbot Diaries....

December 21, 2022

Podcast: Revolutions

Mike Duncan is wrapping up his excellent Revolutions podcast. If you’re at all interested in history then now is a great time to pick it up. He takes the concept of “a revolution” and looks at it through the lens of a bunch of revolutions throughout history. The appendix episodes from the last few weeks have really tied things together, looking at whats common (and not) across all the revolutions covered in the series....

December 20, 2022

Rebooting

Like some others, I’m getting back into blogging. I’ll be “straying from my lane” and won’t just be writing about Python data libraries (though there will still be some of that). If you too would like to blog more, I’d encourge you to read Simon Willison’s What to blog About and Matt Rocklin’s Write Short Blogposts. Because I’m me, I couldn’t just make a new post. I also had to switch static site generators, just becauase....

December 18, 2022

What's Next?

Some personal news: Last Friday was my last day at Anaconda. Next week, I’m joining Microsoft’s AI for Earth team. This is a very bittersweet transition. While I loved working at Anaconda and all the great people there, I’m extremely excited about what I’ll be working on at Microsoft. Reflections I was inspired to write this section by Jim Crist’s post on a similar topic: https://jcristharif.com/farewell-to-anaconda.html. I’ll highlight some of the projects I worked on while at Anaconda....

November 11, 2020

Maintaining Performance

As pandas’ documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes pandas’ current setup for monitoring performance My personal debugging strategy for understanding and fixing performance regressions when they occur. I hope that the first section topic is useful for library maintainers and the second topic is generally useful for people writing performance-sensitive code....

April 1, 2020

Compatibility Code

Compatibility Code Most libraries with dependencies will want to support multiple versions of that dependency. But supporting old version is a pain: it requires compatibility code, code that is around solely to get the same output from versions of a library. This post gives some advice on writing compatibility code. Don’t write your own version parser Centralize all version parsing Use consistent version comparisons Use Python’s argument unpacking Clean up unused compatibility code 1....

December 12, 2019

Dask Workshop

Dask Summit Recap Last week was the first Dask Developer Workshop. This brought together many of the core Dask developers and its heavy users to discuss the project. I want to share some of the experience with those who weren’t able to attend. This was a great event. Aside from any technical discussions, it was ncie to meet all the people. From new acquaintences to people you’re on weekly calls with, it was great to interact with everyone....

December 12, 2019

pandas + binder

This post describes the start of a journey to get pandas’ documentation running on Binder. The end result is this nice button: For a while now I’ve been jealous of Dask’s examples repository. That’s a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some tools to present a set of documentation that is both viewable as a static site at examples.dask.org, and as a executable notebooks on mybinder....

July 21, 2019

A Confluence of Extension

This post describes a few protocols taking shape in the scientific Python community. On their own, each is powerful. Together, I think they enable for an explosion of creativity in the community. Each of the protocols / interfaces we’ll consider deal with extending. NEP-13: NumPy __array_ufunc__ NEP-18: NumPy __array_function__ Pandas Extension types Custom Dask Collections First, a bit of brief background on each. NEP-13 and NEP-18, each deal with using the NumPy API on non-NumPy ndarray objects....

June 18, 2019

Tabular Data in Scikit-Learn and Dask-ML

Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data. This blogpost will introduce those improvements with a small demo. We’ll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays and DataFrames. import dask import dask.array as da import dask.dataframe as dd import numpy as np import pandas as pd import seaborn as sns import fastparquet from distributed import Client from distributed....

September 17, 2018