Dask Workshop

Dask Summit Recap Last week was the first Dask Developer Workshop. This brought together many of the core Dask developers and its heavy users to discuss the project. I want to share some of the experience with those who weren’t able to attend. This was a great event. Aside from any technical discussions, it was ncie to meet all the people. From new acquaintences to people you’re on weekly calls with, it was great to interact with everyone. ...

December 12, 2019

pandas + binder

This post describes the start of a journey to get pandas’ documentation running on Binder. The end result is this nice button: For a while now I’ve been jealous of Dask’s examples repository. That’s a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some tools to present a set of documentation that is both viewable as a static site at examples.dask.org, and as a executable notebooks on mybinder. ...

July 21, 2019

A Confluence of Extension

This post describes a few protocols taking shape in the scientific Python community. On their own, each is powerful. Together, I think they enable for an explosion of creativity in the community. Each of the protocols / interfaces we’ll consider deal with extending. NEP-13: NumPy __array_ufunc__ NEP-18: NumPy __array_function__ Pandas Extension types Custom Dask Collections First, a bit of brief background on each. NEP-13 and NEP-18, each deal with using the NumPy API on non-NumPy ndarray objects. For example, you might want to apply a ufunc like np.log to a Dask array. ...

June 18, 2019

Tabular Data in Scikit-Learn and Dask-ML

Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data. This blogpost will introduce those improvements with a small demo. We’ll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays and DataFrames. import dask import dask.array as da import dask.dataframe as dd import numpy as np import pandas as pd import seaborn as sns import fastparquet from distributed import Client from distributed.utils import format_bytes Background For the most part, Scikit-Learn uses NumPy ndarrays or SciPy sparse matricies for its in-memory data structures. This is great for many reasons, but one major drawback is that you can’t store heterogenous (AKA tabular) data in these containers. These are datasets where different columns of the table have different data types (some ints, some floats, some strings, etc.). ...

September 17, 2018

Distributed Auto-ML with TPOT with Dask

This work is supported by Anaconda Inc. This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose new models to try out in the next generation. Parallelizing TPOT In TPOT-730, we made some modifications to TPOT to support distributed training. As a TPOT user, the only changes you need to make to your code are ...

August 30, 2018

Moral Philosophy for pandas or: What is `.values`?

The other day, I put up a Twitter poll asking a simple question: What’s the type of series.values? Pop Quiz! What are the possible results for the following: >>> type(pandas.Series.values) — Tom Augspurger (@TomAugspurger) August 6, 2018 I was a bit limited for space, so I’ll expand on the options here. Choose as many as you want. NumPy ndarray pandas Categorical (or all of the above) An Index or any of it’s subclasses (DatetimeIndex, CategoricalIndex, RangeIndex, etc.) (or all of the above) None or all of the above I was prompted to write this post because a.) this is an (unfortunately) confusing topic and b.) it’s undergoing a lot of change right now (and, c.) I had this awesome title in my head). ...

August 14, 2018

Modern Pandas (Part 8): Scaling

This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in tension with the fact that a pandas DataFrame is an in memory container. You can’t have a DataFrame larger than your machine’s RAM. In practice, your available RAM should be several times the size of your dataset, as you or pandas will have to make intermediate copies as part of the analysis. ...

April 23, 2018

dask-ml 0.4.1 Released

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. dask-ml 0.4.1 was released today with a few enhancements. See the changelog for all the changes from 0.4.0. Conda packages are available on conda-forge $ conda install -c conda-forge dask-ml and wheels and the source are available on PyPI $ pip install dask-ml I wanted to highlight one change, that touches on a topic I mentioned in my first post on scalable Machine Learning. I discussed how, in my limited experience, a common workflow was to train on a small batch of data and predict for a much larger set of data. The training data easily fits in memory on a single machine, but the full dataset does not. ...

February 13, 2018

Extension Arrays for Pandas

This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren’t a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we’d like to open that up to anybody. A couple months ago, a client came to Anaconda with a problem: they have a bunch of IP Address data that they’d like to work with in pandas. They didn’t just want to make a NumPy array of IP addresses for a few reasons: ...

February 12, 2018

Easy distributed training with Joblib and dask

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. This article will talk about some improvements we made to improve training scikit-learn models using a cluster. ...

February 5, 2018