Distributed Auto-ML with TPOT with Dask

This work is supported by Anaconda Inc. This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose new models to try out in the next generation. Parallelizing TPOT In TPOT-730, we made some modifications to TPOT to support distributed training....

August 30, 2018

Moral Philosophy for pandas or: What is `.values`?

The other day, I put up a Twitter poll asking a simple question: What’s the type of series.values? Pop Quiz! What are the possible results for the following: >>> type(pandas.Series.values) — Tom Augspurger (@TomAugspurger) August 6, 2018 I was a bit limited for space, so I’ll expand on the options here. Choose as many as you want. NumPy ndarray pandas Categorical (or all of the above) An Index or any of it’s subclasses (DatetimeIndex, CategoricalIndex, RangeIndex, etc....

August 14, 2018

Modern Pandas (Part 8): Scaling

This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in tension with the fact that a pandas DataFrame is an in memory container. You can’t have a DataFrame larger than your machine’s RAM....

April 23, 2018

dask-ml 0.4.1 Released

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. dask-ml 0.4.1 was released today with a few enhancements. See the changelog for all the changes from 0.4.0. Conda packages are available on conda-forge $ conda install -c conda-forge dask-ml and wheels and the source are available on PyPI $ pip install dask-ml I wanted to highlight one change, that touches on a topic I mentioned in my first post on scalable Machine Learning....

February 13, 2018

Extension Arrays for Pandas

This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren’t a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we’d like to open that up to anybody. A couple months ago, a client came to Anaconda with a problem: they have a bunch of IP Address data that they’d like to work with in pandas....

February 12, 2018

Easy distributed training with Joblib and dask

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. This article will talk about some improvements we made to improve training scikit-learn models using a cluster....

February 5, 2018

Rewriting scikit-learn for big data, in under 9 hours.

This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. Towards the end of our week, Gael threw out the observation that for many applications, you don’t need to train on the entire dataset, a sample is often sufficient. But it’d be nice if the trained estimator would be able to transform and predict for dask arrays, getting all the nice distributed parallelism and memory management dask brings....

January 28, 2018

dask-ml

Today we released the first version of dask-ml, a library for parallel and distributed machine learning. Read the documentation or install it with pip install dask-ml Packages are currently building for conda-forge, and will be up later today. conda install -c conda-forge dask-ml The Goals dask is, to quote the docs, “a flexible parallel computing library for analytic computing.” dask.array and dask.dataframe have done a great job scaling NumPy arrays and pandas dataframes; dask-ml hopes to do the same in the machine learning domain....

October 26, 2017

Scalable Machine Learning (Part 3): Parallel

This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation. This is part three of my series on scalable machine learning. Small Fit, Big Predict Scikit-Learn Partial Fit Parallel Machine Learning You can download a notebook of this post [here][notebook]. In part one, I talked about the type of constraints that push us to parallelize or distribute a machine learning workload. Today, we’ll be talking about the second constraint, “I’m constrained by time, and would like to fit more models at once, by using all the cores of my laptop, or all the machines in my cluster”....

September 16, 2017

Scalable Machine Learning (Part 2): Partial Fit

This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation. This is part two of my series on scalable machine learning. Small Fit, Big Predict Scikit-Learn Partial Fit You can download a notebook of this post here. Scikit-learn supports out-of-core learning (fitting a model on a dataset that doesn’t fit in RAM), through it’s partial_fit API. See here. The basic idea is that, for certain estimators, learning can be done in batches....

September 15, 2017