Tom's Blog

dask-ml 0.4.1 Released

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. dask-ml 0.4.1 was released today with a few enhancements. See the changelog for all the changes from 0.4.0. Conda packages are available on conda-forge $ conda install -c conda-forge dask-ml and wheels and the source are available on PyPI $ pip install dask-ml I wanted to highlight one change, that touches on a topic I mentioned in my first post on scalable Machine Learning. I discussed how, in my limited experience, a common workflow was to train on a small batch of data and predict for a much larger set of data. The training data easily fits in memory on a single machine, but the full dataset does not. ...

Extension Arrays for Pandas

This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren’t a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we’d like to open that up to anybody. A couple months ago, a client came to Anaconda with a problem: they have a bunch of IP Address data that they’d like to work with in pandas. They didn’t just want to make a NumPy array of IP addresses for a few reasons: ...

Easy distributed training with Joblib and dask

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. This article will talk about some improvements we made to improve training scikit-learn models using a cluster. ...

Rewriting scikit-learn for big data, in under 9 hours.

This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. Towards the end of our week, Gael threw out the observation that for many applications, you don’t need to train on the entire dataset, a sample is often sufficient. But it’d be nice if the trained estimator would be able to transform and predict for dask arrays, getting all the nice distributed parallelism and memory management dask brings. ...

dask-ml

Today we released the first version of dask-ml, a library for parallel and distributed machine learning. Read the documentation or install it with pip install dask-ml Packages are currently building for conda-forge, and will be up later today. conda install -c conda-forge dask-ml The Goals dask is, to quote the docs, “a flexible parallel computing library for analytic computing.” dask.array and dask.dataframe have done a great job scaling NumPy arrays and pandas dataframes; dask-ml hopes to do the same in the machine learning domain. ...

Scalable Machine Learning (Part 3): Parallel

This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation. This is part three of my series on scalable machine learning. Small Fit, Big Predict Scikit-Learn Partial Fit Parallel Machine Learning You can download a notebook of this post [here][notebook]. In part one, I talked about the type of constraints that push us to parallelize or distribute a machine learning workload. Today, we’ll be talking about the second constraint, “I’m constrained by time, and would like to fit more models at once, by using all the cores of my laptop, or all the machines in my cluster”. ...

Scalable Machine Learning (Part 2): Partial Fit

This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation. This is part two of my series on scalable machine learning. Small Fit, Big Predict Scikit-Learn Partial Fit You can download a notebook of this post here. Scikit-learn supports out-of-core learning (fitting a model on a dataset that doesn’t fit in RAM), through it’s partial_fit API. See here. The basic idea is that, for certain estimators, learning can be done in batches. The estimator will see a batch, and then incrementally update whatever it’s learning (the coefficients, for example). This link has a list of the algorithms that implement partial_fit. ...

Scalable Machine Learning (Part 1)

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation. Anaconda is interested in scaling the scientific python ecosystem. My current focus is on out-of-core, parallel, and distributed machine learning. This series of posts will introduce those concepts, explore what we have available today, and track the community’s efforts to push the boundaries. You can download a Jupyter notebook demonstrating the analysis here. Constraints I am (or was, anyway) an economist, and economists like to think in terms of constraints. How are we constrained by scale? The two main ones I can think of are ...

Dask Performace Trip

I’m faced with a fairly specific problem: Compute the pairwise distances between two matrices $X$ and $Y$ as quickly as possible. We’ll assume that $Y$ is fairly small, but $X$ may not fit in memory. This post tracks my progress.

Introducing Stitch

Today I released stitch into the wild. If you haven’t yet, check out the examples page to see an example of what stitch does, and the Github repo for how to install. I’m using this post to explain why I wrote stitch, and some issues it tries to solve. Why knitr / knitpy / stitch / RMarkdown? Each of these tools or formats have the same high-level goal: produce reproducible, dynamic (to changes in the data) reports. They take some source document (typically markdown) that’s a mixture of text and code and convert it to a destination output (HTML, PDF, docx, etc.). ...