Today we released the first version of dask-ml, a library for parallel and distributed machine learning. Read the documentation or install it with

pip install dask-ml

Packages are currently building for conda-forge, and will be up later today.

conda install -c conda-forge dask-ml

The Goals

dask is, to quote the docs, “a flexible parallel computing library for analytic computing.” dask.array and dask.dataframe have done a great job scaling NumPy arrays and pandas dataframes; dask-ml hopes to do the same in the machine learning domain.

Put simply, we want

est = MyEstimator(), y)

to work well in parallel and potentially distributed across a cluster. dask provides us with the building blocks to do that.

What’s Been Done

dask-ml collects some efforts that others already built:

  • distributed joblib: scaling out some scikit-learn operations to clusters (from distributed.joblib)
  • hyper-parameter search: Some drop in replacements for scikit-learn’s GridSearchCV and RandomizedSearchCV classes (from dask-searchcv)
  • distributed GLMs: Fit large Generalized Linear Models on your cluster (from dask-glm)
  • dask + xgboost: Peer a dask.distributed cluster with XGBoost running in distributed mode (from dask-xgboost)
  • dask + tensorflow: Peer a dask.distributed cluster with TensorFlow running in distributed mode (from dask-tensorflow)
  • Out-of-core learning in pipelines: Reuse scikit-learn’s out-of-core .partial_fit API in pipelines (from dask.array.learn)

In addition to providing a single home for these existing efforts, we’ve implemented some algorithms that are designed to run in parallel and distributed across a cluster.

  • KMeans: Uses the k-means|| algorithm for initialization, and a parallelized Lloyd’s algorithm for the EM step.
  • Preprocessing: These are estimators that can be dropped into scikit-learn Pipelines, but they operate in parallel on dask collections. They’ll work well on datasets in distributed memory, and may be faster for NumPy arrays (depending on the overhead from parallelizing, and whether or not the scikit-learn implementation is already parallel).

Help Contribute!

Scikit-learn is a robust, mature, stable library. dask-ml is… not. Which means there are plenty of places to contribute! Dask makes writing parallel and distributed implementations of algorithms fun. For the most part, you don’t even have to think about “where’s my data? How do I parallelize this?” Dask does that for you.

Have a look at the issues or propose a new one. I’d love to hear issues that you’ve run into when scaling the “traditional” scientific python stack out to larger problems.