Today we released the first version of dask-ml
, a library for parallel and
distributed machine learning. Read the documentation or install it with
pip install dask-ml
Packages are currently building for conda-forge, and will be up later today.
conda install -c conda-forge dask-ml
The Goals
dask is, to quote the docs, “a flexible parallel computing library for
analytic computing.” dask.array
and dask.dataframe
have done a great job
scaling NumPy arrays and pandas dataframes; dask-ml
hopes to do the same in
the machine learning domain.
Put simply, we want
est = MyEstimator()
est.fit(X, y)
to work well in parallel and potentially distributed across a cluster. dask
provides us with the building blocks to do that.
What’s Been Done
dask-ml
collects some efforts that others already built:
- distributed joblib:
scaling out some scikit-learn operations to clusters (from
distributed.joblib
) - hyper-parameter
search:
Some drop in replacements for scikit-learn’s
GridSearchCV
andRandomizedSearchCV
classes (fromdask-searchcv
) - distributed GLMs: Fit
large Generalized Linear Models on your cluster (from
dask-glm
) - dask + xgboost: Peer a
dask.distributed
cluster with XGBoost running in distributed mode (fromdask-xgboost
) - dask + tensorflow:
Peer a
dask.distributed
cluster with TensorFlow running in distributed mode (fromdask-tensorflow
) - Out-of-core learning in
pipelines: Reuse
scikit-learn’s out-of-core
.partial_fit
API in pipelines (fromdask.array.learn
)
In addition to providing a single home for these existing efforts, we’ve implemented some algorithms that are designed to run in parallel and distributed across a cluster.
KMeans
: Uses thek-means||
algorithm for initialization, and a parallelized Lloyd’s algorithm for the EM step.- Preprocessing: These are estimators that can be dropped into scikit-learn Pipelines, but they operate in parallel on dask collections. They’ll work well on datasets in distributed memory, and may be faster for NumPy arrays (depending on the overhead from parallelizing, and whether or not the scikit-learn implementation is already parallel).
Help Contribute!
Scikit-learn is a robust, mature, stable library. dask-ml
is… not. Which
means there are plenty of places to contribute! Dask makes writing parallel and
distributed implementations of algorithms fun. For the most part, you don’t even
have to think about “where’s my data? How do I parallelize this?” Dask does that
for you.
Have a look at the issues or propose a new one. I’d love to hear issues that you’ve run into when scaling the “traditional” scientific python stack out to larger problems.