This post describes a few protocols taking shape in the scientific Python community. On their own, each is powerful. Together, I think they enable for an explosion of creativity in the community.

Each of the protocols / interfaces we’ll consider deal with extending.

- NEP-13: NumPy
`__array_ufunc__`

- NEP-18: NumPy
`__array_function__`

- Pandas Extension types
- Custom Dask Collections

First, a bit of brief background on each.

NEP-13 and NEP-18, each deal with using the NumPy API on non-NumPy ndarray
objects. For example, you might want to apply a ufunc like `np.log`

to a Dask
array.

```
>>> a = da.random.random((10, 10))
>>> np.log(a)
dask.array<log, shape=(10, 10), dtype=float64, chunksize=(10, 10)>
```

Prior to NEP-13, `dask.array`

needed it’s own namespace of ufuncs like `da.log`

,
since `np.log`

would convert the Dask array to an in-memory NumPy array
(probably blowing up your machine’s memory). With `__array_ufunc__`

library
authors and users can all just use NumPy ufuncs, without worrying about the type of
the Array object.

While NEP-13 is limited to ufuncs, NEP-18 applies the same idea to most of the
NumPy API. With NEP-18, libraries written to deal with NumPy ndarrays may
suddenly support any object implementing `__array_function__`

.

I highly recommend reading this blog
post for more on the motivation
for `__array_function__`

. Ralph Gommers gave a nice talk on the current state of
things at PyData Amsterdam 2019, though this is
an active area of development.

Pandas added extension types to allow third-party libraries to solve domain-specific problems in a way that gels nicely with the rest of pandas. For example, cyberpandas handles network data, while geopandas handles geographic data. When both implement extension arrays it’s possible to operate on a dataset with a mixture of geographic and network data in the same DataFrame.

Finally, Dask defines a Collections Interface so that any object can be a first-class citizen within Dask. This is what ensures XArray’s DataArray and Dataset objects work well with Dask.

`Series.__array_ufunc__`

Now, onto the fun stuff: combining these interfaces across objects and
libraries. https://github.com/pandas-dev/pandas/pull/23293 is a pull request
adding `Series.__array_ufunc__`

. There are a few subtleties, but the basic idea
is that a ufunc applied to a Series should

- Unbox the array (ndarray or extension array) from the Series
- Apply the ufunc to the Series (honoring the array’s
`__array_ufunc__`

if needed) - Rebox the output in a Series (with the original index and name)

For example, pandas’ `SparseArray`

implements `__array_ufunc__`

. It works by
calling the ufunc twice, once on the sparse values (e.g. the non-zero values),
and once on the scalar `fill_value`

. The result is a new `SparseArray`

with the
same memory usage. With that PR, we achieve the same thing when operating on a
Series containing an ExtensionArray.

```
>>> ser = pd.Series(pd.SparseArray([-10, 0, 10] + [0] * 100000))
>>> ser
0 -10
1 0
2 10
3 0
4 0
..
99998 0
99999 0
100000 0
100001 0
100002 0
Length: 100003, dtype: Sparse[int64, 0]
>>> n [20]: np.sign(ser)
0 -1
1 0
2 1
3 0
4 0
..
99998 0
99999 0
100000 0
100001 0
100002 0
Length: 100003, dtype: Sparse[int64, 0]
```

Previously, that would have converted the `SparseArray`

to a *dense* NumPy
array, blowing up your memory, slowing things down, and giving the incorrect result.

`IPArray.__array_function__`

To demonstrate `__array_function__`

, we’ll implement it on `IPArray`

.

```
def __array_function__(self, func, types, args, kwargs):
cls = type(self)
if not all(issubclass(t, cls) for t in types):
return NotImplemented
return HANDLED_FUNCTIONS[func](*args, **kwargs)
```

`IPArray`

is pretty domain-specific, so we place ourself down at the bottom
priority by returning `NotImplemented`

if there are any types we don’t recognize
(we might consider handling Python’s stdlib `ipaddres.IPv4Address`

and
`ipaddres.IPv6Address`

objects too).

And then we start implementing the interface. For example, `concatenate`

.

```
@implements(np.concatenate)
def concatenate(arrays, axis=0, out=None):
if axis != 0:
raise NotImplementedError(f"Axis != 0 is not supported. (Got {axis}).")
return IPArray(np.concatenate([array.data for array in arrays]))
```

With this, we can successfully concatenate two IPArrays

```
>>> a = cyberpandas.ip_range(4)
>>> b = cyberpandas.ip_range(10, 14)
>>> np.concatenate([a, b])
IPArray(['0.0.0.0', '0.0.0.1', '0.0.0.2', '0.0.0.3', '0.0.0.10', '0.0.0.11', '0.0.0.12', '0.0.0.13'])
```

## Extending Dask

Finally, we may wish to make `IPArray`

work well with `dask.dataframe`

, to do
normal cyberpandas operations in parallel, possibly distributed on a cluster.
This requires a few changes:

- Updating
`IPArray`

to work on either NumPy or Dask arrays - Implementing the Dask Collections interface on
`IPArray`

. - Registering an
`ip`

accessor with`dask.dataframe`

, just like with`pandas`

.

This is demonstrated in https://github.com/ContinuumIO/cyberpandas/pull/39

```
In [28]: ddf
Out[28]:
Dask DataFrame Structure:
A
npartitions=2
0 ip
6 ...
11 ...
Dask Name: from_pandas, 2 tasks
In [29]: ddf.A.ip.netmask()
Out[29]:
Dask Series Structure:
npartitions=2
0 ip
6 ...
11 ...
Name: A, dtype: ip
Dask Name: from-delayed, 22 tasks
In [30]: ddf.A.ip.netmask().compute()
Out[30]:
0 255.255.255.255
1 255.255.255.255
2 255.255.255.255
3 255.255.255.255
4 255.255.255.255
5 255.255.255.255
6 255.255.255.255
7 255.255.255.255
8 255.255.255.255
9 255.255.255.255
10 255.255.255.255
11 255.255.255.255
dtype: ip
```

## Conclusion

I think that these points of extension.