This is a status update on some enhancements for pandas. The goal of the work
is to store things that are sufficiently array-like in a pandas DataFrame
,
even if they aren’t a regular NumPy array. Pandas already does this in a few
places for some blessed types (like Categorical
); we’d like to open that up to
anybody.
A couple months ago, a client came to Anaconda with a problem: they have a bunch of IP Address data that they’d like to work with in pandas. They didn’t just want to make a NumPy array of IP addresses for a few reasons:
- IPv6 addresses are 128 bits, so they can’t use a specialized NumPy dtype. It
would have to be an
object
array, which will be slow for their large datasets. - IP Addresses have special structure. They’d like to use this structure for
special methods like
is_reserved
. - It’s much better to put the knowledge of types in the library, rather than relying on analysts to know that this column of objects or strings is actually this other special type.
I wrote up a proposal to gauge interest from the community for adding an IP Address dtype to pandas. The general sentiment was that an IP addresses were too specialized for inclusion pandas (which matched my own feelings). But, the community was interested in allowing 3rd party libraries to define their own types and having pandas “do the right thing” when it encounters them.
Pandas Internals
While not technically true, you could reasonably describe a DataFrame
as a
dictionary of NumPy arrays. There are a few complications that invalidate that
caricature , but the one I want to focus on is pandas’ extension dtypes.
Pandas has extended NumPy’s type system in a few cases. For the most part, this
involves tricking pandas.DataFrame
and pandas.Series
into thinking that
the object passed to it is a single array, when in fact it’s multiple arrays, or
an array plus a bit of extra metadata.
datetime64[ns]
with a timezone. A regularnumpy.datetime64[ns]
array (which is really just an array of integers) plus some metadata for the timezone.Period
: An array of integer ordinals and some metadata about the frequency.Categorical
: two arrays: one with the unique set ofcategories
and a second array ofcodes
, the positions incategories
.Interval
: Two arrays, one for the left-hand endpoints and one for the right-hand endpoints.
So our definition of a pandas.DataFrame
is now “A dictionary of NumPy arrays,
or one of pandas’ extension types.” Internal to pandas, we have checks for “is
this thing an extension dtype? If so take this special path.” To the user, it
looks like a Categorical
is just a regular column, but internally, it’s a bit
messier.
Anyway, the upshot of my proposal was to make changes to pandas' internals to support 3rd-party objects going down that “is this an extension dtype” path.
Pandas’ Array Interface
To support external libraries defining extension array types, we defined an interface.
In pandas-19268 we laid out exactly what pandas considers sufficiently “array-like” for an extension array type. When pandas comes across one of these array-like objects, it avoids the previous behavior of just storing the data in a NumPy array of objects. The interface includes things like
- What type of scalars do you hold?
- How do I convert you to a NumPy array?
__getitem__
Most things should be pretty straightforward to implement. In the test suit, we
have a 60-line implementation for storing decimal.Decimal
objects in a
Series
.
It’s important to emphasize that pandas’ ExtensionArray
is not another array
implementation. It’s just an agreement between pandas and your library that your
array-like object (which may be a NumPy array, many NumPy arrays, an Arrow
array, a list, anything really) that satisfies the proper semantics for storage
inside a Series
or DataFrame
.
With those changes, I’ve been able to prototype a small library (named…
cyberpandas) for storing arrays of IP Addresses. It defines
IPAddress
, an array-like container for IP Addresses. For this blogpost, the
only relevant implementation detail is that IP Addresses are stored as a NumPy
structured array with two uint64 fields. So we’re making pandas treat this 2-D
array as a single array, like how Interval
works. Here’s a taste:
As a taste for what’s possible, here’s a preview of our IP Address library,
cyberpandas
.
In [1]: import cyberpandas
In [2]: import pandas as pd
In [3]: ips = cyberpandas.IPAddress([
...: '0.0.0.0',
...: '192.168.1.1',
...: '2001:0db8:85a3:0000:0000:8a2e:0370:7334',
...: ])
In [4]: ips
Out[4]: IPAddress(['0.0.0.0', '192.168.1.1', '2001:db8:85a3::8a2e:370:7334'])
In [5]: ips.data
Out[5]:
array([( 0, 0),
( 0, 3232235777),
(2306139570357600256, 151930230829876)],
dtype=[('hi', '>u8'), ('lo', '>u8')])
ips
satisfies pandas’ ExtensionArray
interface, so it can be stored inside
pandas’ containers.
In [6]: ser = pd.Series(ips)
In [7]: ser
Out[7]:
0 0.0.0.0
1 192.168.1.1
2 2001:db8:85a3::8a2e:370:7334
dtype: ip
Note the dtype
in that output. That’s a custom dtype (like category
) defined
outside pandas.
We register a custom accessor with pandas claiming the .ip
namespace (just like pandas uses .str
or .dt
or .cat
):
In [8]: ser.ip.isna
Out[8]:
0 True
1 False
2 False
dtype: bool
In [9]: ser.ip.is_ipv6
Out[9]:
0 False
1 False
2 True
dtype: bool
I’m extremely interested in seeing what the community builds on top of this
interface. Joris has already tested out the Cythonized geopandas
extension, which stores a NumPy array of pointers to geometry objects, and
things seem great. I could see someone (perhaps you, dear reader?) building a
JSONArray
array type for working with nested data. That combined with custom
.json
accessor, perhaps with a jq
-like query language should make for
a powerful combination.
I’m also happy to have to say “Closed, out of scope; sorry.” less often. Now it can be “Closed, out of scope; do it outside of pandas.” :)
Open Source Success Story
It’s worth taking a moment to realize that this was a great example of open source at its best.
- A company had a need for a tool. They didn’t have the expertise or desire to build and maintain it internally, so they approached Anaconda (a for-profit company with a great OSS tradition) to do it for them.
- A proposal was made and rejected by the pandas community. You can’t just “buy” features in pandas if it conflicts too strongly with the long-term goals for the project.
- A more general solution was found, with minimal changes to pandas itself, allowing anyone to do this type of extension outside of pandas.
- We built the cyberpandas, which to users will feel like a first-class array type in pandas.
Thanks to the tireless reviews from the other pandas contributors, especially Jeff Reback, Joris van den Bossche, and Stephen Hoyer. Look forward to these changes in the next major pandas release.