This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you’re an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.

We’ll work through the introductory dplyr vignette to analyze some flight data.

I’m working on a better layout to show the two packages side by side. But for now I’m just putting the dplyr code in a comment above each python call.

# Some prep work to get the data from R and into pandas
%matplotlib inline
%load_ext rmagic

import pandas as pd
import seaborn as sns

pd.set_option("display.max_rows", 5)

/Users/tom/Envs/py3/lib/python3.4/site-packages/IPython/extensions/rmagic.py:693: UserWarning: The rmagic extension in IPython is deprecated in favour of rpy2.ipython. If available, that will be loaded instead.
http://rpy.sourceforge.net/
  warnings.warn("The rmagic extension in IPython is deprecated in favour of "

%%R
library("nycflights13")
write.csv(flights, "flights.csv")

Data: nycflights13

flights = pd.read_csv("flights.csv", index_col=0)

# dim(flights)   <--- The R code
flights.shape  # <--- The python code

(336776, 16)

# head(flights)
flights.head()

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
3	2013	1	1	542	2	923	33	AA	N619AA	1141	JFK	MIA	160	1089	5	42
4	2013	1	1	544	-1	1004	-18	B6	N804JB	725	JFK	BQN	183	1576	5	44
5	2013	1	1	554	-6	812	-25	DL	N668DN	461	LGA	ATL	116	762	5	54

Single table verbs

dplyr has a small set of nicely defined verbs. I’ve listed their closest pandas verbs.

dplyr	pandas
filter() (and slice())	query() (and loc[], iloc[])
arrange()	sort()
select() (and rename())	\_\_getitem\_\_ (and rename())
distinct()	drop_duplicates()
mutate() (and transmute())	None
summarise()	None
sample_n() and sample_frac()	None

Some of the “missing” verbs in pandas are because there are other, different ways of achieving the same goal. For example summarise is spread across mean, std, etc. Others, like sample_n, just haven’t been implemented yet.

Filter rows with filter(), query()

# filter(flights, month == 1, day == 1)
flights.query("month == 1 & day == 1")

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
841	2013	1	1	NaN	NaN	NaN	NaN	AA	N3EVAA	1925	LGA	MIA	NaN	1096	NaN	NaN
842	2013	1	1	NaN	NaN	NaN	NaN	B6	N618JB	125	JFK	FLL	NaN	1069	NaN	NaN

842 rows × 16 columns

The more verbose version:

# flights[flights$month == 1 & flights$day == 1, ]
flights[(flights.month == 1) & (flights.day == 1)]

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
841	2013	1	1	NaN	NaN	NaN	NaN	AA	N3EVAA	1925	LGA	MIA	NaN	1096	NaN	NaN
842	2013	1	1	NaN	NaN	NaN	NaN	B6	N618JB	125	JFK	FLL	NaN	1069	NaN	NaN

842 rows × 16 columns

# slice(flights, 1:10)
flights.iloc[:9]

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8	2013	1	1	557	-3	709	-14	EV	N829AS	5708	LGA	IAD	53	229	5	57
9	2013	1	1	557	-3	838	-8	B6	N593JB	79	JFK	MCO	140	944	5	57

9 rows × 16 columns

Arrange rows with arrange(), sort()

# arrange(flights, year, month, day) 
flights.sort(['year', 'month', 'day'])

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
111295	2013	12	31	NaN	NaN	NaN	NaN	UA	NaN	219	EWR	ORD	NaN	719	NaN	NaN
111296	2013	12	31	NaN	NaN	NaN	NaN	UA	NaN	443	JFK	LAX	NaN	2475	NaN	NaN

336776 rows × 16 columns

# arrange(flights, desc(arr_delay))
flights.sort('arr_delay', ascending=False)

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
7073	2013	1	9	641	1301	1242	1272	HA	N384HA	51	JFK	HNL	640	4983	6	41
235779	2013	6	15	1432	1137	1607	1127	MQ	N504MQ	3535	JFK	CMH	74	483	14	32
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN

336776 rows × 16 columns

Select columns with select(), []

# select(flights, year, month, day) 
flights[['year', 'month', 'day']]

	year	month	day
1	2013	1	1
2	2013	1	1
...	...	...	...
336775	2013	9	30
336776	2013	9	30

336776 rows × 3 columns

# select(flights, year:day) 

# No real equivalent here. Although I think this is OK.
# Typically I'll have the columns I want stored in a list
# somewhere, which can be passed right into __getitem__ ([]).

# select(flights, -(year:day)) 

# Again, simliar story. I would just use
# flights.drop(cols_to_drop, axis=1)
# or fligths[flights.columns.difference(pd.Index(cols_to_drop))]
# point to dplyr!

# select(flights, tail_num = tailnum)
flights.rename(columns={'tailnum': 'tail_num'})['tail_num']

1    N14228
...
336776    N839MQ
Name: tail_num, Length: 336776, dtype: object

But like Hadley mentions, not that useful since it only returns the one column. dplyr and pandas compare well here.

# rename(flights, tail_num = tailnum)
flights.rename(columns={'tailnum': 'tail_num'})

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tail_num	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN

336776 rows × 16 columns

Pandas is more verbose, but the the argument to columns can be any mapping. So it’s often used with a function to perform a common task, say df.rename(columns=lambda x: x.replace('-', '_')) to replace any dashes with underscores. Also, rename (the pandas version) can be applied to the Index.

Extract distinct (unique) rows

# distinct(select(flights, tailnum))
flights.tailnum.unique()

array(['N14228', 'N24211', 'N619AA', ..., 'N776SK', 'N785SK', 'N557AS'], dtype=object)

FYI this returns a numpy array instead of a Series.

# distinct(select(flights, origin, dest))
flights[['origin', 'dest']].drop_duplicates()

	origin	dest
1	EWR	IAH
2	LGA	IAH
...	...	...
255456	EWR	ANC
275946	EWR	LGA

224 rows × 2 columns

OK, so dplyr wins there from a consistency point of view. unique is only defined on Series, not DataFrames. The original intention for drop_duplicates is to check for records that were accidentally included twice. This feels a bit hacky using it to select the distinct combinations, but it works!

Add new columns with mutate()

# mutate(flights,
#   gain = arr_delay - dep_delay,
#   speed = distance / air_time * 60)

flights['gain'] = flights.arr_delay - flights.dep_delay
flights['speed'] = flights.distance / flights.air_time * 60
flights

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	speed
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17	9	370.044053
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33	16	374.273128
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN	NaN	NaN

336776 rows × 18 columns

# mutate(flights,
#   gain = arr_delay - dep_delay,
#   gain_per_hour = gain / (air_time / 60)
# )

flights['gain'] = flights.arr_delay - flights.dep_delay
flights['gain_per_hour'] = flights.gain / (flights.air_time / 60)
flights

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	speed	gain_per_hour
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17	9	370.044053	2.378855
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33	16	374.273128	4.229075
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN	NaN	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN	NaN	NaN	NaN

336776 rows × 19 columns

dplyr's approach may be nicer here since you get to refer to the variables in subsequent statements within the mutate(). To achieve this with pandas, you have to add the gain variable as another column in flights. If I don’t want it around I would have to explicitly drop it.

# transmute(flights,
#   gain = arr_delay - dep_delay,
#   gain_per_hour = gain / (air_time / 60)
# )

flights['gain'] = flights.arr_delay - flights.dep_delay
flights['gain_per_hour'] = flights.gain / (flights.air_time / 60)
flights[['gain', 'gain_per_hour']]

	gain	gain_per_hour
1	9	2.378855
2	16	4.229075
...	...	...
336775	NaN	NaN
336776	NaN	NaN

336776 rows × 2 columns

Summarise values with summarise()

flights.dep_delay.mean()

12.639070257304708

Randomly sample rows with sample_n() and sample_frac()

There’s an open PR on Github to make this nicer (closer to dplyr). For now you can drop down to numpy.

# sample_n(flights, 10)
flights.loc[np.random.choice(flights.index, 10)]

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	speed	gain_per_hour
316903	2013	9	9	1539	-6	1650	-43	9E	N918XJ	3459	JFK	BNA	98	765	15	39	-37	468.367347	-22.653061
105369	2013	12	25	905	0	1126	-7	FL	N939AT	275	LGA	ATL	117	762	9	5	-7	390.769231	-3.589744
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
82862	2013	11	30	1627	-8	1750	-35	AA	N4XRAA	343	LGA	ORD	111	733	16	27	-27	396.216216	-14.594595
190653	2013	4	28	748	-7	856	-24	MQ	N520MQ	3737	EWR	ORD	107	719	7	48	-17	403.177570	-9.532710

10 rows × 19 columns

# sample_frac(flights, 0.01)
flights.iloc[np.random.randint(0, len(flights),
                               .1 * len(flights))]

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	speed	gain_per_hour
188581	2013	4	25	1836	-4	2145	7	DL	N398DA	1629	JFK	LAS	313	2248	18	36	11	430.926518	2.108626
307015	2013	8	29	1258	5	1409	-4	EV	N12957	6054	EWR	IAD	46	212	12	58	-9	276.521739	-11.739130
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
286563	2013	8	7	2126	18	6	7	UA	N822UA	373	EWR	PBI	138	1023	21	26	-11	444.782609	-4.782609
62818	2013	11	8	1300	0	1615	5	VX	N636VA	411	JFK	LAX	349	2475	13	0	5	425.501433	0.859599

33677 rows × 19 columns

Grouped operations

# planes <- group_by(flights, tailnum)
# delay <- summarise(planes,
#   count = n(),
#   dist = mean(distance, na.rm = TRUE),
#   delay = mean(arr_delay, na.rm = TRUE))
# delay <- filter(delay, count > 20, dist < 2000)

planes = flights.groupby("tailnum")
delay = planes.agg({"year": "count",
                    "distance": "mean",
                    "arr_delay": "mean"})
delay.query("year > 20 & distance < 2000")

	year	arr_delay	distance
tailnum
N0EGMQ	371	9.982955	676.188679
N10156	153	12.717241	757.947712
...	...	...	...
N999DN	61	14.311475	895.459016
N9EAMQ	248	9.235294	674.665323

2961 rows × 3 columns

For me, dplyr’s n() looked is a bit starge at first, but it’s already growing on me.

I think pandas is more difficult for this particular example. There isn’t as natural a way to mix column-agnostic aggregations (like count) with column-specific aggregations like the other two. You end up writing could like .agg{'year': 'count'} which reads, “I want the count of year”, even though you don’t care about year specifically. Additionally assigning names can’t be done as cleanly in pandas; you have to just follow it up with a rename like before.

# destinations <- group_by(flights, dest)
# summarise(destinations,
#   planes = n_distinct(tailnum),
#   flights = n()
# )

destinations = flights.groupby('dest')
destinations.agg({
    'tailnum': lambda x: len(x.unique()),
    'year': 'count'
    }).rename(columns={'tailnum': 'planes',
                       'year': 'flights'})

	flights	planes
dest
ABQ	254	108
ACK	265	58
...	...	...
TYS	631	273
XNA	1036	176

105 rows × 2 columns

Similar to how dplyr provides optimized C++ versions of most of the summarise functions, pandas uses cython optimized versions for most of the agg methods.

# daily <- group_by(flights, year, month, day)
# (per_day   <- summarise(daily, flights = n()))

daily = flights.groupby(['year', 'month', 'day'])
per_day = daily['distance'].count()
per_day

year  month  day
2013  1      1      842
...
2013  12     31     776
Name: distance, Length: 365, dtype: int64

# (per_month <- summarise(per_day, flights = sum(flights)))
per_month = per_day.groupby(level=['year', 'month']).sum()
per_month

year  month
2013  1        27004
...
2013  12       28135
Name: distance, Length: 12, dtype: int64

# (per_year  <- summarise(per_month, flights = sum(flights)))
per_year = per_month.sum()
per_year

I’m not sure how dplyr is handling the other columns, like year, in the last example. With pandas, it’s clear that we’re grouping by them since they’re included in the groupby. For the last example, we didn’t group by anything, so they aren’t included in the result.

Chaining

Any follower of Hadley’s twitter account will know how much R users love the %>% (pipe) operator. And for good reason!

# flights %>%
#   group_by(year, month, day) %>%
#   select(arr_delay, dep_delay) %>%
#   summarise(
#     arr = mean(arr_delay, na.rm = TRUE),
#     dep = mean(dep_delay, na.rm = TRUE)
#   ) %>%
#   filter(arr > 30 | dep > 30)
(
flights.groupby(['year', 'month', 'day'])
    [['arr_delay', 'dep_delay']]
    .mean()
    .query('arr_delay > 30 | dep_delay > 30')
)

			arr_delay	dep_delay
year	month	day
2013	1	16	34.247362	24.612865
	1	31	32.602854	28.658363
	1	...	...	...
	12	17	55.871856	40.705602
	12	23	32.226042	32.254149

49 rows × 2 columns

Other Data Sources

Pandas has tons IO tools to help you get data in and out, including SQL databases via SQLAlchemy.

Summary

I think pandas held up pretty well, considering this was a vignette written for dplyr. I found the degree of similarity more interesting than the differences. The most difficult task was renaming of columns within an operation; they had to be followed up with a call to rename after the operation, which isn’t that burdensome honestly.

More and more it looks like we’re moving towards future where being a language or package partisan just doesn’t make sense. Not when you can load up a Jupyter (formerly IPython) notebook to call up a library written in R, and hand those results off to python or Julia or whatever for followup, before going back to R to make a cool shiny web app.

There will always be a place for your “utility belt” package like dplyr or pandas, but it wouldn’t hurt to be familiar with both.

If you want to contribute to pandas, we’re always looking for help at https://github.com/pydata/pandas/. You can get ahold of me directly on twitter.

Data: nycflights13#

Single table verbs#

Filter rows with filter(), query()#

Arrange rows with arrange(), sort()#

Select columns with select(), []#

Extract distinct (unique) rows#

Add new columns with mutate()#

Summarise values with summarise()#

Randomly sample rows with sample_n() and sample_frac()#

Grouped operations#

Chaining#

Other Data Sources#

Summary#

Data: nycflights13

Single table verbs

Filter rows with filter(), query()

Arrange rows with arrange(), sort()

Select columns with select(), []

Extract distinct (unique) rows

Add new columns with mutate()

Summarise values with summarise()

Randomly sample rows with sample_n() and sample_frac()

Grouped operations

Chaining

Other Data Sources

Summary