Pandas A Foundational Python Library For Data Anal
Pandas A Foundational Python Library For Data Anal
net/publication/265194455
CITATIONS READS
1,307 25,900
1 author:
Wes Mckinney
SEE PROFILE
All content following this page was uploaded by Wes Mckinney on 15 December 2015.
Abstract—In this paper we will discuss pandas, a Python library of rich a table or spreadsheet of data will likely have labels for the
data structures and tools for working with structured data sets common to columns and possibly also the rows. Alternately, some columns
statistics, finance, social sciences, and many other fields. The library provides in a table might be used for grouping and aggregating data into
integrated, intuitive routines for performing common data manipulations and
a pivot or contingency table. In the case of a time series data
analysis on such data sets. It aims to be the foundational layer for the future of
statistical computing in Python. It serves as a strong complement to the existing
set, the row labels could be time stamps. It is often necessary
scientific Python stack while implementing and improving upon the kinds of data to have the labeling information available to allow many kinds
manipulation tools found in other statistical programming languages such as of data manipulations, such as merging data sets or performing
R. In addition to detailing its design and features of pandas, we will discuss an aggregation or “group by” operation, to be expressed in an
future avenues of work and growth opportunities for statistics and data analysis intuitive and concise way. Domain-specific database languages
applications in the Python language.
like SQL and statistical languages like R and SAS have a
wealth of such tools. Until relatively recently, Python had few
Introduction tools providing the same level of richness and expressiveness
for working with labeled data sets.
Python is being used increasingly in scientific applications The pandas library, under development since 2008, is
traditionally dominated by [R], [MATLAB], [Stata], [SAS], intended to close the gap in the richness of available data
other commercial or open-source research environments. The analysis tools between Python, a general purpose systems
maturity and stability of the fundamental numerical li- and scientific computing language, and the numerous domain-
braries ([NumPy], [SciPy], and others), quality of documenta- specific statistical computing platforms and database lan-
tion, and availability of “kitchen-sink” distributions ([EPD], guages. We not only aim to provide equivalent functionality
[Pythonxy]) have gone a long way toward making Python but also implement many features, such as automatic data
accessible and convenient for a broad audience. Additionally alignment and hierarchical indexing, which are not readily
[matplotlib] integrated with [IPython] provides an interactive available in such a tightly integrated way in any other libraries
research and development environment with data visualization or computing environments to our knowledge. While initially
suitable for most users. However, adoption of Python for developed for financial data analysis applications, we hope that
applied statistical modeling has been relatively slow compared pandas will enable scientific Python to be a more attractive
with other areas of computational science. and practical statistical computing environment for academic
One major issue for would-be statistical Python program- and industry practitioners alike. The library’s name derives
mers in the past has been the lack of libraries implementing from panel data, a common term for multidimensional data
standard models and a cohesive framework for specifying sets encountered in statistics and econometrics.
models. However, in recent years there have been significant
While we offer a vignette of some of the main features of
new developments in econometrics ([StaM]), Bayesian statis-
interest in pandas, this paper is by no means comprehensive.
tics ([PyMC]), and machine learning ([SciL]), among others
For more, we refer the interested reader to the online docu-
fields. However, it is still difficult for many statisticians to
mentation at http://pandas.sf.net ([pandas]).
choose Python over R given the domain-specific nature of the
R language and breadth of well-vetted open-source libraries
available to R users ([CRAN]). In spite of this obstacle, we Structured data sets
believe that the Python language and the libraries and tools Structured data sets commonly arrive in tabular format, i.e.
currently available can be leveraged to make Python a superior as a two-dimensional list of observations and names for the
environment for data analysis and statistical computing. fields of each observation. Usually an observation can be
Another issue preventing many from using Python in the uniquely identified by one or more values or labels. We show
past for data analysis applications has been the lack of rich data an example data set for a pair of stocks over the course of
structures with integrated handling of metadata. By metadata several days. The NumPy ndarray with structured dtype can
we mean labeling information about data points. For example, be used to hold this data:
Corresponding author can be contacted at: wesmckinn@gmail.com. >>> data
○2011
c Wes McKinney array([(’GOOG’, ’2009-12-28’, 622.87, 1697900.0),
2
sequential observation numbers, while the column index con- >>> data._data
tains the column names. The labels are not required to be BlockManager
Items: [item date price volume ind newcol]
sorted, though a subclass of Index could be implemented to Axis 1: [0 1 2 3 4 5 6 7]
require sortedness and provide operations optimized for sorted FloatBlock: [price volume], 2 x 8
data (e.g. time series data). ObjectBlock: [item date], 2 x 8
BoolBlock: [ind], 1 x 8
The Index object is used for many purposes: FloatBlock: [newcol], 1 x 8
• Performing lookups to select subsets of slices of an object
>>> data.consolidate()._data
• Providing fast data alignment routines for aligning one
BlockManager
object with another Items: [item date price volume ind newcol]
• Enabling intuitive slicing / selection to form new Index Axis 1: [0 1 2 3 4 5 6 7]
BoolBlock: [ind], 1 x 8
objects FloatBlock: [price volume newcol], 3 x 8
• Forming unions and intersections of Index objects ObjectBlock: [item date], 2 x 8
Here are some examples of how the index is used internally: The separation between the internal BlockManager ob-
>>> index = Index([’a’, ’b’, ’c’, ’d’, ’e’]) ject and the external, user-facing DataFrame gives the pan-
>>> ’c’ in index
True
das developers a significant amount of freedom to modify the
>>> index.get_loc(’d’) internal structure to achieve better performance and memory
3 usage.
>>> index.slice_locs(’b’, ’d’)
(1, 4)
Label-based data access
# for aligning data
>>> index.get_indexer([’c’, ’e’, ’f’])
While standard []-based indexing (using __getitem__
array([ 2, 4, -1], dtype=int32) and __setitem__) is reserved for column access in
DataFrame, it is useful to be able to index both axes of
The basic Index uses a Python dict internally to map a DataFrame in a matrix-like way using labels. We would
labels to their respective locations and implement these fea- like to be able to get or set data on any axis using one of the
tures, though subclasses could take a more specialized and following:
potentially higher performance approach.
• A list or array of labels or integers
Multidimensional objects like DataFrame are not proper
• A slice, either with integers (e.g. 1:5) or labels (e.g.
subclasses of NumPy’s ndarray nor do they use arrays
lab1:lab2)
with structured dtype. In recent releases of pandas there is a
• A boolean vector
new internal data structure known as BlockManager which
• A single label
manipulates a collection of n-dimensional ndarray objects
we refer to as blocks. Since DataFrame needs to be able to To avoid excessively overloading the []-related methods,
store mixed-type data in the columns, each of these internal leading to ambiguous indexing semantics in some cases, we
Block objects contains the data for a set of columns all have implemented a special label-indexing attribute ix on all
having the same type. In the example from above, we can of the pandas data structures. Thus, we can pass a tuple of
examine the BlockManager, though most users would never any of the above indexing objects to get or set values.
need to do this: >>> df
A B C D
>>> data._data 2000-01-03 -0.2047 1.007 -0.5397 -0.7135
BlockManager 2000-01-04 0.4789 -1.296 0.477 -0.8312
Items: [item date price volume ind] 2000-01-05 -0.5194 0.275 3.249 -2.37
Axis 1: [0 1 2 3 4 5 6 7] 2000-01-06 -0.5557 0.2289 -1.021 -1.861
FloatBlock: [price volume], 2 x 8, dtype float64 2000-01-07 1.966 1.353 -0.5771 -0.8608
ObjectBlock: [item date], 2 x 8, dtype object
BoolBlock: [ind], 1 x 8, dtype bool >>> df.ix[:2, [’D’, ’C’, ’A’]]
D C A
The key importance of BlockManager is that many 2000-01-03 -0.7135 -0.5397 -0.2047
operations, e.g. anything row-oriented (as opposed to column- 2000-01-04 -0.8312 0.477 0.4789
oriented), especially in homogeneous DataFrame objects, >>> df.ix[-2:, ’B’:]
are significantly faster when the data are all stored in a B C D
single ndarray. However, as it is common to insert and 2000-01-06 0.2289 -1.021 -1.861
2000-01-07 1.353 -0.5771 -0.8608
delete columns, it would be wasteful to have a reallocate-
copy step on each column insertion or deletion step. As Setting values also works as expected.
a result, the BlockManager effectively provides a lazy >>> date1, date2 = df.index[[1, 3]]
evaluation scheme where-in newly inserted columns are stored >>> df.ix[date1:date2, [’A’, ’C’]] = 0
in new Block objects. Later, either explicitly or when certain >>> df
A B C D
methods are called in DataFrame, blocks having the same 2000-01-03 -0.6856 0.1362 0.3996 1.585
type will be consolidated, i.e. combined together, to form a 2000-01-04 0 0.8863 0 1.907
single homogeneously-typed Block: 2000-01-05 0 -1.351 0 0.104
2000-01-06 0 -0.8863 0 0.1741
>>> data[’newcol’] = 1. 2000-01-07 -0.05927 -1.013 0.9923 -0.4395
4
df.groupby([’A’, ’B’], as_index=False).mean() Note that, under the hood, calling describe generates
A B C D
0 bar one 1.772 -0.7472 and passes a dynamic function to apply which invokes
1 bar three 0.04931 0.3939 describe on each group and glues the results together. We
2 bar two -3.202 0.9365 transposed the result with .T to make it more readable.
3 foo one -0.5205 1.579
4 foo three 0.1461 -2.655
5 foo two -0.5958 0.7762
Easy spreadsheet-style pivot tables
In a completely general setting, groupby operations are An obvious application combining groupby and reshaping
about mapping axis labels to buckets. In the above examples, operations is creating pivot tables, a common way of sum-
when we pass column names we are simply establishing a cor- marizing data in spreadsheet applications such as Microsoft
respondence between the row labels and the group identifiers. Excel. We’ll take a brief look at a tipping data set collected
There are other ways to do this; the most general is to pass a from a restaurant ([Bryant]):
Python function (for single-key) or list of functions (for multi-
>>> tips.head()
key) which will be invoked on each each label, producing a sex smoker time day size tip_pct
group specification: 1 Female No Dinner Sun 2 0.05945
2 Male No Dinner Sun 3 0.1605
>>> dat 3 Male No Dinner Sun 3 0.1666
A B C D 4 Male No Dinner Sun 2 0.1398
2000-01-03 0.6371 0.672 0.9173 1.674 5 Female No Dinner Sun 4 0.1468
2000-01-04 -0.8178 -1.865 -0.23 0.5411
2000-01-05 0.314 0.2931 -0.6444 -0.9973 The pivot_table function in pandas takes a set of
2000-01-06 1.913 -0.5867 0.273 0.4631
2000-01-07 1.308 0.426 -1.306 0.04358 column names to group on the pivot table rows, another set to
group on the columns, and optionally an aggregation function
>>> mapping for each group (which defaults to mean):
{’A’: ’Group 1’, ’B’: ’Group 2’,
’C’: ’Group 1’, ’D’: ’Group 2’} >>> import numpy as np
>>> from pandas import pivot_table
>>> for name, group in dat.groupby(mapping.get, >>> pivot_table(tips, ’tip_pct’, rows=[’time’, ’sex’],
... axis=1): cols=’smoker’)
... print name; print group smoker No Yes
Group 1 time sex
A C Dinner Female 0.1568 0.1851
8
Male 0.1594 0.1489 One might be interested in joining on something other than
Lunch Female 0.1571 0.1753 the index as well, such as the categorical data we presented
Male 0.1657 0.1667
in an earlier section:
Conveniently, the returned object is a DataFrame, so it can >>> data.join(cats, on=’item’)
be further reshaped and manipulated by the user: country date industry item value
0 US 2009-12-28 TECH GOOG 622.9
>>> table = pivot_table(tips, ’tip_pct’, 1 US 2009-12-29 TECH GOOG 619.4
rows=[’sex’, ’day’], 2 US 2009-12-30 TECH GOOG 622.7
cols=’smoker’, aggfunc=len) 3 US 2009-12-31 TECH GOOG 620
>>> table 4 US 2009-12-28 TECH AAPL 211.6
smoker No Yes 5 US 2009-12-29 TECH AAPL 209.1
sex day 6 US 2009-12-30 TECH AAPL 211.6
Female Fri 2 7 7 US 2009-12-31 TECH AAPL 210.7
Sat 13 15
Sun 14 4 This is akin to a SQL join operation between two tables
Thur 25 7
Male Fri 2 8 or a VLOOKUP operation in a spreadsheet such as Excel. It
Sat 32 27 is possible to join on multiple keys, in which case the table
Sun 43 15 being joined is currently required to have a hierarchical index
Thur 20 10
corresponding to those keys. We will be working on more
>>> table.unstack(’sex’) joining and merging methods in a future release of pandas.
smoker No Yes
sex Female Male Female Male
day
Fri 2 2 7 8 Performance and use for Large Data Sets
Sat 13 32 15 27
Sun 14 43 4 15 Using DataFrame objects over homogeneous NumPy arrays
Thur 25 20 7 10 for computation incurs overhead from a number of factors:
For many users, this will be an attractive alternative to • Computational functions like sum, mean, and std have
dumping a data set into a spreadsheet for the sole purpose been overridden to omit missing data
of creating a pivot table. • Most of the axis Index data structures are reliant on the
Python dict for performing lookups and data alignment.
>>> pivot_table(tips, ’size’,
rows=[’time’, ’sex’, ’smoker’], This also results in a slightly larger memory footprint as
cols=’day’, aggfunc=np.sum, the dict containing the label mapping is created once
fill_value=0) and then stored.
day Fri Sat Sun Thur
time sex smoker • The internal BlockManager data structure consolidates
Dinner Female No 2 30 43 2 the data of each type (floating point, integer, boolean,
Yes 8 33 10 0 object) into 2-dimensional arrays. However, this is an
Dinner Male No 4 85 124 0
Yes 12 71 39 0 upfront cost that speeds up row-oriented computations
Lunch Female No 3 0 0 60 and data alignment later.
Yes 6 0 0 17 • Performing repeated lookups of values by label passes
Lunch Male No 0 0 0 50
Yes 5 0 0 23 through much more Python code than simple integer-
based lookups on ndarray objects.
The savvy user will learn what operations are not very
Combining or joining data sets efficient in DataFrame and Series and fall back on working
Combining, joining, or merging related data sets is a quite directly with the underlying ndarray objects (accessible
common operation. In doing so we are interested in associating via the values attribute) in such cases. What DataFrame
observations from one data set with another via a merge key sacrifices in performance it makes up for in flexibility and
of some kind. For similarly-indexed 2D data, the row labels expressiveness.
serve as a natural key for the join function: With 64-bit integers representing timestamps, pandas in
fact provides some of the fastest data alignment routines for
>>> df1 >>> df2
AAPL GOOG MSFT YHOO differently-indexed time series to be found in open source soft-
2009-12-24 209 618.5 2009-12-24 31 16.72 ware. As working with large, irregularly time series requires
2009-12-28 211.6 622.9 2009-12-28 31.17 16.88 having a timestamp index, pandas is well-positioned to become
2009-12-29 209.1 619.4 2009-12-29 31.39 16.92
2009-12-30 211.6 622.7 2009-12-30 30.96 16.98 the gold standard for high performance open source time series
2009-12-31 210.7 620 processing.
With regard to memory usage and large data sets, pandas
>>> df1.join(df2)
AAPL GOOG MSFT YHOO is currently only designed for use with in-memory data sets.
2009-12-24 209 618.5 31 16.72 We would like to expand its capability to work with data
2009-12-28 211.6 622.9 31.17 16.88 sets that do not fit into memory, perhaps transparently using
2009-12-29 209.1 619.4 31.39 16.92
2009-12-30 211.6 622.7 30.96 16.98 the multiprocessing module or a parallel computing
2009-12-31 210.7 620 NaN NaN backend to orchestrate large scale computations.
9
pandas for R users to make statistical modeling and data analysis tools in Python
Given the “DataFrame” name and feature overlap with the [R] more cohesive and integrated. We plan to combine pandas
project and its 3rd party packages, pandas will draw inevitable with a formula framework to make specifying statistical mod-
comparisons with R. pandas brings a robust, full-featured, and els easy and intuitive when working with a DataFrame of
integrated data analysis toolset to Python while maintaining a data, for example.
simple and easy-to-use API. As nearly all data manipulations
involving data.frame objects in R can be easily expressed Conclusions
using the pandas DataFrame, it is relatively straightforward We believe that in the coming years there will be great oppor-
in most cases to port R functions to Python. It would be tunity to attract users in need of statistical data analysis tools
useful to provide a migration guide for R users as we have to Python who might have previously chosen R, MATLAB,
not copied R’s naming conventions or syntax in most places, or another research environment. By designing robust, easy-
rather naming based on common-sense and making the syntax to-use data structures that cohere with the rest of the scientific
and API as “Pythonic” as possible. Python stack, we can make Python a compelling choice for
R does not provide indexing functionality in nearly such a data analysis applications. In our opinion, pandas provides
deeply integrated way as pandas does. For example, operations a solid foundation upon which a very powerful data analysis
between data.frame objects will proceed in R without ecosystem can be established.
regard to whether the labels match as long as they are the
same length and width. Some R packages, such as zoo and
R EFERENCES
xts provides indexed data structures with data alignment,
but they are largely specialized to ordered time series data. [pandas] W. McKinney, pandas: a python data analysis library, http:
//pandas.sourceforge.net
Hierarchical indexing with constant-time subset selection is [scipy2010] W. McKinney, Data Structures for Statistical Computing in
another significant feature missing from R’s data structures. Python Proceedings of the 9th Python in Science Conference,
Outside of the scope of this paper is a rigorous performance http://http://conference.scipy.org/. 2010
[Larry] K. Goodman. la / larry: ndarray with labeled axes, http://larry.
comparison of R and pandas. In almost all of the benchmarks sourceforge.net/
we have run comparing R and pandas, pandas significantly [SciTS] M. Knox, P. Gerard-Marchant, scikits.timeseries: python time
outperforms R. series analysis, http://pytseries.sourceforge.net/
[StaM] S. Seabold, J. Perktold, J. Taylor, statsmodels: statistical
modeling in Python, http://statsmodels.sourceforge.net
[SciL] D. Cournapeau, et al., scikit-learn: machine learning in
Other features of note Python, http://scikit-learn.sourceforge.net
There are many other features in pandas worth exploring for [PyMC] C. Fonnesbeck, A. Patil, D. Huard, PyMC: Markov Chain
Monte Carlo for Python, http://code.google.com/p/pymc/
the interested users: [Tab] D. Yamins, E. Angelino, tabular: tabarray data structure for
• Time series functionality: date range generation, shifting 2D data, http://parsemydata.com/tabular/
[NumPy] T. Oliphant, http://numpy.scipy.org
and lagging, frequency conversion and forward/backward [SciPy] E. Jones, T. Oliphant, P. Peterson, http://scipy.org
filling [matplotlib] J. Hunter, et al., matplotlib: Python plotting, http://matplotlib.
• Integration with [matplotlib] to concisely generate plots sourceforge.net/
[EPD] Enthought, Inc., EPD: Enthought Python Distribution, http:
with metadata //www.enthought.com/products/epd.php
• Moving window statistics (e.g. moving standard devia- [Pythonxy] P. Raybaut, Python(x,y): Scientific-oriented Python distribu-
tion, exponentially weighted moving average) and moving tion, http://www.pythonxy.com/
[CRAN] The R Project for Statistical Computing, http://cran.r-project.
window linear and panel regression org/
• 3-dimensional Panel data structure for manipulating [Cython] G. Ewing, R. W. Bradshaw, S. Behnel, D. S. Seljebotn, et al.,
collections of DataFrame objects The Cython compiler, http://cython.org
[IPython] Fernando Pérez, Brian E. Granger, IPython: A System for
• Sparse versions of the data structures
Interactive Scientific Computing, Computing in Science and
• Robust IO tools for reading and writing pandas objects to Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007,
flat files (delimited text, CSV, Excel) and HDF5 format doi:10.1109/MCSE.2007.53. http://ipython.org
[Grun] Batalgi, Grunfeld data set, http://www.wiley.com/legacy/
wileychi/baltagi/
Related packages [nipy] J. Taylor, F. Perez, et al., nipy: Neuroimaging in Python, http:
//nipy.sourceforge.net
A number of other Python packages have some degree of [pydataframe] A. Straw, F. Finkernagel, pydataframe, http://code.google.com/
p/pydataframe/
feature overlap with pandas. Among these, la ([Larry]) is [R] R Development Core Team. 2010, R: A Language and Envi-
the most similar, as it implements a labeled ndarray object ronment for Statistical Computing, http://www.R-project.org
intending to closely mimic NumPy arrays. Since ndarray [MATLAB] The MathWorks Inc. 2010, MATLAB, http://www.mathworks.
com
is only applicable many problems in its homogeneous (non- [Stata] StatCorp. 2010, Stata Statistical Software: Release 11 http:
structured dtype) form, in pandas we have distanced our- //www.stata.com
selves from ndarray to instead provide a more flexible, [SAS] SAS Institute Inc., SAS System, http://www.sas.com
[Bryant] Bryant, P. G. and Smith, M (1995) Practical Data Analysis:
(potentially) heterogeneous, size-mutable data structure. The Case Studies in Business Statistics. Homewood, IL: Richard
references include a some other packages of interest. D. Irwin Publishing:
pandas will soon become a dependency of statsmodels
([StaM]), the main statistics and econometric library in Python,