Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
51 views

Pandas - Powerful Python Data Analysis Toolkit

The document discusses various data analysis and manipulation tasks that can be performed in pandas, the Python data analysis library. It provides examples of tasks like merging DataFrames, grouping and aggregating data, pivoting data, adding and removing rows and columns, finding and replacing values. It also compares pandas to other tools like Excel and SAS.

Uploaded by

Mehul Prajapat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Pandas - Powerful Python Data Analysis Toolkit

The document discusses various data analysis and manipulation tasks that can be performed in pandas, the Python data analysis library. It provides examples of tasks like merging DataFrames, grouping and aggregating data, pivoting data, adding and removing rows and columns, finding and replacing values. It also compares pandas to other tools like Excel and SAS.

Uploaded by

Mehul Prajapat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

pandas: powerful Python data analysis toolkit, Release 1.3.

(continued from previous page)


In [51]: left_join = df1.merge(df2, on=["key"], how="left")

In [52]: left_join
Out[52]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

In [53]: right_join = df1.merge(df2, on=["key"], how="right")

In [54]: right_join
Out[54]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236

In [55]: outer_join = df1.merge(df2, on=["key"], how="outer")

In [56]: outer_join
Out[56]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

merge has a number of advantages over VLOOKUP:


• The lookup value doesn’t need to be the first column of the lookup table
• If multiple rows are matched, there will be one row for each match, instead of just the first
• It will include all columns from the lookup table, instead of just a single specified column
• It supports more complex join operations

Other considerations

Fill Handle

Create a series of numbers following a set pattern in a certain set of cells. In a spreadsheet, this would be done by
shift+drag after entering the first number or by entering the first two or three values and then dragging.
This can be achieved by creating a series and assigning it to the desired cells.

104 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [57]: df = pd.DataFrame({"AAA": [1] * 8, "BBB": list(range(0, 8))})

In [58]: df
Out[58]:
AAA BBB
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7

In [59]: series = list(range(1, 5))

In [60]: series
Out[60]: [1, 2, 3, 4]

In [61]: df.loc[2:5, "AAA"] = series

In [62]: df
Out[62]:
AAA BBB
0 1 0
1 1 1
2 1 2
3 2 3
4 3 4
5 4 5
6 1 6
7 1 7

Drop Duplicates

Excel has built-in functionality for removing duplicate values. This is supported in pandas via drop_duplicates().

In [63]: df = pd.DataFrame(
....: {
....: "class": ["A", "A", "A", "B", "C", "D"],
....: "student_count": [42, 35, 42, 50, 47, 45],
....: "all_pass": ["Yes", "Yes", "Yes", "No", "No", "Yes"],
....: }
....: )
....:

In [64]: df.drop_duplicates()
Out[64]:
class student_count all_pass
0 A 42 Yes
1 A 35 Yes
(continues on next page)

1.4. Tutorials 105


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


3 B 50 No
4 C 47 No
5 D 45 Yes

In [65]: df.drop_duplicates(["class", "student_count"])


Out[65]:
class student_count all_pass
0 A 42 Yes
1 A 35 Yes
3 B 50 No
4 C 47 No
5 D 45 Yes

Pivot Tables

PivotTables from spreadsheets can be replicated in pandas through Reshaping and pivot tables. Using the tips dataset
again, let’s find the average gratuity by size of the party and sex of the server.
In Excel, we use the following configuration for the PivotTable:

The equivalent in pandas:

In [66]: pd.pivot_table(
....: tips, values="tip", index=["size"], columns=["sex"], aggfunc=np.average
....: )
....:
Out[66]:
sex Female Male
size
1 1.276667 1.920000
2 2.528448 2.614184
3 3.250000 3.476667
4 4.021111 4.172143
(continues on next page)

106 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


5 5.140000 3.750000
6 4.600000 5.850000

Adding a row

Assuming we are using a RangeIndex (numbered 0, 1, etc.), we can use DataFrame.append() to add a row to the
bottom of a DataFrame.

In [67]: df
Out[67]:
class student_count all_pass
0 A 42 Yes
1 A 35 Yes
2 A 42 Yes
3 B 50 No
4 C 47 No
5 D 45 Yes

In [68]: new_row = {"class": "E", "student_count": 51, "all_pass": True}

In [69]: df.append(new_row, ignore_index=True)


Out[69]:
class student_count all_pass
0 A 42 Yes
1 A 35 Yes
2 A 42 Yes
3 B 50 No
4 C 47 No
5 D 45 Yes
6 E 51 True

Find and Replace

Excel’s Find dialog takes you to cells that match, one by one. In pandas, this operation is generally done for an entire
column or DataFrame at once through conditional expressions.

In [70]: tips
Out[70]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
.. ... ... ... ... ... ... ...
182 43.35 3.50 Male Yes Sun Dinner 3
156 46.17 5.00 Male No Sun Dinner 6
59 46.27 6.73 Male No Sat Dinner 4
212 46.33 9.00 Male No Sat Dinner 4
(continues on next page)

1.4. Tutorials 107


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


170 48.81 10.00 Male Yes Sat Dinner 3

[244 rows x 7 columns]

In [71]: tips == "Sun"


Out[71]:
total_bill tip sex smoker day time size
67 False False False False False False False
92 False False False False False False False
111 False False False False False False False
145 False False False False False False False
135 False False False False False False False
.. ... ... ... ... ... ... ...
182 False False False False True False False
156 False False False False True False False
59 False False False False False False False
212 False False False False False False False
170 False False False False False False False

[244 rows x 7 columns]

In [72]: tips["day"].str.contains("S")
Out[72]:
67 True
92 False
111 True
145 False
135 False
...
182 True
156 True
59 True
212 True
170 True
Name: day, Length: 244, dtype: bool

pandas’ replace() is comparable to Excel’s Replace All.

In [73]: tips.replace("Thu", "Thursday")


Out[73]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
.. ... ... ... ... ... ... ...
182 43.35 3.50 Male Yes Sun Dinner 3
156 46.17 5.00 Male No Sun Dinner 6
59 46.27 6.73 Male No Sat Dinner 4
212 46.33 9.00 Male No Sat Dinner 4
170 48.81 10.00 Male Yes Sat Dinner 3
(continues on next page)

108 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)

[244 rows x 7 columns]

Comparison with SAS

For potential users coming from SAS this page is meant to demonstrate how different SAS operations would be per-
formed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

Data structures

General terminology translation

pandas SAS
DataFrame data set
column variable
row observation
groupby BY-group
NaN .

DataFrame

A DataFrame in pandas is analogous to a SAS data set - a two-dimensional data source with labeled columns that can
be of different types. As will be shown in this document, almost any operation that can be applied to a data set using
SAS’s DATA step, can also be accomplished in pandas.

Series

A Series is the data structure that represents one column of a DataFrame. SAS doesn’t have a separate data structure
for a single column, but in general, working with a Series is analogous to referencing a column in the DATA step.

1.4. Tutorials 109


pandas: powerful Python data analysis toolkit, Release 1.3.5

Index

Every DataFrame and Series has an Index - which are labels on the rows of the data. SAS does not have an exactly
analogous concept. A data set’s rows are essentially unlabeled, other than an implicit integer index that can be accessed
during the DATA step (_N_).
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on).
While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part
of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as a
collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.

Copies vs. in place operations

Most pandas operations return copies of the Series/DataFrame. To make the changes “stick”, you’ll need to either
assign to a new variable:

sorted_df = df.sort_values("col1")

or overwrite the original one:

df = df.sort_values("col1")

Note: You will see an inplace=True keyword argument available for some methods:

df.sort_values("col1", inplace=True)

Its use is discouraged. More information.

Data input / output

Constructing a DataFrame from values

A SAS data set can be built from specified values by placing the data after a datalines statement and specifying the
column names.

data df;
input x y;
datalines;
1 2
3 4
5 6
;
run;

A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often convenient
to specify it as a Python dictionary, where the keys are the column names and the values are the data.

In [1]: df = pd.DataFrame({"x": [1, 3, 5], "y": [2, 4, 6]})

In [2]: df
(continues on next page)

110 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


Out[2]:
x y
0 1 2
1 3 4
2 5 6

Reading external data

Like SAS, pandas provides utilities for reading in data from many formats. The tips dataset, found within the pandas
tests (csv) will be used in many of the following examples.
SAS provides PROC IMPORT to read csv data into a data set.

proc import datafile='tips.csv' dbms=csv out=tips replace;


getnames=yes;
run;

The pandas method is read_csv(), which works similarly.

In [3]: url = (
...: "https://raw.github.com/pandas-dev/"
...: "pandas/master/pandas/tests/io/data/csv/tips.csv"
...: )
...:

In [4]: tips = pd.read_csv(url)

In [5]: tips
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

[244 rows x 7 columns]

Like PROC IMPORT, read_csv can take a number of parameters to specify how the data should be parsed. For example,
if the data was instead tab delimited, and did not have column names, the pandas command would be:

tips = pd.read_csv("tips.csv", sep="\t", header=None)

# alternatively, read_table is an alias to read_csv with tab delimiter


tips = pd.read_table("tips.csv", header=None)

1.4. Tutorials 111


pandas: powerful Python data analysis toolkit, Release 1.3.5

In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL databases. These
are all read via a pd.read_* function. See the IO documentation for more details.

Limiting output

By default, pandas will truncate output of large DataFrames to show the first and last rows. This can be overridden by
changing the pandas options, or using DataFrame.head() or DataFrame.tail().

In [1]: tips.head(5)
Out[1]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

The equivalent in SAS would be:

proc print data=df(obs=5);


run;

Exporting data

The inverse of PROC IMPORT in SAS is PROC EXPORT

proc export data=tips outfile='tips2.csv' dbms=csv;


run;

Similarly in pandas, the opposite of read_csv is to_csv(), and other data formats follow a similar api.

tips.to_csv("tips2.csv")

Data operations

Operations on columns

In the DATA step, arbitrary math expressions can be used on new or existing columns.

data tips;
set tips;
total_bill = total_bill - 2;
new_bill = total_bill / 2;
run;

pandas provides vectorized operations by specifying the individual Series in the DataFrame. New columns can be
assigned in the same way. The DataFrame.drop() method drops a column from the DataFrame.

112 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [1]: tips["total_bill"] = tips["total_bill"] - 2

In [2]: tips["new_bill"] = tips["total_bill"] / 2

In [3]: tips
Out[3]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295
.. ... ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3 13.515
240 25.18 2.00 Female Yes Sat Dinner 2 12.590
241 20.67 2.00 Male Yes Sat Dinner 2 10.335
242 15.82 1.75 Male No Sat Dinner 2 7.910
243 16.78 3.00 Female No Thur Dinner 2 8.390

[244 rows x 8 columns]

In [4]: tips = tips.drop("new_bill", axis=1)

Filtering

Filtering in SAS is done with an if or where statement, on one or more columns.

data tips;
set tips;
if total_bill > 10;
run;

data tips;
set tips;
where total_bill > 10;
/* equivalent in this case - where happens before the
DATA step begins and can also be used in PROC statements */
run;

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
In [1]: tips[tips["total_bill"] > 10]
Out[1]:
total_bill tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
5 23.29 4.71 Male No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3
240 25.18 2.00 Female Yes Sat Dinner 2
(continues on next page)

1.4. Tutorials 113


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


241 20.67 2.00 Male Yes Sat Dinner 2
242 15.82 1.75 Male No Sat Dinner 2
243 16.78 3.00 Female No Thur Dinner 2

[204 rows x 7 columns]

The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with
True.

In [1]: is_dinner = tips["time"] == "Dinner"

In [2]: is_dinner
Out[2]:
0 True
1 True
2 True
3 True
4 True
...
239 True
240 True
241 True
242 True
243 True
Name: time, Length: 244, dtype: bool

In [3]: is_dinner.value_counts()
Out[3]:
True 176
False 68
Name: time, dtype: int64

In [4]: tips[is_dinner]
Out[4]:
total_bill tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3
240 25.18 2.00 Female Yes Sat Dinner 2
241 20.67 2.00 Male Yes Sat Dinner 2
242 15.82 1.75 Male No Sat Dinner 2
243 16.78 3.00 Female No Thur Dinner 2

[176 rows x 7 columns]

114 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

If/then logic

In SAS, if/then logic can be used to create new columns.

data tips;
set tips;
format bucket $4.;

if total_bill < 10 then bucket = 'low';


else bucket = 'high';
run;

The same operation in pandas can be accomplished using the where method from numpy.

In [1]: tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")

In [2]: tips
Out[2]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high
.. ... ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3 high
240 25.18 2.00 Female Yes Sat Dinner 2 high
241 20.67 2.00 Male Yes Sat Dinner 2 high
242 15.82 1.75 Male No Sat Dinner 2 high
243 16.78 3.00 Female No Thur Dinner 2 high

[244 rows x 8 columns]

Date functionality

SAS provides a variety of functions to do operations on date/datetime columns.

data tips;
set tips;
format date1 date2 date1_plusmonth mmddyy10.;
date1 = mdy(1, 15, 2013);
date2 = mdy(2, 15, 2015);
date1_year = year(date1);
date2_month = month(date2);
* shift date to beginning of next interval;
date1_next = intnx('MONTH', date1, 1);
* count intervals between dates;
months_between = intck('MONTH', date1, date2);
run;

The equivalent pandas operations are shown below. In addition to these functions pandas supports other Time Series
features not available in Base SAS (such as resampling and custom offsets) - see the timeseries documentation for more
details.

1.4. Tutorials 115


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [1]: tips["date1"] = pd.Timestamp("2013-01-15")

In [2]: tips["date2"] = pd.Timestamp("2015-02-15")

In [3]: tips["date1_year"] = tips["date1"].dt.year

In [4]: tips["date2_month"] = tips["date2"].dt.month

In [5]: tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin()

In [6]: tips["months_between"] = tips["date2"].dt.to_period("M") - tips[


...: "date1"
...: ].dt.to_period("M")
...:

In [7]: tips[
...: ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between
˓→"]

...: ]
...:
Out[7]:
date1 date2 date1_year date2_month date1_next months_between
0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
.. ... ... ... ... ... ...
239 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
240 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
241 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
242 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
243 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>

[244 rows x 6 columns]

Selection of columns

SAS provides keywords in the DATA step to select, drop, and rename columns.

data tips;
set tips;
keep sex total_bill tip;
run;

data tips;
set tips;
drop sex;
run;

data tips;
(continues on next page)

116 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


set tips;
rename total_bill=total_bill_2;
run;

The same operations are expressed in pandas below.

Keep certain columns

In [1]: tips[["sex", "total_bill", "tip"]]


Out[1]:
sex total_bill tip
0 Female 14.99 1.01
1 Male 8.34 1.66
2 Male 19.01 3.50
3 Male 21.68 3.31
4 Female 22.59 3.61
.. ... ... ...
239 Male 27.03 5.92
240 Female 25.18 2.00
241 Male 20.67 2.00
242 Male 15.82 1.75
243 Female 16.78 3.00

[244 rows x 3 columns]

Drop a column

In [2]: tips.drop("sex", axis=1)


Out[2]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4
.. ... ... ... ... ... ...
239 27.03 5.92 No Sat Dinner 3
240 25.18 2.00 Yes Sat Dinner 2
241 20.67 2.00 Yes Sat Dinner 2
242 15.82 1.75 No Sat Dinner 2
243 16.78 3.00 No Thur Dinner 2

[244 rows x 6 columns]

1.4. Tutorials 117


pandas: powerful Python data analysis toolkit, Release 1.3.5

Rename a column

In [1]: tips.rename(columns={"total_bill": "total_bill_2"})


Out[1]:
total_bill_2 tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3
240 25.18 2.00 Female Yes Sat Dinner 2
241 20.67 2.00 Male Yes Sat Dinner 2
242 15.82 1.75 Male No Sat Dinner 2
243 16.78 3.00 Female No Thur Dinner 2

[244 rows x 7 columns]

Sorting by values

Sorting in SAS is accomplished via PROC SORT

proc sort data=tips;


by sex total_bill;
run;

pandas has a DataFrame.sort_values() method, which takes a list of columns to sort by.

In [1]: tips = tips.sort_values(["sex", "total_bill"])

In [2]: tips
Out[2]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
.. ... ... ... ... ... ... ...
182 43.35 3.50 Male Yes Sun Dinner 3
156 46.17 5.00 Male No Sun Dinner 6
59 46.27 6.73 Male No Sat Dinner 4
212 46.33 9.00 Male No Sat Dinner 4
170 48.81 10.00 Male Yes Sat Dinner 3

[244 rows x 7 columns]

118 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

String processing

Finding length of string

SAS determines the length of a character string with the LENGTHN and LENGTHC functions. LENGTHN excludes
trailing blanks and LENGTHC includes trailing blanks.

data _null_;
set tips;
put(LENGTHN(time));
put(LENGTHC(time));
run;

You can find the length of a character string with Series.str.len(). In Python 3, all strings are Unicode strings.
len includes trailing blanks. Use len and rstrip to exclude trailing blanks.

In [1]: tips["time"].str.len()
Out[1]:
67 6
92 6
111 6
145 5
135 5
..
182 6
156 6
59 6
212 6
170 6
Name: time, Length: 244, dtype: int64

In [2]: tips["time"].str.rstrip().str.len()
Out[2]:
67 6
92 6
111 6
145 5
135 5
..
182 6
156 6
59 6
212 6
170 6
Name: time, Length: 244, dtype: int64

1.4. Tutorials 119


pandas: powerful Python data analysis toolkit, Release 1.3.5

Finding position of substring

SAS determines the position of a character in a string with the FINDW function. FINDW takes the string defined by the
first argument and searches for the first position of the substring you supply as the second argument.

data _null_;
set tips;
put(FINDW(sex,'ale'));
run;

You can find the position of a character in a column of strings with the Series.str.find() method. find searches
for the first position of the substring. If the substring is found, the method returns its position. If not found, it returns
-1. Keep in mind that Python indexes are zero-based.

In [1]: tips["sex"].str.find("ale")
Out[1]:
67 3
92 3
111 3
145 3
135 3
..
182 1
156 1
59 1
212 1
170 1
Name: sex, Length: 244, dtype: int64

Extracting substring by position

SAS extracts a substring from a string based on its position with the SUBSTR function.

data _null_;
set tips;
put(substr(sex,1,1));
run;

With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that
Python indexes are zero-based.

In [1]: tips["sex"].str[0:1]
Out[1]:
67 F
92 F
111 F
145 F
135 F
..
182 M
156 M
59 M
(continues on next page)

120 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


212 M
170 M
Name: sex, Length: 244, dtype: object

Extracting nth word

The SAS SCAN function returns the nth word from a string. The first argument is the string you want to parse and the
second argument specifies which word you want to extract.

data firstlast;
input String $60.;
First_Name = scan(string, 1);
Last_Name = scan(string, -1);
datalines2;
John Smith;
Jane Cook;
;;;
run;

The simplest way to extract words in pandas is to split the strings by spaces, then reference the word by index. Note
there are more powerful approaches should you need them.

In [1]: firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})

In [2]: firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]

In [3]: firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0]

In [4]: firstlast
Out[4]:
String First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane

Changing case

The SAS UPCASE LOWCASE and PROPCASE functions change the case of the argument.

data firstlast;
input String $60.;
string_up = UPCASE(string);
string_low = LOWCASE(string);
string_prop = PROPCASE(string);
datalines2;
John Smith;
Jane Cook;
;;;
run;

The equivalent pandas methods are Series.str.upper(), Series.str.lower(), and Series.str.title().

1.4. Tutorials 121


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [1]: firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]})

In [2]: firstlast["upper"] = firstlast["string"].str.upper()

In [3]: firstlast["lower"] = firstlast["string"].str.lower()

In [4]: firstlast["title"] = firstlast["string"].str.title()

In [5]: firstlast
Out[5]:
string upper lower title
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook

Merging

The following tables will be used in the merge examples:


In [1]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})

In [2]: df1
Out[2]:
key value
0 A 0.469112
1 B -0.282863
2 C -1.509059
3 D -1.135632

In [3]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})

In [4]: df2
Out[4]:
key value
0 B 1.212112
1 D -0.173215
2 D 0.119209
3 E -1.044236

In SAS, data must be explicitly sorted before merging. Different types of joins are accomplished using the in= dummy
variables to track whether a match was found in one or both input frames.
proc sort data=df1;
by key;
run;

proc sort data=df2;


by key;
run;

data left_join inner_join right_join outer_join;


merge df1(in=a) df2(in=b);

(continues on next page)

122 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


if a and b then output inner_join;
if a then output left_join;
if b then output right_join;
if a or b then output outer_join;
run;

pandas DataFrames have a merge() method, which provides similar functionality. The data does not have to be sorted
ahead of time, and different join types are accomplished via the how keyword.

In [1]: inner_join = df1.merge(df2, on=["key"], how="inner")

In [2]: inner_join
Out[2]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209

In [3]: left_join = df1.merge(df2, on=["key"], how="left")

In [4]: left_join
Out[4]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

In [5]: right_join = df1.merge(df2, on=["key"], how="right")

In [6]: right_join
Out[6]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236

In [7]: outer_join = df1.merge(df2, on=["key"], how="outer")

In [8]: outer_join
Out[8]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

1.4. Tutorials 123


pandas: powerful Python data analysis toolkit, Release 1.3.5

Missing data

Both pandas and SAS have a representation for missing data.


pandas represents missing data with the special float value NaN (not a number). Many of the semantics are the same;
for example missing data propagates through numeric operations, and is ignored by default for aggregations.

In [1]: outer_join
Out[1]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

In [2]: outer_join["value_x"] + outer_join["value_y"]


Out[2]:
0 NaN
1 0.929249
2 NaN
3 -1.308847
4 -1.016424
5 NaN
dtype: float64

In [3]: outer_join["value_x"].sum()
Out[3]: -3.5940742896293765

One difference is that missing data cannot be compared to its sentinel value. For example, in SAS you could do this to
filter missing values.

data outer_join_nulls;
set outer_join;
if value_x = .;
run;

data outer_join_no_nulls;
set outer_join;
if value_x ^= .;
run;

In pandas, Series.isna() and Series.notna() can be used to filter the rows.

In [1]: outer_join[outer_join["value_x"].isna()]
Out[1]:
key value_x value_y
5 E NaN -1.044236

In [2]: outer_join[outer_join["value_x"].notna()]
Out[2]:
key value_x value_y
0 A 0.469112 NaN
(continues on next page)

124 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

pandas provides a variety of methods to work with missing data. Here are some examples:

Drop rows with missing values

In [3]: outer_join.dropna()
Out[3]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

Forward fill from previous rows

In [4]: outer_join.fillna(method="ffill")
Out[4]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E -1.135632 -1.044236

Replace missing values with a specified value

Using the mean:

In [1]: outer_join["value_x"].fillna(outer_join["value_x"].mean())
Out[1]:
0 0.469112
1 -0.282863
2 -1.509059
3 -1.135632
4 -1.135632
5 -0.718815
Name: value_x, dtype: float64

1.4. Tutorials 125


pandas: powerful Python data analysis toolkit, Release 1.3.5

GroupBy

Aggregation

SAS’s PROC SUMMARY can be used to group by one or more key variables and compute aggregations on numeric
columns.

proc summary data=tips nway;


class sex smoker;
var total_bill tip;
output out=tips_summed sum=;
run;

pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for
more details and examples.

In [1]: tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum()

In [2]: tips_summed
Out[2]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07

Transformation

In SAS, if the group aggregations need to be used with the original frame, it must be merged back together. For example,
to subtract the mean for each observation by smoker group.

proc summary data=tips missing nway;


class smoker;
var total_bill;
output out=smoker_means mean(total_bill)=group_bill;
run;

proc sort data=tips;


by smoker;
run;

data tips;
merge tips(in=a) smoker_means(in=b);
by smoker;
adj_total_bill = total_bill - group_bill;
if a and b;
run;

pandas provides a Transformation mechanism that allows these type of operations to be succinctly expressed in one
operation.

126 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [1]: gb = tips.groupby("smoker")["total_bill"]

In [2]: tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean")

In [3]: tips
Out[3]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278
.. ... ... ... ... ... ... ... ...
182 43.35 3.50 Male Yes Sun Dinner 3 24.593656
156 46.17 5.00 Male No Sun Dinner 6 28.981722
59 46.27 6.73 Male No Sat Dinner 4 29.081722
212 46.33 9.00 Male No Sat Dinner 4 29.141722
170 48.81 10.00 Male Yes Sat Dinner 3 30.053656

[244 rows x 8 columns]

By group processing

In addition to aggregation, pandas groupby can be used to replicate most other by group processing from SAS. For
example, this DATA step reads the data by sex/smoker group and filters to the first entry for each.

proc sort data=tips;


by sex smoker;
run;

data tips_first;
set tips;
by sex smoker;
if FIRST.sex or FIRST.smoker then output;
run;

In pandas this would be written as:

In [4]: tips.groupby(["sex", "smoker"]).first()


Out[4]:
total_bill tip day time size adj_total_bill
sex smoker
Female No 5.25 1.00 Sat Dinner 1 -11.938278
Yes 1.07 1.00 Sat Dinner 1 -17.686344
Male No 5.51 2.00 Thur Lunch 2 -11.678278
Yes 5.25 5.15 Sun Dinner 2 -13.506344

1.4. Tutorials 127


pandas: powerful Python data analysis toolkit, Release 1.3.5

Other considerations

Disk vs memory

pandas operates exclusively in memory, where a SAS data set exists on disk. This means that the size of data able to
be loaded in pandas is limited by your machine’s memory, but also that the operations on that data may be faster.
If out of core processing is needed, one possibility is the dask.dataframe library (currently in development) which
provides a subset of pandas functionality for an on-disk DataFrame

Data interop

pandas provides a read_sas() method that can read SAS data saved in the XPORT or SAS7BDAT binary format.

libname xportout xport 'transport-file.xpt';


data xportout.tips;
set tips(rename=(total_bill=tbill));
* xport variable names limited to 6 characters;
run;

df = pd.read_sas("transport-file.xpt")
df = pd.read_sas("binary-file.sas7bdat")

You can also specify the file format directly. By default, pandas will try to infer the file format based on its extension.

df = pd.read_sas("transport-file.xpt", format="xport")
df = pd.read_sas("binary-file.sas7bdat", format="sas7bdat")

XPORT is a relatively limited format and the parsing of it is not as optimized as some of the other pandas readers. An
alternative way to interop data between SAS and pandas is to serialize to csv.

# version 0.17, 10M rows

In [8]: %time df = pd.read_sas('big.xpt')


Wall time: 14.6 s

In [9]: %time df = pd.read_csv('big.csv')


Wall time: 4.86 s

Comparison with Stata

For potential users coming from Stata this page is meant to demonstrate how different Stata operations would be per-
formed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

128 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

Data structures

General terminology translation

pandas Stata
DataFrame data set
column variable
row observation
groupby bysort
NaN .

DataFrame

A DataFrame in pandas is analogous to a Stata data set – a two-dimensional data source with labeled columns that can
be of different types. As will be shown in this document, almost any operation that can be applied to a data set in Stata
can also be accomplished in pandas.

Series

A Series is the data structure that represents one column of a DataFrame. Stata doesn’t have a separate data structure
for a single column, but in general, working with a Series is analogous to referencing a column of a data set in Stata.

Index

Every DataFrame and Series has an Index – labels on the rows of the data. Stata does not have an exactly analogous
concept. In Stata, a data set’s rows are essentially unlabeled, other than an implicit integer index that can be accessed
with _n.
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on).
While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part
of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as a
collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.

Copies vs. in place operations

Most pandas operations return copies of the Series/DataFrame. To make the changes “stick”, you’ll need to either
assign to a new variable:

sorted_df = df.sort_values("col1")

or overwrite the original one:

df = df.sort_values("col1")

Note: You will see an inplace=True keyword argument available for some methods:

df.sort_values("col1", inplace=True)

1.4. Tutorials 129


pandas: powerful Python data analysis toolkit, Release 1.3.5

Its use is discouraged. More information.

Data input / output

Constructing a DataFrame from values

A Stata data set can be built from specified values by placing the data after an input statement and specifying the
column names.

input x y
1 2
3 4
5 6
end

A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often convenient
to specify it as a Python dictionary, where the keys are the column names and the values are the data.

In [3]: df = pd.DataFrame({"x": [1, 3, 5], "y": [2, 4, 6]})

In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6

Reading external data

Like Stata, pandas provides utilities for reading in data from many formats. The tips data set, found within the pandas
tests (csv) will be used in many of the following examples.
Stata provides import delimited to read csv data into a data set in memory. If the tips.csv file is in the current
working directory, we can import it as follows.

import delimited tips.csv

The pandas method is read_csv(), which works similarly. Additionally, it will automatically download the data set
if presented with a url.

In [5]: url = (
...: "https://raw.github.com/pandas-dev"
...: "/pandas/master/pandas/tests/io/data/csv/tips.csv"
...: )
...:

In [6]: tips = pd.read_csv(url)

In [7]: tips
Out[7]:
total_bill tip sex smoker day time size
(continues on next page)

130 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

[244 rows x 7 columns]

Like import delimited, read_csv() can take a number of parameters to specify how the data should be parsed.
For example, if the data were instead tab delimited, did not have column names, and existed in the current working
directory, the pandas command would be:

tips = pd.read_csv("tips.csv", sep="\t", header=None)

# alternatively, read_table is an alias to read_csv with tab delimiter


tips = pd.read_table("tips.csv", header=None)

pandas can also read Stata data sets in .dta format with the read_stata() function.

df = pd.read_stata("data.dta")

In addition to text/csv and Stata files, pandas supports a variety of other data formats such as Excel, SAS, HDF5,
Parquet, and SQL databases. These are all read via a pd.read_* function. See the IO documentation for more details.

Limiting output

By default, pandas will truncate output of large DataFrames to show the first and last rows. This can be overridden by
changing the pandas options, or using DataFrame.head() or DataFrame.tail().

In [8]: tips.head(5)
Out[8]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

The equivalent in Stata would be:

list in 1/5

1.4. Tutorials 131


pandas: powerful Python data analysis toolkit, Release 1.3.5

Exporting data

The inverse of import delimited in Stata is export delimited

export delimited tips2.csv

Similarly in pandas, the opposite of read_csv is DataFrame.to_csv().

tips.to_csv("tips2.csv")

pandas can also export to Stata file format with the DataFrame.to_stata() method.

tips.to_stata("tips2.dta")

Data operations

Operations on columns

In Stata, arbitrary math expressions can be used with the generate and replace commands on new or existing
columns. The drop command drops the column from the data set.

replace total_bill = total_bill - 2


generate new_bill = total_bill / 2
drop new_bill

pandas provides vectorized operations by specifying the individual Series in the DataFrame. New columns can be
assigned in the same way. The DataFrame.drop() method drops a column from the DataFrame.

In [9]: tips["total_bill"] = tips["total_bill"] - 2

In [10]: tips["new_bill"] = tips["total_bill"] / 2

In [11]: tips
Out[11]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295
.. ... ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3 13.515
240 25.18 2.00 Female Yes Sat Dinner 2 12.590
241 20.67 2.00 Male Yes Sat Dinner 2 10.335
242 15.82 1.75 Male No Sat Dinner 2 7.910
243 16.78 3.00 Female No Thur Dinner 2 8.390

[244 rows x 8 columns]

In [12]: tips = tips.drop("new_bill", axis=1)

132 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

Filtering

Filtering in Stata is done with an if clause on one or more columns.

list if total_bill > 10

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

In [13]: tips[tips["total_bill"] > 10]


Out[13]:
total_bill tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
5 23.29 4.71 Male No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3
240 25.18 2.00 Female Yes Sat Dinner 2
241 20.67 2.00 Male Yes Sat Dinner 2
242 15.82 1.75 Male No Sat Dinner 2
243 16.78 3.00 Female No Thur Dinner 2

[204 rows x 7 columns]

The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with
True.

In [14]: is_dinner = tips["time"] == "Dinner"

In [15]: is_dinner
Out[15]:
0 True
1 True
2 True
3 True
4 True
...
239 True
240 True
241 True
242 True
243 True
Name: time, Length: 244, dtype: bool

In [16]: is_dinner.value_counts()
Out[16]:
True 176
False 68
Name: time, dtype: int64

In [17]: tips[is_dinner]
Out[17]:
(continues on next page)

1.4. Tutorials 133


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


total_bill tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3
240 25.18 2.00 Female Yes Sat Dinner 2
241 20.67 2.00 Male Yes Sat Dinner 2
242 15.82 1.75 Male No Sat Dinner 2
243 16.78 3.00 Female No Thur Dinner 2

[176 rows x 7 columns]

If/then logic

In Stata, an if clause can also be used to create new columns.

generate bucket = "low" if total_bill < 10


replace bucket = "high" if total_bill >= 10

The same operation in pandas can be accomplished using the where method from numpy.

In [18]: tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")

In [19]: tips
Out[19]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high
.. ... ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3 high
240 25.18 2.00 Female Yes Sat Dinner 2 high
241 20.67 2.00 Male Yes Sat Dinner 2 high
242 15.82 1.75 Male No Sat Dinner 2 high
243 16.78 3.00 Female No Thur Dinner 2 high

[244 rows x 8 columns]

134 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

Date functionality

Stata provides a variety of functions to do operations on date/datetime columns.

generate date1 = mdy(1, 15, 2013)


generate date2 = date("Feb152015", "MDY")

generate date1_year = year(date1)


generate date2_month = month(date2)

* sh if t date to beginning of next month


generate date1_next = mdy(month(date1) + 1, 1, year(date1)) if month(date1) != 12
replace date1_next = mdy(1, 1, year(date1) + 1) if month(date1) == 12
generate months_between = mofd(date2) - mofd(date1)

list date1 date2 date1_year date2_month date1_next months_between

The equivalent pandas operations are shown below. In addition to these functions, pandas supports other Time Series
features not available in Stata (such as time zone handling and custom offsets) – see the timeseries documentation for
more details.

In [20]: tips["date1"] = pd.Timestamp("2013-01-15")

In [21]: tips["date2"] = pd.Timestamp("2015-02-15")

In [22]: tips["date1_year"] = tips["date1"].dt.year

In [23]: tips["date2_month"] = tips["date2"].dt.month

In [24]: tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin()

In [25]: tips["months_between"] = tips["date2"].dt.to_period("M") - tips[


....: "date1"
....: ].dt.to_period("M")
....:

In [26]: tips[
....: ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_
˓→between"]

....: ]
....:
Out[26]:
date1 date2 date1_year date2_month date1_next months_between
0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
.. ... ... ... ... ... ...
239 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
240 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
241 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
242 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
(continues on next page)

1.4. Tutorials 135


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


243 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>

[244 rows x 6 columns]

Selection of columns

Stata provides keywords to select, drop, and rename columns.


keep sex total_bill tip

drop sex

rename total_bill total_bill_2

The same operations are expressed in pandas below.

Keep certain columns

In [27]: tips[["sex", "total_bill", "tip"]]


Out[27]:
sex total_bill tip
0 Female 14.99 1.01
1 Male 8.34 1.66
2 Male 19.01 3.50
3 Male 21.68 3.31
4 Female 22.59 3.61
.. ... ... ...
239 Male 27.03 5.92
240 Female 25.18 2.00
241 Male 20.67 2.00
242 Male 15.82 1.75
243 Female 16.78 3.00

[244 rows x 3 columns]

Drop a column

In [28]: tips.drop("sex", axis=1)


Out[28]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4
.. ... ... ... ... ... ...
239 27.03 5.92 No Sat Dinner 3
240 25.18 2.00 Yes Sat Dinner 2
(continues on next page)

136 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


241 20.67 2.00 Yes Sat Dinner 2
242 15.82 1.75 No Sat Dinner 2
243 16.78 3.00 No Thur Dinner 2

[244 rows x 6 columns]

Rename a column

In [29]: tips.rename(columns={"total_bill": "total_bill_2"})


Out[29]:
total_bill_2 tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 27.03 5.92 Male No Sat Dinner 3
240 25.18 2.00 Female Yes Sat Dinner 2
241 20.67 2.00 Male Yes Sat Dinner 2
242 15.82 1.75 Male No Sat Dinner 2
243 16.78 3.00 Female No Thur Dinner 2

[244 rows x 7 columns]

Sorting by values

Sorting in Stata is accomplished via sort

sort sex total_bill

pandas has a DataFrame.sort_values() method, which takes a list of columns to sort by.

In [30]: tips = tips.sort_values(["sex", "total_bill"])

In [31]: tips
Out[31]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
.. ... ... ... ... ... ... ...
182 43.35 3.50 Male Yes Sun Dinner 3
156 46.17 5.00 Male No Sun Dinner 6
59 46.27 6.73 Male No Sat Dinner 4
212 46.33 9.00 Male No Sat Dinner 4
170 48.81 10.00 Male Yes Sat Dinner 3
(continues on next page)

1.4. Tutorials 137


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)

[244 rows x 7 columns]

String processing

Finding length of string

Stata determines the length of a character string with the strlen() and ustrlen() functions for ASCII and Unicode
strings, respectively.

generate strlen_time = strlen(time)


generate ustrlen_time = ustrlen(time)

You can find the length of a character string with Series.str.len(). In Python 3, all strings are Unicode strings.
len includes trailing blanks. Use len and rstrip to exclude trailing blanks.

In [32]: tips["time"].str.len()
Out[32]:
67 6
92 6
111 6
145 5
135 5
..
182 6
156 6
59 6
212 6
170 6
Name: time, Length: 244, dtype: int64

In [33]: tips["time"].str.rstrip().str.len()
Out[33]:
67 6
92 6
111 6
145 5
135 5
..
182 6
156 6
59 6
212 6
170 6
Name: time, Length: 244, dtype: int64

138 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

Finding position of substring

Stata determines the position of a character in a string with the strpos() function. This takes the string defined by
the first argument and searches for the first position of the substring you supply as the second argument.

generate str_position = strpos(sex, "ale")

You can find the position of a character in a column of strings with the Series.str.find() method. find searches
for the first position of the substring. If the substring is found, the method returns its position. If not found, it returns
-1. Keep in mind that Python indexes are zero-based.

In [34]: tips["sex"].str.find("ale")
Out[34]:
67 3
92 3
111 3
145 3
135 3
..
182 1
156 1
59 1
212 1
170 1
Name: sex, Length: 244, dtype: int64

Extracting substring by position

Stata extracts a substring from a string based on its position with the substr() function.

generate short_sex = substr(sex, 1, 1)

With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that
Python indexes are zero-based.

In [35]: tips["sex"].str[0:1]
Out[35]:
67 F
92 F
111 F
145 F
135 F
..
182 M
156 M
59 M
212 M
170 M
Name: sex, Length: 244, dtype: object

1.4. Tutorials 139


pandas: powerful Python data analysis toolkit, Release 1.3.5

Extracting nth word

The Stata word() function returns the nth word from a string. The first argument is the string you want to parse and
the second argument specifies which word you want to extract.

clear
input str20 string
"John Smith"
"Jane Cook"
end

generate first_name = word(name, 1)


generate last_name = word(name, -1)

The simplest way to extract words in pandas is to split the strings by spaces, then reference the word by index. Note
there are more powerful approaches should you need them.

In [36]: firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})

In [37]: firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]

In [38]: firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0]

In [39]: firstlast
Out[39]:
String First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane

Changing case

The Stata strupper(), strlower(), strproper(), ustrupper(), ustrlower(), and ustrtitle() functions
change the case of ASCII and Unicode strings, respectively.

clear
input str20 string
"John Smith"
"Jane Cook"
end

generate upper = strupper(string)


generate lower = strlower(string)
generate title = strproper(string)
list

The equivalent pandas methods are Series.str.upper(), Series.str.lower(), and Series.str.title().

In [40]: firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]})

In [41]: firstlast["upper"] = firstlast["string"].str.upper()

In [42]: firstlast["lower"] = firstlast["string"].str.lower()


(continues on next page)

140 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)

In [43]: firstlast["title"] = firstlast["string"].str.title()

In [44]: firstlast
Out[44]:
string upper lower title
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook

Merging

The following tables will be used in the merge examples:

In [45]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})

In [46]: df1
Out[46]:
key value
0 A 0.469112
1 B -0.282863
2 C -1.509059
3 D -1.135632

In [47]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})

In [48]: df2
Out[48]:
key value
0 B 1.212112
1 D -0.173215
2 D 0.119209
3 E -1.044236

In Stata, to perform a merge, one data set must be in memory and the other must be referenced as a file name on disk.
In contrast, Python must have both DataFrames already in memory.
By default, Stata performs an outer join, where all observations from both data sets are left in memory after the merge.
One can keep only observations from the initial data set, the merged data set, or the intersection of the two by using the
values created in the _merge variable.

* First create df 2 and save to disk


clear
input str1 key
B
D
D
E
end
generate value = rnormal()
save df2.dta

(continues on next page)

1.4. Tutorials 141


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


* Now create df 1 in memory
clear
input str1 key
A
B
C
D
end
generate value = rnormal()

preserve

* Lef t join
merge 1:n key using df2.dta
keep if _merge == 1

* Righ t join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 2

* Inner join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 3

* Outer join
restore
merge 1:n key using df2.dta

pandas DataFrames have a merge() method, which provides similar functionality. The data does not have to be sorted
ahead of time, and different join types are accomplished via the how keyword.

In [49]: inner_join = df1.merge(df2, on=["key"], how="inner")

In [50]: inner_join
Out[50]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209

In [51]: left_join = df1.merge(df2, on=["key"], how="left")

In [52]: left_join
Out[52]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
(continues on next page)

142 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)

In [53]: right_join = df1.merge(df2, on=["key"], how="right")

In [54]: right_join
Out[54]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236

In [55]: outer_join = df1.merge(df2, on=["key"], how="outer")

In [56]: outer_join
Out[56]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

Missing data

Both pandas and Stata have a representation for missing data.


pandas represents missing data with the special float value NaN (not a number). Many of the semantics are the same;
for example missing data propagates through numeric operations, and is ignored by default for aggregations.

In [57]: outer_join
Out[57]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

In [58]: outer_join["value_x"] + outer_join["value_y"]


Out[58]:
0 NaN
1 0.929249
2 NaN
3 -1.308847
4 -1.016424
5 NaN
dtype: float64

In [59]: outer_join["value_x"].sum()
(continues on next page)

1.4. Tutorials 143


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


Out[59]: -3.5940742896293765

One difference is that missing data cannot be compared to its sentinel value. For example, in Stata you could do this to
filter missing values.

* Keep missing values


list if value_x == .
* Keep non-missing values
list if value_x != .

In pandas, Series.isna() and Series.notna() can be used to filter the rows.

In [60]: outer_join[outer_join["value_x"].isna()]
Out[60]:
key value_x value_y
5 E NaN -1.044236

In [61]: outer_join[outer_join["value_x"].notna()]
Out[61]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

pandas provides a variety of methods to work with missing data. Here are some examples:

Drop rows with missing values

In [62]: outer_join.dropna()
Out[62]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

Forward fill from previous rows

In [63]: outer_join.fillna(method="ffill")
Out[63]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E -1.135632 -1.044236

144 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

Replace missing values with a specified value

Using the mean:

In [64]: outer_join["value_x"].fillna(outer_join["value_x"].mean())
Out[64]:
0 0.469112
1 -0.282863
2 -1.509059
3 -1.135632
4 -1.135632
5 -0.718815
Name: value_x, dtype: float64

GroupBy

Aggregation

Stata’s collapse can be used to group by one or more key variables and compute aggregations on numeric columns.

collapse (sum) total_bill tip, by(sex smoker)

pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for
more details and examples.

In [65]: tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum()

In [66]: tips_summed
Out[66]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07

Transformation

In Stata, if the group aggregations need to be used with the original data set, one would usually use bysort with
egen(). For example, to subtract the mean for each observation by smoker group.

bysort sex smoker: egen group_bill = mean(total_bill)


generate adj_total_bill = total_bill - group_bill

pandas provides a Transformation mechanism that allows these type of operations to be succinctly expressed in one
operation.

In [67]: gb = tips.groupby("smoker")["total_bill"]

In [68]: tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean")


(continues on next page)

1.4. Tutorials 145


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)

In [69]: tips
Out[69]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278
.. ... ... ... ... ... ... ... ...
182 43.35 3.50 Male Yes Sun Dinner 3 24.593656
156 46.17 5.00 Male No Sun Dinner 6 28.981722
59 46.27 6.73 Male No Sat Dinner 4 29.081722
212 46.33 9.00 Male No Sat Dinner 4 29.141722
170 48.81 10.00 Male Yes Sat Dinner 3 30.053656

[244 rows x 8 columns]

By group processing

In addition to aggregation, pandas groupby can be used to replicate most other bysort processing from Stata. For
example, the following example lists the first observation in the current sort order by sex/smoker group.

bysort sex smoker: list if _n == 1

In pandas this would be written as:

In [70]: tips.groupby(["sex", "smoker"]).first()


Out[70]:
total_bill tip day time size adj_total_bill
sex smoker
Female No 5.25 1.00 Sat Dinner 1 -11.938278
Yes 1.07 1.00 Sat Dinner 1 -17.686344
Male No 5.51 2.00 Thur Lunch 2 -11.678278
Yes 5.25 5.15 Sun Dinner 2 -13.506344

Other considerations

Disk vs memory

pandas and Stata both operate exclusively in memory. This means that the size of data able to be loaded in pandas is
limited by your machine’s memory. If out of core processing is needed, one possibility is the dask.dataframe library,
which provides a subset of pandas functionality for an on-disk DataFrame.

146 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.3.5

1.4.5 Community tutorials

This is a guide to many pandas tutorials by the community, geared mainly for new users.

pandas cookbook by Julia Evans

The goal of this 2015 cookbook (by Julia Evans) is to give you some concrete examples for getting started with pandas.
These are examples with real-world data, and all the bugs and weirdness that entails. For the table of contents, see the
pandas-cookbook GitHub repository.

Learn pandas by Hernan Rojas

A set of lesson for new pandas users: https://bitbucket.org/hrojas/learn-pandas

Practical data analysis with Python

This guide is an introduction to the data analysis process using the Python data ecosystem and an interesting open
dataset. There are four sections covering selected topics as munging data, aggregating data, visualizing data and time
series.

Exercises for new users

Practice your skills with real data sets and exercises. For more resources, please visit the main repository.

Modern pandas

Tutorial series written in 2016 by Tom Augspurger. The source may be found in the GitHub repository
TomAugspurger/effective-pandas.
• Modern Pandas
• Method Chaining
• Indexes
• Performance
• Tidy Data
• Visualization
• Timeseries

Excel charts with pandas, vincent and xlsxwriter

• Using Pandas and XlsxWriter to create Excel charts

1.4. Tutorials 147


pandas: powerful Python data analysis toolkit, Release 1.3.5

Video tutorials

• Pandas From The Ground Up (2015) (2:24) GitHub repo


• Introduction Into Pandas (2016) (1:28) GitHub repo
• Pandas: .head() to .tail() (2016) (1:26) GitHub repo
• Data analysis in Python with pandas (2016-2018) GitHub repo and Jupyter Notebook
• Best practices with pandas (2018) GitHub repo and Jupyter Notebook

Various tutorials

• Wes McKinney’s (pandas BDFL) blog


• Statistical analysis made easy in Python with SciPy and pandas DataFrames, by Randal Olson
• Statistical Data Analysis in Python, tutorial videos, by Christopher Fonnesbeck from SciPy 2013
• Financial analysis in Python, by Thomas Wiecki
• Intro to pandas data structures, by Greg Reda
• Pandas and Python: Top 10, by Manish Amde
• Pandas DataFrames Tutorial, by Karlijn Willems
• A concise tutorial with real life examples

148 Chapter 1. Getting started


CHAPTER

TWO

USER GUIDE

The User Guide covers all of pandas by topic area. Each of the subsections introduces a topic (such as “working with
missing data”), and discusses how pandas approaches the problem, with many examples throughout.
Users brand-new to pandas should start with 10min.
For a high level summary of the pandas fundamentals, see Intro to data structures and Essential basic functionality.
Further information on any specific method can be obtained in the API reference. {{ header }}

2.1 10 minutes to pandas

This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook.
Customarily, we import as follows:

In [1]: import numpy as np

In [2]: import pandas as pd

2.1.1 Object creation

See the Data Structure Intro section.


Creating a Series by passing a list of values, letting pandas create a default integer index:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

149
pandas: powerful Python data analysis toolkit, Release 1.3.5

In [5]: dates = pd.date_range("20130101", periods=6)

In [6]: dates
Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

In [8]: df
Out[8]:
A B C D
2013-01-01 -0.910967 1.341764 -0.175480 0.422155
2013-01-02 0.811408 0.525740 0.363757 -1.618057
2013-01-03 -0.736677 0.385875 0.302966 -0.169031
2013-01-04 -0.934749 0.041383 0.325251 -0.139264
2013-01-05 1.026476 0.713828 -0.662758 -1.316092
2013-01-06 2.332797 0.270436 0.077737 0.767321

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [9]: df2 = pd.DataFrame(


...: {
...: "A": 1.0,
...: "B": pd.Timestamp("20130102"),
...: "C": pd.Series(1, index=list(range(4)), dtype="float32"),
...: "D": np.array([3] * 4, dtype="int32"),
...: "E": pd.Categorical(["test", "train", "test", "train"]),
...: "F": "foo",
...: }
...: )
...:

In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

The columns of the resulting DataFrame have different dtypes.

In [11]: df2.dtypes
Out[11]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object

150 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s
a subset of the attributes that will be completed:

In [12]: df2.<TAB> # noqa: E225, E999


df2.A df2.bool
df2.abs df2.boxplot
df2.add df2.C
df2.add_prefix df2.clip
df2.add_suffix df2.columns
df2.align df2.copy
df2.all df2.count
df2.any df2.combine
df2.append df2.D
df2.apply df2.describe
df2.applymap df2.diff
df2.B df2.duplicated

As you can see, the columns A, B, C, and D are automatically tab completed. E and F are there as well; the rest of the
attributes have been truncated for brevity.

2.1.2 Viewing data

See the Basics section.


Here is how to view the top and bottom rows of the frame:

In [13]: df.head()
Out[13]:
A B C D
2013-01-01 -0.910967 1.341764 -0.175480 0.422155
2013-01-02 0.811408 0.525740 0.363757 -1.618057
2013-01-03 -0.736677 0.385875 0.302966 -0.169031
2013-01-04 -0.934749 0.041383 0.325251 -0.139264
2013-01-05 1.026476 0.713828 -0.662758 -1.316092

In [14]: df.tail(3)
Out[14]:
A B C D
2013-01-04 -0.934749 0.041383 0.325251 -0.139264
2013-01-05 1.026476 0.713828 -0.662758 -1.316092
2013-01-06 2.332797 0.270436 0.077737 0.767321

Display the index, columns:

In [15]: df.index
Out[15]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive

2.1. 10 minutes to pandas 151


pandas: powerful Python data analysis toolkit, Release 1.3.5

operation when your DataFrame has columns with different data types, which comes down to a fundamental difference
between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have
one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all
of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.
For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data.

In [17]: df.to_numpy()
Out[17]:
array([[-0.91096686, 1.34176437, -0.1754796 , 0.42215465],
[ 0.81140767, 0.52573976, 0.36375711, -1.61805727],
[-0.73667718, 0.38587534, 0.30296561, -0.16903089],
[-0.93474874, 0.041383 , 0.3252512 , -0.13926358],
[ 1.02647646, 0.71382781, -0.66275815, -1.31609243],
[ 2.33279727, 0.27043643, 0.07773689, 0.76732138]])

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

In [18]: df2.to_numpy()
Out[18]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)

Note: DataFrame.to_numpy() does not include the index or column labels in the output.

describe() shows a quick statistic summary of your data:

In [19]: df.describe()
Out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.264715 0.546504 0.038579 -0.342161
std 1.340138 0.451227 0.399262 0.944746
min -0.934749 0.041383 -0.662758 -1.618057
25% -0.867394 0.299296 -0.112175 -1.029327
50% 0.037365 0.455808 0.190351 -0.154147
75% 0.972709 0.666806 0.319680 0.281800
max 2.332797 1.341764 0.363757 0.767321

Transposing your data:

In [20]: df.T
Out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A -0.910967 0.811408 -0.736677 -0.934749 1.026476 2.332797
B 1.341764 0.525740 0.385875 0.041383 0.713828 0.270436
C -0.175480 0.363757 0.302966 0.325251 -0.662758 0.077737
D 0.422155 -1.618057 -0.169031 -0.139264 -1.316092 0.767321

Sorting by an axis:

152 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [21]: df.sort_index(axis=1, ascending=False)


Out[21]:
D C B A
2013-01-01 0.422155 -0.175480 1.341764 -0.910967
2013-01-02 -1.618057 0.363757 0.525740 0.811408
2013-01-03 -0.169031 0.302966 0.385875 -0.736677
2013-01-04 -0.139264 0.325251 0.041383 -0.934749
2013-01-05 -1.316092 -0.662758 0.713828 1.026476
2013-01-06 0.767321 0.077737 0.270436 2.332797

Sorting by values:
In [22]: df.sort_values(by="B")
Out[22]:
A B C D
2013-01-04 -0.934749 0.041383 0.325251 -0.139264
2013-01-06 2.332797 0.270436 0.077737 0.767321
2013-01-03 -0.736677 0.385875 0.302966 -0.169031
2013-01-02 0.811408 0.525740 0.363757 -1.618057
2013-01-05 1.026476 0.713828 -0.662758 -1.316092
2013-01-01 -0.910967 1.341764 -0.175480 0.422155

2.1.3 Selection

Note: While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for
interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and
.iloc.

See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.

Getting

Selecting a single column, which yields a Series, equivalent to df.A:


In [23]: df["A"]
Out[23]:
2013-01-01 -0.910967
2013-01-02 0.811408
2013-01-03 -0.736677
2013-01-04 -0.934749
2013-01-05 1.026476
2013-01-06 2.332797
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 -0.910967 1.341764 -0.175480 0.422155
(continues on next page)

2.1. 10 minutes to pandas 153


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


2013-01-02 0.811408 0.525740 0.363757 -1.618057
2013-01-03 -0.736677 0.385875 0.302966 -0.169031

In [25]: df["20130102":"20130104"]
Out[25]:
A B C D
2013-01-02 0.811408 0.525740 0.363757 -1.618057
2013-01-03 -0.736677 0.385875 0.302966 -0.169031
2013-01-04 -0.934749 0.041383 0.325251 -0.139264

Selection by label

See more in Selection by Label.


For getting a cross section using a label:

In [26]: df.loc[dates[0]]
Out[26]:
A -0.910967
B 1.341764
C -0.175480
D 0.422155
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [27]: df.loc[:, ["A", "B"]]


Out[27]:
A B
2013-01-01 -0.910967 1.341764
2013-01-02 0.811408 0.525740
2013-01-03 -0.736677 0.385875
2013-01-04 -0.934749 0.041383
2013-01-05 1.026476 0.713828
2013-01-06 2.332797 0.270436

Showing label slicing, both endpoints are included:

In [28]: df.loc["20130102":"20130104", ["A", "B"]]


Out[28]:
A B
2013-01-02 0.811408 0.525740
2013-01-03 -0.736677 0.385875
2013-01-04 -0.934749 0.041383

Reduction in the dimensions of the returned object:

In [29]: df.loc["20130102", ["A", "B"]]


Out[29]:
A 0.811408
B 0.525740
Name: 2013-01-02 00:00:00, dtype: float64

154 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

For getting a scalar value:

In [30]: df.loc[dates[0], "A"]


Out[30]: -0.9109668640080292

For getting fast access to a scalar (equivalent to the prior method):

In [31]: df.at[dates[0], "A"]


Out[31]: -0.9109668640080292

Selection by position

See more in Selection by Position.


Select via the position of the passed integers:

In [32]: df.iloc[3]
Out[32]:
A -0.934749
B 0.041383
C 0.325251
D -0.139264
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to NumPy/Python:

In [33]: df.iloc[3:5, 0:2]


Out[33]:
A B
2013-01-04 -0.934749 0.041383
2013-01-05 1.026476 0.713828

By lists of integer position locations, similar to the NumPy/Python style:

In [34]: df.iloc[[1, 2, 4], [0, 2]]


Out[34]:
A C
2013-01-02 0.811408 0.363757
2013-01-03 -0.736677 0.302966
2013-01-05 1.026476 -0.662758

For slicing rows explicitly:

In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 0.811408 0.525740 0.363757 -1.618057
2013-01-03 -0.736677 0.385875 0.302966 -0.169031

For slicing columns explicitly:

In [36]: df.iloc[:, 1:3]


Out[36]:
B C
(continues on next page)

2.1. 10 minutes to pandas 155


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


2013-01-01 1.341764 -0.175480
2013-01-02 0.525740 0.363757
2013-01-03 0.385875 0.302966
2013-01-04 0.041383 0.325251
2013-01-05 0.713828 -0.662758
2013-01-06 0.270436 0.077737

For getting a value explicitly:

In [37]: df.iloc[1, 1]
Out[37]: 0.5257397553870523

For getting fast access to a scalar (equivalent to the prior method):

In [38]: df.iat[1, 1]
Out[38]: 0.5257397553870523

Boolean indexing

Using a single column’s values to select data.

In [39]: df[df["A"] > 0]


Out[39]:
A B C D
2013-01-02 0.811408 0.525740 0.363757 -1.618057
2013-01-05 1.026476 0.713828 -0.662758 -1.316092
2013-01-06 2.332797 0.270436 0.077737 0.767321

Selecting values from a DataFrame where a boolean condition is met.

In [40]: df[df > 0]


Out[40]:
A B C D
2013-01-01 NaN 1.341764 NaN 0.422155
2013-01-02 0.811408 0.525740 0.363757 NaN
2013-01-03 NaN 0.385875 0.302966 NaN
2013-01-04 NaN 0.041383 0.325251 NaN
2013-01-05 1.026476 0.713828 NaN NaN
2013-01-06 2.332797 0.270436 0.077737 0.767321

Using the isin() method for filtering:

In [41]: df2 = df.copy()

In [42]: df2["E"] = ["one", "one", "two", "three", "four", "three"]

In [43]: df2
Out[43]:
A B C D E
2013-01-01 -0.910967 1.341764 -0.175480 0.422155 one
2013-01-02 0.811408 0.525740 0.363757 -1.618057 one
2013-01-03 -0.736677 0.385875 0.302966 -0.169031 two
(continues on next page)

156 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


2013-01-04 -0.934749 0.041383 0.325251 -0.139264 three
2013-01-05 1.026476 0.713828 -0.662758 -1.316092 four
2013-01-06 2.332797 0.270436 0.077737 0.767321 three

In [44]: df2[df2["E"].isin(["two", "four"])]


Out[44]:
A B C D E
2013-01-03 -0.736677 0.385875 0.302966 -0.169031 two
2013-01-05 1.026476 0.713828 -0.662758 -1.316092 four

Setting

Setting a new column automatically aligns the data by the indexes.

In [45]: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

In [46]: s1
Out[46]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64

In [47]: df["F"] = s1

Setting values by label:

In [48]: df.at[dates[0], "A"] = 0

Setting values by position:

In [49]: df.iat[0, 1] = 0

Setting by assigning with a NumPy array:

In [50]: df.loc[:, "D"] = np.array([5] * len(df))

The result of the prior setting operations.

In [51]: df
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -0.175480 5 NaN
2013-01-02 0.811408 0.525740 0.363757 5 1.0
2013-01-03 -0.736677 0.385875 0.302966 5 2.0
2013-01-04 -0.934749 0.041383 0.325251 5 3.0
2013-01-05 1.026476 0.713828 -0.662758 5 4.0
2013-01-06 2.332797 0.270436 0.077737 5 5.0

A where operation with setting.

2.1. 10 minutes to pandas 157


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [52]: df2 = df.copy()

In [53]: df2[df2 > 0] = -df2

In [54]: df2
Out[54]:
A B C D F
2013-01-01 0.000000 0.000000 -0.175480 -5 NaN
2013-01-02 -0.811408 -0.525740 -0.363757 -5 -1.0
2013-01-03 -0.736677 -0.385875 -0.302966 -5 -2.0
2013-01-04 -0.934749 -0.041383 -0.325251 -5 -3.0
2013-01-05 -1.026476 -0.713828 -0.662758 -5 -4.0
2013-01-06 -2.332797 -0.270436 -0.077737 -5 -5.0

2.1.4 Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See
the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

In [56]: df1.loc[dates[0] : dates[1], "E"] = 1

In [57]: df1
Out[57]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.175480 5 NaN 1.0
2013-01-02 0.811408 0.525740 0.363757 5 1.0 1.0
2013-01-03 -0.736677 0.385875 0.302966 5 2.0 NaN
2013-01-04 -0.934749 0.041383 0.325251 5 3.0 NaN

To drop any rows that have missing data.

In [58]: df1.dropna(how="any")
Out[58]:
A B C D F E
2013-01-02 0.811408 0.52574 0.363757 5 1.0 1.0

Filling missing data.

In [59]: df1.fillna(value=5)
Out[59]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.175480 5 5.0 1.0
2013-01-02 0.811408 0.525740 0.363757 5 1.0 1.0
2013-01-03 -0.736677 0.385875 0.302966 5 2.0 5.0
2013-01-04 -0.934749 0.041383 0.325251 5 3.0 5.0

To get the boolean mask where values are nan.

158 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [60]: pd.isna(df1)
Out[60]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True

2.1.5 Operations

See the Basic section on Binary Ops.

Stats

Operations in general exclude missing data.


Performing a descriptive statistic:
In [61]: df.mean()
Out[61]:
A 0.416543
B 0.322877
C 0.038579
D 5.000000
F 3.000000
dtype: float64

Same operation on the other axis:


In [62]: df.mean(1)
Out[62]:
2013-01-01 1.206130
2013-01-02 1.540181
2013-01-03 1.390433
2013-01-04 1.486377
2013-01-05 2.015509
2013-01-06 2.536194
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broad-
casts along the specified dimension.

In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

In [64]: s
Out[64]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
(continues on next page)

2.1. 10 minutes to pandas 159


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


Freq: D, dtype: float64

In [65]: df.sub(s, axis="index")


Out[65]:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -1.736677 -0.614125 -0.697034 4.0 1.0
2013-01-04 -3.934749 -2.958617 -2.674749 2.0 0.0
2013-01-05 -3.973524 -4.286172 -5.662758 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN

Apply

Applying functions to the data:

In [66]: df.apply(np.cumsum)
Out[66]:
A B C D F
2013-01-01 0.000000 0.000000 -0.175480 5 NaN
2013-01-02 0.811408 0.525740 0.188278 10 1.0
2013-01-03 0.074730 0.911615 0.491243 15 3.0
2013-01-04 -0.860018 0.952998 0.816494 20 6.0
2013-01-05 0.166458 1.666826 0.153736 25 10.0
2013-01-06 2.499255 1.937262 0.231473 30 15.0

In [67]: df.apply(lambda x: x.max() - x.min())


Out[67]:
A 3.267546
B 0.713828
C 1.026515
D 0.000000
F 4.000000
dtype: float64

Histogramming

See more at Histogramming and Discretization.

In [68]: s = pd.Series(np.random.randint(0, 7, size=10))

In [69]: s
Out[69]:
0 3
1 1
2 0
3 4
4 2
5 2
6 3
(continues on next page)

160 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


7 5
8 2
9 5
dtype: int64

In [70]: s.value_counts()
Out[70]:
2 3
3 2
5 2
1 1
0 1
4 1
dtype: int64

String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each
element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions
by default (and in some cases always uses them). See more at Vectorized String Methods.
In [71]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

In [72]: s.str.lower()
Out[72]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object

2.1.6 Merge

Concat

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set
logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
See the Merging section.
Concatenating pandas objects together with concat():
In [73]: df = pd.DataFrame(np.random.randn(10, 4))

In [74]: df
Out[74]:
(continues on next page)

2.1. 10 minutes to pandas 161


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


0 1 2 3
0 -1.003793 -1.461961 -1.416173 -0.454477
1 -0.573005 -0.329866 0.589381 -1.983939
2 -1.152335 1.546638 0.402839 -0.007463
3 -0.655220 -0.294104 0.426159 0.967863
4 0.624757 0.832861 0.667867 -1.313268
5 -0.365050 1.231649 1.893172 -0.505329
6 -0.511227 0.608083 0.649903 -0.925380
7 1.123615 1.844237 2.142172 -1.507573
8 0.912026 0.759643 0.962315 -0.683789
9 0.466999 0.716297 -0.171417 -0.341965

# break it into pieces


In [75]: pieces = [df[:3], df[3:7], df[7:]]

In [76]: pd.concat(pieces)
Out[76]:
0 1 2 3
0 -1.003793 -1.461961 -1.416173 -0.454477
1 -0.573005 -0.329866 0.589381 -1.983939
2 -1.152335 1.546638 0.402839 -0.007463
3 -0.655220 -0.294104 0.426159 0.967863
4 0.624757 0.832861 0.667867 -1.313268
5 -0.365050 1.231649 1.893172 -0.505329
6 -0.511227 0.608083 0.649903 -0.925380
7 1.123615 1.844237 2.142172 -1.507573
8 0.912026 0.759643 0.962315 -0.683789
9 0.466999 0.716297 -0.171417 -0.341965

Note: Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be
expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a
DataFrame by iteratively appending records to it. See Appending to dataframe for more.

Join

SQL style merges. See the Database style joining section.

In [77]: left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

In [78]: right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2

In [80]: right
Out[80]:
(continues on next page)

162 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


key rval
0 foo 4
1 foo 5

In [81]: pd.merge(left, right, on="key")


Out[81]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

Another example that can be given is:

In [82]: left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

In [83]: right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

In [84]: left
Out[84]:
key lval
0 foo 1
1 bar 2

In [85]: right
Out[85]:
key rval
0 foo 4
1 bar 5

In [86]: pd.merge(left, right, on="key")


Out[86]:
key lval rval
0 foo 1 4
1 bar 2 5

2.1.7 Grouping

By “group by” we are referring to a process involving one or more of the following steps:
• Splitting the data into groups based on some criteria
• Applying a function to each group independently
• Combining the results into a data structure
See the Grouping section.

In [87]: df = pd.DataFrame(
....: {
....: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
....: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
....: "C": np.random.randn(8),
(continues on next page)

2.1. 10 minutes to pandas 163


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


....: "D": np.random.randn(8),
....: }
....: )
....:

In [88]: df
Out[88]:
A B C D
0 foo one 0.193042 -0.763031
1 bar one 1.169312 -0.392259
2 foo two -0.674569 -0.663556
3 bar three 0.803945 -0.523608
4 foo two -1.020884 0.517510
5 bar two 0.233905 -0.582940
6 foo one 0.595408 -1.770948
7 foo three 0.019054 0.147643

Grouping and then applying the sum() function to the resulting groups.

In [89]: df.groupby("A").sum()
Out[89]:
C D
A
bar 2.207162 -1.498808
foo -0.887948 -2.532383

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.

In [90]: df.groupby(["A", "B"]).sum()


Out[90]:
C D
A B
bar one 1.169312 -0.392259
three 0.803945 -0.523608
two 0.233905 -0.582940
foo one 0.788450 -2.533980
three 0.019054 0.147643
two -1.695453 -0.146046

164 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

2.1.8 Reshaping

See the sections on Hierarchical Indexing and Reshaping.

Stack

In [91]: tuples = list(


....: zip(
....: *[
....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
....: ["one", "two", "one", "two", "one", "two", "one", "two"],
....: ]
....: )
....: )
....:

In [92]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [93]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

In [94]: df2 = df[:4]

In [95]: df2
Out[95]:
A B
first second
bar one -0.252449 -0.186979
two 0.703702 1.386883
baz one 0.536893 2.096205
two 2.021030 -0.518601

The stack() method “compresses” a level in the DataFrame’s columns.

In [96]: stacked = df2.stack()

In [97]: stacked
Out[97]:
first second
bar one A -0.252449
B -0.186979
two A 0.703702
B 1.386883
baz one A 0.536893
B 2.096205
two A 2.021030
B -0.518601
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is
unstack(), which by default unstacks the last level:
In [98]: stacked.unstack()
Out[98]:
(continues on next page)

2.1. 10 minutes to pandas 165


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


A B
first second
bar one -0.252449 -0.186979
two 0.703702 1.386883
baz one 0.536893 2.096205
two 2.021030 -0.518601

In [99]: stacked.unstack(1)
Out[99]:
second one two
first
bar A -0.252449 0.703702
B -0.186979 1.386883
baz A 0.536893 2.021030
B 2.096205 -0.518601

In [100]: stacked.unstack(0)
Out[100]:
first bar baz
second
one A -0.252449 0.536893
B -0.186979 2.096205
two A 0.703702 2.021030
B 1.386883 -0.518601

Pivot tables

See the section on Pivot Tables.

In [101]: df = pd.DataFrame(
.....: {
.....: "A": ["one", "one", "two", "three"] * 3,
.....: "B": ["A", "B", "C"] * 4,
.....: "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
.....: "D": np.random.randn(12),
.....: "E": np.random.randn(12),
.....: }
.....: )
.....:

In [102]: df
Out[102]:
A B C D E
0 one A foo 0.405648 -1.059080
1 one B foo -1.116261 0.917092
2 two C foo -0.030795 1.076313
3 three A bar 0.381455 0.284642
4 one B bar -1.649931 -0.390057
5 one C bar -1.938784 -0.103901
6 two A foo -0.290505 0.893362
7 three B foo 1.261707 -0.419065
(continues on next page)

166 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


8 one C foo 0.988838 -0.068604
9 one A bar -0.808601 -1.063978
10 two B bar -0.454453 0.065181
11 three C bar 0.158073 0.619186

We can produce pivot tables from this data very easily:

In [103]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])


Out[103]:
C bar foo
A B
one A -0.808601 0.405648
B -1.649931 -1.116261
C -1.938784 0.988838
three A 0.381455 NaN
B NaN 1.261707
C 0.158073 NaN
two A NaN -0.290505
B -0.454453 NaN
C NaN -0.030795

2.1.9 Time series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency con-
version (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial
applications. See the Time Series section.

In [104]: rng = pd.date_range("1/1/2012", periods=100, freq="S")

In [105]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [106]: ts.resample("5Min").sum()
Out[106]:
2012-01-01 26186
Freq: 5T, dtype: int64

Time zone representation:

In [107]: rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")

In [108]: ts = pd.Series(np.random.randn(len(rng)), rng)

In [109]: ts
Out[109]:
2012-03-06 -1.832697
2012-03-07 -0.474224
2012-03-08 -0.080857
2012-03-09 1.644925
2012-03-10 -0.889834
Freq: D, dtype: float64

(continues on next page)

2.1. 10 minutes to pandas 167


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


In [110]: ts_utc = ts.tz_localize("UTC")

In [111]: ts_utc
Out[111]:
2012-03-06 00:00:00+00:00 -1.832697
2012-03-07 00:00:00+00:00 -0.474224
2012-03-08 00:00:00+00:00 -0.080857
2012-03-09 00:00:00+00:00 1.644925
2012-03-10 00:00:00+00:00 -0.889834
Freq: D, dtype: float64

Converting to another time zone:

In [112]: ts_utc.tz_convert("US/Eastern")
Out[112]:
2012-03-05 19:00:00-05:00 -1.832697
2012-03-06 19:00:00-05:00 -0.474224
2012-03-07 19:00:00-05:00 -0.080857
2012-03-08 19:00:00-05:00 1.644925
2012-03-09 19:00:00-05:00 -0.889834
Freq: D, dtype: float64

Converting between time span representations:

In [113]: rng = pd.date_range("1/1/2012", periods=5, freq="M")

In [114]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [115]: ts
Out[115]:
2012-01-31 0.183892
2012-02-29 1.067580
2012-03-31 -0.598739
2012-04-30 -1.168435
2012-05-31 0.993823
Freq: M, dtype: float64

In [116]: ps = ts.to_period()

In [117]: ps
Out[117]:
2012-01 0.183892
2012-02 1.067580
2012-03 -0.598739
2012-04 -1.168435
2012-05 0.993823
Freq: M, dtype: float64

In [118]: ps.to_timestamp()
Out[118]:
2012-01-01 0.183892
2012-02-01 1.067580
(continues on next page)

168 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


2012-03-01 -0.598739
2012-04-01 -1.168435
2012-05-01 0.993823
Freq: MS, dtype: float64

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following
example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the
quarter end:

In [119]: prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV")

In [120]: ts = pd.Series(np.random.randn(len(prng)), prng)

In [121]: ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9

In [122]: ts.head()
Out[122]:
1990-03-01 09:00 1.404900
1990-06-01 09:00 -0.977839
1990-09-01 09:00 -0.834893
1990-12-01 09:00 1.492798
1991-03-01 09:00 0.850050
Freq: H, dtype: float64

2.1.10 Categoricals

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API
documentation.

In [123]: df = pd.DataFrame(
.....: {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
.....: )
.....:

Convert the raw grades to a categorical data type.

In [124]: df["grade"] = df["raw_grade"].astype("category")

In [125]: df["grade"]
Out[125]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

Rename the categories to more meaningful names (assigning to Series.cat.categories() is in place!).

2.1. 10 minutes to pandas 169


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [126]: df["grade"].cat.categories = ["very good", "good", "very bad"]

Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new
Series by default).

In [127]: df["grade"] = df["grade"].cat.set_categories(


.....: ["very bad", "bad", "medium", "good", "very good"]
.....: )
.....:

In [128]: df["grade"]
Out[128]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

Sorting is per order in the categories, not lexical order.

In [129]: df.sort_values(by="grade")
Out[129]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good

Grouping by a categorical column also shows empty categories.

In [130]: df.groupby("grade").size()
Out[130]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64

170 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

2.1.11 Plotting

See the Plotting docs.


We use the standard convention for referencing the matplotlib API:

In [131]: import matplotlib.pyplot as plt

In [132]: plt.close("all")

The close() method is used to close a figure window.

In [133]: ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000",␣


˓→periods=1000))

In [134]: ts = ts.cumsum()

In [135]: ts.plot();

On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:

In [136]: df = pd.DataFrame(
.....: np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
.....: )
(continues on next page)

2.1. 10 minutes to pandas 171


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


.....:

In [137]: df = df.cumsum()

In [138]: plt.figure();

In [139]: df.plot();

In [140]: plt.legend(loc='best');

2.1.12 Getting data in/out

CSV

Writing to a csv file.

In [141]: df.to_csv("foo.csv")

Reading from a csv file.

172 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [142]: pd.read_csv("foo.csv")
Out[142]:
Unnamed: 0 A B C D
0 2000-01-01 0.925749 -0.853542 -1.211573 0.422463
1 2000-01-02 1.402389 0.674441 -0.446843 1.643758
2 2000-01-03 2.228604 0.739665 1.208626 1.855920
3 2000-01-04 0.736588 3.417921 0.108200 0.772436
4 2000-01-05 2.160919 1.976578 -0.625528 0.746988
.. ... ... ... ... ...
995 2002-09-22 -20.451525 7.771058 0.978014 8.029402
996 2002-09-23 -22.241962 7.704241 1.331267 7.343168
997 2002-09-24 -22.098180 8.064968 2.541076 8.391860
998 2002-09-25 -20.586567 9.483943 5.030084 9.670501
999 2002-09-26 -19.663841 10.256692 6.448441 9.121815

[1000 rows x 5 columns]

HDF5

Reading and writing to HDFStores.


Writing to a HDF5 Store.

In [143]: df.to_hdf("foo.h5", "df")

Reading from a HDF5 Store.

In [144]: pd.read_hdf("foo.h5", "df")


Out[144]:
A B C D
2000-01-01 0.925749 -0.853542 -1.211573 0.422463
2000-01-02 1.402389 0.674441 -0.446843 1.643758
2000-01-03 2.228604 0.739665 1.208626 1.855920
2000-01-04 0.736588 3.417921 0.108200 0.772436
2000-01-05 2.160919 1.976578 -0.625528 0.746988
... ... ... ... ...
2002-09-22 -20.451525 7.771058 0.978014 8.029402
2002-09-23 -22.241962 7.704241 1.331267 7.343168
2002-09-24 -22.098180 8.064968 2.541076 8.391860
2002-09-25 -20.586567 9.483943 5.030084 9.670501
2002-09-26 -19.663841 10.256692 6.448441 9.121815

[1000 rows x 4 columns]

2.1. 10 minutes to pandas 173


pandas: powerful Python data analysis toolkit, Release 1.3.5

Excel

Reading and writing to MS Excel.


Writing to an excel file.

In [145]: df.to_excel("foo.xlsx", sheet_name="Sheet1")

Reading from an excel file.

In [146]: pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])


Out[146]:
Unnamed: 0 A B C D
0 2000-01-01 0.925749 -0.853542 -1.211573 0.422463
1 2000-01-02 1.402389 0.674441 -0.446843 1.643758
2 2000-01-03 2.228604 0.739665 1.208626 1.855920
3 2000-01-04 0.736588 3.417921 0.108200 0.772436
4 2000-01-05 2.160919 1.976578 -0.625528 0.746988
.. ... ... ... ... ...
995 2002-09-22 -20.451525 7.771058 0.978014 8.029402
996 2002-09-23 -22.241962 7.704241 1.331267 7.343168
997 2002-09-24 -22.098180 8.064968 2.541076 8.391860
998 2002-09-25 -20.586567 9.483943 5.030084 9.670501
999 2002-09-26 -19.663841 10.256692 6.448441 9.121815

[1000 rows x 5 columns]

2.1.13 Gotchas

If you are attempting to perform an operation you might see an exception like:

>>> if pd.Series([False, True, False]):


... print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

See Comparisons for an explanation and what to do.


See Gotchas as well.

2.2 Intro to data structures

We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get you started.
The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To
get started, import NumPy and load pandas into your namespace:

In [1]: import numpy as np

In [2]: import pandas as pd

Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken
unless done so explicitly by you.

174 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in
separate sections.

2.2.1 Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers,
Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to
call:

>>> s = pd.Series(data, index=index)

Here, data can be many different things:


• a Python dict
• an ndarray
• a scalar value (like 5)
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:
From ndarray
If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values
[0, ..., len(data) - 1].

In [3]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [4]: s
Out[4]:
a 0.469112
b -0.282863
c -1.509059
d -1.135632
e 1.212112
dtype: float64

In [5]: s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [6]: pd.Series(np.random.randn(5))
Out[6]:
0 -0.173215
1 0.119209
2 -1.044236
3 -0.861849
4 -2.104569
dtype: float64

Note: pandas supports non-unique index values. If an operation that does not support duplicate index values is
attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there are
many instances in computations, like parts of GroupBy, where the index is not used).

From dict
Series can be instantiated from dicts:

2.2. Intro to data structures 175


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [7]: d = {"b": 1, "a": 0, "c": 2}

In [8]: pd.Series(d)
Out[8]:
b 1
a 0
c 2
dtype: int64

Note: When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion
order, if you’re using Python version >= 3.6 and pandas version >= 0.23.
If you’re using Python < 3.6 or pandas < 0.23, and an index is not passed, the Series index will be the lexically ordered
list of dict keys.

In the example above, if you were on a Python version lower than 3.6 or a pandas version lower than 0.23, the Series
would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b', 'a', 'c']).
If an index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [9]: d = {"a": 0.0, "b": 1.0, "c": 2.0}

In [10]: pd.Series(d)
Out[10]:
a 0.0
b 1.0
c 2.0
dtype: float64

In [11]: pd.Series(d, index=["b", "c", "d", "a"])


Out[11]:
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

Note: NaN (not a number) is the standard missing data marker used in pandas.

From scalar value


If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [12]: pd.Series(5.0, index=["a", "b", "c", "d", "e"])


Out[12]:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64

176 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

Series is ndarray-like

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations such
as slicing will also slice the index.
In [13]: s[0]
Out[13]: 0.4691122999071863

In [14]: s[:3]
Out[14]:
a 0.469112
b -0.282863
c -1.509059
dtype: float64

In [15]: s[s > s.median()]


Out[15]:
a 0.469112
e 1.212112
dtype: float64

In [16]: s[[4, 3, 1]]


Out[16]:
e 1.212112
d -1.135632
b -0.282863
dtype: float64

In [17]: np.exp(s)
Out[17]:
a 1.598575
b 0.753623
c 0.221118
d 0.321219
e 3.360575
dtype: float64

Note: We will address array-based indexing like s[[4, 3, 1]] in section on indexing.

Like a NumPy array, a pandas Series has a dtype.


In [18]: s.dtype
Out[18]: dtype('float64')

This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few places, in
which case the dtype would be an ExtensionDtype. Some examples within pandas are Categorical data and Nullable
integer data type. See dtypes for more.
If you need the actual array backing a Series, use Series.array.

In [19]: s.array
Out[19]:
<PandasArray>
(continues on next page)

2.2. Intro to data structures 177


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
-1.1356323710171934, 1.2121120250208506]
Length: 5, dtype: float64

Accessing the array can be useful when you need to do some operation without the index (to disable automatic align-
ment, for example).
Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one or more
concrete arrays like a numpy.ndarray. pandas knows how to take an ExtensionArray and store it in a Series or a
column of a DataFrame. See dtypes for more.
While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().

In [20]: s.to_numpy()
Out[20]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])

Even if the Series is backed by a ExtensionArray, Series.to_numpy() will return a NumPy ndarray.

Series is dict-like

A Series is like a fixed-size dict in that you can get and set values by index label:

In [21]: s["a"]
Out[21]: 0.4691122999071863

In [22]: s["e"] = 12.0

In [23]: s
Out[23]:
a 0.469112
b -0.282863
c -1.509059
d -1.135632
e 12.000000
dtype: float64

In [24]: "e" in s
Out[24]: True

In [25]: "f" in s
Out[25]: False

If a label is not contained, an exception is raised:

>>> s["f"]
KeyError: 'f'

Using the get method, a missing label will return None or specified default:

In [26]: s.get("f")

In [27]: s.get("f", np.nan)


Out[27]: nan

178 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

See also the section on attribute access.

Vectorized operations and label alignment with Series

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true
when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [28]: s + s
Out[28]:
a 0.938225
b -0.565727
c -3.018117
d -2.271265
e 24.000000
dtype: float64

In [29]: s * 2
Out[29]:
a 0.938225
b -0.565727
c -3.018117
d -2.271265
e 24.000000
dtype: float64

In [30]: np.exp(s)
Out[30]:
a 1.598575
b 0.753623
c 0.221118
d 0.321219
e 162754.791419
dtype: float64

A key difference between Series and ndarray is that operations between Series automatically align the data based on
label. Thus, you can write computations without giving consideration to whether the Series involved have the same
labels.

In [31]: s[1:] + s[:-1]


Out[31]:
a NaN
b -0.565727
c -3.018117
d -2.271265
e NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found
in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit
data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data
alignment features of the pandas data structures set pandas apart from the majority of related tools for working with
labeled data.

2.2. Intro to data structures 179


pandas: powerful Python data analysis toolkit, Release 1.3.5

Note: In general, we chose to make the default result of operations between differently indexed objects yield the union
of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically
important information as part of a computation. You of course have the option of dropping labels with missing data
via the dropna function.

Name attribute

Series can also have a name attribute:

In [32]: s = pd.Series(np.random.randn(5), name="something")

In [33]: s
Out[33]:
0 -0.494929
1 1.071804
2 0.721555
3 -0.706771
4 -1.039575
Name: something, dtype: float64

In [34]: s.name
Out[34]: 'something'

The Series name will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame as
you will see below.
You can rename a Series with the pandas.Series.rename() method.

In [35]: s2 = s.rename("different")

In [36]: s2.name
Out[36]: 'different'

Note that s and s2 refer to different objects.

2.2.2 DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it
like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like
Series, DataFrame accepts many different kinds of input:
• Dict of 1D ndarrays, lists, dicts, or Series
• 2-D numpy.ndarray
• Structured or record ndarray
• A Series
• Another DataFrame
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass
an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of
Series plus a specific index will discard all data not matching up to the passed index.

180 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

Note: When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by the dict’s
insertion order, if you are using Python version >= 3.6 and pandas >= 0.23.
If you are using Python < 3.6 or pandas < 0.23, and columns is not specified, the DataFrame columns will be the
lexically ordered list of dict keys.

From dict of Series or dicts

The resulting index will be the union of the indexes of the various Series. If there are any nested dicts, these will first
be converted to Series. If no columns are passed, the columns will be the ordered list of dict keys.

In [37]: d = {
....: "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
....: "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
....: }
....:

In [38]: df = pd.DataFrame(d)

In [39]: df
Out[39]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0

In [40]: pd.DataFrame(d, index=["d", "b", "a"])


Out[40]:
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0

In [41]: pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])


Out[41]:
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN

The row and column labels can be accessed respectively by accessing the index and columns attributes:

Note: When a particular set of columns is passed along with a dict of data, the passed columns override the keys in
the dict.

In [42]: df.index
Out[42]: Index(['a', 'b', 'c', 'd'], dtype='object')
(continues on next page)

2.2. Intro to data structures 181


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)

In [43]: df.columns
Out[43]: Index(['one', 'two'], dtype='object')

From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If
no index is passed, the result will be range(n), where n is the array length.

In [44]: d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}

In [45]: pd.DataFrame(d)
Out[45]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

In [46]: pd.DataFrame(d, index=["a", "b", "c", "d"])


Out[46]:
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0

From structured or record array

This case is handled identically to a dict of arrays.

In [47]: data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])

In [48]: data[:] = [(1, 2.0, "Hello"), (2, 3.0, "World")]

In [49]: pd.DataFrame(data)
Out[49]:
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'

In [50]: pd.DataFrame(data, index=["first", "second"])


Out[50]:
A B C
first 1 2.0 b'Hello'
second 2 3.0 b'World'

In [51]: pd.DataFrame(data, columns=["C", "A", "B"])


Out[51]:
C A B
(continues on next page)

182 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


0 b'Hello' 1 2.0
1 b'World' 2 3.0

Note: DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.

From a list of dicts

In [52]: data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]

In [53]: pd.DataFrame(data2)
Out[53]:
a b c
0 1 2 NaN
1 5 10 20.0

In [54]: pd.DataFrame(data2, index=["first", "second"])


Out[54]:
a b c
first 1 2 NaN
second 5 10 20.0

In [55]: pd.DataFrame(data2, columns=["a", "b"])


Out[55]:
a b
0 1 2
1 5 10

From a dict of tuples

You can automatically create a MultiIndexed frame by passing a tuples dictionary.

In [56]: pd.DataFrame(
....: {
....: ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
....: ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
....: ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
....: ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
....: ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
....: }
....: )
....:
Out[56]:
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0

2.2. Intro to data structures 183


pandas: powerful Python data analysis toolkit, Release 1.3.5

From a Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original
name of the Series (only if no other column name provided).

From a list of namedtuples

The field names of the first namedtuple in the list determine the columns of the DataFrame. The remaining named-
tuples (or tuples) are simply unpacked and their values are fed into the rows of the DataFrame. If any of those tuples
is shorter than the first namedtuple then the later columns in the corresponding row are marked as missing values. If
any are longer than the first namedtuple, a ValueError is raised.

In [57]: from collections import namedtuple

In [58]: Point = namedtuple("Point", "x y")

In [59]: pd.DataFrame([Point(0, 0), Point(0, 3), (2, 3)])


Out[59]:
x y
0 0 0
1 0 3
2 2 3

In [60]: Point3D = namedtuple("Point3D", "x y z")

In [61]: pd.DataFrame([Point3D(0, 0, 0), Point3D(0, 3, 5), Point(2, 3)])


Out[61]:
x y z
0 0 0 0.0
1 0 3 5.0
2 2 3 NaN

From a list of dataclasses

New in version 1.1.0.


Data Classes as introduced in PEP557, can be passed into the DataFrame constructor. Passing a list of dataclasses is
equivalent to passing a list of dictionaries.
Please be aware, that all values in the list should be dataclasses, mixing types in the list would result in a TypeError.

In [62]: from dataclasses import make_dataclass

In [63]: Point = make_dataclass("Point", [("x", int), ("y", int)])

In [64]: pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])


Out[64]:
x y
0 0 0
1 0 3
2 2 3

Missing data

184 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing data, we use
np.nan to represent missing values. Alternatively, you may pass a numpy.MaskedArray as the data argument to the
DataFrame constructor, and its masked entries will be considered missing.

Alternate constructors

DataFrame.from_dict
DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like
the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can be set to
'index' in order to use the dict keys as row labels.

In [65]: pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]))


Out[65]:
A B
0 1 4
1 2 5
2 3 6

If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired column names:

In [66]: pd.DataFrame.from_dict(
....: dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]),
....: orient="index",
....: columns=["one", "two", "three"],
....: )
....:
Out[66]:
one two three
A 1 2 3
B 4 5 6

DataFrame.from_records
DataFrame.from_records takes a list of tuples or an ndarray with structured dtype. It works analogously to the
normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of the structured
dtype. For example:

In [67]: data
Out[67]:
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [68]: pd.DataFrame.from_records(data, index="C")


Out[68]:
A B
C
b'Hello' 1 2.0
b'World' 2 3.0

2.2. Intro to data structures 185


pandas: powerful Python data analysis toolkit, Release 1.3.5

Column selection, addition, deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns
works with the same syntax as the analogous dict operations:

In [69]: df["one"]
Out[69]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64

In [70]: df["three"] = df["one"] * df["two"]

In [71]: df["flag"] = df["one"] > 2

In [72]: df
Out[72]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False

Columns can be deleted or popped like with a dict:

In [73]: del df["two"]

In [74]: three = df.pop("three")

In [75]: df
Out[75]:
one flag
a 1.0 False
b 2.0 False
c 3.0 True
d NaN False

When inserting a scalar value, it will naturally be propagated to fill the column:

In [76]: df["foo"] = "bar"

In [77]: df
Out[77]:
one flag foo
a 1.0 False bar
b 2.0 False bar
c 3.0 True bar
d NaN False bar

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s
index:

186 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [78]: df["one_trunc"] = df["one"][:2]

In [79]: df
Out[79]:
one flag foo one_trunc
a 1.0 False bar 1.0
b 2.0 False bar 2.0
c 3.0 True bar NaN
d NaN False bar NaN

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.
By default, columns get inserted at the end. The insert function is available to insert at a particular location in the
columns:

In [80]: df.insert(1, "bar", df["one"])

In [81]: df
Out[81]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN

Assigning new columns in method chains

Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new columns
that are potentially derived from existing columns.

In [82]: iris = pd.read_csv("data/iris.data")

In [83]: iris.head()
Out[83]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

In [84]: iris.assign(sepal_ratio=iris["SepalWidth"] / iris["SepalLength"]).head()


Out[84]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000

In the example above, we inserted a precomputed value. We can also pass in a function of one argument to be evaluated
on the DataFrame being assigned to.

2.2. Intro to data structures 187


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [85]: iris.assign(sepal_ratio=lambda x: (x["SepalWidth"] / x["SepalLength"])).head()


Out[85]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000

assign always returns a copy of the data, leaving the original DataFrame untouched.
Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to the
DataFrame at hand. This is common when using assign in a chain of operations. For example, we can limit the
DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and plot:

In [86]: (
....: iris.query("SepalLength > 5")
....: .assign(
....: SepalRatio=lambda x: x.SepalWidth / x.SepalLength,
....: PetalRatio=lambda x: x.PetalWidth / x.PetalLength,
....: )
....: .plot(kind="scatter", x="SepalRatio", y="PetalRatio")
....: )
....:
Out[86]: <AxesSubplot:xlabel='SepalRatio', ylabel='PetalRatio'>

188 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly, this is the
DataFrame that’s been filtered to those rows with sepal length greater than 5. The filtering happens first, and then the
ratio calculations. This is an example where we didn’t have a reference to the filtered DataFrame available.
The function signature for assign is simply **kwargs. The keys are the column names for the new fields, and the
values are either a value to be inserted (for example, a Series or NumPy array), or a function of one argument to be
called on the DataFrame. A copy of the original DataFrame is returned, with the new values inserted.
Starting with Python 3.6 the order of **kwargs is preserved. This allows for dependent assignment, where an expres-
sion later in **kwargs can refer to a column created earlier in the same assign().

In [87]: dfa = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

In [88]: dfa.assign(C=lambda x: x["A"] + x["B"], D=lambda x: x["A"] + x["C"])


Out[88]:
A B C D
0 1 4 5 6
1 2 5 7 9
2 3 6 9 12

In the second expression, x['C'] will refer to the newly created column, that’s equal to dfa['A'] + dfa['B'].

2.2. Intro to data structures 189


pandas: powerful Python data analysis toolkit, Release 1.3.5

Indexing / selection

The basics of indexing are as follows:

Operation Syntax Result


Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

Row selection, for example, returns a Series whose index is the columns of the DataFrame:

In [89]: df.loc["b"]
Out[89]:
one 2.0
bar 2.0
flag False
foo bar
one_trunc 2.0
Name: b, dtype: object

In [90]: df.iloc[2]
Out[90]:
one 3.0
bar 3.0
flag True
foo bar
one_trunc NaN
Name: c, dtype: object

For a more exhaustive treatment of sophisticated label-based indexing and slicing, see the section on indexing. We will
address the fundamentals of reindexing / conforming to new sets of labels in the section on reindexing.

Data alignment and arithmetic

Data alignment between DataFrame objects automatically align on both the columns and the index (row labels).
Again, the resulting object will have the union of the column and row labels.

In [91]: df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])

In [92]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])

In [93]: df + df2
Out[93]:
A B C D
0 0.045691 -0.014138 1.380871 NaN
1 -0.955398 -1.501007 0.037181 NaN
2 -0.662690 1.534833 -0.859691 NaN
3 -2.452949 1.237274 -0.133712 NaN
4 1.414490 1.951676 -2.320422 NaN
5 -0.494922 -1.649727 -1.084601 NaN
(continues on next page)

190 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


6 -1.047551 -0.748572 -0.805479 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN

When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the
DataFrame columns, thus broadcasting row-wise. For example:

In [94]: df - df.iloc[0]
Out[94]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 -1.359261 -0.248717 -0.453372 -1.754659
2 0.253128 0.829678 0.010026 -1.991234
3 -1.311128 0.054325 -1.724913 -1.620544
4 0.573025 1.500742 -0.676070 1.367331
5 -1.741248 0.781993 -1.241620 -2.053136
6 -1.240774 -0.869551 -0.153282 0.000430
7 -0.743894 0.411013 -0.929563 -0.282386
8 -1.194921 1.320690 0.238224 -1.482644
9 2.293786 1.856228 0.773289 -1.446531

For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.
Operations with scalars are just as you would expect:

In [95]: df * 5 + 2
Out[95]:
A B C D
0 3.359299 -0.124862 4.835102 3.381160
1 -3.437003 -1.368449 2.568242 -5.392133
2 4.624938 4.023526 4.885230 -6.575010
3 -3.196342 0.146766 -3.789461 -4.721559
4 6.224426 7.378849 1.454750 10.217815
5 -5.346940 3.785103 -1.373001 -6.884519
6 -2.844569 -4.472618 4.068691 3.383309
7 -0.360173 1.930201 0.187285 1.969232
8 -2.615303 6.478587 6.026220 -4.032059
9 14.828230 9.156280 8.701544 -3.851494

In [96]: 1 / df
Out[96]:
A B C D
0 3.678365 -2.353094 1.763605 3.620145
1 -0.919624 -1.484363 8.799067 -0.676395
2 1.904807 2.470934 1.732964 -0.583090
3 -0.962215 -2.697986 -0.863638 -0.743875
4 1.183593 0.929567 -9.170108 0.608434
5 -0.680555 2.800959 -1.482360 -0.562777
6 -1.032084 -0.772485 2.416988 3.614523
7 -2.118489 -71.634509 -2.758294 -162.507295
8 -1.083352 1.116424 1.241860 -0.828904
9 0.389765 0.698687 0.746097 -0.854483
(continues on next page)

2.2. Intro to data structures 191


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)

In [97]: df ** 4
Out[97]:
A B C D
0 0.005462 3.261689e-02 0.103370 5.822320e-03
1 1.398165 2.059869e-01 0.000167 4.777482e+00
2 0.075962 2.682596e-02 0.110877 8.650845e+00
3 1.166571 1.887302e-02 1.797515 3.265879e+00
4 0.509555 1.339298e+00 0.000141 7.297019e+00
5 4.661717 1.624699e-02 0.207103 9.969092e+00
6 0.881334 2.808277e+00 0.029302 5.858632e-03
7 0.049647 3.797614e-08 0.017276 1.433866e-09
8 0.725974 6.437005e-01 0.420446 2.118275e+00
9 43.329821 4.196326e+00 3.227153 1.875802e+00

Boolean operators work as well:

In [98]: df1 = pd.DataFrame({"a": [1, 0, 1], "b": [0, 1, 1]}, dtype=bool)

In [99]: df2 = pd.DataFrame({"a": [0, 1, 1], "b": [1, 1, 0]}, dtype=bool)

In [100]: df1 & df2


Out[100]:
a b
0 False False
1 False True
2 True False

In [101]: df1 | df2


Out[101]:
a b
0 True True
1 True True
2 True True

In [102]: df1 ^ df2


Out[102]:
a b
0 True True
1 True False
2 False True

In [103]: -df1
Out[103]:
a b
0 False True
1 True False
2 False False

192 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

Transposing

To transpose, access the T attribute (also the transpose function), similar to an ndarray:
# only show the first 5 rows
In [104]: df[:5].T
Out[104]:
0 1 2 3 4
A 0.271860 -1.087401 0.524988 -1.039268 0.844885
B -0.424972 -0.673690 0.404705 -0.370647 1.075770
C 0.567020 0.113648 0.577046 -1.157892 -0.109050
D 0.276232 -1.478427 -1.715002 -1.344312 1.643563

DataFrame interoperability with NumPy functions

Elementwise NumPy ufuncs (log, exp, sqrt, . . . ) and various other NumPy functions can be used with no issues on
Series and DataFrame, assuming the data within are numeric:
In [105]: np.exp(df)
Out[105]:
A B C D
0 1.312403 0.653788 1.763006 1.318154
1 0.337092 0.509824 1.120358 0.227996
2 1.690438 1.498861 1.780770 0.179963
3 0.353713 0.690288 0.314148 0.260719
4 2.327710 2.932249 0.896686 5.173571
5 0.230066 1.429065 0.509360 0.169161
6 0.379495 0.274028 1.512461 1.318720
7 0.623732 0.986137 0.695904 0.993865
8 0.397301 2.449092 2.237242 0.299269
9 13.009059 4.183951 3.820223 0.310274

In [106]: np.asarray(df)
Out[106]:
array([[ 0.2719, -0.425 , 0.567 , 0.2762],
[-1.0874, -0.6737, 0.1136, -1.4784],
[ 0.525 , 0.4047, 0.577 , -1.715 ],
[-1.0393, -0.3706, -1.1579, -1.3443],
[ 0.8449, 1.0758, -0.109 , 1.6436],
[-1.4694, 0.357 , -0.6746, -1.7769],
[-0.9689, -1.2945, 0.4137, 0.2767],
[-0.472 , -0.014 , -0.3625, -0.0062],
[-0.9231, 0.8957, 0.8052, -1.2064],
[ 2.5656, 1.4313, 1.3403, -1.1703]])

DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics and data model are quite
different in places from an n-dimensional array.
Series implements __array_ufunc__, which allows it to work with NumPy’s universal functions.
The ufunc is applied to the underlying array in a Series.
In [107]: ser = pd.Series([1, 2, 3, 4])

(continues on next page)

2.2. Intro to data structures 193


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


In [108]: np.exp(ser)
Out[108]:
0 2.718282
1 7.389056
2 20.085537
3 54.598150
dtype: float64

Changed in version 0.25.0: When multiple Series are passed to a ufunc, they are aligned before performing the
operation.
Like other parts of the library, pandas will automatically align labeled inputs as part of a ufunc with multiple inputs. For
example, using numpy.remainder() on two Series with differently ordered labels will align before the operation.

In [109]: ser1 = pd.Series([1, 2, 3], index=["a", "b", "c"])

In [110]: ser2 = pd.Series([1, 3, 5], index=["b", "a", "c"])

In [111]: ser1
Out[111]:
a 1
b 2
c 3
dtype: int64

In [112]: ser2
Out[112]:
b 1
a 3
c 5
dtype: int64

In [113]: np.remainder(ser1, ser2)


Out[113]:
a 1
b 0
c 3
dtype: int64

As usual, the union of the two indices is taken, and non-overlapping values are filled with missing values.

In [114]: ser3 = pd.Series([2, 4, 6], index=["b", "c", "d"])

In [115]: ser3
Out[115]:
b 2
c 4
d 6
dtype: int64

In [116]: np.remainder(ser1, ser3)


Out[116]:
a NaN
(continues on next page)

194 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


b 0.0
c 3.0
d NaN
dtype: float64

When a binary ufunc is applied to a Series and Index, the Series implementation takes precedence and a Series is
returned.
In [117]: ser = pd.Series([1, 2, 3])

In [118]: idx = pd.Index([4, 5, 6])

In [119]: np.maximum(ser, idx)


Out[119]:
0 4
1 5
2 6
dtype: int64

NumPy ufuncs are safe to apply to Series backed by non-ndarray arrays, for example arrays.SparseArray (see
Sparse calculation). If possible, the ufunc is applied without converting the underlying data to an ndarray.

Console display

Very large DataFrames will be truncated to display them in the console. You can also get a summary using info().
(Here I am reading a CSV version of the baseball dataset from the plyr R package):
In [120]: baseball = pd.read_csv("data/baseball.csv")

In [121]: print(baseball)
id player year stint team lg g ab r h X2b X3b hr rbi sb ␣
˓→cs bb so ibb hbp sh sf gidp
0 88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 1.0 1.
˓→0 4 4.0 0.0 0.0 3.0 0.0 0.0
1 88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 0.0 0.
˓→0 0 1.0 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... .. .. ... .. ... ... ... .. ... ... ..
˓→. .. ... ... ... ... ... ...
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 3.0 0.
˓→0 27 30.0 5.0 2.0 0.0 3.0 13.0
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 0.0 0.
˓→0 0 3.0 0.0 0.0 0.0 0.0 0.0

[100 rows x 23 columns]

In [122]: baseball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 100 non-null int64
(continues on next page)

2.2. Intro to data structures 195


pandas: powerful Python data analysis toolkit, Release 1.3.5

(continued from previous page)


1 player 100 non-null object
2 year 100 non-null int64
3 stint 100 non-null int64
4 team 100 non-null object
5 lg 100 non-null object
6 g 100 non-null int64
7 ab 100 non-null int64
8 r 100 non-null int64
9 h 100 non-null int64
10 X2b 100 non-null int64
11 X3b 100 non-null int64
12 hr 100 non-null int64
13 rbi 100 non-null float64
14 sb 100 non-null float64
15 cs 100 non-null float64
16 bb 100 non-null int64
17 so 100 non-null float64
18 ibb 100 non-null float64
19 hbp 100 non-null float64
20 sh 100 non-null float64
21 sf 100 non-null float64
22 gidp 100 non-null float64
dtypes: float64(9), int64(11), object(3)
memory usage: 18.1+ KB

However, using to_string will return a string representation of the DataFrame in tabular form, though it won’t always
fit the console width:

In [123]: print(baseball.iloc[-20:, :12].to_string())


id player year stint team lg g ab r h X2b X3b
80 89474 finlest01 2007 1 COL NL 43 94 9 17 3 0
81 89480 embreal01 2007 1 OAK AL 4 0 0 0 0 0
82 89481 edmonji01 2007 1 SLN NL 117 365 39 92 15 2
83 89482 easleda01 2007 1 NYN NL 76 193 24 54 6 0
84 89489 delgaca01 2007 1 NYN NL 139 538 71 139 30 0
85 89493 cormirh01 2007 1 CIN NL 6 0 0 0 0 0
86 89494 coninje01 2007 2 NYN NL 21 41 2 8 2 0
87 89495 coninje01 2007 1 CIN NL 80 215 23 57 11 1
88 89497 clemero02 2007 1 NYA AL 2 2 0 1 0 0
89 89498 claytro01 2007 2 BOS AL 8 6 1 0 0 0
90 89499 claytro01 2007 1 TOR AL 69 189 23 48 14 0
91 89501 cirilje01 2007 2 ARI NL 28 40 6 8 4 0
92 89502 cirilje01 2007 1 MIN AL 50 153 18 40 9 2
93 89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 0
94 89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 3
95 89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0
96 89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0
97 89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0

Wide DataFrames will be printed across multiple rows by default:

196 Chapter 2. User Guide


pandas: powerful Python data analysis toolkit, Release 1.3.5

In [124]: pd.DataFrame(np.random.randn(3, 12))


Out[124]:
0 1 2 3 4 5 6 7 ␣
˓→8 9 10 11
0 -1.226825 0.769804 -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 0.
˓→959726 -1.110336 -0.619976 0.149748
1 -0.732339 0.687738 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849 -0.
˓→954208 1.462696 -1.743161 -0.826591
2 -0.345352 1.314232 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441 -1.
˓→236269 0.896171 -0.487602 -0.082240

You can change how much to print on a single row by setting the display.width option:

In [125]: pd.set_option("display.width", 40) # default is 80

In [126]: pd.DataFrame(np.random.randn(3, 12))


Out[126]:
0 1 2 3 4 5 6 7 ␣
˓→8 9 10 11
0 -2.182937 0.380396 0.084844 0.432390 1.519970 -0.493662 0.600178 0.274230 0.
˓→132885 -0.023688 2.410179 1.450520
1 0.206053 -0.251905 -2.213588 1.063327 1.266143 0.299368 -0.863838 0.408204 -1.
˓→048089 -0.025747 -0.988387 0.094055
2 1.262731 1.289997 0.082423 -0.055758 0.536580 -0.489682 0.369374 -0.034571 -2.
˓→484478 -0.281461 0.030711 0.109121

You can adjust the max width of the individual columns by setting display.max_colwidth

In [127]: datafile = {
.....: "filename": ["filename_01", "filename_02"],
.....: "path": [
.....: "media/user_name/storage/folder_01/filename_01",
.....: "media/user_name/storage/folder_02/filename_02",
.....: ],
.....: }
.....:

In [128]: pd.set_option("display.max_colwidth", 30)

In [129]: pd.DataFrame(datafile)
Out[129]:
filename path
0 filename_01 media/user_name/storage/fo...
1 filename_02 media/user_name/storage/fo...

In [130]: pd.set_option("display.max_colwidth", 100)

In [131]: pd.DataFrame(datafile)
Out[131]:
filename path
0 filename_01 media/user_name/storage/folder_01/filename_01
1 filename_02 media/user_name/storage/folder_02/filename_02

You can also disable this feature via the expand_frame_repr option. This will print the table in one block.

2.2. Intro to data structures 197


pandas: powerful Python data analysis toolkit, Release 1.3.5

DataFrame column attribute access and IPython completion

If a DataFrame column label is a valid Python variable name, the column can be accessed like an attribute:

In [132]: df = pd.DataFrame({"foo1": np.random.randn(5), "foo2": np.random.randn(5)})

In [133]: df
Out[133]:
foo1 foo2
0 1.126203 0.781836
1 -0.977349 -1.071357
2 1.474071 0.441153
3 -0.064034 2.353925
4 -1.282782 0.583787

In [134]: df.foo1
Out[134]:
0 1.126203
1 -0.977349
2 1.474071
3 -0.064034
4 -1.282782
Name: foo1, dtype: float64

The columns are also connected to the IPython completion mechanism so they can be tab-completed:

In [5]: df.foo<TAB> # noqa: E225, E999


df.foo1 df.foo2

2.3 Essential basic functionality

Here we discuss a lot of the essential functionality common to the pandas data structures. To begin, let’s create some
example objects like we did in the 10 minutes to pandas section:

In [1]: index = pd.date_range("1/1/2000", periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

2.3.1 Head and tail

To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of
elements to display is five, but you may pass a custom number.

In [4]: long_series = pd.Series(np.random.randn(1000))

In [5]: long_series.head()
Out[5]:
0 -1.157892
(continues on next page)

198 Chapter 2. User Guide

You might also like