Unit - 1 - Python Pandas
Unit - 1 - Python Pandas
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data munging and preparation. It had
very little contribution towards data analysis. Pandas solved this problem. Using
Pandas, we can accomplish five typical steps in the processing and analysis of data,
regardless of the origin of data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and
commercial domains including finance, economics, Statistics, analytics, etc.
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
If you install Anaconda Python package, Pandas will be installed by default with the
following −
Windows
Linux
Package managers of respective Linux distributions are used to install one or more
packages in SciPy stack.
Series
DataFrame
Panel
These data structures are built on top of Numpy array, which means they are fast.
Data
Dimensions Description
Structure
Building and handling two or more dimensional arrays is a tedious task, burden is
placed on the user to consider the orientation of the data set when writing functions.
But using Pandas data structures, the mental effort of the user is reduced.
For example, with tabular data (DataFrame) it is more semantically helpful to think
of the index (the rows) and the columns rather than axis 0 and axis 1.
Mutability
All Pandas data structures are value mutable (can be changed) and except Series all
are size mutable. Series is size immutable.
Note − DataFrame is widely used and one of the most important data structures.
Panel is used much less.
Series
Series is a one-dimensional array like structure with homogeneous data. For
example, the following series is a collection of integers 10, 23, 56, …
10 23 56 17 52 61 73 90 26 72
Key Points
Homogeneous data
Size Immutable
Values of Data Mutable
DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For example,
The table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column
represents an attribute and each row represents a person.
Column Type
Name String
Age Integer
Gender String
Rating Float
Key Points
Heterogeneous data
Size Mutable
Data Mutable
Panel
Panel is a three-dimensional data structure with heterogeneous data. It is hard to
represent the panel in graphical representation. But a panel can be illustrated as a
container of DataFrame.
Key Points
Heterogeneous data
Size Mutable
Data Mutable
pandas.Series
A pandas Series can be created using the following constructor −
data
1
data takes various forms like ndarray, list, constants
index
2 Index values must be unique and hashable, same length as data. Default
np.arange(n) if no index is passed.
3 dtype
dtype is for data type. If None, data type will be inferred
copy
4
Copy data. Default False
Array
Dict
Example
Example 1
0 a
1 b
2 c
3 d
dtype: object
We did not pass any index, so by default, it assigned the indexes ranging from 0 to
len(data)-1, i.e., 0 to 3.
Example 2
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed values in
the output.
Example 1
a 0.0
b 1.0
c 2.0
dtype: float64
Example 2
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
Observe − Index order is persisted and the missing element is filled with NaN (Not
a Number).
0 5
1 5
2 5
3 5
dtype: int64
Example 1
Retrieve the first element. As we already know, the counting starts from zero for the
array, which means the first element is stored at zeroth position and so on.
Example 2
Retrieve the first three elements in the Series. If a : is inserted in front of it, all
items from that index onwards will be extracted. If two parameters (with : between
them) is used, items between the two indexes (not including the stop index)
a 1
b 2
c 3
dtype: int64
Example 3
Example 1
Example 2
a 1
c 3
d 4
dtype: int64
Example 3
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
…
KeyError: 'f'
Features of DataFrame
Structure
Let us assume that we are creating a data frame with student’s data.
You can think of it as an SQL table or a spreadsheet data representation.
pandas.DataFrame
A pandas DataFrame can be created using the following constructor −
data
1 data takes various forms like ndarray, series, map, lists, dict, constants
and also another DataFrame.
index
2 For the row labels, the Index to be used for the resulting frame is Optional
Default np.arange(n) if no index is passed.
columns
3 For column labels, the optional default syntax is - np.arange(n). This is
only true if no index is passed.
dtype
4
Data type of each column.
copy
5 This command (or whatever it is) is used for copying of data, if the default
is False.
Create DataFrame
A pandas DataFrame can be created using various inputs like −
Lists
dict
Series
Numpy ndarrays
Another DataFrame
In the subsequent sections of this chapter, we will see how to create a DataFrame
using these inputs.
Example
Example 1
0
0 1
1 2
2 3
3 4
4 5
Example 2
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
Example 3
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
Note − Observe, the dtype parameter changes the type of Age column to floating
point.
If no index is passed, then by default, index will be range(n), where n is the array
length.
Example 1
Note − Observe the values 0,1,2,3. They are the default index assigned to each
using the function range(n).
Example 2
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
Example 1
Li D
Live Demo
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
a b c
0 1 2 NaN
1 5 10 20.0
Example 2
a b c
first 1 2 NaN
second 5 10 20.0
Example 3
The following example shows how to create a DataFrame with a list of dictionaries,
row indices, and column indices.
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print df1
print df2
#df1 output
a b
first 1 2
second 5 10
#df2 output
a b1
first 1 NaN
second 5 NaN
Note − Observe, df2 DataFrame is created with a column index other than the
dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with
column indices same as dictionary keys, so NaN’s appended.
Example
df = pd.DataFrame(d)
print df
Its output is as follows −
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Note − Observe, for the series one, there is no label ‘d’ passed, but in the result,
for the d label, NaN is appended with NaN.
Column Selection
We will understand this by selecting a column from the DataFrame.
Example
df = pd.DataFrame(d)
print df ['one']
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
Column Addition
We will understand this by adding a new column to an existing data frame.
Example
df = pd.DataFrame(d)
print df
Column Deletion
Columns can be deleted or popped; let us take an example to understand how.
Example
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
Selection by Label
df = pd.DataFrame(d)
print df.loc['b']
one 2.0
two 2.0
Name: b, dtype: float64
The result is a series with labels as column names of the DataFrame. And, the Name
of the series is the label with which it is retrieved.
df = pd.DataFrame(d)
print df.iloc[2]
one 3.0
two 3.0
Name: c, dtype: float64
Slice Rows
df = pd.DataFrame(d)
print df[2:4]
one two
c 3.0 3
d NaN 4
Addition of Rows
Add new rows to a DataFrame using the append function. This function will append
the rows at the end.
import pandas as pd Live Demo
df = df.append(df2)
print df
a b
0 1 2
1 3 4
0 5 6
1 7 8
Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then
multiple rows will be dropped.
If you observe, in the above example, the labels are duplicate. Let us drop a label
and will see how many rows will get dropped.
df = df.append(df2)
print df
In the above example, two rows were dropped because those two contain the same
label 0.
The names for the 3 axes are intended to give some semantic meaning to describing
operations involving panel data. They are −
pandas.Panel()
A Panel can be created using the following constructor −
Parameter Description
Data takes various forms like ndarray, series, map, lists, dict,
data
constants and also another DataFrame
items axis=0
major_axis axis=1
minor_axis axis=2
Create Panel
A Panel can be created using multiple ways like −
From ndarrays
From 3D ndarray
data = np.random.rand(2,4,5)
p = pd.Panel(data)
print p
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
Note − Observe the dimensions of the empty panel and the above panel, all the
objects are different.
<class 'pandas.core.panel.Panel'>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None
Items
Major_axis
Minor_axis
Using Items
0 1 2
0 0.488224 -0.128637 0.930817
1 0.417497 0.896681 0.576657
2 -2.775266 0.571668 0.290082
3 -0.400538 -0.144234 1.110535
We have two items, and we retrieved item1. The result is a DataFrame with 4 rows
and 3 columns, which are the Major_axis and Minor_axis dimensions.
Using major_axis
Item1 Item2
0 0.417497 0.748412
1 0.896681 -0.557322
2 0.576657 NaN
Using minor_axis
Item1 Item2
0 -0.128637 -1.047032
1 0.896681 -0.557322
2 0.571668 0.431953
3 -0.144234 1.302466
axes
1
Returns a list of the row axis labels
2 dtype
Returns the dtype of the object.
empty
3
Returns True if series is empty.
ndim
4
Returns the number of dimensions of the underlying data, by definition 1.
size
5
Returns the number of elements in the underlying data.
values
6
Returns the Series as ndarray.
head()
7
Returns the first n rows.
tail()
8
Returns the last n rows.
Let us now create a Series and see all the above tabulated attributes operation.
Example
0 0.967853
1 -0.148368
2 -1.395906
3 -1.758394
dtype: float64
axes
Returns the list of the labels of the series.
The above result is a compact format of a list of values from 0 to 5, i.e., [0,1,2,3,4].
empty
Returns the Boolean value saying whether the Object is empty or not. True indicates
that the object is empty.
ndim
Returns the number of dimensions of the object. By definition, a Series is a 1D data
structure, so it returns
0 0.175898
1 0.166197
2 -0.609712
3 -1.377000
dtype: float64
size
values
0 1.787373
1 -0.605159
2 0.180477
3 -0.140922
dtype: float64
To view a small sample of a Series or the DataFrame object, use the head() and the
tail() methods.
head() returns the first n rows(observe the index values). The default number of
elements to display is five, but you may pass a custom number.
import pandas as pd Live Demo
import numpy as np
tail() returns the last n rows(observe the index values). The default number of
elements to display is five, but you may pass a custom number.
T
1
Transposes rows and columns.
axes
2 Returns a list with the row axis labels and column axis labels as the only
members.
dtypes
3
Returns the dtypes in this object.
empty
4 True if NDFrame is entirely empty [no items]; if any of the axes are of
length 0.
ndim
5
Number of axes / array dimensions.
shape
6
Returns a tuple representing the dimensionality of the DataFrame.
7 size
Number of elements in the NDFrame.
values
8
Numpy representation of NDFrame.
head()
9
Returns the first n rows.
tail()
10
Returns last n rows.
Let us now create a DataFrame and see all how the above mentioned attributes
operate.
Example
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print df
Returns the transpose of the DataFrame. The rows and columns will interchange.
# Create a DataFrame
df = pd.DataFrame(d)
print ("The transpose of the data series is:")
print df.T
axes
Returns the list of row axis labels and column axis labels.
#Create a DataFrame
df = pd.DataFrame(d)
print ("Row axis labels and column axis labels are:")
print df.axes
dtypes
#Create a DataFrame
df = pd.DataFrame(d)
print ("The data types of each column are:")
print df.dtypes
empty
Returns the Boolean value saying whether the Object is empty or not; True indicates
that the object is empty.
#Create a DataFrame
df = pd.DataFrame(d)
print ("Is the object empty?")
print df.empty
ndim
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The dimension of the object is:")
print df.ndim
shape
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The shape of the object is:")
print df.shape
size
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The total number of elements in our object is:")
print df.size
values
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The actual data in our data frame is:")
print df.values
To view a small sample of a DataFrame object, use the head() and tail() methods.
head() returns the first n rows (observe the index values). The default number of
elements to display is five, but you may pass a custom number.
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The first two rows of the data frame is:")
print df.head(2)
tail() returns the last n rows (observe the index values). The default number of
elements to display is five, but you may pass a custom number.
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The last two rows of the data frame is:")
print df.tail(2)
Let us create a DataFrame and use this object throughout this chapter for all the
operations.
Example
#Create a DataFrame
df = pd.DataFrame(d)
print df
sum()
Returns the sum of the values for the requested axis. By default, axis is index
(axis=0).
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum()
Age 382
Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating 44.92
dtype: object
axis=1
This syntax will give the output as shown below.
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum(1)
0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80
7 37.78
8 42.98
9 34.80
10 55.10
11 49.65
dtype: float64
mean()
#Create a DataFrame
df = pd.DataFrame(d)
print df.mean()
Age 31.833333
Rating 3.743333
dtype: float64
std()
#Create a DataFrame
df = pd.DataFrame(d)
print df.std()
Functions like sum(), cumsum() work with both numeric and character (or)
string data elements without any error. Though n practice, character
aggregations are never used generally, these functions do not throw any
exception.
Functions like abs(), cumprod() throw exception when the DataFrame
contains character or string data because such operations cannot be
performed.
Summarizing Data
The describe() function computes a summary of statistics pertaining to the
DataFrame columns.
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
This function gives the mean, std and IQR values. And, function excludes the
character columns and given summary about numeric columns. 'include' is the
argument which is used to pass necessary information regarding what columns need
to be considered for summarizing. Takes the list of values; by default, 'number'.
object − Summarizes String columns
number − Summarizes Numeric columns
all − Summarizes all columns together (Should not pass it as a list value)
Now, use the following statement in the program and check the output −
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe(include=['object'])
Name
count 12
unique 12
top Ricky
freq 1
#Create a DataFrame
df = pd.DataFrame(d)
print df. describe(include='all')
adder function
The adder function adds two numeric values as parameters and returns the sum.
def adder(ele1,ele2):
return ele1+ele2
We will now use the custom function to conduct operation on the DataFrame.
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
def adder(ele1,ele2):
return ele1+ele2
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
print df.apply(np.mean)
Example 1
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean)
print df.apply(np.mean)
col1 -0.288022
col2 1.044839
col3 -0.187009
dtype: float64
Example 2
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean,axis=1)
print df.apply(np.mean)
col1 0.034093
col2 -0.152672
col3 -0.229728
dtype: float64
Example 3
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(lambda x: x.max() - x.min())
print df.apply(np.mean)
col1 -0.167413
col2 -0.370495
col3 -0.707631
dtype: float64
Example 1
# My custom function
df['col1'].map(lambda x:x*100)
print df.apply(np.mean)
col1 0.480742
col2 0.454185
col3 0.266563
dtype: float64
Example 2
# My custom function
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.applymap(lambda x:x*100)
print df.apply(np.mean)
col1 0.395263
col2 0.204418
col3 -0.795188
dtype: float64
Example
df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})
print df_reindexed
A C B
0 2016-01-01 Low NaN
2 2016-01-03 High NaN
5 2016-01-06 Low NaN
Example
df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
df1 = df1.reindex_like(df2)
print df1
Note − Here, the df1 DataFrame is altered and reindexed like df2. The column
names should be matched or else NAN will be added for the entire column label.
Example
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
# Padding NAN's
print df2.reindex_like(df1)
Example
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
# Padding NAN's
print df2.reindex_like(df1)
Note − Observe, only the 7th row is filled by the preceding 6th row. Then, the rows
are left as they are.
Renaming
The rename() method allows you to relabel an axis based on some mapping (a dict
or Series) or an arbitrary function.
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print df1
Series − values
Iterating a DataFrame
Iterating a DataFrame gives column names. Let us consider the following example to
understand the same.
import pandas as pd Live Demo
import numpy as np
N=20
df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})
A
C
D
x
y
To iterate over the rows of the DataFrame, we can use the following functions −
iteritems()
Iterates over each column as key, value pair with label as key and column value as a
Series object.
col1 0 0.802390
1 0.324060
2 0.256811
3 0.839186
Name: col1, dtype: float64
col2 0 1.624313
1 -1.033582
2 1.796663
3 1.856277
Name: col2, dtype: float64
col3 0 -0.022142
1 -0.230820
2 1.160691
3 -0.830279
Name: col3, dtype: float64
iterrows()
iterrows() returns the iterator yielding each index value along with a series
containing the data in each row.
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
print row_index,row
1 col1 -0.944087
col2 1.420919
col3 -0.507895
Name: 1, dtype: float64
2 col1 -0.077287
col2 -0.858556
col3 -0.663385
Name: 2, dtype: float64
3 col1 -1.638578
col2 0.059866
col3 0.493482
Name: 3, dtype: float64
Note − Because iterrows() iterate over the rows, it doesn't preserve the data type
across the row. 0,1,2 are the row indices and col1,col2,col3 are column indices.
itertuples()
itertuples() method will return an iterator yielding a named tuple for each row in the
DataFrame. The first element of the tuple will be the row’s corresponding index
value, while the remaining values are the row values.
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row in df.itertuples():
print row
Note − Do not try to modify any object while iterating. Iterating is meant for
reading and the iterator returns a copy of the original object (a view), thus the
changes will not reflect on the original object.
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
import pandas as pd
import numpy as np
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],co
mns=['col2','col1'])
print unsorted_df
col2 col1
1 -2.063177 0.537527
4 0.142932 -0.684884
6 0.012667 -0.389340
2 -0.548797 1.848743
3 -1.044160 0.837381
5 0.385605 1.300185
9 1.031425 -1.002967
8 -0.407374 -0.435142
0 2.237453 -1.067139
7 -1.445831 -1.701035
In unsorted_df, the labels and the values are unsorted. Let us see how these can
be sorted.
By Label
Using the sort_index() method, by passing the axis arguments and the order of
sorting, DataFrame can be sorted. By default, sorting is done on row labels in
ascending order.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index()
print sorted df
col2 col1
0 0.208464 0.627037
1 0.641004 0.331352
2 -0.038067 -0.464730
3 -0.638456 -0.021466
4 0.014646 -0.737438
5 -0.290761 -1.669827
6 -0.797303 -0.018737
7 0.525753 1.628921
8 -0.567031 0.775951
9 0.060724 -0.322425
Order of Sorting
By passing the Boolean value to ascending parameter, the order of the sorting can be
controlled. Let us consider the following example to understand the same.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],
mns = ['col2','col1'])
sorted_df = unsorted_df.sort_index(ascending=False)
print sorted_df
col2 col1
9 0.825697 0.374463
8 -1.699509 0.510373
7 -0.581378 0.622958
6 -0.202951 0.954300
5 -1.289321 -1.551250
4 1.302561 0.851385
3 -0.157915 -0.388659
2 -1.222295 0.166609
1 0.584890 -0.291048
0 0.668444 -0.061294
By passing the axis argument with a value 0 or 1, the sorting can be done on the
column labels. By default, axis=0, sort by row. Let us consider the following example
to understand the same.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index(axis=1)
print sorted_df
col1 col2
1 -0.291048 0.584890
4 0.851385 1.302561
6 0.954300 -0.202951
2 0.166609 -1.222295
3 -0.388659 -0.157915
5 -1.551250 -1.289321
9 0.374463 0.825697
8 0.510373 -1.699509
0 -0.061294 0.668444
7 0.622958 -0.581378
By Value
Like index sorting, sort_values() is the method for sorting by values. It accepts a
'by' argument which will use the column name of the DataFrame with which the
values are to be sorted.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')
print sorted_df
col1 col2
1 1 3
2 1 2
3 1 4
0 2 1
Observe, col1 values are sorted and the respective col2 value and row index will
alter along with col1. Thus, they look unsorted.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by=['col1','col2'])
print sorted_df
col1 col2
2 1 2
1 1 3
3 1 4
0 2 1
Sorting Algorithm
sort_values() provides a provision to choose the algorithm from mergesort,
heapsort and quicksort. Mergesort is the only stable algorithm.
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')
print sorted_df
col1 col2
1 1 3
2 1 2
3 1 4
0 2 1
Pandas provides a set of string functions which make it easy to operate on string
data. Most importantly, these functions ignore (or exclude) missing/NaN values.
Almost, all of these methods work with Python string functions (refer:
https://docs.python.org/3/library/stdtypes.html#string-methods). So, convert the
Series Object to String Object and then perform the operation.
lower()
1
Converts strings in the Series/Index to lower case.
upper()
2
Converts strings in the Series/Index to upper case.
len()
3
Computes String length().
strip()
4 Helps strip whitespace(including newline) from each string in the
Series/index from both the sides.
split(' ')
5
Splits each string with the given pattern.
cat(sep=' ')
6
Concatenates the series/index elements with given separator.
get_dummies()
7
Returns the DataFrame with One-Hot Encoded values.
contains(pattern)
8 Returns a Boolean value True for each element if the substring contains in
the element, else False.
replace(a,b)
9
Replaces the value a with the value b.
repeat(value)
10
Repeats each element with specified number of times.
count(pattern)
11
Returns count of appearance of pattern in each element.
startswith(pattern)
12
Returns true if the element in the Series/Index starts with the pattern.
endswith(pattern)
13
Returns true if the element in the Series/Index ends with the pattern.
find(pattern)
14
Returns the first position of the first occurrence of the pattern.
findall(pattern)
15
Returns a list of all occurrence of the pattern.
swapcase
16
Swaps the case lower/upper.
islower()
17 Checks whether all characters in each string in the Series/Index in lower
case or not. Returns Boolean
isupper()
18 Checks whether all characters in each string in the Series/Index in upper
case or not. Returns Boolean.
isnumeric()
19 Checks whether all characters in each string in the Series/Index are
numeric. Returns Boolean.
Let us now create a Series and see how all the above functions work.
print s
0 Tom
1 William Rick
2 John
3 Alber@t
4 NaN
5 1234
6 Steve Smith
dtype: object
lower()
import pandas as pd Live Demo
import numpy as np
print s.str.lower()
0 tom
1 william rick
2 john
3 alber@t
4 NaN
5 1234
6 steve smith
dtype: object
upper()
print s.str.upper()
0 TOM
1 WILLIAM RICK
2 JOHN
3 ALBER@T
4 NaN
5 1234
6 STEVE SMITH
dtype: object
len()
0 3.0
1 12.0
2 4.0
3 7.0
4 NaN
5 4.0
6 10.0
dtype: float64
strip()
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
After Stripping:
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
split(pattern)
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
Split Pattern:
0 [Tom, , , , , , , , , , ]
1 [, , , , , William, Rick]
2 [John]
3 [Alber@t]
dtype: object
cat(sep=pattern)
print s.str.cat(sep='_')
get_dummies()
print s.str.get_dummies()
contains ()
replace(a,b)
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
repeat(value)
print s.str.repeat(2)
Its output is as follows −
0 Tom Tom
1 William Rick William Rick
2 JohnJohn
3 Alber@tAlber@t
dtype: object
count(pattern)
startswith(pattern)
endswith(pattern)
find(pattern)
print s.str.find('e')
0 -1
1 -1
2 -1
3 3
dtype: int64
findall(pattern)
print s.str.findall('e')
0 []
1 []
2 []
3 [e]
dtype: object
Null list([ ]) indicates that there is no such pattern available in the element.
swapcase()
0 tOM
1 wILLIAM rICK
2 jOHN
3 aLBER@T
dtype: object
islower()
0 False
1 False
2 False
3 False
dtype: bool
isupper()
print s.str.isupper()
0 False
1 False
2 False
3 False
dtype: bool
isnumeric()
print s.str.isnumeric()
0 False
1 False
2 False
3 False
dtype: bool
get_option()
set_option()
reset_option()
describe_option()
option_context()
get_option(param)
get_option takes a single parameter and returns the value as given in the output
below −
display.max_rows
Displays the default number of value. Interpreter reads this value and displays the
rows with this value as upper limit to display.
Live Demo
import pandas as pd
print pd.get_option("display.max_rows")
60
display.max_columns
Displays the default number of value. Interpreter reads this value and displays the
rows with this value as upper limit to display.
20
set_option(param,value)
set_option takes two arguments and sets the value to the parameter as shown below
−
display.max_rows
pd.set_option("display.max_rows",80)
print pd.get_option("display.max_rows")
display.max_columns
pd.set_option("display.max_columns",30)
print pd.get_option("display.max_columns")
30
reset_option(param)
reset_option takes an argument and sets the value back to the default value.
display.max_rows
Using reset_option(), we can change the value back to the default number of rows to
be displayed.
pd.reset_option("display.max_rows")
print pd.get_option("display.max_rows")
60
describe_option(param)
describe_option prints the description of the argument.
display.max_rows
Using reset_option(), we can change the value back to the default number of rows to
be displayed.
display.max_rows : int
If max_rows is exceeded, switch to truncate view. Depending on
'large_repr', objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited.
option_context()
option_context context manager is used to set the option in with statement
temporarily. Option values are restored automatically when you exit the with block
−
display.max_rows
10
10
See, the difference between the first and the second print statements. The first
statement prints the value set by option_context() which is temporary within the
with context itself. After the with context, the second print statement prints the
configured value.
display.max_rows
1
Displays maximum number of rows to display
2 display.max_columns
2
Displays maximum number of columns to display
display.expand_frame_repr
3
Displays DataFrames to Stretch Pages
display.max_colwidth
4
Displays maximum column width
display.precision
5
Displays precision for decimal numbers
The Python and NumPy indexing operators "[ ]" and attribute operator "." provide
quick and easy access to Pandas data structures across a wide range of use cases.
However, since the type of the data to be accessed isn’t known in advance, directly
using standard operators has some optimization limits. For production code, we
recommend that you take advantage of the optimized pandas data access methods
explained in this chapter.
Pandas now supports three types of Multi-axes indexing; the three types are
mentioned in the following table −
.loc()
1
Label based
.iloc()
2
Integer based
.ix()
3
Both Label and Integer based
.loc()
Pandas provide various methods to have purely label based indexing. When
slicing, the start bound is also included. Integers are valid labels, but they refer to
the label and not the position.
A list of labels
A slice object
A Boolean array
loc takes two single/list/range operator separated by ','. The first one indicates the
row and the second one indicates columns.
Example 1
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
a 0.391548
b -0.070649
c -0.317212
d -2.162406
e 2.202797
f 0.613709
g 1.050559
h 1.122680
Name: A, dtype: float64
Example 2
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
A C
a 0.391548 0.745623
b -0.070649 1.620406
c -0.317212 1.448365
d -2.162406 -0.873557
e 2.202797 0.528067
f 0.613709 0.286414
g 1.050559 0.216526
h 1.122680 -1.621420
Example 3
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
A C
a 0.391548 0.745623
b -0.070649 1.620406
f 0.613709 0.286414
h 1.122680 -1.621420
Example 4
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
Example 5
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
A False
B True
C False
D False
Name: a, dtype: bool
.iloc()
Pandas provide various methods in order to get purely integer based indexing. Like
python and numpy, these are 0-based indexing.
An Integer
A list of integers
A range of values
Example 1
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Example 2
# Integer slicing
print df.iloc[:4]
print df.iloc[1:5, 2:4]
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
C D
1 -0.813012 0.631615
2 0.025070 0.230806
3 0.826977 -0.026251
4 1.423332 1.130568
Example 3
B D
1 0.890791 0.631615
3 -1.284314 -0.026251
5 -0.512888 -0.518930
A B C D
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
B C
0 0.256239 -1.270702
1 0.890791 -0.813012
2 -0.531378 0.025070
3 -1.284314 0.826977
4 -0.460729 1.423332
5 -0.512888 0.581409
6 -1.204853 0.098060
7 -0.947857 0.641358
.ix()
Besides pure label based and integer based, Pandas provides a hybrid method for
selections and subsetting the object using the .ix() operator.
Example 1
# Integer slicing
print df.ix[:4]
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Example 2
Use of Notations
Getting values from the Pandas object with Multi-axes indexing uses the following
notation −
p.loc[item_index,major_index, p.loc[item_index,major_index,
Panel
minor_index] minor_index]
Note − .iloc() & .ix() applies the same indexing options and Return value.
Let us now see how each operation can be performed on the DataFrame object. We
will use the basic indexing operator '[ ]' −
Example 1
0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721
7 -0.759399
Name: A, dtype: float64
Example 2
print df[['A','B']]
A B
0 -0.478893 -0.606311
1 0.391931 -0.949025
2 0.336825 0.093717
3 -1.055102 -0.012944
4 -0.165218 1.550310
5 -0.328641 -0.226363
6 0.567721 -0.312585
7 -0.759399 -0.372696
Example 3
Attribute Access
Example
print df.A
0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721
7 -0.759399
Name: A, dtype: float64
Percent_change
Series, DatFrames and Panel, all have the function pct_change(). This function
compares every element with its prior element and computes the change
percentage.
import pandas as pd Live Demo
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print s.pct_change()
df = pd.DataFrame(np.random.randn(5, 2))
print df.pct_change()
0 NaN
1 1.000000
2 0.500000
3 0.333333
4 0.250000
5 -0.200000
dtype: float64
0 1
0 NaN NaN
1 -15.151902 0.174730
2 -0.746374 -1.449088
3 -3.582229 -3.165836
4 15.601150 -1.860434
By default, the pct_change() operates on columns; if you want to apply the same
row wise, then use axis=1() argument.
Covariance
Covariance is applied on series data. The Series object has a method cov to compute
covariance between series objects. NA will be excluded automatically.
Cov Series
-0.12978405324
Covariance method when applied on a DataFrame, computes cov between all the
columns.
-0.58312921152741437
a b c d e
a 1.780628 -0.583129 -0.185575 0.003679 -0.136558
b -0.583129 1.297011 0.136530 -0.523719 0.251064
c -0.185575 0.136530 0.915227 -0.053881 -0.058926
d 0.003679 -0.523719 -0.053881 1.521426 -0.487694
e -0.136558 0.251064 -0.058926 -0.487694 0.960761
Note − Observe the cov between a and b column in the first statement and the
same is the value returned by cov on DataFrame.
Correlation
Correlation shows the linear relationship between any two array of values (series).
There are multiple methods to compute the correlation like pearson(default),
spearman and kendall.
Live Demo
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'
print frame['a'].corr(frame['b'])
print frame.corr()
-0.383712785514
a b c d e
a 1.000000 -0.383713 -0.145368 0.002235 -0.104405
b -0.383713 1.000000 0.125311 -0.372821 0.224908
c -0.145368 0.125311 1.000000 -0.045661 -0.062840
d 0.002235 -0.372821 -0.045661 1.000000 -0.403380
e -0.104405 0.224908 -0.062840 -0.403380 1.000000
Data Ranking
Data Ranking produces ranking for each element in the array of elements. In case of
ties, assigns the mean rank.
s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
print s.rank()
a 1.0
b 3.5
c 2.0
d 3.5
e 5.0
dtype: float64
Rank optionally takes a parameter ascending which by default is true; when false,
data is reverse-ranked, with larger values assigned a smaller rank.
Rank supports different tie-breaking methods, specified with the method parameter
−
We will now learn how each of these can be applied on DataFrame objects.
.rolling() Function
This function can be applied on a series of data. Specify the window=n argument
and apply the appropriate statistical function on top of it.
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df.rolling(window=3).mean()
Note − Since the window size is 3, for first two elements there are nulls and from
third the value will be the average of the n, n-1 and n-2 elements. Thus we can also
apply various functions as mentioned above.
.expanding() Function
This function can be applied on a series of data. Specify the min_periods=n
argument and apply the appropriate statistical function on top of it.
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df.expanding(min_periods=3).mean()
A B C D
2000-01-01 NaN NaN NaN NaN
2000-01-02 NaN NaN NaN NaN
2000-01-03 0.434553 -0.667940 -1.051718 -0.826452
2000-01-04 0.743328 -0.198015 -0.852462 -0.262547
2000-01-05 0.614776 -0.205649 -0.583641 -0.303254
2000-01-06 0.538175 -0.005878 -0.687223 -0.199219
2000-01-07 0.505503 -0.108475 -0.790826 -0.081056
2000-01-08 0.454751 -0.223420 -0.671572 -0.230215
2000-01-09 0.586390 -0.206201 -0.517619 -0.267521
2000-01-10 0.560427 -0.037597 -0.399429 -0.376886
.ewm() Function
ewm is applied on a series of data. Specify any of the com, span, halflife argument
and apply the appropriate statistical function on top of it. It assigns the weights
exponentially.
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df.ewm(com=0.5).mean()
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 0.865131 -0.453626 -1.137961 0.058747
2000-01-03 -0.132245 -0.807671 -0.308308 -1.491002
2000-01-04 1.084036 0.555444 -0.272119 0.480111
2000-01-05 0.425682 0.025511 0.239162 -0.153290
2000-01-06 0.245094 0.671373 -0.725025 0.163310
2000-01-07 0.288030 -0.259337 -1.183515 0.473191
2000-01-08 0.162317 -0.771884 -0.285564 -0.692001
2000-01-09 1.147156 -0.302900 0.380851 -0.607976
2000-01-10 0.600216 0.885614 0.569808 -1.110113
Window functions are majorly used in finding the trends within the data graphically
by smoothing the curve. If there is lot of variation in the everyday data and a lot of
data points are available, then taking the samples and plotting is one method and
applying the window computations and plotting the graph on the results is another
method. By these methods, we can smooth the curve or the trend.
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 0.790670 -0.387854 -0.668132 0.267283
2000-01-03 -0.575523 -0.965025 0.060427 -2.179780
2000-01-04 1.669653 1.211759 -0.254695 1.429166
2000-01-05 0.100568 -0.236184 0.491646 -0.466081
2000-01-06 0.155172 0.992975 -1.205134 0.320958
2000-01-07 0.309468 -0.724053 -1.412446 0.627919
2000-01-08 0.099489 -1.028040 0.163206 -1.274331
2000-01-09 1.639500 -0.068443 0.714008 -0.565969
2000-01-10 0.326761 1.479841 0.664282 -1.361169
Rolling [window=3,min_periods=1,center=False,axis=0]
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r.aggregate(np.sum)
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r['A'].aggregate(np.sum)
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
2000-01-01 1.088512
2000-01-02 1.879182
2000-01-03 1.303660
2000-01-04 1.884801
2000-01-05 1.194699
2000-01-06 1.925393
2000-01-07 0.565208
2000-01-08 0.564129
2000-01-09 2.048458
2000-01-10 2.065750
Freq: D, Name: A, dtype: float64
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
A B
2000-01-01 1.088512 -0.650942
2000-01-02 1.879182 -1.038796
2000-01-03 1.303660 -2.003821
2000-01-04 1.884801 -0.141119
2000-01-05 1.194699 0.010551
2000-01-06 1.925393 1.968551
2000-01-07 0.565208 0.032738
2000-01-08 0.564129 -0.759118
2000-01-09 2.048458 -1.820537
2000-01-10 2.065750 0.383357
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r['A'].aggregate([np.sum,np.mean])
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
sum mean
2000-01-01 1.088512 1.088512
2000-01-02 1.879182 0.939591
2000-01-03 1.303660 0.434553
2000-01-04 1.884801 0.628267
2000-01-05 1.194699 0.398233
2000-01-06 1.925393 0.641798
2000-01-07 0.565208 0.188403
2000-01-08 0.564129 0.188043
2000-01-09 2.048458 0.682819
2000-01-10 2.065750 0.688583
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r[['A','B']].aggregate([np.sum,np.mean])
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
A B
sum mean sum mean
2000-01-01 1.088512 1.088512 -0.650942 -0.650942
2000-01-02 1.879182 0.939591 -1.038796 -0.519398
2000-01-03 1.303660 0.434553 -2.003821 -0.667940
2000-01-04 1.884801 0.628267 -0.141119 -0.047040
2000-01-05 1.194699 0.398233 0.010551 0.003517
2000-01-06 1.925393 0.641798 1.968551 0.656184
2000-01-07 0.565208 0.188403 0.032738 0.010913
2000-01-08 0.564129 0.188043 -0.759118 -0.253039
2000-01-09 2.048458 0.682819 -1.820537 -0.606846
2000-01-10 2.065750 0.688583 0.383357 0.127786
df = pd.DataFrame(np.random.randn(3, 4),
index = pd.date_range('1/1/2000', periods=3),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r.aggregate({'A' : np.sum,'B' : np.mean})
A B C D
2000-01-01 -1.575749 -1.018105 0.317797 0.545081
2000-01-02 -0.164917 -1.361068 0.258240 1.113091
2000-01-03 1.258111 1.037941 -0.047487 0.867371
A B
2000-01-01 -1.575749 -1.018105
2000-01-02 -1.740666 -1.189587
2000-01-03 -0.482555 -0.447078
Let us now see how we can handle missing values (say NA or NaN) using Pandas.
Using reindexing, we have created a DataFrame with missing values. In the output,
NaN means Not a Number.
To make detecting missing values easier (and across different array dtypes), Pandas
provides the isnull() and notnull() functions, which are also methods on Series and
DataFrame objects −
Example 1
print df['one'].isnull()
Example 2
print df['one'].notnull()
a True
b False
c True
d False
e True
f True
g False
h True
Name: one, dtype: bool
Example 1
print df['one'].sum()
2.02357685917
Example 2
df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print df['one'].sum()
nan
print df
print ("NaN replaced with '0':")
print df.fillna(0)
Here, we are filling with value zero; instead we can also fill with any other value.
pad/fill
1
Fill methods Forward
bfill/backfill
2
Fill methods Backward
Example 1
print df.fillna(method='pad')
Example 2
print df.fillna(method='backfill')
Example 1
Example 2
Empty DataFrame
Columns: [ ]
Index: [a, b, c, d, e, f, g, h]
Example 1
df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
print df.replace({1000:10,2000:60})
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
Example 2
df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
print df.replace({1000:10,2000:60})
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
In many situations, we split the data into sets and we apply some functionality on
each subset. In the apply functionality, we can perform the following operations −
Let us now create a DataFrame object and perform all the operations on it −
print df
obj.groupby('key')
obj.groupby(['key1','key2'])
obj.groupby(key,axis=1)
Let us now see how the grouping objects can be applied to the DataFrame object
Example
print df.groupby('Team')
View Groups
print df.groupby('Team').groups
Example
print df.groupby(['Team','Year']).groups
grouped = df.groupby('Year')
2014
Points Rank Team Year
0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014
2015
Points Rank Team Year
1 789 2 Riders 2015
3 673 3 Devils 2015
5 812 4 kings 2015
10 804 1 Royals 2015
2016
Points Rank Team Year
6 756 1 Kings 2016
8 694 2 Riders 2016
2017
Points Rank Team Year
7 788 1 Kings 2017
11 690 2 Riders 2017
By default, the groupby object has the same label name as the group name.
Select a Group
Using the get_group() method, we can select a single group.
grouped = df.groupby('Year')
print grouped.get_group(2014)
Aggregations
An aggregated function returns a single aggregated value for each group. Once the
group by object is created, several aggregation operations can be performed on the
grouped data.
grouped = df.groupby('Year')
print grouped['Points'].agg(np.mean)
Year
2014 795.25
2015 769.50
2016 725.00
2017 739.00
Name: Points, dtype: float64
Another way to see the size of each group is by applying the size() function −
Live Demo
import pandas as pd
import numpy as np
With grouped Series, you can also pass a list or dict of functions to do aggregation
with, and generate DataFrame as output −
Transformations
Transformation on a group or a column returns an object that is indexed the same
size of that is being grouped. Thus, the transform should return a result that is the
same size as that of a group chunk.
grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print grouped.transform(score)
Filtration
Filtration filters the data on a defined criteria and returns the subset of data. The
filter() function is used to filter the data.
Pandas provides a single function, merge, as the entry point for all standard
database join operations between DataFrame objects −
on − Columns (names) to join on. Must be found in both the left and right
DataFrame objects.
left_on − Columns from the left DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame.
right_on − Columns from the right DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame.
left_index − If True, use the index (row labels) from the left DataFrame as
its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the
number of levels must match the number of join keys from the right
DataFrame.
Let us now create two different DataFrames and perform the merging operations on
it.
# import the pandas library Live Demo
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right
Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Left Join
Right Join
Outer Join
Inner Join
Joining will be performed on index. Join operation honors the object on which it is
called. So, a.join(b) is not equal to b.join(a).
pd.concat(objs,axis=0,join='outer',join_axes=None,
ignore_index=False)
Concatenating Objects
The concat function does all of the heavy lifting of performing concatenation
operations along an axis. Let us create different objects and do concatenation.
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])
Suppose we wanted to associate specific keys with each of the pieces of the chopped
up DataFrame. We can do this by using the keys argument −
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'])
x 1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
y 1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5
If the resultant object has to follow its own indexing, set ignore_index to True.
Live Demo
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'],ignore_index=True)
Observe, the index changes completely and the Keys are also overridden.
If two objects need to be added along axis=1, then the new columns will be
appended.
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],axis=1)
A useful shortcut to concat are the append instance methods on Series and
DataFrame. These methods actually predated concat. They concatenate along
axis=0, namely the index −
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print one.append(two)
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print one.append([two,one,two])
Time Series
Pandas provide a robust tool for working time with Time series data, especially in the
financial sector. While working with time series data, we frequently come across the
following −
Pandas provides a relatively compact and self-contained set of tools for performing
the above tasks.
print pd.datetime.now()
2017-05-11 06:10:13.393147
Create a TimeStamp
Time-stamped data is the most basic type of timeseries data that associates values
with points in time. For pandas objects, it means using the points in time. Let’s take
an example −
print pd.Timestamp('2017-03-01')
2017-03-01 00:00:00
It is also possible to convert integer or float epoch times. The default unit for these is
nanoseconds (since these are how Timestamps are stored). However, often epochs
are stored in another unit which can be specified. Let’s take another example
print pd.Timestamp(1587687255,unit='s')
2020-04-24 00:14:15
Converting to Timestamps
0 2009-07-31
1 2010-01-10
2 NaT
dtype: datetime64[ns]
bdate_range
bdate_range() stands for business date ranges. Unlike date_range(), it excludes
Saturday and Sunday.
Observe, after 3rd March, the date jumps to 6th march excluding 4th and 5th. Just
check your calendar for the days.
Offset Aliases
A number of string aliases are given to useful common time series frequencies. We
will refer to these aliases as offset aliases.
String
By passing a string literal, we can create a timedelta object.
import pandas as pd Live Demo
2 days 02:15:30
Integer
By passing an integer value with the unit, an argument creates a Timedelta object.
print pd.Timedelta(6,unit='h')
0 days 06:00:00
Data Offsets
Data offsets such as - weeks, days, hours, minutes, seconds, milliseconds,
microseconds, nanoseconds can also be used in construction.
print pd.Timedelta(days=2)
2 days 00:00:00
to_timedelta()
Using the top-level pd.to_timedelta, you can convert a scalar, array, list, or series
from a recognized timedelta format/ value into a Timedelta type. It will construct
Series if the input is a Series, a scalar if the input is scalar-like, otherwise will output
a TimedeltaIndex.
print pd.Timedelta(days=2)
2 days 00:00:00
Operations
You can operate on Series/ DataFrames and construct timedelta64[ns] Series
through subtraction operations on datetime64[ns] Series, or Timestamps.
Let us now create a DataFrame with Timedelta and datetime objects and perform
some arithmetic operations on it −
print df
A B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days
Addition Operations
import pandas as pd Live Demo
print df
A B C
0 2012-01-01 0 days 2012-01-01
1 2012-01-02 1 days 2012-01-03
2 2012-01-03 2 days 2012-01-05
Subtraction Operation
print df
A B C D
0 2012-01-01 0 days 2012-01-01 2012-01-01
1 2012-01-02 1 days 2012-01-03 2012-01-04
2 2012-01-03 2 days 2012-01-05 2012-01-07
Categorical variables can take on only a limited, and usually fixed number of possible
values. Besides the fixed length, categorical data might have an order but cannot
perform numerical operation. Categorical are a Pandas data type.
The lexical order of a variable is not the same as the logical order (“one”,
“two”, “three”). By converting to a categorical and specifying an order on the
categories, sorting and min/max will use the logical order instead of the
lexical order.
Object Creation
Categorical object can be created in multiple ways. The different ways have been
described below −
category
s = pd.Series(["a","b","c","a"], dtype="category")
print s
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
The number of elements passed to the series object is four, but the categories are
only three. Observe the same in the output Categories.
pd.Categorical
Using the standard pandas Categorical constructor, we can create a category object.
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]
[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]
Here, the second argument signifies the categories. Thus, any value which is not
present in the categories will be treated as NaN.
[a, b, c, a, b, c, NaN]
Categories (3, object): [c < b < a]
Logically, the order means that, a is greater than b and b is greater than c.
Description
Using the .describe() command on the categorical data, we get similar output to a
Series or DataFrame of the type string.
print df.describe()
print df["cat"].describe()
cat s
count 33
unique 22
top cc
freq 22
count 3
unique 2
top c
freq 2
Name: cat, dtype: object
False
Renaming Categories
s = pd.Series(["a","b","c","a"], dtype="category")
s.cat.categories = ["Group %s" % g for g in s.cat.categories]
print s.cat.categories
s = pd.Series(["a","b","c","a"], dtype="category")
s = s.cat.add_categories([4])
print s.cat.categories
Removing Categories
s = pd.Series(["a","b","c","a"], dtype="category")
print ("Original object:")
print s
Original object:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
After removal:
0 NaN
1 b
2 c
3 NaN
dtype: category
Categories (2, object): [b, c]
comparing equality (== and !=) to a list-like object (list, Series, array, ...) of
the same length as the categorical data.
all comparisons (==, !=, >, >=, <, and <=) of categorical data to another
categorical Series, when ordered==True and the categories are the same.
0 False
1 False
2 True
dtype: bool
This functionality on Series and DataFrame is just a simple wrapper around the
matplotlib libraries plot() method.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('1/1/2000',
periods=10), columns=list('ABCD'))
df.plot()
We can plot one column versus another using the x and y keywords.
Plotting methods allow a handful of plot styles other than the default line plot. These
methods can be provided as the kind keyword argument to plot(). These include −
Bar Plot
Let us now see what a Bar Plot is by creating one. A bar plot can be created in the
following way −
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d')
df.plot.bar()
import pandas as pd
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d')
df.plot.bar(stacked=True)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d')
df.plot.barh(stacked=True)
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df.plot.hist(bins=20)
To plot different histograms for each column, use the following code −
import pandas as pd
import numpy as np
df=pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df.diff.hist(bins=20)
Box Plots
Boxplot can be drawn calling Series.box.plot() and DataFrame.box.plot(), or
DataFrame.boxplot() to visualize the distribution of values within each column.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df.plot.box()
import pandas as pd
import numpy as np
Scatter Plot
Scatter plot can be created using the DataFrame.plot.scatter() methods.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
df.plot.scatter(x='a', y='b')
Pie Chart
Pie chart can be created using the DataFrame.plot.pie() method.
import pandas as pd
import numpy as np
The two workhorse functions for reading text files (or the flat files) are read_csv()
and read_table(). They both use the same parsing code to intelligently convert
tabular data into a DataFrame object −
S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900
S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900
read.csv
read.csv reads data from the csv files and creates a DataFrame object.
import pandas as pd
df=pd.read_csv("temp.csv")
print df
custom index
This specifies a column in the csv file to customize the index using index_col.
import pandas as pd
df=pd.read_csv("temp.csv",index_col=['S.No'])
print df
Converters
import pandas as pd
By default, the dtype of the Salary column is int, but the result shows it as float
because we have explicitly casted the type.
header_names
import pandas as pd
a b c d e
0 S.No Name Age City Salary
1 1 Tom 28 Toronto 20000
2 2 Lee 32 HongKong 3000
3 3 Steven 43 Bay Area 8300
4 4 Ram 38 Hyderabad 3900
Observe, the header names are appended with the custom names, but the header in
the file has not been eliminated. Now, we use the header argument to remove that.
If the header is in a row other than the first, pass the row number to header. This
will skip the preceding rows.
import pandas as pd
df=pd.read_csv("temp.csv",names=['a','b','c','d','e'],header=0)
print df
a b c d e
0 S.No Name Age City Salary
1 1 Tom 28 Toronto 20000
2 2 Lee 32 HongKong 3000
3 3 Steven 43 Bay Area 8300
4 4 Ram 38 Hyderabad 3900
skiprows
import pandas as pd
df=pd.read_csv("temp.csv", skiprows=2)
print df
ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = ts.to_sparse()
print sts
0 -0.810497
1 -1.419954
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 0.439240
9 -1.095910
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)
Let us now assume you had a large NA DataFrame and execute the following code −
df = pd.DataFrame(np.random.randn(10000, 4))
df.ix[:9998] = np.nan
sdf = df.to_sparse()
print sdf.density
Any sparse object can be converted back to the standard dense form by calling
to_dense −
0 -0.810497
1 -1.419954
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 0.439240
9 -1.095910
dtype: float64
Sparse Dtypes
Sparse data should have the same dtype as its dense representation. Currently,
float64, int64 and booldtypes are supported. Depending on the original dtype,
fill_value default changes −
float64 − np.nan
int64 − 0
bool − False
s.to_sparse()
print s
0 1.0
1 NaN
2 NaN
dtype: float64
0 1.0
1 NaN
2 NaN
dtype: float64
I am any
print pd.Series([True]).bool()
True
Bitwise Boolean
Bitwise Boolean operators like == and != will return a Boolean series, which is
almost always what is required anyways.
0 False
1 False
2 False
3 False
4 True
dtype: bool
isin Operation
This returns a Boolean series showing whether each element in the Series is exactly
contained in the passed sequence of values.
s = pd.Series(list('abc'))
s = s.isin(['a', 'c', 'e'])
print s
0 True
1 False
2 True
dtype: bool
Reindexing vs ix Gotcha
Many users will find themselves using the ix indexing capabilities as a concise
means of selecting data from a Pandas object −
print df
print df.ix[['b', 'c', 'e']]
This is, of course, completely equivalent in this case to using the reindex method −
print df
print df.reindex(['b', 'c', 'e'])
Some might conclude that ix and reindex are 100% equivalent based on this. This
is true except in the case of integer indexing. For example, the above operation can
alternatively be expressed as −
print df
print df.ix[[1, 2, 4]]
print df.reindex([1, 2, 4])
import pandas as pd
url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'
tips=pd.read_csv(url)
print tips.head()
SELECT
In SQL, selection is done using a comma-separated list of columns that you select
(or a * to select all columns) −
With Pandas, column selection is done by passing a list of column names to your
DataFrame −
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
import pandas as pd
url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'
tips=pd.read_csv(url)
print tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
Calling the DataFrame without the list of column names will display all columns (akin
to SQL’s *).
WHERE
Filtering in SQL is done via a WHERE clause.
DataFrames can be filtered in multiple ways; the most intuitive of which is using
Boolean indexing.
tips[tips['time'] == 'Dinner'].head(5)
url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'
tips=pd.read_csv(url)
print tips[tips['time'] == 'Dinner'].head(5)
GroupBy
This operation fetches the count of records in each group throughout a dataset. For
instance, a query fetching us the number of tips left by sex −
tips.groupby('sex').size()
import pandas as pd
url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'
tips=pd.read_csv(url)
print tips.groupby('sex').size()
sex
Female 87
Male 157
dtype: int64
Top N rows
SQL returns the top n rows using LIMIT −
tips.head(5)
import pandas as pd
url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips
tips=pd.read_csv(url)
tips = tips[['smoker', 'day', 'time']].head(5)
print tips
These are the few basic operations we compared are, which we learnt, in the
previous chapters of the Pandas Library.