Lecture 8 - Data Wrangling Using Pandas
Lecture 8 - Data Wrangling Using Pandas
1 Data Wrangling
• Hierarchical Indexing
• Combining and Merging datasets
• Reshaping and Pivoting
2 Hierarchical Indexing
[2]: import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9),index=[['a', 'a', 'a', 'b', 'b', 'c', 'c',␣
↪'d', 'd'],[1, 2, 3, 1, 3, 1, 2, 2, 3]])
display(data)
display(data.index)
a 1 1.188749
2 -1.142034
3 1.159208
b 1 -0.793279
3 -1.449333
c 1 1.330207
2 1.185908
d 2 -0.130557
3 -0.300575
dtype: float64
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
1
2.1 Indexing of hierarchical indexed series
[14]: display(data.loc['a'])
display(data.loc[('a',2)])
1 -0.907795
2 -0.267606
3 0.496992
dtype: float64
-0.26760593571603775
a 1 1.188749
2 -1.142034
3 1.159208
b 1 -0.793279
3 -1.449333
c 1 1.330207
2 1.185908
d 2 -0.130557
3 -0.300575
dtype: float64
[4]: 1 2 3
a 1.188749 -1.142034 1.159208
b -0.793279 NaN -1.449333
c 1.330207 1.185908 NaN
d NaN -0.130557 -0.300575
[ ]: a 1 2.320428
2 0.791752
3 1.258071
b 1 -0.269220
3 0.524681
c 1 -0.506953
2 -0.955109
d 2 1.336005
3 0.556053
dtype: float64
2
2.4 Hierarchical index on a dataFrame
[10]: import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b',␣
↪'b'], [1, 2, 1, 2]],
3
[13]: # Sorting index
frame.sort_index(level='key2')#level=1
frame
[15]: a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
frame2
[16]: a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
2.7 reset_index: the hierarchical index levels are moved into the columns
[17]: frame2.reset_index()
4
[17]: c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
display(df1)
display(df2)
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
key data2
0 a 0
1 b 1
2 d 2
[11]: display(df1)
display(df2)
pd.merge(df1, df2)
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5
5 a 5
6 b 6
key data2
0 a 0
1 b 1
2 d 2
[8]: # Equivalent
pd.merge(df1, df2, on='key')
3.1 If the column names are different in each object, you can specify them
separately:
[21]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1':␣
↪range(7)})
display(df3)
display(df4)
lkey data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
6
rkey data2
0 a 0
1 b 1
2 d 2
display(df1)
display(df2)
key data1
0 b 0
1 b 1
2 a 2
3 c 3
7
4 a 4
5 b 5
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
3.3.1 Many-to-many joins form the Cartesian product of the rows. Since there were
three ‘b’ rows in the left DataFrame and two in the right one, there are six ‘b’
rows in the result.
4 Merging on Index
• Keys on the index
display(left)
right = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], index=['b',␣
↪'c', 'd', 'e'], columns=['Missouri', 'Alabama'])
display(right)
pd.merge(left, right, how='outer', left_index=True, right_index=True)
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
8
[9]: Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
display(arr)
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
9
5.2 Pandas concatenate
[30]: s1 = pd.Series([0, 1], index=['a', 'b'])
display(s1)
display(s2)
display(s3)
a 0
b 1
dtype: int64
c 2
d 3
e 4
dtype: int64
f 5
g 6
dtype: int64
[30]: a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
5.3 By default concat works along rows, producing another Series. For hori-
zontal concat, pass axis=1
[31]: pd.concat([s1, s2, s3], axis=1)
[31]: 0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
10
[37]: s4 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'f', 'g'])
display(s1)
display(s4)
pd.concat([s1, s4], axis=1)
a 0
b 1
dtype: int64
a 0
b 1
f 2
g 3
dtype: int64
[37]: 0 1
a 0.0 0
b 1.0 1
f NaN 2
g NaN 3
[38]: 0 1
a 0 0
b 1 1
result
[39]: s1 a 0
b 1
s2 c 2
d 3
e 4
s3 f 5
g 6
dtype: int64
[40]: result.unstack()
[40]: a b c d e f g
s1 0.0 1.0 NaN NaN NaN NaN NaN
11
s2 NaN NaN 2.0 3.0 4.0 NaN NaN
s3 NaN NaN NaN NaN NaN 5.0 6.0
display(df1)
display(df2)
one two
a 0 1
b 2 3
c 4 5
three four
a 5 6
c 7 8
[45]: a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f', 'e', 'd',␣
↪'c', 'b', 'a'])
b[-1] = np.nan
display(a)
display(b)
12
a.fillna(b)
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
f 0.0
e 1.0
d 2.0
h 3.0
g 4.0
a NaN
dtype: float64
[45]: f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a NaN
dtype: float64
[47]: a NaN
b 4.5
c 3.5
d 2.0
e 2.5
f 0.0
g 4.0
h 3.0
dtype: float64
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.], 'b': [np.nan, 3., 4., 6., 8.
↪]})
13
display(df1)
display(df2)
df1.combine_first(df2)
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
[48]: a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
display(data)
result = data.stack()
result
14
Colorado one 3
two 4
three 5
dtype: int32
By default the innermost level is unstacked (same with stack). We can unstack a different level by
passing a level number or name:
[6]: result.unstack(0)
display(data2)
data2.unstack()
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
[7]: a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
15
7.4 Unstack filters out missing values
[18]: data2.unstack().stack(dropna=False)
display(ldata)
pivoted
16
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
ldata
"8/4/2020": np.random.randint(10,200,␣
↪size=(1,4))[0],
"8/5/2020": np.random.randint(12,200,␣
↪size=(1,4))[0],
"8/6/2020": np.random.randint(12,200,␣
↪size=(1,4))[0],
"8/7/2020": np.random.randint(12,200,␣
↪size=(1,4))[0]}, orient='columns')
display(df)
melted = df.melt()
17
display(melted.head())
melted_id = df.melt('Name') # Use name as identifier variable
melted_id.head()
Since the result of pivot creates an index from the column used as the row labels, we may want to
use reset_index to move the data back into a column:
[72]: reshaped.reset_index()
18
8 Data Aggregation and Group Operations
9 Group operations
9.1 Split-apply-combine
19
key1
a -1.383319
b -0.576548
Name: data1, dtype: float64
key1 key2
a one -0.815467
two -2.519022
b one -0.926079
two -0.227016
Name: data1, dtype: float64
mapping = {'a': 'red', 'b': 'red', 'c': 'blue','d': 'blue', 'e': 'red', 'f' :␣
↪'orange'}
display(people)
display(mapping)
people.groupby(mapping, axis=1).sum()
a b c d e
Joe -0.929928 0.070314 0.654937 -0.878092 -1.040932
Steve 0.343663 -0.354972 0.547707 0.740796 0.315129
Wes -0.420404 -1.471619 1.637827 -1.449958 -0.017148
Jim 0.355240 -0.347607 1.759714 0.300139 0.868392
Travis -0.690952 0.980516 -0.448059 -0.310043 -1.724671
{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}
display(people)
display(mapping)
map_series = pd.Series(mapping)
people.groupby(map_series, axis=1).sum()
a b c d e
Joe -0.929928 0.070314 0.654937 -0.878092 -1.040932
Steve 0.343663 -0.354972 0.547707 0.740796 0.315129
Wes -0.420404 -1.471619 1.637827 -1.449958 -0.017148
20
Jim 0.355240 -0.347607 1.759714 0.300139 0.868392
Travis -0.690952 0.980516 -0.448059 -0.310043 -1.724671
{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}
display(people)
# Find average marks for each subject of people who have not failed in Maths
x = people.groupby(lambda x: 'Math-Pass' if people['Maths'].loc[x] > 40 else␣
↪'Math-Fail').mean()
display(x)
10 Data Aggregation
10.1 Any transformation that produces scalar values from arrays
[34]: import pandas as pd
df = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/
↪vehicles.csv.zip')
city_mileage=df.city08
display(df)
display(city_mileage)
C:\Users\ANUP\AppData\Local\Temp\ipykernel_23528\1763585868.py:2: DtypeWarning:
Columns (68,70,71,72,73,74,76,79) have mixed types. Specify dtype option on
import or set low_memory=False.
df = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/veh
icles.csv.zip')
21
barrels08 barrelsA08 charge120 charge240 city08 city08U cityA08 \
0 15.695714 0.0 0.0 0.0 19 0.0 0
1 29.964545 0.0 0.0 0.0 9 0.0 0
2 12.207778 0.0 0.0 0.0 23 0.0 0
3 29.964545 0.0 0.0 0.0 10 0.0 0
4 17.347895 0.0 0.0 0.0 17 0.0 0
… … … … … … … …
41139 14.982273 0.0 0.0 0.0 19 0.0 0
41140 14.330870 0.0 0.0 0.0 20 0.0 0
41141 15.695714 0.0 0.0 0.0 18 0.0 0
41142 15.695714 0.0 0.0 0.0 18 0.0 0
41143 18.311667 0.0 0.0 0.0 16 0.0 0
22
41141 0 0 0
41142 0 0 0
41143 0 0 0
city_mileage.mean()
18.369045304297103
[9]: 18.369045304297103
10.2 Aggregration
[40]: display(city_mileage.agg('mean'))
18.369045304297103
23
Name: city08, dtype: float64
a -0.865203
a 0.906041
b -0.229707
b 0.122065
a -0.865028
dtype: float64
[35]: -0.1863666066128976
↪randn(5)})
display(df)
df.data1.groupby(df.key1).agg('mean')
24
key1 key2 data1 data2
0 a one -1.833308 0.208652
1 a two -1.166889 0.221166
2 b one 1.031091 0.039213
3 b two -1.133420 0.288701
4 a one 0.313682 0.534899
[46]: key1
a -0.895505
b -0.051164
Name: data1, dtype: float64
[31]: display(df)
m = df.data1.groupby([df.key1, df.key2]).agg('mean')
display(m)
tips
25
[244 rows x 6 columns]
[49]: display(tips)
# Exercise: Average tip by smoker and day
day smoker
Fri No 2.812500
Yes 2.714000
Sat No 3.102889
Yes 2.875476
Sun No 3.167895
Yes 3.516842
Thur No 2.673778
Yes 3.030000
Name: tip, dtype: float64
11 Pivot table
• Data Summarization
• Aggregates a table of data by one or more keys, arranging the data in a rectangle with some
of the group keys along the rows and some along the columns.
26
[51]: # pivot_table
tips.pivot_table(['tip'], index=['day'], columns='smoker')
[51]: tip
smoker No Yes
day
Fri 2.812500 2.714000
Sat 3.102889 2.875476
Sun 3.167895 3.516842
Thur 2.673778 3.030000
[52]: tip
smoker No Yes
day
Fri 11.25 40.71
Sat 139.63 120.77
Sun 180.57 66.82
Thur 120.32 51.51
[54]: # Margin=true
tips.pivot_table(['tip'], index=['day'], columns='smoker', aggfunc='sum',␣
↪margins=True)
[54]: tip
smoker No Yes All
day
Fri 11.25 40.71 51.96
Sat 139.63 120.77 260.40
Sun 180.57 66.82 247.39
Thur 120.32 51.51 171.83
All 451.77 279.81 731.58
# With margins
pd.crosstab(tips.day, tips.smoker, margins=True)
smoker No Yes
day
Fri 4 15
Sat 45 42
27
Sun 57 19
Thur 45 17
12 Categorical Data
[59]: # Categorical Data
values = pd.Series(['apple', 'orange', 'apple','apple'] * 2)
display(values)
display(values.unique())
values.value_counts()
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
dtype: object
array(['apple', 'orange'], dtype=object)
[59]: apple 6
orange 2
dtype: int64
0 0
1 1
2 0
3 0
4 0
5 1
6 0
28
7 0
dtype: int64
[60]: 0 apple
1 orange
dtype: object
display(df)
fruit_cat = df['fruit'].astype('category')
fruit_cat
[66]: 0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']
[69]: c = fruit_cat.values
display(type(c))
# Categories
display(c.categories)
# Codes
display(c.codes)
pandas.core.arrays.categorical.Categorical
Index(['apple', 'orange'], dtype='object')
array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)
29
[73]: # Pandas series with category data type
series = pd.Series(['a', 'b', 'c', 'd', 'e'], dtype="category")
print(series)
0 a
1 b
2 c
3 d
4 e
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
print(dataFrame)
print(dataFrame.dtypes)
I II
0 a b
1 b c
2 c d
3 d e
I category
II category
dtype: object
30
1 b c
2 c c
3 a d
A B
0 a b
1 b c
2 c c
3 a d
[82]: 0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']
31