Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
26 views

Lecture 8 - Data Wrangling Using Pandas

Study material
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Lecture 8 - Data Wrangling Using Pandas

Study material
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Lecture 8 - Data Wrangling using Pandas

August 29, 2024

1 Data Wrangling
• Hierarchical Indexing
• Combining and Merging datasets
• Reshaping and Pivoting

2 Hierarchical Indexing
[2]: import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9),index=[['a', 'a', 'a', 'b', 'b', 'c', 'c',␣
↪'d', 'd'],[1, 2, 3, 1, 3, 1, 2, 2, 3]])

display(data)

display(data.index)

a 1 1.188749
2 -1.142034
3 1.159208
b 1 -0.793279
3 -1.449333
c 1 1.330207
2 1.185908
d 2 -0.130557
3 -0.300575
dtype: float64
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)

1
2.1 Indexing of hierarchical indexed series
[14]: display(data.loc['a'])
display(data.loc[('a',2)])

1 -0.907795
2 -0.267606
3 0.496992
dtype: float64
-0.26760593571603775

2.2 Rearrange multi-index series to DataFrame using its unstack


[4]: display(data)
data.unstack()

a 1 1.188749
2 -1.142034
3 1.159208
b 1 -0.793279
3 -1.449333
c 1 1.330207
2 1.185908
d 2 -0.130557
3 -0.300575
dtype: float64

[4]: 1 2 3
a 1.188749 -1.142034 1.159208
b -0.793279 NaN -1.449333
c 1.330207 1.185908 NaN
d NaN -0.130557 -0.300575

2.3 The inverse operation of unstack is stack


[ ]: data.unstack().stack()

[ ]: a 1 2.320428
2 0.791752
3 1.258071
b 1 -0.269220
3 0.524681
c 1 -0.506953
2 -0.955109
d 2 1.336005
3 0.556053
dtype: float64

2
2.4 Hierarchical index on a dataFrame
[10]: import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b',␣
↪'b'], [1, 2, 1, 2]],

columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])

frame.index.names = ['key1', 'key2']


frame.columns.names = ['state','color']
frame

[10]: state Ohio Colorado


color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

2.5 Selecting groups of columns


[14]: display(frame['Ohio'])
display(frame[('Ohio', 'Green')])

color Green Red


key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
key1 key2
a 1 0
2 3
b 1 6
2 9
Name: (Ohio, Green), dtype: int32

[11]: # Reordering index


frame.swaplevel('key1', 'key2')

[11]: state Ohio Colorado


color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11

3
[13]: # Sorting index
frame.sort_index(level='key2')#level=1

[13]: state Ohio Colorado


color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11

2.6 Hierarchical Indexing with a DataFrame’s columns


[15]: frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
....: 'c': ['one', 'one', 'one', 'two', 'two',
....: 'two', 'two'],
....: 'd': [0, 1, 2, 0, 1, 2, 3]})

frame

[15]: a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3

[16]: frame2 = frame.set_index(['c', 'd'])

frame2

[16]: a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1

2.7 reset_index: the hierarchical index levels are moved into the columns
[17]: frame2.reset_index()

4
[17]: c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1

3 Combining and Merging datasets


• Merge or join operations combine datasets by linking rows using one or more keys

[1]: # many-to-one merge


import pandas as pd
import numpy as np
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1':␣
↪range(7)})

df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})

display(df1)
display(df2)

key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
key data2
0 a 0
1 b 1
2 d 2

[11]: display(df1)
display(df2)
pd.merge(df1, df2)

key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4

5
5 a 5
6 b 6
key data2
0 a 0
1 b 1
2 d 2

[11]: key data1 data2


0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

[8]: # Equivalent
pd.merge(df1, df2, on='key')

[8]: key data1 data2


0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

3.1 If the column names are different in each object, you can specify them
separately:
[21]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1':␣
↪range(7)})

df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],'data2': range(3)})

display(df3)
display(df4)

pd.merge(df3, df4, left_on='lkey', right_on='rkey')

lkey data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6

6
rkey data2
0 a 0
1 b 1
2 d 2

[21]: lkey data1 rkey data2


0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0

3.2 Notice that c and d ‘keys’ are ignored


• By default, merge on values that are the intersection of values in both dataframes
• Left merge: Keep every row in the left dataframe.
• Right merge: Keep every row in the right dataframe.
• Outer merge: Union of the values in left and right dataframe ### Specify using ‘how’

[18]: pd.merge(df1, df2, how='outer')

[18]: key data1 data2


0 b 0.0 1.0
1 b 1.0 1.0
2 b 6.0 1.0
3 a 2.0 0.0
4 a 4.0 0.0
5 a 5.0 0.0
6 c 3.0 NaN
7 d NaN 2.0

3.3 Merging datasets: Many-to-many


[40]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})

display(df1)

display(df2)

pd.merge(df1, df2, on='key')

key data1
0 b 0
1 b 1
2 a 2
3 c 3

7
4 a 4
5 b 5
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4

[40]: key data1 data2


0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2

3.3.1 Many-to-many joins form the Cartesian product of the rows. Since there were
three ‘b’ rows in the left DataFrame and two in the right one, there are six ‘b’
rows in the result.

4 Merging on Index
• Keys on the index

[9]: left = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],index=['a', 'c',␣


↪'e'],columns=['Ohio', 'Nevada'])

display(left)
right = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], index=['b',␣
↪'c', 'd', 'e'], columns=['Missouri', 'Alabama'])

display(right)
pd.merge(left, right, how='outer', left_index=True, right_index=True)

Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0

8
[9]: Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0

[10]: # Pandas join for merging


left.join(right, how='outer')

[10]: Ohio Nevada Missouri Alabama


a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0

5 Concatenating Along an Axis


• Numpy vstack and hstack (Numpy concatenate)
• Pandas version: concat

5.1 Numpy concatenate


[29]: arr = np.arange(12).reshape((3, 4))

display(arr)

display(np.concatenate([arr, arr], axis=0)) # vstack


display(np.concatenate([arr, arr], axis=1)) # hstack

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])

9
5.2 Pandas concatenate
[30]: s1 = pd.Series([0, 1], index=['a', 'b'])

s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

s3 = pd.Series([5, 6], index=['f', 'g'])

display(s1)
display(s2)
display(s3)

pd.concat([s1, s2, s3])

a 0
b 1
dtype: int64
c 2
d 3
e 4
dtype: int64
f 5
g 6
dtype: int64

[30]: a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64

5.3 By default concat works along rows, producing another Series. For hori-
zontal concat, pass axis=1
[31]: pd.concat([s1, s2, s3], axis=1)

[31]: 0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0

10
[37]: s4 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'f', 'g'])
display(s1)
display(s4)
pd.concat([s1, s4], axis=1)

a 0
b 1
dtype: int64
a 0
b 1
f 2
g 3
dtype: int64

[37]: 0 1
a 0.0 0
b 1.0 1
f NaN 2
g NaN 3

5.4 Concatenate along overlapping index


[38]: pd.concat([s1, s4], axis=1, join='inner')

[38]: 0 1
a 0 0
b 1 1

5.5 Identifiability: Use hierarchical index through keys


[39]: result = pd.concat([s1, s2, s3], keys=['s1', 's2', 's3'])

result

[39]: s1 a 0
b 1
s2 c 2
d 3
e 4
s3 f 5
g 6
dtype: int64

[40]: result.unstack()

[40]: a b c d e f g
s1 0.0 1.0 NaN NaN NaN NaN NaN

11
s2 NaN NaN 2.0 3.0 4.0 NaN NaN
s3 NaN NaN NaN NaN NaN 5.0 6.0

5.6 Hierachical index: DataFrame


[41]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],␣
↪columns=['one', 'two'])

df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],␣


↪columns=['three', 'four'])

display(df1)

display(df2)

pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

one two
a 0 1
b 2 3
c 4 5
three four
a 5 6
c 7 8

[41]: level1 level2


one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0

6 Filling in data (Combining data with overlap)


Consider NumPy’s where function, which performs the array-oriented equivalent of an if-else ex-
pression:

[45]: a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f', 'e', 'd',␣
↪'c', 'b', 'a'])

b = pd.Series(np.arange(len(a), dtype=np.float64), index=['f', 'e', 'd', 'h',␣


↪'g', 'a'])

b[-1] = np.nan

display(a)

display(b)

12
a.fillna(b)

f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
f 0.0
e 1.0
d 2.0
h 3.0
g 4.0
a NaN
dtype: float64

[45]: f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a NaN
dtype: float64

[47]: a.combine_first(b) # fillna with non-overlapping index from b included


# Bonus: index alignment

[47]: a NaN
b 4.5
c 3.5
d 2.0
e 2.5
f 0.0
g 4.0
h 3.0
dtype: float64

6.1 For DataFrames, column by column filling


[48]: df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan], 'b': [np.nan, 2., np.nan, 6.
↪], 'c': range(2, 18, 4)})

df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.], 'b': [np.nan, 3., 4., 6., 8.
↪]})

13
display(df1)

display(df2)

df1.combine_first(df2)

a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0

[48]: a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN

7 Reshaping with Hierarchical Indexing


7.1 stack: Converts columns of a dataframe into rows
[2]: import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(6).reshape((2, 3)),index=pd.Index(['Ohio',␣
↪'Colorado']),columns=pd.Index(['one', 'two', 'three']))

display(data)

result = data.stack()

result

one two three


Ohio 0 1 2
Colorado 3 4 5

[2]: Ohio one 0


two 1
three 2

14
Colorado one 3
two 4
three 5
dtype: int32

7.2 Unstack: Hierarchically indexed Series to a dataframe


[4]: result.unstack()

[4]: one two three


Ohio 0 1 2
Colorado 3 4 5

By default the innermost level is unstacked (same with stack). We can unstack a different level by
passing a level number or name:

[6]: result.unstack(0)

[6]: Ohio Colorado


one 0 3
two 1 4
three 2 5

7.3 Stacking and unstacking series


[7]: s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])

data2 = pd.concat([s1, s2], keys=['one', 'two'])

display(data2)

data2.unstack()

one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64

[7]: a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0

15
7.4 Unstack filters out missing values
[18]: data2.unstack().stack(dropna=False)

[18]: one a 0.0


b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64

7.5 Pivoting “Long” to “Wide” Format


[10]: ldata = pd.DataFrame({'date':["1959-03-31", "1959-03-31", "1959-03-31",␣
↪"1959-06-30", "1959-06-30", "1959-06-30", "1959-09-30", "1959-09-30",␣

↪"1959-09-30"],'item':["realgdp", "infl", "unemp"] * 3,'value':[2710.349, 0.

↪000, 5.800, 2778.801, 2.340, 5.100, 2775.488, 2.740, 5.300]})

display(ldata)

date item value


0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
5 1959-06-30 unemp 5.100
6 1959-09-30 realgdp 2775.488
7 1959-09-30 infl 2.740
8 1959-09-30 unemp 5.300

7.6 Convert to wide format using pivot()


• index: What should be the rows of the dataframe?
• columns: what should be the coloumns of the dataframe?
• values: Values for each row and column.

[43]: pivoted = ldata.pivot(index='date', columns='item', values='value')

pivoted

[43]: item infl realgdp unemp


date
1959-03-31 0.00 2710.349 5.8

16
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3

7.7 Pivot for multiple columns


[44]: ldata['value2'] = np.random.randn(len(ldata))

ldata

[44]: date item value value2


0 1959-03-31 realgdp 2710.349 -0.598456
1 1959-03-31 infl 0.000 -0.570555
2 1959-03-31 unemp 5.800 0.657731
3 1959-06-30 realgdp 2778.801 -1.115266
4 1959-06-30 infl 2.340 0.952629
5 1959-06-30 unemp 5.100 0.630104
6 1959-09-30 realgdp 2775.488 -0.613216
7 1959-09-30 infl 2.740 -0.451026
8 1959-09-30 unemp 5.300 -1.382464

[54]: pivoted = ldata.pivot(index='date', columns='item', values=['value', 'value2'])


pivoted
# Can you do the same thing using set_index and unstack?

[54]: value value2


item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 -0.570555 -0.598456 0.657731
1959-06-30 2.34 2778.801 5.1 0.952629 -1.115266 0.630104
1959-09-30 2.74 2775.488 5.3 -0.451026 -0.613216 -1.382464

7.8 Pivoting “Wide” to “Long” Format


7.8.1 Inverse operation to pivot for DataFrames is pandas.melt.
[20]: df = pd.DataFrame.from_dict({"Name": ['Liho Liho', 'Tompkins', 'The Square',␣
↪'Chambers'],

"8/4/2020": np.random.randint(10,200,␣
↪size=(1,4))[0],

"8/5/2020": np.random.randint(12,200,␣
↪size=(1,4))[0],

"8/6/2020": np.random.randint(12,200,␣
↪size=(1,4))[0],

"8/7/2020": np.random.randint(12,200,␣
↪size=(1,4))[0]}, orient='columns')

display(df)
melted = df.melt()

17
display(melted.head())
melted_id = df.melt('Name') # Use name as identifier variable
melted_id.head()

Name 8/4/2020 8/5/2020 8/6/2020 8/7/2020


0 Liho Liho 49 123 98 168
1 Tompkins 167 28 81 89
2 The Square 21 164 110 22
3 Chambers 86 151 98 187
variable value
0 Name Liho Liho
1 Name Tompkins
2 Name The Square
3 Name Chambers
4 8/4/2020 49

[20]: Name variable value


0 Liho Liho 8/4/2020 49
1 Tompkins 8/4/2020 167
2 The Square 8/4/2020 21
3 Chambers 8/4/2020 86
4 Liho Liho 8/5/2020 123

Using pivot, we can reshape back to the original layout:

[71]: reshaped = melted.pivot(index='Name', columns='variable', values='value')


reshaped

[71]: variable 8/4/2020 8/5/2020 8/6/2020 8/7/2020


Name
Chambers 180 78 81 125
Liho Liho 165 51 13 166
The Square 186 123 172 148
Tompkins 97 156 92 44

Since the result of pivot creates an index from the column used as the row labels, we may want to
use reset_index to move the data back into a column:

[72]: reshaped.reset_index()

[72]: variable Name 8/4/2020 8/5/2020 8/6/2020 8/7/2020


0 Chambers 180 78 81 125
1 Liho Liho 165 51 13 166
2 The Square 186 123 172 148
3 Tompkins 97 156 92 44

18
8 Data Aggregation and Group Operations

9 Group operations
9.1 Split-apply-combine

[29]: df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],


....: 'key2' : ['one', 'two', 'one', 'two', 'one'],
....: 'data1' : np.random.randn(5),
....: 'data2' : np.random.randn(5)})
display(df)
# Mean of data1 with index key1
display(df['data1'].groupby(df['key1']).mean())
# Mean of data1 with index key1 and key2
display(df['data1'].groupby([df['key1'],df['key2']]).mean())

key1 key2 data1 data2


0 a one -0.822739 -0.413697
1 a two -2.519022 -0.283733
2 b one -0.926079 1.126704
3 b two -0.227016 0.358026
4 a one -0.808195 0.607594

19
key1
a -1.383319
b -0.576548
Name: data1, dtype: float64
key1 key2
a one -0.815467
two -2.519022
b one -0.926079
two -0.227016
Name: data1, dtype: float64

[24]: # Grouping with dictionaries


people = pd.DataFrame(np.random.randn(5, 5),columns=['a', 'b', 'c', 'd', 'e'],␣
↪index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

mapping = {'a': 'red', 'b': 'red', 'c': 'blue','d': 'blue', 'e': 'red', 'f' :␣
↪'orange'}

display(people)
display(mapping)
people.groupby(mapping, axis=1).sum()

a b c d e
Joe -0.929928 0.070314 0.654937 -0.878092 -1.040932
Steve 0.343663 -0.354972 0.547707 0.740796 0.315129
Wes -0.420404 -1.471619 1.637827 -1.449958 -0.017148
Jim 0.355240 -0.347607 1.759714 0.300139 0.868392
Travis -0.690952 0.980516 -0.448059 -0.310043 -1.724671
{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

[24]: blue red


Joe -0.223154 -1.900546
Steve 1.288503 0.303820
Wes 0.187868 -1.909170
Jim 2.059853 0.876024
Travis -0.758102 -1.435106

[25]: # Grouping with series


mapping = {'a': 'red', 'b': 'red', 'c': 'blue','d': 'blue', 'e': 'red', 'f' :␣
↪'orange'}

display(people)
display(mapping)
map_series = pd.Series(mapping)
people.groupby(map_series, axis=1).sum()

a b c d e
Joe -0.929928 0.070314 0.654937 -0.878092 -1.040932
Steve 0.343663 -0.354972 0.547707 0.740796 0.315129
Wes -0.420404 -1.471619 1.637827 -1.449958 -0.017148

20
Jim 0.355240 -0.347607 1.759714 0.300139 0.868392
Travis -0.690952 0.980516 -0.448059 -0.310043 -1.724671
{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

[25]: blue red


Joe -0.223154 -1.900546
Steve 1.288503 0.303820
Wes 0.187868 -1.909170
Jim 2.059853 0.876024
Travis -0.758102 -1.435106

[33]: # Grouping with Python functions


people = pd.DataFrame(np.random.uniform(0, 100, (5,5)), columns=['Maths',␣
↪'Physics', 'Chemistry', 'English', 'Computer'], index=['Joe', 'Steve',␣

↪'Wes', 'Jim', 'Travis'])

display(people)
# Find average marks for each subject of people who have not failed in Maths
x = people.groupby(lambda x: 'Math-Pass' if people['Maths'].loc[x] > 40 else␣
↪'Math-Fail').mean()

display(x)

Maths Physics Chemistry English Computer


Joe 76.600581 69.734027 88.207766 93.656526 54.241596
Steve 49.510219 21.429238 73.585302 15.075714 84.818743
Wes 36.227625 74.108171 75.587162 63.813382 56.984336
Jim 47.317391 30.334947 75.065363 67.022803 43.677969
Travis 12.667923 31.145376 69.089946 71.147333 5.321583
Maths Physics Chemistry English Computer
Math-Fail 24.447774 52.626773 72.338554 67.480357 31.152960
Math-Pass 57.809397 40.499404 78.952811 58.585014 60.912769

10 Data Aggregation
10.1 Any transformation that produces scalar values from arrays
[34]: import pandas as pd
df = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/
↪vehicles.csv.zip')

city_mileage=df.city08
display(df)
display(city_mileage)

C:\Users\ANUP\AppData\Local\Temp\ipykernel_23528\1763585868.py:2: DtypeWarning:
Columns (68,70,71,72,73,74,76,79) have mixed types. Specify dtype option on
import or set low_memory=False.
df = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/veh
icles.csv.zip')

21
barrels08 barrelsA08 charge120 charge240 city08 city08U cityA08 \
0 15.695714 0.0 0.0 0.0 19 0.0 0
1 29.964545 0.0 0.0 0.0 9 0.0 0
2 12.207778 0.0 0.0 0.0 23 0.0 0
3 29.964545 0.0 0.0 0.0 10 0.0 0
4 17.347895 0.0 0.0 0.0 17 0.0 0
… … … … … … … …
41139 14.982273 0.0 0.0 0.0 19 0.0 0
41140 14.330870 0.0 0.0 0.0 20 0.0 0
41141 15.695714 0.0 0.0 0.0 18 0.0 0
41142 15.695714 0.0 0.0 0.0 18 0.0 0
41143 18.311667 0.0 0.0 0.0 16 0.0 0

cityA08U cityCD cityE … mfrCode c240Dscr charge240b c240bDscr \


0 0.0 0.0 0.0 … NaN NaN 0.0 NaN
1 0.0 0.0 0.0 … NaN NaN 0.0 NaN
2 0.0 0.0 0.0 … NaN NaN 0.0 NaN
3 0.0 0.0 0.0 … NaN NaN 0.0 NaN
4 0.0 0.0 0.0 … NaN NaN 0.0 NaN
… … … … … … … … …
41139 0.0 0.0 0.0 … NaN NaN 0.0 NaN
41140 0.0 0.0 0.0 … NaN NaN 0.0 NaN
41141 0.0 0.0 0.0 … NaN NaN 0.0 NaN
41142 0.0 0.0 0.0 … NaN NaN 0.0 NaN
41143 0.0 0.0 0.0 … NaN NaN 0.0 NaN

createdOn modifiedOn startStop \


0 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
1 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
2 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
3 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
4 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
… … … …
41139 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
41140 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
41141 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
41142 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
41143 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN

phevCity phevHwy phevComb


0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
… … … …
41139 0 0 0
41140 0 0 0

22
41141 0 0 0
41142 0 0 0
41143 0 0 0

[41144 rows x 83 columns]


0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: int64

[9]: # Find the mean of the mileage


import numpy as np
display(np.mean(city_mileage))

city_mileage.mean()

18.369045304297103

[9]: 18.369045304297103

10.2 Aggregration
[40]: display(city_mileage.agg('mean'))

# There are many aggregration functions available


df[['barrels08','city08']].agg('mean')

18.369045304297103

[40]: barrels08 17.283900


city08 18.369045
dtype: float64

[12]: # Aggregate works with a list of functions


city_mileage.agg(['mean', 'median', 'max', 'min'])

[12]: mean 18.369045


median 17.000000
max 150.000000
min 6.000000

23
Name: city08, dtype: float64

[41]: df[['barrels08','city08']].agg(['mean', 'median'])

[41]: barrels08 city08


mean 17.2839 18.369045
median 16.4805 17.000000

[42]: # Using your own functions


# any function that takes as input an array and returns a number
def max_sub_min(arr): return arr.max() - arr.min()
df[['barrels08','city08']].agg(max_sub_min)

[42]: barrels08 47.027143


city08 144.000000
dtype: float64

[35]: # Aggregrating series


import numpy as np
import pandas as pd
s = pd.Series(np.random.randn(5), index=['a', 'a', 'b', 'b', 'a'])
display(s)
s.agg('mean')

a -0.865203
a 0.906041
b -0.229707
b 0.122065
a -0.865028
dtype: float64

[35]: -0.1863666066128976

[38]: # We may want to find the mean by index


s.groupby(s.index).agg(['mean', 'median'])

[38]: mean median


a -0.274730 -0.865028
b -0.053821 -0.053821

10.3 Group aggregation in DataFrame


[46]: df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], 'key2' : ['one', 'two',␣
↪'one', 'two', 'one'],'data1' : np.random.randn(5), 'data2' : np.random.

↪randn(5)})

display(df)
df.data1.groupby(df.key1).agg('mean')

24
key1 key2 data1 data2
0 a one -1.833308 0.208652
1 a two -1.166889 0.221166
2 b one 1.031091 0.039213
3 b two -1.133420 0.288701
4 a one 0.313682 0.534899

[46]: key1
a -0.895505
b -0.051164
Name: data1, dtype: float64

[31]: display(df)
m = df.data1.groupby([df.key1, df.key2]).agg('mean')
display(m)

key1 key2 data1 data2


0 a one -0.200839 -0.479937
1 a two -1.806834 -0.654038
2 b one 0.140283 1.164648
3 b two -0.526533 -0.483947
4 a one -1.776828 -1.599953
key1 key2
a one -0.988833
two -1.806834
b one 0.140283
two -0.526533
Name: data1, dtype: float64

10.4 Pivot Table


[48]: tips = pd.read_csv('https://raw.githubusercontent.com/wesm/pydata-book/
↪3rd-edition/examples/tips.csv')

tips

[48]: total_bill tip smoker day time size


0 16.99 1.01 No Sun Dinner 2
1 10.34 1.66 No Sun Dinner 3
2 21.01 3.50 No Sun Dinner 3
3 23.68 3.31 No Sun Dinner 2
4 24.59 3.61 No Sun Dinner 4
.. … … … … … …
239 29.03 5.92 No Sat Dinner 3
240 27.18 2.00 Yes Sat Dinner 2
241 22.67 2.00 Yes Sat Dinner 2
242 17.82 1.75 No Sat Dinner 2
243 18.78 3.00 No Thur Dinner 2

25
[244 rows x 6 columns]

[49]: display(tips)
# Exercise: Average tip by smoker and day

total_bill tip smoker day time size


0 16.99 1.01 No Sun Dinner 2
1 10.34 1.66 No Sun Dinner 3
2 21.01 3.50 No Sun Dinner 3
3 23.68 3.31 No Sun Dinner 2
4 24.59 3.61 No Sun Dinner 4
.. … … … … … …
239 29.03 5.92 No Sat Dinner 3
240 27.18 2.00 Yes Sat Dinner 2
241 22.67 2.00 Yes Sat Dinner 2
242 17.82 1.75 No Sat Dinner 2
243 18.78 3.00 No Thur Dinner 2

[244 rows x 6 columns]

[50]: m = tips.tip.groupby([tips.day, tips.smoker]).agg('mean')


display(m)
m.unstack()

day smoker
Fri No 2.812500
Yes 2.714000
Sat No 3.102889
Yes 2.875476
Sun No 3.167895
Yes 3.516842
Thur No 2.673778
Yes 3.030000
Name: tip, dtype: float64

[50]: smoker No Yes


day
Fri 2.812500 2.714000
Sat 3.102889 2.875476
Sun 3.167895 3.516842
Thur 2.673778 3.030000

11 Pivot table
• Data Summarization
• Aggregates a table of data by one or more keys, arranging the data in a rectangle with some
of the group keys along the rows and some along the columns.

26
[51]: # pivot_table
tips.pivot_table(['tip'], index=['day'], columns='smoker')

[51]: tip
smoker No Yes
day
Fri 2.812500 2.714000
Sat 3.102889 2.875476
Sun 3.167895 3.516842
Thur 2.673778 3.030000

[52]: # Find total tips


tips.pivot_table(['tip'], index=['day'], columns='smoker', aggfunc='sum')

[52]: tip
smoker No Yes
day
Fri 11.25 40.71
Sat 139.63 120.77
Sun 180.57 66.82
Thur 120.32 51.51

[54]: # Margin=true
tips.pivot_table(['tip'], index=['day'], columns='smoker', aggfunc='sum',␣
↪margins=True)

[54]: tip
smoker No Yes All
day
Fri 11.25 40.71 51.96
Sat 139.63 120.77 260.40
Sun 180.57 66.82 247.39
Thur 120.32 51.51 171.83
All 451.77 279.81 731.58

11.1 Cross-tabulation: Crosstab


[55]: # A special case of a pivot table that computes group frequencies
# Number of smokers/non-smokers on a day
display(pd.crosstab(tips.day, tips.smoker))

# With margins
pd.crosstab(tips.day, tips.smoker, margins=True)

smoker No Yes
day
Fri 4 15
Sat 45 42

27
Sun 57 19
Thur 45 17

[55]: smoker No Yes All


day
Fri 4 15 19
Sat 45 42 87
Sun 57 19 76
Thur 45 17 62
All 151 93 244

12 Categorical Data
[59]: # Categorical Data
values = pd.Series(['apple', 'orange', 'apple','apple'] * 2)
display(values)
display(values.unique())
values.value_counts()

0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
dtype: object
array(['apple', 'orange'], dtype=object)

[59]: apple 6
orange 2
dtype: int64

[60]: # Categorical or Dictionary encoded representation


values = pd.Series([0, 1, 0, 0] * 2) # category codes
dim = pd.Series(['apple', 'orange']) # categories
display(values)
dim

0 0
1 1
2 0
3 0
4 0
5 1
6 0

28
7 0
dtype: int64

[60]: 0 apple
1 orange
dtype: object

[66]: # Categorical type in pandas


fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,'basket_id': np.arange(N),'count': np.random.
↪randint(3, 15, size=N),'weight': np.random.uniform(0, 4,␣

↪size=N)},columns=['basket_id', 'fruit', 'count', 'weight'])

display(df)
fruit_cat = df['fruit'].astype('category')
fruit_cat

basket_id fruit count weight


0 0 apple 3 3.748114
1 1 orange 9 1.403637
2 2 apple 12 3.814795
3 3 apple 12 2.893883
4 4 apple 4 1.093364
5 5 orange 5 2.636002
6 6 apple 4 0.844373
7 7 apple 13 3.923541

[66]: 0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

[69]: c = fruit_cat.values
display(type(c))
# Categories
display(c.categories)
# Codes
display(c.codes)

pandas.core.arrays.categorical.Categorical
Index(['apple', 'orange'], dtype='object')
array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

29
[73]: # Pandas series with category data type
series = pd.Series(['a', 'b', 'c', 'd', 'e'], dtype="category")
print(series)

0 a
1 b
2 c
3 d
4 e
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

[78]: # creating a data frame.


dataFrame = pd.DataFrame({"I": list("abcd"), "II": list("bcde")},␣
↪dtype="category")

print(dataFrame)
print(dataFrame.dtypes)

I II
0 a b
1 b c
2 c d
3 d e
I category
II category
dtype: object

[75]: # There is no ordering inside categorical variables


# Order can be imposed
fruit_cat = df['fruit'].astype('category')
fruit_cat.values.as_ordered()

[75]: ['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']


Categories (2, object): ['apple' < 'orange']

[82]: # Creating new categorical data type


from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
display(cat_type)
df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
display(df)
df_cat = df.astype(cat_type)
display(df_cat)
df_cat.A

CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)


A B
0 a b

30
1 b c
2 c c
3 a d
A B
0 a b
1 b c
2 c c
3 a d

[82]: 0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

31

You might also like