0% found this document useful (0 votes)

26 views

Lecture 8 - Data Wrangling Using Pandas

Study material

Uploaded by

shiva.kumar1n381l

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Lecture 8 - Data Wrangling Using Pandas

Study material

Uploaded by

shiva.kumar1n381l

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Lecture 8 - Data Wrangling using Pandas

August 29, 2024

1 Data Wrangling
• Hierarchical Indexing
• Combining and Merging datasets
• Reshaping and Pivoting

2 Hierarchical Indexing
[2]: import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9),index=[['a', 'a', 'a', 'b', 'b', 'c', 'c',␣
↪'d', 'd'],[1, 2, 3, 1, 3, 1, 2, 2, 3]])

display(data)

display(data.index)

a 1 1.188749
2 -1.142034
3 1.159208
b 1 -0.793279
3 -1.449333
c 1 1.330207
2 1.185908
d 2 -0.130557
3 -0.300575
dtype: float64
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)

1
2.1 Indexing of hierarchical indexed series
[14]: display(data.loc['a'])
display(data.loc[('a',2)])

1 -0.907795
2 -0.267606
3 0.496992
dtype: float64
-0.26760593571603775

2.2 Rearrange multi-index series to DataFrame using its unstack

[4]: display(data)
data.unstack()

a 1 1.188749
2 -1.142034
3 1.159208
b 1 -0.793279
3 -1.449333
c 1 1.330207
2 1.185908
d 2 -0.130557
3 -0.300575
dtype: float64

[4]: 1 2 3
a 1.188749 -1.142034 1.159208
b -0.793279 NaN -1.449333
c 1.330207 1.185908 NaN
d NaN -0.130557 -0.300575

2.3 The inverse operation of unstack is stack

[ ]: data.unstack().stack()

[ ]: a 1 2.320428
2 0.791752
3 1.258071
b 1 -0.269220
3 0.524681
c 1 -0.506953
2 -0.955109
d 2 1.336005
3 0.556053
dtype: float64

2
2.4 Hierarchical index on a dataFrame
[10]: import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b',␣
↪'b'], [1, 2, 1, 2]],

columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])

frame.index.names = ['key1', 'key2']

frame.columns.names = ['state','color']
frame

[10]: state Ohio Colorado

color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

2.5 Selecting groups of columns

[14]: display(frame['Ohio'])
display(frame[('Ohio', 'Green')])

color Green Red

key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
key1 key2
a 1 0
2 3
b 1 6
2 9
Name: (Ohio, Green), dtype: int32

[11]: # Reordering index

frame.swaplevel('key1', 'key2')

[11]: state Ohio Colorado

color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11

3
[13]: # Sorting index
frame.sort_index(level='key2')#level=1

[13]: state Ohio Colorado

color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11

2.6 Hierarchical Indexing with a DataFrame’s columns

[15]: frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
....: 'c': ['one', 'one', 'one', 'two', 'two',
....: 'two', 'two'],
....: 'd': [0, 1, 2, 0, 1, 2, 3]})

frame

[15]: a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3

[16]: frame2 = frame.set_index(['c', 'd'])

frame2

[16]: a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1

2.7 reset_index: the hierarchical index levels are moved into the columns
[17]: frame2.reset_index()

4
[17]: c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1

3 Combining and Merging datasets

• Merge or join operations combine datasets by linking rows using one or more keys

[1]: # many-to-one merge

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1':␣
↪range(7)})

df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})

display(df1)
display(df2)

key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
key data2
0 a 0
1 b 1
2 d 2

[11]: display(df1)
display(df2)
pd.merge(df1, df2)

key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4

5
5 a 5
6 b 6
key data2
0 a 0
1 b 1
2 d 2

[11]: key data1 data2

0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

[8]: # Equivalent
pd.merge(df1, df2, on='key')

[8]: key data1 data2

0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

3.1 If the column names are different in each object, you can specify them
separately:
[21]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1':␣
↪range(7)})

df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],'data2': range(3)})

display(df3)
display(df4)

pd.merge(df3, df4, left_on='lkey', right_on='rkey')

lkey data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6

6
rkey data2
0 a 0
1 b 1
2 d 2

[21]: lkey data1 rkey data2

0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0

3.2 Notice that c and d ‘keys’ are ignored

• By default, merge on values that are the intersection of values in both dataframes
• Left merge: Keep every row in the left dataframe.
• Right merge: Keep every row in the right dataframe.
• Outer merge: Union of the values in left and right dataframe ### Specify using ‘how’

[18]: pd.merge(df1, df2, how='outer')

[18]: key data1 data2

0 b 0.0 1.0
1 b 1.0 1.0
2 b 6.0 1.0
3 a 2.0 0.0
4 a 4.0 0.0
5 a 5.0 0.0
6 c 3.0 NaN
7 d NaN 2.0

3.3 Merging datasets: Many-to-many

[40]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})

display(df1)

display(df2)

pd.merge(df1, df2, on='key')

key data1
0 b 0
1 b 1
2 a 2
3 c 3

7
4 a 4
5 b 5
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4

[40]: key data1 data2

0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2

3.3.1 Many-to-many joins form the Cartesian product of the rows. Since there were
three ‘b’ rows in the left DataFrame and two in the right one, there are six ‘b’
rows in the result.

4 Merging on Index
• Keys on the index

[9]: left = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],index=['a', 'c',␣

↪'e'],columns=['Ohio', 'Nevada'])

display(left)
right = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], index=['b',␣
↪'c', 'd', 'e'], columns=['Missouri', 'Alabama'])

display(right)
pd.merge(left, right, how='outer', left_index=True, right_index=True)

Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0

8
[9]: Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0

[10]: # Pandas join for merging

left.join(right, how='outer')

[10]: Ohio Nevada Missouri Alabama

a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0

5 Concatenating Along an Axis

• Numpy vstack and hstack (Numpy concatenate)
• Pandas version: concat

5.1 Numpy concatenate

[29]: arr = np.arange(12).reshape((3, 4))

display(arr)

display(np.concatenate([arr, arr], axis=0)) # vstack

display(np.concatenate([arr, arr], axis=1)) # hstack

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])

9
5.2 Pandas concatenate
[30]: s1 = pd.Series([0, 1], index=['a', 'b'])

s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

s3 = pd.Series([5, 6], index=['f', 'g'])

display(s1)
display(s2)
display(s3)

pd.concat([s1, s2, s3])

a 0
b 1
dtype: int64
c 2
d 3
e 4
dtype: int64
f 5
g 6
dtype: int64

[30]: a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64

5.3 By default concat works along rows, producing another Series. For hori-
zontal concat, pass axis=1
[31]: pd.concat([s1, s2, s3], axis=1)

[31]: 0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0

10
[37]: s4 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'f', 'g'])
display(s1)
display(s4)
pd.concat([s1, s4], axis=1)

a 0
b 1
dtype: int64
a 0
b 1
f 2
g 3
dtype: int64

[37]: 0 1
a 0.0 0
b 1.0 1
f NaN 2
g NaN 3

5.4 Concatenate along overlapping index

[38]: pd.concat([s1, s4], axis=1, join='inner')

[38]: 0 1
a 0 0
b 1 1

5.5 Identifiability: Use hierarchical index through keys

[39]: result = pd.concat([s1, s2, s3], keys=['s1', 's2', 's3'])

result

[39]: s1 a 0
b 1
s2 c 2
d 3
e 4
s3 f 5
g 6
dtype: int64

[40]: result.unstack()

[40]: a b c d e f g
s1 0.0 1.0 NaN NaN NaN NaN NaN

11
s2 NaN NaN 2.0 3.0 4.0 NaN NaN
s3 NaN NaN NaN NaN NaN 5.0 6.0

5.6 Hierachical index: DataFrame

[41]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],␣
↪columns=['one', 'two'])

df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],␣

↪columns=['three', 'four'])

display(df1)

display(df2)

pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

one two
a 0 1
b 2 3
c 4 5
three four
a 5 6
c 7 8

[41]: level1 level2

one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0

6 Filling in data (Combining data with overlap)

Consider NumPy’s where function, which performs the array-oriented equivalent of an if-else ex-
pression:

[45]: a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f', 'e', 'd',␣
↪'c', 'b', 'a'])

b = pd.Series(np.arange(len(a), dtype=np.float64), index=['f', 'e', 'd', 'h',␣

↪'g', 'a'])

b[-1] = np.nan

display(a)

display(b)

12
a.fillna(b)

f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
f 0.0
e 1.0
d 2.0
h 3.0
g 4.0
a NaN
dtype: float64

[45]: f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a NaN
dtype: float64

[47]: a.combine_first(b) # fillna with non-overlapping index from b included

# Bonus: index alignment

[47]: a NaN
b 4.5
c 3.5
d 2.0
e 2.5
f 0.0
g 4.0
h 3.0
dtype: float64

6.1 For DataFrames, column by column filling

[48]: df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan], 'b': [np.nan, 2., np.nan, 6.
↪], 'c': range(2, 18, 4)})

df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.], 'b': [np.nan, 3., 4., 6., 8.
↪]})

13
display(df1)

display(df2)

df1.combine_first(df2)

a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0

[48]: a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN

7 Reshaping with Hierarchical Indexing

7.1 stack: Converts columns of a dataframe into rows
[2]: import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(6).reshape((2, 3)),index=pd.Index(['Ohio',␣
↪'Colorado']),columns=pd.Index(['one', 'two', 'three']))

display(data)

result = data.stack()

result

one two three

Ohio 0 1 2
Colorado 3 4 5

[2]: Ohio one 0

two 1
three 2

14
Colorado one 3
two 4
three 5
dtype: int32

7.2 Unstack: Hierarchically indexed Series to a dataframe

[4]: result.unstack()

[4]: one two three

Ohio 0 1 2
Colorado 3 4 5

By default the innermost level is unstacked (same with stack). We can unstack a different level by
passing a level number or name:

[6]: result.unstack(0)

[6]: Ohio Colorado

one 0 3
two 1 4
three 2 5

7.3 Stacking and unstacking series

[7]: s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])

data2 = pd.concat([s1, s2], keys=['one', 'two'])

display(data2)

data2.unstack()

one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64

[7]: a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0

15
7.4 Unstack filters out missing values
[18]: data2.unstack().stack(dropna=False)

[18]: one a 0.0

b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64

7.5 Pivoting “Long” to “Wide” Format

[10]: ldata = pd.DataFrame({'date':["1959-03-31", "1959-03-31", "1959-03-31",␣
↪"1959-06-30", "1959-06-30", "1959-06-30", "1959-09-30", "1959-09-30",␣

↪"1959-09-30"],'item':["realgdp", "infl", "unemp"] * 3,'value':[2710.349, 0.

↪000, 5.800, 2778.801, 2.340, 5.100, 2775.488, 2.740, 5.300]})

display(ldata)

date item value

0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
5 1959-06-30 unemp 5.100
6 1959-09-30 realgdp 2775.488
7 1959-09-30 infl 2.740
8 1959-09-30 unemp 5.300

7.6 Convert to wide format using pivot()

• index: What should be the rows of the dataframe?
• columns: what should be the coloumns of the dataframe?
• values: Values for each row and column.

[43]: pivoted = ldata.pivot(index='date', columns='item', values='value')

pivoted

[43]: item infl realgdp unemp

date
1959-03-31 0.00 2710.349 5.8

16
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3

7.7 Pivot for multiple columns

[44]: ldata['value2'] = np.random.randn(len(ldata))

ldata

[44]: date item value value2

0 1959-03-31 realgdp 2710.349 -0.598456
1 1959-03-31 infl 0.000 -0.570555
2 1959-03-31 unemp 5.800 0.657731
3 1959-06-30 realgdp 2778.801 -1.115266
4 1959-06-30 infl 2.340 0.952629
5 1959-06-30 unemp 5.100 0.630104
6 1959-09-30 realgdp 2775.488 -0.613216
7 1959-09-30 infl 2.740 -0.451026
8 1959-09-30 unemp 5.300 -1.382464

[54]: pivoted = ldata.pivot(index='date', columns='item', values=['value', 'value2'])

pivoted
# Can you do the same thing using set_index and unstack?

[54]: value value2

item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 -0.570555 -0.598456 0.657731
1959-06-30 2.34 2778.801 5.1 0.952629 -1.115266 0.630104
1959-09-30 2.74 2775.488 5.3 -0.451026 -0.613216 -1.382464

7.8 Pivoting “Wide” to “Long” Format

7.8.1 Inverse operation to pivot for DataFrames is pandas.melt.
[20]: df = pd.DataFrame.from_dict({"Name": ['Liho Liho', 'Tompkins', 'The Square',␣
↪'Chambers'],

"8/4/2020": np.random.randint(10,200,␣
↪size=(1,4))[0],

"8/5/2020": np.random.randint(12,200,␣
↪size=(1,4))[0],

"8/6/2020": np.random.randint(12,200,␣
↪size=(1,4))[0],

"8/7/2020": np.random.randint(12,200,␣
↪size=(1,4))[0]}, orient='columns')

display(df)
melted = df.melt()

17
display(melted.head())
melted_id = df.melt('Name') # Use name as identifier variable
melted_id.head()

Name 8/4/2020 8/5/2020 8/6/2020 8/7/2020

0 Liho Liho 49 123 98 168
1 Tompkins 167 28 81 89
2 The Square 21 164 110 22
3 Chambers 86 151 98 187
variable value
0 Name Liho Liho
1 Name Tompkins
2 Name The Square
3 Name Chambers
4 8/4/2020 49

[20]: Name variable value

0 Liho Liho 8/4/2020 49
1 Tompkins 8/4/2020 167
2 The Square 8/4/2020 21
3 Chambers 8/4/2020 86
4 Liho Liho 8/5/2020 123

Using pivot, we can reshape back to the original layout:

[71]: reshaped = melted.pivot(index='Name', columns='variable', values='value')

reshaped

[71]: variable 8/4/2020 8/5/2020 8/6/2020 8/7/2020

Name
Chambers 180 78 81 125
Liho Liho 165 51 13 166
The Square 186 123 172 148
Tompkins 97 156 92 44

Since the result of pivot creates an index from the column used as the row labels, we may want to
use reset_index to move the data back into a column:

[72]: reshaped.reset_index()

[72]: variable Name 8/4/2020 8/5/2020 8/6/2020 8/7/2020

0 Chambers 180 78 81 125
1 Liho Liho 165 51 13 166
2 The Square 186 123 172 148
3 Tompkins 97 156 92 44

18
8 Data Aggregation and Group Operations

9 Group operations
9.1 Split-apply-combine

[29]: df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],

....: 'key2' : ['one', 'two', 'one', 'two', 'one'],
....: 'data1' : np.random.randn(5),
....: 'data2' : np.random.randn(5)})
display(df)
# Mean of data1 with index key1
display(df['data1'].groupby(df['key1']).mean())
# Mean of data1 with index key1 and key2
display(df['data1'].groupby([df['key1'],df['key2']]).mean())

key1 key2 data1 data2

0 a one -0.822739 -0.413697
1 a two -2.519022 -0.283733
2 b one -0.926079 1.126704
3 b two -0.227016 0.358026
4 a one -0.808195 0.607594

19
key1
a -1.383319
b -0.576548
Name: data1, dtype: float64
key1 key2
a one -0.815467
two -2.519022
b one -0.926079
two -0.227016
Name: data1, dtype: float64

[24]: # Grouping with dictionaries

people = pd.DataFrame(np.random.randn(5, 5),columns=['a', 'b', 'c', 'd', 'e'],␣
↪index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

mapping = {'a': 'red', 'b': 'red', 'c': 'blue','d': 'blue', 'e': 'red', 'f' :␣
↪'orange'}

display(people)
display(mapping)
people.groupby(mapping, axis=1).sum()

a b c d e
Joe -0.929928 0.070314 0.654937 -0.878092 -1.040932
Steve 0.343663 -0.354972 0.547707 0.740796 0.315129
Wes -0.420404 -1.471619 1.637827 -1.449958 -0.017148
Jim 0.355240 -0.347607 1.759714 0.300139 0.868392
Travis -0.690952 0.980516 -0.448059 -0.310043 -1.724671
{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

[24]: blue red

Joe -0.223154 -1.900546
Steve 1.288503 0.303820
Wes 0.187868 -1.909170
Jim 2.059853 0.876024
Travis -0.758102 -1.435106

[25]: # Grouping with series

mapping = {'a': 'red', 'b': 'red', 'c': 'blue','d': 'blue', 'e': 'red', 'f' :␣
↪'orange'}

display(people)
display(mapping)
map_series = pd.Series(mapping)
people.groupby(map_series, axis=1).sum()

a b c d e
Joe -0.929928 0.070314 0.654937 -0.878092 -1.040932
Steve 0.343663 -0.354972 0.547707 0.740796 0.315129
Wes -0.420404 -1.471619 1.637827 -1.449958 -0.017148

20
Jim 0.355240 -0.347607 1.759714 0.300139 0.868392
Travis -0.690952 0.980516 -0.448059 -0.310043 -1.724671
{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

[25]: blue red

Joe -0.223154 -1.900546
Steve 1.288503 0.303820
Wes 0.187868 -1.909170
Jim 2.059853 0.876024
Travis -0.758102 -1.435106

[33]: # Grouping with Python functions

people = pd.DataFrame(np.random.uniform(0, 100, (5,5)), columns=['Maths',␣
↪'Physics', 'Chemistry', 'English', 'Computer'], index=['Joe', 'Steve',␣

↪'Wes', 'Jim', 'Travis'])

display(people)
# Find average marks for each subject of people who have not failed in Maths
x = people.groupby(lambda x: 'Math-Pass' if people['Maths'].loc[x] > 40 else␣
↪'Math-Fail').mean()

display(x)

Maths Physics Chemistry English Computer

Joe 76.600581 69.734027 88.207766 93.656526 54.241596
Steve 49.510219 21.429238 73.585302 15.075714 84.818743
Wes 36.227625 74.108171 75.587162 63.813382 56.984336
Jim 47.317391 30.334947 75.065363 67.022803 43.677969
Travis 12.667923 31.145376 69.089946 71.147333 5.321583
Maths Physics Chemistry English Computer
Math-Fail 24.447774 52.626773 72.338554 67.480357 31.152960
Math-Pass 57.809397 40.499404 78.952811 58.585014 60.912769

10 Data Aggregation
10.1 Any transformation that produces scalar values from arrays
[34]: import pandas as pd
df = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/
↪vehicles.csv.zip')

city_mileage=df.city08
display(df)
display(city_mileage)

C:\Users\ANUP\AppData\Local\Temp\ipykernel_23528\1763585868.py:2: DtypeWarning:
Columns (68,70,71,72,73,74,76,79) have mixed types. Specify dtype option on
import or set low_memory=False.
df = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/veh
icles.csv.zip')

21
barrels08 barrelsA08 charge120 charge240 city08 city08U cityA08 \
0 15.695714 0.0 0.0 0.0 19 0.0 0
1 29.964545 0.0 0.0 0.0 9 0.0 0
2 12.207778 0.0 0.0 0.0 23 0.0 0
3 29.964545 0.0 0.0 0.0 10 0.0 0
4 17.347895 0.0 0.0 0.0 17 0.0 0
… … … … … … … …
41139 14.982273 0.0 0.0 0.0 19 0.0 0
41140 14.330870 0.0 0.0 0.0 20 0.0 0
41141 15.695714 0.0 0.0 0.0 18 0.0 0
41142 15.695714 0.0 0.0 0.0 18 0.0 0
41143 18.311667 0.0 0.0 0.0 16 0.0 0

cityA08U cityCD cityE … mfrCode c240Dscr charge240b c240bDscr \

0 0.0 0.0 0.0 … NaN NaN 0.0 NaN
1 0.0 0.0 0.0 … NaN NaN 0.0 NaN
2 0.0 0.0 0.0 … NaN NaN 0.0 NaN
3 0.0 0.0 0.0 … NaN NaN 0.0 NaN
4 0.0 0.0 0.0 … NaN NaN 0.0 NaN
… … … … … … … … …
41139 0.0 0.0 0.0 … NaN NaN 0.0 NaN
41140 0.0 0.0 0.0 … NaN NaN 0.0 NaN
41141 0.0 0.0 0.0 … NaN NaN 0.0 NaN
41142 0.0 0.0 0.0 … NaN NaN 0.0 NaN
41143 0.0 0.0 0.0 … NaN NaN 0.0 NaN

createdOn modifiedOn startStop \

0 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
1 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
2 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
3 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
4 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
… … … …
41139 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
41140 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
41141 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
41142 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN
41143 Tue Jan 01 00:00:00 EST 2013 Tue Jan 01 00:00:00 EST 2013 NaN

phevCity phevHwy phevComb

0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
… … … …
41139 0 0 0
41140 0 0 0

22
41141 0 0 0
41142 0 0 0
41143 0 0 0

[41144 rows x 83 columns]

0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: int64

[9]: # Find the mean of the mileage

import numpy as np
display(np.mean(city_mileage))

city_mileage.mean()

18.369045304297103

[9]: 18.369045304297103

10.2 Aggregration
[40]: display(city_mileage.agg('mean'))

# There are many aggregration functions available

df[['barrels08','city08']].agg('mean')

18.369045304297103

[40]: barrels08 17.283900

city08 18.369045
dtype: float64

[12]: # Aggregate works with a list of functions

city_mileage.agg(['mean', 'median', 'max', 'min'])

[12]: mean 18.369045

median 17.000000
max 150.000000
min 6.000000

23
Name: city08, dtype: float64

[41]: df[['barrels08','city08']].agg(['mean', 'median'])

[41]: barrels08 city08

mean 17.2839 18.369045
median 16.4805 17.000000

[42]: # Using your own functions

# any function that takes as input an array and returns a number
def max_sub_min(arr): return arr.max() - arr.min()
df[['barrels08','city08']].agg(max_sub_min)

[42]: barrels08 47.027143

city08 144.000000
dtype: float64

[35]: # Aggregrating series

import numpy as np
import pandas as pd
s = pd.Series(np.random.randn(5), index=['a', 'a', 'b', 'b', 'a'])
display(s)
s.agg('mean')

a -0.865203
a 0.906041
b -0.229707
b 0.122065
a -0.865028
dtype: float64

[35]: -0.1863666066128976

[38]: # We may want to find the mean by index

s.groupby(s.index).agg(['mean', 'median'])

[38]: mean median

a -0.274730 -0.865028
b -0.053821 -0.053821

10.3 Group aggregation in DataFrame

[46]: df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], 'key2' : ['one', 'two',␣
↪'one', 'two', 'one'],'data1' : np.random.randn(5), 'data2' : np.random.

↪randn(5)})

display(df)
df.data1.groupby(df.key1).agg('mean')

24
key1 key2 data1 data2
0 a one -1.833308 0.208652
1 a two -1.166889 0.221166
2 b one 1.031091 0.039213
3 b two -1.133420 0.288701
4 a one 0.313682 0.534899

[46]: key1
a -0.895505
b -0.051164
Name: data1, dtype: float64

[31]: display(df)
m = df.data1.groupby([df.key1, df.key2]).agg('mean')
display(m)

key1 key2 data1 data2

0 a one -0.200839 -0.479937
1 a two -1.806834 -0.654038
2 b one 0.140283 1.164648
3 b two -0.526533 -0.483947
4 a one -1.776828 -1.599953
key1 key2
a one -0.988833
two -1.806834
b one 0.140283
two -0.526533
Name: data1, dtype: float64

10.4 Pivot Table

[48]: tips = pd.read_csv('https://raw.githubusercontent.com/wesm/pydata-book/
↪3rd-edition/examples/tips.csv')

tips

[48]: total_bill tip smoker day time size

0 16.99 1.01 No Sun Dinner 2
1 10.34 1.66 No Sun Dinner 3
2 21.01 3.50 No Sun Dinner 3
3 23.68 3.31 No Sun Dinner 2
4 24.59 3.61 No Sun Dinner 4
.. … … … … … …
239 29.03 5.92 No Sat Dinner 3
240 27.18 2.00 Yes Sat Dinner 2
241 22.67 2.00 Yes Sat Dinner 2
242 17.82 1.75 No Sat Dinner 2
243 18.78 3.00 No Thur Dinner 2

25
[244 rows x 6 columns]

[49]: display(tips)
# Exercise: Average tip by smoker and day

total_bill tip smoker day time size

[244 rows x 6 columns]

[50]: m = tips.tip.groupby([tips.day, tips.smoker]).agg('mean')

display(m)
m.unstack()

day smoker
Fri No 2.812500
Yes 2.714000
Sat No 3.102889
Yes 2.875476
Sun No 3.167895
Yes 3.516842
Thur No 2.673778
Yes 3.030000
Name: tip, dtype: float64

[50]: smoker No Yes

day
Fri 2.812500 2.714000
Sat 3.102889 2.875476
Sun 3.167895 3.516842
Thur 2.673778 3.030000

11 Pivot table
• Data Summarization
• Aggregates a table of data by one or more keys, arranging the data in a rectangle with some
of the group keys along the rows and some along the columns.

26
[51]: # pivot_table
tips.pivot_table(['tip'], index=['day'], columns='smoker')

[51]: tip
smoker No Yes
day
Fri 2.812500 2.714000
Sat 3.102889 2.875476
Sun 3.167895 3.516842
Thur 2.673778 3.030000

[52]: # Find total tips

tips.pivot_table(['tip'], index=['day'], columns='smoker', aggfunc='sum')

[52]: tip
smoker No Yes
day
Fri 11.25 40.71
Sat 139.63 120.77
Sun 180.57 66.82
Thur 120.32 51.51

[54]: # Margin=true
tips.pivot_table(['tip'], index=['day'], columns='smoker', aggfunc='sum',␣
↪margins=True)

[54]: tip
smoker No Yes All
day
Fri 11.25 40.71 51.96
Sat 139.63 120.77 260.40
Sun 180.57 66.82 247.39
Thur 120.32 51.51 171.83
All 451.77 279.81 731.58

11.1 Cross-tabulation: Crosstab

[55]: # A special case of a pivot table that computes group frequencies
# Number of smokers/non-smokers on a day
display(pd.crosstab(tips.day, tips.smoker))

# With margins
pd.crosstab(tips.day, tips.smoker, margins=True)

smoker No Yes
day
Fri 4 15
Sat 45 42

27
Sun 57 19
Thur 45 17

[55]: smoker No Yes All

day
Fri 4 15 19
Sat 45 42 87
Sun 57 19 76
Thur 45 17 62
All 151 93 244

12 Categorical Data
[59]: # Categorical Data
values = pd.Series(['apple', 'orange', 'apple','apple'] * 2)
display(values)
display(values.unique())
values.value_counts()

0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
dtype: object
array(['apple', 'orange'], dtype=object)

[59]: apple 6
orange 2
dtype: int64

[60]: # Categorical or Dictionary encoded representation

values = pd.Series([0, 1, 0, 0] * 2) # category codes
dim = pd.Series(['apple', 'orange']) # categories
display(values)
dim

0 0
1 1
2 0
3 0
4 0
5 1
6 0

28
7 0
dtype: int64

[60]: 0 apple
1 orange
dtype: object

[66]: # Categorical type in pandas

fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,'basket_id': np.arange(N),'count': np.random.
↪randint(3, 15, size=N),'weight': np.random.uniform(0, 4,␣

↪size=N)},columns=['basket_id', 'fruit', 'count', 'weight'])

display(df)
fruit_cat = df['fruit'].astype('category')
fruit_cat

basket_id fruit count weight

0 0 apple 3 3.748114
1 1 orange 9 1.403637
2 2 apple 12 3.814795
3 3 apple 12 2.893883
4 4 apple 4 1.093364
5 5 orange 5 2.636002
6 6 apple 4 0.844373
7 7 apple 13 3.923541

[66]: 0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

[69]: c = fruit_cat.values
display(type(c))
# Categories
display(c.categories)
# Codes
display(c.codes)

pandas.core.arrays.categorical.Categorical
Index(['apple', 'orange'], dtype='object')
array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

29
[73]: # Pandas series with category data type
series = pd.Series(['a', 'b', 'c', 'd', 'e'], dtype="category")
print(series)

0 a
1 b
2 c
3 d
4 e
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

[78]: # creating a data frame.

dataFrame = pd.DataFrame({"I": list("abcd"), "II": list("bcde")},␣
↪dtype="category")

print(dataFrame)
print(dataFrame.dtypes)

I II
0 a b
1 b c
2 c d
3 d e
I category
II category
dtype: object

[75]: # There is no ordering inside categorical variables

# Order can be imposed
fruit_cat = df['fruit'].astype('category')
fruit_cat.values.as_ordered()

[75]: ['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']

Categories (2, object): ['apple' < 'orange']

[82]: # Creating new categorical data type

from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
display(cat_type)
df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
display(df)
df_cat = df.astype(cat_type)
display(df_cat)
df_cat.A

CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)

A B
0 a b

30
1 b c
2 c c
3 a d
A B
0 a b
1 b c
2 c c
3 a d

[82]: 0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

40-incredible-excel-tricks-Nov-2024
No ratings yet
40-incredible-excel-tricks-Nov-2024
50 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
6 pages
Unit 4 DSE
No ratings yet
Unit 4 DSE
9 pages
Merge, Join, and Concatenate: Concatenating Objects
No ratings yet
Merge, Join, and Concatenate: Concatenating Objects
62 pages
UNIT IV Material
No ratings yet
UNIT IV Material
23 pages
IV Unit Fds
No ratings yet
IV Unit Fds
16 pages
99c949c0-5910-425f-9ac5-155882800fa5
No ratings yet
99c949c0-5910-425f-9ac5-155882800fa5
36 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
Data Wrangling and Analysis
100% (1)
Data Wrangling and Analysis
36 pages
Pandas
No ratings yet
Pandas
94 pages
EXP-3
No ratings yet
EXP-3
10 pages
Pandas
No ratings yet
Pandas
44 pages
Ch8 Data Wrangling Join, Combine, and Reshape
No ratings yet
Ch8 Data Wrangling Join, Combine, and Reshape
13 pages
Combining Datasets
No ratings yet
Combining Datasets
36 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
Data Science Data Manipulation With Pandas
No ratings yet
Data Science Data Manipulation With Pandas
77 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Notes For Python Part III
No ratings yet
Notes For Python Part III
44 pages
Pandas Data Wrangling Cheatsheet Datacamp PDF
No ratings yet
Pandas Data Wrangling Cheatsheet Datacamp PDF
1 page
pandas_merged
No ratings yet
pandas_merged
2 pages
EDP-3[1]
No ratings yet
EDP-3[1]
16 pages
Pandas
No ratings yet
Pandas
25 pages
Lecture 14
No ratings yet
Lecture 14
33 pages
Merge, Join, and Concatenate - Pandas 0203 Documentation
No ratings yet
Merge, Join, and Concatenate - Pandas 0203 Documentation
31 pages
UnitIV.1
No ratings yet
UnitIV.1
4 pages
Fundamental - Python
No ratings yet
Fundamental - Python
3 pages
Pandas
No ratings yet
Pandas
13 pages
IP Practical File - Reference
No ratings yet
IP Practical File - Reference
98 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
6 pages
Basic Data Processing with Pandas
No ratings yet
Basic Data Processing with Pandas
29 pages
python 2.1.2 (2)
No ratings yet
python 2.1.2 (2)
7 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Panda
No ratings yet
Panda
33 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
05Getting Started With Pandas
No ratings yet
05Getting Started With Pandas
44 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
No ratings yet
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
2 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Python Lecture 5 (2025)
No ratings yet
Python Lecture 5 (2025)
29 pages
Unit 2
No ratings yet
Unit 2
81 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Pandas - Powerful Python Data Analysis Toolkit
No ratings yet
Pandas - Powerful Python Data Analysis Toolkit
95 pages
Python Libraries Cheat Sheets
No ratings yet
Python Libraries Cheat Sheets
6 pages
Unit3_3) Pandas.ipynb - Colab
No ratings yet
Unit3_3) Pandas.ipynb - Colab
11 pages
Pandas
No ratings yet
Pandas
9 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Class 12 Practical File
No ratings yet
Class 12 Practical File
29 pages
python unit 3 4
No ratings yet
python unit 3 4
92 pages
OOM Unit 2
No ratings yet
OOM Unit 2
145 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Beginning C# 7 Programming with Visual Studio 2017
From Everand
Beginning C# 7 Programming with Visual Studio 2017
Benjamin Perkins
No ratings yet
EE3003D Topic 8
No ratings yet
EE3003D Topic 8
8 pages
EE2013E Assignment 2
No ratings yet
EE2013E Assignment 2
1 page
Lecture 08_DC-DC Converters_Problems
No ratings yet
Lecture 08_DC-DC Converters_Problems
8 pages
Lec - 1
No ratings yet
Lec - 1
23 pages
Lecture 2 - Database Theory For Data Science
No ratings yet
Lecture 2 - Database Theory For Data Science
19 pages
Applied Activity 5 - 17099 and 11818
No ratings yet
Applied Activity 5 - 17099 and 11818
68 pages
MS Excel Training Smaple Data For Practice
No ratings yet
MS Excel Training Smaple Data For Practice
90 pages
Pivot Table Introduction
100% (1)
Pivot Table Introduction
13 pages
Python For Finance: Regressions, Interpolation & Optimisation
No ratings yet
Python For Finance: Regressions, Interpolation & Optimisation
38 pages
Basic IT Tools
No ratings yet
Basic IT Tools
200 pages
Excel Interview Questions - Advanced
No ratings yet
Excel Interview Questions - Advanced
11 pages
Computer Lab 1
No ratings yet
Computer Lab 1
82 pages
Excel Specifications and Limits - Excel
No ratings yet
Excel Specifications and Limits - Excel
12 pages
EXCEL PIVOT TABLES and PIVOT CHARTS NINJA Step-By-Step Tutorial on How to Create Amazing Pivot Tables and Pivot Charts in... (Mejia, Henry E.) (Z-Library)
No ratings yet
EXCEL PIVOT TABLES and PIVOT CHARTS NINJA Step-By-Step Tutorial on How to Create Amazing Pivot Tables and Pivot Charts in... (Mejia, Henry E.) (Z-Library)
76 pages
DSBD Cae Iii
No ratings yet
DSBD Cae Iii
4 pages
Excel 2022 The 1 Guide To Master All The Functions and Formulas To Become A Professional in Just 7 D
100% (1)
Excel 2022 The 1 Guide To Master All The Functions and Formulas To Become A Professional in Just 7 D
165 pages
(Ebook) Modern Data Analytics in Excel (First Early Release) by George Mount ISBN 9781098148829, 1098148827 - Download the ebook now to never miss important information
100% (1)
(Ebook) Modern Data Analytics in Excel (First Early Release) by George Mount ISBN 9781098148829, 1098148827 - Download the ebook now to never miss important information
82 pages
Akanki Front Page IT Skill Lab 2
No ratings yet
Akanki Front Page IT Skill Lab 2
49 pages
Introduction To Microsoft Excel
No ratings yet
Introduction To Microsoft Excel
22 pages
ECAM-UserGuide Version 1
No ratings yet
ECAM-UserGuide Version 1
18 pages
MS Excel
No ratings yet
MS Excel
34 pages
Final BI Lab Manual
No ratings yet
Final BI Lab Manual
42 pages
ITC Session 6 Pivot Tables
No ratings yet
ITC Session 6 Pivot Tables
14 pages
MS Excel MCQS by Zeesshhii Study Portal
No ratings yet
MS Excel MCQS by Zeesshhii Study Portal
27 pages
Sujith PDF
No ratings yet
Sujith PDF
12 pages
RK 2
No ratings yet
RK 2
6 pages
Syllabus Excel 2013 Advance
No ratings yet
Syllabus Excel 2013 Advance
2 pages
MS Excel - Tour
No ratings yet
MS Excel - Tour
28 pages
Scheduling Excel
No ratings yet
Scheduling Excel
58 pages
Codju Accel - Grade 7 - Lesson Plan
No ratings yet
Codju Accel - Grade 7 - Lesson Plan
6 pages
Data Analytics with Excel lab2 manual
No ratings yet
Data Analytics with Excel lab2 manual
98 pages
Assessing and Responding To Fraud Risks: Concept Checks P. 281
No ratings yet
Assessing and Responding To Fraud Risks: Concept Checks P. 281
30 pages
2-MS Excel Mcqs Questions With Answers PDF Notes
No ratings yet
2-MS Excel Mcqs Questions With Answers PDF Notes
26 pages
Calculating Statistics Using Excel
No ratings yet
Calculating Statistics Using Excel
14 pages