Descriptive Statistics With Pandas: Data Handling Using Pandas - II

9.4.
2020
Unit 1: Data Handling using Pandas and Data Visualization Data Handling using Pandas -I
Introduction to Python libraries- Pandas, Matplotlib.
Data structures in Pandas - Series and Data Frames.
Series: Creation of Series from – ndarray, dictionary, scalar value; mathematical operations; Head
and Tail functions; Selection, Indexing and Slicing.
Data Frames: creation - from dictionary of Series, list of dictionaries, Text/CSV files; display;
iteration; Operations on rows and columns: add, select, delete, rename; Head and Tail functions;
Indexing using Labels, Boolean Indexing; Joining, Merging and Concatenation.
Importing/Exporting Data between CSV files and Data Frames.
Data handling using Pandas – II Descriptive Statistics: max, min, count, sum, mean, median,
mode, quartile, Standard deviation, variance.
DataFrame operations: Aggregation, group by, Sorting, Deleting and Renaming Index, Pivoting.
Handling missing values – dropping and filling. Importing/Exporting Data between MySQL database
and Pandas.
Data handling using Pandas – II

Descriptive Statistics: max, min, count, sum, mean, median, mode, quartile, Standard deviation,
variance
Descriptive Statistics with Pandas

Ex1.
import pandas as pd
dict1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj','Tanu','Hitesh','Amrit','Kapil','Deepak','Sujatha'],
'Age':[16,17,17,18,16,17,17,16,18,17],
'Marks':[89,94,34,56,67,78,86,94,75,68]}
df1=pd.DataFrame(dict1,index=[1,2,3,4,5,6,7,8,9,10])
print(df1)
Descriptive statistics
#Aggregate functions of df1
print("Minimum marks in df1: ",df1['Marks'].min())
print("Maximum marks in df1: ",df1['Marks'].max())
print("Total marks in df1: ",df1['Marks'].sum())
#count function
print(df1)
print()
print("No of students in df1: ",df1.count())
#or
print("No of students in df1: ",df1['Name'].count())
#Descriptive statisctics count()
#axis = 0 means row (default)
#axis =1 means columns
print("No of columns in each row of df1: ")
print(df1.count(axis=1))
#mode()....most repeated value
print(df1)
print()
print("Column mode of DF1: ")
print()
print(df1.mode(axis=0))
#10.4.2020
Quartile- In statisctics quartiles are the values that divide data into quarters.
A quartile is a type of quantile. The first quartile is defined as the middle

number between the smallest number and the median of the data set.
The second quartile is the median of the data. The third quartile is the
middle value between the median and the highest value of the data set.
Common Quantiles Certain types of quantiles are used commonly

enough to have specific names.
Below is a list of these: •
The 2 quantile is called the median

• The 3 quantiles are called terciles
• The 4 quantiles are called quartiles
• The 5 quantiles are called quintiles
• The 6 quantiles are called sextiles
• The 7 quantiles are called septiles
• The 8 quantiles are called octiles
• The 10 quantiles are called deciles
• The 12 quantiles are called duodeciles
• The 20 quantiles are called vigintiles
• The 100 quantiles are called percentiles
• The 1000 quantiles are called permilles
Reference for
quartile….. https://www.mathsisfun.com/data/quartiles.html
1. Consider a series of numbers from A to B
2. Now, we divide it from middle at point c

3. Now, further divide it between A and C then C and B
4. Now, defining quartile for the divided region i.e. Q1, Q2, Q3
Q1 – 1st Quartile (25th Percentile)

Q2 – 2nd Quartile (50th Percentile – also called median)
Q3 – 3rd Quartile (75th Percentile)
Variance - Variance function is used to calculate variance of a given set of
numbers.
-We can calculate variance of a DataFrame/Column/Rows
Ex:
DataFrame operations:
Aggregation, group by, Sorting, Deleting and Renaming Index, Pivoting
Ex: Aggregation
#aggregate
import pandas as pd
d1 = {'Rollno':[101,101,103,102,104], 'Name': ['Pat','Sid','Tom','Kim','Ray'],\
'Physics':[90,40,50,90,65],'Chemistry':[75,80,60,85,60] }
df = pd.DataFrame(d1)
display(df)
print('--------Basic aggregate functuions min(), max(), sum() and mean()')
print('minimum is:',df['Physics'].min())
print('maximum is:',df['Physics'].max())
print('sum is:',df['Physics'].sum())
print('average is:',df['Physics'].mean())
#Ex2:
import pandas as pd
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
print(df)
print("\n----------- Calculate Mean -----------\n")
print(df.mean())
print("\n----------- Calculate Median -----------\n")
print(df.median())
print("\n----------- Calculate Mode -----------\n")

print(df.mode())
df.aggregate(['sum', 'min'])
# Applying aggregation across all the columns
# sum and min will be found for each
# numeric type column in df dataframe
Ex:
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
print(df)
df.aggregate(['sum', 'min'])
Group by ()
By “group by” we are referring to a process involving one or more of the following
steps:
● Splitting the data into groups based on some criteria.
● Applying a function to each group independently.
● Combining the results into a data structure.
In many situations we may wish to split the data set into groups and do something with those
groups. In the apply step, we might wish do to one of the following:
● Aggregation:compute a summary statistic (or statistics) for each group. (mean, sum)
● Transformation:perform some group-specific computations and return a like-indexed
object.
● Filteration:discard some groups, according to a group-wise computation that evaluates
True or False.
Splitting an object into groups

pandas objects can be split on any of their axes. There are multiple ways to split an object like −
● obj.groupby('key')
● obj.groupby(['key1','key2'])
● obj.groupby(key,axis=1)
Ex1:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df
grpyear = df.groupby('Year')
2014, 2015, 2016,2017
Let us group the above dataframe on the column year and try to see one of the groups. We are
going to use the functions groupby() to create the groups and get_group() to extract one of the
groups.
grpyear = df.groupby('Year')
g1 = grpyear.get_group(2014)
g1
Next we are going to apply an aggregate function on the the group displayed above.
g1['Points'].agg('mean')
Now we are going to apply multiple aggregate functions to the created group
g1['Points'].agg(['mean','max','var'])
If we want to see all the groups created based on a given key, we need to iterate over the
grouped dataset:
for key, data in grpyear:

display(key)
display(data)
#13.4.2020
Ex: 2
import pandas as pd
import numpy as np
#Create a DataFrame
d={
'Name':['Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine',
'Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine'],
'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester
1','Semester 1',
'Semester 2','Semester 2','Semester 2','Semester 2','Semester
2','Semester 2'],
'Subject':['Mathematics','Mathematics','Mathematics','Science','Scienc
e','Science',
'Mathematics','Mathematics','Mathematics','Science','Science','
Science'],
'Score':[62,47,55,74,31,77,85,63,42,67,89,81]}
df = pd.DataFrame(d,columns=['Name','Exam','Subject','Score'])
print (df)
Get mean score of a group using groupby function in pandas
# mean score of Students
df['Score'].groupby([df['Name']]).mean()
Get sum of score of a group using groupby function in pandas
# sum of score group by Name and Exam
df['Score'].groupby([df['Name'],df['Exam']]).sum()
Group the entire dataframe by Subject and Exam:

# group the entire dataframe by Subject and Exam
df.groupby(['Subject', 'Exam']).sum()
Descriptive statistics of the group :
# descriptive statistics by group - subject
df['Score'].groupby(df['Subject']).describe()
14.4.2020
Sorting:
Means arranging DataFrame data in ascending/descending order.

DataFrame can be sorted according to row and column.
It is often needed to sort tabular data in order to get a particular view of the
data. Pandas too provide a means to sort data based on the values in one or
more columns in the DataFrame. Even we can sort a DataFrame based on the
row index. Hence the two sorting functions provided by pandas are:
sort_values() - to sort pandas data frame by one or more columns.
sort_index() - to sort pandas data frame by row index.
By value (column) – sort_values()

By index (row) – sort_index()
By default, sorting is done on row in ascending order.
By value (column) – sort_values()

Syntax:
DataFrame.sort_values(by=None, axis=0, ascending=True, inplace=False)
#(by=[[’Name’,’Maths’]])
by = column name to be sorted

axis = 0 stands for row-wise sorting
axis = 1 stands for col-wise sorting
ascending = by default it is true
inplace = by default it is false i.e. a new DataFrame will be created
By index (row) – sort_index()
Syntax:
DataFrame.sort_index(by=None, axis=0, ascending=True, inplace=false)
by = column name to be sorted

axis = 0 stands for row-wise sorting
axis = 1 stands for col-wise sorting
ascending = by default it is true
inplace = by default it is false i.e. a new DataFrame will be created
Both the functions are loaded with a few options and give a new
DataFrame in return.
EX1:
import pandas as pd
d1 = {'rollno':[101,101,103,102,104], 'name':
['Pat','Sid','Tom','Kim','Ray'],\
'physics':[90,40,50,90,65],'chem':[75,80,60,85,60] }
df = pd.DataFrame(d1)
display(df)
Now we are going to apply sort_values() function to the dataFrame on

column 'physics' . We will see that the DataFrame is sorted in ascending
order of the physics column. Also we will see the reordering of the index.
pop_asc = df.sort_values('physics')
display(pop_asc)
Next we are going to sort on two columns physics and chem. These
columns are required to be passed as a list to the sort_values()
function.
Note that when sorting by multiple columns, pandas sort_value() uses

the first variable first and second variable next. We are going to get
different result if we change the order of columns in the sort_values()
function.
We can see that the sorting of the DataFrame is done first for physics,
where Pat is placed ahead of Kim due to less score in chem since both
the columns are sorted in the ascending order.
df1 = df.sort_values(['physics','chem'])
display(df1)
Original Data Frame Sorting multiple columns(physics and chem)

Next we are going to sort the DataFrame in descending order of physics. By
default sorting is done in ascending order and to change this order we need to
set the ascending parameter to False.
df_desc = df.sort_values('physics', ascending=False)
display(df_desc)
Next we are going to sort the dataframe on physics in ascending order and
chem in descending order.
We can see that the sorting of the dataFrame is done first for physics, where
Kim is placed ahead of Part due to more score in chem.
df1 = df.sort_values(['physics','chem'], ascending=[True, False])
display(df1)
In the above DataFrame we can see that the index is not in proper order. In
order to rearrange index we may sort the above DataFrame based on index
values by using sort_index() function as shown below.
df_desc_index = df_desc.sort_index()
display(df_desc_index)
At the end we are going to sort the original dataframe without
creating a new DataFrame. This uses the inplace parameter to be set to
True while executing the sort_values() function
df.sort_values('physics', inplace=True)
display(df)
Example 2:
import pandas as pd
import numpy as np
#Create a Dictionary of series

d=
{'Name':pd.Series(['Alisa','Bobby','Cathrine','Madonna','Rocky','Sebastian','Jaql
uine', 'Rahul','David','Andrew','Ajay','Teresa']),
'Age':pd.Series([26,27,25,24,31,27,25,33,42,32,51,47]),
'Score':pd.Series([89,87,67,55,47,72,76,79,44,92,99,69])}
#Create a DataFrame
df = pd.DataFrame(d)
df
# sort the pandas dataframe by ascending value of single column
df.sort_values(by='Score')
# sort the pandas dataframe by descending value of single column
df.sort_values(by='Score',ascending=0)
# sort the pandas dataframe by descending value of single column
df.sort_values(by='Score',ascending=0)
# sort the pandas dataframe by multiple columns
df.sort_values(by=['Age', 'Score'],ascending=[True,False])
...............................................................................................................................
.

Descriptive Statistics With Pandas: Data Handling Using Pandas - II

Uploaded by

Copyright:

Available Formats

Descriptive Statistics With Pandas: Data Handling Using Pandas - II

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Descriptive Statistics With Pandas: Data Handling Using Pandas - II

Uploaded by

Copyright:

Available Formats

9.4.

Introduction to Python libraries- Pandas, Matplotlib.

Data structures in Pandas - Series and Data Frames.

Importing/Exporting Data between CSV files and Data Frames.

Data handling using Pandas – II

Descriptive Statistics with Pandas

dict1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj','Tanu','Hitesh','Amrit','Kapil','Deepak','Sujatha'],

#Aggregate functions of df1

print("Minimum marks in df1: ",df1['Marks'].min())

print("Maximum marks in df1: ",df1['Marks'].max())

print("Total marks in df1: ",df1['Marks'].sum())

#axis = 0 means row (default)

#axis =1 means columns

print("No of columns in each row of df1: ")

print("Column mode of DF1: ")

A quartile is a type of quantile. The first quartile is defined as the middle

Common Quantiles Certain types of quantiles are used commonly

Below is a list of these: •

The 2 quantile is called the median

1. Consider a series of numbers from A to B

2. Now, we divide it from middle at point c

Q1 – 1st Quartile (25th Percentile)

Aggregation, group by, Sorting, Deleting and Renaming Index, Pivoting

print('--------Basic aggregate functuions min(), max(), sum() and mean()')

[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],

columns=['Apple', 'Orange', 'Banana', 'Pear'],

index=['Basket1', 'Basket2', 'Basket3', 'Basket4',

print("\n----------- Calculate Mean -----------\n")

print("\n----------- Calculate Median -----------\n")

print("\n----------- Calculate Mode -----------\n")

Splitting an object into groups

for key, data in grpyear:

Group the entire dataframe by Subject and Exam:

# descriptive statistics by group - subject

Means arranging DataFrame data in ascending/descending order.

sort_values() - to sort pandas data frame by one or more columns.

sort_index() - to sort pandas data frame by row index.

By value (column) – sort_values()

By default, sorting is done on row in ascending order.

By value (column) – sort_values()

DataFrame.sort_values(by=None, axis=0, ascending=True, inplace=False)

by = column name to be sorted

By index (row) – sort_index()

by = column name to be sorted

Now we are going to apply sort_values() function to the dataFrame on

Note that when sorting by multiple columns, pandas sort_value() uses

Original Data Frame Sorting multiple columns(physics and chem)

df_desc = df.sort_values('physics', ascending=False)

df1 = df.sort_values(['physics','chem'], ascending=[True, False])

#Create a Dictionary of series

You might also like