Descriptive Statistics With Pandas: Data Handling Using Pandas - II
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
2020
Unit 1: Data Handling using Pandas and Data Visualization Data Handling using Pandas -I
Series: Creation of Series from – ndarray, dictionary, scalar value; mathematical operations; Head
and Tail functions; Selection, Indexing and Slicing.
Data Frames: creation - from dictionary of Series, list of dictionaries, Text/CSV files; display;
iteration; Operations on rows and columns: add, select, delete, rename; Head and Tail functions;
Indexing using Labels, Boolean Indexing; Joining, Merging and Concatenation.
Data handling using Pandas – II Descriptive Statistics: max, min, count, sum, mean, median,
mode, quartile, Standard deviation, variance.
DataFrame operations: Aggregation, group by, Sorting, Deleting and Renaming Index, Pivoting.
Handling missing values – dropping and filling. Importing/Exporting Data between MySQL database
and Pandas.
import pandas as pd
'Age':[16,17,17,18,16,17,17,16,18,17],
'Marks':[89,94,34,56,67,78,86,94,75,68]}
df1=pd.DataFrame(dict1,index=[1,2,3,4,5,6,7,8,9,10])
print(df1)
Descriptive statistics
#count function
print(df1)
print()
print("No of students in df1: ",df1.count())
#or
print("No of students in df1: ",df1['Name'].count())
#Descriptive statisctics count()
print(df1.count(axis=1))
#mode()....most repeated value
print(df1)
print()
print()
print(df1.mode(axis=0))
#10.4.2020
Quartile- In statisctics quartiles are the values that divide data into quarters.
Reference for
quartile….. https://www.mathsisfun.com/data/quartiles.html
4. Now, defining quartile for the divided region i.e. Q1, Q2, Q3
Ex:
DataFrame operations:
Ex: Aggregation
#aggregate
import pandas as pd
d1 = {'Rollno':[101,101,103,102,104], 'Name': ['Pat','Sid','Tom','Kim','Ray'],\
'Physics':[90,40,50,90,65],'Chemistry':[75,80,60,85,60] }
df = pd.DataFrame(d1)
display(df)
print('minimum is:',df['Physics'].min())
print('maximum is:',df['Physics'].max())
print('sum is:',df['Physics'].sum())
print('average is:',df['Physics'].mean())
#Ex2:
import pandas as pd
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
'Basket5', 'Basket6'])
print(df)
print(df.mean())
print(df.median())
Ex:
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
print(df)
df.aggregate(['sum', 'min'])
Group by ()
By “group by” we are referring to a process involving one or more of the following
steps:
● Splitting the data into groups based on some criteria.
● Applying a function to each group independently.
● Combining the results into a data structure.
In many situations we may wish to split the data set into groups and do something with those
groups. In the apply step, we might wish do to one of the following:
● Aggregation:compute a summary statistic (or statistics) for each group. (mean, sum)
● Transformation:perform some group-specific computations and return a like-indexed
object.
● Filteration:discard some groups, according to a group-wise computation that evaluates
True or False.
Ex1:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df
grpyear = df.groupby('Year')
2014, 2015, 2016,2017
Let us group the above dataframe on the column year and try to see one of the groups. We are
going to use the functions groupby() to create the groups and get_group() to extract one of the
groups.
grpyear = df.groupby('Year')
g1 = grpyear.get_group(2014)
g1
Next we are going to apply an aggregate function on the the group displayed above.
g1['Points'].agg('mean')
Now we are going to apply multiple aggregate functions to the created group
g1['Points'].agg(['mean','max','var'])
If we want to see all the groups created based on a given key, we need to iterate over the
grouped dataset:
Ex: 2
import pandas as pd
import numpy as np
#Create a DataFrame
d={
'Name':['Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine',
'Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine'],
'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester
1','Semester 1',
'Semester 2','Semester 2','Semester 2','Semester 2','Semester
2','Semester 2'],
'Subject':['Mathematics','Mathematics','Mathematics','Science','Scienc
e','Science',
'Mathematics','Mathematics','Mathematics','Science','Science','
Science'],
'Score':[62,47,55,74,31,77,85,63,42,67,89,81]}
df = pd.DataFrame(d,columns=['Name','Exam','Subject','Score'])
print (df)
Get mean score of a group using groupby function in pandas
# mean score of Students
df['Score'].groupby([df['Name']]).mean()
Get sum of score of a group using groupby function in pandas
# sum of score group by Name and Exam
df['Score'].groupby([df['Name'],df['Exam']]).sum()
df.groupby(['Subject', 'Exam']).sum()
Descriptive statistics of the group :
df['Score'].groupby(df['Subject']).describe()
14.4.2020
Sorting:
It is often needed to sort tabular data in order to get a particular view of the
data. Pandas too provide a means to sort data based on the values in one or
more columns in the DataFrame. Even we can sort a DataFrame based on the
row index. Hence the two sorting functions provided by pandas are:
#(by=[[’Name’,’Maths’]])
Syntax:
DataFrame.sort_index(by=None, axis=0, ascending=True, inplace=false)
Both the functions are loaded with a few options and give a new
DataFrame in return.
EX1:
import pandas as pd
d1 = {'rollno':[101,101,103,102,104], 'name':
['Pat','Sid','Tom','Kim','Ray'],\
'physics':[90,40,50,90,65],'chem':[75,80,60,85,60] }
df = pd.DataFrame(d1)
display(df)
pop_asc = df.sort_values('physics')
display(pop_asc)
Next we are going to sort on two columns physics and chem. These
columns are required to be passed as a list to the sort_values()
function.
We can see that the sorting of the DataFrame is done first for physics,
where Pat is placed ahead of Kim due to less score in chem since both
the columns are sorted in the ascending order.
df1 = df.sort_values(['physics','chem'])
display(df1)
display(df_desc)
Next we are going to sort the dataframe on physics in ascending order and
chem in descending order.
We can see that the sorting of the dataFrame is done first for physics, where
Kim is placed ahead of Part due to more score in chem.
display(df1)
In the above DataFrame we can see that the index is not in proper order. In
order to rearrange index we may sort the above DataFrame based on index
values by using sort_index() function as shown below.
df_desc_index = df_desc.sort_index()
display(df_desc_index)
At the end we are going to sort the original dataframe without
creating a new DataFrame. This uses the inplace parameter to be set to
True while executing the sort_values() function
df.sort_values('physics', inplace=True)
display(df)
Example 2:
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
df
# sort the pandas dataframe by ascending value of single column
df.sort_values(by='Score')
# sort the pandas dataframe by descending value of single column
df.sort_values(by='Score',ascending=0)
# sort the pandas dataframe by descending value of single column
df.sort_values(by='Score',ascending=0)
# sort the pandas dataframe by multiple columns
df.sort_values(by=['Age', 'Score'],ascending=[True,False])
...............................................................................................................................
.