Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Descriptive Statistics With Pandas: Data Handling Using Pandas - II

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

9.4.

2020

Unit 1: Data Handling using Pandas and Data Visualization Data Handling using Pandas -I

Introduction to Python libraries- Pandas, Matplotlib.

Data structures in Pandas - Series and Data Frames.

Series: Creation of Series from – ndarray, dictionary, scalar value; mathematical operations; Head
and Tail functions; Selection, Indexing and Slicing.

Data Frames: creation - from dictionary of Series, list of dictionaries, Text/CSV files; display;
iteration; Operations on rows and columns: add, select, delete, rename; Head and Tail functions;
Indexing using Labels, Boolean Indexing; Joining, Merging and Concatenation.

Importing/Exporting Data between CSV files and Data Frames.

Data handling using Pandas – II Descriptive Statistics: max, min, count, sum, mean, median,
mode, quartile, Standard deviation, variance.

DataFrame operations: Aggregation, group by, Sorting, Deleting and Renaming Index, Pivoting.
Handling missing values – dropping and filling. Importing/Exporting Data between MySQL database
and Pandas.

Data handling using Pandas – II


Descriptive Statistics: max, min, count, sum, mean, median, mode, quartile, Standard deviation,
variance

Descriptive Statistics with Pandas


Ex1.

import pandas as pd

dict1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj','Tanu','Hitesh','Amrit','Kapil','Deepak','Sujatha'],

'Age':[16,17,17,18,16,17,17,16,18,17],

'Marks':[89,94,34,56,67,78,86,94,75,68]}

df1=pd.DataFrame(dict1,index=[1,2,3,4,5,6,7,8,9,10])

print(df1)
Descriptive statistics

#Aggregate functions of df1

print("Minimum marks in df1: ",df1['Marks'].min())

print("Maximum marks in df1: ",df1['Marks'].max())

print("Total marks in df1: ",df1['Marks'].sum())

#count function
print(df1)
print()
print("No of students in df1: ",df1.count())
#or
print("No of students in df1: ",df1['Name'].count())
#Descriptive statisctics count()

#axis = 0 means row (default)

#axis =1 means columns

print("No of columns in each row of df1: ")

print(df1.count(axis=1))
#mode()....most repeated value

print(df1)

print()

print("Column mode of DF1: ")

print()

print(df1.mode(axis=0))
#10.4.2020

Quartile- In statisctics quartiles are the values that divide data into quarters.

A quartile is a type of quantile. The first quartile is defined as the middle


number between the smallest number and the median of the data set.
The second quartile is the median of the data. The third quartile is the
middle value between the median and the highest value of the data set.

Common Quantiles Certain types of quantiles are used commonly


enough to have specific names.

Below is a list of these: •

The 2 quantile is called the median


• The 3 quantiles are called terciles
• The 4 quantiles are called quartiles
• The 5 quantiles are called quintiles
• The 6 quantiles are called sextiles
• The 7 quantiles are called septiles
• The 8 quantiles are called octiles
• The 10 quantiles are called deciles
• The 12 quantiles are called duodeciles
• The 20 quantiles are called vigintiles
• The 100 quantiles are called percentiles
• The 1000 quantiles are called permilles

Reference for
quartile….. https://www.mathsisfun.com/data/quartiles.html

1. Consider a series of numbers from A to B

2. Now, we divide it from middle at point c


3. Now, further divide it between A and C then C and B

4. Now, defining quartile for the divided region i.e. Q1, Q2, Q3

Q1 – 1st Quartile (25th Percentile)


Q2 – 2nd Quartile (50th Percentile – also called median)
Q3 – 3rd Quartile (75th Percentile)
Variance - Variance function is used to calculate variance of a given set of
numbers.
-We can calculate variance of a DataFrame/Column/Rows

Ex:

DataFrame operations:

Aggregation, group by, Sorting, Deleting and Renaming Index, Pivoting

Ex: Aggregation

#aggregate

import pandas as pd
d1 = {'Rollno':[101,101,103,102,104], 'Name': ['Pat','Sid','Tom','Kim','Ray'],\

'Physics':[90,40,50,90,65],'Chemistry':[75,80,60,85,60] }

df = pd.DataFrame(d1)

display(df)

print('--------Basic aggregate functuions min(), max(), sum() and mean()')

print('minimum is:',df['Physics'].min())

print('maximum is:',df['Physics'].max())

print('sum is:',df['Physics'].sum())

print('average is:',df['Physics'].mean())
#Ex2:

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],

[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],

columns=['Apple', 'Orange', 'Banana', 'Pear'],

index=['Basket1', 'Basket2', 'Basket3', 'Basket4',

'Basket5', 'Basket6'])

print(df)

print("\n----------- Calculate Mean -----------\n")

print(df.mean())

print("\n----------- Calculate Median -----------\n")

print(df.median())

print("\n----------- Calculate Mode -----------\n")


print(df.mode())
df.aggregate(['sum', 'min'])
# Applying aggregation across all the columns
# sum and min will be found for each
# numeric type column in df dataframe

Ex:
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
print(df)

df.aggregate(['sum', 'min'])

Group by ()

By “group by” we are referring to a process involving one or more of the following
steps:
● Splitting the data into groups based on some criteria.
● Applying a function to each group independently.
● Combining the results into a data structure.
In many situations we may wish to split the data set into groups and do something with those
groups. In the apply step, we might wish do to one of the following:
● Aggregation:compute a summary statistic (or statistics) for each group. (mean, sum)
● Transformation:perform some group-specific computations and return a like-indexed
object.
● Filteration:discard some groups, according to a group-wise computation that evaluates
True or False.

Splitting an object into groups


pandas objects can be split on any of their axes. There are multiple ways to split an object like −
● obj.groupby('key')
● obj.groupby(['key1','key2'])
● obj.groupby(key,axis=1)

Ex1:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df
grpyear = df.groupby('Year')
2014, 2015, 2016,2017

Let us group the above dataframe on the column year and try to see one of the groups. We are
going to use the functions groupby() to create the groups and get_group() to extract one of the
groups.

grpyear = df.groupby('Year')
g1 = grpyear.get_group(2014)
g1
Next we are going to apply an aggregate function on the the group displayed above.

g1['Points'].agg('mean')

Now we are going to apply multiple aggregate functions to the created group

g1['Points'].agg(['mean','max','var'])

If we want to see all the groups created based on a given key, we need to iterate over the
grouped dataset:

for key, data in grpyear:


display(key)
display(data)
#13.4.2020

Ex: 2

import pandas as pd
import numpy as np

#Create a DataFrame
d={
'Name':['Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine',
'Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine'],
'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester
1','Semester 1',
'Semester 2','Semester 2','Semester 2','Semester 2','Semester
2','Semester 2'],

'Subject':['Mathematics','Mathematics','Mathematics','Science','Scienc
e','Science',
'Mathematics','Mathematics','Mathematics','Science','Science','
Science'],
'Score':[62,47,55,74,31,77,85,63,42,67,89,81]}

df = pd.DataFrame(d,columns=['Name','Exam','Subject','Score'])
print (df)
Get mean score of a group using groupby function in pandas
# mean score of Students

df['Score'].groupby([df['Name']]).mean()
Get sum of score of a group using groupby function in pandas
# sum of score group by Name and Exam

df['Score'].groupby([df['Name'],df['Exam']]).sum()

Group the entire dataframe by Subject and Exam:


# group the entire dataframe by Subject and Exam

df.groupby(['Subject', 'Exam']).sum()
Descriptive statistics of the group :

# descriptive statistics by group - subject

df['Score'].groupby(df['Subject']).describe()
14.4.2020

Sorting:

Means arranging DataFrame data in ascending/descending order.


DataFrame can be sorted according to row and column.

It is often needed to sort tabular data in order to get a particular view of the
data. Pandas too provide a means to sort data based on the values in one or
more columns in the DataFrame. Even we can sort a DataFrame based on the
row index. Hence the two sorting functions provided by pandas are:

sort_values() - to sort pandas data frame by one or more columns.

sort_index() - to sort pandas data frame by row index.

By value (column) – sort_values()


By index (row) – sort_index()

By default, sorting is done on row in ascending order.

By value (column) – sort_values()


Syntax:

DataFrame.sort_values(by=None, axis=0, ascending=True, inplace=False)

#(by=[[’Name’,’Maths’]])

by = column name to be sorted


axis = 0 stands for row-wise sorting
axis = 1 stands for col-wise sorting
ascending = by default it is true
inplace = by default it is false i.e. a new DataFrame will be created

By index (row) – sort_index()

Syntax:
DataFrame.sort_index(by=None, axis=0, ascending=True, inplace=false)

by = column name to be sorted


axis = 0 stands for row-wise sorting
axis = 1 stands for col-wise sorting
ascending = by default it is true
inplace = by default it is false i.e. a new DataFrame will be created

Both the functions are loaded with a few options and give a new
DataFrame in return.

EX1:

import pandas as pd

d1 = {'rollno':[101,101,103,102,104], 'name':
['Pat','Sid','Tom','Kim','Ray'],\
'physics':[90,40,50,90,65],'chem':[75,80,60,85,60] }

df = pd.DataFrame(d1)
display(df)

Now we are going to apply sort_values() function to the dataFrame on


column 'physics' . We will see that the DataFrame is sorted in ascending
order of the physics column. Also we will see the reordering of the index.

pop_asc = df.sort_values('physics')
display(pop_asc)
Next we are going to sort on two columns physics and chem. These
columns are required to be passed as a list to the sort_values()
function.

Note that when sorting by multiple columns, pandas sort_value() uses


the first variable first and second variable next. We are going to get
different result if we change the order of columns in the sort_values()
function.

We can see that the sorting of the DataFrame is done first for physics,
where Pat is placed ahead of Kim due to less score in chem since both
the columns are sorted in the ascending order.

df1 = df.sort_values(['physics','chem'])
display(df1)

Original Data Frame Sorting multiple columns(physics and chem)


Next we are going to sort the DataFrame in descending order of physics. By
default sorting is done in ascending order and to change this order we need to
set the ascending parameter to False.

df_desc = df.sort_values('physics', ascending=False)

display(df_desc)

Next we are going to sort the dataframe on physics in ascending order and
chem in descending order.

We can see that the sorting of the dataFrame is done first for physics, where
Kim is placed ahead of Part due to more score in chem.

df1 = df.sort_values(['physics','chem'], ascending=[True, False])

display(df1)
In the above DataFrame we can see that the index is not in proper order. In
order to rearrange index we may sort the above DataFrame based on index
values by using sort_index() function as shown below.

df_desc_index = df_desc.sort_index()
display(df_desc_index)
At the end we are going to sort the original dataframe without
creating a new DataFrame. This uses the inplace parameter to be set to
True while executing the sort_values() function

df.sort_values('physics', inplace=True)
display(df)

Example 2:
import pandas as pd
import numpy as np

#Create a Dictionary of series


d=
{'Name':pd.Series(['Alisa','Bobby','Cathrine','Madonna','Rocky','Sebastian','Jaql
uine', 'Rahul','David','Andrew','Ajay','Teresa']),
'Age':pd.Series([26,27,25,24,31,27,25,33,42,32,51,47]),
'Score':pd.Series([89,87,67,55,47,72,76,79,44,92,99,69])}

#Create a DataFrame
df = pd.DataFrame(d)
df
# sort the pandas dataframe by ascending value of single column
df.sort_values(by='Score')
# sort the pandas dataframe by descending value of single column

df.sort_values(by='Score',ascending=0)
# sort the pandas dataframe by descending value of single column

df.sort_values(by='Score',ascending=0)
# sort the pandas dataframe by multiple columns

df.sort_values(by=['Age', 'Score'],ascending=[True,False])
...............................................................................................................................
.

You might also like