Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Lec4 - Python with data analysis

The document outlines a course on Data Science (DAI-101) taught by Dr. Devesh Bhimsaria at IIT Roorkee, focusing on Python for data analysis, particularly data cleaning and outlier detection using the Interquartile Range (IQR). It also covers Principal Component Analysis (PCA) for data reduction, including the mathematical foundations, explained variance ratio, and practical Python code examples. Additionally, the document includes installation instructions for necessary libraries and concludes with a note on licensing for the materials presented.

Uploaded by

namanm1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lec4 - Python with data analysis

The document outlines a course on Data Science (DAI-101) taught by Dr. Devesh Bhimsaria at IIT Roorkee, focusing on Python for data analysis, particularly data cleaning and outlier detection using the Interquartile Range (IQR). It also covers Principal Component Analysis (PCA) for data reduction, including the mathematical foundations, explained variance ratio, and practical Python code examples. Additionally, the document includes installation instructions for necessary libraries and concludes with a note on licensing for the materials presented.

Uploaded by

namanm1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Science

DAI-101 Spring 2024-25

Dr. Devesh Bhimsaria


Office: F9, Old Building
Department of Biosciences and Bioengineering
Indian Institute of Technology–Roorkee
devesh.bhimsaria@bt.iitr.ac.in
Dr. Devesh Bhimsaria 1
Python with Data Analysis
Data Cleaning: Outliers
⚫ Python program 1 for data cleaning
⚫ Interquartile Range (IQR) is a statistical measure that describes the spread of
the middle 50% of data points in a dataset. It is used to detect variability and
identify outliers in the data. The IQR is calculated as the difference between
the third quartile (Q3) and the first quartile (Q1):
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
⚫ Q1 (First Quartile): The 25th percentile; 25% of the data is smaller than this
value.
⚫ Q3 (Third Quartile): The 75th percentile; 75% of the data is smaller than this
value.
⚫ Median: The 50th percentile of the data, separating it into two halves.
Outlier Thresholds:
⚫ Lower Bound: 𝑄1 − 1.5 ∗ 𝐼𝑄𝑅
⚫ Upper Bound: 𝑄3 + 1.5 ∗ 𝐼𝑄𝑅
⚫ Values outside these thresholds are considered outliers.

Dr. Devesh Bhimsaria 3


Data Cleaning: Outliers
ID Age ID Age
0 1 22 0 1 22
1 2 25 1 2 25
2 3 30 2 3 30
3 4 24 3 4 24
4 5 29 4 5 29
5 6 35 5 6 35
6 7 120 7 8 28
7 8 28 8 9 32
8 9 32 9 10 31
9 10 31 10 11 27
10 11 27 11 12 26
11 12 26 12 13 23
12 13 23 13 14 40
13 14 40 14 15 33
14 15 33 15 16 29
15 16 29 16 17 36
16 17 36 18 19 28
17 18 100 19 20 29
18 19 28
19 20 29

Lower Bound: 16.625


Upper Bound: 43.625
Dr. Devesh Bhimsaria 4
Dr. Devesh Bhimsaria 5
Installing libraries
⚫ On Terminal
⚫ pip install pandas (General)
⚫ pip3 install pandas (Python3)

⚫ After installation import them in code

import pandas as pd
import matplotlib.pyplot as plt

Dr. Devesh Bhimsaria 6


Data Cleaning: Outliers
⚫ Python code

Dr. Devesh Bhimsaria 7


Data Reduction: PCA
⚫ Principal Component Analysis (PCA) is fundamentally based on the
mathematics of eigenvalues and eigenvectors.
⚫ Step 1: Get the covariance matrix- It captures the variance (diagonal
elements) and the correlation between features. If 𝑋 is the centered dataset
(mean 0). 𝑝 × 𝑝 symmetric matrix:
1
Σ= 𝑋𝑇 𝑋
𝑛−1
⚫ Step 2: Solve for eigen values 𝜆 and vector 𝒗
Σ𝒗 = 𝜆𝒗
⚫ Step 3: Ordering the Principal Components in descending order of eigen
values
⚫ Step 4: Principal Components: The eigenvectors are the principal axes that
define the new coordinate system. The data can be projected onto these axes
to form the principal components:
𝑍= 𝑋𝑉
⚫ 𝑍 : Transformed data in the reduced dimension space.
⚫ 𝑉 : Matrix of eigenvectors corresponding to the top 𝑘 eigenvalues.
Dr. Devesh Bhimsaria 8
Data Reduction: PCA
⚫ Explained Variance Ratio
⚫ The explained variance ratio tells you how much of the total variance in the
original data is captured by each principal component (PC).
⚫ Variance measures how much the data spreads out (varies) along a particular
dimension. PCA tries to find new axes (principal components) that
maximize the variance in the data.
⚫ Math: Let-
⚫ 𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = σ 𝜆𝑖 (sum of all eigenvalues of the covariance matrix),
⚫ 𝜆𝑖 : the eigenvalue corresponding to the i-th principal component.
⚫ The explained variance ratio for the i-th component is:

𝜆𝑖
⚫ 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑅𝑎𝑡𝑖𝑜𝑖 =
𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
⚫ This represents the proportion of the dataset’s total variance explained by
the i-th component.

Dr. Devesh Bhimsaria 9


Data Reduction: Cluster & Sample
⚫ Python code

Dr. Devesh Bhimsaria 10


Data Reduction: Cluster & Sample

Dr. Devesh Bhimsaria 11


Data Reduction: Cluster & Sample

Dr. Devesh Bhimsaria 12


Wavelet transform example

13
Data Reduction: Linear regression

Dr. Devesh Bhimsaria 14


Data Reduction: PCA
⚫ Python code
⚫ Original Dataset Shape: (150, 4)
⚫ 2 PCA components:
⚫ Reduced Dataset Shape: (150, 2)
⚫ Explained Variance Ratio: [0.72962445 0.22850762]
⚫ Total Variance Retained: 0.9581320720000164

Dr. Devesh Bhimsaria 15


Data Reduction: PCA
⚫ Python code
⚫ 3 PCA components:
⚫ Reduced Dataset Shape: (150, 3)
⚫ Explained Variance Ratio: [0.72962445 0.22850762 0.03668922]
⚫ Cumulative Variance Retained: [0.72962445 0.95813207 0.99482129]

Dr. Devesh Bhimsaria 16


Data Reduction: Linear regression
⚫ Python code

Dr. Devesh Bhimsaria 17


Data Reduction: Linear regression

Dr. Devesh Bhimsaria 18


Data Reduction: Linear regression

Dr. Devesh Bhimsaria 19


Thank You
• All my slides/notes excluding third party material
are licensed by various authors including myself
under https://creativecommons.org/licenses/by-
nc/4.0/

Dr. Devesh Bhimsaria 20

You might also like