0% found this document useful (0 votes)

107 views

2.1 Exploratory Data Analysis Using Python

This document provides an exploratory data analysis of the Iris dataset using various Python techniques. It begins with getting summary statistics of the dataset such as the number of rows and columns, data types, and descriptive statistics. Potential issues like missing values, duplicates, and imbalanced classes are checked for. Relationships between variables are visualized with scatter plots and pair plots. Correlations are calculated and a heatmap is used to visualize correlations. Outliers in the SepalWidth column are detected with a boxplot and then removed from the dataset.

Uploaded by

Kakashi Hatake

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views

2.1 Exploratory Data Analysis Using Python

Uploaded by

Kakashi Hatake

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Exploratory Data Analysis (EDA)

is a technique to analyze data using some visual

Techniques. With this technique, we can get detailed information about the statistical
summary of the data. We will also be able to deal with the duplicates values, outliers, and
also see some trends or patterns present in the dataset.
Note: We will be using Iris Dataset.

Getting Information about the Dataset

We will use the shape parameter to get the shape of the dataset.
Shape of Dataframe
 Python3
df.shape

Output:
(150, 6)
We can see that the dataframe contains 6 columns and 150 rows.
Now, let’s also the columns and their data types. For this, we will use the info() method.
Information about Dataset
 Python3
df.info()

Output:

information about the dataset

We can see that only one column has categorical data and all the other columns are of the
numeric type with non-Null entries.
Let’s get a quick statistical summary of the dataset using the describe() method. The
describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc. Any missing value or NaN value is
automatically skipped. describe() function gives a good picture of the distribution of data.
Description of dataset
 Python3
df.describe()

Output:

Description about the dataset

We can see the count of each column along with their mean value, standard deviation,
minimum and maximum values.

Checking Missing Values

We will check if our data contains any missing values or not. Missing values can occur
when no information is provided for one or more items or for a whole unit. We will use
the isnull() method.
python code for missing value
 Python3
df.isnull().sum()

Output:
Missing values in the dataset

We can see that no column has any missing value.

Checking Duplicates

Let’s see if our dataset contains any duplicates or not. Pandas drop_duplicates() method

helps in removing duplicates from the data frame.
Pandas function for missing values
 Python3
data = df.drop_duplicates(subset ="Species",)
data

Output:

Dropping duplicate value in the dataset

We can see that there are only three unique species. Let’s see if the dataset is balanced or
not i.e. all the species contain equal amounts of rows or not. We will use
the Series.value_counts() function. This function returns a Series containing counts of
unique values.
Python code for value counts in the column
 Python3
df.value_counts("Species")

Output:
value count in the dataset

We can see that all the species contain an equal amount of rows, so we should not delete
any entries.

Relation between variables

We will see the relationship between the sepal length and sepal width and also between
petal length and petal width.
Comparing Sepal Length and Sepal Width
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
hue='Species', data=df, )

# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:
Scatter plot using matplotlib library

From the above plot, we can infer that –

 Species Setosa has smaller sepal lengths but larger sepal widths.
 Versicolor Species lies in the middle of the other two species in terms of sepal length
and width
 Species Virginica has larger sepal lengths but smaller sepal widths.
Comparing Petal Length and Petal Width
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='PetalLengthCm', y='PetalWidthCm',
hue='Species', data=df, )

# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:
sactter plot petal length

From the above plot, we can infer that –

 The species Setosa has smaller petal lengths and widths.
 Versicolor Species lies in the middle of the other two species in terms of petal length
and width
 Species Virginica has the largest petal lengths and widths.
Let’s plot all the column’s relationships using a pairplot. It can be used for multivariate
analysis.
Python code for pairplot
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df.drop(['Id'], axis = 1),
hue='Species', height=2)

Output:
Pairplot for the dataset

We can see many types of relationships from this plot such as the species Seotsa has the
smallest of petals widths and lengths. It also has the smallest sepal length but larger sepal
widths. Such information can be gathered about any other species.

Handling Correlation

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the

dataframe. Any NA values are automatically excluded. Any non-numeric data type
columns in the dataframe are ignored.
Example:
 Python3

data.corr(method='pearson')

Output:

correlation between columns in the dataset

Heatmaps

The heatmap is a data visualization technique that is used to analyze the dataset as colors
in two dimensions. Basically, it shows a correlation between all numerical variables in
the dataset. In simpler terms, we can plot the above-found correlation using the heatmaps.
python code for heatmap
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.corr(method='pearson').drop(
['Id'], axis=1).drop(['Id'], axis=0),
annot = True);

plt.show()

Output:
Heatmap for correlation in the dataset

From the above graph, we can see that –

 Petal width and petal length have high correlations.
 Petal length and sepal width have good correlations.
 Petal Width and Sepal length have good correlations.

Handling Outliers

An Outlier is a data item/object that deviates significantly from the rest of the (so-called
normal)objects. They can be caused by measurement or execution errors. The analysis for
outlier detection is referred to as outlier mining. There are many ways to detect outliers,
and the removal process is the data frame same as removing a data item from the panda’s
dataframe.
Let’s consider the iris dataset and let’s plot the boxplot for the SepalWidthCm column.
python code for Boxplot
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('Iris.csv')

sns.boxplot(x='SepalWidthCm', data=df)

Output:

Boxplot for sepalwidth column

In the above graph, the values above 4 and below 2 are acting as outliers.

Removing Outliers

For removing the outlier, one must follow the same process of removing an entry from
the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.
We will detect the outliers using IQR and then we will remove them. We will also draw
the boxplot to see if the outliers are removed or not.
 Python3
# Importing
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns

# Load the dataset
df = pd.read_csv('Iris.csv')

# IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
interpolation = 'midpoint')

Q3 = np.percentile(df['SepalWidthCm'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1

print("Old Shape: ", df.shape)

# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))

# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))

# Removing the Outliers
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)

print("New Shape: ", df.shape)

sns.boxplot(x='SepalWidthCm', data=df)

Output:

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Practical File: Internet Programming Lab
No ratings yet
Practical File: Internet Programming Lab
26 pages
CSE 2-2 CS & Syllabus - UG - R20
No ratings yet
CSE 2-2 CS & Syllabus - UG - R20
83 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
Objective: For One Dimensional Data Set (7,10,20,28,35), Perform Hierarchical Clustering
No ratings yet
Objective: For One Dimensional Data Set (7,10,20,28,35), Perform Hierarchical Clustering
13 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
IV_AI-DS_AD3491_FDSA_Unit4
No ratings yet
IV_AI-DS_AD3491_FDSA_Unit4
30 pages
Cns Lessonplan
No ratings yet
Cns Lessonplan
2 pages
Cs3353 Foundations of Data Science L T P C 3 0 0 3
No ratings yet
Cs3353 Foundations of Data Science L T P C 3 0 0 3
2 pages
IV_AI-DS_AD3491_FDSA_Unit3
No ratings yet
IV_AI-DS_AD3491_FDSA_Unit3
35 pages
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer --3-9 (1)
No ratings yet
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer --3-9 (1)
7 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
CS8091 Bigdata Analytics Lessonplan With Date
No ratings yet
CS8091 Bigdata Analytics Lessonplan With Date
11 pages
FDS Iat-2 Part-B
No ratings yet
FDS Iat-2 Part-B
4 pages
Unit 5 Fod (1) (Repaired)
No ratings yet
Unit 5 Fod (1) (Repaired)
28 pages
Cs3451 Ios Unit 5 Notes
No ratings yet
Cs3451 Ios Unit 5 Notes
21 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
14 pages
CS3361 Set1
No ratings yet
CS3361 Set1
5 pages
Unit3 Inferentialnew
No ratings yet
Unit3 Inferentialnew
36 pages
EDA Unit IV
No ratings yet
EDA Unit IV
17 pages
DBMS Relational Calculus
No ratings yet
DBMS Relational Calculus
9 pages
EDA - With Python Question Bank
No ratings yet
EDA - With Python Question Bank
3 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
24 pages
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
No ratings yet
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
22 pages
03 - Decision - Tree - Hunt Algorithm
No ratings yet
03 - Decision - Tree - Hunt Algorithm
28 pages
It6006 Data Analytics Syllabus
No ratings yet
It6006 Data Analytics Syllabus
1 page
UNIT-5-IDS
No ratings yet
UNIT-5-IDS
19 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
50 pages
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
No ratings yet
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
23 pages
Cse-IV-unix and Shell Programming (10cs44) - Notes
No ratings yet
Cse-IV-unix and Shell Programming (10cs44) - Notes
161 pages
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
100% (1)
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
17 pages
Unit 3
No ratings yet
Unit 3
24 pages
Unit 5
No ratings yet
Unit 5
29 pages
CCS356 Object Oriented Software Engineering Lecture Notes 1
No ratings yet
CCS356 Object Oriented Software Engineering Lecture Notes 1
222 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Ad3311 Set4
No ratings yet
Ad3311 Set4
2 pages
DAP Lab Manual
No ratings yet
DAP Lab Manual
20 pages
DS&BD Lab Manul
No ratings yet
DS&BD Lab Manul
98 pages
AD3311-AI Lab Manual-Ex1a and 1b
No ratings yet
AD3311-AI Lab Manual-Ex1a and 1b
6 pages
AD3461 ML lab manual
No ratings yet
AD3461 ML lab manual
32 pages
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
Time Series Analysis
No ratings yet
Time Series Analysis
3 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
MSD Previous Papers 2022-23
100% (1)
MSD Previous Papers 2022-23
4 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
CCS334 BDA Practical Question
No ratings yet
CCS334 BDA Practical Question
2 pages
IIyear R Mid2 Question Bank
No ratings yet
IIyear R Mid2 Question Bank
2 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
ccs346 Eda
No ratings yet
ccs346 Eda
2 pages
Numpy - Tutorial - Ipynb - Colaboratory
No ratings yet
Numpy - Tutorial - Ipynb - Colaboratory
9 pages
DCCN Notes
No ratings yet
DCCN Notes
27 pages
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
No ratings yet
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
35 pages
Study On Intel 80386 Microprocessor
No ratings yet
Study On Intel 80386 Microprocessor
3 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
21CSE354T - Full Stack Web Development Question Bank (1)
100% (1)
21CSE354T - Full Stack Web Development Question Bank (1)
9 pages
CP4252 Machine Learning lab manual
No ratings yet
CP4252 Machine Learning lab manual
37 pages
Unit I
No ratings yet
Unit I
85 pages
Night POEM
No ratings yet
Night POEM
2 pages
Seminar Add Maths F4 1-4 PDF
No ratings yet
Seminar Add Maths F4 1-4 PDF
18 pages
Sample Latex Format For RevTex
No ratings yet
Sample Latex Format For RevTex
3 pages
Multivariable Integration in Polars
No ratings yet
Multivariable Integration in Polars
54 pages
Embedded Tutorial Solutions-1
No ratings yet
Embedded Tutorial Solutions-1
14 pages
Oral Communication in Context Worktext 3333
100% (2)
Oral Communication in Context Worktext 3333
11 pages
Gummy Bear Story Rubric
No ratings yet
Gummy Bear Story Rubric
1 page
EJb3 Interview Questions
No ratings yet
EJb3 Interview Questions
48 pages
Rsview32 PDF
No ratings yet
Rsview32 PDF
2 pages
Unit 2.1
No ratings yet
Unit 2.1
13 pages
SIST BotonNexRech DB 20250303ToStudents
No ratings yet
SIST BotonNexRech DB 20250303ToStudents
18 pages
Evaluation Tool Laro NG Lahi and HGP
No ratings yet
Evaluation Tool Laro NG Lahi and HGP
3 pages
Ingles 5 Modulo 4
No ratings yet
Ingles 5 Modulo 4
24 pages
Hadith 1-10 Theme
No ratings yet
Hadith 1-10 Theme
11 pages
Mod 3 Math 311
No ratings yet
Mod 3 Math 311
12 pages
Quantitative Methods in Syntax & Semantics Research: Ted Gibson 9.59 / 24.905
No ratings yet
Quantitative Methods in Syntax & Semantics Research: Ted Gibson 9.59 / 24.905
24 pages
AWS-DevOps-Engineer-Professional-DOP-C01-demo
No ratings yet
AWS-DevOps-Engineer-Professional-DOP-C01-demo
12 pages
ADJECTIVES - Complete Lesson Exemplar
No ratings yet
ADJECTIVES - Complete Lesson Exemplar
7 pages
Quine On Logic
No ratings yet
Quine On Logic
18 pages
Reading Ii Module PDF (1) ...
No ratings yet
Reading Ii Module PDF (1) ...
34 pages
MIPS Opcodes
No ratings yet
MIPS Opcodes
3 pages
电影论文
100% (1)
电影论文
5 pages
There Are Many Forms of Poetry Including The
No ratings yet
There Are Many Forms of Poetry Including The
17 pages
Om Communicative Competence Canale.1983
No ratings yet
Om Communicative Competence Canale.1983
30 pages
Analysis of The Appointment (3e - Arago, J. - Guevarra, M.)
No ratings yet
Analysis of The Appointment (3e - Arago, J. - Guevarra, M.)
42 pages
Teaching Deaf Children To Talk: Jean S. Moog Karen K. Stein
No ratings yet
Teaching Deaf Children To Talk: Jean S. Moog Karen K. Stein
10 pages
Book
No ratings yet
Book
254 pages
Oracle Database PDF
No ratings yet
Oracle Database PDF
73 pages
V2V_OSY_Super_25_Questions_Solution
No ratings yet
V2V_OSY_Super_25_Questions_Solution
5 pages
Shayari On Beautiful Eyes of GF & Wife in Hindi. Best 2 Lines Status - Site Title
No ratings yet
Shayari On Beautiful Eyes of GF & Wife in Hindi. Best 2 Lines Status - Site Title
7 pages