Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
100 views

2.1 Exploratory Data Analysis Using Python

This document provides an exploratory data analysis of the Iris dataset using various Python techniques. It begins with getting summary statistics of the dataset such as the number of rows and columns, data types, and descriptive statistics. Potential issues like missing values, duplicates, and imbalanced classes are checked for. Relationships between variables are visualized with scatter plots and pair plots. Correlations are calculated and a heatmap is used to visualize correlations. Outliers in the SepalWidth column are detected with a boxplot and then removed from the dataset.

Uploaded by

Kakashi Hatake
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views

2.1 Exploratory Data Analysis Using Python

This document provides an exploratory data analysis of the Iris dataset using various Python techniques. It begins with getting summary statistics of the dataset such as the number of rows and columns, data types, and descriptive statistics. Potential issues like missing values, duplicates, and imbalanced classes are checked for. Relationships between variables are visualized with scatter plots and pair plots. Correlations are calculated and a heatmap is used to visualize correlations. Outliers in the SepalWidth column are detected with a boxplot and then removed from the dataset.

Uploaded by

Kakashi Hatake
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Exploratory Data Analysis (EDA) 

is a technique to analyze data using some visual


Techniques. With this technique, we can get detailed information about the statistical
summary of the data. We will also be able to deal with the duplicates values, outliers, and
also see some trends or patterns present in the dataset.
Note: We will be using Iris Dataset.

Getting Information about the Dataset

We will use the shape parameter to get the shape of the dataset.
Shape of Dataframe 
 Python3
df.shape

Output:
(150, 6)
We can see that the dataframe contains 6 columns and 150 rows.
Now, let’s also the columns and their data types. For this, we will use the info() method.
Information about Dataset 
 Python3
df.info()

Output:

information about the dataset 

We can see that only one column has categorical data and all the other columns are of the
numeric type with non-Null entries.
Let’s get a quick statistical summary of the dataset using the describe() method. The
describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc. Any missing value or NaN value is
automatically skipped. describe() function gives a good picture of the distribution of data.
Description of dataset 
 Python3
df.describe()

Output:

Description about the dataset 

We can see the count of each column along with their mean value, standard deviation,
minimum and maximum values.

Checking Missing Values

We will check if our data contains any missing values or not. Missing values can occur
when no information is provided for one or more items or for a whole unit. We will use
the isnull() method.
python code for missing value
 Python3
df.isnull().sum()

Output:
Missing values in the dataset 

We can see that no column has any missing value.

Checking Duplicates

Let’s see if our dataset contains any duplicates or not. Pandas drop_duplicates() method


helps in removing duplicates from the data frame.
Pandas function for missing values 
 Python3
data = df.drop_duplicates(subset ="Species",)
data

Output:

Dropping duplicate value in the dataset 

We can see that there are only three unique species. Let’s see if the dataset is balanced or
not i.e. all the species contain equal amounts of rows or not. We will use
the Series.value_counts() function. This function returns a Series containing counts of
unique values. 
Python code for value counts in the column 
 Python3
df.value_counts("Species")

Output:
value count in the dataset 

We can see that all the species contain an equal amount of rows, so we should not delete
any entries.

Relation between variables

We will see the relationship between the sepal length and sepal width and also between
petal length and petal width.
Comparing Sepal Length and Sepal Width
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
 
 
sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
                hue='Species', data=df, )
 
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
 
plt.show()

Output:
Scatter plot using matplotlib library 

From the above plot, we can infer that – 


 Species Setosa has smaller sepal lengths but larger sepal widths.
 Versicolor Species lies in the middle of the other two species in terms of sepal length
and width
 Species Virginica has larger sepal lengths but smaller sepal widths.
Comparing Petal Length and Petal Width
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
 
 
sns.scatterplot(x='PetalLengthCm', y='PetalWidthCm',
                hue='Species', data=df, )
 
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
 
plt.show()

Output:
sactter plot petal length 

From the above plot, we can infer that – 


 The species Setosa has smaller petal lengths and widths.
 Versicolor Species lies in the middle of the other two species in terms of petal length
and width
 Species Virginica has the largest petal lengths and widths.
Let’s plot all the column’s relationships using a pairplot. It can be used for multivariate
analysis.
Python code for pairplot 
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
 
 
sns.pairplot(df.drop(['Id'], axis = 1),
             hue='Species', height=2)

Output:
Pairplot for the dataset 

We can see many types of relationships from this plot such as the species Seotsa has the
smallest of petals widths and lengths. It also has the smallest sepal length but larger sepal
widths. Such information can be gathered about any other species.

Handling Correlation

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the


dataframe. Any NA values are automatically excluded. Any non-numeric data type
columns in the dataframe are ignored.
Example:
 Python3

data.corr(method='pearson')

Output:

correlation between columns in the dataset 

Heatmaps

The heatmap is a data visualization technique that is used to analyze the dataset as colors
in two dimensions. Basically, it shows a correlation between all numerical variables in
the dataset. In simpler terms, we can plot the above-found correlation using the heatmaps.
python code for heatmap 
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
 
 
sns.heatmap(df.corr(method='pearson').drop(
  ['Id'], axis=1).drop(['Id'], axis=0),
            annot = True);
 
plt.show()

Output:
Heatmap for correlation in the dataset 

From the above graph, we can see that –


 Petal width and petal length have high correlations.
 Petal length and sepal width have good correlations.
 Petal Width and Sepal length have good correlations.

Handling Outliers

An Outlier is a data item/object that deviates significantly from the rest of the (so-called
normal)objects. They can be caused by measurement or execution errors. The analysis for
outlier detection is referred to as outlier mining. There are many ways to detect outliers,
and the removal process is the data frame same as removing a data item from the panda’s
dataframe.
Let’s consider the iris dataset and let’s plot the boxplot for the SepalWidthCm column.
python code for Boxplot 
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
 
# Load the dataset
df = pd.read_csv('Iris.csv')
 
sns.boxplot(x='SepalWidthCm', data=df)

Output:

Boxplot for sepalwidth column 

In the above graph, the values above 4 and below 2 are acting as outliers.

Removing Outliers

For removing the outlier, one must follow the same process of removing an entry from
the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.
We will detect the outliers using IQR and then we will remove them. We will also draw
the boxplot to see if the outliers are removed or not.
 Python3
# Importing
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns
 
# Load the dataset
df = pd.read_csv('Iris.csv')
 
# IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
                interpolation = 'midpoint')
 
Q3 = np.percentile(df['SepalWidthCm'], 75,
                interpolation = 'midpoint')
IQR = Q3 - Q1
 
print("Old Shape: ", df.shape)
 
# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))
 
# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))
 
# Removing the Outliers
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)
 
print("New Shape: ", df.shape)
 
sns.boxplot(x='SepalWidthCm', data=df)

Output:

You might also like