2.1 Exploratory Data Analysis Using Python
2.1 Exploratory Data Analysis Using Python
We will use the shape parameter to get the shape of the dataset.
Shape of Dataframe
Python3
df.shape
Output:
(150, 6)
We can see that the dataframe contains 6 columns and 150 rows.
Now, let’s also the columns and their data types. For this, we will use the info() method.
Information about Dataset
Python3
df.info()
Output:
We can see that only one column has categorical data and all the other columns are of the
numeric type with non-Null entries.
Let’s get a quick statistical summary of the dataset using the describe() method. The
describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc. Any missing value or NaN value is
automatically skipped. describe() function gives a good picture of the distribution of data.
Description of dataset
Python3
df.describe()
Output:
We can see the count of each column along with their mean value, standard deviation,
minimum and maximum values.
We will check if our data contains any missing values or not. Missing values can occur
when no information is provided for one or more items or for a whole unit. We will use
the isnull() method.
python code for missing value
Python3
df.isnull().sum()
Output:
Missing values in the dataset
Checking Duplicates
Output:
We can see that there are only three unique species. Let’s see if the dataset is balanced or
not i.e. all the species contain equal amounts of rows or not. We will use
the Series.value_counts() function. This function returns a Series containing counts of
unique values.
Python code for value counts in the column
Python3
df.value_counts("Species")
Output:
value count in the dataset
We can see that all the species contain an equal amount of rows, so we should not delete
any entries.
We will see the relationship between the sepal length and sepal width and also between
petal length and petal width.
Comparing Sepal Length and Sepal Width
Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
hue='Species', data=df, )
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()
Output:
Scatter plot using matplotlib library
Output:
sactter plot petal length
Output:
Pairplot for the dataset
We can see many types of relationships from this plot such as the species Seotsa has the
smallest of petals widths and lengths. It also has the smallest sepal length but larger sepal
widths. Such information can be gathered about any other species.
Handling Correlation
data.corr(method='pearson')
Output:
Heatmaps
The heatmap is a data visualization technique that is used to analyze the dataset as colors
in two dimensions. Basically, it shows a correlation between all numerical variables in
the dataset. In simpler terms, we can plot the above-found correlation using the heatmaps.
python code for heatmap
Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.corr(method='pearson').drop(
['Id'], axis=1).drop(['Id'], axis=0),
annot = True);
plt.show()
Output:
Heatmap for correlation in the dataset
Handling Outliers
An Outlier is a data item/object that deviates significantly from the rest of the (so-called
normal)objects. They can be caused by measurement or execution errors. The analysis for
outlier detection is referred to as outlier mining. There are many ways to detect outliers,
and the removal process is the data frame same as removing a data item from the panda’s
dataframe.
Let’s consider the iris dataset and let’s plot the boxplot for the SepalWidthCm column.
python code for Boxplot
Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('Iris.csv')
sns.boxplot(x='SepalWidthCm', data=df)
Output:
In the above graph, the values above 4 and below 2 are acting as outliers.
Removing Outliers
For removing the outlier, one must follow the same process of removing an entry from
the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.
We will detect the outliers using IQR and then we will remove them. We will also draw
the boxplot to see if the outliers are removed or not.
Python3
# Importing
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns
# Load the dataset
df = pd.read_csv('Iris.csv')
# IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
interpolation = 'midpoint')
Q3 = np.percentile(df['SepalWidthCm'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
print("Old Shape: ", df.shape)
# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))
# Removing the Outliers
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)
print("New Shape: ", df.shape)
sns.boxplot(x='SepalWidthCm', data=df)
Output: