Exploratory Data Analysis (EDA) Using Python
Exploratory Data Analysis (EDA) Using Python
Introduction to EDA
The main objective of this workshop is to cover the steps involved in Data pre-
processing, Feature Engineering, and different stages of Exploratory Data Analysis,
which is an essential step in any research analysis.
Data pre-processing, Feature Engineering, and EDA are fundamental early steps
after data collection. Still, they are not limited to where the data is simply visualized,
plotted, and manipulated, without any assumptions, to assess the quality of the data
and building models.
We will spend a lot of time refining our raw data. Data pre-processing and Feature
Engineering plays a key role in any data process
Data preprocessing is the process of cleaning and preparing the raw data to enable
feature engineering
Feature Engineering is one of the most crucial tasks and plays a major role in
determining the outcome of a model
The Data pre-processing, Feature Engineering, and EDA steps will be carried out in
this workshop using Python.
Pandas and Numpy have been used for Data Manipulation and numerical
Calculations
Matplotlib and Seaborn have been used for Data visualizations.
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 #to ignore warnings
6 import warnings
7 warnings.filterwarnings('ignore')
Reading Dataset
The Pandas library offers a wide range of possibilities for loading data into the
pandas DataFrame from files like JSON, .csv, .xlsx, .sql, .pickle, .html, .txt, images
etc.
Most of the data are available in a tabular format of CSV files. It is trendy and easy to
access. Using the read_csv() function, data can be converted to a pandas DataFrame.
In this workshop, the data to predict Used car price is being used as an example. In
this dataset, we are trying to analyze the used car’s price and how EDA focuses on
identifying the factors influencing the car price. We have stored the data in the
DataFrame data.
1 data = pd.read_csv("used_cars.csv")
The main goal of data understanding is to gain general insights about the data,
which covers the number of rows and columns, values in the data, datatypes, and
Missing values in the dataset.
1 data.head()
1 data.tail()
info() helps to understand the data type and information about data, including the
number of records in each column, data having null or not null, Data type, the
memory usage of the dataset
1 data.info()
data.info() shows the variables Mileage, Engine, Power, Seats, New_Price, and Price
have missing values. Numeric variables like Mileage, Power are of datatype as
float64 and int64. Categorical variables like Location, Fuel_Type, Transmission, and
Owner Type are of object data type
nunique() based on several unique values in each column and the data description,
we can identify the continuous and categorical columns in the data. Duplicated data
can be handled or removed based on further analysis
1 data.nunique()
isnull() is widely been in all pre-processing steps to identify null values in the data
1 data.isnull().sum()
The below code helps to calculate the percentage of missing values in each column
1 (data.isnull().sum()/(len(data)))*100
The percentage of missing values for the columns New_Price and Price is ~86% and
~17%, respectively.
Data Reduction
Some columns or variables can be dropped if they do not add value to our analysis.
In our dataset, the column S.No have only ID values, assuming they don’t have any
predictive power to predict the dependent variable.
We start our Feature Engineering as we need to add some columns required for
analysis.
Feature Engineering
Feature engineering refers to the process of using domain knowledge to select and
transform the most relevant variables from raw data when creating a predictive
model using machine learning or statistical modeling. The main goal of Feature
engineering is to create meaningful data from raw data.
Creating Features
We will play around with the variables Year and Name in our dataset. If we see the
sample data, the column “Year” shows the manufacturing year of the car.
It would be difficult to find the car’s age if it is in year format as the Age of the car
is a contributing factor to Car Price.
Since car names will not be great predictors of the price in our current data. But we
can process this column to extract important information using brand and Model
names. Let’s split the name and introduce new variables “Brand” and “Model”
1 data['Brand'] = data.Name.str.split().str.get(0)
2 data['Model'] = data.Name.str.split().str.get(1) +
data.Name.str.split().str.get(2)
3 data[['Name','Brand','Model']]
Data Cleaning/Wrangling
Some names of the variables are not relevant and not easy to understand. Some data
may have data entry errors, and some variables may need data type conversion. We
need to fix this issue in the data.
In the example, The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks
incorrect. This needs to be corrected
1 print(data.Brand.unique())
2 print(data.Brand.nunique())
We have done the fundamental data analysis, Featuring, and data clean-up. Let’s
move to the EDA process
EDA can be leveraged to check for outliers, patterns, and trends in the given
data.
EDA helps to find meaningful patterns in data.
EDA provides in-depth insights into the data sets to solve our business
problems.
EDA gives a clue to impute missing values in the dataset
Statistics Summary
The information gives a quick and simple description of the data.
Can include Count, Mean, Standard Deviation, median, mode, minimum value,
maximum value, range, standard deviation, etc.
Statistics summary gives a high-level idea to identify whether the data has any
outliers, data entry error, distribution of data such as the data is normally
distributed or left/right skewed
1 data.describe().T
Years range from 1996- 2019 and has a high in a range which shows used cars
contain both latest models and old model cars.
On average of Kilometers-driven in Used cars are ~58k KM. The range shows a
huge difference between min and max as max values show 650000 KM shows
the evidence of an outlier. This record can be removed.
Min value of Mileage shows 0 cars won’t be sold with 0 mileage. This sounds
like a data entry issue.
It looks like Engine and Power have outliers, and the data is right-skewed.
The average number of seats in a car is 5. car seat is an important feature in
price contribution.
The max price of a used car is 160k which is quite weird, such a high price for
used cars. There may be an outlier or data entry issue.
1 data.describe(include='all').T
Before we do EDA, lets separate Numerical and categorical variables for easy
analysis
1 cat_cols=data.select_dtypes(include=['object']).columns
2 num_cols = data.select_dtypes(include=np.number).columns.tolist()
3 print("Categorical Variables:")
4 print(cat_cols)
5 print("Numerical Variables:")
6 print(num_cols)
Seaborn is also a python library built on top of Matplotlib that uses short lines of
code to create and style statistical plots from Pandas and Numpy
Univariate analysis can be done for both Categorical and Numerical variables.
Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
Numerical Variables can be visualized using Histogram, Box Plot, Density Plot, etc.
In our example, we have done a Univariate analysis using Histogram and Box Plot for
continuous Variables.
In the below fig, a histogram and box plot is used to show the pattern of the
variables, as some variables have skewness and outliers.**
**
Price and Kilometers Driven are right skewed for this data to be transformed, and all
outliers will be handled during imputation
categorical variables are being visualized using a count plot. Categorical variables
provide the pattern of factors influencing car price
16 axes[2][0].tick_params(labelrotation=90);
17 axes[2][1].tick_params(labelrotation=90);
Mumbai has the highest number of cars available for purchase, followed by
Hyderabad and Coimbatore
~53% of cars have fuel type as Diesel this shows diesel cars provide higher
performance
~72% of cars have manual transmission
~82 % of cars are First owned cars. This shows most of the buyers prefer to
purchase first-owner cars
~20% of cars belong to the brand Maruti followed by 19% of cars belonging to
Hyundai
WagonR ranks first among all models which are available for purchase
Data Transformation
Before we proceed to Bi-variate Analysis, Univariate analysis demonstrated the data
pattern as some variables to be transformed.
Price and Kilometer-Driven variables are highly skewed and on a larger scale. Let’s
do log transformation.
For Numerical variables, Pair plots and Scatter plots are widely been used to do
Bivariate Analysis.
A Stacked bar chart can be used for categorical variables if the output variable is a
classifier. Bar plots can be used if the output variable is continuous
In our example, a pair plot has been used to show the relationship between two
Categorical variables.
1 plt.figure(figsize=(13,17))
2 sns.pairplot(data=data.drop(['Kilometers_Driven','Price'],axis=1))
3 plt.show()
The variable Year has a positive correlation with price and mileage
A year has a Negative correlation with kilometers-Driven
Mileage is negatively correlated with Power
As power increases, mileage decreases
Car with recent make is higher at prices. As the age of the car increases price
decreases
Engine and Power increase, and the price of the car increases
A bar plot can be used to show the relationship between Categorical variables and
continuous variables
2 data.groupby('Location')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[0]
[0], fontsize=12)
3 axarr[0][0].set_title("Location Vs Price", fontsize=18)
4 data.groupby('Transmission')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[0]
[1], fontsize=12)
5 axarr[0][1].set_title("Transmission Vs Price", fontsize=18)
6 data.groupby('Fuel_Type')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[1]
[0], fontsize=12)
7 axarr[1][0].set_title("Fuel_Type Vs Price", fontsize=18)
8 data.groupby('Owner_Type')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[1]
[1], fontsize=12)
9 axarr[1][1].set_title("Owner_Type Vs Price", fontsize=18)
10 data.groupby('Brand')
['Price_log'].mean().sort_values(ascending=False).head(10).plot.bar(ax
=axarr[2][0], fontsize=12)
11 axarr[2][0].set_title("Brand Vs Price", fontsize=18)
12 data.groupby('Model')
['Price_log'].mean().sort_values(ascending=False).head(10).plot.bar(ax
=axarr[2][1], fontsize=12)
13 axarr[2][1].set_title("Model Vs Price", fontsize=18)
14 data.groupby('Seats')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[3]
[0], fontsize=12)
15 axarr[3][0].set_title("Seats Vs Price", fontsize=18)
16 data.groupby('Car_Age')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[3]
[1], fontsize=12)
17 axarr[3][1].set_title("Car_Age Vs Price", fontsize=18)
18 plt.subplots_adjust(hspace=1.0)
19 plt.subplots_adjust(wspace=.5)
20 sns.despine()
Observations
The price of cars is high in Coimbatore and less price in Kolkata and Jaipur
Automatic cars have more price than manual cars.
Diesel and Electric cars have almost the same price, which is maximum, and
LPG cars have the lowest price
First-owner cars are higher in price, followed by a second
The third owner’s price is lesser than the Fourth and above
Lamborghini brand is the highest in price
Gallardocoupe Model is the highest in price
2 Seater has the highest price followed by 7 Seater
The latest model cars are high in price
Heat Map gives the correlation between the variables, whether it has a positive or
negative correlation.
In our example heat map shows the correlation between the variables.
1 plt.figure(figsize=(12, 7))
2 sns.heatmap(data.drop(['Kilometers_Driven','Price'],axis=1).corr(),
annot = True, vmin = -1, vmax = 1)
3 plt.show()
We cannot impute the data with a simple Mean/Median. We must need business
knowledge or common insights about the data. If we have domain knowledge, it will
add value to the imputation. Some data can be imputed on assumptions.
In our dataset, we have found there are missing values for many columns like
Mileage, Power, and Seats.
We observed earlier some observations have zero Mileage. This looks like a data
entry issue. We could fix this by filling null values with zero and then the mean value
of Mileage since Mean and Median values are nearly the same for this variable
chosen Mean to impute the values.
1 data.loc[data["Mileage"]==0.0,'Mileage']=np.nan
2 data.Mileage.isnull().sum()
3 data['Mileage'].fillna(value=np.mean(data['Mileage']),inplace=True)
Let’s assume some cars brand and Models have features like Engine, Mileage, Power,
and Number of seats that are nearly the same. Let’s impute those missing values
with the existing data:
1 data.Seats.isnull().sum()
2 data['Seats'].fillna(value=np.nan,inplace=True)
3 data['Seats']=data.groupby(['Model','Brand'])['Seats'].apply(lambda
x:x.fillna(x.median()))
4 data['Engine']=data.groupby(['Brand','Model'])['Engine'].apply(lambda
x:x.fillna(x.median()))
5 data['Power']=data.groupby(['Brand','Model'])['Power'].apply(lambda
x:x.fillna(x.median()))
In general, there are no defined or perfect rules for imputing missing values in a
dataset. Each method can perform better for some datasets but may perform even
worse. Only practice and experiments give the knowledge which works better.
Conclusion
In this workshop, we tried to analyze the factors influencing the used car’s price.
Through EDA, we got useful insights, and below are the factors influencing the
price of the car and a few takeaways
Most of the customers prefer 2 Seat cars hence the price of the 2-seat cars is
higher than other cars.
The price of the car decreases as the Age of the car increases.
Customers prefer to purchase the First owner rather than the Second or Third.
Copyright 2023©inixindo [20 / 21]
Exploratory Data Analysis (EDA) using Python
This way, we perform EDA on the datasets to explore the data and extract all possible
insights, which can help in model building and better decision making.
Thanks