0% found this document useful (0 votes)

546 views

Exploratory Data Analysis (EDA) Using Python

Uploaded by

bpjstk vc

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

546 views

Exploratory Data Analysis (EDA) Using Python

Uploaded by

bpjstk vc

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Exploratory Data Analysis (EDA) using Python

Exploratory Data Analysis (EDA)

using Python

Introduction to EDA
The main objective of this workshop is to cover the steps involved in Data pre-
processing, Feature Engineering, and different stages of Exploratory Data Analysis,
which is an essential step in any research analysis.

Data pre-processing, Feature Engineering, and EDA are fundamental early steps
after data collection. Still, they are not limited to where the data is simply visualized,
plotted, and manipulated, without any assumptions, to assess the quality of the data
and building models.

Data Pre-processing and Feature Engineering

We will spend a lot of time refining our raw data. Data pre-processing and Feature
Engineering plays a key role in any data process

Data preprocessing is the process of cleaning and preparing the raw data to enable
feature engineering

Feature Engineering covers various data engineering techniques such as

adding/removing relevant features, handling missing data, encoding the data,
handling categorical variables, etc

Copyright 2023©inixindo [1 / 21]

Exploratory Data Analysis (EDA) using Python

Feature Engineering is one of the most crucial tasks and plays a major role in
determining the outcome of a model

Feature engineering involves the creation of features, whereas preprocessing

involves cleaning the data.

The Data pre-processing, Feature Engineering, and EDA steps will be carried out in
this workshop using Python.

Step By Step: EDA

Import Python Libraries

The first step involved in ML using python is understanding and playing around with
our data using libraries. Import all libraries which are required for our analysis, such
as Data Loading, Statistical analysis, Visualizations, Data Transformations, Merge
and Joins, etc.

Pandas and Numpy have been used for Data Manipulation and numerical
Calculations
Matplotlib and Seaborn have been used for Data visualizations.

Copyright 2023©inixindo [2 / 21]

Exploratory Data Analysis (EDA) using Python

1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 #to ignore warnings
6 import warnings
7 warnings.filterwarnings('ignore')

Reading Dataset
The Pandas library offers a wide range of possibilities for loading data into the
pandas DataFrame from files like JSON, .csv, .xlsx, .sql, .pickle, .html, .txt, images
etc.

Most of the data are available in a tabular format of CSV files. It is trendy and easy to
access. Using the read_csv() function, data can be converted to a pandas DataFrame.

In this workshop, the data to predict Used car price is being used as an example. In
this dataset, we are trying to analyze the used car’s price and how EDA focuses on
identifying the factors influencing the car price. We have stored the data in the
DataFrame data.

1 data = pd.read_csv("used_cars.csv")

Analyzing the data

Before we make any inferences, we listen to our data by examining all variables in
the data.

The main goal of data understanding is to gain general insights about the data,
which covers the number of rows and columns, values in the data, datatypes, and
Missing values in the dataset.

shape – shape will display the number of observations(rows) and features(columns)

in the dataset

There are 7253 observations and 14 variables in our dataset

head() will display the top 5 observations of the dataset

1 data.head()

Copyright 2023©inixindo [3 / 21]

Exploratory Data Analysis (EDA) using Python

tail() will display the last 5 observations of the dataset

1 data.tail()

info() helps to understand the data type and information about data, including the
number of records in each column, data having null or not null, Data type, the
memory usage of the dataset

1 data.info()

data.info() shows the variables Mileage, Engine, Power, Seats, New_Price, and Price
have missing values. Numeric variables like Mileage, Power are of datatype as
float64 and int64. Categorical variables like Location, Fuel_Type, Transmission, and
Owner Type are of object data type

Copyright 2023©inixindo [4 / 21]

Exploratory Data Analysis (EDA) using Python

Check for duplication

nunique() based on several unique values in each column and the data description,
we can identify the continuous and categorical columns in the data. Duplicated data
can be handled or removed based on further analysis

1 data.nunique()

Missing values Calculation

isnull() is widely been in all pre-processing steps to identify null values in the data

In our example, data.isnull().sum() is used to get the number of missing records in

each column

1 data.isnull().sum()

The below code helps to calculate the percentage of missing values in each column

1 (data.isnull().sum()/(len(data)))*100

Copyright 2023©inixindo [5 / 21]

Exploratory Data Analysis (EDA) using Python

The percentage of missing values for the columns New_Price and Price is ~86% and
~17%, respectively.

Data Reduction
Some columns or variables can be dropped if they do not add value to our analysis.

In our dataset, the column S.No have only ID values, assuming they don’t have any
predictive power to predict the dependent variable.

1 # Remove S.No. column from data

2 data = data.drop(['S.No.'], axis = 1)
3 data.info()

We start our Feature Engineering as we need to add some columns required for
analysis.

Feature Engineering
Feature engineering refers to the process of using domain knowledge to select and
transform the most relevant variables from raw data when creating a predictive
model using machine learning or statistical modeling. The main goal of Feature
engineering is to create meaningful data from raw data.

Creating Features
We will play around with the variables Year and Name in our dataset. If we see the
sample data, the column “Year” shows the manufacturing year of the car.

It would be difficult to find the car’s age if it is in year format as the Age of the car
is a contributing factor to Car Price.

Introducing a new column, “Car_Age” to know the age of the car

Copyright 2023©inixindo [6 / 21]

Exploratory Data Analysis (EDA) using Python

1 from datetime import date

2 date.today().year
3 data['Car_Age']=date.today().year-data['Year']
4 data.head()

Since car names will not be great predictors of the price in our current data. But we
can process this column to extract important information using brand and Model
names. Let’s split the name and introduce new variables “Brand” and “Model”

1 data['Brand'] = data.Name.str.split().str.get(0)
2 data['Model'] = data.Name.str.split().str.get(1) +
data.Name.str.split().str.get(2)
3 data[['Name','Brand','Model']]

Data Cleaning/Wrangling
Some names of the variables are not relevant and not easy to understand. Some data
may have data entry errors, and some variables may need data type conversion. We
need to fix this issue in the data.

In the example, The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks
incorrect. This needs to be corrected

1 print(data.Brand.unique())
2 print(data.Brand.nunique())

Copyright 2023©inixindo [7 / 21]

Exploratory Data Analysis (EDA) using Python

1 searchfor = ['Isuzu' ,'ISUZU','Mini','Land']

2 data[data.Brand.str.contains('|'.join(searchfor))].head(5)

1 data["Brand"].replace({"ISUZU": "Isuzu", "Mini": "Mini

Cooper","Land":"Land Rover"}, inplace=True)

We have done the fundamental data analysis, Featuring, and data clean-up. Let’s
move to the EDA process

Our Data is ready to perform EDA.

EDA Exploratory Data Analysis

Exploratory Data Analysis refers to the crucial process of performing initial
investigations on data to discover patterns to check assumptions with the help of
summary statistics and graphical representations.

EDA can be leveraged to check for outliers, patterns, and trends in the given
data.
EDA helps to find meaningful patterns in data.
EDA provides in-depth insights into the data sets to solve our business
problems.
EDA gives a clue to impute missing values in the dataset

Statistics Summary
The information gives a quick and simple description of the data.

Can include Count, Mean, Standard Deviation, median, mode, minimum value,
maximum value, range, standard deviation, etc.

Copyright 2023©inixindo [8 / 21]

Exploratory Data Analysis (EDA) using Python

Statistics summary gives a high-level idea to identify whether the data has any
outliers, data entry error, distribution of data such as the data is normally
distributed or left/right skewed

In python, this can be achieved using describe()

describe() function gives all statistics summary of data

describe()– Provide a statistics summary of data belonging to numerical datatype

such as int, float

1 data.describe().T

From the statistics summary, we can infer the below findings :

Years range from 1996- 2019 and has a high in a range which shows used cars
contain both latest models and old model cars.
On average of Kilometers-driven in Used cars are ~58k KM. The range shows a
huge difference between min and max as max values show 650000 KM shows
the evidence of an outlier. This record can be removed.
Min value of Mileage shows 0 cars won’t be sold with 0 mileage. This sounds
like a data entry issue.
It looks like Engine and Power have outliers, and the data is right-skewed.
The average number of seats in a car is 5. car seat is an important feature in
price contribution.
The max price of a used car is 160k which is quite weird, such a high price for
used cars. There may be an outlier or data entry issue.

describe(include=’all’) provides a statistics summary of all data, include object,

category etc

1 data.describe(include='all').T

Before we do EDA, lets separate Numerical and categorical variables for easy
analysis

1 cat_cols=data.select_dtypes(include=['object']).columns
2 num_cols = data.select_dtypes(include=np.number).columns.tolist()
3 print("Categorical Variables:")
4 print(cat_cols)
5 print("Numerical Variables:")
6 print(num_cols)

Copyright 2023©inixindo [9 / 21]

Exploratory Data Analysis (EDA) using Python

EDA Univariate Analysis

Analyzing/visualizing the dataset by taking one variable at a time:

Data visualization is essential; we must decide what charts to plot to better

understand the data. In this workshop, we visualize our data using Matplotlib and
Seaborn libraries.

Matplotlib is a Python 2D plotting library used to draw basic charts we use

Matplotlib.

Seaborn is also a python library built on top of Matplotlib that uses short lines of
code to create and style statistical plots from Pandas and Numpy

Univariate analysis can be done for both Categorical and Numerical variables.

Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.

Numerical Variables can be visualized using Histogram, Box Plot, Density Plot, etc.

In our example, we have done a Univariate analysis using Histogram and Box Plot for
continuous Variables.

In the below fig, a histogram and box plot is used to show the pattern of the
variables, as some variables have skewness and outliers.**
**

1 for col in num_cols:

2 print(col)
3 print('Skew :', round(data[col].skew(), 2))
4 plt.figure(figsize = (15, 4))
5 plt.subplot(1, 2, 1)
6 data[col].hist(grid=False)
7 plt.ylabel('count')
8 plt.subplot(1, 2, 2)
9 sns.boxplot(x=data[col])
10 plt.show()

Copyright 2023©inixindo [10 / 21]

Exploratory Data Analysis (EDA) using Python

Price and Kilometers Driven are right skewed for this data to be transformed, and all
outliers will be handled during imputation

categorical variables are being visualized using a count plot. Categorical variables
provide the pattern of factors influencing car price

1 fig, axes = plt.subplots(3, 2, figsize = (18, 18))

2 fig.suptitle('Bar plot for all categorical variables in the dataset')
3 sns.countplot(ax = axes[0, 0], x = 'Fuel_Type', data = data, color =
'blue',
4 order = data['Fuel_Type'].value_counts().index);
5 sns.countplot(ax = axes[0, 1], x = 'Transmission', data = data, color
= 'blue',
6 order = data['Transmission'].value_counts().index);
7 sns.countplot(ax = axes[1, 0], x = 'Owner_Type', data = data, color =
'blue',
8 order = data['Owner_Type'].value_counts().index);
9 sns.countplot(ax = axes[1, 1], x = 'Location', data = data, color =
'blue',
10 order = data['Location'].value_counts().index);
11 sns.countplot(ax = axes[2, 0], x = 'Brand', data = data, color =
'blue',
12 order = data['Brand'].head(20).value_counts().index);
13 sns.countplot(ax = axes[2, 1], x = 'Model', data = data, color =
'blue',
14 order = data['Model'].head(20).value_counts().index);
15 axes[1][1].tick_params(labelrotation=45);

Copyright 2023©inixindo [11 / 21]

Exploratory Data Analysis (EDA) using Python

16 axes[2][0].tick_params(labelrotation=90);
17 axes[2][1].tick_params(labelrotation=90);

From the count plot, we can have below observations

Mumbai has the highest number of cars available for purchase, followed by
Hyderabad and Coimbatore
~53% of cars have fuel type as Diesel this shows diesel cars provide higher
performance
~72% of cars have manual transmission
~82 % of cars are First owned cars. This shows most of the buyers prefer to
purchase first-owner cars

Copyright 2023©inixindo [12 / 21]

Exploratory Data Analysis (EDA) using Python

~20% of cars belong to the brand Maruti followed by 19% of cars belonging to
Hyundai
WagonR ranks first among all models which are available for purchase

Data Transformation
Before we proceed to Bi-variate Analysis, Univariate analysis demonstrated the data
pattern as some variables to be transformed.

Price and Kilometer-Driven variables are highly skewed and on a larger scale. Let’s
do log transformation.

Log transformation can help in normalization, so this variable can maintain

standard scale with other variables:

1 # Function for log transformation of the column

2 def log_transform(data,col):
3 for colname in col:
4 if (data[colname] == 1.0).all():
5 data[colname + '_log'] = np.log(data[colname]+1)
6 else:
7 data[colname + '_log'] = np.log(data[colname])
8 data.info()
9 log_transform(data,['Kilometers_Driven','Price'])
10 #Log transformation of the feature 'Kilometers_Driven'
11 sns.distplot(data["Kilometers_Driven_log"],
axlabel="Kilometers_Driven_log");

Exploratory Data Analysis (EDA) using Python

EDA Bivariate Analysis

Now, let’s move ahead with bivariate analysis. Bivariate Analysis helps to
understand how variables are related to each other and the relationship between
dependent and independent variables present in the dataset.

For Numerical variables, Pair plots and Scatter plots are widely been used to do
Bivariate Analysis.

A Stacked bar chart can be used for categorical variables if the output variable is a
classifier. Bar plots can be used if the output variable is continuous

In our example, a pair plot has been used to show the relationship between two
Categorical variables.

1 plt.figure(figsize=(13,17))
2 sns.pairplot(data=data.drop(['Kilometers_Driven','Price'],axis=1))
3 plt.show()

Exploratory Data Analysis (EDA) using Python

Pair Plot provides below insights:

The variable Year has a positive correlation with price and mileage
A year has a Negative correlation with kilometers-Driven
Mileage is negatively correlated with Power
As power increases, mileage decreases
Car with recent make is higher at prices. As the age of the car increases price
decreases
Engine and Power increase, and the price of the car increases

A bar plot can be used to show the relationship between Categorical variables and
continuous variables

1 fig, axarr = plt.subplots(4, 2, figsize=(12, 18))

Exploratory Data Analysis (EDA) using Python

2 data.groupby('Location')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[0]
[0], fontsize=12)
3 axarr[0][0].set_title("Location Vs Price", fontsize=18)
4 data.groupby('Transmission')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[0]
[1], fontsize=12)
5 axarr[0][1].set_title("Transmission Vs Price", fontsize=18)
6 data.groupby('Fuel_Type')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[1]
[0], fontsize=12)
7 axarr[1][0].set_title("Fuel_Type Vs Price", fontsize=18)
8 data.groupby('Owner_Type')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[1]
[1], fontsize=12)
9 axarr[1][1].set_title("Owner_Type Vs Price", fontsize=18)
10 data.groupby('Brand')
['Price_log'].mean().sort_values(ascending=False).head(10).plot.bar(ax
=axarr[2][0], fontsize=12)
11 axarr[2][0].set_title("Brand Vs Price", fontsize=18)
12 data.groupby('Model')
['Price_log'].mean().sort_values(ascending=False).head(10).plot.bar(ax
=axarr[2][1], fontsize=12)
13 axarr[2][1].set_title("Model Vs Price", fontsize=18)
14 data.groupby('Seats')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[3]
[0], fontsize=12)
15 axarr[3][0].set_title("Seats Vs Price", fontsize=18)
16 data.groupby('Car_Age')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[3]
[1], fontsize=12)
17 axarr[3][1].set_title("Car_Age Vs Price", fontsize=18)
18 plt.subplots_adjust(hspace=1.0)
19 plt.subplots_adjust(wspace=.5)
20 sns.despine()

Exploratory Data Analysis (EDA) using Python

Observations

Exploratory Data Analysis (EDA) using Python

The price of cars is high in Coimbatore and less price in Kolkata and Jaipur
Automatic cars have more price than manual cars.
Diesel and Electric cars have almost the same price, which is maximum, and
LPG cars have the lowest price
First-owner cars are higher in price, followed by a second
The third owner’s price is lesser than the Fourth and above
Lamborghini brand is the highest in price
Gallardocoupe Model is the highest in price
2 Seater has the highest price followed by 7 Seater
The latest model cars are high in price

EDA Multivariate Analysis

As the name suggests, Multivariate analysis looks at more than two variables.
Multivariate analysis is one of the most useful methods to determine relationships
and analyze patterns for any dataset.

A heat map is widely been used for Multivariate Analysis

Heat Map gives the correlation between the variables, whether it has a positive or
negative correlation.

In our example heat map shows the correlation between the variables.

1 plt.figure(figsize=(12, 7))
2 sns.heatmap(data.drop(['Kilometers_Driven','Price'],axis=1).corr(),
annot = True, vmin = -1, vmax = 1)
3 plt.show()

Exploratory Data Analysis (EDA) using Python

From the Heat map, we can infer the following:

The engine has a strong positive correlation to Power 0.86

Price has a positive correlation to Engine 0.69 as well Power 0.77
Mileage has correlated to Engine, Power, and Price negatively
Price is moderately positive in correlation to year.
Kilometer driven has a negative correlation to year not much impact on the
price
Car age has a negative correlation with Price
car Age is positively correlated to Kilometers-Driven as the Age of the car
increases; then the kilometer will also increase of car has a negative correlation
with Mileage this makes sense

Impute Missing values

Missing data arise in almost all statistical analyses. There are many ways to impute
missing values; we can impute the missing values by their Mean, median, most
frequent, or zero values and use advanced imputation algorithms like KNN,
Regularization, etc.

We cannot impute the data with a simple Mean/Median. We must need business
knowledge or common insights about the data. If we have domain knowledge, it will
add value to the imputation. Some data can be imputed on assumptions.

In our dataset, we have found there are missing values for many columns like
Mileage, Power, and Seats.

Exploratory Data Analysis (EDA) using Python

We observed earlier some observations have zero Mileage. This looks like a data
entry issue. We could fix this by filling null values with zero and then the mean value
of Mileage since Mean and Median values are nearly the same for this variable
chosen Mean to impute the values.

1 data.loc[data["Mileage"]==0.0,'Mileage']=np.nan
2 data.Mileage.isnull().sum()
3 data['Mileage'].fillna(value=np.mean(data['Mileage']),inplace=True)

Similarly, imputation for Seats. As we mentioned earlier, we need to know common

insights about the data.

Let’s assume some cars brand and Models have features like Engine, Mileage, Power,
and Number of seats that are nearly the same. Let’s impute those missing values
with the existing data:

1 data.Seats.isnull().sum()
2 data['Seats'].fillna(value=np.nan,inplace=True)
3 data['Seats']=data.groupby(['Model','Brand'])['Seats'].apply(lambda
x:x.fillna(x.median()))
4 data['Engine']=data.groupby(['Brand','Model'])['Engine'].apply(lambda
x:x.fillna(x.median()))
5 data['Power']=data.groupby(['Brand','Model'])['Power'].apply(lambda
x:x.fillna(x.median()))

In general, there are no defined or perfect rules for imputing missing values in a
dataset. Each method can perform better for some datasets but may perform even
worse. Only practice and experiments give the knowledge which works better.

Conclusion
In this workshop, we tried to analyze the factors influencing the used car’s price.

Data Analysis helps to find the basic structure of the dataset.

Dropped columns that are not adding value to our analysis.
Performed Feature Engineering by adding some columns which contribute to
our analysis.
Data Transformations have been used to normalize the columns.
We used different visualizations for EDA like Univariate, Bi-Variate, and
Multivariate Analysis.

Through EDA, we got useful insights, and below are the factors influencing the
price of the car and a few takeaways

Most of the customers prefer 2 Seat cars hence the price of the 2-seat cars is
higher than other cars.
The price of the car decreases as the Age of the car increases.
Customers prefer to purchase the First owner rather than the Second or Third.
Copyright 2023©inixindo [20 / 21]
Exploratory Data Analysis (EDA) using Python

Due to increased Fuel price, the customer prefers to purchase an Electric

vehicle.
Automatic Transmission is easier than Manual.

This way, we perform EDA on the datasets to explore the data and extract all possible
insights, which can help in model building and better decision making.

Thanks

TP 6701
67% (3)
TP 6701
134 pages
Create Optimal Orders in S4hana
100% (1)
Create Optimal Orders in S4hana
8 pages
Parallel Testing and Implementing New QC Material
No ratings yet
Parallel Testing and Implementing New QC Material
14 pages
MAchine Learning
No ratings yet
MAchine Learning
120 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
Data Science Theory: Analysis and Analytics
No ratings yet
Data Science Theory: Analysis and Analytics
14 pages
ML Interview Questions and Answers
100% (1)
ML Interview Questions and Answers
25 pages
Numpy Notes Csai
No ratings yet
Numpy Notes Csai
8 pages
ML Notes
100% (2)
ML Notes
125 pages
Python Setup For Machine Learning
100% (1)
Python Setup For Machine Learning
3 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
Pandas
100% (1)
Pandas
1,131 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
26 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
ML First Unit
No ratings yet
ML First Unit
70 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
Top 9 Data Science Algorithms
No ratings yet
Top 9 Data Science Algorithms
152 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Data Visualization - Matplotlib PDF
100% (1)
Data Visualization - Matplotlib PDF
15 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
02 - Decision Tree Classification On Iris Dataset
No ratings yet
02 - Decision Tree Classification On Iris Dataset
6 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
Python DataScience Cheat-Sheet
100% (1)
Python DataScience Cheat-Sheet
7 pages
ML Lab Manual
100% (1)
ML Lab Manual
37 pages
Machine Learning Lab Manual 7
100% (1)
Machine Learning Lab Manual 7
8 pages
Answer Book (Ashish)
100% (1)
Answer Book (Ashish)
21 pages
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
100% (1)
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
23 pages
ML0101EN Clas Logistic Reg Churn Py v1
100% (1)
ML0101EN Clas Logistic Reg Churn Py v1
13 pages
A Practical Time-Series Tutorial With MATLAB
No ratings yet
A Practical Time-Series Tutorial With MATLAB
95 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Machine Learning
100% (5)
Machine Learning
56 pages
Keras Cheat Sheet Python
No ratings yet
Keras Cheat Sheet Python
1 page
Pandas Assignment
100% (2)
Pandas Assignment
11 pages
Data Science
No ratings yet
Data Science
39 pages
The Cricket Winner Prediction With Applications of ML and Data Analytics
No ratings yet
The Cricket Winner Prediction With Applications of ML and Data Analytics
18 pages
Matrix-Vector Multiplication Using MapReduce in Big Data.
No ratings yet
Matrix-Vector Multiplication Using MapReduce in Big Data.
4 pages
Tutorial 2 - Clustering
100% (2)
Tutorial 2 - Clustering
6 pages
Python Core Material
No ratings yet
Python Core Material
162 pages
Pandas Plotting Capabilities
No ratings yet
Pandas Plotting Capabilities
27 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
2 pages
TP Regression
100% (1)
TP Regression
1 page
Machine Learning Notes
No ratings yet
Machine Learning Notes
112 pages
CCS355 Neural Networks and Deep Learning Lab
No ratings yet
CCS355 Neural Networks and Deep Learning Lab
43 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
No ratings yet
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
365 Data Science R Course Notes
No ratings yet
365 Data Science R Course Notes
20 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
Cheatsheet Machine Learning Tips and Tricks PDF
No ratings yet
Cheatsheet Machine Learning Tips and Tricks PDF
2 pages
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Eda Expt
No ratings yet
Eda Expt
6 pages
Swaraj Project (1)
No ratings yet
Swaraj Project (1)
16 pages
Project Report
No ratings yet
Project Report
7 pages
Intro
No ratings yet
Intro
26 pages
Cambridge International General Certificate of Secondary Education
No ratings yet
Cambridge International General Certificate of Secondary Education
2 pages
RPM VS Yum
No ratings yet
RPM VS Yum
6 pages
Patch Management Solution Group IT Global Unix Tower: Ibm Bigfix
No ratings yet
Patch Management Solution Group IT Global Unix Tower: Ibm Bigfix
15 pages
Online Business Startup Guide
No ratings yet
Online Business Startup Guide
96 pages
CSC 222 - Lecture 1
No ratings yet
CSC 222 - Lecture 1
30 pages
Neu 1
No ratings yet
Neu 1
18 pages
ATR 2025
No ratings yet
ATR 2025
2 pages
Chapter1 3
No ratings yet
Chapter1 3
19 pages
Rubiks Solution
No ratings yet
Rubiks Solution
31 pages
What Is A Privacy Policy?
No ratings yet
What Is A Privacy Policy?
3 pages
Just Research Format
No ratings yet
Just Research Format
17 pages
Certified AI & ML BlackBelt Plus Program Brochure
No ratings yet
Certified AI & ML BlackBelt Plus Program Brochure
40 pages
2.6 Homework
No ratings yet
2.6 Homework
2 pages
Imel Talag Stve Technical Drafting Ncii 10 Adm q1 m3
No ratings yet
Imel Talag Stve Technical Drafting Ncii 10 Adm q1 m3
21 pages
Geodatabase Design Forms: Geodatabase Name Feature Dataset Name
No ratings yet
Geodatabase Design Forms: Geodatabase Name Feature Dataset Name
6 pages
AK2000 Manual en
No ratings yet
AK2000 Manual en
29 pages
WiFi Issue
No ratings yet
WiFi Issue
14 pages
Arevalo Carreno - Abstract
No ratings yet
Arevalo Carreno - Abstract
3 pages
SHIV
No ratings yet
SHIV
283 pages
AI Artificial Intelligence, 60 Leaders 17 Questions
100% (11)
AI Artificial Intelligence, 60 Leaders 17 Questions
236 pages
Timing Advance
No ratings yet
Timing Advance
1 page
2023 Its665 - Isp565 - Group Project
No ratings yet
2023 Its665 - Isp565 - Group Project
6 pages
High Voltage Contactor 3TL71: 24kV/800A 24 kV/800A
No ratings yet
High Voltage Contactor 3TL71: 24kV/800A 24 kV/800A
16 pages
10 Кл Action 4 Четв ҚМЖ
No ratings yet
10 Кл Action 4 Четв ҚМЖ
74 pages
1 Indoctrination Series Template
No ratings yet
1 Indoctrination Series Template
8 pages
Monika Nuthalapati - Scrum Master
No ratings yet
Monika Nuthalapati - Scrum Master
2 pages
Deltabar M PMD55
No ratings yet
Deltabar M PMD55
48 pages

Exploratory Data Analysis (EDA) Using Python

Uploaded by

Exploratory Data Analysis (EDA) Using Python

Uploaded by

Exploratory Data Analysis (EDA) using Python

Exploratory Data Analysis (EDA)

Data Pre-processing and Feature Engineering

Feature Engineering covers various data engineering techniques such as

Copyright 2023©inixindo [1 / 21]

Feature engineering involves the creation of features, whereas preprocessing

Step By Step: EDA

Import Python Libraries

Copyright 2023©inixindo [2 / 21]

Analyzing the data

shape – shape will display the number of observations(rows) and features(columns)

There are 7253 observations and 14 variables in our dataset

head() will display the top 5 observations of the dataset

Copyright 2023©inixindo [3 / 21]

tail() will display the last 5 observations of the dataset

Copyright 2023©inixindo [4 / 21]

Check for duplication

Missing values Calculation

In our example, data.isnull().sum() is used to get the number of missing records in

Copyright 2023©inixindo [5 / 21]

1 # Remove S.No. column from data

Introducing a new column, “Car_Age” to know the age of the car

Copyright 2023©inixindo [6 / 21]

1 from datetime import date

Copyright 2023©inixindo [7 / 21]

1 searchfor = ['Isuzu' ,'ISUZU','Mini','Land']

1 data["Brand"].replace({"ISUZU": "Isuzu", "Mini": "Mini

Our Data is ready to perform EDA.

EDA Exploratory Data Analysis

Copyright 2023©inixindo [8 / 21]

In python, this can be achieved using describe()

describe() function gives all statistics summary of data

describe()– Provide a statistics summary of data belonging to numerical datatype

From the statistics summary, we can infer the below findings :

describe(include=’all’) provides a statistics summary of all data, include object,

Copyright 2023©inixindo [9 / 21]

EDA Univariate Analysis

Data visualization is essential; we must decide what charts to plot to better

Matplotlib is a Python 2D plotting library used to draw basic charts we use

1 for col in num_cols:

Copyright 2023©inixindo [10 / 21]

1 fig, axes = plt.subplots(3, 2, figsize = (18, 18))

Copyright 2023©inixindo [11 / 21]

From the count plot, we can have below observations

Copyright 2023©inixindo [12 / 21]

Log transformation can help in normalization, so this variable can maintain

1 # Function for log transformation of the column

Copyright 2023©inixindo [13 / 21]

EDA Bivariate Analysis

Copyright 2023©inixindo [14 / 21]

Pair Plot provides below insights:

1 fig, axarr = plt.subplots(4, 2, figsize=(12, 18))

Copyright 2023©inixindo [15 / 21]

Copyright 2023©inixindo [16 / 21]

Copyright 2023©inixindo [17 / 21]

EDA Multivariate Analysis

A heat map is widely been used for Multivariate Analysis

Copyright 2023©inixindo [18 / 21]

From the Heat map, we can infer the following:

The engine has a strong positive correlation to Power 0.86

Impute Missing values

Copyright 2023©inixindo [19 / 21]

Similarly, imputation for Seats. As we mentioned earlier, we need to know common

Data Analysis helps to find the basic structure of the dataset.

Due to increased Fuel price, the customer prefers to purchase an Electric

Copyright 2023©inixindo [21 / 21]

You might also like