4 Data Preprocessing

Uploaded by

umadataengg

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

4 Data Preprocessing

Uploaded by

umadataengg

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Machine Learning

1
Data Preprocessing

• Data preprocessing is a process of preparing the

raw data and
• making it suitable for a machine learning model.
• It is the first and crucial step while creating a
machine learning model.
• while doing any operation with data, it is
mandatory to clean data and put in a formatted
way.
Why do we need Data Preprocessing?

• A real-world data generally contains noises, missing

values, and maybe in an unusable format which cannot
be directly used for machine learning models.
• Data preprocessing is required tasks for cleaning the data
and
• making it suitable for a machine learning model
• which also increases the accuracy and efficiency of a
machine learning model.
Why do we need Data Preprocessing?
It involves below steps:
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
1) Get the Dataset

• To create a machine learning model, the first thing we

required is a dataset as a machine learning model
completely works on data.
• The collected data for a particular problem in a proper
format is known as the dataset.
• CSV file.
• HTML or
• xlsx file
2) Importing Libraries

• In order to perform data preprocessing using Python,

we need to import some predefined Python libraries.
• These libraries are used to perform some specific jobs.
• Numpy:
• Matplotlib:
• Pandas:
• import numpy as np
3) Importing the Datasets

• Now we need to import the datasets which we

have collected for our machine learning
project.
• df= pd.read_csv('Dataset.csv')
Extracting dependent and independent variables:

• X = df.iloc[:, :-1].values
• y = df.iloc[:, -1].values
• print(X)
• print(y)
4) Handling Missing data

• The next step of data preprocessing is to handle

missing data in the datasets.
• If our dataset contains some missing data, then
it may create a huge problem for our machine
learning model.
• Hence it is necessary to handle missing values
present in the dataset.
Ways to handle missing data:

• There are mainly two ways to handle missing data, which are:
• By deleting the particular row: The first way is used to
commonly deal with null values.
• In this way, we just delete the specific row or column which
consists of null values.
• But this way is not so efficient and removing data may lead to
loss of information which will not give the accurate output.
Ways to handle missing data:

• By calculating the mean: In this way, we will

calculate the mean of that column or row which
contains any missing value and will put it on the
place of missing value.
• This strategy is useful for the features which
have numeric data such as age, salary, year, etc.
Here, we will use this approach.
4) Handling Missing data:

• To handle missing values, we will use Scikit-

learn library in our code,
• which contains various libraries for building
machine learning models.
• Here we will use Imputer class
of sklearn.preprocessing library.
• Below is the code for it:
4) Handling Missing data:

• from sklearn.impute import SimpleImputer

• imputer = SimpleImputer(missing_values=np.nan,
strategy='mean')
• imputer.fit(X[:, 1:3])X[:, 1:3] =
imputer.transform(X[:, 1:3])
• print(X)
5) Encoding Categorical data
5) Encoding Categorical data:
• Categorical data is data which has some categories such as,
in our dataset;
• there are two categorical variable, Country, and Purchased.
• Since machine learning model completely works on
mathematics and numbers,
• but if our dataset would have a categorical variable, then it
may create trouble while building the model.
• So it is necessary to encode these categorical variables into
numbers.
5) Encoding Categorical data

• For Country variable:

• Firstly, we will convert the country variables
into categorical data.
• So to do this, we will use LabelEncoder() class
from preprocessing library.
5) Encoding Categorical data

• # Encoding the Dependent Variable

• from sklearn.preprocessing import LabelEncoder
• le = LabelEncoder()
• y = le.fit_transform(y)
• print(y)
5) Encoding Categorical data
• Explanation:
• In above code, we have imported LabelEncoder class of sklearn
library.
• This class has successfully encoded the variables into digits.
• But in our case, there are three country variables, and as we can see
in the above output, these variables are encoded into 0, 1, and 2.
• By these values, the machine learning model may assume that there
is some correlation between these variables
• which will produce the wrong output.
• So to remove this issue, we will use dummy encoding.
5) Encoding Categorical data:

• Dummy Variables:
• Dummy variables are those variables which have
values 0 or 1.
• The 1 value gives the presence of that variable in
a particular column, and rest variables become 0.
• With dummy encoding, we will have a number of
columns equal to the number of categories.
5) Encoding Categorical data:

• In our dataset, we have 3 categories so it will

produce three columns having 0 and 1 values.
• For Dummy Encoding, we will
use OneHotEncoder class
of preprocessing library.
5) Encoding Categorical data:

• # Encoding the Independent Variable

• from sklearn.compose import ColumnTransformer
• from sklearn.preprocessing import OneHotEncoder
• ct = ColumnTransformer(transformers=[('encoder',
OneHotEncoder(), [0])], remainder='passthrough')
• X = np.array(ct.fit_transform(X))
• print(X)
5) Encoding Categorical data:

• For Purchased Variable:

• labelencoder_y= LabelEncoder()
• y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use
labelencoder object of LableEncoder class.
• Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and
• which are automatically encoded into 0 and 1.
5) Encoding Categorical data
• # Encoding the Dependent Variable
• from sklearn.preprocessing import Label
• Encoderle = LabelEncoder()
• y = le.fit_transform(y)
• print(y)
6. Splitting the dataset into the Training set and Test set

• from sklearn.model_selection import train_test_split

• X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state = 1)
• print(X_train)
• print(X_test)
• print(y_train)
• print(y_test)
7) Feature Scaling

• It is a technique to standardize the independent

variables of the dataset in a specific range.
• In feature scaling, we put our variables in the
same range and in the same scale
• so that no any variable dominate the other
variable.
7) Feature Scaling

• For feature scaling, we will import StandardScaler class

of sklearn.preprocessing library as:
• Now, we will create the object of StandardScaler class
for independent variables or features.
• And then we will fit and transform the training dataset.
• For test dataset, we will directly
apply transform() function instead of fit_transform()
• because it is already done in training set.
7) Feature Scaling

• from sklearn.preprocessing import StandardScaler

• sc = StandardScaler()
• X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
• X_test[:, 3:] = sc.transform(X_test[:, 3:])
• print(X_train)
• print(X_test)

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Retreats Ora Grodsky and Jeremy Phillips
No ratings yet
Retreats Ora Grodsky and Jeremy Phillips
11 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
L1_Data Pre-processing & Steps of Building a Model (1)
No ratings yet
L1_Data Pre-processing & Steps of Building a Model (1)
30 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Deep Learning and Machine Learning: Lab Explanation
No ratings yet
Deep Learning and Machine Learning: Lab Explanation
34 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Unit-2 Feature Selection
No ratings yet
Unit-2 Feature Selection
92 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Pre-Processing with Sklearn using Standard and Minmax
No ratings yet
Data Pre-Processing with Sklearn using Standard and Minmax
21 pages
Data Preprocessing using Python. Python implementation of data… _ by Suneet Jain _ Medium
No ratings yet
Data Preprocessing using Python. Python implementation of data… _ by Suneet Jain _ Medium
20 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
FP Unit 3
No ratings yet
FP Unit 3
105 pages
Lecture 2 - Hello World in ML
No ratings yet
Lecture 2 - Hello World in ML
49 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
CH 02
No ratings yet
CH 02
32 pages
CH 02
No ratings yet
CH 02
32 pages
Silver Oak College of Computer Application: Subject:Machine Learning
No ratings yet
Silver Oak College of Computer Application: Subject:Machine Learning
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
ANL252 SU5 Jul2022
No ratings yet
ANL252 SU5 Jul2022
58 pages
CSC407_Chapter 4
No ratings yet
CSC407_Chapter 4
28 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Digital Image Processing Workshop
No ratings yet
Digital Image Processing Workshop
55 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Exp 6
No ratings yet
Exp 6
9 pages
Data Preprocessing and Data Analysis using Python
No ratings yet
Data Preprocessing and Data Analysis using Python
32 pages
Lec 1 Data Structures and Algorithm Analysis
No ratings yet
Lec 1 Data Structures and Algorithm Analysis
35 pages
Ritesh Mangla ML PracticalFile
No ratings yet
Ritesh Mangla ML PracticalFile
55 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
Unit 2 ML
No ratings yet
Unit 2 ML
93 pages
Homework_6
No ratings yet
Homework_6
7 pages
Implementing Artificial Neural Network in Python From Scratch
No ratings yet
Implementing Artificial Neural Network in Python From Scratch
16 pages
Classes and Objects: Chapter No.: 2
No ratings yet
Classes and Objects: Chapter No.: 2
55 pages
Week 1 Slides
No ratings yet
Week 1 Slides
38 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
ML1
No ratings yet
ML1
69 pages
DM Lab Cycle 2 1
No ratings yet
DM Lab Cycle 2 1
10 pages
Gls University Bca Sem - Iii Data Structure-0301302
No ratings yet
Gls University Bca Sem - Iii Data Structure-0301302
26 pages
Spark MLIB
No ratings yet
Spark MLIB
50 pages
MATLAB Tutorial
No ratings yet
MATLAB Tutorial
40 pages
Fake News Detection Presentation
No ratings yet
Fake News Detection Presentation
15 pages
MATLAB For Image Processing: April 10, 2015
No ratings yet
MATLAB For Image Processing: April 10, 2015
32 pages
3-4
No ratings yet
3-4
81 pages
Deep Neural Network Application
No ratings yet
Deep Neural Network Application
17 pages
2-Lecture2
No ratings yet
2-Lecture2
55 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Lab 1 PDF
No ratings yet
Lab 1 PDF
38 pages
code explanation
No ratings yet
code explanation
9 pages
Capstone project_Jaro-Prof. Babji
No ratings yet
Capstone project_Jaro-Prof. Babji
5 pages
6 Encapsulation
No ratings yet
6 Encapsulation
31 pages
Data Science II: Charles C.N. Wang
No ratings yet
Data Science II: Charles C.N. Wang
38 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
2 Azure Portal
No ratings yet
2 Azure Portal
18 pages
3 Storage Building Blocks
No ratings yet
3 Storage Building Blocks
12 pages
2 App Creation
No ratings yet
2 App Creation
5 pages
1 evaluate performance of regression and classifiaction
No ratings yet
1 evaluate performance of regression and classifiaction
8 pages
1 Azure
No ratings yet
1 Azure
13 pages
2 Pandas Series
No ratings yet
2 Pandas Series
1 page
4 Urls
No ratings yet
4 Urls
5 pages
1 Install Django and Django Create Project
No ratings yet
1 Install Django and Django Create Project
12 pages
0 Introduction
No ratings yet
0 Introduction
17 pages
Paper Structure6
No ratings yet
Paper Structure6
2 pages
Code 2
No ratings yet
Code 2
3 pages
Stress Strain Diagram
No ratings yet
Stress Strain Diagram
8 pages
ISBN
No ratings yet
ISBN
3 pages
Detailed WMS Plan
No ratings yet
Detailed WMS Plan
3 pages
Report autoDNA WBAVU71040KG92706 PDF
No ratings yet
Report autoDNA WBAVU71040KG92706 PDF
6 pages
How To Calculate Notice Timedkading Master For Stopping The Cargo... Here Is The Answer - MySeaTime
No ratings yet
How To Calculate Notice Timedkading Master For Stopping The Cargo... Here Is The Answer - MySeaTime
4 pages
Kasus MYOB Bookstore
No ratings yet
Kasus MYOB Bookstore
6 pages
2013 Course Structure BTech CSE
No ratings yet
2013 Course Structure BTech CSE
32 pages
Practice Questions & Answers: Made With by Sawzeeyy
No ratings yet
Practice Questions & Answers: Made With by Sawzeeyy
141 pages
Practice of Urban Aerial Ropeways: Work Report No.1
No ratings yet
Practice of Urban Aerial Ropeways: Work Report No.1
79 pages
Supply Chain Management
No ratings yet
Supply Chain Management
10 pages
Fundamentals of Security in Operating Systems
No ratings yet
Fundamentals of Security in Operating Systems
4 pages
The Beginners Guide To Nintendo DS Homebrew
No ratings yet
The Beginners Guide To Nintendo DS Homebrew
26 pages
Location Based Services
No ratings yet
Location Based Services
23 pages
Inbound Logistics. The Majority of Samsung Suppliers Are Based in Asia and Accordingly
No ratings yet
Inbound Logistics. The Majority of Samsung Suppliers Are Based in Asia and Accordingly
4 pages
EN ISO 6888-1 (2021) (E) Codified
No ratings yet
EN ISO 6888-1 (2021) (E) Codified
8 pages
Indian Youth Cafe - Ecosystem Map
No ratings yet
Indian Youth Cafe - Ecosystem Map
1 page
Modulo 680 AS
No ratings yet
Modulo 680 AS
98 pages
Perkinsrestaurant Menu PDF
No ratings yet
Perkinsrestaurant Menu PDF
12 pages
Cmos Digital Circuits - Book
100% (1)
Cmos Digital Circuits - Book
56 pages
96 Pkli Ad
No ratings yet
96 Pkli Ad
1 page
Safety Data Sheet Sds #: Ninjaflex Semiflex: 1. Product and Company Identification
No ratings yet
Safety Data Sheet Sds #: Ninjaflex Semiflex: 1. Product and Company Identification
5 pages
VOS3000 Details Pricing
No ratings yet
VOS3000 Details Pricing
13 pages
Design of A 10kHz Filter
No ratings yet
Design of A 10kHz Filter
11 pages
A Full-Scale Fluvial Flood Modelling Framework Based On A High-Performance Integrated Hydrodynamic Modelling System (HiPIMS)
No ratings yet
A Full-Scale Fluvial Flood Modelling Framework Based On A High-Performance Integrated Hydrodynamic Modelling System (HiPIMS)
42 pages
Competitive Programming Syllabus
No ratings yet
Competitive Programming Syllabus
6 pages
RealControl-2 7
No ratings yet
RealControl-2 7
1 page
Top 35 E-Hacking Tools For 2025
No ratings yet
Top 35 E-Hacking Tools For 2025
15 pages
Notes On Digital Electronics Unit 2
No ratings yet
Notes On Digital Electronics Unit 2
90 pages
Template Bill of Material
No ratings yet
Template Bill of Material
7 pages