0% found this document useful (0 votes)

205 views

Data Pre-Processing Python For Beginner

The document discusses various techniques for data preprocessing which is an essential step in machine learning projects. These techniques include data cleansing to handle missing values and outliers, feature selection to reduce complexity and improve performance, feature scaling to prepare data for algorithms, and feature engineering to better represent the problem for models. Python libraries like Scikit-learn can be used to implement these techniques such as imputation to fill missing values, recursive feature elimination for selection, and MinMaxScaler for normalization.

Uploaded by

Bongkar Taktik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

205 views

Data Pre-Processing Python For Beginner

Uploaded by

Bongkar Taktik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

When dealing with machine learning project, real world data typically

is not ready to be used. There might be missing values or incorrect

types in the dataset that we get. These rawness of the data needs to be
dealt first so that ML algorithm can be applied on it. This is a common
problem that all data-related professionals have to face.

The process of dealing with unclean data and transform it into more
appropriate form for modeling is called data pre-processing. This step
can be considered as a mandatory in machine learning process due to
some reason, such as:

• data errors: Statistical noise or missing data need to be

corrected.

• data types: Most machine learning algorithm require input

data in form of numbers.

• data complexity: Some data might be so complex that

algorithm can not perform well on it. Complexity can be a
reason for overfitting in a model.

While data pre-processing can be different for every cases, there are
some common tasks that ca be used:

• data cleansing

• feature selection

Internal
• data scaling

• feature engineering

• dimensionality reduction

We will explore these steps and implement it on sample dataset using

python libraries.

Data Cleansing: Handling missing values

One of the most common process of data cleansing is dealing with

missing values. Basically, there are two ways to handle missing values:

1. Remove rows with missing values

2. Impute missing values

Removing rows is the simplest strategy and easy to execute. On the

contrary, impute missing values is more complicated. We can impute
values using some rules, such as:

• Constant value that has meaning within the domain and

different from other data, like 0 or -1.

• Central tendency of data, which are mean, median, or

mode.

• Predictive values estimated from other data.

Internal
Even though most ML algorithm require complete dataset, not all of
them fail when there is missing data. There are algorithm that robust
to missing values, like KNN and Naive Bayes while other algorithm can
use missing values as a unique value, like Decision Trees. Nevertheless,
scikit-learn library implementations for those algorithms are not
robust to missing values.

We are going to use SimpleImputer class to transform all missing

values marked with a NaN value with the mean value for the column.
You can download the dataset here: Melbourne Housing
Snapshot.

Four features have missing values. We will work on feature ‘Age’, ‘BuildingArea’, and
‘YearBuilt’.

Internal
Feature Selection

In a nutshell, feature selection means removing irrelevant features.

The reasons we need to do this are to:

• reduce complexity

• produce easy to understand model

• reduce computational cost

• prevent overfitting

• improve model performance

These are feature selection techniques based on its basic algorithm:

credit: machinelearningmastery.com

Internal
In using stats based feature selection, it is important to choose what
method to use based on the data types of input and output variable.
This is a decision tree to decide which stats based method is
suitable for our data:

credit: machinelearningmastery.com

We are going to use RFE method to select the most important features
from our dataset. Recursive Feature Elimination (RFE) is popular due
to its flexibility and ease of use. It reduces model complexity
by removing features one by one until the selected number of
features is left.

The scikit-learn Python machine learning library provides an

implementation of RFE for machine learning. To use it, first, the class
is configured with the chosen algorithm specified via the “estimator”
argument and the number of features to select via the
“n_features_to_select” argument.

Internal
Six most relevant features based on RFE are indicated by “Selected=True”

Feature Scaling

Many machine learning algorithms perform better when numerical

input variables are scaled. This case includes algorithms that use a
weighted sum of the input, like linear regression, and algorithms that
use distance measures, like k-nearest neighbors, or gradient descent-
based algorithms.

There are two common methods for scaling:

For the Melbourne Housing dataset, we are going to implement

normalization using scikit-learn object called MinMaxScaler.

Internal
All maximum values have been scaled to 1

Feature Engineering

Feature engineering is the process of transforming data to

represent the underlying problem better to the predictive
models. It is an iterative process that interplays with data selection and
model evaluation, again and again.

General process of feature engineering are commonly divided by

numerical and categorical feature.

Example of feature engineering for numerical features including:

• Feature Generation: feature 1 + feature 2, feature 1 x

feature 2, feature 1 /feature 2, etc.

• Decomposing Categorical Attributes: item_color ->

is_red, is_blue; gender -> is_male, is_female (one-hot
encoding)

Internal
• Decomposing a Date-Time: datetime -> hour_of_day;
hour -> morning, night

• Reframing Numerical Quantities: weight -> above_70,

below_70

• etc.

Tips for doing numerical feature engineering effectively:

1. Ask the expert

2. Discretization

3. Combinations of 2 features or more

4. Using simple statistics descriptive

Next, for handling categorical features, there are several method called
encoding. These are three common encoding techniques with sample.

Label Encoding

• Give every categorical variable a numerical ID.

Internal
• Useful for non-linear and tree-based algorithms.

• Does not increase dimensionality.

• Useful for ordinal data type.

One-Hot Encoding

• Create new feature for every unique value.

• Memory depends on number of unique category.

• Similar to dummy encoding that generates n-1 new columns,

while OHE generates n new columns, with n is the count of
unique value from encoded feature.

Binary Encoding

Internal
• Variables -> numerical label (label encoding) -> binary
number -> split every digit into different columns.

• Useful for feature with large number of unique values.

Increase dataset dimension logarithmically.

• Only need to create log base 2 new columns of unique values

from encoded feature.

The following code shows how to implement one-hot encoding in the

pandas Python library via get_dummies class.

Dimensionality Reduction

More input features often make a predictive modeling task more

challenging to model, more generally referred to as the curse of
dimensionality.

Internal
Dimensionality reduction techniques are often used for data
visualization. Nevertheless, these techniques can be used in applied
machine learning to simplify a classification or regression
dataset in order to better fit a predictive model.

One of the most popular technique for dimensionality reduction in

machine learning is Principal Component Analysis (PCA).

Handling Outliers

Many data have outliers that can heavily affect model training result.
In Python, outliers can be easily detected using boxplot visualization.

Both Landsize and BuildingArea feature have outliers

Internal
We can adjust the outliers without any additional library using
winsorization method. Outlier values can be replaced by certain value
that called upper and lower bound.

Those are several common method for data preparation. Every project
is unique and may need different approach for data pre-processing and
cleansing.

References
• https://www.kaggle.com/alexisbcook/missing-values

• https://machinelearningmastery.com/data-preparation-for-
machine-learning-7-day-mini-course/

• https://machinelearningmastery.com/feature-selection-with-
real-and-categorical-data/

• https://towardsdatascience.com/categorical-encoding-using-
label-encoding-and-one-hot-encoder-911ef77fb5bd

Internal

Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Ericsson Spo 1400 Family: Packet Optical Transport-Etsi
No ratings yet
Ericsson Spo 1400 Family: Packet Optical Transport-Etsi
4 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
2_Machine Learning_130824
No ratings yet
2_Machine Learning_130824
81 pages
Week 10
No ratings yet
Week 10
50 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
ML_DA
No ratings yet
ML_DA
55 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
ML_Notes
No ratings yet
ML_Notes
44 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
ML SELF UNIT 2
No ratings yet
ML SELF UNIT 2
20 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
1737527078055
No ratings yet
1737527078055
111 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
DS 1
No ratings yet
DS 1
20 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Data Pre-processing Steps
No ratings yet
Data Pre-processing Steps
32 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
50 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Lect 2
No ratings yet
Lect 2
54 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Presentation
No ratings yet
Presentation
10 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
4 Automatic Outlier Detection Algorithms in Python
No ratings yet
4 Automatic Outlier Detection Algorithms in Python
2 pages
NN-7
No ratings yet
NN-7
26 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
vertopal.com_C1_W2_Lab04_FeatEng_PolyReg_Soln
No ratings yet
vertopal.com_C1_W2_Lab04_FeatEng_PolyReg_Soln
5 pages
CH1
No ratings yet
CH1
64 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
DM - MOD - 1 Part III
No ratings yet
DM - MOD - 1 Part III
12 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
cbnst code
No ratings yet
cbnst code
6 pages
Clinical Hematology and Fundamentals of Hemostasis 5th Edition Harmening Test Bankinstant download
100% (5)
Clinical Hematology and Fundamentals of Hemostasis 5th Edition Harmening Test Bankinstant download
33 pages
TTM Cookies Counting FREE
No ratings yet
TTM Cookies Counting FREE
8 pages
Quotes
No ratings yet
Quotes
10 pages
Capsule: 500 MG Injection: 100
No ratings yet
Capsule: 500 MG Injection: 100
2 pages
Study On The Status of Indian Blue Peafowl (Pavo Cristatus) in Selected Areas at Kumbakonam, Thanjavur District
No ratings yet
Study On The Status of Indian Blue Peafowl (Pavo Cristatus) in Selected Areas at Kumbakonam, Thanjavur District
6 pages
Modern Steel Construction March-2022
No ratings yet
Modern Steel Construction March-2022
76 pages
Magnezix CS Brochure For Doctor 02
No ratings yet
Magnezix CS Brochure For Doctor 02
12 pages
Curso Siemens NX Parte 48
No ratings yet
Curso Siemens NX Parte 48
13 pages
EIM2 Entry Test
No ratings yet
EIM2 Entry Test
9 pages
Review of Related Literature
No ratings yet
Review of Related Literature
5 pages
Seminar On Management Accounting
No ratings yet
Seminar On Management Accounting
4 pages
HLE Infinity 6kW
No ratings yet
HLE Infinity 6kW
2 pages
From Case Frames To Semantic Frames
No ratings yet
From Case Frames To Semantic Frames
25 pages
US20190295733A1 - Pasma Compression Fusion Device
No ratings yet
US20190295733A1 - Pasma Compression Fusion Device
8 pages
Bad Romance
No ratings yet
Bad Romance
8 pages
Chemical Engineering Journal: Yiyong Chen, Dejin Zhang
No ratings yet
Chemical Engineering Journal: Yiyong Chen, Dejin Zhang
7 pages
Module 1 - Week 2 - GEC102
No ratings yet
Module 1 - Week 2 - GEC102
10 pages
Mzpack 3 User Guide (En)
100% (1)
Mzpack 3 User Guide (En)
60 pages
Inclusive Practice Assignment
No ratings yet
Inclusive Practice Assignment
4 pages
Analyzing Qualitative Data: An: Sage Publications Limited © 2008 Michael D. Myers All Rights Reserved
No ratings yet
Analyzing Qualitative Data: An: Sage Publications Limited © 2008 Michael D. Myers All Rights Reserved
40 pages
3 March 2023
No ratings yet
3 March 2023
7 pages
35 Dist Trans 500 KVA R0 Apr 09 MGVCL
100% (1)
35 Dist Trans 500 KVA R0 Apr 09 MGVCL
33 pages
Approval Acceptance Sheets Final
No ratings yet
Approval Acceptance Sheets Final
2 pages
Reinforced Concrete Basics Analysis and Design of ... - (CHAPTER 6 Footings and Retaining Walls)
No ratings yet
Reinforced Concrete Basics Analysis and Design of ... - (CHAPTER 6 Footings and Retaining Walls)
10 pages
Hauman-anatomy-and-Physiology-Sample-paper-2021-pdf-HAP-Sample-paper-2021-pdf
No ratings yet
Hauman-anatomy-and-Physiology-Sample-paper-2021-pdf-HAP-Sample-paper-2021-pdf
2 pages
ابتهالات 2-3-4
No ratings yet
ابتهالات 2-3-4
15 pages
Depth Study - POSTNATAL
No ratings yet
Depth Study - POSTNATAL
17 pages
Bio-Diversity Loss: Presented by Team:-Multifarious
No ratings yet
Bio-Diversity Loss: Presented by Team:-Multifarious
45 pages

Data Pre-Processing Python For Beginner

Uploaded by

Data Pre-Processing Python For Beginner

Uploaded by

When dealing with machine learning project, real world data typically

is not ready to be used. There might be missing values or incorrect

• data errors: Statistical noise or missing data need to be

• data types: Most machine learning algorithm require input

• data complexity: Some data might be so complex that

We will explore these steps and implement it on sample dataset using

Data Cleansing: Handling missing values

One of the most common process of data cleansing is dealing with

1. Remove rows with missing values

2. Impute missing values

Removing rows is the simplest strategy and easy to execute. On the

• Constant value that has meaning within the domain and

• Central tendency of data, which are mean, median, or

• Predictive values estimated from other data.

We are going to use SimpleImputer class to transform all missing

In a nutshell, feature selection means removing irrelevant features.

• produce easy to understand model

• reduce computational cost

• improve model performance

These are feature selection techniques based on its basic algorithm:

The scikit-learn Python machine learning library provides an

Many machine learning algorithms perform better when numerical

There are two common methods for scaling:

For the Melbourne Housing dataset, we are going to implement

Feature engineering is the process of transforming data to

General process of feature engineering are commonly divided by

Example of feature engineering for numerical features including:

• Feature Generation: feature 1 + feature 2, feature 1 x

• Decomposing Categorical Attributes: item_color ->

• Reframing Numerical Quantities: weight -> above_70,

Tips for doing numerical feature engineering effectively:

1. Ask the expert

3. Combinations of 2 features or more

4. Using simple statistics descriptive

• Give every categorical variable a numerical ID.

• Does not increase dimensionality.

• Useful for ordinal data type.

• Create new feature for every unique value.

• Memory depends on number of unique category.

• Similar to dummy encoding that generates n-1 new columns,

• Useful for feature with large number of unique values.

• Only need to create log base 2 new columns of unique values

The following code shows how to implement one-hot encoding in the

More input features often make a predictive modeling task more

One of the most popular technique for dimensionality reduction in

Both Landsize and BuildingArea feature have outliers

You might also like