Unit 4 Basics of Feature Engineering

This document discusses feature engineering techniques used in machine learning. It defines feature engineering as transforming raw data into features that better represent the underlying problem and improve model accuracy. The key techniques discussed are feature selection, feature transformation, and feature extraction. Feature transformation techniques covered include imputation, handling outliers, binning, log transforms, and scaling. The goals of feature engineering are preparing compatible input data and improving model performance.

Uploaded by

Yash Desai

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

352 views

Unit 4 Basics of Feature Engineering

Uploaded by

Yash Desai

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Silver Oal College Of Engineering And Technology

Unit 4 :
Basics of Feature Engineering:

1
Outline
 Feature and Feature Engineering,
 Feature transformation:
 Construction
 Extraction,
 Feature subset selection :
 Issues in high-dimensional data,
 key drivers,
 measure
 overall process

2
Feature and Feature Engineering
 Input in machine learning which are usually in the form of
structured columns.
 Algorithms require features with some specific characteristic to
work properly.
 Feature Engineering?
 Feature engineering is the process of transforming raw data into
features that better represent the underlying problem to the predictive
models, resulting in improved model accuracy on unseen data.
 Goals of Feature Engineering
1. Preparing the proper input dataset, compatible with the machine
learning algorithm requirements.
2. Improving the performance of machine learning models.

3 Prof. Monali Suthar (SOCET-CE)

Feature Engineering Category
 Feature Engineering is divided into 3 broad categories:-
1. Feature Selection:
 It is all about selecting a small subset of features from a large pool of
features.
 We select those attributes which best explain the relationship of an
independent variable with the target variable.
 There are certain features which are more important than other
features to the accuracy of the model.
 It is different from dimensionality reduction because the
dimensionality reduction method does so by combining existing
attributes, whereas the feature selection method includes or excludes
those features.
 Ex: Chi-squared test, correlation coefficient scores, LASSO, Ridge
regression etc.

4
Feature Engineering Category
II. Feature Transformation:
 It means transforming our original feature to the functions of original
features.
 Ex: Scaling, discretization, binning and filling missing data values are the
most common forms of data transformation.
 To reduce right skewness of the data, we use log.
III. Feature Extraction:
 When the data to be processed through an algorithm is too large, it’s
generally considered redundant.
 Analysis with a large number of variables uses a lot of computation power
and memory, therefore we should reduce the dimensionality of these types
of variables.
 It is a term for constructing combinations of the variables.
 For tabular data, we use PCA to reduce features.
 For image, we can use line or edge detection.

5
Feature transformation
 Feature transformation is the process of modifying your
data but keeping the information.
 These modifications will make Machine Learning algorithms
understanding easier, which will deliver better results.
 But why would we transform our features?
 data types are not suitable to be fed into a machine learning
algorithm, e.g. text, categories
 feature values may cause problems during the learning process,
e.g. data represented in different scales
 we want to reduce the number of features to plot and visualize
data, speed up training or improve the accuracy of a specific
model

6
Feature Engineering Techniques
 List of Techniques
1.Imputation
2.Handling Outliers
3.Binning
4.Log Transform
5.One-Hot Encoding
6.Grouping Operations
7.Feature Split
8.Scaling
9.Extracting Date

7
Imputation Using (Mean/Median) Values
 This works by calculating the mean/median of the non-
missing values in a column and then replacing the missing
values within each column separately and independently
from the others. It can only be used with numeric data.

8
Pros and Cons
 Pros:
• Easy and fast.
• Works well with small numerical datasets.
 Cons:
• Doesn’t factor the correlations between features. It only
works on the column level.
• Will give poor results on encoded categorical features (do
NOT use it on categorical features).
• Not very accurate.
• Doesn’t account for the uncertainty in the imputations.

9
Pros and Cons
 Pros:
• Easy and fast.
• Works well with small numerical datasets.
 Cons:
• Doesn’t factor the correlations between features. It only
works on the column level.
• Will give poor results on encoded categorical features (do
NOT use it on categorical features).
• Not very accurate.
• Doesn’t account for the uncertainty in the imputations.

10
Imputation Using (Most Frequent) or
(Zero/Constant) Values:
 Most Frequent is another statistical strategy to impute
missing values and YES!! It works with categorical
features (strings or numerical representations) by
replacing missing data with the most frequent values
within each column.
 Pros:
• Works well with categorical features.
 Cons:
• It also doesn’t factor the correlations between features.
• It can introduce bias in the data.

11
Imputation Using (Most Frequent) or
(Zero/Constant) Values:

12
Imputation Using k-NN
 The k nearest neighbors is an algorithm that is used for
simple classification. The algorithm uses ‘feature
similarity’ to predict the values of any new data points.

 This means that the new point is assigned a value based

on how closely it resembles the points in the training set.
This can be very useful in making predictions about the
missing values by finding the k’s closest neighbor's to the
observation with missing data and then imputing them
based on the non-missing values in the neighborhood.

13
Pros and Cons
 Pros:
• Can be much more accurate than the mean, median or
most frequent imputation methods (It depends on the
dataset).
 Cons:
• Computationally expensive. KNN works by storing the
whole training dataset in memory.
• K-NN is quite sensitive to outliers in the data (unlike
SVM)

14
Handling outlier
• Incorrect data entry or error during data processing
• Missing values in a dataset.
• Data did not come from the intended sample.
• Errors occur during experiments.
• Not an errored, it would be unusual from the original.
• Extreme distribution than normal.

15
Handling outlier
Univariate method:
 Univariate analysis is the simplest form of analyzing data. “Uni”
means “one”, so in other words your data has only one variable.
 It doesn’t deal with causes or relationships (unlike regression ) and
it’s major purpose is to describe; It takes data, summarizes that data
and finds patterns in the data.

 Univariate and multivariate represent two approaches to statistical

analysis.
 Univariate involves the analysis of a single variable
while multivariate analysis examines two or more variables.
 Most multivariate analysis involves a dependent variable and
multiple independent variables.

16
Handling outlier with Z score
 The Z-score is the signed number of standard deviations by which the value of
an observation or data point is above the mean value of what is being
observed or measured.
 Z score is an important concept in statistics. Z score is also called standard
score. This score helps to understand if a data value is greater or smaller than
mean and how far away it is from the mean. More specifically, Z score tells
how many standard deviations away a data point is from the mean.

 The intuition behind Z-score is to describe any data point by finding their
relationship with the Standard Deviation and Mean of the group of data points.
Z-score is finding the distribution of data where mean is 0 and standard
deviation is 1 i.e. normal distribution.
 Z score = (x -mean) / std. deviation
 If the z score of a data point is more than 3, it indicates that the data point is
quite different from the other data points. Such a data point can be an outlier.

17
Binning
 Data binning, bucketing is a data pre-processing method
used to minimize the effects of small observation errors.
 The original data values are divided into small intervals
known as bins and then they are replaced by a general
value calculated for that bin.
 This has a smoothing effect on the input data and may also
reduce the chances of overfitting in case of small datasets.

18
Log Transform
 The Log Transform is one of the most popular
Transformation techniques out there.
 It is primarily used to convert a skewed distribution to a
normal distribution/less-skewed distribution.
 In this transform, we take the log of the values in a
column and use these values as the column instead.

19
Standard Scaler
 The Standard Scaler is another popular scaler that is very
easy to understand and implement.
 For each feature, the Standard Scaler scales the values such
that the mean is 0 and the standard deviation is 1(or the
variance).
x_scaled = x – mean/std_dev

 However, Standard Scaler assumes that the distribution of

the variable is normal. Thus, in case, the variables are not
normally distributed, we either choose a different scaler or
first, convert the variables to a normal distribution and then
apply this scaler
20
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

21
One-Hot Encoding
 A one hot encoding allows the representation of
categorical data to be more expressive.
 Many machine learning algorithms cannot work with
categorical data directly.
 The categories must be converted into numbers.
 This is required for both input and output variables that
are categorical.

22
Feature subset selection
 Feature Selection is the most critical pre-processing
activity in any machine learning process. It intends to
select a subset of attributes or features that makes the most
meaningful contribution to a machine learning activity.

23
High dimensional data
 High Dimensional refers to the high number of variables or attributes or
features present in certain data sets, more so in the domains like DNA
analysis, geographic information system (GIS), etc. It may have
sometimes hundreds or thousands of dimensions which is not good from
the machine learning aspect because it may be a big challenge for any
ML algorithm to handle that. On the other hand, a high quantity of
computational and a high amount of time will be required. Also, a model
built on an extremely high number of features may be very difficult to
understand. For these reasons, it is necessary to take a subset of the
features instead of the full set. So we can deduce that the objectives of
feature selection are:
1. Having a faster and more cost-effective (less need for computational
resources) learning model
2. Having a better understanding of the underlying model that generates the data.
3. Improving the efficacy of the learning model.

24
Feature subset selection methods
1. Wrapper methods
 Wrapping methods compute models with a certain subset of
features and evaluate the importance of each feature.
 Then they iterate and try a different subset of features until the
optimal subset is reached.
 Two drawbacks of this method are the large computation time
for data with many features, and that it tends to overfit the
model when there is not a large amount of data points.
 The most notable wrapper methods of feature selection
are forward selection, backward selection, and stepwise
selection.

25
Feature subset selection methods
1. Wrapper methods
 Forward selection starts with zero features, then, for each
individual feature, runs a model and determines the p-value
associated with the t-test or F-test performed. It then selects the
feature with the lowest p-value and adds that to the working
model.
 Backward selection starts with all features contained in the
dataset. It then runs a model and calculates a p-value associated
with the t-test or F-test of the model for each feature.
 Stepwise selection is a hybrid of forward and backward
selection. It starts with zero features and adds the one feature
with the lowest significant p-value as described above.

26
Feature subset selection methods
1. Filter methods
 Filter methods use a measure other than error rate to determine
whether that feature is useful.
 Rather than tuning a model (as in wrapper methods), a subset of
the features is selected through ranking them by a useful
descriptive measure.
 Benefits of filter methods are that they have a very low
computation time and will not overfit the data.
 However, one drawback is that they are blind to any interactions
or correlations between features.
 This will need to be taken into account separately, which will be
explained below. Three different filter methods are ANOVA,
Pearson correlation, and variance thresholding.
27
Feature subset selection methods
2. Filter methods
 The ANOVA (Analysis of variance) test looks a the variation
within the treatments of a feature and also between the
treatments.
 The Pearson correlation coefficient is a measure of the
similarity of two features that ranges between -1 and 1. A value
close to 1 or -1 indicates that the two features have a high
correlation and may be related.
 The variance of a feature determines how much predictive
power it contains. The lower the variance is, the less
information contained in the feature, and the less value it has in
predicting the response variable.

28
Feature subset selection methods
3. Embedded Methods
 Embedded methods perform feature selection as a part of the
model creation process.
 This generally leads to a happy medium between the two
methods of feature selection previously explained, as the
selection is done in conjunction with the model tuning process.
 Lasso and Ridge regression are the two most common feature
selection methods of this type, and Decision tree also creates a
model using different types of feature selection.

29
Feature subset selection methods
3. Embedded Methods
 Lasso Regression is another way to penalize the beta coefficients in a
model, and is very similar to Ridge regression. It also adds a penalty
term to the cost function of a model, with a lambda value that must be
tuned.
 The smaller number of features a model has, the lower the complexity.
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train,y_train)
train_score=lasso.score(X_train,y_train)
test_score=lasso.score(X_test,y_test)
coeff_used = np.sum(lasso.coef_!=0)
 An important note for Ridge and Lasso regression is that all of your features must
be standardized

30
Feature subset selection methods
3. Embedded Methods
 Ridge regression can do this by penalizing the beta coefficients of a model
for being too large. Basically, it scales back the strength of correlation with
variables that may not be as important as others. Ride Regression is done by
adding a penalty term (also called ridge estimator or shrinkage estimator) to
the cost function of the regression. The penalty term takes all of the betas and
scales them by a term lambda (λ) that must be tuned (usually with cross
validation: compares the same model but with different values of lambda).
from sklearn.linear_model import Ridge
rr = Ridge(alpha=0.01)
rr.fit(X_train, y_train)

31
32
 https://seleritysas.com/blog/2019/12/12/types-of-predictiv
e-analytics-models-and-how-they-work/
 https://towardsdatascience.com/selecting-the-correct-predi
ctive-modeling-technique-ba459c370d59
 https://www.netsuite.com/portal/resource/articles/financial
-management/predictive-modeling.shtml
 https://www.dezyre.com/article/types-of-analytics-descript
ive-predictive-prescriptive-analytics/209#toc-2
 https://
www.sciencedirect.com/topics/computer-science/descripti
ve-model
 https://towardsdatascience.com/intro-to-feature-selection-
methods-for-data-science-4cae2178a00a
33 Prof. Monali Suthar (SOCET-CE)

A Study On Customer Satisfaction and Loyalty Towards Bookmyshow
No ratings yet
A Study On Customer Satisfaction and Loyalty Towards Bookmyshow
56 pages
Problem Set 7 (With Instructions) : Regression Statistics
No ratings yet
Problem Set 7 (With Instructions) : Regression Statistics
6 pages
ML Performance Improvement Cheatsheet
No ratings yet
ML Performance Improvement Cheatsheet
11 pages
Unit - 3 Feature Engineering
No ratings yet
Unit - 3 Feature Engineering
29 pages
Deep Learning Laboratory
No ratings yet
Deep Learning Laboratory
69 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
23 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
ML Interview Questions and Answers
100% (1)
ML Interview Questions and Answers
25 pages
Loading The Dataset: 'Churn - Modelling - CSV'
No ratings yet
Loading The Dataset: 'Churn - Modelling - CSV'
6 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
2.building Blocks of Neural Networks
100% (1)
2.building Blocks of Neural Networks
2 pages
Pattern Classification
100% (1)
Pattern Classification
42 pages
UNIT-4
No ratings yet
UNIT-4
79 pages
CCS355 Neural Networks and Deep Learning Lab
No ratings yet
CCS355 Neural Networks and Deep Learning Lab
43 pages
Crime Prediction in Nigeria's Higer Institutions
No ratings yet
Crime Prediction in Nigeria's Higer Institutions
13 pages
Answer Book (Ashish)
100% (1)
Answer Book (Ashish)
21 pages
ML Lab File
No ratings yet
ML Lab File
53 pages
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
100% (1)
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
23 pages
DL Lab Manual
100% (1)
DL Lab Manual
35 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
ML Lab
No ratings yet
ML Lab
21 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Machine Learning Projects For Final Year PDF
No ratings yet
Machine Learning Projects For Final Year PDF
4 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
ML First Unit
No ratings yet
ML First Unit
70 pages
Deep Learning RNN
100% (1)
Deep Learning RNN
53 pages
Experiment 5
100% (1)
Experiment 5
6 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
19 pages
RBF, KNN, SVM, DT
No ratings yet
RBF, KNN, SVM, DT
9 pages
Deep Learning Questions
50% (2)
Deep Learning Questions
51 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
Deep Learning Interview Questions
No ratings yet
Deep Learning Interview Questions
17 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
Data Mining Question Bank
No ratings yet
Data Mining Question Bank
4 pages
Evaluation Metrics For Regression: Dr. Jasmeet Singh Assistant Professor, Csed Tiet, Patiala
No ratings yet
Evaluation Metrics For Regression: Dr. Jasmeet Singh Assistant Professor, Csed Tiet, Patiala
13 pages
ML Notes
100% (2)
ML Notes
125 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
R20 Iii-Ii ML Lab Manual
100% (1)
R20 Iii-Ii ML Lab Manual
79 pages
Deep Learning Handout
100% (1)
Deep Learning Handout
6 pages
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Ccs355 Neural Networks and Deep Learning Unit1 (1)
No ratings yet
Ccs355 Neural Networks and Deep Learning Unit1 (1)
29 pages
Unit V Graphical Models
No ratings yet
Unit V Graphical Models
23 pages
Unit 4 Ensemble Techniques and Unsupervised Learning
100% (1)
Unit 4 Ensemble Techniques and Unsupervised Learning
25 pages
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
100% (1)
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
11 pages
Unit No.02 - Feature Extraction and Selection
No ratings yet
Unit No.02 - Feature Extraction and Selection
17 pages
Deep Learning Lab Manual - IGDTUW - Vinisky Kumar
100% (1)
Deep Learning Lab Manual - IGDTUW - Vinisky Kumar
33 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Course Plan Natural Language Processing
No ratings yet
Course Plan Natural Language Processing
5 pages
Play Tennis Example: Outlook Temperature Humidity Windy
No ratings yet
Play Tennis Example: Outlook Temperature Humidity Windy
29 pages
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Methods To Solve Recurrence
No ratings yet
Methods To Solve Recurrence
4 pages
Unit 3 Modelling and Evaluation
No ratings yet
Unit 3 Modelling and Evaluation
40 pages
Unit 1 Introduction To ML
100% (1)
Unit 1 Introduction To ML
52 pages
Unit 2 Preparing To Model
No ratings yet
Unit 2 Preparing To Model
49 pages
Tutorial Answers
No ratings yet
Tutorial Answers
5 pages
1-Day Training Workshop On Basic Statistics Techniques and Predictive Analysis (Module 2)
No ratings yet
1-Day Training Workshop On Basic Statistics Techniques and Predictive Analysis (Module 2)
85 pages
Soybean Annual Balance Sheet
No ratings yet
Soybean Annual Balance Sheet
27 pages
Review of Psychological Testing: A Practical Approach To Design and Evaluation
No ratings yet
Review of Psychological Testing: A Practical Approach To Design and Evaluation
4 pages
Farhan Dhuha Alharis (220201089)
No ratings yet
Farhan Dhuha Alharis (220201089)
28 pages
Lecture 5
No ratings yet
Lecture 5
16 pages
Stats Tables
No ratings yet
Stats Tables
5 pages
Correlation and Linear
No ratings yet
Correlation and Linear
68 pages
3.1 Model Check
No ratings yet
3.1 Model Check
20 pages
Sari Nurfadila - StaofEd
No ratings yet
Sari Nurfadila - StaofEd
14 pages
Multiple Choice Sample Questions
No ratings yet
Multiple Choice Sample Questions
3 pages
All chapter download Essentials of Business Analytics 2nd Edition Camm Test Bank
100% (7)
All chapter download Essentials of Business Analytics 2nd Edition Camm Test Bank
52 pages
9 AIML Question bank updated 5 units
No ratings yet
9 AIML Question bank updated 5 units
21 pages
CCSD Algebra Unit 7 Interim Use
No ratings yet
CCSD Algebra Unit 7 Interim Use
8 pages
Econometrics 2
No ratings yet
Econometrics 2
135 pages
RATS Programming Manual
No ratings yet
RATS Programming Manual
255 pages
Bayesian Statistics For Data Science - Towards Data Science
No ratings yet
Bayesian Statistics For Data Science - Towards Data Science
7 pages
Econometrics II. Lecture Notes 1
No ratings yet
Econometrics II. Lecture Notes 1
17 pages
Regression Examples
No ratings yet
Regression Examples
26 pages
Sampling Distributions, Estimation and Hypothesis Testing - Multiple Choice Questions
100% (1)
Sampling Distributions, Estimation and Hypothesis Testing - Multiple Choice Questions
6 pages
Basic Statistics - All Calculations
No ratings yet
Basic Statistics - All Calculations
52 pages
Statistical Inference Project Part 2 - Basic Inferential Data Analysis On Tooth Growth Data Set
No ratings yet
Statistical Inference Project Part 2 - Basic Inferential Data Analysis On Tooth Growth Data Set
6 pages
Stata Results
No ratings yet
Stata Results
4 pages
Tobit Models A Survey PDF
No ratings yet
Tobit Models A Survey PDF
59 pages
Chapra - Chapter 17 Sol.
No ratings yet
Chapra - Chapter 17 Sol.
20 pages
message (3)
100% (1)
message (3)
4 pages
04.sampling Distributions of The Estimators
No ratings yet
04.sampling Distributions of The Estimators
32 pages
Business Statistics Bsa Day Program All Chapters
No ratings yet
Business Statistics Bsa Day Program All Chapters
491 pages