Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Feature Engineering & Feature
Selection
Davis David
Data Scientist at ParrotAI
d.david@parrotai.co.tz
CONTENT:
1. Feature Engineering
2. Missing Data
3. Continuous Features
4. Categorical Features
5. Feature Selection
6. Practical Feature Engineering and Selection
1.Feature Engineering
Feature engineering refers to a process of selecting and transforming
variables/features when creating a predictive model using machine
learning.
Feature engineering has two goals:
● Preparing the proper input dataset, compatible with the machine
learning algorithm requirements.
● Improving the performance of machine learning models.
Data scientists spend 60% of their time on cleaning and organizing data.
57% of data scientists regard cleaning and organizing data as the least
enjoyable part of their work
“At the end of the day, some machine learning projects succeed
and some fail. What makes the difference? Easily the most
important factor is the features used.”
— Prof. Pedro Domingos from University of Washington
Read his paper here : A few useful things to know about machine learning
2. Missing Data
 Handling missing data is important as many machine learning
algorithms do not support data with missing values.
 Having missing values in the dataset can cause errors and poor
performance with some machine learning algorithms.
2. Missing Data
Common missing values
 N/A
 null
 Empty
 ?
 none
 empty
 -
 NaN
2. How to handle Missing Values
(a) Variable Deletion
Variable deletion involves dropping variables(columns) with missing values on
an case by case basis.
This method makes sense when lot of missing values in a variable and if the
variable is of relatively less importance.
The only case that it may worth deleting a variable is when its missing values
are more than 60% of the observations.
2. How to handle Missing Values
(a) Variable Deletion
2. How to handle Missing Values
(b) Mean or Median Imputation
A common technique is to use the mean or median of the non-missing
observations.
This strategy can be applied on a feature which has numeric data.
2. How to handle Missing Values
(c) Most Common Value
Replacing the missing values with the maximum occurred value in a column/feature is a
good option for handling categorical columns/features.
3. Continuous Features
● Continuous features in the dataset have different range of values.
● If you train your model with different range of value the model will not
perform well.
Example continuous features: age, salary , prices, heights
Common methods
 Min-Max Normalization
 Standardization
3. Continuous Features
(a) Min-Max Normalization
For each value in a feature, Min-Max normalization subtracts the minimum
value in the feature and then divides by the range. The range is the difference
between the original maximum and original minimum.
It scale all values in a fixed range between 0 and 1.
3. Continuous Features
(b) Standardization
The Standardization ensures that for each feature have the mean is 0 and the
variance is 1, bringing all features to the same magnitude.
If the standard deviation of features is different, their range also would differ
from each other.
x = observation, μ = mean , σ = standard deviation
4.Categorical Features
Categorical features represents types of data which may be divided into groups.
Example: genders, educational levels
Any non-numerical values need to be converted to integers or floats in order to
be utilized in most machine learning libraries.
Common Methods
 one-hot-encoding(Dummy variables)
 Label Encoding
4.Categorical Features
(a) One-hot-encoding
By far the most common way to represent categorical variables is using
the one-hot encoding or one-out-of-N encoding, also known as dummy
variables.
The idea behind dummy variables is to replace a categorical variable
with one or more new features that can have the values 0 and 1.
4.Categorical Features
4.Categorical Features
(b) Label Encoding
Label encoding is simply converting each categorical value in a column to a
number.
NB: It is recommended to use label encoding to a Binary variable
5.Feature Selection
● Feature Selection is the process where you automatically or manually select
those features which contribute most to your prediction variable or output in
which you are interested in.
● Having irrelevant features in your data can decrease the accuracy of the
models and make your model learn based on irrelevant features..
5.Feature Selection
Top reasons to use feature selection are:
● It enables the machine learning algorithm to train faster.
● It reduces the complexity of a model and makes it easier to interpret.
● It improves the accuracy of a model if the right subset is chosen.
● It reduces overfitting.
5. Feature Selection
“I prepared a model by selecting all the features and I got an accuracy of around 65%
which is not pretty good for a predictive model and after doing some feature
selection and feature engineering without doing any logical changes in my model
code my accuracy jumped to 81% which is quite impressive”
- By Raheel Shaikh
5.Feature Selection
(a) Univariate Selection
● Statistical tests can be used to select those independent features that have
the strongest relationship with the target feature in your dataset.
E.g. Chi squared test
● The scikit-learn library provides the SelectKBest class that can be used with a
suite of different statistical tests to select a specific number of features.
5.Feature Selection
(b) Feature Importance
Feature importance gives you a score for each feature of your data, the higher
the score more important or relevant is the feature towards your target feature.
Feature importance is an inbuilt class that comes with Tree Based Classifiers
Example:
 Random Forest Classifiers
 Extra Tree Classifiers
5. Feature Selection
(c) Correlation Matrix with Heatmap
● Correlation show how the features are related to each other or the target
feature.
● Correlation can be positive (increase in one value of feature increases the
value of the target variable) or negative (increase in one value of feature
decreases the value of the target variable)
Feature enginnering and selection

More Related Content

What's hot

Machine learning
Machine learningMachine learning
Machine learning
Dr Geetha Mohan
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
Knoldus Inc.
 
Hands-on ML - CH1
Hands-on ML - CH1Hands-on ML - CH1
Hands-on ML - CH1
Jamie (Taka) Wang
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
skylian
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
Mehrnaz Faraz
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
Rui Pedro Paiva
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
Samra Shahzadi
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
Paras Kohli
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Shrey Malik
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Knoldus Inc.
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
Appsilon Data Science
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
Abhishek Singh
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
BICA Labs
 
Machine Learning Model Evaluation Methods
Machine Learning Model Evaluation MethodsMachine Learning Model Evaluation Methods
Machine Learning Model Evaluation Methods
Pyingkodi Maran
 
Machine learning
Machine learningMachine learning
Machine learning
Rajib Kumar De
 
Data Preprocessing
Data PreprocessingData Preprocessing
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 

What's hot (20)

Machine learning
Machine learningMachine learning
Machine learning
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Hands-on ML - CH1
Hands-on ML - CH1Hands-on ML - CH1
Hands-on ML - CH1
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Machine Learning Model Evaluation Methods
Machine Learning Model Evaluation MethodsMachine Learning Model Evaluation Methods
Machine Learning Model Evaluation Methods
 
Machine learning
Machine learningMachine learning
Machine learning
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 

Similar to Feature enginnering and selection

Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
jaffarbikat
 
Optimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptxOptimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptx
MurindanyiSudi1
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
AnushaSharma81
 
Feature Engineering.pdf
Feature Engineering.pdfFeature Engineering.pdf
Feature Engineering.pdf
Rajoo Jha
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
Dr.Shweta
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
microsoftventures
 
Machine Learning: Transforming Data into Insights
Machine Learning: Transforming Data into InsightsMachine Learning: Transforming Data into Insights
Machine Learning: Transforming Data into Insights
pemac73062
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentation
NeerajNishad4
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
Naveen Kumar
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
Charles Vestur
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
Agile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity management
Agnirudra Sikdar
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docx
daniahendric
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data Modeling
Databricks
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
Sanghamitra Deb
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
VaishaliBagewadikar
 
13_Data Preprocessing in Python.pptx (1).pdf
13_Data Preprocessing in Python.pptx (1).pdf13_Data Preprocessing in Python.pptx (1).pdf
13_Data Preprocessing in Python.pptx (1).pdf
andreyhapantenda
 
seminar.pptx
seminar.pptxseminar.pptx
seminar.pptx
ShanavasShanu5
 

Similar to Feature enginnering and selection (20)

Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
 
Optimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptxOptimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptx
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
Feature Engineering.pdf
Feature Engineering.pdfFeature Engineering.pdf
Feature Engineering.pdf
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
Machine Learning: Transforming Data into Insights
Machine Learning: Transforming Data into InsightsMachine Learning: Transforming Data into Insights
Machine Learning: Transforming Data into Insights
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentation
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Agile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity management
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docx
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data Modeling
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
13_Data Preprocessing in Python.pptx (1).pdf
13_Data Preprocessing in Python.pptx (1).pdf13_Data Preprocessing in Python.pptx (1).pdf
13_Data Preprocessing in Python.pptx (1).pdf
 
seminar.pptx
seminar.pptxseminar.pptx
seminar.pptx
 

Recently uploaded

The Maritime Security. OSINT [EN] .pdf
The Maritime Security. OSINT [EN]   .pdfThe Maritime Security. OSINT [EN]   .pdf
The Maritime Security. OSINT [EN] .pdf
Snarky Security
 
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Alliance
 
Easy Compliance is Continuous Compliance
Easy Compliance is Continuous ComplianceEasy Compliance is Continuous Compliance
Easy Compliance is Continuous Compliance
Anchore
 
The learners analyze the various sectors of ICT and evaluate the potential ca...
The learners analyze the various sectors of ICT and evaluate the potential ca...The learners analyze the various sectors of ICT and evaluate the potential ca...
The learners analyze the various sectors of ICT and evaluate the potential ca...
maricrismontales
 
Best USA IPTV Providers to Stream in 2024.pdf
Best USA IPTV Providers to Stream in 2024.pdfBest USA IPTV Providers to Stream in 2024.pdf
Best USA IPTV Providers to Stream in 2024.pdf
perth Riya
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPathCommunity
 
Blue Screen Of Death | Windows Down | Biggest IT failure
Blue Screen Of Death | Windows Down | Biggest IT failureBlue Screen Of Death | Windows Down | Biggest IT failure
Blue Screen Of Death | Windows Down | Biggest IT failure
Dexbytes Infotech Pvt Ltd
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
 
Securiport Gambia - Intelligent Threat Analysis
Securiport Gambia - Intelligent Threat AnalysisSecuriport Gambia - Intelligent Threat Analysis
Securiport Gambia - Intelligent Threat Analysis
Securiport Gambia
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
Getting Started with Azure AI Studio.pptx
Getting Started with Azure AI Studio.pptxGetting Started with Azure AI Studio.pptx
Getting Started with Azure AI Studio.pptx
Swaminathan Vetri
 
CI/CD pipelines for CloudHub 2.0 - Wroclaw MuleSoft Meetup #2
CI/CD pipelines for CloudHub 2.0 - Wroclaw MuleSoft Meetup #2CI/CD pipelines for CloudHub 2.0 - Wroclaw MuleSoft Meetup #2
CI/CD pipelines for CloudHub 2.0 - Wroclaw MuleSoft Meetup #2
wromeetup
 
Starlink Product Specifications_HighPerformance-1.pdf
Starlink Product Specifications_HighPerformance-1.pdfStarlink Product Specifications_HighPerformance-1.pdf
Starlink Product Specifications_HighPerformance-1.pdf
ssuser0b9571
 
Top keywords searches on home and garden
Top keywords searches on home and gardenTop keywords searches on home and garden
Top keywords searches on home and garden
riannecreativetwo
 
Bài tập tiếng anh lớp 9 - Ôn tập tuyển sinh
Bài tập tiếng anh lớp 9 - Ôn tập tuyển sinhBài tập tiếng anh lớp 9 - Ôn tập tuyển sinh
Bài tập tiếng anh lớp 9 - Ôn tập tuyển sinh
NguynThNhQunh59
 
Leading Bigcommerce Development Services for Online Retailers
Leading Bigcommerce Development Services for Online RetailersLeading Bigcommerce Development Services for Online Retailers
Leading Bigcommerce Development Services for Online Retailers
SynapseIndia
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
 
Epicor Kinetic REST API Services Overview.pptx
Epicor Kinetic REST API Services Overview.pptxEpicor Kinetic REST API Services Overview.pptx
Epicor Kinetic REST API Services Overview.pptx
Piyush Khalate
 
Informatika smk kelas 10 kurikulum merdeka.pptx
Informatika smk kelas 10 kurikulum merdeka.pptxInformatika smk kelas 10 kurikulum merdeka.pptx
Informatika smk kelas 10 kurikulum merdeka.pptx
OkyPrayudi
 

Recently uploaded (20)

The Maritime Security. OSINT [EN] .pdf
The Maritime Security. OSINT [EN]   .pdfThe Maritime Security. OSINT [EN]   .pdf
The Maritime Security. OSINT [EN] .pdf
 
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
 
Easy Compliance is Continuous Compliance
Easy Compliance is Continuous ComplianceEasy Compliance is Continuous Compliance
Easy Compliance is Continuous Compliance
 
The learners analyze the various sectors of ICT and evaluate the potential ca...
The learners analyze the various sectors of ICT and evaluate the potential ca...The learners analyze the various sectors of ICT and evaluate the potential ca...
The learners analyze the various sectors of ICT and evaluate the potential ca...
 
Best USA IPTV Providers to Stream in 2024.pdf
Best USA IPTV Providers to Stream in 2024.pdfBest USA IPTV Providers to Stream in 2024.pdf
Best USA IPTV Providers to Stream in 2024.pdf
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
 
Blue Screen Of Death | Windows Down | Biggest IT failure
Blue Screen Of Death | Windows Down | Biggest IT failureBlue Screen Of Death | Windows Down | Biggest IT failure
Blue Screen Of Death | Windows Down | Biggest IT failure
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
 
Securiport Gambia - Intelligent Threat Analysis
Securiport Gambia - Intelligent Threat AnalysisSecuriport Gambia - Intelligent Threat Analysis
Securiport Gambia - Intelligent Threat Analysis
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
Getting Started with Azure AI Studio.pptx
Getting Started with Azure AI Studio.pptxGetting Started with Azure AI Studio.pptx
Getting Started with Azure AI Studio.pptx
 
CI/CD pipelines for CloudHub 2.0 - Wroclaw MuleSoft Meetup #2
CI/CD pipelines for CloudHub 2.0 - Wroclaw MuleSoft Meetup #2CI/CD pipelines for CloudHub 2.0 - Wroclaw MuleSoft Meetup #2
CI/CD pipelines for CloudHub 2.0 - Wroclaw MuleSoft Meetup #2
 
Starlink Product Specifications_HighPerformance-1.pdf
Starlink Product Specifications_HighPerformance-1.pdfStarlink Product Specifications_HighPerformance-1.pdf
Starlink Product Specifications_HighPerformance-1.pdf
 
Top keywords searches on home and garden
Top keywords searches on home and gardenTop keywords searches on home and garden
Top keywords searches on home and garden
 
Bài tập tiếng anh lớp 9 - Ôn tập tuyển sinh
Bài tập tiếng anh lớp 9 - Ôn tập tuyển sinhBài tập tiếng anh lớp 9 - Ôn tập tuyển sinh
Bài tập tiếng anh lớp 9 - Ôn tập tuyển sinh
 
Leading Bigcommerce Development Services for Online Retailers
Leading Bigcommerce Development Services for Online RetailersLeading Bigcommerce Development Services for Online Retailers
Leading Bigcommerce Development Services for Online Retailers
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
 
Epicor Kinetic REST API Services Overview.pptx
Epicor Kinetic REST API Services Overview.pptxEpicor Kinetic REST API Services Overview.pptx
Epicor Kinetic REST API Services Overview.pptx
 
Informatika smk kelas 10 kurikulum merdeka.pptx
Informatika smk kelas 10 kurikulum merdeka.pptxInformatika smk kelas 10 kurikulum merdeka.pptx
Informatika smk kelas 10 kurikulum merdeka.pptx
 

Feature enginnering and selection

  • 1. Feature Engineering & Feature Selection Davis David Data Scientist at ParrotAI d.david@parrotai.co.tz
  • 2. CONTENT: 1. Feature Engineering 2. Missing Data 3. Continuous Features 4. Categorical Features 5. Feature Selection 6. Practical Feature Engineering and Selection
  • 3. 1.Feature Engineering Feature engineering refers to a process of selecting and transforming variables/features when creating a predictive model using machine learning. Feature engineering has two goals: ● Preparing the proper input dataset, compatible with the machine learning algorithm requirements. ● Improving the performance of machine learning models.
  • 4. Data scientists spend 60% of their time on cleaning and organizing data.
  • 5. 57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work
  • 6. “At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.” — Prof. Pedro Domingos from University of Washington Read his paper here : A few useful things to know about machine learning
  • 7. 2. Missing Data  Handling missing data is important as many machine learning algorithms do not support data with missing values.  Having missing values in the dataset can cause errors and poor performance with some machine learning algorithms.
  • 8. 2. Missing Data Common missing values  N/A  null  Empty  ?  none  empty  -  NaN
  • 9. 2. How to handle Missing Values (a) Variable Deletion Variable deletion involves dropping variables(columns) with missing values on an case by case basis. This method makes sense when lot of missing values in a variable and if the variable is of relatively less importance. The only case that it may worth deleting a variable is when its missing values are more than 60% of the observations.
  • 10. 2. How to handle Missing Values (a) Variable Deletion
  • 11. 2. How to handle Missing Values (b) Mean or Median Imputation A common technique is to use the mean or median of the non-missing observations. This strategy can be applied on a feature which has numeric data.
  • 12. 2. How to handle Missing Values (c) Most Common Value Replacing the missing values with the maximum occurred value in a column/feature is a good option for handling categorical columns/features.
  • 13. 3. Continuous Features ● Continuous features in the dataset have different range of values. ● If you train your model with different range of value the model will not perform well. Example continuous features: age, salary , prices, heights Common methods  Min-Max Normalization  Standardization
  • 14. 3. Continuous Features (a) Min-Max Normalization For each value in a feature, Min-Max normalization subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum. It scale all values in a fixed range between 0 and 1.
  • 15. 3. Continuous Features (b) Standardization The Standardization ensures that for each feature have the mean is 0 and the variance is 1, bringing all features to the same magnitude. If the standard deviation of features is different, their range also would differ from each other. x = observation, μ = mean , σ = standard deviation
  • 16. 4.Categorical Features Categorical features represents types of data which may be divided into groups. Example: genders, educational levels Any non-numerical values need to be converted to integers or floats in order to be utilized in most machine learning libraries. Common Methods  one-hot-encoding(Dummy variables)  Label Encoding
  • 17. 4.Categorical Features (a) One-hot-encoding By far the most common way to represent categorical variables is using the one-hot encoding or one-out-of-N encoding, also known as dummy variables. The idea behind dummy variables is to replace a categorical variable with one or more new features that can have the values 0 and 1.
  • 19. 4.Categorical Features (b) Label Encoding Label encoding is simply converting each categorical value in a column to a number. NB: It is recommended to use label encoding to a Binary variable
  • 20. 5.Feature Selection ● Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in. ● Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features..
  • 21. 5.Feature Selection Top reasons to use feature selection are: ● It enables the machine learning algorithm to train faster. ● It reduces the complexity of a model and makes it easier to interpret. ● It improves the accuracy of a model if the right subset is chosen. ● It reduces overfitting.
  • 22. 5. Feature Selection “I prepared a model by selecting all the features and I got an accuracy of around 65% which is not pretty good for a predictive model and after doing some feature selection and feature engineering without doing any logical changes in my model code my accuracy jumped to 81% which is quite impressive” - By Raheel Shaikh
  • 23. 5.Feature Selection (a) Univariate Selection ● Statistical tests can be used to select those independent features that have the strongest relationship with the target feature in your dataset. E.g. Chi squared test ● The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.
  • 24. 5.Feature Selection (b) Feature Importance Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your target feature. Feature importance is an inbuilt class that comes with Tree Based Classifiers Example:  Random Forest Classifiers  Extra Tree Classifiers
  • 25. 5. Feature Selection (c) Correlation Matrix with Heatmap ● Correlation show how the features are related to each other or the target feature. ● Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable)