0% found this document useful (0 votes)

8 views

5_Unit 2 - Lecture 2-Data Handling

Uploaded by

sihagmukesh05

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

5_Unit 2 - Lecture 2-Data Handling

Uploaded by

sihagmukesh05

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Handling

datasets for
Machine Learning
Feature sets

•Handling datasets for machine learning

feature sets involves several key steps.
Here's a comprehensive guide to manage
and prepare your datasets effectively:

•1. Data Collection

• Identify Data Sources: Determine the
sources from where the data will be
collected (databases, APIs, web
scraping, sensors, etc.).
• Gather Data: Collect the data
ensuring you have enough examples
to train a robust model.

Figure 1: Data Collection

Handling
datasets for
Machine Learning
Feature sets

•2. Data Cleaning

• Remove Duplicates: Eliminate
duplicate records to avoid
redundancy.
• Handle Missing Values: Impute
missing values using strategies
like mean/median imputation,
forward/backward fill, or
removing the records/columns
with excessive missing values.
• Correct Errors: Fix any errors in
the data such as incorrect labels,
out-of-range values, etc.

Figure 2: Data Cleaning Cycle

Handling datasets
for Machine
Learning Feature
sets
•3. Data Transformation
• Normalization/Standardization: Scale
the features so they have similar ranges.
Common techniques include min-max
normalization and z-score
standardization.
• Encoding Categorical Variables: Convert
categorical variables to numerical using
methods like one-hot encoding, label
encoding, or target encoding.
• Feature Engineering: Create new
features from existing ones to help the
model learn better. This includes
creating interaction terms, polynomial
features, and using domain knowledge
to derive new features.

Figure 3: Data Transformation Process

Handling datasets for Machine
Learning Feature sets

Figure 4: Data Transformation Techniques

Handling
datasets for
Machine Learning
Feature sets

•4. Data Splitting

• Train-Test Split: Split the data into
training and testing sets to evaluate
the model's performance on unseen
data.
• Validation Set: Further split the
training data into a training set and a
validation set to tune
hyperparameters and avoid
overfitting.
• Cross-Validation: Use k-fold cross-
validation to make the best use of
the data, especially when you have
limited data. Figure 5: Data Splitting
Handling
datasets for
Machine Learning
Feature sets

• Cross-Validation: Use k-
fold cross-validation to
make the best use of the
data, especially when you
have limited data.

Figure 6: Cross Validation

Handling •5. Handling Imbalanced Data
datasets for • Resampling: Use techniques like oversampling the minority
class (e.g., SMOTE) or undersampling the majority class.
Machine Learning • Class Weighting: Assign different weights to classes to
Feature sets balance the influence of each class on the model training.

Figure 7: Handling Imbalance Data

Handling datasets for Machine Learning Feature sets
5. Handling Imbalanced Data
 Resampling: Use techniques like oversampling the minority class (e.g., SMOTE) or
undersampling the majority class.
 Class Weighting: Assign different weights to classes to balance the influence of each class
on the model training.

6. Feature Selection
 Remove Unnecessary Features: Drop features that do not contribute to the model
performance.
 Use Algorithms: Employ algorithms (like LASSO, Decision Trees) that help in selecting
important features.
 Correlation Analysis: Remove highly correlated features to reduce multicollinearity.

7. Feature Scaling
 Normalization: Scale features to a range, typically [0, 1].
 Standardization: Transform features to have zero mean and unit variance.
Handling
datasets for
Machine Learning
Feature sets

•6. Feature Selection

• Remove Unnecessary Features:
Drop features that do not
contribute to the model
performance.
• Use Algorithms: Employ
algorithms (like LASSO, Decision
Trees) that help in selecting
important features.
• Correlation Analysis: Remove
highly correlated features to
reduce multicollinearity.

Figure 8: Feature Selection

Handling datasets for Machine
Learning Feature sets

Figure 9: Benefit of Feature Selection

Handling datasets for Machine Learning Feature sets
7. Feature Scaling: Feature Scaling is a technique to standardize the independent features
present in the data in a fixed range. It is performed during the data pre-processing to handle
highly varying magnitudes or values or units. If feature scaling is not done, then a machine
learning algorithm tends to weigh greater values, higher and consider smaller values as the
lower values, regardless of the unit of the values.
 Normalization: Scale features to a range, typically [0, 1].
 Standardization: Transform features to have zero mean and unit variance.

Figure 10: Data Normalization

Handling datasets for Machine Learning Feature sets

8. Data Augmentation
 Generate New Data: For image, text, or audio data, create variations of existing data to
increase the dataset size.
 In machine learning, data augmentation is a common method for manipulating existing
data to artificially increase the size of a training dataset. In an attempt to enhance the
efficiency and flexibility of machine learning models, data augmentation looks for the
boost in the variety and volatility of the training data.
 Data augmentation can be especially beneficial when the original set of data is small as it
enables the system to learn from a larger and more varied group of samples.

Types of Data Augmentation: Techniques for data augmentation can be used with a variety
of data kinds, including time series, text, photos, and audio. Here are a few frequently used
methods of data augmentation for image data:
 Images can be rotated at different angles and flipped horizontally or vertically to create
alternative points of view.
Handling datasets for Machine Learning Feature sets

 Random cropping and padding: By applying random cropping or padding to the photos,
various scales, and translations can be simulated.

 Scaling and zooming: The model can manage various item sizes and resolutions by
rescaling the photos to different sizes or zooming in and out.

 Shearing and perspective transform: Changing an image's shape or perspective can imitate
various viewing angles while also introducing deformations.

 Color jittering: By adjusting the color characteristics of the images, including their
brightness, contrast, saturation, and hue, the model can be made to be more resilient to
variations in illumination.

 Gaussian noise: By introducing random Gaussian noise to the images, the model's
resistance to noisy inputs can be strengthened.
Handling datasets for Machine
Learning Feature sets

Figure 11: Data Augmentation

Handling datasets for Machine Learning Feature sets

9. Data Storage
 Save Cleaned Data: Store the cleaned and preprocessed data in an appropriate format
(CSV, HDF5, etc.) for future use.
 Document the Process: Keep track of the steps and transformations applied to the data
for reproducibility.

Designing Machine Learning Systems by Chip Huygen by Rick
No ratings yet
Designing Machine Learning Systems by Chip Huygen by Rick
15 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
NN-7
No ratings yet
NN-7
26 pages
ML1
No ratings yet
ML1
69 pages
ML_DA
No ratings yet
ML_DA
55 pages
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
No ratings yet
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
6 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
Presentation
No ratings yet
Presentation
10 pages
Presentation-2 Data Pre-Processing in Machine Learning
No ratings yet
Presentation-2 Data Pre-Processing in Machine Learning
11 pages
Lecture 5 - Feature extraction, model building & evaluation
No ratings yet
Lecture 5 - Feature extraction, model building & evaluation
35 pages
VIVA
No ratings yet
VIVA
5 pages
UNIT 2 ML
No ratings yet
UNIT 2 ML
14 pages
ML Lecture Notes Unit-1
No ratings yet
ML Lecture Notes Unit-1
45 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
ML Interactively
No ratings yet
ML Interactively
273 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
machineLearning-unit1
No ratings yet
machineLearning-unit1
9 pages
Module 4
No ratings yet
Module 4
96 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Data
No ratings yet
Data
36 pages
Machine Learning Unit-2
No ratings yet
Machine Learning Unit-2
12 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
MACHINE LEARNING 1-5 (Ai &DS)
100% (1)
MACHINE LEARNING 1-5 (Ai &DS)
60 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Allpiedml unit2
No ratings yet
Allpiedml unit2
19 pages
Exploring, Transforming, And Summarizing Input Datasets for Building Classification Models
No ratings yet
Exploring, Transforming, And Summarizing Input Datasets for Building Classification Models
21 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
Salazar CPE124 Courswork 1
No ratings yet
Salazar CPE124 Courswork 1
22 pages
AAM 1st Unit QB
No ratings yet
AAM 1st Unit QB
4 pages
ML_notion_1
No ratings yet
ML_notion_1
18 pages
ML Unit 2
No ratings yet
ML Unit 2
33 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Pa 2
No ratings yet
Pa 2
13 pages
ML 2022
No ratings yet
ML 2022
10 pages
Chương
No ratings yet
Chương
12 pages
Chapter 01 machine learning
No ratings yet
Chapter 01 machine learning
22 pages
How To Apply ML
No ratings yet
How To Apply ML
4 pages
life lesson
No ratings yet
life lesson
13 pages
Unit .1
No ratings yet
Unit .1
7 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
Feature Engineering
No ratings yet
Feature Engineering
11 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
No ratings yet
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
29 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
The Problem of Overfitting
No ratings yet
The Problem of Overfitting
40 pages
Weka Sample
No ratings yet
Weka Sample
21 pages
Recommendation System
No ratings yet
Recommendation System
11 pages
Data Warehousing and Data Mining Lab Manual
100% (1)
Data Warehousing and Data Mining Lab Manual
30 pages
Ridge Regression: Patrick Breheny
No ratings yet
Ridge Regression: Patrick Breheny
22 pages
Admixture Manual
No ratings yet
Admixture Manual
14 pages
9 Types of Regression Analysis
No ratings yet
9 Types of Regression Analysis
16 pages
Nave et al 2018 Musical Preferences Predict Personality- Evidence from Active Listening and Facebook Likes
No ratings yet
Nave et al 2018 Musical Preferences Predict Personality- Evidence from Active Listening and Facebook Likes
20 pages
Convolutional Neural
No ratings yet
Convolutional Neural
13 pages
INSY 5339 - Data Mining Exam #2 Review
No ratings yet
INSY 5339 - Data Mining Exam #2 Review
1 page
04 Multivariate Analysis
No ratings yet
04 Multivariate Analysis
38 pages
ML Interview Questions
No ratings yet
ML Interview Questions
21 pages
7406HW05-1
No ratings yet
7406HW05-1
2 pages
Avi Watwani d17b 75 Bda Project Report
No ratings yet
Avi Watwani d17b 75 Bda Project Report
13 pages
Stroke
No ratings yet
Stroke
6 pages
Survey
No ratings yet
Survey
5 pages
Personality Inventory Correlates of Creativity Among Architects
No ratings yet
Personality Inventory Correlates of Creativity Among Architects
5 pages
CS6735 ProgrammingProject Group08 Report
No ratings yet
CS6735 ProgrammingProject Group08 Report
7 pages
Trend Surface Analysis
No ratings yet
Trend Surface Analysis
42 pages
manigault_2023_oi_221531_1674234706.11949
No ratings yet
manigault_2023_oi_221531_1674234706.11949
10 pages
Customer Churn Prediction in Telecommunication
No ratings yet
Customer Churn Prediction in Telecommunication
13 pages
Analyzing Positional Play in Chess Using Maching Learning
No ratings yet
Analyzing Positional Play in Chess Using Maching Learning
5 pages
Complete Time Series Analysis in Python 1673057003
No ratings yet
Complete Time Series Analysis in Python 1673057003
56 pages
Encog 3 3 Quickstart
No ratings yet
Encog 3 3 Quickstart
61 pages
Hawkins Et Al 2003
No ratings yet
Hawkins Et Al 2003
8 pages
Clrernet: Improving Confidence of Lane Detection With Laneiou
No ratings yet
Clrernet: Improving Confidence of Lane Detection With Laneiou
10 pages
Lab Assignment 1 Ucs551
No ratings yet
Lab Assignment 1 Ucs551
23 pages
Paper Moho Inversion Tesseroids - 2
No ratings yet
Paper Moho Inversion Tesseroids - 2
16 pages
Unit 6-Feature Engineering and Sensitivity Analysis
No ratings yet
Unit 6-Feature Engineering and Sensitivity Analysis
63 pages
Predicting The Price of Airline Tickets
No ratings yet
Predicting The Price of Airline Tickets
30 pages