Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

5_Unit 2 - Lecture 2-Data Handling

Uploaded by

sihagmukesh05
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

5_Unit 2 - Lecture 2-Data Handling

Uploaded by

sihagmukesh05
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Handling

datasets for
Machine Learning
Feature sets

•Handling datasets for machine learning


feature sets involves several key steps.
Here's a comprehensive guide to manage
and prepare your datasets effectively:

•1. Data Collection


• Identify Data Sources: Determine the
sources from where the data will be
collected (databases, APIs, web
scraping, sensors, etc.).
• Gather Data: Collect the data
ensuring you have enough examples
to train a robust model.

Figure 1: Data Collection


Handling
datasets for
Machine Learning
Feature sets

•2. Data Cleaning


• Remove Duplicates: Eliminate
duplicate records to avoid
redundancy.
• Handle Missing Values: Impute
missing values using strategies
like mean/median imputation,
forward/backward fill, or
removing the records/columns
with excessive missing values.
• Correct Errors: Fix any errors in
the data such as incorrect labels,
out-of-range values, etc.

Figure 2: Data Cleaning Cycle


Handling datasets
for Machine
Learning Feature
sets
•3. Data Transformation
• Normalization/Standardization: Scale
the features so they have similar ranges.
Common techniques include min-max
normalization and z-score
standardization.
• Encoding Categorical Variables: Convert
categorical variables to numerical using
methods like one-hot encoding, label
encoding, or target encoding.
• Feature Engineering: Create new
features from existing ones to help the
model learn better. This includes
creating interaction terms, polynomial
features, and using domain knowledge
to derive new features.

Figure 3: Data Transformation Process


Handling datasets for Machine
Learning Feature sets

Figure 4: Data Transformation Techniques


Handling
datasets for
Machine Learning
Feature sets

•4. Data Splitting


• Train-Test Split: Split the data into
training and testing sets to evaluate
the model's performance on unseen
data.
• Validation Set: Further split the
training data into a training set and a
validation set to tune
hyperparameters and avoid
overfitting.
• Cross-Validation: Use k-fold cross-
validation to make the best use of
the data, especially when you have
limited data. Figure 5: Data Splitting
Handling
datasets for
Machine Learning
Feature sets

• Cross-Validation: Use k-
fold cross-validation to
make the best use of the
data, especially when you
have limited data.

Figure 6: Cross Validation


Handling •5. Handling Imbalanced Data
datasets for • Resampling: Use techniques like oversampling the minority
class (e.g., SMOTE) or undersampling the majority class.
Machine Learning • Class Weighting: Assign different weights to classes to
Feature sets balance the influence of each class on the model training.

Figure 7: Handling Imbalance Data


Handling datasets for Machine Learning Feature sets
5. Handling Imbalanced Data
 Resampling: Use techniques like oversampling the minority class (e.g., SMOTE) or
undersampling the majority class.
 Class Weighting: Assign different weights to classes to balance the influence of each class
on the model training.

6. Feature Selection
 Remove Unnecessary Features: Drop features that do not contribute to the model
performance.
 Use Algorithms: Employ algorithms (like LASSO, Decision Trees) that help in selecting
important features.
 Correlation Analysis: Remove highly correlated features to reduce multicollinearity.

7. Feature Scaling
 Normalization: Scale features to a range, typically [0, 1].
 Standardization: Transform features to have zero mean and unit variance.
Handling
datasets for
Machine Learning
Feature sets

•6. Feature Selection


• Remove Unnecessary Features:
Drop features that do not
contribute to the model
performance.
• Use Algorithms: Employ
algorithms (like LASSO, Decision
Trees) that help in selecting
important features.
• Correlation Analysis: Remove
highly correlated features to
reduce multicollinearity.

Figure 8: Feature Selection


Handling datasets for Machine
Learning Feature sets

Figure 9: Benefit of Feature Selection


Handling datasets for Machine Learning Feature sets
7. Feature Scaling: Feature Scaling is a technique to standardize the independent features
present in the data in a fixed range. It is performed during the data pre-processing to handle
highly varying magnitudes or values or units. If feature scaling is not done, then a machine
learning algorithm tends to weigh greater values, higher and consider smaller values as the
lower values, regardless of the unit of the values.
 Normalization: Scale features to a range, typically [0, 1].
 Standardization: Transform features to have zero mean and unit variance.

Figure 10: Data Normalization


Handling datasets for Machine Learning Feature sets

8. Data Augmentation
 Generate New Data: For image, text, or audio data, create variations of existing data to
increase the dataset size.
 In machine learning, data augmentation is a common method for manipulating existing
data to artificially increase the size of a training dataset. In an attempt to enhance the
efficiency and flexibility of machine learning models, data augmentation looks for the
boost in the variety and volatility of the training data.
 Data augmentation can be especially beneficial when the original set of data is small as it
enables the system to learn from a larger and more varied group of samples.

Types of Data Augmentation: Techniques for data augmentation can be used with a variety
of data kinds, including time series, text, photos, and audio. Here are a few frequently used
methods of data augmentation for image data:
 Images can be rotated at different angles and flipped horizontally or vertically to create
alternative points of view.
Handling datasets for Machine Learning Feature sets

 Random cropping and padding: By applying random cropping or padding to the photos,
various scales, and translations can be simulated.

 Scaling and zooming: The model can manage various item sizes and resolutions by
rescaling the photos to different sizes or zooming in and out.

 Shearing and perspective transform: Changing an image's shape or perspective can imitate
various viewing angles while also introducing deformations.

 Color jittering: By adjusting the color characteristics of the images, including their
brightness, contrast, saturation, and hue, the model can be made to be more resilient to
variations in illumination.

 Gaussian noise: By introducing random Gaussian noise to the images, the model's
resistance to noisy inputs can be strengthened.
Handling datasets for Machine
Learning Feature sets

Figure 11: Data Augmentation


Handling datasets for Machine Learning Feature sets

9. Data Storage
 Save Cleaned Data: Store the cleaned and preprocessed data in an appropriate format
(CSV, HDF5, etc.) for future use.
 Document the Process: Keep track of the steps and transformations applied to the data
for reproducibility.

You might also like