5_Unit 2 - Lecture 2-Data Handling
5_Unit 2 - Lecture 2-Data Handling
datasets for
Machine Learning
Feature sets
• Cross-Validation: Use k-
fold cross-validation to
make the best use of the
data, especially when you
have limited data.
6. Feature Selection
Remove Unnecessary Features: Drop features that do not contribute to the model
performance.
Use Algorithms: Employ algorithms (like LASSO, Decision Trees) that help in selecting
important features.
Correlation Analysis: Remove highly correlated features to reduce multicollinearity.
7. Feature Scaling
Normalization: Scale features to a range, typically [0, 1].
Standardization: Transform features to have zero mean and unit variance.
Handling
datasets for
Machine Learning
Feature sets
8. Data Augmentation
Generate New Data: For image, text, or audio data, create variations of existing data to
increase the dataset size.
In machine learning, data augmentation is a common method for manipulating existing
data to artificially increase the size of a training dataset. In an attempt to enhance the
efficiency and flexibility of machine learning models, data augmentation looks for the
boost in the variety and volatility of the training data.
Data augmentation can be especially beneficial when the original set of data is small as it
enables the system to learn from a larger and more varied group of samples.
Types of Data Augmentation: Techniques for data augmentation can be used with a variety
of data kinds, including time series, text, photos, and audio. Here are a few frequently used
methods of data augmentation for image data:
Images can be rotated at different angles and flipped horizontally or vertically to create
alternative points of view.
Handling datasets for Machine Learning Feature sets
Random cropping and padding: By applying random cropping or padding to the photos,
various scales, and translations can be simulated.
Scaling and zooming: The model can manage various item sizes and resolutions by
rescaling the photos to different sizes or zooming in and out.
Shearing and perspective transform: Changing an image's shape or perspective can imitate
various viewing angles while also introducing deformations.
Color jittering: By adjusting the color characteristics of the images, including their
brightness, contrast, saturation, and hue, the model can be made to be more resilient to
variations in illumination.
Gaussian noise: By introducing random Gaussian noise to the images, the model's
resistance to noisy inputs can be strengthened.
Handling datasets for Machine
Learning Feature sets
9. Data Storage
Save Cleaned Data: Store the cleaned and preprocessed data in an appropriate format
(CSV, HDF5, etc.) for future use.
Document the Process: Keep track of the steps and transformations applied to the data
for reproducibility.