Machine Learning Unit-2
Machine Learning Unit-2
UNIT - 2
Basics of Data Preprocessing and Feature Engineering
Feature Transformation, Feature scaling, Feature Construction, and Feature subset
selection, Dimensionality reduction, Explorative data analysis, Hyperparameter
tuning, Introduction to SK learn package
1) Function Transformers:
Function transformers are a type of feature transformation technique that
uses specific functions to change data, so it follows a normal distribution.
These functions are applied to the data observations.
There isn't a strict rule for choosing the right function transformer; it often
depends on the domain knowledge of the data. However, there are generally
five common types of function transformers that are effective in making data
more normally distributed. They are Log Transform, Square Transform,
Square Root Transform, Reciprocal Transform, Custom Transform.
a) Log Transform:
Log transformation is a simple technique where you apply the logarithm to
each value in the data. This transformed data is then used for machine
learning algorithms.
b) Square Transform:
Square transformation is a technique where you apply the square function
to each data value. In simple terms, this means you replace each value
with its square, and then use these squared values as the final
transformed data for your machine learning algorithms.
d) Reciprocal Transform:
In reciprocal transformation, you take the reciprocal (or inverse) of each
data value, which means you replace each value with 1/x. This method
can be useful for some datasets, as applying this transformation can help
achieve a more normal distribution.
e) Custom Transform:
Log and square root transformations might not work for every dataset
because each dataset can have unique patterns and complexities. Instead,
you can apply custom transformations based on your understanding of the
data. These custom transformations can include any function or operation,
such as sine, cosine, tangent, or cubing, to help transform the data into a
normal distribution.
2) Power Transformers:
Power transformation techniques apply a mathematical power to data
observations to transform the data. There are two main types of power
transformation techniques:
a) Box-Cox Transform:
This technique is used to stabilize variance and make the data more
normally distributed. It requires the data to be positive, and it finds the
best power transformation to apply.
b) Yeo-Johnson Transform:
This technique extends the Box-Cox transform to handle both positive and
negative data. It also aims to stabilize variance and make the data more
normally distributed.
3) Quantile Transformers:
Quantile transformation techniques are used to transform numerical data
observations. This technique can be implemented using libraries like scikit-
learn.
In this transformation, the input data is processed so that the output data
follows a normal distribution. This makes the data more suitable for machine
learning algorithms.
Feature scaling solves this by bringing all features to a similar scale, ensuring
that no single feature dominates the learning process.
Importance:
• Dimensionality Reduction: Many machine learning datasets have a high
number of features, but not all features are useful. Reducing the number of
features can simplify the model.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a
search problem, in which different combinations are made, evaluated, and
compared with other combinations. It trains the algorithm by using the
subset of features iteratively.
Based on the output of the model, features are added or subtracted, and with
this feature set, the model has trained again.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the
features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns
from the model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time
and does not overfit the data.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper
methods by considering the interaction of features along with low
computational cost. These are fast processing methods similar to the filter
method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and
optimally finds the most important features that contribute the most to
training in a particular iteration.
• Outlier Detection: Identifying unusual values that deviate from other data
points. Outliers can influence statistical analyses and might indicate data
entry errors or unique cases.
• Summary Statistics: Calculating key statistics that provide insight into data
trends and nuances.
• Testing Assumptions: Many statistical tests and models assume the data
meet certain conditions (like normality or homoscedasticity). EDA helps verify
these assumptions.
• Batch Size: Number of training samples to work through before updating the
model’s parameters.
• Tree Depth (in Decision Trees/Random Forests): Limits the depth of the
trees to prevent overfitting.
Features
Rather than focusing on loading, manipulating, and summarizing data, Scikit-
learn library is focused on modeling the data. Some of the most popular groups
of models provided by Sklearn are as follows –
Installation
If you already installed NumPy and Scipy, following are the two easiest ways to
install scikit-learn −
Using pip: Following command can be used to install scikit-learn via pip
pip install -U scikit-learn
Using conda: Following command can be used to install scikit-learn via conda
conda install scikit-learn
2.9. Exercises:
1. Which of the following is an example of a feature transformation technique?
a) Principal Component Analysis (PCA) [ ]
b) Min-Max Scaling
c) One-Hot Encoding
d) Removing outliers
4. Feature subset selection is mainly used for which of the following purposes?
[ ]
a) Reducing the size of the feature set by removing irrelevant or
redundant features
b) Transforming categorical features into numerical features
c) Balancing the dataset with respect to class distribution
d) Creating new features based on existing ones
10. Hyperparameter tuning is only necessary for models with high computational
complexity.
True / False
Answer Key:
1 C 2 C 3 C 4 A 5 D 6 B
7 C 8 B 9 TRUE 10 FALSE