Unit - 3 Feature Engineering
Unit - 3 Feature Engineering
Unit - 3 Feature Engineering
UNIT III
What is Feature?
In the context of machine learning, a feature (also known as a variable or attribute) is an individual
measurable property or characteristic of a data point that is used as an input for a machine
learning algorithm. Features can be numerical, categorical, or text-based, and they represent different
aspects of the data that are relevant to the problem at hand. For example:
• In a dataset of housing prices, features could include the number of bedrooms, the square footage, the
• In a dataset of customer demographics, features could include age, gender, income level, and
occupation.
• The choice and quality of features are critical in machine learning, as they can greatly impact the
• In other words we can say that, all machine learning algorithms take input data to generate the
output. The input data remains in a tabular form consisting of rows (instances or observations) and
columns (variable or attributes), and these attributes are often known as features.
• For example, an image is an instance in computer vision, but a line in the image could be the feature.
Similarly, in NLP, a document can be an observation, and the word count could be the feature. So, we
can say a feature is an attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?
Feature engineering is the process of transforming raw data into features that are suitable for
machine learning models. In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build more accurate and efficient
machine learning models. It is a machine learning technique that leverages data to create new
variables that aren’t in the training set. It can produce new features for both supervised and
unsupervised learning, with the goal of simplifying and speeding up data transformations while
Feature Extraction, and Feature Selection. These processes are described as below:
• Feature Creation: Feature creation is finding the most useful variables to be used in a predictive
model. The process is subjective, and it requires human creativity and intervention. The new features are
created by mixing existing features using addition, subtraction, and ration, and these new features have
great flexibility.
• Transformations: The transformation step of feature engineering involves adjusting the predictor
variable to improve the accuracy and performance of the model. For example, it ensures that the model
is flexible to take input of the variety of data; it ensures that all the variables are on the same scale,
making the model easier to understand. It improves the model's accuracy and ensures that all the
features are within the acceptable range to avoid any computational error.
• Feature Extraction: Feature extraction is an automated feature engineering process that generates new
variables by extracting them from the raw data. The main aim of this step is to reduce the volume of data so
that it can be easily used and managed for data modelling. Feature extraction methods include cluster analysis,
text analytics, edge detection algorithms, and principal components analysis (PCA).
• Feature Selection: While developing the machine learning model, only a few variables in the dataset are
useful for building the model, and the rest features are either redundant or irrelevant. If we input the dataset
with all these redundant and irrelevant features, it may negatively impact and reduce the overall performance
and accuracy of the model. Hence it is very important to identify and select the most appropriate features from
the data and remove the irrelevant or less important features, which is done with the help of feature selection
in machine learning. "Feature selection is a way of selecting the subset of the most relevant features from the
Feature Selection is the process of selecting a subset of relevant features from the dataset to be
used in a machine-learning model. It is an important step in the feature engineering process as it
can have a significant impact on the model’s performance.
Selecting the best features helps the model to perform well. For example, Suppose we want to create a
model that automatically decides which car should be crushed for a spare part, and to do this, we have
a dataset. This dataset contains a Model of the car, Year, Owner's name, Miles. So, in this dataset, the
name of the owner does not contribute to the model performance as it does not decide if the car should
be crushed or not, so we can remove this column and select the rest of the features(column) for the
model building.
Benefits of Feature Selection:
1. Reduces Overfitting: By using only the most relevant features, the model can generalize better to new
data.
2. Improves Model Performance: Selecting the right features can improve the accuracy, precision, and
3. Decreases Computational Costs: A smaller number of features requires less computation and storage
resources.
4. Improves Interpretability: By reducing the number of features, it is easier to understand and interpret
There are mainly two types of Feature Selection techniques, which are:
Supervised Feature selection techniques consider the target variable. These methods are used for
labeled data, and are also used to classify the relevant features for increasing the efficiency of
Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabeled
dataset.
There can be various reasons to perform feature selection.
Roughly the feature selection techniques can be divided into three parts.
There are mainly three techniques under supervised feature Selection:
1. Wrapper Method:-
In wrapper methodology, selection of features is done by considering it as a search problem, in
which different combinations are made, evaluated, and compared with other combinations. It
trains the algorithm by using the subset of features iteratively. On the basis of the output of the
model, features are added or subtracted, and with this feature set, the model has trained again.
There are mainly three techniques under supervised feature Selection:
2. Filter Method:-
In Filter Method, features are selected on the basis of statistics measures. This method does not
depend on the learning algorithm and chooses the features as a pre-processing step. The filter method
filters out the irrelevant feature and redundant columns from the model by using different metrics
through ranking. The advantage of using filter methods is that it needs low computational time and
does not over fit the data.
There are mainly three techniques under supervised feature Selection:
3. Embedded Method:-
Embedded methods combined the advantages of both filter and wrapper methods by considering the
interaction of features along with low computational cost. These are fast processing methods similar
to the filter method but more accurate than the filter method. These methods are also iterative,
which evaluates each iteration, and optimally finds the most important features that contribute the
most to training in a particular iteration.
How to choose a Feature Selection Method?
For machine learning engineers, it is very important to understand that which feature selection method will
work properly for their model. The more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
How to choose a Feature Selection Method?
To know this, we need to first identify the type of input and output variables. In machine learning, variables
◦ Categorical Variables: Variables with categorical values such as Boolean, ordinal, nominals.
Below are some univariate statistical measures, which can be used for filter-based feature selection:
Numerical Input variables are used for predictive regression modelling. The common method to be used for
removes features from the dataset sequentially. Sometimes it evaluates each feature separately and selects
M features from N features on the basis of individual scores; this method is called naive sequential feature
selection. It works very rarely because it does not account for feature dependence.
• Mathematically these algorithms are used for the reduction of initial N features to M features where M<N.
and the M features are optimized for the performance of the model.
• This searching algorithm adds or removes the feature candidate from the candidate subset while evaluating
the objective function or criterion. Sequential searches follow only one direction: either it increases the
number of features in the subset or reduces the number of features in the candidate feature subset.
On the basis of movement, we can divide them into two variants.