Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit - 3 Feature Engineering

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

FEATURE ENGINEERING

UNIT III
What is Feature?

In the context of machine learning, a feature (also known as a variable or attribute) is an individual

measurable property or characteristic of a data point that is used as an input for a machine

learning algorithm. Features can be numerical, categorical, or text-based, and they represent different

aspects of the data that are relevant to the problem at hand. For example:

• In a dataset of housing prices, features could include the number of bedrooms, the square footage, the

location, and the age of the property.

• In a dataset of customer demographics, features could include age, gender, income level, and

occupation.
• The choice and quality of features are critical in machine learning, as they can greatly impact the

accuracy and performance of the model.

• In other words we can say that, all machine learning algorithms take input data to generate the

output. The input data remains in a tabular form consisting of rows (instances or observations) and

columns (variable or attributes), and these attributes are often known as features.

• For example, an image is an instance in computer vision, but a line in the image could be the feature.

Similarly, in NLP, a document can be an observation, and the word count could be the feature. So, we

can say a feature is an attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?

Feature engineering is the process of transforming raw data into features that are suitable for

machine learning models. In other words, it is the process of selecting, extracting, and

transforming the most relevant features from the available data to build more accurate and efficient

machine learning models. It is a machine learning technique that leverages data to create new

variables that aren’t in the training set. It can produce new features for both supervised and

unsupervised learning, with the goal of simplifying and speeding up data transformations while

also enhancing model accuracy.


Regardless of the data or architecture, a terrible feature will have a direct impact on your
model. Feature engineering, in simple terms, is the act of converting raw observations into
desired features using statistical or machine learning approaches. It helps to represent an
underlying problem to predictive models in a better way, which as a result, improve the
accuracy of the model for unseen data. The predictive model contains predictor variables and
an outcome variable while the feature engineering process selects the most useful predictor
variables for the model.
The Feature engineering in ML contains mainly four processes: Feature Creation, Transformations,

Feature Extraction, and Feature Selection. These processes are described as below:

• Feature Creation: Feature creation is finding the most useful variables to be used in a predictive

model. The process is subjective, and it requires human creativity and intervention. The new features are

created by mixing existing features using addition, subtraction, and ration, and these new features have

great flexibility.

• Transformations: The transformation step of feature engineering involves adjusting the predictor

variable to improve the accuracy and performance of the model. For example, it ensures that the model

is flexible to take input of the variety of data; it ensures that all the variables are on the same scale,

making the model easier to understand. It improves the model's accuracy and ensures that all the

features are within the acceptable range to avoid any computational error.
• Feature Extraction: Feature extraction is an automated feature engineering process that generates new

variables by extracting them from the raw data. The main aim of this step is to reduce the volume of data so

that it can be easily used and managed for data modelling. Feature extraction methods include cluster analysis,

text analytics, edge detection algorithms, and principal components analysis (PCA).

• Feature Selection: While developing the machine learning model, only a few variables in the dataset are

useful for building the model, and the rest features are either redundant or irrelevant. If we input the dataset

with all these redundant and irrelevant features, it may negatively impact and reduce the overall performance

and accuracy of the model. Hence it is very important to identify and select the most appropriate features from

the data and remove the irrelevant or less important features, which is done with the help of feature selection

in machine learning. "Feature selection is a way of selecting the subset of the most relevant features from the

original features set by removing the redundant, irrelevant, or noisy features."


Feature Selection

Feature Selection is the process of selecting a subset of relevant features from the dataset to be
used in a machine-learning model. It is an important step in the feature engineering process as it
can have a significant impact on the model’s performance.
Selecting the best features helps the model to perform well. For example, Suppose we want to create a
model that automatically decides which car should be crushed for a spare part, and to do this, we have
a dataset. This dataset contains a Model of the car, Year, Owner's name, Miles. So, in this dataset, the
name of the owner does not contribute to the model performance as it does not decide if the car should
be crushed or not, so we can remove this column and select the rest of the features(column) for the
model building.
Benefits of Feature Selection:

1. Reduces Overfitting: By using only the most relevant features, the model can generalize better to new

data.

2. Improves Model Performance: Selecting the right features can improve the accuracy, precision, and

recall of the model.

3. Decreases Computational Costs: A smaller number of features requires less computation and storage

resources.

4. Improves Interpretability: By reducing the number of features, it is easier to understand and interpret

the results of the model.


Importance of Feature Selection
Feature Selection Techniques

There are mainly two types of Feature Selection techniques, which are:

1. Supervised Feature Selection technique

Supervised Feature selection techniques consider the target variable. These methods are used for

labeled data, and are also used to classify the relevant features for increasing the efficiency of

supervised models, such as classification and regression.

2. Unsupervised Feature Selection technique

Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabeled

dataset.
There can be various reasons to perform feature selection.

•Simplification of the model.

•Less computational time.

•To avoid the curse of dimensionality.

•Improve the compatibility of data with models.

Roughly the feature selection techniques can be divided into three parts.
There are mainly three techniques under supervised feature Selection:

1. Wrapper Method:-
In wrapper methodology, selection of features is done by considering it as a search problem, in
which different combinations are made, evaluated, and compared with other combinations. It
trains the algorithm by using the subset of features iteratively. On the basis of the output of the
model, features are added or subtracted, and with this feature set, the model has trained again.
There are mainly three techniques under supervised feature Selection:

2. Filter Method:-
In Filter Method, features are selected on the basis of statistics measures. This method does not
depend on the learning algorithm and chooses the features as a pre-processing step. The filter method
filters out the irrelevant feature and redundant columns from the model by using different metrics
through ranking. The advantage of using filter methods is that it needs low computational time and
does not over fit the data.
There are mainly three techniques under supervised feature Selection:

3. Embedded Method:-
Embedded methods combined the advantages of both filter and wrapper methods by considering the
interaction of features along with low computational cost. These are fast processing methods similar
to the filter method but more accurate than the filter method. These methods are also iterative,
which evaluates each iteration, and optimally finds the most important features that contribute the
most to training in a particular iteration.
How to choose a Feature Selection Method?
For machine learning engineers, it is very important to understand that which feature selection method will
work properly for their model. The more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
How to choose a Feature Selection Method?
To know this, we need to first identify the type of input and output variables. In machine learning, variables

are of mainly two types:

◦ Numerical Variables: Variable with continuous values such as integer, float.

◦ Categorical Variables: Variables with categorical values such as Boolean, ordinal, nominals.

Below are some univariate statistical measures, which can be used for filter-based feature selection:

1. Numerical Input, Numerical Output:

Numerical Input variables are used for predictive regression modelling. The common method to be used for

such a case is the Correlation coefficient.

◦ Pearson's correlation coefficient (For linear Correlation).

◦ Spearman's rank coefficient (for non-linear correlation).


2. Numerical Input, Categorical Output:
Numerical Input with categorical output is the case for classification predictive modelling problems. In this
case, also, correlation-based techniques should be used, but with categorical output.
◦ ANOVA correlation coefficient (linear).
◦ Kendall's rank coefficient (nonlinear).
3. Categorical Input, Numerical Output:
This is the case of regression predictive modelling with categorical input. It is a different example of a
regression problem. We can use the same measures as discussed in the above case but in reverse order.
4. Categorical Input, Categorical Output:
This is a case of classification predictive modelling with categorical Input variables.
The commonly used technique for such a case is Chi-Squared Test. We can also use Information gain in this
case.
We can summarise the above cases with appropriate
measures in the below table:

Input Variable Output Feature Selection technique


Variable

Numerical Numerical ◦ Pearson's correlation coefficient (For linear


Correlation).
◦ Spearman's rank coefficient (for non-linear
correlation).

Numerical Categorical ◦ ANOVA correlation coefficient (linear).


◦ Kendall's rank coefficient (nonlinear).

Categorical Numerical ◦ Kendall's rank coefficient (linear).


◦ ANOVA correlation coefficient (nonlinear).

Categorical Categorical ◦ Chi-Squared test (contingency tables).


◦ Mutual Information.
Sequential Feature Selection Algorithms
• Sequential feature selection algorithms are basically part of the wrapper methods where it adds and

removes features from the dataset sequentially. Sometimes it evaluates each feature separately and selects

M features from N features on the basis of individual scores; this method is called naive sequential feature

selection. It works very rarely because it does not account for feature dependence.

• Mathematically these algorithms are used for the reduction of initial N features to M features where M<N.

and the M features are optimized for the performance of the model.

• This searching algorithm adds or removes the feature candidate from the candidate subset while evaluating

the objective function or criterion. Sequential searches follow only one direction: either it increases the

number of features in the subset or reduces the number of features in the candidate feature subset.
On the basis of movement, we can divide them into two variants.

Sequential forward selection(SFS)

Sequential Backward Selection (SBS)


Sequential Forward Selection (SFS)
Sequential Backward Selection (SBS)
Bidirectional Feature Selection (BFS)

You might also like