01 - Feature Engg
01 - Feature Engg
01 - Feature Engg
Feature Engineering
prediction is to be done.
• In Data Science, the performance of the model is depending on data preprocessing and
data handling.
• Suppose if we build a model without handling data, we got an accuracy of around 70%.
• By applying Feature engineering on the same model there is a chance to increase the
• Replacing missing values with the maximum occurred value in a column is a good
option for handling categorical columns.
• But if you think the values in the column are distributed uniformly and there is not a
dominant value, imputing a category like “Other” might be more sensible, because in
such a case, your imputation is likely to converge a random selection.
2.Handling Outliers
• Best way to detect the outliers is to demonstrate the data visually.
•
Outlier Detection with Standard Deviation
•If a value has a distance to the average higher than x * standard deviation, it can be
assumed as an outlier. Then what x should be?
•There is no trivial solution for x, but usually, a value between 2 and 4 seems practical.
2.Handling Outliers
• So you can keep your data size and at the end of the day, it might be better for the final
model performance.
• On the other hand, capping can affect the distribution of the data, thus it better not to
exaggerate it.
3.Binning
3.Binning
3.Binning
• The main motivation of binning is to make the model more robust and
prevent overfitting, however, it has a cost to the performance.
• Every time you bin something, you sacrifice information and make your data more
regularized.
•The trade-off between performance and overfitting is the key point of the binning
• For numerical columns, except for some obvious overfitting cases, binning might be
redundant for some kind of algorithms, due to its effect on model performance.
•However, for categorical columns, the labels with low frequencies probably affect the
robustness of statistical models negatively.
•Thus, assigning a general category to these less frequent values helps to keep the
robustness of the model.
•For example, if your data size is 100,000 rows, it might be a good option to unite the
labels with a count less than 100 to a new category like “Other”.
3.Binning
4.Log Transform
•Logarithm transformation (or log transform) is one of the most commonly used
mathematical transformations in feature engineering.
A critical note: The data you apply log transform must have only positive values, otherwise
you receive an error. Also, you can add 1 to your data before transform it.
5.One-hot encoding
• One-hot encoding is one of the most common encoding methods
• This method spreads the values in a column to multiple flag columns and
assigns 0 or 1 to them.
• These binary values express the relationship between grouped and encoded
column.
• This method changes your categorical data, which is challenging to understand for
algorithms, to a numerical format and enables you to group your categorical data
without losing any information.
5.One-hot encoding
• Why One-Hot?
• If you have N distinct values in the column, it is enough to map them to N-1 binary
columns, because the missing value can be deducted from other columns.
• If all the columns in our hand are equal to 0, the missing value must be equal to 1.
•
6.Grouping Operations
•In most machine learning algorithms, every instance is represented by a row in the
training dataset, where every column show a different feature of the instance.
Tidy datasets are easy to manipulate, model and visualise, and have a specific
structure: each variable is a column, each observation is a row, and each type of
observational unit is a table.
•Datasets such as transactions rarely fit the definition of tidy data above, because of the
multiple rows of an instance.
• In such a case, we group the data by the instances and then every instance is represented
by only one row.
•The key point of group by operations is to decide the aggregation functions of the features.
•For numerical features, average and sum functions are usually convenient options
6.Grouping Operations
Categorical Column Grouping
The first option is to select the label with the highest frequency. In other words, this
is the max operation for categorical columns, but ordinary max functions generally
do not return this value, you need to use a lambda function for this purpose.
• For example, if you want to obtain ratio columns, you can use the average of
binary columns.
• In the same example, sum function can be used to obtain the total count either.
7.Feature Split
•Splitting features is a good way to make them useful in terms of machine learning.
• Most of the time the dataset contains string columns that violates tidy data principles.
Split function is a good option, however, there is no one way of splitting features.
It depends on the characteristics of the column, how to split it.
7.Feature Split
7.Feature Split
•Another case for split function is to extract a string part between two chars. The
following example shows an implementation of this case by using two split functions in a
row
8.Scaling
• In most cases, the numerical features of the dataset do not have a certain range and
they differ from each other.
• In real life, it is nonsense to expect age and income columns to have the same range.
• But from machine learning point of view, how these two columns can be compared?
• The continuous features become identical in terms of the range, after a scaling process.
• This process is not mandatory for many algorithms, but it might be still nice to apply.
• However, the algorithms based on distance calculations such as k-NN or k-Means need
to have scaled continuous features as model input.
8.Scaling
8.Scaling
8.Scaling
8.Scaling
8.Scaling
8.Scaling
Feature Engineering of Time Series Data
Time Series data must be re-framed as a supervised learning dataset before we can start
using machine learning algorithms
Task before us is to create or invent new input features from our time series dataset
The goal of feature engineering is to provide strong and ideally simple relationships
between new input features and the output feature for the supervised learning algorithm
to model.
For clarity, we will focus on a univariate (one variable) time series dataset
This dataset describes the minimum daily temperatures over 10 years (1981-1990) in
Melbourne, Australia.
The units are in degrees Celsius and there are 3,650 observations. The source of the data is
credited as the Australian Bureau of Meteorology.
Date Time Features
The supervised learning problem we are proposing is to predict the daily minimum
temperature given the month and day, as follows:
Lag Features
The simplest approach is to predict the value at the next time (t+1) given the value at
the previous time (t-1).
The Pandas library provides the shift() function to help create these shifted or lag features
from a time series dataset.
Shifting the dataset by 1 creates the t-1 column, adding a NaN (unknown) value for the first
row. The time series dataset without a shift represents the t+1.
You can see that we would have to discard the first row to
use the dataset to train a supervised learning model
Lag Features
The addition of lag features is called the sliding window method, in this case with a
window width of 1. It is as though we are sliding our focus along the time series for each
observation with an interest in only what is within the window width
We can expand the window width and include more lagged features.
Rolling Window Statistics
A step beyond adding raw lagged values is to add a summary of the values at previous
time steps
We can calculate summary statistics across the values in the sliding window and include
these as features in our dataset. Perhaps the most useful is the mean of the previous few
values, also called the rolling mean.
Pandas provides a rolling() function that creates a new data structure with the window of
values at each time step. We can then perform statistical functions on the window of
values collected for each time step, such as calculating the mean.
First, the series must be shifted. Then the rolling dataset can be created and the mean
values calculated on each window of two values.
Expanding Window Statistics
Another type of window that may be useful includes all previous data in the series.
This is called an expanding window and can help with keeping track of the bounds of
observable data.
Like the rolling() function on DataFrame, Pandas provides an expanding() function that
collects sets of all prior values for each time step.