Module 3
Module 3
Module-3
Syllabus:
Feature Generation and Feature Selection
Extracting Meaning from Data: Motivating application: user (customer) retention.
Feature Generation (brainstorming, role of domain expertise, and place for
imagination), Feature Selection algorithms. Filters; Wrappers; Decision Trees;
Random Forests. Recommendation Systems: Building a User-Facing Data Product,
Algorithmic ingredients of a Recommendation Engine, Dimensionality Reduction,
Singular Value Decomposition, Principal Component Analysis, Exercise: build your
own recommendation system.
Feature Selection
Feature Selection refers to the process of selecting the most relevant features (or
variables) from the dataset to use in building a predictive model. The goal is to
improve the performance of the model by eliminating irrelevant or redundant features,
which can lead to better accuracy, reduced overfitting, and more efficient computation.
It simplifies the model: data reduction, less storage, Occam’s razor and better
visualization
Reduces training time
Avoids over-fitting
Improves accuracy of the model
Avoids curse of dimensionality.
Methods
Feature selection methods can be grouped into three categories: filter method,
wrapper method and embedded method.
Three methods of feature selection
Filter Methods: These involve statistical techniques to evaluate the relevance of each
feature individually based on its relationship with the target variable. Examples
include correlation coefficients, Chi-square tests, and mutual information.
A subset of features is selected based on their relationship to the target variable. The
selection is not dependent of any machine learning algorithm. On the contrary, filter
methods measure the “relevance” of the features with the output via statistical tests.
You can use the following table for reference:
Pearson’s Correlation
A statistic that measures the linear correlation between two variables, which are both
continuous. It varies from -1 to +1, where +1 corresponds to positive linear
correlation, 0 to no linear correlation, and −1 to negative linear correlation.
Peason’s r
LDA
Linear Discriminant Analysis is a supervised linear algorithm that projects the data
into a smaller subspace k (k < N-1) while maximising the separation between the
classes. More specifically, the model finds linear combinations of the features that
achieve maximum separability between the classes and minimum variance within
each class.
ANOVA
CHI SQUARE
Chi-squared tests whether the occurrences of a specific feature and a specific class are
independent using their frequency distribution. The null hypothesis is that the two
variables are independent. However, large values of χ² indicate that the null
hypothesis should be rejected. When selecting features, we wish to extract those that
are highly dependent on the output.
Wrapper methods
Forward selection
Backward elimination
Embedded Methods: These involve algorithms that perform feature selection during
the model training process. Regularization methods like LASSO (Least Absolute
Shrinkage and Selection Operator) and Ridge Regression are examples.
Relevance: The feature should have a significant relationship with the target variable.
Redundancy: The feature should provide unique information that isn’t already
provided by another feature.
Interpretablility: Selected features should make sense in the context of the domain
knowledge.
Data Cleaning and Preprocessing: Ensure data quality before selecting features.
Initial Feature Selection: Use domain knowledge and exploratory data analysis
(EDA) to identify potential features.
Model-Based Selection: Apply algorithms and techniques to evaluate and select the
best subset of features.
Validation: Test the selected features on validation data to ensure they generalize
well.
Challenges and Considerations: There are some challenges like dealing with high-
dimensional data, multicollinearity among features, and the trade-off between model
complexity and performance. It also emphasizes the importance of iterative
experimentation and validation.
Feature selection is a fundamental step in the data science process that involves
choosing the most informative and relevant variables to include in a model, with the
aim of enhancing model performance and interpretability.
Decision Trees
A decision tree is a hierarchical model used in decision support that depicts decisions
and their potential outcomes, incorporating chance events, resource expenses, and
utility. This algorithmic model utilizes conditional control statements and is non-
parametric, supervised learning, useful for both classification and regression tasks.
The tree structure is comprised of a root node, branches, internal nodes, and leaf
nodes, forming a hierarchical, tree-like structure.
It is a tool that has applications spanning several different areas. Decision trees can be
used for classification as well as regression problems. The name itself suggests that it
uses a flowchart like a tree structure to show the predictions that result from a series
of feature-based splits. It starts with a root node and ends with a decision made by
leaves.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy,
or rainy? If yes then it will go to the next feature which is humidity and wind. It
will again check if there is a strong wind or weak, if it’s a weak wind and it’s
rainy then the person may go and play.
https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/
https://medium.com/geekculture/step-by-step-decision-tree-id3-algorithm-from-
scratch-in-python-no-fancy-library-4822bbfdd88f
Problem Definition:
Build a decision tree using ID3 algorithm for the given training data in the table (Buy
Computer data), and predict the class of the following new example: age<=30,
income=medium, student=yes, credit-rating=fair
Solution:
First, check which attribute provides the highest Information Gain in order to split the
training set based on that attribute. We need to calculate the expected information to
classify the set and the entropy of each attribute.
The information gain is this mutual information minus the entropy:
For Age, we have three values age<=30 (2 yes and 3 no), age31..40 (4 yes and 0 no),
and age>40 (3 yes and 2 no)
For Income, we have three values incomehigh (2 yes and 2 no), incomemedium (4 yes
and 2 no), and incomelow (3 yes 1 no)
Next, consider Student Attribute For Student, we have two values studentyes (6 yes
and 1 no) and studentno (3 yes 4 no)
= 7/14(0.5916) + 7/14(0.9852)
= 8/14(0.8112) + 6/14(1)
Since Age has the highest Information Gain we start splitting the dataset using
the age attribute.
The same process of splitting has to happen for the two remaining branches.
Left sub-branch
For branch age<=30 we still have attributes income, student, and credit_rating. Which
one should be used to split the partition?
For Income, we have three values incomehigh (0 yes and 2 no), incomemedium (1
yes and 1 no) and incomelow (1 yes and 0 no)
For Student, we have two values studentyes (2 yes and 0 no) and studentno (0 yes 3
no)
We can then safely split on attribute student without checking the other attributes
since the information gain is maximized.
Right sub-branch
The mutual information is Entropy(Sage>40)= I(3,2)= -3/5 log2(3/5) – 2/5
log2(2/5)=0.97
For Income, we have two values incomemedium (2 yes and 1 no) and incomelow (1
yes and 1 no)
For Student, we have two values studentyes (2 yes and 1 no) and studentno (1 yes
and 1 no)
For Credit_Rating, we have two values credit_ratingfair (3 yes and 0 no) and
credit_ratingexcellent (0 yes and 2 no)
Entropy(credit_rating) = 0
Gain(credit_rating) = 0.97 – 0 = 0.97
We then split based on credit_rating. These splits give partitions each with records
from the same class. We just need to make these into leaf nodes with their class label
attached:
Buys_computer = yes
Random Forest
Put simply: random forest builds multiple decision trees and merges them together to
get a more accurate and stable prediction.
Step 2:Build the decision trees associated with the selected data points(Subsets).
Step 3:Choose the number N for decision trees that you want to build.
Step 5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.
How Does Random Forest Work?
The random Forest algorithm works in several steps which are discussed below–>
Random Feature Selection: To ensure that each decision tree in the ensemble brings a
unique perspective, Random Forest employs random feature selection. During the
training of each tree, a random subset of features is chosen. This randomness ensures
that each tree focuses on different aspects of the data, fostering a diverse set of
predictors within the ensemble.
Decision Making and Voting: When it comes to making predictions, each decision
tree in the Random Forest casts its vote. For classification tasks, the final prediction is
determined by the mode (most frequent prediction) across all the trees. In regression
tasks, the average of the individual tree predictions is taken. This internal voting
mechanism ensures a balanced and collective decision-making process.
Large Datasets Handling: Dealing with a mountain of data? Random Forest tackles it
like a seasoned explorer with a team of helpers (decision trees). Each helper takes on
a part of the dataset, ensuring that the expedition is not only thorough but also
surprisingly quick.
Variable Importance Assessment: Think of Random Forest as a detective at a crime
scene, figuring out which clues (features) matter the most. It assesses the importance
of each clue in solving the case, helping you focus on the key elements that drive
predictions.
Built-in Cross-Validation: Random Forest is like having a personal coach that keeps
you in check. As it trains each decision tree, it also sets aside a secret group of cases
(out-of-bag) for testing. This built-in validation ensures your model doesn’t just ace
the training but also performs well on new challenges.
Handling Missing Values: Life is full of uncertainties, just like datasets with missing
values. Random Forest is the friend who adapts to the situation, making predictions
using the information available. It doesn’t get flustered by missing pieces; instead, it
focuses on what it can confidently tell us.
Parallelization for Speed: Random Forest is your time-saving buddy. Picture each
decision tree as a worker tackling a piece of a puzzle simultaneously. This parallel
approach taps into the power of modern tech, making the whole process faster and
more efficient for handling large-scale projects.
https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-
recommendation-engine-python
Introduction
In today’s world, every customer faces multiple choices, such as finding a book to
read without a specific idea in mind, leading to time-consuming searches and reliance
on recommendations from others. However, a recommendation engine could
streamline this process by suggesting books based on previous reads, saving time and
enhancing the user experience. Recommendation engines, widely used by businesses
like Amazon, Netflix, Google, and Goodreads, leverage machine learning to provide
personalized suggestions. This article explores various recommendation engine
algorithms, the mathematics behind them, and demonstrates creating a
recommendation engine using matrix factorization in Python.
Problem Statement
Many online businesses rely on customer reviews and ratings. Explicit feedback is
especially important in the entertainment and ecommerce industry where all customer
engagements are impacted by these ratings. Netflix relies on such rating data to power
its recommendation engine to provide the best movie and TV series recommendations
that are personalized and most relevant to the user.
This practice problem challenges the participants to predict the ratings for jokes given
by the users provided the ratings provided by the same users for another set of jokes.
This dataset is taken from the famous jester online Joke Recommender system
dataset.
Till recently, people generally tended to buy products recommended to them by their
friends or the people they trust. This used to be the primary method of purchase when
there was any doubt about the product. But with the advent of the digital age, that
circle has expanded to include online sites that utilize some sort of recommendation
engine.
A recommendation engine filters the data using different algorithms and recommends
the most relevant items to users. It first captures the past behavior of a customer and
based on that, recommends products which the users might be likely to buy.
If a completely new user visits an e-commerce site, that site will not have any past
history of that user. So how does the site go about recommending products to the user
in such a scenario? One possible solution could be to recommend the best selling
products, i.e. the products which are high in demand. Another possible solution could
be to recommend the products which would bring the maximum profit to the business.
If we can recommend a few items to a customer based on their needs and interests, it
will create a positive impact on the user experience and lead to frequent visits. Hence,
businesses nowadays are building smart and intelligent recommendation engines by
studying the past behavior of their users.
Now that we have an intuition of recommendation engines, let’s now look at how they
work.
Before we deep dive into this topic, first we’ll think of how we can recommend items
to users:
We can recommend items to a user which are most popular among all the users
We can divide the users into multiple segments based on their preferences (user
features) and recommend items to them based on the segment they belong to
Both of the above methods have their drawbacks. In the first case, the most popular
items would be the same for each user so everybody will see the same
recommendations. While in the second case, as the number of users increases, the
number of features will also increase. So classifying the users into various segments
will be a very difficult task.
The main problem here is that we are unable to tailor recommendations based on the
specific interest of the users. It’s like Amazon is recommending you buy a laptop just
because it’s been bought by the majority of the shoppers. But thankfully, Amazon (or
any other big firm) does not recommend products using the above mentioned
approach. They use some personalized methods which help them in recommending
products more accurately.
Let’s now focus on how a recommendation engine works by going through the
following steps.
This is the first and most crucial step for building a recommendation engine. The data
can be collected by two means: explicitly and implicitly. Explicit data is information
that is provided intentionally, i.e. input from the users such as movie ratings. Implicit
data is information that is not provided intentionally but gathered from available data
streams like search history, clicks, order history, etc.
There are various algorithms that help us make the filtering process easier. In the next
section, we will go through each algorithm in detail.
This algorithm recommends products which are similar to the ones that a user has
liked in the past.
For example, if a person has liked the movie “Inception”, then this algorithm will
recommend movies that fall under the same genre. But how does the algorithm
understand which genre to pick and recommend movies from?
Recommendation engines save all information related to each user in a vector form
known as the profile vector, which contains the user’s past behavior, including liked
or disliked movies and given ratings. Information about movies is stored in another
vector called the item vector, which includes details such as genre, cast, and director.
The content-based filtering algorithm uses cosine similarity to find the cosine of the
angle between the profile vector and the item vector. If A is the profile vector and B is
the item vector, the similarity between them can be calculated as the cosine of the
angle between these two vectors.
Based on the cosine value, which ranges between -1 to 1, the movies are arranged in
descending order and one of the two below approaches is used for recommendations:
Top-n approach: where the top n movies are recommended (Here n can be decided by
the business)
Rating scale approach: Where a threshold is set and all the movies above that
threshold are recommended
Euclidean Distance: Similar items will lie in close proximity to each other if plotted in
n-dimensional space. So, we can calculate the distance between items and based on
that distance, recommend items to the user. The formula for the euclidean distance is
given by:
Euclidean Distance
Pearson’s Correlation: It tells us how much two items are correlated. Higher the
correlation, more will be the similarity. Pearson’s correlation can be calculated using
the following formula:
The algorithm’s main flaw is its narrow recommendation of items of the same type,
never recommending products the user hasn’t previously purchased or liked. To
improve, an algorithm should consider user behavior in recommendation.
Collaborative filtering
Let us understand this with an example. If person A likes 3 movies, say Interstellar,
Inception and Predestination, and person B likes Inception, Predestination and The
Prestige, then they have almost similar interests. We can say with some certainty that
A should like The Prestige and B should like Interstellar. The collaborative filtering
algorithm uses “User Behavior” for recommending items. This is one of the most
commonly used algorithms in the industry as it is not dependent on any additional
information. There are different types of collaborating filtering techniques and we
shall look at them in detail below.
This algorithm first finds the similarity score between users. Based on this similarity
score, it then picks out the most similar users and recommends products which these
similar users have liked or bought previously.
In terms of our movies example from earlier, this algorithm finds the similarity
between each user based on the ratings they have previously given to different movies.
The prediction of an item for a user u is calculated by computing the weighted sum of
the user ratings given by other users to an item i.
Now, we have the ratings for users in profile vector and based on that we have to
predict the ratings for other users. Following steps are followed to do so:
For predictions we need the similarity between the user u and v. We can make use of
Pearson correlation.
First we find the items rated by both the users and based on the ratings, correlation
between the users is calculated.
The predictions can be calculated using the similarity values. This algorithm, first of
all calculates the similarity between each user and then based on each similarity
calculates the predictions. Users having higher correlation will tend to be similar.
A 4 1 – 4 – 3
B – 4 – 2 3 3
C – 1 – 4 4 3
Here we have a user movie rating matrix. To understand this in a more practical
manner, let’s find the similarity between users (A, C) and (B, C) in the above table.
Common movies rated by A/[ and C are movies x2 and x4 and by B and C are movies
x2, x4 and x5.
The correlation between user A and C is more than the correlation between B and C.
Hence users A and C have more similarity and the movies liked by user A will be
recommended to user C and vice versa.
This algorithm is quite time consuming as it involves calculating the similarity for
each user and then calculating prediction for each similarity score. One way of
handling this problem is to select only a few users (neighbors) instead of all to make
predictions, i.e. instead of making predictions for all similarity values, we choose only
few similarity values. There are various ways to select the neighbors:
Select a threshold similarity and choose all the users above that value
Arrange the neighbors in descending order of their similarity value and choose top-N
users
This algorithm is useful when the number of users is less. Its not effective when there
are a large number of users as it will take a lot of time to compute the similarity
between all user pairs. This leads us to item-item collaborative filtering, which is
effective when the number of users is more than the items being recommended.
The algorithm aims to find similarity between movie pairs and recommend similar
ones based on user-user collaborative filtering. It uses the weighted sum of ratings of
“item-neighbors” instead of “user-neighbors” and provides predictions based on user-
friendliness.
There are various algorithms that help us make the filtering process easier. In the
next section, we will go through each algorithm in detail.
This algorithm recommends products which are similar to the ones that a user has
liked in the past.
For example, if a person has liked the movie “Inception”, then this algorithm will
recommend movies that fall under the same genre. But how does the algorithm
understand which genre to pick and recommend movies from?
Based on the cosine value, which ranges between -1 to 1, the movies are arranged
in descending order and one of the two below approaches is used for
recommendations:
Top-n approach: where the top n movies are recommended (Here n can be
decided by the business)
Rating scale approach: Where a threshold is set and all the movies above
that threshold are recommended
Euclidean Distance: Similar items will lie in close proximity to each other
if plotted in n-dimensional space. So, we can calculate the distance
between items and based on that distance, recommend items to the user.
The formula for the euclidean distance is given by:
The algorithm’s main flaw is its narrow recommendation of items of the same type,
never recommending products the user hasn’t previously purchased or liked. To
improve, an algorithm should consider user behavior in recommendation.
Collaborative filtering
Let us understand this with an example. If person A likes 3 movies, say Interstellar,
Inception and Predestination, and person B likes Inception, Predestination and The
Prestige, then they have almost similar interests. We can say with some certainty
that A should like The Prestige and B should like Interstellar. The collaborative
filtering algorithm uses “User Behavior” for recommending items. This is one of
the most commonly used algorithms in the industry as it is not dependent on any
additional information. There are different types of collaborating filtering
techniques and we shall look at them in detail below.
This algorithm first finds the similarity score between users. Based on this
similarity score, it then picks out the most similar users and recommends products
which these similar users have liked or bought previously.
In terms of our movies example from earlier, this algorithm finds the similarity
between each user based on the ratings they have previously given to different
movies. The prediction of an item for a user u is calculated by computing the
weighted sum of the user ratings given by other users to an item i.
Here,
Pu,i is the prediction of an item
Now, we have the ratings for users in profile vector and based on that we have to
predict the ratings for other users. Following steps are followed to do so:
For predictions we need the similarity between the user u and v. We can
make use of Pearson correlation.
First we find the items rated by both the users and based on the ratings,
correlation between the users is calculated.
Here we have a user movie rating matrix. To understand this in a more practical
manner, let’s find the similarity between users (A, C) and (B, C) in the above table.
Common movies rated by A/[ and C are movies x2 and x4 and by B and C are
movies x2, x4 and x5.
The correlation between user A and C is more than the correlation between B and
C. Hence users A and C have more similarity and the movies liked by user A will
be recommended to user C and vice versa.
This algorithm is quite time consuming as it involves calculating the similarity for
each user and then calculating prediction for each similarity score. One way of
handling this problem is to select only a few users (neighbors) instead of all to
make predictions, i.e. instead of making predictions for all similarity values, we
choose only few similarity values. There are various ways to select the neighbors:
Select a threshold similarity and choose all the users above that value
This algorithm is useful when the number of users is less. Its not effective when
there are a large number of users as it will take a lot of time to compute the
similarity between all user pairs. This leads us to item-item collaborative filtering,
which is effective when the number of users is more than the items being
recommended.
Now, as we have the similarity between each movie and the ratings, predictions are
made and based on those predictions, similar movies are recommended. Let us
understand it with an example.
User/Movie x1 x2 x3 x4 x5
A 4 1 2 4 4
B 2 4 4 2 1
C – 1 – 3 4
The mean item rating is the average of all ratings given to a particular item, compared
to the user-user filtering table. Instead of finding user-user similarity, item-item
similarity is calculated. For example, comparing movies (x1, x4) and (x1, x5),
common users who have rated these items are A and B, while those who have rated
movies x1 and x5 are also A and B.
The similarity between movie x1 and x4 is more than the similarity between movie x1
and x5. So based on these similarity values, if any user searches for movie x1, they
will be recommended movie x4 and vice versa.
https://www.analyticsvidhya.com/blog/2020/08/recommendation-system-k-nearest-
neighbors/
kNN algorithm is a reliable and intuitive recommendation system that leverages user
or item similarity to provide personalized recommendations. kNN recommender
system is helpful in e-commerce, social media, and healthcare, and continues to be an
important tool for generating accurate and personalized recommendations.
Data Quality Issues: Check the quality and consistency of your data. Ensure that
your data is clean, free from outliers, and properly preprocessed (e.g., normalized
or standardized).
Cold Start Problem: Nearest neighbor algorithms may struggle with cold start
problems, where there isn't enough data available for new users or items. Consider
using hybrid approaches or incorporating content-based features to handle this
scenario.
Scalability: For large datasets, computing distances between all pairs of items or
users can be computationally expensive. Look into approximate nearest neighbor
methods or data structures like KD-trees or Ball trees to improve efficiency.
Normalization: Ensure that features used for calculating similarity are properly
normalized to prevent certain features from dominating the distance calculation.
User/item representation: Make sure that your representation of users and items
(feature vectors) appropriately captures the relevant characteristics that define
similarity in your recommendation context.
Dimensionality Reduction
Principal Component Analysis (PCA): PCA is one of the most widely used
dimensionality reduction techniques. It transforms the original variables into a new set
of orthogonal variables (principal components) that capture the maximum variance in
the data.
Autoencoders: These are neural network models that learn efficient representations
of data by encoding input into a lower-dimensional latent space and then
reconstructing the output from this representation.
Benefits:
Efficiency: Reduced dimensionality can lead to faster training times and less memory
usage, especially beneficial for large datasets.
Considerations:
What is SVD?
Dimensionality Reduction:
Matrix Approximation:
o SVD allows for approximating a matrix AAA by using only the first
kkk singular values and vectors. This approximation can be useful for
compressing data or denoising.
Implicit Feedback Handling: SVD can handle implicit feedback data (e.g.,
user views or clicks) effectively by capturing underlying patterns in user-item
interactions.
What is PCA?
Covariance Matrix:
PCA starts by computing the covariance matrix of the dataset, which captures
the pairwise relationships between different variables.
Eigen decomposition or Singular Value Decomposition (SVD):
Finally, the original dataset is transformed into the new space defined by the
selected principal components. This transformation projects the data onto a
lower-dimensional subspace while preserving as much variance as possible.
Applications of PCA:
Dimensionality Reduction:
PCA is primarily used for reducing the number of variables in a dataset while
retaining most of the information. This is beneficial for improving
computational efficiency, reducing noise, and avoiding overfitting in machine
learning models.
Visualization:
Feature Extraction:
Noise Reduction:
PCA can effectively filter out noise by emphasizing variations in data that are
significant (captured by principal components with high eigenvalues) and
disregarding variations that are less significant (captured by components with
low eigenvalues).
Advantages of PCA:
Interpretability: Principal components are linear combinations of original
variables, making them interpretable in terms of the contributions of different
features.
Data Compression: PCA allows for data compression by reducing the number
of dimensions while preserving most of the variance, which is useful for
storage and computation.