0% found this document useful (0 votes)

27 views

Module 3

Uploaded by

Shivanand S Likke

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Module 3

Uploaded by

Shivanand S Likke

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

DATA SCIENCE AND VISUALIZATION(21CS644)

Module 3- Feature Generation and Feature Selection

Extracting Meaning from Data

Feature Generation and Feature Selection

Extracting Meaning from Data: Motivating application: user (customer)
retention. Feature Generation (brainstorming, role of domain expertise,
Module 3 and place for imagination), Feature Selection algorithms. Filters;
Syllabus Wrappers; Decision Trees; Random Forests. Recommendation Systems:
Building a User-Facing Data Product, Algorithmic ingredients of a
Recommendation Engine, Dimensionality Reduction, Singular Value
Decomposition, Principal Component Analysis, Exercise: build your own
recommendation system.

Handouts for Session 1: Motivating Application: User Retention

3.1 Motivating application: user (customer) retention

 Suppose an app called Chasing Dragons charges a monthly subscription fee, with
revenue increasing with more users.
 However, only 10% of new users return after the first month.
 To boost revenue, there are two options: increase the retention rate of existing
users or acquire new ones.
 Generally, retaining existing customers is cheaper than acquiring new ones.
 Focusing on retention, a model could be built to predict if a new user will return next
month based on their behavior this month.
 This model could help in providing targeted incentives, such as a free month, to users
predicted to need extra encouragement to stay.
 A good crude model: Logistic Regression – Gives the probability the user returns their
second month conditional on their activities in the first month.
 User behavior is recorded for the first 30 days after sign-up, logging every action with
timestamps: for example, a user clicked "level 6" at 5:22 a.m., slew a dragon at 5:23
a.m., earned 22 points at 5:24 a.m., and was shown an ad at 5:25 a.m. This phase
involves collecting data on every possible user action.
 User actions, ranging from thousands to just a few, are stored in timestamped event
logs.
 These logs need to be processed into a dataset with rows representing users and columns
representing features. This phase, known as feature generation, involves
brainstorming potential features without being selective.

PREPARED BY DEPARTMENT OF CSE 1

DATA SCIENCE AND VISUALIZATION(21CS644)

 The data science team, including game designers, software engineers, statisticians, and
marketing experts, collaborates to identify relevant features.
Here are some examples:
 Number of days the user visited in the first month
 Amount of time until second visit
 Number of points on day j for j=1, . . .,30 (this would be 30 separate
features)
 Total number of points in first month (sum of the other features)
 Did user fill out Chasing Dragons profile (binary 1 or 0)
 Age and gender of user
 Screen size of device
 Notice there are redundancies and correlations between these features; that’s OK.

 To construct a logistic regression model predicting user return behavior, the initial
focus lies in attaining a functional model before refinement. Irrespective of the
subsequent time frame, classification 𝑐𝑖=1 designates a returning user. The logistic
regression formula targeted is:

 Initially, a comprehensive set of features is gathered, encompassing user behavior,

demographics, and platform interactions.
 Following data collection, feature subsets must be refined for optimal predictive
power during model scaling and production.
 Three main methods guide feature subset selection: filters, wrappers, and embedded
methods.
 Filters independently evaluate feature relevance, wrappers use model performance to
assess feature subsets, and embedded methods incorporate feature selection within
model training.

Questions

1.Define Customer Retention?

2.What are different relevant features of the Chasing Dragon app?

3.How to boost the revenue of Chasing Dragon application.

PREPARED BY DEPARTMENT OF CSE 2
DATA SCIENCE AND VISUALIZATION(21CS644)

Handouts for Session 2: Feature Generation or Feature Extraction

3.2 Feature Generation or Feature Extraction

Feature generation, also known as feature extraction, is the process of transforming

raw data into a structured format where each column represents a specific
characteristic or attribute (feature) of the data, and each row represents an
observation or instance.
 This involves identifying, creating, and selecting meaningful variables from the raw
data that can be used in machine learning models to make predictions or understand
patterns.
 This process is both an art and a science. Having a domain expert involved is beneficial,
but using creativity and imagination is equally important.
 Remember, feature generation is constrained by two factors: the feasibility of
capturing certain information and the awareness to consider capturing it.

Information can be categorized into the following buckets:

 Relevant and useful, but it’s impossible to capture it.

Keep in mind that much user information isn't captured, like free time, other apps,
employment status, or insomnia, which might predict their return. Some captured data
may act as proxies for these factors, such as playing the game at 3 a.m. indicating
insomnia or night shifts.
 Relevant and useful, possible to log it, and you did.
The decision to log this information during the brainstorming session was crucial.
However, mere logging doesn't guarantee understanding its relevance or usefulness.
The feature selection process aims to uncover this information.
 Relevant and useful, possible to log it, but you didn’t.
Human limitations can lead to overlooking crucial information, emphasizing the need
for creative feature selection. Usability studies help identify key user actions for better
feature capture.
 Not relevant or useful, but you don’t know that and log it.
Feature selection aims to address this: while you've logged certain information,
unknowing its necessity.
 Not relevant or useful, and you either can’t capture it or it didn’t occur to you.
PREPARED BY DEPARTMENT OF CSE 3
DATA SCIENCE AND VISUALIZATION(21CS644)

Feature Generation or Feature Extraction.

Questions:

1.Define Feature Generation.

2.Define Feature Extraction.

3.Define Feature Generation. Explain how information can be categorized in feature

generation in detail.

Handouts for Session 3: Feature Selection: Filters and Wrappers

3.3 Feature Selection Algorithms

Feature selection involves identifying the most relevant and informative features from a dataset
for building predictive models.

1. Filters:

 Filters prioritize features based on specific metrics or statistics, such as

correlation with the outcome variable, offering a quick overview of predictive
power of .
 However, they may ignore redundancy and fail to consider feature interactions,
potentially resulting in correlated features and limited insight into complex
relationships.
 By treating the features as independent, it is not taking into account possible
interactions or correlation between features.
 However in some cases 2 redundant features can be more powerful when used
together and appear useless when considered individually.

2. Wrappers

 Wrapper feature selection explores subsets of features of a predetermined size,

seeking to identify combinations that optimize model performance.
 However, as the number of potential combinations grows exponentially with the
number of features.

The number of possible size k subsets of n things, called .

 This exponential growth of possible feature subsets can lead to overfitting.
PREPARED BY DEPARTMENT OF CSE 4
DATA SCIENCE AND VISUALIZATION(21CS644)

 In wrapper feature selection, two key aspects require consideration:

 first, the choice of an algorithm for feature selection, and
 second, the determination of a selection criterion or filter to find out the
usefulness of the chosen feature set.

A. Selecting an algorithm

Stepwise regression is a category of feature selection methods which involves

systematically adjusting feature sets within regression models, typically through
forward selection, backward elimination, or a combined approach, to
optimize model performance based on predefined selection criteria.

Forward selection:

Forward selection involves systematically adding features to a regression

model one at a time based on their ability to improve model performance
according to a selection criterion. This iterative process continues until further
feature additions no longer enhance the model's performance.

Backward elimination:

Backward elimination begins with a regression model containing all features.

Subsequently, one feature is systematically removed at a time, the feature
whose removal makes the biggest improvement in the selection criterion. Stop
removing features when removing the feature makes the selection criterion
get worse.

Combined approach:

The combined approach in feature selection blends forward selection and

backward elimination to strike a balance between maximizing relevance and
minimizing redundancy. It iteratively adds and removes features based on
their significance and impact on model fit, resulting in a subset of features
optimized for predictive power.

B. Selection criterion

PREPARED BY DEPARTMENT OF CSE 5

DATA SCIENCE AND VISUALIZATION(21CS644)

The choice of selection criteria in feature selection methods may seem arbitrary.
To address this, experimenting with various criteria can help assess model
robustness. Different criteria may yield diverse models, necessitating the
prioritization of optimization goals based on the problem context and objectives.

R-squared

R-squared can be interpreted as the proportion of variance explained by your

model.

p-values

In regression analysis, the interpretation of p-values involves assuming a null

hypothesis where the coefficients (βs) are zero. A low p-value suggests that
observing the data and obtaining the estimated coefficient under the null
hypothesis is highly unlikely, indicating a high likelihood that the coefficient is
non-zero.

AIC (Akaike Information Criterion)

Given by the formula 2k−2ln(L) , where k is the number of parameters in the

model and ln(L) is the “maximized value of the log likelihood.” The goal is to
minimize AIC.

BIC (Bayesian Information Criterion)

Given by the formula k*ln(n) −2ln(L), where k is the number of parameters in

the model, n is the number of observations (data points, or users), and ln(L) is
the maximized value of the log likelihood. The goal is to minimize BIC.

Entropy

Entropy is a measure of disorder or impurity in the given dataset.

PREPARED BY DEPARTMENT OF CSE 6

DATA SCIENCE AND VISUALIZATION(21CS644)

Questions:

1.Define Feature Selection.

2.Explain Filter Method

3.Explain Wrapper Method.

4.Explain selecting an algorithm in wrapper method.

5.Explain the different Selecting Criteria in feature selection.

Handouts for Session 4: Decision tree

3. Embedded Methods: Decision Trees

i)Decision Trees

 Decision trees are a popular and powerful tool used in various fields such as
machine learning, data mining, and statistics. They provide a clear and
intuitive way to make decisions based on data by modelling the relationships
between different variables.
 A decision tree is a flowchart-like structure used to make decisions or
predictions. It consists of nodes representing decisions or tests on attributes,
branches representing the outcome of these decisions, and leaf nodes
representing final outcomes or predictions.
 Each internal node corresponds to a test on an attribute, each branch corresponds
to the result of the test, and each leaf node corresponds to a class label or a
continuous value. In the context of a data problem, a decision tree is a
classification algorithm.
 For the Chasing Dragons example, you want to classify users as “Yes, going to
come back next month” or “No, not going to come back next month.” This isn’t
really a decision in the colloquial sense, so don’t let that throw you.
 You know that the class of any given user is dependent on many factors
(number of dragons the user slew, their age, how many hours they already
played the game). And you want to break it down based on the data you’ve
collected. But how do you construct decision trees from data and what
mathematical properties can you expect them to have?

PREPARED BY DEPARTMENT OF CSE 7

DATA SCIENCE AND VISUALIZATION(21CS644)

 But you want this tree to be based on data and not just what you feel like.
Choosing a feature to pick at each step is like playing the game 20 Questions
really well. You take whatever the most informative thing is first. Let’s
formalize that—we need a notion of “informative.”
 For the sake of this discussion, assume we break compound questions into
multiple yes-or-no questions, and we denote the answers by “0” or “1.” Given
a random variable X, we denote by p(X)=1 and p(X) =0 the probability that X
is true or false, respectively.

Entropy

 Entropy is a measure of disorder or impurity in the given dataset.

 In the decision tree, messy data are split based on values of the feature vector
associated with each data point.

 With each split, the data becomes more homogenous which will decrease the
entropy. However, some data in some nodes will not be homogenous, where the
entropy value will not be small.The higher the entropy, the harder it is to draw
any conclusion.

PREPARED BY DEPARTMENT OF CSE 8

DATA SCIENCE AND VISUALIZATION(21CS644)

 When the tree finally reaches the terminal or leaf node maximum purity is
added.

 For a dataset that has C classes and the probability of randomly choosing data
from class, i is Pi. Then entropy E(S) can be mathematically represented as

 To quantify what is the most “informative” feature, we define entropy–

effectively a measure for how mixed up something is—for X

when p(X =1)= 0 or p (X =0) =0, the entropy vanishes i.e. if either option has
probability zero, the entropy is 0(pure).As p(X =1) =1− p(X =0) , the entropy
is symmetric about 0.5 and maximized at 0.5.

 In particular, if either option has probability zero, the entropy is 0. Moreover,

because p (X =1) =1− p (X =0) , the entropy is symmetric about 0.5 and
maximized at 0.5, which we can easily confirm using a bit of calculus. The
Below Figure shows the picture of that.

 Example:

PREPARED BY DEPARTMENT OF CSE 9

DATA SCIENCE AND VISUALIZATION(21CS644)

 Entropy is a measurement of how mixed up something is.If X denotes the event

of a baby being born a boy, the expectation is to be true or false with probability
close to 1/2, which corresponds to high entropy, i.e., the bag of babies from
which we are selecting a baby is highly mixed.
 If X denotes the event of a rainfall in a desert, then it’s low entropy.In other
words, the bag of day-long weather events is not highly mixed in deserts.
 X is the target of our model. So, X could be the event that someone buys
something on our site.
 Which attribute of the user will tell us the most information about this event X
needs to be determined
 Information Gain, IG(X,a), for a given attribute a, is the entropy we
lose(reduction in entropy, or uncertainty) if we know the value of that attribute
 IG (X,a) =H(X) − H (X|a).
 H(X|a) can be computed in 2 steps:
 For any actual value 𝑎0 of the attribute 𝑎 we can compute the specific
conditional entropy H(X|𝑎 = 𝑎0 ) as:
 H(X|𝑎 = 𝑎0 ) = -p(X=1| 𝑎 = 𝑎0 ) 𝑙𝑜𝑔2 (p(X=1| 𝑎 = 𝑎0 ))
- p(X=0| 𝑎 = 𝑎0 ) 𝑙𝑜𝑔2 (p(X=0| 𝑎 = 𝑎0 ))
And then we can put it all together, for all possible values of a, to getthe
conditional entropy H(X|a) :
H(X|a)=∑𝑎𝑖 𝑝(𝑎 = 𝑎𝑖 ). H(X|𝑎 = 𝑎𝑖 )

Decision Tree Algorithm

 The Decision Tree is built iteratively.

PREPARED BY DEPARTMENT OF CSE 10

DATA SCIENCE AND VISUALIZATION(21CS644)

 It starts with the root. - You need an algorithm to decide which attribute to split
on; e.g., which node should be the next one to identify.
 The attribute is chosen in order to maximize information gain.
 Keep going until all the points at the end are in the same class or you end up
with no features left. In this case, you take the majority vote.
 The Tree can be pruned to avoid overfitting. - cutting it off below a certain
depth.
 If you build the entire tree, it’s often less accurate with new data than if you
prune it.

 Example:

 Suppose you have your Chasing Dragons dataset. Your outcome

variable is Return: a binary variable that captures whether or not the user
returns next month, and you have tons of predictors.

# Load necessary library

#Loads the rpart library, which is used for recursive partitioning and
regression trees.

library(rpart)

# Read the CSV file

chasingdragons <- read.csv("chasingdragons.csv")

setwd("F:/college/data science-2021 scheme/dscience")

# Grow the classification tree

#Builds a classification tree model to predict the Return variable using

the other variables (profile, num_dragons, num_friends_invited, gender,
age, num_days) as predictors. The method="class" specifies that this is
a classification tree.

model1 <- rpart(Return ~ profile + num_dragons + num_friends_invited

+ gender + age + num_days, method="class", data=chasingdragons)

PREPARED BY DEPARTMENT OF CSE 11

DATA SCIENCE AND VISUALIZATION(21CS644)

# Display the results

#Prints the complexity parameter (CP) table, which helps in

understanding the performance of the model and in pruning the tree.

printcp(model1)

# Visualize cross-validation results

# plot of the cross-validation results for the complexity parameter. The

plotcp function helps visualize how the model's error changes with the
complexity of the tree, which aids in selecting the optimal tree size.

plotcp(model1)

# Detailed summary of the model

#The summary function provides comprehensive information about the

model, including splits, nodes.

summary(model1)

#sample summary output:

PREPARED BY DEPARTMENT OF CSE 12

DATA SCIENCE AND VISUALIZATION(21CS644)

# Plot the classification tree.The plot function visualizes the tree

structure, and the uniform=TRUE argument makes the branch lengths
uniform. The main argument specifies the title of the plot.

plot(model1, uniform=TRUE, main="Classification Tree for Chasing

Dragons")

# Add text labels to the tree plot. The text function annotates the tree plot with
information about the nodes. The use.n=TRUE argument includes the number
of observations at each node, all=TRUE ensures all nodes are labeled, and
cex=.8 sets the size of the text labels.

text(model1, use.n=TRUE, all=TRUE, cex=.8)

PREPARED BY DEPARTMENT OF CSE 13

DATA SCIENCE AND VISUALIZATION(21CS644)

Handling Continuous Variables

 In case of continuous variables, the correct threshold of a value needs to be

determined to consider it as a binary variable.
 Example: A User’s number of Dragon Slays can be partitioned into categories
such as “less than 10” and “at least 10”. Now we have a binary variable case.
 In this case, it takes some extra work to decide on the information gain because
it depends on the threshold as well as the feature.
 Instead of a single threshold, bins of values can be created for the attribute. –
Depends on situation

Questions:

1.Define Decision tree.

2.Explain Decision Tree for Chasing Dragons Problem.

3. Suppose you have your Chasing Dragons dataset. Your outcome variable is Return: a
binary variable that captures whether or not the user returns next month, and you have tons
of predictors. Write a R Script using decision tree algorithm for the above scenario.

4.Write Decision Algorithm in detail.

Handouts for Session 5: Random forest

ii)Random Forest

 Random forests generalize decision trees with bagging,otherwise known has

Bootstrap Aggregating.

PREPARED BY DEPARTMENT OF CSE 14

DATA SCIENCE AND VISUALIZATION(21CS644)

 Makes the models more accurate and more robust, but at the cost of
interpretability
 But easy to specify – 2 Hyperparameters: Number of Trees (N) in the forest and
Number of Features (F) to randomly select for each tree
 A bootstrap sample is a sample with replacement, which means we might
sample the same data point more than once.We usually take to the sample size
to be 80% of the size of the entire (training) dataset, but of course this parameter
can be adjusted de‐pending on circumstances. This is technically a third hyper
parameter of our random forest algorithm.
 To construct a random forest, you construct N decision trees as follows:
 For each tree, take a bootstrap sample of your data, and for each node
you randomly select F features, say 5 out of the 100 total features.
 Then you use your entropy-information-gain engine as described in the
previous section to decide which among those features you will split
your tree on at each stage.
 Note that you could decide beforehand how deep the tree should get,or you
could prune your trees after the fact, but you typically don’t prune the trees in
random forests, because a great feature of random forests is that they can
incorporate idiosyncratic noise.

 Algorithm
 Select random K data points from the Training Set
 Build a Decision Tree based on the selected K points
 Select the Number of Trees to build and repeat Steps 1 & 2
 For a new data point (test data point) makes each of the trees predict the
class for the data point. And assign the new data point the average across
all the predicted classes

Questions

1.Define Random Forest.

Explain Random forest Algorithm.

PREPARED BY DEPARTMENT OF CSE 15

DATA SCIENCE AND VISUALIZATION(21CS644)

Handouts for Session 6: User Retention

3.4 User Retention: Interpretability Vs. Predictive Power

 Assume a model predicts well, but can you find the meaning in the model?

 Example: The prediction could be “the more the user plays in the first month, the
more likely the user is to come back next month”. This is obvious and not very
helpful when doing the analysis

 It could also tell that showing them ads in the first 5 minutes decreases their chances
of coming back, buts its ok to show ads after the first hour. This would give an
insight not to show ads in the first 5 minutes

 To study this more, you really would want to do some A/B testing, but this initial
model and feature selection would help you prioritize the types of tests you might
want to run.

 Features that are associated with the user’s behaviour are qualitatively different
from the features associated with one’s own behaviour.

 If there’s a correlation of getting a high number of points in the first month with
players returning to play next month, does that mean if you just give users a high
number of points this month without them playing at all, they’ll come back – No!

 It’s not the number of points that caused them to come back, it’s that they’re really
into playing the game which correlates with both their coming back and their getting
a high number of points.

 Therefore, do feature selection with all variables, but then focus on the ones you
can do something about conditional on user attributes.

Questions:

1.Explain the User Retention in Detail.

PREPARED BY DEPARTMENT OF CSE 16

DATA SCIENCE AND VISUALIZATION(21CS644)

Handouts for Session 7: User Retention, Dimensionality Reduction, SVD

3.5 A Real-World Recommendation Engine

 Recommendation engines are used all the time—what movie would you like,
knowing other movies you liked? What book would you like, keeping in mind past
purchases?
 There are plenty of different ways to go about building such a model, but they have
very similar feels if not implementation. To set up a recommendation engine,
suppose you have users, which form a set U; and you have items to recommend,
which form a set V.
 We can denote this as a bipartite graph (shown again in Figure below) if each user
and each item has a node to represent it—there are lines from a user to an item if
that user has expressed an opinion about that item.
 Note they might not always love that item, so the edges could have weights: they
could be positive, negative, or on a continuous scale (or discontinuous, but many-
valued like a star system).
 The implications of this choice can be heavy but we won’t delve too deep here for
us they are numeric ratings.
 Next up, you have training data in the form of some preferences—you know some
of the opinions of some of the users on some of the items. From those training data,
you want to predict other preferences for your users. That’s essentially the output
for a recommendation engine.
 You may also have metadata on users (i.e., they are male or female, etc.) or on items
(the color of the product). For example, users come to your website and set up
accounts, so you may know each user’s gender, age, and preferences for up to three
items.
 You represent a given user as a vector of features, sometimes including only
metadata—sometimes including only preferences (which would lead to a sparse
vector because you don’t know all the user’s opinions) and sometimes including
both, depending on what you’re doing with the vector.
 Also, you can sometimes bundle all the user vectors together to get a big user
matrix, which we call U, through abuse of notation.

PREPARED BY DEPARTMENT OF CSE 17

DATA SCIENCE AND VISUALIZATION(21CS644)

3.6 The Dimensionality Problem

 As we increase the dimension (Number of features), the accuracy of the system

increases up to a certain limit. Beyond the limit, the accuracy starts to decline.
 Solution - Dimensionality Reduction: This does not mean removing features, but
rather transform the data into a different perspective.
 Our goal is to build a model that has a representation in a low dimensional subspace
that gathers “taste information” to generate recommendations. So we’re saying here
that taste is latent but can be approximated by putting together all the observed
information we do have about the user.

3.6.1 Singular Value Decomposition (SVD)

 Rank of a Matrix: The rank of a matrix A is the maximum number of linearly

independent row vectors or column vectors in the matrix. It is denoted as rank(A). In
case of SVD the rank of A is the number of non-zero singular values in the diagonal
matrix S.
 Latent Features: These are the hidden patterns or factors in the data. For a user-item
matrix, they represent underlying user preferences and item characteristics.
 Importance of Latent Features: The singular values in S indicate the importance of these
features.
 A larger singular value means the corresponding latent feature is more significant in
capturing the structure of the data.
 Given an m×n matrix X of rank k: X=USVT
PREPARED BY DEPARTMENT OF CSE 18
DATA SCIENCE AND VISUALIZATION(21CS644)

 U: m×k matrix (Left Singular Vectors) that contains user latent features. Each row
corresponds to a user, and each column represents a latent feature. For example, a row
might capture a user's preference for genres like action or comedy.
 S: k×k diagonal matrix with singular values. These values indicate the importance of
each latent feature. Larger values correspond to more significant features.
 V: k×n matrix (Right Singular Vectors) Contains item latent features. Each row
corresponds to an item, and each column represents a latent feature. For example, a row
might capture an item's characteristics like genre or popularity.
 U and V: The columns of U and V are orthogonal, meaning they capture independent
latent features.
 S: The singular values in S measure the importance of each latent feature. Larger values
indicate more significant features.

Dimensionality Reduction with SVD

To reduce the dimensionality, follow these steps:

1. Compute SVD:

 Perform SVD on the matrix X to get U, S, and V.

2. Select Top d Singular Values:

 Identify the top d largest singular values from S. These singular values correspond
to the most significant components of the data.

3. Construct Reduced Matrices:

 Construct reduced matrices Ud, Sd, and Vd

4. Approximate the Original Matrix:

Xd =UdSdVdT

 This Xd is the best approximation of X, using only the top d singular values.

In Singular Value Decomposition (SVD), dimensionality reduction is achieved by

selecting the most significant singular values and their corresponding singular vectors.

PREPARED BY DEPARTMENT OF CSE 19

DATA SCIENCE AND VISUALIZATION(21CS644)

This process leverages the properties of the SVD to capture the most important
structures in the data while discarding less important information.

Questions:

1.Explain Real world recommendation engine with neat diagram.

2.What is Dimensionality Problem.

3.Explain SVD in detail

Handouts for Session 8: PCA,Building Recommendation Engine

3.6.2 Principal Component Analysis

 In this approach, we aim to predict preferences by factorizing the user-item

interaction matrix X into two lower-dimensional matrices, U and V, without the
need for the singular values matrix S. The goal is to find U and V such that:
X≈U⋅VT
 The optimization problem is to minimize the discrepancy between the actual
user-item interaction matrix X and its approximation

̃ =U⋅VT.
𝑿

 This discrepancy is measured using the squared error:

argmin∑i,j(xij−ui⋅vj)2

 xij is the actual interaction (e.g., rating) between user i and item j.
 ui is the i-th row of matrix U, representing the latent features of user i.
 vj is the j-th row of matrix V, representing the latent features of item j.
 The dot product ui⋅vj is the predicted preference of user i for item j.

Latent Features and Matrix Dimensions

 Number of Latent Features (d): This is a parameter that you choose, representing
the number of latent features you want to use. It controls the dimensions of matrices
U and V.
 Matrix U: Has dimensions m×d, where m is the number of users and d is the
number of latent features. Each row corresponds to a user.

PREPARED BY DEPARTMENT OF CSE 20

DATA SCIENCE AND VISUALIZATION(21CS644)

 Matrix V: Has dimensions n×d, where n is the number of items and d is the number
of latent features. Each row corresponds to an item.

3.7 Alternating Least Squares

 Here we are not first minimizing the squared error and then minimizing the size of the
entries of the matrices U and V. Here we are actually doing both at the same time.
 Alternating Least Squares (ALS) is an algorithm for matrix factorization. ALS is used
to decompose a given user-item interaction matrix into two lower-dimensional matrices
(U and V) that capture latent features of users and items.

 Here’s the algorithm:

Pick a random V.

Optimize U while V is fixed.

Optimize V while U is fixed.

 Keep doing the preceding two steps until you’re not changing very much at all. To be
precise, you choose an ϵ and if your coefficients are each changing by less than ϵ, then
you declare your algorithm “converged.”
 Fix V and Update U The way you do this optimization is user by user. So for user i,
you want to find:

 where vj is fixed. In other words, you just care about this user for now. But wait a
minute, this is the same as linear least squares, and has a closed form solution! In other
words, set:

 where V*,i is the subset of V for which you have preferences coming from user i.
Taking the inverse is easy because it’s d×d, which is small.And there aren’t that many

PREPARED BY DEPARTMENT OF CSE 21

DATA SCIENCE AND VISUALIZATION(21CS644)

preferences per user, so solving this many times is really not that hard. Overall you’ve
got a doable update for U.
 When you fix U and optimize V, it’s analogous—you only ever have to consider the
users that rated that movie, which may be pretty large for popular movies but on average
isn’t; but even so, you’re only ever inverting a d×d matrix.

3.8 Building Recommendation System using Python

The following code is Matt’s code to illustrate implementing a recommendation system on

a relatively small dataset.

Initialize Matrix V:

 V is initialized with random values.

 This matrix represents the latent features of items.

Initialize Matrix U:

 U is initialized with zeros.

 This matrix will represent the latent features of users.

ALS Algorithm

The ALS algorithm alternates between updating the user latent features (U) and the item
latent features (V) to minimize the squared error of the predicted ratings.

For each user:

• Extract the items they have interacted with and their corresponding ratings.

• Create a matrix vo containing the latent features of these items.

• Create a vector pvo of the ratings.

Solve the regularized least squares problem:

• Update U[i, :] using the formula:

• `V_i` is the submatrix of `V` corresponding to the items user `i` has rated.

• `X_i` is the vector of ratings for these items.

PREPARED BY DEPARTMENT OF CSE 22

DATA SCIENCE AND VISUALIZATION(21CS644)

Error Calculation

PREPARED BY DEPARTMENT OF CSE 23

DATA SCIENCE AND VISUALIZATION(21CS644)

 Calculate the root mean square error (RMSE):Compute the prediction error for
all known user-item interactions.
 Final Predicted Matrix
 Predict the entire user-item interaction matrix:
 Multiply U and V.T to get the predicted ratings.

Code:

import math

import numpy as np

# Define the user-item ratings matrix pu

pu = [

[(0, 0, 1), (0, 1, 22), (0, 2, 1), (0, 3, 1), (0, 5, 0)],

[(1, 0, 1), (1, 1, 32), (1, 2, 0), (1, 3, 0), (1, 4, 1), (1, 5, 0)],

[(2, 0, 0), (2, 1, 18), (2, 2, 1), (2, 3, 1), (2, 4, 0), (2, 5, 1)],

[(3, 0, 1), (3, 1, 40), (3, 2, 1), (3, 3, 0), (3, 4, 0), (3, 5, 1)],

[(4, 0, 0), (4, 1, 40), (4, 2, 0), (4, 4, 1), (4, 5, 0)],

[(5, 0, 0), (5, 1, 25), (5, 2, 1), (5, 3, 1), (5, 4, 1)]

# Define the item-user ratings matrix pv

pv = [

[(0, 0, 1), (0, 1, 1), (0, 2, 0), (0, 3, 1), (0, 4, 0), (0, 5, 0)],

[(1, 0, 22), (1, 1, 32), (1, 2, 18), (1, 3, 40), (1, 4, 40), (1, 5, 25)],

[(2, 0, 1), (2, 1, 0), (2, 2, 1), (2, 3, 1), (2, 4, 0), (2, 5, 1)],

[(3, 0, 1), (3, 1, 0), (3, 2, 1), (3, 3, 0), (3, 5, 1)],

[(4, 1, 1), (4, 2, 0), (4, 3, 0), (4, 4, 1), (4, 5, 1)],

[(5, 0, 0), (5, 1, 0), (5, 2, 1), (5, 3, 1), (5, 4, 0)]

PREPARED BY DEPARTMENT OF CSE 24

DATA SCIENCE AND VISUALIZATION(21CS644)

# Define matrix V

V = np.mat([

[0.15968384, 0.9441198, 0.83651085],

[0.73573009, 0.24906915, 0.85338239],

[0.25605814, 0.6990532, 0.50900407],

[0.2405843, 0.31848888, 0.60233653],

[0.24237479, 0.15293281, 0.22240255],

[0.03943766, 0.19287528, 0.95094265]

])

# Initialize matrix U with zeros

U = np.mat(np.zeros([6, 3]))

# Regularization parameter

L = 0.03

# Perform matrix factorization using alternating least squares

for iter in range(5):

print("\n----- ITER %s -----" % (iter + 1))

print("U")

urs = []

# Update U

for uset in pu:

vo = []

PREPARED BY DEPARTMENT OF CSE 25

DATA SCIENCE AND VISUALIZATION(21CS644)

pvo = []

for i, j, p in uset:

vor = []

for k in range(3):

vor.append(V[j, k])

vo.append(vor)

pvo.append(p)

vo = np.mat(vo)

ur = np.linalg.inv(vo.T * vo + L * np.mat(np.eye(3))) * vo.T * np.mat(pvo).T

urs.append(ur.T)

U = np.vstack(urs)

print(U)

print("V")

vrs = []

# Update V

for vset in pv:

uo = []

puo = []

for j, i, p in vset:

uor = []

PREPARED BY DEPARTMENT OF CSE 26

DATA SCIENCE AND VISUALIZATION(21CS644)

for k in range(3):

uor.append(U[i, k])

uo.append(uor)

puo.append(p)

uo = np.mat(uo)

vr = np.linalg.inv(uo.T * uo + L * np.mat(np.eye(3))) * uo.T * np.mat(puo).T

vrs.append(vr.T)

V = np.vstack(vrs)

print(V)

# Calculate RMSE (Root Mean Squared Error)

err = 0.

n = 0.

for uset in pu:

for i, j, p in uset:

err += (p - (U[i] * V[j].T)[0, 0]) ** 2

n += 1

rmse = math.sqrt(err / n)

print("RMSE:", rmse)

# Print final U * V.T

print("\nFinal U * V.T")

PREPARED BY DEPARTMENT OF CSE 27

DATA SCIENCE AND VISUALIZATION(21CS644)

print(U * V.T)

Output:

PREPARED BY DEPARTMENT OF CSE 28

DATA SCIENCE AND VISUALIZATION(21CS644)

Questions:

1.Explain PCA in detail.

2.Define Alternating Least Sqaure.

3.Write a program for Recommendation system using Python.

PREPARED BY DEPARTMENT OF CSE 29

Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
21CS644 Mod 3
No ratings yet
21CS644 Mod 3
29 pages
DSV Module-3
No ratings yet
DSV Module-3
24 pages
Module3 DSV Notes
No ratings yet
Module3 DSV Notes
29 pages
DOC-20250125-WA0001.
No ratings yet
DOC-20250125-WA0001.
50 pages
21cs644 Module 3
No ratings yet
21cs644 Module 3
95 pages
MDM Data Science Unit -II
No ratings yet
MDM Data Science Unit -II
65 pages
Feature_Generation_and_Selection
No ratings yet
Feature_Generation_and_Selection
12 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Module 3 DS
No ratings yet
Module 3 DS
44 pages
DA Assignmnet 3 Based On Format Solu
No ratings yet
DA Assignmnet 3 Based On Format Solu
9 pages
Feature Selection
No ratings yet
Feature Selection
56 pages
Feature Selection
No ratings yet
Feature Selection
36 pages
Review@data Mining Haiylachew
No ratings yet
Review@data Mining Haiylachew
14 pages
Data Science
No ratings yet
Data Science
4 pages
FDS Unit V
No ratings yet
FDS Unit V
9 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
02-Data Mining Functionalities-2
No ratings yet
02-Data Mining Functionalities-2
23 pages
Mid Semester Project Review UditSoni
No ratings yet
Mid Semester Project Review UditSoni
25 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
life lesson
No ratings yet
life lesson
13 pages
Dimensionality Reduction of High Dimensional Data: Summer Internship Project Summary
No ratings yet
Dimensionality Reduction of High Dimensional Data: Summer Internship Project Summary
20 pages
ssrn-4976040
No ratings yet
ssrn-4976040
14 pages
Toward Integrating Feature Selection Algorithms For Classification and Clustering-M7s PDF
No ratings yet
Toward Integrating Feature Selection Algorithms For Classification and Clustering-M7s PDF
12 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
GAIN RATIO and Correlation
No ratings yet
GAIN RATIO and Correlation
7 pages
Dandona Gitansh Assign 3 CISC683
No ratings yet
Dandona Gitansh Assign 3 CISC683
10 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
dimensionalityReduction.pptx
No ratings yet
dimensionalityReduction.pptx
117 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
Feature Selection
No ratings yet
Feature Selection
61 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
1 of 1
No ratings yet
1 of 1
41 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Slides CRM -4
No ratings yet
Slides CRM -4
33 pages
Lect 2
No ratings yet
Lect 2
35 pages
Data Preprocessing For Supervised Leaning
No ratings yet
Data Preprocessing For Supervised Leaning
6 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
Lua Chon Dac Trung
No ratings yet
Lua Chon Dac Trung
18 pages
DATA MINING
No ratings yet
DATA MINING
7 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
Business Data Mining
No ratings yet
Business Data Mining
9 pages
Module-3 - DS (Autosaved)
No ratings yet
Module-3 - DS (Autosaved)
18 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Topic 1c - Tasks & techniques
No ratings yet
Topic 1c - Tasks & techniques
23 pages
PSK Unit 1 Merged
No ratings yet
PSK Unit 1 Merged
125 pages
FULLTEXT01
No ratings yet
FULLTEXT01
56 pages
Report
No ratings yet
Report
17 pages
Course Project Report: Indian Institute of Technology, Kanpur
No ratings yet
Course Project Report: Indian Institute of Technology, Kanpur
15 pages
Mining Using Genitic Algorithms
No ratings yet
Mining Using Genitic Algorithms
7 pages
Module5.2 Feature selection methods
No ratings yet
Module5.2 Feature selection methods
64 pages
Unit 3,4 and 5
No ratings yet
Unit 3,4 and 5
5 pages
Data Reduction
No ratings yet
Data Reduction
23 pages
Featureselction 12 Chapter 3
No ratings yet
Featureselction 12 Chapter 3
19 pages
It 311-Ads Module 5
No ratings yet
It 311-Ads Module 5
9 pages
Bia 3
No ratings yet
Bia 3
4 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
The Data Arena.
No ratings yet
The Data Arena.
11 pages
Business Data Analytics Part 4
No ratings yet
Business Data Analytics Part 4
52 pages
Speech Emotion Recognition Based On CNN and Random Forest
No ratings yet
Speech Emotion Recognition Based On CNN and Random Forest
5 pages
Rintro Wekacomplete
No ratings yet
Rintro Wekacomplete
135 pages
Machine Learning 1.4.19
No ratings yet
Machine Learning 1.4.19
23 pages
Arora 2019
No ratings yet
Arora 2019
29 pages
6 - Malware Detection
No ratings yet
6 - Malware Detection
17 pages
Data Mining: Concepts and Techniques: - Chapter 10
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 10
50 pages
Draft Report Final Year
No ratings yet
Draft Report Final Year
62 pages
(Become An Industry-Ready Data Scientist) Ascend Pro - 9 Months Certi Ed Training Program - Apply Today
No ratings yet
(Become An Industry-Ready Data Scientist) Ascend Pro - 9 Months Certi Ed Training Program - Apply Today
29 pages
Data Mining-1
No ratings yet
Data Mining-1
15 pages
Report On Machine Learning-Jyoti Poddar-EC084
No ratings yet
Report On Machine Learning-Jyoti Poddar-EC084
70 pages
Sensitivity Analysis 2
No ratings yet
Sensitivity Analysis 2
19 pages
Malware Triage For Early Identification of Advanced Persistent Threat Activities
No ratings yet
Malware Triage For Early Identification of Advanced Persistent Threat Activities
15 pages
Week 8
No ratings yet
Week 8
70 pages
Decision Tree Introduction
No ratings yet
Decision Tree Introduction
14 pages
ML UNIT 2 Sir
No ratings yet
ML UNIT 2 Sir
46 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Decision Trees 2
No ratings yet
Decision Trees 2
18 pages
1 s2.0 S2772442522000016 Main
No ratings yet
1 s2.0 S2772442522000016 Main
18 pages
Module 1 Data Mining
No ratings yet
Module 1 Data Mining
10 pages
MachineLearning MidTerm UMT Spring 2021
No ratings yet
MachineLearning MidTerm UMT Spring 2021
12 pages
DS4 - CLS-Decision Tree
No ratings yet
DS4 - CLS-Decision Tree
32 pages
C2_W4_Decision_Tree_with_Markdown
No ratings yet
C2_W4_Decision_Tree_with_Markdown
17 pages
Literature Survey of Association Rule Based Techniques For Preserving Privacy
No ratings yet
Literature Survey of Association Rule Based Techniques For Preserving Privacy
6 pages
AI (My Variant)
No ratings yet
AI (My Variant)
23 pages
ML Project Report-1
No ratings yet
ML Project Report-1
34 pages
Application of Data Warehouse and Data Mining in Construction Management
No ratings yet
Application of Data Warehouse and Data Mining in Construction Management
12 pages
Unit 1
No ratings yet
Unit 1
12 pages
LLM ML Interview Q
No ratings yet
LLM ML Interview Q
43 pages
3804-Article Text-13528-1-18-20230606
No ratings yet
3804-Article Text-13528-1-18-20230606
11 pages
IntroductionToArtificialIntelligenceForSecurityProfessionals Cylance
No ratings yet
IntroductionToArtificialIntelligenceForSecurityProfessionals Cylance
177 pages