Module 3
Module 3
Suppose an app called Chasing Dragons charges a monthly subscription fee, with
revenue increasing with more users.
However, only 10% of new users return after the first month.
To boost revenue, there are two options: increase the retention rate of existing
users or acquire new ones.
Generally, retaining existing customers is cheaper than acquiring new ones.
Focusing on retention, a model could be built to predict if a new user will return next
month based on their behavior this month.
This model could help in providing targeted incentives, such as a free month, to users
predicted to need extra encouragement to stay.
A good crude model: Logistic Regression – Gives the probability the user returns their
second month conditional on their activities in the first month.
User behavior is recorded for the first 30 days after sign-up, logging every action with
timestamps: for example, a user clicked "level 6" at 5:22 a.m., slew a dragon at 5:23
a.m., earned 22 points at 5:24 a.m., and was shown an ad at 5:25 a.m. This phase
involves collecting data on every possible user action.
User actions, ranging from thousands to just a few, are stored in timestamped event
logs.
These logs need to be processed into a dataset with rows representing users and columns
representing features. This phase, known as feature generation, involves
brainstorming potential features without being selective.
The data science team, including game designers, software engineers, statisticians, and
marketing experts, collaborates to identify relevant features.
Here are some examples:
Number of days the user visited in the first month
Amount of time until second visit
Number of points on day j for j=1, . . .,30 (this would be 30 separate
features)
Total number of points in first month (sum of the other features)
Did user fill out Chasing Dragons profile (binary 1 or 0)
Age and gender of user
Screen size of device
Notice there are redundancies and correlations between these features; that’s OK.
To construct a logistic regression model predicting user return behavior, the initial
focus lies in attaining a functional model before refinement. Irrespective of the
subsequent time frame, classification 𝑐𝑖=1 designates a returning user. The logistic
regression formula targeted is:
Questions
Questions:
Feature selection involves identifying the most relevant and informative features from a dataset
for building predictive models.
1. Filters:
2. Wrappers
A. Selecting an algorithm
Forward selection:
Backward elimination:
Combined approach:
B. Selection criterion
The choice of selection criteria in feature selection methods may seem arbitrary.
To address this, experimenting with various criteria can help assess model
robustness. Different criteria may yield diverse models, necessitating the
prioritization of optimization goals based on the problem context and objectives.
R-squared
p-values
Entropy
Questions:
i)Decision Trees
Decision trees are a popular and powerful tool used in various fields such as
machine learning, data mining, and statistics. They provide a clear and
intuitive way to make decisions based on data by modelling the relationships
between different variables.
A decision tree is a flowchart-like structure used to make decisions or
predictions. It consists of nodes representing decisions or tests on attributes,
branches representing the outcome of these decisions, and leaf nodes
representing final outcomes or predictions.
Each internal node corresponds to a test on an attribute, each branch corresponds
to the result of the test, and each leaf node corresponds to a class label or a
continuous value. In the context of a data problem, a decision tree is a
classification algorithm.
For the Chasing Dragons example, you want to classify users as “Yes, going to
come back next month” or “No, not going to come back next month.” This isn’t
really a decision in the colloquial sense, so don’t let that throw you.
You know that the class of any given user is dependent on many factors
(number of dragons the user slew, their age, how many hours they already
played the game). And you want to break it down based on the data you’ve
collected. But how do you construct decision trees from data and what
mathematical properties can you expect them to have?
But you want this tree to be based on data and not just what you feel like.
Choosing a feature to pick at each step is like playing the game 20 Questions
really well. You take whatever the most informative thing is first. Let’s
formalize that—we need a notion of “informative.”
For the sake of this discussion, assume we break compound questions into
multiple yes-or-no questions, and we denote the answers by “0” or “1.” Given
a random variable X, we denote by p(X)=1 and p(X) =0 the probability that X
is true or false, respectively.
Entropy
In the decision tree, messy data are split based on values of the feature vector
associated with each data point.
With each split, the data becomes more homogenous which will decrease the
entropy. However, some data in some nodes will not be homogenous, where the
entropy value will not be small.The higher the entropy, the harder it is to draw
any conclusion.
When the tree finally reaches the terminal or leaf node maximum purity is
added.
For a dataset that has C classes and the probability of randomly choosing data
from class, i is Pi. Then entropy E(S) can be mathematically represented as
when p(X =1)= 0 or p (X =0) =0, the entropy vanishes i.e. if either option has
probability zero, the entropy is 0(pure).As p(X =1) =1− p(X =0) , the entropy
is symmetric about 0.5 and maximized at 0.5.
Example:
It starts with the root. - You need an algorithm to decide which attribute to split
on; e.g., which node should be the next one to identify.
The attribute is chosen in order to maximize information gain.
Keep going until all the points at the end are in the same class or you end up
with no features left. In this case, you take the majority vote.
The Tree can be pruned to avoid overfitting. - cutting it off below a certain
depth.
If you build the entire tree, it’s often less accurate with new data than if you
prune it.
Example:
#Loads the rpart library, which is used for recursive partitioning and
regression trees.
library(rpart)
printcp(model1)
plotcp(model1)
summary(model1)
# Add text labels to the tree plot. The text function annotates the tree plot with
information about the nodes. The use.n=TRUE argument includes the number
of observations at each node, all=TRUE ensures all nodes are labeled, and
cex=.8 sets the size of the text labels.
Questions:
3. Suppose you have your Chasing Dragons dataset. Your outcome variable is Return: a
binary variable that captures whether or not the user returns next month, and you have tons
of predictors. Write a R Script using decision tree algorithm for the above scenario.
ii)Random Forest
Makes the models more accurate and more robust, but at the cost of
interpretability
But easy to specify – 2 Hyperparameters: Number of Trees (N) in the forest and
Number of Features (F) to randomly select for each tree
A bootstrap sample is a sample with replacement, which means we might
sample the same data point more than once.We usually take to the sample size
to be 80% of the size of the entire (training) dataset, but of course this parameter
can be adjusted de‐pending on circumstances. This is technically a third hyper
parameter of our random forest algorithm.
To construct a random forest, you construct N decision trees as follows:
For each tree, take a bootstrap sample of your data, and for each node
you randomly select F features, say 5 out of the 100 total features.
Then you use your entropy-information-gain engine as described in the
previous section to decide which among those features you will split
your tree on at each stage.
Note that you could decide beforehand how deep the tree should get,or you
could prune your trees after the fact, but you typically don’t prune the trees in
random forests, because a great feature of random forests is that they can
incorporate idiosyncratic noise.
Algorithm
Select random K data points from the Training Set
Build a Decision Tree based on the selected K points
Select the Number of Trees to build and repeat Steps 1 & 2
For a new data point (test data point) makes each of the trees predict the
class for the data point. And assign the new data point the average across
all the predicted classes
Questions
Assume a model predicts well, but can you find the meaning in the model?
Example: The prediction could be “the more the user plays in the first month, the
more likely the user is to come back next month”. This is obvious and not very
helpful when doing the analysis
It could also tell that showing them ads in the first 5 minutes decreases their chances
of coming back, buts its ok to show ads after the first hour. This would give an
insight not to show ads in the first 5 minutes
To study this more, you really would want to do some A/B testing, but this initial
model and feature selection would help you prioritize the types of tests you might
want to run.
Features that are associated with the user’s behaviour are qualitatively different
from the features associated with one’s own behaviour.
If there’s a correlation of getting a high number of points in the first month with
players returning to play next month, does that mean if you just give users a high
number of points this month without them playing at all, they’ll come back – No!
It’s not the number of points that caused them to come back, it’s that they’re really
into playing the game which correlates with both their coming back and their getting
a high number of points.
Therefore, do feature selection with all variables, but then focus on the ones you
can do something about conditional on user attributes.
Questions:
Recommendation engines are used all the time—what movie would you like,
knowing other movies you liked? What book would you like, keeping in mind past
purchases?
There are plenty of different ways to go about building such a model, but they have
very similar feels if not implementation. To set up a recommendation engine,
suppose you have users, which form a set U; and you have items to recommend,
which form a set V.
We can denote this as a bipartite graph (shown again in Figure below) if each user
and each item has a node to represent it—there are lines from a user to an item if
that user has expressed an opinion about that item.
Note they might not always love that item, so the edges could have weights: they
could be positive, negative, or on a continuous scale (or discontinuous, but many-
valued like a star system).
The implications of this choice can be heavy but we won’t delve too deep here for
us they are numeric ratings.
Next up, you have training data in the form of some preferences—you know some
of the opinions of some of the users on some of the items. From those training data,
you want to predict other preferences for your users. That’s essentially the output
for a recommendation engine.
You may also have metadata on users (i.e., they are male or female, etc.) or on items
(the color of the product). For example, users come to your website and set up
accounts, so you may know each user’s gender, age, and preferences for up to three
items.
You represent a given user as a vector of features, sometimes including only
metadata—sometimes including only preferences (which would lead to a sparse
vector because you don’t know all the user’s opinions) and sometimes including
both, depending on what you’re doing with the vector.
Also, you can sometimes bundle all the user vectors together to get a big user
matrix, which we call U, through abuse of notation.
U: m×k matrix (Left Singular Vectors) that contains user latent features. Each row
corresponds to a user, and each column represents a latent feature. For example, a row
might capture a user's preference for genres like action or comedy.
S: k×k diagonal matrix with singular values. These values indicate the importance of
each latent feature. Larger values correspond to more significant features.
V: k×n matrix (Right Singular Vectors) Contains item latent features. Each row
corresponds to an item, and each column represents a latent feature. For example, a row
might capture an item's characteristics like genre or popularity.
U and V: The columns of U and V are orthogonal, meaning they capture independent
latent features.
S: The singular values in S measure the importance of each latent feature. Larger values
indicate more significant features.
1. Compute SVD:
Identify the top d largest singular values from S. These singular values correspond
to the most significant components of the data.
Xd =UdSdVdT
This Xd is the best approximation of X, using only the top d singular values.
This process leverages the properties of the SVD to capture the most important
structures in the data while discarding less important information.
Questions:
̃ =U⋅VT.
𝑿
argmin∑i,j(xij−ui⋅vj)2
xij is the actual interaction (e.g., rating) between user i and item j.
ui is the i-th row of matrix U, representing the latent features of user i.
vj is the j-th row of matrix V, representing the latent features of item j.
The dot product ui⋅vj is the predicted preference of user i for item j.
Number of Latent Features (d): This is a parameter that you choose, representing
the number of latent features you want to use. It controls the dimensions of matrices
U and V.
Matrix U: Has dimensions m×d, where m is the number of users and d is the
number of latent features. Each row corresponds to a user.
Matrix V: Has dimensions n×d, where n is the number of items and d is the number
of latent features. Each row corresponds to an item.
Here we are not first minimizing the squared error and then minimizing the size of the
entries of the matrices U and V. Here we are actually doing both at the same time.
Alternating Least Squares (ALS) is an algorithm for matrix factorization. ALS is used
to decompose a given user-item interaction matrix into two lower-dimensional matrices
(U and V) that capture latent features of users and items.
Pick a random V.
where vj is fixed. In other words, you just care about this user for now. But wait a
minute, this is the same as linear least squares, and has a closed form solution! In other
words, set:
where V*,i is the subset of V for which you have preferences coming from user i.
Taking the inverse is easy because it’s d×d, which is small.And there aren’t that many
preferences per user, so solving this many times is really not that hard. Overall you’ve
got a doable update for U.
When you fix U and optimize V, it’s analogous—you only ever have to consider the
users that rated that movie, which may be pretty large for popular movies but on average
isn’t; but even so, you’re only ever inverting a d×d matrix.
Initialize Matrix V:
Initialize Matrix U:
ALS Algorithm
The ALS algorithm alternates between updating the user latent features (U) and the item
latent features (V) to minimize the squared error of the predicted ratings.
• Extract the items they have interacted with and their corresponding ratings.
• `V_i` is the submatrix of `V` corresponding to the items user `i` has rated.
Error Calculation
Calculate the root mean square error (RMSE):Compute the prediction error for
all known user-item interactions.
Final Predicted Matrix
Predict the entire user-item interaction matrix:
Multiply U and V.T to get the predicted ratings.
Code:
import math
import numpy as np
pu = [
[(0, 0, 1), (0, 1, 22), (0, 2, 1), (0, 3, 1), (0, 5, 0)],
[(1, 0, 1), (1, 1, 32), (1, 2, 0), (1, 3, 0), (1, 4, 1), (1, 5, 0)],
[(2, 0, 0), (2, 1, 18), (2, 2, 1), (2, 3, 1), (2, 4, 0), (2, 5, 1)],
[(3, 0, 1), (3, 1, 40), (3, 2, 1), (3, 3, 0), (3, 4, 0), (3, 5, 1)],
[(4, 0, 0), (4, 1, 40), (4, 2, 0), (4, 4, 1), (4, 5, 0)],
[(5, 0, 0), (5, 1, 25), (5, 2, 1), (5, 3, 1), (5, 4, 1)]
pv = [
[(0, 0, 1), (0, 1, 1), (0, 2, 0), (0, 3, 1), (0, 4, 0), (0, 5, 0)],
[(1, 0, 22), (1, 1, 32), (1, 2, 18), (1, 3, 40), (1, 4, 40), (1, 5, 25)],
[(2, 0, 1), (2, 1, 0), (2, 2, 1), (2, 3, 1), (2, 4, 0), (2, 5, 1)],
[(3, 0, 1), (3, 1, 0), (3, 2, 1), (3, 3, 0), (3, 5, 1)],
[(4, 1, 1), (4, 2, 0), (4, 3, 0), (4, 4, 1), (4, 5, 1)],
[(5, 0, 0), (5, 1, 0), (5, 2, 1), (5, 3, 1), (5, 4, 0)]
# Define matrix V
V = np.mat([
])
U = np.mat(np.zeros([6, 3]))
# Regularization parameter
L = 0.03
print("U")
urs = []
# Update U
vo = []
pvo = []
for i, j, p in uset:
vor = []
for k in range(3):
vor.append(V[j, k])
vo.append(vor)
pvo.append(p)
vo = np.mat(vo)
urs.append(ur.T)
U = np.vstack(urs)
print(U)
print("V")
vrs = []
# Update V
uo = []
puo = []
for j, i, p in vset:
uor = []
for k in range(3):
uor.append(U[i, k])
uo.append(uor)
puo.append(p)
uo = np.mat(uo)
vrs.append(vr.T)
V = np.vstack(vrs)
print(V)
err = 0.
n = 0.
for i, j, p in uset:
n += 1
rmse = math.sqrt(err / n)
print("RMSE:", rmse)
print("\nFinal U * V.T")
print(U * V.T)
Output:
Questions: