DSV Module-3
DSV Module-3
1
DATA SCIENCE AND VISUALIZATION (21CS644)
• To construct a logistic regression model predicting user return behavior, the initial
focus lies in attaining a functional model before refinement. Irrespective of the
subsequent time frame, classification 𝑐𝑖=1 designates a returning user. The logistic
regression formula targeted is:
2
DATA SCIENCE AND VISUALIZATION (21CS644)
Questions:
3
DATA SCIENCE AND VISUALIZATION (21CS644)
A. Selecting an algorithm
Stepwise regression is a category of feature selection methods which involves
systematically adjusting feature sets within regression models, typically through
forward selection, backward elimination, or a combined approach, to
optimize model performance based on predefined selection criteria.
Forward selection:
Forward selection involves systematically adding features to a regression
model one at a time based on their ability to improve model performance
according to a selection criterion. This iterative process continues until further
feature additions no longer enhance the model's performance.
Backward elimination:
Backward elimination begins with a regression model containing all features.
Subsequently, one feature is systematically removed at a time, the feature whose
removal makes the biggest improvement in the selection criterion. Stop
removing features when removing the feature makes the selection criterion get
worse.
4
DATA SCIENCE AND VISUALIZATION (21CS644)
Combined approach:
The combined approach in feature selection blends forward selection and
backward elimination to strike a balance between maximizing relevance and
minimizing redundancy. It iteratively adds and removes features based on
their significance and impact on model fit, resulting in a subset of features
optimized for predictive power.
B. Selection criterion
The choice of selection criteria in feature selection methods may seem arbitrary.
To address this, experimenting with various criteria can help assess model
robustness. Different criteria may yield diverse models, necessitating the
prioritization of optimization goals based on the problem context and objectives.
R-squared
p-values
In regression analysis, the interpretation of p-values involves assuming a null
hypothesis where the coefficients (βs) are zero. A low p-value suggests that
observing the data and obtaining the estimated coefficient under the null
hypothesis is highly unlikely, indicating a high likelihood that the coefficient is
non-zero.
AIC (Akaike Information Criterion)
Given by the formula 2k−2ln(L) , where k is the number of parameters in the
model and ln(L) is the “maximized value of the log likelihood.” The goal is to
minimize AIC.
BIC (Bayesian Information Criterion)
Given by the formula k*ln(n) −2ln(L), where k is the number of parameters in
the model, n is the number of observations (data points, or users), and ln(L) is
the maximized value of the log likelihood. The goal is to minimize BIC.
Entropy
Entropy is a measure of disorder or impurity in the given dataset.
Questions:
1.Define Feature Selection.
2.Explain Filter Method
3.Explain Wrapper Method.
4. Explain selecting an algorithm in wrapper method.
5. Explain the different Selecting Criteria in feature selection.
5
DATA SCIENCE AND VISUALIZATION (21CS644)
Decision tree
3. Embedded Methods: Decision Trees
i)Decision Trees
✓ Decision trees are a popular and powerful tool used in various fields such as
machine learning, data mining, and statistics. They provide a clear and intuitive
way to make decisions based on data by modelling the relationships between
different variables.
✓ A decision tree is a flowchart-like structure used to make decisions or
predictions. It consists of nodes representing decisions or tests on attributes,
branches representing the outcome of these decisions, and leaf nodes
representing final outcomes or predictions.
✓ Each internal node corresponds to a test on an attribute, each branch corresponds
to the result of the test, and each leaf node corresponds to a class label or a
continuous value. In the context of a data problem, a decision tree is a
classification algorithm.
✓ For the Chasing Dragons example, you want to classify users as “Yes, going to
come back next month” or “No, not going to come back next month.” This isn’t
really a decision in the colloquial sense, so don’t let that throw you.
✓ You know that the class of any given user is dependent on many factors
(number of dragons the user slew, their age, how many hours they already
played the game). And you want to break it down based on the data you’ve
collected. But how do you construct decision trees from data and what
mathematical properties can you expect them to have?
✓ But you want this tree to be based on data and not just what you feel like.
Choosing a feature to pick at each step is like playing the game 20 Questions
really well. You take whatever the most informative thing is first. Let’s
formalize that—we need a notion of “informative.”
✓ For the sake of this discussion, assume we break compound questions into
multiple yes-or-no questions, and we denote the answers by “0” or “1.” Given
a random variable X, we denote by p(X)=1 and p(X) =0 the probability that X
is true or false, respectively.
6
DATA SCIENCE AND VISUALIZATION (21CS644)
Entropy
• Entropy is a measure of disorder or impurity in the given dataset.
• In the decision tree, messy data are split based on values of the feature vector
associated with each data point.
• With each split, the data becomes more homogenous which will decrease the
entropy. However, some data in some nodes will not be homogenous, where the
entropy value will not be small. The higher the entropy, the harder it is to draw
any conclusion.
• When the tree finally reaches the terminal or leaf node maximum purity is
added.
• For a dataset that has C classes and the probability of randomly choosing data
from class, i is Pi. Then entropy E(S) can be mathematically represented as
when p(X =1)= 0 or p (X =0) =0, the entropy vanishes i.e. if either option has
probability zero, the entropy is 0(pure).As p(X =1) =1− p(X =0) , the entropy
is symmetric about 0.5 and maximized at 0.5.
• In particular, if either option has probability zero, the entropy is 0. Moreover,
because p (X =1) =1− p (X =0) , the entropy is symmetric about 0.5 and
maximized at 0.5, which we can easily confirm using a bit of calculus. The
Below Figure shows the picture of that.
• Example:
7
DATA SCIENCE AND VISUALIZATION (21CS644)
#Loads the rpart library, which is used for recursive partitioning and
regression trees.
library(rpart)
8
DATA SCIENCE AND VISUALIZATION (21CS644)
9
DATA SCIENCE AND VISUALIZATION (21CS644)
# Plot the classification tree. The plot function visualizes the tree
structure, and the uniform=TRUE argument makes the branch lengths
uniform. The main argument specifies the title of the plot.
plot(model1, uniform=TRUE, main="Classification Tree for Chasing
Dragons")
# Add text labels to the tree plot. The text function annotates the tree plot with
information about the nodes. The use.n=TRUE argument includes the number
of observations at each node, all=TRUE ensures all nodes are labeled, and
cex=.8 sets the size of the text labels.
text(model1, use.n=TRUE, all=TRUE, cex=.8)
10
DATA SCIENCE AND VISUALIZATION (21CS644)
Random forest
ii)Random Forest
• Random forests generalize decision trees with bagging,otherwise known has
Bootstrap Aggregating.
• Makes the models more accurate and more robust, but at the cost of
interpretability
• But easy to specify – 2 Hyperparameters: Number of Trees (N) in the forest and
Number of Features (F) to randomly select for each tree
• A bootstrap sample is a sample with replacement, which means we might
sample the same data point more than once.We usually take to the sample size
to be 80% of the size of the entire (training) dataset, but of course this parameter
can be adjusted de‐pending on circumstances. This is technically a third hyper
parameter of our random forest algorithm.
• To construct a random forest, you construct N decision trees as follows:
• For each tree, take a bootstrap sample of your data, and for each node
you randomly select F features, say 5 out of the 100 total features.
• Then you use your entropy-information-gain engine as described in the
previous section to decide which among those features you will split
your tree on at each stage.
• Note that you could decide beforehand how deep the tree should get,or you
could prune your trees after the fact, but you typically don’t prune the trees in
random forests, because a great feature of random forests is that they can
incorporate idiosyncratic noise.
• Algorithm
• Select random K data points from the Training Set
• Build a Decision Tree based on the selected K points
11
DATA SCIENCE AND VISUALIZATION (21CS644)
User Retention
3.4 User Retention: Interpretability Vs. Predictive Power
• Assume a model predicts well, but can you find the meaning in the model?
• Example: The prediction could be “the more the user plays in the first month, the
more likely the user is to come back next month”. This is obvious and not very
helpful when doing the analysis
• It could also tell that showing them ads in the first 5 minutes decreases their chances
of coming back, buts its ok to show ads after the first hour. This would give an
insight not to show ads in the first 5 minutes
• To study this more, you really would want to do some A/B testing, but this initial
model and feature selection would help you prioritize the types of tests you might
want to run.
• Features that are associated with the user’s behaviour are qualitatively different
from the features associated with one’s own behaviour.
• If there’s a correlation of getting a high number of points in the first month with
players returning to play next month, does that mean if you just give users a high
number of points this month without them playing at all, they’ll come back – No!
• It’s not the number of points that caused them to come back, it’s that they’re really
into playing the game which correlates with both their coming back and their getting
a high number of points.
• Therefore, do feature selection with all variables, but then focus on the ones you
can do something about conditional on user attributes.
Questions:
12
DATA SCIENCE AND VISUALIZATION (21CS644)
• We can denote this as a bipartite graph (shown again in Figure below) if each user
and each item has a node to represent it—there are lines from a user to an item if
that user has expressed an opinion about that item.
• Note they might not always love that item, so the edges could have weights: they
could be positive, negative, or on a continuous scale (or discontinuous, but many-
valued like a star system).
• The implications of this choice can be heavy but we won’t delve too deep here for
us they are numeric ratings.
• Next up, you have training data in the form of some preferences—you know some
of the opinions of some of the users on some of the items. From those training data,
you want to predict other preferences for your users. That’s essentially the output
for a recommendation engine.
• You may also have metadata on users (i.e., they are male or female, etc.) or on items
(the color of the product). For example, users come to your website and set up
accounts, so you may know each user’s gender, age, and preferences for up to three
items.
• You represent a given user as a vector of features, sometimes including only
metadata—sometimes including only preferences (which would lead to a sparse
vector because you don’t know all the user’s opinions) and sometimes including
both, depending on what you’re doing with the vector.
• Also, you can sometimes bundle all the user vectors together to get a big user
matrix, which we call U, through abuse of notation.
13
DATA SCIENCE AND VISUALIZATION (21CS644)
14
DATA SCIENCE AND VISUALIZATION (21CS644)
matrix S.
• Latent Features: These are the hidden patterns or factors in the data. For a user-item
matrix, they represent underlying user preferences and item characteristics.
• Importance of Latent Features: The singular values in S indicate the importance of these
features.
• A larger singular value means the corresponding latent feature is more significant in
capturing the structure of the data.
• Given an m×n matrix X of rank k: X=USVT
• U: m×k matrix (Left Singular Vectors) that contains user latent features. Each row
corresponds to a user, and each column represents a latent feature. For example, a row
might capture a user's preference for genres like action or comedy.
• S: k×k diagonal matrix with singular values. These values indicate the importance of
each latent feature. Larger values correspond to more significant features.
• V: k×n matrix (Right Singular Vectors) Contains item latent features. Each row
corresponds to an item, and each column represents a latent feature. For example, a row
might capture an item's characteristics like genre or popularity.
• U and V: The columns of U and V are orthogonal, meaning they capture independent
latent features.
• S: The singular values in S measure the importance of each latent feature. Larger values
indicate more significant features.
1. Compute SVD:
• Perform SVD on the matrix X to get U, S, and V.
2. Select Top d Singular Values:
• Identify the top d largest singular values from S. These singular values correspond
to the most significant components of the data.
3. Construct Reduced Matrices:
• Construct reduced matrices Ud, Sd, and Vd
4. Approximate the Original Matrix:
Xd =UdSd Vd T
• This Xd is the best approximation of X, using only the top d singular values.
15
DATA SCIENCE AND VISUALIZATION (21CS644)
• xij is the actual interaction (e.g., rating) between user i and item j.
• ui is the i-th row of matrix U, representing the latent features of user i.
• vj is the j-th row of matrix V, representing the latent features of item j.
• The dot product ui⋅vj is the predicted preference of user i for item j.
• Number of Latent Features (d): This is a parameter that you choose, representing
the number of latent features you want to use. It controls the dimensions of matrices
U and V.
• Matrix U: Has dimensions m×d, where m is the number of users and d is the number
of latent features. Each row corresponds to a user.
• Matrix V: Has dimensions n×d, where n is the number of items and d is the number
of latent features. Each row corresponds to an item.
PCA
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load dataset
# 'diabetes.csv' is the filename of the dataset. The data is read into a pandas
DataFrame.
data = pd.read_csv(r"D:\Files\dsv-dataset\1. DataSets\diabetes.csv")
16
DATA SCIENCE AND VISUALIZATION (21CS644)
# Perform PCA
# Create a PCA object with the number of components we want to keep (2 in this case).
# Fit the PCA model to the scaled data and transform the data to the new PCA space.
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_features)
17
DATA SCIENCE AND VISUALIZATION (21CS644)
• Here we are not first minimizing the squared error and then minimizing the size of the
entries of the matrices U and V. Here we are actually doing both at the same time.
• Alternating Least Squares (ALS) is an algorithm for matrix factorization. ALS is used
to decompose a given user-item interaction matrix into two lower-dimensional matrices
18
DATA SCIENCE AND VISUALIZATION (21CS644)
a. Pick a random V.
b. Optimize U while V is fixed.
c. Optimize V while U is fixed.
• Keep doing the preceding two steps until you’re not changing very much at all. To be
precise, you choose an ϵ and if your coefficients are each changing by less than ϵ, then
you declare your algorithm “converged.”
• Fix V and Update U The way you do this optimization is user by user. So for user i,
you want to find:
• where vj is fixed. In other words, you just care about this user for now. But wait a
minute, this is the same as linear least squares, and has a closed form solution! In other
words, set:
where V*,i is the subset of V for which you have preferences coming from user i.
Taking the inverse is easy because it’s d×d, which is small. And there aren’t that many
preferences per user, so solving this many times is really not that hard. Overall you’vegot a
doable update for U.
• When you fix U and optimize V, it’s analogous—you only ever have to consider the
users that rated that movie, which may be pretty large for popular movies but on average
isn’t; but even so, you’re only ever inverting a d×d matrix.
3.8 Building Recommendation System using Python
Initialize Matrix U:
ALS Algorithm
19
DATA SCIENCE AND VISUALIZATION (21CS644)
The ALS algorithm alternates between updating the user latent features (U) and the item
latent features (V) to minimize the squared error of the predicted ratings.
For each user:
• Extract the items they have interacted with and their corresponding ratings.
• Create a matrix vo containing the latent features of these items.
• Create a vector pvo of the ratings.
• `V_i` is the submatrix of `V` corresponding to the items user `i` has rated.
20
DATA SCIENCE AND VISUALIZATION (21CS644)
Error Calculation
• Calculate the root mean square error (RMSE):Compute the prediction error for
all known user-item interactions.
• Final Predicted Matrix
• Predict the entire user-item interaction matrix:
• Multiply U and V.T to get the predicted ratings.
Code:
import math
import numpy as np
# Define the user-item ratings matrix pu
pu = [
[(0, 0, 1), (0, 1, 22), (0, 2, 1), (0, 3, 1), (0, 5, 0)],
[(1, 0, 1), (1, 1, 32), (1, 2, 0), (1, 3, 0), (1, 4, 1), (1, 5, 0)],
[(2, 0, 0), (2, 1, 18), (2, 2, 1), (2, 3, 1), (2, 4, 0), (2, 5, 1)],
[(3, 0, 1), (3, 1, 40), (3, 2, 1), (3, 3, 0), (3, 4, 0), (3, 5, 1)],
[(4, 0, 0), (4, 1, 40), (4, 2, 0), (4, 4, 1), (4, 5, 0)],
[(5, 0, 0), (5, 1, 25), (5, 2, 1), (5, 3, 1), (5, 4, 1)]
]
# Define the item-user ratings matrix pv
pv = [
[(0, 0, 1), (0, 1, 1), (0, 2, 0), (0, 3, 1), (0, 4, 0), (0, 5, 0)],
[(1, 0, 22), (1, 1, 32), (1, 2, 18), (1, 3, 40), (1, 4, 40), (1, 5, 25)],
[(2, 0, 1), (2, 1, 0), (2, 2, 1), (2, 3, 1), (2, 4, 0), (2, 5, 1)],
[(3, 0, 1), (3, 1, 0), (3, 2, 1), (3, 3, 0), (3, 5, 1)],
[(4, 1, 1), (4, 2, 0), (4, 3, 0), (4, 4, 1), (4, 5, 1)],
[(5, 0, 0), (5, 1, 0), (5, 2, 1), (5, 3, 1), (5, 4, 0)]
]
# Define matrix V
V = np.mat([
[0.15968384, 0.9441198, 0.83651085],
[0.73573009, 0.24906915, 0.85338239],
[0.25605814, 0.6990532, 0.50900407],
[0.2405843, 0.31848888, 0.60233653],
[0.24237479, 0.15293281, 0.22240255],
[0.03943766, 0.19287528, 0.95094265]
])
# Initialize matrix U with zeros
U = np.mat(np.zeros([6, 3]))
# Regularization parameter
L = 0.03
# Perform matrix factorization using alternating least squares
for iter in range(5):
21
DATA SCIENCE AND VISUALIZATION (21CS644)
# Update U
for uset in pu:
vo = []
pvo = []
for i, j, p in uset:
vor = []
for k in range(3):
vor.append(V[j, k])
vo.append(vor)
pvo.append(p)
vo = np.mat(vo)
ur = np.linalg.inv(vo.T * vo + L * np.mat(np.eye(3))) * vo.T * np.mat(pvo).T
urs.append(ur.T)
U = np.vstack(urs)
print(U)
print("V")
vrs = []
# Update V
for vset in pv:
uo = []
puo = []
for j, i, p in vset:
uor = []
for k in range(3):
uor.append(U[i, k])
uo.append(uor)
puo.append(p)
uo = np.mat(uo)
vr = np.linalg.inv(uo.T * uo + L * np.mat(np.eye(3))) * uo.T * np.mat(puo).T
vrs.append(vr.T)
V = np.vstack(vrs)
print(V)
# Calculate RMSE (Root Mean Squared Error)
err = 0.
n = 0.
22
DATA SCIENCE AND VISUALIZATION (21CS644)
23
DATA SCIENCE AND VISUALIZATION (21CS644)
Questions:
24