Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
19 views

DSV Module-3

Uploaded by

Keshav Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

DSV Module-3

Uploaded by

Keshav Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

DATA SCIENCE AND VISUALIZATION (21CS644)

Module 3- Feature Generation and Feature Selection


Extracting Meaning from Data
Feature Generation and Feature Selection
Extracting Meaning from Data: Motivating application: user (customer)
Module 3 retention. Feature Generation (brainstorming, role of domain expertise, and
Syllabus place for imagination), Feature Selection algorithms. Filters; Wrappers;
Decision Trees; Random Forests. Recommendation Systems: Building a
User-Facing Data Product, Algorithmic ingredients of a Recommendation
Engine, Dimensionality Reduction, Singular Value Decomposition,
Principal Component Analysis, Exercise: build your ownrecommendation
system.

Motivating Application: User Retention


3.1 Motivating application: user (customer) retention
• Suppose an app called Chasing Dragons charges a monthly subscription fee, with
revenue increasing with more users.
• However, only 10% of new users return after the first month.
• To boost revenue, there are two options: increase the retention rate of existing
users or acquire new ones.
• Generally, retaining existing customers is cheaper than acquiring new ones.
• Focusing on retention, a model could be built to predict if a new user will return next
month based on their behavior this month.
• This model could help in providing targeted incentives, such as a free month, to users
predicted to need extra encouragement to stay.
• A good crude model: Logistic Regression – Gives the probability the user returns their
second month conditional on their activities in the first month.
• User behavior is recorded for the first 30 days after sign-up, logging every action with
timestamps: for example, a user clicked "level 6" at 5:22 a.m., slew a dragon at 5:23
a.m., earned 22 points at 5:24 a.m., and was shown an ad at 5:25 a.m. This phase
involves collecting data on every possible user action.
• User actions, ranging from thousands to just a few, are stored in timestamped event
logs.
• These logs need to be processed into a dataset with rows representing users and columns
representing features. This phase, known as feature generation, involves
brainstorming potential features without being selective.
• The data science team, including game designers, software engineers, statisticians, and
marketing experts, collaborates to identify relevant features.
Here are some examples:
✓ Number of days the user visited in the first month
✓ Amount of time until second visit
✓ Number of points on day j for j=1, . . .,30 (this would be 30 separate
features)
✓ Total number of points in first month (sum of the other features)

1
DATA SCIENCE AND VISUALIZATION (21CS644)

✓ Did user fill out Chasing Dragons profile (binary 1 or 0)


✓ Age and gender of user
✓ Screen size of device
• Notice there are redundancies and correlations between these features; that’s OK.

• To construct a logistic regression model predicting user return behavior, the initial
focus lies in attaining a functional model before refinement. Irrespective of the
subsequent time frame, classification 𝑐𝑖=1 designates a returning user. The logistic
regression formula targeted is:

• Initially, a comprehensive set of features is gathered, encompassing user behavior,


demographics, and platform interactions.
• Following data collection, feature subsets must be refined for optimal predictive
power during model scaling and production.
• Three main methods guide feature subset selection: filters, wrappers, and embedded
methods.
• Filters independently evaluate feature relevance, wrappers use model performance to
assess feature subsets, and embedded methods incorporate feature selection within
model training.
Questions

1. Define Customer Retention?

2. What are different relevant features of the Chasing Dragon app?

3. How to boost the revenue of Chasing Dragon application.

2
DATA SCIENCE AND VISUALIZATION (21CS644)

Feature Generation or Feature Extraction

3.2 Feature Generation or Feature Extraction

Feature generation, also known as feature extraction, is the process of transforming


raw data into a structured format where each column represents a specific
characteristic or attribute (feature) of the data, and each row represents an
observation or instance.
• This involves identifying, creating, and selecting meaningful variables from the raw
data that can be used in machine learning models to make predictions or understand
patterns.
• This process is both an art and a science. Having a domain expert involved is beneficial,
but using creativity and imagination is equally important.
• Remember, feature generation is constrained by two factors: the feasibility of
capturing certain information and the awareness to consider capturing it.
Information can be categorized into the following buckets:

• Relevant and useful, but it’s impossible to capture it.


Keep in mind that much user information isn't captured, like free time, other apps,
employment status, or insomnia, which might predict their return. Some captured data
may act as proxies for these factors, such as playing the game at 3 a.m. indicating
insomnia or night shifts.
• Relevant and useful, possible to log it, and you did.
The decision to log this information during the brainstorming session was crucial.
However, mere logging doesn't guarantee understanding its relevance or usefulness.
The feature selection process aims to uncover this information.
• Relevant and useful, possible to log it, but you didn’t.
Human limitations can lead to overlooking crucial information, emphasizing the need
for creative feature selection. Usability studies help identify key user actions for better
feature capture.
• Not relevant or useful, but you don’t know that and log it.
Feature selection aims to address this: while you've logged certain information,
unknowing its necessity.
• Not relevant or useful, and you either can’t capture it or it didn’t occur to you.
Feature Generation or Feature Extraction.

Questions:

1.Define Feature Generation.


2.Define Feature Extraction.
3.Define Feature Generation. Explain how information can be categorized in feature
generation in detail.

3
DATA SCIENCE AND VISUALIZATION (21CS644)

Feature Selection: Filters and Wrappers

3.3 Feature Selection Algorithms


Feature selection involves identifying the most relevant and informative features from a dataset
for building predictive models.
1. Filters:
✓ Filters prioritize features based on specific metrics or statistics, such as
correlation with the outcome variable, offering a quick overview of predictive
power of .
✓ However, they may ignore redundancy and fail to consider feature interactions,
potentially resulting in correlated features and limited insight into complex
relationships.
✓ By treating the features as independent, it is not taking into account possible
interactions or correlation between features.
✓ However in some cases 2 redundant features can be more powerful when used
together and appear useless when considered individually.
2. Wrappers
✓ Wrapper feature selection explores subsets of features of a predetermined size,
seeking to identify combinations that optimize model performance.
✓ However, as the number of potential combinations grows exponentially with the
number of features.
The number of possible size k subsets of n things, called .
✓ This exponential growth of possible feature subsets can lead to overfitting.
✓ In wrapper feature selection, two key aspects require consideration:
✓ first, the choice of an algorithm for feature selection, and
✓ second, the determination of a selection criterion or filter to find out the
usefulness of the chosen feature set.

A. Selecting an algorithm
Stepwise regression is a category of feature selection methods which involves
systematically adjusting feature sets within regression models, typically through
forward selection, backward elimination, or a combined approach, to
optimize model performance based on predefined selection criteria.
Forward selection:
Forward selection involves systematically adding features to a regression
model one at a time based on their ability to improve model performance
according to a selection criterion. This iterative process continues until further
feature additions no longer enhance the model's performance.
Backward elimination:
Backward elimination begins with a regression model containing all features.
Subsequently, one feature is systematically removed at a time, the feature whose
removal makes the biggest improvement in the selection criterion. Stop
removing features when removing the feature makes the selection criterion get
worse.

4
DATA SCIENCE AND VISUALIZATION (21CS644)

Combined approach:
The combined approach in feature selection blends forward selection and
backward elimination to strike a balance between maximizing relevance and
minimizing redundancy. It iteratively adds and removes features based on
their significance and impact on model fit, resulting in a subset of features
optimized for predictive power.
B. Selection criterion
The choice of selection criteria in feature selection methods may seem arbitrary.
To address this, experimenting with various criteria can help assess model
robustness. Different criteria may yield diverse models, necessitating the
prioritization of optimization goals based on the problem context and objectives.
R-squared

R-squared can be interpreted as the proportion of variance explained by your


model.

p-values
In regression analysis, the interpretation of p-values involves assuming a null
hypothesis where the coefficients (βs) are zero. A low p-value suggests that
observing the data and obtaining the estimated coefficient under the null
hypothesis is highly unlikely, indicating a high likelihood that the coefficient is
non-zero.
AIC (Akaike Information Criterion)
Given by the formula 2k−2ln(L) , where k is the number of parameters in the
model and ln(L) is the “maximized value of the log likelihood.” The goal is to
minimize AIC.
BIC (Bayesian Information Criterion)
Given by the formula k*ln(n) −2ln(L), where k is the number of parameters in
the model, n is the number of observations (data points, or users), and ln(L) is
the maximized value of the log likelihood. The goal is to minimize BIC.
Entropy
Entropy is a measure of disorder or impurity in the given dataset.

Questions:
1.Define Feature Selection.
2.Explain Filter Method
3.Explain Wrapper Method.
4. Explain selecting an algorithm in wrapper method.
5. Explain the different Selecting Criteria in feature selection.

5
DATA SCIENCE AND VISUALIZATION (21CS644)

Decision tree
3. Embedded Methods: Decision Trees
i)Decision Trees
✓ Decision trees are a popular and powerful tool used in various fields such as
machine learning, data mining, and statistics. They provide a clear and intuitive
way to make decisions based on data by modelling the relationships between
different variables.
✓ A decision tree is a flowchart-like structure used to make decisions or
predictions. It consists of nodes representing decisions or tests on attributes,
branches representing the outcome of these decisions, and leaf nodes
representing final outcomes or predictions.
✓ Each internal node corresponds to a test on an attribute, each branch corresponds
to the result of the test, and each leaf node corresponds to a class label or a
continuous value. In the context of a data problem, a decision tree is a
classification algorithm.
✓ For the Chasing Dragons example, you want to classify users as “Yes, going to
come back next month” or “No, not going to come back next month.” This isn’t
really a decision in the colloquial sense, so don’t let that throw you.
✓ You know that the class of any given user is dependent on many factors
(number of dragons the user slew, their age, how many hours they already
played the game). And you want to break it down based on the data you’ve
collected. But how do you construct decision trees from data and what
mathematical properties can you expect them to have?
✓ But you want this tree to be based on data and not just what you feel like.
Choosing a feature to pick at each step is like playing the game 20 Questions
really well. You take whatever the most informative thing is first. Let’s
formalize that—we need a notion of “informative.”
✓ For the sake of this discussion, assume we break compound questions into
multiple yes-or-no questions, and we denote the answers by “0” or “1.” Given
a random variable X, we denote by p(X)=1 and p(X) =0 the probability that X
is true or false, respectively.

6
DATA SCIENCE AND VISUALIZATION (21CS644)

Entropy
• Entropy is a measure of disorder or impurity in the given dataset.
• In the decision tree, messy data are split based on values of the feature vector
associated with each data point.
• With each split, the data becomes more homogenous which will decrease the
entropy. However, some data in some nodes will not be homogenous, where the
entropy value will not be small. The higher the entropy, the harder it is to draw
any conclusion.
• When the tree finally reaches the terminal or leaf node maximum purity is
added.
• For a dataset that has C classes and the probability of randomly choosing data
from class, i is Pi. Then entropy E(S) can be mathematically represented as

• To quantify what is the most “informative” feature, we define entropy–


effectively a measure for how mixed up something is—for X

when p(X =1)= 0 or p (X =0) =0, the entropy vanishes i.e. if either option has
probability zero, the entropy is 0(pure).As p(X =1) =1− p(X =0) , the entropy
is symmetric about 0.5 and maximized at 0.5.
• In particular, if either option has probability zero, the entropy is 0. Moreover,
because p (X =1) =1− p (X =0) , the entropy is symmetric about 0.5 and
maximized at 0.5, which we can easily confirm using a bit of calculus. The
Below Figure shows the picture of that.

• Example:

7
DATA SCIENCE AND VISUALIZATION (21CS644)

• Entropy is a measurement of how mixed up something is.If X denotes the event


of a baby being born a boy, the expectation is to be true or false with probability
close to 1/2, which corresponds to high entropy, i.e., the bag of babies from
which we are selecting a baby is highly mixed.
• If X denotes the event of a rainfall in a desert, then it’s low entropy. In other
words, the bag of day-long weather events is not highly mixed in deserts.
• X is the target of our model. So, X could be the event that someone buys
something on our site.
• Which attribute of the user will tell us the most information about this event X
needs to be determined
• Information Gain, IG(X,a), for a given attribute a, is the entropy we
lose(reduction in entropy, or uncertainty) if we know the value of that attribute
• IG (X,a) =H(X) − H (X|a).
• H(X|a) can be computed in 2 steps:
• For any actual value 𝑎0 of the attribute 𝑎 we can compute the specific conditional
entropy H(X|𝑎 = 𝑎0) as:
• H(X|𝑎 = 𝑎0) = -p(X=1| 𝑎 = 𝑎0) 𝑙𝑜𝑔2(p(X=1| 𝑎 = 𝑎0))
- p(X=0| 𝑎 = 𝑎0) 𝑙𝑜𝑔2(p(X=0| 𝑎 = 𝑎0))
And then we can put it all together, for all possible values of a, to getthe
conditional entropy H(X|a) :
H(X|a)=∑𝑎𝑖 𝑝(𝑎 = 𝑎𝑖). H(X|𝑎 = 𝑎𝑖)

Decision Tree Algorithm


• The Decision Tree is built iteratively.
• It starts with the root. - You need an algorithm to decide which attribute to split
on; e.g., which node should be the next one to identify.
• The attribute is chosen in order to maximize information gain.
• Keep going until all the points at the end are in the same class or you end up
with no features left. In this case, you take the majority vote.
• The Tree can be pruned to avoid overfitting. - cutting it off below a certain
depth.
• If you build the entire tree, it’s often less accurate with new data than if you
prune it.
• Example:
➢ Suppose you have your Chasing Dragons dataset. Your outcome
variable is Return: a binary variable that captures whether or not the user
returns next month, and you have tons of predictors.
# Load necessary library

#Loads the rpart library, which is used for recursive partitioning and
regression trees.
library(rpart)

8
DATA SCIENCE AND VISUALIZATION (21CS644)

# Read the CSV file

chasingdragons <- read.csv("chasingdragons.csv")


setwd("F:/college/data science-2021 scheme/dscience")
# Grow the classification tree
#Builds a classification tree model to predict the Return variable using
the other variables (profile, num_dragons, num_friends_invited, gender,
age, num_days) as predictors. The method="class" specifies that this is
a classification tree.
model1 <- rpart(Return ~ profile + num_dragons + num_friends_invited
+ gender + age + num_days, method="class", data=chasingdragons)
# Display the results

#Prints the complexity parameter (CP) table, which helps in


understanding the performance of the model and in pruning the tree.
printcp(model1)

# Visualize cross-validation results

# plot of the cross-validation results for the complexity parameter. The


plotcp function helps visualize how the model's error changes with the
complexity of the tree, which aids in selecting the optimal tree size.
plotcp(model1)

# Detailed summary of the model

#The summary function provides comprehensive information about the


model, including splits, nodes.
summary(model1)
#sample summary output:

9
DATA SCIENCE AND VISUALIZATION (21CS644)

# Plot the classification tree. The plot function visualizes the tree
structure, and the uniform=TRUE argument makes the branch lengths
uniform. The main argument specifies the title of the plot.
plot(model1, uniform=TRUE, main="Classification Tree for Chasing
Dragons")

# Add text labels to the tree plot. The text function annotates the tree plot with
information about the nodes. The use.n=TRUE argument includes the number
of observations at each node, all=TRUE ensures all nodes are labeled, and
cex=.8 sets the size of the text labels.
text(model1, use.n=TRUE, all=TRUE, cex=.8)

Handling Continuous Variables

10
DATA SCIENCE AND VISUALIZATION (21CS644)

• In case of continuous variables, the correct threshold of a value needs to be


determined to consider it as a binary variable.
• Example: A User’s number of Dragon Slays can be partitioned into categories
such as “less than 10” and “at least 10”. Now we have a binary variable case.
• In this case, it takes some extra work to decide on the information gain because
it depends on the threshold as well as the feature.
• Instead of a single threshold, bins of values can be created for the attribute. –
Depends on situation
Questions:

1. Define Decision tree.


2. Explain Decision Tree for Chasing Dragons Problem.
3. Suppose you have your Chasing Dragons dataset. Your outcome variable is Return: a
binary variable that captures whether or not the user returns next month, and you have tons
of predictors. Write a R Script using decision tree algorithm for the above scenario.
4. Write Decision Algorithm in detail.

Random forest
ii)Random Forest
• Random forests generalize decision trees with bagging,otherwise known has
Bootstrap Aggregating.
• Makes the models more accurate and more robust, but at the cost of
interpretability
• But easy to specify – 2 Hyperparameters: Number of Trees (N) in the forest and
Number of Features (F) to randomly select for each tree
• A bootstrap sample is a sample with replacement, which means we might
sample the same data point more than once.We usually take to the sample size
to be 80% of the size of the entire (training) dataset, but of course this parameter
can be adjusted de‐pending on circumstances. This is technically a third hyper
parameter of our random forest algorithm.
• To construct a random forest, you construct N decision trees as follows:
• For each tree, take a bootstrap sample of your data, and for each node
you randomly select F features, say 5 out of the 100 total features.
• Then you use your entropy-information-gain engine as described in the
previous section to decide which among those features you will split
your tree on at each stage.
• Note that you could decide beforehand how deep the tree should get,or you
could prune your trees after the fact, but you typically don’t prune the trees in
random forests, because a great feature of random forests is that they can
incorporate idiosyncratic noise.
• Algorithm
• Select random K data points from the Training Set
• Build a Decision Tree based on the selected K points

11
DATA SCIENCE AND VISUALIZATION (21CS644)

• Select the Number of Trees to build and repeat Steps 1 & 2


• For a new data point (test data point) makes each of the trees predict the
class for the data point. And assign the new data point the average across
all the predicted classes
Questions
1.Define Random Forest.
2. Explain Random forest Algorithm.

User Retention
3.4 User Retention: Interpretability Vs. Predictive Power
• Assume a model predicts well, but can you find the meaning in the model?
• Example: The prediction could be “the more the user plays in the first month, the
more likely the user is to come back next month”. This is obvious and not very
helpful when doing the analysis
• It could also tell that showing them ads in the first 5 minutes decreases their chances
of coming back, buts its ok to show ads after the first hour. This would give an
insight not to show ads in the first 5 minutes
• To study this more, you really would want to do some A/B testing, but this initial
model and feature selection would help you prioritize the types of tests you might
want to run.
• Features that are associated with the user’s behaviour are qualitatively different
from the features associated with one’s own behaviour.
• If there’s a correlation of getting a high number of points in the first month with
players returning to play next month, does that mean if you just give users a high
number of points this month without them playing at all, they’ll come back – No!
• It’s not the number of points that caused them to come back, it’s that they’re really
into playing the game which correlates with both their coming back and their getting
a high number of points.
• Therefore, do feature selection with all variables, but then focus on the ones you
can do something about conditional on user attributes.
Questions:

1.Explain the User Retention in Detail.

User Retention, Dimensionality Reduction, SVD


3.5 A Real-World Recommendation Engine
• Recommendation engines are used all the time—what movie would you like,
knowing other movies you liked? What book would you like, keeping in mind past
purchases?
• There are plenty of different ways to go about building such a model, but they have
very similar feels if not implementation. To set up a recommendation engine,
suppose you have users, which form a set U; and you have items to recommend,
which form a set V.

12
DATA SCIENCE AND VISUALIZATION (21CS644)

• We can denote this as a bipartite graph (shown again in Figure below) if each user
and each item has a node to represent it—there are lines from a user to an item if
that user has expressed an opinion about that item.
• Note they might not always love that item, so the edges could have weights: they
could be positive, negative, or on a continuous scale (or discontinuous, but many-
valued like a star system).
• The implications of this choice can be heavy but we won’t delve too deep here for
us they are numeric ratings.
• Next up, you have training data in the form of some preferences—you know some
of the opinions of some of the users on some of the items. From those training data,
you want to predict other preferences for your users. That’s essentially the output
for a recommendation engine.
• You may also have metadata on users (i.e., they are male or female, etc.) or on items
(the color of the product). For example, users come to your website and set up
accounts, so you may know each user’s gender, age, and preferences for up to three
items.
• You represent a given user as a vector of features, sometimes including only
metadata—sometimes including only preferences (which would lead to a sparse
vector because you don’t know all the user’s opinions) and sometimes including
both, depending on what you’re doing with the vector.
• Also, you can sometimes bundle all the user vectors together to get a big user
matrix, which we call U, through abuse of notation.

3.6 The Dimensionality Problem

• As we increase the dimension (Number of features), the accuracy of the system


increases up to a certain limit. Beyond the limit, the accuracy starts to decline.
• Solution - Dimensionality Reduction: This does not mean removing features, but
rather transform the data into a different perspective.
• Our goal is to build a model that has a representation in a low dimensional subspace
that gathers “taste information” to generate recommendations. So we’re saying here that
taste is latent but can be approximated by putting together all the observed information

13
DATA SCIENCE AND VISUALIZATION (21CS644)

we do have about the user.


Some Problems with Nearest Neighbors
So you could use nearest neighbors; it makes some intuitive sense that you’d want to
recommend items to people by finding similar people and using those people’s opinions to
generate ideas and recommendations. But there are a number of problems nearest neighbors
poses.
Curse of dimensionality
There are too many dimensions, so the closest neighbors are too far away from each other to
realistically be considered close.
Overfitting
Overfitting is also a problem. So one guy is closest, but that could be pure noise. How do you
adjust for that? One idea is to use k-NN, with, say, k=5 rather than k=1, which increases the
noise.
Correlated features
There are tons of features, moreover, that are highly correlated with each other. For example,
you might imagine that as you get older you become more conservative. But then counting both
age and politics would mean you’re double counting a single feature in some sense. This would
lead to bad performance, because you’re using redundant information and essentially placing
double the weight on some variables. It’s preferable to build in an understanding
of the correlation and project onto smaller dimensional space.
Relative importance of features
Some features are more informative than others. Weighting features may therefore be helpful:
maybe your age has nothing to do with your preference for item 1. You’d probably use
something like covariances to choose your weights.
Sparseness
If your vector (or matrix, if you put together the vectors) is too sparse, or you have lots of
missing data, then most things are unknown, and the Jaccard distance means nothing because
there’s no overlap.
Measurement errors
There’s measurement error (also called reporting error): people may lie.
Computational complexity
There’s a calculation cost—computational complexity.
Sensitivity of distance metrics
Euclidean distance also has a scaling problem: distances in age outweigh distances for other
features if they’re reported as 0 (for don’t like) or 1 (for like). Essentially this means that raw
Euclidean distance doesn’t make much sense. Also, old and young people might think one thing
but middle-aged people something else. We seem to be assuming a linear relationship, but it
may not exist. Should you be binning by age group instead, for example?

3.6.1 Singular Value Decomposition (SVD)

• Rank of a Matrix: The rank of a matrix A is the maximum number of linearly


independent row vectors or column vectors in the matrix. It is denoted as rank(A). In
case of SVD the rank of A is the number of non-zero singular values in the diagonal

14
DATA SCIENCE AND VISUALIZATION (21CS644)

matrix S.
• Latent Features: These are the hidden patterns or factors in the data. For a user-item
matrix, they represent underlying user preferences and item characteristics.
• Importance of Latent Features: The singular values in S indicate the importance of these
features.
• A larger singular value means the corresponding latent feature is more significant in
capturing the structure of the data.
• Given an m×n matrix X of rank k: X=USVT
• U: m×k matrix (Left Singular Vectors) that contains user latent features. Each row
corresponds to a user, and each column represents a latent feature. For example, a row
might capture a user's preference for genres like action or comedy.
• S: k×k diagonal matrix with singular values. These values indicate the importance of
each latent feature. Larger values correspond to more significant features.
• V: k×n matrix (Right Singular Vectors) Contains item latent features. Each row
corresponds to an item, and each column represents a latent feature. For example, a row
might capture an item's characteristics like genre or popularity.
• U and V: The columns of U and V are orthogonal, meaning they capture independent
latent features.
• S: The singular values in S measure the importance of each latent feature. Larger values
indicate more significant features.

Dimensionality Reduction with SVD

To reduce the dimensionality, follow these steps:

1. Compute SVD:
• Perform SVD on the matrix X to get U, S, and V.
2. Select Top d Singular Values:
• Identify the top d largest singular values from S. These singular values correspond
to the most significant components of the data.
3. Construct Reduced Matrices:
• Construct reduced matrices Ud, Sd, and Vd
4. Approximate the Original Matrix:
Xd =UdSd Vd T

• This Xd is the best approximation of X, using only the top d singular values.

In Singular Value Decomposition (SVD), dimensionality reduction is achieved by


selecting the most significant singular values and their corresponding singular vectors.
This process leverages the properties of the SVD to capture the most important
structures in the data while discarding less important information.
Questions:

15
DATA SCIENCE AND VISUALIZATION (21CS644)

1.Explain Real world recommendation engine with neat diagram.


2.What is Dimensionality Problem.
3.Explain SVD in detail

PCA, Building Recommendation Engine

3.6.2 Principal Component Analysis

• In this approach, we aim to predict preferences by factorizing the user-item


interaction matrix X into two lower-dimensional matrices, U and V, without the
need for the singular values matrix S. The goal is to find U and V such that:
X≈U⋅VT
• The optimization problem is to minimize the discrepancy between the actual
user-item interaction matrix X and its approximation
̃ =U⋅VT.
𝑿

• This discrepancy is measured using the squared error:


argmin∑i,j(xij−ui⋅vj)2

• xij is the actual interaction (e.g., rating) between user i and item j.
• ui is the i-th row of matrix U, representing the latent features of user i.
• vj is the j-th row of matrix V, representing the latent features of item j.
• The dot product ui⋅vj is the predicted preference of user i for item j.

Latent Features and Matrix Dimensions

• Number of Latent Features (d): This is a parameter that you choose, representing
the number of latent features you want to use. It controls the dimensions of matrices
U and V.
• Matrix U: Has dimensions m×d, where m is the number of users and d is the number
of latent features. Each row corresponds to a user.
• Matrix V: Has dimensions n×d, where n is the number of items and d is the number
of latent features. Each row corresponds to an item.
PCA
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load dataset
# 'diabetes.csv' is the filename of the dataset. The data is read into a pandas
DataFrame.
data = pd.read_csv(r"D:\Files\dsv-dataset\1. DataSets\diabetes.csv")

16
DATA SCIENCE AND VISUALIZATION (21CS644)

# Separate features and target


# Assuming the last column is the target/outcome column
features = data.iloc[:, :-1]
target = data.iloc[:, -1]

# Standardize the data


# StandardScaler standardizes the features by removing the mean and scaling to unit
variance.
# This ensures that each feature contributes equally to the analysis.
scaler = StandardScaler()

# Fit the scaler on the features and transform the features


scaled_features = scaler.fit_transform(features)

# Perform PCA
# Create a PCA object with the number of components we want to keep (2 in this case).
# Fit the PCA model to the scaled data and transform the data to the new PCA space.
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_features)

# Create a DataFrame with the PCA results


# Assign column names 'Principal Component 1' and 'Principal Component 2'
pca_df = pd.DataFrame(data=pca_result, columns=['Principal Component 1', 'Principal
Component 2'])

# Display the PCA DataFrame


print(pca_df.head())

# Retrieve the loadings of each component


# The components_ attribute contains the loadings of each principal component
components = pca.components_

# Create a DataFrame for the loadings of each principal component


loadings_df = pd.DataFrame(components, columns=features.columns, index=['Principal
Component 1', 'Principal Component 2'])

# Display the loadings DataFrame


print("\nLoadings of each principal component:")
print(loadings_df)

# Plot the PCA result


# Create a new figure with a specified size.
plt.figure(figsize=(8, 6))

# Scatter plot of the first two principal components.


# pca_result[:, 0] contains the values of the first principal component.
# pca_result[:, 1] contains the values of the second principal component.
# c='blue' sets the color of the points to blue.
# edgecolor='k' sets the edge color of the points to black.
# s=50 sets the size of the points.
plt.scatter(pca_df['Principal Component 1'], pca_df['Principal Component 2'], c='blue',
edgecolor='k', s=50)

# Label the x-axis as 'Principal Component 1'.


plt.xlabel('Principal Component 1')

# Label the y-axis as 'Principal Component 2'.

17
DATA SCIENCE AND VISUALIZATION (21CS644)

plt.ylabel('Principal Component 2')

# Set the title of the plot to 'PCA Result'.


plt.title('PCA Result')

# Display the plot.


plt.show()
Principal Component 1 Principal Component 2
0 1.068503 1.234895
1 -1.121683 -0.733852
2 -0.396477 1.595876
3 -1.115781 -1.271241
4 2.359334 -2.184819

Loadings of each principal component:


Pregnancies Glucose BloodPressure SkinThickness \
Principal Component 1 0.128432 0.393083 0.360003 0.439824
Principal Component 2 0.593786 0.174029 0.183892 -0.331965

Insulin BMI DiabetesPedigreeFunction Age


Principal Component 1 0.435026 0.451941 0.270611 0.198027
Principal Component 2 -0.250781 -0.100960 -0.122069 0.620589

3.7 Alternating Least Squares

• Here we are not first minimizing the squared error and then minimizing the size of the
entries of the matrices U and V. Here we are actually doing both at the same time.
• Alternating Least Squares (ALS) is an algorithm for matrix factorization. ALS is used
to decompose a given user-item interaction matrix into two lower-dimensional matrices

18
DATA SCIENCE AND VISUALIZATION (21CS644)

(U and V) that capture latent features of users and items.

• Here’s the algorithm:

a. Pick a random V.
b. Optimize U while V is fixed.
c. Optimize V while U is fixed.
• Keep doing the preceding two steps until you’re not changing very much at all. To be
precise, you choose an ϵ and if your coefficients are each changing by less than ϵ, then
you declare your algorithm “converged.”
• Fix V and Update U The way you do this optimization is user by user. So for user i,
you want to find:

• where vj is fixed. In other words, you just care about this user for now. But wait a
minute, this is the same as linear least squares, and has a closed form solution! In other
words, set:

where V*,i is the subset of V for which you have preferences coming from user i.
Taking the inverse is easy because it’s d×d, which is small. And there aren’t that many
preferences per user, so solving this many times is really not that hard. Overall you’vegot a
doable update for U.
• When you fix U and optimize V, it’s analogous—you only ever have to consider the
users that rated that movie, which may be pretty large for popular movies but on average
isn’t; but even so, you’re only ever inverting a d×d matrix.
3.8 Building Recommendation System using Python

The following code is Matt’s code to illustrate implementing a recommendation system on


a relatively small dataset.
Initialize Matrix V:

• V is initialized with random values.


• This matrix represents the latent features of items.

Initialize Matrix U:

• U is initialized with zeros.


• This matrix will represent the latent features of users.

ALS Algorithm

19
DATA SCIENCE AND VISUALIZATION (21CS644)

The ALS algorithm alternates between updating the user latent features (U) and the item
latent features (V) to minimize the squared error of the predicted ratings.
For each user:
• Extract the items they have interacted with and their corresponding ratings.
• Create a matrix vo containing the latent features of these items.
• Create a vector pvo of the ratings.

Solve the regularized least squares problem:

• Update U[i, :] using the formula:

• `V_i` is the submatrix of `V` corresponding to the items user `i` has rated.

• `X_i` is the vector of ratings for these items.

20
DATA SCIENCE AND VISUALIZATION (21CS644)

Error Calculation
• Calculate the root mean square error (RMSE):Compute the prediction error for
all known user-item interactions.
• Final Predicted Matrix
• Predict the entire user-item interaction matrix:
• Multiply U and V.T to get the predicted ratings.

Code:

import math
import numpy as np
# Define the user-item ratings matrix pu
pu = [
[(0, 0, 1), (0, 1, 22), (0, 2, 1), (0, 3, 1), (0, 5, 0)],
[(1, 0, 1), (1, 1, 32), (1, 2, 0), (1, 3, 0), (1, 4, 1), (1, 5, 0)],
[(2, 0, 0), (2, 1, 18), (2, 2, 1), (2, 3, 1), (2, 4, 0), (2, 5, 1)],
[(3, 0, 1), (3, 1, 40), (3, 2, 1), (3, 3, 0), (3, 4, 0), (3, 5, 1)],
[(4, 0, 0), (4, 1, 40), (4, 2, 0), (4, 4, 1), (4, 5, 0)],
[(5, 0, 0), (5, 1, 25), (5, 2, 1), (5, 3, 1), (5, 4, 1)]
]
# Define the item-user ratings matrix pv
pv = [
[(0, 0, 1), (0, 1, 1), (0, 2, 0), (0, 3, 1), (0, 4, 0), (0, 5, 0)],
[(1, 0, 22), (1, 1, 32), (1, 2, 18), (1, 3, 40), (1, 4, 40), (1, 5, 25)],
[(2, 0, 1), (2, 1, 0), (2, 2, 1), (2, 3, 1), (2, 4, 0), (2, 5, 1)],
[(3, 0, 1), (3, 1, 0), (3, 2, 1), (3, 3, 0), (3, 5, 1)],
[(4, 1, 1), (4, 2, 0), (4, 3, 0), (4, 4, 1), (4, 5, 1)],
[(5, 0, 0), (5, 1, 0), (5, 2, 1), (5, 3, 1), (5, 4, 0)]
]
# Define matrix V
V = np.mat([
[0.15968384, 0.9441198, 0.83651085],
[0.73573009, 0.24906915, 0.85338239],
[0.25605814, 0.6990532, 0.50900407],
[0.2405843, 0.31848888, 0.60233653],
[0.24237479, 0.15293281, 0.22240255],
[0.03943766, 0.19287528, 0.95094265]
])
# Initialize matrix U with zeros
U = np.mat(np.zeros([6, 3]))
# Regularization parameter
L = 0.03
# Perform matrix factorization using alternating least squares
for iter in range(5):

21
DATA SCIENCE AND VISUALIZATION (21CS644)

print("\n----- ITER %s ---- " % (iter + 1))


print("U")
urs = []

# Update U
for uset in pu:
vo = []
pvo = []

for i, j, p in uset:
vor = []
for k in range(3):
vor.append(V[j, k])
vo.append(vor)
pvo.append(p)

vo = np.mat(vo)
ur = np.linalg.inv(vo.T * vo + L * np.mat(np.eye(3))) * vo.T * np.mat(pvo).T
urs.append(ur.T)

U = np.vstack(urs)
print(U)
print("V")
vrs = []
# Update V
for vset in pv:
uo = []
puo = []

for j, i, p in vset:
uor = []
for k in range(3):
uor.append(U[i, k])
uo.append(uor)
puo.append(p)
uo = np.mat(uo)
vr = np.linalg.inv(uo.T * uo + L * np.mat(np.eye(3))) * uo.T * np.mat(puo).T
vrs.append(vr.T)

V = np.vstack(vrs)
print(V)
# Calculate RMSE (Root Mean Squared Error)
err = 0.
n = 0.

22
DATA SCIENCE AND VISUALIZATION (21CS644)

for uset in pu:


for i, j, p in uset:
err += (p - (U[i] * V[j].T)[0, 0]) ** 2
n += 1
rmse = math.sqrt(err / n)
print("RMSE:", rmse)

# Print final U * V.T


print("\nFinal U * V.T"
print(U * V.T)
Output:

23
DATA SCIENCE AND VISUALIZATION (21CS644)

Questions:

1. Explain PCA in detail.


2. Define Alternating Least Square.
3. Write a program for Recommendation system using Python.

24

You might also like