Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Matrix Factorization For Collaborative Filtering Is Just Solving An Adjoint Latent Dirichlet Allocation Model After All

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Matrix Factorization for Collaborative Filtering Is Just Solving

an Adjoint Latent Dirichlet Allocation Model After All


Florian Wilhelm
inovex GmbH
Cologne, Germany
Florian.Wilhelm@inovex.de
ABSTRACT The question of why MF-based methods in particular are so
Matrix factorization-based methods are among the most popular effective in finding a personalized ranking of items based on im-
methods for collaborative filtering tasks with implicit feedback. The plicit user feedback remains largely unanswered. What are the
most effective of these methods do not apply sign constraints, such assumptions and user model behind a general MF approach? How
as non-negativity, to their factors. Despite their simplicity, the latent can the latent factors, i.e., the learned parameters, be interpreted
factors for users and items lack interpretability, which is becoming especially if no sign constraints, e.g., non-negativity, have been
an increasingly important requirement. In this work, we provide a imposed? With more and more tasks taken over by algorithms, the
theoretical link between unconstrained and the interpretable non- call for interpretability is getting louder and louder. While there
negative matrix factorization in terms of the personalized ranking are approaches such as non-negative matrix factorization (NMF)
induced by these methods. We also introduce a novel, latent Dirich- that offer possibilities for interpretation, they often cannot rival
let allocation-inspired model for recommenders and extend our the performance of general MF methods without these constraints
theoretical link to also allow the interpretation of an unconstrained on the factors [21]. In a well-received article, Rudin [31] pointed
matrix factorization as an adjoint formulation of our new model. out that interpretable models could potentially exist in many dif-
Our experiments indicate that this novel approach represents the ferent domains that are just as accurate as their non-interpretable
unknown processes of implicit user-item interactions in the real counterparts. What could such an interpretable model look like in
world much better than unconstrained matrix factorization while the context of collaborative filtering?
being interpretable. In this paper, we focus on the task of creating a user-specific
ranking for a set of items based on the implicit feedback of a set of
CCS CONCEPTS users. Our work thus relies heavily on the definition of the Bayesian
pair-wise ranking (BPR) criterion by Rendle et al. [29], which for-
• Information systems → Recommender systems; Collabo-
malizes the task of finding a user-specific item ranking. We begin
rative filtering; • Computing methodologies → Factorization
with a review and discussion of several variants of MF before intro-
methods; Latent variable models; Latent Dirichlet allocation;
ducing a novel, latent Dirichlet allocation (LDA)-inspired model for
Learning latent representations.
recommendation tasks (LDA4Rec) that is easy to interpret and to
ACM Reference Format: reason about. After that, we prove that the factors of unconstrained
Florian Wilhelm. 2021. Matrix Factorization for Collaborative Filtering Is MF can be transformed into NMF factors as well as into a factor-
Just Solving an Adjoint Latent Dirichlet Allocation Model After All. In ization adjoint to the presented LDA4Rec model while keeping
Fifteenth ACM Conference on Recommender Systems (RecSys ’21), September the MF-induced personalized ranking constant. Consequently, the
27-October 1, 2021, Amsterdam, Netherlands. ACM, New York, NY, USA, presented transformations allow the interpretation of the results
8 pages. https://doi.org/10.1145/3460231.3474266
from the presented MF methods in the context of the LDA4Rec
model. Although the personalized ranking remains constant under
1 INTRODUCTION these transformations, the MF-based methods and LDA4Rec may
Since the Netflix competition from 2006 to 2009, matrix factorization not necessarily find the same optimal solution because the opti-
(MF)-based methods have been and still are among the most popular mization losses are different. We evaluate this by running several
approaches to collaborative filtering problems. Even the emergence experiments on two public datasets comparing the presented MF
of deep learning could not fundamentally change this, as recent variants and the novel LDA4Rec model in terms of their results
publications have shown the effectiveness of MF and even simpler using mean reciprocal rank, precision, and recall as metrics.
methods over neural network-based approaches [5, 30].
2 RELATED WORK
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed Already in the original work on LDA, Blei et al. [3] evaluate recom-
for profit or commercial advantage and that copies bear this notice and the full citation mender systems as a use case for LDA. Treating documents as users,
on the first page. Copyrights for components of this work owned by others than the topics as cohorts of users, and words as items, the LDA model is
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission trained on a set of fully observed users while for each unobserved
and/or a fee. Request permissions from permissions@acm.org. user all but one of the movies are shown. Then the likelihood of the
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands held-out movie compared to the others is eventually used to derive
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8458-2/21/09. . . $15.00 a user-specific ranking. This direct application of LDA to the do-
https://doi.org/10.1145/3460231.3474266 main of recommender systems is also used by Xie et al. [36] in their

55
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Florian Wilhelm

LDA-inspired probabilistic method. Our work differs from these in 4 MATRIX FACTORIZATION
that two additional vectors of parameters are introduced into the MF-based methods for collaborative filtering share the idea of
original LDA formulation, adding an inductive bias to support the approximating the sparse matrix of user-item interactions X ∈
use case. One vector for the popularity of the items regularizes the R |U |×|I | by the product of two low-rank matrices W ∈ R |U |×|K |
item preferences over the user cohorts while another vector weights and H ∈ R |I |×|K | , i.e.,
this regularization for each user thus indicating the conformity of
the user with the item popularity. X ≈ X̂ := W H t ,
MF and LDA are often considered together to derive collabo-
where K = {1, . . . , |K |} is the index set of the latent dimensions.
rative methods that also include content information. Wang and
Derived from this general form, we will define the personalized
Blei [34] propose a collaborative topic regression (CTRlda) model
score of a user u for an item i as
that combines a textual LDA model for the content information
and a probabilistic matrix factorization to jointly explain the ob- x̂ui = ⟨wu , hi ⟩ + bi , (1)
served content and user ratings, respectively. A similar approach
where bi ∈ R is an item bias. Adding an explicit item bias term has
is proposed by Nikolenko [23] while Rao et al. [28] extend CTRlda
been shown to improve MF-based models in many studies [17, 26]
using the special words with background (SWB) model [4] instead
and can be interpreted as the popularity of an item independent of
of LDA. As these methods use additional content information, they
a user’s preferences. The personalized scores of a user then induce
differ from LDA4Rec in this aspect while also not including the
the personalized ranking ⩾u by virtue of x̂ui ≥ x̂u j for i, j ∈ I.
aforementioned additional parameters.
Note that in our implicit feedback scenario, there is no need for a
Zhang et al. [38] emphasize the interpretability of NMF for col-
user bias term bu as the personalized ranking ⩾u would not change
laborative filtering and regard the latent user vector as an additive
by definition.
mixture of different user communities, i.e., cohorts. A similar, but
The actual approximation depends on the optimization loss
more probabilistic NMF approach, rendering the mixture a distri-
L(X , X̂ ) and over the years many were derived, most notably SVD++
bution over cohorts of users, is presented by Hernando et al. [12].
[16], which was used to win the Netflix price, WR-MF [14, 24] and
Compared to LDA4Rec, these works also lack the proposed addi-
PMF [32]. Since we are ultimately interested in an optimal rank-
tional parameters in LDA4Rec similar to the previously mentioned
ing ⩾u rather than an approximation of the original matrix, the
works that apply LDA directly. The lacking interpretability of MF
Bayesian Personalized Ranking (BPR) method was proposed by
without non-negativity constraints is also recognized by Datta et al.
Rendle et al. [29] to directly reflect this task in its loss L and can be
[6] and addressed with a different approach than NMF. The authors
considered a differentiable analogy to Area Under the ROC Curve
propose a shadow model that learns a mapping from interpretable
(AUC) optimization [29].
auxiliary features to the latent factors of MF. Therefore, their ap-
Despite the simple formulation (1) of MF, the actual interpre-
proach cannot be considered as pure collaborative filtering like
tation of the latent vectors wu and hi is not as easy. The latent
LDA4Rec, since additional content information is used.
elements of an item i, i.e., hik , k ∈ K, might quantify the prevalence
of some latent feature in an item while the corresponding element
of a user u, i.e., wuk , quantifies the user’s preference for this feature.
The problem with this notion becomes apparent when considering
3 NOTATION AND TERMINOLOGY negative elements, since, for example, a strong negative prevalence
together with a negative preference can lead to a large positive
In this section, we formalize the problem and establish a common
term in the scalar product. This observation motivates the usage of
notation, which is to some extent based on the work of Rendle et al.
MF methods that demand non-negativity for wu and hi .
[29]. Matrices are denoted by capital letters X , transposed matrices
with X t , vectors by bold letters x, sets by calligraphic letters X, and
the cardinality of a set by |X|. The scalar product of two vectors x
4.1 Non-Negative Matrix Factorization
and y is denoted by ⟨x, y⟩ := ni=1 x i yi and the l 1 -norm is denoted
Í The non-negative Matrix Factorization (NMF) was introduced by
Ín
by ∥x∥1 := i=1 |x i |, where n is the dimension of the vector space. Lee and Seung [19, 20] as a method to learn parts of objects, which
The Hadamard element-wise vector multiplication and division are can then be combined again to form a whole. NMF differs from MF in
|K | |K |
denoted by ⊙ and ⊘, respectively. A concatenation of two vectors (1) only in that we have wu ∈ R ≥0 , hi ∈ R ≥0 and bi ∈ R ≥0 . Using
x, z is denoted by [x, z] and 1 is the vector of all ones. The i-th the example of faces, Lee and Seung [20] showed that NMF is able to
row vector of a matrix X is denoted by xi whereas the j-th column learn localized features, e.g., eye area, which can then be used again
vector is expressed with the help of the Kleene star as x∗j . The to form a whole face by an additive mixture. Although the notion
symbol R ≥0 is used for non-negative real numbers. of feature prevalences within items and user preferences for certain
Let U be the set of all users and I the set of all items. With features translates well to NMF, in many practical applications the
S ⊂ U × I we denote the set of implicit feedback from users results achieved with NMF, unfortunately, fall short of those of
u ∈ U having interacted with items i ∈ I. Following the definition MF [21].
of Rendle et al. [29], the task of personalized ranking is to provide A mathematically more rigorous interpretation is that NMF finds
each user u with a personalized total ranking ⩾u on I. In particular, a |K | clustering of the column vectors of X , i.e., x∗i for i ∈ I in the
since we assume that ⩾u is a total order, we have for i, j ∈ I with space of users u. Ding et al. [8] prove that NMF with least squares
i , j that either i >u j or i <u j. optimization is mathematically equivalent to the minimization of

56
Matrix Factorization Formulated as an Adjoint LDA4Rec Model RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands

K-means clustering with some mild relaxation with respect to the


orthogonality of H .
Due to the constraints of NMF, we also have that the approxima-
tion matrix X̂ is non-negative. While the non-negativity constraint β ϕk
is naturally fulfilled by the interaction matrix X , this restriction
might still be unnecessary in practice. As a middle ground between |K|
MF and NMF, a variation of NMF was proposed which requires the
non-negativity only from the matrix H and thus also the restriction
for X̂ is lifted.
α θu zus ius
4.2 Semi Non-negative Matrix Factorization |S|
The Semi Non-negative Matrix Factorization (SNMF) is proposed
by Ding et al. [9] to lift the non-negativity condition on X̂ = W H t |U|
by setting the non-negativity constraint only on H . Analogously
to NMF, they also show that SNMF is a relaxation of K-means
clustering. We can interpret W = (w∗1 , . . . , w∗ | K | ) as the cluster Figure 1: Graphical model representation of the smoothed
centroids and H = (h1 , . . . , h | I | ) as the cluster indicators for each LDA model adapted to a collaborative filtering problem [3].
data point x∗i . To the best knowledge of the author, SNMF finds The boxes are called plates representing replicates. The up-
very little to no application in the field of recommender systems, per box represents the latent cohorts whereas the lower
despite its interpretability. outer plate represents the users, while the inner plate rep-
resents the repeated choice of cohorts and items within the
5 LATENT DIRICHLET ALLOCATION interactions of a user.
The latent Dirichlet allocation (LDA) is a generative statistical
model, most often used in the context of natural language pro-
cessing (NLP). It is an instance of a topic model in that it explains
the observations by assuming a set of unobserved groups or top-
ics where the observations within an assigned group share some high probability for a single cohort. With regard to the modeling
common features. In the context of NLP, LDA postulates that each of preferences in cohorts, some caveats arise.
document is a mixture of a number of latent topics and that the First of all, the item preferences within the cohorts are only
frequency of each word within the document then depends on the connected by the allocation of the users and not directly by the
frequency of this word within a topic and the mixture of topics. items. Taking up the point made for the item bias in Section 4, we
We adapt the generative process of a smoothed LDA from Blei argue that the probability of an item within different cohorts should
et al. [3] to our notation and context of users interacting with also depend on the item itself. According to intuition, the movie
items. Given a set of items i ∈ I and |K | cohorts of users, each user Pulp Fiction, for example, should exhibit within each user cohort k
u ∈ U has Su = {1, . . . , |Su |} interactions, assuming the following a relatively high probability φ ki of being interacted with, compared
generative process: to other movies within the same cohort due to its popularity. This
limitation can be remedied by including an explicit item bias in
1. Choose θ u ∼ Dirichlet(α ) for u ∈ U. the generative process that changes the categorical distribution in
2. Choose φ k ∼ Dirichlet(β) for k ∈ K. Step 3b accordingly.
3. For each user u ∈ U and his or her interactions s ∈ Su : Another caveat becomes clearer after considering the relation-
(a) Choose a cohort zus ∼ Categorical(θ u ). ship of LDA and NMF. NMF with Kullback-Leibler divergence is
(b) Choose an item ius ∼ p(ius |φ zus ) := Categorical(φ zus ). an approximation of the evidence lower bound (ELBO) of LDA
with symmetric Dirichlet priors [7]. The connection becomes also
|K | |I |
The hyperparameters α ∈ R >0 and β ∈ R >0 are typically chosen apparent, in a mathematically somewhat imprecise way, when we
to be sparse, i.e., components smaller than 1, in order to favor users consider wu and h∗k of NMF as not l 1 -normalized θ u and not l 1 -
belonging mainly to a single or only a few cohorts and cohorts with normalized φ k in LDA, respectively. The fact that wu in NMF is not
an item distribution that has low entropy, respectively. Also note normalized allows cohort preferences to be weighted against item
that the actual number of user interactions Su is not part of the popularity depending on the user u. Intuitively, we demand this
generation process and is therefore taken as given. The subscript is flexibility from a model as users differ in whether they are more
dropped for the ease of notation where it seems reasonable. The conformist or more individual in their perception of the popularity
graphical model corresponding to the generative process is depicted of items. To take this into account, we will introduce a user-specific
in Figure 1. weighting factor for the popularity of items, thus also resolving
With respect to interpretability, the presented generative process this caveat.
matches the intuition about cohorts of users, i.e., topics in the In total, we have identified two inductive biases that traditional
original LDA, sharing similar item preferences quite well. Each LDA is missing and present a modified LDA model for recommender
user is then probabilistically assigned to some cohorts, assuming a systems that incorporates these.

57
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Florian Wilhelm

5.2 Adjoint LDA4Rec Formulation of Matrix


µϕ µδ
Factorization
We derive the adjoint LDA4Rec formulation of MF using two steps.
First, we show that the personalized ranking given by MF can be
σϕ2 ϕk σδ2 δi reformulated as NMF. Assuming an NMF, we then transform the
latent vectors of users and items into an l 1 -normalized represen-
|K| |I| tation, which can be interpreted as parameters of the categorical
distributions in Step 4b of the generating process of LDA4Rec.

Lemma. Given personalized ranking scores x̂ui = ⟨wu , hi ⟩ + bi


for users u ∈ U and items i ∈ I with wu ∈ R |K | , hi ∈ R |K | and
α θu zus ius bi ∈ R that induce a total ranking ⩾u for all users. Then there exists
′ ′
′ = ⟨w ′ , h ′ ⟩ +b ′ with w ′ ∈ R |K | , h ′ ∈ R |K | and b ′ ∈ R
xui
|S| u i i u ≥0 i ≥0 i ≥0 that
induce the same total ranking ⩾u for all users.
µλ λu
|U| Proof. We define wu′ = w+u , w−u where
 

σλ2
( (
+ wuk if wuk ≥ 0 − −wuk if wuk < 0
w uk = and w uk = ,
0 otherwise 0 otherwise

for k ∈ K. Also, we define analogously hi′ = [hi + s, −hi + s] with


Figure 2: Graphical model representation of the proposed s = (si )i ∈I , si = maxk ∈K |hik | and bi′ = bi + maxi ∈I |bi |. By
LDA4Rec model that also incorporates the item popularity |K ′ | |K ′ |
construction, we have wu′ ∈ R ≥0 , hi′ ∈ R ≥0 and bi′ ∈ R ≥0 with
δi as well as the user’s conformity λu to the general popular-
K ′ = {1, . . . , 2|K |}. We also have by construction that ⟨wu′ , hi′ ⟩ =
ity.
⟨wu+ , hi ⟩ + ⟨wu+ , s⟩ − ⟨wu− , hi ⟩ + ⟨wu− , s⟩ = ⟨wu+ − wu− , hi ⟩ + ⟨wu+ +
wu− , s⟩ = ⟨wu , hi ⟩ + ⟨wu+ + wu− , s⟩. This is now applied to conclude
that

xui ≥ xu′ j ⇐⇒ ⟨wu′ , hi′ ⟩ + bi′ ≥ ⟨wu′ , hj′ ⟩ + b j′
⇐⇒ ⟨wu , hi ⟩ + ⟨wu+ + wu− , s⟩ + bi + max |bi |
5.1 LDA for Recommender Systems i ∈I
The generative process of a traditional LDA is modified by incor- ≥ ⟨wu , hj ⟩ + ⟨wu+ + wu− , s⟩ + b j + max |bi |
i ∈I
porating the item popularity δi and the user’s conformity λu . This
results in the generative process of an LDA for recommender sys- ⇐⇒ ⟨wu , hi ⟩ + bi ≥ ⟨wu , hj ⟩ + b j
tems (LDA4Rec): ⇐⇒ x̂ui ≥ x̂u j .

1. Choose θ u ∼ Dirichlet(α ) and λu ∼ LogNormal(µ λ , σλ2 ) for Subsequently, xui


′ induces the same total ranking ⩾ as x̂ .
u ui □
u ∈ U.
2. Choose δi ∼ LogNormal(µ δ , σδ2 ) for i ∈ I. Theorem. Given personalized ranking scores x̂ui = ⟨wu , hi ⟩ + bi
3. Choose φ ki ∼ LogNormal(µ φ , σφ2 ) for k ∈ K and i ∈ I. for users u ∈ U and items i ∈ I with wu ∈ R |K | , hi ∈ R |K |
4. For each user u ∈ U and his or her interactions s ∈ Su : and bi ∈ R that induce a total ranking ⩾u for all users. Then there
′ ′
′ = ⟨v , g (u)⟩ where v ∈ R |K | and g (u) ∈ R |K | with
exists xui
(a) Choose a cohort zus ∼ Categorical(θ u ). u i u ≥0 i ≥0
(b) Choose an item ius ∼ p(ius |φ zus , δi , λu ) := ∥vu ∥1 = 1 and ∥gi (u)∥1 = 1, such that the same total ranking ⩾u is
Categorical (∥c∥1 −1 c) with c = φ zus + λu · δ . induced for all users.

|K ′ |
The hyperparameters µ ∗ , σ∗2 can be used to incorporate prior Proof. Without loss of generality, we assume wu ∈ R ≥0 , hi ∈
knowledge about the relations of λ, δ and φ k . In Step 4b, we see |K ′ |
R ≥0 and bi ∈ R ≥0 by virtue of the lemma. We define
that the probability of a user interacting with an item not only
depends on the preference assigned to the item by the cohort, i.e., vu = cu−1 wu ⊙ nu , gi (u) = (hi + tu−1bi 1) ⊘ nu ,
φ zus , but also the popularity δi of the item and the conformity
λu of the user to the general popularity. The graphical model of where nu = (nuk )k ∈K ′ with nuk = i ∈I hik + tu−1bi , tu = ∥wu ∥1
Í
LDA4Rec is illustrated in Figure 2. and cu = ⟨wu , nu ⟩. We can neglect the pathological cases, i.e.,
We show now that MF, as introduced in Section 4, has an adjoint tu = 0 and nuk = 0, as in the former case we have a trivial so-
formulation that corresponds to the parameters φ k , θ u , δi and λu lution and ⩾u only depends on b whereas in the latter case, we
have i ∈I hik = 0 and thus the latent vector (hik )i ∈I could just
Í
of LDA4Rec. Finally, this allows us to intuitively interpret the latent
factors of MF. be removed. By construction, we have now that ∥vu ∥1 = 1 and

58
Matrix Factorization Formulated as an Adjoint LDA4Rec Model RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands

∥gi (u)∥1 = 1. We can thus conclude that due to our random split. This reduces the number of interactions in

xui ≥ xu′ j ⇐⇒ ⟨vu , gi (u)⟩ ≥ ⟨vu , gj (u)⟩ MovieLens-1M to approximately 661 thousand and in MovieLens-
100K to 60 thousand while the number of interactions in Goodbooks
⇐⇒ ⟨cu−1 wu ⊙ nu , (hi + tu−1bi 1) ⊘ nu ⟩ is unaffected.
≥ ⟨cu−1 wu ⊙ nu , (hj + tu−1b j 1) ⊘ nu ⟩ As evaluation metrics, we use the mean reciprocal rank (MRR),
precision at 10 (Prec@10), and recall at 10 (Recall@10) to measure
⇐⇒ ⟨wu , hi + tu−1bi 1⟩ ≥ ⟨wu , hj + tu−1b j 1⟩ the quality of our models. We define Prec@10 as the fraction of
⇐⇒ ⟨wu , hi ⟩ + ⟨wu , tu−1bi 1⟩ ≥ ⟨wu , hj ⟩ known positives in the first 10 positions of the ranked list of results
and Recall@10 as the number of known positives in the first 10
+ ⟨wu , tu−1b j 1⟩
positions divided by the total number of known positives.
⇐⇒ ⟨wu , hi ⟩ + bi ≥ ⟨wu , hj ⟩ + b j
⇐⇒ x̂ui ≥ x̂u j . 6.2 Experiments
Consequently, xui
′ induces the same total ranking ⩾u as x̂ui . □ 6.2.1 Comparison of the Different Variants of Matrix Factorization.
Despite the theoretical results from Subsection 5.2, which allow
In the light of these constructive proofs and noting that SNMF us to transform a personalized ranking solution found through
was an intermediate step in the proof of the lemma, we can also MF into an NMF formulation, this does not necessarily mean that
make a statement about the expressive power of MF, NMF, SNMF, a solution found through direct application of NMF has the same
and LDA4Rec. In particular, we have seen that each latent dimension quality. For this reason, we implemented MF, NMF, and SNMF using
indexed by k of MF is split up into two corresponding dimensions BPR as loss function and the Adam optimizer [15]. The implemen-
in the NMF representation. Following our previous interpretation, tations of NMF and SNMF differ from MF only in that they restrict
those dimensions stand for cohorts having complementary item the corresponding parameters to non-negative values using the
preferences. sigmoid function. Our implementation heavily relies on Spotlight
[18] and PyTorch [25]. In our experiment various batch sizes and at
Corollary. The expressive power, i.e., the number of possible total
last 3 different seeds as random initialization are used. In order to
rankings ⩾u that can be encoded, of MF is twice as high as in the case
provide a baseline, we also implemented a purely popularity-based
of NMF for a given latent vector length |K |. LDA4Rec has the same
recommender (Pop).
expressive power as NMF and the expressive power of MF is equivalent
to the expressive power of SNMF. 6.2.2 Transformation of Matrix Factorization to the Adjoint LDA4Rec
Formulation. Although we have mathematically proven in Sub-
It is important to note here that we have only proved that the
section 5.2 that an MF solution can be transformed to NMF and
personalized ranking ⩾u remains constant under some transforma-
subsequently also to an adjoint LDA formulation, floating-point
tions that allow us to express an MF as NMF or an adjoint LDA4Rec
arithmetic may pose challenges in a practical application. To follow
formulation. Since ⩾u is eventually the result of an optimization
up on this, we implemented the presented transformations as an
problem with some loss function L, e.g., BPR for MF or likelihood for
optional preprocessing step before the evaluation. This allows us to
LDA4Rec, we make no statement about maintaining the optimality
evaluate a solution obtained from MF directly and after the trans-
of some solution under these transformations.
formation in order to compare the resulting personalized rankings
⩾u , which should be equivalent in theory.
6 EVALUATION
To support our theoretical considerations with empirical results, 6.2.3 Comparison of LDA4Rec to Matrix Factorization. We imple-
several experiments were conducted with real-world datasets. The mented the LDA4Rec model as presented in Subsection 5.1 with the
source code of our implementation and the detailed results of all help of the Pyro deep universal probabilistic programming frame-
experiments are publicly available1 . work [2]. To cope with the high dimensionality low-sample size
setting, we decided for a stochastic variational inference (SVI) [13]
6.1 Datasets & Evaluation Metrics approach using the Adam optimizer [15]. Due to the presence of
discrete latent variables, a trace implementation of ELBO-based SVI
For our experiments, three different datasets were used. MovieLens-
[27, 35] with exhaustive enumeration over discrete sample sites was
1M encompasses approximately 1 million movie ratings across
chosen. To predict the personalized ranking scores x̂ui , we sampled
6,040 users and 3,706 movies while MovieLens-100K has roughly
items from the posterior predictive distribution [10] and counted
100 thousand interactions across 610 users and 9,724 movies [11].
the occurrences to obtain a personalized ranking. Ties were broken
Goodbooks has approximately 6 million interactions across 53,425
by adding a small non-negative random number.
users and 10,001 books [37]. We split these datasets randomly into
Our implementation and thus the evaluation of LDA4Rec turned
train, validation, and test sets using 90% of interactions for training,
out to be several orders of magnitude slower than the MF-based
and 5% each for validation and testing. The explicit feedback of
methods. For this reason, the experiments comparing LDA4Rec
these datasets was treated as implicit, i.e., the various user ratings
to MF-based methods were performed on the smaller MovieLens-
were converted to 1 representing an interaction while no rating
100K dataset. While the variational inference makes the training
means no interaction. Also, we limited the maximum number of
process quite fast, the bottleneck of LDA4Rec is the prediction of
interactions that a single user might have to 200 to avoid results
the personalized rankings for which we need a high number of
that are skewed towards users with a high number of interactions
samples per user to compute a stable ranking. Thus for each user
1 https://github.com/FlorianWilhelm/lda4rec 10,000 items were sampled.

59
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Florian Wilhelm

Table 1: Comparison of different variants of matrix factorization with varying number of latent parameters |K |.

Goodbooks MovieLens-1M
|K | Model MRR@10 Prec@10 Recall@10 MRR@10 Prec@10 Recall@10
Pop 0.023918 0.027079 0.047867 0.033488 0.033084 0.065908
4 NMF 0.014388 0.022056 0.038420 0.032927 0.033031 0.066281
SNMF 0.036186 0.040912 0.072584 0.046642 0.046211 0.097219
MF 0.038901 0.044045 0.079124 0.050495 0.048702 0.103090
8 NMF 0.015121 0.019261 0.033310 0.033445 0.033191 0.066516
SNMF 0.042435 0.047185 0.085326 0.053639 0.052028 0.108542
MF 0.044683 0.049835 0.090115 0.058240 0.057044 0.119924
16 NMF 0.019945 0.026436 0.046623 0.033461 0.033351 0.066191
SNMF 0.049671 0.055800 0.101652 0.062695 0.061526 0.130733
MF 0.050875 0.057127 0.103747 0.063849 0.062131 0.131973
32 NMF 0.028766 0.028268 0.050012 0.033223 0.033155 0.064652
SNMF 0.055048 0.062215 0.113179 0.068511 0.066453 0.141667
MF 0.056080 0.062841 0.114629 0.064506 0.064888 0.138600
48 NMF 0.032190 0.033548 0.060065 0.032996 0.033084 0.066660
SNMF 0.058595 0.066861 0.122292 0.068369 0.068143 0.146653
MF 0.058730 0.066321 0.121440 0.067427 0.065777 0.143905
64 NMF 0.034171 0.039272 0.070492 0.032925 0.032978 0.066924
SNMF 0.061561 0.070156 0.128119 0.069775 0.069050 0.151497
MF 0.060261 0.068837 0.126254 0.067474 0.066489 0.145744

6.3 Results Table 2: Comparison of MF to LDA4Rec with varying num-


ber of latent parameters |K | on MovieLens-100K.
6.3.1 Comparison of the Different Variants of Matrix Factorization.
Table 1 shows the results of our first experiment comparing various
variants of matrix factorization. For each metric, we report the value Model |K | MRR@10 Prec@10 Recall@10
on the test set corresponding to the best value on the validation set Pop 0.037069 0.031148 0.072635
for a model and its set of hyperparameters. Comparing the results MF 2 0.041071 0.033151 0.075098
with a fixed latent dimensionality of |K |, we see that MF outper- 4 0.049304 0.041530 0.099996
forms SNMF and NMF for low |K |. Starting at |K | = 32, SNMF 8 0.047378 0.042987 0.110723
surpasses MF and achieves the overall best results at |K | = 64 by a 16 0.046867 0.045173 0.111719
small margin. Thus, the non-negativity constraints on H of SNMF 32 0.048462 0.049909 0.116455
that make it interpretable appear to cause a positive regularization
LDA4Rec 2 0.052898 0.040984 0.105045
effect.
4 0.066236 0.048816 0.130728
For NMF we have a completely different picture, here we see
8 0.053524 0.045902 0.119682
results worse and slightly better than the popularity baseline for
16 0.058407 0.046995 0.125783
low values and high values of |K |, respectively. From the theoretical
32 0.058738 0.044991 0.115292
results established in Subsection 5.2, we know that an MF result of
latent dimensionality |K | could be represented as an NMF result
with twice the latent dimensionality but our experiments show that
a solution of the same quality is not found. This is due to the fact
that the optimization landscape exhibits many local minima where
the Adam optimizer gets stuck. no impact on the results as the overall metrics MRR@10, Prec@10,
and Recall@10 are equivalent in accordance with our theoretical
6.3.2 Transformation of Matrix Factorization to the Adjoint LDA4Rec results.
Formulation. Our evaluations show that the personalized rank-
ing scores of a given user obtained from MF and from its adjoint 6.3.3 Comparison of LDA4Rec to Matrix Factorization. The bench-
LDA4Rec formulation lead to equivalent rankings except for a mark results in Table 2 show that LDA4Rec outperforms MF in
small number of inversions due to numerical reasons. The round- all but one metric, requiring a much lower latent dimensionality.
ing errors of floating-point arithmetic cause on some occasions Although the expressiveness of MF is twice as high as of LDA4Rec
that if x̂ui > x̂u j where |x̂ui − x̂u j | ≤ ϵ, we have xui′ < x ′ with
uj given the same latent dimensionality, the results of LDA4Rec for
′ −x ′ | ≤ ϵ for some small value ϵ using the notation established
|xui uj |K | = 4 are an indication that the LDA4Rec model better represents
in Subsection 5.2. Statistically, across all users, these inversions have reality due to its inductive biases.

60
Matrix Factorization Formulated as an Adjoint LDA4Rec Model RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands

7 CONCLUSION [7] Thiago de Paulo Faleiros and Alneu de Andrade Lopes. 2016. On the equiv-
alence between algorithms for Non-negative Matrix Factorization and Latent
From a theoretical point of view, we have discussed several vari- Dirichlet Allocation. In 24 th European Symposium on Artificial Neural Networks,
ants of matrix factorization, i.e., MF, SNMF, NMF, and introduced Computational Intelligence and Machine Learning.
[8] Chris Ding, Xiaofeng He, and Horst D. Simon. 2005. On the Equivalence of
the novel and interpretable LDA4Rec model, which extends the Nonnegative Matrix Factorization and Spectral Clustering. In Proceedings of the
traditional LDA by incorporating parameters for the popularity of 2005 SIAM International Conference on Data Mining. Society for Industrial and
items and conformity of users. We have proven that the personal- Applied Mathematics, 606–610. https://doi.org/10.1137/1.9781611972757.70
[9] C.H.Q. Ding, Tao Li, and M.I. Jordan. 2010. Convex and Semi-Nonnegative Matrix
ized ranking induced by MF can be transformed so that the same Factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence 32,
personalized ranking is induced by NMF as well as by an adjoint 1 (Jan. 2010), 45–55. https://doi.org/10.1109/TPAMI.2008.277
formulation corresponding to the parameters of LDA4Rec. The ad- [10] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2004. Bayesian
Data Analysis (2nd ed. ed.). Chapman and Hall/CRC.
joint LDA4Rec formulation of an MF allows easy interpretation of [11] F. Maxwell Harper and Joseph A. Konstan. 2016. The MovieLens Datasets: History
its parameters without sacrificing accuracy. and Context. ACM Transactions on Interactive Intelligent Systems 5, 4 (Jan. 2016),
1–19. https://doi.org/10.1145/2827872
In several experiments, we have shown that SNMF performs [12] Antonio Hernando, Jesús Bobadilla, and Fernando Ortega. 2016. A non negative
slightly better than MF in same cases and is interpretable at the matrix factorization for collaborative filtering recommender systems based on a
same time. Our evaluations also show that the result obtained by Bayesian probabilistic model. Knowledge-Based Systems 97 (April 2016), 188–202.
https://doi.org/10.1016/j.knosys.2015.12.018
directly solving LDA4Rec outperforms MF with BPR loss while [13] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Sto-
being more interpretable. Our empirical results combined with the chastic Variational Inference. Journal of Machine Learning Research 14, 4 (2013),
derivation of LDA4Rec as a mathematical model suggest that its 1303–1347. http://jmlr.org/papers/v14/hoffman13a.html
[14] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for
generative process represents reality well and and thus provides Implicit Feedback Datasets. In 2008 Eighth IEEE International Conference on Data
means to interpret the results of traditional MF-based methods. Mining. Ieee, IEEE, Pisa, Italy, 263–272. https://doi.org/10.1109/ICDM.2008.22
[15] Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimiza-
Following on from this and assuming that the unknown, real- tion. International Conference on Learning Representations (12 2014).
world process behind implicit user feedback is actually well rep- [16] Yehuda Koren. 2009. The bellkor solution to the netflix grand prize. Netflix prize
resented by LDA4Rec, some conclusion about the effectiveness documentation 81, 2009 (2009), 1–10.
[17] Yehuda Koren and Robert Bell. 2015. Advances in Collaborative Filtering. In Rec-
of Neural Collaborative Filtering (NCF) can also be drawn. NCF ommender Systems Handbook, Francesco Ricci, Lior Rokach, and Bracha Shapira
replaces the scalar product of MF with a learned similarity, e.g., (Eds.). Springer US, Boston, MA, 77–118. https://doi.org/10.1007/978-1-4899-
using a multi-layer perceptron (MLP). Rendle et al. [30] show in a 7637-6_3
[18] Maciej Kula. 2017. Spotlight. https://github.com/maciejkula/spotlight.
reproducibility paper that the scalar product outperforms several [19] Daniel Lee and Hyunjune Seung. 2001. Algorithms for Non-negative Matrix
NCF-based methods and that it should thus be the default choice for Factorization. Adv. Neural Inform. Process. Syst. 13 (02 2001).
[20] Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects
combining embeddings, i.e., vectors of the latent factors. Similar re- by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788–791.
sults demonstrating the effectiveness of simple factorization-based https://doi.org/10.1038/44565
models are shown by Dacrema et al. [5]. Our work underpins these [21] Joonseok Lee, Mingxuan Sun, and Guy Lebanon. 2012. A Comparative Study
of Collaborative Filtering Algorithms. arXiv:1205.3193 [cs, stat] (May 2012).
findings as the scalar product of embeddings can be interpreted as http://arxiv.org/abs/1205.3193 arXiv: 1205.3193.
a mixture of several preferences thus explaining its effectiveness. [22] Henry W. Lin, Max Tegmark, and David Rolnick. 2017. Why does deep and cheap
Since learning a multiplication and also a scalar product is possible learning work so well? Journal of Statistical Physics 168, 6 (Sept. 2017), 1223–1247.
https://doi.org/10.1007/s10955-017-1836-5 arXiv: 1608.08225.
in theory [22] but proves difficult in practice [1, 30, 33] for an MLP, [23] Sergey Nikolenko. 2015. SVD-LDA: Topic Modeling for Full-Text Recommender
MF-based methods will continue to have an advantage over NCF Systems. In Advances in Artificial Intelligence and Its Applications, Obdulia
Pichardo Lagunas, Oscar Herrera Alcántara, and Gustavo Arroyo Figueroa (Eds.).
under this assumption. Vol. 9414. Springer International Publishing, Cham, 67–79. https://doi.org/10.
1007/978-3-319-27101-9_5 Series Title: Lecture Notes in Computer Science.
[24] Rong Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz,
REFERENCES and Qiang Yang. 2008. One-Class Collaborative Filtering. In 2008 Eighth IEEE
[1] Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H. International Conference on Data Mining. IEEE, Pisa, Italy, 502–511. https://doi.
Chi. 2018. Latent Cross: Making Use of Context in Recurrent Recommender org/10.1109/ICDM.2008.16
Systems. In Proceedings of the Eleventh ACM International Conference on Web [25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Search and Data Mining. ACM, Marina Del Rey CA USA, 46–54. https://doi.org/ Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban
10.1145/3159652.3159727 Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan
[2] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith
Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning
Noah D. Goodman. 2018. Pyro: Deep Universal Probabilistic Programming. Library. arXiv:1912.01703 [cs.LG]
arXiv:1810.09538 [cs, stat] (Oct. 2018). http://arxiv.org/abs/1810.09538 arXiv: [26] Arkadiusz Paterek. 2007. Improving regularized singular value decomposition
1810.09538. for collaborative filtering. Proceedings of KDD cup and workshop vol. 2007 (2007),
[3] David M. Blei, Andres Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet pp. 5–8.
Allocation. Journal of Machine Learning Research 3 (2003), pp. 993–1022. Issue [27] Rajesh Ranganath, Sean Gerrish, and David M. Blei. 2013. Black Box Variational
4-5. Inference. arXiv:1401.0118 [cs, stat] (Dec. 2013). http://arxiv.org/abs/1401.0118
[4] Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. 2007. Modeling arXiv: 1401.0118.
General and Specific Aspects of Documents with a Probabilistic Topic Model. [28] Vidyadhar Rao, KV Rosni, and Vineet Padmanabhan. 2017. Divide and Transfer:
Advances in Neural Information Processing Systems 19 (2007). https://proceedings. Understanding Latent Factors for Recommendation Tasks. In RecSysKTL. 1–8.
neurips.cc/paper/2006/file/ec47a5de1ebd60f559fee4afd739d59b-Paper.pdf [29] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.
[5] Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are We 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. (01 2009),
Really Making Much Progress? A Worrying Analysis of Recent Neural Recom- 452–461.
mendation Approaches. Proceedings of the 13th ACM Conference on Recommender [30] Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. 2020. Neural
Systems - RecSys ’19 (2019), 101–109. https://doi.org/10.1145/3298689.3347058 Collaborative Filtering vs. Matrix Factorization Revisited. In Fourteenth ACM
arXiv: 1907.06902. Conference on Recommender Systems (Virtual Event, Brazil) (RecSys ’20). As-
[6] Anupam Datta, Sophia Kovaleva, Piotr Mardziel, and Shayak Sen. 2018. Latent sociation for Computing Machinery, New York, NY, USA, 240–248. https:
Factor Interpretations for Collaborative Filtering. arXiv:1711.10816 [cs] (April //doi.org/10.1145/3383313.3412488
2018). http://arxiv.org/abs/1711.10816 arXiv: 1711.10816.

61
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Florian Wilhelm

[31] Cynthia Rudin. 2019. Stop explaining black box machine learning models for [35] David Wingate and Theophane Weber. 2013. Automated Variational Inference
high stakes decisions and use interpretable models instead. Nature Machine in Probabilistic Programming. arXiv:1301.1299 [cs, stat] (Jan. 2013). http://arxiv.
Intelligence 1, 5 (May 2019), 206–215. https://doi.org/10.1038/s42256-019-0048-x org/abs/1301.1299 arXiv: 1301.1299.
[32] Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization. [36] WenBo Xie, Qiang Dong, and Hui Gao. 2014. A Probabilistic Recommendation
In Proceedings of the 20th International Conference on Neural Information Processing Method Inspired by Latent Dirichlet Allocation Model. Mathematical Problems
Systems (Vancouver, British Columbia, Canada) (NIPS’07). Curran Associates Inc., in Engineering 2014 (2014), 1–10. https://doi.org/10.1155/2014/979147
Red Hook, NY, USA, 1257–1264. [37] Zygmunt Zajac. 2017. Goodbooks-10k: a new dataset for book recommendations.
[33] Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. http://fastml.com/goodbooks-10k. FastML (2017).
2018. Neural Arithmetic Logic Units. (2018), 10. [38] Sheng Zhang, Weihong Wang, James Ford, and Fillia Makedon. 2006. Learning
[34] Chong Wang and David M. Blei. 2011. Collaborative topic modeling for recom- from Incomplete Ratings Using Non-negative Matrix Factorization. In Proceedings
mending scientific articles. In Proceedings of the 17th ACM SIGKDD international of the 2006 SIAM International Conference on Data Mining. Society for Industrial
conference on Knowledge discovery and data mining - KDD ’11. ACM Press, San and Applied Mathematics, 549–553. https://doi.org/10.1137/1.9781611972764.58
Diego, California, USA, 448. https://doi.org/10.1145/2020408.2020480

62

You might also like