Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Knowledge-Grounded Natural Language Recommendation Explanation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Knowledge-grounded Natural Language Recommendation Explanation

Anthony Colas*1 , Jun Araki2 , Zhengyu Zhou2 ,


Bingqing Wang2 , Zhe Feng2
1
University of Florida
2
Bosch Research North America
acolas1@ufl.edu
{jun.araki, zhengyu.zhou2, bingqing.wang, zhe.feng2}@us.bosch.com

a user’s own experience (Asghar, 2016), 3) Sys-


Abstract tems that rely on reviews cannot account for new
items which have never been purchased before, nor
Explanations accompanied by a recommenda- can they provide justifications for item catalogs
arXiv:2308.15813v1 [cs.CL] 30 Aug 2023

tion can assist users in understanding the deci-


which may not have reviews available. Given this,
sion made by recommendation systems, which
in turn increases a user’s confidence and trust in it may be difficult for a user to reason as to why
the system. Recently, research has focused on an item was recommended, hindering the user’s ex-
generating natural language explanations in a perience (Tintarev and Masthoff, 2015). The user
human-readable format. Thus far, the proposed may then lose trust in such systems which do not
approaches leverage item reviews written by provide objective and accurate explanations.
users, which are often subjective, sparse in lan-
We propose KnowRec, a KG-grounded ap-
guage, and unable to account for new items
that have not been purchased or reviewed be- proach to natural language explainable recommen-
fore. Instead, we aim to generate fact-grounded dation which not only personalizes recommen-
recommendation explanations that are objec- dations/explanations with user information, but
tively described with item features while im- also draws on facts about a particular item via
plicitly considering a user’s preferences, based its corresponding KG to generate objective, spe-
on the user’s purchase history. To achieve this, cific, and data-driven explanations for the recom-
we propose a knowledge graph (KG) approach
mended item. For example, given the movie “Paths
to natural language explainable recommenda-
tion. Our approach draws on user-item features of Glory”, previous work aims to generate expla-
through a novel collaborative filtering-based nations such as “it’s not the best military movie”
KG representation to produce fact-grounded, and “good performances all around”, which are
personalized explanations, while jointly learn- subjective, not specific to a given movie, and re-
ing user-item representations for recommenda- lies on data from pre-existing reviews. Instead,
tion scoring. Experimental results show that by leveraging an item KG such as <director, Stan-
our approach consistently outperforms previ- ley Kubrick>, <conflict, World War 1>, <country,
ous state-of-the-art models on natural language
France>, a more objective and precise explanation
explainable recommendation.
can be produced such as: “A World War I French
1 Introduction colonel defends three soldiers. Directed by Stan-
ley Kubrick.” The item features of ‘World War I’,
Current approaches to natural language (NL) ex- ‘colonel’, and ’defends three soldiers’ in the expla-
plainable recommendation focus on generating user nation objectively describe the movie, while they
reviews (Chen et al., 2018; Wang et al., 2018a; Li can implicitly reflect the user’s preferences for war
et al., 2020, 2021; Yang et al., 2021). Instead of pro- movies, based on his/her purchase history.
viding a justification for the item recommendation, KnowRec is also more advantageous than prior
the models learn to output language that is com- work in terms of scalability to unpurchased items.
monly found in personal reviews. This reliance on Previously, KG-based recommendation systems
reviews poses three problems: 1) The explanations have effectively addressed the cold-start problem
are not objective, because users typically review by linking users and items through shared at-
items based on their sentiment (Wu et al., 2018), tributes (Wang et al., 2019, 2020, 2021). Similarly,
2) Reviews are often sparse, because they describe there exists a kind of cold-start problem for new
*Work performed at Bosch Research. items in recommendation explanation that rely on
reviews. KnowRec demonstrates KGs can help 2 Related Work
solve this problem through existing item-level fea-
tures by adapting KG-to-text (Koncel-Kedziorski 2.1 Explainable Recommendation
et al., 2019; Ke et al., 2021; Colas et al., 2022) el- Previous works on NL explainable recommenda-
ements into explainable recommendation, produc- tion focus on generating user-provided reviews,
ing item-level explanations to justify a purchase. where the output is typically short, subjective, and
The KG-based approach is particularly important repetitive (Chen et al., 2018; Hou et al., 2019; Wang
for recommendation scenarios in special domains et al., 2018b; Yang et al., 2021; Li et al., 2017,
where personal reviews are not available and the 2020, 2021; Hui et al., 2022). Extractive-based
review-based approaches are impractical. approaches have been proposed to score and select
reviews as explanations (Chen et al., 2018; Li et al.,
2019). Conversely, generative approaches (Yang
Our approach presents several algorithmic nov- et al., 2021; Li et al., 2017, 2020, 2021; Sun et al.,
elties. First, inspired by work on KG Recommen- 2020; Hui et al., 2022) leverage user/item features
dation (Wang et al., 2020) and KG-to-Text (Co- to generate new reviews as explanations. Currently,
las et al., 2022), we devise a novel user-item KG the task is still limited by review data, thus these
lexical representation, viewing the input through models cannot adequately handle new items. Un-
collaborative filtering lens, where users are graphi- like previous work, we introduce KGs to the ex-
cally represented via their previous purchases and plainable recommendation task to provide objec-
connected to a given item KG. Our representation tive, information-dense, specific explanations. Our
differs from previous work on explainable NL gen- approach can then handle new items which have
eration which relies on ID and sparse keyword fea- not been reviewed yet.
tures. Previous work extracts keywords from re- Inspired by recent advancements in explainable
views to represent the user and item, linearizing recommendation models like (Li et al., 2021), we
all such features to encode and produce an NL ex- enhance BART (Lewis et al., 2020), renowned for
planation (Li et al., 2020, 2022). Next, KnowRec graph-to-text tasks, to incorporate user-item knowl-
adapts a graph attention encoder for the user-item edge graphs. This adaptation enables us to gen-
representation via a new masking scheme. Finally, erate recommendation scores along with natural
the encoded KG representation is simultaneously language explanations.
decoded into a textual explanation, while we in-
novatively dissociate the joint learned user-item 2.2 Knowledge Graph Recommendation
representation to compute a user-item similarity
Leveraging KGs for recommendation systems has
for recommendation scoring.
gained increasing attention (Wang et al., 2019,
2020, 2021; Xie et al., 2021; Du et al., 2022).
In neighborhood-based methods (Hamilton et al.,
To evaluate our approach, we first devise a 2017; Welling and Kipf, 2016; Veličković et al.,
method of constructing (KG, T ext) pairs from 2018), propagation is performed iteratively over
product descriptions as described in Section 5, the neighborhood information in a KG to update
where we extract entities and relations for the item the user-item representation. While recent work
KGs. We construct two such datasets from the has produced explanations via KGs, these works
publically available recommendation datasets to focus on structural explanations such as knowledge
evaluate our proposed model for both the explana- graph paths (Ma et al., 2019; Fu et al., 2020; Xian
tion and recommendation task and focus on natural et al., 2019) and rules (Zhu et al., 2021; Chen et al.,
language generation (NLG) metrics for the expla- 2021; Shi et al., 2020), which are not as intuitive
nation task as in previous work. We adapt and for users to understand. We focus on generating
compare previous baseline models for the recom- NL explanations, which has been shown to be a pre-
mendation explanation task as described in Sec- ferred type of explanation (Zhang et al., 2020). For
tion 6, where we substantially outperform previous a fair comparison, we compare to prior work that
models on explanation while achieving similar rec- produces NL explanations. Unlike these works, we
ommendation performance as models that rely on aim to generate NL explanations instead of using
user and item ID-based features. paths along the KG as explanations.
1) User's Item Graph Representation X0 2) User-Item Encoder  User related info
3a) Rating Prediction

Item related info

Linear
mean pool

ni1 ni2 XL

Global Attention

Global Attention

Global Attention

rating
user

User

vu1 ri2 Mask

ri1

Linear
mean pool

vu2 vc + Item

+ Mask

item

3b) Explanation Generation:

gather gather

User-Item

explanation
Attention

User-Item

User-Item
Attention

Attention

Auto-Regressive
ri3 ~
XL

Decoder
ni3 Mg ~
XgL
Collaborative KG Representation

Figure 1: Illustration of KnowRec. 1) The User’s Item KG Representation Module. 2) The Global and User-Item
Graph Attention Encoder. 3) The Output Module for rating prediction and explanation.

2.3 Knowledge Graph-to-Text Generation measures u’s preference for v. By jointly training
In KG-to-Text, pre-trained language models such on the recommendation and explanation generation,
as GPT-2 (Radford et al., 2019) and BART (Lewis our model can contextualize the embeddings more
et al., 2020) have seen success in generating fluent adequately with training signals from both tasks.
and accurate verbalizations of KGs (Chen et al., 4 Model
2020; Ke et al., 2021; Ribeiro et al., 2021; Colas
et al., 2022). We devise an encoder for user-item Figure 1 illustrates our model with the user-item
KGs and a decoder for both the generation and rec- graph constructed through collaborative filtering
ommendation tasks. Specifically, we formulate a signals, an encoder, and inference functions for
novel masking scheme for user-item KGs to struc- explanation generation and rating prediction.
turally encode user and item features, while gen- 4.1 Input
erating a recommendation score from their latent
representations. Thus, our task is two-fold, fusing The input of KnowRec comprises a user u repre-
elements from the Graph-to-Text generation and sented by the user’s purchase history {vui } and an
KG recommendation domains. item v represented by its KG gv , as introduced in
Section 3. Let vc denote the item currently consid-
3 Problem Formulation ered by the system. The item vc is aligned with one
of the entities through A and becomes the center
Following prior work, we denote U as a set of users, node of gv , as shown in Figure 1.
I as a set of items, and the user-item interaction Because our system leverages a Transformer-
matrix as Y ∈ R|U |×|I| , where yuv = 1 if user based encoder, we first linearize the input into a
u ∈ U and item v ∈ I have interacted. Here, we string. For the user u = {vui }, we initialize it by
represent user u as the user’s purchase history u = mapping each purchased item vui into tokens of
{vui }, where vui denotes the i-th item purchased by the item’s name. For the item v represented by
user u in the past. Next, we define a KG as a multi- gv , we decompose gv into a set of tuples {tvj },
relational graph G = (V, E), where V is the set of where tvj = (vc , rvj , nvj ), nvj ∈ Vv , and rvj ∈
entity vertices and E ⊂ V×R×V is the set of edges Rv . We linearize each tuple tvj into a sequence of
connecting entities with a relation from R. Each tokens using lexicalized names of the nodes and the
item v has its own KG, gv , comprising an entity relation. We then concatenate all the user tokens
set Vv and a relation set Rv which contain features and the item tokens to form the full input sequence
of v. We devise a set of item-entity alignments x. For example, suppose the current item vc is
A = {(v, e)|v ∈ I, e ∈ V}, where (v, e) indicates the book Harry Potter, the KG has a single tuple
that item v is aligned with an entity e. (Harry Potter, author, J.K. Rowling), and the user
Given a user u and an item v represented by its previously purchased two books The Lord of the
KG gv , the task is to generate an explanation of Rings and The Little Prince. In this case, input
natural language sentences Eu,v as to why item sequence x = The Lord of the Rings The Little
v was recommended for the user u. As in previ- Prince Harry Potter author J.K. Rowling.
ous multi-task explainable recommendation mod- We map the tokens to randomly initialized vec-
els, KnowRec calculates a rating score ru,v that tors or pre-trained word embeddings such as those
in BART (Lewis et al., 2020), obtaining X0 = corresponding parameter matrix W in the l-th layer.
[. . . ; Vui ; . . . ; Tvj ; . . . ] where Vui and Tvj are Note that the transformer encoder may be initial-
word vector representations of vui and tvj , respec- ized via a pre-trained language model.
tively. Unlike previous work on KG recommen- User-Item Graph Attention. We further propose
dation (Wang et al., 2020) where users/items are User-Item Graph Attention encoder layers, which
represented via purchase history and propagated compute graph-aware attention via a mask to cap-
KG information, our system infuses KG compo- ture the user-item graph’s topological information,
nents to provide a recommendation and its natu- which runs in parallel with the Global Attention
ral language explanation. Our system also differs encoder layers.
from prior studies on explainable recommendation We first extract the mask Mg ∈ Rm×m from the
in that while they focus on reviews and thus en- user-item linked KG, where m is the number of
code users/items as random vectors with additional relevant KG components, i.e., nodes and edges that
review-based sparse token features as auxiliary in- are lexically expressed in the KG (edges between
formation (Li et al., 2021), we directly encapsulate vui and vc not included). In Mg , each row/column
KG information into the input representation. refers to a KG component. Mij = 0 if there is a
connection between component i and j (e.g., “J.K.
4.2 Encoder Rowling” and “author”) and −∞ otherwise. In
Collaborative KG Representation. Because addition, we assume all item components, i.e., the
KnowRec outputs a natural language explanation previous purchases and the current item, are mutu-
grounded on KG facts, as well as a recommenda- ally connected when devising Mg .
tion score for the user-item pair, we need to con- For each layer (referred to as the l-th layer), we
struct a user-item-linked KG to represent an in- then transfer its input Xl−1 into a component-wise
put through its corresponding lexical graph feature. representation Xgl−1 ∈ Rm×d , where d is the word
To do so, we leverage collaborative signals from embedding size. Motivated by Ke et al. (2021),
Y, combining u with v by linking previously pur- we perform this transfer by employing a pooling
chased products vui to the current item vc from gv , layer that averages the vector representations of
forming a novel lexical user-item KG. Additionally, all the word tokens contained in the correspond-
we connect all previously purchased items together ing node/edge names per relevant KG component.
in order to graphically model collaborative filtering With the transferred input Xgl−1 , we proceed to en-
effects for rating prediction, as illustrated in Fig- code it using User-Item Graph Attention with the
ure 1. Note that the relations between previously graph-topology-sensitive mask as follows:
purchased items and the current items do require a
lexical representation for our model. The resulting X̃gl = AttnM (Q′ , K′ , V′ )
 ′ ′⊤
graph goes through the Transformer architecture, (2)

QK
as described below. = softmax √ + Mg V′ .
dk
Global Attention. Transformer architectures have
recently been adopted for the personalized ex- where query Q′ , key K′ , and value V′ are com-
plainable recommendation task (Li et al., 2021). puted with the transferred input and learnable pa-
We similarly leverage Transformer encoder lay- rameters in the same manner as Equation (1).
ers (Vaswani et al., 2017), referred to as Global Lastly, we combine the outputs of the Global
Attention, to encode the input representation with Attention encoder and the User-Item Graph Atten-
self-attention as: tion encoder in each layer. As the two outputs have

QK⊤
 different dimensions, we first expand X̃gl to the
Xl = Attn(Q, K, V) = softmax √ V, same dimension of Xl through a gather operation,
dk
i.e., broadcasting each KG component-wise rep-
Q = Xl−1 WlQ , K = Xl−1 WlK , resentation in X̃gl to every encompassing word of
V = Xl−1 WlV the corresponding component and connecting those
(1) representations. We then add the expanded X̃gl to
where Xl is the output of the l-th layer in the en- Xl through element-wise addition, generating the
coder, and dk is a tunable parameter. Q, K, and l-th encoding layer’s output:
V represent the Query, Key, and Value vectors,
respectively, each of which is calculated with the X̃l = gather(X̃gl ) + Xl (3)
Note, in this section, we illustrate the Global At- where ru,v denotes the ground-true score.
tention encoder, User-Item Attention encoder, and Next, as in other NLG tasks (Lewis et al., 2020;
their combination with single-head attention. In Zhang et al., 2020), we incorporate Negative Log-
practice, we implement both encoders with multi- Likelihood (NLL) as the explanation’s cost func-
head attention as in Vaswani et al. (2017). tion Le . Thus, we define Le as:
|Eu,v |
4.3 Rating Prediction 1 X 1 X
Le = − log pet t (7)
For the rating prediction task, we first separate |U||I| |Eu,v |
u∈U ∧v∈I t=1
and isolate user u and item v features via masking.
Once isolated, we perform a mean pool on all their where pet t
is the probability of a decoded token et
respective tokens and linearly project u and v to at time step t.
perform a dot-product between the two new vector
5 Dataset
representations as follows:
Although KG-recommendation datasets exist, they
x̃u = poolmean (X̃L + mu )Wu do not contain any supervision signals to NL de-
x̃v = poolmean (X̃L + mv )Wv (4) scriptions. Thus, to evaluate our explainable recom-
r̂u,v = dot(x̃u , x̃v ), mendation approach in a KG-aware setting and our
KnowRec model, we introduce two new datasets
where mu and mv are the user and item masks that based on the Amazon-Book and Amazon-Movie
denote which tokens belong to the user and item, datasets (He and McAuley, 2016): (1) Book KG-
Ws are learnable parameters, and L refers to the Exp and (2) Movie KG-Exp.
last layer of the encoder. Recall that our task requires an input KG along
with an NL explanation and recommendation score.
4.4 Explanation Generation Because it is more efficient to extract KGs from
Before generating a final output text for our expla- text, rather than manually annotate each KG with
nation, we pass the representation through a fully text, we take a description-first approach, automati-
connected linear layer as the encoder hidden state cally extracting KG elements from the correspond-
and decode the representation into its respective ing text. Given the currently available data, we
output tokens through an auto-regressive decoder, leverage item descriptions as a proxy for the NL
following previous work (Lewis et al., 2020). explanations, while constructing a user-item KG
from an item’s features and user’s purchase history.
4.5 Joint-learning Objective
We first extract entities from a given item de-
As previously noted, our system consists of two scription via DBpedia Spotlight (Mendes et al.,
outputs: a rating prediction score r̂u,v and natural 2011), a tool that detects mentions of DBpe-
language explanation Eu,v which justifies the rat- dia (Auer et al., 2007) entities from NL text.
ing by verbalizing the item’s corresponding KG. We then query for each entity’s most specific
We thus perform multi-task learning to learn both type and use those types as relations that con-
tasks and manually define regularization weights nect the item to its corresponding entities. We
λ, as in similar multi-task paradigms, to weight construct a user KG via their purchase history,
the two tasks. Taking Lr and Le to represent the e.g. [P urchase1 , P urchase2 , ...P urchasen ], as
recommendation and explanation cost functions, a complete graph where each purchase is connected.
respectively, the multi-task cost L then becomes: Finally, we connect all the nodes of the user KG
to the item KG, treating each user purchase as a
L = λr Lr + λe Le , (5)
one-hop neighbor of the current item. To ensure the
where λr and λe denote the rating prediction and KG-explanation correspondence, we filter out any
explanation regularization weights, respectively. sentences in the explanation in which no entities
We define Lr using Mean Square Error (MSE) were found. To measure objectivity, we calculate
in line with conventional item recommendation and the proportion of a given KG’s entities that appear
review-based explainable systems: in the explanation, called entity coverage (EC) (de-
fined in Appendix B.3). We summarize our dataset
1 X
statistics in Table 1 and present a more comprehen-
Lr = (ru,v − r̂u,v )2 , (6)
|U||I| sive comparison in Appendix A.2.
u∈U ∧v∈I
Name #Users #Items #Interactions KG #Es #Rs #Triples EC Desc. Words/Sample
Book KG-Exp 396,114 95,733 2,318,107 Yes 195,110 392 745,699 71.45 Yes 99.96
Movie KG-Exp 131,375 18,107 788,957 Yes 59,036 363 146,772 71.32 Yes 96.35

Table 1: Statistics of our Book KG-Exp and Movie KG-Exp benchmark datasets.#Es, #Rs, and Desc. denote number
of entities, number of relations, and if the dataset contains parallel descriptions.

6 Experiments For BLEU and ROUGE, KnowRec consistently


outperforms all baselines, achieving a BLEU-4
6.1 Evaluation Metrics score of 10.71 and ROUGE-L F1 score of 27.71
We assess explainable recommendation following on Movie KG-Exp and a BLEU-4 score of 12.60
prior work: 1) on the recommendation performance and ROUGE-L F score of 28.29 on Movie KG-
and 2) on the explanation performance. For the ex- Exp. This suggests that previous baselines, de-
planation generation task, we employ standard natu- signed for review-level explanation, are inadequate
ral language generation (NLG) metrics: BLEU (Pa- to produce longer and more objective explanations.
pineni et al., 2002) and ROUGE (Lin, 2004). We Specifically, of the baselines, PETER which uti-
measure diversity and the detail-oriented features lizes the whole lexical input, adapts best. However,
of the generated sentences using Unique Sentence KnowRec makes use of user-item graph encodings,
Ratio (USR) (Li et al., 2020, 2021), and use EC, which may lead to better generation of the item KG
instead of feature coverage ratio, for coverage due features mentioned in the ground truth texts. While
to our non-review-based explanations. PEPLER (Li et al., 2022)’s pretrained approach
aids in fluent sentence generation, KnowRec excels
6.2 Implementation in generating contextually relevant words around
We train our newly proposed KnowRec model feature-level terms. Unlike PEPLER, which creates
on two settings of the Book and Movie KG-Exp concise reviews based on user-item IDs, KnowRec
datasets, a full training set and a few-shot setting, utilizes graph attention to interconnect related com-
where 1% of the data is used. Because our method ponents for comprehensive NL text explanations.
provides item-level explanations based on KGs, we In terms of explainability, KnowRec also gen-
split the datasets based on their labeled descrip- erates much more diverse sentences (USR), espe-
tion/explanation, and as such, we experiment in a cially compared to models that do not leverage
setting where items in the test set can be unseen dur- pre-trained models. Note that while PEPLER has a
ing training. By doing so, we handle a unique case comparable USR score to KnowRec on the Book
that has not been considered in previous research KG-Exp dataset, it does not similarly produce high-
relying on item reviews. The train/validation/test quality and related sentences according to the NLG
sets are split into 60/20/20. For KnowRec, we use metrics. Our results show that while the ground
BART as our pre-trained model, with a Byte-Pair truth is based on item-level features, the generated
Encoding (BPE) vocabulary (Radford et al., 2019). output is still personalized as further discussed in
We compare our approach to available explanation Section 7.5. Also note the high discrepancy in EC,
generation baselines, including those that leverage where the entity-level features are generated in the
user and item IDs, and those which utilize word- output text. As our goal is to generate objective and
level features. We adapt the baselines to use the KG specific explanations, the EC can help real-world
information and detail them in Appendix B.1. For users understand what a certain recommended prod-
more details regarding our experimental settings uct is about and how it compares to other products.
please see Appendix B.2. Therefore, it is crucial that explainable models cap-
ture these features when producing justifications
7 Results and Analysis for recommendations.
7.1 Explanation Results
7.2 Few-shot Explanation Results
In Table 2, we evaluate the models’ text repro-
duction performance using BLEU and ROUGE Real-world recommendation systems may face low-
metrics, while also examining their explainability resource problems, where only a small amount of
through USR and EC analysis. training data with few item descriptions is avail-
Dataset Model BLEU-1 BLEU-4 USR R2-F R2-R R2-P RL-F RL-R RL-P EC
Att2Seq 8.86 0.39 0.30 2.08 1.41 8.47 8.07 11.65 9.49 0.44
NRT 11.76 0.57 0.03 1.50 1.40 3.25 7.20 11.70 8.05 0.98
Movie Transformer 8.67 0.18 0.33 1.21 0.91 6.55 6.58 9.54 9.69 0.82
KG-Exp PETER 14.66 3.99 0.55 5.07 4.26 11.66 15.06 16.67 23.03 10.58
PEPLER 11.68 0.13 0.46 0.56 0.63 0.54 8.90 10.92 9.53 0.78
KnowRec 37.02 10.71 0.83 15.49 15.12 18.15 27.71 28.71 37.10 67.97
Att2Seq 19.51 1.85 0.43 5.08 3.76 12.15 12.98 16.55 20.89 0.86
NRT 21.06 2.59 0.10 6.18 4.88 11.44 15.57 18.67 24.36 1.57
Book Transformer 16.90 2.01 0.12 5.68 4.23 11.94 13.66 15.57 26.87 2.08
KG-Exp PETER 27.93 8.39 0.71 11.94 10.36 18.68 21.24 23.30 28.02 17.39
PEPLER 16.07 1.20 0.90 2.39 2.63 2.26 13.03 16.34 12.24 0.74
KnowRec 38.53 12.60 0.92 19.78 19.44 23.22 28.29 29.43 35.28 69.50

Table 2: Comparison of neural generation models on the Movie KG-Exp and Book KG-Exp datasets.

Dataset Model BLEU-1 BLEU-4 USR R2-F R2-R R2-P RL-F RL-R RL-P EC
Att2Seq 2.63 0.00 0.00 0.00 0.00 0.00 2.73 4.25 2.63 0.01
Movie NRT 8.78 0.32 0.01 1.84 1.08 11.73 7.12 10.17 17.97 0.07
KG-Exp Transformer 12.23 0.27 0.16 1.24 1.07 3.54 6.97 9.54 12.00 1.18
(Few-shot) PETER 12.28 0.68 0.36 2.33 1.45 12.49 12.00 13.18 18.03 5.44
PEPLER 12.58 0.41 0.01 1.26 1.44 1.18 10.73 12.63 10.38 0.11
KnowRec 33.89 7.53 0.87 13.41 12.60 17.67 24.48 25.63 35.66 63.92
Att2Seq 16.58 1.53 0.22 4.68 3.10 15.58 13.30 15.28 21.32 0.26
Book NRT 19.12 2.19 0.01 6.11 4.36 13.99 15.18 20.47 16.78 1.19
KG-Exp Transformer 12.69 1.22 0.08 3.60 3.16 8.65 9.77 15.64 10.58 1.57
(Few-shot) PETER 18.38 2.87 0.45 7.12 5.07 17.50 14.74 17.66 17.52 4.23
PEPLER 7.96 0.26 0.02 0.67 0.63 0.83 7.59 10.07 7.04 0.54
KnowRec 28.93 7.94 0.93 17.28 16.05 22.45 24.84 25.19 36.60 60.46

Table 3: Comparison of neural generation models on the Movie KG-Exp and Book KG-Exp datasets in the few-shot
learning setting (1% of training data).

able but an item database exists. To reflect this


Book KG-Exp Movie KG-Exp practical situation, we also evaluate a few-shot set-
Model All Few All Few ting where the training data is 1% of its total size.
R M R M R M R M
PMF 3.50 3.35 3.50 3.35 3.31 3.08 3.32 3.08
As in previous experiments, we set the user-item
SVD++ 1.03 0.80 1.01 0.64 1.20 0.79 1.25 0.98 size for KnowRec to 5. We show the results of
NRT 0.98 0.74 1.07 0.73 1.17 0.93 1.23 0.97
PETER 1.01 0.79 1.03 0.82 1.24 1.03 1.24 1.00 this few-shot experiment in Table 3. KnowRec
PEPLER 0.96 0.72 1.07 0.72 1.14 0.91 1.27 0.96
KnowRec 1.04 0.75 1.04 0.72 1.22 0.92 1.21 0.93
consistently and significantly outperforms other ex-
plainable baselines on both the Book and Movie
Table 4: Performance comparison on the recommenda- datasets in terms of text quality, sentence diversity
tion task with respect to RMSE and MAE, denoted as R (USR), and entity representation (ER), showing
and M on the table respectively. our approach is effective even in data-scarce sce-
narios. Like KnowRec, PEPLER also leverages a
pre-trained model, namely GPT-2. However, unlike
KnowRec, the model does not adapt well to gener-
BLEU-4↑ USR↑ RL-F↑ RMSE↓ MAE↓ ating item-specific explanations. The second best
KnowRec 7.94 0.93 24.84 1.04 0.78 model, PETER, fully leverages the KG features
- Recomm. 8.32 0.93 24.90 - - in their approach. However, such a model does
- UIG Att. 7.75 0.91 24.80 1.03 0.78
produce diverse sentences. Note that those models
that completely rely on user and item IDs, fail to
Table 5: Ablation study on the Book KG-Exp (Few-
Shot) dataset. ‘Recomm.’ means the joint learning with produce quality explanations, as noted by their re-
recommendation scoring, and ‘UIG Att.’ denotes the spective BLEU and ROUGE scores, showing the
user-item graph attention. task to be more complex than previous explana-
tion tasks relying on repetitive, short, and already
existing user reviews. 7.5 Qualitative Analysis

To grasp KnowRec’s effectiveness, we analyze


7.3 Recommendation Performance explanations from Movie/Book KG-Exp test sets.
These explanations are both grammatically smooth
Table 4 shows the recommendation performance and adept at (1) integrating robust item features
on all KG Explanation datasets. We report the Root for factual insights and (2) tailoring personalized
Mean Square Error (RMSE) and Mean Absolute content based on diverse user purchase histories
Error (MAE) metrics to evaluate the recommenda- (examples in Appendix C, Table 7).
tion task. As shown, all results except PMF are
relatively close. PMF significantly underperforms Consider the first two rows of the table, pertain-
due to the cold start problem presented on new ing to the movie Journey to the Center of the Earth.
items. KnowRec achieves performance compara- We can see two different (but syntactically simi-
ble to other strong baselines, despite KnowRec be- lar) generated explanations for two different users.
ing the only model that uses lexical features for the In one case, the user has bought mystery and fan-
recommendation task, while the other models learn tasy movies such as Stitch in Crime, Columbo, and
the task through user/item IDs. Thus, KnowRec The Lord of the Rings, and the output integrates
may need more data to learn these parameters. Ad- related words such as investigates and mysterious
ditionally, because we learn the recommendation to personalize the explanation. The second case
task through lexical features, our model provides mentions classic and novel, possibly because the
an interpretable solution that could be directly com- second user’s purchase history involves Disney clas-
pared to the produced NL explanations. sics and movies based on novels such as The Hardy
Boys and Old Yeller. While the input KG does not
explicitly state that Journey to the Center of the
7.4 Ablation Study Earth is a novel, such information may be inferred
from the KG’s relation and supported through the
We perform ablation studies to analyze the effects user’s related purchases. In both cases the output
of the recommendation and user-item graph com- closely matches the ground truth, verbalizing item
ponents on Book KG-exp as shown in Table 5. features from the KB such as Jules Verne and mag-
Due to computational resources, we performed the netic storm, suggesting that our model is robust
study on the few-shot setting. We first examine in describing the explanation content, while still
the results of KnowRec without the recommenda- implicitly reflecting the user’s purchase history.
tion module in the second row (- Recomm.). By
removing the ‘Recomm’ component, the perfor-
mance on the NLG metrics improves, as the task 8 Conclusion
is now a single-objective generative task instead
of a multi-objective. We next study the effects of We propose KnowRec, a Knowledge-aware model
the User-Item Attention encoders on KnowRec’s for generating NL explanations and recommen-
explainability and recommendation performance dation scores on user-item pairs. To evaluate
(- UIG Att). As shown by - UIG Att., even with KnowRec, we devise and release a semi-supervised
a smaller training dataset of 1% of the full data, large-scale KG-NL recommendation dataset in the
by removing this component, we observe a slight book and movie domain. Extensive experiments
decrease in the NLG metrics, BLEU and ROUGE, on both datasets demonstrate the suitability of our
and less diverse sentences (USR). The representa- model compared to recently proposed explainable
tion and attention masking on the user-item graph, recommendation models. We hope that by propos-
which connects and encodes the local item infor- ing this KG-guided task, we will open up avenues
mation, may therefore give a better representation to research focused on detailed, objective, and spe-
of the input which is in turn decoded to produce an cific explanations which can also scale to new items
explanation. This may be further expressed within and users, rather than the current review-focused
larger datasets. Furthermore, from the NLG metric work. In future work, we plan to incorporate user-
results, we can infer from Table 5 that our rating specific KGs and other pre-trained language mod-
module does not significantly hinder the perfor- els into our model in order to verbalize both user
mance of the generation component of KnowRec. and item-level feature explanations.
9 Limitations ceedings of the Web Conference 2021, pages 1516–
1527.
While our approach generates objective, descriptive
Wenhu Chen, Yu Su, Xifeng Yan, and William Yang
explanations while implicitly capturing personal-
Wang. 2020. KGPT: Knowledge-grounded pre-
ized aspects of a user’s purchase history, currently training for data-to-text generation. In Proceedings
our dataset labels are limited to item-specific ex- of the 2020 Conference on Empirical Methods in
planations, with the book-related KGs typically Natural Language Processing (EMNLP), pages 8635–
containing author-related information, and thus 8648, Online. Association for Computational Lin-
guistics.
more information dense than the movie-related
KGs. These limitations are due to the currently Anthony Colas, Mehrdad Alvandipour, and Daisy Zhe
available datasets, and future work can explore con- Wang. 2022. GAP: A graph-aware language model
framework for knowledge graph-to-text generation.
structing a more personalized user-item KG for In Proceedings of the 29th International Conference
explainable recommendation. Furthermore, in our on Computational Linguistics, pages 5755–5769,
approach, we represent users through their item Gyeongju, Republic of Korea. International Com-
purchase history. Therefore, while we handle the mittee on Computational Linguistics.
zero-purchase case for items (items that have not Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata,
been purchased before), the zero-purchase case for Ming Zhou, and Ke Xu. 2017. Learning to gener-
users (users without a purchase history) is outside ate product reviews from attributes. In Proceedings
of the 15th Conference of the European Chapter of
the scope of our work. In the future, we will extend
the Association for Computational Linguistics: Vol-
our approach to user-attributed datasets to handle ume 1, Long Papers, pages 623–632, Valencia, Spain.
such cases. Association for Computational Linguistics.
Yuntao Du, Xinjun Zhu, Lu Chen, Baihua Zheng, and
10 Ethics Statement Yunjun Gao. 2022. HAKG: Hierarchy-aware knowl-
edge gated network for recommendation. arXiv
All our experiments are performed over publicly
preprint arXiv:2204.04959.
available datasets. We do not use any identifiable
information about crowd workers who provide an- Zuohui Fu, Yikun Xian, Ruoyuan Gao, Jieyu Zhao,
notations for these datasets. Neither do we perform Qiaoying Huang, Yingqiang Ge, Shuyuan Xu, Shijie
Geng, Chirag Shah, Yongfeng Zhang, et al. 2020.
any additional annotations or human evaluations Fairness-aware explainable recommendation over
in this work. We do not foresee any risks using knowledge graphs. In Proceedings of the 43rd Inter-
KnowRec if the inputs to our model are designed national ACM SIGIR Conference on Research and
as per our procedure. However, our models may Development in Information Retrieval, pages 69–78.
exhibit unwanted biases that are inherent in pre- Claire Gardent, Anastasia Shimorina, Shashi Narayan,
trained language models. This aspect is beyond the and Laura Perez-Beltrachini. 2017. The WebNLG
scope of the current work. challenge: Generating text from RDF data. In Pro-
ceedings of the 10th International Conference on
Natural Language Generation, pages 124–133, San-
tiago de Compostela, Spain. Association for Compu-
References tational Linguistics.
Nabiha Asghar. 2016. Yelp dataset challenge: Review Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017.
rating prediction. arXiv preprint arXiv:1605.05362. Inductive representation learning on large graphs. In
Advances in Neural Information Processing Systems,
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens volume 30.
Lehmann, Richard Cyganiak, and Zachary Ives. 2007.
DBpedia: A nucleus for a web of open data. In Ruining He and Julian McAuley. 2016. Ups and downs:
Proceedings of the 6th International The Semantic Modeling the visual evolution of fashion trends with
Web and 2nd Asian Conference on Asian Semantic one-class collaborative filtering. In Proceedings of
Web Conference, pages 722–735, Berlin, Heidelberg. the 25th international conference on World Wide Web,
Springer-Verlag. pages 507–517.
Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Chris Hokamp and Qun Liu. 2017. Lexically con-
Ma. 2018. Neural attentional rating regression with strained decoding for sequence generation using grid
review-level explanations. In Proceedings of the beam search. In Proceedings of the 55th Annual
2018 World Wide Web Conference, pages 1583–1592. Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 1535–1546,
Hanxiong Chen, Shaoyun Shi, Yunqi Li, and Yongfeng Vancouver, Canada. Association for Computational
Zhang. 2021. Neural collaborative reasoning. In Pro- Linguistics.
Min Hou, Le Wu, Enhong Chen, Zhi Li, Vincent W Lei Li, Yongfeng Zhang, and Li Chen. 2021. Person-
Zheng, and Qi Liu. 2019. Explainable fashion rec- alized transformer for explainable recommendation.
ommendation: A semantic attribute region guided In Proceedings of the 59th Annual Meeting of the
approach. In Proceedings of the 28th International Association for Computational Linguistics and the
Joint Conference on Artificial Intelligence, pages 11th International Joint Conference on Natural Lan-
4681–4688. guage Processing (Volume 1: Long Papers), pages
4947–4957, Online. Association for Computational
Bei Hui, Lizong Zhang, Xue Zhou, Xiao Wen, and Linguistics.
Yuhui Nian. 2022. Personalized recommendation
system based on knowledge embedding and historical Lei Li, Yongfeng Zhang, and Li Chen. 2022. Person-
behavior. Applied Intelligence, 52(1):954–966. alized prompt learning for explainable recommenda-
tion. arXiv preprint arXiv:2202.07371.
Pei Ke, Haozhe Ji, Yu Ran, Xin Cui, Liwei Wang, Lin- Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and
feng Song, Xiaoyan Zhu, and Minlie Huang. 2021. Wai Lam. 2017. Neural rating regression with ab-
JointGT: Graph-text joint representation learning for stractive tips generation for recommendation. In Pro-
text generation from knowledge graphs. In Find- ceedings of the 40th International ACM SIGIR con-
ings of the Association for Computational Linguis- ference on Research and Development in Information
tics: ACL-IJCNLP 2021, pages 2526–2538, Online. Retrieval, pages 345–354.
Association for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
Diederik P Kingma and Jimmy Ba. 2015. Adam: A matic evaluation of summaries. In Text Summariza-
method for stochastic optimization. In Proceedings tion Branches Out, pages 74–81, Barcelona, Spain.
of the International Conference on Learning Repre- Association for Computational Linguistics.
sentations. Weizhi Ma, Min Zhang, Yue Cao, Woojeong Jin,
Chenyang Wang, Yiqun Liu, Shaoping Ma, and Xi-
Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, ang Ren. 2019. Jointly learning explainable rules
Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text for recommendation with knowledge graph. In The
Generation from Knowledge Graphs with Graph world wide web conference, pages 1210–1221.
Transformers. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Associ- Pablo N Mendes, Max Jakob, Andrés García-Silva, and
ation for Computational Linguistics: Human Lan- Christian Bizer. 2011. DBpedia spotlight: Shedding
guage Technologies, Volume 1 (Long and Short Pa- light on the web of documents. In Proceedings of
pers), pages 2284–2293, Minneapolis, Minnesota. the 7th international conference on semantic systems,
Association for Computational Linguistics. pages 1–8.

Yehuda Koren. 2008. Factorization meets the neighbor- Andriy Mnih and Russ R Salakhutdinov. 2007. Proba-
hood: a multifaceted collaborative filtering model. In bilistic matrix factorization. In Advances in Neural
Proceedings of the 14th ACM SIGKDD international Information Processing Systems, volume 20.
conference on Knowledge discovery and data mining, Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
pages 426–434. Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan 40th Annual Meeting of the Association for Compu-
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, tational Linguistics, pages 311–318, Philadelphia,
Veselin Stoyanov, and Luke Zettlemoyer. 2020. Pennsylvania, USA. Association for Computational
BART: Denoising sequence-to-sequence pre-training Linguistics.
for natural language generation, translation, and com-
prehension. In Proceedings of the 58th Annual Meet- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
ing of the Association for Computational Linguistics, Dario Amodei, Ilya Sutskever, et al. 2019. Language
pages 7871–7880, Online. Association for Computa- models are unsupervised multitask learners. OpenAI
tional Linguistics. blog, 1(8).
Leonardo F. R. Ribeiro, Martin Schmitt, Hinrich
Chenliang Li, Cong Quan, Li Peng, Yunwei Qi, Yuming Schütze, and Iryna Gurevych. 2021. Investigating
Deng, and Libing Wu. 2019. A capsule network for pretrained language models for graph-to-text genera-
recommendation and explaining what you like and tion. In Proceedings of the 3rd Workshop on Natural
dislike. In Proceedings of the 42nd International Language Processing for Conversational AI, pages
ACM SIGIR conference on Research and Develop- 211–227, Online. Association for Computational Lin-
ment in Information Retrieval, pages 275–284. guistics.
Lei Li, Yongfeng Zhang, and Li Chen. 2020. Gener- Shaoyun Shi, Hanxiong Chen, Weizhi Ma, Jiaxin Mao,
ate neural template explanations for recommendation. Min Zhang, and Yongfeng Zhang. 2020. Neural logic
In Proceedings of the 29th ACM International Con- reasoning. In Proceedings of the 29th ACM Inter-
ference on Information & Knowledge Management, national Conference on Information & Knowledge
pages 755–764. Management, pages 1365–1374.
Peijie Sun, Le Wu, Kun Zhang, Yanjie Fu, Richang sentiment classification. Proceedings of the AAAI
Hong, and Meng Wang. 2020. Dual learning for ex- Conference on Artificial Intelligence, 32(1).
plainable recommendation: Towards unifying user
preference prediction and review generation. In Pro- Yikun Xian, Zuohui Fu, Shan Muthukrishnan, Gerard
ceedings of The Web Conference 2020, WWW ’20, De Melo, and Yongfeng Zhang. 2019. Reinforcement
page 837–847, New York, NY, USA. Association for knowledge graph reasoning for explainable recom-
Computing Machinery. mendation. In Proceedings of the 42nd international
ACM SIGIR conference on research and development
Nava Tintarev and Judith Masthoff. 2015. Explaining in information retrieval, pages 285–294.
recommendations: Design and evaluation. In Recom-
mender systems handbook, pages 353–382. Springer. Lijie Xie, Zhaoming Hu, Xingjuan Cai, Wensheng
Zhang, and Jinjun Chen. 2021. Explainable rec-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ommendation based on knowledge graph and multi-
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz objective optimization. Complex & Intelligent Sys-
Kaiser, and Illia Polosukhin. 2017. Attention is all tems, 7(3):1241–1252.
you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Aobo Yang, Nan Wang, Hongbo Deng, and Hongning
Wang. 2021. Explanation as a defense of recommen-
Petar Veličković, Guillem Cucurull, Arantxa Casanova, dation. In Proceedings of the 14th ACM International
Adriana Romero, Pietro Liò, and Yoshua Bengio. Conference on Web Search and Data Mining, pages
2018. Graph attention networks. In Proceedings of 1029–1037.
the International Conference on Learning Represen-
tations. Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu,
Qingyun Wang, Heng Ji, and Meng Jiang. 2022. A
Nan Wang, Hongning Wang, Yiling Jia, and Yue Yin. survey of knowledge-enhanced text generation. ACM
2018a. Explainable recommendation via multi-task Computing Surveys, 54(11s):1–38.
learning in opinionated text data. In The 41st In-
ternational ACM SIGIR Conference on Research & Yongfeng Zhang, Xu Chen, et al. 2020. Explainable
Development in Information Retrieval, pages 165– recommendation: A survey and new perspectives.
174. Foundations and Trends® in Information Retrieval,
14(1):1–101.
Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and
Tat-Seng Chua. 2019. KGAT: Knowledge graph at- Yaxin Zhu, Yikun Xian, Zuohui Fu, Gerard de Melo,
tention network for recommendation. In Proceedings and Yongfeng Zhang. 2021. Faithfully explainable
of the 25th ACM SIGKDD international conference recommendation via neural logic reasoning. In Pro-
on knowledge discovery & data mining, pages 950– ceedings of the 2021 Conference of the North Amer-
958. ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
Xiang Wang, Tinglin Huang, Dingxian Wang, Yancheng 3083–3090, Online. Association for Computational
Yuan, Zhenguang Liu, Xiangnan He, and Tat-Seng Linguistics.
Chua. 2021. Learning intents behind interactions
with knowledge graph for recommendation. In Pro- A Dataset Details
ceedings of the Web Conference 2021, pages 878–
887. A.1 Source Data

Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Amazon product data: The Amazon product
Wu, and Xing Xie. 2018b. A reinforcement learn- dataset is a large-scale widely used dataset for prod-
ing framework for explainable recommendation. In uct recommendation containing product reviews
2018 IEEE International Conference on Data Mining, and metadata from Amazon. Data fields include
pages 587–596. IEEE.
ratings, texts, descriptions, and category informa-
Ze Wang, Guangyan Lin, Huobin Tan, Qinghong tion (He and McAuley, 2016). Because the dataset
Chen, and Xiyang Liu. 2020. CKAN: Collaborative contains item descriptions, we can leverage such
knowledge-aware attentive network for recommender data to extract entities and relations to construct
systems. In Proceedings of the 43rd International
ACM SIGIR conference on Research and Develop- a KG that matches the textual description. Thus,
ment in Information Retrieval, pages 219–228. these descriptions provide objective, item-distinct
explanations as to why a user may have purchased a
Max Welling and Thomas N Kipf. 2016. Semi-
supervised classification with graph convolutional product. Although a user may not have reviewed an
networks. In Proceedings of the International Con- item, the dataset provides an existing description of
ference on Learning Representations. the item, allowing models to produce explanations
Zhen Wu, Xin-Yu Dai, Cunyan Yin, Shujian Huang,
for such items. To keep our datasets large-scale, we
and Jiajun Chen. 2018. Improving review represen- focus on Amazon Book and Amazon Movie 5-core,
tations with user attention and product attention for the two largest Amazon product datasets.
A.2 Dataset Comparison NRT (Li et al., 2017) is a multi-task model for
rating prediction and tip generation, based on user
Table 6 summarizes existing popular rec-
and item IDs. As in previous work, we use our
ommendation system datasets utilized for
explanations as tips and remove the model’s L2
both the explainable recommendation and
regularizer (Li et al., 2020, 2021), which causes the
KG recommendation task. We report both
model to generate identical sentences.
traditional recommendation features, KG-
recommendation features, and explainable Transformer (Vaswani et al., 2017; Li et al.,
recommendation features. Last.FM (Wang et al., 2021) treats user and item IDs as words. We adapt
2019), Book-Crossing (Wang et al., 2020), Movie- the model first introduced for review generation
Lens20M (Wang et al., 2020), and Amazon-book by Li et al. (2021) while integrating the KG entities
(KG) (Wang et al., 2019) are popular benchmarks and relations instead of the review item features.
for the KG-recommendation task but contain PETER (Li et al., 2021) utilizes both user/item
no NL explanation features. Yelp-Restaurant, IDs and corresponding item features extracted from
Amazon Movies & TV, and TripAdvisor-Hotel user reviews to generate a recommendation score,
have been recently experimented with for the explanation, and context related to the item fea-
explainable recommendation task (Li et al., 2020), tures. The model also develops a novel PETER
but lack KG data and rely on user reviews as mask between item/user IDs and corresponding
proxies for the explanation. In contrast, our features/generated text. As our task does not take a
datasets, referred to as Book KG-Exp and Movie feature-based approach, for a fair comparison we
KG-Exp contain both KG and the corresponding remove the context prediction module and input
parallel item descriptions associated with those the whole KG into the model as the corresponding
KGs as explanations. Compared to Book KG-Exp, item features.
the Movie KG-Exp dataset contains fewer amount PEPLER (Li et al., 2022) is an extension of
of unique KG elements, with 59,036 to 195,110 PETER, where the transformer is replaced with a
and 745,699 to 146,772 unique entities and KG, pre-train language model, namely GPT-2 to gener-
while having similarly sized explanations. ate both recommendation scores and explanations.
We take the best-performing setting for a fair com-
A.3 Dataset Statistics parison, namely using the MLP setting for recom-
mendation scores.
We provide detailed statistics on both the Book KG-
Exp and Movie KG-Exp datasets in Figure 2. As In addition to NRT, PETER, and PEPLER, as
seen in Figures 2(a) and 2(b), the distributions of in previous work, we compare with two traditional
KGs with respect to the number of tuples shows baselines for recommendation: PMF (Mnih and
similar long-tail distributions in both datasets. We Salakhutdinov, 2007) and SVD++ (Koren, 2008).
observe from Figures 2(c) and 2(d) that a similar
trend of long-tail distributions exists for both with B.2 Hyper-parameters and Settings
respect to explanation lengths, where the lengths in As in (Li et al., 2021), we adapt the baseline codes
the book dataset tend to skew more right than the to our setting and set the vocabulary size for NRT,
lengths in the movie dataset. ATT2Seq, and PETER to 20,000 by keeping the
most frequent words. For PETER and PEPLER,
B Experiment Details we set the number of context words to 128. For
all approaches, including KnowRec, we set the
B.1 Baseline Models
length of explanation to 128, as the mean length
We introduce several baselines in explainable rec- is about 94 for both datasets. For KnowRec, we
ommendation, describing how to adapt the models use an embedding size of 512, using a Byte-Pair
to the KG setting, as these models have been pri- Encoding (BPE) vocabulary (Radford et al., 2019)
marily formulated for user review data. of size 50,256, with 2 encoding layers. Follow-
Att2Seq (Dong et al., 2017) was designed for ing KG generation work (Ribeiro et al., 2021), we
review generation, where we adapt it to the item split the tokens in the linearized graph with their
explanation setting. As in (Li et al., 2021), we corresponding label: [user], [graph], [head], [re-
remove the attention module, as it makes the gen- lation], and [tail]. For both datasets, we set the
erated content unreadable. batch size to 128 and max user and KG size to 64
Words/
Name #Users #Items #Interactions KG #Es #Rs #Triples Desc.
Sample
Last.FM 23,566 48,123 3,034,796 Yes 58,266 9 464,567 No -
Book-Crossing 276,271 271,379 1,048,575 Yes 25,787 18 60,787 No -
Movie-Lens20M 138,159 16,954 13,501,622 Yes 102,569 32 499,474 No -
Amazon-book (KG) 70,679 24,915 847,733 Yes 88,572 39 2,557,746 No -
Yelp-Restaurant 27,147 20,266 1,293,247 No - - - No 12.32
Amazon Movies 7,506 7,360 441,783 No - - - No 14.14
TripAdvisor-Hotel 9,765 6,280 320,023 No - - - No 13.01
Book KG-Exp 396,114 95,733 2,318,107 Yes 195,110 392 745,699 Yes 99.96
Movie KG-Exp 131,375 18,107 788,957 Yes 59,036 363 146,772 Yes 96.35

Table 6: Comparison of widely used datasets divided by task: KG-Recommendation (top), Explainable Recommen-
dation (middle), and KG Explainable Recommendation (bottom).

Number of Unique Explanations


14K 4.0K

Number of Unique Explanations


20K 5K
Number of Unique KGs

3.5K
Number of Unique KGs

12K
4K 10K 3.0K
15K 2.5K
3K 8K
2.0K
10K 6K
2K 1.5K
4K 1.0K
5K 1K 2K 0.5K
0 0 0 0
0 20 40 60 80 0 20 40 60 80 0 200 400 0 200 400
Number of Tuples Number of Tuples Number of Tokens Number of Tokens
(a) Book KGs (b) Movie KGs (c) Book Explanations (d) Movie Explanations

Figure 2: Distributions for number of tuples (Figures 2(a) and 2(b)) and tokens (Figures 2(c) and 2(d)) per sample.

and 192, respectively. We set the max node and Effect of r on KG-Exp (Few-Shot)
edge length to 60. We experiment with λr and λe Movie
7.5 Book
and find that 0.01 and 1 give us the best BLEU per-
7.0
formance without affecting the recommendation
BLEU-4 Score (Avg)

prediction scores as in (Li et al., 2022). See Fig- 6.5

ure 3 for an analysis of Movie KG-Exp (Few-shot). 6.0


The model’s parameters were trained for 20 epochs 5.5
and optimized via Adam (Kingma and Ba, 2015) 5.0
with a learning rate of 1e-3 and Adam ϵ of 1e-08, 4.5
and the gradients were clipped at 1.0. All other 0.0 0.2 0.4 0.6 0.8 1.0
r Value
attention-related hyper-parameters were the same
as used in previous work (Lewis et al., 2020). We Figure 3: Effect of λr on the BLEU-4 score for the
decoded the text via beam search (Hokamp and Liu, Book and Movie KG-Exp datasets. We average all top
2017) with a beam size of 5. Experiments were per- 10 runs for a more comprehensive comparison.
formed on NVIDIA RTX 3090 GPUs. We evaluate
the model based on the validation set’s total loss
instead of BLEU score due to computational limita- other text generative tasks such as KG-to-text (Gar-
tions, saving the top 10 models for testing, because dent et al., 2017) and summarization (Yu et al.,
the model with the least loss does not necessarily 2022).
result in the best NLG metrics.
B.3 Entity Coverage
Because of computation limitations, for evalua-
tion purposes, we randomly sample and evaluate on We define entity coverage (EC) as the percentage of
1% of the test set, containing 4,491 and 1,456 sam- unique entities, originating in an item KG, which
ples for the Book and Movie datasets respectively. appears in the recommendation explanation. More
Note, that the size of the test set is comparative to formally, for each head and tail entity e in an item
KG’s set of entities E, we calculate the token over-
lap in the explanation output for those entities. The
EC score ranges in [0, 1], where we report the per-
centage value in our results. The Book KG-Exp
and Movie KG-Exp had an EC score of 71.45%
and 71.32%, indicating that a descriptive, objec-
tive explanation should have a high EC score. The
formula for EC is defined as:
#KG entities f ound in output
#KG entities
or is the recall of the entities in a KG.

C Generated Examples
Table 7 presents some examples generated by
KnowRec from the Book and Movie KG-EXP
datasets. As discussed in Section 7, we find the
examples to be fluent and grammatical, while in-
corporating both item features and implicit user
information based on a user’s purchase history.
The generated examples closely match the ground
truth, while integrating some language derived
from the user. Note, that our aim here is to il-
lustrate examples that showcase the implicit user
preferences, instead of showing those generated
outputs which most closely match the ground truth
descriptions. As with other state-of-the-art NLG
models, KnowRec does have a tendency to hallu-
cinate by adding extra information that may not
be necessarily accurate. As can be by the NLG
metrics in Table 2, KnowRec relieves the halluci-
nation problem by incorporating the user-item KG
information. Such limitations may be additionally
improved by leveraging more dense background
KGs to generate from, while also incorporating
user purchase history item features.
Item Graph Representation Generated Explanation Ground Truth Explanation
a scientist (jules verne) investigates jules verne’s professor lindenbrook
Jules Verne
Stitch in Crime
a magnetic storm that sends a leads a trip through monsters, mush-
writer
mysterious beam of light from earth rooms and a magnetic storm.
to the center of earth.
Journey to the
Columbo
Center of the Earth

disease
The Lord of the
Rings, Trilogy
magnetic storm

a group of scientists, inspired by jules verne’s professor lindenbrook


Jules Verne
The Hardy Boys
jules verne’s classic novel, take a leads a trip through monsters, mush-
writer
trip to the magnetic storm at the rooms and a magnetic storm.
center of the earth.
Journey to the
Old Yeller
Center of the Earth

disease
Walt Disney
Treasures
magnetic storm

ashley gardner is a ny times and usa today bestselling author ashley


USA Today
Nice Dragons
usa today bestselling author. under gardner is pseudonym for ny times
Finish Last
Ashley Gardner the pseudonym jennifer ashley, she bestselling author jennifer ashley.
newspaper has collectively written more than
person

Murder in St .
70 mystery and historical novels.
Silent Circle

Giles

person
newspaper
pseudonym

The Traitor in the


Tunnel

NY Times

kelley puckett is an american comic kelley puckett has been writing


Kelley Puckett Batman
Star Wars vol. 1 , book writer best known for his work comics for far too long, by general
Skywalker Strikes

comicscreator
comicscharacter on batman for dc comics. he is consensus. he has worked on such
the author of numerous books for series as batman adventures, bat-
Black Canary and
Batgirl Vol. 1, young readers, including supergirl, girl and kinetic and supergirl for dc
Zatanna, Batgirl
Silent Knight

Bloodspell

comicscharacter the ultimate guide to character de- comics.


publisher
Silver Surfer comicscharacter velopment and batgirl, a guide to
Volume 1, New writing for comics, both published
Dawn
Supergirl
DC Comics
by image.

your favorite dr. pol vet and his from sick goats to sick pet pigs, dr.
Jurrasic World

animal pet dog return for a second season pol and his colleagues have their
The Incredible Dr.
Pol - Season 2

pet of this hilarious and heartwarming hands full with a variety of cases
Best of the
Incredible Dr. Pol
animated adventure. and several animal emergencies.

linda ravenscroft is an award- linda ravenscroft has produced a


Linda Ravenscroft
winning children’s book author and wide range of images in fairyland
person
illustrator who has illustrated a motifs, including fine art prints, ex-
How to Draw and
Paint Fairies
wide range of books and mag- clusive giftware, and fantasy art
How to Draw and
Paint Fairyland , a
azines, including the best-selling books.
Mermaids in Step-by-step Guide
how to draw and paint series.
Paradise , an Artist automobile
' s Coloring Book

wide range

Table 7: Examples generated by KnowRec on the Book/Movie KG-Exp datasets. In the first column, we follow the
format of user-item KG representation in Figure 1, where red nodes represent a user’s purchase history and blue
nodes represent an item KG. For clarity and brevity, we only show the relevant parts of the item graphs. In the
second column, the bold words are the item features directly coming from the item KG representation, whereas the
underlined words are the features implicitly captured by KnowRec, based on the user’s purchase history.

You might also like