3.2.1 Graph Encoding.
To accommodate embeddings for users, items, and entities, KR-GCN encodes the heterogeneous graph including user-item interactions and KG by utilizing the graph representation model GCN. The embeddings of nodes are computed via performing graph convolution iteratively and aggregating local network neighborhood information. Then, the structure information of the heterogeneous graph can be modeled and the high-order connectivities in the graph can be captured. Moreover, because GCN updates each node embeddings by utilizing adjacent node embeddings, the problem of node ambiguity can be solved well. For example, the nodes Dumas Jr. and Dumas Sr. can be distinguished according to their neighborhood nodes The Lady of the Camellias and The Three Musketeers in the graph.
In KR-GCN, we adopt the weighted sum aggregator to capture the features of each given node and their neighborhood nodes, where the neighborhood nodes are aggregated via mean function. The sum aggregator combines these two representations with a non-linear activation function
\(\sigma\). Specifically, for a given node
\(i\) (i.e., a user, an item, or an entity), the embeddings can be initialized randomly or using original node features pre-trained with external knowledge at the 0th layer. At higher layers, the node embeddings are computed via graph convolution operation. The operation on the node
\(i\) at the (
\(l + 1\))th layer can be abstracted as
where
\(e^{(l+1)}_i\) and
\(e^{(l)}_i\) are the embeddings of the node
\(i\) at the (
\(l\,+\,1\))th layer and
\(l\)th layer, respectively.
\(N_i\) is the set of
\(i\)’s neighborhood nodes and
\(e^{(l)}_j\) is the
\(j\)th neighborhood node of
\(i\) at the
\(l\)th layer.
\(W^{(l)}_{self}\) and
\(W^{(l)}\) are the transformation weight matrices of
\(i\) and
\(i\)’s neighborhood node, respectively. After
\(L\) layers convolution operations, there are
\(L\) representations for the given node
\(i\). Then, the weighted sum operation is applied for the embeddings that calculated at each layer, i.e., from
\(e_i^{(1)}\) to
\(e_i^{(L)}\), so as to encode the given node
\(i\). Let
\(e_i\) denote the final representation of the node
\(i\), and the weighted sum calculation of
\(i\)’s representation is defined as
where
\(\alpha _l\) denotes the weight of the
\(l\)th layer, i.e., the importance of the
\(l\)th layer to the final target node representation. Combining the embeddings that are calculated at each layer can help capture different semantic information and make the representation more comprehensive [
16].
3.2.2 Path Extraction and Selection.
The latent relations among nodes are the high-order connectivities in the graph, which can be extracted and formed as multi-hop paths explicitly between node pairs, such as user-item pairs. Because the high-order connectivities between the given users and the candidate items can reflect the potential interests of users, we explicitly extract multi-hop paths between each user-item pair over the heterogeneous graph to obtain the representations of the user’s potential interests. Reasoning on paths suffers from the problem of error propagation, because considering all paths might involve the irrelevant ones. Defining meta-paths might alleviate the problem of error propagation, but designing proper meta-paths requires a deep understanding of domain-specific knowledge, which is labor-intensive and almost impractical to be generalized. To cope with the error propagation and the knowledge dependence issues, we prune irrelevant paths between each user-item pair. For each user-item pair (
\(u\),
\(v\)), where
\(u\in \mathcal {U}\) and
\(v\in \mathcal {V}\), we can efficiently find reasoning paths between
\(u\) and
\(v\) over the graph and form the selected paths as a path set
\(S_{uv}\). Since the number of paths between the user-item pairs grows exponentially with path hops, we extract multi-hop paths with the limitation that hops in every single path are less than
\(l\). Following the setting in the work of [
47], we set
\(l\) = 3 in the experiment to gather three-hop paths. The
\(j\)th path in the path set
\(S_{uv}\) can be described as
where
\(i\) is the single node (i.e., a user, an item, or an entity) and
\(r\) is the relation that connects two nodes within the path
\(S_{uv}[j]\).
\(i_1 = u\),
\(i_n = v\), and
\(n\) is the number of nodes within
\(S_{uv}[j]\).
Considering that iterating all paths for each user-item pair is inefficient in real-world large-scale KGs, we use the heuristic path search algorithm for path extraction and selection. Specifically, we design a transition-based method to determine the triple-level scores and utilize nucleus sampling to adaptively select triples within the paths between every user-item pair. This strategy captures the transition features within a path and does not rely on any meta-path patterns. We use \(\Delta _{k-1}\) and \(\Delta _k\) to denote the selected node sets in the \(k-1\)th hop and the \(k\)th hop of path search. For the node \(i_{k-1}\) in the node set \(\Delta _{k-1}\), we search its neighbors in the graph as next hop nodes of the node \(i_{k-1}\). For the neighbor node \(i_k\), the score of the corresponding triplet \((i_{k-1},r_{k-1},i_k) \in T_{k-1,k}\) is calculated via the KGE method to measure the quality, where \(T_{k-1,k}\) is the triple set between the \(k-1\)th and the \(k\)th hops.
In knowledge-aware recommendation, the recommendation data includes user-item interactions and KG, and there are associations and constraints within the triple in the KG. In this article, the scores for all triplets are calculated via the KGE method TransH [
48]. The KGE methods, such as TransE and TransH, can be used to measure the quality of the paths by calculating the scores of the triplets within the paths, so as to prune irrelevant paths from noisy graphs [
56]. TransH is a transition-based KGE method, which associates each relation with a relation-specific hyperplane and projects entity vectors on that hyperplane. The main reasons we choose TransH include: (i) Many transition-based KGE methods, such as TransH, aim to model associations and constraints within triples, which are simple but efficient; (ii). TransH can deal with one-to-many, many-to-one, and many-to-many relation patterns in the graph. Considering that these relation patterns widely exist in user-item interactions and KG, for example, a user clicks on multiple movies, or a singer sings multiple songs, TransH is a proper choice to calculate the scores of triples in the path extraction and selection module. In the experiment, TransH is trained in advance. For simplicity, we use
\((h, r, t)\) to denote the representations of target triplet, and the score calculation of the triplet is defined as
where
\(f(h,r,t)\) is the score function, and
\(d(h,r,t)\) is the distance function. The vectors
\(h_{\perp }\) and
\(t_{\perp }\) are obtained by projecting
\(h\) and
\(t\) on the hyperplane of the relation
\(r\):
where
\(w_r^T\) and
\(w_r\) are normal vector of
\(r\).
For the purpose of comparison, we also choose TransE [
3] and DistMult [
53], two typical KGE methods, as triplet scoring functions. Different from TransH, TransE assumes that the relation
\(r\) is a translation vector connecting the embedded head entity
\(h\) and tail entity
\(t\). The distance function of
\((h,r,t)\) in TransE is defined as
DistMult represents each relation as a diagonal matrix and projects the head entity vector to the tail entity vector via this relation matrix. The distance function of
\((h,r,t)\) in DistMult is defined as
where
\(M_r\) denotes the projection matrix associated with the relation
\(r\), and
\(h^T\) is the transposed vector of the vector
\(h\). In DistMult, the larger the
\(h^TM_rt\) is, the greater the plausibility of the triplet, which is different from TransE and TransH. Therefore, we add a negative sign before the distance function of DistMult to be consistent with the methods TransE and TransH.
After calculating the score of the triplet
\((i_{k-1},r_{k-1},i_k)\), we utilize nucleus sampling [
18] to adaptively select triples within the paths between every user-item pair. TransH and the nucleus sampling are used to perform path ranking and selection, and then solve the problem of error propagation, i.e., filtering low-quality paths. The KGE methods (such as TransE and TransH) can be used to measure the quality of the paths by calculating the scores of the triplets within the paths [
56]. Specifically, the triplets are ranked according to their scores calculated in formulas (5) and (6), so as to prune irrelevant triplets as well as paths between each user-item pair. Nucleus sampling aims to sample the top-
\(p\) portion of the candidate probability distribution adaptively. Our goal is to make low-quality paths score lower through formulas (5) (6) and filter them. For example, the path
Oliver Twist\(\xrightarrow {Language}\)English\(\xrightarrow {Language^{-1}}\)Barnaby Rudge in Figure
1 is a low-quality path. The formulas (5) and (6) are mainly used to model associations and constraints within triples. The higher the semantic association within the triple (i.e., confidence), the higher the score of the triple. Then, the probability of the path being selected is greater; that is, the triples with higher scores contribute more to the path selection. Therefore, TransH and the nuclear sampling strategy mainly filter out the paths with low confidence.
Instead of setting a fixed number of samples and sampling from the triple sets, such as top-
\(k\) sampling [
33], the number of samples for nucleus sampling is determined by the sum of probability values. At each hop, the triples are selected from the smallest possible triples set whose cumulative probability exceeds a threshold, where the cumulative probability is calculated via summing the triples’ probability scores. In this way, the number of sampling triples in the triple set can be dynamically increased or decreased based on the probability distribution. In order to perform nuclear sampling, the triple scores are normalized to compute the probabilities of the triples. The probability score of the triple
\((i_{k-1},r_{k-1},i_k)\) is obtained via softmax function:
where
\(e^{k-1}_i\),
\(e^{k-1}_r\),
\(e^{k}_i\) are the vectors of
\(i_{k-1}\),
\(r_k\), and
\(i_k\),
\(f(e^{k-1}_i,e^{k-1}_r,e^{k}_i)\) is the score of the triple
\((i_{k-1},r_{k-1},i_k)\) calculated by TransH,
\(({i^{\prime }}_{k-1},{r^{\prime }}_{k-1},{i^{\prime }}_k)\) is the triple in the set
\(T_{k-1,k}\). Given the probability distribution of all triples between the
\(k-1\)th hop and
\(k\)th hop, the selected triples
\(topp(T_{k-1,k}) \subset T_{k-1,k}\) are defined as the smallest set that satisfies the following conditions:
where \(p\) is the probability threshold. Then, the triples in \(topp(T_{k-1,k})\) are selected as the reasoning triples within the reasoning paths. At each hop, the triples are selected in the same way as described above. Finally, the reasoning path set \(S_{uv}\) can be formed to reflect the user \(u\)’s potential interests, so as to alleviate the effect of error propagation.
3.2.3 Path Encoding.
Inspired by [
65], each user’s historical interaction items are encoded and concatenated with corresponding embedded paths to enhance the representations of the multi-hop paths, so as to reflect the user’s interests. In KR-GCN, the historical interaction set
\(V_u\) serves as an additional input. Although
\(S_{uv}\) already contains the path information between
\(u\) and
\(v\), these paths are mainly for the item
\(v\) and cannot reflect other interests of the user
\(u\). For example, if
\(u\) clicks on a comedy and a documentary,
\(u\)’ interest in the documentary is not considered when exploring the correlation between
\(u\) and comedy. To explore more interests of users, the mutual effects between selected paths and the user’s historical interactions are captured by incorporating the user’s historical interactions into selected paths. Moreover, the mutual effects will be weakened if their embeddings are combined after encoding by path encoder separately, because the later the combination is, the weaker the semantic interaction. Considering that there is no timestamp that corresponds to user-item interactions in the datasets, we combine the historical interaction items with selected paths without exploiting the chronological order of historical interaction items. To capture the mutual effects between selected paths and the user’s historical interactions, we design a method of sequence combination, namely sequence concatenation. The combination of the path sequence
\(S_{uv}[j]\in S_{uv}\) and the historical interaction set
\(V_u\) can be described as
where
\(\oplus\) is the sequence concatenation operation, and
\(T_{uv}[j]\) is the concatenated sequence.
KR-GCN utilizes the stack of the
long short-term memory (
LSTM) and the attention network to encode the selected reasoning paths based on the embeddings learned via graph convolutional network. The LSTM path encoder is utilized for encoding the heterogeneous graph with respect to the multi-hop reasoning paths between the user-item pairs. This module takes the outputs of the graph encoding module and the path extraction and selection module as input. The graph encoding module provides node representations, and the path extraction and selection module provides path information for the path encoding module. Because there are multi-hop relational information and sequential dependencies between different nodes in the paths, this module aims at capturing the multi-hop relational information and encode the sequential dependencies within every single path. The path sequence
\(S_{uv}[j]\) is initially embedded as
\([e_1,e_2,\ldots ,e_n]\), which is computed via convolution and aggregation operations in graph convolutional networks. For the path sequence
\(S_{uv}[j]\), the hidden state at the time slice
\(t\) in LSTMs can be described as follows:
where
\(h_t\) is the representation of the node
\(i_t\). Then attention mechanism is performed to choose important features within the path
\(S_{uv}[j]\), which returns a vector to represent the single path sequence
\(S_{uv}[j]\).
where
\(p_{uv}[j]\) is the learned representation of the selected path
\(S_{uv}[j]\) between the user
\(u\) and the item
\(v\), and
\(\alpha _{h_t}\) denotes the importance of the node
\(i_t\) to the path
\(S_{uv}[j]\). Then the multi-hop reasoning paths (or latent relationships)
\(S_{uv}\) between the user
\(u\) and the item
\(v\) are represented by a set of vectors
\(p_{uv}\). These path representations can reflect the propagation of
\(u\)’s potential interests.
3.2.4 Preference Prediction.
In recommendation, different paths usually make varying contributions to predict user preferences. To discriminate the different contributions of different paths between each user-item pair on reasoning, a path-level self-attention mechanism is applied in KR-GCN. With path-level self-attention, the path-specific weight over each path can be learned. Then, the different importance of different paths can be obtained. And after that, all selected multi-hop paths with different weights are aggregated to represent the user’s preferences. The path-level self-attention mechanism is implemented as follows:
where
\(p_{uv}\) is the representation of the path set
\(S_{uv}\) through the self-attention mechanism and max-pool operation.
\(softmax(\frac{QK^T}{\sqrt {d}})\) aims to normalize representations into the probability distribution, which returns the importance of each path on reasoning.
\(maxpool(P_{uv})\) is used to learn vector representation for the path set
\(S_{uv}\) between the user
\(u\) and the candidate item
\(v\).
\(Q\),
\(K\), and
\(V\) denote query, key, and value representations, and are calculated as follows:
where
\(W_Q\),
\(W_K\), and
\(W_V\) denote the weight matrices, E is the input matrix of the path set
\(S_{uv}\). Finally, we conduct the
Multi-Layer Perceptron (
MLP) layers and an activation function on the output
\(P_{uv}\) to compute the preference prediction score between the user
\(u\) and the item
\(v\), i.e., the probability of the user
\(u\) interacting with the candidate item
\(v\).
where
\(\sigma (x)=\frac{1}{1+{\rm exp}(-x)}\) is the sigmoid function, and
\(MLP(\cdot)\) is the MLP layers with one output. The final prediction score
\(\hat{y}_{uv}\) is the interaction probability between the user
\(u\) and the item
\(v\), i.e., the score of user preference prediction.
After computing the interaction probability of the given user-item pair, the reasoning path with the highest weight is outputted as the explanation for the recommendation.