research-article

Open access

A Text-based Deep Reinforcement Learning Framework Using Self-supervised Graph Representation for Interactive Recommendation

Authors:

Peng PanAuthors Info & Claims

ACM/IMS Transactions on Data Science (TDS), Volume 2, Issue 4

Article No.: 44, Pages 1 - 25

https://doi.org/10.1145/3522596

Published: 17 May 2022 Publication History

All formats PDF

Abstract

Due to its nature of learning from dynamic interactions and planning for long-run performance, Reinforcement Learning (RL) has attracted much attention in Interactive Recommender Systems (IRSs). However, most of the existing RL-based IRSs usually face large discrete action space problem, which severely limits their efficiency. Moreover, data sparsity is another problem that most IRSs are confronted with. The utilization of recommendation-related textual knowledge can tackle this problem to some extent, but existing RL-based recommendation methods either neglect to combine textual information or are not suitable for incorporating it. To address these two problems, in this article, we propose a Text-based deep Reinforcement learning framework using self-supervised Graph representation for Interactive Recommendation (TRGIR). Specifically, we leverage textual information to map items and users into a same feature space by a self-supervised embedding method based on the graph convolutional network, which greatly alleviates data sparsity problem. Moreover, we design an effective method to construct an action candidate set, which reduces the scale of the action space directly. Two types of representative reinforcement learning algorithms have been applied to implement TRGIR. Since the action space of IRS is discrete, it is natural to implement TRGIR with Deep Q-learning Network (DQN). In the TRGIR implementation with Deep Deterministic Policy Gradient (DDPG), denoted as TRGIR-DDPG, we design a policy vector, which can represent user’s preferences, to generate discrete actions from the candidate set. Through extensive experiments on three public datasets, we demonstrate that TRGIR-DDPG achieves state-of-the-art performance over several baselines in a time-efficient manner.

1 Introduction

In the era of information explosion, recommender systems play a critical role in resisting the information overload. Recently, Interactive Recommender System (IRS) [49], which continuously recommends items to individual users and receives their feedbacks to refine its recommendation policy, has attracted much attention and plays an important role in personalized services, such as Amazon, Pandora, and YouTube.

In the past few years, there have been some attempts to address the interactive recommendation problem by modeling the recommendation process as a Multi-Armed Bandit (MAB) problem [22, 35, 49], but these methods are not designed for long-term planning explicitly, which makes their performance unsatisfactory [5]. It is well recognized that Reinforcement Learning (RL) performs excellently in finding policies on interactive long-running tasks, such as playing computer games [25] and solving simulated physics problems [23]. Therefore, it is natural to introduce RL to model the interactive recommendation process. In fact, recently there have been some works on applying RL to address the interactive recommendation problem [5, 6, 11, 16, 32, 44, 45, 46, 47, 50]. However, most of the existing RL-based methods [6, 16, 32, 44, 45, 46, 47, 50] suffer from the problem of making a decision in linear time complexity with respect to the size of the action space, i.e., the number of available items, which makes them inefficient (or unscalable) when the IRS action space size is large.

To improve efficiency, based on Deep Deterministic Policy Gradient (DDPG), Dulac-Arnold et al. [11] proposed DDPG-kNN, which first learns an action representation (vector) in a continuous hidden space, and then finds the valid item by using k nearest neighbor search. However, because DDPG is not designed for discrete IRS action space, and DDPG-kNN ignores the importance of each dimension in the action vector, the effectiveness of such a method is limited. Moreover, this method still needs to find the k nearest-neighbors from the whole action space, which is still time-consuming. Recently, Chen et al. [5] proposed a tree-structured policy gradient recommendation framework, within which a balanced hierarchical clustering tree is built over the items. Then, picking an item is formulated as seeking a path from the root to a certain leaf in the tree, which dramatically reduces the time complexity. But this method introduces the burden of building a clustering tree; especially when new items appear frequently, the tree needs to be reconstructed and this may cost a lot.

However, most of the existing RL-based recommendation methods use the past interaction data, such as ratings, purchase logs, or viewing history, to model user preferences and item features [5, 11, 48]. A major limitation of such kind of methods is that they may suffer serious performance degradation when facing the data sparsity problem, which is very common in real-world recommendation systems. As well known, textual information such as comments by users and item descriptions provided by suppliers contains more knowledge than interaction data. Nowadays, textual information has been readily available in many e-commerce and review websites, such as Amazon and Yelp. Thanks to the invention of word embedding, applying textual information for recommendation is possible, and there have been some successful attempts in conventional recommender systems [3, 8, 51]. But for IRS, existing RL-based methods either neglect to leverage textual information or are not suitable for incorporating textual information due to their unique structures for processing rating sequence.

To address the aforementioned problems, in this article, we propose a Text-based deep Reinforcement learning framework using self-supervised Graph representation for Interactive Recommendation (TRGIR). Specifically, to leverage textual information, we first embed descriptions and comments with pre-trained word vectors [27]. Then, we build a relation graph that consists of four types of nodes (user, item, description, and comment) and use description and comment vectors to initialize the embeddings of the corresponding nodes. By a self-supervised embedding method that is based on the graph convolutional network (GCN) [19], we can learn the embedding vectors of users and items with semantics, which alleviates the data sparsity problem to a great extent. Next, based on the user vectors, we classify users into several clusters by the K-means algorithm [1]. Inspired by the thought of collaborative filtering, we construct an action candidate set, which consists of positive, negative, and ordinary items that are selected based on the user’s historical logs and classification results, to reduce the scale of the action space directly. Considering that the action in IRS is discrete and the widely used Deep Q-learning Network (DQN) [25] is designed for dealing with discrete action space problems, we first present an implementation of TRGIR based on DQN, denoted as TRGIR-DQN. However, when facing large action space, DQN’s efficiency and exploration ability to decide the proper action will degrade significantly. To address this problem, we further propose to utilize DDPG [23] as the RL model and denote this implementation as TRGIR-DDPG. Specifically, we design a policy vector to generate discrete actions from the candidate set. The policy vector, which can represent the user’s preference in a feature space, is dynamically learned from the actor network of TRGIR-DDPG. By combining the candidate set with the policy vector, we can enhance the exploration ability and further improve the efficiency.

Finally, considering that it is too expensive to train and test our model in an online manner, we build an environment simulator to mimic online environments with principles derived from real-world data. Through extensive experiments on several real-world datasets with different settings, we demonstrate that TRGIR-DDPG achieves high efficiency and remarkable performance improvement over several state-of-the-art baselines, especially for large-scale high-sparsity datasets. To sum up, our main contributions of this work are as follows:

•

To reduce the negative influence of rating sparsity in IRSs, we build a relation graph, which is initialized with the description and comment embeddings calculated by textual information and pre-trained word vectors. Through learning the embeddings of users and items by a GCN-based self-supervised embedding method on this graph, we can derive user and item vectors with semantics efficiently.

•

Based on the thought of collaborative filtering, we classify users into several clusters and build the candidate set, which reduces the scale of the action space directly. Further, in TRGIR-DDPG, we represent the preferences of users by implicit policy vectors and propose a method based on DDPG to learn the policy vectors dynamically. The policy vector, combining with the candidate set, is used to generate discrete actions, which can enhance the exploration ability and improve the efficiency simultaneously.

•

Extensive experiments are conducted on three benchmark datasets; the results verify the high efficiency and superior performance of TRGIR-DDPG over state-of-the-art methods for IRSs.

The remainder of this article is organized as follows: Section 2 discusses related work; Section 3 formally defines the research problem and details the proposed TRGIR framework, as well as the corresponding learning algorithms; Section 4 presents and analyzes the experimental results; Section 5 concludes the article with some remarks.

2 Related Work

2.1 RL-based Recommendation Methods

RL-based recommendation methods usually formulate the recommendation procedure as a Markov Decision Process (MDP). They explicitly model the dynamic user’s status and plan for long-run performance [5, 6, 11, 16, 32, 36, 44, 45, 46, 47, 50]. As mentioned earlier, most existing RL-based methods [6, 16, 32, 36, 44, 45, 46, 47, 50] suffer from the large-scale discrete action space problem.

To address such a problem in IRS, there are some impressive attempts. Dulac-Arnold et al. [11] proposed DDPG-kNN, which first leverages prior information about the actions to embed them in a continuous space to generate a proto-action. Then, via a k nearest neighbor search, this method finds a set of discrete actions closest to the proto-action as the candidate in logarithmic time, which can improve the efficiency dramatically. However, there are two flaws: (1) the DDPG is not designed for discrete IRS action space; (2) this method ignores the negative influences of the dimensions that users do not care about, which affects the performance of DDPG-kNN. Moreover, the k nearest-neighbor search needs to be conducted on the whole action space, which still surfers a high runtime overhead. Later, Zhao et al. [48] used the actor network of the actor-critic network to gain k weight vectors at once, each of which can pick up a maximum-score item from the remaining items. But the relation of these vectors is blurry, which causes the order of the k items cannot be explained. Most recently, based on Deterministic Policy Gradient (DPG), Chen et al. [5] proposed a Tree-structured Policy Gradient Recommendation (TPGR) framework. In TPGR, a balanced hierarchical clustering tree is built over all the items. Then, making a decision can be formulated as seeking a path from the root to a certain leaf in the clustering tree, which also reduces the time complexity significantly. But limited by the search method that can only search one leaf node at a time, this method only supports Top-1 recommendation. Moreover, when new items appear frequently, the clustering tree needs to be reconstructed, which incurs extra costs.

2.2 Text-related Recommendation Methods

Most of the recommendation models (including RL-based ones) that merely exploit interaction matrix usually face the data sparsity problem, which can potentially be alleviated by exploiting the large amount of knowledge in the textual information [51]. The development of deep learning in natural language processing (NLP) makes it possible for using textual information to enhance the recommendation performance [3, 7, 8, 51]. In fact, there are already some works that incorporate their proposed models with the vectors gained by sentiment analysis [3], convolutional neural networks [9, 51], or pre-trained word vectors on large corpora [8] from textual information (such as descriptions and comments) for better performance. Recently, some researchers try to combine Knowledge Graph (KG, a kind of relation graph) embedding models with the textual information of entities. Richard et al. [31] introduced an expressive neural tensor network suitable for reasoning over relations between two entities and found that the performance was improved when entities are represented as an average of their constituting word vectors. Xie et al. [41] proposed an RL method for KGs by taking advantage of entity descriptions, which are learned by a continuous bag-of-words model and a deep convolutional neural model. Xiao et al. [40] proposed the semantic space projection model, which jointly learns from the symbolic triples and textual descriptions and showed the effectiveness of it.

IRS also suffers from the rating sparsity problem, but most of the existing RL-based methods for IRS either neglect to incorporate with textual information [11, 16, 46, 47] or have difficulty in utilizing textual information, since they adopt time-related structures to input rating sequence [5, 50]. Recently, through combining images with textual information, Zhang et al. [44] proposed a novel constraint-augmented RL framework to efficiently incorporate user preferences over time. Specifically, they leveraged a discriminator to detect recommendations violating user historical preference, which is incorporated into the standard RL objective of maximizing expected cumulative future rewards. Different from our method that introduces the textual information to alleviate the rating sparsity problem, Reference [44] mainly focuses on utilizing constraint-augmented RL to address the problem that recommendations can easily violate preferences of users from their past natural-language feedback. Moreover, like most existing RL-based methods [5, 6, 11, 16, 32, 45, 46, 47, 50], Reference [44] also suffers from the large-scale discrete action space problem, which is another limitation of existing methods that we tend to address in this work. It is also noteworthy that in the domain of conversational recommender system (CRS), Basile et al. [2] proposed a framework that combines deep learning and reinforcement learning and uses text-based features to provide relevant recommendations and produce meaningful dialogues. But different from CRS, in our RL-based method for IRS, the textual information is utilized to learn the implicit long-term preferences, not the proactive immediate needs of users.

2.3 Other Relevant Recommendation Methods

With the development of deep learning, there are some works that apply deep learning models on recommendation. Sedhain et al. [29] proposed AutoRec to learn embeddings that can reconstruct the ratings of a user from his records. Yao et al. [43] proposed Collaborative Denoising AutoEncoders (CDAE), which utilizes the idea of Denoising Auto-Encoders and contains more flexible components with implicit feedback. By using a two-pathway neural network representation learning architecture, Xue et al. proposed deep matrix factorization (DMF) [42] to map the users and items into a common low-dimensional latent space with non-linear projections, and then utilize cosine similarity as the matching function to calculate predictive scores. To learn the complex structure of user interaction data, He et al. [15] replaced the inner product matching function with a non-linear MLP architecture. Moreover, by fusing the neural matching function learning structure MLP with a representation learning structure generalized matrix factorization, NeuMF was proposed to obtain better performance. Further, considering that the DNN-based representation learning and matching function learning suffered from two fundamental flaws, i.e., the limited expressiveness of inner product and the weakness in capturing low-rank relations, respectively, Deng et al. [10] proposed DeepCF, which combines the strengths of neural representation learning and neural matching function learning, to overcome these flaws.

To gain more effective representations, some recent works try to exploit the structure of interaction graph by propagating user and item embeddings on it. GC-MC [4] was proposed to apply a GCN-based auto-encoder framework on the bipartite user-item graph, but it only employs GCN for link prediction between users and items. Inspired by GCN [19], NGCF [37] exploits the collaborative signal in the embedding function and explicitly encodes the signal in the form of high-order connectivity by performing embedding propagation. The embedding propagation rule of NGCF is the same as that of standard GCN, which is originally proposed for node classification on the attributed graph, where each node has rich attributes. By removing feature transformation and non-linear activation that will negatively increase the difficulty for training in NGCF, LightGCN [14] achieves significant accuracy improvements. Recently, Wu et al. [39] applied self-supervised learning on the user-item graph to improve the accuracy and robustness of GCNs for recommendation. Different from the bipartite graphs that only contain user-item interactions, the heterograph we considered in this work contains other node types. Compared with GCN, relational graph convolutional network (RGCN) [28] has been shown to be capable of dealing with the highly multi-relational data characteristic on heterograph. In view of this, we apply it to propagate information on our heterograph.

3 Proposed Method

3.1 Problem Formulation

We consider a recommender system with M users \( \mathrm{U}=\lbrace u_1, \ldots , u_M\rbrace \) , N items \( \mathrm{V}=\lbrace v_1, \ldots , v_N\rbrace \) , and use \( Y \in \mathbb {R}^{N \times M} \) to denote the rating matrix, where \( y_{i,j} \) is the rating of user \( u_i \) on item \( v_j \) . For the textual information, we denote \( \mathrm{D}=\lbrace d_1, \ldots , d_P\rbrace \) and \( \mathrm{C}=\lbrace c_1, \ldots , c_Q\rbrace \) as the set of descriptions and comments, respectively. This kind of interactive Top-k recommendation process can be modeled as a special Markov Decision Process (MDP), where the key components are defined as follows:

•

State. Use \( \mathrm{S} \) to denote the state space. A state \( s \in \mathrm{S} \) is defined as the possible interaction between a user and the recommender system, which can be represented by \( n_s \) item vectors in a certain order.

•

Action. Use \( \mathrm{A} \) to denote the action space. An action \( a \in \mathrm{A} \) contains \( n_a \) ordered items, each of which is represented by a vector. For the interactive Top-k recommendation, the scale of \( \mathrm{A} \) is large.

•

Reward function. After receiving an action a at state s, our environment simulator returns a reward r, which reflects the user’s feedback to the recommended items. We use \( \mathcal {R}(s, a) \) to denote the reward function.

•

Transition. In our model, since the state is a set of item vectors, once the action is determined and the user’s feedback is given, the state transition is also determined.

Consider an agent that interacts with the environment \( \mathcal {E} \) in discrete timesteps. At each timestep t, the agent can receive a state \( s_{t} \) by observing the current environment, then it takes an action \( a_{t} \) and gets a reward \( r_{t} \) . An agent’s behavior is defined by a policy \( \pi \) , which maps states to a probability distribution over the action, i.e., \( \pi : \mathrm{S} \rightarrow \mathcal {P}(\mathrm{A}) \) . Based on the above notations, we can define the instantiated MDP for our recommendation problem, \( \mathcal {M}=\langle \mathrm{S}, \mathrm{A}, \mathcal {R}, \mathcal {P}, T, \gamma \rangle \) , where T is the maximal decision step, and \( \gamma \) is the discount factor. The objective of this work is to learn a policy \( \pi ^* \) that maximizes the expected discounted cumulative reward.

3.2 Framework Overview

Figure 1 gives an overview of our framework, which contains two major steps: data preparation and training. In data preparation, we first build a relation graph that contains four types of nodes (user, item, description, and comment) and use vectors of descriptions and comments obtained from pre-trained word vectors [27] to initialize the embeddings of description and comment nodes. Note that the embeddings of user and item nodes are initialized randomly. On the relation graph, the user, item, description, and comment embeddings in the lth propagation layer are denoted as \( {\bf \textrm {U}}^{(l)}=\lbrace {\bf \textrm {u}}^{(l)}_1, \ldots , {\bf \textrm {u}}^{(l)}_M\rbrace \) , \( {\bf \textrm {V}}^{(l)}=\lbrace {\bf \textrm {v}}^{(l)}_1, \ldots , {\bf \textrm {v}}^{(l)}_N\rbrace \) , \( {\bf \textrm {D}}^{(l)}=\lbrace {\bf \textrm {d}}^{(l)}_1, \ldots , {\bf \textrm {d}}^{(l)}_P\rbrace \) , and \( {\bf \textrm {C}}^{(l)}=\lbrace {\bf \textrm {c}}^{(l)}_1, \ldots , {\bf \textrm {c}}^{(l)}_Q\rbrace \) , respectively. After propagating with L layers by utilizing a GCN-based self-supervised embedding method, we can learn the embeddings of user and item \( {\bf \textrm {U}}^{(L)}=\lbrace {\bf \textrm {u}}^{(L)}_1, \ldots , {\bf \textrm {u}}^{(L)}_M\!\rbrace \) and \( {\bf \textrm {V}}^{(L)}=\lbrace {\bf \textrm {v}}^{(L)}_1, \ldots , {\bf \textrm {v}}^{(L)}_N\rbrace \) . Through the learning process on the relation graph, \( {\bf \textrm {U}} \) and \( {\bf \textrm {V}} \) can gain the semantics contained in \( {\bf \textrm {D}} \) and \( {\bf \textrm {C}} \) . Then, based on the user embeddings, we utilize the unsupervised K-means [1] algorithm to classify the users into several clusters, which will later be used for helping construct the action candidate set.

Fig. 1.

In the training phase, with the objective of implementing a more personalized recommendation, we train a unique model for each cluster. Take cluster 2 for an example; we randomly select a user \( u_i \) from it. Based on the historical logs of \( u_i \) and the user classification results, we sample positive, negative, and ordinary items for \( u_i \) to construct a candidate set, which will later be used in the reinforcement model for action selection. The reinforcement model interacts with the simulator, which is based on historical logs, to learn the inner relations among all possible states and actions. For the specific implementation, we employ DQN [13, 25] (TRGIR-DQN) and DDPG [23] (TRGIR-DDPG) as our reinforcement model, respectively. In particular, by utilizing policy vector in the DDPG implementation, we can improve the efficiency dramatically. The training phase will stop when the model loss reaches stable.

3.3 GCN-based Self-supervised Embedding

Descriptions and comments are the most important textual information in recommender systems. The descriptions, which contain items’ advantages, and the comments, which contain users’ attitudes, along with the ratings, can express the preferences of users well. To obtain well expressive embeddings of users and items, we build a relation graph (as shown in the left part of Figure 1) including user nodes, item nodes, description nodes, and comment nodes. By initializing the node embeddings of descriptions and comments with pre-trained word vectors GloVe [27], we can get semantics from the textual information. Note these original descriptions and comments contain many meaningless words that can affect the quality of the constructed vector. We remove them in advance according to the Long Stopword List.¹ Then, we pick up pre-trained word vectors GloVe.6B,² which has been trained on large corpora (Wikipedia 2014 and Gigaword 5), to calculate \( {\bf \textrm {d}}^{(0)}_p \) and \( {\bf \textrm {c}}^{(0)}_q \) . Specifically,

\( \begin{equation} {\bf \textrm {d}}^{(0)}_p = \frac{1}{n_d}\sum _{i=1}^{n_d}{\bf \textrm {w}}_i, \quad {\bf \textrm {c}}^{(0)}_q = \frac{1}{n_c}\sum _{i=1}^{n_c}{\bf \textrm {w}}_i , \end{equation} \)

(1)

where \( {\bf \textrm {w}}_i \) denotes the vector of word \( {w}_{i} \) , \( n_d \) ( \( n_c \) ) denotes the number of words that \( d_p \) ( \( c_q \) ) contains after removing the stop words. Note that the word vectors with similar semantics have closer Euclidean distance than the word vectors with large semantics differences [27], which ensures that comments (or descriptions) with similar semantics are closer to each other. Different from \( {\bf \textrm {d}}^{(0)}_p \) and \( {\bf \textrm {c}}^{(0)}_q \) , the initial embeddings of user and item nodes, \( {\bf \textrm {u}}^{(0)}_m \) and \( {\bf \textrm {v}}^{(0)}_n \) , are constructed randomly.

Then, we introduce a GCN-based self-supervised embedding method to learn the representations of users and items. Let \( e \in \mathrm{E} \) denote the entities in the relation graph, where \( \mathrm{E}=\mathrm{U}\cup \mathrm{V}\cup \mathrm{D}\cup \mathrm{C} \) , and let e denote the representation of e. According to the feature propagation model of RGCN [28], an entity \( e_i \) is capable to receive the messages propagated from its l-hop neighbors by stacking l embedding propagation layers. In the lth step, the embedding of \( e_i \) is recursively formulated as,

\( \begin{equation} {\bf \textrm {e}}^{(l)}_i= \sum _{r^e \in \mathrm{R}} \sum _{e_j \in \mathrm{N}^{r^e}_{e_i}} \mathcal {L}_{e_i, e_j} W^{(l-1)}_j {\bf \textrm {e}}^{(l-1)}_j + W^{(l-1)}_{\text{self}} {\bf \textrm {e}}^{(l-1)}_i , \end{equation} \)

(2)

where \( r^e \in \mathrm{R} \) denotes one of the relations between different entities, \( {\bf \textrm {e}}^{(l-1)}_i \) and \( {\bf \textrm {e}}^{(l-1)}_j \) denote the embeddings of \( e_i \) and \( e_j \) generated from the previous message propagation steps, \( \mathrm{N}^{r^e}_{e_i} \) denotes the set of neighbor entities that directly connected to \( e_i \) under relation \( r^e \) , \( W^{(l-1)}_j \) and \( W^{(l-1)}_{\text{self}} \) are trainable weights under relation \( r^e \) , and \( \mathcal {L}_{e_i, e_j} = 1 / {\scriptstyle \sqrt { |\mathrm{N}^{r^e}_{e_i}||\mathrm{N}^{r^e}_{e_j}|}} \) is the symmetric normalization. As shown in Figure 2(a), we take the message propagation of user \( u_1 \) as an example. It is clear to see that there are totally four types of relations (i.e., \( |\mathrm{R}| = 4 \) ): user-item relation, user-comment relation, item-comment relation, and item-description relation. After a two depth propagation, the message from the related nodes on different relations can be aggregated to the target node \( u_1 \) .

Fig. 2.

Note the word vectors in GloVe have the linear substructures [27]. By the sum average operation, the linear substructure feature is kept in the initial embeddings \( {\bf \textrm {d}}^{(0)}_p \) and \( {\bf \textrm {c}}^{(0)}_q \) . To avoid destroying linear substructures and accelerating the training phase, we remove the non-linear active functions in standard RGCN and find that this operation can improve the performance.

The self-supervised loss function of our embedding method encourages the nearby nodes to have similar representations, while enforcing that the representations of disparate nodes are highly distinct. Inspired by the contrastive self-supervised method [26] and considering the linear substructures of these representations in Euclidean space, we choose mean square error as the distance metric. Based on the randomly selected batch of users and items \( \mathrm{U}^b \) and \( \mathrm{V}^b \) , the loss function of graph network parameters \( \theta ^G \) is designed as,

\( \begin{equation} \begin{aligned}L(\theta ^G) =&\sum _{u_m \in \mathrm{U}^b}\left(\sum _{u_i \in \mathcal {N}_{u_m}} (\mathbf {u_m}- \mathbf {u}_{i})^{2} - \sum _{u_j \in \mathcal {F}_{u_m}} (\mathbf {u_m} - \mathbf {u}_{j})^{2}\right) \\ &\quad + \sum _{v_n \in \mathrm{V}^b}\left(\sum _{v_i \in \mathcal {N}_{v_n}} (\mathbf {v_n}- \mathbf {v}_{i})^{2} - \sum _{v_j \in \mathcal {F}_{v_n}} (\mathbf {v_n} - \mathbf {v}_{j})^{2}\right) + \lambda ||\theta ^G||_2 , \end{aligned} \end{equation} \)

(3)

where \( \mathcal {N}_{u_m} \) and \( \mathcal {N}_{v_n} \) are the nearby nodes set of user \( u_m \) and items \( i_n \) , \( \mathcal {F}_{u_m} \) and \( \mathcal {F}_{v_n} \) are the disparate nodes (nodes far from the current node) set of user \( u_m \) and items \( i_n \) , \( \lambda \) is a hyper-parameter to control the strength of the regularizer to avoid model overfitting. Note that we define \( U^2 = H \times H^T \) and \( V^2 = H^T \times H \) , where \( H \in \mathbb {R}^{N \times M} \) denotes the user-item adjacency matrix. For any \( h_{m, n} \in H \) , if \( u_m \) has interacted with \( v_n \) , then \( h_{m, n} = 1 \) ; otherwise, \( h_{m, n} = 0 \) . Moreover, to help understand the calculation of \( U^2 \) and \( V^2 \) , Figure 2(b) presents an example based on the relational graph depicted in Figure 1. Note that both \( U^2 \) and \( V^2 \) are symmetric matrices, i.e., \( U^2_{m,i} = U^2_{i,m} \) and \( V^2_{n,i} = V^2_{i,n} \) . The diagonal elements in \( U^2 \) (or \( V^2 \) ) represent the total interacted items (or users) of the corresponding user (or item). If there is at least one item that both users \( u_m \) and \( u_j \) have interacted with, then \( U^2_{m,j} \gt 0 \) (e.g., \( U^2_{1,2} = 1 \) ); otherwise, \( U^2_{m,j} = 0 \) (e.g., \( U^2_{1,3} = 0 \) ). The same rule goes for \( V^2 \) . Hence, \( U^2_{m,j} \gt 0 \) represents there exists at least one item that is preferred by both \( u_m \) and \( u_j \) , while \( V^2_{n,i} \gt 0 \) represents there exists at least one user that prefers both \( v_n \) and \( v_i \) . For any user \( u_x \) , if \( U^2_{m,x} \gt 0 \) , \( u_i \) belongs to \( \mathcal {N}_{u_m} \) ; otherwise, \( u_x \) belongs to \( \mathcal {F}_{u_m} \) . Similar definitions go for \( \mathcal {N}_{v_n} \) and \( \mathcal {F}_{v_n} \) .

The training of our GCN-based self-supervised method is independent with the RL model and will stop when the loss reaches stable. Moreover, for a specific batch of users and items, when the relations of them changes, we can utilize Equation (3) to update the neural network and their embeddings for this batch locally. After propagating with L layers, we can obtain \( {\bf \textrm {u}}^{(L)}_m \) and \( {\bf \textrm {v}}^{(L)}_n \) as the important foundation for clustering and the construction of state and action.

To better illustrate the superiority of integrating textual information, we pick a real user (with ID: A2P49WD75WHAG5) in the Amazon Digital Music dataset (one of the datasets we used in the experiments) to illustrate how the text could benefit our recommendation process. As shown in Figure 3, the left scatter graph shows the distribution of the interacted items’ embeddings obtained by Matrix Factorization (MF) without textual information, and the right scatter graph shows the same items’ embeddings obtained by Self-supervised Graph (SG) representation learning method with textual information (the category information). We utilize the widely used principal component analysis [12] to reduce the dimension of the above-mentioned embeddings to two dimensions. We can observe that the relative distance between 2,365 (“Gia”) and 2,607 (“Eighteen Visions”) is much shorter than that between 2,365 and 2,464 (“Fijacion Oral”) in the left scatter graph, while an opposite result can be observed in the right scatter graph. By carefully screening the category information of these three items, we can conclude that 2,365 is more similar to 2,464 rather than 2,607, and hence should be more closer to 2,464 in the visualized embedding graph. This demonstrates the effectiveness of integrating textual information for better embedding learning.

Fig. 3.

3.4 Construction of the Candidate Set

In our RL-based recommendation, the state is defined as a set of \( n_s \) items. For the model that recommends Top-k items at once, there are a total of \( A_{M-n_s}^k \) (note here \( A_{M-n_s}^k \) is a permutation) actions that can be chosen as an action. With the increase of the number of items (M), the scale of the action space will increase rapidly. Based on the assumption that the preferences can be obtained by a set of items that users like and dislike, we pick up the positive and negative items to build a candidate set c. Additionally, to maintain generalization, we randomly add some ordinary items into c.

Given a user \( u_i \) , according to the historical logs, if the corresponding rating is greater than a given bound \( y_b \) (e.g., \( y_b = 2 \) in a rating system with the highest rating 5), then the interacted record is regarded as positive; otherwise, it is negative. We use \( \mathcal {V}^p_{u_i} \) and \( \mathcal {V}^n_{u_i} \) to denote the set of items that are in \( u_i \) ’s positive and negative interacted records, respectively. For \( u_i \) , we sample positive items from \( \mathcal {V}^p_{u_i} \) , negative items from \( \mathcal {V}^n_{u_i} \) , and ordinary items by random. Since users usually skip the items they do not like, the negative items in \( \mathcal {V}^n_{u_i} \) are rare [24]. Based on the reverse thought of collaborative filtering, i.e., the more differences between two users, the more possible that the one’s likes are another’s dislikes, we classify users into several clusters by K-means [1] to supplement negative items. Specifically, as shown in Figure 4(a), we denote the set of items that appear in the positive interacted records of users in cluster l as \( \mathcal {V}^p_{cl_l} \) (user \( u_i \) belongs to cluster \( cl_l \) ) and use \( {cl_{l}^f} \) to denote the cluster that has the farthest distance from the current cluster \( cl_l \) . If the negative items in \( \mathcal {V}^n_{u_i} \) are not enough, then the rest negative items will be selected from \( \mathcal {V}_{neg} \leftarrow \mathcal {V}^p_{cl_{l}^f} - (\mathcal {V}^p_{cl_{l}^f} \cap (\mathcal {V}^p_{cl_{l}} \cup \mathcal {V}^n_{u_i})) \) . In this way, we can reduce the scale of the action space from \( M-n_s \) to \( n_c \) , where \( n_c \) is the number of items in the candidate set c.

Fig. 4.

Algorithm 1 shows the detail of the construction for the candidate set, in which the positive items account for no more than \( \alpha \) percent ( \( \alpha \) is a hyper-parameter), and the negative and ordinary items each share \( 50\% \) of the remaining part of \( n_c \) (line 7). In the training phase, since constructing a candidate set only contains some simple operations, such as randomly select and merge, and the size of candidate set is always fixed, it is not difficult to see the time complexity of Algorithm 1 is constant.

3.5 Specific Implementations of TRGIR

The goal of a typical reinforcement learning model is to learn a policy \( \pi \) that can maximize the discounted future reward, i.e., the Q-value, which is usually estimated by the Q-value function \( Q^{\pi }(\cdot) \) . Combined with deep neural networks, there are many algorithms that try to approximate the Q-value function, which can be roughly categorized into three types: value-based (e.g., DQN [25], Double DQN [13]), policy-based (e.g., DPG [30]), and hybrid algorithms (e.g., DDPG [23]). Considering value-based DQN [13, 25] is widely utilized for the scenario where the action space is discrete, we will first implement TRGIR with DQN (TRGIR-DQN) to show its effectiveness.

3.5.1 Implementation with DQN.

The structure of TRGIR-DQN is shown in Figure 4(b), and we utilize the improved DQN (Double DQN) [13] as the RL model. To avoid the overestimation and thus improve performance, Double DQN decouples the action selection with the target Q value calculation.

To make the action selection more reasonable, we introduce the features of items as input by concatenating item vector of \( c^t_k \) with \( s_t \) to derive \( \phi _t \) ,

\( \begin{equation} \phi _t = \phi \big (s_t, c^t_k\big) = {\bf \textrm {c}}^t_k \oplus s_t , \end{equation} \)

(4)

where \( \oplus \) denotes the vector concatenation operation, \( {\bf \textrm {c}}^t_k \) denotes the vector of tth item in the kth candidate set \( c_k \) . The concatenation of state \( s_t \) and \( {\bf \textrm {c}}^t_k \) decides the action selection.

In each timestep t, the Q-value network takes \( \phi _t \) as input. By a multiple-layer perceptron (MLP) network, we can learn two Q-values, which we term as recommendation action and skip action, respectively. If the Q-value of the recommendation action is greater than that of the skip action, then we will recommend item \( c^t_k \) . Otherwise, we just skip it. As shown in Figure 4(b), after receiving \( {a}_{t} \) , \( {s}_{t} \) , and \( {c}^k_{t} \) , the simulator return the next state \( {s}_{t+1} \) . Specifically, according to the Q-value, if we recommend item \( c^t_k \) , then we put \( {c}^k_{t} \) at the head of \( {s}_{t} \) and select the top \( n_s \) items as \( {s}_{t+1} \) . Otherwise, we let \( {s}_{t+1} = {s}_{t} \) .

Algorithm 2 shows the training phase of TRGIR-DQN. By maximizing the cumulated discounted rewards, the model parameters ( \( \theta \) and \( \theta ^{\prime } \) ) can be learned. Based on the assumption that similar users have similar preferences, our method classifies users and trains models for each cluster. In the beginning, we randomly initialize \( \theta \) and the replay buffer \( \mathrm{B} \) . After constructing a candidate set \( c_k \) for \( u_i \) in cluster \( cl_l \) , the agent selects and executes an action according to an \( \epsilon \) -greedy policy. The TRGIR-DQN algorithm focuses on minimizing the gap between the current Q-value \( Q\left(s, a | \theta _{l}\right) \) and the expected Q-value \( z_j \) , which is measured by the following loss:

\( \begin{equation} L_{j}\left(\theta _{l}\right)=\mathbb {E}_{s_j, a_j \sim \rho (\cdot)}\left[\left(z_{j}-Q\left(\phi _j, a_j | \theta _{l}\right)\right)^{2}\right] , \end{equation} \)

(5)

where \( \rho (\cdot) \) is the action distribution, and the expected Q-value \( z_{j} \) can be defined as,

\( \begin{equation} z_{j}=\left\lbrace \begin{array}{@{}lll} r_{j}+\gamma Q^{\prime } (\phi _j, \max _{a_{j+1}} Q(\phi _{j+1}, a_{j+1}, \theta _{l}), \theta _{l}^{\prime }), & \text{if non-terminal}; \\ r_{j}, & \text{if terminal}. \end{array} \right. \end{equation} \)

(6)

Differentiating the loss function with respect to the weights, we arrive at the gradient as,

\( \begin{equation} \begin{aligned}\nabla _{\theta _{l}} J = \frac{1}{N_b} \sum _{j} \mathbb {E}_{s_{j}, a_j \sim \rho (\cdot) ; s_{j+1} \sim \mathcal {E}} [ (r_{j}&+\gamma Q^{\prime }(\phi _j, \max _{a_{j+1}} Q(\phi _{j+1}, a_{j+1}, \theta _{l}), \theta _{l}^{\prime })\\ & - Q(\phi _{j}, a_j | \theta _{l})) \nabla _{\theta _{l}} Q(\phi _{j}, a_j ; \theta _{l}) ]. \end{aligned} \end{equation} \)

(7)

Then, we update the target network parameters by soft updates [23] with a rate \( \tau \) . Finally, when the loss reaches stable, the training will stop. To get k items at once, for the recommended items, we order these items by the Q-values of recommendation action.

3.5.2 Implementation with DDPG.

DDPG combines the advantages of DQN and DPG [30] and can concurrently learn policy and \( Q^{\pi }(s_{t}, a_{t}) \) in high-dimensional, continuous action spaces by using neural function approximation [23]. For more performance and efficiency improvement, we further implement TRGIR with DDPG (TRGIR-DDPG). But it is noteworthy that the action space of IRSs is discrete and is not suitable for DDPG. To address this problem, inspired by DDPG-kNN [11], we propose the policy vector \( {\bf \textrm {p}}_t \) that represents the user’s preference and can be dynamically learned by TRGIR-DDPG. By utilizing \( {\bf \textrm {p}}_t \) to generate discrete actions from the candidate set, we can overcome the gap between discrete actions in IRSs and continuous actions in DDPG.

For Top-k recommendation, TRGIR-DQN has to calculate Q-values for each item to decide whether to recommend or not, which is still inefficient when the scale of the candidate set is large. Moreover, large-scale action space also limits the exploration ability of TRGIR-DQN, which in turn will affect the performance. To solve these problems, we consider utilizing DDPG as the RL model, which can explore large continuous space efficiently, to enhance exploration ability and further improve the efficiency.

The structure of TRGIR-DDPG is shown in Figure 5(a). During the embedding process, we have mapped the discrete actions into the continuous feature space, where each item is represented by a feature vector. Then, by conducting the dot product between \( {\bf \textrm {p}}_t \) and the item vectors in \( c_t \) , we can select the actions from a discrete space. Figure 5(b) gives an example for helping understand the policy vector. Suppose a user selects a movie according to the preference that can be represented as explicit policies such as Prefer Detective Comics, Insensitive to genres, and Like Superman. By our method, a policy vector in the feature space, e.g., \( (0.7, 0,5, 0.1, 0.9) \) , can be learned, where the value of each dimension represents how much emphasis the dimension is for this user. By conducting a dot product between the policy vector and the item vectors, we finally can choose the movie Superman Returns with the highest score of 2.79 for recommendation (assume Top-1 recommendation here).

Fig. 5.

In each timestep t, the actor network takes a state \( s_t \) as input. By an MLP network, we can learn a continuous vector, which we term as the policy vector, denoted by \( {\bf \textrm {p}}_t \) . The critic network takes state \( s_t \) and policy vector \( {\bf \textrm {p}}_t \) as input. By another MLP network, it can learn the current Q-value to evaluate \( {\bf \textrm {p}}_t \) . As mentioned in Figure 5(b), \( {\bf \textrm {p}}_t \) represents a user’s preferences in the feature vector space, it is a continuous weight vector that can measure the importance of each dimension. Combining with the candidate set \( c_t \) , we can get \( n_a \) items with the highest score, each of which is denoted by \( Score(v_i) \) and,

\( \begin{equation} Score({{v}_{i}})= {\bf \textrm {p}}_t^T {\bf \textrm {v}}_i . \end{equation} \)

(8)

As shown in Figure 5(b), TRGIR-DDPG generates \( {s}_{t+1} \) by a sliding-window manner. Specifically, among ordered items in \( {a}_{t} \) , we keep the order and select the items that are not in \( {s}_{t} \) as \( {a}^{\prime }_{t} \) . Then, we put \( {a}^{\prime }_{t} \) at the head of \( {s}_{t} \) and select the top \( n_s \) items as \( {s}_{t+1} \) . Moreover, to cover the action space to a large extent, the candidate set is randomly generated at each timestep.

The training phase (as shown in Algorithm 3) learns model parameters \( \theta ^{Q} \) , \( \theta ^{\mu } \) , \( \theta ^{Q^{\prime }} \) , and \( \theta ^{\mu ^{\prime }} \) through maximizing the cumulated discounted rewards of all the decisions. As mentioned before, TRGIR-DDPG also trains models for each cluster. At the beginning of the training phase, we randomly initialize the network parameters and the replay buffer \( \mathrm{B} \) . For better action exploration, we initialize a random process \( \mathcal {N} \) , adding some uncertainty when generating \( {\bf \textrm {p}} \) . The critic network focuses on minimizing the gap between the current Q-value \( Q(s_{j}, {\bf \textrm {p}}_j | \theta ^{Q}) \) and the expected Q-value \( z_j \) , which is measured by the following loss:

\( \begin{equation} L(\theta ^Q_{l}) = \frac{1}{N_b} \sum _{j}\big (z_j - Q\big (s_{j}, {\bf \textrm {p}}_j | \theta ^{Q}_{l}\big)\big)^{2} , \end{equation} \)

(9)

where \( z_j \) can be expressed in a recursive manner by using the Bellman equation,

\( \begin{equation} z_j = r_{j}+\gamma Q^{\prime }\big (s_{j+1}, \mu ^{\prime }\big (s_{j+1} | \theta ^{\mu ^{\prime }}_{l}\big) | \theta ^{Q^{\prime }}_{l}\big). \end{equation} \)

(10)

The objective of the actor network is to optimize the policy vector \( {\bf \textrm {p}} \) through maximizing the Q-value. The actor network is trained by the sampled policy gradient:

\( \begin{equation} \nabla _{\theta ^{\mu }_{l}} J \approx \frac{1}{N_b} \sum _{j} \nabla _{{\bf \textrm {p}}} Q\big (s, {\bf \textrm {p}} | \theta ^{Q}_{l}\big)|_{s=s_{j}, {\bf \textrm {p}} =\mu (s_{j})} \nabla _{\theta ^{\mu }_{l}} \mu \big (s | \theta ^{\mu }_{l}\big)|_{s_{j}} . \end{equation} \)

(11)

For TRGIR-DDPG, we also update the parameters of the target actor and the target critic network by soft updates [23] with a rate \( \tau \) . Finally, when the loss reaches stable, the training phase will stop.

Note that in our implementations with DQN and DDPG, to avoid insufficient training, we set a minimum training step threshold based on the size of buffer \( \mathrm{B} \) . Only when the number of steps is greater than the minimum threshold and the loss reaches stable, the training will stop. Moreover, we also set a maximum training step threshold based on the size of \( \mathrm{B} \) , when the number of steps is larger the maximum threshold, the training will stop.

3.6 Environment Simulator

It is expensive and time-consuming to utilize a real interactive environment for the training of RL models. The same as several previous works [5, 38], we build the environment simulator based on historical interactions. For a user \( u_i \) in the timestep t, the simulator receives present state \( {s}_{t} \) and action \( {a}_{t} \) , then returns reward \( {r}_{t} \) and the next state \( {s}_{t+1} \) . In this way, the reward function can be written as \( \mathcal {R}\left(s_t, {a}_{t}\right) \) , and the definitions of \( \mathcal {R}\left(\cdot \right) \) are different for different RL models.

At each timestep t, TRGIR-DQN only recommends at most one item to user \( u_i \) . The reward \( r_t \) of TRGIR-DQN is defined as,

\( \begin{equation} r_t = \mathcal {R}\left(s_t, {a}_{t}\right)=\left\lbrace \begin{array}{@{}ll} y^*_{i,j}, & \text{if}~~ recommend~~ item~~ v_j; \\ 0, &\text{otherwise}, \end{array} \right. \end{equation} \)

(12)

where \( y^*_{i,j} \) is the adjusted rating of \( u_i \) on \( v_j \) . To give proper rewards for different types of items, \( y^*_{i,j} \) is designed as follows:

\( \begin{equation} y^*_{i,j}=\left\lbrace \begin{array}{@{}llll} y_{i,j} - y_b, & \text{if}~~ v_j \in \mathcal {V}^p_{u_i}; \\ y_{i,j} - y_b -1, &\text{if}~~ v_j \in \mathcal {V}^n_{u_i}; \\ -0.5, &\text{if}~~ v_j \in \mathcal {V}_{neg} ; \\ 0, & \text{otherwise}. \end{array} \right. \end{equation} \)

(13)

Recall here \( y_{i,j} \) is the initial rating of \( u_i \) on \( v_j \) , and \( y_b \) is the rating bound. By this formula, positive items will get positive feedback, and negative items will get negative feedback. Moreover, the supplemented negative items will get half of the minimum negative feedback, i.e., \( -0.5 \) , while the other items will get feedback of 0.

For TRGIR-DDPG, it recommends \( n_a \) items to user \( u_i \) at each timestep. The reward function of TRGIR-DDPG not only guides the model to capture users’ preferences, but also evaluates the rank quality of the recommended items. Specifically, the reward \( r_t \) is determined by two values, \( {w}_{k} \) and \( y^*_{i,j} \) ,

\( \begin{equation} r_t = \mathcal {R}\left(s_t, {a}_{t}\right)=\sum _{k=1}^{n_a} {w}_{k} \times y^*_{i,j} , \end{equation} \)

(14)

where \( w_{k} \) is the ranking weight of the items in \( a_t \) . Inspired by DCG [17, 38], the ranking weight is calculated by

\( \begin{equation} {w}_{k} = {1}/{\log _{2}(k+1).} \end{equation} \)

(15)

Note that the methods of generating \( {s}_{t+1} \) are different in TRGIR-DQN and TRGIR-DDPG, as detailed in the above specific implementations sections.

4 Experiments and Results

4.1 Experimental Settings

In this section, to demonstrate the effectiveness of the proposed method, we first introduce the experimental settings and then present and discuss the experimental results from the perspective of both performance and efficiency to answer the following research questions:

•

RQ1: How do the methods that implement our TRGIR framework with DQN and DDPG perform as compared with other state-of-the-art methods?

•

RQ2: How is the recommendation sparsity problem alleviated by utilizing the textual information in different ways?

•

RQ3: How does the efficiency benefit from the candidate action set and the policy vector?

•

RQ4: How do the key hyper-parameters (e.g., the dimension of feature space, the number of clusters, the size of candidate set) affect the performance?

Note that, when analyzing one factor, we keep the others fixed. The default settings are: the input dimension of GCN \( n_{in} \) is 100, the output dimension of GCN \( n_{out} \) is 64, the depth of GCN propagation layer \( n_{gcn} \) is 2, the number of clusters \( n_{cl} \) is 5, the size of the candidate set \( n_c \) is 50, the rate of positive items \( \alpha \) is 0.1, the size of state \( n_s \) is 20, the size of action \( n_a \) is 10 (but for TRGIR-DQN, \( n_a \) is fixed to be 2). We have implemented our framework with DQN and DDPG, which can be accessed in GitHub.³

4.1.1 Datasets.

Jure Leskovec et al. [21] collected and categorized a variety of Amazon products and built several datasets⁴ including ratings, descriptions, and comments. We evaluate our models on three publicly available Amazon datasets: Digital Music (Music for short), Beauty and Clothing Shoes and Jewelry (Clothing for short), which all have at least five comments for each product. Table 1 shows the statistical details of the datasets we used.

Table 1.

DataSet	#Users	#Items	#Ratings of Pos.	#Ratings of Neg.	Sparsity	Size of Des.	Size of Com.
Music	5,541	3,568	58,905	5,801	0.9967	2,338 KB	65,758 KB
Beauty	22,363	12,101	176,520	21,982	0.9993	5,735 KB	83,251 KB
Clothing	39,387	23,033	252,022	26,655	0.9997	3,960 KB	80,208 KB

Table 1. Statistics of Datasets

4.1.2 Baseline Methods.

We compare TRGIR-DQN and TRGIR-DDPG with eight methods, where ItemPop is a conventional recommendation method, DMF is an MF-based method with neural network, ANR is a neural recommendation method that leverages textual information, Caser and SASRec are time-related deep learning–based methods, LinearUCB is a MAB-based (MAB stands for Multi-Armed Bandit) method, D-kNN and TPGR are all RL-based methods.

•

ItemPop recommends the most popular items from currently available items to the user. This method is non-personalized and is often used as a benchmark for recommendations.

•

DMF [42] is a matrix factorization model using deep neural networks. Specifically, it utilizes two distinct MLPs to map the users and items into a common low-dimensional space.

•

ANR [8] uses an attention mechanism to focus on the relevant parts of comments and estimates aspect-level user and item importance in a joint manner.

•

Caser [33] embeds a sequence of recent items into an image and learns sequential patterns as local features of the image by using convolutional filters.

•

SASRec [18] is a self-attention-based sequential model for next item recommendation. It models the entire user sequence and adaptively considers consumed items for prediction.

•

LinearUCB [22] is a contextual-bandit recommendation approach that adopts a linear model to estimate the upper confidence bound for each arm.

•

D-kNN [11] addresses the large discrete action space problem by combining DDPG with an approximate kNN method.

•

TPGR [5] builds a balanced hierarchical clustering tree and formulates picking an item as seeking a path from the root to a certain leaf of the tree.

Note that for D-kNN, larger k (i.e., the number of nearest neighbors) will result in better performance but poor efficiency. For a fair comparison, we consider setting k as \( 0.1M \) and M (M is the number of items), respectively.

4.1.3 Evaluation Metrics and Methodology.

The methods that achieve their goals by Top-k recommendation take evaluation on the indexes such as Hit Ratio (HR) [42], Precision [33, 50], Recall [33], F1 [5], and normalized Discounted Cumulative Gain (nDCG) [18, 38, 48, 50]. To cover as many aspects of Top-k recommendation as possible, we chose HR@k, F1@k, and nDCG@k as the evaluation metrics.

The test data was constructed in data preparation, and all the evaluated methods were tested by using this data. We now describe the test method in detail: For each user, we first classify user’s history logs into positive and negative ones and sort the items in positive history logs by time-stamp. Then, we choose the last \( 10\% \) of the ordered items in the positive logs as positive items. Finally, the negative items are randomly selected from the cluster that is farthest from the one that the current user belongs to. Based on such a strategy, the recommendation methods (except TPGR, which only recommends one item in each episode) can generate a ranked Top-k list to evaluate the metrics mentioned above.

4.2 Comparison and Analysis (RQ1)

Table 2 shows the summarized results of our experiments on the three datasets in terms of six metrics, including HR@10, F1@10, nDCG@10, HR@20, F1@20, and nDCG@20. Note that, since TPGR is not suitable for Top-k recommendation, we did not include it as a competitor when evaluating the recommendation performance. From the results, we have the following key observations:

Table 2.

Dataset	Metric	Compared Methods								Ours
Dataset	Metric	ItemPop	DMF	ANR	Caser	SASRec	LinearUCB	D-kNN ( \( k=0.1M \) )	D-kNN ( \( k=M \) )	TRGIR-DQN	TRGIR-DDPG
Music	HR@10	0.2447	0.3318	0.4980	0.8097	0.8897	0.3201	0.3274	0.3436	0.8037	0.9886
	F1@10	0.0454	0.0621	0.1128	0.1676	0.1910	0.0631	0.0648	0.0692	0.1844	0.2304
	NDCG@10	0.1101	0.1569	0.2756	0.5351	0.6212	0.1462	0.1527	0.1617	0.7719	0.9436
	HR@20	0.4889	0.5885	0.7097	0.9090	0.9635	0.5747	0.5838	0.6001	0.8048	0.9935
	F1@20	0.0525	0.0626	0.1084	0.1048	0.1151	0.0638	0.0647	0.0676	0.1003	0.1251
	NDCG@20	0.1716	0.2210	0.3252	0.5542	0.6325	0.2095	0.2171	0.2258	0.7722	0.9446
Beauty	HR@10	0.2551	0.2734	0.4550	0.6125	0.6823	0.3219	0.2585	0.2772	0.7278	0.8845
	F1@10	0.0482	0.0502	0.0990	0.1218	0.1386	0.0614	0.0489	0.0519	0.1463	0.1798
	NDCG@10	0.1134	0.1249	0.2252	0.3939	0.4569	0.1447	0.1170	0.1258	0.6213	0.6949
	HR@20	0.5278	0.5273	0.6993	0.7817	0.8330	0.5911	0.5142	0.5377	0.7349	0.9501
	F1@20	0.0543	0.0529	0.1006	0.0826	0.0907	0.0613	0.0529	0.0547	0.0782	0.1024
	NDCG@20	0.1817	0.1885	0.2850	0.4344	0.4942	0.2122	0.1809	0.1910	0.6230	0.7104
Clothing	HR@10	0.2265	0.2393	0.3421	0.5060	0.5817	0.2500	0.2541	0.2768	0.6593	0.7544
	F1@10	0.0413	0.0437	0.0663	0.0934	0.1084	0.0458	0.0467	0.0510	0.1222	0.1405
	NDCG@10	0.1033	0.1041	0.1622	0.2900	0.3525	0.1130	0.1131	0.1242	0.4577	0.4865
	HR@20	0.4964	0.5044	0.6008	0.7196	0.7655	0.5041	0.5043	0.5293	0.7288	0.8973
	F1@20	0.0482	0.0488	0.0659	0.0702	0.0758	0.0489	0.0490	0.0517	0.0711	0.0881
	NDCG@20	0.1706	0.1704	0.2264	0.3427	0.3968	0.1756	0.1757	0.1874	0.4754	0.5225

Table 2. Overall Recommendation Performance

Best performance is in boldface and second best is underlined.

•

Compared with ItemPop, the methods that utilize deep neural networks are more effective. Moreover, the text-based method ANR consistently outperforms DMF that only uses interaction information for embedding, which demonstrates the importance of utilizing textual information to alleviate the negative effects of data sparsity for better performance.

•

For IRSs, the preferences of users are always time-related. Caser, which learns sequential patterns by using convolutional filters, and SASRec, which utilizes self-attention to model the entire sequence, can improve the performance of IRSs dramatically.

•

The interactive methods, LinerUCB and D-kNN, perform similarly and can only outperform ItemPop. LinerUCB is a traditional MAB-based method, which cannot do long-term plan explicitly. As for D-kNN, only considering the distance but ignoring the importance of each dimension in the latent space for users results in missing proper items.

•

Our method TRGIR-DDPG achieves the best performance and obtains remarkable improvements over the state-of-the-art methods, which demonstrates the effectiveness of combining the candidate action set and the policy vector to address the large discrete space problem. TRGIR-DQN also performs well but is inferior to TRGIR-DDPG. The performance gap between them may be due to the fact that the exploring ability of the policy vector is more powerful.

4.3 Utilizing Textual Information (RQ2)

To figure out if the excellent performance can be retained when not leveraging textual information or utilizing textual information differently, we compare our TRGIR framework with the deep Reinforcement learning framework using Matrix-factorization representation for Interactive Recommendation (RMIR) [20] and the Text-based deep Reinforcement learning framework using Sum-average word representation for Interactive Recommendation (TRSIR) [34]. The same as TRGIR, we also implement RMIR and TRSIR with DQN and DDPG, respectively. The results on the three datasets (arranged in increasing order of scale and sparsity) in terms of HR@10, F1@10, nDCG@10, HR@20, F1@20, and nDCG@20 are shown in Table 3.

Table 3.

Dataset	Metric	RMIR-DQN	TRSIR-DQN	TRGIR-DQN	RMIR-DDPG	TRSIR-DDPG	TRGIR-DDPG
Music	HR@10	0.7164	0.7944	0.8037	0.8717	0.9455	0.9886
	F1@10	0.1603	0.1805	0.1844	0.1920	0.2124	0.2304
	NDCG@10	0.6843	0.7204	0.7719	0.6478	0.7152	0.9436
	HR@20	0.7183	0.7965	0.8048	0.9487	0.9760	0.9935
	F1@20	0.0865	0.0987	0.1003	0.1145	0.1203	0.1251
	NDCG@20	0.6845	0.7207	0.7722	0.6643	0.7219	0.9446
Beauty	HR@10	0.6553	0.6685	0.7278	0.6258	0.7449	0.8845
	F1@10	0.1305	0.1306	0.1463	0.1267	0.1463	0.1798
	NDCG@10	0.5165	0.5121	0.6213	0.4025	0.4896	0.6949
	HR@20	0.6613	0.7050	0.7349	0.8156	0.8909	0.9501
	F1@20	0.0698	0.0731	0.0782	0.0874	0.0936	0.1024
	NDCG@20	0.5178	0.5209	0.6230	0.4486	0.5244	0.7104
Clothing	HR@10	0.3953	0.6251	0.6593	0.3290	0.6622	0.7544
	F1@10	0.0726	0.1157	0.1222	0.0602	0.1226	0.1405
	NDCG@10	0.2655	0.4545	0.4577	0.1647	0.3917	0.4865
	HR@20	0.4680	0.6886	0.7288	0.5805	0.8545	0.8973
	F1@20	0.0453	0.0671	0.0711	0.0562	0.0835	0.0881
	NDCG@20	0.2838	0.4704	0.4754	0.2273	0.4398	0.5225

Table 3. The Comparison of Embedding Methods (RMIR, TRSIR, and TRGIR) on Specific Implementations

From Table 3, it is clear to see that no matter with DQN or DDPG, the data sparsity problem can be alleviated by utilizing textual information. Meanwhile, the performance improvement increases along with the increase of data scale and data sparsity. This justifies the effectiveness of TRSIR and TRGIR that leverage textual information in RL-based recommendation, especially for large-scale and high-sparsity datasets. Further, for the representations of users and items, the results comparison between TRSIR and TRGIR demonstrates that the self-supervised GCN embedding method is much more powerful than the simple sum average word vector operation.

Moreover, to discuss the influence of the settings in our GCN-based self-supervised embedding method, we conduct an ablation study on the three datasets for TRGIR-DQN and TRGIR-DDPG in all metrics. Since the performances under different datasets, metrics, and implementations exhibit similar trends, we only present the performance of TRGIR-DDPG with default setting (red bar) on Music in terms of HR@10 and F1@10, as shown in Figure 6. Note for the two sub-graphs in Figure 6, the left three blue bars show the performance of TRGIR-DDPG with different relations settings. Specifically, without textual information (W/O Text) only contains user-item relations, while without descriptions (W/O Des.) contains user-item, user-comment, and item-comment relations. Because the comments are related to both users and items, we cannot remove any one of them separately. Thus, without comments (W/O Com.) just contains user-item and item-description relations. We can find that the performance degrades greatly when the method only contains user-item relations, and the model with all the relations (Default) obtains the best performance. Moreover, we can see that the item-description relations are more important than the comment-related relations for our model. The reason might be that our document-level nature language processing method introduces more noises in comments than in descriptions. The green bar in Figure 6 shows that without self-connection (W/O Self-con.), our model performs worse, demonstrating its effectiveness. To keep the linear substructures and simplify our model, as mentioned in Section 3.3, we have removed the activation function in our self-supervised embedding method. The orange bar in Figure 6 shows that our model performs better than the model with an activation function (W/ Active-Fun.), which verifies the rationality of our design principle.

Fig. 6.

It is noteworthy that our way of introducing text may also introduce noise or irrelevant information. To alleviate this problem, we first utilize the Long Stopword List to filter out meaningless words (Section 3.3). Moreover, in the pre-trained GloVe [27], due to the vector distance between words with similar meanings is much closer than that between words with different meanings, the influence of noise can be reduced to some extent.

4.4 Time Comparison (RQ3)

In this section, we compare the efficiency of RL-based models from two aspects, the consumed time of training (updating the model per step), and decision-making, where the time spent is measured in milliseconds. Table 4 presents the comparison of the time cost on the Beauty dataset. The time cost orders on other datasets exhibit similar trend as that on Beauty and thus are omitted here. The values in Table 4 are the average ones obtained by statistics. To make a fair comparison, both \( n_s \) and \( n_a \) are set to 1 (but for TRGIR-DQN, we set \( n_a \) as 2) and others keep the default. The experiments are conducted on the same machine with 6-core CPU (i7-6850k, 3.6 GHz) and 64 GB RAM.

Table 4.

Time Cost (ms)	TRGIR-DQN ( \( n_c=500 \) )	TRGIR-DQN	DDPG-kNN ( \( k= \) 0.1M)	DDPG-kNN ( \( k= \) M)	TPGR	TRGIR-DDPG
Per training step	6.01	6.01	5.26	5.51	4.98	3.13
Per decision	621.82	80.65	7.86	56.42	5.06	0.9

Table 4. Time Comparison for Training and Decision-making

As shown in Table 4, for per training step, the time cost gap among those RL-based methods is not large. However, for the time cost of decision-making, the large discrete action space makes most RL-based recommendation methods inefficient. The value-based RL method TRGIR-DQN ( \( n_c=500 \) ), which has to calculate a large amount of Q-values for each possible item, runs much slower than other models, not to mention how the efficiency will drop on a real scale far larger than 500. With the help of narrowing the scale of the action candidate set to 50 ( \( n_c=50 \) is the default setting, which performs better than \( n_c=500 \) in Figure 8(a)), the decision-making efficiency of TRGIR-DQN improves greatly. D-kNN also runs slow (especially when \( k= \) M), because it has high time complexity in discovering nearest neighbors as action in large discrete action space. TPGR reduces the decision-making time significantly by constructing a clustering tree, but as mentioned before, it only supports Top-1 recommendation. Compared to other methods, by using action candidate set and policy vector, TRGIR-DDPG achieves significant improvement in terms of execution efficiency.

4.5 Hyper-parameter Sensitivity (RQ4)

We select several important hyper-parameters to analyze their effects on the performance of TRGIR-DQN and TRGIR-DDPG. Note that we have conducted such experiments on all the datasets in the six metrics mentioned above, and the results show that our approach exhibits similar performance trends on all the evaluated datasets. For simplicity, we only present the results on Beauty dataset in terms of HR@10 and nDCG@10. When testing one parameter, we keep the other hyper-parameters fixed with the default settings. From Figures 7 to 9, we can see that the two performance metrics for TRGIR-DQN and TRGIR-DDPG exhibit similar trends. In this way, the following analyses work for both of them:

Fig. 7.

The Input Dimension of GCN ( \( n_{in} \) ). Note the input dimension of GCN \( n_{in} \) is equal to the dimension of the pre-trained word vectors. The embedding initialization relies on the pre-trained word vectors, hence, the number of \( n_{in} \) reflects the richness of the textual information. As shown in Figure 7(a), with the increase of \( n_{in} \) , as expected, TRGIR-DQN and TRGIR-DDPG also perform better.

The Output Dimension of GCN ( \( n_{out} \) ). The output dimension of GCN \( n_{out} \) is equal to the final vector dimension of users and items. Figure 7(b) shows that, with the increase of \( n_{out} \) , the performance of our methods keeps stable. This is mainly because the useful knowledge can be contained within 16 dimensions.

The Depth of GCN propagation layer ( \( n_{gcn} \) ). Figure 7(c) shows that TRGIR-DDPG achieves the best performance when the depth \( n_{gcn} \) is 3, and TRGIR-DQN achieves the best performance when \( n_{gcn} \) is 2. The reason might be that the increase of the number of propagation layers can aggregate more knowledge from more nodes, but too high-order propagation may cause the over-smoothing problem, which in turn will degrade the performance. The Number of Clusters ( \( n_{cl} \) ). As shown in Figure 7(a), with the increase of \( n_{cl} \) , the performance first rises and then falls. More clusters mean larger differences between the current cluster and the one that provides negative samples, which improves their quality. However, too many clusters may also cause a shortage of effective samples.

The Size of Candidate ( \( n_c \) ). Figure 8(b) shows that the performance decreases with the increase of \( n_c \) . This is mainly because, for training samples, the items from \( \mathcal {V}^p_{u_i} \) are much less than the items from \( \mathcal {V}^p_{cl_l^f} \) . The increase of \( n_c \) will cause imbalance sampling, which leads to worse performance.

Fig. 8.

The Rate of Positive Items ( \( \alpha \) ). As shown in Figure 8(c), with the increase of \( \alpha \) , the performance first grows and then remains stable. This is because increasing \( \alpha \) will introduce more positive items to perceive the user’s interests better. But, since \( n_{pos} \le |\mathcal {V}^p_{u_i}| \) (see Algorithm 1), when \( \alpha \) is big enough, its growth may no longer affect \( n_{pos} \) .

The Size of State ( \( n_s \) ). Figure 9(a) shows that with the increase of the state size \( n_s \) , the performance stays almost smoothly, which means the size of the state has little impacts on the implements of our framework TRGIR. The Size of Action ( \( n_a \) ). For TRGIR-DDPG, Figure 9(b) shows that when \( n_a \) ranges from 1 to 10, the performance also increases. However, the performance starts to decrease when \( n_a \) reaches 20. The larger \( n_a \) is, the more frequent the state changes, indicating that keeping a proper updating speed is important. Note that, since \( n_a \) is fixed to 2 for TRGIR-DQN, we do not include it in this set of experiments.

Fig. 9.

5 Conclusion

In this article, we propose TRGIR, a Text-based deep Reinforcement learning framework using self-supervised Graph representation for Interactive Recommendation. By learning the embeddings of users and items with a GCN-based self-supervised embedding method on a relation graph that contains textual information, we gain user and item vectors with semantics, which greatly alleviates the data sparsity problem. Moreover, based on the thought of collaborative filtering, we classify users into several clusters and construct an action candidate set, which reduces the scale of action space directly. Further, combining with the policy vector dynamically learned from DDPG that represents the user’s preferences in the feature space, we select items from the candidate set to generate action for the recommendation, which greatly improves the efficiency of decision-making and enhances the exploration ability. Experimental results over a carefully designed simulator on three public datasets demonstrate that compared with state-of-the-art methods, TRGIR-DDPG can achieve remarkable performance improvement in a time-efficient manner.

For future work, we intend to model the textual information in word-level to capture finer-grained semantic factors for better recommendation performance; we also would like to see if it is possible to incorporate our proposed model with transfer learning.

Footnotes

https://www.ranks.nl/stopwords.

http://nlp.stanford.edu/data/glove.6B.zip.

https://github.com/SunwardTree/TRGIR.

⁴

http://snap.stanford.edu/data/amazon/productGraph/categoryFiles.

References

[1]

Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Laura M. Haas and Ashutosh Tiwary (Eds.). 94–105.

Abstract

1 Introduction

2 Related Work

2.1 RL-based Recommendation Methods

2.2 Text-related Recommendation Methods

2.3 Other Relevant Recommendation Methods

3 Proposed Method

3.1 Problem Formulation

3.2 Framework Overview

3.3 GCN-based Self-supervised Embedding

3.4 Construction of the Candidate Set

3.5 Specific Implementations of TRGIR

3.5.1 Implementation with DQN.

3.5.2 Implementation with DDPG.

3.6 Environment Simulator

4 Experiments and Results

4.1 Experimental Settings

4.1.1 Datasets.

4.1.2 Baseline Methods.

4.1.3 Evaluation Metrics and Methodology.

4.2 Comparison and Analysis (RQ1)

4.3 Utilizing Textual Information (RQ2)

4.4 Time Comparison (RQ3)

4.5 Hyper-parameter Sensitivity (RQ4)

5 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Reinforcement Learning based Recommendation with Graph Convolutional Q-network

Representation learning with collaborative autoencoder for personalized recommendation

Diversity-Promoting Deep Reinforcement Learning for Interactive Recommendation

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations