Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

A Text-based Deep Reinforcement Learning Framework Using Self-supervised Graph Representation for Interactive Recommendation

Published: 17 May 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Due to its nature of learning from dynamic interactions and planning for long-run performance, Reinforcement Learning (RL) has attracted much attention in Interactive Recommender Systems (IRSs). However, most of the existing RL-based IRSs usually face large discrete action space problem, which severely limits their efficiency. Moreover, data sparsity is another problem that most IRSs are confronted with. The utilization of recommendation-related textual knowledge can tackle this problem to some extent, but existing RL-based recommendation methods either neglect to combine textual information or are not suitable for incorporating it. To address these two problems, in this article, we propose a Text-based deep Reinforcement learning framework using self-supervised Graph representation for Interactive Recommendation (TRGIR). Specifically, we leverage textual information to map items and users into a same feature space by a self-supervised embedding method based on the graph convolutional network, which greatly alleviates data sparsity problem. Moreover, we design an effective method to construct an action candidate set, which reduces the scale of the action space directly. Two types of representative reinforcement learning algorithms have been applied to implement TRGIR. Since the action space of IRS is discrete, it is natural to implement TRGIR with Deep Q-learning Network (DQN). In the TRGIR implementation with Deep Deterministic Policy Gradient (DDPG), denoted as TRGIR-DDPG, we design a policy vector, which can represent user’s preferences, to generate discrete actions from the candidate set. Through extensive experiments on three public datasets, we demonstrate that TRGIR-DDPG achieves state-of-the-art performance over several baselines in a time-efficient manner.

    1 Introduction

    In the era of information explosion, recommender systems play a critical role in resisting the information overload. Recently, Interactive Recommender System (IRS) [49], which continuously recommends items to individual users and receives their feedbacks to refine its recommendation policy, has attracted much attention and plays an important role in personalized services, such as Amazon, Pandora, and YouTube.
    In the past few years, there have been some attempts to address the interactive recommendation problem by modeling the recommendation process as a Multi-Armed Bandit (MAB) problem [22, 35, 49], but these methods are not designed for long-term planning explicitly, which makes their performance unsatisfactory [5]. It is well recognized that Reinforcement Learning (RL) performs excellently in finding policies on interactive long-running tasks, such as playing computer games [25] and solving simulated physics problems [23]. Therefore, it is natural to introduce RL to model the interactive recommendation process. In fact, recently there have been some works on applying RL to address the interactive recommendation problem [5, 6, 11, 16, 32, 44, 45, 46, 47, 50]. However, most of the existing RL-based methods [6, 16, 32, 44, 45, 46, 47, 50] suffer from the problem of making a decision in linear time complexity with respect to the size of the action space, i.e., the number of available items, which makes them inefficient (or unscalable) when the IRS action space size is large.
    To improve efficiency, based on Deep Deterministic Policy Gradient (DDPG), Dulac-Arnold et al. [11] proposed DDPG-kNN, which first learns an action representation (vector) in a continuous hidden space, and then finds the valid item by using k nearest neighbor search. However, because DDPG is not designed for discrete IRS action space, and DDPG-kNN ignores the importance of each dimension in the action vector, the effectiveness of such a method is limited. Moreover, this method still needs to find the k nearest-neighbors from the whole action space, which is still time-consuming. Recently, Chen et al. [5] proposed a tree-structured policy gradient recommendation framework, within which a balanced hierarchical clustering tree is built over the items. Then, picking an item is formulated as seeking a path from the root to a certain leaf in the tree, which dramatically reduces the time complexity. But this method introduces the burden of building a clustering tree; especially when new items appear frequently, the tree needs to be reconstructed and this may cost a lot.
    However, most of the existing RL-based recommendation methods use the past interaction data, such as ratings, purchase logs, or viewing history, to model user preferences and item features [5, 11, 48]. A major limitation of such kind of methods is that they may suffer serious performance degradation when facing the data sparsity problem, which is very common in real-world recommendation systems. As well known, textual information such as comments by users and item descriptions provided by suppliers contains more knowledge than interaction data. Nowadays, textual information has been readily available in many e-commerce and review websites, such as Amazon and Yelp. Thanks to the invention of word embedding, applying textual information for recommendation is possible, and there have been some successful attempts in conventional recommender systems [3, 8, 51]. But for IRS, existing RL-based methods either neglect to leverage textual information or are not suitable for incorporating textual information due to their unique structures for processing rating sequence.
    To address the aforementioned problems, in this article, we propose a Text-based deep Reinforcement learning framework using self-supervised Graph representation for Interactive Recommendation (TRGIR). Specifically, to leverage textual information, we first embed descriptions and comments with pre-trained word vectors [27]. Then, we build a relation graph that consists of four types of nodes (user, item, description, and comment) and use description and comment vectors to initialize the embeddings of the corresponding nodes. By a self-supervised embedding method that is based on the graph convolutional network (GCN) [19], we can learn the embedding vectors of users and items with semantics, which alleviates the data sparsity problem to a great extent. Next, based on the user vectors, we classify users into several clusters by the K-means algorithm [1]. Inspired by the thought of collaborative filtering, we construct an action candidate set, which consists of positive, negative, and ordinary items that are selected based on the user’s historical logs and classification results, to reduce the scale of the action space directly. Considering that the action in IRS is discrete and the widely used Deep Q-learning Network (DQN) [25] is designed for dealing with discrete action space problems, we first present an implementation of TRGIR based on DQN, denoted as TRGIR-DQN. However, when facing large action space, DQN’s efficiency and exploration ability to decide the proper action will degrade significantly. To address this problem, we further propose to utilize DDPG [23] as the RL model and denote this implementation as TRGIR-DDPG. Specifically, we design a policy vector to generate discrete actions from the candidate set. The policy vector, which can represent the user’s preference in a feature space, is dynamically learned from the actor network of TRGIR-DDPG. By combining the candidate set with the policy vector, we can enhance the exploration ability and further improve the efficiency.
    Finally, considering that it is too expensive to train and test our model in an online manner, we build an environment simulator to mimic online environments with principles derived from real-world data. Through extensive experiments on several real-world datasets with different settings, we demonstrate that TRGIR-DDPG achieves high efficiency and remarkable performance improvement over several state-of-the-art baselines, especially for large-scale high-sparsity datasets. To sum up, our main contributions of this work are as follows:
    To reduce the negative influence of rating sparsity in IRSs, we build a relation graph, which is initialized with the description and comment embeddings calculated by textual information and pre-trained word vectors. Through learning the embeddings of users and items by a GCN-based self-supervised embedding method on this graph, we can derive user and item vectors with semantics efficiently.
    Based on the thought of collaborative filtering, we classify users into several clusters and build the candidate set, which reduces the scale of the action space directly. Further, in TRGIR-DDPG, we represent the preferences of users by implicit policy vectors and propose a method based on DDPG to learn the policy vectors dynamically. The policy vector, combining with the candidate set, is used to generate discrete actions, which can enhance the exploration ability and improve the efficiency simultaneously.
    Extensive experiments are conducted on three benchmark datasets; the results verify the high efficiency and superior performance of TRGIR-DDPG over state-of-the-art methods for IRSs.
    The remainder of this article is organized as follows: Section 2 discusses related work; Section 3 formally defines the research problem and details the proposed TRGIR framework, as well as the corresponding learning algorithms; Section 4 presents and analyzes the experimental results; Section 5 concludes the article with some remarks.

    2 Related Work

    2.1 RL-based Recommendation Methods

    RL-based recommendation methods usually formulate the recommendation procedure as a Markov Decision Process (MDP). They explicitly model the dynamic user’s status and plan for long-run performance [5, 6, 11, 16, 32, 36, 44, 45, 46, 47, 50]. As mentioned earlier, most existing RL-based methods [6, 16, 32, 36, 44, 45, 46, 47, 50] suffer from the large-scale discrete action space problem.
    To address such a problem in IRS, there are some impressive attempts. Dulac-Arnold et al. [11] proposed DDPG-kNN, which first leverages prior information about the actions to embed them in a continuous space to generate a proto-action. Then, via a k nearest neighbor search, this method finds a set of discrete actions closest to the proto-action as the candidate in logarithmic time, which can improve the efficiency dramatically. However, there are two flaws: (1) the DDPG is not designed for discrete IRS action space; (2) this method ignores the negative influences of the dimensions that users do not care about, which affects the performance of DDPG-kNN. Moreover, the k nearest-neighbor search needs to be conducted on the whole action space, which still surfers a high runtime overhead. Later, Zhao et al. [48] used the actor network of the actor-critic network to gain k weight vectors at once, each of which can pick up a maximum-score item from the remaining items. But the relation of these vectors is blurry, which causes the order of the k items cannot be explained. Most recently, based on Deterministic Policy Gradient (DPG), Chen et al. [5] proposed a Tree-structured Policy Gradient Recommendation (TPGR) framework. In TPGR, a balanced hierarchical clustering tree is built over all the items. Then, making a decision can be formulated as seeking a path from the root to a certain leaf in the clustering tree, which also reduces the time complexity significantly. But limited by the search method that can only search one leaf node at a time, this method only supports Top-1 recommendation. Moreover, when new items appear frequently, the clustering tree needs to be reconstructed, which incurs extra costs.

    2.2 Text-related Recommendation Methods

    Most of the recommendation models (including RL-based ones) that merely exploit interaction matrix usually face the data sparsity problem, which can potentially be alleviated by exploiting the large amount of knowledge in the textual information [51]. The development of deep learning in natural language processing (NLP) makes it possible for using textual information to enhance the recommendation performance [3, 7, 8, 51]. In fact, there are already some works that incorporate their proposed models with the vectors gained by sentiment analysis [3], convolutional neural networks [9, 51], or pre-trained word vectors on large corpora [8] from textual information (such as descriptions and comments) for better performance. Recently, some researchers try to combine Knowledge Graph (KG, a kind of relation graph) embedding models with the textual information of entities. Richard et al. [31] introduced an expressive neural tensor network suitable for reasoning over relations between two entities and found that the performance was improved when entities are represented as an average of their constituting word vectors. Xie et al. [41] proposed an RL method for KGs by taking advantage of entity descriptions, which are learned by a continuous bag-of-words model and a deep convolutional neural model. Xiao et al. [40] proposed the semantic space projection model, which jointly learns from the symbolic triples and textual descriptions and showed the effectiveness of it.
    IRS also suffers from the rating sparsity problem, but most of the existing RL-based methods for IRS either neglect to incorporate with textual information [11, 16, 46, 47] or have difficulty in utilizing textual information, since they adopt time-related structures to input rating sequence [5, 50]. Recently, through combining images with textual information, Zhang et al. [44] proposed a novel constraint-augmented RL framework to efficiently incorporate user preferences over time. Specifically, they leveraged a discriminator to detect recommendations violating user historical preference, which is incorporated into the standard RL objective of maximizing expected cumulative future rewards. Different from our method that introduces the textual information to alleviate the rating sparsity problem, Reference [44] mainly focuses on utilizing constraint-augmented RL to address the problem that recommendations can easily violate preferences of users from their past natural-language feedback. Moreover, like most existing RL-based methods [5, 6, 11, 16, 32, 45, 46, 47, 50], Reference [44] also suffers from the large-scale discrete action space problem, which is another limitation of existing methods that we tend to address in this work. It is also noteworthy that in the domain of conversational recommender system (CRS), Basile et al. [2] proposed a framework that combines deep learning and reinforcement learning and uses text-based features to provide relevant recommendations and produce meaningful dialogues. But different from CRS, in our RL-based method for IRS, the textual information is utilized to learn the implicit long-term preferences, not the proactive immediate needs of users.

    2.3 Other Relevant Recommendation Methods

    With the development of deep learning, there are some works that apply deep learning models on recommendation. Sedhain et al. [29] proposed AutoRec to learn embeddings that can reconstruct the ratings of a user from his records. Yao et al. [43] proposed Collaborative Denoising AutoEncoders (CDAE), which utilizes the idea of Denoising Auto-Encoders and contains more flexible components with implicit feedback. By using a two-pathway neural network representation learning architecture, Xue et al. proposed deep matrix factorization (DMF) [42] to map the users and items into a common low-dimensional latent space with non-linear projections, and then utilize cosine similarity as the matching function to calculate predictive scores. To learn the complex structure of user interaction data, He et al. [15] replaced the inner product matching function with a non-linear MLP architecture. Moreover, by fusing the neural matching function learning structure MLP with a representation learning structure generalized matrix factorization, NeuMF was proposed to obtain better performance. Further, considering that the DNN-based representation learning and matching function learning suffered from two fundamental flaws, i.e., the limited expressiveness of inner product and the weakness in capturing low-rank relations, respectively, Deng et al. [10] proposed DeepCF, which combines the strengths of neural representation learning and neural matching function learning, to overcome these flaws.
    To gain more effective representations, some recent works try to exploit the structure of interaction graph by propagating user and item embeddings on it. GC-MC [4] was proposed to apply a GCN-based auto-encoder framework on the bipartite user-item graph, but it only employs GCN for link prediction between users and items. Inspired by GCN [19], NGCF [37] exploits the collaborative signal in the embedding function and explicitly encodes the signal in the form of high-order connectivity by performing embedding propagation. The embedding propagation rule of NGCF is the same as that of standard GCN, which is originally proposed for node classification on the attributed graph, where each node has rich attributes. By removing feature transformation and non-linear activation that will negatively increase the difficulty for training in NGCF, LightGCN [14] achieves significant accuracy improvements. Recently, Wu et al. [39] applied self-supervised learning on the user-item graph to improve the accuracy and robustness of GCNs for recommendation. Different from the bipartite graphs that only contain user-item interactions, the heterograph we considered in this work contains other node types. Compared with GCN, relational graph convolutional network (RGCN) [28] has been shown to be capable of dealing with the highly multi-relational data characteristic on heterograph. In view of this, we apply it to propagate information on our heterograph.

    3 Proposed Method

    3.1 Problem Formulation

    We consider a recommender system with M users \( \mathrm{U}=\lbrace u_1, \ldots , u_M\rbrace \) , N items \( \mathrm{V}=\lbrace v_1, \ldots , v_N\rbrace \) , and use \( Y \in \mathbb {R}^{N \times M} \) to denote the rating matrix, where \( y_{i,j} \) is the rating of user \( u_i \) on item \( v_j \) . For the textual information, we denote \( \mathrm{D}=\lbrace d_1, \ldots , d_P\rbrace \) and \( \mathrm{C}=\lbrace c_1, \ldots , c_Q\rbrace \) as the set of descriptions and comments, respectively. This kind of interactive Top-k recommendation process can be modeled as a special Markov Decision Process (MDP), where the key components are defined as follows:
    State. Use \( \mathrm{S} \) to denote the state space. A state \( s \in \mathrm{S} \) is defined as the possible interaction between a user and the recommender system, which can be represented by \( n_s \) item vectors in a certain order.
    Action. Use \( \mathrm{A} \) to denote the action space. An action \( a \in \mathrm{A} \) contains \( n_a \) ordered items, each of which is represented by a vector. For the interactive Top-k recommendation, the scale of \( \mathrm{A} \) is large.
    Reward function. After receiving an action a at state s, our environment simulator returns a reward r, which reflects the user’s feedback to the recommended items. We use \( \mathcal {R}(s, a) \) to denote the reward function.
    Transition. In our model, since the state is a set of item vectors, once the action is determined and the user’s feedback is given, the state transition is also determined.
    Consider an agent that interacts with the environment \( \mathcal {E} \) in discrete timesteps. At each timestep t, the agent can receive a state \( s_{t} \) by observing the current environment, then it takes an action \( a_{t} \) and gets a reward \( r_{t} \) . An agent’s behavior is defined by a policy \( \pi \) , which maps states to a probability distribution over the action, i.e., \( \pi : \mathrm{S} \rightarrow \mathcal {P}(\mathrm{A}) \) . Based on the above notations, we can define the instantiated MDP for our recommendation problem, \( \mathcal {M}=\langle \mathrm{S}, \mathrm{A}, \mathcal {R}, \mathcal {P}, T, \gamma \rangle \) , where T is the maximal decision step, and \( \gamma \) is the discount factor. The objective of this work is to learn a policy \( \pi ^* \) that maximizes the expected discounted cumulative reward.

    3.2 Framework Overview

    Figure 1 gives an overview of our framework, which contains two major steps: data preparation and training. In data preparation, we first build a relation graph that contains four types of nodes (user, item, description, and comment) and use vectors of descriptions and comments obtained from pre-trained word vectors [27] to initialize the embeddings of description and comment nodes. Note that the embeddings of user and item nodes are initialized randomly. On the relation graph, the user, item, description, and comment embeddings in the lth propagation layer are denoted as \( {\bf \textrm {U}}^{(l)}=\lbrace {\bf \textrm {u}}^{(l)}_1, \ldots , {\bf \textrm {u}}^{(l)}_M\rbrace \) , \( {\bf \textrm {V}}^{(l)}=\lbrace {\bf \textrm {v}}^{(l)}_1, \ldots , {\bf \textrm {v}}^{(l)}_N\rbrace \) , \( {\bf \textrm {D}}^{(l)}=\lbrace {\bf \textrm {d}}^{(l)}_1, \ldots , {\bf \textrm {d}}^{(l)}_P\rbrace \) , and \( {\bf \textrm {C}}^{(l)}=\lbrace {\bf \textrm {c}}^{(l)}_1, \ldots , {\bf \textrm {c}}^{(l)}_Q\rbrace \) , respectively. After propagating with L layers by utilizing a GCN-based self-supervised embedding method, we can learn the embeddings of user and item \( {\bf \textrm {U}}^{(L)}=\lbrace {\bf \textrm {u}}^{(L)}_1, \ldots , {\bf \textrm {u}}^{(L)}_M\!\rbrace \) and \( {\bf \textrm {V}}^{(L)}=\lbrace {\bf \textrm {v}}^{(L)}_1, \ldots , {\bf \textrm {v}}^{(L)}_N\rbrace \) . Through the learning process on the relation graph, \( {\bf \textrm {U}} \) and \( {\bf \textrm {V}} \) can gain the semantics contained in \( {\bf \textrm {D}} \) and \( {\bf \textrm {C}} \) . Then, based on the user embeddings, we utilize the unsupervised K-means [1] algorithm to classify the users into several clusters, which will later be used for helping construct the action candidate set.
    Fig. 1.
    Fig. 1. Framework overview.
    In the training phase, with the objective of implementing a more personalized recommendation, we train a unique model for each cluster. Take cluster 2 for an example; we randomly select a user \( u_i \) from it. Based on the historical logs of \( u_i \) and the user classification results, we sample positive, negative, and ordinary items for \( u_i \) to construct a candidate set, which will later be used in the reinforcement model for action selection. The reinforcement model interacts with the simulator, which is based on historical logs, to learn the inner relations among all possible states and actions. For the specific implementation, we employ DQN [13, 25] (TRGIR-DQN) and DDPG [23] (TRGIR-DDPG) as our reinforcement model, respectively. In particular, by utilizing policy vector in the DDPG implementation, we can improve the efficiency dramatically. The training phase will stop when the model loss reaches stable.

    3.3 GCN-based Self-supervised Embedding

    Descriptions and comments are the most important textual information in recommender systems. The descriptions, which contain items’ advantages, and the comments, which contain users’ attitudes, along with the ratings, can express the preferences of users well. To obtain well expressive embeddings of users and items, we build a relation graph (as shown in the left part of Figure 1) including user nodes, item nodes, description nodes, and comment nodes. By initializing the node embeddings of descriptions and comments with pre-trained word vectors GloVe [27], we can get semantics from the textual information. Note these original descriptions and comments contain many meaningless words that can affect the quality of the constructed vector. We remove them in advance according to the Long Stopword List.1 Then, we pick up pre-trained word vectors GloVe.6B,2 which has been trained on large corpora (Wikipedia 2014 and Gigaword 5), to calculate \( {\bf \textrm {d}}^{(0)}_p \) and \( {\bf \textrm {c}}^{(0)}_q \) . Specifically,
    \( \begin{equation} {\bf \textrm {d}}^{(0)}_p = \frac{1}{n_d}\sum _{i=1}^{n_d}{\bf \textrm {w}}_i, \quad {\bf \textrm {c}}^{(0)}_q = \frac{1}{n_c}\sum _{i=1}^{n_c}{\bf \textrm {w}}_i , \end{equation} \)
    (1)
    where \( {\bf \textrm {w}}_i \) denotes the vector of word \( {w}_{i} \) , \( n_d \) ( \( n_c \) ) denotes the number of words that \( d_p \) ( \( c_q \) ) contains after removing the stop words. Note that the word vectors with similar semantics have closer Euclidean distance than the word vectors with large semantics differences [27], which ensures that comments (or descriptions) with similar semantics are closer to each other. Different from \( {\bf \textrm {d}}^{(0)}_p \) and \( {\bf \textrm {c}}^{(0)}_q \) , the initial embeddings of user and item nodes, \( {\bf \textrm {u}}^{(0)}_m \) and \( {\bf \textrm {v}}^{(0)}_n \) , are constructed randomly.
    Then, we introduce a GCN-based self-supervised embedding method to learn the representations of users and items. Let \( e \in \mathrm{E} \) denote the entities in the relation graph, where \( \mathrm{E}=\mathrm{U}\cup \mathrm{V}\cup \mathrm{D}\cup \mathrm{C} \) , and let e denote the representation of e. According to the feature propagation model of RGCN [28], an entity \( e_i \) is capable to receive the messages propagated from its l-hop neighbors by stacking l embedding propagation layers. In the lth step, the embedding of \( e_i \) is recursively formulated as,
    \( \begin{equation} {\bf \textrm {e}}^{(l)}_i= \sum _{r^e \in \mathrm{R}} \sum _{e_j \in \mathrm{N}^{r^e}_{e_i}} \mathcal {L}_{e_i, e_j} W^{(l-1)}_j {\bf \textrm {e}}^{(l-1)}_j + W^{(l-1)}_{\text{self}} {\bf \textrm {e}}^{(l-1)}_i , \end{equation} \)
    (2)
    where \( r^e \in \mathrm{R} \) denotes one of the relations between different entities, \( {\bf \textrm {e}}^{(l-1)}_i \) and \( {\bf \textrm {e}}^{(l-1)}_j \) denote the embeddings of \( e_i \) and \( e_j \) generated from the previous message propagation steps, \( \mathrm{N}^{r^e}_{e_i} \) denotes the set of neighbor entities that directly connected to \( e_i \) under relation \( r^e \) , \( W^{(l-1)}_j \) and \( W^{(l-1)}_{\text{self}} \) are trainable weights under relation \( r^e \) , and \( \mathcal {L}_{e_i, e_j} = 1 / {\scriptstyle \sqrt { |\mathrm{N}^{r^e}_{e_i}||\mathrm{N}^{r^e}_{e_j}|}} \) is the symmetric normalization. As shown in Figure 2(a), we take the message propagation of user \( u_1 \) as an example. It is clear to see that there are totally four types of relations (i.e., \( |\mathrm{R}| = 4 \) ): user-item relation, user-comment relation, item-comment relation, and item-description relation. After a two depth propagation, the message from the related nodes on different relations can be aggregated to the target node \( u_1 \) .
    Fig. 2.
    Fig. 2. (a) Example illustrating message propagation; (b) Example illustrating the calculation of \( U^2 \) and \( V^2 \) .
    Note the word vectors in GloVe have the linear substructures [27]. By the sum average operation, the linear substructure feature is kept in the initial embeddings \( {\bf \textrm {d}}^{(0)}_p \) and \( {\bf \textrm {c}}^{(0)}_q \) . To avoid destroying linear substructures and accelerating the training phase, we remove the non-linear active functions in standard RGCN and find that this operation can improve the performance.
    The self-supervised loss function of our embedding method encourages the nearby nodes to have similar representations, while enforcing that the representations of disparate nodes are highly distinct. Inspired by the contrastive self-supervised method [26] and considering the linear substructures of these representations in Euclidean space, we choose mean square error as the distance metric. Based on the randomly selected batch of users and items \( \mathrm{U}^b \) and \( \mathrm{V}^b \) , the loss function of graph network parameters \( \theta ^G \) is designed as,
    \( \begin{equation} \begin{aligned}L(\theta ^G) =&\sum _{u_m \in \mathrm{U}^b}\left(\sum _{u_i \in \mathcal {N}_{u_m}} (\mathbf {u_m}- \mathbf {u}_{i})^{2} - \sum _{u_j \in \mathcal {F}_{u_m}} (\mathbf {u_m} - \mathbf {u}_{j})^{2}\right) \\ &\quad + \sum _{v_n \in \mathrm{V}^b}\left(\sum _{v_i \in \mathcal {N}_{v_n}} (\mathbf {v_n}- \mathbf {v}_{i})^{2} - \sum _{v_j \in \mathcal {F}_{v_n}} (\mathbf {v_n} - \mathbf {v}_{j})^{2}\right) + \lambda ||\theta ^G||_2 , \end{aligned} \end{equation} \)
    (3)
    where \( \mathcal {N}_{u_m} \) and \( \mathcal {N}_{v_n} \) are the nearby nodes set of user \( u_m \) and items \( i_n \) , \( \mathcal {F}_{u_m} \) and \( \mathcal {F}_{v_n} \) are the disparate nodes (nodes far from the current node) set of user \( u_m \) and items \( i_n \) , \( \lambda \) is a hyper-parameter to control the strength of the regularizer to avoid model overfitting. Note that we define \( U^2 = H \times H^T \) and \( V^2 = H^T \times H \) , where \( H \in \mathbb {R}^{N \times M} \) denotes the user-item adjacency matrix. For any \( h_{m, n} \in H \) , if \( u_m \) has interacted with \( v_n \) , then \( h_{m, n} = 1 \) ; otherwise, \( h_{m, n} = 0 \) . Moreover, to help understand the calculation of \( U^2 \) and \( V^2 \) , Figure 2(b) presents an example based on the relational graph depicted in Figure 1. Note that both \( U^2 \) and \( V^2 \) are symmetric matrices, i.e., \( U^2_{m,i} = U^2_{i,m} \) and \( V^2_{n,i} = V^2_{i,n} \) . The diagonal elements in \( U^2 \) (or \( V^2 \) ) represent the total interacted items (or users) of the corresponding user (or item). If there is at least one item that both users \( u_m \) and \( u_j \) have interacted with, then \( U^2_{m,j} \gt 0 \) (e.g., \( U^2_{1,2} = 1 \) ); otherwise, \( U^2_{m,j} = 0 \) (e.g., \( U^2_{1,3} = 0 \) ). The same rule goes for \( V^2 \) . Hence, \( U^2_{m,j} \gt 0 \) represents there exists at least one item that is preferred by both \( u_m \) and \( u_j \) , while \( V^2_{n,i} \gt 0 \) represents there exists at least one user that prefers both \( v_n \) and \( v_i \) . For any user \( u_x \) , if \( U^2_{m,x} \gt 0 \) , \( u_i \) belongs to \( \mathcal {N}_{u_m} \) ; otherwise, \( u_x \) belongs to \( \mathcal {F}_{u_m} \) . Similar definitions go for \( \mathcal {N}_{v_n} \) and \( \mathcal {F}_{v_n} \) .
    The training of our GCN-based self-supervised method is independent with the RL model and will stop when the loss reaches stable. Moreover, for a specific batch of users and items, when the relations of them changes, we can utilize Equation (3) to update the neural network and their embeddings for this batch locally. After propagating with L layers, we can obtain \( {\bf \textrm {u}}^{(L)}_m \) and \( {\bf \textrm {v}}^{(L)}_n \) as the important foundation for clustering and the construction of state and action.
    To better illustrate the superiority of integrating textual information, we pick a real user (with ID: A2P49WD75WHAG5) in the Amazon Digital Music dataset (one of the datasets we used in the experiments) to illustrate how the text could benefit our recommendation process. As shown in Figure 3, the left scatter graph shows the distribution of the interacted items’ embeddings obtained by Matrix Factorization (MF) without textual information, and the right scatter graph shows the same items’ embeddings obtained by Self-supervised Graph (SG) representation learning method with textual information (the category information). We utilize the widely used principal component analysis [12] to reduce the dimension of the above-mentioned embeddings to two dimensions. We can observe that the relative distance between 2,365 (“Gia”) and 2,607 (“Eighteen Visions”) is much shorter than that between 2,365 and 2,464 (“Fijacion Oral”) in the left scatter graph, while an opposite result can be observed in the right scatter graph. By carefully screening the category information of these three items, we can conclude that 2,365 is more similar to 2,464 rather than 2,607, and hence should be more closer to 2,464 in the visualized embedding graph. This demonstrates the effectiveness of integrating textual information for better embedding learning.
    Fig. 3.
    Fig. 3. An example to illustrate how the text could benefit the recommendation process.

    3.4 Construction of the Candidate Set

    In our RL-based recommendation, the state is defined as a set of \( n_s \) items. For the model that recommends Top-k items at once, there are a total of \( A_{M-n_s}^k \) (note here \( A_{M-n_s}^k \) is a permutation) actions that can be chosen as an action. With the increase of the number of items (M), the scale of the action space will increase rapidly. Based on the assumption that the preferences can be obtained by a set of items that users like and dislike, we pick up the positive and negative items to build a candidate set c. Additionally, to maintain generalization, we randomly add some ordinary items into c.
    Given a user \( u_i \) , according to the historical logs, if the corresponding rating is greater than a given bound \( y_b \) (e.g., \( y_b = 2 \) in a rating system with the highest rating 5), then the interacted record is regarded as positive; otherwise, it is negative. We use \( \mathcal {V}^p_{u_i} \) and \( \mathcal {V}^n_{u_i} \) to denote the set of items that are in \( u_i \) ’s positive and negative interacted records, respectively. For \( u_i \) , we sample positive items from \( \mathcal {V}^p_{u_i} \) , negative items from \( \mathcal {V}^n_{u_i} \) , and ordinary items by random. Since users usually skip the items they do not like, the negative items in \( \mathcal {V}^n_{u_i} \) are rare [24]. Based on the reverse thought of collaborative filtering, i.e., the more differences between two users, the more possible that the one’s likes are another’s dislikes, we classify users into several clusters by K-means [1] to supplement negative items. Specifically, as shown in Figure 4(a), we denote the set of items that appear in the positive interacted records of users in cluster l as \( \mathcal {V}^p_{cl_l} \) (user \( u_i \) belongs to cluster \( cl_l \) ) and use \( {cl_{l}^f} \) to denote the cluster that has the farthest distance from the current cluster \( cl_l \) . If the negative items in \( \mathcal {V}^n_{u_i} \) are not enough, then the rest negative items will be selected from \( \mathcal {V}_{neg} \leftarrow \mathcal {V}^p_{cl_{l}^f} - (\mathcal {V}^p_{cl_{l}^f} \cap (\mathcal {V}^p_{cl_{l}} \cup \mathcal {V}^n_{u_i})) \) . In this way, we can reduce the scale of the action space from \( M-n_s \) to \( n_c \) , where \( n_c \) is the number of items in the candidate set c.
    Fig. 4.
    Fig. 4. (a) An example illustrates the components of the candidate set; (b) The structure of TRGIR-DQN.
    Algorithm 1 shows the detail of the construction for the candidate set, in which the positive items account for no more than \( \alpha \) percent ( \( \alpha \) is a hyper-parameter), and the negative and ordinary items each share \( 50\% \) of the remaining part of \( n_c \) (line 7). In the training phase, since constructing a candidate set only contains some simple operations, such as randomly select and merge, and the size of candidate set is always fixed, it is not difficult to see the time complexity of Algorithm 1 is constant.

    3.5 Specific Implementations of TRGIR

    The goal of a typical reinforcement learning model is to learn a policy \( \pi \) that can maximize the discounted future reward, i.e., the Q-value, which is usually estimated by the Q-value function \( Q^{\pi }(\cdot) \) . Combined with deep neural networks, there are many algorithms that try to approximate the Q-value function, which can be roughly categorized into three types: value-based (e.g., DQN [25], Double DQN [13]), policy-based (e.g., DPG [30]), and hybrid algorithms (e.g., DDPG [23]). Considering value-based DQN [13, 25] is widely utilized for the scenario where the action space is discrete, we will first implement TRGIR with DQN (TRGIR-DQN) to show its effectiveness.

    3.5.1 Implementation with DQN.

    The structure of TRGIR-DQN is shown in Figure 4(b), and we utilize the improved DQN (Double DQN) [13] as the RL model. To avoid the overestimation and thus improve performance, Double DQN decouples the action selection with the target Q value calculation.
    To make the action selection more reasonable, we introduce the features of items as input by concatenating item vector of \( c^t_k \) with \( s_t \) to derive \( \phi _t \) ,
    \( \begin{equation} \phi _t = \phi \big (s_t, c^t_k\big) = {\bf \textrm {c}}^t_k \oplus s_t , \end{equation} \)
    (4)
    where \( \oplus \) denotes the vector concatenation operation, \( {\bf \textrm {c}}^t_k \) denotes the vector of tth item in the kth candidate set \( c_k \) . The concatenation of state \( s_t \) and \( {\bf \textrm {c}}^t_k \) decides the action selection.
    In each timestep t, the Q-value network takes \( \phi _t \) as input. By a multiple-layer perceptron (MLP) network, we can learn two Q-values, which we term as recommendation action and skip action, respectively. If the Q-value of the recommendation action is greater than that of the skip action, then we will recommend item \( c^t_k \) . Otherwise, we just skip it. As shown in Figure 4(b), after receiving \( {a}_{t} \) , \( {s}_{t} \) , and \( {c}^k_{t} \) , the simulator return the next state \( {s}_{t+1} \) . Specifically, according to the Q-value, if we recommend item \( c^t_k \) , then we put \( {c}^k_{t} \) at the head of \( {s}_{t} \) and select the top \( n_s \) items as \( {s}_{t+1} \) . Otherwise, we let \( {s}_{t+1} = {s}_{t} \) .
    Algorithm 2 shows the training phase of TRGIR-DQN. By maximizing the cumulated discounted rewards, the model parameters ( \( \theta \) and \( \theta ^{\prime } \) ) can be learned. Based on the assumption that similar users have similar preferences, our method classifies users and trains models for each cluster. In the beginning, we randomly initialize \( \theta \) and the replay buffer \( \mathrm{B} \) . After constructing a candidate set \( c_k \) for \( u_i \) in cluster \( cl_l \) , the agent selects and executes an action according to an \( \epsilon \) -greedy policy. The TRGIR-DQN algorithm focuses on minimizing the gap between the current Q-value \( Q\left(s, a | \theta _{l}\right) \) and the expected Q-value \( z_j \) , which is measured by the following loss:
    \( \begin{equation} L_{j}\left(\theta _{l}\right)=\mathbb {E}_{s_j, a_j \sim \rho (\cdot)}\left[\left(z_{j}-Q\left(\phi _j, a_j | \theta _{l}\right)\right)^{2}\right] , \end{equation} \)
    (5)
    where \( \rho (\cdot) \) is the action distribution, and the expected Q-value \( z_{j} \) can be defined as,
    \( \begin{equation} z_{j}=\left\lbrace \begin{array}{@{}lll} r_{j}+\gamma Q^{\prime } (\phi _j, \max _{a_{j+1}} Q(\phi _{j+1}, a_{j+1}, \theta _{l}), \theta _{l}^{\prime }), & \text{if non-terminal}; \\ r_{j}, & \text{if terminal}. \end{array} \right. \end{equation} \)
    (6)
    Differentiating the loss function with respect to the weights, we arrive at the gradient as,
    \( \begin{equation} \begin{aligned}\nabla _{\theta _{l}} J = \frac{1}{N_b} \sum _{j} \mathbb {E}_{s_{j}, a_j \sim \rho (\cdot) ; s_{j+1} \sim \mathcal {E}} [ (r_{j}&+\gamma Q^{\prime }(\phi _j, \max _{a_{j+1}} Q(\phi _{j+1}, a_{j+1}, \theta _{l}), \theta _{l}^{\prime })\\ & - Q(\phi _{j}, a_j | \theta _{l})) \nabla _{\theta _{l}} Q(\phi _{j}, a_j ; \theta _{l}) ]. \end{aligned} \end{equation} \)
    (7)
    Then, we update the target network parameters by soft updates [23] with a rate \( \tau \) . Finally, when the loss reaches stable, the training will stop. To get k items at once, for the recommended items, we order these items by the Q-values of recommendation action.

    3.5.2 Implementation with DDPG.

    DDPG combines the advantages of DQN and DPG [30] and can concurrently learn policy and \( Q^{\pi }(s_{t}, a_{t}) \) in high-dimensional, continuous action spaces by using neural function approximation [23]. For more performance and efficiency improvement, we further implement TRGIR with DDPG (TRGIR-DDPG). But it is noteworthy that the action space of IRSs is discrete and is not suitable for DDPG. To address this problem, inspired by DDPG-kNN [11], we propose the policy vector \( {\bf \textrm {p}}_t \) that represents the user’s preference and can be dynamically learned by TRGIR-DDPG. By utilizing \( {\bf \textrm {p}}_t \) to generate discrete actions from the candidate set, we can overcome the gap between discrete actions in IRSs and continuous actions in DDPG.
    For Top-k recommendation, TRGIR-DQN has to calculate Q-values for each item to decide whether to recommend or not, which is still inefficient when the scale of the candidate set is large. Moreover, large-scale action space also limits the exploration ability of TRGIR-DQN, which in turn will affect the performance. To solve these problems, we consider utilizing DDPG as the RL model, which can explore large continuous space efficiently, to enhance exploration ability and further improve the efficiency.
    The structure of TRGIR-DDPG is shown in Figure 5(a). During the embedding process, we have mapped the discrete actions into the continuous feature space, where each item is represented by a feature vector. Then, by conducting the dot product between \( {\bf \textrm {p}}_t \) and the item vectors in \( c_t \) , we can select the actions from a discrete space. Figure 5(b) gives an example for helping understand the policy vector. Suppose a user selects a movie according to the preference that can be represented as explicit policies such as Prefer Detective Comics, Insensitive to genres, and Like Superman. By our method, a policy vector in the feature space, e.g., \( (0.7, 0,5, 0.1, 0.9) \) , can be learned, where the value of each dimension represents how much emphasis the dimension is for this user. By conducting a dot product between the policy vector and the item vectors, we finally can choose the movie Superman Returns with the highest score of 2.79 for recommendation (assume Top-1 recommendation here).
    Fig. 5.
    Fig. 5. (a) The Structure of TRGIR-DDPG; (b) An example for illustrating policy vector for TRGIR-DDPG.
    In each timestep t, the actor network takes a state \( s_t \) as input. By an MLP network, we can learn a continuous vector, which we term as the policy vector, denoted by \( {\bf \textrm {p}}_t \) . The critic network takes state \( s_t \) and policy vector \( {\bf \textrm {p}}_t \) as input. By another MLP network, it can learn the current Q-value to evaluate \( {\bf \textrm {p}}_t \) . As mentioned in Figure 5(b), \( {\bf \textrm {p}}_t \) represents a user’s preferences in the feature vector space, it is a continuous weight vector that can measure the importance of each dimension. Combining with the candidate set \( c_t \) , we can get \( n_a \) items with the highest score, each of which is denoted by \( Score(v_i) \) and,
    \( \begin{equation} Score({{v}_{i}})= {\bf \textrm {p}}_t^T {\bf \textrm {v}}_i . \end{equation} \)
    (8)
    As shown in Figure 5(b), TRGIR-DDPG generates \( {s}_{t+1} \) by a sliding-window manner. Specifically, among ordered items in \( {a}_{t} \) , we keep the order and select the items that are not in \( {s}_{t} \) as \( {a}^{\prime }_{t} \) . Then, we put \( {a}^{\prime }_{t} \) at the head of \( {s}_{t} \) and select the top \( n_s \) items as \( {s}_{t+1} \) . Moreover, to cover the action space to a large extent, the candidate set is randomly generated at each timestep.
    The training phase (as shown in Algorithm 3) learns model parameters \( \theta ^{Q} \) , \( \theta ^{\mu } \) , \( \theta ^{Q^{\prime }} \) , and \( \theta ^{\mu ^{\prime }} \) through maximizing the cumulated discounted rewards of all the decisions. As mentioned before, TRGIR-DDPG also trains models for each cluster. At the beginning of the training phase, we randomly initialize the network parameters and the replay buffer \( \mathrm{B} \) . For better action exploration, we initialize a random process \( \mathcal {N} \) , adding some uncertainty when generating \( {\bf \textrm {p}} \) . The critic network focuses on minimizing the gap between the current Q-value \( Q(s_{j}, {\bf \textrm {p}}_j | \theta ^{Q}) \) and the expected Q-value \( z_j \) , which is measured by the following loss:
    \( \begin{equation} L(\theta ^Q_{l}) = \frac{1}{N_b} \sum _{j}\big (z_j - Q\big (s_{j}, {\bf \textrm {p}}_j | \theta ^{Q}_{l}\big)\big)^{2} , \end{equation} \)
    (9)
    where \( z_j \) can be expressed in a recursive manner by using the Bellman equation,
    \( \begin{equation} z_j = r_{j}+\gamma Q^{\prime }\big (s_{j+1}, \mu ^{\prime }\big (s_{j+1} | \theta ^{\mu ^{\prime }}_{l}\big) | \theta ^{Q^{\prime }}_{l}\big). \end{equation} \)
    (10)
    The objective of the actor network is to optimize the policy vector \( {\bf \textrm {p}} \) through maximizing the Q-value. The actor network is trained by the sampled policy gradient:
    \( \begin{equation} \nabla _{\theta ^{\mu }_{l}} J \approx \frac{1}{N_b} \sum _{j} \nabla _{{\bf \textrm {p}}} Q\big (s, {\bf \textrm {p}} | \theta ^{Q}_{l}\big)|_{s=s_{j}, {\bf \textrm {p}} =\mu (s_{j})} \nabla _{\theta ^{\mu }_{l}} \mu \big (s | \theta ^{\mu }_{l}\big)|_{s_{j}} . \end{equation} \)
    (11)
    For TRGIR-DDPG, we also update the parameters of the target actor and the target critic network by soft updates [23] with a rate \( \tau \) . Finally, when the loss reaches stable, the training phase will stop.
    Note that in our implementations with DQN and DDPG, to avoid insufficient training, we set a minimum training step threshold based on the size of buffer \( \mathrm{B} \) . Only when the number of steps is greater than the minimum threshold and the loss reaches stable, the training will stop. Moreover, we also set a maximum training step threshold based on the size of \( \mathrm{B} \) , when the number of steps is larger the maximum threshold, the training will stop.

    3.6 Environment Simulator

    It is expensive and time-consuming to utilize a real interactive environment for the training of RL models. The same as several previous works [5, 38], we build the environment simulator based on historical interactions. For a user \( u_i \) in the timestep t, the simulator receives present state \( {s}_{t} \) and action \( {a}_{t} \) , then returns reward \( {r}_{t} \) and the next state \( {s}_{t+1} \) . In this way, the reward function can be written as \( \mathcal {R}\left(s_t, {a}_{t}\right) \) , and the definitions of \( \mathcal {R}\left(\cdot \right) \) are different for different RL models.
    At each timestep t, TRGIR-DQN only recommends at most one item to user \( u_i \) . The reward \( r_t \) of TRGIR-DQN is defined as,
    \( \begin{equation} r_t = \mathcal {R}\left(s_t, {a}_{t}\right)=\left\lbrace \begin{array}{@{}ll} y^*_{i,j}, & \text{if}~~ recommend~~ item~~ v_j; \\ 0, &\text{otherwise}, \end{array} \right. \end{equation} \)
    (12)
    where \( y^*_{i,j} \) is the adjusted rating of \( u_i \) on \( v_j \) . To give proper rewards for different types of items, \( y^*_{i,j} \) is designed as follows:
    \( \begin{equation} y^*_{i,j}=\left\lbrace \begin{array}{@{}llll} y_{i,j} - y_b, & \text{if}~~ v_j \in \mathcal {V}^p_{u_i}; \\ y_{i,j} - y_b -1, &\text{if}~~ v_j \in \mathcal {V}^n_{u_i}; \\ -0.5, &\text{if}~~ v_j \in \mathcal {V}_{neg} ; \\ 0, & \text{otherwise}. \end{array} \right. \end{equation} \)
    (13)
    Recall here \( y_{i,j} \) is the initial rating of \( u_i \) on \( v_j \) , and \( y_b \) is the rating bound. By this formula, positive items will get positive feedback, and negative items will get negative feedback. Moreover, the supplemented negative items will get half of the minimum negative feedback, i.e., \( -0.5 \) , while the other items will get feedback of 0.
    For TRGIR-DDPG, it recommends \( n_a \) items to user \( u_i \) at each timestep. The reward function of TRGIR-DDPG not only guides the model to capture users’ preferences, but also evaluates the rank quality of the recommended items. Specifically, the reward \( r_t \) is determined by two values, \( {w}_{k} \) and \( y^*_{i,j} \) ,
    \( \begin{equation} r_t = \mathcal {R}\left(s_t, {a}_{t}\right)=\sum _{k=1}^{n_a} {w}_{k} \times y^*_{i,j} , \end{equation} \)
    (14)
    where \( w_{k} \) is the ranking weight of the items in \( a_t \) . Inspired by DCG [17, 38], the ranking weight is calculated by
    \( \begin{equation} {w}_{k} = {1}/{\log _{2}(k+1).} \end{equation} \)
    (15)
    Note that the methods of generating \( {s}_{t+1} \) are different in TRGIR-DQN and TRGIR-DDPG, as detailed in the above specific implementations sections.

    4 Experiments and Results

    4.1 Experimental Settings

    In this section, to demonstrate the effectiveness of the proposed method, we first introduce the experimental settings and then present and discuss the experimental results from the perspective of both performance and efficiency to answer the following research questions:
    RQ1: How do the methods that implement our TRGIR framework with DQN and DDPG perform as compared with other state-of-the-art methods?
    RQ2: How is the recommendation sparsity problem alleviated by utilizing the textual information in different ways?
    RQ3: How does the efficiency benefit from the candidate action set and the policy vector?
    RQ4: How do the key hyper-parameters (e.g., the dimension of feature space, the number of clusters, the size of candidate set) affect the performance?
    Note that, when analyzing one factor, we keep the others fixed. The default settings are: the input dimension of GCN \( n_{in} \) is 100, the output dimension of GCN \( n_{out} \) is 64, the depth of GCN propagation layer \( n_{gcn} \) is 2, the number of clusters \( n_{cl} \) is 5, the size of the candidate set \( n_c \) is 50, the rate of positive items \( \alpha \) is 0.1, the size of state \( n_s \) is 20, the size of action \( n_a \) is 10 (but for TRGIR-DQN, \( n_a \) is fixed to be 2). We have implemented our framework with DQN and DDPG, which can be accessed in GitHub.3

    4.1.1 Datasets.

    Jure Leskovec et al. [21] collected and categorized a variety of Amazon products and built several datasets4 including ratings, descriptions, and comments. We evaluate our models on three publicly available Amazon datasets: Digital Music (Music for short), Beauty and Clothing Shoes and Jewelry (Clothing for short), which all have at least five comments for each product. Table 1 shows the statistical details of the datasets we used.
    Table 1.
    DataSet#Users#Items#Ratings of Pos.#Ratings of Neg.SparsitySize of Des.Size of Com.
    Music5,5413,56858,9055,8010.99672,338 KB65,758 KB
    Beauty22,36312,101176,52021,9820.99935,735 KB83,251 KB
    Clothing39,38723,033252,02226,6550.99973,960 KB80,208 KB
    Table 1. Statistics of Datasets

    4.1.2 Baseline Methods.

    We compare TRGIR-DQN and TRGIR-DDPG with eight methods, where ItemPop is a conventional recommendation method, DMF is an MF-based method with neural network, ANR is a neural recommendation method that leverages textual information, Caser and SASRec are time-related deep learning–based methods, LinearUCB is a MAB-based (MAB stands for Multi-Armed Bandit) method, D-kNN and TPGR are all RL-based methods.
    ItemPop recommends the most popular items from currently available items to the user. This method is non-personalized and is often used as a benchmark for recommendations.
    DMF [42] is a matrix factorization model using deep neural networks. Specifically, it utilizes two distinct MLPs to map the users and items into a common low-dimensional space.
    ANR [8] uses an attention mechanism to focus on the relevant parts of comments and estimates aspect-level user and item importance in a joint manner.
    Caser [33] embeds a sequence of recent items into an image and learns sequential patterns as local features of the image by using convolutional filters.
    SASRec [18] is a self-attention-based sequential model for next item recommendation. It models the entire user sequence and adaptively considers consumed items for prediction.
    LinearUCB [22] is a contextual-bandit recommendation approach that adopts a linear model to estimate the upper confidence bound for each arm.
    D-kNN [11] addresses the large discrete action space problem by combining DDPG with an approximate kNN method.
    TPGR [5] builds a balanced hierarchical clustering tree and formulates picking an item as seeking a path from the root to a certain leaf of the tree.
    Note that for D-kNN, larger k (i.e., the number of nearest neighbors) will result in better performance but poor efficiency. For a fair comparison, we consider setting k as \( 0.1M \) and M (M is the number of items), respectively.

    4.1.3 Evaluation Metrics and Methodology.

    The methods that achieve their goals by Top-k recommendation take evaluation on the indexes such as Hit Ratio (HR) [42], Precision [33, 50], Recall [33], F1 [5], and normalized Discounted Cumulative Gain (nDCG) [18, 38, 48, 50]. To cover as many aspects of Top-k recommendation as possible, we chose HR@k, F1@k, and nDCG@k as the evaluation metrics.
    The test data was constructed in data preparation, and all the evaluated methods were tested by using this data. We now describe the test method in detail: For each user, we first classify user’s history logs into positive and negative ones and sort the items in positive history logs by time-stamp. Then, we choose the last \( 10\% \) of the ordered items in the positive logs as positive items. Finally, the negative items are randomly selected from the cluster that is farthest from the one that the current user belongs to. Based on such a strategy, the recommendation methods (except TPGR, which only recommends one item in each episode) can generate a ranked Top-k list to evaluate the metrics mentioned above.

    4.2 Comparison and Analysis (RQ1)

    Table 2 shows the summarized results of our experiments on the three datasets in terms of six metrics, including HR@10, F1@10, nDCG@10, HR@20, F1@20, and nDCG@20. Note that, since TPGR is not suitable for Top-k recommendation, we did not include it as a competitor when evaluating the recommendation performance. From the results, we have the following key observations:
    Table 2.
    DatasetMetricCompared MethodsOurs
    ItemPopDMFANRCaserSASRecLinearUCBD-kNN ( \( k=0.1M \) )D-kNN ( \( k=M \) )TRGIR-DQNTRGIR-DDPG
    MusicHR@100.24470.33180.49800.80970.88970.32010.32740.34360.80370.9886
    F1@100.04540.06210.11280.16760.19100.06310.06480.06920.18440.2304
    NDCG@100.11010.15690.27560.53510.62120.14620.15270.16170.77190.9436
    HR@200.48890.58850.70970.90900.96350.57470.58380.60010.80480.9935
    F1@200.05250.06260.10840.10480.11510.06380.06470.06760.10030.1251
    NDCG@200.17160.22100.32520.55420.63250.20950.21710.22580.77220.9446
    BeautyHR@100.25510.27340.45500.61250.68230.32190.25850.27720.72780.8845
    F1@100.04820.05020.09900.12180.13860.06140.04890.05190.14630.1798
    NDCG@100.11340.12490.22520.39390.45690.14470.11700.12580.62130.6949
    HR@200.52780.52730.69930.78170.83300.59110.51420.53770.73490.9501
    F1@200.05430.05290.10060.08260.09070.06130.05290.05470.07820.1024
    NDCG@200.18170.18850.28500.43440.49420.21220.18090.19100.62300.7104
    ClothingHR@100.22650.23930.34210.50600.58170.25000.25410.27680.65930.7544
    F1@100.04130.04370.06630.09340.10840.04580.04670.05100.12220.1405
    NDCG@100.10330.10410.16220.29000.35250.11300.11310.12420.45770.4865
    HR@200.49640.50440.60080.71960.76550.50410.50430.52930.72880.8973
    F1@200.04820.04880.06590.07020.07580.04890.04900.05170.07110.0881
    NDCG@200.17060.17040.22640.34270.39680.17560.17570.18740.47540.5225
    Table 2. Overall Recommendation Performance
    Best performance is in boldface and second best is underlined.
    Compared with ItemPop, the methods that utilize deep neural networks are more effective. Moreover, the text-based method ANR consistently outperforms DMF that only uses interaction information for embedding, which demonstrates the importance of utilizing textual information to alleviate the negative effects of data sparsity for better performance.
    For IRSs, the preferences of users are always time-related. Caser, which learns sequential patterns by using convolutional filters, and SASRec, which utilizes self-attention to model the entire sequence, can improve the performance of IRSs dramatically.
    The interactive methods, LinerUCB and D-kNN, perform similarly and can only outperform ItemPop. LinerUCB is a traditional MAB-based method, which cannot do long-term plan explicitly. As for D-kNN, only considering the distance but ignoring the importance of each dimension in the latent space for users results in missing proper items.
    Our method TRGIR-DDPG achieves the best performance and obtains remarkable improvements over the state-of-the-art methods, which demonstrates the effectiveness of combining the candidate action set and the policy vector to address the large discrete space problem. TRGIR-DQN also performs well but is inferior to TRGIR-DDPG. The performance gap between them may be due to the fact that the exploring ability of the policy vector is more powerful.

    4.3 Utilizing Textual Information (RQ2)

    To figure out if the excellent performance can be retained when not leveraging textual information or utilizing textual information differently, we compare our TRGIR framework with the deep Reinforcement learning framework using Matrix-factorization representation for Interactive Recommendation (RMIR) [20] and the Text-based deep Reinforcement learning framework using Sum-average word representation for Interactive Recommendation (TRSIR) [34]. The same as TRGIR, we also implement RMIR and TRSIR with DQN and DDPG, respectively. The results on the three datasets (arranged in increasing order of scale and sparsity) in terms of HR@10, F1@10, nDCG@10, HR@20, F1@20, and nDCG@20 are shown in Table 3.
    Table 3.
    DatasetMetricRMIR-DQNTRSIR-DQNTRGIR-DQNRMIR-DDPGTRSIR-DDPGTRGIR-DDPG
    MusicHR@100.71640.79440.80370.87170.94550.9886
    F1@100.16030.18050.18440.19200.21240.2304
    NDCG@100.68430.72040.77190.64780.71520.9436
    HR@200.71830.79650.80480.94870.97600.9935
    F1@200.08650.09870.10030.11450.12030.1251
    NDCG@200.68450.72070.77220.66430.72190.9446
    BeautyHR@100.65530.66850.72780.62580.74490.8845
    F1@100.13050.13060.14630.12670.14630.1798
    NDCG@100.51650.51210.62130.40250.48960.6949
    HR@200.66130.70500.73490.81560.89090.9501
    F1@200.06980.07310.07820.08740.09360.1024
    NDCG@200.51780.52090.62300.44860.52440.7104
    ClothingHR@100.39530.62510.65930.32900.66220.7544
    F1@100.07260.11570.12220.06020.12260.1405
    NDCG@100.26550.45450.45770.16470.39170.4865
    HR@200.46800.68860.72880.58050.85450.8973
    F1@200.04530.06710.07110.05620.08350.0881
    NDCG@200.28380.47040.47540.22730.43980.5225
    Table 3. The Comparison of Embedding Methods (RMIR, TRSIR, and TRGIR) on Specific Implementations
    From Table 3, it is clear to see that no matter with DQN or DDPG, the data sparsity problem can be alleviated by utilizing textual information. Meanwhile, the performance improvement increases along with the increase of data scale and data sparsity. This justifies the effectiveness of TRSIR and TRGIR that leverage textual information in RL-based recommendation, especially for large-scale and high-sparsity datasets. Further, for the representations of users and items, the results comparison between TRSIR and TRGIR demonstrates that the self-supervised GCN embedding method is much more powerful than the simple sum average word vector operation.
    Moreover, to discuss the influence of the settings in our GCN-based self-supervised embedding method, we conduct an ablation study on the three datasets for TRGIR-DQN and TRGIR-DDPG in all metrics. Since the performances under different datasets, metrics, and implementations exhibit similar trends, we only present the performance of TRGIR-DDPG with default setting (red bar) on Music in terms of HR@10 and F1@10, as shown in Figure 6. Note for the two sub-graphs in Figure 6, the left three blue bars show the performance of TRGIR-DDPG with different relations settings. Specifically, without textual information (W/O Text) only contains user-item relations, while without descriptions (W/O Des.) contains user-item, user-comment, and item-comment relations. Because the comments are related to both users and items, we cannot remove any one of them separately. Thus, without comments (W/O Com.) just contains user-item and item-description relations. We can find that the performance degrades greatly when the method only contains user-item relations, and the model with all the relations (Default) obtains the best performance. Moreover, we can see that the item-description relations are more important than the comment-related relations for our model. The reason might be that our document-level nature language processing method introduces more noises in comments than in descriptions. The green bar in Figure 6 shows that without self-connection (W/O Self-con.), our model performs worse, demonstrating its effectiveness. To keep the linear substructures and simplify our model, as mentioned in Section 3.3, we have removed the activation function in our self-supervised embedding method. The orange bar in Figure 6 shows that our model performs better than the model with an activation function (W/ Active-Fun.), which verifies the rationality of our design principle.
    Fig. 6.
    Fig. 6. Ablation experiments: Performance of TRGIR-DDPG on Music w.r.t. (a) HR@10; (b) F1@10.
    It is noteworthy that our way of introducing text may also introduce noise or irrelevant information. To alleviate this problem, we first utilize the Long Stopword List to filter out meaningless words (Section 3.3). Moreover, in the pre-trained GloVe [27], due to the vector distance between words with similar meanings is much closer than that between words with different meanings, the influence of noise can be reduced to some extent.

    4.4 Time Comparison (RQ3)

    In this section, we compare the efficiency of RL-based models from two aspects, the consumed time of training (updating the model per step), and decision-making, where the time spent is measured in milliseconds. Table 4 presents the comparison of the time cost on the Beauty dataset. The time cost orders on other datasets exhibit similar trend as that on Beauty and thus are omitted here. The values in Table 4 are the average ones obtained by statistics. To make a fair comparison, both \( n_s \) and \( n_a \) are set to 1 (but for TRGIR-DQN, we set \( n_a \) as 2) and others keep the default. The experiments are conducted on the same machine with 6-core CPU (i7-6850k, 3.6 GHz) and 64 GB RAM.
    Table 4.
    Time Cost (ms)TRGIR-DQN ( \( n_c=500 \) )TRGIR-DQNDDPG-kNN ( \( k= \) 0.1M)DDPG-kNN ( \( k= \) M)TPGRTRGIR-DDPG
    Per training step6.016.015.265.514.983.13
    Per decision621.8280.657.8656.425.060.9
    Table 4. Time Comparison for Training and Decision-making
    As shown in Table 4, for per training step, the time cost gap among those RL-based methods is not large. However, for the time cost of decision-making, the large discrete action space makes most RL-based recommendation methods inefficient. The value-based RL method TRGIR-DQN ( \( n_c=500 \) ), which has to calculate a large amount of Q-values for each possible item, runs much slower than other models, not to mention how the efficiency will drop on a real scale far larger than 500. With the help of narrowing the scale of the action candidate set to 50 ( \( n_c=50 \) is the default setting, which performs better than \( n_c=500 \) in Figure 8(a)), the decision-making efficiency of TRGIR-DQN improves greatly. D-kNN also runs slow (especially when \( k= \) M), because it has high time complexity in discovering nearest neighbors as action in large discrete action space. TPGR reduces the decision-making time significantly by constructing a clustering tree, but as mentioned before, it only supports Top-1 recommendation. Compared to other methods, by using action candidate set and policy vector, TRGIR-DDPG achieves significant improvement in terms of execution efficiency.

    4.5 Hyper-parameter Sensitivity (RQ4)

    We select several important hyper-parameters to analyze their effects on the performance of TRGIR-DQN and TRGIR-DDPG. Note that we have conducted such experiments on all the datasets in the six metrics mentioned above, and the results show that our approach exhibits similar performance trends on all the evaluated datasets. For simplicity, we only present the results on Beauty dataset in terms of HR@10 and nDCG@10. When testing one parameter, we keep the other hyper-parameters fixed with the default settings. From Figures 7 to 9, we can see that the two performance metrics for TRGIR-DQN and TRGIR-DDPG exhibit similar trends. In this way, the following analyses work for both of them:
    Fig. 7.
    Fig. 7. Performance of TRGIR-DQN and TRGIR-DDPG on Beauty in HR@10 and NDCG@10 w.r.t. (a) the input dimension of GCN; (b) the output dimension of GCN; (c) the depth of GCN propagation layer.
    The Input Dimension of GCN ( \( n_{in} \) ). Note the input dimension of GCN \( n_{in} \) is equal to the dimension of the pre-trained word vectors. The embedding initialization relies on the pre-trained word vectors, hence, the number of \( n_{in} \) reflects the richness of the textual information. As shown in Figure 7(a), with the increase of \( n_{in} \) , as expected, TRGIR-DQN and TRGIR-DDPG also perform better.
    The Output Dimension of GCN ( \( n_{out} \) ). The output dimension of GCN \( n_{out} \) is equal to the final vector dimension of users and items. Figure 7(b) shows that, with the increase of \( n_{out} \) , the performance of our methods keeps stable. This is mainly because the useful knowledge can be contained within 16 dimensions.
    The Depth of GCN propagation layer ( \( n_{gcn} \) ). Figure 7(c) shows that TRGIR-DDPG achieves the best performance when the depth \( n_{gcn} \) is 3, and TRGIR-DQN achieves the best performance when \( n_{gcn} \) is 2. The reason might be that the increase of the number of propagation layers can aggregate more knowledge from more nodes, but too high-order propagation may cause the over-smoothing problem, which in turn will degrade the performance. The Number of Clusters ( \( n_{cl} \) ). As shown in Figure 7(a), with the increase of \( n_{cl} \) , the performance first rises and then falls. More clusters mean larger differences between the current cluster and the one that provides negative samples, which improves their quality. However, too many clusters may also cause a shortage of effective samples.
    The Size of Candidate ( \( n_c \) ). Figure 8(b) shows that the performance decreases with the increase of \( n_c \) . This is mainly because, for training samples, the items from \( \mathcal {V}^p_{u_i} \) are much less than the items from \( \mathcal {V}^p_{cl_l^f} \) . The increase of \( n_c \) will cause imbalance sampling, which leads to worse performance.
    Fig. 8.
    Fig. 8. Performance of TRGIR-DQN and TRGIR-DDPG on Beauty in HR@10 and NDCG@10 w.r.t. (a) the number of clusters; (b) the size of candidate; (c) the rate of positive items.
    The Rate of Positive Items ( \( \alpha \) ). As shown in Figure 8(c), with the increase of \( \alpha \) , the performance first grows and then remains stable. This is because increasing \( \alpha \) will introduce more positive items to perceive the user’s interests better. But, since \( n_{pos} \le |\mathcal {V}^p_{u_i}| \) (see Algorithm 1), when \( \alpha \) is big enough, its growth may no longer affect \( n_{pos} \) .
    The Size of State ( \( n_s \) ). Figure 9(a) shows that with the increase of the state size \( n_s \) , the performance stays almost smoothly, which means the size of the state has little impacts on the implements of our framework TRGIR. The Size of Action ( \( n_a \) ). For TRGIR-DDPG, Figure 9(b) shows that when \( n_a \) ranges from 1 to 10, the performance also increases. However, the performance starts to decrease when \( n_a \) reaches 20. The larger \( n_a \) is, the more frequent the state changes, indicating that keeping a proper updating speed is important. Note that, since \( n_a \) is fixed to 2 for TRGIR-DQN, we do not include it in this set of experiments.
    Fig. 9.
    Fig. 9. Performance of TRGIR-DQN and TRGIR-DDPG on Beauty in HR@10 and NDCG@10 w.r.t. (a) the size of state; (d) the size of action.

    5 Conclusion

    In this article, we propose TRGIR, a Text-based deep Reinforcement learning framework using self-supervised Graph representation for Interactive Recommendation. By learning the embeddings of users and items with a GCN-based self-supervised embedding method on a relation graph that contains textual information, we gain user and item vectors with semantics, which greatly alleviates the data sparsity problem. Moreover, based on the thought of collaborative filtering, we classify users into several clusters and construct an action candidate set, which reduces the scale of action space directly. Further, combining with the policy vector dynamically learned from DDPG that represents the user’s preferences in the feature space, we select items from the candidate set to generate action for the recommendation, which greatly improves the efficiency of decision-making and enhances the exploration ability. Experimental results over a carefully designed simulator on three public datasets demonstrate that compared with state-of-the-art methods, TRGIR-DDPG can achieve remarkable performance improvement in a time-efficient manner.
    For future work, we intend to model the textual information in word-level to capture finer-grained semantic factors for better recommendation performance; we also would like to see if it is possible to incorporate our proposed model with transfer learning.

    Footnotes

    References

    [1]
    Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Laura M. Haas and Ashutosh Tiwary (Eds.). 94–105.
    [2]
    Pierpaolo Basile, Claudio Greco, Alessandro Suglia, and Giovanni Semeraro. 2018. Deep learning and hierarchical reinforcement learning for modeling a conversational recommender system. Intelligenza Artificiale 12, 2 (2018), 125–141.
    [3]
    Konstantin Bauman, Bing Liu, and Alexander Tuzhilin. 2017. Aspect based recommendations: Recommending items with the most valuable aspects based on user reviews. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 717–725.
    [4]
    Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2017. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263 (2017).
    [5]
    Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. 2019. Large-scale interactive recommendation with tree-structured policy gradient. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. AAAI Press, 3312–3320.
    [6]
    Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, and Le Song. 2019. Generative adversarial user model for reinforcement learning based recommendation system. In Proceedings of the 36th International Conference on Machine Learning (ICML), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 1052–1061.
    [7]
    Germán Cheuque, José Guzmán, and Denis Parra. 2019. Recommender systems for online video game platforms: The case of STEAM. In Proceedings of the International Conference on World Wide Web (WWW), Sihem Amer-Yahia, Mohammad Mahdian, Ashish Goel, Geert-Jan Houben, Kristina Lerman, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 763–771.
    [8]
    Jin Yao Chin, Kaiqi Zhao, Shafiq R. Joty, and Gao Cong. 2018. ANR: Aspect-based neural recommender. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM), Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). ACM, 147–156.
    [9]
    Dong Deng, Liping Jing, Jian Yu, Shaolong Sun, and Haofei Zhou. 2018. Neural Gaussian mixture model for review-based rating prediction. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys), Sole Pera, Michael D. Ekstrand, Xavier Amatriain, and John O’Donovan (Eds.). ACM, 113–121.
    [10]
    Zhi-Hong Deng, Ling Huang, Chang-Dong Wang, Jian-Huang Lai, and S. Yu Philip. 2019. DeepCF: A unified framework of representation learning and matching function learning in recommender system. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 61–68.
    [11]
    Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 (2015).
    [12]
    George H. Dunteman. 1989. Principal Components Analysis. Number 69. Sage.
    [13]
    Hado van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double Q-Learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2094–2100.
    [14]
    Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
    [15]
    Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173–182.
    [16]
    Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (SIGKDD). 368–377.
    [17]
    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (Oct. 2002), 422–446. DOI:
    [18]
    Wang-Cheng Kang and Julian J. McAuley. 2018. Self-attentive sequential recommendation. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society, 197–206.
    [19]
    Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR).
    [20]
    Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer8 (2009), 30–37.
    [21]
    Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. Retrieved from http://snap.stanford.edu/data.
    [22]
    Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International Conference on World Wide Web (WWW). 661–670.
    [23]
    Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations (ICLR Poster).
    [24]
    Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative prediction and ranking with non-random missing data. In Proceedings of the 3rd ACM Conference on Recommender Systems (RecSys). 5–12.
    [25]
    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
    [26]
    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
    [27]
    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 1532–1543.
    [28]
    Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In Proceedings of the European Semantic Web Conference. Springer, 593–607.
    [29]
    Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. AutoRec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web Companion. ACM, 111–112.
    [30]
    David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML). 387–395.
    [31]
    Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 926–934.
    [32]
    Haihui Tan, Ziyu Lu, and Wenjie Li. 2017. Neural network based reinforcement learning for real-time pushing on text stream. In Proceedings of the 40th International ACM Conference on Research and Development in Information Retrieval (SIGIR). 913–916.
    [33]
    Jiaxi Tang and Ke Wang. 2018. Personalized top-N sequential recommendation via convolutional sequence embedding. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), Yi Chang, Chengxiang Zhai, Yan Liu, and Yoelle Maarek (Eds.). ACM, 565–573.
    [34]
    Chaoyang Wang, Zhiqiang Guo, Jianjun Li, Peng Pan, and Guohui Li. 2020. A text-based deep reinforcement learning framework for interactive recommendation. In Proceedings of the 24th European Conference on Artificial Intelligence. IOS Press, 537–544. DOI:
    [35]
    Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization bandits for interactive recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence. 2695–2702.
    [36]
    Kai Wang, Zhene Zou, Qilin Deng, Jianrong Tao, Runze Wu, Changjie Fan, Liang Chen, and Peng Cui. 2021. Reinforcement learning with a disentangled universal value function for item recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4427–4435.
    [37]
    Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174.
    [38]
    Zeng Wei, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2017. Reinforcement learning to rank with Markov decision process. In Proceedings of the 40th International ACM Conference on Research and Development in Information Retrieval (SIGIR). 945–948.
    [39]
    Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021. Self-Supervised Graph Learning for Recommendation. Association for Computing Machinery, New York, NY, 726–735. DOI:
    [40]
    Han Xiao, Minlie Huang, Lian Meng, and Xiaoyan Zhu. 2017. SSP: Semantic space projection for knowledge graph embedding with text descriptions. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. 3104–3110.
    [41]
    Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. 2016. Representation learning of knowledge graphs with entity descriptions. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
    [42]
    Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. 2017. Deep matrix factorization models for recommender systems. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 3203–3209.
    [43]
    Wu Yao, Christopher Dubois, Alice X. Zheng, and Martin Ester. 2016. Collaborative denoising auto-encoders for top-N recommender systems. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM). 153–162.
    [44]
    Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, and Changyou Chen. 2019. Text-based interactive recommendation via constraint-augmented reinforcement learning. Adv. Neural Inf. Process. Syst. 32 (2019), 15214–15224.
    [45]
    Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin. 2019. Deep reinforcement learning for search, recommendation, and online advertising: A survey. SIGWEB Newsl.Spring (July 2019). DOI:
    [46]
    Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2018. Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys). 95–103.
    [47]
    Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD). 1040–1048.
    [48]
    Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang Tang. 2018. Deep reinforcement learning for list-wise recommendations. arXiv preprint arXiv:1801.00209 (2018).
    [49]
    Xiaoxue Zhao, Weinan Zhang, and Jun Wang. 2013. Interactive collaborative filtering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM). ACM, 1411–1420.
    [50]
    Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the International Conference on World Wide Web (WWW). 167–176.
    [51]
    Lei Zheng, Vahid Noroozi, and Philip S. Yu. 2017. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM). ACM, 425–434.

    Cited By

    View all
    • (2024)UbiPhysioProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435528:1(1-27)Online publication date: 6-Mar-2024
    • (2023)SpiroMask: Measuring Lung Function Using Consumer-Grade MasksACM Transactions on Computing for Healthcare10.1145/35701674:1(1-34)Online publication date: 27-Feb-2023
    • (2023)IDRes: Identity-Based Respiration Monitoring System for Digital Twins Enabled HealthcareIEEE Journal on Selected Areas in Communications10.1109/JSAC.2023.331009541:10(3333-3348)Online publication date: 1-Oct-2023
    • Show More Cited By

    Index Terms

    1. A Text-based Deep Reinforcement Learning Framework Using Self-supervised Graph Representation for Interactive Recommendation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM/IMS Transactions on Data Science
      ACM/IMS Transactions on Data Science  Volume 2, Issue 4
      November 2021
      439 pages
      ISSN:2691-1922
      DOI:10.1145/3485158
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 May 2022
      Accepted: 01 February 2022
      Revised: 01 January 2022
      Received: 01 March 2021
      Published in TDS Volume 2, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Recommender system
      2. representation learning
      3. graph convolutional network
      4. Reinforcement Learning
      5. textual information

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • National Natural Science Foundation of China
      • Fundamental Research Funds for the Central Universities

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)682
      • Downloads (Last 6 weeks)56
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)UbiPhysioProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435528:1(1-27)Online publication date: 6-Mar-2024
      • (2023)SpiroMask: Measuring Lung Function Using Consumer-Grade MasksACM Transactions on Computing for Healthcare10.1145/35701674:1(1-34)Online publication date: 27-Feb-2023
      • (2023)IDRes: Identity-Based Respiration Monitoring System for Digital Twins Enabled HealthcareIEEE Journal on Selected Areas in Communications10.1109/JSAC.2023.331009541:10(3333-3348)Online publication date: 1-Oct-2023
      • (2023)Enhancing Personalized Performance through a Deep Reinforcement Learning-Based Recommendation System2023 Global Conference on Information Technologies and Communications (GCITC)10.1109/GCITC60406.2023.10426207(1-6)Online publication date: 1-Dec-2023
      • (2022)Personalized Travel Route Recommendation Model of Intelligent Service Robot Using Deep Learning in Big Data EnvironmentJournal of Robotics10.1155/2022/77785922022Online publication date: 29-Jan-2022
      • (2021)Deep Learning and its Applications: A Real-World PerspectiveDeep Learning and Edge Computing Solutions for High Performance Computing10.1007/978-3-030-60265-9_10(149-166)Online publication date: 28-Jan-2021

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media