Reinforcement learning (RL) has shown great promise in optimizing long-term user interest in recommender systems. However, existing RL-based recommendation methods need a large number of interactions for each user to learn the recommendation policy. The challenge becomes more critical when recommending to new users who have a limited number of interactions. To that end, in this article, we address the cold-start challenge in the RL-based recommender systems by proposing a novel context-aware offline meta-level model-based RL approach for user adaptation. Our proposed approach learns to infer each user's preference with a user context variable that enables recommendation systems to better adapt to new users with limited contextual information. To improve adaptation efficiency, our approach learns to recover the user choice function and reward from limited contextual information through an inverse RL method, which is used to assist the training of a meta-level recommendation agent. To avoid the need for online interaction, the proposed method is trained using historically collected offline data. Moreover, to tackle the challenge of offline policy training, we introduce a mutual information constraint between the user model and recommendation agent. Evaluation results show the superiority of our developed offline policy learning method when adapting to new users with limited contextual information. In addition, we provide a theoretical analysis of the recommendation performance bound.
1 Introduction
Recent years have witnessed great interest in developing reinforcement learning (RL)-based recommender systems [2, 8], which can effectively model and optimize user's long-term interest. In the RL-based methods, the recommendation policy is learned by leveraging the collected interactions between users and recommender systems. As different users may have different interests, conventional RL-based methods need to learn a separate policy for each user, which calls for large amounts of interactions for an individual user to learn an adaptable recommendation policy for different users. However, it is very difficult and expensive to obtain enough user–recommender interactions to train a robust recommendation policy. Such a challenge becomes even more critical for cold-start users who have a very limited number of interactions and prominently exist in many recommender systems. Therefore, it is imperative to learn a recommendation policy that can infer user's preferences and quickly adapt to cold-start users with limited information.
When dealing with cold-start users, the availability of information is often limited to a few user attributes (e.g., gender, age) and perhaps minimal user–item interactions. The challenge lies in learning user preferences from this scant information. To address this issue, meta-learning approaches [4, 12] have been applied to cold-start recommendation [33, 37]. Meta-learning seeks to learn general knowledge across diverse tasks and adapt the knowledge to new tasks using minimal samples. Within the context of cold-start recommendation, each user's recommendation can be viewed as a unique task. During the meta-training phase, meta-learning facilitates the understanding of user's general preferences and adapts these to new users with limited data during the meta-testing phase. However, it is still very challenging to apply meta-learning to RL-based recommender systems due to the following reasons. First, inferring user preferences requires a significant number of interactions sampled from user distributions. It is very difficult to achieve this for new users. In extreme cold-start scenarios, the only available information might be user attributes like gender and age, with no user–item interactions, where the meta-learning method cannot adapt to new users by fine-tuning with few interactions [12]. This complicates the task of inferring user preferences. Second, to meta-train a recommendation policy, traditional model-free RL methods need interactions with users. Recent methods [2, 8] utilize model-based RL [9, 41] approaches to sidestep the sample efficiency challenge by leveraging offline-logged user data to model the environment. However, they still require lots of offline data from each user to build the environment model, i.e., individual user model. Moreover, the third challenge lies in learning the user model and recommendation policy from offline-logged user data without online interaction with real users. During offline learning, the recommendation policy might suggest recommendations for which the estimated user model cannot provide accurate feedback. This issue represents a fundamental challenge for offline RL due to the distribution shift between user–item interaction data generated by the policy-user model and the existing offline data.
To address the aforementioned challenges, we propose a novel context-aware offline meta-level model-based RL method, specifically designed for tackling the cold-start problem in RL-based recommender systems. To address the first challenge, inspired by the context-based meta-learning approach [11, 49], we introduce a user context variable to infer the user's preferences from the available user's contextual information, such as user attributes or a limited number of user–item interactions. In response to the second challenge, we employ meta-learning to train a model-based RL model by conditioning both the user model and the recommendation policy on the user context variable. This approach facilitates the learning of meta-level knowledge, i.e., general user preferences. During the meta-training phase, by conditioning on the user context variable, the user model (meta-level user model) is learned to estimate the user choice function and user reward function by using the inverse RL (IRL) approach across a broad spectrum of users. In the model-based RL framework, the recommendation agent conditioned on the user context variable (meta-level recommendation agent) is trained through interaction with the meta-level user model. After the meta-training phase, the meta-level recommendation agent can adapt to a new cold-start user by using the user's contextual information that is embedded in the user context variable. To tackle the third challenge arising from offline RL training of our proposed method, we adhere to the principle of conservatism, a concept extensively adopted in offline RL literature [32, 50]. We propose incorporating a mutual information regularizer between the user model and the recommendation agent. This approach is designed to encourage the recommendation agent to make recommendations within the user model's confidence region, thereby improving the accuracy of feedback.
We conduct intensive experiments to evaluate our developed method. The evaluations include both the simulated online experiment and the offline experiment. The online experiment is carried out in a simulated environment by using an open-sourced simulator for recommendation system [26], which provides sequential interaction with users. The offline experiment is performed with two widely used datasets. In both online and offline experiments, we evaluate our method against several state-of-the-art baselines using multiple evaluation metrics. All the evaluation results consistently demonstrate the superiority of our developed method.
The main contributions of this work can be summarized as follows:
—
We propose a novel context-aware offline meta-level model-based RL method to address the cold-start problem of RL-based recommender systems, which is trained using logged offline data.
—
Within our framework, we introduce a user context variable to infer user preference that enables adaptation on cold-start users with limited contextual information.
—
Both the user model and the recommendation agent are meta-learned by conditioning on this user context variable within the model-based RL framework. This approach ensures effective adaptation to new users. To tackle the challenge presented by the offline RL training of our proposed method, we introduce a mutual information regularizer between the meta-level user model and the recommendation agent.
—
Evaluation results demonstrate the superiority of our developed method. A theoretical analysis of the recommendation performance bound of the developed method is also provided.
2 Related Work
The related work of this article could be grouped into four categories as discussed below.
RL-Based Recommender System. There are mainly two kinds of RL methods for recommendation: model-free RL methods [6, 27, 36, 38, 62, 66, 68] and model-based RL methods [2, 8, 67]. Model-free RL methods assume the environment is unknown without user modeling. Model-free RL methods usually need large amounts of interactions for policy optimization. To tackle the sample complexity challenge, model-based RL methods are applied by considering user modeling, which can predict user behavior and reward. For instance, the generative adversarial user model [8] learns the user behavior model and reward function together in a unified min-max framework; then the recommendation policy is learned with reward from the trained user model. However, this model requires a large amount of data to estimate a particular user model, which is not feasible in the cold-start recommendation scenario. Besides, the user model and recommendation model are trained separately, which prevents them from benefiting from each other. Bai et al. [2] also proposed to use model-based RL for recommendation. They introduced the discriminator with adversarial training to let the user behavior and recommendation policy imitate the policy in logged offline data. The reward to train recommendation policy is weighted by the discriminator score. Their method can be seen as reward shaping [40, 42], which does not recover the true user reward function. Neither of the above two methods does properly address the cold-start challenge in the RL-based recommender system. Recently, offline RL [34] methods have been used to learn a recommendation policy from the offline dataset [7, 39, 56, 61]. For example, Xiao et al. [61] proposed a general offline RL framework for recommendation, where supervised regularization, policy constraints, dual constraints, and reward extrapolation are introduced to minimize the distribution mismatch between the logging policy and recommendation policy. Wang et al. [56] proposed a causal decision transformer to integrate the offline RL and transformer model. Gao et al. [17] combined causal inference with offline RL to burst the filter bubbles in recommender systems. To alleviate the Matthew effect of offline RL in interactive recommendation, a state entropy term is added to relax the pessimism in the model-based offline RL algorithm [16]. In contrast with these methods, our method can recover the true user behavior and reward with a small amount of data by meta-learning user model and recommendation model with user context variable in a unified framework, and the mutual information regularization between the user policy and recommendation policy can benefit each other for learning from offline-logged user data.
Meta-Learning. Meta-learning aims to learn from a small amount of data and adapt quickly to new tasks [4, 12]. In context-based meta-learning [11, 49], the approaches learn to infer the task uncertainties by taking task experiences as input. For instance, Kate et al. [49] proposed to learn the task context variables with probabilistic latent variables from past experiences. The model-free RL policy is trained conditioned on the task variable to improve sample efficiency. In contrast, our method learns a user context variable to infer user preference within the offline model-based RL framework.
Cold-Start Recommendation. The cold-start problem of recommendation has been studied for a long time in the literature. Various approaches have been developed to address this problem [33, 35, 45, 57, 59, 71]. Among these, cross-domain recommendation techniques [5, 20, 21, 70] enhance performance in the target domain by leveraging user–item interactions from relevant source domains. However, they often rely on shared users across domains for knowledge transfer [5] and require source domain interaction data for target domain cold-start users [5], limiting their applications in our context. One particular type of method, which has been developed recently and is very related to this study, utilizes the meta-learning technique [10, 33, 64] to tackle the cold-start challenge. Generally speaking, these methods regard users as tasks and items as classes and use gradient-based model-agnostic meta-learning (MAML) [12] algorithm to enable fast adaptation to new users with few interactions. However, how to tackle the cold-start challenge of RL-based recommendations is still underexplored, which is indeed the research focus of this article.
IRL. IRL is the problem of learning reward functions from demonstrations [1, 43, 46], which can avoid the need for reward engineering. For instance, Fu et al. [13] proposed an adversarial IRL (AIRL) framework to recover the true reward functions from demonstrations. IRL needs a large number of expert demonstrations to infer true reward function, which is highly expensive in the area of robotics. Recently, some works [18, 63] try to recover the reward function from a limited amount of demonstrations with the meta-IRL method by incorporating the context-based meta-learning method into the AIRL framework. Comparatively, in our solution, we recover the user policy and reward function from offline user behavior data by leveraging the meta-IRL method. To better capture the user context information into policy, we utilize a variational policy network conditioned on the user context variable. Besides, the meta-IRL-learned user model serves as the environment in our meta-level model-based RL framework.
Offline RL. Offline RL [34, 51] aims to learn policies from a static dataset consisting of past interactions with the environment. Most offline RL methods use constraints on the learned policy to prevent the policy from drifting away from offline data support. For instance, Wu et al. [60] use Kullback–Leibler (KL) divergence to regularize the learned policy to be closer to the behavior policy. Recently, Yu et al. [65] and Kidambi et al. [29] utilize uncertainty estimation as a reward penalty to constrain policy learning. Different from existing methods, we not only provide constraints on the policy with a novel mutual information regularization but also encourage policy adaptation with meta-learning.
3 Preliminaries
3.1 RL-Based Recommender Systems
Recent studies have demonstrated the potential of RL-based recommender systems in enhancing long-term user engagement [2, 8]. Such a system relies on an RL-based recommendation agent that regularly interacts with users. During each interaction, the agent presents the user with a list of recommendations. The user then chooses an item from the list, providing feedback to the agent, which is typically a measure of user utility or satisfaction and considered as a reward to the agent.
Formally, the RL-based recommendation problem is modeled as a Markov decision process (MDP) with the following components:
—
Environment: Each user represents an environment that the recommendation agent interacts with. This environment provides feedback to the recommendation agent, encompassing the user's choice and reward.
—
State Space\(\boldsymbol{\mathcal{S}}\): \(\boldsymbol{s}_{t}\in\mathcal{S}\) is the user's historical clicks before time \(t\).
—
Action\(\boldsymbol{\mathcal{A}}\): The action \(\boldsymbol{A}_{t}\in\mathcal{A}\) corresponds to the top-\(k\) recommendation list generated by the recommendation agent at time \(t\).
—
State Transition Probability\(\boldsymbol{\mathcal{P}}\): \(p(\boldsymbol{s}_{t+1}|\boldsymbol{s}_{t},\boldsymbol{A}_{t})\) is the probability of transitioning from state \(\boldsymbol{s}_{t}\) to \(\boldsymbol{s}_{t+1}\) based on the recommendation list \(\boldsymbol{A}_{t}\). This also signifies the probability of the user's choice \(x_{t}\) from the recommendation list \(\boldsymbol{A}_{t}\) according to the user's choice or policy function.
—
Reward\(\boldsymbol{\mathcal{R}}\): \(r(\boldsymbol{s}_{t},\boldsymbol{A}_{t},x_{t})\) is the immediate reward for the agent's action \(\boldsymbol{A}_{t}\), which represents the user's utility or satisfaction after choosing \(x_{t}\in\boldsymbol{A}_{t}\) at state \(\boldsymbol{s}_{t}\).
Recommendation Policy\(\boldsymbol{\pi}\): \(\boldsymbol{A}_{t}\sim\pi(\boldsymbol{A}_{t}|\boldsymbol{s}_{t})\) is the recommendation agent's recommendation list given state \(\boldsymbol{s}_{t}\).
The recommendation agent's goal is to learn a policy \(\pi\) that maximizes the expected cumulative reward, symbolizing long-term user utility or satisfaction: \(\mathbb{E}{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r\left(\boldsymbol{s}_{t}, \boldsymbol{A}_{t},x_{t}\right)\right]\), with \(\gamma\) as the discount factor.
3.2 Problem Statement
Our article primarily concentrates on the cold-start problem within the RL-based recommender system context. The data used to train the recommendation model are user–item interactions. We have a set of users \(\mathcal{U}=\left\{\mathcal{U}^{warm},\mathcal{U}^{cold}\right\}\), where \(\mathcal{U}^{warm}\) and \(\mathcal{U}^{cold}\) represent warm users and cold-start users, respectively. Importantly, there is no overlap between these two groups of users. The item set is denoted as \(\mathcal{V}=\left\{v_{1},\ldots,v_{j},\ldots,v_{|\mathcal{V}|}\right\}\). For every user \(u\in\mathcal{U}\), the sequence of interacted items is recorded as \(P_{u}=\left\{x_{1},\ldots,x_{t},\ldots,x_{|P_{u}|}\right\}\), which is maintained in chronological order. Additionally, each user \(u\) is associated with contextual information \(C_{u}\) such as user attributes \(u_{att}=\left\{u^{a}_{1},\ldots,u^{a}_{j},\ldots,u^{a}_{|u_{att}|}\right\}\). With the above setup, we can formally state the cold-start recommendation problem as follows:
4 Method
In this section, we present the proposed context-aware Mutual information regularized Meta-level Model-based RL approach for cold-start Recommendation, denoted as \(\rm M^{3}Rec\). We first introduce the overall framework and then elaborate on the details of the proposed method.
4.1 Overview
Our proposed model's architecture is depicted in Figure 1, comprising four key modules: the user context encoder, the meta-level user model, the meta-level recommendation agent, and the mutual information regularizer. The model adopts a meta-learning perspective to address the cold-start issue inherent in RL-based recommendations. With an aim to adapt the user model and recommendation agent for individual users who have limited contextual information, we employ the context-based meta-learning approach [11, 49]. The user context encoder learns to derive user context variable by inferring the user's preference from the contextual information of each user. Since both the state transition probability function \(p(s_{t+1}|s_{t},a_{t})\) and the reward function \(r(s_{t},a_{t},x_{t})\) are unknown, we adopt a model-based RL framework to explicitly model the user behavior dynamics from the offline data. This structure integrates the meta-level user model and the meta-level recommendation agent. Conditioning both the user model and recommendation agent on the user context variable enables these components to be meta-learned, which effectively enhances their adaptability to cold-start users. Specifically, within the model-based RL framework, the optimization of the meta-level recommendation agent is facilitated through interaction with the meta-level user model. The meta-level user model serves to approximate individual user dynamics, offering reward feedback to the meta-level recommendation agent. Accordingly, the meta-level recommendation agent is optimized utilizing the reward provided by the meta-level user model. Lastly, the model's training utilizes offline-logged user data and does not involve expensive online interactions with real users. To avoid accessing out-of-distribution data that exceed the support of the collected offline user data, we introduce a mutual information regularizer. This component establishes a link between the meta-level user model and meta-level recommendation agent, adhering to the principle of conservatism in offline RL literature [32, 50].
Figure 1.
4.2 User Context Encoder
To learn a context variable for adaptation to different users, we adopt a user context encoder to summarize the user contextual information \(C_{u_{i}}\) into a context variable \(c_{u_{i}}\) for user \(u_{i}\). We first transform the raw context information into embedding representation and then employ deep neural networks to obtain user context variable
\[c_{u_i} = f^{cont}({\textbf{E}}(C_{u_i})),\]
where \({\rm E}\) is the embedding layer and \(f^{cont}\) is a multi-layer fully connected neural network with rectified linear unit (ReLU) activation function.
4.3 Meta-Level User Model
As the context-aware user choice and reward function are unknown, we aim to recover both the user choice and reward function from offline data in the meta-level user model. Considering limited context information from the cold-start users, we propose to utilize the context-based meta-learning [11, 49] to estimate the user behavior model from offline data. We first introduce the user choice or policy function \(\pi(x_{t}|\boldsymbol{s}_{t},\boldsymbol{A}_{t},c)\), which characterizes how the user chooses an item from the provided recommendation list. Then, we utilize the IRL method to recover the user choice function and the reward function without manually specifying the reward function.
Given \(i\)th user's historical clicked item sequence before time \(t\): \(\{x_{i,1},x_{i,2},...,x_{i,t-1}\}\), we first transform the clicked items into item embeddings \(\{\boldsymbol{e}_{i,1},\boldsymbol{e}_{i,2},...,\boldsymbol{e}_{i,t-1}\}\) using embedding matrix \(\textbf{E}^{u}\). We utilize the recurrent neural network with Long short-term memory (LSTM) unit [25] to summarize the state information maintained by the user model \(\boldsymbol{s}_{i,t}^{u}\) as follows:
\[\boldsymbol{s}_{i, t}^{u}= LSTM(\boldsymbol{e}_{i, t - 1}, \boldsymbol{s}_{i, t-1}^{u}),\]
In the following notations, we omit the user index \(i\) for simplicity.
To learn a context-aware user choice function \(\pi(x_{t}|\boldsymbol{s}_{t}^{{u}},\boldsymbol{A}_{t},c)\), it must encode the salient information of the user context variable \(c_{u_{i}}\) into user policy representation. Besides, it's also desirable that the user policy representation can reason the uncertainty of user distribution for adaptation to different users. Therefore, we adopt the variational inference approach [53] to infer the probabilistic latent user policy variable \(z_{u{,t}}\) at time \(t\). It can be generated from variational distribution \(q_{inf}(z_{u{,t}}|\boldsymbol{s}_{t}^{{u}}, c)\). To optimize the parameters for learning \(z_{u{,t}}\), we maximize the variational lower bound of \(\log p(\boldsymbol{s}_{t}^{{u}}|c)\)
where the first term is optimized to reconstruct current state \(\boldsymbol{s}_{t}^{{u}}\). The second term constrains the latent policy variable with a Gaussian prior.
In practice, following [53], \(q_{inf}\) and \(p_{dec}\) are parameterized by the inference neural network and decoding neural network, respectively. Specifically, the context-conditional variational autoencoder is defined as follows:
where we utilize the reparameterization trick [31] to sample \(z\) from \(\mathcal{N}(\mu,\Sigma)\). \(q_{inf}\) and \(p_{dec}\) are both three-layer multilayer perceptron (MLP) with ReLU activation function. To reconstruct the current state information, we utilize the decoding network to predict the last click \(x_{t-1}\) in the user's historical clicks, which is the input information at time \(t\). For the latent user policy representation, we adopt \(z_{u,t}=\mu_{t}\) for stable training.
Given the recommendation list \(\boldsymbol{A}_{t}=\{a_{1},\cdots,a_{k}\}\), the probability of choosing item \(x_{t}\in\boldsymbol{A}_{t}\) is based on the latent user representation \(z_{u,t}\) as follows:
where \(f^{pref}\) and \(f^{rec}\) are one-layer MLP neural network to encode user's preference and the representation of candidate items, respectively. \(\|\) represents the concatenation operation and \(f^{cho}\) is a three-layer MLP with ReLU activation function to model the user's choice selection with concatenated preference and candidate item representation.
The above content demonstrates the modeling of the user choice function: \(\pi_{\phi}(x_{t}|\boldsymbol{s}_{t}^{u},\boldsymbol{A}_{t},c)\). We also denote the parameters involved in the above user choice modeling as \(\phi\). We aim to recover both the actual user policy \(\pi_{\phi}\) and reward function \(r_{\omega}\). Inspired by AIRL [13], we recover both the context-aware user policy and reward function from offline data by optimizing the following objective:
where we omit the subscript \(t\) and superscript \(u\) for the ease of notation. This objective forms an adversarial game between the user policy function and the discriminator. The discriminator aims to distinguish the user behavior sampled from the learned user policy function and the real user–item interactions, while the user policy is trained to confuse the discriminator. The discriminator function takes the form
where \(g_{\omega}\) contains the reward approximator \(r_{\omega}\) and the reward shaping term \(h_{\varphi}\): \(g_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c)=r_{\omega}(\boldsymbol{s},x, \boldsymbol{A},c)+\gamma h_{\varphi}(\boldsymbol{s}^{\prime})-h_{\varphi}(\boldsymbol{s})\), where \(\gamma\) is the discount factor and \(\boldsymbol{s}^{\prime}\) is the next state of state \(\boldsymbol{s}\). The reward shaping term \(h_{\varphi}\) is modeled using a one-layer MLP. The reward function \(r_{\omega}\) is modeled as follows:
where \(x^{max}\) simulated the user's preferred item in the recommendation list. \(f^{d\_pref}\) and \(f^{d\_rec}\) are one-layer MLP. \(f^{r}\) is a three-layer MLP with a ReLU activation function. To enable the gradient backpropagation through \(\operatorname{argmax}\) operation, we utilize softmax with temperature.
Specifically, based on Equation (3), during training, we can alternately update parameters of user policy \(\pi_{\phi}\) and discriminator \(D_{\omega}\). The objective for training the user policy \(\pi_{\phi}\) is
where \(\tau_{u_{i}}\) is user \(u_{i}\)'s behavior sequence generated from user's policy \(\pi_{\phi}\) interacting with the recommendation agent. We can train the meta-level user policy \(\pi_{\phi}\) using the policy gradient (PG) algorithm [55].
where \(P_{u{{}_{i}}}\) is the sampled real user \(u_{i}\)'s behavior sequence. Similar to AIRL in [13], when training the context-aware user policy and discriminator to optimality, we can recover the true user policy and the true reward function up to a constant, which approximates the real user model. We also utilize the offline data to estimate the meta-level user model by maximizing the likelihood to stabilize the training process.
4.4 Meta-Level Recommendation Agent
With the above estimated user policy function \(\pi_{\phi}\) and reward function \(r_{\omega}\) as the user environment model, we can learn the recommendation policy \(\pi_{\theta}\) to maximize the cumulated reward. To adapt to different users with limited context information, we also adopt the context-based meta-learning approach [11, 49] to learn a context-aware recommendation policy. Similar to the meta-level user model, we use a variational recommendation policy conditioned on the user context variable to enable the recommendation policy to be aware of the user preference. The latent recommendation policy variable at time \(t\) is denoted as \(z_{rec,t}\) induced from variational distribution \(q_{inf}(z_{rec}|\boldsymbol{s}_{t}^{rec},c)\), where \({s}_{t}^{rec}\) is the state information maintained by the recommendation agent. We optimize the lower bound of \(\log p(\boldsymbol{s}_{t}^{rec}|c)\).
The model details of learning \(z_{rec,t}\) is similar to the learning of \(z_{u,t}\) in (Equation 2). Then, based on the latent recommendation policy variable \(z_{rec,t}\), the agent will generate a recommendation list of size \(k\). Specifically, we utilize a two-layer MLP with ReLU activation function and softmax function normalization to output a probability vector over the entire item set \(\mathcal{V}\). Then, items with top-\(k\) probabilities are selected as the recommendation list \(\boldsymbol{A}_{t}\).
The objective for training the recommendation policy is as follows:
which can be optimized by using a PG algorithm [55]. We also utilize the offline data with maximal likelihood training to stabilize the training process.
4.5 Mutual Information Regularizer
Given that it is very expensive and difficult to perform online interactions, the user model and recommendation policy are trained with offline data. Therefore, it is conceivable that the recommendation policy may offer recommendations for which the user model is unable to provide precise feedback. This phenomenon poses a fundamental challenge for offline RL of the proposed model due to the distribution shift between user–item interaction data produced by the policy and the existing offline data. In accordance with the principle of conservatism widely adopted in offline RL literature [32, 50], we propose the inclusion of a mutual information regularizer between the user model and recommendation agent. This approach aims to encourage the recommendation agent to propose recommendations within the confidence region of the user model, consequently enhancing the accuracy of feedback. Moreover, this approach fosters the user model's ability to offer more precise feedback on the user's recommendations. Further theoretical analysis concerning the influence of mutual information regularization on recommendation performance is illustrated in Section 4.7.
To realize the mutual information constraint between the user model and the recommendation agent, we utilize the latent user policy variable \(z_{u{,t}}\) and recommendation policy variable \(z_{rec{,t}}\). This can be expressed by the following equation:
where \(\mathbb{P}_{z_{u,t}z_{rec,t}}\) denotes the joint distribution and \(\mathbb{P}_{z_{u,t}}\) and \(\mathbb{P}_{z_{rec,t}}\) are marginal distributions.
The aim is to maximize \(\mathcal{I}(z_{u,t};z_{rec,t})\) to model the dependency between the user model and the recommendation agent. Nevertheless, estimating mutual information in high-dimensional space is nontrivial. Inspired by [3, 44], we opt for maximizing the lower bound of the Jensen–Shannon mutual information \(\mathcal{I}^{(JSD)}(z_{u,t};z_{rec,t})\) based on Jensen-Shannon divergence to maintain stable training. This can be formulated as
where \(T_{\psi}:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}\) is a neural network function with parameter \(\psi\) and \(\operatorname{sp}(z)=\log(1+e^{z})\) is the softplus function.
During the training process, we strive to maximize the lower bound denoted as \(\mathcal{L}_{mutual}\) in Equation (9). The corresponding parameters in the meta-level user model and meta-level recommendation agent model are updated alternately. For instance, while updating the respective parameters in the meta-level recommendation agent model using \(\mathcal{L}_{mutual}\), the parameters of the meta-level user model are kept constant. To estimate the joint distribution \(\mathbb{P}_{z_{u,t}z_{rec,t}}\) in Equation (9), we construct samples from the same users at time \(t\) (e.g., \(i\)th user, (\(z_{u,t}^{i},z_{rec,t}^{i}\))). Conversely, the marginal distributions \(\mathbb{P}_{z_{u,t}}\otimes\mathbb{P}_{z_{rec,t}}\) are estimated by sampling from a different user (e.g., (\(z_{u,t}^{i},z_{rec,t}^{j}\)), where \(j\neq i\)).
4.6 Training
In this section, we introduce the training and testing procedures for the proposed \(\rm M^{3}Rec\) model. We perform the training by alternately updating three key parameters, namely the recommendation policy, user model, and the discriminator together with the parameter of the user context encoder. This process is outlined in Algorithm 1. Upon completion of training, we proceed to evaluate the recommendation policy. The test is conducted by conditioning the recommendation policy on the context information of the test user, as described in Algorithm 2. We employ both simulated online evaluation and offline evaluation methods to assess the performance of the model.
4.7 Theoretical Analysis
In this section, we provide a theoretical analysis of the performance bound of the recommender policy when adapting to meta-test users by our meta-level model-based RL framework. Proofs can be found in Appendix A.
To provide our theoretical analysis, let us first introduce some notations. Here, we slightly abuse the notation for simplicity. We denote action at time \(t\) as \(a_{t}\), which corresponds to \(x_{t}\) and \(\boldsymbol{A}_{t}\) defined in Section 3 for user and recommender agent, respectively. We use \(\mu_{u_{i}}^{\pi_{\theta}}=\frac{1}{T}\sum_{t=0}^{T}P(\boldsymbol{s}_{t}= \boldsymbol{s},a_{t}=a)\) to denote the average state action (\(\boldsymbol{s}\), \(a\)) visitation distribution when executing recommendation policy \(\pi_{\theta}\) in user model \(u_{i}\), where \(u_{i}\) can be the approximated user model \(u_{i}^{m}\) in our meta-level model-based RL or the true user model \(u_{i}^{w}\) in the real world. Then for user \(u_{i}\), there is modeling error between \(u_{i}^{m}\) and \(u_{i}^{w}\) under state-action distribution \(\mu\): \(\ell(u_{i}^{m},\mu)=\mathbb{E}_{(\boldsymbol{s},a)\sim\mu}[D_{\mathrm{KL}}(P_{ u_{i}^{w}}(\cdot|\boldsymbol{s},a),P_{u_{i}^{m}}(\cdot|\boldsymbol{s},a))]\), where
\(D_{\mathrm{KL}}\) is the KL divergence. \(P_{u_{i}^{w}}\) and \(P_{u_{i}^{m}}\) represents user transition dynamics (i.e., user policy \(\pi_{\phi}\)) in true user model \(u_{i}^{w}\) and approximated meta-level user model \(u_{i}^{m}\), respectively. The performance of recommendation policy \(\pi_{\theta}\) under user model \(u_{i}\) is \(J(\pi_{\theta},u_{i})=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{u_{i},t}]\), where \(\gamma\) is the discount factor. Then, we can get the recommendation policy performance bound learned in our meta-level model-based RL framework.
Remark. In this theorem, the gap between the recommender policy performance in our model trained on meta-training users and the optimal policy in the real-world test users comes from two error terms.
The first term is related to the sub-optimality of the meta-policy optimization as well as the generalization error of the meta-level recommendation policy to the meta-test user. It can be reduced by the sufficient training of a meta-level recommendation policy. The mutual information regularization between user policy and recommendation policy can also help reduce error \(\epsilon_{\pi_{\theta}}^{adapt}\).
The second term is related to the user model adaptation error \(\epsilon_{u_{test}^{m}}^{adapt}\) on the new meta-test user recommendation policy \(\pi_{\theta}\) and its optimal recommendation policy \(\pi_{\theta}^{*}\). As our meta-level user model meta-learns from the distribution of users and optimizes its prediction performance on different meta-level recommendation policies, the model adaptation error \(\epsilon_{u_{test}^{m}}^{adapt}\) can be small. Intuitively, the mutual information regularization between the meta-level user model and meta-level recommendation agent can benefit each other. For the meta-level user model, this mutual information can inform the user model to improve its prediction accuracy on the area visited by the recommendation agent. For the meta-level recommendation agent, it is encouraged to visit the area where the user model has high confidence. Hence, this mutual information regularization further helps reduce the model estimation error \(\epsilon_{u_{test}^{m}}^{adapt}\) in the offline setting. Therefore, these two error terms can be reduced by meta-learning and mutual information regularization in our method. The performance of the recommender policy learned on meta-training users in our offline meta-level model-based RL framework can approximate the optimal policy in the real-world meta-test users.
5 Experiment
In this section, we perform a thorough experimental study to validate the effectiveness of our proposed \(\rm M^{3}Rec\) method in handling cold-start recommendation scenarios. We first provide a description of our experimental setup, including implementation details and the baseline methods employed for comparison. Then, we show the simulated online evaluation experiment. After that, we present the offline evaluation results with real-world datasets. Specifically, we seek to answer the following research questions (RQ):
RQ1:
What is the performance of \(\rm M^{3}Rec\) on the cold-start recommendation task?
RQ2:
When the offline trained \(\rm M^{3}Rec\) is utilized as an initialization for subsequent online learning, how does it perform?
RQ3:
How does the model perform on the first few interactions of cold-start test users?
RQ4:
What impacts do the essential components of \(\rm M^{3}Rec\) have on the overall recommendation performance?
RQ5:
How sensitive is the proposed \(\rm M^{3}Rec\) model to variations in the user model rollout length during training?
5.1 Experimental Setup
5.1.1 Implementation Details.
The parameters were selected based on the recommendation performance on the validation set. The number of layers in the user context encoder was tuned in the set \(\{1,2,3\}\). We adjusted the embedding size of the context information in the user context encoder within the range of \(\{8,16,32,64\}\). The item embedding size was set to 50. The hidden size of the MLP neural network employed in the model was tuned within the range \(\{64,128,256,512\}\). The number of MLP layers is detailed in the model description in Section 4. We tuned the weight \(\beta\) in the variation lower bound within the set \(\{0.00001,0.0001,0.001,0.01,0.1,1\}\). User model rollout length is tuned in the range \(\{5,10,15,20\}\). For pre-training, we used the Adam optimizer [30] with a learning rate of 0.001. Following pre-training, a small learning rate was tuned within the range \(\{0.0005,0.0001,0.00001,0.000001\}\). Lastly, the training frequencies of \(e\), \(m\), and \(d\) in Algorithm 1 were tuned within the set \(\{1,3,5,10\}\).
5.1.2 Baseline Methods.
We compare our method with several state-of-the-art approaches that are broadly classified into non-RL cold-start recommendation methods and RL-based methods. The former includes Meta-LSTM and MeLU, while the latter consists of Meta-PG, IRecGAN, and generative adversarial user model assisted policy gradient optimization (GAN-PG).
—
Meta-LSTM: Recurrent neural networks have shown success for session-based recommendation [24, 47]. Here, we utilize LSTM as the recommendation policy network with context-based meta-learning method [11, 49] by incorporating a user context variable to adapt to the cold-start recommendation setting.
—
MeLU [33]: MeLU utilizes the MAML approach to estimate user's preference for cold-start recommendation.
—
Meta-PG: Similar to [49], Meta-PG is a meta-RL method integrating a user context variable to infer user's preference to adapt to new users.
—
IRecGAN [2]: IRecGAN is a model-based RL method for recommendation by adversarial training between the user behavior model and the discriminator, with the latter purposed to evaluate the quality of generated data. The recommendation policy is trained using the reward from the discriminator.
—
GAN-PG [8]: GAN-PG models the user behavior dynamics and recovers the reward function via a closed-form solution using generative adversarial training. When the training of the user model ends, it serves as the environment model for the recommender agent.
Batch-constrained deep Q-learning (BCQ) [14, 15] : BCQ is a model-free offline RL method, which restricts the agent's action space to constrain the learned policy to be close to the behavior policy which produces the offline data. Specifically, we utilize the BCQ in the discrete-action setting [14].
For the baseline methods, the item embedding size was set to 50 as in our method. The learning rate was tuned in the range \(\{0.0001,0.0005,0.001\}\). The rest of the parameters were set according to the recommendations in their articles of the above baseline methods. For a fair comparison, all the RL-based methods along with our proposed method utilize the REINFORCE algorithm for policy optimization [58]. The context information utilized in these baseline methods is consistent with our method. All parameters of the baseline methods are tuned as per suggestions in their respective original articles based on the performance on the validation set.
5.2 Simulated Online Evaluation
As it is difficult to conduct the online evaluation by interacting with the recommendation policy with real users, we carry out the online evaluation in a simulated environment by following previous works [2, 8].
5.2.1 Simulated Environment.
To simulate the behavior of different users, we utilized an open-sourced simulator RecSim1 [26] for recommendation system, which provides sequential interaction with users. We consider the interest evolution environment, where the task in this environment is video recommendation. User's interest evolves with time and the goal is to maximize the long-term user engagement (e.g., video watch time). The environment consists of a user model, a video model, and a user-choice model. The user model can sample a set of users from a distribution over the configurable user features. For example, we can configure the interests of users by changing the user-specific interest value to a different topic. Hence, it is flexible to configure users with different preferences. For user configuration, the user interest vector means a user's interest in the different document (video) topics where each dimension ranges from \(-1\) to 1 (1 = very interested, \(-1\) = disgust). Another important parameter is alpha to influence the change of the user's time budget. Specifically, we configure a \(d\)-dimension user interest vector, where each dimension is sampled uniformly from \(-1\) to 1 and \(d=20\) is the number of topics. For alpha, we uniformly sample its value from 0 to 1. We use default configurations for other parameters. The number of document candidates is set to 200.
5.2.2 Experiment Setting.
We first generate offline data for model training. As we aim to train the model in the offline setting without online interaction, we generate three datasets with different offline data qualities: Random: roll out a random initialized policy in the RecSim environment. Medium: partially train a Meta-PG method in the RecSim environment and then interact with the simulator. Expert: interacting with users by the fully trained Meta-PG method. Each user's time budget is set as 1,000 minutes in the interest evolution environment. Each dataset contains 200 users and the number of sessions for each user is 20. Information in the dataset contains the recommendation list, the user's click, user's reward at each timestep in each session. For the meta-training and meta-test users, we utilize two sets of user configurations without intersection as shown in Appendix B. Similarly, we construct the validation dataset with 500 users sampled from the meta-training user distribution. For the meta-test, we perform online interaction with 500 meta-test users and configure the time budget as 1,000 minutes for each user. The cumulative video watch time is calculated as a return for each user. We use the averaged user return as the evaluation metric and report the performance for the recommendation list with size \(k=3,5,10\).
In the case of our simulated users, no user profile information exists. Therefore, we adopt a single user session sequence as the context information for a cold-start test user. This unique interaction sequence is obtained by deploying a recommendation policy to interact with the users according to the dataset type. With regard to context-based meta-learning methods, this individual session sequence is leveraged to infer the user context variable. Notably, in the MeLU baseline method, which is based on the MAML approach, the same single-session sequence of the test user is utilized as the support set for fine-tuning on the cold-start test user.
5.2.3 Online Evaluation (RQ1).
To answer RQ1, we present an overall comparison with the baseline methods using simulated online evaluation for the cold-start recommendation task. Table 1 shows the average cumulative rewards of all competing methods, where the higher cumulative reward indicates that the recommender system can better satisfy users’ evolutionary interests and lead to longer user engagement time. It can be observed that the proposed method \(\rm{M^{3}Rec}\) outperforms all the baseline methods across the datasets with different qualities and slate sizes (i.e., the size of the recommendation list). The baseline methods tend to underperform when dealing with poor quality offline datasets (e.g., dataset type random or medium). However, our M\({}^{3}\)Rec method performs quite well despite the low-quality random data. This underscores the capacity of our method to effectively utilize low-quality data for offline RL training without stringent constraints on data quality. Another interesting observation is that our method is the only method that achieved a cumulative reward greater than the configured user time budget of 1,000 across all the dataset types. This implies that our method successfully maximizes the cold-start user's long-term interest, thereby preventing the user from dropping from the platform.
Table 1.
Dataset type
Slate size
Meta-LSTM
Meta-PG
IRecGAN
GAN-PG
MeLU
BCQ
\(\rm{M^{3}Rec}\) (Ours)
Random
\(k\) = 3
960.41
971.56
973.33
883.39
895.52
849.20
1,082.27
\(k\) = 5
960.80
963.35
957.30
909.72
905.14
900.19
1,116.08
\(k\) = 10
959.05
957.86
948.12
928.70
916.41
953.92
1,024.84
Medium
\(k\) = 3
999.04
960.93
1,104.70
950.32
939.97
1,012.41
1,268.50
\(k\) = 5
995.89
944.87
1,077.33
987.57
947.71
977.34
1,244.89
\(k\) = 10
989.87
983.09
1,049.69
970.36
951.76
988.14
1,114.62
Expert
\(k\) = 3
1,019.22
992.38
1,334.57
1,384.34
961.28
1,214.4
1,473.06
\(k\) = 5
1,012.86
981.26
1,187.35
1,170.67
961.43
1,067.73
1,242.29
\(k\) = 10
1,004.66
970.58
1,089.42
1,079.82
960.90
970.29
1,116.36
Table 1. Online Evaluation Results of Averaged User Cumulative Reward with Different Recommendation List Sizes \(k\)
Methods are offline trained with datasets of different qualities. The bold numbers indicate the best performances across all the methods for ease of demonstration.
5.2.4 Online Learning (RQ2).
Given the cost of training an RL-based recommender by interacting with users, in the above setting, \(\rm M^{3}Rec\) is firstly offline trained with logged user data and then deployed with simulated cold-start users for online evaluation, which demonstrates superior online evaluation performance. An intriguing question arises: when the offline trained \(\rm M^{3}Rec\) is utilized as initialization for subsequent online learning with these cold-start users, how does it perform? (RQ2) To address this, we execute online learning and assess its performance via interaction with meta-test users. This methodology presents a practical approach for enhancing the performance of the already offline-trained RL model. During each online learning iteration, we interact with 500 meta-test users, with the time budget per user set at 120. We utilize the offline learned model in the expert dataset for online learning (other dataset types show similar results). As depicted in Figure 2, our method consistently improves its performance with increased online interactions. These results demonstrate that our \(\rm M^{3}Rec\) method trained using offline-logged user data offers a valuable initialization, thus boosting subsequent online learning performance with these cold-start users.
Figure 2.
5.3 Offline Evaluation with Real-World Datasets
5.3.1 Datasets.
We further validate the effectiveness of our proposed method with two widely used real-world recommendation datasets. Table 2 provides a statistical overview of these datasets, which we describe briefly below.
—
MovieLens. Derived from the domain of movie user–item interactions, we use the MovieLens-1M dataset2 [22]. Consistent with established practices [23, 54], numeric ratings are transformed into implicit feedback, thereby signifying whether a user has rated a particular item or not.
—
Last.fm. This dataset constitutes a collection of user–song interactions and mirrors the user's listening habits until 5 May 2009.3 We intercepted the last month data for the cold-start recommendation task.
Table 2.
Datasets
# of users
# of items
# of interactions
Density
MovieLens
6,040
3,706
1,000,209
4.47%
Last.fm
613
150,826
426,203
0.46%
Table 2. Statistics of the Real-World Datasets
For both datasets, we randomly sample 70% users as the warm users in the training set, 10% users in the validation set, and the remaining 20% users as cold-start users in the test set.
5.3.2 Experiment Setting.
In both datasets, we utilize attribute information of users as the context information to infer user context variables. This approach aligns with real-world scenarios, wherein only user profile information is available for cold-start users. Specifically, the attributes used are (gender, age, and occupation) for MovieLens, and (gender, age, and country) for Last.fm datasets.
These real-world datasets do not include user reward information, rendering the Meta-PG and BCQ baseline methods inapplicable. Similarly, the MeLU baseline method is not applicable as it necessitates user–item interactions from cold-start users for support set fine-tuning. Therefore, we utilize the rest baseline methods for comparison.
During the meta-test stage, sequential recommendations are performed on the test users, and the performance is evaluated using two widely accepted metrics: hit ratio (HR) and normalized discounted cumulative gain (NDCG) of top \(k\) recommended items [52, 69]. We report HR@\(k\) and NDCG@\(k\) for \(k=10,20,50\). HR@\(k\) measures the average proportion of preferred items appearing in the top-\(k\) recommendation list, defined as
where \(\mathcal{J}_{u_{i},j}\) and \(p_{u_{i},j}\) represent the top-\(k\) recommendation list and the preferred item at timestamp \(j\) (relative time order) for cold-start user \(u_{i}\), respectively. NDCG@\(k\) evaluates the recommendation performance from a ranking perspective, defined as
where \(rel_{u_{i},j,m}\) is the relevance of the item at position \(m\) in \(\mathcal{J}_{u_{i},j}\). \(rel_{u_{i},j,m}=1\) if the item at position \(m\) coincides with user \(u_{i}\)'s preferred item at timestamp \(j\); otherwise \(rel_{u_{i},j,m}=0\). IDCG@\(k\) is the ideal DCG value over all possible recommendation lists of length \(k\), serving as a normalizer.
5.3.3 Offline Evaluation Results (RQ1).
To address RQ1, besides the simulated online evaluation, we further present an overall comparison with the baseline methods on these two real-world datasets for the cold-start recommendation task. The offline evaluation results for the MovieLens and Last.fm datasets are illustrated in Tables 3 and 4, respectively. An observable trend is the superior performance of our proposed method in comparison to all baseline methods, across all datasets and a variety of evaluation metrics. For example, in the Last.fm dataset, when \(k=10\), our method outperforms the best baseline method by 15.58% and 17.15% on the NDCG and HR evaluation metrics, respectively. When comparing the evaluation results across the datasets, we found that the performance of all the compared methods on the Last.fm dataset is lower than that of the MovieLens dataset. The potential reason is the sparsity of the Last.fm dataset as shown in Table 2. Nevertheless, even with the more challenging Last.fm dataset for cold-start recommendation, our method outperforms the baseline methods by a larger margin compared to its performance on the MovieLens dataset. These results demonstrate the effectiveness of our method for the cold-start recommendation problem.
Table 3.
Metric
Meta-LSTM
IRecGAN
GAN-PG
M\({}^{3}\)Rec (ours)
NDCG@10
0.0841
0.0833
0.0856
0.0866
HR@10
0.1679
0.1690
0.1710
0.1753
NDCG@20
0.1091
0.1094
0.1114
0.1134
HR@20
0.2677
0.2727
0.2738
0.2816
NDCG@50
0.1431
0.1440
0.1461
0.1495
HR@50
0.4394
0.4470
0.4486
0.4637
Table 3. Offline Evaluation Results on MovieLens Dataset
The bold numbers indicate the best performances across all the methods for ease of demonstration.
Table 4.
Metric
Meta-LSTM
IRecGAN
GAN-PG
M\({}^{3}\)Rec (ours)
NDCG@10
0.0318
0.0077
0.0207
0.0368
HR@10
0.0427
0.0113
0.0354
0.0500
NDCG@20
0.0326
0.0085
0.0219
0.0375
HR@20
0.0458
0.0144
0.0397
0.0531
NDCG@50
0.0332
0.0094
0.0230
0.0379
HR@50
0.0487
0.0191
0.0455
0.0548
Table 4. Offline Evaluation Results on Last.fm Dataset
The bold numbers indicate the best performances across all the methods for ease of demonstration.
In subsequent sections, we delve deeper into the analysis of the proposed \(\rm M^{3}Rec\) method, using the Last.fm dataset as a representative example, which is more challenging for the cold-start recommendation task.
5.3.4 Performance on Initial Interactions of Test Users (RQ3).
In real-world cold-start recommendation scenarios, it is desirable for the recommender system to demonstrate promising performance on the initial interactions of cold-start users, to deter user attrition. With this consideration in mind, to answer RQ3, we further report the averaged recommendation performance over the earliest \(l\) interactions of the test users, where \(l\) ranges from 3 to 20. As depicted in Figure 3, our method consistently surpasses the baseline methods across all evaluation metrics. Notably, the margin of outperformance is particularly larger when \(l\) is smaller (e.g., \(l=3\)), implying that our method excels at providing superior recommendation performance during the earliest interactions of cold-start users.
Figure 3.
5.3.5 Ablation Study (RQ4).
To probe RQ4, we conduct an ablation study aimed at assessing the impact of \(\rm M^{3}Rec\)'s essential components on the recommendation performance. Specifically, we remove the meta-learning component (i.e., the user context encoder) and the mutual information regularizer separately. The results of the ablation study are presented in Table 5. Upon the removal of either the meta-learning component or the mutual information regularizer, a performance drop is observed in our \(\rm M^{3}Rec\) model. This ablation study underscores the necessity of the meta-learning method and the mutual information regularizer within our proposed model.
Table 5.
Metric
\(\rm{M^{3}Rec}\)
Without meta-learning
Without mutual information regularizer
NDCG@10
0.0368
0.0349
0.0354
HR@10
0.0500
0.0477
0.0473
NDCG@20
0.0375
0.0358
0.0361
HR@20
0.0531
0.0510
0.0501
NDCG@50
0.0379
0.0363
0.0366
HR@50
0.0548
0.0535
0.0527
Table 5. Ablation Study on Last.fm Dataset
5.3.6 Impact of the User Model Rollout Length (RQ5).
In our model-based RL framework, during training, the recommendation agent is allowed to interact with the user model for varying lengths of time, a concept denoted as the user model rollout. In response to RQ5, which queries the sensitivity of our proposed \(\rm M^{3}Rec\) model to variations in user model rollout length during training, we adjusted this length during the model's training process. Figure 4 illustrates that both too small and too large user model rollout lengths can lead to inferior performance. Our model achieves the best performance across a range of evaluation metrics when a moderately sized user rollout length (i.e., 10) is employed.
Figure 4.
6 Conclusion
In this article, we presented a new approach to addressing the cold-start problem in RL-based recommendations by developing a context-aware offline meta-level model-based RL method. This method incorporates a user context variable designed to infer user preferences for adapting to new users. Within the context of the meta-learned model-based RL framework, we proposed to recover user policy and reward via an IRL approach, which is conditioned on the user context variable. This meta-level user model is employed to aid in training the context-aware recommendation agent, facilitating adaptation to new users who have limited contextual information or user–item interaction records. To address the challenge posed by the offline training of the proposed model, we further introduced a mutual information constraint between the user model and the recommendation agent. Alongside extensive simulated online and offline evaluations, which demonstrate the effectiveness of our approach, we also provided a theoretical analysis of the recommendation performance bound of the developed method.
We provide some related lemmas for the proof of Lemma 1.
To demonstrate the performance difference of the recommendation policy under the approximated user model \(u_{test}^{m}\) in our meta-level model-based RL framwork and true user model \(u_{test}^{w}\), we first introduce some definitions.
The first definition \(\mu_{u_{test}^{m}}^{\pi_{\theta}}(\boldsymbol{s},a)\) indicates the state-action visitation distribution when executing recommendation policy \(\pi_{\theta}\) in user model \(u_{test}^{m}\). The second definition is the discounted state-action visitation distribution, respectively. Similar definitions can be introduced for true user model \(u_{test}^{w}\) as \(\mu_{u_{test}^{w}}^{\pi_{\theta}}(\boldsymbol{s},a)\) and \(\mu_{u_{test}^{w}}^{\pi_{\theta}}(\boldsymbol{s},a)\). We further define a marginal distribution at time \(t\) when executing recommendation policy \(\pi_{\theta}\) under the true user model \(u_{test}^{w}\):
\(\mu_{u_{test}^{m}}^{\pi_{\theta},t}(\boldsymbol{s},a)\) is similarly defined when following policy \(\pi_{\theta}\) in approximated user model \(u_{test}^{m}\).
Now, we formally introduce Lemma 3, which measures the performance difference of the recommendation policy \(\pi_{\theta}\) learned in the \(u_{test}^{m}\) and \(u_{test}^{w}\) due to the approximation error of \(u_{test}^{m}\) with \(u_{test}^{w}\).
Appendix B Online Evaluation Details
For online evaluation, we utilize an open-sourced simulator4 [26] for recommendation system. The parameters of this simulator include user sensitivity, user-specific memory discount, noise standard deviation, the mean and standard deviation of kale response, the mean and standard deviation of choc response, which can affect user's preferences. We use default parameters in the simulator for standard deviation of kale response and standard deviation of choc response. Other parameters are chosen from a set of values to configure different users, which are listed below for meta-training and meta-test, respectively.
For meta-training users:
—
user sensitivity: \([0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1]\).
noise standard deviation: \([0.035,0.045,0.055,0.065]\).
—
the mean of kale response: \([2.5,3.5,4.5,5.5,6.5,7.5,8.5]\).
—
the mean of choc response: \([2.5,3.5,4.5,5.5,6.5,7.5,8.5]\).
Each user is configured with a combination of these parameters.
References
[1]
Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML ’04), 1.
Xueying Bai, Jian Guan, and Hongning Wang. 2019. Model-based reinforcement learning with adversarial training for online recommendation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 10735–10746.
Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. 1991. Learning a synaptic learning rule. In Proceedings of the IJCNN-91-Seattle International Joint Conference on Neural Networks II, Vol. 2, 969.
Jiangxia Cao, Jiawei Sheng, Xin Cong, Tingwen Liu, and Bin Wang. 2022. Cross-domain recommendation to cold-start users via variational information bottleneck. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2209–2223.
Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H. Chi. 2019a. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining, 456–464.
Minmin Chen, Can Xu, Vince Gatto, Devanshu Jain, Aviral Kumar, and Ed Chi. 2022. Off-policy actor-critic for recommender systems. In Proceedings of the 16th ACM Conference on Recommender Systems, 338–349.
Xinshi Chen, Shuang Chen Li, Hui Li, Shaohua Jiang, Yuan Qi, and Le Song. 2019b. Generative adversarial user model for reinforcement learning based recommendation system. In Proceedings of the International Conference on Machine Learning (ICML), 1052–1061.
Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. 2013. A survey on policy search for robotics. Foundations and Trends in Robotics 2 (2013), 1–142.
Manqing Dong, Feng Yuan, Lina Yao, Xiwei Xu, and Liming Zhu. 2020. MAMO: Memory-augmented meta-optimization for cold-start recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 688–697.
Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. 2016. RL\({}^{2}\): Fast reinforcement learning via slow reinforcement learning. arXiv:1611.02779. Retrieved from https://arxiv.org/abs/arXiv:1611.02779
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017a. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, 1126–1135. arXiv:1703.03400. Retrieved from https://arxiv.org/abs/1703.03400
Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau. 2019a. Benchmarking batch deep reinforcement learning algorithms. arXiv:1910.01708. Retrieved from https://arxiv.org/abs/1910.01708
Scott Fujimoto, David Meger, and Doina Precup. 2019b. Off-policy deep reinforcement learning without exploration. In Proceedings of the International Conference on Machine Learning. PMLR, 2052–2062.
Seyed Kamyar Seyed Ghasemipour, Shixiang Gu, and Richard S. Zemel. 2019a. SMILe: Scalable meta inverse reinforcement learning through context-conditional policies. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
Seyed Kamyar Seyed Ghasemipour, Richard S. Zemel, and Shixiang Gu. 2019b. A divergence minimization perspective on imitation learning methods. In Proceedings of the Conference on Robot Learning (CoRL), 1259–1277.
Lei Guo, Li Tang, Tong Chen, Lei Zhu, Quoc Viet Hung Nguyen, and Hongzhi Yin. 2021. DA-GCN: A domain-aware attentive graph convolution network for shared-account cross-domain sequential recommendation. arXiv:2105.03300. Retrieved from https://arxiv.org/abs/arXiv:2307.04571
F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TIIS) 5, 4 (2015), 1–19.
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, 173–182.
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Session-based recommendations with recurrent neural networks. arXiv:1511.06939. Retrieved from https://arxiv.org/abs/1511.06939
Eugene Ie, Chih-Wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jun bo Wang, Rui Wu, and Craig Boutilier. 2019a. RecSim: A configurable simulation platform for recommender systems. arXiv:1909.04847. Retrieved from https://arxiv.org/abs/1909.04847
Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Tushar Deepak Chandra, and Craig Boutilier. 2019b. SlateQ: A tractable decomposition for reinforcement learning with recommendation sets. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2592–2599.
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. 2019. When to trust your model: Model-based policy optimization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 12519–12530.
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 1179–1191.
Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019. MeLU: Meta-learned user preference estimator for cold-start recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1073–1082.
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv:2005.01643. Retrieved from https://arxiv.org/abs/2005.01643
Elad Liebman, Maytal Saar-Tsechansky, and Peter Stone. 2015. DJ-MC: A reinforcement-learning agent for music playlist recommendation. arXiv:1401.1880. Retrieved from https://arxiv.org/abs/1401.1880
Yuanfu Lu, Yuan Fang, and Chuan Shi. 2020. Meta-learning on heterogeneous information networks for cold-start recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1563–1573.
Zhongqi Lu and Qiang Yang. 2016. Partially observable Markov decision process for recommender systems. arXiv:1608.07793. Retrieved from https://arxiv.org/abs/1608.07793
Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H. Chi. 2020. Off-policy learning in two-stage recommender systems. In Proceedings of the Web Conference 2020, 463–473.
Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. 2018. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), 7559–7566.
Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the International Conference on Machine Learning (ICML), 278–287.
Andrew Y. Ng and Stuart J. Russell. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 663–670.
Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-GAN: Training generative neural samplers using variational divergence minimization. In Proceedings of the Advances in Neural Information Processing Systems, 271–279.
Xingyu Pan, Yushuo Chen, Changxin Tian, Zihan Lin, Jinpeng Wang, He Hu, and Wayne Xin Zhao. 2022. Multimodal meta-learning for cold-start sequential recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 3421–3430.
Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. 2019. Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow. arXiv:1810.00821. Retrieved from https://arxiv.org/abs/1810.00821
Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and Paolo Cremonesi. 2017. Personalizing session-based recommendations with hierarchical recurrent neural networks. In Proceedings of the 11th ACM Conference on Recommender Systems, 130–137.
Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. 2020. A game theoretic framework for model based reinforcement learning. In Proceedings of the International Conference on Machine Learning, 7953–7963.
Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. 2019. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In Proceedings of the International Conference on Machine Learning, (ICML). 5331–5340.
Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. 2021. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34, 11702–11716.
Martin Riedmiller. 2005. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In Proceedings of the European Conference on Machine Learning. Springer, 317–328.
Hinrich Schütze, Christopher D. Manning, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval, Vol. 39. Cambridge University Press, Cambridge.
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), 3483–3491.
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 1441–1450.
Siyu Wang, Xiaocong Chen, Dietmar Jannach, and Lina Yao. 2023. Causal decision transformer for recommender systems via offline reinforcement learning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1599–1608.
Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng Chua. 2021. Contrastive learning for cold-start recommendation. In Proceedings of the 29th ACM International Conference on Multimedia. 5382–5390.
Hanrui Wu, Jinyi Long, Nuosi Li, Dahai Yu, and Michael K. Ng. 2022. Adversarial auto-encoder domain adaptation for cold-start recommendation with positive and negative hypergraphs. ACM Transactions on Information Systems 41, 2 (2022), 1–25.
Teng Xiao and Donglin Wang. 2021. A general offline reinforcement learning framework for interactive recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 4512–4520.
Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, and Joemon M. Jose. 2020. Self-supervised reinforcement learning for recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 931–940.
Lantao Yu, Tianhe Yu, Chelsea Finn, and Stefano Ermon. 2019. Meta-inverse reinforcement learning with probabilistic context variables. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 11772–11783.
Runsheng Yu, Yu Gong, Xu He, Bo An, Yu Zhu, Qingwen Liu, and Wenwu Ou. 2020a. Personalized adaptive meta learning for cold-start user preference prediction. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 10772–10780.
Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, Changyou Chen, and Lawrence Carin. 2019. Text-based interactive recommendation via constraint-augmented reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 15214–15224.
Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1040–1048.
Vincent W. Zheng, Yu Zheng, Xing Xie, and Qiang Yang. 2010. Collaborative location and activity recommendations with GPS history data. In Proceedings of the 19th International Conference on World Wide Web, 1029–1038.
Feng Zhu, Yan Wang, Chaochao Chen, Jun Zhou, Longfei Li, and Guanfeng Liu. 2021. Cross-domain recommendation: Challenges, progress, and prospects. arXiv:2103.01696. Retrieved from https://arxiv.org/abs/2103.01696
Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. 2017. What to do next: Modeling user behaviors by time-LSTM. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Vol. 17, 3602–3608.
Deng YGalib ATan PLuo LBaeza-Yates RBonchi F(2024)Unraveling Block Maxima Forecasting Models with Counterfactual ExplanationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671923(562-573)Online publication date: 25-Aug-2024
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
Meta-learning has become a widely used method for the user cold-start problem in recommendation systems, as it allows the model to learn from similar learning tasks and transfer the knowledge to new tasks. However, most existing meta-learning methods do ...
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
User cold-start recommendation is one of the most challenging problems that limit the effectiveness of recommender systems. Meta-learning-based methods are introduced to address this problem by learning initialization parameters for cold-start tasks. ...
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
Sequential recommenders have made great strides in capturing a user's preferences. Nevertheless, the cold-start recommendation remains a fundamental challenge as they typically involve limited user-item interactions for personalization. Recently, ...
Deng YGalib ATan PLuo LBaeza-Yates RBonchi F(2024)Unraveling Block Maxima Forecasting Models with Counterfactual ExplanationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671923(562-573)Online publication date: 25-Aug-2024