Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

M3Rec: A Context-Aware Offline Meta-Level Model-Based Reinforcement Learning Approach for Cold-Start Recommendation

Published: 19 August 2024 Publication History

Abstract

Reinforcement learning (RL) has shown great promise in optimizing long-term user interest in recommender systems. However, existing RL-based recommendation methods need a large number of interactions for each user to learn the recommendation policy. The challenge becomes more critical when recommending to new users who have a limited number of interactions. To that end, in this article, we address the cold-start challenge in the RL-based recommender systems by proposing a novel context-aware offline meta-level model-based RL approach for user adaptation. Our proposed approach learns to infer each user's preference with a user context variable that enables recommendation systems to better adapt to new users with limited contextual information. To improve adaptation efficiency, our approach learns to recover the user choice function and reward from limited contextual information through an inverse RL method, which is used to assist the training of a meta-level recommendation agent. To avoid the need for online interaction, the proposed method is trained using historically collected offline data. Moreover, to tackle the challenge of offline policy training, we introduce a mutual information constraint between the user model and recommendation agent. Evaluation results show the superiority of our developed offline policy learning method when adapting to new users with limited contextual information. In addition, we provide a theoretical analysis of the recommendation performance bound.

1 Introduction

Recent years have witnessed great interest in developing reinforcement learning (RL)-based recommender systems [2, 8], which can effectively model and optimize user's long-term interest. In the RL-based methods, the recommendation policy is learned by leveraging the collected interactions between users and recommender systems. As different users may have different interests, conventional RL-based methods need to learn a separate policy for each user, which calls for large amounts of interactions for an individual user to learn an adaptable recommendation policy for different users. However, it is very difficult and expensive to obtain enough user–recommender interactions to train a robust recommendation policy. Such a challenge becomes even more critical for cold-start users who have a very limited number of interactions and prominently exist in many recommender systems. Therefore, it is imperative to learn a recommendation policy that can infer user's preferences and quickly adapt to cold-start users with limited information.
When dealing with cold-start users, the availability of information is often limited to a few user attributes (e.g., gender, age) and perhaps minimal user–item interactions. The challenge lies in learning user preferences from this scant information. To address this issue, meta-learning approaches [4, 12] have been applied to cold-start recommendation [33, 37]. Meta-learning seeks to learn general knowledge across diverse tasks and adapt the knowledge to new tasks using minimal samples. Within the context of cold-start recommendation, each user's recommendation can be viewed as a unique task. During the meta-training phase, meta-learning facilitates the understanding of user's general preferences and adapts these to new users with limited data during the meta-testing phase. However, it is still very challenging to apply meta-learning to RL-based recommender systems due to the following reasons. First, inferring user preferences requires a significant number of interactions sampled from user distributions. It is very difficult to achieve this for new users. In extreme cold-start scenarios, the only available information might be user attributes like gender and age, with no user–item interactions, where the meta-learning method cannot adapt to new users by fine-tuning with few interactions [12]. This complicates the task of inferring user preferences. Second, to meta-train a recommendation policy, traditional model-free RL methods need interactions with users. Recent methods [2, 8] utilize model-based RL [9, 41] approaches to sidestep the sample efficiency challenge by leveraging offline-logged user data to model the environment. However, they still require lots of offline data from each user to build the environment model, i.e., individual user model. Moreover, the third challenge lies in learning the user model and recommendation policy from offline-logged user data without online interaction with real users. During offline learning, the recommendation policy might suggest recommendations for which the estimated user model cannot provide accurate feedback. This issue represents a fundamental challenge for offline RL due to the distribution shift between user–item interaction data generated by the policy-user model and the existing offline data.
To address the aforementioned challenges, we propose a novel context-aware offline meta-level model-based RL method, specifically designed for tackling the cold-start problem in RL-based recommender systems. To address the first challenge, inspired by the context-based meta-learning approach [11, 49], we introduce a user context variable to infer the user's preferences from the available user's contextual information, such as user attributes or a limited number of user–item interactions. In response to the second challenge, we employ meta-learning to train a model-based RL model by conditioning both the user model and the recommendation policy on the user context variable. This approach facilitates the learning of meta-level knowledge, i.e., general user preferences. During the meta-training phase, by conditioning on the user context variable, the user model (meta-level user model) is learned to estimate the user choice function and user reward function by using the inverse RL (IRL) approach across a broad spectrum of users. In the model-based RL framework, the recommendation agent conditioned on the user context variable (meta-level recommendation agent) is trained through interaction with the meta-level user model. After the meta-training phase, the meta-level recommendation agent can adapt to a new cold-start user by using the user's contextual information that is embedded in the user context variable. To tackle the third challenge arising from offline RL training of our proposed method, we adhere to the principle of conservatism, a concept extensively adopted in offline RL literature [32, 50]. We propose incorporating a mutual information regularizer between the user model and the recommendation agent. This approach is designed to encourage the recommendation agent to make recommendations within the user model's confidence region, thereby improving the accuracy of feedback.
We conduct intensive experiments to evaluate our developed method. The evaluations include both the simulated online experiment and the offline experiment. The online experiment is carried out in a simulated environment by using an open-sourced simulator for recommendation system [26], which provides sequential interaction with users. The offline experiment is performed with two widely used datasets. In both online and offline experiments, we evaluate our method against several state-of-the-art baselines using multiple evaluation metrics. All the evaluation results consistently demonstrate the superiority of our developed method.
The main contributions of this work can be summarized as follows:
We propose a novel context-aware offline meta-level model-based RL method to address the cold-start problem of RL-based recommender systems, which is trained using logged offline data.
Within our framework, we introduce a user context variable to infer user preference that enables adaptation on cold-start users with limited contextual information.
Both the user model and the recommendation agent are meta-learned by conditioning on this user context variable within the model-based RL framework. This approach ensures effective adaptation to new users. To tackle the challenge presented by the offline RL training of our proposed method, we introduce a mutual information regularizer between the meta-level user model and the recommendation agent.
Evaluation results demonstrate the superiority of our developed method. A theoretical analysis of the recommendation performance bound of the developed method is also provided.

2 Related Work

The related work of this article could be grouped into four categories as discussed below.
RL-Based Recommender System. There are mainly two kinds of RL methods for recommendation: model-free RL methods [6, 27, 36, 38, 62, 66, 68] and model-based RL methods [2, 8, 67]. Model-free RL methods assume the environment is unknown without user modeling. Model-free RL methods usually need large amounts of interactions for policy optimization. To tackle the sample complexity challenge, model-based RL methods are applied by considering user modeling, which can predict user behavior and reward. For instance, the generative adversarial user model [8] learns the user behavior model and reward function together in a unified min-max framework; then the recommendation policy is learned with reward from the trained user model. However, this model requires a large amount of data to estimate a particular user model, which is not feasible in the cold-start recommendation scenario. Besides, the user model and recommendation model are trained separately, which prevents them from benefiting from each other. Bai et al. [2] also proposed to use model-based RL for recommendation. They introduced the discriminator with adversarial training to let the user behavior and recommendation policy imitate the policy in logged offline data. The reward to train recommendation policy is weighted by the discriminator score. Their method can be seen as reward shaping [40, 42], which does not recover the true user reward function. Neither of the above two methods does properly address the cold-start challenge in the RL-based recommender system. Recently, offline RL [34] methods have been used to learn a recommendation policy from the offline dataset [7, 39, 56, 61]. For example, Xiao et al. [61] proposed a general offline RL framework for recommendation, where supervised regularization, policy constraints, dual constraints, and reward extrapolation are introduced to minimize the distribution mismatch between the logging policy and recommendation policy. Wang et al. [56] proposed a causal decision transformer to integrate the offline RL and transformer model. Gao et al. [17] combined causal inference with offline RL to burst the filter bubbles in recommender systems. To alleviate the Matthew effect of offline RL in interactive recommendation, a state entropy term is added to relax the pessimism in the model-based offline RL algorithm [16]. In contrast with these methods, our method can recover the true user behavior and reward with a small amount of data by meta-learning user model and recommendation model with user context variable in a unified framework, and the mutual information regularization between the user policy and recommendation policy can benefit each other for learning from offline-logged user data.
Meta-Learning. Meta-learning aims to learn from a small amount of data and adapt quickly to new tasks [4, 12]. In context-based meta-learning [11, 49], the approaches learn to infer the task uncertainties by taking task experiences as input. For instance, Kate et al. [49] proposed to learn the task context variables with probabilistic latent variables from past experiences. The model-free RL policy is trained conditioned on the task variable to improve sample efficiency. In contrast, our method learns a user context variable to infer user preference within the offline model-based RL framework.
Cold-Start Recommendation. The cold-start problem of recommendation has been studied for a long time in the literature. Various approaches have been developed to address this problem [33, 35, 45, 57, 59, 71]. Among these, cross-domain recommendation techniques [5, 20, 21, 70] enhance performance in the target domain by leveraging user–item interactions from relevant source domains. However, they often rely on shared users across domains for knowledge transfer [5] and require source domain interaction data for target domain cold-start users [5], limiting their applications in our context. One particular type of method, which has been developed recently and is very related to this study, utilizes the meta-learning technique [10, 33, 64] to tackle the cold-start challenge. Generally speaking, these methods regard users as tasks and items as classes and use gradient-based model-agnostic meta-learning (MAML) [12] algorithm to enable fast adaptation to new users with few interactions. However, how to tackle the cold-start challenge of RL-based recommendations is still underexplored, which is indeed the research focus of this article.
IRL. IRL is the problem of learning reward functions from demonstrations [1, 43, 46], which can avoid the need for reward engineering. For instance, Fu et al. [13] proposed an adversarial IRL (AIRL) framework to recover the true reward functions from demonstrations. IRL needs a large number of expert demonstrations to infer true reward function, which is highly expensive in the area of robotics. Recently, some works [18, 63] try to recover the reward function from a limited amount of demonstrations with the meta-IRL method by incorporating the context-based meta-learning method into the AIRL framework. Comparatively, in our solution, we recover the user policy and reward function from offline user behavior data by leveraging the meta-IRL method. To better capture the user context information into policy, we utilize a variational policy network conditioned on the user context variable. Besides, the meta-IRL-learned user model serves as the environment in our meta-level model-based RL framework.
Offline RL. Offline RL [34, 51] aims to learn policies from a static dataset consisting of past interactions with the environment. Most offline RL methods use constraints on the learned policy to prevent the policy from drifting away from offline data support. For instance, Wu et al. [60] use Kullback–Leibler (KL) divergence to regularize the learned policy to be closer to the behavior policy. Recently, Yu et al. [65] and Kidambi et al. [29] utilize uncertainty estimation as a reward penalty to constrain policy learning. Different from existing methods, we not only provide constraints on the policy with a novel mutual information regularization but also encourage policy adaptation with meta-learning.

3 Preliminaries

3.1 RL-Based Recommender Systems

Recent studies have demonstrated the potential of RL-based recommender systems in enhancing long-term user engagement [2, 8]. Such a system relies on an RL-based recommendation agent that regularly interacts with users. During each interaction, the agent presents the user with a list of recommendations. The user then chooses an item from the list, providing feedback to the agent, which is typically a measure of user utility or satisfaction and considered as a reward to the agent.
Formally, the RL-based recommendation problem is modeled as a Markov decision process (MDP) with the following components:
Environment: Each user represents an environment that the recommendation agent interacts with. This environment provides feedback to the recommendation agent, encompassing the user's choice and reward.
State Space \(\boldsymbol{\mathcal{S}}\): \(\boldsymbol{s}_{t}\in\mathcal{S}\) is the user's historical clicks before time \(t\).
Action \(\boldsymbol{\mathcal{A}}\): The action \(\boldsymbol{A}_{t}\in\mathcal{A}\) corresponds to the top-\(k\) recommendation list generated by the recommendation agent at time \(t\).
State Transition Probability \(\boldsymbol{\mathcal{P}}\): \(p(\boldsymbol{s}_{t+1}|\boldsymbol{s}_{t},\boldsymbol{A}_{t})\) is the probability of transitioning from state \(\boldsymbol{s}_{t}\) to \(\boldsymbol{s}_{t+1}\) based on the recommendation list \(\boldsymbol{A}_{t}\). This also signifies the probability of the user's choice \(x_{t}\) from the recommendation list \(\boldsymbol{A}_{t}\) according to the user's choice or policy function.
Reward \(\boldsymbol{\mathcal{R}}\): \(r(\boldsymbol{s}_{t},\boldsymbol{A}_{t},x_{t})\) is the immediate reward for the agent's action \(\boldsymbol{A}_{t}\), which represents the user's utility or satisfaction after choosing \(x_{t}\in\boldsymbol{A}_{t}\) at state \(\boldsymbol{s}_{t}\).
Recommendation Policy \(\boldsymbol{\pi}\): \(\boldsymbol{A}_{t}\sim\pi(\boldsymbol{A}_{t}|\boldsymbol{s}_{t})\) is the recommendation agent's recommendation list given state \(\boldsymbol{s}_{t}\).
The recommendation agent's goal is to learn a policy \(\pi\) that maximizes the expected cumulative reward, symbolizing long-term user utility or satisfaction: \(\mathbb{E}{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r\left(\boldsymbol{s}_{t}, \boldsymbol{A}_{t},x_{t}\right)\right]\), with \(\gamma\) as the discount factor.

3.2 Problem Statement

Our article primarily concentrates on the cold-start problem within the RL-based recommender system context. The data used to train the recommendation model are user–item interactions. We have a set of users \(\mathcal{U}=\left\{\mathcal{U}^{warm},\mathcal{U}^{cold}\right\}\), where \(\mathcal{U}^{warm}\) and \(\mathcal{U}^{cold}\) represent warm users and cold-start users, respectively. Importantly, there is no overlap between these two groups of users. The item set is denoted as \(\mathcal{V}=\left\{v_{1},\ldots,v_{j},\ldots,v_{|\mathcal{V}|}\right\}\). For every user \(u\in\mathcal{U}\), the sequence of interacted items is recorded as \(P_{u}=\left\{x_{1},\ldots,x_{t},\ldots,x_{|P_{u}|}\right\}\), which is maintained in chronological order. Additionally, each user \(u\) is associated with contextual information \(C_{u}\) such as user attributes \(u_{att}=\left\{u^{a}_{1},\ldots,u^{a}_{j},\ldots,u^{a}_{|u_{att}|}\right\}\). With the above setup, we can formally state the cold-start recommendation problem as follows:
Definition 1
(Problem Statement). The cold-start recommendation problem involves training the recommendation policy with offline warm users’ behavior data \(\left\{P_{u}:u\in\mathcal{U}^{warm}\right\}\) and subsequently making recommendations for a cold-start user \(u\in\mathcal{U}^{cold}\) using the user's contextual information \(C_{u}\).

4 Method

In this section, we present the proposed context-aware Mutual information regularized Meta-level Model-based RL approach for cold-start Recommendation, denoted as \(\rm M^{3}Rec\). We first introduce the overall framework and then elaborate on the details of the proposed method.

4.1 Overview

Our proposed model's architecture is depicted in Figure 1, comprising four key modules: the user context encoder, the meta-level user model, the meta-level recommendation agent, and the mutual information regularizer. The model adopts a meta-learning perspective to address the cold-start issue inherent in RL-based recommendations. With an aim to adapt the user model and recommendation agent for individual users who have limited contextual information, we employ the context-based meta-learning approach [11, 49]. The user context encoder learns to derive user context variable by inferring the user's preference from the contextual information of each user. Since both the state transition probability function \(p(s_{t+1}|s_{t},a_{t})\) and the reward function \(r(s_{t},a_{t},x_{t})\) are unknown, we adopt a model-based RL framework to explicitly model the user behavior dynamics from the offline data. This structure integrates the meta-level user model and the meta-level recommendation agent. Conditioning both the user model and recommendation agent on the user context variable enables these components to be meta-learned, which effectively enhances their adaptability to cold-start users. Specifically, within the model-based RL framework, the optimization of the meta-level recommendation agent is facilitated through interaction with the meta-level user model. The meta-level user model serves to approximate individual user dynamics, offering reward feedback to the meta-level recommendation agent. Accordingly, the meta-level recommendation agent is optimized utilizing the reward provided by the meta-level user model. Lastly, the model's training utilizes offline-logged user data and does not involve expensive online interactions with real users. To avoid accessing out-of-distribution data that exceed the support of the collected offline user data, we introduce a mutual information regularizer. This component establishes a link between the meta-level user model and meta-level recommendation agent, adhering to the principle of conservatism in offline RL literature [32, 50].
Figure 1.
Figure 1. Mutual information regularized meta-level model-based RL for cold-start recommendation (\(\rm M^{3}Rec\)) framework.

4.2 User Context Encoder

To learn a context variable for adaptation to different users, we adopt a user context encoder to summarize the user contextual information \(C_{u_{i}}\) into a context variable \(c_{u_{i}}\) for user \(u_{i}\). We first transform the raw context information into embedding representation and then employ deep neural networks to obtain user context variable
\[c_{u_i} = f^{cont}({\textbf{E}}(C_{u_i})),\]
where \({\rm E}\) is the embedding layer and \(f^{cont}\) is a multi-layer fully connected neural network with rectified linear unit (ReLU) activation function.

4.3 Meta-Level User Model

As the context-aware user choice and reward function are unknown, we aim to recover both the user choice and reward function from offline data in the meta-level user model. Considering limited context information from the cold-start users, we propose to utilize the context-based meta-learning [11, 49] to estimate the user behavior model from offline data. We first introduce the user choice or policy function \(\pi(x_{t}|\boldsymbol{s}_{t},\boldsymbol{A}_{t},c)\), which characterizes how the user chooses an item from the provided recommendation list. Then, we utilize the IRL method to recover the user choice function and the reward function without manually specifying the reward function.
Given \(i\)th user's historical clicked item sequence before time \(t\): \(\{x_{i,1},x_{i,2},...,x_{i,t-1}\}\), we first transform the clicked items into item embeddings \(\{\boldsymbol{e}_{i,1},\boldsymbol{e}_{i,2},...,\boldsymbol{e}_{i,t-1}\}\) using embedding matrix \(\textbf{E}^{u}\). We utilize the recurrent neural network with Long short-term memory (LSTM) unit [25] to summarize the state information maintained by the user model \(\boldsymbol{s}_{i,t}^{u}\) as follows:
\[\boldsymbol{s}_{i, t}^{u}= LSTM(\boldsymbol{e}_{i, t - 1}, \boldsymbol{s}_{i, t-1}^{u}),\]
In the following notations, we omit the user index \(i\) for simplicity.
To learn a context-aware user choice function \(\pi(x_{t}|\boldsymbol{s}_{t}^{{u}},\boldsymbol{A}_{t},c)\), it must encode the salient information of the user context variable \(c_{u_{i}}\) into user policy representation. Besides, it's also desirable that the user policy representation can reason the uncertainty of user distribution for adaptation to different users. Therefore, we adopt the variational inference approach [53] to infer the probabilistic latent user policy variable \(z_{u{,t}}\) at time \(t\). It can be generated from variational distribution \(q_{inf}(z_{u{,t}}|\boldsymbol{s}_{t}^{{u}}, c)\). To optimize the parameters for learning \(z_{u{,t}}\), we maximize the variational lower bound of \(\log p(\boldsymbol{s}_{t}^{{u}}|c)\)
\begin{align}\log p(\boldsymbol{s}_{t}^{u}|c) & \geq \mathbb{E}_{q_{inf}(z_{u,t}|\boldsymbol{s}_{t }^{u},c)}[\log p_{dec}(\boldsymbol{s}_{t}^{u}|z_{u,t},c)]\\&\quad{} - \beta D_{\mathrm{KL}}(q_{inf}(z_{u,t}|\boldsymbol{s}_{t}^{u},c)\|p(z_{u,t }|c)),\end{align}
(1)
where the first term is optimized to reconstruct current state \(\boldsymbol{s}_{t}^{{u}}\). The second term constrains the latent policy variable with a Gaussian prior.
In practice, following [53], \(q_{inf}\) and \(p_{dec}\) are parameterized by the inference neural network and decoding neural network, respectively. Specifically, the context-conditional variational autoencoder is defined as follows:
\begin{align} &\mu_{t}, \sigma_{t} = q_{inf}(\boldsymbol{s}_t^{u}, c), \\ &z_{u,t} \sim \mathcal{N}(\mu_t, \Sigma_t) ;\operatorname{diag}(\Sigma_t)=\sigma_t, \\ & \hat{\boldsymbol{s}}_t^{u} = p_{dec}(z_{u,t}, c),\end{align}
(2)
where we utilize the reparameterization trick [31] to sample \(z\) from \(\mathcal{N}(\mu,\Sigma)\). \(q_{inf}\) and \(p_{dec}\) are both three-layer multilayer perceptron (MLP) with ReLU activation function. To reconstruct the current state information, we utilize the decoding network to predict the last click \(x_{t-1}\) in the user's historical clicks, which is the input information at time \(t\). For the latent user policy representation, we adopt \(z_{u,t}=\mu_{t}\) for stable training.
Given the recommendation list \(\boldsymbol{A}_{t}=\{a_{1},\cdots,a_{k}\}\), the probability of choosing item \(x_{t}\in\boldsymbol{A}_{t}\) is based on the latent user representation \(z_{u,t}\) as follows:
\begin{align*}p(x_{t}|\boldsymbol{s}_{t}^{u},\boldsymbol{A}_{t},c)=\frac{{\rm exp}(f^{cho}(f^{pref}(z_{u,t}) ||f^{rec}(\boldsymbol{e}_{x_{t}})))}{\sum_{i=1}^{\left|\boldsymbol{A}_{t}\right|}{\rm exp}(f^{ cho}(f^{pref}(z_{u,t})||f^{rec}(\boldsymbol{e}_{a_{i}})))},\end{align*}
where \(f^{pref}\) and \(f^{rec}\) are one-layer MLP neural network to encode user's preference and the representation of candidate items, respectively. \(\|\) represents the concatenation operation and \(f^{cho}\) is a three-layer MLP with ReLU activation function to model the user's choice selection with concatenated preference and candidate item representation.
The above content demonstrates the modeling of the user choice function: \(\pi_{\phi}(x_{t}|\boldsymbol{s}_{t}^{u},\boldsymbol{A}_{t},c)\). We also denote the parameters involved in the above user choice modeling as \(\phi\). We aim to recover both the actual user policy \(\pi_{\phi}\) and reward function \(r_{\omega}\). Inspired by AIRL [13], we recover both the context-aware user policy and reward function from offline data by optimizing the following objective:
\begin{align}\min_{\pi_{\phi}}\max_{D_{\omega}}\mathbb{E}_{p(\boldsymbol{A},c)}[ \mathbb{E}_{\rho^{true}(\boldsymbol{s},x|\boldsymbol{A},c)}[\log D_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c) ]+\mathbb{E}_{\rho^{\pi_{\phi}}(\boldsymbol{s},x|\boldsymbol{A},c)}[\log(1-D_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c))]],\end{align}
(3)
where we omit the subscript \(t\) and superscript \(u\) for the ease of notation. This objective forms an adversarial game between the user policy function and the discriminator. The discriminator aims to distinguish the user behavior sampled from the learned user policy function and the real user–item interactions, while the user policy is trained to confuse the discriminator. The discriminator function takes the form
\begin{align*}D_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c)=\frac{\exp(g_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c))}{(\exp (g_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c))+\pi_{\phi}(x|\boldsymbol{s},\boldsymbol{A},c))},\end{align*}
where \(g_{\omega}\) contains the reward approximator \(r_{\omega}\) and the reward shaping term \(h_{\varphi}\): \(g_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c)=r_{\omega}(\boldsymbol{s},x, \boldsymbol{A},c)+\gamma h_{\varphi}(\boldsymbol{s}^{\prime})-h_{\varphi}(\boldsymbol{s})\), where \(\gamma\) is the discount factor and \(\boldsymbol{s}^{\prime}\) is the next state of state \(\boldsymbol{s}\). The reward shaping term \(h_{\varphi}\) is modeled using a one-layer MLP. The reward function \(r_{\omega}\) is modeled as follows:
\begin{align*}r_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c) & =f^{r}(x^\mathit{max}||x||c), \\x^\mathit{max} & =\operatorname{argmax}_{x\in\boldsymbol{A}}(f^{d\_pref}(\mathbf{s}))^{ \top}f^{d\_rec}(\boldsymbol{e}(x)),\end{align*}
where \(x^{max}\) simulated the user's preferred item in the recommendation list. \(f^{d\_pref}\) and \(f^{d\_rec}\) are one-layer MLP. \(f^{r}\) is a three-layer MLP with a ReLU activation function. To enable the gradient backpropagation through \(\operatorname{argmax}\) operation, we utilize softmax with temperature.
Specifically, based on Equation (3), during training, we can alternately update parameters of user policy \(\pi_{\phi}\) and discriminator \(D_{\omega}\). The objective for training the user policy \(\pi_{\phi}\) is
\begin{align} \max _{\phi} &\mathbb{E}_{u_{i} \sim p(\mathcal{U}), c_{u_i}, \tau_{u_i} \sim \rho^{\pi_{\phi}}(\tau_{u_i} | \boldsymbol{A}, c_{u_i})} \sum_{t=1}^{T} \log (D_{w}(\boldsymbol{s}_t^{u}, x_t, \boldsymbol{A}_t, c_{u_i}) - \log (1-D_{w}(\boldsymbol{s}_t^{u}, x_t, \boldsymbol{A}_t, c_{u_i})), \\ &=\mathbb{E}_{u_i \sim p(\mathcal{U}), c_{u_i}, \tau_{u_i} \sim \rho^{\pi_{\phi}}(\tau_u | \boldsymbol{A}, c_{u_i})} \sum_{t=1}^{T} g_{\omega}(\boldsymbol{s}_t^{u}, x_t, \boldsymbol{A}_t,c_{u_i})) - \log \pi_{\phi}(x | \boldsymbol{s}_t^{u}, \boldsymbol{A}_t,c_{u_i})), \end{align}
(4)
where \(\tau_{u_{i}}\) is user \(u_{i}\)'s behavior sequence generated from user's policy \(\pi_{\phi}\) interacting with the recommendation agent. We can train the meta-level user policy \(\pi_{\phi}\) using the policy gradient (PG) algorithm [55].
The objective for training the discriminator is
\begin{align} \max_{D_{\omega}}\mathbb{E}_{u_{i}\sim p(\mathcal{U})}\bigg[ & \mathbb{E}_{c_{u_{i}},\tau_{u_{i}}\sim\rho^{\pi_{\phi}}(\tau_{u_{ i}}|\boldsymbol{A},c_{u_{i}})}\sum_{t=1}^{T}\log(1-D_{w}(\boldsymbol{s}_{t}^{u},x_{t},\boldsymbol{A_{t }},c_{u_{i}})) \\& + \mathbb{E}_{c_{u_{i}},P_{u_{i}}\sim\rho^{true}({u_{i}})}\sum_{t=1 }^{T}\log D_{w}(\boldsymbol{s}_{t}^{u},x_{t},\boldsymbol{A_{t}},c_{u_{i}})\bigg],\end{align}
(5)
where \(P_{u{{}_{i}}}\) is the sampled real user \(u_{i}\)'s behavior sequence. Similar to AIRL in [13], when training the context-aware user policy and discriminator to optimality, we can recover the true user policy and the true reward function up to a constant, which approximates the real user model. We also utilize the offline data to estimate the meta-level user model by maximizing the likelihood to stabilize the training process.

4.4 Meta-Level Recommendation Agent

With the above estimated user policy function \(\pi_{\phi}\) and reward function \(r_{\omega}\) as the user environment model, we can learn the recommendation policy \(\pi_{\theta}\) to maximize the cumulated reward. To adapt to different users with limited context information, we also adopt the context-based meta-learning approach [11, 49] to learn a context-aware recommendation policy. Similar to the meta-level user model, we use a variational recommendation policy conditioned on the user context variable to enable the recommendation policy to be aware of the user preference. The latent recommendation policy variable at time \(t\) is denoted as \(z_{rec,t}\) induced from variational distribution \(q_{inf}(z_{rec}|\boldsymbol{s}_{t}^{rec},c)\), where \({s}_{t}^{rec}\) is the state information maintained by the recommendation agent. We optimize the lower bound of \(\log p(\boldsymbol{s}_{t}^{rec}|c)\).
\begin{align} \log p(\boldsymbol{s}_{t}^{rec}|c)& \geq\mathbb{E}_{q_{inf}(z_{rec,t}|\boldsymbol{s }_{t}^{rec},c)}[\log p_{dec}(\boldsymbol{s}_{t}^{rec}|z_{rec,t},c)] \\&\quad{} -\beta D_{\mathrm{KL}}(q_{inf}(z_{rec,t}|\boldsymbol{s}_{t}^{rec},c)\|p(z _{rec,t}|c)).\end{align}
(6)
The model details of learning \(z_{rec,t}\) is similar to the learning of \(z_{u,t}\) in (Equation 2). Then, based on the latent recommendation policy variable \(z_{rec,t}\), the agent will generate a recommendation list of size \(k\). Specifically, we utilize a two-layer MLP with ReLU activation function and softmax function normalization to output a probability vector over the entire item set \(\mathcal{V}\). Then, items with top-\(k\) probabilities are selected as the recommendation list \(\boldsymbol{A}_{t}\).
The objective for training the recommendation policy is as follows:
\begin{align}\max_{\theta}\mathbb{E}_{u_{i}\sim p(\mathcal{U}),c_{u_{i}},\tau_{u_{i}}\sim \rho^{\pi_{\theta}}(\tau_{u_{i}}|\boldsymbol{A},c_{u_{i}}\!)}\sum_{t=1}^{T}r_{\omega}(\boldsymbol{s}_{t}^{rec},x_{t},\boldsymbol{A}_{t},c_{u_{i}}\!),\end{align}
(7)
which can be optimized by using a PG algorithm [55]. We also utilize the offline data with maximal likelihood training to stabilize the training process.

4.5 Mutual Information Regularizer

Given that it is very expensive and difficult to perform online interactions, the user model and recommendation policy are trained with offline data. Therefore, it is conceivable that the recommendation policy may offer recommendations for which the user model is unable to provide precise feedback. This phenomenon poses a fundamental challenge for offline RL of the proposed model due to the distribution shift between user–item interaction data produced by the policy and the existing offline data. In accordance with the principle of conservatism widely adopted in offline RL literature [32, 50], we propose the inclusion of a mutual information regularizer between the user model and recommendation agent. This approach aims to encourage the recommendation agent to propose recommendations within the confidence region of the user model, consequently enhancing the accuracy of feedback. Moreover, this approach fosters the user model's ability to offer more precise feedback on the user's recommendations. Further theoretical analysis concerning the influence of mutual information regularization on recommendation performance is illustrated in Section 4.7.
To realize the mutual information constraint between the user model and the recommendation agent, we utilize the latent user policy variable \(z_{u{,t}}\) and recommendation policy variable \(z_{rec{,t}}\). This can be expressed by the following equation:
\begin{align}\mathcal{I}(z_{u,t};z_{rec,t})=D_{\mathrm{KL}}\left(\mathbb{P}_{z_{u,t}z_{rec, t}}\|\mathbb{P}_{z_{u,t}}\otimes\mathbb{P}_{z_{rec,t}}\right),\end{align}
(8)
where \(\mathbb{P}_{z_{u,t}z_{rec,t}}\) denotes the joint distribution and \(\mathbb{P}_{z_{u,t}}\) and \(\mathbb{P}_{z_{rec,t}}\) are marginal distributions.
The aim is to maximize \(\mathcal{I}(z_{u,t};z_{rec,t})\) to model the dependency between the user model and the recommendation agent. Nevertheless, estimating mutual information in high-dimensional space is nontrivial. Inspired by [3, 44], we opt for maximizing the lower bound of the Jensen–Shannon mutual information \(\mathcal{I}^{(JSD)}(z_{u,t};z_{rec,t})\) based on Jensen-Shannon divergence to maintain stable training. This can be formulated as
\begin{align}\mathcal{I}^{(JSD)}(z_{u,t};z_{rec,t}) & \geq\sup_{\psi\in\Psi} \mathbb{E}_{\mathbb{P}_{z_{u,t}z_{rec,t}}}\left[-\mathit{sp}\left(-T_{\psi}(z_{u,t},z_{ rec,t})\right)\right] \\&\quad{} -\mathbb{E}_{\mathbb{P}_{z_{u,t}}\otimes\mathbb{P}_{z_{rec,t}}} \left[\mathit{sp}\left(T_{\psi}(z_{u,t},z_{rec,t})\right)\right],\end{align}
(9)
where \(T_{\psi}:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}\) is a neural network function with parameter \(\psi\) and \(\operatorname{sp}(z)=\log(1+e^{z})\) is the softplus function.
During the training process, we strive to maximize the lower bound denoted as \(\mathcal{L}_{mutual}\) in Equation (9). The corresponding parameters in the meta-level user model and meta-level recommendation agent model are updated alternately. For instance, while updating the respective parameters in the meta-level recommendation agent model using \(\mathcal{L}_{mutual}\), the parameters of the meta-level user model are kept constant. To estimate the joint distribution \(\mathbb{P}_{z_{u,t}z_{rec,t}}\) in Equation (9), we construct samples from the same users at time \(t\) (e.g., \(i\)th user, (\(z_{u,t}^{i},z_{rec,t}^{i}\))). Conversely, the marginal distributions \(\mathbb{P}_{z_{u,t}}\otimes\mathbb{P}_{z_{rec,t}}\) are estimated by sampling from a different user (e.g., (\(z_{u,t}^{i},z_{rec,t}^{j}\)), where \(j\neq i\)).
Algorithm 1 M3Rec Meta-Training
1:  Input: Offline data \(\left\{P_{u}:u\in\mathcal{U}^{warm}\right\}\); Meta-level user policy \(\pi_{\phi}\); Meta-level recommendation policy \(\pi_{\theta}\); Context encoder\(f^{cont}\); Discriminator \(D_{\omega}\).
2:  Initialize simulated user buffers \(\mathcal{B}^{s}_{i}\) and real user buffers \(\mathcal{B}^{r}_{i}\) for each training user.
3:  Pre-train\(\pi_{\phi},f^{cont}\) using maximum likelihood estimation on the offline data.
4:  Pre-train \(\pi_{\theta}\) using maximum likelihood estimation on the offline data.
5:  Sample a batch of users, for each user \(i\), add a batch of simulated sequences to\(\mathcal{B}^{s}_{i}\) and add a batch of true user interaction sequences from offline data to \(\mathcal{B}^{r}_{i}\). Pre-train \(D_{\omega}\) using Equation (5) with \(\mathcal{B}^{s}_{i}\) and \(\mathcal{B}^{r}_{i}\).
6:  repeat
7:     for \(e\) times do
8:        Sample user \(u_{i}\) uniformly from \(\mathcal{U}^{warm}\).
9:        For user \(u_{i}\), infer the user context variable \(c_{u_{i}}\) with the context information \(C_{u_{i}}\) using the user context encoder in Section 4.2.
10:        For user \(u_{i}\), empty \(\mathcal{B}^{s}_{i}\) and generate interaction sequences from \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s}, c_{u_i})\) with \(c_{u_{i}}\) fixed during each rollout. Add these rollouts to \(\mathcal{B}^{s}_{i}\).
11:        Update \(\pi_{\theta},f^{cont}\) by using policy gradient in Equation (7), maximizing lower bound of mutual information in Equation (9) by fixing user model and maximizing the variational lower bound in Equation (6) with samples from \(\mathcal{B}^{s}_{i}\).
12:        Update \(\pi_{\theta},f^{cont}\) using maximum likelihood estimation on the offline data.
13:     end for
14:     for \(m\)times do
15:        Sample user \(u_{i}\) uniformly from \(\mathcal{U}^{warm}\).
16:        For user \(u_{i}\), infer the user context variable \(c_{u_{i}}\)with the context information \(C_{u_{i}}\) using the user context encoder in Section 4.2.
17:        For user \(u_{i}\), empty \(\mathcal{B}^{s}_{i}\) and generate interaction sequences from \(\pi_{\phi}(x | \boldsymbol{s}. \boldsymbol{A},c_{u_i})\) with \(c_{u_{i}}\) fixed during each rollout. Add these rollouts to \(\mathcal{B}^{s}_{i}\).
18:        Update \(\pi_{\theta},f^{cont}\) by using policy gradient in Equation (4), maximizing lower bound of mutual information in Equation (9) by fixing recommendation model and maximizing the variational lower bound in Equation (1) with samples from \(\mathcal{B}^{s}_{i}\).
19:        Update \(\pi_{\phi},f^{cont}\) using maximum likelihood estimation on the offline data.
20:     end for
21:     for \(d\) times do
22:        Sample user \(u_{i}\) uniformly from \(\mathcal{U}^{warm}\).
23:        For user \(u_{i}\), infer the user context variable \(c_{u_{i}}\) with the context information \(C_{u_{i}}\) using the user context encoder in Section 4.2.
24:        For user \(u_{i}\), empty \(\mathcal{B}^{s}_{i}\) and generate interaction sequences from \(\pi_{\phi}(x | \boldsymbol{s}. \boldsymbol{A},c_{u_i})\) with \(c_{u_{i}}\) fixed during each rollout. Add these rollouts to \(\mathcal{B}^{s}_{i}\).
25:        Empty \(\mathcal{B}^{r}_{i}\) and sample true user behavior sequence \(P_{u_{i}}\) from offline data added to \(\mathcal{B}^{r}_{i}\).
26:        Update \(D_{\omega},f^{cont}\) using Equation (5) with samples from \(\mathcal{B}^{s}_{i}\) and \(\mathcal{B}^{r}_{i}\).
27:     end for
28:  until Convergence

4.6 Training

In this section, we introduce the training and testing procedures for the proposed \(\rm M^{3}Rec\) model. We perform the training by alternately updating three key parameters, namely the recommendation policy, user model, and the discriminator together with the parameter of the user context encoder. This process is outlined in Algorithm 1. Upon completion of training, we proceed to evaluate the recommendation policy. The test is conducted by conditioning the recommendation policy on the context information of the test user, as described in Algorithm 2. We employ both simulated online evaluation and offline evaluation methods to assess the performance of the model.

4.7 Theoretical Analysis

In this section, we provide a theoretical analysis of the performance bound of the recommender policy when adapting to meta-test users by our meta-level model-based RL framework. Proofs can be found in Appendix A.
To provide our theoretical analysis, let us first introduce some notations. Here, we slightly abuse the notation for simplicity. We denote action at time \(t\) as \(a_{t}\), which corresponds to \(x_{t}\) and \(\boldsymbol{A}_{t}\) defined in Section 3 for user and recommender agent, respectively. We use \(\mu_{u_{i}}^{\pi_{\theta}}=\frac{1}{T}\sum_{t=0}^{T}P(\boldsymbol{s}_{t}= \boldsymbol{s},a_{t}=a)\) to denote the average state action (\(\boldsymbol{s}\), \(a\)) visitation distribution when executing recommendation policy \(\pi_{\theta}\) in user model \(u_{i}\), where \(u_{i}\) can be the approximated user model \(u_{i}^{m}\) in our meta-level model-based RL or the true user model \(u_{i}^{w}\) in the real world. Then for user \(u_{i}\), there is modeling error between \(u_{i}^{m}\) and \(u_{i}^{w}\) under state-action distribution \(\mu\): \(\ell(u_{i}^{m},\mu)=\mathbb{E}_{(\boldsymbol{s},a)\sim\mu}[D_{\mathrm{KL}}(P_{ u_{i}^{w}}(\cdot|\boldsymbol{s},a),P_{u_{i}^{m}}(\cdot|\boldsymbol{s},a))]\), where
Algorithm 2 M3Rec Meta-test
1:  Input: test user \(u_{test}\sim\mathcal{U}^{cold}\). A test user context information \(C_{u_{test}}\).
2:  Infer the test user context variable \(c_{test}\) with test user context information \(C_{u_{test}}\)using the user context encoder in Section 4.2.
3:  Test the adapted recommendation policy \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s}, c_{test})\) in the setting of simulated online evaluation or offline evaluation with real-world dataset.
\(D_{\mathrm{KL}}\) is the KL divergence. \(P_{u_{i}^{w}}\) and \(P_{u_{i}^{m}}\) represents user transition dynamics (i.e., user policy \(\pi_{\phi}\)) in true user model \(u_{i}^{w}\) and approximated meta-level user model \(u_{i}^{m}\), respectively. The performance of recommendation policy \(\pi_{\theta}\) under user model \(u_{i}\) is \(J(\pi_{\theta},u_{i})=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{u_{i},t}]\), where \(\gamma\) is the discount factor. Then, we can get the recommendation policy performance bound learned in our meta-level model-based RL framework.
Theorem 1
Suppose the meta-level user model and meta-level recommendation policy are trained to optimality on meta-training users. When adapting to a meta-test user with few behavior sequences to infer the user context variable, the test user's policy is obtained as \(\pi_{\phi}(x|\boldsymbol{s},\boldsymbol{A},c_{test})\) with the corresponding recommendation policy as \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s},c_{test})\). Suppose the modeling error of this test user model \(\ell(u_{test}^{m},\mu_{u_{test}^{w}}^{\pi_{\theta}})\leq\epsilon_{u_{test}^{m} }^{adapt}\). The performance of \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s},c_{test})\) under this test user's model satisfies \(J(\pi_{\theta},u_{test}^{m})\geq\sup_{\pi_{\theta}^{\prime}}J(\pi_{\theta}^{ \prime},u_{test}^{m})-\epsilon_{\pi_{\theta}}^{adapt}\). Let \(\pi_{\theta}^{*}\) denotes the optimal recommendation policy and the corresponding performance is \(J_{u_{test}^{w}}^{*}=\sup_{\pi_{\theta}^{\prime}}J(\pi_{\theta}^{\prime},u_{ test}^{w})\). We also suppose the modeling error of this test user model on the optimal recommendation policy \(\pi_{\theta}^{\prime}\) is \(\ell(u_{test}^{m},\mu_{u_{test}^{w}}^{\pi_{\theta}^{*}})\leq\epsilon_{u_{test} ^{m}}^{adapt}\) as the meta-training process can help the model adapt to different recommendation policy. Then, the performance bound between the learned recommendation policy \(\pi_{\theta}\) and the optimal policy \(\pi_{\theta}^{*}\) on real meta-test users is as follows:
\begin{align*}J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}, u_{test}^{w}\right) & \leq\epsilon_{\pi_{\theta}}^{adapt}+\frac{4\gamma R_{\max}\sqrt{ \epsilon_{u_{test}^{m}}^{adapt}}}{(1-\gamma)^{2}}.\end{align*}
Remark. In this theorem, the gap between the recommender policy performance in our model trained on meta-training users and the optimal policy in the real-world test users comes from two error terms.
The first term is related to the sub-optimality of the meta-policy optimization as well as the generalization error of the meta-level recommendation policy to the meta-test user. It can be reduced by the sufficient training of a meta-level recommendation policy. The mutual information regularization between user policy and recommendation policy can also help reduce error \(\epsilon_{\pi_{\theta}}^{adapt}\).
The second term is related to the user model adaptation error \(\epsilon_{u_{test}^{m}}^{adapt}\) on the new meta-test user recommendation policy \(\pi_{\theta}\) and its optimal recommendation policy \(\pi_{\theta}^{*}\). As our meta-level user model meta-learns from the distribution of users and optimizes its prediction performance on different meta-level recommendation policies, the model adaptation error \(\epsilon_{u_{test}^{m}}^{adapt}\) can be small. Intuitively, the mutual information regularization between the meta-level user model and meta-level recommendation agent can benefit each other. For the meta-level user model, this mutual information can inform the user model to improve its prediction accuracy on the area visited by the recommendation agent. For the meta-level recommendation agent, it is encouraged to visit the area where the user model has high confidence. Hence, this mutual information regularization further helps reduce the model estimation error \(\epsilon_{u_{test}^{m}}^{adapt}\) in the offline setting. Therefore, these two error terms can be reduced by meta-learning and mutual information regularization in our method. The performance of the recommender policy learned on meta-training users in our offline meta-level model-based RL framework can approximate the optimal policy in the real-world meta-test users.

5 Experiment

In this section, we perform a thorough experimental study to validate the effectiveness of our proposed \(\rm M^{3}Rec\) method in handling cold-start recommendation scenarios. We first provide a description of our experimental setup, including implementation details and the baseline methods employed for comparison. Then, we show the simulated online evaluation experiment. After that, we present the offline evaluation results with real-world datasets. Specifically, we seek to answer the following research questions (RQ):
RQ1:
What is the performance of \(\rm M^{3}Rec\) on the cold-start recommendation task?
RQ2:
When the offline trained \(\rm M^{3}Rec\) is utilized as an initialization for subsequent online learning, how does it perform?
RQ3:
How does the model perform on the first few interactions of cold-start test users?
RQ4:
What impacts do the essential components of \(\rm M^{3}Rec\) have on the overall recommendation performance?
RQ5:
How sensitive is the proposed \(\rm M^{3}Rec\) model to variations in the user model rollout length during training?

5.1 Experimental Setup

5.1.1 Implementation Details.

The parameters were selected based on the recommendation performance on the validation set. The number of layers in the user context encoder was tuned in the set \(\{1,2,3\}\). We adjusted the embedding size of the context information in the user context encoder within the range of \(\{8,16,32,64\}\). The item embedding size was set to 50. The hidden size of the MLP neural network employed in the model was tuned within the range \(\{64,128,256,512\}\). The number of MLP layers is detailed in the model description in Section 4. We tuned the weight \(\beta\) in the variation lower bound within the set \(\{0.00001,0.0001,0.001,0.01,0.1,1\}\). User model rollout length is tuned in the range \(\{5,10,15,20\}\). For pre-training, we used the Adam optimizer [30] with a learning rate of 0.001. Following pre-training, a small learning rate was tuned within the range \(\{0.0005,0.0001,0.00001,0.000001\}\). Lastly, the training frequencies of \(e\), \(m\), and \(d\) in Algorithm 1 were tuned within the set \(\{1,3,5,10\}\).

5.1.2 Baseline Methods.

We compare our method with several state-of-the-art approaches that are broadly classified into non-RL cold-start recommendation methods and RL-based methods. The former includes Meta-LSTM and MeLU, while the latter consists of Meta-PG, IRecGAN, and generative adversarial user model assisted policy gradient optimization (GAN-PG).
Meta-LSTM: Recurrent neural networks have shown success for session-based recommendation [24, 47]. Here, we utilize LSTM as the recommendation policy network with context-based meta-learning method [11, 49] by incorporating a user context variable to adapt to the cold-start recommendation setting.
MeLU [33]: MeLU utilizes the MAML approach to estimate user's preference for cold-start recommendation.
Meta-PG: Similar to [49], Meta-PG is a meta-RL method integrating a user context variable to infer user's preference to adapt to new users.
IRecGAN [2]: IRecGAN is a model-based RL method for recommendation by adversarial training between the user behavior model and the discriminator, with the latter purposed to evaluate the quality of generated data. The recommendation policy is trained using the reward from the discriminator.
GAN-PG [8]: GAN-PG models the user behavior dynamics and recovers the reward function via a closed-form solution using generative adversarial training. When the training of the user model ends, it serves as the environment model for the recommender agent.
Batch-constrained deep Q-learning (BCQ) [14, 15] : BCQ is a model-free offline RL method, which restricts the agent's action space to constrain the learned policy to be close to the behavior policy which produces the offline data. Specifically, we utilize the BCQ in the discrete-action setting [14].
For the baseline methods, the item embedding size was set to 50 as in our method. The learning rate was tuned in the range \(\{0.0001,0.0005,0.001\}\). The rest of the parameters were set according to the recommendations in their articles of the above baseline methods. For a fair comparison, all the RL-based methods along with our proposed method utilize the REINFORCE algorithm for policy optimization [58]. The context information utilized in these baseline methods is consistent with our method. All parameters of the baseline methods are tuned as per suggestions in their respective original articles based on the performance on the validation set.

5.2 Simulated Online Evaluation

As it is difficult to conduct the online evaluation by interacting with the recommendation policy with real users, we carry out the online evaluation in a simulated environment by following previous works [2, 8].

5.2.1 Simulated Environment.

To simulate the behavior of different users, we utilized an open-sourced simulator RecSim1 [26] for recommendation system, which provides sequential interaction with users. We consider the interest evolution environment, where the task in this environment is video recommendation. User's interest evolves with time and the goal is to maximize the long-term user engagement (e.g., video watch time). The environment consists of a user model, a video model, and a user-choice model. The user model can sample a set of users from a distribution over the configurable user features. For example, we can configure the interests of users by changing the user-specific interest value to a different topic. Hence, it is flexible to configure users with different preferences. For user configuration, the user interest vector means a user's interest in the different document (video) topics where each dimension ranges from \(-1\) to 1 (1 = very interested, \(-1\) = disgust). Another important parameter is alpha to influence the change of the user's time budget. Specifically, we configure a \(d\)-dimension user interest vector, where each dimension is sampled uniformly from \(-1\) to 1 and \(d=20\) is the number of topics. For alpha, we uniformly sample its value from 0 to 1. We use default configurations for other parameters. The number of document candidates is set to 200.

5.2.2 Experiment Setting.

We first generate offline data for model training. As we aim to train the model in the offline setting without online interaction, we generate three datasets with different offline data qualities: Random: roll out a random initialized policy in the RecSim environment. Medium: partially train a Meta-PG method in the RecSim environment and then interact with the simulator. Expert: interacting with users by the fully trained Meta-PG method. Each user's time budget is set as 1,000 minutes in the interest evolution environment. Each dataset contains 200 users and the number of sessions for each user is 20. Information in the dataset contains the recommendation list, the user's click, user's reward at each timestep in each session. For the meta-training and meta-test users, we utilize two sets of user configurations without intersection as shown in Appendix B. Similarly, we construct the validation dataset with 500 users sampled from the meta-training user distribution. For the meta-test, we perform online interaction with 500 meta-test users and configure the time budget as 1,000 minutes for each user. The cumulative video watch time is calculated as a return for each user. We use the averaged user return as the evaluation metric and report the performance for the recommendation list with size \(k=3,5,10\).
In the case of our simulated users, no user profile information exists. Therefore, we adopt a single user session sequence as the context information for a cold-start test user. This unique interaction sequence is obtained by deploying a recommendation policy to interact with the users according to the dataset type. With regard to context-based meta-learning methods, this individual session sequence is leveraged to infer the user context variable. Notably, in the MeLU baseline method, which is based on the MAML approach, the same single-session sequence of the test user is utilized as the support set for fine-tuning on the cold-start test user.

5.2.3 Online Evaluation (RQ1).

To answer RQ1, we present an overall comparison with the baseline methods using simulated online evaluation for the cold-start recommendation task. Table 1 shows the average cumulative rewards of all competing methods, where the higher cumulative reward indicates that the recommender system can better satisfy users’ evolutionary interests and lead to longer user engagement time. It can be observed that the proposed method \(\rm{M^{3}Rec}\) outperforms all the baseline methods across the datasets with different qualities and slate sizes (i.e., the size of the recommendation list). The baseline methods tend to underperform when dealing with poor quality offline datasets (e.g., dataset type random or medium). However, our M\({}^{3}\)Rec method performs quite well despite the low-quality random data. This underscores the capacity of our method to effectively utilize low-quality data for offline RL training without stringent constraints on data quality. Another interesting observation is that our method is the only method that achieved a cumulative reward greater than the configured user time budget of 1,000 across all the dataset types. This implies that our method successfully maximizes the cold-start user's long-term interest, thereby preventing the user from dropping from the platform.
Table 1.
Dataset typeSlate sizeMeta-LSTMMeta-PGIRecGANGAN-PGMeLUBCQ\(\rm{M^{3}Rec}\) (Ours)
Random\(k\) = 3960.41971.56973.33883.39895.52849.201,082.27
\(k\) = 5960.80963.35957.30909.72905.14900.191,116.08
\(k\) = 10959.05957.86948.12928.70916.41953.921,024.84
Medium\(k\) = 3999.04960.931,104.70950.32939.971,012.411,268.50
\(k\) = 5995.89944.871,077.33987.57947.71977.341,244.89
\(k\) = 10989.87983.091,049.69970.36951.76988.141,114.62
Expert\(k\) = 31,019.22992.381,334.571,384.34961.281,214.41,473.06
\(k\) = 51,012.86981.261,187.351,170.67961.431,067.731,242.29
\(k\) = 101,004.66970.581,089.421,079.82960.90970.291,116.36
Table 1. Online Evaluation Results of Averaged User Cumulative Reward with Different Recommendation List Sizes \(k\)
Methods are offline trained with datasets of different qualities. The bold numbers indicate the best performances across all the methods for ease of demonstration.

5.2.4 Online Learning (RQ2).

Given the cost of training an RL-based recommender by interacting with users, in the above setting, \(\rm M^{3}Rec\) is firstly offline trained with logged user data and then deployed with simulated cold-start users for online evaluation, which demonstrates superior online evaluation performance. An intriguing question arises: when the offline trained \(\rm M^{3}Rec\) is utilized as initialization for subsequent online learning with these cold-start users, how does it perform? (RQ2) To address this, we execute online learning and assess its performance via interaction with meta-test users. This methodology presents a practical approach for enhancing the performance of the already offline-trained RL model. During each online learning iteration, we interact with 500 meta-test users, with the time budget per user set at 120. We utilize the offline learned model in the expert dataset for online learning (other dataset types show similar results). As depicted in Figure 2, our method consistently improves its performance with increased online interactions. These results demonstrate that our \(\rm M^{3}Rec\) method trained using offline-logged user data offers a valuable initialization, thus boosting subsequent online learning performance with these cold-start users.
Figure 2.
Figure 2. Performance of online learning with compared methods initialized by offline training on the expert dataset.

5.3 Offline Evaluation with Real-World Datasets

5.3.1 Datasets.

We further validate the effectiveness of our proposed method with two widely used real-world recommendation datasets. Table 2 provides a statistical overview of these datasets, which we describe briefly below.
MovieLens. Derived from the domain of movie user–item interactions, we use the MovieLens-1M dataset2 [22]. Consistent with established practices [23, 54], numeric ratings are transformed into implicit feedback, thereby signifying whether a user has rated a particular item or not.
Last.fm. This dataset constitutes a collection of user–song interactions and mirrors the user's listening habits until 5 May 2009.3 We intercepted the last month data for the cold-start recommendation task.
Table 2.
Datasets# of users# of items# of interactionsDensity
MovieLens6,0403,7061,000,2094.47%
Last.fm613150,826426,2030.46%
Table 2. Statistics of the Real-World Datasets
For both datasets, we randomly sample 70% users as the warm users in the training set, 10% users in the validation set, and the remaining 20% users as cold-start users in the test set.

5.3.2 Experiment Setting.

In both datasets, we utilize attribute information of users as the context information to infer user context variables. This approach aligns with real-world scenarios, wherein only user profile information is available for cold-start users. Specifically, the attributes used are (gender, age, and occupation) for MovieLens, and (gender, age, and country) for Last.fm datasets.
These real-world datasets do not include user reward information, rendering the Meta-PG and BCQ baseline methods inapplicable. Similarly, the MeLU baseline method is not applicable as it necessitates user–item interactions from cold-start users for support set fine-tuning. Therefore, we utilize the rest baseline methods for comparison.
During the meta-test stage, sequential recommendations are performed on the test users, and the performance is evaluated using two widely accepted metrics: hit ratio (HR) and normalized discounted cumulative gain (NDCG) of top \(k\) recommended items [52, 69]. We report HR@\(k\) and NDCG@\(k\) for \(k=10,20,50\). HR@\(k\) measures the average proportion of preferred items appearing in the top-\(k\) recommendation list, defined as
\begin{align*}{\rm HR}@k=\frac{1}{|\mathcal{U}^{cold}|}\sum_{i=1}^{|\mathcal{U}^{cold}|} \frac{1}{|p_{u_{i}}|}\sum_{j=1}^{|p_{u_{i}}|}\delta\left(p_{u_{i},j}\in \mathcal{J}_{u_{i},j}\right),\end{align*}
where \(\mathcal{J}_{u_{i},j}\) and \(p_{u_{i},j}\) represent the top-\(k\) recommendation list and the preferred item at timestamp \(j\) (relative time order) for cold-start user \(u_{i}\), respectively. NDCG@\(k\) evaluates the recommendation performance from a ranking perspective, defined as
\begin{align*}{\rm NDCG}@k=\frac{1}{|\mathcal{U}^{cold}|}\sum_{i=1}^{|\mathcal{U}^{cold}|} \frac{1}{|p_{u_{i}}|}\sum_{j=1}^{|p_{u_{i}}|}\frac{{DCG}@k\left(u_{i},j\right)}{ {IDCG}@k\left(u_{i},j\right)},{DCG}@k\left(u_{i},j\right)=\sum_{m=1}^{k}\frac{rel_ {u_{i},j,m}}{\log_{2}(m+1)},\end{align*}
where \(rel_{u_{i},j,m}\) is the relevance of the item at position \(m\) in \(\mathcal{J}_{u_{i},j}\). \(rel_{u_{i},j,m}=1\) if the item at position \(m\) coincides with user \(u_{i}\)'s preferred item at timestamp \(j\); otherwise \(rel_{u_{i},j,m}=0\). IDCG@\(k\) is the ideal DCG value over all possible recommendation lists of length \(k\), serving as a normalizer.

5.3.3 Offline Evaluation Results (RQ1).

To address RQ1, besides the simulated online evaluation, we further present an overall comparison with the baseline methods on these two real-world datasets for the cold-start recommendation task. The offline evaluation results for the MovieLens and Last.fm datasets are illustrated in Tables 3 and 4, respectively. An observable trend is the superior performance of our proposed method in comparison to all baseline methods, across all datasets and a variety of evaluation metrics. For example, in the Last.fm dataset, when \(k=10\), our method outperforms the best baseline method by 15.58% and 17.15% on the NDCG and HR evaluation metrics, respectively. When comparing the evaluation results across the datasets, we found that the performance of all the compared methods on the Last.fm dataset is lower than that of the MovieLens dataset. The potential reason is the sparsity of the Last.fm dataset as shown in Table 2. Nevertheless, even with the more challenging Last.fm dataset for cold-start recommendation, our method outperforms the baseline methods by a larger margin compared to its performance on the MovieLens dataset. These results demonstrate the effectiveness of our method for the cold-start recommendation problem.
Table 3.
MetricMeta-LSTMIRecGANGAN-PGM\({}^{3}\)Rec (ours)
NDCG@100.08410.08330.08560.0866
HR@100.16790.16900.17100.1753
NDCG@200.10910.10940.11140.1134
HR@200.26770.27270.27380.2816
NDCG@500.14310.14400.14610.1495
HR@500.43940.44700.44860.4637
Table 3. Offline Evaluation Results on MovieLens Dataset
The bold numbers indicate the best performances across all the methods for ease of demonstration.
Table 4.
MetricMeta-LSTMIRecGANGAN-PGM\({}^{3}\)Rec (ours)
NDCG@100.03180.00770.02070.0368
HR@100.04270.01130.03540.0500
NDCG@200.03260.00850.02190.0375
HR@200.04580.01440.03970.0531
NDCG@500.03320.00940.02300.0379
HR@500.04870.01910.04550.0548
Table 4. Offline Evaluation Results on Last.fm Dataset
The bold numbers indicate the best performances across all the methods for ease of demonstration.
In subsequent sections, we delve deeper into the analysis of the proposed \(\rm M^{3}Rec\) method, using the Last.fm dataset as a representative example, which is more challenging for the cold-start recommendation task.

5.3.4 Performance on Initial Interactions of Test Users (RQ3).

In real-world cold-start recommendation scenarios, it is desirable for the recommender system to demonstrate promising performance on the initial interactions of cold-start users, to deter user attrition. With this consideration in mind, to answer RQ3, we further report the averaged recommendation performance over the earliest \(l\) interactions of the test users, where \(l\) ranges from 3 to 20. As depicted in Figure 3, our method consistently surpasses the baseline methods across all evaluation metrics. Notably, the margin of outperformance is particularly larger when \(l\) is smaller (e.g., \(l=3\)), implying that our method excels at providing superior recommendation performance during the earliest interactions of cold-start users.
Figure 3.
Figure 3. Performance comparison on initial interactions of test users with the Last.fm dataset, where the first few interactions (test user length) for each test user are kept for evaluation.

5.3.5 Ablation Study (RQ4).

To probe RQ4, we conduct an ablation study aimed at assessing the impact of \(\rm M^{3}Rec\)'s essential components on the recommendation performance. Specifically, we remove the meta-learning component (i.e., the user context encoder) and the mutual information regularizer separately. The results of the ablation study are presented in Table 5. Upon the removal of either the meta-learning component or the mutual information regularizer, a performance drop is observed in our \(\rm M^{3}Rec\) model. This ablation study underscores the necessity of the meta-learning method and the mutual information regularizer within our proposed model.
Table 5.
Metric\(\rm{M^{3}Rec}\)Without meta-learningWithout mutual information regularizer
NDCG@100.03680.03490.0354
HR@100.05000.04770.0473
NDCG@200.03750.03580.0361
HR@200.05310.05100.0501
NDCG@500.03790.03630.0366
HR@500.05480.05350.0527
Table 5. Ablation Study on Last.fm Dataset

5.3.6 Impact of the User Model Rollout Length (RQ5).

In our model-based RL framework, during training, the recommendation agent is allowed to interact with the user model for varying lengths of time, a concept denoted as the user model rollout. In response to RQ5, which queries the sensitivity of our proposed \(\rm M^{3}Rec\) model to variations in user model rollout length during training, we adjusted this length during the model's training process. Figure 4 illustrates that both too small and too large user model rollout lengths can lead to inferior performance. Our model achieves the best performance across a range of evaluation metrics when a moderately sized user rollout length (i.e., 10) is employed.
Figure 4.
Figure 4. Influence of user model rollout length on the performance of M\({}^{3}\)Rec on the Last.fm dataset.

6 Conclusion

In this article, we presented a new approach to addressing the cold-start problem in RL-based recommendations by developing a context-aware offline meta-level model-based RL method. This method incorporates a user context variable designed to infer user preferences for adapting to new users. Within the context of the meta-learned model-based RL framework, we proposed to recover user policy and reward via an IRL approach, which is conditioned on the user context variable. This meta-level user model is employed to aid in training the context-aware recommendation agent, facilitating adaptation to new users who have limited contextual information or user–item interaction records. To address the challenge posed by the offline training of the proposed model, we further introduced a mutual information constraint between the user model and the recommendation agent. Alongside extensive simulated online and offline evaluations, which demonstrate the effectiveness of our approach, we also provided a theoretical analysis of the recommendation performance bound of the developed method.

Footnotes

Appendices

Appendix A Proofs

A.1. Proof of Theorem 1

Proof.
The proof techniques are similar to the proof in [48, Theorem 1]. For completeness, we prove Lemma 1 for our meta-level model-based RL framework. Some necessary Lemmas are provided in Appendix A.3. We first decompose \(J_{u_{test}^{w}}^{*}-J\left(\pi_{\theta},u_{test}^{w}\right)\) into three terms to analyze them separately.
\begin{align*}&J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}, u_{test}^{w}\right) \\&\quad{} =J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}^ {*},u_{test}^{m}\right)+J\left(\pi_{\theta}^{*},u_{test}^{m}\right)-J\left(\pi _{\theta},u_{test}^{w}\right) \\&\quad{} =\underbrace{J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}^{*},u_{test}^{m}\right)}_{\text{Term-I}}+\underbrace{J\left(\pi_{ \theta}^{*},u_{test}^{m}\right)-J\left(\pi_{\theta},u_{test}^{m}\right)}_{ \text{Term-II}}+\underbrace{J\left(\pi_{\theta},u_{test}^{m}\right)-J\left(\pi _{\theta},u_{test}^{w}\right)}_{\text{Term-III}}.\end{align*}
Term-II. We introduce a term \(J\left(\pi_{\theta,u_{test}^{m}}^{*},u_{test}^{m}\right)\geq J\left(\pi_{ \theta}^{\prime},u_{test}^{m}\right),\forall\pi_{\theta}^{\prime}\), which represents the optimal recommendation policy \(\pi_{\theta,u_{test}^{m}}^{*}\) obtained under approximated model \(u_{test}^{m}\) in our meta-level model-based RL framework. Then, we can further decompose Term -\(\mathrm{II}\) as follows:
\begin{align*}J\left(\pi_{\theta}^{*},u_{test}^{m}\right)-J\left(\pi_{\theta}, u_{test}^{m}\right) & = \left(J\left(\pi_{\theta}^{*},u_{test}^{m}\right)-J\left(\pi_{\theta, u_{test}^{m}}^{*},u_{test}^{m}\right)+J\left(\pi_{\theta,u_{test}^{m}}^{*},u_{ test}^{m}\right)-J\left(\pi_{\theta},u_{test}^{m}\right)\right) \\& \leq 0+\epsilon_{\pi_{\theta}}^{adapt}.\end{align*}
The first difference term is \(\leq 0\) because \(\pi_{\theta,u_{test}^{m}}\) is the optimal policy under approximated model \(u_{test}^{m}\). The second difference term is since our assumption \(J(\pi_{\theta},u_{test}^{m})\geq\sup_{\pi_{\theta}^{\prime}}J(\pi_{\theta}^{ \prime},u_{test}^{m})-\epsilon_{\pi_{\theta}}^{adapt}\), which represents the generalization error of \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s},c_{test})\) as it trained on meta-training users.
Term-III. As our meta-level user model \(u_{test}^{m}\) approximates the true user model \(u_{test}^{w}\) using IRL, it can be seen as a distribution matching [19] (i.e., KL divergence) between the state-action visitation distribution of \(u_{test}^{m}\) and \(u_{test}^{w}\). Thus, model approximating error \(\epsilon_{u_{test}^{m}}^{adapt}\) is actually due to the error KL divergence matching. By applying Pinsker's inequality which connects \(D_{KL}\) and total variation (TV) distance \(D_{TV}\), we can connect \(\epsilon_{u_{test}^{m}}^{adapt}\) and the \(D_{\mathit{TV}}\) as follows:
\begin{align*}\mathbb{E}_{(\boldsymbol{s},a)\sim\mu_{u_{test}^{*},t}^{\pi_{0},t}}\left[D_{TV}\left(P_{u_{test}^{w}}(\cdot|\boldsymbol{s},a),P_{u_{test}^{m}}(\cdot|\boldsymbol{s},a)\right)\right]\leq\sqrt{\epsilon_{u_{test}^{m}}^{adapt}}.\end{align*}
By using Lemma 2, the bound for Term-III is \(\frac{2\gamma R_{\max}\sqrt{\epsilon_{u_{test}^{m}}^{adapt}}}{(1-\gamma)^{2}}.\)
Term-I. It measures the modeling error of the test user model on the states visited by \(\pi_{\theta}^{*}\), which is unseen. This term can be formally calculated as follows:
\begin{align}|J(\pi_{\theta}^{*},u_{test}^{w})-J(\pi_{\theta}^{*},u_{test}^{m})| & =\left|\frac{1}{1-\gamma}\mathbb{E}_{\tilde{\mu}_{u_{test}^{w}}^{ \pi_{\theta}^{*}}}[r_{u_{test}}(\boldsymbol{s},a)]-\frac{1}{1-\gamma}\mathbb{E}_{ \tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}^{*}}}[r_{u_{test}}(\boldsymbol{s},a)]\right| \\ & \leq\frac{2R_{\max}}{1-\gamma}D_\mathit{TV}\left(\tilde{\mu}_{u_{test}^{ w}}^{\pi_{\theta}^{*}},\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}^{*}}\right).\end{align}
(10)
Combining all these terms together, we obtain the following bound:
\begin{align*}J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}, u_{test}^{w}\right)=\frac{2R_{\max}}{1-\gamma}D_\mathit{TV}\left(\tilde{\mu}_{u_{test }^{w}}^{\pi_{\theta}^{*}},\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}^{*}}\right) +\epsilon_{\pi_{\theta}}^{adapt}+\frac{2\gamma R_{\max}\sqrt{\epsilon_{u_{test }^{m}}^{adapt}}}{(1-\gamma)^{2}}.\end{align*}
By omitting some constant terms, we conclude the proof of Lemma 1. ◻

A.2. Proof of the of the Lower Bound of \(\mathcal{I}^{\boldsymbol{(JSD)}}\)

\begin{align*}\mathcal{I}^{(JSD)}(z_{u};z_{rec})\geq\sup_{\psi\in\Psi}\mathbb{E }_{\mathbb{P}_{z_{u}z_{rec}}}\left[-{sp}\left(-T_{\psi}(z_{u},z_{rec})\right) \right]-\mathbb{E}_{\mathbb{P}_{z_{u}}\otimes\mathbb{P}_{z_{rec}}}\left[ \operatorname{sp}\left(T_{\psi}(z_{u},z_{rec})\right)\right].\end{align*}
Proof.
\begin{align*}\mathcal{I}^{(JSD)}(z_{u};z_{rec}) & =D_{JSD}\left(\mathbb{P}_{z_{u}z_{rec}}\|\mathbb{P}_{z_{u}} \otimes\mathbb{P}_{z_{rec}}\right) \\& \geq\sup_{\psi\in\Psi}\left(\mathbb{E}_{\mathbb{P}_{z_{u}z_{rec}} }[V_{\psi}(z_{u},z_{rec})]-\mathbb{E}_{\mathbb{P}_{z_{u}}\otimes\mathbb{P}_{z_ {rec}}}\left[{JSD}^{*}(V_{\psi}(z_{u},z_{rec}))\right]\right) \\& =\sup_{\psi\in\Psi}\left(\mathbb{E}_{\mathbb{P}_{z_{u}z_{rec}}}[g _{f}(T_{\psi}(z_{u},z_{rec}))]-\mathbb{E}_{\mathbb{P}_{z_{u}}\otimes\mathbb{P} _{z_{rec}}}\left[{JSD}^{*}(g_{f}(T_{\psi}(z_{u},z_{rec}))\right]\right) \\& =\sup_{\psi\in\Psi}\mathbb{E}_{\mathbb{P}_{z_{u}z_{rec}}}\left[- \operatorname{sp}\left(-T_{\psi}(z_{u},z_{rec})\right)\right]-\mathbb{E}_{ \mathbb{P}_{z_{u}}\otimes\mathbb{P}_{z_{rec}}}\left[\operatorname{sp}\left(T_{ \psi}(z_{u},z_{rec})\right)\right]+\operatorname{log}(4).\end{align*}
The inequality in the second line is due to the variational lower bound of \(f\)-divergence [44]. The third line is to represent the variational function \(V_{\psi}(z_{u},z_{rec})=g_{f}(T_{\psi}(z_{u},z_{rec}))\). In the fourth line, we replace \(g_{f}(T_{\psi}(z_{u},z_{rec}))=\log(2)-\log(1+\exp(-T_{\psi}(z_{u},z_{rec})))\) and the conjugate \({JSD}^{*}(t)=-\log(2-\exp(t))\) as in [44]. ◻

A.3. Related Lemmas

We provide some related lemmas for the proof of Lemma 1.
Lemma 1
Let \(P_{1}(\cdot|\boldsymbol{s})\) and \(P_{2}(\cdot|\boldsymbol{s})\) be two Markov chains with the same initial state distribution. Let \(P_{1}^{t}(\boldsymbol{s})\) and \(P_{2}^{t}(\boldsymbol{s})\) be the marginal distributions over states at time \(t\) when following \(P_{1}\) and \(P_{2}\), respectively. Suppose
\begin{align*}\mathbb{E}_{\boldsymbol{s}\sim P_{1}^{t}}\left[D_\mathit{TV}(P_{1}(\cdot|\boldsymbol{s}),P_{2}(\cdot|\boldsymbol{s}))\right]\leq\epsilon\ \ \forall\ t,\end{align*}
then, the marginal distributions are bounded as:
\begin{align*}D_\mathit{TV}(P_{1}^{t},P_{2}^{t})\leq\epsilon t\ \ \forall\ t.\end{align*}
Proof.
Proof has been provided in several previous works such as [48, Lemma 2] and [28, Lemma B.2]. Here, we omit the proof. ◻
To demonstrate the performance difference of the recommendation policy under the approximated user model \(u_{test}^{m}\) in our meta-level model-based RL framwork and true user model \(u_{test}^{w}\), we first introduce some definitions.
\begin{align} \mu_{u_{test}^{m}}^{\pi_{\theta}}(\boldsymbol{s},a) & =\frac{1}{T_{\infty}}\sum_{t=0}^{T_{\infty}}P(s_{t}=\boldsymbol{s},a_{t}=a)\end{align}
(11)
\begin{align}\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}(\boldsymbol{s},a) =(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}P(s_{t}=\boldsymbol{s},a_{t}=a).\end{align}
(12)
The first definition \(\mu_{u_{test}^{m}}^{\pi_{\theta}}(\boldsymbol{s},a)\) indicates the state-action visitation distribution when executing recommendation policy \(\pi_{\theta}\) in user model \(u_{test}^{m}\). The second definition is the discounted state-action visitation distribution, respectively. Similar definitions can be introduced for true user model \(u_{test}^{w}\) as \(\mu_{u_{test}^{w}}^{\pi_{\theta}}(\boldsymbol{s},a)\) and \(\mu_{u_{test}^{w}}^{\pi_{\theta}}(\boldsymbol{s},a)\). We further define a marginal distribution at time \(t\) when executing recommendation policy \(\pi_{\theta}\) under the true user model \(u_{test}^{w}\):
\begin{align}\mu_{u_{test}^{w}}^{\pi_{\theta},t}(\boldsymbol{s},a)=P\left(s_{t}=\boldsymbol{s},a_{t}=a\right) \end{align}
(13)
\(\mu_{u_{test}^{m}}^{\pi_{\theta},t}(\boldsymbol{s},a)\) is similarly defined when following policy \(\pi_{\theta}\) in approximated user model \(u_{test}^{m}\).
Now, we formally introduce Lemma 3, which measures the performance difference of the recommendation policy \(\pi_{\theta}\) learned in the \(u_{test}^{m}\) and \(u_{test}^{w}\) due to the approximation error of \(u_{test}^{m}\) with \(u_{test}^{w}\).
Lemma 2
Let \(u_{test}^{w}\) and \(u_{test}^{m}\) be two different MDPs differing only in their transition dynamics—\(P_{u_{test}^{w}}\) and \(P_{u_{test}^{m}}\). Let the absolute value of rewards be bounded by \(R_{\max}\). Fix a recommendation policy \(\pi_{\theta}\) for both \(u_{test}^{w}\) and \(u_{test}^{m}\), and let \(P_{u_{test}^{w}}^{t}\) and \(P_{u_{test}^{m}}^{t}\) be the resulting marginal state distributions at time \(t\). If the MDPs are such that
\begin{align*}\mathbb{E}_{(\boldsymbol{s},a)\sim\mu_{u_{test}^{w}}^{\pi_{\theta},t}}\left[D_\mathit{TV}\big{(}P_{u_{test}^{w}}(\cdot|\boldsymbol{s},a),P_{u_{test}^{m}}(\cdot|\boldsymbol{s},a)\big{)}\right]\leq\epsilon\ \ \forall t,\end{align*}
then, the performance difference is bounded as
\begin{align*}|J(\pi_{\theta},u_{test}^{w})-J\left(\pi_{\theta},u_{test}^{m}\right)|\leq \frac{2\gamma\epsilon R_{\max}}{(1-\gamma)^{2}}.\end{align*}
Proof.
The proof is essentially the same as that for [48, Lemma 3]. We provide a proof to derive our conclusion in Lemma 1 for meta-level model-based RL framework. For recommendation policy \(\pi_{\theta}\) in approximated user model \(u_{test}^{m}\), the performance is
\begin{align*}J(\pi_{\theta},u_{test}^{m})=\frac{1}{1-\gamma}\mathbb{E}_{ \tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}}[r_{u_{test}}(\boldsymbol{s},a)]=\mathbb{E} \left[\sum_{t=0}^{\infty}\gamma^{t}r_{u_{test}}(\boldsymbol{s},a)\right].\end{align*}
Similar term can be induced for true user model \(u_{test}^{w}\). Then, the performance difference can be bounded as follows:
\begin{align}|J(\pi_{\theta},u_{test}^{w})-J(\pi_{\theta},u_{test}^{m})| & =\left|\frac{1}{1-\gamma}\mathbb{E}_{\tilde{\mu}_{u_{test}^{w}}^{ \pi_{\theta}}}[r_{u_{test}}(\boldsymbol{s},a)]-\frac{1}{1-\gamma}\mathbb{E}_{\tilde{ \mu}_{u_{test}^{m}}^{\pi_{\theta}}}[r_{u_{test}}(\boldsymbol{s},a)]\right|\\ & \leq\frac{2R_{\max}}{1-\gamma}D_{TV}\left(\tilde{\mu}_{u_{test}^{ w}}^{\pi_{\theta}},\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}\right).\end{align}
(14)
As \(\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}=P\left(\boldsymbol{s}_{t}= \boldsymbol{s},a_{t}=a\right)=P_{u_{test}^{m}}^{t}(\boldsymbol{s})\pi(a| \boldsymbol{s})\),
then, we can bound \(D_{\mathit{TV}}\left(\tilde{\mu}_{u_{test}^{w}}^{\pi_{\theta}},\tilde{\mu}_{u_ {test}^{m}}^{\pi_{\theta}}\right)\) using Equations (12) and (13) as follows:
\begin{align*}2D_\mathit{TV}\left(\tilde{\mu}_{u_{test}^{w}}^{\pi_{\theta}},\tilde{ \mu}_{u_{test}^{m}}^{\pi_{\theta}}\right) & =\sum_{\boldsymbol{s},a}\left|\tilde{\mu}_{u_{test}^{w}}^{\pi_{\theta}}- \tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}\right| \\& =(1-\gamma)\sum_{\boldsymbol{s},a}\left|\sum_{t}\gamma^{t}\mu_{u_{test}^{ w}}^{\pi_{\theta},t}(\boldsymbol{s},a)-\gamma^{t}\mu_{u_{test}^{m}}^{\pi_{\theta},t}(\boldsymbol{s},a)\right| \\& \leq(1-\gamma)\sum_{\boldsymbol{s},a}\sum_{t}\gamma^{t}\left|\mu_{u_{test }^{w}}^{\pi_{\theta},t}(s,a)-\mu_{u_{test}^{m}}^{\pi_{\theta},t}(s,a)\right| \\& =(1-\gamma)\sum_{\boldsymbol{s}}\sum_{t}\gamma^{t}\left|P_{u_{test}^{w}}^ {t}(\boldsymbol{s})-P_{u_{test}^{m}}^{t}(\boldsymbol{s})\right| \\& \leq(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}(2t\epsilon).\end{align*}
The last inequality is obtained using Lemma 1. Finally, by summarizing the infinite series, we get
\begin{align*}D_\mathit{TV}\left(\tilde{\mu}_{u_{test}^{w}}^{\pi_{\theta}},\tilde{\mu}_{u_{test}^{m }}^{\pi_{\theta}}\right)\leq(1-\gamma)\frac{\epsilon\gamma}{(1-\gamma)^{2}} \leq\frac{\epsilon\gamma}{1-\gamma}.\end{align*}
By substituting this inequality into Equation (14), we conclude the proof. ◻

Appendix B Online Evaluation Details

For online evaluation, we utilize an open-sourced simulator4 [26] for recommendation system. The parameters of this simulator include user sensitivity, user-specific memory discount, noise standard deviation, the mean and standard deviation of kale response, the mean and standard deviation of choc response, which can affect user's preferences. We use default parameters in the simulator for standard deviation of kale response and standard deviation of choc response. Other parameters are chosen from a set of values to configure different users, which are listed below for meta-training and meta-test, respectively.
For meta-training users:
user sensitivity: \([0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1]\).
user-specific memory discount: \([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]\).
noise standard deviation: \([0.03,0.04,0.05,0.06]\).
the mean of kale response: \([2,3,4,5,6,7,8]\).
the mean of choc response: \([2,3,4,5,6,7,8]\).
For meta-test users:
user sensitivity: \([0.015,0.025,0.035,0.045,0.055,0.065,0.075,0.085,0.095,0.105]\).
user-specific memory discount: \([0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95]\).
noise standard deviation: \([0.035,0.045,0.055,0.065]\).
the mean of kale response: \([2.5,3.5,4.5,5.5,6.5,7.5,8.5]\).
the mean of choc response: \([2.5,3.5,4.5,5.5,6.5,7.5,8.5]\).
Each user is configured with a combination of these parameters.

References

[1]
Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML ’04), 1.
[2]
Xueying Bai, Jian Guan, and Hongning Wang. 2019. Model-based reinforcement learning with adversarial training for online recommendation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 10735–10746.
[3]
Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. 2018. Mine: Mutual information neural estimation. arXiv:1801.04062. Retrieved from https://arxiv.org/abs/1801.04062.
[4]
Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. 1991. Learning a synaptic learning rule. In Proceedings of the IJCNN-91-Seattle International Joint Conference on Neural Networks II, Vol. 2, 969.
[5]
Jiangxia Cao, Jiawei Sheng, Xin Cong, Tingwen Liu, and Bin Wang. 2022. Cross-domain recommendation to cold-start users via variational information bottleneck. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2209–2223.
[6]
Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H. Chi. 2019a. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining, 456–464.
[7]
Minmin Chen, Can Xu, Vince Gatto, Devanshu Jain, Aviral Kumar, and Ed Chi. 2022. Off-policy actor-critic for recommender systems. In Proceedings of the 16th ACM Conference on Recommender Systems, 338–349.
[8]
Xinshi Chen, Shuang Chen Li, Hui Li, Shaohua Jiang, Yuan Qi, and Le Song. 2019b. Generative adversarial user model for reinforcement learning based recommendation system. In Proceedings of the International Conference on Machine Learning (ICML), 1052–1061.
[9]
Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. 2013. A survey on policy search for robotics. Foundations and Trends in Robotics 2 (2013), 1–142.
[10]
Manqing Dong, Feng Yuan, Lina Yao, Xiwei Xu, and Liming Zhu. 2020. MAMO: Memory-augmented meta-optimization for cold-start recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 688–697.
[11]
Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. 2016. RL\({}^{2}\): Fast reinforcement learning via slow reinforcement learning. arXiv:1611.02779. Retrieved from https://arxiv.org/abs/arXiv:1611.02779
[12]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017a. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, 1126–1135. arXiv:1703.03400. Retrieved from https://arxiv.org/abs/1703.03400
[13]
Justin Fu, Katie Luo, and Sergey Levine. 2018. Learning robust rewards with adversarial inverse reinforcement learning. arXiv:1710.11248. Retrieved from https://arxiv.org/abs/1710.11248
[14]
Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau. 2019a. Benchmarking batch deep reinforcement learning algorithms. arXiv:1910.01708. Retrieved from https://arxiv.org/abs/1910.01708
[15]
Scott Fujimoto, David Meger, and Doina Precup. 2019b. Off-policy deep reinforcement learning without exploration. In Proceedings of the International Conference on Machine Learning. PMLR, 2052–2062.
[16]
Chongming Gao, Kexin Huang, Jiawei Chen, Yuan Zhang, Biao Li, Peng Jiang, Shiqi Wang, Zhong Zhang, and Xiangnan He. 2023a. Alleviating Matthew effect of offline reinforcement learning in interactive recommendation. arXiv:2307.04571. Retrieved from https://arxiv.org/abs/arXiv:2307.04571
[17]
Chongming Gao, Shiqi Wang, Shijun Li, Jiawei Chen, Xiangnan He, Wenqiang Lei, Biao Li, Yuan Zhang, and Peng Jiang. 2023b. CIRS: Bursting filter bubbles by counterfactual interactive recommender system. ACM Transactions on Information Systems 42, 1 (2023), 1–27.
[18]
Seyed Kamyar Seyed Ghasemipour, Shixiang Gu, and Richard S. Zemel. 2019a. SMILe: Scalable meta inverse reinforcement learning through context-conditional policies. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
[19]
Seyed Kamyar Seyed Ghasemipour, Richard S. Zemel, and Shixiang Gu. 2019b. A divergence minimization perspective on imitation learning methods. In Proceedings of the Conference on Robot Learning (CoRL), 1259–1277.
[20]
Lei Guo, Li Tang, Tong Chen, Lei Zhu, Quoc Viet Hung Nguyen, and Hongzhi Yin. 2021. DA-GCN: A domain-aware attentive graph convolution network for shared-account cross-domain sequential recommendation. arXiv:2105.03300. Retrieved from https://arxiv.org/abs/arXiv:2307.04571
[21]
Lei Guo, Jinyu Zhang, Tong Chen, Xinhua Wang, and Hongzhi Yin. 2022. Reinforcement learning-enhanced shared-account cross-domain sequential recommendation. IEEE Transactions on Knowledge and Data Engineering 35, 7 (2022), 7397–7411.
[22]
F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TIIS) 5, 4 (2015), 1–19.
[23]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, 173–182.
[24]
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Session-based recommendations with recurrent neural networks. arXiv:1511.06939. Retrieved from https://arxiv.org/abs/1511.06939
[25]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9 (1997), 1735–1780.
[26]
Eugene Ie, Chih-Wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jun bo Wang, Rui Wu, and Craig Boutilier. 2019a. RecSim: A configurable simulation platform for recommender systems. arXiv:1909.04847. Retrieved from https://arxiv.org/abs/1909.04847
[27]
Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Tushar Deepak Chandra, and Craig Boutilier. 2019b. SlateQ: A tractable decomposition for reinforcement learning with recommendation sets. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2592–2599.
[28]
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. 2019. When to trust your model: Model-based policy optimization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 12519–12530.
[29]
Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. 2020. MOReL: Model-based offline reinforcement learning. arXiv:2005.05951. Retrieved from https://arxiv.org/abs/2005.05951
[30]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980
[31]
Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114
[32]
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 1179–1191.
[33]
Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019. MeLU: Meta-learned user preference estimator for cold-start recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1073–1082.
[34]
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv:2005.01643. Retrieved from https://arxiv.org/abs/2005.01643
[35]
Shijun Li, Wenqiang Lei, Qingyun Wu, Xiangnan He, Peng Jiang, and Tat-Seng Chua. 2021. Seamlessly unifying attributes and items: Conversational recommendation for cold-start users. ACM Transactions on Information Systems (TOIS) 39, 4 (2021), 1–29.
[36]
Elad Liebman, Maytal Saar-Tsechansky, and Peter Stone. 2015. DJ-MC: A reinforcement-learning agent for music playlist recommendation. arXiv:1401.1880. Retrieved from https://arxiv.org/abs/1401.1880
[37]
Yuanfu Lu, Yuan Fang, and Chuan Shi. 2020. Meta-learning on heterogeneous information networks for cold-start recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1563–1573.
[38]
Zhongqi Lu and Qiang Yang. 2016. Partially observable Markov decision process for recommender systems. arXiv:1608.07793. Retrieved from https://arxiv.org/abs/1608.07793
[39]
Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H. Chi. 2020. Off-policy learning in two-stage recommender systems. In Proceedings of the Web Conference 2020, 463–473.
[40]
Maja J. Mataric. 1994. Reward functions for accelerated learning. In Proceedings of the International Conference on Machine Learning (ICML), 181–189.
[41]
Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. 2018. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), 7559–7566.
[42]
Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the International Conference on Machine Learning (ICML), 278–287.
[43]
Andrew Y. Ng and Stuart J. Russell. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 663–670.
[44]
Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-GAN: Training generative neural samplers using variational divergence minimization. In Proceedings of the Advances in Neural Information Processing Systems, 271–279.
[45]
Xingyu Pan, Yushuo Chen, Changxin Tian, Zihan Lin, Jinpeng Wang, He Hu, and Wayne Xin Zhao. 2022. Multimodal meta-learning for cold-start sequential recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 3421–3430.
[46]
Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. 2019. Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow. arXiv:1810.00821. Retrieved from https://arxiv.org/abs/1810.00821
[47]
Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and Paolo Cremonesi. 2017. Personalizing session-based recommendations with hierarchical recurrent neural networks. In Proceedings of the 11th ACM Conference on Recommender Systems, 130–137.
[48]
Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. 2020. A game theoretic framework for model based reinforcement learning. In Proceedings of the International Conference on Machine Learning, 7953–7963.
[49]
Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. 2019. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In Proceedings of the International Conference on Machine Learning, (ICML). 5331–5340.
[50]
Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. 2021. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34, 11702–11716.
[51]
Martin Riedmiller. 2005. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In Proceedings of the European Conference on Machine Learning. Springer, 317–328.
[52]
Hinrich Schütze, Christopher D. Manning, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval, Vol. 39. Cambridge University Press, Cambridge.
[53]
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), 3483–3491.
[54]
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 1441–1450.
[55]
Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press.
[56]
Siyu Wang, Xiaocong Chen, Dietmar Jannach, and Lina Yao. 2023. Causal decision transformer for recommender systems via offline reinforcement learning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1599–1608.
[57]
Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng Chua. 2021. Contrastive learning for cold-start recommendation. In Proceedings of the 29th ACM International Conference on Multimedia. 5382–5390.
[58]
Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (1992), 229–256.
[59]
Hanrui Wu, Jinyi Long, Nuosi Li, Dahai Yu, and Michael K. Ng. 2022. Adversarial auto-encoder domain adaptation for cold-start recommendation with positive and negative hypergraphs. ACM Transactions on Information Systems 41, 2 (2022), 1–25.
[60]
Yifan Wu, George Tucker, and Ofir Nachum. 2019. Behavior regularized offline reinforcement learning. arXiv:1911.11361. Retrieved from https://arxiv.org/abs/1911.11361
[61]
Teng Xiao and Donglin Wang. 2021. A general offline reinforcement learning framework for interactive recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 4512–4520.
[62]
Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, and Joemon M. Jose. 2020. Self-supervised reinforcement learning for recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 931–940.
[63]
Lantao Yu, Tianhe Yu, Chelsea Finn, and Stefano Ermon. 2019. Meta-inverse reinforcement learning with probabilistic context variables. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 11772–11783.
[64]
Runsheng Yu, Yu Gong, Xu He, Bo An, Yu Zhu, Qingwen Liu, and Wenwu Ou. 2020a. Personalized adaptive meta learning for cold-start user preference prediction. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 10772–10780.
[65]
Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. 2020b. MOPO: Model-based offline policy optimization. arXiv:2005.13239. Retrieved from https://arxiv.org/abs/2005.13239
[66]
Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, Changyou Chen, and Lawrence Carin. 2019. Text-based interactive recommendation via constraint-augmented reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 15214–15224.
[67]
Xiangyu Zhao, Long Xia, Yihong Zhao, Dawei Yin, and Jiliang Tang. 2019. Model-based reinforcement learning for whole-chain recommendations. arXiv:1902.03987. Retrieved from https://arxiv.org/abs/1902.03987
[68]
Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1040–1048.
[69]
Vincent W. Zheng, Yu Zheng, Xing Xie, and Qiang Yang. 2010. Collaborative location and activity recommendations with GPS history data. In Proceedings of the 19th International Conference on World Wide Web, 1029–1038.
[70]
Feng Zhu, Yan Wang, Chaochao Chen, Jun Zhou, Longfei Li, and Guanfeng Liu. 2021. Cross-domain recommendation: Challenges, progress, and prospects. arXiv:2103.01696. Retrieved from https://arxiv.org/abs/2103.01696
[71]
Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. 2017. What to do next: Modeling user behaviors by time-LSTM. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Vol. 17, 3602–3608.

Cited By

View all
  • (2024)Unraveling Block Maxima Forecasting Models with Counterfactual ExplanationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671923(562-573)Online publication date: 25-Aug-2024

Index Terms

  1. M3Rec: A Context-Aware Offline Meta-Level Model-Based Reinforcement Learning Approach for Cold-Start Recommendation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 42, Issue 6
    November 2024
    813 pages
    EISSN:1558-2868
    DOI:10.1145/3618085
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 August 2024
    Online AM: 25 April 2024
    Accepted: 02 April 2024
    Revised: 24 March 2024
    Received: 16 June 2023
    Published in TOIS Volume 42, Issue 6

    Check for updates

    Author Tags

    1. Recommendation
    2. reinforcement learning
    3. model-based reinforcement learning
    4. meta-learning

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Foundation (NSF)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)788
    • Downloads (Last 6 weeks)172
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Unraveling Block Maxima Forecasting Models with Counterfactual ExplanationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671923(562-573)Online publication date: 25-Aug-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media