research-article

Open access

M³Rec: A Context-Aware Offline Meta-Level Model-Based Reinforcement Learning Approach for Cold-Start Recommendation

Authors:

Yanan Wang,

Yong Ge,

Zhepeng Li,

Li Li,

Rui ChenAuthors Info & Claims

ACM Transactions on Information Systems, Volume 42, Issue 6

Article No.: 146, Pages 1 - 27

https://doi.org/10.1145/3659947

Published: 19 August 2024 Publication History

PDF eReader

Abstract

Reinforcement learning (RL) has shown great promise in optimizing long-term user interest in recommender systems. However, existing RL-based recommendation methods need a large number of interactions for each user to learn the recommendation policy. The challenge becomes more critical when recommending to new users who have a limited number of interactions. To that end, in this article, we address the cold-start challenge in the RL-based recommender systems by proposing a novel context-aware offline meta-level model-based RL approach for user adaptation. Our proposed approach learns to infer each user's preference with a user context variable that enables recommendation systems to better adapt to new users with limited contextual information. To improve adaptation efficiency, our approach learns to recover the user choice function and reward from limited contextual information through an inverse RL method, which is used to assist the training of a meta-level recommendation agent. To avoid the need for online interaction, the proposed method is trained using historically collected offline data. Moreover, to tackle the challenge of offline policy training, we introduce a mutual information constraint between the user model and recommendation agent. Evaluation results show the superiority of our developed offline policy learning method when adapting to new users with limited contextual information. In addition, we provide a theoretical analysis of the recommendation performance bound.

1 Introduction

Recent years have witnessed great interest in developing reinforcement learning (RL)-based recommender systems [2, 8], which can effectively model and optimize user's long-term interest. In the RL-based methods, the recommendation policy is learned by leveraging the collected interactions between users and recommender systems. As different users may have different interests, conventional RL-based methods need to learn a separate policy for each user, which calls for large amounts of interactions for an individual user to learn an adaptable recommendation policy for different users. However, it is very difficult and expensive to obtain enough user–recommender interactions to train a robust recommendation policy. Such a challenge becomes even more critical for cold-start users who have a very limited number of interactions and prominently exist in many recommender systems. Therefore, it is imperative to learn a recommendation policy that can infer user's preferences and quickly adapt to cold-start users with limited information.

When dealing with cold-start users, the availability of information is often limited to a few user attributes (e.g., gender, age) and perhaps minimal user–item interactions. The challenge lies in learning user preferences from this scant information. To address this issue, meta-learning approaches [4, 12] have been applied to cold-start recommendation [33, 37]. Meta-learning seeks to learn general knowledge across diverse tasks and adapt the knowledge to new tasks using minimal samples. Within the context of cold-start recommendation, each user's recommendation can be viewed as a unique task. During the meta-training phase, meta-learning facilitates the understanding of user's general preferences and adapts these to new users with limited data during the meta-testing phase. However, it is still very challenging to apply meta-learning to RL-based recommender systems due to the following reasons. First, inferring user preferences requires a significant number of interactions sampled from user distributions. It is very difficult to achieve this for new users. In extreme cold-start scenarios, the only available information might be user attributes like gender and age, with no user–item interactions, where the meta-learning method cannot adapt to new users by fine-tuning with few interactions [12]. This complicates the task of inferring user preferences. Second, to meta-train a recommendation policy, traditional model-free RL methods need interactions with users. Recent methods [2, 8] utilize model-based RL [9, 41] approaches to sidestep the sample efficiency challenge by leveraging offline-logged user data to model the environment. However, they still require lots of offline data from each user to build the environment model, i.e., individual user model. Moreover, the third challenge lies in learning the user model and recommendation policy from offline-logged user data without online interaction with real users. During offline learning, the recommendation policy might suggest recommendations for which the estimated user model cannot provide accurate feedback. This issue represents a fundamental challenge for offline RL due to the distribution shift between user–item interaction data generated by the policy-user model and the existing offline data.

To address the aforementioned challenges, we propose a novel context-aware offline meta-level model-based RL method, specifically designed for tackling the cold-start problem in RL-based recommender systems. To address the first challenge, inspired by the context-based meta-learning approach [11, 49], we introduce a user context variable to infer the user's preferences from the available user's contextual information, such as user attributes or a limited number of user–item interactions. In response to the second challenge, we employ meta-learning to train a model-based RL model by conditioning both the user model and the recommendation policy on the user context variable. This approach facilitates the learning of meta-level knowledge, i.e., general user preferences. During the meta-training phase, by conditioning on the user context variable, the user model (meta-level user model) is learned to estimate the user choice function and user reward function by using the inverse RL (IRL) approach across a broad spectrum of users. In the model-based RL framework, the recommendation agent conditioned on the user context variable (meta-level recommendation agent) is trained through interaction with the meta-level user model. After the meta-training phase, the meta-level recommendation agent can adapt to a new cold-start user by using the user's contextual information that is embedded in the user context variable. To tackle the third challenge arising from offline RL training of our proposed method, we adhere to the principle of conservatism, a concept extensively adopted in offline RL literature [32, 50]. We propose incorporating a mutual information regularizer between the user model and the recommendation agent. This approach is designed to encourage the recommendation agent to make recommendations within the user model's confidence region, thereby improving the accuracy of feedback.

We conduct intensive experiments to evaluate our developed method. The evaluations include both the simulated online experiment and the offline experiment. The online experiment is carried out in a simulated environment by using an open-sourced simulator for recommendation system [26], which provides sequential interaction with users. The offline experiment is performed with two widely used datasets. In both online and offline experiments, we evaluate our method against several state-of-the-art baselines using multiple evaluation metrics. All the evaluation results consistently demonstrate the superiority of our developed method.

The main contributions of this work can be summarized as follows:

—

We propose a novel context-aware offline meta-level model-based RL method to address the cold-start problem of RL-based recommender systems, which is trained using logged offline data.

—

Within our framework, we introduce a user context variable to infer user preference that enables adaptation on cold-start users with limited contextual information.

—

Both the user model and the recommendation agent are meta-learned by conditioning on this user context variable within the model-based RL framework. This approach ensures effective adaptation to new users. To tackle the challenge presented by the offline RL training of our proposed method, we introduce a mutual information regularizer between the meta-level user model and the recommendation agent.

—

Evaluation results demonstrate the superiority of our developed method. A theoretical analysis of the recommendation performance bound of the developed method is also provided.

2 Related Work

The related work of this article could be grouped into four categories as discussed below.

RL-Based Recommender System. There are mainly two kinds of RL methods for recommendation: model-free RL methods [6, 27, 36, 38, 62, 66, 68] and model-based RL methods [2, 8, 67]. Model-free RL methods assume the environment is unknown without user modeling. Model-free RL methods usually need large amounts of interactions for policy optimization. To tackle the sample complexity challenge, model-based RL methods are applied by considering user modeling, which can predict user behavior and reward. For instance, the generative adversarial user model [8] learns the user behavior model and reward function together in a unified min-max framework; then the recommendation policy is learned with reward from the trained user model. However, this model requires a large amount of data to estimate a particular user model, which is not feasible in the cold-start recommendation scenario. Besides, the user model and recommendation model are trained separately, which prevents them from benefiting from each other. Bai et al. [2] also proposed to use model-based RL for recommendation. They introduced the discriminator with adversarial training to let the user behavior and recommendation policy imitate the policy in logged offline data. The reward to train recommendation policy is weighted by the discriminator score. Their method can be seen as reward shaping [40, 42], which does not recover the true user reward function. Neither of the above two methods does properly address the cold-start challenge in the RL-based recommender system. Recently, offline RL [34] methods have been used to learn a recommendation policy from the offline dataset [7, 39, 56, 61]. For example, Xiao et al. [61] proposed a general offline RL framework for recommendation, where supervised regularization, policy constraints, dual constraints, and reward extrapolation are introduced to minimize the distribution mismatch between the logging policy and recommendation policy. Wang et al. [56] proposed a causal decision transformer to integrate the offline RL and transformer model. Gao et al. [17] combined causal inference with offline RL to burst the filter bubbles in recommender systems. To alleviate the Matthew effect of offline RL in interactive recommendation, a state entropy term is added to relax the pessimism in the model-based offline RL algorithm [16]. In contrast with these methods, our method can recover the true user behavior and reward with a small amount of data by meta-learning user model and recommendation model with user context variable in a unified framework, and the mutual information regularization between the user policy and recommendation policy can benefit each other for learning from offline-logged user data.

Meta-Learning. Meta-learning aims to learn from a small amount of data and adapt quickly to new tasks [4, 12]. In context-based meta-learning [11, 49], the approaches learn to infer the task uncertainties by taking task experiences as input. For instance, Kate et al. [49] proposed to learn the task context variables with probabilistic latent variables from past experiences. The model-free RL policy is trained conditioned on the task variable to improve sample efficiency. In contrast, our method learns a user context variable to infer user preference within the offline model-based RL framework.

Cold-Start Recommendation. The cold-start problem of recommendation has been studied for a long time in the literature. Various approaches have been developed to address this problem [33, 35, 45, 57, 59, 71]. Among these, cross-domain recommendation techniques [5, 20, 21, 70] enhance performance in the target domain by leveraging user–item interactions from relevant source domains. However, they often rely on shared users across domains for knowledge transfer [5] and require source domain interaction data for target domain cold-start users [5], limiting their applications in our context. One particular type of method, which has been developed recently and is very related to this study, utilizes the meta-learning technique [10, 33, 64] to tackle the cold-start challenge. Generally speaking, these methods regard users as tasks and items as classes and use gradient-based model-agnostic meta-learning (MAML) [12] algorithm to enable fast adaptation to new users with few interactions. However, how to tackle the cold-start challenge of RL-based recommendations is still underexplored, which is indeed the research focus of this article.

IRL. IRL is the problem of learning reward functions from demonstrations [1, 43, 46], which can avoid the need for reward engineering. For instance, Fu et al. [13] proposed an adversarial IRL (AIRL) framework to recover the true reward functions from demonstrations. IRL needs a large number of expert demonstrations to infer true reward function, which is highly expensive in the area of robotics. Recently, some works [18, 63] try to recover the reward function from a limited amount of demonstrations with the meta-IRL method by incorporating the context-based meta-learning method into the AIRL framework. Comparatively, in our solution, we recover the user policy and reward function from offline user behavior data by leveraging the meta-IRL method. To better capture the user context information into policy, we utilize a variational policy network conditioned on the user context variable. Besides, the meta-IRL-learned user model serves as the environment in our meta-level model-based RL framework.

Offline RL. Offline RL [34, 51] aims to learn policies from a static dataset consisting of past interactions with the environment. Most offline RL methods use constraints on the learned policy to prevent the policy from drifting away from offline data support. For instance, Wu et al. [60] use Kullback–Leibler (KL) divergence to regularize the learned policy to be closer to the behavior policy. Recently, Yu et al. [65] and Kidambi et al. [29] utilize uncertainty estimation as a reward penalty to constrain policy learning. Different from existing methods, we not only provide constraints on the policy with a novel mutual information regularization but also encourage policy adaptation with meta-learning.

3 Preliminaries

3.1 RL-Based Recommender Systems

Recent studies have demonstrated the potential of RL-based recommender systems in enhancing long-term user engagement [2, 8]. Such a system relies on an RL-based recommendation agent that regularly interacts with users. During each interaction, the agent presents the user with a list of recommendations. The user then chooses an item from the list, providing feedback to the agent, which is typically a measure of user utility or satisfaction and considered as a reward to the agent.

Formally, the RL-based recommendation problem is modeled as a Markov decision process (MDP) with the following components:

—

Environment: Each user represents an environment that the recommendation agent interacts with. This environment provides feedback to the recommendation agent, encompassing the user's choice and reward.

—

State Space \(\boldsymbol{\mathcal{S}}\): \(\boldsymbol{s}_{t}\in\mathcal{S}\) is the user's historical clicks before time \(t\).

—

Action \(\boldsymbol{\mathcal{A}}\): The action \(\boldsymbol{A}_{t}\in\mathcal{A}\) corresponds to the top-\(k\) recommendation list generated by the recommendation agent at time \(t\).

—

State Transition Probability \(\boldsymbol{\mathcal{P}}\): \(p(\boldsymbol{s}_{t+1}|\boldsymbol{s}_{t},\boldsymbol{A}_{t})\) is the probability of transitioning from state \(\boldsymbol{s}_{t}\) to \(\boldsymbol{s}_{t+1}\) based on the recommendation list \(\boldsymbol{A}_{t}\). This also signifies the probability of the user's choice \(x_{t}\) from the recommendation list \(\boldsymbol{A}_{t}\) according to the user's choice or policy function.

—

Reward \(\boldsymbol{\mathcal{R}}\): \(r(\boldsymbol{s}_{t},\boldsymbol{A}_{t},x_{t})\) is the immediate reward for the agent's action \(\boldsymbol{A}_{t}\), which represents the user's utility or satisfaction after choosing \(x_{t}\in\boldsymbol{A}_{t}\) at state \(\boldsymbol{s}_{t}\).

Recommendation Policy \(\boldsymbol{\pi}\): \(\boldsymbol{A}_{t}\sim\pi(\boldsymbol{A}_{t}|\boldsymbol{s}_{t})\) is the recommendation agent's recommendation list given state \(\boldsymbol{s}_{t}\).

The recommendation agent's goal is to learn a policy \(\pi\) that maximizes the expected cumulative reward, symbolizing long-term user utility or satisfaction: \(\mathbb{E}{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r\left(\boldsymbol{s}_{t}, \boldsymbol{A}_{t},x_{t}\right)\right]\), with \(\gamma\) as the discount factor.

3.2 Problem Statement

Our article primarily concentrates on the cold-start problem within the RL-based recommender system context. The data used to train the recommendation model are user–item interactions. We have a set of users \(\mathcal{U}=\left\{\mathcal{U}^{warm},\mathcal{U}^{cold}\right\}\), where \(\mathcal{U}^{warm}\) and \(\mathcal{U}^{cold}\) represent warm users and cold-start users, respectively. Importantly, there is no overlap between these two groups of users. The item set is denoted as \(\mathcal{V}=\left\{v_{1},\ldots,v_{j},\ldots,v_{|\mathcal{V}|}\right\}\). For every user \(u\in\mathcal{U}\), the sequence of interacted items is recorded as \(P_{u}=\left\{x_{1},\ldots,x_{t},\ldots,x_{|P_{u}|}\right\}\), which is maintained in chronological order. Additionally, each user \(u\) is associated with contextual information \(C_{u}\) such as user attributes \(u_{att}=\left\{u^{a}_{1},\ldots,u^{a}_{j},\ldots,u^{a}_{|u_{att}|}\right\}\). With the above setup, we can formally state the cold-start recommendation problem as follows:

Definition 1

(Problem Statement). The cold-start recommendation problem involves training the recommendation policy with offline warm users’ behavior data \(\left\{P_{u}:u\in\mathcal{U}^{warm}\right\}\) and subsequently making recommendations for a cold-start user \(u\in\mathcal{U}^{cold}\) using the user's contextual information \(C_{u}\).

4 Method

In this section, we present the proposed context-aware Mutual information regularized Meta-level Model-based RL approach for cold-start Recommendation, denoted as \(\rm M^{3}Rec\). We first introduce the overall framework and then elaborate on the details of the proposed method.

4.1 Overview

Our proposed model's architecture is depicted in Figure 1, comprising four key modules: the user context encoder, the meta-level user model, the meta-level recommendation agent, and the mutual information regularizer. The model adopts a meta-learning perspective to address the cold-start issue inherent in RL-based recommendations. With an aim to adapt the user model and recommendation agent for individual users who have limited contextual information, we employ the context-based meta-learning approach [11, 49]. The user context encoder learns to derive user context variable by inferring the user's preference from the contextual information of each user. Since both the state transition probability function \(p(s_{t+1}|s_{t},a_{t})\) and the reward function \(r(s_{t},a_{t},x_{t})\) are unknown, we adopt a model-based RL framework to explicitly model the user behavior dynamics from the offline data. This structure integrates the meta-level user model and the meta-level recommendation agent. Conditioning both the user model and recommendation agent on the user context variable enables these components to be meta-learned, which effectively enhances their adaptability to cold-start users. Specifically, within the model-based RL framework, the optimization of the meta-level recommendation agent is facilitated through interaction with the meta-level user model. The meta-level user model serves to approximate individual user dynamics, offering reward feedback to the meta-level recommendation agent. Accordingly, the meta-level recommendation agent is optimized utilizing the reward provided by the meta-level user model. Lastly, the model's training utilizes offline-logged user data and does not involve expensive online interactions with real users. To avoid accessing out-of-distribution data that exceed the support of the collected offline user data, we introduce a mutual information regularizer. This component establishes a link between the meta-level user model and meta-level recommendation agent, adhering to the principle of conservatism in offline RL literature [32, 50].

Figure 1.

4.2 User Context Encoder

To learn a context variable for adaptation to different users, we adopt a user context encoder to summarize the user contextual information \(C_{u_{i}}\) into a context variable \(c_{u_{i}}\) for user \(u_{i}\). We first transform the raw context information into embedding representation and then employ deep neural networks to obtain user context variable

\[c_{u_i} = f^{cont}({\textbf{E}}(C_{u_i})),\]

where \({\rm E}\) is the embedding layer and \(f^{cont}\) is a multi-layer fully connected neural network with rectified linear unit (ReLU) activation function.

4.3 Meta-Level User Model

As the context-aware user choice and reward function are unknown, we aim to recover both the user choice and reward function from offline data in the meta-level user model. Considering limited context information from the cold-start users, we propose to utilize the context-based meta-learning [11, 49] to estimate the user behavior model from offline data. We first introduce the user choice or policy function \(\pi(x_{t}|\boldsymbol{s}_{t},\boldsymbol{A}_{t},c)\), which characterizes how the user chooses an item from the provided recommendation list. Then, we utilize the IRL method to recover the user choice function and the reward function without manually specifying the reward function.

Given \(i\)th user's historical clicked item sequence before time \(t\): \(\{x_{i,1},x_{i,2},...,x_{i,t-1}\}\), we first transform the clicked items into item embeddings \(\{\boldsymbol{e}_{i,1},\boldsymbol{e}_{i,2},...,\boldsymbol{e}_{i,t-1}\}\) using embedding matrix \(\textbf{E}^{u}\). We utilize the recurrent neural network with Long short-term memory (LSTM) unit [25] to summarize the state information maintained by the user model \(\boldsymbol{s}_{i,t}^{u}\) as follows:

\[\boldsymbol{s}_{i, t}^{u}= LSTM(\boldsymbol{e}_{i, t - 1}, \boldsymbol{s}_{i, t-1}^{u}),\]

In the following notations, we omit the user index \(i\) for simplicity.

To learn a context-aware user choice function \(\pi(x_{t}|\boldsymbol{s}_{t}^{{u}},\boldsymbol{A}_{t},c)\), it must encode the salient information of the user context variable \(c_{u_{i}}\) into user policy representation. Besides, it's also desirable that the user policy representation can reason the uncertainty of user distribution for adaptation to different users. Therefore, we adopt the variational inference approach [53] to infer the probabilistic latent user policy variable \(z_{u{,t}}\) at time \(t\). It can be generated from variational distribution \(q_{inf}(z_{u{,t}}|\boldsymbol{s}_{t}^{{u}}, c)\). To optimize the parameters for learning \(z_{u{,t}}\), we maximize the variational lower bound of \(\log p(\boldsymbol{s}_{t}^{{u}}|c)\)

\begin{align}\log p(\boldsymbol{s}_{t}^{u}|c) & \geq \mathbb{E}_{q_{inf}(z_{u,t}|\boldsymbol{s}_{t }^{u},c)}[\log p_{dec}(\boldsymbol{s}_{t}^{u}|z_{u,t},c)]\\&\quad{} - \beta D_{\mathrm{KL}}(q_{inf}(z_{u,t}|\boldsymbol{s}_{t}^{u},c)\|p(z_{u,t }|c)),\end{align}

(1)

where the first term is optimized to reconstruct current state \(\boldsymbol{s}_{t}^{{u}}\). The second term constrains the latent policy variable with a Gaussian prior.

In practice, following [53], \(q_{inf}\) and \(p_{dec}\) are parameterized by the inference neural network and decoding neural network, respectively. Specifically, the context-conditional variational autoencoder is defined as follows:

\begin{align} &\mu_{t}, \sigma_{t} = q_{inf}(\boldsymbol{s}_t^{u}, c), \\ &z_{u,t} \sim \mathcal{N}(\mu_t, \Sigma_t) ;\operatorname{diag}(\Sigma_t)=\sigma_t, \\ & \hat{\boldsymbol{s}}_t^{u} = p_{dec}(z_{u,t}, c),\end{align}

(2)

where we utilize the reparameterization trick [31] to sample \(z\) from \(\mathcal{N}(\mu,\Sigma)\). \(q_{inf}\) and \(p_{dec}\) are both three-layer multilayer perceptron (MLP) with ReLU activation function. To reconstruct the current state information, we utilize the decoding network to predict the last click \(x_{t-1}\) in the user's historical clicks, which is the input information at time \(t\). For the latent user policy representation, we adopt \(z_{u,t}=\mu_{t}\) for stable training.

Given the recommendation list \(\boldsymbol{A}_{t}=\{a_{1},\cdots,a_{k}\}\), the probability of choosing item \(x_{t}\in\boldsymbol{A}_{t}\) is based on the latent user representation \(z_{u,t}\) as follows:

\begin{align*}p(x_{t}|\boldsymbol{s}_{t}^{u},\boldsymbol{A}_{t},c)=\frac{{\rm exp}(f^{cho}(f^{pref}(z_{u,t}) ||f^{rec}(\boldsymbol{e}_{x_{t}})))}{\sum_{i=1}^{\left|\boldsymbol{A}_{t}\right|}{\rm exp}(f^{ cho}(f^{pref}(z_{u,t})||f^{rec}(\boldsymbol{e}_{a_{i}})))},\end{align*}

where \(f^{pref}\) and \(f^{rec}\) are one-layer MLP neural network to encode user's preference and the representation of candidate items, respectively. \(\|\) represents the concatenation operation and \(f^{cho}\) is a three-layer MLP with ReLU activation function to model the user's choice selection with concatenated preference and candidate item representation.

The above content demonstrates the modeling of the user choice function: \(\pi_{\phi}(x_{t}|\boldsymbol{s}_{t}^{u},\boldsymbol{A}_{t},c)\). We also denote the parameters involved in the above user choice modeling as \(\phi\). We aim to recover both the actual user policy \(\pi_{\phi}\) and reward function \(r_{\omega}\). Inspired by AIRL [13], we recover both the context-aware user policy and reward function from offline data by optimizing the following objective:

\begin{align}\min_{\pi_{\phi}}\max_{D_{\omega}}\mathbb{E}_{p(\boldsymbol{A},c)}[ \mathbb{E}_{\rho^{true}(\boldsymbol{s},x|\boldsymbol{A},c)}[\log D_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c) ]+\mathbb{E}_{\rho^{\pi_{\phi}}(\boldsymbol{s},x|\boldsymbol{A},c)}[\log(1-D_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c))]],\end{align}

(3)

where we omit the subscript \(t\) and superscript \(u\) for the ease of notation. This objective forms an adversarial game between the user policy function and the discriminator. The discriminator aims to distinguish the user behavior sampled from the learned user policy function and the real user–item interactions, while the user policy is trained to confuse the discriminator. The discriminator function takes the form

\begin{align*}D_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c)=\frac{\exp(g_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c))}{(\exp (g_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c))+\pi_{\phi}(x|\boldsymbol{s},\boldsymbol{A},c))},\end{align*}

where \(g_{\omega}\) contains the reward approximator \(r_{\omega}\) and the reward shaping term \(h_{\varphi}\): \(g_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c)=r_{\omega}(\boldsymbol{s},x, \boldsymbol{A},c)+\gamma h_{\varphi}(\boldsymbol{s}^{\prime})-h_{\varphi}(\boldsymbol{s})\), where \(\gamma\) is the discount factor and \(\boldsymbol{s}^{\prime}\) is the next state of state \(\boldsymbol{s}\). The reward shaping term \(h_{\varphi}\) is modeled using a one-layer MLP. The reward function \(r_{\omega}\) is modeled as follows:

\begin{align*}r_{\omega}(\boldsymbol{s},x,\boldsymbol{A},c) & =f^{r}(x^\mathit{max}||x||c), \\x^\mathit{max} & =\operatorname{argmax}_{x\in\boldsymbol{A}}(f^{d\_pref}(\mathbf{s}))^{ \top}f^{d\_rec}(\boldsymbol{e}(x)),\end{align*}

where \(x^{max}\) simulated the user's preferred item in the recommendation list. \(f^{d\_pref}\) and \(f^{d\_rec}\) are one-layer MLP. \(f^{r}\) is a three-layer MLP with a ReLU activation function. To enable the gradient backpropagation through \(\operatorname{argmax}\) operation, we utilize softmax with temperature.

Specifically, based on Equation (3), during training, we can alternately update parameters of user policy \(\pi_{\phi}\) and discriminator \(D_{\omega}\). The objective for training the user policy \(\pi_{\phi}\) is

\begin{align} \max _{\phi} &\mathbb{E}_{u_{i} \sim p(\mathcal{U}), c_{u_i}, \tau_{u_i} \sim \rho^{\pi_{\phi}}(\tau_{u_i} | \boldsymbol{A}, c_{u_i})} \sum_{t=1}^{T} \log (D_{w}(\boldsymbol{s}_t^{u}, x_t, \boldsymbol{A}_t, c_{u_i}) - \log (1-D_{w}(\boldsymbol{s}_t^{u}, x_t, \boldsymbol{A}_t, c_{u_i})), \\ &=\mathbb{E}_{u_i \sim p(\mathcal{U}), c_{u_i}, \tau_{u_i} \sim \rho^{\pi_{\phi}}(\tau_u | \boldsymbol{A}, c_{u_i})} \sum_{t=1}^{T} g_{\omega}(\boldsymbol{s}_t^{u}, x_t, \boldsymbol{A}_t,c_{u_i})) - \log \pi_{\phi}(x | \boldsymbol{s}_t^{u}, \boldsymbol{A}_t,c_{u_i})), \end{align}

(4)

where \(\tau_{u_{i}}\) is user \(u_{i}\)'s behavior sequence generated from user's policy \(\pi_{\phi}\) interacting with the recommendation agent. We can train the meta-level user policy \(\pi_{\phi}\) using the policy gradient (PG) algorithm [55].

The objective for training the discriminator is

\begin{align} \max_{D_{\omega}}\mathbb{E}_{u_{i}\sim p(\mathcal{U})}\bigg[ & \mathbb{E}_{c_{u_{i}},\tau_{u_{i}}\sim\rho^{\pi_{\phi}}(\tau_{u_{ i}}|\boldsymbol{A},c_{u_{i}})}\sum_{t=1}^{T}\log(1-D_{w}(\boldsymbol{s}_{t}^{u},x_{t},\boldsymbol{A_{t }},c_{u_{i}})) \\& + \mathbb{E}_{c_{u_{i}},P_{u_{i}}\sim\rho^{true}({u_{i}})}\sum_{t=1 }^{T}\log D_{w}(\boldsymbol{s}_{t}^{u},x_{t},\boldsymbol{A_{t}},c_{u_{i}})\bigg],\end{align}

(5)

where \(P_{u{{}_{i}}}\) is the sampled real user \(u_{i}\)'s behavior sequence. Similar to AIRL in [13], when training the context-aware user policy and discriminator to optimality, we can recover the true user policy and the true reward function up to a constant, which approximates the real user model. We also utilize the offline data to estimate the meta-level user model by maximizing the likelihood to stabilize the training process.

4.4 Meta-Level Recommendation Agent

With the above estimated user policy function \(\pi_{\phi}\) and reward function \(r_{\omega}\) as the user environment model, we can learn the recommendation policy \(\pi_{\theta}\) to maximize the cumulated reward. To adapt to different users with limited context information, we also adopt the context-based meta-learning approach [11, 49] to learn a context-aware recommendation policy. Similar to the meta-level user model, we use a variational recommendation policy conditioned on the user context variable to enable the recommendation policy to be aware of the user preference. The latent recommendation policy variable at time \(t\) is denoted as \(z_{rec,t}\) induced from variational distribution \(q_{inf}(z_{rec}|\boldsymbol{s}_{t}^{rec},c)\), where \({s}_{t}^{rec}\) is the state information maintained by the recommendation agent. We optimize the lower bound of \(\log p(\boldsymbol{s}_{t}^{rec}|c)\).

\begin{align} \log p(\boldsymbol{s}_{t}^{rec}|c)& \geq\mathbb{E}_{q_{inf}(z_{rec,t}|\boldsymbol{s }_{t}^{rec},c)}[\log p_{dec}(\boldsymbol{s}_{t}^{rec}|z_{rec,t},c)] \\&\quad{} -\beta D_{\mathrm{KL}}(q_{inf}(z_{rec,t}|\boldsymbol{s}_{t}^{rec},c)\|p(z _{rec,t}|c)).\end{align}

(6)

The model details of learning \(z_{rec,t}\) is similar to the learning of \(z_{u,t}\) in (Equation 2). Then, based on the latent recommendation policy variable \(z_{rec,t}\), the agent will generate a recommendation list of size \(k\). Specifically, we utilize a two-layer MLP with ReLU activation function and softmax function normalization to output a probability vector over the entire item set \(\mathcal{V}\). Then, items with top-\(k\) probabilities are selected as the recommendation list \(\boldsymbol{A}_{t}\).

The objective for training the recommendation policy is as follows:

\begin{align}\max_{\theta}\mathbb{E}_{u_{i}\sim p(\mathcal{U}),c_{u_{i}},\tau_{u_{i}}\sim \rho^{\pi_{\theta}}(\tau_{u_{i}}|\boldsymbol{A},c_{u_{i}}\!)}\sum_{t=1}^{T}r_{\omega}(\boldsymbol{s}_{t}^{rec},x_{t},\boldsymbol{A}_{t},c_{u_{i}}\!),\end{align}

(7)

which can be optimized by using a PG algorithm [55]. We also utilize the offline data with maximal likelihood training to stabilize the training process.

4.5 Mutual Information Regularizer

Given that it is very expensive and difficult to perform online interactions, the user model and recommendation policy are trained with offline data. Therefore, it is conceivable that the recommendation policy may offer recommendations for which the user model is unable to provide precise feedback. This phenomenon poses a fundamental challenge for offline RL of the proposed model due to the distribution shift between user–item interaction data produced by the policy and the existing offline data. In accordance with the principle of conservatism widely adopted in offline RL literature [32, 50], we propose the inclusion of a mutual information regularizer between the user model and recommendation agent. This approach aims to encourage the recommendation agent to propose recommendations within the confidence region of the user model, consequently enhancing the accuracy of feedback. Moreover, this approach fosters the user model's ability to offer more precise feedback on the user's recommendations. Further theoretical analysis concerning the influence of mutual information regularization on recommendation performance is illustrated in Section 4.7.

To realize the mutual information constraint between the user model and the recommendation agent, we utilize the latent user policy variable \(z_{u{,t}}\) and recommendation policy variable \(z_{rec{,t}}\). This can be expressed by the following equation:

\begin{align}\mathcal{I}(z_{u,t};z_{rec,t})=D_{\mathrm{KL}}\left(\mathbb{P}_{z_{u,t}z_{rec, t}}\|\mathbb{P}_{z_{u,t}}\otimes\mathbb{P}_{z_{rec,t}}\right),\end{align}

(8)

where \(\mathbb{P}_{z_{u,t}z_{rec,t}}\) denotes the joint distribution and \(\mathbb{P}_{z_{u,t}}\) and \(\mathbb{P}_{z_{rec,t}}\) are marginal distributions.

The aim is to maximize \(\mathcal{I}(z_{u,t};z_{rec,t})\) to model the dependency between the user model and the recommendation agent. Nevertheless, estimating mutual information in high-dimensional space is nontrivial. Inspired by [3, 44], we opt for maximizing the lower bound of the Jensen–Shannon mutual information \(\mathcal{I}^{(JSD)}(z_{u,t};z_{rec,t})\) based on Jensen-Shannon divergence to maintain stable training. This can be formulated as

\begin{align}\mathcal{I}^{(JSD)}(z_{u,t};z_{rec,t}) & \geq\sup_{\psi\in\Psi} \mathbb{E}_{\mathbb{P}_{z_{u,t}z_{rec,t}}}\left[-\mathit{sp}\left(-T_{\psi}(z_{u,t},z_{ rec,t})\right)\right] \\&\quad{} -\mathbb{E}_{\mathbb{P}_{z_{u,t}}\otimes\mathbb{P}_{z_{rec,t}}} \left[\mathit{sp}\left(T_{\psi}(z_{u,t},z_{rec,t})\right)\right],\end{align}

(9)

where \(T_{\psi}:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}\) is a neural network function with parameter \(\psi\) and \(\operatorname{sp}(z)=\log(1+e^{z})\) is the softplus function.

During the training process, we strive to maximize the lower bound denoted as \(\mathcal{L}_{mutual}\) in Equation (9). The corresponding parameters in the meta-level user model and meta-level recommendation agent model are updated alternately. For instance, while updating the respective parameters in the meta-level recommendation agent model using \(\mathcal{L}_{mutual}\), the parameters of the meta-level user model are kept constant. To estimate the joint distribution \(\mathbb{P}_{z_{u,t}z_{rec,t}}\) in Equation (9), we construct samples from the same users at time \(t\) (e.g., \(i\)th user, (\(z_{u,t}^{i},z_{rec,t}^{i}\))). Conversely, the marginal distributions \(\mathbb{P}_{z_{u,t}}\otimes\mathbb{P}_{z_{rec,t}}\) are estimated by sampling from a different user (e.g., (\(z_{u,t}^{i},z_{rec,t}^{j}\)), where \(j\neq i\)).

Algorithm 1 M³Rec Meta-Training

1: Input: Offline data \(\left\{P_{u}:u\in\mathcal{U}^{warm}\right\}\); Meta-level user policy \(\pi_{\phi}\); Meta-level recommendation policy \(\pi_{\theta}\); Context encoder\(f^{cont}\); Discriminator \(D_{\omega}\).

2: Initialize simulated user buffers \(\mathcal{B}^{s}_{i}\) and real user buffers \(\mathcal{B}^{r}_{i}\) for each training user.

3: Pre-train\(\pi_{\phi},f^{cont}\) using maximum likelihood estimation on the offline data.

4: Pre-train \(\pi_{\theta}\) using maximum likelihood estimation on the offline data.

5: Sample a batch of users, for each user \(i\), add a batch of simulated sequences to\(\mathcal{B}^{s}_{i}\) and add a batch of true user interaction sequences from offline data to \(\mathcal{B}^{r}_{i}\). Pre-train \(D_{\omega}\) using Equation (5) with \(\mathcal{B}^{s}_{i}\) and \(\mathcal{B}^{r}_{i}\).

6: repeat

7: for \(e\) times do

8: Sample user \(u_{i}\) uniformly from \(\mathcal{U}^{warm}\).

9: For user \(u_{i}\), infer the user context variable \(c_{u_{i}}\) with the context information \(C_{u_{i}}\) using the user context encoder in Section 4.2.

10: For user \(u_{i}\), empty \(\mathcal{B}^{s}_{i}\) and generate interaction sequences from \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s}, c_{u_i})\) with \(c_{u_{i}}\) fixed during each rollout. Add these rollouts to \(\mathcal{B}^{s}_{i}\).

11: Update \(\pi_{\theta},f^{cont}\) by using policy gradient in Equation (7), maximizing lower bound of mutual information in Equation (9) by fixing user model and maximizing the variational lower bound in Equation (6) with samples from \(\mathcal{B}^{s}_{i}\).

12: Update \(\pi_{\theta},f^{cont}\) using maximum likelihood estimation on the offline data.

13: end for

14: for \(m\)times do

15: Sample user \(u_{i}\) uniformly from \(\mathcal{U}^{warm}\).

16: For user \(u_{i}\), infer the user context variable \(c_{u_{i}}\)with the context information \(C_{u_{i}}\) using the user context encoder in Section 4.2.

17: For user \(u_{i}\), empty \(\mathcal{B}^{s}_{i}\) and generate interaction sequences from \(\pi_{\phi}(x | \boldsymbol{s}. \boldsymbol{A},c_{u_i})\) with \(c_{u_{i}}\) fixed during each rollout. Add these rollouts to \(\mathcal{B}^{s}_{i}\).

18: Update \(\pi_{\theta},f^{cont}\) by using policy gradient in Equation (4), maximizing lower bound of mutual information in Equation (9) by fixing recommendation model and maximizing the variational lower bound in Equation (1) with samples from \(\mathcal{B}^{s}_{i}\).

19: Update \(\pi_{\phi},f^{cont}\) using maximum likelihood estimation on the offline data.

20: end for

21: for \(d\) times do

22: Sample user \(u_{i}\) uniformly from \(\mathcal{U}^{warm}\).

23: For user \(u_{i}\), infer the user context variable \(c_{u_{i}}\) with the context information \(C_{u_{i}}\) using the user context encoder in Section 4.2.

24: For user \(u_{i}\), empty \(\mathcal{B}^{s}_{i}\) and generate interaction sequences from \(\pi_{\phi}(x | \boldsymbol{s}. \boldsymbol{A},c_{u_i})\) with \(c_{u_{i}}\) fixed during each rollout. Add these rollouts to \(\mathcal{B}^{s}_{i}\).

25: Empty \(\mathcal{B}^{r}_{i}\) and sample true user behavior sequence \(P_{u_{i}}\) from offline data added to \(\mathcal{B}^{r}_{i}\).

26: Update \(D_{\omega},f^{cont}\) using Equation (5) with samples from \(\mathcal{B}^{s}_{i}\) and \(\mathcal{B}^{r}_{i}\).

27: end for

28: until Convergence

4.6 Training

In this section, we introduce the training and testing procedures for the proposed \(\rm M^{3}Rec\) model. We perform the training by alternately updating three key parameters, namely the recommendation policy, user model, and the discriminator together with the parameter of the user context encoder. This process is outlined in Algorithm 1. Upon completion of training, we proceed to evaluate the recommendation policy. The test is conducted by conditioning the recommendation policy on the context information of the test user, as described in Algorithm 2. We employ both simulated online evaluation and offline evaluation methods to assess the performance of the model.

4.7 Theoretical Analysis

In this section, we provide a theoretical analysis of the performance bound of the recommender policy when adapting to meta-test users by our meta-level model-based RL framework. Proofs can be found in Appendix A.

To provide our theoretical analysis, let us first introduce some notations. Here, we slightly abuse the notation for simplicity. We denote action at time \(t\) as \(a_{t}\), which corresponds to \(x_{t}\) and \(\boldsymbol{A}_{t}\) defined in Section 3 for user and recommender agent, respectively. We use \(\mu_{u_{i}}^{\pi_{\theta}}=\frac{1}{T}\sum_{t=0}^{T}P(\boldsymbol{s}_{t}= \boldsymbol{s},a_{t}=a)\) to denote the average state action (\(\boldsymbol{s}\), \(a\)) visitation distribution when executing recommendation policy \(\pi_{\theta}\) in user model \(u_{i}\), where \(u_{i}\) can be the approximated user model \(u_{i}^{m}\) in our meta-level model-based RL or the true user model \(u_{i}^{w}\) in the real world. Then for user \(u_{i}\), there is modeling error between \(u_{i}^{m}\) and \(u_{i}^{w}\) under state-action distribution \(\mu\): \(\ell(u_{i}^{m},\mu)=\mathbb{E}_{(\boldsymbol{s},a)\sim\mu}[D_{\mathrm{KL}}(P_{ u_{i}^{w}}(\cdot|\boldsymbol{s},a),P_{u_{i}^{m}}(\cdot|\boldsymbol{s},a))]\), where

Algorithm 2 M³Rec Meta-test

1: Input: test user \(u_{test}\sim\mathcal{U}^{cold}\). A test user context information \(C_{u_{test}}\).

2: Infer the test user context variable \(c_{test}\) with test user context information \(C_{u_{test}}\)using the user context encoder in Section 4.2.

3: Test the adapted recommendation policy \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s}, c_{test})\) in the setting of simulated online evaluation or offline evaluation with real-world dataset.

\(D_{\mathrm{KL}}\) is the KL divergence. \(P_{u_{i}^{w}}\) and \(P_{u_{i}^{m}}\) represents user transition dynamics (i.e., user policy \(\pi_{\phi}\)) in true user model \(u_{i}^{w}\) and approximated meta-level user model \(u_{i}^{m}\), respectively. The performance of recommendation policy \(\pi_{\theta}\) under user model \(u_{i}\) is \(J(\pi_{\theta},u_{i})=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{u_{i},t}]\), where \(\gamma\) is the discount factor. Then, we can get the recommendation policy performance bound learned in our meta-level model-based RL framework.

Theorem 1

Suppose the meta-level user model and meta-level recommendation policy are trained to optimality on meta-training users. When adapting to a meta-test user with few behavior sequences to infer the user context variable, the test user's policy is obtained as \(\pi_{\phi}(x|\boldsymbol{s},\boldsymbol{A},c_{test})\) with the corresponding recommendation policy as \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s},c_{test})\). Suppose the modeling error of this test user model \(\ell(u_{test}^{m},\mu_{u_{test}^{w}}^{\pi_{\theta}})\leq\epsilon_{u_{test}^{m} }^{adapt}\). The performance of \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s},c_{test})\) under this test user's model satisfies \(J(\pi_{\theta},u_{test}^{m})\geq\sup_{\pi_{\theta}^{\prime}}J(\pi_{\theta}^{ \prime},u_{test}^{m})-\epsilon_{\pi_{\theta}}^{adapt}\). Let \(\pi_{\theta}^{*}\) denotes the optimal recommendation policy and the corresponding performance is \(J_{u_{test}^{w}}^{*}=\sup_{\pi_{\theta}^{\prime}}J(\pi_{\theta}^{\prime},u_{ test}^{w})\). We also suppose the modeling error of this test user model on the optimal recommendation policy \(\pi_{\theta}^{\prime}\) is \(\ell(u_{test}^{m},\mu_{u_{test}^{w}}^{\pi_{\theta}^{*}})\leq\epsilon_{u_{test} ^{m}}^{adapt}\) as the meta-training process can help the model adapt to different recommendation policy. Then, the performance bound between the learned recommendation policy \(\pi_{\theta}\) and the optimal policy \(\pi_{\theta}^{*}\) on real meta-test users is as follows:

\begin{align*}J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}, u_{test}^{w}\right) & \leq\epsilon_{\pi_{\theta}}^{adapt}+\frac{4\gamma R_{\max}\sqrt{ \epsilon_{u_{test}^{m}}^{adapt}}}{(1-\gamma)^{2}}.\end{align*}

Remark. In this theorem, the gap between the recommender policy performance in our model trained on meta-training users and the optimal policy in the real-world test users comes from two error terms.

The first term is related to the sub-optimality of the meta-policy optimization as well as the generalization error of the meta-level recommendation policy to the meta-test user. It can be reduced by the sufficient training of a meta-level recommendation policy. The mutual information regularization between user policy and recommendation policy can also help reduce error \(\epsilon_{\pi_{\theta}}^{adapt}\).

The second term is related to the user model adaptation error \(\epsilon_{u_{test}^{m}}^{adapt}\) on the new meta-test user recommendation policy \(\pi_{\theta}\) and its optimal recommendation policy \(\pi_{\theta}^{*}\). As our meta-level user model meta-learns from the distribution of users and optimizes its prediction performance on different meta-level recommendation policies, the model adaptation error \(\epsilon_{u_{test}^{m}}^{adapt}\) can be small. Intuitively, the mutual information regularization between the meta-level user model and meta-level recommendation agent can benefit each other. For the meta-level user model, this mutual information can inform the user model to improve its prediction accuracy on the area visited by the recommendation agent. For the meta-level recommendation agent, it is encouraged to visit the area where the user model has high confidence. Hence, this mutual information regularization further helps reduce the model estimation error \(\epsilon_{u_{test}^{m}}^{adapt}\) in the offline setting. Therefore, these two error terms can be reduced by meta-learning and mutual information regularization in our method. The performance of the recommender policy learned on meta-training users in our offline meta-level model-based RL framework can approximate the optimal policy in the real-world meta-test users.

5 Experiment

In this section, we perform a thorough experimental study to validate the effectiveness of our proposed \(\rm M^{3}Rec\) method in handling cold-start recommendation scenarios. We first provide a description of our experimental setup, including implementation details and the baseline methods employed for comparison. Then, we show the simulated online evaluation experiment. After that, we present the offline evaluation results with real-world datasets. Specifically, we seek to answer the following research questions (RQ):

RQ1:

What is the performance of \(\rm M^{3}Rec\) on the cold-start recommendation task?

RQ2:

When the offline trained \(\rm M^{3}Rec\) is utilized as an initialization for subsequent online learning, how does it perform?

RQ3:

How does the model perform on the first few interactions of cold-start test users?

RQ4:

What impacts do the essential components of \(\rm M^{3}Rec\) have on the overall recommendation performance?

RQ5:

How sensitive is the proposed \(\rm M^{3}Rec\) model to variations in the user model rollout length during training?

5.1 Experimental Setup

5.1.1 Implementation Details.

The parameters were selected based on the recommendation performance on the validation set. The number of layers in the user context encoder was tuned in the set \(\{1,2,3\}\). We adjusted the embedding size of the context information in the user context encoder within the range of \(\{8,16,32,64\}\). The item embedding size was set to 50. The hidden size of the MLP neural network employed in the model was tuned within the range \(\{64,128,256,512\}\). The number of MLP layers is detailed in the model description in Section 4. We tuned the weight \(\beta\) in the variation lower bound within the set \(\{0.00001,0.0001,0.001,0.01,0.1,1\}\). User model rollout length is tuned in the range \(\{5,10,15,20\}\). For pre-training, we used the Adam optimizer [30] with a learning rate of 0.001. Following pre-training, a small learning rate was tuned within the range \(\{0.0005,0.0001,0.00001,0.000001\}\). Lastly, the training frequencies of \(e\), \(m\), and \(d\) in Algorithm 1 were tuned within the set \(\{1,3,5,10\}\).

5.1.2 Baseline Methods.

We compare our method with several state-of-the-art approaches that are broadly classified into non-RL cold-start recommendation methods and RL-based methods. The former includes Meta-LSTM and MeLU, while the latter consists of Meta-PG, IRecGAN, and generative adversarial user model assisted policy gradient optimization (GAN-PG).

—

Meta-LSTM: Recurrent neural networks have shown success for session-based recommendation [24, 47]. Here, we utilize LSTM as the recommendation policy network with context-based meta-learning method [11, 49] by incorporating a user context variable to adapt to the cold-start recommendation setting.

—

MeLU [33]: MeLU utilizes the MAML approach to estimate user's preference for cold-start recommendation.

—

Meta-PG: Similar to [49], Meta-PG is a meta-RL method integrating a user context variable to infer user's preference to adapt to new users.

—

IRecGAN [2]: IRecGAN is a model-based RL method for recommendation by adversarial training between the user behavior model and the discriminator, with the latter purposed to evaluate the quality of generated data. The recommendation policy is trained using the reward from the discriminator.

—

GAN-PG [8]: GAN-PG models the user behavior dynamics and recovers the reward function via a closed-form solution using generative adversarial training. When the training of the user model ends, it serves as the environment model for the recommender agent.

Batch-constrained deep Q-learning (BCQ) [14, 15] : BCQ is a model-free offline RL method, which restricts the agent's action space to constrain the learned policy to be close to the behavior policy which produces the offline data. Specifically, we utilize the BCQ in the discrete-action setting [14].

For the baseline methods, the item embedding size was set to 50 as in our method. The learning rate was tuned in the range \(\{0.0001,0.0005,0.001\}\). The rest of the parameters were set according to the recommendations in their articles of the above baseline methods. For a fair comparison, all the RL-based methods along with our proposed method utilize the REINFORCE algorithm for policy optimization [58]. The context information utilized in these baseline methods is consistent with our method. All parameters of the baseline methods are tuned as per suggestions in their respective original articles based on the performance on the validation set.

5.2 Simulated Online Evaluation

As it is difficult to conduct the online evaluation by interacting with the recommendation policy with real users, we carry out the online evaluation in a simulated environment by following previous works [2, 8].

5.2.1 Simulated Environment.

To simulate the behavior of different users, we utilized an open-sourced simulator RecSim¹ [26] for recommendation system, which provides sequential interaction with users. We consider the interest evolution environment, where the task in this environment is video recommendation. User's interest evolves with time and the goal is to maximize the long-term user engagement (e.g., video watch time). The environment consists of a user model, a video model, and a user-choice model. The user model can sample a set of users from a distribution over the configurable user features. For example, we can configure the interests of users by changing the user-specific interest value to a different topic. Hence, it is flexible to configure users with different preferences. For user configuration, the user interest vector means a user's interest in the different document (video) topics where each dimension ranges from \(-1\) to 1 (1 = very interested, \(-1\) = disgust). Another important parameter is alpha to influence the change of the user's time budget. Specifically, we configure a \(d\)-dimension user interest vector, where each dimension is sampled uniformly from \(-1\) to 1 and \(d=20\) is the number of topics. For alpha, we uniformly sample its value from 0 to 1. We use default configurations for other parameters. The number of document candidates is set to 200.

5.2.2 Experiment Setting.

We first generate offline data for model training. As we aim to train the model in the offline setting without online interaction, we generate three datasets with different offline data qualities: Random: roll out a random initialized policy in the RecSim environment. Medium: partially train a Meta-PG method in the RecSim environment and then interact with the simulator. Expert: interacting with users by the fully trained Meta-PG method. Each user's time budget is set as 1,000 minutes in the interest evolution environment. Each dataset contains 200 users and the number of sessions for each user is 20. Information in the dataset contains the recommendation list, the user's click, user's reward at each timestep in each session. For the meta-training and meta-test users, we utilize two sets of user configurations without intersection as shown in Appendix B. Similarly, we construct the validation dataset with 500 users sampled from the meta-training user distribution. For the meta-test, we perform online interaction with 500 meta-test users and configure the time budget as 1,000 minutes for each user. The cumulative video watch time is calculated as a return for each user. We use the averaged user return as the evaluation metric and report the performance for the recommendation list with size \(k=3,5,10\).

In the case of our simulated users, no user profile information exists. Therefore, we adopt a single user session sequence as the context information for a cold-start test user. This unique interaction sequence is obtained by deploying a recommendation policy to interact with the users according to the dataset type. With regard to context-based meta-learning methods, this individual session sequence is leveraged to infer the user context variable. Notably, in the MeLU baseline method, which is based on the MAML approach, the same single-session sequence of the test user is utilized as the support set for fine-tuning on the cold-start test user.

5.2.3 Online Evaluation (RQ1).

To answer RQ1, we present an overall comparison with the baseline methods using simulated online evaluation for the cold-start recommendation task. Table 1 shows the average cumulative rewards of all competing methods, where the higher cumulative reward indicates that the recommender system can better satisfy users’ evolutionary interests and lead to longer user engagement time. It can be observed that the proposed method \(\rm{M^{3}Rec}\) outperforms all the baseline methods across the datasets with different qualities and slate sizes (i.e., the size of the recommendation list). The baseline methods tend to underperform when dealing with poor quality offline datasets (e.g., dataset type random or medium). However, our M\({}^{3}\)Rec method performs quite well despite the low-quality random data. This underscores the capacity of our method to effectively utilize low-quality data for offline RL training without stringent constraints on data quality. Another interesting observation is that our method is the only method that achieved a cumulative reward greater than the configured user time budget of 1,000 across all the dataset types. This implies that our method successfully maximizes the cold-start user's long-term interest, thereby preventing the user from dropping from the platform.

Table 1.

Dataset type	Slate size	Meta-LSTM	Meta-PG	IRecGAN	GAN-PG	MeLU	BCQ	\(\rm{M^{3}Rec}\) (Ours)
Random	\(k\) = 3	960.41	971.56	973.33	883.39	895.52	849.20	1,082.27
	\(k\) = 5	960.80	963.35	957.30	909.72	905.14	900.19	1,116.08
	\(k\) = 10	959.05	957.86	948.12	928.70	916.41	953.92	1,024.84
Medium	\(k\) = 3	999.04	960.93	1,104.70	950.32	939.97	1,012.41	1,268.50
	\(k\) = 5	995.89	944.87	1,077.33	987.57	947.71	977.34	1,244.89
	\(k\) = 10	989.87	983.09	1,049.69	970.36	951.76	988.14	1,114.62
Expert	\(k\) = 3	1,019.22	992.38	1,334.57	1,384.34	961.28	1,214.4	1,473.06
	\(k\) = 5	1,012.86	981.26	1,187.35	1,170.67	961.43	1,067.73	1,242.29
	\(k\) = 10	1,004.66	970.58	1,089.42	1,079.82	960.90	970.29	1,116.36

Table 1. Online Evaluation Results of Averaged User Cumulative Reward with Different Recommendation List Sizes \(k\)

Methods are offline trained with datasets of different qualities. The bold numbers indicate the best performances across all the methods for ease of demonstration.

5.2.4 Online Learning (RQ2).

Given the cost of training an RL-based recommender by interacting with users, in the above setting, \(\rm M^{3}Rec\) is firstly offline trained with logged user data and then deployed with simulated cold-start users for online evaluation, which demonstrates superior online evaluation performance. An intriguing question arises: when the offline trained \(\rm M^{3}Rec\) is utilized as initialization for subsequent online learning with these cold-start users, how does it perform? (RQ2) To address this, we execute online learning and assess its performance via interaction with meta-test users. This methodology presents a practical approach for enhancing the performance of the already offline-trained RL model. During each online learning iteration, we interact with 500 meta-test users, with the time budget per user set at 120. We utilize the offline learned model in the expert dataset for online learning (other dataset types show similar results). As depicted in Figure 2, our method consistently improves its performance with increased online interactions. These results demonstrate that our \(\rm M^{3}Rec\) method trained using offline-logged user data offers a valuable initialization, thus boosting subsequent online learning performance with these cold-start users.

Figure 2.

5.3 Offline Evaluation with Real-World Datasets

5.3.1 Datasets.

We further validate the effectiveness of our proposed method with two widely used real-world recommendation datasets. Table 2 provides a statistical overview of these datasets, which we describe briefly below.

—

MovieLens. Derived from the domain of movie user–item interactions, we use the MovieLens-1M dataset² [22]. Consistent with established practices [23, 54], numeric ratings are transformed into implicit feedback, thereby signifying whether a user has rated a particular item or not.

—

Last.fm. This dataset constitutes a collection of user–song interactions and mirrors the user's listening habits until 5 May 2009.³ We intercepted the last month data for the cold-start recommendation task.

Table 2.

Datasets	# of users	# of items	# of interactions	Density
MovieLens	6,040	3,706	1,000,209	4.47%
Last.fm	613	150,826	426,203	0.46%

Table 2. Statistics of the Real-World Datasets

For both datasets, we randomly sample 70% users as the warm users in the training set, 10% users in the validation set, and the remaining 20% users as cold-start users in the test set.

5.3.2 Experiment Setting.

In both datasets, we utilize attribute information of users as the context information to infer user context variables. This approach aligns with real-world scenarios, wherein only user profile information is available for cold-start users. Specifically, the attributes used are (gender, age, and occupation) for MovieLens, and (gender, age, and country) for Last.fm datasets.

These real-world datasets do not include user reward information, rendering the Meta-PG and BCQ baseline methods inapplicable. Similarly, the MeLU baseline method is not applicable as it necessitates user–item interactions from cold-start users for support set fine-tuning. Therefore, we utilize the rest baseline methods for comparison.

During the meta-test stage, sequential recommendations are performed on the test users, and the performance is evaluated using two widely accepted metrics: hit ratio (HR) and normalized discounted cumulative gain (NDCG) of top \(k\) recommended items [52, 69]. We report HR@\(k\) and NDCG@\(k\) for \(k=10,20,50\). HR@\(k\) measures the average proportion of preferred items appearing in the top-\(k\) recommendation list, defined as

\begin{align*}{\rm HR}@k=\frac{1}{|\mathcal{U}^{cold}|}\sum_{i=1}^{|\mathcal{U}^{cold}|} \frac{1}{|p_{u_{i}}|}\sum_{j=1}^{|p_{u_{i}}|}\delta\left(p_{u_{i},j}\in \mathcal{J}_{u_{i},j}\right),\end{align*}

where \(\mathcal{J}_{u_{i},j}\) and \(p_{u_{i},j}\) represent the top-\(k\) recommendation list and the preferred item at timestamp \(j\) (relative time order) for cold-start user \(u_{i}\), respectively. NDCG@\(k\) evaluates the recommendation performance from a ranking perspective, defined as

\begin{align*}{\rm NDCG}@k=\frac{1}{|\mathcal{U}^{cold}|}\sum_{i=1}^{|\mathcal{U}^{cold}|} \frac{1}{|p_{u_{i}}|}\sum_{j=1}^{|p_{u_{i}}|}\frac{{DCG}@k\left(u_{i},j\right)}{ {IDCG}@k\left(u_{i},j\right)},{DCG}@k\left(u_{i},j\right)=\sum_{m=1}^{k}\frac{rel_ {u_{i},j,m}}{\log_{2}(m+1)},\end{align*}

where \(rel_{u_{i},j,m}\) is the relevance of the item at position \(m\) in \(\mathcal{J}_{u_{i},j}\). \(rel_{u_{i},j,m}=1\) if the item at position \(m\) coincides with user \(u_{i}\)'s preferred item at timestamp \(j\); otherwise \(rel_{u_{i},j,m}=0\). IDCG@\(k\) is the ideal DCG value over all possible recommendation lists of length \(k\), serving as a normalizer.

5.3.3 Offline Evaluation Results (RQ1).

To address RQ1, besides the simulated online evaluation, we further present an overall comparison with the baseline methods on these two real-world datasets for the cold-start recommendation task. The offline evaluation results for the MovieLens and Last.fm datasets are illustrated in Tables 3 and 4, respectively. An observable trend is the superior performance of our proposed method in comparison to all baseline methods, across all datasets and a variety of evaluation metrics. For example, in the Last.fm dataset, when \(k=10\), our method outperforms the best baseline method by 15.58% and 17.15% on the NDCG and HR evaluation metrics, respectively. When comparing the evaluation results across the datasets, we found that the performance of all the compared methods on the Last.fm dataset is lower than that of the MovieLens dataset. The potential reason is the sparsity of the Last.fm dataset as shown in Table 2. Nevertheless, even with the more challenging Last.fm dataset for cold-start recommendation, our method outperforms the baseline methods by a larger margin compared to its performance on the MovieLens dataset. These results demonstrate the effectiveness of our method for the cold-start recommendation problem.

Table 3.

Metric	Meta-LSTM	IRecGAN	GAN-PG	M\({}^{3}\)Rec (ours)
NDCG@10	0.0841	0.0833	0.0856	0.0866
HR@10	0.1679	0.1690	0.1710	0.1753
NDCG@20	0.1091	0.1094	0.1114	0.1134
HR@20	0.2677	0.2727	0.2738	0.2816
NDCG@50	0.1431	0.1440	0.1461	0.1495
HR@50	0.4394	0.4470	0.4486	0.4637

Table 3. Offline Evaluation Results on MovieLens Dataset

The bold numbers indicate the best performances across all the methods for ease of demonstration.

Table 4.

Metric	Meta-LSTM	IRecGAN	GAN-PG	M\({}^{3}\)Rec (ours)
NDCG@10	0.0318	0.0077	0.0207	0.0368
HR@10	0.0427	0.0113	0.0354	0.0500
NDCG@20	0.0326	0.0085	0.0219	0.0375
HR@20	0.0458	0.0144	0.0397	0.0531
NDCG@50	0.0332	0.0094	0.0230	0.0379
HR@50	0.0487	0.0191	0.0455	0.0548

Table 4. Offline Evaluation Results on Last.fm Dataset

The bold numbers indicate the best performances across all the methods for ease of demonstration.

In subsequent sections, we delve deeper into the analysis of the proposed \(\rm M^{3}Rec\) method, using the Last.fm dataset as a representative example, which is more challenging for the cold-start recommendation task.

5.3.4 Performance on Initial Interactions of Test Users (RQ3).

In real-world cold-start recommendation scenarios, it is desirable for the recommender system to demonstrate promising performance on the initial interactions of cold-start users, to deter user attrition. With this consideration in mind, to answer RQ3, we further report the averaged recommendation performance over the earliest \(l\) interactions of the test users, where \(l\) ranges from 3 to 20. As depicted in Figure 3, our method consistently surpasses the baseline methods across all evaluation metrics. Notably, the margin of outperformance is particularly larger when \(l\) is smaller (e.g., \(l=3\)), implying that our method excels at providing superior recommendation performance during the earliest interactions of cold-start users.

Figure 3.

5.3.5 Ablation Study (RQ4).

To probe RQ4, we conduct an ablation study aimed at assessing the impact of \(\rm M^{3}Rec\)'s essential components on the recommendation performance. Specifically, we remove the meta-learning component (i.e., the user context encoder) and the mutual information regularizer separately. The results of the ablation study are presented in Table 5. Upon the removal of either the meta-learning component or the mutual information regularizer, a performance drop is observed in our \(\rm M^{3}Rec\) model. This ablation study underscores the necessity of the meta-learning method and the mutual information regularizer within our proposed model.

Table 5.

Metric	\(\rm{M^{3}Rec}\)	Without meta-learning	Without mutual information regularizer
NDCG@10	0.0368	0.0349	0.0354
HR@10	0.0500	0.0477	0.0473
NDCG@20	0.0375	0.0358	0.0361
HR@20	0.0531	0.0510	0.0501
NDCG@50	0.0379	0.0363	0.0366
HR@50	0.0548	0.0535	0.0527

Table 5. Ablation Study on Last.fm Dataset

5.3.6 Impact of the User Model Rollout Length (RQ5).

In our model-based RL framework, during training, the recommendation agent is allowed to interact with the user model for varying lengths of time, a concept denoted as the user model rollout. In response to RQ5, which queries the sensitivity of our proposed \(\rm M^{3}Rec\) model to variations in user model rollout length during training, we adjusted this length during the model's training process. Figure 4 illustrates that both too small and too large user model rollout lengths can lead to inferior performance. Our model achieves the best performance across a range of evaluation metrics when a moderately sized user rollout length (i.e., 10) is employed.

Figure 4.

6 Conclusion

In this article, we presented a new approach to addressing the cold-start problem in RL-based recommendations by developing a context-aware offline meta-level model-based RL method. This method incorporates a user context variable designed to infer user preferences for adapting to new users. Within the context of the meta-learned model-based RL framework, we proposed to recover user policy and reward via an IRL approach, which is conditioned on the user context variable. This meta-level user model is employed to aid in training the context-aware recommendation agent, facilitating adaptation to new users who have limited contextual information or user–item interaction records. To address the challenge posed by the offline training of the proposed model, we further introduced a mutual information constraint between the user model and the recommendation agent. Alongside extensive simulated online and offline evaluations, which demonstrate the effectiveness of our approach, we also provided a theoretical analysis of the recommendation performance bound of the developed method.

Footnotes

https://github.com/google-research/recsim

https://grouplens.org/datasets/movielens/1m/

http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html

⁴

https://github.com/google-research/recsim

Appendices

Appendix A Proofs

A.1. Proof of Theorem 1

Proof.

The proof techniques are similar to the proof in [48, Theorem 1]. For completeness, we prove Lemma 1 for our meta-level model-based RL framework. Some necessary Lemmas are provided in Appendix A.3. We first decompose \(J_{u_{test}^{w}}^{*}-J\left(\pi_{\theta},u_{test}^{w}\right)\) into three terms to analyze them separately.

\begin{align*}&J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}, u_{test}^{w}\right) \\&\quad{} =J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}^ {*},u_{test}^{m}\right)+J\left(\pi_{\theta}^{*},u_{test}^{m}\right)-J\left(\pi _{\theta},u_{test}^{w}\right) \\&\quad{} =\underbrace{J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}^{*},u_{test}^{m}\right)}_{\text{Term-I}}+\underbrace{J\left(\pi_{ \theta}^{*},u_{test}^{m}\right)-J\left(\pi_{\theta},u_{test}^{m}\right)}_{ \text{Term-II}}+\underbrace{J\left(\pi_{\theta},u_{test}^{m}\right)-J\left(\pi _{\theta},u_{test}^{w}\right)}_{\text{Term-III}}.\end{align*}

Term-II. We introduce a term \(J\left(\pi_{\theta,u_{test}^{m}}^{*},u_{test}^{m}\right)\geq J\left(\pi_{ \theta}^{\prime},u_{test}^{m}\right),\forall\pi_{\theta}^{\prime}\), which represents the optimal recommendation policy \(\pi_{\theta,u_{test}^{m}}^{*}\) obtained under approximated model \(u_{test}^{m}\) in our meta-level model-based RL framework. Then, we can further decompose Term -\(\mathrm{II}\) as follows:

\begin{align*}J\left(\pi_{\theta}^{*},u_{test}^{m}\right)-J\left(\pi_{\theta}, u_{test}^{m}\right) & = \left(J\left(\pi_{\theta}^{*},u_{test}^{m}\right)-J\left(\pi_{\theta, u_{test}^{m}}^{*},u_{test}^{m}\right)+J\left(\pi_{\theta,u_{test}^{m}}^{*},u_{ test}^{m}\right)-J\left(\pi_{\theta},u_{test}^{m}\right)\right) \\& \leq 0+\epsilon_{\pi_{\theta}}^{adapt}.\end{align*}

The first difference term is \(\leq 0\) because \(\pi_{\theta,u_{test}^{m}}\) is the optimal policy under approximated model \(u_{test}^{m}\). The second difference term is since our assumption \(J(\pi_{\theta},u_{test}^{m})\geq\sup_{\pi_{\theta}^{\prime}}J(\pi_{\theta}^{ \prime},u_{test}^{m})-\epsilon_{\pi_{\theta}}^{adapt}\), which represents the generalization error of \(\pi_{\theta}(\boldsymbol{A}|\boldsymbol{s},c_{test})\) as it trained on meta-training users.

Term-III. As our meta-level user model \(u_{test}^{m}\) approximates the true user model \(u_{test}^{w}\) using IRL, it can be seen as a distribution matching [19] (i.e., KL divergence) between the state-action visitation distribution of \(u_{test}^{m}\) and \(u_{test}^{w}\). Thus, model approximating error \(\epsilon_{u_{test}^{m}}^{adapt}\) is actually due to the error KL divergence matching. By applying Pinsker's inequality which connects \(D_{KL}\) and total variation (TV) distance \(D_{TV}\), we can connect \(\epsilon_{u_{test}^{m}}^{adapt}\) and the \(D_{\mathit{TV}}\) as follows:

\begin{align*}\mathbb{E}_{(\boldsymbol{s},a)\sim\mu_{u_{test}^{*},t}^{\pi_{0},t}}\left[D_{TV}\left(P_{u_{test}^{w}}(\cdot|\boldsymbol{s},a),P_{u_{test}^{m}}(\cdot|\boldsymbol{s},a)\right)\right]\leq\sqrt{\epsilon_{u_{test}^{m}}^{adapt}}.\end{align*}

By using Lemma 2, the bound for Term-III is \(\frac{2\gamma R_{\max}\sqrt{\epsilon_{u_{test}^{m}}^{adapt}}}{(1-\gamma)^{2}}.\)

Term-I. It measures the modeling error of the test user model on the states visited by \(\pi_{\theta}^{*}\), which is unseen. This term can be formally calculated as follows:

\begin{align}|J(\pi_{\theta}^{*},u_{test}^{w})-J(\pi_{\theta}^{*},u_{test}^{m})| & =\left|\frac{1}{1-\gamma}\mathbb{E}_{\tilde{\mu}_{u_{test}^{w}}^{ \pi_{\theta}^{*}}}[r_{u_{test}}(\boldsymbol{s},a)]-\frac{1}{1-\gamma}\mathbb{E}_{ \tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}^{*}}}[r_{u_{test}}(\boldsymbol{s},a)]\right| \\ & \leq\frac{2R_{\max}}{1-\gamma}D_\mathit{TV}\left(\tilde{\mu}_{u_{test}^{ w}}^{\pi_{\theta}^{*}},\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}^{*}}\right).\end{align}

(10)

Combining all these terms together, we obtain the following bound:

\begin{align*}J\left(\pi_{\theta}^{*},u_{test}^{w}\right)-J\left(\pi_{\theta}, u_{test}^{w}\right)=\frac{2R_{\max}}{1-\gamma}D_\mathit{TV}\left(\tilde{\mu}_{u_{test }^{w}}^{\pi_{\theta}^{*}},\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}^{*}}\right) +\epsilon_{\pi_{\theta}}^{adapt}+\frac{2\gamma R_{\max}\sqrt{\epsilon_{u_{test }^{m}}^{adapt}}}{(1-\gamma)^{2}}.\end{align*}

By omitting some constant terms, we conclude the proof of Lemma 1. ◻

A.2. Proof of the of the Lower Bound of \(\mathcal{I}^{\boldsymbol{(JSD)}}\)

\begin{align*}\mathcal{I}^{(JSD)}(z_{u};z_{rec})\geq\sup_{\psi\in\Psi}\mathbb{E }_{\mathbb{P}_{z_{u}z_{rec}}}\left[-{sp}\left(-T_{\psi}(z_{u},z_{rec})\right) \right]-\mathbb{E}_{\mathbb{P}_{z_{u}}\otimes\mathbb{P}_{z_{rec}}}\left[ \operatorname{sp}\left(T_{\psi}(z_{u},z_{rec})\right)\right].\end{align*}

Proof.

\begin{align*}\mathcal{I}^{(JSD)}(z_{u};z_{rec}) & =D_{JSD}\left(\mathbb{P}_{z_{u}z_{rec}}\|\mathbb{P}_{z_{u}} \otimes\mathbb{P}_{z_{rec}}\right) \\& \geq\sup_{\psi\in\Psi}\left(\mathbb{E}_{\mathbb{P}_{z_{u}z_{rec}} }[V_{\psi}(z_{u},z_{rec})]-\mathbb{E}_{\mathbb{P}_{z_{u}}\otimes\mathbb{P}_{z_ {rec}}}\left[{JSD}^{*}(V_{\psi}(z_{u},z_{rec}))\right]\right) \\& =\sup_{\psi\in\Psi}\left(\mathbb{E}_{\mathbb{P}_{z_{u}z_{rec}}}[g _{f}(T_{\psi}(z_{u},z_{rec}))]-\mathbb{E}_{\mathbb{P}_{z_{u}}\otimes\mathbb{P} _{z_{rec}}}\left[{JSD}^{*}(g_{f}(T_{\psi}(z_{u},z_{rec}))\right]\right) \\& =\sup_{\psi\in\Psi}\mathbb{E}_{\mathbb{P}_{z_{u}z_{rec}}}\left[- \operatorname{sp}\left(-T_{\psi}(z_{u},z_{rec})\right)\right]-\mathbb{E}_{ \mathbb{P}_{z_{u}}\otimes\mathbb{P}_{z_{rec}}}\left[\operatorname{sp}\left(T_{ \psi}(z_{u},z_{rec})\right)\right]+\operatorname{log}(4).\end{align*}

The inequality in the second line is due to the variational lower bound of \(f\)-divergence [44]. The third line is to represent the variational function \(V_{\psi}(z_{u},z_{rec})=g_{f}(T_{\psi}(z_{u},z_{rec}))\). In the fourth line, we replace \(g_{f}(T_{\psi}(z_{u},z_{rec}))=\log(2)-\log(1+\exp(-T_{\psi}(z_{u},z_{rec})))\) and the conjugate \({JSD}^{*}(t)=-\log(2-\exp(t))\) as in [44]. ◻

A.3. Related Lemmas

We provide some related lemmas for the proof of Lemma 1.

Lemma 1

Let \(P_{1}(\cdot|\boldsymbol{s})\) and \(P_{2}(\cdot|\boldsymbol{s})\) be two Markov chains with the same initial state distribution. Let \(P_{1}^{t}(\boldsymbol{s})\) and \(P_{2}^{t}(\boldsymbol{s})\) be the marginal distributions over states at time \(t\) when following \(P_{1}\) and \(P_{2}\), respectively. Suppose

\begin{align*}\mathbb{E}_{\boldsymbol{s}\sim P_{1}^{t}}\left[D_\mathit{TV}(P_{1}(\cdot|\boldsymbol{s}),P_{2}(\cdot|\boldsymbol{s}))\right]\leq\epsilon\ \ \forall\ t,\end{align*}

then, the marginal distributions are bounded as:

\begin{align*}D_\mathit{TV}(P_{1}^{t},P_{2}^{t})\leq\epsilon t\ \ \forall\ t.\end{align*}

Proof.

Proof has been provided in several previous works such as [48, Lemma 2] and [28, Lemma B.2]. Here, we omit the proof. ◻

To demonstrate the performance difference of the recommendation policy under the approximated user model \(u_{test}^{m}\) in our meta-level model-based RL framwork and true user model \(u_{test}^{w}\), we first introduce some definitions.

\begin{align} \mu_{u_{test}^{m}}^{\pi_{\theta}}(\boldsymbol{s},a) & =\frac{1}{T_{\infty}}\sum_{t=0}^{T_{\infty}}P(s_{t}=\boldsymbol{s},a_{t}=a)\end{align}

(11)

\begin{align}\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}(\boldsymbol{s},a) =(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}P(s_{t}=\boldsymbol{s},a_{t}=a).\end{align}

(12)

The first definition \(\mu_{u_{test}^{m}}^{\pi_{\theta}}(\boldsymbol{s},a)\) indicates the state-action visitation distribution when executing recommendation policy \(\pi_{\theta}\) in user model \(u_{test}^{m}\). The second definition is the discounted state-action visitation distribution, respectively. Similar definitions can be introduced for true user model \(u_{test}^{w}\) as \(\mu_{u_{test}^{w}}^{\pi_{\theta}}(\boldsymbol{s},a)\) and \(\mu_{u_{test}^{w}}^{\pi_{\theta}}(\boldsymbol{s},a)\). We further define a marginal distribution at time \(t\) when executing recommendation policy \(\pi_{\theta}\) under the true user model \(u_{test}^{w}\):

\begin{align}\mu_{u_{test}^{w}}^{\pi_{\theta},t}(\boldsymbol{s},a)=P\left(s_{t}=\boldsymbol{s},a_{t}=a\right) \end{align}

(13)

\(\mu_{u_{test}^{m}}^{\pi_{\theta},t}(\boldsymbol{s},a)\) is similarly defined when following policy \(\pi_{\theta}\) in approximated user model \(u_{test}^{m}\).

Now, we formally introduce Lemma 3, which measures the performance difference of the recommendation policy \(\pi_{\theta}\) learned in the \(u_{test}^{m}\) and \(u_{test}^{w}\) due to the approximation error of \(u_{test}^{m}\) with \(u_{test}^{w}\).

Lemma 2

Let \(u_{test}^{w}\) and \(u_{test}^{m}\) be two different MDPs differing only in their transition dynamics—\(P_{u_{test}^{w}}\) and \(P_{u_{test}^{m}}\). Let the absolute value of rewards be bounded by \(R_{\max}\). Fix a recommendation policy \(\pi_{\theta}\) for both \(u_{test}^{w}\) and \(u_{test}^{m}\), and let \(P_{u_{test}^{w}}^{t}\) and \(P_{u_{test}^{m}}^{t}\) be the resulting marginal state distributions at time \(t\). If the MDPs are such that

\begin{align*}\mathbb{E}_{(\boldsymbol{s},a)\sim\mu_{u_{test}^{w}}^{\pi_{\theta},t}}\left[D_\mathit{TV}\big{(}P_{u_{test}^{w}}(\cdot|\boldsymbol{s},a),P_{u_{test}^{m}}(\cdot|\boldsymbol{s},a)\big{)}\right]\leq\epsilon\ \ \forall t,\end{align*}

then, the performance difference is bounded as

\begin{align*}|J(\pi_{\theta},u_{test}^{w})-J\left(\pi_{\theta},u_{test}^{m}\right)|\leq \frac{2\gamma\epsilon R_{\max}}{(1-\gamma)^{2}}.\end{align*}

Proof.

The proof is essentially the same as that for [48, Lemma 3]. We provide a proof to derive our conclusion in Lemma 1 for meta-level model-based RL framework. For recommendation policy \(\pi_{\theta}\) in approximated user model \(u_{test}^{m}\), the performance is

\begin{align*}J(\pi_{\theta},u_{test}^{m})=\frac{1}{1-\gamma}\mathbb{E}_{ \tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}}[r_{u_{test}}(\boldsymbol{s},a)]=\mathbb{E} \left[\sum_{t=0}^{\infty}\gamma^{t}r_{u_{test}}(\boldsymbol{s},a)\right].\end{align*}

Similar term can be induced for true user model \(u_{test}^{w}\). Then, the performance difference can be bounded as follows:

\begin{align}|J(\pi_{\theta},u_{test}^{w})-J(\pi_{\theta},u_{test}^{m})| & =\left|\frac{1}{1-\gamma}\mathbb{E}_{\tilde{\mu}_{u_{test}^{w}}^{ \pi_{\theta}}}[r_{u_{test}}(\boldsymbol{s},a)]-\frac{1}{1-\gamma}\mathbb{E}_{\tilde{ \mu}_{u_{test}^{m}}^{\pi_{\theta}}}[r_{u_{test}}(\boldsymbol{s},a)]\right|\\ & \leq\frac{2R_{\max}}{1-\gamma}D_{TV}\left(\tilde{\mu}_{u_{test}^{ w}}^{\pi_{\theta}},\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}\right).\end{align}

(14)

As \(\tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}=P\left(\boldsymbol{s}_{t}= \boldsymbol{s},a_{t}=a\right)=P_{u_{test}^{m}}^{t}(\boldsymbol{s})\pi(a| \boldsymbol{s})\),

then, we can bound \(D_{\mathit{TV}}\left(\tilde{\mu}_{u_{test}^{w}}^{\pi_{\theta}},\tilde{\mu}_{u_ {test}^{m}}^{\pi_{\theta}}\right)\) using Equations (12) and (13) as follows:

\begin{align*}2D_\mathit{TV}\left(\tilde{\mu}_{u_{test}^{w}}^{\pi_{\theta}},\tilde{ \mu}_{u_{test}^{m}}^{\pi_{\theta}}\right) & =\sum_{\boldsymbol{s},a}\left|\tilde{\mu}_{u_{test}^{w}}^{\pi_{\theta}}- \tilde{\mu}_{u_{test}^{m}}^{\pi_{\theta}}\right| \\& =(1-\gamma)\sum_{\boldsymbol{s},a}\left|\sum_{t}\gamma^{t}\mu_{u_{test}^{ w}}^{\pi_{\theta},t}(\boldsymbol{s},a)-\gamma^{t}\mu_{u_{test}^{m}}^{\pi_{\theta},t}(\boldsymbol{s},a)\right| \\& \leq(1-\gamma)\sum_{\boldsymbol{s},a}\sum_{t}\gamma^{t}\left|\mu_{u_{test }^{w}}^{\pi_{\theta},t}(s,a)-\mu_{u_{test}^{m}}^{\pi_{\theta},t}(s,a)\right| \\& =(1-\gamma)\sum_{\boldsymbol{s}}\sum_{t}\gamma^{t}\left|P_{u_{test}^{w}}^ {t}(\boldsymbol{s})-P_{u_{test}^{m}}^{t}(\boldsymbol{s})\right| \\& \leq(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}(2t\epsilon).\end{align*}

The last inequality is obtained using Lemma 1. Finally, by summarizing the infinite series, we get

\begin{align*}D_\mathit{TV}\left(\tilde{\mu}_{u_{test}^{w}}^{\pi_{\theta}},\tilde{\mu}_{u_{test}^{m }}^{\pi_{\theta}}\right)\leq(1-\gamma)\frac{\epsilon\gamma}{(1-\gamma)^{2}} \leq\frac{\epsilon\gamma}{1-\gamma}.\end{align*}

By substituting this inequality into Equation (14), we conclude the proof. ◻

Appendix B Online Evaluation Details

For online evaluation, we utilize an open-sourced simulator⁴ [26] for recommendation system. The parameters of this simulator include user sensitivity, user-specific memory discount, noise standard deviation, the mean and standard deviation of kale response, the mean and standard deviation of choc response, which can affect user's preferences. We use default parameters in the simulator for standard deviation of kale response and standard deviation of choc response. Other parameters are chosen from a set of values to configure different users, which are listed below for meta-training and meta-test, respectively.

For meta-training users:

—

user sensitivity: \([0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1]\).

—

user-specific memory discount: \([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]\).

—

noise standard deviation: \([0.03,0.04,0.05,0.06]\).

—

the mean of kale response: \([2,3,4,5,6,7,8]\).

—

the mean of choc response: \([2,3,4,5,6,7,8]\).

For meta-test users:

—

user sensitivity: \([0.015,0.025,0.035,0.045,0.055,0.065,0.075,0.085,0.095,0.105]\).

—

user-specific memory discount: \([0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95]\).

—

noise standard deviation: \([0.035,0.045,0.055,0.065]\).

—

the mean of kale response: \([2.5,3.5,4.5,5.5,6.5,7.5,8.5]\).

—

the mean of choc response: \([2.5,3.5,4.5,5.5,6.5,7.5,8.5]\).

Each user is configured with a combination of these parameters.

References

[1]

Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML ’04), 1.

Abstract

1 Introduction

2 Related Work

3 Preliminaries

3.1 RL-Based Recommender Systems

3.2 Problem Statement

4 Method

4.1 Overview

4.2 User Context Encoder

4.3 Meta-Level User Model

4.4 Meta-Level Recommendation Agent

4.5 Mutual Information Regularizer

4.6 Training

4.7 Theoretical Analysis

5 Experiment

5.1 Experimental Setup

5.1.1 Implementation Details.

5.1.2 Baseline Methods.

5.2 Simulated Online Evaluation

5.2.1 Simulated Environment.

5.2.2 Experiment Setting.

5.2.3 Online Evaluation (RQ1).

5.2.4 Online Learning (RQ2).

5.3 Offline Evaluation with Real-World Datasets

5.3.1 Datasets.

5.3.2 Experiment Setting.

5.3.3 Offline Evaluation Results (RQ1).

5.3.4 Performance on Initial Interactions of Test Users (RQ3).

5.3.5 Ablation Study (RQ4).

5.3.6 Impact of the User Model Rollout Length (RQ5).

6 Conclusion

Footnotes

Appendix A Proofs

A.1. Proof of Theorem 1

A.2. Proof of the of the Lower Bound of \(\mathcal{I}^{\boldsymbol{(JSD)}}\)

A.3. Related Lemmas

Appendix B Online Evaluation Details

References

Cited By

Index Terms

Recommendations

TAML: Time-Aware Meta Learning for Cold-Start Problem in News Recommendation

Task-Difficulty-Aware Meta-Learning with Adaptive Update Strategies for User Cold-Start Recommendation

Meta-Learning with Adaptive Weighted Loss for Imbalanced Cold-Start Recommendation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations