research-article

Open access

Personalised Multi-modal Interactive Recommendation with Hierarchical State Representations

Authors:

Yaxiong Wu,

Craig Macdonald,

Iadh OunisAuthors Info & Claims

ACM Transactions on Recommender Systems, Volume 2, Issue 3

Article No.: 21, Pages 1 - 25

https://doi.org/10.1145/3651169

Published: 05 June 2024 Publication History

PDF eReader

Abstract

Multi-modal interactive recommender systems (MMIRS) can effectively guide users towards their desired items through multi-turn interactions by leveraging the users’ real-time feedback (in the form of natural-language critiques) on previously recommended items (such as images of fashion products). In this scenario, the users’ preferences can be expressed by both the users’ past interests from their historical interactions and their current needs from the real-time interactions. However, it is typically challenging to make satisfactory personalised recommendations across multi-turn interactions due to the difficulty in balancing the users’ past interests and the current needs for generating the users’ state (i.e., current preferences) representations over time. However, hierarchical reinforcement learning has been successfully applied in various fields by decomposing a complex task into a hierarchy of more easily addressed subtasks. In this journal article, we propose a novel personalised multi-modal interactive recommendation model (PMMIR) using hierarchical reinforcement learning to more effectively incorporate the users’ preferences from both their past and real-time interactions. In particular, PMMIR decomposes the personalised interactive recommendation process into a sequence of two subtasks with hierarchical state representations: a first subtask where a history encoder learns the users’ past interests with the hidden states of history for providing personalised initial recommendations and a second subtask where a state tracker estimates the current needs with the real-time estimated states for updating the subsequent recommendations. The history encoder and the state tracker are jointly optimised with a single objective by maximising the users’ future satisfaction with the recommendations. Following previous work, we train and evaluate our PMMIR model using a user simulator that can generate natural-language critiques about the recommendations as a surrogate for real human users. Experiments conducted on two derived fashion datasets from two well-known public datasets demonstrate that our proposed PMMIR model yields significant improvements in comparison to the existing state-of-the-art baseline models. The datasets and code are publicly available at: https://github.com/yashonwu/pmmir

1 Introduction

Recent advances in multi-modal interactive recommender systems (MMIRSs) enable the users to explore their desired items (such as images of fashion products) through multi-turn interactions by expressing their current needs with real-time feedback (often natural-language critiques) according to the quality of the recommendations [16, 28, 49, 51, 52, 53, 58, 59, 61, 62]. In this multi-modal interactive recommendation (MMIR) scenario, the users’ preferences can be represented by both the users’ past interests from their historical interactions and their current needs from their recent interactions. Figure 1 shows an example of the personalised multi-modal interactive recommendation with visual recommendations and the corresponding natural-language critiques. In particular, Figure 1(a) demonstrates the users’ past interests with the shopping history recorded by the recommender system and their current needs with the next item that they wish to purchase (the next target item). Next, Figure 1(b) illustrates the real-time interactions between a recommender system and a user. The recommender system initiates the conversation by presenting a list of personalised initial recommendations to the user. Subsequently, during each interaction turn, the user provides natural-language critiques regarding the visual recommendation list to achieve items with more preferred features. An effective MMIRS will improve the users’ experience substantially and will save users much efforts in finding their target items.

Fig. 1.

Despite the recent advances in incorporating the users’ current needs (i.e., the target items) from the informative multi-modal information across the multi-turn interactions, we argue that it is typically challenging to make satisfactory personalised recommendations due to the difficulty in balancing the users’ past interests and the current needs for generating the users’ state (i.e., current preferences) representations over time. Indeed, the existing MMIRSs [16, 49, 51, 52] typically simplify the multi-modal interactive recommendation task by initiating conversations using randomly sampled recommendations irrespective of the users’ interaction histories (i.e., the past interests), thereby only focusing on seeking the target item (i.e., the current needs) across real-time interactions. Although providing next-item recommendations from sequential user-item interaction history is one of the most common use cases in the recommender system domain, the existing sequential and session-aware recommendation models [19, 20, 23, 41] currently only consider the explicit/implicit past user-item interactions (such as purchases and clicks) in the sequence modelling. In addition, these sequential/session-aware recommendation models have shown difficulties in learning sequential patterns over cold-start users (who have very limited historical interactions) compared to warm-start users (who have longer interaction sequences) [46, 64]. An obvious and simple solution for the personalised MMIR task is to conduct a pipeline, where a sequential/session-aware recommendation model (such as GRU4Rec [20]) generates the initial personalised recommendations and a multi-modal interactive recommendation model (such as EGE [51]) updates the subsequent recommendations across the multi-turn interactions. However, such pipeline-based recommender systems cannot effectively benefit from a proper cooperation between the sequential/session-aware recommendation models and the multi-modal interactive recommendation models when there is a shift between the users’ past interests and their current needs (in particular, with cold-start users), thereby possibly failing to provide satisfactory personalised recommendations over time.

Deep reinforcement learning (DRL) allows a recommender system (i.e., an agent) to actively interact with a user (i.e., the environment) while learning from the user’s real-time feedback to infer the user’s dynamic preferences. A variety of DRL algorithms has been successfully applied in various recommender system domains, such as e-commerce [55], video [7], and music recommendations [27]. In particular, recent research on MMIR has formulated the MMIR task with various DRL algorithms as MDPs [16], POMDPs [51], CMDPs [62], or multi-armed bandits [59]. However, all of these only consider a specific recommendation scenario where the users are all cold-start users, i.e., without using any interaction history. Indeed, the existing DRL-based recommender systems are not able to deal with the personalised multi-modal interactive recommendation task in an end-to-end fashion considering the computational complexity of learning users’ the past interests from the interaction history and estimating the users’ current needs from the real-time interactions. Hierarchical reinforcement learning (HRL) [21, 35] can decompose a complex task into a hierarchy of subtasks as semi-Markov decision processes (SMDPs), which reduces the computational complexity. Such an HRL formulation with a hierarchy of subtasks is particularly suitable for the multi-modal interactive task that requires to address different subtasks over time by either estimating the users’ past interests or tracking the users’ current needs. For instance, the “Options” framework of HRL provides a generic way for task decomposition where options represent closed-loop sub-behaviours that are carried out for multiple timesteps until the termination condition is triggered [21]. However, to the best of our knowledge, no prior work has investigated HRL in the multi-modal interactive recommendation task.

In this article, we present our formulation of the personalised MMIR task as an SMDP by simulating both the past and real-time interactions between a user (i.e., an environment) and a recommender system (i.e., an agent). To this end, we propose a novel personalised multi-modal interactive recommendation model (PMMIR) using hierarchical reinforcement learning to more effectively incorporate the users’ preferences from both their past and real-time interactions. In particular, the proposed PMMIR model uses the Options framework of HRL to decompose the personalised interactive recommendation process into a sequence of two subtasks with hierarchical state representations: a first subtask where a history encoder learns the users’ past interests with the hidden states of history for providing personalised initial recommendations and a second subtask where a state tracker estimates the current needs with the real-time estimated states for updating the subsequent recommendations. The history encoder and the state tracker are jointly optimised using a typical policy gradient approach (i.e., REINFORCE [6]) with a single optimisation objective by maximising the users’ future satisfaction with the recommendations (i.e., the cumulative future rewards). Following previous work [16, 49, 51], our PMMIR model is trained and evaluated by adopting a user simulator, which is capable of producing natural-language critiques regarding the recommendations. This surrogate simulates the behaviour of real human users [16]. By conducting experiments on two fashion datasets derived from two well-known public datasets, we observe that our proposed PMMIR model outperforms existing state-of-the-art baseline models, leading to significant improvements. In short, we summarise the main contributions of this article as follows:

—

We propose a novel personalised multi-modal interactive recommendation model (PMMIR) that effectively integrates the users’ preferences obtained from both past and real-time interactions by leveraging HRL with the Options framework.

—

Our proposed PMMIR model decomposes the MMIR task into two subtasks: an initial personalised recommendation with the users’ past interests and several subsequent recommendations with the users’ current needs.

—

We derive two fashion datasets (i.e., Amazon-Shoes and Amazon-Dresses) for providing the users’ interaction histories from two well-known public datasets, since there is no existing dataset suitable for the personalisation setting of the multi-modal interactive recommendation task.

—

Through extensive empirical evaluations conducted on the personalised MMIR task, our proposed PMMIR model demonstrates significant improvements over existing state-of-the-art approaches. We also show that both cold-start and warm-start users can benefit from our proposed PMMIR model in terms of recommendation effectiveness.

The paper is structured as follows: Section 2 provides a comprehensive review of the related work and highlights the contributions of our research in relation to the existing literature. In Section 3, we define the problem formulation and introduce our proposed PMMIR model. The experimental setup and results are presented in Sections 4 and 5, respectively. Section 6 summarises our findings.

2 Related Work

Within this section, our primary focus is to introduce the concept of multi-modal interactive recommendation (MMIR). Then, we discuss personalisation in interactive recommendation. Finally, we describe hierarchical reinforcement learning.

Multi-modal Interactive Recommendation. Interactive recommender systems have been shown to be more effective in incorporating the users’ dynamic preferences over time from their explicit and implicit real-time feedback (such as natural-language critiques and clicks) compared to static/traditional recommender systems that predict the users’ preferences by mining the users’ past behaviours offline (such as ratings, clicks, and purchases) [14]. In addition, multi-modal recommender systems can handle information with various modalities either from items (such as images and textual descriptions) or users (such as natural-language feedback) to effectively alleviate the problems of data sparsity and cold start [31, 65]. Therefore, MMIRSs can effectively track/estimate the users’ dynamic preferences from the informative information with different modalities across real-time interactions. As an example, Guo et al. [16] were among the first to tackle the MMIR task by introducing a Dialog Manager (DM) model that combined supervised pre-training and model-based policy improvement (MBPI). This approach aimed to effectively capture the users’ preferences across multiple interaction turns by considering both visual recommendations and the corresponding natural-language critiques. Since then, research has focussed upon improving the recommendation performance by either formulating the MMIR task using various reinforcement learning approaches (such as CMDPs [62], multi-armed bandits [58], and POMDPs [51]) or adopting more advanced state tracking components (such as Transformer [49] and RNN-enhanced Transformer [52]). Unlike the uni-modal (text-based) conversational recommendation task [27, 42], which usually leverages attribute-based clarification questions to elicit the users’ preferences, the multi-modal interactive recommendation task addressed in this article takes the critiquing-based task formulation by incorporating the users’ preferences from their natural-language feedback.

Personalisation in Interactive Recommendation. The above-existing MMIR models only focus on incorporating the users’ current needs across the multi-turn real-time interactions but omit their past behaviours by initially presenting users with randomly selected items at the start of the interaction process. Meanwhile, a variety of interactive recommendation models has leveraged the users’ past behaviours for personalised recommendations during the multi-turn interaction processes. For instance, the Estimation-Action-Reflection (EAR) model by Lei et al. [27] (a typical question-based interactive recommendation model [14]) leveraged the factorisation machine (FM) [39] to estimate the users’ preferences with the users’ past behaviours for predicting further preferred items and attributes. The users’ online feedback is incorporated by feeding the accepted attributes back to FM to make a new prediction of items and attributes again or using the rejected items as negative signals for training FM again. However, such an FM-based method for the question-based interactive recommendation task is infeasible for our multi-modal interactive recommendation task, which leverages natural-language critiquing sentences freely expressed by the users rather than the brief terms of well-categorised attributes. In addition, a simple solution for the personalised multi-modal interactive recommendation task is to combine the sequential recommendation models (such as GRU4Rec [20]) with the multi-modal interactive recommendation models (such as EGE [51]) in a pipeline. For instance, GRU4Rec can be leveraged for generating the initial personalised recommendations, while EGE can be utilised for updating the subsequent recommendation across the multi-turn real-time interactions. However, we argue that such pipeline-based recommender systems are fragile at providing satisfactory personalised recommendations over time when there is a shift between the users’ past interests and current needs, since their components are optimised independently.

Furthermore, session-aware recommendation models [22, 26, 37, 47] decouple the users’ long-term and short-term preferences for making better-personalised recommendations by exploiting the relationship between sessions for each user. For instance, Quadrana et al. [37] proposed a Hierarchical Recurrent Neural Network model (HRNN) for the personalised session-based recommendations. The HRNN model is structured with a hierarchy of two-level Gated Recurrent Units (GRUs): the session-level GRU that makes recommendations by tracking the user interactions within sessions and the user-level GRU that tracks the evolution of the users’ preferences across sessions. When a new session starts, the hidden state of the user-level GRU is used to initialise the session-level GRU, thereby providing personalisation capabilities to the session-level GRU. Such a hierarchy of two-level GRUs structure can also be leveraged in the multi-modal interactive recommendation task to make personalised recommendations over time. Therefore, we are inspired by the hierarchy of two-level GRUs structure to propose an effective end-to-end multi-modal interactive recommendation model with a dual GRUs/Transformers structure that can make personalised recommendations over time by incorporating both the users’ past behaviours and the informative multi-modal information from real-time interactions. The HRNN model with two-level GRUs adopts a supervised learning approach for jointly optimising the user-level and session-level GRUs, which is less effective than the DRL approaches for maximising the future rewards [1, 8, 30].

Hierarchical Reinforcement Learning. Deep reinforcement learning (DRL) has been widely adopted in the recommendation field with various DRL algorithms, such as Deep Q-learning Network (DQN) [33], REINFORCE [48], and Actor-Critic [25], for coping with the users’ dynamic preferences over time and maximising their long-term engagements [1, 8, 30]. In particular, the MMIR task has been formulated with various DRL algorithms as MDPs [16], POMDPs [51], CMDPs [62], or multi-armed bandits [59] to simulate the multi-turn interactions between the recommender systems and the users. However, the existing MMIR models (e.g., MBPI [16], EGE [51], and RCR [62]) with DRL can only maximise the cumulative rewards when dealing with real-time requests within the conversational session, while simplifying the MMIR task by omitting the users’ past interests. Indeed, making personalised recommendations across multi-turn interactions considering the users’ past interests and current needs is a complex task. Hierarchical reinforcement learning provides a solution for decomposing a complex task into a hierarchy of easily addressed subtasks as SMDPs with various frameworks, such as Options [44], Hierarchies of Abstract Machines (HAMs) [34], and MAXQ value function decomposition [12]. The existing recommender systems with HRL [15, 29, 54, 63] typically formulate the recommendation task with two levels of hierarchies where a high-level agent (the so-called meta-controller) determines the subtasks and a low-level agent (the so-called controller) addresses the subtasks. For instance, CEI [15] formulates the conversational recommendation task with the Options framework using a meta-controller to select a type of subtasks (chitchat or recommendation) and a controller to provide subtask-specific actions (i.e., response for chitchat or candidate items for recommendation). In addition, recent research on question-based conversational recommendations (such as EAR [27] and FPAN [57]) follows a two-level architecture with a policy network as a meta-controller to decide either to ask for more information or to recommend items and a Factorisation Machine (FM) [39] as a controller to generate a set of recommendations [14]. Different from the standard HRL models, these question-based conversational recommendation models [14, 27, 57] only optimise the meta-controller with RL algorithms (such as REINFORCE [48]) to manage the conversational system, while the controller is separately optimised with supervised learning approaches (such as BPR [40]). However, to the best of our knowledge, no prior work has investigated HRL in the multi-modal interactive recommendation task. In this article, we leverage HRL with the Options framework by proposing a personalised multi-modal interactive recommendation model (PMMIR) to effectively incorporate the users’ past interests and their evolving current needs over time. In particular, the high-level agent for determining the subtasks is fully driven by the users’ natural-language feedback (we will describe this in Section 3). Therefore, we mainly focus on modelling the cooperation of the low-level agents for estimating the users’ past interests and tracking the users’ current needs over time in our proposed PMMIR model.

3 The PMMIR Model

In this section, we begin by formulating the problem of the multi-modal interactive recommendation task using hierarchical reinforcement learning within the framework of partially observable semi-Markov decision processes (PO-SMDP) and we introduce the notations used in our formulation (Section 3.1). Then, in Section 3.2, we propose a novel personalised multi-modal interactive recommendation model (PMMIR) using dual GRUs, as well as dual Transformers, to effectively incorporate the users’ preferences from both past interests through the interaction history and the current needs via the real-time interactions. Finally, we define the rewards and describe the learning algorithm for the multi-modal interactive recommendation scenario (Section 3.3).

3.1 Preliminaries

Our research focuses on investigating the MMIR task within a hierarchical reinforcement learning (HRL) formulation, specifically utilising the Options framework [44] in a partially observable environment. In such an environment, the users’ preferences can only be partially expressed with the natural-language critiques at each turn [51]. Figures 2(b) and (c) illustrate the state transition process with hierarchical state representations for the personalised MMIR task.

Fig. 2.

3.1.1 PO-SMDP for Personalised MMIR.

Figure 2(a) shows the extension of a Markov decision process (MDP) with options (i.e., closed-loop policies for taking action over a period of time [44]) into a semi-Markov decision process (SMDP). In particular, the state trajectory of an MDP is made up of discrete-time transitions. Meanwhile, SMDP is a type of MDP suitable for modelling continuous-time discrete-event systems, therefore its state trajectory consists of continuous-time transitions. Sutton et al. [44] defined a set of options over an MDP as a semi-Markov decision process (SMDP), which enables an MDP trajectory to be analysed in either discrete-time transitions or continuous-time transitions. In this article, we adopt a partially observable semi-Markov decision process (PO-SMDP, as shown in Figure 2(b)) for the personalised MMIR task with two low-level agents for addressing the subtasks: (1) estimating the users’ past interests from their interaction history using a history encoder as an MDP and (2) tracking the users’ current needs from the real-time interactions using a state tracker as a POMDP. In the initial stage, the users’ preferences are fully observed (i.e., as an MDP), since each item the user has interacted with (e.g., purchased fashion products) can be seen as their preferences at that time. However, in the subsequent stage of tracking current needs, the users’ preferences are only partially observed (i.e., as a POMDP), since they can only be expressed partially through their natural-language feedback in relation to the critiqued items. The subtasks for taking actions can be selected in sequence with a fixed high-level agent according to the users’ requests in natural language following the example of the interaction process in Figure 1. The history encoder is initiated as a one-step option for the initial personalised recommendations corresponding to the request for recommending “some shoes for women” in Figure 1. The history encoder is then terminated and the state tracker is initiated when the user requests “shoes that are brown leather with an ankle strap.” Since the high-level agent for determining the subtasks is fully driven by the users’ natural-language feedback, we mainly focus on modelling the cooperation of the low-level agents for addressing the MMIR task.

3.1.2 Notations.

We specifically approach the MMIR process as a PO-SMDP with a tuple consisting of eight elements \((\mathcal {S}, \mathcal {A}, \mathcal {C}, \mathcal {O}, \mathcal {R}, \mathcal {T}, \mathcal {P}, \gamma)\), where:

—

\(\mathcal {S}\) is a set of states (i.e., the users’ preferences),

—

\(\mathcal {A}\) is a set of actions (i.e., the items for recommendations),

—

\(\mathcal {C}\) is a set of observations (i.e., the users’ natural-language critiques),

—

\(\mathcal {O}\) is a set of options (i.e., options for selecting subtasks, either estimating past interests or tracking current needs),

—

\(\mathcal {R}\) is the reward function,

—

\(\mathcal {T}\) is a set of transition probabilities between states,

—

\(\mathcal {P}\) is a set of transition probabilities between options, and

—

\(\gamma \in [0,1]\) is the discount factor for future rewards.

The estimated users’ preferences at turn t are denoted by \(s_{t} \in \mathcal {S}\). When the recommender system (i.e., the agent) provides a ranking of K items, \(a_{t} \in \mathcal {A}\) (\(a_{t,\le K}=(a_{t,1},\ldots ,a_{t,K})\)), and receives a natural-language critique \(c_{t} \in \mathcal {C}\) and a reward \(r_{t} \sim \mathcal {R}(s_{t}, a_{t})\), the estimated preferences \(s_{t}\) change in accordance with the transition distribution, \(s_{t+1} \sim \mathcal {T}(s_{t+1}|s_{t},a_{t},c_{t})\). A recommender system acts according to its policy \(\pi (a_{t+1}|a_{\le t},c_{\le t})\) by returning the probability of selecting action \(a_{t}\) at turn t, where \(a_{\le t}=(a_{0},\ldots ,a_{t})\) and \(c_{\le t}=(c_{0},\ldots ,c_{t})\) are the action and critique histories, respectively. Figure 2(b) shows that the personalised multi-modal interactive recommendation process starts with the past interests \(s_{0}\) estimated from the users’ interaction history \((a_{1}^{p},\ldots ,a_{n}^{p})\) with the past hidden states \((h_{0}^{p}, \ldots ,h_{n}^{p})\) while following with the current needs \(s_{t} (t \ne 0)\) tracked from the users’ real-time interactions (i.e., the sequence of the critiqued items \((a_{0}^{c},\ldots ,a_{t}^{c})\) and the sequence of the corresponding critiques \((c_{0},\ldots ,c_{t})\)) with the current hidden states \((h_{0}^{c},\ldots ,h_{t}^{c})\). Generally, for a PO-SMDP, the recommender system’s goal is to learn policies \(\pi _{\phi }\) (i.e., the history encoder) and \(\pi _{\theta }\) (i.e., the state tracker) that maximise the expected future return over trajectories \(\tau\) = ((\(a_{0,\le K},c_{0}\)), \(\ldots\), (\(a_{T,\le K},c_{T}\))) induced by the policies. Note that we assume that the users seek a single target item based on its visual features, have a single history session for estimating the past interests, and interact with the recommender system within a single interaction session. We leave the handling of more complex situations (such as multiple target items based on both visual & non-visual features (such as brands, prices, and sizes) across multiple interaction sessions) in the multi-modal interactive recommendation task as interesting future work.

3.2 The Model Architecture

We propose a PMMIR comprising multi-modal encoders, a history encoder, and a state tracker. In particular, both GRU and Transformer encoders are two popular neural networks for sequence modelling and state tracking. Therefore, our proposed PMMIR model can adopt either GRU or Transformer as the history encoder and/or state tracker. Here, we consider two versions of PMMIR: PMMIR\(_{GRU}\) with GRUs only and PMMIR\(_{Transformer}\) with Transformers only. Figure 3 shows our proposed end-to-end PMMIR with hierarchical state representations based on GRUs (Figure 3(a) with PMMIR\(_{GRU}\)) and Transformers (Figure 3(b) with PMMIR\(_{Transformer}\)). In the following, we describe the major components of our PMMIR models:

Fig. 3.

The Multi-modal Encoders. To properly represent the system’ recommendations and the users’ feedback, we leverage visual and textual encoders for encoding the images of the recommendations and the natural-language critiques into embedded vector representations, respectively. In particular, both images of recommendations and natural-language critiques made by users can be encoded with a pre-trained vision-language model, called CLIP [38], as the unified visual and textual representations. There are also other alternatives for the multi-modal encoders [16, 49, 51], for instance, the pre-trained language models (such as GloVe [36] and BERT [11]) for text and the pre-trained vision models (such as ResNet [18] and ViT [13]) for images. Compared to these alternative encoders, CLIP has the capability of providing a single representation vector for each modality with the same dimensionality. CLIP has been shown to be effective in capturing the fine-grained features of fashion products, such as shirts and dresses, in the conditioned and combined image retrieval tasks [2, 3]. We denote the multi-modal encoders for encoding a visual item a as \(a^{^{\prime }}=CLIP^{img}(a)\) and a textual critique c as \(c^{^{\prime }}=CLIP^{txt}(c)\). Note that we directly use a and c to denote their encoded representations (i.e., \(a^{^{\prime }}\) and \(c^{^{\prime }}\)), respectively.

The History Encoder. The users’ interaction history (i.e., a sequence of the interacted items \(a_{1:n}^{p}=(a_{1}^{p}, \ldots , a_{n}^{p})\)) can be first encoded with the above visual encoder \(CLIP^{img}(\cdot)\). To estimate the users’ past interests, we adopt a gated recurrent unit (GRU) [9] as the history encoder (similar to the GRU4Rec [20] model for sequential recommendations) for encoding the past hidden states as follows:

\begin{equation} ~ h_{n}^{p}= GRU^{past}(h_{n-1}^{p},a_{n}^{p}). \end{equation}

(1)

The last hidden state \(h_{n}^{p}\) of \(GRU^{past}(\cdot)\) is further mapped with a linear layer as the overall-representation of the users’ past interests (i.e., the initial state \(s_{0}= Linear(tanh(h_{n}^{p}))\) for the MMIR task).

Alternatively, we can adopt a Transformer encoder [45] as the history encoder (similar to the SASRec [23] model for sequential recommendations) by directly processing the sequence of the interacted items \(a^{p}_{1:n}\) as the input, while averaging the output embeddings with \(Mean(\cdot)\). Note that we also use \(h_{n}^{p}\) to denote the estimated historical preferences using a Transformer encoder as follows:

\begin{equation} ~ h_{n}^{p}= Mean(Transformer^{past}(a_{1:n}^{p})). \end{equation}

(2)

The State Tracker. To incorporate the users’ current needs over time from the visual recommendations and the corresponding natural-language feedback, we leverage a simple concatenation operation for the multi-modal feature fusion, as in References [16, 49] and then a state tracker (either based on a GRU [16, 51] or a Transformer encoder [49, 53]) for estimating the users’ interaction states. In particular, both the visual and textual representations are concatenated and then mapped into a low-dimensional space as input to a subsequent GRU-based state tracker to model the user’s current needs at each turn t.

\begin{align} x_{t-1} =&\, Linear([a_{t-1}^{c},c_{t-1}]) \end{align}

(3)

\begin{align} h_{t}^{c} =&\, GRU^{current}(h_{t-1}^{c},x_{t-1}) \end{align}

(4)

We argue that the users usually hold a certain preference state (such as the estimated past preference state \(h^{p}_{n}\)) when they start seeking their current needs in a real-time interaction session. To this end, the initial hidden state \(h_{0}^{c}\) of the state tracker \(GRU^{current}(\cdot)\) can be initialised by the last hidden state \(h_{n}^{p}\) of the history encoder \(GRU^{past}(\cdot)\), that is, \(h_{0}^{c} = h_{n}^{p}\). In addition, the hidden state \(h_{t}^{c}\) at each turn t (\(t \ne 0\)) is further mapped with a linear layer into the estimated users’ current needs (i.e., \(s_{t}=Linear(h_{t}^{c})\)).

Similarly, a Transformer-based state tracker concatenates and encodes all previous visual and textual representations:

\begin{equation} h_{t}^{c} = Mean(Transformer^{current}([h_{0}^{c},a_{0}^{c},c_{0},\ldots ,a_{t-1}^{c},c_{t-1}])). \end{equation}

(5)

The last hidden state \(h_{n}^{p}\) of the history encoder \(Transformer^{past}(\cdot)\) is concatenated as the input of \(Transformer^{current}(\cdot)\), that is, \(h_{0}^{c} = h_{n}^{p}\). In addition, the hidden state \(h_{t}^{c}\) at each turn t (\(t \ne 0\)) is further mapped with a linear layer into the estimated users’ current needs (i.e., \(s_{t}=Linear(tanh(h_{t}^{c}))\)).

Considering the estimated state \(s_{t}\) representing the user’s preferences, we adopt a greedy policy [16, 51] by recommending the top-K candidate items \(a_{t,\le K}=(a_{t,1},\ldots ,a_{t,K})\) for the next action. More specifically, we choose the top-K items that are closest to \(s_{t}\) in the multi-modal (i.e., visual and textual) feature space using the Euclidean distance: \(a_{t,\le K} \sim KNNs(s_{t})\), where \(KNNs(\cdot)\) represents a softmax distribution over the top-K nearest neighbours of \(s_{t}\) and \(a_{t,\le K}=(a_{t,1},\ldots ,a_{t,K})\). Furthermore, we incorporate a post-filtering step to eliminate any candidate item from the ranking list that has already been shown to the user based on the real-time interaction history \(a_{\le t}\) as Reference [51].

3.3 The Learning Algorithm

To optimise PMMIR, we leverage a two-stage optimisation method following References [16, 51] with a supervised learning (SL) loss for initialising the policies and then a reinforcement learning (RL) loss for further improving the performances.

3.3.1 Supervised Learning.

We initialise PMMIR with a supervised pre-training process to improve the sample efficiency during the RL training process. In particular, we leverage a triplet loss objective \(L(\pi _{\phi }, \pi _{\theta })\) as in References [16, 51] to jointly pre-train the recommendation policies \(\pi _{\phi }\) (for estimating the past interests) and \(\pi _{\theta }\) (for tracking the current needs):

\begin{equation} ~ {\rm max}~L(\pi _{\phi }, \pi _{\theta }) = \sum _{t=0}^{T} {\rm max}~(0, l_{2}(s_{t},a^{+}) - l_{2}(s_{t},a^{-})+\epsilon), \end{equation}

(6)

where \(\phi \in \mathbb {R}\) and \(\theta \in \mathbb {R}\) denote policy parameters. \(l_{2}(\cdot)\) denotes the \(l_{2}\) distance. \(a^{+}\) is the target item and \(a^{-}\) is a randomly sampled item from the candidate pool. \(\epsilon\) is a constant for the margin to keep the negative samples \(a^{-}\) far apart.

3.3.2 Reinforcement Learning.

The objective of policy optimisation with RL is to find the target item via the policies \(\pi _{\phi }\) and \(\pi _{\theta }\) that maximise the expectation of the cumulative return:

\begin{equation} {\rm max}~J(\pi _{\phi },\pi _{\theta }) = {\rm max}~\underset{\tau \sim {\pi _{\phi },\pi _{\theta }}}{\mathbb {E}}[R(\tau)], \; where \: R(\tau)=\sum _{t=0}^{T}\gamma ^{t}r(s_{t},a_{t,\le K}), \end{equation}

(7)

where \(R(\tau)\) is the discounted cumulative reward, and T is the maximum turn in the interaction trajectory. The expectation is taken over trajectories \(\tau\) = ((\(a_{0,\le K},c_{0}\)), \(\ldots\), (\(a_{T,\le K},c_{T}\))).

We adopt a policy gradient method (e.g., REINFORCE [48]) for PO-SMDP to further optimise our PMMIR model. Indeed, the policy gradient methods have been shown to be more stable with a small learning rate [6] compared to the value-based methods (such as DQN [33]). Specifically, the gradient of Equation (7) can be computed as follows:

\begin{equation} \nabla J(\pi _{\phi },\pi _{\theta })=\underset{\tau \sim {\pi _{\phi },\pi _{\theta }}}{\mathbb {E}}\left[\sum _{t=0}^{T} \nabla \log \pi (a_{t,\le K}|s_{t})R(\tau)\right]. \end{equation}

(8)

We define \(\log \pi (a_{t,\le K}|s_{t})\) as a softmax cross-entropy objective to identify the positive sample (i.e., the target item \(a^{+}\)) among a set of hard negative samples (i.e., the rejected items \(a^{-}_{j} (j \in [1,J])\)):

\begin{equation} \log \pi (a_{t,\le K}|s_{t})=\log \left(\frac{e^{sim(s_{t},a^{+})}}{e^{sim(s_{t},a^{+})}+\sum _{j=1}^{J}e^{sim(s_{t},a^{-}_{j})}}\right), \end{equation}

(9)

where \(sim(\cdot)\) is a similarity kernel that can be the dot product or the negative \(l_{2}\) distance in our experiments.

We define the reward \(r(s_{t},a_{t,\le K})\) as the sum of the similarities between all the top-K candidates and the target item:

\begin{equation} r(s_{t},a_{t,\le K}) = \sum _{i=1}^{K} sim(a_{t,i},a^{+}). \end{equation}

(10)

3.3.3 Training Procedure.

We also present the training procedure of our PMMIR model for PO-SMDP with REINFORCE in Algorithm 1. To facilitate the training processes, a user simulator [16, 49] is adopted as a substitute for real human users. Further information regarding the specific user simulator employed is discussed in Section 4.2. As shown in Algorithm 1, the recommender policies \(\pi _{\phi }\) and \(\pi _{\theta }\) aim to maximise the expected rewards by properly cooperating with each other.

4 Experimental Setup

We proceed to evaluate the effectiveness of our proposed PMMIR model, along with its two variants (PMMIR\(_{GRU}\) and PMMIR\(_{Transformer}\)), in comparison to existing approaches from the literature. In particular, we aim to address the following three research questions:

—

RQ1: Is there a significant improvement in the performance of our proposed PMMIR model compared to the existing state-of-the-art baseline models in the multi-modal interactive recommendation task?

—

RQ2: Can both cold-start and warm-start users benefit from our proposed PMMIR model?

—

RQ3: What are the impacts of the components of the PMMIR model (such as \(h_{0}^{c} = h_{n}^{p}\) and CLIP backbones) and the introduced hyper-parameters (such as \(\gamma\) & K) on the overall performance?

4.1 Datasets & Setup

Datasets. Since there is no existing dataset suitable for the personalisation setting of the multi-modal interactive recommendation task, we derive two datasets (i.e., Amazon-Shoes and Amazon-Dresses) for providing the user-item interaction sequences from two well-known public fashion datasets, i.e., Amazon Review Data (2014)¹ and Amazon Review Data (2018)² with the “Clothing, Shoes and Jewelry” category. In particular, we derive the Amazon-Shoes dataset by including various types of shoes for women (such as “Athletic,” “Boot,” “Clog,” “Flat,” “Heel,” “Pump,” “Sneaker,” “Stiletto,” and “Wedding”) from the “Clothing, Shoes and Jewelry” category of Amazon Review Data (2014). Meanwhile, we also derive the Amazon-Dresses dataset by including the fashion products with the “dress” label for women from the “Clothing, Shoes and Jewelry” category of Amazon Review Data (2018). On both derived datasets, we construct the user-item interaction sequences by concatenating the IDs of a user’s purchased items according to their interaction timestamps. Table 1 summarises the statistics of the Amazon-Shoes and Amazon-Dresses datasets. Our both derived datasets are open to the public via the anonymised link in the abstract. Both datasets provide an image for each fashion product. In addition, for training/testing the user simulators, we use two well-known fashion datasets, namely, the Shoes [4, 16] and Fashion IQ Dresses [49] datasets (discussed further in Section 4.2) for relative captioning with the provided triplets (i.e., \(\langle a_{target}\), \(a_{candidate}\), \(c_{caption}\rangle\)). The relative captions (\(c_{caption}\)) of the image pairs (\(a_{target}\) and \(a_{candidate}\)) describe the attributes of the target item \(a_{target}\) that is missing in candidate item \(a_{candidate}\) in natural language and have been written by real users via crowd-sourcing. The Shoes dataset contains 10,751 triplets in total, while the Fashion IQ Dresses dataset provides 11,970 and 4,034 triplets for training and testing, respectively. Note that the triplets in the Shoes and Fashion IQ Dresses datasets for training the user simulators do not include any data from our derived Amazon-Shoes and Amazon-Dresses datasets, which are used for training the recommendation models.

Table 1.

Dataset	Total Items	Train Users	Test Users	Lengths
Amazon-Shoes	31,940	14,892	3,722	3–9
Amazon-Dresses	18,501	13,657	3,414	4–9

Table 1. Datasets’ Statistics

Setup. As described in Algorithm 1, we leverage a two-stage training procedure for optimising the PMMIR model following References [16, 51]. In particular, we first pre-train and initialise the PMMIR model with the SL setting using a learning rate \(\eta _{sl}=10^{-3}\) [16] and then further optimise the PMMIR model in the RL setting using a learning rate \(\eta _{rl}=10^{-5}\) [16]. We use Adam [24] with Equation (6) and Equation (8) for optimising the PMMIR model’s parameters, respectively. The pre-trained CLIP image and text encoders are loaded with the “ViT-B/32” checkpoint,³ and the visual and textual embedding dimensionalities of the multi-modal feature space are both set to 512. The initial hidden state \(h_{0}^{p}\) of the history encoder \(GRU^{past}(\cdot)\) in PMMIR\(_{GRU}\) is initialised with zeros. Meanwhile, PMMIR\(_{Transformer}\) does not have such an explicit initial hidden state \(h_{0}^{p}\) of the history encoder \(Transformer^{past}(\cdot)\). Indeed, PMMIR\(_{Transformer}\) directly takes the sequence of the interacted items as the input. The batch size is set to 128 following the setting in Reference [16]. The maximum number of epochs for SL & RL training is set to 20 with early stopping as in Reference [52], while the maximum number of interaction turns is set to 10 as in References [51, 52]. At each interaction turn for both training and testing, the recommender system provides the top-K (i.e., \(K=3\)) items as a recommendation. For the RL stage, the number of hard negative samples (i.e., J) is set to 5, following Reference [51]. The similarity kernel \(sim(\cdot)\) in Equation (9) is set to be the dot product by default with the normalised visual and textual representations. If not mentioned otherwise, then the discount factor \(\gamma\) is set to 0.2. We consider users with the least interactions (3 interactions on Amazon-Shoes and 4 interactions on Amazon-Dresses) as cold-start users, while the other users with longer interaction sequences are considered as warm-start users. For each user-item interaction sequence, we leave the last interaction as the user’s target item (i.e., the current needs) and the previous sequence of interactions as the users’ interaction history (i.e., the past interests).

4.2 Online Evaluation & Metrics

Online Evaluation. The success of the personalised MMIR task is measured by the number of interaction turns to obtain the target item(s) and the rank of the target item(s) in each interaction turn. We evaluate the effectiveness of our proposed PMMIR model for personalised multi-modal interactive recommendation in comparison to the existing approaches from the literature based on an online evaluation approach. Figure 4 shows an example of online evaluation with top-K (e.g., \(K=3\)) recommendation across multi-turn interactions in the personalised MMIR scenario. In this scenario, the recommender system ranks all the items and shows the top-K items as the recommendations at each turn. Meanwhile, a user browses the exposed top-K items, gives a natural-language critique on the most preferred item, and rejects the others at each turn. In particular, the figure illustrates how a user can find the desired item through multi-turn interactions. Following the methodology in References [51, 52, 62], we measure the effectiveness of the interactive recommendation models at interaction turn M. However, the user may check more items in the ranking list at each turn, down to rank N.

Fig. 4.

User Simulators. In both the optimisation and evaluation processes, user simulators have been employed as substitutes for real human users in the context of relative captioning tasks [16, 49, 53, 62]. A user simulator based on relative captioning can automatically generate descriptions of the prominent visual differences between any pair of target and candidate images as users’ natural-language feedback. This natural-language feedback generation process within the user simulator closely resembles a shopping conversation between a customer and a shopping assistant. The rewards returned by the user simulators are calculated using Equation (10). The user simulator can actively interact with the recommender system to provide various real-time natural-language feedback, thereby allowing to learn satisfactory multi-modal interactive recommender systems with enough training data. In particular, we adopt a user simulator with the Show, Attend, & Tell [56] model trained with triplets from Shoes by using the checkpoint⁴ [4, 16] provided by Guo et al. [16]. In addition, we adopt the VL-Transformer model introduced in References [49, 52] as a user simulator, specifically trained on triplets extracted from the Fashion IQ Dresses dataset, following the setting⁵ in References [49, 52] and using the checkpoint provided by Wu et al. [52]. Both user simulators are deployed by using an image captioning tool (called ImageCaptioning.pytorch⁶ [32]). The user simulators are intensively trained using crowdsourced relative expressions to describe the visual distinctions between pairs of images [16, 49, 52]. Moreover, the pre-trained user simulators have previously been thoroughly assessed through both quantitative evaluations and user studies, making them a reliable substitute for real users in conducting evaluation experiments [16, 49, 51]. Following References [16, 49, 53], we assume that the user simulator only gives a natural-language critique on a single recommended item (the most similar to the target item) at each turn by describing the desired attributes in the target item that are missing in the recommended item. Such simplification is necessitated by the existing available datasets and the availability of accurate user simulators.

Metrics. We measure the effectiveness of the multi-modal interactive recommendations at interaction turn M in terms of Normalised Discounted Cumulative Gain (NDCG@N truncated at rank \(N=3\)) and Success Rate (SR). To assess the quality of the ranking lists, the NDCG metric emphasises the importance of higher ranks compared to lower ones. However, the SR metric measures the percentage of users for whom the target image was successfully retrieved within a specific number of interactions, denoted as M within the range of 1 to 10. For significance testing, we employ both evaluation metrics, namely, NDCG@3 and SR, at the 5th and 10th interaction turns.

4.3 Baselines

We conduct a comparative analysis between our proposed PMMIR model variants (PMMIR\(_{GRU}\) and PMMIR\(_{Transformer}\)) and existing state-of-the-art baseline models, including their extensions, for the MMIR task.

The first group of baseline models are all based on GRUs to compare with PMMIR\(_{GRU}\):

—

GRU\(_{hist}\): The GRU\(_{hist}\) model is adapted from the GRU4Rec [20] model for sequential recommendations. Unlike the GRU4Rec model, which takes a sequence of item IDs as its input, GRU\(_{hist}\) adopts a GRU to model the user-item interaction history with images.

—

GRU\(_{img+txt}\): The GRU\(_{img+txt}\) model (or called Dialog Manager (DM) [16]) leverages a single GRU as a state tracker with images of items and natural-language critiques as its inputs for addressing the multi-modal interactive recommendation task.

—

EGE [51]: Estimator-Generator-Evaluator (EGE) is also a GRU-based model for MMIR. It uses a multi-task learning approach to optimise the model, combining a cross-entropy classification loss for supervised learning and a Q-learning prediction loss for reinforcement learning.

—

GRU-EGE: To provide strong baseline models for the personalised MMIR task considering both the users’ past interests and the current needs, we integrate the existing sequential recommendation model (i.e., GRU\(_{hist}\)) and the RL-based MMIR model (i.e., EGE) within a pipeline. In particular, the sequential recommendation model estimates the users’ past interests from the interaction history and provides the initial recommendations, while the RL-based MMIR model tracks the users’ current needs from the real-time interactions and updates the subsequent recommendations.

—

GRU\(_{all}\): We extend a single GRU for both estimating the users’ past interests and tracking the users’ current needs. We optimise the GRU\(_{all}\) model with a triplet loss (i.e., GRU\(_{all}\)-SL) and then extend it with REINFORCE [43] (i.e., GRU\(_{all}\)-RL) to further improve the performance by maximising the long-term rewards.

The next group of baseline models are based on Transformers to compare with PMMIR\(_{Transformers}\):

—

Transformer\(_{hist}\): The Transformer\(_{hist}\) model is adapted from the SASRec [23] model for sequential recommendations. Unlike the SASRec model, which takes a sequence of item IDs as its input, Transformer\(_{hist}\) adopts a Transformer encoder to model the user-item interaction history with images and predict the target item.

—

Transformer\(_{img+txt}\) & MMT: The Transformer\(_{img+txt}\) model, also called Multi-modal Interactive Transformer [49, 53], is a state-of-the-art multi-modal interactive recommendation model. It incorporates a Transformer encoder to directly attend to the entire multi-modal real-time interaction sequences, encompassing the users’ textual feedback and the system’s visual recommendations. We optimise the Transformer\(_{img+txt}\) model with a triplet loss and then extend it with REINFORCE (denoted by MMT) to further improve the performance by maximising the long-term rewards.

—

Transformer-MMT: Similar to GRU-EGE, we also make both well-trained Transformer\(_{hist}\) and MMT models into a pipeline for making personalised initial recommendations with Transformer\(_{hist}\) and updating the subsequent recommendation during the real-time interactions with Transformer.

—

Transformer\(_{all}\): We also extend a single Transformer encoder for both estimating the users’ past interests and tracking the users’ current needs. We optimise Transformer\(_{all}\) with a triplet loss (Transformer\(_{all}\)-SL) and then extend it with REINFORCE (Transformer\(_{all}\)-RL) to further improve the performance by maximising the long-term rewards.

Although there are a few more attention-based/Transformer-based sequential recommendation models (such as BERT4Rec [41] and Transformers4Rec [10]) and multi-modal interactive recommendation models (such as MMRAN [52] with a RNN-enhanced Transformer structure), they can make the PMMIR model overly complex compared to using a simple GRU-based/Transformer-based history encoder. We leave the integration of these more advanced sequential recommendation models for estimating past interests and multi-modal interactive models for tracking the current needs as future work. In addition to the above baseline models for the MMIR task, we also investigate variants of PMMIR for ablation studies. Such variants can also act as solid baselines:

—

PMMIR w/o \(\boldsymbol {h^{c}_{0}=h^{p}_{n}}\): The “PMMIR w/o \(h^{c}_{0}=h^{p}_{n}\)” variant initialises the initial hidden state \(h^{c}_{0}\) of the state tracker randomly instead of using \(h^{c}_{0}=h^{p}_{n}\).

—

PMMIR w/ \(\boldsymbol {Linear^{img/txt}}\): The “PMMIR w/ \(Linear^{img/txt}\)” variant adds both a \(Linear^{img}\) layer in the image encoder and a \(Linear^{txt}\) layer in the textual encoder for fine-tuning the CLIP visual and textual representations. The parameters of both the \(Linear^{img}\) and \(Linear^{txt}\) layers are frozen during the RL training procedure following References [16, 51].

—

PMMIR w/ “RN101”: The “PMMIR w/ RN101” variant replaces the ViT-based CLIP checkpoint (i.e., “ViT-B/32”) with a ResNet101-based [18] CLIP checkpoint (i.e., “RN101”).

For fair comparisons, all of the tested baseline models and variants use CLIP to encode the text and image as the backbone representations (as described in Section 3.2). However, GRU\(_{hist}\) and Transformer\(_{hist}\) can be considered as sequential recommendation models, since they only take the users’ interaction history into consideration, allowing to compare with models that do not consider text or image representations. Although there are a few more other models with different formulations for the interactive recommendation task (e.g., RCR [62], EAR [27], CRM [42], and SGR [50]), these models are not comparable with our scenario due to requiring additional attributes of items for learning [17, 60, 61, 62], requiring a multi-modal knowledge graph for reasoning [50], or their inability to incorporate both the textual and visual modalities during the recommendation process [27, 42].

5 Experimental Results

In this section, we present an analysis of the experimental results in relation to the three research questions outlined in Section 4 to demonstrate the effectiveness of our proposed PMMIR model. Specifically, we address the overall effectiveness of the PMMIR model variants (PMMIR\(_{GRU}\) and PMMIR\(_{Transformer}\)) for multi-modal interactive recommendations (RQ1, discussed in Section 5.1), its performance on both cold-start and warm-start users (RQ2, detailed in Section 5.2), and the impact of various components and hyperparameters (RQ3, covered in Section 5.3). To further consolidate our findings, we provide a use case based on the logged experimental results in Section 5.4.

5.1 PMMIR vs. Baselines (RQ1)

To address RQ1, we investigate the performance of our proposed PMMIR model variants (PMMIR\(_{GRU}\) and PMMIR\(_{Transformer}\)) and the baseline models. Figure 5 depicts the recommendation effectiveness of our proposed PMMIR model variants, along with the corresponding baseline models, for top-3 recommendations in terms of Success Rate (SR) on the Amazon-Shoes and Amazon-Dresses datasets. Specifically, Figures 5(a) and (c) represent the results using PMMIR\(_{GRU}\), while Figures 5(b) and (d) correspond to PMMIR\(_{Transformer}\). The x-axis indicates the number of interaction turns. Comparing the results presented in Figure 5, we can observe that our proposed PMMIR model variants consistently outperform the baseline models in terms of SR across different interaction turns (in particular, from 4th to 10th turns). This indicates the superior overall performance of our PMMIR models. As the number of interaction turns increases, the differences in effectiveness between our PMMIR models and the baseline models become more pronounced, as observed from the increasing gaps in SR. This suggests that our PMMIR models demonstrate a stronger performance advantage over the baseline models as the interaction process unfolds. We can also observe the same trends on NDCG@3. We omit their reporting in a figure to reduce redundancy. The better overall performance of PMMIR suggests that our PMMIR model can better incorporate the users’ preferences from both the interaction history and the real-time interactions compared to the baseline models.

Fig. 5.

To quantify the improvements achieved by our proposed PMMIR model in comparison to the baseline models, we measure their performances in terms of SR and NDCG@3 at the 5th and 10th interaction turns. This enables us to assess the progress and effectiveness of our PMMIR model at different stages of the interaction process. Table 2 presents the obtained recommendation performances of the PMMIR model variants (PMMIR\(_{GRU}\) and PMMIR\(_{Transformer}\)) and their corresponding baseline models. These baseline models include the GRU-based models (GRU\(_{hist}\), GRU\(_{img+txt}\), GRU\(_{all}\)-SL, EGE, GRU-EGE, GRU\(_{all}\)-RL) as well as the Transformer-based models (Transformer\(_{hist}\), Transformer\(_{img+txt}\), Transformer\(_{all}\)-SL, MMT, Transformer-MMT, Transformer\(_{all}\)-RL). The performances are evaluated using the same test datasets from the Amazon-Shoes and Amazon-Dresses datasets at the 5th and 10th interaction turns. The table provides a comprehensive overview of the recommendation performances, allowing for a direct comparison between the PMMIR model and the various baseline models. In Table 2, the best overall performing results across the four groups of columns are highlighted in bold. * indicates a significant difference, determined by a paired t-test with a Holm-Bonferroni multiple comparison correction (\(p\lt 0.05\)), when compared to the PMMIR model within each group. Comparing the results in the table, we observe that our proposed PMMIR\(_{GRU}\) model consistently achieves significantly better performances, with improvements on both metrics ranging from 4%–8% and 2%–4% on the Amazon-Shoes and Amazon-Dresses datasets, respectively, compared to the best GRU-based baseline model. Similarly, the PMMIR\(_{Transformer}\) model also demonstrates similar improvements, with performance gains ranging from 4%–7% and 4%–6% compared to the best Transformer-based baseline model. These findings highlight the effectiveness of our proposed PMMIR models in outperforming the baseline models across both datasets. Furthermore, it is worth noting that the PMMIR\(_{Transformer}\) model, which is based on Transformers, generally outperforms the PMMIR\(_{GRU}\) model, which is based on GRUs, in terms of both metrics on both the Amazon-Shoes and Amazon-Dresses datasets. This observation highlights the superiority of the Transformer-based approach in achieving improved recommendation performances.

Table 2.

Models	hist	img	txt	Type	NDCG@3	SR	NDCG@3	SR	NDCG@3	SR	NDCG@3	SR
					Amazon-Shoes				Amazon-Dresses
	Input Type			Learning	Turn 5		Turn 10		Turn 5		Turn 10
GRU
GRU\(_{hist}\)	✓	✗	✗	SL	0.0131*	0.0134*	0.0198*	0.0201*	0.0435*	0.0445*	0.0638*	0.0644*
GRU\(_{img+txt}\)	✗	✓	✓	SL	0.1342*	0.1421*	0.2635*	0.2705*	0.3015*	0.3145*	0.4658*	0.4703*
GRU\(_{all}\)-SL	✓	✓	✓	SL	0.1520*	0.1606*	0.2740*	0.2796*	0.3204*	0.3315*	0.4653*	0.4703*
PMMIR\(_{GRU}\)-SL	✓	✓	✓	SL	0.1564*	0.1647*	0.2925*	0.2998*	0.3441*	0.3552*	0.4966*	0.5019*
EGE	✗	✓	✓	RL	0.1970*	0.2095*	0.3644*	0.3712*	0.3825*	0.4012*	0.5885*	0.5950*
GRU-EGE	✓	✓	✓	SL/RL	0.2160*	0.2310*	0.3746*	0.3809*	0.4102*	0.4243*	0.6114*	0.6193*
GRU\(_{all}\)-RL	✓	✓	✓	RL	0.2160*	0.2272*	0.3821*	0.3876*	0.4573*	0.4712*	0.6587*	0.6659*
PMMIR\(_{GRU}\)	✓	✓	✓	RL	0.2299	0.2412	0.4120	0.4196	0.4748	0.4878	0.6766	0.6843
% Improvement	–	–	–	–	6.44	4.42	7.83	8.26	3.83	3.52	2.72	2.76
Transformer
Transformer\(_{hist}\)	✓	✗	✗	SL	0.0104*	0.0107*	0.0149*	0.0150*	0.0213*	0.0228*	0.0411*	0.0422*
Transformer\(_{img+txt}\)	✗	✓	✓	SL	0.1102*	0.1176*	0.2235*	0.2286*	0.2603*	0.2735*	0.4343*	0.4436*
Transformer\(_{all}\)-SL	✓	✓	✓	SL	0.1122*	0.1179*	0.2138*	0.2192*	0.2425*	0.2553*	0.3927*	0.3994*
PMMIR\(_{Transformer}\)-SL	✓	✓	✓	SL	0.1245*	0.1311*	0.2472*	0.2536*	0.2842*	0.2937*	0.4419*	0.4498*
MMT	✗	✓	✓	RL	0.2220*	0.2302*	0.3894*	0.3973*	0.4721*	0.4867*	0.6759*	0.6826*
Transformer-MMT	✓	✓	✓	SL/RL	0.2258*	0.2340*	0.3935*	0.4013*	0.4798*	0.4958*	0.6789*	0.6858*
Transformer\(_{all}\)-RL	✓	✓	✓	RL	0.2289*	0.2412*	0.3919*	0.3989*	0.4950*	0.5086*	0.6809*	0.6876*
PMMIR\(_{Transformer}\)	✓	✓	✓	RL	0.2390	0.2517	0.4207	0.4276	0.5261	0.5394	0.7107	0.7171
% Improvement	–	–	–	–	4.41	4.35	6.91	6.55	6.28	6.06	4.38	4.29

Table 2. The Recommendation Effectiveness of Our Proposed PMMIR Model Variants (PMMIR\(_{GRU}\) and PMMIR\(_{Transformer}\)) and the Baseline Models at the 5th and 10th Turns on the Amazon-Shoes and Amazon-Dresses Datasets

*indicates a significant difference (\(p\lt 0.05\), paired t-test with Holm-Bonferroni correction) w.r.t. PMMIR for each group.

In response to RQ1, the results obtained clearly demonstrate that our proposed PMMIR model variants exhibit a significant performance advantage over the state-of-the-art baseline models. Therefore, our proposed PMMIR model with hierarchical state representations in PO-SMDP can effectively incorporate the users’ preferences from both the interaction history and the real-time interactions.

5.2 Cold-start vs. Warm-start Users (RQ2)

To address RQ2, we investigate the performance of our proposed PMMIR model on cold-start and warm-start users. We classify users with the minimum interactions (3 interactions on Amazon-Shoes and 4 interactions on Amazon-Dresses) as cold-start users, while those with longer interaction sequences are categorised as warm-start users. This investigation aims to understand how effectively our model adapts to different user scenarios and assesses its performance in each case. Table 3 presents the performances of our PMMIR model variants, as well as the RL-based and pipeline-based baseline models, in terms of NDCG@3 and SR. The table is divided into two parts: The top part focuses on the GRU-based models, while the second part pertains to the Transformer-based models. This division facilitates a comprehensive comparison of the performances across different model types. Comparing the results in Table 3, we observe that our proposed PMMIR\(_{GRU}\) and PMMIR\(_{Transformer}\) models can achieve better performances than the corresponding baseline models in terms of both metrics on both cold-start and warm-start users on the two used datasets, except for the cold-start users with PMMIR\(_{GRU}\) in terms of NDCG@3 on Amazon-Dresses. The reported results in Table 3 show that both the cold-start and warm-start users can generally benefit from our proposed PMMIR model variants with hierarchical state representations. In addition, we observe that the warm-start users can generally benefit more from the GRU-based variant compared to the cold-start users. In particular, PMMIR\(_{GRU}\) achieves improvements of 11%–12% (warm-start) vs. 4%–5% (cold-start) on Amazon-Shoes and 4%–5% (warm-start) vs. 0%–1% (cold-start) on Amazon-Dresses in terms of both metrics. Conversely, we observe that cold-start users can generally benefit more from the Transformer-based variant compared to warm-start users. In particular, PMMIR\(_{Transformer}\) achieves improvements of 8%–9% (cold-start) vs. 2%–3% (warm-start) on Amazon-Shoes and 3.8%–4.0% (cold-start) vs. 3.4%–3.5% (warm-start) on Amazon-Dresses in terms of both metrics. We postulate that this difference in performance on cold-start and warm-start users between PMMIR\(_{GRU}\) and PMMIR\(_{Transformer}\) can be attributed to the features of the interaction history sequences and the different sequence modelling abilities of GRUs and Transformers. The long sequences of purchases (warm-start users) can have a greater timespan and can be noisy due to the users’ preferences drifting over time, while short sequences of purchases (cold-start) can have a relatively smaller timespan but can be less informative in relating to the users’ preferences. Meanwhile, GRUs (adopted by PMMIR\(_{GRU}\)) can effectively denoise the sequences with their internal forgetting mechanism with a forget gate, while the Transformer encoders (adopted by PMMIR\(_{Transformer}\)) have stronger sequence modelling abilities due to the complex neural structures but have been shown to be insufficient to address noisy items within sequences [5].

Table 3.

Models	Cold	Warm	Overall	Cold	Warm	Overall	Cold	Warm	Overall	Cold	Warm	Overall
	Amazon-Shoes						Amazon-Dresses
	NDCG@3			SR			NDCG@3			SR
GRU
EGE	0.3726\(^{*}\)	0.3546\(^{*}\)	0.3644\(^{*}\)	0.3807	0.3600\(^{*}\)	0.3712\(^{*}\)	0.5876\(^{*}\)	0.5892\(^{*}\)	0.5885\(^{*}\)	0.5935\(^{*}\)	0.5963\(^{*}\)	0.5950\(^{*}\)
GRU-EGE	0.3764	0.3724\(^{*}\)	0.3746\(^{*}\)	0.3827	0.3787\(^{*}\)	0.3809\(^{*}\)	0.6120\(^{*}\)	0.6109\(^{*}\)	0.6114\(^{*}\)	0.6210\(^{*}\)	0.6179\(^{*}\)	0.6193\(^{*}\)
GRU\(_{all}\)-RL	0.3827	0.3814\(^{*}\)	0.3821\(^{*}\)	0.3886	0.3864\(^{*}\)	0.3876\(^{*}\)	0.6575	0.6597\(^{*}\)	0.6587\(^{*}\)	0.6639	0.6676\(^{*}\)	0.6659\(^{*}\)
PMMIR\(_{GRU}\)	0.4007	0.4253	0.4120	0.4089	0.4322	0.4196	0.6569	0.6933	0.6766	0.6665	0.6994	0.6843
% Improvement	4.70	11.51	7.83	5.22	11.85	8.26	-0.09	5.09	2.72	0.39	4.76	2.76
Transformer
MMT	0.3902\(^{*}\)	0.3885	0.3894\(^{*}\)	0.3980\(^{*}\)	0.3964	0.3973\(^{*}\)	0.6691\(^{*}\)	0.6817	0.6759\(^{*}\)	0.6754\(^{*}\)	0.6886	0.6826\(^{*}\)
Transformer-MMT	0.3973\(^{*}\)	0.3889	0.3935\(^{*}\)	0.4059\(^{*}\)	0.3958	0.4013\(^{*}\)	0.6894	0.6701\(^{*}\)	0.6789\(^{*}\)	0.6959	0.6773\(^{*}\)	0.6858\(^{*}\)
Transformer\(_{all}\)-RL	0.3900\(^{*}\)	0.3941	0.3919\(^{*}\)	0.3970\(^{*}\)	0.4011	0.3989\(^{*}\)	0.6797\(^{*}\)	0.6819	0.6809\(^{*}\)	0.6869\(^{*}\)	0.6881	0.6876\(^{*}\)
PMMIR\(_{Transformer}\)	0.4352	0.4035	0.4207	0.4406	0.4122	0.4276	0.7168	0.7055	0.7107	0.7228	0.7124	0.7171
% Improvement	9.54	2.39	6.91	8.55	2.77	6.55	3.97	3.46	4.38	3.87	3.46	4.29

Table 3. Personalised Multi-modal Interactive Recommendation Effectiveness of Our Proposed PMMIR Model Variants (PMMIR\(_{GRU}\) and PMMIR\(_{Transformer}\)) and the Baseline Models on the Cold-start and Warm-start Users at the 10th Turn on the Amazon-Shoes and Amazon-Dresses Datasets

\(^{*}\)indicates a significant difference (p<0.05, paired t-test with Holm-Bonferroni correction) w.r.t. PMMIR for each group.

In response to RQ2, we find that both cold-start and warm-start users can benefit from our proposed PMMIR model. The warm-start users can generally benefit more with PMMIR\(_{GRU}\), while the cold-start users can generally benefit more with PMMIR\(_{Transformer}\).

5.3 Impact of Components & Hyper-parameters (RQ3)

To address RQ3, we investigate the impact of the components and the hyper-parameters of our proposed PMMIR model.

Impact of Components.

Table 4 reports the performances of our PMMIR model with different applied ablations in terms of NDCG@3 and SR. The original setting is shown in the top part of the table. The PMMIR ablation variants (i.e., PMMIR w/o \(h^{c}_{0}=h^{p}_{n}\), PMMIR w/ \(Linear^{img/txt}\), and PMMIR w/ “RN101”) are shown in the second part of the table. All the examined PMMIR ablation variants perform generally worse than the corresponding original PMMIR model. The results of PMMIR w/o \(h^{c}_{0}=h^{p}_{n}\) suggest that our PMMIR model can benefit from the initialisation of the state tracker with the final hidden state of the history encoder. The results of PMMIR w/ \(Linear^{img/txt}\) and PMMIR w/ “RN101” indicate that the CLIP model with the “ViT-B/32” checkpoint can provide better visual and textual representations than the “RN101” checkpoint, and further fine-tuning the CLIP embeddings is not necessary for our personalised MMIR task.

Table 4.

Models	NDCG@3	SR	NDCG@3	SR	NDCG@3	SR	NDCG@3	SR
	Amazon-Shoes				Amazon-Dresses
	GRU		Transformer		GRU		Transformer
PMMIR	0.4120	0.4196	0.4207	0.4276	0.6766	0.6843	0.7107	0.7171
1. w/o \(h^{c}_{0}=h^{p}_{n}\)	0.4013	0.4102	0.4074	0.4155	0.6658	0.6714	0.6835*	0.6899*
2. w/ Linear\(^{img/txt}\)	0.3966	0.4048	0.3510*	0.3575*	0.6462*	0.6530*	0.6252*	0.6322*
3. w/ “RN101”	0.3891	0.3954*	0.3914*	0.4024*	0.6338*	0.6392*	0.6913*	0.6969*

Table 4. Ablation Study at the 10th Turn

w/o and w/ denote that a component is removed or replaced in PMMIR, respectively. Notation as per Table 3.

*indicates a significant difference (\(p\lt 0.05\), paired t-test with Holm-Bonferroni correction) w.r.t. PMMIR for each group.

Impact of Hyper-parameters.

Figure 6 depicts the effects of the reward discount factor (\(\gamma \in [0, 1]\)) when training the PMMIR model on both datasets and the number of exposed top-K items (\(K \in [2, 5]\)) in each ranking list in terms of SR at 10th turn, respectively. In our analysis, we primarily compare the performances of our PMMIR model with different values of discount factors (i.e., \(\gamma \in [0, 1]\)) at the 10th interaction turn. Specifically, when the discount factor \(\gamma\) is set to 0, it indicates that the model exclusively considers immediate rewards and does not take future rewards into account. However, when \(\gamma\) is set to 1, the model assigns equal importance to all future rewards and considers them on an equal footing. From Figure 6(a), we observe that there is a decreasing trend in the performance of PMMIR\(_{GRU}\) on both datasets and a decrease in the effectiveness of PMMIR\(_{Transformer}\) on Amazon-Shoes when the discount factor \(\gamma\) increases from 0.2 to 1.0. We observe the same trend for PMMIR\(_{Transformer}\) on Amazon-Dresses with \(\gamma \in [0.6,1.0]\). This trend shows that both the history encoder and the state tracker in PMMIR are more influenced by the immediate rewards than by future rewards. Additionally, Figure 6(b) highlights that the PMMIR model exhibits better performance when more items are exposed to users at each interaction turn. This suggests that increasing the number of items presented to users during the interaction process leads to improved recommendation performance for PMMIR.

Fig. 6.

Overall, in response to RQ3, we find that the PMMIR model can generally benefit more in terms of effectiveness from the hierarchical state representations, adequate multi-modal CLIP encoders, using low values for the discount factor \(\gamma\), and from more exposed top-K items.

5.4 Use Case

In this section, we present use cases of the multi-modal interactive recommendation task with/without personalisation on the Amazon-Shoes dataset in Figure 7. In particular, the figure shows a user’s interaction history and the next target item, as well as the interaction process for the top-3 recommendations between the simulated users for the EGE and PMMIR\(_{GRU}\) models that are both based on GRUs. When the target item is listed in the recommendation list, the user simulator will give a comment to end the interaction, such as “The 3^rd shoes are my desired shoes” in Figure 7(c). Comparing the recommendations made by EGE and PMMIR\(_{GRU}\) on the Amazon-Shoes dataset, we can observe that our proposed PMMIR\(_{GRU}\) model is able to find the target items with fewer interaction turns compared to EGE—this is expected, due to the increased effectiveness of PMMIR\(_{GRU}\) shown in Section 5.1. In addition, our PMMIR\(_{GRU}\) model is more effective at incorporating the users’ preferences from both the users’ interaction history and the real-time interactions. For instance, our PMMIR\(_{GRU}\) model suggests personalised recommendations with different “high-heeled sandals” at the initial interaction turn, then easily finds the target items with a critique “tan with a higher heel” at the next turn. Meanwhile, the EGE model can only randomly sample items as the initial recommendations, but the “high heel” feature is missing in the initial recommendation, which leads to the EGE model’s failure in finding the target item at the next turn. We observed similar trends and results in other use cases involving other baseline models compared to the PMMIR variants on the Amazon-Shoes and Amazon-Dresses datasets. We omit their reporting in this article to reduce redundancy.

Fig. 7.

6 Conclusions

In this article, we proposed a novel personalised multi-modal interactive recommendation model (PMMIR) using hierarchical reinforcement learning with the Options framework to more effectively incorporate the users’ preferences from both their past and real-time interactions. Specifically, PMMIR decomposes the personalised interactive recommendation process into a sequence of two subtasks with hierarchical state representations: a first subtask where a history encoder learns the users’ past interests with the hidden states of history for providing personalised initial recommendations and a second subtask where a state tracker estimates the current needs with the real-time estimated states for updating the subsequent recommendations. The history encoder and the state tracker are jointly optimised with a single optimisation objective by maximising the users’ future satisfaction. Following previous work [16, 49, 51], we trained and evaluated our PMMIR model using a user simulator that can generate natural-language critiques about the recommendations as a surrogate for real human users. Our experiments on the Amazon-Shoes and Amazon-Dresses datasets demonstrate that our proposed PMMIR model variants achieve significantly better performances compared to the best baseline models—for instance, improvements of 4%–8% and 2%–4% with PMMIR\(_{GRU}\) and 4%–7% and 4%–6% with PMMIR\(_{Transformer}\) at the 5th and 10th turns. The reported results show that our proposed PMMIR model benefits from the dual GRUs/Transformers structure and the initialisation of the state tracker with the final hidden state of the history encoder. In addition, the results show that both cold-start and warm start users can benefit from our proposed PMMIR model.

Footnotes

http://jmcauley.ucsd.edu/data/amazon/index_2014.html

https://nijianmo.github.io/amazon/

https://github.com/openai/CLIP

⁴

https://github.com/XiaoxiaoGuo/fashion-retrieval

⁵

https://github.com/XiaoxiaoGuo/fashion-iq

⁶

https://github.com/ruotianluo/ImageCaptioning.pytorch

References

[1]

M. Mehdi Afsar, Trafford Crump, and Behrouz Far. 2022. Reinforcement learning based recommender systems: A survey. Comput. Surv. 55, 7 (2022), 1–38.

Abstract

1 Introduction

2 Related Work

3 The PMMIR Model

3.1 Preliminaries

3.1.1 PO-SMDP for Personalised MMIR.

3.1.2 Notations.

3.2 The Model Architecture

3.3 The Learning Algorithm

3.3.1 Supervised Learning.

3.3.2 Reinforcement Learning.

3.3.3 Training Procedure.

4 Experimental Setup

4.1 Datasets & Setup

4.2 Online Evaluation & Metrics

4.3 Baselines

5 Experimental Results

5.1 PMMIR vs. Baselines (RQ1)

5.2 Cold-start vs. Warm-start Users (RQ2)

5.3 Impact of Components & Hyper-parameters (RQ3)

5.4 Use Case

6 Conclusions

Footnotes

References

Cited By

Index Terms

Recommendations

Goal-Oriented Multi-Modal Interactive Recommendation with Verbal and Non-Verbal Relevance Feedback

Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation

Collaborative recommendation model based on multi-modal multi-view attention network: Movie and literature cases

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations