1 Introduction
Recent advances in
multi-modal interactive recommender systems (MMIRSs) enable the users to explore their desired items (such as images of fashion products) through multi-turn interactions by expressing their current needs with real-time feedback (often natural-language critiques) according to the quality of the recommendations [
16,
28,
49,
51,
52,
53,
58,
59,
61,
62]. In this
multi-modal interactive recommendation (MMIR) scenario, the users’ preferences can be represented by both the users’ past interests from their historical interactions and their current needs from their recent interactions. Figure
1 shows an example of the personalised multi-modal interactive recommendation with visual recommendations and the corresponding natural-language critiques. In particular, Figure
1(a) demonstrates the users’ past interests with the shopping history recorded by the recommender system and their current needs with the next item that they wish to purchase (the next target item). Next, Figure
1(b) illustrates the real-time interactions between a recommender system and a user. The recommender system initiates the conversation by presenting a list of personalised initial recommendations to the user. Subsequently, during each interaction turn, the user provides natural-language critiques regarding the visual recommendation list to achieve items with more preferred features. An effective MMIRS will improve the users’ experience substantially and will save users much efforts in finding their target items.
Despite the recent advances in incorporating the users’ current needs (i.e., the target items) from the informative multi-modal information across the multi-turn interactions, we argue that it is typically challenging to make satisfactory personalised recommendations due to the difficulty in balancing the users’ past interests and the current needs for generating the users’ state (i.e., current preferences) representations over time. Indeed, the existing MMIRSs [
16,
49,
51,
52] typically simplify the multi-modal interactive recommendation task by initiating conversations using randomly sampled recommendations irrespective of the users’ interaction histories (i.e., the past interests), thereby only focusing on seeking the target item (i.e., the current needs) across real-time interactions. Although providing next-item recommendations from sequential user-item interaction history is one of the most common use cases in the recommender system domain, the existing sequential and session-aware recommendation models [
19,
20,
23,
41] currently only consider the explicit/implicit past user-item interactions (such as purchases and clicks) in the sequence modelling. In addition, these sequential/session-aware recommendation models have shown difficulties in learning sequential patterns over
cold-start users (who have very limited historical interactions) compared to
warm-start users (who have longer interaction sequences) [
46,
64]. An obvious and simple solution for the personalised MMIR task is to conduct a pipeline, where a sequential/session-aware recommendation model (such as GRU4Rec [
20]) generates the initial personalised recommendations and a multi-modal interactive recommendation model (such as EGE [
51]) updates the subsequent recommendations across the multi-turn interactions. However, such pipeline-based recommender systems cannot effectively benefit from a proper cooperation between the sequential/session-aware recommendation models and the multi-modal interactive recommendation models when there is a shift between the users’ past interests and their current needs (in particular, with cold-start users), thereby possibly failing to provide satisfactory personalised recommendations over time.
Deep reinforcement learning (DRL) allows a recommender system (i.e., an agent) to actively interact with a user (i.e., the environment) while learning from the user’s real-time feedback to infer the user’s dynamic preferences. A variety of DRL algorithms has been successfully applied in various recommender system domains, such as e-commerce [
55], video [
7], and music recommendations [
27]. In particular, recent research on MMIR has formulated the MMIR task with various DRL algorithms as MDPs [
16], POMDPs [
51], CMDPs [
62], or multi-armed bandits [
59]. However, all of these only consider a specific recommendation scenario where the users are all cold-start users, i.e., without using any interaction history. Indeed, the existing DRL-based recommender systems are not able to deal with the personalised multi-modal interactive recommendation task in an end-to-end fashion considering the computational complexity of learning users’ the past interests from the interaction history and estimating the users’ current needs from the real-time interactions.
Hierarchical reinforcement learning (HRL) [
21,
35] can decompose a complex task into a hierarchy of subtasks as
semi-Markov decision processes (SMDPs), which reduces the computational complexity. Such an HRL formulation with a hierarchy of subtasks is particularly suitable for the multi-modal interactive task that requires to address different subtasks over time by either estimating the users’ past interests or tracking the users’ current needs. For instance, the “Options” framework of HRL provides a generic way for task decomposition where options represent closed-loop sub-behaviours that are carried out for multiple timesteps until the termination condition is triggered [
21]. However, to the best of our knowledge, no prior work has investigated HRL in the multi-modal interactive recommendation task.
In this article, we present our formulation of the personalised MMIR task as an SMDP by simulating both the past and real-time interactions between a user (i.e., an environment) and a recommender system (i.e., an agent). To this end, we propose a novel
personalised multi-modal interactive recommendation model (PMMIR) using hierarchical reinforcement learning to more effectively incorporate the users’ preferences from both their past and real-time interactions. In particular, the proposed PMMIR model uses the Options framework of HRL to decompose the personalised interactive recommendation process into a sequence of two subtasks with hierarchical state representations: a first subtask where a
history encoder learns the users’ past interests with the
hidden states of history for providing personalised initial recommendations and a second subtask where a
state tracker estimates the current needs with the
real-time estimated states for updating the subsequent recommendations. The history encoder and the state tracker are jointly optimised using a typical policy gradient approach (i.e., REINFORCE [
6]) with a single optimisation objective by maximising the users’ future satisfaction with the recommendations (i.e., the cumulative future rewards). Following previous work [
16,
49,
51], our PMMIR model is trained and evaluated by adopting a user simulator, which is capable of producing natural-language critiques regarding the recommendations. This surrogate simulates the behaviour of real human users [
16]. By conducting experiments on two fashion datasets derived from two well-known public datasets, we observe that our proposed PMMIR model outperforms existing state-of-the-art baseline models, leading to significant improvements. In short, we summarise the main contributions of this article as follows:
—
We propose a novel personalised multi-modal interactive recommendation model (PMMIR) that effectively integrates the users’ preferences obtained from both past and real-time interactions by leveraging HRL with the Options framework.
—
Our proposed PMMIR model decomposes the MMIR task into two subtasks: an initial personalised recommendation with the users’ past interests and several subsequent recommendations with the users’ current needs.
—
We derive two fashion datasets (i.e., Amazon-Shoes and Amazon-Dresses) for providing the users’ interaction histories from two well-known public datasets, since there is no existing dataset suitable for the personalisation setting of the multi-modal interactive recommendation task.
—
Through extensive empirical evaluations conducted on the personalised MMIR task, our proposed PMMIR model demonstrates significant improvements over existing state-of-the-art approaches. We also show that both cold-start and warm-start users can benefit from our proposed PMMIR model in terms of recommendation effectiveness.
The paper is structured as follows: Section
2 provides a comprehensive review of the related work and highlights the contributions of our research in relation to the existing literature. In Section
3, we define the problem formulation and introduce our proposed PMMIR model. The experimental setup and results are presented in Sections
4 and
5, respectively. Section
6 summarises our findings.
2 Related Work
Within this section, our primary focus is to introduce the concept of multi-modal interactive recommendation (MMIR). Then, we discuss personalisation in interactive recommendation. Finally, we describe hierarchical reinforcement learning.
Multi-modal Interactive Recommendation. Interactive recommender systems have been shown to be more effective in incorporating the users’ dynamic preferences over time from their explicit and implicit real-time feedback (such as natural-language critiques and clicks) compared to static/traditional recommender systems that predict the users’ preferences by mining the users’ past behaviours offline (such as ratings, clicks, and purchases) [
14]. In addition, multi-modal recommender systems can handle information with various modalities either from items (such as images and textual descriptions) or users (such as natural-language feedback) to effectively alleviate the problems of data sparsity and cold start [
31,
65]. Therefore, MMIRSs can effectively track/estimate the users’ dynamic preferences from the informative information with different modalities across real-time interactions. As an example, Guo et al. [
16] were among the first to tackle the MMIR task by introducing a
Dialog Manager (DM) model that combined supervised pre-training and
model-based policy improvement (MBPI). This approach aimed to effectively capture the users’ preferences across multiple interaction turns by considering both visual recommendations and the corresponding natural-language critiques. Since then, research has focussed upon improving the recommendation performance by either formulating the MMIR task using various reinforcement learning approaches (such as CMDPs [
62], multi-armed bandits [
58], and POMDPs [
51]) or adopting more advanced state tracking components (such as Transformer [
49] and RNN-enhanced Transformer [
52]). Unlike the uni-modal (text-based) conversational recommendation task [
27,
42], which usually leverages attribute-based clarification questions to elicit the users’ preferences, the multi-modal interactive recommendation task addressed in this article takes the critiquing-based task formulation by incorporating the users’ preferences from their natural-language feedback.
Personalisation in Interactive Recommendation. The above-existing MMIR models only focus on incorporating the users’ current needs across the multi-turn real-time interactions but omit their past behaviours by initially presenting users with randomly selected items at the start of the interaction process. Meanwhile, a variety of interactive recommendation models has leveraged the users’ past behaviours for personalised recommendations during the multi-turn interaction processes. For instance, the
Estimation-Action-Reflection (EAR) model by Lei et al. [
27] (a typical question-based interactive recommendation model [
14]) leveraged the
factorisation machine (FM) [
39] to estimate the users’ preferences with the users’ past behaviours for predicting further preferred items and attributes. The users’ online feedback is incorporated by feeding the accepted attributes back to FM to make a new prediction of items and attributes again or using the rejected items as negative signals for training FM again. However, such an FM-based method for the question-based interactive recommendation task is infeasible for our multi-modal interactive recommendation task, which leverages natural-language critiquing sentences freely expressed by the users rather than the brief terms of well-categorised attributes. In addition, a simple solution for the personalised multi-modal interactive recommendation task is to combine the sequential recommendation models (such as GRU4Rec [
20]) with the multi-modal interactive recommendation models (such as EGE [
51]) in a pipeline. For instance, GRU4Rec can be leveraged for generating the initial personalised recommendations, while EGE can be utilised for updating the subsequent recommendation across the multi-turn real-time interactions. However, we argue that such pipeline-based recommender systems are fragile at providing satisfactory personalised recommendations over time when there is a shift between the users’ past interests and current needs, since their components are optimised independently.
Furthermore, session-aware recommendation models [
22,
26,
37,
47] decouple the users’ long-term and short-term preferences for making better-personalised recommendations by exploiting the relationship between sessions for each user. For instance, Quadrana et al. [
37] proposed a
Hierarchical Recurrent Neural Network model (HRNN) for the personalised session-based recommendations. The HRNN model is structured with a hierarchy of two-level
Gated Recurrent Units (GRUs): the session-level GRU that makes recommendations by tracking the user interactions within sessions and the user-level GRU that tracks the evolution of the users’ preferences across sessions. When a new session starts, the hidden state of the user-level GRU is used to initialise the session-level GRU, thereby providing personalisation capabilities to the session-level GRU. Such a hierarchy of two-level GRUs structure can also be leveraged in the multi-modal interactive recommendation task to make personalised recommendations over time. Therefore, we are inspired by the hierarchy of two-level GRUs structure to propose an effective end-to-end multi-modal interactive recommendation model with a dual GRUs/Transformers structure that can make personalised recommendations over time by incorporating both the users’ past behaviours and the informative multi-modal information from real-time interactions. The HRNN model with two-level GRUs adopts a supervised learning approach for jointly optimising the user-level and session-level GRUs, which is less effective than the DRL approaches for maximising the future rewards [
1,
8,
30].
Hierarchical Reinforcement Learning.
Deep reinforcement learning (DRL) has been widely adopted in the recommendation field with various DRL algorithms, such as
Deep Q-learning Network (DQN) [
33], REINFORCE [
48], and Actor-Critic [
25], for coping with the users’ dynamic preferences over time and maximising their long-term engagements [
1,
8,
30]. In particular, the MMIR task has been formulated with various DRL algorithms as MDPs [
16], POMDPs [
51], CMDPs [
62], or multi-armed bandits [
59] to simulate the multi-turn interactions between the recommender systems and the users. However, the existing MMIR models (e.g., MBPI [
16], EGE [
51], and RCR [
62]) with DRL can only maximise the cumulative rewards when dealing with real-time requests within the conversational session, while simplifying the MMIR task by omitting the users’ past interests. Indeed, making personalised recommendations across multi-turn interactions considering the users’ past interests and current needs is a complex task. Hierarchical reinforcement learning provides a solution for decomposing a complex task into a hierarchy of easily addressed subtasks as SMDPs with various frameworks, such as Options [
44],
Hierarchies of Abstract Machines (HAMs) [
34], and MAXQ value function decomposition [
12]. The existing recommender systems with HRL [
15,
29,
54,
63] typically formulate the recommendation task with two levels of hierarchies where a high-level agent (the so-called meta-controller) determines the subtasks and a low-level agent (the so-called controller) addresses the subtasks. For instance, CEI [
15] formulates the conversational recommendation task with the Options framework using a meta-controller to select a type of subtasks (chitchat or recommendation) and a controller to provide subtask-specific actions (i.e., response for chitchat or candidate items for recommendation). In addition, recent research on question-based conversational recommendations (such as EAR [
27] and FPAN [
57]) follows a two-level architecture with a policy network as a meta-controller to decide either to ask for more information or to recommend items and a
Factorisation Machine (FM) [
39] as a controller to generate a set of recommendations [
14]. Different from the standard HRL models, these question-based conversational recommendation models [
14,
27,
57] only optimise the meta-controller with RL algorithms (such as REINFORCE [
48]) to manage the conversational system, while the controller is separately optimised with supervised learning approaches (such as BPR [
40]). However, to the best of our knowledge, no prior work has investigated HRL in the multi-modal interactive recommendation task. In this article, we leverage HRL with the Options framework by proposing a personalised multi-modal interactive recommendation model (PMMIR) to effectively incorporate the users’ past interests and their evolving current needs over time. In particular, the high-level agent for determining the subtasks is fully driven by the users’ natural-language feedback (we will describe this in Section
3). Therefore, we mainly focus on modelling the cooperation of the low-level agents for estimating the users’ past interests and tracking the users’ current needs over time in our proposed PMMIR model.
5 Experimental Results
In this section, we present an analysis of the experimental results in relation to the three research questions outlined in Section
4 to demonstrate the effectiveness of our proposed PMMIR model. Specifically, we address the overall effectiveness of the PMMIR model variants (PMMIR
\(_{GRU}\) and PMMIR
\(_{Transformer}\)) for multi-modal interactive recommendations (RQ1, discussed in Section
5.1), its performance on both cold-start and warm-start users (RQ2, detailed in Section
5.2), and the impact of various components and hyperparameters (RQ3, covered in Section
5.3). To further consolidate our findings, we provide a use case based on the logged experimental results in Section
5.4.
5.1 PMMIR vs. Baselines (RQ1)
To address RQ1, we investigate the performance of our proposed PMMIR model variants (PMMIR
\(_{GRU}\) and PMMIR
\(_{Transformer}\)) and the baseline models. Figure
5 depicts the recommendation effectiveness of our proposed PMMIR model variants, along with the corresponding baseline models, for top-3 recommendations in terms of
Success Rate (SR) on the
Amazon-Shoes and
Amazon-Dresses datasets. Specifically, Figures
5(a) and (c) represent the results using PMMIR
\(_{GRU}\), while Figures
5(b) and (d) correspond to PMMIR
\(_{Transformer}\). The x-axis indicates the number of interaction turns. Comparing the results presented in Figure
5, we can observe that our proposed PMMIR model variants consistently outperform the baseline models in terms of SR across different interaction turns (in particular, from 4th to 10th turns). This indicates the superior overall performance of our PMMIR models. As the number of interaction turns increases, the differences in effectiveness between our PMMIR models and the baseline models become more pronounced, as observed from the increasing gaps in SR. This suggests that our PMMIR models demonstrate a stronger performance advantage over the baseline models as the interaction process unfolds. We can also observe the same trends on NDCG@3. We omit their reporting in a figure to reduce redundancy. The better overall performance of PMMIR suggests that our PMMIR model can better incorporate the users’ preferences from both the interaction history and the real-time interactions compared to the baseline models.
To quantify the improvements achieved by our proposed PMMIR model in comparison to the baseline models, we measure their performances in terms of SR and NDCG@3 at the 5th and 10th interaction turns. This enables us to assess the progress and effectiveness of our PMMIR model at different stages of the interaction process. Table
2 presents the obtained recommendation performances of the PMMIR model variants (PMMIR
\(_{GRU}\) and PMMIR
\(_{Transformer}\)) and their corresponding baseline models. These baseline models include the GRU-based models (GRU
\(_{hist}\), GRU
\(_{img+txt}\), GRU
\(_{all}\)-SL, EGE, GRU-EGE, GRU
\(_{all}\)-RL) as well as the Transformer-based models (Transformer
\(_{hist}\), Transformer
\(_{img+txt}\), Transformer
\(_{all}\)-SL, MMT, Transformer-MMT, Transformer
\(_{all}\)-RL). The performances are evaluated using the same test datasets from the
Amazon-Shoes and
Amazon-Dresses datasets at the 5th and 10th interaction turns. The table provides a comprehensive overview of the recommendation performances, allowing for a direct comparison between the PMMIR model and the various baseline models. In Table
2, the best overall performing results across the four groups of columns are highlighted in bold. * indicates a significant difference, determined by a paired t-test with a Holm-Bonferroni multiple comparison correction (
\(p\lt 0.05\)), when compared to the PMMIR model within each group. Comparing the results in the table, we observe that our proposed PMMIR
\(_{GRU}\) model consistently achieves significantly better performances, with improvements on both metrics ranging from 4%–8% and 2%–4% on the Amazon-Shoes and Amazon-Dresses datasets, respectively, compared to the best GRU-based baseline model. Similarly, the PMMIR
\(_{Transformer}\) model also demonstrates similar improvements, with performance gains ranging from 4%–7% and 4%–6% compared to the best Transformer-based baseline model. These findings highlight the effectiveness of our proposed PMMIR models in outperforming the baseline models across both datasets. Furthermore, it is worth noting that the PMMIR
\(_{Transformer}\) model, which is based on Transformers, generally outperforms the PMMIR
\(_{GRU}\) model, which is based on GRUs, in terms of both metrics on both the
Amazon-Shoes and
Amazon-Dresses datasets. This observation highlights the superiority of the Transformer-based approach in achieving improved recommendation performances.
In response to RQ1, the results obtained clearly demonstrate that our proposed PMMIR model variants exhibit a significant performance advantage over the state-of-the-art baseline models. Therefore, our proposed PMMIR model with hierarchical state representations in PO-SMDP can effectively incorporate the users’ preferences from both the interaction history and the real-time interactions.
5.2 Cold-start vs. Warm-start Users (RQ2)
To address RQ2, we investigate the performance of our proposed PMMIR model on cold-start and warm-start users. We classify users with the minimum interactions (3 interactions on Amazon-Shoes and 4 interactions on Amazon-Dresses) as cold-start users, while those with longer interaction sequences are categorised as warm-start users. This investigation aims to understand how effectively our model adapts to different user scenarios and assesses its performance in each case. Table
3 presents the performances of our PMMIR model variants, as well as the RL-based and pipeline-based baseline models, in terms of NDCG@3 and SR. The table is divided into two parts: The top part focuses on the GRU-based models, while the second part pertains to the Transformer-based models. This division facilitates a comprehensive comparison of the performances across different model types. Comparing the results in Table
3, we observe that our proposed PMMIR
\(_{GRU}\) and PMMIR
\(_{Transformer}\) models can achieve better performances than the corresponding baseline models in terms of both metrics on both cold-start and warm-start users on the two used datasets, except for the cold-start users with PMMIR
\(_{GRU}\) in terms of NDCG@3 on Amazon-Dresses. The reported results in Table
3 show that both the cold-start and warm-start users can generally benefit from our proposed PMMIR model variants with hierarchical state representations. In addition, we observe that the warm-start users can generally benefit more from the GRU-based variant compared to the cold-start users. In particular, PMMIR
\(_{GRU}\) achieves improvements of 11%–12% (warm-start) vs. 4%–5% (cold-start) on Amazon-Shoes and 4%–5% (warm-start) vs. 0%–1% (cold-start) on Amazon-Dresses in terms of both metrics. Conversely, we observe that cold-start users can generally benefit more from the Transformer-based variant compared to warm-start users. In particular, PMMIR
\(_{Transformer}\) achieves improvements of 8%–9% (cold-start) vs. 2%–3% (warm-start) on Amazon-Shoes and 3.8%–4.0% (cold-start) vs. 3.4%–3.5% (warm-start) on Amazon-Dresses in terms of both metrics. We postulate that this difference in performance on cold-start and warm-start users between PMMIR
\(_{GRU}\) and PMMIR
\(_{Transformer}\) can be attributed to the features of the interaction history sequences and the different sequence modelling abilities of GRUs and Transformers. The long sequences of purchases (warm-start users) can have a greater timespan and can be noisy due to the users’ preferences drifting over time, while short sequences of purchases (cold-start) can have a relatively smaller timespan but can be less informative in relating to the users’ preferences. Meanwhile, GRUs (adopted by PMMIR
\(_{GRU}\)) can effectively denoise the sequences with their internal forgetting mechanism with a forget gate, while the Transformer encoders (adopted by PMMIR
\(_{Transformer}\)) have stronger sequence modelling abilities due to the complex neural structures but have been shown to be insufficient to address noisy items within sequences [
5].
In response to RQ2, we find that both cold-start and warm-start users can benefit from our proposed PMMIR model. The warm-start users can generally benefit more with PMMIR\(_{GRU}\), while the cold-start users can generally benefit more with PMMIR\(_{Transformer}\).
5.3 Impact of Components & Hyper-parameters (RQ3)
To address RQ3, we investigate the impact of the components and the hyper-parameters of our proposed PMMIR model.
Impact of Components.
Table
4 reports the performances of our PMMIR model with different applied ablations in terms of NDCG@3 and SR. The original setting is shown in the top part of the table. The PMMIR ablation variants (i.e., PMMIR w/o
\(h^{c}_{0}=h^{p}_{n}\), PMMIR w/
\(Linear^{img/txt}\), and PMMIR w/ “RN101”) are shown in the second part of the table. All the examined PMMIR ablation variants perform generally worse than the corresponding original PMMIR model. The results of PMMIR w/o
\(h^{c}_{0}=h^{p}_{n}\) suggest that our PMMIR model can benefit from the initialisation of the state tracker with the final hidden state of the history encoder. The results of PMMIR w/
\(Linear^{img/txt}\) and PMMIR w/ “RN101” indicate that the CLIP model with the “ViT-B/32” checkpoint can provide better visual and textual representations than the “RN101” checkpoint, and further fine-tuning the CLIP embeddings is not necessary for our personalised MMIR task.
Impact of Hyper-parameters.
Figure
6 depicts the effects of the reward discount factor (
\(\gamma \in [0, 1]\)) when training the PMMIR model on both datasets and the number of exposed top-K items (
\(K \in [2, 5]\)) in each ranking list in terms of SR at 10th turn, respectively. In our analysis, we primarily compare the performances of our PMMIR model with different values of discount factors (i.e.,
\(\gamma \in [0, 1]\)) at the 10th interaction turn. Specifically, when the discount factor
\(\gamma\) is set to 0, it indicates that the model exclusively considers immediate rewards and does not take future rewards into account. However, when
\(\gamma\) is set to 1, the model assigns equal importance to all future rewards and considers them on an equal footing. From Figure
6(a), we observe that there is a decreasing trend in the performance of PMMIR
\(_{GRU}\) on both datasets and a decrease in the effectiveness of PMMIR
\(_{Transformer}\) on Amazon-Shoes when the discount factor
\(\gamma\) increases from 0.2 to 1.0. We observe the same trend for PMMIR
\(_{Transformer}\) on Amazon-Dresses with
\(\gamma \in [0.6,1.0]\). This trend shows that both the history encoder and the state tracker in PMMIR are more influenced by the immediate rewards than by future rewards. Additionally, Figure
6(b) highlights that the PMMIR model exhibits better performance when more items are exposed to users at each interaction turn. This suggests that increasing the number of items presented to users during the interaction process leads to improved recommendation performance for PMMIR.
Overall, in response to RQ3, we find that the PMMIR model can generally benefit more in terms of effectiveness from the hierarchical state representations, adequate multi-modal CLIP encoders, using low values for the discount factor \(\gamma\), and from more exposed top-K items.
5.4 Use Case
In this section, we present use cases of the multi-modal interactive recommendation task with/without personalisation on the
Amazon-Shoes dataset in Figure
7. In particular, the figure shows a user’s interaction history and the next target item, as well as the interaction process for the top-3 recommendations between the simulated users for the EGE and PMMIR
\(_{GRU}\) models that are both based on GRUs. When the target item is listed in the recommendation list, the user simulator will give a comment to end the interaction, such as “The 3
rd shoes are my desired shoes” in Figure
7(c). Comparing the recommendations made by EGE and PMMIR
\(_{GRU}\) on the
Amazon-Shoes dataset, we can observe that our proposed PMMIR
\(_{GRU}\) model is able to find the target items with fewer interaction turns compared to EGE—this is expected, due to the increased effectiveness of PMMIR
\(_{GRU}\) shown in Section
5.1. In addition, our PMMIR
\(_{GRU}\) model is more effective at incorporating the users’ preferences from both the users’ interaction history and the real-time interactions. For instance, our PMMIR
\(_{GRU}\) model suggests personalised recommendations with different “high-heeled sandals” at the initial interaction turn, then easily finds the target items with a critique “tan with a higher heel” at the next turn. Meanwhile, the EGE model can only randomly sample items as the initial recommendations, but the “high heel” feature is missing in the initial recommendation, which leads to the EGE model’s failure in finding the target item at the next turn. We observed similar trends and results in other use cases involving other baseline models compared to the PMMIR variants on the
Amazon-Shoes and
Amazon-Dresses datasets. We omit their reporting in this article to reduce redundancy.
6 Conclusions
In this article, we proposed a novel
personalised multi-modal interactive recommendation model (PMMIR) using hierarchical reinforcement learning with the Options framework to more effectively incorporate the users’ preferences from both their past and real-time interactions. Specifically, PMMIR decomposes the personalised interactive recommendation process into a sequence of two subtasks with hierarchical state representations: a first subtask where a
history encoder learns the users’ past interests with the
hidden states of history for providing personalised initial recommendations and a second subtask where a
state tracker estimates the current needs with the
real-time estimated states for updating the subsequent recommendations. The history encoder and the state tracker are jointly optimised with a single optimisation objective by maximising the users’ future satisfaction. Following previous work [
16,
49,
51], we trained and evaluated our PMMIR model using a user simulator that can generate natural-language critiques about the recommendations as a surrogate for real human users. Our experiments on the
Amazon-Shoes and
Amazon-Dresses datasets demonstrate that our proposed PMMIR model variants achieve significantly better performances compared to the best baseline models—for instance, improvements of 4%–8% and 2%–4% with PMMIR
\(_{GRU}\) and 4%–7% and 4%–6% with PMMIR
\(_{Transformer}\) at the 5th and 10th turns. The reported results show that our proposed PMMIR model benefits from the dual GRUs/Transformers structure and the initialisation of the state tracker with the final hidden state of the history encoder. In addition, the results show that both cold-start and warm start users can benefit from our proposed PMMIR model.