1 Introduction
Evaluation is a major concern when developing
information retrieval (IR) systems, and it can be conducted based on measures of result relevance or user experience, such as user satisfaction, which focuses on the user’s perspective. While relevance metrics such as nDCG or average precision [
34] have been commonly used, are re-usable and allow for system comparison, they often demonstrate poor correlation with the user’s actual interaction experience [
2,
63]. As a result, in recent years, there has been a growing interest in user-oriented evaluation approaches that rely on various user interaction signals, in contrast to system-oriented evaluation methodologies, i.e., the Cranfield paradigm [
13,
14].
In traditional
recommender systems (RSs), user-oriented evaluation strategies often rely on implicit user feedback, such as user clicks and mouse scroll events to assess whether a user finds the recommended item appealing or not. However, such interaction signals are not available for
CRSs whose main interaction with users is in natural language either by text or speech [
26]. In CRSs, users interact with the system through natural language with utterances such as “I like the movie, I will watch it” expressing their preference in more detail [
54]. This distinction in user interaction poses unique challenges in evaluating CRSs, both in terms of design and deployment, to ensure that these systems effectively cater to the user’s needs.
User satisfaction. CRSs are recommender systems designed to provide recommendations that address the specific needs of users. As such, they fall under the category of task-oriented dialogue systems
task-oriented dialogue systems (TDSs). Standard automatic evaluation metrics such as BLEU [
52], ROUGE [
45], and METEOR [
16] have shown poor correlation with human judgment [
46], thus making them unsuitable for the evaluation of TDSs. In recent years, the research community has shown significant interest in developing new automatic evaluation metrics tailored to dialogue systems. These metrics not only exhibit stronger correlation with human judgment, but also consider various aspects of dialogues, such as relevance, interestingness, and understanding, without relying solely on word overlap [
27,
32,
51,
64,
70]. While these metrics perform well during system design, their efficacy during system deployment is still a subject of ongoing investigation.
As a consequence, a significant number of TDSs rely on human evaluation to measure the system’s effectiveness [
29,
42]. An emerging approach for evaluating TDSs is to estimate a user’s overall satisfaction with the system from explicit and implicit user interaction signals [
29,
42]. While this approach is valuable and effective, it does not provide insights into the specific aspects or dimensions in which the CRS is performing well. Understanding the reasons behind a user’s satisfaction or dissatisfaction is crucial for the CRS to learn from errors and optimize its performance in individual aspects, thereby avoiding complete dissatisfaction during an interaction session.
Understanding user satisfaction in a task-oriented setting. Understanding user satisfaction with CRSs is crucial, mainly for two reasons. Firstly, it allows system designers to understand different user perceptions regarding satisfaction, which in turn leads to better user personalization. Secondly, it helps prevent total dialogue failure by enabling the deployment of adaptive conversational approaches, such as failure recovery or topic switching. By conducting fine-grained evaluations of CRSs, the system can learn an individual user’s interaction preferences, leading to a more successful fulfillment of the user’s goal.
Various metrics, including engagement, relevance, and interestingness, have been investigated to understand fine-grained user satisfaction and their correlation with overall user satisfaction in different scenarios and applications[
28,
59,
64]. While recent research has seen a surge in fine-grained evaluation for dialogue systems, most of these studies have focused on open-domain dialogue systems that are non-task-oriented [
22,
27,
51]. On the other hand, conventionally, TDSs such as CRSs are evaluated on the basis of
task success (
TS) and overall user satisfaction. In CRSs, user satisfaction is modeled as an evaluation metric for measuring the ability of the system to achieve a pre-defined goal with high accuracy, that is to make the most relevant recommendations [
55]. In contrast, for non-task-based dialogue systems (i.e., chat-bots), the evaluation focus is primarily on the user experience during interaction (i.e., how engaging or interesting the system is) [
43].
Evaluating user satisfaction. Recent studies have examined user satisfaction in dialogue systems, particularly in the context of CRSs. These studies typically estimate user satisfaction by collecting overall turn-level satisfaction ratings from users during system interactions or by leveraging external assessors through platforms like
Amazon mechanical turk (MTurk).
2 In these evaluations, users
3 are typically asked to provide ratings for each dialogue turn by answering questions such as,
Are you/Is the user satisfied with the system response? While overall turn-level satisfaction ratings provide a measure of user satisfaction, they may not capture the broader aspects that contribute to a user’s satisfaction [
60]. When humans are asked to evaluate a dialogue system, they often consider multiple aspects of the system [
22]. Therefore, the satisfaction label aims to summarize the user’s opinion into one single measure. Venkatesh et al. [
64] argue that user satisfaction is subjective due to its reliance on the user’s emotional and intellectual state. They also demonstrate that different dialogue systems exhibit varying performance when evaluated across different dialogue aspects, indicating the absence of a one-size-fits-all metric.
Previous studies have proposed metrics that offer a granular analysis of how various aspects influence user satisfaction in chat-bot systems [
28,
64]. However, it is unclear how these aspects specifically influence user satisfaction in the context of TDSs [see, e.g.,
41,
71]. With most aspect-based evaluations focusing on chat-bot systems [
50,
51], only a few studies have so far investigated the influence of dialogue aspects for TDSs [
37,
60]. Jin et al. [
37] present a model that explores the relationship between different conversational characteristics (e.g., adaptability and understanding) and the user experience in a CRS. Their findings demonstrate how conversational constructs interact with recommendation constructs to influence the overall user experience of a CRS. However, they do not specifically examine how individual aspects impact a user’s satisfaction with the CRS. In our previous work [
60], we proposed several dialogue aspects that could influence a user’s satisfaction with TDSs. We found that, in terms of turn-level aspects,
relevance strongly influenced a user’s overall satisfaction rating(Spearman’s
\(\rho\) of 0.5199). Additionally, we introduced a newly defined aspect,
interest arousal which exhibited a high correlation with overall user satisfaction(Spearman’s
\(\rho\) of 0.7903). However, we did not establish a direct relationship between turn-level aspects and turn-level user satisfaction in our previous study.
Research questions. In this study, we seek to extend the study we carried out in [
60]. Our aim is to understand a user’s satisfaction with CRSs by focusing on the dialogue aspects of both the response and the entire dialogue. We intend to establish the relationship between individual dialogue aspects and overall user satisfaction to understand how they relate with
satisfactory (Sat) and
dissatisfactory (DSat) dialogues.
In addition, we aim to evaluate how effective the proposed aspects are in estimating a user’s satisfaction at the turn and dialogue levels. To this aim, we carry out a crowdsourcing study with workers from MTurk on recommendation dialogue data, viz. the ReDial dataset [
44]. The ReDial dataset provides a high-quality resource to investigate how several dialogue aspects affect a user’s satisfaction during interaction with a CRS. We ask workers to annotate 600 dialogue turns and 200 dialogues on six dialogue aspects following our previous work [
60]:
relevance,
interestingness,
understanding,
task completion,
interest arousal, and
efficiency. The dialogue aspects are grouped into utility and
user experience (UX) dimensions of a
TDS. Different from [
60], we also ask workers to give their turn-level overall satisfaction rating and use it to establish a relationship between turn-level aspects and turn-level user satisfaction.
Our aim is to answer the following research questions:
(RQ1)
How do the proposed dialogue aspects influence overall user satisfaction with a CRS?
(RQ2)
Can we estimate user satisfaction at each turn from turn-level aspects?
(RQ3)
How effective are the dialogue-level aspects in estimating user satisfaction compared to turn-level satisfaction ratings on CRSs?
Main findings. To address our research questions, we perform an in-depth analysis of the annotated turns and dialogues in order to understand how the proposed dialogue aspects influence a user’s overall satisfaction. We note that for most annotators, at the turn level, the ability of a CRS to make relevant recommendations has a high influence on their turn-level satisfaction rating with a Spearman’s \(\rho\) of 0.6104. In contrast, at the dialogue level, arousing a user’s interest in watching a novel recommendation along with completing a task are the most influential determinants for overall satisfaction ratings from annotators with a Spearman’s \(\rho\) of 0.6219 and 0.5987, respectively.
To evaluate the effectiveness of the proposed dialogue aspects, we experiment with several machine learning models on user satisfaction estimation and compare their performance using the annotated data. At the turn-level user satisfaction estimation task, we achieve a Spearman’s \(\rho\) of 0.7337 between a random forest regressor model’s prediction and the ground truth ratings. We achieved a correlation score of 0.7956 for predicting user satisfaction at the dialogue level. These results show the efficacy of the proposed dialogue aspects in estimating user satisfaction. Additionally, these results also demonstrate the significance of assessing the performance of a CRS at the aspect level; they can help system designers to identify on what dialogue quality a CRS is not performing as expected and optimize it.
Contributions. Our contributions in this article can be summarized as follows.
(1)
In our previous work [
60], we conducted a study on 40 dialogues and 120 responses. In order to gain more insights, we extend that study with an extra 160 dialogues and 480 responses. In total, we conducted our investigations on 200 dialogues and 600 responses.
(2)
We ask annotators to assess dialogues on six dialogue aspects and overall user satisfaction. In addition, they provide judgments on turn-level satisfaction. User satisfaction ratings at the turn level allow us to establish the relationship between turn-level aspects and not only overall dialogue satisfaction but also turn-level satisfaction, which we did not experiment with in our previous work.
(3)
We carry out an in-depth feature analysis on individual dialogue aspects and at the class level (i.e., Sat and DSat classes) so as to understand which dialogue aspects correlate highly with each of the classes.
(4)
Leveraging the annotated data, we experiment with several classical machine learning models and compare their performance in estimating user satisfaction at the turn and dialogue levels.
(5)
Our findings indicate that predictive models perform better at estimating user satisfaction based on the proposed dialogue aspects than based on turn-level satisfaction ratings.
To the best of our knowledge, our work is the first attempt to establish a relationship between the proposed dialogue aspects and user satisfaction at both the turn and dialogue levels and to evaluate their effectiveness in estimating user satisfaction with CRSs.
Organization of the paper. The rest of this article is organized as follows. In Section
2, we discuss related work. We describe the dialogue aspects investigated in this study in Section
3. In Section
4, we detail our annotation process and instructions given to the annotators. In Section
5, we analyse the annotated data to answer
RQ1. Section
6 discusses our problem formulation and predictive models used to estimate turn- and dialogue-level user satisfaction, while Section
7 presents the results to our experiments and answers
RQ2 and
RQ3. We discuss our results and limitations of this study in Section
8 and make our conclusions, implications, and future work in Section
9.
9 Conclusion and Future Work
In this article, we have focused on a user-oriented approach to understanding user satisfaction in conversational recommendations. We have conducted a study to assess the influence of multiple dialogue aspects on overall user satisfaction. Through a carefully designed annotation process, we have collected external assessors’ feedback ratings on six dialogue aspects (relevance, interestingness, understanding, task completion, interest arousal, and efficiency) and user satisfaction at the turn and dialogue level. With this data, we have investigated the relationship between several dialogue aspects and user satisfaction. Furthermore, we have adopted several machine learning methods to predict response quality and overall user satisfaction with different feature combinations.
Combining both the qualitative and quantitative methods, our results indicate that: (i) Relevant recommendations are necessary but not sufficient for high user satisfaction feedback. Therefore, several aspects should be considered in estimating a user’s overall satisfaction with a CRS. (ii) In the absence of response quality ratings, we can rely on turn-level aspects to estimate the user’s rating for each response. And (iii) user satisfaction can be predicted more accurately with combined dialogue aspects as features unlike only using turn-level satisfaction ratings.
In addition to understanding how several dialogue aspects influence a user’s overall satisfaction with a CRS, our findings also have implications for the design and evaluation of CRSs. Our results show that predicting user satisfaction with aspects representing the utility of a CRS (relevance and task completion) performs poorly compared to predicting with a combination of all aspects. Thus, in order to achieve high user satisfaction, the design of CRSs should not only be optimized toward goal accomplishment but also a good user interaction experience.
Our experimental results with traditional machine learning methods indicate a strong performance. We have not experimented with neural network architectures in this study as it is not the main focus of our work, and we leave this to future work. Furthermore, other dialogue features such as dialogue context, intent, and system-user action could be modeled in a neural architecture as they have proven to improve user satisfaction prediction. Since our study involves a small sample dataset, we plan to verify our findings on a larger scale and with diverse data collected from actual users interacting with the system. Collecting a large-scale dataset can be achieved in an automatic way by leveraging existing predictive models to capture key patterns by training them with explicit ratings or in an unsupervised way. Apart from that, techniques such as user simulation can be used to provide annotated user feedback within dialogues, thus increasing the amount of data to be annotated [
5], where this feedback can include explicit ratings on the dialogue aspects allowing for the collection of ground truth data for training and automatic evaluation at scale.
Though the focus of our study is to uncover the relationship between various dialogue aspects and user satisfaction, we believe our findings can provide insights into the factors that contribute to increased user satisfaction in CRS and can serve as a basis for future research and system development. We, therefore, encourage future research to investigate the practical implications of our findings by looking at the impact of increasing dialogue aspects on user satisfaction through experimental studies or user-centered evaluations using tools such as CRSLab [
73] to compare different CRS methods.
For future work, we are interested in integrating large language models in the annotation process to further enhance the accuracy, richness, and scale of the annotated dataset. We hypothesize that their advanced contextual understanding and semantic analysis capabilities will benefit the annotations. In particular, following [
21], we expect that the annotations on the recommended items will more closely align with user preferences and intents expressed in the conversation.