An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI

Naveed, Sidra; Stevens, Gunnar; Robin-Kern, Dean

doi:10.3390/app142311288

Open AccessReview

An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI

by

Sidra Naveed

^1,*,

Gunnar Stevens

¹ and

Dean Robin-Kern

²

¹

Information Systems, University of Siegen, 57072 Siegen, Germany

²

Bikar Metalle GmbH, 57319 Bad Berleburg, Germany

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 11288; https://doi.org/10.3390/app142311288

Submission received: 30 September 2024 / Revised: 18 October 2024 / Accepted: 12 November 2024 / Published: 3 December 2024

(This article belongs to the Special Issue Empowering Interactions: Advancing Human-Centred AI for Transparent, Collaborative and Accessible Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Recent advances in technology have propelled Artificial Intelligence (AI) into a crucial role in everyday life, enhancing human performance through sophisticated models and algorithms. However, the focus on predictive accuracy has often resulted in opaque black-box models that lack transparency in decision-making. To address this issue, significant efforts have been made to develop explainable AI (XAI) systems that make outcomes comprehensible to users. Various approaches, including new concepts, models, and user interfaces, aim to improve explainability, build user trust, enhance satisfaction, and increase task performance. Evaluation research has emerged to define and measure the quality of these explanations, differentiating between formal evaluation methods and empirical approaches that utilize techniques from psychology and human–computer interaction. Despite the importance of empirical studies, evaluations remain underutilized, with literature reviews indicating a lack of rigorous evaluations from the user perspective. This review aims to guide researchers and practitioners in conducting effective empirical user-centered evaluations by analyzing several studies; categorizing their objectives, scope, and evaluation metrics; and offering an orientation map for research design and metric measurement.

Keywords:

AI transparency; explainable artificial intelligence (XAI); XAI evaluation procedure; user-centered evaluation

1. Introduction

With recent advances in technology, Artificial Intelligence (AI) has become a dynamic and thriving field of research. AI systems are now being used in many different settings beyond research labs, greatly impacting our daily lives. These systems can amplify, augment, and enhance human performance [1,2,3] by enhancing predictive performance through complex models and algorithms. However, a primary focus on prediction accuracy has left AI systems with black-box models, which provide non-transparent decision-making. To overcome these obstacles, considerable efforts have been made in recent years to implement explainable systems with the aim of making AI systems and their outcomes understandable to humans [4,5].

Various concepts, models, and user interfaces have been explored to improve explainability, predictability, and accountability; build user trust; enhance user satisfaction; increase task performance; and support decision-making [5,6,7,8].

The emerging field of evaluation research addresses the issue of what constitutes a good explanation and how its quality can be measured [3,9,10]. Two evaluation approaches can be distinguished: the formal evaluation approach [11], which uses formal methods, mathematical metrics, and computational simulations [12,13], and the empirical evaluation approach, which has gained popularity in recent years due to its focus on user impact. While formal evaluation demonstrates technical accuracy, it leaves open the question of whether the desired effects are achieved in practice. In contrast, the empirical approach adopts research methods, scales, and probing concepts from psychology and human–computer interaction [8,14,15,16]. Empirical studies evaluating explanations in AI are labor-intensive and require careful planning to ensure that they are rigorous and valid [3]. For example, in their literature review, Adadi et al. [17] noted that only 5% of studies conducted an empirical evaluation. Similarly, the literature review by Anjomshoae et al. [18] found that 32% of studies conducted no evaluation, 59% conducted only an informal user study, and only 9% conducted thorough evaluations with well-defined metrics. In the most recent literature review, Nauta et al. [19] highlighted that 33% of the research was evaluated using anecdotal evidence, 58% applied quantitative evaluation, 22% evaluated human subjects in a user study, and 23% of the research was evaluated using domain experts, i.e., application-grounded evaluation.

The immaturity of this subject area is also reflected in the lack of literature reviews on empirical evaluation studies. While various surveys, such as [5,6,8,10,15,17,18], provide a sound overview of formal evaluations, empirical methodologies, and procedures are only marginally discussed. Without a clear, systematic understanding of the different goals, methods, and procedures used in evaluating explanation systems, it is difficult to establish best practices, standards, and benchmarks.

The goal of this exploratory scoping review [20] is to inform and sensitize researchers and practitioners on how to conduct empirical evaluations effectively and rigorously. To achieve this, we analyzed the most relevant papers on XAI evaluation studies from prominent academic databases. The papers we analyzed demonstrate that the scope of explainable AI systems is quite broad, both in terms of domains (ranging from supporting engineers in debugging AI systems to helping consumers understand why certain product recommendations are made) and explanation objectives (ranging from enhancing understanding over improving task performance towards fostering trust).

Through this review, we aimed to identify common patterns and recognize the essential elements that need to be considered when planning and conducting empirical evaluations of explanation systems. In our analysis, we categorized the objectives, scope, and evaluation metrics used in the studies. We also categorized the procedures and evaluation methods that were applied. This categorization provides an orientation map and guidance, e.g., showing which evaluation metrics align with specific objectives and which research designs are appropriate for measuring those metrics rigorously.

In this context, our work aims to address the following research questions:

What are the common practices in terms of patterns and essential elements in empirical evaluations of AI explanations?
What pitfalls, but also best practices, standards, and benchmarks, should be established for empirical evaluations of AI explanations?

The remainder of this article is arranged as follows: In Section 2, we first give a brief overview of relevant concepts of XAI evaluation. After that, we present the findings of our literature survey. Section 3 describes three evaluation objectives we identified in the literature. Section 4 details the target domains and target groups addressed in the evaluation studies we analyzed. Section 5 presents the core of the article, summarizing the various measurement constructs and how they were operationalized in the evaluation studies. Finally, we present the procedures used in user-centric XAI evaluation in Section 6. Section 7 discusses the literature survey regarding pitfalls and best practices for doing evaluation studies rigorously. A conclusion is given in Section 8.

2. Explainable Artificial Intelligence (XAI): Evaluation Theory

An evaluation presents a systematic process of measuring a well-defined quality of the AI system’s explanation and assessing if and how well it meets the set objectives [10,15,21]. In the literature, three distinct evaluation approaches have emerged for the evaluation of explainable AI systems [3,16]:

Functionality-grounded evaluations require no humans. Instead, objective evaluations are carried out using algorithmic metrics and formal definitions of interpretability to evaluate the quality of explanations.
Application-grounded evaluations measure the quality of explanations by conducting experiments with end-users within an actual application.
Human-grounded evaluations, which involve human subjects with less experience, measure general constructs with respect to explanations, such as understandability, trust, and usability on a simple task.

The first approach is theoretical in nature, focusing on conceptual frameworks and abstract principles. In contrast, the subsequent two approaches are empirical, involving the study design, implementation, and interpretation of the study results. It is essential to adhere to rigorous standards of empirical research, ensuring the reliability, validity, and generalizability of the findings.

In this regard, the measurement theory underscores the fact that rigorous evaluation measures should consider the following three elements [22,23,24]:

Evaluation Objective and Scope: Evaluation studies can have different scopes as well as different objectives, such as understanding a general concept or improving a specific application. Hence, the first step in planning an evaluation study should be defining the objective and scope, including a specification of the intended application domain and target group. Such a specification is also essential for assessing instrument validity, referring to the process of ensuring that an evaluation method will measure the constructs accurately, reliably, and consistently. The scope of validity indicates where the instrument has been validated and calibrated and where it will be measured effectively.
Measurement Constructs and Metrics: Furthermore, it is important to specify what the measurement constructs of the study are and how they should be evaluated. In principle, measurement constructs could be any object, phenomenon, or property of interest that we seek to quantify. In user studies, they are typically theoretical constructs such as user satisfaction, user trust, or system intelligibility. Some constructs, such as task performance, can be directly measured. However, most constructs need to be operationalized through a set of measurable items. Operationalization includes selecting validated metrics and defining the measurement method. The method should describe the process of assigning a quantitative or qualitative value to a particular entity in a systematic way.
Implementation and Procedure: Finally, the implementation of the study must be planned. This includes decisions about the study participants (e.g., members of the target group or proxy users) and recruitment methods (such as using a convenience sample, an online panel, or a sample representative of the target group/application domain). Additionally, one must consider whether a working system or a prototype should be evaluated and under which conditions (e.g., laboratory conditions or real-world settings). Furthermore, the data collection method should be specified. Generally, this can be categorized into observation, interviews, and surveys. Each method has its strengths, and the choice of method should align with the research objectives, scope, and nature of the constructs being measured.

Scoping Review Methodology

In our literature review, we investigate how these elements of rigorous evaluation studies have been implemented to identify best practices, common challenges, and innovative approaches in the field. For this, our review employs a scoping review methodology [20] to explore and synthesize the practices used in conducting empirical, human-centered evaluations of explainable artificial intelligence (XAI) systems. The goal of this scoping review was to encompass a broad and diverse range of studies to capture the heterogeneity of research approaches taken in this emerging and interdisciplinary field.

We conducted comprehensive searches in academic databases, including IEEE Explore, ACM Digital Library, Scopus, and Google Scholar, using relevant keywords such as Explainable AI, user study, empirical evaluation, and evaluation metrics. Given the rapidly evolving nature of XAI research, we observed no consistent taxonomy available for systematic, keyword-based searches across studies in XAI evaluation. In response to this challenge, we adopted an open and iterative approach, identifying relevant studies through cross-referencing papers and searching the literature on topics and terms that emerged during our review process. This approach allowed us to include heterogeneous sample studies, covering a variety of research designs, target domains, and evaluation metrics.

However, this type of scoping review has certain limitations. In contrast to systematic literature reviews [20] in well-developed research fields with established terms and taxonomies, the rigor of bibliometric analysis and keyword searches for identifying all relevant studies is limited. As a result, our scoping review may have omitted some studies because they remained undiscovered in the iterative process. Additionally, the open nature of this process may introduce a degree of selection bias, as it relies on the researcher’s judgment to identify and include relevant studies. This is particularly true for studies (such as [25,26,27,28]) in which the authors of this paper were personally involved. We discussed whether to include these studies in the sample or not. Ultimately, a decision was made to use the same selection criteria to include these studies only if they shed light on existing practices for planning and conducting XAI evaluation studies.

Despite these limitations, our scoping process’s open and iterative nature enabled us to capture a more comprehensive and nuanced picture of the current state of empirical evaluations in XAI. Moreover, our explorative approach shows the diversity and complexity of XAI evaluation practices, especially in terms of research design, evaluation metrics, and methodologies.

By analyzing the papers from this stance, we could identify various patterns and categories (see Table 1). For instance, regarding research objectives, we saw methodology-driven, concept-driven, and application-driven studies as common research genres. Regarding the evaluation scope, we saw that the studies address various application domains (such as healthcare, law/justice, finance, etc.), where both highly critical real-world and less critical, illustrative scenarios had been addressed. We were also able to identify different target group types, such as end-users/affected persons, regulators/managers, and developers/engineers. Regarding measurement, we uncovered three main areas, namely understandability, usability, and integrity. By assessing implementation and procedures, our analysis reveals that using proxy users recruited by an online panel was a common pattern. Lastly, we looked at the data collection methods employed, such as observations, interviews, and surveys, and how they aligned with the study objectives and constructs being measured.

We present our findings in detail in the following sections. This not only enhances our understanding of current methodologies but also contributes to guiding future research efforts for more effective and accurate evaluations in similar domains.

3. Evaluation Objectives

Evaluation studies are primarily defined by their objectives, which serve as guiding principles directing the focus, methodology, and scope of the research.

Regarding the evaluation objectives, our review uncovers three types of research. First, studies about evaluation methodologies focus on methodological issues and developing metrics to effectively measure explainability. Second, concept-driven research focuses on novel concepts, models, and interfaces to improve the systems’ explainability. Third, domain-driven research focuses on practical applications and specific domains to bridge the gap between theory and practice, showcasing how explainability functions in various domains.

Figure 1 highlights the fact that the majority of the studies in our sample are concept-driven evaluations (69%), followed by methodological evaluations (24%) and domain-driven evaluations (7%).

3.1. Studies About Evaluation Methodologies

The first category focuses on methodological questions related to evaluating explainable systems. Studies in this category [15,16,38,39,40,44,48,49,50,53,61,63,64,66,72,74,76,77] are dedicated to developing effective, relevant, and reliable methods and approaches to understand, measure, and assess the explainability of such systems regarding well-specified goals.

In the following section, we outline the various methodologically oriented literature reflections in more detail. Hoffmann et al. [35] outlined how procedures from scale development research and test theories can be used to specify evaluation metrics rigorously. They further investigate the evaluation methods for determining the effectiveness of explainable AI systems (XAI) in helping users understand, trust, and work with AI systems. Another good example is the study by Holzinger et al. [44], which introduced the System Causality Scale (SCS) to measure the overall quality of explanations provided by explainable AI systems and illustrate the application of the SCS in the medical domain. Schmidt and Biessmann [53] proposed a quantitative measure to assess the overall interpretability of methods explaining machine learning decision-making. They further proposed a measure to assess the effect on trust as a desired outcome of explanations. Kim et al. [61] argued for standardized metrics and evaluation tasks to enable benchmarking across different explanation approaches. For this reason, they suggested two tasks (referred to as the confirmation task and the distinction task) to assess the utility of visual explanations in AI-assisted decision-making scenarios. Mohseni et al. [16] suggested a human attention benchmark for evaluating model saliency explanations in image and text domains. Naveed et al. [25] pinpointed the fact that evaluations must be not only rigorous but also relevant concerning the particular use context where explanations are requested. For this reason, domain-agnostic measures should be supplemented with domain-specific metrics, which should be grounded in empirical qualitative pre-studies.

Various studies in this category are also based on literature reviews to understand and identify underlying concepts and methods to evaluate the XAI systems. For example, Lopes et al. [78] conducted a literature survey on human- and computer-centered methods to evaluate systems and proposed a new taxonomy for XAI evaluation methods. Similarly, Rong et al. [29] explored user studies in XAI applications and proposed guidelines for designing user studies in XAI. Kong et al. [79] conducted a literature survey to summarize the human-centered demand framework and XAI evaluation measures for validating the effect of XAI. Then, the authors presented a taxonomy of XAI methods for matching diverse human demands with appropriate XAI methods or tools in specific applications. Jin [76] conducted a critical examination of plausibility as a common XAI criterion and emphasized the need for explainability-specific evaluation objectives in XAI. Schoonderwoerd et al. [30] examined a case study on a human-centered design approach for AI-generated explanations in clinical decision support systems and developed design patterns for explanations. Weitz et al. [74] investigated end-user preferences for explanation styles and content for stress monitoring in mobile health apps and created user personas to guide human-centered XAI design.

Overall, the methodological-driven research explores various evaluation approaches to understand how well an explainable system made the behavior of an AI system interpretable and accountable [80]. A central question of this research is how to quantify and objectively measure explainability. This often involves creating metrics and evaluation techniques that allow for the assessment of explanation quality and facilitate comparisons between different explanation models. Researchers in this category also address the challenge of balancing explainability and performance since complex models may achieve better performance but can be less interpretable.

3.2. Concept-Driven Evaluation Studies

Most studies [16,25,26,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,44,46,47,48,49,50,51,52,55,56,57,58,61,62,63,64,65,66,67,68,69,70,71,73,74,75,81,82] in our sample are driven by research on explanation models and their representation. The objective is to comprehend what constitutes a high-quality explanation that increases human cognition and decision-making.

The research in this category aims to investigate a common understanding of the explanation quality of existing XAI frameworks such as LIME or SHAP [63]. Often, novel explanation concepts and approaches are evaluated, including example-based explanations (normative and comparative explanations) [48], consequence-oriented and contrastive explanations [81], question-answering pipelines [36,61], data-centric explanations [36], or argumentative explanations [28]. Also, design principles, such as the implementation of the right to explanation [43], or novel interface concepts, such as interactive explanations [38,51], have been evaluated in this kind of research.

To improve generalizability, most studies in this category carry out domain-independent evaluations using fictional and illustrative scenarios. The goal is to obtain insights into how various explanation models and features impact the overall functionality and effectiveness of explanation systems. To reduce confounding effects, these studies typically prefer experimental designs conducted under controlled conditions that abstract away from specific real-world contexts. Regarding the tension between rigor and relevance [83], evaluation studies in this category are often methodologically sound yet lack ecological validity as fictional and simplified tasks were often used that were disconnected from real-world applications and did not involve real users.

3.3. Domain-Driven Evaluation Studies

The third category, domain-driven, focuses on applying explainability in specific domains or application areas. Studies in this category [34,41,51,59,60] address how to deploy explainable systems in real-world scenarios and practical applications, such as news recommendation systems [47], Facebook’s news feed algorithm [55], deceptive review detection [31], diagnostic decision support systems for both laypeople [49] and experts [77], explainable learning platforms [56], recommendation of scientific publications [58], and even applications concerning autonomous vehicles [67], music, and product recommendation systems [26].

In contrast to concept-oriented research, domain-oriented studies address the specific application context and emphasize the relevance of the evaluated factors within the particular application domain. By considering the intricacies and specifics of the application context, domain-based research aims to provide insights that are not only theoretically valuable but also practically applicable and useful in real-world scenarios. For this purpose, domain-independent explanation concepts such as “what-if” explanations or feature attribution are adapted for the respective use case, or new concepts are designed and implemented for the specific application [26]. Frequently, different explanatory approaches are also combined within a single application. They also tend to be more holistic than concept-driven studies. In domain-driven studies, users typically should not evaluate isolated explanation concepts; they should instead use applications where explanatory elements are embedded.

There are also a few ethnographically oriented studies, which focus less on what explainable systems do to users, and more on what users do with explainable systems. For example, Kaur et al. [30] investigated the usage and comprehension of interpretability tools (such as the Interpret ML implementation of GAMs and the SHAP Python package) by data scientists, identifying potential issues and misconceptions. Another instance is the study conducted by Alizadeh et al. [29], which examined the experiences of Instagram users affected by the platform’s algorithmic block and their need for explanations regarding the decision-making process. Such studies are often less rigorous but possess the highest grade of ecological validity as they do not assess the use of explainable systems under pre-defined tasks in artificial lab conditions.

4. Evaluation Scope

The evaluation scope is defined as the range within which an evaluation approach and metric are developed, tested, and calibrated. In our literature review, the evaluation scope of the studies in our sample can be mainly defined by the target domain, the target group, and the test scenarios used in the evaluation. Figure 2 summarizes the identified evaluation scope of the studies in our sample.

4.1. Target Domain

The target domain is often categorized by sector-specific boundaries [84]. In the following section, we summarize the various sectors addressed in our sample.

4.1.1. Healthcare

Various studies in our sample [44,49,57,63,74,85] have been conducted in the healthcare domain. This domain is characterized by life-critical decisions that can immediately impact human lives, necessitating the consideration of ethical, legal, and regulatory aspects. Therefore, both doctors and patients must be able to trust AI systems and understand the decisions they make.

Regarding this, Holzinger et al. [44] evaluated medical explanations to enhance clarity and trust for healthcare professionals. Schoonderwoerd et al. [63] evaluated the structuring and presentation of explanations within clinical decision support systems. Tsai et al. [57] evaluated to what extent explanations make the AI diagnostic process more transparent for clinicians and patients. Weitz et al. [74] evaluated what kinds of explanations are preferred by healthcare providers and patients in clinical decision support systems. Van der Waa et al. [49] evaluated how explanations help patients with diabetes better understand and manage their condition. Cabitza et al. [85] evaluated explainable AI in the health domain by examining the usefulness and effectiveness of activation maps in aiding radiologists in detecting vertebral fractures.

4.1.2. Judiciary

Several studies [11,36,52,65,73,86,87] address legal decision-making processes related to granting bail to defendants, calculating reconviction risk, predicting recidivism and re-offending, or assessing the likelihood of violating pretrial release conditions. Since legal decisions involve acts of the state, it is crucial that these decisions are made on a legal basis, are fair and unbiased, and that the decision-making process is transparent and accountable. For this reason, explanations are pivotal for the societal acceptance of the use of AI systems in this field.

Regarding this, Dodge et al. [87] examine how explanations affect fairness judgments in AI models, while Harrison et al. [73] assess the impact of explanations on the perceived fairness of AI decisions. Anik and Bunt [36] focus on the transparency of machine learning systems by explaining the training data used. Alufaisan et al. [65] evaluate the impact of explainable AI in enhancing legal decision-making, while Liu et al. [52] evaluate the role of explanations in addressing the out-of-distribution problem in AI.

4.1.3. Finance Sector

The finance sector is another domain in our sample that has been extensively researched [15,25,50,65,68,70,88]. A characteristic of this domain is that everyone must make financial decisions in their daily lives, yet making investments is quite complex, and one wrong decision could have a significant impact on the financial well-being of a person [25]. As the average financial literacy of ordinary users might be relatively low, AI systems can significantly contribute to this context. However, they must be trustworthy, and the explanations must be understandable to laypersons with low financial expertise.

Regarding this, Chromik et al. [15] investigated how non-technical users interpret additive local explanations in loan application scenarios, while Schoeffer et al. [82] evaluated the role of explanations for automated decision systems (ADSs) for loan approvals. Poursabzi et al. [50] evaluated the impact of explanations in a tool used for predicting apartment selling prices using existing property data. Other studies have assessed the effect of explanations in predicting annual income based on socio-demographic factors [68] or utilized the CENSUS dataset [65]. More recently, Naveed et al. [25] investigated explanations for robo-advisors from a user perspective by identifying and understanding the user’s specific needs for explainability in the context of the financial domain.

4.1.4. E-Commerce

E-commerce was also a prominent sector in our sample, especially in relation to product recommendation systems [27,28,42,67,89,90]. Personalized product recommendation systems are ubiquitous nowadays, helping consumers find the best offers in a vast array of products. A wrong purchase may not have the same far-reaching consequences as those in financial investment, but it is still important for consumers to understand how product recommendation systems work and how trustworthy they are. Research in this domain shows that providing explanations can enhance the success of recommender systems in various ways. For instance, explanations can reveal the reasoning behind a recommendation [91], increase system acceptance by outlining the strengths and limitations of recommendations [92,93], help users make informed decisions, and facilitate advanced communication between the service provider and the user [94].

Regarding this, Naveed evaluated various aspects of personal recommender systems. Naveed et al. [28] evaluated the perceived quality and satisfaction of argumentative explanations for product recommendations targeting intuitive thinkers. They also evaluated the transparency and impact on richer interaction possibilities of explanations for digital camera recommendations. Naveed et al. [27,89,90] implemented an interactive feature-based explanation system and evaluated its impact on the overall system perception. Other studies have focused on different consumer domains to evaluate explanations. Bucina et al. [42] focused on evaluating explainable AI systems using food- and nutrition-related tasks, where participants predict the fat content of meals based on AI-generated explanations.

4.1.5. Media Sector

Explainable recommender systems have also been studied with regard to media contexts, including news, movies, music, books, gaming, and art [26,31,32,33,41,55,69,71,95,96,97,98]. These systems have much in common with product recommendation systems, yet there is a significant difference. Mass and social media influence public opinion, inform societal norms, and play a crucial role in shaping individuals’ knowledge, attitudes, and values as well as how they perceive and engage with political and social topics. Because of this, recommender systems are at risk of contributing to the formation of filter bubbles and the spread of disinformation and toxic content.

In this context, Rader et al. [55] and Liao et al. [47] delved into the explanation of news feed algorithms. Papenmeier et al. [46] examined tools designed for social media administrators to detect offensive language in tweets. Carton et al. [60] analyzed explanations concerning the AI-supported detection of toxic social media posts. With regard to disinformation, Lai et al. [31] investigated deceptive practices in reviews, while Schmidt and Biessmann [53] and Bansal et al. [60] focused on explanations used to support the sentiment classification of movie reviews. Millecamp et al. [26,41] evaluated explainable music recommendations regarding their impact on perceived usefulness, satisfaction, and engagement of end-users. In contrast, Kulesza et al. [33] evaluated the usefulness of explanations for “debugging” AI-generated playlists. Regarding film recommender systems, Ngo et al. [70] investigated the mental models of end-users, while Kunkel et al. [71] evaluated different explanation styles regarding trust and satisfaction.

Empirical studies in this domain also show that users expressed concerns about the amount of information presented, as excessive details can lead to cognitive overload [99,100]. Moreover, research in gaming and art recommender areas has shown that users prefer prompt hints (explanations) with communicative and user-friendly interfaces [101,102].

4.1.6. Transportation Sector

The transportation sector is characterized by transportation systems that are quite complex, technical, and safety-critical, where the behavior of AI must be explained so that it is understandable to laypeople.

This sector was not very prominent in our sample [29,67]. One of the viewed studies was by Colley et al. [67], in which the focus was on using explanations in the context of highly automated vehicles. The authors used semantic segmentation visualization to assess the impact on user trust, situation awareness, and cognitive load by displaying the vehicle’s detection capabilities to the user. Alizadeh et al. [29] investigated people’s AI folk concepts to evaluate how individuals interact with AI technologies in mobility-related contexts.

4.1.7. Science and Education

Various studies in our sample addressed the science and educational sector [36,38,43,56,58,60,66], where explanations are studied in the context of learning and research and knowledge assessment. A distinctive feature of this domain is that users are particularly interested in understanding issues and are often eager to learn something new. Therefore, explanations contribute not only to pragmatic goals but also align with the educational interests of the target audience.

Regarding this, Ooge et al. [56] evaluated the role of explanations concerning math exercise recommendations. Also, explanations are evaluated in entertainment settings such as decision-making games [66] or learning games [38]. Explanations are also evaluated to improve scientific literature recommender systems [58].

A more serious issue arises when AI systems are used to assess students’ intelligence and abilities. Since such evaluations can have significant consequences for educational careers, these decisions must be correct, fair, and accountable. For this reason, several studies have evaluated how explanations can influence the decision-making process in student admission [43] and student application recommendation [36] and using explanations for answering the Law School Admission Test (LSAT) [60].

4.1.8. AI Engineering

AI engineering was also a prominent domain in our sample, where several evaluation studies have been conducted [16,30,39,40,48,59,61,72]. The characteristic of AI engineering is that the developed models are often complex and operate as “black boxes”, which makes it difficult for developers to understand AI model behavior, identify errors in models and datasets quickly, and debug and optimize the models. For this reason, explainability elements have become essential tools for developers to debug and understand AI models. Unlike other domains, this field is distinguished by developers’ high technical expertise, allowing them to comprehend intricate and technically detailed explanations.

In this regard, Kaur et al. [30] investigated how data scientists understand and utilize interpretability tools in their daily tasks. Dieber et al. [40] examined which representations benefit data scientists in making tabular models more interpretable. Jeyakumar et al. [72] investigated the AI engineer’s preferences for deep neural network explanation methods.

Numerous studies in our sample have evaluated how explanations contribute to data labeling tasks, such as providing explanations within annotation tools [16] or supporting the classification of handwritten digit images [59], artistic drawings [39], or images [75] and pictures [48] of objects and people. Additionally, the authors of [61] evaluated various visual explanation methods concerning their effectiveness in confirmation and distinction tasks within classification processes.

4.1.9. Domain-Agnostic XAI Research

The survey by Islam et al. [84] shows that many instances of XAI research are domain-agnostic, meaning they are not specifically designed and evaluated for a particular real-world application. This also holds true for the evaluation studies in our sample. For this reason, domain-agnostic studies constitute a distinct category where the authors either do not specify a target domain or focus on domain-independent features [34,51,61,103]. Most domain-agnostic studies are either concept-driven or methodology-driven. For instance, Hoffman et al. [103] did not focus on any specific domain but reflected on evaluation methodologies in general. Various studies focus on universal evaluation concepts. For instance, Sukkerd [34] evaluated consequence-oriented, contrastive explanations. Kim et al. [61] investigated visual explanations for interpreting charts. Narayanan et al. [51] deliberately used alien scenarios in their evaluation study to abstract from a concrete domain.

This overview demonstrates that explanations are used across a wide range of domains, each with very different levels of severity: the potential harm in the domain of product recommender systems is relatively minimal while using AI for legal decision-making can significantly affect individual liberty. Additionally, the requirements are quite different: Explanations for movie recommendations, for instance, can be reviewed at one’s leisure, whereas in the transport sector, explanations need to be understood in real time. The target audience can vary greatly not only between domains but also within the same domain. For instance, in the medical field, the level of domain knowledge significantly affects whether the explanations are intended for doctors or patients. The last issue underscores the importance of explicitly defining the target audience in evaluation studies.

4.2. Target Group

A target group is defined as the intended people who will be affected by an AI system or make use of the explanation provided. The target group presents a key issue for the scope of explanation systems because the significance and relevance of explanations are highly dependent on their intended audience [104,105,106].

4.2.1. Expertise

The target group is characterized, among other factors, by their level of expertise. Mohseni et al. [105], for instance, distinguish three levels of expertise: high, medium, and low. High expertise is attributed to individuals with advanced knowledge of AI theory and the technical aspects of machine learning algorithms. Medium expertise describes those who may lack theoretical knowledge but possess an understanding of machine learning concepts sufficient for data exploration and visual analytics. Low expertise refers to individuals with minimal or no knowledge of both theoretical and practical aspects of machine learning [104,105,106].

In our sample, the level of expertise is rarely explicitly considered in evaluation studies. An exception to this are the studies by Ngo et al. [32], Anik et al. [36], and Schoeffer et al. [70], which distinctly differentiate between users with high and low levels of technical knowledge and AI literacy. More commonly, the target group is characterized based on their role.

4.2.2. Role

Target groups can also be defined by stakeholders’ roles within the AI system’s lifecycle. Meske et al. [104], for instance, distinguish roles within the lifecycle of designing, operating, and using AI systems.

AI engineers and data scientists play the most prominent role in the design phase. They must understand the data, models, and algorithms affecting the system’s performance. Explanation systems are essential here to improve algorithm performance and facilitate debugging, testing, and model verification [104].

In the using phase, we can distinguish between the end-users of the system and the people affected by the systems’ decision-making. Regarding their expertise, we can further differentiate between professional and lay users. In both cases, explanations can contribute to the users’ satisfaction, trust building, task performance, and system understanding [29,104,107].

A typical scenario is that the end-users of AI systems are professionals such as doctors, judges, or financial advisors, while the individuals indirectly affected by these decisions are laypersons, such as patients, defendants, or bank clients. These affected individuals have a right to explanations to validate decisions, assess their fairness, and provide grounds for objection [29,104,107]. Furthermore, the EU’s AI Act stipulates that these explanations should be conveyed in a language comprehensible to the average person, not solely to technical experts “https://www.ey.com/content/dam/ey-unified-site/ey-com/en-gl/services/ai/documents/ey-eu-ai-act-political-agreement-overview-february-2024.pdf (accessed on 20 March 2024)”.

Administrators, managers, and regulators are typical stakeholders involved in the operation of AI systems. These stakeholders play a crucial role in ensuring the system functions correctly and adheres to corporate policies, regulations, and legal requirements. Explanations should help these stakeholders monitor, operate, and audit these systems [104]. The general public constitutes another significant stakeholder group, particularly in the context of socially relevant systems such as mass media, the legal process, and the democratic process. This is especially true regarding social values and ethical principles such as fairness, impartiality, and public welfare [108].

Even though specifying the target group is essential for a rigorous evaluation, in our sample, the roles or stakeholders targeted by the system were often not explicitly defined. In such cases, we tried to infer the expertise and roles of the target group from the study context.

4.2.3. Lay Persons

Lay persons are individuals who do not have specialized knowledge or professional expertise in a particular field or subject. In our sample, the target group of lay persons spans a wide range of areas, from mass-market domains such as movie or song recommendations [26,32,33,41,71], news recommendation systems [47,55], product recommendations [28,51,89,90], finance applications [25], and driver assist systems [67], to more specific groups of patients [49], [57], students or researchers [56,58] and individuals affected by service bans [29].

Also, the study by Dominguez et al. [39], which focuses on artwork recommendation, the study by Cai et al. [48], which focuses on a drawing scenario, and the study by Buçinca et al. [42], which focuses on a nutrition scenario, all appear to consider the lay user, too. In other cases, identifying the target audience from the context is more challenging, as the evaluation scenarios are primarily illustrative in nature.

4.2.4. Professionals

Professionals are individuals who possess specialized knowledge or expertise in a particular field or subject, often gained through formal education, training, and experience. In our sample, professionals are typically targeted with regard to the healthcare sector (such as doctors, nurses, paramedics, and emergency services providers) [44,77,85] or the engineering sector, including data professionals, data scientists, AI engineers, data annotators, and data categorization specialists [30,40,59,72]. Here, a notable study is the one by Kaur et al. [30], which explicitly addresses data scientists as the target group. In most cases, the target group is only implicitly defined by the context of the studies, such as the one by Dieber et al. [40] focusing on XAI frameworks like LIME; Jeyakumar et al. [72] on the comprehensibility of deep neural networks; Sukkerd [34] on AI-based navigation planning; Ford & Keane [59] on labeling handwritten digits; and Kim et al. [61] on classification tasks.

In many cases, it is not clear from the context whether the target group of a study is laypersons or professionals. For instance, a series of studies have investigated the effects of explanations about data classification, such as sentiment analysis of online reviews [53,60], online reviews [52], toxic posts, and hate speech [46,69]. From the context of these studies, however, it is unclear whether they are addressing professional content mediators/data analysts or lay users affected by online reviews and social media posts.

The same ambiguity is also present in studies concerning the use of explainable artificial intelligence (XAI) in automated decision-making [31,43,65,73,86,87]. Since the target group is not explicitly defined, it is uncertain whether the explanations are intended for individuals affected by these decisions; professionals such as judges, psychiatrists, or jurors who are making them; or other stakeholders such as the general public. This uncertainty also holds true for studies in our sample, which focused on application areas such as processing loan applications [15], making real estate transactions [50], making income predictions [68], interpreting graphs or charts [61], or making university admissions decisions [43]. In these cases, too, it remains unclear from the context whether the explanations are aimed at the decision-makers or at the individuals who are affected by these decisions.

This ambiguity is particularly prevalent in concept-driven studies, where a specific usage scenario is either absent, only briefly described, or very generally defined [5,16,35,36,38,51,61,64,66,75]. In such cases, it is not possible for us to define the target group more precisely. As a result, the ecological validity of the effects measured in these studies remains uncertain.

4.3. Test Scenarios

The evaluation scope also relies on the test scenarios used for the evaluation, their relevancy, and their ecological validity. Concerning this, our sample includes two types of test scenarios: those with significant real-world impact and those that serve an illustrative purpose using toy scenarios [8].

4.3.1. Real-World Scenarios with Critical Impact

The category of real-world scenarios addresses test scenarios, which address real-world cases in which AI decisions significantly affect individuals or carry a high risk of substantial impact on the lives of individuals, groups, or society. These are domains where the stakes of AI decisions are high, necessitating rigorous and reliable explanation systems. By using an open and iterative scoping review methodology, we ensured the inclusion of diverse evaluation practices and methodologies, capturing the complexity and real-world relevance of explainable AI research across various critical domains like healthcare, justice, finance, etc.

Many of our selected domain-driven studies, such as [29,30,31,32,43,47,49,55,56,58,67,71,73,77], fall into this category. These studies typically place a high degree of emphasis on the ecological validity of their research. Various studies investigate the actual explanation needs of affected people and/or the usage of explanatory systems in practice [25,29,30,32,77]. There is also a strong emphasis on creating evaluation scenarios that reflect the real-world setting of the domain as closely as possible [31,49,56,58,67,73].

Within this category [26,28,31,33,40,41,46,50,68,70,86,87], there are also various concept-oriented studies. These studies are less focused on specific application domains but rather on generic explanatory concepts and their impact on users. The use of real-world scenarios in these studies, however, helps to demonstrate the research’s relevance and evaluate the concepts by using scenarios that are meaningful for the participants. The same applies to methodological studies [44,50,57], where the focus is on how explainability can be evaluated and what appropriate measurements and procedures are. In these cases, real-world scenarios are also used to illustrate general considerations or to validate the developed measurement methods through specific application cases.

Real-world scenarios frequently focus on the healthcare [44,49,57,85] and judicial sector [36,52,65,73,86,87], where mistakes in decisions can have a significant impact on individuals’ lives. To a lesser extent, this also applies to the financial sector [15,25,50,65,68,70,88], where the denial of a loan or a poor investment in houses, stocks, or other financial products can have significant repercussions. Other real-world scenarios address the denial of access to educational institutions, such as universities [43], and essential digital services, which can have a significant impact on an individual’s life. Other real-world scenarios address news recommendation algorithms, which can impact the spread of fake news and the formation of filter bubbles [55].

4.3.2. Illustrative Scenarios with Less Critical Impact

This category encompasses domains or evaluation scenarios where AI decisions have minor impacts or researchers envision simple scenarios to illustrate an approach and the explanations produced [8].

A common method used in this category is to isolate the explanation mechanism from specific contexts to better understand its fundamental properties and impacts. For example, Kim et al. [61], Fügener et al. [75], and Mohseni et al. [16] utilized generic image classification scenarios to evaluate various explanation methods. Ford and Keane [59] used the labeling of handwritten digits in a decontextualized evaluation scenario. Similarly, Jeyakumar et al. [72] presented various explanation methods for text, images, audio, and sensor data in a non-contextualized manner to determine user preferences for these methods. Kim et al. [61] explored the role of explanations in a decontextualized setting where participants were asked to interpret and respond to questions about charts and tables. There are also cases where no specific domains are addressed or where evaluation scenarios are not well specified. Hoffmann et al. [35] focused on theoretical criteria for evaluation studies, not on empirical research.

An additional approach is to use toy examples and fictitious scenarios. To prevent confounding effects and avoid triggering everyday habits, biases, and established preferences, these scenarios are intentionally designed to be distinct from familiar environments and real-world applications. Buçinca et al. [42], for instance, employed proxy, artificial tasks such as predicting the AI’s decision-making regarding the percentage of fat content in a plate. Narayanan et al. [51] defined an alien food preference and an alien medicine treatment scenario for their evaluation study. Sukkerd [34] and Paleja et al. [64] designed fictitious robot scenarios for their evaluation study. Schaffer et al. [66] used a scenario based on the Diner’s Dilemma, where several diners eat out at a restaurant and agree to split the bill equally over an unspecified number of days.

All of these studies allow for the examination of explanation methods in a controlled, non-realistic task. Using fictitious application scenarios in evaluations aids in engaging participants and facilitating their understanding of the context. At the same time, detaching the evaluation from real-world scenarios comes with a trade-off that reduces the ecological validity of the results.

Another approach is to adapt familiar contexts to enhance the participants’ understanding and engagement with the abstract concepts being evaluated. The study by Guo et al. [38] is an example of this approach, in which explanation concepts were evaluated with the help of the well-known Tic-Tac-Toe game. In a similar vein, Dominguez et al. [39] used an art recommendation scenario, and Cai et al. [48] used the widely known QuickDraw platform for this purpose. Bansal et al. [60] and Schmidt and Biessmann [53] utilized a sentiment labeling task for online movie reviews as a familiar context to many internet users to evaluate explanation systems. Anik et al. [36] evaluated their data-centric explanatory approach using four decontextualized but familiar scenarios: predictive bail decisions, facial expression recognition, automatic approval decisions, and automatic speech recognition. Similarly, Alufaisan et al. [65] used a repeat- offender scenario for their evaluation, Carton et al. [69] used the toxicity of social media posts, while Chromik et al. [15] used the default risk assessment scenario for credit applications. Naveed [28,51], used a common but fictional online shopping scenario to evaluate explanations for finding appropriate digital cameras.

Overall, the approach of using fictional and toy examples minimizes the complexity inherent in real-world settings and reduces potential confounding variables, thereby facilitating a clearer understanding of the general effects of explanatory systems. However, it leaves unanswered questions about how these systems are utilized in everyday life and what domain- and context-specific effects might occur. This gap highlights the need to complement domain-independent, illustrative evaluation studies with domain-specific real-world research. This research should evaluate the adoption and impact of these systems in the context of everyday life to fully understand the complexities of how people make use of explanations for their specific problems.

5. Evaluation Measures

Evaluation approaches in XAI studies can be broadly divided into two groups. One group is human-grounded evaluation, which involves human subjects and measuring constructs such as user satisfaction, trust, mental models, etc. In contrast, functionality-grounded evaluation measures require no human subjects; instead, it uses a formal definition of interpretability as a proxy to evaluate the explanation quality [3,16,19,80].

In XAI evaluation studies, measurement constructs are well-defined theoretical concepts or variables that researchers aim to quantify and measure to assess the effectiveness of XAI systems [23,24]. Epistemologically, measurement constructs are defined by both the subject matter and their theoretic concepts, as well as by the intended evaluation goals [8,10,15].

Various taxonomies for human-grounded XAI evaluation measures have been established and researched [78,79,105,109]. According to these taxonomies, evaluation measures are mainly divided into four categories, i.e., trust, usability, understandability, and human–AI task performance. Each category corresponds to the evaluation of specific XAI constructs from the human perspective derived from existing studies in the literature from several research areas [7,16,103]. However, based on our selected literature sample, we categorized the XAI constructs into the following categories, as shown in Table 2.

In the following subsections, we focus on these qualitative and quantitative measures.

5.1. Understandability

Understandability refers to the quality of explanations being easy to comprehend. It is also usually defined by the user’s mental model of the system and its underlying functionality [78,110]. In the context of XAI, the rationale behind evaluating understandability is to examine whether explanations facilitate the user’s understanding of the system-related aspects [111].

Understandability is a complex theoretical construct encompassing multiple dimensions and is influenced by various factors. Consequently, it can be evaluated from different perspectives and operationalized in different ways. In our literature review, we identified three approaches that are not mutually exclusive: evaluating the user’s perceived understanding, evaluating the user’s mental model, and evaluating the user’s model output prediction.

5.1.1. Mental Model

The goal of XAI is not to provide text or visualization on a computer screen but to form a mental model of why and how an AI system reaches its conclusions. Cognitive psychology defines a mental model as a representation of how a person understands certain events, processes, or systems or as a representation of the user’s mental state in a particular context. In this regard, the design of the explanation’s structure, types, and representation should contribute to user understanding and create more precise mental models [112].

In our literature review, Hoffmann et al. [35] mainly dealt with mental models on a theoretical and methodological level. Following this, a mental model reflects how a person interprets and understands an AI system’s functioning, processes, and decision-making [35]. The authors emphasized that clear and accurate mental models help users comprehend why the system makes certain decisions [35]. Conversely, inadequate or flawed mental models can lead to misunderstandings and incorrect decisions [35].

In addition to these theoretical considerations, Hoffman et al. [35] discussed the methodological challenges in empirically eliciting and analyzing mental models. They underscored that “there is a consensus that mental models can be inferred from empirical evidence” [35], and concerning this, they outlined various methods to capture and analyze users’ mental models systematically, such as think-aloud protocols, structured interviews, retrospective task reflection, concept mapping, prediction tasks, and glitch detection tasks. These methods aim to uncover the users’ mental models qualitatively by reconstructing them from people’s expressions and descriptions of their understanding of the system verbally in interviews or visually through concept mapping. In addition, methods like prediction tasks or glitch detection tasks can be used to quantitatively assess how well the users’ mental models align with the AI system’s actual functioning and identify where misunderstandings or misconceptions may exist. In terms of performance, the mental models must not be perfectly accurate or entirely correct; it is enough if they are sufficiently robust to inform user behavior and be effective in practice.

Only a few works in our sample explicitly refer to mental models and how people interpret the system qualitatively, namely the studies [15,29,32,33] and [36]. For instance, Chromik et al. [15] mentioned that understandability can be evaluated by assessing participants’ mental models of the system. Mohseni et al. [16] asked their participants to review the visualization used to make its system classification decision understandable. Similarly, Kaur et al. [30] asked their participants to describe the shown explanations to understand their mental models better. The most elaborate studies were those of Alizadeh et al. [29] and Ngo et al. [32]. In Alizadeh et al.’s [29] study, folk concepts and mental models are understood as individuals’ representations of AI—how they believe AI systems function, what they expect from AI, and how they perceive its role in their daily lives. The study emphasizes that these mental models are shaped by people’s experiences, assumptions, and interactions with AI technologies, which are also influenced by their social interactions and the broader cultural context [29]. The authors stress that these models are inherently “messy” and typically inaccurate, but they guide how users interpret AI’s behavior, make decisions, and form expectations about AI’s capabilities and limitations [29]. The authors adopt a qualitative approach using thematic analysis to uncover the folk concept from semi-structured, in-depth interviews talking with people about their experiences, thoughts, and beliefs regarding AI systems [29]. In their study, Ngo et al. [32] refer to mental models as the internal cognitive structures that users develop from a music recommendation system. The authors employ quantitative and qualitative methods to comprehensively understand the structure and soundness of the users’ mental models. To analyze the mental models, they use think-aloud protocols, verbal explanations, and drawings where users express their understanding of the system’s operation. To analyze mental reasoning processes in the context of an AI-supported online review classification task, Lai et al. [31] also use a qualitative method by asking participants to verbalize their reasoning using the following syntax: “I think the review is [predicted label] because [reason]”.

Overall, our review reveals that, by their very nature, mental models are highly contextualized and specific to the system and domain in question. This makes generalizing and comparing mental models challenging. For this reason, Ngo et al. [32] and Kulesza et al. [69], for instance, used additional measures, such as objective measures of the accuracy of the mental model, to describe the system behavior. In addition, quantitative subjective measures based on self-reports, such as perceived confidence or perceived understandability, could also be utilized.

5.1.2. Perceived Understandability

In the context of XAI, users’ perceived understandability refers to the user’s understanding of the system’s underlying functionality in the presence of explanations [78]. In our sample, various studies [36,39,40,41,42,43,46,47,48] have evaluated the perceived understandability to evaluate the understandability of explanations. These approaches operationalize and measure perceived understandability in different ways.

Cheng et al. [43] utilize the definition proposed by Weld and Bansal [113], which suggests that a human user “understands” an algorithm when they can identify the attributes driving the algorithm’s actions and can anticipate how modifications in the situation might result in different algorithmic predictions. They measure this understanding by asking them to rate the agreement with the statement, “I understand the algorithm”. Regarding explainable recommender systems, Millecamp et al. [41] and Dominguez et al. [39] use questions that directly assess whether users understand why certain recommendations (e.g., songs or art images) were made. Users indicate on a Likert scale to what extent they can comprehend the explanation. Similarly, Bucina et al. [42] also use a self-report measure asking participants to respond to the statement, “I understand how the AI made this recommendation”.

Evaluating the generic XAI framework LIME, Dieber et al. [40] investigated the interpretability of explanations through both interviews and rating scales. They measured how well users can interpret the results of a prediction model by asking open questions such as “What do you see?” or “Did you know why the model made this prediction?” In addition, they asked the participants to rate on a 10-point item scale how well they could interpret the explanations provided. Cai et al. [48] measured perceived understanding via a single item, asking participants to self-assess by rating the statement “I understand what the system is thinking”. Gao et al. [38] adopted measurement scales from Knijnenburg [114] to assess the participants’ perception of the understandability of the system. Papenmeier et al. [46] and Anik et al. [36] used Likert-scale questions to measure perceived understanding, and Kim et al. [61] let participants self-rate their level of understanding of the explanation method.

In their methodological reflection, Hoffman et al. [35] also reflected on perceived understandability as a key factor in evaluation studies. They outlined a questionnaire with an item where participants self-assess their understanding by responding “From the explanation, I understand how the [software, algorithm, tool] works”. Similarly, the questionnaire proposed by Holzinger et al. [44] includes several items related to perceived understandability. For instance, the questionnaire includes items on general understandability, such as “I understood the explanations within the context of my work”.

Overall, our review reveals a significant overlap in the theoretical understanding of the construct. Perceived understandability is a subjective measure that can be evaluated by assessing how well users comprehend explanations and how these explanations improve their overall understanding of the system’s functionality. Most studies rely on self-report measures, where participants respond to one or more Likert-scale questions to assess their understanding. However, there is no standardized questionnaire specifically for perceived understandability, particularly regarding input–output causality. This lack of standardization complicates cross-study comparisons and highlights the importance of carefully examining how the construct is operationalized in each study when interpreting results.

5.1.3. The Goodness/Soundness of the Understanding

In addition to directly analyzing users’ mental models and perceived understanding, users’ ability to predict a system’s decisions and behavior offers an indirect yet equally insightful measurement method. As Hoffman aptly states, “A measure of performance is simultaneously a measure of the goodness of user mental models”. Similarly, Cheng et al. [75] argue that “a human user understands the algorithm if the human can see what attributes cause the algorithm’s action and can predict how changes in the situation can lead to alternative algorithm predictions”. Also, Schmidt et al. [53] stress that intuitive understanding is expressed by the decision-making performance of the users: “Faster and more accurate decisions indicate intuitive understanding”. In addition, Chromik et al. [15] mention that the goodness of the understanding can be assessed “through prediction tests and generative exercises” [15].

These quotes highlight that a user’s ability to know and predict system behavior serves as an indicator of how well their mental model functions, and, by extension, how well the system’s explanations have been understood. These predictive abilities provide an objective metric for evaluating the comprehensibility of explanations, transcending subjective perception, and reflecting both actual understanding and trust in the system. Regarding this, an explanation is considered understandable if the user can predict or describe the model’s behavior and output in a particular situation or using particular data [78]. Hence, the accuracy of the user’s prediction could serve as a metric to assess understandability.

Several studies within our sample [37,42,43,49,50,51] utilized such evaluation measures in various ways. For instance, in the study by Van der Waa et al. [49], participants completed multiple trials, where after each trial, they were asked to predict the system and their thoughts on which input factor was responsible for this. In a similar way, Cheng et al. [43] evaluated if the participants could anticipate how changes in the situation might result in different system behaviors and if they could identify the attributes that influence the algorithm’s actions. Poursabzi et al. [50] focused on laypeople’s ability to simulate a model’s predictions. In a qualitative manner, Liu et al. [52] used a concurrent think-aloud process to analyze the input–output understandability, where participants verbalized the factors they considered to be behind a prediction.

Similar measures were also used to assess the soundness of mental models. For instance, Ngo et al. [32] used multiple-choice comprehension questions to assess whether users understood the system’s behavior correctly. In addition, participants rated their overall confidence in understanding the system on a 7-point Likert scale. Similarly, Sukkerd [34] assessed the soundness of the mental model in his user study by evaluating both whether participants correctly determined the system behavior and their confidence in their assessment. Also, Kulesza et al. [69] assessed the soundness of users’ mental models by asking participants multiple-choice questions about system behavior and having them rate their overall confidence in understanding the system on a 7-point scale.

Concerning the understandability goodness, Schmidt et al. [53] measured the time and error rate users made in an AI-supported classification task. Narayanan et al. [51] measured the understandability by determining whether participants correctly identified if the output was consistent with both the input and the provided explanation. Also, Bucina et al. [42] measured how well users could predict the AI’s decisions based on the explanations given. Lastly, Deters [37] used the number of correct responses to indicate that the user understands the explanations provided. Poursabzi et al. [50] also evaluated the laypeople’s abilities to detect when a model has made a mistake. In some cases, the soundness of the understanding can also be used to measure the effectiveness of explanations concerning task performance.

In summary, we identified the following three methodologies in our sample for evaluating understandability:

Qualitative methods involve uncovering users’ mental models through introspection, such as interviews, think-aloud protocols, or drawings made by the users.
Subjective–quantitative methods assess perceived understandability through self-report measures.
Objective–quantitative methods evaluate how accurately users can predict and explain system behavior based on their mental models.

These three approaches are not mutually exclusive but rather complement each other, providing a comprehensive understanding of how well something is understood.

5.1.4. Perceived Explanation Qualities

In everyday language, the quality of an explanation refers to how effectively it communicates and makes the intended information understandable. In research, it is typically defined by the formal attributes of the explanation’s form, content, and structure or by the formal properties of the method used to generate it. In the case of functionality-grounded evaluations [10,115], for instance, explanation methods are analyzed with regard to their fidelity (how accurately the method approximates the underlying AI model), stability (whether the method generates similar explanations for similar inputs), consistency (whether multiple explanations for the same input are similar), and sparsity (the degree to which the number of features or elements in the generated explanation is minimized to reduce complexity).

Regarding human-centered evaluation, the quality of an explanation is assessed based on the perception of the target audience. This approach considers how well the explanation resonates with users, considering their cognitive abilities, practical needs, goals, and the context in which they use the explanation. Additionally, explanation qualities, such as complexity, completeness, consistency, and input–output relationships, are not formally assessed but are evaluated based on how they are perceived by the users. In our sample, we found studies that have evaluated both the overall explanation quality and specific explanation qualities from the user’s perspective.

When examining overall explanation quality, the focus is on how users perceive the quality of the explanations provided or the system as a whole. Evaluating an explanation-driven interactive machine learning (XIML) system, Guo et al. [38], for instance, investigated the perceived explanation quality by focusing on the system as a whole and asking about their perceptions of the feedback provided by the XIML system. Similarly, in the context of recommender systems, Liao et al. [47] as well as Tsai et al. [57] evaluated perceived explanation quality via the perceived quality of the provided recommendations, asking participants to rate the statement “[The system] can provide more relevant recommendations tailored to my preferences or personal interests” [47] and “The app provides good medical recommendations for me” [57] respectively. In contrast, Naveed et al. [25,26], in their user studies on explainable recommender systems, distinguished between recommendation quality on one hand and explanation quality on the other hand. Also, Mohseni et al. [16] focused on explanation quality in their study of an image classification task by directly asking participants to rate how well the AI explained the classification of the image. Guesmi et al. [58] included the item “How good do you think this explanation is?” in their questionnaire and interpreted this item as an indicator of satisfaction.

Several studies in our sample also evaluate specific explanation qualities to delve more comprehensively into the nuances of how concepts or features are explained. By focusing on specific qualities of explanations, researchers aim to uncover how different types of explanatory information contribute to the user’s comprehension. In their survey, Schoonderwoerd et al. [28], for instance, included the questions “This explanation-component is understandable” and “From the explanation-component, I understand how the system works” to obtain more detailed insight into how users understand the explanation interfaces.

Why-understanding presents an important explanation quality, which refers to the goal of explaining the reasoning and rationale behind decision-making and the context and conditions of the decision-making. Rader et al. [55] defined why-explanations as “providing justifications for a system and its outcomes and explaining the motivations behind the system, but not disclosing how the system works” [55]. Correspondently, they evaluated the why-understanding by asking the participants “what they know about the goals and reasons behind what the [system] does” [55]. Regarding the perceived transparency of the reasoning and rationale behind the decision-making process, Tsai et al. [57] used two self-report items: “I understand why the [system’s] recommendations were made to me” and “The app explains why the [system’s] recommendations were made to me”. In a similar manner, Deters [37] used the item “Do you know why the model made this prediction?” in his study.

Input–output causality presents a similar quality, referring to the goal of making AI decision-making understandable. This involves clarifying what specific input, such as data features or variables, leads to particular outputs or decisions made by the AI model. To evaluate the perceived quality of explaining causality, the questionnaire outlined by Holzinger et al. [44] includes items such as “I found the explanations helped me to understand causality”. Additionally, the questionnaire includes items to assess if explanations are self-explanatory and understandable without external assistance.

Information sufficiency presents a further quality examined in several studies ([35,56,70,89]. This refers to whether explanations offer enough detail or evidence to effectively address users’ questions or tasks. To assess this quality, Hoffman et al. [35] proposed the following items: “The explanation of the [software, algorithm, tool] sufficiently detailed” and “The explanation of how the [software, algorithm, tool] works is sufficiently complete”. In Schoeffer et al.’s [70] study, information sufficiency was measured using the item “If you feel you did not receive enough information to judge whether the decision-making procedures are fair or unfair, what information is missing?” Similarly, the item “I find that [system] provides enough explanation as to why an exercise has been recommended” in the study of Ooge et al. [56] evaluated information sufficiency concerning explainable recommender systems. Naveed et al. [89] thoroughly discussed measuring this construct. They argued that information sufficiency should be evaluated by asking participants to rate whether the explanations provided by the AI system contained enough relevant and necessary information to support their decision-making process, for instance, using the Likert-scaled items originally adapted from a user-centric evaluation framework for recommender systems [89,116] via the items “The explanation provided all the information I needed to understand the recommendation” and “The details given in the explanation were sufficient for me to make an informed decision”.

Explanation correctness was also a quality addressed in some evaluation studies. This refers to the quality where explanations accurately reflect the true nature of the system’s decisions or recommendations, ensuring they are not based on errors or misclassifications. Rader et al. [55], for instance, incorporated questions about correctness to assess how well participants believe the system’s outputs match their expectations and whether these outputs are free from errors. Similarly, Ford et al. [59] operationalized perceived correctness via five-point Likert-scale ratings, asking participants if they believe the system is correct.

Regarding domain-specific qualities, Naveed et al. [26] evaluated several domain-specific categories regarding financial support systems, asking the participants to rate how well the system explains the financial recommendations; gives evidence that the system aligns with user’s understanding, values, and preferences; and explain the domain-specific topics necessary to understand the system’s actions.

5.2. Usability

Explanations should not only be understandable but also usable. This means that explanations must be designed to be not only clear and comprehensible in content but also practically applicable and useful for the user.

In human–computer interaction (HCI), usability refers to effectiveness, efficiency, and user satisfaction. Usability has been extensively studied across various domains and can be measured by factors such as satisfaction, helpfulness, ease of use, workload, and performance. In the context of explainable systems, usability should enhance users’ work performance by providing relevant, easy-to-use, and high-quality explanations. In the following section, we outline how these issues were considered and operationalized in the various evaluation studies used in our sample.

5.2.1. Satisfaction

Satisfaction is a multifaceted theoretical construct in psychology that encompasses both affective and cognitive components. The affective component refers to the positive subjective experience of pleasure, joy, or well-being concerning a specific situation, state, or outcome [117]. The cognitive component refers to evaluating and comparing the individual’s expectations with their actual experience. When the outcome aligns with or exceeds expectations, satisfaction is achieved. Satisfaction also serves as a motivational factor [118], where satisfaction can motivate certain actions, such as adopting a technology, while dissatisfaction tends to inhibit such actions.

In usability engineering, satisfaction is defined as the freedom from discomfort and positive emotional and attitudinal responses toward a product, system, or service. Regarding explainable AI, satisfaction refers to the degree to which users find explanations provided by AI systems comprehensible, convincing, and useful in enhancing their understanding of the system’s decisions or predictions [103].

In our sample, various studies evaluate user satisfaction in the context of explainable systems [26,27,28,33,35,37,38,39,40,41,51,57,58,59]. Most of these studies treated “user satisfaction” as an established concept, so the concept was not discussed on a theoretical level primarily focused on its operationalization or its application.

On a theoretical level, Chromik et al. [15] defined satisfaction as the increase in the ease of use or enjoyment, which can be measured by participants’ self-reported satisfaction. Hoffman et al. [35] and Dieber et al. [40] explored the construct of “satisfaction” in the context of XAI in more detail. Dieber et al. [40] stressed satisfaction resulting from the use of a system, product, or service, where three key elements are important: (1) positive attitudes, which relate to the general cognitive evaluation of approval or disapproval; (2) positive emotions, expressed through reactions such as joy, happiness, or contentment; and (3) perceived comfort, which refers to how easy and intuitive the system is to use. Dieber et al. [40] emphasized that the affective component of satisfaction can be assessed through self-reports that gauge how users feel about their interaction with the system.

While Dieber et al. [40]’s definition refers to (exploratory) systems, Hoffman et al. [35] focus on the isolated explanation. They understand satisfaction as a cognitive process of a “contextualized, a posteriori judgment of explanations” [35]. Regarding Hoffman et al. [35], this judgment relates to understandability, where the positive experience emerges when users have achieved an understanding of how the system made a particular decision. From this perspective, they define explanation satisfaction as “the degree to which users feel that they understand the AI system or process being explained to them” [35]. In other words, Hoffman et al. define satisfaction as being subsumed under the broader construct of understandability. This is also evident in their questionnaire design, where satisfaction is measured in relation to the understandability of the explanation, asking participants “The explanation of how the [software, algorithm, tool] works is satisfying”.

In our sample, satisfaction has been evaluated in various contexts, such as recommender systems, data classification tasks, or fictitious explanation tasks. Guesmi et al. [58] focused on the explanation directly, evaluating satisfaction through the item “How good do you think this explanation is?”. Most studies have a broader focus, evaluating if the user is satisfied with system as a whole. For instance, Dominguez et al. [39] and Millecamp et al. [41] operationalize satisfaction using the single item “Overall, I am satisfied with the recommender system”. Naveed et al. [26,27,28] adopted the operationalization of Pu et al.’s [116] study using the item “Overall, I am satisfied with the recommender”. Similarly, Kulesza et al. [33] operationalized the construct with the item “How satisfied are you with the computer’s playlists?”

Some studies also use multiple questions to measure satisfaction. Dieber et al. [40], for example, use three questions to obtain a more detailed understanding of user satisfaction in their study: (I1) “Do we have a positive or negative attitude towards the tool?”, (I2) “What emotions arise from using it?”, and (I3) “How satisfying is the final result?”. These questions, however, do not form a psychometric scale in the traditional sense, as the questions have different levels of measurement (e.g., I2 is an open-ended question). Instead, these questions aim to gain a more holistic understanding of the issue. In contrast, Tsai et al. [77] utilize a multi-dimensional scale in the traditional sense. They understand satisfaction because of comfort and enjoyment and measure the construct using three items: (I1) “Overall, I am satisfied with the app”, (I2) “I will use this app again”, and (I3) “I would like to share this app with my friends or colleagues”. Here, I1 measures satisfaction directly, while I2 and I3 measure the construct indirectly, based on the motivational effects of satisfaction, such as the intention to reuse and recommend the tool. However, a psychometric validation of this scale has not been conducted.

Another scale was proposed by Deters [37]. The author conceptualizes satisfaction as a multi-dimensional construct encompassing subjective usefulness, subjective enjoyment, and perceived quality of the system. Concerning this, the author outlines a 12-item questionnaire that covers the affective level (e.g., through items such as “Overall, I am satisfied with the system”, “Overall, the system was enjoyable”, and “I would enjoy using the system when explanations like that are given.”; the pragmatic level of making of the explanations (e.g., through items such as (I4) “Overall, the system was easy to use” and (I5) “The explanations were intuitive to use”); the aesthetic level of representation (e.g., “The explanation is aesthetically pleasing” and “Content layout and order of elements in explanations are satisfying.”); as well as further items addressing positive exploratory features (e.g., through items such as “The explanation convinces you that the system is fair while doing [action]”). While this literature-based questionnaire is quite comprehensive, it has not been psychometrically validated or empirically tested. For this reason, it is unclear whether all items measure the same theoretical construct.

Another multi-dimensional scale was proposed by Guo et al. [38]. In evaluating an explanation-driven interactive machine learning (XIML) system, they measure users’ satisfaction with eight items. Due to issues with discriminant validity, three items were removed from their System Satisfaction Scale. The remaining five items (“Using the system is a pleasant experience”, “Overall, I am satisfied with the system”, “I like using the system”, “I would recommend the system to others”, and “The system is useful“) capture both the emotional–affective and the motivational–pragmatic aspects of system use and recommendation. As the discriminant validity of these items has been proven, it presents a promising candidate for a standardized measure for evaluating user satisfaction in the context of explanations.

Overall, our literature review reveals that user satisfaction is a frequently used metric in the sample. Our review also shows that the studies do not focus on the individual explanation but on the user experience interacting with the system, measuring the overall satisfaction using a single-item questionnaire. Although there are emerging efforts to capture the multi-dimensional construct of “satisfaction” through multiple-item questionnaires, a standardized and widely accepted scale for this purpose is still lacking.

In addition to multiple-item scales, qualitative methods such as thinking aloud or interviews are also recommended [119]. This would allow for a profound understanding of context and the extent to which satisfaction resulted from the explanation design or other contextual factors. In addition, conducting expert case studies could be an alternative approach as experts, in contrast to laypersons, possess extensive knowledge of the system’s domain, enabling them to provide more thorough and insightful evaluations [120]. Moreover, additional objective measures, such as eye movement, heart rate variability, skin conductance, and jaw electromyography showing positive emotions, have also been used in research to measure user satisfaction [121,122].

5.2.2. Utility and Suitability

Usability engineering emphasizes the teleological nature of human behavior, where people use systems not only for enjoyment but also as tools to achieve instrumental goal. This instrumental perspective is reflected in the theoretical constructs of pragmatic quality [123] and perceived utility [124]. Pragmatic quality is defined as the perceived ability of a system to assist users in completing tasks and achieving their so-called do-goals or action-oriented objectives [123]. Related to this, perceived utility describes the user’s subjective assessment of how useful a product, system, or service is in helping them achieve their specific goals. This perception is influenced not only by the actual functions of the product but also by the user’s expectations, needs, goals, individual experiences, and the context of use. Pragmatic quality [123] and perceived utility can be evaluated by different but similar constructs, such as the explanation’s helpfulness, usefulness, personal relevance, and actionability.

Helpfulness as a theoretical construct refers to the degree to which explanations provided by AI systems are perceived as valuable, informative, and supportive by users in aiding their decision-making or understanding of the system’s outputs. In the literature, evaluating helpfulness is often based on self-reports, where users rate to what extent explanations are tailored to specific tasks [42,44,48,50,125,126,127,128].

In our sample, various studies [35,42,44,60,61] have evaluated the helpfulness or usefulness of AI systems’ explanations. For instance, Ford et al. [59] used a 5-point Likert scale to measure the perceived helpfulness of explanations for both correct system classifications and misclassifications. Similarly, Bucina et al. [42] evaluated helpfulness by asking participants to rate the statement “This AI helped me assess the percent fat content”, on a 5-point Likert scale. They used this rating as an indicator of the usefulness of the explanations provided. Concerning causality, Holzinger et al. [44] measured helpfulness in the questionnaire via the item “I found the explanations helped me to understand causality”.

Anik et al. [36] evaluated perceived utility by asking participants to rate the usefulness of the explanation element on 5-point Likert scale items. Similarly, Bansal et al. [60] evaluated the perceived usefulness quantitatively in a post-task survey, where participants indicated whether they found the model’s assistance helpful. Kulesza et al. [33] evaluated utility using the cost–benefit ratio, including the item “Do you feel the effort you put into adjusting the computer was worth the result?” in their post-task survey.

In contrast, Kim et al. [61] used a free-form questionnaire in their user study about pipeline-generated visual explanations, where participants were asked to write about their views on usefulness. Studying how data scientists use XAI tools, Kaur et al. [30] also asked the participants whether the explanations were useful and, if so, how they would use them in their typical work. Lai et al. [31] asked the participants of the user study to report their subjective perception of tutorial usefulness as reported in the exit surveys.

Another indicator of usefulness is actionability, which means the ability of users to apply the explanations within their specific context. Hoffmann et al. [35] suggest measuring actionability through the item “The explanation is actionable, that is, it helps me know how to use the [software, algorithm, tool]”. This concept emphasizes that a useful explanation should provide information that guides users toward reaching their goals. Closely related to this is the concept of “easy-to-use”, which also assesses how well the explanation facilitates the process of making use of explanations. This means this construct evaluates not only the possibility of understanding and applying explanations within the particular context but also the effort and workload required by the user in doing so.

In most studies from our sample, the authors take for granted the user’s goals and for what purposes explanations are needed, such as a general understanding of causal relationships in AI systems [44] or, quite specifically, analyzing the meal’s fat content [59]. However, in real-world settings, the user’s goal is vague and not always clearly defined. In such cases, the first step in a user study is to explore which explanations are relevant and suitable for the particular context and the target user. In our sample, we identified two studies that focus on this issue. Dodge et al. [62], for example, investigated how users explain system behavior and what types of questions participants (StarCraft II players) ask when trying to understand the decisions of an allegedly AI-controlled player. The goal was not to measure the soundness of mental models but rather to analyze what types of explanations will be relevant to make AI behavior understandable. To uncover explanation needs and visual explanation style preferences, Kim et al. [61] conducted a formative, qualitative study where the participants wrote natural language questions and provided answers and explanations for their answers. They analyzed the results to determine what explanations the users requested and how they could be visually represented.

Using a mixed-methods approach, Naveed et al. [25] also investigated what kind of explanations will be helpful in a particular context. In their study about financial support systems, they showed that users want explanations to understand input–output causality (e.g., which inputs are used to determine the recommended portfolio), the outcome (e.g., why option A is recommended instead of option B), the procedural (e.g., which decision steps were taken by the system), and the context (e.g., which portfolios are recommended to other users). To understand the user’s explanation needs with regard to online symptom checker apps for laypeople, Schoonderwoerd et al. [28] conducted semi-structured interviews and analyzed them with the help of thematic analysis. In addition, they used a questionnaire asking participants to rate the perceived importance of information elements in different use scenarios.

Concerning the heterogeneity of users, domains, and contexts, Deters [37] also argued that explanations should be performed appropriately by considering the specific circumstances. For this reason, she defines criteria as suitability, which is formed using the following sub-criteria: Suitable for the User, where the system should be adapted to the particular target groups; Suitable for the Goal, where the system should be adapted to the particular task to be performed; and Suitable for the Context, where the system should be adapted to the environmental conditions of the use context. In evaluation studies, this criterion should be addressed by the perceived suitability, for instance, by asking participants in a survey “In what use case would you use the explanations?”.

Once the hypothesis about what will be relevant is established, the next step would be to validate this through user studies. This can be achieved, for example, by including the construct of personal relevance in the study design. In our sample, Jin et al. [76] evaluated the explanation need by asking the participating physicians in the user study whether they would use the explanations in their work for cancer diagnosis. Schoonderwoerd et al. [63] included in their post-task survey the item “This explanation-component is important”. Similarly, Ooge et al. [56] include in their questionnaire the item “I find it important to receive explanations for recommendations”. Dominguez et al. [39] evaluate the personal relevancy of explanations in the case of the art recommender system by asking the participants to rate the statement “The art images recommended matched my interests”.

In summary, our review shows that usefulness and actionability are important in evaluation studies, which can be measured quantitatively via self-reporting in surveys. In the case of real-world studies or new areas of application and/or target groups, our overview suggests conducting a qualitative, exploratory study first. In this way, it is possible to determine which explanations are suitable and relevant for the respective context.

5.2.3. Task Performance and Cognitive Workload

Satisfaction is not the only goal in usability engineering; improving task performance is equally important. According to ISO 9241-11, task performance is defined by two main components: effectiveness and efficiency. Similarly, Lim and Dey [129] define performance in the context of XAI as the degree of success with which the human–AI system effectively and successfully conducts the task for which the technology is designed. With regard to human–AI collaboration, performance can refer to different aspects. It can relate to the technical system, such as being measured by the model’s accuracy in making correct decisions or predictions. Additionally, it can also refer to the user’s performance in utilizing the system’s output. However, the performance of the system often cannot be separated from the performance of the user, and vice versa. In such cases, task performance refers to the joint effectiveness of both the AI and the user working together.

Effectiveness refers to the accuracy and completeness with which users perform tasks and achieve their goals when interacting with a system. The specific operationalization of this construct depends on the nature and structure of the task or goal. Most commonly, effectiveness is measured by the success rate in completing tasks or sub-tasks, often quantified by the number of successfully accomplished trials within a specific time period [130]. For example, in game-based applications like chess, task effectiveness can be assessed using metrics such as winning percentage and percentile ranking of player moves [131]. In the context of decision-support systems, effectiveness can be measured by how much the decision-making process is improved [132]. Analogous effectiveness in recommender systems can be measured by the percentage of cases in which the user finds a suitable item. In cases where the goal is to explain the system behavior, effectiveness can be measured by the percentage of accurate user predictions of the system’s output, typically evaluated through metrics like the number of hits, errors, and false alarms [60], [130] (as outlined in previous sections, this approach also uses a metric to measure understandability indirectly).

In our sample, various studies evaluate effectiveness in the context of explainability. Theoretically, Chromik et al. [15] define the effectiveness of explanations as helping users make good decisions. On an empirical level, various studies explore how well different explanation methods help users reach their goals in different contexts. Van der Waa et al. [49], Tsai et al. [57], Deters [37], Alufaisan et al. [65], Zhang et al. [68], and Carton et al. [69] address effectiveness in the context of decision- and prediction-support systems, but they use different operationalization approaches. Van Der Waa [49] employs an objective measure, evaluating effectiveness by counting the number of times a correct decision is made. Similarly, Alufaisan et al. [65], Zhang et al. [68], Carton et al. [69], and Kim et al. [61] measured the effect of explanations on the user’s accuracy of AI-assisted prediction/classification tasks. In contrast, Tsai et al. [57], Deters [37], and Schoonderwoerd et al. [63] focus on softer, more subjective, aspects of effectiveness, such as whether users make good decisions [57], better decisions [37], or improved decision-making [63]. To assess these factors, they use subjective measures based on self-reports, asking participants to rate statements like “The app helps me make better medical choices” [57], “The explanation provided contains sufficient information to make an informed decision” [37], and “This explanation-component improves my decision-making process“ [63]. Millecamp et al. [41] use a similar approach to measure effectiveness in the case of explainable music recommender systems, using the item “The recommender helped me find good songs for my playlist” to operationalize the construct. Evaluating different kinds of explanations supporting humans in classification tasks, Schmidt et al. [53] and Lai et al. [31] quantified effectiveness via the percentage of instances correctly labeled by participants. Jin et al. [76] also addressed a classification task in which physicians diagnosed cancer with the help of AI. They operationalized effectiveness as participants’ task performance accuracy. This was measured under different conditions (with and without providing explanations) by asking participants “What is your final judgment on the tumor grade?”

In his work, Sukkerd [34] evaluates the effectiveness of explanations by focusing on understandability. Since the primary design goal was to improve users’ understanding, he uses measures of understandability as a proxy for effectiveness, studying how well the explanation approach enables the users to understand the AI decisions and to assess whether those decisions are appropriate. In a similar approach, Cheng et al. [43] assess the effectiveness of different explanation concepts by comparing how well participants understand the algorithm and how much they trust it, with each group receiving access to a specific explanation interface, along with a control group.

Dieber et al. [40] conceptualize effectiveness as a multi-dimensional construct to evaluate the XAI framework LIME from a user’s perspective. According to Dieber et al. [40], effectiveness is associated with the complete and accurate completion of tasks, how well users achieve their goals using the system, and the effective mitigation of potential negative consequences if the task is not completed correctly.

Efficiency refers to the resources required to achieve specific goals or complete a specific task [64]. It is usually measured by the time a user needs to successfully complete a task, using a timer/stopwatch or a log of time stamps representing when the user starts and finishes the task [110]. This is also reflected in our sample where efficiency was typically measured by response time [51,59], reaction time [65], annotation time [53], second-per-task completion [68], interaction time [33], time spent using the tool [43], or faster decision-making, [15]. An objective measure was also used by Schaffer et al. [66], evaluating the explanation interface concept using a fictive multi-round game. They assessed efficiency by the number of moves participants needed to solve the task. Similarly, Kulesza et al. [33] also counted the number of interactions with the system. In contrast, Guesmi et al. [58] used a subjective measure in the context of explainable recommender systems. They evaluate efficiency by asking participants to rate the statement “This explanation helps me determine more quickly how well the recommendations match my interests”.

Mental or cognitive workload is another important theoretical construct for measuring efficiency in terms of the mental effort required to perform the task and the mental resources, such as attention, memory, and cognitive abilities, demanded to reach the goal [133]. A high cognitive workload is linked to stress, increased mental activity, and information-processing behavior. This is reflected in physiological activities, such as changes in heart rate, brainwave activity, skin conductance, pupil size, and saccadic movements. Hence, measuring these changes using EEG, GSR, or eye-tracking systems is often used as an objective operationalization of the theoretical construct [134,135]. Another approach to measuring users’ cognitive load is to capture the log reading time in memorizing explanations [133]. Another approach is to use subjective measures for operationalization by collecting self-reported data, where individuals assess their own perceived cognitive workload [121]. The NASA Task Load Index (NASA-TLX) is a widely used subjective rating scale for measuring the user’s perceived workload during interaction task performance or in post-task surveys [136]. The NASA-TLX operationalizes workload across six dimensions by asking participants to evaluate their experience in terms of mental, physical, and temporal demands, performance, effort, and frustration.

In our sample, the NASA-TLX was the most used method for evaluating cognitive workload. For instance, Kaur et al. [30], Kulesza et al. [33], and Paleja et al. [64] used this questionnaire to measure task workload. In some cases, other subjective measures were employed in the user studies, such as asking questions about usage effort [27] or evaluating mental demand by asking participants “How mentally demanding was it to understand how this AI makes decisions?” [42]. In contrast, we found only one study in our sample that evaluated cognitive workload using objective measures.

This is likely because these measurements are more challenging to implement in user studies than self-report-based measures. To measure the mental workload, Cai et al. [48] logged the time spent on the explanation interface.

In summary, our review shows a strong consensus on the theoretical construct of performance, often defined in terms of effectiveness, efficiency, and mental workload. However, the operationalization of this construct varies significantly depending on the specific design goals and the nature of the task. Evaluating this construct effectively requires a well-defined understanding of the context and a precise specification of the task, ensuring that the measurement of task performance is both relevant and rigorous.

5.2.4. User Control and Scrutability

Controllability is a critical aspect of usability. It refers to the degree to which users can directly manage and influence the behavior of a system or application to meet their needs and preferences. Various studies have explored user control in different contexts. For instance, Ngo et al. [32] and Rader et al. [55] explore this issue qualitatively by studying how users perceive controllability in commercial social media platforms like Facebook and recommender systems like Netflix. Concerning this, they investigated how explanations might enhance perceived controllability. Guo et al. [38] assessed perceived controllability quantitatively by surveying participants about their sense of control over the system.

In our sample, user control concerns scrutability. This is motivated by the fact that AI systems do not always deliver correct outcomes and are not always aligned with user preferences [37]. In this context, scrutability is linked to user control as it not only allows users to inspect the system’s models but also enables them to influence future behavior by providing feedback when the system is incorrect [58].

Chromik et al. [15] define scrutability as the ability of users to inform the system when it has made an error. Deter highlights that explanations contribute to scrutability by enhancing the user’s understanding of when the system is wrong and by providing a mechanism to report such errors. Operationalizing this construct, Guesmi et al. [58] included the survey item “The system allows me to give feedback on how well my preferences have been understood”. Deters [37] addressed the multi-dimensionality by operationalizing the construct with the following three items: “The system would make it difficult for me to correct the reasoning behind the recommendation”, “The response allows me to understand if the system made an error in interpreting my request”, and “I felt in control of telling the system what I want”.

5.3. Integrity Measures

In the studies in our sample, trust, transparency, and perceived fairness were common evaluation measures. We grouped them under the label “integrity”, as these measures address this concept in various ways. Firstly, system integrity is essential for establishing and maintaining the trustworthiness of a system. Integrity ensures that the system consistently behaves in a reliable and predictable manner, adhering to its intended purpose and ethical standards. Secondly, system transparency is integral to maintaining integrity, as it enables users and stakeholders to observe and understand the system’s processes and decisions. Thirdly, fairness is also a key component of integrity, ensuring that the system’s operations are free from bias and that decisions are made equitably and ethically. Assessing these dimensions of integrity—trust, transparency, and fairness—provides a comprehensive understanding of a system’s reliability and ethical performance.

5.3.1. Trust

Trust is a well-established concept in social science. Due to its long history and multidisciplinary nature, trust has been defined in various ways. Still, trust can generally be defined as “the willingness of a party to be vulnerable to the actions of another party based on the expectation that the other will perform a particular action important to the trustor, irrespective of the ability to monitor or control that other party” [129]. Concerning interactive systems, trust also refers to “the extent to which a user is confident in and willing to act on the recommendations, actions, and decisions of an artificially intelligent decision aid” [137].

It has also become an important topic in AI research, as trusting systems is essential, particularly in critical decision-making processes, when users do not fully understand how the system arrives at its conclusions or only have limited control over its behavior. This central importance is also reflected in our sample, where many studies investigate how explanations impact trust and confidence [15,27,28,30,34,35,36,37,38,39,40,41,42,43,46,47,48,49,50,52,53,56,57,58,59,60,61,63,64,65,66,67,68,69,70,71,75,76].

Analyzing the studies in our sample in detail reveals that the theoretical understanding of the concept is shaped by various disciplines such as psychology, service science, and technology studies. From a psychological perspective, trust presents a multidimensional construct encompassing both cognitive and affective components [137]. The cognitive aspect of trust refers to the user’s intellectual assessment of the AI system’s characteristics, such as accuracy, transparency, and reliability. On the other hand, the affective component deals with emotional responses such as feelings of safety, confidence, and comfort when interacting with the system. Institution-based trust theories complement this understanding by emphasizing trust, which is a prerequisite for a certain level of confidence in the service provider’s integrity, benevolence, and competence [138]. Competence refers to the system’s ability to fulfill its promises (e.g., delivering a product on time). Integrity is demonstrated by the system’s consistent and reliable actions, and benevolence indicates that the system prioritizes the user’s interests above its own, showing genuine concern for the user’s well-being. The third perspective came from technology studies, where the conceptualization also addresses specific issues in the human–machine context, such as users perceiving a system as trustworthy when it is reliable, secure, transparent, and understandable and when the system behaves in a predictable and consistent manner [139]. In addition, there is a behavioral view in the literature on trust that emphasizes that trust is not only an internal state of mind but also becomes evident through the user’s behavior—specifically when users actively engage with a system, utilize its functionalities, or follow its decision-making and recommendations [140].

In our literature survey, we observed various perspectives on trust manifesting in different ways of measuring it. Concerning the psychological dimension, various operationalizations in our sample emphasize the affective side of trust. Chromik et al. [15], for instance, stress that trust aims to increase the user’s confidence. Similarly, Alufaisan et al. [65] Sukkerd [34], Bansal et al. [60], and Cheng et al. [43] define trust as the confident expectation that one’s vulnerability will not be exploited. To measure trust, Liao et al. [47] adopt the trust scale developed by Madsen and Gregor [137], which also covers the affective dimension of trust, including items like “I find [system] likable” and “My interaction with [system] is enjoyable”. In the same vein, Paleja et al. [64] apply perceived likability and positive affective traits as indicators of trust in human–machine interactions.

The behavioral perspective on trust is particularly evident in the work of Zhang et al. [68], who argue that subjective self-reported trust may not be a reliable indicator of trusting behaviors. Instead, they measure trust by objectively assessing how often users rely on and agree with the system’s decision-making. Likewise, Carton et al. [69], Schmidt et al. [53], and Liu et al. [52] measure trust behaviorally by examining how often users agree with the system’s decisions. Van der Waa et al. [49] and Liu et al. [52] further stress that a trust metric must capture cases of over-trust and under-trust using objective measures. This involves evaluating the number of instances where humans agree with decisions wrongly made by the system (over-trust) and, conversely, where humans disagree with decisions correctly made by the system (under-trust).

Schaffer et al. [66] and Hoffman et al. [35] also link trust to user actions but use subjective measures in their operationalization. Hoffman et al. [35], for instance, stress that, at a minimum, a trust scale should include both a perceptual and a behavioral component, operationalized by the items “Do you trust the machine’s outputs?” (trust) and “Would you follow the machine’s advice?” (reliance) [35].

In our sample, some studies adopt trust scales, which come from service science, where trust is crucial for establishing and maintaining long-term relationships between service providers and clients. For instance, Cai et al. [48] adopted the organizational trust scale from Roger et al. [129], while Guesmi et al. [58] and Ooge et al. [56] adopted the trust scale from Wang and Benbasat [141]. Both scales are based on self-reports and are multi-dimensional, addressing integrity, benevolence, and competence in service provision. The studies in our sample adopt these dimensions in relation to the particular context, including items to measure the perceived competence (e.g., “the system seems capable” [48], “the system has the expertise to understand my needs and preferences” [58], “[the system] has the expertise (knowledge) to estimate my [needs]” [48]), the perceived benevolence (e.g., “the system seems benevolent” [48], “the system keeps my interests in mind” [58], “[the system] prioritizes that I improve [myself]” [56]), and the perceived integrity (“the system is honest“ [58], “[the system] makes integrous recommendations]” [48]).

The technology-oriented operationalizations in our sample address additional technical attributes that are relevant for building trust. Adopting established technology-trust scales [137], Fügener et al. [75], and Colley et al. [67], for instance, include self-reporting statements, which address the attributes of consistency and predictability. In their questionnaires, Fügener et al. [58] use a positive-polarity item, “The system responds the same way under the same conditions at different times”, while Colley et al. [67] include a corresponding but negative-polarity item “The system behaves in an underhanded manner”, to measure trust. In her operationalization, Deters [37] also emphasizes the role of consistency in evaluating trustability, asserting that explanations should be coherent and stable to avoid confusing users and undermining their trust. From this stance, her trustability questionnaire includes the item “The explanation seems consistent” [37]. Fügener et al. [75] also consider perceived understandability as an essential indicator of trust, operationalized by items such as “Although I may not know exactly how the system works, I know how to use it to make decisions about the problem” [75]. Similarly, Van der Waa et al. [49] stressed that understanding can be a proxy used to measure trust. In this regard, they instead measured trust directly, but they decided to measure constructs like understanding and persuasion instead.

In various cases, the studies in our sample did not explicitly refer to a specific trust theory but rather relied on a common sense understanding of the concept. Most often, the self-reporting approach was used, where participants were asked general questions about their perceived trust in the system and its outcomes. For example, Guo et al. [38] simply assessed trust by asking participants to express their level of trust in the studied explanation systems. Dominguez et al. [39] measured trust in the system’s outcome using a single item: “I trusted the recommendations made”. Similarly, Millecamp et al. [41] evaluated perceived trust with the statement “I trust the system to suggest good songs”.

Overall, our survey shows that trust is an essential construct in the empirical evaluation of explanations. Our survey also shows varying interpretations and operationalizations across different studies, utilizing objective measures that observe user behavior and subjective measures that rely on self-reports. The advantage of objective measures is that they allow researchers to uncover when explanations lead to over-trust and under-trust. In contrast, the advantage of subjective measures lies in capturing the nuanced perceptions of trust. Our survey further shows that psychological, institution-based, and technology-oriented trust theories shape the various operationalizations in our sample. These various perspectives are not mutually exclusive but rather complement each other. In conclusion, researchers should combine different perspectives and measures in their evaluation studies to comprehensively understand how explanations affect trust in a particular context.

5.3.2. Transparency

Transparency is a well-discussed topic in research that addresses the challenges posed by AI systems that function as black boxes. By making AI systems’ decision-making processes more visible, transparency allows stakeholders to evaluate whether these processes align with ethical standards and societal values. Moreover, transparency should ensure accountability, predictability, trustworthiness, and user acceptance. Wahlström et al. [142] define the concept by stating that “transparency in AI refers to the degree to which the AI system reveals the details required for a person to accurately anticipate the behavior of the system”.

Transparency is also an essential topic in XAI, where explainability is seen as the key to achieving transparency by providing clear, comprehensible explanations of how AI systems work and how they reach their conclusions. In our sample, several studies also addressed transparency as an explicit evaluation measure [15,32,37,55,56,57,58,61]. Upon closer analysis, it becomes evident that transparency is typically conceptualized and operationalized by the constructs of understandability and controllability.

Regarding conceptualizing the construct, Chromik et al. [15], for instance, mention that transparency aims to “explain how the system works” [15]. Similarly, Deters [37] defines transparency in terms of users’ interest in “understanding what questions are asked, why, and how they affect the outcome” [37]. In the context of newsfeed algorithms, Rader et al. [55] consider transparency as a quality of explanations that helps users “become more aware of how the system works” and enables them to assess whether the system is biased and if they can control what they see. Concerning recommender systems, Ngo et al. [32] also emphasize that “transparency and control are not independent of each other” [32].

Also, in the operationalization of the construct, the close link to understandability is evident. For instance, the transparency questionnaire suggested by Deters [37] includes items such as “I understood why an event happened” and “The response helps me understand what the [result] is based on”. Similarly, Guesmi et al.’s [58] questionnaire includes the item “it helps me to understand what the recommendations are based on”, while Tsai et al. [57] use the item “the system explains how the system works”, and Ooge et al. [56] use the item “I find that [the system] gives enough explanation as to why an exercise has been recommended” for this purpose. These examples demonstrate that transparency can be measured through self-reported assessments of the system’s understandability.

5.3.3. Fairness

Another topic in AI and Explainable AI (XAI) research is fairness. This topic has emerged from the growing application of AI systems in critical societal domains such as healthcare, finance, and criminal justice. As these systems increasingly influence decisions that affect individuals’ lives, ensuring fairness in AI is paramount to preventing discrimination and bias. Because of this, it becomes important that AI systems in critical domains make unbiased decisions that are perceived as fair by the various stakeholders. Studying perceived fairness has a long history in psychology and organizational science, where it has been conceptualized as a multi-dimensional construct, including issues like distributive fairness, procedural fairness, and interactional fairness. In our sample, the studies of Grgic et al. [86], Harrison et al. [73], and Schoeffer et al. [70] have mainly evaluated perceived fairness in the context of explainability.

The studies of Grgic et al. [86] and Harrison et al. [73] focus on distributive (or group) fairness and procedural fairness. Grgic et al. [86] define distributive fairness as the fairness of AI decision-making outcomes (or ends). Similarly, Harrison et al. [73] interpret group fairness as the model’s comparative treatment of different groups, such as gender, race, or age. Both define procedural fairness in a complementary way as the fairness of the process (or means) by which individuals are judged and how AI systems generate outcomes. The study of Schoeffer et al. [70] addresses informational fairness concerns in automated decision-making. They define informational fairness as individuals feeling provided with adequate information and explanations about the decision-making process and its outcomes. Because of the different research interests, the studies operationalize fairness differently.

Harrison et al. [73] conducted an experimental study comparing two ML models for predicting bail denial. These models implement various fairness-related properties. They used a mixed-methods approach where participants should rate the fairness, bias, and utility of each model on a five-point Likert scale. In addition, the survey included free-respond questions asking the participants to give reasons for their ratings. These responses were thematically coded.

Grgic et al. [86] also studied ML models for predicting bail denial, focusing on how human perceptions of process fairness might conflict with the prediction accuracy of these systems. In the first step, they evaluated participants’ judgments of algorithmic process fairness by asking three yes/no questions: “Do you believe it is fair or unfair to use this information?” [86], “Do you believe it is fair or unfair to use this information if it increases the accuracy of the prediction?” [86], and “Do you believe it is fair or unfair to use this information if it makes one group of people (e.g., African American people) more likely to be falsely predicted?” [86]. In the second step, they assessed how the prediction accuracy of ML models would change if the participants’ fairness judgments were considered.

Schoeffer et al. evaluated how explanations affect people’s perception of informational fairness by using a scenario of an automated decision-making process for approving loan eligibility. The experimental design covers the following conditions: (i) baseline with no explanation, (ii) what factors are considered in making the decision, (iii) the relative importance of these factors, and (iv) additional counterfactual scenarios. For each condition, participants rated informational fairness and trustworthiness using Likert-scaled items. The study also collected qualitative feedback through open-ended questions to gain deeper insights into what information they believe is necessary to judge the fairness of the decision-making and their views on the appropriateness of the explanations provided.

In their study of data-centric explanation concepts to promote transparency in, e.g., algorithmic predictive bail decisions, facial expression recognition, or admission decisions, Anik et al. [36] also included algorithmic fairness as one of their evaluation metrics. They measured fairness using a combination of quantitative and qualitative methods. After interacting with the data-centric explanations, participants were asked to rate their perceptions of system fairness using Likert-scale questions. Additionally, semi-structured interviews were conducted to gather in-depth insights into how and why participants perceived the systems’ fairness based on the provided explanations.

Overall, the studies in our sample demonstrate that fairness is a multi-dimensional construct encompassing distributive, procedural, informational, and system fairness. Furthermore, our review indicates that using a mixed-methods approach is common, combining quantitative ratings with qualitative insights to capture more nuanced reasoning about what explanation elements are helpful and why.

5.4. Miscellaneous

In addition to the more commonly used evaluation metrics, several studies in our sample employed additional measures to assess diverse aspects of AI system performance. These supplementary metrics help evaluate specific concerns and design goals related to XAI. In the following section, we summarize the most important of these measures.

5.4.1. Diversity, Novelty, and Curiosity

Diversity, novelty, and curiosity were themes that some of the studies in our sample [26,32,39,76,130,143] addressed. One of the reasons for evaluating this topic is the risk that AI systems may reinforce filter bubbles and echo chambers, thereby limiting users’ exposure to diverse perspectives and presenting them with familiar content. In discussing newsfeeds, Rader et al. [143] note that algorithmic curation in social media can contribute to a lack of information diversity, where users are shielded from alternative viewpoints. Similarly, Ngo et al. [32] mention that some participants perceive a risk that personalization might reduce diversity in their movie consumption. In user studies, diversity and novelty are often measured by self-reports. For example, Millecamp [26] addresses novelty in the context of music recommender systems, incorporating items in his questionnaire such as “The recommender system helped me discover new songs”. In a similar fashion, Dominguez [39] operationalizes diversity in his study on art recommender systems with the item “The art images recommended were diverse”. However, the provision of novel content aligns with the user’s curiosity and willingness to seek out the new.

Curiosity plays a pivotal role in engaging with different viewpoints and unfamiliar content, and motivating users to explore AI systems. Hoffmann et al. [130], for instance, emphasize that curiosity plays a central role in motivating users to interact with explainable AI (XAI), as it is the main factor behind the desire for explanations. Hence, explanation design should incorporate novelty, surprise, or incongruity elements to stimulate curiosity. In this context, Hoffmann et al. [130] also outline a questionnaire to measure users’ curiosity about AI systems, with items such as “I want to know what the AI just did” and “I want to know what the AI would have done if something had been different. I was surprised by the AI’s actions and want to understand what I missed”.

5.4.2. Persuasiveness, Plausibility, and Intention to Use

This section presents another theme explored in various studies [15,37,46,49,65,68,76]. Chromik et al. [15] define persuasiveness as the ability to convince the user to take a specific action. Similarly, Van der Waa describes the persuasive power of an explanation as its capacity to convince the user to follow the provided advice. The authors further emphasize that this definition of persuasive power is independent of the explanation’s correctness. For this reason, persuasiveness must be distinguished from transparency and understandability, as the persuasive power of an explanation may lead to over-trust. In such cases, users may believe the system is correct, even when it is not, without fully understanding how it works [49]. This highlights the dual nature of persuasiveness in explainable AI (XAI): it is both a goal and a potential risk. On one hand, if an explanation lacks persuasive power—meaning it does not influence user behavior—it suggests that the explanation is irrelevant and provides no added value. On the other hand, as Papenmeier et al. [46] warn, users should not be misled by persuasive yet inaccurate explanations.

In our sample, persuasiveness was evaluated using both objective and subjective measures. Van der Waa [49], for instance, measured the construct by using behavioral metrics, assessing how often participants followed the advice given by the AI, regardless of its correctness. Similarly, Zang et al. [68] measured the switch percentage across different conditions of model information. Fügener et al. [75] also measured this objectively, evaluating whether users followed the AI system’s recommendations.

Beyond behavioral changes, some studies interpret persuasiveness in terms of changes in belief or attitude. For instance, Deters [37] outlined that “if the user changes their mind after receiving an explanation, it is persuasive”. From this stance, some studies in our sample employed subjective measures based on self-report surveys. In these cases, persuasiveness was most often evaluated based on how convincing users found the explanation. For example, Guesmi et al. [58] measured persuasiveness using the item “this explanation is convincing”. Similarly, Deters’ questionnaire included items like “the explanation is convincing” and “the recommendation or result is convincing”. Jin et al. [76] adopted a similar operationalization, assessing “how convincing AI explanations are to humans”. However, in their case, they used this measure to evaluate the plausibility of XAI explanations.

5.4.3. Intention to Use or Purchase

The intention to use or purchase [26,27,30,37,56,71] is also a theme addressed by some studies in our sample. Deters [37], for instance, understood purchase intention as a subconstruct of persuasiveness, arguing that convincing explanations will increase user acceptance and, in turn, enhance purchase intentions. Therefore, her persuasiveness questionnaire includes this subconstruct, asking participants to rate statements such as “I would purchase the product I just chose if given the opportunity” and “I will use the system again if I need a tool like that”. In contrast, Kunkel et al. [71] consider it a subconstruct of trusting intentions, where trust in a recommender system is reflected by the user’s intention to make a purchase, evaluating their willingness to buy something based on the AI’s recommendation. A slightly different approach is found in Kaur et al.’s study. Like Kunkel et al. [71], Kaur et al.’s [30] view is that the intention to use a system is an indication of genuine trust. During the interviews, they asked participants to rate how much they believed the system was ready for use. However, this question primarily served as a method to encourage participants to reflect on the system seriously. By asking participants to explain their rating, the authors aimed to analyze how the users understood the system.

In other studies [26,27,47], the intention to use is considered an independent theoretical construct. This construct refers to an individual’s subjective likelihood or willingness to use a specific technology or system in the future. It typically serves as a proximal antecedent to actual system usage based on their beliefs and perceptions about the technology. In Liao’s study [47], the construct is operationalized by asking participants if they would allow the system to access their currently used social media accounts. Also, Millecamp et al. [26] and Ooge et al. [56] operationalize the construct in terms of future usage, including statements like “I will use this recommender system again” (Millecamp et al. [26]) and “If I want to solve math exercises again, I will choose the system” (Ooge et al. [56]) in their post-task surveys.

5.4.4. Explanation Preferences

Explanation preferences have also been a focus in several evaluation studies, aiming to understand which explanation styles or types users prefer [63,72,73,74,89]. In most studies, various explanation styles or types were compared to determine user preferences. A common approach involved presenting users with one or more outputs, where different explanation concepts were available. Then, participants evaluated these explanations and were asked to express their preferences. With regard to the experimental design, two approaches are common: In a between-subject design, different groups of participants are exposed to different explanation styles. In contrast, in the case of a within-subject design, all participants are exposed to multiple explanation styles simultaneously or in a (random) order.

Regarding preference assessment, two methods are commonly used: ranking-based and rating-based. Ranking-based methods require participants to rank the different explanation concepts, forcing them to prioritize their preferences by directly comparing explanations against one another. This method has higher ecological validity, because in real-life situations, users must often choose between options. However, ranking-based methods require a within-subject experiment, where ordering effects must also be considered. On the other hand, rating-based methods involve asking participants to rate each explanation independently on a Likert scale. This allows users to express their preference level, offering more nuanced insights. It also enables the use of between-subject experiments. In our sample, Naveed et al. [89] employed a rating-based approach and additionally asked open-ended questions about which system participants preferred and why.

In contrast, Jeyakumar et al. [72] used a ranking-based approach, where participants were asked to select which of the two available methods offered a better explanation for the provided model prediction. Similarly, Harrison et al. [73] asked participants whether they would choose an AI model or a human. Still, they used a 5-point Likert scale ranging from “definitely prefer the model” to “definitely prefer the human”.

Various studies also used evaluation measures specific to the respective context, such as debugging support, situational awareness, and learning or education. These measures aim to gain a better understanding of how well explanations address particular needs and goals in a given context.

5.4.5. Debugging Support

This refers to the development context, where developers must understand complex AI systems, identify bugs, and improve models. In this regard, explanations should help users or engineers identify defects, localize system issues, and simplify bug detection and resolution [15,37]. Although debugging is a critical task, the author of [37] points out that currently, only vague evaluation metrics exist, such as counting the number of user actions required to identify and fix a problem. In her work, she argues that such measures should be supplemented with self-reports, where a questionnaire covers aspects like “The explanations facilitate identifying errors” or “The explanations facilitate localizing and solving issues”. Kulesza et al. [33] also adopt the term for the non-professional sector. They understand debugging as the process by which end-users adjust an intelligent agent’s behavior to better align with their personal preferences. In their study, they examine this issue by evaluating how effectively participants can modify the system’s behavior. This is measured through user satisfaction, cost–benefit perception, interaction time, and other relevant metrics.

5.4.6. Situational Awareness

Situational awareness is mentioned in some studies, mainly with regard to AI-supported decision-making in dynamic, collaborative environments [66]. The concept refers to the understanding of elements in the environment, including their meaning and the ability to predict their future status [144]. However, intelligent systems could threaten situational awareness, as automation may lead users to check out and become over trusting [66]. To evaluate the role of explanations within such a context, Schaffer et al. [66] and Paleja et al. [64], for instance, conducted experiments where participants worked with an AI system to solve collaborative tasks. In their experimental design, one group used a system with explanations, while the control group used it without explanations. In both groups, situational awareness was measured. In Schaffer et al.’s [66] study, situational awareness was assessed by measuring how accurately participants could predict the behavior of their co-players in a game. Similarly, Paleja et al. [64] used the Situation Awareness Global Assessment Technique (SAGAT) [144], where participants were prompted at random points during a task with a series of fact-based questions to assess their knowledge of the current situation.

5.4.7. Learning and Education

This construct refers to the goal that explanations enable users to generalize and learn [15,57]. This is implicitly addressed by Hoffmann et al. [130] when they said that explanations should seek the universal human need to educate oneself and learn new things. This can also be an essential goal of XAI applications in a specific context. One example is Tsai’s study on an AI-supported online symptom checker. In their study, explanations should not only help people better understand the system but also improve their knowledge of the symptoms and the disease. To measure the learning effects after using the system, they administered a questionnaire with two items: “The app helps me to better communicate COVID-19 related information with others” and “The app helps me to learn COVID-19 related information”.

6. Evaluation Procedure

Evaluation procedures in XAI typically utilize a mix of both qualitative and quantitative methods, including user studies, surveys, and metrics, to assess the effectiveness of explanations. The idea is to implement controlled experiments by recruiting a representative sample of end-users, domain experts, or proxy users to participate in the evaluation, ensuring varied perspectives and experiences. Figure 3 shows the elements of the evaluation procedure.

6.1. Sampling and Participants

Study design and participant recruitment present key issues for the quality, validity, and transferability of empirical studies. Ideally, the study design reflects a real-world setting, and the sample will be a representative cross-section of the target population. If this representativeness is lacking, there is a risk that the findings may not be generalizable, conclusions may be incorrect, and recommendations may be misguiding. Ensuring a representative sample is, therefore, fundamental to the reliability and applicability of a study’s results.

In this context, the distinction between proxy-user and real-user studies has become prevalent in the literature [42].

Real User: These participants belong directly to the target group being studied and reflect the needs, motivations, experiences, competencies, and practices of the target group.

Proxy User: These are substitute participants who are not directly part of the target group. They do not fully embody the authentic needs, competencies, motivations, and behaviors of the intended users but act as their representatives.

6.1.1. Real-User Studies

In our sample, only some studies were conducted with participants of the addressed target group [29,30,38,46,47,48,55,61,64,67,70,72,85,103]. Most of these studies were domain-driven research conducted in the consumer sector, the academic/technical field, and the healthcare sector.

The prevalence of real-user studies in the consumer domain is attributed to the ease of recruiting participants from widely used, mass-market systems. For example, in studies focusing on streaming and social media recommendation algorithms, Kunkel et al. [71] recruited users of Amazon Prime, Ngo et al. [32] engaged Netflix users, and Rader et al. [55] recruited Facebook users. Alizadeh et al. [29] focused on Instagram users impacted by a service ban. Similarly, in studies of driver assistance systems and drawing software, respectively, the target groups of Colley et al. [67] and Cai et al. [48] belonged to mass markets. In summary, the high proportion of studies with real users in this domain can be explained by the widespread adoption of these systems, which facilitates the recruitment of real users. A special case arises when the target group is the general public, such as in studies concerning the perception of fair decision-making [35]. In this context, recruiting “real users” is straightforward since everyone belongs to the target population.

In the academic/technical domain, the reason for the prominence of real-user studies results from the close proximity between researchers and participants. Although the evaluated systems do not represent a mass market, the target group is part of the researchers’ social environment. For instance, Ooge et al. [56] evaluated e-learning platforms, Guesmi et al. [58] focused on a science literature recommender, and Kaur et al. [30] examined data scientists work practices, where the convenience sampling method presented an easy-to-implement recruiting method. This practice is largely due to the proximity of the domain to the researchers’ own field, facilitating easier access to target groups like students, researchers, and technicians. These studies highlight a significant representation of real user studies, but the ease of recruitment is a key factor influencing this trend.

The situation in the healthcare domain is distinct. In contrast to the previous cases, gaining access to the target group in this domain is not straightforward. Challenges are common in reaching the desired participants, which include medical professionals such as doctors, nurses, paramedics, and emergency service providers. Examples of this include Holzinger et al. [44] and Cabitza et al. [85], who involved a medical doctor from a hospital. In contrast to studies using proxy users, these real-user studies often have fewer participants. They prioritize addressing the specific needs, skills, and preferences of the target group, unlike large-scale evaluations that rely on proxy users.

6.1.2. Proxy-User Studies

Most of the studies in our sample recruited proxy users [5,8,15,25,30,33,34,37,39,40,42,43,49,50,51,52,53,55,57,58,60,61,62,65,66,68,69,71,73,75,86]. This especially holds true for methodology-driven [32,33,42] and concept-driven research [5,8,25,30,33,34,36,39,40,43,49,51,53,57,58,60,61,62,65,66,71,73,75,86].

For instance, proxy users were recruited in studies that compared the effectiveness of various explanation methods, evaluated novel explanation concepts, or evaluated the general role of explanations within the decision-making process, enhancing human–AI collaboration and improving human performance with AI assistance [34,36,43,51,58,60,65,73,86,94].

Moreover, proxy-user studies are prevalent when general effects on the understanding, mental models, perception, preferences, attitudes, and behavior of users are evaluated. Regarding domain-driven studies, some researchers adopted a proxy-user approach for pragmatic research reasons when they assumed that this would not cause any significant bias [52,55,56]. The goal of utilizing such an approach is to simplify the recruiting process, better control environmental conditions, and increase the number of participants for statistical reasons. In some cases, [50,67,73], neither the addressed target group nor the applied sampling method was described in detail, which is why we have not classified these studies.

A common proxy-user recurring method was the use of paid online panels and crowd workers. In our sample, 23 studies utilized Amazon Mechanical Turk (MTurk), while 4 studies used Prolific for this reason. A few proxy-user studies and other sampling methods are typically based on convenience and snowball sampling techniques. These studies recruited their participants through word-of-mouth, personal contacts, or asking people from the local community [36,68,71], using internal university mailing lists [15], posters around university campuses [36], or recruiting participants via social media [71].

Compared to real-user studies, proxy-user studies tend to involve a larger number of participants. This is particularly evident in studies using paid crowd-worker samples, where the average number of participants across the 23 studies in our sample was 764 (SD = 1147). This high SD value reflects considerable variation in sample sizes across the studies. In contrast, convenience sampling studies comprised a smaller average number of participants (mean = 56, SD = 44). Some studies, such as [53,57], do not provide further details on their sampling methods or sample sizes. Therefore, the statistics regarding sample sizes represent only a rough estimate and should be treated with caution.

6.2. Evaluation Methods

Measurement methods can be classified by various criteria such as qualitative/quantitative or objective/subjective measures [15]. In HCI, a widespread categorization is distinct between interviews, observations, questionnaires, and mixed methods [14]. In the following section, we briefly summarize how these methods have been adopted in XAI evaluation research.

6.2.1. Interviews

Interviews are a common evaluation method in HCI used to gather information by asking people about their knowledge, experiences, opinions, or attitudes [14,145]. Interviews are qualitative in nature, offering a high degree of flexibility ranging from very structured to open-ended [145]. In user research, semi-structured interviews are the most prominent type [14,146]. They are used, for instance, to study people’s mental models and perceptions of the AI system and the explanation provided [29,103]. Interviews are also used for the formative evaluation of prototypically designed explanation systems or the contextual inquiry of the application domain where these systems should be used [30], [77]. In evaluation research, interviews are used in mixed methodologies to develop quantitative metrics based on qualitative data or, conversely, to enrich, contextualize, and better interpret quantitative results [25,77,147].

In our sample, interviews are used as a methodology by some studies [25,29,30,32,33,58]. In these studies, interviews are usually conducted with a small group of participants. As Table 1 shows, a striking feature is that almost all real-user evaluations are interview studies or open-ended surveys. For instance, Alizadeh et al. [29] interviewed people affected by an action ban, and Kaur et al. [30] interviewed data scientists using XAI tools such as SHAP. By its very nature, the interview studies are qualitative, focusing on the evaluation of perceived properties of the explanatory systems rather than quantifying their impact on user behavior. For instance, Kaur et al. [30] used interviews to understand the purpose of XAI tools for data scientists. Alizadeh et al. [29] used this methodology to understand the perception of affected persons regarding explanations given by service providers. Naveed et al. [25] interviewed potential Robo-Advisor users about what kind of explanations they like to glean from such a system.

6.2.2. Observations

Observations are a vital methodological approach, allowing researchers to systematically watch and record behaviors and events in various settings [14,146]. They can be used in qualitative settings such as thinking-aloud sessions. However, observations are primarily employed to measure quantifiable aspects such as task completion times and interaction metrics.

In our sample, for instance, Millecamp et al. [41] recorded interaction logs of study participants’ interaction with interface components that were captured using mouse clicks together with their timestamps to measure task performance objectively. In a similar way, Cai et al. [48] logged time spent on each explanation page to determine how different explanations require varying amounts of time for mental processing. Narayanan et al. [51] recorded the response time metric, measured as the number of seconds from when the task was displayed until the participant hit the submit button on the interface.

Observation studies, although less prone to biases than self-reports Gorber et al. [148], can be time-consuming and cannot directly assess subjective experiences, such as thoughts or attitudes Sikes et al. [149]. Nonetheless, innovative methodologies are emerging that leverage external observations to infer internal states, such as stress and emotional responses, further enriching the understanding of user experiences in human–computer interaction [10,150]. This dual approach enhances the robustness of findings and deepens insights into how users engage with AI systems and their explanations.

6.2.3. Questionnaires

Questionnaires are used to obtain quantitative values about people’s knowledge, experiences, opinions, or attitudes in a comparable way. They present the most used data collection method in positivistic research [151]. This also holds ture for our sample, where most studies used a questionnaire evaluation method (see Table 1).

The evaluation theory suggests that questionnaires as measurement instruments should be equally anchored in formal theories, such as the classical test theory (CTT) and item response theory (IRT) [24], as well as in the respective substance or object theories, such as explanation theories [5,98] or adopting theories from other disciplines. The entities given by substance theories must be operationalized, making them measurable, whereas psychometric research has provided several methods to ensure that these results in measures are valid, reliable, and objective [24,103]. While using questionnaires has become a common practice in XAI evaluation, Section 5 shows that regarding a rigor questionnaire design, XAI research is less mature compared to other disciplines, including the explication of theoretical constructs used in questionnaires, how they are operationalized, and how they are psychometrically validated. Instead, in our sample, the use of ad hoc questionnaires remains a widespread practice.

As shown in Table 1, all three evaluation methodologies (interviews, observation, and questionnaires) are utilized as a single method or in combination as a mixed-methods approach to gather the needed data for the evaluation.

6.2.4. Mixed-Methods Approach

The mixed-methods evaluation approach is gaining traction in recent XAI research due to its ability to provide a comprehensive understanding of both quantitative and qualitative aspects of user interaction with AI systems. This is also evident in the literature used in our sample, as shown in Table 1, where many of the XAI studies implied the use of a mixed-method approach. For instance, Naveed et al. [25] used a mixed-methods approach that supplements qualitative focus group discussions with a quantitative online survey. Millecamp et al. [41] used both qualitative and quantitative metrics in their study which included Likert-scale questionnaire items, open-ended questions, and interaction log outputs.

The idea behind this approach is to combine numerical data analysis, often derived from user performance metrics or surveys, with qualitative insights gathered from interviews or open-ended discussions, enabling researchers to capture the complexities and underlying aspects of user experiences and perceptions [152]. For example, quantitative assessments can evaluate how well XAI systems’ explanations enhance user decision-making, while qualitative insights offer a deeper understanding of user satisfaction and the clarity of AI outputs [21]. By combining these approaches, researchers can tackle the diverse challenges associated with XAI more effectively, resulting in designs that prioritize user needs and foster greater transparency in AI systems [153]. This holistic perspective is essential for advancing the field, as it facilitates2the identification of not only the technical performance of XAI systems but also the subjective experiences of the users interacting with them.

7. Discussion: Pitfalls and Guidelines for Planning and Conducting XAI Evaluations

This work provides a comprehensive literature review analysis of XAI evaluation studies with an aim to provide guiding principles for planning and conducting XAI evaluation studies from a user perspective.

For this purpose, we outline key elements and considerations for planning and setting up an extensive XAI evaluation. We argue that establishing clear guiding principles is essential for maintaining the focus on user needs and ensuring that the study aligns with the evaluation goals. However, overly rigid principles can hinder the flexibility needed to adapt to user feedback and evolving user requirements [3]. Figure 4 presents key elements as part of the XAI design guiding principles, which we describe below.

The first step in an XAI evaluation study is to define clear objectives of the evaluation study, i.e., specify what should be measured by distinguishing between concept evaluation (theoretical assessment), domain evaluation (practical implementation or improving a specific application), and methodological evaluation (focusing on evaluation metrics and frameworks). Concept-driven evaluations involve theoretical aspects and introduce innovative models and interfaces to enhance explainability, aiming to bridge the gap between theoretical frameworks and practical implementations. Such evaluations often employ novel concepts like example-based explanations and interactive explanations to enhance the user’s understanding and satisfaction with the AI system. Meanwhile, domain-driven evaluation applies these methods to specific fields and areas such as, healthcare, finance, e-commerce, etc., demonstrating how tailored and personalized explanations can enhance trust, transparency, and decision-making in practical and real-world scenarios.

Our survey reveals that concept-driven research frequently encounters difficulties with relevancy and ecological validity, especially when trying to ensure that theoretical concepts are applicable to real-world scenarios. In contrast, domain-driven research faces challenges with maintaining rigor and achieving generalizability of results. To obtain deeper insights into specific contexts, domain-driven research often relies on less rigorous qualitative, exploratory methods, typically conducted in practical settings where full control is impossible. Balancing trade-offs between rigor and relevance is a critical challenge for both research approaches.

Guidelines: The guideline is to carefully consider the tension between rigor and relevance from the very beginning when planning an evaluation study, as it influences both the evaluation scope and the methods used.

- If the goal is to gain a deep understanding of the context, the scope will be narrower, and qualitative methods are typically more appropriate.
- If the goal is to test a hypothesis about the causal relationship of an explanation concept, using standardized questionnaires and test scenarios under controlled conditions should be the method of choice.

Furthermore, XAI evaluation studies can have different scopes that include defining the domain, target group, and evaluation context/test scenario. HCI and AI stress the importance of understanding the specific needs and context of the target user group and application to ensure that the explanations are contextually relevant and user-specific. This ensures that the explanations provided are contextually relevant and tailored to the users’ needs and requirements [1]. Understanding the specific requirements and characteristics of different user groups is essential for defining the domain and target group.

Pitfalls: A common pitfall in many evaluation studies is not defining the target domain, group and context explicitly. This lack of explication negatively affects both the planning of the study and the broader scientific community.
During the planning phase, this complicates the formulation of test scenarios, recruitment strategies, and predictions regarding the impact of pragmatic research decisions (e.g., using proxy users instead of real users, evaluating a click dummy instead of a fully functional system, using toy examples instead of real-world scenarios, etc.).
During the publication phase, the missing explication impedes the assessment of the study’s scope of validity and reproducibility. Without clearly articulating the limitations imposed by early decisions—such as the choice of participants, test conditions, or simplified test scenarios—the results may be seen as less robust or generalizable.

Guidelines: The systematic planning of an evaluation study should include a clear and explicit definition of the application domain, target group, and use context of the explanation system. This definition should be as precise as possible. However, an overly narrow scope may restrict the generalizability of the research findings, while a broader scope could reduce the focus of the study, informing the systematic implementation and the relevancy of the findings [154]. Striking the right balance is essential for ensuring both meaningful insights and the potential applicability of the results across different contexts.

Additionally, methodological evaluations focus on creating and employing reliable and valid metrics to assess the explainability of AI systems. The heterogeneity of the use contexts and explanation demands of application domains rules out a one-size-fits-all evaluation approach that can be applied to all cases [10]. Moreover, explanations can have multiple effects and are often designed for this purpose, such as enhancing understandability, improving task performance, increasing user satisfaction, and building trust. Therefore, it is essential to specify what the measurement objects and metrics of the study are and how they need to be evaluated. This includes any object, phenomenon, concept, or property of interest that the study aims to quantify for evaluating the effectiveness of XAI.

Pitfalls: A common pitfall in many evaluation studies is the use of ad hoc questionnaires instead of a standardized one. This lack of explication has a negative impact for both study planning and the scientific community:
During the planning phase, creating ad hoc questionnaires adds to the cost, particularly when theoretical constructs are rigorously operationalized, including pre-testing the questionnaires and validating them psychometrically.
During the publication phase, using non-standardized questionnaires complicates reproducibility, comparability, and the assessment of the study’s validity.
Guidelines:
The definition of measurement constructs:

- Consider what should be measured in the evaluation study. Is there a specific goal the explanation design aims to achieve (e.g., increased transparency, improved task performance, etc.)? Is there a particular hypothesis that needs to be tested (e.g., explanations contribute to higher trust)?
- If a concrete hypothesis cannot be formulated (e.g., due to the novelty of the domain or approach), a qualitative, exploratory study may be appropriate to gain deeper insights and generate hypotheses/theoretic concepts.

The operationalization of measurement constructs:

- Is the construct directly measurable (e.g., task duration), or is it a theoretical construct (e.g., trust)?
- Does it present the underlying theory or reference relevant work?
- If a standardized questionnaire exists for the construct, use it. If necessary, adapt it to the context, but keep in mind that this may limit validity and comparability.
- If no standardized questionnaire exists, develop a new one according to general guidelines for questionnaire development and test theory.

Finally, defining a study procedure is vital for implementing and conducting the evaluation. This involves determining the genuine reflection that the study captures a representative sample user population) and data-gathering methodologies (ensuring that the collected data are reliable and valid) [155]. A detailed and well-defined procedure helps replicate the evaluation to verify the results. However, overly rigid procedures can limit the ability to adapt to new insights and user feedback. According to Hoffman et al. [35], balancing structure with flexibility is key to practical XAI evaluation.

Guidelines: Essentially, there are three types of methods.

- Interviews: These allow for a high degree of flexibility and provide deeper insights into individual perspectives and experiences.
- Observations: These enable the collection of objective measures without the distortion of subjective memory.
- Questionnaires: These ensure high reusability, allow for efficient data collection, and facilitate comparability across different studies.

In the case of exploratory studies or rich user experiences that need to be gathered, interviews are especially suitable. On the other hand, questionnaires are ideal when subjective measures, such as trust or satisfaction, need to be collected in a standardized manner for statistical analysis. To evaluate task performance, observations or log file analysis are appropriate methods. Additionally, mixed-methods approaches are valuable for evaluating or triangulating aspects of different natures (e.g., qualitative mental models and the accuracy of those models in predicting system behavior).

The sampling method used is another essential factor to consider. In addition to convenience and snowball sampling, the use of crowd workers has become a popular approach due to its time and cost efficiency. However, using paid online panels often implicitly involves the decision to conduct the study with proxy users, a choice that should be made carefully. While proxy-user studies offer certain advantages, relying on a proxy-user sample can have significant implications for the external validity of the study. Although proxy users are often employed as a practical solution when accessing the actual target group is difficult, their use may limit the generalizability of the findings. Since proxy users are substitutes for the intended participants, their involvement may lead to skewed or unrepresentative results.

This misalignment can be especially pronounced in studies involving complex systems like explainable AI, where nuanced user interactions and perceptions are critical. Conversely, incorporating real users into a study enhances its external validity. This authentic engagement of real users affected by AI systems in real-world settings ensures that the findings genuinely reflect the target audience’s experiences with the system, thus offering more reliable and actionable insights. Still, proxy users are often a necessity in certain research contexts. For example, when evaluating mock-ups or nascent systems for which there is no existing user base, proxy users offer the only feasible means of testing and feedback. Similarly, in studies focused not on specific applications but on general concepts, proxy users can provide valuable, albeit generalized, insights [61], [79]. Also, in cases where the target group comprises a limited number of domain experts who are unavailable for the study, proxy users with comparable domain knowledge and expertise can offer a viable alternative. However, it is essential to acknowledge that while they can simulate the role of the target group, their perspectives might not fully align with those of the actual users. Furthermore, logistical factors such as cost, time, and organizational complexity often necessitate the use of proxy users.

Pitfalls: Real- and proxy-user sampling each comes with its own set of advantages and disadvantages. A real-user approach is particularly challenging in niche domains beyond the mass market, especially where AI systems address sensitive topics or affect marginalized or hard-to-reach populations. Key sectors in this regard include healthcare, justice, and finance, where real-user studies typically comprise smaller sample sizes due to the specific conditions of the domain and the unique characteristics of the target group. Conversely, the availability of crowd workers and online panel platforms simplifies the recruitment process for proxy-user studies, enabling larger sample sizes. While recruiting proxy users can be beneficial for achieving a substantial sample size and sometimes essential for gathering valuable insights, researchers must be mindful of the limitations and potential biases this approach introduces. It is crucial to carefully assess how accurately proxy users represent the target audience and to interpret the findings considering these constraints.
Relying on proxy users, rather than real users from the target group, can be viewed as a compromise often driven by practical considerations. However, the decision to use proxy users is often made for pragmatic reasons without considering the implications for the study design and the applicability of the research findings for real-world scenarios.
Guidelines: The sampling method used has a serious impact on the study results. Sometimes, a small sample with real users could lead to more valid results than large sample studies with proxy users. Therefore, the decision sampling method should be carried out intentionally, balancing the statistically required sample size, contextual relevance, and ecological validity, as well as with the practicalities of conducting the study in a time- and cost-efficient manner. In addition, researchers should articulate the rationale behind the sampling decision and the implications for the study design and the limitations of the findings.

The final step is the analysis of the gathered data, where guidelines also exist, such as thematic analysis for qualitative data or statistical evaluation for quantitative data. However, these are beyond the scope of this study, as we focused on the methodological aspects of planning and conducting empirical evaluation studies.

8. Conclusions

In this study, we provided an in-depth review of existing empirical methods for evaluating XAI systems, with a particular focus on user-centered approaches. While significant efforts have been made in developing explainability techniques, the empirical evaluation of these systems remains fragmented, which necessitates the development of rigorous, user-specific evaluation frameworks.

This review suggests that a standardized framework is essential for assessing not only the technical fidelity of AI explanations but also their practical usefulness, particularly user understandability, usability, and integrity. Based on our analysis, we provide the following key guidelines and future tasks for further research in XAI evaluation:

Balancing Methodological Rigor and Practical Relevance—Future research should maintain a balance between rigorous theoretical evaluations and practical, real-world applications. Researchers must tailor their methods based on the study type ― whether concept-driven, methodological, or domain-driven. Studies need to either prioritize controlled environments for hypothesis testing or adopt more qualitative, context-driven methods for exploring new areas.
A Clear Definition of the Evaluation Scope—Future studies need to clearly define their evaluation scope, including the domain, target group, and context of use. Avoiding vague or broad scopes is essential for maintaining focus and ensuring meaningful generalizable results.
The Standardization of Measurement Metrics—A major pitfall in XAI research is the inconsistent use of evaluation metrics. Researchers should use standardized questionnaires and tools to enable comparisons across studies. If none are available, newly developed metrics must be rigorously validated before use.
The Use of Real Users Over Proxy Users—Although proxy users are sometimes needed, real users should be prioritized in domain-specific evaluations to improve ecological validity and generalizability. Researchers must clearly justify their user group choice and acknowledge the limitations of using proxies.
The Exploration of Mixed-Methods Evaluations—To better understand human–AI interactions, mixed-methods approaches combining qualitative and quantitative methods are essential. Triangulating interviews, questionnaires, and observational data will offer deeper insights into how users engage with and perceive AI explanations. Furthermore, future research needs to focus on conducting longitudinal studies to explore how user trust and reliance on AI explanations evolve over time. This is especially important in high-stakes fields like healthcare, where initial trust may differ from long-term trust based on system performance and explanation quality.
Domain-Specific Research—Future research must move beyond general-purpose evaluations to focus on domain-specific needs, especially in sensitive areas like healthcare, finance, law, and autonomous systems. Each domain may have unique requirements for explainability, and evaluations must be designed to consider these domain-specific constraints and expectations. For this, we propose the need for developing domain-specific evaluation metrics.

In conclusion, our analysis calls for a more rigorous, structured, and standardized approach to XAI evaluation that addresses both domain-specific and generalizable user needs. It advocates for interdisciplinary collaboration, drawing on human–computer interaction, psychology, and AI to create more reliable and effective evaluation methods that contribute to the broader adoption of and trust in AI systems in real-world applications.

Author Contributions

Conceptualization, S.N., G.S. and D.R.-K.; methodology, S.N. and G.S.; software; validation, S.N. and G.S.; formal analysis, S.N. and G.S.; investigation, S.N. and G.S.; resources, D.R.-K.; data curation, S.N. and G.S.; writing—original draft preparation, S.N. and G.S.; writing—review and editing, S.N.; visualization, S.N.; supervision, G.S.; project administration, S.N. and G.S.; funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

Author Dean Robin-Kern was employed by the company Bikar Metalle GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Abdul, A.; Vermeulen, J.; Wang, D.; Lim, B.Y.; Kankanhalli, M. Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, in CHI ’18; Association for Computing Machinery, Montreal, QC, Canada, 21–26 April 2018; pp. 1–18. [Google Scholar] [CrossRef]
Shneiderman, B. Human-Centered Artificial Intelligence: Reliable, Safe & Trustworthy. Int. J. Hum. Comput. Interact. 2020, 36, 495–504. [Google Scholar] [CrossRef]
Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. Available online: https://api.semanticscholar.org/CorpusID:11319376 (accessed on 12 April 2024).
Herrmann, T.; Pfeiffer, S. Keeping the organization in the loop: A socio-technical extension of human-centered artificial intelligence. AI Soc. 2023, 38, 1523–1542. [Google Scholar] [CrossRef]
Vilone, G.; Longo, L. Notions of explainability and evaluation approaches for explainable artificial intelligence. Inf. Fusion 2021, 76, 89–106. [Google Scholar] [CrossRef]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Gunning, D.; Aha, D. DARPA’s Explainable Artificial Intelligence (XAI) Program. AI Mag. 2019, 40, 44–58. [Google Scholar] [CrossRef]
Nunes, I.; Jannach, D. A systematic review and taxonomy of explanations in decision support and recommender systems. User Model. User-Adapt. Interact. 2017, 27, 393–444. [Google Scholar] [CrossRef]
Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining Explanations: An Overview of Interpretability of Machine Learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar] [CrossRef]
Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics. Electronics 2021, 10, 593. [Google Scholar] [CrossRef]
Doshi-Velez, F.; Kortz, M.; Budish, R.; Bavitz, C.; Gershman, S.; O’Brien, D.; Scott, K.; Schieber, S.; Waldo, J.; Wood, A.; et al. Accountability of AI under the law: The role of explanation. arXiv 2017, arXiv:1711.01134. [Google Scholar] [CrossRef]
Nguyen, A.; Martínez, M.R. MonoNet: Towards Interpretable Models by Learning Monotonic Features. arXiv 2019, arXiv:1909.13611. [Google Scholar] [CrossRef]
Rosenfeld, A. Better Metrics for Evaluating Explainable Artificial Intelligence. 2021. Available online: https://api.semanticscholar.org/CorpusID:233453690 (accessed on 25 July 2024).
Sharp, H.; Preece, J.; Rogers, Y. Interaction Design: Beyond Human-Computer Interaction; Jon Wiley & Sons. Inc.: Hoboken, NJ, USA, 2002. [Google Scholar]
Chromik, M.; Schuessler, M. A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI. In Proceedings of the ExSS-ATEC@IUI, Cagliari, Italy, 17–20 March 2020; Available online: https://api.semanticscholar.org/CorpusID:214730454 (accessed on 29 November 2023).
Mohseni, S.; Block, J.E.; Ragan, E. Quantitative Evaluation of Machine Learning Explanations: A Human-Grounded Benchmark. In Proceedings of the 26th International Conference on Intelligent User Interfaces, in IUI ’21, College Station, TX, USA, 14–17 April 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 22–31. [Google Scholar] [CrossRef]
Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Anjomshoae, S.; Najjar, A.; Calvaresi, D.; Främling, K. Explainable Agents and Robots: Results from a Systematic Literature Review. In Proceedings of the AAMAS ’19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, International Foundation for Autonomous Agents and MultiAgent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 1078–1088. Available online: http://www.ifaamas.org/Proceedings/aamas2019/pdfs/p1078.pdf (accessed on 29 November 2023).
Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; van Keulen, M.; Seifert, C. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI. ACM Comput. Surv. 2023, 55, 3583558. [Google Scholar] [CrossRef]
Arksey, H.; O’Malley, L. Scoping studies: Towards a methodological framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
Litwin, M.S. How to Measure Survey Reliability and Validity; Sage: Thousand Oaks, CA, USA, 1995; Volume 7. [Google Scholar]
DeVellis; Robert, F.; Carolyn, T.T. Scale Development: Theory and Applications; Sage: Thousand Oaks, CA, USA, 2003. [Google Scholar]
Raykov, T.; Marcoulides, G.A. Introduction to Psychometric Theory, 1st ed.; Routledge: London, UK, 2010. [Google Scholar]
Naveed, S.; Kern, D.R.; Stevens, G. Explainable Robo-Advisors: Empirical Investigations to Specify and Evaluate a User-Centric Taxonomy of Explanations in the Financial Domain. In Proceedings of the IntRS@RecSys, Seattle, WA, USA, 18–23 September 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 85–103. [Google Scholar]
Millecamp, M.; Naveed, S.; Verbert, K.; Ziegler, J. To Explain or not to Explain: The Effects of Personal Characteristics when Explaining Feature-based Recommendations in Different Domains. In Proceedings of the IntRS@RecSys, Copenaghen, Denmark, 16–19 September 2019; Association for Computing Machinery: New York, NY, USA, 2019. Available online: https://api.semanticscholar.org/CorpusID:203415984 (accessed on 29 November 2023).
Naveed, S.; Loepp, B.; Ziegler, J. On the Use of Feature-based Collaborative Explanations: An Empirical Comparison of Explanation Styles. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, in UMAP ’20 Adjunct, Genoa Italy, 12–18 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 226–232. [Google Scholar] [CrossRef]
Naveed, S.; Donkers, T.; Ziegler, J. Argumentation-Based Explanations in Recommender Systems: Conceptual Framework and Empirical Results. In Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, in UMAP ’18, Singapore, 8–11 July 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 293–298. [Google Scholar] [CrossRef]
Alizadeh, F.; Stevens, G.; Esau, M. An Empirical Study of Folk Concepts and People’s Expectations of Current and Future Artificial Intelligence. i-com 2021, 20, 3–17. [Google Scholar] [CrossRef]
Kaur, H.; Nori, H.; Jenkins, S.; Caruana, R.; Wallach, H.; Vaughan, J.W. Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, in CHI ’20, Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–14. [Google Scholar] [CrossRef]
Lai, V.; Liu, H.; Tan, C. ‘Why is “Chicago” deceptive? ’ Towards Building Model-Driven Tutorials for Humans. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, in CHI ’20, Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–13. [Google Scholar] [CrossRef]
Ngo, T.; Kunkel, J.; Ziegler, J. Exploring Mental Models for Transparent and Controllable Recommender Systems: A Qualitative Study. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, in UMAP ’20, Genoa Italy, 12–18 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 183–191. [Google Scholar] [CrossRef]
Kulesza, T.; Stumpf, S.; Burnett, M.; Yang, S.; Kwan, I.; Wong, W.-K. Too much, too little, or just right? Ways explanations impact end users’ mental models. In Proceedings of the 2013 IEEE Symposium on Visual Languages and Human Centric Computing, San Jose, CA, USA, 15–19 September 2013; pp. 3–10. [Google Scholar] [CrossRef]
Sukkerd, R. Improving Transparency and Intelligibility of Multi-Objective Probabilistic Planning. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2022. [Google Scholar] [CrossRef]
Hoffman, R.R.; Mueller, S.T.; Klein, G.; Litman, J. Metrics for Explainable AI: Challenges and Prospects. arXiv 2019, arXiv:1812.04608. [Google Scholar] [CrossRef]
Anik, A.I.; Bunt, A. Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–12 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Hannah, D. Criteria and Metrics for the Explainability of Software; Gottfried Wilhelm Leibniz Universität: Hannover, Germany, 2022. [Google Scholar]
Guo, L.; Daly, E.M.; Alkan, O.; Mattetti, M.; Cornec, O.; Knijnenburg, B. Building Trust in Interactive Machine Learning via User Contributed Interpretable Rules. In Proceedings of the 27th International Conference on Intelligent User Interfaces, in IUI ’22, Helsinki, Finland, 21–25 March 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 537–548. [Google Scholar] [CrossRef]
Dominguez, V.; Messina, P.; Donoso-Guzmán, I.; Parra, D. The effect of explanations and algorithmic accuracy on visual recommender systems of artistic images. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 408–416. [Google Scholar] [CrossRef]
Dieber, J.; Kirrane, S. A novel model usability evaluation framework (MUsE) for explainable artificial intelligence. Inf. Fusion 2022, 81, 143–153. [Google Scholar] [CrossRef]
Millecamp, M.; Htun, N.N.; Conati, C.; Verbert, K. To explain or not to explain: The effects of personal characteristics when explaining music recommendations. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 397–407. [Google Scholar] [CrossRef]
Buçinca, Z.; Lin, P.; Gajos, K.Z.; Glassman, E.L. Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces, in IUI ’20, Cagliari, Italy, 17–20 March 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 454–464. [Google Scholar] [CrossRef]
Cheng, H.-F.; Wang, R.; Zhang, Z.; O’Connell, F.; Gray, T.; Harper, F.M.; Zhu, H. Explaining Decision-Making Algorithms through UI: Strategies to Help Non-Expert Stakeholders. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19, Glasgow, UK, 4–9 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–12. [Google Scholar] [CrossRef]
Holzinger, A.; Carrington, A.; Müller, H. Measuring the Quality of Explanations: The System Causability Scale (SCS). KI—Künstliche Intell. 2020, 34, 193–198. [Google Scholar] [CrossRef]
Weina, J.; Hamarneh, G. The XAI alignment problem: Rethinking how should we evaluate human-centered AI explainability techniques. arXiv 2023, arXiv:2303.17707. [Google Scholar] [CrossRef]
Papenmeier, A.; Englebienn, G.; Seifert, C. How model accuracy and explanation fidelity influence user trust. arXiv 2019, arXiv:1907.12652. [Google Scholar] [CrossRef]
Liao, M.; Sundar, S.S. How Should AI Systems Talk to Users when Collecting their Personal Information? Effects of Role Framing and Self-Referencing on Human-AI Interaction. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–13 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Cai, C.J.; Jongejan, J.; Holbrook, J. The effects of example-based explanations in a machine learning interface. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Marina del Ray, CA, USA, 17–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 258–262. [Google Scholar] [CrossRef]
van der Waa, J.; Nieuwburg, E.; Cremers, A.; Neerincx, M. Evaluating XAI: A comparison of rule-based and example-based explanations. Artif. Intell. 2021, 291, 103404. [Google Scholar] [CrossRef]
Poursabzi-Sangdeh, F.; Goldstein, D.G.; Hofman, J.M.; Vaughan, J.W.W.; Wallach, H. Manipulating and Measuring Model Interpretability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–13 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Narayanan, M.; Chen, E.; He, J.; Kim, B.; Gershman, S.; Doshi-Velez, F. How do humans understand explanations from machine learning systems? An evaluation of the human-interpretability of explanation. arXiv 2018, arXiv:1802.00682. [Google Scholar] [CrossRef]
Liu, H.; Lai, V.; Tan, C. Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision Making. In ACM Human_Computer Interaction, 5 (CSCW2); Association for Computing Machinery: New York, NY, USA, 2021; pp. 1–45. [Google Scholar] [CrossRef]
Schmidt, P.; Biessmann, F. Quantifying Interpretability and Trust in Machine Learning Systems. arXiv 2019, arXiv:1901.08558. [Google Scholar]
Kim, S.S.Y.; Meister, N.; Ramaswamy, V.V.; Fong, R. HIVE: Evaluating the Human Interpretability of Visual Explanations. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 280–298. [Google Scholar]
Rader, E.; Cotter, K.; Cho, J. Explanations as Mechanisms for Supporting Algorithmic Transparency. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, in CHI ’18, Montreal, Canada, 21–26 April 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1–13. [Google Scholar] [CrossRef]
Ooge, J.; Kato, S.; Verbert, K. Explaining Recommendations in E-Learning: Effects on Adolescents’ Trust. In Proceedings of the 27th International Conference on Intelligent User Interfaces, in IUI ’22, Helsinki, Finland, 21–25 March 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 93–105. [Google Scholar] [CrossRef]
Tsai, C.-H.; You, Y.; Gui, X.; Kou, Y.; Carroll, J.M. Exploring and Promoting Diagnostic Transparency and Explainability in Online Symptom Checkers. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–12 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Guesmi, M.; Chatti, M.A.; Vorgerd, L.; Ngo, T.; Joarder, S.; Ain, Q.U.; Muslim, A. Explaining User Models with Different Levels of Detail for Transparent Recommendation: A User Study. In Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, in UMAP ’22 Adjunct, Barcelona, Spain, 4–7 July 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 175–183. [Google Scholar] [CrossRef]
Ford, C.; Keane, M.T. Explaining Classifications to Non-experts: An XAI User Study of Post-Hoc Explanations for a Classifier When People Lack Expertise. In Pattern Recognition, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges; Rousseau, J.J., Kapralos, B., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 246–260. [Google Scholar]
Bansal, G.; Wu, T.; Zhou, J.; Fok, R.; Kamar, E.; Ribeiro, M.T.; Weld, D. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. In Proceedings of the 2021 CHI Conference on Human. Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–12 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Kim, D.H.; Hoque, E.; Agrawala, M. Answering Questions about Charts and Generating Visual Explanations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, in CHI ’20, Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–13. [Google Scholar] [CrossRef]
Dodge, J.; Penney, S.; Anderson, A.; Burnett, M.M. What Should Be in an XAI Explanation? What IFT Reveals. In Proceedings of the 2018 Joint ACM IUI Workshops Co-Located with the 23rd ACM Conference on Intelligent User Interfaces (ACM IUI 2018), Tokyo, Japan, 11 March 2018; CEUR Workshop Proceedings. Said, A., Komatsu, T., Eds.; Association for Computing Machinery: New York, NY, USA, 2018; Volume 2068. Available online: https://ceur-ws.org/Vol-2068/exss9.pdf (accessed on 11 October 2023).
Schoonderwoerd, T.A.J.; Jorritsma, W.; Neerincx, M.A.; van den Bosch, K. Human-centered XAI: Developing design patterns for explanations of clinical decision support systems. Int. J. Hum. Comput. Stud. 2021, 154, 102684. [Google Scholar] [CrossRef]
Paleja, R.; Ghuy, M.; Arachchige, N.R.; Jensen, R.; Gombolay, M. The Utility of Explainable AI in Ad Hoc Human-Machine Teaming. In Proceedings of the Advances in Neural Information Processing Systems, online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W., Eds.; Curran Associates, Inc.: New York, NY, USA, 2021; pp. 610–623. Available online: https://dl.acm.org/doi/10.5555/3540261.3540308 (accessed on 11 October 2023).
Alufaisan, Y.; Marusich, L.R.; Bakdash, J.Z.; Zhou, Y.; Kantarcioglu, M. Does Explainable Artificial Intelligence Improve Human Decision-Making? Proc. AAAI Conf. Artif. Intell. 2021, 35, 6618–6626. [Google Scholar] [CrossRef]
Schaffer, J.; O’Donovan, J.; Michaelis, J.; Raglin, A.; Höllerer, T. I can do better than your AI: Expertise and explanations. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 240–251. [Google Scholar] [CrossRef]
Colley, M.; Eder, B.; Rixen, J.O.; Rukzio, E. Effects of Semantic Segmentation Visualization on Trust, Situation Awareness, and Cognitive Load in Highly Automated Vehicles. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–12 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Zhang, Y.; Liao, Q.V.; Bellamy, R.K.E. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, in FAT* ’20, Barcelona, Spain, 27–30 January 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 295–305. [Google Scholar] [CrossRef]
Carton, S.; Mei, Q.; Resnick, P. Feature-Based Explanations Don’t Help People Detect Misclassifications of Online Toxicity. Proc. Int. AAAI Conf. Web Soc. Media 2020, 14, 95–106. [Google Scholar] [CrossRef]
Schoeffer, J.; Kuehl, N.; Machowski, Y. ‘There Is Not Enough Information’: On the Effects of Explanations on Perceptions of Informational Fairness and Trustworthiness in Automated Decision-Making. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, in FAccT ’22, Seoul, Republic of Korea, 21–24 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1616–1628. [Google Scholar] [CrossRef]
Kunkel, J.; Donkers, T.; Michael, L.; Barbu, C.-M.; Ziegler, J. Let Me Explain: Impact of Personal and Impersonal Explanations on Trust in Recommender Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19, Glasgow, UK, 4–9 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–12. [Google Scholar] [CrossRef]
Jeyakumar, J.V.; Noor, J.; Cheng, Y.-H.; Garcia, L.; Srivastava, M. How Can I Explain This to You? An Empirical Study of Deep Neural Network Explanation Methods. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, Canada, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; pp. 4211–4222. Available online: https://proceedings.neurips.cc/paper_files/paper/2020/file/2c29d89cc56cdb191c60db2f0bae796b-Paper.pdf (accessed on 25 January 2024).
Harrison, G.; Hanson, J.; Jacinto, C.; Ramirez, J.; Ur, B. An Empirical Study on the Perceived Fairness of Realistic, Imperfect Machine Learning Models. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, in FAT* ’20, Barcelona, Spain, 27–30 January 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 392–402. [Google Scholar] [CrossRef]
Weitz, K.; Alexander, Z.; Elisabeth, A. What do end-users really want? investigation of human-centered xai for mobile health apps. arXiv 2022, arXiv:2210.03506. [Google Scholar] [CrossRef]
Fügener, A.; Grahl, J.; Gupta, A.; Ketter, W. Will Humans-in-The-Loop Become Borgs? Merits and Pitfalls of Working with AI. Manag. Inf. Syst. Q. (MISQ) 2021, 45, 1527–1556. [Google Scholar] [CrossRef]
Jin, W.; Fatehi, M.; Guo, R.; Hamarneh, G. Evaluating the clinical utility of artificial intelligence assistance and its explanation on the glioma grading task. Artif. Intell. Med. 2024, 148, 102751. [Google Scholar] [CrossRef]
Panigutti, C.; Hamon, R.; Hupont, I.; Llorca, D.F.; Yela, D.F.; Junklewitz, H.; Scalzo, S.; Mazzini, G.; Sanchez, I.; Garrido, J.S.; et al. The role of explainable AI in the context of the AI Act. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, in FAccT ’23, Chicago, IL, USA, 12–15 June 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1139–1150. [Google Scholar] [CrossRef]
Lopes, P.; Silva, E.; Braga, C.; Oliveira, T.; Rosado, L. XAI Systems Evaluation: A Review of Human and Computer-Centred Methods. Appl. Sci. 2022, 12, 9423. [Google Scholar] [CrossRef]
Kong, X.; Liu, S.; Zhu, L. Toward Human-centered XAI in Practice: A survey. Mach. Intell. Res. 2024, 21, 740–770. [Google Scholar] [CrossRef]
Doshi-Velez, F.; Kim, B. Considerations for evaluation and generalization in interpretable machine learning. In Explainable and Interpretable Models in Computer Vision and Machine Learning; Springer: Cham, Switzerland, 2018; pp. 3–17. [Google Scholar]
Kulesza, T.; Stumpf, S.; Burnett, M.; Kwan, I. Tell Me More? The Effects of Mental Model Soundness on Personalizing an Intelligent Agent. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, in CHI ’12, Austin, TX, USA, 5–10 May 2012; Association for Computing Machinery: New York, NY, USA, 2012; pp. 1–10. [Google Scholar] [CrossRef]
Bansal, G.; Nushi, B.; Kamar, E.; Lasecki, W.S.; Weld, D.S.; Horvitz, E. Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance. Proc. AAAI Conf. Hum. Comput. Crowdsourc 2019, 7, 2–11. [Google Scholar] [CrossRef]
Davenport, T.H.; Markus, M.L. Rigor vs. Relevance Revisited: Response to Benbasat and Zmud. MIS Q. 1999, 23, 19–23. [Google Scholar] [CrossRef]
Islam, M.R.; Ahmed, M.U.; Barua, S.; Begum, S. A Systematic Review of Explainable Artificial Intelligence in Terms of Different Application Domains and Tasks. Appl. Sci. 2022, 12. [Google Scholar] [CrossRef]
Cabitza, F.; Campagner, A.; Famiglini, L.; Gallazzi, E.; La Maida, G.A. Color Shadows (Part I): Exploratory Usability Evaluation of Activation Maps in Radiological Machine Learning. In Proceedings of the Machine Learning and Knowledge Extraction, Vienna, Austria, 23–26 August 2022; Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E., Eds.; Springer International Publishing: Vienna, Austria, 2022; pp. 31–50. [Google Scholar]
Grgic-Hlaca, N.; Zafar, M.B.; Gummadi, K.P.; Weller, A. Beyond Distributive Fairness in Algorithmic Decision Making: Feature Selection for Procedurally Fair Learning. AAAI Conf. Artif. Intell. 2018, 32. [Google Scholar] [CrossRef]
Dodge, J.; Liao, Q.V.; Zhang, Y.; Bellamy, R.K.E.; Dugan, C. Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 275–285. [Google Scholar] [CrossRef]
Kern, D.-R.; Dethier, E.; Alizadeh, F.; Stevens, G.; Naveed, S.; Du, D.; Shajalal, M. Peeking Inside the Schufa Blackbox: Explaining the German Housing Scoring System. arXiv 2023, arXiv:2311.11655. [Google Scholar]
Naveed, S.; Ziegler, J. Featuristic: An interactive hybrid system for generating explainable recommendations—Beyond system accuracy. In Proceedings of the IntRS@RecSys, Rio de Janeiro, Brazil, 22–26 September 2020; Available online: https://api.semanticscholar.org/CorpusID:225063158 (accessed on 3 August 2023).
Naveed, S.; Ziegler, J. Feature-Driven Interactive Recommendations and Explanations with Collaborative Filtering Approach. In Proceedings of the ComplexRec@ RecSys, Copenhagen, Denmark, 20 September 2019; p. 1015. [Google Scholar]
Herlocker, J.L.; Konstan, J.A.; Riedl, J. Explaining Collaborative Filtering Recommendations. In Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, in CSCW ’00, Philadelphia, PA, USA, 2–6 December 2000; Association for Computing Machinery: New York, NY, USA, 2000; pp. 241–250. [Google Scholar] [CrossRef]
Tintarev, N.; Masthoff, J. Explaining Recommendations: Design and Evaluation. In Recommender Systems Handbook; Springer: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
Tintarev, N.; Masthoff, J. Designing and Evaluating Explanations for Recommender Systems. In Recommender Systems Handbook; Ricci, F., Rokach, L., Shapira, B., Kantor, P., Eds.; Springer: Boston, MA, USA, 2011; pp. 479–510. [Google Scholar] [CrossRef]
Nunes, I.; Taylor, P.; Barakat, L.; Griffiths, N.; Miles, S. Explaining reputation assessments. Int. J. Hum. Comput. Stud. 2019, 123, 1–17. [Google Scholar] [CrossRef]
Kouki, P.; Schaffer, J.; Pujara, J.; O’Donovan, J.; Getoor, L. Personalized explanations for hybrid recommender systems. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 379–390. [Google Scholar] [CrossRef]
Le, N.L.; Abel, M.-H.; Gouspillou, P. Combining Embedding-Based and Semantic-Based Models for Post-Hoc Explanations in Recommender Systems. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Oahu, HI, USA, 1–4 October 2023; pp. 4619–4624. [Google Scholar] [CrossRef]
Raza, S.; Ding, C. News recommender system: A review of recent progress, challenges, and opportunities. Artif. Intell. Rev. 2022, 55, 749–800. [Google Scholar] [CrossRef]
Wang, X.; Wang, D.; Xu, C.; He, X.; Cao, Y.; Chua, T.-S. Explainable Reasoning over Knowledge Graphs for Recommendation. Proc. AAAI Conf. Artif. Intell. 2019, 33, 5329–5336. [Google Scholar] [CrossRef]
Ehsan, U.; Liao, Q.V.; Muller, M.; Riedl, M.O.; Weisz, J.D. Expanding Explainability: Towards Social Transparency in AI systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–13 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Hudon, A.; Demazure, T.; Karran, A.J.; Léger, P.-M.; Sénécal, S. Explainable Artificial Intelligence (XAI): How the Visualization of AI Predictions Affects User Cognitive Load and Confidence. Inf. Syst. Neurosci. 2021, 52, 237–246. [Google Scholar] [CrossRef]
Cramer, H.; Evers, V.; Ramlal, S.; Van Someren, M.; Rutledge, L.; Stash, N.; Aroyo , L.; Wielinga, B. The effects of transparency on trust in and acceptance of a content-based art recommender. User Model. User-Adapt Interact. 2008, 18, 455–496. [Google Scholar] [CrossRef]
Ehsan, U.; Tambwekar, P.; Chan, L.; Harrison, B.; Riedl, M.O. Automated rationale generation: A technique for explainable AI and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces, Marina del Ray, CA, USA, 16–20 March 2019. [Google Scholar] [CrossRef]
Hoffman, R.R.; Clancey, W.J.; Mueller, S.T. Explaining AI as an exploratory process: The peircean abduction model. arXiv 2020, arXiv:2009.14795. [Google Scholar] [CrossRef]
Meske, C.; Bunde, E.; Schneider, J.; Gersch, M. Explainable Artificial Intelligence: Objectives, Stakeholders, and Future Research Opportunities. Inf. Syst. Manag. 2022, 39, 53–63. [Google Scholar] [CrossRef]
Mohseni, S.; Zarei, N.; Ragan, E.D. A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems. ACM Trans. Interact. Intell. Syst. 2021, 11, 3–4. [Google Scholar] [CrossRef]
Phillips, P.J.; Hahn, C.A.; Fontana, P.C.; Yates, A.N.; Greene, K.; Broniatowski, D.A.; Przybocki, M.A. Four Principles of Explainable Artificial Intelligence; NIST Interagency/Internal Report; NIST: Gaithersburg, MD, USA, 2021. [Google Scholar]
Goodman, B.; Flaxman, S. European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation. AI Mag. 2017, 38, 50–57. [Google Scholar] [CrossRef]
Lund, A.B. A.B. A Stakeholder Approach to Media Governance. In Managing Media Firms and Industries: What’s So Special About Media Management; Lowe, G., Brown, C., Eds.; Springer International Publishing: Vienna, Austria, 2016; pp. 103–120. [Google Scholar] [CrossRef]
Rong, Y.; Leemann, T.; Nguyen, T.-T.; Fiedler, L.; Qian, P.; Unhelkar, V.; Seidel, T.; Kasneci, G.; Kasneci, E. Towards Human-centered Explainable AI: User Studies for Model Explanations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 46, 2104–2122. [Google Scholar] [CrossRef] [PubMed]
Rong, Y.; Castner, N.; Bozkir, E.; Kasneci, E. User trust on an explainable ai-based medical diagnosis support system. arXiv 2022, arXiv:2204.12230. [Google Scholar] [CrossRef]
Páez, A. The Pragmatic Turn in Explainable Artificial Intelligence (XAI). Minds Mach. 2019, 29, 441–459. [Google Scholar] [CrossRef]
Lim, B.Y.; Dey, A.K. Assessing demand for intelligibility in context-aware applications. In Proceedings of the 11th International Conference on Ubiquitous Computing, in UbiComp ’09, Orlando, FL, USA, 30 September–3 October 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 195–204. [Google Scholar] [CrossRef]
Weld, D.S.; Bansal, G. The challenge of crafting intelligible intelligence. Commun. ACM 2019, 62, 70–79. [Google Scholar] [CrossRef]
Knijnenburg, B.P.; Willemsen, M.C.; Kobsa, A. A pragmatic procedure to support the user-centric evaluation of recommender systems. In Proceedings of the Fifth ACM Conference on Recommender Systems, in RecSys ’11, Chicago, IL, USA, 23–27 October 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 321–324. [Google Scholar] [CrossRef]
Dai, J.; Upadhyay, S.; Aivodji, U.; Bach, S.H.; Lakkaraju, H. Fairness via Explanation Quality: Evaluating Disparities in the Quality of Post hoc Explanations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, in AIES ’22, Palo Alto, CA, USA, 21–23 March 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 203–214. [Google Scholar] [CrossRef]
Pu, P.; Chen, L.; Hu, R. A user-centric evaluation framework for recommender systems. In Proceedings of the Fifth ACM Conference on Recommender Systems, in RecSys ’11, Chicago, IL, USA, 23–27 October 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 157–164. [Google Scholar] [CrossRef]
Caro, L.M.; García, J.A.M. Cognitive–affective model of consumer satisfaction. An exploratory study within the framework of a sporting event. J. Bus. Res. 2007, 60, 108–114. [Google Scholar] [CrossRef]
Myers, D.G.; Dewall, C.N. Psychology, 11th ed.; Worth Publishers: New York, NY, USA, 2021. [Google Scholar]
Gedikli, F.; Jannach, D.; Ge, M. How should I explain? A comparison of different explanation types for recommender systems. Int. J. Hum. Comput. Stud. 2014, 72, 367–382. [Google Scholar] [CrossRef]
Kahng, M.; Andrews, P.Y.; Kalro, A.; Chau, D.H. ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models. IEEE Trans. Vis. Comput. Graph. 2018, 24, 88–97. [Google Scholar] [CrossRef]
Buettner, R. Cognitive Workload of Humans Using Artificial Intelligence Systems: Towards Objective Measurement Applying Eye-Tracking Technology. In KI 2013: Advances in Artificial Intelligence; Timm, I.J., Thimm, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–48. [Google Scholar]
Wu, Y.; Liu, Y.; Tsai, Y.R.; Yau, S. Investigating the role of eye movements and physiological signals in search satisfaction prediction using geometric analysis. J. Assoc. Inf. Sci. Technol. 2019, 70, 981–999. [Google Scholar] [CrossRef]
Hassenzahl, M.; Kekez, R.; Burmester, M. The Importance of a software’s pragmatic quality depends on usage modes. In Proceedings of the 6th International Conference on Work with Display Units WWDU 2002, ERGONOMIC Institut für Arbeits-und Sozialforschung, Berlin, Germany, 22–25 May 2002; pp. 275–276. [Google Scholar]
Nemeth, A.; Bekmukhambetova, A. Achieving Usability: Looking for Connections between User-Centred Design Practices and Resultant Usability Metrics in Agile Software Development. Period. Polytech. Soc. Manag. Sci. 2023, 31, 135–143. [Google Scholar] [CrossRef]
Zhang, W.; Lim, B.Y. Towards Relatable Explainable AI with the Perceptual Process. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, in CHI ’22, New Orleans, LA, USA, 30 April–May 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
Nourani, M.; Kabir, S.; Mohseni, S.; Ragan, E.D. The Effects of Meaningful and Meaningless Explanations on Trust and Perceived System Accuracy in Intelligent Systems. Proc. AAAI Conf. Hum. Comput. Crowdsourc. 2019, 7, 97–105. [Google Scholar] [CrossRef]
Abdul, A.; von der Weth, C.; Kankanhalli, M.; Lim, B.Y. COGAM: Measuring and Moderating Cognitive Load in Machine Learning Model Explanations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, in CHI ’20, Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–14. [Google Scholar] [CrossRef]
Lim, B.Y.; Dey, A.K.; Avrahami, D. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, in CHI ’09, Boston, MA, USA, 4–9 April 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 2119–2128. [Google Scholar] [CrossRef]
Mayer, R.C.; Davis, J.H.; Schoorman, F.D. An Integrative Model of Organizational Trust. Acad. Manag. Rev. 1995, 20, 709–734. [Google Scholar] [CrossRef]
Hoffman, R.R.; Mueller, S.T.; Klein, G.; Litman, J. Measures for explainable AI: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-AI performance. Front. Comput. Sci. 2023, 5. [Google Scholar] [CrossRef]
Das, D.; Chernova, S. Leveraging rationales to improve human task performance. In Proceedings of the 25th International Conference on Intelligent User Interfaces, in IUI ’20, Cagliari, Italy, 17–20 March 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 510–518. [Google Scholar] [CrossRef]
de Andreis, F. A Theoretical Approach to the Effective Decision-Making Process. Open J. Appl. Sci. 2020, 10, 287–304. [Google Scholar] [CrossRef]
Parasuraman, R.; Sheridan, T.B.; Wickens, C.D. Situation Awareness, Mental Workload, and Trust in Automation: Viable, Empirically Supported Cognitive Engineering Constructs. J. Cogn. Eng. Decis. Mak. 2008, 2, 140–160. [Google Scholar] [CrossRef]
Pomplun, M.; Sunkara, S. Pupil Dilation as an Indicator of Cognitive Workload in Human-Computer Interaction. 2003. Available online: https://api.semanticscholar.org/CorpusID:1052200 (accessed on 7 October 2024).
Cegarra, J.; Chevalier, A. The use of Tholos software for combining measures of mental workload: Toward theoretical and methodological improvements. Behav. Res. Methods 2008, 40, 988–1000. [Google Scholar] [CrossRef]
Hart, S.G.; Staveland, L.E. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Human Mental Workload; Hancock, P.A., Meshkati, N., Eds.; Advances in Psychology; Elsevier: Amsterdam, Netherlands, 1988; Volume 52, pp. 139–183. [Google Scholar] [CrossRef]
Madsen, M.; Gregor, S. Measuring Human-Computer Trust. In Proceedings of the 11th Australasian Conference on Information Systems, Bisbane, Australia, 6–8 December 2000; Volume 53, pp. 6–8. [Google Scholar]
Gefen, D. Reflections on the dimensions of trust and trustworthiness among online consumers. SIGMIS Database 2002, 33, 38–53. [Google Scholar] [CrossRef]
Madsen, M.; Gregor, S.D. Measuring Human-Computer Trust. 2000. Available online: https://api.semanticscholar.org/CorpusID:18821611 (accessed on 9 June 2023).
Stevens, G.; Bossauer, P. Who do you trust: Peers or Technology? A conjoint analysis about computational reputation mechanisms. In Proceedings of the 18th European Conference on Computer-Supported Cooperative Work, Siegen, Germany, 17–21 October 2020. [Google Scholar] [CrossRef]
Wang, W.; Benbasat, I. Recommendation Agents for Electronic Commerce: Effects of Explanation Facilities on Trusting Beliefs. J. Manag. Inf. Syst. 2007, 23, 217–246. [Google Scholar] [CrossRef]
Wahlström, M.; Tammentie, B.; Salonen, T.-T.; Karvonen, A. AI and the transformation of industrial work: Hybrid intelligence vs double-black box effect. Appl. Ergon. 2024, 118, 104271. [Google Scholar] [CrossRef]
Rader, E.; Gray, R. Understanding User Beliefs About Algorithmic Curation in the Facebook News Feed. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, in CHI ’15, Seoul, Republic of Korea, 18–23 April 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 173–182. [Google Scholar] [CrossRef]
Endsley, M.R. Situation awareness global assessment technique (SAGAT). In Proceedings of the IEEE 1988 National Aerospace and Electronics Conference, Dayton, OH, USA, 23–27 May 1988; Volume 3, pp. 789–795. [Google Scholar] [CrossRef]
Uwe, F. Doing Interview Research: The Essential How to Guide; Sage Publications: Thousand Oaks, CA, USA, 2021. [Google Scholar]
Blandford, A.; Furniss, D.; Makri, S. Qualitative HCI Research: Going Behind the Scenes. In Synthesis Lectures on Human-Centered Informatics; Springer: Berlin/Heidelberg, Germany, 2016; Available online: https://api.semanticscholar.org/CorpusID:38190394 (accessed on 13 April 2022).
Kelle, U. „Mixed Methods” in der Evaluationsforschung—Mit den Möglichkeiten und Beschränkungen quantitativer und qualitativer Methoden arbeiten. Z. Für Eval. 2018, 17, 25–52. Available online: https://www.proquest.com/scholarly-journals/mixed-methods-der-evaluationsforschung-mit-den/docview/2037015610/se-2?accountid=14644 (accessed on 24 July 2024).
Gorber, S.C.; Tremblay, M.S. Self-Report and Direct Measures of Health: Bias and Implications. In The Objective Monitoring of Physical Activity: Contributions of Accelerometry to Epidemiology, Exercise Science and Rehabilitation; Shephard, R., Tudor-Locke, C., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 369–376. [Google Scholar] [CrossRef]
Sikes, L.M.; Dunn, S.M. Subjective Experiences. In Encyclopedia of Personality and Individual Differences; Zeigler-Hill, V., Shackelford, T.K., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 5273–5275. [Google Scholar] [CrossRef]
Mellouk, W.; Handouzi, W. Facial emotion recognition using deep learning: Review and insights. Procedia Comput. Sci. 2020, 175, 689–694. [Google Scholar] [CrossRef]
Hinkin, T.R. A review of scale development practices in the study of organizations. J. Manag. 1995, 21, 967–988. [Google Scholar] [CrossRef]
Creswell, J.W.; Clark, V.L.P. Revisiting mixed methods research designs twenty years later. In The Sage Handbook of Mixed Methods Research Design; Sage Publications: Thousand Oaks, CA, USA, 2023; pp. 21–36. [Google Scholar]
Binns, R. Fairness in Machine Learning: Lessons from Political Philosophy. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; Friedler, S.A., Wilson, C., Eds.; Volume 81, pp. 149–159. Available online: https://proceedings.mlr.press/v81/binns18a.html (accessed on 5 November 2023).
Eiband, M.; Buschek, D.; Kremer, A.; Hussmann, H. The Impact of Placebic Explanations on Trust in Intelligent Systems. In Proceedings of the Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI EA ’19, Glasgow, UK, 4–9 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–6. [Google Scholar] [CrossRef]
Binns, R.; Van Kleek, M.; Veale, M.; Lyngs, U.; Zhao, J.; Shadbolt, N. ‘It’s Reducing a Human Being to a Percentage’: Perceptions of Justice in Algorithmic Decisions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, in CHI ’18, Montreal, QC, Canada, 21–26 April 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1–14. [Google Scholar] [CrossRef]

Figure 1. Distribution of selected papers across different evaluation objectives.

Figure 2. Evaluation scope of the studies in our sample.

Figure 3. XAI evaluation procedure.

Figure 4. The step-by-step approach of planning and conducting an XAI evaluation study.

Table 1. A summary of the explanatory analysis of XAI evaluation studies with respect to the existing literature. Each column in the table is color-coded to represent different elements of the XAI evaluation studies, with each element discussed in detail in various sections of the paper.

Literature Source	Objective			Scope						Procedure
	Objective			Domain		Target Group				Sampling and Participants			Method
	Methodology-driven Evaluation	Concept-driven Evaluation	Domain-driven Evaluation	Domain-Specific	Domain Agnostic/Not Stated	Agnostic/Not stated	Developers/Engineers	Managers/Regulators	End-Users	Not Stated	Proxy Users	Real Users	Interviews/ Think Aloud	Observations	Questionnaires	Mixed-Methods
Chromik et al. [15]	O				O	O				Not applicable			Not applicable
Alizadeh et al. [29]		O			O	O			O			O	O
Mohseni et al. [16]	O	O			O	O					O					O
Kaur et al. [30]		O					O				O	O	O		O
Lai et al. [31]		O		O		O					O					O
Ngo et al. [32]		O		O					O			O	O
Kulesza et al. [33]		O		O					O		O				O
Sukkerd [34]		O	O	O					O		O				O
Hoffman et al. [35]		O			O	O						O			O
Anik et al. [36]		O			O				O		O		O		O
Deters [37]		O			O	O						O			O
Guo et al. [38]	O	O		O		O					O				O
Dominguez et al. [39]	O	O		O		O					O			O	O
Dieber et al. [40]	O	O			O	O						O	O
Millecamp et al. [41]		O		O					O		O			O		O
Bucina et al. [42]		O		O					O		O		O			O
Cheng et al. [43]		O		O				O	O		O					O
Holzinger et al. [44]	O	O		O					O	O					O
Jin [45]	O				O	O						O				O
Papenmeier et al. [46]		O			O		O		O		O				O
Liao et al. [47]		O		O					O		O				O
Cai et al. [48]	O	O		O		O					O			O	O
Van der waa et al. [49]	O	O		O					O		O				O
Poursabzi et al. [50]	O	O		O					O		O				O
Narayanan et al. [51]		O	O	O					O		O			O	O
Liu et al. [52]		O			O	O					O		O		O
Schmidt et al. [53]	O				O	O					O				O
Kim et al. [54]	O	O			O	O					O			O	O
Rader et al. [55]		O		O					O		O				O
Ooge et al. [56]		O		O					O			O		O	O
Naveed et al. [25]		O		O			O		O			O				O
Naveed et al. [26]		O	O	O					O		O				O
Naveed et al. [27]		O		O					O		O				O
Tsai et al. [57]		O		O					O							O
Guesmi et al. [58]		O		O					O		O				O
Naveed et al. [28]		O		O					O		O				O
Ford et al. [59]			O	O					O		O				O
Bansal et al. [60]		O	O		O	O					O					O
Kim et al. [61]		O			O	O					O			O	O
Dodge et al. [62]		O		O			O		O			O	O
Schoonderwoerd et al. [63]	O	O		O			O		O			O	O		O
Paleja et al. [64]	O	O			O		O			O				O	O
Alufaisan et al. [65]		O			O	O					O				O
Schaffer et al. [66]	O	O		O		O					O			O	O
Colley et al. [67]		O		O					O		O				O
Zhang et al. [68]		O		O		O					O				O
Carton et al. [69]		O		O		O					O				O
Schoeffer et al. [70]		O		O					O		O					O
Kunkel et al. [71]		O		O					O		O				O
Jeyakumar et al. [72]	O				O		O				O				O
Harrison et al. [73]		O		O			O				O				O
Weitz et al. [74]	O	O		O					O		O				O
Fügener et al. [75]		O			O	O				O					O

Table 2. A summary of human-centered XAI evaluation measures.

Literature Source	Understandability				Usability				Integrity			Misc.
Literature Source	Mental Models	Perceived Understandability	Understanding Goodness / Soundness	Perceived Explanation Qualities	Satisfaction	Utility/Suitability	Performance/Workload	Controllability/Scrutability	Trust / Confidence	Perceived Fairness	Transparency	Other
Chromik et al. [15]	O		O		O		O		O		O	Persuasiveness, Education, Debugging
Alizadeh et al. [29]	O
Mohseni et al. [16]	O			O
Kaur et al. [30]	O					O	O		O			Intention to use/purchase
Lai et al. [31]	O					O	O
Ngo et al. [32]	O		O					O			O	Diversity
Kulesza et al. [33]	O		O		O	O	O					Debugging
Sukkerd [34]	O		O				O		O
Hoffman et al. [35]	O	O	O	O	O	O	O		O			Curiosity
Aniket al. [36]	O	O				O	O		O	O
Deters [37]	O	O			O	O			O		O	Persuasiveness, Debugging, Situation Awareness, Learn/Edu.
Guo et al. [38]		O		O	O			O	O
Dominguez et al. [39]		O			O	O			O			Diversity
Dieber et al. [40]		O			O		O
Millecamp et al. [41]		O			O		O		O			Novelty, Intention to use/purchase
Bucina et al. [42]		O	O			O	O		O
Cheng et al. [43]		O	O				O		O
Holzinger et al. [44]		O			O
Jin [45]		O				O	O		O			Plausability(Plausability measures how convincing AI explanations are to humans. It is typically measured in terms of quantitative metrics such as feature localization or feature correlation.)
Papenmeier et al. [46]		O							O			Persuasiveness
Liao et al. [47]		O						O	O			Intention to use/purchase
Cai et al. [48]		O					O		O
Van der Waa et al. [49]		O	O				O		O			Persuasiveness
Poursabzi et al. [50]			O						O
Narayanan et al. [51]			O		O		O
Liu et al. [52]			O				O		O
Schmidt et al. [53]			O				O		O
Kim et al. [61]			O				O
Rader et al. [55]				O						O	O	Diversity, Situation Awareness
Ooge et al. [56]				O			O		O		O	Intention to use/purchase
Naveed et al. [25]				O		O
Naveed et al. [26]				O	O
Naveed et al. [27]				O	O		O		O		O
Tsai et al. [57]				O	O		O	O	O		O	Situation Awareness, Learning/Education
Guesmi et al. [58]				O	O		O	O	O		O	Persuasiveness
Naveed et al. [28]					O	O	O		O			Diversity, Use Intentions
Ford et al. [59]				O	O	O	O		O
Bansal et al. [60]						O	O		O
Kim et al. [61]						O			O		O
Dodge et al. [62]						O
Schoonderwoerd et al. [63]				O		O	O		O			Preferences
Paleja et al. [64]							O		O			Situation Awareness
Alufaisan et al. [65]							O	O	O
Schaffer et al. [66]							O		O			Situation Awareness
Colley et al. [67]							O		O			Situation Awareness
Zhang et al. [68]							O		O			Persuasiveness
Carton et al. [69]							O		O
Schoeffer et al. [70]				O					O	O
Kunkel et al. [71]									O			Intention to use/purchase
Jeyakumar et al. [72]												Preferences
Harrison et al. [73]										O		Preferences
Weitz et al. [74]												Preferences
Fügener et al. [75]							O		O			Persuasiveness

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Naveed, S.; Stevens, G.; Robin-Kern, D. An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI. Appl. Sci. 2024, 14, 11288. https://doi.org/10.3390/app142311288

AMA Style

Naveed S, Stevens G, Robin-Kern D. An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI. Applied Sciences. 2024; 14(23):11288. https://doi.org/10.3390/app142311288

Chicago/Turabian Style

Naveed, Sidra, Gunnar Stevens, and Dean Robin-Kern. 2024. "An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI" Applied Sciences 14, no. 23: 11288. https://doi.org/10.3390/app142311288

APA Style

Naveed, S., Stevens, G., & Robin-Kern, D. (2024). An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI. Applied Sciences, 14(23), 11288. https://doi.org/10.3390/app142311288

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI

Abstract

1. Introduction

2. Explainable Artificial Intelligence (XAI): Evaluation Theory

Scoping Review Methodology

3. Evaluation Objectives

3.1. Studies About Evaluation Methodologies

3.2. Concept-Driven Evaluation Studies

3.3. Domain-Driven Evaluation Studies

4. Evaluation Scope

4.1. Target Domain

4.1.1. Healthcare

4.1.2. Judiciary

4.1.3. Finance Sector

4.1.4. E-Commerce

4.1.5. Media Sector

4.1.6. Transportation Sector

4.1.7. Science and Education

4.1.8. AI Engineering

4.1.9. Domain-Agnostic XAI Research

4.2. Target Group

4.2.1. Expertise

4.2.2. Role

4.2.3. Lay Persons

4.2.4. Professionals

4.3. Test Scenarios

4.3.1. Real-World Scenarios with Critical Impact

4.3.2. Illustrative Scenarios with Less Critical Impact

5. Evaluation Measures

5.1. Understandability

5.1.1. Mental Model

5.1.2. Perceived Understandability

5.1.3. The Goodness/Soundness of the Understanding

5.1.4. Perceived Explanation Qualities

5.2. Usability

5.2.1. Satisfaction

5.2.2. Utility and Suitability

5.2.3. Task Performance and Cognitive Workload

5.2.4. User Control and Scrutability

5.3. Integrity Measures

5.3.1. Trust

5.3.2. Transparency

5.3.3. Fairness

5.4. Miscellaneous

5.4.1. Diversity, Novelty, and Curiosity

5.4.2. Persuasiveness, Plausibility, and Intention to Use

5.4.3. Intention to Use or Purchase

5.4.4. Explanation Preferences

5.4.5. Debugging Support

5.4.6. Situational Awareness

5.4.7. Learning and Education

6. Evaluation Procedure

6.1. Sampling and Participants

6.1.1. Real-User Studies

6.1.2. Proxy-User Studies

6.2. Evaluation Methods

6.2.1. Interviews

6.2.2. Observations

6.2.3. Questionnaires

6.2.4. Mixed-Methods Approach

7. Discussion: Pitfalls and Guidelines for Planning and Conducting XAI Evaluations

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI