An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI
Abstract
:1. Introduction
- What are the common practices in terms of patterns and essential elements in empirical evaluations of AI explanations?
- What pitfalls, but also best practices, standards, and benchmarks, should be established for empirical evaluations of AI explanations?
2. Explainable Artificial Intelligence (XAI): Evaluation Theory
- Functionality-grounded evaluations require no humans. Instead, objective evaluations are carried out using algorithmic metrics and formal definitions of interpretability to evaluate the quality of explanations.
- Application-grounded evaluations measure the quality of explanations by conducting experiments with end-users within an actual application.
- Human-grounded evaluations, which involve human subjects with less experience, measure general constructs with respect to explanations, such as understandability, trust, and usability on a simple task.
- Evaluation Objective and Scope: Evaluation studies can have different scopes as well as different objectives, such as understanding a general concept or improving a specific application. Hence, the first step in planning an evaluation study should be defining the objective and scope, including a specification of the intended application domain and target group. Such a specification is also essential for assessing instrument validity, referring to the process of ensuring that an evaluation method will measure the constructs accurately, reliably, and consistently. The scope of validity indicates where the instrument has been validated and calibrated and where it will be measured effectively.
- Measurement Constructs and Metrics: Furthermore, it is important to specify what the measurement constructs of the study are and how they should be evaluated. In principle, measurement constructs could be any object, phenomenon, or property of interest that we seek to quantify. In user studies, they are typically theoretical constructs such as user satisfaction, user trust, or system intelligibility. Some constructs, such as task performance, can be directly measured. However, most constructs need to be operationalized through a set of measurable items. Operationalization includes selecting validated metrics and defining the measurement method. The method should describe the process of assigning a quantitative or qualitative value to a particular entity in a systematic way.
- Implementation and Procedure: Finally, the implementation of the study must be planned. This includes decisions about the study participants (e.g., members of the target group or proxy users) and recruitment methods (such as using a convenience sample, an online panel, or a sample representative of the target group/application domain). Additionally, one must consider whether a working system or a prototype should be evaluated and under which conditions (e.g., laboratory conditions or real-world settings). Furthermore, the data collection method should be specified. Generally, this can be categorized into observation, interviews, and surveys. Each method has its strengths, and the choice of method should align with the research objectives, scope, and nature of the constructs being measured.
Scoping Review Methodology
3. Evaluation Objectives
3.1. Studies About Evaluation Methodologies
3.2. Concept-Driven Evaluation Studies
3.3. Domain-Driven Evaluation Studies
4. Evaluation Scope
4.1. Target Domain
4.1.1. Healthcare
4.1.2. Judiciary
4.1.3. Finance Sector
4.1.4. E-Commerce
4.1.5. Media Sector
4.1.6. Transportation Sector
4.1.7. Science and Education
4.1.8. AI Engineering
4.1.9. Domain-Agnostic XAI Research
4.2. Target Group
4.2.1. Expertise
4.2.2. Role
4.2.3. Lay Persons
4.2.4. Professionals
4.3. Test Scenarios
4.3.1. Real-World Scenarios with Critical Impact
4.3.2. Illustrative Scenarios with Less Critical Impact
5. Evaluation Measures
5.1. Understandability
5.1.1. Mental Model
5.1.2. Perceived Understandability
5.1.3. The Goodness/Soundness of the Understanding
- Qualitative methods involve uncovering users’ mental models through introspection, such as interviews, think-aloud protocols, or drawings made by the users.
- Subjective–quantitative methods assess perceived understandability through self-report measures.
- Objective–quantitative methods evaluate how accurately users can predict and explain system behavior based on their mental models.
5.1.4. Perceived Explanation Qualities
5.2. Usability
5.2.1. Satisfaction
5.2.2. Utility and Suitability
5.2.3. Task Performance and Cognitive Workload
5.2.4. User Control and Scrutability
5.3. Integrity Measures
5.3.1. Trust
5.3.2. Transparency
5.3.3. Fairness
5.4. Miscellaneous
5.4.1. Diversity, Novelty, and Curiosity
5.4.2. Persuasiveness, Plausibility, and Intention to Use
5.4.3. Intention to Use or Purchase
5.4.4. Explanation Preferences
5.4.5. Debugging Support
5.4.6. Situational Awareness
5.4.7. Learning and Education
6. Evaluation Procedure
6.1. Sampling and Participants
6.1.1. Real-User Studies
6.1.2. Proxy-User Studies
6.2. Evaluation Methods
6.2.1. Interviews
6.2.2. Observations
6.2.3. Questionnaires
6.2.4. Mixed-Methods Approach
7. Discussion: Pitfalls and Guidelines for Planning and Conducting XAI Evaluations
Guidelines: The guideline is to carefully consider the tension between rigor and relevance from the very beginning when planning an evaluation study, as it influences both the evaluation scope and the methods used.
|
Pitfalls: A common pitfall in many evaluation studies is not defining the target domain, group and context explicitly. This lack of explication negatively affects both the planning of the study and the broader scientific community. During the planning phase, this complicates the formulation of test scenarios, recruitment strategies, and predictions regarding the impact of pragmatic research decisions (e.g., using proxy users instead of real users, evaluating a click dummy instead of a fully functional system, using toy examples instead of real-world scenarios, etc.). During the publication phase, the missing explication impedes the assessment of the study’s scope of validity and reproducibility. Without clearly articulating the limitations imposed by early decisions—such as the choice of participants, test conditions, or simplified test scenarios—the results may be seen as less robust or generalizable. Guidelines: The systematic planning of an evaluation study should include a clear and explicit definition of the application domain, target group, and use context of the explanation system. This definition should be as precise as possible. However, an overly narrow scope may restrict the generalizability of the research findings, while a broader scope could reduce the focus of the study, informing the systematic implementation and the relevancy of the findings [154]. Striking the right balance is essential for ensuring both meaningful insights and the potential applicability of the results across different contexts. |
Pitfalls: A common pitfall in many evaluation studies is the use of ad hoc questionnaires instead of a standardized one. This lack of explication has a negative impact for both study planning and the scientific community: During the planning phase, creating ad hoc questionnaires adds to the cost, particularly when theoretical constructs are rigorously operationalized, including pre-testing the questionnaires and validating them psychometrically. During the publication phase, using non-standardized questionnaires complicates reproducibility, comparability, and the assessment of the study’s validity. Guidelines: The definition of measurement constructs:
|
Guidelines: Essentially, there are three types of methods.
|
Pitfalls: Real- and proxy-user sampling each comes with its own set of advantages and disadvantages. A real-user approach is particularly challenging in niche domains beyond the mass market, especially where AI systems address sensitive topics or affect marginalized or hard-to-reach populations. Key sectors in this regard include healthcare, justice, and finance, where real-user studies typically comprise smaller sample sizes due to the specific conditions of the domain and the unique characteristics of the target group. Conversely, the availability of crowd workers and online panel platforms simplifies the recruitment process for proxy-user studies, enabling larger sample sizes. While recruiting proxy users can be beneficial for achieving a substantial sample size and sometimes essential for gathering valuable insights, researchers must be mindful of the limitations and potential biases this approach introduces. It is crucial to carefully assess how accurately proxy users represent the target audience and to interpret the findings considering these constraints. Relying on proxy users, rather than real users from the target group, can be viewed as a compromise often driven by practical considerations. However, the decision to use proxy users is often made for pragmatic reasons without considering the implications for the study design and the applicability of the research findings for real-world scenarios. Guidelines: The sampling method used has a serious impact on the study results. Sometimes, a small sample with real users could lead to more valid results than large sample studies with proxy users. Therefore, the decision sampling method should be carried out intentionally, balancing the statistically required sample size, contextual relevance, and ecological validity, as well as with the practicalities of conducting the study in a time- and cost-efficient manner. In addition, researchers should articulate the rationale behind the sampling decision and the implications for the study design and the limitations of the findings. |
8. Conclusions
- Balancing Methodological Rigor and Practical Relevance—Future research should maintain a balance between rigorous theoretical evaluations and practical, real-world applications. Researchers must tailor their methods based on the study type ― whether concept-driven, methodological, or domain-driven. Studies need to either prioritize controlled environments for hypothesis testing or adopt more qualitative, context-driven methods for exploring new areas.
- A Clear Definition of the Evaluation Scope—Future studies need to clearly define their evaluation scope, including the domain, target group, and context of use. Avoiding vague or broad scopes is essential for maintaining focus and ensuring meaningful generalizable results.
- The Standardization of Measurement Metrics—A major pitfall in XAI research is the inconsistent use of evaluation metrics. Researchers should use standardized questionnaires and tools to enable comparisons across studies. If none are available, newly developed metrics must be rigorously validated before use.
- The Use of Real Users Over Proxy Users—Although proxy users are sometimes needed, real users should be prioritized in domain-specific evaluations to improve ecological validity and generalizability. Researchers must clearly justify their user group choice and acknowledge the limitations of using proxies.
- The Exploration of Mixed-Methods Evaluations—To better understand human–AI interactions, mixed-methods approaches combining qualitative and quantitative methods are essential. Triangulating interviews, questionnaires, and observational data will offer deeper insights into how users engage with and perceive AI explanations. Furthermore, future research needs to focus on conducting longitudinal studies to explore how user trust and reliance on AI explanations evolve over time. This is especially important in high-stakes fields like healthcare, where initial trust may differ from long-term trust based on system performance and explanation quality.
- Domain-Specific Research—Future research must move beyond general-purpose evaluations to focus on domain-specific needs, especially in sensitive areas like healthcare, finance, law, and autonomous systems. Each domain may have unique requirements for explainability, and evaluations must be designed to consider these domain-specific constraints and expectations. For this, we propose the need for developing domain-specific evaluation metrics.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Abdul, A.; Vermeulen, J.; Wang, D.; Lim, B.Y.; Kankanhalli, M. Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, in CHI ’18; Association for Computing Machinery, Montreal, QC, Canada, 21–26 April 2018; pp. 1–18. [Google Scholar] [CrossRef]
- Shneiderman, B. Human-Centered Artificial Intelligence: Reliable, Safe & Trustworthy. Int. J. Hum. Comput. Interact. 2020, 36, 495–504. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. Available online: https://api.semanticscholar.org/CorpusID:11319376 (accessed on 12 April 2024).
- Herrmann, T.; Pfeiffer, S. Keeping the organization in the loop: A socio-technical extension of human-centered artificial intelligence. AI Soc. 2023, 38, 1523–1542. [Google Scholar] [CrossRef]
- Vilone, G.; Longo, L. Notions of explainability and evaluation approaches for explainable artificial intelligence. Inf. Fusion 2021, 76, 89–106. [Google Scholar] [CrossRef]
- Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
- Gunning, D.; Aha, D. DARPA’s Explainable Artificial Intelligence (XAI) Program. AI Mag. 2019, 40, 44–58. [Google Scholar] [CrossRef]
- Nunes, I.; Jannach, D. A systematic review and taxonomy of explanations in decision support and recommender systems. User Model. User-Adapt. Interact. 2017, 27, 393–444. [Google Scholar] [CrossRef]
- Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining Explanations: An Overview of Interpretability of Machine Learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar] [CrossRef]
- Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics. Electronics 2021, 10, 593. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kortz, M.; Budish, R.; Bavitz, C.; Gershman, S.; O’Brien, D.; Scott, K.; Schieber, S.; Waldo, J.; Wood, A.; et al. Accountability of AI under the law: The role of explanation. arXiv 2017, arXiv:1711.01134. [Google Scholar] [CrossRef]
- Nguyen, A.; Martínez, M.R. MonoNet: Towards Interpretable Models by Learning Monotonic Features. arXiv 2019, arXiv:1909.13611. [Google Scholar] [CrossRef]
- Rosenfeld, A. Better Metrics for Evaluating Explainable Artificial Intelligence. 2021. Available online: https://api.semanticscholar.org/CorpusID:233453690 (accessed on 25 July 2024).
- Sharp, H.; Preece, J.; Rogers, Y. Interaction Design: Beyond Human-Computer Interaction; Jon Wiley & Sons. Inc.: Hoboken, NJ, USA, 2002. [Google Scholar]
- Chromik, M.; Schuessler, M. A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI. In Proceedings of the ExSS-ATEC@IUI, Cagliari, Italy, 17–20 March 2020; Available online: https://api.semanticscholar.org/CorpusID:214730454 (accessed on 29 November 2023).
- Mohseni, S.; Block, J.E.; Ragan, E. Quantitative Evaluation of Machine Learning Explanations: A Human-Grounded Benchmark. In Proceedings of the 26th International Conference on Intelligent User Interfaces, in IUI ’21, College Station, TX, USA, 14–17 April 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 22–31. [Google Scholar] [CrossRef]
- Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
- Anjomshoae, S.; Najjar, A.; Calvaresi, D.; Främling, K. Explainable Agents and Robots: Results from a Systematic Literature Review. In Proceedings of the AAMAS ’19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, International Foundation for Autonomous Agents and MultiAgent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 1078–1088. Available online: http://www.ifaamas.org/Proceedings/aamas2019/pdfs/p1078.pdf (accessed on 29 November 2023).
- Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; van Keulen, M.; Seifert, C. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI. ACM Comput. Surv. 2023, 55, 3583558. [Google Scholar] [CrossRef]
- Arksey, H.; O’Malley, L. Scoping studies: Towards a methodological framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
- Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
- Litwin, M.S. How to Measure Survey Reliability and Validity; Sage: Thousand Oaks, CA, USA, 1995; Volume 7. [Google Scholar]
- DeVellis; Robert, F.; Carolyn, T.T. Scale Development: Theory and Applications; Sage: Thousand Oaks, CA, USA, 2003. [Google Scholar]
- Raykov, T.; Marcoulides, G.A. Introduction to Psychometric Theory, 1st ed.; Routledge: London, UK, 2010. [Google Scholar]
- Naveed, S.; Kern, D.R.; Stevens, G. Explainable Robo-Advisors: Empirical Investigations to Specify and Evaluate a User-Centric Taxonomy of Explanations in the Financial Domain. In Proceedings of the IntRS@RecSys, Seattle, WA, USA, 18–23 September 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 85–103. [Google Scholar]
- Millecamp, M.; Naveed, S.; Verbert, K.; Ziegler, J. To Explain or not to Explain: The Effects of Personal Characteristics when Explaining Feature-based Recommendations in Different Domains. In Proceedings of the IntRS@RecSys, Copenaghen, Denmark, 16–19 September 2019; Association for Computing Machinery: New York, NY, USA, 2019. Available online: https://api.semanticscholar.org/CorpusID:203415984 (accessed on 29 November 2023).
- Naveed, S.; Loepp, B.; Ziegler, J. On the Use of Feature-based Collaborative Explanations: An Empirical Comparison of Explanation Styles. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, in UMAP ’20 Adjunct, Genoa Italy, 12–18 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 226–232. [Google Scholar] [CrossRef]
- Naveed, S.; Donkers, T.; Ziegler, J. Argumentation-Based Explanations in Recommender Systems: Conceptual Framework and Empirical Results. In Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, in UMAP ’18, Singapore, 8–11 July 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 293–298. [Google Scholar] [CrossRef]
- Alizadeh, F.; Stevens, G.; Esau, M. An Empirical Study of Folk Concepts and People’s Expectations of Current and Future Artificial Intelligence. i-com 2021, 20, 3–17. [Google Scholar] [CrossRef]
- Kaur, H.; Nori, H.; Jenkins, S.; Caruana, R.; Wallach, H.; Vaughan, J.W. Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, in CHI ’20, Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–14. [Google Scholar] [CrossRef]
- Lai, V.; Liu, H.; Tan, C. ‘Why is “Chicago” deceptive? ’ Towards Building Model-Driven Tutorials for Humans. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, in CHI ’20, Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–13. [Google Scholar] [CrossRef]
- Ngo, T.; Kunkel, J.; Ziegler, J. Exploring Mental Models for Transparent and Controllable Recommender Systems: A Qualitative Study. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, in UMAP ’20, Genoa Italy, 12–18 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 183–191. [Google Scholar] [CrossRef]
- Kulesza, T.; Stumpf, S.; Burnett, M.; Yang, S.; Kwan, I.; Wong, W.-K. Too much, too little, or just right? Ways explanations impact end users’ mental models. In Proceedings of the 2013 IEEE Symposium on Visual Languages and Human Centric Computing, San Jose, CA, USA, 15–19 September 2013; pp. 3–10. [Google Scholar] [CrossRef]
- Sukkerd, R. Improving Transparency and Intelligibility of Multi-Objective Probabilistic Planning. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2022. [Google Scholar] [CrossRef]
- Hoffman, R.R.; Mueller, S.T.; Klein, G.; Litman, J. Metrics for Explainable AI: Challenges and Prospects. arXiv 2019, arXiv:1812.04608. [Google Scholar] [CrossRef]
- Anik, A.I.; Bunt, A. Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–12 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Hannah, D. Criteria and Metrics for the Explainability of Software; Gottfried Wilhelm Leibniz Universität: Hannover, Germany, 2022. [Google Scholar]
- Guo, L.; Daly, E.M.; Alkan, O.; Mattetti, M.; Cornec, O.; Knijnenburg, B. Building Trust in Interactive Machine Learning via User Contributed Interpretable Rules. In Proceedings of the 27th International Conference on Intelligent User Interfaces, in IUI ’22, Helsinki, Finland, 21–25 March 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 537–548. [Google Scholar] [CrossRef]
- Dominguez, V.; Messina, P.; Donoso-Guzmán, I.; Parra, D. The effect of explanations and algorithmic accuracy on visual recommender systems of artistic images. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 408–416. [Google Scholar] [CrossRef]
- Dieber, J.; Kirrane, S. A novel model usability evaluation framework (MUsE) for explainable artificial intelligence. Inf. Fusion 2022, 81, 143–153. [Google Scholar] [CrossRef]
- Millecamp, M.; Htun, N.N.; Conati, C.; Verbert, K. To explain or not to explain: The effects of personal characteristics when explaining music recommendations. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 397–407. [Google Scholar] [CrossRef]
- Buçinca, Z.; Lin, P.; Gajos, K.Z.; Glassman, E.L. Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces, in IUI ’20, Cagliari, Italy, 17–20 March 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 454–464. [Google Scholar] [CrossRef]
- Cheng, H.-F.; Wang, R.; Zhang, Z.; O’Connell, F.; Gray, T.; Harper, F.M.; Zhu, H. Explaining Decision-Making Algorithms through UI: Strategies to Help Non-Expert Stakeholders. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19, Glasgow, UK, 4–9 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–12. [Google Scholar] [CrossRef]
- Holzinger, A.; Carrington, A.; Müller, H. Measuring the Quality of Explanations: The System Causability Scale (SCS). KI—Künstliche Intell. 2020, 34, 193–198. [Google Scholar] [CrossRef]
- Weina, J.; Hamarneh, G. The XAI alignment problem: Rethinking how should we evaluate human-centered AI explainability techniques. arXiv 2023, arXiv:2303.17707. [Google Scholar] [CrossRef]
- Papenmeier, A.; Englebienn, G.; Seifert, C. How model accuracy and explanation fidelity influence user trust. arXiv 2019, arXiv:1907.12652. [Google Scholar] [CrossRef]
- Liao, M.; Sundar, S.S. How Should AI Systems Talk to Users when Collecting their Personal Information? Effects of Role Framing and Self-Referencing on Human-AI Interaction. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–13 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Cai, C.J.; Jongejan, J.; Holbrook, J. The effects of example-based explanations in a machine learning interface. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Marina del Ray, CA, USA, 17–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 258–262. [Google Scholar] [CrossRef]
- van der Waa, J.; Nieuwburg, E.; Cremers, A.; Neerincx, M. Evaluating XAI: A comparison of rule-based and example-based explanations. Artif. Intell. 2021, 291, 103404. [Google Scholar] [CrossRef]
- Poursabzi-Sangdeh, F.; Goldstein, D.G.; Hofman, J.M.; Vaughan, J.W.W.; Wallach, H. Manipulating and Measuring Model Interpretability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–13 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Narayanan, M.; Chen, E.; He, J.; Kim, B.; Gershman, S.; Doshi-Velez, F. How do humans understand explanations from machine learning systems? An evaluation of the human-interpretability of explanation. arXiv 2018, arXiv:1802.00682. [Google Scholar] [CrossRef]
- Liu, H.; Lai, V.; Tan, C. Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision Making. In ACM Human_Computer Interaction, 5 (CSCW2); Association for Computing Machinery: New York, NY, USA, 2021; pp. 1–45. [Google Scholar] [CrossRef]
- Schmidt, P.; Biessmann, F. Quantifying Interpretability and Trust in Machine Learning Systems. arXiv 2019, arXiv:1901.08558. [Google Scholar]
- Kim, S.S.Y.; Meister, N.; Ramaswamy, V.V.; Fong, R. HIVE: Evaluating the Human Interpretability of Visual Explanations. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 280–298. [Google Scholar]
- Rader, E.; Cotter, K.; Cho, J. Explanations as Mechanisms for Supporting Algorithmic Transparency. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, in CHI ’18, Montreal, Canada, 21–26 April 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1–13. [Google Scholar] [CrossRef]
- Ooge, J.; Kato, S.; Verbert, K. Explaining Recommendations in E-Learning: Effects on Adolescents’ Trust. In Proceedings of the 27th International Conference on Intelligent User Interfaces, in IUI ’22, Helsinki, Finland, 21–25 March 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 93–105. [Google Scholar] [CrossRef]
- Tsai, C.-H.; You, Y.; Gui, X.; Kou, Y.; Carroll, J.M. Exploring and Promoting Diagnostic Transparency and Explainability in Online Symptom Checkers. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–12 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Guesmi, M.; Chatti, M.A.; Vorgerd, L.; Ngo, T.; Joarder, S.; Ain, Q.U.; Muslim, A. Explaining User Models with Different Levels of Detail for Transparent Recommendation: A User Study. In Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, in UMAP ’22 Adjunct, Barcelona, Spain, 4–7 July 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 175–183. [Google Scholar] [CrossRef]
- Ford, C.; Keane, M.T. Explaining Classifications to Non-experts: An XAI User Study of Post-Hoc Explanations for a Classifier When People Lack Expertise. In Pattern Recognition, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges; Rousseau, J.J., Kapralos, B., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 246–260. [Google Scholar]
- Bansal, G.; Wu, T.; Zhou, J.; Fok, R.; Kamar, E.; Ribeiro, M.T.; Weld, D. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. In Proceedings of the 2021 CHI Conference on Human. Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–12 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Kim, D.H.; Hoque, E.; Agrawala, M. Answering Questions about Charts and Generating Visual Explanations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, in CHI ’20, Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–13. [Google Scholar] [CrossRef]
- Dodge, J.; Penney, S.; Anderson, A.; Burnett, M.M. What Should Be in an XAI Explanation? What IFT Reveals. In Proceedings of the 2018 Joint ACM IUI Workshops Co-Located with the 23rd ACM Conference on Intelligent User Interfaces (ACM IUI 2018), Tokyo, Japan, 11 March 2018; CEUR Workshop Proceedings. Said, A., Komatsu, T., Eds.; Association for Computing Machinery: New York, NY, USA, 2018; Volume 2068. Available online: https://ceur-ws.org/Vol-2068/exss9.pdf (accessed on 11 October 2023).
- Schoonderwoerd, T.A.J.; Jorritsma, W.; Neerincx, M.A.; van den Bosch, K. Human-centered XAI: Developing design patterns for explanations of clinical decision support systems. Int. J. Hum. Comput. Stud. 2021, 154, 102684. [Google Scholar] [CrossRef]
- Paleja, R.; Ghuy, M.; Arachchige, N.R.; Jensen, R.; Gombolay, M. The Utility of Explainable AI in Ad Hoc Human-Machine Teaming. In Proceedings of the Advances in Neural Information Processing Systems, online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W., Eds.; Curran Associates, Inc.: New York, NY, USA, 2021; pp. 610–623. Available online: https://dl.acm.org/doi/10.5555/3540261.3540308 (accessed on 11 October 2023).
- Alufaisan, Y.; Marusich, L.R.; Bakdash, J.Z.; Zhou, Y.; Kantarcioglu, M. Does Explainable Artificial Intelligence Improve Human Decision-Making? Proc. AAAI Conf. Artif. Intell. 2021, 35, 6618–6626. [Google Scholar] [CrossRef]
- Schaffer, J.; O’Donovan, J.; Michaelis, J.; Raglin, A.; Höllerer, T. I can do better than your AI: Expertise and explanations. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 240–251. [Google Scholar] [CrossRef]
- Colley, M.; Eder, B.; Rixen, J.O.; Rukzio, E. Effects of Semantic Segmentation Visualization on Trust, Situation Awareness, and Cognitive Load in Highly Automated Vehicles. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–12 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Zhang, Y.; Liao, Q.V.; Bellamy, R.K.E. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, in FAT* ’20, Barcelona, Spain, 27–30 January 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 295–305. [Google Scholar] [CrossRef]
- Carton, S.; Mei, Q.; Resnick, P. Feature-Based Explanations Don’t Help People Detect Misclassifications of Online Toxicity. Proc. Int. AAAI Conf. Web Soc. Media 2020, 14, 95–106. [Google Scholar] [CrossRef]
- Schoeffer, J.; Kuehl, N.; Machowski, Y. ‘There Is Not Enough Information’: On the Effects of Explanations on Perceptions of Informational Fairness and Trustworthiness in Automated Decision-Making. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, in FAccT ’22, Seoul, Republic of Korea, 21–24 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1616–1628. [Google Scholar] [CrossRef]
- Kunkel, J.; Donkers, T.; Michael, L.; Barbu, C.-M.; Ziegler, J. Let Me Explain: Impact of Personal and Impersonal Explanations on Trust in Recommender Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ’19, Glasgow, UK, 4–9 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–12. [Google Scholar] [CrossRef]
- Jeyakumar, J.V.; Noor, J.; Cheng, Y.-H.; Garcia, L.; Srivastava, M. How Can I Explain This to You? An Empirical Study of Deep Neural Network Explanation Methods. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, Canada, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; pp. 4211–4222. Available online: https://proceedings.neurips.cc/paper_files/paper/2020/file/2c29d89cc56cdb191c60db2f0bae796b-Paper.pdf (accessed on 25 January 2024).
- Harrison, G.; Hanson, J.; Jacinto, C.; Ramirez, J.; Ur, B. An Empirical Study on the Perceived Fairness of Realistic, Imperfect Machine Learning Models. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, in FAT* ’20, Barcelona, Spain, 27–30 January 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 392–402. [Google Scholar] [CrossRef]
- Weitz, K.; Alexander, Z.; Elisabeth, A. What do end-users really want? investigation of human-centered xai for mobile health apps. arXiv 2022, arXiv:2210.03506. [Google Scholar] [CrossRef]
- Fügener, A.; Grahl, J.; Gupta, A.; Ketter, W. Will Humans-in-The-Loop Become Borgs? Merits and Pitfalls of Working with AI. Manag. Inf. Syst. Q. (MISQ) 2021, 45, 1527–1556. [Google Scholar] [CrossRef]
- Jin, W.; Fatehi, M.; Guo, R.; Hamarneh, G. Evaluating the clinical utility of artificial intelligence assistance and its explanation on the glioma grading task. Artif. Intell. Med. 2024, 148, 102751. [Google Scholar] [CrossRef]
- Panigutti, C.; Hamon, R.; Hupont, I.; Llorca, D.F.; Yela, D.F.; Junklewitz, H.; Scalzo, S.; Mazzini, G.; Sanchez, I.; Garrido, J.S.; et al. The role of explainable AI in the context of the AI Act. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, in FAccT ’23, Chicago, IL, USA, 12–15 June 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1139–1150. [Google Scholar] [CrossRef]
- Lopes, P.; Silva, E.; Braga, C.; Oliveira, T.; Rosado, L. XAI Systems Evaluation: A Review of Human and Computer-Centred Methods. Appl. Sci. 2022, 12, 9423. [Google Scholar] [CrossRef]
- Kong, X.; Liu, S.; Zhu, L. Toward Human-centered XAI in Practice: A survey. Mach. Intell. Res. 2024, 21, 740–770. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kim, B. Considerations for evaluation and generalization in interpretable machine learning. In Explainable and Interpretable Models in Computer Vision and Machine Learning; Springer: Cham, Switzerland, 2018; pp. 3–17. [Google Scholar]
- Kulesza, T.; Stumpf, S.; Burnett, M.; Kwan, I. Tell Me More? The Effects of Mental Model Soundness on Personalizing an Intelligent Agent. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, in CHI ’12, Austin, TX, USA, 5–10 May 2012; Association for Computing Machinery: New York, NY, USA, 2012; pp. 1–10. [Google Scholar] [CrossRef]
- Bansal, G.; Nushi, B.; Kamar, E.; Lasecki, W.S.; Weld, D.S.; Horvitz, E. Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance. Proc. AAAI Conf. Hum. Comput. Crowdsourc 2019, 7, 2–11. [Google Scholar] [CrossRef]
- Davenport, T.H.; Markus, M.L. Rigor vs. Relevance Revisited: Response to Benbasat and Zmud. MIS Q. 1999, 23, 19–23. [Google Scholar] [CrossRef]
- Islam, M.R.; Ahmed, M.U.; Barua, S.; Begum, S. A Systematic Review of Explainable Artificial Intelligence in Terms of Different Application Domains and Tasks. Appl. Sci. 2022, 12. [Google Scholar] [CrossRef]
- Cabitza, F.; Campagner, A.; Famiglini, L.; Gallazzi, E.; La Maida, G.A. Color Shadows (Part I): Exploratory Usability Evaluation of Activation Maps in Radiological Machine Learning. In Proceedings of the Machine Learning and Knowledge Extraction, Vienna, Austria, 23–26 August 2022; Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E., Eds.; Springer International Publishing: Vienna, Austria, 2022; pp. 31–50. [Google Scholar]
- Grgic-Hlaca, N.; Zafar, M.B.; Gummadi, K.P.; Weller, A. Beyond Distributive Fairness in Algorithmic Decision Making: Feature Selection for Procedurally Fair Learning. AAAI Conf. Artif. Intell. 2018, 32. [Google Scholar] [CrossRef]
- Dodge, J.; Liao, Q.V.; Zhang, Y.; Bellamy, R.K.E.; Dugan, C. Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 275–285. [Google Scholar] [CrossRef]
- Kern, D.-R.; Dethier, E.; Alizadeh, F.; Stevens, G.; Naveed, S.; Du, D.; Shajalal, M. Peeking Inside the Schufa Blackbox: Explaining the German Housing Scoring System. arXiv 2023, arXiv:2311.11655. [Google Scholar]
- Naveed, S.; Ziegler, J. Featuristic: An interactive hybrid system for generating explainable recommendations—Beyond system accuracy. In Proceedings of the IntRS@RecSys, Rio de Janeiro, Brazil, 22–26 September 2020; Available online: https://api.semanticscholar.org/CorpusID:225063158 (accessed on 3 August 2023).
- Naveed, S.; Ziegler, J. Feature-Driven Interactive Recommendations and Explanations with Collaborative Filtering Approach. In Proceedings of the ComplexRec@ RecSys, Copenhagen, Denmark, 20 September 2019; p. 1015. [Google Scholar]
- Herlocker, J.L.; Konstan, J.A.; Riedl, J. Explaining Collaborative Filtering Recommendations. In Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, in CSCW ’00, Philadelphia, PA, USA, 2–6 December 2000; Association for Computing Machinery: New York, NY, USA, 2000; pp. 241–250. [Google Scholar] [CrossRef]
- Tintarev, N.; Masthoff, J. Explaining Recommendations: Design and Evaluation. In Recommender Systems Handbook; Springer: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
- Tintarev, N.; Masthoff, J. Designing and Evaluating Explanations for Recommender Systems. In Recommender Systems Handbook; Ricci, F., Rokach, L., Shapira, B., Kantor, P., Eds.; Springer: Boston, MA, USA, 2011; pp. 479–510. [Google Scholar] [CrossRef]
- Nunes, I.; Taylor, P.; Barakat, L.; Griffiths, N.; Miles, S. Explaining reputation assessments. Int. J. Hum. Comput. Stud. 2019, 123, 1–17. [Google Scholar] [CrossRef]
- Kouki, P.; Schaffer, J.; Pujara, J.; O’Donovan, J.; Getoor, L. Personalized explanations for hybrid recommender systems. In Proceedings of the 24th International Conference on Intelligent User Interfaces, in IUI ’19, Los Angeles, CA, USA, 16–20 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 379–390. [Google Scholar] [CrossRef]
- Le, N.L.; Abel, M.-H.; Gouspillou, P. Combining Embedding-Based and Semantic-Based Models for Post-Hoc Explanations in Recommender Systems. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Oahu, HI, USA, 1–4 October 2023; pp. 4619–4624. [Google Scholar] [CrossRef]
- Raza, S.; Ding, C. News recommender system: A review of recent progress, challenges, and opportunities. Artif. Intell. Rev. 2022, 55, 749–800. [Google Scholar] [CrossRef]
- Wang, X.; Wang, D.; Xu, C.; He, X.; Cao, Y.; Chua, T.-S. Explainable Reasoning over Knowledge Graphs for Recommendation. Proc. AAAI Conf. Artif. Intell. 2019, 33, 5329–5336. [Google Scholar] [CrossRef]
- Ehsan, U.; Liao, Q.V.; Muller, M.; Riedl, M.O.; Weisz, J.D. Expanding Explainability: Towards Social Transparency in AI systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI ’21, Yokohama, Japan, 8–13 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Hudon, A.; Demazure, T.; Karran, A.J.; Léger, P.-M.; Sénécal, S. Explainable Artificial Intelligence (XAI): How the Visualization of AI Predictions Affects User Cognitive Load and Confidence. Inf. Syst. Neurosci. 2021, 52, 237–246. [Google Scholar] [CrossRef]
- Cramer, H.; Evers, V.; Ramlal, S.; Van Someren, M.; Rutledge, L.; Stash, N.; Aroyo , L.; Wielinga, B. The effects of transparency on trust in and acceptance of a content-based art recommender. User Model. User-Adapt Interact. 2008, 18, 455–496. [Google Scholar] [CrossRef]
- Ehsan, U.; Tambwekar, P.; Chan, L.; Harrison, B.; Riedl, M.O. Automated rationale generation: A technique for explainable AI and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces, Marina del Ray, CA, USA, 16–20 March 2019. [Google Scholar] [CrossRef]
- Hoffman, R.R.; Clancey, W.J.; Mueller, S.T. Explaining AI as an exploratory process: The peircean abduction model. arXiv 2020, arXiv:2009.14795. [Google Scholar] [CrossRef]
- Meske, C.; Bunde, E.; Schneider, J.; Gersch, M. Explainable Artificial Intelligence: Objectives, Stakeholders, and Future Research Opportunities. Inf. Syst. Manag. 2022, 39, 53–63. [Google Scholar] [CrossRef]
- Mohseni, S.; Zarei, N.; Ragan, E.D. A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems. ACM Trans. Interact. Intell. Syst. 2021, 11, 3–4. [Google Scholar] [CrossRef]
- Phillips, P.J.; Hahn, C.A.; Fontana, P.C.; Yates, A.N.; Greene, K.; Broniatowski, D.A.; Przybocki, M.A. Four Principles of Explainable Artificial Intelligence; NIST Interagency/Internal Report; NIST: Gaithersburg, MD, USA, 2021. [Google Scholar]
- Goodman, B.; Flaxman, S. European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation. AI Mag. 2017, 38, 50–57. [Google Scholar] [CrossRef]
- Lund, A.B. A.B. A Stakeholder Approach to Media Governance. In Managing Media Firms and Industries: What’s So Special About Media Management; Lowe, G., Brown, C., Eds.; Springer International Publishing: Vienna, Austria, 2016; pp. 103–120. [Google Scholar] [CrossRef]
- Rong, Y.; Leemann, T.; Nguyen, T.-T.; Fiedler, L.; Qian, P.; Unhelkar, V.; Seidel, T.; Kasneci, G.; Kasneci, E. Towards Human-centered Explainable AI: User Studies for Model Explanations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 46, 2104–2122. [Google Scholar] [CrossRef] [PubMed]
- Rong, Y.; Castner, N.; Bozkir, E.; Kasneci, E. User trust on an explainable ai-based medical diagnosis support system. arXiv 2022, arXiv:2204.12230. [Google Scholar] [CrossRef]
- Páez, A. The Pragmatic Turn in Explainable Artificial Intelligence (XAI). Minds Mach. 2019, 29, 441–459. [Google Scholar] [CrossRef]
- Lim, B.Y.; Dey, A.K. Assessing demand for intelligibility in context-aware applications. In Proceedings of the 11th International Conference on Ubiquitous Computing, in UbiComp ’09, Orlando, FL, USA, 30 September–3 October 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 195–204. [Google Scholar] [CrossRef]
- Weld, D.S.; Bansal, G. The challenge of crafting intelligible intelligence. Commun. ACM 2019, 62, 70–79. [Google Scholar] [CrossRef]
- Knijnenburg, B.P.; Willemsen, M.C.; Kobsa, A. A pragmatic procedure to support the user-centric evaluation of recommender systems. In Proceedings of the Fifth ACM Conference on Recommender Systems, in RecSys ’11, Chicago, IL, USA, 23–27 October 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 321–324. [Google Scholar] [CrossRef]
- Dai, J.; Upadhyay, S.; Aivodji, U.; Bach, S.H.; Lakkaraju, H. Fairness via Explanation Quality: Evaluating Disparities in the Quality of Post hoc Explanations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, in AIES ’22, Palo Alto, CA, USA, 21–23 March 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 203–214. [Google Scholar] [CrossRef]
- Pu, P.; Chen, L.; Hu, R. A user-centric evaluation framework for recommender systems. In Proceedings of the Fifth ACM Conference on Recommender Systems, in RecSys ’11, Chicago, IL, USA, 23–27 October 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 157–164. [Google Scholar] [CrossRef]
- Caro, L.M.; García, J.A.M. Cognitive–affective model of consumer satisfaction. An exploratory study within the framework of a sporting event. J. Bus. Res. 2007, 60, 108–114. [Google Scholar] [CrossRef]
- Myers, D.G.; Dewall, C.N. Psychology, 11th ed.; Worth Publishers: New York, NY, USA, 2021. [Google Scholar]
- Gedikli, F.; Jannach, D.; Ge, M. How should I explain? A comparison of different explanation types for recommender systems. Int. J. Hum. Comput. Stud. 2014, 72, 367–382. [Google Scholar] [CrossRef]
- Kahng, M.; Andrews, P.Y.; Kalro, A.; Chau, D.H. ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models. IEEE Trans. Vis. Comput. Graph. 2018, 24, 88–97. [Google Scholar] [CrossRef]
- Buettner, R. Cognitive Workload of Humans Using Artificial Intelligence Systems: Towards Objective Measurement Applying Eye-Tracking Technology. In KI 2013: Advances in Artificial Intelligence; Timm, I.J., Thimm, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–48. [Google Scholar]
- Wu, Y.; Liu, Y.; Tsai, Y.R.; Yau, S. Investigating the role of eye movements and physiological signals in search satisfaction prediction using geometric analysis. J. Assoc. Inf. Sci. Technol. 2019, 70, 981–999. [Google Scholar] [CrossRef]
- Hassenzahl, M.; Kekez, R.; Burmester, M. The Importance of a software’s pragmatic quality depends on usage modes. In Proceedings of the 6th International Conference on Work with Display Units WWDU 2002, ERGONOMIC Institut für Arbeits-und Sozialforschung, Berlin, Germany, 22–25 May 2002; pp. 275–276. [Google Scholar]
- Nemeth, A.; Bekmukhambetova, A. Achieving Usability: Looking for Connections between User-Centred Design Practices and Resultant Usability Metrics in Agile Software Development. Period. Polytech. Soc. Manag. Sci. 2023, 31, 135–143. [Google Scholar] [CrossRef]
- Zhang, W.; Lim, B.Y. Towards Relatable Explainable AI with the Perceptual Process. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, in CHI ’22, New Orleans, LA, USA, 30 April–May 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
- Nourani, M.; Kabir, S.; Mohseni, S.; Ragan, E.D. The Effects of Meaningful and Meaningless Explanations on Trust and Perceived System Accuracy in Intelligent Systems. Proc. AAAI Conf. Hum. Comput. Crowdsourc. 2019, 7, 97–105. [Google Scholar] [CrossRef]
- Abdul, A.; von der Weth, C.; Kankanhalli, M.; Lim, B.Y. COGAM: Measuring and Moderating Cognitive Load in Machine Learning Model Explanations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, in CHI ’20, Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–14. [Google Scholar] [CrossRef]
- Lim, B.Y.; Dey, A.K.; Avrahami, D. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, in CHI ’09, Boston, MA, USA, 4–9 April 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 2119–2128. [Google Scholar] [CrossRef]
- Mayer, R.C.; Davis, J.H.; Schoorman, F.D. An Integrative Model of Organizational Trust. Acad. Manag. Rev. 1995, 20, 709–734. [Google Scholar] [CrossRef]
- Hoffman, R.R.; Mueller, S.T.; Klein, G.; Litman, J. Measures for explainable AI: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-AI performance. Front. Comput. Sci. 2023, 5. [Google Scholar] [CrossRef]
- Das, D.; Chernova, S. Leveraging rationales to improve human task performance. In Proceedings of the 25th International Conference on Intelligent User Interfaces, in IUI ’20, Cagliari, Italy, 17–20 March 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 510–518. [Google Scholar] [CrossRef]
- de Andreis, F. A Theoretical Approach to the Effective Decision-Making Process. Open J. Appl. Sci. 2020, 10, 287–304. [Google Scholar] [CrossRef]
- Parasuraman, R.; Sheridan, T.B.; Wickens, C.D. Situation Awareness, Mental Workload, and Trust in Automation: Viable, Empirically Supported Cognitive Engineering Constructs. J. Cogn. Eng. Decis. Mak. 2008, 2, 140–160. [Google Scholar] [CrossRef]
- Pomplun, M.; Sunkara, S. Pupil Dilation as an Indicator of Cognitive Workload in Human-Computer Interaction. 2003. Available online: https://api.semanticscholar.org/CorpusID:1052200 (accessed on 7 October 2024).
- Cegarra, J.; Chevalier, A. The use of Tholos software for combining measures of mental workload: Toward theoretical and methodological improvements. Behav. Res. Methods 2008, 40, 988–1000. [Google Scholar] [CrossRef]
- Hart, S.G.; Staveland, L.E. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Human Mental Workload; Hancock, P.A., Meshkati, N., Eds.; Advances in Psychology; Elsevier: Amsterdam, Netherlands, 1988; Volume 52, pp. 139–183. [Google Scholar] [CrossRef]
- Madsen, M.; Gregor, S. Measuring Human-Computer Trust. In Proceedings of the 11th Australasian Conference on Information Systems, Bisbane, Australia, 6–8 December 2000; Volume 53, pp. 6–8. [Google Scholar]
- Gefen, D. Reflections on the dimensions of trust and trustworthiness among online consumers. SIGMIS Database 2002, 33, 38–53. [Google Scholar] [CrossRef]
- Madsen, M.; Gregor, S.D. Measuring Human-Computer Trust. 2000. Available online: https://api.semanticscholar.org/CorpusID:18821611 (accessed on 9 June 2023).
- Stevens, G.; Bossauer, P. Who do you trust: Peers or Technology? A conjoint analysis about computational reputation mechanisms. In Proceedings of the 18th European Conference on Computer-Supported Cooperative Work, Siegen, Germany, 17–21 October 2020. [Google Scholar] [CrossRef]
- Wang, W.; Benbasat, I. Recommendation Agents for Electronic Commerce: Effects of Explanation Facilities on Trusting Beliefs. J. Manag. Inf. Syst. 2007, 23, 217–246. [Google Scholar] [CrossRef]
- Wahlström, M.; Tammentie, B.; Salonen, T.-T.; Karvonen, A. AI and the transformation of industrial work: Hybrid intelligence vs double-black box effect. Appl. Ergon. 2024, 118, 104271. [Google Scholar] [CrossRef]
- Rader, E.; Gray, R. Understanding User Beliefs About Algorithmic Curation in the Facebook News Feed. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, in CHI ’15, Seoul, Republic of Korea, 18–23 April 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 173–182. [Google Scholar] [CrossRef]
- Endsley, M.R. Situation awareness global assessment technique (SAGAT). In Proceedings of the IEEE 1988 National Aerospace and Electronics Conference, Dayton, OH, USA, 23–27 May 1988; Volume 3, pp. 789–795. [Google Scholar] [CrossRef]
- Uwe, F. Doing Interview Research: The Essential How to Guide; Sage Publications: Thousand Oaks, CA, USA, 2021. [Google Scholar]
- Blandford, A.; Furniss, D.; Makri, S. Qualitative HCI Research: Going Behind the Scenes. In Synthesis Lectures on Human-Centered Informatics; Springer: Berlin/Heidelberg, Germany, 2016; Available online: https://api.semanticscholar.org/CorpusID:38190394 (accessed on 13 April 2022).
- Kelle, U. „Mixed Methods” in der Evaluationsforschung—Mit den Möglichkeiten und Beschränkungen quantitativer und qualitativer Methoden arbeiten. Z. Für Eval. 2018, 17, 25–52. Available online: https://www.proquest.com/scholarly-journals/mixed-methods-der-evaluationsforschung-mit-den/docview/2037015610/se-2?accountid=14644 (accessed on 24 July 2024).
- Gorber, S.C.; Tremblay, M.S. Self-Report and Direct Measures of Health: Bias and Implications. In The Objective Monitoring of Physical Activity: Contributions of Accelerometry to Epidemiology, Exercise Science and Rehabilitation; Shephard, R., Tudor-Locke, C., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 369–376. [Google Scholar] [CrossRef]
- Sikes, L.M.; Dunn, S.M. Subjective Experiences. In Encyclopedia of Personality and Individual Differences; Zeigler-Hill, V., Shackelford, T.K., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 5273–5275. [Google Scholar] [CrossRef]
- Mellouk, W.; Handouzi, W. Facial emotion recognition using deep learning: Review and insights. Procedia Comput. Sci. 2020, 175, 689–694. [Google Scholar] [CrossRef]
- Hinkin, T.R. A review of scale development practices in the study of organizations. J. Manag. 1995, 21, 967–988. [Google Scholar] [CrossRef]
- Creswell, J.W.; Clark, V.L.P. Revisiting mixed methods research designs twenty years later. In The Sage Handbook of Mixed Methods Research Design; Sage Publications: Thousand Oaks, CA, USA, 2023; pp. 21–36. [Google Scholar]
- Binns, R. Fairness in Machine Learning: Lessons from Political Philosophy. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; Friedler, S.A., Wilson, C., Eds.; Volume 81, pp. 149–159. Available online: https://proceedings.mlr.press/v81/binns18a.html (accessed on 5 November 2023).
- Eiband, M.; Buschek, D.; Kremer, A.; Hussmann, H. The Impact of Placebic Explanations on Trust in Intelligent Systems. In Proceedings of the Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI EA ’19, Glasgow, UK, 4–9 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Binns, R.; Van Kleek, M.; Veale, M.; Lyngs, U.; Zhao, J.; Shadbolt, N. ‘It’s Reducing a Human Being to a Percentage’: Perceptions of Justice in Algorithmic Decisions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, in CHI ’18, Montreal, QC, Canada, 21–26 April 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1–14. [Google Scholar] [CrossRef]
Literature Source | Objective | Scope | Procedure | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Domain | Target Group | Sampling and Participants | Method | |||||||||||||
Methodology-driven Evaluation | Concept-driven Evaluation | Domain-driven Evaluation | Domain-Specific | Domain Agnostic/Not Stated | Agnostic/Not stated | Developers/Engineers | Managers/Regulators | End-Users | Not Stated | Proxy Users | Real Users | Interviews/ Think Aloud | Observations | Questionnaires | Mixed-Methods | |
Chromik et al. [15] | O | O | O | Not applicable | Not applicable | |||||||||||
Alizadeh et al. [29] | O | O | O | O | O | O | ||||||||||
Mohseni et al. [16] | O | O | O | O | O | O | ||||||||||
Kaur et al. [30] | O | O | O | O | O | O | ||||||||||
Lai et al. [31] | O | O | O | O | O | |||||||||||
Ngo et al. [32] | O | O | O | O | O | |||||||||||
Kulesza et al. [33] | O | O | O | O | O | |||||||||||
Sukkerd [34] | O | O | O | O | O | O | ||||||||||
Hoffman et al. [35] | O | O | O | O | O | |||||||||||
Anik et al. [36] | O | O | O | O | O | O | ||||||||||
Deters [37] | O | O | O | O | O | |||||||||||
Guo et al. [38] | O | O | O | O | O | O | ||||||||||
Dominguez et al. [39] | O | O | O | O | O | O | O | |||||||||
Dieber et al. [40] | O | O | O | O | O | O | ||||||||||
Millecamp et al. [41] | O | O | O | O | O | O | ||||||||||
Bucina et al. [42] | O | O | O | O | O | O | ||||||||||
Cheng et al. [43] | O | O | O | O | O | O | ||||||||||
Holzinger et al. [44] | O | O | O | O | O | O | ||||||||||
Jin [45] | O | O | O | O | O | |||||||||||
Papenmeier et al. [46] | O | O | O | O | O | O | ||||||||||
Liao et al. [47] | O | O | O | O | O | |||||||||||
Cai et al. [48] | O | O | O | O | O | O | O | |||||||||
Van der waa et al. [49] | O | O | O | O | O | O | ||||||||||
Poursabzi et al. [50] | O | O | O | O | O | O | ||||||||||
Narayanan et al. [51] | O | O | O | O | O | O | O | |||||||||
Liu et al. [52] | O | O | O | O | O | O | ||||||||||
Schmidt et al. [53] | O | O | O | O | O | |||||||||||
Kim et al. [54] | O | O | O | O | O | O | O | |||||||||
Rader et al. [55] | O | O | O | O | O | |||||||||||
Ooge et al. [56] | O | O | O | O | O | O | ||||||||||
Naveed et al. [25] | O | O | O | O | O | O | ||||||||||
Naveed et al. [26] | O | O | O | O | O | O | ||||||||||
Naveed et al. [27] | O | O | O | O | O | |||||||||||
Tsai et al. [57] | O | O | O | O | ||||||||||||
Guesmi et al. [58] | O | O | O | O | O | |||||||||||
Naveed et al. [28] | O | O | O | O | O | |||||||||||
Ford et al. [59] | O | O | O | O | O | |||||||||||
Bansal et al. [60] | O | O | O | O | O | O | ||||||||||
Kim et al. [61] | O | O | O | O | O | O | ||||||||||
Dodge et al. [62] | O | O | O | O | O | O | ||||||||||
Schoonderwoerd et al. [63] | O | O | O | O | O | O | O | O | ||||||||
Paleja et al. [64] | O | O | O | O | O | O | O | |||||||||
Alufaisan et al. [65] | O | O | O | O | O | |||||||||||
Schaffer et al. [66] | O | O | O | O | O | O | O | |||||||||
Colley et al. [67] | O | O | O | O | O | |||||||||||
Zhang et al. [68] | O | O | O | O | O | |||||||||||
Carton et al. [69] | O | O | O | O | O | |||||||||||
Schoeffer et al. [70] | O | O | O | O | O | |||||||||||
Kunkel et al. [71] | O | O | O | O | O | |||||||||||
Jeyakumar et al. [72] | O | O | O | O | O | |||||||||||
Harrison et al. [73] | O | O | O | O | O | |||||||||||
Weitz et al. [74] | O | O | O | O | O | O | ||||||||||
Fügener et al. [75] | O | O | O | O | O |
Literature Source | Understandability | Usability | Integrity | Misc. | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mental Models | Perceived Understandability | Understanding Goodness / Soundness | Perceived Explanation Qualities | Satisfaction | Utility/Suitability | Performance/Workload | Controllability/Scrutability | Trust / Confidence | Perceived Fairness | Transparency | Other | |
Chromik et al. [15] | O | O | O | O | O | O | Persuasiveness, Education, Debugging | |||||
Alizadeh et al. [29] | O | |||||||||||
Mohseni et al. [16] | O | O | ||||||||||
Kaur et al. [30] | O | O | O | O | Intention to use/purchase | |||||||
Lai et al. [31] | O | O | O | |||||||||
Ngo et al. [32] | O | O | O | O | Diversity | |||||||
Kulesza et al. [33] | O | O | O | O | O | Debugging | ||||||
Sukkerd [34] | O | O | O | O | ||||||||
Hoffman et al. [35] | O | O | O | O | O | O | O | O | Curiosity | |||
Aniket al. [36] | O | O | O | O | O | O | ||||||
Deters [37] | O | O | O | O | O | O | Persuasiveness, Debugging, Situation Awareness, Learn/Edu. | |||||
Guo et al. [38] | O | O | O | O | O | |||||||
Dominguez et al. [39] | O | O | O | O | Diversity | |||||||
Dieber et al. [40] | O | O | O | |||||||||
Millecamp et al. [41] | O | O | O | O | Novelty, Intention to use/purchase | |||||||
Bucina et al. [42] | O | O | O | O | O | |||||||
Cheng et al. [43] | O | O | O | O | ||||||||
Holzinger et al. [44] | O | O | ||||||||||
Jin [45] | O | O | O | O | Plausability(Plausability measures how convincing AI explanations are to humans. It is typically measured in terms of quantitative metrics such as feature localization or feature correlation.) | |||||||
Papenmeier et al. [46] | O | O | Persuasiveness | |||||||||
Liao et al. [47] | O | O | O | Intention to use/purchase | ||||||||
Cai et al. [48] | O | O | O | |||||||||
Van der Waa et al. [49] | O | O | O | O | Persuasiveness | |||||||
Poursabzi et al. [50] | O | O | ||||||||||
Narayanan et al. [51] | O | O | O | |||||||||
Liu et al. [52] | O | O | O | |||||||||
Schmidt et al. [53] | O | O | O | |||||||||
Kim et al. [61] | O | O | ||||||||||
Rader et al. [55] | O | O | O | Diversity, Situation Awareness | ||||||||
Ooge et al. [56] | O | O | O | O | Intention to use/purchase | |||||||
Naveed et al. [25] | O | O | ||||||||||
Naveed et al. [26] | O | O | ||||||||||
Naveed et al. [27] | O | O | O | O | O | |||||||
Tsai et al. [57] | O | O | O | O | O | O | Situation Awareness, Learning/Education | |||||
Guesmi et al. [58] | O | O | O | O | O | O | Persuasiveness | |||||
Naveed et al. [28] | O | O | O | O | Diversity, Use Intentions | |||||||
Ford et al. [59] | O | O | O | O | O | |||||||
Bansal et al. [60] | O | O | O | |||||||||
Kim et al. [61] | O | O | O | |||||||||
Dodge et al. [62] | O | |||||||||||
Schoonderwoerd et al. [63] | O | O | O | O | Preferences | |||||||
Paleja et al. [64] | O | O | Situation Awareness | |||||||||
Alufaisan et al. [65] | O | O | O | |||||||||
Schaffer et al. [66] | O | O | Situation Awareness | |||||||||
Colley et al. [67] | O | O | Situation Awareness | |||||||||
Zhang et al. [68] | O | O | Persuasiveness | |||||||||
Carton et al. [69] | O | O | ||||||||||
Schoeffer et al. [70] | O | O | O | |||||||||
Kunkel et al. [71] | O | Intention to use/purchase | ||||||||||
Jeyakumar et al. [72] | Preferences | |||||||||||
Harrison et al. [73] | O | Preferences | ||||||||||
Weitz et al. [74] | Preferences | |||||||||||
Fügener et al. [75] | O | O | Persuasiveness |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Naveed, S.; Stevens, G.; Robin-Kern, D. An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI. Appl. Sci. 2024, 14, 11288. https://doi.org/10.3390/app142311288
Naveed S, Stevens G, Robin-Kern D. An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI. Applied Sciences. 2024; 14(23):11288. https://doi.org/10.3390/app142311288
Chicago/Turabian StyleNaveed, Sidra, Gunnar Stevens, and Dean Robin-Kern. 2024. "An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI" Applied Sciences 14, no. 23: 11288. https://doi.org/10.3390/app142311288
APA StyleNaveed, S., Stevens, G., & Robin-Kern, D. (2024). An Overview of the Empirical Evaluation of Explainable AI (XAI): A Comprehensive Guideline for User-Centered Evaluation in XAI. Applied Sciences, 14(23), 11288. https://doi.org/10.3390/app142311288