Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3442188.3445941acmconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article

How can I choose an explainer?: An Application-grounded Evaluation of Post-hoc Explanations

Published: 01 March 2021 Publication History

Abstract

There have been several research works proposing new Explainable AI (XAI) methods designed to generate model explanations having specific properties, or desiderata, such as fidelity, robustness, or human-interpretability. However, explanations are seldom evaluated based on their true practical impact on decision-making tasks. Without that assessment, explanations might be chosen that, in fact, hurt the overall performance of the combined system of ML model + end-users. This study aims to bridge this gap by proposing XAI Test, an application-grounded evaluation methodology tailored to isolate the impact of providing the end-user with different levels of information. We conducted an experiment following XAI Test to evaluate three popular XAI methods - LIME, SHAP, and TreeInterpreter - on a real-world fraud detection task, with real data, a deployed ML model, and fraud analysts. During the experiment, we gradually increased the information provided to the fraud analysts in three stages: Data Only, i.e., just transaction data without access to model score nor explanations, Data + ML Model Score, and Data + ML Model Score + Explanations. Using strong statistical analysis, we show that, in general, these popular explainers have a worse impact than desired. Some of the conclusion highlights include: i) showing Data Only results in the highest decision accuracy and the slowest decision time among all variants tested, ii) all the explainers improve accuracy over the Data + ML Model Score variant but still result in lower accuracy when compared with Data Only; iii) LIME was the least preferred by users, probably due to its substantially lower variability of explanations from case to case.

References

[1]
General Data Protection Regulation. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46. Official Journal of the European Union (OJ), 59(1-88):294, 2016.
[2]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictions of any classifier, 2016.
[3]
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765--4774. Curran Associates, Inc., 2017.
[4]
Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77--91, 2018.
[5]
Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning, 2017.
[6]
Richard Tomsett, Dave Braines, Dan Harborne, Alun Preece, and Supriyo Chakraborty. Interpretable to whom? a role-based model for analyzing interpretable machine learning systems. arXiv preprint arXiv:1806.07552, 2018.
[7]
Sina Mohseni, Niloofar Zarei, and Eric D Ragan. A multidisciplinary survey and framework for design and evaluation of explainable ai systems. arXiv, pages arXiv-1811, 2018.
[8]
Kasun Amarasinghe, Kit Rodolfa, Hemank Lamba, and Rayid Ghani. Explainable machine learning for public policy: Use cases, gaps, and research directions. arXiv preprint arXiv:2010.14374, 2020.
[9]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictions of any classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pages 1135--1144, New York, NY, USA, 2016. ACM.
[10]
Scott Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature Machine Intelligence, 2, 01 2020.
[11]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-agnostic explanations. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
[12]
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3145--3153, International Convention Centre, Sydney, Australia, 06-11 Aug 2017. PMLR.
[13]
Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018.
[14]
Gregory Plumb, Denali Molitor, and Ameet S. Talwalkar. Model agnostic supervised local explanations. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 2520--2529, 2018.
[15]
David Alvarez-Melis and Tommi S. Jaakkola. On the robustness of interpretability methods. CoRR, abs/1806.08049, 2018.
[16]
Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation. CoRR, abs/1802.00682, 2018.
[17]
William W. Cohen and Yoram Singer. A simple, fast, and effective rule learner. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Conference Innovative Applications of Artificial Intelligence, AAAI '99/IAAI '99, pages 335--342, Menlo Park, CA, USA, 1999. American Association for Artificial Intelligence.
[18]
Jerome H. Friedman and Bogdan E. Popescu. Predictive learning via rule ensembles. Ann. Appl. Stat., 2(3):916--954, 09 2008.
[19]
Krzysztof Dembczynski, Wojciech Kotlowski, and Roman Slowinski. Maximum likelihood rule ensembles. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, pages 224--231, 2008.
[20]
Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1675--1684. ACM, 2016.
[21]
Joel Vaughan, Agus Sudjianto, Erind Brahimi, Jie Chen, and Vijayan N. Nair. Explainable neural networks based on additive index models. CoRR, abs/1806.01933, 2018.
[22]
Xuezhou Zhang, Sarah Tan, Paul Koch, Yin Lou, Urszula Chajewska, and Rich Caruana. Axiomatic interpretability for multiclass additive models. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '19, pages 226--234, New York, NY, USA, 2019. ACM.
[23]
Rich Caruana, Paul Koch, Yin Lou, Marc Sturm, Johannes Gehrke, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In KDD'15, August 10-13, 2015, Sydney, NSW, Australia. ACM, August 2015.
[24]
Zachary C Lipton. The mythos of model interpretability. Queue, 16(3):31--57, 2018.
[25]
Cynthia Rudin and Yaron Shaposhnik. Globally-consistent rule-based summary-explanations for machine learning models: Application to credit-risk evaluation. SSRN Electronic Journal, 01 2019.
[26]
Ivan Sanchez, Tim Rocktaschel, Sebastian Riedel, and Sameer Singh. Towards extracting faithful and descriptive representations of latent variable models. In AAAI Spring Syposium on Knowledge Representation and Reasoning (KRR): Integrating Symbolic and Neural Approaches, 2015.
[27]
Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. Distill-and-compare: Auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2018, New Orleans, LA, USA, February 02-03, 2018, pages 303--310, 2018.
[28]
David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability with self-explaining neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7775--7784. Curran Associates, Inc., 2018.
[29]
David Alvarez-Melis and Tommi S. Jaakkola. On the robustness of interpretability methods. CoRR, abs/1806.08049, 2018.
[30]
Amirata Ghorbani, Abubakar Abid, and James Y. Zou. Interpretation of neural networks is fragile. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019., pages 3681--3688, 2019.
[31]
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Jun Cai, James Wexler, Fernanda Viegas, and Rory Abbott Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). 2018.
[32]
Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
[33]
Tim Miller, Piers Howe, and Liz Sonenberg. Explainable ai: Beware of inmates running the asylum or: How i learnt to stop worrying and love the social and behavioural sciences. arXiv preprint arXiv:1712.00547, 2017.
[34]
W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences, 116(44):22071--22080, Oct 2019.
[35]
Zachary C Lipton. The doctor just won't accept that! arXiv preprint arXiv:1711.08037, 2017.
[36]
Philipp Schmidt and Felix Biessmann. Quantifying interpretability and trust in machine learning systems. arXiv preprint arXiv:1901.08558, 2019.
[37]
Hilde J. P. Weerts, Werner van Ipenburg, and Mykola Pechenizkiy. Case-based reasoning for assisting domain experts in processing fraud alerts of black-box machine learning models, 2019.
[38]
Shi Feng and Jordan Boyd-Graber. What can ai do for me? evaluating machine learning interpretations in cooperative play. In Proceedings of the 24th International Conference on Intelligent User Interfaces, IUI '19, page 229--239, New York, NY, USA, 2019. Association for Computing Machinery.
[39]
Vivian Lai and Chenhao Tan. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* '19, page 29--38, New York, NY, USA, 2019. Association for Computing Machinery.
[40]
Joseph L. Fleiss and Jacob Cohen. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33(3):613--619, 1973.
[41]
Karl Pearson F.R.S. X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157--175, 1900.
[42]
William H. Kruskal and W. Allen Wallis. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583--621, 1952.
[43]
H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18:50--60, 1947.
[44]
Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65--70, 1979.
[45]
Andrey Kolmogorov. Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn., 4:83--91, 1933.
[46]
Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65--70, 1979.
[47]
Leo Breiman. Random forests. Mach. Learn., 45(1):5--32, October 2001.
[48]
Ando Saabas. treeinterpreter, 2015.

Cited By

View all
  • (2024)An Explainable AI System for the Diagnosis of High-Dimensional Biomedical DataBioMedInformatics10.3390/biomedinformatics40100134:1(197-218)Online publication date: 11-Jan-2024
  • (2024)User‐Centered Evaluation of Explainable Artificial Intelligence (XAI): A Systematic Literature ReviewHuman Behavior and Emerging Technologies10.1155/2024/46288552024:1Online publication date: 15-Jul-2024
  • (2024)What Does Evaluation of Explainable Artificial Intelligence Actually Tell Us? A Case for Compositional and Contextual Validation of XAI Building BlocksExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3651047(1-8)Online publication date: 11-May-2024
  • Show More Cited By

Index Terms

  1. How can I choose an explainer?: An Application-grounded Evaluation of Post-hoc Explanations

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency
        March 2021
        899 pages
        ISBN:9781450383097
        DOI:10.1145/3442188
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 01 March 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Evaluation
        2. Explainability
        3. LIME
        4. SHAP
        5. User Study
        6. XAI

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        FAccT '21
        Sponsor:

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)343
        • Downloads (Last 6 weeks)26
        Reflects downloads up to 02 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)An Explainable AI System for the Diagnosis of High-Dimensional Biomedical DataBioMedInformatics10.3390/biomedinformatics40100134:1(197-218)Online publication date: 11-Jan-2024
        • (2024)User‐Centered Evaluation of Explainable Artificial Intelligence (XAI): A Systematic Literature ReviewHuman Behavior and Emerging Technologies10.1155/2024/46288552024:1Online publication date: 15-Jul-2024
        • (2024)What Does Evaluation of Explainable Artificial Intelligence Actually Tell Us? A Case for Compositional and Contextual Validation of XAI Building BlocksExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3651047(1-8)Online publication date: 11-May-2024
        • (2024)Contextualizing the “Why”: The Potential of Using Visual Map As a Novel XAI Method for Users with Low AI-literacyExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650812(1-7)Online publication date: 11-May-2024
        • (2024)Pixel-Grounded Prototypical Part Networks2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00470(4756-4767)Online publication date: 3-Jan-2024
        • (2024)SHERPA: Explainable Robust Algorithms for Privacy-Preserved Federated Learning in Future Networks to Defend Against Data Poisoning Attacks2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00271(4772-4790)Online publication date: 19-May-2024
        • (2024)Explainability Requirements for Time Series Forecasts: A Study in the Energy Domain2024 IEEE 32nd International Requirements Engineering Conference (RE)10.1109/RE59067.2024.00030(229-239)Online publication date: 24-Jun-2024
        • (2024)Feature Importance in Machine Learning with Explainable Artificial Intelligence (XAI) for Rainfall PredictionITM Web of Conferences10.1051/itmconf/2024650300765(03007)Online publication date: 16-Jul-2024
        • (2024)Designing explainable AI to improve human-AI team performanceArtificial Intelligence in Medicine10.1016/j.artmed.2024.102780149:COnline publication date: 1-Mar-2024
        • (2024)An Interpretability Evaluation Framework for Decision Tree Surrogate Model-Based XAIsFrontiers of Artificial Intelligence, Ethics, and Multidisciplinary Applications10.1007/978-981-99-9836-4_9(99-112)Online publication date: 25-Feb-2024
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media