Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3375627.3375830acmconferencesArticle/Chapter ViewAbstractPublication PagesaiesConference Proceedingsconference-collections
research-article
Open access

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

Published: 07 February 2020 Publication History

Abstract

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.

References

[1]
Ulrich Aivodji, Hiromi Arai, Olivier Fortineau, Sébastien Gambs, Satoshi Hara, and Alain Tapp. 2019. Fairwashing: the risk of rationalization. In International Conference on Machine Learning . 161--170.
[2]
Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias. ProPublica (2016).
[3]
Arthur Asuncion and David Newman. 2007. UCI machine learning repository, 2007.
[4]
C Blake, E Koegh, and CJ Mertz. 1999. Repository of Machine Learning. University of California at Irvine (1999).
[5]
Ann-Kathrin Dombrowski, Maximilian Alber, Christopher J Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. 2019. Explanations can be manipulated and geometry is to blame. arXiv preprint arXiv:1906.07983 (2019).
[6]
Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017).
[7]
Radwa Elshawi, Mouaz H Al-Mallah, and Sherif Sakr. 2019. On the interpretability of machine learning-based model for predicting hypertension. BMC medical informatics and decision making, Vol. 19, 1 (2019), 146.
[8]
Amirata Ghorbani, Abubakar Abid, and James Zou. 2019. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3681--3688.
[9]
Juyeon Heo, Sunghwan Joo, and Taesup Moon. 2019. Fooling Neural Network Interpretations via Adversarial Model Manipulation. In Advances in Neural Information Processing Systems 32. 2921--2932.
[10]
Mark Ibrahim, Melissa Louie, Ceena Modarres, and John Paisley. 2019. Global Explanations of Neural Networks: Mapping the Landscape of Predictions. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES '19). 279--287.
[11]
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory S ayres. 2018. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 2668--2677.
[12]
Joshua A Kroll, Solon Barocas, Edward W Felten, Joel R Reidenberg, David G Robinson, and Harlan Yu. 2016. Accountable algorithms. U. Pa. L. Rev., Vol. 165 (2016), 633.
[13]
Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. 2016. How we analyzed the COMPAS recidivism algorithm. ProPublica (5 2016), Vol. 9 (2016).
[14]
Zachary C. Lipton. 2018. The Mythos of Model Interpretability. Queue, Vol. 16, 3, Article 30 (June 2018), bibinfonumpages27 pages.
[15]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774.
[16]
Brent Mittelstadt, Chris Russell, and Sandra Wachter. 2019. Explaining explanations in AI. In Proceedings of the conference on fairness, accountability, and transparency. ACM, 279--288.
[17]
M Redmond. 2011. Communities and crime unnormalized data set. UCI Machine Learning Repository. (2011).
[18]
Michael Redmond and Alok Baveja. 2002. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, Vol. 141, 3 (2002), 660--678.
[19]
General Data Protection Regulation. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46. Official Journal of the European Union (OJ), Vol. 59, 1--88 (2016), 294.
[20]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Knowledge Discovery and Data Mining (KDD) .
[21]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision model-agnostic explanations. In Thirty-Second AAAI Conference on Artificial Intelligence .
[22]
Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, Vol. 1, 5 (2019), 206.
[23]
Andrew D Selbst and Solon Barocas. 2018. The intuitive appeal of explainable machines. Fordham L. Rev., Vol. 87 (2018), 1085.
[24]
Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. 2018. Distill-and-compare: auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. ACM, 303--310.
[25]
Leanne S Whitmore, Anthe George, and Corey M Hudson. 2016. Mapping chemical performance on molecular structures using locally interpretable explanations. arXiv preprint arXiv:1611.07443 (2016).

Cited By

View all
  • (2025)Exploring happiness factors with explainable ensemble learning in a global pandemicPLOS ONE10.1371/journal.pone.031327620:1(e0313276)Online publication date: 2-Jan-2025
  • (2025) Integrating Model‐Informed Drug Development With AI : A Synergistic Approach to Accelerating Pharmaceutical Innovation Clinical and Translational Science10.1111/cts.7012418:1Online publication date: 10-Jan-2025
  • (2025)Generalized Relevance Learning Grassmann QuantizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346631547:1(502-513)Online publication date: Jan-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
February 2020
439 pages
ISBN:9781450371100
DOI:10.1145/3375627
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adversarial attacks
  2. bias detection
  3. black box explanations
  4. model interpretability

Qualifiers

  • Research-article

Funding Sources

Conference

AIES '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 61 of 162 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,971
  • Downloads (Last 6 weeks)319
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Exploring happiness factors with explainable ensemble learning in a global pandemicPLOS ONE10.1371/journal.pone.031327620:1(e0313276)Online publication date: 2-Jan-2025
  • (2025) Integrating Model‐Informed Drug Development With AI : A Synergistic Approach to Accelerating Pharmaceutical Innovation Clinical and Translational Science10.1111/cts.7012418:1Online publication date: 10-Jan-2025
  • (2025)Generalized Relevance Learning Grassmann QuantizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346631547:1(502-513)Online publication date: Jan-2025
  • (2025)Enhanced Prediction and Optimization of Oil–Water Emulsion Stability Through Application of Machine Learning and Explainable Artificial Intelligence on TFBG Sensor DataIEEE Sensors Letters10.1109/LSENS.2024.35037529:1(1-4)Online publication date: Jan-2025
  • (2025)Answering new urban questions: Using eXplainable AI-driven analysis to identify determinants of Airbnb price in DublinExpert Systems with Applications10.1016/j.eswa.2024.125360260(125360)Online publication date: Jan-2025
  • (2025)Demystifying the black box: A survey on explainable artificial intelligence (XAI) in bioinformaticsComputational and Structural Biotechnology Journal10.1016/j.csbj.2024.12.02727(346-359)Online publication date: 2025
  • (2025)Nullius in Explanans: an ethical risk assessment for explainable AIEthics and Information Technology10.1007/s10676-024-09800-727:1Online publication date: 1-Mar-2025
  • (2025)Artificial intelligence-based cardiovascular/stroke risk stratification in women affected by autoimmune disorders: a narrative surveyRheumatology International10.1007/s00296-024-05756-545:1Online publication date: 2-Jan-2025
  • (2025)Manipulation Risks in Explainable AI: The Implications of the Disagreement ProblemMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74633-8_12(185-200)Online publication date: 1-Jan-2025
  • (2025)Using Part-Based Representations for Explainable Deep Reinforcement LearningMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74627-7_35(420-432)Online publication date: 1-Jan-2025
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media