Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Eye into AI: Evaluating the Interpretability of Explainable AI Techniques through a Game with a Purpose

Published: 04 October 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Recent developments in explainable AI (XAI) aim to improve the transparency of black-box models. However, empirically evaluating the interpretability of these XAI techniques is still an open challenge. The most common evaluation method is algorithmic performance, but such an approach may not accurately represent how interpretable these techniques are to people. A less common but growing evaluation strategy is to leverage crowd-workers to provide feedback on multiple XAI techniques to compare them. However, these tasks often feel like work and may limit participation. We propose a novel, playful, human-centered method for evaluating XAI techniques: a Game With a Purpose (GWAP), Eye into AI, that allows researchers to collect human evaluations of XAI at scale. We provide an empirical study demonstrating how our GWAP supports evaluating and comparing the agreement between three popular XAI techniques (LIME, Grad-CAM, and Feature Visualization) and humans, as well as evaluating and comparing the interpretability of those three XAI techniques applied to a deep learning model for image classification. The data collected from Eye into AI offers convincing evidence that GWAPs can be used to evaluate and compare XAI techniques. Eye into AI is available to the public: https://dig.cmu.edu/eyeintoai/.

    References

    [1]
    Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan Kankanhalli. 2018. Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI '18). ACM, New York, NY, USA, Article 582, 18 pages. https://doi.org/10.1145/3173574.3174156
    [2]
    Amina Adadi and Mohammed Berrada. 2018. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access, Vol. 6 (2018), 52138--52160. https://doi.org/10.1109/access.2018.2870052
    [3]
    Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity Checks for Saliency Maps. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems (Montreal, Canada) (NIPS'18). Curran Associates Inc., USA, 9525--9536. http://dl.acm.org/citation.cfm?id=3327546.3327621
    [4]
    Ahmed Alqaraawi, Martin Schuessler, Philipp Weiß, Enrico Costanza, and Nadia Berthouze. 2020. Evaluating Saliency Map Explanations for Convolutional Neural Networks: A User Study. CoRR, Vol. abs/2002.00772 (2020). showeprint[arXiv]2002.00772 https://arxiv.org/abs/2002.00772
    [5]
    Agathe Balayn, Gaole He, Andrea Hu, Jie Yang, and Ujwal Gadiraju. 2022. Ready Player One! Eliciting Diverse Knowledge Using A Configurable Game. In Proceedings of the ACM Web Conference 2022. 1709--1719. https://doi.org/10.1145/3485447.3512241
    [6]
    Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. Association for Computing Machinery, 1--16. https://doi.org/10.1145/3411764.3445717
    [7]
    Shan Carter, Zan Armstrong, Ludwig Schubert, Ian Johnson, and Chris Olah. 2019. Activation Atlas. Distill (2019). https://distill.pub/2019/activation-atlas
    [8]
    Charles E. Connor, Howard E. Egeth, and Steven Yantis. 2004. Visual Attention: Bottom-Up Versus Top-Down. Current Biology, Vol. 14, 19 (Oct 2004), R850--R852. https://doi.org/10.1016/j.cub.2004.09.041
    [9]
    Sabrina Culyba. 2018. The Transformational Framework: A process tool for the development of Transformational games. figshare. https://kilthub.cmu.edu/articles/journal_contribution/The_Transformational_Framework_A_Process_Tool_for_the_Development_of_Transformational_Games/7130594/files/13117568.pdf
    [10]
    Vickie Curtis. 2015. Motivation to participate in an online citizen science game: A study of Foldit. Science Communication, Vol. 37, 6 (2015), 723--746. https://doi.org/10.1177/1075547015609322
    [11]
    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255. https://doi.org/10.1109/cvpr.2009.5206848
    [12]
    Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Interpretable Machine Learning. arxiv: 1702.08608 [stat.ML] https://arxiv.org/pdf/1702.08608
    [13]
    Upol Ehsan, Q Vera Liao, Michael Muller, Mark O Riedl, and Justin D Weisz. 2021a. Expanding explainability: Towards social transparency in ai systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1--19. https://doi.org/10.1145/3411764.3445188
    [14]
    Upol Ehsan and Mark O Riedl. 2020. Human-centered explainable ai: Towards a reflective sociotechnical approach. In International Conference on Human-Computer Interaction. Springer, 449--466. https://doi.org/10.1007/978--3-030--60117--1_33
    [15]
    Upol Ehsan, Philipp Wintersberger, Q Vera Liao, Martina Mara, Marc Streit, Sandra Wachter, Andreas Riener, and Mark O Riedl. 2021b. Operationalizing human-centered perspectives in explainable AI. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. 1--6. https://doi.org/10.1145/3411763.3441342
    [16]
    Laura Beth Fulton, Ja Young Lee, Qian Wang, Zhendong Yuan, Jessica Hammer, and Adam Perer. 2020. Getting playful with explainable ai: Games with a purpose to improve human understanding of ai. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems. 1--8. https://doi.org/10.1145/3334480.3382831
    [17]
    Jacob Gildenblat and contributors. 2021. PyTorch library for CAM methods. https://github.com/jacobgil/pytorch-grad-cam.
    [18]
    David Gundry and Sebastian Deterding. 2018. Intrinsic elicitation: A model and design approach for games collecting human subject data. In Proceedings of the 13th International Conference on the Foundations of Digital Games. ACM, 38. https://doi.org/10.1145/3235765.3235803
    [19]
    Chien-Ju Ho, Tao-Hsuan Chang, Jong-Chuan Lee, Jane Yung-jen Hsu, and Kuan-Ta Chen. 2010. KissKissBan: A Competitive Human Computation Game for Image Annotation. SIGKDD Explor. Newsl., Vol. 12, 1 (Nov. 2010), 21--24. http://doi.acm.org/10.1145/1882471.1882475
    [20]
    Fred Matthew Hohman, Minsuk Kahng, Robert Pienta, and Duen Horng Chau. 2018. Visual analytics in deep learning: An interrogative survey for the next frontiers. IEEE transactions on visualization and computer graphics (2018). https://doi.org/10.1109/tvcg.2018.2843369
    [21]
    Matthew Kay, Tara Kola, Jessica R Hullman, and Sean A Munson. 2016. When (ish) is my bus?: User-centered visualizations of uncertainty in everyday, mobile predictive systems. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 5092--5103. https://doi.org/10.1145/2858036.2858558
    [22]
    Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. Examples are not enough, learn to criticize! criticism for interpretability. Advances in neural information processing systems, Vol. 29 (2016). https://proceedings.neurips.cc/paper/2016/file/5680522b8e2bb01943234bce7bf84534-Paper.pdf
    [23]
    Sunnie S. Y. Kim, Nicole Meister, Vikram V. Ramaswamy, Ruth Fong, and Olga Russakovsky. 2022. HIVE: Evaluating the Human Interpretability of Visual Explanations. arXiv:2112.03184 [cs] (Jan 2022). https://doi.org/10.1007/978--3-031--19775--8_17 arXiv: 2112.03184.
    [24]
    Josua Krause, Adam Perer, and Kenney Ng. 2016. Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (San Jose, California, USA) (CHI '16). ACM, New York, NY, USA, 5686--5697. http://doi.acm.org/10.1145/2858036.2858529
    [25]
    Edith LM Law, Luis Von Ahn, Roger B Dannenberg, and Mike Crawford. 2007. TagATune: A Game for Music and Sound Annotation. In ISMIR, Vol. 3. 2. https://www.cs.cmu.edu/ elaw/papers/ISMIR2007.pdf
    [26]
    Q Vera Liao and Kush R Varshney. 2021. Human-centered explainable AI (XAI): From algorithms to user experiences. arXiv preprint arXiv:2110.10790 (2021). https://arxiv.org/pdf/2110.10790
    [27]
    Andreas Lieberoth. 2015. Shallow gamification: Testing psychological effects of framing an activity as a game. Games and Culture, Vol. 10, 3 (2015), 229--248. https://doi.org/10.1177/1555412014559978
    [28]
    Yi-Shan Lin, Wen-Chuan Lee, and Z. Berkay Celik. 2021. What Do You See?: Evaluation of Explainable Artificial Intelligence (XAI) Interpretability through Neural Backdoors. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (2021). https://doi.org/10.1145/3447548.3467213
    [29]
    Zhong Qiu Lin, Mohammad Javad Shafiee, Stanislav Bochkarev, Michael St Jules, Xiao Yu Wang, and Alexander Wong. 2019. Do explanations reflect decisions? A machine-centric strategy to quantify the performance of explainability algorithms. arXiv preprint arXiv:1910.07387 (2019). https://arxiv.org/pdf/1910.07387.pdf
    [30]
    Zachary Chase Lipton. 2016. The Mythos of Model Interpretability. CoRR, Vol. abs/1606.03490 (2016). arxiv: 1606.03490 http://arxiv.org/abs/1606.03490
    [31]
    Xiaotian Lu, Arseny Tolmachev, Tatsuya Yamamoto, Koh Takeuchi, Seiji Okajima, Tomoyoshi Takebayashi, Koji Maruhashi, and Hisashi Kashima. 2021. Crowdsourcing Evaluation of Saliency-based XAI Methods. arXiv:2107.00456 [cs] (Aug 2021). http://arxiv.org/abs/2107.00456 arXiv: 2107.00456.
    [32]
    Chris Madge, Jon Chamberlain, Udo Kruschwitz, and Massimo Poesio. 2017. Experiment-driven development of a gwap for marking segments in text. In Extended Abstracts Publication of the Annual Symposium on Computer-Human Interaction in Play. ACM, 397--404. https://doi.org/10.1145/3130859.3131332
    [33]
    Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, Vol. 267 (2019), 1--38. https://doi.org/10.1016/j.artint.2018.07.007
    [34]
    Sina Mohseni and Eric D. Ragan. 2018. A Human-Grounded Evaluation Benchmark for Local Explanations of Machine Learning. CoRR, Vol. abs/1801.05075 (2018). showeprint[arXiv]1801.05075 http://arxiv.org/abs/1801.05075
    [35]
    Christoph Molnar. 2019. Model-Agnostic Methods. Lulu. https://christophm.github.io/interpretable-ml-book/agnostic.html
    [36]
    Mahsan Nourani, Samia Kabir, Sina Mohseni, and Eric D Ragan. 2019. The effects of meaningful and meaningless explanations on trust and perceived system accuracy in intelligent systems. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. 97--105. https://doi.org/10.1609/hcomp.v7i1.5284
    [37]
    Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. 2017. Feature Visualization. Distill (2017). https://distill.pub/2017/feature-visualization
    [38]
    Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. 2018. The Building Blocks of Interpretability. Distill (2018). https://distill.pub/2018/building-blocks
    [39]
    Kayur Patel, Naomi Bancroft, Steven M Drucker, James Fogarty, Andrew J Ko, and James Landay. 2010. Gestalt: integrated support for implementation and analysis in machine learning. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM, 37--46. https://doi.org/10.1145/1866029.1866038
    [40]
    Ei Pa Pa Pe-Than, Dion Hoe-Lian Goh, and Chei Sian Lee. 2015. A typology of human computation games: an analysis and a review of current games. Behaviour & Information Technology, Vol. 34, 8 (2015), 809--824. https://doi.org/10.1080/0144929x.2013.862304
    [41]
    Ei Pa Pa Pe-Than, Dion Hoe-Lian Goh, and Chei Sian Lee. 2017. Does it matter how you play? The effects of collaboration and competition among players of human computation games. Journal of the Association for Information Science and Technology, Vol. 68, 8 (2017), 1823--1835. https://doi.org/10.1002/asi.23863
    [42]
    Alexander J. Quinn and Benjamin B. Bederson. 2011. Human Computation: A Survey and Taxonomy of a Growing Field. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, Canada) (CHI '11). ACM, New York, NY, USA, 1403--1412. https://doi.org/10.1145/1978942.1979148
    [43]
    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016a. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). 1135--1144. https://doi.org/10.1145/2939672.2939778
    [44]
    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016b. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. (Aug 2016), 1135--1144. https://doi.org/10.1145/2939672.2939778
    [45]
    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), Vol. 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-y
    [46]
    Sam Sattarzadeh, Mahesh Sudhakar, and Konstantinos N. Plataniotis. 2021. SVEA: A Small-scale Benchmark for Validating the Usability of Post-hoc Explainable AI Solutions in Image and Signal Recognition. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (2021), 4141--4150. https://doi.org/10.1109/iccvw54120.2021.00462
    [47]
    Henrik Schoenau-Fog et al. 2011. The Player Engagement Process-An Exploration of Continuation Desire in Digital Games. In Digra conference. [PDF] digra.org
    [48]
    Nitin Seemakurty, Jonathan Chu, Luis Von Ahn, and Anthony Tomasic. 2010. Word sense disambiguation via human computation. In Proceedings of the acm sigkdd workshop on human computation. ACM, 60--63. https://doi.org/10.1145/1837885.1837905
    [49]
    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2020. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. International Journal of Computer Vision, Vol. 128, 2 (Feb 2020), 336--359. https://doi.org/10.1007/s11263-019-01228--7
    [50]
    Kristin Siu, Matthew Guzdial, and Mark O. Riedl. 2017a. Evaluating Singleplayer and Multiplayer in Human Computation Games. In Proceedings of the 12th International Conference on the Foundations of Digital Games (Hyannis, Massachusetts) (FDG '17). ACM, New York, NY, USA, Article 34, 10 pages. http://doi.acm.org/10.1145/3102071.3102077
    [51]
    Kristin Siu and Mark O. Riedl. 2016. Reward Systems in Human Computation Games. In Proceedings of the 2016 Annual Symposium on Computer-Human Interaction in Play (Austin, Texas, USA) (CHI PLAY '16). ACM, New York, NY, USA, 266--275. http://doi.acm.org/10.1145/2967934.2968083
    [52]
    Kristin Siu, Alexander Zook, and Mark O. Riedl. 2017b. A Framework for Exploring and Evaluating Mechanics in Human Computation Games. In Proceedings of the 12th International Conference on the Foundations of Digital Games (Hyannis, Massachusetts) (FDG '17). ACM, New York, NY, USA, Article 38, 4 pages. http://doi.acm.org/10.1145/3102071.3106344
    [53]
    Simone Stumpf, Vidya Rajaram, Lida Li, Weng-Keen Wong, Margaret Burnett, Thomas Dietterich, Erin Sullivan, and Jonathan Herlocker. 2009. Interacting Meaningfully with Machine Learning Systems: Three Experiments. Int. J. Hum.-Comput. Stud., Vol. 67, 8 (Aug. 2009), 639--662. http://dx.doi.org/10.1016/j.ijhcs.2009.03.004
    [54]
    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going Deeper with Convolutions. arXiv:1409.4842 [cs] (Sep 2014). http://arxiv.org/abs/1409.4842 arXiv: 1409.4842.
    [55]
    Tensorflow. 2020. Lucid. https://github.com/tensorflow/lucid.
    [56]
    Stefan Thaler, Elena Simperl, and Stephan Wölger. 2012. An experiment in comparing human-computation techniques. IEEE Internet Computing, Vol. 16, 5 (2012), 52--58. https://doi.org/10.1109/mic.2012.67
    [57]
    Andrea Tocchetti, Marco Brambilla, Lorenzo Corti, and Irene Celino. 2022. EXP-Crowd: a Gamified Crowdsourcing Framework for Explainability. Frontiers in Artificial Intelligence (2022), 61. https://doi.org/10.3389/frai.2022.826499
    [58]
    Luis Von Ahn and Laura Dabbish. 2008. Designing games with a purpose. Commun. ACM, Vol. 51, 8 (2008), 58--67. https://dl.acm.org/doi/fullHtml/10.1145/1378704.1378719
    [59]
    Luis von Ahn, Ruoran Liu, and Manuel Blum. 2006. Peekaboom: a game for locating objects in images. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '06). Association for Computing Machinery, 55--64. https://doi.org/10.1145/1124772.1124782
    [60]
    Xinru Wang and Ming Yin. 2021. Are Explanations Helpful? A Comparative Study of the Effects of Explanations in AI-Assisted Decision-Making. Association for Computing Machinery, New York, NY, USA, 318--328. https://doi.org/10.1145/3397481.3450650
    [61]
    James Wexler, Mahima Pushkarna, Tolga Bolukbasi, Martin Wattenberg, Fernanda Viégas, and Jimbo Wilson. 2019. The What-If Tool: Interactive Probing of Machine Learning Models. IEEE transactions on visualization and computer graphics (2019). https://doi.org/10.1109/tvcg.2019.2934619
    [62]
    Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. 2020. Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making. (Jan 2020). https://doi.org/10.1145/3351095.3372852
    [63]
    Zijian Zhang, Jaspreet Singh, Ujwal Gadiraju, and Avishek Anand. 2019. Dissonance between human and machine understanding. Proceedings of the ACM on Human-Computer Interaction, Vol. 3, CSCW (2019), 1--23. https://doi.org/10.1145/3359158
    [64]
    Jichen Zhu, Antonios Liapis, Sebastian Risi, Rafael Bidarra, and G Michael Youngblood. 2018. Explainable AI for designers: A human-centered perspective on mixed-initiative co-creation. In 2018 IEEE Conference on Computational Intelligence and Games (CIG). IEEE, 1--8. https://doi.org/10.1109/cig.2018.8490433

    Cited By

    View all
    • (2024)"How Good Is Your Explanation?": Towards a Standardised Evaluation Approach for Diverse XAI Methods on Multiple Dimensions of ExplainabilityAdjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization10.1145/3631700.3664911(513-515)Online publication date: 27-Jun-2024

    Index Terms

    1. Eye into AI: Evaluating the Interpretability of Explainable AI Techniques through a Game with a Purpose

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the ACM on Human-Computer Interaction
        Proceedings of the ACM on Human-Computer Interaction  Volume 7, Issue CSCW2
        CSCW
        October 2023
        4055 pages
        EISSN:2573-0142
        DOI:10.1145/3626953
        Issue’s Table of Contents
        This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 October 2023
        Published in PACMHCI Volume 7, Issue CSCW2

        Check for updates

        Author Tags

        1. explainable ai
        2. games with a purpose
        3. interpretability

        Qualifiers

        • Research-article

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)399
        • Downloads (Last 6 weeks)56
        Reflects downloads up to 26 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)"How Good Is Your Explanation?": Towards a Standardised Evaluation Approach for Diverse XAI Methods on Multiple Dimensions of ExplainabilityAdjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization10.1145/3631700.3664911(513-515)Online publication date: 27-Jun-2024

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media