Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3595916.3626437acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Generic Attention-model Explainability by Weighted Relevance Accumulation

Published: 01 January 2024 Publication History

Abstract

Attention-based Transformer models have achieved remarkable progress in multi-modal tasks, such as visual question answering. The explainability of attention-based methods has recently attracted wide interest as it can explain the inner changes of attention tokens by accumulating relevancy across attention layers. Current methods simply update relevancy by equally accumulating the token relevancy before and after the attention processes. However, the importance of token values is usually different during relevance accumulation.In this paper, we propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance. To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks through CLIP encoder and a following mapper. CLIPmapper consists of self-attention, cross-attention, single-modality, and cross-modality attention, thus it is more suitable for evaluating our generic explainability method. Extensive perturbation tests on visual question answering and image captioning tasks validate that our explainability method outperforms existing methods.

Supplementary Material

Appendix (mmaasia23-87-supplementary material.pdf)

References

[1]
Samira Abnar and Willem H. Zuidema. 2020. Quantifying Attention Flow in Transformers., 4190–4197 pages. https://doi.org/10.18653/v1/2020.acl-main.385
[2]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. https://doi.org/10.48550/arXiv.2204.14198 arXiv:2204.14198
[3]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering., 6077–6086 pages. https://doi.org/10.1109/CVPR.2018.00636
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments., 65–72 pages. https://aclanthology.org/W05-0909/
[5]
Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, and Desmond Elliott. 2021. Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs., 978–994 pages. https://doi.org/10.1162/tacl_a_00408
[6]
Hila Chefer, Shir Gur, and Lior Wolf. 2021. Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers., 387–396 pages. https://doi.org/10.1109/ICCV48922.2021.00045
[7]
Hila Chefer, Shir Gur, and Lior Wolf. 2021. Transformer Interpretability Beyond Attention Visualization., 782–791 pages. https://doi.org/10.1109/CVPR46437.2021.00084
[8]
Feilong Chen, Duzhen Zhang, Minglun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. 2023. VLP: A Survey on Vision-language Pre-training., 38–56 pages. https://doi.org/10.1007/s11633-022-1369-5
[9]
Shuguang Chen, Gustavo Aguilar, Leonardo Neves, and Thamar Solorio. 2021. Can images help recognize entities? A study of the role of images for Multimodal NER., 87–96 pages. https://doi.org/10.18653/v1/2021.wnut-1.11
[10]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325http://arxiv.org/abs/1504.00325
[11]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://openreview.net/forum?id=YicbFdNTTy
[12]
Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. 2022. A Survey of Vision-Language Pre-Trained Models., 5436–5443 pages. https://doi.org/10.24963/ijcai.2022/762
[13]
Stella Frank, Emanuele Bugliarello, and Desmond Elliott. 2021. Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers., 9847–9857 pages. https://doi.org/10.18653/v1/2021.emnlp-main.775
[14]
Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2019. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering., 398–414 pages. https://doi.org/10.1007/s11263-018-1116-0
[15]
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning., 7514–7528 pages. https://doi.org/10.18653/v1/2021.emnlp-main.595
[16]
Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions., 664–676 pages. https://doi.org/10.1109/TPAMI.2016.2598339
[17]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision., 5583–5594 pages. http://proceedings.mlr.press/v139/kim21k.html
[18]
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven Chu-Hong Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation., 9694–9705 pages. https://proceedings.neurips.cc/paper/2021/hash/505259756244493872b7709a8a01b536-Abstract.html
[19]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv:1908.03557http://arxiv.org/abs/1908.03557
[20]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks., 121–137 pages. https://doi.org/10.1007/978-3-030-58577-8_8
[21]
Yibing Liu, Haoliang Li, Yangyang Guo, Chenqi Kong, Jing Li, and Shiqi Wang. 2022. Rethinking Attention-Model Explainability through Faithfulness Violation Test., 13807–13824 pages. https://proceedings.mlr.press/v162/liu22i.html
[22]
Yiwei Lyu, Paul Pu Liang, Zihao Deng, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2022. DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations., 455–467 pages. https://doi.org/10.1145/3514094.3534148
[23]
Ron Mokady, Amir Hertz, and Amit H. Bermano. 2021. ClipCap: CLIP Prefix for Image Captioning. arXiv:2111.09734https://arxiv.org/abs/2111.09734
[24]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation., 311–318 pages.
[25]
Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. 2018. Multimodal Explanations: Justifying Decisions and Pointing to the Evidence., 8779–8788 pages. https://doi.org/10.1109/CVPR.2018.00915
[26]
Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. 2018. Multimodal Explanations: Justifying Decisions and Pointing to the Evidence., 8779–8788 pages. https://doi.org/10.1109/CVPR.2018.00915
[27]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision., 8748–8763 pages. http://proceedings.mlr.press/v139/radford21a.html
[28]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks., 1137–1149 pages. https://doi.org/10.1109/TPAMI.2016.2577031
[29]
Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier., 1135–1144 pages. https://doi.org/10.1145/2939672.2939778
[30]
Emmanuelle Salin, Badreddine Farah, Stéphane Ayache, and Benoît Favre. 2022. Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective., 11248–11257 pages. https://ojs.aaai.org/index.php/AAAI/article/view/21375
[31]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization., 618–626 pages. https://doi.org/10.1109/ICCV.2017.74
[32]
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2022. How Much Can CLIP Benefit Vision-and-Language Tasks?https://openreview.net/forum?id=zf_Ll3HZWgy
[33]
Vivswan Shitole, Fuxin Li, Minsuk Kahng, Prasad Tadepalli, and Alan Fern. 2021. One Explanation is Not Enough: Structured Attention Graphs for Image Classification., 11352–11363 pages. https://proceedings.neurips.cc/paper/2021/hash/5e751896e527c862bf67251a474b3819-Abstract.html
[34]
Haoyu Song, Li Dong, Weinan Zhang, Ting Liu, and Furu Wei. 2022. CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment., 6088–6100 pages. https://doi.org/10.18653/v1/2022.acl-long.421
[35]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers., 5099–5110 pages. https://doi.org/10.18653/v1/D19-1514
[36]
Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2018. Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge., 4223–4232 pages. https://doi.org/10.1109/CVPR.2018.00444
[37]
Yao-Hung Hubert Tsai, Martin Ma, Muqiao Yang, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis., 1823–1833 pages. https://doi.org/10.18653/v1/2020.emnlp-main.143
[38]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation., 4566–4575 pages. https://doi.org/10.1109/CVPR.2015.7299087
[39]
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned., 5797–5808 pages. https://doi.org/10.18653/v1/p19-1580
[40]
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation., 11–20 pages. https://doi.org/10.18653/v1/D19-1002
[41]
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. 2022. CoCa: Contrastive Captioners are Image-Text Foundation Models. https://doi.org/10.48550/arXiv.2205.01917 arXiv:2205.01917
[42]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph., 2236–2246 pages. https://doi.org/10.18653/v1/P18-1208

Cited By

View all
  • (2024)The Explainability of Transformers: Current Status and DirectionsComputers10.3390/computers1304009213:4(92)Online publication date: 4-Apr-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
December 2023
745 pages
ISBN:9798400702051
DOI:10.1145/3595916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Attention-model Explainability
  2. Multimodal model
  3. Weighted Relevancy Accumulation

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

MMAsia '23
Sponsor:
MMAsia '23: ACM Multimedia Asia
December 6 - 8, 2023
Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)2
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Explainability of Transformers: Current Status and DirectionsComputers10.3390/computers1304009213:4(92)Online publication date: 4-Apr-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media