research-article

Open access

Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection

Authors:

Roy Ka-Wei Lee,

Jing JiangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5244 - 5252

https://doi.org/10.1145/3581783.3612498

Published: 27 October 2023 Publication History

Abstract

Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method1.

Supplementary Material

MP4 File (3576-video.mp4)

Presentation Video

Download
41.44 MB

References

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. CoRR (2022). arXiv:2204.14198

[2]

Rui Cao, Roy Ka-Wei Lee, Wen-Haw Chong, and Jing Jiang. 2022. Prompting for Multimodal Hateful Meme Classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP. 321--332.

[3]

Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. 2023. Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics EACL. 2136--2148.

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 4171--4186.

[5]

Elisabetta Fersini, Francesca Gasparini, Giulia Rizzi, Aurora Saibene, Berta Chulvi, Paolo Rosso, Alyssa Lees, and Jeffrey Sorensen. 2022. SemEval-2022 Task 5: Multi-media Automatic Misogyny Identification. In Proceedings of the 16th International Workshop on Semantic Evaluation, SemEval@NAACL. 533--549.

[6]

Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, ACL/IJCNLP. 3816--3830.

[7]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 6325--6334.

[8]

Ming Shan Hee, Wen-Haw Chong, and Roy Ka-Wei Lee. 2023. Decoding the Underlying Meaning of Multimodal Hateful Memes. arXiv preprint arXiv:2305.17678 (2023).

[9]

Ming Shan Hee, Roy Ka-Wei Lee, and Wen-Haw Chong. 2022. On explaining multimodal hateful meme detection models. In Proceedings of the ACM Web Conference 2022. 3651--3655.

Digital Library

[10]

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 6700--6709.

[11]

Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, and Davide Testuggine. 2019. Supervised Multimodal Bitransformers for Classifying Images and Text. In Visually Grounded Interaction and Language (ViGIL), NeurIPS Workshop.

[12]

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. In Advances in Neural Information Processing Systems, NeurIPS.

[13]

Gokul Karthik Kumar and Karthik Nandakumar. 2022. Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features. CoRR abs/2210.05916 (2022).

[14]

Roy Ka-Wei Lee, Rui Cao, Ziqing Fan, Jing Jiang, and Wen-Haw Chong. 2021. Disentangling Hate in Online Memes. In MM '21: ACM Multimedia Conference. 5138--5147.

[15]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. CoRR (2023). arXiv:2301.12597

[16]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrap-ping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML, Vol. 162. 12888--12900.

[17]

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. CoRR (2021). arXiv:2107.07651

[18]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. CoRR (2019). arXiv:1908.03557

[19]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV European Conference, Vol. 8693. 740--755.

[20]

Phillip Lippe, Nithin Holla, Shantanu Chandra, Santhosh Rajamanickam, Georgios Antoniou, Ekaterina Shutova, and Helen Yannakoudakis. 2020. A Multimodal Framework for the Detection of Hateful Memes. arXiv preprint arXiv:2012.12871 (2020).

[21]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR (2019). arXiv:1907.11692

[22]

Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR (2017). arXiv:1711.05101

[23]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. CoRR (2019). arXiv:1908.02265

[24]

Lambert Mathias, Shaoliang Nie, Aida Mostafazadeh Davani, Douwe Kiela, Vinodkumar Prabhakaran, Bertie Vidgen, and Zeerak Waseem. 2021. Findings of the WOAH 5 Shared Task on Fine Grained Hateful Memes Detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 201--206.

[25]

Ron Mokady, Amir Hertz, and Amit H. Bermano. 2021. ClipCap: CLIP Prefix for Image Captioning. CoRR (2021). arXiv:2111.09734

[26]

Niklas Muennighoff. 2020. Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes. CoRR (2020). arXiv:2012.07788

[27]

Shraman Pramanick, Dimitar Dimitrov, Rituparna Mukherjee, Shivam Sharma, Md. Shad Akhtar, Preslav Nakov, and Tanmoy Chakraborty. 2021. Detecting Harmful Memes and Their Targets. In Findings of the Association for Computational Linguistics: ACL/IJCNLP. 2783--2796.

[28]

Shraman Pramanick, Shivam Sharma, Dimitar Dimitrov, Md. Shad Akhtar, Preslav Nakov, and Tanmoy Chakraborty. 2021. MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets. In Findings of the Association for Computational Linguistics: EMNLP. 4439--4455.

[29]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, ICML, Vol. 139. 8748--8763.

[30]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39, 6 (2016), 1137--1149.

[31]

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In IEEE International Conference on Computer Vision, ICCV. 618--626.

[32]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, ACL. 2556--2565.

[33]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR (2023). arXiv:2302.13971

[34]

Riza Velioglu and Jewgeni Rose. 2020. Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge. CoRR (2020). arXiv:2012.12975

[35]

Yi Zhou and Zhenhao Chen. 2020. Multimodal Learning for Hateful Memes Detection. arXiv preprint arXiv:2011.12870 (2020).

[36]

Jiawen Zhu, Roy Ka-Wei Lee, and Wen Haw Chong. 2022. Multimodal zero-shot hateful meme detection. In Proceedings of the 14th ACM Web Science Conference 2022. 382--389.

Digital Library

[37]

Ron Zhu. 2020. Enhance Multimodal Transformer With External Label And In-Domain Pretrain: Hateful Meme Challenge Winning Solution. CoRR (2020). arXiv:2012.08290

Cited By

Lim YHee MYee XYau WSim XTay WNg WNg SLee RChua TNgo CKumar RLauw HKa-Wei Lee R(2024)AISG's Online Safety Prize Challenge: Detecting Harmful Social Bias in Multimodal MemesCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3665993(1884-1891)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3665993
Wang HLee RChua TNgo CKa-Wei Lee RKumar RLauw H(2024)MemeCraft: Contextual and Stance-Driven Multimodal Meme GenerationProceedings of the ACM on Web Conference 202410.1145/3589334.3648151(4642-4652)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3648151
Ji JLin XNaseem UChua TNgo CKa-Wei Lee RKumar RLauw H(2024)CapAlign: Improving Cross Modal Alignment via Informative Captioning for Harmful Meme DetectionProceedings of the ACM on Web Conference 202410.1145/3589334.3648146(4585-4594)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3648146
Show More Cited By

Index Terms

Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
    2. Natural language processing

Recommendations

Multimodal Zero-Shot Hateful Meme Detection
WebSci '22: Proceedings of the 14th ACM Web Science Conference 2022

Facebook has recently launched the hateful meme detection challenge, which garnered much attention in academic and industry research communities. Researchers have proposed multimodal deep learning classification methods to perform hateful meme ...
On Explaining Multimodal Hateful Meme Detection Models
WWW '22: Proceedings of the ACM Web Conference 2022

Hateful meme detection is a new multimodal task that has gained significant traction in academic and industry research communities. Recently, researchers have applied pre-trained visual-linguistic models to perform the multimodal classification task, ...
Multimodal Religiously Hateful Social Media Memes Classification based on Textual and Image Data
Multimodal hateful social media meme detection is an important and challenging problem in the vision-language domain. Recent studies show high accuracy for such multimodal tasks due to datasets that provide better joint multimodal embedding to narrow the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
345
Total Downloads

Downloads (Last 12 months)345
Downloads (Last 6 weeks)46

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lim YHee MYee XYau WSim XTay WNg WNg SLee RChua TNgo CKumar RLauw HKa-Wei Lee R(2024)AISG's Online Safety Prize Challenge: Detecting Harmful Social Bias in Multimodal MemesCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3665993(1884-1891)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3665993
Wang HLee RChua TNgo CKa-Wei Lee RKumar RLauw H(2024)MemeCraft: Contextual and Stance-Driven Multimodal Meme GenerationProceedings of the ACM on Web Conference 202410.1145/3589334.3648151(4642-4652)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3648151
Ji JLin XNaseem UChua TNgo CKa-Wei Lee RKumar RLauw H(2024)CapAlign: Improving Cross Modal Alignment via Informative Captioning for Harmful Meme DetectionProceedings of the ACM on Web Conference 202410.1145/3589334.3648146(4585-4594)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3648146
Cao RLee RJiang JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Modularized Networks for Few-shot Hateful Meme DetectionProceedings of the ACM on Web Conference 202410.1145/3589334.3648145(4575-4584)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3648145
Lin HLuo ZGao WMa JWang BYang RChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language ModelsProceedings of the ACM on Web Conference 202410.1145/3589334.3645381(2359-2370)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645381
Wang BHuang SLiang BTu GYang MXu R(2024)What do they “meme”? A metaphor-aware multi-modal multi-task framework for fine-grained meme understandingKnowledge-Based Systems10.1016/j.knosys.2024.111778294(111778)Online publication date: Jun-2024
https://doi.org/10.1016/j.knosys.2024.111778
Grasso BLa Gatta VMoscato VSperlì G(2024)KERMITInformation Fusion10.1016/j.inffus.2024.102269106:COnline publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.inffus.2024.102269
Toh SKuek AChong WLee R(2023)MERMAID: A Dataset and Framework for Multimodal Meme Semantic Understanding2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386279(433-442)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386279

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents