Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey
Open access

Multi-modal Misinformation Detection: Approaches, Challenges and Opportunities

Published: 22 November 2024 Publication History

Abstract

As social media platforms evolve from text-based forums into multi-modal environments, the nature of misinformation in social media is also transforming accordingly. Taking advantage of the fact that visual modalities such as images and videos are more favorable and attractive to users, and textual content is sometimes skimmed carelessly, misinformation spreaders have recently targeted contextual connections between the modalities, e.g., text and image. Hence, many researchers have developed automatic techniques for detecting possible cross-modal discordance in web-based content. We analyze, categorize, and identify existing approaches in addition to the challenges and shortcomings they face to unearth new research opportunities in the field of multi-modal misinformation detection.

1 Introduction

Nowadays, billions of multi-modal posts containing texts, images, videos, soundtracks, and so on, are shared throughout the web, mainly via social media platforms such as Facebook, Twitter, Snapchat, Reddit, Instagram, YouTube, and so on. While the combination of modalities allows for more expressive, detailed, and user-friendly content, it brings about new challenges, as it is harder to accommodate uni-modal solutions to multi-modal environments.
However, in recent years, due to the sheer use of multi-modal platforms, many automated techniques for multi-modal tasks, such as Visual Question Answering (VQA) [5, 31, 33, 38, 85], image captioning [18, 34, 51, 104], and more recently for fake news detection including hate speech in multi-modal memes [30, 50, 65, 87], have been introduced by machine learning researchers.
Similar to other multi-modal tasks, it is harder and more challenging to detect fake news on multi-modal platforms, as it requires not only the evaluation of each modality, but also cross-modal connections and credibility of the combination as well. This becomes even more challenging when each modality, e.g., text or image, is credible but the combination creates misinformative content. For instance, a COVID-19 anti-vaccination misinformation1 post can have text that reads “vaccines do this” and then attaches a graphic image of a dead person. In this case, although the image and text are not individually misinformative, taken together they create misinformation.
Over the past decade, several detection models [14, 40, 79, 81] have been developed to detect misinformation. However, the majority of them leverage only a single modality for misinformation detection, e.g., text [32, 37, 77, 101] or image [1, 20, 39, 66], which miss the important information conveyed by other modalities. There are existing works [2, 3, 35, 48, 76] that leverage ensemble methods that create multiple models for each modality and then combine them to produce improved results. However, in many cases of multi-modal misinformation, loosely combining individual modalities is inadequate for detecting fake news, leading to the failure of the joint model.
Nevertheless, in recent years, machine learning scientists have developed different techniques for cross-modal fake news detection, which combine information from multiple modalities, leveraging cross-modal information such as the consistency and meaningful relationships between different modalities. Studying and analyzing these techniques and identifying existing challenges will give a clearer picture of the state of knowledge on multi-modal misinformation detection and open the door to new opportunities in this field.
Even though there are a number of valuable surveys on fake news detection [24, 52, 79], very few of them focus on multi-modal techniques [6, 74]. Since the number of proposed techniques for multi-modal fake news detection has been increasing immensely, the necessity of a comprehensive survey on existing techniques, datasets, and emerging challenges is felt more than ever. With that said, in this work, we aim to conduct a comprehensive study on fake news detection in multi-modal environments.
To this end, we classify multi-modal misinformation detection study into the following directions:
Multi-modal Data Study: In this direction, the goal is to collect multi-modal fake news data, e.g., image, text, social context, and so on, from different sources of information and use fact-checking resources to evaluate the veracity of the collected data and annotate them accordingly. Comparison and analysis of existing datasets, as well as benchmarking, are other tasks that fall under this category.
Multi-modal Feature Study: The primary goal of this study is to uncover significant links between various data modalities, which are frequently exploited by misinformation spreaders to distort, impersonate, or exaggerate original information. These meaningful connections may be used as clues for detecting misinformation in multi-modal environments such as social media posts. Another goal of this direction is to study and develop strategies for fusing features of different modalities and creating information-rich multi-modal features.
Multi-modal Model Study: The main focus of this direction is on the development of efficient multi-modal machine learning solutions to detect misinformation by leveraging multi-modal features and clues. Proposing new techniques and approaches, in addition to improving the performance, scalability, interpretability, and explicability of machine learning models, are some of the common tasks in this direction.
These three studies form a sequential pipeline in the multi-modal misinformation field, where the output of each study serves as the input for the next. Figure 1 provides a summary of these directions. In this work, we aim to explore each direction in greater depth to identify the challenges and shortcomings of each study and propose new avenues for addressing them.
Fig. 1.
Fig. 1. An overview of multi-modal misinformation detection pipeline.
The rest of this survey is organized as follows: In Section 2, we discuss the multi-modal feature study by introducing some widely spread categories of misinformation in multi-modal settings and commonly used cross-modal clues for detecting them. In the following section, we discuss different fusion mechanisms to merge modalities involved in such clues. Then, we explain the multi-modal model study by introducing solutions and categorizing them based on the machine learning techniques they utilize. In Section 4, we describe the multi-modal data study by introducing, analyzing, and comparing existing databases for multi-modal fake news detection. In Section 5, we discuss existing challenges and shortcomings that each direction is facing. Finally, in Section 6, we propose new avenues to address these shortcomings and advance multi-modal misinformation detection research.
We conducted our literature search across multiple databases, including IEEE Xplore, ACM Digital Library, and Google Scholar, using a combination of keywords related to our research focus. The inclusion criteria for the papers were defined by their relevance to the research question, publication date within the past 10 years to ensure timeliness, and peer-reviewed status to guarantee quality. The selection process involved an initial screening of titles and abstracts, followed by a full-text review to confirm that each paper met our stringent criteria. This methodical approach ensures that the included papers provide a diverse yet focused perspective on the subject, offering readers a succinct and informative summary of current knowledge in the field. We emphasize the importance of transparency in our literature selection process and outline these steps to clarify the criteria and rationale behind our choices.

2 Multi-modal Feature Study

In this section, we discuss the feature-based direction of multi-modal misinformation studies. To better understand the rationale behind multi-modal features and clues, we start with a brief introduction to some of the common categories of misinformation that spread in multi-modal environments. Furthermore, we discuss some of the commonly used multi-modal features and clues, and then we talk about existing fusion mechanisms for combining data modality features. Finally, we discuss the pros and cons of each fusion mechanism.

2.1 Common Categories of Misinformation in Multi-modal Environments

Multi-modal misinformation refers to a package of misleading information that includes multiple modalities such as images, text, videos, and so on. In multi-modal misinformation, not all modalities are necessarily false, but sometimes the connections between the modalities are manipulated to deceive the audience’s perception. In what follows, we briefly discuss some of the common categories of misinformation that are widely spread in multi-modal settings. It is worth mentioning that these categories of misinformation are common in both multi-modal and uni-modal environments. However, we provide examples of each category in multi-modal platforms as well.
Satire or Parody: This category refers to content that conveys true information with a satirical tone or added information that makes it false. One of the well-known publishers of this category is The Onion website,2 which is a digital media organization that publishes satirical articles on a variety of international, national, and local news. A multi-modal example of this category is an image within a satirical news article that contains absurd or ridiculous content or is manipulated to create humorous critique [25, 56]. In this case, the textual content may not necessarily be false, but when combined with an image, it creates misleading content.
Fabricated Content: This category of information is completely false and is generated to deceive the audience. The intention behind publishing fabricated content is usually to mislead people for political, social, or economic benefits. A multi-modal instance of this category is a news report that uses auxiliary images or videos that are either completely fake or belong to irrelevant events.
Imposter Content: This category of misinformation takes advantage of established news agencies by publishing misleading content under their branding. Since audiences trust established agencies, they are less likely to doubt the validity of the content and consequently pay less attention to subtle clues. Imposter content may damage the reputation of agencies and undermine audience trust. An example of imposter content is a website that mimics the domain features of global news outlets, such as CNN3 and BBC.4 To detect this category of misinformation, it is crucial to identify and pay attention to the subtle features of web publishers [1, 2].
Manipulated Content: This category of misinformation is generated by editing valid information, usually in the form of images and videos, to deceive audiences. Deepfake videos are well-known examples of this category. Manipulated videos and images have been widely generated to support fabricated content [4, 72, 92].
False Connection: This is one of the most common types of misinformation in multi-modal environments. In this category, some modalities, such as captions or titles, do not support other modalities, such as text or video. False connections are designed to catch the audience’s attention with clickbait headlines or provocative images [57, 62].
The above categories are used to spread a variety of fake news content, such as “Junk Science,”5 “Propaganda,”6 “Conspiracy Theories,”7 “Hate Speech,” “Rumors,” “Bias,” and so on. In the next section, we introduce some of cross-modal clues for detecting them in multi-modal settings.

2.2 Multi-modal Features and Clues

As previously indicated, combining features such as text and images has recently been utilized to identify false information in multi-modal contexts. In this section, we provide a non-exhaustive list of frequently used cues that machine learning researchers have used to identify false information. We emphasize that even though there are numerous other multi-modal combinations, they have not yet been fully explored by researchers at the time of writing, and we merely enumerate those that are frequently used in the literature.
Image and text mismatch. The combination of textual content and article images is one of the widely used sets of features for multi-modal fake news detection. The intuition behind this cue is that some fake news spreaders use tempting images, such as exaggerated, dramatic, or sarcastic graphics, which are far from the textual content to attract users’ attention. Since it is difficult to find both pertinent and pristine images to match these fictions, fake news generators sometimes use manipulated images to support non-factual scenarios. Researchers refer to this cue as the similarity relationship between text and image [30, 103, 110], which could be captured with a variety of similarity measuring techniques such as cosine similarity between the title and image tags embeddings [30, 110] or similarity measure architectures [103].
Mismatch between video and descriptive writing style. On video-based platforms such as YouTube8 and TikTok,9 video content is accompanied by descriptive textual information such as video descriptions, titles, users’ comments, and replies. Different users and video producers use various writing styles in such textual content. These writing styles can be learned and distinguished from unrecognized patterns by machine learning models. Meanwhile, the meaningful relationship between the visual content and the descriptive information, such as the video title, is another important clue that could be used for detecting online misbehavior [19]. However, this is a very challenging task, as it is difficult to detect frames that are relevant to the text and discard irrelevant ones, such as advertisements, opening, or ending frames. Moreover, encoding all video frames is very inefficient in terms of speed and memory.
Textual content and propagation network. The majority of online fact checkers, such as BS Detector10 or News Guard,11 provide labels that pertain to domains rather than articles. Despite this disparity, several works [36, 112] show that the weakly supervised task of using labels pertaining to domains and subsequently testing on labels pertaining to articles yields negligible accuracy loss due to the strong correlation between the two [36, 112]. Thus, by recognizing the domain features and behaviors, we might be able to classify articles published by them with admissible accuracy. Some of these feature patterns are the propagation network and word usage patterns of the domains, which could be considered [78, 83, 84, 111] as a discriminating signature for different domains. It has been empirically shown that not only do news articles from different domains have significantly different word usage, but they also follow different propagation patterns [84].
Textual content and overall look of serving domain. Another domain-level feature that researchers have recently introduced for detecting misinformation is the overall look of the serving webpage [1, 2]. It is shown that, in contrast to credible domains, unreliable web-based news outlets tend to be visually busy and full of events such as advertisements, popups, and so on [1]. Trustworthy webpages often look professional and ordered, as they often request users to agree to sign up or subscribe, have some featured articles, a headline picture, standard writing styles, and so on. However, unreliable domains tend to have an unprofessional blog-post style, negative space, and sometimes hard-to-read font errors. Considering this discriminating clue, researchers have recently proposed to consider the overall look of the webpages in addition to textual content and social context to create a multi-modal model for detecting misinformation [2, 3].
Video and audio mismatch. Due to the ubiquity of camera devices and video-editing applications, video-based frameworks are extremely vulnerable to manipulation, e.g., virtual backgrounds, anime filters, and so on. Such visual manipulations introduce non-trivial noise to the video frames, which may lead to the misclassification of irrelevant information from videos [75]. Moreover, manipulated videos often incorporate content in different modalities such as audio and text, which sometimes are not misinformative when considered individually. However, they mislead the audience when considered jointly with the video content. To detect misleading content that is jointly expressed in video, audio, and text content, researchers have proposed leveraging frame-based information along with audio and text content on video-based platforms like TikTok [75].

3 Multi-modal Model Study

Extracted features and the way they are fused play an important role in the model architecture. In fact, model-based and feature-based studies are closely related through fusion strategies, which makes the demarcation of these two studies very difficult. Hence, in this section, we first discuss common fusion strategies as the point of connection between the two studies. Furthermore, we categorize existing works based on the machine learning techniques exploited by each work. Specifically, we classify them into two main categories: (1) classic machine learning and (2) deep learning-based solutions. In this section, we discuss each category in detail.

3.1 Fusion Mechanisms

Data fusion is the process of combining information from multiple modalities to take advantage of all different aspects of the data and extract as much information as possible to improve the performance of machine learning models, as opposed to using a single data aspect or modality. Different fusion mechanisms have been used to combine features from different modalities, including those mentioned in the previous section. Fusion mechanisms are often categorized into one of the following groups:
Early fusion. also known as feature-level fusion, this refers to combining features from different data modalities at an early stage using an operation, which is often concatenation. This type of fusion is often performed ahead of classification. If the fusion process is done after feature extraction, then it is sometimes referred to as intermediate fusion [12, 54, 60].
Late fusion. also known as decision-level or kernel-level fusion, this is usually done in the classification stage. This method depends on the results obtained by each data modality individually. In other words, the modality-wise classification results are combined using techniques such as sum, max, average, and weighted average. Most of the late fusion solutions use handcrafted rules, which are prone to human bias and are far from real-world peculiarities [12, 54, 60].

3.2 Comparison of Fusion Mechanisms

In most cases, early fusion is a complex operation, whereas late fusion is easier to perform [8] because, unlike early fusion where the features from different modalities (e.g., image and text) may have different representations, the decisions at the semantic level usually have the same representation. Therefore, the fusion of decisions is easier than the fusion of features. However, the late fusion strategy does not utilize the feature-level correlation among modalities, which may improve classification performance. In fact, it is shown that in many cases, the early fusion of different modalities outperforms multi-modal late fusion while applying deep learning or classic machine learning classifiers [27, 28]. For instance, early fusion of images and texts while using BERT and CNN on the UPMC Food-101 dataset12 [95] outperforms late fusion of these modalities.
Another advantage of early fusion is that it requires less computation time, because training is performed only once, whereas late fusion needs multiple classifiers for local decisions [8]. However, to have the best of both worlds, there are hybrid approaches as well, which take advantage of both early and late fusion strategies [8]. Figure 2 to Figure 4 illustrate simplified schemes of various fusion mechanisms for multi-modal learning. Traditional and modern approaches for detecting multi-modal misinformation, some of which employ fusion mechanisms, are covered in the section that follows.
Fig. 2.
Fig. 2. Early fusion mechanism.
Fig. 3.
Fig. 3. Late fusion mechanism.
Fig. 4.
Fig. 4. A hybrid of early and late fusion mechanisms.

3.3 Classic Machine Learning Solutions

As we discussed earlier, a vast majority of misinformation detection methods leverage a single modality, a.k.a. aspect of news articles, e.g., text [32, 37, 77, 101], image [1, 20, 39, 66], user features [80, 82, 102], and temporal properties [52, 78, 89]. However, recently, there have been very few works that incorporate various aspects of a news article using classic machine learning techniques to create multi-modal article representations.
For instance, a work by Shu et al. [48] proposes individual embedded representations for text, user-user interactions, user-article interactions, and publisher-article interactions, and defines a joint optimization problem leveraging these individual representations. Finally, they apply a “Non-convex Optimization” solution via the Alternating Least Squares (ALS) algorithm to solve the proposed optimization problem.
In another work, Abdali et al. [2] propose an “Algebraic Joint Structure” algorithm called HiJoD, which encodes three different aspects of an article: the article text, the context of social sharing behaviors, and host website/domain features. These aspects are transformed into individual embeddings, and shared structures among these embeddings are extracted using a principled tensor-based framework. By canceling out the unshared structures, the extracted shared structures are then utilized for article classification. The classification performance of the algebraic joint model, HiJoD, is compared with the “Naive Embeddings Concatenation” of embedding representations. The results demonstrate that the tensor-based representation is more effective in capturing the nuanced patterns of the joint structure.
Another study [3] presents the K-Nearest Hyperplanes (KNH) graph, a new type of graph generalization where nodes are higher-order Euclidean subspaces formed by algebraic structures, aimed at multi-aspect modeling of news articles.
More recently, Meel et al. [35] have proposed an “Ensemble Framework,” which leverages text embedding, a score calculated by cosine similarity between image caption and news body, and noisy images. Despite the fact that some of the modules of this model, e.g., text embedding generator, leverage deep attention-based architecture, the classification process is done via a classic ensemble technique, i.e., max “Voting.”
Summarily, due to the success of deep learning-based techniques in feature extraction and classification tasks, classic machine learning-based techniques are not commonly used these days. However, considering the fact that deep learning techniques are data-hungry and require a lot of effort for training and fine-tuning the models, depending on the applications, classic machine learning techniques are still being used solely or in conjunction with deep learning techniques.

3.4 Deep Learning Solutions

Due to the impressive success of deep neural networks in feature extraction and classification of text, images, and many other modalities, they have been widely exploited by research scientists over the past few years for a variety of multi-modal tasks, including misinformation detection. We may categorize deep learning-based multi-modal misinformation detection into five categories: concatenation-based, attention-based, generative-based, graph neural network-based, and cross-modality discordance-aware architectures, as demonstrated in Figure 5. In what follows, we summarize and categorize the existing works into the aforementioned categories.
Fig. 5.
Fig. 5. An overview of the multi-modal model study.

3.4.1 Concatenation-based Architectures.

The majority of the existing work on multi-modal misinformation detection embeds each modality, e.g., text or image, into a vector representation and then concatenates them to generate a multi-modal representation that can be utilized for classification tasks. For instance, Singhal et al. propose using pretrained XLNet and VGG-19 models to embed text and image, respectively, and then classify the concatenation of the resulting feature vectors to detect misinformation [87].
In another work [74], Bartolome et al. exploit a Convolutional Neural Network (CNN) that takes as inputs both text and image corresponding to an article, and the outputs are concatenated into a single vector. Qi et al. extract text, Optical Character Recognition (OCR) content, news-related high-level semantics of images (e.g., celebrities and landmarks), and visual CNN features of the image. Then, in the stage of multi-modal feature fusion, text-image correlations, mutual enhancement, and entity inconsistency are merged by concatenation operation [65].
In another work [71], Rezayi et al. leverage network, textual, and relaying features such as hashtags and URLs and classify articles using the concatenation of the feature embeddings. Works in References [69, 76] are other examples of this category of deep learning-based solutions.

3.4.2 Attention-based Architectures.

As mentioned above, many architectures simply concatenate vector representations, thereby failing to build effective multi-modal embeddings. Such models are not efficient in many cases. For instance, the entire text of an article does not necessarily need to be false for the corresponding image and vice versa to consider the article as misinformative content. Thus, some recent works attempt to use the attention mechanism to attend to relevant parts of images, text, and so on. The attention mechanism is a more effective approach for utilizing embeddings, as it produces richer multi-modal representations.
For instance, a work by Sachan et al. [73] proposes Shared Cross Attention Transformer Encoders (SCADE), which exploits CNNs and transformer-based methods to encode image and text information and utilizes cross-modal attention and shared layers for the two modalities. SCADE pays attention to the relevant parts of each modality with reference to the other.
Another example is a work by Kumari et al. [53], where a framework is developed to maximize the correlation between textual and visual information. This framework has four different sub-modules: Attention-Based Stacked Bidirectional Long Short Term Memory (ABS-BiLSTM) for textual feature representation, Attention-Based Multilevel Convolutional Neural Network–Recurrent Neural Network (ABM-CNN–RNN) for visual feature extraction, multi-modal Factorized Bilinear Pooling (MFB) for feature fusion, and, finally, Multi-Layer Perceptron (MLP) for classification.
In another study, Qian et al. [67] introduce the Hierarchical Multi-modal Contextual Attention Network (HMCAN) architecture. This architecture leverages a pre-trained BERT and convolutional ResNet50 to create word and image embeddings. It also employs a multi-modal contextual attention network to investigate multi-modal context information. HMCAN uses various multi-modal contextual attention networks to form a hierarchical encoding network, aiming to explore and capture the rich hierarchical semantics of multi-modal data.
Another example is Reference [45], where Jin et al. fuse features from three modalities, i.e., textual, visual, and social context, using an RNN that utilizes an attention mechanism (att-RNN) for feature alignment. Jing et al. propose TRANSFAKE [47] to connect features of text and images into a series and feed them into a vision-language transformer model to learn the joint representation of multi-modal features. TRANSFAKE adopts a preprocessing method similar to BERT for concatenated text, comments, and images.
In another work [94], Wang et al. apply scaled dot-product attention on top of image and text features as a fine-grained fusion and use the fused feature to classify articles.
Wang et al. propose a deep learning network for biomedical informatics that leverages visual and textual information and a semantic- and task-level attention mechanism to focus on the essential contents of a post that signal anti-vaccine messages [100].
Another example is the study by Lu et al., where they concatenate representations of user interaction, word representations, and propagation features after implementing a dual co-attention mechanism. The goal is to capture the correlations between users’ interactions/propagation and the tweet’s text [59].
Finally, Song et al. [88] propose a multi-modal fake news detection architecture based on Cross-modal Attention Residual (CARN) and Multichannel Convolutional Neural Networks (CARMN).CARN selectively extracts the information related to a target modality from a source modality while maintaining the unique information of the target.

3.4.3 Generative Architectures.

In this category of deep learning solutions, the goal is to either apply Generative Networks or use auxiliary networks to learn individual or multi-modal representations, spaces, or parameters to improve the classification performance of the fake news detector.
As an example, Jaiswal et al. propose a BERT-based multi-modal variational Autoencoder (VAE) [42] that consists of an encoder, decoder, and a fake news detector. The encoder encodes the shared representations of both the image and text into a multidimensional latent vector. The decoder decodes the multidimensional latent vector into the original image and text, and the fake news detector is a binary classifier that takes the shared representation as an input and classifies it as either fake or real.
Similarly, Kattar et al. propose a deep multi-modal variational autoencoder (MVAE) [49], which learns a unified representation of both the modalities of a tweet’s content. Similar to the previous work, MVAE has three main components: an encoder, a decoder, and a fake news detector that utilizes the learned shared representation to predict if a news is fake or not.
Like the previous work, a work by Zeng et al. [107] proposes to capture the correlations between text and image by a VAE-based multi-modal feature fusion method. In another work, Wang et al. propose Event Adversarial Neural Networks (EANN) [96], an end-to-end framework that can derive event-invariant features and thus benefit the detection of fake news on newly arrived events. It consists of three main components: a multi-modal feature extractor, the fake news detector, and the event discriminator. The multi-modal feature extractor is responsible for extracting the textual and visual features from posts. It cooperates with the fake news detector to learn the discriminating representation of news articles. The role of the event discriminator is to remove the event-specific features and keep shared features among the events.
In another work [97], Wang et al. propose the MetaFEND framework, which is able to detect fake news on emergent events with a few verified posts using an event adaptation strategy. The MetaFEND framework has two stages: event adaptation and detection. In the event adaptation stage, the model adapts to specific events, and then in the detection stage, the event-specific parameter is leveraged to detect fake news on a given event. Although MetaFEND does not apply a generative architecture, it leverages an auxiliary network to learn an event-specific parameter set to improve the efficiency of the fake news detector.
The last example is a work [84] by Silva et al., where they propose a cross-domain framework using text and propagation network. The proposed model consists of two components: an unsupervised domain embedding learning and a supervised domain-agnostic news classification. The unsupervised domain embedding exploits text and propagation network to represent a news domain with a low-dimensional vector. The classification model represents each news record as a vector using the textual content and the propagation network. Then, the model maps this representation into two different subspaces such that one preserves the domain-specific information. Later on, these two components are integrated to identify fake news while exploiting domain-specific and cross-domain knowledge in the news records.

3.4.4 Graph Neural Network Architectures.

In recent years, Graph Neural Networks (GNNs) have been successfully exploited for fake news detection [9, 23, 89], thereby catching researchers’ attention for multi-modal misinformation detection tasks as well. In this category of deep learning solutions, article content (e.g., text, image) is represented by graphs, and then graph neural networks are used to extract the semantic-level features.
For instance, Wang et al. construct a graph for each social media post based on the point-wise mutual information (PMI) score of pairs of words, extracted objects in visual content, and knowledge concepts through knowledge distillation. They then utilize a Knowledge-driven Multi-modal Graph Convolutional Network (KMGCN), which extracts the multi-modal representation of each post through graph convolutional networks [98].
Another GCN-based model is GAME-ON [26], which represents each news item with uni-modal visual and textual graphs and then projects them into a common space. To capture multi-modal representations, GAME-ON applies a graph attention layer on a multi-modal graph generated out of modality graphs.

3.4.5 Cross-modal Discordance-aware Architectures.

In the previously discussed categories, deep learning models are employed to merge different modalities to create distinguishing representations. However, in this category, deep learning architectures are tailored to address identified discrepancies between modalities. The idea is that fabricating either modality will cause dissonance between them, leading to misrepresented, misinterpreted, and misleading news. Therefore, subtle cross-modal discordance clues can be identified and learned by customized architectures. Consequently, methods utilizing “contrastive learning” or Contrastive Language-Image Pre-Training (CLIP)-based architectures [21, 44] may fall into this category.
In many cases, fake news propagators use irrelevant modalities (e.g., image, video, audio) for false statements to attract readers’ attention. Thus, the similarity of text to other modalities (e.g., image, audio) is a cue for measuring the credibility of a news article.
With that said, Zhou et al. [110] propose SAFE, a Similarity-Aware Multi-Modal Fake News Detection framework by defining the relevance between news textual and visual information using a modified cosine similarity.
Similarly, Giachanou et al. propose a multi-image system that combines textual, visual, and semantic information [30]. The semantic representation refers to the text-image similarity calculated using the cosine similarity between the title and image tag embeddings.
In another work, Singhal et al. [86] develop an inter-modality discordance-based fake news detector that learns discriminating features and employs a modified version of contrastive loss that explores the inter-modality discordance.
Xue et al. [103] propose a Multi-modal Consistency Neural Network (MCNN) that utilizes a similarity measurement module that measures the similarity of multi-modal data to detect the possible mismatches between the image and text. Last, Biamby et al. [10] leverage the CLIP model [68] to jointly learn image/text representation to detect image-text inconsistencies in tweets. Instead of concatenating vector representations, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples.
On video-based platforms such as YouTube videos, typically different producers use different title and description, as users and subscribers express their opinions in different writing styles.
Having this clue in mind, Choi et al. propose a framework to identify fake content on YouTube [19]. They propose to use domain knowledge and “hit-likes” of comments to create the comments embedding that is effective in detecting fake news videos. They encode Multi-modal features, i.e., image and text, and detect differences between title, description or video and user’s comments.
In another work [75], Shang et al. develop TikTec, a multi-modal misinformation detection framework that explicitly exploits the captions to accurately capture the key information from unreliable video content. This framework learns the composed misinformation that is jointly conveyed by the visual and audio content. TikTec consists of four major components. A Caption-guided Visual Representation Learning (CVRL) module that identifies the misinformation-related visual features of each sampled video frame, an Acoustic-aware Speech Representation Learning (ASRL) module that jointly learns the misleading semantic information that is deeply embedded in the unstructured and casual audio tracks, and the Visual-speech Co-attentive Information Fusion (VCIF) module, which captures the multiview composed information jointly embedded in the heterogeneous visual and audio contents of the video. Finally, the Supervised Misleading Video Detection (SMVD) module identifies misleading COVID-19 videos.

3.4.6 Foundation Models and Prompt-based Techniques.

A foundation model is a large machine learning model that is trained on large-scale datasets such that it can be adapted to a wide range of downstream tasks. Some examples of multi-modal foundation models are pre-trained GPT-4 [64], DALL-E [70], Florence [105], Flamingo [7], and so on.
In-Context Learning (ICL) is the simplest and one of the most effective ways of using foundation models. ICL is a training-free technique where models learn to learn from limited demonstrations and descriptions and generalize to unseen tasks [90]. The learn-to-learn concept was first introduced in meta-learning, which is a family of machine learning techniques that uses few examples to adapt the model to new tasks. In recent years, meta-learning has been used for different applications, including multi-modal misinformation detection [97, 106]. However, the GPT-3 paper [13] shows that few-shot learning is an emergent capability of Large Language Models (LLMs) and could be taken advantage of using ICL techniques. In fact, a frozen model can be conditioned to perform a variety of tasks through ICL, where a user primes the model for a given task through prompt design, i.e., manually crafting a text prompt with descriptions or examples of the task.
A more effective way to condition frozen models is by using tunable prompts. Unlike model fine-tuning, which modifies the model’s parameters through additional training on new data, prompt-tuning adjusts the parameters of the prompt tokens while keeping the pre-trained model frozen [55].
ICL techniques, including few-shot and zero-shot prompting, as well as prompt tuning, have been widely used to query LLMs for a variety of downstream tasks, including misinformation detection. For example, Jiang et al. [43] study the role of prompt learning in detecting fake news. In another work [29], Gao et al. put forward a prompt-tuning template to extract knowledge from a pretrained LM for detecting misinformation. Another example is a work by Tian et al. [91], where few-shot learning is leveraged for troll detection. In another work by Lin et al. [58], prompt tuning is used for rumor detection using a zero-shot framework. Similarly, Reference [113] presents a continual learning framework that applies prompt tuning for rumor detection.
However, there are few existing works that utilize them for misinformation detection in multi-modal settings. One of the existing works is PromptHate [16], a simple prompt-based multi-modal model that prompts pre-trained language models (PLMs) for hateful meme classification. PromptHate constructs simple prompts and provides a few in-context examples to exploit the implicit knowledge in the pre-trained RoBERTa to classify hateful memes.
In another work [21], a novel propaganda detection model, Antipersuasion Prompt Enhanced Contrastive Learning (APCL), is proposed for detecting propaganda. The prompt is designed with a persuasion prompt template and an anti-persuasion prompt template to build matched text-image and mismatched text-image pairs, respectively. Later on, the distances between the two prompt templates and pairs of text and image are used for detection.
More recently, Cao et al. leverage pre-trained vision-language models (PVLMs) in a zero-shot and fine-tuning-free VQA setting to address the problem of meme detection by generating hateful content-centric image captions [15].
In addition, Jian et al. propose a Similarity-Aware Multimodal Prompt Learning (SAMPLE) framework that incorporates prompt-tuning into multi-modal fake news detection [44]. SAMPLE uses three prompt templates: discrete prompting, continuous prompting, and mixed prompting to the original input text, and employs the pre-trained RoBERTa to extract text features from the prompt. Furthermore, the pre-trained CLIP is used to obtain the input texts, input images, and their semantic similarities. SAMPLE introduces a similarity-aware multi-modal feature fusing approach that applies standardization and a Sigmoid function to adjust the intensity of the final cross-modal representation and mitigate noise injection via uncorrelated cross-modal features.
A summary of the aforementioned deep learning-based works is demonstrated in Table 1. It is worth mentioning that many of the state-of-the-art solutions utilize a hybrid of deep learning solutions.
Table 1.
PaperConcat.AttentionGenerativeGNNCross-modal CuePromptingPrimary Focus
[87]Concatenation
[76]Concatenation
[74]Concatenation
[65]Concatenation
[71]Concatenation
[69]Concatenation
[45]Attention Mech.
[59]Attention Mech.
[67]Attention Mech.
[61]Attention Mech.
[73]Attention Mech.
[53]Attention Mech.
[47]Attention Mech.
[100]Attention Mech.
[88]Attention Mech.
[94]Attention Mech.
[41]Attention Mech.
[96]Generative Net.
[49]Generative Net.
[107]Generative Net.
[42]Generative Net.
[84]Generative Net.
[108]Generative Net.
[97]Generative Net.
[98]GNN
[26]GNN
[110]Cross-Modal Cue
[30]Cross-Modal Cue
[103]Cross-Modal Cue
[75]Cross-Modal Cue
[86]Cross-Modal Cue
[10]Cross-Modal Cue
[19]Cross-Modal Cue
[16]Prompting
[21]Prompting
[15]Prompting
[44]Prompting
Table 1. A Summary of the Existing Deep Learning-based Solutions

4 Multi-modal Data Study

Data acquisition and preparation are the most important building blocks of a machine learning pipeline. Machine learning models leverage training data to continuously improve themselves over time. Thus, sufficient good quality, and in most cases annotated data, is extremely crucial for these models to operate effectively. With that said, in this section, we introduce and compare some of the existing multi-modal datasets for the fake news detection task. Later on, we will discuss some of the limitations of these datasets.
Image-Verification-Corpus 13 is an evolving dataset containing 17,806 fake and real posts with images shared on Twitter. This dataset is created as an open corpus of tweets containing images that may be used for assessing online image verification approaches (based on tweet texts and user features) as well as building classifiers for new content. Fake and real images in this dataset have been annotated by online sources that evaluate the credibility of the images and the events they are associated with [11].
Fakeddit 14 is a dataset collected from Reddit, a social news and discussion website where users can post submissions on various subreddits. Fakeddit consists of over 1 million submissions from 22 different subreddits spanning over a decade, with the earliest submission being from 3/19/2008 and the most recent submission being from 10/24/2019. These subreddits are posted on highly active and popular pages by over 300,000 users. Fakeddit consists of submission titles, images, user comments, and submission metadata including score, the username of the author, subreddit source, sourced domain, number of comments, and up-vote to down-vote ratio. Approximately 64% of the samples have both text and image data [62]. Samples of this dataset are annotated with 2-way, 3-way, and 6-way labels including true, satire/parody, misleading content, manipulated content, false connection, and imposter content. Examples of 6-way labels are demonstrated in Figure 6. Additionally, Table 2 illustrates a comparison and evaluation of various methods’ performance on the Fakeddit dataset [62].15
Table 2.
   2-way3-way6-way
TypeTextImageValidationTestValidationTestValidationTest
Text+ImageInferSentVGG160.86550.86580.86180.86240.81300.8130
 InferSentEfficientNet0.83280.83390.82590.82560.72660.7280
 InferSentResNet500.88880.88910.88550.88630.85460.8526
 BERTVGG160.86940.86990.86440.86550.81770.8208
 BERTEfficientNet0.83340.83180.82650.82550.72580.7272
 BERTResNet500.89290.89090.89050.89000.86000.8588
Table 2. Evaluation of Classification Accuracy on the Fakeddit Dataset Using Various Image/Text Embedders Conducted by Reference [62]
Fig. 6.
Fig. 6. Examples of different classes in Fakeddit dataset [62].
NewsBag comprises 200,000 real news and 15,000 fake articles. The real training articles have been collected from the Wall Street Journal and the fake ones from The Onion website,16 which publishes satirical content. However, the samples of the test set are collected from different websites, i.e., TheRealNews17 and ThePoke.18 The rationale behind using different sources of news for the training and test sets is to observe how well the models could be generalized to unseen data samples. The NewsBag dataset is highly imbalanced. Thus, to tackle this issue, NewsBag ++ is also released, which is the augmented training version of the NewsBag dataset and contains 200,000 real and 389,000 fake news articles. Another weakness of the NewsBag dataset is that it does not have any social context information such as spreader information, sharing trends, and reactions such as user comments and engagements [46].
MM-COVID 19 is a multi-lingual and multi-dimensional COVID-19 fake news data repository. This dataset comprises 3,981 fake news and 7,192 trustworthy information in 6 different languages, i.e., English, Spanish, Portuguese, Hindi, French, and Italian. MM-COVID consists of visual, textual, and social context information, e.g., users and networks information [57]. This dataset is annotated is by Snopes20 and Poynter21 crowdsource domains, where experts and journalists evaluate and fact-check news content and annotate contents as either fake or real. While Snopes is an independent publication that mainly contains English content, Poynter is an international fact-checking network (IFCN), which unites 96 different fact-checking agencies such as PolitiFact22 in 40 languages.
ReCOVery 23 contains 2,029 news articles that have been shared on social media, most of which (2,017 samples) have both textual and visual information for multi-modal studies. ReCOVery is imbalanced in news class, i.e., the proportion of real vs. fake articles is around 2:1. The number of users who spread real news (78,659) and users sharing fake articles (17,323) is greater than the total number of users included in the dataset (93,761). In this dataset, the assumption is that users can engage in spreading both real and fake news articles. Samples of this dataset are annotated by two fact-checking resources: NewsGuard24 and Media Bias/Fact Check (MBFC),25 which is a website that evaluates factual accuracy and political bias of news media. MBFC labels each news media as one of six factual-accuracy levels based on the fact-checking results of the previously published news articles. Samples of ReCOVery are collected from 60 news domains, from which 22 are the sources of reliable news articles (e.g., National Public Radio26 and Reuters27) and the remaining 38 are sources to collect unreliable news articles (e.g., Human Are Free28 and Natural News29) [109].
CoAID 30: Covid-19 heAlthcare mIsinformation Dataset or CoAID is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users’ social engagement about the news. It includes 5,216 news articles, 296,752 related user engagements, 926 social platform posts about COVID-19, and ground truth labels. The publishing dates of the collected information range from December 1, 2019, to September 1, 2020. In total, 204 fake news articles, 3,565 true news articles, 28 fake claims, and 454 true claims are collected. Real news articles are crawled from 9 reliable media outlets that have been cross-checked as reliable, e.g., National Institutes of Health (NIH)31 and CDC.32 Fake news is retrieved from several fact-checking websites, such as PolitiFact and Health Feedback [22].33
MMCoVaR is a Multi-modal COVID-19 Vaccine Focused Data Repository (MMCoVaR). Articles in this dataset are annotated using two news website source checking methods, and the tweets are fact-checked based on a stance detection approach. MMCoVaR comprises 2,593 articles issued by 80 publishers and shared between 02/16/2020 and 05/08/2021, and 24,184 Twitter posts collected between 04/17/2021 and 05/08/2021. Samples of this dataset are annotated by Media Bias Chart and Media Bias/Fact Check (MBFC) and classified into two levels of credibility: reliable and unreliable. Thus, articles are labeled as either credible or unreliable, and tweets are annotated as reliable, inconclusive, or unreliable [17]. It is worth mentioning that textual, visual, and social context information are available for the news articles.
N24News 34 is a multi-modal dataset extracted from New York Times articles published from 2010 to 2020. Each news article belongs to one of 24 different categories, e.g., science, arts. The dataset comprises up to 3,000 samples of real news for each category. In total, 60,000 news articles are collected. Each article sample contains a category tag, headline, abstract, article body, image, and corresponding image caption. This dataset is randomly split into training/validation/testing sets in the ratio of 8:1:1 [99]. The main weakness of this dataset is that it does not have any fake samples, and all of the real samples are collected from a single source, i.e., the New York Times.
MuMiN :35 Large-Scale Multilingual Multi-modal Fact-Checked Misinformation Social Network Dataset (MuMin) comprises 21 million tweets belonging to 26K Twitter threads, each of which has been linked to 13K fact-checked claims in 41 different languages. MuMiN is available in three versions: large, medium, and small, with the largest one consisting of 10,920 articles and 6,573 images. In this dataset, if the claim is “mostly true,” then it is labeled as factual. When the claim is deemed “half true” or “half false,” it is labeled as misinformation, with the justification that a statement containing a significant part of false information should be considered misleading content. When there is no clear verdict, the verdict is labeled as other [63].
A summary and side-by-side comparison of the previously mentioned datasets are shown in Table 3. As illustrated in Figure 7, most of these datasets are small, annotated with binary labels, sourced from limited platforms like Twitter, and contain only a few modalities, namely, text and image.
Table 3.
DatasetTotal Samples# classesModalitiesSourceDetails
image-verification-corpus [11]17,8062image,textTwitter 
Fakeddit [62]1,063,1062,3,6image,textReddit682,996 samples are multi-modal.
NewsBag [46]215,0002image, textTTrain: Wall Street & Onion.Test: TheRealNews & ThePokeThis dataset is highly imbalanced. There are only 15,000 fake samples.
NewsBag++ [46]589,0002image,textTrain: Wall Street & Onion. Test: TheRealNews & ThePokeSame as NewsBag, but fake samples are synthetic samples created by augmentation techniques.
MM-COVID [57]11,1732image,text,social contextTwitter3,981 fake samples and 7,192 real samples.
ReCOVery [109]2,0292text,imageTwitterImbalanced with ratio of 2:1 real vs. fake.
CoAID [22]5,2162image,textTwitterConsists of 296,752 user engagements (926 social platforms).
MMCoVaR [17]2,593 articles & 24,184 tweets2image,text,social contextTwitterTweets as labeled as reliable,inconclusive and unreliable.
N24News [99]60,00024image,textNew York TimesAll samples are real from 24 different categories.
MuMiN [63]10,9203image,textTwitterConsists of 10,920 articles and 6,573 images.
Table 3. Statistics of Multi-modal Databases for Fake News Detection
Fig. 7.
Fig. 7. Number of news articles by dataset.

5 Challenges in Multi-modal Misinformation Detection

Recent studies on multi-modal learning have made significant contributions to the field of multi-modal fake news detection. However, there are still weaknesses and shortcomings, and recognizing them opens the door to new opportunities not only in fake news detection but also in the multi-modal field in general. In this section, we provide non-exhaustive lists of challenges and shortcomings for each direction of multi-modal misinformation research.

5.1 Data Study Challenges

This category refers to the weaknesses of current multi-modal datasets for misinformation detection. We briefly discussed some of these weaknesses in the multi-modal data study section. An itemized list of such limitations and shortcomings is as follows:
Lack of large and comprehensive datasets: As illustrated in Figure 7, most of the existing datasets are small in size and sometimes highly imbalanced in terms of the fake-to-real ratio.
Lack of cross-lingual datasets: Almost all social media platforms are multi-lingual environments where users share information in multiple languages. Although misinformation spreads in multiple languages, a vast majority of the existing datasets are mono-lingual, i.e., they only provide English content. Therefore, there is a serious lack of non-English content and annotations.
Limited modalities: As we discussed earlier, most of the existing multi-modal datasets only provide image and text modalities, thus neglecting useful information conveyed by other modalities such as video, audio, and so on. The necessity of providing more modalities becomes more apparent when we consider popular social media such as YouTube, TikTok, and Clubhouse, which are mainly video- or audio-based platforms.
Bias in event-specific datasets: Many of the existing datasets are created for specific events such as the COVID-19 crisis, thereby not covering a variety of events and topics. As a result, they may not sufficiently train models to detect fake news in other contexts.
Binary and domain-level ground truth: Most of the existing datasets provide binary and domain-level ground truth for well-known outlets such as The Onion or the New York Times. In addition, they often do not provide any information about the reasons for misinformation, e.g., cross-modal discordance, false connection, imposter content.
Subjective annotations and inconsistency of labels: As discussed in the data study section, different datasets use different crowd-sourced and fact-checking agencies, thereby articles are annotated subjectively with different labels across different datasets. Thus, it is very challenging to analyze, compare, and interpret results.

5.2 Feature Study Challenges

This category comprises shortcomings related to cross-modal feature identification and extraction in the multi-modal fake news detection pipeline. Some of the most important weaknesses in current feature-based studies are:
Insufficiency of cross-modal cues: Although researchers have proposed some multi-modal cues, most of the existing models naively fuse image-based features with textual features as a supplement. There are fewer works that leverage explainable cross-modal cues other than image and text combinations. However, there are still plenty of useful multi-modal cues that are often neglected by researchers.
Ineffective cross-modal embeddings: As mentioned earlier, the majority of the existing approaches only fuse embeddings with simple operations such as concatenation of the representations, thereby failing to build an effective and non-noisy cross-modal embedding. Such architectures fail in many cases, as the resulting cross-modal embedding consists of useless or irrelevant parts that may result in noisy representations.
Lack of language-independent features: A majority of existing work on misinformation leverages text features that are highly dependent on dataset languages, which are mostly English. Identifying language-independent features is an effective way to cope with mono-lingual datasets.

5.3 Model Study Challenges

This category refers to the shortcomings of current machine learning solutions in detecting misinformation in multi-modal environments. The following is a non-exhaustive list of existing shortcomings:
Inexplicability of current models: A majority of the existing models do not provide any explicable information about the regions of interest, common patterns of inconsistencies among modalities, and types of misinformation (e.g., manipulation, exaggeration). While some recent works attempt to use attention-based techniques to overcome the problem of ineffective multi-modal embedding and provide some interpretability, most of them usually follow a trial-and-error approach like masking to find relevant sections to attend to. However, interpretable and explainable AI is crucial in building trust and confidence as well as fairness and transparency, which are mostly neglected.
Non-transferable models to unseen events: Most of the existing models are designed in such a way that they extract and learn event-specific features (e.g., COVID-19, election). Thus, they are most likely biased toward specific events and, as a result, not transferable to unseen and emerging events. For this reason, building models that learn general features and separate them from the non-transferable event-specific features would be extremely useful.
Unscalability of current models: Considering the expensive and complicated structures of deep networks and the fact that most of the existing multi-modal models leverage multiple deep networks (one for each modality), they are not scalable if the number of modalities increases. Moreover, many of the existing models require heavy computing resources and need a large volume of memory storage and processing units. Therefore, the scalability of proposed models should be taken into account while developing new architectures.
Vulnerabilities against adversarial attacks: Malicious adversaries continuously try to fool the misinformation detection models. This is especially feasible when the underlying model’s techniques and cues are revealed to the attacker, such as when the attacker can probe the model. As a result, many of the detection techniques become dated in a short period of time. Thus, there is a need to create detection models that are resistant to manipulation.

6 Opportunities in Multi-modal Misinformation Detection

Considering the challenges and shortcomings in multi-modal misinformation detection we discussed above, we propose opportunities for furthering research in this field. In what follows, we discuss these opportunities by each direction of multi-modal misinformation detection study.

6.1 Opportunities in Multi-modal Data Study

Considering the data study challenges we discussed earlier, we propose the following avenues:
Comprehensive multi-modal and multi-lingual datasets: As we discussed earlier, one important gap in the misinformation detection study is the lack of a comprehensive multi-modal dataset, which needs to be addressed in the future. Multi-modal misinformation detection requires large, multi-lingual, multi-source datasets that cover a variety of modalities, web resources, events, and so on, and provide fine-grained ground truth for the samples.
Standardized annotation strategy: Current datasets are annotated by various fact-checking agencies, leading to subjective labels in many cases. Establishing a standardized labeling agreement across all datasets would facilitate easier cross-dataset comparison and analysis.

6.2 Opportunities in Multi-modal Feature Study

Based on the feature study challenges we discussed in the previous section, we propose the following research opportunities to overcome some of the existing challenges in multi-modal feature study:
Identifying cross-modal clues: Currently, cross-modal cues are restricted to a few basic indicators, such as the similarity between text and images. Identifying more subtle and often overlooked cues can aid in developing discordance-aware models and help recognize vulnerabilities in the serving platforms, which is integral to adversarial learning.
Developing efficient fusion mechanisms: Many of the existing solutions leverage naive fusion mechanisms such as concatenation, which may result in inefficient and noisy multi-modal representations. Therefore, another fruitful avenue of research lies in the study and development of more efficient fusion techniques to produce information-rich representations.
Identifying language-independent features to cope with mono-lingual datasets: A majority of existing datasets are mono-lingual, thereby not sufficient enough to train models for non-English tasks. One way to compensate for the lack of multi-lingual datasets is to use language-independent features [93]. Identifying such features, especially in multi-modal environments where there are more features and aspects, would be highly effective in coping with mono-lingual datasets.

6.3 Opportunities in Multi-modal Model Study

Some unexplored research avenues to tackle existing model-related challenges in multi-modal misinformation detection include:
Utilizing foundation models and prompt-based techniques in multi-modal misinformation detection: The astounding effectiveness of foundation models and techniques, including ICL and prompt-tuning, in numerous multi-modal tasks suggests that foundation models have a lot of potential for identifying multi-modal misinformation. Developing task-specific foundation models for detecting misinformation is another opportunity that would hugely impact the field of misinformation detection.
Developing cross-modal discordance-aware architectures: Most of the existing works either blindly merge modalities or take a trial-and-error approach to attend to the relevant modalities. Implementing discordance-aware models not only results in information-rich representations but also may be useful in making attention-based techniques more efficient.
Adversarial learning in multi-modal misinformation detection: Although there are existing generative-based architectures, adversarial study of multi-modal misinformation detection has been mostly neglected. To make the detection models more adversarially robust, it is of utmost importance to dedicate time and effort to the study and development of generative and adversarial learning techniques.
Interpretability of multi-modal models: Development of explainable frameworks to help better understand and interpret predictions made by multi-modal detection models is another opportunity in multi-modal misinformation detection. Explicability can be very useful for related tasks such as the predictability of models, fairness and bias, and adversarial learning.
Transferable models to unseen events: As mentioned earlier, except for a few works, most of the existing models are designed for specific events and, as a result, are ineffective for emerging ones. Since misinformation spreads during a variety of events, developing general and transferable models is extremely crucial.
Development of scalable models: Another opportunity is to develop models that are more efficient in terms of time and resources and do not become intolerably complicated while increasing the number of fused modalities.

7 Conclusions

In this article, we review the literature on multi-modal misinformation detection, discuss its strengths and weaknesses, and suggest new directions for future research. First, we introduce some of the prominent misinformation categories and often-used cross-modal cues for spotting them. We also discuss different fusion mechanisms to merge modalities that are engaged in such cross-modal cues. In addition, we categorize existing solutions into two groups: classic machine learning and deep learning solutions, and then further divide each group based on the techniques that are utilized. Furthermore, we introduce and compare existing datasets on multi-modal misinformation detection and identify some of the weaknesses of these datasets. By classifying them into data, feature, and model-based shortcomings, we demonstrate some of the most prominent problems in multi-modal fake news detection. Finally, we propose new lines of research to address these shortcomings.

Footnotes

1
“Misinformation” is false information that spreads unintentionally, whereas the term “Disinformation” refers to false information that malicious users share intentionally and often strategically to affect other audiences’ behaviors toward social, political, and economic events. In this work, regardless of spreaders’ intention, we refer to all sorts of false news, i.e., misinformation and disinformation, as “Misinformation” or “Fake News” interchangeably.
5
The term “Junk Science” refers to inaccurate information about scientific facts that is used to skew opinions or push a hidden agenda.
6
Refers to biased information that is often generated to promote a political point of view. Propaganda ranges from completely false information to subtle manipulation.
7
Refers to rejecting a widely accepted explanation for an event and offering a secret plot instead.
15
The table is based on the work in Reference [62].

References

[1]
Sara Abdali, Rutuja Gurav, Siddharth Menon, Daniel Fonseca, Negin Entezari, Neil Shah, and Evangelos E. Papalexakis. 2021. Identifying misinformation from website screenshots. Proc. Int. AAAI Conf. Web Soc. Media 15, 1 (May2021), 2–13. Retrieved from https://ojs.aaai.org/index.php/ICWSM/article/view/18036
[2]
Sara Abdali, Neil Shah, and Evangelos E. Papalexakis. 2020. HiJoD: Semi-supervised multi-aspect detection of misinformation using hierarchical joint decomposition. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD’20).
[3]
Sara Abdali, Neil Shah, and Evangelos E. Papalexakis. 2021. KNH: Multi-view modeling with k-nearest hyperplanes graph for misinformation detection. CoRR abs/2102.07857 (2021).
[4]
Sara Abdali, M. Alex O. Vasilescu, and Evangelos E. Papalexakis. 2021. Deepfake Representation with Multilinear Regression. arxiv:2108.06702 [cs.CV]
[5]
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2017. VQA: Visual question answering. Int. J. Comput. Vision 123, 1 (May2017), 4–31. DOI:
[6]
Firoj Alam, Stefano Cresci, Tanmoy Chakraborty, Fabrizio Silvestri, Dimiter Dimitrov, Giovanni Martino, Shaden Shaar, Hamed Firooz, and Preslav Nakov. 2021. A Survey on Multimodal Disinformation Detection. In Proceedings of the 29th International Conference on Computational Linguistics. Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (Eds.), International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 6625–6643. https://aclanthology.org/2022.coling-1.576
[7]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2024. Flamingo: A visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 1723, 21 pages.
[8]
Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and M. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multim. Syst. 16 (2010), 345–379.
[9]
Adrien Benamira, Benjamin Devillers, Etienne Lesot, Ayush K. Ray, Manal Saadi, and Fragkiskos D. Malliaros. 2019. Semi-supervised learning and graph neural networks for fake news detection. InProceedings of the International Conference on Advances in Social Network Analysis and Mining (ASONAM’19). Association for Computing Machinery, New York, NY, USA. DOI:
[10]
Giscard Biamby, Grace Luo, Trevor Darrell, and Anna Rohrbach. 2022. Twitter-COMMs: Detecting climate, COVID, and military multimodal misinformation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1530–1549.
[11]
Christina Boididou, Symeon Papadopoulos, Markos Zampoglou, Lazaros Apostolidis, Olga Papadopoulou, and Yiannis Kompatsiaris. 2018. Detection and visualization of misleading content on Twitter. Int. J. Multim. Inf. Retr. 7, 1 (2018), 71–86. DOI:
[12]
Said Boulahia, Abdenour Amamra, Mohamed Madi, and Said Daikh. 2021. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach. Vis. Applic. 32 (112021). DOI:
[13]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems. H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33, Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[14]
Yegui Cai, George O. M. Yee, Yuan Xiang Gu, and Chung-Horng Lung. 2020. Threats to online advertising and countermeasures: A technical survey. Digit. Threats: Res. Pract. 1, 2, Article 11 (May2020), 27 pages. DOI:
[15]
Rui Cao, Ming Shan Hee, Adriel Kuek, Wen-Haw Chong, Roy Ka-Wei Lee, and Jing Jiang. 2023. Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection. In Proceedings of the 31st ACM International Conference on Multimedia. 5244–5252.
[16]
Rui Cao, Roy Ka-Wei Lee, Wen-Haw Chong, and Jing Jiang. 2023. Prompting for multimodal hateful meme classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Retrieved from https://api.semanticscholar.org/CorpusID:256461095
[17]
Mingxuan Chen, Xinqiao Chu, and K. P. Subbalakshmi. 2021. MMCoVaR: Multimodal COVID-19 vaccine focused data repository for fake news detection and a baseline architecture for classification. In Proceedings of the International Conference on Advances in Social Network Analysis and Mining(ASONAM’21). Association for Computing Machinery, New York, NY, USA, 31–38. DOI:
[18]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. CoRR abs/1504.00325 (2015).
[19]
Hyewon Choi and Youngjoong Ko. 2022. Effective fake news video detection using domain knowledge and multimodal data fusion on YouTube. Pattern Recog. Lett. 154 (2022), 44–52. DOI:
[20]
Anshika Choudhary and Anuja Arora. 2021. ImageFake: An ensemble convolution models driven approach for image based fake news detection. In Proceedings of the 7th International Conference on Signal Processing and Communication (ICSC’21). 182–187. DOI:
[21]
Jian Cui, Lin Li, Xin Zhang, and Jingling Yuan. 2023. Multimodal propaganda detection via anti-persuasion prompt enhanced contrastive learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’23). 1–5. DOI:
[22]
Limeng Cui and Dongwon Lee. 2020. CoAID: COVID-19 Healthcare Misinformation Dataset. arxiv:2006.00885 [cs.SI]
[23]
Limeng Cui, Haeseung Seo, Maryam Tabar, Fenglong Ma, Suhang Wang, and Dongwon Lee. 2020. DETERRENT: Knowledge guided graph attention network for detecting healthcare misinformation. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’20). Association for Computing Machinery, New York, NY, USA. DOI:
[24]
Fernando Cardoso Durier da Silva, Rafael Vieira, and Ana Cristina Bicharra Garcia. 2019. Can machines learn to detect fake news? A survey focused on social media. In Proceedings of the Hawaii International Conference on System Sciences (HICSS’19).
[25]
Dipto Das. 2019. A Multimodal Approach to Sarcasm Detection on Social Media. Master’s thesis. MSU Graduate Theses.
[26]
Mudit Dhawan, Shakshi Sharma, Aditya Kadam, Rajesh Sharma, and Ponnurangam Kumaraguru. 2022. GAME-ON: Graph Attention Network based Multimodal Fusion for Fake News Detection. Social Network Analysis and Mining 14(2022), 1–13. https://api.semanticscholar.org/CorpusID:247155148
[27]
Yuan Dong, Shan Gao, Kun Tao, Jiqing Liu, and Haila Wang. 2014. Performance evaluation of early and late fusion methods for generic semantics indexing. Pattern Anal. Appl. 17, 1 (Feb.2014), 37–50. DOI:
[28]
Ignazio Gallo, Gianmarco Ria, Nicola Landro, and Riccardo La Grassa. 2020. Image and text fusion for UPMC Food-101 using BERT and CNNs. In Proceedings of the 35th International Conference on Image and Vision Computing New Zealand (IVCNZ’20). 1–6. DOI:
[29]
Wang Gao, Mingyuan Ni, Hongtao Deng, Xun Zhu, Peng Zeng, and Xi Hu. 2023. Few-shot fake news detection via prompt-based tuning. J. Intell. Fuzzy Syst. 44(2023), 9933–9942. https://api.semanticscholar.org/CorpusID:257966771
[30]
Anastasia Giachanou, Guobiao Zhang, and Paolo Rosso. 2020. Multimodal multi-image fake news detection. In Proceedings of the IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA’20). 647–654. DOI:
[31]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6325–6334.
[32]
Gisel Bastidas Guacho, Sara Abdali, Neil Shah, and Evangelos E. Papalexakis. 2018. Semi-supervised content-based detection of misinformation via tensor embeddings. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’18). 322–325. DOI:
[33]
Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. 2018. VizWiz grand challenge: Answering visual questions from blind people. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3608–3617. https://api.semanticscholar.org/CorpusID:3831582
[34]
Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. 2020. Captioning Images Taken by People Who Are Blind. In Computer Vision — ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII (Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 417–434.
[35]
Saqib Hakak, Mamoun Alazab, Suleman Khan, Thippa Reddy Gadekallu, Praveen Kumar Reddy Maddikunta, and Wazir Zada Khan. 2021. An ensemble machine learning approach through effective feature extraction to classify fake news. Fut. Gen. Comput. Syst. 117 (2021), 47–58. DOI:
[36]
Stefan Helmstetter and Heiko Paulheim. 2018. Weakly supervised learning for fake news detection on Twitter. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’18). 274–277. DOI:
[37]
Benjamin Horne and Sibel Adali. 2017. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 11. 759–766.
[38]
Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 6693–6702. https://api.semanticscholar.org/CorpusID:152282269
[39]
Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A. Efros. 2018. Fighting fake news: Image splice detection via learned self-consistency. In Computer Vision — ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XI (Munich, Germany). Springer-Verlag, Berlin, Heidelberg, 106–124.
[40]
Md Rafiqul Islam, Shaowu Liu, Wang Xianzhi, and Guandong Xu. 2020. Deep learning for misinformation detection on online social networks: A survey and new perspectives. Soc. Netw. Anal. Min. 10 (122020). DOI:
[41]
Vidit Jain, Rohit Kaliyar, Anurag Goswami, Pratik Narang, and Yashvardhan Sharma. 2022. AENeT: An attention-enabled neural architecture for fake news detection using contextual features. Neural Comput. Applic. 34 (012022). DOI:
[42]
Ramji Jaiswal, Upendra Pratap Singh, and Krishna Pratap Singh. 2021. Fake news detection using BERT-VGG19 multimodal variational autoencoder. In Proceedings of the IEEE 8th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON’21). 1–5. DOI:
[43]
Gongyao Jiang, Shuang Liu, Yu Zhao, Yueheng Sun, and Meishan Zhang. 2022. Fake news detection via knowledgeable prompt learning. Inf. Process. Manag. 59, 5 (2022), 103029.
[44]
Ye Jiang, Xiaomin Yu, Yimin Wang, Xiaoman Xu, Xingyi Song, and Diana Maynard. 2023. Similarity-aware multimodal prompt learning for fake news detection. ArXiv abs/2304.04187 (2023).
[45]
Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, and Jiebo Luo. 2017. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). Association for Computing Machinery, New York, NY, USA, 795–816. DOI:
[46]
Sarthak Jindal, Raghav Sood, Richa Singh, Mayank Vatsa, Tanmoy, and Chakraborty. 2020. NewsBag: A benchmark multimodal dataset for fake news detection. In SafeAI@AAAI. https://api.semanticscholar.org/CorpusID:213179339
[47]
Quanliang Jing, Di Yao, Xinxin Fan, Baoli Wang, Haining Tan, Xiangpeng Bu, and Jingping Bi. 2021. TRANSFAKE: Multi-task transformer for multimodal enhanced fake news detection. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’21). 1–8. DOI:
[48]
S. Wang K. Shu, A. Sliva and H. Liu. 2019. Beyond news contents: The role of social context for fake news detection. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM’19). 312–320.
[49]
Dhruv Khattar, Jaipal Singh Goud, Manish Gupta, and Vasudeva Varma. 2019. MVAE: Multimodal variational autoencoder for fake news detection. In Proceedings of the World Wide Web Conference (WWW’19). Association for Computing Machinery, New York, NY, USA, 2915–2921. DOI:
[50]
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. Advan. Neural Inf. Process. Syst. 33 (2020), 2611–2624.
[51]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 1 (May2017), 32–73. DOI:
[52]
Srijan Kumar and Neil Shah. 2018. False information on web and social media: A survey. arXiv preprint arXiv:1804.08559 (2018).
[53]
Rina Kumari and Asif Ekbal. 2021. AMFB: Attention based multimodal factorized bilinear pooling for multimodal fake news detection. Expert Syst. Applic. 184 (2021), 115412. DOI:
[54]
Dana Lahat, T. Adalı, and Christian Jutten. 2015. Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. IEEE 103 (2015), 1449–1477.
[55]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3045–3059. DOI:
[56]
Lily Li, Or Levi, Pedram Hosseini, and David Broniatowski. 2020. A multi-modal method for satire detection using textual and visual cues. In Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda. 33–38.
[57]
Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. MM-COVID: A Multilingual and Multimodal Data Repository for Combating COVID-19 Disinformation. arxiv:2011.04088 [cs.SI]
[58]
Hongzhan Lin, Pengyao Yi, Jing Ma, Haiyun Jiang, Ziyang Luo, Shuming Shi, and Ruifang Liu. 2023. Zero-shot rumor detection with propagation structure via prompt learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 5213–5221.
[59]
Yi-Ju Lu and Cheng-Te Li. 2020. GCAN: Graph-aware co-attention networks for explainable fake news detection on social media. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 505–514. DOI:
[60]
Héctor P. Martínez and Georgios N. Yannakakis. 2014. Deep multimodal fusion: Combining discrete events and continuous signals. In Proceedings of the 16th International Conference on Multimodal Interaction (ICMI’14). Association for Computing Machinery, New York, NY, USA, 34–41. DOI:
[61]
Nicola Messina, Fabrizio Falchi, Claudio Gennaro, and Giuseppe Amato. 2021. AIMH at SemEval-2021 task 6: Multimodal classification using an ensemble of transformer models. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval’21). Association for Computational Linguistics, 1020–1026. DOI:
[62]
Kai Nakamura, Sharon Levy, and William Yang Wang. 2020. Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 6149–6157. Retrieved from https://aclanthology.org/2020.lrec-1.755
[63]
Dan S. Nielsen and Ryan McConville. 2022. MuMiN: A large-scale multilingual multimodal fact-checked misinformation social network dataset. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3141–3153.
[64]
OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL]
[65]
Peng Qi, Juan Cao, Xirong Li, Huan Liu, Qiang Sheng, Xiaoyue Mi, Qin He, Yongbiao Lv, Chenyang Guo, and Yingchao Yu. 2021. Improving Fake News Detection by Using an Entity-enhanced Framework to Fuse Diverse Multimodal Clues. Association for Computing Machinery, New York, NY, USA. DOI:
[66]
Peng Qi, Juan Cao, Tianyun Yang, Junbo Guo, and Jintao Li. 2019. Exploiting multi-domain visual information for fake news detection. In 2019 IEEE International Conference on Data Mining (ICDM). 518–527. DOI:
[67]
Shengsheng Qian, Jinguang Wang, Jun Hu, Quan Fang, and Changsheng Xu. 2021. Hierarchical Multi-modal Contextual Attention Network for Fake News Detection. Association for Computing Machinery, New York, NY, USA, 153–162. DOI:
[68]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML’21).
[69]
Chahat Raj and Priyanka Meel. 2022. ARCNN framework for multimodal infodemic detection. Neural Netw. 146 (2022), 36–68. DOI:
[70]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning (ICML’21). PMLR, 8821–8831.
[71]
Saed Rezayi, Saber Soleymani, Hamid R. Arabnia, and Sheng Li. 2021. Socially aware multimodal deep neural networks for fake news classification. In Proceedings of the IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR’21). 253–259. DOI:
[72]
Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. FaceForensics++: Learning to detect manipulated facial images. CoRR abs/1901.08971 (2019).
[73]
Tanmay Sachan, Nikhil Pinnaparaju, Manish Gupta, and Vasudeva Varma. 2021. SCATE: Shared cross attention transformer encoders for multimodal fake news detection. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’21). Association for Computing Machinery, New York, NY, USA, 399–406. DOI:
[74]
Isabel Segura-Bedmar and Santiago Alonso-Bartolome. 2022. Multimodal fake news detection. Information 13, 6 (2022), 284.
[75]
Lanyu Shang, Ziyi Kou, Yang Zhang, and Dong Wang. 2021. A multimodal misinformation detector for COVID-19 short videos on TikTok. In 2021 IEEE International Conference on Big Data (Big Data). 899–908. DOI:
[76]
Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. 2019. DEFEND: Explainable fake news detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19). Association for Computing Machinery, New York, NY, USA, 395–405. DOI:
[77]
Kai Shu, Ahmed Hassan, Susan Dumais, and Huan Liu. 2020. Detecting fake news with weak social supervision. IEEE Intell. Syst. PP (052020), 1–1. DOI:
[78]
Kai Shu, Deepak Mahudeswaran, Suhang Wang, and Huan Liu. 2020. Hierarchical propagation networks for fake news detection: Investigation and exploitation. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 14. 626–637.
[79]
Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake news detection on social media: A data mining perspective. SIGKDD Explor. Newslett. 19 (2017).
[80]
Kai Shu, Suhang Wang, and Huan Liu. 2018. Understanding user profiles on social media for fake news detection. In Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR’18). 430–435. DOI:
[81]
Kai Shu, Guoqing Zheng, Yichuan Li, Subhabrata Mukherjee, Ahmed Hassan Awadallah, Scott Ruston, and Huan Liu. 2020. Early detection of fake news with multi-source weak social supervision. Springer-Verlag, Berlin, Heidelberg, 650–666.
[82]
Kai Shu, Xinyi Zhou, Suhang Wang, Reza Zafarani, and Huan Liu. 2019. The role of user profiles for fake news detection. In Proceedings of the International Conference on Advances in Social Network Analysis and Mining(ASONAM’19). Association for Computing Machinery, New York, NY, USA, 436–439. DOI:
[83]
Amila Silva, Yi Han, Ling Luo, Shanika Karunasekera, and Christopher Leckie. 2021. Propagation2Vec: Embedding partial propagation networks for explainable fake news early detection. Inf. Process. Manag. 58, 5 (2021), 102618. DOI:
[84]
Amila Silva, Ling Luo, Shanika Karunasekera, and Christopher Leckie. 2021. Embracing domain differences in fake news: Cross-domain fake news detection using multi-modal data. Proceedings of the AAAI Conference on Artificial Intelligence 35(2021), 557–565.
[85]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models that Can Read. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 8309–8318.
[86]
Shivangi Singhal, Mudit Dhawan, Rajiv Ratn Shah, and Ponnurangam Kumaraguru. 2021. Inter-modality discordance for multimodal fake news detection. In Proceedings of the ACM Multimedia Asia Conference (MMAsia’21). Association for Computing Machinery, New York, NY, USA, Article 33, 7 pages. DOI:
[87]
Shivangi Singhal, Rajiv Ratn Shah, Tanmoy Chakraborty, Ponnurangam Kumaraguru, and Shin’ichi Satoh. 2019. SpotFake: A multi-modal framework for fake news detection. In Proceedings of the IEEE 5th International Conference on Multimedia Big Data (BigMM’19). 39–47. DOI:
[88]
Chenguang Song, Nianwen Ning, Yunlei Zhang, and Bin Wu. 2021. A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks. Inf. Process. Manag. 58 (2021), 102437.
[89]
Chenguang Song, Kai Shu, and Bin Wu. 2021. Temporally evolving graph neural network for fake news detection. Inf. Process. Manag. 58, 6 (2021), 102712. DOI:
[90]
Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, and Ziwei Liu. 2023. Link-context learning for multimodal LLMs. ArXiv abs/2308.07891 (2023).
[91]
Lin Tian, Xiuzhen Zhang, and Jey Han Lau. 2023. MetaTroll: Few-shot detection of state-sponsored trolls with transformer adapters. In Proceedings of the ACM Web Conference. 1743–1753.
[92]
Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. 2020. Deepfakes and beyond: A survey of face manipulation and fake detection. Inf. Fusion 64 (2020), 131–148. DOI:
[93]
Inna Vogel and Meghana Meghana. 2020. Detecting fake news spreaders on Twitter from a multilingual perspective. Proceedings of the IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA’20). 599–606.
[94]
Jingzi Wang, Hongyan Mao, and Hongwei Li. 2022. FMFN: Fine-grained multimodal fusion networks for fake news detection. Appl. Sci. 12, 3 (2022). DOI:
[95]
Xin Wang, Devinder Kumar, Nicolas Thome, Matthieu Cord, and Frédéric Precioso. 2015. Recipe recognition with large multimodal food dataset. In Proceedings of the IEEE International Conference on Multimedia Expo Workshops (ICMEW’15). 1–6. DOI:
[96]
Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu Xun, Kishlay Jha, Lu Su, and Jing Gao. 2018. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD’18). Association for Computing Machinery, New York, NY, USA, 9 pages. DOI:
[97]
Yaqing Wang, Fenglong Ma, Haoyu Wang, Kishlay Jha, and Jing Gao. 2021. Multimodal emergent fake news detection via meta neural process networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.
[98]
Youze Wang, Shengsheng Qian, Jun Hu, Quan Fang, and Changsheng Xu. 2020. Fake News Detection via Knowledge-Driven Multimodal Graph Convolutional Networks. Association for Computing Machinery, New York, NY, USA, 540–547. DOI:
[99]
Zhen Wang, Xu Shan, Xiangxie Zhang, and Jie Yang. 2022. N24News: A new dataset for multimodal news classification. In Proceedings of the 13th International Conference on Language Resources and Evaluation Conference (LREC’22). European Language Resources Association (ELRA), 6768–6775.
[100]
Zuhui Wang, Zhaozheng Yin, and Young Anna Argyris. 2021. Detecting medical misinformation on social media using multimodal deep learning. IEEE J. Biomed. Health Inform. 25, 6 (2021), 2193–2203. DOI:
[101]
Liang Wu, Jundong Li, Xia Hu, and Huan Liu. 2017. Gleaning wisdom from the past: Early detection of emerging rumors in social media. In Proceedings of the SIAM International Conference on Data Mining. SIAM, 99–107.
[102]
Liang Wu and Huan Liu. 2018. Tracing fake-news footprints: Characterizing social media messages by how they propagate. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining.
[103]
Junxiao Xue, Yabo Wang, Yichen Tian, Yafei Li, Lei Shi, and Lin Wei. 2021. Detecting fake news by exploring the consistency of multimodal data. Inf. Process. Manag. 58 (2021), 102610.
[104]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67–78. DOI:
[105]
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. 2021. Florence: A new foundation model for computer vision. arXiv:2111.11432 [cs.CV] (2021). https://arxiv.org/abs/2111.11432
[106]
Zhenrui Yue, Huimin Zeng, Yang Zhang, Lanyu Shang, and Dong Wang. 2023. MetaAdapt: Domain adaptive few-shot misinformation detection via meta learning. arXiv preprint arXiv:2305.12692 (2023).
[107]
Jiangfeng Zeng, Yin Zhang, and Xiao Ma. 2020. Fake news detection for epidemic emergencies via deep correlations between text and images. Sustain. Cities Societ. 66 (122020), 102652. DOI:
[108]
Honghao Zhou, Tinghuai Ma, Huan Rong, Yurong Qian, Yuan Tian, and Najla Al-Nabhan. 2022. MDMN: Multi-task and domain adaptation based multi-modal network for early rumor detection. Expert Syst. Applic. 195 (2022), 116517. DOI:
[109]
Xinyi Zhou, Apurva Mulay, Emilio Ferrara, and Reza Zafarani. 2020. ReCOVery: A multimodal repository for COVID-19 news credibility research. InProceedings of the Conference on Information and Knowledge Management (CIKM’20). Association for Computing Machinery, New York, NY, USA, 3205–3212. DOI:
[110]
Xinyi Zhou, Jindi Wu, and Reza Zafarani. 2020. SAFE: Similarity-aware multi-modal fake news detection. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’20). Springer, 354–367.
[111]
Xinyi Zhou and Reza Zafarani. 2019. Network-based fake news detection: A pattern-driven approach. SIGKDD Explor. Newslett. 21, 2 (Nov.2019), 48–60. DOI:
[112]
Zhi-Hua Zhou. 2017. A brief introduction to weakly supervised learning. Nat’l Sci. Rev. 5 (082017). DOI:
[113]
Yuhui Zuo, Wei Zhu, and Guoyong GUET Cai. 2022. Continually detection, rapidly react: Unseen rumors detection based on continual prompt-tuning. In Proceedings of the 29th International Conference on Computational Linguistics. 3029–3041.

Cited By

View all
  • (2025)A Comprehensive Survey of Fake Text Detection on Misinformation and LM-Generated TextsIEEE Access10.1109/ACCESS.2025.353880513(25301-25324)Online publication date: 2025
  • (2024)Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection Methodologies: A ReviewEngineering, Technology & Applied Science Research10.48084/etasr.790714:4(15665-15675)Online publication date: 2-Aug-2024
  • (2024)Event-Based Multi-Modal Fusion for Online Misinformation Detection in High-Impact Events2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10826054(3301-3308)Online publication date: 15-Dec-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 57, Issue 3
March 2025
984 pages
EISSN:1557-7341
DOI:10.1145/3697147
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024
Online AM: 15 October 2024
Accepted: 22 August 2024
Revised: 27 March 2024
Received: 13 July 2022
Published in CSUR Volume 57, Issue 3

Check for updates

Author Tags

  1. Misinformation detection
  2. multi-modal learning
  3. fake news detection
  4. survey
  5. multi-modal datasets

Qualifiers

  • Survey

Funding Sources

  • National Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,752
  • Downloads (Last 6 weeks)620
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Comprehensive Survey of Fake Text Detection on Misinformation and LM-Generated TextsIEEE Access10.1109/ACCESS.2025.353880513(25301-25324)Online publication date: 2025
  • (2024)Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection Methodologies: A ReviewEngineering, Technology & Applied Science Research10.48084/etasr.790714:4(15665-15675)Online publication date: 2-Aug-2024
  • (2024)Event-Based Multi-Modal Fusion for Online Misinformation Detection in High-Impact Events2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10826054(3301-3308)Online publication date: 15-Dec-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media