survey

Open access

Multi-modal Misinformation Detection: Approaches, Challenges and Opportunities

Authors:

Sara Abdali,

Sina Shaham,

Bhaskar KrishnamachariAuthors Info & Claims

ACM Computing Surveys, Volume 57, Issue 3

Article No.: 76, Pages 1 - 29

https://doi.org/10.1145/3697349

Published: 22 November 2024 Publication History

PDF eReader

Abstract

As social media platforms evolve from text-based forums into multi-modal environments, the nature of misinformation in social media is also transforming accordingly. Taking advantage of the fact that visual modalities such as images and videos are more favorable and attractive to users, and textual content is sometimes skimmed carelessly, misinformation spreaders have recently targeted contextual connections between the modalities, e.g., text and image. Hence, many researchers have developed automatic techniques for detecting possible cross-modal discordance in web-based content. We analyze, categorize, and identify existing approaches in addition to the challenges and shortcomings they face to unearth new research opportunities in the field of multi-modal misinformation detection.

1 Introduction

Nowadays, billions of multi-modal posts containing texts, images, videos, soundtracks, and so on, are shared throughout the web, mainly via social media platforms such as Facebook, Twitter, Snapchat, Reddit, Instagram, YouTube, and so on. While the combination of modalities allows for more expressive, detailed, and user-friendly content, it brings about new challenges, as it is harder to accommodate uni-modal solutions to multi-modal environments.

However, in recent years, due to the sheer use of multi-modal platforms, many automated techniques for multi-modal tasks, such as Visual Question Answering (VQA) [5, 31, 33, 38, 85], image captioning [18, 34, 51, 104], and more recently for fake news detection including hate speech in multi-modal memes [30, 50, 65, 87], have been introduced by machine learning researchers.

Similar to other multi-modal tasks, it is harder and more challenging to detect fake news on multi-modal platforms, as it requires not only the evaluation of each modality, but also cross-modal connections and credibility of the combination as well. This becomes even more challenging when each modality, e.g., text or image, is credible but the combination creates misinformative content. For instance, a COVID-19 anti-vaccination misinformation¹ post can have text that reads “vaccines do this” and then attaches a graphic image of a dead person. In this case, although the image and text are not individually misinformative, taken together they create misinformation.

Over the past decade, several detection models [14, 40, 79, 81] have been developed to detect misinformation. However, the majority of them leverage only a single modality for misinformation detection, e.g., text [32, 37, 77, 101] or image [1, 20, 39, 66], which miss the important information conveyed by other modalities. There are existing works [2, 3, 35, 48, 76] that leverage ensemble methods that create multiple models for each modality and then combine them to produce improved results. However, in many cases of multi-modal misinformation, loosely combining individual modalities is inadequate for detecting fake news, leading to the failure of the joint model.

Nevertheless, in recent years, machine learning scientists have developed different techniques for cross-modal fake news detection, which combine information from multiple modalities, leveraging cross-modal information such as the consistency and meaningful relationships between different modalities. Studying and analyzing these techniques and identifying existing challenges will give a clearer picture of the state of knowledge on multi-modal misinformation detection and open the door to new opportunities in this field.

Even though there are a number of valuable surveys on fake news detection [24, 52, 79], very few of them focus on multi-modal techniques [6, 74]. Since the number of proposed techniques for multi-modal fake news detection has been increasing immensely, the necessity of a comprehensive survey on existing techniques, datasets, and emerging challenges is felt more than ever. With that said, in this work, we aim to conduct a comprehensive study on fake news detection in multi-modal environments.

To this end, we classify multi-modal misinformation detection study into the following directions:

—

Multi-modal Data Study: In this direction, the goal is to collect multi-modal fake news data, e.g., image, text, social context, and so on, from different sources of information and use fact-checking resources to evaluate the veracity of the collected data and annotate them accordingly. Comparison and analysis of existing datasets, as well as benchmarking, are other tasks that fall under this category.

—

Multi-modal Feature Study: The primary goal of this study is to uncover significant links between various data modalities, which are frequently exploited by misinformation spreaders to distort, impersonate, or exaggerate original information. These meaningful connections may be used as clues for detecting misinformation in multi-modal environments such as social media posts. Another goal of this direction is to study and develop strategies for fusing features of different modalities and creating information-rich multi-modal features.

—

Multi-modal Model Study: The main focus of this direction is on the development of efficient multi-modal machine learning solutions to detect misinformation by leveraging multi-modal features and clues. Proposing new techniques and approaches, in addition to improving the performance, scalability, interpretability, and explicability of machine learning models, are some of the common tasks in this direction.

These three studies form a sequential pipeline in the multi-modal misinformation field, where the output of each study serves as the input for the next. Figure 1 provides a summary of these directions. In this work, we aim to explore each direction in greater depth to identify the challenges and shortcomings of each study and propose new avenues for addressing them.

Fig. 1.

The rest of this survey is organized as follows: In Section 2, we discuss the multi-modal feature study by introducing some widely spread categories of misinformation in multi-modal settings and commonly used cross-modal clues for detecting them. In the following section, we discuss different fusion mechanisms to merge modalities involved in such clues. Then, we explain the multi-modal model study by introducing solutions and categorizing them based on the machine learning techniques they utilize. In Section 4, we describe the multi-modal data study by introducing, analyzing, and comparing existing databases for multi-modal fake news detection. In Section 5, we discuss existing challenges and shortcomings that each direction is facing. Finally, in Section 6, we propose new avenues to address these shortcomings and advance multi-modal misinformation detection research.

We conducted our literature search across multiple databases, including IEEE Xplore, ACM Digital Library, and Google Scholar, using a combination of keywords related to our research focus. The inclusion criteria for the papers were defined by their relevance to the research question, publication date within the past 10 years to ensure timeliness, and peer-reviewed status to guarantee quality. The selection process involved an initial screening of titles and abstracts, followed by a full-text review to confirm that each paper met our stringent criteria. This methodical approach ensures that the included papers provide a diverse yet focused perspective on the subject, offering readers a succinct and informative summary of current knowledge in the field. We emphasize the importance of transparency in our literature selection process and outline these steps to clarify the criteria and rationale behind our choices.

2 Multi-modal Feature Study

In this section, we discuss the feature-based direction of multi-modal misinformation studies. To better understand the rationale behind multi-modal features and clues, we start with a brief introduction to some of the common categories of misinformation that spread in multi-modal environments. Furthermore, we discuss some of the commonly used multi-modal features and clues, and then we talk about existing fusion mechanisms for combining data modality features. Finally, we discuss the pros and cons of each fusion mechanism.

2.1 Common Categories of Misinformation in Multi-modal Environments

Multi-modal misinformation refers to a package of misleading information that includes multiple modalities such as images, text, videos, and so on. In multi-modal misinformation, not all modalities are necessarily false, but sometimes the connections between the modalities are manipulated to deceive the audience’s perception. In what follows, we briefly discuss some of the common categories of misinformation that are widely spread in multi-modal settings. It is worth mentioning that these categories of misinformation are common in both multi-modal and uni-modal environments. However, we provide examples of each category in multi-modal platforms as well.

—

Satire or Parody: This category refers to content that conveys true information with a satirical tone or added information that makes it false. One of the well-known publishers of this category is The Onion website,² which is a digital media organization that publishes satirical articles on a variety of international, national, and local news. A multi-modal example of this category is an image within a satirical news article that contains absurd or ridiculous content or is manipulated to create humorous critique [25, 56]. In this case, the textual content may not necessarily be false, but when combined with an image, it creates misleading content.

—

Fabricated Content: This category of information is completely false and is generated to deceive the audience. The intention behind publishing fabricated content is usually to mislead people for political, social, or economic benefits. A multi-modal instance of this category is a news report that uses auxiliary images or videos that are either completely fake or belong to irrelevant events.

—

Imposter Content: This category of misinformation takes advantage of established news agencies by publishing misleading content under their branding. Since audiences trust established agencies, they are less likely to doubt the validity of the content and consequently pay less attention to subtle clues. Imposter content may damage the reputation of agencies and undermine audience trust. An example of imposter content is a website that mimics the domain features of global news outlets, such as CNN³ and BBC.⁴ To detect this category of misinformation, it is crucial to identify and pay attention to the subtle features of web publishers [1, 2].

—

Manipulated Content: This category of misinformation is generated by editing valid information, usually in the form of images and videos, to deceive audiences. Deepfake videos are well-known examples of this category. Manipulated videos and images have been widely generated to support fabricated content [4, 72, 92].

—

False Connection: This is one of the most common types of misinformation in multi-modal environments. In this category, some modalities, such as captions or titles, do not support other modalities, such as text or video. False connections are designed to catch the audience’s attention with clickbait headlines or provocative images [57, 62].

The above categories are used to spread a variety of fake news content, such as “Junk Science,”⁵ “Propaganda,”⁶ “Conspiracy Theories,”⁷ “Hate Speech,” “Rumors,” “Bias,” and so on. In the next section, we introduce some of cross-modal clues for detecting them in multi-modal settings.

2.2 Multi-modal Features and Clues

As previously indicated, combining features such as text and images has recently been utilized to identify false information in multi-modal contexts. In this section, we provide a non-exhaustive list of frequently used cues that machine learning researchers have used to identify false information. We emphasize that even though there are numerous other multi-modal combinations, they have not yet been fully explored by researchers at the time of writing, and we merely enumerate those that are frequently used in the literature.

Image and text mismatch. The combination of textual content and article images is one of the widely used sets of features for multi-modal fake news detection. The intuition behind this cue is that some fake news spreaders use tempting images, such as exaggerated, dramatic, or sarcastic graphics, which are far from the textual content to attract users’ attention. Since it is difficult to find both pertinent and pristine images to match these fictions, fake news generators sometimes use manipulated images to support non-factual scenarios. Researchers refer to this cue as the similarity relationship between text and image [30, 103, 110], which could be captured with a variety of similarity measuring techniques such as cosine similarity between the title and image tags embeddings [30, 110] or similarity measure architectures [103].

Mismatch between video and descriptive writing style. On video-based platforms such as YouTube⁸ and TikTok,⁹ video content is accompanied by descriptive textual information such as video descriptions, titles, users’ comments, and replies. Different users and video producers use various writing styles in such textual content. These writing styles can be learned and distinguished from unrecognized patterns by machine learning models. Meanwhile, the meaningful relationship between the visual content and the descriptive information, such as the video title, is another important clue that could be used for detecting online misbehavior [19]. However, this is a very challenging task, as it is difficult to detect frames that are relevant to the text and discard irrelevant ones, such as advertisements, opening, or ending frames. Moreover, encoding all video frames is very inefficient in terms of speed and memory.

Textual content and propagation network. The majority of online fact checkers, such as BS Detector¹⁰ or News Guard,¹¹ provide labels that pertain to domains rather than articles. Despite this disparity, several works [36, 112] show that the weakly supervised task of using labels pertaining to domains and subsequently testing on labels pertaining to articles yields negligible accuracy loss due to the strong correlation between the two [36, 112]. Thus, by recognizing the domain features and behaviors, we might be able to classify articles published by them with admissible accuracy. Some of these feature patterns are the propagation network and word usage patterns of the domains, which could be considered [78, 83, 84, 111] as a discriminating signature for different domains. It has been empirically shown that not only do news articles from different domains have significantly different word usage, but they also follow different propagation patterns [84].

Textual content and overall look of serving domain. Another domain-level feature that researchers have recently introduced for detecting misinformation is the overall look of the serving webpage [1, 2]. It is shown that, in contrast to credible domains, unreliable web-based news outlets tend to be visually busy and full of events such as advertisements, popups, and so on [1]. Trustworthy webpages often look professional and ordered, as they often request users to agree to sign up or subscribe, have some featured articles, a headline picture, standard writing styles, and so on. However, unreliable domains tend to have an unprofessional blog-post style, negative space, and sometimes hard-to-read font errors. Considering this discriminating clue, researchers have recently proposed to consider the overall look of the webpages in addition to textual content and social context to create a multi-modal model for detecting misinformation [2, 3].

Video and audio mismatch. Due to the ubiquity of camera devices and video-editing applications, video-based frameworks are extremely vulnerable to manipulation, e.g., virtual backgrounds, anime filters, and so on. Such visual manipulations introduce non-trivial noise to the video frames, which may lead to the misclassification of irrelevant information from videos [75]. Moreover, manipulated videos often incorporate content in different modalities such as audio and text, which sometimes are not misinformative when considered individually. However, they mislead the audience when considered jointly with the video content. To detect misleading content that is jointly expressed in video, audio, and text content, researchers have proposed leveraging frame-based information along with audio and text content on video-based platforms like TikTok [75].

3 Multi-modal Model Study

Extracted features and the way they are fused play an important role in the model architecture. In fact, model-based and feature-based studies are closely related through fusion strategies, which makes the demarcation of these two studies very difficult. Hence, in this section, we first discuss common fusion strategies as the point of connection between the two studies. Furthermore, we categorize existing works based on the machine learning techniques exploited by each work. Specifically, we classify them into two main categories: (1) classic machine learning and (2) deep learning-based solutions. In this section, we discuss each category in detail.

3.1 Fusion Mechanisms

Data fusion is the process of combining information from multiple modalities to take advantage of all different aspects of the data and extract as much information as possible to improve the performance of machine learning models, as opposed to using a single data aspect or modality. Different fusion mechanisms have been used to combine features from different modalities, including those mentioned in the previous section. Fusion mechanisms are often categorized into one of the following groups:

Early fusion. also known as feature-level fusion, this refers to combining features from different data modalities at an early stage using an operation, which is often concatenation. This type of fusion is often performed ahead of classification. If the fusion process is done after feature extraction, then it is sometimes referred to as intermediate fusion [12, 54, 60].

Late fusion. also known as decision-level or kernel-level fusion, this is usually done in the classification stage. This method depends on the results obtained by each data modality individually. In other words, the modality-wise classification results are combined using techniques such as sum, max, average, and weighted average. Most of the late fusion solutions use handcrafted rules, which are prone to human bias and are far from real-world peculiarities [12, 54, 60].

3.2 Comparison of Fusion Mechanisms

In most cases, early fusion is a complex operation, whereas late fusion is easier to perform [8] because, unlike early fusion where the features from different modalities (e.g., image and text) may have different representations, the decisions at the semantic level usually have the same representation. Therefore, the fusion of decisions is easier than the fusion of features. However, the late fusion strategy does not utilize the feature-level correlation among modalities, which may improve classification performance. In fact, it is shown that in many cases, the early fusion of different modalities outperforms multi-modal late fusion while applying deep learning or classic machine learning classifiers [27, 28]. For instance, early fusion of images and texts while using BERT and CNN on the UPMC Food-101 dataset¹² [95] outperforms late fusion of these modalities.

Another advantage of early fusion is that it requires less computation time, because training is performed only once, whereas late fusion needs multiple classifiers for local decisions [8]. However, to have the best of both worlds, there are hybrid approaches as well, which take advantage of both early and late fusion strategies [8]. Figure 2 to Figure 4 illustrate simplified schemes of various fusion mechanisms for multi-modal learning. Traditional and modern approaches for detecting multi-modal misinformation, some of which employ fusion mechanisms, are covered in the section that follows.

Fig. 2.

Fig. 3.

Fig. 4.

3.3 Classic Machine Learning Solutions

As we discussed earlier, a vast majority of misinformation detection methods leverage a single modality, a.k.a. aspect of news articles, e.g., text [32, 37, 77, 101], image [1, 20, 39, 66], user features [80, 82, 102], and temporal properties [52, 78, 89]. However, recently, there have been very few works that incorporate various aspects of a news article using classic machine learning techniques to create multi-modal article representations.

For instance, a work by Shu et al. [48] proposes individual embedded representations for text, user-user interactions, user-article interactions, and publisher-article interactions, and defines a joint optimization problem leveraging these individual representations. Finally, they apply a “Non-convex Optimization” solution via the Alternating Least Squares (ALS) algorithm to solve the proposed optimization problem.

In another work, Abdali et al. [2] propose an “Algebraic Joint Structure” algorithm called HiJoD, which encodes three different aspects of an article: the article text, the context of social sharing behaviors, and host website/domain features. These aspects are transformed into individual embeddings, and shared structures among these embeddings are extracted using a principled tensor-based framework. By canceling out the unshared structures, the extracted shared structures are then utilized for article classification. The classification performance of the algebraic joint model, HiJoD, is compared with the “Naive Embeddings Concatenation” of embedding representations. The results demonstrate that the tensor-based representation is more effective in capturing the nuanced patterns of the joint structure.

Another study [3] presents the K-Nearest Hyperplanes (KNH) graph, a new type of graph generalization where nodes are higher-order Euclidean subspaces formed by algebraic structures, aimed at multi-aspect modeling of news articles.

More recently, Meel et al. [35] have proposed an “Ensemble Framework,” which leverages text embedding, a score calculated by cosine similarity between image caption and news body, and noisy images. Despite the fact that some of the modules of this model, e.g., text embedding generator, leverage deep attention-based architecture, the classification process is done via a classic ensemble technique, i.e., max “Voting.”

Summarily, due to the success of deep learning-based techniques in feature extraction and classification tasks, classic machine learning-based techniques are not commonly used these days. However, considering the fact that deep learning techniques are data-hungry and require a lot of effort for training and fine-tuning the models, depending on the applications, classic machine learning techniques are still being used solely or in conjunction with deep learning techniques.

3.4 Deep Learning Solutions

Due to the impressive success of deep neural networks in feature extraction and classification of text, images, and many other modalities, they have been widely exploited by research scientists over the past few years for a variety of multi-modal tasks, including misinformation detection. We may categorize deep learning-based multi-modal misinformation detection into five categories: concatenation-based, attention-based, generative-based, graph neural network-based, and cross-modality discordance-aware architectures, as demonstrated in Figure 5. In what follows, we summarize and categorize the existing works into the aforementioned categories.

Fig. 5.

3.4.1 Concatenation-based Architectures.

The majority of the existing work on multi-modal misinformation detection embeds each modality, e.g., text or image, into a vector representation and then concatenates them to generate a multi-modal representation that can be utilized for classification tasks. For instance, Singhal et al. propose using pretrained XLNet and VGG-19 models to embed text and image, respectively, and then classify the concatenation of the resulting feature vectors to detect misinformation [87].

In another work [74], Bartolome et al. exploit a Convolutional Neural Network (CNN) that takes as inputs both text and image corresponding to an article, and the outputs are concatenated into a single vector. Qi et al. extract text, Optical Character Recognition (OCR) content, news-related high-level semantics of images (e.g., celebrities and landmarks), and visual CNN features of the image. Then, in the stage of multi-modal feature fusion, text-image correlations, mutual enhancement, and entity inconsistency are merged by concatenation operation [65].

In another work [71], Rezayi et al. leverage network, textual, and relaying features such as hashtags and URLs and classify articles using the concatenation of the feature embeddings. Works in References [69, 76] are other examples of this category of deep learning-based solutions.

3.4.2 Attention-based Architectures.

As mentioned above, many architectures simply concatenate vector representations, thereby failing to build effective multi-modal embeddings. Such models are not efficient in many cases. For instance, the entire text of an article does not necessarily need to be false for the corresponding image and vice versa to consider the article as misinformative content. Thus, some recent works attempt to use the attention mechanism to attend to relevant parts of images, text, and so on. The attention mechanism is a more effective approach for utilizing embeddings, as it produces richer multi-modal representations.

For instance, a work by Sachan et al. [73] proposes Shared Cross Attention Transformer Encoders (SCADE), which exploits CNNs and transformer-based methods to encode image and text information and utilizes cross-modal attention and shared layers for the two modalities. SCADE pays attention to the relevant parts of each modality with reference to the other.

Another example is a work by Kumari et al. [53], where a framework is developed to maximize the correlation between textual and visual information. This framework has four different sub-modules: Attention-Based Stacked Bidirectional Long Short Term Memory (ABS-BiLSTM) for textual feature representation, Attention-Based Multilevel Convolutional Neural Network–Recurrent Neural Network (ABM-CNN–RNN) for visual feature extraction, multi-modal Factorized Bilinear Pooling (MFB) for feature fusion, and, finally, Multi-Layer Perceptron (MLP) for classification.

In another study, Qian et al. [67] introduce the Hierarchical Multi-modal Contextual Attention Network (HMCAN) architecture. This architecture leverages a pre-trained BERT and convolutional ResNet50 to create word and image embeddings. It also employs a multi-modal contextual attention network to investigate multi-modal context information. HMCAN uses various multi-modal contextual attention networks to form a hierarchical encoding network, aiming to explore and capture the rich hierarchical semantics of multi-modal data.

Another example is Reference [45], where Jin et al. fuse features from three modalities, i.e., textual, visual, and social context, using an RNN that utilizes an attention mechanism (att-RNN) for feature alignment. Jing et al. propose TRANSFAKE [47] to connect features of text and images into a series and feed them into a vision-language transformer model to learn the joint representation of multi-modal features. TRANSFAKE adopts a preprocessing method similar to BERT for concatenated text, comments, and images.

In another work [94], Wang et al. apply scaled dot-product attention on top of image and text features as a fine-grained fusion and use the fused feature to classify articles.

Wang et al. propose a deep learning network for biomedical informatics that leverages visual and textual information and a semantic- and task-level attention mechanism to focus on the essential contents of a post that signal anti-vaccine messages [100].

Another example is the study by Lu et al., where they concatenate representations of user interaction, word representations, and propagation features after implementing a dual co-attention mechanism. The goal is to capture the correlations between users’ interactions/propagation and the tweet’s text [59].

Finally, Song et al. [88] propose a multi-modal fake news detection architecture based on Cross-modal Attention Residual (CARN) and Multichannel Convolutional Neural Networks (CARMN).CARN selectively extracts the information related to a target modality from a source modality while maintaining the unique information of the target.

3.4.3 Generative Architectures.

In this category of deep learning solutions, the goal is to either apply Generative Networks or use auxiliary networks to learn individual or multi-modal representations, spaces, or parameters to improve the classification performance of the fake news detector.

As an example, Jaiswal et al. propose a BERT-based multi-modal variational Autoencoder (VAE) [42] that consists of an encoder, decoder, and a fake news detector. The encoder encodes the shared representations of both the image and text into a multidimensional latent vector. The decoder decodes the multidimensional latent vector into the original image and text, and the fake news detector is a binary classifier that takes the shared representation as an input and classifies it as either fake or real.

Similarly, Kattar et al. propose a deep multi-modal variational autoencoder (MVAE) [49], which learns a unified representation of both the modalities of a tweet’s content. Similar to the previous work, MVAE has three main components: an encoder, a decoder, and a fake news detector that utilizes the learned shared representation to predict if a news is fake or not.

Like the previous work, a work by Zeng et al. [107] proposes to capture the correlations between text and image by a VAE-based multi-modal feature fusion method. In another work, Wang et al. propose Event Adversarial Neural Networks (EANN) [96], an end-to-end framework that can derive event-invariant features and thus benefit the detection of fake news on newly arrived events. It consists of three main components: a multi-modal feature extractor, the fake news detector, and the event discriminator. The multi-modal feature extractor is responsible for extracting the textual and visual features from posts. It cooperates with the fake news detector to learn the discriminating representation of news articles. The role of the event discriminator is to remove the event-specific features and keep shared features among the events.

In another work [97], Wang et al. propose the MetaFEND framework, which is able to detect fake news on emergent events with a few verified posts using an event adaptation strategy. The MetaFEND framework has two stages: event adaptation and detection. In the event adaptation stage, the model adapts to specific events, and then in the detection stage, the event-specific parameter is leveraged to detect fake news on a given event. Although MetaFEND does not apply a generative architecture, it leverages an auxiliary network to learn an event-specific parameter set to improve the efficiency of the fake news detector.

The last example is a work [84] by Silva et al., where they propose a cross-domain framework using text and propagation network. The proposed model consists of two components: an unsupervised domain embedding learning and a supervised domain-agnostic news classification. The unsupervised domain embedding exploits text and propagation network to represent a news domain with a low-dimensional vector. The classification model represents each news record as a vector using the textual content and the propagation network. Then, the model maps this representation into two different subspaces such that one preserves the domain-specific information. Later on, these two components are integrated to identify fake news while exploiting domain-specific and cross-domain knowledge in the news records.

3.4.4 Graph Neural Network Architectures.

In recent years, Graph Neural Networks (GNNs) have been successfully exploited for fake news detection [9, 23, 89], thereby catching researchers’ attention for multi-modal misinformation detection tasks as well. In this category of deep learning solutions, article content (e.g., text, image) is represented by graphs, and then graph neural networks are used to extract the semantic-level features.

For instance, Wang et al. construct a graph for each social media post based on the point-wise mutual information (PMI) score of pairs of words, extracted objects in visual content, and knowledge concepts through knowledge distillation. They then utilize a Knowledge-driven Multi-modal Graph Convolutional Network (KMGCN), which extracts the multi-modal representation of each post through graph convolutional networks [98].

Another GCN-based model is GAME-ON [26], which represents each news item with uni-modal visual and textual graphs and then projects them into a common space. To capture multi-modal representations, GAME-ON applies a graph attention layer on a multi-modal graph generated out of modality graphs.

3.4.5 Cross-modal Discordance-aware Architectures.

In the previously discussed categories, deep learning models are employed to merge different modalities to create distinguishing representations. However, in this category, deep learning architectures are tailored to address identified discrepancies between modalities. The idea is that fabricating either modality will cause dissonance between them, leading to misrepresented, misinterpreted, and misleading news. Therefore, subtle cross-modal discordance clues can be identified and learned by customized architectures. Consequently, methods utilizing “contrastive learning” or Contrastive Language-Image Pre-Training (CLIP)-based architectures [21, 44] may fall into this category.

In many cases, fake news propagators use irrelevant modalities (e.g., image, video, audio) for false statements to attract readers’ attention. Thus, the similarity of text to other modalities (e.g., image, audio) is a cue for measuring the credibility of a news article.

With that said, Zhou et al. [110] propose SAFE, a Similarity-Aware Multi-Modal Fake News Detection framework by defining the relevance between news textual and visual information using a modified cosine similarity.

Similarly, Giachanou et al. propose a multi-image system that combines textual, visual, and semantic information [30]. The semantic representation refers to the text-image similarity calculated using the cosine similarity between the title and image tag embeddings.

In another work, Singhal et al. [86] develop an inter-modality discordance-based fake news detector that learns discriminating features and employs a modified version of contrastive loss that explores the inter-modality discordance.

Xue et al. [103] propose a Multi-modal Consistency Neural Network (MCNN) that utilizes a similarity measurement module that measures the similarity of multi-modal data to detect the possible mismatches between the image and text. Last, Biamby et al. [10] leverage the CLIP model [68] to jointly learn image/text representation to detect image-text inconsistencies in tweets. Instead of concatenating vector representations, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples.

On video-based platforms such as YouTube videos, typically different producers use different title and description, as users and subscribers express their opinions in different writing styles.

Having this clue in mind, Choi et al. propose a framework to identify fake content on YouTube [19]. They propose to use domain knowledge and “hit-likes” of comments to create the comments embedding that is effective in detecting fake news videos. They encode Multi-modal features, i.e., image and text, and detect differences between title, description or video and user’s comments.

In another work [75], Shang et al. develop TikTec, a multi-modal misinformation detection framework that explicitly exploits the captions to accurately capture the key information from unreliable video content. This framework learns the composed misinformation that is jointly conveyed by the visual and audio content. TikTec consists of four major components. A Caption-guided Visual Representation Learning (CVRL) module that identifies the misinformation-related visual features of each sampled video frame, an Acoustic-aware Speech Representation Learning (ASRL) module that jointly learns the misleading semantic information that is deeply embedded in the unstructured and casual audio tracks, and the Visual-speech Co-attentive Information Fusion (VCIF) module, which captures the multiview composed information jointly embedded in the heterogeneous visual and audio contents of the video. Finally, the Supervised Misleading Video Detection (SMVD) module identifies misleading COVID-19 videos.

3.4.6 Foundation Models and Prompt-based Techniques.

A foundation model is a large machine learning model that is trained on large-scale datasets such that it can be adapted to a wide range of downstream tasks. Some examples of multi-modal foundation models are pre-trained GPT-4 [64], DALL-E [70], Florence [105], Flamingo [7], and so on.

In-Context Learning (ICL) is the simplest and one of the most effective ways of using foundation models. ICL is a training-free technique where models learn to learn from limited demonstrations and descriptions and generalize to unseen tasks [90]. The learn-to-learn concept was first introduced in meta-learning, which is a family of machine learning techniques that uses few examples to adapt the model to new tasks. In recent years, meta-learning has been used for different applications, including multi-modal misinformation detection [97, 106]. However, the GPT-3 paper [13] shows that few-shot learning is an emergent capability of Large Language Models (LLMs) and could be taken advantage of using ICL techniques. In fact, a frozen model can be conditioned to perform a variety of tasks through ICL, where a user primes the model for a given task through prompt design, i.e., manually crafting a text prompt with descriptions or examples of the task.

A more effective way to condition frozen models is by using tunable prompts. Unlike model fine-tuning, which modifies the model’s parameters through additional training on new data, prompt-tuning adjusts the parameters of the prompt tokens while keeping the pre-trained model frozen [55].

ICL techniques, including few-shot and zero-shot prompting, as well as prompt tuning, have been widely used to query LLMs for a variety of downstream tasks, including misinformation detection. For example, Jiang et al. [43] study the role of prompt learning in detecting fake news. In another work [29], Gao et al. put forward a prompt-tuning template to extract knowledge from a pretrained LM for detecting misinformation. Another example is a work by Tian et al. [91], where few-shot learning is leveraged for troll detection. In another work by Lin et al. [58], prompt tuning is used for rumor detection using a zero-shot framework. Similarly, Reference [113] presents a continual learning framework that applies prompt tuning for rumor detection.

However, there are few existing works that utilize them for misinformation detection in multi-modal settings. One of the existing works is PromptHate [16], a simple prompt-based multi-modal model that prompts pre-trained language models (PLMs) for hateful meme classification. PromptHate constructs simple prompts and provides a few in-context examples to exploit the implicit knowledge in the pre-trained RoBERTa to classify hateful memes.

In another work [21], a novel propaganda detection model, Antipersuasion Prompt Enhanced Contrastive Learning (APCL), is proposed for detecting propaganda. The prompt is designed with a persuasion prompt template and an anti-persuasion prompt template to build matched text-image and mismatched text-image pairs, respectively. Later on, the distances between the two prompt templates and pairs of text and image are used for detection.

More recently, Cao et al. leverage pre-trained vision-language models (PVLMs) in a zero-shot and fine-tuning-free VQA setting to address the problem of meme detection by generating hateful content-centric image captions [15].

In addition, Jian et al. propose a Similarity-Aware Multimodal Prompt Learning (SAMPLE) framework that incorporates prompt-tuning into multi-modal fake news detection [44]. SAMPLE uses three prompt templates: discrete prompting, continuous prompting, and mixed prompting to the original input text, and employs the pre-trained RoBERTa to extract text features from the prompt. Furthermore, the pre-trained CLIP is used to obtain the input texts, input images, and their semantic similarities. SAMPLE introduces a similarity-aware multi-modal feature fusing approach that applies standardization and a Sigmoid function to adjust the intensity of the final cross-modal representation and mitigate noise injection via uncorrelated cross-modal features.

A summary of the aforementioned deep learning-based works is demonstrated in Table 1. It is worth mentioning that many of the state-of-the-art solutions utilize a hybrid of deep learning solutions.

Table 1.

Paper	Concat.	Attention	Generative	GNN	Cross-modal Cue	Prompting	Primary Focus
[87]	✓	✗	✗	✗	✗	✗	Concatenation
[76]	✓	✗	✗	✗	✗	✗	Concatenation
[74]	✓	✗	✗	✗	✗	✗	Concatenation
[65]	✓	✗	✗	✗	✗	✗	Concatenation
[71]	✓	✗	✗	✗	✗	✗	Concatenation
[69]	✓	✗	✗	✗	✗	✗	Concatenation
[45]	✓	✓	✗	✗	✗	✗	Attention Mech.
[59]	✓	✓	✗	✗	✓	✗	Attention Mech.
[67]	✓	✓	✗	✗	✗	✗	Attention Mech.
[61]	✓	✓	✗	✗	✗	✗	Attention Mech.
[73]	✓	✓	✗	✗	✗	✗	Attention Mech.
[53]	✗	✓	✗	✗	✗	✗	Attention Mech.
[47]	✓	✓	✗	✗	✗	✗	Attention Mech.
[100]	✓	✓	✗	✗	✗	✗	Attention Mech.
[88]	✓	✓	✗	✗	✗	✗	Attention Mech.
[94]	✓	✓	✗	✗	✗	✗	Attention Mech.
[41]	✓	✓	✗	✗	✗	✗	Attention Mech.
[96]	✓	✗	✓	✗	✗	✗	Generative Net.
[49]	✓	✗	✓	✗	✗	✗	Generative Net.
[107]	✓	✓	✓	✗	✗	✗	Generative Net.
[42]	✓	✓	✓	✗	✗	✗	Generative Net.
[84]	✓	✗	✓	✗	✗	✗	Generative Net.
[108]	✓	✗	✓	✗	✗	✗	Generative Net.
[97]	✓	✓	✓	✗	✗	✗	Generative Net.
[98]	✗	✗	✗	✓	✗	✗	GNN
[26]	✗	✓	✗	✓	✗	✗	GNN
[110]	✗	✗	✗	✗	✓	✗	Cross-Modal Cue
[30]	✓	✓	✗	✗	✓	✗	Cross-Modal Cue
[103]	✗	✓	✗	✗	✓	✗	Cross-Modal Cue
[75]	✓	✓	✗	✗	✓	✗	Cross-Modal Cue
[86]	✓	✓	✗	✗	✓	✗	Cross-Modal Cue
[10]	✗	✗	✗	✗	✓	✗	Cross-Modal Cue
[19]	✓	✓	✗	✗	✓	✗	Cross-Modal Cue
[16]	✗	✗	✗	✗	✓	✓	Prompting
[21]	✗	✗	✗	✗	✓	✓	Prompting
[15]	✗	✗	✗	✗	✓	✓	Prompting
[44]	✓	✗	✗	✗	✓	✓	Prompting

Table 1. A Summary of the Existing Deep Learning-based Solutions

4 Multi-modal Data Study

Data acquisition and preparation are the most important building blocks of a machine learning pipeline. Machine learning models leverage training data to continuously improve themselves over time. Thus, sufficient good quality, and in most cases annotated data, is extremely crucial for these models to operate effectively. With that said, in this section, we introduce and compare some of the existing multi-modal datasets for the fake news detection task. Later on, we will discuss some of the limitations of these datasets.

Image-Verification-Corpus ¹³ is an evolving dataset containing 17,806 fake and real posts with images shared on Twitter. This dataset is created as an open corpus of tweets containing images that may be used for assessing online image verification approaches (based on tweet texts and user features) as well as building classifiers for new content. Fake and real images in this dataset have been annotated by online sources that evaluate the credibility of the images and the events they are associated with [11].

Fakeddit ¹⁴ is a dataset collected from Reddit, a social news and discussion website where users can post submissions on various subreddits. Fakeddit consists of over 1 million submissions from 22 different subreddits spanning over a decade, with the earliest submission being from 3/19/2008 and the most recent submission being from 10/24/2019. These subreddits are posted on highly active and popular pages by over 300,000 users. Fakeddit consists of submission titles, images, user comments, and submission metadata including score, the username of the author, subreddit source, sourced domain, number of comments, and up-vote to down-vote ratio. Approximately 64% of the samples have both text and image data [62]. Samples of this dataset are annotated with 2-way, 3-way, and 6-way labels including true, satire/parody, misleading content, manipulated content, false connection, and imposter content. Examples of 6-way labels are demonstrated in Figure 6. Additionally, Table 2 illustrates a comparison and evaluation of various methods’ performance on the Fakeddit dataset [62].¹⁵

Table 2.

Type	Text	Image	Validation	Test	Validation	Test	Validation	Test
			2-way		3-way		6-way
Text+Image	InferSent	VGG16	0.8655	0.8658	0.8618	0.8624	0.8130	0.8130
	InferSent	EfficientNet	0.8328	0.8339	0.8259	0.8256	0.7266	0.7280
	InferSent	ResNet50	0.8888	0.8891	0.8855	0.8863	0.8546	0.8526
	BERT	VGG16	0.8694	0.8699	0.8644	0.8655	0.8177	0.8208
	BERT	EfficientNet	0.8334	0.8318	0.8265	0.8255	0.7258	0.7272
	BERT	ResNet50	0.8929	0.8909	0.8905	0.8900	0.8600	0.8588

Table 2. Evaluation of Classification Accuracy on the Fakeddit Dataset Using Various Image/Text Embedders Conducted by Reference [62]

Fig. 6.

NewsBag comprises 200,000 real news and 15,000 fake articles. The real training articles have been collected from the Wall Street Journal and the fake ones from The Onion website,¹⁶ which publishes satirical content. However, the samples of the test set are collected from different websites, i.e., TheRealNews¹⁷ and ThePoke.¹⁸ The rationale behind using different sources of news for the training and test sets is to observe how well the models could be generalized to unseen data samples. The NewsBag dataset is highly imbalanced. Thus, to tackle this issue, NewsBag ++ is also released, which is the augmented training version of the NewsBag dataset and contains 200,000 real and 389,000 fake news articles. Another weakness of the NewsBag dataset is that it does not have any social context information such as spreader information, sharing trends, and reactions such as user comments and engagements [46].

MM-COVID ¹⁹ is a multi-lingual and multi-dimensional COVID-19 fake news data repository. This dataset comprises 3,981 fake news and 7,192 trustworthy information in 6 different languages, i.e., English, Spanish, Portuguese, Hindi, French, and Italian. MM-COVID consists of visual, textual, and social context information, e.g., users and networks information [57]. This dataset is annotated is by Snopes²⁰ and Poynter²¹ crowdsource domains, where experts and journalists evaluate and fact-check news content and annotate contents as either fake or real. While Snopes is an independent publication that mainly contains English content, Poynter is an international fact-checking network (IFCN), which unites 96 different fact-checking agencies such as PolitiFact²² in 40 languages.

ReCOVery ²³ contains 2,029 news articles that have been shared on social media, most of which (2,017 samples) have both textual and visual information for multi-modal studies. ReCOVery is imbalanced in news class, i.e., the proportion of real vs. fake articles is around 2:1. The number of users who spread real news (78,659) and users sharing fake articles (17,323) is greater than the total number of users included in the dataset (93,761). In this dataset, the assumption is that users can engage in spreading both real and fake news articles. Samples of this dataset are annotated by two fact-checking resources: NewsGuard²⁴ and Media Bias/Fact Check (MBFC),²⁵ which is a website that evaluates factual accuracy and political bias of news media. MBFC labels each news media as one of six factual-accuracy levels based on the fact-checking results of the previously published news articles. Samples of ReCOVery are collected from 60 news domains, from which 22 are the sources of reliable news articles (e.g., National Public Radio²⁶ and Reuters²⁷) and the remaining 38 are sources to collect unreliable news articles (e.g., Human Are Free²⁸ and Natural News²⁹) [109].

CoAID ³⁰: Covid-19 heAlthcare mIsinformation Dataset or CoAID is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users’ social engagement about the news. It includes 5,216 news articles, 296,752 related user engagements, 926 social platform posts about COVID-19, and ground truth labels. The publishing dates of the collected information range from December 1, 2019, to September 1, 2020. In total, 204 fake news articles, 3,565 true news articles, 28 fake claims, and 454 true claims are collected. Real news articles are crawled from 9 reliable media outlets that have been cross-checked as reliable, e.g., National Institutes of Health (NIH)³¹ and CDC.³² Fake news is retrieved from several fact-checking websites, such as PolitiFact and Health Feedback [22].³³

MMCoVaR is a Multi-modal COVID-19 Vaccine Focused Data Repository (MMCoVaR). Articles in this dataset are annotated using two news website source checking methods, and the tweets are fact-checked based on a stance detection approach. MMCoVaR comprises 2,593 articles issued by 80 publishers and shared between 02/16/2020 and 05/08/2021, and 24,184 Twitter posts collected between 04/17/2021 and 05/08/2021. Samples of this dataset are annotated by Media Bias Chart and Media Bias/Fact Check (MBFC) and classified into two levels of credibility: reliable and unreliable. Thus, articles are labeled as either credible or unreliable, and tweets are annotated as reliable, inconclusive, or unreliable [17]. It is worth mentioning that textual, visual, and social context information are available for the news articles.

N24News ³⁴ is a multi-modal dataset extracted from New York Times articles published from 2010 to 2020. Each news article belongs to one of 24 different categories, e.g., science, arts. The dataset comprises up to 3,000 samples of real news for each category. In total, 60,000 news articles are collected. Each article sample contains a category tag, headline, abstract, article body, image, and corresponding image caption. This dataset is randomly split into training/validation/testing sets in the ratio of 8:1:1 [99]. The main weakness of this dataset is that it does not have any fake samples, and all of the real samples are collected from a single source, i.e., the New York Times.

MuMiN :³⁵ Large-Scale Multilingual Multi-modal Fact-Checked Misinformation Social Network Dataset (MuMin) comprises 21 million tweets belonging to 26K Twitter threads, each of which has been linked to 13K fact-checked claims in 41 different languages. MuMiN is available in three versions: large, medium, and small, with the largest one consisting of 10,920 articles and 6,573 images. In this dataset, if the claim is “mostly true,” then it is labeled as factual. When the claim is deemed “half true” or “half false,” it is labeled as misinformation, with the justification that a statement containing a significant part of false information should be considered misleading content. When there is no clear verdict, the verdict is labeled as other [63].

A summary and side-by-side comparison of the previously mentioned datasets are shown in Table 3. As illustrated in Figure 7, most of these datasets are small, annotated with binary labels, sourced from limited platforms like Twitter, and contain only a few modalities, namely, text and image.

Table 3.

Dataset	Total Samples	# classes	Modalities	Source	Details
image-verification-corpus [11]	17,806	2	image,text	Twitter
Fakeddit [62]	1,063,106	2,3,6	image,text	Reddit	682,996 samples are multi-modal.
NewsBag [46]	215,000	2	image, text	TTrain: Wall Street & Onion.Test: TheRealNews & ThePoke	This dataset is highly imbalanced. There are only 15,000 fake samples.
NewsBag++ [46]	589,000	2	image,text	Train: Wall Street & Onion. Test: TheRealNews & ThePoke	Same as NewsBag, but fake samples are synthetic samples created by augmentation techniques.
MM-COVID [57]	11,173	2	image,text,social context	Twitter	3,981 fake samples and 7,192 real samples.
ReCOVery [109]	2,029	2	text,image	Twitter	Imbalanced with ratio of 2:1 real vs. fake.
CoAID [22]	5,216	2	image,text	Twitter	Consists of 296,752 user engagements (926 social platforms).
MMCoVaR [17]	2,593 articles & 24,184 tweets	2	image,text,social context	Twitter	Tweets as labeled as reliable,inconclusive and unreliable.
N24News [99]	60,000	24	image,text	New York Times	All samples are real from 24 different categories.
MuMiN [63]	10,920	3	image,text	Twitter	Consists of 10,920 articles and 6,573 images.

Table 3. Statistics of Multi-modal Databases for Fake News Detection

Fig. 7.

5 Challenges in Multi-modal Misinformation Detection

Recent studies on multi-modal learning have made significant contributions to the field of multi-modal fake news detection. However, there are still weaknesses and shortcomings, and recognizing them opens the door to new opportunities not only in fake news detection but also in the multi-modal field in general. In this section, we provide non-exhaustive lists of challenges and shortcomings for each direction of multi-modal misinformation research.

5.1 Data Study Challenges

This category refers to the weaknesses of current multi-modal datasets for misinformation detection. We briefly discussed some of these weaknesses in the multi-modal data study section. An itemized list of such limitations and shortcomings is as follows:

—

Lack of large and comprehensive datasets: As illustrated in Figure 7, most of the existing datasets are small in size and sometimes highly imbalanced in terms of the fake-to-real ratio.

—

Lack of cross-lingual datasets: Almost all social media platforms are multi-lingual environments where users share information in multiple languages. Although misinformation spreads in multiple languages, a vast majority of the existing datasets are mono-lingual, i.e., they only provide English content. Therefore, there is a serious lack of non-English content and annotations.

—

Limited modalities: As we discussed earlier, most of the existing multi-modal datasets only provide image and text modalities, thus neglecting useful information conveyed by other modalities such as video, audio, and so on. The necessity of providing more modalities becomes more apparent when we consider popular social media such as YouTube, TikTok, and Clubhouse, which are mainly video- or audio-based platforms.

—

Bias in event-specific datasets: Many of the existing datasets are created for specific events such as the COVID-19 crisis, thereby not covering a variety of events and topics. As a result, they may not sufficiently train models to detect fake news in other contexts.

—

Binary and domain-level ground truth: Most of the existing datasets provide binary and domain-level ground truth for well-known outlets such as The Onion or the New York Times. In addition, they often do not provide any information about the reasons for misinformation, e.g., cross-modal discordance, false connection, imposter content.

—

Subjective annotations and inconsistency of labels: As discussed in the data study section, different datasets use different crowd-sourced and fact-checking agencies, thereby articles are annotated subjectively with different labels across different datasets. Thus, it is very challenging to analyze, compare, and interpret results.

5.2 Feature Study Challenges

This category comprises shortcomings related to cross-modal feature identification and extraction in the multi-modal fake news detection pipeline. Some of the most important weaknesses in current feature-based studies are:

—

Insufficiency of cross-modal cues: Although researchers have proposed some multi-modal cues, most of the existing models naively fuse image-based features with textual features as a supplement. There are fewer works that leverage explainable cross-modal cues other than image and text combinations. However, there are still plenty of useful multi-modal cues that are often neglected by researchers.

—

Ineffective cross-modal embeddings: As mentioned earlier, the majority of the existing approaches only fuse embeddings with simple operations such as concatenation of the representations, thereby failing to build an effective and non-noisy cross-modal embedding. Such architectures fail in many cases, as the resulting cross-modal embedding consists of useless or irrelevant parts that may result in noisy representations.

—

Lack of language-independent features: A majority of existing work on misinformation leverages text features that are highly dependent on dataset languages, which are mostly English. Identifying language-independent features is an effective way to cope with mono-lingual datasets.

5.3 Model Study Challenges

This category refers to the shortcomings of current machine learning solutions in detecting misinformation in multi-modal environments. The following is a non-exhaustive list of existing shortcomings:

—

Inexplicability of current models: A majority of the existing models do not provide any explicable information about the regions of interest, common patterns of inconsistencies among modalities, and types of misinformation (e.g., manipulation, exaggeration). While some recent works attempt to use attention-based techniques to overcome the problem of ineffective multi-modal embedding and provide some interpretability, most of them usually follow a trial-and-error approach like masking to find relevant sections to attend to. However, interpretable and explainable AI is crucial in building trust and confidence as well as fairness and transparency, which are mostly neglected.

—

Non-transferable models to unseen events: Most of the existing models are designed in such a way that they extract and learn event-specific features (e.g., COVID-19, election). Thus, they are most likely biased toward specific events and, as a result, not transferable to unseen and emerging events. For this reason, building models that learn general features and separate them from the non-transferable event-specific features would be extremely useful.

—

Unscalability of current models: Considering the expensive and complicated structures of deep networks and the fact that most of the existing multi-modal models leverage multiple deep networks (one for each modality), they are not scalable if the number of modalities increases. Moreover, many of the existing models require heavy computing resources and need a large volume of memory storage and processing units. Therefore, the scalability of proposed models should be taken into account while developing new architectures.

—

Vulnerabilities against adversarial attacks: Malicious adversaries continuously try to fool the misinformation detection models. This is especially feasible when the underlying model’s techniques and cues are revealed to the attacker, such as when the attacker can probe the model. As a result, many of the detection techniques become dated in a short period of time. Thus, there is a need to create detection models that are resistant to manipulation.

6 Opportunities in Multi-modal Misinformation Detection

Considering the challenges and shortcomings in multi-modal misinformation detection we discussed above, we propose opportunities for furthering research in this field. In what follows, we discuss these opportunities by each direction of multi-modal misinformation detection study.

6.1 Opportunities in Multi-modal Data Study

Considering the data study challenges we discussed earlier, we propose the following avenues:

—

Comprehensive multi-modal and multi-lingual datasets: As we discussed earlier, one important gap in the misinformation detection study is the lack of a comprehensive multi-modal dataset, which needs to be addressed in the future. Multi-modal misinformation detection requires large, multi-lingual, multi-source datasets that cover a variety of modalities, web resources, events, and so on, and provide fine-grained ground truth for the samples.

—

Standardized annotation strategy: Current datasets are annotated by various fact-checking agencies, leading to subjective labels in many cases. Establishing a standardized labeling agreement across all datasets would facilitate easier cross-dataset comparison and analysis.

6.2 Opportunities in Multi-modal Feature Study

Based on the feature study challenges we discussed in the previous section, we propose the following research opportunities to overcome some of the existing challenges in multi-modal feature study:

—

Identifying cross-modal clues: Currently, cross-modal cues are restricted to a few basic indicators, such as the similarity between text and images. Identifying more subtle and often overlooked cues can aid in developing discordance-aware models and help recognize vulnerabilities in the serving platforms, which is integral to adversarial learning.

—

Developing efficient fusion mechanisms: Many of the existing solutions leverage naive fusion mechanisms such as concatenation, which may result in inefficient and noisy multi-modal representations. Therefore, another fruitful avenue of research lies in the study and development of more efficient fusion techniques to produce information-rich representations.

—

Identifying language-independent features to cope with mono-lingual datasets: A majority of existing datasets are mono-lingual, thereby not sufficient enough to train models for non-English tasks. One way to compensate for the lack of multi-lingual datasets is to use language-independent features [93]. Identifying such features, especially in multi-modal environments where there are more features and aspects, would be highly effective in coping with mono-lingual datasets.

6.3 Opportunities in Multi-modal Model Study

Some unexplored research avenues to tackle existing model-related challenges in multi-modal misinformation detection include:

—

Utilizing foundation models and prompt-based techniques in multi-modal misinformation detection: The astounding effectiveness of foundation models and techniques, including ICL and prompt-tuning, in numerous multi-modal tasks suggests that foundation models have a lot of potential for identifying multi-modal misinformation. Developing task-specific foundation models for detecting misinformation is another opportunity that would hugely impact the field of misinformation detection.

—

Developing cross-modal discordance-aware architectures: Most of the existing works either blindly merge modalities or take a trial-and-error approach to attend to the relevant modalities. Implementing discordance-aware models not only results in information-rich representations but also may be useful in making attention-based techniques more efficient.

—

Adversarial learning in multi-modal misinformation detection: Although there are existing generative-based architectures, adversarial study of multi-modal misinformation detection has been mostly neglected. To make the detection models more adversarially robust, it is of utmost importance to dedicate time and effort to the study and development of generative and adversarial learning techniques.

—

Interpretability of multi-modal models: Development of explainable frameworks to help better understand and interpret predictions made by multi-modal detection models is another opportunity in multi-modal misinformation detection. Explicability can be very useful for related tasks such as the predictability of models, fairness and bias, and adversarial learning.

—

Transferable models to unseen events: As mentioned earlier, except for a few works, most of the existing models are designed for specific events and, as a result, are ineffective for emerging ones. Since misinformation spreads during a variety of events, developing general and transferable models is extremely crucial.

—

Development of scalable models: Another opportunity is to develop models that are more efficient in terms of time and resources and do not become intolerably complicated while increasing the number of fused modalities.

7 Conclusions

In this article, we review the literature on multi-modal misinformation detection, discuss its strengths and weaknesses, and suggest new directions for future research. First, we introduce some of the prominent misinformation categories and often-used cross-modal cues for spotting them. We also discuss different fusion mechanisms to merge modalities that are engaged in such cross-modal cues. In addition, we categorize existing solutions into two groups: classic machine learning and deep learning solutions, and then further divide each group based on the techniques that are utilized. Furthermore, we introduce and compare existing datasets on multi-modal misinformation detection and identify some of the weaknesses of these datasets. By classifying them into data, feature, and model-based shortcomings, we demonstrate some of the most prominent problems in multi-modal fake news detection. Finally, we propose new lines of research to address these shortcomings.

Footnotes

“Misinformation” is false information that spreads unintentionally, whereas the term “Disinformation” refers to false information that malicious users share intentionally and often strategically to affect other audiences’ behaviors toward social, political, and economic events. In this work, regardless of spreaders’ intention, we refer to all sorts of false news, i.e., misinformation and disinformation, as “Misinformation” or “Fake News” interchangeably.

https://www.theonion.com/

https://www.cnn.com/world

⁴

https://www.bbc.com/

⁵

The term “Junk Science” refers to inaccurate information about scientific facts that is used to skew opinions or push a hidden agenda.

⁶

Refers to biased information that is often generated to promote a political point of view. Propaganda ranges from completely false information to subtle manipulation.

⁷

Refers to rejecting a widely accepted explanation for an event and offering a secret plot instead.

⁸

https://www.youtube.com/

⁹

https://www.tiktok.com/

¹⁰

https://github.com/selfagency/bs-detector

¹¹

https://www.newsguardtech.com/

¹²

http://visiir.lip6.fr/explore

¹³

https://githubhelp.com/MKLab-ITI/image-verification-corpus

¹⁴

https://github.com/entitize/Fakeddit

¹⁵

The table is based on the work in Reference [62].

¹⁶

https://www.theonion.com/

¹⁷

https://therealnews.com/

¹⁸

https://www.thepoke.co.uk/

¹⁹

https://github.com/bigheiniu/MM-COVID

²⁰

www.snopes.com

²¹

www.poynter.org/coronavirusfactsalliance/

²²

https://www.politifact.com/

²³

https://github.com/apurvamulay/ReCOVery

²⁴

https://www.newsguardtech.com/

²⁵

https://mediabiasfactcheck.com/

²⁶

https://www.npr.org/

²⁷

https://www.reuters.com

²⁸

http://humansarefree.com/

²⁹

https://www.naturalnews.com/

³⁰

https://github.com/cuilimeng/CoAID

³¹

https://www.nih.gov/news-events/news-releases

³²

https://www.cdc.gov/coronavirus/2019-ncov/whats-new-all.html

³³

https://healthfeedback.org/

³⁴

https://github.com/billywzh717/N24News

³⁵

https://github.com/MuMiN-dataset/mumin-build

References

[1]

Sara Abdali, Rutuja Gurav, Siddharth Menon, Daniel Fonseca, Negin Entezari, Neil Shah, and Evangelos E. Papalexakis. 2021. Identifying misinformation from website screenshots. Proc. Int. AAAI Conf. Web Soc. Media 15, 1 (May2021), 2–13. Retrieved from https://ojs.aaai.org/index.php/ICWSM/article/view/18036

Abstract

1 Introduction

2 Multi-modal Feature Study

2.1 Common Categories of Misinformation in Multi-modal Environments

2.2 Multi-modal Features and Clues

3 Multi-modal Model Study

3.1 Fusion Mechanisms

3.2 Comparison of Fusion Mechanisms

3.3 Classic Machine Learning Solutions

3.4 Deep Learning Solutions

3.4.1 Concatenation-based Architectures.

3.4.2 Attention-based Architectures.

3.4.3 Generative Architectures.

3.4.4 Graph Neural Network Architectures.

3.4.5 Cross-modal Discordance-aware Architectures.

3.4.6 Foundation Models and Prompt-based Techniques.

4 Multi-modal Data Study

5 Challenges in Multi-modal Misinformation Detection

5.1 Data Study Challenges

5.2 Feature Study Challenges

5.3 Model Study Challenges

6 Opportunities in Multi-modal Misinformation Detection

6.1 Opportunities in Multi-modal Data Study

6.2 Opportunities in Multi-modal Feature Study

6.3 Opportunities in Multi-modal Model Study

7 Conclusions

Footnotes

References

Cited By

Index Terms

Recommendations

Efficient Multivariate Data Fusion for Misinformation Detection During High Impact Events

Exploiting sparsity and statistical dependence in multivariate data fusion: an application to misinformation detection for high-impact events

Hierarchical Multi-modal Contextual Attention Network for Fake News Detection

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations