1 Introduction
Nowadays, billions of multi-modal posts containing texts, images, videos, soundtracks, and so on, are shared throughout the web, mainly via social media platforms such as Facebook, Twitter, Snapchat, Reddit, Instagram, YouTube, and so on. While the combination of modalities allows for more expressive, detailed, and user-friendly content, it brings about new challenges, as it is harder to accommodate uni-modal solutions to multi-modal environments.
However, in recent years, due to the sheer use of multi-modal platforms, many automated techniques for multi-modal tasks, such as
Visual Question Answering (VQA) [
5,
31,
33,
38,
85], image captioning [
18,
34,
51,
104], and more recently for fake news detection including hate speech in multi-modal memes [
30,
50,
65,
87], have been introduced by machine learning researchers.
Similar to other multi-modal tasks, it is harder and more challenging to detect fake news on multi-modal platforms, as it requires not only the evaluation of each modality, but also cross-modal connections and credibility of the combination as well. This becomes even more challenging when each modality, e.g., text or image, is credible but the combination creates misinformative content. For instance, a COVID-19 anti-vaccination misinformation
1 post can have text that reads “vaccines do this” and then attaches a graphic image of a dead person. In this case, although the image and text are not individually misinformative, taken together they create misinformation.
Over the past decade, several detection models [
14,
40,
79,
81] have been developed to detect misinformation. However, the majority of them leverage only a single modality for misinformation detection, e.g., text [
32,
37,
77,
101] or image [
1,
20,
39,
66], which miss the important information conveyed by other modalities. There are existing works [
2,
3,
35,
48,
76] that leverage ensemble methods that create multiple models for each modality and then combine them to produce improved results. However, in many cases of multi-modal misinformation, loosely combining individual modalities is inadequate for detecting fake news, leading to the failure of the joint model.
Nevertheless, in recent years, machine learning scientists have developed different techniques for cross-modal fake news detection, which combine information from multiple modalities, leveraging cross-modal information such as the consistency and meaningful relationships between different modalities. Studying and analyzing these techniques and identifying existing challenges will give a clearer picture of the state of knowledge on multi-modal misinformation detection and open the door to new opportunities in this field.
Even though there are a number of valuable surveys on fake news detection [
24,
52,
79], very few of them focus on multi-modal techniques [
6,
74]. Since the number of proposed techniques for multi-modal fake news detection has been increasing immensely, the necessity of a comprehensive survey on existing techniques, datasets, and emerging challenges is felt more than ever. With that said, in this work, we aim to conduct a comprehensive study on fake news detection in multi-modal environments.
To this end, we classify multi-modal misinformation detection study into the following directions:
—
Multi-modal Data Study: In this direction, the goal is to collect multi-modal fake news data, e.g., image, text, social context, and so on, from different sources of information and use fact-checking resources to evaluate the veracity of the collected data and annotate them accordingly. Comparison and analysis of existing datasets, as well as benchmarking, are other tasks that fall under this category.
—
Multi-modal Feature Study: The primary goal of this study is to uncover significant links between various data modalities, which are frequently exploited by misinformation spreaders to distort, impersonate, or exaggerate original information. These meaningful connections may be used as clues for detecting misinformation in multi-modal environments such as social media posts. Another goal of this direction is to study and develop strategies for fusing features of different modalities and creating information-rich multi-modal features.
—
Multi-modal Model Study: The main focus of this direction is on the development of efficient multi-modal machine learning solutions to detect misinformation by leveraging multi-modal features and clues. Proposing new techniques and approaches, in addition to improving the performance, scalability, interpretability, and explicability of machine learning models, are some of the common tasks in this direction.
These three studies form a sequential pipeline in the multi-modal misinformation field, where the output of each study serves as the input for the next. Figure
1 provides a summary of these directions. In this work, we aim to explore each direction in greater depth to identify the challenges and shortcomings of each study and propose new avenues for addressing them.
The rest of this survey is organized as follows: In Section
2, we discuss the multi-modal feature study by introducing some widely spread categories of misinformation in multi-modal settings and commonly used cross-modal clues for detecting them. In the following section, we discuss different fusion mechanisms to merge modalities involved in such clues. Then, we explain the multi-modal model study by introducing solutions and categorizing them based on the machine learning techniques they utilize. In Section
4, we describe the multi-modal data study by introducing, analyzing, and comparing existing databases for multi-modal fake news detection. In Section
5, we discuss existing challenges and shortcomings that each direction is facing. Finally, in Section
6, we propose new avenues to address these shortcomings and advance multi-modal misinformation detection research.
We conducted our literature search across multiple databases, including IEEE Xplore, ACM Digital Library, and Google Scholar, using a combination of keywords related to our research focus. The inclusion criteria for the papers were defined by their relevance to the research question, publication date within the past 10 years to ensure timeliness, and peer-reviewed status to guarantee quality. The selection process involved an initial screening of titles and abstracts, followed by a full-text review to confirm that each paper met our stringent criteria. This methodical approach ensures that the included papers provide a diverse yet focused perspective on the subject, offering readers a succinct and informative summary of current knowledge in the field. We emphasize the importance of transparency in our literature selection process and outline these steps to clarify the criteria and rationale behind our choices.
4 Multi-modal Data Study
Data acquisition and preparation are the most important building blocks of a machine learning pipeline. Machine learning models leverage training data to continuously improve themselves over time. Thus, sufficient good quality, and in most cases annotated data, is extremely crucial for these models to operate effectively. With that said, in this section, we introduce and compare some of the existing multi-modal datasets for the fake news detection task. Later on, we will discuss some of the limitations of these datasets.
Image-Verification-Corpus 13 is an evolving dataset containing 17,806 fake and real posts with images shared on Twitter. This dataset is created as an open corpus of tweets containing images that may be used for assessing online image verification approaches (based on tweet texts and user features) as well as building classifiers for new content. Fake and real images in this dataset have been annotated by online sources that evaluate the credibility of the images and the events they are associated with [
11].
Fakeddit 14 is a dataset collected from Reddit, a social news and discussion website where users can post submissions on various subreddits.
Fakeddit consists of over 1 million submissions from 22 different subreddits spanning over a decade, with the earliest submission being from 3/19/2008 and the most recent submission being from 10/24/2019. These subreddits are posted on highly active and popular pages by over 300,000 users.
Fakeddit consists of submission titles, images, user comments, and submission metadata including score, the username of the author, subreddit source, sourced domain, number of comments, and up-vote to down-vote ratio. Approximately 64% of the samples have both text and image data [
62]. Samples of this dataset are annotated with 2-way, 3-way, and 6-way labels including true, satire/parody, misleading content, manipulated content, false connection, and imposter content. Examples of 6-way labels are demonstrated in Figure
6. Additionally, Table
2 illustrates a comparison and evaluation of various methods’ performance on the Fakeddit dataset [
62].
15 NewsBag comprises 200,000 real news and 15,000 fake articles. The real training articles have been collected from the
Wall Street Journal and the fake ones from The Onion website,
16 which publishes satirical content. However, the samples of the test set are collected from different websites, i.e., TheRealNews
17 and ThePoke.
18 The rationale behind using different sources of news for the training and test sets is to observe how well the models could be generalized to unseen data samples. The
NewsBag dataset is highly imbalanced. Thus, to tackle this issue,
NewsBag ++ is also released, which is the augmented training version of the
NewsBag dataset and contains 200,000 real and 389,000 fake news articles. Another weakness of the
NewsBag dataset is that it does not have any social context information such as spreader information, sharing trends, and reactions such as user comments and engagements [
46].
MM-COVID 19 is a multi-lingual and multi-dimensional COVID-19 fake news data repository. This dataset comprises 3,981 fake news and 7,192 trustworthy information in 6 different languages, i.e., English, Spanish, Portuguese, Hindi, French, and Italian.
MM-COVID consists of visual, textual, and social context information, e.g., users and networks information [
57]. This dataset is annotated is by Snopes
20 and Poynter
21 crowdsource domains, where experts and journalists evaluate and fact-check news content and annotate contents as either fake or real. While Snopes is an independent publication that mainly contains English content, Poynter is an
international fact-checking network (IFCN), which unites 96 different fact-checking agencies such as PolitiFact
22 in 40 languages.
ReCOVery 23 contains 2,029 news articles that have been shared on social media, most of which (2,017 samples) have both textual and visual information for multi-modal studies.
ReCOVery is imbalanced in news class, i.e., the proportion of real vs. fake articles is around 2:1. The number of users who spread real news (78,659) and users sharing fake articles (17,323) is greater than the total number of users included in the dataset (93,761). In this dataset, the assumption is that users can engage in spreading both real and fake news articles. Samples of this dataset are annotated by two fact-checking resources: NewsGuard
24 and Media Bias/Fact Check (MBFC),
25 which is a website that evaluates factual accuracy and political bias of news media. MBFC labels each news media as one of six factual-accuracy levels based on the fact-checking results of the previously published news articles. Samples of
ReCOVery are collected from 60 news domains, from which 22 are the sources of reliable news articles (e.g., National Public Radio
26 and Reuters
27) and the remaining 38 are sources to collect unreliable news articles (e.g., Human Are Free
28 and Natural News
29) [
109].
CoAID 30: Covid-19 heAlthcare mIsinformation Dataset or
CoAID is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users’ social engagement about the news. It includes 5,216 news articles, 296,752 related user engagements, 926 social platform posts about COVID-19, and ground truth labels. The publishing dates of the collected information range from December 1, 2019, to September 1, 2020. In total, 204 fake news articles, 3,565 true news articles, 28 fake claims, and 454 true claims are collected. Real news articles are crawled from 9 reliable media outlets that have been cross-checked as reliable, e.g.,
National Institutes of Health (NIH)31 and CDC.
32 Fake news is retrieved from several fact-checking websites, such as PolitiFact and Health Feedback [
22].
33 MMCoVaR is a
Multi-modal COVID-19 Vaccine Focused Data Repository (MMCoVaR). Articles in this dataset are annotated using two news website source checking methods, and the tweets are fact-checked based on a stance detection approach.
MMCoVaR comprises 2,593 articles issued by 80 publishers and shared between 02/16/2020 and 05/08/2021, and 24,184 Twitter posts collected between 04/17/2021 and 05/08/2021. Samples of this dataset are annotated by Media Bias Chart and
Media Bias/Fact Check (MBFC) and classified into two levels of credibility: reliable and unreliable. Thus, articles are labeled as either credible or unreliable, and tweets are annotated as reliable, inconclusive, or unreliable [
17]. It is worth mentioning that textual, visual, and social context information are available for the news articles.
N24News 34 is a multi-modal dataset extracted from
New York Times articles published from 2010 to 2020. Each news article belongs to one of 24 different categories, e.g., science, arts. The dataset comprises up to 3,000 samples of real news for each category. In total, 60,000 news articles are collected. Each article sample contains a category tag, headline, abstract, article body, image, and corresponding image caption. This dataset is randomly split into training/validation/testing sets in the ratio of 8:1:1 [
99]. The main weakness of this dataset is that it does not have any fake samples, and all of the real samples are collected from a single source, i.e., the
New York Times. MuMiN :
35 Large-Scale Multilingual Multi-modal Fact-Checked Misinformation Social Network Dataset (MuMin) comprises 21 million tweets belonging to 26K Twitter threads, each of which has been linked to 13K fact-checked claims in 41 different languages.
MuMiN is available in three versions: large, medium, and small, with the largest one consisting of 10,920 articles and 6,573 images. In this dataset, if the claim is “mostly true,” then it is labeled as factual. When the claim is deemed “half true” or “half false,” it is labeled as misinformation, with the justification that a statement containing a significant part of false information should be considered misleading content. When there is no clear verdict, the verdict is labeled as other [
63].
A summary and side-by-side comparison of the previously mentioned datasets are shown in Table
3. As illustrated in Figure
7, most of these datasets are small, annotated with binary labels, sourced from limited platforms like Twitter, and contain only a few modalities, namely, text and image.