Fake News Detection Using Machine Learning Report PDF
Fake News Detection Using Machine Learning Report PDF
A PROJECT REPORT
BACHELOR OF TECHNOLOGY
In
Alwar
May 2024
i
BIKANER TECHNICAL UNIVERSITY, RAJASTHAN
CERTIFICATE
Certified that this project report “FAKE NEWS DETECTION” is the original work of
“GAURAV KHANDELWAL, MOHIT BHATIA” students of B. Tech. Final Year VIII
Semester (Artificial Intelligence & Data Science Branch) who carried out the project work
under my supervision.
SIGNATURE SIGNATURE
ii
ACKNOWLEDGEMENT
Firstly, we would like to express our gratitude to our advisor for the beneficial
comments and remarks. We express our sincere thanks to Prof. S. K. Sharma
(Director) of the Modern Institute of Technology & Research Centre, Alwar. We pay
our deep sense of gratitude to Prof. J.R Arun Kumar, Head of the Computer Science
& Engineering Department of Modern Institute of Technology & Research Centre,
Alwar for encouraging us to the highest peak and providing us the opportunity to present
the Project.
iii
TABLE OF CONTENT
LIST OF FIGURES VI
ABSTRACT VIII
CHAPTER 1 INTRODUCTION 1
Introduction 1
Problem Statement 4
Objectives 5
Methodology 6
Dataset 6
Flowchart 10
Algorithm 7
System Configuration 12
Data Pre-processing 12
Design of Project 14
Sample Code 16
CONCLUSIONS 41
Conclusions 41
iv
Future Scope 41
REFERENCES 43
v
LIST OF FIGURES
Fig. 3: Flowchart 7
Fig. 8: Fake.csv 16
Fig. 9: True.csv 17
Fig. 13: Removing last 10 rows from both dataset for manual testing 19
Fig. 16: Merging the main fake and true data frame 20
vi
Fig. 23: Train-Test Split 25
vii
ABSTRACT
Fake News has become one of the major problems in the existing society. Fake News
has high potential to change opinions, facts and can be the most dangerous weapon in
influencing society.
The proposed project uses NLP techniques for detecting the 'fake news', that is,
misleading news stories which come from non-reputable sources. By building a model
based on a K-Means clustering algorithm, the fake news can be detected. The data
science community has responded by taking actions against the problem. It is impossible
to determine whether the news was real or fake accurately. So, the proposed project uses
the datasets that are trained using the count vectorizer method for the detection of fake
news and its accuracy will be tested using machine learning algorithms.
In this research, we concentrate on how to spot fake news in internet news sources. We
are dedicated in two ways. In order to determine the percentage of correct news that is
phony, we will use multiple datasets of actual and fake news. We provide a thorough
description of the selection, justification, and approval process as well as a few
exploratory analyses on the observable evidence of etymological differences in false and
legitimate news material. In order to create precise false news identifiers, we focus a lot
of learning studies. Additionally, we provide close examinations of the automatic and
manual evidence of bogus news. Python can be used to spot fake news posted on social
media.
viii
CHAPTER-1
INTRODUCTION
1.1) Introduction
Machine learning (ML) is the study of the statistical models and methods used by
computers to do certain tasks devoid of explicit instructions and in favour of patterns and
inference. As part of artificial intelligence, it is viewed. Without explicit instructions,
machine learning algorithms construct a mathematical model using sample data, or
"training data," in order to provide predictions or judgements. Computational statistics,
which focuses on computer-aided prediction, and machine learning have a lot in
common. Machine learning may benefit from the ideas, practises, and fields of
application that come from the study of mathematical optimisation. s
The quantity of modifications that the data goes through is referred to as "deep learning"
in this context. The credit assignment path (CAP) depth is significant, especially for deep
learning systems. The series of changes that take place from input to output make up the
CAP. CAPs define the possible causal connections between input and outcome. For a
feed-forward neural network, the depth of the CAPs is equal to the depth of the network
plus one, given that the output layer is also parameterized. Since a signal can pass
through a layer more than once in recurrent neural networks, the CAP depth may be
limitless.
1
Fig. 1: Deep Learning vs Machine Learning vs Artificial Intelligence
The study of how computers interact with human (natural) languages is known as natural
language processing, or NLP, and it is a branch of computer science and artificial
intelligence that focuses on instructing computers to efficiently analyse massive volumes
of natural language data. In the fields of linguistics, computer science, information
engineering, and artificial intelligence, natural language processing (NLP) studies how
computers interact with human (natural) languages. Its major goal is to instruct computer
programmers in how to study and analyse vast amounts of natural language.
With the rising use of social media platforms, false news has become a severe problem
in recent years. Finding fake news is a difficult problem that necessitates the use of
several computer techniques, such as data mining, machine learning, and natural
language processing. In this abstract, the current state of false news detection will be
discussed, along with its challenges and potential solutions. Finally, it will consider how
2
cutting-edge technology like blockchain and artificial intelligence may be used in the
future to improve the efficiency and precision of fake news detection.
As a result, there is a larger than ever need for accurate and reliable techniques to
distinguish fake news. The field of fake news detection has rapidly evolved as a result of
researchers and engineers developing a number of techniques and tactics to identify and
combat misleading information. These methods include human fact-checking by
educated professionals as well as sophisticated computers that use machine learning to
examine and classify news content. Automated processes are also a part of them.
It is important to research and create fake news detection, but it is also a challenging and
complex problem. The ability to recognise fake news requires knowledge of linguistic
nuance, social and cultural contexts, and the complex network dynamics of online
communication. Despite these challenges, work has been done to establish effective
methods for spotting false news, and the area is still developing as new tools and
technology are created.
Both benefits and drawbacks come with reading the news. On the other hand, news is
actively sought for and consumed since it is easily available, inexpensive, and quickly
spread. It makes it possible for "fake news," or negative news with blatantly inaccurate
material, to be widely disseminated.
As a result, research into the detection of bogus news has recently made significant
strides. First off, identifying fake news just on the basis of the content is challenging and
nontrivial since it is purposefully designed to lead people to accept incorrect information.
1.3) Objective
Our project's primary goal is to determine the veracity of news in order to determine if it
is real or phoney. the development of a machine learning model that would allow us to
recognise bogus information.
It can be difficult and difficult to identify fake news only based on its content since it is
intentionally produced to influence readers to believe false information.
3
By applying a range of methods and models, machine learning makes it easy to detect
bogus news. Additionally, to examine the relationship between two words, we will apply
deep learning-based NLP.
1.4) Methodology
1.4.1) Dataset
Two datasets are available. a mix of the two. There are 44898 news stories total in the
csv file, which is a sizable quantity. While the true dataset only comprises 21417, the
fraudulent dataset has 23481. This data collection is accessible at:
The following elements are included in a news article: • Id: Special ID for News Article;
▪ title;
▪ text;
▪ Subject;
▪ It describes the topic of the news.
▪ Date: It provides news's publication date.
▪ The conclusion that the information might not be trustworthy.
First, the dataset is quite balanced, as we have shown. There are 21417 accurate news
items and 23481 false news pieces in it. This is beneficial feature of dataset. It will aid
models in making objective judgments.
4
Fig. 2: Comparison of Fake and Real news
The dataset has undergone some processing, and as was indicated, stop terms have been
included. The most common words in the dataset are "the," "to," "of," "and," etc. The top
20 terms in the sample were as follows before stop words were eliminated:
Fake.csv
5
True.csv
The terms "said," "mr," "trump," "new," "people," and "year," which are now the most
popular ones, can provide the models important information.
We also examined the bigrams in the dataset to have a better understanding of the news
story subjects. Before stop words are removed, the topics of the news stories are not at
all clear. As a result, removing stop words makes it simpler to comprehend the news
reports' themes.
The graph below displays the top 20 bigrams from the dataset before stop words are
removed. As one can see, often used phrases like "of the," "in," and "to the" do not help
one comprehend the content of the story.
6
Graph 3: Frequent bigrams
1.4.2) Flowchart:
Fig. 3: Flowchart
7
1.4.3) Algorithm for The Proposed System
Step 1: Pre-processing
•
Load the dataset of news items with their labels, whether they are true or false;
•
Clean the text by eliminating punctuation and stopwords;
•
Divide the dataset into training and testing sets.
8
▪ Determine each model's accuracy score using the actual and projected labels.
Step 6: Accuracy
▪ Determine each model's accuracy by comparing its predicted labels to its actual
labels.
▪ The accuracy measures the proportion of news stories that were accurately identified
as being true or false.
▪ Evaluate the accuracy of various models to find which one is most effective at
spotting fake news.
Provide tools that allow users to submit their own content for categorization and display
the key terms and phrases used to categorise news items, among other capabilities
9
CHAPTER-2
LITERATURE SURVEY
A. S. A. Ahmed, A. Abidin, M. A. Maarof, and R. A. Rashid [1] is only a survey and
does not contain any experiments or findings. Instead, the study offers a thorough
analysis of the many false news detection techniques put out in the literature, as well as
their advantages and disadvantages, as well as the datasets employed for testing. In terms
of feature selection, feature extraction, classification algorithms, and assessment
measures, the authors examine and contrast the methodologies utilised by various
research. In the area of false news identification, they also emphasise the difficulties and
potential avenues for further study. The article makes use of a number of datasets,
including those from BuzzFeed, LIAR, FakeNewsNet, and PolitiFact.
S. Asghar, S. Mahmood, and H. Kamran, "Fake news detection using machine learning
[2] the article also addresses a number of datasets that have been used in studies on the
identification of fake news, including the LIAR dataset, the Fake News Challenge
dataset, and the BuzzFeed News dataset. According to the authors, ensemble learning-
based algorithms had the greatest results on the LIAR dataset, with accuracy rates of up
to 78%. On the BuzzFeed News dataset, on the other hand, deep learning-based methods
perform better, achieving an accuracy of up to 91%.
J. H. Kim, S. H. Lee, and H. J. Kim, "Fake news detection using ensemble learning with
context and attention mechanism,"[3] For their experiments, the authors employ two
datasets: the Celebrity dataset and the LIAR dataset. To capture both local and global
aspects of news items, the proposed model combines convolutional neural networks
(CNNs) with recurrent neural networks (RNNs). The experimental findings demonstrate
that the suggested model outperforms numerous baseline models and reaches an
accuracy of up to 73.7%, reaching state-of-the-art performance on both the LIAR and
Celebrity datasets.
10
either labelled as true or false and also include extra labels for the degree of falsehood.
The Support Vector Machine (SVM), Multinomial Naive Bayes (MNB), and Random
Forest (RF) machine learning techniques are used in the suggested hybrid method. To
choose the most pertinent characteristics for each algorithm, the authors employ a feature
selection technique known as Chi-Square. They then integrate the results of the three
algorithms and arrive at a final forecast using a weighted voting system. According to
the experimental findings, the suggested hybrid technique works better than each
individual algorithm and a number of baseline models, obtaining an accuracy of up to
72.28%.
11
CHAPTER-3
SYSTEM DEVELOPMENT
3.1) System Configuration
Run this project using standard hardware. We utilised an Intel I5 CPU with 8 GB of
RAM, a 2 GB Nvidia graphics processor, and 2 cores that have a frequency of 1.7 GHz
and 2.1 GHz, respectively, to complete the project. The test phase, which follows the
training phase and lasts for around 10-15 minutes, allows for predictions to be made and
accuracy to be determined quickly.
Missing values in datasets can be a difficulty for some machine learning techniques.
Therefore, any missing values in each column of the input data must be found and
replaced before we model the prediction problem. Missing The use of data assignment or
assignment is made for this.A space (' ') should be used in place of the null value for
each attribute. Use this approach instead of removing tuples containing null values.
Stop words like "if," "the," "is," "a," and "an," among others, shouldn't be given much
weight by a machine learning model because they are common English expressions and
don't increase the novelty or believability of any tale. Being present in the dataset may
have an impact on the model's forecast because they are often used.
The use of special characters in a sentence has no bearing on whether a piece of news is
accurate or not. We do this to eliminate all punctuation from the dataset. Regular
expressions are used to eliminate all punctuation. A random function was developed to
remove special characters, links, extra space, underlines, etc.
Lemmatization
12
The word "play" serves as the origin for other words, including "playing" and "plays." It
is possible to carry out a more extensive examination of the term's frequency by
swapping out the term's core word with words in other tenses and participles. As a result,
we substitute that word for any phrase that only has one source word.
Count Vectorization
For machine learning algorithms to accept the preprocessed text as input, it must next be
encoded as integers or floating-point values. The phrase used to describe this method is
feature extraction (or vectorization).
If a vocabulary word is present in the text data, we will add one to the corresponding
vector's dimension, which will have the same number of dimensions as our vocabulary.
We will add one to the total for each additional instance of that term, leaving zeros in the
spots where we didn't see it even once.
TF-IDF Transformation
In order to create a matrix with TF-IDF values for each feature, we utilise the count
vectorized matrix as a transformation.
Because some words may prove to be incredibly unimportant, word frequency alone
might not be accurate. Thus, we employ TF-IDF to maintain harmony between a word's
significance and frequency within the text. The acronym TF-IDF stands for term
frequency and inverse document frequency.
13
Fig. 4: Fake.csv and True.csv
Dataset: The first step is to collect or obtain a dataset of news articles, labeled as "fake"
or "real". This dataset will be used to train and evaluate the performance of different fake
news detection models.
Count Vectorizer (BOW): The Bag-of-Words (BOW) format can be used to transform
textual data into numerical characteristics after preprocessing the text. This entails
building a matrix where each row represents a news item and each column represents a
distinct term from the dataset. The value in each cell indicates how often the term
appears in the related art.
Train-Test Split: Once we have the BOW matrix, we can split the data into training and
testing sets. The training set will be used to train the fake news detection model, while
the testing set will be used to evaluate the model's performance on new, unseen data.
Text-to-vectors (TF-IDF): In addition to BOW, we can also express the textual data
using the Term Frequency-Inverse Document Frequency (TF-IDF) representation. The
frequency of the terms in each article as well as their frequency throughout the whole
dataset is taken into consideration in this representation. This helps to downplay terms
that are prevalent across the whole dataset and to emphasise words that are exclusive to a
certain article.
14
Models: After obtaining the numerical features from the text data, several machine
learning methods such as logistic regression, decision trees, or neural networks can be
employed to train a fake news detection model. The objective of the model is to learn a
function that can accurately classify news stories as either "real" or "fake" based on the
derived attributes from the text.
Accuracy and Confusion Matrix: It's crucial to assess the false news detection model's
performance on the testing set after we've trained it. By assessing its accuracy, precision,
recall, and F1 score, we may do this. To see how many true positives, true negatives,
false positives, and false negatives the model produces, we may also develop a confusion
matrix.
Testing: We may use the model to categorise fresh and previously unheard news pieces
as "real" or "fake" after assessing the model's performance. This entails applying the
same feature extraction and preprocessing operations to the fresh data that we did during
training. After that, we can apply the trained model to the cleaned-up data to provide a
categorization label.
Result: Streamlit library of python is used to represent the result in web browser where
user input the news and algorithm tell that the news is “Real” or “Fake”.
15
Fig. 6: Importing Libraries
Fig. 8: Fake.csv
16
Fig. 9: True.csv
17
Fig. 11: Describing Fake and True Dataset
Pre-processing of Dataset
18
Fig. 13: Removing last 10 rows from both dataset for manual testing
19
Fig. 14: Merging the manual data frame
20
Graph 5: Frequency of subject of the news
21
Graph 6: Fake and Real News
22
Fig. 18: Checking the column
23
Fig. 20: Randomly Shuffling the data frame
24
Fig. 22: Pre-processing task of words
Train-Test Split
25
Fig. 24: Importing for Confusion Matrix
Models
26
Fig. 25: Logistic Regression
27
Fig. 27: Decision Tree Classifier
28
Fig. 29: Random Forest Classifier
29
Graph 8: Comparison of the accuracies of different models
Testing
30
Fig. 30: Testing
Sample Input
31
CHAPTER-4
RESULTS AND EXPERIMENTAL ANALYSIS
4.1) Models Applied And their Results
32
Below are the Results from applying Support Vector Machine model:
Confusion Matrix:
33
Logistic regression
▪
In binary classification issues, where the goal is to predict one of two outcomes,
logistic regression is a frequently used approach. Through the use of a sigmoid
function, it converts the output of the linear regression into a probability value
between 0 and 1, which can then be used to decide whether to classify data by
applying a threshold.
▪
With applications in many areas, including credit scoring, spam filtering, and medical
diagnosis, this simple yet reliable algorithm may be taught well on big datasets.
However, because it depends on certain presumptions, such as the linearity and
independence of the characteristics, it could not work well with highly coupled or
nonlinear data.
34
Confusion Matrix:
35
•
For both binary and multi-class classification tasks, decision tree classification is a
popular machine learning approach. The input data are recursively divided into
subgroups depending on the most instructive characteristic.
•
Decision trees can handle category and numerical data and are simple to understand
and use. Additionally, they are resistant to noise and missing data and are capable of
capturing intricate non-linear correlations between features.
Below are the Results from applying the Decision Tree Classification model:
36
Confusion Matrix:
Below are the Results from applying Gradient boosting classifier model:
37
Table 4: Classification Report of GBC
Confusion Matrix:
As the name implies, a Random Forest consists of numerous independent decision trees
that work together as an ensemble. Each tree in the Random Forest spits out a class
prediction, and the classification that recieves the most votes becomes the prediction of
our model.
38
Below are the Results from applying Random Forest Classifier model:
Confusion Matrix:
Sample Input:
39
Fig. 40: Web Browser Output
40
CONCLUSIONS
5.1) Conclusions
Considering the accuracy scores, we were able to establish for the various models, it
appears that all of the models are doing a good job of identifying false news items. The
SVM, Decision Tree, and Gradient Boosting classifiers notably achieved a very high
accuracy of 99.5%, although the Random Forest Classifier performed just slightly lower,
at 98.71%.
All things considered, these results suggest that a range of classifiers may be used with
equal success rates and that machine learning techniques may be extremely successful in
spotting bogus news. It's important to keep in mind that accuracy is only one measure
and that the models should be evaluated using multiple metrics including precision,
recall, and F1-score in addition to factors like interpretability, scalability, and processing
requirements. Investigating different feature extraction and selection methods, classifier
types, and ensemble approaches may also be useful to see whether even better results
may be produced.
We utilised the datasets real and fake, each of which had 21417 and 23481 entries,
respectively. We converted text into a numerical model using TF-ID F Vectorizer and
utilised the following models:
Accuracy of 99.31% for support vector machines Decision Tree: 99.5% precision
Classifier using Gradient Boosting: Accuracy = 99.5% Accuracy of 98.7% for the random
forest classifier
Including more varied and subtle aspects: For the most part, current methods for
detecting false news rely on simple text-based traits like TF-IDF vectors or bag-of-
words. Research in the future could concentrate on more complex and diverse aspects,
41
such sentiment analysis, network analysis, or multimedia analysis (for instance,
identifying false images or videos).
Creating more interpretable models: Existing methods for spotting fake news
sometimes rely on complex machine learning algorithms that might be difficult to
comprehend. In the future, it would be beneficial to develop more intelligible models
that might provide more information on how people make decisions.
Combining information from other sources: In addition to social media, news articles,
and videos, fake news is regularly spread through other media channels and platforms.
The development of methods that can incorporate data from several sources may be
crucial in the future to improve false news identification.
Adapting to shifting strategies: It will be crucial for fake news detection technologies
to develop alongside the tactics used by those who create and spread it. For this, the
detection methods might need to be regularly reviewed and improved.
42
REFERENCES
[1] A. S. A. Ahmed, A. Abidin, M. A. Maarof, and R. A. Rashid, "Fake news
detection: A survey," IEEE Access, vol. 9, pp. 113051-113071, 2021. doi:
10.1109/ACCESS.2021.3104178
[2] S. Asghar, S. Mahmood, and H. Kamran, "Fake news detection using machine
learning: A survey," IEEE Access, vol. 9, pp. 57613-57639, 2021. doi:
10.1109/ACCESS.2021.3075392
[3] J. H. Kim, S. H. Lee, and H. J. Kim, "Fake news detection using ensemble
learning with context and attention mechanism," IEEE Access, vol. 9, pp. 27569-27579,
2021. doi: 10.1109/ACCESS.2021.3057736
[6] https://www.google.co.in/imgres?imgurl=https%3A%2F%2Fdata-
flair.training%2Fblogs%2Fwpcontent%2Fuploads%2Fsites%2F2%2F2019%2F07%2Fin
troductiontoSVM.png&tbnid=p7ua2IdzmLsjqM&vet=12ahUKEwjf26Kfru
DAhW6JrcAHdMIAagQMygCegUIARDlAQ..i&imgrefurl=https%3A%2F%2Fdata-
flair.training%2Fblogs%2Fsvm-support-vector-machine-
tutorial%2F&docid=7oy5_irTaN4UfM&w=801&h=420&q=svm&ved=2ahUKE
wjf26KfruD-AhW6JrcAHdMIAagQMygCegUIARDlAQ
[7] https://www.google.co.in/imgres?imgurl=https%3A%2F%2Fstatic.javatpoint.com%2
Ftutorial%2Fmachine-learning%2Fimages%2Flogistic-regression-
inmachinelearning.png&tbnid=LuaHnfur76i8eM&vet=12ahUKEwjFoPGSruDAh
VNnNgFHUjLCl8QMygCegUIARDjAQ..i&imgrefurl=https%3A%2F%2Fwww.j
avatpoint.com%2Flogisticregressioinmachinelearning&docid=makIlDmuc8naW
43
M&w=500&h=300&itg=1&q=logistic%20regression&ved=2ahUKEwjFoPGSru D-
AhVNnNgFHUjLCl8QMygCegUIARDjAQ
[8]https://www.google.co.in/url?sa=i&url=https%3A%2F%2Fwww.geeksforgeek
s.org%2Fdecision-tree%2F&psig=AOvVaw0sYuRq-TZe0WWhW-
9YQUnl&ust=1683450911500000&source=images&cd=vfe&ved=0CBEQjRxqF
woTCLDwi7qt4P4CFQAAAAAdAAAAABAE
44