Fake News Detection Using Machine Learning Models
Fake News Detection Using Machine Learning Models
Abstract— Nowadays, with the widespread use of technology, increases concerns related to digital literacy and the legitimacy
fake news and rumors are spreading too. People and society are of information [1].
greatly impacted by fake news, which also can be used as phishing
attempts and a way of stealing their information. In many areas of Phishing is one of the serious cyber threats to the internet
our lives, Artificial Intelligence (AI) and Machine Learning (ML) environment and people's daily life. In this attack, the attacker
have demonstrated their effectiveness. Furthermore, Natural mimics a trusted entity with the aim of stealing sensitive
Language Processing (NLP) has shown promising results in text information [2] [3]. Therefore, fake news detection is an
classification applications. In this study, we proposed an important need in our society.
experimental study for detecting fake news using ML models. The
proposed model analyzes the main text of the news using NLP Moreover, The ability to distinguish between trustworthy
techniques and then classifies the news into fake or real news. We and fake news is difficult because fake news is diverse with
used a new dataset that combined multiple fake news datasets. regard to subjects, styles, and media platforms [4]. Nevertheless,
Moreover, we studied the impact of features extraction methods researchers have argued that deceptive news is given away by
on the performance of the developed models. Eight experiments linguistic cues. If we are able to recognize these indicators, we
were performed using Random Forest (RF) and Support Vector can develop an intelligent detector that can outperform our
Machines (SVM) models, each with a different features extraction manual inspection [4]. Effective tools are therefore crucial to
technique. The SVM model resulted in the best performance with distinguish between reliable news and fake ones in order to
an accuracy level of 98%. This result proves the model ability to facilitate the extraction of unreliable news articles.
be deployed and used in real-world with high reliability, to detect
fake news. Machine learning (ML) techniques have proven to be an
effective tool for the automated detection of anomalies in
Keywords—Fake news, Features extraction, Machine Learning, different sectors [5] [6]. With enough training using relevant and
NLP useful data these models have shown their efficiency over time
[7][8]. Most of the time ML algorithms are used for prediction
I. INTRODUCTION purposes or to detect something that is difficult to be manually
The internet provides vast magnitudes of news articles which identified. In this paper, we aim to detect fake news using
makes it very convenient to reach any piece of information Support Vector Machine (SVM) and Random Forest (RF) ML
needed at the click of a mouse. While people have become algorithms.
increasingly aware of the invalidity of some of the news found The main contributions of this paper are as follows:
on the web, there is still a large amount of news that is being
falsely trusted and shared. Moreover, fake news is becoming • Utilize a new dataset related to fake news.
more and more believable aiming to retain the curiosity of the
• Analyze news articles and build a set of models
audiences to sell information. Even younger generations who
using different features extraction techniques and
were better at identifying misleading websites find themselves
different ML classifiers to detect the fake news.
confused as those websites continue to develop. It was reported
in a study conducted by Common Sense Media that 44% of • Perform a comparative analysis to evaluate the
teenagers frequently struggle to detect whether a news article is performance of the set of models.
fake or not. They went on to indicate that 31% of children
between the ages of 10 and 18 have at least shared one news The remaining part of this paper is organized as follows:
story online that they later realized was inaccurate. This Section 2 reviews the previous works of literature. Section 3
explains the methodology followed in this study. Section 4
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering.474
Downloaded on August 26,2023 at 17:27:45 UTC from IEEE Xplore. Restrictions apply.
removed. Lastly, all the letters were converted into the lower Where TF (t, d) is the term frequency in the document and
case. IDF (t) is the (n) total number of documents over the number of
documents that contain the term t.
2) Stop-words Removal
Stop words are the words that are highly repeated in the text This technique shows the most related words to each
but do not affect its meaning[16]. For example, (‘a’, ‘the’, ‘is’, document by assigning them high values, while the words that
‘are’). All these stop words were removed. are repeated in all the documents will have low values even if
they were frequently used.
3) Tokenization
Tokenization means splitting the sentence into chunks of 3) N-Grams
tokens [16]. For example, the sentence (‘no work today’) was N-grams is the process of extracting a set of words from a
tokenized into (‘no’, ‘work’, ‘today’). sentence based on specified window size. The n-grams take the
minimum number and the maximum number for its range [17].
4) Stemming For example, the sentence (‘The weather is very cold today’),
Stemming is the process of removing large portions of the with N-grams= (2, 2) means only 2-grams will give the
token such as suffixes and prefixes to return the base verb [16]. following: {‘weather very’, ‘very hot’, ‘hot today’}.
For example, the word (‘studying’) was stemmed into (‘study’).
While with N-grams= (1, 3) means 1,2, and 3-grams will
C. Features Extraction give the following:
Before training the ML models, we need to extract the
{‘weather’, ‘very’, ‘hot’, ‘today’, ‘weather very’, ‘very
important words that give the best performance. This step is
hot’, ‘hot today’, ‘weather very hot’, ‘very hot today’}.
also known as features extraction because words are the
features to be used to train the model. Three features extraction The terms ‘The’ and ‘is’ are ignored by this technique.
techniques were used in this study, which are Hash vectorizer,
Term Frequency–Inverse Document Frequency (TF-IDF) This method helps in predicting the relation between words
Vectorizer, and n-grams. After cleaning the data, each features based on probability, as some words are more likely to be used
extraction technique was performed on the data. after some sets of words.
475
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 26,2023 at 17:27:45 UTC from IEEE Xplore. Restrictions apply.
Where TP is the number of truly predicted samples of the TABLE 2. EXPERIMENTAL RESULTS
positive class, over the total number of samples of the positive N-
F1-
class. Model Vectorizer grams
range
Accuracy Precision Recall
score
C 10
Gamma scale
criterion gini
RF
max_features sqrt
Fig. 3. Confusion Matrix
V. RESULTS AND DISCUSSION Fig.3 indicated that the model shows high performance in
After training eight different models, a comparison between distinguishing between the two classes as the number of
their performances was performed as indicated by Tab. 2. misclassified samples in each class is considerably low
476
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 26,2023 at 17:27:45 UTC from IEEE Xplore. Restrictions apply.
comparing to the size of the testing set. This means the results Sci. Mach. Learn. Appl. CDMA 2022, pp. 175–180, 2022, doi:
of the model are generalizable and can be used for new data. 10.1109/CDMA54072.2022.00034.
[3] M. Aljabri et al., “An Assessment of Lexical, Network, and Content-
A. Comparison with Benchmark Study Based Features for Detecting Malicious URLs Using Machine Learning
and Deep Learning Models,” Comput. Intell. Neurosci., vol. 2022, pp. 1–
In this study, the WELFake dataset was used to perform 14, Aug. 2022, doi: 10.1155/2022/3241216.
eight experiments using different ML models and features [4] ShuKai, SlivaAmy, WangSuhang, TangJiliang, and LiuHuan, “Fake
extraction techniques. The dataset was preprocessed and News Detection on Social Media: A Data Mining Perspective,” ACM
cleaned. Then two vectorization techniques were used. The SIGKDD Explor. Newsl., vol. 19, no. 1, pp. 22–36, Sep. 2017, doi:
Hash vectorizer and TF-IDF vectorizer, with two n-grams 10.1145/3137597.3137600.
ranges which are (1, 3) and (2, 2). The proposed model obtained [5] M. Aljabri, A. A. Alahmadi, R. M. A. Mohammad, M. Aboulnour, D. M.
an accuracy level of 98% using the SVM model with Hash Alomari, and S. H. Almotiri, “Classification of Firewall Log Data Using
Multiclass Machine Learning Models,” Electron. 2022, Vol. 11, Page
vectorizer and (1, 3) n-grams range. Another study [15] used 1851, vol. 11, no. 12, p. 1851, Jun. 2022, doi:
the same dataset and proposed different models. The study used 10.3390/ELECTRONICS11121851.
the SVM model and achieved an accuracy level of 96.73%. [6] M. Aljabri et al., “Intelligent Techniques for Detecting Network Attacks:
Tab. 3 shows a detailed comparison. Review and Research Directions,” Sensors 2021, Vol. 21, Page 7070, vol.
21, no. 21, p. 7070, Oct. 2021, doi: 10.3390/S21217070.
TABLE 3: COMPARISON AGAINST THE BENCHMARK STUDY [7] R. Mustafa A. Mohammad, M. Aljabri, M. Aboulnour, S. Mirza, and A.
Alshobaiki, “Classifying the Mortality of People with Underlying Health
Study Model Dataset Accuracy F1-score Conditions Affected by COVID-19 Using Machine Learning
Techniques,” Appl. Comput. Intell. Soft Comput., vol. 2022, 2022, doi:
WELFake
This study SVM 98% 98% 10.1155/2022/3783058.
dataset
WELFake [8] M. Aljabri et al., “Sentiment Analysis of Arabic Tweets Regarding
[15] SVM 96.73% 96.56% Distance Learning in Saudi Arabia during the COVID-19 Pandemic,”
dataset
Sensors 2021, Vol. 21, Page 5431, vol. 21, no. 16, p. 5431, Aug. 2021,
VI. CONCLUSION doi: 10.3390/S21165431.
[9] K. Stahl, “Fake news detection in social media.”
There is a demand for a system to detect fake news to protect [10] V. Pérez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea, “Automatic
society from scams and all that this type of news may lead to Detection of Fake News,” Aug. 2017.
from public disturbance. By collecting the needed data and using [11] “FakeNewsNet Dataset | Papers With Code.” .
ML algorithms, models can be built to solve these problems and [12] A. Jain, A. Shakya, H. Khatter, and A. K. Gupta, “A smart System for
identify anomalies. Fake News Detection Using Machine Learning,” IEEE Int. Conf. Issues
Challenges Intell. Comput. Tech. ICICT 2019, Sep. 2019, doi:
Still, this issue has not been completely resolved. Even 10.1109/ICICT46931.2019.8977659.
though there have been several attempts to address the issue, [13] C. K. Hiramath and G. C. Deshpande, “Fake News Detection Using Deep
existing techniques need modification and improvements. In this Learning Techniques,” 1st IEEE Int. Conf. Adv. Inf. Technol. ICAIT
work, we sought to address this issue by developing ML models 2019 - Proc., pp. 411–415, Jul. 2019, doi:
that can detect different types of fake news. We applied three 10.1109/ICAIT47043.2019.8987258.
different features extraction techniques to improve our models' [14] M. Aldwairi and A. Alwahedi, “Detecting Fake News in Social Media
performance and managed to develop an SVM model that Networks,” Procedia Comput. Sci., vol. 141, pp. 215–222, Jan. 2018, doi:
10.1016/J.PROCS.2018.10.171.
achieved an accuracy, recall, precision, and F1-score of 98%.
[15] P. K. Verma, P. Agrawal, and R. Prodan, “WELFake dataset for fake news
For future work, we aim to implement deep learning detection in text data,” Feb. 2021, doi: 10.5281/ZENODO.4561253.
techniques to handle fake news. In addition, we would attempt [16] “Text Preprocessing NLP | Text Preprocessing in NLP with Python
to generate new datasets related to fake news with new features codes.” https://www.analyticsvidhya.com/blog/2021/06/text-
preprocessing-in-nlp-with-python-codes/ (accessed Oct. 27, 2022).
and study the effect of the different features on the model’s
[17] “Vectorization Techniques in NLP [Guide] - neptune.ai.”
detection's performance. https://neptune.ai/blog/vectorization-techniques-in-nlp-guide (accessed
Oct. 27, 2022).
ACKNOWLEDGMENT
[18] T. Evgeniou and M. Pontil, “Support vector machines: Theory and
applications,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes
We would like to thank SAUDI ARAMCO Cybersecurity Chair Artif. Intell. Lect. Notes Bioinformatics), vol. 2049 LNAI, pp. 249–257,
at Imam Abdulrahman bin Faisal University for funding this 2001, doi: 10.1007/3-540-44673-7_12.
project. [19] L. Breiman, “Random Forests,” Mach. Learn. 2001 451, vol. 45, no. 1,
pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324.
REFERENCES [20] “Evaluation Metrics Definition | DeepAI.” https://deepai.org/machine-
[1] Á. Figueira and L. Oliveira, “The current state of fake news: Challenges learning-glossary-and-terms/evaluation-metrics (accessed Sep. 13, 2022).
and opportunities,” in Procedia Computer Science, 2017, vol. 121, doi: [21] “GridSearchCV for Beginners. It is somewhat common knowledge in
10.1016/j.procs.2017.11.106. the… | by Scott Okamura | Towards Data Science.”
[2] M. Aljabri and S. Mirza, “Phishing Attacks Detection using Machine https://towardsdatascience.com/gridsearchcv-for-beginners-
Learning and Deep Learning Models,” Proc. - 2022 7th Int. Conf. Data db48a90114ee (accessed Feb. 22, 2022).
477
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 26,2023 at 17:27:45 UTC from IEEE Xplore. Restrictions apply.