Fake News Classification with LSTM or Logistic Regression
Fake news has become a focal point of discussion in the media over the past several years. Therefore, there is a need for an automated way to classify fake and real news accurately.
This post attempts to eliminate the spread of rumors and fake news and helps people to identify the news as trustworthy or not by automatically classifying the news, using DataSpell, — the latest Data Science IDE from JetBrains.
The Data
The dataset consists of 44,919 news articles, almost equally distributed to the true and fake categories.
The true articles were collected from the Reuters website, and the fake ones from various sources flagged as fake sources by Wikipedia and from Politifact. The datasets comprises the full body of each article, the title, date and topic. As a courtesy of the University of Victoria ISOT Research Lab, the data sets can be downloaded from here. We will compare the state-of-the-art approach such as long short-term memories (LSTMs) vs. Logistic Regression.
DataSpell
There are a number of ways to download DataSpell, if you plan to use DataSpell only not any other JetBrains’ tools in its toolbox, the simplest way is to download it from it’s website directly.
After launching DataSpell, you will see a screen like this. If this is the first time you use DataSpell, you environment should be much more cleaner than mine. I have created a project called dsProject_LogReg_DL, under the project, I created a data
folder, loaded Fake.csv
and True.csv
. I also created two Jupyter notebooks in the same directory FakeNew_DL.ipynb
and FakeNew_LogReg.ipynb
like so, by right clicking:
As a data scientist, I am a heavy user of Jupyter notebook. DataSpell provides an interface similar to JupyterLab, plus several more new functionalities such as interactivity with data frame, smart coding assistance and so on. You just need to try and test them out.
Deep Learning
This section presents an overview of preprocessing techniques, and description of the deep learning model used for classification.
To make meaningful comparison between Deep Learning model with Logistic Regression model, we add start time at the beginning of the notebook and end time at the end of the notebook, then calculated the time spent running the entire code.
Pre-processing
The following steps demonstrated part of the data pre-processing process.
- We load
Fake.csv
andTrue.csv
. - Remove the useless columns, we only need
title
&text.
- Label fake news as 0, and real news as 1.
- Concatenate two data frames into one.
You may have noticed that you can view the entire data frame by clicking “Open in new tab”, and using the scroll bar when the data frame is large.
In the following steps, we
- Combine
title
&text
into one column. - Standard text cleaning process such as lower case, remove extra spaces and url links.
- The way we split training and testing data must be the same for deep learning model and Logistic Regression.
- We put the parameters at the top like this to make it easier to change and edit.
Tokenization
- Tokenizer does all the heavy lifting for us. In our articles (aka title + text) that it was tokenizing, it will take 10,000 most common words.
oov_tok
is to put a special value in where an unseen word is encountered. This means I want “OOV” in bracket to be used to for words that are not in the word index.fit_on_text
will go through all the text and create dictionary. - After tokenization, the next step is to turn those tokens into lists of sequence.
- When we train neural networks for NLP, we need sequences to be in the same size, that’s why we use padding. Our
max_length
is 256, so we usepad_sequence
to make all of our articles (aka title + text) the same length which is 256. - In addition, there are padding type and truncating type, we set both of them “post”., meaning for example, if one article at 200 in length, we padded to 256, and we padded at the end, add 56 zeros.
Building the Model
Now we can implement LSTM. Here is my code that I build a tf.keras.Sequential
model and start with an embedding layer.
- An embedding layer stores one vector per word. When called, it converts the sequences of word indices into sequences of vectors. After training, words with similar meanings often have the similar vectors.
- Next is how to implement LSTM in code. The Bidirectional wrapper is used with a LSTM layer, this propagates the input forwards and backwards through the LSTM layer and then concatenates the outputs. This helps LSTM to learn long term dependencies. We then fit it to a dense neural network to do classification.
- In our model summary, we have our embeddings, our Bidirectional contains LSTM, followed by two dense layers. The output from Bidirectional is 128, because it doubled what we put in LSTM. I also stacked LSTM layer to improve the results.
- We are using early stop, which stops when the validation loss no longer improves.
- Visualize training over time and the results were good.
Logistic Regression
This time, we are going to create a simple logistic regression model to classify news to either real or fake, using the same data sets, same methods of text cleaning and the same way of train_test_split
.
The process is very simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.
Pre-processing
- In the following pre-processing, we strip off any html tags, punctuation, and make them lower case.
- The following code combines tokenization and stemming techniques together, and then apply the techniques on “title_text”.
TF-IDF
Here we transform “title_text” feature into TF-IDF vectors.
- Instead of tuning C parameter manually, we can use an estimator which is
LogisticRegressionCV
. - We specify the number of cross validation folds
cv=5
to tune this hyperparameter. - The measurement of the model is the
accuracy
of the classification. - By setting
n_jobs=-1
, we dedicate all the CPU cores to solve the problem. - We maximize the number of iterations of the optimization algorithm.
- Evaluate the performance.
Summary
We found that the deep learning model and Logistic Regression produced the similar results. Only the training time for Logistic Regression is half of the training time for deep learning.
The obvious difference, from the above process, is that the Deep Neural Network is estimating many more parameters and even more permutations of parameters than the logistic regression. Basically, we can think of logistic regression as a one layer neural network.
To sum it up, I would recommend to solve a classification problem with a simple model first (e.g., logistic regression). In our example, this would’ve already solved our problem sufficiently well. In other cases, when we are not satisfied the simple models’ performance and we have sufficient training data, we’d try to train a deep neural network, which has the advantage to learn more complex, non-linear functions.
The project repository can be found on Github. Have a great week!