Classifiers for Fake News Detection

19 min readMay 13, 2021

Team Members: Siddhartha Shetkar, Albert Wu, Natasha Long, Nic De Valle

Before starting, if you would like to view our demo to interact in real time with our models, please refer to our Github repo linked at the bottom of this article and follow the README to setup the web application.

Abstract

Our project uses popular fake news datasets, ISOT Fake News Dataset from University of Victoria and LIAR dataset from William Yang, to train our machine learning models and classify fake news. The models we used were a Multinomial Naive Bayes classifier, BERT classifier, Catboost gradient boosting classifier, and a combination of Stanford GloVe and LSTM. We began by preprocessing our datasets using popular feature engineering techniques. We conducted sentiment analysis and generated features for our respective models. We also removed punctuation and stopwords from our data. To run our models on categorical data, we used TFIDF vectorization to transform our data into numbered features. We used Stanford GloVe and model ensembling on Naive Bayes and Catboost respectively to further improve our models. We then evaluated the performance of our models and tuned our best models for an optimal fake news detector.

We ran a Naive Bayes Model on the ISOT dataset and quickly determined numerous glaring issues with the dataset. The biggest issue being there was extreme data leakage in the ISOT data. For this reason, the rest of the report strictly focuses on the LIAR dataset. Our Catboost models proved to have the best results in classification accuracies followed by BERT, Stanford GloVe + LSTM, and Naive Bayes respectively. We also attempted stacking and ensembling models together. This project outlines our process and our specific results regarding each separate model. We believe the current state of fake news detection can be further explored and improved to create interesting insights and hopefully, a reliable detector.

Introduction & Background

The rise of social media has brought on a flurry of changes to our everyday lives. Along with the recent presidency, we have seen the impact that social media, or rather, all media has on public opinion and discourse. While information is being proliferated at a faster rate than ever before, our problem today focuses on the bad side of this information — information that has no verifiable facts or sources. In other words, fake news.

Before we get started, we just wanted to make a disclaimer that we are not making a political statement by any means, if there’s anything to take away from this, it’s that fake news is bad and it’s present on both sides of the political spectrum.

So firstly we wanted to talk about the implications of fake news on society today. It’s a hot topic in the Supreme Court, the national security agency, the business sector, and many more areas of society because the debate centers around the balance between free speech and censorship in the US’ democratic government. The main implications of fake news in the US are hyperpolarization between Democrats and Republicans as well as a generally misinformed voter population. The pie chart below depicts the news of biased outlets on both the left and the right. Buzzfeed investigated news on Facebook’s three biggest political groups on the far left and the far right and fact-checked them. On the far left, Buzzfeed found 20% of the news to be fake or include false information, and on the far right, the presence of fake news was almost double.

Data Collection/Description

When we first started looking for collections of real and fake news articles as our dataset, a common one we found was a Fake News Detection Dataset made by the ISOT lab of the University of Victoria (commonly known as the ISOT dataset). This was used in many published articles and Kaggle competitions. The ISOT dataset had a total of 40,000 data points split across two files (True.csv and False.csv). The 4 features of the dataset were:

Title — title of the article
Text — entire contents of the article
Subject — subject of the article
Date — date the article was published

However, we quickly found that the ISOT dataset had fundamental issues mainly in the form of massive data leakage. Some examples of this are

Reuters: The word ‘Reuters’ appears almost exclusively in Real news articles. This makes that one word an almost surefire predictor. We later found that all the Real news articles were taken from Reuters meaning that any defining characteristics of Reuters articles, as opposed to other news outlets, could be misinterpreted by our model as being associated with being real news.
Subject columns: The Real news data points had subjects of ‘politics news’ or ‘world news’ while all the Fake news data points had subjects of ‘politics’, ‘left news’, ‘government news’, ‘US news’, ‘Middle east’. Since there is no intersection between these sets, this makes the subject column a complete giveaway for the label which would not be useful in any realistic setting.
Date: The articles were not sampled with the Date in mind making some unreasonable data leakage very apparent. For example, there were no Real articles before the year 2016 which made that feature a dead giveaway for the label sometimes.

Since there were such pervasive issues with this dataset, we had to find another dataset. After searching through many other collections, we eventually settled on the Liar dataset which was created by William Yang from the University of Santa Barbara. This dataset was generated by using Politifact.com, a reliable source that has multiple editors that fact-check statements and verify that proper context is provided. In particular, this dataset focuses on statements by US government officials. Some other notable benefits of this dataset are that a finer scale of labels is used to account for context (true, mostly-true, half-true, barely-true, false, pants-fire) and that only small statements are evaluated as to not paint an entire article as being just true or just false. We provide further elaboration on the given features below:

ID — ID the of article given by Politifact
Label — What the Politifact team of checkers give as a final rating to the statement
Statement — the statement of text being evaluated
Subjects — list of subjects the statement might fall into
Speaker — which US government official made the statement
Speaker job title — official job title of the speaker
State — which state the statement was made in
Party Affiliation — which party the speaker is affiliated with
Barely True — the amount previous statements made by the speaker that were labeled as ‘barely true’
False — the amount previous statements made by the speaker that were labeled as false
Half true — the amount previous statements made by the speaker that were labeled as half true’
Mostly true — the amount previous statements made by the speaker that were labeled as mostly true’
Pants on fire — the amount previous statements made by the speaker that were labeled as ‘pants on fire’
Context — a short sentence (3–7 words) describing the context of the statement

First, we visualized the spread of our labels to check for data balance.

Histogram of labels on the entire LIAR dataset

We see that in general most of the label counts are similar aside from pants-on-fire which we expected to be lower because the criterion for pants-on-fire statements requires the statement to be outrageously false which we expect to be lower on average compared to other labels.

Next, we visualized the different types of statements using word clouds.

Finally, we explored the different spreads of the label across the features of the dataset.

This graphic represents a few of the most frequent speakers in the LIAR dataset and not only the respective amount of times each speaker has made a statement in a certain category, but also the relative proportion of which categories that speakers statements belong in. The LIAR dataset takes many statements from known US representatives and speakers thus the high frequency in statements from people such as Barack Obama and Donald Trump.

It is important to observe that most speakers are primarily grouped into the three labels: republican, democrat, or no party. It is also worth noting that with parties such as the tea party which many will categorize as republican, a larger dataset will potentially benefit from the grouping of such parties represented in the data. This will lead to fewer outliers from the main categories.

Data Pre-processing & Exploration

We tried many forms of feature selection/engineering but found that most methods did not yield much performance improvement.

In terms of feature selection, the first thing we did was drop the ID column as this was just an artifact of how the data was collected and did not provide any useful information about the article itself. Next, we plotted a histogram showing the amount of NaN values in each feature because we need to handle them before running any models. We found that the only features that had a severe lack of data were the speaker job title, the state info, and the context features. As shown by the graphic below,

both speaker job title and state info had around a quarter of their columns being filled with NaN values. There was not a clear way to mitigate this issue without going through thousands of data points manually, thus we tried dropping the features and adding another column that represented whether the feature had a NaN value or not.

We found that dropping the features yielded better results which we think is a combination of the features not being that important and too many lost data points. The feature importance plot below shows how much Catboost valued speaker job title and state info.

Catboost Feature Importances (original features)

For the context feature, we decided on filling NaN values with empty strings so our model wouldn’t try to extrapolate any incorrect information from those data points as we treated this feature as a text feature and not a categorical feature. For all other features that contained NaN values we decided to simply drop those data points since overall, it was less than 10 data points dropped. Overall, dropping speaker job title and state info yielded around a 2% increase from 44% to 46% on testing accuracy.

Before continuing to our data preprocessing and feature engineering attempts, we would like to discuss a decision we had to make regarding whether we wanted to continue this problem as a multiclass classification problem or a binary classification problem. Considering that the original labels were made by Politifact editors cross-referencing against other articles not shown in the dataset, we felt that the labels given might be too specific for a model to learn without that information. We were able to achieve a test accuracy of 46% but we had a lot of difficulties pushing past this barrier. Thus, we decided to lower the difficulty of the problem to binary classification to account for the fact that the original labels were determined using much more information than that that was given to us in our dataset. We created our mapping as follows:

true → true
mostly-true → true
half-true → false
barely-true → false
false → false
pants on fire → false

The reason we chose these mappings is that all the labels at least as strict as half-true were described by Politifact as having some type of important context being left out. We decided that if the statement was not being made in a completely clear context that it should not be considered true. This did affect class balance making false labels more prevalent but it was not enough for us to consider the dataset imbalanced. Lastly, we also combined the features of previous speaker counts for barely true, false, half true, mostly true, and pants on fire statements according to the above mapping to make just two features for previous speaker counts of true and false statements. Moving forwards, we only worked with binary classification.

Next, we explored data preprocessing techniques. One thing we should note before continuing is that Catboost natively supports text features. The way it preprocesses them is by first tokenizing the text, then creating a dictionary to map the text to a number, and finally converting all the text using this dictionary. Catboost supports a few options for this conversion, but we found the most success with the default option which uses a combination of Bag of Words and Multinomial Bayes to estimate numerical features from the text features. The main preprocessing techniques we considered were changing all text to be lower case, punctuation elimination, stop word elimination, and word stemming. The idea behind these techniques was to simplify the sentences so the model won’t get caught up in minor intricacies of the text and standardize the conversion to numerical features. For example, we don’t want the word ‘Political’ to be interpreted differently from the word ‘political’. Similarly, punctuation or stop words (like ‘the’ or ‘a’) do not give any additional meaning to the sentence. Finally, stemming removes suffixes from the word to reduce it to its root (like ‘waited’ → ‘wait’). However, after performing these preprocessing techniques, we still did not notice any noticeable performance improvements.

Lastly, we explored various feature engineering techniques. The first option we attempted was using sentiment analysis with NLTK. We added positive, negative, and compound scores as features to see if the sentiment of the text would be helpful in its validity. We thought this might be useful because we expected more expressive/polarized rhetoric to be used in statements that were not sufficiently backed up with data and thus might be considered false. However, we found that these features ended up not being important as shown by the graphic below.

Catboost Feature Importances (Sentiment Analysis)

Next, we attempted using term frequency-inverse document frequency (TFIDF) to add additional features to evaluate how relevant a word is in a statement relative to all the statements. This was done in addition to and without the default Catboost text preprocessing. However, in both cases, we found that these additional features did not significantly affect performance.

Learning/Modeling

Catboost:

Catboost is one of the many popular gradient boosting libraries. Gradient boosting is a technique that makes a model by ensembling many weak models, typically decision trees.

The main reasons we decided to start out with Catboost are:

Gradient boosting libraries are very popular for Kaggle which gives us many resources on how to use it and indicates it is a versatile library
Our dataset has many categorical features which Catboost specializes/optimizes for
Has extensive built-in support for text features that applies to our statement and context features

Multinomial Naive Bayes:

Multinomial Naive Bayes models are generally used with numerical features which we obtain using TFIDF on text features and one-hot encoding on categorical features. The Naive Bayes Model itself is a very popular model for NLP problems. Multinomial Naive Bayes calculates likelihood for each class based on a frequency table and then uses probabilistic rules to classify the data points. The Bayes Theorem is as outlined:

The Bayes theorem allows us to use past information to create future decisions. The probability of an event A occurring given an event B has occurred is the equation outlined above. This fundamental theorem creates interesting and useful results when applied to a large dataset for classification.

BERT:

BERT or Bidirectional Encoder Representations from Transformers is a technique developed by Google for NLP tasks. The transformer is an architecture that relies on a self-attention mechanism to find contextual relations between words and their position in the text. The main key development with BERT is that it reads the entire sequence of text at once to learn context from all of the surrounding text. This is as opposed to previous representations that only focused on processing text from left to right or right to left.

Since the text in the statements has been shown previously to be the most important feature for our Catboost models, we thought that trying to focus on the text might give us more insights. BERT being one of the state of the art language models made it a prime candidate for further analysis on the text.

Google, the creator of BERT, also provides pre-trained models which we then fine-tuned on our dataset for predictions.

Stanford Glove + LSTM:

Stanford Glove is an unsupervised learning algorithm and obtains vector representations for words that it is trained on. Each word vector creates a correlation ranking for every other word. Useful in creating word vector space for a model to use. We believe Stanford Glove would provide useful features for our text classification as it is a popular word embedding technique.

We trained our Stanford GloVe model on the 6B 50-dimensional vector data and we obtained word embeddings from this process. Word embeddings essentially create representations of words with similar meanings and form them into a vector. For example:

In the above example, we determine the 5 most correlated words to the word, “king”. We index 1 through 6 because the 0th correlated word will always be the word itself. This provides us with useful information to train our model on.

The LSTM allows us to model our vectorized words into a sequence of integers into another vector. The sequences are then truncated and padded accordingly. The LSTM and binary cross-entropy loss function which we added are all from the Keras library. Next, we trained our models with the LIAR dataset.

We add to our layers using the Stanford Glove embedding and the LSTM transformer. We then set the activation to ‘relu’ and the loss function to ‘binary_crossentropy’ which specializes in binary classifications.

LSTM after Stanford GloVe Preprocessing

Stacked Catboost:

Stacking is a technique that involves using the predictions made by ‘base’ models as features to ‘meta’ models which then train on those predictions and the features to try to produce a better model. Here, we attempted to use BERT as our base model since it performed slightly worse than our current best-performing Catboost model. However, we found that the stacked model did not perform any better than the original Catboost model.

Weighted Ensembled Catboost:

We tried to ensemble models by taking the weighted sum of their predicted probabilities for each class and then taking the class with the higher probability as our prediction. In our case, we took two Catboost models: one trained just on the BERT predictions as features and our current best performing Catboost model on all the original features from the dataset. Since the Catboost model with just BERT prediction performed worse, we decided to weigh it lower than the current best-performing Catboost model. In the end, this became our best-performing model when we used a weight of 0.12 on the Catboost with just BERT predictions and a weight of 1 with the current best-performing Catboost model.

Results

Below, we show the accuracy and confusion matrix for the best model in each of the categories of models shown below.

Multiclass Catboost model (46.7% test accuracy):

Multiclass Catboost Confusion Matrix on test dataset

There were some issues with overfitting in the Multiclass Catboost model. The training accuracy was much higher at 65.9%. However, since we decided to only focus on binary classification, we did not focus too much on particularly mitigating this result. However, we note that Catboost does also support built-in parameters to prevent overfitting.

Binary Catboost model (76.6% test accuracy):

Binary Catboost Confusion Matrix on test dataset

We did not see any issues with overfitting or underfitting as the accuracy is relatively high and the training accuracy is similar at 82.7%.

Multinomial Naive Bayes (66.5% test accuracy):

*Multinomial* Naive Bayes Confusion Matrix on test dataset

We did not see any issues with overfitting or underfitting as the accuracy is relatively high and the training accuracy is similar at 70.6%.

BERT (65.6% test accuracy):

During training, we did see some issues with overfitting as shown by the graphic below.

Loss chart while fine tuning pre-trained BERT model

However, we set the pre-trained BERT model trainer object to load the best model determined at these checkpoint results while training as shown in the gist below.

Thus, the final model obtained did not suffer much from overfitting.

Stanford GloVe + LSTM (64.3% test accuracy):

Stanford GloVe + LSTM Confusion Matrix on test dataset

We did not see any issues with overfitting or underfitting as the accuracy is relatively high and the training accuracy is similar at 66.8%.

Stacked Catboost (74.6% test accuracy):

*Stacked* Catboost Confusion Matrix on test dataset

There were some issues with overfitting in the stacked Catboost models. The training accuracy was much higher at 85.8%. Looking forwards, we think Catboost’s built-in parameters for preventing overfitting might help mitigate this issue.

Weighted Ensembled Catboost (77.4% test accuracy):

*Weighted Ensembled* Catboost Confusion Matrix on test dataset

There were some issues with overfitting in the weighted ensembled Catboost models. The training accuracy was much higher at 85.6%. Looking forwards, we think Catboost’s built-in parameters for preventing overfitting might help mitigate this issue.

Looking at the results we see that in general all of our models had issues with false negatives. The models that focus on text (BERT & Stanford GloVe + LSTM) seemed to additionally have an issue of just predicting false too often. Interestingly, ensembling the Catboost models based on the BERT predictions and the original features seemed to be able to counteract these effects and yielded our highest test accuracy at 77.4%.

Some of the key findings we found were:

NLTK/NLP applications on text did not help (stopwords, stemming, sentiment analysis) improve performance
Lack of text given might be holding back BERT/Stanford Glove models from really extracting enough information and might be what is hurting their performance
The statement feature is the most important feature (as expected) but beyond that, the past true/false statements of a speaker are also really important features
Job title/state info didn’t seem very important which could be because there were a lot of NaNs but for things like President, the job title alone doesn’t seem to matter that much since the people who hold that position can vary drastically
Catboost has really good default handling for categorical and text features (could not beat it on our own with techniques like TFIDF)
The difference in binary vs multiclass classification is huge which makes sense since other extraneous pieces were used to create the multiclass version of the problem

Conclusion

Overall, we attempted a variety of models that tried to tackle the problem of fake news classification from different angles. We tried focusing on the text and the categorical features using a variety of models such as Multinomial Naive Bayes, Catboost, BERT, and a combination of Stanford GloVe with an LSTM. Afterward, we attempted to combined our models using techniques such as stacking and weighted ensembling. This eventually led us to our best model with was a weighted ensemble of two Catboost models: one trained only on pre-trained BERT model predictions and one trained on the original features. This model achieved a test accuracy of 77.4% which we consider to be pretty good due to the difficult nature of the problem at hand.

Though our model has not been trained on an enormous amount of information, we believe it could be potentially useful as a tool for the average citizen to quickly get an insight on whether or not they should be wary about a statement being made. Considering how fast fake news travel, an easy-to-use tool that provides immediate feedback could help lower the effectiveness of fake news. This would also be useful for companies that need mitigation strategies for spreading fake news such as social media outlets. They could use the model to produce quick heuristics and hopefully filter out statements that are clearly fake/false. Lastly, our model could also be used as a sanity check for US government officials to check if their statement’s rhetoric might make it more prone to being viewed as false or not. Our model does not check against a database of facts to verify the truth, but the patterns it picks up in the false/true statements could be useful in identifying rhetoric that a government official might want to avoid so their statement can be more easily discerned as the truth.

Some important lessons we learned while working on this project were:

Data leakage is a huge concern. You cannot just trust the source (even if it came from reputable sources like a university!). Always be wary of your results if they seem too good to be true.
Sometimes it is good to rely on premade libraries for preprocessing. We attempted many types of text preprocessing techniques but none surpassed what Catboost was able to do out of the box.
NLP techniques are not always guaranteed to help in tasks that involve text. We tried a variety of preprocessing techniques from NLTK but none of them helped our model performance.

Moving forward, some approaches we would like to consider are:

Experiment with some of Catboost’s built-in overfitting detection parameters to help with overfitting in the stacked/ensembled Catboost models.
Web scraping/using the Politifact API to access the full content of the statements instead of just short snippets. We believe that the lack of text content might have been what was holding back some of our more sophisticated models (pre-trained BERT & Stanford GloVe + LSTM) from being more useful.
Politifact also releases articles in which their editors explained how they evaluated each of the statements. These articles are almost always the first link when searching the statement so it is reasonable to be able to web scrape all of these articles and use them as additional context for our models. This also might make it viable to shift back to multiclass classification since we have the context the editors used to make the finer-grained labels.
Experiment with recurrent neural networks (RNNs) to see if we could generate fake/real news as an exploratory tool.

References

Github

https://github.com/sshetkar3858/EE461P_Final_Project_Fake_News_Classifier

Classifiers for Fake News Detection

Written by Siddhartha S Shetkar