In an era of 280-character Twitter posts from hundreds of news outlets to 15-second TikTok videos from celebrities, we are constantly presented with a stream of articles, content, and opinions on all topics. Many times we are simultaneously presented with contradicting viewpoints, with no clarifications regarding which methods were used to derive the statistics. This not only leads to an information overload but also prevents the audience from truly learning about the issue. Moreover, the “misinformation effect” is apparent during these times of widespread data and information being constantly released without an adequate filtering system. This effect occurs when an individual recalls memories or events inaccurately as a result of information overheard after the event occurs. This is something we see all over the news as “fake” news. In a recent study using the spread of rumors on Twitter between 2016 and 2017, falsehoods were 70 percent more likely to be retweeted than the truth, even when controlling for the account age, activity level, and number of followers. As data and quality experts, we must lead our community in differentiating between fact and opinion-based news content to prevent the spread of misinformation.
Although there are numerous combinations of analytical techniques that can be used to clean and extract data, certain methods help minimize bias and can lead to an objective presentation of data. Using data mining and Lean Six Sigma (LSS) techniques allowed our team to filter out opinion-based articles from our news stream. Concepts such as classification, clustering, regression, decision trees, and value stream maps help build an environment where individuals can interact with data in a structured manner and develop their own conclusions.
To show you how to use these techniques and strategies work, we developed a data mining tool – Justification and Analysis of Real vs. Incorrect Sources (JARVIS). (Not to be confused with Tony Stark's J.A.R.V.I.S. from the Marvel Universe.) The following content highlights how each technique was used within the Define, Measure, Analyze, Improve, and Control (DMAIC) framework to provide users with a definitive conclusion on whether an entered news article is based on facts or opinion.
Media consumption has increased over the past ten (10) years as digital news has become more accessible and instantaneous. As a result, spreading of fake news has become a greater issue as fake news is more likely to be spread than real news. We wanted to determine what makes an article true or fake to try and prevent the spread of misinformation. Our goal was to develop a localized toolset that performs data mining to predict whether an article is based on facts or fiction/opinion with over 95% accuracy.
As seen when you use the tool, JARVIS takes a link for an article and was able to virtually visit that website, scan in all the text from the article to decide whether it was real or fake. In order to train JARVIS on this capability, we performed a random sampling of nearly 45,000 articles from around the world, which include several topics to make sure it wasn’t biased towards certain type of content. The data itself contained approximately equal amounts of true and false articles, respectively. Outside of the true/false label, the initial data also contained the article headline, body text, topic, and date posted. JARVIS was built in Python and the model, itself, prioritized accuracy given the sample data was balanced approximately 50/50 between true and false articles. We limited the model to only evaluate three (3) factors in its assessment to reduce bias and prevent any other outside information from making the decision:
Words within the headline (Feature/Input)
Words in the body text (Feature/Input)
Whether the article is true or false (Label/Output)
Through considering these factors, the end goal was to develop a model that had sufficiently learned the relationship between the features and labels above such that it could take in a new article headline and text before being able to accurately identify its objectivity.
After cleaning the data, the next step was to analyze the sampled data. We applied several data mining techniques to filter through and clean data:
Special characters are characters that do not add value to the text, such as commas, double quotes, single quotes. These characters were removed in the data cleaning process.
Stopwords are words that do add value to text, such as “the”, “an”, and “as”. These words were removed in the data cleaning process.
Lemmatization works to group inflected forms of words together as a single item. For instance, walking and walked would be categorized as walk. These words were lemmatized, or transformed, in the data cleaning process.
Bag of words defines a dictionary of words, and counts the number of times each of those words appear within the article. We used our initial dataset to learn what those bag of words are, then applied that “bag” to the data to transform it into a format that that can be used with a machine learning (ML) model.
These techniques transformed the data from unstructured text to numerical representations, allowing for further analysis and predictive modeling.
Once the information was transformed into ML, our team could start performing data classification to eventually build the predictive capability. We stacked five (5) different base data mining classification models through a final layer meta model to determine whether or not the article in question was real/objective or fake/bias. The base models included ML techniques such as Logistic Regression, Random Forest, and Naïve Bayes variants while the final layer model featured an L2 regularized Logistic Regression model. Each of these is heavily rooted in statistical theory and is similar to LSS evaluation techniques. For example, the Random Decision Forest classification model is comparable to Decision Trees performed in root cause analysis (RCA). Random Forest evaluates which words or groupings of words leads to a true/ false result by constructing a multitude of decision trees – hence the name forest! See below for a brief summarization of these model’s assumptions and how they learn:
Logistic Regression (linear assumption): Given the bag of words data transformation, logistic regression will attempt to assign a positive/negative weight to each word which translates to how much an instance of that word increases/decreases the likelihood of fake/biased news. Under the hood, this algorithm attempts to learn a hyperplane that can separate the classes in feature space by minimizing the negative log loss function.
Random Forest (linear/nonlinear assumption): Although harder to statistically interpret, through the bag of words data transformation Random Forests attempt to identify combinations of word usage choices that lead to real/objective news vs. fake/biased news. This technique is an ensembled extension of Decision Trees. Each tree’s data is based off a bootstrap sample (sampling with replacement) of size n from the original data set. The tree is constructed by recursive binary splitting, which partitions feature space by aiming to minimize Gini Impurity at each partition. A forest of trees rather than a single tree aims to minimize the risk of overfitting, ultimately allowing the model to generalize well to new data.
Naïve Bayes (linear/nonlinear assumption): Given the bag of words data transformation, Naïve Bayes attempts to learn “given the news article is false, what is the likelihood a specific word occurs X number of times”. This algorithm attempts to learn the probability of a label given the data by assuming the data is conditionally independent on the label. This is a naïve assumption, hence the name Naïve Bayes. The difference among Naïve Bayes variants (e.g., Gaussian, Multinomial, Complement, Categorical) is that they assume different conditional distributions of the data.
By using a stacked ML framework, this allows us to utilize a methodology that does not rely on a single assumption of the shape of the data. For example, utilizing a single Logistic Regression model would signify the assumption that the data is linearly separable. Typically, this is not the case. JARVIS’s base modeling layer features models that all make various assumptions about the shape of the data, both linear and nonlinear. JARVIS’s final Logistic Regression layer allows it to learn the best way to use the previous models together to make the most accurate predictions. In other words, by learning weights for each of the base models, JARVIS’s final layer learns how to best balance both linear and non-linear assumptions from those models to achieve the highest possible accuracy.
In terms of controlling the improvements that we identified as a result of the data mining efforts, we will need to continue to perform text analysis and ensure regular feedback loops within the system to provide insights on errors and challenges and continue to gather data which may improve JARVIS’ performance and accuracy over time. Because news is so broad, for JARVIS to be effective in a production setting its training data would need to be an accurate and holistic representation of news in order. This would ensure that JARVIS has the capability to generalize well to new articles, so the challenge here becomes developing a training set that is representative of the larger market.
JARVIS ultimately provides a quick turn evaluation of an article’s objectivity before reading it. Essentially, if this application were engrained within a news distributors platform to allow every article to be assigned an objectivity score, readers could use the score as a metric for prioritizing articles to read or scroll past through. It is our responsibility to ensure proper data mining and analytical methodologies are applied to data presented through information platforms so that members of our community can comprehend potentially biased data in an unbiased manner. We hope this article has empowered you with the tools, techniques, and mindset to filter through real vs fake data.
Stay tuned to Boulevard Insights, JARVIS will be made available for public use as an open source tool and interactive demo on this website.
Amit Gattadahalli is an Associate at Boulevard with a focus in data science. He recently graduated from the University of Maryland College Park with a B.S. in Mathematics and Statistics. Since 2018, his work within the consulting industry has largely concentrated on machine learning, custom data science-centric algorithm development, data visualization, and general software development.
Shrey Tarpara is an Associate at Boulevard with a background in data analytics and change management. As a Lean Six Sigma Black Belt, he helps organizations make sense of their data and implement strategic initiatives to improve operations. He holds a B.S. in Economics from the University of Maryland and holds the patent for an illuminating wire designed to improve hardware replacement and troubleshooting processes within healthcare and commercial settings.
Kashni Sharma is a Senior Associate at Boulevard with a focus in project management, process improvement, and risk management. She graduated from the University of Maryland with a bachelor’s degree in economics and a master’s in applied economics.