Utilizing Natural Language Processing Methods With Supervised Learning on Reddit Data
One of the largest schools of interest in the vast world of data science is machine learning. Machine learning is a field of study focusing on having a computer make predictions as accurately as possible, from data. Today, there are two main types of machine learning used: supervised and unsupervised learning. In supervised learning, a programmer will have access to data that depicts an outcome from a certain pattern of features. The “outcome data” can then be used as a prediction measure for modeling a machine learning process, through training. In supervised learning, classification scoring metrics such as accuracy and precision are unlocked for a data scientist to estimate how well their model performs where as in regression learning, scoring metrics such as mean absolute error and mean squared error are often utilized to complete the same objective of measuring the effectiveness of a machine learning algorithm. Many times a programmer may not have access to “outcome data” or prediction data and will have to resort to unsupervised learning. In unsupervised learning, different approaches are taken to help a data scientist determine useful information from data. Often, unsupervised learning will encompass the analysis of the relationship between features. Methods such as “clustering” and “principal component analysis” are used to classify and determine significance of features for other models. Ultimately, both types of machine learning are a subset of AI development and have their roots embedded in computational statistics. Each type of machine learning is powerful and utilizes unique methodologies for uncovering pattern trends in data.
Today, we are going to focus primarily on supervised machine learning models and their use with natural language processing practices. While data science may intuitively seem like a field for only analyzing numerical data (which is correct in a literal sense), a big focus in industry is placed on interpreting human language data in a programmatic way. Businesses thrive from learning in what their users are saying about them (particularly in sentiment analysis), and machine learning is an excellent way to tackle the challenge of “how does one understand what people are saying from a large data set of human language?” Let’s explore that by going through a classification process involving Reddit data…
Consider This Problem Statement…
Reddit, the popular, “community run” social media website, has a dilemma in which two subreddit content are mixed together into one subreddit. You as a data scientist are tasked with creating a machine learning model to classify which content comes from which subreddit to help the Reddit developers and subreddit moderators clean up this blend of content.
This form of a problem statement can be interpretable for different situations which may occur in the world across various domains — imagine the case scenario where a hacker maliciously blends categorical content for pleasure or where a developer may want to create a classification application for users to navigate through. This problem statement can also be transferable for situations where sensitive and valuable information is accidentally merged with data that does not associate.
The key takeaway here is that this is a binary classification problem, where an outcome can be determined with a simple question of asking whether the data associates with one class or another. For this particular problem statement, we are going to analyze whether submission content belongs to the subreddit “r/aww” or “r/natureismetal.”
What are These Subreddits and Why is Researching Them Important for Our Problem?
Each subreddit has a vast difference in type of content it showcases. This is useful to understand when considering how well our model will perform. Similar content may be more difficult for a computer or a person to classify, where as polarity in content is more intuitive to identify. In “r/aww”, textual submission content is relevant to posts that nurture feelings of positive sentiment; content is not graphic nor intimidating to experience. For “r/natureismetal”, posts are more closely related towards negative sentiment, where NSFW (“not safe for work”) content is allowed; content is often more graphic and more intimidating to witness.
In any given data science problem, a good portion of one’s time should be spent on researching the involved subject. A lack of subject matter knowledge will more often yield a poor result and ultimately may spread misinformation to anyone who chooses to study from your work. Furthermore, research on a relevant topics may also help one discover more about the different and more efficient kinds of approaches one can make to solve a problem. In this case, it was discovered through the subreddits’ rules and guidelines that each subreddit expresses a different image for what it chooses to popularize — this affects how post titles are made. For instance, r/natureismetal requires that submission titles are descriptive while r/aww does not have such requirement. Furthermore, r/aww prohibits any “sad” content whereas r/natureismetal endorses the showcasing of graphic animal content or grand acts of nature. It was also noticed that each subreddit domain has a vast difference in size in userbase: r/aww showed about 24.4 million users while r/natureismetal showed about 1.4 million users.
A portion of our time was also spent on analyzing Reddit’s platform in general. It was discovered that subreddits undergo a judicial system of sorts when it comes to posting content. Each subreddit has moderators and auto-moderators (robotic moderators) which patrol the subreddit for content not appropriate for the community’s image. Many times, ill-suited posts are removed from a subreddit, but some posts manage to squeeze by the judicial system. The next filter is a userbase reaction where more popular submissions will often receive more upvotes, comments, and virtual awards to increase a post’s popularity — which often yields more upvotes, comment, and virtual awards as if it were a positive feedback loop. Such applauded submissions will serve as better choices for a user to discover what image a subreddit expresses.
All of the above aforementioned regarding the subreddits and Reddit as a platform must be taken into consideration when analyzing results. For now, we will continue with analyzing the data.
What Data are We Analyzing?
Luckily, we have access to 2,500 submission texts from each subreddit before their fictional merge, where each post in our dataframe has an associated label showcasing what subreddit the post is from. The data was scraped, prior to the merge, and may be enough for us to make a working model. With access to such data, we can very easily integrate a supervised machine learning model.
Let’s Explore What We are Working With…
While it may be difficult to initially notice, the histograms depicted above in both figures are showing right skewness when observing the comments across the posts from each subreddit. This means that much of the data we have here is unpopular, with most submissions having less than 100 comments. Under more careful observation, it was discovered that none of the submissions associated with r/aww received any virtual awards, and the highest upvote score with r/aww was only 7 upvotes — r/natureismetal had a post with a maximum upvote score of 40,830 upvotes, to compare. Such observations allow us to state that most of the data we found here is “new” and did not face as much scrutiny from the moderators or userbase. The possibility exists that our data is not appropriately labeled. This kind of observation must be noted when analyzing error in any supervised learning models.
Introducing the Count Vectorizer
A computer does not understand textual language data (at least not readily); it only understands numerical data — binary at its most fundamental level. This means that any language data observed must be converted into numerical data. This is achieved through one-hot encoding, by utilizing what is known as a Count Vectorizer from scikit-learn.
A Count Vectorizer converts our textual data to arrays expressing the state of a word element. For this case, we will consider the range of words analyzed per element in our vectorizer to be a single word (we are observing an n-gram range of (1,1)). For each instance a unique word is analyzed across a corpus of language data, our Count Vectorizer will add a “1” to its associative cell in an array. This “1” represents the Boolean state of the variable referencing its position to a subset of the corpus. For instance in the picture above, we notice from the left table there exists a comma separated text corpus of “Red, Red, Yellow, Green, Yellow.” The Count Vectorizer understands that only three unique words are expressed in this corpus: “Red”, ”Yellow”, and “Green”. On the right side, we see a table of these three words with arrays spanning the length of the corpus (NOTE: the right table is lacking a final row….this final row should read from left to right: [0, 1, 0]). Upon each respective instance of the Count Vectorizer analyzing a unique word in a corpus, the Boolean state returns true — all else will be false. These vectorizers are a very simple way to quickly discover trends within textual data. Let’s observe what can be understood from one-hot encoded text data through the below figure.
The above figure is a visual construction of the sum of Booleans from the Count Vectorizer sparse matrix, related to r/natureismetal data. This is a great way for a data scientist to visualize the word frequency of a certain word (or expression with larger n-gram ranges) across a text corpus. The most frequent word expressed in r/natureismetal is the word “the.” Intuitively, one should recognize that this information is meaningless to our study. We will not be able to accurately differentiate content between two subreddits by trying to highlight such a common word as an anomaly — almost every paragraph created in any language will make use of the definite article “the”. Words such as “the”, “and”, “in”, and many more are considered commonly considered to be stop words in vectorization.
Our next approach will be to instantiate and refit our data to a new Count Vectorizer that incorporates a corpus of common English stop words. This will filter out most of the meaningless noise observed in our data.
Figure 6 and Figure 7 both show the adjusted Count Vectorized datasets incorporating stop words from an English corpus package known as nltk (Natural Language Toolkit). Here we begin to uncover the significant difference in language across each subreddit. From r/aww, we see words such as “cat”, “dog”, and “cute” serve as common indicators in showcasing the subreddit’s image. For r/natueismetal we see words such as “eating”, “lion”, and “eagle” serve as common indicators in showcasing the subreddit’s image.
Now to Train…
In supervised learning, we have the ability to compare our prediction values against a set of true values to help us gauge our modeling approaches. Of the 5,000 total data points, we are going to do what is known as a train-test-split to create a training set and a validation set of data points (where our validation set will serve as out testing set).
Referencing the above figure, the idea is simple. We take our entire available dataset and split it into two sections. One section is used to train our model, while the other section is used to validate/test our model. All supervised learning methods utilize this concept.
In this problem we split our data into four sets with scikit-learn: a training feature set, a training label set, a testing feature set, and a testing label set. Now we did specify that the train-test-split breaks our data into “two” datasets, not “four.” Technically, this still holds true as we now have an amalgamated “training” set and an amalgamated “testing” set — two complete data sets comprised of 4 related datasets. The split we chose was a 75/25 split where 75% of our data were used for training and 25% of our data were used for testing; the split of data was sampled out at random. We also included a condition in scikit’s train_test_split method which stratifies our sampling method to make sure our labels are evenly distributed across both the training and testing sets. Doing such helps us eliminate data imbalance across our modeling process.
What are Our Models?
With our training set, we can now begin creating a machine learning model that will help us analyze predictive patterns in our features. In this study, we chose to experiment on three different types of models: Logistic Regression, Gaussian Naïve Bayes, and Multinomial Naïve Bayes. In combination with these three models, we also utilized two forms of vectorization: Count Vectorizer, and TFIDF Vectorizer.
There is a lot to uncover here across all models and forms of vectorization. Recall why vectorization of our text data is necessary before the modeling process; a computer can only understand binary. Our models will only work with numerical data, which means you must always convert your text data into a numerical representation.
What is the TFIDF Vectorizer?
We learned what a Count Vectorizer does, so why are we also utilizing a TFIDF Vectorizer? The TFIDF Vectorizer works in a similar way to the Count Vectorizer where a text corpus is converted into a sparse matrix to represent the same data. The difference is that the TFIDF Vectorizer will apply a weight of a word to a sparse matrix. This is done by analyzing a “term frequency” in the data where a specified n-gram range of a corpus is considering how often it appears in that corpus. Sometimes, we will discover noise in the data which can be filtered by incorporating stop words. However, a TFIDF Vectorizer will also do some corrections of its own upon its vectorization by analyzing the “inverse document frequency”; this will downscale the weight of words appearing too frequently across different corpora.
Suppose we notice the word “the” is left in our TFIDF Vectorizer analysis for our study. In a singular submission post on a subreddit, the word “the” may appear twice more than all other unique words in the title. Therefore, its term frequency is larger and may carry more weight. However, if “the” is appearing more than twice as frequently across the entire dataset of r/aww and r/natureismetal, we cannot justify increasing its weight across our dataset. Therefore, the TFIDF Vectorizer will also consider downsizing the weight of the word “the” as its meaningful importance is negligible.
What are Our Models?
Logistic Regression is a regression model which can be used for analyzing the probability of an event in a binary classification problem. Logistic Regression operates best as a model when data does not have collinearity and when data does not contain any outliers. We can use this method to classify whether our submission post belongs to one subreddit or not (if not, it clearly only will belong to the other subreddit in our problem).
Naïve Bayes models hold the assumption that each one of our features is independent from one another. For our study, this may not seem like a great model to incorporate as our features may have some relevance to one another. Consider the example where the expression “not good” is observed; “not good” is an expression showcasing negative sentiment, but is a description that relies on the word “not” to contradict the single word expression of the word “good.” More often the assumption is violated, but this does not disqualify it from being a useful classification model due to its capabilities of utilizing Laplacian smoothing to affect the probable weight of n-grams in its root Bayes probability formula. In Gaussian Naïve Bayes, feature values are thought to be distributed normally within a dataset where the average frequent feature (the mean) is utilized help classify how likely a feature is related to a label. In Multinomial Naïve Bayes, a multinomial distribution is assumed with features which is suitable for a discrete domain. Often, a Multinomial Naïve Bayes model will outperform other Naïve Bayes models for text classification. The only condition to remember with utilizing a Naïve Bayes model with converted text data (expressed in a sparse matrix) is to convert such sparse matrix into a dense matrix.
Being Efficient with Pipeline and Grid-search Cross-validation
So, we can effectively state that we have six full models to work with by observing various combinations of vectorizers with preset model algorithms. What if we wanted to check a model with various input parameters? Stop words vs. no stop words? Single word expressions vs. Multiword expressions? What kind of solver can we use? What kind of penalties can we add? How can we keep track of all our model performances in a programmatic way?
The process is simple! (what do you know…machine learning can be easy haha)
Let’s introduce scikit-learn’s Pipelines and GridsearchCV (Grid-search Cross-validation).
A Pipeline is a class provided by scikit-learn which is able to provide us a shortcut for our modeling process. Through a pipeline, we are able to provide step-by-step instructions of how data is transformed. Think of our data to be a liquid flowing through a literal pipe. One instruction we can provide would be to change the diameter of the pipe from 1 inch to 2 inches as the liquid travels 10ft through the pipe. Another instruction would be to change the diameter of the pipe from 2 inches to 3 inches another 10 feet later. Once we instantiate our pipeline model (actually build out our pipe), the liquid (or data) will flow through the pipe and have changing surface area due to diameter differences throughout its transformation (incorporating pipeline instructions and parameters). The liquid flowing out of the pipe (data output from pipeline) will be a liquid flowing with a 3 inch diameter (transformed data).
The next useful tool to work in conjunction with Pipelines is a cross-validated grid-search. A grid-search in conjunction with Pipelines will allow us to utilize various combinations of model and vector parameters within a pipeline and ultimately help us discover the best performing model. An example of this written in python is shown below:
What Results Do We See From Our Machine Learning?
Well, we have successfully implemented models onto text data from Reddit, but the ultimate questions is: “Did we solve our problem?”
Good practice to solving a data science problem is to base your supervised model on the quality of one metric. Many times, multiple metrics may be used to help uncover overall characteristics of a model, but it’s more appropriate to have an ultimate metric to rely on the success of your model.
In this case, we are using “accuracy” as a scoring metric for our binary classification problem. Accuracy is the number of true positives plus the number of true negatives all per the total classified number of predictions. When comparing against our baseline score (which is nothing more than guessing one class throughout the entirety of the dataset), we can be accurate 50% of the time. Our model must outperform our baseline for it to be marginally more useful than a baseline model. In this case, it was found that a TFIDF Vectorizer in conjunction with a Multinomial Naïve Bayes model gave a training accuracy score of ~89% and a testing accuracy score of ~84%. That’s not bad! It’s not great!
We could only say that our problem was only partially solved. In the world of social media where millions of submissions may be trafficked through Reddit on a daily basis, our text classification model was only 84% successful in differentiating a small set of text data with clear polarity. Had the merged content been more similar, we can expect our model to perform more poorly. Our model is also showcasing some overfitting and had a misclassification rate of 16%.
So What Do We Learn From This?
Machine learning with text data is difficult to successfully accomplish for a variety of reasons. To start, recall the popularity of our submissions in our dataset. Most of our data was unpopular and possibly did not face the needed scrutiny for determining the association between subreddits. One can argue that we can utilize the numeric features of comments, upvote score, and awards, but this will cause problems when we come across equally popular submissions. Data is inherently “dirty” and we will not always have access to the most effective data (especially if our text data included mispellings).
Machine learning faces its limitations. Today we analyzed how a few models with various parameters can try to classify labeled content. Often times, text data will not come with labels; this is where unsupervised learning is required. Perhaps for future work on this dataset, we will consider scraping more data to analyze, experiment with different classification models, or incorporating more features. To see the full study, be sure to check out the Github Enterprise repository where the exact project was studied in its entirety (https://git.generalassemb.ly/chriskuz/project_3).
Relevant Sources: