Predicting Stock Prices with Reddit Comments

Author: Katie Kemp

Motivation

In January 2021, the subreddit r/wallstreetbets became very famous for its influence on the stock market. Reddit is social media website where users can semi-anonymously post content and comment on it. It allows users with similar interests engage on topics and filters information based on positive engagement. The stock market has long been controlled by the rich, as the more money a person has the more influence they have on the market. By allowing users to communicate in real time, reddit users in r/wallstreetbets were able to work together to influence the market, most notably causing the stock for Gamestop, GME, to rise from about $40 on January 20 to about $450 on January 28. They did this to spite the hedge funds who had bet against Gamestop, essentially gambling money on the fact that GME would not rise past a certain price. To understand this better see this article: https://www.cnn.com/2021/12/19/investing/stocks-week-ahead-reddit-wallstreetbets-gamestop/index.html.

Clearly there are big gains to be earned through understanding the stock market and communicating with others on r/wallstreetbets, but it takes a long time to sift through comments and know when to buy and when to sell. Perhaps there is a way to aggregate the information on r/wallstreetbets to inform a trader when to buy and sell without reading the actual prose.

The simplest indicator of “big gains” might be that everyone is talking about that stock. By setting up a bot to read r/wallstreetbets comments and scanning the content to see the frequency of mentions of that stock, we can see how much people are talking about a stock. We can get data about the stock price and see if it is possible to predict the change in stock price a certain amunt of time in advance. We can exmine how accurate it is. Even if we cannot predict the exact stock price very well, we may still be able to give a good idea of when to sell in order to minimize losses, or at least begin to understand the correlation between people talking about a stock and its price.

Data Collection

Obtaining Data from Reddit

There are several ways to set up a Reddit bot in order to obtain information about comment and post content. After experimenting with using the raw Reddit API, PRAW, PSAW, and PMAW, I ended up going with PMAW. PRAW, PSAW, and PMAW are Pushshift's Python libraries for using the Reddit API, with an additional layer of abstraction that makes it easier to use. The Reddit API alone does not let you request comments or posts within a given time. You can work your way back to the time period you wish to examine by getting a certain number of posts before a given post, but this can be complicated and time consuming. The Pushshift libraries allow you to do this and more, but PMAW is specialized for large datasets, so that is what I went with. To learn more about using the Reddit API, check out this tutorial: https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c, and for a quickstart on PMAW, check out https://medium.com/swlh/how-to-scrape-large-amounts-of-reddit-data-using-pushshift-1d33bde9286.

It’s important to note that Pushshift has an aggregate functionality, which allows you to skip over a lot of the data analysis I will have to do manually in a single line. You can do this using the aggs parameter and specifying the frequency at which you aggregate information. Then you can get data about how frequently a word is used each hour. From my research, the aggs command appears to be disabled, but if it is available you can read about it here: https://github.com/pushshift/api.

It took a lot of experimenting with the capavilities of the different libraries, but once I decided on what information I wanted, the process boiled down to something pretty simple. First, install and import pmaw.

!pip3 install pmaw
import pmaw
import pandas as pd

Next, set up the first time period, I wanted to collect data for, for me it was 24 hours prior to 12/15/21 at noon.

seconds_per_day = 60 * 60 * 24
before = int(dt.datetime(2021,12,15,12,0).timestamp())
after = before - seconds_per_day

If desired, set a specific subreddit to target and the number of requests you want to make at a time.

subreddit = 'wallstreetbets'
limit = 100

Initialize the 'PushshiftAPI' object and a 'pandas' dataframe.

api = pmaw.PushshiftAPI()
df = pd.DataFrame()

As I already said, I decided to collect data from noon to noon on the next day, and here I set up a loop to do this for 90 connsecutive days before 12/15

for i in range(90):
    comments = api.search_comments(subreddit=subreddit, limit=limit, before=before, after=after)
    for comment in comments:
        df = df.append({'author' : comment['author'],'body': comment['body'], 'created_utc' :comment['created_utc'], 'score' : comment['score'], 'awards' : comment['total_awards_received']}, ignore_index=True)
    before -= seconds_per_day
    after = before - seconds_per_day

You can use the 'search_comment()' function to search Reddit comments, or 'search_submissions' funtction to search posts. Using the API documentation (https://reddit-api.readthedocs.io/en/latest/) I was able to set the right parameters for what I needed. Next, I added each of the 100 comments to the dataframe, extraxcting only the parameters that I thought would be potentially useful.

'author', 'body', and 'awards' are self explanatory, but 'created_utc' is the epoch time measured in seconds that the post was created at, and the 'score' is given by Reddit based on positive engagement (mainly upvotes vs. downvotes). There are more interesting parameters to be taken from a submission than a comment, but I found scanning comments was more useful for what I wanted to measure so I did not extract posts. To get all information that pmaw provides from a request, just print it out. You will see a dictionary where most keys are self explanatory, and any that are not can be researched further in pmaw or the Reddit API itself.

Lastly, I saced the dataframe as a csv, as this data to about 30 inutes to retrieve. I made 90 requests of 100 records, and request does take a significant amount of time. The Reddit API returns a maximum of 100 records per request at a limit of 60 request per minute. If you set limit higher than 100, pmaw knows to make multiple requests. In reality, we do not get anywhere close to the optimal request rate, so be careful when making a large number of requests without saving in between, as it would be easy to lose a large amount of data.

To save the csv, make a zip file with the desired title and use 'to_csv()' to put the pandas dataframe inside.

compression_opts = dict(method='zip', archive_name='comments.csv')  
df.to_csv('comments.zip', index=False, compression=compression_opts)'

The rest of the tutorial will use the csv I saved here, in order to protect against data loss.

Obtaining Stock Market Data

Unlike Reddit content, stock market data is readily available in many locations. I used Yahoo Finance. I searched the ticker for the stock of interest, I chose Tesla (TSLA), Amazon (AMZN), Pfizer (PFE), and Gamestop (GME) because these are stocks that I have investments in and/or I thought would be interesting to look at. You can then click “historical data” for the stock and request data for a specific time period at a specific frequency, and download it as a csv.

I uploaded the csvs to this folder so that they could be imported as pandas dataframes. For quickstart on Yahoo Finance, check out https://help.yahoo.com/kb/SLN2311.html.

Data Processing

The first step to data processing is importing it into a dataframe, as I did for each stock, as well as the comments. I also imported all of the libraries that I will use throughout the tutorial.

For a lot of this tutorial, code is repetitive for each stock, so I will only add detailed comments and explanation for Tesla.

The main data processing necessary for the stock market data involved deciding on what price metric to use. I decided to take an average of open and close price each day because tha seems like a fair method that wouldn't make the stock look more or less expensive that it actuall is. I also had to format the date properly so that it would be easy to compare with the date that I will retreieve from the Reddit comments, and make sense to a human looking at a graph.

Next, I processed the Reddit comments. I ussed regexes to find instances of the stock beinng mentioned in the body of a comment. I looked for two things, the name of the comapny, like 'Tesla' or the ticker, like 'TSLA', in any combination of upper and lower case. An improvement that could definitely be made would be to include other relevant terms, like "Elon' or 'Bezos'. I also converted the date from epoch time. Lastly, I looked up the price for each stock in the corresponding dictionaries I amde in the last step, and added that to the dataframe. Since Yahoo Finance doesn't list the stock price on the days the market is not open, I just carried over the price from Friday on Saturdays and Sundays, so that valuable comment information didn't go to waste. All that is left to do is update the dataframe.

Exploratory Analysis and Data Visualization

After making necessary changes to the data, it is time to take a first look at what we have. I graphed the price over time along with the comments over time, and saw something like what I expected. The amount of comments containing the stock peaks a bit before the price peaks. I looked up the dates for the corresponding maxes and found what I call the "lag time", or the time the stock response lags behind the comment volume.

Tesla

Analysis, Hypotheis Testing, and Machine Learning

From the data visualization we can not make a hypothesis that an increase in comment volume is correlated with an increase in stock price.

My null hypothesis is that the change in stock price is not correlated with the change in comment volume.

I did a linear regression in order to examine this hypothesis.

Tesla

A p-value of <= .05 is considered significant, so this linear regression passes the test. Comments do in fact seem to have a correlation with price. However, I wasn't satisfied with this as I stated before that I noticed a lag between the response. I decided to try shifting the price data back by the lag time, doing another linear regression, getting the predictions, and shifting them back to the actual date.

The p-value is indeed much lower, and the predictions much better. The predictions are still not great when the price is lower, but the model does tell us when the max will be, which tells us when to sell the stock.

Next I will do use the same approach on the other three stocks, without unnecesary re-explanation. I skipped to the linear regression adjusted for lag time.

Amazon

The p-value here is significant, although not great. Amazon has way fewer mentions overall, so the model is not nearly as good. A larger sample size would be helpful.

Pfizer

Since the lag for Pfizer is so long, it wasn't really possible to build a model and the p-value is not significant. More data is needed to predict Pfizer prices.

Gamestop

Similarly, Gamestop was not mentioned enough in this period to build a model.

Insights

I was able to build a statisticlaly significant model for Tesla and Amazon with just 100 comments per day for 90 days. This volume of data only scratches the surface as to how much is out there. Obtaining more comments at more precise intervals would undoubtedly improve the models. Other factors could be taken into account, like score, in order to give more weight to more important mentions. Posts could also be incorporated, as they have more metrics. As mentioned earlier, other words besides the company name or ticker could be use as predictors, and could be given different weights depending on how closely related they are. An entirely separate analysis could be done on which terms are most closely related. Additionally, rather than predicting the actual price, we could predict the change in price. Using the predicted slope values we could then take the antiderivative and obtain price predictions.

It was difficult to obtain enough records in a short amount of time, but if the data were being downloaded in real time each day, we could get a much better prediction. This technique could even be expanded beyond the stock market, to predict when other events might occur, such as those related to the pandemic.

Hopefully this tutorial has been useful in understanding how to scrape data from Reddit and Yahoo Finance, and how we can apply the data science pipeline in order to obtain meaningful results.


Back