This post is the first one in what will hopefully be a series of ‘field notes’ on capturing social media data for research. Data capture is one of the main features of most of the tools we work on in CAT4SMR. In the process of building and maintaining these tools, we often need to engage with platforms on a relatively low level to figure out how we can robustly capture their content for analysis. By sharing these notes we hope to make the life of researchers and developers working with similar goals a little easier. This is not a series of tutorials; rather, the goal is to share the various pitfalls encountered while building capture software, in the hope that this can help others avoid them.
One of the platforms we make available through our tool 4CAT is Reddit. Reddit is, essentially, a platform where anyone can create a forum (subreddit), in which people can then post links or bits of text that others can respond to. People discuss an extremely wide variety of topics on the platform, and it has hosted various subreddits of interest to researchers, including politically extreme communities, such as the Donald Trump fan subreddit r/The_Donald (which was eventually banned).
Reddit data for research is often retrieved via Pushshift, a service hosted by Jason Baumgartner that seeks to collect every single post on Reddit and then make it available via an API that offers, among other things, full text search capabilities. This is very useful if you want to, for example, capture all posts on Reddit that mention the keyword ‘Trump’ – which may sound trivial but is not something that Reddit’s own API offers! We have used this feature of Pushshift for our own research as well, for example in this article on the spread of the QAnon conspiracy theory on various online platforms including Reddit.
While Pushshift is very useful and remains available for free, it is somewhat lacking in documentation. Though Baumgartner co-authored a paper on the Reddit dataset, there is no central documentation on how its data is captured, and the source code of its data collection pipe line is not available. Furthermore, occasional tests show that especially its recent coverage may be spotty. For this reason, it is worth exploring how Pushshift collects its data and whether replicating it would be feasible. What follows is a collection of notes on this that outline how one might capture a complete Reddit dataset. Implementing this is left as an exercise for the reader, but should be relatively straightforward bearing the following in mind 🙂
Interfacing with Reddit
Reddit offers a fairly solid official API through which one can retrieve data. This is in essence the API that is also used by Reddit’s own website and apps, and as such it is supported well. It however lacks various features important to research, such as robust keyword search and a way to retrieve a full list of subreddits. It is therefore not very useful for discovery but should be the first choice for retrieval of data. In other words, if we can figure out what exact posts or threads we want to capture through some other means, we can use the official API to retrieve the data itself, for example using the popular Python library PRAW.
In terms of discovery, our goal is simple and the same as Pushshift’s: to capture every single post and thread made on Reddit. As mentioned, the official API has no way to retrieve a complete list of posts or threads. Since we want to capture everything, and we can use the official API to capture posts once we know their ID, we need an alternative method to create a list of all valid IDs, which can then be requested from the API.
This turns out to be simple enough: Reddit uses globally sequential numeric IDs for its items, usually represented in base 36 in URLs and the Reddit interface: the ID of this thread is ostensibly ‘ezz3td’, but ‘906950929’ in the more familiar base 10. This means that if we simply retrieve threads by ID sequentially, starting at an arbitrary number that is continously increased by one, we will eventually capture every single thread made since that initial thread. We can then wait a while and resume capturing from the last captured ID, and continuously repeat this to maintain a complete dataset.
One potential problem is the rate limiting of Reddit’s API. Can we capture data quickly enough to keep up with the rate at which new items are created? The API allows 60 requests per minute per API key, and 100 items per call, requested by their ID. If more than 100 items are created per second, a single API key would not be sufficient as activity would outpace the rate at which new data can be requested.
After running some benchmarks, it seems that as of mid-2021 threads are posted at a rate of (roughly!) 40 per second, which is well within the constraints set by the API’s rate limits. It should therefore be possible to comfortably capture every single thread created on Reddit on an on-going basis with a single API key.
Comments however are created at a rate of (roughly!) 100 per second (and this may increase during events of interest such as elections). If the goal is to capture both threads and comments, you would therefore need to use multiple API keys – four or five should suffice, and allow for some margin in case of sudden spikes of activity. While this is technically not allowed by the Reddit API terms of service (which forbid one to “circumvent or exceed limitations on calls and use of the Reddit APIs”), Pushshift has been running for years and most likely uses this same strategy, so it would be possible to simply register multiple Reddit accounts and request separate API keys for all of them.
A final obstacle is that IDs are sometimes skipped for unclear reasons – perhaps due to maintenance on Reddit’s end. If you naively stop capturing when no results are returned for a range of IDs, assuming you have reached the most recent post or thread, you may end up waiting indefinitely while new posts are added with higher IDs. One way to avoid this is to request a recent thread or comment ID from an active subreddit, using the r/all/new
API endpoint. If this returns an ID that is higher than the latest captured, you can be sure that IDs have resumed somewhere outside the range you are currently trying to download, and shift the range accordingly.
Continuous Reddit data capture is feasible
In summary, capturing all new contributions to Reddit on an on-going basis is quite possible, provided one uses several (but not that many) API keys. As long as Reddit keeps using sequential IDs, this strategy should remain feasible. The main potential obstacle is the disk space required to store this data: if one were to store full JSON objects for each comment, data would accumulate at a rate of about 8.5 GB/day (based on a quick sample of one hour of posts and threads). In practice however, it would not be difficult to store the data more efficiently, and a couple of terabytes should last you for a while. If services like Reddit’s own API or Pushshift are not sufficient for your needs, capturing Reddit data yourself then is a realistic alternative.