Blog

Call for Participation: Internet Research with Foundation Models?

CAT4SMR is convening its closing workshop on the exploration of generative methods in internet research on September 3, 2024 in Amsterdam. Situated within the Digital Methods Initiative, CAT4SMR has been working on tools for capturing and analyzing social media data, such as 4CAT and the YouTube Data Tools, but also on how to develop and implement new and innovative research methods. 

As the incorporation of large pre-trained models (sometimes referred to as “foundation models”, including Large Language Models, Large Vision Models, etc.) into social sciences and humanities (SSH) research moves from novelty to standard, the need to exchange, evaluate, and critique is becoming pressing. This one-day workshop will serve as a forum for investigating how these advanced computational techniques are being integrated into digital research, what opportunities, limitations, and pitfalls they imply, and what this means for academic research.

Our conversations will center around several themes:

  • The evolution of digital methods in light of foundation models, exploring how tools like LLMs and LVMs augment and transform established methodologies and introduce new ones into digital research, from thematic textual and visual analysis to the study of misinformation;
  • The critical examination of best practices in the deployment of LLMs or LVMs, from the art of effective prompting to the nuances of model selection, fine-tuning, and evaluation, all within the context of internet research;

We are particularly interested in fostering discussions about new epistemologies, ethical implications, and practical issues related to the use of these models in research. This includes the careful examination of models’ black-boxed nature and their (potential) imprint on research, as well as the broader ethical landscape that researchers navigate.

We welcome short proposals (about 500 words) for brief presentations on empirical research projects, methodologies, tools, or critiques aligned with our workshop themes, especially those adopting experimental, explorative, or speculative approaches. Potential topics include, but are not limited to:

  • interesting new methods, advancing visual and textual analysis;
  • effective strategies for employing pre-trained models in internet research;
  • taming black boxes: observing, testing, and deploying pre-trained models for digital research;
  • the evolution of interpretive practices with the integration of LLMs;
  • the role of pre-trained models as surrogate expert readers;
  • the delegation of interpretative agency to pre-trained models;
  • reflections on the role of pre-trained models in, or next to, qualitative and quantitative research;
  • challenges and opportunities in the ‘traceability’ of digital actors and research procedures;
  • researching platformed models;
  • platform research with pre-trained models and their (mis)uses for medium-sensitive studies;

This will be a one-day meeting with presentations and ample room for discussion. We hope to make this an inspiring event by bringing together scholars and practitioners from diverse disciplines, resulting in fruitful collaborations. ​​We have some ability to fund transport and accommodation, and will provide lunch and dinner.

Deadline for submissions: June 7, 2024

Submit proposals to: cat4smr [diddalidoo] list.uva.nl

Notice of acceptance: June 21, 2024

Workshop date: September 3, 2024

Location: Amsterdam City Centre

Organizers: Erik Borra, Sal Hagen, Stijn Peeters, Bernhard Rieder

Invitation – Capture and Analysis Tools for Social Media Research Workshop (May 8, 2023)

When: May 8th, 13:00-16:00

Where: On site in Amsterdam, City Center Campus (location follows per email)

Interested in learning how to work with social media data? Join us for the third workshop of CAT4SMR, the initiative that builds and maintains Capture and Analysis Tools for Social Media Research! 

On May 8th in Amsterdam, we will introduce interested scholars to two powerful tools (4CAT and the YouTube Data Tools) for the data-driven analysis of online platforms such as 4chan, Instagram, Reddit, Telegram, Twitter, TikTok, and YouTube. The goal is to help students (Master’s, PhD) and researchers (all levels) integrate social media analysis in their research and teaching.

The first part (13:00-15:00) will be dedicated to the presentation and discussion of 4CAT and YouTube Data Tools. How and why to use them, how to get up and running, best practices and things to watch out for.

During the second part (15:00 – 16:00), we open the floor to discuss specific research questions and projects, and how to approach them in terms of methodology, logistics, ethics, and so forth. This part of the workshop is optional.

Entry is free, but registration is required. Please sign up here before April 24 to reserve a spot and help us plan the workshop. A limited amount of space is available.

The workshop is organized an facilitated by the CAT4SMR project and the team members Erik Borra, Stijn Peeters, and Bernhard Rieder.

Links:

Signup

Invitation – Capture and Analysis Tools for Social Media Research Workshop (November 22)

(update: due to limited capacity, sign-ups for the workshop are now closed)

When: Nov 22rd, 13:30-17:00 (CET)

Where: Online on Zoom

Interested in learning how to work with social media data? Join us for the second workshop of CAT4SMR, the initiative that builds and maintains Capture and Analysis Tools for Social Media Research! 

On Nov 22nd on Zoom, we will introduce interested scholars to two powerful tools (4CAT and the YouTube Data Tools) for the data-driven analysis of online platforms such as 4chan, Instagram, Reddit, Telegram, Twitter, TikTok, and YouTube. The goal is to help students (Master’s, PhD) and researchers (all levels) integrate social media analysis in their research and teaching.

The first part (13:30-15:30 CET) will be dedicated to the presentation and discussion of 4CAT and YouTube Data Tools. How and why to use them, how to get up and running, best practices and things to watch out for.

In second part (15:45-17:00 CET), we split into small groups and open the floor to discuss specific research questions and projects, and how to approach them in terms of methodology, logistics, ethics, and so forth. This part of the workshop is optional, but for those participating we are asking for a short project description to help us prepare.

Entry is free, but registration is required. Please sign up here before November 8th to reserve a spot and help us plan the workshop. A limited amount of space is available.

The workshop is organized and facilitated by the CAT4SMR project and the team members Erik Borra, Stijn Peeters, and Bernhard Rieder.

Links:

https://tinyurl.com/cat4smr-workshop-november

Location:

Zoom (link will be provided beforehand)            

Invitation – Capture and Analysis Tools for Social Media Research Workshop (May 22)

(update: due to limited capacity, sign-ups for the workshop are now closed)

When: May 23rd, 10:00-16:00

Where: On site in Amsterdam

Interested in learning how to work with social media data? Join us for the first workshop of CAT4SMR, the initiative that builds and maintains Capture and Analysis Tools for Social Media Research! 

On May 23rd in Amsterdam, we will introduce interested scholars to two powerful tools (4CAT and the YouTube Data Tools) for the data-driven analysis of online platforms such as 4chan, Instagram, Reddit, Telegram, Twitter, TikTok, and YouTube. The goal is to help students (Master’s, PhD) and researchers (all levels) integrate social media analysis in their research and teaching.

The morning (10:00-12:30) will be dedicated to the presentation and discussion of 4CAT and YouTube Data Tools. How and why to use them, how to get up and running, best practices and things to watch out for.

In the afternoon (13:30 – 16:00), we split into small groups and open the floor to discuss specific research questions and projects, and how to approach them in terms of methodology, logistics, ethics, and so forth. This part of the workshop is optional.

Entry is free, but registration is required. A catered lunch is included. Please sign up here before May 16th to reserve a spot and help us plan the workshop. A limited amount of space is available.

The workshop is organized an facilitated by the CAT4SMR project and the team members Erik Borra, Stijn Peeters, and Bernhard Rieder.

Links:

https://www.buzzhouse.co

https://tinyurl.com/cat4smr-workshop-signup

Location:

BuzzHouse (BG5)

Oudezijds Achterburgwal 233-237

1012 DL Amsterdam                               

Data Capture Field Notes: Reddit

This post is the first one in what will hopefully be a series of ‘field notes’ on capturing social media data for research. Data capture is one of the main features of most of the tools we work on in CAT4SMR. In the process of building and maintaining these tools, we often need to engage with platforms on a relatively low level to figure out how we can robustly capture their content for analysis. By sharing these notes we hope to make the life of researchers and developers working with similar goals a little easier. This is not a series of tutorials; rather, the goal is to share the various pitfalls encountered while building capture software, in the hope that this can help others avoid them.

One of the platforms we make available through our tool 4CAT is Reddit. Reddit is, essentially, a platform where anyone can create a forum (subreddit), in which people can then post links or bits of text that others can respond to. People discuss an extremely wide variety of topics on the platform, and it has hosted various subreddits of interest to researchers, including politically extreme communities, such as the Donald Trump fan subreddit r/The_Donald (which was eventually banned).

Reddit data for research is often retrieved via Pushshift, a service hosted by Jason Baumgartner that seeks to collect every single post on Reddit and then make it available via an API that offers, among other things, full text search capabilities. This is very useful if you want to, for example, capture all posts on Reddit that mention the keyword ‘Trump’ – which may sound trivial but is not something that Reddit’s own API offers! We have used this feature of Pushshift for our own research as well, for example in this article on the spread of the QAnon conspiracy theory on various online platforms including Reddit.

While Pushshift is very useful and remains available for free, it is somewhat lacking in documentation. Though Baumgartner co-authored a paper on the Reddit dataset, there is no central documentation on how its data is captured, and the source code of its data collection pipe line is not available. Furthermore, occasional tests show that especially its recent coverage may be spotty. For this reason, it is worth exploring how Pushshift collects its data and whether replicating it would be feasible. What follows is a collection of notes on this that outline how one might capture a complete Reddit dataset. Implementing this is left as an exercise for the reader, but should be relatively straightforward bearing the following in mind 🙂

Interfacing with Reddit

Reddit offers a fairly solid official API through which one can retrieve data. This is in essence the API that is also used by Reddit’s own website and apps, and as such it is supported well. It however lacks various features important to research, such as robust keyword search and a way to retrieve a full list of subreddits. It is therefore not very useful for discovery but should be the first choice for retrieval of data. In other words, if we can figure out what exact posts or threads we want to capture through some other means, we can use the official API to retrieve the data itself, for example using the popular Python library PRAW.

In terms of discovery, our goal is simple and the same as Pushshift’s: to capture every single post and thread made on Reddit. As mentioned, the official API has no way to retrieve a complete list of posts or threads. Since we want to capture everything, and we can use the official API to capture posts once we know their ID, we need an alternative method to create a list of all valid IDs, which can then be requested from the API.

This turns out to be simple enough: Reddit uses globally sequential numeric IDs for its items, usually represented in base 36 in URLs and the Reddit interface: the ID of this thread is ostensibly ‘ezz3td’, but ‘906950929’ in the more familiar base 10. This means that if we simply retrieve threads by ID sequentially, starting at an arbitrary number that is continously increased by one, we will eventually capture every single thread made since that initial thread. We can then wait a while and resume capturing from the last captured ID, and continuously repeat this to maintain a complete dataset.

One potential problem is the rate limiting of Reddit’s API. Can we capture data quickly enough to keep up with the rate at which new items are created? The API allows 60 requests per minute per API key, and 100 items per call, requested by their ID. If more than 100 items are created per second, a single API key would not be sufficient as activity would outpace the rate at which new data can be requested.

After running some benchmarks, it seems that as of mid-2021 threads are posted at a rate of (roughly!) 40 per second, which is well within the constraints set by the API’s rate limits. It should therefore be possible to comfortably capture every single thread created on Reddit on an on-going basis with a single API key.

Comments however are created at a rate of (roughly!) 100 per second (and this may increase during events of interest such as elections). If the goal is to capture both threads and comments, you would therefore need to use multiple API keys – four or five should suffice, and allow for some margin in case of sudden spikes of activity. While this is technically not allowed by the Reddit API terms of service (which forbid one to “circumvent or exceed limitations on calls and use of the Reddit APIs”), Pushshift has been running for years and most likely uses this same strategy, so it would be possible to simply register multiple Reddit accounts and request separate API keys for all of them.

A final obstacle is that IDs are sometimes skipped for unclear reasons – perhaps due to maintenance on Reddit’s end. If you naively stop capturing when no results are returned for a range of IDs, assuming you have reached the most recent post or thread, you may end up waiting indefinitely while new posts are added with higher IDs. One way to avoid this is to request a recent thread or comment ID from an active subreddit, using the r/all/new API endpoint. If this returns an ID that is higher than the latest captured, you can be sure that IDs have resumed somewhere outside the range you are currently trying to download, and shift the range accordingly.

Continuous Reddit data capture is feasible

In summary, capturing all new contributions to Reddit on an on-going basis is quite possible, provided one uses several (but not that many) API keys. As long as Reddit keeps using sequential IDs, this strategy should remain feasible. The main potential obstacle is the disk space required to store this data: if one were to store full JSON objects for each comment, data would accumulate at a rate of about 8.5 GB/day (based on a quick sample of one hour of posts and threads). In practice however, it would not be difficult to store the data more efficiently, and a couple of terabytes should last you for a while. If services like Reddit’s own API or Pushshift are not sufficient for your needs, capturing Reddit data yourself then is a realistic alternative.

Starting up the project

After years of working on research software development mostly in our spare time, we are very happy that CAT4SMR (Capture and Anaysis Tools for Social Media Research) was funded in May 2020 by the Dutch PDI- SSH (Platform Digital Infrastructure Social Science and Humanities).

We (Erik Borra, Stijn Peeters, and Bernhard Rieder) will be able to use this opportunity to improve and stabilize the social media analysis tools we have been working on over the years: DMI-TCAT, 4CAT, YouTube Data Tools, and others. Who knows, we may even be able to revive Netvizz, which had do close down after changes in Facebook’s API governance.

The project started work in September 2020, beginning with the usual practicalities and some initial improvements to code quality and documentation. 2021 will see much more activity on all fronts. Stay tuned.