twarc2

This post was originally published on Medium but I spent time writing it so I wanted to have it here too.

TL;DR twarc has been redesigned from the ground up to work with the new Twitter v2 API and their Academic Research track. Many thanks for the code and design contributions of Betsy Alpert, Igor Brigadir, Sam Hames, Jeff Sauer, and Daniel Verdeer that have made twarc2 possible, as well as early feedback from Dan Kerchner, Shane Lin, Miles McCain, 李荣蓬, David Thiel, Melanie Walsh and Laura Wrubel. Extra special thanks to the Institute for Future Environments at Queensland University of Technology for supporting Betsy and Sam in their work, and for the continued support of the Mellon Foundation.

Back in August of last year Twitter announced early access to their new v2 API, and their plans to sunset the v1.1 API that has been active for almost the last 10 years. Over the lifetime of their v1.1 API Twitter has become deeply embedded in the media landscape. As magazines, newspapers and television have moved onto the web they have increasingly adopted tweets as a mechanism for citing politicians, celebrities and organizations, while also using them to document current events, generate leads and gather feedback for evolving stories. As a result Twitter has also become a popular object of study for humanities and social science researchers looking to understand the world as reflected, refracted and distorted by/in social media.

On the surface the v2 API update seems pretty insignificant since the shape of a tweet, its parts, properties and affordances, aren’t changing at all. Tweets with 280 characters of text, images and video will continue to be posted, retweeted and quoted. However behind the scenes the representation of a tweet as data, and the quotas that control the rates at which this data can flow between apps and other third party services will be greatly transformed.

Needless to say, v2 represents a big change for the Documenting the Now project. Along with community members we’ve developed and maintained open source tools like twarc that talk directly to the Twitter API to help users to search for and collect live tweets that match criteria like hashtags, names and geographic locations. Today we’re excited to announce the release of twarc v2 which has been designed from the ground up to work with the v2 API and Twitter’s new Academic Research track.

Clearly it’s extremely problematic having a multi-national corporation act as a gatekeeper for who counts as an academic researcher, and what constitutes academic research. We need look no further than the recent experiences of Timnit Gebru and Margaret Mitchell at Google for an example of what happens when research questions run up against the business objectives of capital. We only know their stories because Gebru and Mitchell’s bravely took a principled approach, where many researchers would have knowingly or unknowingly shaped their research to better fit the needs of the company.

So it is important for us that twarc still be usable by people with and without access to the Academic Research Track. But we have heard from many users that the Academic Research Track presents new opportunities for Twitter data collection that are essential for researchers interested in the observability of social media platforms. Twitter is making a good faith effort to work with the academic research community, and we thought twarc should support it, even if big challenges lie ahead.

So why are people interested in the Academic Research Track? Once your application has been approved you are able to collect data from the full history of Tweets, at no cost. This is a massive improvement over the v1.1 access which was limited to a one week window and researchers had to pay for access. Access to the full archive means it’s now possible to study events that have happened in the past back to the beginning of Twitter in 2006. If you do create any historical datasets we’d love for you to share the tweet identifier datasets in The Catalog.

However this opening up of access on the one hand comes with a simultaneous contraction in terms of how much data can be collected at one time. The remainder of this post describes some of the details and the design decisions we have made with twarc2 to address them. If you would prefer to watch a quick introduction to using twarc v2 please check out this short video:

Installation

If you are familiar with installing twarc nothing is changed. You still install (or upgrade) with pip as you did before:

$ pip install --upgrade twarc

In fact you will still have full access to the v1.1 API just as you did before. So the old commands will continue to work as they did ¹

$ twarc search blacklivesmatter > tweets.jsonl

twarc2 was designed to let you to continue to use Twitter’s v1.1 API undisturbed until it is finally turned off by Twitter, at which point the functionality will be removed from twarc. All the support for the v2 API is mediated by a new command line utility twarc2. For example to search for blacklivesmatter tweets and write them to a file tweets.jsonl:

$ twarc2 search blacklivesmatter > tweets.jsonl

All the usual twarc functionality such as searching for tweets, collecting live tweets from the streaming API endpoint, requesting user timelines and user metadata are all still there, twarc2 --help gives you the details. But while the interface looks the same there’s quite a bit different going on behind the scenes.

Representation

Truth be told, there is no shortage of open source libraries and tools for interacting with the Twitter API. In the past twarc has made a bit of a name for itself by catering to a niche group of users who want a reliable, programmable way to collect the canonical JSON representation of a tweet. JavaScript Object Notation (JSON) is the language of Web APIs, and Twitter has kept its JSON representation of a tweet relatively stable over the years. Rather than making lots of decisions about the many ways you might want to collect, model and analyze tweets twarc has tried to do one thing and do it well (data collection) and get out of the way so that you can use (or create) the tools for putting this data to use.

But the JSON representation of a tweet in the Twitter v2 API is completely burst apart. The v2 base representation of a tweet is extremely lean and minimal, and just includes the text of the tweet its identifier and a handful of other things. All the details about the user who created the tweet, embedded media, and more are not included. Fortunately this information is still available, but the user needs to craft their API request to request tweets using a set of expansions that tell the Twitter API what additional entities to include. In addition for each expansion there are a set of field options to include that control what of these expansions is returned.

So rather than there being a single JSON representation of a tweet API users now have the ability to shape the data based on what they need, much like how GraphQL APIs work. This kind of makes you wonder why Twitter didn’t make their GraphQL API available. For specific use cases this customizability is very useful, but the mutability of the representation of a tweet presents challenges when collecting data for future use. If you didn’t request the right expansions or fields when collecting the data then you won’t be able to analyze that data later when doing your research.

To solve for this twarc2 has been designed to collect the richest possible representation for a tweet, by requesting all possible expansions and field combinations for tweets. See the expansions module for the details if you are interested. This takes a significant burden off of users to digest the API documentation, and craft the correct API requests themselves. In addition the twarc community will be monitoring the Twitter API documentation going forward to incorporate new expansions and fields as they will inevitably be added in the future.

Flattening

This is diving into the weeds a little bit, but it’s worth noting here that Twitter’s introduction of expansions allows data that was once duplicated across multiple tweets (such as user information, media, retweets, etc) to be included once per response from the API. This means that instead of seeing information about the user who created a tweet in the context of their tweet the user will be referenced using an identifier, and this identifier will map to user metadata in the outer envelope of the response.

It makes sense why Twitter have introduced expansions since it means in a set of 100 tweets from a given user the user information will just be included once rather than repeated 100 times, which means less data, less network traffic and less money. It’s even more significant when consider the large number of possible expansions. However this pass by-reference rather than by-value presents some challenges for stream based processing which expects each tweet to be self-contained.

For this reason we’ve introduce the idea of flattening the response data when persisting the JSON to disk. This means that tools and data pipelines that expect to operate on a stream of tweets can continue to do so. Since the representation of a tweet is so dependent on how data is requested we’ve taken the opportunity to introduce a small stanza of twarc specific metadata using the __twarc prefix.

This metadata records what API endpoint the data was requested from, and when. This information is critically important when interpreting the data, because some information about a tweet like its retweet and quote counts are constantly changing.

Data Flows

As mentioned above you can still collect tweets from the search and streaming API endpoints in a way that seems quite similar to the v1 API. The big changes however are the quotas associated with these endpoints which govern how much can be collected. These quotas control how many requests can be sent to Twitter in 15 minute intervals.

In fact these quotas are not much changed, but what’s new are app wide quotas that constrain how many tweets a given application (app) can collect every month. An app in this context is a piece of software (e.g. your twarc software) identified by unique API keys set up in the Twitter Developer Portal. The standard API access sets a 500,000 tweet per month limit. This is a huge change considering there were no monthly app limits before. If you get approved for the Academic Research track your app quota is increased to 10 million per month. This is markedly better but the achievable data volume is still nothing like the v1.1 API, as these graphs attempt to illustrate:

twarc2 will still observe the same rate limits, but once you’ve collected your portion for the month there’s not much that can be done, for that app at least.

Apart from the quotas Twitter’s streaming endpoint in v2 is substantially changed which impacts how users interact with twarc. Previously twarc users would be able to create up to to two connections to the filter stream API. This could be done by simply:

twarc filter obama > obama.jsonl

However in the Twitter v2 API only apps can connect to the filter stream, and they can only connect once. At first this seems like a major limitation but rather than creating a connection per query the v2 API allows you to build a set of rules for tweets to match, which in turns controls what tweets are included in the stream. This means you can collect for multiple types of queries at the same time, and the tweets will come back with a piece of metadata indicating what rule caused its inclusion.

This translates into a markedly different set of interactions at the command line for collecting from the stream where you first need to set your stream rules and then open a connection to fetch it.

twarc2 stream-rules add blacklivesmatter  
twarc2 stream > tweets.jsonl

One useful side effect of this is that you can update the stream (add and remove rules) while the stream is in motion:

twarc2 stream-rules add blm

While you are limited by the API quota in terms of how many tweets you can collect, tweets are not “dropped on the floor” when the volume gets too high. Once upon a time the v1.1 filter stream was rumored to be rate limited when your stream exceeds 1% of the total volume of new tweets.

Plugins

In addition to twarc helping you collect tweets the GitHub repository has also been a place to collect a set of utilities for working with the data. For example there are scripts for extracting and unshortening urls, identifying suspended/deleted content, extracting videos, buiding wordclouds, putting tweets on maps, displaying network graph visualizations, counting hashtags, and more. These utilities all work like Unix filters where the input is a stream of tweets and the output varies depending on what the utility is doing, e.g. a Gephi file for a network visualization, or a folder of mp4 files for video extraction.

While this has worked well in general the kitchen sink approach has been difficult to manage from a configuration management perspective. Users have to download these scripts manually from GitHub or by cloning the repository. For some users this is fine, but it’s a bit of a barrier to entry for users who have just installed twarc with pip.

Furthermore these plugins often have their own dependencies which twarc itself does not. This lets twarc can stay pretty lean, and things like youtube_dl, NetworkX or Pandas can be installed by people that want to use utilities that need them. But since there is no way to install the utilities there isn’t a way to ensure that the dependencies are installed, which can lead to users needing to diagnose missing libraries themselves.

Finally the plugins have typically lacked their own tests. twarc’s test suite has really helped us track changes to the Twitter API and to make sure that it continues to operate properly as new functionality has been added. But nothing like this has existed for the utilities. We’ve noticed that over time some of them need updating. Also their command line arguments have drifted over time which can lead to some inconsistencies in how they are used.

So with twarc2 we’ve introduced the idea of plugins which extend the functionality of the twarc2 command, are distributed on PyPI separately from twarc, and exist in their own GitHub repositories where they can be developed and tested independently of twarc itself. This is all achieved through twarc2’s use of the click library and specifically click-plugins. So now if you would like to convert your collected tweets to CSV you can install the twarc-csv:

$ pip install twarc-csv  
$ twarc2 search covid19 > covid19.jsonl  
$ twarc2 csv covid19.jsonl > covid19.csv

Or if you want to extract embedded and referenced videos from tweets you can install twarc-videos which will write all the videos to a directory:

$ pip install twarc-videos  
$ twarc2 videos covid19.jsonl --download-dir covid19-videos

You can write these plugins yourself and release them as needed. Check out the plugin reference implementation tweet-ids for a simple example to adapt. We’re still in the process of porting some of the most useful utilities over and would love to see ideas for new plugins. Check out the current list of twarc2 plugins and use the twarc issue tracker on GitHub to join the discussion.

You may notice from the list of plugins that twarc now (finally) has documentation on ReadTheDocs external from the documentation that was previously only available on GitHub. We got by with GitHub’s rendering of Markdown documents for a while, but GitHub’s boilerplate designed for developers can prove to be quite confusing for users who aren’t used to selectively ignoring it. ReadTheDocs allows us to manage the command line and API documentation for twarc, and to showcase the work that has gone into the Spanish, Japanese, Portuguese, Swedish, Swahili and Chinese translations.

Feedback

Thanks for reading this far! We hope you will give twarc2 a try. Let us know what you think either in comments here, in the DocNow Slack or over on GitHub.

✨ ✨ Happy twarcing! ✨ ✨ ✨

Windows users will want to indicate the output file using a second argument rather than redirecting output with >. See this page for details.↩︎