I ran across an interesting anomaly in the Twitter API recently and thought it might be worth jotting down here.

The Twitter v2 API (with Academic Research product track enabled) allows you to search the full archive of tweets going back to 2006. While doing some work with Alejandra Josiowicz to collect tweets related to the Brazilian human rights activist Marielle Franco I noticed that the data collection was going a lot slower than I expected. This is something that twarc’s new progress bar makes much more visible. It seemed like twarc was stuck at a particular point in time collecting very little new tweets. What gives?

Taking a look at the twarc log file I could see twarc was sometimes only collecting a single tweet instead of the 100 that was being requested. Basically a line like this:

2021-08-17 03:42:53,464 INFO archived 1267885644099002374

instead of a line like this with lots of tweet ids on it:

2021-08-13 08:27:43,806 INFO archived 1426158197384044549,1426155626011430913,1426155528552697862,1426155056567574534,1426154992554086402,1426153260696678401,1426152718083702790,1426151338291957762,1426151303286247425,1426150785428164613,1426150593207447559,1426149917010145282,1426149795245268996,1426149628739825667,1426147897373044738,1426147402990379008,1426147073087492100,1426146906456174594,1426145724115341315,1426145360729317376,1426145137298714624,1426143294539583492,1426142733123694599,1426140920962076673,1426138951081992192,1426138676954902531,1426138248812941318,1426137687703052288,1426137658904911872,1426137184143351810,1426134708031827970,1426133475204255749,1426133385886552068,1426129300428840967,1426129292895850496,1426126292450217984,1426125942863409155,1426124810464149506,1426123335998574592,1426123142787997703,1426122015346466821,1426121994257543168,1426121464307138560,1426120311213379585,1426116765705281536,1426116734973579266,1426116720377442309,1426116553301536769,1426115831876968448,1426115603912396801,1426115563542171660,1426113168485535746,1426112289191649280,1426112010199109632,1426111436565123074,1426111213445033984,1426110935253594115,1426110931663302658,1426110666172211201,1426110534181658628,1426108955382669313,1426100653303255043,1426097593856274432,1426055213115392003,1426031551146565634,1426015633611784200,1426002518149156864,1426000997881679876,1425998435862331394,1425997289521393667,1425996699366051849,1425995750534782986,1425994923631841280,1425994819357249536,1425990300699344903,1425986576635600901,1425985946726637571,1425984891339030529,1425982908494401544,1425982737857622016,1425980825317875717,1425978278586761218,1425976980017754119,1425976685296422915,1425975810616332291,1425975040974196738,1425974916004950022,1425974391515623432,1425974134438309891,1425970945496387587,1425970625517084674,1425970245550870532,1425969423026950149,1425967499133259788,1425966405200056323,1425966374950744068,1425965431614025736,1425965245101707268,1425964978532753415,1425964921041428485

To isolate the problem I used data from the log file to write a very simple Python program that issues a single request to fetch 100 tweets that match “mariella franco”, that were sent before the tweet id 1267885644099002374.

Note: if you want to run this yourself you’ll need to set BEARER_TOKEN in your environment to access the Twitter API.

import os
import json
import requests

token = os.environ.get("BEARER_TOKEN")

url = "https://api.twitter.com/2/tweets/search/all"

params = {
    "query": '"marielle franco"',
    "until_id": "1267885644099002374",
    "max_results": 100
}

response = requests.get(
    url,
    params=params,
    headers={"Authorization": f"Bearer {token}"}
)

print(
    json.dumps(
        response.json(), 
        indent=2
    )
)

When you run this program it outputs the following JSON:

{
  "data": [
    {
      "id": "1267885518005690370",
      "text": "RT @lab_fantasma: \u00c9 por \u00c1gatha Felix, Douglas Martins Rodrigues e Jo\u00e3o Pedro. Mas tamb\u00e9m por George Floyd, por Claudia Ferreira e Marielle\u2026"
    }
  ],
  "meta": {
    "newest_id": "1267885518005690370",
    "oldest_id": "1267885518005690370",
    "result_count": 1,
    "next_token": "b26v89c19zqg8o3fo7dfv5axa5q2kdlt21pvomk80wqyl"
  }
}

The strange thing here is only one tweet is being returned. But there are definitely more tweets that match that query. At the moment, according to Twitter’s new Counts API, there are 2,389,336 tweets more to be exact.

(In case you are curious)

twarc2 counts --csv --archive --until-id 1267885518005690370 '"marielle
franco"' --granularity day counts.csv

Looking at my twarc log file I could see that for long stretches of time the search response only contains a handful of tweets.

So I spent a little time in a Jupyter notebook parsing the log messages, and using the tweet created time that is hidden in a tweet’s id to visualize how many tweets are missing from responses over time.

It’s not entirely clear why there is such a spike on June 2, 2020. It’s also not clear what these missing tweets actually signify. Over in DocNow Slack Sam Hames speculated:

My guess is that a page of tweet ids is retrieved for the search, then hydrated to handle deletions/protected accounts/get current data. It’s especially notable when you have a topic that only one user is tweeting about, then that account is deleted or goes protected.

This sounds right to me, but of course we don’t really know what’s happening inside Twitter’s infrastructure – only they do. Sam did point out that he observed this behavior with the Twiter v1.1 API search endpoint as well. I think it’s possibly more noticeable when using the full archive search because you are much more likely to run across stretches of deleted/protected tweets. Also, twarc didn’t log tweets per response like it does now with the v2 API.

If true I think this suggests that when you delete a tweet or an account on Twitter it doesn’t really get deleted. Instead it is flagged as deleted, and no longer distributed to the public. Almost like a disk block that has had its inode deleted – except Twitter know exactly where the data is. Jane Friedhoff suggested the same things some time ago using a different (but related) API signal. Perhaps an experiment could be set up to test if it’s true? Delete a few things from your timeline and then try to retrieve all your tweets and see if the little gaps appear?

I’m not expecting to get much insight into what’s really going on here, but I did ask over in the Twitter Community Forum, and will update this post if I learn anything. If it seems potentially useful to have a twarc plugin for analyzing the log files in this way please get in touch.

The situation is actually kind of tedious for researchers collecting data from Twitter because these empty responses count as responses–you can only issue 300 every 15 minutes. So these stretches of deletes can negatively effect how much data you can collect in the alloted time. Much ado about nothing.