Overflow

If you happen to be using jq to provide tweet identifier datasets for others to hydrate you may be interested in this significant difference between jq .id and jq .id_str.

Yesterday I saw the announcement that GWU had made a dataset of 280 million tweets available. This is a really significant contribution for researchers of all stripes who want to study the 2016 US presidential election. So I wanted to replicate the dataset myself for potential researchers at UMD by hydrating the tweet identifiers to get the original JSON data for tweets that have not been deleted yet.

Since this hydration job was going to take a long time (weeks) I started up a new ec2 instance with what I guessed was enough disk space. Just as an aside, GWU has provided a great deal of useful detail about how this dataset was built, but I didn’t actually see the dataset size in gigabytes anywhere. This is a pretty useful detail for people that are going to be hydrating it.

I downloaded the dataset, unzipped it, and put it in a directory called ids. I then installed twarc and wrote a small shell script to hydrate each tweet ids file in turn, writing the compressed JSON data to a similarly named directory tweets. I started a screen session so I could disconnect from the server and reconnect later to look in the log to see how it was doing.

One thing I noticed when checking back on its progress is that twarc was only able to hydrate a small fraction of the tweet identifiers. I’ve seen up to 25% deletion rates in datasets like our Ferguson dataset. But I was seeing like 97% which seemed really odd. I worried that perhaps there was a bug in twarc’s hydration logic.

So I did a test to pull down some tweets and try to hydrate them:

twarc stream obama > tweets1.json
# let this run for a few minutes

# create a sorted list of the weet ids
jq .id tweets1.json | sort -n > ids1.json

# hydrate them into a new tweets file
twarc hydrate ids1.json > tweets2.json

Sure enough, only like 4 or 5 of the tweets came back hydrated. I wanted to see the identifiers that were no longer available so I created a new tweet ids based on the hydrated data and diffed them.

jq .id tweets2.json | sort -n > ids2.json
diff ids1.json ids2.json

Another aside: if you have a tweet identifier you can easily resolve it using a little URL trick for tweets. Just use any twitter username (real or fake) and construct a URL like https://twitter.com/myFakeUsername/status/{id} and it will redirect to the correct location. For example https://twitter.com/myFakeUsername/status/803901951276707841.

Sure enough I could see that the tweets appeared to have been deleted. At least the few I spot checked in the diff. I was curious to see what they were so I used jq to pull out the id and the text of the tweet and to grep for the deleted tweet ID:

jq '.id + " " + .text' tweets1.json | grep 803902036756533200

This threw a bunch of errors because I was trying to concatenate a number and a string. So I used the id_str property that is available in also available the tweet data:

jq '.id_str + " " + .text' tweets1.json | grep 803902036756533200

This step was crucial because I could have used jq to cast the number into a string, but I wouldn’t have found my error that way. So when my grep on the .id_str failed I couldn’t figure out why my tweet wasn’t found. I tried to get just the one tweet using grep without jq and it didn’t work either!

grep 803902036756533200 tweets1.json

At that point, after much squinting and chin scratching I finally remembered why Twitter introduced the both id and id_str in the first place:

Some programming languages such as Javascript cannot support numbers with 53-bits.

So I pulled out one tweet and compared the .id to .id_str:

head -1 tweets1.json | jq .id
803901951276707800

head -1 tweets1.json | jq .id_str
"803901951276707841"

Notice how the numbers are different? The integer tweet identifier overflowed and the string didn’t! So the big lesson here is: don’t use jq .id to create tweet id datasets, use jq -r .id_str instead. The -r is for raw output, which will remove those quotation marks.

I don’t know if jq is in GWU’s pipeline but I know it has been in mine for a few months for creating the tweet id datasets. I had been using .id_str until fairly recently when I saw .id being used. Luckily I had just a handful of datasets created this way.