Overflow
If you happen to be using jq to provide tweet identifier
datasets for others to hydrate you may be interested in this significant
difference between jq .id
and jq .id_str
.
Yesterday I saw the announcement that GWU had made a dataset of 280 million tweets available. This is a really significant contribution for researchers of all stripes who want to study the 2016 US presidential election. So I wanted to replicate the dataset myself for potential researchers at UMD by hydrating the tweet identifiers to get the original JSON data for tweets that have not been deleted yet.
Since this hydration job was going to take a long time (weeks) I started up a new ec2 instance with what I guessed was enough disk space. Just as an aside, GWU has provided a great deal of useful detail about how this dataset was built, but I didn’t actually see the dataset size in gigabytes anywhere. This is a pretty useful detail for people that are going to be hydrating it.
I downloaded the dataset,
unzipped it, and put it in a directory called ids
. I then
installed twarc and wrote
a small
shell script to hydrate each tweet ids file in turn, writing the
compressed JSON data to a similarly named directory tweets
.
I started a screen
session so I could disconnect from the server and reconnect later to
look in the log to see how it was doing.
One thing I noticed when checking back on its progress is that twarc was only able to hydrate a small fraction of the tweet identifiers. I’ve seen up to 25% deletion rates in datasets like our Ferguson dataset. But I was seeing like 97% which seemed really odd. I worried that perhaps there was a bug in twarc’s hydration logic.
So I did a test to pull down some tweets and try to hydrate them:
twarc stream obama > tweets1.json
# let this run for a few minutes
# create a sorted list of the weet ids
jq .id tweets1.json | sort -n > ids1.json
# hydrate them into a new tweets file
twarc hydrate ids1.json > tweets2.json
Sure enough, only like 4 or 5 of the tweets came back hydrated. I wanted to see the identifiers that were no longer available so I created a new tweet ids based on the hydrated data and diffed them.
jq .id tweets2.json | sort -n > ids2.json
diff ids1.json ids2.json
Another aside: if you have a tweet identifier you can easily resolve
it using a little URL trick for tweets. Just use any twitter username
(real or fake) and construct a URL like
https://twitter.com/myFakeUsername/status/{id}
and it will
redirect to the correct location. For example https://twitter.com/myFakeUsername/status/803901951276707841.
Sure enough I could see that the tweets appeared to have been deleted. At least the few I spot checked in the diff. I was curious to see what they were so I used jq to pull out the id and the text of the tweet and to grep for the deleted tweet ID:
jq '.id + " " + .text' tweets1.json | grep 803902036756533200
This threw a bunch of errors because I was trying to concatenate a
number and a string. So I used the id_str
property that is
available in also available the tweet data:
jq '.id_str + " " + .text' tweets1.json | grep 803902036756533200
This step was crucial because I could have used jq to cast the number
into a string, but I wouldn’t have found my error that way. So when my
grep on the .id_str
failed I couldn’t figure out why my
tweet wasn’t found. I tried to get just the one tweet using grep without
jq and it didn’t work either!
grep 803902036756533200 tweets1.json
At that point, after much squinting and chin scratching I finally
remembered why Twitter introduced
the both id
and id_str
in the first place:
Some programming languages such as Javascript cannot support numbers with 53-bits.
So I pulled out one tweet and compared the .id to .id_str:
head -1 tweets1.json | jq .id
803901951276707800
head -1 tweets1.json | jq .id_str
"803901951276707841"
Notice how the numbers are different? The integer tweet identifier overflowed
and the string didn’t! So the big lesson here is: don’t use
jq .id
to create tweet id datasets, use
jq -r .id_str
instead. The -r
is for raw
output, which will remove those quotation marks.
I don’t know if jq is in GWU’s pipeline but I know it has been in
mine for a few months for creating the tweet id datasets. I had been
using .id_str
until fairly recently when I saw
.id
being used. Luckily I had just a handful of datasets
created this way.