UTR

I’ve always intended to use this blog as more of a place for rough working notes as well as somewhat more fully formed writing. So in that spirit here are some rough notes for some digging into a collection of tweets that used the #unitetheright hashtag. Specifically I’ll describe a way of determining what tweets have been deleted.

Note: be careful with deleted Twitter data. Specifically be careful about how you publish it on the web. Users delete content for lots of reasons. Republishing deleted data on the web could be seen as a form of Doxing. I’m documenting this procedure for identifying deleted tweets because it can provide insight into how particularly toxic information is traveling on the web. Please use discretion in how you put this data to use.

So I started by building a dataset of #unitetheright data using twarc:

twarc search '#unitetheright' > tweets.json

I waited two days and then was able to gather some information about the tweets that were deleted. I was also interested in what content and websites people were linking to in their tweets because of the implications this has for web archives. Here are some basic stats about the dataset:

number of tweets: 200,113

collected at: 2017-08-13 11:46:05 EDT

date range: 2017-08-04 11:44:12 - 2017-08-13 15:45:39 UTC

tweets deleted: 16,492 (8.2%)

Top 10 Domains in Tweeted URls

Domain	Count
www.youtube.com	518
www.facebook.com	91
www.pscp.tv	83
jwww.instagram.com	47
www.reddit.com	32
www.dailystormer.com	22
gab.ai	17
restoringthehonor.blogspot.com	16
www.nbc29.com	15
paper.li	15

Top 25 Tweeted URLs (after unshortening)

URL	Count
https://medium.com/@RVAwonk/unite-the-right-rally-reflects-a-growing-threat-of-extremism-in-america-e94f57b61980	1460
http://www.dailyprogress.com/gallery/unite-the-right-torch-rally-at-university-of-virginia/collection_5c6e5f36-7f0a-11e7-ab60-ffcbb63bd035.html	929
https://www.pscp.tv/TheMadDimension1/1ZkKzOvVomLGv	613
https://www.pscp.tv/FaithGoldy/1djGXLwPwoPGZ	384
https://www.pscp.tv/FaithGoldy/1BRJjyjBPyNGw	351
https://www.pscp.tv/aletweetsnews/1lPKqwzkaWwJb	338
https://www.pscp.tv/AmRenaissance/1eaKbmynnYexX	244
https://www.youtube.com/watch?v=bfIfywQkxOk	242
https://www.pscp.tv/TheMadDimension1/1yoJMplRNDOGQ	223
https://www.pscp.tv/MemeAlertNews/1mrxmmzRaQwxy	208
https://www.pscp.tv/pnehlen/1DXGyOZjlBkxM	202
https://www.youtube.com/watch?v=098QwsPVHrM	189
http://www.richmond.com/news/virginia/feds-open-civil-rights-investigation-state-police-arrest-three-men/article_a3cf7b72-e437-5472-8025-43147acf3d34.html	187
https://redice.tv/live	184
https://www.pscp.tv/occdissent/1ypJdlnwNlqJW	167
https://www.pscp.tv/forecaster25/1nAKEeqLOnaKL	143
https://www.vdare.com/articles/the-system-revealed-antifa-virginia-politicians-and-police-work-together-to-shut-down-unitetheright	127
http://www.thetrumptimes.com/2017/08/06/will-uniting-right-ever-possible-america/	123
https://www.pscp.tv/KidRockSenator/1LyGBElqYyOKN	107
https://www.youtube.com/watch?v=lAGsPmDaQPA	100
https://www.pscp.tv/KidRockSenator/1DXxyOZvwndGM	99
http://bigleaguepolitics.com/airbnb-banning-right-wing-users-planning-attend-unite-right-rally/	90
https://www.youtube.com/watch?v=ttM3_bqjHMo	87
https://www.pscp.tv/RandLawl/1kvKpjZAdlbJE	81
http://www.unicornriot.ninja/?p=18055	80

Deletes

So how do you get a sense of what has been deleted from your data? While it might make sense to write a program to do this eventually, I find it can be useful to work in a more a more exploratory way on the command line first and then when I’ve got a good workflow I can put that into a program. I guess if I were a real data scientist I would be doing this in R or a Jupyter notebook at least. But I still enjoy working at the command line, so here are the steps I took to identify tweets that had been deleted from the original dataset:

First I extracted and sorted the tweet identifiers into a separate file using jq:

jq -r '.id_str' tweets.json | sort -n > ids.csv

Then I hydrated those ids with twarc. If the tweet has been deleted since it was first collected it cannot be hydrated:

twarc hydrate ids.csv > hydrated.json

I extracted these hydrated ids:

jq -r .id_str hydrated.json | sort -n > ids-hydrated.csv

Then I used diff to compare the pre and post hydration ids, and used a little bit of Perl to strip of the diff formatting, which results in a file of tweet ids that have been deleted.

diff ids.csv ids-hydrated.csv | perl -ne 'if (/< (\d+)/) {print "$1\n";}' > ids-deleted.csv

Since we have the data that was deleted we can now build a file of just deleted tweets. Maybe there’s a fancy way to do this on the command line but I found it easiest to write a little bit of Python to do it:

After you run it you should have a file delete.json. You might want to convert it to CSV with something like twarc’s json2csv.py utility to inspect in a spreadsheet program.

Calling these tweets deleted is a bit of a misnomer. A user could have deleted their tweet, deleted their account, protected their account or Twitter could have decided to suspend the users account. Also, the user could have done none of these things and simply retweeted a user who had done one of these things. Untangling what happened is left for another blog post. To be continued…