I’ve always intended to use this blog as more of a place for rough working notes as well as somewhat more fully formed writing. So in that spirit here are some rough notes for some digging into a collection of tweets that used the #unitetheright hashtag. Specifically I’ll describe a way of determining what tweets have been deleted.

Note: be careful with deleted Twitter data. Specifically be careful about how you publish it on the web. Users delete content for lots of reasons. Republishing deleted data on the web could be seen as a form of Doxing. I’m documenting this procedure for identifying deleted tweets because it can provide insight into how particularly toxic information is traveling on the web. Please use discretion in how you put this data to use.

So I started by building a dataset of #unitetheright data using twarc:

twarc search '#unitetheright' > tweets.json

I waited two days and then was able to gather some information about the tweets that were deleted. I was also interested in what content and websites people were linking to in their tweets because of the implications this has for web archives. Here are some basic stats about the dataset:

number of tweets: 200,113

collected at: 2017-08-13 11:46:05 EDT

date range: 2017-08-04 11:44:12 - 2017-08-13 15:45:39 UTC

tweets deleted: 16,492 (8.2%)

Top 10 Domains in Tweeted URls

Domain Count
www.youtube.com 518
www.facebook.com 91
www.pscp.tv 83
jwww.instagram.com 47
www.reddit.com 32
www.dailystormer.com 22
gab.ai 17
restoringthehonor.blogspot.com 16
www.nbc29.com 15
paper.li 15

Top 25 Tweeted URLs (after unshortening)

URL Count
https://medium.com/@RVAwonk/unite-the-right-rally-reflects-a-growing-threat-of-extremism-in-america-e94f57b61980 1460
http://www.dailyprogress.com/gallery/unite-the-right-torch-rally-at-university-of-virginia/collection_5c6e5f36-7f0a-11e7-ab60-ffcbb63bd035.html 929
https://www.pscp.tv/TheMadDimension1/1ZkKzOvVomLGv 613
https://www.pscp.tv/FaithGoldy/1djGXLwPwoPGZ 384
https://www.pscp.tv/FaithGoldy/1BRJjyjBPyNGw 351
https://www.pscp.tv/aletweetsnews/1lPKqwzkaWwJb 338
https://www.pscp.tv/AmRenaissance/1eaKbmynnYexX 244
https://www.youtube.com/watch?v=bfIfywQkxOk 242
https://www.pscp.tv/TheMadDimension1/1yoJMplRNDOGQ 223
https://www.pscp.tv/MemeAlertNews/1mrxmmzRaQwxy 208
https://www.pscp.tv/pnehlen/1DXGyOZjlBkxM 202
https://www.youtube.com/watch?v=098QwsPVHrM 189
http://www.richmond.com/news/virginia/feds-open-civil-rights-investigation-state-police-arrest-three-men/article_a3cf7b72-e437-5472-8025-43147acf3d34.html 187
https://redice.tv/live 184
https://www.pscp.tv/occdissent/1ypJdlnwNlqJW 167
https://www.pscp.tv/forecaster25/1nAKEeqLOnaKL 143
https://www.vdare.com/articles/the-system-revealed-antifa-virginia-politicians-and-police-work-together-to-shut-down-unitetheright 127
http://www.thetrumptimes.com/2017/08/06/will-uniting-right-ever-possible-america/ 123
https://www.pscp.tv/KidRockSenator/1LyGBElqYyOKN 107
https://www.youtube.com/watch?v=lAGsPmDaQPA 100
https://www.pscp.tv/KidRockSenator/1DXxyOZvwndGM 99
http://bigleaguepolitics.com/airbnb-banning-right-wing-users-planning-attend-unite-right-rally/ 90
https://www.youtube.com/watch?v=ttM3_bqjHMo 87
https://www.pscp.tv/RandLawl/1kvKpjZAdlbJE 81
http://www.unicornriot.ninja/?p=18055 80


So how do you get a sense of what has been deleted from your data? While it might make sense to write a program to do this eventually, I find it can be useful to work in a more a more exploratory way on the command line first and then when I’ve got a good workflow I can put that into a program. I guess if I were a real data scientist I would be doing this in R or a Jupyter notebook at least. But I still enjoy working at the command line, so here are the steps I took to identify tweets that had been deleted from the original dataset:

First I extracted and sorted the tweet identifiers into a separate file using jq:

jq -r '.id_str' tweets.json | sort -n > ids.csv

Then I hydrated those ids with twarc. If the tweet has been deleted since it was first collected it cannot be hydrated:

twarc hydrate ids.csv > hydrated.json

I extracted these hydrated ids:

jq -r .id_str hydrated.json | sort -n > ids-hydrated.csv

Then I used diff to compare the pre and post hydration ids, and used a little bit of Perl to strip of the diff formatting, which results in a file of tweet ids that have been deleted.

diff ids.csv ids-hydrated.csv | perl -ne 'if (/< (\d+)/) {print "$1\n";}' > ids-deleted.csv

Since we have the data that was deleted we can now build a file of just deleted tweets. Maybe there’s a fancy way to do this on the command line but I found it easiest to write a little bit of Python to do it:

After you run it you should have a file delete.json. You might want to convert it to CSV with something like twarc’s json2csv.py utility to inspect in a spreadsheet program.

Calling these tweets deleted is a bit of a misnomer. A user could have deleted their tweet, deleted their account, protected their account or Twitter could have decided to suspend the users account. Also, the user could have done none of these things and simply retweeted a user who had done one of these things. Untangling what happened is left for another blog post. To be continued…