Yesterday I got into conversation with Ben Nimmo and Brian Krebs who were the subjects of an intense botnet attack on Twitter. They were experiencing a large number of followers in a short period of time, and a selection of their tweets were getting artificially boosted by retweets of up to 80,000 times. You can read Brian's detailed writeup here.

At first it seemed completely counter-intuitive to me that someone would direct their botnet (which in all likelihood they are paying for) to boost the followers and messages of people they disagree with. But I wasn't thinking deviously enough. As Brian points out in his post, Twitter appear to have stepped up suspending botnet accounts and the beneficiaries of the botnet traffic. So boosting a user you don't like could get them suspended. In addition, as Ben wrote about a few days ago, it is also an intimidation tactic that disrupts the target's use of Twitter.

Our specific conversation was about how to analyze the retweets since there are tens of thousands and Twitter's statuses/retweets API endpoint is limited to fetching the last 100 retweets. However, it is possible to use the search/tweets endpoint to search for the retweets using the text of the tweet, as long as the retweets have been sent in the last 7 days, which is furthest back Twitter allow you to search for tweets in. So there is a brief window in which you can fetch the retweets.

If you, like Ben and Brian find yourself needing to collect retweets I thought I would document the process a little bit here. The basic approach should work with different Twitter clients if you prefer to work in another language---I used twarc because I'm familiar with it, and it handles rate limiting easily. I also worked from the command line to explain the process at a higher level. You could certainly write a small program to do this.


So, I wanted to get the retweets for a tweet from Brian that generated a great deal of rapid retweet traffic that appeared to him to be bot driven:

First you'll need to install twarc for interacting with the Twitter API from the command line. If you don't have Python yet you'll need to go get that first.

pip install twarc

Now you are ready to tell twarc about your Twitter API keys. Go over to apps.twitter.com, create an app and note the keys down so you can tell twarc about them with:

twarc configure

With twarc and your twitter keys in hand you are ready to collect the tweets using twarc's search command. To run a search you need a query. In this case we're going to use some identifying text from the tweet in question. The results are line-oriented-json, where every line is a complete JSON document for a tweet. The JSON is exactly what is returned from the Twitter API for a tweet.

twarc search 'Bring on the bots and sock puppet accounts Amazing how a tweet about Putin' > briankrebs.jsonl

This command could run for a while depending on how many retweets you can get. Twitter only allow you to get 17,000 ever 15 minutes. twarc will handle waiting until it can go get more. You can see a file twarc.log which contains information about what it is doing.

Once that finishes you probably want to be absolutely sure the file only includes retweets of that specific tweet. It's possible that your search generated some false positives if the words happened to be used in tweets that were not retweets of your subject tweet. One handy way of doing this is to use jq to filter them using the tweet id of the original tweet:

jq -cr 'select(.retweeted_status.id_str = "902545914304319491")' briankrebs.jsonl > briankrebs-filtered.jsonl

Now that you have the JSON for the retweets you can do analysis of the users by creating a CSV file of information about them. For example I was interested in looking at the followers, friends and tweets counts, as well as when the account was created and the user's preferred language. Yes, this user profile information can be found in the information you get for each tweet, or in this case, retweet. For the full details checkout the Tweet Field Guide from Twitter. jq is also pretty good at extracting bits of the json and writing it as CSV:

jq -r '[.user.screen_name, .user.followers_count, .user.friends_count, .user.status_count, .user.created_at, .user.lang, .user.status_count] | @csv' briankrebs-filtered.jsonl > briankrebs.csv

Here is the file I generated. You should be able to open that in your spreadsheet software of choice and look for patterns.


One other thing I was interested in doing was seeing what connections there might be between the retweeters of Brian and the retweeters of this tweet by Ben, which he thought got artificially boosted as well:

So I went through the exact same process to generate a file of user information for the retweeters of that tweet. With that file in hand I just needed to see what users were present in both. One nice little trick for doing the join is to use csvkit's csvjoin.

csvsql --query "SELECT briankrebs.* FROM briankrebs, benimmo WHERE briankrebs.screen_name = benimmo.screen_name" briankrebs.csv benimmo.csv > briankrebs_benimmo.csv

You can see the resulting file here.


I realize this post was a bit of an esoteric post, but I wanted to write up the process in case you find yourself wanting to analyze retweets. One thing I don't know the answer to is why the number of retweets returned isn't exactly the same as the number of retweets displayed in the tweet. One explanation for this is that the search index is imperfect, and there is some hidden limitation apart from the ~7 day window that it will return results in. Another more likely explanation is that some of the retweets were from accounts that have been suspended or deleted, but the retweet count has not been adjusted to account for that. I guess only Twitter know the answer to that one.