twarc-hashtags

One of the nice things that we did with twarc2 is to design it so you can add plugins relatively easily. These plugins extend twarc’s basic functionality to do different things with collected Twitter data. This is just a quick post about a new plugin twarc-hashtags.

twarc-hashtags was born of necessity. As I mentioned in the last post I’ve been doing some work with Alejandra Josiowicz to examine tweets about Brazilian activist Marielle Franco. Thanks to the Academic Research product track we were able to collect all the tweets matching the phrase Marielle Franco.

twarc2 search --archive '"marielle franco"' tweets.jsonl

But we wanted to discover what hashtags were used in this initial dataset in order to broaden the search using relevant hashtags and then run it again.

Once you pip install twarc-hashtags you get a new command hashtags which you can use to generate a CSV dataset that represents the hashtags present in your data:

twarc2 hashtags tweets.jsonl hashtags.csv

The generated CSV is pretty simple. It has two columns: hashtag and tweets. While your data is being read a little SQLite database is populated which has three columns: tweet id, created, and hashtag. This allows for easy counting (using SQL) but also for a bit more manipulation.

For example, if you would like to see the hashtags grouped by month you can:

twarc2 hashtags tweets.jsonl hashtags.csv --group month

The generated dataset will have an additional column for time containing a year-month value, e.g. 2020-03. You can do the same for day, week, and year if you want to group differently.

In addition you can limit the number of hashtags to display. So if you just wanted to see the top 20 hashtags per month you could:

twarc2 hashtags tweets.jsonl hashtags.csv --group month --limit 20

And finally, since loading the SQLite database can take some time (for example if you are looking through 3.8 million tweets like I was) you can load it the first time and then use --no-import afterwards to skip the import step and use the existing SQLite database. This will allow you to try grouping by something different, or using a new limit, without needing to parse all that data again.

twarc2 hashtags tweets.jsonl hashtags.csv --group week --limit 10 --no-import

Maybe some fancier output than CSV will get added over time (ideas are welcome). Having the output in CSV means you can pretty easily drop it into another tool, like D3, GoogleSheets, Tableau, DataWrapper, etc.

I’ve been meaning to try RawGraphs after seeing it come up in a thread that Anne Helmond kicked off about some of her data visualization work some time ago. Here’s what the top 25 hashtags look like over time as a BumpGraph. If you want you can use the rawgraphs file (it’s JSON) to upload it yourself and tweak it.

If you click on the image it might be a bit easier to read (you can hover on the streams to see what they are)–but clearly it needs a bit of work still. One thing it shows pretty clearly though is the emergence of #QuemMandouMatarMarielle (Who Ordered Marielle to be Killed), the yellow band, that started the year after her murder, and has continued since.

If you get a chance to try the twarc-hashtags plugin please let me know!