After seeing Gina's tweet, I was curious to see if there was any difference by gender in the tweets directed at @adriarichards over the recent controversy at PyCon. I wasn't confident I would find anything. It was more a feeble attempt to try to make Python make sense of something senseless that happened at PyCon; or to paraphrase Physician, heal thyself...for Python to heal itself.
I used twarc to collect 13,472 tweets that mentioned @adriarichards from the search API. I then added a utility filter that uses genderator to filter the line oriented JSON based on a guess at the gender (Twitter doesn't track it). genderator identified 2,433 (18%) tweets from women, 5,268 (39%) from men, and 5,771 (42%) that were of unknown gender. I then added another utility that reads a stream of Tweets and generates a tag cloud as a standalone HTML file using d3-cloud.
I put them all together on the command line like this:
% twarc.py @adriarichards % cat @adriarichards-20130321200320.json | utils/gender.py --gender male | utils/wordcloud.py > male.html % cat @adriarichards-20130321200320.json | utils/gender.py --gender female | utils/wordcloud.py > female.html
I realize word clouds aren't probably the greatest way to visualize the differences in these messages. If you have better ideas let me know. I made the tweet JSON available if you want to try your own visualization.
Looking at these didn't yield much insight. So instead of visualizing all the words that each gender used, I wondered what the clouds would look like if I limited them to words that were uniquely spoken by each gender. In other words, what words did males use in their tweets which were not used by females, and vice-versa. There were 1,333 (11%) uniquely female words, and 4,767 (39%) uniquely male words, with a shared vocabulary of 5,988 (50%) words.
I'm not sure there is much more insight here either. I guess there is some weak comfort in the knowledge that 1/2 of the words used in these tweets were shared by both sexes.