Languages on Twitter.

There have been some interesting visualizations of languages in use on Twitter, like this one done by Gnip and published in the New York Times. Recently I’ve been involved in some research on particular a topical collection of tweets. One angle that’s been particularly relevant for this dataset is language. When perusing some of the tweet data we retrieved from Twitter’s API we noticed that there were two lang properties in the JSON. One was attached to the embedded user profile stanza, and the other was a top level property of the tweet itself.

We presumed that the user profile language was the language the user (who submitted the tweet) had selected, and that the second language on the tweet was the language of the tweet itself. The first is what Gnip used in its visualization. Interestingly, Twitter’s own documentation for the /get/statuses/:id API call only shows the user profile language.

When you send a tweet you don’t indicate what language it is in. For example you might indicate in your profile that you speak primarily English, but send some tweets in French. I can only imagine that detecting language for each tweet isn’t a cheap operation for the scale that Twitter operates at. Milliseconds count when you are sending 500 million tweets a day, in real time. So at the time I was skeptical that we were right…but I added a mental note to do a little experiment.

This morning I noticed my friend Dan had posted a tweet in Hebrew, and figured now was as a good a time as any.

I downloaded the JSON for the Tweet from the Twitter API and sure enough, the user profile had language en and the tweet itself had language iw which is the deprecated ISO 639-1 code for Hebrew (current is he. Here’s the raw JSON for the tweet, search for lang:

  "contributors": null,
  "truncated": false,
  "text": "\u05d0\u05e0\u05d7\u05e0\u05d5 \u05e0\u05ea\u05d2\u05d1\u05e8",
  "in_reply_to_status_id": null,
  "id": 540623422469185537,
  "favorite_count": 2,
  "source": "<a href=\"\" rel=\"nofollow\">Tweetbot for Mac</a>",
  "retweeted": false,
  "coordinates": null,
  "entities": {
    "symbols": [],
    "user_mentions": [],
    "hashtags": [],
    "urls": []
  "in_reply_to_screen_name": null,
  "id_str": "540623422469185537",
  "retweet_count": 0,
  "in_reply_to_user_id": null,
  "favorited": true,
  "user": {
    "follow_request_sent": false,
    "profile_use_background_image": true,
    "profile_text_color": "333333",
    "default_profile_image": false,
    "id": 17981917,
    "profile_background_image_url_https": "",
    "verified": false,
    "profile_location": null,
    "profile_image_url_https": "",
    "profile_sidebar_fill_color": "DDFFCC",
    "entities": {
      "description": {
        "urls": []
    "followers_count": 1841,
    "profile_sidebar_border_color": "BDDCAD",
    "id_str": "17981917",
    "profile_background_color": "9AE4E8",
    "listed_count": 179,
    "is_translation_enabled": false,
    "utc_offset": -18000,
    "statuses_count": 14852,
    "description": "",
    "friends_count": 670,
    "location": "Washington DC",
    "profile_link_color": "0084B4",
    "profile_image_url": "",
    "following": true,
    "geo_enabled": false,
    "profile_banner_url": "",
    "profile_background_image_url": "",
    "name": "Dan Chudnov",
    "lang": "en",
    "profile_background_tile": true,
    "favourites_count": 1212,
    "screen_name": "dchud",
    "notifications": false,
    "url": null,
    "created_at": "Tue Dec 09 02:56:15 +0000 2008",
    "contributors_enabled": false,
    "time_zone": "Eastern Time (US & Canada)",
    "protected": false,
    "default_profile": false,
    "is_translator": false
  "geo": null,
  "in_reply_to_user_id_str": null,
  "lang": "iw",
  "created_at": "Thu Dec 04 21:47:22 +0000 2014",
  "in_reply_to_status_id_str": null,
  "place": null

Although tweets are short they certainly can contain multiple languages. I was curious what would happen if I tweeted two words, one in English and one in French.

When I fetched the JSON data for this tweet the language of the tweet was indicated to be pt or Portuguese! As far as I know neither testing nor essai are Portuguese.

This made me think perhaps the tweet was a bit short so I tried something a bit longer, with the number of words in each language being equal.

This one came across with lang fr. So having the text be a bit longer helped in this case. Admittedly this isn’t a very sound experiment, but it seems interesting and useful to see that Twitter is detecting language in tweets. It isn’t perfect, but that shouldn’t be surprising at all given the nature of human language. It might be useful to try a more exhaustive test using a more complete list of languages to see how it fairs. I’m adding another mental note…

Removing Bias

An Invitation to Study

These are some brief remarks I prepared for a 5 minute lightning talk at the Ferguson Town Hall meeting at UMD on December 3, 2014.

Thank you for the opportunity to speak here this evening. It is a real privilege. I’d like to tell you a little bit about an archive of 13 million Ferguson related tweets we’ve assembled here at the University of Maryland. You can see a random sampling of some of them up on the screen here. The 140 characters of text in a tweet only makes up about 2% of the data for each tweet. The other 98% includes metadata such as who sent it, their profile, how many followers they have, what tweet they are replying to, who they are retweeting, when the tweet was sent, (sometimes) where the tweet was sent from, embedded images and video. I’m hoping that I can interest some of you in studying this data.

I intentionally used the word privilege in my opening sentence to recognize that my ethnicity and my gender enabled me to be here speaking to you this evening. I’d actually like to talk very quickly about about a different set of privileges I have: those of my profession, and as a member of the UMD academic community that we are all a part of:

In a Democracy Now interview back in August, hip-hop artist and activist Talib Kweli characterized the Ferguson related activity on Twitter in a deeply insightful way:

I remember a world before the Internet. And I remember what it really takes to have movement on the ground. Someone tweeted me back and said, “Well, you know, back in the day they didn’t have Twitter, but they had letters, and they wrote letters to each other, so…” I said, “Yeah, but ain’t nobody saying that the letters started the revolution.”

When I look at the Green Revolution, when I look what happened to Egypt, when I look at what happened to Occupy Wall Street, yeah, the tweets helped—they helped a lot—but without those bodies in the street, without the people actually being there, ain’t nothing to tweet about. If Twitter worked like that, Joseph Kony would be locked down in a jail right now.

Of course, Talib is right. It’s why we are all here this evening. In some sense it doesn’t matter what happens on Twitter. What matters is what happened in Ferguson, what is happening in Ferguson, and in meetings and demonstrations like this one all around the country.

Talib’s comparison of a tweet to a letter struck me as particularly insightful. I work as an archivist and software developer in the Maryland Institute for Technology in the Humanities here at UMD. Humanities scholars have traditionally studied a particular set of historical materials, of which letters are one. These materials form the heart of what we call the archive. What gets collected in archives and studied is inevitably what forms our cultural canon. It is a site of controversy, for as George Orwell wrote in 1984:

Who controls the past controls the future. Who controls the present controls the past.

Would we be here tonight if it wasn’t for Twitter? Would the President be talking about Ferguson if it wasn’t for the groundswell of activity on Twitter? Without Twitter what would the main stream media have reported about Mike Brown, John Crawford, Eric Garner, Renisha McBride, Trayvon Martin, and Tamir Rice? As you know this list of injustices is long…it is vast and overwhelming. It extends back to the beginnings of this state and this country. But what trace do we have of these injustices and these struggles in our archives?

The famed historian and social activist Howard Zinn said this when addressing a group of archivists in 1970:

the existence, preservation and availability of archives, documents, records in our society are very much determined by the distribution of wealth and power. That is, the most powerful, the richest elements in society have the greatest capacity to find documents, preserve them, and decide what is or is not available to the public. This means government, business and the military are dominant.

This is where social media and the Web present such a profoundly new opportunity for us, as we struggle to understand what happened in Ferguson…as we struggle to understand how best to act in the present. We need to work to make sure the voices of Ferguson are available for study–and not just in the future, but study now. Let’s put our privilege as members of this academic community to work. Is there something to learn in these 13 million tweets, these letters from ordinary people. The thousands of videos and photographs, and links to stories? I think there is. I’m hopeful that these digital traces provide us with a new insight into an old problem…insights that can guide our actions here in the present.

If you have questions you’d like to ask of the data please get in touch with either me or Neil Fraistat (Director of MITH) here tonight, or via email or Twitter.