Languages on Twitter.

There have been some interesting visualizations of languages in use on Twitter, like this one done by Gnip and published in the New York Times. Recently I’ve been involved in some research on particular a topical collection of tweets. One angle that’s been particularly relevant for this dataset is language. When perusing some of the tweet data we retrieved from Twitter’s API we noticed that there were two lang properties in the JSON. One was attached to the embedded user profile stanza, and the other was a top level property of the tweet itself.

We presumed that the user profile language was the language the user (who submitted the tweet) had selected, and that the second language on the tweet was the language of the tweet itself. The first is what Gnip used in its visualization. Interestingly, Twitter’s own documentation for the /get/statuses/:id API call only shows the user profile language.

When you send a tweet you don’t indicate what language it is in. For example you might indicate in your profile that you speak primarily English, but send some tweets in French. I can only imagine that detecting language for each tweet isn’t a cheap operation for the scale that Twitter operates at. Milliseconds count when you are sending 500 million tweets a day, in real time. So at the time I was skeptical that we were right…but I added a mental note to do a little experiment.

This morning I noticed my friend Dan had posted a tweet in Hebrew, and figured now was as a good a time as any.

I downloaded the JSON for the Tweet from the Twitter API and sure enough, the user profile had language en and the tweet itself had language iw which is the deprecated ISO 639-1 code for Hebrew (current is he. Here’s the raw JSON for the tweet, search for lang:

{
  "contributors": null,
  "truncated": false,
  "text": "\u05d0\u05e0\u05d7\u05e0\u05d5 \u05e0\u05ea\u05d2\u05d1\u05e8",
  "in_reply_to_status_id": null,
  "id": 540623422469185537,
  "favorite_count": 2,
  "source": "<a href=\"http://tapbots.com/software/tweetbot/mac\" rel=\"nofollow\">Tweetbot for Mac</a>",
  "retweeted": false,
  "coordinates": null,
  "entities": {
    "symbols": [],
    "user_mentions": [],
    "hashtags": [],
    "urls": []
  },
  "in_reply_to_screen_name": null,
  "id_str": "540623422469185537",
  "retweet_count": 0,
  "in_reply_to_user_id": null,
  "favorited": true,
  "user": {
    "follow_request_sent": false,
    "profile_use_background_image": true,
    "profile_text_color": "333333",
    "default_profile_image": false,
    "id": 17981917,
    "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/3725850/woods.jpg",
    "verified": false,
    "profile_location": null,
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/524709964905218048/-CuYZQQY_normal.jpeg",
    "profile_sidebar_fill_color": "DDFFCC",
    "entities": {
      "description": {
        "urls": []
      }
    },
    "followers_count": 1841,
    "profile_sidebar_border_color": "BDDCAD",
    "id_str": "17981917",
    "profile_background_color": "9AE4E8",
    "listed_count": 179,
    "is_translation_enabled": false,
    "utc_offset": -18000,
    "statuses_count": 14852,
    "description": "",
    "friends_count": 670,
    "location": "Washington DC",
    "profile_link_color": "0084B4",
    "profile_image_url": "http://pbs.twimg.com/profile_images/524709964905218048/-CuYZQQY_normal.jpeg",
    "following": true,
    "geo_enabled": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/17981917/1354047961",
    "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/3725850/woods.jpg",
    "name": "Dan Chudnov",
    "lang": "en",
    "profile_background_tile": true,
    "favourites_count": 1212,
    "screen_name": "dchud",
    "notifications": false,
    "url": null,
    "created_at": "Tue Dec 09 02:56:15 +0000 2008",
    "contributors_enabled": false,
    "time_zone": "Eastern Time (US & Canada)",
    "protected": false,
    "default_profile": false,
    "is_translator": false
  },
  "geo": null,
  "in_reply_to_user_id_str": null,
  "lang": "iw",
  "created_at": "Thu Dec 04 21:47:22 +0000 2014",
  "in_reply_to_status_id_str": null,
  "place": null
}

Although tweets are short they certainly can contain multiple languages. I was curious what would happen if I tweeted two words, one in English and one in French.

When I fetched the JSON data for this tweet the language of the tweet was indicated to be pt or Portuguese! As far as I know neither testing nor essai are Portuguese.

This made me think perhaps the tweet was a bit short so I tried something a bit longer, with the number of words in each language being equal.

This one came across with lang fr. So having the text be a bit longer helped in this case. Admittedly this isn’t a very sound experiment, but it seems interesting and useful to see that Twitter is detecting language in tweets. It isn’t perfect, but that shouldn’t be surprising at all given the nature of human language. It might be useful to try a more exhaustive test using a more complete list of languages to see how it fairs. I’m adding another mental note…

Inter-face


Image from page 315 of “The elements of astronomy; a textbook” (1919)

Every document, every moment in every document, conceals (or reveals) an indeterminate set of interfaces that open into alternate spaces and temporal relations.

Traditional criticism will engage this kind of radiant textuality more as a problem of context than a problem of text, and we have no reason to fault that way of seeing the matter. But as the word itself suggests, “context” is a cognate of text, and not in any abstract Barthesian sense. We construct the poem’s context, for example, by searching out the meanings marked in the physical witnesses that bring the poem to us. We read those witnesses with scrupulous attention, that is to say, we make our detailed way through the looking glass of the book and thence to the endless reaches of the Library of Babel, where every text is catalogued and multiple cross-referenced. In making the journey we are driven far out into the deep space, as we say these days, occupied by our orbiting texts. There objects pivot about many different points and poles, the objects themselves shapeshift continually and the pivots move, drift, shiver, and even dissolve away. Those transformations occur because “the text” is always a negotiated text, half perceived and half created by those who engage with it.

Radiant Textuality by Jerome McGann

Social Machines and the Archive

Yesterday MIT announced that Twitter made a 5 million dollar investment to help them create a Laboratory for Social Machines (LSM) as part of the MIT Media Lab proper:

It seems like an important move for MIT to formally recognize that social media is a new medium that deserves its own research focus, and investment in infrastructure. The language on the homepage gives a nice flavor for the type of work they plan to be doing. I was particularly struck by their frank assessment of how our governance systems are failing us, and social media’s potential role in understanding and helping solve the problems we face:

In a time of growing political polarization and institutional distrust, social networks have the potential to remake the public sphere as a realm where institutions and individuals can come together to understand, debate and act on societal problems. To date, large-scale, decentralized digital networks have been better at disrupting old hierarchies than constructing new, sustainable systems to replace them. Existing tools and practices for understanding and harnessing this emerging media ecosystem are being outstripped by its rapid evolution and complexity.

Their notion of “social machines” as “networked human-machine collaboratives” reminds me a lot of my somewhat stumbling work on @congressedits and archiving Ferguson Twitter data. As Nick Diakopoulos has pointed out we really need a theoretical framework for thinking about what sorts of interactions these automated social media agents can participate in, formulating their objectives, and for measuring their effects. Full disclosure: I work with Nick at the University of Maryland, but he wrote that post mentioning me before we met here, which was kind of awesome to discover after the fact.

Some of the news stories about the Twitter/MIT announcement have included this quote from Deb Roy from MIT who will lead the LSM:

The Laboratory for Social Machines will experiment in areas of public communication and social organization where humans and machines collaborate on problems that can’t be solved manually or through automation alone.

What a lovely encapsulation of the situation we find ourselves in today, where the problems we face are localized and yet global. Where algorithms and automation are indispensable for analysis and data gathering, but people and collaborative processes are all the more important. The ethical dimensions to algorithms and our understanding of them is also of growing importance, as the stories we read are mediated more and more by automated agents. It is super that Twitter has decided to help build this space at MIT where people can answer these questions, and have the infrastructure to support asking them.

When I read the quote I was immediately reminded of the problem that some of us were discussing at the last Society of American Archivists meeting in DC: how do we document the protests going on in Ferguson?

Much of the primary source material was being distributed through Twitter. Internet Archive were looking for nominations of URLs to use in their web crawl. But weren’t all the people tweeting about Ferguson including URLs for stories, audio and video that were of value? If people are talking about something can we infer its value in an archive? Or rather, is it a valuable place to start inferring from?

I ended up archiving 13 million of the tweets that mention “ferguson” for the 2 week period after the killing of Michael Brown. I then went through the URLs in these tweets, and unshortened them and came up with a list of 417,972 unshortened URLs. You can see the top 50 of them here, and the top 50 for August 10th (the day after Michael Brown was killed) here.

I did a lot of this work in prototyping mode, writing quick one off scripts to do this and that. One nice unintended side effect was unshrtn which is a microservice for unshortening URLs, which John Kunze gave me the idea for years ago. It gets a bit harder when you are unshortening millions of URLs.

But what would a tool look like that let us analyze events in social media, and helped us (archivists) collect information that needs to be preserved for future use? These tools are no doubt being created by those in positions of power, but we need them for the archive as well. We also desperately need to explore what it means to explore these archives: how do we provide access to them, and share them? It feels like there could be a project here along the lines of what George Washington University University are doing with their Social Feed Manager. Full disclosure again: I’ve done some contracting work with the fine folks at GW on a new interface to their library catalog.

The 5 million dollars aside, an important contribution that Twitter is making here (that’s probably worth a whole lot more) is firehose access to the Tweets that are happening now, as well as the historic data. I suspect Deb Roy’s role at MIT as a professor and as Chief Media Scientist at Twitter helped make that happen. Since MIT has such strong history of supporting open research, it will be interesting to see how the LSM chooses to share data that supports its research.