Languages on Twitter.
There have been some interesting visualizations of languages in use on
Twitter, like
this
one done by Gnip and published in the New York Times. Recently I’ve
been involved in some research on particular a topical collection of
tweets. One angle that’s been particularly relevant for this dataset is
language. When perusing some of the tweet data we retrieved from
Twitter’s API we noticed that there were two lang
properties in the JSON. One was attached to the embedded user profile
stanza, and the other was a top level property of the tweet itself.
We presumed that the user profile language was the language the user
(who submitted the tweet) had selected, and that the second language on
the tweet was the language of the tweet itself. The first is what Gnip
used in its visualization. Interestingly, Twitter’s own
documentation
for the /get/statuses/:id
API call only shows the user
profile language.
When you send a tweet you don’t indicate what language it is in. For example you might indicate in your profile that you speak primarily English, but send some tweets in French. I can only imagine that detecting language for each tweet isn’t a cheap operation for the scale that Twitter operates at. Milliseconds count when you are sending 500 million tweets a day, in real time. So at the time I was skeptical that we were right…but I added a mental note to do a little experiment.
This morning I noticed my friend Dan had posted a tweet in Hebrew, and figured now was as a good a time as any.
????? ?????
— Dan Chudnov ((dchud?))
December
4, 2014
I downloaded the JSON for the Tweet from the Twitter API and sure
enough, the user profile had language en
and the tweet
itself had language iw
which is the deprecated ISO 639-1
code for Hebrew (current is he
. Here’s the raw JSON for the
tweet, search for lang
:
{
"contributors": null,
"truncated": false,
"text": "\u05d0\u05e0\u05d7\u05e0\u05d5 \u05e0\u05ea\u05d2\u05d1\u05e8",
"in_reply_to_status_id": null,
"id": 540623422469185537,
"favorite_count": 2,
"source": "Tweetbot for Mac",
"retweeted": false,
"coordinates": null,
"entities": {
"symbols": [],
"user_mentions": [],
"hashtags": [],
"urls": []
},
"in_reply_to_screen_name": null,
"id_str": "540623422469185537",
"retweet_count": 0,
"in_reply_to_user_id": null,
"favorited": true,
"user": {
"follow_request_sent": false,
"profile_use_background_image": true,
"profile_text_color": "333333",
"default_profile_image": false,
"id": 17981917,
"profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/3725850/woods.jpg",
"verified": false,
"profile_location": null,
"profile_image_url_https": "https://pbs.twimg.com/profile_images/524709964905218048/-CuYZQQY_normal.jpeg",
"profile_sidebar_fill_color": "DDFFCC",
"entities": {
"description": {
"urls": []
}
},
"followers_count": 1841,
"profile_sidebar_border_color": "BDDCAD",
"id_str": "17981917",
"profile_background_color": "9AE4E8",
"listed_count": 179,
"is_translation_enabled": false,
"utc_offset": -18000,
"statuses_count": 14852,
"description": "",
"friends_count": 670,
"location": "Washington DC",
"profile_link_color": "0084B4",
"profile_image_url": "http://pbs.twimg.com/profile_images/524709964905218048/-CuYZQQY_normal.jpeg",
"following": true,
"geo_enabled": false,
"profile_banner_url": "https://pbs.twimg.com/profile_banners/17981917/1354047961",
"profile_background_image_url": "http://pbs.twimg.com/profile_background_images/3725850/woods.jpg",
"name": "Dan Chudnov",
"lang": "en",
"profile_background_tile": true,
"favourites_count": 1212,
"screen_name": "dchud",
"notifications": false,
"url": null,
"created_at": "Tue Dec 09 02:56:15 +0000 2008",
"contributors_enabled": false,
"time_zone": "Eastern Time (US & Canada)",
"protected": false,
"default_profile": false,
"is_translator": false
},
"geo": null,
"in_reply_to_user_id_str": null,
"lang": "iw",
"created_at": "Thu Dec 04 21:47:22 +0000 2014",
"in_reply_to_status_id_str": null,
"place": null
}
Although tweets are short they certainly can contain multiple languages. I was curious what would happen if I tweeted two words, one in English and one in French.
testing, essai
— Ed Summers ((edsu?))
December
15, 2014
When I fetched the JSON data for this tweet the language of the tweet
was indicated to be pt
or Portuguese! As far as I know
neither testing nor essai are Portuguese.
This made me think perhaps the tweet was a bit short so I tried something a bit longer, with the number of words in each language being equal.
Désolé for le noise, je suis just seeing how détection de la language
works.
— Ed Summers ((edsu?))
December
15, 2014
This one came across with lang fr
. So having the text be a
bit longer helped in this case. Admittedly this isn’t a very sound
experiment, but it seems interesting and useful to see that Twitter is
detecting language in tweets. It isn’t perfect, but that shouldn’t be
surprising at all given the nature of human language. It might be useful
to try a more exhaustive test using a more complete list of languages to
see how it fairs. I’m adding another mental note…