archiving wikitweets

Earlier this year I created a little toy webapp called wikitweets that uses the Twitter streaming API to identify tweets that reference Wikipedia, which it then displays realtime in your browser. It was basically a fun experiment to kick the tires on NodeJS and SocketIO using a free, single process Heroku instance.

At the time I announced the app on the wiki-research-l discussion list to see if anyone was interested in it. Out of the responses I received were ones from Emilio Rodríguez-Posada and Taha Yasseri where they asked whether the tweets are archived as they stream by. This struck a chord with me, since I’m a software developer working in the field of “digital preservation”. You know that feeling when you suddenly see one of your huge gaping blindspots? Yeah.

Anyway, some 6 months or so later I finally got around to adding an archive function to wikitweets, and I thought it might be worth writing about very quickly. Wikitweets uses the S3 API at Internet Archive to store every 1000 tweets. So you can visit this page at Internet Archive and download the tweets. Now I don’t know how long Internet Archive is going to be around, but I bet it will be longer than inkdroid.org, so it seemed like a logical (and free) safe harbor for the data.

In addition to being able to share the files Internet Archive also make a BitTorrent seed available, so the data can easily be distributed around the Internet. For example you could open wikitweets_archive.torrent in your BitTorrent client and download a copy of the entire dataset, while providing a redundant copy. I don’t really expect this to happen much with the wikitweets collection, but it seems to be a practical offering in the Lots of Copies Keeps Stuff Safe category.

I tried to coerce several of the seemingly excellent s3 libraries for NodeJS to talk to the Internet Archive, but ended up writing my own very small library that works specifically with Internet Archive. ia.js is bundled as part of wikitweets, but I guess I could put it on npm if anyone is really interested. It gets used by wikitweets like this:

  var c  = ia.createClient({
    accessKey: config.ia_access_key,
    secretKey: config.ia_secret_key,
    bucket: config.ia_bucket
  });

  c.addObject({name: "20120919030946.json", value: tweets}, function() {
    console.log("archived " + name);
  });

The nice thing is that you can use s3 libraries that have support for Internet Archive, like boto to programatically pull down the data. For example, here is a Python program that goes through each file and prints out the Wikipedia article title that is referenced by the tweet:

  import json
  import boto

  ia = boto.connect_ia()
  bucket = ia.get_bucket("wikitweets")

  for keyfile in bucket:
      content = keyfile.get_contents_as_string()
      for tweet in json.loads(content):
          print tweet['article']['title']

The archiving has only been running for last 24 hours or so, so I imagine there will be tweaks that need to be made. I’m considering compression of the tweets as one of them. Also it might be nice to put the files in subdirectories, but it seemed that Internet Archive’s API wanted to URL encode object names that have slashes in them.

If you have any suggestions I’d love to hear them.