and then the web happened

Here is the text of my talk I’m giving at Wikimania today, and the slides.

Let me begin by saying thank you to the conference organizers for accepting my talk proposal. I am excited to be here at my first WikiMania conference, and hope that it will be the first of many. Similar to millions of other people around the world, I use Wikipedia every day at work and at home. In the last three years I’ve transitioned from being a consumer to a producer, by making modest edits to articles about libraries, archives, and occasionally music. I had heard horror stories of people having their work reverted and deleted, so I was careful to cite material in my edits. I was pleasantly surprised when editors swooped in not to delete my work, but to improve it. So, I also want to say thanks to all of you for creating such an improbably open and alive community. I know there is room for improvement, but it’s a pretty amazing thing you all have built.

And really, this is all my talk about Wikistream is about. Wikistream was born out of a desire to share just how amazing the Wikipedia community is, with people who didn’t know it already. I know, I’m preaching to the choir. I also know that I’m speaking in the Technology and Infrastructure track, and I promise to get to some details about how Wikistream works. But really, there’s nothing that radically new in Wikistream–and perhaps what I’m going to say would be more appropriate for the GLAM track, or a performance art track, if there was one. If you are a multi-tasker and want to listen to me with one ear, please pull up http://wikistream.inkdroid.org in your browser, and try to make sense of it as I talk. Lets see what breaks first, the application or the wi-fi connection–hopefully neither.

Wikipedia and the BBC

A couple years ago I was attending the dev8d conference in London and dropped into a 2nd Linked Data Meetup that happened to be going on nearby. Part of the program included presentations from Tom Scott, Silver Oliver and Georgi Kobilarov about some work they did at the BBC. They demo’d two web applications, the BBC Wildlife Finder and BBC Music, that used Wikipedia as a content management platform ^1,
2.

If I’m remembering right it was Tom who demonstrated how an edit to a Wikipedia article resulted in the content being immediately updated at the BBC. It seemed like magic. More than that it struck me as mind-blowingly radical for an organization like the BBC to tap into the Wikipedia platform and community like this.

After a few questions I learned from Georgi that part of the magic of this update mechanism was a bot that the BBC created which sits in the #en.wikipedia IRC chatroom, where edits are announced ⁴. I logged into the chatroom and was astonished by the number of edits flying by:

And remember this was just the English language Wikipedia channel. There are more than 730 other Wikimedia related channels where updates are announced. The BBC’s use of Wikipedia really resonated with me, but to explain why I need to back up a little bit more.

Crowdsourcing in the Library

I work as a software developer at the Library of Congress. In developing web applications there I often end up using data about books, people and topics that have been curated for hundreds of years, and which began to be made available in electronic form in the early 1970s. The library community has had a longstanding obsession with collaboration, or (dare I say) crowdsourcing, to maintain its information about the bibliographic universe. Librarians would most likely call it cooperative cataloging instead of crowdsourcing, but the idea is roughly the same.

As early as 1850, Charles Jewett proposed that the Smithsonian be established as the national library of the United States, which would (among other things) collect the catalogs of libraries all around the country ². The Smithsonian wasn’t as sure as Jewett, so it wasn’t until the 1890s that we saw his ideas take hold when the Library of Congress assumed the role of the national library, and home to the Copyright Office. To this day, copyright registration results in a copy of a registered book being deposited at the Library of Congress. In 1901 the Library of Congress established its printed card service which made its catalog cards available to libraries around the United States and the world.

This meant that a book could be cataloged once by one of the growing army of catalogers at the Library of Congress, instead of the same work being done over and over by all the libraries all over the country. But the real innovation happened in 1971 when Fred Kilgour’s dream of an online shared cataloging database was powered up at OCLC. This allowed a book to be cataloged by any library, and instantly shared with other libraries around the country. It was at this point that the cataloging became truly cooperative, because catalogers could be anywhere, at any member institution, and weren’t required to be in an office at the Library of Congress.

This worked for a bit, but then the Web happened. As the Web began to spread in the mid to late 1990s the library community got it into their head that they would catalog it, with efforts like the Cooperative Online Resource Catalog. But the Web was growing too fast, there just weren’t enough catalogers who cared, and the tools weren’t up to the task, so the project died.

So when I saw Tom, Silver and Georgi present on the use of Wikipedia as a curated content platform at the BBC, and saw how active the community was I had a bit of a light bulb moment. It wasn’t a if-you-can’t-beat-em-join-em moment in which libraries and other cultural heritage organizations (like the BBC) fade into the background and become irrelevant, but one in which Wikipedia helps libraries do their job better…and maybe libraries can help make Wikipedia better. It just so happened that this was right as the Galleries, Libraries, Archives and Museums (GLAM) effort was kicking off at Wikipedia. I really wanted to be able to help show librarians and others not likely to drop into an IRC chat how active the Wikipedia community was, and that’s how Wikistream came to be.

How

So now that you understand the why of Wikistream I’ll tell you briefly about the how. When I released Wikistream I got this really nice email from Ward Cunningham, who is a personal hero of mine, and I imagine a lot of you too:

To: wiki-research-l@lists.wikimedia.org
From: Ward Cunningham <ward@c2.com>
Subject: Re: wikistream: displays wikipedia updates in realtime
Date: Jun 16, 2011 7:43:11 am

I've written this app several times using technology from text-to-speech 
to quartz-composer. I have to tip my hat to Ed for doing a better job 
than I ever did and doing it in a way that he makes look effortless. 
Kudos to Ed for sharing both the page and the software that produces 
it. You made my morning. -- Ward

Sure enough, my idea wasn’t really new at all. But at least I was in good company. I was lucky to stumble across the idea for Wikistream when a Google search for streaming to the browser pulled up SocketIO. If you haven’t seen it before SocketIO is a JavaScript library that allows you to easily stream data to the browser without needing to care about the transport mechanisms that the browser supports: WebSocket, Adobe FlashSocket, AJAX long polling, AJAX multipart-streaming, Forever iframe, JSONP Polling. It autodetects the capabilities of the browser and the server, and gives you a simple callback API for publishing and consuming events. For example here is the code that runs in your browser to connect to the server and start getting updates:

$(document).ready(function() {
  var socket = io.connect();
  socket.on('message', function(msg) {
    addUpdate(msg);
  });
});

There’s a bit more to it, like loading the SocketIO library, and the details of adding the information about the change stored in the msg JavaScript object (more on that below) to the DOM, but SocketIO makes the hard part of streaming data from the server to the client easy.

Of course you need a server to send the updates, and that’s where things get a bit more interesting. SocketIO is designed to run in a NodeJS environment with the Express web framework. Once you have your webapp set up, you can add SocketIO to it:

var express = require("express");
var sio = require("socket.io");

var app = express.createServer();
// set up standard app routes/views
var io = sio.listen(app);

Then the last bit is to do the work of listening to the IRC chatrooms and pushing the updates out to the clients that want to be updated. To make this a bit easier I created a reusable library called wikichanges that abstracts away the business of connecting to the IRC channels and parsing the status updates into a JavaScript object, and lets you pass in a callback function that will be given updates as they occur.

var wikichanges = require('wikichanges');

var w = wikichanges.WikiChanges();
w.listen(function(msg) {
  io.sockets.emit('message', msg);
});

This results in updates being delivered as JSON objects to the client code we started with, where each update looks something like:

{ 
  channel: '#en.wikipedia',
  wikipedia: 'English Wikipedia',
  page: 'Persuasion (novel)',
  pageUrl: 'http://en.wikipedia.org/wiki/Persuasion_(novel)',
  url: 'http://en.wikipedia.org/w/index.php?diff=498770193&oldid=497895763',
  delta: -13,
  comment: '/* Main characters */',
  wikipediaUrl: 'http://en.wikipedia.org',
  user: '108.49.244.224',
  userUrl: 'http://en.wikipedia.org/wiki/User:108.49.244.224',
  unpatrolled: false,
  newPage: false,
  robot: false,
  anonymous: true,
  namespace: 'Article'
  flag: '',
}

As I already mentioned I extracted the interesting bit of connecting to the IRC chatrooms, and parsing the IRC colored text into a JavaScript object as a NodeJS library called wikichanges. Working with the stream of edits is surprisingly addictive, and I found myself wanting to create some other similar applications:

wikipulse which displays the rate of change of wikipedias as a set of accelerator displays
wikitweets: a visualization of how Wikipedia is cited on Twitter
wikibeat: a musical exploration of how Wikipedia is changing created by Dan Chudnov and Chris Burns.

So wikichanges is there to make it easier to bootstrap applications that want to do things with the Wikipedia update stream. Here is a brief demo of getting wikichanges working on a stock Ubuntu ec2 instance:

What’s Next?

So this was a bit of wild ride, I hope you were able to follow along. I could have spent some time explaining why Node was a good fit for wikistream. Perhaps we can talk about that in the Q/A if there is any time for that. Let’s just say I actually reach for Python first when working on a new project, but the particular nature of this application, and tool availability made Node a natural fit. Did we crash it yet?

The combination of the GLAM effort with the WikiData are poised to really transform the way cultural heritage organizations contribute to and use Wikipedia. I hope wikistream might help you make the case for Wikipedia in your organization as you make presentations. If you have ideas on how to use the wikistream library to do something with the update stream I would love to hear about them.

Case Study: Use of Semantic Web Technologies on the BBC Web Sites by Yves Raimond, et al.
The Web as a CMS by Tom Scott.
Catalog It Once And For All: A History of Cooperative Cataloging in the United States Prior to 1967 (Before MARC) by Barbara Tillett, in Cooperative Cataloging: Past, Present, and Future. Psychology Press, 1993, page 5.
After hitting publish on this post I learned that the BBC’s bot was written by Patrick Sinclair