future archives

It’s hard to read Yves Raimond and Tristan Ferne’s paper The BBC World Service Archive Prototype and not imagine a possible future for radio archives, archives on the Web, and archival description in general.

Actually, it’s not just the future, it’s also the present, as embodied in the BBC World Service Archive prototype itself, where you can search and listen to 45 years of radio, and pitch in by helping describe it if you want.

As their paper describes, Raimond and Ferne came up with some automated techniques to connect up text about the programs (derived directly from the audio, or indirectly through supplied metadata) to Wikipedia and DBPedia. This resulted in some 20 million RDF assertions, that form the database that the (very polished) web application sits on top of. Registered users can then help augment and correct these assertions. I can only hope that some of these users are actually BBC archivists, who can also help monitor and tune the descriptions provided from the general public.

Their story is full of win, so it’s understandable why the paper won the 2013 Semantic Web Challenge:

They used WikipedidMiner to take a first pass at entity extraction of the text they were able to collect for each program. The MapHub project uses WikipediaMiner for the same purpose of adding structure to otherwise unstructured text.
They used Amazon Web Services (aka the cloud) to do what would have taken them 4 years in the space of 2 weeks, for a fixed, one time cost.
They use ElasticSearch for search, instead of trying to squeeze that functionality and scalability out of a triple store.
They wanted to encourage curation of the content, so they put an emphasis on usability and design that is often absent from Linked Data prototypes.
They have written in more detail about the algorithms that they used to connect up their text to Wikipedia/DBpedia.
Their github account reflects the nuts and bolts of how they did this work. Specifically their rdfsim Python project that vectorizes a SKOS hierarchy, for determining the distance between concepts, seems like a really useful approach to disambiguating terms in text.

But it is the (implied) role of the archivist, as the professional responsible for working with developers to tune these algorithms, evaluating/gauging user contributions, and helping describe the content themselves that excites me the most about this work. It’s also the future role of the archive that is at stake too. In another paper Raimond, Smethurst, McParland and Lowiswhich describe how having this archival data allows them to augment live BBC News subtitles with links to the audio archive, where people can follow their nose (or ears in this case) to explore the context around news stories.

The fact that it’s RDF and Linked Data isn’t terribly important in all this. But the importance of using world curated, openly licensed entities derived from Wikipedia cannot be understated. It’s the conceptual glue that allows connections to be made. As Wikidata grows in importance at Wikipedia it will be interesting to see if it supplants the role that DBpedia has been playing to date.

And of course, it’s exciting because it’s not just anyone doing this, it’s the BBC.

My only nit is that it would be nice to see some of the structured data they’ve collected expressed more in their HTML. For example they have minted a URI for Brian Eno which lists radio programs that are related to him. Why not display his bio, and perhaps a picture? Why not put links to other radio programs for people he is associated with him, like David Byrne or David Bowie, etc. Why not express some of this semantic metadata in microdata or RDFa in the page, to enable search engine optimization and reuse?

Luckily, it sounds like they have invested in the platform and data they would need to add these sorts of features.

PS. Apologies to the Mighty Boosh for the title of this post. “The future’s dead … Everyone’s looking back, not forwards.”