Tag Archives: dbpedia

future archives

It’s hard to read Yves Raimond and Tristan Ferne‘s paper The BBC World Service Archive Prototype and not imagine a possible future for radio archives, archives on the Web, and archival description in general.

Actually, it’s not just the future, it’s also the present, as embodied in the BBC World Service Archive prototype itself, where you can search and listen to 45 years of radio, and pitch in by helping describe it if you want.

As their paper describes, Raimond and Ferne came up with some automated techniques to connect up text about the programs (derived directly from the audio, or indirectly through supplied metadata) to Wikipedia and DBPedia. This resulted in some 20 million RDF assertions, that form the database that the (very polished) web application sits on top of. Registered users can then help augment and correct these assertions. I can only hope that some of these users are actually BBC archivists, who can also help monitor and tune the descriptions provided from the general public.

Their story is full of win, so it’s understandable why the paper won the 2013 Semantic Web Challenge:

  • They used WikipedidMiner to take a first pass at entity extraction of the text they were able to collect for each program. The MapHub project uses WikipediaMiner for the same purpose of adding structure to otherwise unstructured text.
  • They used Amazon Web Services (aka the cloud) to do what would have taken them 4 years in the space of 2 weeks, for a fixed, one time cost.
  • They use ElasticSearch for search, instead of trying to squeeze that functionality and scalability out of a triple store.
  • They wanted to encourage curation of the content, so they put an emphasis on usability and design that is often absent from Linked Data prototypes.
  • They have written in more detail about the algorithms that they used to connect up their text to Wikipedia/DBpedia.
  • Their github account reflects the nuts and bolts of how they did this work. Specifically their rdfsim Python project that vectorizes a SKOS hierarchy, for determining the distance between concepts, seems like a really useful approach to disambiguating terms in text.

But it is the (implied) role of the archivist, as the professional responsible for working with developers to tune these algorithms, evaluating/gauging user contributions, and helping describe the content themselves that excites me the most about this work. It’s also the future role of the archive that is at stake too. In another paper Raimond, Smethurst, McParland and Lowiswhich describe how having this archival data allows them to augment live BBC News subtitles with links to the audio archive, where people can follow their nose (or ears in this case) to explore the context around news stories.

The fact that it’s RDF and Linked Data isn’t terribly important in all this. But the importance of using world curated, openly licensed entities derived from Wikipedia cannot be understated. It’s the conceptual glue that allows connections to be made. As Wikidata grows in importance at Wikipedia it will be interesting to see if it supplants the role that DBpedia has been playing to date.

And of course, it’s exciting because it’s not just anyone doing this, it’s the BBC.

My only nit is that it would be nice to see some of the structured data they’ve collected expressed more in their HTML. For example they have minted a URI for Brian Eno which lists radio programs that are related to him. Why not display his bio, and perhaps a picture? Why not put links to other radio programs for people he is associated with him, like David Byrne or David Bowie, etc. Why not express some of this semantic metadata in microdata or RDFa in the page, to enable search engine optimization and reuse?

Luckily, it sounds like they have invested in the platform and data they would need to add these sorts of features.

PS. Apologies to the Mighty Boosh for the title of this post. “The future’s dead … Everyone’s looking back, not forwards.”

a middle way for linked data at the bbc

I got the chance to attend the 2nd London Linked Data Meetup that was co-located with dev8d last week, which turned out to be a whole lot of fun. I figured if I waited long enough other people would save me from having to write a good summary/discussion of the event…and they have: thanks Pete Johnston, Ben Summers, Sheila Macneill, Martin Belam and Frankie Roberto.

The main thing that I took away is how much good work the BBC is doing in this space. Given the recent news of cuts at the BBC, it seems like a good time to say publicly how important some of the work they are doing is to the web technology sector. As part of the Meetup Tom Scott gave a presentation on how the BBC are using Linked Data to integrate distinct web properties in the BBC enterprise, like their Programmes and the Wildlife Finder web sites.

The basic idea is that they categorize (dare I say catalog?) television and radio content using wikipedia/dbpedia as a controlled vocabulary. Just doing this relatively simple thing means that they can create another site like the Wildlife Finder that provides a topical guide to the natural world (and also happens to use wikipedia/dbpedia as a controlled vocabulary), that then links to their audio and video content. Since the two sites share a common topic vocabulary, they are able to automatically create links from the topic guides to all the radio and television content that are on a particular topic.

For a practical example take a look consider this page for the Great Basin Bristlecone Pine:

If you scroll down on the page you’ll see a link to a video clip from David Attenborough’s documentary Life on the Programmes portion of the website. Now take a step back and consider how these are two separate applications in the BBC enterprise that are able to build a rich network of links between each other. It’s the shared controlled vocabulary (in this case dbpedia derived from wikipedia) which allows them to do this.

If you take a peak in the html you’ll see the resource has an alternate RDF version:

<link rel="alternate" type="application/rdf+xml" href="/nature/species/Pinus_longaeva.rdf" />

The Resource Description Framework (RDF) is really just the best data model we have for describing stuff that’s on the Web, and the type of links between resources that are on (and off) the Web. Personally, I prefer to look at RDF as Turtle which is pretty easily done with Dave Beckett‘s handy rapper utility (aptitude install raptor-utils if you are following from home).

rapper -o turtle http://www.bbc.co.uk/nature/species/Pinus_longaeva

The key bits of the RDF are the description of the Great Basin bristlecone pine:

 
<http://www.bbc.co.uk/nature/species/Pinus_longaeva>
    rdfs:seeAlso <http://www.bbc.co.uk/nature/species> ;
    foaf:primaryTopic <http://www.bbc.co.uk/nature/species/Pinus_longaeva#species> .

<http://www.bbc.co.uk/nature/species/Pinus_longaeva#species>
    dc:description "Great Basin bristlecone pines are restricted to the mountain ranges of California, Nevada and Utah and have a remarkable ability to survive in this extremely harsh and challenging environment. They grow extremely slowly, and are some of the oldest living organisms in the world. With some aged at almost 5,000 years these amazing trees can reveal information about Earth's climate variations. Amazingly, the leaves, or needles, can remain green for over 45 years." ;
    wo:class <http://www.bbc.co.uk/nature/class/Pinopsida#class> ;
    wo:family <http://www.bbc.co.uk/nature/family/Pinaceae#family> ;
    wo:genus <http://www.bbc.co.uk/nature/genus/Pinus#genus> ;
    wo:growsIn <http://www.bbc.co.uk/nature/habitats/Mountain#habitat>, <http://www.bbc.co.uk/nature/habitats/Temperate_coniferous_forest#habitat> ;
    wo:kingdom <http://www.bbc.co.uk/nature/kingdom/Plant#kingdom> ;
    wo:name <http://www.bbc.co.uk/nature/species/Pinus_longaeva#name> ;
    wo:order <http://www.bbc.co.uk/nature/order/Pinales#order> ;
    wo:phylum <http://www.bbc.co.uk/nature/phylum/Pinophyta#phylum> ;
    a wo:Species ;
    rdfs:label "Great Basin bristlecone pine" ;
    owl:sameAs <http://dbpedia.org/resource/Pinus_longaeva> ;
    foaf:depiction <http://open.live.bbc.co.uk/dynamic_images/naturelibrary_640_credits/downloads.bbc.co.uk/earth/naturelibrary/assets/p/pi/pinus_longaeva/pinus_longaeva_1.jpg> .

And then the description of the clip that is related to the topic of Great Basin bristlecone pine:

<http://www.bbc.co.uk/programmes/p005fs5p#programme>
    dc:title "Ancient bristlecones" ;
    po:subject <http://www.bbc.co.uk/nature/species/Pinus_longaeva#species> ;
    a po:Clip .

And we can follow our nose and fetch a description of the Ancient bristelcones clip:

rapper -o turtle http://www.bbc.co.uk/programmes/p005fs5p

Which tells us lots of stuff, like that it’s a documentary part of the science and nature genre, gives us a synopsis, and even links the clip to the episode and series it is a part of:

<http://www.bbc.co.uk/programmes/p005fs5p#programme>
    dc:title "Ancient bristlecones" ;
    po:format <http://www.bbc.co.uk/programmes/formats/documentaries#format> ;
    po:genre <http://www.bbc.co.uk/programmes/genres/factual/scienceandnature#genre>, <http://www.bbc.co.uk/programmes/genres/factual/scienceandnature/natureandenvironment#genre> ;
    po:long_synopsis """Bristlecone pines live at the limit of life, above 3,000m in the mountains of  western America. Almost continuous freezing temperatures and savage winds make life so tough, that these bristlecones only grow for six weeks of the year.

Everything is about conserving energy.They hardly ever shed their needles which can last more than 30 years. After centuries of being blasted by storms a full grown tree still survives with only a strip of bark a few inches wide.

These trees live life at such a slow pace they can reach a great age. Some are over 5,000 years old. It has been said of the bristlecones that to live here is to take a very long time to die.""" ;
    po:medium_synopsis "Living above 3,000 metres, North America's bristlecones cope with freezing temperatures and battering winds by only growing for six weeks of the year. But seeing as they may live for more than 5,000 years, that's still a fair bit of growing in a single lifetime. Slowly but surely does it..." ;
    po:short_synopsis "The world's oldest trees have survived 5,000 years of harsh conditions." ;
    po:version <http://www.bbc.co.uk/programmes/p005fs5r#programme> ;
    a po:Clip .

<http://www.bbc.co.uk/programmes/b00lbpcy#programme>
    po:clip <http://www.bbc.co.uk/programmes/p005fs5p#programme> ;
    a po:Series .

<http://www.bbc.co.uk/programmes/b00p90d6#programme>
    po:clip <http://www.bbc.co.uk/programmes/p005fs5p#programme> ;
    a po:Episode .

Conspicuously missing from this description is something like:

<http://www.bbc.co.uk/programmes/p005fs5p#programme>
    dcterms:subject <http://dbpedia.org/resource/Pinus_longaeva> .

But presumably it’s hiding underneath the covers in the Programmes database, and that’s what lets them link stuff up?

Also very interesting was Georgi Kobilarov‘s description of Uberblic. Since Georgi helped create dbpedia and is now consulting with the BBC, it seems like uberblic is positioning itself to provide a platform for the BBC to have it’s own local cache of the world of Linked Data. Having a local curated view of the world of linked data is something Dan Chudnov identified as a real need at the first Linked Data workshop at code4lib 2009 for caching and proxying linked data…so it is really cool to see solutions starting to appear in this space…and for them to be adopted by institutions like the BBC.

Georgi demo’d how an edit on wikipedia would be immediately reflected in the structured data available from uberblic. It was a real time update, and extremely impressive. It looks like part of the uberblic strategy is to crawl BBC’s web site and other pockets of Linked Data to enable the sort of linking across web properties that Tom described. I’d also surmise given the realtime nature of this that Georgi is bypassing dbpedia dumps and using the Wikipedia changes atom feed in conjunction with extractors that were built as part of the dbpedia project. But I’d love to know more of the mechanics of the update. It also would be interesting to know if uberblic has a notion of versions.

The really powerful message that the BBC is helping promote is this idea that good websites are APIs. Tom mentioned Paul Downey’s notion that Web APIs Are Just Web Sites. It’s a subtle but extremely important point that I learned primarily working closely with Dan Krech for a year or so. It’s an unfortunate side effect of lots market driven talk about web2.0, web3.0 and Linked Data in general that this simple REST message gets lost. We took it seriously in the design of the “API” at the Library of Congress’ Chronicling America. It’s also something I tried to talk about later in the week at dev8d when I had to quickly put a presentation together:

The slides probably won’t make much sense on their own, but the basic message was that we often hear about Linked Data in terms of pushing all your data to some triple store so you can start querying it with SPARQL and doing inferencing, and suddenly you’re going to be sitting pretty, totally jacked up on the Semantic Web.

If you are like me, you’ve already got databases where things are modeled, and you’ve created little web apps that have extracted information from the databases and put them on the web as HTML docs for people around the world to read (queue some mid 1990s grunge music). Expecting people to chuck away the applications and technology stacks they have simply to say they do Linked Data is wishful thinking. What’s missing is a simple migration strategy that would allow web publishers to easily recognize the value in publishing the contents of their database as Linked Data, and how it complements the HTML (and XML, JSON) publishing they are currently doing. My advice to folks at dev8d boiled down to:

  • Keep modelling your stuff how you like
  • Identify your stuff with Cool URIs in your webapps
  • Link your stuff together in HTML
  • Link to machine friendly formats (RSS, Atom, JSON, etc)
  • Use RDF to make your database available on the web using vocabularies other people understand.
  • Start thinking about technologies like SPARQL that will let you query pools and aggregated views of your data.
  • Consider joining the public-lod discussion list and joining the conversation.

I got some nice comments afterwards from Graham Klyne, Hugh Glaser, Adrian Stevenson and Mark Phillips so I felt pretty happy…granted most of the hard line Linked Data folks had already left a couple days earlier.

So some really exciting stuff is going on at the BBC. They are using Linked Data in a practical way that benefits their enterprise in real ways. I’m crossing my fingers and hoping that the value of what is going on here is recognized, and the various cuts that are going on won’t affect any of the fine work they are doing on improving the Web.

For more information check out the Semantic Web Case Study they folks at the BBC wrote summarizing their approach for the W3C.