a middle way for linked data at the bbc

I got the chance to attend the 2nd London Linked Data Meetup that was co-located with dev8d last week, which turned out to be a whole lot of fun. I figured if I waited long enough other people would save me from having to write a good summary/discussion of the event…and they have: thanks Pete Johnston, Ben Summers, Sheila Macneill, Martin Belam and Frankie Roberto.

The main thing that I took away is how much good work the BBC is doing in this space. Given the recent news of cuts at the BBC, it seems like a good time to say publicly how important some of the work they are doing is to the web technology sector. As part of the Meetup Tom Scott gave a presentation on how the BBC are using Linked Data to integrate distinct web properties in the BBC enterprise, like their Programmes and the Wildlife Finder web sites.

The basic idea is that they categorize (dare I say catalog?) television and radio content using wikipedia/dbpedia as a controlled vocabulary. Just doing this relatively simple thing means that they can create another site like the Wildlife Finder that provides a topical guide to the natural world (and also happens to use wikipedia/dbpedia as a controlled vocabulary), that then links to their audio and video content. Since the two sites share a common topic vocabulary, they are able to automatically create links from the topic guides to all the radio and television content that are on a particular topic.

For a practical example take a look consider this page for the Great Basin Bristlecone Pine:

If you scroll down on the page you’ll see a link to a video clip from David Attenborough’s documentary Life on the Programmes portion of the website. Now take a step back and consider how these are two separate applications in the BBC enterprise that are able to build a rich network of links between each other. It’s the shared controlled vocabulary (in this case dbpedia derived from wikipedia) which allows them to do this.

If you take a peak in the html you’ll see the resource has an alternate RDF version:

<link rel="alternate" type="application/rdf+xml" href="/nature/species/Pinus_longaeva.rdf" />

The Resource Description Framework (RDF) is really just the best data model we have for describing stuff that’s on the Web, and the type of links between resources that are on (and off) the Web. Personally, I prefer to look at RDF as Turtle which is pretty easily done with Dave Beckett’s handy rapper utility (aptitude install raptor-utils if you are following from home).

rapper -o turtle http://www.bbc.co.uk/nature/species/Pinus_longaeva

The key bits of the RDF are the description of the Great Basin bristlecone pine:

 
<http://www.bbc.co.uk/nature/species/Pinus_longaeva>
    rdfs:seeAlso <http://www.bbc.co.uk/nature/species> ;
    foaf:primaryTopic <http://www.bbc.co.uk/nature/species/Pinus_longaeva#species> .

<http://www.bbc.co.uk/nature/species/Pinus_longaeva#species>
    dc:description "Great Basin bristlecone pines are restricted to the mountain ranges of California, Nevada and Utah and have a remarkable ability to survive in this extremely harsh and challenging environment. They grow extremely slowly, and are some of the oldest living organisms in the world. With some aged at almost 5,000 years these amazing trees can reveal information about Earth's climate variations. Amazingly, the leaves, or needles, can remain green for over 45 years." ;
    wo:class <http://www.bbc.co.uk/nature/class/Pinopsida#class> ;
    wo:family <http://www.bbc.co.uk/nature/family/Pinaceae#family> ;
    wo:genus <http://www.bbc.co.uk/nature/genus/Pinus#genus> ;
    wo:growsIn <http://www.bbc.co.uk/nature/habitats/Mountain#habitat>, <http://www.bbc.co.uk/nature/habitats/Temperate_coniferous_forest#habitat> ;
    wo:kingdom <http://www.bbc.co.uk/nature/kingdom/Plant#kingdom> ;
    wo:name <http://www.bbc.co.uk/nature/species/Pinus_longaeva#name> ;
    wo:order <http://www.bbc.co.uk/nature/order/Pinales#order> ;
    wo:phylum <http://www.bbc.co.uk/nature/phylum/Pinophyta#phylum> ;
    a wo:Species ;
    rdfs:label "Great Basin bristlecone pine" ;
    owl:sameAs <http://dbpedia.org/resource/Pinus_longaeva> ;
    foaf:depiction <http://open.live.bbc.co.uk/dynamic_images/naturelibrary_640_credits/downloads.bbc.co.uk/earth/naturelibrary/assets/p/pi/pinus_longaeva/pinus_longaeva_1.jpg> .

And then the description of the clip that is related to the topic of Great Basin bristlecone pine:

<http://www.bbc.co.uk/programmes/p005fs5p#programme>
    dc:title "Ancient bristlecones" ;
    po:subject <http://www.bbc.co.uk/nature/species/Pinus_longaeva#species> ;
    a po:Clip .

And we can follow our nose and fetch a description of the Ancient bristelcones clip:

rapper -o turtle http://www.bbc.co.uk/programmes/p005fs5p

Which tells us lots of stuff, like that it’s a documentary part of the science and nature genre, gives us a synopsis, and even links the clip to the episode and series it is a part of:

<http://www.bbc.co.uk/programmes/p005fs5p#programme>
dc:title "Ancient bristlecones" ;
po:format <http://www.bbc.co.uk/programmes/formats/documentaries#format> ;
po:genre <http://www.bbc.co.uk/programmes/genres/factual/scienceandnature#genre>, <http://www.bbc.co.uk/programmes/genres/factual/scienceandnature/natureandenvironment#genre> ;
po:long_synopsis """Bristlecone pines live at the limit of life, above 3,000m in the mountains of western America. Almost continuous freezing temperatures and savage winds make life so tough, that these bristlecones only grow for six weeks of the year.

Everything is about conserving energy.They hardly ever shed their needles which can last more than 30 years. After centuries of being blasted by storms a full grown tree still survives with only a strip of bark a few inches wide.

These trees live life at such a slow pace they can reach a great age. Some are over 5,000 years old. It has been said of the bristlecones that to live here is to take a very long time to die.""" ;
po:medium_synopsis "Living above 3,000 metres, North America's bristlecones cope with freezing temperatures and battering winds by only growing for six weeks of the year. But seeing as they may live for more than 5,000 years, that's still a fair bit of growing in a single lifetime. Slowly but surely does it..." ;
po:short_synopsis "The world's oldest trees have survived 5,000 years of harsh conditions." ;
po:version <http://www.bbc.co.uk/programmes/p005fs5r#programme> ;
a po:Clip .

<http://www.bbc.co.uk/programmes/b00lbpcy#programme>
po:clip <http://www.bbc.co.uk/programmes/p005fs5p#programme> ;
a po:Series .

<http://www.bbc.co.uk/programmes/b00p90d6#programme>
po:clip <http://www.bbc.co.uk/programmes/p005fs5p#programme> ;
a po:Episode .

Conspicuously missing from this description is something like:

<http://www.bbc.co.uk/programmes/p005fs5p#programme>
    dcterms:subject <http://dbpedia.org/resource/Pinus_longaeva> .

But presumably it’s hiding underneath the covers in the Programmes database, and that’s what lets them link stuff up?

Also very interesting was Georgi Kobilarov’s description of Uberblic. Since Georgi helped create dbpedia and is now consulting with the BBC, it seems like uberblic is positioning itself to provide a platform for the BBC to have it’s own local cache of the world of Linked Data. Having a local curated view of the world of linked data is something Dan Chudnov identified as a real need at the first Linked Data workshop at code4lib 2009 for caching and proxying linked data…so it is really cool to see solutions starting to appear in this space…and for them to be adopted by institutions like the BBC.

Georgi demo’d how an edit on wikipedia would be immediately reflected in the structured data available from uberblic. It was a real time update, and extremely impressive. It looks like part of the uberblic strategy is to crawl BBC’s web site and other pockets of Linked Data to enable the sort of linking across web properties that Tom described. I’d also surmise given the realtime nature of this that Georgi is bypassing dbpedia dumps and using the Wikipedia changes atom feed in conjunction with extractors that were built as part of the dbpedia project. But I’d love to know more of the mechanics of the update. It also would be interesting to know if uberblic has a notion of versions.

The really powerful message that the BBC is helping promote is this idea that good websites are APIs. Tom mentioned Paul Downey’s notion that Web APIs Are Just Web Sites. It’s a subtle but extremely important point that I learned primarily working closely with Dan Krech for a year or so. It’s an unfortunate side effect of lots market driven talk about web2.0, web3.0 and Linked Data in general that this simple REST message gets lost. We took it seriously in the design of the “API” at the Library of Congress’ Chronicling America. It’s also something I tried to talk about later in the week at dev8d when I had to quickly put a presentation together:

The slides probably won’t make much sense on their own, but the basic message was that we often hear about Linked Data in terms of pushing all your data to some triple store so you can start querying it with SPARQL and doing inferencing, and suddenly you’re going to be sitting pretty, totally jacked up on the Semantic Web.

If you are like me, you’ve already got databases where things are modeled, and you’ve created little web apps that have extracted information from the databases and put them on the web as HTML docs for people around the world to read (queue some mid 1990s grunge music). Expecting people to chuck away the applications and technology stacks they have simply to say they do Linked Data is wishful thinking. What’s missing is a simple migration strategy that would allow web publishers to easily recognize the value in publishing the contents of their database as Linked Data, and how it complements the HTML (and XML, JSON) publishing they are currently doing. My advice to folks at dev8d boiled down to:

Keep modelling your stuff how you like
Identify your stuff with Cool URIs in your webapps
Link your stuff together in HTML
Link to machine friendly formats (RSS, Atom, JSON, etc)
Use RDF to make your database available on the web using vocabularies other people understand.
Start thinking about technologies like SPARQL that will let you query pools and aggregated views of your data.
Consider joining the public-lod discussion list and joining the conversation.

I got some nice comments afterwards from Graham Klyne, Hugh Glaser, Adrian Stevenson and Mark Phillips so I felt pretty happy…granted most of the hard line Linked Data folks had already left a couple days earlier.

So some really exciting stuff is going on at the BBC. They are using Linked Data in a practical way that benefits their enterprise in real ways. I’m crossing my fingers and hoping that the value of what is going on here is recognized, and the various cuts that are going on won’t affect any of the fine work they are doing on improving the Web.

For more information check out the Semantic Web Case Study they folks at the BBC wrote summarizing their approach for the W3C.