The recent public release of the UK Government’s site got picked up by the press last week in articles at The Guardian, Prospect Magazine and elswhere. These have been supplemented by some more technical discussions at ReadWriteWeb, Open Knowledge Foundation, Talis, Jeni Tennison’s blog, and some helpful emails from Leigh Dodds (Talis) and Jonathan Gray (Open Knowledge Foundation) on the w3c egovernment discussion list.

One thing that I haven’t seen mentioned so far in public (which I just discovered today) is that is using RDFa to expose metadata about the datasets in a machine readable way. What this means is that in an HTML page for a dataset like this there are some extra HTML attributes like about, property, rel that have been thoughtfully used to express some structured metadata about the dataset, which can be extracted from the HTML and expressed say as Turtle:

<> dct:coverage "Great Britain (England, Scotland, Wales)"@en ;
     dct:created "2009-12-04"@en ;
     dct:creator "Department for Environment, Food and Rural Affairs"@en ;
     dct:isReferencedBy <> ; 
     dct:license "Crown Copyright"@en ;
     dct:source <>, <> ;
         <> .

In fact since has a nice paging mechanism that lists all the datasets it’s not hard to write a little script that scrapes all the metadata for the datasets (35,478 triples) right out of the web pages.

I also noticed via Stéphane Corlosquet today that is using the Drupal open-source content management system. To what extent Drupal7’s new RDFa features are being used to layer in this RDFa isn’t clear to me. But it is an exciting development. It’s exciting because is a great example of how to bubble up data that’s typically locked away in databases of some kind into the HTML that’s out on the web for people to interact with, and for crawlers to crawl and re-purpose.

For example I can now write a utility to check the status of the external dataset links, to make sure they are they are there (200 OK). The complete results by URL can be summarized by rolling up by status code:

Status Code Number of Datasets
200 2977
404 106
502 23
503 14
[Errno socket error] [Errno -2] Name or service not known 8
500 3
nonnumeric port: ’’ 1
[Errno socket error] [Errno 110] Connection timed out 1
400 1

Or I can generate a list of dataset subjects (eventhough it’s already available I guess). Here’s the top 25:

Subject Number of Datasets
health 645
care 427
child 398
population 341
children 295
school 273
health-and-social-care 271
health-well-being-and-care 205
economy 202
economics-and-finance 189
census 188
education 176
communities 154
benefit 153
road 144
children-education-and-skills 121
people-and-places 111
government-receipts-and-expenditure 110
education-and-skills 110
housing 108
environment 107
tax 107
life-in-the-community 106
employment 105
tax-credit 96

I realize it’s early days but here are a few things it would be fun to see at

  • add some RDFa and SKOS or CommonTag in tag pages like education: this would allow things to be hooked up a bit more explicitly, tags to be given nice labels, and encourage the reuse of the tagging vocabulary within and outside
  • link the dataset descriptions to the dataset resources themselves (the pdfs, excel spreadsheets, etc) that are online using a vocabulary like the Open Archives Reuse and Exchange and/or POWDER. This would allow for the harvesting and aggregation not only of the metadata, but the datasets as well.

I imagine much of this sort of hacking around can be enabled by querying the SPARQL endpoint. But it hasn’t been very clear to me exactly what data is behind there. And there is something comforting about being able to crawl the open web to find the information that’s there in open to view.