data.gov.uk and rdfa

The recent public release of the UK Government’s data.gov.uk site got picked up by the press last week in articles at The Guardian, Prospect Magazine and elswhere. These have been supplemented by some more technical discussions at ReadWriteWeb, Open Knowledge Foundation, Talis, Jeni Tennison’s blog, and some helpful emails from Leigh Dodds (Talis) and Jonathan Gray (Open Knowledge Foundation) on the w3c egovernment discussion list.

One thing that I haven’t seen mentioned so far in public (which I just discovered today) is that data.gov.uk is using RDFa to expose metadata about the datasets in a machine readable way. What this means is that in an HTML page for a dataset like this there are some extra HTML attributes like about, property, rel that have been thoughtfully used to express some structured metadata about the dataset, which can be extracted from the HTML and expressed say as Turtle:

<http://data.gov.uk/id/dataset/agricultural_market_reports> dct:coverage "Great Britain (England, Scotland, Wales)"@en ;
     dct:created "2009-12-04"@en ;
     dct:creator "Department for Environment, Food and Rural Affairs"@en ;
     dct:isReferencedBy <http://data.gov.uk/wiki/index.php/Package:agricultural_market_reports> ; 
     dct:license "Crown Copyright"@en ;
     dct:source <http://statistics.defra.gov.uk/esg/publications/amr/default.asp>, <https://statistics.defra.gov.uk/esg/publications/amr/default.asp> ;
     dct:subject
         <http://data.gov.uk/data/tag/agriculture>,
         <http://data.gov.uk/data/tag/agriculture-and-environment>,
         <http://data.gov.uk/data/tag/environment>,
         <http://data.gov.uk/data/tag/farm-business>,
         <http://data.gov.uk/data/tag/farm-businesses>,
         <http://data.gov.uk/data/tag/farming> .

In fact since data.gov.uk has a nice paging mechanism that lists all the datasets it’s not hard to write a little script that scrapes all the metadata for the datasets (35,478 triples) right out of the web pages.

I also noticed via Stéphane Corlosquet today that data.gov.uk is using the Drupal open-source content management system. To what extent Drupal7’s new RDFa features are being used to layer in this RDFa isn’t clear to me. But it is an exciting development. It’s exciting because data.gov.uk is a great example of how to bubble up data that’s typically locked away in databases of some kind into the HTML that’s out on the web for people to interact with, and for crawlers to crawl and re-purpose.

For example I can now write a utility to check the status of the external dataset links, to make sure they are they are there (200 OK). The complete results by URL can be summarized by rolling up by status code:

Status Code	Number of Datasets
200	2977
404	106
502	23
503	14
[Errno socket error] [Errno -2] Name or service not known	8
500	3
nonnumeric port: ’’	1
[Errno socket error] [Errno 110] Connection timed out	1
400	1

Or I can generate a list of dataset subjects (eventhough it’s already available I guess). Here’s the top 25:

Subject	Number of Datasets
health	645
care	427
child	398
population	341
children	295
school	273
health-and-social-care	271
health-well-being-and-care	205
economy	202
economics-and-finance	189
census	188
education	176
communities	154
benefit	153
road	144
children-education-and-skills	121
people-and-places	111
government-receipts-and-expenditure	110
education-and-skills	110
housing	108
environment	107
tax	107
life-in-the-community	106
employment	105
tax-credit	96

I realize it’s early days but here are a few things it would be fun to see at data.gov.uk:

add some RDFa and SKOS or CommonTag in tag pages like education: this would allow things to be hooked up a bit more explicitly, tags to be given nice labels, and encourage the reuse of the tagging vocabulary within and outside data.gov.uk
link the dataset descriptions to the dataset resources themselves (the pdfs, excel spreadsheets, etc) that are online using a vocabulary like the Open Archives Reuse and Exchange and/or POWDER. This would allow for the harvesting and aggregation not only of the metadata, but the datasets as well.

I imagine much of this sort of hacking around can be enabled by querying the data.gov.uk SPARQL endpoint. But it hasn’t been very clear to me exactly what data is behind there. And there is something comforting about being able to crawl the open web to find the information that’s there in open to view.