dcat:distribution considered helpful

The other day I happened to notice that the folks at data.gov.uk have started using the Data Catalog Vocabulary in the RDFa they have embedded in their dataset webpages. As an example here is the RDF you can pull out of the HTML for the Anonymised MOT tests and results dataset. Of particular interest to me is that the dataset description now includes an explicit link to the actual data being described using the dcat:distribution property.

     <http://data.gov.uk/id/dataset/anonymised_mot_test> dcat:distribution
         <http://www.dft.gov.uk/data/download/10007/DOC>,
         <http://www.dft.gov.uk/data/download/10008/ZIP>,
         <http://www.dft.gov.uk/data/download/10009/GZ>,
         <http://www.dft.gov.uk/data/download/10010/GZ>,
         <http://www.dft.gov.uk/data/download/10011/GZ>,
         <http://www.dft.gov.uk/data/download/10012/GZ>,
         <http://www.dft.gov.uk/data/download/10013/GZ>,
         <http://www.dft.gov.uk/data/download/10014/GZ>,
         <http://www.dft.gov.uk/data/download/10015/GZ>,
         <http://www.dft.gov.uk/data/download/10016/GZ>,
         <http://www.dft.gov.uk/data/download/10017/GZ>,
         <http://www.dft.gov.uk/data/download/10018/GZ>,
         <http://www.dft.gov.uk/data/download/10019/GZ>,
         <http://www.dft.gov.uk/data/download/10020/GZ>,
         <http://www.dft.gov.uk/data/download/10021/GZ>,
         <http://www.dft.gov.uk/data/download/10022/GZ> .

Chris Gutteridge happened to see a Twitter message of mine about this, and asked what consumes this data, and why I thought it was important. So here’s a brief illustration. I reran a little python program I have that crawls all of the data.gov.uk datasets, extracting the RDF using rdflib’s RDFa support (thanks Ivan). Now there are 92,550 triples (up from 35,478 triples almost a year ago).

So what can you do with the this metadata about datasets? I am a software developer working in the area where digital preservation meets the web. So I’m interested in not only getting the metadata for these datasets, but also the datasets themselves. It’s important to enable 3rd party, automated access to datasets for a variety of reasons; but the biggest one for me can be summarized with the common-sensical: Lots of Copies Keep Stuff Safe.

It’s kind of a no-brainer, but copies are important for digital preservation, when the unfortunate happens. The subtlety is being able to know where the copies of a particular dataset are in the enterprise, in a distributed system like the Web, and the mechanics for relating them together. It’s also important for scholarly communication, so that researchers can cite datasets and follow citations of other research to the actual dataset it is based upon. And lastly aggregation services that collect datasets for dissemination on a particular platform, like data.gov.uk, need ways to predictably sweep domains for datasets that needs to be collected.

Consider this practical example: as someone interested in digital preservation I’d like to be able to know what format types are used within the data.gov.uk collection. Since they have used the dcat:distribution property to point at the referenced dataset, I was able to write a small Python program to crawl the datasets and log the media type and HTTP status code along the way, to generate some results like:

media type	datasets
text/html	5898
application/octet-stream	1266
application/vnd.ms-excel	874
text/plain	234
text/csv	220
application/pdf	167
text/xml	81
text/comma-separated-values	51
application/x-zip-compressed	36
application/vnd.ms-powerpoint	33
application/zip	31
application/x-msexcel	28
application/excel	21
application/xml	18
text/x-comma-separated-values	14
application/x-gzip	13
application/x-bittorrent	12
application/octet_stream	12
application/msword	10
application/force-download	10
application/x-vnd.oasis.opendocument.presentation	9
application/x-octet-stream	9
application/vnd.excel	9
application/x-unknown-content-type	6
application/xhtml+xml	6
application/vnd.msexcel	5
application/vnd.google-earth.kml+xml kml	5
application/octetstream	4
application/csv	3
vnd.openxmlformats-officedocument.spreadsheetml.sheet	2
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet	2
application/octet-string	2
image/jpeg	1
image/gif	1
application/x-mspowerpoint	1
application/vnd.google-earth.kml+xml	1
application/powerpoint	1
application/msexcel	1

Granted some of these aren’t too interesting. The predominance of text/html is largely an artifact of using dcat:distribution to link to the splash page for the dataset, not to the dataset itself. This is allowed by the dcat vocabulary … but dcat’s approach kind of assumes that the domain of the assertion is suitably typed as a dcat:Download, dcat:Feed or dcat:WebService. I personally think that dcat has some issues that make it a bit more difficult to use than I’d like. But it’s extremely useful that data.gov.uk are kicking the tires on the vocabulary, so that kinks like this can be worked out.

The application/octet-stream media-type (and its variants) are also kind of useless for these purposes, since it basically says the dataset is made of bits. It would be more helpful if the servers in these cases could send something more specific. But it ought to be possible to use something like JHOVE or DROID to do some post-hoc analysis of the bitstream to figure out just what this data is, if it is valid etc.

The nice thing about using the Web to publish these datasets and their descriptions is that this sort of format analysis application could be decoupled from the data.gov.uk web publishing software itself. data.gov.uk becomes a clearinghouse for information and whereabouts of datasets, but a format verification service can be built as an orthogonal application. I think it basically fits the RESTful style of Curation Microservices being promoted by the California Digital Library:

Micro-services are an approach to digital curation based on devolving curation function into a set of independent, but interoperable, services that embody curation values and strategies. Since each of the services is small and self-contained, they are collectively easier to develop, deploy, maintain, and enhance. Equally as important, they are more easily replaced when they have outlived their usefulness. Although the individual services are narrowly scoped, the complex function needed for effective curation emerges from the strategic combination of individual services.

One last thing before you are returned to your regular scheduled programming. You may have noticed that the URI for the dataset being described in the RDF is different from the URL for the HTML view for the resource. For example:

http://data.gov.uk/id/dataset/anonymised_mot_test

instead of:

http://data.gov.uk/dataset/anonymised_mot_test

This is understandable given some of the dictums about Linked Data and trying to separate the Information Resource from the Non-Information Resource. But it would be nice if the URL resolved via a 303 redirect to the HTML as the Cool URIs for the Semantic Web document prescribes. If this is going to be the identifier for the dataset it’s important that it resolves so that people and automated agents can follow their nose to the dataset. I think this highlights some of the difficulties that people typically face when deploying Linked Data.