dcat:distribution considered helpful
The other day I happened to notice that the folks at
data.gov.uk have started using the
Data Catalog Vocabulary in the
RDFa they have
embedded in their dataset webpages. As an example
here is the RDF you can
pull out of the HTML for the
Anonymised MOT
tests and results dataset. Of particular
interest
to me is that the dataset description now includes an explicit link to
the actual data being described using the
dcat:distribution
property.
<http://data.gov.uk/id/dataset/anonymised_mot_test> dcat:distribution <http://www.dft.gov.uk/data/download/10007/DOC>, <http://www.dft.gov.uk/data/download/10008/ZIP>, <http://www.dft.gov.uk/data/download/10009/GZ>, <http://www.dft.gov.uk/data/download/10010/GZ>, <http://www.dft.gov.uk/data/download/10011/GZ>, <http://www.dft.gov.uk/data/download/10012/GZ>, <http://www.dft.gov.uk/data/download/10013/GZ>, <http://www.dft.gov.uk/data/download/10014/GZ>, <http://www.dft.gov.uk/data/download/10015/GZ>, <http://www.dft.gov.uk/data/download/10016/GZ>, <http://www.dft.gov.uk/data/download/10017/GZ>, <http://www.dft.gov.uk/data/download/10018/GZ>, <http://www.dft.gov.uk/data/download/10019/GZ>, <http://www.dft.gov.uk/data/download/10020/GZ>, <http://www.dft.gov.uk/data/download/10021/GZ>, <http://www.dft.gov.uk/data/download/10022/GZ> .
Chris Gutteridge happened to see a Twitter message of mine about this, and asked what consumes this data, and why I thought it was important. So here’s a brief illustration. I reran a little python program I have that crawls all of the data.gov.uk datasets, extracting the RDF using rdflib’s RDFa support (thanks Ivan). Now there are 92,550 triples (up from 35,478 triples almost a year ago).
So what can you do with the this metadata about datasets? I am a software developer working in the area where digital preservation meets the web. So I’m interested in not only getting the metadata for these datasets, but also the datasets themselves. It’s important to enable 3rd party, automated access to datasets for a variety of reasons; but the biggest one for me can be summarized with the common-sensical: Lots of Copies Keep Stuff Safe.
It’s kind of a no-brainer, but copies are important for digital preservation, when the unfortunate happens. The subtlety is being able to know where the copies of a particular dataset are in the enterprise, in a distributed system like the Web, and the mechanics for relating them together. It’s also important for scholarly communication, so that researchers can cite datasets and follow citations of other research to the actual dataset it is based upon. And lastly aggregation services that collect datasets for dissemination on a particular platform, like data.gov.uk, need ways to predictably sweep domains for datasets that needs to be collected.
Consider this practical example: as someone interested in digital
preservation I’d like to be able to know what format types are used
within the data.gov.uk collection. Since they have used the
dcat:distribution
property to point at the referenced
dataset, I was able to write a small
Python
program to crawl the datasets and log the media type and HTTP status
code along the way, to generate some results like:
media type | datasets |
---|---|
text/html | 5898 |
application/octet-stream | 1266 |
application/vnd.ms-excel | 874 |
text/plain | 234 |
text/csv | 220 |
application/pdf | 167 |
text/xml | 81 |
text/comma-separated-values | 51 |
application/x-zip-compressed | 36 |
application/vnd.ms-powerpoint | 33 |
application/zip | 31 |
application/x-msexcel | 28 |
application/excel | 21 |
application/xml | 18 |
text/x-comma-separated-values | 14 |
application/x-gzip | 13 |
application/x-bittorrent | 12 |
application/octet_stream | 12 |
application/msword | 10 |
application/force-download | 10 |
application/x-vnd.oasis.opendocument.presentation | 9 |
application/x-octet-stream | 9 |
application/vnd.excel | 9 |
application/x-unknown-content-type | 6 |
application/xhtml+xml | 6 |
application/vnd.msexcel | 5 |
application/vnd.google-earth.kml+xml kml | 5 |
application/octetstream | 4 |
application/csv | 3 |
vnd.openxmlformats-officedocument.spreadsheetml.sheet | 2 |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | 2 |
application/octet-string | 2 |
image/jpeg | 1 |
image/gif | 1 |
application/x-mspowerpoint | 1 |
application/vnd.google-earth.kml+xml | 1 |
application/powerpoint | 1 |
application/msexcel | 1 |
Granted some of these aren’t too interesting. The predominance of
text/html
is largely an artifact of using
dcat:distribution
to link to the splash page for the
dataset, not to the dataset itself. This is allowed by the dcat
vocabulary … but dcat’s approach kind of assumes that the domain of the
assertion is suitably typed as a dcat:Download
,
dcat:Feed
or dcat:WebService
. I personally
think that dcat has some
issues
that make it a bit more difficult to use than I’d like. But it’s
extremely useful that data.gov.uk are kicking the tires on the
vocabulary, so that kinks like this can be worked out.
The application/octet-stream
media-type (and its variants)
are also kind of useless for these purposes, since it basically says the
dataset is made of bits. It would be more helpful if the servers in
these cases could send something more specific. But it ought to be
possible to use something like
JHOVE
or
DROID
to do some post-hoc analysis of the bitstream to figure out just what
this data is, if it is valid etc.
The nice thing about using the Web to publish these datasets and their descriptions is that this sort of format analysis application could be decoupled from the data.gov.uk web publishing software itself. data.gov.uk becomes a clearinghouse for information and whereabouts of datasets, but a format verification service can be built as an orthogonal application. I think it basically fits the RESTful style of Curation Microservices being promoted by the California Digital Library:
Micro-services are an approach to digital curation based on devolving curation function into a set of independent, but interoperable, services that embody curation values and strategies. Since each of the services is small and self-contained, they are collectively easier to develop, deploy, maintain, and enhance. Equally as important, they are more easily replaced when they have outlived their usefulness. Although the individual services are narrowly scoped, the complex function needed for effective curation emerges from the strategic combination of individual services.
One last thing before you are returned to your regular scheduled programming. You may have noticed that the URI for the dataset being described in the RDF is different from the URL for the HTML view for the resource. For example:
instead of:
This is understandable given some of the dictums about Linked Data and trying to separate the Information Resource from the Non-Information Resource. But it would be nice if the URL resolved via a 303 redirect to the HTML as the Cool URIs for the Semantic Web document prescribes. If this is going to be the identifier for the dataset it’s important that it resolves so that people and automated agents can follow their nose to the dataset. I think this highlights some of the difficulties that people typically face when deploying Linked Data.