Category Archives: government

dcat:distribution considered helpful

The other day I happened to notice that the folks at data.gov.uk have started using the Data Catalog Vocabulary in the RDFa they have embedded in their dataset webpages. As an example here is the RDF you can pull out of the HTML for the Anonymised MOT tests and results dataset. Of particular interest to me is that the dataset description now includes an explicit link to the actual data being described using the dcat:distribution property.

     <http://data.gov.uk/id/dataset/anonymised_mot_test> dcat:distribution
         <http://www.dft.gov.uk/data/download/10007/DOC>,
         <http://www.dft.gov.uk/data/download/10008/ZIP>,
         <http://www.dft.gov.uk/data/download/10009/GZ>,
         <http://www.dft.gov.uk/data/download/10010/GZ>,
         <http://www.dft.gov.uk/data/download/10011/GZ>,
         <http://www.dft.gov.uk/data/download/10012/GZ>,
         <http://www.dft.gov.uk/data/download/10013/GZ>,
         <http://www.dft.gov.uk/data/download/10014/GZ>,
         <http://www.dft.gov.uk/data/download/10015/GZ>,
         <http://www.dft.gov.uk/data/download/10016/GZ>,
         <http://www.dft.gov.uk/data/download/10017/GZ>,
         <http://www.dft.gov.uk/data/download/10018/GZ>,
         <http://www.dft.gov.uk/data/download/10019/GZ>,
         <http://www.dft.gov.uk/data/download/10020/GZ>,
         <http://www.dft.gov.uk/data/download/10021/GZ>,
         <http://www.dft.gov.uk/data/download/10022/GZ> .

Chris Gutteridge happened to see a Twitter message of mine about this, and asked what consumes this data, and why I thought it was important. So here’s a brief illustration. I reran a little python program I have that crawls all of the data.gov.uk datasets, extracting the RDF using rdflib’s RDFa support (thanks Ivan). Now there are 92,550 triples (up from 35,478 triples almost a year ago).

So what can you do with the this metadata about datasets? I am a software developer working in the area where digital preservation meets the web. So I’m interested in not only getting the metadata for these datasets, but also the datasets themselves. It’s important to enable 3rd party, automated access to datasets for a variety of reasons; but the biggest one for me can be summarized with the common-sensical: Lots of Copies Keep Stuff Safe.

It’s kind of a no-brainer, but copies are important for digital preservation, when the unfortunate happens. The subtlety is being able to know where the copies of a particular dataset are in the enterprise, in a distributed system like the Web, and the mechanics for relating them together. It’s also important for scholarly communication, so that researchers can cite datasets and follow citations of other research to the actual dataset it is based upon. And lastly aggregation services that collect datasets for dissemination on a particular platform, like data.gov.uk, need ways to predictably sweep domains for datasets that needs to be collected.

Consider this practical example: as someone interested in digital preservation I’d like to be able to know what format types are used within the data.gov.uk collection. Since they have used the dcat:distribution property to point at the referenced dataset, I was able to write a small Python program to crawl the datasets and log the media type and HTTP status code along the way, to generate some results like:

media type datasets
text/html 5898
application/octet-stream 1266
application/vnd.ms-excel 874
text/plain 234
text/csv 220
application/pdf 167
text/xml 81
text/comma-separated-values 51
application/x-zip-compressed 36
application/vnd.ms-powerpoint 33
application/zip 31
application/x-msexcel 28
application/excel 21
application/xml 18
text/x-comma-separated-values 14
application/x-gzip 13
application/x-bittorrent 12
application/octet_stream 12
application/msword 10
application/force-download 10
application/x-vnd.oasis.opendocument.presentation 9
application/x-octet-stream 9
application/vnd.excel 9
application/x-unknown-content-type 6
application/xhtml+xml 6
application/vnd.msexcel 5
application/vnd.google-earth.kml+xml kml 5
application/octetstream 4
application/csv 3
vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/octet-string 2
image/jpeg 1
image/gif 1
application/x-mspowerpoint 1
application/vnd.google-earth.kml+xml 1
application/powerpoint 1
application/msexcel 1

Granted some of these aren’t too interesting. The predominance of text/html is largely an artifact of using dcat:distribution to link to the splash page for the dataset, not to the dataset itself. This is allowed by the dcat vocabulary … but dcat’s approach kind of assumes that the domain of the assertion is suitably typed as a dcat:Download, dcat:Feed or dcat:WebService. I personally think that dcat has some issues that make it a bit more difficult to use than I’d like. But it’s extremely useful that data.gov.uk are kicking the tires on the vocabulary, so that kinks like this can be worked out.

The application/octet-stream media-type (and its variants) are also kind of useless for these purposes, since it basically says the dataset is made of bits. It would be more helpful if the servers in these cases could send something more specific. But it ought to be possible to use something like JHOVE or DROID to do some post-hoc analysis of the bitstream to figure out just what this data is, if it is valid etc.

The nice thing about using the Web to publish these datasets and their descriptions is that this sort of format analysis application could be decoupled from the data.gov.uk web publishing software itself. data.gov.uk becomes a clearinghouse for information and whereabouts of datasets, but a format verification service can be built as an orthogonal application. I think it basically fits the RESTful style of Curation Microservices being promoted by the California Digital Library:

Micro-services are an approach to digital curation based on devolving curation function into a set of independent, but interoperable, services that embody curation values and strategies. Since each of the services is small and self-contained, they are collectively easier to develop, deploy, maintain, and enhance. Equally as important, they are more easily replaced when they have outlived their usefulness. Although the individual services are narrowly scoped, the complex function needed for effective curation emerges from the strategic combination of individual services.

One last thing before you are returned to your regular scheduled programming. You may have noticed that the URI for the dataset being described in the RDF is different from the URL for the HTML view for the resource. For example:

http://data.gov.uk/id/dataset/anonymised_mot_test

instead of:

http://data.gov.uk/dataset/anonymised_mot_test

This is understandable given some of the dictums about Linked Data and trying to separate the Information Resource from the Non-Information Resource. But it would be nice if the URL resolved via a 303 redirect to the HTML as the Cool URIs for the Semantic Web document prescribes. If this is going to be the identifier for the dataset it’s important that it resolves so that people and automated agents can follow their nose to the dataset. I think this highlights some of the difficulties that people typically face when deploying Linked Data.

iogdc ramblings

Yesterday I was at the first day of the International Open Government Data Conference in Washington DC. It was an exciting day, with a great deal of enthusiasm being expressed by luminaries like Tim Berners-Lee, Jim Hendler , Beth Noveck, and Vivek Kundra for enabling participatory democracy by opening up access to government data. Efforts like data.gov, data.gov.uk, data.govt.nz, data.australia.gov.au to aggregate egov datasets from their jurisdictions were well represented, although it would’ve been great to hear more from places like Spain, Sweden as well as groups like the Sunlight Foundation and Open Knowledge Foundation … but there are two more days to go. Here are my reflections so far from the first day:

Licensing

New Zealand is embracing the use of Creative Commons licenses to release their datasets onto the web. Their NZGOAL project got cabinet approval for using CC licenses in June of this year. They are now doing outreach within government agencies, and building tools to help data owners put these license into play, so that data can go out on the web. Where I work at the Library of Congress, the general understanding is that our data is public domain (in the US) … except when its not. For example some of the high resolution images in the Prints and Photographs Catalog aren’t available outside the physical buildings of the Library of Congress, due to licensing concerns. So I’m totally envious of New Zealand’s coordinated efforts to iron out these licensing issues.

Centralization/Decentralization

Vivek Kundra and Alan Mallie of the data.gov touted the number of datasets that they are federating access to. But it remains unclear exactly how content is federated, and how datasets flow from agencies into data.gov itself. Perhaps some of these details are included in the v1.0 release of the data.gov Concept of Operations (which Kundra announced). An excellent question posed to Berners-Lee and Kundra concerned what role centralized and distributed approaches play in publishing data. While there is value in one-stop-shopping where you can find data aggregated in one place, Berners-Lee really stressed that the web grew because it was distributed. Aggregated collections of datasets like data.gov need to be able to efficiently pull data from places where it is collected. We need to use the web effectively to enable this.

Legacy Data

There are tons of datasets waiting to be put on the web. Steve Young of the EPA described a few datasets such as the Toxics Release Inventory, which has the goal to:

provide communities with information about toxic chemical releases and waste management activities and to support informed decision making at all levels by industry, government, non-governmental organizations, and the public.

This data has been collected for 22 years after the Emergency Planning and Right to Know Act. Young emphasized how important it is that this data be used in applications, and combined with other datasets. The data is available for download directly from the EPA, and is also available on data.gov. It would’ve been interesting to learn more about the mechanics of how the EPA gets data onto data.gov ; and how updates can flow.

But a really important question came from Young’s colleague at the EPA (sorry I didn’t note her name). She asked about how the data in their relational databases could be made available on the web. Should they simply dump the database? Or is there something else they could do? Young said that it’s early days, but he hoped that Linked Data might have some answers. The issues came up later in the day at the Is the Semantic Web Ready Yet panel. There was a question about how to make Linked Data relevant to folks whose focus is Enterprise data. In my opinion Linked Data advocates over emphasize the importance of using RDF and SPARQL (standards), and converting all the data over without completely understanding how invasive these solutions are. Not enough is done to show enterprise data folks, who typically think in terms of relational databases, what they can do to put their lovingly crafted and hugged data on the web. Consider a primary key in a database: what does it identify, what relations does that thing have with other things? Why not use that key in constructing a URL for that thing, and link things together using the URLs? Then other people could use your URLs as well in their own data. I think the drumbeat to use SPARQL and triple stores often misses explaining this fundamental baby step that data owners could take. As Derek Willis said (on the 2nd day, when I’m writing this), people want to use your data, but not your database…people want to browse your data using their web browser. Assigning URLs to the important stuff in your databases is the first important step to make with Linked Data.

Community

Robert Schaefer of the Applied Physics Lab at Johns Hopkins University pointed out that enabling virtual communities around our data is an essential part of making data available and usable. In my opinion this is the true potential of platform, data aggregator sites like data.gov…they can allow users of government datasets to share what they have done, and learn from each other. Efforts like Civic Commons also promise to be places where this collaboration can take place. The communities may be born inside or outside of government, but they inevitably must include both. The W3C Egov effort might also be a good place to collaborate on standards possibly.

federal register embraces the web and opensource

Tom Lee of the Sunlight Foundation blogged yesterday about the new Federal Register website. The facelift was also announced a few days earlier by the Archivist of the United States, David Ferriero. If you aren’t familiar with it already, the Federal Register is basically the daily newspaper of the United States Federal Government, which details all the rules and regulations of the federal agencies. It is compiled by the Office of the Federal Register located in the National Archives, and printed by the Government Printing Office. As the video describing the new site points out, the Federal Register began publication in 1936 in the depths of the Great Depression as a way to communicate in one place all that the agencies were doing to try to jump start the economy. So it seems like a fitting time to be rethinking the role of the Federal Register.

I’m no usability expert, but just a few minutes browsing the new site and comparing it to the old one make it clear what a leap forward this is. Hopefully the legal status of the new site will be ironed out shortly.

Most of all it’s great to see that the Federal Register is now a single web application. The service it provides to the American public is important enough to deserve its own dedicated web presence. As the developers point out in their video describing the effort, they wanted to make the Federal Register a “first class citizen of the web”…and I think they are certainly helping do that. This might seem obvious, but often there is a temptation to jam publications from the print world (like the Federal Register) into dumbed down monolithic repositories that treat all “objects” the same. Proponents of this approach tend to characterize one off websites like Federal Register 2.0 as “yet another silo”. But I think it’s important to remember that the web was really created to break down the silo walls, and that every well designed web site is actually the antithesis of a silo. In fact, monolithic repository systems that treat all publications as static documents to be uniformly managed are more like silos than these ‘one off’ dedicated web applications.

As a software developer working in the federal government there were a few things about the Federal Register 2.0 that I found really exciting:

  • Fruitful collaboration between federal employees and citizen activist/geeks initiated by a software development contest.
  • Extensive use of opensource technologies like Ruby, Ruby on Rails, MySQL, Sphinx, nginx, Varnish, Passenger, Apache2, Ubuntu Linux, Chef. Opensource technologies encourage collaboration by allowing citizen activists/technologists to participate without having to drop a princely sum.
  • Release of the source code for the website itself, using decentralized revision control (git) so that people can easily contribute changes, and see how the site was put together.
  • Extensive use of syndicated feeds to communicate how how content is being added to the site, ical feeds to keep on top of events going on in your area, and detailed XML for each entry.
  • The robots.txt file for the site makes the content fully crawlable by web indexers, except for search related portions of the website. Excluding dynamic search results is often important for performance reasons, but much of the article content can be discovered via links, see below about permalinks. They also have made a sitemap available for crawlers to efficiently discover URLs for the content.
  • Deployment of the web application to the cloud using Amazon’s EC2 and S3 services. Cloud computing allows computing resources to scale to meet demand. In effect this means that government IT shops don’t have to make big up front investments in infrastructure to make new services available. I guess the jury is still out, but I think this will eventually prove to greatly lower the barrier to innovation in the egov sector. It also lets the more progressive developers in government leap frog ancient technologies and bureaucracies to get things done in a timely manner.
  • And last, but certainly not least … now every entry in the Federal Register has a URL!. Permalinks for the Federal Register are incredibly important for citability reasons. I predict that we’ll quickly see more and more people referencing specific parts of the Federal Register in social media sites like Facebook, Twitter and out on the open web in blogs, and in collaborative applications like Wikipedia.

I would like to see more bulk access to XML data made available, for re-purposing on other websites–although I guess it might be able to walk from the syndicated feeds to the detailed XML. Also, the search functionality is so rich it would be useful to have an OpenSearch description that documents it, and perhaps provides some hooks for getting back JSON and/or XML representations. Perhaps even following the lead of the London Gazette and trying to make some of the structured metadata available in the the HTML using RDFa. It also looks like content is only available for 2008 on, so it might be interesting to see how easy it would be to make more of the historic content available.

But the great thing about what these folks have done is now I can fork the project on github, see how easy it is to add the changes, and let the developers know about my updates to see if they are worth merging back into the production website. This is an incredible leap forward for egov efforts–so hats off to everyone who helped make this happen.