alien vs predator: www-style

I finally got around to reading Web Services for Recovery.gov by Erik Wilde, Eric Kansa and Raymond Yee. The authors wrote the report with funding from the Sunlight Foundation, who are deeply engaged in improving the way the US Federal Government provides transparent access to its data assets.

I highly recommend giving it a read if you are interested in web services, REST, Linked Data, and simple things you can do to open up access to data. The practicality of the advice is clearly gleaned from the experience of an actual implementation over at recovery.berkeley.edu where they kick the tires on their ideas.

Erik’s blog has a succinct summary of the paper’s findings, which for me boils down to:

any data source that is frequently updated must have simple ways for synchronizing data

Web syndication is a widely deployed mechanism for presenting a list of updated web resources. The authors make a pretty strong case for Atom because of its pervasive use of identifiers for content, extensibility, rich linking semantics, paging, the potential for write-enabled services, install base, and generally just good old Resource Oriented Architecture a.k.a. REST.

Because of my interest in Linked Data the paragraph that discusses why RDF/XML wasn’t chosen as a data format is particularly interesting:

The approach described in this report, driven by a desire for openness and accessibility, uses the most widely established technologies and data formats to ensure that access to reporting data is as easy as possible. Recently, the idea of openly accessible data has been promoted under the term of “linked data”, with recent recommendations being centered around a very specific choice of technologies and data models (all centered around Semantic Web approaches focusing on RDF for data representation and centralized data storage). While it is possible to use these approaches for building Web applications, our recommendation is to use better established and more widely supported technologies, thereby lowering the barrier-to-entry and choosing a simpler toolset for achieving the same goals as with the more sophisticated technologies envisioned for the Semantic Web.

It could be argued that the growing amount of RDF/XML in the Linked Data web make it a contender for Atom’s install base–especially when you consider RSS1.0. However I think the main point the authors are making is that the tools for working with XML documents far outnumber the tools that are available for processing RDF/XML graphs. Furthermore, most programmers I know are more familiar with the processing model and standards associated with XML documents (DOM, XSLT, XPath, XQuery) compared with RDF graphs (Triples, Directed Graph, GRDDL, SPARQL). Maybe this says more about the people I know … and if I were to jump into the biomedical field I’d feel different. But perhaps the most subtle point is that whether or not developers know it, Atom expresses a Graph model just like RDF/XML … but it does it in a much more straightforward, familiar document-centric way.

Of course the debate of whether RDF needed to be a part of Linked Data or not rippled through the semantic web community a few months ago–and there’s little chance of resolving any of those issues here. In the pure-land of RDF model theory the choice between Atom and RDF/XML is a bit of a false dilemma since RDF/XML is minimally processable with, well, XML tools … and idioms like GRDDL allow Atom to be explicitly represented as an RDF Graph. And in fact, REST and Content Negotiation would allow both serializations to co-exist nicely in the context of a single web application. However, I’d argue that this point isn’t a particularly easy thing to explain, and it certainly isn’t terrain that you would want to navigate in documentation on the recovery.gov website. The choice of whether RDF belongs in Linked Data or not has technical and business considerations – but I’m increasingly seeing it as a cultural issue,that perhaps doesn’t really even need resolving.

Even Tim Berners-Lee recognizes that there are quite large hurdles to modeling all government data on the Linked Data web in RDF and querying it with SPARQL. It’s a bit unrealistic to expect the Federal Government to start modeling and storing their enterprise data in a fundamentally new and somewhat experimental way in order to support what amounts to arbitrary database queries from anyone on the web. If that’s what the Linked Data brand is I’m not buying it. That being said, I see a great deal of value in the RDF data model (the giant global graph), especially as a tool for seeing how your data fits the contours of the web.

The important message that Erik, Eric and Raymond’s paper communicates is that the Federal Government should be focused on putting data out on the web in familiar ways, using sound web architecture practices that have allowed the web to grow and evolve into the wonderful environment it is today. Atom is a flexible, simple, commonly supported, well understood XML format for letting downstream applications know about newly published web resources. If the Federal Government is serious about the long term sustainability of efforts like recovery.gov and data.gov they should focus on enabling an ecosystem of visualization applications created by third parties, rather than trying to produce those applications themselves. I hope the data.gov folks also run across this important work. Thanks to Sunlight Foundation for funding the folks at Berkeley.