linking things and common sense

Tom Scott’s recent Linking Things post got me jotting down what I’ve been thinking lately about URIs, Linked Data and the Web. First go read Tom’s post if you haven’t already. He does a really nice job of setting the stage for why people care about using distinct URIs (web identifiers) for identifying web documents (aka information resources) and real world things (aka non-information resources). Tom’s opinions are grounded in the experience of really putting these ideas into practice at the BBC. His key point, which he attributes to Michael Smethurst, is that:

Some people will tell you that the whole non-information resource thing isn’t necessary – we have a web of documents and we just don’t need to worry about URIs for non-information resources; others will claim that everything is a thing and so every URL is, in effect, a non-information resource.

Michael, however, recently made a very good point (as usual): all the interesting assertions are about real world things not documents. The only metadata, the only assertions people talk about when it comes to documents are relatively boring: author, publication date, copyright details etc.

If this is the case then perhaps we should focus on using RDF to describe real world things, and not the documents about those things.

I think this is an important observation, but I don’t really agree with the conclusion. I would conclude instead that the distinction between real world and document URIs is a non-issue. We should be able to tell if the thing being described is a document or a real world thing based on the vocabulary terms that are being used.

For example, if I assert:

<http://en.wikipedia.org/wiki/William_Shakespeare> a foaf:Person ; foaf:name "William Shakespeare" .

Isn’t it reasonable to assume http://en.wikipedia.org/wiki/William_Shakespeare identifies a person whose name is William Shakespeare? I don’t have to try to resolve the URL and see if I get a 303 or 200 response code do I?

And if I also assert,

<http://en.wikipedia.org/wiki/William_Shakespeares> dcterms:modified "2010-06-28T17:02:41-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>

can’t I can assume that http://en.wikipedia.org/wiki/William_Shakespeare identifies a document that was modified on 2010-06-28T17:02:41? Does it really make sense to think that the person William Shakespeare was modified then? Not really…

Similarly if I said,

<http://en.wikipedia.org/wiki/William_Shakespeare> cc:license <http://creativecommons.org/licenses/by-sa/3.0/> .

Isn’t it reasonable to assume that http://en.wikipedia.org/wiki/William_Shakespeare identifies a document that is licensed with the Attribution-ShareAlike 3.0 Unported license? It doesn’t really make sense to say that the person William Shakespeare is licensed with Attribution-ShareAlike 3.0 Unported does it? Not really…

Why does the Linked Data community lean on using identifiers to do this common sense work? Well, largely because people argued about it for three years and this is the resolution the W3C came to. In general I like the REST approach of saying a URL identifies a Resource, and that when you resolve one you get back a Representation (a document of some kind, html, rdf/xml, whatever). Why does it have to be more complicated than that?

If it’s not clear if an assertion is about a document or a thing, why isn’t that a problem with the vocabulary in use being underspecified and vague? I believe this is essentially the point that Xiaoshu Wang made three years ago in his paper URI Identity and Web Architecture Revisited.

To get back to Tom’s point, I agree that the really interesting assertions in Linked Data are about things, and their relations, or as Richard Rorty said a bit more expansively:

There is nothing to be known about anything except an initially large, and forever expandable, web of relations to other things. Everything that can serve as a term of relation can be dissolved into another set of relations, and so on for ever. There are, so to speak, relations all the way down, all the way up, and all the way out in every direction: you never reach something which is not just one more nexus of relations.

Philosophy and Social Hope, pp 53-54.

But assertions about a document, albeit being a bit more on the dry side, are also useful and important, such as: who created the web document, when they created it, a license associated with the document, its relation to previous versions, etc. As a software developer working in a library I’m actually really interested in this sort of administrivia. In fact the Open Archives Initiative Object Reuse and Exchange vocabulary, and the Memento efforts are largely about relating web documents together in meaningful and useful ways: to be able to harvest compound objects out of the web, and to navigate between versions of web documents. Heck, the Dublin Core vocabulary started out as an effort to describe networked resources (essentially documents), and the gist of the Dublin Core Metadata Terms retain much of this flavor. So I think RDF is also important for describing documents on the web, or (more accurately) representations.

So, in short:

URLs identify resources.
A resource can be anything.
When you resolve a URL you get a representation of that resource.
If a representation is some sort of flavor of RDF, the semantics of an RDF vocabulary should make it clear what is being described.
If it’s not clear, maybe the vocabulary sucks.

I think this is basically the point that Harry Halpin and Pat Hayes were making in their paper In Defense of Ambiguity. A URL has a dual role: it identifies resources, and it allows us to access representations of resources. This ambiguity is the source of its great utility, expressiveness and power. It’s why we see URLs on the sides of buses and buildings. It’s why a QR Code slapped on some real world thing has a URL embedded in it.

In an ideal world (where people agreed with Xiaoshu, Harry and Pat) I don’t think this would mean we would have to redo all the Linked Data that we have already. I think it just means that publishers who want the granularity of distinguishing between real world things and documents at the identifier level can have it. It would also mean that the Linked Data space can accommodate RESTafarians, and other mere mortals who don’t want to ponder whether their resources are information resources or not. And, of course, it would mean we could use a URL like http://en.wikipedia.org/wiki/William_Shakespeare to identify William Shakespeare in our RDF data …

Wouldn’t that be nice?