linking things and common sense

Tom Scott’s recent Linking Things post got me jotting down what I’ve been thinking lately about URIs, Linked Data and the Web. First go read Tom’s post if you haven’t already. He does a really nice job of setting the stage for why people care about using distinct URIs (web identifiers) for identifying web documents (aka information resources) and real world things (aka non-information resources). Tom’s opinions are grounded in the experience of really putting these ideas into practice at the BBC. His key point, which he attributes to Michael Smethurst, is that:

Some people will tell you that the whole non-information resource thing isn’t necessary – we have a web of documents and we just don’t need to worry about URIs for non-information resources; others will claim that everything is a thing and so every URL is, in effect, a non-information resource.

Michael, however, recently made a very good point (as usual): all the interesting assertions are about real world things not documents. The only metadata, the only assertions people talk about when it comes to documents are relatively boring: author, publication date, copyright details etc.

If this is the case then perhaps we should focus on using RDF to describe real world things, and not the documents about those things.

I think this is an important observation, but I don’t really agree with the conclusion. I would conclude instead that the distinction between real world and document URIs is a non-issue. We should be able to tell if the thing being described is a document or a real world thing based on the vocabulary terms that are being used.

For example, if I assert:

<http://en.wikipedia.org/wiki/William_Shakespeare> a foaf:Person ; foaf:name "William Shakespeare" .

Isn’t it reasonable to assume http://en.wikipedia.org/wiki/William_Shakespeare identifies a person whose name is William Shakespeare? I don’t have to try to resolve the URL and see if I get a 303 or 200 response code do I?

And if I also assert,

<http://en.wikipedia.org/wiki/William_Shakespeares> dcterms:modified "2010-06-28T17:02:41-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>

can’t I can assume that http://en.wikipedia.org/wiki/William_Shakespeare identifies a document that was modified on 2010-06-28T17:02:41? Does it really make sense to think that the person William Shakespeare was modified then? Not really…

Similarly if I said,

<http://en.wikipedia.org/wiki/William_Shakespeare> cc:license <http://creativecommons.org/licenses/by-sa/3.0/> .

Isn’t it reasonable to assume that http://en.wikipedia.org/wiki/William_Shakespeare identifies a document that is licensed with the Attribution-ShareAlike 3.0 Unported license? It doesn’t really make sense to say that the person William Shakespeare is licensed with Attribution-ShareAlike 3.0 Unported does it? Not really…

Why does the Linked Data community lean on using identifiers to do this common sense work? Well, largely because people argued about it for three years and this is the resolution the W3C came to. In general I like the REST approach of saying a URL identifies a Resource, and that when you resolve one you get back a Representation (a document of some kind, html, rdf/xml, whatever). Why does it have to be more complicated than that?

If it’s not clear if an assertion is about a document or a thing, why isn’t that a problem with the vocabulary in use being underspecified and vague? I believe this is essentially the point that Xiaoshu Wang made three years ago in his paper URI Identity and Web Architecture Revisited.

To get back to Tom’s point, I agree that the really interesting assertions in Linked Data are about things, and their relations, or as Richard Rorty said a bit more expansively:

There is nothing to be known about anything except an initially large, and forever expandable, web of relations to other things. Everything that can serve as a term of relation can be dissolved into another set of relations, and so on for ever. There are, so to speak, relations all the way down, all the way up, and all the way out in every direction: you never reach something which is not just one more nexus of relations.

Philosophy and Social Hope, pp 53-54.

But assertions about a document, albeit being a bit more on the dry side, are also useful and important, such as: who created the web document, when they created it, a license associated with the document, its relation to previous versions, etc. As a software developer working in a library I’m actually really interested in this sort of administrivia. In fact the Open Archives Initiative Object Reuse and Exchange vocabulary, and the Memento efforts are largely about relating web documents together in meaningful and useful ways: to be able to harvest compound objects out of the web, and to navigate between versions of web documents. Heck, the Dublin Core vocabulary started out as an effort to describe networked resources (essentially documents), and the gist of the Dublin Core Metadata Terms retain much of this flavor. So I think RDF is also important for describing documents on the web, or (more accurately) representations.

So, in short:

  1. URLs identify resources.
  2. A resource can be anything.
  3. When you resolve a URL you get a representation of that resource.
  4. If a representation is some sort of flavor of RDF, the semantics of an RDF vocabulary should make it clear what is being described.
  5. If it’s not clear, maybe the vocabulary sucks.

I think this is basically the point that Harry Halpin and Pat Hayes were making in their paper In Defense of Ambiguity. A URL has a dual role: it identifies resources, and it allows us to access representations of resources. This ambiguity is the source of its great utility, expressiveness and power. It’s why we see URLs on the sides of buses and buildings. It’s why a QR Code slapped on some real world thing has a URL embedded in it.

In an ideal world (where people agreed with Xiaoshu, Harry and Pat) I don’t think this would mean we would have to redo all the Linked Data that we have already. I think it just means that publishers who want the granularity of distinguishing between real world things and documents at the identifier level can have it. It would also mean that the Linked Data space can accommodate RESTafarians, and other mere mortals who don’t want to ponder whether their resources are information resources or not. And, of course, it would mean we could use a URL like http://en.wikipedia.org/wiki/William_Shakespeare to identify William Shakespeare in our RDF data …

Wouldn’t that be nice?

Creative Commons License
linking things and common sense by Ed Summers, unless otherwise expressly stated, is licensed under a Creative Commons Attribution 4.0 International License.

13 thoughts on “linking things and common sense

  1. Ed, I am in complete agreement with you, but I will play devil’s advocate. I think you have chosen to compare two things that are quite distinct: web pages and people. Clearly you can distinguish the context of statements made for those types of things because there is not much of an overlap. However, to choose something close to your heart, can I freely use this URI http://chroniclingamerica.loc.gov/lccn/sn99021999/1899-10-22/ed-1/seq-25/ to refer to the first page of issue 19 of the Omaha Illustrated Bee? Is the publisher of the resource identified by that URI the LoC or Edward Rosewater?

  2. Great post, Ed. Not only would it be nice, as you say, I think it is the key to making Linked Data work. As you say: “This ambiguity is the source of its great utility, expressiveness and power.” The Web itself has that sort of ambiguity built in, and it’s what makes it possible to bend w/o breaking. Good vocabularies are key, as you say, but every vocabulary has a context — some vocabularies will be clearer w/in their intended domain/context. Langauge & meaning simply cannot be forced into universal unambiguity. Being unambiguous w/in a particular domain or context *is* a worthy goal. I’d contend that Ian’s example can be answered by either “you better have a good vocabulary to make that distinction clear” OR “depends on the context…”

  3. Peter, I guess one vocabulary oriented solution to my problem is to have a documentPublisher (or representationPublisher) property distinct from the usual publisher property.

    Following on from Michael Smethurst’s comment that there are only a limited number of interesting things to say about web documents, perhaps it’s possible to create a good enough vocabulary to cover them. i.e. documentLicense, documentPublishDate etc.

  4. This argument reminds me of a conversation I had when I was in China, when I asked a (Chinese) friend how they could distinguish the verb “shuo” which means “to say” in terms of past, present, and future. There is no verb conjugation (at least not like we have in English) in Chinese.

    The person, Dajie, looked at me like I was stupid, and told me that when someone said “Kong zi shuo” (loosely, “Confucius say”), that they meant it in the past, because Confucius is dead, and it had to be in the past.

    The idea that there is ambiguity when using a URI is kind of similar. If I’m talking about a person when I say http://davidbrunton.com, I probably mean David Brunton. If I’m talking about an article that succinctly explains the whole universe in a word, it’s probably that first post.

    Anyway, nice post, Ed. Love the Rorty quote, too :)

  5. @iand trust you to ask the hard questions :-) I should’ve have blamed^w credited you for putting the idea in my head with your various posts over the years, and your recent “what would it break” twitter convo with cygri.

    I think you nailed the answer in your later comment: there are probably a handful of things that we’d want to be able to assert about the representation, and if something like dcterms:publisher was too ambiguous we’d probably want to define a new term that was by definition about the representation.

    As you know the URI http://chroniclingamerica.loc.gov/lccn/sn99021999/1899-10-22/ed-1/seq-25/ is served up by the host chroniclingamerica.loc.gov ; which DNS can tell us is owned by the Library of Congress. So I think it’s safe to say that representations being generated by that hostname are effectively published by the Library of Congress. So I’d expect dc:publisher to be used to say something unique about the resource, in this case that a particular page of a newspaper was published by Edward Rosewater. As @pkeane pointed out, the context (named graph, surface?) of related assertions, such as one that said:

    <http://chroniclingamerica.loc.gov/lccn/sn99021999/1899-10-22/ed-1/seq-25/> a <http://purl.org/ontology/bibo/Newspaper> .
    

    could provide some guidance about what dc:publisher is meant to refer to.

  6. While I think it is an elegant solution to shift the discrimination of information and non-information resources to vocabulary level by specifying the predicates, I see a problem arising with this approach which is similar to the existing problem:

    Basically, I’m afraid people, instead of asking “Why do I need an extra URI for the information ressource?”, will just ask “Why do I have to use a different predicate for the real newspaper than for the same scanned newspaper when those predicates actually mean the same?” In effect, the actual use might get inconsistent and the data’s value diminishes.

    So, I believe this approach wouldn’t solve the practical problems Tom writes about because it as well requires people to distinguish information and non-information resources just as the minting of different URIs for information and non-information ressources does.

    I think the underlying problem is: People don’t seeing any sense in distinguishing retrievable and non-retrievable ressources or even not being able to do so. So rather this is the task: Making clear why this distinction is useful and desirable.

    Adrian

  7. As well as documents that refer to documents such as the site @iand points to or URIBurner, in principle one might want to talk about the document meta-information itself – maybe we want to identify who added the meta-information such as dc:creator. The reflective stack is unbounded!

    In ordinary language we cope with this sort of thing in the way that Ed suggests for URIs: I can write, “Alan is writing this comment” or “Alan starts with the letter A” without even bothering to put quotes around ‘Alan’. Of course this can cause confusion or be used in verbal puns, so is not unambiguous, but works most of the time.

    In some cases, like computing or philosophy these things start to become more complex and ONLY then do we resort to more precise languages “The first ASCII code in the representation of ‘Alan’ is 65″ or “The use of ‘Alan’ as an example name and example string in the preceding paragraph.”

    Similar precise language is certainly needed for web resources to be able to distinguish the multiple levels, but maybe the solution would be to adopt Ed’s more relaxed style with a set of well known disambiguation rules, and then only be more precise when the disambiguation would fail.

    maybe in such cases:

    mymeta:document _:theDoc.
    _:theDoc dc:creator “Library of Congress”.
    _:theDoc dc:title “Omaha daily bee. (Omaha [Neb.]) 187?-1922, October 22, 1899, Image 25 – Chronicling America – The Library of Congress:.

    mymeta:refers_to _:thePaper.
    _:thePaper dc:title “Omaha daily bee”.

    … and then of course:

    mymeta:refers_to
    .

    just for fun I looked at:

    http://linkeddata.uriburner.com/about/html/http://linkeddata.uriburner.com/about/html/http://ontologi.es/rail/void.xhtml

    ;-)

  8. you said: “In general I think the REST approach of saying a URL identifies a Resource, and that when you resolve one you get back a Representation (a document of some kind, html, rdf/xml, whatever).”

    The sentence is not very clear. Or I have a bit of difficulty to parse. Did you mean?

    “In general, I agree with the REST approach i.e. A URL identifies a Resource and when you resolve the URL, you get back a Representation (a document of some kind, html, rdf/xml, whatever).”

Leave a Reply