Skip to content

web documents and axioms for linked data

A few months ago I took part in a discussion on the pedantic-web list, which started out as a relatively simple question about FOAF usage, and quickly evolved into a conversation about terms people use when talking about Linked Data, and more generally the Web.

I ended up having a very helpful off-list email exchange with Richard Cyganiak (one of the architects of the Linked Data pattern) about some trouble I’ve had understanding what Information Resources and Documents are in the context of Web Architecture. The trouble I had was in determining whether or not a collection of physical newspaper pages I was helping put on the web were Information Resources or not. I needed to know because I wanted to identify the newspaper pages with URIs, and describe them as Linked Data…and the resolvability of these URIs was largely dependent on how I chose to answer the question.

Richard ended up offering up some advice that I’ve since found very useful, and I thought I would transcribe some of it down here just in case you might find it useful as well. My apologies to you (and Richard) if some of this seems out of context. It may really only be useful for people who are in the digital library domain, but perhaps it’s useful elsewhere.

On the subject of what is a Document Richard offered up this way at looking at what are Web Documents:

The Web is a new, blank information space that is, by definition, disjoint from anything else that exists in the world. By setting up and configuring a web server, you make things pop up in that information space (by creating resolvable URIs). By definition, the things that pop up in the information space are a different beast from anything that existed before. They are web pages. They are *not* the same as things that exist outside of the space, like files on your hard disk, or newspaper articles.

I would avoid the term “document” when talking about representations. Representations are those ephemeral things that go over the wire. A representation is a “byte streams with a media type (and possibly other meta data)”. When I use the term “HTML document”, I mean a resource, identified by a URI, that has (only) HTML representations.

Richard encouraged me to think in terms of Web Documents and not generic Documents. I was getting tripped up by considering Newspaper Pages as Documents…which of course they are in the general sense, but characterized this way it became clear that the Newspaper Pages are not Web Documents. This view on Web Documents is supported in the Cool URIs for the Semantic Web that he co-authored.

Richard also included some axioms that underpin how he thinks about resources in the Linked Data view:

I’m using a few rules that I think should be considered axioms of web architecture:

First, if something exists independently from the Web, then it cannot be a Web Document. (hence two resources, one for the newspaper page and one for the web page)

Second, only Web Documents can have representations (hence the need to describe the newspaper page in a web page, rather than directly providing representations of the newspaper page).

I understand these rules as axioms, that is, they should be followed because they make the system work best, not because they somehow follow from the nature of the world (they don’t).

The pragmatist in me particularly liked how these aren’t supposed to have anything to do with the Real World, but are just ways of thinking about the Web to make it work better. Finally Richard offered some advice on how to reconcile the REST and Linked Data views on identity:

I make sense of the REST worldview like this: In typical REST, all the URIs *always* identify web documents. The REST folks might claim that they identify other things, like users or items for sale or places on the earth, but actually they just identify a document that is *about* that thing. The thing itself doesn’t have an identifier. This is perfectly fine for building certain kinds of systems, so the REST guys actually get away with pretending that the URI identifies the thing. But this doesn’t allow you to do certain things, like using domain-independent vocabularies for metadata and coreference, and you get into deep trouble if you want to use this for describing *web pages* rather than *newspaper pages*.

I hope I haven’t take any liberties quoting my conversation with Richard out of context like this. I mainly wanted to transcribe Richard’s points (which perhaps he has made elsewhere) so that I could revisit them, without having to dig through my email archive … Comments welcome!

4 Comments

  1. I work a lot with TEI documents, though I don’t consider myself an expert or a zealot in that arena. Your mention of the idea of web documents vs. non-web documents crystallizes a lot of tensions that have been floating around as I try to grok working with TEI in an RDF/OWL/LinkedData environment.

    My impression is that discussions about XML docs & related technologies are running along a parallel, but rarely converging, track with discussions about “web” documents. Philosophically, there is a big difference between the purity of the RDF/Linked Data world and the comparatively procedural XML world. But I keep finding that I wish I had a better strategy for reconciling the two.

    So it raises the question: how do “web documents” express, or at least point to, the kind of semantic nuance that we can express in a single non-web document. What sort of mechanism resolves a semantic concept expressed in TEI (for example) to a referenceable resource? I’m not aware of any XML-native technologies (e.g. Xpointer) that are really suited to this. It seems we’re stuck with creating RDF representations of semantic encoding within documents, but that level of abstraction is invariably going to introduce more noise into the already-noisy practice of text-encoding. I’m just curious what ideas are out there for reconciling these technologies – or have all the Linked Data community given up on XML?

    Monday, February 22, 2010 at 8:10 pm | Permalink
  2. Good stuff, Ed. I am struck yet again, though, at the discordance between the Linked Data and REST worldviews. I think a REST-based rejoinder to the last quote would state that building a system that made the distinction too finely would be brittle (cf. http://roy.gbiv.com/untangled/2008/resource-resource-wherefore-art-thou-resource ).

    My own take is that we are talking about two different kinds of architecture/approaches: Linked Data is concerned w/ compile time whereas REST is concerned w/ runtime. Put another way, linked data is strongly and statically typed and REST is weakly and dynamically typed. Not sure how useful that metaphor is, but it seems to me to help distinguish systems that will fit well into the Linked Data approach (e.g. , LCSH) and those that might not.

    I’m also struck that the two worldviews will continue to be at odds (there are essential differences), but that both will likely figure into the future of the web.

    Monday, February 22, 2010 at 10:16 pm | Permalink
  3. This makes a hell of a lot of sense.

    –jrochkind

    Monday, February 22, 2010 at 11:42 pm | Permalink
  4. jakoblog.de/ wrote:

    Thanks for the summary! I read the whole thread and also Tim Berners-Lee’s historical explanation how the term Resource slipped into the specifications. However “Web Document” is not much better than “Information Resource” and I disagree that only any of both can have representations. You can create byte streams to represent anything, a newspaper, a cat, or an HTML document. The question is only whether the representation is appropriate in a given context. But this question cannot be answered by technical architectures or axioms only. It always depends on. Funny how Semantic Web believers seem to think that you only need more standards and levels of abstraction to finally get rid of this fuzzy nasty human common sense that has less problems to handle with uncertain and contradicting information :-)

    Tuesday, February 23, 2010 at 12:13 am | Permalink

Post a Comment

You must be logged in to post a comment.