web documents and axioms for linked data

A few months ago I took part in a discussion on the pedantic-web list, which started out as a relatively simple question about FOAF usage, and quickly evolved into a conversation about terms people use when talking about Linked Data, and more generally the Web.

I ended up having a very helpful off-list email exchange with Richard Cyganiak (one of the architects of the Linked Data pattern) about some trouble I’ve had understanding what Information Resources and Documents are in the context of Web Architecture. The trouble I had was in determining whether or not a collection of physical newspaper pages I was helping put on the web were Information Resources or not. I needed to know because I wanted to identify the newspaper pages with URIs, and describe them as Linked Data…and the resolvability of these URIs was largely dependent on how I chose to answer the question.

Richard ended up offering up some advice that I’ve since found very useful, and I thought I would transcribe some of it down here just in case you might find it useful as well. My apologies to you (and Richard) if some of this seems out of context. It may really only be useful for people who are in the digital library domain, but perhaps it’s useful elsewhere.

On the subject of what is a Document Richard offered up this way at looking at what are Web Documents:

The Web is a new, blank information space that is, by definition, disjoint from anything else that exists in the world. By setting up and configuring a web server, you make things pop up in that information space (by creating resolvable URIs). By definition, the things that pop up in the information space are a different beast from anything that existed before. They are web pages. They are not the same as things that exist outside of the space, like files on your hard disk, or newspaper articles.

…

I would avoid the term “document” when talking about representations. Representations are those ephemeral things that go over the wire. A representation is a “byte streams with a media type (and possibly other meta data)”. When I use the term “HTML document”, I mean a resource, identified by a URI, that has (only) HTML representations.

Richard encouraged me to think in terms of Web Documents and not generic Documents. I was getting tripped up by considering Newspaper Pages as Documents…which of course they are in the general sense, but characterized this way it became clear that the Newspaper Pages are not Web Documents. This view on Web Documents is supported in the Cool URIs for the Semantic Web that he co-authored.

Richard also included some axioms that underpin how he thinks about resources in the Linked Data view:

I’m using a few rules that I think should be considered axioms of web architecture:

First, if something exists independently from the Web, then it cannot be a Web Document. (hence two resources, one for the newspaper page and one for the web page)

Second, only Web Documents can have representations (hence the need to describe the newspaper page in a web page, rather than directly providing representations of the newspaper page).

I understand these rules as axioms, that is, they should be followed because they make the system work best, not because they somehow follow from the nature of the world (they don’t).

The pragmatist in me particularly liked how these aren’t supposed to have anything to do with the Real World, but are just ways of thinking about the Web to make it work better. Finally Richard offered some advice on how to reconcile the REST and Linked Data views on identity:

I make sense of the REST worldview like this: In typical REST, all the URIs always identify web documents. The REST folks might claim that they identify other things, like users or items for sale or places on the earth, but actually they just identify a document that is about that thing. The thing itself doesn’t have an identifier. This is perfectly fine for building certain kinds of systems, so the REST guys actually get away with pretending that the URI identifies the thing. But this doesn’t allow you to do certain things, like using domain-independent vocabularies for metadata and coreference, and you get into deep trouble if you want to use this for describing web pages rather than newspaper pages.

I hope I haven’t take any liberties quoting my conversation with Richard out of context like this. I mainly wanted to transcribe Richard’s points (which perhaps he has made elsewhere) so that I could revisit them, without having to dig through my email archive … Comments welcome!