Imagine you were minting close to a million URIs for historic newspaper pages such as:

http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/

for pages like:

The web page allows you to zoom in quite close and see lots of detail in the page:

Now lets say I want to describe this Newspaper Page in RDF. I need to decide what subject URI to hang the description off of. Should I consider this Newspaper Page resource an information resource, or a real world resource? The answer to this question determines whether or not I can hang my description of the page off the above URI, for example:

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> 
  dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .

Or if I need to mint a new URI for the page as a real world thing:

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1#page> 
  dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .

AWWW 1 provides some guidance:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.”

This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.

Can all of the essential characteristics of this newspaper page be sent down the wire as a message to a client? The text of the page is pretty legible after zooming in and you can see pictures, headlines, etc. You can’t feel the texture of the page itself, but you can’t in the microfilm that the page images were generated from. So I’m inclined to say yes.

Cool URIs for the Semantic Web also has some advice:

It is important to understand that using URIs, it is possible to identify both a thing (which may exist outside of the Web) and a Web document describing the thing. For example the person Alice is described on her homepage. Bob may not like the look of the homepage, but fancy the person Alice. So two URIs are needed, one for Alice, one for the homepage or a RDF document describing Alice. The question is where to draw the line between the case where either is possible and the case where only descriptions are available.

According to W3C guidelines ([AWWW], section 2.2.), we have a Web document (there called information resource) if all its essential characteristics can be conveyed in a message. Examples are a Web page, an image or a product catalog.

In HTTP, because a 200 response code should be sent when a Web document has been accessed, but a different setup is needed when publishing URIs that are meant to identify entities which are not Web documents.

This makes me think that I will need distinct identifiers for the abstract notion of the Newspaper Page, and the HTML document itself, if it is important to describe them separately. Say for example if I wanted to say the publisher of the web page was the Library of Congress, but the publisher of the Newspaper Page was Charles M. Shortridge. If I don’t have distinct identifiers I will have to say:

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> 
  dc:publisher <http://loc.gov>, 
  <http://www.joincalifornia.com/candidate/12338> 
  .

Pondering this Information Resource Sniff-Test got me re-reading Xiaoshu Wang’s paper URI Identity and Web Architecture Revisited again. And I’ve come away more convinced that maybe he’s right: that the real issue lies in my vocabulary usage (dc:publisher in this example), and not with whether my URI identifies an Information Resource or not. So maybe new vocabulary is needed in order to describe the representation?

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> 
  web:repPublisher <http://loc.gov> ;
  dcterms:publisher <http://www.joincalifornia.com/candidate/12338> 
  .

But there isn’t a community of practice behind Xiaoshu’s position, at least not one like the Linked Data community. Unless perhaps his position is closer to the REST community which is going strong at the moment, especially in AtomPub circles. Members of the linked-data/semweb community would most likely say that there needs to be either hash or 303’ing URIs for the Newspaper Page, distinct from the URIs for the document describing the Newspaper Page. As a late comer to the httpRange-14 debate I don’t think I ever internalized how REST and the Semantic Web are slightly out of tune w/ each other regarding resources on the web.

So. Should I have two different URIs: one for the real-world Newspaper Page, and one for the HTML document that describes that page? Is the Newspaper Page an Information Resource? Am I muddling up something here? Am I thinking too much? Should I just let sleeping dogs lie? Your opinion, advice, therapy would be greatly appreciated.