Imagine you were minting close to a million URIs for historic newspaper pages such as:
http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/
for pages like:
The web page allows you to zoom in quite close and see lots of detail in the page:
Now lets say I want to describe this Newspaper Page in RDF. I need to decide what subject URI to hang the description off of. Should I consider this Newspaper Page resource an information resource, or a real world resource? The answer to this question determines whether or not I can hang my description of the page off the above URI, for example:
<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .
Or if I need to mint a new URI for the page as a real world thing:
<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1#page> dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .
AWWW 1 provides some guidance:
By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.”
This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.
Can all of the essential characteristics of this newspaper page be sent down the wire as a message to a client? The text of the page is pretty legible after zooming in and you can see pictures, headlines, etc. You can’t feel the texture of the page itself, but you can’t in the microfilm that the page images were generated from. So I’m inclined to say yes.
Cool URIs for the Semantic Web also has some advice:
It is important to understand that using URIs, it is possible to identify both a thing (which may exist outside of the Web) and a Web document describing the thing. For example the person Alice is described on her homepage. Bob may not like the look of the homepage, but fancy the person Alice. So two URIs are needed, one for Alice, one for the homepage or a RDF document describing Alice. The question is where to draw the line between the case where either is possible and the case where only descriptions are available.
According to W3C guidelines ([AWWW], section 2.2.), we have a Web document (there called information resource) if all its essential characteristics can be conveyed in a message. Examples are a Web page, an image or a product catalog.
In HTTP, because a 200 response code should be sent when a Web document has been accessed, but a different setup is needed when publishing URIs that are meant to identify entities which are not Web documents.
This makes me think that I will need distinct identifiers for the abstract notion of the Newspaper Page, and the HTML document itself, if it is important to describe them separately. Say for example if I wanted to say the publisher of the web page was the Library of Congress, but the publisher of the Newspaper Page was Charles M. Shortridge. If I don’t have distinct identifiers I will have to say:
<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> dc:publisher <http://loc.gov>, <http://www.joincalifornia.com/candidate/12338> .
Pondering this Information Resource Sniff-Test got me re-reading Xiaoshu Wang’s paper URI Identity and Web Architecture Revisited again. And I’ve come away more convinced that maybe he’s right: that the real issue lies in my vocabulary usage (dc:publisher in this example), and not with whether my URI identifies an Information Resource or not. So maybe new vocabulary is needed in order to describe the representation?
<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> web:repPublisher <http://loc.gov> ; dcterms:publisher <http://www.joincalifornia.com/candidate/12338> .
But there isn’t a community of practice behind Xiaoshu’s position, at least not one like the Linked Data community. Unless perhaps his position is closer to the REST community which is going strong at the moment, especially in AtomPub circles. Members of the linked-data/semweb community would most likely say that there needs to be either hash or 303′ing URIs for the Newspaper Page, distinct from the URIs for the document describing the Newspaper Page. As a late comer to the httpRange-14 debate I don’t think I ever internalized how REST and the Semantic Web are slightly out of tune w/ each other regarding resources on the web.
So. Should I have two different URIs: one for the real-world Newspaper Page, and one for the HTML document that describes that page? Is the Newspaper Page an Information Resource? Am I muddling up something here? Am I thinking too much? Should I just let sleeping dogs lie? Your opinion, advice, therapy would be greatly appreciated.















15 Comments
I have to say, and I guess I’ve said before, that I’ve always found http-range14 and the “real world” vs “information” thing to be somewhat confusing, and a pretty hacky “solution” that seems to make everything consistent but really just confuses as much as it answers.
I think you should do whatever seems to be simplest and easiest while still supporting all the use cases you can think of fairly well. It seems like you have several options that support most of your use cases fairly well; in which case to me it comes down to whatever is simplest, easiest to implement, easiest to understand. Rather than whatever is most abstractly theoretically sound according to httpRange-14.
We’re all working this stuff out in practice, right? So you’ll try something, and people will learn from it. Now, it would still be unfortunate if you tried something without understand what was going on and what other people were trying — but I really believe this isn’t a done deal answered question, smart intelligent reasonable people can disagree on the best way to do it. I’ve never completely bought into to the httpRange-14 stuff — now, perhaps that’s cause I still don’t completely understand it, but as we frequently say critisizing overly complex library standards, if someone who’s motivated and reasonably clever can’t understand something even spending some time trying, that doens’t say good things about it’s eventual widespread adoption.
Note that this example indicates there are 2 text/html (one supposedly plain text but not really), 1 PDF, and 1 JP2 representation available. Even though these are separate resources with different URIs, they are still representations of some one thing. This something is the Real World Object and it deserves a URI of its own.Ideally, in this case, the RWO URI will do a 303 redirect to a Generic Document from which clients can negotiate for the representation of their choice. Some of the triples you refer to such as the publisher of the newspaper need to use this RWO URI as the subject which may be different from the publisher of the representation.
I am still confused by all of the server side machinations for using URI for describing parts of XML documents of any sort. I was under the impression that anchors and ids within a document would allow for finding a portion of a document. XLINKS and especially XPointer provides fine grained exposure of an XML document. Then extra features, either server or client side, based on those standards can be built, allowing for gentle degrading to the link to the whole document. Fewer HTTP error messages, I assume, easier caching especially when supported with static documents that do not depend on server side processing and …
And the result is still a URI (e.g.
http://example.com/document1#part1_subsec5 )
Daniel Bennett
Jeff, yeah you have a point. If there were a URI for the page it should content-negotiate to all the representations. I still think it’s debatable based on the language in the AWWW whether this is a RWO or not though. Not probably worthy of much debate though in the long run.
On the subject of the text/plain representation … why do you say http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/ocr.txt isn’t text/plain?
inkdroid, don’t worry, be happy; you work for the Library of Congress, not the US Mint.
When I make an assertion about something digital you make available, I am the one who chooses the URI to use. I can be guided by your example- but I am free to reject your guidance. If I want to make an assertion about the representation you deliver distinct from the RWO, it’s up to me to be clear what my subject is, and it’s not your job to anticipate all my identifier needs.
gluejar, that makes me rest a bit easier…
But still: if I am publishing RDF assertions about the things I publish then I do need to think about the identifiers I use eh?
Topic Maps gets this right: it distinguishes *in the links* between the use of URIs as subject indicators and their use as information resources.
I can tell you what we did in a similar situation with the London Gazette. We gave the notices (and issues, and editions) identifier URLs which 303 to an abstract document URL, which content negotiate to a number of different representation URLs.
I think that there is a difference between “page 1 of edition 1 of The Call dated 1st Jan 1898″ and “a web page that provides information about page 1 of edition 1 of The Call dated 1st Jan 1898″. The two items have different publishers and creation dates, for example. Therefore I’d give them separate URIs. If someone requests “page 1 of edition 1 of The Call dated 1st Jan 1898″ you redirect them to the “web page about …”.
Ed,
Look closer at the “View: Text” link at http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/. It leads to a text/html representation.
If you plan to buy into the Linked Data principles, you should be beware that Topic Maps seem to have a competing philosophy. This blog entry may help you decide which URIs are appropriate for use in the RDF you produce: http://q6.oclc.org/2009/05/linked_datahttp.html.
Jeff
Jeff, yes … but I thought you were talking about the link rel=”alternate” type=”text/plain” in the html … which really does lead you to a text/plain representation.
I am already aware of Topic Maps yes, but thanks for the link to your blog post. I am interested in exploring how linked data works in practice primarily at the moment.
John: interesting, so would it be appropriate to say Topic Maps are more in line with Xiaoshu’s point about the confusion being a vocabulary issue rather than an identity issue? Will need to review Topic Maps again.
Ed: Regarding the vocab issue, you might be interested in this paper that was presented at LDOW2009 (where I presented on ORE):
An Ontology of Resources for Linked Data (Harry Halpin, Valentina Presutti) – http://events.linkeddata.org/ldow2009/papers/ldow2009_paper19.pdf
@hvdsomp thanks so much, I had seen your ldow2009 ore paper, but not this one on resources!
I’ve been really pleased to see you and the rest of the OAI-ORE folks cross-fertilizing the digital-library/repository and linked-data/semweb crowds.
@ed thanks for your kind words.
regarding your above question, I was wondering whether you had actually considered an ORE approach? I see the following resources:
(*) Splash page that gives access to scanned image of newspaper – document resource
(*) Scanned image – one or more document resources depending on whether you give each format the same (conneg) or different (non conneg) URI
(*) Analog newspaper – non-document resource
The above 3 could be aggregated in an ORE Aggregation, itself a non-document resource. Now the URI of the ORE Aggregation is the one to ship around ;-)
Anyhow, this is interesting because at a workshop a few months ago at the National Library of Sweden a related issue came up: the library there has the need to glue together the analog newspaper (issue of a day) and all its related digital products, such as the newspaper’s website pages, the blog, the blog’s comments, the videos, etc. In that context, ORE was also mentioned as a possible solution.
@herbert yes, in fact we were playing around w/ ore. I just sent a message to the oai-ore discussion list about the use of the oai-ore vocabulary in the linked data views at chronicling america.
One Trackback/Pingback
[...] email exchange with Richard Cyganiak (one of the architects of the Linked Data pattern) about some trouble I’ve had understanding what Information Resources and Documents are in the context of [...]
Post a Comment