Archive for May, 2007

Angela’s dilemma

Thursday, May 31st, 2007

If you are interested in practical ways to garden in the emerging web-of-data take a look at this draft finding that folks in the W3C Technical Architecture Group are considering. Or for a different expression of the same idea look at Cool URIs for the Semantic Web.

These two documents describe a simple use of HTTP and URLs to identify resources that are outside of the information space of the web. Yes, you read that right: resources that are outside the information space of the web. Why would I want to use URLs to address resources that aren’t on the web!? The finding illustrates this subtlety using Angela’s dilemma:

Angela is creating an OWL ontology that defines specific characteristics of devices used to access the Web. Some of these characteristics represent physical properties of the device, such as its length, width and weight. As a result, the ontology includes concepts such as unit of measure, and specific instances, such as meter and kilogram. Angela uses URIs to identify these concepts.Having chosen a URI for the concept of the meter, Angela faces the question of what should be returned if that URI is ever dereferenced. There is general advice that owners of URIs should provide representations [AWWW] and Angela is keen to comply. However, the choices of possible representations appear legion. Given that the URI is being used in the context of an OWL ontology, Angela first considers a representation that consists of some RDF triples that allow suitable computer systems to discover more information about the meter. She then worries that these might be less useful to a human user, who might prefer the appropriate Wikipedia entry. Perhaps, she reasons, a better approach would be to create a representation which itself contains a set of URIs to a range of resources that provide related representations. Perhaps content negotiation can help? She could return different representations based on the content type specified in the request.

Angela’s dilemma is, of course, based on the fact that none of the representations she is considering are actually representations of the units of measure themselves. Even if the Web could deliver a platinum-iridium bar with two marks a meter apart at zero degrees celsius, or 1,650,763.73 wavelengths of the orange-red emission line in the electromagnetic spectrum of the krypton-86 atom in a vacuum, or even two marks, a meter apart on a screen, such representations are probably less than completely useful in the context of an information space. The representations that Angela is considering are not representations of the meter itself. Instead, they are representations of information resources related to the meter.

It is not appropriate for any of the individual representations that Angela is considering to be returned by dereferencing the URI that identifies the concept of the meter. Not only do the representations she is considering fail to represent the concept of the meter, they each have a different essence and so they should each have their own URI. As a consequence, it would also be inappropriate to use content negotiation as a way to provide them as alternate representations when the URI for the concept of the meter is dereferenced.

So assuming we are agreed about the problem what’s the solution? Basically you can use content negotiation and a 303 See Other HTTP status code to redirect to the appropriate resource. For an example of the basic idea in action fire up curl and take a look at how this instance of the SemanticMediaWiki responds to a GET request:

%  curl --head http://ontoworld.org/wiki/Special:URIResolver/Ruby
HTTP/1.1 303 See Other
Date: Thu, 31 May 2007 20:03:12 GMT
Server: Apache/2.2.3 (Debian) ...
Location: http://ontoworld.org/wiki/Ruby
Content-Type: text/html; charset=UTF-8

Nothing too surprising there–basically just got redirected to another URL that serves up some friendly HTML describing the Ruby programming language. But send along an extra Accept header:

% curl --head  --header 'Accept: application/rdf+xml
http://ontoworld.org/wiki/Special:URIResolver/Ruby
HTTP/1.1 303 See Other
Date: Thu, 31 May 2007 20:04:36 GMT
Server: Apache/2.2.3 (Debian) ...
Location: http://ontoworld.org/wiki/Special:ExportRDF/Ruby
Content-Type: text/html; charset=UTF-8

Notice how you are redirected to another URL that results in rdf/xml describing Ruby coming down the pipe? RubyOnRails and other frameworks have good REST support built in for doing content negotiation to provide multiple representations of a single information resource. But the use of the 303 See Other here is a new subtle twist to accommodate the fact that the resource in question isn’t really a canonical set of bits on disk somewhere. The good news is that your browser will display the human readable resource when you visit http://ontoworld.org/wiki/Special:URIResolver/Ruby

Some folks would argue that resources that are outside the web don’t deserve URLs and should instead be identified with URIs like info-uris that are not required to resolve. My personal feeling is that info-uris do have a great deal of use in the enterprise (where they are most likely resolvable). But in situations like Angela’s where she is creating a public RDF document that needs to refer to concepts like “length” and “meter” I think it makes sense that these concepts should resolve to appropriate representations that will guide appropriate usage. Or as the Architecture of the World Wide Web puts it:

A URI owner may supply zero or more authoritative representations of the resource identified by that URI. There is a benefit to the community in providing representations. A URI owner SHOULD provide representations of the resource it identifies

It’ll be interesting to see how these issues shake out as more and more structured data is made available on the web.

the weight of legacy data

Sunday, May 20th, 2007

v0.97 of MARC::Charset was just released with an important bugfix. If you’ve had the misfortune of needing to convert from MARC-8 to UTF-8 and have used MARC::Charset >= v0.8 to do it you may very well have null characters (0×00) in your UTF-8 data. Well, only if your MARC-8 data contained either of the following characters:

  • DOUBLE TILDE, SECOND HALF / COMBINING DOUBLE TILDE RIGHT HALF
  • LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF

It turns out that the mapping file kindly provided by the Library of Congress does not include UCS mapping values for these two characters, and instead relies on alternate values.

v0.97 now uses the alternate value when the ucs is not available…which is good going forward. But I am literally sad when I think about how this little bug has added to the noise of erroneous extant MARC data. Please accept my humble apologies–and hear my plea to for bibliographic data that starts in Unicode rather than MARC-8. I’ll go further:

Don’t build systems that import/export MARC in transmission format anymore unless you absolutely have to.

Use MARCXML, MODS, RDF, JSON, YAML or something else instead. I realize this is hardly news but it feels good to be saying it. If you’re not convinced read Bill’s Pride and Prejudice installments. The library world needs to use common formats and encodings (with lots of tried/true tool sets)…and stop painting itself into a corner. Z39.2 has been hella useful for building up vast networks of data sharing libraries, but its time to leverage that data in ways that are more familiar to the networked world at large.

Many thanks to Michael O’Connor and Mike Rylander for discovering and resolving this bug.

miscellaneous talk

Thursday, May 17th, 2007

If you are reading Everything is Miscellaneous like me then you might be interested in watching a talk David Weinberger did a few days ago at Google.

I only wish I had more time to ingest all the good content that comes in through the GoogleTech Talks feed.

rda/frbr and the semantic web

Thursday, May 3rd, 2007

I was interested to learn from Alistair Miles that folks in the library community are starting to look at expressing models such as RDA and FRBR using the semantic web technology stack–including things like Dublin Core Abstract Model.

It’s exciting and timely to see the luminaries in the field get together to talk about this sort of convergence. I must admit I’m still a little bit hazy about why we need DCAM when we already have RDF, RDFS and OWL … but I think only good things can come from this interaction.

It’s particularly heartening that the library community is exploring what RDA and FRBR look like when the rubber meets the road of data representation. Although Ian Davis, Richard Newman and Bruce D’Arcus have arguably already done this for FRBR.

Update: official announcement from the British Library.