purl2

It’s great to see that OCLC is going to work with Zepheira on a new version of the PURL service and that it’s going to have an Apache license. Other than addressing scalability issues it sounds like Zepheira is going to build in support for resources that are outside of the information space of the web:

The new PURL software will also be updated to reflect the current understanding of Web architecture as defined by the World Wide Web Consortium (W3C). This new software will provide the ability to permanently identify networked information resources, such as Web documents, as well as non-networked resources such as people, organizations, concepts and scientific data. This capability will represent an important step forward in the adoption of a machine-processable “Web of data” enabled by the Semantic Web.

Since Eric Miller helped start up Zepheira it’s not surprising that purl2 will take this on. As part of some experiments I’ve been doing with SKOS, and serving up Concepts over HTTP it has become clear that a minimal bit of work for managing these identifiers would be useful. I can definitely see the need for a general solution that helps manage identifiers for people, organizations, concepts, etc. which also fits into how HTTP should/could serve up the resources associated with them.

via Thom Hickey


ruby-zoom v0.3.0

Thanks to some prodding from William Denton and Jason Ronallo and the kindness of Laurent Sansonetti I’ve been added as a developer to the ruby-zoom project which provides a Ruby wrapper to the yaz Z39.50 library. I essentially wanted to remove some unused code from the project that was interfering with the ruby-marc gem … and I also wanted to create gem for ruby-zoom. This was the first time I’ve tried packaging up a C wrapper as a gem and it was remarkably smooth. I also added a test suite and a Rakefile. So assuming you have yaz installed you can install ruby-zoom with:

% gem install zoom

I’ll admit, I’m no huge fan of Z39.50 but the fact remains that it’s pretty much the most widely deployed machine API for getting at bibliographic data locked up in online catalogs. It’s really nice to see forward thinking systems at Talis, Evergreen and Koha who have (or at least experimented with) OpenSearch implementations.


Angela's dilemma

If you are interested in practical ways to garden in the emerging web-of-data take a look at this draft finding that folks in the W3C Technical Architecture Group are considering. Or for a different expression of the same idea look at Cool URIs for the Semantic Web.

These two documents describe a simple use of HTTP and URLs to identify resources that are outside of the information space of the web. Yes, you read that right: resources that are outside the information space of the web. Why would I want to use URLs to address resources that aren’t on the web!? The finding illustrates this subtlety using Angela’s dilemma:

Angela is creating an OWL ontology that defines specific characteristics of devices used to access the Web. Some of these characteristics represent physical properties of the device, such as its length, width and weight. As a result, the ontology includes concepts such as unit of measure, and specific instances, such as meter and kilogram. Angela uses URIs to identify these concepts.Having chosen a URI for the concept of the meter, Angela faces the question of what should be returned if that URI is ever dereferenced. There is general advice that owners of URIs should provide representations [AWWW] and Angela is keen to comply. However, the choices of possible representations appear legion. Given that the URI is being used in the context of an OWL ontology, Angela first considers a representation that consists of some RDF triples that allow suitable computer systems to discover more information about the meter. She then worries that these might be less useful to a human user, who might prefer the appropriate Wikipedia entry. Perhaps, she reasons, a better approach would be to create a representation which itself contains a set of URIs to a range of resources that provide related representations. Perhaps content negotiation can help? She could return different representations based on the content type specified in the request.

Angela’s dilemma is, of course, based on the fact that none of the representations she is considering are actually representations of the units of measure themselves. Even if the Web could deliver a platinum-iridium bar with two marks a meter apart at zero degrees celsius, or 1,650,763.73 wavelengths of the orange-red emission line in the electromagnetic spectrum of the krypton-86 atom in a vacuum, or even two marks, a meter apart on a screen, such representations are probably less than completely useful in the context of an information space. The representations that Angela is considering are not representations of the meter itself. Instead, they are representations of information resources related to the meter.

It is not appropriate for any of the individual representations that Angela is considering to be returned by dereferencing the URI that identifies the concept of the meter. Not only do the representations she is considering fail to represent the concept of the meter, they each have a different essence and so they should each have their own URI. As a consequence, it would also be inappropriate to use content negotiation as a way to provide them as alternate representations when the URI for the concept of the meter is dereferenced.

So assuming we are agreed about the problem what’s the solution? Basically you can use content negotiation and a 303 See Other HTTP status code to redirect to the appropriate resource. For an example of the basic idea in action fire up curl and take a look at how this instance of the SemanticMediaWiki responds to a GET request:

%  curl --head http://ontoworld.org/wiki/Special:URIResolver/Ruby
HTTP/1.1 303 See Other
Date: Thu, 31 May 2007 20:03:12 GMT
Server: Apache/2.2.3 (Debian) ...
Location: http://ontoworld.org/wiki/Ruby
Content-Type: text/html; charset=UTF-8

Nothing too surprising there–basically just got redirected to another URL that serves up some friendly HTML describing the Ruby programming language. But send along an extra Accept header:

% curl --head  --header 'Accept: application/rdf+xml
http://ontoworld.org/wiki/Special:URIResolver/Ruby
HTTP/1.1 303 See Other
Date: Thu, 31 May 2007 20:04:36 GMT
Server: Apache/2.2.3 (Debian) ...
Location: http://ontoworld.org/wiki/Special:ExportRDF/Ruby
Content-Type: text/html; charset=UTF-8

Notice how you are redirected to another URL that results in rdf/xml describing Ruby coming down the pipe? RubyOnRails and other frameworks have good REST support built in for doing content negotiation to provide multiple representations of a single information resource. But the use of the 303 See Other here is a new subtle twist to accommodate the fact that the resource in question isn’t really a canonical set of bits on disk somewhere. The good news is that your browser will display the human readable resource when you visit http://ontoworld.org/wiki/Special:URIResolver/Ruby

Some folks would argue that resources that are outside the web don’t deserve URLs and should instead be identified with URIs like info-uris that are not required to resolve. My personal feeling is that info-uris do have a great deal of use in the enterprise (where they are most likely resolvable). But in situations like Angela’s where she is creating a public RDF document that needs to refer to concepts like “length” and “meter” I think it makes sense that these concepts should resolve to appropriate representations that will guide appropriate usage. Or as the Architecture of the World Wide Web puts it:

A URI owner may supply zero or more authoritative representations of the resource identified by that URI. There is a benefit to the community in providing representations. A URI owner SHOULD provide representations of the resource it identifies

It’ll be interesting to see how these issues shake out as more and more structured data is made available on the web.


the weight of legacy data

v0.97 of MARC::Charset was just released with an important bugfix. If you’ve had the misfortune of needing to convert from MARC-8 to UTF-8 and have used MARC::Charset >= v0.8 to do it you may very well have null characters (0x00) in your UTF-8 data. Well, only if your MARC-8 data contained either of the following characters:

  • DOUBLE TILDE, SECOND HALF / COMBINING DOUBLE TILDE RIGHT HALF
  • LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF

It turns out that the mapping file kindly provided by the Library of Congress does not include UCS mapping values for these two characters, and instead relies on alternate values.

v0.97 now uses the alternate value when the ucs is not available…which is good going forward. But I am literally sad when I think about how this little bug has added to the noise of erroneous extant MARC data. Please accept my humble apologies–and hear my plea to for bibliographic data that starts in Unicode rather than MARC-8. I’ll go further:

Don’t build systems that import/export MARC in transmission format anymore unless you absolutely have to.

Use MARCXML, MODS, RDF, JSON, YAML or something else instead. I realize this is hardly news but it feels good to be saying it. If you’re not convinced read Bill’s Pride and Prejudice installments. The library world needs to use common formats and encodings (with lots of tried/true tool sets)…and stop painting itself into a corner. Z39.2 has been hella useful for building up vast networks of data sharing libraries, but its time to leverage that data in ways that are more familiar to the networked world at large.

Many thanks to Michael O’Connor and Mike Rylander for discovering and resolving this bug.



rda/frbr and the semantic web

I was interested to learn from Alistair Miles that folks in the library community are starting to look at expressing models such as RDA and FRBR using the semantic web technology stack–including things like Dublin Core Abstract Model.

It’s exciting and timely to see the luminaries in the field get together to talk about this sort of convergence. I must admit I’m still a little bit hazy about why we need DCAM when we already have RDF, RDFS and OWL … but I think only good things can come from this interaction.

It’s particularly heartening that the library community is exploring what RDA and FRBR look like when the rubber meets the road of data representation. Although Ian Davis, Richard Newman and Bruce D’Arcus have arguably already done this for FRBR.

Update: official announcement from the British Library.


late easter present

I finally took the time to make pymarc setuptools friendly. This basically means that if you’ve got easy_install handy you can:

sudo easy_install pymarc

If you haven’t looked at eggs yet, they are pretty much the defacto standard for distributing python code. The PyPi (Python Package Index, aka Python Cheese Shop) allows easy_install to locate and download packages, which are then unpacked and installed.

pymarc was basically an experiment to make sure I understood how eggs worked with pypi. Next up Rob Sanderson has sent me some code he and a colleague did for parsing Library of Congress Classification Numbers which I’m going to bundle up as an egg as well. Stay tuned.


nekkid

Yeah, today is CSS Naked Day I just hope I remember to re-enable CSS tomorrow :-)


theory

The second book I checked out of the Library of Congress with my shiny new borrowing card was Alistair Cockburn’s Agile Software Development: The Cooperative Game (which happened to just win this years Jolt Award). Early on Cockburn recommends jumping to an appendix to read Peter Naur’s article “Programming as Theory Building” (thanks ksclarke).

This is my second time reading the article, but this time it is really resonating with me–the idea of writing programs as building theories. Partly I think this is because I was reading it while I attended a recent Haskell tutorial by coworker Adam Turoff here in DC (which I will write about shortly).

On the ride to work this morning a particular quote stood out, and I’m just writing it here so I don’t forget it:

… the problems of program modification arise from acting on the assumption that programming consists of program text production, instead of recognizing programming as an activity of theory building.

It seems obvious at first I guess. But it’s a powerful statement about what the activity of software development ought to be–instead of a string of hacks that eventually brings a piece of software to its knees.


US open access petition

As announced on the jisc-repositories list there is now a US counterpart to the EU Petition calling for Open Access.

We, the undersigned, believe that broad dissemination of research results is fundamental to the advancement of knowledge. For America’s taxpayers to obtain an optimal return on their investment in science, publicly funded research must be shared as broadly as possible. Yet too often, research results are not available to researchers, scientists, or the members of the public. Today, the Internet and digital technologies give us a powerful means of addressing this problem by removing access barriers and enabling new, expanded, and accelerated uses of research findings.

The petition was put together by the Alliance for Taxpayer’s Access in response to the 28,000 odd enlightened folks who signed the EU petition. I was encouraged to see prominent sponsor icons for American Libraries Association, Association of College & Research Libraries on the US petition.

I haven’t been tracking the Open Access movement as well as I should have–but I did take a few seconds while drinking coffee at the breakfast table this morning to sign the petition. The movement seems to be really making a lot of progress recently.

Via a bit of synchronicity Caroline Arms sent a message around at $work about the recent Emerging Libraries conference at Rice. Apparently Brewster Kahle and Paul Ginsparg had a meeting of like like minds. I guess it’s not surprising considering their roles in bringing libraries and archives into the computing age with The Internet Archive and arXiv. What is surprising is that it took this long. These two projects are wildly successful, living and breathing examples of Open Access projects.

The audio for all the conference presentations is available from Rice…including the very listenable Universal Access to Human Knowledge (Kahle) and Read as We May (Ginsparg).