info-uris and opening up library data

I had a few moments to read the info-uri spec during a short flight from DC to Chicago this past weekend. info-uri aka RFC 4452 is a spec that allows you to create URIs for identifiers in public namespaces.

So what does this mean in practice and why would you want to use one?

If you have a database of stuff you make available on the web, and you have ids for the stuff (say a primary_key on a Stuff table) you essentially have an identifier in a public namespace. Go register the namespace!

So, the LoC assigns identifiers called Library of Congress Control Numbers (LCCN) to each of its metadata records. Here’s the personal-name authority record (expressed as MADS) that allows works by Tim Berners-Lee to be grouped together:

<?xml version='1.0' encoding='UTF-8'?>
<madsCollection
xmlns:xlink='http://www.w3.org/1999/xlink'
xmlns='http://www.loc.gov/mods/v3'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:schemaLocation='http://www.loc.gov/mads
http://www.loc.gov/standards/mads/mads.xsd'>
<mads version='beta'>
<authority>
<name type='personal' authority='naf'>
<namePart>Berners-Lee, Tim</namePart>
</name>
<titleInfo authority='naf'>
<title/>
</titleInfo>
</authority>
<variant type='other'>
<name type='personal'>
<namePart>Lee, Tim Berners-</namePart>
</name>
<titleInfo>
<title/>
</titleInfo>
</variant>
<variant type='other'>
<name type='personal'>
<namePart>Berners-Lee, Timothy J</namePart>
</name>
<titleInfo>
<title/>
</titleInfo>
</variant>
<note type='source'>The WWW virtual library Web site,
Feb. 15, 1999 about the virtual library (Tim Berners-Lee; creator
of the Web)</note>
<note type='source'>OCLC, Feb. 12, 1999 (hdg.: Berners-Lee,
Tim; usage: Tim Berners-Lee)</note>
<note type='source'>Gaines, A. Tim Berners-Lee and the
development of the World Wide Web, 2001: CIP galley
(Timothy J. Berners-Lee; b. London, England, June 8, 1955)</note>
<recordInfo>
<identifier>no 99010609 </identifier>
<recordContentSource authority='marcorg'>NBL</recordContentSource>
<recordCreationDate encoding='marc'>990216</recordCreationDate>
<recordChangeDate encoding='iso8601'>20010716094452.0</recordChangeDate>
<recordIdentifier>1851704</recordIdentifier>
<languageOfCataloging>
<languageTerm authority='iso639-2b' type='code'>eng</languageTerm>
</languageOfCataloging>
</recordInfo>
</mads>
</madsCollection>

In the record/recordInfo/identifier element you can find the LCCN:

no 99010609

Which can be represented as an info-uri:

info:lccn/no9910609

Now why would you ever want to express a LCCN as an info-uri? The LoC has spent a lot of time and effort establishing these personal name and subject authorities. You might want to use a URI like info:lccn/no9910609 to identify Tim Berners-Lee as an individual in your data so that other people will know who you are talking about and be able to interoperate with you. For example you can now unambiguously say that Tim Berners-Lee created Weaving the Web

<info:lccn/no9910609> <http://purl.org/dc/elements/1.1/creator>
<info:lccn/99027665>

That was for you ksclarke :-) Pretty nifty eh? Now what’s really cool is that while info-uris aren’t necessarily resolvable (by design) OCLC does have the Linked Authority File, which allows you to look up these records. So tbl’s record can be found here:

http://errol.oclc.org/laf/no99-10609.html

I imagine that this is part of the joint OCLC/LoC/Die Deutsche Bibliothek project to build a Virtual International Authority File…but I’m not totally sure. At any rate there’s currently no way to drop a lccn info-uri in there and have it resolve to the XML–but that looks like an easy thing to add.

It feels like there is a real opportunity for libraries and archives to offer up their data to the larger web community. How can we make it easy for non-library folks to find and repurpose this data we’ve so assiduously collected over the years?

tbl is encouraging people to give themselves a URI…I wonder if he knew that he (and millions of others) already have one!

Addendum:

If you are interested section 6 of the RFC details the subtle rationale behind why the authors chose to create a new URI scheme rather than:

  1. using an existing URI scheme
  2. creating a new URN namespace

In essence they didn’t want to use an existing URI scheme because they all assume that you should be able to dereference the URI. An example of dereferencing in action can be found when clicking on a link like http://www.yahoo.com where the magic of DNS allows you to find yahoo’s web server and talk to it on port 80 in a predictable way. info-uris are designed to be agnostic as to whether or not the identifier can be dereferenced through a resolver of some kind.

Using URNs was thrown out since URNs are intended to persistently identify information resources and info-uris are designed to identify persistent namespaces not the resources themselves. Also the process of establishing a URN namespace isn’t for the faint of heart, which is evidenced by the short list of them. info-uris by contrast have a registrar who will expedite the process of registering a namespace, and have set up a framework for publishing validation/normalization rules. The current registrar is run by OCLCRLGBORG^w OCLC on behalf of NISO. So basically you don’t have to write an RFC to register your namespace.


building and ingesting

I prefer using an XML generating mini-language (elementtree, XML::Writer, REXML, Stan, etc) to actually writing raw XML. It’s just too easy for me to forget or misstype an end tag, or forget to encode strings properly–and I find all those inline strings or even here-docs make a mess of an otherwise pretty program.

Recently I wanted some code to write FOXML for ingesting digital objects into my Fedora test instance. I’m working in Ruby so REXML seemed like the best place to start…but after I finished I ran across Builder. The Builder code turned out to be somewhat shorter, much more expressive and consequently a bit easier to read (for my eyes). Here’s a quick example of how Builder’s API improves on REXML when writing this little chunk of XML:

<dc xmlns='http://purl.org/dc/elements/1.1/'>
  <title>Communication in the Presence of Noise</title>
</dc>

So here’s the REXML code:

dc = REXML::Element.new 'dc'
dc.add_attributes 'xmlns' => 'http://purl.org/dc/elements/1.1/'
title = REXML::Element.new 'title', dc
title.text 'Communication in the Presence of Noise'

and the Builder code:

x = Builder::XmlMarkup.new 
x.dc 'xmlns' => 'http://purl.org/dc/elements/1.1' do
  x.title 'Communication in the Presence of Noise'
end

So both are four lines, but look at how the Builder::XmlMarkup object infers the name of the element based on the message that is passed to it? Element attributes and content can be set when the element is created–something I wasn’t able to do w/ REXML. My favorite though is Builder’s use of blocks so that the hierarchical structure of the code directly mirrors that of the XML content!

So anyway, if you read this far you might actually like to see how a FOXML document can be built and ingested into Fedora–so hear goes building the document:

x = Builder::XmlMarkup.new :indent => 2

x.digitalObject ‘xmlns’ => ‘info:fedora/fedora-system:def/foxml#’ do

x.objectProperties do x.property ‘NAME’ => ‘http://www.w3.org/1999/02/22-rdf-syntax-ns#type’, ‘VALUE’ => ‘FedoraObject’ x.property ‘NAME’ => ‘info:fedora/fedora-system:def/model#state’, ‘VALUE’ => ‘A’ end


Cataloging at the BBC with RubyOnRails

It’s nice to see that BBC Programme Catalogue (built with RubyOnRails and MySQL) has gone live. Here is some historical background from the about page:

The BBC has been cataloguing and indexing its programmes since the 1920s. The development of the programme catalogue has reflected the changes in the BBC and in broadcasting over the last seventy five years. For example, in the early days of broadcasting, for both Radio and TV, the majority of programmes were broadcast live and were never recorded. There was therefore little point at the time to do extensive cataloguing and indexing of material that did not exist. As you will see, the number of catalogue entries for a day in the 1990s, far exceeds the entries for a day from the 1950s.

As recording technology developed in both mediums, the requirement to keep material for re-use also grew. If material was going to be re-used, it had to be catalogued and indexed. The original records of radio programmes were handwritten into books; over time, card catalogues were developed, and from the mid-1980s onwards there have been computer based catalogues.

This experimental catalogue database holds over 900,000 entries. It is a sub-set of the data from the internal BBC database created and maintained by the BBC’s Information and Archives department. This public version is updated daily as new records are added and updated in the main catalogue. This figure is so high because, for example, each TV news story now has an individual entry in the catalogue.

Talk about sexy retrospective conversion eh? Hats off to Matt Biddulph and his colleagues. I wish I was going to RailsConf to hear more of the technical details. Actually, if you haven’t already take a look at the RailsConf program–it looks like it’s going to be a great event.


Fedora/SOAP and Ruby

I’ve been playing around getting ruby to talk to the fedora framework for building digital repositories. Fedora makes its api available by different sets of SOAP services, defined in WSDL files. What follows is a brief howto on getting Ruby to talk to the API-A and API-M

To get basic API-A and API-M clients working you’ll need the following:

  • A modern ruby: probably >= 1.8.2
  • The latest soap4r: the one that comes standard in 1.8.4 may work but emits some warnings when processing the fedora wsdl files.
  • The latest http-access2 if you plan on doing API-M with basic authentication.
  • A tarball of ruby classes I generated with wsdl2ruby using the wsdl files in the latest fedora distribution.

So assuming you’ve unpacked the ruby-fedora.tar.gz you should be able to go in there and write the following program which will attempt to connect to a fedora server at localhost:8080 and retrieve the PDF datastream for an object with PID ‘biblio:2’ and write it out to disk. I guess to get it working right you should change the datastream label and PID to something relevant in your repository.

#!/usr/bin/env ruby


oai/sru and ruby

biblio:~/Projects/ruby-oai ed$ ruby test.rb Loaded suite test Started ………. Finished in 171.247595 seconds.


hGoogle

So it’s been noted elsewhere that the latest ajaxy application out of google labs (Google Calendar) lacks support for the hCalendar microformat.

Perhaps it’s an oversight–but with all the high profile exposure microformats have been getting lately it’s kind of hard to imagine. But people have deadlines and some things just can’t make it into the first release–even at Google. The main thing, as Mark Pilgrim says is:

Sniping from the sidelines makes us look petty and insular. Instead
of making assumptions about big bad evil Google ignoring open
standards and locking users in, have we tried opening a dialogue?

I don’t know anyone at google so I feel like I’m doing my part by just blogging about how awesome it would be if they marked up their calendar data using hCalendar. As a full featured calendaring application on the web, Google Calendar could really enable downstream applications like the LiveClipboard if they simply added some class attributes and spans to the data they are already displaying.

In the long run I imagine it’s in Google’s best interests to promote microformats since their infrastructure would allow them to take best advantage of a system of distributed metadata. Here’s to hoping that it’ll be layered in sometime soon. In the meantime Scott and Mark have the right idea!

By the way, being able to enter a quick event in free text and have the time/location/description parsed as opposed to tabbing around in a complicated form is very nice.


Graham Patrick Summers

Graham Patrick Summers
Born: April 2, 2006 in McHenry, IL at 11:51 AM
Weight: 8lb 8oz
Length: 22 inches
Mother and Baby Healthy and Happy :-)

Details to follow!


Translation and a Citation Microformat

I can think of only one company that has the resources to embed translation links into the world’s existing body of printed material. What’s more, while they are at it they are going to markup the title page with a citation microformat…and get this…the microformat is based on a OpenURL XMDP profile so that it’ll interoperate with existing citation resolvers in use in libraries around the world…niiiice.


Standard

When I find the time I’m enjoying reading The Algebraist by Iain M. Banks which (so far) details a many-species galactic civilization in 4034 AD. The milieu includes an amorphous ancient species known as the Dwellers who live for millions of years on gas giant planets (like Jupiter) and have very, very long memories…and the best archives which other beings ocassionaly ‘delve’ into. It’s the usual Banksian genius. Last night on pages 100-101 I couldn’t help laughing at this segment that discusses standards bodies in the future. Apologies to Mr Banks for the extended quote…

The official was speaking the human version of Standard, the galaxy’s lingua franca. Standard had been chose as an inter-species, pan-galactic language over eight billion years ago. Dwellers had been the main vector in its spread, though they made a point of emphasising that it was not theirs originally. They had one very ancient, informal vernacular and another even more ancient formal language of their own, plus lots that had survived somehow from earlier times or been made up in the meantime. These latter came and went in popularity as such things tended to.

‘Oh no, there was a competition,’ the Dweller guide/mentor Y’sul had explained to Fassin on his first delve, hundreds of years ago. ‘Usual thing; lots of competing so-called universal standards. There was a proper full-scale war after one linguistic disagreement – a grumous and a p’Liner species, if memory serves – and after that came the usual response: inquiries, missions, meetings, reports, conferences, summits.’

‘What we now know as Standard was chosen after centuries of research, study and argument by a vast and unwieldly committee composed of representatives of thousands of species., at least two of which became effectively extinct during the course of the deliberations. It was chosen, astonishingly, on its merits, because it was an almost perfect language: flexible, descriptive, uncoloured (whatver that means, but apparently it’s important), precise but malleable, highly, elegantly complete yet primed for external-term-adoption and with an unusually free but logical link between the written form and the pronounced which could easily and plausibly embrace almost any set of phonemes, scints, glyphs or pictals and still make translatable sense.’

‘Best of all, it didn’t belong to anybody, the species which had invented it having safely extincted itself themselves millions of years earlier without leaving either any proven inheritors or significant mark on the greater galaxy, save this sole linguistic gem. Even more amazingly, the subsequent conference to endorse the decision of the mega-committee went smoothly and agreed all the relevan recommendations. Take-up and acceptance were swift and widespread. Standard became the first and so far only true universal language within just a few Quick-mean generations. Set a standard for pan-species cooperation that everybody’s been trying to live up to ever since.’

Too funny. I love how the ‘perfect’ language was created by a race that extincted themselves. Just goes to show that perfection ain’t everything…


reading 2.0

Reading 2.0 slipped under my radar, but I guess that was the idea: to let people from O’Reilly, Los Alamos National Labs, OCLC, The Internet Archive, Adobe, Yahoo, Harvard and Elsevier hobnob away from prying eyes. I haven’t seen any audio/video for the event but Tim O’Reilly has a nice fly on the wall summary of what went on.

It’s refreshing to see library technologies/concepts such as OpenURL, OCOinS, OAI-PMH, FRBR, METS and Dublin Core starting to be talked about in the context of a larger information environment. For example I had no idea that Yahoo is harvesting data from the Internet Archive using the OAI-PMH protocol. And I didn’t know Yahoo is starting to leverage microformats, but should’ve guessed considering the recent news about Flickr starting to use hCard.

All in all these are exciting “lowercase semantic web” times we’re living in. And it’s interesting to watch some of the things people you know have worked on starting to catch on. Hopefully Reading 2.0 was just the start of this ongoing collaboration. Case in point, I just heard Robert Sanderson say in #code4lib that he’s visiting the a9 folks to talk about opensearch and sru. This is just the sort of cross-fertilization we need going on in library land.