set your data free ... with unapi

Dan, Jeremy, Peter, Michael, Mike, Ross and I wrote an article in the latest Ariadne introducing the lightweight web protocol unAPI. Essentially unAPI is an easy way to include references to digital objects in your HTML which can then be predictably retrieved by a machine…yes ‘machine’ includes JavaScript running in a browser :-) Dan and a really nice cross section of developers around the world have been working on this spec for over a year now and I think it could be poised to play an important role in the emerging open data movement.

Imagine you have a citation database which is searchable via the web. The search results include hits. Wouldn’t it be nice to align your human viewable results with machine readable representations so that people could write browser hacks and the like to remix your application data?

As far as I can tell there are a few options available to help you do this (apart from doing something ad-hoc).

  1. use a citation microformat and mark up your HTML predictably so that it can be recognized and parsed
  2. use GRDDL to map your HTML to RDF via an XLST profile.
  3. embed RDF in your HTML essentially using an RDF microformat.
  4. OpenURL and/or COinS to link in page IDs to OpenURL servers.
  5. use unAPI and include a unapi server url (familiar autodiscovery like RSS/Atom), and identifiers (simple element attributes) and write a simple server side script that emits xml for a given identifier.

I like microformats a lot and I think a citation format will eventually get done. But it’s been a long time coming and there’s no indication it’s going to get done any time soon. What’s more unAPI is bigger than just citation data–and it allows you to publish all kinds of rich data objects without waiting for a community to ratify a particular representation in HTML.

Options 2 and 3 use RDF which I actually like quite a bit as well. GRDDL implies a GRDDL aware browser which would be cool but is a bit heavy weight. XSLT will require clean XHTML–or pipelines to clean it. Embedding RDF in HTML using microformat techniques is compelling because you can theoretically process the RDF data similarly–whereas unAPI doesn’t require any particular kind of machine readable format (apart from HTML). Actually there’s nothing stopping you from using unAPI to link human viewable objects with RDF representations. The advantage unAPI has here is you can learn RDF if you want to, but you don’t have to learn RDF to get going with unAPI today.

Option 4 leverages work done in the library community on citation linking. OpenURL routers are widely deployed in libraries around the world and COinS is a quasi-microformat for putting OpenURL context objects into your HTML so that they can be extracted and fired off at an OpenURL server. OpenURL is a relatively complex and subtle standard which can do a lot more than just citation linking. Compared to OpenURL/COinS unAPI allows for ease of implementation in languages like JavaScript and provides a simple introspection mechanism for discovering what formats a particular resource is available in. AFAIK this can’t be done simply using OpenURL/COinS. If I’m wrong, comments should be open. I would argue that the sheer power and flexibility of OpenURL paradoxically make it hard to understand…and that unAPI in Dan’s adherence to a one-page-spec is more limited and simple. Less is more…

So if this piques your interest read the article. It does a much better job of describing the origins of the work, where it’s headed, has examples and links out to sites/tools that use unAPI today. I must admit I wrote very little of the article, and mostly contributed text snippets and screenshots of the unAPI validator I wrote, which uses my unapi ruby gem.


Amidst the flurry of commit messages and the like on the simile development discussion list I happened to see the Simile Project includes a RDFizer project which has a component called oai2rdf.

oai2rdf is a command line program that happens to use Jeff Young’s OAIHarvester2 and some XSLT magic to harvest an entire oai-pmh archive and covert it to rdf.

  % cogprints

This will harvest the entire cogprints eprint archive and convert it on the fly to rdf which is saved in a directory called cogprints. Just in case you are wondering–yes it handles resumption tokens. In fact you can also give it date ranges to harvest, and tell it to only harvest particular metadata formats. By default it actually grabs all possible metadata formats.

As part of my day job I’ve been looking at some rdf technologies like jena and while there are lots of chunks of rdf around on the web to play with oai2rdf suddenly opens up the possibilities quite a bit.

Getting oai2rdf up and running is pretty easy. First get the oai2rdf code:

  svn co oai2rdf

Next make sure you have maven. If you don’t have it maven is very easy to install. Just download, unpack, and make sure the maven/bin directory is in your path. Then you can:

  mvn package

The magic of maven will pull down dependencies and compile the code. Then you should be able to run oai2rdf. Art Rhyno has been talking about the work the Simile folks are doing for quite a while now, and only recently have I started to see what a rich set of tools they are developing.

gems...on ice

When developing and deploying RubyOnRails applications you’ve often got to think about the gem dependencies your project might have. It’s particularly useful to freeze a version of rails in your vendor directory so that your app uses that version of rails rather than a globally installed (or not installed) one. It’s easy to do this by simply invoking:

  rake freeze_gems

Which will unpack all the rails gems into vendor, and your application will magically use these instead of the globally installed rails gems.

The cool thing is that with a little bit of plugin help you can freeze your other gems in vendor as well. Simply install Rick Olson’s elegantly simple gem plugin into vendor/plugins. Then assuming you are using let’s say my oai-pmh gem you can simply:

  rake gems:freeze GEM=oai

and the gem will be unpacked in vendor, and the $LOAD_PATH for your application will automatically include the library path for the new gem. Very useful, thanks Rick!

the librarian's store

While working at Follett I always thought it was just a matter of time till Amazon turned it’s eye on the library market. Much of the web development that went on at Follett was done with an eye towards what Amazon was doing…while tailoring the experience for librarians and library book ordering/processing. The management I expressed this idea to seemed to think that Amazon wouldn’t be interested in Follett’s business. It was my opinion at the time that it would be better to have Amazon as a partner than a competitor. This is really just common sense right? No leap of intuition there.

…time passes…

Now it looks like (thanks eby) that Follett has some company. When a web savvy company like Amazon notices your niche in the ecosystem it’s definitely important to pay attention. Amazon has decided to partner with TLC and Marcive for MARC data and with OCLC to automatically update holdings. This is big news.

Somewhat related and even more interesting in some ways rsinger and eby report in #code4lib that they’ve seen Library of Congress Subject Headings and Dewey Decimal Classification Numbers in Amazon Web Service responses. For an example splice your Amazon Token in here:

scan for:

Ruby (Computer program language)

more on web identifiers

I monitor the www-tag discussion list, but more than half of it goes right over my head–so I was pleased when a colleague forwarded URNs, Namespaces and Registries to me. Don’t let the 2001 in the URL fool you, it has been updated quite recently. This finding provides an interesting counterpoint to rfc 4452 which I wrote about earlier.

Essentially the authors go about examining the reasons why folks want to have URNs (persistence) and info-uris (non-dereferencability) and showing how URIs actually satisfy the requirements of these two communities.

I have to admit, it sure would be nice if (for example) LCCNs and OCLCNUMs resolved using the existing the infrastructure of http and dns. Let’s say I run across an info-uri in a XML document identifying tbl as info:lccn/no9910609. What does that really tell me? Wouldn’t it be nice if instead it was and I could use my net/http library of choice to fetch tbl’s MADS record? Amusingly Henry Thompson (one of the authors of the finding) is holding and for ransom :-)

Instead, in the case of info-uri, OCLC is tasked with building a registry of these namespaces, and even when this is built the identifiers won’t necessarily be resolvable in any fashion. This is the intent behind info-uris of course–that they need not be resolvable or persistent. But this finding raises some practical issues that are worth taking a look at, which seem to point to the ultimate power of the web-we-already-have.


I saw this in yesterday’s Washington Post and just learned it won the Best of Photo Journalism Award for 2006. The picture says it all, but the story is just as harrowing. What a sad mess.

repositories and domain-specific-languages

At work I’ve been doing some experiments with the Fedora repository software. One of the strengths of Fedora is that it is fundamentally designed as a set of extensible web services. At first I set about becoming familiar with the set of web services and decided that Ruby would be a useful and lightweight language to do this from. Sure enough, Ruby was plenty capable of putting stuff into Fedora and getting stuff back out again.

As time went on it became clear that what was really needed was a layer of abstraction around this Fedora web services API that would allow it (or another repository framework) to be used in a programatic way without having to make SOAP calls and building FOXML all over the place. Typically in software pattern lingo this is referred to as a facade.

So I worked on creating a facade, and ended up with something I half-jokingly called ‘bitbucket’ which looks something like this:

require ‘bitbucket’

code4libcon 2007

Here’s to making sure that code4libcon 2007 is a watershed moment for women library technologists.

code4libcon 2006 in Corvallis wasn’t all male, but it was largely…and I can only remember two women speaking to the audience. To a large extent code4libcon was modeled after technology conferences like yapc, pycon, oscon, barcamp, etc–which have much the same sort of ratio. But libraries are different because the majority of people who work in libraries are women. So it was a bit surprising that more women didn’t end up at code4libcon 2006.

2006 did get organized practically overnight with a very small (male) clique in an irc room (that’s not always well behaved, but mean well–hey it’s IRC). When people actually started signing up and sending in papers to the more formal discussion list I think we were all kind of surprised. I seriously thought we were just going to be hanging out in some random space with free wifi, and it turned into this really successful event.

Some folks like Dan Chudnov, Art Rhyno, Jeremy Frumkin and Roy Tennant started thinking and talking early about making the conference appeal to women library technologists. But it seems that either the voting (open to all, but all men for some reason) somehow subconsciously counteracted this.

AFAIK the keynote voting is still going on, and I imagine you can still suggest speakers. There will only be more voting to do as we get into selecting presenters. If you’d like to participate just email Brad LaJeunesse and he’ll hook you up with a backpack login. Also, sign up for the code4lib and code4libcon discussion lists. Luckily Dorothea Salo is involved and vocal and I’m hoping that other women technologists will get involved too. This is a grassroots thing after all, not some sort of LITA top-tech trends panel. It’ll become whatever we want it to be.

nice work atlanta

It’s nice to see that Dr. King’s papers found a home at his alma mater–and won’t be locked away in somebody’s safe.

oai-pmh revisited

So the big news for me at JCDL was Carl Lagoze’s presentation on the state of the NSDL in his paper Metadata Aggregation and “Automated Digital Libraries”: A Retrospective on the NSDL Experience. It ended up winning the Vannevar Bush Best Paper Award so I guess it was of interest to some other folks. I highly recommend printing out the paper for a train ride home if you are at all interested in digital libraries and metadata.

The paper is essentially a review of a three year effort to build a distributed digital library using the OAI-PMH protocol. Lagoze’s pain in relating some of the findings was audible and understandable given his involvement in OAI-PMH…and the folks seated before him.

The findings boil down to a few key points:

The OAI-PMH isn’t low barrier enough

Overall success rate for OAI-PMH harvesting has been 64%. I took the liberty of pulling out this graphic that Carl put up on the big screen at JCDL:

The reason for harvest failure varied but centered around XML encoding problems (character sets, well-formedness, schema validation) as well as improper use of datestamps and resumption tokens. These errors are often unique to a particular harvest request–so that a repository that has passed “validation” can often turn around and randomly emit bad XML. So perhaps more than OAI-PMH not being low-barrier enough we are really talking about XML not being low barrier enough eh?

This issue actually came up in the Trusted Repositories workshop after JCDL (which I’ll write about eventually) when Simeon Warner stated directly (in Mackenzie Smith’s direction) that off the shelf repositories like DSpace should be engineered in such a way that they are unable to emit invalid/ill-formed XML. Mackenzie responded by saying that as far as she knew the issues had been resolved, but that she is unable to force people to upgrade their software. This situation sounds familiar to the bind that M$ finds themselves in–but at least with Windows you are prompted to upgrade your software…I wonder if DSpace has anything like that.

So anyway–bad XML. Perhaps on the flipside brittle XML tools are part of the problem…nd the fact that folks are generating XML by hand instead of using tools such as Builder.

The other rubs were datestamps and resumption tokens which I got to hear about extensively in Simeon’s OAI-PMH tutorial. The point being that there are these little wrinkles to creating a data provider which don’t sound like much; but when multiplied out to many nodes can result in an explosion of administrative emails (170 messages per provider, per year) for the central hub. This amounts to a lot of (wo)man hours lost for both the central hub and the data providers. It makes one wonder at how lucky/brialliant tbl was in creating protocols which scaled massively to the web we know today…with a little help from his friends.

Good metadata is hard

…or at least harder than expected. Good metadata requires domain expertise, metadata expertise and technical expertise–and unfortunately the NSDL data providers typically lacked people or a team with these skills (aka library technologists).

Essentially the NSDL was banking that the successful Union Catalog model (WorldCat) could be re-expressed using OAI-PMH and a minimalist metadata standard (oai_dc)…but WorldCat has professinally trained librarians and NSDL did not. oai_dc was typically too low-resolution to be useful, and only 50% of the collections used the recommended nsdl_dc qualified DC…which made it less than useful at the aggregator level. Furthermore only 10% of data providers even provided another type of richer metadata.

The NSDL team expended quite a bit of effort building software for scrubbing and transforming the oai_dc but:

In the end, all of these transforms don’t enhance the richness of the information in the metadata. Minimally descriptive metadata, like Dublin Core, is still minimally descriptive even after multiple quality repairs. We suggest that the time spent on such format-specific transforms might be better spent on analysis of the resource itself–the source of all manner of rich information.

which brings us to…

Resource-centric (rather than metadata-centric) DL systems are the future

OAI-PMH by definition is essentially a protocol for sharing metadata. The system that the NSDL built is centered around a Metadata Repository which is essentially a RDBMs with a Dublin Core metadata record at it’s core. Various services are built up around the MR including a service for Search and an Archiving service.

However as the NSDL has started to build out other digital library services they’ve discovered problems such as multiple records for the same resource. But more importantly they want to build services that provide context to resources, and the current MR model puts metadata at the center rather than the actual resource.

In the future, we also want to express the relationships between resources and other information, such as annotations and standards alignments…we wish to inter-relate resources themselves, such as their co-existence within a lesson plan or curriculum.

So it seems to me the NSDL folks have decided that the Union Catalog approach just isn’t working out, and that the resource itself needs to be moved into the center of the picture. At the end of the day, the same is true of libraries–where the most important thing is the content that is stored there–not the organization of it. More information about this sea change can be found in An Information Network Overlay Architecture for the NSDL, and in the work on Pathways, which was discussed and I hope to summarize here in the coming days.


I think it’s important to keep these findings in the context of the NSDL enterprise. For example the folks at are using OAI-PMH in their aDORe framework heavily. Control of data providers is implicit in the aDORe framework–so XML and metadata quality problems are mitigated. Furthermore aDORe is resource centric since MPEG21 surrogates for the resources themselves are being sent over OAI-PMH. But it does seem somehow odd that a metadata protocol is being overridden to transfer objects…but that begs a philosophical question about metadata which I’m not even going to entertain right now.

I think it’s useful to compare the success of oai-pmh with the success of the www. Consider this thought experiment…Imagine that early web browsers (lynx/mosaic/netscrape/etc) required valid HTML in order to display a page. If the page was invalid you would get nothing but a blank page. If you really needed to view the page you’d have to email the document author and tell them to fix an unclosed tag, or a particular character encoding. Do you think that the web would’ve still propagated at the speed that it did?

fault_tolerant_xml_tools + using_xml_generator_libraries == happiness