oai-pmh tut

I couldn’t pass up the opportunity to hear Simeon Warner talk about oai-pmh in the second group of tutorials. I’ve implemented data and service providers before–so I consider myself fairly knowledgeable about the protocol. But Simeon works with this stuff constantly at Cornell since arXiv so I was certain there would be things to learn from him…he did not disappoint.

Using some recent work at NSDL, and his experience with the protocol Simeon provided some really useful advice on sing oai-pmh. Here are the things I picked up on:

avoid using sets–especially overloading sets to do searches. There is an interesting edge case in the protocol when a record moves from a particular set A to set B, which makes harvesters who are harvesting set A totally miss the update.
pay attention to datestamps. Make sure datestamps are assigned to records when they actually change in the repository or else harvesters can miss updates. The protocol essentially is a way of exposing updates, so getting the datestamps right is crucial.
resumption tokens need to be idempotent. This means a harvester should be able to use the resumption token more than once and get the same result (barring updates to the repository). This is essential so that harvesters engaged in a lengthy harvest can recover from network failure and other exceptions.
pay attention to character encoding. Use a parser that decodes character entities in XML and store the utf8. This will make your live simpler as you layer new services over harvested data. Make sure that HTML entities aren’t used in oai-pmh responses. utf8conditioner is Simeon’s command line app for debugging utf8 data.
be aware of the two great myths of oai-pmh: the myth that oai-php only allows exposure of DC records, and that oai_pmh only allows a single metadata format to be exposed.

There are lots more recommendations at NSDL, but it was useful to have this overview and have the chance to ask Simeon questions. For example even though oai-pmh requires records to have an XML schema, it would be possible to create a wrapping schema for freeform data like RDF.

The main reason I was interested in this tutorial was to hear more about using oai-pmh to distribute not only resource metadata records, but also the resources themselves. There were a couple of initial problems with using the protocol to provide access to the actual resources.

The first is that identifiers such as URLs in metadata records which point to the resource for capture had too much ambiguity. Some of the URLs point at splash screens where someone could download a particular flavor of the resource, others went directly to a PDF, etc. This made machine harvesting data-provider specific. In addition there is a problem with the datestamp semantics when a remote resource changes–when the resource is updated but the metadata stays the same the datestamp is not required to change. This makes it impossible for harvesters to know when it needs to download the resource again.

Fortunately there was a solution that is detailed more fully in a paper written by Simeon and the usual suspects. It boils down to actually distributing the resource as a metadata format. This plays a little bit with what metadata itself is…but it makes the two previously mentioned problems disappear. Simeon gave a brief overview of MPEG21 DIDL but was keen to point out METS and other packaging formats can work the same. Using oai-pmh in this way is really interesting to me since it enables respositories to share actual objects with each other–with OAI-PMH working almost like an ingestion protocol.

I asked about mechanisms to autodiscover oai-pmh metadata in HTML, like unAPI. Simeon pointed out that the usual suspects are actually extending/refining this idea a bit in some recent work done w/ the Andrew Mellon Foundation on interoperability. Apparently they’ve experimented with the LiveClipboard idea in support of some of this work. More on this later.