Posts Tagged ‘atom’

resource maps and site maps

Friday, August 1st, 2008

Andy reminds me that a relatively simple idea (I think it was David’s at RepoCamp) for the OAI-ORE Challenge would be to create a tool that transformed OAI-ORE resource maps expressed as Atom into Google Site Maps. This would allow “repositories” that exposed their “objects” as resource maps, to easily be crawled by Google and others.

It would also be useful to demonstrate what value-add OAI-ORE resource maps give you: to answer the question of why not just generate the site map and be done with it. I think there definitely are advantages, such as being able to identify compound objects or aggregations of web resources, and then make assertions about them (a.k.a. attach metadata to them).

RepoCamp recap

Monday, July 28th, 2008

So RepoCamp was a lot of fun. The goal was to discuss repository interoperability–and at the very least repository practitioners got to interoperate, and have a few beers afterwards. Hats off to David Flanders who clearly has got running these events down to a fine art.

I finally got to meet Ben O’Steen after bantering with him on #code4lib and #talis … and also got to chat with Jim Downing (Cambridge Univ) about SWORD stuff, and Stephan Drescher (Los Alamos National Lab) about validating OAI-ORE.

Stephan and I had a varied and wide ranging discussion about the web in general, which was a lot of fun. I really dug his metaphor of the web as an aquatic ecosystem, with interdependent organisms and shared environments. It reminded me a bit of how shocked I was to discover how rich and varied the ecosystem is around a “simple” service like twitter. If I ever return to school it will be to study something along the lines of web science.

It was also interesting to hear that other people saw a parallel between OAI-ORE Resource Maps and BagIt’s fetch.txt. The parallel being that both resource maps and bags are aggregations of web resources. Of course bags can also just be files on disk, it’s when the fetch.txt is present in the bag that the package is made up of web resources. It would be interesting to see what vocabularies are available for expressing fixity information (md5 checksums and the like), and if they could be layered into the resource map atom serialization. Perhaps PREMIS v2.0? It might be fun to code up what a simple OAI-ORE resource map harvester would look like, that checked fixity values — using LC’s existing BagIt parallelretriever.py as a starting point. God I wish I could just hyperlink to that :-(

At any rate, I now need to investigate OAuth because Jim thinks it fits really nicely with AtomPub and SWORD in particular. And if it’s good enough for Google it’s probably worth checking out. Jim also said that there is a possibility that the SWORD 2.0 might take shape as an IETF RFC, which would be good to see.

Thanks to all that made it happen, and for all of you that traveled long distances to join us at the Library of Congress.

oai-ore post baltimore thoughts

Thursday, March 13th, 2008

The recent OAI-ORE meeting was just up the road in Baltimore, so it was easy for a bunch of us from the Library of Congress to attend. I work on a team at LC that is specifically looking at the role that repositories play at the library; I’ve implemented OAI-PMH data providers and harvesters, and in the past couple of years I’ve gotten increasingly interested in semantic web technologies — so OAI-ORE is of particular interest to me. I’ve commented a bit about OAI-ORE on here before, but I figure it can’t hurt to follow in my coworker’s footsteps and summarize my thoughts after the meeting.

(BTW, above is an image of some constellations I nabbed off of wikipedia. I included it here because the repeated analogy (during the meeting) of OAI-ORE resource maps as constellations was really compelling — and quite poetic.)

The Vocabulary

It seems to me that the real innovation of the OAI-ORE effort is that it provides a lightweight RDF vocabulary for talking about aggregated resources on the web. Unfortunately I think that this kernel gets a little bit lost in the 6 specification documents that were released en masse a few months ago.

The ORE vocabulary essentially consists of three new resource types: ore:ResourceMap, ore:Aggregation, ore:AggregatedResource ; and 5 new properties to use with those types: ore:describes, ore:isDescribedBy, ore:aggregates, ore:isAggregatedBy, ore:analogousTo. In addition, the Vocabulary document
provides guidance on how to use a few terms from the DublinCore vocabulary: dc:creator, dc:rights, dcterms:modified, dcterms:created.

The vocabulary is small, so if I were them I would publish the vocabulary elements using hash URIs, instead of slash URIs. The reason for this is that you don’t have to jigger the web server to do a httpRange-14 style 303 correctly:

  • http://www.openarchives.org/ore/0.2/terms#Aggregation
  • http://www.openarchives.org/ore/0.2/terms#AggregatedResource
  • http://www.openarchives.org/ore/0.2/terms#ResourceMap
  • http://www.openarchives.org/ore/0.2/terms#describes
  • http://www.openarchives.org/ore/0.2/terms#isDescribedBy
  • http://www.openarchives.org/ore/0.2/terms#aggregates
  • http://www.openarchives.org/ore/0.2/terms#isAggregatedBy
  • http://www.openarchives.org/ore/0.2/terms#analogousTo

Also, I think ore:AggregatedResource is currently missing from the rdf/xml vocabulary, so it should be added. Also ore:isDescribedBy seems to be commented out.

There is a lot of redundancy between the Abstract Data Model and the Vocabulary documents–so I would recommend collapsing them down into a single, succinct document. This is in keeping with the DRY principle and will have the added benefit of making it easier for newbies to hit the ground running (not having to wade through multiple docs and mentally reconcile them). I could understand having a separate Abstract Data Model document if it were totally divorced from the web and semantic web technologies like RDF, but it’s not.

The Graph

The OAI-ORE effort seemed to be mostly driven by a desire to take harvesting agents the last mile to the actual repository resources themselves–enabling digital library objects (in addition to their metadata) to be harvested from repositories (using HTTP) ; and to be referenced from other contexts (say objects in other repositories). This desire was born out of real, hard won experience with harvesting metadata records, and marked a shift from metadata-centric harvesting to resource-centric harvesting.

In addition OAI-ORE marks a departure from predictable and mind-numbing arguments about SIP formats (METS, DIDL, FOXML, IEEE LOM, XFDU, etc). Yet as soon as we have our shiny new OAI-ORE vocabulary we have to learn yet-another-packaging-format, this time one built on top of Atom.

First, let me just say I’m a big fan of RFC 4287, in particular how it is used in the RESTful Atom Publishing Protocol (RFC 5023). I also think it makes sense to have an Atom serialization for OAI-ORE resource maps — assuming there is a GRDDL transform for turning it into RDF. But the workshop in Baltimore seemed to stress that the Atom serialization was the only way to do OAI-ORE, and didn’t emphasize that there are in fact lots of ways of representing RDF graphs on the web. For example GRDDL allows you to associate arbitrary XML with an XSLT transform to extract a RDF graph. And you could encode your RDF graph directly with RDFa, N3, Turtle, ntriples, or RDF/XML.

Perhaps there is a feeling that stressing the RDF graph too much will alienate some people who are more familiar with XML technologies. Or perhaps all these graph serialization choices could be perceived as being too overwhelming. But I think the opposite extreme of making it look like you can only use an overloaded Atom document as a means to publishing ORE resource maps is misguided, and will ultimately slow adoption. Why not encourage people to publish GRDDL transforms for METS, DIDL or mark up their “splash pages” with RDFa? This would bring the true value of the OAI-ORE work home–it’s not about yet-another-packaging format, it’s about what the various packaging formats have in common on the web.

Release Early, Release Often

In hindsight I think it would’ve been helpful for the OAI-ORE group to privately build consensus about the core OAI-ORE vocabulary (if necessary), then release that into the world wild web for discussion. Then once the kinks were worked out, and there was general understanding, moving on to issues such as discovery and serialization. As it stands the various documents were all dumped at the same time, and seem somewhat fragmented, and in places redundant. Clearly a lot of conversations have gone on that aren’t happening on the public discussion list.

I expressed interest in being part of the OAI-ORE and was politely turned down. I’m actually kind of glad really because I also don’t want to be part of some cabal of digital library practitioners. Maybe I should’ve titled this post “Sour Grapes” :-) Seriously though, the digital library needs good practical solutions and communities of users that encourage widespread adoption and tool support. We don’t need research-ware. Having secret discussions and occasional public events that feel more like lectures than meetings isn’t a good way to encourage adoption.

Anyhow, I hope that this isn’t all seen as being too harsh. Everyone’s a critic eh? All in all there is a lot in OAI-ORE to be proud of. The effort to integrate Web Architecture into Digital Library practices is most welcome indeed. Keep up the good work y’all.