Posts Tagged ‘standards’

We’ve got five years, my brain hurts a lot

Thursday, August 7th, 2008

Recently there’s been a few discussions about persistent identifiers on the web: in particular one about the persistence of XRIs, and another about the use of HTTP URIs in semantic web applications like dbpedia.

As you probably know already, the w3c publicly recommended against the use of Extensible Resource Identifiers (XRI). The net effect of this was to derail the standardization of XRIs within OAISIS itself. Part of the process that Ray Denenberg (my colleague at the Library of Congress) helped kick off was a further discussion between XRI people and the w3c-tag about what XRI specifically provides that HTTP URIs do not. Recently that discussion hit a key point by Stuart Williams:

… the point that I’m trying to make is that the issue is with the social and administrative policies associated with the DNS system - and the solution is to establish a separate namespace outside the DNS system that has different social/adminsitrative policies (particularly wrt persistent name segments) that better suits the requirements of the XRI community. There is the question as to whether that alternate social/administrative system will endure into the long term such the the persistence intended guarantees endure… or not - however that will largely be determined by market forces (adoption) and ‘crudely’ the funding regime that enables the administrative structure of XRI to persist - and probably includes the use of IPRs to prevent duplicate/alternate root problems which we have seen in the DNS world.

It’ll be interesting to see the response. I basically have the same issue with DOIs and the Handle System that they depend on. Over at CrossTech Tony Hammond suggests that the Handle System would make RDF assertions such as those that involve DBPedia more persistent. But just how isn’t entirely clear to me. It seems that Handles like URLs are only persistent to the degree that they are maintained.

I’d love to see a use case from Tony that describes just how DOIs and the Handle System would provide more persistence than HTTP URLs in the context of RDF assertions involving dbpedia. As Stuart said eloquently in his email:

Again just seeking to understand - not to take a particular position

PS. Sorry if the blog post title is too cryptic, it’s Bowie’s “Five Years” which Tony’s post (perhaps intentionally) reminded me of :-)

BagIt

Friday, June 6th, 2008

One little bit of goodness that has percolated out from my group at $work in collaboration with the California Digital Library is the BagIt spec (more readable version). BagIt is an IETF RFC for bundling up files for transfer over the network, or for shipping on physical media. Just yesterday a little article about BagIt surfaced on the LC digital preservation website, so I figure now is a good time to mention it.

The goodness of BagIt is in its simplicity and utility. A Bag is essentially: a set of files in a particular directory named data, a manifest file which states what files ought to be in the data directory, and a bagit.txt file that states the version of BagIt. For example here’s a sample (abbreviated) directory structure for a bag of digitized newspapers via the National Digital Newspaper Program:

mybag
|-- bagit.txt
|-- data
|   `-- batch_lc_20070821_jamaica
|       |-- batch.xml
|       |-- batch_1.xml
|       `-- sn83030214
|           |-- 00175041217
|           |   |-- 00175041217.xml
|           |   |-- 1905010401
|           |   |   |-- 1905010401.xml
|           |   |   `-- 1905010401_1.xml
|           |   |-- 1905010601
|           |   |   |-- 1905010601.xml
|           |   |   `-- 1905010601_1.xml

The manifest itself is just the relative file path, and a fixity value:

ea9dee53c2c2dd4027984a2b59f58d1f  data/batch_lc_20070821_jamaica/batch.xml
72134329a82f32dd44d59b509928b6cd  data/batch_lc_20070821_jamaica/batch_1.xml
dc5740d295521fcc692bb58603ce8d1a  data/batch_lc_20070821_jamaica/sn83030214/00175041217/1905010601/1905010601_1.xml
e16e74988ca927afc10ee2544728bd14  data/batch_lc_20070821_jamaica/sn83030214/00175041217/1905010601/1905010601.xml
fd480b2c4bcb6537c3bc4c9e7c8d7c21  data/batch_lc_20070821_jamaica/sn83030214/00175041217/1905010401/1905010401.xml
e0e4a981ddefb574fa1df98a8a55b7a4  data/batch_lc_20070821_jamaica/sn83030214/00175041217/1905010401/1905010401_1.xml
c8dffa3cdb7c13383151e0cd8263d082  data/batch_lc_20070821_jamaica/sn83030214/00175041217/00175041217.xml

The manifest format happens to be the same format understood and generated by the common unix (and windows) utility md5deep. So it’s pretty easy to generate and validate the manifests.

The context for this work has largely been NDIIPP partners (like CDL) transferring data generated by funded projects back to LC. Although it’s likely to get used in some other places as well internally. It’s funny to see the spec in its current state, after Justin Littman rattled off the LC Manifest wiki page in a few minutes after a meeting where Andy Boyko initially brought up the issue. Andy has just left LC to work for a record company in Cupertino. I don’t think I fully understood simplicity in software development until I worked with Andy. He has a real talent for boiling down solutions to their most simple expression, often leveraging existing tools to the point where very little software actually needs to be written. I think Andy and John found a natural affinity for striving for simplicity, and it shows in BagIt. Andy will be sorely missed, but that record store is lucky to get him on their team.

There are some additional cool features to BagIt, including the ability to include a fetch.txt file which contains http and/or rsync URIs to fill in parts of the bag from the network. We’ve come to refer to bags with a fetch.txt as “holey bags” because they have holes in them that need to be filled in. This allows very large bags to be assembled quickly in parallel (using a 100 line python script Andy Boyko wrote, or whatever variant of wget, curl, rsync makes you happy). Also you can include a package-info.txt which includes some basic metadata as key/value pairs … designed primarily for humans.

Dan Krech and I are in the process of creating a prototype deposit web application that will essentially allow bags to be submitted via a SWORD (profile of AtomPub for Repositories) service. The SWORD part should be pretty easy, but getting the retrieval of “holey bags” kicked off and monitored propertly will be the more challenging part. Hopefully I’ll be able to report more here as things develop.

Feedback on the BagIt RFC is most welcome.

metadata hackers

Monday, December 31st, 2007

I opened the paper this morning to read a story of another person involved in the creation of MARC who has just died. I hadn’t realized before reading Henrietta Avram and Samuel Snyder’s obituaries that there was a bit of an NSA LC connection when MARC was being created.

From 1964 to 1966, [Samuel Snyder] was coordinator of the Library of Congress’s information systems office. He was among the creators of the library’s Machine Readable Cataloging system that replaced the handwritten card with an electronic searchable database system that became the standard worldwide.

I imagine NSA folks had a lot to do with early automation efforts in the federal government…but it’s still an interesting connection. One of my coworkers is reading up on this early history of MARC so this is for him in the unlikely event that he missed it…email would probably have worked better I guess, but I also wanted to pay tribute. Libraries wouldn’t be what they are today without this influential early work.