Archive for July, 2007

pymarc, marc8 and nothingness

Friday, July 20th, 2007

pymarc 1.0 went out day before yesterday with a new function: marc8_to_unicode(). When trying to leverage MARC bibliographic data in today’s networked world it is inevitable that the MARC8 character encoding will at some point rear its ugly head and make your brain hurt. The problem is that the standard character set tools for various programming languages do not support it. So you need to know to use a specialized tool like marc4j, yaz, MARC::Charset for converting from MARC8 into something useful like Unicode.

The MARC8 support in pymarc is the brainchild of Aaron Lav and Mark Matienzo. Aaron gave permission for us to package up some of is code from PyZ3950 into pymarc. In testing with equivalent MARC-8 and UTF-8 record batches from the Library of Congress we were able to find and fix a few glitches.

The exercise was instructive to me because of my previous experience working with the MARC::Charset Perl module. When I wrote MARC::Charset I was overly concerned with not storing the mapping table in memory, I used an on disk Berkeley-DB originally. Aaron’s code simply stored the mapping in memory. Since python stores bytecode on disk after compiling there were some performance gains to be had over Perl–since Perl would compile the big mapping hash every time. But the main thing is that Aaron seemed to choose the simplest solution first– whereas I was busy performing a premature optimization. I also went through some pains to enable mapping not only MARC-8 to Unicode but Unicode back to MARC-8. In hindsight this was a mistake because going back to MARC-8 is increasingly more insane as each day passes.

Aaron’s code as a result is much cleaner and easier to understand because, well, there’s less of it. I’m reading Beautiful Code at the moment and was just reading Jon Bentley’s chapter “The Most Beautiful Code I Never Wrote” — which really crystallized things. Definitely check out Beautiful Code if you have a chance. Maybe the quiet books4code could revive to read it as a group?

app and repositories

Monday, July 16th, 2007

Pete Johnston blogged recently about a very nice use of the Atom Publishing Protocol (APP) to provide digital library repository functionality. The project is supported by UKOLN at the University of Bath and is called Simple Web-service Offering Repository Deposit (SWORD).

If you are interested in digital repositories and web services take a look at their APP profile. It’s a great example of how APP encourages the use of the Atom XML format and RESTful practices, which can then be extended to suit the particular needs of a community of practice.

To understand APP you really only need to grok a handful of concepts from the data model and REST. The data model is basically made up of a service document, which describes a set of collections, which aggregates member entries, which can in turn point to a media entry. All of these types of resources are identified with URLs. Since they are URLs you can interact with the objects with plain old HTTP–just like your web browser. For example you can list the entries in a collection by issuing a GET to the collection URL. Or you can create a member resource by doing a POST to the collection URL. Similarly you can delete a member entry by issuing a DELETE to the member entry. The full details are available in the latest draft of the RFC–and also in a wide variety of articles including this one.

So to perform a SWORD deposit a program would have to:

  1. get the service document for the repository (GET http://www.myrepository.ac.uk/app/servicedocument)
  2. see what collections it can add objects to
  3. create some IMS, METS or DIDL metadata to describe your repository object and ZIP it up with any of the objects datastreams
  4. POST the zip file to the appropriate collection URL with the appropriate X-Format-Namespace to identify the format of the submitted object
  5. check that you got a 201 Created status code and record the Location of the newly created resource
  6. profit!

1 and 2 are perhaps not even necessary if the URL for the target collection is already known. Some notable things about the SWORD profile of APP:

  • two levels of conformance (one really minimalistic one)
  • the idea that collections imply particular treatments or workflows associated with how the object is ingested
  • service documents dynamically change to describe only the collections that a particular user can see
  • no ability to edit resources
  • no ability to delete resources
  • no ability to list collections
  • repository objects are POSTed as ZIP files to collections
  • HTTP Basic Authentication + TLS for security
  • the use of DublinCore to describe collections and their respective policies.
  • collections can support mediated deposit which means deposits can include the X-On-Behalf-Of HTTP header to identify the user to create the resource for.
  • the use of X-Format-Namespace HTTP header to explicitly identify the format of the submission package that is zipped up: for example IMS, METS or DIDL.

While I understand why update and delete would be disabled for deposited packages I don’t really understand why the listing of collections would be disabled. An atom feed for a collection would essentially enable harvesting of a repository, much like ListRecords in OAI-PMH.

I’m not quite sure I completely understand X-On-Behalf-Of and sword:mediation either. I could understand X-On-Behalf-Of in an environment where there is no authentication. But if a user is authenticated couldn’t their username be used to identify who is doing the deposit? Perhaps there are cases (as the doc suggests) where a deposit is done for another user?

All in all this is really wonderful work. Of particular value for me was seeing the list of SWORD extensions and also the use of HTTP status codes. If I have the time I’d like to throw together a sample repository server and client to see just how easy it is to implement SWORD. I did try some experiments along these lines for my presentation back in February…but they never got as well defined as SWORD.

How do 26 Nobel Laureates change a light bulb?

Friday, July 13th, 2007

I don’t know … but it sure is nice to see that 26 Nobel Laureates at least understand the direction libraries ought to be headed:

As scientists and Nobel laureates, we are writing to express our strong support for the House and Senate Appropriations Committees’ recent directives to the NIH to enact a mandatory policy that allows public access to published reports of work supported by the agency. We believe that the time is now for Congress to enact this enlightened policy to ensure that the results of research conducted by NIH can be more readily accessed, shared and built upon ­ to maximize the return on our collective investment in science and to further the public good.

The public at large also has a significant stake in seeing that this research (researched funded by the National Institute of Health) is made more widely available. When a woman goes online to find what treatment options are available to battle breast cancer, she will find many opinions, but peer-reviewed research of the highest quality often remains behind a high-fee barrier. Families seeking clinical trial updates for a loved one with Huntington’s disease search in vain because they do not have a journal subscription. Librarians, physicians, health care workers, students, journalists, and investigators at thousands of academic institutions and companies are currently hindered by unnecessary costs and delays in gaining access to publicly funded research results.

Exciting times for libraries and the medical profession! I just hope they can convince Congress.

purl2

Thursday, July 12th, 2007

It’s great to see that OCLC is going to work with Zepheira on a new version of the PURL service and that it’s going to have an Apache license. Other than addressing scalability issues it sounds like Zepheira is going to build in support for resources that are outside of the information space of the web:

The new PURL software will also be updated to reflect the current understanding of Web architecture as defined by the World Wide Web Consortium (W3C). This new software will provide the ability to permanently identify networked information resources, such as Web documents, as well as non-networked resources such as people, organizations, concepts and scientific data. This capability will represent an important step forward in the adoption of a machine-processable “Web of data” enabled by the Semantic Web.

Since Eric Miller helped start up Zepheira it’s not surprising that purl2 will take this on. As part of some experiments I’ve been doing with SKOS, and serving up Concepts over HTTP it has become clear that a minimal bit of work for managing these identifiers would be useful. I can definitely see the need for a general solution that helps manage identifiers for people, organizations, concepts, etc. which also fits into how HTTP should/could serve up the resources associated with them.

via Thom Hickey

ruby-zoom v0.3.0

Tuesday, July 10th, 2007

Thanks to some prodding from William Denton and Jason Ronallo and the kindness of Laurent Sansonetti I’ve been added as a developer to the ruby-zoom project which provides a Ruby wrapper to the yaz Z39.50 library. I essentially wanted to remove some unused code from the project that was interfering with the ruby-marc gem … and I also wanted to create gem for ruby-zoom. This was the first time I’ve tried packaging up a C wrapper as a gem and it was remarkably smooth. I also added a test suite and a Rakefile. So assuming you have yaz installed you can install ruby-zoom with:

% gem install zoom

I’ll admit, I’m no huge fan of Z39.50 but the fact remains that it’s pretty much the most widely deployed machine API for getting at bibliographic data locked up in online catalogs. It’s really nice to see forward thinking systems at Talis, Evergreen and Koha who have (or at least experimented with) OpenSearch implementations.