ruby-oai v0.0.3

v0.0.3 of ruby-oai was just released to RubyForge. The big news is that this release allows you to use libxml for parsing thanks to the efforts of Terry Reese. Terry is building a RubyOnRails metasearch application at OSU and, well, felt the need for speed.

After committing the branch he was working on I ran some performance tests of my own. I ran a vanilla ListRecords request against dspace, eprints and american memory oai-pmh servers using both the rexml (default) and libxml backend parsers. Here are the results

server parser real user sys
dspace rexml 0m3.632s 0m2.008s 0m0.044s
libxml 0m1.900s 0m0.212s 0m0.032s
  1.732s (+48%) 1.796s (+89%) 0.012s (+27%)
 
eprints rexml 0m19.807s 0m1.984s 0m0.036s
libxml 0m19.344s 0m0.236s 0m0.024s
  0.463s (+2%) 1.748s (+88%) 0.012s (+33%)
 
american-memory rexml 0m12.991s 0m5.424s 0m0.052s
libxml 0m7.420s 0m0.324s 0m0.032s
  5.571s (+43%) 5.104s (+94%) 0.02s (+38%)

Those percentage values are speed improvements. Thanks Terry :-)


the importance of making packages

If you are interested in such things Ian Bicking has a nice posting about why breaking up a project into smaller packages of functionality is important. His key point is that the boundaries between packages actually help in establishing and maintaining decoupled modules within your application.

…when someone claims their framework is all spiffy and decoupled, but they just don’t care to package it as separate pieces… I become quite suspicious. Packaging doesn’t fix everything. And it can introduce real problems if you split your packages the wrong way. But doing it right is a real sign of a framework that wants to become a library, and that’s a sign of Something I’d Like To Use.

So why is decoupling important? Creating distinct modules of code with prescribed interfaces helps ensure that a change inside one module doesn’t have a huge ripple effect across an entire project codebase. In addition to using packaging to create boundaries between components the Law of Demeter is a pretty handy technique for reducing coupling in object oriented systems. It amounts to ensuring that a given method only invokes methods on objects that are: itself, in its parameters, objects that itself creates, or component objects. The LoD seems to be a good practice at the local level, but packaging helps at a macro/design level. One of the most powerful and fun parts of packaging is coming up with good names and metaphors for your packages and components. Having fun and meaningful names for packages provides coherence to a project, and allows developers to talk about an application. Eric Evans has some nice chapters in his Domain Driven Design about coming up with what he calls a domain language whose aim is to:

To create a supple, knowledge-rich design calls for a versatile, shared team language, and a lively experimentation with language that seldom happens on software projects.

It’s important…and naming distinct packages well helps build a good domain language.

I suppose it’s implicit in making something a code library–but one of the other major benefits of splitting a larger project up into smaller packages is that you encourage reuse. The bit of functionality that you decided to bundle up separately can be used as a dependency in a different project–perhaps even by a different person or organization. This seems to me to be a hallmark of good open source software.

Most popular languages these days have established ways of making packages available, downloadable and installable while expressing the dependencies between them. Perl has CPAN, PHP has PEAR, Ruby has gems and RubyForge, Python has eggs and EasyInstall, Java has maven, Lisp has asdf. Even some applications like Trac, RubyOnRails and Drupal encourage the creation of smaller packages (modules or plugins) by having a well defined api for adding extensions. And that’s not even getting into the various ways operating systems make packages available…

The truly hard part about packaging for me isn’t really technical. Most packaging solutions allow you to manage dependencies, versioning, installation and removal. As Ian says, its the decision of where to draw the lines between packages that is hard. It’s hard because you have to guess before you start coding–and often during the process of coding you realize that the dividing lines between packages begin to blur. This is why having distinct packages is so important because you are forced to stare at the blurriness and encouraged to fix it…instead of creating the infamous big ball of mud.

An interesting counterpoint to trying to figure out the dividing lines before hand is to try to design from the outside in, and extract reusable components from the result. The very successful RubyOnRails web framework was extracted from a working application (Basecamp). In a lot of ways I think Test Driven Design encourages this sort of outside-in thinking as well. Extracting usable components from a ball of mud is nigh impossible though…at least for me. I would be interested to know how much of the Rails components were anticipated by the designers as they were creating BaseCamp. It takes a great deal of discipline and jazz-like anticipation to be able to improvise a good design. That or, you have to build in time to prototype something with an eye to taking what you’ve learned to do it right.


open standards

Folks who are interested in libraries and technology are often drawn to the issue of open standards. Using open standards is very important to libraries for a variety of reasons that Ed Corrado summarizes nicely.

This week my podcast reader picked up an excellent interview with Danese Cooper of the Open Source Initiative where she talks about the Open Standard Requirement which was introduced a few months ago. It provides a new perspective on the same issue from outside of the library community.

Essentially the OSR amounts to 5 guidelines for identifying a truly open standard. These guidelines are different though because they focus on what makes a standard open for an implementor. Whether the standard was created by an open process or not is really outside of scope. The important thing is how easy it is for a software developer to write software that uses the standard. A nice feature of the OSR is that the guidelines would fit on an index card. Here’s my regurgitation of them:

  1. The spec can’t omit details needed for implementation
  2. The standard needs to be freely/publicly available
  3. All patents involved in the spec need to be royalty free
  4. Clicking through a license agreement is not necessary
  5. The spec can’t be dependent on a standard that is not open as well

Danese was quick to point out that these are simply guidelines and not rules. For example Unicode fails on 2. since you have to pay for a copy of the spec. But in this case printing the standard is a publishing feat–given all the glyphs and their number. It’s not unusual that the book would cost money. So this guideline could be waived if the OSI folks agreed.

Rather than the OSI going and applying these rules to all known standards the idea is that standards bodies could claim self-compliance–and as developers implement the standard the compliance will be ascertained.

The guidelines themselves and the process of being fine tuned/hammered on–and they are looking for volunteers…



set your data free ... with unapi

Dan, Jeremy, Peter, Michael, Mike, Ross and I wrote an article in the latest Ariadne introducing the lightweight web protocol unAPI. Essentially unAPI is an easy way to include references to digital objects in your HTML which can then be predictably retrieved by a machine…yes ‘machine’ includes JavaScript running in a browser :-) Dan and a really nice cross section of developers around the world have been working on this spec for over a year now and I think it could be poised to play an important role in the emerging open data movement.

Imagine you have a citation database which is searchable via the web. The search results include hits. Wouldn’t it be nice to align your human viewable results with machine readable representations so that people could write browser hacks and the like to remix your application data?

As far as I can tell there are a few options available to help you do this (apart from doing something ad-hoc).

  1. use a citation microformat and mark up your HTML predictably so that it can be recognized and parsed
  2. use GRDDL to map your HTML to RDF via an XLST profile.
  3. embed RDF in your HTML essentially using an RDF microformat.
  4. OpenURL and/or COinS to link in page IDs to OpenURL servers.
  5. use unAPI and include a unapi server url (familiar autodiscovery like RSS/Atom), and identifiers (simple element attributes) and write a simple server side script that emits xml for a given identifier.

I like microformats a lot and I think a citation format will eventually get done. But it’s been a long time coming and there’s no indication it’s going to get done any time soon. What’s more unAPI is bigger than just citation data–and it allows you to publish all kinds of rich data objects without waiting for a community to ratify a particular representation in HTML.

Options 2 and 3 use RDF which I actually like quite a bit as well. GRDDL implies a GRDDL aware browser which would be cool but is a bit heavy weight. XSLT will require clean XHTML–or pipelines to clean it. Embedding RDF in HTML using microformat techniques is compelling because you can theoretically process the RDF data similarly–whereas unAPI doesn’t require any particular kind of machine readable format (apart from HTML). Actually there’s nothing stopping you from using unAPI to link human viewable objects with RDF representations. The advantage unAPI has here is you can learn RDF if you want to, but you don’t have to learn RDF to get going with unAPI today.

Option 4 leverages work done in the library community on citation linking. OpenURL routers are widely deployed in libraries around the world and COinS is a quasi-microformat for putting OpenURL context objects into your HTML so that they can be extracted and fired off at an OpenURL server. OpenURL is a relatively complex and subtle standard which can do a lot more than just citation linking. Compared to OpenURL/COinS unAPI allows for ease of implementation in languages like JavaScript and provides a simple introspection mechanism for discovering what formats a particular resource is available in. AFAIK this can’t be done simply using OpenURL/COinS. If I’m wrong, comments should be open. I would argue that the sheer power and flexibility of OpenURL paradoxically make it hard to understand…and that unAPI in Dan’s adherence to a one-page-spec is more limited and simple. Less is more…

So if this piques your interest read the article. It does a much better job of describing the origins of the work, where it’s headed, has examples and links out to sites/tools that use unAPI today. I must admit I wrote very little of the article, and mostly contributed text snippets and screenshots of the unAPI validator I wrote, which uses my unapi ruby gem.


oai2rdf

Amidst the flurry of commit messages and the like on the simile development discussion list I happened to see the Simile Project includes a RDFizer project which has a component called oai2rdf.

oai2rdf is a command line program that happens to use Jeff Young’s OAIHarvester2 and some XSLT magic to harvest an entire oai-pmh archive and covert it to rdf.

  % oai2rdf.sh http://cogprints.ecs.soton.ac.uk/perl/oai2 cogprints

This will harvest the entire cogprints eprint archive and convert it on the fly to rdf which is saved in a directory called cogprints. Just in case you are wondering–yes it handles resumption tokens. In fact you can also give it date ranges to harvest, and tell it to only harvest particular metadata formats. By default it actually grabs all possible metadata formats.

As part of my day job I’ve been looking at some rdf technologies like jena and while there are lots of chunks of rdf around on the web to play with oai2rdf suddenly opens up the possibilities quite a bit.

Getting oai2rdf up and running is pretty easy. First get the oai2rdf code:

  svn co http://simile.mit.edu/repository/RDFizers/oai2rdf/ oai2rdf

Next make sure you have maven. If you don’t have it maven is very easy to install. Just download, unpack, and make sure the maven/bin directory is in your path. Then you can:

  mvn package

The magic of maven will pull down dependencies and compile the code. Then you should be able to run oai2rdf. Art Rhyno has been talking about the work the Simile folks are doing for quite a while now, and only recently have I started to see what a rich set of tools they are developing.


gems...on ice


When developing and deploying RubyOnRails applications you’ve often got to think about the gem dependencies your project might have. It’s particularly useful to freeze a version of rails in your vendor directory so that your app uses that version of rails rather than a globally installed (or not installed) one. It’s easy to do this by simply invoking:

  rake freeze_gems

Which will unpack all the rails gems into vendor, and your application will magically use these instead of the globally installed rails gems.

The cool thing is that with a little bit of plugin help you can freeze your other gems in vendor as well. Simply install Rick Olson’s elegantly simple gem plugin into vendor/plugins. Then assuming you are using let’s say my oai-pmh gem you can simply:

  rake gems:freeze GEM=oai

and the gem will be unpacked in vendor, and the $LOAD_PATH for your application will automatically include the library path for the new gem. Very useful, thanks Rick!


the librarian's store

While working at Follett I always thought it was just a matter of time till Amazon turned it’s eye on the library market. Much of the web development that went on at Follett was done with an eye towards what Amazon was doing…while tailoring the experience for librarians and library book ordering/processing. The management I expressed this idea to seemed to think that Amazon wouldn’t be interested in Follett’s business. It was my opinion at the time that it would be better to have Amazon as a partner than a competitor. This is really just common sense right? No leap of intuition there.

…time passes…

Now it looks like (thanks eby) that Follett has some company. When a web savvy company like Amazon notices your niche in the ecosystem it’s definitely important to pay attention. Amazon has decided to partner with TLC and Marcive for MARC data and with OCLC to automatically update holdings. This is big news.

Somewhat related and even more interesting in some ways rsinger and eby report in #code4lib that they’ve seen Library of Congress Subject Headings and Dewey Decimal Classification Numbers in Amazon Web Service responses. For an example splice your Amazon Token in here:

http://web.archive.org/web/20120913073818/http://webservices.amazon.com/onca/xml
?Service=AWSECommerceService
&Version=2006-06-28
&Operation=ItemLookup
&ContentType=text%2Fxml
&SubscriptionId=YOUR_TOKEN_HERE
&ItemId=097669400X
&IdType=ASIN
&ResponseGroup=ItemAttributes,Large,Subjects

scan for:

Ruby (Computer program language)


more on web identifiers

I monitor the www-tag discussion list, but more than half of it goes right over my head–so I was pleased when a colleague forwarded URNs, Namespaces and Registries to me. Don’t let the 2001 in the URL fool you, it has been updated quite recently. This finding provides an interesting counterpoint to rfc 4452 which I wrote about earlier.

Essentially the authors go about examining the reasons why folks want to have URNs (persistence) and info-uris (non-dereferencability) and showing how URIs actually satisfy the requirements of these two communities.

I have to admit, it sure would be nice if (for example) LCCNs and OCLCNUMs resolved using the existing the infrastructure of http and dns. Let’s say I run across an info-uri in a XML document identifying tbl as info:lccn/no9910609. What does that really tell me? Wouldn’t it be nice if instead it was http://lccn.info/no9910609 and I could use my net/http library of choice to fetch tbl’s MADS record? Amusingly Henry Thompson (one of the authors of the finding) is holding http://lccn.info and http://oclcnum.info for ransom :-)

Instead, in the case of info-uri, OCLC is tasked with building a registry of these namespaces, and even when this is built the identifiers won’t necessarily be resolvable in any fashion. This is the intent behind info-uris of course–that they need not be resolvable or persistent. But this finding raises some practical issues that are worth taking a look at, which seem to point to the ultimate power of the web-we-already-have.


iraq

I saw this in yesterday’s Washington Post and just learned it won the Best of Photo Journalism Award for 2006. The picture says it all, but the story is just as harrowing. What a sad mess.