John Price-Wilkin Interview

In case you missed it in your overstuffed RSS reader Jon Udell recently interviewed John Price-Wilkin who is coordinating the University of Michigan’s joint digitization project with Google.

The interview covers interesting bits of history about the University of Michigan Digital Library,
Making of America, JSTOR (didn’t realize there was a book), and of course the project with Google.

The shocker for me was that while the UMDL has been able to digitize 3000 books per year, Google is doing approximately that number a day. Wilkin wasn’t able to go into details about just how Google is doing this, but he does talk about details such as resolutions used, destructive vs non-destructive digitization, and how federations of libraries could work with this data.

Wilkin has been at the center of digital library efforts for as long as I’ve been working with libraries and technology, so it was really fun to hear this interview.


got data?

Just saw this float by on simile-general

… thanks to Ben, we now have permission to publish the barton RDF dump (consisting of 50 million juicy RDF statements from the MIT library catalogue). They are now available at

http://simile.mit.edu/rdf-test-data/

Juicy indeed…it would be nice to see more libraries do this sort of thing.


>js

So I’ve been dabbling with that four letter word at \(work to create a hierarchical journal/volume/issue/article browser. <a href="http://rubyonrails.org">Le rails</a> and <a href="http://web.archive.org/web/20130216003249/http://script.aculo.us/">scriptaculous</a> make it pretty easy indeed.</p> <p>I figured I'd be a good developer and try to understand what's actually going on behind the scenes, so I picked up a copy of <a href="http://www.amazon.com/Ajax-in-Action-Dave-Crane/dp/1932394613">Ajax in Action [Illustrated]</a> and am working through it.</p> <p>There is so much hype surrounding Ajax that I had pretty low expectations--but the book is actually very well written and a joy to read. I noticed before diving in that there was an appendix on object-oriented JavaScript. I've been around the block enough times to know that JavaScript is actually quite a <a href="http://interglacial.com/hoj/hoj.html">nice</a> functional language; but apart from DHTML I haven't really had the opportunity to dabble in it much. This appendix really made it clear how JavaScript is really quite elegant, and for someone who has done object-oriented-programming in Perl the idioms for doing OOP in JavaScript didn't seem quite that bad.</p> <p>Anyhow, I quickly wanted to start fiddling around with the language with a JavaScript interpreter so I downloaded <a href="http://www.mozilla.org/rhino/">Rhino</a> and discovered that you can:</p> <pre> frizz:~/Projects/rhino1_6R4 edsu\) java -jar js.jar Rhino 1.6 release 4 2006 09 09 js> print(“hello world”); hello world js>

Pretty sweet :-)



rsinger++

So Ross beat out 11 other projects to win the OCLC Research Software Contest for his next generation OpenURL resolver umlaut. Second place went to to Jesse Andrews’ BookBurro–so the competition was fierce this year. Much more so than last year when there were 4 contestants.

Those of us who hang out in #code4lib got to hear about this project when it was just a glimmer in his eye…and had front row seats for hearing about the development as it progressed. Essentially umlaut is an openurl router that’s able to consult online catalogs (via SRU), other OpenURL resolvers (SFX), Amazon, Google, Yahoo, Connotea, CiteULike and OAI-PMH. It’s all written in Ruby and RubyOnRails.

I feel particularly proud because Ross is enough of a mad genius to have found a use for some ruby gems I wrote for doing sru, oai-pmh and querying OCLC’s xisbn service.

Speaking of which we’ve been collaborating recently on a little ruby gem for querying OCLC’s OpenURL Resolver Registry. This registry essentially makes it easy to determine what the appropriate OpenURL resolver is given a particular IP address. So you could theoretically rewrite your fulltext URLs so that they were geospatially aware. For example:

require “resolver_registry”


a funny way to make a living

gabe discovered that the code4lib.org drupal instance was littered with comment spam. Someone had actually registered for an account and proceeded to add comments to virtually every story.

Since there was an email address associated with the account I figured I’d send an email letting them know their account was going to be zapped.

From: edsu To: evgeniy1985@breezein.net Subject: code4lib.org spam


ruby-oai v0.0.3

v0.0.3 of ruby-oai was just released to RubyForge. The big news is that this release allows you to use libxml for parsing thanks to the efforts of Terry Reese. Terry is building a RubyOnRails metasearch application at OSU and, well, felt the need for speed.

After committing the branch he was working on I ran some performance tests of my own. I ran a vanilla ListRecords request against dspace, eprints and american memory oai-pmh servers using both the rexml (default) and libxml backend parsers. Here are the results

server parser real user sys
dspace rexml 0m3.632s 0m2.008s 0m0.044s
libxml 0m1.900s 0m0.212s 0m0.032s
  1.732s (+48%) 1.796s (+89%) 0.012s (+27%)
 
eprints rexml 0m19.807s 0m1.984s 0m0.036s
libxml 0m19.344s 0m0.236s 0m0.024s
  0.463s (+2%) 1.748s (+88%) 0.012s (+33%)
 
american-memory rexml 0m12.991s 0m5.424s 0m0.052s
libxml 0m7.420s 0m0.324s 0m0.032s
  5.571s (+43%) 5.104s (+94%) 0.02s (+38%)

Those percentage values are speed improvements. Thanks Terry :-)


the importance of making packages

If you are interested in such things Ian Bicking has a nice posting about why breaking up a project into smaller packages of functionality is important. His key point is that the boundaries between packages actually help in establishing and maintaining decoupled modules within your application.

…when someone claims their framework is all spiffy and decoupled, but they just don’t care to package it as separate pieces… I become quite suspicious. Packaging doesn’t fix everything. And it can introduce real problems if you split your packages the wrong way. But doing it right is a real sign of a framework that wants to become a library, and that’s a sign of Something I’d Like To Use.

So why is decoupling important? Creating distinct modules of code with prescribed interfaces helps ensure that a change inside one module doesn’t have a huge ripple effect across an entire project codebase. In addition to using packaging to create boundaries between components the Law of Demeter is a pretty handy technique for reducing coupling in object oriented systems. It amounts to ensuring that a given method only invokes methods on objects that are: itself, in its parameters, objects that itself creates, or component objects. The LoD seems to be a good practice at the local level, but packaging helps at a macro/design level. One of the most powerful and fun parts of packaging is coming up with good names and metaphors for your packages and components. Having fun and meaningful names for packages provides coherence to a project, and allows developers to talk about an application. Eric Evans has some nice chapters in his Domain Driven Design about coming up with what he calls a domain language whose aim is to:

To create a supple, knowledge-rich design calls for a versatile, shared team language, and a lively experimentation with language that seldom happens on software projects.

It’s important…and naming distinct packages well helps build a good domain language.

I suppose it’s implicit in making something a code library–but one of the other major benefits of splitting a larger project up into smaller packages is that you encourage reuse. The bit of functionality that you decided to bundle up separately can be used as a dependency in a different project–perhaps even by a different person or organization. This seems to me to be a hallmark of good open source software.

Most popular languages these days have established ways of making packages available, downloadable and installable while expressing the dependencies between them. Perl has CPAN, PHP has PEAR, Ruby has gems and RubyForge, Python has eggs and EasyInstall, Java has maven, Lisp has asdf. Even some applications like Trac, RubyOnRails and Drupal encourage the creation of smaller packages (modules or plugins) by having a well defined api for adding extensions. And that’s not even getting into the various ways operating systems make packages available…

The truly hard part about packaging for me isn’t really technical. Most packaging solutions allow you to manage dependencies, versioning, installation and removal. As Ian says, its the decision of where to draw the lines between packages that is hard. It’s hard because you have to guess before you start coding–and often during the process of coding you realize that the dividing lines between packages begin to blur. This is why having distinct packages is so important because you are forced to stare at the blurriness and encouraged to fix it…instead of creating the infamous big ball of mud.

An interesting counterpoint to trying to figure out the dividing lines before hand is to try to design from the outside in, and extract reusable components from the result. The very successful RubyOnRails web framework was extracted from a working application (Basecamp). In a lot of ways I think Test Driven Design encourages this sort of outside-in thinking as well. Extracting usable components from a ball of mud is nigh impossible though…at least for me. I would be interested to know how much of the Rails components were anticipated by the designers as they were creating BaseCamp. It takes a great deal of discipline and jazz-like anticipation to be able to improvise a good design. That or, you have to build in time to prototype something with an eye to taking what you’ve learned to do it right.


open standards

Folks who are interested in libraries and technology are often drawn to the issue of open standards. Using open standards is very important to libraries for a variety of reasons that Ed Corrado summarizes nicely.

This week my podcast reader picked up an excellent interview with Danese Cooper of the Open Source Initiative where she talks about the Open Standard Requirement which was introduced a few months ago. It provides a new perspective on the same issue from outside of the library community.

Essentially the OSR amounts to 5 guidelines for identifying a truly open standard. These guidelines are different though because they focus on what makes a standard open for an implementor. Whether the standard was created by an open process or not is really outside of scope. The important thing is how easy it is for a software developer to write software that uses the standard. A nice feature of the OSR is that the guidelines would fit on an index card. Here’s my regurgitation of them:

  1. The spec can’t omit details needed for implementation
  2. The standard needs to be freely/publicly available
  3. All patents involved in the spec need to be royalty free
  4. Clicking through a license agreement is not necessary
  5. The spec can’t be dependent on a standard that is not open as well

Danese was quick to point out that these are simply guidelines and not rules. For example Unicode fails on 2. since you have to pay for a copy of the spec. But in this case printing the standard is a publishing feat–given all the glyphs and their number. It’s not unusual that the book would cost money. So this guideline could be waived if the OSI folks agreed.

Rather than the OSI going and applying these rules to all known standards the idea is that standards bodies could claim self-compliance–and as developers implement the standard the compliance will be ascertained.

The guidelines themselves and the process of being fine tuned/hammered on–and they are looking for volunteers…