Last night I took the train into ThoughtWorks to check out the Chicago Area Ruby Group meeting. There wasn’t a planned talk, so I wasn’t sure what to expect (apart from getting a chance to chat with Jason and Chris). One thing I definitely didn’t expect was seeing close to 30 people there.

The room we met in was kind of an atrium type of space. Everyone arranged themselves into a circle, and at the center of the room there was a smaller table with 6 chairs. After everyone went round an introduced themselves the function of the central table was revealed by Joe O’Brien who got everyone to play this discussion game called fishbowl. Basically there are 6 seats, and any 5 people can sit in them at any time (always leaving one chair open). People start talking about stuff, and if at any point someone wants to join the discussion then they sit down in the empty seat, and someone who no longer wants to talk can get up and leave. At all times the 5 seats needed to stay filled.

This fishbowl actually worked out really well. The conversation ranged from ruby’s performance, to comparisons with java, rails, the community, the “Ruby Way”, joy, and practical examples of Ruby in the workplace. All in all it was a very pleasant meeting, and it was really interesting to see a good cross section of Chicago technology in an informal environment. Afterwards Jason, Chris and I went and had a few drinks at a nearby bar with Sam Stephenson and Marcel Molina of 37Signals. Sam and Marcel are both core developers for the Rails project, and recently moved to Chicago to join 37Signals. Sam is the developer behind prototype which I’ve been meaning to learn more about. Hopefully Sam can be coerced into doing a prototype talk at some point.

It’s been interesting watching local perl, python, ruby and java groups and how they regard each other. Chris mentioned that it’s unfortunate when discussion borders on digging at the other guy, and that the real thing that unites these groups is that everyone enjoys programming, and works on stuff on their own time. If the focus could brought to be that level I think there could very well be occasional cross language meetings. I mentioned to John Long that perhaps a meeting about javascript could bring folks from other languages together. He had a great idea of having someone like Sam talk about javascript, and then break off into smaller groups that talked about integration with various languages. Anyhow it was well worth the train ride in. Thanks for putting up with me being away for an evening Kesa :-)

a citation microformat - when worlds collide

Tim White has taken the time to prod the microformats list about the citation microformat that’s been floating around for a few months. It’s really encouraging that a developer at Gale is thinking of using a citation microformat. While I also work in the industry I’ve been coming at the citation microformat from a slightly different angle. For the past few months I’ve been monitoring activity in microformat land while watching another group of library technologists. Recently, Bill Burcham’s Baby Steps to Synergistic Web Apps and Half a Baby Step confirmed a nagging feeling I was having that the two communities were converging.

The “other” group are library/programmer friends of mine in #code4lib. These guys have been brainstorming about adapting the widely used OpenURL for use in HTML. OpenURL is used extensively in the academic library environment to enable linking to licensed content from online indexes. OpenURL essentially provides guidelines for encoding citation metadata in URLs, which has given birth to an ecosystem of vendors/developers who can provide resolver and content services. Context Object in Spans (COinS) provides a microformatty way to put openurls (without reference to an openurl resolver) into HTML. I’m not doing this work justice, so if you’re curious to see how COinS got started there’s lots of content in the gather-create-share discussion list. COinS exists in the wild at citeulike, hubmed, Current Law Journal Content.

Now after reading up about microformats and posting to the discussion list, and talking to Brian Suda it became clear that COinS as it stands now isn’t really usable as a microformat. Microformats center around marking up human readable data with semantic HTML, whereas COinS hides citation data encoded as a query string in HTML. However it is possible to encode openurl’s as XML, so there’s still hope I suppose. I want to sketch out what this could look like for the microformat wiki.

Before Tim’s post I’d never even heard of the Standard Format for Downloading Bibliographic Records z39.80. While it’s only a draft it’s used by Gale for providing downloadable citations, can be imported by RefWorks and most likely others. It bears a lot of resemblance to other citation formats that I’ve come across, but is obviously pre XML. The microformats brain storming that Brian has done has centered around DublinCore, BibTeX, MODS. At the moment I’m thinking BibTeX, Z39.80 and OpenURL stand the best chance of working. Honestly I think we could debate formats till the cows come home (and have left their cow paths ;-), but what microformats needs is some workable solution like semantic-html for OpenURL or Z39.80 and get some examples out there ane people using it while there’s momentum. It feels like there’s a swell here and a wave to ride.

quite a patch

Since starting to use lucene heavily at work about a year ago I’ve been watching the lucene list out of the corner of my eye for tips and tricks. Today I saw an email go by that referenced a recent patch that lazily creates SegmentMergeInfo.docMap objects. I guess the point isn’t so much what the object is, but the mere change in lazily creating the object yielded some pretty impressive performance gains:

Performance Results: A simple single field index with 555,555 documents, and 1000 random deletions was queried 1000 times with a PrefixQuery matching a single document. Performance Before Patch: indexing time = 121,656 ms querying time = 58,812 ms Performance After Patch: indexing time = 121,000 ms querying time = 598 ms A 100 fold increase in query performance!

Umm, 100 fold increase in performance. That’s quite a patch!


If you ever need to do z39.50 from ruby and have successfully built and installed ruby-zoom only to see:

   biblio:~ ed$ irb   irb(main):001:0> require 'zoom'   dyld: NSLinkModule() error   dyld: Symbol not found: _ZOOM_connection_search     Referenced from: /usr/lib/ruby/site_ruby/1.8/powerpc-darwin8.0/zoom.bundle     Expected in: flat namespace 

or a similar error about missing symbols…never fear! The YAZ toolkit doesn’t build a shared library by default. It’s confusing because the ruby-zoom package builds fine with header files. When building YAZ you’ll need to:

biblio:/usr/src/yaz-2.1.8 ed$ ./configure --enable-shared

Submitted here to help similar users who are flailing wildly in Google.

geocoder and rdf

While fielding a question on a local Perl list this weekend I ran across some more RDF alive and kicking in the very useful service. They have a nice RESTful web service, which allows you to drop an address or intersection into a URL like:

and get back the longtitude and latitude in a chunk of RDF like:

<?xml version="1.0"?>

<geo:Point rdf:nodeID="aid87293465">
    <dc:description>899 Ridgeview Dr, McHenry IL 60050</dc:description>



Of course this data could be encoded in comma-separated-values, in fact they have a similar RESTful service that does just that:

which returns:

42.314936,-88.291658,899 Ridgeview Dr,McHenry,IL,60050

Does this mean RDF isn’t necessary? For someone who is just querying directly and knows what the output is I guess the RDF doesn’t really add that much value. My coworker Bill likes to talk about being explicit in code whenever possible, and the RDF in this case is more explicit. Until there are programs that follow lines of inference using this data it’s largely a matter of taste. It’s nice that geocoder supports both world views.

And hats off to geocoder: they give away their software and how they built the service to anyone who wants it. They provide expertise in using the data, and also offer commercial access to their web services which have the 10 second or so pause between requests disabled. What an interesting model for a company. Heck, wouldn’t it be nice if OCLC operated this way?

On Lateral Thinking

I recently checked out Zen and the Art of Motorcycle Maintenance after reading Kevin’s piece about how the book informed his practice of library cataloging. I am enjoying it a lot more this time around, and have found it really informs my practice of computer programming as well. Unfortunately I only made it half way through before it needed to be returned to the library, and the local superbookstores oddly enough don’t seem to carry it…So, I’ve got a copy on order from a used bookstore I found through Amazon. Anyhow here’s one nice quote I jotted down before I had to return the book:

At first the truths Phaedrus began to pursue were lateral truths; no longer the frontal truths of science, those toward which the discipline pointed, but the kind of truth you see laterally, out of the corner of your eye. In a laboratory situation, when your whole procedure goes haywire, when everything goes wrong or is indeterminate or is so screwed up by unexpected results you can’t make head or tail out of anything, you start looking laterally. That’s a word he later used to describe a growth of knowledge that doesn’t move forward like an arrow in flight, but expands sideways, like an arrow enlarging in flight, or like the archer, discovering that although he has hit the bull’s eye and won the prize, his head is on a pillow and the sun is coming in the window. Lateral knowledge is knowledge that’s from a wholly unepected direction, from a direction that’s not even understood as a direction until the knowledge forces itself upon one. Lateral truths point to the falseness of axioms and postulates underlying one’s existing system of getting at truth.

I’m not entirely sure why this resonated with me. I think the idea of “lateral thinking” reminds me of how IRC and web surfing often informs my craft of writing software. While many universities offer computer “science” programs, I’ve found a large component of writing software is more artistic than scientific. Of course I’m hardly the first person to comment on this…but Zen and the Art of Motorcycle Maintenance is full of good advice for writing and tuning your programs. Hopefully I’ll get to write more about them in here when I get my copy in the mail.

trackbacks at arXiv

I just read (thanks jeff) about how has implemented experimental trackback support. Essentially this allows researchers who maintain online journals to simply reference an abstract like File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files (a great article by the way) and arXiv will receive a trackback ping at that lets them know someone referenced the abstract. If you’ve followed this so far you might be wondering how the blogging software (wordpress, moveabletype, blosxom, etc) figure out where to ping Take a look in the source code for the arXiv abstract and you’ll see a chunk of RDF:

delicious json

I just noticed over on the blog that my data is available as JavaScript Object Notation (JSON) by following a simple URL like:

Essentially you just load the URL as javascript source in your HTML:

<script type=“text/javascript” src=“”></script>

and voila you’ve magically got a new javascript array variable Delicious.posts, each element of which is a hash describing your link on delicious. It’s a very elegant (and simple) technique…much more elegant than that taken in the XML::RSS::JavaScript module which I helped create. It’s so elegant in fact that I got it working off to the side of this page in 2 minutes. I downloaded the python and ruby extensions for working with JSON just to take a look. The python version is a pleasant read, especially the unittests! The ruby version is a lesson in minimalism:

jsonobj = eval(json.gsub(/(["'])\s*:\s*(['"0-9tfn\[{])/){"#{$1}=>#{$2}"})

Now, if I were to use this I’d probably put a wrapper around it :-) Although it’s less minimalistic I think I prefer the explicitness of the python code. I’ve been digging into Ruby a bit more lately as I work on ruby-marc, and while I’m really enjoying the language I tend to shy away from one line regex hacks like this…which more often than not turn out to be a pain to extend and maintain.

I first heard of JSON from Mike Rylander at the open-ils project who are using JSON heavily in the opensource library catalog that they are developing for the state of Georgia. It is nice to see library technologists leading the curve.

Lockheed Martin and NARA

After 7 years of consultation Lockheed-Martin has been selected to build the Electronic Records Archives for the National Archives and Records Administration for 38 million dollars.

ERA will provide NARA with the capability to authentically preserve and provide access to any kind of electronic record free from dependence on any specific hardware or software.

There are some aging exploratory papers on the NARA site, along with what appears to be a copy of the RFP…but I can’t seem to find any specific information on how Lockheed Martin is planning to do this. I wonder what sort of track record L-M has in building electronic archiving software. Do they have an existing system which they are going to modify for NARA, or are they going to be building a new system from scratch? It sure would be interesting to hear some more details.

Open Documents

Have you ever had trouble importing one type of word processing document into your current word processor? Perhaps you’re using the same word processor, but are trying to import a document you created with an earlier version. Imagine for a moment what this will mean for a historian who is trying to research some correspondence fifty years from now. How about five hundred years from now? Are historians going to have to be computer hackers who have superhuman reverse engineering talents? Will there be mystical emulators that let you convert your modern computer into an insanely slow Pentium processor CPU running Windows 95 and Word 7? How will you even know what format a document is in?

There’s been some inspiring developments in Massachusetts who have decided to use the OpenDocument Format instead of Mircrosoft’s OpenXML. David Wheeler does a really nice job of summarizing what this means for open source development, and how Microsoft can choose to recover. I had no idea (but was not surprised to learn) that the royalty free license that Microsoft is using to distribute its “open” document format is incompatable with the popular GNU open source license. Ironically this seems to have been a calculated move by Microsoft to exclude open source developers from working with the open formats. Isn’t the whole point of an open document format to be open?

Thank goodness the folks in Massachusetts are on the ball, asking the right questions, and not simply following the money, power and status quo. At the same time they’re not exclusively endorsing the GPL; but have wisely decided to include as many possible development environments as possible. This sounds like the best way to making a truly archivable document format that will be good for “long haul institutions” like libraries and archives. Hopefully other public organizations will consider taking a similiar approach. Thanks Bruce for writing about this.