code4lib 2006

Some #code4lib regulars (who also help put on Access up north) have managed to get some space at the Oregon State University in February for code4lib 2006:

code4lib 2006 is a loosely structured conference for library technologists to commune, gather/create/share ideas and software, be inspired, and forge collaborations. It is also an outgrowth of the Access HackFest, wrapped into a conference-ish format. It is the event for technologists building digital libraries and digital information systems, tools, and software.

A call for proposals is out. The nice thing about this conference is that there will be different levels of involvement: from keynote speakers, to shorter presentations, to lightning talks, and with space/time to actually hack at stuff/brainstorm with colleagues.

We’re hoping to attract both library professionals who use computers, and computer professionals who have an interest in libraries. The registration is now open as well at a discounted price. If you are interested in computers and libraries please submit a proposal or register to attend!

BBC Catalogue's Search

I did end up hearing back from Matt Biddulph about the search technology that he’s using with RubyOnRails to build the BBC Programme Catalogue.

The core of the search is nothing more than mysql 4.1’s fulltext indexer. I used to think very poorly of it until I discovered how to turn off its automatic stoplist and minimum indexable word length, and started using its boolean mode. Having the database manage the indexing without having to keep a separate index in sync is very valuable, and of course it’s portable to any client language.

The nice thing with a dataset the size and quality of the BBC’s is that you’re not solely dependent on the quality of the freetext indexer. I’ve done a little statistical analysis on the data to help with scoring the results. For example, programme contributors can be ranked according to how many shows they’ve contributed to, and commonly co-occurring contributors can be easily calculated with a bit of overnight batch processing. This kind of stuff contributes to a pretty good set of search results.

Given the visibility of the BBC Catalogue and that it has nearly a million records this says good things to me about the scalability of MySQL’s fulltext search. I’ll definitely consider it along with Ferret for Rails experiments that need search functionality.

buddhism and spimes

The Dalai Lama has an op-ed on science and faith in yesterdays New York Times. There are some delightful descriptions of his encounters with science as a child, which I imagine are excerpts from his recent book. I also like how he intertwingles religion and science–not making one higher up in a hierarchy.

If science proves some belief of Buddhism wrong, then Buddhism will have to change. In my view, science and Buddhism share a search for the truth and for understanding reality. By learning from science about aspects of reality where its understanding may be more advanced, I believe that Buddhism enriches its own worldview.

And the converse:

Just as the world of business has been paying renewed attention to ethics, the world of science would benefit from more deeply considering the implications of its own work. Scientists should be more than merely technically adept; they should be mindful of their own motivation and the larger goal of what they do: the betterment of humanity.

The impact of science and our way of life on our environment is something I’ve been reading about in Bruce Sterling’s Shaping Things. I haven’t finished it yet but the essential message so far is that we need to design objects in our environment so that they can reveal information about how they fit into the environment. This information amounts to links to databases that can track the history of the object, how to get customer support, history of ownership, manufacturing origins, internal components, details on customizing and interfacing, etc. Sterling calls these objects spimes and if you are interested his speech at SIGGRAPH has more details.

I’m not entirely sure why I’m mentioning both Spimes, Buddhism and Ted Nelson in the same breath. I suppose all three focus the attention on just how deeply interconnected we all are with each other and with the world around us. Sometimes these interconnections can be overwhelming. Meditating on this inter-connectedness, and building tools to manage the connections responsibly are two worthwhile things I’d like to work on.

jython niceties

While playing around with the Java JDOM library, I found myself resorting to jython to experiment with the API. It’s just so much easier this way for me:

#!/usr/bin/env jython

search @ delicious and the bbc

I just noticed that now has full, fast search across all content (not just your own bookmarks). This is something that Dan’s unalog has had on delicious for a while (apart from the delightful content). Dan uses pylucene as his search engine, which still has some interesting features. It’s pretty wild being able to search across all the delicious content, given their volume.

When delicious was really ramping up I saw the occasional mason error page, so I know that they are (or were) using Perl. This makes me really curious to know what search technology they are using…but I couldn’t find any details in the announcement.

Likewise, the news about the BBC Programme Catalogue being built with RubyOnRails. I’ve really come to appreciate Lucene and PyLucene and am in search of similar search tools for Ruby. I’ve got an email out to Matt Biddulph to see if he can provide any details about the BBC effort.

access2005 presentations

Unfortunately I wasn’t able to make it to Access this year where lots of library developer types I respect and learn from were presenting and hacking. Fortunately the audio and slides are now available. Combined with the collected blogging and snippets in irc I almost feel like I was there…but I imagine the real brain storming and fun happened outside of these artifacts. Inspiring stuff, and highly recommended if you’re into writing software for libraries/archives.


Last night I took the train into ThoughtWorks to check out the Chicago Area Ruby Group meeting. There wasn’t a planned talk, so I wasn’t sure what to expect (apart from getting a chance to chat with Jason and Chris). One thing I definitely didn’t expect was seeing close to 30 people there.

The room we met in was kind of an atrium type of space. Everyone arranged themselves into a circle, and at the center of the room there was a smaller table with 6 chairs. After everyone went round an introduced themselves the function of the central table was revealed by Joe O’Brien who got everyone to play this discussion game called fishbowl. Basically there are 6 seats, and any 5 people can sit in them at any time (always leaving one chair open). People start talking about stuff, and if at any point someone wants to join the discussion then they sit down in the empty seat, and someone who no longer wants to talk can get up and leave. At all times the 5 seats needed to stay filled.

This fishbowl actually worked out really well. The conversation ranged from ruby’s performance, to comparisons with java, rails, the community, the “Ruby Way”, joy, and practical examples of Ruby in the workplace. All in all it was a very pleasant meeting, and it was really interesting to see a good cross section of Chicago technology in an informal environment. Afterwards Jason, Chris and I went and had a few drinks at a nearby bar with Sam Stephenson and Marcel Molina of 37Signals. Sam and Marcel are both core developers for the Rails project, and recently moved to Chicago to join 37Signals. Sam is the developer behind prototype which I’ve been meaning to learn more about. Hopefully Sam can be coerced into doing a prototype talk at some point.

It’s been interesting watching local perl, python, ruby and java groups and how they regard each other. Chris mentioned that it’s unfortunate when discussion borders on digging at the other guy, and that the real thing that unites these groups is that everyone enjoys programming, and works on stuff on their own time. If the focus could brought to be that level I think there could very well be occasional cross language meetings. I mentioned to John Long that perhaps a meeting about javascript could bring folks from other languages together. He had a great idea of having someone like Sam talk about javascript, and then break off into smaller groups that talked about integration with various languages. Anyhow it was well worth the train ride in. Thanks for putting up with me being away for an evening Kesa :-)

a citation microformat - when worlds collide

Tim White has taken the time to prod the microformats list about the citation microformat that’s been floating around for a few months. It’s really encouraging that a developer at Gale is thinking of using a citation microformat. While I also work in the industry I’ve been coming at the citation microformat from a slightly different angle. For the past few months I’ve been monitoring activity in microformat land while watching another group of library technologists. Recently, Bill Burcham’s Baby Steps to Synergistic Web Apps and Half a Baby Step confirmed a nagging feeling I was having that the two communities were converging.

The “other” group are library/programmer friends of mine in #code4lib. These guys have been brainstorming about adapting the widely used OpenURL for use in HTML. OpenURL is used extensively in the academic library environment to enable linking to licensed content from online indexes. OpenURL essentially provides guidelines for encoding citation metadata in URLs, which has given birth to an ecosystem of vendors/developers who can provide resolver and content services. Context Object in Spans (COinS) provides a microformatty way to put openurls (without reference to an openurl resolver) into HTML. I’m not doing this work justice, so if you’re curious to see how COinS got started there’s lots of content in the gather-create-share discussion list. COinS exists in the wild at citeulike, hubmed, Current Law Journal Content.

Now after reading up about microformats and posting to the discussion list, and talking to Brian Suda it became clear that COinS as it stands now isn’t really usable as a microformat. Microformats center around marking up human readable data with semantic HTML, whereas COinS hides citation data encoded as a query string in HTML. However it is possible to encode openurl’s as XML, so there’s still hope I suppose. I want to sketch out what this could look like for the microformat wiki.

Before Tim’s post I’d never even heard of the Standard Format for Downloading Bibliographic Records z39.80. While it’s only a draft it’s used by Gale for providing downloadable citations, can be imported by RefWorks and most likely others. It bears a lot of resemblance to other citation formats that I’ve come across, but is obviously pre XML. The microformats brain storming that Brian has done has centered around DublinCore, BibTeX, MODS. At the moment I’m thinking BibTeX, Z39.80 and OpenURL stand the best chance of working. Honestly I think we could debate formats till the cows come home (and have left their cow paths ;-), but what microformats needs is some workable solution like semantic-html for OpenURL or Z39.80 and get some examples out there ane people using it while there’s momentum. It feels like there’s a swell here and a wave to ride.

quite a patch

Since starting to use lucene heavily at work about a year ago I’ve been watching the lucene list out of the corner of my eye for tips and tricks. Today I saw an email go by that referenced a recent patch that lazily creates SegmentMergeInfo.docMap objects. I guess the point isn’t so much what the object is, but the mere change in lazily creating the object yielded some pretty impressive performance gains:

Performance Results: A simple single field index with 555,555 documents, and 1000 random deletions was queried 1000 times with a PrefixQuery matching a single document. Performance Before Patch: indexing time = 121,656 ms querying time = 58,812 ms Performance After Patch: indexing time = 121,000 ms querying time = 598 ms A 100 fold increase in query performance!

Umm, 100 fold increase in performance. That’s quite a patch!


If you ever need to do z39.50 from ruby and have successfully built and installed ruby-zoom only to see:

   biblio:~ ed$ irb   irb(main):001:0> require 'zoom'   dyld: NSLinkModule() error   dyld: Symbol not found: _ZOOM_connection_search     Referenced from: /usr/lib/ruby/site_ruby/1.8/powerpc-darwin8.0/zoom.bundle     Expected in: flat namespace 

or a similar error about missing symbols…never fear! The YAZ toolkit doesn’t build a shared library by default. It’s confusing because the ruby-zoom package builds fine with header files. When building YAZ you’ll need to:

biblio:/usr/src/yaz-2.1.8 ed$ ./configure --enable-shared

Submitted here to help similar users who are flailing wildly in Google.