pybibutils

The #code4lib sprint is coming up soon and (alas) we still don’t really have a firm grasp on what we will be sprinting on. After pycon dchud had some ideas for a metadata wrangling framework for python. Around the same time I was working on SWIG wrapper for the bibutils library. So one idea we had was to create this python utility that would enable converting between many of the popular metadata/citation formats:

Emerging details are available on the wiki. If you have any ideas for the sprint please note them on the wiki.


MARC::Record v2.0 RC1

Thanks to the support of Anne Highsmith at Texas A&M MARC::Record v2.0 RC1 was released today to sourceforge. This new version of MARC::Record addresses the use of Unicode in MARC records. There has been a long standing bug in MARC::Record which caused it to calculate record directories incorrectly when the records contained Unicode. This isn’t hitting CPAN yet so that the people who want Unicode handling can take it for a test drive first. As noted previously this Perl/Unicode stuff is pretty tricky since most of the time the encoding of a scalar variable is sort of hidden from view. I’d much prefer to be in a situation like in Java where all strings are UTF-8.


one billion

Thom Hickey mentioned a new page at OCLC which lists some real time stats for worldcat: total holdings, last record added, etc. Perhaps this is in honor of the total holdings getting very close to crossing the 1 billion mark.

So of course I had to add a plugin for panizzi to scrape the page. Rather than writing yet another state machine for parsing html I decided to try out Frederik Lundh’s ElementTree Tidy HTML Tree Builder, which works out very well when you want to walk a datastructure representing possibly invalid HTML.

    url = "http://www.oclc.org/worldcat/grow.htm"
    tree = TidyHTMLTreeBuilder.parse( urlopen( self.url ) )

That’s all there is to getting nice elementtree object which you can dig into for a page of HTML.

So, predictably:

10:53 < edsu> @worldcat
10:53 < panizzi> edsu: [May 16, 2005 11:49 AM EDT #981,277,234] 
                      El senor de los anillos. Tolkien, J. R. R. ... 
                      uploaded by OEL - EUGENE PUB LIBR


code4lib sprint

A bunch of #code4lib folks will be converging on Chicago this summer for the annual American Library Association conference. Several of us thought it would be fun to get together for a sprint on a project that has yet to be decided. A potential project is building a framework for metadata translation similar to bibutils or perhaps Cheshire. I worked on creating a bibutils wrapper for Python a few months ago, and decided it would be better to have a pure python framework instead. The wrapper worked ok, but only on particular platforms, and the API felt kludgy in that bibutils is oriented towards command line tools. There’s also some interest in having a discussion and possibly some hacking on mirroring OPACs that Art Rhyno and Ross Singer have been working on.

I called Chicago Public Library to reserve some of their rooms but they’re already all booked up. Fortunately the Lincoln Park Branch has a nice room (with wifi) which chipy used for their pypi sprint a few months ago…and I just reserved the space for the entire day of Friday June 24th, 2005. My friend Brian Ray from chipy kindly offered to stop by his local branch to fill out the paper work to make it official.



Communication

At my day job I’ve spent the better part of a month working on a nasty performance tuning problem in some software that I didn’t actually write. Without going into much detail we have a distributed application that provides cover images (a la Amazon) to the websites and other applications at various divisions with Follett. There are multiple caching layers, and heavy use of 3rd party software such as lucene and tomcat. The problem was the image query service would ocasionally take 10 times as long (or more) to service a request.

Initially I used a tool called jrat to profile the application in question to see where it was spending its time. jrat is a neat little application that uses the Byte Code Engineering Library to instrument Java class files so that they write timing information to a log file. jrat then has a visualization tool that lets you open the log and view timings for the various methods. After doing this it became clear that a large amount of time was being spent in searching the Lucene index.

So I isolated the searching component of the code and replicated the timeout behavior outside of the web container. Once I could replicate the behavior at will I was able to start turning knobs and switching switches to try to get better performance. One of the first obvious things I tried was to create one IndexSearcher object and share it across the threads. This helped a great deal and I was happy. Thinking that it was the creation of the Searchers which slowed things down I created a pool of IndexSearchers which the application drew from, and a worker thread that kept the pool full. This change also worked well outside of Tomcat; however once it ran under Tomcat I saw the same delays. The test outside of Tomcat pushed the searching much harder that our web traffic ever did…so extrapolating from one to the other wasn’t appropriate. I had fixed a problem but not the problem.

This is when depression set in…

After I had started to think clearly again I happened to have lunch with Mike who asked if JVM garbage collection could have anything to do with it. I practically slapped myself on the forehead. This is what all those articles warned me about when discussing Java and embedded software! I went back, turned on garbage collection logging and sure enough, every 10-20 seconds the JVM was spending sometimes around 2 seconds collecting a huge amount of memory. I had a little log analysis tool that told me when the response times were exceeding 2 seconds, and sure enough these popped up while the full GC was running. What objects were chewing up that amount of space?

This is when Bob suggested giving the commercial Java tuning app YourKit a try. They have a fully functional 14 day demo which I got to run under RH Fedora Core 2 in fairly short order. YourKit can talk to a host of J2EE servers including Tomcat. On startup it asks what type of server you want it to attach to. After selecting Tomcat it goes and creates a new Tomcat startup script based on the existing one. After resetarting Tomcat YourKit is able to selectively log a ton of data from the running JVM, including memory usage.

This screen alone (click on it for a more readable version) showed that a large chunk of memory was being used up by all the IndexSearcher objects that were being created. So I had been right to focus on the IndexSearcher after all, but it wasn’t that they were expensive to create, but that they resulted in a great deal of memory being used which caused the JVM to stall out while garbage collection was being done. I confirmed this by hacking the app to keep one IndexSearcher around and stress testing again, which performed nicely.

While I don’t have a solution in code yet, this whole exercise has made it clear to me how important communication is in programming. I always seem to get better results when talking to people I work with. It’s so easy to get stuck in one way of looking at a problem, and discussion has a way of dislocating my perspective, challenging my assumptions, and bringing humor into a problem. In addition good tools are worth their weight in gold. I spent far too much time guessing and testing when I could have used something like YourKit from the start. One thing that has impressed me a lot about Java are the high quality development tools that are available.


MARC, Perl and Unicode

I’ve been doing some work for Texas A&M who need a MARC::Record module that is Unicode safe. Many ILS vendors are moving away from MARC-8 encoded records towards Unicode. No doubt this move is being spurred on by big players like OCLC who are moving (or have moved) their mammoth WorldCat database to Unicode.

At any rate Texas A&M have workflows that use MARC::Record for transforming records in their catalog and they need the Unicode support for their new Voyager system. Technically there were very few places where MARC::Record needed to be adjusted. The problem is that the antiquated transmission format for MARC records uses byte lengths in the so called directory, as offsets into the record. MARC::Record uses length() and substr() to create and work with the directory…which works fine when 1 character equals 1 byte. However, Unicode characters can have multiple bytes per character…so the character oriented length() will create faulty record directories, and substr() will extract data from the rest of the record incorrectly.

Fortunately there is the bytes pragma which alters the behavior of various character oriented Perl functions. Unfortunately these functions were added to Perl relatively recently, so this new version of MARC::Record will require Perl >= v5.8.2. Technically it could run on 5.8.1, however I found that the 5.8.1 that ships with OS X 10.3 lacks the bytes::substr(). Not only that but if you try to call a non existent function in the bytes namespace you’ll go into an infinite loop. This is even the case with Perl 5.8.6 as well.

All in all I really have come to dislike Perl’s Unicode support. The magical utf8 flag on scalars has a tendency to pop on and off for obscure reasons. And I’ve found the behavior of bytes::length() to be a bit unpredictable. Surely this is because I don’t fully understand the mechanics involved, but judging from the traffic on perl-unicode I’m not the only one who has struggled with it. My experience using unicode in Java and Python has been much more pleasant, and really confirms my decision to move towards doing new work in these languages. Perl has served me well, and there are some things I really love about the language, but these nasty corners are a bit scary.


name authority fun

As a joke dchud suggested that panizzi (the friendly neighborhood bot in #code4lib) should have a plugin for querying the Library of Congress Name Authority File that OCLC provides. The Name Authority File allows librarians the world over to use the same established names when cataloging books, etc. It would serve no purpose in irc, but it could be a good conversation piece…

I had goofed around writing a command line app about half a year ago so I figured it couldn’t be that hard to hack this into the infobot source code. However I guessed wrong…granted I only tried for 30 minutes or so.

Fortunately, python’s supybot was a different story. It’s more modern, has command line programs for configuring a supybot, has built in support for plugins – and has documentation. There is even a command line program supybot-newplugin that will ask a few questions and then autogenerate a template plugin module. All you have to do after that is add a method (with a particular signature given in the docs) which will then do the work and respond.


HTML/HTTP

RFC 2397 has been around since August 1998 and I’m just learning about the data URL scheme today. Perhaps browser support for it is new? Basically data URLs allow you to embed data, like images directly in an HTML page. Data URLs remind me of Fred Lindberg’s old idea (circa 2001) of “mailing the web” by freezing web pages as email with MIME attachments.

It’s fun to be learning new things about HTML/HTTP: technologies that I thought I was familiar with already. Perhaps I’ve been out of web development for long enough to fall behind. The other day I learned about iframes from my friend and sometime coworker Jason and was similarly blown away by something new under the sun. iframes are esentially the same things as regular frames but for the browser user they don’t see separate panes. Useful for scrolling panels inside of pages and other things I’m sure.

I guess this is all part of the web renaissance that is going on now, spurred on by Google’s forays and investment in javascript and xml. It’s really interesting to see how a big player like Google can redefine what is acceptable technology to rely on in web applications. For years I’ve avoided doing too much in javascript since it was a headache to get it working across different browsers, at least for this programmer. Now javascript is on my list of things to learn more about.


Hello WordPress

Hello WordPress, bye bye custom blog code written in Perl. Well the old code is still running, but I’ve wanted to install WordPress for the past few months and finally got around to it this weekend. I had a little bit of trouble getting PHP installed, only because I decided to use the older php4 with the latest mysql, and php4 didn’t seem to want to configure itself using the latest mysql. Fortunately using php5 was a different story and WordPress was a breeze to install.

My reasons for switching from my homegrown code to WordPress are several.

  • there was really no way of commenting on stories, only adding them.
  • the old code didn’t really archive or categorize stories the way I wanted to
  • links to stories didn’t work, and I wanted to join dan’s Planet #code4lib.
  • I didn’t use the RSS aggregation features I wrote since I started using Bloglines.
  • I’ve been coding more in Python these days and don’t feel particularly tied to my Perl code base any longer. WordPress is PHP, which I’m not a huge fan of, but I think this had more to do with the PHP that I was exposed to more than the language itself. Installing WordPress and the various plugins like the audioscrobbler one you see to the right was very pleasant.
  • the WordPress community is extremely rich. I spent some time with Kesa looking at different themes, but in the end decided to stay with the default for now. There are tons of neat plugins to look at.

So what you can expect here is more of the same. I’m going to try to write more about my work as a programmer, mainly as a journal for myself to keep track of what I’m working on, where I’ve been, and where I’d like to go. Perhaps you are thinking spare me the details, where are the pictures of Chloe?! If this is the case you should see a link to the photos over on the right.