tiger

Well, I took the plunge and installed the latest version of OS X. I’m actually posting this blog entry with a WordPress dashboard plugin. I backed up my mail, addressbook and calendar and did a clean install. I was a bit nervous that I forgot to back up everything I needed…but it was also kind of refreshing starting with a clean slate. I’ve got the latest versions of Perl and Python building in the background now, and everything so far seems pretty smooth. I hope to take a closer look at dashboard widgets sometime soon.


lightning strikes

Chris has a nice writeup about last nights ChiPy lightning talks. There were tons of interesting people there with very interesting projects. Apart from the announcement that we might be hosting PyCon next year in Chicago, the highlight of the evening for me was hearing about the amazing data hack that is ChicagoCrime. Adrian is a journalist/programmer who managed to glue together GoogleMaps with publicly available data from the Chicago Police Department. The main (perhaps unintended) things I took from his enthusiastic and humorous talk were:

  • screen scraping is fragile but it’s an important lever for fostering more elegant/robust information sharing.
  • screen scraping is fragile but it’s important for building new public applications that aren’t run behind closed doors at the Department of Homeland Security

I really, really want to get going on the GovTrack data scraping now.


govtrack

GovTrack has done some awesome work generating publicly available machine readable data for US government information. After the last election I decided that I really wanted to get involved in some sort of volunteer technology/political activity, so I started googling and found GovTrack pretty much just starting up. Now there is a loose affiliation of similar sites (including GovTrack) called Ogdex who are attempting to foster the collection of publicly available government information. In particular there has been some talk on the govtrack discussion list about local efforts to add state data to the collection of federal data…and even bounties for getting state data collection going. I’m going to take a stab at writing some scraping utilities for gathering together Illinois data and will report back with how it goes. If you are interested in helping out details are available.

Update: Joshua just set up a new drupal site for govtrack development.


pylucene

I’m going to be doing a lightning talk tonight at the Chicago Python Group about pylucene. pylucene essentially lets you use the popular Lucene indexing library (Java) in Python. No time limit has been set for the lightning talks (and mjd won’t be there with his gong) but I hope to quickly cover how to index an mbox with pylucene in 5 minutes. There are slides, which are there mainly as cue cards.


pybibutils

The #code4lib sprint is coming up soon and (alas) we still don’t really have a firm grasp on what we will be sprinting on. After pycon dchud had some ideas for a metadata wrangling framework for python. Around the same time I was working on SWIG wrapper for the bibutils library. So one idea we had was to create this python utility that would enable converting between many of the popular metadata/citation formats:

Emerging details are available on the wiki. If you have any ideas for the sprint please note them on the wiki.


MARC::Record v2.0 RC1

Thanks to the support of Anne Highsmith at Texas A&M MARC::Record v2.0 RC1 was released today to sourceforge. This new version of MARC::Record addresses the use of Unicode in MARC records. There has been a long standing bug in MARC::Record which caused it to calculate record directories incorrectly when the records contained Unicode. This isn’t hitting CPAN yet so that the people who want Unicode handling can take it for a test drive first. As noted previously this Perl/Unicode stuff is pretty tricky since most of the time the encoding of a scalar variable is sort of hidden from view. I’d much prefer to be in a situation like in Java where all strings are UTF-8.


one billion

Thom Hickey mentioned a new page at OCLC which lists some real time stats for worldcat: total holdings, last record added, etc. Perhaps this is in honor of the total holdings getting very close to crossing the 1 billion mark.

So of course I had to add a plugin for panizzi to scrape the page. Rather than writing yet another state machine for parsing html I decided to try out Frederik Lundh’s ElementTree Tidy HTML Tree Builder, which works out very well when you want to walk a datastructure representing possibly invalid HTML.

    url = "http://www.oclc.org/worldcat/grow.htm"
    tree = TidyHTMLTreeBuilder.parse( urlopen( self.url ) )

That’s all there is to getting nice elementtree object which you can dig into for a page of HTML.

So, predictably:

10:53 < edsu> @worldcat
10:53 < panizzi> edsu: [May 16, 2005 11:49 AM EDT #981,277,234] 
                      El senor de los anillos. Tolkien, J. R. R. ... 
                      uploaded by OEL - EUGENE PUB LIBR


code4lib sprint

A bunch of #code4lib folks will be converging on Chicago this summer for the annual American Library Association conference. Several of us thought it would be fun to get together for a sprint on a project that has yet to be decided. A potential project is building a framework for metadata translation similar to bibutils or perhaps Cheshire. I worked on creating a bibutils wrapper for Python a few months ago, and decided it would be better to have a pure python framework instead. The wrapper worked ok, but only on particular platforms, and the API felt kludgy in that bibutils is oriented towards command line tools. There’s also some interest in having a discussion and possibly some hacking on mirroring OPACs that Art Rhyno and Ross Singer have been working on.

I called Chicago Public Library to reserve some of their rooms but they’re already all booked up. Fortunately the Lincoln Park Branch has a nice room (with wifi) which chipy used for their pypi sprint a few months ago…and I just reserved the space for the entire day of Friday June 24th, 2005. My friend Brian Ray from chipy kindly offered to stop by his local branch to fill out the paper work to make it official.



Communication

At my day job I’ve spent the better part of a month working on a nasty performance tuning problem in some software that I didn’t actually write. Without going into much detail we have a distributed application that provides cover images (a la Amazon) to the websites and other applications at various divisions with Follett. There are multiple caching layers, and heavy use of 3rd party software such as lucene and tomcat. The problem was the image query service would ocasionally take 10 times as long (or more) to service a request.

Initially I used a tool called jrat to profile the application in question to see where it was spending its time. jrat is a neat little application that uses the Byte Code Engineering Library to instrument Java class files so that they write timing information to a log file. jrat then has a visualization tool that lets you open the log and view timings for the various methods. After doing this it became clear that a large amount of time was being spent in searching the Lucene index.

So I isolated the searching component of the code and replicated the timeout behavior outside of the web container. Once I could replicate the behavior at will I was able to start turning knobs and switching switches to try to get better performance. One of the first obvious things I tried was to create one IndexSearcher object and share it across the threads. This helped a great deal and I was happy. Thinking that it was the creation of the Searchers which slowed things down I created a pool of IndexSearchers which the application drew from, and a worker thread that kept the pool full. This change also worked well outside of Tomcat; however once it ran under Tomcat I saw the same delays. The test outside of Tomcat pushed the searching much harder that our web traffic ever did…so extrapolating from one to the other wasn’t appropriate. I had fixed a problem but not the problem.

This is when depression set in…

After I had started to think clearly again I happened to have lunch with Mike who asked if JVM garbage collection could have anything to do with it. I practically slapped myself on the forehead. This is what all those articles warned me about when discussing Java and embedded software! I went back, turned on garbage collection logging and sure enough, every 10-20 seconds the JVM was spending sometimes around 2 seconds collecting a huge amount of memory. I had a little log analysis tool that told me when the response times were exceeding 2 seconds, and sure enough these popped up while the full GC was running. What objects were chewing up that amount of space?

This is when Bob suggested giving the commercial Java tuning app YourKit a try. They have a fully functional 14 day demo which I got to run under RH Fedora Core 2 in fairly short order. YourKit can talk to a host of J2EE servers including Tomcat. On startup it asks what type of server you want it to attach to. After selecting Tomcat it goes and creates a new Tomcat startup script based on the existing one. After resetarting Tomcat YourKit is able to selectively log a ton of data from the running JVM, including memory usage.

This screen alone (click on it for a more readable version) showed that a large chunk of memory was being used up by all the IndexSearcher objects that were being created. So I had been right to focus on the IndexSearcher after all, but it wasn’t that they were expensive to create, but that they resulted in a great deal of memory being used which caused the JVM to stall out while garbage collection was being done. I confirmed this by hacking the app to keep one IndexSearcher around and stress testing again, which performed nicely.

While I don’t have a solution in code yet, this whole exercise has made it clear to me how important communication is in programming. I always seem to get better results when talking to people I work with. It’s so easy to get stuck in one way of looking at a problem, and discussion has a way of dislocating my perspective, challenging my assumptions, and bringing humor into a problem. In addition good tools are worth their weight in gold. I spent far too much time guessing and testing when I could have used something like YourKit from the start. One thing that has impressed me a lot about Java are the high quality development tools that are available.