Archive for the ‘java’ Category

oai2rdf

Thursday, August 24th, 2006

Amidst the flurry of commit messages and the like on the simile development discussion list I happened to see the Simile Project includes a RDFizer project which has a component called oai2rdf.

oai2rdf is a command line program that happens to use Jeff Young’s OAIHarvester2 and some XSLT magic to harvest an entire oai-pmh archive and covert it to rdf.

  % oai2rdf.sh http://cogprints.ecs.soton.ac.uk/perl/oai2 cogprints

This will harvest the entire cogprints eprint archive and convert it on the fly to rdf which is saved in a directory called cogprints. Just in case you are wondering–yes it handles resumption tokens. In fact you can also give it date ranges to harvest, and tell it to only harvest particular metadata formats. By default it actually grabs all possible metadata formats.

As part of my day job I’ve been looking at some rdf technologies like jena and while there are lots of chunks of rdf around on the web to play with oai2rdf suddenly opens up the possibilities quite a bit.

Getting oai2rdf up and running is pretty easy. First get the oai2rdf code:

  svn co http://simile.mit.edu/repository/RDFizers/oai2rdf/ oai2rdf

Next make sure you have maven. If you don’t have it maven is very easy to install. Just download, unpack, and make sure the maven/bin directory is in your path. Then you can:

  mvn package

The magic of maven will pull down dependencies and compile the code. Then you should be able to run oai2rdf. Art Rhyno has been talking about the work the Simile folks are doing for quite a while now, and only recently have I started to see what a rich set of tools they are developing.

(py)?lucene 1.9

Thursday, March 2nd, 2006

So on March 1st lucene v1.9 was released and the *next day* pylucene v1.9 is released. Nice work!

I guess there are a bunch of methods that are deprecated in 1.9 which will dissappear entirely in v2.0. Now would be a good time to update usage…

jython niceties

Friday, November 11th, 2005

While playing around with the Java JDOM library, I found myself resorting to jython to experiment with the API. It’s just so much easier this way for me:

#!/usr/bin/env jython
 
from java.io import StringReader
 
from org.jdom import Document
from org.jdom.input import SAXBuilder
from org.jdom.xpath import XPath
 
xml = '<foo><bar>foobar</bar></foo>'
 
builder = SAXBuilder()
document = builder.build(StringReader(xml))
xpath = XPath.newInstance('//foo/bar')
node = xpath.selectSingleNode(document)
print node.getText()

In case it’s of interest I’ve got a little jython startup script which automatically makes .jar files I drop in a particular directory available to the interpreter. So when testing jdom all i had to do was drop jdom.jar in my /usr/local/jython/jars and it’s immediately available the next time I start up jython.

#!/bin/bash
 
JYTHON_HOME=/usr/local/jython
 
for jar in $JYTHON_HOME/jars/*.jar
do
    jars=&quot;$jars:$jar&quot;
done
 
CLASSPATH=&quot;$JYTHON_HOME/dist/jython.jar$jars&quot;
 
java -cp $CLASSPATH \
	$JYTHON_JAVA_ARGS -Dpython.home=$JYTHON_HOME \
	org.python.util.jython &quot;$@&quot;

Pretty handy, especially for interactive sessions.

quite a patch

Thursday, October 13th, 2005

Since starting to use lucene heavily at work about a year ago I’ve been watching the lucene list out of the corner of my eye for tips and tricks. Today I saw an email go by that referenced a recent patch that lazily creates SegmentMergeInfo.docMap objects. I guess the point isn’t so much what the object is, but the mere change in lazily creating the object yielded some pretty impressive performance gains:

Performance Results: A simple single field index with 555,555 documents, and 1000 random deletions was queried 1000 times with a PrefixQuery matching a single document. Performance Before Patch: indexing time = 121,656 ms querying time = 58,812 ms Performance After Patch: indexing time = 121,000 ms querying time = 598 ms A 100 fold increase in query performance!

Umm, 100 fold increase in performance. That’s quite a patch!

pylucene

Thursday, June 9th, 2005

I’m going to be doing a lightning talk tonight at the Chicago Python Group about pylucene. pylucene essentially lets you use the popular Lucene indexing library (Java) in Python. No time limit has been set for the lightning talks (and mjd won’t be there with his gong) but I hope to quickly cover how to index an mbox with pylucene in 5 minutes. There are slides, which are there mainly as cue cards.

Communication

Friday, May 6th, 2005

At my day job I’ve spent the better part of a month working on a nasty performance tuning problem in some software that I didn’t actually write. Without going into much detail we have a distributed application that provides cover images (a la Amazon) to the websites and other applications at various divisions with Follett. There are multiple caching layers, and heavy use of 3rd party software such as lucene and tomcat. The problem was the image query service would ocasionally take 10 times as long (or more) to service a request.

Initially I used a tool called jrat to profile the application in question to see where it was spending its time. jrat is a neat little application that uses the Byte Code Engineering Library to instrument Java class files so that they write timing information to a log file. jrat then has a visualization tool that lets you open the log and view timings for the various methods. After doing this it became clear that a large amount of time was being spent in searching the Lucene index.

So I isolated the searching component of the code and replicated the timeout behavior outside of the web container. Once I could replicate the behavior at will I was able to start turning knobs and switching switches to try to get better performance. One of the first obvious things I tried was to create one IndexSearcher object and share it across the threads. This helped a great deal and I was happy. Thinking that it was the creation of the Searchers which slowed things down I created a pool of IndexSearchers which the application drew from, and a worker thread that kept the pool full. This change also worked well outside of Tomcat; however once it ran under Tomcat I saw the same delays. The test outside of Tomcat pushed the searching much harder that our web traffic ever did…so extrapolating from one to the other wasn’t appropriate. I had fixed *a* problem but not *the* problem.

This is when depression set in…

After I had started to think clearly again I happened to have lunch with Mike who asked if JVM garbage collection could have anything to do with it. I practically slapped myself on the forehead. This is what all those articles warned me about when discussing Java and embedded software! I went back, turned on garbage collection logging and sure enough, every 10-20 seconds the JVM was spending sometimes around 2 seconds collecting a huge amount of memory. I had a little log analysis tool that told me when the response times were exceeding 2 seconds, and sure enough these popped up while the full GC was running. What objects were chewing up that amount of space?

This is when Bob suggested giving the commercial Java tuning app YourKit a try. They have a fully functional 14 day demo which I got to run under RH Fedora Core 2 in fairly short order. YourKit can talk to a host of J2EE servers including Tomcat. On startup it asks what type of server you want it to attach to. After selecting Tomcat it goes and creates a new Tomcat startup script based on the existing one. After resetarting Tomcat YourKit is able to selectively log a ton of data from the running JVM, including memory usage.

This screen alone (click on it for a more readable version) showed that a large chunk of memory was being used up by all the IndexSearcher objects that were being created. So I had been right to focus on the IndexSearcher after all, but it wasn’t that they were expensive to create, but that they resulted in a great deal of memory being used which caused the JVM to stall out while garbage collection was being done. I confirmed this by hacking the app to keep one IndexSearcher around and stress testing again, which performed nicely.

While I don’t have a solution in code yet, this whole exercise has made it clear to me how important communication is in programming. I always seem to get better results when talking to people I work with. It’s so easy to get stuck in one way of looking at a problem, and discussion has a way of dislocating my perspective, challenging my assumptions, and bringing humor into a problem. In addition good tools are worth their weight in gold. I spent far too much time guessing and testing when I could have used something like YourKit from the start. One thing that has impressed me a lot about Java are the high quality development tools that are available.