Thom Hickey mentioned a new page at OCLC which lists some real time stats for worldcat: total holdings, last record added, etc. Perhaps this is in honor of the total holdings getting very close to crossing the 1 billion mark.

So of course I had to add a plugin for panizzi to scrape the page. Rather than writing yet another state machine for parsing html I decided to try out Frederik Lundh’s ElementTree Tidy HTML Tree Builder, which works out very well when you want to walk a datastructure representing possibly invalid HTML.

    url = "http://www.oclc.org/worldcat/grow.htm"
    tree = TidyHTMLTreeBuilder.parse( urlopen( self.url ) )

That’s all there is to getting nice elementtree object which you can dig into for a page of HTML.

So, predictably:

10:53 < edsu> @worldcat
10:53 < panizzi> edsu: [May 16, 2005 11:49 AM EDT #981,277,234] 
                      El senor de los anillos. Tolkien, J. R. R. ... 
                      uploaded by OEL - EUGENE PUB LIBR