Thom Hickey mentioned a new page at OCLC which lists some real time stats for worldcat: total holdings, last record added, etc. Perhaps this is in honor of the total holdings getting very close to crossing the 1 billion mark.
So of course I had to add a plugin for panizzi to scrape the page. Rather than writing yet another state machine for parsing html I decided to try out Frederik Lundh’s ElementTree Tidy HTML Tree Builder, which works out very well when you want to walk a datastructure representing possibly invalid HTML.
url = "http://www.oclc.org/worldcat/grow.htm" tree = TidyHTMLTreeBuilder.parse( urlopen( self.url ) )
That’s all there is to getting nice elementtree object which you can dig into for a page of HTML.
10:53 < edsu> @worldcat 10:53 < panizzi> edsu: [May 16, 2005 11:49 AM EDT #981,277,234] El senor de los anillos. Tolkien, J. R. R. ... uploaded by OEL - EUGENE PUB LIBR