File under m for megalomania

Google Announces Plan To Destroy All Information It Can’t Index.

Although Google executives are keeping many details about Google Purge under wraps, some analysts speculate that the categories of information Google will eventually index or destroy include handwritten correspondence, buried fossils, and private thoughts and feelings.

Seriously, many a truth is said in jest. With the news that Google is going to be selling another 4 billion dollars worth of shares it makes sense that they would be thinking of a purging program to balance out their binging. What are they going to do with 4 billion dollars? I can’t even begin to imagine. It is frankly, a bit frightening, and seems like behavior one might read about in the DSM IV.

A Tale of Two Searches

There’s been some interesting discussion about SRW/U vs OpenSearch on some library email lists, blogs, and over in #code4lib. I worked on the SRU and CQL::Parser modules for the Ockham Initiative, and have watched the comparisons to A9’s RSS based OpenSearch with great interest. It’s amazing how similar the goals of these two search technologies are, and yet how different the implementations and developer communities are.

At their most basic both SRW/U and OpenSearch aim to make it easy to conduct searches over the web. They both want to spread distributed searching over the web for the masses. SRW/U grew up before OpenSearch at the Library of Congress, mainly on a small implementors list. It allows you to use SOAP as a transport layer, or simple XML over HTTP using a RESTful interface. The results can really be any type of XML, and there is no lingua franca of DublinCore like in OAI-PMH. SRW/U comes with an attendent query specification known as the Common Query Language (CQL). So there are a fair amount of moving pieces in building even a barebones SRW/U server.

OpenSearch on the other hand is relatively new, and was developed by A9 (a subsidiary of Amazon who know a thing or two about building robust easy to use web services). Really it’s just a RESTful interface for obtaining search results in RSS 2.0. There is talk that v1.1 might have some extensions to support more refined queries and xmlnamespaces for bundling different types of XML results…but at the moment there’s no need to parse queries, or to be able to handle any flavor of XML other than RSS 2.0.

When comparing the two sites one thing makes itself clear: the SRW/U site is a shambles–the specification itself is fragmented, and as a result there’s information all over the place. The OpenSearch page is neatly laid out with examples, rationale and even has a developers blog. The key here I think is that OpenSearch started simple and is slowly adding functionality that might be needed. SRW/U started out trying to simplify an existing standard and is slowly trying to make it simpler (there’s even been suggestions to drop the SOAPy SRW altogether and focus on the RESTful SRU). They’re moving in opposite directions. I don’t really have any doubts about which standard will see the widest deployment. The virtues of keeping things simple have been noted (very eloquently) by Adam Bosworth.

There is hope for library technologists though. OCLC is doing some really good work like their Open WorldCat program which allows you to link directly to local holdings for a book with a URL like:

Yeah, that’s an ISBN and a ZIP code. Oh, and I installed Chris Fairbanks’ nice OpenSearch/WordPress plugin in like 5 minutes. Here’s an example:

Drop it in an RSS reader and you can see whenever I write about code4lib. Not that you would really want to do that. Hmm but maybe it would be useful with say an online catalog or bibliographic database!

Intelligent Design

I was so pleased to read recently that there are others who find it appalling that the Kansas School Board isn’t considering teaching the solid scientific evidence for the Flying Spaghetti Monster. If the statistics on pirates and global warming aren’t enough to convince you, I have a little story to relate. One night I was driving along the road, and I saw some strange lights in the clouds. At first I thought it might be an airplane, but when I pulled over and got out I caught the aroma of spaghetti sauce and cooked pasta. I looked at my shirt, and didn’t see any spaghetti stains so I immediately thought that it had been the Flying Spaghetti Monster in the clouds. The next day I decided to hire an artist to draw what I imagined the Flying Spaghetti Monster to look like, lurking in the clouds. The result is above, and it’s exactly like the one I have seen reported elsewhere! Since I paid for this artist to do the picture, it is obviously scientifically accurate. I hope that this leaves no doubt in your mind, that FSM is not only a reality, but a very cool one indeed.


The FCC is mandating that Internet providers and network appliance manufacturers build in backdoors so that the spooks can monitor electronic communication between, well you know, terrorists and stuff. What a profoundly bad idea…do they really think that the secret access mechanism will stay a secret? And when it leaks out, what sort of access will the crooks have, and will they ever know it was leaked?

At least the effort to tap is on the table, unlike the Data Encryption Standard which was supposedly introduced with a deliberately small keysize to ease decoding by the NSA.

Thanks to the ever vigilant EFF for reporting on this. Which reminds me, my membership is up for renewal.

pypi over xmlrpc

It’s great to see that our ChiPy sprint bore some fruit for the PyPI service. There’s now decent XMLRPC support in PyPI for querying the packages. This will hopefully open up the door for lots of PyPI utilities that abound in the Perl/CPAN world…like this very simple client for listing packages:

#!/usr/bin/env python

pascal's triangle in python

I mentioned Pascal’s Triangle in the previous post, and after typing in the Oz code decided to make a Pascal’s Triangle pretty printer in python.

from sys import argv

def pascal(n):
    if n == 1:
        return [ [1] ]
        result = pascal(n-1)
        lastRow = result[-1]
        result.append( [ (a+b) for a,b in zip([0]+lastRow, lastRow+[0]) ] )
        return result

def pretty(tree):
    if len(tree) == 0: return ''
    line = '  ' * len(tree)
    for cell in tree[0]:
        line += '  %2i' % cell
    return line + "\n" + pretty(tree[1:])

if __name__ == '__main__':
    print pretty( pascal( int(argv[1]) ) ) 

Which, when run with can generate something like this:

biblio:~/Projects/bookclub ed$ python 9
                   1   1
                 1   2   1
               1   3   3   1
             1   4   6   4   1
           1   5  10  10   5   1
         1   6  15  20  15   6   1
       1   7  21  35  35  21   7   1
     1   8  28  56  70  56  28   8   1 

It’s been fun reading up on the uses for Pascal’s triangle, although I imagine this is old hat for people more familiar with math than I. Still I think getting through this tome will be time well spent in the long run.

chipy bookclub

So the Chicago Python Group has started up a bookclub about a month ago. The first book we’re reading as a group is Concepts, Techniques, and Models of Computer Programming which is fortunately available online for free. The aim of the bookclub (as with many bookclubs) is to work through a text together, and hopefully get to hear different perspectives during discussion which will happen online and after our monthly meetings. Also, a bit of peer pressure can help make it through certain types of books…

And this first book is a doozy at 939 pages. It covers all sorts of territory from computer science using a multi-paradigm language called Oz. I’ve made it through the preface, and into Chapter 1, which starts out teaching some fundamental concepts behind functional programming. The jury is still out, but so far I’m finding the content refreshingly clear and stimulating. I like the fact that mathematical notation (so far) is explained and not taken for granted. Calculating factorial with recursion is a bit predictable, but chapter 1 quickly moved on to an algorithm that calculates a given row in a Pascal’s Triangle.

The sheer magnitude of the book is a bit intimdating, however reading it on my ibook makes it easy to ignore that. I’m thinking of it as a sort of thematic encyclopedia of computer programming with handy illustrations. Hopefully I’ll find the time to drop my thoughts here as I work my way through each chapter. Please feel free to join us (whether you’re from Chicago or not) if you are interested.

Update to SRU and CQL::Parser

If you are tracking it you might be interested to know that Brian Cassidy added a Catalyst plugin to the SRU CPAN module. Catalyst is a MVC framework that is getting quite a bit of mindshare in the Perl community (at least the small subset I hang out with in #code4lib). And if that wasn’t enough Brian also committed some changes to CQL::Parser that provides toLucene() functionality for converting CQL queries to queries that can be passed off to Lucene. Thanks Brian!

Net::OAI::Harvester v1.0

I got an email from Thorsten Schwander at LANL about a bug in Net::OAI::Harvester when using a custom metadata handler with the auto-resumption token handling code. This was the first I’d heard about anyone using the custom metadata handling feature in N:O:H, so I was pleased to hear about it. Thorsten was kind enough to send a patch, so a new version is on its way around the CPAN mirrors. While it’s hardly a major change, this is bumping the version from 0.991 to 1.0. It’s been over 2 years since N:O:H was first released, and it’s been pretty stable for the past year.

google's map api

Adrian pointed out at the last chipy meeting that a formal API for GoogleMaps was in the works…but I had no idea it was this close.

After you’ve got an authentication key for your site directory, all you need to do to embed a map in your page is include a javascript library source URL directly from google, create a <div> tag with an id (say “map”) and add some javascript to your page.

    var map = new GMap( document.getElementById("map") );
    map.addControl( new GSmallMapControl() );
    map.centerAndZoom( new GPoint(-88.316385,42.247090), 4);                                                                    

This took literally 2 minutes to do, if that. It’s a bit tedious that the token is only good on a per-directory basis, but I guess this is because of partitioned blogging sites where different users have different directories with the same hostname.

update: I guess I’m not the only one who finds the per-directory limit to be kind of a hassle.