baby steps at linking library data

May 5th, 2008

Alistair wanted to have some data to demonstrate the potential of linked library data, so I quickly converted 10K MARC records (using a slightly modified version of MARC21slim2RDFDC.xsl and rewrote the subjects as lcsh.info URIs using a few lines of python…all a bit hackish, but it got this particular job done quickly.

The rewriting of subjects is basically a transformation of:

<http://lccn.loc.gov/00009010#manifestation>
  dc:creator "Rollo, David.";
  dc:date "c2000." ;
  dc:description "Includes bibliographical references (p. 173-223) and index." ;
  dc:identifier
     "URN:ISBN:0816635463 (alk. paper)",
     "URN:ISBN:0816635471 (pbk. : alk. paper)",
     "http://www.loc.gov/catdir/toc/fy032/00009010.html" ;
  dc:language "eng" ;
  dc:publisher "Minneapolis : University of Minnesota Press," ;
  dc:subject
    "Anglo-Norman literature",
    "Benoi?t, de Sainte-More, 12th cent.",
    "Latin prose literature, Medieval and modern",
    "Literacy",
    "Literature and history",
    "Magic in literature." ;
  dc:title "Glamorous sorcery : magic and literacy in the High Middle Ages /" ;
  dc:type "text" .

to:

<http://lccn.loc.gov/00009010#manifestation>
    dc:creator "Rollo, David." ;
    dc:date "c2000." ;
    dc:description "Includes bibliographical references (p. 173-223) and
index." ;
    dc:identifier "URN:ISBN:0816635463 (alk. paper)", "URN:ISBN:0816635471 (pbk. : alk. paper)", "http://www.loc.gov/catdir/toc/fy032/00009010.html" ;
    dc:language "eng" ;
    dc:publisher "Minneapolis : University of Minnesota Press," ;
    dc:subject <http://lcsh.info/sh85005082#concept>,
      <http://lcsh.info/sh85077482#concept>,
      <http://lcsh.info/sh85077565#concept>,
      <http://lcsh.info/sh85079624#concept>,
      <http://lcsh.info/sh86008161#concept>,
      "Benoi?t, de Sainte-More, 12th cent." ;
    dc:title "Glamorous sorcery : magic and literacy in the High Middle Ages
/" ;
    dc:type "text" .

Clearly there are lots of ways to improve even this simplified description: URIs for entries in the Name Authority File, referencing identifiers as resources rather than string literals (an artifact of the XSLT transform), removing ISBD punctuation, unicode normalization (&cough;), etc.

You may notice I kind of fudged the URI for the book itself using the LCCN service at LC: http://lccn.loc.gov/00009010#manifestation (which does resolve, but doesn’t serve up RDF yet). I’m no FRBR expert so I’m not sure if the use of “manifestation” in this hash URI makes sense. I just wanted to distinguish between the URI for the description, and the URI for the thing being described. I think it’s high time for me to understand FRBR a lot more.

If you prefer diagrams to turtle here is a graph visualization from the w3c rdf validator for the record.

SKOS in the Context of Semantic Web Deployment

April 30th, 2008

If you happen to be in the DC area on May 8th and are interested in linked data and the practical application of semantic web technologies like RDF and OWL please join us at the Library of Congress for a presentation by Alistair Miles, key developer of SKOS, and semantic web practitioner at the University of Oxford.

Below is the announcement, I hope you can make it. Oh, and if you are really interested in this stuff we’re having some brown bag sessions later in the afternoon that you are welcome to attend, just email me at ehs [at] pobox [dot] com.

The Simple Knowledge Organization System (SKOS), in the Context of Semantic Web Deployment, Alistair Miles, University of Oxford May 8th 10am6th 11:30am, 2008, Montepelier Room, Madison Building, Library of Congress (map) .

Links are valuable. Links between documents, between people, between ideas, between data. Data is now a first class Web citizen, and the Web is expanding as more of these valuable networks are deployed within its fabric. Well-established knowledge organization systems like the Library of Congress Subject Headings will play a major role within these networks, as hubs, connecting people with information and providing a firm foundation for network growth as many new routes to the discovery of information emerge through the collective action of individuals. Or will they?

This talk introduces the Simple Knowledge Organization System (SKOS), a soon-to-be-completed W3C standard for publishing thesauri, classification schemes and subject headings as linked data in the Web. This talk also presents SKOS in the context of the W3C’s Semantic Web Activity, and in particular the work of the W3C’s Semantic Web Deployment Working Group where other specifications are being developed for publishing linked data in the Web, for embedding linked data in Web pages, and for managing Semantic Web vocabularies. Finally, this talk takes a mildly inquisitive look at the value propositions for linked data in the Web, and how LCSH might be deployed in the Web for better information discovery.

Alistair’s background is in the development of Web technologies for scientific applications. He was a research associate in the e-Science department of the Rutherford Appleton Laboratory, where he was introduced to Semantic Web technologies and first developed SKOS. He has recently moved to the University of Oxford to work on linking fruit fly genomics research data, and he hopes everything he knows about the Semantic Web will turn out to be useful after all.

MIME types and library metadata

April 23rd, 2008

While thinking about library metadata and RESTful web services I got to wondering how many application/*+xml MIME types have actually been registered. It turns out that 120 out of the 633 other application/* MIME types.

Does it seem like a generally useful thing to be able to identify metadata representations with MIME types? Rebecca Guenther registered application/marc back in 1997. Maybe we could have application/marc+xml, application/mods+xml, application/dc+xml?

MIME types for established library metadata formats would be useful to use in applications like AtomPub implementations, or say OAI-ORE resource maps that want to identify the format of a particular resource. In general it would be useful to have in RESTful environments where content-negotiation for resources is encouraged.

If you are curious, here is a current (as of Apr 23, 2008) list of registered MIME types that are in the application/*+xml space.

application/atom+xml
application/atomcat+xml
application/atomsvc+xml
application/auth-policy+xml
application/beep+xml
application/ccxml+xml
application/cellml+xml
application/cnrp+xml
application/conference-info+xml
application/cpl+xml
application/csta+xml
application/CSTAdata+xml
application/davmount+xml
application/dialog-info+xml
application/epp+xml
application/im-iscomposing+xml
application/kpml-request+xml
application/kpml-response+xml
application/mbms-associated-procedure-description+xml
application/mbms-deregister+xml
application/mbms-envelope+xml
application/mbms-msk-response+xml
application/mbms-msk+xml
application/mbms-protection-description+xml
application/mbms-reception-report+xml
application/mbms-register-response+xml
application/mbms-register+xml
application/mbms-user-service-description+xml
application/media_control+xml
application/mediaservercontrol+xml
application/oebps-package+xml
application/pidf+xml
application/pls+xml
application/poc-settings+xml
application/rdf+xml
application/reginfo+xml
application/resource-lists+xml
application/rlmi+xml
application/rls-services+xml
application/samlassertion+xml
application/samlmetadata+xml
application/sbml+xml
application/shf+xml
application/simple-filter+xml
application/smil+xml
application/soap+xml
application/sparql-results+xml
application/spirits-event+xml
application/srgs+xml
application/ssml+xml
application/vnd.3gpp.bsf+xml
application/vnd.3gpp2.bcmcsinfo+xml
application/vnd.adobe.xdp+xml
application/vnd.apple.installer+xml
application/vnd.avistar+xml
application/vnd.chemdraw+xml
application/vnd.criticaltools.wbs+xml
application/vnd.ctct.ws+xml
application/vnd.eszigno3+xml
application/vnd.google-earth.kml+xml
application/vnd.HandHeld-Entertainment+xml
application/vnd.informedcontrol.rms+xml
application/vnd.irepository.package+xml
application/vnd.liberty-request+xml
application/vnd.llamagraphics.life-balance.exchange+xml
application/vnd.marlin.drm.actiontoken+xml
application/vnd.marlin.drm.conftoken+xml
application/vnd.mozilla.xul+xml
application/vnd.ms-playready.initiator+xml
application/vnd.nokia.conml+xml
application/vnd.nokia.iptv.config+xml
application/vnd.nokia.landmark+xml
application/vnd.nokia.landmarkcollection+xml
application/vnd.nokia.n-gage.ac+xml
application/vnd.nokia.pcd+xml
application/vnd.oma.bcast.associated-procedure-parameter+xml
application/vnd.oma.bcast.drm-trigger+xml
application/vnd.oma.bcast.imd+xml
application/vnd.oma.bcast.notification+xml
application/vnd.oma.bcast.sgdd+xml
application/vnd.oma.bcast.smartcard-trigger+xml
application/vnd.oma.bcast.sprov+xml
application/vnd.oma.dd2+xml
application/vnd.oma.drm.risd+xml
application/vnd.oma.group-usage-list+xml
application/vnd.oma.poc.detailed-progress-report+xml
application/vnd.oma.poc.final-report+xml
application/vnd.oma.poc.groups+xml
application/vnd.oma.poc.invocation-descriptor+xml
application/vnd.oma.poc.optimized-progress-report+xml
application/vnd.oma.xcap-directory+xml
application/vnd.omads-email+xml
application/vnd.omads-file+xml
application/vnd.omads-folder+xml
application/vnd.otps.ct-kip+xml
application/vnd.poc.group-advertisement+xml
application/vnd.pwg-xhtml-print+xml
application/vnd.recordare.musicxml+xml
application/vnd.solent.sdkm+xml
application/vnd.sun.wadl+xml
application/vnd.syncml.dm+xml
application/vnd.syncml+xml
application/vnd.uoml+xml
application/vnd.wv.csp+xml
application/vnd.wv.ssp+xml
application/vnd.zzazz.deck+xml
application/voicexml+xml
application/watcherinfo+xml
application/wsdl+xml
application/wspolicy+xml
application/xcap-att+xml
application/xcap-caps+xml
application/xcap-el+xml
application/xcap-error+xml
application/xcap-ns+xml
application/xenc+xml
application/xhtml+xml
application/xmpp+xml
application/xop+xml
application/xv+xml

literals and resources

March 26th, 2008

There’s a fascinating modeling discussion going on over on the DC-RDA list about whether RDA properties should reference literals or resources in descriptions. For example when describing an author you could use a literal:

Twain, Mark, 1835-1910

or a resource:

http://lccn.loc.gov/n79021164

There are some shades of gray in between (using blank nodes, auto-generated URIs, typed literals) but that’s the basic gist of it. The discussion basically concerns what the DC-RDA Application Profile should allow. There seems to be two competing interests:

  1. perceived ease of migrating legacy data (MARC -> RDA)
  2. perceived benefits to explicitly modeling the relationships found in bibliographic data

More information can also be found in the blogs of Karen Coyle and Jon Phipps.

My personal opinion is that RDA should take the high road on this one and really drive home the value proposition for using resources wherever possible, modeling relationships in bibliographic data, and leveraging hundreds of years of work maintaining controlled vocabularies. This will have the positive side effect of pushing library controlled vocabularies (LCSH, name authority, language and geographic codes, etc.) into the open on the web. More importantly I think it will highlight what libraries (at their best) do best, for the larger semantic web and computing world. I think it’s worth limping along a bit longer with MARC and waiting for RDA to actually “do the right thing”.

How to do this effectively is another matter, and is really what the discussion is about. It’s really nice to see people talking openly about these issues.

(PS, using an author isn’t a particularly good example because I don’t see it in the current list of RDA properties…)

(PSS, no that lccn url doesn’t currently resolve (it does for bibliographic records, but not authority) or return rdf (hopefully someday))

tabulator and google reader notifier oddness

March 17th, 2008

If you’ve ever tried installing the Tabulator (Tim Berners-Lee’s experimental linked-data browser) and not seen it work you may have run into the same problem as me.

On a hunch I guessed that there might be some weird interaction with another Firefox plugin — so I went through all 15 of them, disabling each one and restarting Firefox to see if Tabulator would start working. Sure enough, after I disabled Google Reader Notifier the Tabulator worked fine.

I dropped a message to public-semweb-ui, but figured it couldn’t hurt to add this here for other linked-data nerds casting about in google with the same problem.

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080207 Ubuntu/7.10 (gutsy) Firefox/2.0.0.12
Tabulator v0.8.2
Google Reader Notifier v0.4.5

Cyganiak on linked data, microformats and the semweb

March 14th, 2008

In case you missed it Danny Ayers has a fun interview with Richard Cyganiak who is one of the prime movers behind the Linking Open Data Project of Semantic Web Education and Outreach Group at the W3C, and authors of Cool URIs for the Semantic Web and How to Publish Linked Data on the Web. Among other things you’ll learn some details about sindice (the semantic web search engine at DERI) which indexes (using Solr!) structured data like rdf/xml, microformats (I never noticed last.fm had microformat content) and (soon) rdfa from the world wild web. More details about Sindice can be found in an earlier podcast Paul Miller did with Eyal Oren (also at DERI).

Richard’s perspective on the past and future of the semantic web is particularly refreshing. Rather than hard selling SPARQL or even RDF his attitude seems to be to try what works now, while recognizing that the technologies that make the semantic web work may very well be different in a few years. Also there’s an interesting discussion of microformats and RDF, highlighting the strengths and weaknesses of both. Plus there is a fun side story to the LOD diagram that shows the links between various open data sets.

If you’ve ever wanted to hear more about linked-data from someone in the know now is your chance. Nice questions danja!

oai-ore post baltimore thoughts

March 13th, 2008

The recent OAI-ORE meeting was just up the road in Baltimore, so it was easy for a bunch of us from the Library of Congress to attend. I work on a team at LC that is specifically looking at the role that repositories play at the library; I’ve implemented OAI-PMH data providers and harvesters, and in the past couple of years I’ve gotten increasingly interested in semantic web technologies — so OAI-ORE is of particular interest to me. I’ve commented a bit about OAI-ORE on here before, but I figure it can’t hurt to follow in my coworker’s footsteps and summarize my thoughts after the meeting.

(BTW, above is an image of some constellations I nabbed off of wikipedia. I included it here because the repeated analogy (during the meeting) of OAI-ORE resource maps as constellations was really compelling — and quite poetic.)

The Vocabulary

It seems to me that the real innovation of the OAI-ORE effort is that it provides a lightweight RDF vocabulary for talking about aggregated resources on the web. Unfortunately I think that this kernel gets a little bit lost in the 6 specification documents that were released en masse a few months ago.

The ORE vocabulary essentially consists of three new resource types: ore:ResourceMap, ore:Aggregation, ore:AggregatedResource ; and 5 new properties to use with those types: ore:describes, ore:isDescribedBy, ore:aggregates, ore:isAggregatedBy, ore:analogousTo. In addition, the Vocabulary document
provides guidance on how to use a few terms from the DublinCore vocabulary: dc:creator, dc:rights, dcterms:modified, dcterms:created.

The vocabulary is small, so if I were them I would publish the vocabulary elements using hash URIs, instead of slash URIs. The reason for this is that you don’t have to jigger the web server to do a httpRange-14 style 303 correctly:

  • http://www.openarchives.org/ore/0.2/terms#Aggregation
  • http://www.openarchives.org/ore/0.2/terms#AggregatedResource
  • http://www.openarchives.org/ore/0.2/terms#ResourceMap
  • http://www.openarchives.org/ore/0.2/terms#describes
  • http://www.openarchives.org/ore/0.2/terms#isDescribedBy
  • http://www.openarchives.org/ore/0.2/terms#aggregates
  • http://www.openarchives.org/ore/0.2/terms#isAggregatedBy
  • http://www.openarchives.org/ore/0.2/terms#analogousTo

Also, I think ore:AggregatedResource is currently missing from the rdf/xml vocabulary, so it should be added. Also ore:isDescribedBy seems to be commented out.

There is a lot of redundancy between the Abstract Data Model and the Vocabulary documents–so I would recommend collapsing them down into a single, succinct document. This is in keeping with the DRY principle and will have the added benefit of making it easier for newbies to hit the ground running (not having to wade through multiple docs and mentally reconcile them). I could understand having a separate Abstract Data Model document if it were totally divorced from the web and semantic web technologies like RDF, but it’s not.

The Graph

The OAI-ORE effort seemed to be mostly driven by a desire to take harvesting agents the last mile to the actual repository resources themselves–enabling digital library objects (in addition to their metadata) to be harvested from repositories (using HTTP) ; and to be referenced from other contexts (say objects in other repositories). This desire was born out of real, hard won experience with harvesting metadata records, and marked a shift from metadata-centric harvesting to resource-centric harvesting.

In addition OAI-ORE marks a departure from predictable and mind-numbing arguments about SIP formats (METS, DIDL, FOXML, IEEE LOM, XFDU, etc). Yet as soon as we have our shiny new OAI-ORE vocabulary we have to learn yet-another-packaging-format, this time one built on top of Atom.

First, let me just say I’m a big fan of RFC 4287, in particular how it is used in the RESTful Atom Publishing Protocol (RFC 5023). I also think it makes sense to have an Atom serialization for OAI-ORE resource maps — assuming there is a GRDDL transform for turning it into RDF. But the workshop in Baltimore seemed to stress that the Atom serialization was the only way to do OAI-ORE, and didn’t emphasize that there are in fact lots of ways of representing RDF graphs on the web. For example GRDDL allows you to associate arbitrary XML with an XSLT transform to extract a RDF graph. And you could encode your RDF graph directly with RDFa, N3, Turtle, ntriples, or RDF/XML.

Perhaps there is a feeling that stressing the RDF graph too much will alienate some people who are more familiar with XML technologies. Or perhaps all these graph serialization choices could be perceived as being too overwhelming. But I think the opposite extreme of making it look like you can only use an overloaded Atom document as a means to publishing ORE resource maps is misguided, and will ultimately slow adoption. Why not encourage people to publish GRDDL transforms for METS, DIDL or mark up their “splash pages” with RDFa? This would bring the true value of the OAI-ORE work home–it’s not about yet-another-packaging format, it’s about what the various packaging formats have in common on the web.

Release Early, Release Often

In hindsight I think it would’ve been helpful for the OAI-ORE group to privately build consensus about the core OAI-ORE vocabulary (if necessary), then release that into the world wild web for discussion. Then once the kinks were worked out, and there was general understanding, moving on to issues such as discovery and serialization. As it stands the various documents were all dumped at the same time, and seem somewhat fragmented, and in places redundant. Clearly a lot of conversations have gone on that aren’t happening on the public discussion list.

I expressed interest in being part of the OAI-ORE and was politely turned down. I’m actually kind of glad really because I also don’t want to be part of some cabal of digital library practitioners. Maybe I should’ve titled this post “Sour Grapes” :-) Seriously though, the digital library needs good practical solutions and communities of users that encourage widespread adoption and tool support. We don’t need research-ware. Having secret discussions and occasional public events that feel more like lectures than meetings isn’t a good way to encourage adoption.

Anyhow, I hope that this isn’t all seen as being too harsh. Everyone’s a critic eh? All in all there is a lot in OAI-ORE to be proud of. The effort to integrate Web Architecture into Digital Library practices is most welcome indeed. Keep up the good work y’all.

pymarc PEP-8 cleanup

February 28th, 2008

pymarc v2.0 was released yesterday afternoon. I’m mentioning it here to give a big tip of the hat to Gabriel Farrell (gsf on #code4lib) who spent a significant amount of time cleaning up the code to be PEP-8 compliant.

If you are a current user of pymarc your code will most likely break, since methods like: addField() will now look like add_field(). This is a small price to pay for pythonistas who typically prefer clean, consistent and more coherent code (how’s that for alliteration?). It had to be done and I’m very grateful to gsf for taking the time to do it.

Another big thing is that we’ve switched from using subversion to bzr for revision control. Initially it seemed like a lightweight way for gsf and I to collaborate without monkeying with svn authentication (again)…and to learn the zen of distributed revision control. We both liked it so much that Gabe is hosting the pymarc repository at http://fruct.us/bzr/pymarc.

So if you like the latest/greatest/shiniest, and/or want to contribute some of your own changes to pymarc:

  % bzr branch http://fruct.us/bzr/pymarc
  % # hack, hack, hack, hackety, hack
  % bzr commit
  % bzr send --mail-to gsf@fruct.us --message "Gabe, I added a jammies method to the record object!"
  % # or publish your own repo and point us at it :-)

oai-ore and the shadow web

February 22nd, 2008

The OAI-ORE meeting is coming up, and in general I’ve been really impressed with the alpha specs that have come out. It’s not clear that there’s an established vocabulary for talking about aggregated resources on the web, so the Data Model and Vocabulary documents were of particular interest to me.

One thing I didn’t quite understand, and which I think may have some significance for implementors, is some language in the Discovery document on the subject of URI conflation:

The Data Model document [ORE Model] explicitly prohibits a URI of a ReM (URI-R) ever returning anything other than a ReM. This allows multiple representations to be associated with URI-R, such as using content negotiation to return ReMs in different languages, character sets, or compression encodings. But it does not allow URI-R to return a human readable “splash page”, either by HTTP content negotiation or redirection. For example, clients MUST NOT merge with content negotiation the following URI pair that would correspond to a ReM and a “splash page” for an object:

If I’m understanding right this would prohibit using technologies like microformats, eRDF, RDFa and GRDDL in a “splash page” to represent the resource map. It seems odd to me that you can represent a resource map in Atom, but not in HTML.

To illustrate what this might look like I took a splash page off of arXiv (hope that was ok!) and marked it up with oai-ore RDFa.

Take a look. So all I did is modify the existing XHTML at arxiv.org, and I’ve been able to represent an ORE Resource Map. This seems like a relatively simple, and powerful way for existing repositories to make their aggregated resources available.

RDFa just entered Last Call, but there are already multiple implementations. Try out the GetN3 bookmarklet on the splash page, and you should see some triples come back. I ran them through the validator at w3c and got the following graph (kinda too big to include here inline).

This kind of issue seem to be at the heart of what Ian Davis refers to when he asks “Is the Semantic Web Destined to be a Shadow?“. Andy Powell and Pete Johnston have also been strong voices for integrating digital library repositories and the web–and they are also involved with the oai-ore effort. It feels like some of the oia-ore language could be loosened a bit to allow machine readable and human readable information to commingle a bit more.

calais and ocr newspaper data

February 13th, 2008

Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun vocabularies.

At work Dan, Brian and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It’s a very different approach in that Calais is doing natural language processing and we instead are looking for patterns in the structure of XML. But the end result is the same–an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have large amounts of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais…

To aid in the process I wrote a helper utility (calais.py) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan’s rdflib:

  import calais
  graph = calais_graph(content)

This is dependent on you getting a calais license key and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here’s the people script. note, the angly brackets are missing from the sparql prefixes intentionally, since they don’t render properly (yet) in wordpress.

  from calais import calais_graph
  from sys import argv
 
  filename = argv[1]
  content = file(filename).read()
  g = calais_graph(content)
 
  sparql = """
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          """
 
  for row in g.query(sparql):
      print row[0]

Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here’s what we get when I run this OCR data through (take a look at the linked OCR to see just how irregular this data is).

  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer

Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here’s the output of cities.

  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO

Not too shabby. If you want to try this out, install rdflib, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:

  bzr branch http://inkdroid.org/bzr/calais

If you do dive into calais.py you’ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF.