calais and ocr newspaper data

Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun vocabularies.

At work Dan, Brian and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It’s a very different approach in that Calais is doing natural language processing and we instead are looking for patterns in the structure of XML. But the end result is the same–an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have large amounts of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais…

To aid in the process I wrote a helper utility (calais.py) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan’s rdflib:

  import calais
  graph = calais_graph(content)

This is dependent on you getting a calais license key and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here’s the people script. note, the angly brackets are missing from the sparql prefixes intentionally, since they don’t render properly (yet) in wordpress.

  from calais import calais_graph
  from sys import argv

  filename = argv[1]
  content = file(filename).read()
  g = calais_graph(content)

  sparql = """
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          """

  for row in g.query(sparql):
      print row[0]

Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here’s what we get when I run this OCR data through (take a look at the linked OCR to see just how irregular this data is).

  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348 
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer

Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here’s the output of cities.

  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348 
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO

Not too shabby. If you want to try this out, install rdflib, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:

  bzr branch http://web.archive.org/web/20101217003936/http://inkdroid.org/bzr/calais/

If you do dive into calais.py you’ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF.