calais and ocr newspaper data

Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun vocabularies.

At work Dan, Brian and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It’s a very different approach in that Calais is doing natural language processing and we instead are looking for patterns in the structure of XML. But the end result is the same–an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have large amounts of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais…

To aid in the process I wrote a helper utility (calais.py) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan’s rdflib:

  import calais
  graph = calais_graph(content)

This is dependent on you getting a calais license key and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here’s the people script. note, the angly brackets are missing from the sparql prefixes intentionally, since they don’t render properly (yet) in wordpress.

  from calais import calais_graph
  from sys import argv
 
  filename = argv[1]
  content = file(filename).read()
  g = calais_graph(content)
 
  sparql = """
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          """
 
  for row in g.query(sparql):
      print row[0]

Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here’s what we get when I run this OCR data through (take a look at the linked OCR to see just how irregular this data is).

  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer

Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here’s the output of cities.

  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO

Not too shabby. If you want to try this out, install rdflib, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:

  bzr branch http://inkdroid.org/bzr/calais

If you do dive into calais.py you’ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF.

Tags: , , , , , ,

7 Responses to “calais and ocr newspaper data”

  1. Dave Says:

    This is real cool Ed. Mmmm, Angostura Bitters - I think I’m going to make a Manhattan.

  2. Pete Says:

    Tried the example, but seem to be having a problem with rdflib… any ideas?

    $ ./calais.py data/ndnp:1396148
     
    Traceback (most recent call last):
      File "./calais.py", line 127, in ?
        print g.serialize(format='n3')
      File "/opt/local/lib/python2.4/site-packages/rdflib/Graph.py", line 414, in serialize
        return serializer.serialize(destination, base=base, encoding=encoding)
      File "/opt/local/lib/python2.4/site-packages/rdflib/syntax/serializer.py", line 28, in serialize
        self.serializer.serialize(stream, base=base, encoding=encoding)
      File "/opt/local/lib/python2.4/site-packages/rdflib/syntax/serializers/N3Serializer.py", line 17, in serialize
        self._ser(self.store, stream)
      File "/opt/local/lib/python2.4/site-packages/rdflib/syntax/serializers/N3Serializer.py", line 22, in _ser
        for s, p, o in store:
    ValueError: need more than 2 values to unpack
  3. Pete Says:

    Another one:

    $ python people.py ndnp:774348        
     
    Traceback (most recent call last):
      File "people.py", line 22, in ?
        for row in g.query(sparql):
    AttributeError: 'ConjunctiveGraph' object has no attribute 'query'
  4. ed Says:

    What version of rdflib are you running?

    uqbar:~/bzr/calais ed$ python
    Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) 
    [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import rdflib
    >>> print rdflib.__version__
    2.4.0
  5. Pete Says:

    Thanks, looks like I just have some python path cleaning to do…

    >>> import rdflib
    >>> print rdflib.__version__
    2.3.1

  6. ed Says:

    Yeah, you’ll need 2.4.0 (eikeon is here sitting next to me telling me).

    easy_install -U rdflib==2.4.0
  7. Open Libraries – Mining for Meaning Says:

    [...] text, but it also works for extracting meaning from the most recent weblog posting to historic newspapers newly scanned into text via Optical Character Recognition (OCR). Since human-created metadata and [...]

Leave a Reply

You must be logged in to post a comment.