calais and ocr newspaper data
Like you I’ve been
reading
about
the new Reuters Calais Web Service.
The basic gist is you can send the service text and get back machine
readable data about recognized entities (personal names, state/province
names, city names, etc). The response format is kind of interesting
because it’s RDF that uses a bunch of homespun vocabularies.
At work Dan, Brian and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It’s a very different approach in that Calais is doing natural language processing and we instead are looking for patterns in the structure of XML. But the end result is the same–an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have large amounts of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais…
To aid in the process I wrote a helper utility (calais.py) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan’s rdflib:
import calais
graph = calais_graph(content)
This is dependent on you getting a calais license key and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here’s the people script. note, the angly brackets are missing from the sparql prefixes intentionally, since they don’t render properly (yet) in wordpress.
from calais import calais_graph
from sys import argv
filename = argv[1]
content = file(filename).read()
g = calais_graph(content)
sparql = """
PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
PREFIX ct: http://s.opencalais.com/1/type/em/e/
PREFIX cp: http://s.opencalais.com/1/pred/
SELECT ?name
WHERE {
?subject rdf:type ct:People .
?subject cp:name ?name .
}
"""
for row in g.query(sparql):
print row[0]
Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here’s what we get when I run this OCR data through (take a look at the linked OCR to see just how irregular this data is).
ed@curry:~/bzr/calais$ ./people data/ndnp\:774348 McKmley Edwin W. Joy A. Musto JOHN D. SPRECKELS George Dlxoh Le Roy Bryan Charles P. Braslan Siegerfs Angostura Bitters James Stafford Herbert Putnam H. G. Pond Charles F. Joy Santa Rosa Allen S. Qlmsted Pptter Palmer
Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here’s the output of cities.
ed@curry:~/bzr/calais$ ./cities data/ndnp:774348 Valencia San Jose Seattle Newport Santa Clara St. Louis New York Haifa Venice Rochester Fremont San Francisco San Francisco Chicago Oakland Los Angeles Fresno Watsonville Philadelphia Washington CHICAGO
Not too shabby. If you want to try this out, install rdflib, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:
bzr branch http://web.archive.org/web/20101217003936/http://inkdroid.org/bzr/calais/
If you do dive into calais.py you’ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF.