I spent a 1/2 an hour goofing around with with the new (to me) SemanticProxy service from Calais. You give the service a URL along with your API key, and it’ll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, it’s just a GET:

GET http://service.semanticproxy.com/processurl/{key}/rdf/{url}

Here’s an example of some turtle you can get for my friend Dan’s blog. Obviously there’s a lot of data there, but I wanted to see exactly what entities are being recognized, and their labels. It doesn’t take long to notice that most of the resource types are in the namespace: http://s.opencalais.com/1/type/em/e/

For example:

  • http://s.opencalais.com/1/type/em/e/Person
  • http://s.opencalais.com/1/type/em/e/Country
  • http://s.opencalais.com/1/type/em/e/Company

And most of these resources have a property which seems to assign a literal string label to the resource:


It’s kind of a bummer that these vocabulary terms don’t resolve, because it would be sweet to get a bigger picture look at their vocabulary.

At any rate, with these two little facts gleaned from looking at the RDF for a few moments I wrote a little script (using rdflib) which you feed a URL and it’ll munge through the RDF and print out the recognized entities:

ed@curry:~/bzr/calais$ ./entities.py http://onebiglibrary.net
a Company named Lehman Bros.
a Company named Southwest Airlines
a Company named Costco
a Company named Everbank
a Holiday named New Year's Day
a ProvinceOrState named Illinois
a ProvinceOrState named Arizona
a ProvinceOrState named Michigan
a IndustryTerm named media ownership rules
a IndustryTerm named unreliable technologies
a IndustryTerm named bank
a IndustryTerm named health care insurance
a IndustryTerm named bank panics
a IndustryTerm named free software
a City named Lansing
a Facility named Big Library
a Person named Ralph Nader
a Person named Dan Chudnov
a Person named Shouldn't Bob Barr
a Person named John Mayer
a Person named Daniel Chudnov
a Person named Cynthia McKinney
a Person named Bob Barr
a Person named John Legend
a Country named Iraq
a Country named United States
a Country named Afghanistan
a Organization named FDIC
a Organization named senate
a Currency named USD

Quite easy and impressive IMHO. One thing that is missing from this output are the URIs that identify the various resources that are recognized like Dan’s:


Like the vocabulary URIs it doesn’t resolve (at least outside the Reuters media empire). Sure would be nice if it did. It’s got the fact that it’s a person cooked into it (pershash)…but otherwise seems to be just a simple hashing algorithm applied to the string “Dan Chudnov”.

I didn’t actually spend any time looking at the licensing issues around using the service. I’ve heard they are somewhat stultifying and vague, which is to be expected I guess. The news about Reuters and Zotero isn’t exactly encouraging … but it is interesting to see how good some of the NLP analysis is getting at institutions like Reuters. It would be lovely to get a backend look at how this technology is actually being used internally at Reuters.

If you want to take this entities.py for a spin and can’t be bothered to download it, just drop into #code4lib and ask #zoia for entities:

14:45 < edsu> @entities http://www.frbr.org/2008/10/21/xid-updates-at-oclc
14:45 < zoia> edsu: 'ok I found: a Facility Library of Congress, a Company FRBR 
              Review Group, a City York, a EmailAddress wtd@pobox.com, a Person 
              Jenn Riley, a Person Robert Maxwell, a Person Arlene Taylor, a 
              Person William Denton, a Person Barbara Tillett, a Organization 
              Congress, a Organization Open Content Alliance, a Organization 
              York \nUniversity'