I spent a 1/2 an hour goofing around with with the new (to me) SemanticProxy service from Calais. You give the service a URL along with your API key, and itā€™ll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, itā€™s just a GET:

GET http://service.semanticproxy.com/processurl/{key}/rdf/{url}

Hereā€™s an example of some turtle you can get for my friend Danā€™s blog. Obviously thereā€™s a lot of data there, but I wanted to see exactly what entities are being recognized, and their labels. It doesnā€™t take long to notice that most of the resource types are in the namespace: http://s.opencalais.com/1/type/em/e/

For example:

  • http://s.opencalais.com/1/type/em/e/Person
  • http://s.opencalais.com/1/type/em/e/Country
  • http://s.opencalais.com/1/type/em/e/Company

And most of these resources have a property which seems to assign a literal string label to the resource:

http://s.opencalais.com/1/pred/name

Itā€™s kind of a bummer that these vocabulary terms donā€™t resolve, because it would be sweet to get a bigger picture look at their vocabulary.

At any rate, with these two little facts gleaned from looking at the RDF for a few moments I wrote a little script (using rdflib) which you feed a URL and itā€™ll munge through the RDF and print out the recognized entities:

ed@curry:~/bzr/calais$ ./entities.py http://onebiglibrary.net
a Company named Lehman Bros.
a Company named Southwest Airlines
a Company named Costco
a Company named Everbank
a Holiday named New Year's Day
a ProvinceOrState named Illinois
a ProvinceOrState named Arizona
a ProvinceOrState named Michigan
a IndustryTerm named media ownership rules
a IndustryTerm named unreliable technologies
a IndustryTerm named bank
a IndustryTerm named health care insurance
a IndustryTerm named bank panics
a IndustryTerm named free software
a City named Lansing
a Facility named Big Library
a Person named Ralph Nader
a Person named Dan Chudnov
a Person named Shouldn't Bob Barr
a Person named John Mayer
a Person named Daniel Chudnov
a Person named Cynthia McKinney
a Person named Bob Barr
a Person named John Legend
a Country named Iraq
a Country named United States
a Country named Afghanistan
a Organization named FDIC
a Organization named senate
a Currency named USD

Quite easy and impressive IMHO. One thing that is missing from this output are the URIs that identify the various resources that are recognized like Danā€™s:


http://d.opencalais.com/pershash-1/f7383d60-c27b-309c-889a-4e34d0938a0f

Like the vocabulary URIs it doesnā€™t resolve (at least outside the Reuters media empire). Sure would be nice if it did. Itā€™s got the fact that itā€™s a person cooked into it (pershash)ā€¦but otherwise seems to be just a simple hashing algorithm applied to the string ā€œDan Chudnovā€.

I didnā€™t actually spend any time looking at the licensing issues around using the service. Iā€™ve heard they are somewhat stultifying and vague, which is to be expected I guess. The news about Reuters and Zotero isnā€™t exactly encouraging ā€¦ but it is interesting to see how good some of the NLP analysis is getting at institutions like Reuters. It would be lovely to get a backend look at how this technology is actually being used internally at Reuters.

If you want to take this entities.py for a spin and canā€™t be bothered to download it, just drop into #code4lib and ask #zoia for entities:

14:45 < edsu> @entities http://www.frbr.org/2008/10/21/xid-updates-at-oclc
14:45 < zoia> edsu: 'ok I found: a Facility Library of Congress, a Company FRBR 
              Review Group, a City York, a EmailAddress wtd@pobox.com, a Person 
              Jenn Riley, a Person Robert Maxwell, a Person Arlene Taylor, a 
              Person William Denton, a Person Barbara Tillett, a Organization 
              Congress, a Organization Open Content Alliance, a Organization 
              York \nUniversity'