SemanticProxy
I spent a 1/2 an hour goofing around with with the new (to me) SemanticProxy service from Calais. You give the service a URL along with your API key, and itāll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, itās just a GET:
GET http://service.semanticproxy.com/processurl/{key}/rdf/{url}
Hereās an example of some
turtle you can get for my
friend Danās blog. Obviously
thereās a lot of data there, but I wanted to see exactly what entities
are being recognized, and their labels. It doesnāt take long to notice
that most of the resource types are in the namespace:
http://s.opencalais.com/1/type/em/e/
For example:
-
http://s.opencalais.com/1/type/em/e/Person
-
http://s.opencalais.com/1/type/em/e/Country
-
http://s.opencalais.com/1/type/em/e/Company
And most of these resources have a property which seems to assign a literal string label to the resource:
http://s.opencalais.com/1/pred/name
Itās kind of a bummer that these vocabulary terms donāt resolve, because it would be sweet to get a bigger picture look at their vocabulary.
At any rate, with these two little facts gleaned from looking at the RDF for a few moments I wrote a little script (using rdflib) which you feed a URL and itāll munge through the RDF and print out the recognized entities:
ed@curry:~/bzr/calais$ ./entities.py http://onebiglibrary.net a Company named Lehman Bros. a Company named Southwest Airlines a Company named Costco a Company named Everbank a Holiday named New Year's Day a ProvinceOrState named Illinois a ProvinceOrState named Arizona a ProvinceOrState named Michigan a IndustryTerm named media ownership rules a IndustryTerm named unreliable technologies a IndustryTerm named bank a IndustryTerm named health care insurance a IndustryTerm named bank panics a IndustryTerm named free software a City named Lansing a Facility named Big Library a Person named Ralph Nader a Person named Dan Chudnov a Person named Shouldn't Bob Barr a Person named John Mayer a Person named Daniel Chudnov a Person named Cynthia McKinney a Person named Bob Barr a Person named John Legend a Country named Iraq a Country named United States a Country named Afghanistan a Organization named FDIC a Organization named senate a Currency named USD
Quite easy and impressive IMHO. One thing that is missing from this output are the URIs that identify the various resources that are recognized like Danās:
http://d.opencalais.com/pershash-1/f7383d60-c27b-309c-889a-4e34d0938a0f
Like the vocabulary URIs it doesnāt resolve (at least outside the Reuters media empire). Sure would be nice if it did. Itās got the fact that itās a person cooked into it (pershash)ā¦but otherwise seems to be just a simple hashing algorithm applied to the string āDan Chudnovā.
I didnāt actually spend any time looking at the licensing issues around using the service. Iāve heard they are somewhat stultifying and vague, which is to be expected I guess. The news about Reuters and Zotero isnāt exactly encouraging ⦠but it is interesting to see how good some of the NLP analysis is getting at institutions like Reuters. It would be lovely to get a backend look at how this technology is actually being used internally at Reuters.
If you want to take this entities.py for a spin and canāt be bothered to download it, just drop into #code4lib and ask #zoia for entities:
14:45 < edsu> @entities http://www.frbr.org/2008/10/21/xid-updates-at-oclc 14:45 < zoia> edsu: 'ok I found: a Facility Library of Congress, a Company FRBR Review Group, a City York, a EmailAddress wtd@pobox.com, a Person Jenn Riley, a Person Robert Maxwell, a Person Arlene Taylor, a Person William Denton, a Person Barbara Tillett, a Organization Congress, a Organization Open Content Alliance, a Organization York \nUniversity'