I spent a 1/2 an hour goofing around with with the new (to me) SemanticProxy service from Calais. You give the service a URL along with your API key, and it’ll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, it’s just a GET:
Here’s an example of some turtle you can get for my friend Dan’s blog. Obviously there’s a lot of data there, but I wanted to see exactly what entities are being recognized, and their labels. It doesn’t take long to notice that most of the resource types are in the namespace:
And most of these resources have a property which seems to assign a literal string label to the resource:
It’s kind of a bummer that these vocabulary terms don’t resolve, because it would be sweet to get a bigger picture look at their vocabulary.
At any rate, with these two little facts gleaned from looking at the RDF for a few moments I wrote a little script (using rdflib) which you feed a URL and it’ll munge through the RDF and print out the recognized entities:
ed@curry:~/bzr/calais$ ./entities.py http://onebiglibrary.net a Company named Lehman Bros. a Company named Southwest Airlines a Company named Costco a Company named Everbank a Holiday named New Year's Day a ProvinceOrState named Illinois a ProvinceOrState named Arizona a ProvinceOrState named Michigan a IndustryTerm named media ownership rules a IndustryTerm named unreliable technologies a IndustryTerm named bank a IndustryTerm named health care insurance a IndustryTerm named bank panics a IndustryTerm named free software a City named Lansing a Facility named Big Library a Person named Ralph Nader a Person named Dan Chudnov a Person named Shouldn't Bob Barr a Person named John Mayer a Person named Daniel Chudnov a Person named Cynthia McKinney a Person named Bob Barr a Person named John Legend a Country named Iraq a Country named United States a Country named Afghanistan a Organization named FDIC a Organization named senate a Currency named USD
Quite easy and impressive IMHO. One thing that is missing from this output are the URIs that identify the various resources that are recognized like Dan’s:
Like the vocabulary URIs it doesn’t resolve (at least outside the Reuters media empire). Sure would be nice if it did. It’s got the fact that it’s a person cooked into it (pershash)…but otherwise seems to be just a simple hashing algorithm applied to the string “Dan Chudnov”.
I didn’t actually spend any time looking at the licensing issues around using the service. I’ve heard they are somewhat stultifying and vague, which is to be expected I guess. The news about Reuters and Zotero isn’t exactly encouraging … but it is interesting to see how good some of the NLP analysis is getting at institutions like Reuters. It would be lovely to get a backend look at how this technology is actually being used internally at Reuters.
If you want to take this entities.py for a spin and can’t be bothered to download it, just drop into #code4lib and ask #zoia for entities:
14:45 < edsu> @entities http://www.frbr.org/2008/10/21/xid-updates-at-oclc 14:45 < zoia> edsu: 'ok I found: a Facility Library of Congress, a Company FRBR Review Group, a City York, a EmailAddress firstname.lastname@example.org, a Person Jenn Riley, a Person Robert Maxwell, a Person Arlene Taylor, a Person William Denton, a Person Barbara Tillett, a Organization Congress, a Organization Open Content Alliance, a Organization York \nUniversity'