I spent a 1/2 an hour goofing around with with the new (to me) SemanticProxy service from Calais. You give the service a URL along with your API key, and it’ll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, it’s just a GET:
GET http://service.semanticproxy.com/processurl/{key}/rdf/{url}
Here’s an example of some turtle you can get for my friend Dan’s blog. Obviously there’s a lot of data there, but I wanted to see exactly what entities are being recognized, and their labels. It doesn’t take long to notice that most of the resource types are in the namespace: http://s.opencalais.com/1/type/em/e/
For example:
http://s.opencalais.com/1/type/em/e/Personhttp://s.opencalais.com/1/type/em/e/Countryhttp://s.opencalais.com/1/type/em/e/Company
And most of these resources have a property which seems to assign a literal string label to the resource:
http://s.opencalais.com/1/pred/name
It’s kind of a bummer that these vocabulary terms don’t resolve, because it would be sweet to get a bigger picture look at their vocabulary.
At any rate, with these two little facts gleaned from looking at the RDF for a few moments I wrote a little script (using rdflib) which you feed a URL and it’ll munge through the RDF and print out the recognized entities:
ed@curry:~/bzr/calais$ ./entities.py http://onebiglibrary.net a Company named Lehman Bros. a Company named Southwest Airlines a Company named Costco a Company named Everbank a Holiday named New Year's Day a ProvinceOrState named Illinois a ProvinceOrState named Arizona a ProvinceOrState named Michigan a IndustryTerm named media ownership rules a IndustryTerm named unreliable technologies a IndustryTerm named bank a IndustryTerm named health care insurance a IndustryTerm named bank panics a IndustryTerm named free software a City named Lansing a Facility named Big Library a Person named Ralph Nader a Person named Dan Chudnov a Person named Shouldn't Bob Barr a Person named John Mayer a Person named Daniel Chudnov a Person named Cynthia McKinney a Person named Bob Barr a Person named John Legend a Country named Iraq a Country named United States a Country named Afghanistan a Organization named FDIC a Organization named senate a Currency named USD
Quite easy and impressive IMHO. One thing that is missing from this output are the URIs that identify the various resources that are recognized like Dan’s:
http://d.opencalais.com/pershash-1/f7383d60-c27b-309c-889a-4e34d0938a0f
Like the vocabulary URIs it doesn’t resolve (at least outside the Reuters media empire). Sure would be nice if it did. It’s got the fact that it’s a person cooked into it (pershash)…but otherwise seems to be just a simple hashing algorithm applied to the string “Dan Chudnov”.
I didn’t actually spend any time looking at the licensing issues around using the service. I’ve heard they are somewhat stultifying and vague, which is to be expected I guess. The news about Reuters and Zotero isn’t exactly encouraging … but it is interesting to see how good some of the NLP analysis is getting at institutions like Reuters. It would be lovely to get a backend look at how this technology is actually being used internally at Reuters.
If you want to take this entities.py for a spin and can’t be bothered to download it, just drop into #code4lib and ask #zoia for entities:
14:45 < edsu> @entities http://www.frbr.org/2008/10/21/xid-updates-at-oclc
14:45 < zoia> edsu: 'ok I found: a Facility Library of Congress, a Company FRBR
Review Group, a City York, a EmailAddress wtd@pobox.com, a Person
Jenn Riley, a Person Robert Maxwell, a Person Arlene Taylor, a
Person William Denton, a Person Barbara Tillett, a Organization
Congress, a Organization Open Content Alliance, a Organization
York \nUniversity'













2 Comments
From Alf Eaton via email (sorry I had registration turned off Alf, it’s back on)
Ed:
Tom Tague from Calais here.
Just a short note on a couple of things.
First – yes the URIs will be resolvable at the end of the year. We’re going to focus at populating interesting endpoint information and links to other linked data assets for company, geography and a few other types for that release – we’ll continue to expand the endpoints over time.
Second – yes, we really are going to publish the ontology at the end of the year. This required a little more work than we expected around naming standardization and other boring stuff – but it’s coming. We have to figure out the exact language – but we plan to make it open and usable by all.
Last – Terms of Service. We’ll be revisiting these in the next few weeks with goals of 1) improving privacy safeguards, and 2) removing ambiguity where possible. Ambiguity in TOS’s is scary – and we want to eradicate it wherever possible.
Thanks for creating and making the sample code available – we appreciate anything that can jump start people’s usage of Calais.
I’d also encourage you to jump into events and facts. While entity extraction is cool – the real power of Calais starts to become more apparent when you begin playing with the relationships contained in the source material.
Regards,
2 Trackbacks/Pingbacks
[...] inkdroid » Blog Archive » SemanticProxy – [...]
[...] a comment » Thanks to Ed Summers at the Library of Congress for his post on SemanticProxy. Semantic proxy offers a dead simple API for feeding URL’s to the OpenCalais entity [...]
Post a Comment