<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>inkdroid &#187; reuters</title>
	<atom:link href="http://inkdroid.org/journal/tag/reuters/feed/" rel="self" type="application/rss+xml" />
	<link>http://inkdroid.org/journal</link>
	<description>$pithy_personal_mission_statement</description>
	<lastBuildDate>Wed, 28 Jul 2010 13:48:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>SemanticProxy</title>
		<link>http://inkdroid.org/journal/2008/10/27/semanticproxy/</link>
		<comments>http://inkdroid.org/journal/2008/10/27/semanticproxy/#comments</comments>
		<pubDate>Mon, 27 Oct 2008 18:54:31 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[linguistics]]></category>
		<category><![CDATA[semweb]]></category>
		<category><![CDATA[calais]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[reuters]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/?p=472</guid>
		<description><![CDATA[I spent a 1/2 an hour goofing around with with the new (to me) SemanticProxy service from Calais. You give the service a URL along with your API key, and it&#8217;ll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, it&#8217;s just a GET: GET [...]]]></description>
			<content:encoded><![CDATA[<p>I spent a 1/2 an hour goofing around with with the new (to me) <a href="http://semanticproxy.opencalais.com/">SemanticProxy</a> service from Calais. You  give the service a URL along with your API key, and it&#8217;ll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, it&#8217;s just a GET:</p>
<pre>
GET http://service.semanticproxy.com/processurl/{key}/rdf/{url}
</pre>
<p>Here&#8217;s an example of some <a href="http://inkdroid.org/data/obl.txt">turtle</a> you can get for my friend Dan&#8217;s <a href="http://onebiglibrary.net">blog</a>. Obviously there&#8217;s a lot of data there, but I wanted to see exactly what entities are being recognized, and their labels. It doesn&#8217;t take long to notice that most of the resource types are in the namespace: <code>http://s.opencalais.com/1/type/em/e/</code></p>
<p>For example:</p>
<ul>
<li><code>http://s.opencalais.com/1/type/em/e/Person</code></li>
<li><code>http://s.opencalais.com/1/type/em/e/Country</code></li>
<li><code>http://s.opencalais.com/1/type/em/e/Company</code></li>
</ul>
<p>And most of these resources have a property which seems to assign a literal string label to the resource:</p>
<p>  <code>http://s.opencalais.com/1/pred/name</code> </p>
<p>It&#8217;s kind of a bummer that these vocabulary terms don&#8217;t resolve, because it would be sweet to get a bigger picture look at their vocabulary.</p>
<p>At any rate, with these two little facts gleaned from looking at the RDF for a few moments I wrote a little <a href="http://inkdroid.org/bzr/calais/entities.py">script</a> (using <a href="http://rdflib.net">rdflib</a>) which you feed a URL and it&#8217;ll munge through the RDF and print out the recognized entities:</p>
<pre>
ed@curry:~/bzr/calais$ ./entities.py http://onebiglibrary.net
a Company named Lehman Bros.
a Company named Southwest Airlines
a Company named Costco
a Company named Everbank
a Holiday named New Year's Day
a ProvinceOrState named Illinois
a ProvinceOrState named Arizona
a ProvinceOrState named Michigan
a IndustryTerm named media ownership rules
a IndustryTerm named unreliable technologies
a IndustryTerm named bank
a IndustryTerm named health care insurance
a IndustryTerm named bank panics
a IndustryTerm named free software
a City named Lansing
a Facility named Big Library
a Person named Ralph Nader
a Person named Dan Chudnov
a Person named Shouldn't Bob Barr
a Person named John Mayer
a Person named Daniel Chudnov
a Person named Cynthia McKinney
a Person named Bob Barr
a Person named John Legend
a Country named Iraq
a Country named United States
a Country named Afghanistan
a Organization named FDIC
a Organization named senate
a Currency named USD
</pre>
<p>Quite easy and impressive IMHO. One thing that is missing from this output are the URIs that identify the various resources that are recognized like Dan&#8217;s:</p>
<p><code></p>
<p>http://d.opencalais.com/pershash-1/f7383d60-c27b-309c-889a-4e34d0938a0f</p>
<p></code></p>
<p>Like the vocabulary URIs it doesn&#8217;t resolve (at least outside the Reuters media empire). Sure would be nice if it did. It&#8217;s got the fact that it&#8217;s a person cooked into it (pershash)&#8230;but otherwise seems to be just a simple hashing algorithm applied to the string &#8220;Dan Chudnov&#8221;.</p>
<p>I didn&#8217;t actually spend any time looking at the licensing issues around using the service. I&#8217;ve heard they are somewhat stultifying and vague, which is to be expected I guess. The news about <a href="http://www.nature.com/nature/journal/v455/n7214/full/455708a.html">Reuters and Zotero</a> isn&#8217;t exactly encouraging &#8230; but it is interesting to see how good some of the NLP analysis is getting at institutions like Reuters. It would be lovely to get a backend look at how this technology is actually being used internally at Reuters.</p>
<p>If you want to take this entities.py for a spin and can&#8217;t be bothered to download it, just drop into <a href="irc://chat.freenode.net/code4lib">#code4lib</a> and ask #zoia for entities:</p>
<pre>
14:45 < edsu> @entities http://www.frbr.org/2008/10/21/xid-updates-at-oclc
14:45 < zoia> edsu: 'ok I found: a Facility Library of Congress, a Company FRBR
              Review Group, a City York, a EmailAddress wtd@pobox.com, a Person
              Jenn Riley, a Person Robert Maxwell, a Person Arlene Taylor, a
              Person William Denton, a Person Barbara Tillett, a Organization
              Congress, a Organization Open Content Alliance, a Organization
              York \nUniversity'
</pre>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/10/27/semanticproxy/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
