<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>inkdroid &#187; semanticweb</title>
	<atom:link href="http://inkdroid.org/journal/tag/semanticweb/feed/" rel="self" type="application/rss+xml" />
	<link>http://inkdroid.org/journal</link>
	<description>$pithy_personal_mission_statement</description>
	<lastBuildDate>Wed, 03 Mar 2010 13:44:03 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>web documents and axioms for linked data</title>
		<link>http://inkdroid.org/journal/2010/02/22/web-documents-and-axioms-for-linked-data/</link>
		<comments>http://inkdroid.org/journal/2010/02/22/web-documents-and-axioms-for-linked-data/#comments</comments>
		<pubDate>Tue, 23 Feb 2010 02:06:38 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[web]]></category>
		<category><![CDATA[linkeddata]]></category>
		<category><![CDATA[rest]]></category>
		<category><![CDATA[semanticweb]]></category>
		<category><![CDATA[webarch]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/?p=1676</guid>
		<description><![CDATA[A few months ago I took part in a discussion on the pedantic-web list, which started out as a relatively simple question about FOAF usage, and quickly evolved into a conversation about terms people use when talking about Linked Data, and more generally the Web. 
I ended up having a very helpful off-list email exchange [...]]]></description>
			<content:encoded><![CDATA[<p>A few months ago I took part in a <a href="http://groups.google.com/group/pedantic-web/browse_thread/thread/eb65cce9df40abd4">discussion</a> on the <a href="http://pedantic-web.org/">pedantic-web</a> list, which started out as a relatively simple question about FOAF usage, and quickly evolved into a conversation about terms people use when talking about Linked Data, and more generally the Web. </p>
<p>I ended up having a very helpful off-list email exchange with <a href="http://richard.cyganiak.de/">Richard Cyganiak</a> (one of the architects of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/">Linked Data</a> pattern) about <a href="http://inkdroid.org/journal/2009/05/14/rest-the-semantic-web-and-my-feeble-brain/">some</a> <a href="http://inkdroid.org/journal/2009/09/10/documents/">trouble</a> I&#8217;ve had understanding what <em>Information Resources</em> and <em>Documents</em> are in the context of <a href="http://www.w3.org/TR/webarch/#id-resources">Web Architecture</a>. The trouble I had was in determining whether or not a collection of physical newspaper pages I was helping <a href="http://chroniclingamerica.loc.gov">put on the web</a> were <em>Information Resources</em> or not. I needed to know because I wanted to identify the newspaper pages with URIs, and describe them as Linked Data&#8230;and the resolvability of these URIs was largely <a href="http://www.w3.org/TR/cooluris/#semweb">dependent</a> on how I chose to answer the question.</p>
<p>Richard ended up offering up some advice that I&#8217;ve since found very useful, and I thought I would transcribe some of it down here just in case you might find it useful as well. My apologies to you (and Richard) if some of this seems out of context. It may really only be useful for people who are in the digital library domain, but perhaps it&#8217;s useful elsewhere.</p>
<p>On the subject of what is a <em>Document</em> Richard offered up this way at looking at what are <em>Web Documents</em>:</p>
<blockquote><p>
The Web is a new, blank information space that is, by definition, disjoint from anything else that exists in the world. By setting up and configuring a web server, you make things pop up in that information space (by creating resolvable URIs). By definition, the things that pop up in the information space are a different beast from anything that existed before. They are web pages. They are *not* the same as things that exist outside of the space, like files on your hard disk, or newspaper articles.</p>
<p>&#8230;</p>
<p>I would avoid the term &#8220;document&#8221; when talking about representations. Representations are those ephemeral things that go over the wire. A representation is a &#8220;byte streams with a media type (and possibly other meta data)&#8221;. When I use the term &#8220;HTML document&#8221;, I mean a resource, identified by a URI, that has (only) HTML representations.
</p></blockquote>
<p>Richard encouraged me to think in terms of <em>Web Documents</em> and not generic Documents. I was getting tripped up by considering Newspaper Pages as Documents&#8230;which of course they are in the general sense, but characterized this way it became clear that the Newspaper Pages are not Web Documents. This view on Web Documents is supported in the <a href="http://www.w3.org/TR/cooluris/">Cool URIs for the Semantic Web</a> that he co-authored. </p>
<p>Richard also included some axioms that underpin how he thinks about resources in the Linked Data view:</p>
<blockquote><p>
I&#8217;m using a few rules that I think should be considered axioms of web architecture:</p>
<p>First, if something exists independently from the Web, then it cannot be a Web Document. (hence two resources, one for the newspaper page and one for the web page)</p>
<p>Second, only Web Documents can have representations (hence the need to describe the newspaper page in a web page, rather than directly providing representations of the newspaper page).</p>
<p>I understand these rules as axioms, that is, they should be followed because they make the system work best, not because they somehow follow from the nature of the world (they don&#8217;t).
</p></blockquote>
<p>The pragmatist in me particularly liked how these aren&#8217;t supposed to have anything to do with the Real World, but are just ways of thinking about the Web to make it work better.  Finally Richard offered some advice on how to reconcile the REST and Linked Data views on identity:</p>
<blockquote><p>
I make sense of the REST worldview like this: In typical REST, all the URIs *always* identify web documents. The REST folks might claim that they identify other things, like users or items for sale or places on the earth, but actually they just identify a document that is *about* that thing. The thing itself doesn&#8217;t have an identifier. This is perfectly fine for building certain kinds of systems, so the REST guys actually get away with pretending that the URI identifies the thing. But this doesn&#8217;t allow you to do certain things, like using domain-independent vocabularies for metadata and coreference, and you get into deep trouble if you want to use this for describing *web pages* rather than *newspaper pages*.
</p></blockquote>
<p>I hope I haven&#8217;t take any liberties quoting my conversation with Richard out of context like this. I mainly wanted to transcribe Richard&#8217;s points (which perhaps he has made elsewhere) so that I could revisit them, without having to dig through my email archive &#8230; Comments welcome!</p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2010/02/22/web-documents-and-axioms-for-linked-data/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>open to view</title>
		<link>http://inkdroid.org/journal/2009/08/13/open-to-view/</link>
		<comments>http://inkdroid.org/journal/2009/08/13/open-to-view/#comments</comments>
		<pubDate>Thu, 13 Aug 2009 15:54:41 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[libraries]]></category>
		<category><![CDATA[philosophy]]></category>
		<category><![CDATA[semweb]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[atom]]></category>
		<category><![CDATA[books]]></category>
		<category><![CDATA[hathitrust]]></category>
		<category><![CDATA[linkeddata]]></category>
		<category><![CDATA[opensearch]]></category>
		<category><![CDATA[rest]]></category>
		<category><![CDATA[semanticweb]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/?p=1103</guid>
		<description><![CDATA[I spent an hour checking out the HathiTrust API docs this morning; mainly to see what the similarities and differences are with the as-of-yet undocumented API for Chronicling America. There are quite a few similarities in the general RESTful approach, and the use of Atom, METS and PREMIS in the metadata that is made available. [...]]]></description>
			<content:encoded><![CDATA[<p>I spent an hour checking out the <a href="http://www.hathitrust.org/data_api">HathiTrust API docs</a> this morning; mainly to see what the similarities and differences are with the as-of-yet undocumented API for <a href="http://chroniclingamerica.loc.gov">Chronicling America</a>. There are quite a few similarities in the general RESTful approach, and the use of Atom, METS and PREMIS in the metadata that is made available. </p>
<p>Everyone&#8217;s a critic right? Nevertheless, I&#8217;m just going to jot down a few thoughts about the API, mainly for my friend over in <a href="irc://chat.freenode.net/code4lib">#code4lib</a> <a href="http://billdueber.com/">Bill Dueber</a> who works on the project. Let me just say at the outset that I think it&#8217;s awesome that HathiTrust are providing this API, especially given some of the licensing constraints around some of the content. The API is a good example of putting library data on the web using both general and special purpose standards. But there are a few minor things that could be tweaked I think, to make the API fit into the web and the repository space a bit better.</p>
<p>it would be nice if the <a href="http://opensearch.org">OpenSearch</a> description document referenced in the <a href="http://catalog.hathitrust.org">HTML</a> at </p>
<blockquote><p>
<a href="http://catalog.hathitrust.org/Search/OpenSearch?method=describe ">http://catalog.hathitrust.org/Search/OpenSearch?method=describe</a>
</p></blockquote>
<p>worked. It should be pretty easy and non-invasive to add a basic description file for the HTML response since the search is already GET driven. Ideally it would be nice to see the responses also available as Atom and/or JSON with <a href="http://tools.ietf.org/html/rfc5005">Atom Feed Paging</a>. </p>
<p>Another thing that would be nice to see is the API being merged more into the human usable webapp. The best way to explain this is with an example. Consider the HTML page for this 1914 edition of Walt Whitman&#8217;s <a href="http://catalog.hathitrust.org/Record/00020629">Leaves of Grass</a>, available with this clean URI:</p>
<blockquote><p>
<a href="http://catalog.hathitrust.org/Record/000206297">http://catalog.hathitrust.org/Record/000206297</a>
</p></blockquote>
<p>Now, you can get a <a href="http://services.hathitrust.org/api/htd/meta/mdp.39015056032132">few</a> <a href="http://services.hathitrust.org/api/htd/structure/mdp.39015056032132">flavors</a> of metadata for this book, and an aggregated <a href="https://services.hathitrust.org/api/htd/aggregate/mdp.39015056032132">zip file</a> of all the page images and OCR if you are a HathiTrust member. Why not make these alternate representations discoverable right from the item display? It could be as simple as adding some &lt;link&gt; elements to the HTML, that use the link relations they&#8217;ve already established for their Atom:</p>

<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;head&gt;
&lt;link rel=&quot;http://schemas.hathitrust.org/htd/2009#meta&quot; 
    type=&quot;application/atom+xml&quot; 
    href=&quot;http://services.hathitrust.org/api/htd/meta/mdp.39015056032132&quot; /&gt;
&lt;link rel=&quot;http://schemas.hathitrust.org/htd/2009#structure &quot; 
    type=&quot;application/atom+xml&quot; 
    href=&quot;http://services.hathitrust.org/api/htd/structure/mdp.39015056032132&quot; /&gt;
&lt;link rel=&quot;http://schemas.hathitrust.org/htd/2009#aggregate&quot; 
    type=&quot;application/zip&quot; 
    href=&quot;https://services.hathitrust.org/api/htd/aggregate/mdp.39015056032132&quot; /&gt;
&lt;/head&gt;</pre></div></div>

<p>If you wanted to get fancy you could also put human readable links into the &lt;body&gt; and annotate them w/ <a href="http://www.w3.org/TR/xhtml-rdfa-primer/">RDFa</a>. But this would just be icing on the cake. There are a few reasons for doing at least the bare minimum. The big one is to enable in browser applications (like <a href="http://zotero.org">Zotero</a>, etc) to be able to learn more about a given resource in a relatively straightforward and commonplace way. The other big one is to let automated agents like <a href="http://www.google.com/bot.html">GoogleBot</a> and <a href="http://help.yahoo.com/help/us/ysearch/slurp">YahooSlurp</a> and Internet Archive&#8217;s <a href="http://crawler.archive.org/">Heritrix</a>, etc. discover the <a href="http://en.wikipedia.org/wiki/Deep_Web">deep web</a> data that&#8217;s held behind your API. Another nice side effect is that it helps people who might ordinarily scrape your site automatically discover the API in a straightforward way.</p>
<p>Lastly, I was curious to know if HathiTrust considered adjusting their Atom response to use the <a href="http://www.openarchives.org/ore/1.0/atom.html">Atom pattern</a> recommended by the OAI-ORE folks. They are pretty close already, and in fact seem to have modeled their own aggregation vocabulary on OAI-ORE. It would be interesting to hear why they diverged if it was intentional, and if it might be possible to use a bit of oai-ore in there so we can bootstrap an oai-ore harvesting ecosystem.</p>
<p>I&#8217;m <a href="http://iandavis.com/blog/2009/07/the-linked-data-brand">not sure</a> that I can still call this approach to integrating web2.0 APIs into web1.x applications <em>Linked Data</em> anymore, since it doesn&#8217;t really involve RDF directly. It does  involve thinking in a RESTful way about the resources you are publishing on the web, and how they can be linked together to form a graph. My colleague <a href="http://onebiglibrary.net">Dan</a> has been writing in Computers in Libraries recently about how perhaps thinking in terms of &#8220;building a better web&#8221; may be a more accurate way of describing this activity. </p>
<p>For reasons I don&#8217;t fully understand I&#8217;ve been reading a lot of Wittgenstein (well mainly books about Wittgenstein honestly) lately during the non-bike commute. The trajectory of his thought over his life is really interesting to me. He had this zen-like, controversial idea that </p>
<blockquote><p>
Philosophy simply puts everything before us, nor deduces anything. — Since everything lies open to view there is nothing to explain. For what is hidden, for example, is of no interest to us. <a href="http://books.google.com/books?id=ici7FXQZsFIC&#038;lpg=PP1&#038;dq=philosophical%20investigations&#038;pg=PA43-IA1#v=onepage&#038;q=126&#038;f=false">(PI 126)</a>
</p></blockquote>
<p>I really like this idea that our data APIs on the web could be &#8220;open to view&#8221; by checking out the HTML, following your nose, and writing scrapers, bots and browser plugins to use what you find. I think it&#8217;s unfortunate that the recent changes to the <a href="http://www.w3.org/DesignIssues/LinkedData.html">Linked Data Design Issues</a>, and the ensuing <a href="http://cloudofdata.com/2009/07/does-linked-data-need-rdf/">discussion</a> seemed to create this dividing line about the use of RDF and SPARQL. I had always hoped (and continue to hope) that the Linked Data effort is bigger than a particular brand, or reformulation of the semantic web effort &#8230; for me it&#8217;s a pattern for building a better web. I think RDF is very well suited to expressing the core nature of the web, the <a href="http://dig.csail.mit.edu/breadcrumbs/node/215">Giant Global Graph</a>. I&#8217;ve served up RDF representations in applications I&#8217;ve worked on just for this reason. But I think Linked Data pattern will thrive most if it is thought of as an inclusive continuum of efforts, similar to what <a href="http://webofdata.wordpress.com/2009/07/20/what-else/#comment-132">Dan Brickley</a> has suggested. Us technology people strive for explicitness, it&#8217;s an occupational hazard &#8212; but there&#8217;s sometimes quite a bit of strength in ambiguity.</p>
<p>Anyhow, my little review of the HathiTrust API turned into a bit of a soapbox for me to stand on and shout like a lunatic. I guess I&#8217;ve been wanting to write about what I think Linked Data is for a few weeks now, and it just kinda bubbled up when I least expected it. Sorry Bill!</p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2009/08/13/open-to-view/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>rest, the semantic web and my feeble brain</title>
		<link>http://inkdroid.org/journal/2009/05/14/rest-the-semantic-web-and-my-feeble-brain/</link>
		<comments>http://inkdroid.org/journal/2009/05/14/rest-the-semantic-web-and-my-feeble-brain/#comments</comments>
		<pubDate>Fri, 15 May 2009 03:29:40 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[semweb]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[loc]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[rest]]></category>
		<category><![CDATA[semanticweb]]></category>
		<category><![CDATA[uris]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/?p=974</guid>
		<description><![CDATA[Imagine you were minting close to a million URIs for historic newspaper pages such as:

http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/

for pages like:

The web page allows you to zoom in quite close and see lots of detail in the page:

Now lets say I want to describe this Newspaper Page in RDF. I need to decide what subject URI to hang the [...]]]></description>
			<content:encoded><![CDATA[<p>Imagine you were minting close to a million URIs for historic newspaper pages such as:</p>
<blockquote><p>
<a href="http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/">http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/</a>
</p></blockquote>
<p>for pages like:</p>
<p><a href="http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/"><img src="http://inkdroid.org/images/sn85066387_1898-01-01_ed-1_seq-1.png" width="450" border="0" /></a></p>
<p>The web page allows you to zoom in quite close and see lots of detail in the page:</p>
<p><a href="http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/"><img src="http://inkdroid.org/images/sn85066387_1898-01-01_ed-1_seq-1-b.png" width="450" border="0" /></a></p>
<p>Now lets say I want to describe this Newspaper Page in RDF. I need to decide what subject URI to hang the description off of. Should I consider this Newspaper Page resource an information resource, or a real world resource? The answer to this question determines whether or not I can hang my description of the page off the above URI, for example:</p>
<pre>
&lt;http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/&gt;
  dcterms:issued "1898-01-01"^^&lt;http://www.w3.org/2001/XMLSchema#date&gt; .
</pre>
<p>Or if I need to mint a new URI for the page as a real world thing:</p>
<pre>
&lt;http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1#page&gt;
  dcterms:issued "1898-01-01"^^&lt;http://www.w3.org/2001/XMLSchema#date&gt; .
</pre>
<p><a href="http://www.w3.org/TR/webarch/#id-resources">AWWW 1</a> provides some guidance:</p>
<blockquote><p>
By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term &#8220;resource&#8221; is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.”</p>
<p>This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.
</p></blockquote>
<p>Can all of the <em>essential characteristics</em> of this newspaper page be sent down the wire as a message to a client? The text of the page is pretty legible after zooming in and you can see pictures, headlines, etc. You can&#8217;t feel the texture of the page itself, but you can&#8217;t in the microfilm that the page images were generated from. So I&#8217;m inclined to say yes.</p>
<p><a href="http://www.w3.org/TR/cooluris/#distinguishing">Cool URIs for the Semantic Web</a> also has some advice:</p>
<blockquote><p>
It is important to understand that using URIs, it is possible to identify both a thing (which may exist outside of the Web) and a Web document describing the thing. For example the person Alice is described on her homepage. Bob may not like the look of the homepage, but fancy the person Alice. So two URIs are needed, one for Alice, one for the homepage or a RDF document describing Alice. The question is where to draw the line between the case where either is possible and the case where only descriptions are available.</p>
<p>According to W3C guidelines ([AWWW], section 2.2.), we have a Web document (there called information resource) if all its essential characteristics can be conveyed in a message. Examples are a Web page, an image or a product catalog.</p>
<p>In HTTP, because a 200 response code should be sent when a Web document has been accessed, but a different setup is needed when publishing URIs that are meant to identify entities which are not Web documents.
</p></blockquote>
<p>This makes me think that I will need distinct identifiers for the abstract notion of the Newspaper Page, and the HTML document itself, if it is important to describe them separately. Say for example if I wanted to say the publisher of the web page was the Library of Congress, but the publisher of the Newspaper Page was Charles M. Shortridge. If I don&#8217;t have distinct identifiers I will have to say:</p>
<pre>&lt;http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/&gt;
  dc:publisher &lt;http://loc.gov&gt;,
  &lt;http://www.joincalifornia.com/candidate/12338&gt;
  .
</pre>
<p>Pondering this <em>Information Resource Sniff-Test</em> got me re-reading Xiaoshu Wang&#8217;s paper <a href="http://dfdf.inesc-id.pt/tr/web-arch">URI Identity and Web Architecture Revisited</a> again. And I&#8217;ve come away more convinced that maybe he&#8217;s right: that the real issue lies in my vocabulary usage (dc:publisher in this example), and not with whether my URI identifies an Information Resource or not. So maybe new vocabulary is needed in order to describe the representation?</p>
<pre>&lt;http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/&gt;
  web:repPublisher &lt;http://loc.gov&gt; ;
  dcterms:publisher &lt;http://www.joincalifornia.com/candidate/12338&gt;
  .
</pre>
<p>But there isn&#8217;t a community of practice behind Xiaoshu&#8217;s position, at least not one like the Linked Data community.  Unless perhaps his position is closer to the REST community which is going strong at the moment, especially in AtomPub circles. Members of the linked-data/semweb community would most likely say that there needs to be either hash or 303&#8242;ing URIs for the Newspaper Page, distinct from the URIs for the document describing the Newspaper Page. As a late comer to the httpRange-14 debate I don&#8217;t think I ever internalized how REST and the Semantic Web are slightly out of tune w/ each other regarding resources on the web.</p>
<p>So. Should I have two different URIs: one for the real-world Newspaper Page, and one for the HTML document that describes that page? Is the Newspaper Page an Information Resource? Am I muddling up something here? Am I thinking too much? Should I just let sleeping dogs lie? Your opinion, advice, therapy would be greatly appreciated.</p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2009/05/14/rest-the-semantic-web-and-my-feeble-brain/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>freebase and linked-data</title>
		<link>http://inkdroid.org/journal/2008/10/29/freebase-and-linked-data/</link>
		<comments>http://inkdroid.org/journal/2008/10/29/freebase-and-linked-data/#comments</comments>
		<pubDate>Wed, 29 Oct 2008 14:32:08 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[semweb]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[freebase]]></category>
		<category><![CDATA[linkeddata]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[semanticweb]]></category>
		<category><![CDATA[turtle]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/?p=479</guid>
		<description><![CDATA[Ok, this is pretty big news for linked data folks, and for semweb-heads in general. Freebase is now a linked-data target. This is important news because Freebase is an active community of content creators, creating rich data-centric descriptions with a wiki style interface, fancy data loaders, and useful machine APIs. 
The web2.0-meets-semweb space is also [...]]]></description>
			<content:encoded><![CDATA[<p>Ok, this is pretty big <a href="http://lists.w3.org/Archives/Public/public-lod/2008Oct/0047.html">news</a> for linked data folks, and for semweb-heads in general. <a href="http://freebase.com">Freebase</a> is <a href="http://blog.freebase.com/2008/10/30/introducing_the_rdf_service/">now</a> a <a href="http://rdf.freebase.com">linked-data target</a>. This is important news because Freebase is an active community of content creators, creating rich data-centric descriptions with a wiki style interface, fancy data loaders, and useful machine APIs. </p>
<p>The web2.0-meets-semweb space is also being explored by folks like <a href="http://www.talis.com/platform/">Talis</a>. It&#8217;ll be interesting to see how this plays out&#8211;particularly in light of SPARQL adoption, which I remain kind of neutral about for some undefined, wary, spooky reason. I get the idea of web resources having data views. It seems like a logical, &#8220;one small step for an web agent, one giant leap for the web&#8221;. But queryability with SPARQL sounds like something to push off, particularly if you&#8217;ve already got a <a href="http://freebase.com/opensearch.xml">search api</a> that could be hooked up to the data views.</p>
<p>At any rate, what this announcement means is that you can get machine readable data back from freebase using a URI. The descriptions then use more URIs, which you can then follow-your-nose to, and get more machine readable data. So if you are on a page like:</p>
<blockquote><p>
<a href="http://www.freebase.com/view/en/tim_berners-lee">http://www.freebase.com/view/en/tim_berners-lee</a>
</p></blockquote>
<p>you can construct a URL for Tim Berners-Lee like this:</p>
<blockquote><p>
  <a href="http://rdf.freebase.com/ns/en.tim_berners-lee">http://rdf.freebase.com/ns/en.tim_berners-lee</a>
</p></blockquote>
<p>Then you resolve that URL asking for <code>application/turtle</code> (you could ask for <code>application/rdf+xml</code> but I find the turtle more readable).</p>
<pre>
curl --location --header "Accept: application/turtle" http://rdf.freebase.com/ns/en.tim_berners-lee
</pre>
<p>And you&#8217;ll get back a description like <a href="http://inkdroid.org/data/tbl-freebase.txt">this</a>. There&#8217;s a lot of useful data there, but the interesting part for me is the follow-your-nose effect where you can see an assertion like:</p>
<pre>
 &lt;http://rdf.freebase.com/ns/en.tim_berners-lee&gt;
     &lt;http://rdf.freebase.com/ns/influence.influence_node.influenced_by&gt;
     &lt;http://rdf.freebase.com/ns/en.ted_nelson&gt; .
</pre>
<p>And you can then go look up Ted Nelson using that URI:</p>
<pre>
  curl --location --header "Accept: application/turtle" http://rdf.freebase.com/ns/en.ted_nelson
</pre>
<p>And get another chunk of <a href="http://inkdroid.org/data/tednelson-freebase.txt">data</a> which includes this assertion: </p>
<pre>
 &lt;http://rdf.freebase.com/ns/en.ted_nelson&gt;
     &lt;http://rdf.freebase.com/ns/influence.influence_node.influenced_by&gt;
     &lt;http://rdf.freebase.com/ns/en.vannevar_bush&gt; .
</pre>
<p>And you can then continue following your nose to:</p>
<blockquote><p>
<a href="http://rdf.freebase.com/ns/en.vannevar_bush">http://rdf.freebase.com/ns/en.vannevar_bush</a>
</p></blockquote>
<p>Lather, rinse, repeat.</p>
<p>So why is this important? Because following your nose in HTML is what enabled companies like Lycos, AltaVista, Yahoo and Google to be born. It allowed for agents to be able to crawl the web of documents and build indexes of the data to allow people to find what they want (hopefully). Being able to link data in this way allows us to harvest data assets across organizational boundaries and merge them together. It&#8217;s early days still, but seeing an organization like Freebase get it is pretty exciting.</p>
<p>Oh, there are a few little <a href="http://lists.freebase.com/pipermail/developers/2008-October/002210.html">rough spots</a> which probably should be ironed out &#8230; but when is that ever not the case eh? Inspiring stuff.</p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/10/29/freebase-and-linked-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>iswc2009, DC and vocamp</title>
		<link>http://inkdroid.org/journal/2008/09/22/iswc2009-dc-and-vocamp/</link>
		<comments>http://inkdroid.org/journal/2008/09/22/iswc2009-dc-and-vocamp/#comments</comments>
		<pubDate>Mon, 22 Sep 2008 15:03:46 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[semweb]]></category>
		<category><![CDATA[barcamp]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[dc]]></category>
		<category><![CDATA[semanticweb]]></category>
		<category><![CDATA[vocamp]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/?p=378</guid>
		<description><![CDATA[I just learned from Tom Heath that The International Semantic Web Conference is coming to Washington DC next year. This is pretty cool news to me, since traveling to conferences isn&#8217;t always the easiest thing to navigate. Also, Tom suggested that it might be fun to organize a VoCamp around the conference, to provide an [...]]]></description>
			<content:encoded><![CDATA[<p>I just learned from <a href="http://tomheath.com">Tom Heath</a> that <a href="http://iswc.semanticweb.org/">The International Semantic Web Conference</a> is coming to Washington DC <a href="http://iswc2009.semanticweb.org/">next year.</a> This is pretty cool news to me, since traveling to conferences isn&#8217;t always the easiest thing to navigate. Also, Tom suggested that it might be fun to organize a <a href="http://vocamp.org/wiki/Main_Page">VoCamp</a> around the conference, to provide an informal collaboration space for vocabulary demos, development, q/a, etc. If you want to help out please join the <a href="http://vocamp.org/wiki/Main_Page#VoCamp_Mailing_List">mailing list</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/09/22/iswc2009-dc-and-vocamp/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>justify my links</title>
		<link>http://inkdroid.org/journal/2008/05/29/justify-my-links/</link>
		<comments>http://inkdroid.org/journal/2008/05/29/justify-my-links/#comments</comments>
		<pubDate>Thu, 29 May 2008 12:58:07 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[libraries]]></category>
		<category><![CDATA[publishing]]></category>
		<category><![CDATA[semweb]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[lc]]></category>
		<category><![CDATA[nyc]]></category>
		<category><![CDATA[semanticweb]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/?p=201</guid>
		<description><![CDATA[Thanks to a tip from Ian, I&#8217;m looking forward to (hopefully) attending the Linked Data Planet conference in New York City as a volunteer. The idea is that I just have to pay for my hotel, and the cost of admission is waived. It seems my travel money is a bit limited at the moment [...]]]></description>
			<content:encoded><![CDATA[<p>Thanks to a tip from <a href="http://iandavis.com/">Ian</a>, I&#8217;m looking forward to (hopefully) attending the <a href="http://www.linkeddataplanet.com/">Linked Data Planet</a> conference in New York City as a volunteer. The idea is that I just have to pay for my hotel, and the cost of admission is waived. It seems my travel money is a bit limited at the moment (sometimes it&#8217;s there, sometimes it isn&#8217;t), so I figured minimizing costs would be appreciated. But today I got a request to &#8220;justify&#8221; my attendance at the conference. It was actually kind of a good exercise to sit down and write why I think the conference and <a href="http://linkeddata.org/">Linked Data</a> in general is important to the <a href="http://loc.gov">Library of Congress</a>.</p>
<blockquote><p>One of the challenges of Digital Repository work is modeling the context for digital objects. The context for a digital object includes the set of relationships a particular digital object has with other objects in the repository. 30 years of relational database research and development have allowed us to do this modeling pretty effectively within the scope of a particular application.</p>
<p>Very often, particularly in institutions the size of the Library of Congress, the context for a digital object includes digital objects found elsewhere in the enterprise&#8211;in other applications, with their own databases. In addition some institutions (like LC) also need to make their digital resources available publicly for other organizations to reference. The challenge here is in making the objects found in silos or islands of application data (typically housed in databases) reference-able and resolvable, so that other applications inside and outside the enterprise can use them.</p>
<p>As a practical example, a  picture of Dizzie Gilliespie found in the America Memory collection </p>
<div style="text-align: center;">
<a href=" http://lcweb2.loc.gov/cgi-bin/query/i?ammem/van:@field(NUMBER+@band(van+5a52027)):displayType=1:m856sd=van:m856sf=5a52027 "><img src="http://memory.loc.gov/pnp/van/5a52000/5a52000/5a52027r.jpg" /></a>
</div>
<p>is related to the book:</p>
<p><em><br />
  To be, or not&#8211;to bop: memoirs / Dizzy Gillespie, with Al Fraser.<br />
</em></p>
<p>which we have described in our <a href="http://lccn.loc.gov/84029213">online catalog</a>. The person Dizzy Gillespie is also represented in LC&#8217;s name authority file with the <a href="http://www.loc.gov/marc/lccn.html">Library of Congress Control Number</a> n50033872, and the <a href="http://orlabs.oclc.org/viaf/LC|n50033872 ">Linked Authority File at OCLC</a>. And perhaps this picture of Dizzie Gillespie in American Memory will find it&#8217;s way into the <a href="http://memory.loc.gov/pnp/van/5a52000/5a52000/5a52027r.jpg">World Digital Library</a> application that is currently being built. How can we practically and explicitly identify and then represent the relationships between these resources? Is it even possible?</p>
<p>The Linked Data Planet conference is a two day workshop describing how to use traditional web technologies in conjunction with semantic web technologies (RDF, OWL, SPARQL, RDFa and GRDDL) to enable this sort of linking of resources inside particular applications, within the enterprise and around the world. My hope is that the conference will provide guidance on simple things LC can do with web technologies that have been in use for 20 years, to model the relationships between digital resources at the Library of Congress.
</p></blockquote>
<p>Hopefully that will convince them :-)</p>
<p><em>Apologies to <a href="http://en.wikipedia.org/wiki/Justify_My_Love">Madonna</a> for the blog post title&#8230;</em></p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/05/29/justify-my-links/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>literals and resources</title>
		<link>http://inkdroid.org/journal/2008/03/26/literals-and-resources/</link>
		<comments>http://inkdroid.org/journal/2008/03/26/literals-and-resources/#comments</comments>
		<pubDate>Wed, 26 Mar 2008 13:51:32 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[libraries]]></category>
		<category><![CDATA[marc]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[semweb]]></category>
		<category><![CDATA[rda]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[semanticweb]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/2008/03/26/literals-and-resources/</guid>
		<description><![CDATA[There&#8217;s a fascinating modeling discussion going on over on the DC-RDA list about whether RDA properties should reference literals or resources in descriptions. For example when describing an author you could use a literal:

Twain, Mark, 1835-1910

or a resource:


http://lccn.loc.gov/n79021164

There are some shades of gray in between (using blank nodes, auto-generated URIs, typed literals) but that&#8217;s the [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://inkdroid.org/images/rda.jpg" style="margin-left: 10px; float: left;"/>There&#8217;s a fascinating modeling discussion going on over on the <a href="http://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0803&#038;L=dc-rda&#038;T=0&#038;F=&#038;S=&#038;P=2694">DC-RDA</a> list about whether <a href="http://docs.google.com/View?docid=dhpg2gtj_54fgnz8rfs">RDA properties</a> should reference literals or resources in descriptions. For example when describing an author you could use a literal:</p>
<pre>
Twain, Mark, 1835-1910
</pre>
<p>or a resource:</p>
<pre>

http://lccn.loc.gov/n79021164
</pre>
<p>There are some shades of gray in between (using blank nodes, auto-generated URIs, typed literals) but that&#8217;s the basic gist of it. The discussion basically concerns what the DC-RDA Application Profile should allow. There seems to be two competing interests:</p>
<ol>
<li>perceived ease of migrating legacy data (MARC -> RDA)</li>
<li>perceived benefits to explicitly modeling the relationships found in bibliographic data</li>
</ol>
<p>More information can also be found in the blogs of <a href="http://kcoyle.blogspot.com/2008/01/more-on-rda-and-literals.html">Karen Coyle</a> and <a href="http://jonphipps.wordpress.com/2008/03/16/simple-dc-and-rda/">Jon Phipps</a>.</p>
<p>My personal opinion is that RDA should take the high road on this one and really drive home the <a href="http://en.wikipedia.org/wiki/Value_proposition">value proposition</a> for using resources wherever possible, modeling relationships in bibliographic data, and leveraging hundreds of years of work maintaining controlled vocabularies. This will have the positive side effect of pushing library controlled vocabularies (LCSH, name authority, language and geographic codes, etc.) into the open on the web. More importantly I think it will highlight what libraries (at their best) do best, for the larger semantic web and computing world. I think it&#8217;s worth limping along a bit longer with MARC and waiting for RDA to actually &#8220;do the right thing&#8221;.</p>
<p>How to do this effectively is another matter, and is really what the discussion is about. It&#8217;s really nice to see people talking openly about these issues.</p>
<p><em>(PS, using an author isn&#8217;t a particularly good example because I don&#8217;t see it in the current <a href="http://docs.google.com/View?docid=dhpg2gtj_54fgnz8rfs">list</a> of RDA properties&#8230;)</em></p>
<p><em>(PSS, no that lccn url doesn&#8217;t currently resolve (it does for bibliographic records, but not authority) or return rdf (hopefully someday))</em></p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/03/26/literals-and-resources/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>oai-ore and the shadow web</title>
		<link>http://inkdroid.org/journal/2008/02/22/oai-ore-and-the-shadow-web/</link>
		<comments>http://inkdroid.org/journal/2008/02/22/oai-ore-and-the-shadow-web/#comments</comments>
		<pubDate>Fri, 22 Feb 2008 20:09:41 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[libraries]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[semweb]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[erdf]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[oai-ore]]></category>
		<category><![CDATA[oai-pmh]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[rdfa]]></category>
		<category><![CDATA[semanticweb]]></category>
		<category><![CDATA[xhtml]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/2008/02/22/oai-ore-and-the-shadow-web/</guid>
		<description><![CDATA[The OAI-ORE meeting is coming up, and in general I&#8217;ve been really impressed with the alpha specs that have come out. It&#8217;s not clear that there&#8217;s an established vocabulary for talking about aggregated resources on the web, so the Data Model and Vocabulary documents were of particular interest to me.
One thing I didn&#8217;t quite understand, [...]]]></description>
			<content:encoded><![CDATA[<p>The OAI-ORE <a href="http://www.openarchives.org/ore/documents/ore-hopkins-press-release.pdf">meeting</a> is coming up, and in general I&#8217;ve been really impressed with the alpha <a href="http://www.openarchives.org/ore/0.1/toc">specs</a> that have come out. It&#8217;s not clear that there&#8217;s an established vocabulary for talking about aggregated resources on the web, so the <a href="http://www.openarchives.org/ore/0.1/datamodel">Data Model</a> and <a href="http://www.openarchives.org/ore/0.1/vocabulary">Vocabulary</a> documents were of particular interest to me.</p>
<p>One thing I didn&#8217;t quite understand, and which I think may have some significance for implementors, is some language in the <a href="http://www.openarchives.org/ore/0.1/discovery#URIConflation">Discovery</a> document on the subject of URI conflation:</p>
<blockquote><p>The Data Model document [ORE Model] explicitly prohibits a URI of a ReM (URI-R) ever returning anything other than a ReM. This allows multiple representations to be associated with URI-R, such as using content negotiation to return ReMs in different languages, character sets, or compression encodings. But it does not allow URI-R to return a human readable &#8220;splash page&#8221;, either by HTTP content negotiation or redirection. For example, clients MUST NOT merge with content negotiation the following URI pair that would correspond to a ReM and a &#8220;splash page&#8221; for an object:</p></blockquote>
<p>If I&#8217;m understanding right this would prohibit using technologies like <a href="http://microformats.org">microformats</a>, <a href="http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml">eRDF</a>, <a href="http://rdfa.info/">RDFa</a> and <a href="http://www.w3.org/2001/sw/grddl-wg/">GRDDL</a> in a &#8220;splash page&#8221; to represent the resource map. It seems odd to me that you can represent a resource map in Atom, but not in HTML. </p>
<p>To illustrate what this might look like I took a splash page off of <a href="http://arxiv.org/abs/0711.1533v1">arXiv</a> (hope that was ok!) and marked it up with oai-ore RDFa. </p>
<p><a href="http://inkdroid.org/data/0711.1533"><img src="/images/arxiv-screenshot.png" border="0" /></a></p>
<p>Take a <a href="http://inkdroid.org/data/0711.1533">look</a>. So all I did is modify the existing XHTML at arxiv.org, and I&#8217;ve been able to represent an ORE Resource Map. This seems like a relatively simple, and powerful way for existing repositories to make their aggregated resources available. </p>
<p>RDFa just entered <a href="http://www.w3.org/News/2008#item26">Last Call</a>, but there are already multiple implementations. Try out the <a href="http://www.w3.org/2006/07/SWD/RDFa/impl/js/">GetN3</a> bookmarklet on the splash page, and you should see some triples come back. I ran them through the validator at w3c and got the following <a href="/images/oai_ore_graph.png">graph</a> (kinda too big to include here inline).</p>
<p>This kind of issue seem to be at the heart of what Ian Davis refers to when he asks &#8220;<a href="http://iandavis.com/blog/2007/11/is-the-semantic-web-destined-to-be-a-shadow">Is the Semantic Web Destined to be a Shadow?</a>&#8220;. <a href="http://efoundations.typepad.com/efoundations/2008/02/repositories-th.html">Andy Powell</a> and <a href="http://efoundations.typepad.com/efoundations/2008/02/linked-data-and.html">Pete Johnston</a> have also been strong voices for integrating digital library repositories and the web&#8211;and they are also involved with the oai-ore effort. It feels like some of the oia-ore language could be loosened a bit to allow machine readable and human readable information to commingle a bit more. </p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/02/22/oai-ore-and-the-shadow-web/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>calais and ocr newspaper data</title>
		<link>http://inkdroid.org/journal/2008/02/13/calais-and-ocr-newspaper-data/</link>
		<comments>http://inkdroid.org/journal/2008/02/13/calais-and-ocr-newspaper-data/#comments</comments>
		<pubDate>Wed, 13 Feb 2008 20:30:16 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[libraries]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[semweb]]></category>
		<category><![CDATA[datamining]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[semanticweb]]></category>
		<category><![CDATA[sparql]]></category>
		<category><![CDATA[webservices]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/2008/02/13/calais-and-ocr-newspaper-data/</guid>
		<description><![CDATA[ Like you I&#8217;ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it&#8217;s RDF that uses a bunch of homespun [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://opencalais.com/"><img src="/images/calais.gif" style="margin-right: 10px; float: left;"/></a> Like you I&#8217;ve been <a href="http://radar.oreilly.com/archives/2008/02/reuters_semantic_web_moneytech.html">reading</a> <a href="http://ebiquity.umbc.edu/blogger/2008/02/02/reuters-calais-offers-free-text-extraction-services-producing-rdf/">about</a> the new <a href="http://opencalais.com/">Reuters Calais Web Service</a>. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it&#8217;s RDF that uses a bunch of homespun vocabularies. </p>
<p>At work <a href="http://eikeon.com">Dan</a>, <a href="http://ardvaark.net">Brian</a> and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It&#8217;s a very different approach in that Calais is doing <a href="http://en.wikipedia.org/wiki/Natural_language_processing">natural language processing</a> and we instead are looking for patterns in the structure of XML. But the end result is the same&#8211;an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have <a href="http://loc.gov/chroniclingamerica">large amounts</a> of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais&#8230;</p>
<p>To aid in the process I wrote a helper utility (<a href="http://inkdroid.org/bzr/calais/calais.py">calais.py</a>) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan&#8217;s <a href="http://rdflib.net">rdflib</a>:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">  <span style="color: #ff7700;font-weight:bold;">import</span> calais
  graph = calais_graph<span style="color: black;">&#40;</span>content<span style="color: black;">&#41;</span></pre></div></div>

<p>This is dependent on you getting a calais <a href="http://developer.opencalais.com/member/register">license key</a> and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here&#8217;s the <a href="http://inkdroid.org/bzr/calais/people">people</a> script. <em>note, the angly brackets are missing from the sparql prefixes intentionally, since they don&#8217;t render properly (yet) in wordpress</em>.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">  <span style="color: #ff7700;font-weight:bold;">from</span> calais <span style="color: #ff7700;font-weight:bold;">import</span> calais_graph
  <span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">sys</span> <span style="color: #ff7700;font-weight:bold;">import</span> argv
&nbsp;
  filename = argv<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>
  content = <span style="color: #008000;">file</span><span style="color: black;">&#40;</span>filename<span style="color: black;">&#41;</span>.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
  g = calais_graph<span style="color: black;">&#40;</span>content<span style="color: black;">&#41;</span>
&nbsp;
  sparql = <span style="color: #483d8b;">&quot;&quot;&quot;
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          &quot;&quot;&quot;</span>
&nbsp;
  <span style="color: #ff7700;font-weight:bold;">for</span> row <span style="color: #ff7700;font-weight:bold;">in</span> g.<span style="color: black;">query</span><span style="color: black;">&#40;</span>sparql<span style="color: black;">&#41;</span>:
      <span style="color: #ff7700;font-weight:bold;">print</span> row<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here&#8217;s what we get when I run <a href="http://inkdroid.org/bzr/calais/data/ndnp:774348">this</a> OCR data through (take a <a href="http://inkdroid.org/bzr/calais/data/ndnp:774348">look</a> at the linked OCR to see just how irregular this data is).</p>
<pre>
  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer
</pre>
<p>Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here&#8217;s the output of <a href="http://inkdroid.org/bzr/calais/cities">cities</a>. </p>
<pre>
  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO
</pre>
<p>Not too shabby. If you want to try this out, install <a href="http://rdflib.net">rdflib</a>, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:</p>
<pre>
  bzr branch http://inkdroid.org/bzr/calais
</pre>
<p>If you do dive into calais.py you&#8217;ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF. </p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/02/13/calais-and-ocr-newspaper-data/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>following your nose to the web of data</title>
		<link>http://inkdroid.org/journal/2008/01/04/following-your-nose-to-the-web-of-data/</link>
		<comments>http://inkdroid.org/journal/2008/01/04/following-your-nose-to-the-web-of-data/#comments</comments>
		<pubDate>Fri, 04 Jan 2008 15:57:33 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[semweb]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[http]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[semanticweb]]></category>
		<category><![CDATA[uri]]></category>
		<category><![CDATA[url]]></category>

		<guid isPermaLink="false">http://www.inkdroid.org/journal/2008/01/04/following-your-nose-to-the-web-of-data/</guid>
		<description><![CDATA[This is a draft of a column that&#8217;s slated to be published some time in Information Standards Quarterly. Jay was kind enough to let me post it here in this form before it goes to press. It seems timely to put it out there. Please feel free to leave comments to point out inaccuracies, errors, [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is a draft of a column that&#8217;s slated to be published some time in <a href="http://www.niso.org/standards/std_resources.html">Information Standards Quarterly</a>. <a href="http://www.bookism.org/open/">Jay</a> was kind enough to let me post it here in this form before it goes to press. It seems <a href="http://onebiglibrary.net/story/will-i-need-to-understand-the-semantic-web-in-2008">timely</a> to put it out there. Please feel free to leave comments to point out inaccuracies, errors, tips, suggestions, etc.</em></p>
<hr />
<p><a href="http://en.wikipedia.org/wiki/Image:WWWlogo.png"><img src="http://upload.wikimedia.org/wikipedia/commons/2/25/WWWlogo.png" style="float: left; border: none; margin-right: 10px;"/></a>
</p>
<p>It&#8217;s hard to imagine today that in 1991 the entire World Wide Web existed on a single server at CERN in Switzerland. By the end of that year the first web server outside of Europe was <a href="http://www.w3.org/History.html">set up</a> at Stanford. The <a href="http://ksi.cpsc.ucalgary.ca/archives/WWW-TALK/www-talk-1991.index.html">archives</a> of the www-talk discussion list bear witness to the grassroots community effort that grew the early web&#8211;one document and one server at a time.</p>
<p>Fast forward to 2007 when 24.7 billion web pages are <a href="http://www.worldwidewebsize.com/">estimated</a> to exist. The rapid and continued growth of the Web of Documents can partly be attributed to the elegant simplicity of the hypertext link enabled by two of Tim Berners-Lee&#8217;s creations: the HyperText Markup Language (HTML) and the Uniform Resource Locator (URL). There is a similar movement afoot today to build a new kind of web using this same linking technology, the so called <a href="http://en.wikipedia.org/wiki/Linked_Data">Web of Data</a>.</p>
<p>The Web of Data has its beginnings in the vision of a Semantic Web <a href="http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21">articulated</a> by Tim Berners-Lee in 2001. The basic idea of the Semantic Web is to enable intelligent machine agents by augmenting the web of HTML documents with a web of machine processable information. A recent follow up <a href="http://www.sciam.com/article.cfm?id=the-semantic-web-in-action">article</a> covers the &#8220;layer cake&#8221; of standards that have been created since, and how they are being successfully used today to enable data integration in research, government, and business. However the repositories of data associated with these success stories are largely found behind closed doors. As a result there is little large scale integration happening across organizational boundries on the World Wide Web.</p>
<p>The Web of Data represents a distillation and simplification of the Semantic Web vision. It de-emphasizes the automated reasoning aspects of Semantic Web research and focuses instead on the actual linking of data across organizational boundaries. To make things even simpler the linking mechanism relies on already deployed web technologies: the HyperText Transfer Protocol (HTTP), Uniform Resource Identifiers (URI), and Resource Description Framework (RDF).  Tim Berners-Lee has called this technique Linked Data, and <a href="http://www.w3.org/DesignIssues/LinkedData.html">summarized</a> it as a short set of guidelines for publishing data on the web:</p>
<ol>
<li>Use URIs as names for things.</li>
<li>Use HTTP URIs so that people can look up those things.</li>
<li>When someone looks up a URI, provide useful information.</li>
<li>Include links to other URIs, so that they can discover more things.</li>
</ol>
<p>
The <a href="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData">Linking Open Data</a> community project of the <a href="http://www.w3.org/2001/sw/sweo/">W3C Semantic Web Education and Outreach Group</a> has published two additional documents <a href="http://www.dfki.uni-kl.de/~sauermann/2006/11/cooluris/">Cool URIs for the Semantic Web</a> and <a href="http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/">How to Publish Linked Data on the Web</a> that help IT professionals understand what it means to publish their assets as linked data.  The goal of the Linking Open Data Project is to</p>
<blockquote><p>
extend the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different sources.
</p></blockquote>
<p>Central to the Linked Data concept is the publication of RDF on the World Wide Web. The essence of RDF is the &#8220;triple&#8221; which is a statement about a resource in three parts: a subject, predicate and object. The RDF triple provides a way of modeling statements about resources and it can have multiple serialization formats including XML and some more human readable formats such as <a href="http://www.w3.org/DesignIssues/Notation3">notation3</a>. For example to represent a statement that the website at http://niso.org has the title &#8220;NISO &#8211; National Information Standards Organization&#8221; one can create the following triple:</p>
<pre>
<code>
&lt;http://niso.org&gt; &lt;http://purl.org/dc/elements/1.1/title&gt; "NISO - National Information Standards Organization" .
</code>
</pre>
<p>
The subject is the URL for the website, the predicate is &#8220;has title&#8221; represented as a URI from the Dublin Core vocabulary, and the object is the literal &#8220;NISO &#8211; National Information Standards Organization&#8221;. The Linked Data movement encourages the extensive interlinking of your data with other people&#8217;s data: so for example by creating another triple such as:
</p>
<pre>
<code>
&lt;http://niso.org&gt; &lt;http://purl.org/dc/elements/1.1/creator&gt; &lt;http://dbpedia.org/resource/National_Information_Standards_Organization&gt; .
</code>
</pre>
<p>
This indicates that the website was created by NISO which is identified using URI from the dbpedia (a Linked Data version of the Wikipedia). One of the benefits of linking data in this way is the &#8220;follow your nose&#8221; effect.  When a person in their browser or an automated agent runs across the creator in the above triple they are able to dereference the URL and retrieve  more information about this creator. For example when a software agent dereferences a URL for NISO
</p>
<pre>
<code>

http://dbpedia.org/resource/National_Information_Standards_Organization

</code>
</pre>
<p>
24 additional RDF triples are returned including one like:
</p>
<pre>
<code>
&lt;http://dbpedia.org/resource/National_Information_Standards_Organization&gt; &lt;http://www.w3.org/2004/02/skos/core#subject&gt; &lt;http://dbpedia.org/resource/Category:Standards_organizations&gt; .
</code>
</pre>
<p>
This triple says that NISO belongs to a class of resources that are standards organizations. A human or agent can follow their nose to the dbpedia URL for standards organizations:
</p>
<pre>
<code>

http://dbpedia.org/resource/Category:Standards_organizations

</code>
</pre>
<p>
and retrieve 156 triples describing other standards organizations are returned such as:
</p>
<pre>
<code>
&lt;http://dbpedia.org/resource/World_Wide_Web_Consortium&gt; &lt;http://www.w4.org/2004/02/skos/core#subject&gt; &lt;http://dbpedia.org/resource/Category:Standards_organizations&gt; .
</code>
</pre>
<p>
And so on. This ability for humans and automated crawlers to follow their noses in this way makes for a powerfully simple data discovery heuristic. The philosophy is quite different from other data discovery methods, such as the typical web2.0 APIs of Flickr, Amazon, YouTube, Facebook, Google, etc., which all differ in their implementation details and require you to digest their API documentation before you can do anything useful. Contrast this with the Web of Data which uses the ubiquitous technologies of URIs and HTTP plus the secret sauce of the RDF triple.
</p>
<p>
As with the initial growth of the web over 10 years ago the creation of the Web of Data is happening at a grassroots level by individuals around the world. Much of the work takes place on an open <a href="http://simile.mit.edu/mailman/listinfo/linking-open-data">discussion list</a> at MIT where people share their experiences of making data sets available,  discuss technical problems/solutions, and announce the availability of resources. At this time some 27 different data sets have been published including Wikipedia, the US Census, the CIA World Fact Book, Geonames, MusicBrainz, WordNet, OpenCyc. The data and relationships between the data are by definition distributed around the web and harvestable by anyone by anyone with a web browser or HTTP client. Contrast this openness with the relationships that Google extracts from the Web of Documents and locks up on their own private network.
</p>
<p>
Various services aggregate Linked Data and provide services on top of it such as <a href="http://dbpedia.org">dbpedia</a> which has an estimated 3 million RDF links, and over 2 billion RDF triples. It&#8217;s quite possible that the emerging set of Linked Data will serve as a data test bed for intiatives like the <a href="http://www.mindswap.org/blog/2007/12/05/announcing-the-open-web-billion-triple-challenge-iswc-08/">Billion Triple Challenge</a> which aims to foster creative approaches to data mining and Semantic Web research by making large sets of real data available. In much the same way that Tim Berners-Lee could not have predicted the impact of Google&#8217;s PageRank algorithm, or the improbable success of Wikipedia&#8217;s collaborative editing while creating the Web of Documents, it may be that simply building links between data sets on the Web of Data will bootstrap a new class of technologies we cannot begin to imagine today.
</p>
<p><a href="http://richard.cyganiak.de/2007/10/lod/lod-datasets_2007-11-10.png"><img src="http://richard.cyganiak.de/2007/10/lod/lod-datasets_2007-11-10.png" border="0" /></a></p>
<p>
So if you are in the business of making data available on the web and have a bit more time to spare, have a look at Tim Berners-Lee&#8217;s <a href="http://www.w3.org/DesignIssues/LinkedData.html">Linked Data</a> document and familiarize yourself with the simple web publishing techniques behind the Web of Data: HTTP, URI and RDF. If you catch the Linked Data bug join the <a href="http://simile.mit.edu/mailman/listinfo/linking-open-data">discussion list</a> and the conversation, and try publishing some of your data as a pilot project using the tutorials. Who knows what might happen&#8211;you might just help build a new kind of web, and rest assured you&#8217;ll definitely have some fun.
</p>
<p>Thanks to <a href="http://f00die.com/">Jay Luker</a>, <a href="http://paulmiller.typepad.com/">Paul Miller</a>, <a href="http://dannyayers.com">Danny Ayers</a> and <a href="http://onebiglibrary.net">Dan Chudnov</a> for their contributions and suggestions.</p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/01/04/following-your-nose-to-the-web-of-data/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
