Documents

I've struggled in the past with what constitutes an Information Resource in the context of Web Architecture, Linked Data and practical digital library applications such as the National Digital Newspaper Project I work on at the Library of Congress. So it was reassuring to see the issue come up a few months ago during a review of the effort to revise the HTTP specification (RFC 2616). It would be a major effort to summarize the entire conversation here. However an interesting sub-discussion circled around the idea of normalizing the language in the Architecture of the World Wide Web and RFC 2616 with respect to Resources. Well into the multi-month thread Tim Berners-Lee offered up a very helpful, historical recap of the "what is a resource" issue , in which he said:

I would like to see what the documents [AWWW and RFC 2616] all look like if edited to use the words Document and Thing, and eliminate Resource.

A Short History of "Resource"
Which, somewhat predictably, started a discussion of what a Document is. However this conversation seemed more tangible and earthy, and culminated in Larry Masinter recommending David M. Levy's book Scrolling Forward:

... since much of the thought behind it informs a lot of my own thinking about the nature of "Document", "representation", "Resource" and the like.

www-tag email message
Now Larry is a scientist at Adobe, a company that knows a thing or two about electronic documents. He also works closely with the W3C and IETF on web architectural issues. So when he suggested reading a book to learn what he means by Document my ears perked up. The interjection of a book reference into this rapid-fire email exchange was like a magic spell, that made me pause, and consider that a working definition of Document was nuanced enough to be the subject matter of an entire book. I've come to expect references to Michael Buckland's classic What is a Document? in discussions of documents. I hadn't run across David Levy's name before so Larry's recommendation was enough for me to request it from the stacks, and give it a read. I wasn't disappointed. Scrolling Forward is an ode to documents of all shapes and sizes, from all time periods. It's a joyful, mind expanding work, that explores the entire landscape of our documents: from cash register receipts, the multi-editioned Leaves of Grass, email messages, letters, books, photographs, papyrus scrolls, greeting cards and web pages. Since this takes place in 212 pages, it is not surprising that the analysis


New York Times Topics as SKOS

Serves 23,376 SKOS Concepts

INGREDIENTS

DIRECTIONS

  1. Open a new file using your favorite text editor.
  2. Instantiate an RDF graph with a dash of rdflib.
  3. Use python’s urllib to extract the HTML for each of the Times Topics Index Pages, e.g. for A.
  4. Parse HTML into a fine, queryable data structure using BeautifulSoup.
  5. Locate topic names and their associated URLs, and gently add them to the graph with a pinch of SKOS.
  6. Go back to step 3 to fetch the next batch of topics, until you’ve finished Z.
  7. Bake the RDF graph as an rdf/xml file.

NOTES

If you don’t feel like cooking up the rdf/xml yourself you can download it from here (might want to right-click to download, some browsers might have trouble rendering the xml), or download the 68 line implementation and run it yourself.

The point of this exercise was mainly to show how thinking of the New York Times Topics as a controlled vocabulary, that can be serialized as a file, and still present on the Web, could be useful. Perhaps to someone writing an application that needs to integrate with the New York Times and who want to be able to tag content using the same controlled vocabulary. Or perhaps someone wants to be able to link your own content with similar content at the New York Times. These are all use cases for expressing the Topics as SKOS, and being able to ship it around with resolvable identifiers for the concepts.

Of course there is one slight wrinkle. Take a look at this Turtle snippet for the concept of Ray Bradbury:

(???) rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . (???) skos: <http://www.w3.org/2004/02/skos/core#> .


open to view

I spent an hour checking out the HathiTrust API docs this morning; mainly to see what the similarities and differences are with the as-of-yet undocumented API for Chronicling America. There are quite a few similarities in the general RESTful approach, and the use of Atom, METS and PREMIS in the metadata that is made available.

Everyone’s a critic right? Nevertheless, I’m just going to jot down a few thoughts about the API, mainly for my friend over in #code4lib Bill Dueber who works on the project. Let me just say at the outset that I think it’s awesome that HathiTrust are providing this API, especially given some of the licensing constraints around some of the content. The API is a good example of putting library data on the web using both general and special purpose standards. But there are a few minor things that could be tweaked I think, to make the API fit into the web and the repository space a bit better.

it would be nice if the OpenSearch description document referenced in the HTML at

http://catalog.hathitrust.org/Search/OpenSearch?method=describe

worked. It should be pretty easy and non-invasive to add a basic description file for the HTML response since the search is already GET driven. Ideally it would be nice to see the responses also available as Atom and/or JSON with Atom Feed Paging.

Another thing that would be nice to see is the API being merged more into the human usable webapp. The best way to explain this is with an example. Consider the HTML page for this 1914 edition of Walt Whitman’s Leaves of Grass, available with this clean URI:

http://catalog.hathitrust.org/Record/000206297

Now, you can get a few flavors of metadata for this book, and an aggregated zip file of all the page images and OCR if you are a HathiTrust member. Why not make these alternate representations discoverable right from the item display? It could be as simple as adding some <link> elements to the HTML, that use the link relations they’ve already established for their Atom:






If you wanted to get fancy you could also put human readable links into the <body> and annotate them w/ RDFa. But this would just be icing on the cake. There are a few reasons for doing at least the bare minimum. The big one is to enable in browser applications (like Zotero, etc) to be able to learn more about a given resource in a relatively straightforward and commonplace way. The other big one is to let automated agents like GoogleBot and YahooSlurp and Internet Archive’s Heritrix, etc. discover the deep web data that’s held behind your API. Another nice side effect is that it helps people who might ordinarily scrape your site automatically discover the API in a straightforward way.

Lastly, I was curious to know if HathiTrust considered adjusting their Atom response to use the Atom pattern recommended by the OAI-ORE folks. They are pretty close already, and in fact seem to have modeled their own aggregation vocabulary on OAI-ORE. It would be interesting to hear why they diverged if it was intentional, and if it might be possible to use a bit of oai-ore in there so we can bootstrap an oai-ore harvesting ecosystem.

I’m not sure that I can still call this approach to integrating web2.0 APIs into web1.x applications Linked Data anymore, since it doesn’t really involve RDF directly. It does involve thinking in a RESTful way about the resources you are publishing on the web, and how they can be linked together to form a graph. My colleague Dan has been writing in Computers in Libraries recently about how perhaps thinking in terms of “building a better web” may be a more accurate way of describing this activity.

For reasons I don’t fully understand I’ve been reading a lot of Wittgenstein (well mainly books about Wittgenstein honestly) lately during the non-bike commute. The trajectory of his thought over his life is really interesting to me. He had this zen-like, controversial idea that

Philosophy simply puts everything before us, nor deduces anything. — Since everything lies open to view there is nothing to explain. For what is hidden, for example, is of no interest to us. (PI 126)

I really like this idea that our data APIs on the web could be “open to view” by checking out the HTML, following your nose, and writing scrapers, bots and browser plugins to use what you find. I think it’s unfortunate that the recent changes to the Linked Data Design Issues, and the ensuing discussion seemed to create this dividing line about the use of RDF and SPARQL. I had always hoped (and continue to hope) that the Linked Data effort is bigger than a particular brand, or reformulation of the semantic web effort … for me it’s a pattern for building a better web. I think RDF is very well suited to expressing the core nature of the web, the Giant Global Graph. I’ve served up RDF representations in applications I’ve worked on just for this reason. But I think Linked Data pattern will thrive most if it is thought of as an inclusive continuum of efforts, similar to what Dan Brickley has suggested. Us technology people strive for explicitness, it’s an occupational hazard – but there’s sometimes quite a bit of strength in ambiguity.

Anyhow, my little review of the HathiTrust API turned into a bit of a soapbox for me to stand on and shout like a lunatic. I guess I’ve been wanting to write about what I think Linked Data is for a few weeks now, and it just kinda bubbled up when I least expected it. Sorry Bill!


flickr, digital curation and the web

The Library of Congress has started to put selected content from Chronicling America into Flickr as part of the Illustrated Newspaper Supplements set. More details on the rationale and process involved can be found in a FAQ on the LC Newspapers and Current Periodical Reading Room website.

So for example this newspaper page on Chronicling America:

Is also available on Flickr:

I haven’t written about it here yet, but Chronicling America is just a regular vanilla Django/MySQL/Apache webapp which exposes machine readable metadata for the newspaper content using the Linked Data pattern. It just so happens that Dave was able to use these linked data views to determine the metadata to use when uploading content to Flickr. For example if a curator wants to have this newspaper page uploaded to Flickr, Dave’s flickr uploading program is able to use the associated metadata (referenced in a link element inthe HTML) to get the newspaper title, issue, date, and other related metadata. The beauty of this was Dave was able to do the technical work on his own, and it didn’t require any formal project coordination.

A few weeks ago, on learning of the Flickr / Chronicling America integration Rob Sanderson asked if we could possibly reference the Flickr content in our own data views. Rob became interested in Chronicling America because of its use of the Open Archives Initiative Object Reuse and Exchange Vocabulary in the linked data views. Rob has written a rather nice oai-ore Greasemonkey visualization tool called foresite-explorer, which can visualize oai-ore links to Flickr. It also makes sense from a curatorial perspective to want to capture these bi-directional links between Chronicling America and Flickr in the Chronicling America database.

After agreeing with Rob I’ve had it on my plate to get the list of Flickr URLs and their associated Chronicling America URLs from Dave, for loading into Chronicling America so that the links can be served up in the data views, and perhaps maybe in the HTML. But yesterday morning I had the realization that I didn’t really need to ask (and keep asking every month) Dave for the list. Since Dave created a Flickr set for these pages, and has embedded the URI for the Chronicling America page as a machine tag I can get it right from Flickr. So I hacked together a quick script, and now I have the list too.

ed@rorty:~$ ./flickr_chronam.py
(u'http://www.flickr.com/photos/library_of_congress/3608399458', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-01-03/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608400834', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-01-10/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608402104', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-01-17/ed-1/seq-16/')
(u'http://www.flickr.com/photos/library_of_congress/3608403362', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-01-24/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607588861', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-01-31/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608405718', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-02-07/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608407068', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-02-14/ed-1/seq-16/')
(u'http://www.flickr.com/photos/library_of_congress/3608408274', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-02-21/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607593693', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-02-28/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608410606', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-03-07/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607596267', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-03-14/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607597927', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-03-21/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608414374', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-03-28/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608415708', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-04-04/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607601559', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-04-11/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608418042', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-04-18/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608419060', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-04-25/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607604705', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-05-02/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608421240', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-05-09/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608422694', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-05-16/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607608459', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-05-23/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608425436', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-05-30/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607611709', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-06-06/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607637819', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-06-13/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607638897', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-06-20/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608455948', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-06-27/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607641409', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-07-04/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607642551', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-07-11/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607889205', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-07-18/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608709982', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-07-25/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607894517', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-08-01/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607896027', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-08-08/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608713826', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-08-15/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608715804', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-08-22/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3607900561', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-08-29/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608718394', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-09-05/ed-1/seq-15/')
(u'http://www.flickr.com/photos/library_of_congress/3608719874', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-09-12/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608721302', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-09-19/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607906387', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-09-26/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608724542', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-10-03/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607909093', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-10-10/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607910739', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-10-17/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608728408', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-10-24/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607913989', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-10-31/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607915481', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-11-07/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607916497', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-11-14/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608734354', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-11-21/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3607919583', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-11-28/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608737124', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-12-05/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608738658', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-12-12/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608740242', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-12-19/ed-1/seq-17/')
(u'http://www.flickr.com/photos/library_of_congress/3608741612', u'http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-12-26/ed-1/seq-13/')

The point of this blog post is that Dave, Rob and I were able to independently work on tools for pushing the same repository content around without having to talk much at all. The World Wide Web, and URIs in particular, enables this sort of integration. I guess some people would argue that this is a Web2.0 Mashup, and I guess they would be right in a way. But really it is using the Web the way it was designed to be used–as a decentralized information space. Flickr can tell me about the Chronicling America content that’s on Flickr; and Chronicling America can tell other people about the repository objects themselves.

Now I just need to make those links available for Rob in the data views :-)


American Memory is (almost) 20

Through an internal discussion list at the Library of Congress I learned that this year will mark the 20th Anniversary of American Memory. The exact date of the anniversary depends on how you want to mark it: either the beginning of FY90 on October 1st, 1999 1989 (thanks David) when work officially began, or earlier in the year when the President signed the bill that included the Legislative Branch appropriations for that year (exact date yet to be determined).

Via the discussion list I was able to learn that Shirley Liang (with the help of Nancy Eichacker) was able to locate a transcript of the hearings, which includes the details of Carl Fleischhauer’s demo of a Hypercard / Laser Video Disc based system before the House and later the Senate. Yes, HyperCard. LoC was making a pitch for American Memory before Congress just a few months after Tim Berners-Lee made his proposal to build a “web of notes with links” at CERN. Incidentally, I learned recently in Andrew Lih’s Wikipedia Revolution, that Ward Cunningham’s first implementation of Wiki was written using Hypercard.

I digress…and I want to digress more.

As a Library School student in the mid 90s I became a big fan of American Memory. It seemed like an audacious and exciting experiment right on the cutting edge of what the World Wide Web made (and continues to make) possible. The work that Caroline Arms and Carl Fleischhauer did to expose metadata about American Memory collections (with the technical expertise of Dave Woodward) deepened my interest in what LoC was doing. In hindsight, I think seeing this work from afar is what got me interested in trying to find a job at the Library of Congress.

Seeing that American Memory is turning 20 this year made me fess up to a crazy idea of writing a history of the project. In conversation with those much more knowledgeable than me I think I’ve convinced myself that a good place to start would be compiling a bibliography of things that have been written about the project. It seems a relatively simple and logical place to start.


canonical question

As the last post indicated I’m part of a team at loc.gov working on an application that serves up page views like this for historic newspapers–almost a million of them in fact. For each page view there is another URL for a view of the OCR text gleaned from that image, such as this. Yeah, kind of yuckster at the moment, but we’re working on that.

Perhaps it’s obvious, but the goal of making the OCR html view available is so that search engine crawlers can come and index it. Then when someone is searching for someone’s name, say Dr. Herbert D. Burnham in Google they’ll come to page 3 in the 08/25/1901 issue of the New York Tribune. And this can happen without the searcher needing to know anything about the Chronicling America project beforehand. Classic SEO

The current OCR view at the moment is quite confusing, so we wanted to tell Google that when they link to the page in their search results they use the page zoom view instead. We reached for Google’s (and now other major search engine’s) rel=“canonical”, since it seemed like a perfect fit.

… we now support a format that allows you to publicly specify your preferred version of a URL. If your site has identical or vastly similar content that’s accessible through multiple URLs, this format provides you with more control over the URL returned in search results. It also helps to make sure that properties such as link popularity are consolidated to your preferred version.

From our logs we can see that Google has indeed come and fetched both the page viewer and the ocr view for this particular page, and also the text/plain and application/xml views.

66.249.71.166 - - [05/May/2009:23:31:51 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ HTTP/1.1" 200 15566 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.227 - - [06/May/2009:02:02:51 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15.pdf HTTP/1.1" 200 3119248 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.165 - - [06/May/2009:02:03:46 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr/ HTTP/1.1" 200 47075 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.172 - - [06/May/2009:04:34:02 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr.txt HTTP/1.1" 200 40300 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.202 - - [06/May/2009:04:36:07 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr.xml HTTP/1.1" 200 1447056 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"

But it doesn’t look like the ocr html view is being indexed, at least based on the results for this query I just ran. We can see the .txt file is showing up, which was harvested just after the OCR html view … so it really ought to be in the search results.

A bit of text in a recent www2009 paper Sitemaps: Above and Beyond the Crawl of Duty by Uri Schonfeld and Narayanan Shivakumar made me think …

Amazon.com also suffers from URL canonicalization issues, multiple URLs reference identical or similar content. For example, our (Google’s) Discovery crawl crawls both

  • http://…/B0000FEFEFW?showViewpoints=1
  • http://…/B0000FEFEFW?filterBy=addFiveStar

The two URLs return identical content and offer little value since these pages off two “different” views on an empty customer review list. Simple crawlers cannot detect these type of duplicate URLs without downloading all duplicate URLs first, processing their content, and wasting resources in the process.

So. Could it be that google crawls and indexes http://chroniclingamerica.loc.gov/lccn/sn83030214/1901-08-25/ed-1/seq-15/, where it discovers http://chroniclingamerica.loc.gov/lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr/ which it crawls and sees the canonical URI, which it knows it has already indexed, so it doesn’t waste any resources re-indexing?

It seems like a somewhat non-obvious (to me) side effect of asserting a canonical relationship with another URI is that Google will not index the document at the alternate URI. I guess I’m just learning to only use canonical when a site has “identical or vastly similar content that’s accessible through multiple URLs” … Does this seem about right to you?

(thanks Erik Wilde for the pointer to the Schonfeld and Shivakuma paper)


rest, the semantic web and my feeble brain

Imagine you were minting close to a million URIs for historic newspaper pages such as:

http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/

for pages like:

The web page allows you to zoom in quite close and see lots of detail in the page:

Now lets say I want to describe this Newspaper Page in RDF. I need to decide what subject URI to hang the description off of. Should I consider this Newspaper Page resource an information resource, or a real world resource? The answer to this question determines whether or not I can hang my description of the page off the above URI, for example:

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> 
  dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .

Or if I need to mint a new URI for the page as a real world thing:

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1#page> 
  dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .

AWWW 1 provides some guidance:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.”

This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.

Can all of the essential characteristics of this newspaper page be sent down the wire as a message to a client? The text of the page is pretty legible after zooming in and you can see pictures, headlines, etc. You can’t feel the texture of the page itself, but you can’t in the microfilm that the page images were generated from. So I’m inclined to say yes.

Cool URIs for the Semantic Web also has some advice:

It is important to understand that using URIs, it is possible to identify both a thing (which may exist outside of the Web) and a Web document describing the thing. For example the person Alice is described on her homepage. Bob may not like the look of the homepage, but fancy the person Alice. So two URIs are needed, one for Alice, one for the homepage or a RDF document describing Alice. The question is where to draw the line between the case where either is possible and the case where only descriptions are available.

According to W3C guidelines ([AWWW], section 2.2.), we have a Web document (there called information resource) if all its essential characteristics can be conveyed in a message. Examples are a Web page, an image or a product catalog.

In HTTP, because a 200 response code should be sent when a Web document has been accessed, but a different setup is needed when publishing URIs that are meant to identify entities which are not Web documents.

This makes me think that I will need distinct identifiers for the abstract notion of the Newspaper Page, and the HTML document itself, if it is important to describe them separately. Say for example if I wanted to say the publisher of the web page was the Library of Congress, but the publisher of the Newspaper Page was Charles M. Shortridge. If I don’t have distinct identifiers I will have to say:

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> 
  dc:publisher <http://loc.gov>, 
  <http://www.joincalifornia.com/candidate/12338> 
  .

Pondering this Information Resource Sniff-Test got me re-reading Xiaoshu Wang’s paper URI Identity and Web Architecture Revisited again. And I’ve come away more convinced that maybe he’s right: that the real issue lies in my vocabulary usage (dc:publisher in this example), and not with whether my URI identifies an Information Resource or not. So maybe new vocabulary is needed in order to describe the representation?

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> 
  web:repPublisher <http://loc.gov> ;
  dcterms:publisher <http://www.joincalifornia.com/candidate/12338> 
  .

But there isn’t a community of practice behind Xiaoshu’s position, at least not one like the Linked Data community. Unless perhaps his position is closer to the REST community which is going strong at the moment, especially in AtomPub circles. Members of the linked-data/semweb community would most likely say that there needs to be either hash or 303’ing URIs for the Newspaper Page, distinct from the URIs for the document describing the Newspaper Page. As a late comer to the httpRange-14 debate I don’t think I ever internalized how REST and the Semantic Web are slightly out of tune w/ each other regarding resources on the web.

So. Should I have two different URIs: one for the real-world Newspaper Page, and one for the HTML document that describes that page? Is the Newspaper Page an Information Resource? Am I muddling up something here? Am I thinking too much? Should I just let sleeping dogs lie? Your opinion, advice, therapy would be greatly appreciated.


VocabularySoup (1)

It’s been great to see RDFa being picked up by web2.0 publishers like Digg and MySpace. You can use the RDFa Distiller to extract the RDFa from a given web page u by constructing a URI like:

http://www.w3.org/2007/08/pyRdfa/extract?format=turtle&uri=u

Which translates kind of nicely into a command line utility to add to your ~/bin:

#!/bin/sh
curl "http://www.w3.org/2007/08/pyRdfa/extract?format=turtle&uri=$1"

So with that little shell script in hand I can now look at the RDFa something like Yo La Tengo’s page on MySpace:

ed@rorty:~$ rdfa http://www.myspace.com/yolatengo


LibraryThing Ubuntu Screen Saver

I read about the LibraryThing Mac Screensaver and of course wanted the same thing for my Ubuntu workstation at $work. Naturally, I’m supposed to be working on some high-priority tickets on a tight deadline…so I started to work right away on how to do this. Your tax dollars at work, etc…

I’m sure that there’s a much more elegant way of doing this, but I basically created a simple python program extract-images that will pull image urls out of arbitrary text, suck down the images, and dump them to a directory. This can be combined with cron and the standard GLSlideshow screensaver, which displays a slideshow of images in a particular directory.

So you just download extract-images, put it in your path, add a crontab entry like (substituting edsu for your LibraryThing username):

00 14 * * * extract-images http://www.librarything.com/labs-screensaver.php?userid=edsu /home/ed/Pictures/covers

And then tell GLSlideshow where your images are by adding this to your ~/.xscreensaver

imageDirectory:   /home/ed/Pictures/covers
chooseRandomImages:   True

Dear $manager, it really didn’t take me that long to do this. Honest!


APIs Suck

With TransparencyCamp last weekend, news of the mandated use of feed syndication by Federal Agencies receiving funds from the Recovery Act, recent blog posts by Tim O’Reilly and the Special Libraries Association, an article in Newsweek, news of Carl Malamud’s bid to become the Public Printer of the United States (aka head of the GPO), and the W3C eGov meeting coming up next week it looks like issues related public access to government data (specifically Library of Congress bibliographic and legislative data) are hitting the mainstream media, and getting political mind-share. Exciting times.

One thing that bubbled up at code4lib2009 last week was the notion that APIs Suck. Not that web2.0 APIs are wrong or bad…they’re actually great, especially when compared to a world where no machine access to the data existed before. The point is that sometimes just having access to the raw data in the ‘lowest level format’ is the ideal. Rather than service providers trying to guess what you are trying to do with their data, and absorbing the computational responsibility of delivering it, why not make the data readily available using a protocol like HTTP? Put the data in a directory, turn on Indexes, do some sensible caching, and maybe gzip compression and let people grab it, and robots crawl it. Or maybe use something like Amazon Public Datasets. It seems like a relatively easy first step, that involves very little custom software development, and one with the ability make a huge impact.

I’m a federal employee, so I really can’t come out and formally advocate directly for political appointments. But I have to say it would great to see someone like Malamud at the helm of the GPO, since he’s been doing just this kind of work for 20 years. Exciting times.