As the last post indicated I’m part of a team at loc.gov working on an application that serves up page views like this for historic newspapers–almost a million of them in fact. For each page view there is another URL for a view of the OCR text gleaned from that image, such as this. Yeah, kind of yuckster at the moment, but we’re working on that.

Perhaps it’s obvious, but the goal of making the OCR html view available is so that search engine crawlers can come and index it. Then when someone is searching for someone’s name, say Dr. Herbert D. Burnham in Google they’ll come to page 3 in the 08/25/1901 issue of the New York Tribune. And this can happen without the searcher needing to know anything about the Chronicling America project beforehand. Classic SEO

The current OCR view at the moment is quite confusing, so we wanted to tell Google that when they link to the page in their search results they use the page zoom view instead. We reached for Google’s (and now other major search engine’s) rel=“canonical”, since it seemed like a perfect fit.

… we now support a format that allows you to publicly specify your preferred version of a URL. If your site has identical or vastly similar content that’s accessible through multiple URLs, this format provides you with more control over the URL returned in search results. It also helps to make sure that properties such as link popularity are consolidated to your preferred version.

From our logs we can see that Google has indeed come and fetched both the page viewer and the ocr view for this particular page, and also the text/plain and application/xml views.

66.249.71.166 - - [05/May/2009:23:31:51 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ HTTP/1.1" 200 15566 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.227 - - [06/May/2009:02:02:51 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15.pdf HTTP/1.1" 200 3119248 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.165 - - [06/May/2009:02:03:46 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr/ HTTP/1.1" 200 47075 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.172 - - [06/May/2009:04:34:02 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr.txt HTTP/1.1" 200 40300 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.202 - - [06/May/2009:04:36:07 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr.xml HTTP/1.1" 200 1447056 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"

But it doesn’t look like the ocr html view is being indexed, at least based on the results for this query I just ran. We can see the .txt file is showing up, which was harvested just after the OCR html view … so it really ought to be in the search results.

A bit of text in a recent www2009 paper Sitemaps: Above and Beyond the Crawl of Duty by Uri Schonfeld and Narayanan Shivakumar made me think …

Amazon.com also suffers from URL canonicalization issues, multiple URLs reference identical or similar content. For example, our (Google’s) Discovery crawl crawls both

  • http://…/B0000FEFEFW?showViewpoints=1
  • http://…/B0000FEFEFW?filterBy=addFiveStar

The two URLs return identical content and offer little value since these pages off two “different” views on an empty customer review list. Simple crawlers cannot detect these type of duplicate URLs without downloading all duplicate URLs first, processing their content, and wasting resources in the process.

So. Could it be that google crawls and indexes http://chroniclingamerica.loc.gov/lccn/sn83030214/1901-08-25/ed-1/seq-15/, where it discovers http://chroniclingamerica.loc.gov/lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr/ which it crawls and sees the canonical URI, which it knows it has already indexed, so it doesn’t waste any resources re-indexing?

It seems like a somewhat non-obvious (to me) side effect of asserting a canonical relationship with another URI is that Google will not index the document at the alternate URI. I guess I’m just learning to only use canonical when a site has “identical or vastly similar content that’s accessible through multiple URLs” … Does this seem about right to you?

(thanks Erik Wilde for the pointer to the Schonfeld and Shivakuma paper)