It’s comforting to know that California Digital Library are selectively serving up fulltext content in HTML from their institutional repository for search engines to chew on. For example, compare the output of:

curl http://escholarship.org/uc/item/2896686x

with:

curl --header "User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)" http://escholarship.org/uc/item/2896686x

You should see full-text content for the article in the latter and not in the former:

...

qt2896686x repo "Wholly Visionary": the American Library Association, the Library of Congress, and the Card Distribution Program wholly visionary the american library association the library of congress and the card distribution program 2009 2009 2009 2009-04-01 2009-04-01 20090401 yee yy::Yee, Martha M Yee, Martha M American Library Association American Library Association Library of Congress Library of Congress card distribution program card distribution program shared cataloging shared cataloging cooperative cataloging cooperative cataloging national bibliography national bibliography cataloging rules and standards cataloging rules and standards library history united states library history united states This paper offers a historical review of the events and institutional influences in the nineteenth century that led to the ...

The advantage to doing this is that when I was searching for a quote from Title 2, Chapter 5, Section 150 of the US Code:

The Librarian of Congress is authorized to furnish to such institutions or individuals as may desire to buy them

I found Martha Yee’s paper “Wholly Visionary”: the American Library Association, the Library of Congress, and the Card Distribution Program as the 5th hit in the search results.

We do this at the Library of Congress as well in Chronicling America to make the OCR text of historic newspaper pages available to search engines, while not burdening the UI search interface with all the (much noisier) textual content. Compare:

curl http://chroniclingamerica.loc.gov/lccn/sn84026749/1908-04-09/ed-1/seq-11/

with:

curl --header "User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)" http://chroniclingamerica.loc.gov/lccn/sn84026749/1908-04-09/ed-1/seq-11/

However we’ve got a ticket in our tracking system to revisit this practice in light of Google themselves frowning on the practice of ‘cloaking’:

Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.

We were thinking of returning the OCR text in all the responses and putting it in a background pane of some kind that can be selected. But this will most likely increase the size of the HTTP response, and may significantly impact the load time. As more and more fulltext content moves online it would be nice to have a pattern digital libraries could follow for minting URIs for books, articles, etc while still making the fulltext content available to UserAgents that can effectively use it.

Google hasn’t dropped Chronicling America’s pages from its index yet, which is a good sign. After running across similar patter at CDL I’m wondering if it’s OK to continue doing what we are doing. What do you think?

Update: Leigh Dodds let me know in twitter that much of the content gets into Google Scholar via cloaking.