Skip to content

cloaking and fulltext

It’s comforting to know that California Digital Library are selectively serving up fulltext content in HTML from their institutional repository for search engines to chew on. For example, compare the output of:

curl http://escholarship.org/uc/item/2896686x

with:

curl --header "User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)" http://escholarship.org/uc/item/2896686x

You should see full-text content for the article in the latter and not in the former:

...
qt2896686x repo "Wholly Visionary": the American Library Association, the Library of Congress, and the Card Distribution Program wholly visionary the american library association the library of congress and the card distribution program 2009 2009 2009 2009-04-01 2009-04-01 20090401 yee yy::Yee, Martha M Yee, Martha M American Library Association American Library Association Library of Congress Library of Congress card distribution program card distribution program shared cataloging shared cataloging cooperative cataloging cooperative cataloging national bibliography national bibliography cataloging rules and standards cataloging rules and standards library history united states library history united states This paper offers a historical review of the events and institutional influences in the nineteenth century that led to the ...

The advantage to doing this is that when I was searching for a quote from Title 2, Chapter 5, Section 150 of the US Code:

The Librarian of Congress is authorized to furnish to such institutions or individuals as may desire to buy them

I found Martha Yee’s paper “Wholly Visionary”: the American Library Association, the Library of Congress, and the Card Distribution Program as the 5th hit in the search results.

We do this at the Library of Congress as well in Chronicling America to make the OCR text of historic newspaper pages available to search engines, while not burdening the UI search interface with all the (much noisier) textual content. Compare:

curl http://chroniclingamerica.loc.gov/lccn/sn84026749/1908-04-09/ed-1/seq-11/

with:

curl --header "User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)" http://chroniclingamerica.loc.gov/lccn/sn84026749/1908-04-09/ed-1/seq-11/

However we’ve got a ticket in our tracking system to revisit this practice in light of Google themselves frowning on the practice of ‘cloaking’:

Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.

We were thinking of returning the OCR text in all the responses and putting it in a background pane of some kind that can be selected. But this will most likely increase the size of the HTTP response, and may significantly impact the load time. As more and more fulltext content moves online it would be nice to have a pattern digital libraries could follow for minting URIs for books, articles, etc while still making the fulltext content available to UserAgents that can effectively use it.

Google hasn’t dropped Chronicling America’s pages from its index yet, which is a good sign. After running across similar patter at CDL I’m wondering if it’s OK to continue doing what we are doing. What do you think?

Update: Leigh Dodds let me know in twitter that much of the content gets into Google Scholar via cloaking.

6 Comments

  1. How significantly will it impact page loads? Just on my poking around, with a couple of test pages, I’m showing a text-heavy page with 41K bytes in the OCR text. If GZIP encoding were enabled, that would knock it down to 17K. Viewing that image with all caching working correctly is 158K, so you’re talking about a 10% increase in size, although most of that is in the JPG and not the HTML.

    This is one of those nasty situations for which there is no good answer. The reality here is that, like a Major League Umpire, “correct” is whatever Google says it is on any given day.

    Tuesday, November 10, 2009 at 10:31 am | Permalink
  2. gluejar wrote:

    What CA is doing is not cloaking- it’s more akin to content-type negotiation. The one thing I’d worry about is that the OCR is poor enough for your example page that a human reviewing the text could get confused and think that it’s not the ocr of the image.

    Cloaking is when you give a spider juicy content and then give spam to a human.

    Tuesday, November 10, 2009 at 10:38 am | Permalink
  3. ed wrote:

    @gluejar I agree in principle — but google’s docs don’t really say that. The worry at LC was that If google are trying to identify cloaked content at scale on the web they may inadvertently flag Chronicling America content as cloaked — since determining juicy-ness could be infeasible.

    Tuesday, November 10, 2009 at 11:31 am | Permalink
  4. Martin Haye wrote:

    @aardvark: Trouble is some of our (CDL’s) items are entire monographs. Even compressed these would add quite a bit of overhead to each page view. Admittedly we’re not totally optimized for tiny downloads, but currently we *could* optimize for size. Serving up the entire OCR text would foil that.

    This is a subtle point and we debated it quite a bit. In the end we went with what’s practical, and put our hopes in Google recognizing that we’re not cloaking — we’re giving them what they need. Search engines need text, people need the whole page experience.

    Tuesday, November 10, 2009 at 5:14 pm | Permalink
  5. But what about other non-Google, non-browser agents? Would you want them to masquerade as the Googlebot, or arrange with you for the same special treatment (which you’d readily provide, I don’t mean to imply otherwise at all)? Either seems to break the web a bit, yes? Tricky situation.

    Wednesday, November 11, 2009 at 6:37 am | Permalink
  6. ed wrote:

    @sgillies in Chronicling America we do the same for all the big search engine bots. But, I agree: it does seem like the out-of-band coordination breaks the web a bit. Sometimes I try to rationalize it as a variant of content-negotiation, similar to what happens in practice on the mobile web…but it doesn’t work for very long. I’m definitely open to other solutions. I wish that rel=’canonical’ could help here, but I don’t think it does. It would be nice if there were some rel=”fulltext” or something that bots could follow. Perhaps @ardvaark is right and we should just bite the bullet now. But what does that do for @martin’s problem? /me shrugs

    Wednesday, November 11, 2009 at 7:17 am | Permalink

Post a Comment

You must be logged in to post a comment.