<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: cloaking and fulltext</title>
	<atom:link href="http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/feed/" rel="self" type="application/rss+xml" />
	<link>http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/</link>
	<description>$pithy_personal_mission_statement</description>
	<lastBuildDate>Wed, 10 Mar 2010 02:04:23 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: ed</title>
		<link>http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/comment-page-1/#comment-81563</link>
		<dc:creator>ed</dc:creator>
		<pubDate>Wed, 11 Nov 2009 14:17:15 +0000</pubDate>
		<guid isPermaLink="false">http://inkdroid.org/journal/?p=1432#comment-81563</guid>
		<description>@sgillies in Chronicling America we do the same for all the big search engine bots. But, I agree: it does seem like the out-of-band coordination breaks the web a bit. Sometimes I try to rationalize it as a variant of content-negotiation, similar to what happens in practice on the mobile web...but it doesn&#039;t work for very long. I&#039;m definitely open to other solutions. I wish that rel=&#039;canonical&#039; could help here, but &lt;a href=&quot;http://inkdroid.org/journal/2009/05/15/canonical-question/&quot; rel=&quot;nofollow&quot;&gt;I don&#039;t think it does&lt;/a&gt;. It would be nice if there were some rel=&quot;fulltext&quot; or something that bots could follow. Perhaps @ardvaark is right and we should just bite the bullet now. But what does that do for @martin&#039;s problem? /me shrugs</description>
		<content:encoded><![CDATA[<p>@sgillies in Chronicling America we do the same for all the big search engine bots. But, I agree: it does seem like the out-of-band coordination breaks the web a bit. Sometimes I try to rationalize it as a variant of content-negotiation, similar to what happens in practice on the mobile web&#8230;but it doesn&#8217;t work for very long. I&#8217;m definitely open to other solutions. I wish that rel=&#8217;canonical&#8217; could help here, but <a href="http://inkdroid.org/journal/2009/05/15/canonical-question/" rel="nofollow">I don&#8217;t think it does</a>. It would be nice if there were some rel=&#8221;fulltext&#8221; or something that bots could follow. Perhaps @ardvaark is right and we should just bite the bullet now. But what does that do for @martin&#8217;s problem? /me shrugs</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sgillies.net/</title>
		<link>http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/comment-page-1/#comment-81562</link>
		<dc:creator>sgillies.net/</dc:creator>
		<pubDate>Wed, 11 Nov 2009 13:37:24 +0000</pubDate>
		<guid isPermaLink="false">http://inkdroid.org/journal/?p=1432#comment-81562</guid>
		<description>But what about other non-Google, non-browser agents? Would you want them to masquerade as the Googlebot, or arrange with you for the same special treatment (which you&#039;d readily provide, I don&#039;t mean to imply otherwise at all)? Either seems to break the web a bit, yes? Tricky situation.</description>
		<content:encoded><![CDATA[<p>But what about other non-Google, non-browser agents? Would you want them to masquerade as the Googlebot, or arrange with you for the same special treatment (which you&#8217;d readily provide, I don&#8217;t mean to imply otherwise at all)? Either seems to break the web a bit, yes? Tricky situation.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Martin Haye</title>
		<link>http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/comment-page-1/#comment-81561</link>
		<dc:creator>Martin Haye</dc:creator>
		<pubDate>Wed, 11 Nov 2009 00:14:49 +0000</pubDate>
		<guid isPermaLink="false">http://inkdroid.org/journal/?p=1432#comment-81561</guid>
		<description>@aardvark: Trouble is some of our (CDL&#039;s) items are entire monographs. Even compressed these would add quite a bit of overhead to each page view. Admittedly we&#039;re not totally optimized for tiny downloads, but currently we *could* optimize for size. Serving up the entire OCR text would foil that.

This is a subtle point and we debated it quite a bit. In the end we went with what&#039;s practical, and put our hopes in Google recognizing that we&#039;re not cloaking -- we&#039;re giving them what they need. Search engines need text, people need the whole page experience.</description>
		<content:encoded><![CDATA[<p>@aardvark: Trouble is some of our (CDL&#8217;s) items are entire monographs. Even compressed these would add quite a bit of overhead to each page view. Admittedly we&#8217;re not totally optimized for tiny downloads, but currently we *could* optimize for size. Serving up the entire OCR text would foil that.</p>
<p>This is a subtle point and we debated it quite a bit. In the end we went with what&#8217;s practical, and put our hopes in Google recognizing that we&#8217;re not cloaking &#8212; we&#8217;re giving them what they need. Search engines need text, people need the whole page experience.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ed</title>
		<link>http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/comment-page-1/#comment-81558</link>
		<dc:creator>ed</dc:creator>
		<pubDate>Tue, 10 Nov 2009 18:31:32 +0000</pubDate>
		<guid isPermaLink="false">http://inkdroid.org/journal/?p=1432#comment-81558</guid>
		<description>@gluejar I agree in principle -- but google&#039;s docs don&#039;t really say that. The worry at LC was that If google are trying to identify cloaked content at scale on the web they may inadvertently flag Chronicling America content as cloaked -- since determining juicy-ness could be infeasible.</description>
		<content:encoded><![CDATA[<p>@gluejar I agree in principle &#8212; but google&#8217;s docs don&#8217;t really say that. The worry at LC was that If google are trying to identify cloaked content at scale on the web they may inadvertently flag Chronicling America content as cloaked &#8212; since determining juicy-ness could be infeasible.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gluejar</title>
		<link>http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/comment-page-1/#comment-81557</link>
		<dc:creator>gluejar</dc:creator>
		<pubDate>Tue, 10 Nov 2009 17:38:01 +0000</pubDate>
		<guid isPermaLink="false">http://inkdroid.org/journal/?p=1432#comment-81557</guid>
		<description>What CA is doing is not cloaking- it&#039;s more akin to content-type negotiation. The one thing I&#039;d worry about is that the OCR is poor enough for your example page that a human reviewing the text could get confused and think that it&#039;s not the ocr of the image.

Cloaking is when you give a spider juicy content and then give spam to a human.</description>
		<content:encoded><![CDATA[<p>What CA is doing is not cloaking- it&#8217;s more akin to content-type negotiation. The one thing I&#8217;d worry about is that the OCR is poor enough for your example page that a human reviewing the text could get confused and think that it&#8217;s not the ocr of the image.</p>
<p>Cloaking is when you give a spider juicy content and then give spam to a human.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ardvaark.net/identity/br&#8230;</title>
		<link>http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/comment-page-1/#comment-81556</link>
		<dc:creator>ardvaark.net/identity/br&#8230;</dc:creator>
		<pubDate>Tue, 10 Nov 2009 17:31:25 +0000</pubDate>
		<guid isPermaLink="false">http://inkdroid.org/journal/?p=1432#comment-81556</guid>
		<description>How significantly will it impact page loads?  Just on my poking around, with a couple of test pages, I&#039;m showing a text-heavy page with 41K bytes in the OCR text.  If GZIP encoding were enabled, that would knock it down to 17K.  Viewing that image with all caching working correctly is 158K, so you&#039;re talking about a 10% increase in size, although most of that is in the JPG and not the HTML.

This is one of those nasty situations for which there is no good answer.  The reality here is that, like a Major League Umpire, &quot;correct&quot; is whatever Google says it is on any given day.</description>
		<content:encoded><![CDATA[<p>How significantly will it impact page loads?  Just on my poking around, with a couple of test pages, I&#8217;m showing a text-heavy page with 41K bytes in the OCR text.  If GZIP encoding were enabled, that would knock it down to 17K.  Viewing that image with all caching working correctly is 158K, so you&#8217;re talking about a 10% increase in size, although most of that is in the JPG and not the HTML.</p>
<p>This is one of those nasty situations for which there is no good answer.  The reality here is that, like a Major League Umpire, &#8220;correct&#8221; is whatever Google says it is on any given day.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
