Thanks for taking an interest in the historic content on a website I help run. We want to see the NDNP newspaper content get crawled, indexed and re-purposed in as many places as possible. So we appreciate the time and effort you are spending on getting the OCR XML and JPEG2000 files into Footnote. I am a big fan of Footnote and what you are doing to help historical/genealogical researchers who subscribe to your product.
But since I have your ear, it would be nice if you identified yourself as a bot. Right now you are pretending to be Internet Explorer:
18.104.22.168 - - [22/Apr/2010:18:38:39 -0400] "GET /lccn/sn86069496/1909-09-08/ed-1/seq-8.jp2 HTTP/1.1" 200 3170304 "-" "Internet Explorer 6 (MSIE 6; Windows XP)" "*/*" "-" "No-Cache"
Oh, and could you stop sending the Pragma: No-Cache header with every HTTP request? We have a reverse-proxy in front of our dynamic content so that we don’t waste CPU cycles regenerating pages that haven’t changed. It’s what allows us to make our content available to well behaved web crawlers. But every request you send bypasses our cache, and makes our site to do extra work.
It’s true, we can ignore your request to bypass our cache. In fact, that’s what we’re doing now. This means we can’t shift-reload in our browser to force the content to refresh–but we’ll manage. Maybe you could be a good citizen of the Web and send an If-Modified-Since header–or perhaps just don’t send Pragma: No Cache?
Identifying yourself with a User-Agent string like “footbot/0.1 +(http://footnote.com/footbot)” would be neighborly too :-)
ed@curry:~$ whois 22.214.171.124 ... %rwhois V-1.5:0010b0:00 rwhois.cogentco.com 126.96.36.199 network:ID:NET4-2665950018 network:Network-Name:NET4-2665950018 network:IP-Network:188.8.131.52/24 network:Postal-Code:84042 network:State:UT network:City:Linden network:Street-Address:355 South 520 West network:Org-Name:iArchives Inc dba Footnote network:Tech-Contact:ZC108-ARIN network:Updated:2008-05-21 13:05:26 network:Updated-by:Gus Reese