After just over 25 years of use the Robots Exclusion Standard, otherwise known as robots.txt is being standardized at the IETF. This isn’t really news, as the group at Google that is working on it announced the work over a year ago. The effort continues apace, with the latest draft having been submitted back in the middle of pandemic summer.
But it is notable I think because of the length of gestation time this particular standard took. It made me briefly think about what it would be like if standards always worked this way–by documenting established practices, desire lines if you will, rather than being quiet ways to shape markets (Russell, 2014). But then again maybe that hands off approach is fraught in other ways. Standardization processes offer the opportunity for consensus, and a framework for gathering input from multiple parties.
It seems like a good time to write down some tricks of the robots.txt trade (e.g the stop reading after 500kb rule, which I didn’t know about). What would Google look like today if it wasn’t for some of the early conventions that developed around web crawling? Would early search engines have existed at all if a convention for telling them what to crawl and what not to crawl didn’t come into existence? Even though it has been in use for 25 years it will be important to watch the diffs with the existing de-facto standard, to see what new functionality gets added and what (if anything) is removed.
I also wonder if this might be an opportunity for the digital preservation community to grapple with documenting some of its own practices around robots.txt. Much web archiving crawling software has options for observing robots.txt, or explicitly ignoring it. There are clearly legitimate reasons for a crawler to ignore robots.txt, as in cases where CSS files or images are accidentally blocked by a robots.txt and which prevent the rendering of an otherwise unblocked page. I think ethical arguments can also be made for ignoring an exclusion. But ethics are best decided by people not machines– even though some think the behavior of crawling bots can be measured and evaluated (Giles, Sun, & Councill, 2010 ; Thelwall & Stuart, 2006).
Web archives use robots.txt in another significant way too. Ever since the Oakland Archive Policy the web archiving community has used the robots.txt in playback of archived data. Software like the Wayback Machine has basicaly become the reading room of the archived web. The Oakland Archive Policy made it possible for website owners to tell web archives about content on their site that they would like the web archive not to “play back”, even if they had the content. Here is what they said back then:
- Archivists should provide a ‘self-service’ approach site owners can use to remove their materials based on the use of the robots.txt standard.
- Requesters may be asked to substantiate their claim of ownership by changing or adding a robots.txt file on their site.
- This allows archivists to ensure that material will no longer be gathered or made available.
- These requests will not be made public; however, archivists should retain copies of all removal requests.
This convention allows web publishers to use their robots.txt to tell the Internet Archive (and potentially other web archives) not to provide access to archived content from their website. It also is not really news at all. The Internet Archive’s Mark Graham wrote in 2017 about how robots.txt haven’t really been working out for them lately, and how they now ignore them for playback of .gov and .mil domains. There was a popular article about this use of robots.txt written by David Bixenspan at Gizmodo, When the Internet Archive Forgets, and a follow up from David Rosenthal Selective Amnesia.
Perhaps the collective wisdom now is that the use of robots.txt to control playback in web archives is fundamentally flawed and shouldn’t be written down in a standard. But lacking a better way to request that something be removed from the Internet Archive I’m not sure if that is feasible. Some, like Rosenthal, suggest that it’s too easy for these take down notices to be issued. Consent on the web is difficult once you are operating at the scale that the Internet Archive does in its crawls. But if there were a time to write it down in a standard I guess that time would be now.
Giles, C. L., Sun, Y., & Councill, I. G. (2010). Measuring the web crawler ethics. In WWW 2010. Retrieved from https://clgiles.ist.psu.edu/pubs/WWW2010-web-crawler-ethics.pdf
Russell, A. L. (2014). Open standards and the digital age. Cambridge University Press.
Thelwall, M., & Stuart, D. (2006). Web crawling ethics revisited: Cost, privacy, and denial of service. Journal of the American Society for Information Science and Technology. https://doi.org/10.1002/asi.20388