robots.txt

The Internet Archive does some amazing work in the Sisyphean task of archiving the web. Of course the web is just too big and changes too often for them to archive it all. But Internet Archive’s crawling of the web and serving it up out of their Wayback Machine, plus their collaboration with librarians and archivists around the world make it a truly public service if there ever was one.

Recently they announced that they are making (or thinking of making) a significant change to the way they archive the web:

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

The robots.txt was developed to establish a conversation between web publishers and the crawlers, a.k.a. bots, that come to gather and index content. It allows web publishers to provide guidance to automated agents from companies like Google about what parts of the site to index, and to point to a sitemap that lets them do their job more efficiently. It also allows the web publisher to ask a crawler to slow down with the Crawl-delay directive, if their infrastructure doesn’t support rapid crawling.

Up until now the Internet Archive have used the robots.txt in two ways:

their ia_archiver web crawler consults a publisher’s robots.txt to determine what parts of a website to archive and how often
the Wayback Machine (the view of the archive) consults the robots.txt to determine what to allow people to view from the archived content it has collected.

If the Internet Archive’s blog post is read at face value it seems like they are going to stop doing these things altogether, not just for government websites, but for the entire web. While conversation in Twitter makes it seem like this is a great idea whose time has come, I think this would be a step backwards for the web and for its most preeminent archive, and I hope they will reconsider or take this as an opportunity for a wider discussion.

I think it’s crucial to look at the robots.txt as an imperfect, but much needed part of a conversation between web publishers and archives of the web. The idea that there is a perfect archive that contains all the things is a noble goal, but it has always been a fantasy. Like all archives the Internet Archive represents only a sliver of a sliver of the thing we call the web. They make all kinds of decisions about what to archive and when, which are black boxed and difficult to communicate. While some people view the robots.txt as nothing better than a suicide note that poorly optimized websites rely on, robots.txt is really just small toehold in providing transparency about the decisions about what to archive from the web.

If a website really wants to block the Internet Archive it can still do so by limiting access by IP addresses or by ignoring any clients named ia_archiver. If the Internet Archive starts to ignore robots.txt it pushes the decisions about who and what to archive down into the unseen parts of web infrastuctures. It introduces more uncertainty, and reduces transparency. It starts an arms race between the archive and the sites that do not want their content to be archived. It treats the web as one big public information space, and ignores the more complicated reality that there is a continuum between public and private. The idea that Internet Archive is simply a public good obscures the fact that ia_archiver is run by a subsidiary of Amazon, who sell the data, and also make it available to the Internet Archive through a special arrangement. This is a complicated situation and not about a simple technical fix.

The work and craft of archives is one that respects the rights of content creators and involves them in the process of preservation. Saving things for the long term is an important task that shapes what we know of the past and by extension our culture, history and future. While this process has historically privileged the powerful in society, the web has has lowered the barrier to publishing information, and offers us a real opportunity to transform whose voices are present in the archive. While it makes sense to hold our government to a particular standard, not all web publishers are so powerful. It is important that Internet Archive not abandon the idea of a contract between web publishers and the archive.

We don’t know what the fate of the Internet Archive will be. Perhaps some day it will decide to sell its trove of content to a private company and closes its doors. That’s why its important that we not throw the rights of content creators under the bus, and hold the Internet Archive accountable as well. We need web archives that are partners with web publishers. We need web publishers who take their own archives seriously. We need more nuance, understanding and craft in the way we talk about and enact archiving the web. I think archivists and Archive-It subscribers need to step up and talk more about this. Props to the Internet Archive for starting the conversation.