In case you missed it, an interesting study by Jonathan Zittrain and Kendra Albert was written up in the New York Times with the provocative title In Supreme Court Opinions, Web Links to Nowhere. In addition to the article, the study itself is worth reading for its compact review of the study of link rot on the Web, and its stunning finding that 49% of the links in US Supreme Court Opinions are broken.
This 49% is in contrast with a similar, recent study by Raizel Liebler and June Liebert of the same links, which found a much lower rate of 29%. The primary reason for this discrepancy was that Zittrain and Albert looked at reference rot in addition to link rot.
The term reference rot was coined by Rob Sanderson, Mark Phillips and Herbert Van de Sompel in their paper Analyzing the Persistence of Referenced Web Resources with Memento. The distinction is subtle but important. Link rot typically refers to when a URL returns an HTTP error of some kind that prevents a browser from rendering the referenced content. This error can be the result of the page disappearing, or the webserver being offline. Reference rot refers to when the URL itself seems to work (returning either a 200 OK or redirect of some kind), but the content that comes back is no longer the content that was being referenced.
The New York Times article includes a great example of reference rot. The website http://ssnat.com/ which was referenced in a Supreme Court Opinion by Justice Alito.
The DNS registration expired, and was picked up someone who knew its significance and turned it into an opportunity to educate people about links in legal documents. The NY Times article calls this nameless person a “prankster” but it is a wonderful hack
One thing the NY Times article didn’t mention is that the website has been captured 140 times by the Internet Archive and the original as referenced by Justice Alito is available still. It seemed like a missed opportunity to highlight the incredibly important work that Brewster Kahle and his merry band of Web archivists are doing. It would be interesting to see how many of the 555 extracted links are available in the Internet Archive. But I couldn’t seem to find the list in or linked to from the article.
Zittrain and Albert on the other hand do mention the Internet Archive’s work in the context of perma.cc which is their proposed solution to the problem of broken links.
… the Internet Archive is dedicated to comprehensively archiving web content, and thus only passes through a given corner of the Internet occasionally, meaning there is no guarantee that a given page or set of content would be archived to reflect what an author or editor saw at the moment of citation. Moreover, the IA is only one organization, and there are long-term concerns around placing all of the Internet archiving eggs into one basket. A system of distributed, redundant storage and ownership might be a better long-term solution.
This seems like a legitimate concern, that there should be some ability to archive a website at a particular point in time. There are 27 founding members of perma.cc. There is a strong legal flavor to some of the participants, but perma.cc doesn’t appear to be only for legal authors, the website states:
perma.cc helps authors and journals create permanent archived citations in their published work. perma.cc will be free and open to all soon.
It’s good to see Internet Archive as one of the founding members. It remains to be seen what perma.cc’s approach to a distributed, redundant storage will be. For the system to actually be distributed there has to be more to it than listing 27 organizations that agree that it’s a good idea. It’s not like Internet Archive operates on its own, since they work closely with the International Internet Preservation Consortium which has 44 organizational members, many of whom are national libraries. I didn’t see the IIPC on the list of founding members for perma.cc.
If perma.cc were to take off I wonder what it would mean for publishers’ web analytics. If lots of publishers start putting perma.cc URLs in their publications what would this mean for the publishers of the referenced content, and their web analytics? Would it be possible for publishers to see how often their content is being used on perma.cc, and a rough approximation of who they are, what browsers they are using, etc?
Nit-picking aside, its awesome to see another player in the Web archiving space, especially from people Web-veterans who understand how it works, and its significance for society.
Update: Leigh Dodds has an excellent post about perma.cc’s terms of service.