For some experimental work I’ve been talking about with Nicholas Taylor (his idea, which he or I will write about later if it pans out) I’ve gotten interested in programmatic ways of seeing when a URL is available in a web archive. Of course there is the Internet Archive’s collection; but what isn’t as widely known perhaps is that web archiving is going on around the world at a smaller scale, often using similar software, and often under the auspices of the International Internet Preservation Consortium.
Nicholas pointed me at some work around Memento, which provides a proxy-like API in front of some web archives. If you aren’t already familiar with it, Memento is some machinery, or a REST API for deterimining when a given URL is available in a Web Archive–which is pretty useful. Of course, like many standardization efforts it relies on people actually implementing it. For Web Architecture folks, the core idea in Memento is pretty simple; but I think its core simplicity may be obscured from software developers who need to fully digest the spec in order to say they “do” Memento.
Meanwhile a lot of web archives have used the Wayback Machine from the Internet Archive to provide a human interface to the archived web content. While looking at the memento-server code I was surprised to learn that the Wayback can also return structured data about what URLs have been archived. For example, you can see what content the Internet Archive has for the New York Times homepage by visiting:
which returns a chunk of XML like:
< ?xml version="1.0" encoding="UTF-8"?>
19960101000000 4425 urlquery 20120503151837 4425 0 nytimes.com/ 40000 resultstypecapture 68043717 text/html IA-001766.arc.gz - nytimes.com/ GY3 200 http://www.nytimes.com:80/ 19961112181513 ... 8107 text/html BK-000007.arc.gz - nytimes.com/ GY3 200 http://www.nytimes.com:80/ 19961121230155
Sort of similarly you can see what the British Library’s Web Archive has for the BBC homepage by visiting:
Where you will see:
< ?xml version="1.0" encoding="UTF-8"?>
19910806145620 201 urlquery 20120503152750 201 0 bbc.co.uk/ 10000 resultstypecapture 75367408 text/html BL-196764-0.warc.gz - bbc.co.uk/ sha512:b155b8dd868c17748405b7a8d2ee3606efea1319ee237507055f258189c0f620c38d2c159fc4e02211c1ff6d265f45e17ae7eb18f94a5494ab024175fe6f79c3 200 http://www.bbc.co.uk/ 20080410162445 ... 92484146 text/html BL-7307314-46.warc.gz - bbc.co.uk/ sha512:6e37c62b3aa7b60cccc50d430bc7429ecf0d2662bca5562b61ba0bc1027c824a2f7526c747bfca52db46dba5a2ae9c9d96d013e588b2ae5d78188ea4436c571f 200 http://www.bbc.co.uk/ 20080527231330
It turns out British Library are using this structured data to access data from Hadoop where their web archives live on HDFS as WARC files–which is pretty slick. Actually WARCs on spinning disk is pretty awesome by itself, no matter how you are doing it.
Unfortunately I wasn’t able to make it to the International Internet Preservation Consortium meeting going on right now at the Library of Congress. I’m at home heating bottles, changing diapers, and dozing off in comfy chairs with a boppy around my waist. If I was there I think I would be asking:
- Is there a list of Wayback Machine endpoints that are on the Web? There are multiple ones at the California Digital Library, the Library of Congress, and elsewhere I bet.
- How many of them are configured to make this XML data available? Can it easily be turned on for those that don’t have it?
- Rather than requiring people to implement a new standard to improve interoperability, could we document the XML format that Wayback can already emit, and share the endpoints? This way web archives that don’t run Heretrix and Wayback could also share what content they have collected in the same community.
This isn’t to say that Memento isn’t a good idea (I think it is). I just think there might be some quick wins to be had by documenting and raising awareness about things that are already working away quietly behind the scenes. Perhaps the list of Wayback endpoints could be added to the Wikipedia page?
Ok, enough for now. I have a bottle to heat up :-)