way, way back

For some experimental work I’ve been talking about with Nicholas Taylor (his idea, which he or I will write about later if it pans out) I’ve gotten interested in programmatic ways of seeing when a URL is available in a web archive. Of course there is the Internet Archive’s collection; but what isn’t as widely known perhaps is that web archiving is going on around the world at a smaller scale, often using similar software, and often under the auspices of the International Internet Preservation Consortium.

Nicholas pointed me at some work around Memento, which provides a proxy-like API in front of some web archives. If you aren’t already familiar with it, Memento is some machinery, or a REST API for deterimining when a given URL is available in a Web Archive–which is pretty useful. Of course, like many standardization efforts it relies on people actually implementing it. For Web Architecture folks, the core idea in Memento is pretty simple; but I think its core simplicity may be obscured from software developers who need to fully digest the spec in order to say they “do” Memento.

Meanwhile a lot of web archives have used the Wayback Machine from the Internet Archive to provide a human interface to the archived web content. While looking at the memento-server code I was surprised to learn that the Wayback can also return structured data about what URLs have been archived. For example, you can see what content the Internet Archive has for the New York Times homepage by visiting:

http://wayback.archive.org/web/xmlquery?url=http://www.nytimes.com

which returns a chunk of XML like:

< ?xml version="1.0" encoding="UTF-8"?>

  
    19960101000000
    4425
    urlquery
    20120503151837
    4425
    0
    nytimes.com/
    40000
    resultstypecapture
  
  
    
      68043717
      text/html
      IA-001766.arc.gz
      -
      nytimes.com/
      GY3
      200
      http://www.nytimes.com:80/
      19961112181513
    
    
      8107
      text/html
      BK-000007.arc.gz
      -
      nytimes.com/
      GY3
      200
      http://www.nytimes.com:80/
      19961121230155
    
    ...

Sort of similarly you can see what the British Library’s Web Archive has for the BBC homepage by visiting:

http://www.webarchive.org.uk/wayback/archive/*/http://www.bbc.co.uk/

Where you will see:

< ?xml version="1.0" encoding="UTF-8"?>

  
    19910806145620
    201
    urlquery
    20120503152750
    201
    0
    bbc.co.uk/
    10000
    resultstypecapture
  
  
    
      75367408
      text/html
      BL-196764-0.warc.gz
      -
      bbc.co.uk/
      sha512:b155b8dd868c17748405b7a8d2ee3606efea1319ee237507055f258189c0f620c38d2c159fc4e02211c1ff6d265f45e17ae7eb18f94a5494ab024175fe6f79c3
      200
      http://www.bbc.co.uk/
      20080410162445
    
    
      92484146
      text/html
      BL-7307314-46.warc.gz
      -
      bbc.co.uk/
      sha512:6e37c62b3aa7b60cccc50d430bc7429ecf0d2662bca5562b61ba0bc1027c824a2f7526c747bfca52db46dba5a2ae9c9d96d013e588b2ae5d78188ea4436c571f
      200
      http://www.bbc.co.uk/
      20080527231330
    
    ...

It turns out British Library are using this structured data to access data from Hadoop where their web archives live on HDFS as WARC files–which is pretty slick. Actually WARCs on spinning disk is pretty awesome by itself, no matter how you are doing it.

Unfortunately I wasn’t able to make it to the International Internet Preservation Consortium meeting going on right now at the Library of Congress. I’m at home heating bottles, changing diapers, and dozing off in comfy chairs with a boppy around my waist. If I was there I think I would be asking:

Is there a list of Wayback Machine endpoints that are on the Web? There are multiple ones at the California Digital Library, the Library of Congress, and elsewhere I bet.
How many of them are configured to make this XML data available? Can it easily be turned on for those that don’t have it?
Rather than requiring people to implement a new standard to improve interoperability, could we document the XML format that Wayback can already emit, and share the endpoints? This way web archives that don’t run Heretrix and Wayback could also share what content they have collected in the same community.

This isn’t to say that Memento isn’t a good idea (I think it is). I just think there might be some quick wins to be had by documenting and raising awareness about things that are already working away quietly behind the scenes. Perhaps the list of Wayback endpoints could be added to the Wikipedia page?

Ok, enough for now. I have a bottle to heat up :-)