way, way back

For some experimental work I’ve been talking about with Nicholas Taylor (his idea, which he or I will write about later if it pans out) I’ve gotten interested in programmatic ways of seeing when a URL is available in a web archive. Of course there is the Internet Archive‘s collection; but what isn’t as widely known perhaps is that web archiving is going on around the world at a smaller scale, often using similar software, and often under the auspices of the International Internet Preservation Consortium.

Nicholas pointed me at some work around Memento, which provides a proxy-like API in front of some web archives. If you aren’t already familiar with it, Memento is some machinery, or a REST API for deterimining when a given URL is available in a Web Archive–which is pretty useful. Of course, like many standardization efforts it relies on people actually implementing it. For Web Architecture folks, the core idea in Memento is pretty simple; but I think its core simplicity may be obscured from software developers who need to fully digest the spec in order to say they “do” Memento.

Meanwhile a lot of web archives have used the Wayback Machine from the Internet Archive to provide a human interface to the archived web content. While looking at the memento-server code I was surprised to learn that the Wayback can also return structured data about what URLs have been archived. For example, you can see what content the Internet Archive has for the New York Times homepage by visiting:

http://wayback.archive.org/web/xmlquery?url=http://www.nytimes.com

which returns a chunk of XML like:

< ?xml version="1.0" encoding="UTF-8"?>
<wayback>
  <request>
    <startdate>19960101000000</startdate>
    <numreturned>4425</numreturned>
    <type>urlquery</type>
    <enddate>20120503151837</enddate>
    <numresults>4425</numresults>
    <firstreturned>0</firstreturned>
    <url>nytimes.com/</url>
    <resultsrequested>40000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>68043717</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>IA-001766.arc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>nytimes.com/</urlkey>
      <digest>GY3</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.nytimes.com:80/</url>
      <capturedate>19961112181513</capturedate>
    </result>
    <result>
      <compressedoffset>8107</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>BK-000007.arc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>nytimes.com/</urlkey>
      <digest>GY3</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.nytimes.com:80/</url>
      <capturedate>19961121230155</capturedate>
    </result>
    ...
  </results>
</wayback>

Sort of similarly you can see what the British Library’s Web Archive has for the BBC homepage by visiting:

http://www.webarchive.org.uk/wayback/archive/*/http://www.bbc.co.uk/

Where you will see:

< ?xml version="1.0" encoding="UTF-8"?>
<wayback>
  <request>
    <startdate>19910806145620</startdate>
    <numreturned>201</numreturned>
    <type>urlquery</type>
    <enddate>20120503152750</enddate>
    <numresults>201</numresults>
    <firstreturned>0</firstreturned>
    <url>bbc.co.uk/</url>
    <resultsrequested>10000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>75367408</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>BL-196764-0.warc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>bbc.co.uk/</urlkey>
      <digest>sha512:b155b8dd868c17748405b7a8d2ee3606efea1319ee237507055f258189c0f620c38d2c159fc4e02211c1ff6d265f45e17ae7eb18f94a5494ab024175fe6f79c3</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.bbc.co.uk/</url>
      <capturedate>20080410162445</capturedate>
    </result>
    <result>
      <compressedoffset>92484146</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>BL-7307314-46.warc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>bbc.co.uk/</urlkey>
      <digest>sha512:6e37c62b3aa7b60cccc50d430bc7429ecf0d2662bca5562b61ba0bc1027c824a2f7526c747bfca52db46dba5a2ae9c9d96d013e588b2ae5d78188ea4436c571f</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.bbc.co.uk/</url>
      <capturedate>20080527231330</capturedate>
    </result>
    ...
  </results>
</wayback>

It turns out British Library are using this structured data to access data from Hadoop where their web archives live on HDFS as WARC files–which is pretty slick. Actually WARCs on spinning disk is pretty awesome by itself, no matter how you are doing it.

Unfortunately I wasn’t able to make it to the International Internet Preservation Consortium meeting going on right now at the Library of Congress. I’m at home heating bottles, changing diapers, and dozing off in comfy chairs with a boppy around my waist. If I was there I think I would be asking:

  1. Is there a list of Wayback Machine endpoints that are on the Web? There are multiple ones at the California Digital Library, the Library of Congress, and elsewhere I bet.
  2. How many of them are configured to make this XML data available? Can it easily be turned on for those that don’t have it?
  3. Rather than requiring people to implement a new standard to improve interoperability, could we document the XML format that Wayback can already emit, and share the endpoints? This way web archives that don’t run Heretrix and Wayback could also share what content they have collected in the same community.

This isn’t to say that Memento isn’t a good idea (I think it is). I just think there might be some quick wins to be had by documenting and raising awareness about things that are already working away quietly behind the scenes. Perhaps the list of Wayback endpoints could be added to the Wikipedia page?

Ok, enough for now. I have a bottle to heat up :-)

xhtml, wayback

The Internet Archive gave the Wayback Machine a facelift back in January. It actually looks really nice, but I noticed something kinda odd. I was looking for old archived versions of the lcsh.info site. Things work fine for the latest archived copies:

But during part of lcsh.info’s brief lifetime the site was serving up XHTML with the application/xhtml+xml media type. Now Wayback rightly (I think) remembers the media type, and serves it up that way:

ed@curry:~$ curl -I http://replay.waybackmachine.org/20081216020433/http://lcsh.info/
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Archive-Guessed-Charset: UTF-8
X-Archive-Orig-Connection: close
X-Archive-Orig-Content-Length: 6497
X-Archive-Orig-Content-Type: application/xhtml+xml; charset=UTF-8
X-Archive-Orig-Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.4.6 PHP/5.2.4-2ubuntu5.4 with Suhosin-Patch mod_wsgi/1.3 Python/2.5.2
X-Archive-Orig-Date: Tue, 16 Dec 2008 02:04:31 GMT
Content-Type: application/xhtml+xml;charset=utf-8
X-Varnish: 1458812435 1458503935
Via: 1.1 varnish
Date: Wed, 09 Mar 2011 23:09:47 GMT
X-Varnish: 903390921
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS

But to add navigation controls and branding, Wayback also splices in its own HTML into the display, which unfortunately is not valid XML. And since the media type and doctype trigger standards mode in browsers, the pages render in Firefox like this:

And in Chrome like this:

Now I don’t quite know what the solution should be here. Perhaps the HTML that is spliced in should be valid XML. Or maybe Wayback should just serve up the HTML as text/html. Or maybe this is a good use case for frames (gasp). But I imagine it will similarly afflict any other XHTML that was served up as application/xhtml+xml when Heretrix crawled it.

Sigh. I sure am glad that HTML5 is arriving on the scene and XHTML is riding off into the sunset. Although it’s kind of the Long Goodbye given Internet Archive has archived it.

Update: just a couple hours later I got an email that a fix for this was deployed. And sure enough now it works. I quickly eyeballed the response and didn’t see what the change was. Thanks very much Internet Archive!