way, way back

For some experimental work I’ve been talking about with Nicholas Taylor (his idea, which he or I will write about later if it pans out) I’ve gotten interested in programmatic ways of seeing when a URL is available in a web archive. Of course there is the Internet Archive‘s collection; but what isn’t as widely known perhaps is that web archiving is going on around the world at a smaller scale, often using similar software, and often under the auspices of the International Internet Preservation Consortium.

Nicholas pointed me at some work around Memento, which provides a proxy-like API in front of some web archives. If you aren’t already familiar with it, Memento is some machinery, or a REST API for deterimining when a given URL is available in a Web Archive–which is pretty useful. Of course, like many standardization efforts it relies on people actually implementing it. For Web Architecture folks, the core idea in Memento is pretty simple; but I think its core simplicity may be obscured from software developers who need to fully digest the spec in order to say they “do” Memento.

Meanwhile a lot of web archives have used the Wayback Machine from the Internet Archive to provide a human interface to the archived web content. While looking at the memento-server code I was surprised to learn that the Wayback can also return structured data about what URLs have been archived. For example, you can see what content the Internet Archive has for the New York Times homepage by visiting:

http://wayback.archive.org/web/xmlquery?url=http://www.nytimes.com

which returns a chunk of XML like:

< ?xml version="1.0" encoding="UTF-8"?>
<wayback>
  <request>
    <startdate>19960101000000</startdate>
    <numreturned>4425</numreturned>
    <type>urlquery</type>
    <enddate>20120503151837</enddate>
    <numresults>4425</numresults>
    <firstreturned>0</firstreturned>
    <url>nytimes.com/</url>
    <resultsrequested>40000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>68043717</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>IA-001766.arc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>nytimes.com/</urlkey>
      <digest>GY3</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.nytimes.com:80/</url>
      <capturedate>19961112181513</capturedate>
    </result>
    <result>
      <compressedoffset>8107</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>BK-000007.arc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>nytimes.com/</urlkey>
      <digest>GY3</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.nytimes.com:80/</url>
      <capturedate>19961121230155</capturedate>
    </result>
    ...
  </results>
</wayback>

Sort of similarly you can see what the British Library’s Web Archive has for the BBC homepage by visiting:

http://www.webarchive.org.uk/wayback/archive/*/http://www.bbc.co.uk/

Where you will see:

< ?xml version="1.0" encoding="UTF-8"?>
<wayback>
  <request>
    <startdate>19910806145620</startdate>
    <numreturned>201</numreturned>
    <type>urlquery</type>
    <enddate>20120503152750</enddate>
    <numresults>201</numresults>
    <firstreturned>0</firstreturned>
    <url>bbc.co.uk/</url>
    <resultsrequested>10000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>75367408</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>BL-196764-0.warc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>bbc.co.uk/</urlkey>
      <digest>sha512:b155b8dd868c17748405b7a8d2ee3606efea1319ee237507055f258189c0f620c38d2c159fc4e02211c1ff6d265f45e17ae7eb18f94a5494ab024175fe6f79c3</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.bbc.co.uk/</url>
      <capturedate>20080410162445</capturedate>
    </result>
    <result>
      <compressedoffset>92484146</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>BL-7307314-46.warc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>bbc.co.uk/</urlkey>
      <digest>sha512:6e37c62b3aa7b60cccc50d430bc7429ecf0d2662bca5562b61ba0bc1027c824a2f7526c747bfca52db46dba5a2ae9c9d96d013e588b2ae5d78188ea4436c571f</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.bbc.co.uk/</url>
      <capturedate>20080527231330</capturedate>
    </result>
    ...
  </results>
</wayback>

It turns out British Library are using this structured data to access data from Hadoop where their web archives live on HDFS as WARC files–which is pretty slick. Actually WARCs on spinning disk is pretty awesome by itself, no matter how you are doing it.

Unfortunately I wasn’t able to make it to the International Internet Preservation Consortium meeting going on right now at the Library of Congress. I’m at home heating bottles, changing diapers, and dozing off in comfy chairs with a boppy around my waist. If I was there I think I would be asking:

  1. Is there a list of Wayback Machine endpoints that are on the Web? There are multiple ones at the California Digital Library, the Library of Congress, and elsewhere I bet.
  2. How many of them are configured to make this XML data available? Can it easily be turned on for those that don’t have it?
  3. Rather than requiring people to implement a new standard to improve interoperability, could we document the XML format that Wayback can already emit, and share the endpoints? This way web archives that don’t run Heretrix and Wayback could also share what content they have collected in the same community.

This isn’t to say that Memento isn’t a good idea (I think it is). I just think there might be some quick wins to be had by documenting and raising awareness about things that are already working away quietly behind the scenes. Perhaps the list of Wayback endpoints could be added to the Wikipedia page?

Ok, enough for now. I have a bottle to heat up :-)

Creative Commons License
way, way back by Ed Summers, unless otherwise expressly stated, is licensed under a Creative Commons Attribution 4.0 International License.

8 thoughts on “way, way back

  1. for question (2) i searched wayback configuration files with no success, and i asked developer list too a while ago (no response).
    we’ve a private wayback archive in Italy at Biblioteca Centrale di Firenze, we should be happy to learn how to enable this api and eventually contribute to memento

    http://sourceforge.net/mailarchive/forum.php?forum_name=archive-access-discuss&max_rows=25&style=ultimate&viewmonth=201202

    ciao
    /raffaele (@atomotic)

  2. The Archive is really focused on collecting the data, and preserving it. Much less focused on the dissemination of that information! They do make the various datasets available to folks doing research, but they aren’t interested in collecting data just to make it easier for the rest of the world to mine it for any old purpose.

  3. @atomotic thanks for the info regarding the setup of that XML response. It’s surprising that there’s no information about it. Perhaps I’ll ask someone at the British Library if they found a knob to turn, or did it custom themselves.

    @eric I’m not really sure I agree with your distinguishing between preservation and dissemination. It seems like a pretty rudimentary task to be able to say if a given URL is in the archive, and they currently do this. Are you working for the Internet Archive and are able to speak on their behalf?

  4. Hi!

    The behavior of a Wayback installation is controlled via the (Spring 2.5 XML) configuration that sets up its ‘access points’ and ‘renderers’ from among the available choices. If you start from the included example ‘wayback.xml’…

    https://github.com/internetarchive/wayback-machine/blob/master/wayback/wayback-webapp/src/main/webapp/WEB-INF/wayback.xml#L183

    …that “org.archive.wayback.query.Renderer” should offer the sort of XML results described above. That is, unless you’ve done something else to remove/change the expected/default XML-related JSPs, as mentioned at…

    http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#Query_UI

    …and reviewable in the source as the last two files in the subdirectory…

    https://github.com/internetarchive/wayback-machine/blob/master/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/

    So if you’re working from a recent ‘wayback’ and haven’t intentionally (or inadvertently) disabled this, or perhaps filtered out access (by URL patterns?) in some intervening layer, this style of access should be available from your Wayback install.

    – Gordon (oftentimes-contributor to web archive.org)

    1. thank you Gordon,
      but is still not so clear, could you point me at a working examples?

      after add/modify the bean org.archive.wayback.query.Renderer, where i define the route of this new renderer? web.xml ?
      sorry, but java is an obscure beast to me.
      ciao

      /raffaele

  5. Although we’ve been looking at serving from HDFS for a while, we’re only just now moving this into production (http://britishlibrary.typepad.co.uk/webarchive/2012/11/upgrading-the-wayback-machine.html).

    Please note that this will also change the way we expose the XML API, making things more consistent with other Wayback deployments. Specifically, the API calls will look like this once we are live:

    http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=http://www.bl.uk/

Leave a Reply