Based on Calder Sculpture 2/4 by Bolumena

One of the topics we’ve been discussing while thinking about the design of the WACZ format and how it is used in tools like ArchiveWeb.page and ReplayWeb.page is the issue of trust. This post isn’t about all the angles on this issue, because it’s a huge topic that will be addressed more fully as the work progresses. It’s just a quick experiment to illustrate how distributed web archives could be (will be? are?) used as a tool in disinformation and deep(ish) fakes, and to encourage some creative thinking on how to mitigate some of these potential harms, or design with them in mind.

ArchiveWeb.page makes it easy to record a web page (or pages), export as a WACZ, and embed that WACZ anywhere on the web using the ReplayWeb.page web component (just a bit of HTML, JavaScript and the WACZ file). This is amazing when you consider the amount of infrastructure that is needed to do this with a classical web archive setup (e.g. Heritrix + Wayback Machine).

But because the traditional web archive stack is so expensive to run and keep online it is typically paid for by organizations that naturally have some level of trust associated with them (Internet Archive, Library of Congress, British Library, Bibliothèque nationale de France, etc).

When web archives are easy to create and distribute as a media type on the web outside of traditional “web archives” (e.g on wikis, in cms systems, social media, etc) they theoretically become what Bruno Latour calls an immutable mobile: a snapshot of what some region of the web looked like at a particular points in time. Latour details lots of characteristics of immutable mobiles, but his basic definition is that they are:

A general term that refers to all the types of transformations through which an entity becomes materialized into a sign, an archive, a docu­ment, a piece of paper, a trace. Usually but not always inscriptions are two­ dimensional,superimposable, and combinable. They are always mobile, that is they allow for new translations and articulations while keeping some types of relations intact.(Latour, 1999)

Maybe this is a bit circular saying that WACZ files are immutable mobiles because they are archives, but I think the idea is useful because it gets us thinking about how WACZs can move around while retaining particular intrinsic properties.

But the question is, how immutable are these WACZ archives? One approach that is under discussion for verifying the authenticity of a WACZ is to generate a content hash of each HTTP response, record it in the WARC file, and then create a content hash for all the WARC files, and sign it. This way you can know as you are replaying content from an archive that it hasn’t been interfered with, as long as you trust the WACZ creator. This information would then be surfaced in the player in some fashion, perhaps not unlike how you trust a website by examining the lock to the left of the URL location in your browser.

Trust in the creator of a WACZ (not only the publisher) is an essential component, because with a little bit of effort it’s possible to create a web archive that looks authentic, but was actually recorded in such a way that has transformed the original content. For example:

You can also see this WACZ embedded in a larger page here which allows some additional the UI elements to appear. The WACZ file being viewed here demonstrates a potential attack vector for distributed web archives, where archived content has been altered as it was being recorded.

In this experiment a man in the middle proxy (mitmproxy) was set up on the recording users’s computer, and their browser has been instructed to use the proxy and trust the certificate it generated. mitmproxy was given a small script to rewrite the content of the page:

# rewrite.py

def response(flow):
    flow.response.text = flow.response.text.replace(
        'Members of extremist Oath Keepers group planned attack', 
        'Members of extremist Oath Keepers group planned to prevent the attack'
    )

Which is used by simply telling mitmproxy to use it when starting up:

$ mitmproxy -s rewrite.py

Is there anything that can be done at the interface level to surface any information in the underlying data (e.g HTTP trace data) that could provide a clue that some breach of trust has occurred? Or perhaps the only viable way forward is for the ReplayWeb.page interface to negotiate trust between the viewer and the creator of the archive: to allow the viewer to see who created the archive, and potentially who else trusts the creator?

Trust on the web is a bit of a fraught problem space, and the techniques for signing WACZs are by no means finalized, but if you have thoughts or ideas about a WACZ viwere could speak to issues of trust please send them via the WACZ issue tracker or to info [at] webrecorder.net.

References

Latour, B. (1999). Pandora’s hope: Essays on the reality of science studies. In. Harvard University Press.