Web Archives in Repositories

I’m fortunate to be back at code4lib again this year. It gives me hope to see this conference working in the same spirit as it started out with, albeit with much honed mechanics. It is also refreshing to be talking about someone else’s work, in this case the work that the Webrecorder project has been doing to shift archival practices for web content.

Rather than archived web content being placed into repositories, web archives are typically repository infrastructures in and of themselves. I call this pattern Web Archive As Repository, as opposed to Web Archive In Repository. It has been hard to notice the difference before because the latter has been pretty much unthinkable.

The classic shape of web archives has been (perhaps necessarily) designed to be monolithic, and has evolved into a fairly expensive infrastructure, that only a few institutions are able to sustain over time. Needless to say, maintaining digital content over time is the whole point of a digital repository. The higher the maintenance costs, the fewer web archives there are. The fewer web archives there are, the more their collecting practices effect the shape of what types of content are collected.

Classic Web Archive Architecture (Monolithic)

One of Webrecorder’s primary contributions over the past few years has been the creation of a data format known as Web Archive Collection Zipped or WACZ, which allows this monolithic architecture to be teased apart by separating the process of creating an archive, from the process of publishing it. I think this sets the stage for a more diverse set of actors who perform each of these steps, and also raises some important questions about what social and political impacts this could have.

Disaggregated Web Archives Architecture (WACZ)

I recorded a practice version of the talk to make sure I could squeeze all the content into the 15 minute slot. Here it is if you are interested: