Design for the Rubble

If you are interested in issues of preservation and technology you might get a kick out of this chat between inventor Danny Hillis and musician/artist Brian Eno. It’s the latest seminar in long-term thinking from the Long Now Foundation. The backdrop to the conversation is their work over the last 10 years designing the 10,000 year clock. It’s a subtle, and (appropriately) long talk, that was a real pleasure to listen to over the course of three evening dog walks.

A theme they explored was how difficult (and rewarding) it is to think over really long time frames. The further they tried to look into the future, the more they found themselves looking into the past, and getting in touch with the present. They considered the therapeutic value of considering our small parts to play in the sweep of history, the need to surrender to it…to embrace complexity, and work with the muddle of everything, rather than trying to bend it to our will, and our plans.

Plus there are lots of humorous moments along the way about Microsoft Excel’s Y10K bug, white on white paintings, the singularity that already happened, and more.

Paris Review Interviews and Wikipedia

I was recently reading an amusing piece by David Dobbs about William Faulkner being a tough interview. Dobbs has been working through the Paris Review archive of interviews which are available on the Web. The list of authors is really astonishing, and the interviews are great examples of longform writing on the Web.

The 1965 interview with William S. Burroughs really blew me away. So much so that I got to wondering how many Wikipedia articles reference these interviews.

A few years ago, I experimented with a site called Linkypedia for visualizing how a particular website is referenced on Wikipedia. It’s actually pretty easy to write a script to see what Wikipedia articles point at a Website, and I’ve done it enough times that it was convenient to wrap it up in a little Python module.

from wplinks import extlinks

Ignoring Our Visual Heritage

I recently ran across Digital Storage: Losing Our Virtual Heritage over on the (surprisingly interesting) SAA Archives & Archivists discussion list. Strangely, the editorial struck me as both emblematic of a problem in the archival community, and a guidepost for how archives need to move forward.

The key point Bromberg makes is that archives will no longer be able to function if our collections (specifically photographic collections) become digital:

Once I do basic work to care for my collections, I can put them on the shelf and pretty much not have to put any more money into their care. You cannot keep a digital file, however, without continually having to put money into it for the constant migration to new forms.

You have to buy new software and equipment, and pay for the labour to migrate them to be able to continue to get access to your images. Right now, I can just walk to the shelf and open a box to get access to my photographs.

This high cost of caring for digital files means that archives and museums which hold much of the world’s recorded history will most likely not be able to afford to care for them. We already have small budgets to care for our materials and that is unlikely to change.

Fear of format obsolescence is real and justified. But as David Rosenthal has been pointing out for a while the shared information space of the Web, and its open-source viewers (browsers), have mitigated some of these concerns. We have yet to see evidence that prospective format migration actually helps preserve content. But our continued obsession with format migration, and describing resources so they can be migrated is making the task of archiving digital content (like photographs) cost prohibitive, especially for smaller archives. Do we really think that billions of JPEGs are going to become unreadable overnight?

Bromberg’s piece contains a useful example:

I can get the Smith family photographs that Grandmother Smith put into a shoebox 50 years ago and forgot about it until her family cleaned out the house. I have packed up photo collections from families, businesses and organisations that contained images well over 100 years old that are perfectly fine. But if Grandmother Smith sticks some photo disks in her shoebox, by the time an archive gets them, they will be long gone.

Is it really useful for us to put our collective head in the sand and say that digital photography is going away? Or would we be better off helping photographers take care of their digital collections, so that when it comes time to donate them, they have a digital equivalent of a box of photographs to hand over? I’m reminded of the rich literature about the post-custodial archive where there is an emphasis on helping content owners manage their content, which in turn makes it easier to eventually transfer to an archive if desired. Personal Digital Archiving Day and DPLA’s Community program are good examples of this sort of effort.

I’m not suggesting that there isn’t work to do on this front. Bromberg is right. In our quest for the holy grail of digital preservation our content management systems have raised the bar way too high, for everyday people, and small libraries and archives to continue to do for digital content what they have for physical content, like photographs. To succeed I think archives and libraries need simple solutions that let them easily collect digital content, manage it, and let it feed into larger collections like DPLA.

By simple solutions I mean mostly a process that content owners and archivists can keep in their heads, that involves very little software, and mostly represents an investment in digital storage and backup systems in the same way that they have invested in physical space, containers, etc. I suspect many individuals and small archives already have storage solutions in operation for their business data, so this won’t be as big a leap as they imagine. But as long as we keep promulgating things like Fedora, DSpace etc as pre-requisites for doing real digital preservation Bromberg will be right.

To put it another way, we need a digital equivalent to the More Product, Less Process manifesto. This is the spirit that BagIt was created in at the Library of Congress. We needed to start processing an influx of digital content from NDIIPP partners, and we didn’t have the time, resources, or collective will to describe everything with METS, PREMIS, MODS and put it into Fedora or iRods, or whatever.

You can think of a Bag as a digital analog for a physical container. It’s just a directory with files in it, that includes a manifest, and some (optional) high level, human readable metadata. Certainly PREMIS, METS, etc can be layered on top of this, and we’ve done just that at LC with some of our some of our internal systems … but BagIt helps with the absolute basics of bundling up data so that it can be moved through space and time.

So I’m not suggesting that people should start using BagIt to manage their digital photographs. I actually think BagIt could be simplified even more (look for more on that later). My point is rather that we need super simple solutions, like BagIt, that involve very little working software, and that everyday people and archives can use. We need to educate, and move forward. A new army of archivist computer scientists isn’t going to solve this problem. In a lot of ways, the computer savvy (but not Luddite) archivists we have are perfect for the job of educating artists, business people, and Grandma Smith since the solutions we give them need to fit inside their heads too. We are doing a disservice to them if we just say we can’t handle the new medium, and pine for the good old days. Or as my colleague Bill Lefurgy said in response to this post:

… archivists at smaller institutions also need to push beyond digital fear and build capacity, even if slowly

natus digital

A few days ago I asked folks on Twitter if they had any born-digital archival collections that were Internet accessible (web, ftp, etc). I’m testing a little prototype application called fondz (which I will hopefully write more about later if it proves to be headed in a useful direction) and wanted a collection I could actually talk about in a blog post (as opposed to the content I’m testing with and cannot). I specifically wanted born-digital archival content with well defined provenance, because fondz assumes the content isn’t just a random assortment of things, but forms a thematic unit of some kind: e.g. content that is donated to an archive as part of a personal or organizational collection of some kind.

Call me crazy, but if you squint right, the Web looks like its full of born-digital archives–they are called websites! But fondz is oriented around files that have been collected offline: collections of word processing or image files, that may have accumulated on your hard drive and perhaps gotten backed up to a disk of some kind, and then ultimately gifted to an archive. So, like a lot of archival content, they also have access rights associated with them.

I received several helpful responses, and thought I would jot them down here in case you go looking for born-digital collections too. If you have one you would like me to add just tweet them at me and I’ll add them.

  • Alberto Accomazzi suggested arXiv which contains lots of scientific material in pdf and LaTeX. Depending on the scope you could get content from a particular author, organization or discipline. If you squint right I guess any digital repository that has a strong sense of an author identity and/or the subject of the content could work as a source of born-digital archival content. There are lots of so called “institutional repositories” on the Web. But pre-print repositories are particularly interesting because they often represent work in progress, not the finished, polished thing that people often think of as “published”. Pre-prints are more like the documents you have lying around on your computer, that you happen to have pushed out to let people know what you are working on, and to share research that is still underway.
  • Mark Matienzo pointed me at the Richard Rorty born digital files at the University of California at Irvine. In order to download the files you need to apply for an account in their UCISpace application. I filled out the form, and was pleasantly surprised when I received an email the next day granting me access–way to go UCI!. Are there many other examples of this sort of Web enabled interaction with researchers? Or have I been asleep for a while and this is the new normal for archives on the Web? I particularly enjoyed Mark’s suggestion because I’m a big fan of Rorty’s work, so it will be fun to look at the content. Subsequently Matthew followed up on Twitter to let me know that UCISpace has other born digital collections available. Aaron Brenner also pointed me at some slides about UCI’s virtual reading room.
  • Erin O’meara pointed me to Jill Sexton and Meg Tuomala at the University of North Carolina at Chapel Hill who have some born-digital collections. Erin then pointed me to the Carolina Digital Repository which seems to have a fair bit of born digital material: for example this Word document from a folder named Dad’s laptop in the John Chapman collection. This content is available without having to login, which is nice. I haven’t poked around to see how much more is available yet.
  • Mark Jordan referred me to Nick Ruest of York University. Nick has been kind enough to bag up some content from the Allan Fleming collection and put it on the Web for me to download. I noticed that the system at York provided a way to login, so maybe someday they could offer a similar service to UC Irvine.
  • Trevor Owens suggested I get in touch with the Maryland Institute for Technology in the Humanities to see if they might have some of their born digital content online. MITH is local for me, so I could conceivably head over their with a thumb drive.

If you have other ideas I’d still be interested to hear about them, and will add them here if you comment here, or tweet them at me. I’m especially interested in collections that fit the UC Irvine model of making born digital collections available on the Web via a researcher request step, or where they are simply publicly available. It’s great to see traditional archives moving onto the Web this way to make born digital collections available.

As an aside it’s interesting to me how the category of content we call born-digital is beginning to be coterminous with Web content at a particular point in time – especially as more and more of our born digital content lives on the Web in cloud services of some ilk.

public email archives

I noticed on the Archives and Archivists discussion list today that the Library of Virginia has made 66,422 of the approximately 1.3 million emails (167 GB) of Governor Tim Kaine’s 2006-2010 administration available on the Web. Even though this is only 5% of the entire collection, it still represents a significant step forwards for open access to government information.

Thankfully, the Library of Virginia took the extra step of describing how they went about processing the email collection. Along the way they tried a series of proprietary tools, some of which crashed regularly to convert pst to pdf and csv, which they then ingested into their digital asset management system, DigiTool from ExLibris (which apparently couldn’t load more than 3,000 emails without keeling over).

What’s simply astounding is that archivists looked at every email to determine whether it contained restricted material. The result of this sifting was that only 5% of the emails were made available.

I was drawn to the announcement initially because I wondered if they would simply make mailbox data available on the Web, similar to the Enron email dataset. But quickly I noticed that while metadata about the email was readable, I wasn’t able to read the contents of the message–the so called email body. Instead I was presented with a PDF icon, which had a lock over it. At first I suspected the content was only available at the Library of Virginia, but then after some more reading elsewhere I discovered that I needed to login to see the PDFs.

I was surprised to find that the username and password were simply listed on the login page – you don’t get your own login, everyone uses the same one. This login form is accompanied by the following text:

While great care has been taken during the processing of this collection to locate, identify, and restrict access to privacy protected information within this collection, some relevant materials may have been missed.

The Web as a Preservation Medium

This is the text of a talk I gave at the (wonderful) National Digital Forum in Wellington, New Zealand on November 27th, 2013. You can also find my slides here, and the video here. If you do happen to watch the video, you’ll probably notice I spent more time thinking about the text than I did practicing my talk.

Hi there. Thanks for inviting me to NDF 2013, it is a real treat and honor to be here. I’d like to dedicate this talk to Aaron Swartz. Aaron cared deeply about the Web. In a heartbreaking way I think he may have cared more than he was able to. I’m not going to talk much about Aaron specifically, but his work and spirit underly pretty much everything I’m going to talk about today. If there is one message that I would like you to get from my talk today it’s that we need to work together as professionals to care for the Web in the same way Aaron cared for it.

Next year it will be 25 years since Tim Berners-Lee wrote his proposal to build the World Wide Web. I’ve spent almost half of my life working with the technology of the Web. The Web has been good to me. I imagine it has been good to you as well. I highly doubt I would be standing here talking to you today if it wasn’t for the Web. Perhaps the National Digital Forum would not exist, if it was not for the Web. Sometimes I wonder if we need the Web to continue to survive as a species. It’s certainly hard for my kids to imagine a world without the Web. In a way it’s even hard for me to remember it. This is the way of media, slipping into the very fabric of experience. Today I’d like to talk to you about what it means to think about the Web as a preservation medium.

Medium and preservation are some pretty fuzzy, heavy words, and I’m not going to try to pin them down too much. We know from Marshall McLuhan that the medium is the message. I like this definition because it disorients more than it defines. McLuhan reminds us of how we are shaped by our media, just as we shape new forms of media. In her book Always Already New, Lisa Gitelman offers up a definition of media that gives us a bit more to chew on:

I define media as socially realized structures of communication, where structures include both technological forms and their associated protocols, and where communication is a cultural practice, a ritualized collocation of different people on the same mental map, sharing or engaged with popular ontologies of representation.

I like Gitelman’s definition because it emphasizes how important the social dimension is to our understanding of media. The affordances of media, how media are used by people to do things, and how media does things to us, are just as important as the technical qualities of media. In the spirit of Latour she casts media as a fully fledged actor, not as some innocent bystander or tool to be used by the real and only actors, namely people.

When Matthew Oliver wrote to invite me to speak here he said that in recent years NDF had focused on the museum, and that there was some revival of interest in libraries. The spread of the Web has unified the cultural heritage sector, showing how much libraries, archives and museums have in common, despite their use of subtly different words to describe what they do. I think preservation is a similar unifying concept. We all share an interest in keeping the stuff (paintings, sculptures, books, manuscripts, etc) around for another day, so that someone will be able to see it, use it, cite it, re-interpret it.

Unlike the traditional media we care for, the Web confounds us all equally. We’ve traditionally thought of preservation and access as different activities, that often were at odds with each other. Matthew Kirschenbaum dispels this notion:

… the preservation of digital objects is logically inseparable from the act of their creation – the lag between creation and preservation collapses completely, since a digital object may only ever be said to be preserved if it is accessible, and each individual access creates the object anew. The .txtual Condition

Or, as my colleague David Brunton has said, in a McLuhan-esque way:

Digital preservation is access…in the future.

The underlying implication here is that if you are not providing meaningful access in the present to digital content, then you are not preserving it.

In light of these loose definitions I’m going to spend the rest of the time exploring what the Web means as a preservation medium by telling some stories. I’m hoping that they will help illuminate what preservation means in the context of the Web. By the end I hope to convince you of two things: the Web needs us to care for it, and more importantly, we need the Web to do our jobs effectively. For those of you who don’t need convincing about either of these points, I hope to give you a slightly different lens for looking at preservation and the Web. It’s a hopeful and humanistic lens, that is informed by thinking about the Web as an archive. But more on that later.

Everything is Broken

Even the casual user of the Web has run across the problem of the 404 Not Found. In a recent survey of Web citations found in Thompson Reuter’s Web of Science, Hennessey and Ge found that only 69% of the URLs were still available, and the median lifetime for a URL was 9.3 years. The Internet Archive had archived 62% of these URLs. In a similar study of URLs found in recent papers in the popular arXiv pre-print repository Sanderson, Phillips and Van de Sompel found that of the 144,087 unique URLs referenced in papers, only 70% were still available and of these, 45% were not archived in the Internet Archive, Web Citation, the Library of Congress or the UK National Archive.

Bear in mind, this isn’t the World Wild Web of dot com bubbles, failed business plans, and pivots we’re talking about. These URLs were found in a small pocket of the Web for academic research, a body of literature that is built on a foundation of citation, and written by practitioners whose very livelihood is dependent on how they are cited by others.

A few months ago the 404 made mainstream news in the US when Adam Liptak’s story In Supreme Court Opinions, Web Links to Nowhere broke in the New York Times. Liptak’s story spotlighted a recent study by Zittrain and Albert which found that 50% of links in United States Supreme Court opinions were broken. As its name suggests, the Supreme Court is the highest federal court in the United States…it is the final interpreter of our Constitution. These opinions in turn document the decisions of the Supreme Court, and have increasingly referenced content on the Web for context, which becomes important later for interpretation. 50% of the URLs found in the opinions suffered from what the authors call reference rot. Reference rot includes situations of link rot (404 Not Found and other HTTP level errors), but it also includes when the URL appears to technically work, but the content that was cited is no longer available. The point was dramatically and humorously illustrated by the New York Times since someone had bought one of the lapsed domains and put up a message for Justice Alito:

Zittrain and Albert propose a new web archiving project called, which relies on libraries to select web pages and websites that need to be archived. As proposed would be similar in principle to WebCite, which is built around submission of URLs by scholars. But WebCite’s future is uncertain due to a fund drive to raise money to support its operation. also has the potential to offer a governance structure similar to how cultural heritage organizations support the Internet Archive in their crawls of the Web.

Internet Archive was started by Brewster Kahle in 1996. It now contains 366 billion web pages or captures (not unique URLs). In 2008 Google Engineers reported that their index contained 1 trillion unique URLs. That’s 5 years ago now. If we assume it hasn’t grown since then, and overlook the fact that there are often multiple captures of a given URL over time, Internet Archive contains about 37% of the Web. This is overly generous since the Web has almost certainly grown in the past 5 years, and we’re comparing apples and oranges, web captures to unique URLs.

Of course, it’s not really fair (or prudent) to put the weight of preserving the Web on one institution. So thankfully, the Internet Archive isn’t alone. The International Internet Preservation Consortium is a member organization made up of national libraries, universities, and other organizations that do Web archiving. The National Library of New Zealand is a member, and has its own Web archive. According to the list of Web archiving initiatives Wikipedia article the archive is comprised of 346 million URLs. Perhaps someone in the audience has a rough idea of how big this is relative to the size of the Kiwi Web. It’s a bit of a technical problem even to identify national boundaries on the Web. Since the National Library of New Zealand Act of 2003, the National Library has been authorized to crawl the New Zealand portion of the Web. In this regard, New Zealand is light years ahead of the United States, which still is required by law to ask for permission to collect selected, non-governmental websites.

Protocols and tools for sharing the size and makeup of these IIPC collections are still lacking, but the Memento project spurred on some possible approaches out of necessity. For the Memento prototype to work they needed to collect the URL/timestamp combinations for all archived webpages. This turned out to be difficult both for the archive to share, and to aggregate in one place efficiently–and the moment it was done it was already out of date. David Rosenthal has some interesting ideas for aggregators to collect summary data from web archives, which is used to instead provide hints about where a given URL may be archived. Hopefully we’ll see some development in this area, as it’s increasingly important that Web archives do collection development more closely, to encourage diversity of approaches, and ensure that one isn’t a single point of failure.

Even when you consider the work of the International Internet Preservation Consortium, which adds roughly 75 billion URLs (also not unique) we still are only seeing 44% of the Web being archived. And of course this is a very generous guesstimate, since the 366 billion Internet Archive captures are not unique URLs: e.g. a given URL like the BBC homepage has been fetched 13,863 times between December 21, 1996 and November 14, 2013. And there is almost certainly overlap between the various IIPC web archives and the Internet Archive.

The Archival Sliver

I am citing these statistics not to say the job of archiving the Web is impossible, or a waste of resources. Much to the contrary. I raise it here to introduce one of the archival lenses I want to encourage you to look at Web preservation through: Verne Harris’ notion of the archival sliver. Harris is a South African archivist, writer and director of the Archive at the Nelson Mandela Centre of Memory. He participated in the transformation of South Africa’s apartheid public records system, and got to see up close how the contents of archives are shaped by the power structures in which they are embedded. Harris’ ideas have a distinctly post-modern flavor, and contrast with positivist theories of the archive that assert that the archive’s goal is to reflect reality.

Even if archivists in a particular country were to preserve every record generated throughout the land, they would still have only a sliver of a window into that country’s experience. But of course in practice, this record universum is substantially reduced through deliberate and inadvertent destruction by records creators and managers, leaving a sliver of a sliver from which archivists select what they will preserve. And they do not preserve much.

I like Harris’ notion of the archival sliver, because he doesn’t see it as a cause for despair, but rather as a reason to celebrate the role that this archival sliver has in the process of social memory, and the archivist who tends to it.

The archival record … is best understood as a sliver of a sliver of a sliver of a window into process. It is a fragile thing, an enchanted thing, defined not by its connections to “reality,” but by its open-ended layerings of construction and reconstruction. Far from constituting the solid structure around which imagination can play, it is itself the stuff of imagination.

The First URL

So instead of considering the preservation of billions of URLs, lets change tack a bit and take a look at the preservation of one, namely the first URL.

On April 30th, 1993 CERN made (in hindsight) the momentous decision to freely-release the Web technology software that Tim Berners-Lee, Ari Luotonen and Henrik Nielsen created for making the first website. But 20 years later, that website was no longer available. To celebrate the 20th anniversary of the software release Dan Noyes from CERN led a project to bring the original website back online, at the same address using a completely different software stack. The original content was collected from a variety of places: some from the W3C, some from a 1999 backup of Tim Berners-Lee’s NeXT. While the content is how it looked then, the resurrected website isn’t running the original Web server software, it’s running a modern version of Apache.

CERN also hosted a group of 11 volunteer developers to spend 2 days coding at CERN (expenses paid) to recreate the experience of using the line mode browser (LMB). The LMB allowed users with an Internet connection to use the Web without having to install any software: they could simply telnet to and start browsing the emerging Web using their terminal. These developers created a NodeJS JavaScript application that simulates the experience of using the early Web. You can even use it to navigate to other pages, for example the current World Wide Web Consortium page.

In a lot of ways I think this work illustrates James Governor’s adage:

Applications are like fish, data is like wine. Only one improves with age.

As any old school LISP programmer will tell you, sometimes code is data and data is code. But it is remarkable that this 20 year old HTML still renders just fine in a modern Web browser. This is no accident, but is the result of thoughtful, just-in-time design that encouraged the evolvability, extensibility and customizability of the Web. I think we as a community still have lots to learn from the Web’s example, and lots more to import into our practices. More about HTML in a bit.


Now obviously this sort of attention can’t be paid to all broken URLs on the Web. But it seems like an interesting example of how an archival sliver of the Web was cared for, respected and valued. Despite popular opinion, the care for URLs is not something foreign to the Web. For example lets take a look at the idea of the permalink that was popularized by the blogging community. As you know, a blog is typically a stream of content. In 2000 Paul Bausch at Blogger came up with a way to assign URLs to individual posts in the stream. This practice is so ubiquitous now it’s difficult to see what an innovation it was at the time. As its name implies, the idea of the permalink is that it is stable over time, so that the content can be persistently referenced. Apart from longevity, permalinks have beneficial SEO characteristics: the more that people link to the page over time, the higher its PageRank, and the more people who will find it in search results.

A couple years before the blogging community started talking about permalinks, Tim Berners-Lee wrote a short W3C design note entitled Cool URIs Don’t Change. In it he provides some (often humorously snarky) advice for people to think about their URLs, and namespaces with an eye to their future. One of Berners-Lee’s great insights was to allow any HTML document to link to any other HTML document, without permission. This decision allowed the Web to grow in a decentralized fashion. It also means that links can break when pages drift apart, and move to new locations, or disappear. But just because a link can break doesn’t mean that it must break.

The idea of Cool URIs saw new life in 2006 when Leo Sauerman and Richard Cyganiak began work on Cool URIs for the Semantic Web, which became a seminal document for the Linked Data movement. Their key insight was that identity (URLs) matters on the Web, especially when you are trying to create a distributed database like the Semantic Web.

Call them permalinks or Cool URIs, the idea is the same. Well managed websites will be rewarded by more links to their content, improved SEO, and ultimately more users. But most of all they will be rewarded by a better understanding of what they are putting on the Web. Organizations, particularly cultural heritage organization should take note – especially their “architects”. Libraries, archives and museums need to become regions of stability on the Web, where URLs don’t capriciously fail because some exhibit is over, or some content management system is swapped out for another. This doesn’t mean content can’t change, move or even be deleted. It just means we need to know when we are doing it, and say where something has moved, or say that what was once there is now gone. If we can’t do it, the websites that do will become the new libraries and archives of the Web.


Clearly there is a space between large scale projects to archive the entire Web, and efforts to curate a particular website. Consider the work of ArchiveTeam, a volunteer organization formed in 2009 that keeps an eye on when websites are in danger of, actually are, closing their doors and shutting down. Using their wiki, IRC chatrooms, and software tools they have built up a community of practice around archiving websites, which have included some 60 sites, such as Geocities and Friendster. They maintain a page called the Death Watch where they list sites that are dying (pining for the fjords), or in danger of dying (pre-emptive alarm bells). These activist archivists run something called the Warrior which is a virtual appliance you can install on a workstation, which gets instructions from the Archive Team tracker about which URLs to download, and coordinates the collection. The tracker then collects statistics, that allow participants to see how much they have contributed relative to others. The collected data is then packed up as WARC files and delivered to the Internet Archive where it is reviewed by an adminstrator, and added to their Web collection.

ArchiveTeam is a labor of love for its creator Jason Scott Sadofsky (aka Jason Scott) who is himself an accomplished documenter of computing history, with films such as BBS: The Documentary (early bulletin board systems), Get Lamp (interactive fiction) and DEFCON: The Documentary. Apart from mobilizing action, his regular talks have raised awareness about the impermanence on the Web, and have connected with other like minded Web archivists in a way that traditional digital preservation projects have struggled to. I suspect that this self-described “collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage” is the shape of things to come for the profession. ArchiveTeam are not the only activists archiving parts of the Web, lets take a look at a few more examples.

Earlier this year Google announced that they were pulling Google Reader offline. This caused much grief and anger to be vented from the blogging community…but it spurred one person into action. Mihai Parparita was an engineer who helped create Google Reader at Google, but he no longer worked there, and wanted to do something to help people retain both their data and the experience of Google Reader. Because he felt that the snapshots of data provided weren’t complete, he quickly put together a project ReaderIsDead, which is also available on GitHub. ReaderIsDead is actually a collection of different tools: one for pulling down your personal Reader data from Google while the Google Reader servers were still alive, and a simple web application called ZombieReader that serves that data up, for when the Google Reader servers actually went dead. Mihai put his knowledge of how the Google Reader talked to backend Web service APIs to build ZombieReader.

Ordinarily fat client interfaces like Google Reader pose problems for traditional Web archiving tools like Internet Archive’s Heretrix. Fat Web clients are applications that largely run in your browser, using JavaScript. These applications typically talk back to Web service APIs to fetch more data (often JSON) based on interaction in the browser. Web archiving crawlers don’t typically execute JavaScript that is crawled from the Web, and have a hard (if not impossible) time simulating user behavior, which then triggers the calls back to the Web service. And of course the Web Service is what goes dead as well, so even if the Web archive has a snapshot of the requested data, the JavaScript would need to be changed to fetch it. This means the Web archive is left with a largely useless shell.

But in the case of Zombie Reader, the fat client provided a data abstraction that proved to be an excellent way to preserve both the personal data and the user experience of using Google Reader. Mihai was able to use the same HTML, CSS and JavaScript from GoogleReader, but instead of the application talking back to Google’s API he had it talk back to a local Web Service that sat on top of the archived data. Individual users could continue to use their personal archives. ZombieReader became a read-only snapshot of what they were reading on the Web, and their interactions with it. Their sliver of a sliver of a sliver.


Of course .com failures aren’t the only reason why content disappears from the Web. People intentionally remove content from the Web all the time for a variety of reasons. Let’s consider the strange, yet fascinating cases of Mark Pilgrim and Jonathan Gillette. Both were highly prolific software developers, bloggers, authors and well known spokespeople for open source and the Web commons.

Among other things, Mark Pilgrim was very active in the area of feed syndication technology (RSS, Atom). He wrote the feed validator and Universal Feed Parser that makes working with syndicated much easier. He also pushed the boundaries of technical writing by writing Dive Into HTML 5 and Dive Into Python 3 which were published traditionally as books, but also made available on the Web with a CC-BY license. Pilgrim also worked at Google, where he helped promote and evolve the Web with his involvement in the Web Hypertext Application Technology Working Group WHATWG.

Jonathan Gillette, also known as Why the Lucky Stiff or _why, was a prolific writer, cartoonist, artist, and computer programmer who helped popularize the Ruby programming language. His online book Why’s (Poignant) Guide to Ruby introduced people of all ages to the practice of programming with wit and humor that will literally make you laugh out loud as you learn. His projects such like Try Ruby and Hackety Hack! lowered the barriers to getting a working software development environment set up by moving it to the Web. He also wrote a great deal of software such as hpricot for parsing HTML, and the minimalist Web framework camping.

Apart from all these similarities Mark Pilgrim and Jonathan Gillette share something else in common: on October 4, 2011 and August 19, 2009 respectively they both decided to completely delete their online presence from the Web. They committed info-suicide. Their online books, blogs, social media accounts, and github projects were simply removed. No explanations were made, they just blinked out of existence. They are still alive here in the physical world, but they aren’t participating online as they were previously…or at least not using the same personas. I like to think Pilgrim and _why were doing performance art to illustrate the fundamental nature of the Web, its nowness, its fragility, it’s impermanence. As Dan Connolly said once:

The point of the Web arch[itecture] is that it builds the illusion of a shared information space.

If someone decides to turn off a server or delete a website it’s gone for the entire world, the illusion dissolves. Maybe it lives on buried in a Web archive, but it’s previous life out on the Web is over. Or is it?

It’s interesting to see what happened after the info-suicides. Why’s (Poignant) Guide to Ruby was rescued by Mislav Marohnic a software developer living in Croatia. He was able to piece the book back together based on content in the Internet Archive, and put it back online at a new URL, as if nothing had happened. In addition he has continued to curate it: updating code samples to work with the latest version of Ruby, enabling syntax highlighting, converting it to use Markdown, and more.

Similarly Mark Pilgrim’s Dive Into HTML 5 and Dive Into Python 3 were assembled from copies and re-deployed to the Web. Prior to his departure Pilgrim used Github to manage the content for his books. Github is a distributed revision control system, where everyone working with the code has a full copy of it local on their machine. So rather than needing to get content out of the Internet Archive, developers created the diveintomark organization account on Github, and pushed their clones of the original repositories there.

Much of Why and Pilgrim’s code was also developed on GitHub. So even though the master was deleted, many people had clones, and were able to work together to establish a new master. Philip Cromer created the whymirror on Github, which collected _why’s code. Jeremy Ruten created _why’s Estate which is a hypertext archive collects pointers to the various software archives, and writings that have been preserved in Internet Archive and elsewhere.

So, it turns out that the supposedly brittle medium of the Web, where a link can easily break, and a whole website can be capriciously turned off, is a bit more persistent than we think. These events remind me of Matthew Kirschenbaum’s book Mechanisms which deconstructs notions of electronic media being fleeting or impermanent to show how electronic media (especially that which is stored on hard drives) is actually quite resilient and resistent to change. Mechanisms contains a fascinating study of how William Gibson’s poem Agrippa (which was engineered to encrypt itself and become unreadable after a single reading) saw new life on the Internet, as it was copied around on FTP, USENET, email listservs, and ultimately the Web:

Agrippa owes its transmission and continuing availability to a complex network of individuals, communities, ideologies, markets, technologies, and motives … from its example we can see that the preservation of digital media has a profound social dimension that is at least as important as purely technical considerations. Hacking Agrippa

Small Data

In the forensic spirit of Mechanisms, let’s take a closer look at Web technology, specifically HTML. Remember the first URL and how CERN was able to revive it? When you think about it, it’s kind of amazing that you can still look at that HTML in your modern browser, right? Do you think you could view your 20 year old word processing documents today? Jeff Rothenberg cynically observed

digital information lasts forever—or five years, whichever comes first

Maybe if we focus on the archival sliver instead of the impossibility of everything we’re not doing so bad.

As we saw in the cases of Pilgirm and _why the Internet Archive and other web archiving projects play an important role in snapshotting Web pages. But we are also starting to see social media companies are building tools that allow their users to easily extract or “archive” their content. These tools are using HTML in an interesting new way that is worth taking a closer look at.

How many Facebook users are there here? How many of you have requested your archive? If you navigate to the right place in your settings you can “Download a copy of your Facebook data.” When you click on the button you set in motion a process that gathers together your profile, contact information, wall, photos, synced photos, videos, friends, messages, pokes, events, settings, security and (ahem) ads. This takes Facebook a bit of time, it took a day the last time I tried it, and you get an email when it’s finished which contains a link to download a zip file. The zip file contains HTML, JPEG, MP4 files which you can open in your browser. You don’t need to be connected to the Internet, everything is available locally.

Similarly Twitter allow you to request an archive periodically, which triggers an email when it is ready for you to pick up. Much like the Facebook archive it is delivered as a zip file, which contains an easily browsable HTML package. The Twitter archive is actually more like a dynamic application, since it includes a JavaScript application called Grailbird. Grailbird lets you search your tweets, and examine tweets from different time periods. Just like the Facebook archive everything Grailbird needs is available locally, and the application will work when you are disconnected from the Internet. Although user’s avatar thumbnail images are still loaded directly from the Web. But all your tweet data is available as JavaScript and CSV. The application depends on some popular JavaScript libraries like jQuery and Underscore, but those also are bundled right with the archive. It would be nice to see Twitter release Grailbird as a project on Github as many of their other software projects are. Thinking of Grailbird as a visualization framework for tweets would allow interested parties to add new visualizations (e.g. tweets on a map, network graphs, etc). You could also imagine tools for reaching out into the network of an individual’s tweets to fetch tweets that they were replying to, and persisting them back locally to the package.

Some of you may remember that the Data Liberation Front (led by Brian Fitzpatrick at Google) and the eventual product offering Google Takeout were early innovators in this area. Google Takeout allows you to download data from 14 of their products as a zip file. The service isn’t without criticism, because it doesn’t include things like your Gmail archive or your search history. The contents of the archive are also somewhat more difficult to work with, compared to the Facebook and Twitter equivalents. For example, each Google+ update is represented as a single HTML file, and there isn’t a notion of a minimal, static application that you can use to browse them. The HTML also references content out on the Web, and isn’t as self-contained as Twitter and Facebook’s archive. But having snapshots of your Youtube videos, and contents of your Google Drive is extremely handy. As Brad Fitzpatrick wrote in 2010, Google Takeout is kind of a remarkable achievement, or realization for a big publicly traded behometh to make:

Locking your users in, of course, has the advantage of making it harder for them to leave you for a competitor. Likewise, if your competitors lock their users in, it is harder for those users to move to your product. Nonetheless, it is far preferable to spend your engineering effort on innovation than it is to build bigger walls and stronger doors that prevent users from leaving. Making it easier for users to experiment today greatly increases their trust in you, and they are more likely to return to your product line tomorrow.

I mention Facebook, Twitter and Google here because I think these archiving services are important for memory institutions like museums, libraries and archives. They allow individuals to download their data from the huge corpus that is available–a sliver of a sliver of a sliver. When a writer or politician donates their papers, what if we accessioned their Facebook or Twitter archive? Dave Winer for example has started collecting Twitter archives that have been donated to him, that meet a certain criteria, and making them public. If we have decided to add someone’s papers to a collection, why not acquire their social media archives and store them along with their other born digital and traditional content? Yes, Twitter (as a whole) is being archived by the Library of Congress, as so called big data. But why don’t we consider these personal archives as small data, where context and original order are preserved with other relevant material in a highly usable way? As Rufus Pollock of the Open Knowledge Foundation said

This next decade belongs to distributed models not centralized ones, to collaboration not control, and to small data not big data.

The other interesting thing about these services is their use of HTML as a packaging format. My coworker Chris Adams once remarked that the one format he expects to be able to read in 100 years is HTML. Of course we can’t predict the future. But I suspect he may be right. We need best practices, or really just patterns for creating HTML packages of archival content. We need to make sure our work sits on top of common tools for the Web. We need to support the Web browser, particularly open source ones. We need to track and participate in Web standardization efforts such as the W3C and the WHATWG. We must keep the usability of the archive in mind: is it easy to open up in your browser and wander around in the archive as with the Twitter and Facebook examples? And most importantly, as Johan van der Knijff of the National Library of the Netherlands discusses in his study of EPUB, it is important that all resources are local to the package. Loading images, JavaScript, etc from a remote location makes the archive vulnerable, since it becomes dependent on some part of the Web staying alive. Perhaps we also need tools like ArchiveReady for inspecting local HTML packages (in addition to websites) and reporting on their archivability?


So how to wrap up this strange, fragmented, incomplete tour through Web preservation? I feel like I should say something profound, but I was hoping these stories of the Web would do that for me. I can only say for myself that I want to give back to the Web the way it has given to me. With 25 years behind us the Web needs us more than ever to help care for the archival slivers it contains. I think libraries, museums and archives that realize that they are custodians of the Web, and align their mission with the grain of the Web, will be the ones that survive, and prosper. Brian Fitzpatrick, Jason Scott, Brewster Kahle, Mislav Marohnic, Philip Cromer, Jeremy Ruten and Aaron Swartz demonstrated their willingness to work with the Web as a medium in need of preservation, as well as a medium for doing the preservation. We need more of them. We need to provide spaces for them to do their work. They are the new faces of our profession.


Neil Gaiman wrote recently in the Guardian about reading, writing and libraries. I suspect you’ve seen it already, but it sat in my to-read pile (on my Kindle) till now. I think I’m going to read it to my kids tonight.

One of the nice things about reading on my Kindle is I’m much more likely to highlight and annotate. Really, I would’ve highlighted the whole thing if I could have. But anyway, here is what I highlighted:

The Web

There were noises made briefly, a few years ago, about the idea that we were living in a post-literate world, in which the ability to make sense out of written words was somehow redundant, but those days are gone: words are more important than they ever were: we navigate the world with words, and as the world slips onto the web, we need to follow, to communicate and to comprehend what we are reading. People who cannot understand each other cannot exchange ideas, cannot communicate, and translation programs only go so far.

The Book

I do not believe that all books will or should migrate onto screens: as Douglas Adams once pointed out to me, more than 20 years before the Kindle turned up, a physical book is like a shark. Sharks are old: there were sharks in the ocean before the dinosaurs. And the reason there are still sharks around is that sharks are better at being sharks than anything else is. Physical books are tough, hard to destroy, bath-resistant, solar-operated, feel good in your hand: they are good at being books, and there will always be a place for them. They belong in libraries, just as libraries have already become places you can go to get access to ebooks, and audiobooks and DVDs and web content.

The (Public) Library

A library is a place that is a repository of information and gives every citizen equal access to it. That includes health information. And mental health information. It’s a community space. It’s a place of safety, a haven from the world. It’s a place with librarians in it. What the libraries of the future will be like is something we should be imagining now … If you do not value libraries then you do not value information or culture or wisdom. You are silencing the voices of the past and you are damaging the future.

future archives

It’s hard to read Yves Raimond and Tristan Ferne’s paper The BBC World Service Archive Prototype and not imagine a possible future for radio archives, archives on the Web, and archival description in general.

Actually, it’s not just the future, it’s also the present, as embodied in the BBC World Service Archive prototype itself, where you can search and listen to 45 years of radio, and pitch in by helping describe it if you want.

As their paper describes, Raimond and Ferne came up with some automated techniques to connect up text about the programs (derived directly from the audio, or indirectly through supplied metadata) to Wikipedia and DBPedia. This resulted in some 20 million RDF assertions, that form the database that the (very polished) web application sits on top of. Registered users can then help augment and correct these assertions. I can only hope that some of these users are actually BBC archivists, who can also help monitor and tune the descriptions provided from the general public.

Their story is full of win, so it’s understandable why the paper won the 2013 Semantic Web Challenge:

  • They used WikipedidMiner to take a first pass at entity extraction of the text they were able to collect for each program. The MapHub project uses WikipediaMiner for the same purpose of adding structure to otherwise unstructured text.
  • They used Amazon Web Services (aka the cloud) to do what would have taken them 4 years in the space of 2 weeks, for a fixed, one time cost.
  • They use ElasticSearch for search, instead of trying to squeeze that functionality and scalability out of a triple store.
  • They wanted to encourage curation of the content, so they put an emphasis on usability and design that is often absent from Linked Data prototypes.
  • They have written in more detail about the algorithms that they used to connect up their text to Wikipedia/DBpedia.
  • Their github account reflects the nuts and bolts of how they did this work. Specifically their rdfsim Python project that vectorizes a SKOS hierarchy, for determining the distance between concepts, seems like a really useful approach to disambiguating terms in text.

But it is the (implied) role of the archivist, as the professional responsible for working with developers to tune these algorithms, evaluating/gauging user contributions, and helping describe the content themselves that excites me the most about this work. It’s also the future role of the archive that is at stake too. In another paper Raimond, Smethurst, McParland and Lowiswhich describe how having this archival data allows them to augment live BBC News subtitles with links to the audio archive, where people can follow their nose (or ears in this case) to explore the context around news stories.

The fact that it’s RDF and Linked Data isn’t terribly important in all this. But the importance of using world curated, openly licensed entities derived from Wikipedia cannot be understated. It’s the conceptual glue that allows connections to be made. As Wikidata grows in importance at Wikipedia it will be interesting to see if it supplants the role that DBpedia has been playing to date.

And of course, it’s exciting because it’s not just anyone doing this, it’s the BBC.

My only nit is that it would be nice to see some of the structured data they’ve collected expressed more in their HTML. For example they have minted a URI for Brian Eno which lists radio programs that are related to him. Why not display his bio, and perhaps a picture? Why not put links to other radio programs for people he is associated with him, like David Byrne or David Bowie, etc. Why not express some of this semantic metadata in microdata or RDFa in the page, to enable search engine optimization and reuse?

Luckily, it sounds like they have invested in the platform and data they would need to add these sorts of features.

PS. Apologies to the Mighty Boosh for the title of this post. “The future’s dead … Everyone’s looking back, not forwards.”

broken wordpress links

Internet Archive recently announced their new Availability API for checking if a representation for a given URL is in their archive with a simple HTTP call. In addition to the API they highlighted a few IA related projects, including a Wordpress plugin called Broken Link Checker which will check the links in your Wordpress site, and offer to fix any broken ones using an Internet Archive URL, if it is available based on a call to the Availability API.

I installed the plugin here and let it run for a bit. It detected 3898 unique URLs in 4910 links of which 482 are broken. This amounts to 12% link rot … but there were also 1038 redirects that resulted in a 200 OK ; so there may be a fair bit of reference rot lurking there. The plugin itself doesn’t provide a summary of HTTP status codes for the “broken URLS” but they are listed one by one in the broken link report. Since I could see the HTTP status codes in the table, I figured out you can easily log into your Wordpress database and run a query like this to get a summary:

mysql> select http_code, count(*) from wp_blc_links where broken is true group by http_code;
| http_code | count(*) |
|         0 |      113 |
|       400 |        9 |
|       403 |       15 |
|       404 |      333 |
|       410 |        5 |
|       416 |        1 |
|       500 |        3 |
|       502 |        1 |
|       503 |        2 |

I’m assuming the 113 (23% of the broken links) are DNS lookup failure, or connection failures. Once the broken links are identified, you have to manually curate each link to decide whether you want to link out to the Internet Archive based on whether it’s possible, and whether the most recent link is appropriate or not. This can take some time, but it is useful given I uncovered a number of fat-fingered URLs that looked like they never worked, which I was able to fix. Of the legitimately broken URLs, 136 URLs (~28%) were available at the Internet Archive. Once you’ve decided to use an IA URL the plugin can quickly rewrite the original content without requiring you to do in and tweak the content yourself.

One thing that would be nice would be for the API to be queried for a representation of the resource based on when the post was authored. For example my Snakes and Rubies post had a broken link to and the plugin correctly found that it was available at the Internet Archive with an API query like:

% curl --silent '' | python -mjson.tool
    "archived_snapshots": {
        "closest": {
            "available": true,
            "status": "200",
            "timestamp": "20130131030609",
            "url": ""

When requesting that URL you get this hit in the archive: but that’s an archive of a cyerbsquatted version of the page:

Snakes, Rubies and Cybersquatter

If the timestamp of the blogpost were used perhaps a better representation like this could’ve been found automatically, or at least it could have been offered first?

Snakes and Rubies

Based on the svn log for the plugin it appears to have been 2007-10-08 and has been downloaded 2,099,072 times since then. When people gripe about the Web being broken by design I think it’s good to remember that tools like this exist to help make it better, one website and link at a time.

NYPL's Building Inspector

You probably already saw the news about NYPL’s Building Inspector that was released yesterday. If you haven’t, definitely check it out…it’s a beautiful app. I hope Building Inspector represents the shape of things to come for engagement with the Web by cultural heritage institutions.

I think you’ll find that it is strangely addictive. This is partly because you get to zoom in on random closeups of historic NYC maps: which is like candy if you are a map junkie, or have spent any time in the city. More importantly you get the feeling that you are helping NYPL build and enrich a dataset for further use. I guess you could say it’s gamification, but it feels more substantial than that.

Building Inspector hits a sweet spot for a few reasons:

  1. It has a great name. Building Inspector describes the work you will be doing, and contextualizes the activity with a profession you may already be familiar with.
  2. It opens with some playful yet thoughtfully composed instructions that describe how to do the inspection. The instructions aren’t optional, but can easily be dismissed. They are fun while still communicating essential facts about what you are going to be doing.
  3. There is an easy way to review the work you’ve done so far by clicking on the View Progress link. You use your Twitter, Facebook or Google account to login. It would be cool to be able to see the progress view from a global view: everyone’s edits, in realtime perhaps.
  4. The app is very responsive, displaying new parts of the map with sub-second response times.
  5. The webapp looks and works great as a mobile app. I’d love to hear more about how they did this, since they don’t appear to be using anything like Twitter Bootstrap to help. The mobile experience might be improved a little bit if you could zoom and pan with touch gestures.
  6. It uses LeafletJS. I’ve done some very simplistic work with Leaflet in the past, so it is good to see that it can be customized this much.
  7. NYPL is embracing the cloud. Building Inspector is deployed on Heroku, with map tiles on Amazon’s CloudFront. This isn’t a big deal for lots of .com properties, but for libraries (even big research libraries like NYPL) I reckon it is a bigger deal than you might suspect.
  8. The truly hard part of recognizing the outlines of buildings with OpenCV and other tools has been made available by NYPL on Github for other people to play around with.

Another really fun thing about the way this app was put together was its release, with the (apparent) coordination with an article at Wired, and subsequent follow up on the nypl_labs Twitter account.

6:35 AM

7:22 AM

10:12 AM

6:43 PM

Or in other words:


Quite a first day! It would be interesting to know what portion of the work this represents. Also, I’d be curious to see if NYPL is able to sustain this level of engagement to get the work done.

Day 2 Update

2:22 PM

4:07 PM

If I’m doing the math right (double check me if you really care), between those two data points there were 6,499 inspections over 63,000 seconds – so an average of 1.03 inspections/second. Whereas between points 3 and 4 of yesterday it looks like they had an average of 1.91 inspections/second.

Days 1-2

Day 3 Update

Days 1-3