linkrot: use your illusion

Mike Giarlo wrote a bit last week about the issues of citing datasets on the Web with Digital Object Identifiers (DOI). It’s a really nice, concise characterization of why libraries and publishers have promoted and used the DOI, and indirect identifiers more generally. Mike defines indirect identifiers as

… identifiers that point at and resolve to other identifiers.

I might be reading between the lines a bit, but I think Mike is specifically talking about any identifier that has some documented or ad-hoc mechanism for turning it into a Web identifier, or URL. A quick look at the Wikipedia identifier category yields lots of these, many of which (but not all) can be expressed as a URI.

The reason why I liked Mike’s post so much is that he was able to neatly summarize the psychology that drives the use of indirect identifier technologies:

… cultural heritage organizations and publishers have done a pretty poor job of persisting their identifiers so far, partly because they didn’t grok the commitment they were undertaking, or because they weren’t deliberate about crafting sustainable URIs from the outset, or because they selected software with brittle URIs, or because they fell flat on some area of sustainability planning (financial, technical, or otherwise), and so because you can’t trust these organizations or their software with your identifiers, you should use this other infrastructure for minting and managing quote persistent unquote identifiers

Mike goes on to get to the heart of the problem, which is that indirect identifier technologies don’t solve the problem of broken links on the Web, they just push it elsewhere. The real problem of maintaining the indirect identifier when the actual URL changes becomes someone else’s problem. Out of sight, out of mind … except it’s not really out of sight right? Unless you don’t really care about the content you are putting online.

We all know that linkrot on the Web is a real thing. I would be putting my head in the sand if I were to say it wasn’t. But I would also be putting my head in the sand if I said that things don’t go missing from our brick and mortar libraries. But still, we should be able to do better than 1/2 the URLs in arXiv going dead right? I make a living as a web developer, I’m an occasional advocate for linked data, and I’m a big fan of the work Henry Thompson and David Orchard did for the W3C analyzing the use of alternate identifier schemes on the Web…so, admittedly, I’m a bit of a zealot when it comes to promoting URLs as identifiers, and taking the Web seriously as an information space.

Mike’s post actually kicked off what I thought was a useful Twitter conversation (yes they can happen), which left me contemplating the future of libraries and archives on (or in) the Web. Specifically, it got me thinking that perhaps libraries and archives of the not too distant future will be places that take special care in how they put content on the Web, so that it can be accessed over time, just like a traditional physical library or archive. The places where links and the content they reference are less likely to go dead will be the new libraries and archives. These may not be the same institutions we call libraries today. Just like today’s libraries, these new libraries may not necessarily be free to access. You may need to be part of some community to access them, or to pay some sort of subscription fee. But some of them, and I hope most, will be public assets.

So how to make this happen? What will it look like? Rather than advocating a particular identifier technology I think these new libraries need to think seriously about providing Terms of Service documents for their content services. I think these library ToS documents will do a few things.

They will require the library to think seriously about the service they are providing. This will involve meetings, more meetings, power lunches, and likely lawyers. The outcome will be an organizational understanding of what the library is putting on the Web, and the commitment they are entering into with their users. It won’t simply be a matter of a web development team deciding to put up some new website…or take one down. This will likely be hard, but I think it’s getting easier all the time, as the importance of the Web as a publishing platform becomes more and more accepted, even in conservative organizations like libraries and archives.
The ToS will address the institutions commitment for continued access to the content. This will involve a clear understanding of the URL namespaces that the library manages, and a statement about how they will be maintained over time. The Web has built in mechanisms for content moving from place to place (HTTP 301), and for when resources are removed (HTTP 410), so URLs don’t need to be written in stone. But the library needs to commit to how resources will redirect permanently to new locations, and for how long–and how they will be removed.
The ToS will explicitly state the licensing associated with the content, preferably with Creative Commons licenses (hey I’m daydreaming here) so that it can be confidently used.
Libraries and archives will develop a shared palette of ToS documents. Each institution won’t have it’s own special snowflake ToS that nobody reads. There will be some normative patterns for different types of libraries. They will be shared across consortia, and among peer institutions. Maybe they will be incorporated into, or reflect shared principles found in documents like ALA’s Library Bill of Rights or SAA’s Code of Ethics.

I guess some of this might be a bit reminiscent of the work that has gone into what makes a trusted repository. But I think a Terms of Service between a library/archive and its researcher is something a bit different. It’s more outward looking, less interested in certification and compliance and more interested in entering into and upholding a contract with the user of a collection.

As I was writing this post, Dan Brickley tweeted about a recent talk Tony Ageh (head of the archive development team at the BBC) gave at the recent Economies of the Commons conference. He spoke about his ideas for a future Digital Public Space, and the role that archives and organizations like the BBC play in helping create it.

Things no longer ‘need’ to disappear after a certain period of time. Material that once would have flourished only briefly before languishing under lock and key or even being thrown away — can now be made available forever. And our Licence Fee Payers increasingly expect this to be the way of things. We will soon need to have a very, very good reason for why anything at all disappears from view or is not permanently accessible in some way or other.

That is why the Digital Public Space has placed the continuing and permanent availability of all publicly-funded media, and its associated information, as the default and founding principle.

I think Tony and Mike are right. Cultural heritage organizations need to think more seriously, and more long term about the content they are putting on the Web. They need to put this thought into clear, and succinct contracts with their users. The organizations that do will be what we call libraries and archives tomorrow. I guess I need to start by getting my own house in order eh?