Tag Archives: url

linkrot: use your illusion

Mike Giarlo wrote a bit last week about the issues of citing datasets on the Web with Digital Object Identifiers (DOI). It’s a really nice, concise characterization of why libraries and publishers have promoted and used the DOI, and indirect identifiers more generally. Mike defines indirect identifiers as

… identifiers that point at and resolve to other identifiers.

I might be reading between the lines a bit, but I think Mike is specifically talking about any identifier that has some documented or ad-hoc mechanism for turning it into a Web identifier, or URL. A quick look at the Wikipedia identifier category yields lots of these, many of which (but not all) can be expressed as a URI.

The reason why I liked Mike’s post so much is that he was able to neatly summarize the psychology that drives the use of indirect identifier technologies:

… cultural heritage organizations and publishers have done a pretty poor job of persisting their identifiers so far, partly because they didn’t grok the commitment they were undertaking, or because they weren’t deliberate about crafting sustainable URIs from the outset, or because they selected software with brittle URIs, or because they fell flat on some area of sustainability planning (financial, technical, or otherwise), and so because you can’t trust these organizations or their software with your identifiers, you should use this other infrastructure for minting and managing quote persistent unquote identifiers

Mike goes on to get to the heart of the problem, which is that indirect identifier technologies don’t solve the problem of broken links on the Web, they just push it elsewhere. The real problem of maintaining the indirect identifier when the actual URL changes becomes someone else’s problem. Out of sight, out of mind … except it’s not really out of sight right? Unless you don’t really care about the content you are putting online.

We all know that linkrot on the Web is a real thing. I would be putting my head in the sand if I were to say it wasn’t. But I would also be putting my head in the sand if I said that things don’t go missing from our brick and mortar libraries. But still, we should be able to do better than 1/2 the URLs in arXiv going dead right? I make a living as a web developer, I’m an occasional advocate for linked data, and I’m a big fan of the work Henry Thompson and David Orchard did for the W3C analyzing the use of alternate identifier schemes on the Web…so, admittedly, I’m a bit of a zealot when it comes to promoting URLs as identifiers, and taking the Web seriously as an information space.

Mike’s post actually kicked off what I thought was a useful Twitter conversation (yes they can happen), which left me contemplating the future of libraries and archives on (or in) the Web. Specifically, it got me thinking that perhaps libraries and archives of the not too distant future will be places that take special care in how they put content on the Web, so that it can be accessed over time, just like a traditional physical library or archive. The places where links and the content they reference are less likely to go dead will be the new libraries and archives. These may not be the same institutions we call libraries today. Just like today’s libraries, these new libraries may not necessarily be free to access. You may need to be part of some community to access them, or to pay some sort of subscription fee. But some of them, and I hope most, will be public assets.

So how to make this happen? What will it look like? Rather than advocating a particular identifier technology I think these new libraries need to think seriously about providing Terms of Service documents for their content services. I think these library ToS documents will do a few things.

  • They will require the library to think seriously about the service they are providing. This will involve meetings, more meetings, power lunches, and likely lawyers. The outcome will be an organizational understanding of what the library is putting on the Web, and the commitment they are entering into with their users. It won’t simply be a matter of a web development team deciding to put up some new website…or take one down. This will likely be hard, but I think it’s getting easier all the time, as the importance of the Web as a publishing platform becomes more and more accepted, even in conservative organizations like libraries and archives.
  • The ToS will address the institutions commitment for continued access to the content. This will involve a clear understanding of the URL namespaces that the library manages, and a statement about how they will be maintained over time. The Web has built in mechanisms for content moving from place to place (HTTP 301), and for when resources are removed (HTTP 410), so URLs don’t need to be written in stone. But the library needs to commit to how resources will redirect permanently to new locations, and for how long–and how they will be removed.
  • The ToS will explicitly state the licensing associated with the content, preferably with Creative Commons licenses (hey I’m daydreaming here) so that it can be confidently used.
  • Libraries and archives will develop a shared palette of ToS documents. Each institution won’t have it’s own special snowflake ToS that nobody reads. There will be some normative patterns for different types of libraries. They will be shared across consortia, and among peer institutions. Maybe they will be incorporated into, or reflect shared principles found in documents like ALA’s Library Bill of Rights or SAA’s Code of Ethics.

I guess some of this might be a bit reminiscent of the work that has gone into what makes a trusted repository. But I think a Terms of Service between a library/archive and its researcher is something a bit different. It’s more outward looking, less interested in certification and compliance and more interested in entering into and upholding a contract with the user of a collection.

As I was writing this post, Dan Brickley tweeted about a recent talk Tony Ageh (head of the archive development team at the BBC) gave at the recent Economies of the Commons conference. He spoke about his ideas for a future Digital Public Space, and the role that archives and organizations like the BBC play in helping create it.

Things no longer ‘need’ to disappear after a certain period of time. Material that once would have flourished only briefly before languishing under lock and key or even being thrown away — can now be made available forever. And our Licence Fee Payers increasingly expect this to be the way of things. We will soon need to have a very, *very* good reason for why anything at all disappears from view or is not permanently accessible in some way or other.

That is why the Digital Public Space has placed the continuing and permanent availability of all publicly-funded media, and its associated information, as the default and founding principle.

I think Tony and Mike are right. Cultural heritage organizations need to think more seriously, and more long term about the content they are putting on the Web. They need to put this thought into clear, and succinct contracts with their users. The organizations that do will be what we call libraries and archives tomorrow. I guess I need to start by getting my own house in order eh?

geeks bearing gifts

I recently received some correspondence about the EZID identifier service from the California Digital Library. EZID is a relatively new service that aims to help cultural heritage institutions manage their identifiers. Or as the EZID website says:

EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers for their digital content. EZID has these core functions:

Create a persistent identifier: DOI or ARK

  • Add object location
  • Add metadata
  • Update object location
  • Update object metadata

I have some serious concerns about a group of cultural institutions relying on a single service like EZID for managing their identifier namespaces. It seems too much like a single point of failure. I wonder, are there any plans to make the software available, and to allow multiple EZID servers to operate as peers?

URLs are a globally deployed identifier scheme that depend upon HTTP and DNS. These technologies have software implementations in many different computer languages, for diverse operating systems. Bugs and vulnerabilities associated with these software libraries are routinely discovered and fixed, often because the software itself is available as open source, and there are “many eyes” looking at the source code. Best of all, you can put a URL into your web browser (which are now ubiquitous), and view a document that is about the identified resource.

In my humble opinion, cultural heritage institutions should make every effort to work with the grain of the Web, and taking URLs seriously is a big part of that. I’d like to see more guidance for cultural heritage institutions on effective use of URLs, what Tim Berners-Lee has called Cool URIs, and what the Microformats and blogging community call permalinks. When systems are being designed or evaluated for purchase, we need to think about the URL namespaces that we are creating, and how we can migrate them forwards. Ironically, identifier schemes that don’t fit into the DNS and HTTP landscape have their own set of risks; namely that organizations become dependent on niche software and services. Sometimes it’s prudent (and cost effective) to seek safety in numbers.

I did not put this discussion here to try to shame CDL in any way. I think the EZID service is well intentioned, clearly done in good spirit, and already quite useful. But in the long run I’m not sure it pays for institutions to go it alone like this. As another crank (I mean this with all due respect) Ted Nelson put it:

Beware Geeks Bearing Gifts.

On the surface the EZID service seems like a very useful gift. But it comes with a whole set of attendant assumptions. Instead of investing time & energy getting your institution to use a service like EZID, I think most cultural heritage institutions would be better off thinking about how they manage their URL namespaces, and making resource metadata available at those URLs.

We’ve got five years, my brain hurts a lot

Recently there’s been a few discussions about persistent identifiers on the web: in particular one about the persistence of XRIs, and another about the use of HTTP URIs in semantic web applications like dbpedia.

As you probably know already, the w3c publicly recommended against the use of Extensible Resource Identifiers (XRI). The net effect of this was to derail the standardization of XRIs within OAISIS itself. Part of the process that Ray Denenberg (my colleague at the Library of Congress) helped kick off was a further discussion between XRI people and the w3c-tag about what XRI specifically provides that HTTP URIs do not. Recently that discussion hit a key point by Stuart Williams:

… the point that I’m trying to make is that the issue is with the social and administrative policies associated with the DNS system – and the solution is to establish a separate namespace outside the DNS system that has different social/adminsitrative policies (particularly wrt persistent name segments) that better suits the requirements of the XRI community. There is the question as to whether that alternate social/administrative system will endure into the long term such the the persistence intended guarantees endure… or not – however that will largely be determined by market forces (adoption) and ‘crudely’ the funding regime that enables the administrative structure of XRI to persist – and probably includes the use of IPRs to prevent duplicate/alternate root problems which we have seen in the DNS world.

It’ll be interesting to see the response. I basically have the same issue with DOIs and the Handle System that they depend on. Over at CrossTech Tony Hammond suggests that the Handle System would make RDF assertions such as those that involve DBPedia more persistent. But just how isn’t entirely clear to me. It seems that Handles like URLs are only persistent to the degree that they are maintained.

I’d love to see a use case from Tony that describes just how DOIs and the Handle System would provide more persistence than HTTP URLs in the context of RDF assertions involving dbpedia. As Stuart said eloquently in his email:

Again just seeking to understand – not to take a particular position

PS. Sorry if the blog post title is too cryptic, it’s Bowie’s “Five Years” which Tony’s post (perhaps intentionally) reminded me of :-)