If you work in the digital preservation field and haven’t seen Dorothea Salo’s Retooling Libraries for the Data Challenge in the latest issue of Ariadne definitely give it a read. Dorothea takes an unflinching look at the at the scope and characteristics of data assets currently being generated by scholarly research, and how equipped traditional digital library efforts are to deal with it. I haven’t seen so many of the issues I’ve had to deal with (largely unconsciously) as part of my daily work so neatly summarized before. Having them laid out in such a lucid, insightful and succinct way is really refreshing–and inspiring.

In the section on Taylorist Production Processes, Dorothea makes a really good point that libraries have tended to optimize workflows for particular types of materials (e.g. photographs, books, journal articles, maps). When materials require a great deal of specialized knowledge to deal with, and the tools that are used to manage and provide access to the content are similarly specialized, it’s not hard to understand why this balkanization has happened. On occasion I’ve heard folks (/me points finger at self) bemoan the launch of a new website as the creation of “yet another silo”. But the technical silos that we perceive are largely artifacts of the social silos we work in: archives, special collections, rare books, maps, etc. The collections we break up our libraries into…the projects that mirror those collections. We need to work better together before we can build common digital preservation tools. To paraphrase something David Brunton has said to me before: we need to think of our collections more as sets of things that can be rearranged at will, with ease and even impunity. In fact the architecture of the Web (and each website on it) is all about doing that.

Even though it can be tough (particularly in large organizations) I think we can in fact achieve some levels of common tooling (in areas like storage and auditing); but we must admit (to ourselves at least) that some levels of access will likely always be specialized in terms of technical infrastructure and user interface requirements:

Some, though not all, data can be shoehorned into a digital library not optimised for them, but only at the cost of the affordances surrounding them. Consider data over which a complex Web-based interaction environment has been built. The data can be removed from the environment for preservation, but only at the cost of loss of the specialised interactions that make the data valuable to begin with. If the dataset can be browsed via the Web interface, a static Web snapshot becomes possible, but it too will lack sophisticated interaction. If the digital library takes on the not inconsiderable job of recreating the entire environment, it is committing to rewriting interaction code over and over again indefinitely as computing environments change.

Dorothea’s statement about committing to rewriting interaction code over and over again is important. I’m a software developer, and a web developer to boot – so there’s nothing I like more than yanking the data out of one encrusted old app, and creating it a-fresh using the web-framework-du-jour. But in my heart of hearts I know that while this may work for large collections of homogeneous data, it doesn’t scale very well for a vast sea of heterogeneous data. However, all is not lost. As the JISC are fond of saying:

The coolest thing to do with your data will be thought of by someone else.

So why don’t us data archivers get out of the business of building the “interaction code”. Maybe our primary service should be to act as data wholesalers who collect it, and make it available in bulk to those who do want to build access layers on top of it. Lets make our data easy for other people to use (with clear licensing) and reference (with web identifiers) so that they can annotate it, and we can pull back those annotations and views. In a way this is kind of hearkening back to the idea of Data Providers and Service Providers that was talked about a lot in the context of OAI-PMH. But in this case we’d be making the objects available as well as the metadata that describes them, similar to the use cases around OAI-ORE. I got a chance to chat with Kate Zwaard of the GPO at CurateCamp a few weeks ago, and learned how the new Federal Register is a presentation application for raw XML data being made available by the GPO. Part of the challenge is making these flows of data public, and giving credit where credit is due – not only to the creators of the shiny site you see, but to the folks behind the scenes who make it possible.

Another part of Dorothea’s essay that stuck out a bit for me, was the advice to split ingest, storage and access systems.

Ingest, storage, and end-user interfaces should be as loosely coupled as possible. Ideally, the same storage pool should be available to as many ingest mechanisms as researchers and their technology staff can dream up, and the items within should be usable within as many reuse, remix, and re-evaluation environments as the Web can produce.

This is something we (myself and other folks at LC) did as part of the tooling to support the National Digital Newspaper Program. Our initial stab at the software architecture was to use Fedora to manage the full life cycle (from ingest, to storage, to access) of the newspaper content we receive from program awardees around the US. The only trouble was that we wanted the access system to support heavy use by researchers and also robots (Google, Yahoo, Microsoft, etc) building their own views on the content. Unfortunately the way we had put the pieces together we couldn’t support that. Increasingly we found ourselves working around Fedora as much as possible to squeeze a bit more performance out of the system.

So in the end we (and by we I mean David) decided to bite the bullet and split off the inventory systems keeping track of where received content lives (what storage systems, etc) from the access systems that delivered content on the Web. Ultimately this meant we could leverage industry proven web development tools to deliver the newspaper content…which was a huge win. Now that’s not saying that Fedora can’t be used to provide access to content. I think the problems we experienced may well have been the result of our use of Fedora, rather than Fedora itself. Having to do multiple, large XSLT transforms to source XML files to render a page is painful. While it’s a bit of a truism, a good software developer tries to pick the right tool for the job. Half the battle there is deciding on the right granularity for the job … the single job we were trying to solve with Fedora (preservation and access) was too big for us to do either right.

Having a system that’s decomposable, like the approach that CDL is taking with Microservices is essential for long-term thinking about software in the context of digital preservation. I guess you could say “there’s no-there-there” with Microservices, since there’s not really a system to download–but in a way that’s kind of the point.

I guess this is just a long way of saying, Thanks Dorothea! :-)