ipres, iipc, pasig roundup/braindump

I spent last week in San Francisco attending 3 back-to-back conferences: the International Conference on Preservation of Digital Objects (iPRES), International Internet Preservation Consortium (IIPC), and the Sun Preservation and Archiving Special Interest Group (PASIG)…thanks to the Library of Congress and to Kesa Summers for letting me go. Also, thanks to the 3 conferences for deciding to co-locate in San Francisco at the same time, which made this sort of tag-team-digital-preservation-event-week possible. I hadn’t been to either iPRES, IIPC or PASIG before, so it was a lot of fun being able to take them all in at once…especially since given the nature of my group at the Library of Congress, these are my kind of people.

Each event had a different flavor, but the topic under discussion at each was digital preservation. iPRES focused generally on digital preservation, particularly from a research angle. IIPC also had a bit of a research flavor, but focused more specifically on the practicalities of archiving web content. And PASIG was less research oriented, and much more oriented around building/maintaining large scale storage systems. There was so much good content at these events, that it’s kind of impossible to summarize it here. But I thought I would at least attempt to blurrily characterize some of the ideas from the three events that I’m taking back with me.

Forever

Long term digital preservation has many hard problems–so many that I think it is rational to feel somewhat overwhelmed and to some extent even paralyzed. It was important to see other people recognize the big problems of emulation, format characterization/migration, compression – but continue working on pragmatic solutions, for today. Martha Anderson made the case several times for thinking of digital preservation in terms of 5-10 year windows, instead of forever. The phrase “to get to forever you have to get to 5 years first” got mentioned a few times, but I don’t know who said it first. John Kunze brought up the notion of preservation as a “relay”, where bits are passed along at short intervals–and how digital curation need to enable these hand offs to happen easily. It came to my attention later that this relay idea is something that Chris Rusbridge written about back in 2006.

Access

On a similar note, Martha Anderson indicated that making bits useful today is a key factor that the National Digital Information Infrastructure and Preservation Program (NDIIPP) weighs when making funding decisions. Brewster Kahle in his keynote for IIPC struck a similar note that “preservation is driven by access”. Gary Wright gave an interesting presentation about how the Church of Latter Day Saints had to adjust the Reference Model for Open Archival Information System (OAIS) to enable thousands of concurrent users access to their archive of 3.1 billion genealogical image records. Jennifer Waxman was kind enough to give me a pointer to some work Paul Conway has done on this topic of access driven preservation. The topic of access in digital preservation is important to me, because I work in a digital preservation group at the Library of Congress, working primarily on access applications. We’ve had a series of pretty intense debates about the role of access in digital preservation … so it was good to hear the topic come up in San Francisco. In a world where Lots of Copies Keeps Stuff Safe, access to copies is pretty important.

Less is More (More or Less)

Over the week I got several opportunities to hear details from John Kunze, Stephen Abrams, and Margaret Low about the California Digital Library’s notion of curation micro-services, and how they enable digital preservation efforts at CDL. Several folks in my group at LC have been taking a close look at the CDL specifications recently, so getting to hear about the specs, and even see some implementation demos from Margaret was really quite awesome. The specs are interesting to me because they seem to be oriented around the fact that our digital objects ultimately reside on some sort of hierarchical file-system. Fileystem APIs are fairly ubiquitous. In fact, as David Rosenthal has pointed out, some file systems are even designed to resist change. As Kunze said at PASIG in his talk Permanent Objects, Evolving Services, and Disposable Systems: An Emergent Approach to Digital Curation Infrastructure

What is the thinnest smear of functionality that we can add to the filesystem so that it can act as an object storage system?

Approaches to building digital repository software thus far have been primarily aimed at software stacks (dspace, fedora, eprints) which offer particular services, or service frameworks. But the reality is that these systems come and go, and we are left with the bits. Why don’t we try to get the bits in shape so that they can be handed off easily in the relay from application to application, filesystem to filesystem? What is nice about the micro-services approach is that:

The services are compose-able, allowing digital curation activities to be emergent, rather than imposed by a pre-defined software architecture. Since I’ve been on a bit of a functional programming kick lately, I see compose-ability as a pretty big win.
The services are defined by short specifications, not software–so they are ideas instead of implementations. The specifications are clearly guided by ease of implementation, but ultimately they could be implemented in a variety of languages, and tools. Having a 2-3 page spec that defines a piece of functionality, and can be read by a variety of people, and implemented by different groups seems to be an ideal situation to strive for.

Everything Else Is Miscellaneous

Like I said, there was a ton of good content over the week…and it seems somewhat foolhardy to try to summarize it all in a single blog post. I tried to summarize the main themes I took home with me on the plane back to DC…but there were also lots of nuggets of ideas that came up in conversation, and in presentations that I want to at least jot down:

While archival storage may not be best served by HDFS, jobs like virus scanning huge web crawls are well suited to distributed computing environments like Hadoop. We need to be able to operate at this scale at loc.gov.
In Cliff Lynch’s summary wrap up for PASIG he indicated that people don’t talk so much about what we do when the inevitable happens, and bits are lost. The digital preservation community needs to share more statistics on bit loss, system failure modes, and software design patterns that let us build more sustainable storage systems.
Dave Tarrant’s presentation on Where the Semantic Web and Web 2.0 meet format risk management: P2 registry was a welcome revelation about the intersection of my interest in linked data and digital preservation. His presentation of the PRONOM format registry as linked data, and Kevin De Vorsey’s talk about Obsolescence, Risk Management, and Preservation Planning at the National Library of New Zealand made me think that it might be interesting to explore how the LC’s Digital Formats website could be delivered as linked data, and linked to something like PRONOM. David Pearson also suggested that collaborative wiki-spaces could be used by digital format specialists to collect information…which got me thinking of how a semantic media wiki instance could be used in conjunction with Tarrant’s ideas. How easy would it be to use the web to build a distributed network of preservation information, as opposed to some p2p solution?
I want to learn more about the (w)arc data format, and perhaps contribute to some of the existing code bases for working w/ (w)arc. I’m particularly interested in using harvesting tools and WARC to preserve linked data…which I believe some of the Sindice folks have worked on for their bot.
It’s long since time I understood how LOCKSS works as a technology. It was mentioned as the backbone of several projects during the week. I even overheard some talk about establishing rogue LOCKSS networks, which of course piqued my interest even more.
It would be fun to put a jython or jruby web front end on DROID for format identification, but it seems that Carol Chou of the Florida Center for Library Automation has already done something similar. Still, it would be neat to at least try it out, and perhaps have it conneg to Dave’s P2 registry or PRONOM.

Ok, braindump complete. Thanks for reading this far!