the digital repository marketplace

The University of Southern California recently announced its Digital Repository (USCDR) which is a joint venture between the Shoah Foundation Institute and the University of Southern California. The site is quite an impressive brochure that describes the various services that their digital preservation system provides. But a few things struck me as odd. I was definitely pleased to see a prominent description of access services centered on the Web:

The USCDR can provide global access to digital collections through an expertly managed, cloud-computing environment. With its own content distribution network (CDN), the repository can make a digital collection available around the world, securely, rapidly, and reliably. The USCDR’s CDN is an efficient, high-performance alternative to leading commercial content distribution networks. The USCDR’s network consists of a system of disk arrays that are strategically located around the world. Each site allows customers to upload materials and provides users with high-speed access to the collection. The network supports efficient content downloads and real-time, on-demand streaming. The repository can also arrange content delivery through commercial CDNs that specialize in video and rich media.

But from this description it seems clear that the USCDR is creating their own content delivery network, despite the fact that there is already a good marketplace for these services. I would have thought it would be more efficient for the USCDR to provide plugins for the various CDNs rather than go through the effort (and cost) of building out one themselves. Digital repositories are just a drop in the ocean of Web publishers that need fast and cheap delivery networks for their content. Does the USCDR really think they are going to be able to compete and innovate in this marketplace? I’d also be kind of curious to see what public websites there are right now that are built on top of the USCDR.

Secondly, in the section on Cataloging this segment jumped out at me:

The USC Digital Repository (USCDR) offers cost-effective cataloging services for large digital collections by applying a sophisticated system that tags groups of related items, making them easier to find and retrieve. The repository can convert archives of all types to indexed, searchable digital collections. The repository team then creates and manages searchable indices that are customized to reflect the particular nature of a collection.

The USCDR’s cataloging system employs patented software created by the USC Shoah Foundation Institute (SFI) that lets the customers define the basic elements of their collections, as well as the relationships among those elements. The repository’s control standards for metadata verify that users obtain consistent and accurate search results. The repository also supports the use of any standard thesaurus or classification system, as well as the use of customized systems for special collections.

I’m certainly not a patent expert, but doesn’t it seem ill advised to build a digital preservation system around a patented technology? Sure, most of our running systems use possibly thousands of patented technologies, but ordinarily we are insulated from them by standards like POSIX, HTTP, or TCP/IP that allow us to swap out various technologies for other ones. If the particular technique to cataloging built into the USCDR is protected by a patent for 20 years, won’t that limit the dissemination of the technique into other digital preservation systems, and ultimately undermine the ability of people to move their content in and out of digital preservation systems as they become available–what Greg Janée calls relay supporting archives. I guess without more details of the patented technology it’s hard to say, but I would be worried about it.

After working in this repository space for a few years I guess I’ve become pretty jaded about turnkey digital repository systems that say they do it all. Not that it’s impossible, but it always seems like a risky leap for an organization to take. I guess I’m also a software developer, which adds quite a bit of bias. But on the other hand it’s great to see a repository systems that are beginning to address the basic concerns raised by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which identified the need for building sustainable models for digital preservation. The California Digital Library is doing something similar with its UC3 Merritt system, which offers fee based curation services to the University of California (which USC is not part of).

Incidentally the service costs of USCDR and Merritt are quite difficult to compare. Merritt’s Excel Calculator says their cost is $1040 per TB per year (which is pretty straightforward, but doesn’t seem to account for the degree to which the data is accessed). The USCDR is listed as $70/TB per month for Disk-based File-Server Access, and $1000/TB for 20 years for Preservation Services. That would seem indicate the raw storage is a bit less than Merritt at $840.00 per TB per year. But what the preservation services are, and how the 20 year cost would be applied over a growing collection of content seems unclear to me. Perhaps I’m misinterpreting disk-based file-server access, which might actually refer to terabytes of data sent outside their USCDR CDN. In that case the $70/TB measures up quite nicely with a recent quote from Amazon S3 at $120.51 per terabyte transferred out per month. But again, does USCDR really think it can compete in the cloud storage space?

Based on the current pricing models, where there are no access driven costs, the USCDR and Merritt might find a lot of clients outside of the traditional digital repository ecosystem (I’m thinking online marketing or pornography) that have images they would like to serve at high volume for no cost other than the disk storage. That was my bad idea of a joke, if you couldn’t tell. But seriously I sometimes worry that digital repository systems are oriented around the functionality of a dark archive, where lots of data goes in, and not much data comes back out for access.

simplicity and digital preservation, sorta

Over on the Digital Curation discussion list Erik Hetzner of the California Digital Library raised the topic of simplicity as it relates to digital preservation, and specifically to CDL’s notion of Curation Microservices. He referenced a recent bit of writing by Martin Odersky (the creator of Scala) with the title Simple or Complicated. In one of the responses Brian Tingle (also of CDL) suggested that simplicity for an end user and simplicity for the programmer are often inversely related. My friend Kevin Clarke prodded me in #code4lib into making my response to the discussion list into a blog post so, here it is (slightly edited).

For me, the Odersky piece is a really nice essay on why simplicity is often in the eye of the beholder. Often the key to simplicity is working with people who see things in roughly the same way. People who have similar needs, that are met by using particular approaches and tools. Basically a shared and healthy culture to make emergent complexity palatable.

Brian made the point about simplicity for programmers having an inversely proportional relationship to simplicity for end users, or in his own words:

I think that the simpler we make it for the programmers, usually the more complicated it becomes for the end users, and visa versa.

I think the only thing to keep in mind is that the distinction between programmers and end users isn’t always clear.

As a software developer I’m constantly using, or inheriting someone else’s code: be it a third party library that I have a dependency on, or a piece of software that somebody wrote once upon a time, who has moved on elsewhere. In both these cases I’m effectively an end-user of a program that somebody else designed and implemented. The interfaces and abstractions that this software developer has chosen are the things I (as an end user) need to be able to understand and work with. Ultimately, I think that it’s easier to keep software usable for end users (of whatever flavor) by keeping the software design itself simple.

Simplicity makes the software easier to refactor over time when the inevitable happens, and someone wants some new or altered behavior. Simplicity also should make it clear when a suggested change to a piece of software doesn’t fit the design of the software in question, and is best done elsewhere. One of the best rules of thumb I’ve encountered over the years to help get to this place is the Unix Philosophy:

Write programs that do one thing and do it well. Write programs to work together.

As has been noted elsewhere, composability is one of the guiding principles of the Microservices approach–and it’s why I’m a big fan (in principle). Another aspect to the Unix philosophy that Microservices seems to embody is:

Data dominates.

The software can (and will) come and go, but we are left with the data. That’s the reality of digital preservation. It could be argued that the programs themselves are data, which gets us into sci-fi virtualization scenarios. Maybe someday, but I personally don’t think we’re there yet.

Another approach I’ve found that works well to help ensure code simplicity has been unit testing. Admittedly it’s a bit of a religion, but at the end of the day, writing tests for your code encourages you to use the APIs, interfaces and abstractions that you are creating. So you notice sooner when things don’t make sense. And of course, they let you refactor with a safety net, when the inevitable changes rear their head.

And, another slightly more humorous way to help ensure simplicity:

Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live.

Which leads me to a jedi mind trick my former colleague Keyser Söze Andy Boyko tried to teach me (I think): it’s useful to know when you don’t have to write any code at all. Sometimes existing code can be used in a new context. And sometimes the perceived problem can be recast, or examined from a new perspective that makes the problem go away. I’m not sure what all this has to do with digital preservation. The great thing about what CDL is doing with microservices is they are trying to focus on the *what*, and not the *how* of digital preservation. Whatever ends up happening with the implementation of Merritt itself, I think they are discovering what the useful patterns of digital preservation are, trying them out, and documenting them…and it’s incredibly important work that I don’t really see happening much elsewhere.

version control and digital curation

For some time now I have been meaning to write about some of the issues related to version control in repositories as they relate to some projects going on at $work. Most repository systems have a requirement to maintain original data as submitted. But as we all know this content often changes over time–sometimes immediately. Change is in the very nature of digital preservation; as archaic formats are migrated to fresher, more usable ones, and the wheels of movage keep turning. At the same time it’s essential that when content is used and cited that the specific version in time is cited, or as Cliff Lynch is quoted as saying in the Attributes of a Trusted Repository:

It is very easy to replace an electronic dataset with an updated copy, and … the replacement can have wide-reaching effects. The processes of authorship … produce different versions which in an electronic environment can easily go into broad circulation; if each draft is not carefully labeled and dated it is difficult to tell which draft one is looking at or whether one has the “final” version of a work.

For example at $work we have a pilot project to process journal content submitted by publishers. We don’t have the luxury of telling them exactly what content to submit (packaging formats, xml schemas, formats, etc), but on receipt we want to normalize the content for downstream applications by bagging it up, and then extracting some of the content into a standard location.

So a publisher makes an issue of a journal available for pickup via FTP:

adi_v2_i2
|-- adi_015.pdf
|-- adi_015.xml
|-- adi_018.pdf
|-- adi_018.xml
|-- adi_019.pdf
|-- adi_019.xml
|-- adi_v2_i2_ofc.pdf
|-- adi_v2_i2_toc.pdf
`-- adi_v2_i2_toc.xml

The first thing we do is bag up the content, to capture what was retrieved (manifest + checksums) and stash away some metadata about where it came from.

adi_v2_i2
|-- bag-info.txt
|-- bagit.txt
|-- data
|   |-- adi_015.pdf
|   |-- adi_015.xml
|   |-- adi_018.pdf
|   |-- adi_018.xml
|   |-- adi_019.pdf
|   |-- adi_019.xml
|   |-- adi_v2_i2_ofc.pdf
|   |-- adi_v2_i2_toc.pdf
|   `-- adi_v2_i2_toc.xml
`-- manifest-md5.txt

Next we run some software that knows about the particularities of this publishers content, and persist it into the bag in a predictable, normalized way:

adi_v2_i2
|-- bag-info.txt
|-- bagit.txt
|-- data
|   |-- adi_015.pdf
|   |-- adi_015.xml
|   |-- adi_018.pdf
|   |-- adi_018.xml
|   |-- adi_019.pdf
|   |-- adi_019.xml
|   |-- adi_v2_i2_ofc.pdf
|   |-- adi_v2_i2_toc.pdf
|   |-- adi_v2_i2_toc.xml
|   `-- eco
|       `-- articles
|           |-- 1
|           |   |-- article.pdf
|           |   |-- meta.json
|           |   `-- ocr.xml
|           |-- 2
|           |   |-- article.pdf
|           |   |-- meta.json
|           |   `-- ocr.xml
|           `-- 3
|               |-- article.pdf
|               |-- meta.json
|               `-- ocr.xml
`-- manifest-md5.txt

The point of this post isn’t the particular way these changes were layered into the filesystem–it could be done in a multitude of other ways (mets, oai-ore, mpeg21-didl, foxml, etc). The point is rather that the data has undergone two transformations very soon after the it was obtained. And if we are successful, it would undergo many more as it is preserved over time. The advantage of layering content like this into the filesystem is that makes some attempt to preserve the data by disentangling it from the code that operates on it, which will hopefully be swapped out at some point. At the same time we want to be able to get back to the original data, and to also possibly rollback changes which inadvertently corrupted or destroyed portions of the data. Shit happens.

So the way I see it we have three options:

  • Ignore the problem and hope it will go away.
  • Create full copies of the content, and link them together
  • Version the content

While tongue in cheek, option 1 is sometimes reality. Perhaps your repository content is such that it never changes. Or if it does, it happens in systems that are downstream from your repository. The overhead of managing versions isn’t something that you need to worry about…yet. I’ve been told that DSpace doesn’t currently handle versioning, and that they are looking to upcoming Fedora integration to provide versioning options.

In option 2 version control is achieved by copying the files that make up a repository object, making modifications to the relevant files, and then linking this new copy to the older copy. We currently do this in our internal handling of the content in the National Digital Newspaper Program, where we work at what we call the batch level, which is essentially a directory of content shipped to the Library of Congress, containing a bunch of jp2, xml, pdf and tiff files. When we have accepted a batch, and later have discovered a problem, we then version the batch by creating a copy of the directory. When service copies of the batches are 50G or so, these full copies can add up. Sometimes just to make a few small modifications to some of the XML files we end up having what appears to be largely redundant copies of multi-gigabyte directories. Disk is cheap as they say, but it makes me feel a bit dirty.

Option 3 is to leverage some sort of version control system. One of the criteria I have for a version control system for data is that versioned content should not be dependent on a remote server. The reason for this is I want to be able to push the versioned content around to other systems, tape backup, etc and not have it become stale because some server configuration happened to change at some point. So in my mind, subversion and cvs are out. Revision control systems like rcs, git, mercurial, bazaar on the other hand are possibilities. Some of the nice things about using git or mercurial:

  • they are very active projects, so there are lots of eyes on the code fixing bugs, making enhancements
  • the content repositories are distributed, so they allow content to migrate into other contexts
  • developers who come to inherit content will be familiar with using them
  • you get audit logs for free
  • they include functionality to publish content on the web, and to copy or clone repository content

But there are trade offs with using git, mercurial, etc since these are quite complex pieces of software. Changes in them sometimes trigger backwards incompatible changes in repository formats–which require upgrading the repository. Fortunately there are tools for these upgrades, but one must stay on top of them. These systems also have a tendency to see periods of active use, followed by an exodus to a newer shinier version control system. But there are usually tools to convert your repositories from the old to the new.

To anyone familiar with digital preservation this situation should seem pretty familiar: format migration, or more generally data migration.

The California Digital Library have recognized the need for version control, but have taken a slightly different approach. CDL created a specification for a simplified revision control system called Reverse Directory Deltas (ReDD), and built an implementation into their Storage Service. The advantage that ReDD has is that the repository format is quite a bit more transparent, and less code dependent than other repository formats like git, mercurial, etc. It is also a specification that CDL manages, and can change at will, rather than being at the mercy of an external opensource project. As an exercise about a year ago I wanted to see how easy it was to use ReDD by implementing it. I ended up calling the tool dflat, since ReDD fits in with CDL’s filesystem convention D-flat specification.

In the pilot project I mentioned above I started out wanting to use ReDD. I didn’t really want to use all of D-flat, because it did a bit more than I needed, and thought I could refactor the ReDD bits of my dflat code into a ReDD library, and then use it to version the journal content. But in looking at the code a year later I found myself wondering about the trade offs again. Mostly I wondered about the bugs that might be lurking in the code, which nobody else was really using. And I thought about new developers that might work on the project, and who wouldn’t know what the heck ReDD was and why I wasn’t using something more common.

So I decided to be lazy, and use mercurial instead.

I figured this was a pilot project, so what better way to get a handle on the issues of using distributed revision control for managing data? The project was using the Python programming language extensively, and Mercurial is written in Python — so it seemed like a logical choice. I also began to think of the problem of upgrading software and repository formats as another facet to the format migration problem…which we wanted to have a story for anyway.

I wanted to start a conversation about some of these issues at Curate Camp, so I wrote a pretty simplistic tool that compares git, mercurial and dflat/redd in terms of the time it takes to initialize a repository, and commit changes to an arbitrary set of content in a directory called ‘data’. For example I ran it against a gigabyte of mp3 content, and it generates the following output:

ed@curry:~/Projects$ git clone git://github.com/edsu/versioning-metrics.git
ed@curry:~/Projects/versioning-metrics$ cp ~/Music data
ed@curry:~/Projects/versioning-metrics$ ./compare.py 

create repository times
git        0:02:16.842117
hg         0:02:48.397038
dflat      0:01:21.196579

repository size
git        1233106981
hg         1287443209
dflat      1347108177

checkout sizes
git        2580197304
hg         2634533532
dflat      2694216030

commit one byte change to all files
git        0:01:03.694969
hg         0:00:48.456455
dflat      0:00:22.926949

repository sizes after commit
git        1233112423
hg         1287450930
dflat      1347166431

Anyhow this is just the start of some analysis. I’d like to see how it performs on text content where there are probably some compression gains to be had in the repository size. I think this also demonstrates how it might not be feasible to version large amounts of content as a synchronous operation in a web application. Hopefully this will be a useful seed for a discussion at Curate Camp. I don’t want to pretend to have all the answers here, and want to have more conversations about approaches within the digital curation community.