Tag Archives: preservation

the digital repository marketplace

The University of Southern California recently announced its Digital Repository (USCDR) which is a joint venture between the Shoah Foundation Institute and the University of Southern California. The site is quite an impressive brochure that describes the various services that their digital preservation system provides. But a few things struck me as odd. I was definitely pleased to see a prominent description of access services centered on the Web:

The USCDR can provide global access to digital collections through an expertly managed, cloud-computing environment. With its own content distribution network (CDN), the repository can make a digital collection available around the world, securely, rapidly, and reliably. The USCDR’s CDN is an efficient, high-performance alternative to leading commercial content distribution networks. The USCDR’s network consists of a system of disk arrays that are strategically located around the world. Each site allows customers to upload materials and provides users with high-speed access to the collection. The network supports efficient content downloads and real-time, on-demand streaming. The repository can also arrange content delivery through commercial CDNs that specialize in video and rich media.

But from this description it seems clear that the USCDR is creating their own content delivery network, despite the fact that there is already a good marketplace for these services. I would have thought it would be more efficient for the USCDR to provide plugins for the various CDNs rather than go through the effort (and cost) of building out one themselves. Digital repositories are just a drop in the ocean of Web publishers that need fast and cheap delivery networks for their content. Does the USCDR really think they are going to be able to compete and innovate in this marketplace? I’d also be kind of curious to see what public websites there are right now that are built on top of the USCDR.

Secondly, in the section on Cataloging this segment jumped out at me:

The USC Digital Repository (USCDR) offers cost-effective cataloging services for large digital collections by applying a sophisticated system that tags groups of related items, making them easier to find and retrieve. The repository can convert archives of all types to indexed, searchable digital collections. The repository team then creates and manages searchable indices that are customized to reflect the particular nature of a collection.

The USCDR’s cataloging system employs patented software created by the USC Shoah Foundation Institute (SFI) that lets the customers define the basic elements of their collections, as well as the relationships among those elements. The repository’s control standards for metadata verify that users obtain consistent and accurate search results. The repository also supports the use of any standard thesaurus or classification system, as well as the use of customized systems for special collections.

I’m certainly not a patent expert, but doesn’t it seem ill advised to build a digital preservation system around a patented technology? Sure, most of our running systems use possibly thousands of patented technologies, but ordinarily we are insulated from them by standards like POSIX, HTTP, or TCP/IP that allow us to swap out various technologies for other ones. If the particular technique to cataloging built into the USCDR is protected by a patent for 20 years, won’t that limit the dissemination of the technique into other digital preservation systems, and ultimately undermine the ability of people to move their content in and out of digital preservation systems as they become available–what Greg Janée calls relay supporting archives. I guess without more details of the patented technology it’s hard to say, but I would be worried about it.

After working in this repository space for a few years I guess I’ve become pretty jaded about turnkey digital repository systems that say they do it all. Not that it’s impossible, but it always seems like a risky leap for an organization to take. I guess I’m also a software developer, which adds quite a bit of bias. But on the other hand it’s great to see a repository systems that are beginning to address the basic concerns raised by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which identified the need for building sustainable models for digital preservation. The California Digital Library is doing something similar with its UC3 Merritt system, which offers fee based curation services to the University of California (which USC is not part of).

Incidentally the service costs of USCDR and Merritt are quite difficult to compare. Merritt’s Excel Calculator says their cost is $1040 per TB per year (which is pretty straightforward, but doesn’t seem to account for the degree to which the data is accessed). The USCDR is listed as $70/TB per month for Disk-based File-Server Access, and $1000/TB for 20 years for Preservation Services. That would seem indicate the raw storage is a bit less than Merritt at $840.00 per TB per year. But what the preservation services are, and how the 20 year cost would be applied over a growing collection of content seems unclear to me. Perhaps I’m misinterpreting disk-based file-server access, which might actually refer to terabytes of data sent outside their USCDR CDN. In that case the $70/TB measures up quite nicely with a recent quote from Amazon S3 at $120.51 per terabyte transferred out per month. But again, does USCDR really think it can compete in the cloud storage space?

Based on the current pricing models, where there are no access driven costs, the USCDR and Merritt might find a lot of clients outside of the traditional digital repository ecosystem (I’m thinking online marketing or pornography) that have images they would like to serve at high volume for no cost other than the disk storage. That was my bad idea of a joke, if you couldn’t tell. But seriously I sometimes worry that digital repository systems are oriented around the functionality of a dark archive, where lots of data goes in, and not much data comes back out for access.

geeks bearing gifts

I recently received some correspondence about the EZID identifier service from the California Digital Library. EZID is a relatively new service that aims to help cultural heritage institutions manage their identifiers. Or as the EZID website says:

EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers for their digital content. EZID has these core functions:

Create a persistent identifier: DOI or ARK

  • Add object location
  • Add metadata
  • Update object location
  • Update object metadata

I have some serious concerns about a group of cultural institutions relying on a single service like EZID for managing their identifier namespaces. It seems too much like a single point of failure. I wonder, are there any plans to make the software available, and to allow multiple EZID servers to operate as peers?

URLs are a globally deployed identifier scheme that depend upon HTTP and DNS. These technologies have software implementations in many different computer languages, for diverse operating systems. Bugs and vulnerabilities associated with these software libraries are routinely discovered and fixed, often because the software itself is available as open source, and there are “many eyes” looking at the source code. Best of all, you can put a URL into your web browser (which are now ubiquitous), and view a document that is about the identified resource.

In my humble opinion, cultural heritage institutions should make every effort to work with the grain of the Web, and taking URLs seriously is a big part of that. I’d like to see more guidance for cultural heritage institutions on effective use of URLs, what Tim Berners-Lee has called Cool URIs, and what the Microformats and blogging community call permalinks. When systems are being designed or evaluated for purchase, we need to think about the URL namespaces that we are creating, and how we can migrate them forwards. Ironically, identifier schemes that don’t fit into the DNS and HTTP landscape have their own set of risks; namely that organizations become dependent on niche software and services. Sometimes it’s prudent (and cost effective) to seek safety in numbers.

I did not put this discussion here to try to shame CDL in any way. I think the EZID service is well intentioned, clearly done in good spirit, and already quite useful. But in the long run I’m not sure it pays for institutions to go it alone like this. As another crank (I mean this with all due respect) Ted Nelson put it:

Beware Geeks Bearing Gifts.

On the surface the EZID service seems like a very useful gift. But it comes with a whole set of attendant assumptions. Instead of investing time & energy getting your institution to use a service like EZID, I think most cultural heritage institutions would be better off thinking about how they manage their URL namespaces, and making resource metadata available at those URLs.

digital-curation

Some folks at LC and CDL are trying to kick-start a new public discussion list for talking about digital curation in its many guises: repositories, tools, standards, techniques, practices, etc. The intuition being that there is a social component to the problems of digital preservation and repository interoperability.

Of course NDIIPP (the arena for the CDL/LC collaboration) has always been about building and strengthening a network of partners. But as Priscilla Caplan points out in her survey of the digital preservation landscape Ten Years After, organizations in Europe like the JISC and NESTOR seem to have understood that there is an educational component to digital preservation as well. Yet even the JISC and NESTOR have tended to focus more on the preservation of scholarly output, whereas digital preservation really extends beyond that realm of materials.

The continual need to share good ideas and hard-won-knowledge about digital curation, and to build a network of colleagues and experts that extends out past the normal project/institution specific boundaries is just as important as building the collections and the technologies themselves.

So I guess this is a rather highfalutin goal … here’s some text stolen from the digital-curation home page to give you more of a flavor:

The digital preservation and repositories domain is fortunate to have a diverse set of institutional and consortial efforts, software projects, and standardization initiatives. Many discussion lists have been created for these individual efforts. The digital-curation discussion list is intended to be a public forum that encourages cross-pollination across these project and institutional boundaries in order to foster wider awareness of project- and institution-specific work and encourage further collaboration.

Topic of conversation can include (but is not limited to)

  • digital repository software (Fedora, DSpace, EPrints, etc.)
  • management of digital formats (JHOVE, djatoka, etc.)
  • use and development of standards (OAIS, OAI-PMH/ORE, MPEG21, METS, BagIt, etc.)
  • issues related to identifiers, packaging, and data transfer
  • best practices and case studies around curation and preservation of digital content
  • repository interoperability
  • conference, workshop, tutorial announcements
  • recent papers
  • job announcements
  • general chit chat about problems, solutions, itches to be scratched
  • humor and fun

We’ll see how it goes. If you are at all interested please sign up.