Category Archives: preservation

fido test suite

I work in a digital preservation group at the Library of Congress where we do a significant amount of work in Python. Lately, I’ve been spending some time with OpenPlanet’s FIDO utility, mainly to see if I could refactor it so that it’s a bit easier to use as a Python module, for use in other Python applications. At the moment FIDO is designed to be used from the command line. This work involved more than a little bit of refactoring, and the more I looked at the code, the more it became clear that a test suite would be useful to have as a safety net.

Conveniently, I also happened to have been reading a recent report from the National Library of Australia on File Characterization Tools, which in addition to talking about FIDO, pointed me at the govdocs1 dataset. Govdocs1 is a dataset of 1 million files harvested from the .gov domain by the NSF funded Digital Corpora project. The data was collected to serve as a public domain corpus for forensics tools to use as a test bed. I thought it might be useful to survey the filenames in the dataset, and cherry pick out formats of particular types for use in my FIDO test suite.

So I wrote a little script that crawled all the filenames, and kept track of file extensions used. Here are the results:

extension count
pdf 232791
html 191409
jpg 109281
txt 84091
doc 80648
xls 66599
ppt 50257
xml 41994
gif 36301
ps 22129
csv 18396
gz 13870
log 10241
eps 5465
png 4125
swf 3691
pps 1629
kml 995
kmz 949
hlp 660
sql 632
dwf 474
java 323
pptx 219
tmp 196
docx 169
ttf 104
js 92
pub 76
bmp 75
xbm 51
xlsx 46
jar 34
zip 27
wp 17
sys 8
dll 7
exported 5
exe 5
tif 3
chp 2
pst 1
squeak 1
data 1

With this list in hand, I downloaded an example of each file extension, ran it through the current release of FIDO, and used the output to generate a test suite for my new refactored version. Interestingly, two tests fail:

Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 244, in test_pst
    self.assertEqual(i.puid, "x-fmt/249")
AssertionError: 'x-fmt/248' != 'x-fmt/249'

======================================================================
FAIL: test_pub (test.FidoTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 260, in test_pub
    self.assertEqual(i.puid, "x-fmt/257")
AssertionError: 'x-fmt/252' != 'x-fmt/257'

I’ll need to dig in to see what could be different between the two versions that would confuse x-fmt/248 with x-fmt/249 and x-fmt/252 with x-fmt/257. Perhaps it is related to Dave Tarrant’s recent post about how FIDO’s identification patterns have flip flopped in the past.

You may have noticed that I’m linking the PUIDs to Andy Jackson‘s PRONOM Prototype Registry (built in 6 days with Drupal) instead of the official PRONOM registry. I did this because a Google search for the PRONOM identifier (PUID) pulled up a nice detail page for the format in Andy’s prototype, and it doesn’t seem possible (at least in the 5 minutes I tried) to link directly to a file format record in the official PRONOM registry. I briefly tried the Linked Data prototype, but it proved difficult to search for a given PUID (server errors, the unforgiving glare of SPARQL query textareas, etc).

I hope OpenPlanets and/or the National Archives give Andy’s Drupal experiment a fair shake. Getting a functional PRONOM registry running in 6 days with an opensource toolkit like Drupal definitely seems more future proof than spending years with a contractor only to get closed source code. The Linked Data prototype looks promising, but as the recent final report on the Unified Digital Format Registry project highlights, choosing to build on a semantic web stack has its risks compared with more mainstream web publishing frameworks or content management systems like Drupal. PRONOM just needs an easy way for digital preservation practitioners to be able to collaboratively update the registry, and for each format to have a unique URL that uses the PUID. My only complaint is that Andy’s prototype seemed to advertise RDF/XML in the HTML, but it seemed to return an empty RDF document, for example the HTML at http://beta.domd.info/pronom/x-fmt/248 has a <link> that points at http://beta.domd.info/node/1303/rdf.

I admit I am a fan of linked data, or being able to get machine readable data back (RDFa, Microdata, JSON, RDF/XML, XML, etc) from Cool URLs. But using triplestores, and SPARQL don’t seem to be terribly important things for PRONOM to have at this point. And if they are there under the covers, there’s no need to confront the digital preservation practitioner with them. My guess is that they want to have an application that lets them work with their peers to document file formats, not learn a new query or ontology language. Perhaps Jason Scott’s Just Solve the Problem effort in October, will be a good kick in the pants to mobilize grassroots community work around digital formats.

Meanwhile, I’ve finished up the FIDO API changes and the test suite enough to have submitted a pull request to OpenPlanets. My fork of the OpenPlanets repository is similarly on Github. I’m not really holding my breath waiting for it to be accepted, as it represents a significant change, and they have their own published roadmap of work to do. But I am hopeful that they will recognize the value in having a test suite as a safety net as they change and refactor FIDO going forward. Otherwise I guess it could be the beginnings of a fido2, but I would like to avoid that particular future.

Update: after posting this Ross Spencer tweeted me some instructions for linking to PRONOM

Maybe I missed it, but PRONOM could use a page that describes this.

day of digital archives psa

Today is Day of Digital Archives day and I had this semi-thoughtful post written up about BagIt and how it’s a brain dead simple format to use to package up your files so that you’ll know if you still have them 5 minutes, 5 hours, 5 days, 5 years, maybe even 5 decades from now–if the notion of directories and files persists that long.

But I deleted that…you’re welcome…

I was also going to write about how in a fit of web performance art Mark Pilgrim recently deleted his online presence, including various extremely useful opensource tools, and several popular online books, only to see them re-materialize on the Web at new locations.

But I deleted most of that too…you’re welcome again!

Here’s a public service announcement instead. If you happen to use Franco Lazzarino’s Ruby BagIt Library to create bags that contains largish files (> 500MB), you might have accidentally created bad SHA1 manifests. I added a test, and fixed the bug with help from Mark Matienzo and Michael Klein, and sent a pull request. It hasn’t been applied yet, so here’s to hoping it will.

At $mpow we’ve been getting terabytes of data from this social media company that has been bagging their data using this Ruby library. Many of the files are multi-gigabytes gzip compressed. And many of the bags now have bad SHA1 manifests. The social media company wasn’t sure what the problem was, and told us just to ignore the SHA1 manifests. Which is easy enough to do.

It seems like no matter how simple the spec, it’s easy to create bugs. If you create bags, throw Bag-Software-Agent into your bag-info.txt…you never know who might find it useful.

the digital repository marketplace

The University of Southern California recently announced its Digital Repository (USCDR) which is a joint venture between the Shoah Foundation Institute and the University of Southern California. The site is quite an impressive brochure that describes the various services that their digital preservation system provides. But a few things struck me as odd. I was definitely pleased to see a prominent description of access services centered on the Web:

The USCDR can provide global access to digital collections through an expertly managed, cloud-computing environment. With its own content distribution network (CDN), the repository can make a digital collection available around the world, securely, rapidly, and reliably. The USCDR’s CDN is an efficient, high-performance alternative to leading commercial content distribution networks. The USCDR’s network consists of a system of disk arrays that are strategically located around the world. Each site allows customers to upload materials and provides users with high-speed access to the collection. The network supports efficient content downloads and real-time, on-demand streaming. The repository can also arrange content delivery through commercial CDNs that specialize in video and rich media.

But from this description it seems clear that the USCDR is creating their own content delivery network, despite the fact that there is already a good marketplace for these services. I would have thought it would be more efficient for the USCDR to provide plugins for the various CDNs rather than go through the effort (and cost) of building out one themselves. Digital repositories are just a drop in the ocean of Web publishers that need fast and cheap delivery networks for their content. Does the USCDR really think they are going to be able to compete and innovate in this marketplace? I’d also be kind of curious to see what public websites there are right now that are built on top of the USCDR.

Secondly, in the section on Cataloging this segment jumped out at me:

The USC Digital Repository (USCDR) offers cost-effective cataloging services for large digital collections by applying a sophisticated system that tags groups of related items, making them easier to find and retrieve. The repository can convert archives of all types to indexed, searchable digital collections. The repository team then creates and manages searchable indices that are customized to reflect the particular nature of a collection.

The USCDR’s cataloging system employs patented software created by the USC Shoah Foundation Institute (SFI) that lets the customers define the basic elements of their collections, as well as the relationships among those elements. The repository’s control standards for metadata verify that users obtain consistent and accurate search results. The repository also supports the use of any standard thesaurus or classification system, as well as the use of customized systems for special collections.

I’m certainly not a patent expert, but doesn’t it seem ill advised to build a digital preservation system around a patented technology? Sure, most of our running systems use possibly thousands of patented technologies, but ordinarily we are insulated from them by standards like POSIX, HTTP, or TCP/IP that allow us to swap out various technologies for other ones. If the particular technique to cataloging built into the USCDR is protected by a patent for 20 years, won’t that limit the dissemination of the technique into other digital preservation systems, and ultimately undermine the ability of people to move their content in and out of digital preservation systems as they become available–what Greg Janée calls relay supporting archives. I guess without more details of the patented technology it’s hard to say, but I would be worried about it.

After working in this repository space for a few years I guess I’ve become pretty jaded about turnkey digital repository systems that say they do it all. Not that it’s impossible, but it always seems like a risky leap for an organization to take. I guess I’m also a software developer, which adds quite a bit of bias. But on the other hand it’s great to see a repository systems that are beginning to address the basic concerns raised by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which identified the need for building sustainable models for digital preservation. The California Digital Library is doing something similar with its UC3 Merritt system, which offers fee based curation services to the University of California (which USC is not part of).

Incidentally the service costs of USCDR and Merritt are quite difficult to compare. Merritt’s Excel Calculator says their cost is $1040 per TB per year (which is pretty straightforward, but doesn’t seem to account for the degree to which the data is accessed). The USCDR is listed as $70/TB per month for Disk-based File-Server Access, and $1000/TB for 20 years for Preservation Services. That would seem indicate the raw storage is a bit less than Merritt at $840.00 per TB per year. But what the preservation services are, and how the 20 year cost would be applied over a growing collection of content seems unclear to me. Perhaps I’m misinterpreting disk-based file-server access, which might actually refer to terabytes of data sent outside their USCDR CDN. In that case the $70/TB measures up quite nicely with a recent quote from Amazon S3 at $120.51 per terabyte transferred out per month. But again, does USCDR really think it can compete in the cloud storage space?

Based on the current pricing models, where there are no access driven costs, the USCDR and Merritt might find a lot of clients outside of the traditional digital repository ecosystem (I’m thinking online marketing or pornography) that have images they would like to serve at high volume for no cost other than the disk storage. That was my bad idea of a joke, if you couldn’t tell. But seriously I sometimes worry that digital repository systems are oriented around the functionality of a dark archive, where lots of data goes in, and not much data comes back out for access.