fido test suite

I work in a digital preservation group at the Library of Congress where we do a significant amount of work in Python. Lately, I’ve been spending some time with OpenPlanet’s FIDO utility, mainly to see if I could refactor it so that it’s a bit easier to use as a Python module, for use in other Python applications. At the moment FIDO is designed to be used from the command line. This work involved more than a little bit of refactoring, and the more I looked at the code, the more it became clear that a test suite would be useful to have as a safety net.

Conveniently, I also happened to have been reading a recent report from the National Library of Australia on File Characterization Tools, which in addition to talking about FIDO, pointed me at the govdocs1 dataset. Govdocs1 is a dataset of 1 million files harvested from the .gov domain by the NSF funded Digital Corpora project. The data was collected to serve as a public domain corpus for forensics tools to use as a test bed. I thought it might be useful to survey the filenames in the dataset, and cherry pick out formats of particular types for use in my FIDO test suite.

So I wrote a little script that crawled all the filenames, and kept track of file extensions used. Here are the results:

extension	count
pdf	232791
html	191409
jpg	109281
txt	84091
doc	80648
xls	66599
ppt	50257
xml	41994
gif	36301
ps	22129
csv	18396
gz	13870
log	10241
eps	5465
png	4125
swf	3691
pps	1629
kml	995
kmz	949
hlp	660
sql	632
dwf	474
java	323
pptx	219
tmp	196
docx	169
ttf	104
js	92
pub	76
bmp	75
xbm	51
xlsx	46
jar	34
zip	27
wp	17
sys	8
dll	7
exported	5
exe	5
tif	3
chp	2
pst	1
squeak	1
data	1

With this list in hand, I downloaded an example of each file extension, ran it through the current release of FIDO, and used the output to generate a test suite for my new refactored version. Interestingly, two tests fail:

Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 244, in test_pst
    self.assertEqual(i.puid, "x-fmt/249")
AssertionError: 'x-fmt/248' != 'x-fmt/249'

======================================================================
FAIL: test_pub (test.FidoTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 260, in test_pub
    self.assertEqual(i.puid, "x-fmt/257")
AssertionError: 'x-fmt/252' != 'x-fmt/257'

I’ll need to dig in to see what could be different between the two versions that would confuse x-fmt/248 with x-fmt/249 and x-fmt/252 with x-fmt/257. Perhaps it is related to Dave Tarrant’s recent post about how FIDO’s identification patterns have flip flopped in the past.

You may have noticed that I’m linking the PUIDs to Andy Jackson’s PRONOM Prototype Registry (built in 6 days with Drupal) instead of the official PRONOM registry. I did this because a Google search for the PRONOM identifier (PUID) pulled up a nice detail page for the format in Andy’s prototype, and it doesn’t seem possible (at least in the 5 minutes I tried) to link directly to a file format record in the official PRONOM registry. I briefly tried the Linked Data prototype, but it proved difficult to search for a given PUID (server errors, the unforgiving glare of SPARQL query textareas, etc).

I hope OpenPlanets and/or the National Archives give Andy’s Drupal experiment a fair shake. Getting a functional PRONOM registry running in 6 days with an opensource toolkit like Drupal definitely seems more future proof than spending years with a contractor only to get closed source code. The Linked Data prototype looks promising, but as the recent final report on the Unified Digital Format Registry project highlights, choosing to build on a semantic web stack has its risks compared with more mainstream web publishing frameworks or content management systems like Drupal. PRONOM just needs an easy way for digital preservation practitioners to be able to collaboratively update the registry, and for each format to have a unique URL that uses the PUID. My only complaint is that Andy’s prototype seemed to advertise RDF/XML in the HTML, but it seemed to return an empty RDF document, for example the HTML at http://beta.domd.info/pronom/x-fmt/248 has a <link> that points at http://beta.domd.info/node/1303/rdf.

I admit I am a fan of linked data, or being able to get machine readable data back (RDFa, Microdata, JSON, RDF/XML, XML, etc) from Cool URLs. But using triplestores, and SPARQL don’t seem to be terribly important things for PRONOM to have at this point. And if they are there under the covers, there’s no need to confront the digital preservation practitioner with them. My guess is that they want to have an application that lets them work with their peers to document file formats, not learn a new query or ontology language. Perhaps Jason Scott’s Just Solve the Problem effort in October, will be a good kick in the pants to mobilize grassroots community work around digital formats.

Meanwhile, I’ve finished up the FIDO API changes and the test suite enough to have submitted a pull request to OpenPlanets. My fork of the OpenPlanets repository is similarly on Github. I’m not really holding my breath waiting for it to be accepted, as it represents a significant change, and they have their own published roadmap of work to do. But I am hopeful that they will recognize the value in having a test suite as a safety net as they change and refactor FIDO going forward. Otherwise I guess it could be the beginnings of a fido2, but I would like to avoid that particular future.

Update: after posting this Ross Spencer tweeted me some instructions for linking to PRONOM

https://twitter.com/beet_keeper/status/242515266146287616

Maybe I missed it, but PRONOM could use a page that describes this.