Category Archives: python

fido test suite

I work in a digital preservation group at the Library of Congress where we do a significant amount of work in Python. Lately, I’ve been spending some time with OpenPlanet’s FIDO utility, mainly to see if I could refactor it so that it’s a bit easier to use as a Python module, for use in other Python applications. At the moment FIDO is designed to be used from the command line. This work involved more than a little bit of refactoring, and the more I looked at the code, the more it became clear that a test suite would be useful to have as a safety net.

Conveniently, I also happened to have been reading a recent report from the National Library of Australia on File Characterization Tools, which in addition to talking about FIDO, pointed me at the govdocs1 dataset. Govdocs1 is a dataset of 1 million files harvested from the .gov domain by the NSF funded Digital Corpora project. The data was collected to serve as a public domain corpus for forensics tools to use as a test bed. I thought it might be useful to survey the filenames in the dataset, and cherry pick out formats of particular types for use in my FIDO test suite.

So I wrote a little script that crawled all the filenames, and kept track of file extensions used. Here are the results:

extension count
pdf 232791
html 191409
jpg 109281
txt 84091
doc 80648
xls 66599
ppt 50257
xml 41994
gif 36301
ps 22129
csv 18396
gz 13870
log 10241
eps 5465
png 4125
swf 3691
pps 1629
kml 995
kmz 949
hlp 660
sql 632
dwf 474
java 323
pptx 219
tmp 196
docx 169
ttf 104
js 92
pub 76
bmp 75
xbm 51
xlsx 46
jar 34
zip 27
wp 17
sys 8
dll 7
exported 5
exe 5
tif 3
chp 2
pst 1
squeak 1
data 1

With this list in hand, I downloaded an example of each file extension, ran it through the current release of FIDO, and used the output to generate a test suite for my new refactored version. Interestingly, two tests fail:

Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 244, in test_pst
    self.assertEqual(i.puid, "x-fmt/249")
AssertionError: 'x-fmt/248' != 'x-fmt/249'

======================================================================
FAIL: test_pub (test.FidoTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 260, in test_pub
    self.assertEqual(i.puid, "x-fmt/257")
AssertionError: 'x-fmt/252' != 'x-fmt/257'

I’ll need to dig in to see what could be different between the two versions that would confuse x-fmt/248 with x-fmt/249 and x-fmt/252 with x-fmt/257. Perhaps it is related to Dave Tarrant’s recent post about how FIDO’s identification patterns have flip flopped in the past.

You may have noticed that I’m linking the PUIDs to Andy Jackson‘s PRONOM Prototype Registry (built in 6 days with Drupal) instead of the official PRONOM registry. I did this because a Google search for the PRONOM identifier (PUID) pulled up a nice detail page for the format in Andy’s prototype, and it doesn’t seem possible (at least in the 5 minutes I tried) to link directly to a file format record in the official PRONOM registry. I briefly tried the Linked Data prototype, but it proved difficult to search for a given PUID (server errors, the unforgiving glare of SPARQL query textareas, etc).

I hope OpenPlanets and/or the National Archives give Andy’s Drupal experiment a fair shake. Getting a functional PRONOM registry running in 6 days with an opensource toolkit like Drupal definitely seems more future proof than spending years with a contractor only to get closed source code. The Linked Data prototype looks promising, but as the recent final report on the Unified Digital Format Registry project highlights, choosing to build on a semantic web stack has its risks compared with more mainstream web publishing frameworks or content management systems like Drupal. PRONOM just needs an easy way for digital preservation practitioners to be able to collaboratively update the registry, and for each format to have a unique URL that uses the PUID. My only complaint is that Andy’s prototype seemed to advertise RDF/XML in the HTML, but it seemed to return an empty RDF document, for example the HTML at http://beta.domd.info/pronom/x-fmt/248 has a <link> that points at http://beta.domd.info/node/1303/rdf.

I admit I am a fan of linked data, or being able to get machine readable data back (RDFa, Microdata, JSON, RDF/XML, XML, etc) from Cool URLs. But using triplestores, and SPARQL don’t seem to be terribly important things for PRONOM to have at this point. And if they are there under the covers, there’s no need to confront the digital preservation practitioner with them. My guess is that they want to have an application that lets them work with their peers to document file formats, not learn a new query or ontology language. Perhaps Jason Scott’s Just Solve the Problem effort in October, will be a good kick in the pants to mobilize grassroots community work around digital formats.

Meanwhile, I’ve finished up the FIDO API changes and the test suite enough to have submitted a pull request to OpenPlanets. My fork of the OpenPlanets repository is similarly on Github. I’m not really holding my breath waiting for it to be accepted, as it represents a significant change, and they have their own published roadmap of work to do. But I am hopeful that they will recognize the value in having a test suite as a safety net as they change and refactor FIDO going forward. Otherwise I guess it could be the beginnings of a fido2, but I would like to avoid that particular future.

Update: after posting this Ross Spencer tweeted me some instructions for linking to PRONOM

Maybe I missed it, but PRONOM could use a page that describes this.

Hacking O’Reilly RDFa

I recently learned from Ivan Herman‘s blog that O’Reilly has begun publishing RDFa in their online catalog of books. So if you go and install the RDFa Highlight bookmarklet and then visit a page like this and click on the bookmarklet you’ll see something like:



Those red boxes you see are graphical depictions of where metadata can be found interleaved in the HTML. In my screenshot you can maybe barely see an assertion involving the title being displayed:

<urn:x-domain:oreilly.com:product:9780596516499.IP> dc:title "Natural Language Processing with Python"

But there is actually quite a lot of metadata hiding in the page, which can be found by running the page through the RDFa Distiller (quickly skim over this if your eyes glaze over when you see Turtle):

@prefix dc: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix frbr: <http://vocab.org/frbr/core#> .
@prefix gr: <http://purl.org/goodrelations/v1#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<urn:x-domain:oreilly.com:product:9780596516499.IP> a frbr:Expression ;
     dc:creator <urn:x-domain:oreilly.com:agent:pdb:3343>, <urn:x-domain:oreilly.com:agent:pdb:3501>, <urn:x-domain:oreilly.com:agent:pdb:3502> ;
     dc:issued "2009-06-12"^^xsd:dateTime ;
     dc:publisher "O'Reilly Media"@en ;
     dc:title "Natural Language Processing with Python"@en ;
     frbr:embodiment <urn:x-domain:oreilly.com:product:9780596516499.BOOK>, <urn:x-domain:oreilly.com:product:9780596803346.SAF>, <urn:x-domain:oreilly.com:product:9780596803391.EBOOK> . 

<http://customer.wileyeurope.com/CGI-BIN/lansaweb?procfun+shopcart+shcfn01+funcparms+parmisbn(a0130):9780596516499+parmqty(p0050):1+parmurl(l0560):http://oreilly.com/store/> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:hasPriceSpecification
                 [ a gr:UnitPriceSpecification ;
                     gr:hasCurrency "GBP"@en ;
                     gr:hasCurrencyValue "34.50"^^xsd:float
                 ] ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596516499.BOOK>
         ] . 

<http://my.safaribooksonline.com/9780596803346> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596803346.SAF>
         ] . 

<https://epoch.oreilly.com/shop/cart.orm?p=BUNDLE&prod=9780596516499.BOOK&prod=9780596803391.EBOOK&bundle=1&retUrl=http%3A%252F%252Foreilly.com%252Fstore%252F> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:includesObject
                 [ a gr:TypeAndQuantityNode ;
                     gr:ammountOfThisGood "1"^^xsd:float ;
                     gr:hasPriceSpecification
                         [ a gr:UnitPriceSpecification ;
                             gr:hasCurrency "None"@en ;
                             gr:hasCurrencyValue "49.49"^^xsd:float
                         ] ;
                     gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596803391.EBOOK>
                 ] ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596516499.BOOK>
         ] . 

<https://epoch.oreilly.com/shop/cart.orm?prod=9780596516499.BOOK> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:hasPriceSpecification
                 [ a gr:UnitPriceSpecification ;
                     gr:hasCurrency "USD"@en ;
                     gr:hasCurrencyValue "44.99"^^xsd:float
                 ] ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596516499.BOOK>
         ] . 

<https://epoch.oreilly.com/shop/cart.orm?prod=9780596803391.EBOOK> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:hasPriceSpecification
                 [ a gr:UnitPriceSpecification ;
                     gr:hasCurrency "USD"@en ;
                     gr:hasCurrencyValue "35.99"^^xsd:float
                 ] ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596803391.EBOOK>
         ] . 

<urn:x-domain:oreilly.com:agent:pdb:3343> a foaf:Person ;
     foaf:homepage <http://www.oreillynet.com/pub/au/3614> ;
     foaf:name "Steven Bird"@en . 

<urn:x-domain:oreilly.com:agent:pdb:3501> a foaf:Person ;
     foaf:homepage <http://www.oreillynet.com/pub/au/3615> ;
     foaf:name "Ewan Klein"@en . 

<urn:x-domain:oreilly.com:agent:pdb:3502> a foaf:Person ;
     foaf:homepage <http://www.oreillynet.com/pub/au/3616> ;
     foaf:name "Edward Loper"@en . 

<urn:x-domain:oreilly.com:product:9780596803346.SAF> a frbr:Manifestation ;
     dc:type <http://purl.oreilly.com/product-types/SAF> . 

<urn:x-domain:oreilly.com:product:9780596803391.EBOOK> a frbr:Manifestation ;
     dc:identifier <urn:isbn:9780596803391> ;
     dc:issued "2009-06-12"^^xsd:dateTime ;
     dc:type <http://purl.oreilly.com/product-types/EBOOK> . 

<urn:x-domain:oreilly.com:product:9780596516499.BOOK> a frbr:Manifestation ;
     dc:extent """512"""@en ;
     dc:identifier <urn:isbn:9780596516499> ;
     dc:issued "2009-06-19"^^xsd:dateTime ;
     dc:type <http://purl.oreilly.com/product-types/BOOK> . 

So that’s a lot of data. The nice thing about rdf is that you can look at the vocabularies that are being used to get an idea of the rough shape of the underlying data. Just looking at the namespace prefixes we can see that O’Reilly has chosen to use the following vocabularies:

I was curious so I wrote a little crawler (41 lines of Python+rdflib) to collect all the metadata from the O’Reilly Catalog pages. Yes all the pages! It ended up pulling down 92,101 triples.

A nice side effect of having the data as a big ntriples file is you can do unix pipe tricks with sort, cut, uniq like this to get some ballpark numbers on what types of resources are in the rdf graph:

ed@curry:~/Projects/oreilly-crawler$ rdfsum catalog.nt
   6803 <http://purl.org/goodrelations/v1#TypeAndQuantityNode>
   5861 <http://purl.org/goodrelations/v1#Offering>
   4564 <http://purl.org/goodrelations/v1#UnitPriceSpecification>
   4065 <http://vocab.org/frbr/core#Manifestation>
   2100 <http://vocab.org/frbr/core#Expression>
   2023 <http://xmlns.com/foaf/0.1/Person>

Another nice thing about pulling the RDFa down with rdflib is you end up with a little berkeleydb triple store which you can query with SPARQL, say to pull out all the authors and titles:

    SELECT ?title ?author
    WHERE { 
      ?title_uri dct:title ?title .
      ?title_uri dct:creator ?author_uri .
      ?author_uri foaf:name ?author .
    }

And adding a little bit of networkx judo you can get an xmas-friendly graph of authors (the green dots are books and the red ones are authors ; I limited author labels to authors who had written more than 2 books).

Admittedly this is not very readable, but I imagine someone with more network visualization skillz could do something nicer in short order. There’s a lot that could be done with the data. This exercise was mainly just to demonstrate how layering some new stuff into your HTML can really open up doors for how people use your website. Clearly O’Reilly did some deep thinking about what data they had, and what vocabularies they wanted to model it with. But once they’d done that they probably just had to go add 50 lines to an HTML template somewhere, and it was published (props to David Brunton for this turn of phrase). It’s a really good sign that a tech publisher with the stature of O’Reilly is giving this method of data publishing a try.

My only suggestion (for anyone at O’Reilly who might be reading) would be that it would be nice if they used HTTP URLs instead of URNs for People, Works and Expressions. I understand why they did it: using URNs eases deployment somewhat since you don’t have to worry about httpRange-14 stuff. But I think they could easily use a hash URI instead of an URN, and make it easy for people to link to their stuff in other data. The Cool URIs For the Semantic Web has some other patterns they might want to consider, but simply adding a hash to their existing page URIs should do the trick. So for example, consider if OpenLibrary wanted to link their notion of of a book to O’Reilly’s notion of a book with owl:sameAs. If they used they URN they’d have:

<http://openlibrary.org/b/OL23978297M> owl:sameAs <urn:x-domain:oreilly.com:product:9780596516499.IP> .

but if O’Reilly identified their expressions with a URL they would enable something like:

<http://openlibrary.org/b/OL23978297M> owl:sameAs <http://oreilly.com/catalog/9780596516499#expression> .

This may seem like a minor point, but it’s really important to be able to follow your nose on the web–particularly in Linked Data. If a piece of software ran across the O’Reilly URL in a chunk of OpenLibrary RDF, the program could HTTP GET it, and learn more stuff about the book. But if it got the URN it wouldn’t really know how to fetch a representation for that resource without some special case logic that mapped the URN to a URL. There is a reason why Tim Berners-Lee included the following as the second of his design principles for Linked Data:

Use HTTP URIs so that people can look up those names.

Anyhow, hats off to O’Reilly for putting RDFa into practice. I hope the rest of the publishing (and library world) take note. If you are looking to learn more about RDFa Ben Adida and Mark Birbeck‘s RDFa Primer: Bridging the Human and Data Webs is a really nice intro.

oai-pmh and xmpp

As an experiment to learn more about xmpp I created a little utility that will poll an oai-pmh server and send new records as a chunk of xml over xmpp. The idea wasn’t necessarily to see all the xml coming into my jabber client (although you can do that). I wanted to enable downstream applications to have records pushed to them, instead of them having to constantly poll for updates. So you could write a client that archived away metadata and potentially articles as they are found, or write a current awareness tool that listened for articles that matched a particular users research profile, etc…

Here’s how you start it up:

oai2xmpp.py http://www.doaj.org/oai.article from@example.com to@example.org

which would poll Directory of Open Access Journals for new articles every 10 minutes, and send them via xmpp to to@example.org. You can adjust the poll interval, and limit to records within a particular set with the –pollinterval and –set options, e.g.:

oai2xmpp.py http://export.arxiv.org/oai2 currents@jabber.org ehs@jabber.org --set cs --pollinterval 86400

It’s a one file python hack in the spirit of Thom Hickey’s 2PageOAI that has a few dependencies documented in the file (lxml, xmpppy, httplib2). I’ve run it for about a week against DOAJ and arxiv.org without incident (it does respect 503 HTTP status codes telling it to slow down). You can find it here.

If you try it out, have any improvements, or ideas let me know.