Some folks at LC and CDL are trying to kick-start a new public discussion list for talking about digital curation in its many guises: repositories, tools, standards, techniques, practices, etc. The intuition being that there is a social component to the problems of digital preservation and repository interoperability.

Of course NDIIPP (the arena for the CDL/LC collaboration) has always been about building and strengthening a network of partners. But as Priscilla Caplan points out in her survey of the digital preservation landscape Ten Years After, organizations in Europe like the JISC and NESTOR seem to have understood that there is an educational component to digital preservation as well. Yet even the JISC and NESTOR have tended to focus more on the preservation of scholarly output, whereas digital preservation really extends beyond that realm of materials.

The continual need to share good ideas and hard-won-knowledge about digital curation, and to build a network of colleagues and experts that extends out past the normal project/institution specific boundaries is just as important as building the collections and the technologies themselves.

So I guess this is a rather highfalutin goal … here’s some text stolen from the digital-curation home page to give you more of a flavor:

The digital preservation and repositories domain is fortunate to have a diverse set of institutional and consortial efforts, software projects, and standardization initiatives. Many discussion lists have been created for these individual efforts. The digital-curation discussion list is intended to be a public forum that encourages cross-pollination across these project and institutional boundaries in order to foster wider awareness of project- and institution-specific work and encourage further collaboration.

Topic of conversation can include (but is not limited to)

  • digital repository software (Fedora, DSpace, EPrints, etc.)
  • management of digital formats (JHOVE, djatoka, etc.)
  • use and development of standards (OAIS, OAI-PMH/ORE, MPEG21, METS, BagIt, etc.)
  • issues related to identifiers, packaging, and data transfer
  • best practices and case studies around curation and preservation of digital content
  • repository interoperability
  • conference, workshop, tutorial announcements
  • recent papers
  • job announcements
  • general chit chat about problems, solutions, itches to be scratched
  • humor and fun

We’ll see how it goes. If you are at all interested please sign up.


So we have a few bookshelves in our house–one of which is in our kitchen. Only one or two of the shelves in this bookshelf actually house books, most of which are food-stained cookbooks. The rest of the 4 or 5 shelves are given over to photographs, albums, pamphlets from schools, framed pictures, compact discs, pencils, letters, screwdrivers, coins, candles, bills, artwork, crayons–basically the knickknacks and detritus of daily living. We spend a lot of time in the kitchen, so it’s convenient and handy to just stash stuff there.

The only problem is IT DRIVES ME INSANE!

The randomness, and perceived messiness of the bookshelf drives me crazy. I look at it and I see chaos, complexity and disorder. I know I have a problem, but that knowledge doesn’t seem to help. I am constantly shuffling things around, grouping things, moving things, throwing things out while more and and more things are quietly added. I’d almost prefer the bookshelf to be somewhere out of sight, but then we’d probably use something else in the kitchen.

This morning, on my way to work, I got a call from Kesa asking where two flower petals were that needed to be ironed on to Chloe’s Girl Scouts uniform. They were in the bookshelf at one point. Did I throw them away? I can’t remember it’s all a blur. I admit that I probably did. I can hear Chloe crying in the background. I feel bad…and resentful about having to keep this bookshelf organized.

Why am I writing here about this? Well mostly it wouldn’t fit within a 140 byte limit. But srsly – I guess I just feel like this bookshelf is a living emblem of my professional life as a software developer at a library. I strive to create software that is simple in its expression, that does one thing and does it well, and which is hopefully easy to maintain by more people than just me. I relish working at an institution that values the preservation of objects and knowledge.

But I threw away the flower decal …

It’s important to remember that real life is complicated, and that the messiness is something to be relished as well. The useful bookshelf, or bag of bits, chunk of json, or half-remembered perl script in someones homedir are valuable for their organic resilience. Or as Einstein famously said:

Things should be made as simple as possible, but not simpler.

I’m sorry Chloe.

bagit and .deb

I’m just now (OK I’m slow) marveling at how similar BagIt turned out to be to the Debian Package Format. Given some of the folks involved, this synchronicity isn’t too surprising.

Both .deb and BagIt use a directory ‘data’ for bundling the files in the package (well .deb has it as a compressed file data.tar.gz). Both have md5sum-style checksum files for stating the fixity values of said files. Both have simple rfc2822-style text files for expressing metadata. Both have files that contain the version number of the packaging format. One nice thing that deb has which BagIt intentionally eschewed was a serialization format. But no matter.

At LC we (a.k.a. coding machine Justin Littman) are working on a software library for creating and validating bags, as well as a shiny GUI that’ll sit on top of it to assist in bag creation for people who like shiny things.

It’s an interesting counterpoint to this process of creating BagIt tools to look how a .deb can be downloaded and inspected. Here’s a sampling of a shell session where I downloaded and extracted the parts of the .deb for python-rdflib.

ed@curry:~/tmp$ aptitude download python-rdflib
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Reading extended state information       
Initializing package states... Done
Building tag database... Done      
Get:1 hardy/universe python-rdflib 2.4.0-4 [276kB]
Fetched 276kB in 0s (346kB/s) 

ed@curry:~/tmp$ ar -xv python-rdflib_2.4.0-4_i386.deb 
x - debian-binary
x - control.tar.gz
x - data.tar.gz

ed@curry:~/tmp$ tar xvfz control.tar.gz 

ed@curry:~/tmp$ cat control
Package: python-rdflib
Source: rdflib
Version: 2.4.0-4
Architecture: i386
Maintainer: Ubuntu MOTU Developers 
Original-Maintainer: Nacho Barrientos Arias 
Installed-Size: 1608
Depends: libc6 (>= 2.5-5), python-support (>= 0.3.4), python (< < 2.6), python (>= 2.4), python-setuptools
Provides: python2.4-rdflib, python2.5-rdflib
Section: python
Priority: optional
Description: RDF library containing an RDF triple store and RDF/XML parser/serializer
 RDFLib is a Python library for working with RDF, a simple yet
 powerful language for representing information. The library
 contains an RDF/XML parser/serializer that conforms to the
 RDF/XML Syntax Specification and both in-memory and persistent
 Graph backend.
 This package also provides a serialization format converter
 called rdfpipe in order to deal with the different formats
 RDFLib works with.

ed@curry:~/tmp$ cat md5sums 
75af966e839159902537614e5815c415  usr/lib/python-support/python-rdflib/python2.5/rdflib/sparql/bison/
a33eb3985c6de5589cb723d03d2caeb1  usr/lib/python-support/python-rdflib/python2.4/rdflib/sparql/bison/
d1b5578dd1d64432684d86bbb816fafc  usr/bin/rdfpipe
0191b561e3efe1ceea7992e2c865949b  usr/share/doc/python-rdflib/changelog.gz
98a861211f3effe1e69d6148c1e31ab2  usr/share/doc/python-rdflib/copyright
d75c2ab05f3a4239963d8765c0e9e7c5  usr/share/doc/python-rdflib/examples/
17b61c23d0600e6ce17471dc7216d3fa  usr/share/doc/python-rdflib/examples/
3894fa16d075cf0eee1c36e6bcc043d8  usr/share/doc/python-rdflib/changelog.Debian.gz
15653f75f35120b16b1d8115e6b5a179  usr/share/man/man1/rdfpipe.1.gz
405cb531a83fd90356ef5c7113ecd774  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
41e28217ddd2eb394017cd8f12b1dfd5  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
ec9ae5147463ed551d70947c2824bc82  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
6e018a69ca242acb613effe420c2cdc7  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
7e72a08f29abc91faddb85e91f17e87c  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
648384e5980ef39278466be38572523a  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
494386730a6edf5c6caf7972ed0bf4ba  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
4513b2fdc116dc9ff02895222a81421d  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
a800bdac023ae0c02767ab623dffe67b  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
6c31647f2b3be724bdfcc35f631162b1  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
c158b3fb8fd66858f598180084f481c4  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
bff095caa2db064cc2b1827c4b90a9e7  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
2db0c4925d17b49f5bb355d7860150c2  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
10e02ecf896d07c0546b791a450da633  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
eee29bb22b05b16da2a5e6552044bf22  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
a29a508631228f6674e11bb077c24afc  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
479a4702ebee35f464055a554ebf5324  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
d2fe75aa4394ec7d9106a1e02bb3015a  usr/share/python-support/python-rdflib/rdflib/sparql/bison/
da186350e65c8e062887724b1758ef80  usr/share/python-support/python-rdflib/rdflib/sparql/
0130de0f5d28087d7c841e36d89714c4  usr/share/python-support/python-rdflib/rdflib/sparql/
826ffe4c6b3f59a9635524f0746299fe  usr/share/python-support/python-rdflib/rdflib/sparql/

ed@curry:~/tmp$ tar xvfz data.tar.gz 

Here are some more useful notes on the structure of .deb files and how to create them. If you are interested in trying out the nascent-alpha BagIt tools give me a holler (ehs at pobox dot com) or just add a comment here…


Terry’s analysis of the proposed changes to OCLC’s record policy is essential reading. I’m really concerned that these 996 fields will slip somewhat unnoticed into data that I use.

996 $aOCLCWCRUP $iUse and transfer of this record is governed by the OCLC® Policy for Use and Transfer of WorldCat® Records. $u

This appears to be an engineered, legal virus for our bibliographic ecosystems. I’m not a lawyer, so I can’t fully determine the significance of these legal terms…mostly because there isn’t a policy at the end of that PURL right now. There’s a FAQ full of ominous references to “the Policy”, and a glossy, feel-good overview, but the policy itself is empty at the moment. So the precise nature of the virus is so far unknown…or am I wrong?

At any rate, I think libraries need to be careful about letting these 996 fields creep into their data–especially data that they create. I wonder are there other examples of legalese that have slipped into MARC data over the years?

Update 2008-11-03: it appears that “the Policy” was removed sometime Sunday evening? Perhaps its best not to jump to conclusions eh? But that image of the virus is too cool, and I needed an excuse to post it on my blog.

Update 2008-11-07: check out Terry’s re-analysis of “the Policy” when a new version was brought back online by OCLC.

freebase and linked-data

Ok, this is pretty big news for linked data folks, and for semweb-heads in general. Freebase is now a linked-data target. This is important news because Freebase is an active community of content creators, creating rich data-centric descriptions with a wiki style interface, fancy data loaders, and useful machine APIs.

The web2.0-meets-semweb space is also being explored by folks like Talis. It’ll be interesting to see how this plays out–particularly in light of SPARQL adoption, which I remain kind of neutral about for some undefined, wary, spooky reason. I get the idea of web resources having data views. It seems like a logical, “one small step for an web agent, one giant leap for the web”. But queryability with SPARQL sounds like something to push off, particularly if you’ve already got a search api that could be hooked up to the data views.

At any rate, what this announcement means is that you can get machine readable data back from freebase using a URI. The descriptions then use more URIs, which you can then follow-your-nose to, and get more machine readable data. So if you are on a page like:

you can construct a URL for Tim Berners-Lee like this:

Then you resolve that URL asking for application/turtle (you could ask for application/rdf+xml but I find the turtle more readable).

curl --location --header "Accept: application/turtle"

And you’ll get back a description like this. There’s a lot of useful data there, but the interesting part for me is the follow-your-nose effect where you can see an assertion like:

     <> .

And you can then go look up Ted Nelson using that URI:

  curl --location --header "Accept: application/turtle"

And get another chunk of data which includes this assertion:

     <> .

And you can then continue following your nose to:

Lather, rinse, repeat.

So why is this important? Because following your nose in HTML is what enabled companies like Lycos, AltaVista, Yahoo and Google to be born. It allowed for agents to be able to crawl the web of documents and build indexes of the data to allow people to find what they want (hopefully). Being able to link data in this way allows us to harvest data assets across organizational boundaries and merge them together. It’s early days still, but seeing an organization like Freebase get it is pretty exciting.

Oh, there are a few little rough spots which probably should be ironed out … but when is that ever not the case eh? Inspiring stuff.


I spent a 1/2 an hour goofing around with with the new (to me) SemanticProxy service from Calais. You give the service a URL along with your API key, and it’ll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, it’s just a GET:


Here’s an example of some turtle you can get for my friend Dan’s blog. Obviously there’s a lot of data there, but I wanted to see exactly what entities are being recognized, and their labels. It doesn’t take long to notice that most of the resource types are in the namespace:

For example:


And most of these resources have a property which seems to assign a literal string label to the resource:

It’s kind of a bummer that these vocabulary terms don’t resolve, because it would be sweet to get a bigger picture look at their vocabulary.

At any rate, with these two little facts gleaned from looking at the RDF for a few moments I wrote a little script (using rdflib) which you feed a URL and it’ll munge through the RDF and print out the recognized entities:

ed@curry:~/bzr/calais$ ./
a Company named Lehman Bros.
a Company named Southwest Airlines
a Company named Costco
a Company named Everbank
a Holiday named New Year's Day
a ProvinceOrState named Illinois
a ProvinceOrState named Arizona
a ProvinceOrState named Michigan
a IndustryTerm named media ownership rules
a IndustryTerm named unreliable technologies
a IndustryTerm named bank
a IndustryTerm named health care insurance
a IndustryTerm named bank panics
a IndustryTerm named free software
a City named Lansing
a Facility named Big Library
a Person named Ralph Nader
a Person named Dan Chudnov
a Person named Shouldn't Bob Barr
a Person named John Mayer
a Person named Daniel Chudnov
a Person named Cynthia McKinney
a Person named Bob Barr
a Person named John Legend
a Country named Iraq
a Country named United States
a Country named Afghanistan
a Organization named FDIC
a Organization named senate
a Currency named USD

Quite easy and impressive IMHO. One thing that is missing from this output are the URIs that identify the various resources that are recognized like Dan’s:

Like the vocabulary URIs it doesn’t resolve (at least outside the Reuters media empire). Sure would be nice if it did. It’s got the fact that it’s a person cooked into it (pershash)…but otherwise seems to be just a simple hashing algorithm applied to the string “Dan Chudnov”.

I didn’t actually spend any time looking at the licensing issues around using the service. I’ve heard they are somewhat stultifying and vague, which is to be expected I guess. The news about Reuters and Zotero isn’t exactly encouraging … but it is interesting to see how good some of the NLP analysis is getting at institutions like Reuters. It would be lovely to get a backend look at how this technology is actually being used internally at Reuters.

If you want to take this for a spin and can’t be bothered to download it, just drop into #code4lib and ask #zoia for entities:

14:45 < edsu> @entities
14:45 < zoia> edsu: 'ok I found: a Facility Library of Congress, a Company FRBR 
              Review Group, a City York, a EmailAddress, a Person 
              Jenn Riley, a Person Robert Maxwell, a Person Arlene Taylor, a 
              Person William Denton, a Person Barbara Tillett, a Organization 
              Congress, a Organization Open Content Alliance, a Organization 
              York \nUniversity'

json vs pickle

in python JSON is faster, smaller and more portable than pickle …

At work, I’m working on a project where we’re modeling newspaper content in a relational database. We’ve got newspaper titles, issues, pages, institutions, places and some other fun stuff. It’s a django app, and the db schema currently looks something like:

Anyhow, if you look at the schema you’ll notice that we have a Page model, and that attached to that is an OCR model. If you haven’t heard of it before OCR is an acronym for optical character recognition. For each newspaper page we have, we have a TIF image for the original page, and we have rectangle coordinates for the position of every word on the page. Basically it’s xml that looks something like this (warning your browser may choke on this, you might want to right-click-download).

So there are roughly around 2500 words on a page of newspaper text, and there can sometimes be 350 occurrences of a particular word on a page…and we’re looking to model 1,000,000 pages soon … so if we got really prissy with normalization we could soon be looking at (worst case) 875,000,000,000 rows in a table. While I am interested in getting a handle on how to manage large databases like this, we just don’t need the fine grained queries into the word coordinates. But we do need to be able to look up the coordinates for a particular word on a particular page to do hit highlighting in search results.

So let me get to the interesting part already. To avoid having to think about databases with billions of rows, I radically denormalized the data and stored the word coordinates as a blob of JSON in the database. So we just have a word_coordinates_json column in the OCR table, and when we need to look up the coordinates for a page we just load up the JSON dictionary and we’re good to go. JSON is nice with django, since django’s ORM doesn’t seem to support storing blobs in the database, and JSON is just text. This worked just fine on single page views, but we also do hit highlighting on pages where there are 10 pages being viewed at the same time. So we started noticing large lags on these page views – because it was taking a while to load the JSON (sometimes 327K * 10 of JSON).

As I mentioned we’re using Django, so it was easy to use django.utils.simplejson for the parsing. When we noticed slowdowns I decided to compare django.utils.simplejson to the latest simplejson and python-cjson. And just for grins I figured it couldn’t hurt to see if using pickle or cPickle (protocols 0, 1 and 2) would prove to be faster than using JSON. So I wrote a little benchmark script that timed the loading of a 327K JSON and a 507K pickle file 100 times using each technique. Here are the results:

method total seconds avg seconds
django-simplejson 140.606723 1.406067
simplejson 2.260988 0.022610
pickle 45.032428 0.450324
cPickle 4.569351 0.45694
cPickle1 2.829307 0.028293
cPickle2 3.042940 0.030429
python-cjson 1.852755 0.018528

Yeah, that’s right. The real simplejson is 62 times faster than django.utils.simplejson! Even more surprising simplejson seems to be faster than even cPickle (even using binary protocols 1 and 2) python-cjson seems to have a slight edge on simplejson. This is good news for our search results page that has 10 newspaper pages to highlight on it, since it’ll take 10 * 0.033183 = .3 seconds to parse all the JSON instead of the totally unacceptable 10 * 0.976193 = 9.7 seconds. I guess in some circles 0.3 seconds might be unacceptable, we’ll have to see how it pans out. We may be able to remove the JSON deserialization from the page load time by pushing some of the logic into the browser w/ AJAX. If you want, please try out my benchmarks yourself on your own platform. I’d be curious if you see the same ranking.

Here are the versions for various bits I used:

  • python v2.5.2
  • django trunk: r9231 2008-10-13 15:38:18 -0400
  • simplejson 2.0.3

So in summary for pythoneers: JSON is faster, smaller and more portable than pickle. Of course there are caveats in that you can only store simple datatypes that JSON allows you to, not the full fledged Python objects. But in my use case JSON’s data types were just fine. Makes me that much happier that simplesjson aka json is now cooked into the Python 2.6 standard library.

Note: if you aren’t seeing simplejson performing better than cPickle you may need to have python development libraries installed:

  aptitude install python-dev # or the equivalent for your system

You can verify if the optimizations are available in simplejson by:

ed@hammer:~/bzr/jsonickle$ python
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) 
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
<<< import simplejson
<<< simplejson._speedups
<module 'simplejson._speedups' from '/home/ed/.python-eggs/simplejson-2.0.3-py2.5-linux-i686.egg-tmp/simplejson/'>

Thanks eby, mjgiarlo, BenO and Kapil for their pointers and ideas. logs

If you are curious how is being used I’ve made the apache server logs available, including the ones for the sparql service. I’ve been meaning to do some analysis of the logs but haven’t got the time yet. You’ll notice that among the data that’s collected is the Accept header sent by agents, since it’s so important to what representation is served up. Thanks to danbri for the idea to simply make them available.

iswc2009, DC and vocamp

I just learned from Tom Heath that The International Semantic Web Conference is coming to Washington DC next year. This is pretty cool news to me, since traveling to conferences isn’t always the easiest thing to navigate. Also, Tom suggested that it might be fun to organize a VoCamp around the conference, to provide an informal collaboration space for vocabulary demos, development, q/a, etc. If you want to help out please join the mailing list. to liberate Code of Federal Regulations

good news via the govtrack mailing list

Carl Malamud of, with funding from a bunch of places including a small bit from GovTrack’s ad profits, announced his intention to purchase from the Government Printing Office documents they produce in the course of their statutory obligations and then have the nerve to sell back to the public at prohibitive prices. The document to be purchased is the Code of Federal Regulations, the component of federal law created by executive branch agencies, in electronic form. Once obtained, it will be posted openly/freely online.

More here:

And Carl’s letter to the GPO:

It’s pretty sad that it has to come to this…but it’s also pretty awesome that it’s happening.