notes on retooling libraries

If you work in the digital preservation field and haven’t seen Dorothea Salo’s Retooling Libraries for the Data Challenge in the latest issue of Ariadne definitely give it a read. Dorothea takes an unflinching look at the at the scope and characteristics of data assets currently being generated by scholarly research, and how equipped traditional digital library efforts are to deal with it. I haven’t seen so many of the issues I’ve had to deal with (largely unconsciously) as part of my daily work so neatly summarized before. Having them laid out in such a lucid, insightful and succinct way is really refreshing–and inspiring.

In the section on Taylorist Production Processes, Dorothea makes a really good point that libraries have tended to optimize workflows for particular types of materials (e.g. photographs, books, journal articles, maps). When materials require a great deal of specialized knowledge to deal with, and the tools that are used to manage and provide access to the content are similarly specialized, it’s not hard to understand why this balkanization has happened. On occasion I’ve heard folks (/me points finger at self) bemoan the launch of a new website as the creation of “yet another silo”. But the technical silos that we perceive are largely artifacts of the social silos we work in: archives, special collections, rare books, maps, etc. The collections we break up our libraries into…the projects that mirror those collections. We need to work better together before we can build common digital preservation tools. To paraphrase something David Brunton has said to me before: we need to think of our collections more as sets of things that can be rearranged at will, with ease and even impunity. In fact the architecture of the Web (and each website on it) is all about doing that.

Even though it can be tough (particularly in large organizations) I think we can in fact achieve some levels of common tooling (in areas like storage and auditing); but we must admit (to ourselves at least) that some levels of access will likely always be specialized in terms of technical infrastructure and user interface requirements:

Some, though not all, data can be shoehorned into a digital library not optimised for them, but only at the cost of the affordances surrounding them. Consider data over which a complex Web-based interaction environment has been built. The data can be removed from the environment for preservation, but only at the cost of loss of the specialised interactions that make the data valuable to begin with. If the dataset can be browsed via the Web interface, a static Web snapshot becomes possible, but it too will lack sophisticated interaction. If the digital library takes on the not inconsiderable job of recreating the entire environment, it is committing to rewriting interaction code over and over again indefinitely as computing environments change.

Dorothea’s statement about committing to rewriting interaction code over and over again is important. I’m a software developer, and a web developer to boot — so there’s nothing I like more than yanking the data out of one encrusted old app, and creating it a-fresh using the web-framework-du-jour. But in my heart of hearts I know that while this may work for large collections of homogeneous data, it doesn’t scale very well for a vast sea of heterogeneous data. However, all is not lost. As the JISC are fond of saying:

The coolest thing to do with your data will be thought of by someone else.

So why don’t us data archivers get out of the business of building the “interaction code”. Maybe our primary service should be to act as data wholesalers who collect it, and make it available in bulk to those who do want to build access layers on top of it. Lets make our data easy for other people to use (with clear licensing) and reference (with web identifiers) so that they can annotate it, and we can pull back those annotations and views. In a way this is kind of hearkening back to the idea of Data Providers and Service Providers that was talked about a lot in the context of OAI-PMH. But in this case we’d be making the objects available as well as the metadata that describes them, similar to the use cases around OAI-ORE. I got a chance to chat with Kate Zwaard of the GPO at CurateCamp a few weeks ago, and learned how the new Federal Register is a presentation application for raw XML data being made available by the GPO. Part of the challenge is making these flows of data public, and giving credit where credit is due — not only to the creators of the shiny site you see, but to the folks behind the scenes who make it possible.

Another part of Dorothea’s essay that stuck out a bit for me, was the advice to split ingest, storage and access systems.

Ingest, storage, and end-user interfaces should be as loosely coupled as possible. Ideally, the same storage pool should be available to as many ingest mechanisms as researchers and their technology staff can dream up, and the items within should be usable within as many reuse, remix, and re-evaluation environments as the Web can produce.

This is something we (myself and other folks at LC) did as part of the tooling to support the National Digital Newspaper Program. Our initial stab at the software architecture was to use Fedora to manage the full life cycle (from ingest, to storage, to access) of the newspaper content we receive from program awardees around the US. The only trouble was that we wanted the access system to support heavy use by researchers and also robots (Google, Yahoo, Microsoft, etc) building their own views on the content. Unfortunately the way we had put the pieces together we couldn’t support that. Increasingly we found ourselves working around Fedora as much as possible to squeeze a bit more performance out of the system.

So in the end we (and by we I mean David) decided to bite the bullet and split off the inventory systems keeping track of where received content lives (what storage systems, etc) from the access systems that delivered content on the Web. Ultimately this meant we could leverage industry proven web development tools to deliver the newspaper content…which was a huge win. Now that’s not saying that Fedora can’t be used to provide access to content. I think the problems we experienced may well have been the result of our use of Fedora, rather than Fedora itself. Having to do multiple, large XSLT transforms to source XML files to render a page is painful. While it’s a bit of a truism, a good software developer tries to pick the right tool for the job. Half the battle there is deciding on the right granularity for the job … the single job we were trying to solve with Fedora (preservation and access) was too big for us to do either right.

Having a system that’s decomposable, like the approach that CDL is taking with Microservices is essential for long-term thinking about software in the context of digital preservation. I guess you could say “there’s no-there-there” with Microservices, since there’s not really a system to download–but in a way that’s kind of the point.

I guess this is just a long way of saying, Thanks Dorothea! :-)

federal register embraces the web and opensource

Tom Lee of the Sunlight Foundation blogged yesterday about the new Federal Register website. The facelift was also announced a few days earlier by the Archivist of the United States, David Ferriero. If you aren’t familiar with it already, the Federal Register is basically the daily newspaper of the United States Federal Government, which details all the rules and regulations of the federal agencies. It is compiled by the Office of the Federal Register located in the National Archives, and printed by the Government Printing Office. As the video describing the new site points out, the Federal Register began publication in 1936 in the depths of the Great Depression as a way to communicate in one place all that the agencies were doing to try to jump start the economy. So it seems like a fitting time to be rethinking the role of the Federal Register.

I’m no usability expert, but just a few minutes browsing the new site and comparing it to the old one make it clear what a leap forward this is. Hopefully the legal status of the new site will be ironed out shortly.

Most of all it’s great to see that the Federal Register is now a single web application. The service it provides to the American public is important enough to deserve its own dedicated web presence. As the developers point out in their video describing the effort, they wanted to make the Federal Register a “first class citizen of the web”…and I think they are certainly helping do that. This might seem obvious, but often there is a temptation to jam publications from the print world (like the Federal Register) into dumbed down monolithic repositories that treat all “objects” the same. Proponents of this approach tend to characterize one off websites like Federal Register 2.0 as “yet another silo”. But I think it’s important to remember that the web was really created to break down the silo walls, and that every well designed web site is actually the antithesis of a silo. In fact, monolithic repository systems that treat all publications as static documents to be uniformly managed are more like silos than these ‘one off’ dedicated web applications.

As a software developer working in the federal government there were a few things about the Federal Register 2.0 that I found really exciting:

  • Fruitful collaboration between federal employees and citizen activist/geeks initiated by a software development contest.
  • Extensive use of opensource technologies like Ruby, Ruby on Rails, MySQL, Sphinx, nginx, Varnish, Passenger, Apache2, Ubuntu Linux, Chef. Opensource technologies encourage collaboration by allowing citizen activists/technologists to participate without having to drop a princely sum.
  • Release of the source code for the website itself, using decentralized revision control (git) so that people can easily contribute changes, and see how the site was put together.
  • Extensive use of syndicated feeds to communicate how how content is being added to the site, ical feeds to keep on top of events going on in your area, and detailed XML for each entry.
  • The robots.txt file for the site makes the content fully crawlable by web indexers, except for search related portions of the website. Excluding dynamic search results is often important for performance reasons, but much of the article content can be discovered via links, see below about permalinks. They also have made a sitemap available for crawlers to efficiently discover URLs for the content.
  • Deployment of the web application to the cloud using Amazon’s EC2 and S3 services. Cloud computing allows computing resources to scale to meet demand. In effect this means that government IT shops don’t have to make big up front investments in infrastructure to make new services available. I guess the jury is still out, but I think this will eventually prove to greatly lower the barrier to innovation in the egov sector. It also lets the more progressive developers in government leap frog ancient technologies and bureaucracies to get things done in a timely manner.
  • And last, but certainly not least … now every entry in the Federal Register has a URL!. Permalinks for the Federal Register are incredibly important for citability reasons. I predict that we’ll quickly see more and more people referencing specific parts of the Federal Register in social media sites like Facebook, Twitter and out on the open web in blogs, and in collaborative applications like Wikipedia.

I would like to see more bulk access to XML data made available, for re-purposing on other websites–although I guess it might be able to walk from the syndicated feeds to the detailed XML. Also, the search functionality is so rich it would be useful to have an OpenSearch description that documents it, and perhaps provides some hooks for getting back JSON and/or XML representations. Perhaps even following the lead of the London Gazette and trying to make some of the structured metadata available in the the HTML using RDFa. It also looks like content is only available for 2008 on, so it might be interesting to see how easy it would be to make more of the historic content available.

But the great thing about what these folks have done is now I can fork the project on github, see how easy it is to add the changes, and let the developers know about my updates to see if they are worth merging back into the production website. This is an incredible leap forward for egov efforts–so hats off to everyone who helped make this happen. to liberate Code of Federal Regulations

good news via the govtrack mailing list

Carl Malamud of, with funding from a bunch of places including a small bit from GovTrack’s ad profits, announced his intention to purchase from the Government Printing Office documents they produce in the course of their statutory obligations and then have the nerve to sell back to the public at prohibitive prices. The document to be purchased is the Code of Federal Regulations, the component of federal law created by executive branch agencies, in electronic form. Once obtained, it will be posted openly/freely online.

More here:

And Carl’s letter to the GPO:

It’s pretty sad that it has to come to this…but it’s also pretty awesome that it’s happening.