day of digital archives psa

Today is Day of Digital Archives day and I had this semi-thoughtful post written up about BagIt and how it’s a brain dead simple format to use to package up your files so that you’ll know if you still have them 5 minutes, 5 hours, 5 days, 5 years, maybe even 5 decades from now–if the notion of directories and files persists that long.

But I deleted that…you’re welcome…

I was also going to write about how in a fit of web performance art Mark Pilgrim recently deleted his online presence, including various extremely useful opensource tools, and several popular online books, only to see them re-materialize on the Web at new locations.

But I deleted most of that too…you’re welcome again!

Here’s a public service announcement instead. If you happen to use Franco Lazzarino’s Ruby BagIt Library to create bags that contains largish files (> 500MB), you might have accidentally created bad SHA1 manifests. I added a test, and fixed the bug with help from Mark Matienzo and Michael Klein, and sent a pull request. It hasn’t been applied yet, so here’s to hoping it will.

At $mpow we’ve been getting terabytes of data from this social media company that has been bagging their data using this Ruby library. Many of the files are multi-gigabytes gzip compressed. And many of the bags now have bad SHA1 manifests. The social media company wasn’t sure what the problem was, and told us just to ignore the SHA1 manifests. Which is easy enough to do.

It seems like no matter how simple the spec, it’s easy to create bugs. If you create bags, throw Bag-Software-Agent into your bag-info.txt…you never know who might find it useful.

federal register embraces the web and opensource

Tom Lee of the Sunlight Foundation blogged yesterday about the new Federal Register website. The facelift was also announced a few days earlier by the Archivist of the United States, David Ferriero. If you aren’t familiar with it already, the Federal Register is basically the daily newspaper of the United States Federal Government, which details all the rules and regulations of the federal agencies. It is compiled by the Office of the Federal Register located in the National Archives, and printed by the Government Printing Office. As the video describing the new site points out, the Federal Register began publication in 1936 in the depths of the Great Depression as a way to communicate in one place all that the agencies were doing to try to jump start the economy. So it seems like a fitting time to be rethinking the role of the Federal Register.

I’m no usability expert, but just a few minutes browsing the new site and comparing it to the old one make it clear what a leap forward this is. Hopefully the legal status of the new site will be ironed out shortly.

Most of all it’s great to see that the Federal Register is now a single web application. The service it provides to the American public is important enough to deserve its own dedicated web presence. As the developers point out in their video describing the effort, they wanted to make the Federal Register a “first class citizen of the web”…and I think they are certainly helping do that. This might seem obvious, but often there is a temptation to jam publications from the print world (like the Federal Register) into dumbed down monolithic repositories that treat all “objects” the same. Proponents of this approach tend to characterize one off websites like Federal Register 2.0 as “yet another silo”. But I think it’s important to remember that the web was really created to break down the silo walls, and that every well designed web site is actually the antithesis of a silo. In fact, monolithic repository systems that treat all publications as static documents to be uniformly managed are more like silos than these ‘one off’ dedicated web applications.

As a software developer working in the federal government there were a few things about the Federal Register 2.0 that I found really exciting:

  • Fruitful collaboration between federal employees and citizen activist/geeks initiated by a software development contest.
  • Extensive use of opensource technologies like Ruby, Ruby on Rails, MySQL, Sphinx, nginx, Varnish, Passenger, Apache2, Ubuntu Linux, Chef. Opensource technologies encourage collaboration by allowing citizen activists/technologists to participate without having to drop a princely sum.
  • Release of the source code for the website itself, using decentralized revision control (git) so that people can easily contribute changes, and see how the site was put together.
  • Extensive use of syndicated feeds to communicate how how content is being added to the site, ical feeds to keep on top of events going on in your area, and detailed XML for each entry.
  • The robots.txt file for the site makes the content fully crawlable by web indexers, except for search related portions of the website. Excluding dynamic search results is often important for performance reasons, but much of the article content can be discovered via links, see below about permalinks. They also have made a sitemap available for crawlers to efficiently discover URLs for the content.
  • Deployment of the web application to the cloud using Amazon’s EC2 and S3 services. Cloud computing allows computing resources to scale to meet demand. In effect this means that government IT shops don’t have to make big up front investments in infrastructure to make new services available. I guess the jury is still out, but I think this will eventually prove to greatly lower the barrier to innovation in the egov sector. It also lets the more progressive developers in government leap frog ancient technologies and bureaucracies to get things done in a timely manner.
  • And last, but certainly not least … now every entry in the Federal Register has a URL!. Permalinks for the Federal Register are incredibly important for citability reasons. I predict that we’ll quickly see more and more people referencing specific parts of the Federal Register in social media sites like Facebook, Twitter and out on the open web in blogs, and in collaborative applications like Wikipedia.

I would like to see more bulk access to XML data made available, for re-purposing on other websites–although I guess it might be able to walk from the syndicated feeds to the detailed XML. Also, the search functionality is so rich it would be useful to have an OpenSearch description that documents it, and perhaps provides some hooks for getting back JSON and/or XML representations. Perhaps even following the lead of the London Gazette and trying to make some of the structured metadata available in the the HTML using RDFa. It also looks like content is only available for 2008 on, so it might be interesting to see how easy it would be to make more of the historic content available.

But the great thing about what these folks have done is now I can fork the project on github, see how easy it is to add the changes, and let the developers know about my updates to see if they are worth merging back into the production website. This is an incredible leap forward for egov efforts–so hats off to everyone who helped make this happen.