cc0 and git for data

In case you missed it the Cooper-Hewitt National Design Museum at the Smithsonian Institution made a pretty important announcement almost a month ago that they have released their collection metadata on GitHub using the CC0 Creative Commons license. The choice to use GitHub is interesting (more about that below) but the big news is the choice to license the data with CC0. John Wilbanks wrote a nice piece about why the use of CC0 is important. Rather than paraphrase I’ll just quote his main point:

… the fact that the Smithsonian has gone to CC0 is actually a great step. It means that data owners inside the USG have the latitude to use tools that put USG works into a legal status outside the US that is interoperable with their public domain status inside the US, and that’s an unalloyed Good Thing in my view.

While I helped prototype and bring the first version of id.loc.gov online the licensing of the data was a persistent question that I heard from people who wanted to use the data. The about page at id.loc.gov current says:

The Library of Congress has prepared this vocabulary terminology system and is making it available as a public domain data set. While it has attempted to minimize inaccuracies and defects in the data and services furnished, THE LIBRARY OF CONGRESS IS PROVIDING THIS DATA AND THESE SERVICES “AS IS” AND DISCLAIMS ANY AND ALL WARRANTIES, WHETHER EXPRESS OR IMPLIED.

But as John detailed in his post, this isn’t really enough for folks outside the US. I think even for parties inside the US a CC0 license would add more confidence to using the data set in different contexts, and would help align the Library of Congress more generally with the Web. Last Friday Eric Miller of Zepheira spoke about Linked Data at the Library of Congress (eventually the talk should be made available). Near the end of his talk he focused on things that need to be worked on, and I was glad to hear him stress that work needed to be done on licensing. The issue is really nothing new, and it really transcends the Linked Data space. I’m not saying it’s easy, but I agree with everyone who is saying it is important to focus on…and it’s great to see the advances that others in the federal government are making.

Git and GitHub

The other fun part of the Smithsonian announcement was the use of GitHub as a platform for publishing the data. To do this Cooper-Hewitt established an organizational account on GitHub, which you might think is easy, but is actually no small achievement by itself for an organization in the US federal government. With the account in hand the collection project was created and the collection metadata was released as two CSV files (media.csv and objects.csv) by Micah Walter. The repository was then forked by Aaron Straup Cope. Aaron added some Python scripts for converting the CSV files into record based JSON files. In the comments to the Cooper-Hewitt Labs blog post Aaron commented on why he chose to break up the CSV into JSON. The beautiful thing about using Git and GitHub this way for data is that you have a history view like this:

For digital preservation folks this view of what changed, when, and by who is extremely important for establishing provenance. The fact that you get this for free by using the opensource Git version control system, and pushing your repository to GitHub is very compelling.

Over the past couple of years there has been quite a bit of informal discussion in the digital curation community about using Git for versioning data. Just a couple weeks before the Smithsonian announcement Stephanie Collett and Martin Haye from the California Digital Library reported on the use of Git and Mercurial to version data at Code4lib 2012.

But as Alf Eaton observed:

In this case we’re talking 205,137 files. If you doubt Alf, try cloning the repository. Here’s what I see:

ed@rorty:~$ time git clone https://github.com/cooperhewitt/collection.git cooperhewitt-collection
Cloning into cooperhewitt-collection...
remote: Counting objects: 230004, done.
remote: Compressing objects: 100% (19507/19507), done.
remote: Total 230004 (delta 102489), reused 223775 (delta 96260)
Receiving objects: 100% (230004/230004), 27.84 MiB | 3.96 MiB/s, done.
Resolving deltas: 100% (102489/102489), done.

real    8m49.408s
user    0m16.477s
sys 0m17.073s

Yes, that was close to 9 minutes to clone the repository, during which my workstation was pretty much unusable. I suspect that the majority of the time was spent in I/O but more research would be required to know for sure. There are also challenges to using Git for large binary files, which are very common in the digital preservation space. Git and GitHub were designed for versioning code. As any experienced programmer will tell you: at a certain level of abstraction all code is data, and all data is code. So it’s not illogical to expect Git and GitHub to be used this way.

But practically speaking code and data can be laid out on disk quite differently, and the tools for managing code aren’t necessarily optimized for managing data out of the box. There was also an interesting discussion two years ago over on the Sunlight Foundation blog questioning the merits of using Git and GitHub for managing data. Be that as it may, I don’t think the digital preservation community can question the importance of tracking where data came from, and the transformations it has undergone. However, the granularity at which to record these details is still an open issue. Are some notes written in English in a README like file enough, or do they need to be machine actionable? Does PREMIS provide some guidance here? Or maybe OAIS? Perhaps there’s no firm rule here, and there are only pragmatic answers? I guess I should feel embarrassed that I don’t already know the answers to these question…I imagine I’m not the only one.

By a strange coincidence at roughly the same time I happened to notice in a tweet from Bess Sadler that Stanford University recently published Digital Object Storage and Versioning in the Stanford Digital Repository. Maybe there is something to be learned in the findings in there. I’ll let you know when I have read it. I suspect Alf and Bess are right that there are other tools like boar, extensions to Git such as git-annex, or new services like figshare that will have an important role to play in versioning data.

Meanwhile

The concerns about GitHub and Git for versioning data aside, the Cooper-Hewitt collection is inspiring, because it is emblematic of trying to use mainstream tools for digital preservation work. It also seems to emphasize the important (and under-recognized) role of access in doing digital preservation at all. I actually only discovered the announcement after getting into a conversation with Seb Chan who responded to me when I grumptweeted about how the archival finding aids application at the Library of Congress has a robots.txt file that prevents anyone from crawling and indexing it:

The distressing thing is that so much work has gone into describing these archival collections, and getting them on the Web…but two lines in a robots.txt file mean they are invisible to anyone searching for related material in Google, Bing, Yahoo, etc. The date on the robots.txt from 2003 made it look like perhaps the exclusion was from some previous version of the software. I decided to be “that guy” and get in touch with some of the people who helped put the content online, so there may be a chance of letting crawlers in.

I happened to know that the Encoded Archival Description XML files for the finding aids are available online. These XML files are the source data for the HTML view that you see in your browser. I’ve been meaning to try out nodejs and jQuery for ~~scraping~~ harvesting for some time, so I put together a short program that pulls down the XML, which I pushed up to GitHub as lc-findingaids. There aren’t that many XML files to make performance any kind of a problem. The main thing I discovered is that node + jquery using jsdom is a really nice environment for scraping. Here’s an example of using jsdom to print out the title for the Library of Congress homepage:

var jsdom = require('jsdom');

jsdom.env('http://www.loc.gov', ['http://code.jquery.com/jquery-1.5.min.js'], function (errors, window) {
  console.log(window.$('title').text());
});

The beauty of using jsdom and jquery here is that it works on the HTML that is most commonly found on the Web: broken HTML or tag soup. Yes, there are other tools for this, but if you already using jQuery to interact with the DOM it’s particularly friendly. Yes, I guess I ignored the robots.txt file, but I was gentle and only retrieved a page at a time, and the server responded quite happily…which bodes well for removing the robots exclusion. or at least relaxing it.

As the README.md states, these XML files are in the public domain…which may or may not help you. If you are outside the US you might decide not to test the legal waters. If you are in the US maybe you will feel like it’s ok to use them to build some web app that helps visualize the data in some new way, like what the SNAC folks have done. But can you make your app available to the world on the World Wide Web? I’ve been slow to realize the importance of this issue to GLAM institutions. Here’s to hoping we can get some clarity on the licensing issues in the near future, and for more success stories like Cooper-Hewitt’s.