The Archive as Data Platform

Yesterday Wikileaks announced the availability of a new collection, the Carter Cables, which are a new addition to the Public Library of US Diplomacy (PlusD). One thing in particular in the announcement caught my attention:

The Carter Cables were obtained by WikiLeaks through the process described here after formal declassification by the US National Archives and Records Administration earlier this year.

If you follow the link you can see that this content was obtained in a similar manner as the Kissinger Files, that were released just over a year ago. Perhaps this has already been noted, but I didn’t notice before that the Kissinger Files (the largest Wikileaks release to date) were not leaked to Wikileaks, but were legitimately obtained directly from NARA’s website:

Most of the records were reviewed by the United States Department of State’s systematic 25-year declassification process. At review, the records were assessed and either declassified or kept classified with some or all of the metadata records declassified. Both sets of records were then subject to an additional review by the National Archives and Records Administration (NARA). Once believed to be releasable, they were placed as individual PDFs at the National Archives as part of their Central Foreign Policy Files collection.

The Central Foreign Policy Files are a series from the General Records of the Department of State record group. Anyone with a web browser can view these documents on NARA’s Access to Archival Databases website. If you try to access them you’ll notice that the series is broken up into 15 separate files. Each file is a set of documents that can be searched individually. There’s no way to browse the contents of a file, series or the entire group: you must do a search and click through each of the results (more on this in a moment).

The form in which these documents were held at NARA was as 1.7 million individual PDFs. To prepare these documents for integration into the PlusD collection, WikiLeaks obtained and reverse-engineered all 1.7 million PDFs and performed a detailed analysis of individual fields, developed sophisticated technical systems to deal with the complex and voluminous data and corrected a great many errors introduced by NARA, the State Department or its diplomats, for example harmonizing the many different ways in which departments, capitals and people’s names were spelt.

It would be super to hear more details about their process for doing this work. I think archives could potentially learn a lot about how to enhance their own workflows for doing this kind of work at scale.

And yet I think there is another lesson here in this story. It’s actually important to look at this PlusD work as a success story for NARA…and one that can potentially be improved upon. I mentioned above that it doesn’t appear to be possible to browse a list of documents and that you must do a search. If you do a search and click on one of the documents you’ll notice you get a URL like this:

http://aad.archives.gov/aad/createpdf?rid=99311&dt=2472&dl=1345

And if you browse to another you’ll see something like:

http://aad.archives.gov/aad/createpdf?rid=841&dt=2472&dl=1345

Do you see the pattern? Yup, the rid appears to be a record number, and it’s an integer that you can simply start at 1 and keep going until you’ve got to the last one for that file, in this case 155278.

It turns out the other dt and dl parameters change for each file, but they are easily determined by looking at the overview page for the series. Here they are if you are curious:

http://aad.archives.gov/aad/createpdf?rid=&dt=2472&dl=1345
http://aad.archives.gov/aad/createpdf?rid=&dt=2473&dl=1348
http://aad.archives.gov/aad/createpdf?rid=&dt=2474&dl=1345
http://aad.archives.gov/aad/createpdf?rid=&dt=2475&dl=1348
http://aad.archives.gov/aad/createpdf?rid=&dt=2492&dl=1346
http://aad.archives.gov/aad/createpdf?rid=&dt=2493&dl=1347
http://aad.archives.gov/aad/createpdf?rid=&dt=2476&dl=1345
http://aad.archives.gov/aad/createpdf?rid=&dt=2477&dl=1348
http://aad.archives.gov/aad/createpdf?rid=&dt=2494&dl=1346
http://aad.archives.gov/aad/createpdf?rid=&dt=2495&dl=1347
http://aad.archives.gov/aad/createpdf?rid=&dt=2082&dl=1345
http://aad.archives.gov/aad/createpdf?rid=&dt=2083&dl=1348
http://aad.archives.gov/aad/createpdf?rid=&dt=2084&dl=1346
http://aad.archives.gov/aad/createpdf?rid=&dt=2085&dl=1347
http://aad.archives.gov/aad/createpdf?rid=&dt=2532&dl=1629
http://aad.archives.gov/aad/createpdf?rid=&dt=2533&dl=1630

Of course it would be trivial to write a harvesting script to pull down the ~380 gigabytes of PDFs by creating a loop with a counter and using one of the many many HTTP libraries. Maybe even with a bit of sleeping in between requests to be nice to the NARA website. I suspect that this how Wikileaks were able to obtain the documents.

But, in an ideal world, this sort of URL inspection shouldn’t be necessary right? Also, perhaps it could be done in such a way that the burden of distributing the data doesn’t fall on NARA alone? It feels like a bit of an accident that it’s possible to download the data in bulk from NARA’s website this way. But it’s an accident that’s good for access.

What if instead of trying to build the ultimate user experience for archival content, archives focused first and foremost on providing simple access to the underlying data first. I’m thinking of the sort of work Carl Malamud has been doing for years at public.resource.org. With a solid data foundation like that, and simple mechanisms for monitoring the archive for new accessions it would then be possible to layer other applications on top within the enterprise and (hopefully) at places external to the archive, that provide views into the holdings.

I imagine this might sound like ceding the responsibility of the archive to some. It may also sound a bit dangerous to those that are concerned about connecting up public data that is currently unconnected. I’m certainly not suggesting that user experience and privacy aren’t important. But I think Cassie is right:

(edsu?) Yes it is; I think more participation from independent actors in archival access is a great thing we need more of

— Cassie Findlay ((CassPF?))

April 25, 2014

I imagine there are some that feel that associating this idea of the archive as data platform with the Wikileaks project might be counterproductive to an otherwise good idea. I certainly paused before hitting publish on this blog post, given the continued sensitivity around the issue of Wikileaks. But as other archivists have noted there is a great deal to be learned from the phenomenon that is Wikileaks. Open and respectful conversations about what is happening is important, right?

Most of all I think it’s important that we don’t look at this bulk access and distant reading of archival material as a threat to the archive. Researchers should feel that downloading data from the archive is a legitimate activity. Where possible they should be given easy and efficient ways to do it. Archives need environments like OpenGov NSW (thanks Cassie) and the Government Printing Office’s Bulk Data website (see this press release about the Federal Register) where this activity can take place, and where a dialogue can happen around it.

Update: May 8, 2014

Alexa O’Brien’s interview on May 6th with Sarah Harrison of Wikileaks at re:publica14 touched on lots of issues related to Wikileaks the archive. In particular the discussion of redaction, accessibility and Wikileaks role in publishing declassified information for others (including journalists) was quite relevant the topic of this blog post.