On Snowden and Archival Ethics

Much like you I’ve been watching the evolving NSA Surveillance story following the whistle-blowing by former government contractor Edward Snowden. Watching isn’t really the right word…I’ve been glued to it. I don’t have a particularly unique opinion or observation to make about the leak, or the ensuing dialogue — but I suppose calling it “whistle blowing” best summarizes where I stand. I just wanted to share a thought I had on the train to work, after reading Ethan Zuckerman’s excellent Me and My Metadata – Thoughts on Online Surveillance. I tried to fit it in into 140 characters, but it didn’t quite work.

Zuckerman’s post is basically about the value of metadata in research. He opened up his Gmail archive to his students, and they created Immersion, which lets you visualize his network of correspondence using only email metadata (From, To, Cc and Date). Zuckerman goes on to demonstrate what this visualization says about him. The first comment in the post by Jonathan O’Donnell has a nice list of related research on the importance of metadata to discovery. Zuckerman’s work immediately reminded me of Sudheendra Hangal‘s work on MUSE at Stanford, which he and his team have written about extensively. MUSE is a tool that enables scholarly research using email archives. It was then that I realized why I’ve been so fascinated with the Snowden/NSA story.

Over the past few years there has been increasing awareness in the archival community about the role of forensics tools in digital preservation, curation and research. Matt Kirschenbaum‘s Mechanisms had a big role in documenting, and spreading the word about how forensics tools can be (and are) used in the digital humanities. The CLIR report Digital Forensics and Born-Digital Content in Cultural Heritage Collections (co-authored by Kirschenbaum) brought the topic directly to cultural heritage organizations, as did the AIMS report. If you’re not convinced, a search in Google Scholar shows just how prevalent and timely the topic is. The introduction to the CLIR report has a nice summary of why forensics tools are of interest to archives that are dealing with born digital content:

The same forensics software that indexes a criminal suspect’s hard drive allows the archivist to prepare a comprehensive manifest of the electronic files a donor has turned over for accession; the same hardware that allows the forensics investigator to create an algorithmically authenticated “image” of a file system allows the archivist to ensure the integrity of digital content once captured from its source media; the same data-recovery procedures that allow the specialist to discover, recover, and present as trial evidence an “erased” file may allow a scholar to reconstruct a lost or inadvertently deleted version of an electronic manuscript—and do so with enough confidence to stake reputation and career.

Digital forensics therefore offers archivists, as well as an archive’s patrons, new tools, new methodologies, and new capabilities. Yet as even this brief description must suggest, Its methods and outcomes raise important legal, ethical, and hermeneutical questions about the nature of the cultural record, the boundaries between public and private knowledge, and the roles and responsibilities of donor, archivist, and the public in a new technological era.

When collections are donated to an archive, there is usually a gift agreement between the donor and the archival organization, which documents how the collection of material can be used. For example, it is fairly common for there to be a period where portions (or all) of the archive are kept dark. Much less often gift agreements can stipulate that the collection must be made open on the Web, and sometimes money can change hands. Born digital content in archives is new enough that cultural heritage organizations are still grappling with the best way to talk to their donors about donating born digital content.

There has been a bit of attention to sharing best practices about born digital content between organizations, and rising awareness about the sorts of issues that need to be considered. As a software developer tasked with building applications that can be used across these archival collections, the special-snowflake nature to these gift agreements has been a bit of annoyance. If every collection of born digital content has slightly different stipulations about what, when and how content can be used it makes building access applications difficult. The situation is compounded somewhat because the gift agreements themselves aren’t shared publicly (at least at my place of work), so you don’t even know what you can and can’t do. I’ve observed that this has a tendency to derail conversations about access to born digital content–and access is an essential ingredient to insuring the long term preservation of digital content. It’s not like you can take a digital file and put it on a server and come back in 25 or even 5 years and expect to open it, and use it.

So, what does this have to do with Zuckerman’s post, and the intrinsic value of metadata to the NSA? When Zuckerman provided his students with access to his email archive he did it in the context of a particular trust scenario. A gift agreement in an archive serves the same purpose, by documenting a trust scenario between the donor and the institution that is receiving the gift. The NSA allegedly has been collecting information from Verizon, Facebook, Google, et al outside of the trust scenario provided by the Fourth Amendment to the Constitution. After looking at things this way, the special-snowflakism of gift agreements doesn’t seem so annoying any more. It is through these agreements that cultural heritage organizations establish their authenticity and trust. And it is by them that they become a desirable place to deposit born digital content. If they have to be unique per-donor, and this hampers unified access to born digital collections, this seems like a price worth paying. Ideally there would be a standard set of considerations to use when putting the gift agreement together. But if we can’t fit everyone into the same framework, maybe that’s not such a bad thing.

The other common place thing that strikes me is that the same technology that can be used for good, say digital humanities research, or forensics discovery, can also be used for ill. Having a strong sense of the ethics, as a professional, as a citizen, and as a human being is extremely important to establishing the context in which technology is used — and negotiating between the three can sometimes require finesse, and in the case of Snowden, courage.

It’s your data. It’s your life.

I wrote briefly about the Open Science Champions of Change event last week, but almost a week later the impassioned message that Kathy Giusti delivered is still with me. Giusti is the Founder and Chief Executive Officer of the Multiple Myeloma Research Foundation (MMRF), and is herself battling the fatal disease. In her introduction, and later during the panel discussion, she made a strong case for patients to be able to opt-in to open access data sharing. I thought I’d point to these two moments in the 2 hour video stream, and transcribe what she said:

Patients don’t know that data is not shared. They don’t know … If patients knew how long it took to publish, if they knew, it’s your tissue, it’s your data, it’s your life. Believe me, patients would be the first force to start really changing the culture and backing everybody around open access.

Q: A lot of people when they hear about the sharing of clinical data talk about concerns of privacy. How do we start to handle those concerns, and how do we actually encourage patients to contribute their data in meaningful ways to research, so that we can actually continue to drive the successes that we are seeing here?

Giusti: When you’re a patient, and you’re living with a fatal disease, you don’t lie awake and wonder what happens with my data. If patients understand the role they can play in personalizing their own risk taking abilities … We all do this when we work with our banks. There’s certain information that we’re always giving out when we go online, and there’s certain information that we always keep private. And in a future world that’s what patients are going to do. So when you start talking with the patients, and you ask them: “Would you be willing to share your information?” It just depends on the patient, and it depends on how much they would be willing to give. For someone like me, I’m an identical twin, the disease of Myeloma skews in my family, my grandfather had it, my identical twin does not, I would be crazy not to be published … and I’ve done it, and so has my twin … biopsies, whatever we need. Put it in the public domain. I know everybody isn’t going to be like me, but even if you get us half your information we’re making progress, and we can start to match you with the right types of researchers and clinicians that care.

dcat:distribution considered helpful

The other day I happened to notice that the folks at data.gov.uk have started using the Data Catalog Vocabulary in the RDFa they have embedded in their dataset webpages. As an example here is the RDF you can pull out of the HTML for the Anonymised MOT tests and results dataset. Of particular interest to me is that the dataset description now includes an explicit link to the actual data being described using the dcat:distribution property.

     <http://data.gov.uk/id/dataset/anonymised_mot_test> dcat:distribution
         <http://www.dft.gov.uk/data/download/10022/GZ> .

Chris Gutteridge happened to see a Twitter message of mine about this, and asked what consumes this data, and why I thought it was important. So here’s a brief illustration. I reran a little python program I have that crawls all of the data.gov.uk datasets, extracting the RDF using rdflib’s RDFa support (thanks Ivan). Now there are 92,550 triples (up from 35,478 triples almost a year ago).

So what can you do with the this metadata about datasets? I am a software developer working in the area where digital preservation meets the web. So I’m interested in not only getting the metadata for these datasets, but also the datasets themselves. It’s important to enable 3rd party, automated access to datasets for a variety of reasons; but the biggest one for me can be summarized with the common-sensical: Lots of Copies Keep Stuff Safe.

It’s kind of a no-brainer, but copies are important for digital preservation, when the unfortunate happens. The subtlety is being able to know where the copies of a particular dataset are in the enterprise, in a distributed system like the Web, and the mechanics for relating them together. It’s also important for scholarly communication, so that researchers can cite datasets and follow citations of other research to the actual dataset it is based upon. And lastly aggregation services that collect datasets for dissemination on a particular platform, like data.gov.uk, need ways to predictably sweep domains for datasets that needs to be collected.

Consider this practical example: as someone interested in digital preservation I’d like to be able to know what format types are used within the data.gov.uk collection. Since they have used the dcat:distribution property to point at the referenced dataset, I was able to write a small Python program to crawl the datasets and log the media type and HTTP status code along the way, to generate some results like:

media type datasets
text/html 5898
application/octet-stream 1266
application/vnd.ms-excel 874
text/plain 234
text/csv 220
application/pdf 167
text/xml 81
text/comma-separated-values 51
application/x-zip-compressed 36
application/vnd.ms-powerpoint 33
application/zip 31
application/x-msexcel 28
application/excel 21
application/xml 18
text/x-comma-separated-values 14
application/x-gzip 13
application/x-bittorrent 12
application/octet_stream 12
application/msword 10
application/force-download 10
application/x-vnd.oasis.opendocument.presentation 9
application/x-octet-stream 9
application/vnd.excel 9
application/x-unknown-content-type 6
application/xhtml+xml 6
application/vnd.msexcel 5
application/vnd.google-earth.kml+xml kml 5
application/octetstream 4
application/csv 3
vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/octet-string 2
image/jpeg 1
image/gif 1
application/x-mspowerpoint 1
application/vnd.google-earth.kml+xml 1
application/powerpoint 1
application/msexcel 1

Granted some of these aren’t too interesting. The predominance of text/html is largely an artifact of using dcat:distribution to link to the splash page for the dataset, not to the dataset itself. This is allowed by the dcat vocabulary … but dcat’s approach kind of assumes that the domain of the assertion is suitably typed as a dcat:Download, dcat:Feed or dcat:WebService. I personally think that dcat has some issues that make it a bit more difficult to use than I’d like. But it’s extremely useful that data.gov.uk are kicking the tires on the vocabulary, so that kinks like this can be worked out.

The application/octet-stream media-type (and its variants) are also kind of useless for these purposes, since it basically says the dataset is made of bits. It would be more helpful if the servers in these cases could send something more specific. But it ought to be possible to use something like JHOVE or DROID to do some post-hoc analysis of the bitstream to figure out just what this data is, if it is valid etc.

The nice thing about using the Web to publish these datasets and their descriptions is that this sort of format analysis application could be decoupled from the data.gov.uk web publishing software itself. data.gov.uk becomes a clearinghouse for information and whereabouts of datasets, but a format verification service can be built as an orthogonal application. I think it basically fits the RESTful style of Curation Microservices being promoted by the California Digital Library:

Micro-services are an approach to digital curation based on devolving curation function into a set of independent, but interoperable, services that embody curation values and strategies. Since each of the services is small and self-contained, they are collectively easier to develop, deploy, maintain, and enhance. Equally as important, they are more easily replaced when they have outlived their usefulness. Although the individual services are narrowly scoped, the complex function needed for effective curation emerges from the strategic combination of individual services.

One last thing before you are returned to your regular scheduled programming. You may have noticed that the URI for the dataset being described in the RDF is different from the URL for the HTML view for the resource. For example:


instead of:


This is understandable given some of the dictums about Linked Data and trying to separate the Information Resource from the Non-Information Resource. But it would be nice if the URL resolved via a 303 redirect to the HTML as the Cool URIs for the Semantic Web document prescribes. If this is going to be the identifier for the dataset it’s important that it resolves so that people and automated agents can follow their nose to the dataset. I think this highlights some of the difficulties that people typically face when deploying Linked Data.