Moving On to MITH

I’m very excited to say that I am going to be joining the Maryland Institute for Technology in the Humanities to work as their Lead Developer. Apart from the super impressive folks at MITH that I will be joining, I am really looking forward to helping them continue to explore how digital media can be used in interdisciplinary scholarship and learning…and to build and sustain tools that make and remake the work of digital humanities research. MITH’s role in the digital humanities field, and its connections on campus and to institutions and projects around the world were just too good of an opportunity to pass up.

I’m extremely grateful to my colleagues at the Library of Congress who have made working over the last 8 years such a transformative experience. It’s a bit of a cliche that there are challenges to working in the .gov space. But at the end of the day, I am so proud to have been part of the mission to “further the progress of knowledge and creativity for the benefit of the American people”. It’s a tall order, that took me in a few directions that ended up bearing fruit, and some others that didn’t. But that’s always the way of things right? Thanks to all of you (you know who you are) that made it so rewarding…and so much fun.

When I wrote recently about the importance of broken world thinking I had no idea I would be leaving LC to join MITH in a few weeks. I feel incredibly lucky to be able to continue to work in the still evolving space of digital curation and preservation, with a renewed focus on putting data to use, and helping make the world a better place.

Oh, and I still plan on writing here … so, till the next episode.


One Big Archive

Several months ago Hillel Arnold asked me to participate in a panel at the Society of American Archivists, A Trickle Becomes a Flood: Agency, Ethics and Information. The description is probably worth repeating here:

From the Internet activism of Aaron Swartz to Wikileaks’ release of confidential U.S. diplomatic cables, numerous events in recent years have challenged the scope and implications of privacy, confidentiality, and access for archives and archivists. With them comes the opportunity, and perhaps the imperative, to interrogate the role of power, ethics, and regulation in information systems. How are we to engage with these questions as archivists and citizens, and what are their implications for user access?

It is broad and important topic. I had 20 minutes to speak. Here’s what I said.

Thanks to Dan Chudnov for his One Big Library idea, which I’ve adapted/appropriated.


Thanks for inviting me to participate in this panel today. I’ve been a Society of American Archivists member for just two years, and this is my first SAA. So go easy on me, I’m a bit of a newb. I went to “Library School” almost 20 years ago, and was either blind to it or just totally missed out on the rich literature of the archive. As I began to work in the area of digital preservation 8 years ago several friends and mentors encouraged me to explore how thinking about the archive can inform digital repository systems. So here I am today. It’s nice to be here with you.

One thing from the panel description I’d like to focus on is:

the opportunity, perhaps the imperative, to interrogate the role of power, ethics, and regulation in information systems

I’d like to examine a specific, recent controversy, in order to explore how the dynamics of power and ethics impact the archive. I’m going to stretch our notion of what the archive is, perhaps to its breaking point, but I promise to bring it back to its normal shape at the end. I hope to highlight just how important archival thinking and community are to the landscape of the Web.

Ethics

Perhaps I’ve just been perseverating about what to talk about today, but it seems like the news has been full of stories about the role of ethics of information systems lately. One that is still fresh in most people’s mind is the recent debate about Facebook’s emotional contagion study, where users news feeds were directly manipulated to test theories about how emotions are transferred in social media.

As Tim Carmody pointed out, this is significant not only for the individuals that had their news feeds tampered with, but also for the individuals who posted content to Facebook, only to have it manipulated. He asks:

What if you were one of the people whose posts were filtered because your keywords were too happy, too angry, or too sad?

However, much of the debate centered on the Terms of Service that Facebook users agree to when they use that service. Did the ToS allow Facebook and Cornell to conduct this experiment?

At the moment there appears to be language in the Facebook Terms of Service that allows user contributed data to be used for research purposes. But that language was not present in the 2012 version of the ToS, when the experiment was conducted. But just because the fine print of a ToS document allows for something doesn’t necessarily mean it is ethical. Are Terms of Service documents the right way to be talking about the ethical concerns and decisions of an organization? Aren’t they just an artifact of the ethical decisions and conversations that have already happened?

Also, it appears that the experiment itself was conducted before Cornell University’s Institutional Review Board had approved the study. Are IRB’s functioning the way we want?

Power

The Facebook controversy got an added dimension when dots were connected between one of the Cornell researchers involved in the study and the Department of Defense funded Minerva Initiative, which,

aims to determine “the critical mass (tipping point)” of social contagions by studying their “digital traces” in the cases of “the 2011 Egyptian revolution, the 2011 Russian Duma elections, the 2012 Nigerian fuel subsidy crisis and the 2013 Gazi park protests in Turkey.”

I’m currently reading Betty Medsger’s excellent book about how a group of ordinary peace activists in 1971 came to steal files from a small FBI office in Media, Pennsylvania that provided evidence for the FBI’s Counter Intelligence Program (COINTELPRO). COINTELPRO was an illegal program for “surveying, infiltrating, discrediting, and disrupting domestic political organizations”. Medsger was the Washington Post journalist who received the documents. It’s hard not to see the parallels to today where prominent Muslim-American academics and politicians are having their email monitored by the FBI and the NSA. Where the NSA are collecting millions of phone records from Verizon everyday. And where the NSA’s XKeyScore allows analysts to search through the emails, online chats and the browsing histories of millions of people.

Both Facebook and Cornell University later denied that this particular study was funded through the Defense Department’s Minerva Project, but did not deny that Professor Hancock had previously received funding from it. I don’t think you need to be an information scientist (although many of you are) to see how one person’s study of social contagions might inform another study of social contagion, by that very same person. But information scientists have bills to pay just like anyone else. I imagine if many of us traced our sources of income back sufficiently far enough we would find the Defense Department as a source. Still, the degrees of separation matter, don’t they?

Big Data

In her excellent post What does the Facebook Experiment Teach Us, social media researcher Danah Boyd explores how the debate about Facebook’s research practices surfaces a general unease about the so called era of big data:

The more I read people’s reactions to this study, the more I’ve started to think the outrage has nothing to do with the study at all. There is a growing amount of negative sentiment towards Facebook and other companies that collect and use data about people. In short, there’s anger at the practice of big data. This paper provided ammunition for people’s anger because it’s so hard to talk about harm in the abstract.

Certainly part of this anxiety is also the result of what we have learned our own government is doing in willing (and unwilling) collaboration with these companies, based on documents leaked by whistleblower Edward Snowden, and the subsequent journalism from the Guardian and the Washington Post that won them the Pulitzer Prize for Public Service. But as Maciej Ceglowski argues in his (awesome) The Internet With a Human Face.

You could argue (and I do) that this data is actually safer in government hands. In the US, at least, there’s no restriction on what companies can do with our private information, while there are stringent limits (in theory) on our government. And the NSA’s servers are certainly less likely to get hacked by some kid in Minsk.

But I understand that the government has guns and police and the power to put us in jail. It is legitimately frightening to have the government spying on all its citizens. What I take issue with is the idea that you can have different kinds of mass surveillance.

If these vast databases are valuable enough, it doesn’t matter who they belong to. The government will always find a way to query them. Who pays for the servers is just an implementation detail.

Ceglowski makes the case that we need more regulation around the collection of behavioural data, specifically what is collected and how long it is kept. This should be starting to sound familiar.

Boyd takes a slightly more pragmatic tack by pointing out that social media companies need to establish ethics boards that allow users and scholars to enter into conversation with employees that have insight into how things work (policy, algorithms, etc). We need a bit more nuance than “Don’t be evil”. We need to know how the companies that run the websites where we put our content are going to behave, or at least how they would like to behave, and what their moral compass is. I think we’ve seen this to some degree in the response of Google to the Right to be Forgotten law in the EU (a topic for another panel). But I think Boyd is right, that much more coordinated and collaborative work could be done in this area.

The Archive

So, what is the relationship between big data and the archive? Or put a bit more concretely: given the right context, would you consider Facebook content to be archival records?

If you are wearing the right colored glasses, or perhaps Hugh Taylor’s spectacles, I think you will likely say yes, or at least maybe.

Quoting from SAA’s What is an Archive:

An archive is a place where people go to find information. But rather than gathering information from books as you would in a library, people who do research in archives often gather firsthand facts, data, and evidence from letters, reports, notes, memos, photographs, audio and video recording.

Perhaps content associated with Facebook’s user, business and community group accounts are a form of provenance or fonds?

Does anyone here use Facebook? Have you downloaded your Facebook data?

If an individual downloads their Facebook content and donates it to an archive along with other material, does this Facebook data become part of the archival record? How can researchers then access these records? What constraints are in place that govern who can see them, and when. These are familiar questions for the archivist, aren’t they?

Some may argue that the traditional archive doesn’t scale in the same way that big data behemoths like Facebook or Google do. I think in one sense they are right, but this is a feature, not a bug.

If an archive accessions an individual’s Facebook data there is an opportunity to talk with the donor about how they would like their records to be used, by whom and when. Think of an archival fonds as small data. When you add up enough of these fonds, and put them on the Web, you get big data. But the scope and contents of these fonds fit in the brain of a person that can reason ethically about them, and this is a good thing, that is at the heart of our profession, and which we must not lose.

When you consider theories of the postcustodial archive the scope of the archive enlarges greatly. Is there a role for the archivist and archival thinking and practices at Facebook itself? I think there is. Could archivists help balance the needs of researchers and records creators, and foster communication around the ethical use of these records? I think so. Could we help donors think about how they want their content to fit into the Web (when the time comes) by encouraging the use of creative commons licenses? Could we help organizations think about how they allow for people to download and/or delete their data, and how to package it up so it stands a chance of being readable?

Is it useful to look at the Web as One Big Archive, where assets are moved from one archive to another, where researcher rights are balanced with the rights of record creators? Where long term access is taken more seriously in some pockets than in others? I hope so. I think it’s what we do.

I haven’t mentioned Aaron, who worked so hard for these things. I could’ve spent the entire time talking about his work helping to build the Web, and interrogating power. Maybe I should have…but I guess I kind of did. There isn’t any time left to help Aaron. But we still have time to help make the Web a better place.


Paper Work

The connective quality of written traces is still more visible in the most despised of all ethnographic objects: the file or the record. The “rationalization” granted to bureaucracy since Hegel and Weber has been attributed by mistake to the “mind” of (Prussian) bureaucrats. It is all in the files themselves.

A bureau is, in many ways, and more and more every year, a small laboratory in which many elements can be connected together just because their scale and nature has been averaged out: legal texts, specifications, standards, payrolls, maps, surveys (ever since the Norman conquest, as shown by Clanchy, 1979). Economics, politics, sociology, hard sciences, do not come into contact through the grandiose entrance of “interdisciplinarity” but through the back door of the file.

The “cracy” of bureaucracy is mysterious and hard to study, but the “bureau” is something that can be empirically studied, and which explains, because of its structure, why some power is given to an average mind just by looking at files: domains which are far apart become literally inches apart; domains which are convoluted and hidden, become flat; thousands of occurrences can be looked at synoptically.

More importantly, once files start being gathered everywhere to insure some two-way circulation of immutable mobiles, they can be arranged in cascade: files of files can be generated and this process can be continued until a few men consider millions as if they were in the palms of their hands. Common sense ironically makes fun of these “gratte papiers” and “paper shufflers”, and often wonders what all this “red tape” is for; but the same question should be asked of the rest of science and technology. In our cultures “paper shuffling” is the source of an essential power, that constantly escapes attention since its materiality is ignored.

from Visualization and Cognition: Drawing Things Together by Bruno Latour


why @congressedits?

Note: as with all the content on this blog, this post reflects my own thoughts about a personal project, and not the opinions or activities of my employer.

Two days ago a retweet from my friend Ian Davis scrolled past in my Twitter stream:

The simplicity of combining Wikipedia and Twitter in this way immediately struck me as a potentially useful transparency tool. So using my experience on a previous side project I quickly put together a short program that listens to all major language Wikipedias for anonymous edits from Congressional IP address ranges (thanks Josh) … and tweets them.

In less than 48 hours the @congressedits Twitter account had more than 3,000 followers. My friend Nick set up gccaedits for Canada using the same software … and @wikiAssemblee (France) and @RiksdagWikiEdit (Sweden) were quick to follow.


Watching the followers rise, and the flood of tweets from them brought home something that I believed intellectually, but hadn’t felt quite so viscerally before. There is an incredible yearning in this country and around the world for using technology to provide more transparency about our democracies.

Sure, there were tweets and media stories that belittled the few edits that have been found so far. But by and large people on Twitter have been encouraging, supportive and above all interested in what their elected representatives are doing. Despite historically low approval ratings for Congress, people still care deeply about our democracies, our principles and dreams of a government of the people, by the people and for the people.

We desperately want to be part of a more informed citizenry, that engages with our local communities, sees the world as our stage, and the World Wide Web as our medium.


Consider this thought experiment. Imagine if our elected representatives and their staffers logged in to Wikipedia, identified much like Dominic McDevitt-Parks (a federal employee at the National Archives) and used their knowledge of the issues and local history to help make Wikipedia better? Perhaps in the process they enter into conversation in an article’s talk page, with a constituent, or political opponent and learn something from them, or perhaps compromise? The version history becomes a history of the debate and discussion around a topic. Certainly there are issues of conflict of interest to consider, but we always edit topics we are interested and knowledgeable about, don’t we?

I think there is often fear that increased transparency can lead to increased criticism of our elected officials. It’s not surprising given the way our political party system and media operate: always looking for scandal, and the salacious story that will push public opinion a point in one direction, to someone’s advantage. This fear encourages us to clamp down, to decrease or obfuscate the transparency we have. We all kinda lose, irrespective of our political leanings, because we are ultimately less informed.


I wrote this post to make it clear that my hope for @congressedits wasn’t to expose inanity, or belittle our elected officials. The truth is, @congressedits has only announced a handful of edits, and some of them are pretty banal. But can’t a staffer or politician make a grammatical change, or update an article about a movie? Is it really news that they are human, just like the rest of us?

I created @congressedits because I hoped it could engender more, better ideas and tools like it. More thought experiments. More care for our communities and peoples. More understanding, and willingness to talk to each other. More humor. More human.

I’m pretty sure zarkinfrood meant @congressedits figuratively, not literally. As if perhaps @congressedits was emblematic, in its very small way, of something a lot bigger and more important. Let’s not forget that when we see the inevitable mockery and bickering in the media. Don’t forget the big picture. We need transparency in our government more than ever, so we can have healthy debates about the issues that matter. We need to protect and enrich our Internet, and our Web … and to do that we need to positively engage in debate, not tear each other down.

Educate and inform the whole mass of the people. Enable them to see that it is their interest to preserve peace and order, and they will preserve them. And it requires no very high degree of education to convince them of this. They are the only sure reliance for the preservation of our liberty. – Thomas Jefferson

Who knew TJ was a Wikipedian…


MayDay - We Can Fix This Thing

We won’t get our democracy back until we change the way campaigns are funded.

TL;DR if you were thinking of supporting MayDay and have been putting it off please act by July 4th. Every contribution helps, and you will only be charged if they hit their 5 million dollar target (2 million to go right now as I write this). Plus, and this is a big plus, you will be able to tell yourself and maybe your grandkids that you helped make real political reform happen in the United States. Oh and it’s July 4th weekend, what better way to celebrate the independence we have left!

If you are reading my blog you are most likely a fan of Lawrence Lessig and Aaron Swartz’s work on Creative Commons to help create a content ecosystem for the Web that works for its users … you and me.

Before he left us Aaron convinced Lessig that we need to get to the root of the problem, how political decisions are made in Congress, in order to address macro problems like copyright reform. For a really personal interview with Lessig that covers this evolution in his thinking, and how it led to the Granny D inspired midwinter march across New Hampshire, and the recent MayDay effort check out last week’s podcast of the The Good Fight.

Or if you haven’t seen it, definitely watch Lessig’s 13 minute TED Talk:

If you are part of the 90% of American’s who think that our government is broken because of the money in politics please check out MayDay’s efforts to crowdsource enough money to fund political campaigns that are committed to legislation that will change it.

It doesn’t matter if you are on the left or the right, or if you live in a red or blue state, or honestly whether you live in the United States or not. This is an issue that impacts all of us, and generations to come. We can fix this thing. But we have to try to fix it, we can’t just sit back and expect someone else to fix it for us. Lessig has a plan, and he’s raised 4 million dollars so far (if you include the previous 1 million campaign) from people like you who think he’s on to something. Let’s push MayDay over the five million dollar edge and see what happens next!

Afterword

And as my friend Mike reminded me, even if you aren’t sure about the politics, donate in memory of Aaron. He was such an advocate and innovator for the Web, libraries and archives … and continues to be sorely missed. Catch the recently released documentary about Aaron in movie theaters now or for free on the Internet Archive since it is Creative Commons licensed:


No Place to Hide

No Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance StateNo Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance State by Glenn Greenwald
My rating: 4 of 5 stars

I think Greenwald’s book is a must read if you have any interest in the Snowden story, and the role of investigative journalism and its relationship to political power and the media. Greenwald is clearly a professional writer: his narrative is both lucid and compelling, and focuses on three areas that roughly correlate to sections of the book.

The first (and most exciting) section of the book goes behind the scenes to look at how Greenwald first came into contact with Snowden, and worked to publish his Guardian articles about the NSA wiretapping program. It is a riveting story, that provides a lot of insights into what motivated Snowden to do what he did. Snowden comes off as a very ethical, courageous and intelligent individual. Particularly striking was Snowden’s efforts to make sure that the documents were not simply dumped on the Internet, but that journalists had an opportunity to interpret and contextualize the documents to encourage constructive discussion and debate.

In sixteen hours of barely interrupted reading, I managed to get through only a small fraction of the archive. But as the plane landed in Hong Kong, I knew two things for certain. First, the source was highly sophisticated and politically astute, evident in his recognition of the significance of most of the documents. He was also highly rational. The way he chose, analyzed, and described the thousands of documents I now had in my possession proved that. Second, it would be very difficult to deny his status as a classic whistle-blower. If disclosing proof that top-level national security officials lied outright to Congress about domestic spying programs doesn’t make one indisputably a whistle-blower, then what does?

This section is followed by a quite detailed overview of what the documents revealed about the NSA wiretapping program, and their significance. If you are like me, and haven’t read all the articles that have been published in the last year you’ll enjoy this section.

And lastly the book analyzes the relationship between journalism and power in our media organizations, and the role of the independent journalist. The Guardian comes off as quite a progressive and courageous organization. Other media outlets like the New York Times and the Washington Post don’t fare so well. I recently unsubscribed from the Washington Post, after vague feelings of uneasiness about their coverage – so it was good to read Greenwald’s pointed critique. After just having spent some time reading Archives Power I was also struck by the parallels between positivist theories of the archive and journalism, and how important it is to be aware and recognize how power shapes and influences what we write, or archive.

Every news article is the product of all sorts of highly subjective cultural, nationalistic, and political assumptions. And all journalism serves one faction’s interest or another’s. The relevant distinction is not between journalists who have opinions and those who have none, a category that does not exist. It is between journalists who candidly reveal their opinions and those who conceal them, pretending they have none.

The only reason I withheld the 5th star from my rating is it would’ve been interesting to know more about Snowden’s departure from Hong Kong, his negotiations to seek asylum, and his relationship with Wikileaks and Sarah Harrison. Maybe that information wasn’t known to Greenwald, but it would’ve been interesting to have a summary of what was publicly known.

One thing that No Place to Hide really did for me was underscore the importance of privacy and cryptography on the Web and the Internet. This is particularly relevant today, exactly one year after Greenwald’s first Guardian article was published, and as many people celebrate the anniversary by joining with the Reset the Net campaign. I haven’t invested in a signed SSL certificate yet for inkdroid.org but I’m committing to doing that now. I’ve also recently started using GPGTools w/ Mail on my Mac. If you are curious about steps you can take check out the Reset the Net Privacy Pack. In no place to hide Greenwald talks quite frankly about how he found cryptography tools difficult to use and understand, and how he got help in using them – and how essential these tools are to his work.

View all my reviews


RealAudio, AAC and Archivy

A few months ago I happened to read a Pitchfork interview with David Grubbs about his book Records Ruin the Landscape. In the interview Grubbs mentioned how his book was influenced by a 2004 Kenny Goldsmith interview with Henry Flynt…and Pitchfork usefully linked to the interview in the WFMU archive.

You know, books linking to interviews linking to interviews linking to archives, the wondrous beauty and utility of hypertext.

I started listening to the interview on my Mac with Chrome and the latest RealAudio plugin but after a few minutes it went into a feedback loop of some kind, and became full of echoes and loops, and was completely unlistenable. This is WFMU so I thought maybe this was part of the show, but it went on for a while, which seemed a little bit odd. I tried reloading thinking it might be some artifact of the stream, but the exact thing happened again. I noticed a prominent Get Help link right next to the link for listening to the content. I clicked on it and filled out a brief form, not really expecting to hear back.

As you can see the WFMU archive view for the interview is sparse but eminently useful.

Unexpectedly, just a few hours later I received an email from Jeff Moore who wrote that playback of Real Audio had been reported to be a problem before on some items in the archive, and that they were in the process of migrating them to AAC. My report had pushed this particular episode up in the queue, and I could now reload the page and listen to an AAC stream via their Flash player. I guess now that it’s AAC there is probably something that could be done with the audio HTML element to avoid the Flash bit. But now I could listen to the interview (which, incidentally, is awesome) so I was happy.

I asked Jeff about how they were converting the RealAudio, because we have a fair bit of RealAudio laying around at my place of work. He wrote back with some useful notes that I thought I would publish on the Web for others googling for how to do it at this particular point in time. I’d be curious to know if you regard RealAudio as a preservation risk, and good example of a format we ought to be migrating. The playback options seem quite limited, and precarious, but perhaps that’s just my own limited experience.

The whole interaction with WFMU, from discovery, to access, to preservation, to interaction seemed like such a perfect illustration of what the Web can do for archives, and vice-versa.

Jeff’s Notes

The text below is from Jeff’s email to me. Jeff, if you are reading this and don’t really want me quoting you this way, just let me know.

I’m still fine-tuning the process, which is why the whole bulk transcode isn’t done yet. I’m trying to find the sweet spot where I use enough space / bandwidth for the resulting files so that I don’t hear any obvious degradation from the (actually pretty terrible-sounding) Real files, but don’t just burn extra resources with nothing gained.

Our Real files are mostly mono sampled at 22.04khz, using a codec current decoders often identify as “Cook”.

I’ve found that ffmpeg does a good job of extracting a WAV file from the Real originals - oh, and since there are two warring projects which each provide a program called ffmpeg, I mean this one:

http://ffmpeg.org/

We’ve been doing our AAC encoding with the Linux version of the Nero AAC Encoder released a few years ago:

http://www.nero.com/enu/company/about-nero/nero-aac-codec.php

…although I’m still investigating alternatives.

One interesting thing I’ve encountered is that a straight AAC re-encoding from the Real file (mono, 22.05k) plays fine as a file on disk, but hasn’t played correctly for me (in the same VLC version) when streamed from Amazon S3. If I convert the mono archive to stereo and AAC-encode that with the Nero encoder, it’s been streaming fine.

Oh, and if you want to transfer tags from the old Real files to any new files, and your transcoding pipeline doesn’t automatically copy tags, note that ffprobe (also from the ffmpeg package) can extract tags from Real files, which you can then stuff back in (with neroAacTag or the tagger of your choice).

Afterword

Here is Googlebot coming to get the content a few minutes after I published this post.

54.241.82.166 - - [23/May/2014:10:36:22 +0000] "GET http://inkdroid.org/journal/2014/05/23/realaudio-aac-and-archivy/ HTTP/1.1" 200 20752 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

So someone searching for how to convert RealAudio to AAC might stumble across it. This decentralized Web thing is kinda neat. We need to take care of it.


Broken World

You know that tingly feeling you get when you read something, look at a picture, or hear a song that subtly and effortlessly changes the way you think?

I don’t know about you, but for me thoughts, ideas and emotions can often feel like puzzles that stubbornly demand a solution, until something or someone helps make the problem evaporate or dissolve. Suddenly I can zoom in, out or around the problem, and it is utterly transformed. As that philosophical trickster Ludwig Wittgenstein wrote:

It is not surprising that the deepest problems are in fact not problems at all.

A few months ago, a tweet from Matt Kirschenbaum had this effect on me.

It wasn’t the tweet itself, so much as what the tweet led to: Steven Jackson’s Rethinking Repair, which recently appeared in the heady sounding Media Technologies Essays on Communication, Materiality, and Society.

I’ve since read the paper three or four times, taken notes, underlined stuff to follow up on, etc. I’ve been meaning to write about it here, but I couldn’t start…I suppose for reasons that are (or will become) self evident. The paper is like a rhizome that brings together many strands of thought and practice that are directly relevant to my personal and professional life.

I’ve spent the last decade or so working as a software developer in the field of digital preservation and archives. On good days this seems like a surprisingly difficult thing, and on the bad days it seems more like an oxymoron…or an elaborate joke.

At home I’ve spent the last year helping my wife start a business to change our culture one set of hands at a time, while trying to raise children in a society terminally obsessed with waste, violence and greed…and a city addicted to power, or at least the illusion of power.

In short, how do you live and move forward amidst such profound brokenness? Today Quinn Norton’s impassioned Everything is Broken reminded me of Jackson’s broken world thinking, and what a useful hack (literally) it is…especially if you are working with information technology today.

Writing software, especially for the Web, is still fun, even after almost 20 years. It keeps changing, spreading into unexpected places, and the tools just keep evolving, getting better, and more varied. But this same dizzying rate of change and immediacy poses some real problems if you are concerned about stuff staying around so people can look at it tomorrow.

When I was invited to the National Digital Forum I secretly fretted for months, trying to figure out if I had anything of substance to say to that unique blend of folks interested in the cross section of the Web and the cultural heritage. The thing I eventually landed on was taking a look at the Web as a preservation medium, or rather the Web as a process, which has a preservation component to it. In the wrapup I learned that the topic of “web preservation” had already been covered a few years earlier, so there wasn’t much new there … but there was some evidence that the talk connected with a few folks.

If I could do it all again I would totally (as Aaron would say) look at the Web and preservation through Jackson’s prism of broken world thinking.

The bit where I talked about how Mark Pilgrim and Why’s online presence was brought back from virtual suicide using Git repositories, the Internet Archive and a lot of TLC was totally broken world thinking. Verne Harris’ notion that the archive is always just a sliver of a sliver of a sliver of a window into process, and that as such it is extremely, extremely valuable is broken world thinking. Or the evolution of permalinks, cool URIs in the face of swathes of linkrot is at its heart broken world thinking.

The key idea in Jackson’s article (for me) is that there are very good reasons to remain hopeful and constructive while at the same time being very conscious of the problems we find ourself in today. The ethics of care that he outlines, with roots in the feminist theory, is a deeply transformative idea. I’ve got lots of lines of future reading to follow, in particular in the area of sustainability studies, which seems very relevant to the work of digital preservation.

But most of all Jackson’s insight that innovation doesn’t happen in lightbulb moments (the mythology of the Silicon Valley origin story) or the latest tech trend, but in the recognition of brokenness, and the willingness to work together with others to repair and fix it. He positions repair as an ongoing process that fuels innovation:

… broken world thinking asserts that breakdown, dissolution, and change, rather than innovation, development, or design as conventionally practiced and thought about are the key themes and problems facing new media.

I should probably stop there. I know I will return to this topic again, because I feel like a lot of my previous writing here has centered on the importance of repair, without me knowing it. I just wanted to stop for a moment, and give a shout out to some thinking that I’m suspecting will guide me for the next twenty years.


linking spoken quotes of quotes

An ancient buddha said, “If you do not wish to incur the cause for Unceasing Hell, do not slander the true dharma wheel of the Tathagata. You should carve these words on your skin, flesh, bones and marrow; on your body, mind and environment; on emptiness and on form. They are already carved on trees and rocks, on fields and villages.”

From Gary Snyder’s reading of The Teachings of Zen Master Dogen (about 1:26:00 in).

His delivery is just a delight to listen to. The puzzling strangeness of the text are made whole in the precision, earthiness and humor of his words.


The Archive as Data Platform

Yesterday Wikileaks announced the availability of a new collection, the Carter Cables, which are a new addition to the Public Library of US Diplomacy (PlusD). One thing in particular in the announcement caught my attention:

The Carter Cables were obtained by WikiLeaks through the process described here after formal declassification by the US National Archives and Records Administration earlier this year.

If you follow the link you can see that this content was obtained in a similar manner as the Kissinger Files, that were released just over a year ago. Perhaps this has already been noted, but I didn’t notice before that the Kissinger Files (the largest Wikileaks release to date) were not leaked to Wikileaks, but were legitimately obtained directly from NARA’s website:

Most of the records were reviewed by the United States Department of State’s systematic 25-year declassification process. At review, the records were assessed and either declassified or kept classified with some or all of the metadata records declassified. Both sets of records were then subject to an additional review by the National Archives and Records Administration (NARA). Once believed to be releasable, they were placed as individual PDFs at the National Archives as part of their Central Foreign Policy Files collection.

The Central Foreign Policy Files are a series from the General Records of the Department of State record group. Anyone with a web browser can view these documents on NARA’s Access to Archival Databases website. If you try to access them you’ll notice that the series is broken up into 15 separate files. Each file is a set of documents that can be searched individually. There’s no way to browse the contents of a file, series or the entire group: you must do a search and click through each of the results (more on this in a moment).

The form in which these documents were held at NARA was as 1.7 million individual PDFs. To prepare these documents for integration into the PlusD collection, WikiLeaks obtained and reverse-engineered all 1.7 million PDFs and performed a detailed analysis of individual fields, developed sophisticated technical systems to deal with the complex and voluminous data and corrected a great many errors introduced by NARA, the State Department or its diplomats, for example harmonizing the many different ways in which departments, capitals and people’s names were spelt.

It would be super to hear more details about their process for doing this work. I think archives could potentially learn a lot about how to enhance their own workflows for doing this kind of work at scale.

And yet I think there is another lesson here in this story. It’s actually important to look at this PlusD work as a success story for NARA…and one that can potentially be improved upon. I mentioned above that it doesn’t appear to be possible to browse a list of documents and that you must do a search. If you do a search and click on one of the documents you’ll notice you get a URL like this:

http://aad.archives.gov/aad/createpdf?rid=99311&dt=2472&dl=1345

And if you browse to another you’ll see something like:

http://aad.archives.gov/aad/createpdf?rid=841&dt=2472&dl=1345

Do you see the pattern? Yup, the rid appears to be a record number, and it’s an integer that you can simply start at 1 and keep going until you’ve got to the last one for that file, in this case 155278.

It turns out the other dt and dl parameters change for each file, but they are easily determined by looking at the overview page for the series. Here they are if you are curious:

  • http://aad.archives.gov/aad/createpdf?rid=&dt=2472&dl=1345
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2473&dl=1348
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2474&dl=1345
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2475&dl=1348
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2492&dl=1346
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2493&dl=1347
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2476&dl=1345
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2477&dl=1348
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2494&dl=1346
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2495&dl=1347
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2082&dl=1345
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2083&dl=1348
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2084&dl=1346
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2085&dl=1347
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2532&dl=1629
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2533&dl=1630

Of course it would be trivial to write a harvesting script to pull down the ~380 gigabytes of PDFs by creating a loop with a counter and using one of the many many HTTP libraries. Maybe even with a bit of sleeping in between requests to be nice to the NARA website. I suspect that this how Wikileaks were able to obtain the documents.

But, in an ideal world, this sort of URL inspection shouldn’t be necessary right? Also, perhaps it could be done in such a way that the burden of distributing the data doesn’t fall on NARA alone? It feels like a bit of an accident that it’s possible to download the data in bulk from NARA’s website this way. But it’s an accident that’s good for access.

What if instead of trying to build the ultimate user experience for archival content, archives focused first and foremost on providing simple access to the underlying data first. I’m thinking of the sort of work Carl Malamud has been doing for years at public.resource.org. With a solid data foundation like that, and simple mechanisms for monitoring the archive for new accessions it would then be possible to layer other applications on top within the enterprise and (hopefully) at places external to the archive, that provide views into the holdings.

I imagine this might sound like ceding the responsibility of the archive to some. It may also sound a bit dangerous to those that are concerned about connecting up public data that is currently unconnected. I’m certainly not suggesting that user experience and privacy aren’t important. But I think Cassie is right:

I imagine there are some that feel that associating this idea of the archive as data platform with the Wikileaks project might be counterproductive to an otherwise good idea. I certainly paused before hitting publish on this blog post, given the continued sensitivity around the issue of Wikileaks. But as other archivists have noted there is a great deal to be learned from the phenomenon that is Wikileaks. Open and respectful conversations about what is happening is important, right?

Most of all I think it’s important that we don’t look at this bulk access and distant reading of archival material as a threat to the archive. Researchers should feel that downloading data from the archive is a legitimate activity. Where possible they should be given easy and efficient ways to do it. Archives need environments like OpenGov NSW (thanks Cassie) and the Government Printing Office’s Bulk Data website (see this press release about the Federal Register) where this activity can take place, and where a dialogue can happen around it.

Update: May 8, 2014

Alexa O’Brien’s interview on May 6th with Sarah Harrison of Wikileaks at re:publica14 touched on lots of issues related to Wikileaks the archive. In particular the discussion of redaction, accessibility and Wikileaks role in publishing declassified information for others (including journalists) was quite relevant the topic of this blog post.