The Carter Cables were obtained by WikiLeaks through the process described here after formal declassification by the US National Archives and Records Administration earlier this year.
If you follow the link you can see that this content was obtained in a similar manner as the Kissinger Files, that were released just over a year ago. Perhaps this has already been noted, but I didn’t notice before that the Kissinger Files (the largest Wikileaks release to date) were not leaked to Wikileaks, but were legitimately obtained directly from NARA’s website:
Most of the records were reviewed by the United States Department of State’s systematic 25-year declassification process. At review, the records were assessed and either declassified or kept classified with some or all of the metadata records declassified. Both sets of records were then subject to an additional review by the National Archives and Records Administration (NARA). Once believed to be releasable, they were placed as individual PDFs at the National Archives as part of their Central Foreign Policy Files collection.
The Central Foreign Policy Files are a series from the General Records of the Department of State record group. Anyone with a web browser can view these documents on NARA’s Access to Archival Databases website. If you try to access them you’ll notice that the series is broken up into 15 separate files. Each file is a set of documents that can be searched individually. There’s no way to browse the contents of a file, series or the entire group: you must do a search and click through each of the results (more on this in a moment).
The form in which these documents were held at NARA was as 1.7 million individual PDFs. To prepare these documents for integration into the PlusD collection, WikiLeaks obtained and reverse-engineered all 1.7 million PDFs and performed a detailed analysis of individual fields, developed sophisticated technical systems to deal with the complex and voluminous data and corrected a great many errors introduced by NARA, the State Department or its diplomats, for example harmonizing the many different ways in which departments, capitals and people’s names were spelt.
It would be super to hear more details about their process for doing this work. I think archives could potentially learn a lot about how to enhance their own workflows for doing this kind of work at scale.
And yet I think there is another lesson here in this story. It’s actually important to look at this PlusD work as a success story for NARA…and one that can potentially be improved upon. I mentioned above that it doesn’t appear to be possible to browse a list of documents and that you must do a search. If you do a search and click on one of the documents you’ll notice you get a URL like this:
Do you see the pattern? Yup, the rid appears to be a record number, and it’s an integer that you can simply start at 1 and keep going until you’ve got to the last one for that file, in this case 155278.
It turns out the other dt and dl parameters change for each file, but they are easily determined by looking at the overview page for the series. Here they are if you are curious:
Of course it would be trivial to write a harvesting script to pull down the ~380 gigabytes of PDFs by creating a loop with a counter and using one of the many many HTTP libraries. Maybe even with a bit of sleeping in between requests to be nice to the NARA website. I suspect that this how Wikileaks were able to obtain the documents.
But, in an ideal world, this sort of URL inspection shouldn’t be necessary right? Also, perhaps it could be done in such a way that the burden of distributing the data doesn’t fall on NARA alone? It feels like a bit of an accident that it’s possible to download the data in bulk from NARA’s website this way. But it’s an accident that’s good for access.
What if instead of trying to build the ultimate user experience for archival content, archives focused first and foremost on providing simple access to the underlying data first. I’m thinking of the sort of work Carl Malamud has been doing for years at public.resource.org. With a solid data foundation like that, and simple mechanisms for monitoring the archive for new accessions it would then be possible to layer other applications on top within the enterprise and (hopefully) at places external to the archive, that provide views into the holdings.
I imagine this might sound like ceding the responsibility of the archive to some. It may also sound a bit dangerous to those that are concerned about connecting up public data that is currently unconnected. I’m certainly not suggesting that user experience and privacy aren’t important. But I think Cassie is right:
@edsu Yes it is; I think more participation from independent actors in archival access is a great thing we need more of
I imagine there are some that feel that associating this idea of the archive as data platform with the Wikileaks project might be counterproductive to an otherwise good idea. I certainly paused before hitting publish on this blog post, given the continued sensitivity around the issue of Wikileaks. But as other archivists have noted there is a greatdeal to be learned from the phenomenon that is Wikileaks. Open and respectful conversations about what is happening is important, right?
Most of all I think it’s important that we don’t look at this bulk access and distant reading of archival material as a threat to the archive. Researchers should feel that downloading data from the archive is a legitimate activity. Where possible they should be given easy and efficient ways to do it. Archives need environments like OpenGov NSW (thanks Cassie) and the Government Printing Office’s Bulk Data website (see this press release about the Federal Register) where this activity can take place, and where a dialogue can happen around it.
Update: May 8, 2014
Alexa O’Brien‘s interview on May 6th with Sarah Harrison of Wikileaks at re:publica14 touched on lots of issues related to Wikileaks the archive. In particular the discussion of redaction, accessibility and Wikileaks role in publishing declassified information for others (including journalists) was quite relevant the topic of this blog post.
Peter Brantley tells a sad tale about where public library leadership is at, as we plunge headlong into the ebook future, that has been talked about for what seems like forever, and which is now upon us. It’s not pretty.
The general consensus among participants was that public libraries have two, maybe three years to establish their relevance in the digital realm, or risk fading from the central place they have long occupied in the world’s literary culture.
The fact that a bunch of big-wigs invited by IFLA were seemingly unable to find inspiration and reason to hope that public libraries will continue to exist is not surprising in the least I guess. I’m not sure that libraries were ever the center of the world’s literary culture. But for the sake of argument lets assume they were, and that now they’re increasingly not. Let us also assume that the economic landscape around ebooks is in incredible turmoil, and that there will continue to be sea changes in technologies, and people’s use of them in this area for the foreseeable future.
What can libraries do to stay relevant? I think part of the answer is: stop being libraries…well, sorta.
The most serious threat facing libraries does not come from publishers, we argued, but from e-book and digital media retailers like Amazon, Apple, and Google. While some IFLA staff protested that libraries are not in the business of competing with such companies, the library representatives stressed that they are. If public libraries can’t be better than Google or Amazon at something, then libraries will lose their relevance.
In my mind the thing that libraries have to offer, which these big corporations cannot, is authentic, local context for information about a community’s past, present and future. But in the past century or so libraries have focused on collecting mass produced objects, and sharing data about said objects. The mission of collecting hyper-local information has typically been a side task, that has fallen to special collections and archives. If I were invited to that IFLA meeting I would’ve said that libraries need to shift their orientation to caring more about the practices of archives and manuscript collections, by collecting unique, valued, at risk local materials, and adapting collection development and descriptive practices to the realities of more and more of this information being available as data.
As Mark Matienzo indicated (somewhat indirectly in Twitter) after I published this blog post, a lot of this work involves focusing less on hoarding items like books, and focusing more on the functions, services, and actions that public libraries want to document and engage with in their communities. Traditionally this orientation has been a strength area for archivists in their practice and theory of appraisal where:
… considerations … include how to meet the record-granting body’s organizational needs, how to uphold requirements of organizational accountability (be they legal, institutional, or determined by archival ethics), and how to meet the expectations of the record-using community. Wikipedia
I think this represents a pretty significant cognitive shift for library professionals, and would in fact take some doing. But perhaps that’s just because my exposure to archival theory in “library school” was pretty pathetic. Be that as it may here are some practical examples of growth areas for public libraries that I wish came up at the IFLA meeting.
The Internet Archive and national libraries that are part of the International Internet Preservation Consortium don’t have the time, resources and often mandate to collect web content that are of interest at the local level. What if the tooling and expertise existed for public libraries to perform some of this work, and to have the results fed into larger aggregations of web archives?
Municipality Reports and Data
Increasing amounts of data are being collected as part of the daily working of our local governments. What if your public library had the resources to be a repository for this data? Yeah, I said the R word. But I’m not suggesting that public libraries get the expertise to set up Fedora instances with Hydra heads, or something. I’m thinking about approaches to allowing data to easily flow into an organization, where it is backed up, and made available in a clearinghouse manner similar to public.resource.org on the Web, for search engines to pick up. Perhaps even services like LibraryBox offer another lens to look at the opportunities that lie in this area.
Born Digital Manuscript Collections
Public libraries should be aggressively collecting the “papers” of local people who have had significant contributions to their communities. Increasingly, these aren’t paper at all, but are born digital content. For example: email correspondence, document archives, digital photograph collections. I think that librarians and archivists know, in theory, that this born digital content is out there, but the reality is it’s not flowing into the public library/archive. How can we change this? Efforts such as Personal Digital Archiving are important for two reasons: they help set up the right conditions for born digital collections to be donated, and they also make professionals think about how they would like to receive materials so that they are easier to process. Think more things like AIMS, training and tooling for both professionals and citizens.
It’s not unusual for archives and special collections to have all sorts of donor gift agreements that place restrictions on how their donated materials can be used. To some extent needing to visit the collection, request it, and not being able to leave the room with it, has mitigated some of this special-snowflakism. But when things are online things change a bit. We need to normalize these agreements so that content can flow online, and be used online in clearer ways. What if we got donors to think about Creative Commons licenses when they donated materials? How can we make sure donated material can become a usable part of the Web
We all know that things come and go on the Web. But it doesn’t need to be that way for everything on the Web. Libraries and archives have an opportunity to show how focusing on being a clearninghouse for data assets can allow for things to live persistently on the Web. Thinking about our URLs as identifiers for things we are taking care of is important. Practical strategies for achieving that are possible, and repeatable. What if public libraries were safe harbors for local content on the World Wide Web? This might sound hard to do, but I think it’s not as hard as people think.
As libraries/archives make more local content available publicly on the Web it becomes important to track how this content is accessed and used online. Quick wins like Web analytics tools (Google Analytics) for seeing what is being accessed and from where. Seeing how content is cited in social media applications like Facebook, Twitter, Pinterest and Wikipedia is important for reporting on the value of online collections. But encouraging professionals to use this information to become part of the conversations is equally important. Good metrics are also essential for collection development purposes, seeing what content is of interest, and what is not.
Inside Out Libraries
So, no I don’t think public libraries need a new open source Overdrive. The ebook market will likely continue to take care of itself. I also am not really convinced we need some overarching organization like the Digital Public Library of America to serve as a single point of failure when the funding runs dry. We need distributed strategies for documenting our local communities, so that this information can take its rightful place on the Web, and be picked up by Google so that people can find it when they are on the other side of the world. Things will definitely keep changing, but I think libraries and archives need to invest in the Web as an enduring delivery platform for information.
I’ve never been before but I was so excited to read the call for the European Library Automation Group (ELAG) this year.
The theme of this year’s conference is ‘The INSIDE-OUT Library’. This theme was chosen at last year’s conference, because we concluded:
Libraries have been focusing on bringing the world to their users. Now information is globally available.
Libraries have been producing metadata for the same publications in parallel. Now they are faced with deduplicating redundancy.
Libraries have been selecting things for their users. Now the users select things themselves.
Libraries have been supporting users by indexing things locally. Now everything is being indexed in global, shared indexes.
Instead of being an OUTSIDE-IN library, libraries should try and stay relevant by shifting their paradigm 180 degrees. Instead of only helping users to find what is available globally, they should also focus on making local collections and production available to the world. Instead of doing the same thing everywhere, libraries should focus on making unique information accessible. Instead of focusing on information trapped in publications, libraries should try and give the world new views on knowledge.
This blog post is really just a somewhat shabby rephrasing of that call. Maybe IFLA could use some of the folks on the ELAG program commmittee at their next meeting about the future of public libraries? Hopefully 2013 will be a year I can make it to ELAG.
I expect public libraries will continue to exist, but there isn’t going to be some magical technical solution to their problems. Their future will be forged by each local relationship they make, which leads to them better documenting their place on the Web. We may not call these places public libraries at first, but that’s what they will be.
I recently attended the Wikimania conference here in Washington, DC. I really can’t express how amazing it was to be a Metro ride away from more than 1,400 people from 87 countries who were passionate about creating a world in which every single human being can freely share in the sum of all knowledge. It was my first Wikimania, and I had pretty high expectations, but I was blown away by the level of enthusiasm and creativity of the attendees. Since my employer supported me by allowing me to spend the week there, I thought I would jot down some notes about the things that I took from the conference, from the perspective of someone working in the cultural heritage sector.
Of course the big news from Wikimania for folks like me who work in libraries and archives was the plenary speech by the Archivist of the United States, David Ferriero. Ferriero did an excellent job of connecting NARA’s mission to that of the Wikipedia community. In particular he stressed that NARA cannot solve difficult problems like the preservation of electronic records without the help of open government, transparency and citizen engagement to shape its policies and activities. As a library software developer I’m as interested as the next person in technical innovations in the digital preservation space: be they new repository software, flavors of metadata and digital object packaging, web services and protocols, etc. But over the past few years I’ve been increasingly convinced that access to the content that is being preserved is an absolutely vital ingredient to its preservation. If open access (as in the case of NARA) isn’t possible due to licensing concerns, then it is still essential to let access by some user community drive and ground efforts to collect and preserve digital content. Seeing high level leadership in the cultural heritage space (and from the federal government no less) address this issue was really inspiring.
At the Archives our concepts of openness and access are embedded in our mission. The work we do every day is rooted in the belief that citizens have the right to see, examine, and learn from the records that guarantee citizens rights, document government actions, and tell the story of our nation.
My biggest challenge is visibility: not everyone knows who we are, what we do, or more importantly, the amazing resources we collect and house. The lesson I learned in my time in New York is that it isn’t good enough to create great digital collections, and sit back and expect people to find you. You need to be where the people are.
The astounding thing is that it’s not just talk–Ferriero went on to describe several efforts of how the Archives is executing on collaboration with the Wikipedia community, which is also documented at a high level in NARA’s Open Government Plan. One example that stood out for me was NARA’s Today’s Document website which highlights documents from its collections. On June 1st, 2011 they featured a photograph of Harry P. Perry who was the first African American to enlist in the US Marine Corps after it was desegregated on June 1st, 1942. NARA’s Wikipedian in Residence Dominic McDevitt-Parks’ efforts to bring archival content to the attention of Wikipedians resulted in a new article Desegregation in the United States Marine Corps being created that same day…and the photograph on NARA’s website was viewed more than 4 million times in 8 hours. What proportion of the web traffic was driven by Wikipedia specifically rather than other social networking sites wasn’t exactly clear, but the point is that this is what happens when you get your content where the users are. If my blog post is venturing into tl;dr territory, please be sure to at least watch his speech, it’ll just take 20 minutes.
The arrival of Wikipedians in Residence is a welcome seachange in the Wikipedia community, where historically there had been some uncertainty about the best way for cultural heritage organizations to highlight their original content in Wikipedia articles. As Sara pointed out in her talk, it helps both sides (the institutional side, and the Wikipedia side) to have an actual, experienced Wikipedian on site to help the organization understand how they want to engage the community. Having direct contact with archivists, curators and librarians that know their collections backwards and forwards also helps the resident in knowing how to direct their work, and the work of other Wikipedians. The Library of Congress made an announcement at the Wikimania reception that the World Digital Library are seeking a Wikipedia in Residence. I don’t work directly on the project anymore, but I know people who do, so let me know if you are interested and I can try to connect the dots.
I think in a lot of ways the residency program is an excellent start, but really it’s just that–a start. The task at hand of connecting the Wikipedia community and article content with the collections of galleries, libraries, archives and museums is a huge one. One person, especially a temporary volunteer, can only do so much. As you probably know, Wikipedia editors can often be found embedded in cultural heritage organizations. It’s one of the reasons why we started having informal Wikipedia lunches at the Library of Congress: to see what can be done at the grass roots level by staff to integrate Wikipedia into our work. When we started to meet I learned about an earlier, 4 year old effort to create a policy that provides guidance to staff about how to interact with the Wikipedia community as editors. Establishing a residency program is an excellent way to signal a change in institutional culture, and to bootstrap and focus the work. But I think the residencies also highlight the need to empower staff throughout the organization to participate as well, so that after the resident leaves the work goes on. In addition to establishing a WDL Wikipedian in Residence I would love to see the Library of Congress put the finishing touches on its Wikipedia policy that would empower staff to use and contribute to Wikipedia as part of their work, without lingering doubt about whether it was correct or not. It would probably be helpful for other organizations to publish theirs as examples for other organizations wanting the same.
Wikipedia as a Platform
Getting back to Wikimania, I wanted to highlight a few other GLAM related projects that use Wikipedia as a platform.
Daniel Mietchenspoke about work he was doing around the Open Access Media Importer (OAMI). The OAMI is a tool that harvests media files (images, movies, etc) from open access materials and uploads them to Wikimedia Commons for use in article content. Efforts to date have focused primarily on PubMed from the National Institutes of Health. As someone working in the digital preservation field one of the interesting outcomes of the work so far was a table that illustrated the media formats present in PubMed:
Since Daniel and other OAMI collaborators are scientists they have been focused primarily on science related media…so they naturally are interested in working with arXiv. arXiv is a heavily trafficked, volunteer supported, pre-print server, that is normally a poster child for open repositories. But one odd thing about arXiv that Daniel pointed out is that while arXiv collects licensing information from authors as part of deposit, they do not indicate in the user interface which license has been used. This makes it particularly difficult for the OAMI to determine which content can be uploaded to the Wikimedia Commons. I learned from Simeon Warner shortly afterwards that while the licensing information doesn’t show up in the UI currently, and isn’t present in all the metadata formats that their OAI-PMH service provides, it can be found squirreled away in the arXivRaw format. So it should be theoretically possible to modify the OAMI to use arXivRaw.
Another challenges the OAMI faces is extraction of metadata. For example media files often don’t share all the subject keywords that are appropriate for the entire article. So knowing which ones to apply can be difficult. In addition, metadata extraction from Wikimedia Commons was reported to not be optimal, since it involves parsing mediawiki templates, which limits the downstream use of the content added to the Commons. I don’t know if the Public Library of Science is on the radar for harvesting, but if it isn’t it should be. The OAMI work also seems loosely related to the issue of research data storage and citation which seems to be on the front burner for those interested in digital repositories. Jimmy Wales has reportedly been advising the UK government on how to making funded research available to the public. I’m not sure if datasets fit the purview of the Wikimedia Commons, but since Excel is #3 in the graph above perhaps it is. It might be interesting to think more about Wikimedia Commons as a platform for publishing (and citing) datasets.
I learned about another interesting use of the Wikimedia Commons from Maarten Dammers and Dan Entous during their talk about the GLAMwiki Toolset. The project is a partnership between Wikipedia Netherlands and Europeana. If you aren’t already familiar with Europeana it is an EU funded effort to enhance access to European cultural heritage material on the Web. The project is just getting kicked off now, and is aiming to:
…develop a scalable, maintainable, ease to use system for mass uploading open content from galleries, libraries, archives and museums to Wikimedia Commons and to create GLAM-specific requirements for usage statistics.
Wikimedia Commons can be difficult to work with in an automated, batch oriented way for a variety of reasons. One that was mentioned above is metadata. The GLAMwiki Toolset will provide some mappings from commonly held metadata formats (starting with Dublin Core) to Commons templates, and will provide a framework for adapting the tool to custom formats. Also there is a perceived need for tools to manage batch imports as well as exports from the Commons. The other big need are usable analytics tools that let you see how content is used and referenced on the Commons once it has been uploaded. Maarten indicated that they are seeking participation in the project from other GLAM organizations. I imagine that there are other organizations that would like to use the Wikimedia Commons as a content platform, to enable collaboration across institutional boundaries. Wikipedia is one of the most popular destinations on the Web, so they have been forced to scale their technical platform to support this demand. Even the largest cultural heritage organizations can often find themselves bound to somewhat archaic legacy systems, that can make it difficult to similarly scale their infrastructure. I think services like Wikimedia Commons and WikiSource have a lot to offer cash strapped organizations that want to do more to provide access to their organizations unique materials on the Web, but are not in a position to make the technical investments to make it happen. I’m hoping that efforts like the GLAMWiki toolset will make this easier to achieve, and is something I personally would like to get involved in.
Incidentally, one of the more interesting technical track talks I attended was a talk by Ben Hartshorne from the Wikimedia Foundation Operations Team, about their transition from NFS to Openstack Swift for media storage. I had some detailed notes about this talk, but proceeded to lose them. I seem to remember that in total, the various Wikimedia properties amount to 40T of media storage (images, videos, etc), and they want to be able to grow this to 200T this year. Ben included lots of juicy details about the hardware and deployment of Swift in their infrastructure, so I’ve got an email out to him to see if he can share his slides (update: he just shared them, thanks Ben!). The placement of various caches (Swift is an HTTP REST API), as well as the hooks into MediaWiki were really interesting to me. The importance of URL addressable object storage for bitstreams in an enterprise that is made up of many different web applications can’t be overstated. It was also fun to hear about the impact that digitization projects like Wikipedia Loves Monuments and the NARA work mentioned above, are having on the backend infrastructure. It’s great to hear that Wikipedia is planning for growth in the area of media storage, and can scale horizontally to meet it, without paying large sums of money for expensive, proprietary, vendor supplied NAS solutions. What wasn’t entirely clear from the presentation is whether there is a generic tipping point where investing in staff and infrastructure to support something like Swift becomes more cost-effective than using a storage solution like Amazon S3. Ben did indicate that there use of Swift and the abstractions they built into Mediawiki would allow for using storage APIs like S3.
Before I finish this post, there were a couple other Wikipedia related topics that I didn’t happen to see discussed at Wikimania (it’s a multi-track event so I may have just missed it). One is the topic of image citation on Wikipedia. Helena Zinkham (Chief of the Prints and Photographs Division at the Library of Congress) recently floated a project proposal at LC’s Wikipedia Lunch to more prominently place the source of an image in Wikipedia articles. For an example of what Helena is talking about take a look at the article for Walt Whitman: notice how the caption doesn’t include information about where the image came from? If you click on the image you get a detail page that does indicate that the photograph is from LC’s Prints & Photographs collection, with a link back to the Prints & Photographs Online Catalog. I agree with Helena that more prominent information about the source of photographs and other media in Wikipedia could encourage more participation from the GLAM community. What the best way to proceed with the idea is still in question. I’m new to the way projects get started and RFCs work there. Hopefully we will continue to work on this in the context of the grassroots Wikipedia work at LC. If you are interested please drop me an email
Another Wikipedia project directly related to my $work is the Digital Preservation WikiProject that the National Digital Stewardship Alliance is trying to kickstart. One of the challenges of digital preservation is the identification of file formats, and their preservation characteristics. English Wikipedia currently has 325 articles about Computer File Formats, and one of the goals of the Digital Preservation project is to enhance these with predictable infoboxes that usefully describe the format. External data sources such as PRONOM and UDFR also contain information about data formats. It’s possible that some of them could be used to improve Wikipedia articles, to more widely disseminate digital preservation information. Also, as Ferriero noted, it’s important for cultural heritage organizations to get their information out to where the people are. Jason Scott of ArchiveTeam has been talking about a similar project to aggregate information about file formats to build better tools for format identification. While I can understand the desire to build a new wiki to support this work, and there are challenges to working with the Wikipedia community, I think Linus’ Law points the way to using Wikipedia.
So, I could keep going, but in the interests of time (yours and mine) I have to wrap this Wikimania post up (for now). Thanks for reading this far through my library colored glasses. Oddly I didn’t even get to mention the most exciting and high profile Wikidata and Visual Editor projects that are under development, and are poised to change what it means to use and contribute to Wikipedia for everyone, not just GLAM organizations. Wikidata is of particular interest to me because if successful it will bring many of the ideas of the Linked Data to solve an eminently practical problem that Wikipedia faces. In some ways the WikiData project is following in the footsteps of the successful dbpedia and Google Freebase projects. But there is a reason why Freebase and Dbpedia have spent time engineering their Wikipedia updates–because it’s where the users are creating content. Hopefully I’ll be able to attend Wikimania next year to see how they are doing. And I hope that my first Wikimania marks the beginning of a more active engagement in what Wikipedia is doing to transform the Web and the World.