I noticed on the Archives and Archivists discussion list today that the Library of Virginia has made 66,422 of the approximately 1.3 million emails (167 GB) of Governor Tim Kaine's 2006-2010 administration available on the Web. Even though this is only 5% of the entire collection, it still represents a significant step forwards for open access to government information.
Thankfully, the Library of Virginia took the extra step of describing how they went about processing the email collection. Along the way they tried a series of proprietary tools, some of which crashed regularly to convert pst to pdf and csv, which they then ingested into their digital asset management system, DigiTool from ExLibris (which apparently couldn't load more than 3,000 emails without keeling over).
What's simply astounding is that archivists looked at every email to determine whether it contained restricted material. The result of this sifting was that only 5% of the emails were made available.
I was drawn to the announcement initially because I wondered if they would simply make mailbox data available on the Web, similar to the Enron email dataset. But quickly I noticed that while metadata about the email was readable, I wasn't able to read the contents of the message--the so called email body. Instead I was presented with a PDF icon, which had a lock over it. At first I suspected the content was only available at the Library of Virginia, but then after some more reading elsewhere I discovered that I needed to login to see the PDFs.
I was surprised to find that the username and password were simply listed on the login page -- you don't get your own login, everyone uses the same one. This login form is accompanied by the following text:
While great care has been taken during the processing of this collection to locate, identify, and restrict access to privacy protected information within this collection, some relevant materials may have been missed. By logging in and accessing this collection, the user agrees: * That if privacy-protected information is discovered during use of this material to make no notes or other recordation of the confidential information. * Not to publish, publicize, or disclose any confidential material to any other party for any purpose. * That no direct or indirect contact will be made with the individuals to whom the confidential information relates. * To contact the Library of Virginia at firstname.lastname@example.org to report any confidential information found. Improper disclosure of privacy-protected information is a breach of confidentiality that could result in the loss of access to the archival collections housed and maintained by Library of Virginia, and could result in legal penalties (Code of Virginia, §18.2-186.3). Name: LVA Password: LVA Login is required only once during each session.
In all honesty I'm surprised that this collection has been made available at all, given how recent the material is (relative to how long email has been around), and the rush to make it available in time for Kaine's run for the US Senate. It's hard not to imagine politics going on there behind the scenes.
But the thing that gave me pause in this agreement is the researcher needs to know what privacy protected or confidential information is. I wonder if many archivists are clear on what these words mean in this context? Without a clear understanding of what these terms mean this language seems to prevent researchers from actually using any of the material that they discover.
Something that might not be apparent to folks outside the archives world is that this is still a huge step forwards for the Library of Virginia to start making collections like this available on the Web. If it were me, I would probably have started out with a less politicized collection, and used opensource Web tools to convert the content, and make it available. ElasticSearch or Solr over static HTML documents comes to mind. Making the emails indexable by Google, and not hiding them behind a public username and password that the bots can't figure out, would be a priority for me as well. So people who are searching for the content can find it. This would take the burden off of Library of Virginia's search tool as well. Terms of service that make sense to researchers, and don't scare them out of doing their work seems pretty important too.
Still, it looks like progress. Keep on keeping on Library of Virginia. 5% is still a sliver of a sliver of a sliver that the Web didn't have before.