From Polders to Postmodernism

From Polders to Postmodernism: A Concise History of Archival TheoryFrom Polders to Postmodernism: A Concise History of Archival Theory by John Ridener
My rating: 3 of 5 stars

This was a nice little find for my continuing self-education in archives. As its title suggests, it’s a short survey (less than 200 pages), that traces a series of paradigm shifts in archival theory starting in the 19th century Netherlands leading up to the present. Ridener focuses on the approaches to subjectivity and objectivity in archival theory in order to show how the theories have changed and built on each other over the last 200 years. He does a nice job of sketching the context for the theories, the changes in society and technology that drove them, as well as some interesting biographical material about individuals such as Jenkins and Schellenberg. After having just read Controlling the Past I felt like I had some exposure to contemporary thinking about archives, but was lacking some of the historical background, so this book was very helpful. I think I might have to read Schellenberg’s Modern Archives now, especially because of the NARA connection. But that might get sidelined to read more of Terry Cook’s work on macro-appraisal. My only small complaint is that I noticed quite a few typos in the first half of the book, which got a little distracting at times.


Wikimania Revisited

I recently attended the Wikimania conference here in Washington, DC. I really can’t express how amazing it was to be a Metro ride away from more than 1,400 people from 87 countries who were passionate about creating a world in which every single human being can freely share in the sum of all knowledge. It was my first Wikimania, and I had pretty high expectations, but I was blown away by the level of enthusiasm and creativity of the attendees. Since my employer supported me by allowing me to spend the week there, I thought I would jot down some notes about the things that I took from the conference, from the perspective of someone working in the cultural heritage sector.

Archivy

Of course the big news from Wikimania for folks like me who work in libraries and archives was the plenary speech by the Archivist of the United States, David Ferriero. Ferriero did an excellent job of connecting NARA’s mission to that of the Wikipedia community. In particular he stressed that NARA cannot solve difficult problems like the preservation of electronic records without the help of open government, transparency and citizen engagement to shape its policies and activities. As a library software developer I’m as interested as the next person in technical innovations in the digital preservation space: be they new repository software, flavors of metadata and digital object packaging, web services and protocols, etc. But over the past few years I’ve been increasingly convinced that access to the content that is being preserved is an absolutely vital ingredient to its preservation. If open access (as in the case of NARA) isn’t possible due to licensing concerns, then it is still essential to let access by some user community drive and ground efforts to collect and preserve digital content. Seeing high level leadership in the cultural heritage space (and from the federal government no less) address this issue was really inspiring.

At the Archives our concepts of openness and access are embedded in our mission. The work we do every day is rooted in the belief that citizens have the right to see, examine, and learn from the records that guarantee citizens rights, document government actions, and tell the story of our nation.

My biggest challenge is visibility: not everyone knows who we are, what we do, or more importantly, the amazing resources we collect and house. The lesson I learned in my time in New York is that it isn’t good enough to create great digital collections, and sit back and expect people to find you. You need to be where the people are.

The astounding thing is that it’s not just talk–Ferriero went on to describe several efforts of how the Archives is executing on collaboration with the Wikipedia community, which is also documented at a high level in NARA’s Open Government Plan. One example that stood out for me was NARA’s Today’s Document website which highlights documents from its collections. On June 1st, 2011 they featured a photograph of Harry P. Perry who was the first African American to enlist in the US Marine Corps after it was desegregated on June 1st, 1942. NARA’s Wikipedian in Residence Dominic McDevitt-Parks’ efforts to bring archival content to the attention of Wikipedians resulted in a new article Desegregation in the United States Marine Corps being created that same day…and the photograph on NARA’s website was viewed more than 4 million times in 8 hours. What proportion of the web traffic was driven by Wikipedia specifically rather than other social networking sites wasn’t exactly clear, but the point is that this is what happens when you get your content where the users are. If my blog post is venturing into tl;dr territory, please be sure to at least watch his speech, it’ll just take 20 minutes.

Resident Wikipedians

In a similar vein Sara Snyder made a strong case for the use of archival materials on Wikipedia in her talk 5 Reasons Why Archives are an Untapped Goldmine for Wikimedians. She talked about the work that Sarah Stierch did as the Wikipedia in Residence at the Smithsonian Archives of American Art. The partnership resulted in ~300 WPA images being uploaded to Wikimedia Commons, 37 new Wikipedia articles, and new connections with a community of volunteers who participated in edit-a-thons to improve Wikipedia and learn more about the AAA collections. She also pointed out that since 2010 Wikipedia has driven more traffic to the Archives of American Art website than all other social media combined.

In the same session Dominic McDevitt-Parks spoke about his activities as the Wikipedian in Residence at the US National Archives. Dominic focused much of his presentation on NARA’s digitization work, largely done by volunteers, the use of Wikimedia Commons as a content platform for the images, and ultimately WikiSource as a platform for transcribing the documents. The finished documents are then linked to from NARA’s Online Catalog, as in this example: Appeal for a Sixteenth Amendment from the National Woman Suffrage Association. NARA also prominently links out to the content waiting to be transcribed at WikiSource on its Citizens Archivist Dashboard. If you are interested in learning more, Dominic has written a bit about the work with WikiSource on the NARA blog. Both Dominic and Sara will be speaking next month at the Society of American Archivists Annual Meeting making the case for Wikipedia to the archival community. Their talk is called 80,000 Volunteers Can’t Be Wrong: The Case for Greater Collaboration with Wikipedia, and I encourage you attend if you will be at SAA.

The arrival of Wikipedians in Residence is a welcome seachange in the Wikipedia community, where historically there had been some uncertainty about the best way for cultural heritage organizations to highlight their original content in Wikipedia articles. As Sara pointed out in her talk, it helps both sides (the institutional side, and the Wikipedia side) to have an actual, experienced Wikipedian on site to help the organization understand how they want to engage the community. Having direct contact with archivists, curators and librarians that know their collections backwards and forwards also helps the resident in knowing how to direct their work, and the work of other Wikipedians. The Library of Congress made an announcement at the Wikimania reception that the World Digital Library are seeking a Wikipedia in Residence. I don’t work directly on the project anymore, but I know people who do, so let me know if you are interested and I can try to connect the dots.

I think in a lot of ways the residency program is an excellent start, but really it’s just that–a start. The task at hand of connecting the Wikipedia community and article content with the collections of galleries, libraries, archives and museums is a huge one. One person, especially a temporary volunteer, can only do so much. As you probably know, Wikipedia editors can often be found embedded in cultural heritage organizations. It’s one of the reasons why we started having informal Wikipedia lunches at the Library of Congress: to see what can be done at the grass roots level by staff to integrate Wikipedia into our work. When we started to meet I learned about an earlier, 4 year old effort to create a policy that provides guidance to staff about how to interact with the Wikipedia community as editors. Establishing a residency program is an excellent way to signal a change in institutional culture, and to bootstrap and focus the work. But I think the residencies also highlight the need to empower staff throughout the organization to participate as well, so that after the resident leaves the work goes on. In addition to establishing a WDL Wikipedian in Residence I would love to see the Library of Congress put the finishing touches on its Wikipedia policy that would empower staff to use and contribute to Wikipedia as part of their work, without lingering doubt about whether it was correct or not. It would probably be helpful for other organizations to publish theirs as examples for other organizations wanting the same.

Wikipedia as a Platform

Getting back to Wikimania, I wanted to highlight a few other GLAM related projects that use Wikipedia as a platform.

Daniel Mietchen spoke about work he was doing around the Open Access Media Importer (OAMI). The OAMI is a tool that harvests media files (images, movies, etc) from open access materials and uploads them to Wikimedia Commons for use in article content. Efforts to date have focused primarily on PubMed from the National Institutes of Health. As someone working in the digital preservation field one of the interesting outcomes of the work so far was a table that illustrated the media formats present in PubMed:

Since Daniel and other OAMI collaborators are scientists they have been focused primarily on science related media…so they naturally are interested in working with arXiv. arXiv is a heavily trafficked, volunteer supported, pre-print server, that is normally a poster child for open repositories. But one odd thing about arXiv that Daniel pointed out is that while arXiv collects licensing information from authors as part of deposit, they do not indicate in the user interface which license has been used. This makes it particularly difficult for the OAMI to determine which content can be uploaded to the Wikimedia Commons. I learned from Simeon Warner shortly afterwards that while the licensing information doesn’t show up in the UI currently, and isn’t present in all the metadata formats that their OAI-PMH service provides, it can be found squirreled away in the arXivRaw format. So it should be theoretically possible to modify the OAMI to use arXivRaw.

Another challenges the OAMI faces is extraction of metadata. For example media files often don’t share all the subject keywords that are appropriate for the entire article. So knowing which ones to apply can be difficult. In addition, metadata extraction from Wikimedia Commons was reported to not be optimal, since it involves parsing mediawiki templates, which limits the downstream use of the content added to the Commons. I don’t know if the Public Library of Science is on the radar for harvesting, but if it isn’t it should be. The OAMI work also seems loosely related to the issue of research data storage and citation which seems to be on the front burner for those interested in digital repositories. Jimmy Wales has reportedly been advising the UK government on how to making funded research available to the public. I’m not sure if datasets fit the purview of the Wikimedia Commons, but since Excel is #3 in the graph above perhaps it is. It might be interesting to think more about Wikimedia Commons as a platform for publishing (and citing) datasets.

I learned about another interesting use of the Wikimedia Commons from Maarten Dammers and Dan Entous during their talk about the GLAMwiki Toolset. The project is a partnership between Wikipedia Netherlands and Europeana. If you aren’t already familiar with Europeana it is an EU funded effort to enhance access to European cultural heritage material on the Web. The project is just getting kicked off now, and is aiming to:

…develop a scalable, maintainable, ease to use system for mass uploading open content from galleries, libraries, archives and museums to Wikimedia Commons and to create GLAM-specific requirements for usage statistics.

Wikimedia Commons can be difficult to work with in an automated, batch oriented way for a variety of reasons. One that was mentioned above is metadata. The GLAMwiki Toolset will provide some mappings from commonly held metadata formats (starting with Dublin Core) to Commons templates, and will provide a framework for adapting the tool to custom formats. Also there is a perceived need for tools to manage batch imports as well as exports from the Commons. The other big need are usable analytics tools that let you see how content is used and referenced on the Commons once it has been uploaded. Maarten indicated that they are seeking participation in the project from other GLAM organizations. I imagine that there are other organizations that would like to use the Wikimedia Commons as a content platform, to enable collaboration across institutional boundaries. Wikipedia is one of the most popular destinations on the Web, so they have been forced to scale their technical platform to support this demand. Even the largest cultural heritage organizations can often find themselves bound to somewhat archaic legacy systems, that can make it difficult to similarly scale their infrastructure. I think services like Wikimedia Commons and WikiSource have a lot to offer cash strapped organizations that want to do more to provide access to their organizations unique materials on the Web, but are not in a position to make the technical investments to make it happen. I’m hoping that efforts like the GLAMWiki toolset will make this easier to achieve, and is something I personally would like to get involved in.

Incidentally, one of the more interesting technical track talks I attended was a talk by Ben Hartshorne from the Wikimedia Foundation Operations Team, about their transition from NFS to Openstack Swift for media storage. I had some detailed notes about this talk, but proceeded to lose them. I seem to remember that in total, the various Wikimedia properties amount to 40T of media storage (images, videos, etc), and they want to be able to grow this to 200T this year. Ben included lots of juicy details about the hardware and deployment of Swift in their infrastructure, so I’ve got an email out to him to see if he can share his slides (update: he just shared them, thanks Ben!). The placement of various caches (Swift is an HTTP REST API), as well as the hooks into MediaWiki were really interesting to me. The importance of URL addressable object storage for bitstreams in an enterprise that is made up of many different web applications can’t be overstated. It was also fun to hear about the impact that digitization projects like Wikipedia Loves Monuments and the NARA work mentioned above, are having on the backend infrastructure. It’s great to hear that Wikipedia is planning for growth in the area of media storage, and can scale horizontally to meet it, without paying large sums of money for expensive, proprietary, vendor supplied NAS solutions. What wasn’t entirely clear from the presentation is whether there is a generic tipping point where investing in staff and infrastructure to support something like Swift becomes more cost-effective than using a storage solution like Amazon S3. Ben did indicate that there use of Swift and the abstractions they built into Mediawiki would allow for using storage APIs like S3.

Before I finish this post, there were a couple other Wikipedia related topics that I didn’t happen to see discussed at Wikimania (it’s a multi-track event so I may have just missed it). One is the topic of image citation on Wikipedia. Helena Zinkham (Chief of the Prints and Photographs Division at the Library of Congress) recently floated a project proposal at LC’s Wikipedia Lunch to more prominently place the source of an image in Wikipedia articles. For an example of what Helena is talking about take a look at the article for Walt Whitman: notice how the caption doesn’t include information about where the image came from? If you click on the image you get a detail page that does indicate that the photograph is from LC’s Prints & Photographs collection, with a link back to the Prints & Photographs Online Catalog. I agree with Helena that more prominent information about the source of photographs and other media in Wikipedia could encourage more participation from the GLAM community. What the best way to proceed with the idea is still in question. I’m new to the way projects get started and RFCs work there. Hopefully we will continue to work on this in the context of the grassroots Wikipedia work at LC. If you are interested please drop me an email

Another Wikipedia project directly related to my $work is the Digital Preservation WikiProject that the National Digital Stewardship Alliance is trying to kickstart. One of the challenges of digital preservation is the identification of file formats, and their preservation characteristics. English Wikipedia currently has 325 articles about Computer File Formats, and one of the goals of the Digital Preservation project is to enhance these with predictable infoboxes that usefully describe the format. External data sources such as PRONOM and UDFR also contain information about data formats. It’s possible that some of them could be used to improve Wikipedia articles, to more widely disseminate digital preservation information. Also, as Ferriero noted, it’s important for cultural heritage organizations to get their information out to where the people are. Jason Scott of ArchiveTeam has been talking about a similar project to aggregate information about file formats to build better tools for format identification. While I can understand the desire to build a new wiki to support this work, and there are challenges to working with the Wikipedia community, I think Linus’ Law points the way to using Wikipedia.

Beginning

So, I could keep going, but in the interests of time (yours and mine) I have to wrap this Wikimania post up (for now). Thanks for reading this far through my library colored glasses. Oddly I didn’t even get to mention the most exciting and high profile Wikidata and Visual Editor projects that are under development, and are poised to change what it means to use and contribute to Wikipedia for everyone, not just GLAM organizations. Wikidata is of particular interest to me because if successful it will bring many of the ideas of the Linked Data to solve an eminently practical problem that Wikipedia faces. In some ways the WikiData project is following in the footsteps of the successful dbpedia and Google Freebase projects. But there is a reason why Freebase and Dbpedia have spent time engineering their Wikipedia updates–because it’s where the users are creating content. Hopefully I’ll be able to attend Wikimania next year to see how they are doing. And I hope that my first Wikimania marks the beginning of a more active engagement in what Wikipedia is doing to transform the Web and the World.


and then the web happened

Here is the text of my talk I’m giving at Wikimania today, and the slides.

Let me begin by saying thank you to the conference organizers for accepting my talk proposal. I am excited to be here at my first WikiMania conference, and hope that it will be the first of many. Similar to millions of other people around the world, I use Wikipedia every day at work and at home. In the last three years I’ve transitioned from being a consumer to a producer, by making modest edits to articles about libraries, archives, and occasionally music. I had heard horror stories of people having their work reverted and deleted, so I was careful to cite material in my edits. I was pleasantly surprised when editors swooped in not to delete my work, but to improve it. So, I also want to say thanks to all of you for creating such an improbably open and alive community. I know there is room for improvement, but it’s a pretty amazing thing you all have built.

And really, this is all my talk about Wikistream is about. Wikistream was born out of a desire to share just how amazing the Wikipedia community is, with people who didn’t know it already. I know, I’m preaching to the choir. I also know that I’m speaking in the Technology and Infrastructure track, and I promise to get to some details about how Wikistream works. But really, there’s nothing that radically new in Wikistream–and perhaps what I’m going to say would be more appropriate for the GLAM track, or a performance art track, if there was one. If you are a multi-tasker and want to listen to me with one ear, please pull up http://wikistream.inkdroid.org in your browser, and try to make sense of it as I talk. Lets see what breaks first, the application or the wi-fi connection–hopefully neither.

Wikipedia and the BBC

A couple years ago I was attending the dev8d conference in London and dropped into a 2nd Linked Data Meetup that happened to be going on nearby. Part of the program included presentations from Tom Scott, Silver Oliver and Georgi Kobilarov about some work they did at the BBC. They demo’d two web applications, the BBC Wildlife Finder and BBC Music, that used Wikipedia as a content management platform 1, 2.






If I’m remembering right it was Tom who demonstrated how an edit to a Wikipedia article resulted in the content being immediately updated at the BBC. It seemed like magic. More than that it struck me as mind-blowingly radical for an organization like the BBC to tap into the Wikipedia platform and community like this.

After a few questions I learned from Georgi that part of the magic of this update mechanism was a bot that the BBC created which sits in the #en.wikipedia IRC chatroom, where edits are announced 4. I logged into the chatroom and was astonished by the number of edits flying by:

And remember this was just the English language Wikipedia channel. There are more than 730 other Wikimedia related channels where updates are announced. The BBC’s use of Wikipedia really resonated with me, but to explain why I need to back up a little bit more.

Crowdsourcing in the Library

I work as a software developer at the Library of Congress. In developing web applications there I often end up using data about books, people and topics that have been curated for hundreds of years, and which began to be made available in electronic form in the early 1970s. The library community has had a longstanding obsession with collaboration, or (dare I say) crowdsourcing, to maintain its information about the bibliographic universe. Librarians would most likely call it cooperative cataloging instead of crowdsourcing, but the idea is roughly the same.

As early as 1850, Charles Jewett proposed that the Smithsonian be established as the national library of the United States, which would (among other things) collect the catalogs of libraries all around the country 2. The Smithsonian wasn’t as sure as Jewett, so it wasn’t until the 1890s that we saw his ideas take hold when the Library of Congress assumed the role of the national library, and home to the Copyright Office. To this day, copyright registration results in a copy of a registered book being deposited at the Library of Congress. In 1901 the Library of Congress established its printed card service which made its catalog cards available to libraries around the United States and the world.

This meant that a book could be cataloged once by one of the growing army of catalogers at the Library of Congress, instead of the same work being done over and over by all the libraries all over the country. But the real innovation happened in 1971 when Fred Kilgour’s dream of an online shared cataloging database was powered up at OCLC. This allowed a book to be cataloged by any library, and instantly shared with other libraries around the country. It was at this point that the cataloging became truly cooperative, because catalogers could be anywhere, at any member institution, and weren’t required to be in an office at the Library of Congress.

This worked for a bit, but then the Web happened. As the Web began to spread in the mid to late 1990s the library community got it into their head that they would catalog it, with efforts like the Cooperative Online Resource Catalog. But the Web was growing too fast, there just weren’t enough catalogers who cared, and the tools weren’t up to the task, so the project died.

So when I saw Tom, Silver and Georgi present on the use of Wikipedia as a curated content platform at the BBC, and saw how active the community was I had a bit of a light bulb moment. It wasn’t a if-you-can’t-beat-em-join-em moment in which libraries and other cultural heritage organizations (like the BBC) fade into the background and become irrelevant, but one in which Wikipedia helps libraries do their job better…and maybe libraries can help make Wikipedia better. It just so happened that this was right as the Galleries, Libraries, Archives and Museums (GLAM) effort was kicking off at Wikipedia. I really wanted to be able to help show librarians and others not likely to drop into an IRC chat how active the Wikipedia community was, and that’s how Wikistream came to be.

How

So now that you understand the why of Wikistream I’ll tell you briefly about the how. When I released Wikistream I got this really nice email from Ward Cunningham, who is a personal hero of mine, and I imagine a lot of you too:

To: wiki-research-l@lists.wikimedia.org
From: Ward Cunningham <ward@c2.com>
Subject: Re: wikistream: displays wikipedia updates in realtime
Date: Jun 16, 2011 7:43:11 am

I've written this app several times using technology from text-to-speech 
to quartz-composer. I have to tip my hat to Ed for doing a better job 
than I ever did and doing it in a way that he makes look effortless. 
Kudos to Ed for sharing both the page and the software that produces 
it. You made my morning. -- Ward

Sure enough, my idea wasn’t really new at all. But at least I was in good company. I was lucky to stumble across the idea for Wikistream when a Google search for streaming to the browser pulled up SocketIO. If you haven’t seen it before SocketIO is a JavaScript library that allows you to easily stream data to the browser without needing to care about the transport mechanisms that the browser supports: WebSocket, Adobe FlashSocket, AJAX long polling, AJAX multipart-streaming, Forever iframe, JSONP Polling. It autodetects the capabilities of the browser and the server, and gives you a simple callback API for publishing and consuming events. For example here is the code that runs in your browser to connect to the server and start getting updates:

$(document).ready(function() {
  var socket = io.connect();
  socket.on('message', function(msg) {
    addUpdate(msg);
  });
});

There’s a bit more to it, like loading the SocketIO library, and the details of adding the information about the change stored in the msg JavaScript object (more on that below) to the DOM, but SocketIO makes the hard part of streaming data from the server to the client easy.

Of course you need a server to send the updates, and that’s where things get a bit more interesting. SocketIO is designed to run in a NodeJS environment with the Express web framework. Once you have your webapp set up, you can add SocketIO to it:

var express = require("express");
var sio = require("socket.io");

var app = express.createServer();
// set up standard app routes/views
var io = sio.listen(app);

Then the last bit is to do the work of listening to the IRC chatrooms and pushing the updates out to the clients that want to be updated. To make this a bit easier I created a reusable library called wikichanges that abstracts away the business of connecting to the IRC channels and parsing the status updates into a JavaScript object, and lets you pass in a callback function that will be given updates as they occur.

var wikichanges = require('wikichanges');

var w = wikichanges.WikiChanges();
w.listen(function(msg) {
  io.sockets.emit('message', msg);
});

This results in updates being delivered as JSON objects to the client code we started with, where each update looks something like:

{ 
  channel: '#en.wikipedia',
  wikipedia: 'English Wikipedia',
  page: 'Persuasion (novel)',
  pageUrl: 'http://en.wikipedia.org/wiki/Persuasion_(novel)',
  url: 'http://en.wikipedia.org/w/index.php?diff=498770193&oldid=497895763',
  delta: -13,
  comment: '/* Main characters */',
  wikipediaUrl: 'http://en.wikipedia.org',
  user: '108.49.244.224',
  userUrl: 'http://en.wikipedia.org/wiki/User:108.49.244.224',
  unpatrolled: false,
  newPage: false,
  robot: false,
  anonymous: true,
  namespace: 'Article'
  flag: '',
}

As I already mentioned I extracted the interesting bit of connecting to the IRC chatrooms, and parsing the IRC colored text into a JavaScript object as a NodeJS library called wikichanges. Working with the stream of edits is surprisingly addictive, and I found myself wanting to create some other similar applications:

  • wikipulse which displays the rate of change of wikipedias as a set of accelerator displays
  • wikitweets: a visualization of how Wikipedia is cited on Twitter
  • wikibeat: a musical exploration of how Wikipedia is changing created by Dan Chudnov and Chris Burns.

So wikichanges is there to make it easier to bootstrap applications that want to do things with the Wikipedia update stream. Here is a brief demo of getting wikichanges working on a stock Ubuntu ec2 instance:

What’s Next?

So this was a bit of wild ride, I hope you were able to follow along. I could have spent some time explaining why Node was a good fit for wikistream. Perhaps we can talk about that in the Q/A if there is any time for that. Let’s just say I actually reach for Python first when working on a new project, but the particular nature of this application, and tool availability made Node a natural fit. Did we crash it yet?

The combination of the GLAM effort with the WikiData are poised to really transform the way cultural heritage organizations contribute to and use Wikipedia. I hope wikistream might help you make the case for Wikipedia in your organization as you make presentations. If you have ideas on how to use the wikistream library to do something with the update stream I would love to hear about them.

  1. Case Study: Use of Semantic Web Technologies on the BBC Web Sites by Yves Raimond, et al.
  2. The Web as a CMS by Tom Scott.
  3. Catalog It Once And For All: A History of Cooperative Cataloging in the United States Prior to 1967 (Before MARC) by Barbara Tillett, in Cooperative Cataloging: Past, Present, and Future. Psychology Press, 1993, page 5.
  4. After hitting publish on this post I learned that the BBC’s bot was written by Patrick Sinclair


straw

By now I imagine you’ve heard the announcement that OCLC has started to make WorldCat bibliographic data available as openly licensed Linked Data. The availability of microdata and RDFa metadata in WorldCat pages coupled with the ODC-BY license and the availability of sitemaps for crawlers is a huge win for the library community. Similar announcements about Dewey Decimal Classification and the Virtual International Authority File are further evidence that there is a big paradigm shift going on at OCLC.

A few weeks ago Richard Wallis (formerly of Talis, and now at OCLC) asked me to take a look at the strawman library microdata vocabulary that OCLC put together for the WorldCat release: http://purl.org/library. Richard stressed that the library vocabulary was a prototype to focus and gather interest from the cultural heritage sector outside of OCLC, and the metadata community in general. Combined with the prototype microdata at WorldCat I think it represents an excellent first step. At this point I should re-iterate that these remarks about schema.org are mine and not those of my employer.

The vocabulary is actually currently expressed in OWL, and visiting that URL will redirect you to an application that lets you read the OWL file as documentation. Rather than write up a few paragraphs and send my comments to Richard in email, I figured I would jot them down here, in case anyone else has feedback.

Examining the classes that the library vocabulary defines tells the majority of the story. They are broken down into

  • ArchiveMaterial
  • Carrier
  • Computer File
  • Game
  • Image
  • Interactive Multimedia
  • Kit
  • Musical Score
  • Newspaper
  • Periodical
  • Thesis
  • Toy
  • Video
  • VideoGame
  • Visual Material
  • Web Site

These classes should seem familiar to catalogers who have worked in MARC since there is a lot of similarity with the types of data that are encoded into the 008 field. However some are missing such as maps, dictionaries, encyclopedias, etc. It’s kind of amusing that Book isn’t mentioned. I’m not sure what the rationale was for selecting these classes, perhaps some sort of ranking based on use in WorldCat? Examining the OWL shows that OCLC has made an effort to express mappings between the library vocabulary and schema.org:

library schema.org
http://purl.org/library/ArchiveMaterial http://schema.org/CreativeWork/ArchiveMaterial
http://purl.org/library/ComputerFile http://schema.org/CreativeWork/ComputerFile
http://purl.org/library/Game http://schema.org/CreativeWork/Game
http://purl.org/library/Image http://schema.org/CreativeWork/Image
http://purl.org/library/InteractiveMultimedia http://schema.org/CreativeWork/InteractiveMultimedia
http://purl.org/library/Kit http://schema.org/CreativeWork/Kit
http://purl.org/library/MusicalScore http://schema.org/CreativeWork/MusicalScore
http://purl.org/library/Newspaper http://schema.org/CreativeWork/Newspaper
http://purl.org/library/Periodical http://schema.org/CreativeWork/Periodical
http://purl.org/library/Thesis http://schema.org/CreativeWork/Book/Thesis
http://purl.org/library/Toy http://schema.org/CreativeWork/Toy
http://purl.org/library/Video http://schema.org/CreativeWork/Video
http://purl.org/library/VideoGame http://schema.org/CreativeWork/VideoGame
http://purl.org/library/VisualMaterial http://schema.org/CreativeWork/VisualMaterial
http://purl.org/library/WebSite http://schema.org/CreativeWork/WebSite

However these schema.org URLs do not resolve, and are not actually present as specifications of schema.org’s Creative Work. Perhaps the presence of these mappings in the library vocabulary is evidence of a desire to create these classes at schema.org. But then there are cases like library:Image which seem to bear a lot resemblance to schema.org’s ImageObject.

Examining the OWL also yields a set of library:Carrier instances.

  • BlurayDisk
  • CassetteTape
  • CD
  • DVD
  • FilmReel
  • LP
  • Microform
  • VHSTape
  • Volume
  • WWW

Again, there are more carriers than this in the MARC world. Why these were selected is a bit of a mystery. What library:WWW has to do with library:Website (if anything) isn’t clear, etc.

So even in this prototype library vocabulary there is a lot to examine and unpack. I imagine some phone calls or face to face meetings would be required to get at what went into their production.

Be that as it may, I think it could prove more useful to look at the WorldCat microdata and see what library vocabulary was used. For example here is the microdata extracted from the WorldCat page for Tim Berners-Lee’s Weaving the Web expressed as JSON:

{
  "type": "http://schema.org/Book", 
  "properties": {
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
      "http://schema.org/Book"
    ], 
    "http://purl.org/library/placeOfPublication": [
      {
        "type": "http://schema.org/Place", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Place"
          ], 
          "http://schema.org/name": [
            "San Francisco :"
          ]
        }
      }
    ], 
    "http://schema.org/bookEdition": [
      "1st ed."
    ], 
    "http://schema.org/publisher": [
      {
        "type": "http://schema.org/Organization", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Organization"
          ], 
          "http://schema.org/name": [
            "HarperSanFrancisco"
          ]
        }
      }
    ], 
    "http://schema.org/genre": [
      "History"
    ], 
    "http://schema.org/name": [
      "Weaving the Web : the original design and ultimate destiny of the World Wide Web by its inventor"
    ], 
    "http://schema.org/numberOfPages": [
      "226"
    ], 
    "http://purl.org/library/holdingsCount": [
      "2096"
    ], 
    "http://schema.org/about": [
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "Erfindung."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "WWW."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "prospective informatique."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://www.w3.org/2004/02/skos/core#inScheme": [
            "http://dewey.info/scheme/e21/"
          ]
        }, 
        "id": "http://dewey.info/class/025/e21/"
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/subjects/sh95000541"
          ], 
          "http://schema.org/name": [
            "World wide web."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/subjects/sh95000541"
          ], 
          "http://schema.org/name": [
            "World Wide Web."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/subjects/sh95000541"
          ], 
          "http://schema.org/name": [
            "World Wide Web--History."
          ]
        }
      }, 
      {
        "type": "http://schema.org/Person", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Person"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/names/no99010609"
          ], 
          "http://schema.org/name": [
            "Berners-Lee, Tim."
          ]
        }, 
        "id": "http://viaf.org/viaf/85312226"
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "Web--Histoire."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "World Wide Web"
          ]
        }, 
        "id": "http://id.worldcat.org/fast/1181326"
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "historique informatique."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "Web--Histoire."
          ]
        }
      }, 
      {
        "type": "http://schema.org/Person", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Person"
          ], 
          "http://schema.org/name": [
            "Berners-Lee, Tim"
          ]
        }
      }
    ], 
    "http://schema.org/description": [
      "Enquire within upon everything -- Tangles, links, and webs -- info.cern.ch -- Protocols: simple rules for global systems -- Going global -- Browsing -- Changes -- Consortium -- Competition and consensus -- Web of people -- Privacy -- Mind to mind -- Machines and the Web -- Weaving the Web."
    ], 
    "http://purl.org/library/oclcnum": [
      "41238513"
    ], 
    "http://schema.org/copyrightYear": [
      "1999"
    ], 
    "http://schema.org/contributor": [
      {
        "type": "http://schema.org/Person", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Person"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/names/n97003262"
          ], 
          "http://schema.org/name": [
            "Fischetti, Mark."
          ]
        }, 
        "id": "http://viaf.org/viaf/874883"
      }
    ], 
    "http://schema.org/isbn": [
      "9780062515872", 
      "006251587X", 
      "0062515861", 
      "9780062515865"
    ], 
    "http://schema.org/inLanguage": [
      "en"
    ], 
    "http://schema.org/reviews": [
      {
        "type": "http://schema.org/Review", 
        "properties": {
          "http://schema.org/reviewBody": [
            "Tim Berners-Lee, the inventor of the World Wide Web, has been hailed by Time magazine as one of the 100 greatest minds of this century. His creation has already changed the way people do business, entertain themselves, exchange ideas, and socialize with one another.\" \"Berners-Lee offers insights to help readers understand the true nature of the Web, enabling them to use it to their fullest advantage. He shares his views on such critical issues as censorship, privacy, the increasing power of software companies in the online world, and the need to find the ideal balance between the commercial and social forces on the Web. His criticism of the Web's current state makes clear that there is still much work to be done. Finally, Berners-Lee presents his own plan for the Web's future, one that calls for the active support and participation of programmers, computer manufacturers, and social organizations to make it happen.\"--Jacket."
          ], 
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Review"
          ], 
          "http://schema.org/itemReviewed": [
            "http://www.worldcat.org/oclc/41238513"
          ]
        }
      }
    ], 
    "http://schema.org/author": [
      {
        "type": "http://schema.org/Person", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Person"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/names/no99010609"
          ], 
          "http://schema.org/name": [
            "Berners-Lee, Tim."
          ]
        }, 
        "id": "http://viaf.org/viaf/85312226"
      }
    ]
  }, 
  "id": "http://www.worldcat.org/oclc/41238513"
}

Yes, that’s a lot of data. But interestingly only three library vocabulary elements were used:

  • placeOfPublication
  • holdingsCount
  • oclcnum

One could argue that rather than creating library:placeOfPublication they could use schema:publisher with a nested Organization item having a schema:location. Similarly library:oclcnum could’ve been expressed using itemid with a value of info:oclc/41238513 using the info-uri namespace that OCLC maintain the registry for. This leaves library:holdingsCount, which does seem to be missing from schema.org but also begs the question of whose holdings?

As Tom Gruber famously said:

Every ontology is a treaty – a social agreement – among people with some common motive in sharing.

So the question for me is what is the library vocabulary trying to do, and for who? Is it trying to make it easy to share MARC data as microdata on the Web? Is it trying to communicate something to search engines so that they can have enhanced displays? Who are the people that want to share and consume this data? I think having rough consensus about the answers to these questions is really important before diving into modeling exercises…even prototypes. And when the modeling begins I think it’s really important to follow the lead of the WorldCat developers in using the bits of schema.org vocabulary they could, and beginning to mint vocabulary terms for things that are missing. I don’t think it’s going to be fruitful to start from the position of modeling the bibliographic universe completely. I’d rather see real implementations (both publishers and consumers) drive the discovery of what is missing or awkward in schema.org, and how can it be fixed. Ideally, schema.org implementors like GoodReads would be at the table, along with members of the academic community like Jason Ronallo, Jonathan Rochkind and Ed Chamberlain (among others) who care about these issues. In addition my employer is actively engaged in an effort to rethink bibliographic data on the Web. It seems imperative that these efforts at schema.org and Zepheira’s work be combined somehow–especially since OCLC and Zepheira are hardly strangers.

I was of course flattered to be asked my opinion about the library vocabulary. I hope that my remarks haven’t accidentally set this strawman vocabulary on fire, because I think the work that OCLC has begun in this area is incredibly important. My experience watching the designers of SKOS has made me mindful of minimizing ontological commitments when designing a vocabulary, and wary of trying to exhaustively model a domain. In some ways I guess I’m a bit of a schema.org skeptic given its encyclopedic coverage. schema.org should take a page from the HTML 5 book and stay hyper-focused on letting implementations drive standardization. A bit of Seymour Lubetzky’s attention to simplification and user friendliness would be welcome as well.


viaf ntriples

I had a few requests for the Virtual International Authority File ntriples file I wrote about earlier. Having the various flavors of VIAF data available is great, but if an RDF dump is going to be made available I think ntriples kinda makes more sense than line oriented rdf/xml. I say this only because most RDF libraries and tools have support for bulk loading ntriples, and none (to my knowledge) support loading line oriented rdf/xml files.

I’ve made the 1.9G bzipped ntriples file available on Amazon S3 if you are interested in having at it:

http://viaf-ntriples.s3.amazonaws.com/viaf-20120524-clusters-rdf.nt.bz2
Incidentally you can torrent it as well, which would help spread the download of the file (and would spare me some cost on s3) by pointing your BitTorrent client at:

http://viaf-ntriples.s3.amazonaws.com/viaf-20120524-clusters-rdf.nt.bz2?torrent
As with the original VIAF dataset, this ntriples VIAF download contains information from VIAF (Virtual International Authority File) which is made available under the ODC Attribution License (ODC-By). Similarly, I am making the ntriples VIAF download available using the ODC-By License as well, because I think I have to given the of viral nature of ODC-By. At least that’s my unprofessional (I am not a lawyer) reading of the license. I’m not really complaining either, I’m all for openness going viral :-)


On a side note, I upgraded my laptop after the 4 days it took to initially create the ntriples file. In the process I accidentally deleted the ntriples file when I reformatted my hard drive. So the second time around I spent some time seeing if I could generate it quicker on Elastic MapReduce by splitting the file across multiple workers that would generate the ntriples from the rdf/xml and merge it back together. The conversion of the rdf/xml to ntriples using rdflib was largely CPU bound on my laptop, so I figured Hadoop Streaming would help me run my little Python script on as many workers nodes as I needed.

EMR made setting up and running a job quite easy, but I ran into my own ignorance pretty quickly. It turns out that Hadoop was not able to split the gzipped VIAF data, which meant data was only ever sent to one worker, no matter how many I ran. I then ran across some advice to use LZO compression, which is supposedly splittable on EMR, but after lots of experimentation I couldn’t get it to split either. I thought about uncompressing the original gzipped file on S3, but felt kind of depressed about doing that for some reason.

I time-boxed only a few days to try to get EMR working, so I backpedaled to rewriting the conversion script with Python’s multiprocessing library. I thought multiprocessing would let me take advantage of a multi-core EC2 machine. But oddly the conversion ran slower using multiprocessing’s Pool than it did as a single process. I chalked this up to the overhead of pickling large strings of rdf/xml and ntriples to send them around during inter-process-communication…but I didn’t investigate further.

So, in the end, I just re-ran the script for 4 days, but this time up at EC2 so that I could use my laptop during that time. &sigh;


On Discovery

There’s an interesting story over at The Atlantic which discusses the important role that cataloging and archival description play in historical research. The example is a recently discovered report to the Surgeon General from Charles Leale about his treatment of Abraham Lincoln after he was shot. A few weeks ago a researcher named Helen Papaioannou discovered the report while combing a collection of correspondence to the Surgeon General looking for materials related to Lincoln for a project at the Abraham Lincoln Presidential Library and Museum. The Atlantic piece boldly declares in its title:

If You ‘Discover’ Something in an Archive, It’s Not a Discovery.

Then it goes on to heap accolades on the silent archivists toiling away for centuries, that made the report possible to find. I’ve done my fare share of cataloging, and put in enough time working with EAD finding aids to enjoy the pat on the back. But something about the piece struck me as odd, and it took a bit of reading of the announcement of the discovery, and listening to a NPR interview with Papaioannou to put my finger on it.

It’s very possible, of course, with the volume of material that archives hold, for a particular professional to not know exactly what the repository holds. This is because archivists catalogue not at “item level,” a description of every piece of paper, which would take millennia, but at “collection level,” a description of the shape of the collection, who owned it, and what kinds of things it contains. With the volume of materials, some collections may be undescribed or even described wrongly. But if anyone thought that a report to the Surgeon General from a physician who saw Lincoln post-assassination existed, they might have looked through these correspondence files – which is exactly what the researcher, Helen Papaioannou, did. The exciting part about the Leale report is not that it was rescued from a “dusty archives” (an abhorrent turn of phrase!) but that since it’s now catalogued, everyone who wants to find it can.

Papaioannou’s own account is a bit more nuanced though:

Well, the record group I was currently searching was the records of the Office of the Surgeon General. And I was looking through his letters received, and I was in the L’s. And I was going through 1865, so I - since Lincoln died in 1865. I was almost finished with L and there it was, sitting right in the middle of a box.

This account makes it sound more like she was combing various record groups looking for correspondence from Lincoln, and accidentally ran across a letter from Leale, that was filed nearby…and she happened to notice that it was about Lincoln, and subsequently that the documents existence was not known. So Papaioannou didn’t suspect that the report to the Surgeon General existed, and go searching for it. She was instead examining various record groups for any correspondence from Lincoln, and was alert enough to notice something as she was moving through the collection. And most importantly she recognized that the document was not known to the historical community: the all important context, that is not completely knowable by any individual cataloger or archivist. At least that’s how I’m reading it.

Saying that there is no discovery in libraries and archives, because all the discovery has been pre-coordinated by librarians and archivists is putting the case for the work we do too strongly. It doesn’t give enough credit to the acts of discovery and creativity that library users like Papaioannou perform, and which our institutions depend on. I’m not an expert, but it seems to me that the lines that divide the historian and the archivist are more or less semi-permeable, especially since what is historic research gets archived itself, and archivists end up doing their own flavor of historical research when documenting the provenance of a collection. If we care about the future of libraries and archives we need to not only pat ourselves on the back for the work we do, but we need to recognize and appreciate the real work that goes on inside our buildings and on our websites.

And yes it’s great that the letter is now cataloged for re-discovery. But even better (for me) was that I was able to read the Atlantic piece, do some searches, and then go and listen to an interview with Papaioannou, and read the announcement from the Lincoln Library which includes a transcription of the actual letter.

…and then go and update the Wikipedia entry for Charles Leale to include information about the (very real) discovery of the letter.

Hopefully it won’t get reverted :-)



Wikimania Justification

Due to fiscal constraints we (understandably) have to write justifications for travel requests at $work, to make it clear how the conference/meeting fits in with the goals of the institution. I am planning on going to Wikimania for this first time this year, which is happening a short metro ride away at George Washington University. The cost for the full event is $50, which is an amazing value, and makes it a bit of a no-brainer on the cost-benefit scale. But I still need to justify it, mainly because of the time away from work. If you work in a cultural heritage organization and ever find yourself wanting to go to Wikimania maybe the justification I wrote will be of interest. I imagine you could easily substitute in your own organization’s Web publishing projects appropriately …

The Wikimania conference is the annual conference supporting the Wikipedia community. It is attended by thousands of people from around the world, and is the premier event for discussions and research about the continued development of Wikipedia–and it is being held in Washington, DC this year. Wikipedia comprises 22 million articles, in 285 languages, and it has become the largest and most popular general reference work on the Internet, ranking sixth globally among all websites.

Wikimania is of particular interest to cultural heritage institutions, and specifically the Library of Congress, because of the important role that collections like American Memory, Chronicling America, the Prints and Photographs Online Catalog and the World Digital Library (among others) have for Wikipedia editors. Primary resource material on the Web is extremely important to editors for verifiability of article content–so much so that the Wikipedia community is specifically conducting outreach with its Galleries, Libraries, Archives and Museums (GLAM) project. Several of the our peer institutions are involved in the GLAM effort, including: the National Archives, the Smithsonian and OCLC. Wikipedia remains one of the top referrers of web traffic to the Library of Congress web properties. LC’s multi-decade effort to put its unique collections online for the American people naturally aligns it with the mission of Wikipedia, and Wikimania is an excellent place to learn more about this collaboration that is going on with cultural heritage organizations.

I will be presenting on the value of open access to underlying datasets when conducting a real-time visualization of Wikipedia edits. There is a track of presentations for the cultural heritage community which I plan on attending. There is also a workshop on the Wikidata project, which has particular relevance for LC’s historic involvement in subject and name authority control files. In addition there is a Wikipedia Loves Libraries workshop being sponsored by OCLC to explore the ways in which libraries and Wikipedia can support each other in enriching discoverability and access to research material.


diving into VIAF

Last week saw a big (well big for library data nerds) announcement from OCLC that they are making the data for the Virtual International Authority File (VIAF) available for download under the terms of the Open Data Commons Attribution (ODC-BY) license. If you’re not already familiar with VIAF here’s a brief description from OCLC Research:

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF’s goal is to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF seeks to include authoritative names from many libraries into a global service that is available via the Web. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities

More specifically, the VIAF service: links national and regional-level authority records, creating clusters of related records and expands the concept of universal bibliographic control by:

  • allowing national and regional variations in authorized form to coexist
  • supporting needs for variations in preferred language, script and spelling
  • playing a role in the emerging Semantic Web

If you went and looked at the OCLC Research page you’ll notice that last month the VIAF project moved to OCLC. This is evidence of a growing commitment on OCLC’s part to make VIAF part of the library information landscape. It currently includes data about people, places and organizations from 22 different national libraries and other organizations.

Already there has been some great writing about what the release of VIAF data means for the cultural heritage sector. In particular Thom Hickey’s Outgoing is a trove of information about the project, which provides a behind-the-scense look at the various services it offers.

Rather than paraphrase what others have said already I thought I would download some of the data and report on what it looks like. Specifically I’m interested in the RDF data (as opposed to the custom XML, and MARC variants) since I believe it to have the most explicit structure and relations. The shared semantics in the RDF vocabularies that are used also make it the most interesting from a Linked Data perspective.

Diving In

The primary data structure of interest in the data dumps that OCLC has made available is what they call the cluster. A cluster is essentially a hub-and-spoke model with a resource for the person, place or organization in the middle that is attached via the spokes to conceptual resources at the participating VIAF institutions. As an example here is an illustration of the VIAF cluster for the Canadian archivist Hugh Taylor

Here you can see a FOAF Person resource (yellow) in the middle that is linked to from SKOS Concepts (blue) for Bibliothèque nationale de France, The Libraries and Archives of Canada, Deutschen Nationalbibliothek, BIBSYS (Norway) and the Library of Congress. Each of the SKOS Concepts have their own preferred label, which you can see varies across institution. This high level view obscures quite a bit of data, which is probably best viewed in Turtle if you want to see it:

<http://viaf.org/viaf/14894854> rdaGr2:dateOfBirth “1920-01-22” ; rdaGr2:dateOfDeath “2005-09-11” ; a rdaEnt:Person, foaf:Person ; owl:sameAs <http://d-nb.info/gnd/109337093> ; foaf:name “Taylor, Hugh A.”, “Taylor, Hugh A. (Hugh Alexander), 1920-”, “Taylor, Hugh Alexander 1920-2005” .


way, way back

For some experimental work I’ve been talking about with Nicholas Taylor (his idea, which he or I will write about later if it pans out) I’ve gotten interested in programmatic ways of seeing when a URL is available in a web archive. Of course there is the Internet Archive’s collection; but what isn’t as widely known perhaps is that web archiving is going on around the world at a smaller scale, often using similar software, and often under the auspices of the International Internet Preservation Consortium.

Nicholas pointed me at some work around Memento, which provides a proxy-like API in front of some web archives. If you aren’t already familiar with it, Memento is some machinery, or a REST API for deterimining when a given URL is available in a Web Archive–which is pretty useful. Of course, like many standardization efforts it relies on people actually implementing it. For Web Architecture folks, the core idea in Memento is pretty simple; but I think its core simplicity may be obscured from software developers who need to fully digest the spec in order to say they “do” Memento.

Meanwhile a lot of web archives have used the Wayback Machine from the Internet Archive to provide a human interface to the archived web content. While looking at the memento-server code I was surprised to learn that the Wayback can also return structured data about what URLs have been archived. For example, you can see what content the Internet Archive has for the New York Times homepage by visiting:

http://wayback.archive.org/web/xmlquery?url=http://www.nytimes.com

which returns a chunk of XML like:

< ?xml version="1.0" encoding="UTF-8"?>

  
    19960101000000
    4425
    urlquery
    20120503151837
    4425
    0
    nytimes.com/
    40000
    resultstypecapture
  
  
    
      68043717
      text/html
      IA-001766.arc.gz
      -
      nytimes.com/
      GY3
      200
      http://www.nytimes.com:80/
      19961112181513
    
    
      8107
      text/html
      BK-000007.arc.gz
      -
      nytimes.com/
      GY3
      200
      http://www.nytimes.com:80/
      19961121230155
    
    ...
  

Sort of similarly you can see what the British Library’s Web Archive has for the BBC homepage by visiting:

http://www.webarchive.org.uk/wayback/archive/*/http://www.bbc.co.uk/

Where you will see:

< ?xml version="1.0" encoding="UTF-8"?>

  
    19910806145620
    201
    urlquery
    20120503152750
    201
    0
    bbc.co.uk/
    10000
    resultstypecapture
  
  
    
      75367408
      text/html
      BL-196764-0.warc.gz
      -
      bbc.co.uk/
      sha512:b155b8dd868c17748405b7a8d2ee3606efea1319ee237507055f258189c0f620c38d2c159fc4e02211c1ff6d265f45e17ae7eb18f94a5494ab024175fe6f79c3
      200
      http://www.bbc.co.uk/
      20080410162445
    
    
      92484146
      text/html
      BL-7307314-46.warc.gz
      -
      bbc.co.uk/
      sha512:6e37c62b3aa7b60cccc50d430bc7429ecf0d2662bca5562b61ba0bc1027c824a2f7526c747bfca52db46dba5a2ae9c9d96d013e588b2ae5d78188ea4436c571f
      200
      http://www.bbc.co.uk/
      20080527231330
    
    ...
  

It turns out British Library are using this structured data to access data from Hadoop where their web archives live on HDFS as WARC files–which is pretty slick. Actually WARCs on spinning disk is pretty awesome by itself, no matter how you are doing it.

Unfortunately I wasn’t able to make it to the International Internet Preservation Consortium meeting going on right now at the Library of Congress. I’m at home heating bottles, changing diapers, and dozing off in comfy chairs with a boppy around my waist. If I was there I think I would be asking:

  1. Is there a list of Wayback Machine endpoints that are on the Web? There are multiple ones at the California Digital Library, the Library of Congress, and elsewhere I bet.
  2. How many of them are configured to make this XML data available? Can it easily be turned on for those that don’t have it?
  3. Rather than requiring people to implement a new standard to improve interoperability, could we document the XML format that Wayback can already emit, and share the endpoints? This way web archives that don’t run Heretrix and Wayback could also share what content they have collected in the same community.

This isn’t to say that Memento isn’t a good idea (I think it is). I just think there might be some quick wins to be had by documenting and raising awareness about things that are already working away quietly behind the scenes. Perhaps the list of Wayback endpoints could be added to the Wikipedia page?

Ok, enough for now. I have a bottle to heat up :-)