A few weeks ago I was in sunny (mostly) Palo Alto, California at a Linked Data Workshop hosted by Stanford University, and funded by Council on Library and Information Resources. It was an invite-only event, with very little footprint on the public Web, and an attendee list that was distributed via email with “Confidential PLEASE do not re-distribute” across the top. So it feels more than a little bit foolhardy to be writing about it here, even in a limited way. But I was going to the meeting as an employee of the Library of Congress, so I feel a bit responsible for making some kind of public notes/comments about what happened. I suspect this might impact future invitations to similar events, but I can live with that :-)

A week is a long time to talk about anything…and Linked Data is certainly no exception. The conversation was buoyed by the fact that it was a pretty small group of around 30 folks from a wide variety of institutions including: Stanford University, Bibliotheca Alexandrina, Bibliothèque nationale de France, Aalto University, Emory University, Google, National Institute of Informatics, University of Virginia, University of Michigan, Deutschen Nationalbibliothek, British Library, Det Kongelige Bibliotek, California Digital Library and Seme4. So the focus was definitely on cultural heritage institutions, and specifically what Linked Data can offer them. There was a good mix of people in attendance: some who were relatively new to Linked Data and RDF, to others who had been involved in the space before the term Linked Data, RDF or the Web existed…and there were people like me who were somewhere in between.

A few of us took collaborative notes in PiratePad, which sort of tapered off as the week progressed and more discussion happened. After some initial lighting-talk-style presentations from attendees on Linked Data projects they were involved in, we spent most of the rest of the week breaking up into 4 groups, to discuss various issues, and then joining back up with the larger group for discussion. If things go as planned you can expect to see a report from Stanford that covers the opportunities and challenges that Linked Data offers the cultural heritage sector, which were raised during these sessions. I think it’ll be a nice compliment to the report that the W3C Library Linked Data Incubator Group is preparing, which recently became available as a draft.

One thing that has stuck with me a few weeks later, is the continued need in the cultural heritage Linked Data sector for reconciliation services, that help people connect up their resources with appropriate resources that other folks have published. If you work for a large organization, there is often even a need for reconciliation services within the enterprise. For example the British Library reported that it has some 300 distinct data systems within the organization, that sometimes need to be connected together. Linking is the essential ingredient, the sine qua non of Linked Data. Linking is what makes Linked Data and the RDF data model different. It helps you express the work you may have done in joining up your data with other people’s data. It’s the 4th design pattern in Tim Berners-Lee’s Linked Data Design Issues:

Include links to other URIs. so that they can discover more things.

But, expressing the links is the easy part…creating them where they do not currently exist is harder.

Fortunately, Hugh Glaser was on hand to talk about the role that sameas.org plays in the Linked Data space, and how the RKBexplorer managed to reconcile authors names across institutional repositories. He also has described some work with the British Museuem linking up two different data silos about museum objects, to provide a unified web views for those objects. Hugh, if you are reading this, can you comment with a link to this work you did, and how it surfaces in British Museum website?

Similarly Stefano Mazzocchi talked about how Google’s tools like Freebase Suggest and their WebID app can make it easy to integrate Freebase’s identity system into your applications. If you are building a cataloging tool, take a serious look at what using something like Freebase Suggest (a jquery plugin) can offer your application. In addition, as part of the Google Refine data cleanup tool, Google has created an API for data reconciliation services, which other service providers could supply. Stefano indicated that Google was considering releasing the code behind this reconciliation service, and stressed that it is useful for the community to make more of these reconciliation services available, to help others link their data with other peoples data. It seems obvious I guess, but I was interested to hear that Google themselves are encouraging the use of Freebase IDs to join up data within their enterprise.

Almost a year ago Leigh Dodds created a similar API layer for data that is stored in the Talis Platform. Now that the British National Bibliography is being made available in a Talis Store, it might be possible to use Leigh’s code to put a reconciliation service on top of that data. Caveat being that not all the BNB is currently available there. By the way, hats off to the British Library for iteratively making that data available, and getting feedback early, instead of waiting for it all to be “done”…which of course they never will be, if they are successful at integrating Linked Data into their existing data work flows.

If you squint right, I think it’s also possible to look at the VIAF AutoSuggest service as a type reconciliation service. It would be useful to have a similar service over the Library of Congress Subject Headings at id.loc.gov. Having similar APIs for these services could be a handy thing as we begin to build new descriptive cataloging tools that take advantage of these pools of data. But I don’t think it’s necessarily essential, as the APIs could be orchestrated in a more ad hoc, web2.0 mashup style. I imagine I’m not alone in thinking we’re now at the stage when we can start building new cataloging tools that take advantage of these data services. Along those lines Rachel Frick had an excellent idea to try to enhance collection building applications like Omeka and Archives Space to take advantage of reconciliation services under the covers. Adding a bit of suggest-like functionality to these tools could smooth out the description of resources that libraries, museums and archives are putting online. I think the Omeka-LCSH plugin is a good example of steps in this direction.

One other thing that stuck with me from the workshop is that the new (dare I say buzzwordy) focus on Library Linked Data is somewhat frightening to library data professionals. There is a lot of new terminology, and issues to work out (as the Stanford report will hopefully highlight). Some of this scariness is bound up with the Resource Description and Access sea change that is underway. But crufty as they are, data systems built around MARC have served the library community well over the past 30 years. Some of the motivations for Linked Data are specifically for Linked Open Data, where the linking isn’t as important as the openness. The LOD-LAM summit captured some of this spirit in the 4 Star Classification Scheme for Linked Open Cultural Metadata, which focuses on licensing issues. There was a strong undercurrent at the Stanford meeting about licensing issues. The group recognized that explicit licensing is important, but it was intentionally kept out of scope of most of the discussion. Still I think you can expect to see some of the heavy hitters from this group exert some influence in this arena to help bring clarity to licensing issues around our data. I think that some of the ideas of opening up the data, and disrupting existing business workflows around the data can seem quite scary to those who have (against a lot of odds) gotten them working. I’m thinking of the various cooperative cataloging efforts that allow work to get done in libraries today.

Truth be told, I may have inspired some of the “fear” around Linked Data by suggesting that the Stanford group work on a manifesto to rally around, much like what the Agile Manifesto did for the Agile software development movement. I don’t think we had come to enough consensus to really get a manifesto together, but on the last day the sub-group I was in came up with a straw man (near the bottom of the piratepad notes) to throw darts at. Later on (on my own) I kind of wordsmithed them into a briefer list. I’ll conclude this blog post by including the “manifesto” here not as some official manifesto of the workshop (it certainly is not), but more as a personal manifesto, that I’d like to think has been guiding some of the work I have been involved in at the Library of Congress over the past few years:

Manifesto for Linked Libraries

We are uncovering better ways of publishing, sharing and using information by doing it and helping others do it. Through this work we have come to value:

Publishing data on the Web for discovery over preserving it in dark archives.
Continuous improvement of data over waiting to publish perfect data.
Semantically structured data over flat unstructured data.
Collaboration over working alone.
Web standards over domain-specific standards.
Use of open, commonly understood licenses over closed, local licenses.

That is, while there is value in the items on the right, we value the items on the left more.

The manifesto is also on a publicly editable Google Doc; so if you feel the need to add or comment please have a go. I was looking for an alternative to “Linked Libraries” since it was not inclusive of archives and museums … but I didn’t spend much time obsessing on it. One of the selfish motivations for publishing the manifesto here was to capture it a particular point in time where I was happy with it :-)