meeting notes and a manifesto

A few weeks ago I was in sunny (mostly) Palo Alto, California at a Linked Data Workshop hosted by Stanford University, and funded by Council on Library and Information Resources. It was an invite-only event, with very little footprint on the public Web, and an attendee list that was distributed via email with “Confidential PLEASE do not re-distribute” across the top. So it feels more than a little bit foolhardy to be writing about it here, even in a limited way. But I was going to the meeting as an employee of the Library of Congress, so I feel a bit responsible for making some kind of public notes/comments about what happened. I suspect this might impact future invitations to similar events, but I can live with that :-)

A week is a long time to talk about anything…and Linked Data is certainly no exception. The conversation was buoyed by the fact that it was a pretty small group of around 30 folks from a wide variety of institutions including: Stanford University, Bibliotheca Alexandrina, Bibliothèque nationale de France, Aalto University, Emory University, Google, National Institute of Informatics, University of Virginia, University of Michigan, Deutschen Nationalbibliothek, British Library, Det Kongelige Bibliotek, California Digital Library and Seme4. So the focus was definitely on cultural heritage institutions, and specifically what Linked Data can offer them. There was a good mix of people in attendance: some who were relatively new to Linked Data and RDF, to others who had been involved in the space before the term Linked Data, RDF or the Web existed…and there were people like me who were somewhere in between.

A few of us took collaborative notes in PiratePad, which sort of tapered off as the week progressed and more discussion happened. After some initial lighting-talk-style presentations from attendees on Linked Data projects they were involved in, we spent most of the rest of the week breaking up into 4 groups, to discuss various issues, and then joining back up with the larger group for discussion. If things go as planned you can expect to see a report from Stanford that covers the opportunities and challenges that Linked Data offers the cultural heritage sector, which were raised during these sessions. I think it’ll be a nice compliment to the report that the W3C Library Linked Data Incubator Group is preparing, which recently became available as a draft.

One thing that has stuck with me a few weeks later, is the continued need in the cultural heritage Linked Data sector for reconciliation services, that help people connect up their resources with appropriate resources that other folks have published. If you work for a large organization, there is often even a need for reconciliation services within the enterprise. For example the British Library reported that it has some 300 distinct data systems within the organization, that sometimes need to be connected together. Linking is the essential ingredient, the sine qua non of Linked Data. Linking is what makes Linked Data and the RDF data model different. It helps you express the work you may have done in joining up your data with other people’s data. It’s the 4th design pattern in Tim Berners-Lee’s Linked Data Design Issues:

Include links to other URIs. so that they can discover more things.

But, expressing the links is the easy part…creating them where they do not currently exist is harder.

Fortunately, Hugh Glaser was on hand to talk about the role that plays in the Linked Data space, and how the RKBexplorer managed to reconcile authors names across institutional repositories. He also has described some work with the British Museuem linking up two different data silos about museum objects, to provide a unified web views for those objects. Hugh, if you are reading this, can you comment with a link to this work you did, and how it surfaces in British Museum website?

Similarly Stefano Mazzocchi talked about how Google’s tools like Freebase Suggest and their WebID app can make it easy to integrate Freebase’s identity system into your applications. If you are building a cataloging tool, take a serious look at what using something like Freebase Suggest (a jquery plugin) can offer your application. In addition, as part of the Google Refine data cleanup tool, Google has created an API for data reconciliation services, which other service providers could supply. Stefano indicated that Google was considering releasing the code behind this reconciliation service, and stressed that it is useful for the community to make more of these reconciliation services available, to help others link their data with other peoples data. It seems obvious I guess, but I was interested to hear that Google themselves are encouraging the use of Freebase IDs to join up data within their enterprise.

Almost a year ago Leigh Dodds created a similar API layer for data that is stored in the Talis Platform. Now that the British National Bibliography is being made available in a Talis Store, it might be possible to use Leigh’s code to put a reconciliation service on top of that data. Caveat being that not all the BNB is currently available there. By the way, hats off to the British Library for iteratively making that data available, and getting feedback early, instead of waiting for it all to be “done”…which of course they never will be, if they are successful at integrating Linked Data into their existing data work flows.

If you squint right, I think it’s also possible to look at the VIAF AutoSuggest service as a type reconciliation service. It would be useful to have a similar service over the Library of Congress Subject Headings at Having similar APIs for these services could be a handy thing as we begin to build new descriptive cataloging tools that take advantage of these pools of data. But I don’t think it’s necessarily essential, as the APIs could be orchestrated in a more ad hoc, web2.0 mashup style. I imagine I’m not alone in thinking we’re now at the stage when we can start building new cataloging tools that take advantage of these data services. Along those lines Rachel Frick had an excellent idea to try to enhance collection building applications like Omeka and Archives Space to take advantage of reconciliation services under the covers. Adding a bit of suggest-like functionality to these tools could smooth out the description of resources that libraries, museums and archives are putting online. I think the Omeka-LCSH plugin is a good example of steps in this direction.

One other thing that stuck with me from the workshop is that the new (dare I say buzzwordy) focus on Library Linked Data is somewhat frightening to library data professionals. There is a lot of new terminology, and issues to work out (as the Stanford report will hopefully highlight). Some of this scariness is bound up with the Resource Description and Access sea change that is underway. But crufty as they are, data systems built around MARC have served the library community well over the past 30 years. Some of the motivations for Linked Data are specifically for Linked Open Data, where the linking isn’t as important as the openness. The LOD-LAM summit captured some of this spirit in the 4 Star Classification Scheme for Linked Open Cultural Metadata, which focuses on licensing issues. There was a strong undercurrent at the Stanford meeting about licensing issues. The group recognized that explicit licensing is important, but it was intentionally kept out of scope of most of the discussion. Still I think you can expect to see some of the heavy hitters from this group exert some influence in this arena to help bring clarity to licensing issues around our data. I think that some of the ideas of opening up the data, and disrupting existing business workflows around the data can seem quite scary to those who have (against a lot of odds) gotten them working. I’m thinking of the various cooperative cataloging efforts that allow work to get done in libraries today.

Truth be told, I may have inspired some of the “fear” around Linked Data by suggesting that the Stanford group work on a manifesto to rally around, much like what the Agile Manifesto did for the Agile software development movement. I don’t think we had come to enough consensus to really get a manifesto together, but on the last day the sub-group I was in came up with a straw man (near the bottom of the piratepad notes) to throw darts at. Later on (on my own) I kind of wordsmithed them into a briefer list. I’ll conclude this blog post by including the “manifesto” here not as some official manifesto of the workshop (it certainly is not), but more as a personal manifesto, that I’d like to think has been guiding some of the work I have been involved in at the Library of Congress over the past few years:

Manifesto for Linked Libraries

We are uncovering better ways of publishing, sharing and using information by doing it and helping others do it. Through this work we have come to value:

  • Publishing data on the Web for discovery over preserving it in dark archives.
  • Continuous improvement of data over waiting to publish perfect data.
  • Semantically structured data over flat unstructured data.
  • Collaboration over working alone.
  • Web standards over domain-specific standards.
  • Use of open, commonly understood licenses over closed, local licenses.

That is, while there is value in the items on the right, we value the items on the left more.

The manifesto is also on a publicly editable Google Doc; so if you feel the need to add or comment please have a go. I was looking for an alternative to “Linked Libraries” since it was not inclusive of archives and museums … but I didn’t spend much time obsessing on it. One of the selfish motivations for publishing the manifesto here was to capture it a particular point in time where I was happy with it :-)

the dpla as a generative platform

Last week I had the opportunity to attend a meeting of the Digital Public Library of America (DPLA) in Amsterdam. Several people have asked me why an American project was meeting in Amsterdam. In large part it was an opportunity for the DPLA to reach out to, and learn from European projects such as Europeana, LOD2 and Wikimedia Germany–or as the agenda describes:

The purpose of the May 16 and 17 expert working group meeting, convened with generous support from the Open Society Foundations, is to begin to identify the characteristics of a technical infrastructure for the proposed DPLA. This infrastructure ideally will be interoperable with international efforts underway, support global learning, and act as a generative platform for undefined future uses. This workshop will examine interoperability of discovery, use, and deep research in existing global digital library infrastructure to ensure that the DPLA adopts best practices in these areas. It will also serve to share information and foster exchange among peers, to look for opportunities for closer collaboration or harmonization of existing efforts, and to surface topics for deeper inquiry as we examine the role linked data might play in a DPLA.

Prior to the meeting I read the DPLA Concept Note, watched the discussion list and wiki activity — but the DPLA still seemed somewhat hard to grasp to me. The thing I learned at the meeting in Amsterdam is that this nebulousness is by design–not by accident. The DPLA steering committee aren’t really pushing a particular solution that they have in mind. In fact, there doesn’t seem to be a clear consensus about what problem they are trying to solve. Instead the steering committee seem to be making a concerted effort to keep an open, beginners-mind about what a Digital Public Library of America might be. They are trying to create conversations around several broad topic areas or work-streams: content and scope, financial/business models, governance, legal issues, and technical aspects. The recent meeting in Amsterdam focused on the technical aspects work-stream–in particular, best practices for data interoperability on the Web. The thought being that perhaps the DPLA could exist in some kind of distributed relationship with existing digital library efforts in the United States–and possibly abroad. Keeping an open mind in situations like this takes quite a bit of effort. There is often an irresistable urge to jump to particular use cases, scenarios or technical solutions, for fear of seeming ill informed or rudderless. I think the DPLA should be commended for creating conversations at this formative stage, instead of solutions in search of a problem.

I hadn’t noticed the phrase “generative platform” in the meeting announcement above until I began this blog post…but in hindsight it seems very evocative of the potential of the DPLA. At their best, digital libraries currently put content on the Web, so that researchers can discover it via search engines like Google, Bing, Baidu, etc. Researchers discover a pool of digital library content while performing a query in a search engine. Once they’ve landed in the digital library webapp they can wander outwards to related resources, and perhaps do a more nuanced search within the scoped context of the collection. But in practice this doesn’t happen all that often. I imagine many institutions digitize content that actually never makes it onto the Web at all. And when it does make it onto the Web it is often deep-web content hiding behind a web form, un-discoverable by crawlers. Or worse, the content might be actively made invisible by using a robots.txt to prevent search engines from crawling it. Sadly this is often done for performance reasons, not out of any real desire to keep the content from being found–because all too often library webapps are not designed to support crawling.

I was invited to talk very briefly (10 minutes) about Linked Data at the Amsterdam meeting. I think most everyone recognizes that a successful DPLA would exist in a space where there has been years of digitization efforts in the US, with big projects like the HathiTrust and countless others going on. I wanted to talk about how the Web could be used to integrate these collections. Rather than digging into a particular Linked Data solution to the problem of synchronization, I thought I would try to highlight how libraries could learn to do web1.0 a bit better. In particular I wanted to showcase how Google Scholar abandoned OAI-PMH (a traditional library standard for integrating collections) in favor of using sitemaps and metadata embedded in HTML. I wanted to show how thoughtful use of sitemaps, a sensible robots.txt, and perhaps some Atom to publish updates, and deletes a bit more methodically can offer just the same functionality as OAI-PMH, but in a way that is aligned with the Web, and the services that are built on top of it. Digital library initiatives often go off and create their own specialized way of looking at the Web, and ignore broader trends. The nature of grant funding, and research papers often serve as an incentive for this behavior. I’ve heard rumors that there is even some NISO working group being formed to look into standardizing some sort of feed based approach to metadata harvesting. Personally I think it’s probably more important for us to use some of the standards and patterns that are already available instead of trying to define another one.

So you could say I pulled a bit of a bait and switch: instead of talking about Linked Data I really ended up talking about Search Engine Optimization. I didn’t mention RDF or SPARQL once. If anyone noticed they didn’t seem to mind too much.

I learned a lot of very useful information during the presentations–too much to really note here. But there was one conversation that really stands out after a week has passed.

Greg Crane of the Perseus Digital Library spoke about about Deep Research, and how students and researchers participate in the creation of online knowledge. At one point Greg had a slide that contained a map of the entire world, and spoke about how the scope of the DPLA can’t really be confined to the United States alone–since American society is largely made up of immigrant communities (some by choice, some not) the scope of the DPLA is in fact the entire world. I couldn’t help but think how completely audacious it was to say that the Digital Public Library of America would somehow encompass the world — similar to how brick and mortar library and museum collections can often mirror the imperialistic interests of the states that they belong to.

Original WWW graphic by Robert Cailliau

So I was relieved when Stefan Gradmann asked how Greg thought the DPLA would fit in with projects like Europeana, which are already aggregating content from Europe. I can’t exactly recall Greg’s response (update: Greg filled in some of the blanks via email), but this prompted Dan Brickley to point out that in fact it’s pretty hard to draw lines around Europe too … and more importantly the Web is a space that can unite these projects. At this point Josh Greenberg jumped in and suggested that perhaps some thoughtful linking between sites like a DPLA and Europeana could help bring them together. This all probably happened in the span of 3 or 4 minutes, but the exchange really crystallized for me that the cultural heritage community could do a whole lot better at deep linking with each other. Josh’s suggestion is particularly good, because researchers could see and act on contextual links. It wouldn’t be something hidden in a data layer that nobody ever looks at. But to do this sort of linking right we would need to share our data better with each other, and it would most likely need to be Linked Data — machine readable data with URLs at its core. I guess it’s a no-brainer that for it to succeed the DPLA needs to be aligned with the ultimate generative platform of our era: the World Wide Web. Name things with URLs, create typed links between them, and other people’s stuff.

Another thing that struck me was how Europeana really gets the linking part. Europeana is essentially a portal site, or directory of digital objects for more than 15 million items provided by hundreds of providers across Europe. You can find these objects in Europeana, but if you drill down far enough you eventually find yourself on the original site that made the object available. I agree with folks that think that perhaps the user experience of the site would be improved if the user never left Europeana to view the digital object in full. This would necessarily require harvesting a richer version of the digital object, which would be more difficult, but not impossible. There would also be an opportunity to serve as a second copy for the object, which is potentially very desirable to originating institutions for preservation purposes…lots of copies keeps stuff safe.

But even in this hypothetical scenario where the object is available in full on Europeana, I think it would still be important to link out to the originating institution that digitized the object. Linking makes the provenance of the item explicit, which will continue to be important to researchers on the Web. But perhaps more importantly it gives institutions a reason to participate in the project as a whole. Participants will see increased awareness and use of their web properties, as users wander over from Europeana. Perhaps they could even link back to Europeana, which ought to increase Europeana’s density in the graph of the web, which also should boost its relevancy ranking in search engines like Google.

Another good lesson of Europeana is that it’s not just about libraries, but also includes many archives, museums and galleries. One of my personal disappointments about the Amsterdam meeting was that Mathias Schindler of Wikimedia-Germany had to pull out at the last minute. I’ve never met him, but Mathias has had a lot to do with trying to bring the Wikipedia and Library communities together. Efforts to promote the use of Wikipedia as a platform in the Galleries, Libraries, Archives and Museums (GLAM) sector are intensifying. The pivotal role that Wikipedia has had in the Linked Data community in the form of is also very significant. Earlier this year there was a meeting of various Wikipedia, dbpedia and Freebase folks at a Data Summit, where people talked about the potential for an inner hub for the various languages wikipedias to share inter-wiki links, and extracted structured metadata. I haven’t heard whether this is actually leading anywhere currently, but at the very least its a recognition that Wikipedia is itself turning into a key part of information infrastructure on the web.

So I’ve rambled on a bit at this point. Thanks for reading this far. My take-away from the Amsterdam meeting was that the DPLA needs to think about how it wants to align itself with the Web, and work with its grain … not against it. This is easier said than done. The DPLA needs to think about incentives that would give existing digital library projects practical reasons to want to be involved. This also is easier said than done. And hopefully these incentives won’t just involve getting grant money. Keeping an open mind, taking a REST here and there, and continuing to have these very useful conversations (and contests) should help shape the DPLA as a generative platform.