the dpla as a generative platform

Last week I had the opportunity to attend a meeting of the Digital Public Library of America (DPLA) in Amsterdam. Several people have asked me why an American project was meeting in Amsterdam. In large part it was an opportunity for the DPLA to reach out to, and learn from European projects such as Europeana, LOD2 and Wikimedia Germany–or as the agenda describes:

The purpose of the May 16 and 17 expert working group meeting, convened with generous support from the Open Society Foundations, is to begin to identify the characteristics of a technical infrastructure for the proposed DPLA. This infrastructure ideally will be interoperable with international efforts underway, support global learning, and act as a generative platform for undefined future uses. This workshop will examine interoperability of discovery, use, and deep research in existing global digital library infrastructure to ensure that the DPLA adopts best practices in these areas. It will also serve to share information and foster exchange among peers, to look for opportunities for closer collaboration or harmonization of existing efforts, and to surface topics for deeper inquiry as we examine the role linked data might play in a DPLA.

Prior to the meeting I read the DPLA Concept Note, watched the discussion list and wiki activity – but the DPLA still seemed somewhat hard to grasp to me. The thing I learned at the meeting in Amsterdam is that this nebulousness is by design–not by accident. The DPLA steering committee aren’t really pushing a particular solution that they have in mind. In fact, there doesn’t seem to be a clear consensus about what problem they are trying to solve. Instead the steering committee seem to be making a concerted effort to keep an open, beginners-mind about what a Digital Public Library of America might be. They are trying to create conversations around several broad topic areas or work-streams: content and scope, financial/business models, governance, legal issues, and technical aspects. The recent meeting in Amsterdam focused on the technical aspects work-stream–in particular, best practices for data interoperability on the Web. The thought being that perhaps the DPLA could exist in some kind of distributed relationship with existing digital library efforts in the United States–and possibly abroad. Keeping an open mind in situations like this takes quite a bit of effort. There is often an irresistable urge to jump to particular use cases, scenarios or technical solutions, for fear of seeming ill informed or rudderless. I think the DPLA should be commended for creating conversations at this formative stage, instead of solutions in search of a problem.

I hadn’t noticed the phrase “generative platform” in the meeting announcement above until I began this blog post…but in hindsight it seems very evocative of the potential of the DPLA. At their best, digital libraries currently put content on the Web, so that researchers can discover it via search engines like Google, Bing, Baidu, etc. Researchers discover a pool of digital library content while performing a query in a search engine. Once they’ve landed in the digital library webapp they can wander outwards to related resources, and perhaps do a more nuanced search within the scoped context of the collection. But in practice this doesn’t happen all that often. I imagine many institutions digitize content that actually never makes it onto the Web at all. And when it does make it onto the Web it is often deep-web content hiding behind a web form, un-discoverable by crawlers. Or worse, the content might be actively made invisible by using a robots.txt to prevent search engines from crawling it. Sadly this is often done for performance reasons, not out of any real desire to keep the content from being found–because all too often library webapps are not designed to support crawling.

I was invited to talk very briefly (10 minutes) about Linked Data at the Amsterdam meeting. I think most everyone recognizes that a successful DPLA would exist in a space where there has been years of digitization efforts in the US, with big projects like the HathiTrust and countless others going on. I wanted to talk about how the Web could be used to integrate these collections. Rather than digging into a particular Linked Data solution to the problem of synchronization, I thought I would try to highlight how libraries could learn to do web1.0 a bit better. In particular I wanted to showcase how Google Scholar abandoned OAI-PMH (a traditional library standard for integrating collections) in favor of using sitemaps and metadata embedded in HTML. I wanted to show how thoughtful use of sitemaps, a sensible robots.txt, and perhaps some Atom to publish updates, and deletes a bit more methodically can offer just the same functionality as OAI-PMH, but in a way that is aligned with the Web, and the services that are built on top of it. Digital library initiatives often go off and create their own specialized way of looking at the Web, and ignore broader trends. The nature of grant funding, and research papers often serve as an incentive for this behavior. I’ve heard rumors that there is even some NISO working group being formed to look into standardizing some sort of feed based approach to metadata harvesting. Personally I think it’s probably more important for us to use some of the standards and patterns that are already available instead of trying to define another one.

So you could say I pulled a bit of a bait and switch: instead of talking about Linked Data I really ended up talking about Search Engine Optimization. I didn’t mention RDF or SPARQL once. If anyone noticed they didn’t seem to mind too much.

I learned a lot of very useful information during the presentations–too much to really note here. But there was one conversation that really stands out after a week has passed.

Greg Crane of the Perseus Digital Library spoke about about Deep Research, and how students and researchers participate in the creation of online knowledge. At one point Greg had a slide that contained a map of the entire world, and spoke about how the scope of the DPLA can’t really be confined to the United States alone–since American society is largely made up of immigrant communities (some by choice, some not) the scope of the DPLA is in fact the entire world. I couldn’t help but think how completely audacious it was to say that the Digital Public Library of America would somehow encompass the world – similar to how brick and mortar library and museum collections can often mirror the imperialistic interests of the states that they belong to.

Original WWW graphic by Robert Cailliau

So I was relieved when Stefan Gradmann asked how Greg thought the DPLA would fit in with projects like Europeana, which are already aggregating content from Europe. I can’t exactly recall Greg’s response (update: Greg filled in some of the blanks via email), but this prompted Dan Brickley to point out that in fact it’s pretty hard to draw lines around Europe too … and more importantly the Web is a space that can unite these projects. At this point Josh Greenberg jumped in and suggested that perhaps some thoughtful linking between sites like a DPLA and Europeana could help bring them together. This all probably happened in the span of 3 or 4 minutes, but the exchange really crystallized for me that the cultural heritage community could do a whole lot better at deep linking with each other. Josh’s suggestion is particularly good, because researchers could see and act on contextual links. It wouldn’t be something hidden in a data layer that nobody ever looks at. But to do this sort of linking right we would need to share our data better with each other, and it would most likely need to be Linked Data – machine readable data with URLs at its core. I guess it’s a no-brainer that for it to succeed the DPLA needs to be aligned with the ultimate generative platform of our era: the World Wide Web. Name things with URLs, create typed links between them, and other people’s stuff.

Another thing that struck me was how Europeana really gets the linking part. Europeana is essentially a portal site, or directory of digital objects for more than 15 million items provided by hundreds of providers across Europe. You can find these objects in Europeana, but if you drill down far enough you eventually find yourself on the original site that made the object available. I agree with folks that think that perhaps the user experience of the site would be improved if the user never left Europeana to view the digital object in full. This would necessarily require harvesting a richer version of the digital object, which would be more difficult, but not impossible. There would also be an opportunity to serve as a second copy for the object, which is potentially very desirable to originating institutions for preservation purposes…lots of copies keeps stuff safe.

But even in this hypothetical scenario where the object is available in full on Europeana, I think it would still be important to link out to the originating institution that digitized the object. Linking makes the provenance of the item explicit, which will continue to be important to researchers on the Web. But perhaps more importantly it gives institutions a reason to participate in the project as a whole. Participants will see increased awareness and use of their web properties, as users wander over from Europeana. Perhaps they could even link back to Europeana, which ought to increase Europeana’s density in the graph of the web, which also should boost its relevancy ranking in search engines like Google.

Another good lesson of Europeana is that it’s not just about libraries, but also includes many archives, museums and galleries. One of my personal disappointments about the Amsterdam meeting was that Mathias Schindler of Wikimedia-Germany had to pull out at the last minute. I’ve never met him, but Mathias has had a lot to do with trying to bring the Wikipedia and Library communities together. Efforts to promote the use of Wikipedia as a platform in the Galleries, Libraries, Archives and Museums (GLAM) sector are intensifying. The pivotal role that Wikipedia has had in the Linked Data community in the form of dbpedia.org is also very significant. Earlier this year there was a meeting of various Wikipedia, dbpedia and Freebase folks at a Data Summit, where people talked about the potential for an inner hub for the various languages wikipedias to share inter-wiki links, and extracted structured metadata. I haven’t heard whether this is actually leading anywhere currently, but at the very least its a recognition that Wikipedia is itself turning into a key part of information infrastructure on the web.

So I’ve rambled on a bit at this point. Thanks for reading this far. My take-away from the Amsterdam meeting was that the DPLA needs to think about how it wants to align itself with the Web, and work with its grain … not against it. This is easier said than done. The DPLA needs to think about incentives that would give existing digital library projects practical reasons to want to be involved. This also is easier said than done. And hopefully these incentives won’t just involve getting grant money. Keeping an open mind, taking a REST here and there, and continuing to have these very useful conversations (and contests) should help shape the DPLA as a generative platform.