the digital repository marketplace

The University of Southern California recently announced its Digital Repository (USCDR) which is a joint venture between the Shoah Foundation Institute and the University of Southern California. The site is quite an impressive brochure that describes the various services that their digital preservation system provides. But a few things struck me as odd. I was definitely pleased to see a prominent description of access services centered on the Web:

The USCDR can provide global access to digital collections through an expertly managed, cloud-computing environment. With its own content distribution network (CDN), the repository can make a digital collection available around the world, securely, rapidly, and reliably. The USCDR’s CDN is an efficient, high-performance alternative to leading commercial content distribution networks. The USCDR’s network consists of a system of disk arrays that are strategically located around the world. Each site allows customers to upload materials and provides users with high-speed access to the collection. The network supports efficient content downloads and real-time, on-demand streaming. The repository can also arrange content delivery through commercial CDNs that specialize in video and rich media.

But from this description it seems clear that the USCDR is creating their own content delivery network, despite the fact that there is already a good marketplace for these services. I would have thought it would be more efficient for the USCDR to provide plugins for the various CDNs rather than go through the effort (and cost) of building out one themselves. Digital repositories are just a drop in the ocean of Web publishers that need fast and cheap delivery networks for their content. Does the USCDR really think they are going to be able to compete and innovate in this marketplace? I’d also be kind of curious to see what public websites there are right now that are built on top of the USCDR.

Secondly, in the section on Cataloging this segment jumped out at me:

The USC Digital Repository (USCDR) offers cost-effective cataloging services for large digital collections by applying a sophisticated system that tags groups of related items, making them easier to find and retrieve. The repository can convert archives of all types to indexed, searchable digital collections. The repository team then creates and manages searchable indices that are customized to reflect the particular nature of a collection.

The USCDR’s cataloging system employs patented software created by the USC Shoah Foundation Institute (SFI) that lets the customers define the basic elements of their collections, as well as the relationships among those elements. The repository’s control standards for metadata verify that users obtain consistent and accurate search results. The repository also supports the use of any standard thesaurus or classification system, as well as the use of customized systems for special collections.

I’m certainly not a patent expert, but doesn’t it seem ill advised to build a digital preservation system around a patented technology? Sure, most of our running systems use possibly thousands of patented technologies, but ordinarily we are insulated from them by standards like POSIX, HTTP, or TCP/IP that allow us to swap out various technologies for other ones. If the particular technique to cataloging built into the USCDR is protected by a patent for 20 years, won’t that limit the dissemination of the technique into other digital preservation systems, and ultimately undermine the ability of people to move their content in and out of digital preservation systems as they become available–what Greg Janée calls relay supporting archives. I guess without more details of the patented technology it’s hard to say, but I would be worried about it.

After working in this repository space for a few years I guess I’ve become pretty jaded about turnkey digital repository systems that say they do it all. Not that it’s impossible, but it always seems like a risky leap for an organization to take. I guess I’m also a software developer, which adds quite a bit of bias. But on the other hand it’s great to see a repository systems that are beginning to address the basic concerns raised by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which identified the need for building sustainable models for digital preservation. The California Digital Library is doing something similar with its UC3 Merritt system, which offers fee based curation services to the University of California (which USC is not part of).

Incidentally the service costs of USCDR and Merritt are quite difficult to compare. Merritt’s Excel Calculator says their cost is $1040 per TB per year (which is pretty straightforward, but doesn’t seem to account for the degree to which the data is accessed). The USCDR is listed as $70/TB per month for Disk-based File-Server Access, and $1000/TB for 20 years for Preservation Services. That would seem indicate the raw storage is a bit less than Merritt at $840.00 per TB per year. But what the preservation services are, and how the 20 year cost would be applied over a growing collection of content seems unclear to me. Perhaps I’m misinterpreting disk-based file-server access, which might actually refer to terabytes of data sent outside their USCDR CDN. In that case the $70/TB measures up quite nicely with a recent quote from Amazon S3 at $120.51 per terabyte transferred out per month. But again, does USCDR really think it can compete in the cloud storage space?

Based on the current pricing models, where there are no access driven costs, the USCDR and Merritt might find a lot of clients outside of the traditional digital repository ecosystem (I’m thinking online marketing or pornography) that have images they would like to serve at high volume for no cost other than the disk storage. That was my bad idea of a joke, if you couldn’t tell. But seriously I sometimes worry that digital repository systems are oriented around the functionality of a dark archive, where lots of data goes in, and not much data comes back out for access.


Via Ivan Herman I found out about this 1 year old, excellent presentation from Hans Rosling at the World Bank about the importance of sharing our data.

DbHd is certainly a problem. But it strikes me that, paradoxically, it’s the love and care that people put into their datasets, that makes them so valuable to share.

If nobody was hugging their data, then nobody would care about setting it free on the Web.

Spotify, Rdio and Albums of the Year

I’ve recently started listening to streamed music a bit more on Rdio right around when Spotify launched in the US. I noticed that some albums that I might want to listen to weren’t available for streaming on Rdio, so I got it into my head that I’d like to compare the coverage of Rdio and Spotify–but I wasn’t sure what albums to check. Earlier this week I remembered that since 2007 Alf Eaton has put together Albums of the Year (AOTY) lists, where he has compiled the top albums of the year from a variety of sources. I think Alf’s tastes tend a bit toward independent music, which suits me fine because that’s the kind of stuff I listen to. So …

The AOTY HTML is eminently readable (thanks Alf), so I wrote a scraper to pull down the albums and store it as a JSON file. With this in hand I was able to use the Rdio and Spotify web services to look up the albums, and record whether it was found, and whether it was streamable in the United States, which I saved off as another JSON file. So, of the 7,406 unique albums on AOTY, 60% of them were available in Spotify, and 46% in Rdio 32% of them were available on Spotify, and 46% on Rdio (the strikeout is because of a bug that Alf spotted in the Spotify lookup code). I put the data in a public Fusion Table if you want to look at the results. If you notice anomalies please let me know. And speaking of anomalies…

Caveat Lector!

I was kind of surprised that Teen Dream by Beach House (which was mentioned on 27 AOTY lists in 2010) wasn’t showing up as being streamable on Spotify. So I asked on Twitter and Google+ if people in the US saw it as streamable. The results were kind of surprising. People from Michigan, Illinois, Texas, New York and the District of Columbia confirmed what the Web Service told me, that the album was not streamable. But then there were people in Massachusetts and California who reported it as streamable. What’s more, premium membership didn’t seem to affect the availability: the Massachusetts subscriber had a free account, and the Californian had a premium account, and both could stream it. So take the numbers above with a boulder sized grain of salt. It’s not clear what’s going on.

The spotify search API does not require authentication, and they have nice results that include all territories where the content is available. Rdio’s search API does require authentication, which apparently is used to tie your account to a geographic region, which in turn effects what whether the API will say the album is streamable or not.

So anyway, it was interesting to play around with the APIs a bit. It didn’t seem like the service agreements for the various APIs prevented this sort of exploration. I like the fact that Rdio is web based (go Django), and doesn’t require a proprietary client to use. But it looks like the coverage in Spotify is better. I’m not sure I will make any changes. If anyone has any information about whether streamability of content can vary within the United States I would be interested to hear it. This rights stuff is hard. Given the complexity of managing the rights to this content I’m kind of amazed that Rdio and Spotify exist at all…and I’m very glad that they do.

Update: it turns out that the folks who saw Teen Dream as available had it in their personal collection (Spotify is smart like that), which is why Spotify said it was available to them. So, no crazy state-by-state rights issues need to be entertained :-)

GoodReads microdata

I’m not sure how long it has been there, but I just happened to notice that GoodReads (the popular social networking site for book lovers to share what they are reading and have read) has implemented HTML5 Microdata to make metadata for books available in the HTML of their pages. GoodReads has chosen to use the the Book type from vocabulary, most likely because the big three search engines (Google, Bing and Yahoo) announced that they will use the metadata to enhance their search results. So web publishers are motivated to publish metadata in their pages, not because it’s the “right” thing to do, but because they want to drive traffic to their websites.

If you are new to HTML5 Microdata, and what it means for books, check out Eric Hellman’s recent post Spoonfeeding Library Data to Search Engines. And if you are generally curious about HTML5 Microdata, the chapter in Mark Pilgrim’s Dive into HTML5 is really quite good.

But Microdata and are not just good for the search engines, they are actually good for the Web ecosystem, and for hackers like you and me. For example, go to the page for Alice in Wonderland:

If you view source on the page, and search for itemtype or itemprop you’ll see the extra Microdata markup. The latest HTML5 specification has a section on how to serialize microdata as JSON, and the processing model is straightforward enough for me to write a parser on top of html5lib in less than 200 lines of Python. So this means you can:

import urllib
import microdata

url = ""
items = microdata.get_items(urllib.urlopen(url))

print items[0].json()

and you’ll see:

  "numberOfPages": [
  "isbn": [
  "name": [
    "Alice's Adventures in Wonderland and Through the Looking-Glass"
  "author": [
      "url": [
      "type": ""
  "image": [
  "inLanguage": [
  "ratingValue": [
  "ratingCount": [
    "64,628 ratings"
  "bookFormatType": [
  "type": ""

If you have spent a bit of time writing screenscrapers in the past, this should make your jaw drop a bit. What’s more they’ve also added Microdata to the search results page, so you can see metadata for all the books in the results, for example using Google’s Rich Snippets Testing Tool.

Funnily enough, while I was writing this blog post, over in the #code4lib IRC chat room Chris Beer brought up the fact that some Blacklight developers were concerned that <link rel=“unapi-server”> wasn’t valid HTML5. Chris was wondering if anyone was interested in trying to register “unapi-server” with the WHATWG…


Issues of “valid” HTML5 aside, this discussion highlighted for me just how far the world of metadata on the Web has advanced since a small group of library hackers worked on unAPI. The use of HTML5 Microdata and by Google, Bing and Yahoo, and the use of RDFa by Facebook are great examples of some mainstream solutions to what some of us were trying to achieve with unAPI. Seeing sites like GoodReads implement Microdata, and announcements like Opera support for Microdata are good reminders that the library software development community is best served by paying attention to mainstream solutions, as they become available, even if they eclipse homegrown stopgap solutions.

It is somewhat problematic that Facebook has aligned with RDFa and the Open Graph Protocol, and Google, Bing and Yahoo have aligned with HTML5 and GoodReads has implemented both incidentally. I heard a rumor that Facebook was invited to the table and declined. I have no idea if that is actually true. I also have heard a rumor that Ian Hickson of Google wrote up the Microdata spec in a weekend because he hates RDFa. I don’t know it that’s actually true either. The company and personality rivalries aside, if you are having trouble deciding which one to more fully support, try writing a program to parse RDFa and Microdata. It will probably help clarify some things…

meeting notes and a manifesto

A few weeks ago I was in sunny (mostly) Palo Alto, California at a Linked Data Workshop hosted by Stanford University, and funded by Council on Library and Information Resources. It was an invite-only event, with very little footprint on the public Web, and an attendee list that was distributed via email with “Confidential PLEASE do not re-distribute” across the top. So it feels more than a little bit foolhardy to be writing about it here, even in a limited way. But I was going to the meeting as an employee of the Library of Congress, so I feel a bit responsible for making some kind of public notes/comments about what happened. I suspect this might impact future invitations to similar events, but I can live with that :-)

A week is a long time to talk about anything…and Linked Data is certainly no exception. The conversation was buoyed by the fact that it was a pretty small group of around 30 folks from a wide variety of institutions including: Stanford University, Bibliotheca Alexandrina, Bibliothèque nationale de France, Aalto University, Emory University, Google, National Institute of Informatics, University of Virginia, University of Michigan, Deutschen Nationalbibliothek, British Library, Det Kongelige Bibliotek, California Digital Library and Seme4. So the focus was definitely on cultural heritage institutions, and specifically what Linked Data can offer them. There was a good mix of people in attendance: some who were relatively new to Linked Data and RDF, to others who had been involved in the space before the term Linked Data, RDF or the Web existed…and there were people like me who were somewhere in between.

A few of us took collaborative notes in PiratePad, which sort of tapered off as the week progressed and more discussion happened. After some initial lighting-talk-style presentations from attendees on Linked Data projects they were involved in, we spent most of the rest of the week breaking up into 4 groups, to discuss various issues, and then joining back up with the larger group for discussion. If things go as planned you can expect to see a report from Stanford that covers the opportunities and challenges that Linked Data offers the cultural heritage sector, which were raised during these sessions. I think it’ll be a nice compliment to the report that the W3C Library Linked Data Incubator Group is preparing, which recently became available as a draft.

One thing that has stuck with me a few weeks later, is the continued need in the cultural heritage Linked Data sector for reconciliation services, that help people connect up their resources with appropriate resources that other folks have published. If you work for a large organization, there is often even a need for reconciliation services within the enterprise. For example the British Library reported that it has some 300 distinct data systems within the organization, that sometimes need to be connected together. Linking is the essential ingredient, the sine qua non of Linked Data. Linking is what makes Linked Data and the RDF data model different. It helps you express the work you may have done in joining up your data with other people’s data. It’s the 4th design pattern in Tim Berners-Lee’s Linked Data Design Issues:

Include links to other URIs. so that they can discover more things.

But, expressing the links is the easy part…creating them where they do not currently exist is harder.

Fortunately, Hugh Glaser was on hand to talk about the role that plays in the Linked Data space, and how the RKBexplorer managed to reconcile authors names across institutional repositories. He also has described some work with the British Museuem linking up two different data silos about museum objects, to provide a unified web views for those objects. Hugh, if you are reading this, can you comment with a link to this work you did, and how it surfaces in British Museum website?

Similarly Stefano Mazzocchi talked about how Google’s tools like Freebase Suggest and their WebID app can make it easy to integrate Freebase’s identity system into your applications. If you are building a cataloging tool, take a serious look at what using something like Freebase Suggest (a jquery plugin) can offer your application. In addition, as part of the Google Refine data cleanup tool, Google has created an API for data reconciliation services, which other service providers could supply. Stefano indicated that Google was considering releasing the code behind this reconciliation service, and stressed that it is useful for the community to make more of these reconciliation services available, to help others link their data with other peoples data. It seems obvious I guess, but I was interested to hear that Google themselves are encouraging the use of Freebase IDs to join up data within their enterprise.

Almost a year ago Leigh Dodds created a similar API layer for data that is stored in the Talis Platform. Now that the British National Bibliography is being made available in a Talis Store, it might be possible to use Leigh’s code to put a reconciliation service on top of that data. Caveat being that not all the BNB is currently available there. By the way, hats off to the British Library for iteratively making that data available, and getting feedback early, instead of waiting for it all to be “done”…which of course they never will be, if they are successful at integrating Linked Data into their existing data work flows.

If you squint right, I think it’s also possible to look at the VIAF AutoSuggest service as a type reconciliation service. It would be useful to have a similar service over the Library of Congress Subject Headings at Having similar APIs for these services could be a handy thing as we begin to build new descriptive cataloging tools that take advantage of these pools of data. But I don’t think it’s necessarily essential, as the APIs could be orchestrated in a more ad hoc, web2.0 mashup style. I imagine I’m not alone in thinking we’re now at the stage when we can start building new cataloging tools that take advantage of these data services. Along those lines Rachel Frick had an excellent idea to try to enhance collection building applications like Omeka and Archives Space to take advantage of reconciliation services under the covers. Adding a bit of suggest-like functionality to these tools could smooth out the description of resources that libraries, museums and archives are putting online. I think the Omeka-LCSH plugin is a good example of steps in this direction.

One other thing that stuck with me from the workshop is that the new (dare I say buzzwordy) focus on Library Linked Data is somewhat frightening to library data professionals. There is a lot of new terminology, and issues to work out (as the Stanford report will hopefully highlight). Some of this scariness is bound up with the Resource Description and Access sea change that is underway. But crufty as they are, data systems built around MARC have served the library community well over the past 30 years. Some of the motivations for Linked Data are specifically for Linked Open Data, where the linking isn’t as important as the openness. The LOD-LAM summit captured some of this spirit in the 4 Star Classification Scheme for Linked Open Cultural Metadata, which focuses on licensing issues. There was a strong undercurrent at the Stanford meeting about licensing issues. The group recognized that explicit licensing is important, but it was intentionally kept out of scope of most of the discussion. Still I think you can expect to see some of the heavy hitters from this group exert some influence in this arena to help bring clarity to licensing issues around our data. I think that some of the ideas of opening up the data, and disrupting existing business workflows around the data can seem quite scary to those who have (against a lot of odds) gotten them working. I’m thinking of the various cooperative cataloging efforts that allow work to get done in libraries today.

Truth be told, I may have inspired some of the “fear” around Linked Data by suggesting that the Stanford group work on a manifesto to rally around, much like what the Agile Manifesto did for the Agile software development movement. I don’t think we had come to enough consensus to really get a manifesto together, but on the last day the sub-group I was in came up with a straw man (near the bottom of the piratepad notes) to throw darts at. Later on (on my own) I kind of wordsmithed them into a briefer list. I’ll conclude this blog post by including the “manifesto” here not as some official manifesto of the workshop (it certainly is not), but more as a personal manifesto, that I’d like to think has been guiding some of the work I have been involved in at the Library of Congress over the past few years:

Manifesto for Linked Libraries

We are uncovering better ways of publishing, sharing and using information by doing it and helping others do it. Through this work we have come to value:

  • Publishing data on the Web for discovery over preserving it in dark archives.
  • Continuous improvement of data over waiting to publish perfect data.
  • Semantically structured data over flat unstructured data.
  • Collaboration over working alone.
  • Web standards over domain-specific standards.
  • Use of open, commonly understood licenses over closed, local licenses.

That is, while there is value in the items on the right, we value the items on the left more.

The manifesto is also on a publicly editable Google Doc; so if you feel the need to add or comment please have a go. I was looking for an alternative to “Linked Libraries” since it was not inclusive of archives and museums … but I didn’t spend much time obsessing on it. One of the selfish motivations for publishing the manifesto here was to capture it a particular point in time where I was happy with it :-)

the dpla as a generative platform

Last week I had the opportunity to attend a meeting of the Digital Public Library of America (DPLA) in Amsterdam. Several people have asked me why an American project was meeting in Amsterdam. In large part it was an opportunity for the DPLA to reach out to, and learn from European projects such as Europeana, LOD2 and Wikimedia Germany–or as the agenda describes:

The purpose of the May 16 and 17 expert working group meeting, convened with generous support from the Open Society Foundations, is to begin to identify the characteristics of a technical infrastructure for the proposed DPLA. This infrastructure ideally will be interoperable with international efforts underway, support global learning, and act as a generative platform for undefined future uses. This workshop will examine interoperability of discovery, use, and deep research in existing global digital library infrastructure to ensure that the DPLA adopts best practices in these areas. It will also serve to share information and foster exchange among peers, to look for opportunities for closer collaboration or harmonization of existing efforts, and to surface topics for deeper inquiry as we examine the role linked data might play in a DPLA.

Prior to the meeting I read the DPLA Concept Note, watched the discussion list and wiki activity – but the DPLA still seemed somewhat hard to grasp to me. The thing I learned at the meeting in Amsterdam is that this nebulousness is by design–not by accident. The DPLA steering committee aren’t really pushing a particular solution that they have in mind. In fact, there doesn’t seem to be a clear consensus about what problem they are trying to solve. Instead the steering committee seem to be making a concerted effort to keep an open, beginners-mind about what a Digital Public Library of America might be. They are trying to create conversations around several broad topic areas or work-streams: content and scope, financial/business models, governance, legal issues, and technical aspects. The recent meeting in Amsterdam focused on the technical aspects work-stream–in particular, best practices for data interoperability on the Web. The thought being that perhaps the DPLA could exist in some kind of distributed relationship with existing digital library efforts in the United States–and possibly abroad. Keeping an open mind in situations like this takes quite a bit of effort. There is often an irresistable urge to jump to particular use cases, scenarios or technical solutions, for fear of seeming ill informed or rudderless. I think the DPLA should be commended for creating conversations at this formative stage, instead of solutions in search of a problem.

I hadn’t noticed the phrase “generative platform” in the meeting announcement above until I began this blog post…but in hindsight it seems very evocative of the potential of the DPLA. At their best, digital libraries currently put content on the Web, so that researchers can discover it via search engines like Google, Bing, Baidu, etc. Researchers discover a pool of digital library content while performing a query in a search engine. Once they’ve landed in the digital library webapp they can wander outwards to related resources, and perhaps do a more nuanced search within the scoped context of the collection. But in practice this doesn’t happen all that often. I imagine many institutions digitize content that actually never makes it onto the Web at all. And when it does make it onto the Web it is often deep-web content hiding behind a web form, un-discoverable by crawlers. Or worse, the content might be actively made invisible by using a robots.txt to prevent search engines from crawling it. Sadly this is often done for performance reasons, not out of any real desire to keep the content from being found–because all too often library webapps are not designed to support crawling.

I was invited to talk very briefly (10 minutes) about Linked Data at the Amsterdam meeting. I think most everyone recognizes that a successful DPLA would exist in a space where there has been years of digitization efforts in the US, with big projects like the HathiTrust and countless others going on. I wanted to talk about how the Web could be used to integrate these collections. Rather than digging into a particular Linked Data solution to the problem of synchronization, I thought I would try to highlight how libraries could learn to do web1.0 a bit better. In particular I wanted to showcase how Google Scholar abandoned OAI-PMH (a traditional library standard for integrating collections) in favor of using sitemaps and metadata embedded in HTML. I wanted to show how thoughtful use of sitemaps, a sensible robots.txt, and perhaps some Atom to publish updates, and deletes a bit more methodically can offer just the same functionality as OAI-PMH, but in a way that is aligned with the Web, and the services that are built on top of it. Digital library initiatives often go off and create their own specialized way of looking at the Web, and ignore broader trends. The nature of grant funding, and research papers often serve as an incentive for this behavior. I’ve heard rumors that there is even some NISO working group being formed to look into standardizing some sort of feed based approach to metadata harvesting. Personally I think it’s probably more important for us to use some of the standards and patterns that are already available instead of trying to define another one.

So you could say I pulled a bit of a bait and switch: instead of talking about Linked Data I really ended up talking about Search Engine Optimization. I didn’t mention RDF or SPARQL once. If anyone noticed they didn’t seem to mind too much.

I learned a lot of very useful information during the presentations–too much to really note here. But there was one conversation that really stands out after a week has passed.

Greg Crane of the Perseus Digital Library spoke about about Deep Research, and how students and researchers participate in the creation of online knowledge. At one point Greg had a slide that contained a map of the entire world, and spoke about how the scope of the DPLA can’t really be confined to the United States alone–since American society is largely made up of immigrant communities (some by choice, some not) the scope of the DPLA is in fact the entire world. I couldn’t help but think how completely audacious it was to say that the Digital Public Library of America would somehow encompass the world – similar to how brick and mortar library and museum collections can often mirror the imperialistic interests of the states that they belong to.

Original WWW graphic by Robert Cailliau

So I was relieved when Stefan Gradmann asked how Greg thought the DPLA would fit in with projects like Europeana, which are already aggregating content from Europe. I can’t exactly recall Greg’s response (update: Greg filled in some of the blanks via email), but this prompted Dan Brickley to point out that in fact it’s pretty hard to draw lines around Europe too … and more importantly the Web is a space that can unite these projects. At this point Josh Greenberg jumped in and suggested that perhaps some thoughtful linking between sites like a DPLA and Europeana could help bring them together. This all probably happened in the span of 3 or 4 minutes, but the exchange really crystallized for me that the cultural heritage community could do a whole lot better at deep linking with each other. Josh’s suggestion is particularly good, because researchers could see and act on contextual links. It wouldn’t be something hidden in a data layer that nobody ever looks at. But to do this sort of linking right we would need to share our data better with each other, and it would most likely need to be Linked Data – machine readable data with URLs at its core. I guess it’s a no-brainer that for it to succeed the DPLA needs to be aligned with the ultimate generative platform of our era: the World Wide Web. Name things with URLs, create typed links between them, and other people’s stuff.

Another thing that struck me was how Europeana really gets the linking part. Europeana is essentially a portal site, or directory of digital objects for more than 15 million items provided by hundreds of providers across Europe. You can find these objects in Europeana, but if you drill down far enough you eventually find yourself on the original site that made the object available. I agree with folks that think that perhaps the user experience of the site would be improved if the user never left Europeana to view the digital object in full. This would necessarily require harvesting a richer version of the digital object, which would be more difficult, but not impossible. There would also be an opportunity to serve as a second copy for the object, which is potentially very desirable to originating institutions for preservation purposes…lots of copies keeps stuff safe.

But even in this hypothetical scenario where the object is available in full on Europeana, I think it would still be important to link out to the originating institution that digitized the object. Linking makes the provenance of the item explicit, which will continue to be important to researchers on the Web. But perhaps more importantly it gives institutions a reason to participate in the project as a whole. Participants will see increased awareness and use of their web properties, as users wander over from Europeana. Perhaps they could even link back to Europeana, which ought to increase Europeana’s density in the graph of the web, which also should boost its relevancy ranking in search engines like Google.

Another good lesson of Europeana is that it’s not just about libraries, but also includes many archives, museums and galleries. One of my personal disappointments about the Amsterdam meeting was that Mathias Schindler of Wikimedia-Germany had to pull out at the last minute. I’ve never met him, but Mathias has had a lot to do with trying to bring the Wikipedia and Library communities together. Efforts to promote the use of Wikipedia as a platform in the Galleries, Libraries, Archives and Museums (GLAM) sector are intensifying. The pivotal role that Wikipedia has had in the Linked Data community in the form of is also very significant. Earlier this year there was a meeting of various Wikipedia, dbpedia and Freebase folks at a Data Summit, where people talked about the potential for an inner hub for the various languages wikipedias to share inter-wiki links, and extracted structured metadata. I haven’t heard whether this is actually leading anywhere currently, but at the very least its a recognition that Wikipedia is itself turning into a key part of information infrastructure on the web.

So I’ve rambled on a bit at this point. Thanks for reading this far. My take-away from the Amsterdam meeting was that the DPLA needs to think about how it wants to align itself with the Web, and work with its grain … not against it. This is easier said than done. The DPLA needs to think about incentives that would give existing digital library projects practical reasons to want to be involved. This also is easier said than done. And hopefully these incentives won’t just involve getting grant money. Keeping an open mind, taking a REST here and there, and continuing to have these very useful conversations (and contests) should help shape the DPLA as a generative platform.

a bit about scruffiness

Terry Winograd at CHI 2006 (by boltron)

I just finished reading Understanding Computers and Cognition by Terry Winograd and Fernando Flores and have to jot down some quick notes & quotes before I jump in and start reading it again … yeah, it’s that good.

Having gone on Rorty and Wittgenstein kicks recently, I was really happy to find this book while browsing the Neats vs Scruffies Wikipedia article a few months ago. It seems to combine this somewhat odd interest I have in pragmatism and writing software. While it was first published in 1986, it’s still very relevant today, especially in light of what the more Semantic Web heavy Linked Data crowd are trying to do with ontologies and rules. Plus it’s written in clear and accessible language, which is perfect for the arm-chair compsci/philosopher-type … so it’s ideal for a dilettante like me.

While only 207 pages long, the breadth of the book is kind of astounding. The philosophy of Heidegger figures prominently…in particular his ideas about throwness, breakdowns and readiness to hand which emphasize the importance of concernful activity over rationalist, representations of knowledge.

Heidegger insists that it is meaningless to talk about the existence of objects and their properties in the absence of concernful activity, with its potential for breaking down.

The work of the biologist Humberto Maturana forms the second part of the theoretical foundation of the book. The authors draw upon Maturana’s ideas about structural coupling to emphasize the point that:

The most successful designs are not those that try to fully model the domain in which they operate, but those that are ‘in alignment’ with the fundamental structure of that domain, and that allow for modification and evolution to generate new structure coupling.

And the third leg in the chair is John Searle’s notion of speech acts which emphasizes the role of commitment and action, or the social role of language in dealing with meaning.

Words correspond to our intuition about “reality” because our purposes in using them are closely aligned with our physical existence in a world and our actions within it. But the coincidence is the result of our use of language within a tradition … our structure coupling within a consensual domain. Language and cognition are fundamentally social … our ability to think and to give meaning to language is rooted in our participation in a society and a tradition.

Fernando Flores (by Sebastián Piñera)

So the really wonderful thing that this book does here is take this theoretical framework (Heidegger, Maturana & Searle) and apply it to the design of computer software. As the preface makes clear, the authors largely wrote this book to dismantle popular (at the time) notions that computers would “think” like humans. While much of this seems anachronistic today, we still see similar thinking in some of the ways that the Semantic Web is described, where machines will understand the semantics of data, using ontologies that model the “real world”.

There is still a fair bit of talk about getting the ontologies just right so that they model the world properly, and then running rule driven inference engines over the instance data, to “learn” more things. But what is often missing is a firm idea of what actual tools will use this new data. How will these tools be used by people acting in a particular domain? Like The Modeler, practitioners in the Linked Data and Semantic Web community often jump to modeling a domain, and trying to get it to match “reality” before understanding what the field of activity we want to support is…what we are trying to have the computer help us do … what new conversations we want the computer to enable with other people.

In creating tools we are designing new conversations and connections. When a change is made, the most significant innovation is the modification of the conversation structure, not the mechanical means by which the conversation is carried out. In making such changes we alter the overall pattern of conversation, introducing new possibilities or better anticipating breakdowns in the previously existing ones … When we are aware of the real impact of design we can more consciously design conversation structures that work.

It’s important to note here that these are conversations between people, who are acting in some domain, and using computers as tools. It’s the social activity that grounds the computer software, and not some correspondence that the software shares with reality or truth. I guess this is a subtle point, and I’m not doing a terrific job of elucidating it here, but if your interest is piqued definitely pick up a copy of the book. Over the past 5 years I’ve been lucky to work with several people who intuitively understand how important the social setting and alignment are to successful software development–but it’s nice to have the theoretical tools as ballast when the weather gets rough.

Another really surprising part of the book (given that it was written in 1986) is the foreshadowing of the agile school of programming:

… the development of any computer-based system will have to proceed in a cycle from design to experience and back again. It is impossible to anticipate all of the relevant breakdowns and their domains. They emerge gradually in practice. System development methodologies need to take this as a fundamental condition of generating the relevant domains, and to facilitate it through techniques such as building prototypes early in the design process and applying them in situations as close as possible to those in which they will eventually be used.

Compare that with the notion of iterative development that’s now prevalent in software development circles. I guess it shouldn’t be that surprising since the roots of extend back quite a ways. But still, it was pretty eerie seeing how on target Winograd and Flores could be still, particularly in the field of computing which has changed so rapidly in the last 25 years.

update: Kendall Clark has an interesting post that addresses some of the concerns about semantic web technologies.
update: Ryan Shaw recommended some more reading material in this vein.

DOIs as Linked Data

Last week Ross Singer alerted me to some pretty big news for folks interested in Library Linked Data: CrossRef has made the metadata for 46 million Digital Object Identifiers (DOI) available as Linked Data. DOIs are heavily used in the publishing space to uniquely identify electronic documents (largely scholarly journal articles). CrossRef is a consortium of roughly 3,000 publishers, and is a big player in the academic publishing marketplace.

So practically what this means is that all the places in the scholarly publishing ecosystem where DOIs are present (caveat below), it’s now possible to use the Web to retrieve metadata associated with that electronic document. Say you’ve got a DOI in the database backing your institutional repository:


you can use the DOI to construct a URL:

and then do an HTTP GET (what your Web browser is doing all the time as you wander around the Web) to ask for metadata about that document:

curl –location –header “Accept: text/turtle”

At which point you will get back some Turtle flavored RDF that looks like:

<> a <> ; <> “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid” ; <> <>, <> ; <> “10.1038/171737a0” ;
<> “738” ; <> “737” ; <> “171” ; <> “1953-04-25Z”^^<> ; <> “10.1038/171737a0” ; <> <> ; <> “Nature Publishing Group” ; <> “10.1038/171737a0” ; <> “738” ; <> “737” ; <> “171” ; <> <doi:10.1038/171737a0>, <info:doi/10.1038/171737a0> .

snac hacks

A few months ago Brian Tingle posted some exciting news that the Social Networks and Archival Context (SNAC) project was releasing the data that sits behind their initial prototype:

As a part of our work on the Social Networks and Archival Context Project, the SNAC team is please to release more early results of our ongoing research.

A property graph of correspondedWith and associatedWith relationships between corporate, personal, and family identities is made available under the Open Data Commons Attribution License in the form of a graphML file. The graph expresses 245,367 relationships between 124,152 named entities.

The graphML file, as well as the scripts to create and load a graph database from EAC or graphML, are available on google code [5]

We are still researching how to map from the property graph model to RDF, but this graph processing stack will likely power the interactive visualization of the historical social networks we are developing.

The SNAC project have aggregated archival finding aid data for manuscript collections at the Library of Congress, Northwest Digital Archives, Online Archive of California and Virginia Heritage. They then used authority control data from NACO/LCNAF, Getty Union List of Artist Names Online (ULAN) and VIAF to knit these archival finding aids using the Encoded Archival Context – Corporate bodies, Persons, and Families (EAC-CPF).

I wrote about SNAC here about 9 months ago, and how much potential there is in the idea of visualizing archival collections across institutions, along the axis of identity. I had also privately encouraged Brian to look into releasing some portion of the data that is driving their prototype. So when Brian delivered I felt some obligation to look at the data and try to do something with it. Since Brian indicated that the project was interested in an RDF serialization, and Mark had pointed me at Aaron Rubenstein’s arch vocabulary, I decided to take a stab at converting the GraphML data to some Arch flavored RDF.

So I forked Brian’s mercurial repository, and wrote a script that parses the GraphML XML that Brian provided, and writes RDF (using arch:correspondedWith, arch:primaryProvenanceOf, arch:appearsWith) to a local triple store using rdflib. Since RDF has URLs cooked in pretty deep, part of this conversion involved reverse-engineering the SNAC URLs in the prototype, which wasn’t terribly clean, but it seemed good enough for demonstration purposes.

Once I had those triples (877,595 of them) I learned from Cory Harper that the SNAC folks had matched up the archival identities with entries in the Virtual International Authority File. The VIAF URLs aren’t present in their GraphML data (GraphML is not as expressive as RDF), but they are available in the prototype HTML, which I had URLs for. So, again, in the name of demonstration and not purity, I wrote a little scraper that would use the reverse-engineered SNAC URL to pull down the VIAF id. I tried to be respectful and not do this scraping in parallel, and to sleep a bit between requests. A few days of running and I had 40,237 owl:sameAs assertions that linked the SNAC URLs with the VIAF URLs.

With the VIAF URLs in hand I thought it would be useful to have a graph of only the VIAF related resources. It seemed like a VIAF centered graph of archival information could demonstrate something we’ve been talking about some in the Library Linked Data W3C Incubator Group: that Linked Data actually provides a technology that lets the archival and bibliographic description communities cross-pollinate and share. This is the real insight of the SNAC effort, that these communities have a lot in common, in that they both deal with people, places, organizations, etc. So I wrote another little script that created a named graph within the larger triple store, and used the owl:sameAs assertions to do some brute force inferencing, to generate triples relating VIAF resources with Arch.

I realize that Turtle isn’t probably the most compelling example of the result, but in the absence of an app (maybe more on that forthcoming) that uses it, it’ll have to do for now. So here are the assertions for Vannevar Bush, for the Linked Data fetishists out there:

@prefix foaf <> .
@prefix arch <> .

    a foaf:Person ;
    foaf:name "Bush, Vannevar, 1890-1974." ;
    arch:appearsWith <>, 
        <> ;
    arch:correspondedWith <>,
        <> ;
    arch:primaryProvenanceOf <>, 
        <> ;
    owl:sameAs <> .

I’ve made a full dump of the data I created available if you are interested in taking a look. The nice thing is that the URIs are already published on the web, so I didn’t need to mint any identifiers myself to publish this Linked Data. Although I kind of played fast and loose with the SNAC URIs for people since they don’t do the httpRange-14 dance. It’s interesting that it doesn’t seem to have immediately broken anything. It would be nice if the SNAC Prototype URIs were a bit cleaner I guess. Perhaps they could use some kind of identifier instead of baking the heading into the URL?

So maybe I’ll have some time to build a simple app on top of this data. But hopefully I’ve at least communicated how useful it could be for the cultural heritage community to share web identifiers for people, and use them in their data. RDF also proved to be a nice malleable data model for expressing the relationships, and serializing them so that others could download them. Here’s to the emerging (hopefully) Giant Global GLAM Graph!

geeks bearing gifts

Trojan Horse in Stuttgart by Stefan Kühn

I recently received some correspondence about the EZID identifier service from the California Digital Library. EZID is a relatively new service that aims to help cultural heritage institutions manage their identifiers. Or as the EZID website says:

EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers for their digital content. EZID has these core functions:

Create a persistent identifier: DOI or ARK

  • Add object location
  • Add metadata
  • Update object location
  • Update object metadata

I have some serious concerns about a group of cultural institutions relying on a single service like EZID for managing their identifier namespaces. It seems too much like a single point of failure. I wonder, are there any plans to make the software available, and to allow multiple EZID servers to operate as peers?

URLs are a globally deployed identifier scheme that depend upon HTTP and DNS. These technologies have software implementations in many different computer languages, for diverse operating systems. Bugs and vulnerabilities associated with these software libraries are routinely discovered and fixed, often because the software itself is available as open source, and there are “many eyes” looking at the source code. Best of all, you can put a URL into your web browser (which are now ubiquitous), and view a document that is about the identified resource.

In my humble opinion, cultural heritage institutions should make every effort to work with the grain of the Web, and taking URLs seriously is a big part of that. I’d like to see more guidance for cultural heritage institutions on effective use of URLs, what Tim Berners-Lee has called Cool URIs, and what the Microformats and blogging community call permalinks. When systems are being designed or evaluated for purchase, we need to think about the URL namespaces that we are creating, and how we can migrate them forwards. Ironically, identifier schemes that don’t fit into the DNS and HTTP landscape have their own set of risks; namely that organizations become dependent on niche software and services. Sometimes it’s prudent (and cost effective) to seek safety in numbers.

I did not put this discussion here to try to shame CDL in any way. I think the EZID service is well intentioned, clearly done in good spirit, and already quite useful. But in the long run I’m not sure it pays for institutions to go it alone like this. As another crank (I mean this with all due respect) Ted Nelson put it:

Beware Geeks Bearing Gifts.

On the surface the EZID service seems like a very useful gift. But it comes with a whole set of attendant assumptions. Instead of investing time & energy getting your institution to use a service like EZID, I think most cultural heritage institutions would be better off thinking about how they manage their URL namespaces, and making resource metadata available at those URLs.