Martin Malmsten and linked library data

I’m currently listening to Richard Wallis’ interview w/ Martin Malmsten of the Royal Library of Sweden. It’s a really fascinating view inside a library, and the mind of a developer that are publishing bibliographic resources as linked data.

Partly as a dare from Roy Tennant to do something useful with linked-data, I spent 30 minutes w/ rdflib creating a very simplistic (42 lines of code) crawler that can walk the links in the Royal Library’s linked data, and store the bibliographic resources encountered. I ran it over the weekend (it had a 3 second sleep between requests, so as not to arouse the ire of the Royal Library of Sweden), and it ended up pulling down 919,190 triples describing a variety of resources (kind of a fun unix hack here to get the types of resources in a ntriples rdf dump):

ed@hammer:~/bzr/linked-data-crawler$ grep 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' libris.kb.se.n3 \
  | cut -f 3 -d " " \
  | sort \ 
  | uniq -c \
  | sort -rn
  18445 <http://purl.org/ontology/bibo/Book>.
   1686 <http://purl.org/ontology/bibo/Article>.
    258 <http://www.w3.org/2004/02/skos/core#Concept>.
    245 <http://purl.org/ontology/bibo/Film>.
    237 <http://xmlns.com/foaf/0.1/Organization>.
    219 <http://xmlns.com/foaf/0.1/Person>.
     58 <http://purl.org/ontology/bibo/Periodical>.
      4 <http://purl.org/ontology/bibo/Map>.
      4 <http://purl.org/ontology/bibo/Manuscript>.
      1 <http://purl.org/ontology/bibo/Collection>.

As I pointed out on ngc4lib, the purpose of this wasn’t to display any technical prowess–much to the contrary, it was to share how the nature of linked-data being on the web we know and love makes it natural to work with.

One of the many gems in the interview, was Martin’s response to Richard’s question about whether the “semantic web” that we talk about today is subtly different than the semantic web that was introduced in 2001.

People saw the words “semantic web” and then they sort of forgot the web part, and started to work on the semantic part (vocabularies)–and that can become arbitrarily complex. If you forget the web part then it is just metadata, and then people can ask “ok, you have this semantics thing and we have marc21, it’s not really that different” and they’d be right. But now linked data is starting to feed the semantic web, and it’s the web part that makes it special. (about 34:00 into the interview).

I’m not an expert on the history of the web and libraries, but this seems to be spot on to me. The notion that traditional library assets (bibliographic resources like catalog records, name/subject authority records, holdings records, etc.) can be made available directly on the web as machine readable data is the real promise of linked-data for libraries. It feels like we’re at an inflexion point like the one where libraries realized their catalogs could be made available on the web. The web-opac allowed there to be links between say bibliographic records and subject headings, which could be expressed in HTML for people to traverse. But now we can express these links explicitly in a machine readable way, for automated agents to traverse. If you (like Roy Tennant) are skeptical of the value in this ask yourself how companies like Google were able to build up their most valuable asset, their index of the web. They used the open architecture of the web, to walk the links between resources. Imagine if we could allow people to do the same with our data? To gather say a union catalog of Sweden by crawling it’s member libraries catalogs, and periodically updating them with HTTP GET for that resource?

Martin’s main point is that a lot of valuable effort has gone into vocabulary development like DublinCore, MODS etc, and even some on the distribution of descriptions using these vocabularies using OAI-PMH. But the real exciting part IMHO is giving these resources URLs, and linking them together…much as the web of documents is linked together. I agree with Martin, this is new territory, that really combines what librarians and web-technologists do best. I’m looking forward to meeting Martin at DC2008, where hopefully we can do a linked-data BOF or something.