I’m currently listening to Richard Wallis’ interview w/ Martin Malmsten of the Royal Library of Sweden. It’s a really fascinating view inside a library, and the mind of a developer that are publishing bibliographic resources as linked data.
Partly as a dare from Roy Tennant to do something useful with linked-data, I spent 30 minutes w/ rdflib creating a very simplistic (42 lines of code) crawler that can walk the links in the Royal Library’s linked data, and store the bibliographic resources encountered. I ran it over the weekend (it had a 3 second sleep between requests, so as not to arouse the ire of the Royal Library of Sweden), and it ended up pulling down 919,190 triples describing a variety of resources (kind of a fun unix hack here to get the types of resources in a ntriples rdf dump):
ed@hammer:~/bzr/linked-data-crawler$ grep 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' libris.kb.se.n3 \
| cut -f 3 -d " " \
| sort \
| uniq -c \
| sort -rn
18445 <http://purl.org/ontology/bibo/Book>.
1686 <http://purl.org/ontology/bibo/Article>.
258 <http://www.w3.org/2004/02/skos/core#Concept>.
245 <http://purl.org/ontology/bibo/Film>.
237 <http://xmlns.com/foaf/0.1/Organization>.
219 <http://xmlns.com/foaf/0.1/Person>.
58 <http://purl.org/ontology/bibo/Periodical>.
4 <http://purl.org/ontology/bibo/Map>.
4 <http://purl.org/ontology/bibo/Manuscript>.
1 <http://purl.org/ontology/bibo/Collection>.
As I pointed out on ngc4lib, the purpose of this wasn’t to display any technical prowess–much to the contrary, it was to share how the nature of linked-data being on the web we know and love makes it natural to work with.
One of the many gems in the interview, was Martin’s response to Richard’s question about whether the “semantic web” that we talk about today is subtly different than the semantic web that was introduced in 2001.
People saw the words “semantic web” and then they sort of forgot the web part, and started to work on the semantic part (vocabularies)–and that can become arbitrarily complex. If you forget the web part then it is just metadata, and then people can ask “ok, you have this semantics thing and we have marc21, it’s not really that different” and they’d be right. But now linked data is starting to feed the semantic web, and it’s the web part that makes it special. (about 34:00 into the interview).
I’m not an expert on the history of the web and libraries, but this seems to be spot on to me. The notion that traditional library assets (bibliographic resources like catalog records, name/subject authority records, holdings records, etc.) can be made available directly on the web as machine readable data is the real promise of linked-data for libraries. It feels like we’re at an inflexion point like the one where libraries realized their catalogs could be made available on the web. The web-opac allowed there to be links between say bibliographic records and subject headings, which could be expressed in HTML for people to traverse. But now we can express these links explicitly in a machine readable way, for automated agents to traverse. If you (like Roy Tennant) are skeptical of the value in this ask yourself how companies like Google were able to build up their most valuable asset, their index of the web. They used the open architecture of the web, to walk the links between resources. Imagine if we could allow people to do the same with our data? To gather say a union catalog of Sweden by crawling it’s member libraries catalogs, and periodically updating them with HTTP GET for that resource?
Martin’s main point is that a lot of valuable effort has gone into vocabulary development like DublinCore, MODS etc, and even some on the distribution of descriptions using these vocabularies using OAI-PMH. But the real exciting part IMHO is giving these resources URLs, and linking them together…much as the web of documents is linked together. I agree with Martin, this is new territory, that really combines what librarians and web-technologists do best. I’m looking forward to meeting Martin at DC2008, where hopefully we can do a linked-data BOF or something.













3 Comments
Not to be a monkey on your back (he says, as he clambers on for another ride), but I don’t think you’ve yet met my challenge. That is, what problem does this solve and how does it work better than anything previous? As far as I can tell you’ve solved a problem (aggregating data) that was solved back when you could pack a VW bus with mag tapes. What is so useful about this that will have people sit up and saying “Yeah! I WANT me one of those?”
@royt ok, I honestly didn’t think you’d think I had met your challenge :-) I’d need another blog post to talk about why I think aggregating resources in this way (web harvesting) is superior to mag tapes in a VW bus. I would’ve thought it was self-evident, but there you go… A better comparison, I think would be with OAI-PMH. Do you think OAI-PMH is useful?
The point I’m trying to make about your specific demonstration can be equally made with OAI-PMH. Both that protocol and your technique have the same problem: they are a good way to aggregate _unique_ records, but they will always be flawed when aggregating records for commonly held items. This is because variations in those records are much easier to deal with in a batch mode over time than dynamically. Ask anyone who has ever had to create and maintain a union catalog. This is why I don’t find this demonstration a compelling of linked data, as fun as it may have been to be asleep while the catalog was being sucked down — which as you point out could have been done via OAI-PMH as well. Solve a problem that many of us have in a more effective way than before and you may have something. Without that, why should we care?
One Trackback/Pingback
[...] Martin Malmsten and linked library data – [...]
Post a Comment