terminology services sneak peak

I just saw Diane Vizine-Goetz demo OCLC’s Terminology Services at the CENDI/SKOS meeting and was excited to see various things out on the public web. For example, the LCSH concept “World Wide Web” is over here:

http://tspilot.oclc.org/lcsh/sh2008114004

At the moment it’s not the most friendly human readable display, but that’s just a XSLT stylesheet away (assuming TS follows the patterns of other OCLC Services). I’m not quite sure what the default namespace urn:uuid:D30A7E67-31BF-40A3-9956-9668674FCD84 is. But the response looks like it indicates what resources are related to a given conceptual resource.

  1. http://tspilot.oclc.org/lcsh/sh2008114004.html
  2. http://tspilot.oclc.org/lcsh/sh2008114004.json
  3. http://tspilot.oclc.org/lcsh/sh2008114004.marcxml
  4. http://tspilot.oclc.org/lcsh/sh2008114004.meta
  5. http://tspilot.oclc.org/lcsh/sh2008114004.skos
  6. http://tspilot.oclc.org/lcsh/sh2008114004.stats
  7. http://tspilot.oclc.org/lcsh/sh2008114004.zthes

And LCSH is just one of the vocabularies available through the pilot service, if you examine the XML you’ll see references to FAST, TGM and MESH + SRU services for each.

I think this is way cool, and a step in the right direction…particulary because they are going to make vocabularies available for free as long as the original publisher has no problem with it. My only complaint is that the URIs for the concepts don’t appear to do content-negotiation for application/rdf+xml. It looks like text/html and application/javascript (isn’t it application/json?) work just fine though. Try them out:

curl --header "Accept: application/javascript" http://tspilot.oclc.org/lcsh/sh2008114004
curl --header "Accept: text/html" http://tspilot.oclc.org/lcsh/sh2008114004

But not application/rdf+xml:

curl --header "Accept: application/rdf+xml" http://tspilot.oclc.org/lcsh/sh2008114004

It seems like it would be a pretty easy fix, and pretty important for being able to follow your nose on the semantic web.


nkos/cendi

Jon Phipps and I are speaking about SKOS at the World Bank today for a joint meeting of the CENDI and NKOS groups. The talk is entitled “SKOS: New Directions in Interoperability” … which is kind of ironic since SKOS has been a long running topic at NKOS meetings. The idea is to describe SKOS (for those who don’t know it), cover the recent changes to SKOS (for those that do), and describe an implementation of SKOS (lcsh.info). A tall order for 30 minutes!

One new direction that I hope I’ll be able to get to is the notion of linked-data. I created some simple graph visualizations of the Royal Library of Sweden’s linked bibliographic data implementation. I really wanted to emphasize how linked data can model data across enterprise boundaries. By the way this example really exists, it’s not library-science-fiction.

Wish us luck! There are going to be some other interesting talks during the day, on OCLC’S Terminology Services, Semantic Media Wiki for vocabulary development at the Mayo Clinic, mapping agriculture vocabularies, the intersection of folksonomy and taxonomy, and more.

PS. Roy I haven’t forgotten your follow-up comment :-)


w3c semweb use cases and lcsh

Via Ivan Herman I learned that the Semantic Web Use Cases use concepts from lcsh.info. For example look at the RDFa in this case study for the Digital Music Archive for the Norwegian National Broadcaster. You can also look at the Document metadata in a linked data browser like OpenLink. Click on the “Document” and then on the various subject “concepts” and you’ll see the linked data browser go out and fetch the triples from lcsh.info for “Semantic Web” and “Broadcasting”.

One of the downsides to linked-data browsers (for me) is that they hide a bit of what’s going on. Of course this is by-design. For a more rdf centric view on the data take a look at this output of rapper.

ed@curry:~$ rapper -o turtle http://www.w3.org/2001/sw/sweo/public/UseCases/NRK/ rapper: Serializing with serializer turtle (???) rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . (???) bibo: <http://purl.org/ontology/bibo/> . (???) dc: <http://purl.org/dc/terms/> . (???) foaf: <http://xmlns.com/foaf/0.1/> . (???) rdfs: <http://www.w3.org/2000/01/rdf-schema#> . (???) xhv: <http://www.w3.org/1999/xhtml/vocab#> . (???) xml: <http://www.w3.org/XML/1998/namespace> . (???) xsd: <http://www.w3.org/2001/XMLSchema#> .


Martin Malmsten and linked library data

I’m currently listening to Richard Wallis’ interview w/ Martin Malmsten of the Royal Library of Sweden. It’s a really fascinating view inside a library, and the mind of a developer that are publishing bibliographic resources as linked data.

Partly as a dare from Roy Tennant to do something useful with linked-data, I spent 30 minutes w/ rdflib creating a very simplistic (42 lines of code) crawler that can walk the links in the Royal Library’s linked data, and store the bibliographic resources encountered. I ran it over the weekend (it had a 3 second sleep between requests, so as not to arouse the ire of the Royal Library of Sweden), and it ended up pulling down 919,190 triples describing a variety of resources (kind of a fun unix hack here to get the types of resources in a ntriples rdf dump):

ed@hammer:~/bzr/linked-data-crawler$ grep 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' libris.kb.se.n3 \
  | cut -f 3 -d " " \
  | sort \ 
  | uniq -c \
  | sort -rn
  18445 <http://purl.org/ontology/bibo/Book>.
   1686 <http://purl.org/ontology/bibo/Article>.
    258 <http://www.w3.org/2004/02/skos/core#Concept>.
    245 <http://purl.org/ontology/bibo/Film>.
    237 <http://xmlns.com/foaf/0.1/Organization>.
    219 <http://xmlns.com/foaf/0.1/Person>.
     58 <http://purl.org/ontology/bibo/Periodical>.
      4 <http://purl.org/ontology/bibo/Map>.
      4 <http://purl.org/ontology/bibo/Manuscript>.
      1 <http://purl.org/ontology/bibo/Collection>.

As I pointed out on ngc4lib, the purpose of this wasn’t to display any technical prowess–much to the contrary, it was to share how the nature of linked-data being on the web we know and love makes it natural to work with.

One of the many gems in the interview, was Martin’s response to Richard’s question about whether the “semantic web” that we talk about today is subtly different than the semantic web that was introduced in 2001.

People saw the words “semantic web” and then they sort of forgot the web part, and started to work on the semantic part (vocabularies)–and that can become arbitrarily complex. If you forget the web part then it is just metadata, and then people can ask “ok, you have this semantics thing and we have marc21, it’s not really that different” and they’d be right. But now linked data is starting to feed the semantic web, and it’s the web part that makes it special. (about 34:00 into the interview).

I’m not an expert on the history of the web and libraries, but this seems to be spot on to me. The notion that traditional library assets (bibliographic resources like catalog records, name/subject authority records, holdings records, etc.) can be made available directly on the web as machine readable data is the real promise of linked-data for libraries. It feels like we’re at an inflexion point like the one where libraries realized their catalogs could be made available on the web. The web-opac allowed there to be links between say bibliographic records and subject headings, which could be expressed in HTML for people to traverse. But now we can express these links explicitly in a machine readable way, for automated agents to traverse. If you (like Roy Tennant) are skeptical of the value in this ask yourself how companies like Google were able to build up their most valuable asset, their index of the web. They used the open architecture of the web, to walk the links between resources. Imagine if we could allow people to do the same with our data? To gather say a union catalog of Sweden by crawling it’s member libraries catalogs, and periodically updating them with HTTP GET for that resource?

Martin’s main point is that a lot of valuable effort has gone into vocabulary development like DublinCore, MODS etc, and even some on the distribution of descriptions using these vocabularies using OAI-PMH. But the real exciting part IMHO is giving these resources URLs, and linking them together…much as the web of documents is linked together. I agree with Martin, this is new territory, that really combines what librarians and web-technologists do best. I’m looking forward to meeting Martin at DC2008, where hopefully we can do a linked-data BOF or something.



lingvoj

I’m just now running across lingvoj.org, a linked-data application for languages created by Bernard Vatant. lingvoj basically mints URIs for languages (using the ISO-639-1 code) and when resolved (yay HTTP) nice human and machine readable descriptions about the language are returned. So for example the URI for Chinese is:

http://www.lingvoj.org/lang/zh

If you click on that link, your browser will display some HTML that describes the Chinese language, and if a client wants “application/rdf+xml” it’ll get back a nice chunk of rdf – all via a 303 redirect as it should be.

lingvoj is interesting for a few reasons:

  • I work at the Library of Congress, who are the maintainers of iso639-2, and I know someone experimenting with a linked-data application for delivering it.
  • I know software developers at LC and elsewhere who need access to this data in a predictable and explicit machine readable format, which lends itself to being updated (re-harvesting language URIs).
  • lingvoj follows the 303 URIs forwarding to One Generic Document pattern, which is nice to see in practice. I also learned about the use of rdfs:isDefinedBy to assert (in this case) that a language is defined by the HTML representation for the language. Not sure how I missed that in the Cool URIs document before.
  • There are owl:sameAs links between lingvoj and dbpedia and opencyc, which in turn are linked data, and allow an agent to walk outwards and discover more about a language. Maybe one day lingvoj could link to our ISO693-2 codelist at LC?
  • lingvoj defines a vocabulary which includes a new OWL class Lingvo for languages, that happens to extend dcterms:LinguisticSystem.

It’s a lot o’ fun discovering this emerging, rich data-universe on the web. If you are the least bit curious take a look for yourself:

  curl --location --header "Accept: application/rdf+xml" http://www.lingvoj.org/lang/zh

Or better yet:

  rapper -o turtle http://lingvoj.org/lang/zh

Or if you are really adventurous grab the whole data set and put it into your triple-store-du-jour.


We've got five years, my brain hurts a lot

Recently there’s been a few discussions about persistent identifiers on the web: in particular one about the persistence of XRIs, and another about the use of HTTP URIs in semantic web applications like dbpedia.

As you probably know already, the w3c publicly recommended against the use of Extensible Resource Identifiers (XRI). The net effect of this was to derail the standardization of XRIs within OAISIS itself. Part of the process that Ray Denenberg (my colleague at the Library of Congress) helped kick off was a further discussion between XRI people and the w3c-tag about what XRI specifically provides that HTTP URIs do not. Recently that discussion hit a key point by Stuart Williams:

… the point that I’m trying to make is that the issue is with the social and administrative policies associated with the DNS system - and the solution is to establish a separate namespace outside the DNS system that has different social/adminsitrative policies (particularly wrt persistent name segments) that better suits the requirements of the XRI community. There is the question as to whether that alternate social/administrative system will endure into the long term such the the persistence intended guarantees endure… or not - however that will largely be determined by market forces (adoption) and ‘crudely’ the funding regime that enables the administrative structure of XRI to persist - and probably includes the use of IPRs to prevent duplicate/alternate root problems which we have seen in the DNS world.

It’ll be interesting to see the response. I basically have the same issue with DOIs and the Handle System that they depend on. Over at CrossTech Tony Hammond suggests that the Handle System would make RDF assertions such as those that involve DBPedia more persistent. But just how isn’t entirely clear to me. It seems that Handles like URLs are only persistent to the degree that they are maintained.

I’d love to see a use case from Tony that describes just how DOIs and the Handle System would provide more persistence than HTTP URLs in the context of RDF assertions involving dbpedia. As Stuart said eloquently in his email:

Again just seeking to understand - not to take a particular position

PS. Sorry if the blog post title is too cryptic, it’s Bowie’s “Five Years” which Tony’s post (perhaps intentionally) reminded me of :-)


resource maps and site maps

Andy reminds me that a relatively simple idea (I think it was David’s at RepoCamp) for the OAI-ORE Challenge would be to create a tool that transformed OAI-ORE resource maps expressed as Atom into Google Site Maps. This would allow “repositories” that exposed their “objects” as resource maps, to easily be crawled by Google and others.

It would also be useful to demonstrate what value-add OAI-ORE resource maps give you: to answer the question of why not just generate the site map and be done with it. I think there definitely are advantages, such as being able to identify compound objects or aggregations of web resources, and then make assertions about them (a.k.a. attach metadata to them).


RepoCamp recap

So RepoCamp was a lot of fun. The goal was to discuss repository interoperability–and at the very least repository practitioners got to interoperate, and have a few beers afterwards. Hats off to David Flanders who clearly has got running these events down to a fine art.

I finally got to meet Ben O’Steen after bantering with him on #code4lib and #talis … and also got to chat with Jim Downing (Cambridge Univ) about SWORD stuff, and Stephan Drescher (Los Alamos National Lab) about validating OAI-ORE.

Stephan and I had a varied and wide ranging discussion about the web in general, which was a lot of fun. I really dug his metaphor of the web as an aquatic ecosystem, with interdependent organisms and shared environments. It reminded me a bit of how shocked I was to discover how rich and varied the ecosystem is around a “simple” service like twitter. If I ever return to school it will be to study something along the lines of web science.

It was also interesting to hear that other people saw a parallel between OAI-ORE Resource Maps and BagIt’s fetch.txt. The parallel being that both resource maps and bags are aggregations of web resources. Of course bags can also just be files on disk, it’s when the fetch.txt is present in the bag that the package is made up of web resources. It would be interesting to see what vocabularies are available for expressing fixity information (md5 checksums and the like), and if they could be layered into the resource map atom serialization. Perhaps PREMIS v2.0? It might be fun to code up what a simple OAI-ORE resource map harvester would look like, that checked fixity values – using LC’s existing BagIt parallelretriever.py as a starting point. God I wish I could just hyperlink to that :-(

At any rate, I now need to investigate OAuth because Jim thinks it fits really nicely with AtomPub and SWORD in particular. And if it’s good enough for Google it’s probably worth checking out. Jim also said that there is a possibility that the SWORD 2.0 might take shape as an IETF RFC, which would be good to see.

Thanks to all that made it happen, and for all of you that traveled long distances to join us at the Library of Congress.


premis v2.0 and schema munging

In an effort to get a better understanding of PREMIS after reading about the v2.0 release, I dug around for 5 minutes looking for a way to convert an XML Schema to RelaxNG. The theory being that the compact syntax of RelaxNG would be easier to read than the XSD.

I ended up with a little hack suggested here to chain together the rngconv from the Multi-Schema Validator and James Clarke’s Trang, which oddly can’t read an XSD as input.

#!/bin/bash