Archive for September, 2008

lcsh.info logs

Wednesday, September 24th, 2008

If you are curious how lcsh.info is being used I’ve made the apache server logs available, including the ones for the sparql service. I’ve been meaning to do some analysis of the logs but haven’t got the time yet. You’ll notice that among the data that’s collected is the Accept header sent by agents, since it’s so important to what representation is served up. Thanks to danbri for the idea to simply make them available.

iswc2009, DC and vocamp

Monday, September 22nd, 2008

I just learned from Tom Heath that The International Semantic Web Conference is coming to Washington DC next year. This is pretty cool news to me, since traveling to conferences isn’t always the easiest thing to navigate. Also, Tom suggested that it might be fun to organize a VoCamp around the conference, to provide an informal collaboration space for vocabulary demos, development, q/a, etc. If you want to help out please join the mailing list.

public.resource.org to liberate Code of Federal Regulations

Wednesday, September 17th, 2008

good news via the govtrack mailing list

Carl Malamud of public.resource.org, with funding from a bunch of places including a small bit from GovTrack’s ad profits, announced his intention to purchase from the Government Printing Office documents they produce in the course of their statutory obligations and then have the nerve to sell back to the public at prohibitive prices. The document to be purchased is the Code of Federal Regulations, the component of federal law created by executive branch agencies, in electronic form. Once obtained, it will be posted openly/freely online.

More here: http://public.resource.org/gpo.gov/index.html

And Carl’s letter to the GPO:
http://public.resource.org/gpo.gov/the_honorable.html

It’s pretty sad that it has to come to this…but it’s also pretty awesome that it’s happening.

terminology services sneak peak

Thursday, September 11th, 2008

I just saw Diane Vizine-Goetz demo OCLC’s Terminology Services at the CENDI/SKOS meeting and was excited to see various things out on the public web. For example, the LCSH concept “World Wide Web” is over here:

http://tspilot.oclc.org/lcsh/sh2008114004

At the moment it’s not the most friendly human readable display, but that’s just a XSLT stylesheet away (assuming TS follows the patterns of other OCLC Services). I’m not quite sure what the default namespace urn:uuid:D30A7E67-31BF-40A3-9956-9668674FCD84 is. But the response looks like it indicates what resources are related to a given conceptual resource.

  1. http://tspilot.oclc.org/lcsh/sh2008114004.html
  2. http://tspilot.oclc.org/lcsh/sh2008114004.json
  3. http://tspilot.oclc.org/lcsh/sh2008114004.marcxml
  4. http://tspilot.oclc.org/lcsh/sh2008114004.meta
  5. http://tspilot.oclc.org/lcsh/sh2008114004.skos
  6. http://tspilot.oclc.org/lcsh/sh2008114004.stats
  7. http://tspilot.oclc.org/lcsh/sh2008114004.zthes

And LCSH is just one of the vocabularies available through the pilot service, if you examine the XML you’ll see references to FAST, TGM and MESH + SRU services for each.

I think this is way cool, and a step in the right direction…particulary because they are going to make vocabularies available for free as long as the original publisher has no problem with it. My only complaint is that the URIs for the concepts don’t appear to do content-negotiation for application/rdf+xml. It looks like text/html and application/javascript (isn’t it application/json?) work just fine though. Try them out:

curl --header "Accept: application/javascript" http://tspilot.oclc.org/lcsh/sh2008114004
curl --header "Accept: text/html" http://tspilot.oclc.org/lcsh/sh2008114004

But not application/rdf+xml:

curl --header "Accept: application/rdf+xml" http://tspilot.oclc.org/lcsh/sh2008114004

It seems like it would be a pretty easy fix, and pretty important for being able to follow your nose on the semantic web.

nkos/cendi

Thursday, September 11th, 2008

Jon Phipps and I are speaking about SKOS at the World Bank today for a joint meeting of the CENDI and NKOS groups. The talk is entitled “SKOS: New Directions in Interoperability” … which is kind of ironic since SKOS has been a long running topic at NKOS meetings. The idea is to describe SKOS (for those who don’t know it), cover the recent changes to SKOS (for those that do), and describe an implementation of SKOS (lcsh.info). A tall order for 30 minutes!

One new direction that I hope I’ll be able to get to is the notion of linked-data. I created some simple graph visualizations of the Royal Library of Sweden’s linked bibliographic data implementation. I really wanted to emphasize how linked data can model data across enterprise boundaries. By the way this example really exists, it’s not library-science-fiction.

Wish us luck! There are going to be some other interesting talks during the day, on OCLC’S Terminology Services, Semantic Media Wiki for vocabulary development at the Mayo Clinic, mapping agriculture vocabularies, the intersection of folksonomy and taxonomy, and more.

PS. Roy I haven’t forgotten your follow-up comment :-)

w3c semweb use cases and lcsh

Friday, September 5th, 2008

Via Ivan Herman I learned that the Semantic Web Use Cases use concepts from lcsh.info. For example look at the RDFa in this case study for the Digital Music Archive for the Norwegian National Broadcaster. You can also look at the Document metadata in a linked data browser like OpenLink. Click on the “Document” and then on the various subject “concepts” and you’ll see the linked data browser go out and fetch the triples from lcsh.info for “Semantic Web” and “Broadcasting”.

One of the downsides to linked-data browsers (for me) is that they hide a bit of what’s going on. Of course this is by-design. For a more rdf centric view on the data take a look at this output of rapper.

ed@curry:~$ rapper -o turtle http://www.w3.org/2001/sw/sweo/public/UseCases/NRK/
rapper: Serializing with serializer turtle
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xhv: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://www.w3.org/2001/sw/sweo/public/UseCases/NRK/>
    a bibo:Document;
    dc:rights "\u00A9 Copyright 2007, ESIS, NRK."@en-us;
    dc:subject <http://lcsh.info/sh85017004#concept>, <http://lcsh.info/sh2002000569#concept>;
    dc:date "2007-09"^^xsd:dateTime;
    dc:title "Case Study: A Digital Music Archive (DMA) for the Norwegian National Broadcaster (NRK) using Semantic Web techniques"@en-us;
    bibo:distributor <http://www.w3.org/>;
    bibo:authorList (
        [
            a foaf:Person;
            foaf:workplaceHomepage <http://www.esis.no>;
            foaf:name "Dr. Robert H.P. Engels"@nl
        ]
        [
            a foaf:Person;
            foaf:workplaceHomepage <http://www.nrk.no>;
            foaf:name "Jon Roar T\u00F8nnesen"@no
        ]
    ) .

<http://www.w3.org/2001/sw/sweo/public/UseCases/NRK/Overview.html>
    xhv:stylesheet <http://www.w3.org/2001/sw/sweo/public/UseCases/style/ucstyle.css> .

You can see Ivan’s using the dc, foaf, skos and bibo vocabularies, and the links out lcsh Concepts. Fun stuff.

Martin Malmsten and linked library data

Tuesday, September 2nd, 2008

I’m currently listening to Richard Wallis’ interview w/ Martin Malmsten of the Royal Library of Sweden. It’s a really fascinating view inside a library, and the mind of a developer that are publishing bibliographic resources as linked data.

Partly as a dare from Roy Tennant to do something useful with linked-data, I spent 30 minutes w/ rdflib creating a very simplistic (42 lines of code) crawler that can walk the links in the Royal Library’s linked data, and store the bibliographic resources encountered. I ran it over the weekend (it had a 3 second sleep between requests, so as not to arouse the ire of the Royal Library of Sweden), and it ended up pulling down 919,190 triples describing a variety of resources (kind of a fun unix hack here to get the types of resources in a ntriples rdf dump):

ed@hammer:~/bzr/linked-data-crawler$ grep 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' libris.kb.se.n3 \
  | cut -f 3 -d " " \
  | sort \
  | uniq -c \
  | sort -rn
  18445 <http://purl.org/ontology/bibo/Book>.
   1686 <http://purl.org/ontology/bibo/Article>.
    258 <http://www.w3.org/2004/02/skos/core#Concept>.
    245 <http://purl.org/ontology/bibo/Film>.
    237 <http://xmlns.com/foaf/0.1/Organization>.
    219 <http://xmlns.com/foaf/0.1/Person>.
     58 <http://purl.org/ontology/bibo/Periodical>.
      4 <http://purl.org/ontology/bibo/Map>.
      4 <http://purl.org/ontology/bibo/Manuscript>.
      1 <http://purl.org/ontology/bibo/Collection>.

As I pointed out on ngc4lib, the purpose of this wasn’t to display any technical prowess–much to the contrary, it was to share how the nature of linked-data being on the web we know and love makes it natural to work with.

One of the many gems in the interview, was Martin’s response to Richard’s question about whether the “semantic web” that we talk about today is subtly different than the semantic web that was introduced in 2001.

People saw the words “semantic web” and then they sort of forgot the web part, and started to work on the semantic part (vocabularies)–and that can become arbitrarily complex. If you forget the web part then it is just metadata, and then people can ask “ok, you have this semantics thing and we have marc21, it’s not really that different” and they’d be right. But now linked data is starting to feed the semantic web, and it’s the web part that makes it special. (about 34:00 into the interview).

I’m not an expert on the history of the web and libraries, but this seems to be spot on to me. The notion that traditional library assets (bibliographic resources like catalog records, name/subject authority records, holdings records, etc.) can be made available directly on the web as machine readable data is the real promise of linked-data for libraries. It feels like we’re at an inflexion point like the one where libraries realized their catalogs could be made available on the web. The web-opac allowed there to be links between say bibliographic records and subject headings, which could be expressed in HTML for people to traverse. But now we can express these links explicitly in a machine readable way, for automated agents to traverse. If you (like Roy Tennant) are skeptical of the value in this ask yourself how companies like Google were able to build up their most valuable asset, their index of the web. They used the open architecture of the web, to walk the links between resources. Imagine if we could allow people to do the same with our data? To gather say a union catalog of Sweden by crawling it’s member libraries catalogs, and periodically updating them with HTTP GET for that resource?

Martin’s main point is that a lot of valuable effort has gone into vocabulary development like DublinCore, MODS etc, and even some on the distribution of descriptions using these vocabularies using OAI-PMH. But the real exciting part IMHO is giving these resources URLs, and linking them together…much as the web of documents is linked together. I agree with Martin, this is new territory, that really combines what librarians and web-technologists do best. I’m looking forward to meeting Martin at DC2008, where hopefully we can do a linked-data BOF or something.