lingvoj

I’m just now running across lingvoj.org, a linked-data application for languages created by Bernard Vatant. lingvoj basically mints URIs for languages (using the ISO-639-1 code) and when resolved (yay HTTP) nice human and machine readable descriptions about the language are returned. So for example the URI for Chinese is:

http://www.lingvoj.org/lang/zh

If you click on that link, your browser will display some HTML that describes the Chinese language, and if a client wants “application/rdf+xml” it’ll get back a nice chunk of rdf – all via a 303 redirect as it should be.

lingvoj is interesting for a few reasons:

  • I work at the Library of Congress, who are the maintainers of iso639-2, and I know someone experimenting with a linked-data application for delivering it.
  • I know software developers at LC and elsewhere who need access to this data in a predictable and explicit machine readable format, which lends itself to being updated (re-harvesting language URIs).
  • lingvoj follows the 303 URIs forwarding to One Generic Document pattern, which is nice to see in practice. I also learned about the use of rdfs:isDefinedBy to assert (in this case) that a language is defined by the HTML representation for the language. Not sure how I missed that in the Cool URIs document before.
  • There are owl:sameAs links between lingvoj and dbpedia and opencyc, which in turn are linked data, and allow an agent to walk outwards and discover more about a language. Maybe one day lingvoj could link to our ISO693-2 codelist at LC?
  • lingvoj defines a vocabulary which includes a new OWL class Lingvo for languages, that happens to extend dcterms:LinguisticSystem.

It’s a lot o’ fun discovering this emerging, rich data-universe on the web. If you are the least bit curious take a look for yourself:

  curl --location --header "Accept: application/rdf+xml" http://www.lingvoj.org/lang/zh

Or better yet:

  rapper -o turtle http://lingvoj.org/lang/zh

Or if you are really adventurous grab the whole data set and put it into your triple-store-du-jour.


We've got five years, my brain hurts a lot

Recently there’s been a few discussions about persistent identifiers on the web: in particular one about the persistence of XRIs, and another about the use of HTTP URIs in semantic web applications like dbpedia.

As you probably know already, the w3c publicly recommended against the use of Extensible Resource Identifiers (XRI). The net effect of this was to derail the standardization of XRIs within OAISIS itself. Part of the process that Ray Denenberg (my colleague at the Library of Congress) helped kick off was a further discussion between XRI people and the w3c-tag about what XRI specifically provides that HTTP URIs do not. Recently that discussion hit a key point by Stuart Williams:

… the point that I’m trying to make is that the issue is with the social and administrative policies associated with the DNS system - and the solution is to establish a separate namespace outside the DNS system that has different social/adminsitrative policies (particularly wrt persistent name segments) that better suits the requirements of the XRI community. There is the question as to whether that alternate social/administrative system will endure into the long term such the the persistence intended guarantees endure… or not - however that will largely be determined by market forces (adoption) and ‘crudely’ the funding regime that enables the administrative structure of XRI to persist - and probably includes the use of IPRs to prevent duplicate/alternate root problems which we have seen in the DNS world.

It’ll be interesting to see the response. I basically have the same issue with DOIs and the Handle System that they depend on. Over at CrossTech Tony Hammond suggests that the Handle System would make RDF assertions such as those that involve DBPedia more persistent. But just how isn’t entirely clear to me. It seems that Handles like URLs are only persistent to the degree that they are maintained.

I’d love to see a use case from Tony that describes just how DOIs and the Handle System would provide more persistence than HTTP URLs in the context of RDF assertions involving dbpedia. As Stuart said eloquently in his email:

Again just seeking to understand - not to take a particular position

PS. Sorry if the blog post title is too cryptic, it’s Bowie’s “Five Years” which Tony’s post (perhaps intentionally) reminded me of :-)


resource maps and site maps

Andy reminds me that a relatively simple idea (I think it was David’s at RepoCamp) for the OAI-ORE Challenge would be to create a tool that transformed OAI-ORE resource maps expressed as Atom into Google Site Maps. This would allow “repositories” that exposed their “objects” as resource maps, to easily be crawled by Google and others.

It would also be useful to demonstrate what value-add OAI-ORE resource maps give you: to answer the question of why not just generate the site map and be done with it. I think there definitely are advantages, such as being able to identify compound objects or aggregations of web resources, and then make assertions about them (a.k.a. attach metadata to them).


RepoCamp recap

So RepoCamp was a lot of fun. The goal was to discuss repository interoperability–and at the very least repository practitioners got to interoperate, and have a few beers afterwards. Hats off to David Flanders who clearly has got running these events down to a fine art.

I finally got to meet Ben O’Steen after bantering with him on #code4lib and #talis … and also got to chat with Jim Downing (Cambridge Univ) about SWORD stuff, and Stephan Drescher (Los Alamos National Lab) about validating OAI-ORE.

Stephan and I had a varied and wide ranging discussion about the web in general, which was a lot of fun. I really dug his metaphor of the web as an aquatic ecosystem, with interdependent organisms and shared environments. It reminded me a bit of how shocked I was to discover how rich and varied the ecosystem is around a “simple” service like twitter. If I ever return to school it will be to study something along the lines of web science.

It was also interesting to hear that other people saw a parallel between OAI-ORE Resource Maps and BagIt’s fetch.txt. The parallel being that both resource maps and bags are aggregations of web resources. Of course bags can also just be files on disk, it’s when the fetch.txt is present in the bag that the package is made up of web resources. It would be interesting to see what vocabularies are available for expressing fixity information (md5 checksums and the like), and if they could be layered into the resource map atom serialization. Perhaps PREMIS v2.0? It might be fun to code up what a simple OAI-ORE resource map harvester would look like, that checked fixity values – using LC’s existing BagIt parallelretriever.py as a starting point. God I wish I could just hyperlink to that :-(

At any rate, I now need to investigate OAuth because Jim thinks it fits really nicely with AtomPub and SWORD in particular. And if it’s good enough for Google it’s probably worth checking out. Jim also said that there is a possibility that the SWORD 2.0 might take shape as an IETF RFC, which would be good to see.

Thanks to all that made it happen, and for all of you that traveled long distances to join us at the Library of Congress.


premis v2.0 and schema munging

In an effort to get a better understanding of PREMIS after reading about the v2.0 release, I dug around for 5 minutes looking for a way to convert an XML Schema to RelaxNG. The theory being that the compact syntax of RelaxNG would be easier to read than the XSD.

I ended up with a little hack suggested here to chain together the rngconv from the Multi-Schema Validator and James Clarke’s Trang, which oddly can’t read an XSD as input.

#!/bin/bash


identi.ca and linked data

If you’ve already caught the micro-blogging bug identi.ca is an interesting twitter clone for a variety of reasons…not the least of which is that it’s an open source project, and has been designed to run in a decentralized way. The thing I was pleasantly surprised to see was FOAF exports like this for user networks, and HTTP URIs for foaf:Person resources:

ed@hammer:~$ curl -I http://identi.ca/user/6104
HTTP/1.1 302 Found
Date: Fri, 11 Jul 2008 12:58:56 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Status: 303 See Other
Location: http://identi.ca/edsu
Content-Type: text/html

It looks like there’s a slight bug in the way the HTTP status is being returned, but clearly the intent was to do the right thing by httpRange-14. If I have time I’ll get laconi.ca running locally so I can confirm the bug, and attempt a fix.

It’s also cool to see that Evan Prodromou (the lead developer, and creator of identi.ca and laconi.ca) has opened a couple tickets for adding RDFa to various pages. If I have the time this would be a fun hack as well. I’d also like to take a stab at doing conneg at foaf:Person URIs to enable this sorta thing:

ed@hammer:~$ curl -I --header "Content-type: application/rdf+xml" http://identi.ca/user/6104
HTTP/1.1 303 See Other
Date: Fri, 11 Jul 2008 13:08:42 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Location: http://web.archive.org/web/20130308050525/http://identi.ca/edsu/foaf

instead of what happens currently:

ed@hammer:~$ curl -I --header "Content-type: application/rdf+xml" http://identi.ca/user/6104
HTTP/1.1 302 Found
Date: Fri, 11 Jul 2008 13:08:42 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Status: 303 See Other
Location: http://identi.ca/edsu
Content-Type: text/html

I guess this is also just a complicated way of saying I’m edsu on identi.ca–and that the opportunity to learn more about OAuth and XMPP is a compelling enough reason alone for me to make the switch.


lcsh.info SPARQL endpoint

disclaimer: lcsh.info was a prototype, and is no longer available, see id.loc.gov for the service from the Library of Congress

I’ve set up a SPARQL endpoint for lcsh.info at sparql.lcsh.info. If you are new to SPARQL endpoints, they are essentially REST web services that allow you to query a pool of RDF data using a query language that combines features of pattern matching, set logic and the web, and then get back results in a variety of formats. If you are a regular expression and/or SQL junkie, and like data, then SPARQL is definitely worth taking a look at.

If you are new to SPARQL and/or LCSH as SKOS you can try the default query and you’ll get back the first 10 triples in the triple store:

SELECT ?s ?p ?p 
WHERE {?s ?p ?o}
LIMIT 10

As a first tweak try increasing the limit to 100. If you are feeling more adventurous perhaps you’d like to look up all the triples for a concept like Buddhism:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>


Content-MD5 considered helpful

Kind of an interesting thread going on the Amazon Web Services Forum, about data corruption on S3. It highlights how important it is for clients to send something like the Content-MD5 HTTP header to checksum the HTTP payload, and for the server to check it before saying 200 OK back…at least for data storage REST applications:

When Amazon S3 receives a PUT request with the Content-MD5 header, Amazon S3 computes the MD5 of the object received and returns a 400 error if it doesn’t match the MD5 sent in the header. Looking at our service logs from the period between 6/20 11:54pm PDT and 6/22 5:12am PDT, we do see a modest increase in the number of 400 errors. This may indicate that there were elevated network transmission errors somewhere between the customer and Amazon S3.

Some customers are claiming that the md5 checksums coming back from s3 are different than the ones for the content that was originally sent there. Perhaps the clients ignored the 400? Or maybe there is data corruption elsewhere. It’ll be interesting to follow the thread.


provide and enable

I got a chance to meet Jennifer Rigby of the National Archives UK at the LinkedDataPlanet Conference in New York City (thanks Ian). Jennifer is the Head of IT Strategy, and told me lots of interesting stuff related to a profound shift they’ve had in their online strategies to:

Provide and Enable

So rather than pouring all their energy into making applications to visualize archival resources, the National Archives have recognized that making machine readable resources available to the public (in formats like RDF and RDFa) is really important to their core mission. In addition to providing services and data, they are trying to enable an ecosystem of innovation around their assets–or in their words:

• We will allow others to harness the power of our information, leading to a far wider range of products and services than we could provide ourselves.
• We will continue to work with commercial partners to provide online access to millions of records.

Jennifer said we can look forward to an announcement around OpenTech2008 (July 5th) about a set of important publications that are going to made available by the Archives as RDF and RDFa. In addition I heard about how they work with website data harvested by Internet Archive to create a resolver service for transient publications on the web.

Hearing how a big organization like the National Archives can come to this realization of “Provide and Enable”, and then start to execute on it was really encouraging–and inspiring. It is also refreshing to see people recognize, in writing the importance of semantic web technologies:

We have started exploring new ideas and technologies, including using RDFa for publishing the Gazettes. The way we now publish legislation has a key role to play in the further development of the semantic web.