Archive for July, 2008

RepoCamp recap

Monday, July 28th, 2008

So RepoCamp was a lot of fun. The goal was to discuss repository interoperability–and at the very least repository practitioners got to interoperate, and have a few beers afterwards. Hats off to David Flanders who clearly has got running these events down to a fine art.

I finally got to meet Ben O’Steen after bantering with him on #code4lib and #talis … and also got to chat with Jim Downing (Cambridge Univ) about SWORD stuff, and Stephan Drescher (Los Alamos National Lab) about validating OAI-ORE.

Stephan and I had a varied and wide ranging discussion about the web in general, which was a lot of fun. I really dug his metaphor of the web as an aquatic ecosystem, with interdependent organisms and shared environments. It reminded me a bit of how shocked I was to discover how rich and varied the ecosystem is around a “simple” service like twitter. If I ever return to school it will be to study something along the lines of web science.

It was also interesting to hear that other people saw a parallel between OAI-ORE Resource Maps and BagIt’s fetch.txt. The parallel being that both resource maps and bags are aggregations of web resources. Of course bags can also just be files on disk, it’s when the fetch.txt is present in the bag that the package is made up of web resources. It would be interesting to see what vocabularies are available for expressing fixity information (md5 checksums and the like), and if they could be layered into the resource map atom serialization. Perhaps PREMIS v2.0? It might be fun to code up what a simple OAI-ORE resource map harvester would look like, that checked fixity values — using LC’s existing BagIt parallelretriever.py as a starting point. God I wish I could just hyperlink to that :-(

At any rate, I now need to investigate OAuth because Jim thinks it fits really nicely with AtomPub and SWORD in particular. And if it’s good enough for Google it’s probably worth checking out. Jim also said that there is a possibility that the SWORD 2.0 might take shape as an IETF RFC, which would be good to see.

Thanks to all that made it happen, and for all of you that traveled long distances to join us at the Library of Congress.

premis v2.0 and schema munging

Monday, July 21st, 2008

In an effort to get a better understanding of PREMIS after reading about the v2.0 release, I dug around for 5 minutes looking for a way to convert an XML Schema to RelaxNG. The theory being that the compact syntax of RelaxNG would be easier to read than the XSD.

I ended up with a little hack suggested here to chain together the rngconv from the Multi-Schema Validator and James Clarke’s Trang, which oddly can’t read an XSD as input.

#!/bin/bash
 
for i in $*
do
  BN=$(basename $i .xsd)
  java -jar /opt/rngconv/rngconv.jar ${i} > /tmp/${BN}.rng
  java -jar /opt/trang/trang.jar -I rng -O rnc /tmp/${BN}.rng ${BN}.rnc
done

The resulting RelaxNG can be seen here. As you can see I’m not sure it helps much…but it’s a start I guess. I’m interested in looking at what it might take to sublimate an PREMIS RDF vocabulary (hopefully just RDFS?) out of the XSD, mainly because I *think* parts of the vocabulary could prove useful in OAI-ORE resource maps.

identi.ca and linked data

Friday, July 11th, 2008

If you’ve already caught the micro-blogging bug identi.ca is an interesting twitter clone for a variety of reasons…not the least of which is that it’s an open source project, and has been designed to run in a decentralized way. The thing I was pleasantly surprised to see was FOAF exports like this for user networks, and HTTP URIs for foaf:Person resources:

ed@hammer:~$ curl -I http://identi.ca/user/6104
HTTP/1.1 302 Found
Date: Fri, 11 Jul 2008 12:58:56 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Status: 303 See Other
Location: http://identi.ca/edsu
Content-Type: text/html

It looks like there’s a slight bug in the way the HTTP status is being returned, but clearly the intent was to do the right thing by httpRange-14. If I have time I’ll get laconi.ca running locally so I can confirm the bug, and attempt a fix.

It’s also cool to see that Evan Prodromou (the lead developer, and creator of identi.ca and laconi.ca) has opened a couple tickets for adding RDFa to various pages. If I have the time this would be a fun hack as well. I’d also like to take a stab at doing conneg at foaf:Person URIs to enable this sorta thing:

ed@hammer:~$ curl -I --header "Content-type: application/rdf+xml" http://identi.ca/user/6104
HTTP/1.1 303 See Other
Date: Fri, 11 Jul 2008 13:08:42 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Location: http://identi.ca/edsu/foaf

instead of what happens currently:

ed@hammer:~$ curl -I --header "Content-type: application/rdf+xml" http://identi.ca/user/6104
HTTP/1.1 302 Found
Date: Fri, 11 Jul 2008 13:08:42 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Status: 303 See Other
Location: http://identi.ca/edsu
Content-Type: text/html

I guess this is also just a complicated way of saying I’m edsu on identi.ca–and that the opportunity to learn more about OAuth and XMPP is a compelling enough reason alone for me to make the switch.

lcsh.info SPARQL endpoint

Monday, July 7th, 2008

I’ve set up a SPARQL endpoint for lcsh.info at sparql.lcsh.info. If you are new to SPARQL endpoints, they are essentially REST web services that allow you to query a pool of RDF data using a query language that combines features of pattern matching, set logic and the web, and then get back results in a variety of formats. If you are a regular expression and/or SQL junkie, and like data, then SPARQL is definitely worth taking a look at.

If you are new to SPARQL and/or LCSH as SKOS you can try the default query and you’ll get back the first 10 triples in the triple store:

SELECT ?s ?p ?p
WHERE {?s ?p ?o}
LIMIT 10

As a first tweak try increasing the limit to 100. If you are feeling more adventurous perhaps you’d like to look up all the triples for a concept like Buddhism:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?s ?p ?o
WHERE {
  ?s ?p ?o .
  ?s skos:prefLabel "Buddhism"@en .
}

Or, perhaps you are interested in seeing what narrower terms there are for Buddhism:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?uri ?label
WHERE {
  <http://lcsh.info/sh85017454#concept> skos:narrower ?uri .
  ?uri skos:prefLabel ?label
}

Or maybe you don’t know the skos:prefLabel (aka authorized heading), so look for all the lcsh headings that start with Independence

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?s ?label
WHERE {
  ?s skos:prefLabel ?label.
  FILTER regex(?label, '^independence', 'i')
}

Feel free to use the service however you want. I’m interested in seeing what its limitations are.

Benjamin Nowack’s ARC made it extremely easy to load up the 2,441,494 LCSH triples in a few hours with a script like:

include_once('arc/ARC2.php');
 
$config = array(
    'db_name'               => 'arc',
    'db_user'               => 'arc',
    'db_pwd'                => 'notapassword',
    'store_name'            => 'lcsh',
    'store_log_inserts'     => 1,
);
 
$store = ARC2::getStore($config);
 
if (!$store->isSetup()) {
    $store->setUp();
}
 
$store->reset();
$rs = $store->query('LOAD &lt;http://lcsh.info/static/lcsh.nt&gt;');
 
print_r($rs);

Then it’s just a simple matter of putting up a php script like:

/* ARC2 static class inclusion */
include_once('arc/ARC2.php');
 
/* MySQL and endpoint configuration */
$config = array(
  /* db */
  'db_host' => 'localhost', /* optional, default is localhost */
  'db_name' => 'arc',
  'db_user' => 'arc',
  'db_pwd' => 'fakepassword',
 
  /* store name */
  'store_name' => 'lcsh',
 
 
  /* endpoint */
  'endpoint_features' => array(
    'select', 'construct', 'ask', 'describe'
  ),
  'endpoint_timeout' => 60, /* not implemented in ARC2 preview */
  'endpoint_read_key' => '', /* optional */
  'endpoint_write_key' => 'fakekey', /* optional */
  'endpoint_max_limit' => 1000, /* optional */
);
 
/* instantiation */
$ep = ARC2::getStoreEndpoint($config);
 
/* request handling */
$ep->go();

Ideally I would’ve been able to quickly bring up a SPARQL endpoint on top of the rdflib Sleepycat triple store that is being used to serve up the linked data at lcsh.info. But rather that pursuing elegance (this is kinda side work after all), I wanted to quickly put the SPARQL service out there for experimentation, and this was the quickest way for me to do that. If the service proves useful I’ll look more at what it takes to create an rdflib SPARQL service, or porting over the little python code I have to php (gasp).