and linked data

If you’ve already caught the micro-blogging bug is an interesting twitter clone for a variety of reasons…not the least of which is that it’s an open source project, and has been designed to run in a decentralized way. The thing I was pleasantly surprised to see was FOAF exports like this for user networks, and HTTP URIs for foaf:Person resources:

ed@hammer:~$ curl -I
HTTP/1.1 302 Found
Date: Fri, 11 Jul 2008 12:58:56 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Status: 303 See Other
Content-Type: text/html

It looks like there’s a slight bug in the way the HTTP status is being returned, but clearly the intent was to do the right thing by httpRange-14. If I have time I’ll get running locally so I can confirm the bug, and attempt a fix.

It’s also cool to see that Evan Prodromou (the lead developer, and creator of and has opened a couple tickets for adding RDFa to various pages. If I have the time this would be a fun hack as well. I’d also like to take a stab at doing conneg at foaf:Person URIs to enable this sorta thing:

ed@hammer:~$ curl -I --header "Content-type: application/rdf+xml"
HTTP/1.1 303 See Other
Date: Fri, 11 Jul 2008 13:08:42 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1

instead of what happens currently:

ed@hammer:~$ curl -I --header "Content-type: application/rdf+xml"
HTTP/1.1 302 Found
Date: Fri, 11 Jul 2008 13:08:42 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Status: 303 See Other
Content-Type: text/html

I guess this is also just a complicated way of saying I’m edsu on–and that the opportunity to learn more about OAuth and XMPP is a compelling enough reason alone for me to make the switch. SPARQL endpoint

disclaimer: was a prototype, and is no longer available, see for the service from the Library of Congress

I’ve set up a SPARQL endpoint for at If you are new to SPARQL endpoints, they are essentially REST web services that allow you to query a pool of RDF data using a query language that combines features of pattern matching, set logic and the web, and then get back results in a variety of formats. If you are a regular expression and/or SQL junkie, and like data, then SPARQL is definitely worth taking a look at.

If you are new to SPARQL and/or LCSH as SKOS you can try the default query and you’ll get back the first 10 triples in the triple store:

SELECT ?s ?p ?p 
WHERE {?s ?p ?o}

As a first tweak try increasing the limit to 100. If you are feeling more adventurous perhaps you’d like to look up all the triples for a concept like Buddhism:

PREFIX skos: <>

Content-MD5 considered helpful

Kind of an interesting thread going on the Amazon Web Services Forum, about data corruption on S3. It highlights how important it is for clients to send something like the Content-MD5 HTTP header to checksum the HTTP payload, and for the server to check it before saying 200 OK back…at least for data storage REST applications:

When Amazon S3 receives a PUT request with the Content-MD5 header, Amazon S3 computes the MD5 of the object received and returns a 400 error if it doesn’t match the MD5 sent in the header. Looking at our service logs from the period between 6/20 11:54pm PDT and 6/22 5:12am PDT, we do see a modest increase in the number of 400 errors. This may indicate that there were elevated network transmission errors somewhere between the customer and Amazon S3.

Some customers are claiming that the md5 checksums coming back from s3 are different than the ones for the content that was originally sent there. Perhaps the clients ignored the 400? Or maybe there is data corruption elsewhere. It’ll be interesting to follow the thread.

provide and enable

I got a chance to meet Jennifer Rigby of the National Archives UK at the LinkedDataPlanet Conference in New York City (thanks Ian). Jennifer is the Head of IT Strategy, and told me lots of interesting stuff related to a profound shift they’ve had in their online strategies to:

Provide and Enable

So rather than pouring all their energy into making applications to visualize archival resources, the National Archives have recognized that making machine readable resources available to the public (in formats like RDF and RDFa) is really important to their core mission. In addition to providing services and data, they are trying to enable an ecosystem of innovation around their assets–or in their words:

• We will allow others to harness the power of our information, leading to a far wider range of products and services than we could provide ourselves.
• We will continue to work with commercial partners to provide online access to millions of records.

Jennifer said we can look forward to an announcement around OpenTech2008 (July 5th) about a set of important publications that are going to made available by the Archives as RDF and RDFa. In addition I heard about how they work with website data harvested by Internet Archive to create a resolver service for transient publications on the web.

Hearing how a big organization like the National Archives can come to this realization of “Provide and Enable”, and then start to execute on it was really encouraging–and inspiring. It is also refreshing to see people recognize, in writing the importance of semantic web technologies:

We have started exploring new ideas and technologies, including using RDFa for publishing the Gazettes. The way we now publish legislation has a key role to play in the further development of the semantic web.


One little bit of goodness that has percolated out from my group at $work in collaboration with the California Digital Library is the BagIt spec (more readable version). BagIt is an IETF RFC for bundling up files for transfer over the network, or for shipping on physical media. Just yesterday a little article about BagIt surfaced on the LC digital preservation website, so I figure now is a good time to mention it.

The goodness of BagIt is in its simplicity and utility. A Bag is essentially: a set of files in a particular directory named data, a manifest file which states what files ought to be in the data directory, and a bagit.txt file that states the version of BagIt. For example here’s a sample (abbreviated) directory structure for a bag of digitized newspapers via the National Digital Newspaper Program:

|-- bagit.txt
|-- data
|   `-- batch_lc_20070821_jamaica
|       |-- batch.xml
|       |-- batch_1.xml
|       `-- sn83030214
|           |-- 00175041217
|           |   |-- 00175041217.xml
|           |   |-- 1905010401
|           |   |   |-- 1905010401.xml
|           |   |   `-- 1905010401_1.xml
|           |   |-- 1905010601
|           |   |   |-- 1905010601.xml
|           |   |   `-- 1905010601_1.xml

The manifest itself is just the relative file path, and a fixity value:

ea9dee53c2c2dd4027984a2b59f58d1f  data/batch_lc_20070821_jamaica/batch.xml
72134329a82f32dd44d59b509928b6cd  data/batch_lc_20070821_jamaica/batch_1.xml
dc5740d295521fcc692bb58603ce8d1a  data/batch_lc_20070821_jamaica/sn83030214/00175041217/1905010601/1905010601_1.xml
e16e74988ca927afc10ee2544728bd14  data/batch_lc_20070821_jamaica/sn83030214/00175041217/1905010601/1905010601.xml
fd480b2c4bcb6537c3bc4c9e7c8d7c21  data/batch_lc_20070821_jamaica/sn83030214/00175041217/1905010401/1905010401.xml
e0e4a981ddefb574fa1df98a8a55b7a4  data/batch_lc_20070821_jamaica/sn83030214/00175041217/1905010401/1905010401_1.xml
c8dffa3cdb7c13383151e0cd8263d082  data/batch_lc_20070821_jamaica/sn83030214/00175041217/00175041217.xml

The manifest format happens to be the same format understood and generated by the common unix (and windows) utility md5deep. So it’s pretty easy to generate and validate the manifests.

The context for this work has largely been NDIIPP partners (like CDL) transferring data generated by funded projects back to LC. Although it’s likely to get used in some other places as well internally. It’s funny to see the spec in its current state, after Justin Littman rattled off the LC Manifest wiki page in a few minutes after a meeting where Andy Boyko initially brought up the issue. Andy has just left LC to work for a record company in Cupertino. I don’t think I fully understood simplicity in software development until I worked with Andy. He has a real talent for boiling down solutions to their most simple expression, often leveraging existing tools to the point where very little software actually needs to be written. I think Andy and John found a natural affinity for striving for simplicity, and it shows in BagIt. Andy will be sorely missed, but that record store is lucky to get him on their team.

There are some additional cool features to BagIt, including the ability to include a fetch.txt file which contains http and/or rsync URIs to fill in parts of the bag from the network. We’ve come to refer to bags with a fetch.txt as “holey bags” because they have holes in them that need to be filled in. This allows very large bags to be assembled quickly in parallel (using a 100 line python script Andy Boyko wrote, or whatever variant of wget, curl, rsync makes you happy). Also you can include a package-info.txt which includes some basic metadata as key/value pairs … designed primarily for humans.

Dan Krech and I are in the process of creating a prototype deposit web application that will essentially allow bags to be submitted via a SWORD (profile of AtomPub for Repositories) service. The SWORD part should be pretty easy, but getting the retrieval of “holey bags” kicked off and monitored propertly will be the more challenging part. Hopefully I’ll be able to report more here as things develop.

Feedback on the BagIt RFC is most welcome.

SKOS displays w/ SPARQL

I’m just in the process of getting my head around SPARQL a bit more. At $work, Clay and I ran up against a situation where we wanted a query that would return a subgraph from an entire SKOS concept scheme for any assertions involving a particular concept URI as the subject. Easy enough right?


The thing is, for human readable displays we don’t want to display the URIs for related concepts (skos:broader, skos:narrower or skos:related) … we want to display the nice skos:prefLabel for them. Something akin to:

So how can we get a subgraph for a concept as well as any concept that might be directly related to it, in a single query? We came up with the following but I’d be interested in more elegant solutions:

PREFIX skos: <>

CONSTRUCT {<> ?p1 ?o1. ?s2 ?p2 ?o2}
    {<> ?p1 ?o1.}
        {<> skos:narrower ?s2.}
        {?s2 ?p2 ?o2.}
        {<> skos:broader ?s2.}
        {?s2 ?p2 ?o2.}
        {<> skos:related ?s2.}
        {?s2 ?p2 ?o2.}

The above ran quite nicely in my Arc playground. Any suggestions or ideas on how to boil this down would be appreciated. I also wanted to jot this query in the likely event that I forget how I did it.

justify my links

Thanks to a tip from Ian, I’m looking forward to (hopefully) attending the Linked Data Planet conference in New York City as a volunteer. The idea is that I just have to pay for my hotel, and the cost of admission is waived. It seems my travel money is a bit limited at the moment (sometimes it’s there, sometimes it isn’t), so I figured minimizing costs would be appreciated. But today I got a request to “justify” my attendance at the conference. It was actually kind of a good exercise to sit down and write why I think the conference and Linked Data in general is important to the Library of Congress.

One of the challenges of Digital Repository work is modeling the context for digital objects. The context for a digital object includes the set of relationships a particular digital object has with other objects in the repository. 30 years of relational database research and development have allowed us to do this modeling pretty effectively within the scope of a particular application.

Very often, particularly in institutions the size of the Library of Congress, the context for a digital object includes digital objects found elsewhere in the enterprise–in other applications, with their own databases. In addition some institutions (like LC) also need to make their digital resources available publicly for other organizations to reference. The challenge here is in making the objects found in silos or islands of application data (typically housed in databases) reference-able and resolvable, so that other applications inside and outside the enterprise can use them.

As a practical example, a picture of Dizzie Gilliespie found in the America Memory collection

is related to the book:

To be, or not–to bop: memoirs / Dizzy Gillespie, with Al Fraser.

which we have described in our online catalog. The person Dizzy Gillespie is also represented in LC’s name authority file with the Library of Congress Control Number n50033872, and the Linked Authority File at OCLC. And perhaps this picture of Dizzie Gillespie in American Memory will find it’s way into the World Digital Library application that is currently being built. How can we practically and explicitly identify and then represent the relationships between these resources? Is it even possible?

The Linked Data Planet conference is a two day workshop describing how to use traditional web technologies in conjunction with semantic web technologies (RDF, OWL, SPARQL, RDFa and GRDDL) to enable this sort of linking of resources inside particular applications, within the enterprise and around the world. My hope is that the conference will provide guidance on simple things LC can do with web technologies that have been in use for 20 years, to model the relationships between digital resources at the Library of Congress.

Hopefully that will convince them :-)

Apologies to Madonna for the blog post title…

baby steps at linking library data

Alistair wanted to have some data to demonstrate the potential of linked library data, so I quickly converted 10K MARC records (using a slightly modified version of MARC21slim2RDFDC.xsl and rewrote the subjects as URIs using a few lines of python…all a bit hackish, but it got this particular job done quickly.

The rewriting of subjects is basically a transformation of:

  dc:creator "Rollo, David.";
  dc:date "c2000." ;
  dc:description "Includes bibliographical references (p. 173-223) and index." ;
     "URN:ISBN:0816635463 (alk. paper)", 
     "URN:ISBN:0816635471 (pbk. : alk. paper)", 
     "" ;
  dc:language "eng" ;
  dc:publisher "Minneapolis : University of Minnesota Press," ;
    "Anglo-Norman literature", 
    "Benoi?t, de Sainte-More, 12th cent.", 
    "Latin prose literature, Medieval and modern", 
    "Literature and history", 
    "Magic in literature." ;
  dc:title "Glamorous sorcery : magic and literacy in the High Middle Ages /" ;
  dc:type "text" .


    dc:creator "Rollo, David." ;
    dc:date "c2000." ;
    dc:description "Includes bibliographical references (p. 173-223) and
index." ;
    dc:identifier "URN:ISBN:0816635463 (alk. paper)", "URN:ISBN:0816635471 (pbk. : alk. paper)", "" ;
    dc:language "eng" ;
    dc:publisher "Minneapolis : University of Minnesota Press," ;
    dc:subject <>,
      "Benoi?t, de Sainte-More, 12th cent." ;
    dc:title "Glamorous sorcery : magic and literacy in the High Middle Ages
/" ;
    dc:type "text" .

Clearly there are lots of ways to improve even this simplified description: URIs for entries in the Name Authority File, referencing identifiers as resources rather than string literals (an artifact of the XSLT transform), removing ISBD punctuation, unicode normalization (&cough;), etc.

You may notice I kind of fudged the URI for the book itself using the LCCN service at LC: (which does resolve, but doesn’t serve up RDF yet). I’m no FRBR expert so I’m not sure if the use of “manifestation” in this hash URI makes sense. I just wanted to distinguish between the URI for the description, and the URI for the thing being described. I think it’s high time for me to understand FRBR a lot more.

If you prefer diagrams to turtle here is a graph visualization from the w3c rdf validator for the record.

SKOS in the Context of Semantic Web Deployment

If you happen to be in the DC area on May 8th and are interested in linked data and the practical application of semantic web technologies like RDF and OWL please join us at the Library of Congress for a presentation by Alistair Miles, key developer of SKOS, and semantic web practitioner at the University of Oxford.

Below is the announcement, I hope you can make it. Oh, and if you are really interested in this stuff we’re having some brown bag sessions later in the afternoon that you are welcome to attend, just email me at ehs [at] pobox [dot] com.

The Simple Knowledge Organization System (SKOS), in the Context of Semantic Web Deployment, Alistair Miles, University of Oxford May 8th 10am6th 11:30am, 2008, Montepelier Room, Madison Building, Library of Congress (map) .

Links are valuable. Links between documents, between people, between ideas, between data. Data is now a first class Web citizen, and the Web is expanding as more of these valuable networks are deployed within its fabric. Well-established knowledge organization systems like the Library of Congress Subject Headings will play a major role within these networks, as hubs, connecting people with information and providing a firm foundation for network growth as many new routes to the discovery of information emerge through the collective action of individuals. Or will they?

This talk introduces the Simple Knowledge Organization System (SKOS), a soon-to-be-completed W3C standard for publishing thesauri, classification schemes and subject headings as linked data in the Web. This talk also presents SKOS in the context of the W3C’s Semantic Web Activity, and in particular the work of the W3C’s Semantic Web Deployment Working Group where other specifications are being developed for publishing linked data in the Web, for embedding linked data in Web pages, and for managing Semantic Web vocabularies. Finally, this talk takes a mildly inquisitive look at the value propositions for linked data in the Web, and how LCSH might be deployed in the Web for better information discovery.

Alistair’s background is in the development of Web technologies for scientific applications. He was a research associate in the e-Science department of the Rutherford Appleton Laboratory, where he was introduced to Semantic Web technologies and first developed SKOS. He has recently moved to the University of Oxford to work on linking fruit fly genomics research data, and he hopes everything he knows about the Semantic Web will turn out to be useful after all.