Posts Tagged ‘rdf’

provide and enable

Wednesday, June 18th, 2008

I got a chance to meet Jennifer Rigby of the National Archives UK at the LinkedDataPlanet Conference in New York City (thanks Ian). Jennifer is the Head of IT Strategy, and told me lots of interesting stuff related to a profound shift they’ve had in their online strategies to:

Provide and Enable

So rather than pouring all their energy into making applications to visualize archival resources, the National Archives have recognized that making machine readable resources available to the public (in formats like RDF and RDFa) is really important to their core mission. In addition to providing services and data, they are trying to enable an ecosystem of innovation around their assets–or in their words:

• We will allow others to harness the power of our information, leading to a far wider range of products and services than we could provide ourselves.
• We will continue to work with commercial partners to provide online access to millions of records.

Jennifer said we can look forward to an announcement around OpenTech2008 (July 5th) about a set of important publications that are going to made available by the Archives as RDF and RDFa. In addition I heard about how they work with website data harvested by Internet Archive to create a resolver service for transient publications on the web.

Hearing how a big organization like the National Archives can come to this realization of “Provide and Enable”, and then start to execute on it was really encouraging–and inspiring. It is also refreshing to see people recognize, in writing the importance of semantic web technologies:

We have started exploring new ideas and technologies, including using RDFa for publishing the Gazettes. The way we now publish legislation has a key role to play in the further development of the semantic web.

SKOS displays w/ SPARQL

Wednesday, June 4th, 2008

I’m just in the process of getting my head around SPARQL a bit more. At $work, Clay and I ran up against a situation where we wanted a query that would return a subgraph from an entire SKOS concept scheme for any assertions involving a particular concept URI as the subject. Easy enough right?

  DESCRIBE <http://lcsh.info/sh96010624#concept>

The thing is, for human readable displays we don’t want to display the URIs for related concepts (skos:broader, skos:narrower or skos:related) … we want to display the nice skos:prefLabel for them. Something akin to:

So how can we get a subgraph for a concept as well as any concept that might be directly related to it, in a single query? We came up with the following but I’d be interested in more elegant solutions:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

CONSTRUCT {<http://lcsh.info/sh96010624#concept> ?p1 ?o1. ?s2 ?p2 ?o2}
WHERE
{
    {<http://lcsh.info/sh96010624#concept> ?p1 ?o1.}
    UNION
    {
        {<http://lcsh.info/sh96010624#concept> skos:narrower ?s2.}
        {?s2 ?p2 ?o2.}
    }
    UNION
    {
        {<http://lcsh.info/sh96010624#concept> skos:broader ?s2.}
        {?s2 ?p2 ?o2.}
    }
    UNION
    {
        {<http://lcsh.info/sh96010624#concept> skos:related ?s2.}
        {?s2 ?p2 ?o2.}
    }
}

The above ran quite nicely in my Arc playground. Any suggestions or ideas on how to boil this down would be appreciated. I also wanted to jot this query in the likely event that I forget how I did it.

baby steps at linking library data

Monday, May 5th, 2008

Alistair wanted to have some data to demonstrate the potential of linked library data, so I quickly converted 10K MARC records (using a slightly modified version of MARC21slim2RDFDC.xsl and rewrote the subjects as lcsh.info URIs using a few lines of python…all a bit hackish, but it got this particular job done quickly.

The rewriting of subjects is basically a transformation of:

<http://lccn.loc.gov/00009010#manifestation>
  dc:creator "Rollo, David.";
  dc:date "c2000." ;
  dc:description "Includes bibliographical references (p. 173-223) and index." ;
  dc:identifier
     "URN:ISBN:0816635463 (alk. paper)",
     "URN:ISBN:0816635471 (pbk. : alk. paper)",
     "http://www.loc.gov/catdir/toc/fy032/00009010.html" ;
  dc:language "eng" ;
  dc:publisher "Minneapolis : University of Minnesota Press," ;
  dc:subject
    "Anglo-Norman literature",
    "Benoi?t, de Sainte-More, 12th cent.",
    "Latin prose literature, Medieval and modern",
    "Literacy",
    "Literature and history",
    "Magic in literature." ;
  dc:title "Glamorous sorcery : magic and literacy in the High Middle Ages /" ;
  dc:type "text" .

to:

<http://lccn.loc.gov/00009010#manifestation>
    dc:creator "Rollo, David." ;
    dc:date "c2000." ;
    dc:description "Includes bibliographical references (p. 173-223) and
index." ;
    dc:identifier "URN:ISBN:0816635463 (alk. paper)", "URN:ISBN:0816635471 (pbk. : alk. paper)", "http://www.loc.gov/catdir/toc/fy032/00009010.html" ;
    dc:language "eng" ;
    dc:publisher "Minneapolis : University of Minnesota Press," ;
    dc:subject <http://lcsh.info/sh85005082#concept>,
      <http://lcsh.info/sh85077482#concept>,
      <http://lcsh.info/sh85077565#concept>,
      <http://lcsh.info/sh85079624#concept>,
      <http://lcsh.info/sh86008161#concept>,
      "Benoi?t, de Sainte-More, 12th cent." ;
    dc:title "Glamorous sorcery : magic and literacy in the High Middle Ages
/" ;
    dc:type "text" .

Clearly there are lots of ways to improve even this simplified description: URIs for entries in the Name Authority File, referencing identifiers as resources rather than string literals (an artifact of the XSLT transform), removing ISBD punctuation, unicode normalization (&cough;), etc.

You may notice I kind of fudged the URI for the book itself using the LCCN service at LC: http://lccn.loc.gov/00009010#manifestation (which does resolve, but doesn’t serve up RDF yet). I’m no FRBR expert so I’m not sure if the use of “manifestation” in this hash URI makes sense. I just wanted to distinguish between the URI for the description, and the URI for the thing being described. I think it’s high time for me to understand FRBR a lot more.

If you prefer diagrams to turtle here is a graph visualization from the w3c rdf validator for the record.

SKOS in the Context of Semantic Web Deployment

Wednesday, April 30th, 2008

If you happen to be in the DC area on May 8th and are interested in linked data and the practical application of semantic web technologies like RDF and OWL please join us at the Library of Congress for a presentation by Alistair Miles, key developer of SKOS, and semantic web practitioner at the University of Oxford.

Below is the announcement, I hope you can make it. Oh, and if you are really interested in this stuff we’re having some brown bag sessions later in the afternoon that you are welcome to attend, just email me at ehs [at] pobox [dot] com.

The Simple Knowledge Organization System (SKOS), in the Context of Semantic Web Deployment, Alistair Miles, University of Oxford May 8th 10am6th 11:30am, 2008, Montepelier Room, Madison Building, Library of Congress (map) .

Links are valuable. Links between documents, between people, between ideas, between data. Data is now a first class Web citizen, and the Web is expanding as more of these valuable networks are deployed within its fabric. Well-established knowledge organization systems like the Library of Congress Subject Headings will play a major role within these networks, as hubs, connecting people with information and providing a firm foundation for network growth as many new routes to the discovery of information emerge through the collective action of individuals. Or will they?

This talk introduces the Simple Knowledge Organization System (SKOS), a soon-to-be-completed W3C standard for publishing thesauri, classification schemes and subject headings as linked data in the Web. This talk also presents SKOS in the context of the W3C’s Semantic Web Activity, and in particular the work of the W3C’s Semantic Web Deployment Working Group where other specifications are being developed for publishing linked data in the Web, for embedding linked data in Web pages, and for managing Semantic Web vocabularies. Finally, this talk takes a mildly inquisitive look at the value propositions for linked data in the Web, and how LCSH might be deployed in the Web for better information discovery.

Alistair’s background is in the development of Web technologies for scientific applications. He was a research associate in the e-Science department of the Rutherford Appleton Laboratory, where he was introduced to Semantic Web technologies and first developed SKOS. He has recently moved to the University of Oxford to work on linking fruit fly genomics research data, and he hopes everything he knows about the Semantic Web will turn out to be useful after all.

literals and resources

Wednesday, March 26th, 2008

There’s a fascinating modeling discussion going on over on the DC-RDA list about whether RDA properties should reference literals or resources in descriptions. For example when describing an author you could use a literal:

Twain, Mark, 1835-1910

or a resource:

http://lccn.loc.gov/n79021164

There are some shades of gray in between (using blank nodes, auto-generated URIs, typed literals) but that’s the basic gist of it. The discussion basically concerns what the DC-RDA Application Profile should allow. There seems to be two competing interests:

  1. perceived ease of migrating legacy data (MARC -> RDA)
  2. perceived benefits to explicitly modeling the relationships found in bibliographic data

More information can also be found in the blogs of Karen Coyle and Jon Phipps.

My personal opinion is that RDA should take the high road on this one and really drive home the value proposition for using resources wherever possible, modeling relationships in bibliographic data, and leveraging hundreds of years of work maintaining controlled vocabularies. This will have the positive side effect of pushing library controlled vocabularies (LCSH, name authority, language and geographic codes, etc.) into the open on the web. More importantly I think it will highlight what libraries (at their best) do best, for the larger semantic web and computing world. I think it’s worth limping along a bit longer with MARC and waiting for RDA to actually “do the right thing”.

How to do this effectively is another matter, and is really what the discussion is about. It’s really nice to see people talking openly about these issues.

(PS, using an author isn’t a particularly good example because I don’t see it in the current list of RDA properties…)

(PSS, no that lccn url doesn’t currently resolve (it does for bibliographic records, but not authority) or return rdf (hopefully someday))

Cyganiak on linked data, microformats and the semweb

Friday, March 14th, 2008

In case you missed it Danny Ayers has a fun interview with Richard Cyganiak who is one of the prime movers behind the Linking Open Data Project of Semantic Web Education and Outreach Group at the W3C, and authors of Cool URIs for the Semantic Web and How to Publish Linked Data on the Web. Among other things you’ll learn some details about sindice (the semantic web search engine at DERI) which indexes (using Solr!) structured data like rdf/xml, microformats (I never noticed last.fm had microformat content) and (soon) rdfa from the world wild web. More details about Sindice can be found in an earlier podcast Paul Miller did with Eyal Oren (also at DERI).

Richard’s perspective on the past and future of the semantic web is particularly refreshing. Rather than hard selling SPARQL or even RDF his attitude seems to be to try what works now, while recognizing that the technologies that make the semantic web work may very well be different in a few years. Also there’s an interesting discussion of microformats and RDF, highlighting the strengths and weaknesses of both. Plus there is a fun side story to the LOD diagram that shows the links between various open data sets.

If you’ve ever wanted to hear more about linked-data from someone in the know now is your chance. Nice questions danja!

oai-ore post baltimore thoughts

Thursday, March 13th, 2008

The recent OAI-ORE meeting was just up the road in Baltimore, so it was easy for a bunch of us from the Library of Congress to attend. I work on a team at LC that is specifically looking at the role that repositories play at the library; I’ve implemented OAI-PMH data providers and harvesters, and in the past couple of years I’ve gotten increasingly interested in semantic web technologies — so OAI-ORE is of particular interest to me. I’ve commented a bit about OAI-ORE on here before, but I figure it can’t hurt to follow in my coworker’s footsteps and summarize my thoughts after the meeting.

(BTW, above is an image of some constellations I nabbed off of wikipedia. I included it here because the repeated analogy (during the meeting) of OAI-ORE resource maps as constellations was really compelling — and quite poetic.)

The Vocabulary

It seems to me that the real innovation of the OAI-ORE effort is that it provides a lightweight RDF vocabulary for talking about aggregated resources on the web. Unfortunately I think that this kernel gets a little bit lost in the 6 specification documents that were released en masse a few months ago.

The ORE vocabulary essentially consists of three new resource types: ore:ResourceMap, ore:Aggregation, ore:AggregatedResource ; and 5 new properties to use with those types: ore:describes, ore:isDescribedBy, ore:aggregates, ore:isAggregatedBy, ore:analogousTo. In addition, the Vocabulary document
provides guidance on how to use a few terms from the DublinCore vocabulary: dc:creator, dc:rights, dcterms:modified, dcterms:created.

The vocabulary is small, so if I were them I would publish the vocabulary elements using hash URIs, instead of slash URIs. The reason for this is that you don’t have to jigger the web server to do a httpRange-14 style 303 correctly:

  • http://www.openarchives.org/ore/0.2/terms#Aggregation
  • http://www.openarchives.org/ore/0.2/terms#AggregatedResource
  • http://www.openarchives.org/ore/0.2/terms#ResourceMap
  • http://www.openarchives.org/ore/0.2/terms#describes
  • http://www.openarchives.org/ore/0.2/terms#isDescribedBy
  • http://www.openarchives.org/ore/0.2/terms#aggregates
  • http://www.openarchives.org/ore/0.2/terms#isAggregatedBy
  • http://www.openarchives.org/ore/0.2/terms#analogousTo

Also, I think ore:AggregatedResource is currently missing from the rdf/xml vocabulary, so it should be added. Also ore:isDescribedBy seems to be commented out.

There is a lot of redundancy between the Abstract Data Model and the Vocabulary documents–so I would recommend collapsing them down into a single, succinct document. This is in keeping with the DRY principle and will have the added benefit of making it easier for newbies to hit the ground running (not having to wade through multiple docs and mentally reconcile them). I could understand having a separate Abstract Data Model document if it were totally divorced from the web and semantic web technologies like RDF, but it’s not.

The Graph

The OAI-ORE effort seemed to be mostly driven by a desire to take harvesting agents the last mile to the actual repository resources themselves–enabling digital library objects (in addition to their metadata) to be harvested from repositories (using HTTP) ; and to be referenced from other contexts (say objects in other repositories). This desire was born out of real, hard won experience with harvesting metadata records, and marked a shift from metadata-centric harvesting to resource-centric harvesting.

In addition OAI-ORE marks a departure from predictable and mind-numbing arguments about SIP formats (METS, DIDL, FOXML, IEEE LOM, XFDU, etc). Yet as soon as we have our shiny new OAI-ORE vocabulary we have to learn yet-another-packaging-format, this time one built on top of Atom.

First, let me just say I’m a big fan of RFC 4287, in particular how it is used in the RESTful Atom Publishing Protocol (RFC 5023). I also think it makes sense to have an Atom serialization for OAI-ORE resource maps — assuming there is a GRDDL transform for turning it into RDF. But the workshop in Baltimore seemed to stress that the Atom serialization was the only way to do OAI-ORE, and didn’t emphasize that there are in fact lots of ways of representing RDF graphs on the web. For example GRDDL allows you to associate arbitrary XML with an XSLT transform to extract a RDF graph. And you could encode your RDF graph directly with RDFa, N3, Turtle, ntriples, or RDF/XML.

Perhaps there is a feeling that stressing the RDF graph too much will alienate some people who are more familiar with XML technologies. Or perhaps all these graph serialization choices could be perceived as being too overwhelming. But I think the opposite extreme of making it look like you can only use an overloaded Atom document as a means to publishing ORE resource maps is misguided, and will ultimately slow adoption. Why not encourage people to publish GRDDL transforms for METS, DIDL or mark up their “splash pages” with RDFa? This would bring the true value of the OAI-ORE work home–it’s not about yet-another-packaging format, it’s about what the various packaging formats have in common on the web.

Release Early, Release Often

In hindsight I think it would’ve been helpful for the OAI-ORE group to privately build consensus about the core OAI-ORE vocabulary (if necessary), then release that into the world wild web for discussion. Then once the kinks were worked out, and there was general understanding, moving on to issues such as discovery and serialization. As it stands the various documents were all dumped at the same time, and seem somewhat fragmented, and in places redundant. Clearly a lot of conversations have gone on that aren’t happening on the public discussion list.

I expressed interest in being part of the OAI-ORE and was politely turned down. I’m actually kind of glad really because I also don’t want to be part of some cabal of digital library practitioners. Maybe I should’ve titled this post “Sour Grapes” :-) Seriously though, the digital library needs good practical solutions and communities of users that encourage widespread adoption and tool support. We don’t need research-ware. Having secret discussions and occasional public events that feel more like lectures than meetings isn’t a good way to encourage adoption.

Anyhow, I hope that this isn’t all seen as being too harsh. Everyone’s a critic eh? All in all there is a lot in OAI-ORE to be proud of. The effort to integrate Web Architecture into Digital Library practices is most welcome indeed. Keep up the good work y’all.

oai-ore and the shadow web

Friday, February 22nd, 2008

The OAI-ORE meeting is coming up, and in general I’ve been really impressed with the alpha specs that have come out. It’s not clear that there’s an established vocabulary for talking about aggregated resources on the web, so the Data Model and Vocabulary documents were of particular interest to me.

One thing I didn’t quite understand, and which I think may have some significance for implementors, is some language in the Discovery document on the subject of URI conflation:

The Data Model document [ORE Model] explicitly prohibits a URI of a ReM (URI-R) ever returning anything other than a ReM. This allows multiple representations to be associated with URI-R, such as using content negotiation to return ReMs in different languages, character sets, or compression encodings. But it does not allow URI-R to return a human readable “splash page”, either by HTTP content negotiation or redirection. For example, clients MUST NOT merge with content negotiation the following URI pair that would correspond to a ReM and a “splash page” for an object:

If I’m understanding right this would prohibit using technologies like microformats, eRDF, RDFa and GRDDL in a “splash page” to represent the resource map. It seems odd to me that you can represent a resource map in Atom, but not in HTML.

To illustrate what this might look like I took a splash page off of arXiv (hope that was ok!) and marked it up with oai-ore RDFa.

Take a look. So all I did is modify the existing XHTML at arxiv.org, and I’ve been able to represent an ORE Resource Map. This seems like a relatively simple, and powerful way for existing repositories to make their aggregated resources available.

RDFa just entered Last Call, but there are already multiple implementations. Try out the GetN3 bookmarklet on the splash page, and you should see some triples come back. I ran them through the validator at w3c and got the following graph (kinda too big to include here inline).

This kind of issue seem to be at the heart of what Ian Davis refers to when he asks “Is the Semantic Web Destined to be a Shadow?“. Andy Powell and Pete Johnston have also been strong voices for integrating digital library repositories and the web–and they are also involved with the oai-ore effort. It feels like some of the oia-ore language could be loosened a bit to allow machine readable and human readable information to commingle a bit more.

calais and ocr newspaper data

Wednesday, February 13th, 2008

Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun vocabularies.

At work Dan, Brian and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It’s a very different approach in that Calais is doing natural language processing and we instead are looking for patterns in the structure of XML. But the end result is the same–an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have large amounts of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais…

To aid in the process I wrote a helper utility (calais.py) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan’s rdflib:

  import calais
  graph = calais_graph(content)

This is dependent on you getting a calais license key and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here’s the people script. note, the angly brackets are missing from the sparql prefixes intentionally, since they don’t render properly (yet) in wordpress.

  from calais import calais_graph
  from sys import argv
 
  filename = argv[1]
  content = file(filename).read()
  g = calais_graph(content)
 
  sparql = """
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          """
 
  for row in g.query(sparql):
      print row[0]

Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here’s what we get when I run this OCR data through (take a look at the linked OCR to see just how irregular this data is).

  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer

Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here’s the output of cities.

  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO

Not too shabby. If you want to try this out, install rdflib, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:

  bzr branch http://inkdroid.org/bzr/calais

If you do dive into calais.py you’ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF.

tripleshot

Friday, January 11th, 2008

Recently there was a bit of interesting news around a MARBI Discussion Paper 2008-DP04 regarding semweb technologies at LC.

Related to this work are RDF/OWL representations and models for MODS and MARC, which we are also developing. Several representations of MODS in RDF/OWL, such as the one from the SIMILE project, have been made available as part of various projects and we have found they useful for our analysis and to inform our design process. We want to bring them together into one easily downloaded and maintained RDF/OWL file for use in community experimentation with RDF applications. Our time line is to have the MODS RDF ready for community comment by June.