Posts Tagged ‘semanticweb’

justify my links

Thursday, May 29th, 2008

Thanks to a tip from Ian, I’m looking forward to (hopefully) attending the Linked Data Planet conference in New York City as a volunteer. The idea is that I just have to pay for my hotel, and the cost of admission is waived. It seems my travel money is a bit limited at the moment (sometimes it’s there, sometimes it isn’t), so I figured minimizing costs would be appreciated. But today I got a request to “justify” my attendance at the conference. It was actually kind of a good exercise to sit down and write why I think the conference and Linked Data in general is important to the Library of Congress.

One of the challenges of Digital Repository work is modeling the context for digital objects. The context for a digital object includes the set of relationships a particular digital object has with other objects in the repository. 30 years of relational database research and development have allowed us to do this modeling pretty effectively within the scope of a particular application.

Very often, particularly in institutions the size of the Library of Congress, the context for a digital object includes digital objects found elsewhere in the enterprise–in other applications, with their own databases. In addition some institutions (like LC) also need to make their digital resources available publicly for other organizations to reference. The challenge here is in making the objects found in silos or islands of application data (typically housed in databases) reference-able and resolvable, so that other applications inside and outside the enterprise can use them.

As a practical example, a picture of Dizzie Gilliespie found in the America Memory collection

is related to the book:


To be, or not–to bop: memoirs / Dizzy Gillespie, with Al Fraser.

which we have described in our online catalog. The person Dizzy Gillespie is also represented in LC’s name authority file with the Library of Congress Control Number n50033872, and the Linked Authority File at OCLC. And perhaps this picture of Dizzie Gillespie in American Memory will find it’s way into the World Digital Library application that is currently being built. How can we practically and explicitly identify and then represent the relationships between these resources? Is it even possible?

The Linked Data Planet conference is a two day workshop describing how to use traditional web technologies in conjunction with semantic web technologies (RDF, OWL, SPARQL, RDFa and GRDDL) to enable this sort of linking of resources inside particular applications, within the enterprise and around the world. My hope is that the conference will provide guidance on simple things LC can do with web technologies that have been in use for 20 years, to model the relationships between digital resources at the Library of Congress.

Hopefully that will convince them :-)

Apologies to Madonna for the blog post title…

literals and resources

Wednesday, March 26th, 2008

There’s a fascinating modeling discussion going on over on the DC-RDA list about whether RDA properties should reference literals or resources in descriptions. For example when describing an author you could use a literal:

Twain, Mark, 1835-1910

or a resource:

http://lccn.loc.gov/n79021164

There are some shades of gray in between (using blank nodes, auto-generated URIs, typed literals) but that’s the basic gist of it. The discussion basically concerns what the DC-RDA Application Profile should allow. There seems to be two competing interests:

  1. perceived ease of migrating legacy data (MARC -> RDA)
  2. perceived benefits to explicitly modeling the relationships found in bibliographic data

More information can also be found in the blogs of Karen Coyle and Jon Phipps.

My personal opinion is that RDA should take the high road on this one and really drive home the value proposition for using resources wherever possible, modeling relationships in bibliographic data, and leveraging hundreds of years of work maintaining controlled vocabularies. This will have the positive side effect of pushing library controlled vocabularies (LCSH, name authority, language and geographic codes, etc.) into the open on the web. More importantly I think it will highlight what libraries (at their best) do best, for the larger semantic web and computing world. I think it’s worth limping along a bit longer with MARC and waiting for RDA to actually “do the right thing”.

How to do this effectively is another matter, and is really what the discussion is about. It’s really nice to see people talking openly about these issues.

(PS, using an author isn’t a particularly good example because I don’t see it in the current list of RDA properties…)

(PSS, no that lccn url doesn’t currently resolve (it does for bibliographic records, but not authority) or return rdf (hopefully someday))

oai-ore and the shadow web

Friday, February 22nd, 2008

The OAI-ORE meeting is coming up, and in general I’ve been really impressed with the alpha specs that have come out. It’s not clear that there’s an established vocabulary for talking about aggregated resources on the web, so the Data Model and Vocabulary documents were of particular interest to me.

One thing I didn’t quite understand, and which I think may have some significance for implementors, is some language in the Discovery document on the subject of URI conflation:

The Data Model document [ORE Model] explicitly prohibits a URI of a ReM (URI-R) ever returning anything other than a ReM. This allows multiple representations to be associated with URI-R, such as using content negotiation to return ReMs in different languages, character sets, or compression encodings. But it does not allow URI-R to return a human readable “splash page”, either by HTTP content negotiation or redirection. For example, clients MUST NOT merge with content negotiation the following URI pair that would correspond to a ReM and a “splash page” for an object:

If I’m understanding right this would prohibit using technologies like microformats, eRDF, RDFa and GRDDL in a “splash page” to represent the resource map. It seems odd to me that you can represent a resource map in Atom, but not in HTML.

To illustrate what this might look like I took a splash page off of arXiv (hope that was ok!) and marked it up with oai-ore RDFa.

Take a look. So all I did is modify the existing XHTML at arxiv.org, and I’ve been able to represent an ORE Resource Map. This seems like a relatively simple, and powerful way for existing repositories to make their aggregated resources available.

RDFa just entered Last Call, but there are already multiple implementations. Try out the GetN3 bookmarklet on the splash page, and you should see some triples come back. I ran them through the validator at w3c and got the following graph (kinda too big to include here inline).

This kind of issue seem to be at the heart of what Ian Davis refers to when he asks “Is the Semantic Web Destined to be a Shadow?“. Andy Powell and Pete Johnston have also been strong voices for integrating digital library repositories and the web–and they are also involved with the oai-ore effort. It feels like some of the oia-ore language could be loosened a bit to allow machine readable and human readable information to commingle a bit more.

calais and ocr newspaper data

Wednesday, February 13th, 2008

Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun vocabularies.

At work Dan, Brian and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It’s a very different approach in that Calais is doing natural language processing and we instead are looking for patterns in the structure of XML. But the end result is the same–an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have large amounts of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais…

To aid in the process I wrote a helper utility (calais.py) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan’s rdflib:

  import calais
  graph = calais_graph(content)

This is dependent on you getting a calais license key and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here’s the people script. note, the angly brackets are missing from the sparql prefixes intentionally, since they don’t render properly (yet) in wordpress.

  from calais import calais_graph
  from sys import argv
 
  filename = argv[1]
  content = file(filename).read()
  g = calais_graph(content)
 
  sparql = """
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          """
 
  for row in g.query(sparql):
      print row[0]

Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here’s what we get when I run this OCR data through (take a look at the linked OCR to see just how irregular this data is).

  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer

Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here’s the output of cities.

  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO

Not too shabby. If you want to try this out, install rdflib, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:

  bzr branch http://inkdroid.org/bzr/calais

If you do dive into calais.py you’ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF.

following your nose to the web of data

Friday, January 4th, 2008

This is a draft of a column that’s slated to be published some time in Information Standards Quarterly. Jay was kind enough to let me post it here in this form before it goes to press. It seems timely to put it out there. Please feel free to leave comments to point out inaccuracies, errors, tips, suggestions, etc.


It’s hard to imagine today that in 1991 the entire World Wide Web existed on a single server at CERN in Switzerland. By the end of that year the first web server outside of Europe was set up at Stanford. The archives of the www-talk discussion list bear witness to the grassroots community effort that grew the early web–one document and one server at a time.

Fast forward to 2007 when 24.7 billion web pages are estimated to exist. The rapid and continued growth of the Web of Documents can partly be attributed to the elegant simplicity of the hypertext link enabled by two of Tim Berners-Lee’s creations: the HyperText Markup Language (HTML) and the Uniform Resource Locator (URL). There is a similar movement afoot today to build a new kind of web using this same linking technology, the so called Web of Data.

The Web of Data has its beginnings in the vision of a Semantic Web articulated by Tim Berners-Lee in 2001. The basic idea of the Semantic Web is to enable intelligent machine agents by augmenting the web of HTML documents with a web of machine processable information. A recent follow up article covers the the “layer cake” of standards that have been created since, and how they are being successfully used today to enable data integration in research, government, and business. However the repositories of data associated with these success stories are largely found behind closed doors. As a result there is little large scale integration happening across organizational boundries on the World Wide Web.

The Web of Data represents a distillation and simplification of the Semantic Web vision. It de-emphasizes the automated reasoning aspects of Semantic Web research and focuses instead on the actual linking of data across organizational boundaries. To make things even simpler the linking mechanism relies on already deployed web technologies: the HyperText Transfer Protocol (HTTP), Uniform Resource Identifiers (URI), and Resource Description Framework (RDF). Tim Berners-Lee has called this technique Linked Data, and summarized it as a short set of guidelines for publishing data on the web:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those things.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs, so that they can discover more things.

The Linking Open Data community project of the W3C Semantic Web Education and Outreach Group has published two additional documents Cool URIs for the Semantic Web and How to Publish Linked Data on the Web that help IT professionals understand what it means to publish their assets as linked data. The goal of the Linking Open Data Project is to

extend the the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different sources.

Central to the Linked Data concept is the publication of RDF on the World Wide Web. The essence of RDF is the “triple” which is a statement about a resource in three parts: a subject, predicate and object. The RDF triple provides a way of modeling statements about resources and it can have multiple serialization formats including XML and some more human readable formats such as notation3. For example to represent a statement that the website at http://niso.org has the title “NISO - National Information Standards Organization” one can create the following triple:


<http://niso.org> <http://purl.org/dc/elements/1.1/title> "NISO - National Information Standards Organization" .

The subject is the URL for the website, the predicate is “has title” represented as a URI from the Dublin Core vocabulary, and the object is the literal “NISO - National Information Standards Organization”. The Linked Data movement encourages the extensive interlinking of your data with other people’s data: so for example by creating another triple such as:


<http://niso.org> <http://purl.org/dc/elements/1.1/creator> <http://dbpedia.org/resource/National_Information_Standards_Organization> .

This indicates that the website was created by NISO which is identified using URI from the dbpedia (a Linked Data version of the Wikipedia). One of the benefits of linking data in this way is the “follow your nose” effect. When a person in their browser or an automated agent runs across the creator in the above triple they are able to dereference the URL and retrieve more information about this creator. For example when a software agent dereferences a URL for NISO


http://dbpedia.org/resource/National_Information_Standards_Organization

24 additional RDF triples are returned including one like:


<http://dbpedia.org/resource/National_Information_Standards_Organization> <http://www.w3.org/2004/02/skos/core#subject> <http://dbpedia.org/resource/Category:Standards_organizations> .

This triple says that NISO belongs to a class of resources that are standards organizations. A human or agent can follow their nose to the dbpedia URL for standards organizations:


http://dbpedia.org/resource/Category:Standards_organizations

and retrieve 156 triples describing other standards organizations are returned such as:


<http://dbpedia.org/resource/World_Wide_Web_Consortium> <http://www.w4.org/2004/02/skos/core#subject> <http://dbpedia.org/resource/Category:Standards_organizations> .

And so on. This ability for humans and automated crawlers to follow their noses in this way makes for a powerfully simple data discovery heuristic. The philosophy is quite different from other data discovery methods, such as the typical web2.0 APIs of Flickr, Amazon, YouTube, Facebook, Google, etc., which all differ in their implementation details and require you to digest their API documentation before you can do anything useful. Contrast this with the Web of Data which uses the ubiquitous technologies of URIs and HTTP plus the secret sauce of the RDF triple.

As with the initial growth of the web over 10 years ago the creation of the Web of Data is happening at a grassroots level by individuals around the world. Much of the work takes place on an open discussion list at MIT where people share their experiences of making data sets available, discuss technical problems/solutions, and announce the availability of resources. At this time some 27 different data sets have been published including Wikipedia, the US Census, the CIA World Fact Book, Geonames, MusicBrainz, WordNet, OpenCyc. The data and relationships between the data are by definition distributed around the web and harvestable by anyone by anyone with a web browser or HTTP client. Contrast this openness with the relationships that Google extracts from the Web of Documents and locks up on their own private network.

Various services aggregate Linked Data and provide services on top of it such as dbpedia which has an estimated 3 million RDF links, and over 2 billion RDF triples. It’s quite possible that the emerging set of Linked Data will serve as a data test bed for intiatives like the Billion Triple Challenge which aims to foster creative approaches to data mining and Semantic Web research by making large sets of real data available. In much the same way that Tim Berners-Lee could not have predicted the impact of Google’s PageRank algorithm, or the improbable success of Wikipedia’s collaborative editing while creating the Web of Documents, it may be that simply building links between data sets on the Web of Data will bootstrap a new class of technologies we cannot begin to imagine today.

So if you are in the business of making data available on the web and have a bit more time to spare, have a look at Tim Berners-Lee’s Linked Data document and familiarize yourself with the simple web publishing techniques behind the Web of Data: HTTP, URI and RDF. If you catch the Linked Data bug join the discussion list and the conversation, and try publishing some of your data as a pilot project using the tutorials. Who knows what might happen–you might just help build a new kind of web, and rest assured you’ll definitely have some fun.

Thanks to Jay Luker, Paul Miller, Danny Ayers and Dan Chudnov for their contributions and suggestions.

good ore

Friday, November 2nd, 2007

In case you missed it the Object-Reuse-and-Exchange (ORE) folks are having a get together at Johns Hopkins University (Baltimore, MD) on March 3, 2008. It’s free to register, but space is limited. The Compound information objects whitepaper, May 2007 Technical Committee notes and the more recent Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication provide a good taste of what the beta ORE specs are likely to look like.

The ORE group isn’t small, and includes individuals from quite different organizations. So any consensus that can be garnered I think will be quite powerful. Personally I’ve been really pleased to see how much the ORE work is leaning on web architecture: notably resolvable HTTP URIs, content-negotiation, linked-data and named graphs. Also interesting in the recent announcement is that the initial specs will use RFC 4287 for encoding the data model. Who knows, perhaps the spec will rely on archive feeds as discussed recently on the code4lib discussion list.

I’m particularly interested to see what flavor of URIs are used to identify the compound objects:

The protocol-based URI of the Resource Map identifies an aggregation of resources (components of a compound object) and their boundary-type inter-relationships. While this URI is clearly not the identifier of the compound object itself, it does provide an access point to the Resource Map and its representations that list all the resources of the compound object. For many practical purposes, this protocol-based URI may be a handy mechanism to reference the compound object because of the tight dependency of the visibility of the compound object in web space on the Resource Map (i.e., in ORE terms, a compound object exists in web space if and only if there exists a Resource Map describing it).

We note, however, two subtle points regarding the use of the URI of the Resource Map to reference the compound object. First, doing so is inconsistent with the web architecture and URI guidelines that are explicit in their suggestion that a URI should identify a single resource. Strictly interpreted, then, the use of the URI of the Resource Map to identify both the Resource Map and the compound object that it describes is incorrect. Second, some existing information systems already use dedicated URIs for the identification of compound information objects “as a whole.” For example, many scholarly publishers use DOIs whereas the Fedora and aDORe repositories have adopted identifiers of the info URI scheme. These identifiers are explicitly distinct from the URI of the Resource Map. from: Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication

I understand the ORE group is intentionally not aligning themselves too closely with the semantic web community. However I think they need to consider whether compound information objects are WWW information resources or not:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.” (from Architecture of the World Wide Web vol. 1).

I’m not totally convinced that the resource map can’t serve as a suitable representation for the compound information object–however for the sake of argument lets say I am. It seems to me that the URI for the compound information object identifies the concept of a particular compound information object, which lies in various pieces on the network. However this doesn’t preclude the use of HTTP URLs to identify the compound objects. Indeed the What HTTP URIs identify and Cool URIs for the Semantic Web provide specific guidance on how to serve up these non-information resources. Of course philosophical arguments around httpRange-14 have raged for a while. But the Linking Open Data project is using the hash URI and 303 redirect very effectively. There has even been some work on a sitemap extension to enable crawling. As a practical matter using URLs to identify compound information objects will encourage their use because they will naturally find their ways into publications, blogs, other compound objects. Using non-resolvable or quasi-resolvable info-uris or dois will mean people just won’t create the links–and when they do they will create links that can’t be easily verified and evolved over time with standard web tools. The OAI-ORE effort represents a giant leap forward for the digital library community into the web. Here’s to hoping they land safely–we need this stuff.