Archive for the ‘semweb’ Category

calais and ocr newspaper data

Wednesday, February 13th, 2008

Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun vocabularies.

At work Dan, Brian and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It’s a very different approach in that Calais is doing natural language processing and we instead are looking for patterns in the structure of XML. But the end result is the same–an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have large amounts of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais…

To aid in the process I wrote a helper utility (calais.py) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan’s rdflib:

  import calais
  graph = calais_graph(content)

This is dependent on you getting a calais license key and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here’s the people script. note, the angly brackets are missing from the sparql prefixes intentionally, since they don’t render properly (yet) in wordpress.

  from calais import calais_graph
  from sys import argv
 
  filename = argv[1]
  content = file(filename).read()
  g = calais_graph(content)
 
  sparql = """
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          """
 
  for row in g.query(sparql):
      print row[0]

Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here’s what we get when I run this OCR data through (take a look at the linked OCR to see just how irregular this data is).

  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer

Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here’s the output of cities.

  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO

Not too shabby. If you want to try this out, install rdflib, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:

  bzr branch http://inkdroid.org/bzr/calais

If you do dive into calais.py you’ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF.

lcsh, thesauri and skos

Wednesday, January 23rd, 2008

Simon Spero has an interesting post on why LCSH cannot be considered a thesaurus. At $work I’ve been working on mapping LCSH/MARC to SKOS, so Simon’s efforts in both collecting and analyzing LCSH authority data have been extremely valuable. In particular Simon and Leonard Willpower’s involvement with SKOS alerted me relatively early on to some of the problems that lie in store when thinking of LCSH in terms of a thesaurus.

The problem stems from very specific (standardized) notions of what thesauri are. Z39-19-2005 defines broader relationships in thesauri as being transitive. So if a has the broader term b, and b has the broader term c, then you can infer a has the broader term c.

Now consider the broader relationships (BT for those of you w/ the red books handy, or care to browse authorities.loc.gov from the comfort of your chair) from the heading “Non-alcoholic cocktails”:

If broader relationships are to be considered transitive one is obliged to treat Alcoholic beverages as a broader term for Non-alcoholic cocktails. But clearly it’s nonsense to consider a non-alcoholic cocktail a specialization of an alcoholic beverage. As Simon pointed out the problem was recognized by Mary Dykstra soon after LCSH adopted terminology from the thesauri world (BT, NT, RT) in 1986. Her article, LC Subject Headings Disguised as a Thesaurus describes the many difficulties of treating LCSH as a thesaurus. In the example above from LCSH the broader (BT) relationship is used for both hierarchical (IS-A) relationships, as well as part/whole (HAS-A) relationships. According to thesauri folks this is a no-no.

LCSH aside, the semantics of broader/narrower have been an issue for SKOS for a fair amount of time. Guus Schreiber proposed a resolution, which was just accepted at yesterday’s SWD telecon. SKOS is trying to straddle several different worlds, enabling the representation of a range of knowledge organization systems from thesauri and taxonomies to subject heading lists, folksonomy and other controlled vocabularies. To remain flexible in this way, while still appealing to the thesaurus world a compromise was reached where the skos:broader and skos:narrower semantic relations were declared to be sub-properties of two new properties: skos:broaderTransitive and skos:narrowerTransitive (respectively). Since transitivity is not inherited, SKOS can still be used by people who want to represent loose broader relationships (LCSH, and others). At the same time SKOS will allow vocabulary owners to infer transitive broader/narrower relationships across concepts. Incidentally the SKOS Reference was just approved yesterday as a W3C Working Draft, which is its first step along the way to hopefully becoming a Recommendation.

My pottering about with LCSH and SKOS has also illustrated the value in making links between concepts explicit. Modeling LCSH as a graph data structure (SKOS), where each concept has a unique identifier has been a simple and yet powerful step in working with the data. For example to generate the image above, I simply wrote a script that transformed the subgraph related to “Non-alcoholic cocktails” to a graphviz dot file:


digraph G {
  rankdir = "BT"
  "Non-alcoholic cocktails" -> "Cocktails";
  "Alcoholic beverages" -> "Beverages";
  "Non-alcoholic beverages" -> "Beverages";
  "Cocktails" -> "Alcoholic beverages";
  "Non-alcoholic cocktails" -> "Non-alcoholic beverages";
  "Non-alcoholic beer" -> "Non-alcoholic beverages";
}

And then ran that through the graphviz dot utility:


% dot -T png non-alcoholic-cocktails.dot > non-alcoholic-cocktails.png

to generate the PNG file you see. It’s my hope that making a richly linked graph like LCSH/SKOS available will enable not only enhanced use of the vocabulary, but also aid in creative, collaborative refactoring of the graph. I know that these issues are not new to LC, however tools that enable refactoring along the lines of what Margherita Sini proposed for the cocktail problem above will only be possible in a world where the graph can easily be manipulated and, downstream applications (library catalogs, etc) can easily adapt to the changing concept scheme.

tripleshot

Friday, January 11th, 2008

Recently there was a bit of interesting news around a MARBI Discussion Paper 2008-DP04 regarding semweb technologies at LC.

Related to this work are RDF/OWL representations and models for MODS and MARC, which we are also developing. Several representations of MODS in RDF/OWL, such as the one from the SIMILE project, have been made available as part of various projects and we have found they useful for our analysis and to inform our design process. We want to bring them together into one easily downloaded and maintained RDF/OWL file for use in community experimentation with RDF applications. Our time line is to have the MODS RDF ready for community comment by June.

following your nose to the web of data

Friday, January 4th, 2008

This is a draft of a column that’s slated to be published some time in Information Standards Quarterly. Jay was kind enough to let me post it here in this form before it goes to press. It seems timely to put it out there. Please feel free to leave comments to point out inaccuracies, errors, tips, suggestions, etc.


It’s hard to imagine today that in 1991 the entire World Wide Web existed on a single server at CERN in Switzerland. By the end of that year the first web server outside of Europe was set up at Stanford. The archives of the www-talk discussion list bear witness to the grassroots community effort that grew the early web–one document and one server at a time.

Fast forward to 2007 when 24.7 billion web pages are estimated to exist. The rapid and continued growth of the Web of Documents can partly be attributed to the elegant simplicity of the hypertext link enabled by two of Tim Berners-Lee’s creations: the HyperText Markup Language (HTML) and the Uniform Resource Locator (URL). There is a similar movement afoot today to build a new kind of web using this same linking technology, the so called Web of Data.

The Web of Data has its beginnings in the vision of a Semantic Web articulated by Tim Berners-Lee in 2001. The basic idea of the Semantic Web is to enable intelligent machine agents by augmenting the web of HTML documents with a web of machine processable information. A recent follow up article covers the “layer cake” of standards that have been created since, and how they are being successfully used today to enable data integration in research, government, and business. However the repositories of data associated with these success stories are largely found behind closed doors. As a result there is little large scale integration happening across organizational boundries on the World Wide Web.

The Web of Data represents a distillation and simplification of the Semantic Web vision. It de-emphasizes the automated reasoning aspects of Semantic Web research and focuses instead on the actual linking of data across organizational boundaries. To make things even simpler the linking mechanism relies on already deployed web technologies: the HyperText Transfer Protocol (HTTP), Uniform Resource Identifiers (URI), and Resource Description Framework (RDF). Tim Berners-Lee has called this technique Linked Data, and summarized it as a short set of guidelines for publishing data on the web:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those things.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs, so that they can discover more things.

The Linking Open Data community project of the W3C Semantic Web Education and Outreach Group has published two additional documents Cool URIs for the Semantic Web and How to Publish Linked Data on the Web that help IT professionals understand what it means to publish their assets as linked data. The goal of the Linking Open Data Project is to

extend the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different sources.

Central to the Linked Data concept is the publication of RDF on the World Wide Web. The essence of RDF is the “triple” which is a statement about a resource in three parts: a subject, predicate and object. The RDF triple provides a way of modeling statements about resources and it can have multiple serialization formats including XML and some more human readable formats such as notation3. For example to represent a statement that the website at http://niso.org has the title “NISO - National Information Standards Organization” one can create the following triple:


<http://niso.org> <http://purl.org/dc/elements/1.1/title> "NISO - National Information Standards Organization" .

The subject is the URL for the website, the predicate is “has title” represented as a URI from the Dublin Core vocabulary, and the object is the literal “NISO - National Information Standards Organization”. The Linked Data movement encourages the extensive interlinking of your data with other people’s data: so for example by creating another triple such as:


<http://niso.org> <http://purl.org/dc/elements/1.1/creator> <http://dbpedia.org/resource/National_Information_Standards_Organization> .

This indicates that the website was created by NISO which is identified using URI from the dbpedia (a Linked Data version of the Wikipedia). One of the benefits of linking data in this way is the “follow your nose” effect. When a person in their browser or an automated agent runs across the creator in the above triple they are able to dereference the URL and retrieve more information about this creator. For example when a software agent dereferences a URL for NISO


http://dbpedia.org/resource/National_Information_Standards_Organization

24 additional RDF triples are returned including one like:


<http://dbpedia.org/resource/National_Information_Standards_Organization> <http://www.w3.org/2004/02/skos/core#subject> <http://dbpedia.org/resource/Category:Standards_organizations> .

This triple says that NISO belongs to a class of resources that are standards organizations. A human or agent can follow their nose to the dbpedia URL for standards organizations:


http://dbpedia.org/resource/Category:Standards_organizations

and retrieve 156 triples describing other standards organizations are returned such as:


<http://dbpedia.org/resource/World_Wide_Web_Consortium> <http://www.w4.org/2004/02/skos/core#subject> <http://dbpedia.org/resource/Category:Standards_organizations> .

And so on. This ability for humans and automated crawlers to follow their noses in this way makes for a powerfully simple data discovery heuristic. The philosophy is quite different from other data discovery methods, such as the typical web2.0 APIs of Flickr, Amazon, YouTube, Facebook, Google, etc., which all differ in their implementation details and require you to digest their API documentation before you can do anything useful. Contrast this with the Web of Data which uses the ubiquitous technologies of URIs and HTTP plus the secret sauce of the RDF triple.

As with the initial growth of the web over 10 years ago the creation of the Web of Data is happening at a grassroots level by individuals around the world. Much of the work takes place on an open discussion list at MIT where people share their experiences of making data sets available, discuss technical problems/solutions, and announce the availability of resources. At this time some 27 different data sets have been published including Wikipedia, the US Census, the CIA World Fact Book, Geonames, MusicBrainz, WordNet, OpenCyc. The data and relationships between the data are by definition distributed around the web and harvestable by anyone by anyone with a web browser or HTTP client. Contrast this openness with the relationships that Google extracts from the Web of Documents and locks up on their own private network.

Various services aggregate Linked Data and provide services on top of it such as dbpedia which has an estimated 3 million RDF links, and over 2 billion RDF triples. It’s quite possible that the emerging set of Linked Data will serve as a data test bed for intiatives like the Billion Triple Challenge which aims to foster creative approaches to data mining and Semantic Web research by making large sets of real data available. In much the same way that Tim Berners-Lee could not have predicted the impact of Google’s PageRank algorithm, or the improbable success of Wikipedia’s collaborative editing while creating the Web of Documents, it may be that simply building links between data sets on the Web of Data will bootstrap a new class of technologies we cannot begin to imagine today.

So if you are in the business of making data available on the web and have a bit more time to spare, have a look at Tim Berners-Lee’s Linked Data document and familiarize yourself with the simple web publishing techniques behind the Web of Data: HTTP, URI and RDF. If you catch the Linked Data bug join the discussion list and the conversation, and try publishing some of your data as a pilot project using the tutorials. Who knows what might happen–you might just help build a new kind of web, and rest assured you’ll definitely have some fun.

Thanks to Jay Luker, Paul Miller, Danny Ayers and Dan Chudnov for their contributions and suggestions.