freebase and linked-data

Ok, this is pretty big news for linked data folks, and for semweb-heads in general. Freebase is now a linked-data target. This is important news because Freebase is an active community of content creators, creating rich data-centric descriptions with a wiki style interface, fancy data loaders, and useful machine APIs.

The web2.0-meets-semweb space is also being explored by folks like Talis. It’ll be interesting to see how this plays out–particularly in light of SPARQL adoption, which I remain kind of neutral about for some undefined, wary, spooky reason. I get the idea of web resources having data views. It seems like a logical, “one small step for an web agent, one giant leap for the web”. But queryability with SPARQL sounds like something to push off, particularly if you’ve already got a search api that could be hooked up to the data views.

At any rate, what this announcement means is that you can get machine readable data back from freebase using a URI. The descriptions then use more URIs, which you can then follow-your-nose to, and get more machine readable data. So if you are on a page like:

http://www.freebase.com/view/en/tim_berners-lee

you can construct a URL for Tim Berners-Lee like this:

http://rdf.freebase.com/ns/en.tim_berners-lee

Then you resolve that URL asking for application/turtle (you could ask for application/rdf+xml but I find the turtle more readable).

curl --location --header "Accept: application/turtle" http://rdf.freebase.com/ns/en.tim_berners-lee

And you’ll get back a description like this. There’s a lot of useful data there, but the interesting part for me is the follow-your-nose effect where you can see an assertion like:

 <http://rdf.freebase.com/ns/en.tim_berners-lee>   
     <http://rdf.freebase.com/ns/influence.influence_node.influenced_by>
     <http://rdf.freebase.com/ns/en.ted_nelson> .

And you can then go look up Ted Nelson using that URI:

  curl --location --header "Accept: application/turtle" http://rdf.freebase.com/ns/en.ted_nelson

And get another chunk of data which includes this assertion:

 <http://rdf.freebase.com/ns/en.ted_nelson>
     <http://rdf.freebase.com/ns/influence.influence_node.influenced_by>
     <http://rdf.freebase.com/ns/en.vannevar_bush> .

And you can then continue following your nose to:

http://rdf.freebase.com/ns/en.vannevar_bush

Lather, rinse, repeat.

So why is this important? Because following your nose in HTML is what enabled companies like Lycos, AltaVista, Yahoo and Google to be born. It allowed for agents to be able to crawl the web of documents and build indexes of the data to allow people to find what they want (hopefully). Being able to link data in this way allows us to harvest data assets across organizational boundaries and merge them together. It’s early days still, but seeing an organization like Freebase get it is pretty exciting.

Oh, there are a few little rough spots which probably should be ironed out … but when is that ever not the case eh? Inspiring stuff.


SemanticProxy

I spent a 1/2 an hour goofing around with with the new (to me) SemanticProxy service from Calais. You give the service a URL along with your API key, and it’ll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, it’s just a GET:

GET http://service.semanticproxy.com/processurl/{key}/rdf/{url}

Here’s an example of some turtle you can get for my friend Dan’s blog. Obviously there’s a lot of data there, but I wanted to see exactly what entities are being recognized, and their labels. It doesn’t take long to notice that most of the resource types are in the namespace: http://s.opencalais.com/1/type/em/e/

For example:

  • http://s.opencalais.com/1/type/em/e/Person
  • http://s.opencalais.com/1/type/em/e/Country
  • http://s.opencalais.com/1/type/em/e/Company

And most of these resources have a property which seems to assign a literal string label to the resource:

http://s.opencalais.com/1/pred/name

It’s kind of a bummer that these vocabulary terms don’t resolve, because it would be sweet to get a bigger picture look at their vocabulary.

At any rate, with these two little facts gleaned from looking at the RDF for a few moments I wrote a little script (using rdflib) which you feed a URL and it’ll munge through the RDF and print out the recognized entities:

ed@curry:~/bzr/calais$ ./entities.py http://onebiglibrary.net
a Company named Lehman Bros.
a Company named Southwest Airlines
a Company named Costco
a Company named Everbank
a Holiday named New Year's Day
a ProvinceOrState named Illinois
a ProvinceOrState named Arizona
a ProvinceOrState named Michigan
a IndustryTerm named media ownership rules
a IndustryTerm named unreliable technologies
a IndustryTerm named bank
a IndustryTerm named health care insurance
a IndustryTerm named bank panics
a IndustryTerm named free software
a City named Lansing
a Facility named Big Library
a Person named Ralph Nader
a Person named Dan Chudnov
a Person named Shouldn't Bob Barr
a Person named John Mayer
a Person named Daniel Chudnov
a Person named Cynthia McKinney
a Person named Bob Barr
a Person named John Legend
a Country named Iraq
a Country named United States
a Country named Afghanistan
a Organization named FDIC
a Organization named senate
a Currency named USD

Quite easy and impressive IMHO. One thing that is missing from this output are the URIs that identify the various resources that are recognized like Dan’s:


http://d.opencalais.com/pershash-1/f7383d60-c27b-309c-889a-4e34d0938a0f

Like the vocabulary URIs it doesn’t resolve (at least outside the Reuters media empire). Sure would be nice if it did. It’s got the fact that it’s a person cooked into it (pershash)…but otherwise seems to be just a simple hashing algorithm applied to the string “Dan Chudnov”.

I didn’t actually spend any time looking at the licensing issues around using the service. I’ve heard they are somewhat stultifying and vague, which is to be expected I guess. The news about Reuters and Zotero isn’t exactly encouraging … but it is interesting to see how good some of the NLP analysis is getting at institutions like Reuters. It would be lovely to get a backend look at how this technology is actually being used internally at Reuters.

If you want to take this entities.py for a spin and can’t be bothered to download it, just drop into #code4lib and ask #zoia for entities:

14:45 < edsu> @entities http://www.frbr.org/2008/10/21/xid-updates-at-oclc
14:45 < zoia> edsu: 'ok I found: a Facility Library of Congress, a Company FRBR 
              Review Group, a City York, a EmailAddress wtd@pobox.com, a Person 
              Jenn Riley, a Person Robert Maxwell, a Person Arlene Taylor, a 
              Person William Denton, a Person Barbara Tillett, a Organization 
              Congress, a Organization Open Content Alliance, a Organization 
              York \nUniversity'


json vs pickle

in python JSON is faster, smaller and more portable than pickle …

At work, I’m working on a project where we’re modeling newspaper content in a relational database. We’ve got newspaper titles, issues, pages, institutions, places and some other fun stuff. It’s a django app, and the db schema currently looks something like:



Anyhow, if you look at the schema you’ll notice that we have a Page model, and that attached to that is an OCR model. If you haven’t heard of it before OCR is an acronym for optical character recognition. For each newspaper page we have, we have a TIF image for the original page, and we have rectangle coordinates for the position of every word on the page. Basically it’s xml that looks something like this (warning your browser may choke on this, you might want to right-click-download).

So there are roughly around 2500 words on a page of newspaper text, and there can sometimes be 350 occurrences of a particular word on a page…and we’re looking to model 1,000,000 pages soon … so if we got really prissy with normalization we could soon be looking at (worst case) 875,000,000,000 rows in a table. While I am interested in getting a handle on how to manage large databases like this, we just don’t need the fine grained queries into the word coordinates. But we do need to be able to look up the coordinates for a particular word on a particular page to do hit highlighting in search results.

So let me get to the interesting part already. To avoid having to think about databases with billions of rows, I radically denormalized the data and stored the word coordinates as a blob of JSON in the database. So we just have a word_coordinates_json column in the OCR table, and when we need to look up the coordinates for a page we just load up the JSON dictionary and we’re good to go. JSON is nice with django, since django’s ORM doesn’t seem to support storing blobs in the database, and JSON is just text. This worked just fine on single page views, but we also do hit highlighting on pages where there are 10 pages being viewed at the same time. So we started noticing large lags on these page views – because it was taking a while to load the JSON (sometimes 327K * 10 of JSON).

As I mentioned we’re using Django, so it was easy to use django.utils.simplejson for the parsing. When we noticed slowdowns I decided to compare django.utils.simplejson to the latest simplejson and python-cjson. And just for grins I figured it couldn’t hurt to see if using pickle or cPickle (protocols 0, 1 and 2) would prove to be faster than using JSON. So I wrote a little benchmark script that timed the loading of a 327K JSON and a 507K pickle file 100 times using each technique. Here are the results:

method total seconds avg seconds
django-simplejson 140.606723 1.406067
simplejson 2.260988 0.022610
pickle 45.032428 0.450324
cPickle 4.569351 0.45694
cPickle1 2.829307 0.028293
cPickle2 3.042940 0.030429
python-cjson 1.852755 0.018528

Yeah, that’s right. The real simplejson is 62 times faster than django.utils.simplejson! Even more surprising simplejson seems to be faster than even cPickle (even using binary protocols 1 and 2) python-cjson seems to have a slight edge on simplejson. This is good news for our search results page that has 10 newspaper pages to highlight on it, since it’ll take 10 * 0.033183 = .3 seconds to parse all the JSON instead of the totally unacceptable 10 * 0.976193 = 9.7 seconds. I guess in some circles 0.3 seconds might be unacceptable, we’ll have to see how it pans out. We may be able to remove the JSON deserialization from the page load time by pushing some of the logic into the browser w/ AJAX. If you want, please try out my benchmarks yourself on your own platform. I’d be curious if you see the same ranking.

Here are the versions for various bits I used:

  • python v2.5.2
  • django trunk: r9231 2008-10-13 15:38:18 -0400
  • simplejson 2.0.3

So in summary for pythoneers: JSON is faster, smaller and more portable than pickle. Of course there are caveats in that you can only store simple datatypes that JSON allows you to, not the full fledged Python objects. But in my use case JSON’s data types were just fine. Makes me that much happier that simplesjson aka json is now cooked into the Python 2.6 standard library.

Note: if you aren’t seeing simplejson performing better than cPickle you may need to have python development libraries installed:

  aptitude install python-dev # or the equivalent for your system

You can verify if the optimizations are available in simplejson by:

ed@hammer:~/bzr/jsonickle$ python
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) 
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
<<< import simplejson
<<< simplejson._speedups
<module 'simplejson._speedups' from '/home/ed/.python-eggs/simplejson-2.0.3-py2.5-linux-i686.egg-tmp/simplejson/_speedups.so'>

Thanks eby, mjgiarlo, BenO and Kapil for their pointers and ideas.


lcsh.info logs

If you are curious how lcsh.info is being used I’ve made the apache server logs available, including the ones for the sparql service. I’ve been meaning to do some analysis of the logs but haven’t got the time yet. You’ll notice that among the data that’s collected is the Accept header sent by agents, since it’s so important to what representation is served up. Thanks to danbri for the idea to simply make them available.


iswc2009, DC and vocamp

I just learned from Tom Heath that The International Semantic Web Conference is coming to Washington DC next year. This is pretty cool news to me, since traveling to conferences isn’t always the easiest thing to navigate. Also, Tom suggested that it might be fun to organize a VoCamp around the conference, to provide an informal collaboration space for vocabulary demos, development, q/a, etc. If you want to help out please join the mailing list.


public.resource.org to liberate Code of Federal Regulations

good news via the govtrack mailing list

Carl Malamud of public.resource.org, with funding from a bunch of places including a small bit from GovTrack’s ad profits, announced his intention to purchase from the Government Printing Office documents they produce in the course of their statutory obligations and then have the nerve to sell back to the public at prohibitive prices. The document to be purchased is the Code of Federal Regulations, the component of federal law created by executive branch agencies, in electronic form. Once obtained, it will be posted openly/freely online.

More here: http://public.resource.org/gpo.gov/index.html

And Carl’s letter to the GPO:
http://public.resource.org/gpo.gov/the_honorable.html

It’s pretty sad that it has to come to this…but it’s also pretty awesome that it’s happening.


terminology services sneak peak

I just saw Diane Vizine-Goetz demo OCLC’s Terminology Services at the CENDI/SKOS meeting and was excited to see various things out on the public web. For example, the LCSH concept “World Wide Web” is over here:

http://tspilot.oclc.org/lcsh/sh2008114004

At the moment it’s not the most friendly human readable display, but that’s just a XSLT stylesheet away (assuming TS follows the patterns of other OCLC Services). I’m not quite sure what the default namespace urn:uuid:D30A7E67-31BF-40A3-9956-9668674FCD84 is. But the response looks like it indicates what resources are related to a given conceptual resource.

  1. http://tspilot.oclc.org/lcsh/sh2008114004.html
  2. http://tspilot.oclc.org/lcsh/sh2008114004.json
  3. http://tspilot.oclc.org/lcsh/sh2008114004.marcxml
  4. http://tspilot.oclc.org/lcsh/sh2008114004.meta
  5. http://tspilot.oclc.org/lcsh/sh2008114004.skos
  6. http://tspilot.oclc.org/lcsh/sh2008114004.stats
  7. http://tspilot.oclc.org/lcsh/sh2008114004.zthes

And LCSH is just one of the vocabularies available through the pilot service, if you examine the XML you’ll see references to FAST, TGM and MESH + SRU services for each.

I think this is way cool, and a step in the right direction…particulary because they are going to make vocabularies available for free as long as the original publisher has no problem with it. My only complaint is that the URIs for the concepts don’t appear to do content-negotiation for application/rdf+xml. It looks like text/html and application/javascript (isn’t it application/json?) work just fine though. Try them out:

curl --header "Accept: application/javascript" http://tspilot.oclc.org/lcsh/sh2008114004
curl --header "Accept: text/html" http://tspilot.oclc.org/lcsh/sh2008114004

But not application/rdf+xml:

curl --header "Accept: application/rdf+xml" http://tspilot.oclc.org/lcsh/sh2008114004

It seems like it would be a pretty easy fix, and pretty important for being able to follow your nose on the semantic web.


nkos/cendi

Jon Phipps and I are speaking about SKOS at the World Bank today for a joint meeting of the CENDI and NKOS groups. The talk is entitled “SKOS: New Directions in Interoperability” … which is kind of ironic since SKOS has been a long running topic at NKOS meetings. The idea is to describe SKOS (for those who don’t know it), cover the recent changes to SKOS (for those that do), and describe an implementation of SKOS (lcsh.info). A tall order for 30 minutes!

One new direction that I hope I’ll be able to get to is the notion of linked-data. I created some simple graph visualizations of the Royal Library of Sweden’s linked bibliographic data implementation. I really wanted to emphasize how linked data can model data across enterprise boundaries. By the way this example really exists, it’s not library-science-fiction.

Wish us luck! There are going to be some other interesting talks during the day, on OCLC’S Terminology Services, Semantic Media Wiki for vocabulary development at the Mayo Clinic, mapping agriculture vocabularies, the intersection of folksonomy and taxonomy, and more.

PS. Roy I haven’t forgotten your follow-up comment :-)


w3c semweb use cases and lcsh

Via Ivan Herman I learned that the Semantic Web Use Cases use concepts from lcsh.info. For example look at the RDFa in this case study for the Digital Music Archive for the Norwegian National Broadcaster. You can also look at the Document metadata in a linked data browser like OpenLink. Click on the “Document” and then on the various subject “concepts” and you’ll see the linked data browser go out and fetch the triples from lcsh.info for “Semantic Web” and “Broadcasting”.

One of the downsides to linked-data browsers (for me) is that they hide a bit of what’s going on. Of course this is by-design. For a more rdf centric view on the data take a look at this output of rapper.

ed@curry:~$ rapper -o turtle http://www.w3.org/2001/sw/sweo/public/UseCases/NRK/ rapper: Serializing with serializer turtle (???) rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . (???) bibo: <http://purl.org/ontology/bibo/> . (???) dc: <http://purl.org/dc/terms/> . (???) foaf: <http://xmlns.com/foaf/0.1/> . (???) rdfs: <http://www.w3.org/2000/01/rdf-schema#> . (???) xhv: <http://www.w3.org/1999/xhtml/vocab#> . (???) xml: <http://www.w3.org/XML/1998/namespace> . (???) xsd: <http://www.w3.org/2001/XMLSchema#> .


Martin Malmsten and linked library data

I’m currently listening to Richard Wallis’ interview w/ Martin Malmsten of the Royal Library of Sweden. It’s a really fascinating view inside a library, and the mind of a developer that are publishing bibliographic resources as linked data.

Partly as a dare from Roy Tennant to do something useful with linked-data, I spent 30 minutes w/ rdflib creating a very simplistic (42 lines of code) crawler that can walk the links in the Royal Library’s linked data, and store the bibliographic resources encountered. I ran it over the weekend (it had a 3 second sleep between requests, so as not to arouse the ire of the Royal Library of Sweden), and it ended up pulling down 919,190 triples describing a variety of resources (kind of a fun unix hack here to get the types of resources in a ntriples rdf dump):

ed@hammer:~/bzr/linked-data-crawler$ grep 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' libris.kb.se.n3 \
  | cut -f 3 -d " " \
  | sort \ 
  | uniq -c \
  | sort -rn
  18445 <http://purl.org/ontology/bibo/Book>.
   1686 <http://purl.org/ontology/bibo/Article>.
    258 <http://www.w3.org/2004/02/skos/core#Concept>.
    245 <http://purl.org/ontology/bibo/Film>.
    237 <http://xmlns.com/foaf/0.1/Organization>.
    219 <http://xmlns.com/foaf/0.1/Person>.
     58 <http://purl.org/ontology/bibo/Periodical>.
      4 <http://purl.org/ontology/bibo/Map>.
      4 <http://purl.org/ontology/bibo/Manuscript>.
      1 <http://purl.org/ontology/bibo/Collection>.

As I pointed out on ngc4lib, the purpose of this wasn’t to display any technical prowess–much to the contrary, it was to share how the nature of linked-data being on the web we know and love makes it natural to work with.

One of the many gems in the interview, was Martin’s response to Richard’s question about whether the “semantic web” that we talk about today is subtly different than the semantic web that was introduced in 2001.

People saw the words “semantic web” and then they sort of forgot the web part, and started to work on the semantic part (vocabularies)–and that can become arbitrarily complex. If you forget the web part then it is just metadata, and then people can ask “ok, you have this semantics thing and we have marc21, it’s not really that different” and they’d be right. But now linked data is starting to feed the semantic web, and it’s the web part that makes it special. (about 34:00 into the interview).

I’m not an expert on the history of the web and libraries, but this seems to be spot on to me. The notion that traditional library assets (bibliographic resources like catalog records, name/subject authority records, holdings records, etc.) can be made available directly on the web as machine readable data is the real promise of linked-data for libraries. It feels like we’re at an inflexion point like the one where libraries realized their catalogs could be made available on the web. The web-opac allowed there to be links between say bibliographic records and subject headings, which could be expressed in HTML for people to traverse. But now we can express these links explicitly in a machine readable way, for automated agents to traverse. If you (like Roy Tennant) are skeptical of the value in this ask yourself how companies like Google were able to build up their most valuable asset, their index of the web. They used the open architecture of the web, to walk the links between resources. Imagine if we could allow people to do the same with our data? To gather say a union catalog of Sweden by crawling it’s member libraries catalogs, and periodically updating them with HTTP GET for that resource?

Martin’s main point is that a lot of valuable effort has gone into vocabulary development like DublinCore, MODS etc, and even some on the distribution of descriptions using these vocabularies using OAI-PMH. But the real exciting part IMHO is giving these resources URLs, and linking them together…much as the web of documents is linked together. I agree with Martin, this is new territory, that really combines what librarians and web-technologists do best. I’m looking forward to meeting Martin at DC2008, where hopefully we can do a linked-data BOF or something.