Posts Tagged ‘linkeddata’

freebase and linked-data

Wednesday, October 29th, 2008

Ok, this is pretty big news for linked data folks, and for semweb-heads in general. Freebase is now a linked-data target. This is important news because Freebase is an active community of content creators, creating rich data-centric descriptions with a wiki style interface, fancy data loaders, and useful machine APIs.

The web2.0-meets-semweb space is also being explored by folks like Talis. It’ll be interesting to see how this plays out–particularly in light of SPARQL adoption, which I remain kind of neutral about for some undefined, wary, spooky reason. I get the idea of web resources having data views. It seems like a logical, “one small step for an web agent, one giant leap for the web”. But queryability with SPARQL sounds like something to push off, particularly if you’ve already got a search api that could be hooked up to the data views.

At any rate, what this announcement means is that you can get machine readable data back from freebase using a URI. The descriptions then use more URIs, which you can then follow-your-nose to, and get more machine readable data. So if you are on a page like:

http://www.freebase.com/view/en/tim_berners-lee

you can construct a URL for Tim Berners-Lee like this:

http://rdf.freebase.com/ns/en.tim_berners-lee

Then you resolve that URL asking for application/turtle (you could ask for application/rdf+xml but I find the turtle more readable).

curl --location --header "Accept: application/turtle" http://rdf.freebase.com/ns/en.tim_berners-lee

And you’ll get back a description like this. There’s a lot of useful data there, but the interesting part for me is the follow-your-nose effect where you can see an assertion like:

 <http://rdf.freebase.com/ns/en.tim_berners-lee>
     <http://rdf.freebase.com/ns/influence.influence_node.influenced_by>
     <http://rdf.freebase.com/ns/en.ted_nelson> .

And you can then go look up Ted Nelson using that URI:

  curl --location --header "Accept: application/turtle" http://rdf.freebase.com/ns/en.ted_nelson

And get another chunk of data which includes this assertion:

 <http://rdf.freebase.com/ns/en.ted_nelson>
     <http://rdf.freebase.com/ns/influence.influence_node.influenced_by>
     <http://rdf.freebase.com/ns/en.vannevar_bush> .

And you can then continue following your nose to:

http://rdf.freebase.com/ns/en.vannevar_bush

Lather, rinse, repeat.

So why is this important? Because following your nose in HTML is what enabled companies like Lycos, AltaVista, Yahoo and Google to be born. It allowed for agents to be able to crawl the web of documents and build indexes of the data to allow people to find what they want (hopefully). Being able to link data in this way allows us to harvest data assets across organizational boundaries and merge them together. It’s early days still, but seeing an organization like Freebase get it is pretty exciting.

Oh, there are a few little rough spots which probably should be ironed out … but when is that ever not the case eh? Inspiring stuff.

Martin Malmsten and linked library data

Tuesday, September 2nd, 2008

I’m currently listening to Richard Wallis’ interview w/ Martin Malmsten of the Royal Library of Sweden. It’s a really fascinating view inside a library, and the mind of a developer that are publishing bibliographic resources as linked data.

Partly as a dare from Roy Tennant to do something useful with linked-data, I spent 30 minutes w/ rdflib creating a very simplistic (42 lines of code) crawler that can walk the links in the Royal Library’s linked data, and store the bibliographic resources encountered. I ran it over the weekend (it had a 3 second sleep between requests, so as not to arouse the ire of the Royal Library of Sweden), and it ended up pulling down 919,190 triples describing a variety of resources (kind of a fun unix hack here to get the types of resources in a ntriples rdf dump):

ed@hammer:~/bzr/linked-data-crawler$ grep 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' libris.kb.se.n3 \
  | cut -f 3 -d " " \
  | sort \
  | uniq -c \
  | sort -rn
  18445 <http://purl.org/ontology/bibo/Book>.
   1686 <http://purl.org/ontology/bibo/Article>.
    258 <http://www.w3.org/2004/02/skos/core#Concept>.
    245 <http://purl.org/ontology/bibo/Film>.
    237 <http://xmlns.com/foaf/0.1/Organization>.
    219 <http://xmlns.com/foaf/0.1/Person>.
     58 <http://purl.org/ontology/bibo/Periodical>.
      4 <http://purl.org/ontology/bibo/Map>.
      4 <http://purl.org/ontology/bibo/Manuscript>.
      1 <http://purl.org/ontology/bibo/Collection>.

As I pointed out on ngc4lib, the purpose of this wasn’t to display any technical prowess–much to the contrary, it was to share how the nature of linked-data being on the web we know and love makes it natural to work with.

One of the many gems in the interview, was Martin’s response to Richard’s question about whether the “semantic web” that we talk about today is subtly different than the semantic web that was introduced in 2001.

People saw the words “semantic web” and then they sort of forgot the web part, and started to work on the semantic part (vocabularies)–and that can become arbitrarily complex. If you forget the web part then it is just metadata, and then people can ask “ok, you have this semantics thing and we have marc21, it’s not really that different” and they’d be right. But now linked data is starting to feed the semantic web, and it’s the web part that makes it special. (about 34:00 into the interview).

I’m not an expert on the history of the web and libraries, but this seems to be spot on to me. The notion that traditional library assets (bibliographic resources like catalog records, name/subject authority records, holdings records, etc.) can be made available directly on the web as machine readable data is the real promise of linked-data for libraries. It feels like we’re at an inflexion point like the one where libraries realized their catalogs could be made available on the web. The web-opac allowed there to be links between say bibliographic records and subject headings, which could be expressed in HTML for people to traverse. But now we can express these links explicitly in a machine readable way, for automated agents to traverse. If you (like Roy Tennant) are skeptical of the value in this ask yourself how companies like Google were able to build up their most valuable asset, their index of the web. They used the open architecture of the web, to walk the links between resources. Imagine if we could allow people to do the same with our data? To gather say a union catalog of Sweden by crawling it’s member libraries catalogs, and periodically updating them with HTTP GET for that resource?

Martin’s main point is that a lot of valuable effort has gone into vocabulary development like DublinCore, MODS etc, and even some on the distribution of descriptions using these vocabularies using OAI-PMH. But the real exciting part IMHO is giving these resources URLs, and linking them together…much as the web of documents is linked together. I agree with Martin, this is new territory, that really combines what librarians and web-technologists do best. I’m looking forward to meeting Martin at DC2008, where hopefully we can do a linked-data BOF or something.

mmalmsten++

Thursday, August 21st, 2008

Holy crap … now I need to listen to this. It’s so nice to know you’re not alone, and off on another planet.

lingvoj

Wednesday, August 13th, 2008

I’m just now running across lingvoj.org, a linked-data application for languages created by Bernard Vatant. lingvoj basically mints URIs for languages (using the ISO-639-1 code) and when resolved (yay HTTP) nice human and machine readable descriptions about the language are returned. So for example the URI for Chinese is:

http://www.lingvoj.org/lang/zh

If you click on that link, your browser will display some HTML that describes the Chinese language, and if a client wants “application/rdf+xml” it’ll get back a nice chunk of rdf — all via a 303 redirect as it should be.

lingvoj is interesting for a few reasons:

  • I work at the Library of Congress, who are the maintainers of iso639-2, and I know someone experimenting with a linked-data application for delivering it.
  • I know software developers at LC and elsewhere who need access to this data in a predictable and explicit machine readable format, which lends itself to being updated (re-harvesting language URIs).
  • lingvoj follows the 303 URIs forwarding to One Generic Document pattern, which is nice to see in practice. I also learned about the use of rdfs:isDefinedBy to assert (in this case) that a language is defined by the HTML representation for the language. Not sure how I missed that in the Cool URIs document before.
  • There are owl:sameAs links between lingvoj and dbpedia and opencyc, which in turn are linked data, and allow an agent to walk outwards and discover more about a language. Maybe one day lingvoj could link to our ISO693-2 codelist at LC?
  • lingvoj defines a vocabulary which includes a new OWL class Lingvo for languages, that happens to extend dcterms:LinguisticSystem.

It’s a lot o’ fun discovering this emerging, rich data-universe on the web. If you are the least bit curious take a look for yourself:

  curl --location --header "Accept: application/rdf+xml" http://www.lingvoj.org/lang/zh

Or better yet:

  rapper -o turtle http://lingvoj.org/lang/zh

Or if you are really adventurous grab the whole data set and put it into your triple-store-du-jour.