VocabularySoup (1)

It’s been great to see RDFa being picked up by web2.0 publishers like Digg and MySpace. You can use the RDFa Distiller to extract the RDFa from a given web page u by constructing a URI like:

http://www.w3.org/2007/08/pyRdfa/extract?format=turtle&uri=u

Which translates kind of nicely into a command line utility to add to your ~/bin:

#!/bin/sh
curl "http://www.w3.org/2007/08/pyRdfa/extract?format=turtle&uri=$1"

So with that little shell script in hand I can now look at the RDFa something like Yo La Tengo’s page on MySpace:

ed@rorty:~$ rdfa http://www.myspace.com/yolatengo

@prefix myspace: <http://x.myspacecdn.com/modules/sitesearch/static/rdf/profileschema.rdf#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xhv: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://www.myspace.com/YO LA TENGO> a myspace:MusicProfile ;
     myspace:profileType "Music" .

<http://www.myspace.com/yolatengo> xhv:stylesheet
         <http://x.myspacecdn.com/modules/common/static/css/global_j03fjftp.css>,
         <http://x.myspacecdn.com/modules/common/static/css/header/profileheader008.css>,
         <http://x.myspacecdn.com/modules/common/static/css/myspace_jvtnwmp4.css>,
         <http://x.myspacecdn.com/modules/common/static/css/profile_adl4r-y8.css>,
         <http://x.myspacecdn.com/modules/profiles/static/css/musicv2_wo4zzzd-.css> ;
     myspace:addToFriends <http://friends.myspace.com/index.cfm?fuseaction=invite.addfriend_verify&friendID=91362837> ;
     myspace:friendCount "33993" ;
     myspace:headline "\"<b>YO LA TENGO IS MURDERING THE CLASSICS</b>\""^^rdf:XMLLiteral ;
     myspace:photo <http://viewmorepics.myspace.com/index.cfm?fuseaction=user.viewAlbums&friendID=91362837> ;
     myspace:sendMessage <http://messaging.myspace.com/index.cfm?fuseaction=mail.message&friendID=91362837&MyToken=62964687-f06b-4b8b-8227-ba97f133a029> ;
     myspace:viewPictures <http://viewmorepics.myspace.com/index.cfm?fuseaction=user.viewAlbums&friendID=91362837> .

Today I learned that “the world’s largest community for sharing presentations” SlideShare is now using RDFa as well. For example here is the metadata SlideShare makes available for Tom Scott’s recent presentation at CERN for the 20th birthday of the web:

ed@rorty:~$ rdfa http://www.slideshare.net/derivadow/www20-what-does-the-history-of-the-web-tell-us-about-its-future

@prefix dc: <http://purl.org/dc/terms/> .
@prefix hx: <http://purl.org/NET/hinclude> .
@prefix media: <http://search.yahoo.com/searchmonkey/media/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xhv: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://www.slideshare.net/derivadow/www20-what-does-the-history-of-the-web-tell-us-about-its-future> dc:creator "Tom Scott"@en ;
     dc:description "Following my invitation to speak at the WWW@20 celebrations - this is my attempt to squash the interesting bits into a s"@en ;
     media:height "355"@en ;
     media:presentation <http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=www20departmentatalpresentation-090325122157-phpapp02&stripped_title=www20-what-does-the-history-of-the-web-tell-us-about-its-future> ;
     media:thumbnail <http://cdn.slidesharecdn.com/www20departmentatalpresentation-090325122157-phpapp02-thumbnail?1238020296> ;
     media:title "www@20 what does the history of the web tell us about its future?"@en ;
     media:width "425"@en ;
     xhv:alternate <http://www.slideshare.net/rss/latest> ;
     xhv:icon <http://www.slideshare.net/favicon.ico> ;
     xhv:stylesheet <http://public.slidesharecdn.com/v3/styles/slideview.css?1238021672> .

I guess it’s nerdy but I find it really interesting to look at the vocabulary usage. You can see SlideShare is using Yahoo’s media vocabulary as well as DublinCore; and MySpace has opted to create their own vocabulary. The really wonderful thing about RDF is that it allows you to reuse parts of someone else’s vocabulary, in addition to creating your own, or doing both. As a technology RDF encourages this, as do documents like How to Publish Linked Data on the Web and the Semantic Web FAQ.

A common perception of the 14 year Dublin Core effort is that it has largely been about coming to consensus about a set of vocabulary terms to use when describing web resources. I think it’s important to recognize that the Dublin Core community has also been a role model for how to create and share your vocabulary on the web so it can be assembled, discovered, understood, used, and remixed. More recently the Microformats community has done something similar, but by targeting web developers (who are actually coding up the HTML) rather than library/infosci professionals. The real message of the Dublin Core and Microformats efforts aren’t that there ought to be one vocabulary to describe information resources, but that we can use the web to collaboratively build and deploy the vocabularies we need.

As we see more and more metadata making it online as RDFa, LinkedData and Microformats the community really needs tool support for visualizing vocabulary use. These tools will aid data publishers in choosing what vocabularies they could use in their descriptions. They will also aid consumers, harvesters of the web to understand which vocabularies are important to understand (a.k.a write code for). How can we make this easier?

I guess the simplest visualization is the ‘view source’ feature that was built into early web browsers, and enabled the propagation of HTML–which is what my command line shell script approximates, and other plugins like Operator and Fuzz make much more friendly. Another approach is to throw a query at an index like Sindice which indexes large swathes of linked data, rdfa and microformats, and easily click through to the “Ontologies” view for a search result that lists the vocabularies used. Jay Luker covered some of these approaches in his Vocabularies for Linked Data: Finding, Selecting, Creating presentation at code4lib last month.

But it would be really interesting to see more tools that detailed vocabulary usage in a more aggregated way–kind of like what Google did in 2005 for HTML in their Web Authoring Statistics. Are some people already doing this? I hope you know of something I don’t.

Up next in part 2 (if I ever get the nerve to publish it) my insane ramblings about why I think XML Schema is nice, but not really web friendly enough to encourage metadata vocabulary use/reuse on the web.