Posts Tagged ‘http’

identi.ca and linked data

Friday, July 11th, 2008

If you’ve already caught the micro-blogging bug identi.ca is an interesting twitter clone for a variety of reasons…not the least of which is that it’s an open source project, and has been designed to run in a decentralized way. The thing I was pleasantly surprised to see was FOAF exports like this for user networks, and HTTP URIs for foaf:Person resources:

ed@hammer:~$ curl -I http://identi.ca/user/6104
HTTP/1.1 302 Found
Date: Fri, 11 Jul 2008 12:58:56 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Status: 303 See Other
Location: http://identi.ca/edsu
Content-Type: text/html

It looks like there’s a slight bug in the way the HTTP status is being returned, but clearly the intent was to do the right thing by httpRange-14. If I have time I’ll get laconi.ca running locally so I can confirm the bug, and attempt a fix.

It’s also cool to see that Evan Prodromou (the lead developer, and creator of identi.ca and laconi.ca) has opened a couple tickets for adding RDFa to various pages. If I have the time this would be a fun hack as well. I’d also like to take a stab at doing conneg at foaf:Person URIs to enable this sorta thing:

ed@hammer:~$ curl -I --header "Content-type: application/rdf+xml" http://identi.ca/user/6104
HTTP/1.1 303 See Other
Date: Fri, 11 Jul 2008 13:08:42 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Location: http://identi.ca/edsu/foaf

instead of what happens currently:

ed@hammer:~$ curl -I --header "Content-type: application/rdf+xml" http://identi.ca/user/6104
HTTP/1.1 302 Found
Date: Fri, 11 Jul 2008 13:08:42 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.1
Status: 303 See Other
Location: http://identi.ca/edsu
Content-Type: text/html

I guess this is also just a complicated way of saying I’m edsu on identi.ca–and that the opportunity to learn more about OAuth and XMPP is a compelling enough reason alone for me to make the switch.

MIME types and library metadata

Wednesday, April 23rd, 2008

While thinking about library metadata and RESTful web services I got to wondering how many application/*+xml MIME types have actually been registered. It turns out that 120 out of the 633 other application/* MIME types.

Does it seem like a generally useful thing to be able to identify metadata representations with MIME types? Rebecca Guenther registered application/marc back in 1997. Maybe we could have application/marc+xml, application/mods+xml, application/dc+xml?

MIME types for established library metadata formats would be useful to use in applications like AtomPub implementations, or say OAI-ORE resource maps that want to identify the format of a particular resource. In general it would be useful to have in RESTful environments where content-negotiation for resources is encouraged.

If you are curious, here is a current (as of Apr 23, 2008) list of registered MIME types that are in the application/*+xml space.

application/atom+xml
application/atomcat+xml
application/atomsvc+xml
application/auth-policy+xml
application/beep+xml
application/ccxml+xml
application/cellml+xml
application/cnrp+xml
application/conference-info+xml
application/cpl+xml
application/csta+xml
application/CSTAdata+xml
application/davmount+xml
application/dialog-info+xml
application/epp+xml
application/im-iscomposing+xml
application/kpml-request+xml
application/kpml-response+xml
application/mbms-associated-procedure-description+xml
application/mbms-deregister+xml
application/mbms-envelope+xml
application/mbms-msk-response+xml
application/mbms-msk+xml
application/mbms-protection-description+xml
application/mbms-reception-report+xml
application/mbms-register-response+xml
application/mbms-register+xml
application/mbms-user-service-description+xml
application/media_control+xml
application/mediaservercontrol+xml
application/oebps-package+xml
application/pidf+xml
application/pls+xml
application/poc-settings+xml
application/rdf+xml
application/reginfo+xml
application/resource-lists+xml
application/rlmi+xml
application/rls-services+xml
application/samlassertion+xml
application/samlmetadata+xml
application/sbml+xml
application/shf+xml
application/simple-filter+xml
application/smil+xml
application/soap+xml
application/sparql-results+xml
application/spirits-event+xml
application/srgs+xml
application/ssml+xml
application/vnd.3gpp.bsf+xml
application/vnd.3gpp2.bcmcsinfo+xml
application/vnd.adobe.xdp+xml
application/vnd.apple.installer+xml
application/vnd.avistar+xml
application/vnd.chemdraw+xml
application/vnd.criticaltools.wbs+xml
application/vnd.ctct.ws+xml
application/vnd.eszigno3+xml
application/vnd.google-earth.kml+xml
application/vnd.HandHeld-Entertainment+xml
application/vnd.informedcontrol.rms+xml
application/vnd.irepository.package+xml
application/vnd.liberty-request+xml
application/vnd.llamagraphics.life-balance.exchange+xml
application/vnd.marlin.drm.actiontoken+xml
application/vnd.marlin.drm.conftoken+xml
application/vnd.mozilla.xul+xml
application/vnd.ms-playready.initiator+xml
application/vnd.nokia.conml+xml
application/vnd.nokia.iptv.config+xml
application/vnd.nokia.landmark+xml
application/vnd.nokia.landmarkcollection+xml
application/vnd.nokia.n-gage.ac+xml
application/vnd.nokia.pcd+xml
application/vnd.oma.bcast.associated-procedure-parameter+xml
application/vnd.oma.bcast.drm-trigger+xml
application/vnd.oma.bcast.imd+xml
application/vnd.oma.bcast.notification+xml
application/vnd.oma.bcast.sgdd+xml
application/vnd.oma.bcast.smartcard-trigger+xml
application/vnd.oma.bcast.sprov+xml
application/vnd.oma.dd2+xml
application/vnd.oma.drm.risd+xml
application/vnd.oma.group-usage-list+xml
application/vnd.oma.poc.detailed-progress-report+xml
application/vnd.oma.poc.final-report+xml
application/vnd.oma.poc.groups+xml
application/vnd.oma.poc.invocation-descriptor+xml
application/vnd.oma.poc.optimized-progress-report+xml
application/vnd.oma.xcap-directory+xml
application/vnd.omads-email+xml
application/vnd.omads-file+xml
application/vnd.omads-folder+xml
application/vnd.otps.ct-kip+xml
application/vnd.poc.group-advertisement+xml
application/vnd.pwg-xhtml-print+xml
application/vnd.recordare.musicxml+xml
application/vnd.solent.sdkm+xml
application/vnd.sun.wadl+xml
application/vnd.syncml.dm+xml
application/vnd.syncml+xml
application/vnd.uoml+xml
application/vnd.wv.csp+xml
application/vnd.wv.ssp+xml
application/vnd.zzazz.deck+xml
application/voicexml+xml
application/watcherinfo+xml
application/wsdl+xml
application/wspolicy+xml
application/xcap-att+xml
application/xcap-caps+xml
application/xcap-el+xml
application/xcap-error+xml
application/xcap-ns+xml
application/xenc+xml
application/xhtml+xml
application/xmpp+xml
application/xop+xml
application/xv+xml

following your nose to the web of data

Friday, January 4th, 2008

This is a draft of a column that’s slated to be published some time in Information Standards Quarterly. Jay was kind enough to let me post it here in this form before it goes to press. It seems timely to put it out there. Please feel free to leave comments to point out inaccuracies, errors, tips, suggestions, etc.


It’s hard to imagine today that in 1991 the entire World Wide Web existed on a single server at CERN in Switzerland. By the end of that year the first web server outside of Europe was set up at Stanford. The archives of the www-talk discussion list bear witness to the grassroots community effort that grew the early web–one document and one server at a time.

Fast forward to 2007 when 24.7 billion web pages are estimated to exist. The rapid and continued growth of the Web of Documents can partly be attributed to the elegant simplicity of the hypertext link enabled by two of Tim Berners-Lee’s creations: the HyperText Markup Language (HTML) and the Uniform Resource Locator (URL). There is a similar movement afoot today to build a new kind of web using this same linking technology, the so called Web of Data.

The Web of Data has its beginnings in the vision of a Semantic Web articulated by Tim Berners-Lee in 2001. The basic idea of the Semantic Web is to enable intelligent machine agents by augmenting the web of HTML documents with a web of machine processable information. A recent follow up article covers the the “layer cake” of standards that have been created since, and how they are being successfully used today to enable data integration in research, government, and business. However the repositories of data associated with these success stories are largely found behind closed doors. As a result there is little large scale integration happening across organizational boundries on the World Wide Web.

The Web of Data represents a distillation and simplification of the Semantic Web vision. It de-emphasizes the automated reasoning aspects of Semantic Web research and focuses instead on the actual linking of data across organizational boundaries. To make things even simpler the linking mechanism relies on already deployed web technologies: the HyperText Transfer Protocol (HTTP), Uniform Resource Identifiers (URI), and Resource Description Framework (RDF). Tim Berners-Lee has called this technique Linked Data, and summarized it as a short set of guidelines for publishing data on the web:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those things.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs, so that they can discover more things.

The Linking Open Data community project of the W3C Semantic Web Education and Outreach Group has published two additional documents Cool URIs for the Semantic Web and How to Publish Linked Data on the Web that help IT professionals understand what it means to publish their assets as linked data. The goal of the Linking Open Data Project is to

extend the the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different sources.

Central to the Linked Data concept is the publication of RDF on the World Wide Web. The essence of RDF is the “triple” which is a statement about a resource in three parts: a subject, predicate and object. The RDF triple provides a way of modeling statements about resources and it can have multiple serialization formats including XML and some more human readable formats such as notation3. For example to represent a statement that the website at http://niso.org has the title “NISO - National Information Standards Organization” one can create the following triple:


<http://niso.org> <http://purl.org/dc/elements/1.1/title> "NISO - National Information Standards Organization" .

The subject is the URL for the website, the predicate is “has title” represented as a URI from the Dublin Core vocabulary, and the object is the literal “NISO - National Information Standards Organization”. The Linked Data movement encourages the extensive interlinking of your data with other people’s data: so for example by creating another triple such as:


<http://niso.org> <http://purl.org/dc/elements/1.1/creator> <http://dbpedia.org/resource/National_Information_Standards_Organization> .

This indicates that the website was created by NISO which is identified using URI from the dbpedia (a Linked Data version of the Wikipedia). One of the benefits of linking data in this way is the “follow your nose” effect. When a person in their browser or an automated agent runs across the creator in the above triple they are able to dereference the URL and retrieve more information about this creator. For example when a software agent dereferences a URL for NISO


http://dbpedia.org/resource/National_Information_Standards_Organization

24 additional RDF triples are returned including one like:


<http://dbpedia.org/resource/National_Information_Standards_Organization> <http://www.w3.org/2004/02/skos/core#subject> <http://dbpedia.org/resource/Category:Standards_organizations> .

This triple says that NISO belongs to a class of resources that are standards organizations. A human or agent can follow their nose to the dbpedia URL for standards organizations:


http://dbpedia.org/resource/Category:Standards_organizations

and retrieve 156 triples describing other standards organizations are returned such as:


<http://dbpedia.org/resource/World_Wide_Web_Consortium> <http://www.w4.org/2004/02/skos/core#subject> <http://dbpedia.org/resource/Category:Standards_organizations> .

And so on. This ability for humans and automated crawlers to follow their noses in this way makes for a powerfully simple data discovery heuristic. The philosophy is quite different from other data discovery methods, such as the typical web2.0 APIs of Flickr, Amazon, YouTube, Facebook, Google, etc., which all differ in their implementation details and require you to digest their API documentation before you can do anything useful. Contrast this with the Web of Data which uses the ubiquitous technologies of URIs and HTTP plus the secret sauce of the RDF triple.

As with the initial growth of the web over 10 years ago the creation of the Web of Data is happening at a grassroots level by individuals around the world. Much of the work takes place on an open discussion list at MIT where people share their experiences of making data sets available, discuss technical problems/solutions, and announce the availability of resources. At this time some 27 different data sets have been published including Wikipedia, the US Census, the CIA World Fact Book, Geonames, MusicBrainz, WordNet, OpenCyc. The data and relationships between the data are by definition distributed around the web and harvestable by anyone by anyone with a web browser or HTTP client. Contrast this openness with the relationships that Google extracts from the Web of Documents and locks up on their own private network.

Various services aggregate Linked Data and provide services on top of it such as dbpedia which has an estimated 3 million RDF links, and over 2 billion RDF triples. It’s quite possible that the emerging set of Linked Data will serve as a data test bed for intiatives like the Billion Triple Challenge which aims to foster creative approaches to data mining and Semantic Web research by making large sets of real data available. In much the same way that Tim Berners-Lee could not have predicted the impact of Google’s PageRank algorithm, or the improbable success of Wikipedia’s collaborative editing while creating the Web of Documents, it may be that simply building links between data sets on the Web of Data will bootstrap a new class of technologies we cannot begin to imagine today.

So if you are in the business of making data available on the web and have a bit more time to spare, have a look at Tim Berners-Lee’s Linked Data document and familiarize yourself with the simple web publishing techniques behind the Web of Data: HTTP, URI and RDF. If you catch the Linked Data bug join the discussion list and the conversation, and try publishing some of your data as a pilot project using the tutorials. Who knows what might happen–you might just help build a new kind of web, and rest assured you’ll definitely have some fun.

Thanks to Jay Luker, Paul Miller, Danny Ayers and Dan Chudnov for their contributions and suggestions.

good ore

Friday, November 2nd, 2007

In case you missed it the Object-Reuse-and-Exchange (ORE) folks are having a get together at Johns Hopkins University (Baltimore, MD) on March 3, 2008. It’s free to register, but space is limited. The Compound information objects whitepaper, May 2007 Technical Committee notes and the more recent Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication provide a good taste of what the beta ORE specs are likely to look like.

The ORE group isn’t small, and includes individuals from quite different organizations. So any consensus that can be garnered I think will be quite powerful. Personally I’ve been really pleased to see how much the ORE work is leaning on web architecture: notably resolvable HTTP URIs, content-negotiation, linked-data and named graphs. Also interesting in the recent announcement is that the initial specs will use RFC 4287 for encoding the data model. Who knows, perhaps the spec will rely on archive feeds as discussed recently on the code4lib discussion list.

I’m particularly interested to see what flavor of URIs are used to identify the compound objects:

The protocol-based URI of the Resource Map identifies an aggregation of resources (components of a compound object) and their boundary-type inter-relationships. While this URI is clearly not the identifier of the compound object itself, it does provide an access point to the Resource Map and its representations that list all the resources of the compound object. For many practical purposes, this protocol-based URI may be a handy mechanism to reference the compound object because of the tight dependency of the visibility of the compound object in web space on the Resource Map (i.e., in ORE terms, a compound object exists in web space if and only if there exists a Resource Map describing it).

We note, however, two subtle points regarding the use of the URI of the Resource Map to reference the compound object. First, doing so is inconsistent with the web architecture and URI guidelines that are explicit in their suggestion that a URI should identify a single resource. Strictly interpreted, then, the use of the URI of the Resource Map to identify both the Resource Map and the compound object that it describes is incorrect. Second, some existing information systems already use dedicated URIs for the identification of compound information objects “as a whole.” For example, many scholarly publishers use DOIs whereas the Fedora and aDORe repositories have adopted identifiers of the info URI scheme. These identifiers are explicitly distinct from the URI of the Resource Map. from: Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication

I understand the ORE group is intentionally not aligning themselves too closely with the semantic web community. However I think they need to consider whether compound information objects are WWW information resources or not:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.” (from Architecture of the World Wide Web vol. 1).

I’m not totally convinced that the resource map can’t serve as a suitable representation for the compound information object–however for the sake of argument lets say I am. It seems to me that the URI for the compound information object identifies the concept of a particular compound information object, which lies in various pieces on the network. However this doesn’t preclude the use of HTTP URLs to identify the compound objects. Indeed the What HTTP URIs identify and Cool URIs for the Semantic Web provide specific guidance on how to serve up these non-information resources. Of course philosophical arguments around httpRange-14 have raged for a while. But the Linking Open Data project is using the hash URI and 303 redirect very effectively. There has even been some work on a sitemap extension to enable crawling. As a practical matter using URLs to identify compound information objects will encourage their use because they will naturally find their ways into publications, blogs, other compound objects. Using non-resolvable or quasi-resolvable info-uris or dois will mean people just won’t create the links–and when they do they will create links that can’t be easily verified and evolved over time with standard web tools. The OAI-ORE effort represents a giant leap forward for the digital library community into the web. Here’s to hoping they land safely–we need this stuff.