Fielding notes

I’ve been doing a bit of research into the design of the Web for a paper I’m trying to write. In my travels I ran across Jon Udell’s 2006 interview with Roy Fielding. The interview is particularly interesting because of Roy’s telling of how (as a graduate student) he found himself working on libwww-perl which helped him discover the architecture of the Web that was largely documented by Tim Berners-Lee’s libwww HTTP library for Objective-C.

For the purposes of note taking, and giving some web spiders some text to index, here are a few moments that stood out:

Udell: A little later on [in Roy’s dissertation] you talk about how systems based on what you call control messages are in a very different category from systems where the decisions that get made are being made by human beings, and that that’s, in a sense, the ultimate rationale for designing data driven systems that are web-like, because people need to interact with them in lots of ways that you can’t declaratively define.

Fielding: Yeah, it’s a little bit easier to say that people need to reuse them, in various unanticipated ways. A lot of people think that when they are building an application that they are building something that’s going to last forever, and almost always that’s false. Usually when they are building an application the only thing that lasts forever is the data, at least if you’re lucky. If you’re lucky the data retains some semblance of archivability, or reusability over time.

Udell: There is a meme out there to the effect that what we now call REST architectural style was in a sense discovered post facto, as opposed to having been anticipated from the beginning. Do you agree with that or not?

Fielding: No, it’s a little bit of everything, in the sense that there are core principles involved that Berners-Lee was aware of when he was working on it. I first talked to Tim about what I was calling the HTTP Object Model at the time, which is a terrible name for it, but we talked when I was at the W3C in the summer of 95, about the software engineering principles. Being a graduate student of software engineering, that was my focus, and my interest originally. Of course all the stuff I was doing for the Web that was just for fun. At the time that was not considered research.

Udell: But did you at the time think of what you then called the HTTP object model as being in contrast to more API like and procedural approaches?

Fielding: Oh definitely. The reason for that was that the first thing I did for the Web was statistical analysis software, which turned out to be very effective at helping people understand the value of communicating over the Web. The second thing was a program called MOMSpider. It was one of the first Web spiders, a mechanism for testing all the links that were on the Web.

Udell: And that was when you also worked on libwww-perl?

Fielding: Right, and … at the time it was only the second protocol library available for the Web. It was a combination of pieces from various sources, as well as a lot of my own work, in terms of filling out the details, and providing an overall view of what a Web client should do with an HTTP library. And as a result of that design process I realized some of the things Tim Berners-Lee had designed into the system. And I also found a whole bunch of cases where the design didn’t make any sense, or the way it had been particularly implemented over at NCSA, or one of the other clients, or various history of the Web had turned out to be not-fitting with the rest of the design. So that led to a lot of discussions with the other early protocol developers particularly people like Rob McCool, Tony Sanders and Ari Luotonen–people who were building their own systems and understood both what they were doing with the Web, and also what complaints they were getting from their users. And from that I distilled a model of basically what was the core of HTTP. Because if you look back in the 93/94 time frame, the HTTP specification did not look all that similar to what it does now. It had a whole range of methods that were never used, and a lot of talk about various aspects of object orientation which never really applied to HTTP. And all of that came out of Tim’s original implementation of libwww, which was an Objective-C implementation that was trying to be as portable as possible. It had a lot of the good principles of interface separation and genericity inside the library, and really the same principles that I ended up using in the Perl library, although they were completely independently developed. It was just one of those things where that kind of interaction has a way of leading to a more extensible design.

Udell: So was focusing down on a smaller set of verbs partly driven by the experience of having people starting to use the Web, and starting to experience what URLs could be in a human context as well as in a programmatic context?

Fielding: Well, that was really a combination of things. One that’s a fairly common paradigm: if you are trying to inter-operate with people you’ve never met, try to keep it as simple as possible. There’s also just inherent in the notion of using URIs to identify everything, which is of course really the basis of what the Web is, provides you with that frame of mind where you have a common resource, and you want to have a common resource interface.

We’ve got five years, my brain hurts a lot

Recently there’s been a few discussions about persistent identifiers on the web: in particular one about the persistence of XRIs, and another about the use of HTTP URIs in semantic web applications like dbpedia.

As you probably know already, the w3c publicly recommended against the use of Extensible Resource Identifiers (XRI). The net effect of this was to derail the standardization of XRIs within OAISIS itself. Part of the process that Ray Denenberg (my colleague at the Library of Congress) helped kick off was a further discussion between XRI people and the w3c-tag about what XRI specifically provides that HTTP URIs do not. Recently that discussion hit a key point by Stuart Williams:

… the point that I’m trying to make is that the issue is with the social and administrative policies associated with the DNS system – and the solution is to establish a separate namespace outside the DNS system that has different social/adminsitrative policies (particularly wrt persistent name segments) that better suits the requirements of the XRI community. There is the question as to whether that alternate social/administrative system will endure into the long term such the the persistence intended guarantees endure… or not – however that will largely be determined by market forces (adoption) and ‘crudely’ the funding regime that enables the administrative structure of XRI to persist – and probably includes the use of IPRs to prevent duplicate/alternate root problems which we have seen in the DNS world.

It’ll be interesting to see the response. I basically have the same issue with DOIs and the Handle System that they depend on. Over at CrossTech Tony Hammond suggests that the Handle System would make RDF assertions such as those that involve DBPedia more persistent. But just how isn’t entirely clear to me. It seems that Handles like URLs are only persistent to the degree that they are maintained.

I’d love to see a use case from Tony that describes just how DOIs and the Handle System would provide more persistence than HTTP URLs in the context of RDF assertions involving dbpedia. As Stuart said eloquently in his email:

Again just seeking to understand – not to take a particular position

PS. Sorry if the blog post title is too cryptic, it’s Bowie’s “Five Years” which Tony’s post (perhaps intentionally) reminded me of :-)

following your nose to the web of data

This is a draft of a column that’s slated to be published some time in Information Standards Quarterly. Jay was kind enough to let me post it here in this form before it goes to press. It seems timely to put it out there. Please feel free to leave comments to point out inaccuracies, errors, tips, suggestions, etc.

It’s hard to imagine today that in 1991 the entire World Wide Web existed on a single server at CERN in Switzerland. By the end of that year the first web server outside of Europe was set up at Stanford. The archives of the www-talk discussion list bear witness to the grassroots community effort that grew the early web–one document and one server at a time.

Fast forward to 2007 when 24.7 billion web pages are estimated to exist. The rapid and continued growth of the Web of Documents can partly be attributed to the elegant simplicity of the hypertext link enabled by two of Tim Berners-Lee’s creations: the HyperText Markup Language (HTML) and the Uniform Resource Locator (URL). There is a similar movement afoot today to build a new kind of web using this same linking technology, the so called Web of Data.

The Web of Data has its beginnings in the vision of a Semantic Web articulated by Tim Berners-Lee in 2001. The basic idea of the Semantic Web is to enable intelligent machine agents by augmenting the web of HTML documents with a web of machine processable information. A recent follow up article covers the “layer cake” of standards that have been created since, and how they are being successfully used today to enable data integration in research, government, and business. However the repositories of data associated with these success stories are largely found behind closed doors. As a result there is little large scale integration happening across organizational boundries on the World Wide Web.

The Web of Data represents a distillation and simplification of the Semantic Web vision. It de-emphasizes the automated reasoning aspects of Semantic Web research and focuses instead on the actual linking of data across organizational boundaries. To make things even simpler the linking mechanism relies on already deployed web technologies: the HyperText Transfer Protocol (HTTP), Uniform Resource Identifiers (URI), and Resource Description Framework (RDF). Tim Berners-Lee has called this technique Linked Data, and summarized it as a short set of guidelines for publishing data on the web:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those things.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs, so that they can discover more things.

The Linking Open Data community project of the W3C Semantic Web Education and Outreach Group has published two additional documents Cool URIs for the Semantic Web and How to Publish Linked Data on the Web that help IT professionals understand what it means to publish their assets as linked data. The goal of the Linking Open Data Project is to

extend the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different sources.

Central to the Linked Data concept is the publication of RDF on the World Wide Web. The essence of RDF is the “triple” which is a statement about a resource in three parts: a subject, predicate and object. The RDF triple provides a way of modeling statements about resources and it can have multiple serialization formats including XML and some more human readable formats such as notation3. For example to represent a statement that the website at has the title “NISO – National Information Standards Organization” one can create the following triple:

<> <> "NISO - National Information Standards Organization" .

The subject is the URL for the website, the predicate is “has title” represented as a URI from the Dublin Core vocabulary, and the object is the literal “NISO – National Information Standards Organization”. The Linked Data movement encourages the extensive interlinking of your data with other people’s data: so for example by creating another triple such as:

<> <> <> .

This indicates that the website was created by NISO which is identified using URI from the dbpedia (a Linked Data version of the Wikipedia). One of the benefits of linking data in this way is the “follow your nose” effect. When a person in their browser or an automated agent runs across the creator in the above triple they are able to dereference the URL and retrieve more information about this creator. For example when a software agent dereferences a URL for NISO

24 additional RDF triples are returned including one like:

<> <> <> .

This triple says that NISO belongs to a class of resources that are standards organizations. A human or agent can follow their nose to the dbpedia URL for standards organizations:

and retrieve 156 triples describing other standards organizations are returned such as:

<> <> <> .

And so on. This ability for humans and automated crawlers to follow their noses in this way makes for a powerfully simple data discovery heuristic. The philosophy is quite different from other data discovery methods, such as the typical web2.0 APIs of Flickr, Amazon, YouTube, Facebook, Google, etc., which all differ in their implementation details and require you to digest their API documentation before you can do anything useful. Contrast this with the Web of Data which uses the ubiquitous technologies of URIs and HTTP plus the secret sauce of the RDF triple.

As with the initial growth of the web over 10 years ago the creation of the Web of Data is happening at a grassroots level by individuals around the world. Much of the work takes place on an open discussion list at MIT where people share their experiences of making data sets available, discuss technical problems/solutions, and announce the availability of resources. At this time some 27 different data sets have been published including Wikipedia, the US Census, the CIA World Fact Book, Geonames, MusicBrainz, WordNet, OpenCyc. The data and relationships between the data are by definition distributed around the web and harvestable by anyone by anyone with a web browser or HTTP client. Contrast this openness with the relationships that Google extracts from the Web of Documents and locks up on their own private network.

Various services aggregate Linked Data and provide services on top of it such as dbpedia which has an estimated 3 million RDF links, and over 2 billion RDF triples. It’s quite possible that the emerging set of Linked Data will serve as a data test bed for intiatives like the Billion Triple Challenge which aims to foster creative approaches to data mining and Semantic Web research by making large sets of real data available. In much the same way that Tim Berners-Lee could not have predicted the impact of Google’s PageRank algorithm, or the improbable success of Wikipedia’s collaborative editing while creating the Web of Documents, it may be that simply building links between data sets on the Web of Data will bootstrap a new class of technologies we cannot begin to imagine today.

So if you are in the business of making data available on the web and have a bit more time to spare, have a look at Tim Berners-Lee’s Linked Data document and familiarize yourself with the simple web publishing techniques behind the Web of Data: HTTP, URI and RDF. If you catch the Linked Data bug join the discussion list and the conversation, and try publishing some of your data as a pilot project using the tutorials. Who knows what might happen–you might just help build a new kind of web, and rest assured you’ll definitely have some fun.

Thanks to Jay Luker, Paul Miller, Danny Ayers and Dan Chudnov for their contributions and suggestions.