Archive for the ‘web’ Category

oai-ore and the shadow web

Friday, February 22nd, 2008

The OAI-ORE meeting is coming up, and in general I’ve been really impressed with the alpha specs that have come out. It’s not clear that there’s an established vocabulary for talking about aggregated resources on the web, so the Data Model and Vocabulary documents were of particular interest to me.

One thing I didn’t quite understand, and which I think may have some significance for implementors, is some language in the Discovery document on the subject of URI conflation:

The Data Model document [ORE Model] explicitly prohibits a URI of a ReM (URI-R) ever returning anything other than a ReM. This allows multiple representations to be associated with URI-R, such as using content negotiation to return ReMs in different languages, character sets, or compression encodings. But it does not allow URI-R to return a human readable “splash page”, either by HTTP content negotiation or redirection. For example, clients MUST NOT merge with content negotiation the following URI pair that would correspond to a ReM and a “splash page” for an object:

If I’m understanding right this would prohibit using technologies like microformats, eRDF, RDFa and GRDDL in a “splash page” to represent the resource map. It seems odd to me that you can represent a resource map in Atom, but not in HTML.

To illustrate what this might look like I took a splash page off of arXiv (hope that was ok!) and marked it up with oai-ore RDFa.

Take a look. So all I did is modify the existing XHTML at arxiv.org, and I’ve been able to represent an ORE Resource Map. This seems like a relatively simple, and powerful way for existing repositories to make their aggregated resources available.

RDFa just entered Last Call, but there are already multiple implementations. Try out the GetN3 bookmarklet on the splash page, and you should see some triples come back. I ran them through the validator at w3c and got the following graph (kinda too big to include here inline).

This kind of issue seem to be at the heart of what Ian Davis refers to when he asks “Is the Semantic Web Destined to be a Shadow?“. Andy Powell and Pete Johnston have also been strong voices for integrating digital library repositories and the web–and they are also involved with the oai-ore effort. It feels like some of the oia-ore language could be loosened a bit to allow machine readable and human readable information to commingle a bit more.

following your nose to the web of data

Friday, January 4th, 2008

This is a draft of a column that’s slated to be published some time in Information Standards Quarterly. Jay was kind enough to let me post it here in this form before it goes to press. It seems timely to put it out there. Please feel free to leave comments to point out inaccuracies, errors, tips, suggestions, etc.


It’s hard to imagine today that in 1991 the entire World Wide Web existed on a single server at CERN in Switzerland. By the end of that year the first web server outside of Europe was set up at Stanford. The archives of the www-talk discussion list bear witness to the grassroots community effort that grew the early web–one document and one server at a time.

Fast forward to 2007 when 24.7 billion web pages are estimated to exist. The rapid and continued growth of the Web of Documents can partly be attributed to the elegant simplicity of the hypertext link enabled by two of Tim Berners-Lee’s creations: the HyperText Markup Language (HTML) and the Uniform Resource Locator (URL). There is a similar movement afoot today to build a new kind of web using this same linking technology, the so called Web of Data.

The Web of Data has its beginnings in the vision of a Semantic Web articulated by Tim Berners-Lee in 2001. The basic idea of the Semantic Web is to enable intelligent machine agents by augmenting the web of HTML documents with a web of machine processable information. A recent follow up article covers the the “layer cake” of standards that have been created since, and how they are being successfully used today to enable data integration in research, government, and business. However the repositories of data associated with these success stories are largely found behind closed doors. As a result there is little large scale integration happening across organizational boundries on the World Wide Web.

The Web of Data represents a distillation and simplification of the Semantic Web vision. It de-emphasizes the automated reasoning aspects of Semantic Web research and focuses instead on the actual linking of data across organizational boundaries. To make things even simpler the linking mechanism relies on already deployed web technologies: the HyperText Transfer Protocol (HTTP), Uniform Resource Identifiers (URI), and Resource Description Framework (RDF). Tim Berners-Lee has called this technique Linked Data, and summarized it as a short set of guidelines for publishing data on the web:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those things.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs, so that they can discover more things.

The Linking Open Data community project of the W3C Semantic Web Education and Outreach Group has published two additional documents Cool URIs for the Semantic Web and How to Publish Linked Data on the Web that help IT professionals understand what it means to publish their assets as linked data. The goal of the Linking Open Data Project is to

extend the the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different sources.

Central to the Linked Data concept is the publication of RDF on the World Wide Web. The essence of RDF is the “triple” which is a statement about a resource in three parts: a subject, predicate and object. The RDF triple provides a way of modeling statements about resources and it can have multiple serialization formats including XML and some more human readable formats such as notation3. For example to represent a statement that the website at http://niso.org has the title “NISO - National Information Standards Organization” one can create the following triple:


<http://niso.org> <http://purl.org/dc/elements/1.1/title> "NISO - National Information Standards Organization" .

The subject is the URL for the website, the predicate is “has title” represented as a URI from the Dublin Core vocabulary, and the object is the literal “NISO - National Information Standards Organization”. The Linked Data movement encourages the extensive interlinking of your data with other people’s data: so for example by creating another triple such as:


<http://niso.org> <http://purl.org/dc/elements/1.1/creator> <http://dbpedia.org/resource/National_Information_Standards_Organization> .

This indicates that the website was created by NISO which is identified using URI from the dbpedia (a Linked Data version of the Wikipedia). One of the benefits of linking data in this way is the “follow your nose” effect. When a person in their browser or an automated agent runs across the creator in the above triple they are able to dereference the URL and retrieve more information about this creator. For example when a software agent dereferences a URL for NISO


http://dbpedia.org/resource/National_Information_Standards_Organization

24 additional RDF triples are returned including one like:


<http://dbpedia.org/resource/National_Information_Standards_Organization> <http://www.w3.org/2004/02/skos/core#subject> <http://dbpedia.org/resource/Category:Standards_organizations> .

This triple says that NISO belongs to a class of resources that are standards organizations. A human or agent can follow their nose to the dbpedia URL for standards organizations:


http://dbpedia.org/resource/Category:Standards_organizations

and retrieve 156 triples describing other standards organizations are returned such as:


<http://dbpedia.org/resource/World_Wide_Web_Consortium> <http://www.w4.org/2004/02/skos/core#subject> <http://dbpedia.org/resource/Category:Standards_organizations> .

And so on. This ability for humans and automated crawlers to follow their noses in this way makes for a powerfully simple data discovery heuristic. The philosophy is quite different from other data discovery methods, such as the typical web2.0 APIs of Flickr, Amazon, YouTube, Facebook, Google, etc., which all differ in their implementation details and require you to digest their API documentation before you can do anything useful. Contrast this with the Web of Data which uses the ubiquitous technologies of URIs and HTTP plus the secret sauce of the RDF triple.

As with the initial growth of the web over 10 years ago the creation of the Web of Data is happening at a grassroots level by individuals around the world. Much of the work takes place on an open discussion list at MIT where people share their experiences of making data sets available, discuss technical problems/solutions, and announce the availability of resources. At this time some 27 different data sets have been published including Wikipedia, the US Census, the CIA World Fact Book, Geonames, MusicBrainz, WordNet, OpenCyc. The data and relationships between the data are by definition distributed around the web and harvestable by anyone by anyone with a web browser or HTTP client. Contrast this openness with the relationships that Google extracts from the Web of Documents and locks up on their own private network.

Various services aggregate Linked Data and provide services on top of it such as dbpedia which has an estimated 3 million RDF links, and over 2 billion RDF triples. It’s quite possible that the emerging set of Linked Data will serve as a data test bed for intiatives like the Billion Triple Challenge which aims to foster creative approaches to data mining and Semantic Web research by making large sets of real data available. In much the same way that Tim Berners-Lee could not have predicted the impact of Google’s PageRank algorithm, or the improbable success of Wikipedia’s collaborative editing while creating the Web of Documents, it may be that simply building links between data sets on the Web of Data will bootstrap a new class of technologies we cannot begin to imagine today.

So if you are in the business of making data available on the web and have a bit more time to spare, have a look at Tim Berners-Lee’s Linked Data document and familiarize yourself with the simple web publishing techniques behind the Web of Data: HTTP, URI and RDF. If you catch the Linked Data bug join the discussion list and the conversation, and try publishing some of your data as a pilot project using the tutorials. Who knows what might happen–you might just help build a new kind of web, and rest assured you’ll definitely have some fun.

Thanks to Jay Luker, Paul Miller, Danny Ayers and Dan Chudnov for their contributions and suggestions.

permalinks reloaded

Monday, December 17th, 2007

The recently announced Zotero / InternetArchive partnership is exciting on a bunch of levels. The one that immediately struck me was the use of the Internet Archive URI. As you may have noticed before all the content in Internet Archive Wayback Machine can be referenced with a URL that looks something like:

  • http://web.archive.org/web/{yyyymmddhhmmss}/{url}

Where url is the document URL you want to look up in the archive at the given time. So for example:

is a URL for what http://google.com looked like on December 02, 1998 at 23:04:10. Perhaps this is documented somewhere prominent or is common knowledge, but it looks like you can play with the timestamp, and archive.org will adjust as needed, redirecting you to the closest snapshot it can find:

and even:

which redirects to the most recent content for a given URL. It’s just a good old 302 at work:

ed@curry:~$ curl -I http://web.archive.org/web/199812/http://www.google.com/
HTTP/1.1 302 Found
Date: Mon, 17 Dec 2007 21:11:12 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.2 mod_ssl/2.0.54 OpenSSL/0.9.7g mod_perl/2.0.1 Perl/v5.8.7
Location: http://web.archive.org/web/19981202230410/www.google.com/
Content-Type: text/html; charset=iso-8859-1

So anyhow, pretty cool use of URIs and HTTP right? The addition of zotero to the mix will mean that scholars can cite the web as it appeared at a particular point in time:

… as scholars begin to use not only traditional primary sources that have been digitized but also “born digital” materials on the web (blogs, online essays, documents transcribed into HTML), the possibility arises for Zotero users to leverage the resources of IA to ensure a more reliable form of scholarly communication. One of the Internet Archive’s great strengths is that it has not only archived the web but also given each page a permanent URI that includes a time and date stamp in addition to the URL.

Currently when a scholar using Zotero wishes to save a web page for their research they simply store a local copy. For some, perhaps many, purposes this is fine. But for web documents that a scholar believes will be important to share, cite, or collaboratively annotate (e.g., among a group of coauthors of an article or book) we will provide a second option in the Zotero web save function to grab a permanent copy and URI from IA’s web archive. A scholar who shares this item in their library can then be sure that all others who choose to use it will be referring to the exact same document.

This is pretty fundamental to scholarship on the web. Of course when generating a time anchored permalink with zotero one can well expect that archive.org will on occasion not have a snapshot of said content, resulting in a 404. It would be great if archive.org could leverage these requests for snapshots as requests to go out and archive the page. One could imagine a blocking and nonblocking request: the former which would spawn a request to fetch a particular URI, stash content away, and return the permalink; and the latter which would just quickly return the best match its already got (which may be a 404).

Anyhow, it’s really good to see these two outfits working together. Nice work!

ps. dear lazyweb is there a documented archive.org api available?

good ore

Friday, November 2nd, 2007

In case you missed it the Object-Reuse-and-Exchange (ORE) folks are having a get together at Johns Hopkins University (Baltimore, MD) on March 3, 2008. It’s free to register, but space is limited. The Compound information objects whitepaper, May 2007 Technical Committee notes and the more recent Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication provide a good taste of what the beta ORE specs are likely to look like.

The ORE group isn’t small, and includes individuals from quite different organizations. So any consensus that can be garnered I think will be quite powerful. Personally I’ve been really pleased to see how much the ORE work is leaning on web architecture: notably resolvable HTTP URIs, content-negotiation, linked-data and named graphs. Also interesting in the recent announcement is that the initial specs will use RFC 4287 for encoding the data model. Who knows, perhaps the spec will rely on archive feeds as discussed recently on the code4lib discussion list.

I’m particularly interested to see what flavor of URIs are used to identify the compound objects:

The protocol-based URI of the Resource Map identifies an aggregation of resources (components of a compound object) and their boundary-type inter-relationships. While this URI is clearly not the identifier of the compound object itself, it does provide an access point to the Resource Map and its representations that list all the resources of the compound object. For many practical purposes, this protocol-based URI may be a handy mechanism to reference the compound object because of the tight dependency of the visibility of the compound object in web space on the Resource Map (i.e., in ORE terms, a compound object exists in web space if and only if there exists a Resource Map describing it).

We note, however, two subtle points regarding the use of the URI of the Resource Map to reference the compound object. First, doing so is inconsistent with the web architecture and URI guidelines that are explicit in their suggestion that a URI should identify a single resource. Strictly interpreted, then, the use of the URI of the Resource Map to identify both the Resource Map and the compound object that it describes is incorrect. Second, some existing information systems already use dedicated URIs for the identification of compound information objects “as a whole.” For example, many scholarly publishers use DOIs whereas the Fedora and aDORe repositories have adopted identifiers of the info URI scheme. These identifiers are explicitly distinct from the URI of the Resource Map. from: Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication

I understand the ORE group is intentionally not aligning themselves too closely with the semantic web community. However I think they need to consider whether compound information objects are WWW information resources or not:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.” (from Architecture of the World Wide Web vol. 1).

I’m not totally convinced that the resource map can’t serve as a suitable representation for the compound information object–however for the sake of argument lets say I am. It seems to me that the URI for the compound information object identifies the concept of a particular compound information object, which lies in various pieces on the network. However this doesn’t preclude the use of HTTP URLs to identify the compound objects. Indeed the What HTTP URIs identify and Cool URIs for the Semantic Web provide specific guidance on how to serve up these non-information resources. Of course philosophical arguments around httpRange-14 have raged for a while. But the Linking Open Data project is using the hash URI and 303 redirect very effectively. There has even been some work on a sitemap extension to enable crawling. As a practical matter using URLs to identify compound information objects will encourage their use because they will naturally find their ways into publications, blogs, other compound objects. Using non-resolvable or quasi-resolvable info-uris or dois will mean people just won’t create the links–and when they do they will create links that can’t be easily verified and evolved over time with standard web tools. The OAI-ORE effort represents a giant leap forward for the digital library community into the web. Here’s to hoping they land safely–we need this stuff.

groupthink

Wednesday, September 26th, 2007

This little hack came up in channel after Bruce posted some XSLT to transform OCLC Identities XML into FOAF.

xsltproc \
  http://inkdroid.org/data/identity-foaf.xsl \
  http://orlabs.oclc.org/Identities/lccn-no99-10609 \
  | xmllint --format -

!!!

XSLT has its place to be sure.

OCLC deserves some REST

Wednesday, September 26th, 2007

Hey Worldcat Identities you are doing awesome work–you deserve some REST. Why not use content-negotiation to serve up your HTML and XML representations? So:

  curl --header "Accept: text/html" http://orlabs.oclc.org/Identities/key/lccn-no99-10609

would return HTML and

  curl --header "Accept: application/xml" http://orlabs.oclc.org/Identities/key/lccn-no99-10609

would return XML. This would allow you to:

  • not be limited to XSLT driven user views (doesn’t that get tedious?)
  • allow you to scale to other sorts of output (application/rdf+xml, etc)

At least from the outside I’d have to disagree w/ Roy — it appears that institutions can and do innovate. But I won’t say it is easy …

linking open data

Monday, August 27th, 2007


If it isn’t already, put the Linking Open Data project on your radar. It’s a grassroots effort to make large data sets available on the web. These aren’t just tarballs sitting in an FTP directory either–they’re URL addressable information resources available in machine readable format. A few weeks ago Joshua Tauberer announced the availability of the US Census as close to 1 billion triples. If you like data and the web the discussion list is a wonderful place to watch these data sets getting released and linked together.

rockin’ the plastic

Tuesday, August 14th, 2007

High 5, more dead than alive
Rockin’ the plastic like a man from a casket

Yeah, the blog is back after getting routed by the LinuXploit Crew. The whole episode was really rewarding actually. I learned what projects I work on that need to be hosted elsewhere at more stable locations–that are likely to outlive my pathetic musings. I (re)learned how important good friends are (ksclarke, dchud, wtd, jaf, gabe, rsinger) in a pinch. And I watched in awe as the Wordpress 2.2 upgrade actually worked on my pathologically old instance (which I suspect to have been the front door). Oh, and it was a good excuse to ditch gentoo for ubuntu.

Spring cleaning came a bit late this year I guess. Thanks LinuXploit Crew!

app and repositories

Monday, July 16th, 2007

Pete Johnston blogged recently about a very nice use of the Atom Publishing Protocol (APP) to provide digital library repository functionality. The project is supported by UKOLN at the University of Bath and is called Simple Web-service Offering Repository Deposit (SWORD).

If you are interested in digital repositories and web services take a look at their APP profile. It’s a great example of how APP encourages the use of the Atom XML format and RESTful practices, which can then be extended to suit the particular needs of a community of practice.

To understand APP you really only need to grok a handful of concepts from the data model and REST. The data model is basically made up of a service document, which describes a set of collections, which aggregates member entries, which can in turn point to a media entry. All of these types of resources are identified with URLs. Since they are URLs you can interact with the objects with plain old HTTP–just like your web browser. For example you can list the entries in a collection by issuing a GET to the collection URL. Or you can create a member resource by doing a POST to the collection URL. Similarly you can delete a member entry by issuing a DELETE to the member entry. The full details are available in the latest draft of the RFC–and also in a wide variety of articles including this one.

So to perform a SWORD deposit a program would have to:

  1. get the service document for the repository (GET http://www.myrepository.ac.uk/app/servicedocument)
  2. see what collections it can add objects to
  3. create some IMS, METS or DIDL metadata to describe your repository object and ZIP it up with any of the objects datastreams
  4. POST the zip file to the appropriate collection URL with the appropriate X-Format-Namespace to identify the format of the submitted object
  5. check that you got a 201 Created status code and record the Location of the newly created resource
  6. profit!

1 and 2 are perhaps not even necessary if the URL for the target collection is already known. Some notable things about the SWORD profile of APP:

  • two levels of conformance (one really minimalistic one)
  • the idea that collections imply particular treatments or workflows associated with how the object is ingested
  • service documents dynamically change to describe only the collections that a particular user can see
  • no ability to edit resources
  • no ability to delete resources
  • no ability to list collections
  • repository objects are POSTed as ZIP files to collections
  • HTTP Basic Authentication + TLS for security
  • the use of DublinCore to describe collections and their respective policies.
  • collections can support mediated deposit which means deposits can include the X-On-Behalf-Of HTTP header to identify the user to create the resource for.
  • the use of X-Format-Namespace HTTP header to explicitly identify the format of the submission package that is zipped up: for example IMS, METS or DIDL.

While I understand why update and delete would be disabled for deposited packages I don’t really understand why the listing of collections would be disabled. An atom feed for a collection would essentially enable harvesting of a repository, much like ListRecords in OAI-PMH.

I’m not quite sure I completely understand X-On-Behalf-Of and sword:mediation either. I could understand X-On-Behalf-Of in an environment where there is no authentication. But if a user is authenticated couldn’t their username be used to identify who is doing the deposit? Perhaps there are cases (as the doc suggests) where a deposit is done for another user?

All in all this is really wonderful work. Of particular value for me was seeing the list of SWORD extensions and also the use of HTTP status codes. If I have the time I’d like to throw together a sample repository server and client to see just how easy it is to implement SWORD. I did try some experiments along these lines for my presentation back in February…but they never got as well defined as SWORD.

How do 26 Nobel Laureates change a light bulb?

Friday, July 13th, 2007

I don’t know … but it sure is nice to see that 26 Nobel Laureates at least understand the direction libraries ought to be headed:

As scientists and Nobel laureates, we are writing to express our strong support for the House and Senate Appropriations Committees’ recent directives to the NIH to enact a mandatory policy that allows public access to published reports of work supported by the agency. We believe that the time is now for Congress to enact this enlightened policy to ensure that the results of research conducted by NIH can be more readily accessed, shared and built upon ­ to maximize the return on our collective investment in science and to further the public good.

The public at large also has a significant stake in seeing that this research (researched funded by the National Institute of Health) is made more widely available. When a woman goes online to find what treatment options are available to battle breast cancer, she will find many opinions, but peer-reviewed research of the highest quality often remains behind a high-fee barrier. Families seeking clinical trial updates for a loved one with Huntington’s disease search in vain because they do not have a journal subscription. Librarians, physicians, health care workers, students, journalists, and investigators at thousands of academic institutions and companies are currently hindered by unnecessary costs and delays in gaining access to publicly funded research results.

Exciting times for libraries and the medical profession! I just hope they can convince Congress.