I spent an hour checking out the HathiTrust API docs this morning; mainly to see what the similarities and differences are with the as-of-yet undocumented API for Chronicling America. There are quite a few similarities in the general RESTful approach, and the use of Atom, METS and PREMIS in the metadata that is made available.

Everyone’s a critic right? Nevertheless, I’m just going to jot down a few thoughts about the API, mainly for my friend over in #code4lib Bill Dueber who works on the project. Let me just say at the outset that I think it’s awesome that HathiTrust are providing this API, especially given some of the licensing constraints around some of the content. The API is a good example of putting library data on the web using both general and special purpose standards. But there are a few minor things that could be tweaked I think, to make the API fit into the web and the repository space a bit better.

it would be nice if the OpenSearch description document referenced in the HTML at


worked. It should be pretty easy and non-invasive to add a basic description file for the HTML response since the search is already GET driven. Ideally it would be nice to see the responses also available as Atom and/or JSON with Atom Feed Paging.

Another thing that would be nice to see is the API being merged more into the human usable webapp. The best way to explain this is with an example. Consider the HTML page for this 1914 edition of Walt Whitman’s Leaves of Grass, available with this clean URI:


Now, you can get a few flavors of metadata for this book, and an aggregated zip file of all the page images and OCR if you are a HathiTrust member. Why not make these alternate representations discoverable right from the item display? It could be as simple as adding some <link> elements to the HTML, that use the link relations they’ve already established for their Atom:

If you wanted to get fancy you could also put human readable links into the <body> and annotate them w/ RDFa. But this would just be icing on the cake. There are a few reasons for doing at least the bare minimum. The big one is to enable in browser applications (like Zotero, etc) to be able to learn more about a given resource in a relatively straightforward and commonplace way. The other big one is to let automated agents like GoogleBot and YahooSlurp and Internet Archive’s Heritrix, etc. discover the deep web data that’s held behind your API. Another nice side effect is that it helps people who might ordinarily scrape your site automatically discover the API in a straightforward way.

Lastly, I was curious to know if HathiTrust considered adjusting their Atom response to use the Atom pattern recommended by the OAI-ORE folks. They are pretty close already, and in fact seem to have modeled their own aggregation vocabulary on OAI-ORE. It would be interesting to hear why they diverged if it was intentional, and if it might be possible to use a bit of oai-ore in there so we can bootstrap an oai-ore harvesting ecosystem.

I’m not sure that I can still call this approach to integrating web2.0 APIs into web1.x applications Linked Data anymore, since it doesn’t really involve RDF directly. It does involve thinking in a RESTful way about the resources you are publishing on the web, and how they can be linked together to form a graph. My colleague Dan has been writing in Computers in Libraries recently about how perhaps thinking in terms of “building a better web” may be a more accurate way of describing this activity.

For reasons I don’t fully understand I’ve been reading a lot of Wittgenstein (well mainly books about Wittgenstein honestly) lately during the non-bike commute. The trajectory of his thought over his life is really interesting to me. He had this zen-like, controversial idea that

Philosophy simply puts everything before us, nor deduces anything. — Since everything lies open to view there is nothing to explain. For what is hidden, for example, is of no interest to us. (PI 126)

I really like this idea that our data APIs on the web could be “open to view” by checking out the HTML, following your nose, and writing scrapers, bots and browser plugins to use what you find. I think it’s unfortunate that the recent changes to the Linked Data Design Issues, and the ensuing discussion seemed to create this dividing line about the use of RDF and SPARQL. I had always hoped (and continue to hope) that the Linked Data effort is bigger than a particular brand, or reformulation of the semantic web effort … for me it’s a pattern for building a better web. I think RDF is very well suited to expressing the core nature of the web, the Giant Global Graph. I’ve served up RDF representations in applications I’ve worked on just for this reason. But I think Linked Data pattern will thrive most if it is thought of as an inclusive continuum of efforts, similar to what Dan Brickley has suggested. Us technology people strive for explicitness, it’s an occupational hazard – but there’s sometimes quite a bit of strength in ambiguity.

Anyhow, my little review of the HathiTrust API turned into a bit of a soapbox for me to stand on and shout like a lunatic. I guess I’ve been wanting to write about what I think Linked Data is for a few weeks now, and it just kinda bubbled up when I least expected it. Sorry Bill!