genealogy of a typo

I got a Kindle Touch today for Christmas–thanks Kesa! Admittedly I’m pretty late to this party. As I made ready to purchase my first ebook I hopped over to my GoodReads to-read list, to pick something out. I scanned the list quickly, and my eye came to rest on Stephen Ramsey‘s recent book Reading Machines. But I got hung up on something irrelevant: the subtitle was Toward and Algorithmic Criticism instead of Toward an Algorithmic Criticism, the latter of which is clearly correct based on the cover image.

Having recently looked at API services for book data I got curious about how the title appeared on other popular web properties, such as Amazon:


Barnes & Noble:

and LibraryThing

I wasn’t terribly surprised not to find it on OpenLibrary. But it does seem interesting that the exact same typo is present on all these book websites as well, while the title appears correct on the publisher’s website:

and at OCLC:

It’s hard to tell for sure, but my guess is that Amazon, Barnes & Noble, and GoogleBooks got the error from Bowker Link (the Books in Print data service), and that LibraryThing then picked up the data from Amazon, and similarly GoodReads picked up the data from GoogleBooks. LibraryThing can pull data from a variety of sources, including Amazon; and I’m not entirely sure where GoodReads gets their data from, but it seems likely that it comes from the GoogleBooks API given other tie-ins with Google.

If you know more about the lineage of data in these services I would be interested to hear it. Specifically if you have a subscription to BowkerLink it would be great if you could check the title. It would be nice to live in a world where these sorts of data provenance issues were easier to read.

polling and pushing with the Times Newswire API

I’ve been continuing to play around with Node.js some. Not because it’s the only game in town, but mainly because of the renaissance that’s going in the JavaScript community, which is kind of fun and slightly addictive. Ok, I guess that makes me a fan-boy, whatever…

So the latest in my experiments is nytimestream, which is a visualization (ok, it’s just a list) of New York Times headlines using the Times Newswire API. When I saw Derek Willis recently put some work into a Ruby library for the API I got to thinking what it might be like to use Node.js and Socket.IO to provide a push stream of updates. It didn’t take too long. I actually highly doubt anyone is going to use nytimestream much. So you might be wondering why I bothered to create it at all. I guess it was kind more of an academic exercise than anything to reinforce some things that Node.js has been teaching me.

Normally if you wanted a web page to dynamically update based on events elsewhere you’d have some code running in the browser routinely poll a webservice for updates. In this scenario our clients (c1, c2 and c3) poll the Times Newswire directly:

But what happens if lots of people start using your application? Yup, you get lots of requests going to the web service…which may not be a good thing, particularly if you are limited to a certain number of requests per day.

So a logical next step is to create a proxy for the webservice, which will reduce hits on the Times Newswire API.

But still, the client code needs to poll for updates. This can result in the proxy web service needing to field lots of requests as the number of clients increases. You can poll less, but that will diminish the real time nature of your app. If you are interested in having the real time updates in your app in the first place this probably won’t seem like a great solution.

So what if you could have the proxy web service push updates to the clients when it discovers an update?

This is basically what an event-driven webservice application allows you to do (labelled NodeJS in the diagram above). Node’s Socket.IO provides a really nice abstraction around streaming updates in the browser. If you view source on nytimestream you’ll see a bit of code like this:

var socket = io.connect();
socket.on('connect', function() {
  socket.on('story', function(story) {

story is a JavaScript object that comes directly from the proxy webservice as a chunk of JSON. I’ve got the app running on Heroku, which currently recommends Socket.IO be configured to only do long polling (xhr-polling). Socket.IO actually supports a bunch of other transports suitable for streaming, including web sockets. xhr-polling basically means the browser keeps a connection open to the server until an update comes down, after which it quickly reconnects to wait for the next update. This is still preferable to constant polling, especially for the NYTimes API which often sees 15 minutes or more go by without an update. Keeping connections open like this can be expensive in more typical web stacks where each connection translates into a thread or process. But this is what Node’s non-blocking IO programming environment fixes up for you.

Just because I could, I added a little easter egg view in nytimestream, which allows you to see new stories come across the wire as JSON when nytimestream discovers them. It’s similar to Twitter’s stream API in that you can call it with curl. It’s different in that, well, there’s hardly the same amount of updates. Try it out with:


The occasional newlines are there to prevent the connection from timing out.

DOIs as Linked Data

Last week Ross Singer alerted me to some pretty big news for folks interested in Library Linked Data: CrossRef has made the metadata for 46 million Digital Object Identifiers (DOI) available as Linked Data. DOIs are heavily used in the publishing space to uniquely identify electronic documents (largely scholarly journal articles). CrossRef is a consortium of roughly 3,000 publishers, and is a big player in the academic publishing marketplace.

So practically what this means is that all the places in the scholarly publishing ecosystem where DOIs are present (caveat below), it’s now possible to use the Web to retrieve metadata associated with that electronic document. Say you’ve got a DOI in the database backing your institutional repository:


you can use the DOI to construct a URL:

and then do an HTTP GET (what your Web browser is doing all the time as you wander around the Web) to ask for metadata about that document:

curl –location –header “Accept: text/turtle”

At which point you will get back some Turtle flavored RDF that looks like:

    a <> ;
    <> "Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid" ;
    <> <>, <> ;
    <> "10.1038/171737a0" ;  
    <> "738" ;
    <> "737" ;
    <> "171" ;
    <> "1953-04-25Z"^^<> ;
    <> "10.1038/171737a0" ;
    <> <> ;
    <> "Nature Publishing Group" ;
    <> "10.1038/171737a0" ;
    <> "738" ;
    <> "737" ;
    <> "171" ;
    <> <doi:10.1038/171737a0>, <info:doi/10.1038/171737a0> .

    a <> ;
    <> "CRICK" ;
    <> "F. H. C." ;
    <> "F. H. C. CRICK" .

    a <> ;
    <> "WATSON" ;
    <> "J. D." ;
    <> "J. D. WATSON" .

Well without all the funky colors…I put them there to help illustrate how the RDF includes some useful information, such as:

  • the document is an Article
  • it has the title “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid”
  • the article was published on April 25th, 1953
  • the article was published in the journal Nature
  • the article was written by two people: J. D. Watson and F. H. C. Crick
  • it can be found in volume 171, on pages 737-738

It’s also interesting that both the Bibliographic Ontology and the Publishing Requirements for Industry Standard Metadata (PRISM) vocabularies being used. RDF lets you mix in different vocabularies like this. Some people might see this description as partly redundant, but it allows a data publisher to play the field a bit in its descriptions, while still committing to a particular URL for the resource.

Anyhow, the whole point of Linked Data is that you (or your software) can follow your nose by noticing links to related resources of interest in the data. If you are familiar with Turtle and RDF (a more visual diagram is below) you’ll see that the article “Molecular Structure of Nucleic Acids” is “part of” another resource:

If we follow our nose to this URL we get another bit of RDF:

    <> <> ;
    <> <>, <urn:issn:0028-0836> ;
    <> "Nature" ;
    a "" .

Which tells us that the article is part of the journal Nature, which is the “same as” link to a resource in Linked Periodicals Data at the Data Incubator. When we resolve that URL we eventually get some more RDF:

    dc:identifier <info:pmid/0410462>, <info:pmid/0410463> ;
    dc:subject "BIOLOGY", "Biologie", "CIENCIA", "NATURAL HISTORY", "Natuurwetenschappen", "Physique", "SCIENCE", "Science", "Sciences" ;
    dct:publisher <> ;
    dct:subject <>, <>, <>, <>, <> ;
    dct:title "Nature" ;
    bibo:eissn "1476-4687" ;
    bibo:issn "0028-0836", "0090-0028" ;
    bibo:shortTitle "Nat New Biol", "Nature", "Nature New Biol." ;
    a bibo:Journal ;
    owl:sameAs <>, <>, <> ;
    foaf:isPrimaryTopicOf <,1&Search_Arg=0410462&Search_Code=0359&CNT=20&SID=1>, <,1&Search_Arg=0410463&Search_Code=0359&CNT=20&SID=1>, <>, <> .

Which (among other things) tells us that the journal Nature publishes content with the topic of “Biology” from the Library of Congress Subject Headings:

    skos:prefLabel "Biology"@en ;
    dcterms:created "1986-02-11T00:00:00-04:00"^^<> ;
    dcterms:modified "1990-10-09T11:20:35-04:00"^^<> ;
    a skos:Concept ;
    owl:sameAs <info:lc/authorities/sh85014203> ;
    skos:broader <> ;
    skos:closeMatch <> ;
    skos:inScheme <>, <> ;
    skos:narrower <>, <>, <>, <>, <>, <>, <>, <>, <>;
    skos:related <>, <>, <> .

Here we can see the topic of Biology as it relates to other concepts in the Library of Congress Subject Headings, as well as a similar concept in Biologie générale from RAMEAU, which are subject headings from the Bibliothèque nationale de France.

    a skos:Concept ;
    skos:altLabel "Biologie générale"@fr ;
    skos:broader <> ;
    skos:closeMatch <>, <> ;
    skos:inScheme <>, <> ;
    skos:narrower <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <> ;
    skos:note "Domaine : 570"@fr ;
    skos:prefLabel "Biologie"@fr, "FRBNF119440833"@x-notation ;

So at this point maybe you’ll admit that it’s kind of cool to wander around in the data like this. But if you haven’t drunk the Kool-Aid recently (unlikely if you’ve read this far) you might be wondering: what’s the point? Who cares?

I think you should care about this example because it shows:

  1. how an existing organization can leverage its pre-existing identifiers on the Web to enable data publishing (Linked Data)
  2. how important it is for publishers to consider who they link to in their data, and how they do it
  3. how essential the RDF data model is for using the Web to join up these pools (or some may call them silos) of data

The raw Turtle RDF above may have made your eyes glaze over, so its worth restating that this new DOI service allows those with DOIs in their databases to use the machinery of the Web to aggregate and join up data from 4 different organizations: CrossRef, Data Incubator, Library of Congress, and the Bibliothèque nationale de France:

And it’s not just the traditional scholarly publishing community that will potentially benefit from this new Linked Data. As I discovered last August when routing around in the external links dumps from English Wikipedia there were 323,805 links from Wikipedia Articles to–for example the article for Molecular Structure of Nucleic Acids has a citation that includes an external link to the DOI URL included above.

CrossRef’s new Linked Data service could allow someone to write a bot to crawl and verify the citations on Wikipedia. Or perhaps there could be a template on Wikipedia that would allow an editor to add a citation to an article by simply using the DOI, which would then fill in the other bits of article metadata needed for display. There are lots of possibilities.

As I commented over on the CrossTech blog (not approved yet), it would be handy if the service was able to parse and act on non-simple Accept headers during content negotiation, since it’s fairly common for RDF tools like jena, rdflib, arc, redland to send Accept headers with q-values in them. It might actually be nice to see support for some simple JSON views, that might be handy for people that get scared off RDF easily. But those are some minor quibbles in comparison to the outstanding work that CrossRef have done in getting this service going. Hopefully we’ll see more publishing organizations like DataCite helping build this data publishing community more as well.

Update: if this topic interests you, and you want to read more about it, definitely check out John Erickson‘s blog post DOIs, URIs and Cool Resolution.