Last week Ross Singer alerted me to some pretty big news for folks interested in Library Linked Data: CrossRef has made the metadata for 46 million Digital Object Identifiers (DOI) available as Linked Data. DOIs are heavily used in the publishing space to uniquely identify electronic documents (largely scholarly journal articles). CrossRef is a consortium of roughly 3,000 publishers, and is a big player in the academic publishing marketplace.

So practically what this means is that all the places in the scholarly publishing ecosystem where DOIs are present (caveat below), it’s now possible to use the Web to retrieve metadata associated with that electronic document. Say you’ve got a DOI in the database backing your institutional repository:


you can use the DOI to construct a URL:

and then do an HTTP GET (what your Web browser is doing all the time as you wander around the Web) to ask for metadata about that document:

curl –location –header “Accept: text/turtle”

At which point you will get back some Turtle flavored RDF that looks like:

    a <> ;
    <> "Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid" ;
    <> <>, <> ;
    <> "10.1038/171737a0" ;  
    <> "738" ;
    <> "737" ;
    <> "171" ;
    <> "1953-04-25Z"^^<> ;
    <> "10.1038/171737a0" ;
    <> <> ;
    <> "Nature Publishing Group" ;
    <> "10.1038/171737a0" ;
    <> "738" ;
    <> "737" ;
    <> "171" ;
    <> <doi:10.1038/171737a0>, <info:doi/10.1038/171737a0> .

    a <> ;
    <> "CRICK" ;
    <> "F. H. C." ;
    <> "F. H. C. CRICK" .

    a <> ;
    <> "WATSON" ;
    <> "J. D." ;
    <> "J. D. WATSON" .

Well without all the funky colors…I put them there to help illustrate how the RDF includes some useful information, such as:

  • the document is an Article
  • it has the title “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid”
  • the article was published on April 25th, 1953
  • the article was published in the journal Nature
  • the article was written by two people: J. D. Watson and F. H. C. Crick
  • it can be found in volume 171, on pages 737-738

It’s also interesting that both the Bibliographic Ontology and the Publishing Requirements for Industry Standard Metadata (PRISM) vocabularies being used. RDF lets you mix in different vocabularies like this. Some people might see this description as partly redundant, but it allows a data publisher to play the field a bit in its descriptions, while still committing to a particular URL for the resource.

Anyhow, the whole point of Linked Data is that you (or your software) can follow your nose by noticing links to related resources of interest in the data. If you are familiar with Turtle and RDF (a more visual diagram is below) you’ll see that the article “Molecular Structure of Nucleic Acids” is “part of” another resource:

If we follow our nose to this URL we get another bit of RDF:

    <> <> ;
    <> <>, <urn:issn:0028-0836> ;
    <> "Nature" ;
    a "" .

Which tells us that the article is part of the journal Nature, which is the “same as” link to a resource in Linked Periodicals Data at the Data Incubator. When we resolve that URL we eventually get some more RDF:

    dc:identifier <info:pmid/0410462>, <info:pmid/0410463> ;
    dc:subject "BIOLOGY", "Biologie", "CIENCIA", "NATURAL HISTORY", "Natuurwetenschappen", "Physique", "SCIENCE", "Science", "Sciences" ;
    dct:publisher <> ;
    dct:subject <>, <>, <>, <>, <> ;
    dct:title "Nature" ;
    bibo:eissn "1476-4687" ;
    bibo:issn "0028-0836", "0090-0028" ;
    bibo:shortTitle "Nat New Biol", "Nature", "Nature New Biol." ;
    a bibo:Journal ;
    owl:sameAs <>, <>, <> ;
    foaf:isPrimaryTopicOf <,1&Search_Arg=0410462&Search_Code=0359&CNT=20&SID=1>, <,1&Search_Arg=0410463&Search_Code=0359&CNT=20&SID=1>, <>, <> .

Which (among other things) tells us that the journal Nature publishes content with the topic of “Biology” from the Library of Congress Subject Headings:

    skos:prefLabel "Biology"@en ;
    dcterms:created "1986-02-11T00:00:00-04:00"^^<> ;
    dcterms:modified "1990-10-09T11:20:35-04:00"^^<> ;
    a skos:Concept ;
    owl:sameAs <info:lc/authorities/sh85014203> ;
    skos:broader <> ;
    skos:closeMatch <> ;
    skos:inScheme <>, <> ;
    skos:narrower <>, <>, <>, <>, <>, <>, <>, <>, <>;
    skos:related <>, <>, <> .

Here we can see the topic of Biology as it relates to other concepts in the Library of Congress Subject Headings, as well as a similar concept in Biologie générale from RAMEAU, which are subject headings from the Bibliothèque nationale de France.

    a skos:Concept ;
    skos:altLabel "Biologie générale"@fr ;
    skos:broader <> ;
    skos:closeMatch <>, <> ;
    skos:inScheme <>, <> ;
    skos:narrower <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <> ;
    skos:note "Domaine : 570"@fr ;
    skos:prefLabel "Biologie"@fr, "FRBNF119440833"@x-notation ;

So at this point maybe you’ll admit that it’s kind of cool to wander around in the data like this. But if you haven’t drunk the Kool-Aid recently (unlikely if you’ve read this far) you might be wondering: what’s the point? Who cares?

I think you should care about this example because it shows:

  1. how an existing organization can leverage its pre-existing identifiers on the Web to enable data publishing (Linked Data)
  2. how important it is for publishers to consider who they link to in their data, and how they do it
  3. how essential the RDF data model is for using the Web to join up these pools (or some may call them silos) of data

The raw Turtle RDF above may have made your eyes glaze over, so its worth restating that this new DOI service allows those with DOIs in their databases to use the machinery of the Web to aggregate and join up data from 4 different organizations: CrossRef, Data Incubator, Library of Congress, and the Bibliothèque nationale de France:

And it’s not just the traditional scholarly publishing community that will potentially benefit from this new Linked Data. As I discovered last August when routing around in the external links dumps from English Wikipedia there were 323,805 links from Wikipedia Articles to–for example the article for Molecular Structure of Nucleic Acids has a citation that includes an external link to the DOI URL included above.

CrossRef’s new Linked Data service could allow someone to write a bot to crawl and verify the citations on Wikipedia. Or perhaps there could be a template on Wikipedia that would allow an editor to add a citation to an article by simply using the DOI, which would then fill in the other bits of article metadata needed for display. There are lots of possibilities.

As I commented over on the CrossTech blog (not approved yet), it would be handy if the service was able to parse and act on non-simple Accept headers during content negotiation, since it’s fairly common for RDF tools like jena, rdflib, arc, redland to send Accept headers with q-values in them. It might actually be nice to see support for some simple JSON views, that might be handy for people that get scared off RDF easily. But those are some minor quibbles in comparison to the outstanding work that CrossRef have done in getting this service going. Hopefully we’ll see more publishing organizations like DataCite helping build this data publishing community more as well.

Update: if this topic interests you, and you want to read more about it, definitely check out John Erickson’s blog post DOIs, URIs and Cool Resolution.