DOIs as Linked Data

Last week Ross Singer alerted me to some pretty big news for folks interested in Library Linked Data: CrossRef has made the metadata for 46 million Digital Object Identifiers (DOI) available as Linked Data. DOIs are heavily used in the publishing space to uniquely identify electronic documents (largely scholarly journal articles). CrossRef is a consortium of roughly 3,000 publishers, and is a big player in the academic publishing marketplace.

So practically what this means is that all the places in the scholarly publishing ecosystem where DOIs are present (caveat below), it’s now possible to use the Web to retrieve metadata associated with that electronic document. Say you’ve got a DOI in the database backing your institutional repository:


you can use the DOI to construct a URL:

and then do an HTTP GET (what your Web browser is doing all the time as you wander around the Web) to ask for metadata about that document:

curl –location –header “Accept: text/turtle”

At which point you will get back some Turtle flavored RDF that looks like:

    a <> ;
    <> "Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid" ;
    <> <>, <> ;
    <> "10.1038/171737a0" ;  
    <> "738" ;
    <> "737" ;
    <> "171" ;
    <> "1953-04-25Z"^^<> ;
    <> "10.1038/171737a0" ;
    <> <> ;
    <> "Nature Publishing Group" ;
    <> "10.1038/171737a0" ;
    <> "738" ;
    <> "737" ;
    <> "171" ;
    <> <doi:10.1038/171737a0>, <info:doi/10.1038/171737a0> .

    a <> ;
    <> "CRICK" ;
    <> "F. H. C." ;
    <> "F. H. C. CRICK" .

    a <> ;
    <> "WATSON" ;
    <> "J. D." ;
    <> "J. D. WATSON" .

Well without all the funky colors…I put them there to help illustrate how the RDF includes some useful information, such as:

  • the document is an Article
  • it has the title “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid”
  • the article was published on April 25th, 1953
  • the article was published in the journal Nature
  • the article was written by two people: J. D. Watson and F. H. C. Crick
  • it can be found in volume 171, on pages 737-738

It’s also interesting that both the Bibliographic Ontology and the Publishing Requirements for Industry Standard Metadata (PRISM) vocabularies being used. RDF lets you mix in different vocabularies like this. Some people might see this description as partly redundant, but it allows a data publisher to play the field a bit in its descriptions, while still committing to a particular URL for the resource.

Anyhow, the whole point of Linked Data is that you (or your software) can follow your nose by noticing links to related resources of interest in the data. If you are familiar with Turtle and RDF (a more visual diagram is below) you’ll see that the article “Molecular Structure of Nucleic Acids” is “part of” another resource:

If we follow our nose to this URL we get another bit of RDF:

    <> <> ;
    <> <>, <urn:issn:0028-0836> ;
    <> "Nature" ;
    a "" .

Which tells us that the article is part of the journal Nature, which is the “same as” link to a resource in Linked Periodicals Data at the Data Incubator. When we resolve that URL we eventually get some more RDF:

    dc:identifier <info:pmid/0410462>, <info:pmid/0410463> ;
    dc:subject "BIOLOGY", "Biologie", "CIENCIA", "NATURAL HISTORY", "Natuurwetenschappen", "Physique", "SCIENCE", "Science", "Sciences" ;
    dct:publisher <> ;
    dct:subject <>, <>, <>, <>, <> ;
    dct:title "Nature" ;
    bibo:eissn "1476-4687" ;
    bibo:issn "0028-0836", "0090-0028" ;
    bibo:shortTitle "Nat New Biol", "Nature", "Nature New Biol." ;
    a bibo:Journal ;
    owl:sameAs <>, <>, <> ;
    foaf:isPrimaryTopicOf <,1&Search_Arg=0410462&Search_Code=0359&CNT=20&SID=1>, <,1&Search_Arg=0410463&Search_Code=0359&CNT=20&SID=1>, <>, <> .

Which (among other things) tells us that the journal Nature publishes content with the topic of “Biology” from the Library of Congress Subject Headings:

    skos:prefLabel "Biology"@en ;
    dcterms:created "1986-02-11T00:00:00-04:00"^^<> ;
    dcterms:modified "1990-10-09T11:20:35-04:00"^^<> ;
    a skos:Concept ;
    owl:sameAs <info:lc/authorities/sh85014203> ;
    skos:broader <> ;
    skos:closeMatch <> ;
    skos:inScheme <>, <> ;
    skos:narrower <>, <>, <>, <>, <>, <>, <>, <>, <>;
    skos:related <>, <>, <> .

Here we can see the topic of Biology as it relates to other concepts in the Library of Congress Subject Headings, as well as a similar concept in Biologie générale from RAMEAU, which are subject headings from the Bibliothèque nationale de France.

    a skos:Concept ;
    skos:altLabel "Biologie générale"@fr ;
    skos:broader <> ;
    skos:closeMatch <>, <> ;
    skos:inScheme <>, <> ;
    skos:narrower <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <> ;
    skos:note "Domaine : 570"@fr ;
    skos:prefLabel "Biologie"@fr, "FRBNF119440833"@x-notation ;

So at this point maybe you’ll admit that it’s kind of cool to wander around in the data like this. But if you haven’t drunk the Kool-Aid recently (unlikely if you’ve read this far) you might be wondering: what’s the point? Who cares?

I think you should care about this example because it shows:

  1. how an existing organization can leverage its pre-existing identifiers on the Web to enable data publishing (Linked Data)
  2. how important it is for publishers to consider who they link to in their data, and how they do it
  3. how essential the RDF data model is for using the Web to join up these pools (or some may call them silos) of data

The raw Turtle RDF above may have made your eyes glaze over, so its worth restating that this new DOI service allows those with DOIs in their databases to use the machinery of the Web to aggregate and join up data from 4 different organizations: CrossRef, Data Incubator, Library of Congress, and the Bibliothèque nationale de France:

And it’s not just the traditional scholarly publishing community that will potentially benefit from this new Linked Data. As I discovered last August when routing around in the external links dumps from English Wikipedia there were 323,805 links from Wikipedia Articles to–for example the article for Molecular Structure of Nucleic Acids has a citation that includes an external link to the DOI URL included above.

CrossRef’s new Linked Data service could allow someone to write a bot to crawl and verify the citations on Wikipedia. Or perhaps there could be a template on Wikipedia that would allow an editor to add a citation to an article by simply using the DOI, which would then fill in the other bits of article metadata needed for display. There are lots of possibilities.

As I commented over on the CrossTech blog (not approved yet), it would be handy if the service was able to parse and act on non-simple Accept headers during content negotiation, since it’s fairly common for RDF tools like jena, rdflib, arc, redland to send Accept headers with q-values in them. It might actually be nice to see support for some simple JSON views, that might be handy for people that get scared off RDF easily. But those are some minor quibbles in comparison to the outstanding work that CrossRef have done in getting this service going. Hopefully we’ll see more publishing organizations like DataCite helping build this data publishing community more as well.

Update: if this topic interests you, and you want to read more about it, definitely check out John Erickson‘s blog post DOIs, URIs and Cool Resolution.

16 thoughts on “DOIs as Linked Data

  1. So what I’m thinking though, is that realistically I’m not going to be able to write software to actually USE this unless I know what vocabularies to expect.

    It’s already odd to me that entirely different vocabularies are being used in the Atom version (PRISM), (mostly DC — but DC used in particular ways; there’s no reason for software to expect an ISSN in dcterms:isPartOf unless it knows to, is there? And why are they using their own local ISSN URIs?), and xml-rdf (I don’t think this is even using the same vocab as the turtle, is it?)

    Am I mis-reading, or is each serialization using entirely different set of mixed and matched vocabularies?

    And it looks like while you can content-negotiate with, what actual vocabularies you get back in a given content-negotiated format is completely up in the air, up to the registrar, could be different for every DOI or change at any time.

    It strikes me that knowing that something is “atom” or “turtle” or “rdf+xml” is NOT enough to write software that can consume it and do something with it. Those really end up being more ‘serialization’ formats than actual useful metadata formats — whether software is going to be able to do something with it is all about the vocabularies (expressed as namespaces in atom).

    So, while a human can look at these to ‘follow their nose’… I’d really love my software (such as Umlaut) to be able to make use of it too, but unless DOI does a bit of standardization here and documents what vocabularies one can expect… it’s not clear to me how to do that. A promissing first step, but it seems to me a second step is needed to make it more than a toy/proof of concept. No?

    (And yeah, I posted a comment to this effect on their blog too. I’m thinking that our comments are lost to the aether never to be approved).

  2. @jonathan I agree that it would be nice to see a page that describes the new CrossRef Linked Data service, and which documents the RDF vocabularies that are used. I did double-check that the application/rdf+xml and text/turtle serializations are saying the same thing (a sorted ntriples view + diff is handy for this btw).

    Personally, I wouldn’t worry too much about the vocabulary changing significantly. At this point Dublin Core is kind of the lingua-franca of metadata on the Web–at least in the Linked Data space. It is a bit worrisome that the PRISM vocabulary doesn’t seem to be defined at the URIs that they are using, so if you are looking to be cautious I’d probably shy away from burning any of those into you software.

    Another thing to think about: is the potentially variable use of RDF vocabulary really anything new? For example an API provider could change the structure of the JSON being delivered at any time. It’s arguable that JSON tends to be a lot simpler, so there’s less brittleness. Also JSON driven APIs tend to be documented and versioned more explicitly.

    Your comments got me thinking about how important the documentation around these data APIs is. Apart from providing much needed information about the reason why the API is there, and how to use it, they provide context which really forms a foundation for trust around the service, which in turn emboldens people to start to depend on the service in other services like the Umlaut.

  3. Ah, Ed, but I’ve GOT to use the PRISM elements to do what I want to do, just DC is not nearly sufficient for my use cases.

    It’s true that people providing APIs that change all the time without notice or documentation is nothing new — but we know those as ‘crappy APIs’.

    I think some people are misunderstanding the practical utility of RDF, thinking “Oh, see, it’s in RDF, that automatically makes it machine useable.” But without documentation and a commit to standardization of the vocabularies in certain ways, I can’t write code to use it — at least not with any way to predict what proportion of identifiers (DOIs in this case) it will actually work with (do they all have PRISM or just some? Who knows), or any way to predict if it will keep working in the future.

    As near as I can tell from reading their announcement, exactly what metadata to provide is _completely_ up to the “registration agency”, such as CrossRef or DataCite. An individual registration agency don’t have to provide the same vocabularies (or even the same top-level serialization formats) for every identifier; they don’t need to keep it the same over time; they don’t need to be consistent amongst themselves (CrossRef vs DataCite), and they don’t need to TELL us what they’re planning on providing or how often it will change.

    From my perspective, that’s completely unworkable for me to invest software development resources against it. It needs the next step, which is some documentation/standardization of some kind. And I think DOI needs to take that next step and require that of the registration agencies. And I think they need to hear that from people like us, or rather people like you who understand this stuff way better than me, everyone just tells me I’m missing the true power (of having no idea what data to expect, heh).

  4. @jonathan I think you are reading between the lines with the CrossRef linked data offerings quite a bit…which is understandable given the amount of writing that has been done about it. I would see what you can use in the data, and try to trust that it won’t change much before assuming that it will. If I’m going to read a bit between your lines I would hazard a guess that this might be the first time you’ve thought about trying to consume some RDF in an application…and it’s normal (and wise) to be a bit nervous about that…

  5. Ed, where do you think I’m reading between the lines? I’m confused.

    They seem to be saying that the registration agencies themselves will be responsible for those alternate representaitons, not DOI central. No?”It also means that, as registration agency members (CrossRef publishers, for instance) start providing more complete and richer representations of their content, we can simply redirect content-negotiated requests directly to them.”

    And they make no mention of any standardization of vocabularies or serializations.

  6. @jonathan yeah I see your point … they could change the service to redirect to representations hosted by publishers in the future. For some reason I don’t see that happening particularly soon. It would definitely make the situation more complicated if that happened without some vocabulary consolidation. But I don’t think it’s necessarily worthwhile to worry too much about a hypothetical situation right now.

Leave a Reply