DOIs as Linked Data

Last week Ross Singer alerted me to some pretty big news for folks interested in Library Linked Data: CrossRef has made the metadata for 46 million Digital Object Identifiers (DOI) available as Linked Data. DOIs are heavily used in the publishing space to uniquely identify electronic documents (largely scholarly journal articles). CrossRef is a consortium of roughly 3,000 publishers, and is a big player in the academic publishing marketplace.

So practically what this means is that all the places in the scholarly publishing ecosystem where DOIs are present (caveat below), it’s now possible to use the Web to retrieve metadata associated with that electronic document. Say you’ve got a DOI in the database backing your institutional repository:


you can use the DOI to construct a URL:

and then do an HTTP GET (what your Web browser is doing all the time as you wander around the Web) to ask for metadata about that document:

curl –location –header “Accept: text/turtle”

At which point you will get back some Turtle flavored RDF that looks like:

    a <> ;
    <> "Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid" ;
    <> <>, <> ;
    <> "10.1038/171737a0" ;  
    <> "738" ;
    <> "737" ;
    <> "171" ;
    <> "1953-04-25Z"^^<> ;
    <> "10.1038/171737a0" ;
    <> <> ;
    <> "Nature Publishing Group" ;
    <> "10.1038/171737a0" ;
    <> "738" ;
    <> "737" ;
    <> "171" ;
    <> <doi:10.1038/171737a0>, <info:doi/10.1038/171737a0> .

    a <> ;
    <> "CRICK" ;
    <> "F. H. C." ;
    <> "F. H. C. CRICK" .

    a <> ;
    <> "WATSON" ;
    <> "J. D." ;
    <> "J. D. WATSON" .

Well without all the funky colors…I put them there to help illustrate how the RDF includes some useful information, such as:

  • the document is an Article
  • it has the title “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid”
  • the article was published on April 25th, 1953
  • the article was published in the journal Nature
  • the article was written by two people: J. D. Watson and F. H. C. Crick
  • it can be found in volume 171, on pages 737-738

It’s also interesting that both the Bibliographic Ontology and the Publishing Requirements for Industry Standard Metadata (PRISM) vocabularies being used. RDF lets you mix in different vocabularies like this. Some people might see this description as partly redundant, but it allows a data publisher to play the field a bit in its descriptions, while still committing to a particular URL for the resource.

Anyhow, the whole point of Linked Data is that you (or your software) can follow your nose by noticing links to related resources of interest in the data. If you are familiar with Turtle and RDF (a more visual diagram is below) you’ll see that the article “Molecular Structure of Nucleic Acids” is “part of” another resource:

If we follow our nose to this URL we get another bit of RDF:

    <> <> ;
    <> <>, <urn:issn:0028-0836> ;
    <> "Nature" ;
    a "" .

Which tells us that the article is part of the journal Nature, which is the “same as” link to a resource in Linked Periodicals Data at the Data Incubator. When we resolve that URL we eventually get some more RDF:

    dc:identifier <info:pmid/0410462>, <info:pmid/0410463> ;
    dc:subject "BIOLOGY", "Biologie", "CIENCIA", "NATURAL HISTORY", "Natuurwetenschappen", "Physique", "SCIENCE", "Science", "Sciences" ;
    dct:publisher <> ;
    dct:subject <>, <>, <>, <>, <> ;
    dct:title "Nature" ;
    bibo:eissn "1476-4687" ;
    bibo:issn "0028-0836", "0090-0028" ;
    bibo:shortTitle "Nat New Biol", "Nature", "Nature New Biol." ;
    a bibo:Journal ;
    owl:sameAs <>, <>, <> ;
    foaf:isPrimaryTopicOf <,1&Search_Arg=0410462&Search_Code=0359&CNT=20&SID=1>, <,1&Search_Arg=0410463&Search_Code=0359&CNT=20&SID=1>, <>, <> .

Which (among other things) tells us that the journal Nature publishes content with the topic of “Biology” from the Library of Congress Subject Headings:

    skos:prefLabel "Biology"@en ;
    dcterms:created "1986-02-11T00:00:00-04:00"^^<> ;
    dcterms:modified "1990-10-09T11:20:35-04:00"^^<> ;
    a skos:Concept ;
    owl:sameAs <info:lc/authorities/sh85014203> ;
    skos:broader <> ;
    skos:closeMatch <> ;
    skos:inScheme <>, <> ;
    skos:narrower <>, <>, <>, <>, <>, <>, <>, <>, <>;
    skos:related <>, <>, <> .

Here we can see the topic of Biology as it relates to other concepts in the Library of Congress Subject Headings, as well as a similar concept in Biologie générale from RAMEAU, which are subject headings from the Bibliothèque nationale de France.

    a skos:Concept ;
    skos:altLabel "Biologie générale"@fr ;
    skos:broader <> ;
    skos:closeMatch <>, <> ;
    skos:inScheme <>, <> ;
    skos:narrower <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <> ;
    skos:note "Domaine : 570"@fr ;
    skos:prefLabel "Biologie"@fr, "FRBNF119440833"@x-notation ;

So at this point maybe you’ll admit that it’s kind of cool to wander around in the data like this. But if you haven’t drunk the Kool-Aid recently (unlikely if you’ve read this far) you might be wondering: what’s the point? Who cares?

I think you should care about this example because it shows:

  1. how an existing organization can leverage its pre-existing identifiers on the Web to enable data publishing (Linked Data)
  2. how important it is for publishers to consider who they link to in their data, and how they do it
  3. how essential the RDF data model is for using the Web to join up these pools (or some may call them silos) of data

The raw Turtle RDF above may have made your eyes glaze over, so its worth restating that this new DOI service allows those with DOIs in their databases to use the machinery of the Web to aggregate and join up data from 4 different organizations: CrossRef, Data Incubator, Library of Congress, and the Bibliothèque nationale de France:

And it’s not just the traditional scholarly publishing community that will potentially benefit from this new Linked Data. As I discovered last August when routing around in the external links dumps from English Wikipedia there were 323,805 links from Wikipedia Articles to–for example the article for Molecular Structure of Nucleic Acids has a citation that includes an external link to the DOI URL included above.

CrossRef’s new Linked Data service could allow someone to write a bot to crawl and verify the citations on Wikipedia. Or perhaps there could be a template on Wikipedia that would allow an editor to add a citation to an article by simply using the DOI, which would then fill in the other bits of article metadata needed for display. There are lots of possibilities.

As I commented over on the CrossTech blog (not approved yet), it would be handy if the service was able to parse and act on non-simple Accept headers during content negotiation, since it’s fairly common for RDF tools like jena, rdflib, arc, redland to send Accept headers with q-values in them. It might actually be nice to see support for some simple JSON views, that might be handy for people that get scared off RDF easily. But those are some minor quibbles in comparison to the outstanding work that CrossRef have done in getting this service going. Hopefully we’ll see more publishing organizations like DataCite helping build this data publishing community more as well.

Update: if this topic interests you, and you want to read more about it, definitely check out John Erickson‘s blog post DOIs, URIs and Cool Resolution.

geeks bearing gifts

I recently received some correspondence about the EZID identifier service from the California Digital Library. EZID is a relatively new service that aims to help cultural heritage institutions manage their identifiers. Or as the EZID website says:

EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers for their digital content. EZID has these core functions:

Create a persistent identifier: DOI or ARK

  • Add object location
  • Add metadata
  • Update object location
  • Update object metadata

I have some serious concerns about a group of cultural institutions relying on a single service like EZID for managing their identifier namespaces. It seems too much like a single point of failure. I wonder, are there any plans to make the software available, and to allow multiple EZID servers to operate as peers?

URLs are a globally deployed identifier scheme that depend upon HTTP and DNS. These technologies have software implementations in many different computer languages, for diverse operating systems. Bugs and vulnerabilities associated with these software libraries are routinely discovered and fixed, often because the software itself is available as open source, and there are “many eyes” looking at the source code. Best of all, you can put a URL into your web browser (which are now ubiquitous), and view a document that is about the identified resource.

In my humble opinion, cultural heritage institutions should make every effort to work with the grain of the Web, and taking URLs seriously is a big part of that. I’d like to see more guidance for cultural heritage institutions on effective use of URLs, what Tim Berners-Lee has called Cool URIs, and what the Microformats and blogging community call permalinks. When systems are being designed or evaluated for purchase, we need to think about the URL namespaces that we are creating, and how we can migrate them forwards. Ironically, identifier schemes that don’t fit into the DNS and HTTP landscape have their own set of risks; namely that organizations become dependent on niche software and services. Sometimes it’s prudent (and cost effective) to seek safety in numbers.

I did not put this discussion here to try to shame CDL in any way. I think the EZID service is well intentioned, clearly done in good spirit, and already quite useful. But in the long run I’m not sure it pays for institutions to go it alone like this. As another crank (I mean this with all due respect) Ted Nelson put it:

Beware Geeks Bearing Gifts.

On the surface the EZID service seems like a very useful gift. But it comes with a whole set of attendant assumptions. Instead of investing time & energy getting your institution to use a service like EZID, I think most cultural heritage institutions would be better off thinking about how they manage their URL namespaces, and making resource metadata available at those URLs.

routers, webcams and thermometers

If you have a local wi-fi network at home you probably use something like this Linksys wireless router on the left, to let your laptop and other devices connect to the Internet. When you bought it and plugged it in you probably followed the instructions and typed “” into your web browser and visited a page to configure the router: settings its name, admin password, etc.

Would you agree that this router sitting on top of your TV, or wherever it is, is a real world thing? It’s not some abstract concept of a router: you can pick it up, turn it off and on, take it apart and try to put it back together again. And the router is identified with a URL: When your web browser resolves the URL for your router it gets back some HTML, that lets you see the router’s current state, and make modifications to it. You don’t get the router itself. That would be silly right?

In terms of REST, the router is a Resource that has a URL Identifier, which when resolved returns an HTML Representation of the Resource. But you don’t really have to think about it much at all, because it’s intuitively part of how you use the web every day.

In fact the Internet is strewn with online devices that have embedded web servers in them. A 5 year old BoingBoing article More Googleable Unsecured Webcams shows how you can drop a web search for inurl:”view/index.shtml” into Google, and get back thousands of webcams from around the world. You can zoom and pan these cameras using your web browser. These are URLs for real world cameras. When you put the URL in your browser you don’t get the camera itself, that’s crazy talk; instead you get some HTML describing the camera’s current state, and some form controls for changing its position. Again all is well in the REST world, where the camera is the Resource identified with a URL, and your browser receives a Representation of the Resource.

If you are an Arduino hacker you might follow some instructions to build an online thermometer. You wire up the temperature sensor, and configure the Arduino to listen for HTTP requests at a particular IP address. You can then visit a URL in your web browser, and the server returns a Representation of the current temperature. It doesn’t return the Arduino board, the thermometer, or the thermodynamic state of its environment…that’s crazy talk. It returns a Representation of the temperature.

So imagine I want to give myself a URL, say Is this so different than the camera, the router and the thermometer? Sure, I don’t have a web server embedded in me. But even if I did nobody would expect it to return me would they? Just as in the other cases, people would expect a Representation of me to be returned. Heck, there are millions of OpenID URLs deployed for people already. But this argument is used time, and time again in the Semantic Web, Linked Data community to justify the need for elaborate, byzantine, hard to explain HTTP behavior when making RDF descriptions of real world things available. The pattern has been best described in the Cool URIs for the Semantic Web W3C Note. I understand it. But if you’ve ever had to explain it to a web developer not already brainwashed^w familar with the pattern you will understand that it is hard to explain convincingly. It’s even harder to implement correctly, since you are constantly asking yourself nonsensical questions like: “is this a Information Resource” when you are building your application.

I was pleased to see Ian Davis’ recent well articulated posts about whether the complicated HTTP behavior is essential for deploying Linked Data. I know I am biased because I was introduced to much of the Semantic Web and Linked Data world when Ian Davis and Sam Tunnicliffe visited the Library of Congress three years ago. I agree with Ian’s position: the current situation with the 303 redirect is potentially wasteful, error prone and bordering on the absurd…and the Linked Data community could do a lot to make it easier to deploy Linked Data. At its core, Ian’s advice in Guide to Publishing Linked Data Without Redirects does a nice job of making Linked Data publishing seem familiar to folks who have used HTTP’s content-negotiation features to enable internationalization, or building RESTful web services. A URL for a resource that has a more constrained set of representations, allows for Agent Driven Negotiation in situations where custom tuning the Accept header in the client isn’t convenient and practical. Providing a pattern for linking these resources together with something like wrds:describedby and/or the describedby relation that’s now available in RFC 5988 is helpful for people building REST APIs and Linked Data applications.

At the end of the day, it would be useful if the W3C could de-emphasize httpRange-14, simplify the Architecture of the World Wide Web (by removing the notion of Information Resources), and pave the cowpaths we already are seeing for Real World Objects on the Web. It would be great to have a W3C document that guided people on how to put URIs for things on the web, that fit with how people are already doing it, and made intuitive sense. We’re already used to things like our routers, cameras and thermometers being on the web, and my guess is we’re going to see much, much more of it in the coming years. I don’t think a move like this would invalidate documents like Cool URIs for the Semantic Web, or make the existing Linked Data that is out there somehow wrong. It would simply lower the bar for people who want to publish Linked Data, who don’t necessarily want to go through the process of using URIs to distinguish non-Information Resources from Information Resources.

If the W3C doesn’t have the stomach for it, I imagine we will see the IETF lead the way, or for innovation to happen elsewhere as with HTML5.