Linked Library Data at the Deutschen Nationalbibliothek

Just last week Lars Svensson from the Deutschen Nationalbibliothek (German National Library aka DNB) made a big announcement that they have released their authority data as Linked Data for the world to use. What this means is that there are now unique URLs (and machine readable data at the other end of them) for:

1.8 million authors from the Personennamendatei (PND)
1.3 million corporate bodies from the Gemeinsame Körperschaftsdatei (GKD)
187,000 subject headings from the Schlagwortnormdatei (SWD)
51,000 Dewey Decimal Classification categories

The full dataset that the DNB has made available for download amounts to 38,849,113 individual statements (aka triples). Linked Data enthusiasts that are used to thinking in terms of billions of triples might not even blink when seeing these numbers. But it is important to remember that these data assets have been curated by a network of German, Austrian and Swiss libraries, for close to a hundred of years, as they documented (and continue to document) all known German-language publications.

The simple act of making each of these authority records URL addressable, means that they can now meaningfully participate in the global information space some call the Web of Data. It’s true, the records were available as part of the DNB’s Online Catalog before they were released as Linked Data. What’s new is that the DNB has commited to using persistent URLs to identify these records, using a new host name d-nb.info in combination with their own record identifiers. This means that people can persistently link to these DNB resources in their own web applications and data. Another subtle thing, and really the heart of what Linked Data pattern offers, is the ability to use the same URL to retrieve the record as structured metadata. The important thing about having machine readable data is it allows other applications to easily re-purpose the information, much like libraries have done traditionally by shipping around batches of Machine Readable Cataloging (MARC) records. Here’s a practical example:

The URL http://d-nb.info/gnd/119053071 identifies the author Herta Müller, who won the Nobel Prize for Literature in 2009. If you load that URL in your web browser by clicking on it, you should see a web page (HTML) for the authority record describing Herta Müller. But if a web client requests that same URL asking for RDF it will (via a redirect) get the same authority record as RDF. RDF is more a data model than a particular file format, so it has a variety of serializations … The server at d-nb.info returns RDF/XML, and they have made their data dumps available in N-Triples…but I’m kind of fond of the Turtle serialization which is kind of JSON-ish, and makes the RDF a bit more readable. Here is the RDF (as Turtle) for Herta Müller that the DBN makes available:

@prefix gnd: <http://d-nb.info/gnd/> .
@prefix rdaGr2: <http://RDVocab.info/ElementsGr2/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://d-nb.info/gnd/119053071>
    rdaGr2:biographicalInformation "Rumän.-dt. Schriftstellerin und Essayistin, lebt seit 1987 in Deutschland, Literaturnobelpreisträgerin 2009"@de ;
    rdaGr2:dateOfBirth "1953" ;
    rdaGr2:identifierForThePerson "(DE-588)119053071", "(DE-588c)4293331-6", "(DLC)n  86833524" ;
    rdaGr2:placeOfBirth "Nitzkydorf (Banat)"@de ;
    rdaGr2:placeOfResidence "Berlin"@de ;
    rdaGr2:professionOrOccupation <http://d-nb.info/gnd/4053311-6> ;
    gnd:countryCodeForThePerson "XA-RO" ;
    gnd:preferredNameForThePerson [
        gnd:foreName "Herta" ;
        gnd:surname "Müller" ;
        gnd:usedRules "RAK-WB"
    ], "Müller, Herta" ;
    gnd:studyPathsOfThePerson "Germanistik, Romanistik"@de ;
    gnd:variantNameForThePerson [
        gnd:foreName "Cherta" ;
        gnd:surname "Myller" ;
        gnd:usedRules "RAK-WB"
    ], [
        gnd:personalName "Heta-Mulei" ;
        gnd:usedRules "RAK-WB"
    ], [
        gnd:foreName "Heta" ;
        gnd:surname "Mulei" ;
        gnd:usedRules "RAK-WB"
    ], [
        gnd:foreName "Herta" ;
        gnd:surname "Müller" ;
        gnd:usedRules "AACR"
    ], [
        gnd:foreName "Heruta" ;
        gnd:surname "Myur?" ;
        gnd:usedRules "RAK-WB"
    ], "Heta-Mulei", "Mulei, Heta", "Müller, Herta", "Myller, Cherta", "Myur?, Heruta" ;
    owl:sameAs <http://dbpedia.org/resource/Herta_M%C3%BCller>, <http://viaf.org/viaf/12324250> ;
    foaf:page <http://de.wikipedia.org/wiki/Herta_M%C3%BCller> .

A few interesting things to note in this example are the use the RDA Group 2 Entities vocabulary and the GND vocabulary to describe Herta Müller. RDF vocabularies are explicit ways of describing resources like people, places, topics, etc. When different things are described using the same vocabulary (or the vocabularies themselves are related together in a particular way) it becomes possible to merge the descriptions, and build software on top of it. So the DNB’s choice of RDA and GND is quite significant. Normally the URL for an RDF schema will return a description of that schema known as a Namespace Document. Namespace Documents are handy for understanding what exactly the vocabulary means, and how it might relate to other RDF vocabularies on the web. This is the case for the RDA vocabulary, but the GND vocabulary namespace doesn’t appear to be resolving to anything that describes the GND vocabulary.

Another really interesting thing to note about this RDF for Herta Müller are the links to Wikipedia (http://de.wikipedia.org/wiki/Herta_M%C3%BCller), VIAF (http://viaf.org/viaf/12324250) and dbpedia (http://dbpedia.org/resource/Herta_M%C3%BCller). These are important because they contextualize the DNB record for Herta Müller by relating it to other records for her, thus allowing it to be disambiguated from records describing other people named Herta Müller. Another beneficial side effect of linking your own records to others out on the Web of Data is that you enrich your own data in the process. For example if a machine agent resolves the dbpedia URI it will get back RDF that includes 114 new assertions, some of which you can see below:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Herta_M%C3%BCller>
    dbpedia-owl:birthDate "1953-08-17"^^<http://www.w3.org/2001/XMLSchema#date> ;
    dbpedia-owl:birthPlace <http://dbpedia.org/resource/Ni%C5%A3chidorf> ;
    dbpedia-owl:spouse <http://dbpedia.org/resource/Richard_Wagner_%28novelist%29> ;
    dbpedia-owl:thumbnail <http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Herta_M%C3%BCller_2007.JPG/200px-Herta_M%C3%BCller_2007.JPG> ;
    rdfs:label "Herta Müller"@de, "Herta Müller"@en, "Herta Müller"@es, "Herta Müller"@fi, "Herta Müller"@fr, "Herta Müller"@it, "Herta Müller"@nl, "Herta Müller"@nn, "Herta Müller"@pl, "Herta Müller"@pt, "Herta Müller"@sv, "??????, ?????"@ru, "????????"@ja, "??·??"@zh ;
    owl:sameAs <http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000dc69bb>, <http://umbel.org/umbel/ne/wikipedia/Herta_M%C3%BCller> ;
    foaf:depiction <http://upload.wikimedia.org/wikipedia/commons/2/2c/Herta_M%C3%BCller_2007.JPG> .

So now we’ve enriched the DNB authority record with:

a thumbnail picture of Herta Müller
her name in Japanese, Chinese and Russian
her birth day
her place of birth
a link to a similar record for her spouse Richard Wagner
links to records for Herta Müller at Freebase (recently purchased by Google)

And that’s just a sampling of the sorts of data that dbpedia returns. Another interesting one to look at is the Virtual International Authority File (VIAF), which links together the authority records for 18 National Libraries around the world. If you resolve the VIAF URL that DNB have linked to, you will get machine readable information for authority records from the the Library of Congress, NII (Japan), Biblioteca Nacional de Portugal, National Library of Sweden, Biblioteca Nacional de España, Bibliothèque nationale de France, National Library of the Czech Republic, and of course the Deutsche Nationalbibliothek. The information for the DNB and Sweden are particularly important because they in turn link back to the records at the originating institution: http://d-nb.info/gnd/119053071 and http://libris.kb.se/auth/218085. It might be worthwhile for the DNB to consider linking directly to their own record in VIAF http://viaf.org/viaf/12324250/#DNB%7C119053071 instead of http://viaf.org/viaf/12324250, but that’s largely a technical matter. We’ve connected up the DNB’s notion of Herta Müller with the Royal Library of Sweden’s–just by following our nose on the World Wide Web. And this is an activity that computer software can perform as well.

So, it’s clear there’s a whole lot of library linking going on. I did some quick and dirty analysis of the full data dump from the DNB and found: 3,569,402 links to VIAF and 40,136 links to dbpedia (the Linked Data version of Wikipedia). What remains to be done to some extent is leveraging this contextual information around our data in Library Applications, both cataloging, metadata enrichment applications and end user facing discovery applications.

One challenge to building applications that use this Web of Library Data are the vocabularies that are used. I did some more rudimentary analysis on the full DNB data dump and came up with this count of property usage:

RDF Property	Number of Assertions
http://www.w3.org/2002/07/owl#sameAs	3,609,878
http://d-nb.info/gnd/preferredNameForThePerson	3,609,753
http://d-nb.info/gnd/usedRules	3,476,879
http://d-nb.info/gnd/variantNameForThePerson	3,327,005
http://d-nb.info/gnd/surname	3,218,840
http://d-nb.info/gnd/foreName	3,218,125
http://RDVocab.info/ElementsGr2/identifierForTheCorporateBody	2,642,185
http://RDVocab.info/ElementsGr2/identifierForThePerson	2,163,258
http://d-nb.info/gnd/preferredNameForTheCorporateBody	1,320,711
http://d-nb.info/gnd/variantNameForTheCorporateBody	1,293,751
http://RDVocab.info/ElementsGr2/biographicalInformation	1,084,183
http://RDVocab.info/ElementsGr2/professionOrOccupation	1,059,570
http://d-nb.info/gnd/publicationOfThePerson	986,418
http://RDVocab.info/ElementsGr2/dateOfBirth	971,993
http://d-nb.info/gnd/countryCodeForThePerson	823,100
http://d-nb.info/gnd/countryCodeForTheCorporateBody	759,088
http://RDVocab.info/ElementsGr2/periodOfActivityOfThePerson	539,230
http://RDVocab.info/ElementsGr2/gender	404,247
http://RDVocab.info/ElementsGr2/dateOfDeath	381,888
http://purl.org/dc/terms/identifier	337,230
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/hierarchicalSuperior	277,484
http://d-nb.info/gnd/personalName	258,214
http://d-nb.info/gnd/prefixName	233,481
http://d-nb.info/gnd/functionOfThePerson	211,045
http://d-nb.info/gnd/invalidIdentifierForThePerson	208,267
http://RDVocab.info/ElementsGr2/placeOfBirth	192,563
http://d-nb.info/gnd/qualifierName	169,284
http://www.w3.org/1999/02/22-rdf-syntax-ns#type	168,615
http://www.w3.org/2004/02/skos/core#prefLabel	163,854
http://www.w3.org/2004/02/skos/core#altLabel	143,254
http://xmlns.com/foaf/0.1/page	123,569
http://d-nb.info/gnd/invalidIdentifierForTheCorporateBody	122,999
http://www.w3.org/2004/02/skos/core#broader	118,696
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/predecessor	110,112
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/successor	109,819
http://www.w3.org/2004/02/skos/core#narrower	102,850
http://d-nb.info/gnd/preferredNameAcronymForTheCorporateBody	102,517
http://RDVocab.info/ElementsGr2/dateOfEstablishment	88,470
http://d-nb.info/gnd/academicTitleOfThePerson	77,763
http://RDVocab.info/ElementsGr2/placeOfResidence	70,112
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/relatedCorporateBodyPerson	65,319
http://www.w3.org/2004/02/skos/core#closeMatch	60,893
http://xmlns.com/foaf/0.1/homepage	59,065
http://RDVocab.info/ElementsGr2/dateOfTermination	38,997
http://RDVocab.info/ElementsGr2/dateOfTermination	38,997
http://www.w3.org/2004/02/skos/core#definition	37,086
http://RDVocab.info/ElementsGr2/placeOfDeath	35,266
http://d-nb.info/gnd/locQualifier	35,220
http://d-nb.info/gnd/studyPathsOfThePerson	33,307
http://www.w3.org/2004/02/skos/core#related	26,971
http://RDVocab.info/ElementsGr2/nameOfTheCorporateBody	20,009
http://RDVocab.info/ElementsGr2/languageOfThePerson	13,318
http://d-nb.info/gnd/variantNameAcronymForTheCorporateBody	12,786
http://www.w3.org/2004/02/skos/core#scopeNote	11,000
http://d-nb.info/gnd/useInsteadSWD	9,572
http://d-nb.info/gnd/useInsteadNoteSWD	9,522
http://d-nb.info/gnd/countryCodeForTheSubject	7,179
http://RDVocab.info/ElementsGr2/titleOfThePerson	6,798
http://purl.org/vocab/relationship/childOf	6,554
http://purl.org/vocab/relationship/parentOf	5,895
http://purl.org/vocab/relationship/spouseOf	5,613
http://d-nb.info/gnd/successorWithoutPredecessor	5,574
http://www.w3.org/2000/01/rdf-schema#label	4,761
http://d-nb.info/gnd/useConceptsInsteadSWD	4,761
http://d-nb.info/gnd/invalidIdentifierForTheSubject	4,635
http://purl.org/vocab/relationship/siblingOf	3,891
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/relatedPersonPerson	2,764
http://d-nb.info/gnd/predecessorWithoutSuccessor	1,501
http://purl.org/vocab/relationship/grandchildOf	493
http://www.w3.org/2000/01/rdf-schema#seeAlso	484
http://purl.org/vocab/relationship/grandparentOf	416
http://purl.org/dc/terms/language	266

So we see heavy usage of the http://d-nb.info/gnd/ vocabulary, but we don’t know precisely how this vocabulary connects up with other vocabularies in use on the Web. We also see the new RDA vocabulary http://RDVocab.info/ElementsGr2 heavily used. Whereas the trailblazing Royal Library of Sweden chose to leverage the Friend of a Friend vocabulary more. It’s very important that we see some convergence in vocabulary use, so that our distributed data is interoperable, and mashable. This will undoubtedly lead to changes in what vocabularies are used, and growing pains in any applications that are dependent on the data. But I think it is worth it. I have high hopes that some of this convergence may come about as a result of meetings later this week at the Dublin Core Metadata Initiative 2010 meeting in Pittsburgh. But if it’s going to scale, we need to see this convergence going on all the time in online forums like the Linked Library Data discussion list, and via tools that allow library data managers to view the emerging web of library data.

Another niggling little problem is the need to synchronize these data sets. For example how am I to know when DNB has created, updated or deleted one of their authority records? I could wait for a database dump, and blow away what I knew before. But ideally there would be a mechanism to keep my own view of the DNB data synchronized. Of course there is the tried and true OAI-PMH which VIAF is using to collect MARC rocords, but it is showing its age and doesn’t really fit the Linked Data pattern very well. There is the successor to OAI-PMH, OAI-ORE which better fits more recent notions of Web Architecture and Linked Data. But there are some issues to do with very large resource maps which kind of need ironing out. The Dataset Dynamics has been doing some interesting work identifying the various mechanisms for performing synchronization with an emphasis on using Atom. Atom is a standard XML document format for describing sets of web resources. In fact OAI-ORE leverage Atom as one of the serialization formats for resource maps. But I’m personally hoping we’ll see some stream lined guidelines for publishing feeds for Linked Data, that leverage Atom’s Feed Paging/Archiving for making large lists of resources available. Maybe the Semantic Sitemaps (an Linked Data extension to traditional sitemaps that the big web search engines use to stay on top of things. I imagine we’ll see a combination of these approaches, but I think it’s important to see some convergence amongst Library Linked Data publishers to help the ecosystem flourish.

Update: I shared some more pedantic thoughts about the d-nb.info URLs in another forum. I didn’t want these particular technical details/questions to detract from saying how important I think the DNB Linked Data release is.