diving into VIAF

Last week saw a big (well big for library data nerds) announcement from OCLC that they are making the data for the Virtual International Authority File (VIAF) available for download under the terms of the Open Data Commons Attribution (ODC-BY) license. If you’re not already familiar with VIAF here’s a brief description from OCLC Research:

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF’s goal is to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF seeks to include authoritative names from many libraries into a global service that is available via the Web. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities

More specifically, the VIAF service: links national and regional-level authority records, creating clusters of related records and expands the concept of universal bibliographic control by:

  • allowing national and regional variations in authorized form to coexist
  • supporting needs for variations in preferred language, script and spelling
  • playing a role in the emerging Semantic Web

If you went and looked at the OCLC Research page you’ll notice that last month the VIAF project moved to OCLC. This is evidence of a growing commitment on OCLC’s part to make VIAF part of the library information landscape. It currently includes data about people, places and organizations from 22 different national libraries and other organizations.

Already there has been some great writing about what the release of VIAF data means for the cultural heritage sector. In particular Thom Hickey’s Outgoing is a trove of information about the project, which provides a behind-the-scense look at the various services it offers.

Rather than paraphrase what others have said already I thought I would download some of the data and report on what it looks like. Specifically I’m interested in the RDF data (as opposed to the custom XML, and MARC variants) since I believe it to have the most explicit structure and relations. The shared semantics in the RDF vocabularies that are used also make it the most interesting from a Linked Data perspective.

Diving In

The primary data structure of interest in the data dumps that OCLC has made available is what they call the cluster. A cluster is essentially a hub-and-spoke model with a resource for the person, place or organization in the middle that is attached via the spokes to conceptual resources at the participating VIAF institutions. As an example here is an illustration of the VIAF cluster for the Canadian archivist Hugh Taylor

Here you can see a FOAF Person resource (yellow) in the middle that is linked to from SKOS Concepts (blue) for Bibliothèque nationale de France, The Libraries and Archives of Canada, Deutschen Nationalbibliothek, BIBSYS (Norway) and the Library of Congress. Each of the SKOS Concepts have their own preferred label, which you can see varies across institution. This high level view obscures quite a bit of data, which is probably best viewed in Turtle if you want to see it:

<http://viaf.org/viaf/14894854>
    rdaGr2:dateOfBirth "1920-01-22" ;
    rdaGr2:dateOfDeath "2005-09-11" ;
    a rdaEnt:Person, foaf:Person ;
    owl:sameAs <http://d-nb.info/gnd/109337093> ;
    foaf:name "Taylor, Hugh A.", "Taylor, Hugh A. (Hugh Alexander), 1920-", "Taylor, Hugh Alexander 1920-2005" .

<http://viaf.org/viaf/sourceID/BIBSYS%7Cx90575046#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/BIBSYS> ;
    skos:prefLabel "Taylor, Hugh A." ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/BNF%7C12688277#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/BNF> ;
    skos:prefLabel "Taylor, Hugh Alexander 1920-2005" ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/DNB%7C109337093#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/DNB> ;
    skos:prefLabel "Taylor, Hugh A." ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/LAC%7C0013G3497#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/LAC> ;
    skos:prefLabel "Taylor, Hugh A. (Hugh Alexander), 1920-" ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/LC%7Cn++82148845#skos:Concept>
    a skos:Concept ;
    skos:exactMatch <http://id.loc.gov/authorities/names/n82148845> ;
    skos:inScheme <http://viaf.org/authorityScheme/LC> ;
    skos:prefLabel "Taylor, Hugh A." ;
    foaf:focus <http://viaf.org/viaf/14894854> .

The Numbers

The RDF Cluster Dataset http://viaf.org/viaf/data/viaf-20120422-clusters.xml.gz is 2.1G gzip compressed RDF data. Rather than it being one complete RDF/XML file, each line has a complete RDF/XML document on it, which represents a single cluster. All in all there are 20,379,541 clusters in the file.

I quickly hacked together a rdflib filter that reads the uncompressed line-oriented RDF/XML and writes the RDF as ntriples:

import sys
 
import rdflib
 
for line in sys.stdin:
    g = rdflib.Graph()
    g.parse(data=line)
    print g.serialize(format='nt').encode('utf-8'),

This took 4 days to run on my (admittedly old) laptop. If you are interested in seeing the ntriples let me know and I can see about making it available somewhere. It is 2.8G gzip compressed. An ntriples dump might be a useful version of the RDF data for OCLC to make available, since it would be easier to load into triplestores, and otherwise muck around with (more on that below) than the line oriented RDF/XML. I don’t know much about the backend that drives VIAF (has anyone seen it written up?)…but I would understand if someone said it was too expensive to generate, and was intentionally left as an exercise for the downloader.

Given its line-oriented nature, ntriples is very handy for doing analysis from the Unix command line with cut, sort, uniq, etc. From the ntriples file I learned that the VIAF RDF dump is made up of 377,194,224 assertions or RDF triples. Here’s the breakdown on the types of resources present in the data:

Resource Type Number of Resources
skos:Concept 26,745,286
foaf:Document 20,379,541
foaf:Person 15,043,112
rda:Person 15,043,112
foaf:Organization 3,722,318
foaf:CorporateBody 3,722,318
dbpedia:Place 195,472

Here’s a breakdown of predicates (RDF properties) that are used:

RDF Property Number of Assertions
rdf:type 84,851,159
foaf:focus 45,510,716
foaf:name 44,729,247
rdfs:comment 41,253,178
owl:sameAs 32,741,138
skos:prefLabel 26,745,286
skos:inScheme 26,745,286
foaf:primaryTopic 20,379,541
void:inDataset 20,379,541
skos:altLabel 16,702,081
skos:exactMatch 8,487,197
rda:dateOfBirth 5,215,150
rda:dateOfDeath 1,364,355
owl:differentFrom 1,045,172
rdfs:seeAlso 1,045,172

I’m expecting these statistics to be useful in helping target some future work I want to do with the VIAF RDF dataset (to explore what an idiomatic JSON representation for the dataset would be, shhh). In addition to the RDF, OCLC also makes a dump of link data available. It is a smaller file (239M gzip compressed) of tab delimited data, which looks like:

...
http://viaf.org/viaf/10014828   SELIBR:219751
http://viaf.org/viaf/10014828   SUDOC:052584895
http://viaf.org/viaf/10014828   NKC:xx0015094
http://viaf.org/viaf/10014828   BIBSYS:x98003783
http://viaf.org/viaf/10014828   LC:24893
http://viaf.org/viaf/10014828   NUKAT:vtls000425208
http://viaf.org/viaf/10014828   BNE:XX917469
http://viaf.org/viaf/10014828   DNB:121888096
http://viaf.org/viaf/10014828   BNF:http://catalogue.bnf.fr/ark:/12148/cb13566121c
http://viaf.org/viaf/10014828   http://en.wikipedia.org/wiki/Liza_Marklund
...

There are 27,046,631 links in total. With a little more Unix commandline-fu I was able to get some stats on the number of links by institution:

Institution Number of Links
LC NACO (United States) 8,325,352
Deutschen Nationalbibliothek (Germany) 7,732,546
SUDOC (France) 2,031,452
BIBSYS (Norway) 1,822,681
Bibliothèque nationale de France 1,643,068
National Library of Australia 977,141
NUKAT Center (Poland) 894,981
Libraries and Archives of Canada 674,088
National Library of the Czech Republic 598,848
Biblioteca Nacional de España 519,511
National Library of Israel 327,455
Biblioteca Nacional de Portugal 321,064
English Wikipedia 301,345
Vatican Library 247,574
Getty Union List of Artist Names 202,711
National Library of Sweden 161,845
RERO (Switzerland) 119,366
Istituto Centrale per il Catalogo Unico (Italy) 45,208
Swiss National Library 33,866
National Széchényi Library (Hungary) 33,727
Bibliotheca Alexandrina (Egypt) 26,877
Flemish Public Libraries 4,819
Russian State Library 997
Extended VIAF Authority 109

The 301,345 links to Wikipedia are really great to see. It might be a fun project to see how many of these links are actually present in Wikipedia, and if they can be automatically added with a bot if they are missing. I think it’s useful to have the HTTP identifier in the link dump file, as is the case for the BNF identifiers. I’m not sure why the DNB, Sweden, and LC URLs aren’t expressed URLs as well.

One other parting observation (I’m sure I’ll blog more about this) is that it would be nice if more of the data that you see in the HTML presentation were available in the RDF dumps. Specifically, it would be useful to have the Wikipedia links expressed in the RDF data, as well as linked works (uniform titles).

Anyway, a big thanks to OCLC for making the VIAF dataset available! It really feels like a major sea change in the cultural heritage data ecosystem.

Linked Library Data at the Deutschen Nationalbibliothek

Just last week Lars Svensson from the Deutschen Nationalbibliothek (German National Library aka DNB) made a big announcement that they have released their authority data as Linked Data for the world to use. What this means is that there are now unique URLs (and machine readable data at the other end of them) for:

The full dataset that the DNB has made available for download amounts to 38,849,113 individual statements (aka triples). Linked Data enthusiasts that are used to thinking in terms of billions of triples might not even blink when seeing these numbers. But it is important to remember that these data assets have been curated by a network of German, Austrian and Swiss libraries, for close to a hundred of years, as they documented (and continue to document) all known German-language publications.

The simple act of making each of these authority records URL addressable, means that they can now meaningfully participate in the global information space some call the Web of Data. It’s true, the records were available as part of the DNB’s Online Catalog before they were released as Linked Data. What’s new is that the DNB has commited to using persistent URLs to identify these records, using a new host name d-nb.info in combination with their own record identifiers. This means that people can persistently link to these DNB resources in their own web applications and data. Another subtle thing, and really the heart of what Linked Data pattern offers, is the ability to use the same URL to retrieve the record as structured metadata. The important thing about having machine readable data is it allows other applications to easily re-purpose the information, much like libraries have done traditionally by shipping around batches of Machine Readable Cataloging (MARC) records. Here’s a practical example:

The URL http://d-nb.info/gnd/119053071 identifies the author Herta Müller, who won the Nobel Prize for Literature in 2009. If you load that URL in your web browser by clicking on it, you should see a web page (HTML) for the authority record describing Herta Müller. But if a web client requests that same URL asking for RDF it will (via a redirect) get the same authority record as RDF. RDF is more a data model than a particular file format, so it has a variety of serializations … The server at d-nb.info returns RDF/XML, and they have made their data dumps available in N-Triples…but I’m kind of fond of the Turtle serialization which is kind of JSON-ish, and makes the RDF a bit more readable. Here is the RDF (as Turtle) for Herta Müller that the DBN makes available:

@prefix gnd: <http://d-nb.info/gnd/> .
@prefix rdaGr2: <http://RDVocab.info/ElementsGr2/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://d-nb.info/gnd/119053071>
    rdaGr2:biographicalInformation "Rumän.-dt. Schriftstellerin und Essayistin, lebt seit 1987 in Deutschland, Literaturnobelpreisträgerin 2009"@de ;
    rdaGr2:dateOfBirth "1953" ;
    rdaGr2:identifierForThePerson "(DE-588)119053071", "(DE-588c)4293331-6", "(DLC)n  86833524" ;
    rdaGr2:placeOfBirth "Nitzkydorf (Banat)"@de ;
    rdaGr2:placeOfResidence "Berlin"@de ;
    rdaGr2:professionOrOccupation <http://d-nb.info/gnd/4053311-6> ;
    gnd:countryCodeForThePerson "XA-RO" ;
    gnd:preferredNameForThePerson [
        gnd:foreName "Herta" ;
        gnd:surname "Müller" ;
        gnd:usedRules "RAK-WB"
    ], "Müller, Herta" ;
    gnd:studyPathsOfThePerson "Germanistik, Romanistik"@de ;
    gnd:variantNameForThePerson [
        gnd:foreName "Cherta" ;
        gnd:surname "Myller" ;
        gnd:usedRules "RAK-WB"
    ], [
        gnd:personalName "Heta-Mulei" ;
        gnd:usedRules "RAK-WB"
    ], [
        gnd:foreName "Heta" ;
        gnd:surname "Mulei" ;
        gnd:usedRules "RAK-WB"
    ], [
        gnd:foreName "Herta" ;
        gnd:surname "Müller" ;
        gnd:usedRules "AACR"
    ], [
        gnd:foreName "Heruta" ;
        gnd:surname "Myur?" ;
        gnd:usedRules "RAK-WB"
    ], "Heta-Mulei", "Mulei, Heta", "Müller, Herta", "Myller, Cherta", "Myur?, Heruta" ;
    owl:sameAs <http://dbpedia.org/resource/Herta_M%C3%BCller>, <http://viaf.org/viaf/12324250> ;
    foaf:page <http://de.wikipedia.org/wiki/Herta_M%C3%BCller> .


A few interesting things to note in this example are the use the RDA Group 2 Entities vocabulary and the GND vocabulary to describe Herta Müller. RDF vocabularies are explicit ways of describing resources like people, places, topics, etc. When different things are described using the same vocabulary (or the vocabularies themselves are related together in a particular way) it becomes possible to merge the descriptions, and build software on top of it. So the DNB’s choice of RDA and GND is quite significant. Normally the URL for an RDF schema will return a description of that schema known as a Namespace Document. Namespace Documents are handy for understanding what exactly the vocabulary means, and how it might relate to other RDF vocabularies on the web. This is the case for the RDA vocabulary, but the GND vocabulary namespace doesn’t appear to be resolving to anything that describes the GND vocabulary.

Another really interesting thing to note about this RDF for Herta Müller are the links to Wikipedia (http://de.wikipedia.org/wiki/Herta_M%C3%BCller), VIAF (http://viaf.org/viaf/12324250) and dbpedia (http://dbpedia.org/resource/Herta_M%C3%BCller). These are important because they contextualize the DNB record for Herta Müller by relating it to other records for her, thus allowing it to be disambiguated from records describing other people named Herta Müller. Another beneficial side effect of linking your own records to others out on the Web of Data is that you enrich your own data in the process. For example if a machine agent resolves the dbpedia URI it will get back RDF that includes 114 new assertions, some of which you can see below:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Herta_M%C3%BCller>
    dbpedia-owl:birthDate "1953-08-17"^^<http://www.w3.org/2001/XMLSchema#date> ;
    dbpedia-owl:birthPlace <http://dbpedia.org/resource/Ni%C5%A3chidorf> ;
    dbpedia-owl:spouse <http://dbpedia.org/resource/Richard_Wagner_%28novelist%29> ;
    dbpedia-owl:thumbnail <http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Herta_M%C3%BCller_2007.JPG/200px-Herta_M%C3%BCller_2007.JPG> ;
    rdfs:label "Herta Müller"@de, "Herta Müller"@en, "Herta Müller"@es, "Herta Müller"@fi, "Herta Müller"@fr, "Herta Müller"@it, "Herta Müller"@nl, "Herta Müller"@nn, "Herta Müller"@pl, "Herta Müller"@pt, "Herta Müller"@sv, "??????, ?????"@ru, "????????"@ja, "??·??"@zh ;
    owl:sameAs <http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000dc69bb>, <http://umbel.org/umbel/ne/wikipedia/Herta_M%C3%BCller> ;
    foaf:depiction <http://upload.wikimedia.org/wikipedia/commons/2/2c/Herta_M%C3%BCller_2007.JPG> .

So now we’ve enriched the DNB authority record with:

  • a thumbnail picture of Herta Müller
  • her name in Japanese, Chinese and Russian
  • her birth day
  • her place of birth
  • a link to a similar record for her spouse Richard Wagner
  • links to records for Herta Müller at Freebase (recently purchased by Google)

And that’s just a sampling of the sorts of data that dbpedia returns. Another interesting one to look at is the Virtual International Authority File (VIAF), which links together the authority records for 18 National Libraries around the world. If you resolve the VIAF URL that DNB have linked to, you will get machine readable information for authority records from the the Library of Congress, NII (Japan), Biblioteca Nacional de Portugal, National Library of Sweden, Biblioteca Nacional de España, Bibliothèque nationale de France, National Library of the Czech Republic, and of course the Deutsche Nationalbibliothek. The information for the DNB and Sweden are particularly important because they in turn link back to the records at the originating institution: http://d-nb.info/gnd/119053071 and http://libris.kb.se/auth/218085. It might be worthwhile for the DNB to consider linking directly to their own record in VIAF http://viaf.org/viaf/12324250/#DNB%7C119053071 instead of http://viaf.org/viaf/12324250, but that’s largely a technical matter. We’ve connected up the DNB’s notion of Herta Müller with the Royal Library of Sweden’s–just by following our nose on the World Wide Web. And this is an activity that computer software can perform as well.

So, it’s clear there’s a whole lot of library linking going on. I did some quick and dirty analysis of the full data dump from the DNB and found: 3,569,402 links to VIAF and 40,136 links to dbpedia (the Linked Data version of Wikipedia). What remains to be done to some extent is leveraging this contextual information around our data in Library Applications, both cataloging, metadata enrichment applications and end user facing discovery applications.

One challenge to building applications that use this Web of Library Data are the vocabularies that are used. I did some more rudimentary analysis on the full DNB data dump and came up with this count of property usage:

RDF Property Number of Assertions
http://www.w3.org/2002/07/owl#sameAs 3,609,878
http://d-nb.info/gnd/preferredNameForThePerson 3,609,753
http://d-nb.info/gnd/usedRules 3,476,879
http://d-nb.info/gnd/variantNameForThePerson 3,327,005
http://d-nb.info/gnd/surname 3,218,840
http://d-nb.info/gnd/foreName 3,218,125
http://RDVocab.info/ElementsGr2/identifierForTheCorporateBody 2,642,185
http://RDVocab.info/ElementsGr2/identifierForThePerson 2,163,258
http://d-nb.info/gnd/preferredNameForTheCorporateBody 1,320,711
http://d-nb.info/gnd/variantNameForTheCorporateBody 1,293,751
http://RDVocab.info/ElementsGr2/biographicalInformation 1,084,183
http://RDVocab.info/ElementsGr2/professionOrOccupation 1,059,570
http://d-nb.info/gnd/publicationOfThePerson 986,418
http://RDVocab.info/ElementsGr2/dateOfBirth 971,993
http://d-nb.info/gnd/countryCodeForThePerson 823,100
http://d-nb.info/gnd/countryCodeForTheCorporateBody 759,088
http://RDVocab.info/ElementsGr2/periodOfActivityOfThePerson 539,230
http://RDVocab.info/ElementsGr2/gender 404,247
http://RDVocab.info/ElementsGr2/dateOfDeath 381,888
http://purl.org/dc/terms/identifier 337,230
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/hierarchicalSuperior 277,484
http://d-nb.info/gnd/personalName 258,214
http://d-nb.info/gnd/prefixName 233,481
http://d-nb.info/gnd/functionOfThePerson 211,045
http://d-nb.info/gnd/invalidIdentifierForThePerson 208,267
http://RDVocab.info/ElementsGr2/placeOfBirth 192,563
http://d-nb.info/gnd/qualifierName 169,284
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 168,615
http://www.w3.org/2004/02/skos/core#prefLabel 163,854
http://www.w3.org/2004/02/skos/core#altLabel 143,254
http://xmlns.com/foaf/0.1/page 123,569
http://d-nb.info/gnd/invalidIdentifierForTheCorporateBody 122,999
http://www.w3.org/2004/02/skos/core#broader 118,696
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/predecessor 110,112
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/successor 109,819
http://www.w3.org/2004/02/skos/core#narrower 102,850
http://d-nb.info/gnd/preferredNameAcronymForTheCorporateBody 102,517
http://RDVocab.info/ElementsGr2/dateOfEstablishment 88,470
http://d-nb.info/gnd/academicTitleOfThePerson 77,763
http://RDVocab.info/ElementsGr2/placeOfResidence 70,112
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/relatedCorporateBodyPerson 65,319
http://www.w3.org/2004/02/skos/core#closeMatch 60,893
http://xmlns.com/foaf/0.1/homepage 59,065
http://RDVocab.info/ElementsGr2/dateOfTermination 38,997
http://RDVocab.info/ElementsGr2/dateOfTermination 38,997
http://www.w3.org/2004/02/skos/core#definition 37,086
http://RDVocab.info/ElementsGr2/placeOfDeath 35,266
http://d-nb.info/gnd/locQualifier 35,220
http://d-nb.info/gnd/studyPathsOfThePerson 33,307
http://www.w3.org/2004/02/skos/core#related 26,971
http://RDVocab.info/ElementsGr2/nameOfTheCorporateBody 20,009
http://RDVocab.info/ElementsGr2/languageOfThePerson 13,318
http://d-nb.info/gnd/variantNameAcronymForTheCorporateBody 12,786
http://www.w3.org/2004/02/skos/core#scopeNote 11,000
http://d-nb.info/gnd/useInsteadSWD 9,572
http://d-nb.info/gnd/useInsteadNoteSWD 9,522
http://d-nb.info/gnd/countryCodeForTheSubject 7,179
http://RDVocab.info/ElementsGr2/titleOfThePerson 6,798
http://purl.org/vocab/relationship/childOf 6,554
http://purl.org/vocab/relationship/parentOf 5,895
http://purl.org/vocab/relationship/spouseOf 5,613
http://d-nb.info/gnd/successorWithoutPredecessor 5,574
http://www.w3.org/2000/01/rdf-schema#label 4,761
http://d-nb.info/gnd/useConceptsInsteadSWD 4,761
http://d-nb.info/gnd/invalidIdentifierForTheSubject 4,635
http://purl.org/vocab/relationship/siblingOf 3,891
http://metadataregistry.org/uri/schema/RDARelationshipsGR2/relatedPersonPerson 2,764
http://d-nb.info/gnd/predecessorWithoutSuccessor 1,501
http://purl.org/vocab/relationship/grandchildOf 493
http://www.w3.org/2000/01/rdf-schema#seeAlso 484
http://purl.org/vocab/relationship/grandparentOf 416
http://purl.org/dc/terms/language 266

So we see heavy usage of the http://d-nb.info/gnd/ vocabulary, but we don’t know precisely how this vocabulary connects up with other vocabularies in use on the Web. We also see the new RDA vocabulary http://RDVocab.info/ElementsGr2 heavily used. Whereas the trailblazing Royal Library of Sweden chose to leverage the Friend of a Friend vocabulary more. It’s very important that we see some convergence in vocabulary use, so that our distributed data is interoperable, and mashable. This will undoubtedly lead to changes in what vocabularies are used, and growing pains in any applications that are dependent on the data. But I think it is worth it. I have high hopes that some of this convergence may come about as a result of meetings later this week at the Dublin Core Metadata Initiative 2010 meeting in Pittsburgh. But if it’s going to scale, we need to see this convergence going on all the time in online forums like the Linked Library Data discussion list, and via tools that allow library data managers to view the emerging web of library data.

Another niggling little problem is the need to synchronize these data sets. For example how am I to know when DNB has created, updated or deleted one of their authority records? I could wait for a database dump, and blow away what I knew before. But ideally there would be a mechanism to keep my own view of the DNB data synchronized. Of course there is the tried and true OAI-PMH which VIAF is using to collect MARC rocords, but it is showing its age and doesn’t really fit the Linked Data pattern very well. There is the successor to OAI-PMH, OAI-ORE which better fits more recent notions of Web Architecture and Linked Data. But there are some issues to do with very large resource maps which kind of need ironing out. The Dataset Dynamics has been doing some interesting work identifying the various mechanisms for performing synchronization with an emphasis on using Atom. Atom is a standard XML document format for describing sets of web resources. In fact OAI-ORE leverage Atom as one of the serialization formats for resource maps. But I’m personally hoping we’ll see some stream lined guidelines for publishing feeds for Linked Data, that leverage Atom’s Feed Paging/Archiving for making large lists of resources available. Maybe the Semantic Sitemaps (an Linked Data extension to traditional sitemaps that the big web search engines use to stay on top of things. I imagine we’ll see a combination of these approaches, but I think it’s important to see some convergence amongst Library Linked Data publishers to help the ecosystem flourish.

Update: I shared some more pedantic thoughts about the d-nb.info URLs in another forum. I didn’t want these particular technical details/questions to detract from saying how important I think the DNB Linked Data release is.