preserving linked data

Earlier this morning Martin Malmsten of the National Library of Sweden asked an interesting question on Twitter:

Do you need help hosting your LOD somewhere else? Could be a valuable excercise in LOD stability http://t.co/FlOT5KqZXn (3windmills?) (edsu?)

— Martin Malmsten ((geckomarma?))

September 30, 2013

Martin was asking about the Linked Open Data that the Library of Congress publishes, and how the potential shutdown of the US Federal Government could result in this data being unavailable. If you are interested, click through to the tweet and take a minute to scan the entire discussion.

Truth be told, I’m sure that many could live without the Library of Congress Subject Headings or Name Authority File for a day or two…or honestly even a month or two. It’s not like this data’s currency is essential to the functioning of society, like financial, weather or space data, etc. But Martin’s point is that it raises an interesting general question about the longevity of Linked Open Data, and how it could be made more persistent.

In case you are new to it, a key feature of Linked Data is that it uses the URL to allow a distributed database to grow organically on the Web. So, in practice, if you are building a database about books, and you need to describe the novel Moby Dick, your description doesn’t need to include everything about Herman Melville. Instead it can assert that the book was authored by an entity identified by the URL

http://id.loc.gov/authorities/names/n79006936

When you resolve that URL you can get back data about Herman Melville. For pragmatic reasons you may want to store some of that data locally in your database. But you don’t need to store all of it. If you suspect it has been updated, or need to pull down more data you simply fetch the URL again. But what if the website that minted that URL is no longer available? Or what if the website is still available but the original DNS registration expired, and someone is cybersquatting on it?

Admittedly some work has happened at the Dublin Core Metadata Initiative around the preservation of Linked Data vocabularies. The DCMI is taking a largely social approach to this problem, where vocabulary owners and memory institutions interact within the context of a particular trust framework centered on DNS. But the preservation of vocabularies (which are also identified with URLs) is really just a subset of the much larger problem of Web preservation more generally. Does web preservation have anything to offer for the preservation of Linked Data?

When reading the conversation Martin started I was reminded of a demo my colleague Chris Adams gave that showed how the World Digital Library item metadata can be retrieved from the Internet Archive. WDL embed item metadata as microdata in their HTML, and since the Internet Archive archives that HTML, you can get the metadata back from the Internet Archive.

So take this page from WDL:

It turns out this particular page has been archived by the Internet Archive 27 times. So with a few lines of Python you can use Internet Archive as a metadata service:

import urllib 
import microdata

url = "http://web.archive.org/web/20130605205647/http://www.wdl.org/en/item/2/"
resp = urllib.urlopen(url)
items = microdata.get_items(resp)
print items[0].json()

which yields:

{
  "name": [
    "Chola Woman, Full-Length Portrait, Standing, Facing Right, La Paz, Bolivia"
  ], 
  "creator": [
    {
      "additionalType": [
        "http://viaf.org/viaf/"
      ], 
      "type": "http://schema.org/Person", 
      "name": [
        "Vargas, Max T., 1874\u20131959"
      ]
    }
  ], 
  "url": [
    "http://hdl.loc.gov/loc.wdl/dlc.2"
  ], 
  "image": [
    "/web/20130605205647im_/http://content.wdl.org/2/thumbnail/308x255.jpg", 
    "/web/20130605205647/http://www.wdl.org/media/2.png"
  ], 
  "dateCreated": [
    "1911"
  ], 
  "provider": [
    {
      "type": "http://schema.org/Organization", 
      "name": [
        "Library of Congress"
      ]
    }
  ], 
  "keywords": [
    "Social sciences", 
    "Customs, etiquette & folklore", 
    "Costume & personal appearance", 
    "Portrait photographs", 
    "Women"
  ], 
  "type": "http://schema.org/ItemPage", 
  "contentLocation": [
    {
      "type": "http://schema.org/Place", 
      "address": [
        {
          "addressCountry": [
            "Bolivia"
          ], 
          "addressLocality": [
            "La Paz"
          ], 
          "type": "http://schema.org/PostalAddress"
        }
      ]
    }
  ], 
  "description": [
    "This photograph of a Bolivian woman is from the Frank and Frances Carpenter Collection at the Library of Congress. Frank G. Carpenter (1855-1924) was an American writer of books on travel and world geography whose works helped to popularize cultural anthropology and geography in the United States in the early years of the 20th century. Consisting of photographs taken and gathered by Carpenter and his daughter Frances (1890-1972) to illustrate his writings, the collection includes an estimated 16,800 photographs and 7,000 glass and film negatives. Max T. Vargas, a noted Peruvian photographer and postcard publisher who worked in La Paz, Bolivia, in the early part of the 20th century, took the photograph."
  ]
}

Similarly you can get the LC Name Authority record for Herman Melville from the Internet Archive using the RDFa that is embedded embedded in the page:

import rdflib

g = rdflib.Graph()
url = "http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names/n79006936.html"
g.parse(url, format="rdfa")
print g.serialize(format="turtle")

which yields:

@prefix cs: <http://www.w3.org/2003/06/sw-vocab-status/ns#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix madsrdf: <http://www.loc.gov/mads/rdf/v1#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix xhv: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://id.loc.gov/authorities/names/n79006936> dcterms:created "1979-01-31T00:00:00" ;
    dcterms:modified "2012-09-13T07:47:36" ;
    dcterms:source "His Typee ... 1846. ",
        "His Pierre, or, The ambiguities ... c1984: t.p. (Herman Meville)"@en,
        "Jeney, Z. Monody, c1983: t.p. (Hermann Melville)"@en,
        "Mobi Diá¸³, 1951 or 1951: t.p. (Herman Melá¹¿il)"@en,
        "Pai ching, 1990 (1992 printing): t.p. (Ho-erh-man Mai-erh-wei-erh, of U.S.) t.p. verso, etc. (Herman Melville [in rom.]; b. 1819, NY; d. 9-27-1891, NY)"@en,
        "The enchanted isles, 2002: p. vii (first published in 1854 The encantadas, or, Enchanted isles by Salvator R. Tarnmoor)"@en,
        "Theory of the American novel, 1970: p.71 (Hawthorne and his mosses, by a Virginian spending July in Vermont [i.e. Herman Melville])"@en,
        "To Kampanario, c1994: t.p. (Cherman Melvil)"@en,
        "Wikipedia, 1 September 2010 (Herman Melville occupations: novelist, short story writer, teacher, sailor, lecturer, poet, customs inspector)"@en ;
    madsrdf:adminMetadata [ a ri:RecordInfo,
                cs:ChangeSet ;
            ri:recordChangeDate "2012-09-13T07:47:36" ;
            ri:recordStatus "revised" ;
            cs:changeReason "revised" ;
            cs:createdDate "2012-09-13T07:47:36" ],
        [ a ri:RecordInfo,
                cs:ChangeSet ;
            ri:recordChangeDate "1979-01-31T00:00:00" ;
            ri:recordStatus "new" ;
            cs:changeReason "new" ;
            cs:createdDate "1979-01-31T00:00:00" ] ;
    madsrdf:authoritativeLabel "Melville, Herman, 1819-1891"@en ;
    madsrdf:classification "PS2380-PS2388" ;
    madsrdf:editorialNote "[Machine-derived non-Latin script reference project.]",
        "[Non-Latin script references not evaluated.]" ;
    madsrdf:hasSource [ a madsrdf:Source ;
            madsrdf:citation-note "t.p. (Herman Melá¹¿il)"@en ;
            madsrdf:citation-source "Mobi Diá¸³, 1951 or 1951:" ;
            madsrdf:citation-status "found" ],
        [ a madsrdf:Source ;
            madsrdf:citation-note "(Herman Melville occupations: novelist, short story writer, teacher, sailor, lecturer, poet, customs inspector)"@en ;
            madsrdf:citation-source "Wikipedia, 1 September 2010" ;
            madsrdf:citation-status "found" ],
        [ a madsrdf:Source ;
            madsrdf:citation-note "t.p. (Herman Meville)"@en ;
            madsrdf:citation-source "His Pierre, or, The ambiguities ... c1984:" ;
            madsrdf:citation-status "found" ],
        [ a madsrdf:Source ;
            madsrdf:citation-note "p.71 (Hawthorne and his mosses, by a Virginian spending July in Vermont [i.e. Herman Melville])"@en ;
            madsrdf:citation-source "Theory of the American novel, 1970:" ;
            madsrdf:citation-status "found" ],
        [ a madsrdf:Source ;
            madsrdf:citation-note "t.p. (Ho-erh-man Mai-erh-wei-erh, of U.S.) t.p. verso, etc. (Herman Melville [in rom.]; b. 1819, NY; d. 9-27-1891, NY)"@en ;
            madsrdf:citation-source "Pai ching, 1990 (1992 printing):" ;
            madsrdf:citation-status "found" ],
        [ a madsrdf:Source ;
            madsrdf:citation-note "" ;
            madsrdf:citation-source "His Typee ... 1846." ;
            madsrdf:citation-status "found" ],
        [ a madsrdf:Source ;
            madsrdf:citation-note "t.p. (Cherman Melvil)"@en ;
            madsrdf:citation-source "To Kampanario, c1994:" ;
            madsrdf:citation-status "found" ],
        [ a madsrdf:Source ;
            madsrdf:citation-note "t.p. (Hermann Melville)"@en ;
            madsrdf:citation-source "Jeney, Z. Monody, c1983:" ;
            madsrdf:citation-status "found" ],
        [ a madsrdf:Source ;
            madsrdf:citation-note "p. vii (first published in 1854 The encantadas, or, Enchanted isles by Salvator R. Tarnmoor)"@en ;
            madsrdf:citation-source "The enchanted isles, 2002:" ;
            madsrdf:citation-status "found" ] ;
    madsrdf:hasVariant _:N07ab04d5af3544cda62856462e53835c,
        _:N08816667a0ec4188aec135323a811da4,
        _:N1101086203694aeab08270c635e2a511,
        _:N3054c346a2754b47bcb4efe873bae4fd,
        _:N36d03253a6e14b49b6b947744ba2b709,
        _:N5b32566a3cb54cea8a085acec3760c19,
        _:N5da6ce34caea470893af1f8253e5512d,
        _:N633919e4f6cb47ebbc7188ed1975280b,
        _:Nab5721a27261493dadb009ebc00a7237,
        _:Nbc0c93b4224e4baba93062ff34dcb970,
        _:Nd5b72f6d9eec49028fbbe024165951cc,
        _:Ne179c605953e4f97a32512b52f3272dc,
        _:Ne46cdc28f7ac4cfc8eda8d37ac47b130 ;
    madsrdf:isMemberOfMADSCollection <http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names/collection_LCNAF>,
        <http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names/collection_NamesAuthorizedHeadings> ;
    madsrdf:isMemberOfMADSScheme <http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names> ;
    skos:editorial "[Machine-derived non-Latin script reference project.]",
        "[Non-Latin script references not evaluated.]" ;
    skos:inScheme <http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names> ;
    skos:prefLabel "Melville, Herman, 1819-1891"@en ;
    skosxl:altLabel _:N07ab04d5af3544cda62856462e53835c,
        _:N08816667a0ec4188aec135323a811da4,
        _:N1101086203694aeab08270c635e2a511,
        _:N3054c346a2754b47bcb4efe873bae4fd,
        _:N36d03253a6e14b49b6b947744ba2b709,
        _:N5b32566a3cb54cea8a085acec3760c19,
        _:N5da6ce34caea470893af1f8253e5512d,
        _:N633919e4f6cb47ebbc7188ed1975280b,
        _:Nab5721a27261493dadb009ebc00a7237,
        _:Nbc0c93b4224e4baba93062ff34dcb970,
        _:Nd5b72f6d9eec49028fbbe024165951cc,
        _:Ne179c605953e4f97a32512b52f3272dc,
        _:Ne46cdc28f7ac4cfc8eda8d37ac47b130 .

<http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names/n79006936.html> madsrdf:isMemberOfMADSScheme <http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names> ;
    xhv:alternate <http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names/n79006936.json>,
        <http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names/n79006936.nt>,
        <http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names/n79006936.rdf> ;
    xhv:stylesheet <http://web.archive.org/web/20130614192231cs_/http://id.loc.gov/static/css/2012/loc_print_v2.css>,
        <http://web.archive.org/web/20130614192231cs_/http://id.loc.gov/static/css/2012/styles.css> ;
    skos:inScheme <http://web.archive.org/web/20130614192231/http://id.loc.gov/authorities/names> .

<http://web.archive.org/web/20130614192231/http://viaf.org/viaf/sourceID/LC%7Cn+79006936#skos:Concept> madsrdf:hasExactExternalAuthority "http://viaf.org/viaf/sourceID/LC%7Cn+79006936#skos:Concept" ;
    skos:exactMatch "http://viaf.org/viaf/sourceID/LC%7Cn+79006936#skos:Concept" .

<http://web.archive.org/web/20130614192231/http://www.loc.gov/mads/rdf/v1#Authority> a "MADS/RDF Authority" .

<http://web.archive.org/web/20130614192231/http://www.loc.gov/mads/rdf/v1#PersonalName> a "MADS/RDF PersonalName" .

<http://web.archive.org/web/20130614192231/http://www.w3.org/2004/02/skos/core#Concept> a "SKOS Concept" .

_:N07ab04d5af3544cda62856462e53835c a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "×ž×œ×•×•×™×œ, ×”×¨×ž×Ÿ, 1819Ö¾1891" ;
    skosxl:literalForm "×ž×œ×•×•×™×œ, ×”×¨×ž×Ÿ, 1819Ö¾1891" .

_:N08816667a0ec4188aec135323a811da4 a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "éº¥çˆ¾ç¶çˆ¾, 1819-1891" ;
    skosxl:literalForm "éº¥çˆ¾ç¶çˆ¾, 1819-1891" .

_:N1101086203694aeab08270c635e2a511 a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "Ù…ÙŠÙ„Ú¤ÙŠÙ„ØŒ Ù‡Ø±Ù…Ù†" ;
    skosxl:literalForm "Ù…ÙŠÙ„Ú¤ÙŠÙ„ØŒ Ù‡Ø±Ù…Ù†" .

_:N3054c346a2754b47bcb4efe873bae4fd a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "Tarnmoor, Salvator R., 1819-1891" ;
    skosxl:literalForm "Tarnmoor, Salvator R., 1819-1891" .

_:N36d03253a6e14b49b6b947744ba2b709 a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "×ž×œ×•×•×™×œ, ×”×¨×ž×Ÿ" ;
    skosxl:literalForm "×ž×œ×•×•×™×œ, ×”×¨×ž×Ÿ" .

_:N5b32566a3cb54cea8a085acec3760c19 a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "Virginian spending July in Vermont, 1819-1891" ;
    skosxl:literalForm "Virginian spending July in Vermont, 1819-1891" .

_:N5da6ce34caea470893af1f8253e5512d a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "Melville, Hermann, 1819-1891" ;
    skosxl:literalForm "Melville, Hermann, 1819-1891" .

_:N633919e4f6cb47ebbc7188ed1975280b a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "Melvil, Cherman, 1819-1891" ;
    skosxl:literalForm "Melvil, Cherman, 1819-1891" .

_:Nab5721a27261493dadb009ebc00a7237 a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "Melvill, German, 1819-1891" ;
    skosxl:literalForm "Melvill, German, 1819-1891" .

_:Nbc0c93b4224e4baba93062ff34dcb970 a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "Mai-erh-wei-erh, Ho-erh-man, 1819-1891" ;
    skosxl:literalForm "Mai-erh-wei-erh, Ho-erh-man, 1819-1891" .

_:Nd5b72f6d9eec49028fbbe024165951cc a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "Melá¹¿il, Herman, 1819-1891" ;
    skosxl:literalForm "Melá¹¿il, Herman, 1819-1891" .

_:Ne179c605953e4f97a32512b52f3272dc a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "×ž×œ×•×™×œ, ×”×¨×ž×Ÿ" ;
    skosxl:literalForm "×ž×œ×•×™×œ, ×”×¨×ž×Ÿ" .

_:Ne46cdc28f7ac4cfc8eda8d37ac47b130 a madsrdf:PersonalName,
        madsrdf:Variant,
        skosxl:Label ;
    madsrdf:variantLabel "Meville, Herman, 1819-1891" ;
    skosxl:literalForm "Meville, Herman, 1819-1891" .

Since it is linked to directly from the HTML page, Internet Archive have also archived the RDF XML itself, and they actually even have the MARC XML, if that’s the sorta thing you are into.

But, as my previous post about perma.cc touched on, a solution to archiving something as important as the Web can’t realistically rely on a single point of failure (the Internet Archive). We can’t simply decide not to worry about archiving the Web because Brewster Kahle is taking care of it. We need lots of copies of linked data to keep stuff safe.

Fortunately, web archiving is going on at a variety of institutions. But if you have a URL for a webpage, how do you know what web archives have a copy? Do you have to go and ask each one? How do you even know which ones to ask? How do you ask?

…

The Memento project worked on aggregating the holdings of web archives in order to provide a single place to look up a URL for their Firefox plugin. They also ended up proxying some sources like Wikipedia that they couldn’t convince to support the Memento protocol. From what little I’ve heard about this process it was done in an ad-hoc fashion, leaning on personal relationships in the IIPC, and was fairly resource intensive, to the point where it was more efficient to use the sneakernet to get the data. If I’m misremembering that I trust someone will let me know.

Earlier this year, David Rosenthal posted some interesting ideas on how to publish the holdings of web archives so that it is not so expensive to aggregate them. His idea is basically for web archives to publish the hostnames of websites they archive instead of complete lists of URLs and all the times they have been fetched. An aggregator could collect these lists, and then provide hints to clients about what web archive a given URL might be found in. This would push the work of polling archives for a particular URL onto the client, which would receive a hint about what web archives to look in. It would also mean that there would space for more than one aggregator, since it wouldn’t be so resource intensive.

I really like Rosenthal’s idea, and hope that we will see a simple pattern for publishing the holdings of web archives that doesn’t turn running an aggregator service into an expensive problem. At the same time it’s important that the solution is simple, and that it’s not so complicated it becomes an onerous process that web archives don’t end up doing. It would be nice to see the bar lowered so that smaller institutions and even individuals could get in the game, not just national libraries. I also hope we can find a simple place to build a list of where these host inventories live, similar to the Name Assigning Authority Number (NAAN) registry that is used (and mirrored) as part of the ARK identifier scheme.