Visualizing FRBR Worksets


The model behind the Functional Requirements for Bibliographic Records (FRBR) was published over 10 years ago, and has been simmering in library land ever since. Bit by bit, FRBR has been finding its way into library systems and software, sometimes in a slightly modified form. But it has been slow going because FRBR offers a more nuanced view of bibliographic data than what is available in our legacy MARC data. So the FRBR relationships we want largely have to be teased out of the data we have.


One of the primary things that FRBR offers is the notion of a Work that groups together Expressions and Manifestations. For example, William Gibson wrote a book Neuromancer, which has been translated into many languages, and is available from multiple publishers. Collectors are sometimes interested in specific editions of a book, say a first edition printing; but readers are often interested in any edition of a work, because they don’t particularly care what’s on the cover, or what pagination or typeface is used. FRBR provides a conceptual model for working with books in this way. For the software developer FRBR also holds out the promise of a normalized view of book data, where some things, such as the author and subject of the book can be expressed in one place (as attributes of the Work) rather than repeated for all the Expressions and Manifestations.

If you are a bibliographic data aficionado, you are probably already familiar with FRBR-ization Web services like xISBN and ThingISBN that make it possible to determine other related editions, or the workset, for a given ISBN. So to look up the 1995 Ace Books printing of Neuromancer (0441569595) at xISBN you can GET a URL like http://xisbn.worldcat.org/webservices/xid/isbn/0441569595?method=getEditions&format=xml and get back some XML like:

<?xml version="1.0" encoding="UTF-8"?>
<rsp xmlns="http://worldcat.org/xid/isbn/" stat="ok">
  <isbn>0441569595</isbn>
  <isbn>0441569579</isbn>
  <isbn>0441012035</isbn>
  <isbn>0006480411</isbn>
  <isbn>1570420599</isbn>
  <isbn>0007119585</isbn>
  <isbn>0736638369</isbn>
  <isbn>0441569587</isbn>
  <isbn>1570421560</isbn>
  <isbn>9029042478</isbn>
  <isbn>229000619X</isbn>
  <isbn>415010672X</isbn>
  <isbn>0307969940</isbn>
  <isbn>0441569560</isbn>
  <isbn>569700124X</isbn>
  <isbn>5792101205</isbn>
  <isbn>2707115622</isbn>
  <isbn>7542818732</isbn>
  <isbn>229030820X</isbn>
  <isbn>2744139157</isbn>
  <isbn>0932096417</isbn>
  <isbn>3453313895</isbn>
  <isbn>1616577843</isbn>
  <isbn>9607002504</isbn>
  <isbn>8445072897</isbn>
  <isbn>0002252325</isbn>
  <isbn>8842907464</isbn>
  <isbn>9029049367</isbn>
  <isbn>8445075950</isbn>
  <isbn>9029050748</isbn>
  <isbn>8071930482</isbn>
  <isbn>0586066454</isbn>
  <isbn>7542824139</isbn>
  <isbn>9119027818</isbn>
  <isbn>8085601273</isbn>
  <isbn>0441000681</isbn>
  <isbn>8445070843</isbn>
  <isbn>8385784012</isbn>
  <isbn>8982738851</isbn>
  <isbn>3893111387</isbn>
  <isbn>807193318X</isbn>
  <isbn>5170198892</isbn>
  <isbn>8371500432</isbn>
  <isbn>8467426373</isbn>
  <isbn>0441007465</isbn>
  <isbn>057503470X</isbn>
  <isbn>8585887907</isbn>
  <isbn>3893111379</isbn>
  <isbn>911300347X</isbn>
  <isbn>8422672596</isbn>
  <isbn>9118721826</isbn>
  <isbn>3453056655</isbn>
  <isbn>3807703098</isbn>
  <isbn>8390021439</isbn>
  <isbn>8203203329</isbn>
  <isbn>8789586735</isbn>
  <isbn>8485752414</isbn>
  <isbn>9612310203</isbn>
  <isbn>8445074059</isbn>
  <isbn>8445076620</isbn>
  <isbn>8974271419</isbn>
  <isbn>3453403851</isbn>
  <isbn>9510172049</isbn>
  <isbn>8758804110</isbn>
  <isbn>9510193062</isbn>
  <isbn>2277223255</isbn>
  <isbn>9637632050</isbn>
  <isbn>9755760326</isbn>
  <isbn>3898132595</isbn>
  <isbn>8790136292</isbn>
  <isbn>8804516445</isbn>
  <isbn>8842910686</isbn>
</rsp>

LibraryThing has a similar API call which allows you to splice the ISBN into a URL like so http://www.librarything.com/api/thingISBN/0441569595, and get:

<?xml version="1.0" encoding="utf-8"?>
<idlist>
  <isbn>0441569595</isbn>
  <isbn>0441012035</isbn>
  <isbn>0006480411</isbn>
  <isbn>0586066454</isbn>
  <isbn>0441007465</isbn>
  <isbn>0441000681</isbn>
  <isbn>8585887907</isbn>
  <isbn>0002252325</isbn>
  <isbn>0441569560</isbn>
  <isbn>3453056655</isbn>
  <isbn>0441569579</isbn>
  <isbn>0932096417</isbn>
  <isbn>0441569587</isbn>
  <isbn>057503470X</isbn>
  <isbn>229030820X</isbn>
  <isbn>8445070843</isbn>
  <isbn>2277223255</isbn>
  <isbn>3453313895</isbn>
  <isbn>8804516445</isbn>
  <isbn>9510193062</isbn>
  <isbn>0007119585</isbn>
  <isbn>8445075950</isbn>
  <isbn>9119027818</isbn>
  <isbn>9510172049</isbn>
  <isbn>8842907464</isbn>
  <isbn>1570420599</isbn>
  <isbn>9637632050</isbn>
  <isbn>9029042478</isbn>
  <isbn>415010672X</isbn>
  <isbn>9634970982</isbn>
  <isbn>8085601273</isbn>
  <isbn>0613922514</isbn>
  <isbn>2707115622</isbn>
  <isbn>8445074059</isbn>
  <isbn>8842913529</isbn>
  <isbn>1569564116</isbn>
  <isbn>9118721826</isbn>
  <isbn>8842910686</isbn>
  <isbn>3898132595</isbn>
  <isbn>1570421560</isbn>
  <isbn>229000619X</isbn>
  <isbn>3893111387</isbn>
  <isbn>8071930482</isbn>
  <isbn>2744139157</isbn>
  <isbn>8445072897</isbn>
  <isbn>8371500432</isbn>
  <isbn>8576570491</isbn>
  <isbn>8789586735</isbn>
  <isbn>9639238023</isbn>
  <isbn>3453074203</isbn>
  <isbn>3893111379</isbn>
  <isbn>0307969940</isbn>
  <isbn>8203203329</isbn>
  <isbn>8842906808</isbn>
  <isbn>9752103677</isbn>
  <isbn>0736638369</isbn>
  <isbn>8324577750</isbn>
  <isbn>8790136292</isbn>
  <isbn>8778803438</isbn>
  <isbn>807193318X</isbn>
</idlist>

I don’t actually know the mechanics of ThingISBN and xISBN in detail, but it’s my understanding that xISBN uses an algorithm to unify works, whereas LibraryThing relies on people to connect things up.

A newer player in this space is the OpenLibrary API. Instead of providing an ISBN -> ISBNs function, OpenLibrary make the editions for a given Work available using a URL like http://openlibrary.org/works//works/OL27258W/editions.json?limit=50&offset=0. This requires you to know the OpenLibrary Work identifier (e.g. OL27258W). Fortunately you can look up their Work identifier using another REST call
using the ISBN: http://openlibrary.org/api/books?bibkeys=ISBN:0441569595&jscmd=details&format=json. The OpenLibrary response includes a lot more information than the LibraryThing or xISBN results, which is way you are required to page through the results with the API, rather than getting all the results back at once:

{
  "size": 19, 
  "links": {
    "self": "/works/OL27258W/editions.json?limit=50&offset=0", 
    "work": "/works/OL27258W"
  }, 
  "entries": [
    {
      "number_of_pages": 322, 
      "subtitle": "roman", 
      "series": [
        "Cyberspace trilogien", 
        "Gibsons Cyberspace trilogi -- 1"
      ], 
      "latest_revision": 3, 
      "edition_name": "2. udg./1. opl.", 
      "source_records": [
        "marc:marc_university_of_toronto/uoft.marc:1618994437:773"
      ], 
      "title": "Neuromantiker", 
      "work_titles": [
        "Neuromancer."
      ], 
      "languages": [
        {
          "key": "/languages/dan"
        }
      ], 
      "publish_country": "dk ", 
      "by_statement": "William Gibson ; p\u00e5 dansk ved Arne Herl\u00f8v Petersen.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 3, 
      "publishers": [
        "Per Kof"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-18T09:06:00.229423"
      }, 
      "key": "/books/OL17987798M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "Copenhagen"
      ], 
      "pagination": "322 p.", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-10-08T22:54:50.763681"
      }, 
      "notes": {
        "type": "/type/text", 
        "value": "Translation of: Neuromancer."
      }, 
      "identifiers": {
        "librarything": [
          "609"
        ]
      }, 
      "isbn_10": [
        "8790136292"
      ], 
      "publish_date": "1995", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "14771"
        ]
      }, 
      "subject_place": [
        "Japan"
      ], 
      "lc_classifications": [
        "PR9199.3.G514 N4x 1986"
      ], 
      "latest_revision": 4, 
      "edition_name": "1st Phantasia Press ed.", 
      "genres": [
        "Fiction."
      ], 
      "source_records": [
        "marc:marc_records_scriblio_net/part20.dat:107059645:825"
      ], 
      "title": "Neuromancer", 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "subjects": [
        "Computer hackers -- Fiction", 
        "Business intelligence -- Fiction", 
        "Information superhighway -- Fiction", 
        "Nervous system -- Wounds and injuries -- Fiction", 
        "Conspiracies -- Fiction", 
        "Japan -- Fiction"
      ], 
      "publish_country": "miu", 
      "by_statement": "William Gibson.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 4, 
      "publishers": [
        "Phantasia Press"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-07-31T08:19:43.878905"
      }, 
      "key": "/books/OL2154100M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "West Bloomfield, Mich"
      ], 
      "pagination": "vi, 231 p. ;", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-04-01T03:28:50.625462"
      }, 
      "lccn": [
        "88672297"
      ], 
      "number_of_pages": 231, 
      "isbn_10": [
        "0932096417"
      ], 
      "publish_date": "1986", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "826097"
        ]
      }, 
      "latest_revision": 4, 
      "source_records": [
        "marc:talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1128979384:559"
      ], 
      "title": "Neuromancer", 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "publish_country": "xxk", 
      "by_statement": "William Gibson.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 4, 
      "publishers": [
        "Voyager"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-19T09:28:46.010665"
      }, 
      "key": "/books/OL22822383M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "London"
      ], 
      "pagination": "317p. ;", 
      "created": {
        "type": "/type/datetime", 
        "value": "2009-01-04T10:04:32.718474"
      }, 
      "dewey_decimal_class": [
        "813.54"
      ], 
      "notes": {
        "type": "/type/text", 
        "value": "Originally published: [London]: Gollancz; 1984."
      }, 
      "number_of_pages": 317, 
      "isbn_10": [
        "0006480411"
      ], 
      "publish_date": "1995", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "number_of_pages": 316, 
      "latest_revision": 4, 
      "edition_name": "1a ed. en bolsillo.", 
      "source_records": [
        "marc:SanFranPL10/SanFranPL10.out:61656066:1111"
      ], 
      "title": "Neuromante", 
      "work_titles": [
        "Neuromancer."
      ], 
      "languages": [
        {
          "key": "/languages/spa"
        }
      ], 
      "subjects": [
        "Ciencia-ficci\u00f3n"
      ], 
      "publish_country": "sp ", 
      "by_statement": "William Gibson ; [traducci\u00f3n de Jos\u00e9 Arconada Rodr\u00edguez y Javier Ferreira Ramos].", 
      "oclc_numbers": [
        "50083763"
      ], 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 4, 
      "publishers": [
        "Minotauro"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-19T10:44:18.483562"
      }, 
      "key": "/books/OL23054075M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "Barcelona"
      ], 
      "pagination": "316 p. ;", 
      "created": {
        "type": "/type/datetime", 
        "value": "2009-02-18T07:02:41.481991"
      }, 
      "notes": {
        "type": "/type/text", 
        "value": "Translation of: Neuromancer.\n\nPremio Hugo.\n\nPremio Nebula.\n\nPremio Philip K. Dick."
      }, 
      "identifiers": {
        "librarything": [
          "609"
        ]
      }, 
      "isbn_10": [
        "8445072897"
      ], 
      "publish_date": "1997", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "1163291"
        ]
      }, 
      "latest_revision": 4, 
      "source_records": [
        "marc:talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1449506617:614"
      ], 
      "title": "Neuromancer", 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "publish_country": "enk", 
      "by_statement": "William Gibson.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 4, 
      "publishers": [
        "HarperCollins"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-19T09:38:30.187012"
      }, 
      "key": "/books/OL22849249M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "London"
      ], 
      "pagination": "277p.", 
      "created": {
        "type": "/type/datetime", 
        "value": "2009-01-07T20:05:13.391858"
      }, 
      "dewey_decimal_class": [
        "813.54"
      ], 
      "notes": {
        "type": "/type/text", 
        "value": "Originally published in Great Britain by Gollancz, 1984."
      }, 
      "number_of_pages": 277, 
      "isbn_10": [
        "0002252325"
      ], 
      "publish_date": "1994", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "publishers": [
        "Harper Collins"
      ], 
      "pagination": "317p. ;", 
      "source_records": [
        "marc:talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:2979159053:556"
      ], 
      "title": "Neuromancer", 
      "dewey_decimal_class": [
        "813/.54"
      ], 
      "notes": {
        "type": "/type/text", 
        "value": "Originally published, London , Gollancz, 1984."
      }, 
      "number_of_pages": 317, 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-10-25T02:27:53.587823"
      }, 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-10-15T15:26:45.512262"
      }, 
      "latest_revision": 3, 
      "publish_country": "xxk", 
      "key": "/books/OL19969875M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_date": "1993", 
      "publish_places": [
        "London"
      ], 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ], 
      "type": {
        "key": "/type/edition"
      }, 
      "by_statement": "William Gibson.", 
      "revision": 3
    }, 
    {
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "2292560"
        ]
      }, 
      "subtitle": "Science Fiction Roman", 
      "series": [
        "Heyne science fiction & fantasy -- Bd. 06/4400"
      ], 
      "latest_revision": 4, 
      "edition_name": "3. Aufl.", 
      "source_records": [
        "marc:marc_university_of_toronto/uoft.marc:716896905:827"
      ], 
      "title": "Neuromancer", 
      "work_titles": [
        "Neuromancer."
      ], 
      "languages": [
        {
          "key": "/languages/ger"
        }
      ], 
      "publish_country": "gw ", 
      "by_statement": "William Gibson ; Deutsche \u00dcbersetzund von Reinhard Heinz.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 4, 
      "publishers": [
        "W. Heyne"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-18T03:53:57.235299"
      }, 
      "key": "/books/OL16064340M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "M\u00fcnchen"
      ], 
      "pagination": "363 p. :", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-09-22T02:36:53.194997"
      }, 
      "notes": {
        "type": "/type/text", 
        "value": "\"Deutsche Erstver\u00f6ffentlichung.\"\n\nTranslation of: Neuromancer."
      }, 
      "number_of_pages": 363, 
      "isbn_10": [
        "3453313895"
      ], 
      "publish_date": "1989", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "number_of_pages": 371, 
      "subject_place": [
        "Japan"
      ], 
      "covers": [
        284192
      ], 
      "lc_classifications": [
        "PS3557.I2264 N48 2004"
      ], 
      "latest_revision": 6, 
      "edition_name": "20th anniversary ed.", 
      "genres": [
        "Fiction."
      ], 
      "source_records": [
        "marc:marc_records_scriblio_net/part15.dat:26112823:924", 
        "marc:marc_loc_updates/v35.i20.records.utf8:16403653:1145"
      ], 
      "title": "Neuromancer", 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "subjects": [
        "Computer hackers -- Fiction", 
        "Business intelligence -- Fiction", 
        "Information superhighway -- Fiction", 
        "Nervous system -- Wounds and injuries -- Fiction", 
        "Conspiracies -- Fiction", 
        "Japan -- Fiction"
      ], 
      "publish_country": "nyu", 
      "by_statement": "William Gibson ; with a new introduction by the author ; with an afterword by Jack Womack.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 6, 
      "publishers": [
        "Ace Books"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-07-31T14:51:42.931650"
      }, 
      "key": "/books/OL3305354M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "New York"
      ], 
      "pagination": "xi, 371 p. ;", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-04-01T03:28:50.625462"
      }, 
      "dewey_decimal_class": [
        "813/.54"
      ], 
      "identifiers": {
        "goodreads": [
          "14770"
        ], 
        "librarything": [
          "609"
        ]
      }, 
      "lccn": [
        "2004048718"
      ], 
      "isbn_10": [
        "0441012035"
      ], 
      "publish_date": "2004", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "888628"
        ]
      }, 
      "subject_place": [
        "Japan"
      ], 
      "covers": [
        283860
      ], 
      "lc_classifications": [
        "PS3557.I2264 N48 2000"
      ], 
      "latest_revision": 5, 
      "edition_name": "Ace trade ed.", 
      "genres": [
        "Fiction."
      ], 
      "source_records": [
        "marc:marc_records_scriblio_net/part13.dat:153635745:885"
      ], 
      "title": "Neuromancer", 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "subjects": [
        "Computer hackers -- Fiction", 
        "Business intelligence -- Fiction", 
        "Information superhighway -- Fiction", 
        "Nervous system -- Wounds and injuries -- Fiction", 
        "Conspiracies -- Fiction", 
        "Japan -- Fiction"
      ], 
      "publish_country": "nyu", 
      "series": [
        "Ace science fiction"
      ], 
      "by_statement": "William Gibson ; with an afterword by Jack Womack.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 5, 
      "publishers": [
        "Ace Books"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-03T20:25:35.114363"
      }, 
      "key": "/books/OL3963678M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "New York"
      ], 
      "pagination": "276 p. ;", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-04-01T03:28:50.625462"
      }, 
      "dewey_decimal_class": [
        "813/.54"
      ], 
      "number_of_pages": 276, 
      "lccn": [
        "2001268016"
      ], 
      "isbn_10": [
        "0441007465"
      ], 
      "publish_date": "2000", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "122395"
        ]
      }, 
      "latest_revision": 4, 
      "source_records": [
        "marc:marc_university_of_toronto/uoft.marc:219836701:673"
      ], 
      "title": "Neuromancien", 
      "work_titles": [
        "Neuromancer."
      ], 
      "languages": [
        {
          "key": "/languages/fre"
        }
      ], 
      "publish_country": "fr ", 
      "by_statement": "William Gibson ; traduit de l'am\u00e9ricain par Jean Bonnefoy.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 4, 
      "publishers": [
        "\u00c9ditions J'ai lu"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-18T21:33:39.583788"
      }, 
      "key": "/books/OL21395048M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "Paris"
      ], 
      "pagination": "318 p.", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-11-02T11:15:35.318748"
      }, 
      "notes": {
        "type": "/type/text", 
        "value": "Translation of: Neuromancer."
      }, 
      "number_of_pages": 318, 
      "isbn_10": [
        "2277223255"
      ], 
      "publish_date": "1988", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "subtitle": "en sp\u00e6ndingsroman", 
      "latest_revision": 3, 
      "contributions": [
        "Mortensen, Hans Palle"
      ], 
      "source_records": [
        "marc:marc_university_of_toronto/uoft.marc:848159064:705"
      ], 
      "title": "Neuromantiker", 
      "work_titles": [
        "Neuromancer."
      ], 
      "languages": [
        {
          "key": "/languages/dan"
        }
      ], 
      "publish_country": "de ", 
      "by_statement": "William Gibson ; p\u00e5 dansk ved Hans Palle Mortensen.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 3, 
      "publishers": [
        "Vega"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-10-15T15:26:45.512262"
      }, 
      "key": "/books/OL16541408M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "[K\u00f8obenhavn]"
      ], 
      "pagination": "329 p.", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-09-24T15:45:30.569311"
      }, 
      "notes": {
        "type": "/type/text", 
        "value": "Translation of: Neuromancer."
      }, 
      "number_of_pages": 329, 
      "isbn_10": [
        "8758804110"
      ], 
      "publish_date": "1989", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "1163292"
        ]
      }, 
      "lc_classifications": [
        "PS3513.I2824"
      ], 
      "latest_revision": 4, 
      "source_records": [
        "marc:marc_university_of_toronto/uoft.marc:2715222992:644"
      ], 
      "title": "Neuromancer", 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "publish_country": "enk", 
      "by_statement": "by William Gibson.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 4, 
      "publishers": [
        "Gollancz"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-18T12:15:40.027146"
      }, 
      "key": "/books/OL19160947M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "London"
      ], 
      "pagination": "251 p. ;", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-10-21T06:38:01.937259"
      }, 
      "dewey_decimal_class": [
        "823/.914"
      ], 
      "number_of_pages": 251, 
      "isbn_10": [
        "057503470X"
      ], 
      "publish_date": "1984", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "number_of_pages": 273, 
      "latest_revision": 6, 
      "contributions": [
        "Cuijpers, Peter"
      ], 
      "edition_name": "1. druk.", 
      "source_records": [
        "marc:marc_university_of_toronto/uoft.marc:848175223:692"
      ], 
      "title": "Zenumagi\u00ebr", 
      "work_titles": [
        "Neuromancer."
      ], 
      "languages": [
        {
          "key": "/languages/dut"
        }
      ], 
      "publish_country": "ne ", 
      "by_statement": "William Gibson ; vertaling Peter Cuijpers.", 
      "oclc_numbers": [
        "64599048"
      ], 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 6, 
      "publishers": [
        "Meulenhoff"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2011-04-28T07:26:35.438655"
      }, 
      "key": "/books/OL16541422M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "Amsterdam"
      ], 
      "pagination": "273 p.", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-09-24T15:45:41.892954"
      }, 
      "notes": {
        "type": "/type/text", 
        "value": "Translation of: Neuromancer."
      }, 
      "identifiers": {
        "librarything": [
          "609"
        ]
      }, 
      "isbn_10": [
        "9029042478"
      ], 
      "publish_date": "1989", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "number_of_pages": 295, 
      "latest_revision": 6, 
      "contributions": [
        "Eggen, Torgrim, 1958-"
      ], 
      "source_records": [
        "marc:marc_university_of_toronto/uoft.marc:4064625723:790"
      ], 
      "title": "Nevromantiker", 
      "work_titles": [
        "Neuromancer."
      ], 
      "languages": [
        {
          "key": "/languages/nor"
        }
      ], 
      "publish_country": "no ", 
      "by_statement": "William Gibson ; oversatt av og med etterord av Torgrim Eggen.", 
      "oclc_numbers": [
        "224937105"
      ], 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 6, 
      "publishers": [
        "Aschehoug"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2011-04-25T21:45:39.581918"
      }, 
      "key": "/books/OL19726291M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "Oslo"
      ], 
      "pagination": "295 p.", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-10-23T17:52:44.936450"
      }, 
      "notes": {
        "type": "/type/text", 
        "value": "Translation of: Neuromancer."
      }, 
      "identifiers": {
        "librarything": [
          "609"
        ]
      }, 
      "isbn_10": [
        "8203203329"
      ], 
      "publish_date": "1999", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "publishers": [
        "Editrice Nord"
      ], 
      "pagination": "iii, 260 p.", 
      "source_records": [
        "marc:marc_university_of_toronto/uoft.marc:1419376120:645"
      ], 
      "title": "Neuromante", 
      "work_titles": [
        "Neuromancer."
      ], 
      "series": [
        "Cosmo -- 80"
      ], 
      "notes": {
        "type": "/type/text", 
        "value": "Translation of: Neuromancer."
      }, 
      "number_of_pages": 260, 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-09-28T17:38:21.398006"
      }, 
      "languages": [
        {
          "key": "/languages/ita"
        }
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-10-15T15:26:45.512262"
      }, 
      "latest_revision": 3, 
      "publish_country": "it ", 
      "key": "/books/OL17407456M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_date": "1986", 
      "publish_places": [
        "Milano"
      ], 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ], 
      "type": {
        "key": "/type/edition"
      }, 
      "by_statement": "William Gibson.", 
      "revision": 3
    }, 
    {
      "number_of_pages": 271, 
      "subject_place": [
        "Japan"
      ], 
      "covers": [
        284574
      ], 
      "lc_classifications": [
        "PS3557.I2264 N48 1984"
      ], 
      "latest_revision": 11, 
      "ocaid": "neuromancer00gibs", 
      "genres": [
        "Fiction."
      ], 
      "source_records": [
        "marc:marc_records_scriblio_net/part22.dat:84028207:784", 
        "marc:CollingswoodLibraryMarcDump10-27-2008/Collingswood.out:7879172:1418", 
        "marc:marc_cca/b10621386.out:20298617:552", 
        "ia:neuromancer00gibs"
      ], 
      "title": "Neuromancer", 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "subjects": [
        "Computer hackers -- Fiction", 
        "Business intelligence -- Fiction", 
        "Information superhighway -- Fiction", 
        "Nervous system -- Wounds and injuries -- Fiction", 
        "Conspiracies -- Fiction", 
        "Japan -- Fiction"
      ], 
      "publish_country": "nyu", 
      "by_statement": "William Gibson.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 11, 
      "publishers": [
        "Ace Books"
      ], 
      "ia_box_id": [
        "IA111402"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2011-08-12T04:31:24.064755"
      }, 
      "key": "/books/OL1627167M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "New York"
      ], 
      "pagination": "271 p. ;", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-04-01T03:28:50.625462"
      }, 
      "dewey_decimal_class": [
        "813/.54"
      ], 
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "22328"
        ]
      }, 
      "lccn": [
        "91174394"
      ], 
      "isbn_10": [
        "0441569595"
      ], 
      "publish_date": "1984", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "313982"
        ]
      }, 
      "subject_place": [
        "Japan"
      ], 
      "covers": [
        283491
      ], 
      "lc_classifications": [
        "PS3557.I2264 N48 1994"
      ], 
      "latest_revision": 5, 
      "edition_name": "1st Ace hardcover ed.", 
      "genres": [
        "Fiction."
      ], 
      "source_records": [
        "marc:marc_records_scriblio_net/part24.dat:178109658:845"
      ], 
      "title": "Neuromancer", 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "subjects": [
        "Computer hackers -- Fiction", 
        "Business intelligence -- Fiction", 
        "Information superhighway -- Fiction", 
        "Nervous system -- Wounds and injuries -- Fiction", 
        "Conspiracies -- Fiction", 
        "Japan -- Fiction"
      ], 
      "publish_country": "nyu", 
      "by_statement": "William Gibson.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 5, 
      "publishers": [
        "Ace Books"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-07-31T01:58:09.386680"
      }, 
      "key": "/books/OL1234381M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "New York"
      ], 
      "pagination": "278 p. ;", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-04-01T03:28:50.625462"
      }, 
      "dewey_decimal_class": [
        "813/.54"
      ], 
      "number_of_pages": 278, 
      "lccn": [
        "94237181"
      ], 
      "isbn_10": [
        "0441000681"
      ], 
      "publish_date": "1994", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "122395"
        ]
      }, 
      "latest_revision": 4, 
      "source_records": [
        "marc:talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:718603618:565"
      ], 
      "title": "Neuromancien", 
      "work_titles": [
        "Neuromancer."
      ], 
      "languages": [
        {
          "key": "/languages/fre"
        }
      ], 
      "publish_country": "fr ", 
      "by_statement": "traduit de l'ame\u0301ricain par Jean Bonnefoy.", 
      "type": {
        "key": "/type/edition"
      }, 
      "revision": 4, 
      "publishers": [
        "J'ai Lu"
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-18T23:31:09.118145"
      }, 
      "key": "/books/OL21795410M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ], 
      "publish_places": [
        "[Paris]"
      ], 
      "pagination": "319p.", 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-11-04T00:30:01.234536"
      }, 
      "notes": {
        "type": "/type/text", 
        "value": "Translation of: Neuromancer."
      }, 
      "number_of_pages": 319, 
      "isbn_10": [
        "2277223255"
      ], 
      "publish_date": "1985", 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ]
    }, 
    {
      "publishers": [
        "Voyager"
      ], 
      "pagination": "317 p.", 
      "identifiers": {
        "librarything": [
          "609"
        ], 
        "goodreads": [
          "953070"
        ]
      }, 
      "revision": 4, 
      "source_records": [
        "marc:marc_university_of_toronto/uoft.marc:3986224271:616"
      ], 
      "title": "Neuromancer", 
      "isbn_10": [
        "0586066454"
      ], 
      "number_of_pages": 317, 
      "created": {
        "type": "/type/datetime", 
        "value": "2008-10-30T08:07:12.492696"
      }, 
      "languages": [
        {
          "key": "/languages/eng"
        }
      ], 
      "last_modified": {
        "type": "/type/datetime", 
        "value": "2010-08-18T17:29:49.077199"
      }, 
      "latest_revision": 4, 
      "edition_name": "Pbk. ed.", 
      "key": "/books/OL20872554M", 
      "authors": [
        {
          "key": "/authors/OL26283A"
        }
      ],
      "publish_date": "2000", 
      "publish_places": [
        "London"
      ], 
      "works": [
        {
          "key": "/works/OL27258W"
        }
      ], 
      "type": {
        "key": "/type/edition"
      }, 
      "by_statement": "William Gibson.", 
      "publish_country": "enk"
    }
  ]
}

Because of some work I’ve been doing helping out at Gluejar I became curious about the coverage of these three FRBR workset APIs. What sort of overlap is there between them? I wrote a little script worksvenn.py that takes one or more ISBNs as input, looks them up in the OpenLibrary, LibraryThing and OCLC APIs, and then outputs the resulting data with a Venn diagram using the Google Chart API.

It’s interesting to see that each service has unique results. You can see these when you run worksvenn.py on the command line:

Workset Results:

oclc: 7542818732,2707115622,7542824139,9029042478,8085601273,0441012035,
0441569579,229000619X,3893111387,2744139157,9607002504,8071930482,
9637632050,8585887907,8485752414,8758804110,8445076620,9118721826,
8203203329,0441569587,8804516445,8422672596,8789586735,0932096417,
3893111379,1570420599,8445072897,5792101205,9755760326,569700124X,
9510172049,0441007465,0736638369,9510193062,8390021439,911300347X,
8445075950,0002252325,0441569595,0441000681,5170198892,3807703098,
0007119585,415010672X,807193318X,3453056655,8974271419,8842910686,
9029050748,3898132595,3453313895,057503470X,1616577843,0307969940,
8385784012,2277223255,0006480411,9029049367,0586066454,1570421560,
8371500432,229030820X,8842907464,0441569560,9119027818,8445070843,
8467426373,9612310203,8790136292,8982738851,3453403851,8445074059

librarything: 0441569595,2707115622,0441000681,9634970982,9118721826,
9029042478,8085601273,3453056655,0006480411,8842906808,0441569579,
229000619X,415010672X,3893111387,0441012035,9639238023,3453074203,
9510193062,9637632050,8585887907,8842910686,0441007465,3898132595,
8203203329,1569564116,8371500432,3453313895,0736638369,057503470X,
8789586735,0932096417,9752103677,8445075950,8778803438,2277223255,
8576570491,8804516445,0613922514,0586066454,1570421560,3893111379,
229030820X,807193318X,8071930482,8842913529,0441569560,9119027818,
8445070843,0007119585,9510172049,2744139157,8324577750,8790136292,
0307969940,0441569587,8842907464,1570420599,8445072897,8445074059,
0002252325

openlibrary: 8758804110,0441569595,8203203329,3453313895,057503470X,
0932096417,9029042478,2277223255,0441000681,0006480411,0441012035,
0586066454,0002252325,8445072897,0441007465,8790136292

Differences:

oclc \ librarything: 7542818732,7542824139,5170198892,569700124X,
8974271419,9607002504,8485752414,9029050748,8758804110,8445076620,
8422672596,9612310203,1616577843,8385784012,9029049367,3453403851,
5792101205,3807703098,9755760326,8467426373,8982738851,8390021439,
911300347X

oclc \ openlibrary: 7542818732,2707115622,9118721826,5170198892,
0007119585,8085601273,8445070843,3453056655,0441569579,229000619X,
415010672X,3893111387,2744139157,8467426373,8974271419,9607002504,
8071930482,9637632050,8585887907,8485752414,8371500432,9029050748,
3898132595,8445076620,7542824139,0441569587,8982738851,8804516445,
8422672596,8789586735,9612310203,1616577843,0307969940,8385784012,
8842907464,9029049367,8842910686,1570421560,3893111379,229030820X,
807193318X,911300347X,0441569560,5792101205,9119027818,3807703098,
9755760326,569700124X,9510172049,8445074059,0736638369,9510193062,
8390021439,1570420599,8445075950,3453403851

librarything \ oclc:  8842906808,9634970982,8842913529,9639238023,
9752103677,1569564116,8778803438,8576570491,8324577750,0613922514,
3453074203

librarything \ openlibrary:  2707115622,9634970982,9118721826,
8085601273,3453056655,8842906808,0441569579,229000619X,415010672X,
3893111387,2744139157,9639238023,9510193062,9637632050,807193318X,
8585887907,8842910686,3898132595,8324577750,3893111379,8804516445,
1570420599,8789586735,9752103677,8778803438,8576570491,0613922514,
1570421560,8371500432,229030820X,3453074203,8071930482,8842913529,
0441569560,9119027818,8445070843,0007119585,9510172049,1569564116,
0736638369,0307969940,0441569587,8842907464,8445075950,8445074059

openlibrary \ oclc:  

openlibrary \ librarything:  8758804110

This suggests that the workset data in these services actually reinforce each other, and a lot could be gained by sharing. For comparison here are the diagrams for a few more books:


As I mentioned earlier, you can pass worksvenn.py a list of ISBNs and it will pool them all together. At Gluejar we have a list of 53 books that are examples of potential books for ungluing. so I ran these through and came up with this diagram.

Although looking on a piecemeal basis can be interesting, it would be fun to see a Venn diagram given a larger pool of seed ISBNs. Perhaps worksvenn.py will give you some ideas. If it does please let me know!


an ode to node

When I made my first edit to Wikipedia a few years ago I can remember watching the recent changes page to see my contribution pop up. I was shocked to see just how quickly my edit was swept up in the torrent of edits that are going on all the time. I think everyone who googles for topical information is familiar with the experience of having Wikipedia articles routinely appear near the top of their search results. In hindsight it should’ve been obvious, but the level of participation in the curation of content at Wikipedia struck me as significant…and somehow different. It was wonderful to see living evidence of so many people caring to collaboratively document our world.

The Obsession

I work as a software developer in the cultural heritage sector, and often find myself building editing environments for users to collaboratively create and edit content. These systems typically get used here and there; but they in no way compare to the sheer volume of edit activity that Wikipedia sees from around the world, every single day. I guess I’d read about crowdsourcing, but had never been provided with a window into it like this before. My wife encourages her 5th grade students to think critically about Wikipedia as an information source. One way she has done this in the past was by having them author an article for their school, which didn’t have an article previously. I wanted to help her and her students see how they were part of a large community of Wikipedia editors; and to give them a tactile sense of the amount of people who are actively engaged in making Wikipedia better.

A few months later Georgi Kobilarov let me know about the many IRC channels where various bits of metadata about recent changes in Wikipedia are announced. Georgi told me about a bot that the BBC run to track changes to Wikipedia, so that relevant article content can be pulled back to the BBC. I guess a light bulb turned on. Could I use these channels to show people how much Wikipedia is actively curated, without requiring them to reload the recent changes page, connect to some cryptic IRC channels, or dig around in some (wonderfully) detailed statistics. More importantly, could it be done in a playful way?

The Apps

Some more time passed and I came across some new tools (more about these below) that made it easy to throw together a Web visualization of the Wikipedia update stream. The tools proved to be so much fun that I ended up making two.

wikistream displays the edits to 38 language wikipedias as a river of flowing text. The content moves by so quickly that I had to add a pause button (the letter p) in order to test things like clicking on the update to see the change that was made. The little icons to the left indicate whether the edit was made by a registered Wikipedia user, an anonymous user, or a bot (there are lots of them). After getting some good feedback on the wikitech-l discussion list I added some knobs to limit updates to specific languages and types of user, or size of the edit. I also added a periodically updating background image based on uploads to the Wikimedia Commons.

The second visualization app is called wikipulse. Dario Taraborelli of the Wikimedia Foundation emailed me with the idea to use the same update data stream I used in wikistream to fuel a higher level view of the edit activity using the gauge widget in Google’s Chart API. To the left is one of these gauges which displays the edits per minute on 36 wikipedia properties. If you visit wikipulse you will also see individual gauges for each language wikipedia. It’s a bit overkill seeing all the gauges on the screen, but it’s also kind of fun to see them update automatically every second relative to each other, based on the live edit activity.

The Tools


For both of these apps I needed to log into the wikimedia IRC server, listen on ~30 different channels, push all the updates through some code that helped visualize the data in some way, and then get this data out to the browser. I had heard good things about node for high concurrency network programming from several people. I ran across a node library called socket.io that reported to make it easy to stream updates from the server to the client, in a browser independent way, using a variety of transport protocols. Instinctively it felt like the pub/sub model would also be handy for connecting up the IRC updates with the webapp. I had been wanting to play around with the pub/sub features in redis for some time, and since there is a nice redis library for node I decided to give it a try.

Like many web developers I am used to writing JavaScript for the browser. Tools like jQuery and underscore.js successfully raised the bar to the point that I’m able to write JavaScript and still look myself in the mirror in the morning. But I was still a bit skeptical about JavaScript running on the server side. The thing I didn’t count on was how well node’s event driven model, the library support (socket.io, redis, express), and the functional programming style fit the domain of making the Wikipedia update stream available on the Web.

For example here’s is the code to connect to the ~30 IRC chatrooms stored in the channels variable, and send all the messages to a function processMessage:

var client = new irc({server: 'irc.wikimedia.org', nick: config.ircNick});

client.connect(function () {
  client.join(channels);
  client.on('privmsg', processMessage);
});

The processMessage function then parses the IRC message into a JavaScript dictionary and publishes it to a ‘wikipedia’ channel in redis:

function processMessage (msg) {
  m = parse_msg(msg.params);
  redis.publish('wikipedia', m);
}

Then over in my wikistream web application I set up socket.io so that when a browser goes to my webapp it negotiates for the best way to get updates from the server. Once a connection is established the server subscribes to the wikipedia channel and sends any updates it receives out to the browser. When the browser disconnects, the connection to redis is closed.

var io = sio.listen(app);

io.sockets.on('connection', function(socket) {
  var updates = redis.createClient();
  updates.subscribe('wikipedia');
  updates.on("message", function (channel, message) {
    socket.send(message);
  });
  socket.on('disconnect', function() {
    updates.quit();
  });
});

Each update is represented as a JavaScript dictionary, which socket.io and node’s redis client transparently serialize and deserialize. In order to understand the socket.io protocol a bit more, I wrote a little python script that connects to wikistream.inkdroid.org, negotiates for the xhr-polling transport, and prints out the stream JSON to the console. It’s a demonstration of how a socket.io instance like wikistream can be used as an API for creating a firehose like service. Although I guess the example might’ve been a bit cleaner to negotiate for a websocket instead.

{
  'anonymous': False,
  'comment': '/* Anatomy */  changed statement that orbit was the eye to saying that the orbit was the eye socket for accuracy',
  'delta': 7,
  'flag': '',
  'namespace': 'article',
  'newPage': False,
  'page': 'Optic nerve',
  'pageUrl': 'http://en.wikipedia.org/wiki/Optic_nerve',
  'robot': False,
  'unpatrolled': False,
  'url': 'http://en.wikipedia.org/w/index.php?diff=449570600&oldid=447889877',
  'user': 'Moearly',
  'userUrl': 'http://en.wikipedia.org/wiki/User:Moearly',
  'wikipedia': '#en.wikipedia',
  'wikipediaLong': 'English Wikipedia',
  'wikipediaShort': 'en',
  'wikipediaUrl': 'http://en.wikipedia.org'
}

This felt so easy, it really made me re-evaluate everything I thought I knew about JavaScript. Plus it all became worth it when Ward Cunningham (the creator of the Wiki idea) wrote on the wiki-research list:

I’ve written this app several times using technology from text-to-speech to quartz-composer. I have to tip my hat to Ed for doing a better job than I ever did and doing it in a way that he makes look effortless. Kudos to Ed for sharing both the page and the software that produces it. You made my morning.

Ward is a personal hero of mine, so making his morning pretty much made my professional career.

I guess this is all a long way of saying what many of you probably already know…the tooling around JavaScript (and especially node) has changed so much, that it really does offer a radically new programming environment, that is worth checking out, especially for network programming. The event driven model that is baked into node, and the fact that v8 runs blisteringly fast, make it possible to write apps that do a whole lot in one low memory process. This is handy when deploying an app to an EC2 mini instance or Heroku, which is where wikipulse is running…for free.

Of course it helped that my wife and kids got a kick out of wikistream and wikipulse. I suspect that they think I’m a bit obsessed with Wikipedia, but that’s ok … because I kinda am.


day of digital archives psa

Today is Day of Digital Archives day and I had this semi-thoughtful post written up about BagIt and how it’s a brain dead simple format to use to package up your files so that you’ll know if you still have them 5 minutes, 5 hours, 5 days, 5 years, maybe even 5 decades from now–if the notion of directories and files persists that long.

But I deleted that…you’re welcome…

I was also going to write about how in a fit of web performance art Mark Pilgrim recently deleted his online presence, including various extremely useful opensource tools, and several popular online books, only to see them re-materialize on the Web at new locations.

But I deleted most of that too…you’re welcome again!

Here’s a public service announcement instead. If you happen to use Franco Lazzarino’s Ruby BagIt Library to create bags that contains largish files (> 500MB), you might have accidentally created bad SHA1 manifests. I added a test, and fixed the bug with help from Mark Matienzo and Michael Klein, and sent a pull request. It hasn’t been applied yet, so here’s to hoping it will.

At $mpow we’ve been getting terabytes of data from this social media company that has been bagging their data using this Ruby library. Many of the files are multi-gigabytes gzip compressed. And many of the bags now have bad SHA1 manifests. The social media company wasn’t sure what the problem was, and told us just to ignore the SHA1 manifests. Which is easy enough to do.

It seems like no matter how simple the spec, it’s easy to create bugs. If you create bags, throw Bag-Software-Agent into your bag-info.txt…you never know who might find it useful.


stepping backwards

Jonathan Rochkind recently wrote a good blog post about using HTML5 Microdata to help citation managers like Mendeley and Zotero discover citation metadata that is available in formats such as RIS. It’s an excellent and detailed complement to Eric Hellman’s piece on the same subject.

I contributed to the unAPI effort 5 years ago, which aimed to fix the same problem: making citation metadata available to browsers. I wrote the unAPI validator which helped implementors confirm they were doing things right, articles were written, and we saw implementations in software such as the opensource integrated library system Evergreen and the popular citation manager Zotero, which at one point looked first for unAPI metadata in pages…perhaps it still does.

As Jonathan points out, there are some issues with unAPI, such as accessibility problems around Microformats in general, which unAPI was partly modeled on. HTML5 Microdata and RDFa weren’t around when we were working on unAPI, so I think Jonathan is right that it definitely makes sense to think about using these technologies nowadays instead of unAPI when making structured metadata available in HTML. I personally think the same thing goes for COinS where OpenURL key value pairs are used to express the metadata. Companies like Google, Microsoft, Yahoo and Facebook actively scrape HTML5 Microdata and RDFa, and there are vocabularies for describing books and articles. And because these technologies are deployed wider than the small niche that libraries occupy, they fit the Web better.

But there is a fair bit of turmoil in the structured-data-on-the-Web space. Today’s F8 product announcements seemed to indicate that Facebook is deepening its use of the OpenGraphProtocol, which is their rebranding of RDFa. We’re seeing the International Press Telecommunications Council standardizing rNews as an RDFa vocabulary for expressing online news metadata. And meanwhile Google, Microsoft and Yahoo are continuing to work on schema.org Microdata vocabularies. The recent Schema.org Workshop seems to anticipate significant changes in that space in the near future, particularly regarding the output of the W3C Web Schema and HTML Data task forces.

At LODLAM-DC we had a good conversation about RDFa, Microdata, Microformats and JSON publishing options for the cultural heritage sector. Perhaps I was just projecting, but it seemed like there was a fair bit of uncertainty about which to use. At the end of the day it seems like making your decisions based on things you want to enable is a good way forward. Are you trying to get your content to show up nicely on Facebook or Google–or both?

…or are you trying to do something else, like advertise some RIS citation metadata that is related to an HTML page so a citation manager can pick it up?

Even before the pixels had dried on the first version of the unAPI spec I was left with the nagging feeling that it had missed the point. I felt like we hadn’t really used the mechanics of the Web that were already there, and had sort of inadvertently succumbed to how standards development would be lampooned later by XKCD:

Specifically, I felt like we could have documented an even simpler pattern, namely using a <link> or <a> elements in conjunction with the rel and type parameters. So if you have a search result that is available as RIS, why not add this to your <head> element:


My IRC conversation with Jonathan about his blog post was rolling around in my head when this Kurt Vonnegut quote went by in my Twitter stream:

It seemed oddly appropriate given the uncertainty in the structured-data-on-the-web marketplace, and some missteps with unAPI. If all we want to do is replace unAPI with something easier and more web-friendly, then why not fall back on basic functionality that has been in HTML for years?

If you want to make structured metadata available directly in HTML, sure HTML5 Microdata and RDFa are important technologies to use. But if all you want to do is link to an external metadata file I personally think the scholarly community would be better served by a simpler and less controversial approach.


the digital repository marketplace

The University of Southern California recently announced its Digital Repository (USCDR) which is a joint venture between the Shoah Foundation Institute and the University of Southern California. The site is quite an impressive brochure that describes the various services that their digital preservation system provides. But a few things struck me as odd. I was definitely pleased to see a prominent description of access services centered on the Web:

The USCDR can provide global access to digital collections through an expertly managed, cloud-computing environment. With its own content distribution network (CDN), the repository can make a digital collection available around the world, securely, rapidly, and reliably. The USCDR’s CDN is an efficient, high-performance alternative to leading commercial content distribution networks. The USCDR’s network consists of a system of disk arrays that are strategically located around the world. Each site allows customers to upload materials and provides users with high-speed access to the collection. The network supports efficient content downloads and real-time, on-demand streaming. The repository can also arrange content delivery through commercial CDNs that specialize in video and rich media.

But from this description it seems clear that the USCDR is creating their own content delivery network, despite the fact that there is already a good marketplace for these services. I would have thought it would be more efficient for the USCDR to provide plugins for the various CDNs rather than go through the effort (and cost) of building out one themselves. Digital repositories are just a drop in the ocean of Web publishers that need fast and cheap delivery networks for their content. Does the USCDR really think they are going to be able to compete and innovate in this marketplace? I’d also be kind of curious to see what public websites there are right now that are built on top of the USCDR.

Secondly, in the section on Cataloging this segment jumped out at me:

The USC Digital Repository (USCDR) offers cost-effective cataloging services for large digital collections by applying a sophisticated system that tags groups of related items, making them easier to find and retrieve. The repository can convert archives of all types to indexed, searchable digital collections. The repository team then creates and manages searchable indices that are customized to reflect the particular nature of a collection.

The USCDR’s cataloging system employs patented software created by the USC Shoah Foundation Institute (SFI) that lets the customers define the basic elements of their collections, as well as the relationships among those elements. The repository’s control standards for metadata verify that users obtain consistent and accurate search results. The repository also supports the use of any standard thesaurus or classification system, as well as the use of customized systems for special collections.

I’m certainly not a patent expert, but doesn’t it seem ill advised to build a digital preservation system around a patented technology? Sure, most of our running systems use possibly thousands of patented technologies, but ordinarily we are insulated from them by standards like POSIX, HTTP, or TCP/IP that allow us to swap out various technologies for other ones. If the particular technique to cataloging built into the USCDR is protected by a patent for 20 years, won’t that limit the dissemination of the technique into other digital preservation systems, and ultimately undermine the ability of people to move their content in and out of digital preservation systems as they become available–what Greg Janée calls relay supporting archives. I guess without more details of the patented technology it’s hard to say, but I would be worried about it.

After working in this repository space for a few years I guess I’ve become pretty jaded about turnkey digital repository systems that say they do it all. Not that it’s impossible, but it always seems like a risky leap for an organization to take. I guess I’m also a software developer, which adds quite a bit of bias. But on the other hand it’s great to see a repository systems that are beginning to address the basic concerns raised by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which identified the need for building sustainable models for digital preservation. The California Digital Library is doing something similar with its UC3 Merritt system, which offers fee based curation services to the University of California (which USC is not part of).

Incidentally the service costs of USCDR and Merritt are quite difficult to compare. Merritt’s Excel Calculator says their cost is $1040 per TB per year (which is pretty straightforward, but doesn’t seem to account for the degree to which the data is accessed). The USCDR is listed as $70/TB per month for Disk-based File-Server Access, and $1000/TB for 20 years for Preservation Services. That would seem indicate the raw storage is a bit less than Merritt at $840.00 per TB per year. But what the preservation services are, and how the 20 year cost would be applied over a growing collection of content seems unclear to me. Perhaps I’m misinterpreting disk-based file-server access, which might actually refer to terabytes of data sent outside their USCDR CDN. In that case the $70/TB measures up quite nicely with a recent quote from Amazon S3 at $120.51 per terabyte transferred out per month. But again, does USCDR really think it can compete in the cloud storage space?

Based on the current pricing models, where there are no access driven costs, the USCDR and Merritt might find a lot of clients outside of the traditional digital repository ecosystem (I’m thinking online marketing or pornography) that have images they would like to serve at high volume for no cost other than the disk storage. That was my bad idea of a joke, if you couldn’t tell. But seriously I sometimes worry that digital repository systems are oriented around the functionality of a dark archive, where lots of data goes in, and not much data comes back out for access.


DbHd

Via Ivan Herman I found out about this 1 year old, excellent presentation from Hans Rosling at the World Bank about the importance of sharing our data.

DbHd is certainly a problem. But it strikes me that, paradoxically, it’s the love and care that people put into their datasets, that makes them so valuable to share.

If nobody was hugging their data, then nobody would care about setting it free on the Web.


Spotify, Rdio and Albums of the Year

I’ve recently started listening to streamed music a bit more on Rdio right around when Spotify launched in the US. I noticed that some albums that I might want to listen to weren’t available for streaming on Rdio, so I got it into my head that I’d like to compare the coverage of Rdio and Spotify–but I wasn’t sure what albums to check. Earlier this week I remembered that since 2007 Alf Eaton has put together Albums of the Year (AOTY) lists, where he has compiled the top albums of the year from a variety of sources. I think Alf’s tastes tend a bit toward independent music, which suits me fine because that’s the kind of stuff I listen to. So …

The AOTY HTML is eminently readable (thanks Alf), so I wrote a scraper to pull down the albums and store it as a JSON file. With this in hand I was able to use the Rdio and Spotify web services to look up the albums, and record whether it was found, and whether it was streamable in the United States, which I saved off as another JSON file. So, of the 7,406 unique albums on AOTY, 60% of them were available in Spotify, and 46% in Rdio 32% of them were available on Spotify, and 46% on Rdio (the strikeout is because of a bug that Alf spotted in the Spotify lookup code). I put the data in a public Fusion Table if you want to look at the results. If you notice anomalies please let me know. And speaking of anomalies…

Caveat Lector!

I was kind of surprised that Teen Dream by Beach House (which was mentioned on 27 AOTY lists in 2010) wasn’t showing up as being streamable on Spotify. So I asked on Twitter and Google+ if people in the US saw it as streamable. The results were kind of surprising. People from Michigan, Illinois, Texas, New York and the District of Columbia confirmed what the Web Service told me, that the album was not streamable. But then there were people in Massachusetts and California who reported it as streamable. What’s more, premium membership didn’t seem to affect the availability: the Massachusetts subscriber had a free account, and the Californian had a premium account, and both could stream it. So take the numbers above with a boulder sized grain of salt. It’s not clear what’s going on.

The spotify search API does not require authentication, and they have nice results that include all territories where the content is available. Rdio’s search API does require authentication, which apparently is used to tie your account to a geographic region, which in turn effects what whether the API will say the album is streamable or not.

So anyway, it was interesting to play around with the APIs a bit. It didn’t seem like the service agreements for the various APIs prevented this sort of exploration. I like the fact that Rdio is web based (go Django), and doesn’t require a proprietary client to use. But it looks like the coverage in Spotify is better. I’m not sure I will make any changes. If anyone has any information about whether streamability of content can vary within the United States I would be interested to hear it. This rights stuff is hard. Given the complexity of managing the rights to this content I’m kind of amazed that Rdio and Spotify exist at all…and I’m very glad that they do.

Update: it turns out that the folks who saw Teen Dream as available had it in their personal collection (Spotify is smart like that), which is why Spotify said it was available to them. So, no crazy state-by-state rights issues need to be entertained :-)


GoodReads microdata

I’m not sure how long it has been there, but I just happened to notice that GoodReads (the popular social networking site for book lovers to share what they are reading and have read) has implemented HTML5 Microdata to make metadata for books available in the HTML of their pages. GoodReads has chosen to use the the Book type from schema.org vocabulary, most likely because the big three search engines (Google, Bing and Yahoo) announced that they will use the metadata to enhance their search results. So web publishers are motivated to publish metadata in their pages, not because it’s the “right” thing to do, but because they want to drive traffic to their websites.

If you are new to HTML5 Microdata, schema.org and what it means for books, check out Eric Hellman’s recent post Spoonfeeding Library Data to Search Engines. And if you are generally curious about HTML5 Microdata, the chapter in Mark Pilgrim’s Dive into HTML5 is really quite good.

But Microdata and schema.org are not just good for the search engines, they are actually good for the Web ecosystem, and for hackers like you and me. For example, go to the page for Alice in Wonderland:

If you view source on the page, and search for itemtype or itemprop you’ll see the extra Microdata markup. The latest HTML5 specification has a section on how to serialize microdata as JSON, and the processing model is straightforward enough for me to write a parser on top of html5lib in less than 200 lines of Python. So this means you can:

import urllib
import microdata

url = "http://www.goodreads.com/book/show/24220.Alice_s_Adventures_in_Wonderland_and_Through_the_Looking_Glass"
items = microdata.get_items(urllib.urlopen(url))

print items[0].json()

and you’ll see:

{
  "numberOfPages": [
    "400"
  ],
  "isbn": [
    "9780141439761"
  ],
  "name": [
    "Alice's Adventures in Wonderland and Through the Looking-Glass"
  ],
  "author": [
    {
      "url": [
        "http://www.goodreads.com/author/show/8164.Lewis_Carroll",
        "http://www.goodreads.com/author/show/495248.Hugh_Haughton",
        "http://www.goodreads.com/author/show/180439.John_Tenniel"
      ],
      "type": "http://schema.org/Person"
    }
  ],
  "image": [
    "/book/photo/24220.Alice_s_Adventures_in_Wonderland_and_Through_the_Looking_Glass"
  ],
  "inLanguage": [
    "English"
  ],
  "ratingValue": [
    "4.03"
  ],
  "ratingCount": [
    "64,628 ratings"
  ],
  "bookFormatType": [
    "Paperback"
  ],
  "type": "http://schema.org/Book"
}

If you have spent a bit of time writing screenscrapers in the past, this should make your jaw drop a bit. What’s more they’ve also added Microdata to the search results page, so you can see metadata for all the books in the results, for example using Google’s Rich Snippets Testing Tool.

Funnily enough, while I was writing this blog post, over in the #code4lib IRC chat room Chris Beer brought up the fact that some Blacklight developers were concerned that <link rel=“unapi-server”> wasn’t valid HTML5. Chris was wondering if anyone was interested in trying to register “unapi-server” with the WHATWG…

    &crickets;

Issues of “valid” HTML5 aside, this discussion highlighted for me just how far the world of metadata on the Web has advanced since a small group of library hackers worked on unAPI. The use of HTML5 Microdata and schema.org by Google, Bing and Yahoo, and the use of RDFa by Facebook are great examples of some mainstream solutions to what some of us were trying to achieve with unAPI. Seeing sites like GoodReads implement Microdata, and announcements like Opera support for Microdata are good reminders that the library software development community is best served by paying attention to mainstream solutions, as they become available, even if they eclipse homegrown stopgap solutions.

It is somewhat problematic that Facebook has aligned with RDFa and the Open Graph Protocol, and Google, Bing and Yahoo have aligned with HTML5 and schema.org. GoodReads has implemented both incidentally. I heard a rumor that Facebook was invited to the schema.org table and declined. I have no idea if that is actually true. I also have heard a rumor that Ian Hickson of Google wrote up the Microdata spec in a weekend because he hates RDFa. I don’t know it that’s actually true either. The company and personality rivalries aside, if you are having trouble deciding which one to more fully support, try writing a program to parse RDFa and Microdata. It will probably help clarify some things…


meeting notes and a manifesto

A few weeks ago I was in sunny (mostly) Palo Alto, California at a Linked Data Workshop hosted by Stanford University, and funded by Council on Library and Information Resources. It was an invite-only event, with very little footprint on the public Web, and an attendee list that was distributed via email with “Confidential PLEASE do not re-distribute” across the top. So it feels more than a little bit foolhardy to be writing about it here, even in a limited way. But I was going to the meeting as an employee of the Library of Congress, so I feel a bit responsible for making some kind of public notes/comments about what happened. I suspect this might impact future invitations to similar events, but I can live with that :-)

A week is a long time to talk about anything…and Linked Data is certainly no exception. The conversation was buoyed by the fact that it was a pretty small group of around 30 folks from a wide variety of institutions including: Stanford University, Bibliotheca Alexandrina, Bibliothèque nationale de France, Aalto University, Emory University, Google, National Institute of Informatics, University of Virginia, University of Michigan, Deutschen Nationalbibliothek, British Library, Det Kongelige Bibliotek, California Digital Library and Seme4. So the focus was definitely on cultural heritage institutions, and specifically what Linked Data can offer them. There was a good mix of people in attendance: some who were relatively new to Linked Data and RDF, to others who had been involved in the space before the term Linked Data, RDF or the Web existed…and there were people like me who were somewhere in between.

A few of us took collaborative notes in PiratePad, which sort of tapered off as the week progressed and more discussion happened. After some initial lighting-talk-style presentations from attendees on Linked Data projects they were involved in, we spent most of the rest of the week breaking up into 4 groups, to discuss various issues, and then joining back up with the larger group for discussion. If things go as planned you can expect to see a report from Stanford that covers the opportunities and challenges that Linked Data offers the cultural heritage sector, which were raised during these sessions. I think it’ll be a nice compliment to the report that the W3C Library Linked Data Incubator Group is preparing, which recently became available as a draft.

One thing that has stuck with me a few weeks later, is the continued need in the cultural heritage Linked Data sector for reconciliation services, that help people connect up their resources with appropriate resources that other folks have published. If you work for a large organization, there is often even a need for reconciliation services within the enterprise. For example the British Library reported that it has some 300 distinct data systems within the organization, that sometimes need to be connected together. Linking is the essential ingredient, the sine qua non of Linked Data. Linking is what makes Linked Data and the RDF data model different. It helps you express the work you may have done in joining up your data with other people’s data. It’s the 4th design pattern in Tim Berners-Lee’s Linked Data Design Issues:

Include links to other URIs. so that they can discover more things.

But, expressing the links is the easy part…creating them where they do not currently exist is harder.

Fortunately, Hugh Glaser was on hand to talk about the role that sameas.org plays in the Linked Data space, and how the RKBexplorer managed to reconcile authors names across institutional repositories. He also has described some work with the British Museuem linking up two different data silos about museum objects, to provide a unified web views for those objects. Hugh, if you are reading this, can you comment with a link to this work you did, and how it surfaces in British Museum website?

Similarly Stefano Mazzocchi talked about how Google’s tools like Freebase Suggest and their WebID app can make it easy to integrate Freebase’s identity system into your applications. If you are building a cataloging tool, take a serious look at what using something like Freebase Suggest (a jquery plugin) can offer your application. In addition, as part of the Google Refine data cleanup tool, Google has created an API for data reconciliation services, which other service providers could supply. Stefano indicated that Google was considering releasing the code behind this reconciliation service, and stressed that it is useful for the community to make more of these reconciliation services available, to help others link their data with other peoples data. It seems obvious I guess, but I was interested to hear that Google themselves are encouraging the use of Freebase IDs to join up data within their enterprise.

Almost a year ago Leigh Dodds created a similar API layer for data that is stored in the Talis Platform. Now that the British National Bibliography is being made available in a Talis Store, it might be possible to use Leigh’s code to put a reconciliation service on top of that data. Caveat being that not all the BNB is currently available there. By the way, hats off to the British Library for iteratively making that data available, and getting feedback early, instead of waiting for it all to be “done”…which of course they never will be, if they are successful at integrating Linked Data into their existing data work flows.

If you squint right, I think it’s also possible to look at the VIAF AutoSuggest service as a type reconciliation service. It would be useful to have a similar service over the Library of Congress Subject Headings at id.loc.gov. Having similar APIs for these services could be a handy thing as we begin to build new descriptive cataloging tools that take advantage of these pools of data. But I don’t think it’s necessarily essential, as the APIs could be orchestrated in a more ad hoc, web2.0 mashup style. I imagine I’m not alone in thinking we’re now at the stage when we can start building new cataloging tools that take advantage of these data services. Along those lines Rachel Frick had an excellent idea to try to enhance collection building applications like Omeka and Archives Space to take advantage of reconciliation services under the covers. Adding a bit of suggest-like functionality to these tools could smooth out the description of resources that libraries, museums and archives are putting online. I think the Omeka-LCSH plugin is a good example of steps in this direction.

One other thing that stuck with me from the workshop is that the new (dare I say buzzwordy) focus on Library Linked Data is somewhat frightening to library data professionals. There is a lot of new terminology, and issues to work out (as the Stanford report will hopefully highlight). Some of this scariness is bound up with the Resource Description and Access sea change that is underway. But crufty as they are, data systems built around MARC have served the library community well over the past 30 years. Some of the motivations for Linked Data are specifically for Linked Open Data, where the linking isn’t as important as the openness. The LOD-LAM summit captured some of this spirit in the 4 Star Classification Scheme for Linked Open Cultural Metadata, which focuses on licensing issues. There was a strong undercurrent at the Stanford meeting about licensing issues. The group recognized that explicit licensing is important, but it was intentionally kept out of scope of most of the discussion. Still I think you can expect to see some of the heavy hitters from this group exert some influence in this arena to help bring clarity to licensing issues around our data. I think that some of the ideas of opening up the data, and disrupting existing business workflows around the data can seem quite scary to those who have (against a lot of odds) gotten them working. I’m thinking of the various cooperative cataloging efforts that allow work to get done in libraries today.

Truth be told, I may have inspired some of the “fear” around Linked Data by suggesting that the Stanford group work on a manifesto to rally around, much like what the Agile Manifesto did for the Agile software development movement. I don’t think we had come to enough consensus to really get a manifesto together, but on the last day the sub-group I was in came up with a straw man (near the bottom of the piratepad notes) to throw darts at. Later on (on my own) I kind of wordsmithed them into a briefer list. I’ll conclude this blog post by including the “manifesto” here not as some official manifesto of the workshop (it certainly is not), but more as a personal manifesto, that I’d like to think has been guiding some of the work I have been involved in at the Library of Congress over the past few years:

Manifesto for Linked Libraries

We are uncovering better ways of publishing, sharing and using information by doing it and helping others do it. Through this work we have come to value:

  • Publishing data on the Web for discovery over preserving it in dark archives.
  • Continuous improvement of data over waiting to publish perfect data.
  • Semantically structured data over flat unstructured data.
  • Collaboration over working alone.
  • Web standards over domain-specific standards.
  • Use of open, commonly understood licenses over closed, local licenses.

That is, while there is value in the items on the right, we value the items on the left more.

The manifesto is also on a publicly editable Google Doc; so if you feel the need to add or comment please have a go. I was looking for an alternative to “Linked Libraries” since it was not inclusive of archives and museums … but I didn’t spend much time obsessing on it. One of the selfish motivations for publishing the manifesto here was to capture it a particular point in time where I was happy with it :-)


the dpla as a generative platform

Last week I had the opportunity to attend a meeting of the Digital Public Library of America (DPLA) in Amsterdam. Several people have asked me why an American project was meeting in Amsterdam. In large part it was an opportunity for the DPLA to reach out to, and learn from European projects such as Europeana, LOD2 and Wikimedia Germany–or as the agenda describes:

The purpose of the May 16 and 17 expert working group meeting, convened with generous support from the Open Society Foundations, is to begin to identify the characteristics of a technical infrastructure for the proposed DPLA. This infrastructure ideally will be interoperable with international efforts underway, support global learning, and act as a generative platform for undefined future uses. This workshop will examine interoperability of discovery, use, and deep research in existing global digital library infrastructure to ensure that the DPLA adopts best practices in these areas. It will also serve to share information and foster exchange among peers, to look for opportunities for closer collaboration or harmonization of existing efforts, and to surface topics for deeper inquiry as we examine the role linked data might play in a DPLA.

Prior to the meeting I read the DPLA Concept Note, watched the discussion list and wiki activity – but the DPLA still seemed somewhat hard to grasp to me. The thing I learned at the meeting in Amsterdam is that this nebulousness is by design–not by accident. The DPLA steering committee aren’t really pushing a particular solution that they have in mind. In fact, there doesn’t seem to be a clear consensus about what problem they are trying to solve. Instead the steering committee seem to be making a concerted effort to keep an open, beginners-mind about what a Digital Public Library of America might be. They are trying to create conversations around several broad topic areas or work-streams: content and scope, financial/business models, governance, legal issues, and technical aspects. The recent meeting in Amsterdam focused on the technical aspects work-stream–in particular, best practices for data interoperability on the Web. The thought being that perhaps the DPLA could exist in some kind of distributed relationship with existing digital library efforts in the United States–and possibly abroad. Keeping an open mind in situations like this takes quite a bit of effort. There is often an irresistable urge to jump to particular use cases, scenarios or technical solutions, for fear of seeming ill informed or rudderless. I think the DPLA should be commended for creating conversations at this formative stage, instead of solutions in search of a problem.

I hadn’t noticed the phrase “generative platform” in the meeting announcement above until I began this blog post…but in hindsight it seems very evocative of the potential of the DPLA. At their best, digital libraries currently put content on the Web, so that researchers can discover it via search engines like Google, Bing, Baidu, etc. Researchers discover a pool of digital library content while performing a query in a search engine. Once they’ve landed in the digital library webapp they can wander outwards to related resources, and perhaps do a more nuanced search within the scoped context of the collection. But in practice this doesn’t happen all that often. I imagine many institutions digitize content that actually never makes it onto the Web at all. And when it does make it onto the Web it is often deep-web content hiding behind a web form, un-discoverable by crawlers. Or worse, the content might be actively made invisible by using a robots.txt to prevent search engines from crawling it. Sadly this is often done for performance reasons, not out of any real desire to keep the content from being found–because all too often library webapps are not designed to support crawling.

I was invited to talk very briefly (10 minutes) about Linked Data at the Amsterdam meeting. I think most everyone recognizes that a successful DPLA would exist in a space where there has been years of digitization efforts in the US, with big projects like the HathiTrust and countless others going on. I wanted to talk about how the Web could be used to integrate these collections. Rather than digging into a particular Linked Data solution to the problem of synchronization, I thought I would try to highlight how libraries could learn to do web1.0 a bit better. In particular I wanted to showcase how Google Scholar abandoned OAI-PMH (a traditional library standard for integrating collections) in favor of using sitemaps and metadata embedded in HTML. I wanted to show how thoughtful use of sitemaps, a sensible robots.txt, and perhaps some Atom to publish updates, and deletes a bit more methodically can offer just the same functionality as OAI-PMH, but in a way that is aligned with the Web, and the services that are built on top of it. Digital library initiatives often go off and create their own specialized way of looking at the Web, and ignore broader trends. The nature of grant funding, and research papers often serve as an incentive for this behavior. I’ve heard rumors that there is even some NISO working group being formed to look into standardizing some sort of feed based approach to metadata harvesting. Personally I think it’s probably more important for us to use some of the standards and patterns that are already available instead of trying to define another one.

So you could say I pulled a bit of a bait and switch: instead of talking about Linked Data I really ended up talking about Search Engine Optimization. I didn’t mention RDF or SPARQL once. If anyone noticed they didn’t seem to mind too much.

I learned a lot of very useful information during the presentations–too much to really note here. But there was one conversation that really stands out after a week has passed.

Greg Crane of the Perseus Digital Library spoke about about Deep Research, and how students and researchers participate in the creation of online knowledge. At one point Greg had a slide that contained a map of the entire world, and spoke about how the scope of the DPLA can’t really be confined to the United States alone–since American society is largely made up of immigrant communities (some by choice, some not) the scope of the DPLA is in fact the entire world. I couldn’t help but think how completely audacious it was to say that the Digital Public Library of America would somehow encompass the world – similar to how brick and mortar library and museum collections can often mirror the imperialistic interests of the states that they belong to.


Original WWW graphic by Robert Cailliau

So I was relieved when Stefan Gradmann asked how Greg thought the DPLA would fit in with projects like Europeana, which are already aggregating content from Europe. I can’t exactly recall Greg’s response (update: Greg filled in some of the blanks via email), but this prompted Dan Brickley to point out that in fact it’s pretty hard to draw lines around Europe too … and more importantly the Web is a space that can unite these projects. At this point Josh Greenberg jumped in and suggested that perhaps some thoughtful linking between sites like a DPLA and Europeana could help bring them together. This all probably happened in the span of 3 or 4 minutes, but the exchange really crystallized for me that the cultural heritage community could do a whole lot better at deep linking with each other. Josh’s suggestion is particularly good, because researchers could see and act on contextual links. It wouldn’t be something hidden in a data layer that nobody ever looks at. But to do this sort of linking right we would need to share our data better with each other, and it would most likely need to be Linked Data – machine readable data with URLs at its core. I guess it’s a no-brainer that for it to succeed the DPLA needs to be aligned with the ultimate generative platform of our era: the World Wide Web. Name things with URLs, create typed links between them, and other people’s stuff.

Another thing that struck me was how Europeana really gets the linking part. Europeana is essentially a portal site, or directory of digital objects for more than 15 million items provided by hundreds of providers across Europe. You can find these objects in Europeana, but if you drill down far enough you eventually find yourself on the original site that made the object available. I agree with folks that think that perhaps the user experience of the site would be improved if the user never left Europeana to view the digital object in full. This would necessarily require harvesting a richer version of the digital object, which would be more difficult, but not impossible. There would also be an opportunity to serve as a second copy for the object, which is potentially very desirable to originating institutions for preservation purposes…lots of copies keeps stuff safe.

But even in this hypothetical scenario where the object is available in full on Europeana, I think it would still be important to link out to the originating institution that digitized the object. Linking makes the provenance of the item explicit, which will continue to be important to researchers on the Web. But perhaps more importantly it gives institutions a reason to participate in the project as a whole. Participants will see increased awareness and use of their web properties, as users wander over from Europeana. Perhaps they could even link back to Europeana, which ought to increase Europeana’s density in the graph of the web, which also should boost its relevancy ranking in search engines like Google.

Another good lesson of Europeana is that it’s not just about libraries, but also includes many archives, museums and galleries. One of my personal disappointments about the Amsterdam meeting was that Mathias Schindler of Wikimedia-Germany had to pull out at the last minute. I’ve never met him, but Mathias has had a lot to do with trying to bring the Wikipedia and Library communities together. Efforts to promote the use of Wikipedia as a platform in the Galleries, Libraries, Archives and Museums (GLAM) sector are intensifying. The pivotal role that Wikipedia has had in the Linked Data community in the form of dbpedia.org is also very significant. Earlier this year there was a meeting of various Wikipedia, dbpedia and Freebase folks at a Data Summit, where people talked about the potential for an inner hub for the various languages wikipedias to share inter-wiki links, and extracted structured metadata. I haven’t heard whether this is actually leading anywhere currently, but at the very least its a recognition that Wikipedia is itself turning into a key part of information infrastructure on the web.

So I’ve rambled on a bit at this point. Thanks for reading this far. My take-away from the Amsterdam meeting was that the DPLA needs to think about how it wants to align itself with the Web, and work with its grain … not against it. This is easier said than done. The DPLA needs to think about incentives that would give existing digital library projects practical reasons to want to be involved. This also is easier said than done. And hopefully these incentives won’t just involve getting grant money. Keeping an open mind, taking a REST here and there, and continuing to have these very useful conversations (and contests) should help shape the DPLA as a generative platform.