a look at who makes the LCNAF

As a follow up to my last post about visualizing Library of Congress Name Authority File (LCNAF) records created by year, I decided to dig a little bit deeper to see how easy it would be to visualize how participating Name Authority Cooperative institutions have contributed to the LCNAF over time. This idea was mostly born out of spending the latter part of last week participating in a conversation about the need for a National Archival Authority Cooperative hosted at NARA. This blog post is one part nerdy technical notes on how I worked with the LCNAF Linked Data, and one part line charts showing who creates and modifies LCNAF records. It might’ve made more sense to start with the pretty charts, and then show you how I did it…but if the tech details don’t interest you can jump to the second half.

The Work

After a very helpful Twitter conversation with Kevin Ford I discovered that the Linked Data MADSRDF representation of the LCNAF includes assertions about the institution responsible for creating or revising the a record. Here’s a snippet of Turtle for RDF that describes who created and modified the LCNAF record for J. K. Rowling (if your eyes glaze over when you see RDF, don’t worry keep reading, it’s not essential you understand this):

@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .

<http://id.loc.gov/authorities/names/n97108433>
    madsrdf:adminMetadata [
        ri:recordChangeDate "1997-10-28T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ri:recordContentSource <http://id.loc.gov/vocabulary/organizations/dlc> ;
        ri:recordStatus "new"^^<http://www.w3.org/2001/XMLSchema#string> ;
        a ri:RecordInfo
    ],
    [
        ri:recordChangeDate "2011-08-25T06:29:06"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ri:recordContentSource <http://id.loc.gov/vocabulary/organizations/dlc> ;
        ri:recordStatus "revised"^^<http://www.w3.org/2001/XMLSchema#string> ;
        a ri:RecordInfo
    ] .

So I picked up an EC2 m1.large spot instance (7.5G of RAM, 2 virtual cores, 850G of storage) for a miserly $0.026/hour, installed 4store (which is a triplestore I’d heard good things about), and loaded the data.

% wget http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz
% gunzip authoritiesnames.nt.madsrdf.gz
% sudo apt-get install 4store
% sudo mkdir /mnt/4store
% sudo chown fourstore:fourstore /mnt/4store
% sudo ln -s /mnt/4store /var/lib/4store
% sudo -u fourstore 4s-backend-setup lcnaf --segments 4
% sudo -u fourstore 4s-backend lcnaf
% sudo -u fourstore 4s-import --verbose lcnaf authoritiesnames.nt.madsrdf

I used 4 segments as a best guess to match the 4 EC2 compute units available to an m1.large. The only trouble was that after loading 90M of the 226M assertions it began to slow to a crawl as the memory was about used up.

I thought briefly about upgrading to a larger instance…but it occurred to me that I actually didn’t need all the triples. I just need the ones related to the record changes, and the organization that made them. So I filtered out just the assertions I needed. By the way, this is a really nice artifact of the ntriples data format, which is very easy to munge with line oriented Unix utilities and scripting tools:

zcat authoritiesnames.nt.madsrdf.gz | egrep '(recordChangeDate)|(recordContentSource)|(recordStatus)'  > updates.nt

This left me with 50,313,810 triples which loaded in about 20 minutes! With the database populated I was then able to execute the following query to fetch all the create dates with their institution code using 4s-query:

@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .

SELECT ?date ?source WHERE { 
  ?s ri:recordChangeDate ?date . 
  ?s ri:recordContentSource ?source . 
  ?s ri:recordStatus "new"^^<http://www.w3.org/2001/XMLSchema#string> . 
}

This returned a tab delimited file that looked something like:

"1991-08-16T00:00:00"^^>http://www.w3.org/2001/XMLSchema#dateTime>      <http://id.loc.gov/vocabulary/organizations/dlc>
"1995-01-07T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>      <http://id.loc.gov/vocabulary/organizations/djbf>
"2004-03-04T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>      <http://id.loc.gov/vocabulary/organizations/nic>

I then wrote a simplistic python program to read in the TSV file and output a table of data where each row represented a year and the columns were the institution codes.

The Result

If you’d like to see the table you can check it out as a Google Fusion Table. If you are interested, you should be able to easily pull the data out into your own table, modify it, and visualize it. Google Fusion tables can be really easily rendered in a variety of ways, including a line graph, which I’ve embedded here, just displaying the top 25 contributors:

While I didn’t quite expect to see LC tapering off the way it is, I did expect it to dominate the graph. Removing LC from the mix makes the graph a little bit more interesting. For example you can see the steady climb of the British Library, and the strong role that Princeton University plays:

Out of curiosity I then executed a SPARQL query for record updates (or revisions), repeated the step with stats.py, uploaded to Google Fusion Tables, and removed LC to better see trends in who is updating records:

@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .

SELECT ?date ?source WHERE { 
  ?s ri:recordChangeDate ?date . 
  ?s ri:recordContentSource ?source . 
  ?s ri:recordStatus "revised"^^<http://www.w3.org/2001/XMLSchema#string> . 
}

I definitely never understood what Twin Peaks was about, and I similarly don’t really know what the twin peaks in this graph signify (2000 and 2008). I guess these were years where there were a lot of coordinated edits? Perhaps some NACO folks who have been around for a few years may know the answer. You can also see in this graph that Princeton University plays a strong role in updating records as well as creating them.

So I’m not sure I understand the how/when/why of an NAAC any better, but I did learn:

  • EC2 is a big win for quick data munging projects like this. I spent $0.98 with the instance up and running for 3 days.
  • Filtering ntriples files to what you actually need prior to loading into a triplestore can save time, money.
  • Working with ntriples is still pretty esoteric, and the options out there for processing a dump of ntriples (or rdf/xml) of LCNAF’s size are truly slim. If I’m wrong about this I would like to be corrected.
  • Google Fusion Tables are a nice way to share data and charts.
  • It seems like while more LCNAF records are being created per year, they are being created by a broader base of institutions instead of just LC (who appear to be in decline). I think this is a good sign for NAAC.
  • Open Data, and Open Data Curators (thanks Kevin) are essential to open, collaborative enterprises.

Now I could’ve made some hideous mistakes here, so in the unlikely event you have the time and inclination I would be interested to hear if you can reproduce these results. If the results confirm or disagree with other views of LCNAF participation I would be interested to see them.


lcnaf unix hack

I was in a meeting today listening to a presentation about the Library of Congress Name Authority File and I got it into my head to see if I could quickly graph record creation by year. Part of this might’ve been prompted by sitting next to Kevin Ford, who was multi-tasking by what looked like loading some MARC data into id.loc.gov. I imagine this isn’t perfect, but I thought it was kind of fun hack that demonstrates what you can get away with on the command line with some open data:

  curl http://id.loc.gov/static/data/authoritiesnames.nt.skos.gz \
    | zcat - \
    | perl -ne '/terms\/created> "(\d{4})-\d{2}-\d{2}/; print "$1\n" if $1;' \
    | sort \
    | uniq -c \
    | perl -ne 'chomp; @cols = split / +/; print "$cols[2]\t$cols[1]\n";' \
    > lcnaf-years.tsv

Which yields a tab delimited file where column 1 is the year and column 2 is the number of records created in that year. The key part is the perl one-liner on line 3 which looks for assertions like this in the ntriples rdf, and pulls out the year:

<http://id.loc.gov/authorities/names/n90608287> <http://purl.org/dc/terms/created> "1990-02-05T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

The use of sort and uniq -c together is a handy trick my old boss Fred Lindberg taught me, for quickly generating aggregate counts from a stream of values. It works surprisingly well with quite large sets of values, because of all the work that has gone into making sort efficient.

WIth the tsv in hand I trimmed the pre-1980 values, since I think there are lots of records attributed to 1980 since that’s when OPAC came online, and I wasn’t sure what the dribs and drabs prior to 1980 represented. Then I dropped the data into ye olde chart maker (in this case GoogleDocs) and voilà:

It would be more interesting to see the results broken out by contributing NACO institution, but I don’t think that data is in the various RDF representations. I don’t even know if the records contributed by other NACO institutions are included in the LCNAF. I imagine a similar graph is available somewhere else, but it was neat that the availability of the LCNAF data meant I could get a rough answer to this passing question fairly quickly.

The numbers add up to ~7.8 million which seems within the realm of possibile correctness. But if you notice something profoundly wrong with this display please let me know!


data dumps

As usual, the following comments are the reflections of a software developer working at the Library of Congress and are not an official statement of my employer..

One of the challenges that we’ve had at the National Digital Newspaper Program’s website Chronicling America has been access to data. At the surface level Chronicling America is a conventional web application that provides access to millions of pages of historic newspapers. Here “access” means a researcher’s ability to browse to each newspaper, issue and page, as well as search across the OCR text for each page.

Digging a bit deeper “access” also means programmatic access via a Web API. Chronicling America’s API enables custom software to issue queries using the popular OpenSearch protocol, and it also makes URL addressable data available using principles of Linked Data. In addition the website also makes the so called “batch” data that each NDNP awardee sends to the Library of Congress available on the Web. The advantage to making the batch data available is that it allows 3rd parties are then able to build their own custom search indexes on top of the data so their own products and services don’t have a runtime dependency on our Web API. Also researchers can choose to index things differently, perform text mining operations, or conduct other experiments. Each batch contains JPEG 2000, PDF, OCR XML and METS XML data for all the newspaper content; and it is in fact the very same data that the Chronicling America web application ingests. The batch data views makes it possible for interested parties to crawl the content using wget or some similar tool that talks HTTP, and fetch a lot of newspaper data.

But partly because of NDNP’s participation in the NEH’s Digging Into Data program, as well as the interest from other individuals and organizations we’ve recently started making data dumps of the OCR content available. This same OCR data is available as part of the batch data mentioned above, but the dumps provide two new things:

  1. The ability to download a small set of large compressed files with checksums to verify their transfer, as opposed to having to issue HTTP GETs for millions of uncompressed files with no verification.
  2. The ability to easily map each of the OCR files to their corresponding URL on the web. While it is theoretically possible to extract the right bits from the METS XML data in the batch data, the best of expression of how to do this is encapsulated in the Chronicling America ingest code, and is non-trivial.

So when you download, decompress and untar one of the files you will end up with a directory structure like this:

sn86063381/
|-- 1908
|   |-- 01
|   |   |-- 01
|   |   |   `-- ed-1
|   |   |       |-- seq-1
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-2
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-3
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       `-- seq-4
|   |   |           |-- ocr.txt
|   |   |           `-- ocr.xml
|   |   |-- 02
|   |   |   `-- ed-1
|   |   |       |-- seq-1
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-2
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-3
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       `-- seq-4
|   |   |           |-- ocr.txt
|   |   |           `-- ocr.xml

...

The pattern here is:

{lccn}/{year}/{month}/{day}/{edition}/{sequence}/

If you don’t work in a library, an lccn is a Library of Congress Control Number, which is a unique ID for each newspaper title. Each archive file will lay out in a similar way, such that you can process each .tar.bz2 file and will end up with a complete snapshot of the OCR data on your filesystem. The pattern maps pretty easily to URLs of the format:

http://chroniclingamerica.loc.gov/lccn/{lccn}/{year}-{month}-{day}/{edition}/{sequence}/

This is an obvious use case for a pattern like PairTree, but there was some perceived elegance to using paths that were a bit more human readable, and easier on the filesystem, which stands a good chance of not being ZFS.

Another side effect of having a discrete set of files to download is that each dump file can be referenced in an Atom feed, so that you can keep your snapshot up to date with a little bit of automation. Here’s a snippet of the feed:

< ?xml version="1.0" encoding="utf-8"?>


    Chronicling America OCR Data Feed
    
    info:lc/ndnp/ocr
    
        Library of Congress
        http://loc.gov
    
    2012-09-20T10:34:02-04:00

    
        part-000292.tar.bz2
        
        info:lc/ndnp/dump/ocr/part-000292.tar.bz2
        2012-09-20T10:34:02-04:00
        
OCR dump file part-000292.tar.bz2 with size 620.1 MB generated Sept. 20, 2012, 10:34 a.m.
...

As you can see it’s a pretty vanilla Atom feed that should play nicely with whatever feed reader or library you are using. You may notice the <link> element has some attributes that you might not be used to seeing. The enclosure and length attributes are directly from RFC 4287 for giving clients an idea that the referenced resource might be on the large side. The hash attribute is a generally useful attribute from James Snell’s Atom Link Extensions IETF draft.

If parsing XML is against your religion, there’s also a JSON flavored feed that looks like:

{
  ocr: [
    {
      url: "http://chroniclingamerica.loc.gov/data/ocr/part-000337.tar.bz2",
      sha1: "fd73d8e1df33015e06739c897bd9c08a48294f82",
      size: 283454353,
      name: "part-000337.tar.bz2",
      created: "2012-09-21T06:56:35-04:00"
    },
    ...
  ]
}

Again, I guess we could’ve kicked the tires on the emerging ResourceSync specification to simliar effect. But ResourceSync is definitely still in development, and well, Atom is a pretty nice Internet standard for publishing changes. Syndication technologies like RSS and Atom have already been used by folks like Wikipedia for publishing the availability of data dumps. ResourceSync seems intent on using Zip for compressing dump files, and bzip is common enough, and enough better than zip that it’s worth diverging. In some ways this blog post has turned into a when-to-eschew-digital-library-standards in favor of more mainstream or straightforward patterns. I didn’t actually plan that, but those of you that know me probably are not surprised.

If you plan to use the OCR dumps I, and others on the NDNP team, would love to hear from you. One of the big problems with them so far is that there is no explict statement about how the data is in the public domain, which it is. I’m hopeful this can be rectified soon. If you have feedback on the use of Atom here I would be interested in that too. But the nice thing about using it is really how uncontroversial it is, so I doubt I’ll hear much feedback on that front.


archiving wikitweets

Earlier this year I created a little toy webapp called wikitweets that uses the Twitter streaming API to identify tweets that reference Wikipedia, which it then displays realtime in your browser. It was basically a fun experiment to kick the tires on NodeJS and SocketIO using a free, single process Heroku instance.

At the time I announced the app on the wiki-research-l discussion list to see if anyone was interested in it. Out of the responses I received were ones from Emilio Rodríguez-Posada and Taha Yasseri where they asked whether the tweets are archived as they stream by. This struck a chord with me, since I’m a software developer working in the field of “digital preservation”. You know that feeling when you suddenly see one of your huge gaping blindspots? Yeah.

Anyway, some 6 months or so later I finally got around to adding an archive function to wikitweets, and I thought it might be worth writing about very quickly. Wikitweets uses the S3 API at Internet Archive to store every 1000 tweets. So you can visit this page at Internet Archive and download the tweets. Now I don’t know how long Internet Archive is going to be around, but I bet it will be longer than inkdroid.org, so it seemed like a logical (and free) safe harbor for the data.

In addition to being able to share the files Internet Archive also make a BitTorrent seed available, so the data can easily be distributed around the Internet. For example you could open wikitweets_archive.torrent in your BitTorrent client and download a copy of the entire dataset, while providing a redundant copy. I don’t really expect this to happen much with the wikitweets collection, but it seems to be a practical offering in the Lots of Copies Keeps Stuff Safe category.

I tried to coerce several of the seemingly excellent s3 libraries for NodeJS to talk to the Internet Archive, but ended up writing my own very small library that works specifically with Internet Archive. ia.js is bundled as part of wikitweets, but I guess I could put it on npm if anyone is really interested. It gets used by wikitweets like this:

  var c  = ia.createClient({
    accessKey: config.ia_access_key,
    secretKey: config.ia_secret_key,
    bucket: config.ia_bucket
  });

  c.addObject({name: "20120919030946.json", value: tweets}, function() {
    console.log("archived " + name);
  });

The nice thing is that you can use s3 libraries that have support for Internet Archive, like boto to programatically pull down the data. For example, here is a Python program that goes through each file and prints out the Wikipedia article title that is referenced by the tweet:

  import json
  import boto

  ia = boto.connect_ia()
  bucket = ia.get_bucket("wikitweets")

  for keyfile in bucket:
      content = keyfile.get_contents_as_string()
      for tweet in json.loads(content):
          print tweet['article']['title']

The archiving has only been running for last 24 hours or so, so I imagine there will be tweaks that need to be made. I’m considering compression of the tweets as one of them. Also it might be nice to put the files in subdirectories, but it seemed that Internet Archive’s API wanted to URL encode object names that have slashes in them.

If you have any suggestions I’d love to hear them.


finding soundcloud users with lastfm

I stumbled upon the lovely Soundcloud API this weekend, and before I knew it I was hacking together something that would use the LastFM API to lookup artists that I listen to, and then look to see if they are on Soundcloud. If you haven’t seen it before Soundcloud is a social networking site for musicians and audiophiles to share tracks. Sometimes artists will share works in progress, which is really fascinating.

It’s kind of amazing what you can accomplish in just HTML and JavaScript these days. It sure makes it easy to deploy, which I did at http://inkdroid.org/lastcloud/. If you want to give it a try enter your LastFM username, or the username of someone you know, like mine: inkdroid. As you can see the hack sorta worked. I say sorta because there seem to be a fair amount of users who are squatting on names of musicians. There also seem to be accounts that are run by fans, pretending to be the artist. Below is a list of seemingly legit Soundcloud accounts I found, and have followed. If you have any ideas for improving the hack, I put the code up on GitHub.


fido test suite

I work in a digital preservation group at the Library of Congress where we do a significant amount of work in Python. Lately, I’ve been spending some time with OpenPlanet’s FIDO utility, mainly to see if I could refactor it so that it’s a bit easier to use as a Python module, for use in other Python applications. At the moment FIDO is designed to be used from the command line. This work involved more than a little bit of refactoring, and the more I looked at the code, the more it became clear that a test suite would be useful to have as a safety net.

Conveniently, I also happened to have been reading a recent report from the National Library of Australia on File Characterization Tools, which in addition to talking about FIDO, pointed me at the govdocs1 dataset. Govdocs1 is a dataset of 1 million files harvested from the .gov domain by the NSF funded Digital Corpora project. The data was collected to serve as a public domain corpus for forensics tools to use as a test bed. I thought it might be useful to survey the filenames in the dataset, and cherry pick out formats of particular types for use in my FIDO test suite.

So I wrote a little script that crawled all the filenames, and kept track of file extensions used. Here are the results:

extension count
pdf 232791
html 191409
jpg 109281
txt 84091
doc 80648
xls 66599
ppt 50257
xml 41994
gif 36301
ps 22129
csv 18396
gz 13870
log 10241
eps 5465
png 4125
swf 3691
pps 1629
kml 995
kmz 949
hlp 660
sql 632
dwf 474
java 323
pptx 219
tmp 196
docx 169
ttf 104
js 92
pub 76
bmp 75
xbm 51
xlsx 46
jar 34
zip 27
wp 17
sys 8
dll 7
exported 5
exe 5
tif 3
chp 2
pst 1
squeak 1
data 1

With this list in hand, I downloaded an example of each file extension, ran it through the current release of FIDO, and used the output to generate a test suite for my new refactored version. Interestingly, two tests fail:

Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 244, in test_pst
    self.assertEqual(i.puid, "x-fmt/249")
AssertionError: 'x-fmt/248' != 'x-fmt/249'

======================================================================
FAIL: test_pub (test.FidoTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 260, in test_pub
    self.assertEqual(i.puid, "x-fmt/257")
AssertionError: 'x-fmt/252' != 'x-fmt/257'

I’ll need to dig in to see what could be different between the two versions that would confuse x-fmt/248 with x-fmt/249 and x-fmt/252 with x-fmt/257. Perhaps it is related to Dave Tarrant’s recent post about how FIDO’s identification patterns have flip flopped in the past.

You may have noticed that I’m linking the PUIDs to Andy Jackson’s PRONOM Prototype Registry (built in 6 days with Drupal) instead of the official PRONOM registry. I did this because a Google search for the PRONOM identifier (PUID) pulled up a nice detail page for the format in Andy’s prototype, and it doesn’t seem possible (at least in the 5 minutes I tried) to link directly to a file format record in the official PRONOM registry. I briefly tried the Linked Data prototype, but it proved difficult to search for a given PUID (server errors, the unforgiving glare of SPARQL query textareas, etc).

I hope OpenPlanets and/or the National Archives give Andy’s Drupal experiment a fair shake. Getting a functional PRONOM registry running in 6 days with an opensource toolkit like Drupal definitely seems more future proof than spending years with a contractor only to get closed source code. The Linked Data prototype looks promising, but as the recent final report on the Unified Digital Format Registry project highlights, choosing to build on a semantic web stack has its risks compared with more mainstream web publishing frameworks or content management systems like Drupal. PRONOM just needs an easy way for digital preservation practitioners to be able to collaboratively update the registry, and for each format to have a unique URL that uses the PUID. My only complaint is that Andy’s prototype seemed to advertise RDF/XML in the HTML, but it seemed to return an empty RDF document, for example the HTML at http://beta.domd.info/pronom/x-fmt/248 has a <link> that points at http://beta.domd.info/node/1303/rdf.

I admit I am a fan of linked data, or being able to get machine readable data back (RDFa, Microdata, JSON, RDF/XML, XML, etc) from Cool URLs. But using triplestores, and SPARQL don’t seem to be terribly important things for PRONOM to have at this point. And if they are there under the covers, there’s no need to confront the digital preservation practitioner with them. My guess is that they want to have an application that lets them work with their peers to document file formats, not learn a new query or ontology language. Perhaps Jason Scott’s Just Solve the Problem effort in October, will be a good kick in the pants to mobilize grassroots community work around digital formats.

Meanwhile, I’ve finished up the FIDO API changes and the test suite enough to have submitted a pull request to OpenPlanets. My fork of the OpenPlanets repository is similarly on Github. I’m not really holding my breath waiting for it to be accepted, as it represents a significant change, and they have their own published roadmap of work to do. But I am hopeful that they will recognize the value in having a test suite as a safety net as they change and refactor FIDO going forward. Otherwise I guess it could be the beginnings of a fido2, but I would like to avoid that particular future.

Update: after posting this Ross Spencer tweeted me some instructions for linking to PRONOM

https://twitter.com/beet_keeper/status/242515266146287616

Maybe I missed it, but PRONOM could use a page that describes this.


From Polders to Postmodernism

From Polders to Postmodernism: A Concise History of Archival TheoryFrom Polders to Postmodernism: A Concise History of Archival Theory by John Ridener
My rating: 3 of 5 stars

This was a nice little find for my continuing self-education in archives. As its title suggests, it’s a short survey (less than 200 pages), that traces a series of paradigm shifts in archival theory starting in the 19th century Netherlands leading up to the present. Ridener focuses on the approaches to subjectivity and objectivity in archival theory in order to show how the theories have changed and built on each other over the last 200 years. He does a nice job of sketching the context for the theories, the changes in society and technology that drove them, as well as some interesting biographical material about individuals such as Jenkins and Schellenberg. After having just read Controlling the Past I felt like I had some exposure to contemporary thinking about archives, but was lacking some of the historical background, so this book was very helpful. I think I might have to read Schellenberg’s Modern Archives now, especially because of the NARA connection. But that might get sidelined to read more of Terry Cook’s work on macro-appraisal. My only small complaint is that I noticed quite a few typos in the first half of the book, which got a little distracting at times.


Wikimania Revisited

I recently attended the Wikimania conference here in Washington, DC. I really can’t express how amazing it was to be a Metro ride away from more than 1,400 people from 87 countries who were passionate about creating a world in which every single human being can freely share in the sum of all knowledge. It was my first Wikimania, and I had pretty high expectations, but I was blown away by the level of enthusiasm and creativity of the attendees. Since my employer supported me by allowing me to spend the week there, I thought I would jot down some notes about the things that I took from the conference, from the perspective of someone working in the cultural heritage sector.

Archivy

Of course the big news from Wikimania for folks like me who work in libraries and archives was the plenary speech by the Archivist of the United States, David Ferriero. Ferriero did an excellent job of connecting NARA’s mission to that of the Wikipedia community. In particular he stressed that NARA cannot solve difficult problems like the preservation of electronic records without the help of open government, transparency and citizen engagement to shape its policies and activities. As a library software developer I’m as interested as the next person in technical innovations in the digital preservation space: be they new repository software, flavors of metadata and digital object packaging, web services and protocols, etc. But over the past few years I’ve been increasingly convinced that access to the content that is being preserved is an absolutely vital ingredient to its preservation. If open access (as in the case of NARA) isn’t possible due to licensing concerns, then it is still essential to let access by some user community drive and ground efforts to collect and preserve digital content. Seeing high level leadership in the cultural heritage space (and from the federal government no less) address this issue was really inspiring.

At the Archives our concepts of openness and access are embedded in our mission. The work we do every day is rooted in the belief that citizens have the right to see, examine, and learn from the records that guarantee citizens rights, document government actions, and tell the story of our nation.

My biggest challenge is visibility: not everyone knows who we are, what we do, or more importantly, the amazing resources we collect and house. The lesson I learned in my time in New York is that it isn’t good enough to create great digital collections, and sit back and expect people to find you. You need to be where the people are.

The astounding thing is that it’s not just talk–Ferriero went on to describe several efforts of how the Archives is executing on collaboration with the Wikipedia community, which is also documented at a high level in NARA’s Open Government Plan. One example that stood out for me was NARA’s Today’s Document website which highlights documents from its collections. On June 1st, 2011 they featured a photograph of Harry P. Perry who was the first African American to enlist in the US Marine Corps after it was desegregated on June 1st, 1942. NARA’s Wikipedian in Residence Dominic McDevitt-Parks’ efforts to bring archival content to the attention of Wikipedians resulted in a new article Desegregation in the United States Marine Corps being created that same day…and the photograph on NARA’s website was viewed more than 4 million times in 8 hours. What proportion of the web traffic was driven by Wikipedia specifically rather than other social networking sites wasn’t exactly clear, but the point is that this is what happens when you get your content where the users are. If my blog post is venturing into tl;dr territory, please be sure to at least watch his speech, it’ll just take 20 minutes.

Resident Wikipedians

In a similar vein Sara Snyder made a strong case for the use of archival materials on Wikipedia in her talk 5 Reasons Why Archives are an Untapped Goldmine for Wikimedians. She talked about the work that Sarah Stierch did as the Wikipedia in Residence at the Smithsonian Archives of American Art. The partnership resulted in ~300 WPA images being uploaded to Wikimedia Commons, 37 new Wikipedia articles, and new connections with a community of volunteers who participated in edit-a-thons to improve Wikipedia and learn more about the AAA collections. She also pointed out that since 2010 Wikipedia has driven more traffic to the Archives of American Art website than all other social media combined.

In the same session Dominic McDevitt-Parks spoke about his activities as the Wikipedian in Residence at the US National Archives. Dominic focused much of his presentation on NARA’s digitization work, largely done by volunteers, the use of Wikimedia Commons as a content platform for the images, and ultimately WikiSource as a platform for transcribing the documents. The finished documents are then linked to from NARA’s Online Catalog, as in this example: Appeal for a Sixteenth Amendment from the National Woman Suffrage Association. NARA also prominently links out to the content waiting to be transcribed at WikiSource on its Citizens Archivist Dashboard. If you are interested in learning more, Dominic has written a bit about the work with WikiSource on the NARA blog. Both Dominic and Sara will be speaking next month at the Society of American Archivists Annual Meeting making the case for Wikipedia to the archival community. Their talk is called 80,000 Volunteers Can’t Be Wrong: The Case for Greater Collaboration with Wikipedia, and I encourage you attend if you will be at SAA.

The arrival of Wikipedians in Residence is a welcome seachange in the Wikipedia community, where historically there had been some uncertainty about the best way for cultural heritage organizations to highlight their original content in Wikipedia articles. As Sara pointed out in her talk, it helps both sides (the institutional side, and the Wikipedia side) to have an actual, experienced Wikipedian on site to help the organization understand how they want to engage the community. Having direct contact with archivists, curators and librarians that know their collections backwards and forwards also helps the resident in knowing how to direct their work, and the work of other Wikipedians. The Library of Congress made an announcement at the Wikimania reception that the World Digital Library are seeking a Wikipedia in Residence. I don’t work directly on the project anymore, but I know people who do, so let me know if you are interested and I can try to connect the dots.

I think in a lot of ways the residency program is an excellent start, but really it’s just that–a start. The task at hand of connecting the Wikipedia community and article content with the collections of galleries, libraries, archives and museums is a huge one. One person, especially a temporary volunteer, can only do so much. As you probably know, Wikipedia editors can often be found embedded in cultural heritage organizations. It’s one of the reasons why we started having informal Wikipedia lunches at the Library of Congress: to see what can be done at the grass roots level by staff to integrate Wikipedia into our work. When we started to meet I learned about an earlier, 4 year old effort to create a policy that provides guidance to staff about how to interact with the Wikipedia community as editors. Establishing a residency program is an excellent way to signal a change in institutional culture, and to bootstrap and focus the work. But I think the residencies also highlight the need to empower staff throughout the organization to participate as well, so that after the resident leaves the work goes on. In addition to establishing a WDL Wikipedian in Residence I would love to see the Library of Congress put the finishing touches on its Wikipedia policy that would empower staff to use and contribute to Wikipedia as part of their work, without lingering doubt about whether it was correct or not. It would probably be helpful for other organizations to publish theirs as examples for other organizations wanting the same.

Wikipedia as a Platform

Getting back to Wikimania, I wanted to highlight a few other GLAM related projects that use Wikipedia as a platform.

Daniel Mietchen spoke about work he was doing around the Open Access Media Importer (OAMI). The OAMI is a tool that harvests media files (images, movies, etc) from open access materials and uploads them to Wikimedia Commons for use in article content. Efforts to date have focused primarily on PubMed from the National Institutes of Health. As someone working in the digital preservation field one of the interesting outcomes of the work so far was a table that illustrated the media formats present in PubMed:

Since Daniel and other OAMI collaborators are scientists they have been focused primarily on science related media…so they naturally are interested in working with arXiv. arXiv is a heavily trafficked, volunteer supported, pre-print server, that is normally a poster child for open repositories. But one odd thing about arXiv that Daniel pointed out is that while arXiv collects licensing information from authors as part of deposit, they do not indicate in the user interface which license has been used. This makes it particularly difficult for the OAMI to determine which content can be uploaded to the Wikimedia Commons. I learned from Simeon Warner shortly afterwards that while the licensing information doesn’t show up in the UI currently, and isn’t present in all the metadata formats that their OAI-PMH service provides, it can be found squirreled away in the arXivRaw format. So it should be theoretically possible to modify the OAMI to use arXivRaw.

Another challenges the OAMI faces is extraction of metadata. For example media files often don’t share all the subject keywords that are appropriate for the entire article. So knowing which ones to apply can be difficult. In addition, metadata extraction from Wikimedia Commons was reported to not be optimal, since it involves parsing mediawiki templates, which limits the downstream use of the content added to the Commons. I don’t know if the Public Library of Science is on the radar for harvesting, but if it isn’t it should be. The OAMI work also seems loosely related to the issue of research data storage and citation which seems to be on the front burner for those interested in digital repositories. Jimmy Wales has reportedly been advising the UK government on how to making funded research available to the public. I’m not sure if datasets fit the purview of the Wikimedia Commons, but since Excel is #3 in the graph above perhaps it is. It might be interesting to think more about Wikimedia Commons as a platform for publishing (and citing) datasets.

I learned about another interesting use of the Wikimedia Commons from Maarten Dammers and Dan Entous during their talk about the GLAMwiki Toolset. The project is a partnership between Wikipedia Netherlands and Europeana. If you aren’t already familiar with Europeana it is an EU funded effort to enhance access to European cultural heritage material on the Web. The project is just getting kicked off now, and is aiming to:

…develop a scalable, maintainable, ease to use system for mass uploading open content from galleries, libraries, archives and museums to Wikimedia Commons and to create GLAM-specific requirements for usage statistics.

Wikimedia Commons can be difficult to work with in an automated, batch oriented way for a variety of reasons. One that was mentioned above is metadata. The GLAMwiki Toolset will provide some mappings from commonly held metadata formats (starting with Dublin Core) to Commons templates, and will provide a framework for adapting the tool to custom formats. Also there is a perceived need for tools to manage batch imports as well as exports from the Commons. The other big need are usable analytics tools that let you see how content is used and referenced on the Commons once it has been uploaded. Maarten indicated that they are seeking participation in the project from other GLAM organizations. I imagine that there are other organizations that would like to use the Wikimedia Commons as a content platform, to enable collaboration across institutional boundaries. Wikipedia is one of the most popular destinations on the Web, so they have been forced to scale their technical platform to support this demand. Even the largest cultural heritage organizations can often find themselves bound to somewhat archaic legacy systems, that can make it difficult to similarly scale their infrastructure. I think services like Wikimedia Commons and WikiSource have a lot to offer cash strapped organizations that want to do more to provide access to their organizations unique materials on the Web, but are not in a position to make the technical investments to make it happen. I’m hoping that efforts like the GLAMWiki toolset will make this easier to achieve, and is something I personally would like to get involved in.

Incidentally, one of the more interesting technical track talks I attended was a talk by Ben Hartshorne from the Wikimedia Foundation Operations Team, about their transition from NFS to Openstack Swift for media storage. I had some detailed notes about this talk, but proceeded to lose them. I seem to remember that in total, the various Wikimedia properties amount to 40T of media storage (images, videos, etc), and they want to be able to grow this to 200T this year. Ben included lots of juicy details about the hardware and deployment of Swift in their infrastructure, so I’ve got an email out to him to see if he can share his slides (update: he just shared them, thanks Ben!). The placement of various caches (Swift is an HTTP REST API), as well as the hooks into MediaWiki were really interesting to me. The importance of URL addressable object storage for bitstreams in an enterprise that is made up of many different web applications can’t be overstated. It was also fun to hear about the impact that digitization projects like Wikipedia Loves Monuments and the NARA work mentioned above, are having on the backend infrastructure. It’s great to hear that Wikipedia is planning for growth in the area of media storage, and can scale horizontally to meet it, without paying large sums of money for expensive, proprietary, vendor supplied NAS solutions. What wasn’t entirely clear from the presentation is whether there is a generic tipping point where investing in staff and infrastructure to support something like Swift becomes more cost-effective than using a storage solution like Amazon S3. Ben did indicate that there use of Swift and the abstractions they built into Mediawiki would allow for using storage APIs like S3.

Before I finish this post, there were a couple other Wikipedia related topics that I didn’t happen to see discussed at Wikimania (it’s a multi-track event so I may have just missed it). One is the topic of image citation on Wikipedia. Helena Zinkham (Chief of the Prints and Photographs Division at the Library of Congress) recently floated a project proposal at LC’s Wikipedia Lunch to more prominently place the source of an image in Wikipedia articles. For an example of what Helena is talking about take a look at the article for Walt Whitman: notice how the caption doesn’t include information about where the image came from? If you click on the image you get a detail page that does indicate that the photograph is from LC’s Prints & Photographs collection, with a link back to the Prints & Photographs Online Catalog. I agree with Helena that more prominent information about the source of photographs and other media in Wikipedia could encourage more participation from the GLAM community. What the best way to proceed with the idea is still in question. I’m new to the way projects get started and RFCs work there. Hopefully we will continue to work on this in the context of the grassroots Wikipedia work at LC. If you are interested please drop me an email

Another Wikipedia project directly related to my $work is the Digital Preservation WikiProject that the National Digital Stewardship Alliance is trying to kickstart. One of the challenges of digital preservation is the identification of file formats, and their preservation characteristics. English Wikipedia currently has 325 articles about Computer File Formats, and one of the goals of the Digital Preservation project is to enhance these with predictable infoboxes that usefully describe the format. External data sources such as PRONOM and UDFR also contain information about data formats. It’s possible that some of them could be used to improve Wikipedia articles, to more widely disseminate digital preservation information. Also, as Ferriero noted, it’s important for cultural heritage organizations to get their information out to where the people are. Jason Scott of ArchiveTeam has been talking about a similar project to aggregate information about file formats to build better tools for format identification. While I can understand the desire to build a new wiki to support this work, and there are challenges to working with the Wikipedia community, I think Linus’ Law points the way to using Wikipedia.

Beginning

So, I could keep going, but in the interests of time (yours and mine) I have to wrap this Wikimania post up (for now). Thanks for reading this far through my library colored glasses. Oddly I didn’t even get to mention the most exciting and high profile Wikidata and Visual Editor projects that are under development, and are poised to change what it means to use and contribute to Wikipedia for everyone, not just GLAM organizations. Wikidata is of particular interest to me because if successful it will bring many of the ideas of the Linked Data to solve an eminently practical problem that Wikipedia faces. In some ways the WikiData project is following in the footsteps of the successful dbpedia and Google Freebase projects. But there is a reason why Freebase and Dbpedia have spent time engineering their Wikipedia updates–because it’s where the users are creating content. Hopefully I’ll be able to attend Wikimania next year to see how they are doing. And I hope that my first Wikimania marks the beginning of a more active engagement in what Wikipedia is doing to transform the Web and the World.


and then the web happened

Here is the text of my talk I’m giving at Wikimania today, and the slides.

Let me begin by saying thank you to the conference organizers for accepting my talk proposal. I am excited to be here at my first WikiMania conference, and hope that it will be the first of many. Similar to millions of other people around the world, I use Wikipedia every day at work and at home. In the last three years I’ve transitioned from being a consumer to a producer, by making modest edits to articles about libraries, archives, and occasionally music. I had heard horror stories of people having their work reverted and deleted, so I was careful to cite material in my edits. I was pleasantly surprised when editors swooped in not to delete my work, but to improve it. So, I also want to say thanks to all of you for creating such an improbably open and alive community. I know there is room for improvement, but it’s a pretty amazing thing you all have built.

And really, this is all my talk about Wikistream is about. Wikistream was born out of a desire to share just how amazing the Wikipedia community is, with people who didn’t know it already. I know, I’m preaching to the choir. I also know that I’m speaking in the Technology and Infrastructure track, and I promise to get to some details about how Wikistream works. But really, there’s nothing that radically new in Wikistream–and perhaps what I’m going to say would be more appropriate for the GLAM track, or a performance art track, if there was one. If you are a multi-tasker and want to listen to me with one ear, please pull up http://wikistream.inkdroid.org in your browser, and try to make sense of it as I talk. Lets see what breaks first, the application or the wi-fi connection–hopefully neither.

Wikipedia and the BBC

A couple years ago I was attending the dev8d conference in London and dropped into a 2nd Linked Data Meetup that happened to be going on nearby. Part of the program included presentations from Tom Scott, Silver Oliver and Georgi Kobilarov about some work they did at the BBC. They demo’d two web applications, the BBC Wildlife Finder and BBC Music, that used Wikipedia as a content management platform 1, 2.






If I’m remembering right it was Tom who demonstrated how an edit to a Wikipedia article resulted in the content being immediately updated at the BBC. It seemed like magic. More than that it struck me as mind-blowingly radical for an organization like the BBC to tap into the Wikipedia platform and community like this.

After a few questions I learned from Georgi that part of the magic of this update mechanism was a bot that the BBC created which sits in the #en.wikipedia IRC chatroom, where edits are announced 4. I logged into the chatroom and was astonished by the number of edits flying by:

And remember this was just the English language Wikipedia channel. There are more than 730 other Wikimedia related channels where updates are announced. The BBC’s use of Wikipedia really resonated with me, but to explain why I need to back up a little bit more.

Crowdsourcing in the Library

I work as a software developer at the Library of Congress. In developing web applications there I often end up using data about books, people and topics that have been curated for hundreds of years, and which began to be made available in electronic form in the early 1970s. The library community has had a longstanding obsession with collaboration, or (dare I say) crowdsourcing, to maintain its information about the bibliographic universe. Librarians would most likely call it cooperative cataloging instead of crowdsourcing, but the idea is roughly the same.

As early as 1850, Charles Jewett proposed that the Smithsonian be established as the national library of the United States, which would (among other things) collect the catalogs of libraries all around the country 2. The Smithsonian wasn’t as sure as Jewett, so it wasn’t until the 1890s that we saw his ideas take hold when the Library of Congress assumed the role of the national library, and home to the Copyright Office. To this day, copyright registration results in a copy of a registered book being deposited at the Library of Congress. In 1901 the Library of Congress established its printed card service which made its catalog cards available to libraries around the United States and the world.

This meant that a book could be cataloged once by one of the growing army of catalogers at the Library of Congress, instead of the same work being done over and over by all the libraries all over the country. But the real innovation happened in 1971 when Fred Kilgour’s dream of an online shared cataloging database was powered up at OCLC. This allowed a book to be cataloged by any library, and instantly shared with other libraries around the country. It was at this point that the cataloging became truly cooperative, because catalogers could be anywhere, at any member institution, and weren’t required to be in an office at the Library of Congress.

This worked for a bit, but then the Web happened. As the Web began to spread in the mid to late 1990s the library community got it into their head that they would catalog it, with efforts like the Cooperative Online Resource Catalog. But the Web was growing too fast, there just weren’t enough catalogers who cared, and the tools weren’t up to the task, so the project died.

So when I saw Tom, Silver and Georgi present on the use of Wikipedia as a curated content platform at the BBC, and saw how active the community was I had a bit of a light bulb moment. It wasn’t a if-you-can’t-beat-em-join-em moment in which libraries and other cultural heritage organizations (like the BBC) fade into the background and become irrelevant, but one in which Wikipedia helps libraries do their job better…and maybe libraries can help make Wikipedia better. It just so happened that this was right as the Galleries, Libraries, Archives and Museums (GLAM) effort was kicking off at Wikipedia. I really wanted to be able to help show librarians and others not likely to drop into an IRC chat how active the Wikipedia community was, and that’s how Wikistream came to be.

How

So now that you understand the why of Wikistream I’ll tell you briefly about the how. When I released Wikistream I got this really nice email from Ward Cunningham, who is a personal hero of mine, and I imagine a lot of you too:

To: wiki-research-l@lists.wikimedia.org
From: Ward Cunningham <ward@c2.com>
Subject: Re: wikistream: displays wikipedia updates in realtime
Date: Jun 16, 2011 7:43:11 am

I've written this app several times using technology from text-to-speech 
to quartz-composer. I have to tip my hat to Ed for doing a better job 
than I ever did and doing it in a way that he makes look effortless. 
Kudos to Ed for sharing both the page and the software that produces 
it. You made my morning. -- Ward

Sure enough, my idea wasn’t really new at all. But at least I was in good company. I was lucky to stumble across the idea for Wikistream when a Google search for streaming to the browser pulled up SocketIO. If you haven’t seen it before SocketIO is a JavaScript library that allows you to easily stream data to the browser without needing to care about the transport mechanisms that the browser supports: WebSocket, Adobe FlashSocket, AJAX long polling, AJAX multipart-streaming, Forever iframe, JSONP Polling. It autodetects the capabilities of the browser and the server, and gives you a simple callback API for publishing and consuming events. For example here is the code that runs in your browser to connect to the server and start getting updates:

$(document).ready(function() {
  var socket = io.connect();
  socket.on('message', function(msg) {
    addUpdate(msg);
  });
});

There’s a bit more to it, like loading the SocketIO library, and the details of adding the information about the change stored in the msg JavaScript object (more on that below) to the DOM, but SocketIO makes the hard part of streaming data from the server to the client easy.

Of course you need a server to send the updates, and that’s where things get a bit more interesting. SocketIO is designed to run in a NodeJS environment with the Express web framework. Once you have your webapp set up, you can add SocketIO to it:

var express = require("express");
var sio = require("socket.io");

var app = express.createServer();
// set up standard app routes/views
var io = sio.listen(app);

Then the last bit is to do the work of listening to the IRC chatrooms and pushing the updates out to the clients that want to be updated. To make this a bit easier I created a reusable library called wikichanges that abstracts away the business of connecting to the IRC channels and parsing the status updates into a JavaScript object, and lets you pass in a callback function that will be given updates as they occur.

var wikichanges = require('wikichanges');

var w = wikichanges.WikiChanges();
w.listen(function(msg) {
  io.sockets.emit('message', msg);
});

This results in updates being delivered as JSON objects to the client code we started with, where each update looks something like:

{ 
  channel: '#en.wikipedia',
  wikipedia: 'English Wikipedia',
  page: 'Persuasion (novel)',
  pageUrl: 'http://en.wikipedia.org/wiki/Persuasion_(novel)',
  url: 'http://en.wikipedia.org/w/index.php?diff=498770193&oldid=497895763',
  delta: -13,
  comment: '/* Main characters */',
  wikipediaUrl: 'http://en.wikipedia.org',
  user: '108.49.244.224',
  userUrl: 'http://en.wikipedia.org/wiki/User:108.49.244.224',
  unpatrolled: false,
  newPage: false,
  robot: false,
  anonymous: true,
  namespace: 'Article'
  flag: '',
}

As I already mentioned I extracted the interesting bit of connecting to the IRC chatrooms, and parsing the IRC colored text into a JavaScript object as a NodeJS library called wikichanges. Working with the stream of edits is surprisingly addictive, and I found myself wanting to create some other similar applications:

  • wikipulse which displays the rate of change of wikipedias as a set of accelerator displays
  • wikitweets: a visualization of how Wikipedia is cited on Twitter
  • wikibeat: a musical exploration of how Wikipedia is changing created by Dan Chudnov and Chris Burns.

So wikichanges is there to make it easier to bootstrap applications that want to do things with the Wikipedia update stream. Here is a brief demo of getting wikichanges working on a stock Ubuntu ec2 instance:

What’s Next?

So this was a bit of wild ride, I hope you were able to follow along. I could have spent some time explaining why Node was a good fit for wikistream. Perhaps we can talk about that in the Q/A if there is any time for that. Let’s just say I actually reach for Python first when working on a new project, but the particular nature of this application, and tool availability made Node a natural fit. Did we crash it yet?

The combination of the GLAM effort with the WikiData are poised to really transform the way cultural heritage organizations contribute to and use Wikipedia. I hope wikistream might help you make the case for Wikipedia in your organization as you make presentations. If you have ideas on how to use the wikistream library to do something with the update stream I would love to hear about them.

  1. Case Study: Use of Semantic Web Technologies on the BBC Web Sites by Yves Raimond, et al.
  2. The Web as a CMS by Tom Scott.
  3. Catalog It Once And For All: A History of Cooperative Cataloging in the United States Prior to 1967 (Before MARC) by Barbara Tillett, in Cooperative Cataloging: Past, Present, and Future. Psychology Press, 1993, page 5.
  4. After hitting publish on this post I learned that the BBC’s bot was written by Patrick Sinclair


straw

By now I imagine you’ve heard the announcement that OCLC has started to make WorldCat bibliographic data available as openly licensed Linked Data. The availability of microdata and RDFa metadata in WorldCat pages coupled with the ODC-BY license and the availability of sitemaps for crawlers is a huge win for the library community. Similar announcements about Dewey Decimal Classification and the Virtual International Authority File are further evidence that there is a big paradigm shift going on at OCLC.

A few weeks ago Richard Wallis (formerly of Talis, and now at OCLC) asked me to take a look at the strawman library microdata vocabulary that OCLC put together for the WorldCat release: http://purl.org/library. Richard stressed that the library vocabulary was a prototype to focus and gather interest from the cultural heritage sector outside of OCLC, and the metadata community in general. Combined with the prototype microdata at WorldCat I think it represents an excellent first step. At this point I should re-iterate that these remarks about schema.org are mine and not those of my employer.

The vocabulary is actually currently expressed in OWL, and visiting that URL will redirect you to an application that lets you read the OWL file as documentation. Rather than write up a few paragraphs and send my comments to Richard in email, I figured I would jot them down here, in case anyone else has feedback.

Examining the classes that the library vocabulary defines tells the majority of the story. They are broken down into

  • ArchiveMaterial
  • Carrier
  • Computer File
  • Game
  • Image
  • Interactive Multimedia
  • Kit
  • Musical Score
  • Newspaper
  • Periodical
  • Thesis
  • Toy
  • Video
  • VideoGame
  • Visual Material
  • Web Site

These classes should seem familiar to catalogers who have worked in MARC since there is a lot of similarity with the types of data that are encoded into the 008 field. However some are missing such as maps, dictionaries, encyclopedias, etc. It’s kind of amusing that Book isn’t mentioned. I’m not sure what the rationale was for selecting these classes, perhaps some sort of ranking based on use in WorldCat? Examining the OWL shows that OCLC has made an effort to express mappings between the library vocabulary and schema.org:

library schema.org
http://purl.org/library/ArchiveMaterial http://schema.org/CreativeWork/ArchiveMaterial
http://purl.org/library/ComputerFile http://schema.org/CreativeWork/ComputerFile
http://purl.org/library/Game http://schema.org/CreativeWork/Game
http://purl.org/library/Image http://schema.org/CreativeWork/Image
http://purl.org/library/InteractiveMultimedia http://schema.org/CreativeWork/InteractiveMultimedia
http://purl.org/library/Kit http://schema.org/CreativeWork/Kit
http://purl.org/library/MusicalScore http://schema.org/CreativeWork/MusicalScore
http://purl.org/library/Newspaper http://schema.org/CreativeWork/Newspaper
http://purl.org/library/Periodical http://schema.org/CreativeWork/Periodical
http://purl.org/library/Thesis http://schema.org/CreativeWork/Book/Thesis
http://purl.org/library/Toy http://schema.org/CreativeWork/Toy
http://purl.org/library/Video http://schema.org/CreativeWork/Video
http://purl.org/library/VideoGame http://schema.org/CreativeWork/VideoGame
http://purl.org/library/VisualMaterial http://schema.org/CreativeWork/VisualMaterial
http://purl.org/library/WebSite http://schema.org/CreativeWork/WebSite

However these schema.org URLs do not resolve, and are not actually present as specifications of schema.org’s Creative Work. Perhaps the presence of these mappings in the library vocabulary is evidence of a desire to create these classes at schema.org. But then there are cases like library:Image which seem to bear a lot resemblance to schema.org’s ImageObject.

Examining the OWL also yields a set of library:Carrier instances.

  • BlurayDisk
  • CassetteTape
  • CD
  • DVD
  • FilmReel
  • LP
  • Microform
  • VHSTape
  • Volume
  • WWW

Again, there are more carriers than this in the MARC world. Why these were selected is a bit of a mystery. What library:WWW has to do with library:Website (if anything) isn’t clear, etc.

So even in this prototype library vocabulary there is a lot to examine and unpack. I imagine some phone calls or face to face meetings would be required to get at what went into their production.

Be that as it may, I think it could prove more useful to look at the WorldCat microdata and see what library vocabulary was used. For example here is the microdata extracted from the WorldCat page for Tim Berners-Lee’s Weaving the Web expressed as JSON:

{
  "type": "http://schema.org/Book", 
  "properties": {
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
      "http://schema.org/Book"
    ], 
    "http://purl.org/library/placeOfPublication": [
      {
        "type": "http://schema.org/Place", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Place"
          ], 
          "http://schema.org/name": [
            "San Francisco :"
          ]
        }
      }
    ], 
    "http://schema.org/bookEdition": [
      "1st ed."
    ], 
    "http://schema.org/publisher": [
      {
        "type": "http://schema.org/Organization", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Organization"
          ], 
          "http://schema.org/name": [
            "HarperSanFrancisco"
          ]
        }
      }
    ], 
    "http://schema.org/genre": [
      "History"
    ], 
    "http://schema.org/name": [
      "Weaving the Web : the original design and ultimate destiny of the World Wide Web by its inventor"
    ], 
    "http://schema.org/numberOfPages": [
      "226"
    ], 
    "http://purl.org/library/holdingsCount": [
      "2096"
    ], 
    "http://schema.org/about": [
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "Erfindung."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "WWW."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "prospective informatique."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://www.w3.org/2004/02/skos/core#inScheme": [
            "http://dewey.info/scheme/e21/"
          ]
        }, 
        "id": "http://dewey.info/class/025/e21/"
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/subjects/sh95000541"
          ], 
          "http://schema.org/name": [
            "World wide web."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/subjects/sh95000541"
          ], 
          "http://schema.org/name": [
            "World Wide Web."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/subjects/sh95000541"
          ], 
          "http://schema.org/name": [
            "World Wide Web--History."
          ]
        }
      }, 
      {
        "type": "http://schema.org/Person", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Person"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/names/no99010609"
          ], 
          "http://schema.org/name": [
            "Berners-Lee, Tim."
          ]
        }, 
        "id": "http://viaf.org/viaf/85312226"
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "Web--Histoire."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "World Wide Web"
          ]
        }, 
        "id": "http://id.worldcat.org/fast/1181326"
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "historique informatique."
          ]
        }
      }, 
      {
        "type": "http://www.w3.org/2004/02/skos/core#Concept", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
          ], 
          "http://schema.org/name": [
            "Web--Histoire."
          ]
        }
      }, 
      {
        "type": "http://schema.org/Person", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Person"
          ], 
          "http://schema.org/name": [
            "Berners-Lee, Tim"
          ]
        }
      }
    ], 
    "http://schema.org/description": [
      "Enquire within upon everything -- Tangles, links, and webs -- info.cern.ch -- Protocols: simple rules for global systems -- Going global -- Browsing -- Changes -- Consortium -- Competition and consensus -- Web of people -- Privacy -- Mind to mind -- Machines and the Web -- Weaving the Web."
    ], 
    "http://purl.org/library/oclcnum": [
      "41238513"
    ], 
    "http://schema.org/copyrightYear": [
      "1999"
    ], 
    "http://schema.org/contributor": [
      {
        "type": "http://schema.org/Person", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Person"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/names/n97003262"
          ], 
          "http://schema.org/name": [
            "Fischetti, Mark."
          ]
        }, 
        "id": "http://viaf.org/viaf/874883"
      }
    ], 
    "http://schema.org/isbn": [
      "9780062515872", 
      "006251587X", 
      "0062515861", 
      "9780062515865"
    ], 
    "http://schema.org/inLanguage": [
      "en"
    ], 
    "http://schema.org/reviews": [
      {
        "type": "http://schema.org/Review", 
        "properties": {
          "http://schema.org/reviewBody": [
            "Tim Berners-Lee, the inventor of the World Wide Web, has been hailed by Time magazine as one of the 100 greatest minds of this century. His creation has already changed the way people do business, entertain themselves, exchange ideas, and socialize with one another.\" \"Berners-Lee offers insights to help readers understand the true nature of the Web, enabling them to use it to their fullest advantage. He shares his views on such critical issues as censorship, privacy, the increasing power of software companies in the online world, and the need to find the ideal balance between the commercial and social forces on the Web. His criticism of the Web's current state makes clear that there is still much work to be done. Finally, Berners-Lee presents his own plan for the Web's future, one that calls for the active support and participation of programmers, computer manufacturers, and social organizations to make it happen.\"--Jacket."
          ], 
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Review"
          ], 
          "http://schema.org/itemReviewed": [
            "http://www.worldcat.org/oclc/41238513"
          ]
        }
      }
    ], 
    "http://schema.org/author": [
      {
        "type": "http://schema.org/Person", 
        "properties": {
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
            "http://schema.org/Person"
          ], 
          "http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority": [
            "http://id.loc.gov/authorities/names/no99010609"
          ], 
          "http://schema.org/name": [
            "Berners-Lee, Tim."
          ]
        }, 
        "id": "http://viaf.org/viaf/85312226"
      }
    ]
  }, 
  "id": "http://www.worldcat.org/oclc/41238513"
}

Yes, that’s a lot of data. But interestingly only three library vocabulary elements were used:

  • placeOfPublication
  • holdingsCount
  • oclcnum

One could argue that rather than creating library:placeOfPublication they could use schema:publisher with a nested Organization item having a schema:location. Similarly library:oclcnum could’ve been expressed using itemid with a value of info:oclc/41238513 using the info-uri namespace that OCLC maintain the registry for. This leaves library:holdingsCount, which does seem to be missing from schema.org but also begs the question of whose holdings?

As Tom Gruber famously said:

Every ontology is a treaty – a social agreement – among people with some common motive in sharing.

So the question for me is what is the library vocabulary trying to do, and for who? Is it trying to make it easy to share MARC data as microdata on the Web? Is it trying to communicate something to search engines so that they can have enhanced displays? Who are the people that want to share and consume this data? I think having rough consensus about the answers to these questions is really important before diving into modeling exercises…even prototypes. And when the modeling begins I think it’s really important to follow the lead of the WorldCat developers in using the bits of schema.org vocabulary they could, and beginning to mint vocabulary terms for things that are missing. I don’t think it’s going to be fruitful to start from the position of modeling the bibliographic universe completely. I’d rather see real implementations (both publishers and consumers) drive the discovery of what is missing or awkward in schema.org, and how can it be fixed. Ideally, schema.org implementors like GoodReads would be at the table, along with members of the academic community like Jason Ronallo, Jonathan Rochkind and Ed Chamberlain (among others) who care about these issues. In addition my employer is actively engaged in an effort to rethink bibliographic data on the Web. It seems imperative that these efforts at schema.org and Zepheira’s work be combined somehow–especially since OCLC and Zepheira are hardly strangers.

I was of course flattered to be asked my opinion about the library vocabulary. I hope that my remarks haven’t accidentally set this strawman vocabulary on fire, because I think the work that OCLC has begun in this area is incredibly important. My experience watching the designers of SKOS has made me mindful of minimizing ontological commitments when designing a vocabulary, and wary of trying to exhaustively model a domain. In some ways I guess I’m a bit of a schema.org skeptic given its encyclopedic coverage. schema.org should take a page from the HTML 5 book and stay hyper-focused on letting implementations drive standardization. A bit of Seymour Lubetzky’s attention to simplification and user friendliness would be welcome as well.