viaf ntriples

I had a few requests for the Virtual International Authority File ntriples file I wrote about earlier. Having the various flavors of VIAF data available is great, but if an RDF dump is going to be made available I think ntriples kinda makes more sense than line oriented rdf/xml. I say this only because most RDF libraries and tools have support for bulk loading ntriples, and none (to my knowledge) support loading line oriented rdf/xml files.

I’ve made the 1.9G bzipped ntriples file available on Amazon S3 if you are interested in having at it:

http://viaf-ntriples.s3.amazonaws.com/viaf-20120524-clusters-rdf.nt.bz2
Incidentally you can torrent it as well, which would help spread the download of the file (and would spare me some cost on s3) by pointing your BitTorrent client at:

http://viaf-ntriples.s3.amazonaws.com/viaf-20120524-clusters-rdf.nt.bz2?torrent
As with the original VIAF dataset, this ntriples VIAF download contains information from VIAF (Virtual International Authority File) which is made available under the ODC Attribution License (ODC-By). Similarly, I am making the ntriples VIAF download available using the ODC-By License as well, because I think I have to given the of viral nature of ODC-By. At least that’s my unprofessional (I am not a lawyer) reading of the license. I’m not really complaining either, I’m all for openness going viral :-)


On a side note, I upgraded my laptop after the 4 days it took to initially create the ntriples file. In the process I accidentally deleted the ntriples file when I reformatted my hard drive. So the second time around I spent some time seeing if I could generate it quicker on Elastic MapReduce by splitting the file across multiple workers that would generate the ntriples from the rdf/xml and merge it back together. The conversion of the rdf/xml to ntriples using rdflib was largely CPU bound on my laptop, so I figured Hadoop Streaming would help me run my little Python script on as many workers nodes as I needed.

EMR made setting up and running a job quite easy, but I ran into my own ignorance pretty quickly. It turns out that Hadoop was not able to split the gzipped VIAF data, which meant data was only ever sent to one worker, no matter how many I ran. I then ran across some advice to use LZO compression, which is supposedly splittable on EMR, but after lots of experimentation I couldn’t get it to split either. I thought about uncompressing the original gzipped file on S3, but felt kind of depressed about doing that for some reason.

I time-boxed only a few days to try to get EMR working, so I backpedaled to rewriting the conversion script with Python’s multiprocessing library. I thought multiprocessing would let me take advantage of a multi-core EC2 machine. But oddly the conversion ran slower using multiprocessing’s Pool than it did as a single process. I chalked this up to the overhead of pickling large strings of rdf/xml and ntriples to send them around during inter-process-communication…but I didn’t investigate further.

So, in the end, I just re-ran the script for 4 days, but this time up at EC2 so that I could use my laptop during that time. &sigh;


On Discovery

There’s an interesting story over at The Atlantic which discusses the important role that cataloging and archival description play in historical research. The example is a recently discovered report to the Surgeon General from Charles Leale about his treatment of Abraham Lincoln after he was shot. A few weeks ago a researcher named Helen Papaioannou discovered the report while combing a collection of correspondence to the Surgeon General looking for materials related to Lincoln for a project at the Abraham Lincoln Presidential Library and Museum. The Atlantic piece boldly declares in its title:

If You ‘Discover’ Something in an Archive, It’s Not a Discovery.

Then it goes on to heap accolades on the silent archivists toiling away for centuries, that made the report possible to find. I’ve done my fare share of cataloging, and put in enough time working with EAD finding aids to enjoy the pat on the back. But something about the piece struck me as odd, and it took a bit of reading of the announcement of the discovery, and listening to a NPR interview with Papaioannou to put my finger on it.

It’s very possible, of course, with the volume of material that archives hold, for a particular professional to not know exactly what the repository holds. This is because archivists catalogue not at “item level,” a description of every piece of paper, which would take millennia, but at “collection level,” a description of the shape of the collection, who owned it, and what kinds of things it contains. With the volume of materials, some collections may be undescribed or even described wrongly. But if anyone thought that a report to the Surgeon General from a physician who saw Lincoln post-assassination existed, they might have looked through these correspondence files – which is exactly what the researcher, Helen Papaioannou, did. The exciting part about the Leale report is not that it was rescued from a “dusty archives” (an abhorrent turn of phrase!) but that since it’s now catalogued, everyone who wants to find it can.

Papaioannou’s own account is a bit more nuanced though:

Well, the record group I was currently searching was the records of the Office of the Surgeon General. And I was looking through his letters received, and I was in the L’s. And I was going through 1865, so I - since Lincoln died in 1865. I was almost finished with L and there it was, sitting right in the middle of a box.

This account makes it sound more like she was combing various record groups looking for correspondence from Lincoln, and accidentally ran across a letter from Leale, that was filed nearby…and she happened to notice that it was about Lincoln, and subsequently that the documents existence was not known. So Papaioannou didn’t suspect that the report to the Surgeon General existed, and go searching for it. She was instead examining various record groups for any correspondence from Lincoln, and was alert enough to notice something as she was moving through the collection. And most importantly she recognized that the document was not known to the historical community: the all important context, that is not completely knowable by any individual cataloger or archivist. At least that’s how I’m reading it.

Saying that there is no discovery in libraries and archives, because all the discovery has been pre-coordinated by librarians and archivists is putting the case for the work we do too strongly. It doesn’t give enough credit to the acts of discovery and creativity that library users like Papaioannou perform, and which our institutions depend on. I’m not an expert, but it seems to me that the lines that divide the historian and the archivist are more or less semi-permeable, especially since what is historic research gets archived itself, and archivists end up doing their own flavor of historical research when documenting the provenance of a collection. If we care about the future of libraries and archives we need to not only pat ourselves on the back for the work we do, but we need to recognize and appreciate the real work that goes on inside our buildings and on our websites.

And yes it’s great that the letter is now cataloged for re-discovery. But even better (for me) was that I was able to read the Atlantic piece, do some searches, and then go and listen to an interview with Papaioannou, and read the announcement from the Lincoln Library which includes a transcription of the actual letter.

…and then go and update the Wikipedia entry for Charles Leale to include information about the (very real) discovery of the letter.

Hopefully it won’t get reverted :-)



Wikimania Justification

Due to fiscal constraints we (understandably) have to write justifications for travel requests at $work, to make it clear how the conference/meeting fits in with the goals of the institution. I am planning on going to Wikimania for this first time this year, which is happening a short metro ride away at George Washington University. The cost for the full event is $50, which is an amazing value, and makes it a bit of a no-brainer on the cost-benefit scale. But I still need to justify it, mainly because of the time away from work. If you work in a cultural heritage organization and ever find yourself wanting to go to Wikimania maybe the justification I wrote will be of interest. I imagine you could easily substitute in your own organization’s Web publishing projects appropriately …

The Wikimania conference is the annual conference supporting the Wikipedia community. It is attended by thousands of people from around the world, and is the premier event for discussions and research about the continued development of Wikipedia–and it is being held in Washington, DC this year. Wikipedia comprises 22 million articles, in 285 languages, and it has become the largest and most popular general reference work on the Internet, ranking sixth globally among all websites.

Wikimania is of particular interest to cultural heritage institutions, and specifically the Library of Congress, because of the important role that collections like American Memory, Chronicling America, the Prints and Photographs Online Catalog and the World Digital Library (among others) have for Wikipedia editors. Primary resource material on the Web is extremely important to editors for verifiability of article content–so much so that the Wikipedia community is specifically conducting outreach with its Galleries, Libraries, Archives and Museums (GLAM) project. Several of the our peer institutions are involved in the GLAM effort, including: the National Archives, the Smithsonian and OCLC. Wikipedia remains one of the top referrers of web traffic to the Library of Congress web properties. LC’s multi-decade effort to put its unique collections online for the American people naturally aligns it with the mission of Wikipedia, and Wikimania is an excellent place to learn more about this collaboration that is going on with cultural heritage organizations.

I will be presenting on the value of open access to underlying datasets when conducting a real-time visualization of Wikipedia edits. There is a track of presentations for the cultural heritage community which I plan on attending. There is also a workshop on the Wikidata project, which has particular relevance for LC’s historic involvement in subject and name authority control files. In addition there is a Wikipedia Loves Libraries workshop being sponsored by OCLC to explore the ways in which libraries and Wikipedia can support each other in enriching discoverability and access to research material.


diving into VIAF

Last week saw a big (well big for library data nerds) announcement from OCLC that they are making the data for the Virtual International Authority File (VIAF) available for download under the terms of the Open Data Commons Attribution (ODC-BY) license. If you’re not already familiar with VIAF here’s a brief description from OCLC Research:

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF’s goal is to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF seeks to include authoritative names from many libraries into a global service that is available via the Web. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities

More specifically, the VIAF service: links national and regional-level authority records, creating clusters of related records and expands the concept of universal bibliographic control by:

  • allowing national and regional variations in authorized form to coexist
  • supporting needs for variations in preferred language, script and spelling
  • playing a role in the emerging Semantic Web

If you went and looked at the OCLC Research page you’ll notice that last month the VIAF project moved to OCLC. This is evidence of a growing commitment on OCLC’s part to make VIAF part of the library information landscape. It currently includes data about people, places and organizations from 22 different national libraries and other organizations.

Already there has been some great writing about what the release of VIAF data means for the cultural heritage sector. In particular Thom Hickey’s Outgoing is a trove of information about the project, which provides a behind-the-scense look at the various services it offers.

Rather than paraphrase what others have said already I thought I would download some of the data and report on what it looks like. Specifically I’m interested in the RDF data (as opposed to the custom XML, and MARC variants) since I believe it to have the most explicit structure and relations. The shared semantics in the RDF vocabularies that are used also make it the most interesting from a Linked Data perspective.

Diving In

The primary data structure of interest in the data dumps that OCLC has made available is what they call the cluster. A cluster is essentially a hub-and-spoke model with a resource for the person, place or organization in the middle that is attached via the spokes to conceptual resources at the participating VIAF institutions. As an example here is an illustration of the VIAF cluster for the Canadian archivist Hugh Taylor

Here you can see a FOAF Person resource (yellow) in the middle that is linked to from SKOS Concepts (blue) for Bibliothèque nationale de France, The Libraries and Archives of Canada, Deutschen Nationalbibliothek, BIBSYS (Norway) and the Library of Congress. Each of the SKOS Concepts have their own preferred label, which you can see varies across institution. This high level view obscures quite a bit of data, which is probably best viewed in Turtle if you want to see it:

<http://viaf.org/viaf/14894854> rdaGr2:dateOfBirth “1920-01-22” ; rdaGr2:dateOfDeath “2005-09-11” ; a rdaEnt:Person, foaf:Person ; owl:sameAs <http://d-nb.info/gnd/109337093> ; foaf:name “Taylor, Hugh A.”, “Taylor, Hugh A. (Hugh Alexander), 1920-”, “Taylor, Hugh Alexander 1920-2005” .


way, way back

For some experimental work I’ve been talking about with Nicholas Taylor (his idea, which he or I will write about later if it pans out) I’ve gotten interested in programmatic ways of seeing when a URL is available in a web archive. Of course there is the Internet Archive’s collection; but what isn’t as widely known perhaps is that web archiving is going on around the world at a smaller scale, often using similar software, and often under the auspices of the International Internet Preservation Consortium.

Nicholas pointed me at some work around Memento, which provides a proxy-like API in front of some web archives. If you aren’t already familiar with it, Memento is some machinery, or a REST API for deterimining when a given URL is available in a Web Archive–which is pretty useful. Of course, like many standardization efforts it relies on people actually implementing it. For Web Architecture folks, the core idea in Memento is pretty simple; but I think its core simplicity may be obscured from software developers who need to fully digest the spec in order to say they “do” Memento.

Meanwhile a lot of web archives have used the Wayback Machine from the Internet Archive to provide a human interface to the archived web content. While looking at the memento-server code I was surprised to learn that the Wayback can also return structured data about what URLs have been archived. For example, you can see what content the Internet Archive has for the New York Times homepage by visiting:

http://wayback.archive.org/web/xmlquery?url=http://www.nytimes.com

which returns a chunk of XML like:

< ?xml version="1.0" encoding="UTF-8"?>

  
    19960101000000
    4425
    urlquery
    20120503151837
    4425
    0
    nytimes.com/
    40000
    resultstypecapture
  
  
    
      68043717
      text/html
      IA-001766.arc.gz
      -
      nytimes.com/
      GY3
      200
      http://www.nytimes.com:80/
      19961112181513
    
    
      8107
      text/html
      BK-000007.arc.gz
      -
      nytimes.com/
      GY3
      200
      http://www.nytimes.com:80/
      19961121230155
    
    ...
  

Sort of similarly you can see what the British Library’s Web Archive has for the BBC homepage by visiting:

http://www.webarchive.org.uk/wayback/archive/*/http://www.bbc.co.uk/

Where you will see:

< ?xml version="1.0" encoding="UTF-8"?>

  
    19910806145620
    201
    urlquery
    20120503152750
    201
    0
    bbc.co.uk/
    10000
    resultstypecapture
  
  
    
      75367408
      text/html
      BL-196764-0.warc.gz
      -
      bbc.co.uk/
      sha512:b155b8dd868c17748405b7a8d2ee3606efea1319ee237507055f258189c0f620c38d2c159fc4e02211c1ff6d265f45e17ae7eb18f94a5494ab024175fe6f79c3
      200
      http://www.bbc.co.uk/
      20080410162445
    
    
      92484146
      text/html
      BL-7307314-46.warc.gz
      -
      bbc.co.uk/
      sha512:6e37c62b3aa7b60cccc50d430bc7429ecf0d2662bca5562b61ba0bc1027c824a2f7526c747bfca52db46dba5a2ae9c9d96d013e588b2ae5d78188ea4436c571f
      200
      http://www.bbc.co.uk/
      20080527231330
    
    ...
  

It turns out British Library are using this structured data to access data from Hadoop where their web archives live on HDFS as WARC files–which is pretty slick. Actually WARCs on spinning disk is pretty awesome by itself, no matter how you are doing it.

Unfortunately I wasn’t able to make it to the International Internet Preservation Consortium meeting going on right now at the Library of Congress. I’m at home heating bottles, changing diapers, and dozing off in comfy chairs with a boppy around my waist. If I was there I think I would be asking:

  1. Is there a list of Wayback Machine endpoints that are on the Web? There are multiple ones at the California Digital Library, the Library of Congress, and elsewhere I bet.
  2. How many of them are configured to make this XML data available? Can it easily be turned on for those that don’t have it?
  3. Rather than requiring people to implement a new standard to improve interoperability, could we document the XML format that Wayback can already emit, and share the endpoints? This way web archives that don’t run Heretrix and Wayback could also share what content they have collected in the same community.

This isn’t to say that Memento isn’t a good idea (I think it is). I just think there might be some quick wins to be had by documenting and raising awareness about things that are already working away quietly behind the scenes. Perhaps the list of Wayback endpoints could be added to the Wikipedia page?

Ok, enough for now. I have a bottle to heat up :-)


Lessons of JSON

A recent (and short) IEEE Computing Conversations interview with Douglas Crockford about the development of JavaScript Object Notation (JSON) offers some profound, and sometimes counter-intuitive, insights into standards development on the Web.

Pave the Cowpaths

I don’t claim to have invented it, because it already existed in nature. I just saw it, recognized the value of it, gave it a name, and a description, and showed its benefits. I don’t claim to be the only person to have discovered it.

Crockford is likeably humble about the origins of JSON. Rather than claiming he invented JSON he instead says he discovered it–almost as if he was a naturalist on an expedition in some uncharted territory. Looking at the Web as an ecosystem with forms of life in it might seem more like a stretched metaphor or sci-fi plot; but I think it’s a useful and pragmatic stance. The Web is a complex space, and efforts to forcibly make it move in particular directions often fail, even when big organizations and trans-national corporations are behind them. Grafting, aligning and cross-fertilizing technologies, while respecting the communities that they grow from will probably feel more chaotic, but it’s likely to yield better results in the long run.

Necessity is the Mother of Invention

Crockford needed JSON when building an application where a client written in JavaScript needed to communicate with a server written in Java. He wanted something simple that would let him solve a real need he had right in front of him. He didn’t want the client to have to compose a query for the server, have the server perform the query against a database, and return something to the client that in turn needed to be queried again. He wanted something where the data serialization matched the data structures available to both programming language environments. He wanted something that made his life easier. Since Crockford was actually using JSON in his work it has usability at its core.

Less is More

Crockford tried very hard to strip unnecessary stuff from JSON so it stood a better chance of being language independent. When confronted with push back about JSON not being a “standard” Crockford registered json.org, put up a specification that documented the data format, and declared it as a standard. He didn’t expend a lot of energy in top-down mode, trying to convince industry and government agencies that JSON was the way to go. Software developers discovered it, and started using it in their applications because it made their life easier too. Some people complained that the X in AJAX stood for XML, and therefore JSON should not be used. But this dogmatism faded into the background over time as the benefits of JSON were recognized.

JSON is not versioned, and has no mechanism for revision. JSON cannot change. This probably sounds like heresy to many folks involved in standardization. It is more radical than the WHATWG’s decision to remove the version number from HTML5. It may only be possible because Crockford focused so much on keeping the JSON specification so small and tight. Crockford is realistic about JSON not solving all data representation problems, and speculates that we will see use cases that JSON does not help solve. But when this happens something new will be created instead of extending JSON. This relieves the burden of backwards compatibility that can often drag projects down into a quagmire of complexity. Software designed to work with JSON will always work with whatever valid JSON is found in the wild.

Anyhow

Don’t listen to me, go watch the interview!


on not linking

NPR Morning Edition recently ran an interview with Teju Cole about his most recent project called Small Fates. Cole is the recipient of the 2012 Hemingway Foundation/PEN Award for his novel Open City. Small Fates is a series of poetic snippets from Nigerian newspapers, which Cole has been posting on Twitter. It turns out Small Fates draws on a tradition of compressed news stories known as fait divers. The interview is a really nice description of the poetic use of this material to explore time and place. In some ways it reminds me a little of the cut-up technique that William S. Bouroughs popularized; albeit in a more lyrical, less dadaist form.

At about the 3 minute mark in the interview Cole mentions that he has recently been using content from historic New York newspapers using Chronicling America. For example:

Chronicling America is a software project I work on. Of course we were all really excited to hear Cole mention us on NPR. One thing that we were wondering is whether he could include shortened URLs to the newspaper page in Chronicling America in his tweets. Obviously this would be a clever publicity vehicle for the NEH funded National Digital Newspaper Program. It would also allow the Small Fates reader to follow the tweet to the source material, if they wanted more context.

Through the magic of Facebook, Twitter, good old email and Teju’s generosity I got in touch with him to see if he would be willing to include some shortened Chronicling America URLs in his tweets. His response indicated that he had clearly already thought about linking, but had decided not to. His reasons for not linking struck me as really interesting, and he agreed to let me quote them here:

I can’t include links directly in my tweets for three reasons.

The first is aesthetic: I like the way the tweets look as clean sentences. One wouldn’t wish to hyperlink a poem.

The second is artistic: I want people to stay here, not go off somewhere else and crosscheck the story. Why go through all the trouble of compression if they’re just going to go off and read more about it? What’s omitted from a story is, to me, an important part of a writer’s storytelling strategy.

And the third is practical: though I seldom use up all 140 characters, rarely do I have enough room left for a url string, even a shortened one.

I really loved this artistic (and pragmatic) rationale for not linking, and thought you might too.


cc0 and git for data

In case you missed it the Cooper-Hewitt National Design Museum at the Smithsonian Institution made a pretty important announcement almost a month ago that they have released their collection metadata on GitHub using the CC0 Creative Commons license. The choice to use GitHub is interesting (more about that below) but the big news is the choice to license the data with CC0. John Wilbanks wrote a nice piece about why the use of CC0 is important. Rather than paraphrase I’ll just quote his main point:

… the fact that the Smithsonian has gone to CC0 is actually a great step. It means that data owners inside the USG have the latitude to use tools that put USG works into a legal status outside the US that is interoperable with their public domain status inside the US, and that’s an unalloyed Good Thing in my view.

While I helped prototype and bring the first version of id.loc.gov online the licensing of the data was a persistent question that I heard from people who wanted to use the data. The about page at id.loc.gov current says:

The Library of Congress has prepared this vocabulary terminology system and is making it available as a public domain data set. While it has attempted to minimize inaccuracies and defects in the data and services furnished, THE LIBRARY OF CONGRESS IS PROVIDING THIS DATA AND THESE SERVICES “AS IS” AND DISCLAIMS ANY AND ALL WARRANTIES, WHETHER EXPRESS OR IMPLIED.

But as John detailed in his post, this isn’t really enough for folks outside the US. I think even for parties inside the US a CC0 license would add more confidence to using the data set in different contexts, and would help align the Library of Congress more generally with the Web. Last Friday Eric Miller of Zepheira spoke about Linked Data at the Library of Congress (eventually the talk should be made available). Near the end of his talk he focused on things that need to be worked on, and I was glad to hear him stress that work needed to be done on licensing. The issue is really nothing new, and it really transcends the Linked Data space. I’m not saying it’s easy, but I agree with everyone who is saying it is important to focus on…and it’s great to see the advances that others in the federal government are making.

Git and GitHub

The other fun part of the Smithsonian announcement was the use of GitHub as a platform for publishing the data. To do this Cooper-Hewitt established an organizational account on GitHub, which you might think is easy, but is actually no small achievement by itself for an organization in the US federal government. With the account in hand the collection project was created and the collection metadata was released as two CSV files (media.csv and objects.csv) by Micah Walter. The repository was then forked by Aaron Straup Cope. Aaron added some Python scripts for converting the CSV files into record based JSON files. In the comments to the Cooper-Hewitt Labs blog post Aaron commented on why he chose to break up the CSV into JSON. The beautiful thing about using Git and GitHub this way for data is that you have a history view like this:

For digital preservation folks this view of what changed, when, and by who is extremely important for establishing provenance. The fact that you get this for free by using the opensource Git version control system, and pushing your repository to GitHub is very compelling.

Over the past couple of years there has been quite a bit of informal discussion in the digital curation community about using Git for versioning data. Just a couple weeks before the Smithsonian announcement Stephanie Collett and Martin Haye from the California Digital Library reported on the use of Git and Mercurial to version data at Code4lib 2012.

But as Alf Eaton observed:

In this case we’re talking 205,137 files. If you doubt Alf, try cloning the repository. Here’s what I see:

ed@rorty:~$ time git clone https://github.com/cooperhewitt/collection.git cooperhewitt-collection Cloning into cooperhewitt-collection… remote: Counting objects: 230004, done. remote: Compressing objects: 100% (19507/19507), done. remote: Total 230004 (delta 102489), reused 223775 (delta 96260) Receiving objects: 100% (230004/230004), 27.84 MiB | 3.96 MiB/s, done. Resolving deltas: 100% (102489/102489), done.


Curating Curation

This post was composed over at Storify and exported here.