A few months ago Brian Tingle posted some exciting news that the Social Networks and Archival Context (SNAC) project was releasing the data that sits behind their initial prototype:

As a part of our work on the Social Networks and Archival Context Project, the SNAC team is please to release more early results of our ongoing research.

A property graph of correspondedWith and associatedWith relationships between corporate, personal, and family identities is made available under the Open Data Commons Attribution License in the form of a graphML file. The graph expresses 245,367 relationships between 124,152 named entities.

The graphML file, as well as the scripts to create and load a graph database from EAC or graphML, are available on google code [5]

We are still researching how to map from the property graph model to RDF, but this graph processing stack will likely power the interactive visualization of the historical social networks we are developing.

The SNAC project have aggregated archival finding aid data for manuscript collections at the Library of Congress, Northwest Digital Archives, Online Archive of California and Virginia Heritage. They then used authority control data from NACO/LCNAF, Getty Union List of Artist Names Online (ULAN) and VIAF to knit these archival finding aids using the Encoded Archival Context – Corporate bodies, Persons, and Families (EAC-CPF).

I wrote about SNAC here about 9 months ago, and how much potential there is in the idea of visualizing archival collections across institutions, along the axis of identity. I had also privately encouraged Brian to look into releasing some portion of the data that is driving their prototype. So when Brian delivered I felt some obligation to look at the data and try to do something with it. Since Brian indicated that the project was interested in an RDF serialization, and Mark had pointed me at Aaron Rubenstein’s arch vocabulary, I decided to take a stab at converting the GraphML data to some Arch flavored RDF.

So I forked Brian’s mercurial repository, and wrote a script that parses the GraphML XML that Brian provided, and writes RDF (using arch:correspondedWith, arch:primaryProvenanceOf, arch:appearsWith) to a local triple store using rdflib. Since RDF has URLs cooked in pretty deep, part of this conversion involved reverse-engineering the SNAC URLs in the prototype, which wasn’t terribly clean, but it seemed good enough for demonstration purposes.

Once I had those triples (877,595 of them) I learned from Cory Harper that the SNAC folks had matched up the archival identities with entries in the Virtual International Authority File. The VIAF URLs aren’t present in their GraphML data (GraphML is not as expressive as RDF), but they are available in the prototype HTML, which I had URLs for. So, again, in the name of demonstration and not purity, I wrote a little scraper that would use the reverse-engineered SNAC URL to pull down the VIAF id. I tried to be respectful and not do this scraping in parallel, and to sleep a bit between requests. A few days of running and I had 40,237 owl:sameAs assertions that linked the SNAC URLs with the VIAF URLs.

With the VIAF URLs in hand I thought it would be useful to have a graph of only the VIAF related resources. It seemed like a VIAF centered graph of archival information could demonstrate something we’ve been talking about some in the Library Linked Data W3C Incubator Group: that Linked Data actually provides a technology that lets the archival and bibliographic description communities cross-pollinate and share. This is the real insight of the SNAC effort, that these communities have a lot in common, in that they both deal with people, places, organizations, etc. So I wrote another little script that created a named graph within the larger triple store, and used the owl:sameAs assertions to do some brute force inferencing, to generate triples relating VIAF resources with Arch.

I realize that Turtle isn’t probably the most compelling example of the result, but in the absence of an app (maybe more on that forthcoming) that uses it, it’ll have to do for now. So here are the assertions for Vannevar Bush, for the Linked Data fetishists out there:

@prefix foaf <http://xmlns.com/foaf/0.1/> .
@prefix arch <http://purl.org/archival/vocab/arch#> .

    a foaf:Person ;
    foaf:name "Bush, Vannevar, 1890-1974." ;
    arch:appearsWith <http://viaf.org/viaf/21341544/#foaf:Person>, 
        <http://viaf.org/viaf/79397853/#foaf:Person> ;
    arch:correspondedWith <http://viaf.org/viaf/13632081/#foaf:Person>,
        <http://viaf.org/viaf/92419478/#foaf:Person> ;
    arch:primaryProvenanceOf <http://hdl.loc.gov/loc.mss/eadmss.ms001043>, 
        <http://www.oac.cdlib.org/findaid/ark:/13030/kt8w1014rz> ;
    owl:sameAs <http://socialarchive.iath.virginia.edu/xtf/view?docId=Bush+Vannevar+1890-1974-cr.xml> .

I’ve made a full dump of the data I created available if you are interested in taking a look. The nice thing is that the URIs are already published on the web, so I didn’t need to mint any identifiers myself to publish this Linked Data. Although I kind of played fast and loose with the SNAC URIs for people since they don’t do the httpRange-14 dance. It’s interesting that it doesn’t seem to have immediately broken anything. It would be nice if the SNAC Prototype URIs were a bit cleaner I guess. Perhaps they could use some kind of identifier instead of baking the heading into the URL?

So maybe I’ll have some time to build a simple app on top of this data. But hopefully I’ve at least communicated how useful it could be for the cultural heritage community to share web identifiers for people, and use them in their data. RDF also proved to be a nice malleable data model for expressing the relationships, and serializing them so that others could download them. Here’s to the emerging (hopefully) Giant Global GLAM Graph!