a look at who makes the LCNAF

As a follow up to my last post about visualizing Library of Congress Name Authority File (LCNAF) records created by year, I decided to dig a little bit deeper to see how easy it would be to visualize how participating Name Authority Cooperative institutions have contributed to the LCNAF over time. This idea was mostly born out of spending the latter part of last week participating in a conversation about the need for a National Archival Authority Cooperative hosted at NARA. This blog post is one part nerdy technical notes on how I worked with the LCNAF Linked Data, and one part line charts showing who creates and modifies LCNAF records. It might’ve made more sense to start with the pretty charts, and then show you how I did it…but if the tech details don’t interest you can jump to the second half.

The Work

After a very helpful Twitter conversation with Kevin Ford I discovered that the Linked Data MADSRDF representation of the LCNAF includes assertions about the institution responsible for creating or revising the a record. Here’s a snippet of Turtle for RDF that describes who created and modified the LCNAF record for J. K. Rowling (if your eyes glaze over when you see RDF, don’t worry keep reading, it’s not essential you understand this):

@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .

<http://id.loc.gov/authorities/names/n97108433>
    madsrdf:adminMetadata [
        ri:recordChangeDate "1997-10-28T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ri:recordContentSource <http://id.loc.gov/vocabulary/organizations/dlc> ;
        ri:recordStatus "new"^^<http://www.w3.org/2001/XMLSchema#string> ;
        a ri:RecordInfo
    ],
    [
        ri:recordChangeDate "2011-08-25T06:29:06"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ri:recordContentSource <http://id.loc.gov/vocabulary/organizations/dlc> ;
        ri:recordStatus "revised"^^<http://www.w3.org/2001/XMLSchema#string> ;
        a ri:RecordInfo
    ] .

So I picked up an EC2 m1.large spot instance (7.5G of RAM, 2 virtual cores, 850G of storage) for a miserly $0.026/hour, installed 4store (which is a triplestore I’d heard good things about), and loaded the data.

% wget http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz
% gunzip authoritiesnames.nt.madsrdf.gz
% sudo apt-get install 4store
% sudo mkdir /mnt/4store
% sudo chown fourstore:fourstore /mnt/4store
% sudo ln -s /mnt/4store /var/lib/4store
% sudo -u fourstore 4s-backend-setup lcnaf --segments 4
% sudo -u fourstore 4s-backend lcnaf
% sudo -u fourstore 4s-import --verbose lcnaf authoritiesnames.nt.madsrdf

I used 4 segments as a best guess to match the 4 EC2 compute units available to an m1.large. The only trouble was that after loading 90M of the 226M assertions it began to slow to a crawl as the memory was about used up.

I thought briefly about upgrading to a larger instance…but it occurred to me that I actually didn’t need all the triples. I just need the ones related to the record changes, and the organization that made them. So I filtered out just the assertions I needed. By the way, this is a really nice artifact of the ntriples data format, which is very easy to munge with line oriented Unix utilities and scripting tools:

zcat authoritiesnames.nt.madsrdf.gz | egrep '(recordChangeDate)|(recordContentSource)|(recordStatus)'  > updates.nt

This left me with 50,313,810 triples which loaded in about 20 minutes! With the database populated I was then able to execute the following query to fetch all the create dates with their institution code using 4s-query:

@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .

SELECT ?date ?source WHERE { 
  ?s ri:recordChangeDate ?date . 
  ?s ri:recordContentSource ?source . 
  ?s ri:recordStatus "new"^^<http://www.w3.org/2001/XMLSchema#string> . 
}

This returned a tab delimited file that looked something like:

"1991-08-16T00:00:00"^^>http://www.w3.org/2001/XMLSchema#dateTime>      <http://id.loc.gov/vocabulary/organizations/dlc>
"1995-01-07T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>      <http://id.loc.gov/vocabulary/organizations/djbf>
"2004-03-04T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>      <http://id.loc.gov/vocabulary/organizations/nic>

I then wrote a simplistic python program to read in the TSV file and output a table of data where each row represented a year and the columns were the institution codes.

The Result

If you’d like to see the table you can check it out as a Google Fusion Table. If you are interested, you should be able to easily pull the data out into your own table, modify it, and visualize it. Google Fusion tables can be really easily rendered in a variety of ways, including a line graph, which I’ve embedded here, just displaying the top 25 contributors:

While I didn’t quite expect to see LC tapering off the way it is, I did expect it to dominate the graph. Removing LC from the mix makes the graph a little bit more interesting. For example you can see the steady climb of the British Library, and the strong role that Princeton University plays:

Out of curiosity I then executed a SPARQL query for record updates (or revisions), repeated the step with stats.py, uploaded to Google Fusion Tables, and removed LC to better see trends in who is updating records:

@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .

SELECT ?date ?source WHERE { 
  ?s ri:recordChangeDate ?date . 
  ?s ri:recordContentSource ?source . 
  ?s ri:recordStatus "revised"^^<http://www.w3.org/2001/XMLSchema#string> . 
}

I definitely never understood what Twin Peaks was about, and I similarly don’t really know what the twin peaks in this graph signify (2000 and 2008). I guess these were years where there were a lot of coordinated edits? Perhaps some NACO folks who have been around for a few years may know the answer. You can also see in this graph that Princeton University plays a strong role in updating records as well as creating them.

So I’m not sure I understand the how/when/why of an NAAC any better, but I did learn:

EC2 is a big win for quick data munging projects like this. I spent $0.98 with the instance up and running for 3 days.
Filtering ntriples files to what you actually need prior to loading into a triplestore can save time, money.
Working with ntriples is still pretty esoteric, and the options out there for processing a dump of ntriples (or rdf/xml) of LCNAF’s size are truly slim. If I’m wrong about this I would like to be corrected.
Google Fusion Tables are a nice way to share data and charts.
It seems like while more LCNAF records are being created per year, they are being created by a broader base of institutions instead of just LC (who appear to be in decline). I think this is a good sign for NAAC.
Open Data, and Open Data Curators (thanks Kevin) are essential to open, collaborative enterprises.

Now I could’ve made some hideous mistakes here, so in the unlikely event you have the time and inclination I would be interested to hear if you can reproduce these results. If the results confirm or disagree with other views of LCNAF participation I would be interested to see them.