lcnaf unix hack

I was in a meeting today listening to a presentation about the Library of Congress Name Authority File and I got it into my head to see if I could quickly graph record creation by year. Part of this might’ve been prompted by sitting next to Kevin Ford, who was multi-tasking by what looked like loading some MARC data into id.loc.gov. I imagine this isn’t perfect, but I thought it was kind of fun hack that demonstrates what you can get away with on the command line with some open data:

  curl http://id.loc.gov/static/data/authoritiesnames.nt.skos.gz \
    | zcat - \
    | perl -ne '/terms\/created> "(\d{4})-\d{2}-\d{2}/; print "$1\n" if $1;' \
    | sort \
    | uniq -c \
    | perl -ne 'chomp; @cols = split / +/; print "$cols[2]\t$cols[1]\n";' \
    > lcnaf-years.tsv

Which yields a tab delimited file where column 1 is the year and column 2 is the number of records created in that year. The key part is the perl one-liner on line 3 which looks for assertions like this in the ntriples rdf, and pulls out the year:

<http://id.loc.gov/authorities/names/n90608287> <http://purl.org/dc/terms/created> "1990-02-05T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

The use of sort and uniq -c together is a handy trick my old boss Fred Lindberg taught me, for quickly generating aggregate counts from a stream of values. It works surprisingly well with quite large sets of values, because of all the work that has gone into making sort efficient.

WIth the tsv in hand I trimmed the pre-1980 values, since I think there are lots of records attributed to 1980 since that’s when OPAC came online, and I wasn’t sure what the dribs and drabs prior to 1980 represented. Then I dropped the data into ye olde chart maker (in this case GoogleDocs) and voilĂ :

It would be more interesting to see the results broken out by contributing NACO institution, but I don’t think that data is in the various RDF representations. I don’t even know if the records contributed by other NACO institutions are included in the LCNAF. I imagine a similar graph is available somewhere else, but it was neat that the availability of the LCNAF data meant I could get a rough answer to this passing question fairly quickly.

The numbers add up to ~7.8 million which seems within the realm of possibile correctness. But if you notice something profoundly wrong with this display please let me know!

Creative Commons License
lcnaf unix hack by Ed Summers, unless otherwise expressly stated, is licensed under a Creative Commons Attribution 4.0 International License.

One thought on “lcnaf unix hack

Leave a Reply