I was in a meeting today listening to a presentation about the Library of Congress Name Authority File and I got it into my head to see if I could quickly graph record creation by year. Part of this might’ve been prompted by sitting next to Kevin Ford, who was multi-tasking by what looked like loading some MARC data into id.loc.gov. I imagine this isn’t perfect, but I thought it was kind of fun hack that demonstrates what you can get away with on the command line with some open data:

  curl http://id.loc.gov/static/data/authoritiesnames.nt.skos.gz \
    | zcat - \
    | perl -ne '/terms\/created> "(\d{4})-\d{2}-\d{2}/; print "$1\n" if $1;' \
    | sort \
    | uniq -c \
    | perl -ne 'chomp; @cols = split / +/; print "$cols[2]\t$cols[1]\n";' \
    > lcnaf-years.tsv

Which yields a tab delimited file where column 1 is the year and column 2 is the number of records created in that year. The key part is the perl one-liner on line 3 which looks for assertions like this in the ntriples rdf, and pulls out the year:

<http://id.loc.gov/authorities/names/n90608287> <http://purl.org/dc/terms/created> "1990-02-05T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

The use of sort and uniq -c together is a handy trick my old boss Fred Lindberg taught me, for quickly generating aggregate counts from a stream of values. It works surprisingly well with quite large sets of values, because of all the work that has gone into making sort efficient.

WIth the tsv in hand I trimmed the pre-1980 values, since I think there are lots of records attributed to 1980 since that’s when OPAC came online, and I wasn’t sure what the dribs and drabs prior to 1980 represented. Then I dropped the data into ye olde chart maker (in this case GoogleDocs) and voilà:

It would be more interesting to see the results broken out by contributing NACO institution, but I don’t think that data is in the various RDF representations. I don’t even know if the records contributed by other NACO institutions are included in the LCNAF. I imagine a similar graph is available somewhere else, but it was neat that the availability of the LCNAF data meant I could get a rough answer to this passing question fairly quickly.

The numbers add up to ~7.8 million which seems within the realm of possibile correctness. But if you notice something profoundly wrong with this display please let me know!