lcnaf unix hack
I was in a meeting today listening to a presentation about the Library of Congress Name Authority File and I got it into my head to see if I could quickly graph record creation by year. Part of this might’ve been prompted by sitting next to Kevin Ford, who was multi-tasking by what looked like loading some MARC data into id.loc.gov. I imagine this isn’t perfect, but I thought it was kind of fun hack that demonstrates what you can get away with on the command line with some open data:
curl http://id.loc.gov/static/data/authoritiesnames.nt.skos.gz \
| zcat - \
| perl -ne '/terms\/created> "(\d{4})-\d{2}-\d{2}/; print "$1\n" if $1;' \
| sort \
| uniq -c \
| perl -ne 'chomp; @cols = split / +/; print "$cols[2]\t$cols[1]\n";' \
> lcnaf-years.tsv
Which yields a tab delimited file where column 1 is the year and column 2 is the number of records created in that year. The key part is the perl one-liner on line 3 which looks for assertions like this in the ntriples rdf, and pulls out the year:
<http://id.loc.gov/authorities/names/n90608287> <http://purl.org/dc/terms/created> "1990-02-05T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
The use of sort
and uniq -c
together is a
handy trick my old boss Fred Lindberg taught me, for quickly generating
aggregate counts from a stream of values. It works surprisingly well
with quite large sets of values, because of all the work that has gone
into making sort
efficient.
WIth the tsv in hand I trimmed the pre-1980 values, since I think there are lots of records attributed to 1980 since that’s when OPAC came online, and I wasn’t sure what the dribs and drabs prior to 1980 represented. Then I dropped the data into ye olde chart maker (in this case GoogleDocs) and voilà:
It would be more interesting to see the results broken out by contributing NACO institution, but I don’t think that data is in the various RDF representations. I don’t even know if the records contributed by other NACO institutions are included in the LCNAF. I imagine a similar graph is available somewhere else, but it was neat that the availability of the LCNAF data meant I could get a rough answer to this passing question fairly quickly.
The numbers add up to ~7.8 million which seems within the realm of possibile correctness. But if you notice something profoundly wrong with this display please let me know!