If you are a library data wrangler at some point you’ve probably wanted to stuff MARC data into a relational database so you can do queries across it. Perhaps your $vendor supports querying like this, but perhaps not. At any rate for some work I’ve been doing I’ve really needed to be able to get a feel for a batch of MARC authority data, in particular the data that Simon Spero has kindly made available.

So I created a little tool I’m calling marcdb which slurps in MARCXML or MARC and stuffs it into a relational database schema. The source for marcdb is available and you can install via the python cheeseshop with easy_install if you have it. As you can see from the README it lets SQLAlchemy and Elixir do the database talkin’. This results in a nice little python file that defines the schema in terms of Python classes. You ought to be able to use marcdb with any backend database (mysql, sqlite, postgres) that is supported by SQLAlchemy.

At any rate, the point of all this is to enable querying. So for example after I loaded Simon’s authority data I can do a query to see what the lay of the land is in terms of number of tags.

SELECT tag, COUNT(*) AS tag_count 
FROM data_fields
GROUP BY tag
ORDER BY tag_count DESC;

tag | tag_count
-----+-----------
035 | 558727
670 | 496600
040 | 379999
010 | 379999
953 | 369625
906 | 272196
550 | 232544
150 | 217556
450 | 211067
952 | 185012
151 | 158900
451 | 143538
781 | 122490
043 | 92656
053 | 92404
675 | 42496
551 | 24797
667 | 14434
985 | 13725
680 | 10342
681 | 8873
410 | 7103
360 | 4126
073 | 3540
180 | 3000
019 | 1832
678 | 1311
580 | 857
480 | 808
260 | 753
185 | 501
510 | 369
485 | 262
042 | 260
500 | 259
016 | 243
585 | 192
400 | 147
682 | 134
710 | 132
979 | 107
530 | 93
430 | 82
665 | 44
182 | 36
482 | 8
969 | 4
181 | 4
555 | 4
581 | 4
455 | 4
582 | 3
481 | 3
052 | 3
411 | 2
155 | 2
751 | 2
014 | 2
050 | 2
856 | 1

Or, here’s a more complex query for determining the types of relationships found in See Also From Tracing fields.

SELECT subfields.value, count(*) AS value_count
FROM data_fields, subfields
WHERE data_fields.tag in ('500', '510', '511', '530', '548', '550', '551',
  '555', '580', '581', '582', '585')
AND data_fields.id = subfields.id
AND subfields.code = 'w'
GROUP BY subfields.value
ORDER BY value_count

 value | value_count 
-------+-------------
 g     |        8438
 nne   |        1243
 nnaa  |        1083
 a     |         146
 b     |         140
 nna   |           8
 bnna  |           4
 anna  |           3
 n     |           2
 nnnd  |           2
 nnnb  |           1
(11 rows)

So most of the relations are ‘g’ which is for broader relations. I know MARC is kind of passé these days, but there’s a lot of it around in libraries, and it’s important to be able to make decisions about it–especially when converting it to more web-viable formats. I’d be interested in feedback if you get a chance to try it out.