Archive for October, 2007

wdl peeps

Wednesday, October 31st, 2007

Speaking of smarties here’s a picture of some of the folks I was fortunate to work with on the recent WDL effort. From left to right: Dan Chudnov, Andy Boyko, Babak Hamidzadeh, Dave Hafken, myself, and Chris Thatcher. I feel really fortunate to be working with all of them. The best part is that these are just the folks that were involved with the WDL project–and there are a bunch more equally fun/talented people in our group that are working on other things. I can safely say that I haven’t worked with a group before that is as simultaneously top-notch and fun to work with.

…thanks to Michael Neubert for the snapshot taken outside the Adams building at the Library of Congress

tools

Thursday, October 18th, 2007


At $work recently many late nights were spent hackety-hacking on a prototype that got written up in the New York Times today. Apart from some promotional materials, not much is available to the public just yet. I just got pulled in near the end to do some search stuff. Over the past few months I’ve seen dchud in top form managing complicated data/organizational workflows while making technical decisions. A nice outgrowth of working with smarties is ending up with a fun and productive technology stack: python, django, postgres, jquery, solr, tilecache, ubuntu, trac, subversion, vmware. Given the press and the commitment to UNESCO I think the code is going to start being a bit more than a prototype pretty soon :-)

marcdb

Monday, October 1st, 2007

If you are a library data wrangler at some point you’ve probably wanted to stuff MARC data into a relational database so you can do queries across it. Perhaps your $vendor supports querying like this, but perhaps not. At any rate for some work I’ve been doing I’ve really needed to be able to get a feel for a batch of MARC authority data, in particular the data that Simon Spero has kindly made available.

So I created a little tool I’m calling marcdb which slurps in MARCXML or MARC and stuffs it into a relational database schema. The source for marcdb is available and you can install via the python cheeseshop with easy_install if you have it. As you can see from the README it lets SQLAlchemy and Elixir do the database talkin’. This results in a nice little python file that defines the schema in terms of Python classes. You ought to be able to use marcdb with any backend database (mysql, sqlite, postgres) that is supported by SQLAlchemy.

At any rate, the point of all this is to enable querying. So for example after I loaded Simon’s authority data I can do a query to see what the lay of the land is in terms of number of tags.

SELECT tag, COUNT(*) AS tag_count 
FROM data_fields 
GROUP BY tag 
ORDER BY tag_count DESC;

tag | tag_count 
-----+----------- 
035 |    558727
670 |    496600
040 |    379999
010 |    379999
953 |    369625
906 |    272196
550 |    232544
150 |    217556
450 |    211067 
952 |    185012 
151 |    158900 
451 |    143538 
781 |    122490 
043 |     92656 
053 |     92404 
675 |     42496 
551 |     24797 
667 |     14434 
985 |     13725 
680 |     10342 
681 |      8873 
410 |      7103 
360 |      4126 
073 |      3540 
180 |      3000 
019 |      1832 
678 |      1311 
580 |       857 
480 |       808 
260 |       753 
185 |       501 
510 |       369 
485 |       262 
042 |       260 
500 |       259 
016 |       243 
585 |       192 
400 |       147 
682 |       134 
710 |       132 
979 |       107 
530 |        93 
430 |        82 
665 |        44 
182 |        36 
482 |         8 
969 |         4 
181 |         4 
555 |         4 
581 |         4 
455 |         4 
582 |         3 
481 |         3 
052 |         3 
411 |         2 
155 |         2 
751 |         2 
014 |         2 
050 |         2 
856 |         1

Or, here’s a more complex query for determining the types of relationships found in See Also From Tracing fields.

SELECT subfields.value, count(*) AS value_count
FROM data_fields, subfields
WHERE data_fields.tag in ('500', '510', '511', '530', '548', '550', '551',
  '555', '580', '581', '582', '585')
AND data_fields.id = subfields.id
AND subfields.code = 'w'
GROUP BY subfields.value
ORDER BY value_count

 value | value_count
-------+-------------
 g     |        8438
 nne   |        1243
 nnaa  |        1083
 a     |         146
 b     |         140
 nna   |           8
 bnna  |           4
 anna  |           3
 n     |           2
 nnnd  |           2
 nnnb  |           1
(11 rows)

So most of the relations are ‘g’ which is for broader relations. I know MARC is kind of passé these days, but there’s a lot of it around in libraries, and it’s important to be able to make decisions about it–especially when converting it to more web-viable formats. I’d be interested in feedback if you get a chance to try it out.