Archive for the ‘marc’ Category

baby steps at linking library data

Monday, May 5th, 2008

Alistair wanted to have some data to demonstrate the potential of linked library data, so I quickly converted 10K MARC records (using a slightly modified version of MARC21slim2RDFDC.xsl and rewrote the subjects as lcsh.info URIs using a few lines of python…all a bit hackish, but it got this particular job done quickly.

The rewriting of subjects is basically a transformation of:

<http://lccn.loc.gov/00009010#manifestation>
  dc:creator "Rollo, David.";
  dc:date "c2000." ;
  dc:description "Includes bibliographical references (p. 173-223) and index." ;
  dc:identifier
     "URN:ISBN:0816635463 (alk. paper)",
     "URN:ISBN:0816635471 (pbk. : alk. paper)",
     "http://www.loc.gov/catdir/toc/fy032/00009010.html" ;
  dc:language "eng" ;
  dc:publisher "Minneapolis : University of Minnesota Press," ;
  dc:subject
    "Anglo-Norman literature",
    "Benoi?t, de Sainte-More, 12th cent.",
    "Latin prose literature, Medieval and modern",
    "Literacy",
    "Literature and history",
    "Magic in literature." ;
  dc:title "Glamorous sorcery : magic and literacy in the High Middle Ages /" ;
  dc:type "text" .

to:

<http://lccn.loc.gov/00009010#manifestation>
    dc:creator "Rollo, David." ;
    dc:date "c2000." ;
    dc:description "Includes bibliographical references (p. 173-223) and
index." ;
    dc:identifier "URN:ISBN:0816635463 (alk. paper)", "URN:ISBN:0816635471 (pbk. : alk. paper)", "http://www.loc.gov/catdir/toc/fy032/00009010.html" ;
    dc:language "eng" ;
    dc:publisher "Minneapolis : University of Minnesota Press," ;
    dc:subject <http://lcsh.info/sh85005082#concept>,
      <http://lcsh.info/sh85077482#concept>,
      <http://lcsh.info/sh85077565#concept>,
      <http://lcsh.info/sh85079624#concept>,
      <http://lcsh.info/sh86008161#concept>,
      "Benoi?t, de Sainte-More, 12th cent." ;
    dc:title "Glamorous sorcery : magic and literacy in the High Middle Ages
/" ;
    dc:type "text" .

Clearly there are lots of ways to improve even this simplified description: URIs for entries in the Name Authority File, referencing identifiers as resources rather than string literals (an artifact of the XSLT transform), removing ISBD punctuation, unicode normalization (&cough;), etc.

You may notice I kind of fudged the URI for the book itself using the LCCN service at LC: http://lccn.loc.gov/00009010#manifestation (which does resolve, but doesn’t serve up RDF yet). I’m no FRBR expert so I’m not sure if the use of “manifestation” in this hash URI makes sense. I just wanted to distinguish between the URI for the description, and the URI for the thing being described. I think it’s high time for me to understand FRBR a lot more.

If you prefer diagrams to turtle here is a graph visualization from the w3c rdf validator for the record.

literals and resources

Wednesday, March 26th, 2008

There’s a fascinating modeling discussion going on over on the DC-RDA list about whether RDA properties should reference literals or resources in descriptions. For example when describing an author you could use a literal:

Twain, Mark, 1835-1910

or a resource:

http://lccn.loc.gov/n79021164

There are some shades of gray in between (using blank nodes, auto-generated URIs, typed literals) but that’s the basic gist of it. The discussion basically concerns what the DC-RDA Application Profile should allow. There seems to be two competing interests:

  1. perceived ease of migrating legacy data (MARC -> RDA)
  2. perceived benefits to explicitly modeling the relationships found in bibliographic data

More information can also be found in the blogs of Karen Coyle and Jon Phipps.

My personal opinion is that RDA should take the high road on this one and really drive home the value proposition for using resources wherever possible, modeling relationships in bibliographic data, and leveraging hundreds of years of work maintaining controlled vocabularies. This will have the positive side effect of pushing library controlled vocabularies (LCSH, name authority, language and geographic codes, etc.) into the open on the web. More importantly I think it will highlight what libraries (at their best) do best, for the larger semantic web and computing world. I think it’s worth limping along a bit longer with MARC and waiting for RDA to actually “do the right thing”.

How to do this effectively is another matter, and is really what the discussion is about. It’s really nice to see people talking openly about these issues.

(PS, using an author isn’t a particularly good example because I don’t see it in the current list of RDA properties…)

(PSS, no that lccn url doesn’t currently resolve (it does for bibliographic records, but not authority) or return rdf (hopefully someday))

pymarc PEP-8 cleanup

Thursday, February 28th, 2008

pymarc v2.0 was released yesterday afternoon. I’m mentioning it here to give a big tip of the hat to Gabriel Farrell (gsf on #code4lib) who spent a significant amount of time cleaning up the code to be PEP-8 compliant.

If you are a current user of pymarc your code will most likely break, since methods like: addField() will now look like add_field(). This is a small price to pay for pythonistas who typically prefer clean, consistent and more coherent code (how’s that for alliteration?). It had to be done and I’m very grateful to gsf for taking the time to do it.

Another big thing is that we’ve switched from using subversion to bzr for revision control. Initially it seemed like a lightweight way for gsf and I to collaborate without monkeying with svn authentication (again)…and to learn the zen of distributed revision control. We both liked it so much that Gabe is hosting the pymarc repository at http://fruct.us/bzr/pymarc.

So if you like the latest/greatest/shiniest, and/or want to contribute some of your own changes to pymarc:

  % bzr branch http://fruct.us/bzr/pymarc
  % # hack, hack, hack, hackety, hack
  % bzr commit
  % bzr send --mail-to gsf@fruct.us --message "Gabe, I added a jammies method to the record object!"
  % # or publish your own repo and point us at it :-)

metadata hackers

Monday, December 31st, 2007

I opened the paper this morning to read a story of another person involved in the creation of MARC who has just died. I hadn’t realized before reading Henrietta Avram and Samuel Snyder’s obituaries that there was a bit of an NSA LC connection when MARC was being created.

From 1964 to 1966, [Samuel Snyder] was coordinator of the Library of Congress’s information systems office. He was among the creators of the library’s Machine Readable Cataloging system that replaced the handwritten card with an electronic searchable database system that became the standard worldwide.

I imagine NSA folks had a lot to do with early automation efforts in the federal government…but it’s still an interesting connection. One of my coworkers is reading up on this early history of MARC so this is for him in the unlikely event that he missed it…email would probably have worked better I guess, but I also wanted to pay tribute. Libraries wouldn’t be what they are today without this influential early work.

more marcdb

Monday, November 5th, 2007

This morning Clay and I were chatting about Library of Congress Subject Headings and SKOS a bit. At one point we found ourselves musing about how much reuse there is of topical subdivisions in topical headings in the LC authority file. You know how it is. Anyhow, I remembered that I’d used marcdb to import all of Simon Spiro’s authority data–so I fired up psql and wrote a query:

SELECT subfields.value AS subdivision, count(*) AS total
FROM subfields, data_fields
WHERE subfields.code = 'x'
  AND subfields.data_field_id = data_fields.id
  AND data_fields.tag = '150'
GROUP BY subfields.value
ORDER BY total DESC;

And a few seconds later…

 subdivision                          | total
--------------------------------------+-------
 Law and legislation                  |  3342
 Religious aspects                    |  2500
 Buddhism, [Christianity, etc.]       |   898
 History                              |   847
 Equipment and supplies               |   571
 Taxation                             |   566
 Baptists, [Catholic Church, etc.]    |   476
 Diseases                             |   450
 Research                             |   422
 Campaigns                            |   378
 Awards                               |   342
 Finance                              |   284
 Study and teaching                   |   284
 Surgery                              |   275
 Employees                            |   269
 Spectra                              |   261
 Computer programs                    |   259
 Labor unions                         |   218
 Testing                              |   207
 Diagnosis                            |   194
 Isotopes                             |   190
 Complications                        |   183
 Physiological effect                 |   172
 Programming                          |   163

There’s nothin’ like the smell of strong set theory in the morning. Although something seems a bit fishy about [Christianity, etc.] and [Catholic Church, etc.]… If you want to try similar stuff and don’t want to wait hours for marcdb to import all the data and you use postgres, here’s the full database dump which you ought to be able to import:

  % createdb authorities
  % wget http://inkdroid.org/data/authorities.sql.bz2
  % bunzip2 authorities.sql.bz2
  % psql authorities < authorities.sql

marcdb

Monday, October 1st, 2007

If you are a library data wrangler at some point you’ve probably wanted to stuff MARC data into a relational database so you can do queries across it. Perhaps your $vendor supports querying like this, but perhaps not. At any rate for some work I’ve been doing I’ve really needed to be able to get a feel for a batch of MARC authority data, in particular the data that Simon Spero has kindly made available.

So I created a little tool I’m calling marcdb which slurps in MARCXML or MARC and stuffs it into a relational database schema. The source for marcdb is available and you can install via the python cheeseshop with easy_install if you have it. As you can see from the README it lets SQLAlchemy and Elixir do the database talkin’. This results in a nice little python file that defines the schema in terms of Python classes. You ought to be able to use marcdb with any backend database (mysql, sqlite, postgres) that is supported by SQLAlchemy.

At any rate, the point of all this is to enable querying. So for example after I loaded Simon’s authority data I can do a query to see what the lay of the land is in terms of number of tags.

SELECT tag, COUNT(*) AS tag_count 
FROM data_fields 
GROUP BY tag 
ORDER BY tag_count DESC;

tag | tag_count 
—–+———– 
035 |    558727
670 |    496600
040 |    379999
010 |    379999
953 |    369625
906 |    272196
550 |    232544
150 |    217556
450 |    211067 
952 |    185012 
151 |    158900 
451 |    143538 
781 |    122490 
043 |     92656 
053 |     92404 
675 |     42496 
551 |     24797 
667 |     14434 
985 |     13725 
680 |     10342 
681 |      8873 
410 |      7103 
360 |      4126 
073 |      3540 
180 |      3000 
019 |      1832 
678 |      1311 
580 |       857 
480 |       808 
260 |       753 
185 |       501 
510 |       369 
485 |       262 
042 |       260 
500 |       259 
016 |       243 
585 |       192 
400 |       147 
682 |       134 
710 |       132 
979 |       107 
530 |        93 
430 |        82 
665 |        44 
182 |        36 
482 |         8 
969 |         4 
181 |         4 
555 |         4 
581 |         4 
455 |         4 
582 |         3 
481 |         3 
052 |         3 
411 |         2 
155 |         2 
751 |         2 
014 |         2 
050 |         2 
856 |         1

Or, here’s a more complex query for determining the types of relationships found in See Also From Tracing fields.

SELECT subfields.value, count(*) AS value_count
FROM data_fields, subfields
WHERE data_fields.tag in ('500', '510', '511', '530', '548', '550', '551',
  '555', '580', '581', '582', '585')
AND data_fields.id = subfields.id
AND subfields.code = 'w'
GROUP BY subfields.value
ORDER BY value_count

 value | value_count
-------+-------------
 g     |        8438
 nne   |        1243
 nnaa  |        1083
 a     |         146
 b     |         140
 nna   |           8
 bnna  |           4
 anna  |           3
 n     |           2
 nnnd  |           2
 nnnb  |           1
(11 rows)

So most of the relations are ‘g’ which is for broader relations. I know MARC is kind of passé these days, but there’s a lot of it around in libraries, and it’s important to be able to make decisions about it–especially when converting it to more web-viable formats. I’d be interested in feedback if you get a chance to try it out.

pymarc, marc8 and nothingness

Friday, July 20th, 2007

pymarc 1.0 went out day before yesterday with a new function: marc8_to_unicode(). When trying to leverage MARC bibliographic data in today’s networked world it is inevitable that the MARC8 character encoding will at some point rear its ugly head and make your brain hurt. The problem is that the standard character set tools for various programming languages do not support it. So you need to know to use a specialized tool like marc4j, yaz, MARC::Charset for converting from MARC8 into something useful like Unicode.

The MARC8 support in pymarc is the brainchild of Aaron Lav and Mark Matienzo. Aaron gave permission for us to package up some of is code from PyZ3950 into pymarc. In testing with equivalent MARC-8 and UTF-8 record batches from the Library of Congress we were able to find and fix a few glitches.

The exercise was instructive to me because of my previous experience working with the MARC::Charset Perl module. When I wrote MARC::Charset I was overly concerned with not storing the mapping table in memory, I used an on disk Berkeley-DB originally. Aaron’s code simply stored the mapping in memory. Since python stores bytecode on disk after compiling there were some performance gains to be had over Perl–since Perl would compile the big mapping hash every time. But the main thing is that Aaron seemed to choose the simplest solution first– whereas I was busy performing a premature optimization. I also went through some pains to enable mapping not only MARC-8 to Unicode but Unicode back to MARC-8. In hindsight this was a mistake because going back to MARC-8 is increasingly more insane as each day passes.

Aaron’s code as a result is much cleaner and easier to understand because, well, there’s less of it. I’m reading Beautiful Code at the moment and was just reading Jon Bentley’s chapter “The Most Beautiful Code I Never Wrote” — which really crystallized things. Definitely check out Beautiful Code if you have a chance. Maybe the quiet books4code could revive to read it as a group?

MARC, Perl and Unicode

Thursday, May 5th, 2005

I’ve been doing some work for Texas A&M who need a MARC::Record module that is Unicode safe. Many ILS vendors are moving away from MARC-8 encoded records towards Unicode. No doubt this move is being spurred on by big players like OCLC who are moving (or have moved) their mammoth WorldCat database to Unicode.

At any rate Texas A&M have workflows that use MARC::Record for transforming records in their catalog and they need the Unicode support for their new Voyager system. Technically there were very few places where MARC::Record needed to be adjusted. The problem is that the antiquated transmission format for MARC records uses byte lengths in the so called directory, as offsets into the record. MARC::Record uses length() and substr() to create and work with the directory…which works fine when 1 character equals 1 byte. However, Unicode characters can have multiple bytes per character…so the character oriented length() will create faulty record directories, and substr() will extract data from the rest of the record incorrectly.

Fortunately there is the bytes pragma which alters the behavior of various character oriented Perl functions. Unfortunately these functions were added to Perl relatively recently, so this new version of MARC::Record will require Perl >= v5.8.2. Technically it could run on 5.8.1, however I found that the 5.8.1 that ships with OS X 10.3 lacks the bytes::substr(). Not only that but if you try to call a non existent function in the bytes namespace you’ll go into an infinite loop. This is even the case with Perl 5.8.6 as well.

All in all I really have come to dislike Perl’s Unicode support. The magical utf8 flag on scalars has a tendency to pop on and off for obscure reasons. And I’ve found the behavior of bytes::length() to be a bit unpredictable. Surely this is because I don’t fully understand the mechanics involved, but judging from the traffic on perl-unicode I’m not the only one who has struggled with it. My experience using unicode in Java and Python has been much more pleasant, and really confirms my decision to move towards doing new work in these languages. Perl has served me well, and there are some things I really love about the language, but these nasty corners are a bit scary.