pymarc, marc8 and nothingness
pymarc 1.0 went out day before yesterday with a new function: marc8_to_unicode(). When trying to leverage MARC bibliographic data in today’s networked world it is inevitable that the MARC8 character encoding will at some point rear its ugly head and make your brain hurt. The problem is that the standard character set tools for various programming languages do not support it. So you need to know to use a specialized tool like marc4j, yaz, MARC::Charset for converting from MARC8 into something useful like Unicode.
The MARC8 support in pymarc is the brainchild of Aaron Lav and Mark Matienzo. Aaron gave permission for us to package up some of is code from PyZ3950 into pymarc. In testing with equivalent MARC-8 and UTF-8 record batches from the Library of Congress we were able to find and fix a few glitches.
The exercise was instructive to me because of my previous experience working with the MARC::Charset Perl module. When I wrote MARC::Charset I was overly concerned with not storing the mapping table in memory, I used an on disk Berkeley-DB originally. Aaron’s code simply stored the mapping in memory. Since python stores bytecode on disk after compiling there were some performance gains to be had over Perl–since Perl would compile the big mapping hash every time. But the main thing is that Aaron seemed to choose the simplest solution first– whereas I was busy performing a premature optimization. I also went through some pains to enable mapping not only MARC-8 to Unicode but Unicode back to MARC-8. In hindsight this was a mistake because going back to MARC-8 is increasingly more insane as each day passes.
Aaron’s code as a result is much cleaner and easier to understand because, well, there’s less of it. I’m reading Beautiful Code at the moment and was just reading Jon Bentley’s chapter “The Most Beautiful Code I Never Wrote” — which really crystallized things. Definitely check out Beautiful Code if you have a chance. Maybe the quiet books4code could revive to read it as a group?
July 20th, 2007 at 10:58 am
Great things, Ed (and thanks to Mark and Aaron, too). I svn’d and tried it out on a dump I had lying around of 105,000 of our records. Following the pymarc.__doc__, I did:
>>> reader = MARCReader(’bibs105000.out’)
>>> for record in reader:
… print record['245']['a']
…
Traceback (most recent call last):
File “”, line 1, in ?
File “build/bdist.linux-x86_64/egg/pymarc/reader.py”, line 51, in next
ValueError: invalid literal for int(): bibs1
>>>
But I noticed “self.reader = pymarc.MARCReader(file(’test/test.dat’))” in test/reader.py, so I:
>>> marc8_file = file(’bibs105000.out’)
>>> reader = MARCReader(marc8_file)
>>> for record in reader:
… print record['245']['a']
…
Microeconomics :
The multilateral development banks :
The Accountants digest.
Achievement.
ALA bulletin.
Acta arithmetica.
Acta crystallographica.
ASLE transactions.
Acta mathematica.
Acta mechanica.
Acta physica Polonica,
Acta physica Austriaca /
Acta physica Polonica.
Acta polytechnica Scandinavica.
[etc. -- big old list of titles, streaming by, very fast]
So MARCReader expects a file object, not a filename. Does that doc string need updating or did I misread it?
I then tried to test marc8_to_unicode:
>>> utf8_file = marc8_to_unicode(marc8_file)
Traceback (most recent call last):
File “”, line 1, in ?
File “build/bdist.linux-x86_64/egg/pymarc/marc8.py”, line 8, in marc8_to_unicode
File “build/bdist.linux-x86_64/egg/pymarc/marc8.py”, line 43, in translate
TypeError: len() of unsized object
>>>
Looking at test/marc8.py I saw marc8_to_unicode expects a string, so I:
>>> marc8_file = file(’bibs105000.out’)
>>> marc8_file_str = marc8_file.read()
>>> utf8_file_str = marc8_to_unicode(marc8_file_str)
couldn’t find 66 69 221 221
couldn’t find 66 69 220 220
couldn’t find 66 69 221 221
couldn’t find 66 69 220 220
couldn’t find 66 69 221 221
couldn’t find 66 69 221 221
couldn’t find 66 69 220 220
couldn’t find 66 69 220 220
couldn’t find 66 69 221 221
>>>
This took a long time, running at about 98% of my cpu and 20% of my ~4GB of memory. So I timed it (only 3 times because, you know, I’m impatient):
>>> t = Timer(stmt=’utf8_file_str = marc8_to_unicode(marc8_file_str)’, setup=’marc8_file = file(”bibs105000.out”); marc8_file_str = marc8_file.read(); from pymarc import marc8_to_unicode’)
>>> t.timeit(3)
[a bunch of the "couldn't find 66 69 22[01] 22[01]“, three times over]
786.2618350982666
So it took a little while. How big is the file?
>>> len(marc8_file_str)
93747869
>>> len(utf8_file_str)
87474412
Hmm. Those should match, right? Well, let’s see if we can read the new file:
>>> pymarc_utf8_file = file(’bibs105000_pymarc_utf8.dat’, ‘w’)
>>> pymarc_utf8_file.write(utf8_file_str.encode(’utf8′))
>>> reader = MARCReader(file(’bibs105000_pymarc_utf8.dat’))
>>> for record in reader:
… print record['245']['a']
…
None
Traceback (most recent call last):
File “”, line 1, in ?
File “build/bdist.linux-x86_64/egg/pymarc/reader.py”, line 54, in next
File “build/bdist.linux-x86_64/egg/pymarc/record.py”, line 46, in __init__
File “build/bdist.linux-x86_64/egg/pymarc/record.py”, line 123, in decodeMARC
pymarc.exceptions.BaseAddressInvalid: Base address exceeds size of record
>>>
Have I done something screwy, or is it something in the file? I can get the file to you if you want to test on your system. I also have a utf8 file to compare the output against, produced by `yaz-marcdump -f MARC-8 -t UTF-8 -I bibs105000.out > bibs105000_utf8.out`, and of course the diff is a mile long. If I get time later I’ll test line by line, as done in test/marc8.py.
Apologies for the long long comment — should be on some mailing list or something.
July 20th, 2007 at 3:13 pm
[...] I’m no expert, but I’m glad that I could help bring pymarc up to version 1.0 and that I’ve had a chance to begin enjoy programming again. I’m also glad to see that Catalogablog has spread the word. Download a copy and start hacking; maybe you’ll be rewarded with rediscovering the joy of code like I was. [...]
January 30th, 2008 at 8:13 pm
Wow, sorry for the delay gabe, this slipped my radar — 1.7 was just released with some fixes you just recently sent to me in a patch.
In addition I documented the marc8_to_unicode function more so that it hopefully makes clear you aren’t supposed to pass in a serialized marc record, but just a chunk of text extracted from the record that you’d like to translate.
Thanks for the info and the patch!
April 17th, 2008 at 10:05 am
[...] pymarc, marc8 and nothingness - new function, marc8_to_unicode() [...]