pymarc, marc8 and nothingness
pymarc 1.0 went out day before yesterday with a new function: marc8_to_unicode(). When trying to leverage MARC bibliographic data in todayâs networked world it is inevitable that the MARC8 character encoding will at some point rear its ugly head and make your brain hurt. The problem is that the standard character set tools for various programming languages do not support it. So you need to know to use a specialized tool like marc4j, yaz, MARC::Charset for converting from MARC8 into something useful like Unicode. The MARC8 support in pymarc is the brainchild of Aaron Lav and Mark Matienzo. Aaron gave permission for us to package up some of is code from PyZ3950 into pymarc. In testing with equivalent MARC-8 and UTF-8 record batches from the Library of Congress we were able to find and fix a few glitches. The exercise was instructive to me because of my previous experience working with the MARC::Charset Perl module. When I wrote MARC::Charset I was overly concerned with not storing the mapping table in memory, I used an on disk Berkeley-DB originally. Aaronâs code simply stored the mapping in memory. Since python stores bytecode on disk after compiling there were some performance gains to be had over Perlâsince Perl would compile the big mapping hash every time. But the main thing is that Aaron seemed to choose the simplest solution firstâ whereas I was busy performing a premature optimization. I also went through some pains to enable mapping not only MARC-8 to Unicode but Unicode back to MARC-8. In hindsight this was a mistake because going back to MARC-8 is increasingly more insane as each day passes. Aaronâs code as a result is much cleaner and easier to understand because, well, thereâs less of it. Iâm reading Beautiful Code at the moment and was just reading Jon Bentleyâs chapter âThe Most Beautiful Code I Never Wroteâ â which really crystallized things. Definitely check out Beautiful Code if you have a chance. Maybe the quiet books4code could revive to read it as a group?