pymarc 1.0 went out day before yesterday with a new function: marc8_to_unicode(). When trying to leverage MARC bibliographic data in today’s networked world it is inevitable that the MARC8 character encoding will at some point rear its ugly head and make your brain hurt. The problem is that the standard character set tools for various programming languages do not support it. So you need to know to use a specialized tool like marc4j, yaz, MARC::Charset for converting from MARC8 into something useful like Unicode. The MARC8 support in pymarc is the brainchild of Aaron Lav and Mark Matienzo. Aaron gave permission for us to package up some of is code from PyZ3950 into pymarc. In testing with equivalent MARC-8 and UTF-8 record batches from the Library of Congress we were able to find and fix a few glitches. The exercise was instructive to me because of my previous experience working with the MARC::Charset Perl module. When I wrote MARC::Charset I was overly concerned with not storing the mapping table in memory, I used an on disk Berkeley-DB originally. Aaron’s code simply stored the mapping in memory. Since python stores bytecode on disk after compiling there were some performance gains to be had over Perl–since Perl would compile the big mapping hash every time. But the main thing is that Aaron seemed to choose the simplest solution first– whereas I was busy performing a premature optimization. I also went through some pains to enable mapping not only MARC-8 to Unicode but Unicode back to MARC-8. In hindsight this was a mistake because going back to MARC-8 is increasingly more insane as each day passes. Aaron’s code as a result is much cleaner and easier to understand because, well, there’s less of it. I’m reading Beautiful Code at the moment and was just reading Jon Bentley’s chapter “The Most Beautiful Code I Never Wrote” – which really crystallized things. Definitely check out Beautiful Code if you have a chance. Maybe the quiet books4code could revive to read it as a group?