Archive for the ‘perl’ Category

the weight of legacy data

Sunday, May 20th, 2007

v0.97 of MARC::Charset was just released with an important bugfix. If you’ve had the misfortune of needing to convert from MARC-8 to UTF-8 and have used MARC::Charset >= v0.8 to do it you may very well have null characters (0×00) in your UTF-8 data. Well, only if your MARC-8 data contained either of the following characters:

  • DOUBLE TILDE, SECOND HALF / COMBINING DOUBLE TILDE RIGHT HALF
  • LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF

It turns out that the mapping file kindly provided by the Library of Congress does not include UCS mapping values for these two characters, and instead relies on alternate values.

v0.97 now uses the alternate value when the ucs is not available…which is good going forward. But I am literally sad when I think about how this little bug has added to the noise of erroneous extant MARC data. Please accept my humble apologies–and hear my plea to for bibliographic data that starts in Unicode rather than MARC-8. I’ll go further:

Don’t build systems that import/export MARC in transmission format anymore unless you absolutely have to.

Use MARCXML, MODS, RDF, JSON, YAML or something else instead. I realize this is hardly news but it feels good to be saying it. If you’re not convinced read Bill’s Pride and Prejudice installments. The library world needs to use common formats and encodings (with lots of tried/true tool sets)…and stop painting itself into a corner. Z39.2 has been hella useful for building up vast networks of data sharing libraries, but its time to leverage that data in ways that are more familiar to the networked world at large.

Many thanks to Michael O’Connor and Mike Rylander for discovering and resolving this bug.

Update to SRU and CQL::Parser

Wednesday, August 10th, 2005

If you are tracking it you might be interested to know that Brian Cassidy added a Catalyst plugin to the SRU CPAN module. Catalyst is a MVC framework that is getting quite a bit of mindshare in the Perl community (at least the small subset I hang out with in #code4lib). And if that wasn’t enough Brian also committed some changes to CQL::Parser that provides toLucene() functionality for converting CQL queries to queries that can be passed off to Lucene. Thanks Brian!

Net::OAI::Harvester v1.0

Wednesday, July 27th, 2005

I got an email from Thorsten Schwander at LANL about a bug in Net::OAI::Harvester when using a custom metadata handler with the auto-resumption token handling code. This was the first I’d heard about anyone using the custom metadata handling feature in N:O:H, so I was pleased to hear about it. Thorsten was kind enough to send a patch, so a new version is on its way around the CPAN mirrors. While it’s hardly a major change, this is bumping the version from 0.991 to 1.0. It’s been over 2 years since N:O:H was first released, and it’s been pretty stable for the past year.

CPAN Module Wins Library Award

Wednesday, May 11th, 2005

Jane Jacobs (the brains behind MARC::Detrans) let me know that the CPAN module won the NYLink Achievement Award for “Innovation in Technology” (search for Detrans, it’s a big page). Good going Jane, Elizabeth and Stuart!

MARC, Perl and Unicode

Thursday, May 5th, 2005

I’ve been doing some work for Texas A&M who need a MARC::Record module that is Unicode safe. Many ILS vendors are moving away from MARC-8 encoded records towards Unicode. No doubt this move is being spurred on by big players like OCLC who are moving (or have moved) their mammoth WorldCat database to Unicode.

At any rate Texas A&M have workflows that use MARC::Record for transforming records in their catalog and they need the Unicode support for their new Voyager system. Technically there were very few places where MARC::Record needed to be adjusted. The problem is that the antiquated transmission format for MARC records uses byte lengths in the so called directory, as offsets into the record. MARC::Record uses length() and substr() to create and work with the directory…which works fine when 1 character equals 1 byte. However, Unicode characters can have multiple bytes per character…so the character oriented length() will create faulty record directories, and substr() will extract data from the rest of the record incorrectly.

Fortunately there is the bytes pragma which alters the behavior of various character oriented Perl functions. Unfortunately these functions were added to Perl relatively recently, so this new version of MARC::Record will require Perl >= v5.8.2. Technically it could run on 5.8.1, however I found that the 5.8.1 that ships with OS X 10.3 lacks the bytes::substr(). Not only that but if you try to call a non existent function in the bytes namespace you’ll go into an infinite loop. This is even the case with Perl 5.8.6 as well.

All in all I really have come to dislike Perl’s Unicode support. The magical utf8 flag on scalars has a tendency to pop on and off for obscure reasons. And I’ve found the behavior of bytes::length() to be a bit unpredictable. Surely this is because I don’t fully understand the mechanics involved, but judging from the traffic on perl-unicode I’m not the only one who has struggled with it. My experience using unicode in Java and Python has been much more pleasant, and really confirms my decision to move towards doing new work in these languages. Perl has served me well, and there are some things I really love about the language, but these nasty corners are a bit scary.