MARC, Perl and Unicode

I’ve been doing some work for Texas A&M who need a MARC::Record module that is Unicode safe. Many ILS vendors are moving away from MARC-8 encoded records towards Unicode. No doubt this move is being spurred on by big players like OCLC who are moving (or have moved) their mammoth WorldCat database to Unicode.

At any rate Texas A&M have workflows that use MARC::Record for transforming records in their catalog and they need the Unicode support for their new Voyager system. Technically there were very few places where MARC::Record needed to be adjusted. The problem is that the antiquated transmission format for MARC records uses byte lengths in the so called directory, as offsets into the record. MARC::Record uses length() and substr() to create and work with the directory…which works fine when 1 character equals 1 byte. However, Unicode characters can have multiple bytes per character…so the character oriented length() will create faulty record directories, and substr() will extract data from the rest of the record incorrectly.

Fortunately there is the bytes pragma which alters the behavior of various character oriented Perl functions. Unfortunately these functions were added to Perl relatively recently, so this new version of MARC::Record will require Perl >= v5.8.2. Technically it could run on 5.8.1, however I found that the 5.8.1 that ships with OS X 10.3 lacks the bytes::substr(). Not only that but if you try to call a non existent function in the bytes namespace you’ll go into an infinite loop. This is even the case with Perl 5.8.6 as well.

All in all I really have come to dislike Perl’s Unicode support. The magical utf8 flag on scalars has a tendency to pop on and off for obscure reasons. And I’ve found the behavior of bytes::length() to be a bit unpredictable. Surely this is because I don’t fully understand the mechanics involved, but judging from the traffic on perl-unicode I’m not the only one who has struggled with it. My experience using unicode in Java and Python has been much more pleasant, and really confirms my decision to move towards doing new work in these languages. Perl has served me well, and there are some things I really love about the language, but these nasty corners are a bit scary.


name authority fun

As a joke dchud suggested that panizzi (the friendly neighborhood bot in #code4lib) should have a plugin for querying the Library of Congress Name Authority File that OCLC provides. The Name Authority File allows librarians the world over to use the same established names when cataloging books, etc. It would serve no purpose in irc, but it could be a good conversation piece…

I had goofed around writing a command line app about half a year ago so I figured it couldn’t be that hard to hack this into the infobot source code. However I guessed wrong…granted I only tried for 30 minutes or so.

Fortunately, python’s supybot was a different story. It’s more modern, has command line programs for configuring a supybot, has built in support for plugins – and has documentation. There is even a command line program supybot-newplugin that will ask a few questions and then autogenerate a template plugin module. All you have to do after that is add a method (with a particular signature given in the docs) which will then do the work and respond.


HTML/HTTP

RFC 2397 has been around since August 1998 and I’m just learning about the data URL scheme today. Perhaps browser support for it is new? Basically data URLs allow you to embed data, like images directly in an HTML page. Data URLs remind me of Fred Lindberg’s old idea (circa 2001) of “mailing the web” by freezing web pages as email with MIME attachments.

It’s fun to be learning new things about HTML/HTTP: technologies that I thought I was familiar with already. Perhaps I’ve been out of web development for long enough to fall behind. The other day I learned about iframes from my friend and sometime coworker Jason and was similarly blown away by something new under the sun. iframes are esentially the same things as regular frames but for the browser user they don’t see separate panes. Useful for scrolling panels inside of pages and other things I’m sure.

I guess this is all part of the web renaissance that is going on now, spurred on by Google’s forays and investment in javascript and xml. It’s really interesting to see how a big player like Google can redefine what is acceptable technology to rely on in web applications. For years I’ve avoided doing too much in javascript since it was a headache to get it working across different browsers, at least for this programmer. Now javascript is on my list of things to learn more about.


Hello WordPress

Hello WordPress, bye bye custom blog code written in Perl. Well the old code is still running, but I’ve wanted to install WordPress for the past few months and finally got around to it this weekend. I had a little bit of trouble getting PHP installed, only because I decided to use the older php4 with the latest mysql, and php4 didn’t seem to want to configure itself using the latest mysql. Fortunately using php5 was a different story and WordPress was a breeze to install.

My reasons for switching from my homegrown code to WordPress are several.

  • there was really no way of commenting on stories, only adding them.
  • the old code didn’t really archive or categorize stories the way I wanted to
  • links to stories didn’t work, and I wanted to join dan’s Planet #code4lib.
  • I didn’t use the RSS aggregation features I wrote since I started using Bloglines.
  • I’ve been coding more in Python these days and don’t feel particularly tied to my Perl code base any longer. WordPress is PHP, which I’m not a huge fan of, but I think this had more to do with the PHP that I was exposed to more than the language itself. Installing WordPress and the various plugins like the audioscrobbler one you see to the right was very pleasant.
  • the WordPress community is extremely rich. I spent some time with Kesa looking at different themes, but in the end decided to stay with the default for now. There are tons of neat plugins to look at.

So what you can expect here is more of the same. I’m going to try to write more about my work as a programmer, mainly as a journal for myself to keep track of what I’m working on, where I’ve been, and where I’d like to go. Perhaps you are thinking spare me the details, where are the pictures of Chloe?! If this is the case you should see a link to the photos over on the right.