Archive for the ‘programming’ Category

sicp reading

Wednesday, January 23rd, 2008

If you’ve ever harbored any interest in reading (or re-reading) The Structure and Interpretation of Computer Programs please consider joining some of the books4code folks as we work through the SICP MIT OpenCourseWare (free) course. Chris McAvoy has set up a wiki-page with details, and a calendar to subscribe to, to keep us honest. The book is available for free, and so are video lectures, notes, exercise answers, etc … Thanks Jason for getting us to take this up again :-)

metadata hackers

Monday, December 31st, 2007

I opened the paper this morning to read a story of another person involved in the creation of MARC who has just died. I hadn’t realized before reading Henrietta Avram and Samuel Snyder’s obituaries that there was a bit of an NSA LC connection when MARC was being created.

From 1964 to 1966, [Samuel Snyder] was coordinator of the Library of Congress’s information systems office. He was among the creators of the library’s Machine Readable Cataloging system that replaced the handwritten card with an electronic searchable database system that became the standard worldwide.

I imagine NSA folks had a lot to do with early automation efforts in the federal government…but it’s still an interesting connection. One of my coworkers is reading up on this early history of MARC so this is for him in the unlikely event that he missed it…email would probably have worked better I guess, but I also wanted to pay tribute. Libraries wouldn’t be what they are today without this influential early work.

tools

Thursday, October 18th, 2007


At $work recently many late nights were spent hackety-hacking on a prototype that got written up in the New York Times today. Apart from some promotional materials, not much is available to the public just yet. I just got pulled in near the end to do some search stuff. Over the past few months I’ve seen dchud in top form managing complicated data/organizational workflows while making technical decisions. A nice outgrowth of working with smarties is ending up with a fun and productive technology stack: python, django, postgres, jquery, solr, tilecache, ubuntu, trac, subversion, vmware. Given the press and the commitment to UNESCO I think the code is going to start being a bit more than a prototype pretty soon :-)

pymarc, marc8 and nothingness

Friday, July 20th, 2007

pymarc 1.0 went out day before yesterday with a new function: marc8_to_unicode(). When trying to leverage MARC bibliographic data in today’s networked world it is inevitable that the MARC8 character encoding will at some point rear its ugly head and make your brain hurt. The problem is that the standard character set tools for various programming languages do not support it. So you need to know to use a specialized tool like marc4j, yaz, MARC::Charset for converting from MARC8 into something useful like Unicode.

The MARC8 support in pymarc is the brainchild of Aaron Lav and Mark Matienzo. Aaron gave permission for us to package up some of is code from PyZ3950 into pymarc. In testing with equivalent MARC-8 and UTF-8 record batches from the Library of Congress we were able to find and fix a few glitches.

The exercise was instructive to me because of my previous experience working with the MARC::Charset Perl module. When I wrote MARC::Charset I was overly concerned with not storing the mapping table in memory, I used an on disk Berkeley-DB originally. Aaron’s code simply stored the mapping in memory. Since python stores bytecode on disk after compiling there were some performance gains to be had over Perl–since Perl would compile the big mapping hash every time. But the main thing is that Aaron seemed to choose the simplest solution first– whereas I was busy performing a premature optimization. I also went through some pains to enable mapping not only MARC-8 to Unicode but Unicode back to MARC-8. In hindsight this was a mistake because going back to MARC-8 is increasingly more insane as each day passes.

Aaron’s code as a result is much cleaner and easier to understand because, well, there’s less of it. I’m reading Beautiful Code at the moment and was just reading Jon Bentley’s chapter “The Most Beautiful Code I Never Wrote” — which really crystallized things. Definitely check out Beautiful Code if you have a chance. Maybe the quiet books4code could revive to read it as a group?

app and repositories

Monday, July 16th, 2007

Pete Johnston blogged recently about a very nice use of the Atom Publishing Protocol (APP) to provide digital library repository functionality. The project is supported by UKOLN at the University of Bath and is called Simple Web-service Offering Repository Deposit (SWORD).

If you are interested in digital repositories and web services take a look at their APP profile. It’s a great example of how APP encourages the use of the Atom XML format and RESTful practices, which can then be extended to suit the particular needs of a community of practice.

To understand APP you really only need to grok a handful of concepts from the data model and REST. The data model is basically made up of a service document, which describes a set of collections, which aggregates member entries, which can in turn point to a media entry. All of these types of resources are identified with URLs. Since they are URLs you can interact with the objects with plain old HTTP–just like your web browser. For example you can list the entries in a collection by issuing a GET to the collection URL. Or you can create a member resource by doing a POST to the collection URL. Similarly you can delete a member entry by issuing a DELETE to the member entry. The full details are available in the latest draft of the RFC–and also in a wide variety of articles including this one.

So to perform a SWORD deposit a program would have to:

  1. get the service document for the repository (GET http://www.myrepository.ac.uk/app/servicedocument)
  2. see what collections it can add objects to
  3. create some IMS, METS or DIDL metadata to describe your repository object and ZIP it up with any of the objects datastreams
  4. POST the zip file to the appropriate collection URL with the appropriate X-Format-Namespace to identify the format of the submitted object
  5. check that you got a 201 Created status code and record the Location of the newly created resource
  6. profit!

1 and 2 are perhaps not even necessary if the URL for the target collection is already known. Some notable things about the SWORD profile of APP:

  • two levels of conformance (one really minimalistic one)
  • the idea that collections imply particular treatments or workflows associated with how the object is ingested
  • service documents dynamically change to describe only the collections that a particular user can see
  • no ability to edit resources
  • no ability to delete resources
  • no ability to list collections
  • repository objects are POSTed as ZIP files to collections
  • HTTP Basic Authentication + TLS for security
  • the use of DublinCore to describe collections and their respective policies.
  • collections can support mediated deposit which means deposits can include the X-On-Behalf-Of HTTP header to identify the user to create the resource for.
  • the use of X-Format-Namespace HTTP header to explicitly identify the format of the submission package that is zipped up: for example IMS, METS or DIDL.

While I understand why update and delete would be disabled for deposited packages I don’t really understand why the listing of collections would be disabled. An atom feed for a collection would essentially enable harvesting of a repository, much like ListRecords in OAI-PMH.

I’m not quite sure I completely understand X-On-Behalf-Of and sword:mediation either. I could understand X-On-Behalf-Of in an environment where there is no authentication. But if a user is authenticated couldn’t their username be used to identify who is doing the deposit? Perhaps there are cases (as the doc suggests) where a deposit is done for another user?

All in all this is really wonderful work. Of particular value for me was seeing the list of SWORD extensions and also the use of HTTP status codes. If I have the time I’d like to throw together a sample repository server and client to see just how easy it is to implement SWORD. I did try some experiments along these lines for my presentation back in February…but they never got as well defined as SWORD.

theory

Wednesday, March 28th, 2007

The second book I checked out of the Library of Congress with my shiny new borrowing card was Alistair Cockburn’s Agile Software Development: The Cooperative Game (which happened to just win this years Jolt Award). Early on Cockburn recommends jumping to an appendix to read Peter Naur’s article “Programming as Theory Building” (thanks ksclarke).

This is my second time reading the article, but this time it is really resonating with me–the idea of writing programs as building theories. Partly I think this is because I was reading it while I attended a recent Haskell tutorial by coworker Adam Turoff here in DC (which I will write about shortly).

On the ride to work this morning a particular quote stood out, and I’m just writing it here so I don’t forget it:

… the problems of program modification arise from acting on the assumption that programming consists of program text production, instead of recognizing programming as an activity of theory building.

It seems obvious at first I guess. But it’s a powerful statement about what the activity of software development ought to be–instead of a string of hacks that eventually brings a piece of software to its knees.

identifiers and authority records

Saturday, January 6th, 2007

Authority files are rather important for unambiguously talking about a person, place or thing. In database lingo they essentially amount to a primary key for a table. Given the time and effort libraries spend in maintaining authority records and assigning control numbers to individuals it makes sense that a URI could be assigned to an individual in such authority files. I realize this idea is nothing new, but until recently I hadn’t seen it put into practice particularly well.

I imagine this has been there all along but I just noticed that OCLC’s Linked Authority File includes PURLs for authors now. For example the following URL contains a LCCN:

http://errol.oclc.org/laf/n79-7035

When you GET this your browser is automatically redirected with an HTTP 302 to:

http://alcme.oclc.org/laf/servlet/OAIHandler?
verb=GetRecord&metadataPrefix=oai_dc&identifier=n79-7035

which you’ll notice is a OAI-PMH request to fetch a DublinCore record with the identifier n79-7035:

<oai_dc:dc
  xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
    http://openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    Borges, Jorge Luis,--1899-
  </dc:creator>
  <dc:description xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    SuaÌrezLynch, B.--nnnc
  </dc:description>
</oai_dc:dc>

So now we know who this identifier is for, and the established heading for the individual. But it gets better (or worse depending on your perspective). Since this is an OAI-PMH server you can issue a ListMetadataFormats request to see what other flavors this record might be available in. If you do you’ll find out that this record is also available as marcxml in all its unholy glory (if you follow that link your browser will use a stylesheet to turn the raw xml into something a bit more presentable). Putting aside my snideness about MARC for a moment, this is a lot of useful data being made available.

You can also search the name authority file and get relevant PURLs via a SOAP/REST service. For example the irc bot panizzi in #code4lib actually has a bit of logic that allows it do lookups in the linked authority file:

06:56 < edsu> @naf borges, jorge
06:56 < panizzi> edsu: [20 matches] [~1] Borges, Jorge Luis, 1899-
                 <http://errol.oclc.org/laf/n79-7035>; [~2] Macedo, Jorge
                 Borges de. <http://errol.oclc.org/laf/n82-149895>; [~3]
                 Borges, Jorge G. (Jorge Guillermo), 1874-1938
                 <http://errol.oclc.org/laf/n90-681877>; [~4] Sua?rez Lynch, B.
                 <http://errol.oclc.org/laf/n82-21644>; [~5] Borges, Jorge
                 Wheliton Miranda <http://errol.oclc.org/laf/n92-76758>; [~6]
                 Canido Borges, Jorge Oscar (3 more messages)

All in all it’s an impressive mix of technology, standards and practice. It is not entirely clear to me how this work relates to the Virtual International Authority File. Perhaps LAF wasn’t considered a good acronym? If you are interested in such things Thom Hickey had a really interesting talk at Access2006 which has audio available.

DemoCampDC

Friday, January 5th, 2007

DemoCampDC is an adaptation of BarCamp to provide an informal mechanism for sharing technology shtuff in the DC area. If you are interested and in the DC area please add your name to the list of attendees and stay tuned.

#9

Thursday, January 4th, 2007
22:01 < edsu> i would try to separate them now before it’s
      too late :)
22:02 < erikhatcher> it’s never too late, but i certainly want
      to keep this clean from the start

New Years Resolution #9 - never underestimate the power of a positive attitude…

evergreen

Friday, December 22nd, 2006


In case you missed it linux.com is running an article by Michael Stutz on Evergreen, an open source integrated library system developed by the state of Georgia to support a consortium of 44 different libraries. (Thanks for the link Adam)

Hanging out with miker_ and bradl in irc and having open-ils in my feed reader makes me take this sort of work for granted sometimes…and Michael’s article made me wake up and marvel at how truly remarkable the work they’ve done is.

The evergreen folks are hosting this years code4libcon where I’m supposed to be doing a presentation on the Atom Publishing Protocol. It’s a low cost/pragmatic alternative to the usual library technology conference options–and will be a good opportunity to buy these Evergreeners a beer. I hope to see you there.