Archive for the ‘opensource’ Category

open standards

Friday, September 15th, 2006

Folks who are interested in libraries and technology are often drawn to the issue of open standards. Using open standards is very important to libraries for a variety of reasons that Ed Corrado summarizes nicely.

This week my podcast reader picked up an excellent interview with Danese Cooper of the Open Source Initiative where she talks about the Open Standard Requirement which was introduced a few months ago. It provides a new perspective on the same issue from outside of the library community.

Essentially the OSR amounts to 5 guidelines for identifying a truly open standard. These guidelines are different though because they focus on what makes a standard open for an implementor. Whether the standard was created by an open process or not is really outside of scope. The important thing is how easy it is for a software developer to write software that uses the standard. A nice feature of the OSR is that the guidelines would fit on an index card. Here’s my regurgitation of them:

  1. The spec can’t omit details needed for implementation
  2. The standard needs to be freely/publicly available
  3. All patents involved in the spec need to be royalty free
  4. Clicking through a license agreement is not necessary
  5. The spec can’t be dependent on a standard that is not open as well

Danese was quick to point out that these are simply guidelines and not rules. For example Unicode fails on 2. since you have to pay for a copy of the spec. But in this case printing the standard is a publishing feat–given all the glyphs and their number. It’s not unusual that the book would cost money. So this guideline could be waived if the OSI folks agreed.

Rather than the OSI going and applying these rules to all known standards the idea is that standards bodies could claim self-compliance–and as developers implement the standard the compliance will be ascertained.

The guidelines themselves and the process of being fine tuned/hammered on–and they are looking for volunteers…

set your data free … with unapi

Monday, August 28th, 2006

Dan, Jeremy, Peter, Michael, Mike, Ross and I wrote an article in the latest Ariadne introducing the lightweight web protocol unAPI. Essentially unAPI is an easy way to include references to digital objects in your HTML which can then be predictably retrieved by a machine…yes ‘machine’ includes JavaScript running in a browser :-) Dan and a really nice cross section of developers around the world have been working on this spec for over a year now and I think it could be poised to play an important role in the emerging open data movement.

Imagine you have a citation database which is searchable via the web. The search results include hits. Wouldn’t it be nice to align your human viewable results with machine readable representations so that people could write browser hacks and the like to remix your application data?

As far as I can tell there are a few options available to help you do this (apart from doing something ad-hoc).

  1. use a citation microformat and mark up your HTML predictably so that it can be recognized and parsed
  2. use GRDDL to map your HTML to RDF via an XLST profile.
  3. embed RDF in your HTML essentially using an RDF microformat.
  4. OpenURL and/or COinS to link in page IDs to OpenURL servers.
  5. use unAPI and include a unapi server url (familiar autodiscovery like RSS/Atom), and identifiers (simple element attributes) and write a simple server side script that emits xml for a given identifier.

I like microformats a lot and I think a citation format will eventually get done. But it’s been a long time coming and there’s no indication it’s going to get done any time soon. What’s more unAPI is bigger than just citation data–and it allows you to publish all kinds of rich data objects without waiting for a community to ratify a particular representation in HTML.

Options 2 and 3 use RDF which I actually like quite a bit as well. GRDDL implies a GRDDL aware browser which would be cool but is a bit heavy weight. XSLT will require clean XHTML–or pipelines to clean it. Embedding RDF in HTML using microformat techniques is compelling because you can theoretically process the RDF data similarly–whereas unAPI doesn’t require any particular kind of machine readable format (apart from HTML). Actually there’s nothing stopping you from using unAPI to link human viewable objects with RDF representations. The advantage unAPI has here is you can learn RDF if you want to, but you don’t have to learn RDF to get going with unAPI today.

Option 4 leverages work done in the library community on citation linking. OpenURL routers are widely deployed in libraries around the world and COinS is a quasi-microformat for putting OpenURL context objects into your HTML so that they can be extracted and fired off at an OpenURL server. OpenURL is a relatively complex and subtle standard which can do a lot more than just citation linking. Compared to OpenURL/COinS unAPI allows for ease of implementation in languages like JavaScript and provides a simple introspection mechanism for discovering what formats a particular resource is available in. AFAIK this can’t be done simply using OpenURL/COinS. If I’m wrong, comments should be open. I would argue that the sheer power and flexibility of OpenURL paradoxically make it hard to understand…and that unAPI in Dan’s adherence to a one-page-spec is more limited and simple. Less is more…

So if this piques your interest read the article. It does a much better job of describing the origins of the work, where it’s headed, has examples and links out to sites/tools that use unAPI today. I must admit I wrote very little of the article, and mostly contributed text snippets and screenshots of the unAPI validator I wrote, which uses my unapi ruby gem.

code4libcon 2007

Sunday, July 2nd, 2006

Here’s to making sure that code4libcon 2007 is a watershed moment for women library technologists.

code4libcon 2006 in Corvallis wasn’t all male, but it was largely…and I can only remember two women speaking to the audience. To a large extent code4libcon was modeled after technology conferences like yapc, pycon, oscon, barcamp, etc–which have much the same sort of ratio. But libraries are different because the majority of people who work in libraries are women. So it was a bit surprising that more women didn’t end up at code4libcon 2006.

2006 did get organized practically overnight with a very small (male) clique in an irc room (that’s not always well behaved, but mean well–hey it’s IRC). When people actually started signing up and sending in papers to the more formal discussion list I think we were all kind of surprised. I seriously thought we were just going to be hanging out in some random space with free wifi, and it turned into this really successful event.

Some folks like Dan Chudnov, Art Rhyno, Jeremy Frumkin and Roy Tennant started thinking and talking early about making the conference appeal to women library technologists. But it seems that either the voting (open to all, but all men for some reason) somehow subconsciously counteracted this.

AFAIK the keynote voting is still going on, and I imagine you can still suggest speakers. There will only be more voting to do as we get into selecting presenters. If you’d like to participate just email Brad LaJeunesse and he’ll hook you up with a backpack login. Also, sign up for the code4lib and code4libcon discussion lists. Luckily Dorothea Salo is involved and vocal and I’m hoping that other women technologists will get involved too. This is a grassroots thing after all, not some sort of LITA top-tech trends panel. It’ll become whatever we want it to be.

professionalism in the age of discontent

Friday, May 19th, 2006

After seeing him speak and meeting him a couple times I’m a big fan of Adrian’s work. He was one of the first people to “mash up” google maps at chicagocrime.org; has set the bar for local online media content at lawrence.com; created the Congressional Votes Database at the washingtonpost which allows you to (among other things) get an RSS feed for your representatives votes; and has created probably the most popular web framework for python.

But the thing that really impresses me the most about him is how he mixes the role of technologist and journalist. If you are curious take a look at the commencement speech he just gave at his alma matter, University of Missouri’s School of Journalism. Now if you work in/for a libraries/archives (which is likely given this blogs focus) just substitute ‘journalism’ for ‘libraries’ as you read the piece. You may be surprised to learn that the field of Journalism finds itself in much the same dire straits that Librarianship is in:

Then there’s this whole Internet thing — which is clearly evil. Some guy in San Francisco runs a Web site, Craigslist, that lets anybody post a classified ad for free — completely bypassing the newspaper classifieds and, therefore, chipping away at one of newspapers’ most important sources of revenue. Why would I post a classified ad in a newspaper, which charges me money for a tiny ad in which I’m forced to use funky abbreviations just to fit within the word limit, when I can post a free ad to Craigslist, with no space limitation and the ability to post photos, maps and links? Google lets anybody place an ad on search results. Why would I, the consumer, place an ad on TV, radio or in a newspaper, if I can do the same on Google for less money and arguably more reach?

Ahem, Google Scholar or Amazon anyone?

The foundation that you got here is important because it will guide you for the rest of your journalism career. It’s important because, no matter what you do in this industry, it all comes back to that foundation. No matter how the industry changes, no matter how your jobs may change, it all comes back to the core journalism values you’ve learned here at Missouri.

But, most of all, the foundation is important because you need to understand the rules before you can break them. And now, more than ever, this industry needs to break some rules.

You’re going to be the people breaking the rules. You’re going to be the people inventing new ones. You’ll be the person who says, “Hey, let’s try this new way of getting our journalism out to the public.” You’ll be the PR person who says, “Let’s try this new way of public relations that takes advantage of the Internet.” You’ll be the photographer who says, “Wow, quite a few amateur photographers are posting their photos online. Let’s try to incorporate that into our journalism somehow.”

So think about how exciting that is. Rarely is an entire industry in a position such that it needs to completely reinvent itself.

What are the rules of the library profession that we need to break? In my conversations with fellow library technologists we often talk about how the profession needs to be advanced, like we are uniquely effected by the massive changes in media/information in the last 10 years. I think we should draw some comfort from the fact that we’re not the only ones dealing with this new terrain–as we kick ourselves in the pants. Perhaps some new professions are being born out of this melange.

Adrian is the type of professional I’d like to be, that’s for sure.

reading 2.0

Monday, March 20th, 2006

Reading 2.0 slipped under my radar, but I guess that was the idea: to let people from O’Reilly, Los Alamos National Labs, OCLC, The Internet Archive, Adobe, Yahoo, Harvard and Elsevier hobnob away from prying eyes. I haven’t seen any audio/video for the event but Tim O’Reilly has a nice fly on the wall summary of what went on.

It’s refreshing to see library technologies/concepts such as OpenURL, OCOinS, OAI-PMH, FRBR, METS and Dublin Core starting to be talked about in the context of a larger information environment. For example I had no idea that Yahoo is harvesting data from the Internet Archive using the OAI-PMH protocol. And I didn’t know Yahoo is starting to leverage microformats, but should’ve guessed considering the recent news about Flickr starting to use hCard.

All in all these are exciting “lowercase semantic web” times we’re living in. And it’s interesting to watch some of the things people you know have worked on starting to catch on. Hopefully Reading 2.0 was just the start of this ongoing collaboration. Case in point, I just heard Robert Sanderson say in #code4lib that he’s visiting the a9 folks to talk about opensearch and sru. This is just the sort of cross-fertilization we need going on in library land.

(py)?lucene 1.9

Thursday, March 2nd, 2006

So on March 1st lucene v1.9 was released and the *next day* pylucene v1.9 is released. Nice work!

I guess there are a bunch of methods that are deprecated in 1.9 which will dissappear entirely in v2.0. Now would be a good time to update usage…

a new type of journal

Wednesday, February 22nd, 2006

In the unlikely event that you haven’t seen it there is an interesting thread over on the code4lib discussion list about establishing a code4lib journal. I think Mark Jordan has the right idea:

…would creating a section at http://code4lib.org/ that was reserved for formal, maybe even peer-reviewed articles do what you’re describing? The articles would be the starting point, but the Web 1.9-compliant features that are already appearing on the site (comments, attachments, microformat links, etc.) may satisfy what you’re describing. Heck, maybe we could write a module for http://code4lib.org/ that would pull some of these things together (drupal already has a publishing module). In other words, http://code4lib.org/ could _be_ the journal but it could be a new type of journal.

Dan followed up with a +1 and I think he is right. The drupal instance running on code4lib.org was thrown together at the last minute and rejiggered by lots of people to serve as a place to put conference information. I’ve been wondering what might be in the cards for the site as we move post-conference and I think this “new kind of journal” idea might be where it can go. While there are lots of people with administrator access via the web, there’s not many people with shell access. I’d like to get where mjordan and others can have shell access (if they want it) so that we can make hardcore changes if necessary. Perhaps we just need another plugin, and we can go to town…or as Ross says

It still sounds like there’d still need to be /a/ process (and we need
to work that out), but the overhead is very low.

And I like that.

I like it too.

code4lib days 2-3

Monday, February 20th, 2006

So I didn’t have time to journal about the 2nd and 3rd days of the conference since there was so much good stuff going on. I’m on the plane back to Chicago now so I’ve got a few moments to jot down some notes about those days and some general thoughts about the conference.

To be honest the 2nd and 3rd days kind of blur together for me because I really didn’t get much sleep between them. I was pretty much blown away by the variety and quality of the presentations. Thom provided a detailed look at how he builds nimble, high-powered applications using short n’sweet python code on a beowulf cluster using techniques like map-reduce.

While they did separate talks on different topics I found some common strands between Devon Smith’s talk about metadata processing and Rob Sanderson’s talk about indexing in Cheshire3. Both of them had interesting workflows which they illustrated with neat diagrams which I should be able to link to from here soon. It wasn’t UML or anything boring like that. Rob’s illustration was more an overlayed animation over a bunch of slides showing the full lifecycle of a document being busted apart, indexed, a query coming in, triggering retrieval and then reconstitution of the document. Devon used interesting shaped objects to represent components in his metadata management framework. It was so much more fun than a dry description of what the software was doing, and really evoked what’s so much fun about building software–metaphor creation and architecture. Similarly Colleen Whitney of the California Digital Library had some really neat ways of visualizing search results which I wish I could link to as well.

Ryan Chute’s talk about the aDORe archiving framework from Los Alamos was interesting, but it largely seemed like a verbalization of the series of articles about aDORe that have been published. Don’t get me wrong, it’s fascinating stuff, and perhaps I just had super-high expectations–but I was hoping to hear more details of how they are actually using the aDORe framework at Los Alamos. It was good to hear Aaron Krowne talk about his experiments with quality metrics at Emory–especially after hearing a bit about it months ago in IRC. It turns out he was able to layer his new metrics over lucene without having to dive into the lucene code itself. I’m looking forward to seeing the code once it is released. I knew Aaron was a smart dude from talking to him IRC, but was surprised to see he is a confident and articulate public speaker as well.

Of course Roy Tennant is so at home at public speaking he was probably the only person that could easily tackle the “future of code4lib” in a presentation. He talked for 20 minutes about a variety of options that could be in the cards to make code4lib into a more formal organization; and then afterwards he did a breakout session on the topic. Unfortunately I wasn’t able to attend this because I was sitting in on Ross’s openurl Ruby library discussion. I heard that the basic consensus at the end was that things will stay much as they are now, but there might be a niche for code4lib to provide educational training for libraries. I think this idea came from Dan, and I think it’s a great idea. Hopefully we’ll get a chance to discuss over the coming months.

One of the neatest things I witnessed was Ross Singer spontaneously suggesting a breakout session about designing an openurl library for ruby…and something like 20 people showed up. Not that just any 20 people were there: we had Jeff Young (who wrote OCLCs openurl library), Eric Hellman (who helped write the openurl spec and who just sold his company to OCLC), Todd Holbrook (the software developer behind CUFTS) and Jay (?) one of the software developers beind ExLibris’ SFX product. We had a good discussion, which Ross was able to fascilitate, and I think we came away with some good ideas on how to improve the existing library, and perhaps think about providing a common DOMlike api for openurl implementations.

I could go on and on. Like how great the lightning talks were…for example Terry Reese’s five minute laid back demo of his MARCEdit software that was so polished and amazing I couldn’t believe it. It can query z39.50/SRU targets, and crosswalk to MODs and other metadata formats. Casey Bisson finished the conference on the right note encouraging library software developers to get involved in the technology world outside libraries and to look outwards for cowpaths to pave rather than navel gazing and using only standards developed by libraries. I think he definitely has a point, and that the converse is also true–we should be promoting library standards such as sru/cql in the outside world and encouraging them to pave some of our cowpaths. I was hoping to follow Casey’s talk with my lightning talk about microformats but alas we ran out of time.

All in all I had a great time, and got a chance to meet some really interesting folks (some of whom I got to hang out in Portland with afterwards: Gabriel, Devon, Rob, Aaron). I don’t think it would’ve been possible without the support of people like Art Rhyno, Roy Tennant, Dan Chudnov and of course Jeremy Frumkin who managed to make it just happen. The most important feature of the conference was the size, which was big enough to make it interesting, but small enough to make it easily experienced as a whole, and relaxed enough to be fun. I think that it’s pretty clear that it hit a sweet spot, and that it is highly likely that it will happen again.

code4lib day 1

Thursday, February 16th, 2006

So the first day of “the conference” was a lot of fun. It is just great to see all these people who care about the same stuff in the same place. The lightning talks and the breakout sessions built in some breathing space between the presentations which worked pretty well I thought. Memorable moments for me included:

- hearing people in the audience shout “OPA!” like we were in Greek Town during Dan’s “Connecting Everything to Everything” talk.
- being able to ask Jeff Young to do a lightning talk about Info URIs and then hear him do it later. (jyoung++)
- picking Rob Sanderson’s brain during break about the fine details of CQL.
- having beers with tholbroo and calvinm at the crowbar
- being able to ask Eric Hellman about the guts of openly’s data collection efforts.
- chilling at jaf’s comfy house in the hills of corvallis

off to corvallis

Monday, February 13th, 2006

So tomorrow I’m headed for Corvallis, Oregon to attend the first ever code4lib conference. It’s been amazing to watch this conference start as a glimmer in the eye of a handful of people in IRC and turn into a real event attended by 80 library technologists from all over the place.

I’m planning on doing a lightning talk or two, and had spent some time preparing some slides which I tossed in the end. I’m going to talk about using eclipse, microformats and object-relational-mapping — hopefully by just doing some live coding. We’ll see how it goes.

I plan on taking some copious notes, so keep your eye on planet.code4lib.org for the play by play.