Archive for the ‘ruby’ Category

ruby-zoom v0.3.0

Tuesday, July 10th, 2007

Thanks to some prodding from William Denton and Jason Ronallo and the kindness of Laurent Sansonetti I’ve been added as a developer to the ruby-zoom project which provides a Ruby wrapper to the yaz Z39.50 library. I essentially wanted to remove some unused code from the project that was interfering with the ruby-marc gem … and I also wanted to create gem for ruby-zoom. This was the first time I’ve tried packaging up a C wrapper as a gem and it was remarkably smooth. I also added a test suite and a Rakefile. So assuming you have yaz installed you can install ruby-zoom with:

% gem install zoom

I’ll admit, I’m no huge fan of Z39.50 but the fact remains that it’s pretty much the most widely deployed machine API for getting at bibliographic data locked up in online catalogs. It’s really nice to see forward thinking systems at Talis, Evergreen and Koha who have (or at least experimented with) OpenSearch implementations.

rsinger++

Thursday, September 28th, 2006

So Ross beat out 11 other projects to win the OCLC Research Software Contest for his next generation OpenURL resolver umlaut. Second place went to to Jesse Andrews’ BookBurro–so the competition was fierce this year. Much more so than last year when there were 4 contestants.

Those of us who hang out in #code4lib got to hear about this project when it was just a glimmer in his eye…and had front row seats for hearing about the development as it progressed. Essentially umlaut is an openurl router that’s able to consult online catalogs (via SRU), other OpenURL resolvers (SFX), Amazon, Google, Yahoo, Connotea, CiteULike and OAI-PMH. It’s all written in Ruby and RubyOnRails.

I feel particularly proud because Ross is enough of a mad genius to have found a use for some ruby gems I wrote for doing sru, oai-pmh and querying OCLC’s xisbn service.

Speaking of which we’ve been collaborating recently on a little ruby gem for querying OCLC’s OpenURL Resolver Registry. This registry essentially makes it easy to determine what the appropriate OpenURL resolver is given a particular IP address. So you could theoretically rewrite your fulltext URLs so that they were geospatially aware. For example:

  require "resolver_registry"
 
  client = ResolverRegistry::Client.new
  institution = client.find('130.207.50.91')
  print institution.resolver.base_address

If you want to take a look direct your svn client like so:

svn co http://rsinger.library.gatech.edu/svn/openurl_registry/

I imagine it’ll get released to rubyforge sometime shortly.

ruby-oai v0.0.3

Tuesday, September 19th, 2006

v0.0.3 of ruby-oai was just released to RubyForge. The big news is that this release allows you to use libxml for parsing thanks to the efforts of Terry Reese. Terry is building a RubyOnRails metasearch application at OSU and, well, felt the need for speed.

After committing the branch he was working on I ran some performance tests of my own. I ran a vanilla ListRecords request against dspace, eprints and american memory oai-pmh servers using both the rexml (default) and libxml backend parsers. Here are the results

server parser real user sys
dspace rexml 0m3.632s 0m2.008s 0m0.044s
libxml 0m1.900s 0m0.212s 0m0.032s
  1.732s (+48%) 1.796s (+89%) 0.012s (+27%)
 
eprints rexml 0m19.807s 0m1.984s 0m0.036s
libxml 0m19.344s 0m0.236s 0m0.024s
  0.463s (+2%) 1.748s (+88%) 0.012s (+33%)
 
american-memory rexml 0m12.991s 0m5.424s 0m0.052s
libxml 0m7.420s 0m0.324s 0m0.032s
  5.571s (+43%) 5.104s (+94%) 0.02s (+38%)

Those percentage values are speed improvements. Thanks Terry :-)

set your data free … with unapi

Monday, August 28th, 2006

Dan, Jeremy, Peter, Michael, Mike, Ross and I wrote an article in the latest Ariadne introducing the lightweight web protocol unAPI. Essentially unAPI is an easy way to include references to digital objects in your HTML which can then be predictably retrieved by a machine…yes ‘machine’ includes JavaScript running in a browser :-) Dan and a really nice cross section of developers around the world have been working on this spec for over a year now and I think it could be poised to play an important role in the emerging open data movement.

Imagine you have a citation database which is searchable via the web. The search results include hits. Wouldn’t it be nice to align your human viewable results with machine readable representations so that people could write browser hacks and the like to remix your application data?

As far as I can tell there are a few options available to help you do this (apart from doing something ad-hoc).

  1. use a citation microformat and mark up your HTML predictably so that it can be recognized and parsed
  2. use GRDDL to map your HTML to RDF via an XLST profile.
  3. embed RDF in your HTML essentially using an RDF microformat.
  4. OpenURL and/or COinS to link in page IDs to OpenURL servers.
  5. use unAPI and include a unapi server url (familiar autodiscovery like RSS/Atom), and identifiers (simple element attributes) and write a simple server side script that emits xml for a given identifier.

I like microformats a lot and I think a citation format will eventually get done. But it’s been a long time coming and there’s no indication it’s going to get done any time soon. What’s more unAPI is bigger than just citation data–and it allows you to publish all kinds of rich data objects without waiting for a community to ratify a particular representation in HTML.

Options 2 and 3 use RDF which I actually like quite a bit as well. GRDDL implies a GRDDL aware browser which would be cool but is a bit heavy weight. XSLT will require clean XHTML–or pipelines to clean it. Embedding RDF in HTML using microformat techniques is compelling because you can theoretically process the RDF data similarly–whereas unAPI doesn’t require any particular kind of machine readable format (apart from HTML). Actually there’s nothing stopping you from using unAPI to link human viewable objects with RDF representations. The advantage unAPI has here is you can learn RDF if you want to, but you don’t have to learn RDF to get going with unAPI today.

Option 4 leverages work done in the library community on citation linking. OpenURL routers are widely deployed in libraries around the world and COinS is a quasi-microformat for putting OpenURL context objects into your HTML so that they can be extracted and fired off at an OpenURL server. OpenURL is a relatively complex and subtle standard which can do a lot more than just citation linking. Compared to OpenURL/COinS unAPI allows for ease of implementation in languages like JavaScript and provides a simple introspection mechanism for discovering what formats a particular resource is available in. AFAIK this can’t be done simply using OpenURL/COinS. If I’m wrong, comments should be open. I would argue that the sheer power and flexibility of OpenURL paradoxically make it hard to understand…and that unAPI in Dan’s adherence to a one-page-spec is more limited and simple. Less is more…

So if this piques your interest read the article. It does a much better job of describing the origins of the work, where it’s headed, has examples and links out to sites/tools that use unAPI today. I must admit I wrote very little of the article, and mostly contributed text snippets and screenshots of the unAPI validator I wrote, which uses my unapi ruby gem.

gems…on ice

Saturday, August 12th, 2006


When developing and deploying RubyOnRails applications you’ve often got to think about the gem dependencies your project might have. It’s particularly useful to freeze a version of rails in your vendor directory so that your app uses that version of rails rather than a globally installed (or not installed) one. It’s easy to do this by simply invoking:

  rake freeze_gems

Which will unpack all the rails gems into vendor, and your application will magically use these instead of the globally installed rails gems.

The cool thing is that with a little bit of plugin help you can freeze your other gems in vendor as well. Simply install Rick Olson’s elegantly simple gem plugin into vendor/plugins. Then assuming you are using let’s say my oai-pmh gem you can simply:

  rake gems:freeze GEM=oai

and the gem will be unpacked in vendor, and the $LOAD_PATH for your application will automatically include the library path for the new gem. Very useful, thanks Rick!

repositories and domain-specific-languages

Thursday, July 6th, 2006

At work I’ve been doing some experiments with the Fedora repository software. One of the strengths of Fedora is that it is fundamentally designed as a set of extensible web services. At first I set about becoming familiar with the set of web services and decided that Ruby would be a useful and lightweight language to do this from. Sure enough, Ruby was plenty capable of putting stuff into Fedora and getting stuff back out again.

As time went on it became clear that what was really needed was a layer of abstraction around this Fedora web services API that would allow it (or another repository framework) to be used in a programatic way without having to make SOAP calls and building FOXML all over the place. Typically in software pattern lingo this is referred to as a facade.

So I worked on creating a facade, and ended up with something I half-jokingly called ‘bitbucket’ which looks something like this:

  require 'bitbucket'
 
  # create a repository
  repository = BitBucket::Repository.new
 
  # create a new repository object
  o = BitBucket::RepositoryObject.new
  o.dc.title = 'Automatic for the People'
  o.dc.creator = 'REM'
 
  # add a datastream to the object
  o << BitBucket::DataStream.new_from_file('The Sidewinder Sleeps Tonight.mp3')
 
  # ingest it!
  id = repository.ingest(o)

Now this code is pretty basic: it creates an object for a CD, associates an mp3 with it, and then adds it to a repository. This is the typical ‘ingest’ process but notice that the ingest format, the SOAP requests, mime-types, and the actual type of the repository are unspecified. The truth is even more could be hidden such as the Dublin Core. Some things could use better names: ‘Resource’ instead of ‘RepostoryObject’ perhaps. If you have interest in using this code (yes it works!) let me know–I imagine it could be liberated from a private subversion repository.

Just after finishing this up it struck me that while I was trying to build a facade around Fedora I was at the same time striving for a domain specific language for repositories.

The basic idea of a domain specific language (DSL) is a computer language that’s targeted to a particular kind of problem, rather than a general purpose language that’s aimed at any kind of software problem.

As Martin Fowler goes on to describe there are two different types of DSLs: external and internal. External DSLs are custom languages such as regular expressions, postscript, ant configuration files, etc. Typically a syntax for the mini-language is determined and a small (hopefully) interpreter is written which parses and processes the DSL. Internal DSLs on the other hand use the constructs of a host programming language to define the DSL. There is a strong tradition of using DSLs in Lisp and Smalltalk…and it seems to also be a growing tradition in the Ruby community as well.

So a DSL for repositories would provide a mini-lanugage, if you will, for interacting with a repository. I think that the efforts underway to build models for interoperability across scholarly repositories are in a way groping after this same thing–an unambiguous language for interacting with repositories.

The Pathways Core poster session at JCDL was very exciting. While the ideas were compelling enough, Jeroen Bekaert created some absolutely beautiful diagrams which really sold some of the concepts. I wish I could find some to put here. I got a chance to pick Xiaoming Liu’s brain a lot at the conference and over beers and I am really looking forward to their upcoming papers on this topic.

What I’d like to see is how easy it would be to use this emerging pathways model to create a Ruby DSL that uses the Atom Publishing Protocol as a backend. I’d also like to take a look at JSR 170. My main purpose in this is to see how well the aims of the scholarly community map to the content management solutions being developed outside the digital library community.

xml spelunking

Saturday, May 27th, 2006

As part of my day job I’ve been rifling through large foreign XML files–learning the rhyme and reason of tags used, looking at content, etc… I opened files in Firefox and vim and that was OK–but I like working from the command line. After minimal searching I wasn’t able to find a suitable tool that would simply outline the structure of an xml document in the way I wanted–although artunit pointed out Gadget from MIT which looks like a really wonderful GUI tool to try out. So (predictably) I wrote my own:

biblio:~ ed$ xmltree
Usage: xmltree foo.xml [--depth=n] [--xpath=/foo/bar] [--content]

Specific options:
    -d, –depth n                    max levels
    -x, –xpath /foo/bar             xpath to apply
    -c, –content                    include tag content
    -n, –namespaces                 include namespace information
    -h, –help                       show this message

You can use it to list all the elements in a document like this:

biblio:~ ed$ xmltree pmets.xml

PorticoMETS
 metsHdr
  agent
   name
 structMapContent
  div
   mdGroup
    descMDcurated
     mdWrap
      xmlData
       PorticoArticleMetadata
        article
... many lines of content removed

Maybe it’s a huge file and you only want to see a few levels in:

biblio:~ed$ xmltree --depth=3 pmets.xml 

PorticoMETS
 metsHdr
  agent
   name
 structMapContent
  div
   mdGroup
   div
   div
   div
 structMapMetadata
  div
   mdGroup
   fileGrp

And if you just want to explore a particular node you can use an xpath:

biblio:~ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/descMDextracted/mdWrap/xmlData sample.pmets

xmlData
 PorticoArticleMetadata
  article
   front
    journal-meta
     journal-id
     journal-title
     issn
     issn
    article-meta
     article-id
     article-id
     title-group
      article-title
     contrib-group
      contrib
     pub-date
      day
      month
      year
      string-date
     volume
     issue
     fpage
     page-range
     product
     copyright-year
     copyright-holder
     self-uri

And finally if you want to eyeball the content of the fields you can use the –content option:

biblio:~ ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/
  descMDextracted/mdWrap/xmlData --content sample.pmets

xmlData
 PorticoArticleMetadata
  article
   front
    journal-meta
     journal-id='bull'
     journal-title='Bulletin of the American Mathematical Society'
     issn='0273-0979'
     issn='1088-9485'
    article-meta
     article-id='S0273-0979-00-00866-1'
     article-id='10.1090/S0273-0979-00-00866-1'
     title-group
      article-title='Book Review'
     contrib-group
      contrib='David Marker'
     pub-date
      day='02'
      month='03'
      year='2000'
      string-date='02 March 2000'
     volume='37'
     issue='03'
     fpage='351'
     page-range='351-357'
     product='Tame topology and o-minimal structures, by Lou van den Dries,
       Cambridge Univ. Press, New York (1998), x + 180 pp., $39.95,
       ISBN 0-521-59838-9'
     copyright-year='2000'
     copyright-holder='American Mathematical Society'
     self-uri='http://www.ams.org/jourcgi/jour-getitem?pii=S0273-0979-
       00-00866-1'

Anyhow, if you have a favorite tool for doing this sort of stuff please let me know. If you want to try out xmltree you can grab it out of my subversion repository. You’ll just need a modern Ruby.

building and ingesting

Friday, April 28th, 2006

I prefer using an XML generating mini-language (elementtree, XML::Writer, REXML, Stan, etc) to actually writing raw XML. It’s just too easy for me to forget or misstype an end tag, or forget to encode strings properly–and I find all those inline strings or even here-docs make a mess of an otherwise pretty program.

Recently I wanted some code to write FOXML for ingesting digital objects into my Fedora test instance. I’m working in Ruby so REXML seemed like the best place to start…but after I finished I ran across Builder. The Builder code turned out to be somewhat shorter, much more expressive and consequently a bit easier to read (for my eyes). Here’s a quick example of how Builder’s API improves on REXML when writing this little chunk of XML:

<dc xmlns='http://purl.org/dc/elements/1.1/'>
  <title>Communication in the Presence of Noise</title>
</dc>

So here’s the REXML code:

dc = REXML::Element.new 'dc'
dc.add_attributes 'xmlns' => 'http://purl.org/dc/elements/1.1/'
title = REXML::Element.new 'title', dc
title.text 'Communication in the Presence of Noise'

and the Builder code:

x = Builder::XmlMarkup.new 
x.dc 'xmlns' => 'http://purl.org/dc/elements/1.1' do
  x.title 'Communication in the Presence of Noise'
end

So both are four lines, but look at how the Builder::XmlMarkup object infers the name of the element based on the message that is passed to it? Element attributes and content can be set when the element is created–something I wasn’t able to do w/ REXML. My favorite though is Builder’s use of blocks so that the hierarchical structure of the code directly mirrors that of the XML content!

So anyway, if you read this far you might actually like to see how a FOXML document can be built and ingested into Fedora–so hear goes building the document:

x = Builder::XmlMarkup.new :indent => 2
 
x.digitalObject 'xmlns' => 'info:fedora/fedora-system:def/foxml#' do
 
  x.objectProperties do
    x.property 'NAME' => 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
      'VALUE' => 'FedoraObject'
    x.property 'NAME' => 'info:fedora/fedora-system:def/model#state',
      'VALUE' => 'A'
  end
 
  x.datastream 'ID' => 'DC', 'STATE' => 'A', 'CONTROL_GROUP' => 'X' do
    x.datastreamVersion 'ID' => 'DC.0', 'MIMETYPE' => 'text/xml' do
      x.xmlContent do
        x.tag! 'oai_dc:dc',
          'xmlns:oai_dc' => 'http://www.openarchives.org/OAI/2.0/oai_dc/',
          'xmlns:dc' => 'http://purl.org/dc/elements/1.1/' do
          x.tag! 'dc:title', 'Communication in the Presence of Noise'
          x.tag! 'dc:creator', 'Claude E Shannon'
          x.tag! 'dc:subject', 'Information Science'
        end
      end
    end
  end
 
end

And here’s some code to fire the foxml at Fedora in a SOAP call:

require 'Fedora-API-M-WSDLDriver'
 
# configure api_m soap client for 
host = 'http://localhost:8080/fedora/services/management'
user = 'fedoraAdmin'
pass = 'fedoraAdmin'
fedora = FedoraAPIM.new
fedora.options['protocol.http.basic_auth'] &lt;&lt; [host, user, pass]
 
fedora.ingest SOAP::SOAPBase64.new(x.to_s), 'foxml1.0', 'added test object'

Cataloging at the BBC with RubyOnRails

Wednesday, April 26th, 2006

It’s nice to see that BBC Programme Catalogue (built with RubyOnRails and MySQL) has gone live. Here is some historical background from the about page:

The BBC has been cataloguing and indexing its programmes since the 1920s. The development of the programme catalogue has reflected the changes in the BBC and in broadcasting over the last seventy five years. For example, in the early days of broadcasting, for both Radio and TV, the majority of programmes were broadcast live and were never recorded. There was therefore little point at the time to do extensive cataloguing and indexing of material that did not exist. As you will see, the number of catalogue entries for a day in the 1990s, far exceeds the entries for a day from the 1950s.

As recording technology developed in both mediums, the requirement to keep material for re-use also grew. If material was going to be re-used, it had to be catalogued and indexed. The original records of radio programmes were handwritten into books; over time, card catalogues were developed, and from the mid-1980s onwards there have been computer based catalogues.

This experimental catalogue database holds over 900,000 entries. It is a sub-set of the data from the internal BBC database created and maintained by the BBC’s Information and Archives department. This public version is updated daily as new records are added and updated in the main catalogue. This figure is so high because, for example, each TV news story now has an individual entry in the catalogue.

Talk about sexy retrospective conversion eh? Hats off to Matt Biddulph and his colleagues. I wish I was going to RailsConf to hear more of the technical details. Actually, if you haven’t already take a look at the RailsConf program–it looks like it’s going to be a great event.

Fedora/SOAP and Ruby

Tuesday, April 25th, 2006

I’ve been playing around getting ruby to talk to the fedora framework for building digital repositories. Fedora makes its api available by different sets of SOAP services, defined in WSDL files. What follows is a brief howto on getting Ruby to talk to the API-A and API-M

To get basic API-A and API-M clients working you’ll need the following:

  • A modern ruby: probably >= 1.8.2
  • The latest soap4r: the one that comes standard in 1.8.4 may work but emits some warnings when processing the fedora wsdl files.
  • The latest http-access2 if you plan on doing API-M with basic authentication.
  • A tarball of ruby classes I generated with wsdl2ruby using the wsdl files in the latest fedora distribution.

So assuming you’ve unpacked the ruby-fedora.tar.gz you should be able to go in there and write the following program which will attempt to connect to a fedora server at localhost:8080 and retrieve the PDF datastream for an object with PID ‘biblio:2′ and write it out to disk. I guess to get it working right you should change the datastream label and PID to something relevant in your repository.

#!/usr/bin/env ruby
 
require 'Fedora-API-A-WSDLDriver'
 
fedora = FedoraAPIA.new
ds = fedora.getDatastreamDissemination('biblio:2', 'PDF', nil)
 
File.open('shannon.pdf', 'w') {|f| f.write ds.stream}

To talk API-M it’s just a little bit more work since you have to tell the SOAP client what the username and password are. In this example we simply ask for the next PID in the ‘biblio’ namespace.

#!/usr/bin/env ruby
 
require 'Fedora-API-M-WSDLDriver'
 
host = 'http://localhost:8080/fedora/services/management'
user = 'fedoraAdmin'
pass = 'fedoraAdmin'
 
fedora = FedoraAPIM.new
fedora.options['protocol.http.basic_auth'] &lt;&lt; [host, user, pass]
 
print fedora.getNextPID(SOAP::SOAPNonNegativeInteger.new(1), 'biblio')

Obviously there’s a lot more depth to go into as far as exploring the fedora api. But these are the basics. Tomorrow I’m going to explore FOXML some more and look at what’s involved in doing injest with ruby.