archiving the web

Folks at Cornell are doing some fun stuff with Internet Archive data. William Arms presented Building a Research Library for the History of the Web at JCDL last week which summarized some of the architectural decisions they had to make in designing a system for mirroring and providing access to 240 terabytes of web content. Their goal is to function as both a full mirror of IA, and to build tools that allow social science and computer science researchers to use this data.

A few interesting tidbits include:

  • Rather than building a distributed system for processing the data (which is what IA and Google have) they went with a symmetric multi-processor. Not just any kind of multi-processor mind you but two dedicated Unisys ES7000/430 each with 16 Itanium2 processors running at 1.5 Gigahertz with 16GB of RAM. The argument was that the very high data rates made this architecture more palatable. The kicker for me was that they are using Microsoft Windows Server 2003 as the operating system. But it gets weirder.
  • The system’s pre-load system extracts useful metadata from ARC files and then stores this in a relational database, while saving off the actual content to a separate Page Store. The Page Store has some intelligence in it which uses an MD5 checksum to figure out if the content has changed; it also provides a layer of abstraction that will allow some content to be stored offline on tapes, etc. Apparently IA stores redundant data quite a bit, and Cornell will be able to save a significant amount of disk space if they de-dupe. Arms detailed the trade offs with using a relational db, namely that they had to get the schema right because if they decided to change it down the road it would require a complete pass over the content again. Ok, so the weirder part is that they are using SQLServer 2000 as the RDBMS.
  • They have created web-service and high-performance clients for extracting data from the archive so that cpu-intensive research operations can be performed locally instead of on the main server. I’d be interested to learn more about the high-performance clients since we’ve been keen to have file-system-like clients in the repository we are building at the LoC. Among the more interesting things the extractors can do is extracting the sub-graph of a particular node on the web.
  • They have a retro-browser which (from the paper) sounds like an interesting http-proxy which turns any old browser into a time-machine. It performs a similar function as the way-back machine, but sounds a lot cooler.
  • Full-text indexing is initially being done using Nutch on an extracted subset of nodes. However Cornell is investigating the use NutchWAX for providing fulltext indexes. NutchWAX was written by Doug Cutting for working directly with IA ARC Files. It also has the ability to distribute indexing–which seems counter to the non-distributed nature of this system at Cornell…but there you go.

I’ve learned from my colleague Andy Boyko that the Library of Congress has been doing similar work with IA…and have been doing other work archiving the world wild web. I imagine my other team members already have been exposed to the work Cornell has been doing in this area, but it was useful for me to learn more. It’s important work–as Arms said:

Everyone with an interest in the history of the Web must be grateful to Brewster Kahle for his foresight in preserving the content of the Web for future generations…


citation graphs

JCDL2006 was chock-a-block full of good content. A set of papers presented on the first day in the Named Entities track explored a common theme of applying graph theory to citation networks in order to cluster works by the same author. For example an author name may appear as Daniel Chudnov, D Chudnov, Dan Chudnov. There is also a similar problem when two authors with the same name are actually two different people. Being able to group all the works by an author is very important for good search interfaces…and also for calculating citation counts and impact factors.

The most interesting paper in the bunch (for me) was Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation. This paper revolves around the hypothesis that authors tend to cite their own works more frequently than others–so called ‘self-citation’. Self-citation isn’t the result of navel gazing or self-promotion so much as it is the result of researchers building on the work that they’ve done previously. In addition to self-citation graphs co-authorship and source URL graphs are also used to build a graph of a particular authors works.

The paper concludes some good precision/recall figures (.997/.818) which points to the value in using self-citation for name clustering. This paper made and some growing interest I have in RDF and Jena have made me realize that I’d like to spend a bit of time over the coming year learning about graph theory.


oai-pmh tut

I couldn’t pass up the opportunity to hear Simeon Warner talk about oai-pmh in the second group of tutorials. I’ve implemented data and service providers before–so I consider myself fairly knowledgeable about the protocol. But Simeon works with this stuff constantly at Cornell since arXiv so I was certain there would be things to learn from him…he did not disappoint.

Using some recent work at NSDL, and his experience with the protocol Simeon provided some really useful advice on sing oai-pmh. Here are the things I picked up on:

  • avoid using sets–especially overloading sets to do searches. There is an interesting edge case in the protocol when a record moves from a particular set A to set B, which makes harvesters who are harvesting set A totally miss the update.
  • pay attention to datestamps. Make sure datestamps are assigned to records when they actually change in the repository or else harvesters can miss updates. The protocol essentially is a way of exposing updates, so getting the datestamps right is crucial.
  • resumption tokens need to be idempotent. This means a harvester should be able to use the resumption token more than once and get the same result (barring updates to the repository). This is essential so that harvesters engaged in a lengthy harvest can recover from network failure and other exceptions.
  • pay attention to character encoding. Use a parser that decodes character entities in XML and store the utf8. This will make your live simpler as you layer new services over harvested data. Make sure that HTML entities aren’t used in oai-pmh responses. utf8conditioner is Simeon’s command line app for debugging utf8 data.
  • be aware of the two great myths of oai-pmh: the myth that oai-php only allows exposure of DC records, and that oai_pmh only allows a single metadata format to be exposed.

There are lots more recommendations at NSDL, but it was useful to have this overview and have the chance to ask Simeon questions. For example even though oai-pmh requires records to have an XML schema, it would be possible to create a wrapping schema for freeform data like RDF.

The main reason I was interested in this tutorial was to hear more about using oai-pmh to distribute not only resource metadata records, but also the resources themselves. There were a couple of initial problems with using the protocol to provide access to the actual resources.

The first is that identifiers such as URLs in metadata records which point to the resource for capture had too much ambiguity. Some of the URLs point at splash screens where someone could download a particular flavor of the resource, others went directly to a PDF, etc. This made machine harvesting data-provider specific. In addition there is a problem with the datestamp semantics when a remote resource changes–when the resource is updated but the metadata stays the same the datestamp is not required to change. This makes it impossible for harvesters to know when it needs to download the resource again.

Fortunately there was a solution that is detailed more fully in a paper written by Simeon and the usual suspects. It boils down to actually distributing the resource as a metadata format. This plays a little bit with what metadata itself is…but it makes the two previously mentioned problems disappear. Simeon gave a brief overview of MPEG21 DIDL but was keen to point out METS and other packaging formats can work the same. Using oai-pmh in this way is really interesting to me since it enables respositories to share actual objects with each other–with OAI-PMH working almost like an ingestion protocol.

I asked about mechanisms to autodiscover oai-pmh metadata in HTML, like unAPI. Simeon pointed out that the usual suspects are actually extending/refining this idea a bit in some recent work done w/ the Andrew Mellon Foundation on interoperability. Apparently they’ve experimented with the LiveClipboard idea in support of some of this work. More on this later.


xquery

So my new employer was kind enough to send me to Joint Conference on Digital Libraries this year. The JCDL program has caught my eye for a few years now, but my previous employer didn’t really see the value in being involved in the digital library community. It’s nice to be back listening to new people with good ideas again. I plan on taking sparse free-form notes here just so I have a record of what I attended and what I learned–rather than waiting till the end to write up a report.

I spent the morning in David Durand’s XQuery tutorial. David has worked on the XML and XLink w3c working groups, teaches at Brown, has over 20 years experience with SGML/XML technologies, and is currently running a startup out of the third floor of his house. He gave a nice hands on demonstration of XQuery using the eXist xml database.

About the first half was spent going over the syntax of XQuery which included a nice mini-tutorial on XPath. I’ve been interested in XQuery since hearing Kevin Clarke talk about it and native xml databases quite a bit on #code4lib, so I really was looking forward to learning more about it from a practical perspective.

I was blown away by how easy it is to actually set up eXist and start adding content and querying it. While David was talking I literally downloaded it, set it up and imported a body of test xml data in 5 minutes. The setup amounts to downloading a jar file and running it. A nice feature is the webdav interface which allows you to mount the eXist database as an editable filesystem, which is very handy. In addition eXist provides REST and XMLRPC interfaces. David used the snazzy XQuery Sandbox web interface for exploring XQuery.

I found the functional aspects of XQuery to be really interesting. David nicely summarized the XQuery type system in and covered enough of the basic flow constructs (let, for, where, return, order by) to start experimenting right away. I must admit that I found the mixture of templating functionality (like that in PHP) with the functional style was a little bit jarring–but that’s normally the case in an environment that supports templating:

<hits>
{
for $speech in //SPEECH[LINE &= 'love']
return <hit>{$speech}</hit>
}
</hits>

which can generate:

<HITS>
<HIT>
<SPEECH>
<SPEAKER>KING CLAUDIUS</SPEAKER>
<LINE>‘Tis sweet and commendable in your nature, Hamlet,</LINE>
<LINE>To give these mourning duties to your father:</LINE>
<LINE>But, you must know, your father lost a father;</LINE>
<LINE>That father lost, lost his, and the survivor bound</LINE>
<LINE>In filial obligation for some term</LINE>
<LINE>To do obsequious sorrow: but to persever</LINE>
<LINE>In obstinate condolement is a course</LINE>
<LINE>Of impious stubbornness; ’tis unmanly grief;</LINE>
<LINE>It shows a will most incorrect to heaven,</LINE>
<LINE>A heart unfortified, a mind impatient,</LINE>
<LINE>An understanding simple and unschool’d:</LINE>
<LINE>For what we know must be and is as common</LINE>
<LINE>As any the most vulgar thing to sense,</LINE>
<LINE>Why should we in our peevish opposition</LINE>
<LINE>Take it to heart? Fie! ’tis a fault to heaven,</LINE>
<LINE>A fault against the dead, a fault to nature,</LINE>
<LINE>To reason most absurd: whose common theme</LINE>
<LINE>Is death of fathers, and who still hath cried,</LINE>
<LINE>From the first corse till he that died to-day,</LINE>
<LINE>’This must be so.’ We pray you, throw to earth</LINE>
<LINE>This unprevailing woe, and think of us</LINE>
<LINE>As of a father: for let the world take note,</LINE>
<LINE>You are the most immediate to our throne;</LINE>
<LINE>And with no less nobility of love</LINE>
<LINE>Than that which dearest father bears his son,</LINE>
<LINE>Do I impart toward you. For your intent</LINE>
<LINE>In going back to school in Wittenberg,</LINE>
<LINE>It is most retrograde to our desire:</LINE>
<LINE>And we beseech you, bend you to remain</LINE>
<LINE>Here, in the cheer and comfort of our eye,</LINE>
<LINE>Our chiefest courtier, cousin, and our son.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>For God’s love, let me hear.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>OPHELIA</SPEAKER>
<LINE>My lord, he hath importuned me with love</LINE>
<LINE>In honourable fashion.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>Ghost</SPEAKER>
<LINE>I am thy father’s spirit,</LINE>
<LINE>Doom’d for a certain term to walk the night,</LINE>
<LINE>And for the day confined to fast in fires,</LINE>
<LINE>Till the foul crimes done in my days of nature</LINE>
<LINE>Are burnt and purged away. But that I am forbid</LINE>
<LINE>To tell the secrets of my prison-house,</LINE>
<LINE>I could a tale unfold whose lightest word</LINE>
<LINE>Would harrow up thy soul, freeze thy young blood,</LINE>
<LINE>Make thy two eyes, like stars, start from their spheres,</LINE>
<LINE>Thy knotted and combined locks to part</LINE>
<LINE>And each particular hair to stand on end,</LINE>
<LINE>Like quills upon the fretful porpentine:</LINE>
<LINE>But this eternal blazon must not be</LINE>
<LINE>To ears of flesh and blood. List, list, O, list!</LINE>
<LINE>If thou didst ever thy dear father love–</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>Haste me to know’t, that I, with wings as swift</LINE>
<LINE>As meditation or the thoughts of love,</LINE>
<LINE>May sweep to my revenge.</LINE>
</SPEECH>
</HIT>
</HITS>

Apart from the nitty gritty of XQuery David also provided an interesting look at some tricks that eXist uses to make it possible to join tree based structures. Basically the algorithm creates a tree structure and then indexes the nodes with identifiers making an assumption about the number of children beneath a particular node. Practically this means it’s easy to do math to traverse the tree, and join subtrees–but a side effect is that lots of ‘ghost nodes’ are created.

Ghost nodes are gaps in the identifier space, and if you are working with irregularly structured XML documents you can actually easily exceed the available resources on a 64bit machine. An example of a irregularly structured document could be a dictionary that has hundreds of thousands of entries, which on average have 2-3 definitions, but a handful have like 60 definitions…this causes the identifier space padding to get bloated with tons of ghost nodes.

If you are interested about any of this take a look at eXist: An Open Source XML Database by Wolfgang Meier. David also recommended XQuery - The XML Query Language by Micael Brundage for learning more about XQuery. In the future David said there is work going on at W3C on extensions to search and update: XQuery Search and Update, which will be good to keep an eye on.

All in all I like XQuery and I’m glad that I finally seem to understand it enough to consider it part of my tool set. I’d like to see XQuery used in say a Java program much like SQL is used via JDBC–and be able to get back results say as JDOM or XOM objects. I must admit I’m not so interested in using XQuery as a general programming language though.


xml spelunking

As part of my day job I’ve been rifling through large foreign XML files–learning the rhyme and reason of tags used, looking at content, etc… I opened files in Firefox and vim and that was OK–but I like working from the command line. After minimal searching I wasn’t able to find a suitable tool that would simply outline the structure of an xml document in the way I wanted–although artunit pointed out Gadget from MIT which looks like a really wonderful GUI tool to try out. So (predictably) I wrote my own:

biblio:~ ed$ xmltree
Usage: xmltree foo.xml [--depth=n] [--xpath=/foo/bar] [--content]

Specific options:
    -d, --depth n                    max levels
    -x, --xpath /foo/bar             xpath to apply
    -c, --content                    include tag content
    -n, --namespaces                 include namespace information
    -h, --help                       show this message

You can use it to list all the elements in a document like this:

biblio:~ ed$ xmltree pmets.xml

PorticoMETS
 metsHdr
  agent
   name
 structMapContent
  div
   mdGroup
    descMDcurated
     mdWrap
      xmlData
       PorticoArticleMetadata
        article
... many lines of content removed

Maybe it’s a huge file and you only want to see a few levels in:

biblio:~ed$ xmltree --depth=3 pmets.xml 

PorticoMETS
 metsHdr
  agent
   name
 structMapContent
  div
   mdGroup
   div
   div
   div
 structMapMetadata
  div
   mdGroup
   fileGrp

And if you just want to explore a particular node you can use an xpath:

biblio:~ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/descMDextracted/mdWrap/xmlData sample.pmets

xmlData
 PorticoArticleMetadata
  article
   front
    journal-meta
     journal-id
     journal-title
     issn
     issn
    article-meta
     article-id
     article-id
     title-group
      article-title
     contrib-group
      contrib
     pub-date
      day
      month
      year
      string-date
     volume
     issue
     fpage
     page-range
     product
     copyright-year
     copyright-holder
     self-uri

And finally if you want to eyeball the content of the fields you can use the –content option:

biblio:~ ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/
  descMDextracted/mdWrap/xmlData --content sample.pmets

xmlData
 PorticoArticleMetadata
  article
   front
    journal-meta
     journal-id='bull'
     journal-title='Bulletin of the American Mathematical Society'
     issn='0273-0979'
     issn='1088-9485'
    article-meta
     article-id='S0273-0979-00-00866-1'
     article-id='10.1090/S0273-0979-00-00866-1'
     title-group
      article-title='Book Review'
     contrib-group
      contrib='David Marker'
     pub-date
      day='02'
      month='03'
      year='2000'
      string-date='02 March 2000'
     volume='37'
     issue='03'
     fpage='351'
     page-range='351-357'
     product='Tame topology and o-minimal structures, by Lou van den Dries,
       Cambridge Univ. Press, New York (1998), x + 180 pp., $39.95,
       ISBN 0-521-59838-9'
     copyright-year='2000'
     copyright-holder='American Mathematical Society'
     self-uri='http://www.ams.org/jourcgi/jour-getitem?pii=S0273-0979-
       00-00866-1'

Anyhow, if you have a favorite tool for doing this sort of stuff please let me know. If you want to try out xmltree you can grab it out of my subversion repository. You’ll just need a modern Ruby.


professionalism in the age of discontent

After seeing him speak and meeting him a couple times I’m a big fan of Adrian’s work. He was one of the first people to “mash up” google maps at chicagocrime.org; has set the bar for local online media content at lawrence.com; created the Congressional Votes Database at the washingtonpost which allows you to (among other things) get an RSS feed for your representatives votes; and has created probably the most popular web framework for python.

But the thing that really impresses me the most about him is how he mixes the role of technologist and journalist. If you are curious take a look at the commencement speech he just gave at his alma matter, University of Missouri’s School of Journalism. Now if you work in/for a libraries/archives (which is likely given this blogs focus) just substitute ‘journalism’ for ‘libraries’ as you read the piece. You may be surprised to learn that the field of Journalism finds itself in much the same dire straits that Librarianship is in:

Then there’s this whole Internet thing – which is clearly evil. Some guy in San Francisco runs a Web site, Craigslist, that lets anybody post a classified ad for free – completely bypassing the newspaper classifieds and, therefore, chipping away at one of newspapers’ most important sources of revenue. Why would I post a classified ad in a newspaper, which charges me money for a tiny ad in which I’m forced to use funky abbreviations just to fit within the word limit, when I can post a free ad to Craigslist, with no space limitation and the ability to post photos, maps and links? Google lets anybody place an ad on search results. Why would I, the consumer, place an ad on TV, radio or in a newspaper, if I can do the same on Google for less money and arguably more reach?

Ahem, Google Scholar or Amazon anyone?

The foundation that you got here is important because it will guide you for the rest of your journalism career. It’s important because, no matter what you do in this industry, it all comes back to that foundation. No matter how the industry changes, no matter how your jobs may change, it all comes back to the core journalism values you’ve learned here at Missouri.

But, most of all, the foundation is important because you need to understand the rules before you can break them. And now, more than ever, this industry needs to break some rules.

You’re going to be the people breaking the rules. You’re going to be the people inventing new ones. You’ll be the person who says, “Hey, let’s try this new way of getting our journalism out to the public.” You’ll be the PR person who says, “Let’s try this new way of public relations that takes advantage of the Internet.” You’ll be the photographer who says, “Wow, quite a few amateur photographers are posting their photos online. Let’s try to incorporate that into our journalism somehow.”

So think about how exciting that is. Rarely is an entire industry in a position such that it needs to completely reinvent itself.

What are the rules of the library profession that we need to break? In my conversations with fellow library technologists we often talk about how the profession needs to be advanced, like we are uniquely effected by the massive changes in media/information in the last 10 years. I think we should draw some comfort from the fact that we’re not the only ones dealing with this new terrain–as we kick ourselves in the pants. Perhaps some new professions are being born out of this melange.

Adrian is the type of professional I’d like to be, that’s for sure.


info-uris and opening up library data

I had a few moments to read the info-uri spec during a short flight from DC to Chicago this past weekend. info-uri aka RFC 4452 is a spec that allows you to create URIs for identifiers in public namespaces.

So what does this mean in practice and why would you want to use one?

If you have a database of stuff you make available on the web, and you have ids for the stuff (say a primary_key on a Stuff table) you essentially have an identifier in a public namespace. Go register the namespace!

So, the LoC assigns identifiers called Library of Congress Control Numbers (LCCN) to each of its metadata records. Here’s the personal-name authority record (expressed as MADS) that allows works by Tim Berners-Lee to be grouped together:

<?xml version='1.0' encoding='UTF-8'?>
<madsCollection
xmlns:xlink='http://www.w3.org/1999/xlink'
xmlns='http://www.loc.gov/mods/v3'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:schemaLocation='http://www.loc.gov/mads
http://www.loc.gov/standards/mads/mads.xsd'>
<mads version='beta'>
<authority>
<name type='personal' authority='naf'>
<namePart>Berners-Lee, Tim</namePart>
</name>
<titleInfo authority='naf'>
<title/>
</titleInfo>
</authority>
<variant type='other'>
<name type='personal'>
<namePart>Lee, Tim Berners-</namePart>
</name>
<titleInfo>
<title/>
</titleInfo>
</variant>
<variant type='other'>
<name type='personal'>
<namePart>Berners-Lee, Timothy J</namePart>
</name>
<titleInfo>
<title/>
</titleInfo>
</variant>
<note type='source'>The WWW virtual library Web site,
Feb. 15, 1999 about the virtual library (Tim Berners-Lee; creator
of the Web)</note>
<note type='source'>OCLC, Feb. 12, 1999 (hdg.: Berners-Lee,
Tim; usage: Tim Berners-Lee)</note>
<note type='source'>Gaines, A. Tim Berners-Lee and the
development of the World Wide Web, 2001: CIP galley
(Timothy J. Berners-Lee; b. London, England, June 8, 1955)</note>
<recordInfo>
<identifier>no 99010609 </identifier>
<recordContentSource authority='marcorg'>NBL</recordContentSource>
<recordCreationDate encoding='marc'>990216</recordCreationDate>
<recordChangeDate encoding='iso8601'>20010716094452.0</recordChangeDate>
<recordIdentifier>1851704</recordIdentifier>
<languageOfCataloging>
<languageTerm authority='iso639-2b' type='code'>eng</languageTerm>
</languageOfCataloging>
</recordInfo>
</mads>
</madsCollection>

In the record/recordInfo/identifier element you can find the LCCN:

no 99010609

Which can be represented as an info-uri:

info:lccn/no9910609

Now why would you ever want to express a LCCN as an info-uri? The LoC has spent a lot of time and effort establishing these personal name and subject authorities. You might want to use a URI like info:lccn/no9910609 to identify Tim Berners-Lee as an individual in your data so that other people will know who you are talking about and be able to interoperate with you. For example you can now unambiguously say that Tim Berners-Lee created Weaving the Web

<info:lccn/no9910609> <http://purl.org/dc/elements/1.1/creator>
<info:lccn/99027665>

That was for you ksclarke :-) Pretty nifty eh? Now what’s really cool is that while info-uris aren’t necessarily resolvable (by design) OCLC does have the Linked Authority File, which allows you to look up these records. So tbl’s record can be found here:

http://errol.oclc.org/laf/no99-10609.html

I imagine that this is part of the joint OCLC/LoC/Die Deutsche Bibliothek project to build a Virtual International Authority File…but I’m not totally sure. At any rate there’s currently no way to drop a lccn info-uri in there and have it resolve to the XML–but that looks like an easy thing to add.

It feels like there is a real opportunity for libraries and archives to offer up their data to the larger web community. How can we make it easy for non-library folks to find and repurpose this data we’ve so assiduously collected over the years?

tbl is encouraging people to give themselves a URI…I wonder if he knew that he (and millions of others) already have one!

Addendum:

If you are interested section 6 of the RFC details the subtle rationale behind why the authors chose to create a new URI scheme rather than:

  1. using an existing URI scheme
  2. creating a new URN namespace

In essence they didn’t want to use an existing URI scheme because they all assume that you should be able to dereference the URI. An example of dereferencing in action can be found when clicking on a link like http://www.yahoo.com where the magic of DNS allows you to find yahoo’s web server and talk to it on port 80 in a predictable way. info-uris are designed to be agnostic as to whether or not the identifier can be dereferenced through a resolver of some kind.

Using URNs was thrown out since URNs are intended to persistently identify information resources and info-uris are designed to identify persistent namespaces not the resources themselves. Also the process of establishing a URN namespace isn’t for the faint of heart, which is evidenced by the short list of them. info-uris by contrast have a registrar who will expedite the process of registering a namespace, and have set up a framework for publishing validation/normalization rules. The current registrar is run by OCLCRLGBORG^w OCLC on behalf of NISO. So basically you don’t have to write an RFC to register your namespace.


building and ingesting

I prefer using an XML generating mini-language (elementtree, XML::Writer, REXML, Stan, etc) to actually writing raw XML. It’s just too easy for me to forget or misstype an end tag, or forget to encode strings properly–and I find all those inline strings or even here-docs make a mess of an otherwise pretty program.

Recently I wanted some code to write FOXML for ingesting digital objects into my Fedora test instance. I’m working in Ruby so REXML seemed like the best place to start…but after I finished I ran across Builder. The Builder code turned out to be somewhat shorter, much more expressive and consequently a bit easier to read (for my eyes). Here’s a quick example of how Builder’s API improves on REXML when writing this little chunk of XML:

<dc xmlns='http://purl.org/dc/elements/1.1/'>
  <title>Communication in the Presence of Noise</title>
</dc>

So here’s the REXML code:

dc = REXML::Element.new 'dc'
dc.add_attributes 'xmlns' => 'http://purl.org/dc/elements/1.1/'
title = REXML::Element.new 'title', dc
title.text 'Communication in the Presence of Noise'

and the Builder code:

x = Builder::XmlMarkup.new 
x.dc 'xmlns' => 'http://purl.org/dc/elements/1.1' do
  x.title 'Communication in the Presence of Noise'
end

So both are four lines, but look at how the Builder::XmlMarkup object infers the name of the element based on the message that is passed to it? Element attributes and content can be set when the element is created–something I wasn’t able to do w/ REXML. My favorite though is Builder’s use of blocks so that the hierarchical structure of the code directly mirrors that of the XML content!

So anyway, if you read this far you might actually like to see how a FOXML document can be built and ingested into Fedora–so hear goes building the document:

x = Builder::XmlMarkup.new :indent => 2

x.digitalObject ‘xmlns’ => ‘info:fedora/fedora-system:def/foxml#’ do

x.objectProperties do x.property ‘NAME’ => ‘http://www.w3.org/1999/02/22-rdf-syntax-ns#type’, ‘VALUE’ => ‘FedoraObject’ x.property ‘NAME’ => ‘info:fedora/fedora-system:def/model#state’, ‘VALUE’ => ‘A’ end


Cataloging at the BBC with RubyOnRails

It’s nice to see that BBC Programme Catalogue (built with RubyOnRails and MySQL) has gone live. Here is some historical background from the about page:

The BBC has been cataloguing and indexing its programmes since the 1920s. The development of the programme catalogue has reflected the changes in the BBC and in broadcasting over the last seventy five years. For example, in the early days of broadcasting, for both Radio and TV, the majority of programmes were broadcast live and were never recorded. There was therefore little point at the time to do extensive cataloguing and indexing of material that did not exist. As you will see, the number of catalogue entries for a day in the 1990s, far exceeds the entries for a day from the 1950s.

As recording technology developed in both mediums, the requirement to keep material for re-use also grew. If material was going to be re-used, it had to be catalogued and indexed. The original records of radio programmes were handwritten into books; over time, card catalogues were developed, and from the mid-1980s onwards there have been computer based catalogues.

This experimental catalogue database holds over 900,000 entries. It is a sub-set of the data from the internal BBC database created and maintained by the BBC’s Information and Archives department. This public version is updated daily as new records are added and updated in the main catalogue. This figure is so high because, for example, each TV news story now has an individual entry in the catalogue.

Talk about sexy retrospective conversion eh? Hats off to Matt Biddulph and his colleagues. I wish I was going to RailsConf to hear more of the technical details. Actually, if you haven’t already take a look at the RailsConf program–it looks like it’s going to be a great event.


Fedora/SOAP and Ruby

I’ve been playing around getting ruby to talk to the fedora framework for building digital repositories. Fedora makes its api available by different sets of SOAP services, defined in WSDL files. What follows is a brief howto on getting Ruby to talk to the API-A and API-M

To get basic API-A and API-M clients working you’ll need the following:

  • A modern ruby: probably >= 1.8.2
  • The latest soap4r: the one that comes standard in 1.8.4 may work but emits some warnings when processing the fedora wsdl files.
  • The latest http-access2 if you plan on doing API-M with basic authentication.
  • A tarball of ruby classes I generated with wsdl2ruby using the wsdl files in the latest fedora distribution.

So assuming you’ve unpacked the ruby-fedora.tar.gz you should be able to go in there and write the following program which will attempt to connect to a fedora server at localhost:8080 and retrieve the PDF datastream for an object with PID ‘biblio:2’ and write it out to disk. I guess to get it working right you should change the datastream label and PID to something relevant in your repository.

#!/usr/bin/env ruby