At work I’ve been doing some experiments with the Fedora repository software. One of the strengths of Fedora is that it is fundamentally designed as a set of extensible web services. At first I set about becoming familiar with the set of web services and decided that Ruby would be a useful and lightweight language to do this from. Sure enough, Ruby was plenty capable of putting stuff into Fedora and getting stuff back out again.
As time went on it became clear that what was really needed was a layer of abstraction around this Fedora web services API that would allow it (or another repository framework) to be used in a programatic way without having to make SOAP calls and building FOXML all over the place. Typically in software pattern lingo this is referred to as a facade.
So I worked on creating a facade, and ended up with something I half-jokingly called ‘bitbucket’ which looks something like this:
Here’s to making sure that code4libcon 2007 is a watershed moment for women library technologists.
code4libcon 2006 in Corvallis wasn’t all male, but it was largely…and I can only remember two women speaking to the audience. To a large extent code4libcon was modeled after technology conferences like yapc, pycon, oscon, barcamp, etc–which have much the same sort of ratio. But libraries are different because the majority of people who work in libraries are women. So it was a bit surprising that more women didn’t end up at code4libcon 2006.
2006 did get organized practically overnight with a very small (male) clique in an irc room (that’s not always well behaved, but mean well–hey it’s IRC). When people actually started signing up and sending in papers to the more formal discussion list I think we were all kind of surprised. I seriously thought we were just going to be hanging out in some random space with free wifi, and it turned into this really successful event.
Some folks like Dan Chudnov, Art Rhyno, Jeremy Frumkin and Roy Tennant started thinking and talking early about making the conference appeal to women library technologists. But it seems that either the voting (open to all, but all men for some reason) somehow subconsciously counteracted this.
AFAIK the keynote voting is still going on, and I imagine you can still suggest speakers. There will only be more voting to do as we get into selecting presenters. If you’d like to participate just email Brad LaJeunesse and he’ll hook you up with a backpack login. Also, sign up for the code4lib and code4libcon discussion lists. Luckily Dorothea Salo is involved and vocal and I’m hoping that other women technologists will get involved too. This is a grassroots thing after all, not some sort of LITA top-tech trends panel. It’ll become whatever we want it to be.
It’s nice to see that Dr. King’s papers found a home at his alma mater–and won’t be locked away in somebody’s safe.
So the big news for me at JCDL was Carl Lagoze’s presentation on the state of the NSDL in his paper Metadata Aggregation and “Automated Digital Libraries”: A Retrospective on the NSDL Experience. It ended up winning the Vannevar Bush Best Paper Award so I guess it was of interest to some other folks. I highly recommend printing out the paper for a train ride home if you are at all interested in digital libraries and metadata.
The paper is essentially a review of a three year effort to build a distributed digital library using the OAI-PMH protocol. Lagoze’s pain in relating some of the findings was audible and understandable given his involvement in OAI-PMH…and the folks seated before him.
The findings boil down to a few key points:
The OAI-PMH isn’t low barrier enough
Overall success rate for OAI-PMH harvesting has been 64%. I took the liberty of pulling out this graphic that Carl put up on the big screen at JCDL:
The reason for harvest failure varied but centered around XML encoding problems (character sets, well-formedness, schema validation) as well as improper use of datestamps and resumption tokens. These errors are often unique to a particular harvest request–so that a repository that has passed “validation” can often turn around and randomly emit bad XML. So perhaps more than OAI-PMH not being low-barrier enough we are really talking about XML not being low barrier enough eh?
This issue actually came up in the Trusted Repositories workshop after JCDL (which I’ll write about eventually) when Simeon Warner stated directly (in Mackenzie Smith’s direction) that off the shelf repositories like DSpace should be engineered in such a way that they are unable to emit invalid/ill-formed XML. Mackenzie responded by saying that as far as she knew the issues had been resolved, but that she is unable to force people to upgrade their software. This situation sounds familiar to the bind that M$ finds themselves in–but at least with Windows you are prompted to upgrade your software…I wonder if DSpace has anything like that.
So anyway–bad XML. Perhaps on the flipside brittle XML tools are part of the problem…nd the fact that folks are generating XML by hand instead of using tools such as Builder.
The other rubs were datestamps and resumption tokens which I got to hear about extensively in Simeon’s OAI-PMH tutorial. The point being that there are these little wrinkles to creating a data provider which don’t sound like much; but when multiplied out to many nodes can result in an explosion of administrative emails (170 messages per provider, per year) for the central hub. This amounts to a lot of (wo)man hours lost for both the central hub and the data providers. It makes one wonder at how lucky/brialliant tbl was in creating protocols which scaled massively to the web we know today…with a little help from his friends.
Good metadata is hard
…or at least harder than expected. Good metadata requires domain expertise, metadata expertise and technical expertise–and unfortunately the NSDL data providers typically lacked people or a team with these skills (aka library technologists).
Essentially the NSDL was banking that the successful Union Catalog model (WorldCat) could be re-expressed using OAI-PMH and a minimalist metadata standard (oai_dc)…but WorldCat has professinally trained librarians and NSDL did not. oai_dc was typically too low-resolution to be useful, and only 50% of the collections used the recommended nsdl_dc qualified DC…which made it less than useful at the aggregator level. Furthermore only 10% of data providers even provided another type of richer metadata.
The NSDL team expended quite a bit of effort building software for scrubbing and transforming the oai_dc but:
In the end, all of these transforms don’t enhance the richness of the information in the metadata. Minimally descriptive metadata, like Dublin Core, is still minimally descriptive even after multiple quality repairs. We suggest that the time spent on such format-specific transforms might be better spent on analysis of the resource itself–the source of all manner of rich information.
which brings us to…
Resource-centric (rather than metadata-centric) DL systems are the future
OAI-PMH by definition is essentially a protocol for sharing metadata. The system that the NSDL built is centered around a Metadata Repository which is essentially a RDBMs with a Dublin Core metadata record at it’s core. Various services are built up around the MR including a service for Search and an Archiving service.
However as the NSDL has started to build out other digital library services they’ve discovered problems such as multiple records for the same resource. But more importantly they want to build services that provide context to resources, and the current MR model puts metadata at the center rather than the actual resource.
In the future, we also want to express the relationships between resources and other information, such as annotations and standards alignments…we wish to inter-relate resources themselves, such as their co-existence within a lesson plan or curriculum.
So it seems to me the NSDL folks have decided that the Union Catalog approach just isn’t working out, and that the resource itself needs to be moved into the center of the picture. At the end of the day, the same is true of libraries–where the most important thing is the content that is stored there–not the organization of it. More information about this sea change can be found in An Information Network Overlay Architecture for the NSDL, and in the work on Pathways, which was discussed and I hope to summarize here in the coming days.
I think it’s important to keep these findings in the context of the NSDL enterprise. For example the folks at lanl.gov are using OAI-PMH in their aDORe framework heavily. Control of data providers is implicit in the aDORe framework–so XML and metadata quality problems are mitigated. Furthermore aDORe is resource centric since MPEG21 surrogates for the resources themselves are being sent over OAI-PMH. But it does seem somehow odd that a metadata protocol is being overridden to transfer objects…but that begs a philosophical question about metadata which I’m not even going to entertain right now.
I think it’s useful to compare the success of oai-pmh with the success of the www. Consider this thought experiment…Imagine that early web browsers (lynx/mosaic/netscrape/etc) required valid HTML in order to display a page. If the page was invalid you would get nothing but a blank page. If you really needed to view the page you’d have to email the document author and tell them to fix an unclosed tag, or a particular character encoding. Do you think that the web would’ve still propagated at the speed that it did?
fault_tolerant_xml_tools + using_xml_generator_libraries == happiness
Folks at Cornell are doing some fun stuff with Internet Archive data. William Arms presented Building a Research Library for the History of the Web at JCDL last week which summarized some of the architectural decisions they had to make in designing a system for mirroring and providing access to 240 terabytes of web content. Their goal is to function as both a full mirror of IA, and to build tools that allow social science and computer science researchers to use this data.
A few interesting tidbits include:
Rather than building a distributed system for processing the data (which is what IA and Google have) they went with a symmetric multi-processor. Not just any kind of multi-processor mind you but two dedicated Unisys ES7000/430 each with 16 Itanium2 processors running at 1.5 Gigahertz with 16GB of RAM. The argument was that the very high data rates made this architecture more palatable. The kicker for me was that they are using Microsoft Windows Server 2003 as the operating system. But it gets weirder.
The system’s pre-load system extracts useful metadata from ARC files and then stores this in a relational database, while saving off the actual content to a separate Page Store. The Page Store has some intelligence in it which uses an MD5 checksum to figure out if the content has changed; it also provides a layer of abstraction that will allow some content to be stored offline on tapes, etc. Apparently IA stores redundant data quite a bit, and Cornell will be able to save a significant amount of disk space if they de-dupe. Arms detailed the trade offs with using a relational db, namely that they had to get the schema right because if they decided to change it down the road it would require a complete pass over the content again. Ok, so the weirder part is that they are using SQLServer 2000 as the RDBMS.
They have created web-service and high-performance clients for extracting data from the archive so that cpu-intensive research operations can be performed locally instead of on the main server. I’d be interested to learn more about the high-performance clients since we’ve been keen to have file-system-like clients in the repository we are building at the LoC. Among the more interesting things the extractors can do is extracting the sub-graph of a particular node on the web.
They have a retro-browser which (from the paper) sounds like an interesting http-proxy which turns any old browser into a time-machine. It performs a similar function as the way-back machine, but sounds a lot cooler.
Full-text indexing is initially being done using Nutch on an extracted subset of nodes. However Cornell is investigating the use NutchWAX for providing fulltext indexes. NutchWAX was written by Doug Cutting for working directly with IA ARC Files. It also has the ability to distribute indexing–which seems counter to the non-distributed nature of this system at Cornell…but there you go.
I’ve learned from my colleague Andy Boyko that the Library of Congress has been doing similar work with IA…and have been doing other work archiving the world wild web. I imagine my other team members already have been exposed to the work Cornell has been doing in this area, but it was useful for me to learn more. It’s important work–as Arms said:
Everyone with an interest in the history of the Web must be grateful to Brewster Kahle for his foresight in preserving the content of the Web for future generations…
JCDL2006 was chock-a-block full of good content. A set of papers presented on the first day in the Named Entities track explored a common theme of applying graph theory to citation networks in order to cluster works by the same author. For example an author name may appear as Daniel Chudnov, D Chudnov, Dan Chudnov. There is also a similar problem when two authors with the same name are actually two different people. Being able to group all the works by an author is very important for good search interfaces…and also for calculating citation counts and impact factors.
The most interesting paper in the bunch (for me) was Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation. This paper revolves around the hypothesis that authors tend to cite their own works more frequently than others–so called ‘self-citation’. Self-citation isn’t the result of navel gazing or self-promotion so much as it is the result of researchers building on the work that they’ve done previously. In addition to self-citation graphs co-authorship and source URL graphs are also used to build a graph of a particular authors works.
The paper concludes some good precision/recall figures (.997/.818) which points to the value in using self-citation for name clustering. This paper made and some growing interest I have in RDF and Jena have made me realize that I’d like to spend a bit of time over the coming year learning about graph theory.
I couldn’t pass up the opportunity to hear Simeon Warner talk about oai-pmh in the second group of tutorials. I’ve implemented data and service providers before–so I consider myself fairly knowledgeable about the protocol. But Simeon works with this stuff constantly at Cornell since arXiv so I was certain there would be things to learn from him…he did not disappoint.
Using some recent work at NSDL, and his experience with the protocol Simeon provided some really useful advice on sing oai-pmh. Here are the things I picked up on:
avoid using sets–especially overloading sets to do searches. There is an interesting edge case in the protocol when a record moves from a particular set A to set B, which makes harvesters who are harvesting set A totally miss the update.
pay attention to datestamps. Make sure datestamps are assigned to records when they actually change in the repository or else harvesters can miss updates. The protocol essentially is a way of exposing updates, so getting the datestamps right is crucial.
resumption tokens need to be idempotent. This means a harvester should be able to use the resumption token more than once and get the same result (barring updates to the repository). This is essential so that harvesters engaged in a lengthy harvest can recover from network failure and other exceptions.
pay attention to character encoding. Use a parser that decodes character entities in XML and store the utf8. This will make your live simpler as you layer new services over harvested data. Make sure that HTML entities aren’t used in oai-pmh responses. utf8conditioner is Simeon’s command line app for debugging utf8 data.
be aware of the two great myths of oai-pmh: the myth that oai-php only allows exposure of DC records, and that oai_pmh only allows a single metadata format to be exposed.
There are lots more recommendations at NSDL, but it was useful to have this overview and have the chance to ask Simeon questions. For example even though oai-pmh requires records to have an XML schema, it would be possible to create a wrapping schema for freeform data like RDF.
The main reason I was interested in this tutorial was to hear more about using oai-pmh to distribute not only resource metadata records, but also the resources themselves. There were a couple of initial problems with using the protocol to provide access to the actual resources.
The first is that identifiers such as URLs in metadata records which point to the resource for capture had too much ambiguity. Some of the URLs point at splash screens where someone could download a particular flavor of the resource, others went directly to a PDF, etc. This made machine harvesting data-provider specific. In addition there is a problem with the datestamp semantics when a remote resource changes–when the resource is updated but the metadata stays the same the datestamp is not required to change. This makes it impossible for harvesters to know when it needs to download the resource again.
Fortunately there was a solution that is detailed more fully in a paper written by Simeon and the usual suspects. It boils down to actually distributing the resource as a metadata format. This plays a little bit with what metadata itself is…but it makes the two previously mentioned problems disappear. Simeon gave a brief overview of MPEG21 DIDL but was keen to point out METS and other packaging formats can work the same. Using oai-pmh in this way is really interesting to me since it enables respositories to share actual objects with each other–with OAI-PMH working almost like an ingestion protocol.
I asked about mechanisms to autodiscover oai-pmh metadata in HTML, like unAPI. Simeon pointed out that the usual suspects are actually extending/refining this idea a bit in some recent work done w/ the Andrew Mellon Foundation on interoperability. Apparently they’ve experimented with the LiveClipboard idea in support of some of this work. More on this later.
So my new employer was kind enough to send me to Joint Conference on Digital Libraries this year. The JCDL program has caught my eye for a few years now, but my previous employer didn’t really see the value in being involved in the digital library community. It’s nice to be back listening to new people with good ideas again. I plan on taking sparse free-form notes here just so I have a record of what I attended and what I learned–rather than waiting till the end to write up a report.
I spent the morning in David Durand’s XQuery tutorial. David has worked on the XML and XLink w3c working groups, teaches at Brown, has over 20 years experience with SGML/XML technologies, and is currently running a startup out of the third floor of his house. He gave a nice hands on demonstration of XQuery using the eXist xml database.
About the first half was spent going over the syntax of XQuery which included a nice mini-tutorial on XPath. I’ve been interested in XQuery since hearing Kevin Clarke talk about it and native xml databases quite a bit on #code4lib, so I really was looking forward to learning more about it from a practical perspective.
I was blown away by how easy it is to actually set up eXist and start adding content and querying it. While David was talking I literally downloaded it, set it up and imported a body of test xml data in 5 minutes. The setup amounts to downloading a jar file and running it. A nice feature is the webdav interface which allows you to mount the eXist database as an editable filesystem, which is very handy. In addition eXist provides REST and XMLRPC interfaces. David used the snazzy XQuery Sandbox web interface for exploring XQuery.
I found the functional aspects of XQuery to be really interesting. David nicely summarized the XQuery type system in and covered enough of the basic flow constructs (let, for, where, return, order by) to start experimenting right away. I must admit that I found the mixture of templating functionality (like that in PHP) with the functional style was a little bit jarring–but that’s normally the case in an environment that supports templating:
for $speech in //SPEECH[LINE &= 'love']
which can generate:
<LINE>‘Tis sweet and commendable in your nature, Hamlet,</LINE>
<LINE>To give these mourning duties to your father:</LINE>
<LINE>But, you must know, your father lost a father;</LINE>
<LINE>That father lost, lost his, and the survivor bound</LINE>
<LINE>In filial obligation for some term</LINE>
<LINE>To do obsequious sorrow: but to persever</LINE>
<LINE>In obstinate condolement is a course</LINE>
<LINE>Of impious stubbornness; ’tis unmanly grief;</LINE>
<LINE>It shows a will most incorrect to heaven,</LINE>
<LINE>A heart unfortified, a mind impatient,</LINE>
<LINE>An understanding simple and unschool’d:</LINE>
<LINE>For what we know must be and is as common</LINE>
<LINE>As any the most vulgar thing to sense,</LINE>
<LINE>Why should we in our peevish opposition</LINE>
<LINE>Take it to heart? Fie! ’tis a fault to heaven,</LINE>
<LINE>A fault against the dead, a fault to nature,</LINE>
<LINE>To reason most absurd: whose common theme</LINE>
<LINE>Is death of fathers, and who still hath cried,</LINE>
<LINE>From the first corse till he that died to-day,</LINE>
<LINE>’This must be so.’ We pray you, throw to earth</LINE>
<LINE>This unprevailing woe, and think of us</LINE>
<LINE>As of a father: for let the world take note,</LINE>
<LINE>You are the most immediate to our throne;</LINE>
<LINE>And with no less nobility of love</LINE>
<LINE>Than that which dearest father bears his son,</LINE>
<LINE>Do I impart toward you. For your intent</LINE>
<LINE>In going back to school in Wittenberg,</LINE>
<LINE>It is most retrograde to our desire:</LINE>
<LINE>And we beseech you, bend you to remain</LINE>
<LINE>Here, in the cheer and comfort of our eye,</LINE>
<LINE>Our chiefest courtier, cousin, and our son.</LINE>
<LINE>For God’s love, let me hear.</LINE>
<LINE>My lord, he hath importuned me with love</LINE>
<LINE>In honourable fashion.</LINE>
<LINE>I am thy father’s spirit,</LINE>
<LINE>Doom’d for a certain term to walk the night,</LINE>
<LINE>And for the day confined to fast in fires,</LINE>
<LINE>Till the foul crimes done in my days of nature</LINE>
<LINE>Are burnt and purged away. But that I am forbid</LINE>
<LINE>To tell the secrets of my prison-house,</LINE>
<LINE>I could a tale unfold whose lightest word</LINE>
<LINE>Would harrow up thy soul, freeze thy young blood,</LINE>
<LINE>Make thy two eyes, like stars, start from their spheres,</LINE>
<LINE>Thy knotted and combined locks to part</LINE>
<LINE>And each particular hair to stand on end,</LINE>
<LINE>Like quills upon the fretful porpentine:</LINE>
<LINE>But this eternal blazon must not be</LINE>
<LINE>To ears of flesh and blood. List, list, O, list!</LINE>
<LINE>If thou didst ever thy dear father love–</LINE>
<LINE>Haste me to know’t, that I, with wings as swift</LINE>
<LINE>As meditation or the thoughts of love,</LINE>
<LINE>May sweep to my revenge.</LINE>
Apart from the nitty gritty of XQuery David also provided an interesting look at some tricks that eXist uses to make it possible to join tree based structures. Basically the algorithm creates a tree structure and then indexes the nodes with identifiers making an assumption about the number of children beneath a particular node. Practically this means it’s easy to do math to traverse the tree, and join subtrees–but a side effect is that lots of ‘ghost nodes’ are created.
Ghost nodes are gaps in the identifier space, and if you are working with irregularly structured XML documents you can actually easily exceed the available resources on a 64bit machine. An example of a irregularly structured document could be a dictionary that has hundreds of thousands of entries, which on average have 2-3 definitions, but a handful have like 60 definitions…this causes the identifier space padding to get bloated with tons of ghost nodes.
If you are interested about any of this take a look at eXist: An Open Source XML Database by Wolfgang Meier. David also recommended XQuery - The XML Query Language by Micael Brundage for learning more about XQuery. In the future David said there is work going on at W3C on extensions to search and update: XQuery Search and Update, which will be good to keep an eye on.
All in all I like XQuery and I’m glad that I finally seem to understand it enough to consider it part of my tool set. I’d like to see XQuery used in say a Java program much like SQL is used via JDBC–and be able to get back results say as JDOM or XOM objects. I must admit I’m not so interested in using XQuery as a general programming language though.
As part of my day job I’ve been rifling through large foreign XML files–learning the rhyme and reason of tags used, looking at content, etc… I opened files in Firefox and vim and that was OK–but I like working from the command line. After minimal searching I wasn’t able to find a suitable tool that would simply outline the structure of an xml document in the way I wanted–although artunit pointed out Gadget from MIT which looks like a really wonderful GUI tool to try out. So (predictably) I wrote my own:
biblio:~ ed$ xmltree
Usage: xmltree foo.xml [--depth=n] [--xpath=/foo/bar] [--content]
-d, --depth n max levels
-x, --xpath /foo/bar xpath to apply
-c, --content include tag content
-n, --namespaces include namespace information
-h, --help show this message
You can use it to list all the elements in a document like this:
biblio:~ ed$ xmltree pmets.xml
... many lines of content removed
Maybe it’s a huge file and you only want to see a few levels in:
biblio:~ed$ xmltree --depth=3 pmets.xml
And if you just want to explore a particular node you can use an xpath:
biblio:~ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/descMDextracted/mdWrap/xmlData sample.pmets
And finally if you want to eyeball the content of the fields you can use the –content option:
biblio:~ ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/
descMDextracted/mdWrap/xmlData --content sample.pmets
journal-title='Bulletin of the American Mathematical Society'
string-date='02 March 2000'
product='Tame topology and o-minimal structures, by Lou van den Dries,
Cambridge Univ. Press, New York (1998), x + 180 pp., $39.95,
copyright-holder='American Mathematical Society'
Anyhow, if you have a favorite tool for doing this sort of stuff please let me know. If you want to try out xmltree you can grab it out of my subversion repository. You’ll just need a modern Ruby.
After seeing him speak and meeting him a couple times I’m a big fan of Adrian’s work. He was one of the first people to “mash up” google maps at chicagocrime.org; has set the bar for local online media content at lawrence.com; created the Congressional Votes Database at the washingtonpost which allows you to (among other things) get an RSS feed for your representatives votes; and has created probably the most popular web framework for python.
But the thing that really impresses me the most about him is how he mixes the role of technologist and journalist. If you are curious take a look at the commencement speech he just gave at his alma matter, University of Missouri’s School of Journalism. Now if you work in/for a libraries/archives (which is likely given this blogs focus) just substitute ‘journalism’ for ‘libraries’ as you read the piece. You may be surprised to learn that the field of Journalism finds itself in much the same dire straits that Librarianship is in:
Then there’s this whole Internet thing – which is clearly evil. Some guy in San Francisco runs a Web site, Craigslist, that lets anybody post a classified ad for free – completely bypassing the newspaper classifieds and, therefore, chipping away at one of newspapers’ most important sources of revenue. Why would I post a classified ad in a newspaper, which charges me money for a tiny ad in which I’m forced to use funky abbreviations just to fit within the word limit, when I can post a free ad to Craigslist, with no space limitation and the ability to post photos, maps and links? Google lets anybody place an ad on search results. Why would I, the consumer, place an ad on TV, radio or in a newspaper, if I can do the same on Google for less money and arguably more reach?
Ahem, Google Scholar or Amazon anyone?
The foundation that you got here is important because it will guide you for the rest of your journalism career. It’s important because, no matter what you do in this industry, it all comes back to that foundation. No matter how the industry changes, no matter how your jobs may change, it all comes back to the core journalism values you’ve learned here at Missouri.
But, most of all, the foundation is important because you need to understand the rules before you can break them. And now, more than ever, this industry needs to break some rules.
You’re going to be the people breaking the rules. You’re going to be the people inventing new ones. You’ll be the person who says, “Hey, let’s try this new way of getting our journalism out to the public.” You’ll be the PR person who says, “Let’s try this new way of public relations that takes advantage of the Internet.” You’ll be the photographer who says, “Wow, quite a few amateur photographers are posting their photos online. Let’s try to incorporate that into our journalism somehow.”
So think about how exciting that is. Rarely is an entire industry in a position such that it needs to completely reinvent itself.
What are the rules of the library profession that we need to break? In my conversations with fellow library technologists we often talk about how the profession needs to be advanced, like we are uniquely effected by the massive changes in media/information in the last 10 years. I think we should draw some comfort from the fact that we’re not the only ones dealing with this new terrain–as we kick ourselves in the pants. Perhaps some new professions are being born out of this melange.
Adrian is the type of professional I’d like to be, that’s for sure.