Archive for June, 2006

nice work atlanta

Tuesday, June 27th, 2006

It’s nice to see that Dr. King’s papers found a home at his alma mater–and won’t be locked away in somebody’s safe.

oai-pmh revisited

Wednesday, June 21st, 2006

So the big news for me at JCDL was Carl Lagoze’s presentation on the state of the NSDL in his paper Metadata Aggregation and “Automated Digital Libraries”: A Retrospective on the NSDL Experience. It ended up winning the Vannevar Bush Best Paper Award so I guess it was of interest to some other folks. I highly recommend printing out the paper for a train ride home if you are at all interested in digital libraries and metadata.

The paper is essentially a review of a three year effort to build a distributed digital library using the OAI-PMH protocol. Lagoze’s pain in relating some of the findings was audible and understandable given his involvement in OAI-PMH…and the folks seated before him.

The findings boil down to a few key points:

The OAI-PMH isn’t low barrier enough

Overall success rate for OAI-PMH harvesting has been 64%. I took the liberty of pulling out this graphic that Carl put up on the big screen at JCDL:

The reason for harvest failure varied but centered around XML encoding problems (character sets, well-formedness, schema validation) as well as improper use of datestamps and resumption tokens. These errors are often unique to a particular harvest request–so that a repository that has passed “validation” can often turn around and randomly emit bad XML. So perhaps more than OAI-PMH not being low-barrier enough we are really talking about XML not being low barrier enough eh?

This issue actually came up in the Trusted Repositories workshop after JCDL (which I’ll write about eventually) when Simeon Warner stated directly (in Mackenzie Smith’s direction) that off the shelf repositories like DSpace should be engineered in such a way that they are unable to emit invalid/ill-formed XML. Mackenzie responded by saying that as far as she knew the issues had been resolved, but that she is unable to force people to upgrade their software. This situation sounds familiar to the bind that M$ finds themselves in–but at least with Windows you are prompted to upgrade your software…I wonder if DSpace has anything like that.

So anyway–bad XML. Perhaps on the flipside brittle XML tools are part of the problem…nd the fact that folks are generating XML by hand instead of using tools such as Builder.

The other rubs were datestamps and resumption tokens which I got to hear about extensively in Simeon’s OAI-PMH tutorial. The point being that there are these little wrinkles to creating a data provider which don’t sound like much; but when multiplied out to many nodes can result in an explosion of administrative emails (170 messages per provider, per year) for the central hub. This amounts to a lot of (wo)man hours lost for both the central hub and the data providers. It makes one wonder at how lucky/brialliant tbl was in creating protocols which scaled massively to the web we know today…with a little help from his friends.

Good metadata is hard

…or at least harder than expected. Good metadata requires domain expertise, metadata expertise and technical expertise–and unfortunately the NSDL data providers typically lacked people or a team with these skills (aka library technologists).

Essentially the NSDL was banking that the successful Union Catalog model (WorldCat) could be re-expressed using OAI-PMH and a minimalist metadata standard (oai_dc)…but WorldCat has professinally trained librarians and NSDL did not. oai_dc was typically too low-resolution to be useful, and only 50% of the collections used the recommended nsdl_dc qualified DC…which made it less than useful at the aggregator level. Furthermore only 10% of data providers even provided another type of richer metadata.

The NSDL team expended quite a bit of effort building software for scrubbing and transforming the oai_dc but:

In the end, all of these transforms don’t enhance the richness of the information in the metadata. Minimally descriptive metadata, like Dublin Core, is still minimally descriptive even after multiple quality repairs. We suggest that the time spent on such format-specific transforms might be better spent on analysis of the resource itself–the source of all manner of rich information.

which brings us to…

Resource-centric (rather than metadata-centric) DL systems are the future

OAI-PMH by definition is essentially a protocol for sharing metadata. The system that the NSDL built is centered around a Metadata Repository which is essentially a RDBMs with a Dublin Core metadata record at it’s core. Various services are built up around the MR including a service for Search and an Archiving service.

However as the NSDL has started to build out other digital library services they’ve discovered problems such as multiple records for the same resource. But more importantly they want to build services that provide context to resources, and the current MR model puts metadata at the center rather than the actual resource.

In the future, we also want to express the relationships between resources and other information, such as annotations and standards alignments…we wish to inter-relate resources themselves, such as their co-existence within a lesson plan or curriculum.

So it seems to me the NSDL folks have decided that the Union Catalog approach just isn’t working out, and that the resource itself needs to be moved into the center of the picture. At the end of the day, the same is true of libraries–where the most important thing is the content that is stored there–not the organization of it. More information about this sea change can be found in An Information Network Overlay Architecture for the NSDL, and in the work on Pathways, which was discussed and I hope to summarize here in the coming days.

So…

I think it’s important to keep these findings in the context of the NSDL enterprise. For example the folks at lanl.gov are using OAI-PMH in their aDORe framework heavily. Control of data providers is implicit in the aDORe framework–so XML and metadata quality problems are mitigated. Furthermore aDORe is resource centric since MPEG21 surrogates for the resources themselves are being sent over OAI-PMH. But it does seem somehow odd that a metadata protocol is being overridden to transfer objects…but that begs a philosophical question about metadata which I’m not even going to entertain right now.

I think it’s useful to compare the success of oai-pmh with the success of the www. Consider this thought experiment…Imagine that early web browsers (lynx/mosaic/netscrape/etc) required valid HTML in order to display a page. If the page was invalid you would get nothing but a blank page. If you really needed to view the page you’d have to email the document author and tell them to fix an unclosed tag, or a particular character encoding. Do you think that the web would’ve still propagated at the speed that it did?

fault_tolerant_xml_tools + using_xml_generator_libraries == happiness

archiving the web

Monday, June 19th, 2006

Folks at Cornell are doing some fun stuff with Internet Archive data. William Arms presented Building a Research Library for the History of the Web at JCDL last week which summarized some of the architectural decisions they had to make in designing a system for mirroring and providing access to 240 terabytes of web content. Their goal is to function as both a full mirror of IA, and to build tools that allow social science and computer science researchers to use this data.

A few interesting tidbits include:

  • Rather than building a distributed system for processing the data (which is what IA and Google have) they went with a symmetric multi-processor. Not just any kind of multi-processor mind you but two dedicated Unisys ES7000/430 each with 16 Itanium2 processors running at 1.5 Gigahertz with 16GB of RAM. The argument was that the very high data rates made this architecture more palatable. The kicker for me was that they are using Microsoft Windows Server 2003 as the operating system. But it gets weirder.
  • The system’s pre-load system extracts useful metadata from ARC files and then stores this in a relational database, while saving off the actual content to a separate Page Store. The Page Store has some intelligence in it which uses an MD5 checksum to figure out if the content has changed; it also provides a layer of abstraction that will allow some content to be stored offline on tapes, etc. Apparently IA stores redundant data quite a bit, and Cornell will be able to save a significant amount of disk space if they de-dupe. Arms detailed the trade offs with using a relational db, namely that they had to get the schema right because if they decided to change it down the road it would require a complete pass over the content again. Ok, so the weirder part is that they are using SQLServer 2000 as the RDBMS.
  • They have created web-service and high-performance clients for extracting data from the archive so that cpu-intensive research operations can be performed locally instead of on the main server. I’d be interested to learn more about the high-performance clients since we’ve been keen to have file-system-like clients in the repository we are building at the LoC. Among the more interesting things the extractors can do is extracting the sub-graph of a particular node on the web.
  • They have a retro-browser which (from the paper) sounds like an interesting http-proxy which turns any old browser into a time-machine. It performs a similar function as the way-back machine, but sounds a lot cooler.
  • Full-text indexing is initially being done using Nutch on an extracted subset of nodes. However Cornell is investigating the use NutchWAX for providing fulltext indexes. NutchWAX was written by Doug Cutting for working directly with IA ARC Files. It also has the ability to distribute indexing–which seems counter to the non-distributed nature of this system at Cornell…but there you go.

I’ve learned from my colleague Andy Boyko that the Library of Congress has been doing similar work with IA…and have been doing other work archiving the world wild web. I imagine my other team members already have been exposed to the work Cornell has been doing in this area, but it was useful for me to learn more. It’s important work–as Arms said:

Everyone with an interest in the history of the Web must be grateful to Brewster Kahle for his foresight in preserving the content of the Web for future generations…

citation graphs

Friday, June 16th, 2006

JCDL2006 was chock-a-block full of good content. A set of papers presented on the first day in the Named Entities track explored a common theme of applying graph theory to citation networks in order to cluster works by the same author. For example an author name may appear as Daniel Chudnov, D Chudnov, Dan Chudnov. There is also a similar problem when two authors with the same name are actually two different people. Being able to group all the works by an author is very important for good search interfaces…and also for calculating citation counts and impact factors.

The most interesting paper in the bunch (for me) was Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation. This paper revolves around the hypothesis that authors tend to cite their own works more frequently than others–so called ’self-citation’. Self-citation isn’t the result of navel gazing or self-promotion so much as it is the result of researchers building on the work that they’ve done previously. In addition to self-citation graphs co-authorship and source URL graphs are also used to build a graph of a particular authors works.

The paper concludes some good precision/recall figures (.997/.818) which points to the value in using self-citation for name clustering. This paper made and some growing interest I have in RDF and Jena have made me realize that I’d like to spend a bit of time over the coming year learning about graph theory.

oai-pmh tut

Tuesday, June 13th, 2006

I couldn’t pass up the opportunity to hear Simeon Warner talk about oai-pmh in the second group of tutorials. I’ve implemented data and service providers before–so I consider myself fairly knowledgeable about the protocol. But Simeon works with this stuff constantly at Cornell since arXiv so I was certain there would be things to learn from him…he did not disappoint.

Using some recent work at NSDL, and his experience with the protocol Simeon provided some really useful advice on sing oai-pmh. Here are the things I picked up on:

  • avoid using sets–especially overloading sets to do searches. There is an interesting edge case in the protocol when a record moves from a particular set A to set B, which makes harvesters who are harvesting set A totally miss the update.
  • pay attention to datestamps. Make sure datestamps are assigned to records when they actually change in the repository or else harvesters can miss updates. The protocol essentially is a way of exposing updates, so getting the datestamps right is crucial.
  • resumption tokens need to be idempotent. This means a harvester should be able to use the resumption token more than once and get the same result (barring updates to the repository). This is essential so that harvesters engaged in a lengthy harvest can recover from network failure and other exceptions.
  • pay attention to character encoding. Use a parser that decodes character entities in XML and store the utf8. This will make your live simpler as you layer new services over harvested data. Make sure that HTML entities aren’t used in oai-pmh responses. utf8conditioner is Simeon’s command line app for debugging utf8 data.
  • be aware of the two great myths of oai-pmh: the myth that oai-php only allows exposure of DC records, and that oai_pmh only allows a single metadata format to be exposed.

There are lots more recommendations at NSDL, but it was useful to have this overview and have the chance to ask Simeon questions. For example even though oai-pmh requires records to have an XML schema, it would be possible to create a wrapping schema for freeform data like RDF.

The main reason I was interested in this tutorial was to hear more about using oai-pmh to distribute not only resource metadata records, but also the resources themselves. There were a couple of initial problems with using the protocol to provide access to the actual resources.

The first is that identifiers such as URLs in metadata records which point to the resource for capture had too much ambiguity. Some of the URLs point at splash screens where someone could download a particular flavor of the resource, others went directly to a PDF, etc. This made machine harvesting data-provider specific. In addition there is a problem with the datestamp semantics when a remote resource changes–when the resource is updated but the metadata stays the same the datestamp is not required to change. This makes it impossible for harvesters to know when it needs to download the resource again.

Fortunately there was a solution that is detailed more fully in a paper written by Simeon and the usual suspects. It boils down to actually distributing the resource as a metadata format. This plays a little bit with what metadata itself is…but it makes the two previously mentioned problems disappear. Simeon gave a brief overview of MPEG21 DIDL but was keen to point out METS and other packaging formats can work the same. Using oai-pmh in this way is *really* interesting to me since it enables respositories to share actual objects with each other–with OAI-PMH working almost like an ingestion protocol.

I asked about mechanisms to autodiscover oai-pmh metadata in HTML, like unAPI. Simeon pointed out that the usual suspects are actually extending/refining this idea a bit in some recent work done w/ the Andrew Mellon Foundation on interoperability. Apparently they’ve experimented with the LiveClipboard idea in support of some of this work. More on this later.

xquery

Monday, June 12th, 2006

So my new employer was kind enough to send me to Joint Conference on Digital Libraries this year. The JCDL program has caught my eye for a few years now, but my previous employer didn’t really see the value in being involved in the digital library community. It’s nice to be back listening to new people with good ideas again. I plan on taking sparse free-form notes here just so I have a record of what I attended and what I learned–rather than waiting till the end to write up a report.

I spent the morning in David Durand’s XQuery tutorial. David has worked on the XML and XLink w3c working groups, teaches at Brown, has over 20 years experience with SGML/XML technologies, and is currently running a startup out of the third floor of his house. He gave a nice hands on demonstration of XQuery using the eXist xml database.

About the first half was spent going over the syntax of XQuery which included a nice mini-tutorial on XPath. I’ve been interested in XQuery since hearing Kevin Clarke talk about it and native xml databases quite a bit on #code4lib, so I really was looking forward to learning more about it from a practical perspective.

I was blown away by how easy it is to actually set up eXist and start adding content and querying it. While David was talking I literally downloaded it, set it up and imported a body of test xml data in 5 minutes. The setup amounts to downloading a jar file and running it. A nice feature is the webdav interface which allows you to mount the eXist database as an editable filesystem, which is very handy. In addition eXist provides REST and XMLRPC interfaces. David used the snazzy XQuery Sandbox web interface for exploring XQuery.

I found the functional aspects of XQuery to be really interesting. David nicely summarized the XQuery type system in and covered enough of the basic flow constructs (let, for, where, return, order by) to start experimenting right away. I must admit that I found the mixture of templating functionality (like that in PHP) with the functional style was a little bit jarring–but that’s normally the case in an environment that supports templating:

<hits>
{
for $speech in //SPEECH[LINE &= 'love']
return <hit>{$speech}</hit>
}
</hits>

which can generate:

<HITS>
<HIT>
<SPEECH>
<SPEAKER>KING CLAUDIUS</SPEAKER>
<LINE>’Tis sweet and commendable in your nature, Hamlet,</LINE>
<LINE>To give these mourning duties to your father:</LINE>
<LINE>But, you must know, your father lost a father;</LINE>
<LINE>That father lost, lost his, and the survivor bound</LINE>
<LINE>In filial obligation for some term</LINE>
<LINE>To do obsequious sorrow: but to persever</LINE>
<LINE>In obstinate condolement is a course</LINE>
<LINE>Of impious stubbornness; ’tis unmanly grief;</LINE>
<LINE>It shows a will most incorrect to heaven,</LINE>
<LINE>A heart unfortified, a mind impatient,</LINE>
<LINE>An understanding simple and unschool’d:</LINE>
<LINE>For what we know must be and is as common</LINE>
<LINE>As any the most vulgar thing to sense,</LINE>
<LINE>Why should we in our peevish opposition</LINE>
<LINE>Take it to heart? Fie! ’tis a fault to heaven,</LINE>
<LINE>A fault against the dead, a fault to nature,</LINE>
<LINE>To reason most absurd: whose common theme</LINE>
<LINE>Is death of fathers, and who still hath cried,</LINE>
<LINE>From the first corse till he that died to-day,</LINE>
<LINE>’This must be so.’ We pray you, throw to earth</LINE>
<LINE>This unprevailing woe, and think of us</LINE>
<LINE>As of a father: for let the world take note,</LINE>
<LINE>You are the most immediate to our throne;</LINE>
<LINE>And with no less nobility of love</LINE>
<LINE>Than that which dearest father bears his son,</LINE>
<LINE>Do I impart toward you. For your intent</LINE>
<LINE>In going back to school in Wittenberg,</LINE>
<LINE>It is most retrograde to our desire:</LINE>
<LINE>And we beseech you, bend you to remain</LINE>
<LINE>Here, in the cheer and comfort of our eye,</LINE>
<LINE>Our chiefest courtier, cousin, and our son.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>For God’s love, let me hear.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>OPHELIA</SPEAKER>
<LINE>My lord, he hath importuned me with love</LINE>
<LINE>In honourable fashion.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>Ghost</SPEAKER>
<LINE>I am thy father’s spirit,</LINE>
<LINE>Doom’d for a certain term to walk the night,</LINE>
<LINE>And for the day confined to fast in fires,</LINE>
<LINE>Till the foul crimes done in my days of nature</LINE>
<LINE>Are burnt and purged away. But that I am forbid</LINE>
<LINE>To tell the secrets of my prison-house,</LINE>
<LINE>I could a tale unfold whose lightest word</LINE>
<LINE>Would harrow up thy soul, freeze thy young blood,</LINE>
<LINE>Make thy two eyes, like stars, start from their spheres,</LINE>
<LINE>Thy knotted and combined locks to part</LINE>
<LINE>And each particular hair to stand on end,</LINE>
<LINE>Like quills upon the fretful porpentine:</LINE>
<LINE>But this eternal blazon must not be</LINE>
<LINE>To ears of flesh and blood. List, list, O, list!</LINE>
<LINE>If thou didst ever thy dear father love–</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>Haste me to know’t, that I, with wings as swift</LINE>
<LINE>As meditation or the thoughts of love,</LINE>
<LINE>May sweep to my revenge.</LINE>
</SPEECH>
</HIT>
</HITS>

Apart from the nitty gritty of XQuery David also provided an interesting look at some tricks that eXist uses to make it possible to join tree based structures. Basically the algorithm creates a tree structure and then indexes the nodes with identifiers making an assumption about the number of children beneath a particular node. Practically this means it’s easy to do math to traverse the tree, and join subtrees–but a side effect is that lots of ‘ghost nodes’ are created.

Ghost nodes are gaps in the identifier space, and if you are working with irregularly structured XML documents you can actually easily exceed the available resources on a 64bit machine. An example of a irregularly structured document could be a dictionary that has hundreds of thousands of entries, which on average have 2-3 definitions, but a handful have like 60 definitions…this causes the identifier space padding to get bloated with tons of ghost nodes.

If you are interested about any of this take a look at eXist: An Open Source XML Database by Wolfgang Meier. David also recommended XQuery - The XML Query Language by Micael Brundage for learning more about XQuery. In the future David said there is work going on at W3C on extensions to search and update: XQuery Search and Update, which will be good to keep an eye on.

All in all I like XQuery and I’m glad that I finally seem to understand it enough to consider it part of my tool set. I’d like to see XQuery used in say a Java program much like SQL is used via JDBC–and be able to get back results say as JDOM or XOM objects. I must admit I’m not so interested in using XQuery as a general programming language though.