xquery

So my new employer was kind enough to send me to Joint Conference on Digital Libraries this year. The JCDL program has caught my eye for a few years now, but my previous employer didn’t really see the value in being involved in the digital library community. It’s nice to be back listening to new people with good ideas again. I plan on taking sparse free-form notes here just so I have a record of what I attended and what I learned–rather than waiting till the end to write up a report.

I spent the morning in David Durand’s XQuery tutorial. David has worked on the XML and XLink w3c working groups, teaches at Brown, has over 20 years experience with SGML/XML technologies, and is currently running a startup out of the third floor of his house. He gave a nice hands on demonstration of XQuery using the eXist xml database.

About the first half was spent going over the syntax of XQuery which included a nice mini-tutorial on XPath. I’ve been interested in XQuery since hearing Kevin Clarke talk about it and native xml databases quite a bit on #code4lib, so I really was looking forward to learning more about it from a practical perspective.

I was blown away by how easy it is to actually set up eXist and start adding content and querying it. While David was talking I literally downloaded it, set it up and imported a body of test xml data in 5 minutes. The setup amounts to downloading a jar file and running it. A nice feature is the webdav interface which allows you to mount the eXist database as an editable filesystem, which is very handy. In addition eXist provides REST and XMLRPC interfaces. David used the snazzy XQuery Sandbox web interface for exploring XQuery.

I found the functional aspects of XQuery to be really interesting. David nicely summarized the XQuery type system in and covered enough of the basic flow constructs (let, for, where, return, order by) to start experimenting right away. I must admit that I found the mixture of templating functionality (like that in PHP) with the functional style was a little bit jarring–but that’s normally the case in an environment that supports templating:

<hits>
{
for $speech in //SPEECH[LINE &= 'love']
return <hit>{$speech}</hit>
}
</hits>

which can generate:

<HITS>
<HIT>
<SPEECH>
<SPEAKER>KING CLAUDIUS</SPEAKER>
<LINE>‘Tis sweet and commendable in your nature, Hamlet,</LINE>
<LINE>To give these mourning duties to your father:</LINE>
<LINE>But, you must know, your father lost a father;</LINE>
<LINE>That father lost, lost his, and the survivor bound</LINE>
<LINE>In filial obligation for some term</LINE>
<LINE>To do obsequious sorrow: but to persever</LINE>
<LINE>In obstinate condolement is a course</LINE>
<LINE>Of impious stubbornness; ’tis unmanly grief;</LINE>
<LINE>It shows a will most incorrect to heaven,</LINE>
<LINE>A heart unfortified, a mind impatient,</LINE>
<LINE>An understanding simple and unschool’d:</LINE>
<LINE>For what we know must be and is as common</LINE>
<LINE>As any the most vulgar thing to sense,</LINE>
<LINE>Why should we in our peevish opposition</LINE>
<LINE>Take it to heart? Fie! ’tis a fault to heaven,</LINE>
<LINE>A fault against the dead, a fault to nature,</LINE>
<LINE>To reason most absurd: whose common theme</LINE>
<LINE>Is death of fathers, and who still hath cried,</LINE>
<LINE>From the first corse till he that died to-day,</LINE>
<LINE>’This must be so.’ We pray you, throw to earth</LINE>
<LINE>This unprevailing woe, and think of us</LINE>
<LINE>As of a father: for let the world take note,</LINE>
<LINE>You are the most immediate to our throne;</LINE>
<LINE>And with no less nobility of love</LINE>
<LINE>Than that which dearest father bears his son,</LINE>
<LINE>Do I impart toward you. For your intent</LINE>
<LINE>In going back to school in Wittenberg,</LINE>
<LINE>It is most retrograde to our desire:</LINE>
<LINE>And we beseech you, bend you to remain</LINE>
<LINE>Here, in the cheer and comfort of our eye,</LINE>
<LINE>Our chiefest courtier, cousin, and our son.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>For God’s love, let me hear.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>OPHELIA</SPEAKER>
<LINE>My lord, he hath importuned me with love</LINE>
<LINE>In honourable fashion.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>Ghost</SPEAKER>
<LINE>I am thy father’s spirit,</LINE>
<LINE>Doom’d for a certain term to walk the night,</LINE>
<LINE>And for the day confined to fast in fires,</LINE>
<LINE>Till the foul crimes done in my days of nature</LINE>
<LINE>Are burnt and purged away. But that I am forbid</LINE>
<LINE>To tell the secrets of my prison-house,</LINE>
<LINE>I could a tale unfold whose lightest word</LINE>
<LINE>Would harrow up thy soul, freeze thy young blood,</LINE>
<LINE>Make thy two eyes, like stars, start from their spheres,</LINE>
<LINE>Thy knotted and combined locks to part</LINE>
<LINE>And each particular hair to stand on end,</LINE>
<LINE>Like quills upon the fretful porpentine:</LINE>
<LINE>But this eternal blazon must not be</LINE>
<LINE>To ears of flesh and blood. List, list, O, list!</LINE>
<LINE>If thou didst ever thy dear father love–</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>Haste me to know’t, that I, with wings as swift</LINE>
<LINE>As meditation or the thoughts of love,</LINE>
<LINE>May sweep to my revenge.</LINE>
</SPEECH>
</HIT>
</HITS>

Apart from the nitty gritty of XQuery David also provided an interesting look at some tricks that eXist uses to make it possible to join tree based structures. Basically the algorithm creates a tree structure and then indexes the nodes with identifiers making an assumption about the number of children beneath a particular node. Practically this means it’s easy to do math to traverse the tree, and join subtrees–but a side effect is that lots of ‘ghost nodes’ are created.

Ghost nodes are gaps in the identifier space, and if you are working with irregularly structured XML documents you can actually easily exceed the available resources on a 64bit machine. An example of a irregularly structured document could be a dictionary that has hundreds of thousands of entries, which on average have 2-3 definitions, but a handful have like 60 definitions…this causes the identifier space padding to get bloated with tons of ghost nodes.

If you are interested about any of this take a look at eXist: An Open Source XML Database by Wolfgang Meier. David also recommended XQuery - The XML Query Language by Micael Brundage for learning more about XQuery. In the future David said there is work going on at W3C on extensions to search and update: XQuery Search and Update, which will be good to keep an eye on.

All in all I like XQuery and I’m glad that I finally seem to understand it enough to consider it part of my tool set. I’d like to see XQuery used in say a Java program much like SQL is used via JDBC–and be able to get back results say as JDOM or XOM objects. I must admit I’m not so interested in using XQuery as a general programming language though.