xquery
So my new employer was kind enough to send me to Joint Conference on Digital Libraries this year. The JCDL program has caught my eye for a few years now, but my previous employer didn’t really see the value in being involved in the digital library community. It’s nice to be back listening to new people with good ideas again. I plan on taking sparse free-form notes here just so I have a record of what I attended and what I learned–rather than waiting till the end to write up a report.
I spent the morning in David Durand’s XQuery tutorial. David has worked on the XML and XLink w3c working groups, teaches at Brown, has over 20 years experience with SGML/XML technologies, and is currently running a startup out of the third floor of his house. He gave a nice hands on demonstration of XQuery using the eXist xml database.
About the first half was spent going over the syntax of XQuery which included a nice mini-tutorial on XPath. I’ve been interested in XQuery since hearing Kevin Clarke talk about it and native xml databases quite a bit on #code4lib, so I really was looking forward to learning more about it from a practical perspective.
I was blown away by how easy it is to actually set up eXist and start adding content and querying it. While David was talking I literally downloaded it, set it up and imported a body of test xml data in 5 minutes. The setup amounts to downloading a jar file and running it. A nice feature is the webdav interface which allows you to mount the eXist database as an editable filesystem, which is very handy. In addition eXist provides REST and XMLRPC interfaces. David used the snazzy XQuery Sandbox web interface for exploring XQuery.
I found the functional aspects of XQuery to be really interesting. David nicely summarized the XQuery type system in and covered enough of the basic flow constructs (let, for, where, return, order by) to start experimenting right away. I must admit that I found the mixture of templating functionality (like that in PHP) with the functional style was a little bit jarring–but that’s normally the case in an environment that supports templating:
<hits> { for $speech in //SPEECH[LINE &= 'love'] return <hit>{$speech}</hit> } </hits>
which can generate:
<HITS>
<HIT>
<SPEECH>
<SPEAKER>KING CLAUDIUS</SPEAKER>
<LINE>‘Tis
sweet and commendable in your nature, Hamlet,</LINE>
<LINE>To give these mourning duties to your
father:</LINE>
<LINE>But, you must know, your father
lost a father;</LINE>
<LINE>That father lost, lost
his, and the survivor bound</LINE>
<LINE>In filial
obligation for some term</LINE>
<LINE>To do obsequious
sorrow: but to persever</LINE>
<LINE>In obstinate
condolement is a course</LINE>
<LINE>Of impious
stubbornness; ’tis unmanly grief;</LINE>
<LINE>It
shows a will most incorrect to heaven,</LINE>
<LINE>A
heart unfortified, a mind impatient,</LINE>
<LINE>An
understanding simple and unschool’d:</LINE>
<LINE>For
what we know must be and is as common</LINE>
<LINE>As
any the most vulgar thing to sense,</LINE>
<LINE>Why
should we in our peevish opposition</LINE>
<LINE>Take
it to heart? Fie! ’tis a fault to heaven,</LINE>
<LINE>A fault against the dead, a fault to
nature,</LINE>
<LINE>To reason most absurd: whose
common theme</LINE>
<LINE>Is death of fathers, and who
still hath cried,</LINE>
<LINE>From the first corse
till he that died to-day,</LINE>
<LINE>’This must be
so.’ We pray you, throw to earth</LINE>
<LINE>This
unprevailing woe, and think of us</LINE>
<LINE>As of a
father: for let the world take note,</LINE>
<LINE>You
are the most immediate to our throne;</LINE>
<LINE>And
with no less nobility of love</LINE>
<LINE>Than that
which dearest father bears his son,</LINE>
<LINE>Do I
impart toward you. For your intent</LINE>
<LINE>In
going back to school in Wittenberg,</LINE>
<LINE>It is
most retrograde to our desire:</LINE>
<LINE>And we
beseech you, bend you to remain</LINE>
<LINE>Here, in
the cheer and comfort of our eye,</LINE>
<LINE>Our
chiefest courtier, cousin, and our son.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>For God’s love, let me hear.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>OPHELIA</SPEAKER>
<LINE>My lord, he hath importuned me with love</LINE>
<LINE>In honourable fashion.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>Ghost</SPEAKER>
<LINE>I am thy father’s spirit,</LINE>
<LINE>Doom’d for a certain term to walk the
night,</LINE>
<LINE>And for the day confined to fast
in fires,</LINE>
<LINE>Till the foul crimes done in my
days of nature</LINE>
<LINE>Are burnt and purged away.
But that I am forbid</LINE>
<LINE>To tell the secrets
of my prison-house,</LINE>
<LINE>I could a tale unfold
whose lightest word</LINE>
<LINE>Would harrow up thy
soul, freeze thy young blood,</LINE>
<LINE>Make thy
two eyes, like stars, start from their spheres,</LINE>
<LINE>Thy knotted and combined locks to part</LINE>
<LINE>And each particular hair to stand on end,</LINE>
<LINE>Like quills upon the fretful porpentine:</LINE>
<LINE>But this eternal blazon must not be</LINE>
<LINE>To ears of flesh and blood. List, list, O,
list!</LINE>
<LINE>If thou didst ever thy dear father
love–</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>Haste me to
know’t, that I, with wings as swift</LINE>
<LINE>As
meditation or the thoughts of love,</LINE>
<LINE>May
sweep to my revenge.</LINE>
</SPEECH>
</HIT>
</HITS>
Apart from the nitty gritty of XQuery David also provided an interesting look at some tricks that eXist uses to make it possible to join tree based structures. Basically the algorithm creates a tree structure and then indexes the nodes with identifiers making an assumption about the number of children beneath a particular node. Practically this means it’s easy to do math to traverse the tree, and join subtrees–but a side effect is that lots of ‘ghost nodes’ are created.
Ghost nodes are gaps in the identifier space, and if you are working with irregularly structured XML documents you can actually easily exceed the available resources on a 64bit machine. An example of a irregularly structured document could be a dictionary that has hundreds of thousands of entries, which on average have 2-3 definitions, but a handful have like 60 definitions…this causes the identifier space padding to get bloated with tons of ghost nodes.
If you are interested about any of this take a look at eXist: An Open Source XML Database by Wolfgang Meier. David also recommended XQuery - The XML Query Language by Micael Brundage for learning more about XQuery. In the future David said there is work going on at W3C on extensions to search and update: XQuery Search and Update, which will be good to keep an eye on.
All in all I like XQuery and I’m glad that I finally seem to understand it enough to consider it part of my tool set. I’d like to see XQuery used in say a Java program much like SQL is used via JDBC–and be able to get back results say as JDOM or XOM objects. I must admit I’m not so interested in using XQuery as a general programming language though.