Archive for the ‘xml’ Category

MIME types and library metadata

Wednesday, April 23rd, 2008

While thinking about library metadata and RESTful web services I got to wondering how many application/*+xml MIME types have actually been registered. It turns out that 120 out of the 633 other application/* MIME types.

Does it seem like a generally useful thing to be able to identify metadata representations with MIME types? Rebecca Guenther registered application/marc back in 1997. Maybe we could have application/marc+xml, application/mods+xml, application/dc+xml?

MIME types for established library metadata formats would be useful to use in applications like AtomPub implementations, or say OAI-ORE resource maps that want to identify the format of a particular resource. In general it would be useful to have in RESTful environments where content-negotiation for resources is encouraged.

If you are curious, here is a current (as of Apr 23, 2008) list of registered MIME types that are in the application/*+xml space.

application/atom+xml
application/atomcat+xml
application/atomsvc+xml
application/auth-policy+xml
application/beep+xml
application/ccxml+xml
application/cellml+xml
application/cnrp+xml
application/conference-info+xml
application/cpl+xml
application/csta+xml
application/CSTAdata+xml
application/davmount+xml
application/dialog-info+xml
application/epp+xml
application/im-iscomposing+xml
application/kpml-request+xml
application/kpml-response+xml
application/mbms-associated-procedure-description+xml
application/mbms-deregister+xml
application/mbms-envelope+xml
application/mbms-msk-response+xml
application/mbms-msk+xml
application/mbms-protection-description+xml
application/mbms-reception-report+xml
application/mbms-register-response+xml
application/mbms-register+xml
application/mbms-user-service-description+xml
application/media_control+xml
application/mediaservercontrol+xml
application/oebps-package+xml
application/pidf+xml
application/pls+xml
application/poc-settings+xml
application/rdf+xml
application/reginfo+xml
application/resource-lists+xml
application/rlmi+xml
application/rls-services+xml
application/samlassertion+xml
application/samlmetadata+xml
application/sbml+xml
application/shf+xml
application/simple-filter+xml
application/smil+xml
application/soap+xml
application/sparql-results+xml
application/spirits-event+xml
application/srgs+xml
application/ssml+xml
application/vnd.3gpp.bsf+xml
application/vnd.3gpp2.bcmcsinfo+xml
application/vnd.adobe.xdp+xml
application/vnd.apple.installer+xml
application/vnd.avistar+xml
application/vnd.chemdraw+xml
application/vnd.criticaltools.wbs+xml
application/vnd.ctct.ws+xml
application/vnd.eszigno3+xml
application/vnd.google-earth.kml+xml
application/vnd.HandHeld-Entertainment+xml
application/vnd.informedcontrol.rms+xml
application/vnd.irepository.package+xml
application/vnd.liberty-request+xml
application/vnd.llamagraphics.life-balance.exchange+xml
application/vnd.marlin.drm.actiontoken+xml
application/vnd.marlin.drm.conftoken+xml
application/vnd.mozilla.xul+xml
application/vnd.ms-playready.initiator+xml
application/vnd.nokia.conml+xml
application/vnd.nokia.iptv.config+xml
application/vnd.nokia.landmark+xml
application/vnd.nokia.landmarkcollection+xml
application/vnd.nokia.n-gage.ac+xml
application/vnd.nokia.pcd+xml
application/vnd.oma.bcast.associated-procedure-parameter+xml
application/vnd.oma.bcast.drm-trigger+xml
application/vnd.oma.bcast.imd+xml
application/vnd.oma.bcast.notification+xml
application/vnd.oma.bcast.sgdd+xml
application/vnd.oma.bcast.smartcard-trigger+xml
application/vnd.oma.bcast.sprov+xml
application/vnd.oma.dd2+xml
application/vnd.oma.drm.risd+xml
application/vnd.oma.group-usage-list+xml
application/vnd.oma.poc.detailed-progress-report+xml
application/vnd.oma.poc.final-report+xml
application/vnd.oma.poc.groups+xml
application/vnd.oma.poc.invocation-descriptor+xml
application/vnd.oma.poc.optimized-progress-report+xml
application/vnd.oma.xcap-directory+xml
application/vnd.omads-email+xml
application/vnd.omads-file+xml
application/vnd.omads-folder+xml
application/vnd.otps.ct-kip+xml
application/vnd.poc.group-advertisement+xml
application/vnd.pwg-xhtml-print+xml
application/vnd.recordare.musicxml+xml
application/vnd.solent.sdkm+xml
application/vnd.sun.wadl+xml
application/vnd.syncml.dm+xml
application/vnd.syncml+xml
application/vnd.uoml+xml
application/vnd.wv.csp+xml
application/vnd.wv.ssp+xml
application/vnd.zzazz.deck+xml
application/voicexml+xml
application/watcherinfo+xml
application/wsdl+xml
application/wspolicy+xml
application/xcap-att+xml
application/xcap-caps+xml
application/xcap-el+xml
application/xcap-error+xml
application/xcap-ns+xml
application/xenc+xml
application/xhtml+xml
application/xmpp+xml
application/xop+xml
application/xv+xml

app and repositories

Monday, July 16th, 2007

Pete Johnston blogged recently about a very nice use of the Atom Publishing Protocol (APP) to provide digital library repository functionality. The project is supported by UKOLN at the University of Bath and is called Simple Web-service Offering Repository Deposit (SWORD).

If you are interested in digital repositories and web services take a look at their APP profile. It’s a great example of how APP encourages the use of the Atom XML format and RESTful practices, which can then be extended to suit the particular needs of a community of practice.

To understand APP you really only need to grok a handful of concepts from the data model and REST. The data model is basically made up of a service document, which describes a set of collections, which aggregates member entries, which can in turn point to a media entry. All of these types of resources are identified with URLs. Since they are URLs you can interact with the objects with plain old HTTP–just like your web browser. For example you can list the entries in a collection by issuing a GET to the collection URL. Or you can create a member resource by doing a POST to the collection URL. Similarly you can delete a member entry by issuing a DELETE to the member entry. The full details are available in the latest draft of the RFC–and also in a wide variety of articles including this one.

So to perform a SWORD deposit a program would have to:

  1. get the service document for the repository (GET http://www.myrepository.ac.uk/app/servicedocument)
  2. see what collections it can add objects to
  3. create some IMS, METS or DIDL metadata to describe your repository object and ZIP it up with any of the objects datastreams
  4. POST the zip file to the appropriate collection URL with the appropriate X-Format-Namespace to identify the format of the submitted object
  5. check that you got a 201 Created status code and record the Location of the newly created resource
  6. profit!

1 and 2 are perhaps not even necessary if the URL for the target collection is already known. Some notable things about the SWORD profile of APP:

  • two levels of conformance (one really minimalistic one)
  • the idea that collections imply particular treatments or workflows associated with how the object is ingested
  • service documents dynamically change to describe only the collections that a particular user can see
  • no ability to edit resources
  • no ability to delete resources
  • no ability to list collections
  • repository objects are POSTed as ZIP files to collections
  • HTTP Basic Authentication + TLS for security
  • the use of DublinCore to describe collections and their respective policies.
  • collections can support mediated deposit which means deposits can include the X-On-Behalf-Of HTTP header to identify the user to create the resource for.
  • the use of X-Format-Namespace HTTP header to explicitly identify the format of the submission package that is zipped up: for example IMS, METS or DIDL.

While I understand why update and delete would be disabled for deposited packages I don’t really understand why the listing of collections would be disabled. An atom feed for a collection would essentially enable harvesting of a repository, much like ListRecords in OAI-PMH.

I’m not quite sure I completely understand X-On-Behalf-Of and sword:mediation either. I could understand X-On-Behalf-Of in an environment where there is no authentication. But if a user is authenticated couldn’t their username be used to identify who is doing the deposit? Perhaps there are cases (as the doc suggests) where a deposit is done for another user?

All in all this is really wonderful work. Of particular value for me was seeing the list of SWORD extensions and also the use of HTTP status codes. If I have the time I’d like to throw together a sample repository server and client to see just how easy it is to implement SWORD. I did try some experiments along these lines for my presentation back in February…but they never got as well defined as SWORD.

the weight of legacy data

Sunday, May 20th, 2007

v0.97 of MARC::Charset was just released with an important bugfix. If you’ve had the misfortune of needing to convert from MARC-8 to UTF-8 and have used MARC::Charset >= v0.8 to do it you may very well have null characters (0×00) in your UTF-8 data. Well, only if your MARC-8 data contained either of the following characters:

  • DOUBLE TILDE, SECOND HALF / COMBINING DOUBLE TILDE RIGHT HALF
  • LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF

It turns out that the mapping file kindly provided by the Library of Congress does not include UCS mapping values for these two characters, and instead relies on alternate values.

v0.97 now uses the alternate value when the ucs is not available…which is good going forward. But I am literally sad when I think about how this little bug has added to the noise of erroneous extant MARC data. Please accept my humble apologies–and hear my plea to for bibliographic data that starts in Unicode rather than MARC-8. I’ll go further:

Don’t build systems that import/export MARC in transmission format anymore unless you absolutely have to.

Use MARCXML, MODS, RDF, JSON, YAML or something else instead. I realize this is hardly news but it feels good to be saying it. If you’re not convinced read Bill’s Pride and Prejudice installments. The library world needs to use common formats and encodings (with lots of tried/true tool sets)…and stop painting itself into a corner. Z39.2 has been hella useful for building up vast networks of data sharing libraries, but its time to leverage that data in ways that are more familiar to the networked world at large.

Many thanks to Michael O’Connor and Mike Rylander for discovering and resolving this bug.

oclc registry

Monday, February 19th, 2007

So OCLC’s WorldCat Registry is a nice new addition to OCLCs growing list of web services. Do a search for your library and take a look at the URL: aye that’s right it’s SRU. In fact do a view source on the results page and you’ll see an SRU response in XML–the HTML is being rendered with client side XSLT.

If you drill into a particular institution you’ll see a pleasantly cool uri:

http://worldcat.org/registry/Institutions/89073

…which would serve nicely as an identifier for the Browne Popular Culture Library. The institution pages are HTML instead of XML–however there is a link to an XML representation:

http://worldcat.org/webservices/registry/content/Institutions/89073

This URL isn’t bad but it would be rather nice if the former could return XML if the Accept: header had text/xml slotted before text/html. Yeah, I did check:

  curl -I "Accept: text/xml" http://worldcat.org/registry/Institutions/89073

It’s inspiring to see OCLC going the extra mile to make their new services have web friendly machine APIs.

Update: for deeper analysis check out Pete Johnston’s WorldCat Institution Registry and Identifiers. He has some great points on the use of identifiers in the xml responses.

identifiers and authority records

Saturday, January 6th, 2007

Authority files are rather important for unambiguously talking about a person, place or thing. In database lingo they essentially amount to a primary key for a table. Given the time and effort libraries spend in maintaining authority records and assigning control numbers to individuals it makes sense that a URI could be assigned to an individual in such authority files. I realize this idea is nothing new, but until recently I hadn’t seen it put into practice particularly well.

I imagine this has been there all along but I just noticed that OCLC’s Linked Authority File includes PURLs for authors now. For example the following URL contains a LCCN:

http://errol.oclc.org/laf/n79-7035

When you GET this your browser is automatically redirected with an HTTP 302 to:

http://alcme.oclc.org/laf/servlet/OAIHandler?
verb=GetRecord&metadataPrefix=oai_dc&identifier=n79-7035

which you’ll notice is a OAI-PMH request to fetch a DublinCore record with the identifier n79-7035:

<oai_dc:dc
  xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
    http://openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    Borges, Jorge Luis,--1899-
  </dc:creator>
  <dc:description xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    SuaÌrezLynch, B.--nnnc
  </dc:description>
</oai_dc:dc>

So now we know who this identifier is for, and the established heading for the individual. But it gets better (or worse depending on your perspective). Since this is an OAI-PMH server you can issue a ListMetadataFormats request to see what other flavors this record might be available in. If you do you’ll find out that this record is also available as marcxml in all its unholy glory (if you follow that link your browser will use a stylesheet to turn the raw xml into something a bit more presentable). Putting aside my snideness about MARC for a moment, this is a lot of useful data being made available.

You can also search the name authority file and get relevant PURLs via a SOAP/REST service. For example the irc bot panizzi in #code4lib actually has a bit of logic that allows it do lookups in the linked authority file:

06:56 < edsu> @naf borges, jorge
06:56 < panizzi> edsu: [20 matches] [~1] Borges, Jorge Luis, 1899-
                 <http://errol.oclc.org/laf/n79-7035>; [~2] Macedo, Jorge
                 Borges de. <http://errol.oclc.org/laf/n82-149895>; [~3]
                 Borges, Jorge G. (Jorge Guillermo), 1874-1938
                 <http://errol.oclc.org/laf/n90-681877>; [~4] Sua?rez Lynch, B.
                 <http://errol.oclc.org/laf/n82-21644>; [~5] Borges, Jorge
                 Wheliton Miranda <http://errol.oclc.org/laf/n92-76758>; [~6]
                 Canido Borges, Jorge Oscar (3 more messages)

All in all it’s an impressive mix of technology, standards and practice. It is not entirely clear to me how this work relates to the Virtual International Authority File. Perhaps LAF wasn’t considered a good acronym? If you are interested in such things Thom Hickey had a really interesting talk at Access2006 which has audio available.

ruby-oai v0.0.3

Tuesday, September 19th, 2006

v0.0.3 of ruby-oai was just released to RubyForge. The big news is that this release allows you to use libxml for parsing thanks to the efforts of Terry Reese. Terry is building a RubyOnRails metasearch application at OSU and, well, felt the need for speed.

After committing the branch he was working on I ran some performance tests of my own. I ran a vanilla ListRecords request against dspace, eprints and american memory oai-pmh servers using both the rexml (default) and libxml backend parsers. Here are the results

server parser real user sys
dspace rexml 0m3.632s 0m2.008s 0m0.044s
libxml 0m1.900s 0m0.212s 0m0.032s
  1.732s (+48%) 1.796s (+89%) 0.012s (+27%)
 
eprints rexml 0m19.807s 0m1.984s 0m0.036s
libxml 0m19.344s 0m0.236s 0m0.024s
  0.463s (+2%) 1.748s (+88%) 0.012s (+33%)
 
american-memory rexml 0m12.991s 0m5.424s 0m0.052s
libxml 0m7.420s 0m0.324s 0m0.032s
  5.571s (+43%) 5.104s (+94%) 0.02s (+38%)

Those percentage values are speed improvements. Thanks Terry :-)

xquery

Monday, June 12th, 2006

So my new employer was kind enough to send me to Joint Conference on Digital Libraries this year. The JCDL program has caught my eye for a few years now, but my previous employer didn’t really see the value in being involved in the digital library community. It’s nice to be back listening to new people with good ideas again. I plan on taking sparse free-form notes here just so I have a record of what I attended and what I learned–rather than waiting till the end to write up a report.

I spent the morning in David Durand’s XQuery tutorial. David has worked on the XML and XLink w3c working groups, teaches at Brown, has over 20 years experience with SGML/XML technologies, and is currently running a startup out of the third floor of his house. He gave a nice hands on demonstration of XQuery using the eXist xml database.

About the first half was spent going over the syntax of XQuery which included a nice mini-tutorial on XPath. I’ve been interested in XQuery since hearing Kevin Clarke talk about it and native xml databases quite a bit on #code4lib, so I really was looking forward to learning more about it from a practical perspective.

I was blown away by how easy it is to actually set up eXist and start adding content and querying it. While David was talking I literally downloaded it, set it up and imported a body of test xml data in 5 minutes. The setup amounts to downloading a jar file and running it. A nice feature is the webdav interface which allows you to mount the eXist database as an editable filesystem, which is very handy. In addition eXist provides REST and XMLRPC interfaces. David used the snazzy XQuery Sandbox web interface for exploring XQuery.

I found the functional aspects of XQuery to be really interesting. David nicely summarized the XQuery type system in and covered enough of the basic flow constructs (let, for, where, return, order by) to start experimenting right away. I must admit that I found the mixture of templating functionality (like that in PHP) with the functional style was a little bit jarring–but that’s normally the case in an environment that supports templating:

<hits>
{
for $speech in //SPEECH[LINE &= 'love']
return <hit>{$speech}</hit>
}
</hits>

which can generate:

<HITS>
<HIT>
<SPEECH>
<SPEAKER>KING CLAUDIUS</SPEAKER>
<LINE>’Tis sweet and commendable in your nature, Hamlet,</LINE>
<LINE>To give these mourning duties to your father:</LINE>
<LINE>But, you must know, your father lost a father;</LINE>
<LINE>That father lost, lost his, and the survivor bound</LINE>
<LINE>In filial obligation for some term</LINE>
<LINE>To do obsequious sorrow: but to persever</LINE>
<LINE>In obstinate condolement is a course</LINE>
<LINE>Of impious stubbornness; ’tis unmanly grief;</LINE>
<LINE>It shows a will most incorrect to heaven,</LINE>
<LINE>A heart unfortified, a mind impatient,</LINE>
<LINE>An understanding simple and unschool’d:</LINE>
<LINE>For what we know must be and is as common</LINE>
<LINE>As any the most vulgar thing to sense,</LINE>
<LINE>Why should we in our peevish opposition</LINE>
<LINE>Take it to heart? Fie! ’tis a fault to heaven,</LINE>
<LINE>A fault against the dead, a fault to nature,</LINE>
<LINE>To reason most absurd: whose common theme</LINE>
<LINE>Is death of fathers, and who still hath cried,</LINE>
<LINE>From the first corse till he that died to-day,</LINE>
<LINE>’This must be so.’ We pray you, throw to earth</LINE>
<LINE>This unprevailing woe, and think of us</LINE>
<LINE>As of a father: for let the world take note,</LINE>
<LINE>You are the most immediate to our throne;</LINE>
<LINE>And with no less nobility of love</LINE>
<LINE>Than that which dearest father bears his son,</LINE>
<LINE>Do I impart toward you. For your intent</LINE>
<LINE>In going back to school in Wittenberg,</LINE>
<LINE>It is most retrograde to our desire:</LINE>
<LINE>And we beseech you, bend you to remain</LINE>
<LINE>Here, in the cheer and comfort of our eye,</LINE>
<LINE>Our chiefest courtier, cousin, and our son.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>For God’s love, let me hear.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>OPHELIA</SPEAKER>
<LINE>My lord, he hath importuned me with love</LINE>
<LINE>In honourable fashion.</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>Ghost</SPEAKER>
<LINE>I am thy father’s spirit,</LINE>
<LINE>Doom’d for a certain term to walk the night,</LINE>
<LINE>And for the day confined to fast in fires,</LINE>
<LINE>Till the foul crimes done in my days of nature</LINE>
<LINE>Are burnt and purged away. But that I am forbid</LINE>
<LINE>To tell the secrets of my prison-house,</LINE>
<LINE>I could a tale unfold whose lightest word</LINE>
<LINE>Would harrow up thy soul, freeze thy young blood,</LINE>
<LINE>Make thy two eyes, like stars, start from their spheres,</LINE>
<LINE>Thy knotted and combined locks to part</LINE>
<LINE>And each particular hair to stand on end,</LINE>
<LINE>Like quills upon the fretful porpentine:</LINE>
<LINE>But this eternal blazon must not be</LINE>
<LINE>To ears of flesh and blood. List, list, O, list!</LINE>
<LINE>If thou didst ever thy dear father love–</LINE>
</SPEECH>
</HIT>
<HIT>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<LINE>Haste me to know’t, that I, with wings as swift</LINE>
<LINE>As meditation or the thoughts of love,</LINE>
<LINE>May sweep to my revenge.</LINE>
</SPEECH>
</HIT>
</HITS>

Apart from the nitty gritty of XQuery David also provided an interesting look at some tricks that eXist uses to make it possible to join tree based structures. Basically the algorithm creates a tree structure and then indexes the nodes with identifiers making an assumption about the number of children beneath a particular node. Practically this means it’s easy to do math to traverse the tree, and join subtrees–but a side effect is that lots of ‘ghost nodes’ are created.

Ghost nodes are gaps in the identifier space, and if you are working with irregularly structured XML documents you can actually easily exceed the available resources on a 64bit machine. An example of a irregularly structured document could be a dictionary that has hundreds of thousands of entries, which on average have 2-3 definitions, but a handful have like 60 definitions…this causes the identifier space padding to get bloated with tons of ghost nodes.

If you are interested about any of this take a look at eXist: An Open Source XML Database by Wolfgang Meier. David also recommended XQuery - The XML Query Language by Micael Brundage for learning more about XQuery. In the future David said there is work going on at W3C on extensions to search and update: XQuery Search and Update, which will be good to keep an eye on.

All in all I like XQuery and I’m glad that I finally seem to understand it enough to consider it part of my tool set. I’d like to see XQuery used in say a Java program much like SQL is used via JDBC–and be able to get back results say as JDOM or XOM objects. I must admit I’m not so interested in using XQuery as a general programming language though.