Archive for the ‘metadata’ Category

the weight of legacy data

Sunday, May 20th, 2007

v0.97 of MARC::Charset was just released with an important bugfix. If you’ve had the misfortune of needing to convert from MARC-8 to UTF-8 and have used MARC::Charset >= v0.8 to do it you may very well have null characters (0×00) in your UTF-8 data. Well, only if your MARC-8 data contained either of the following characters:

  • DOUBLE TILDE, SECOND HALF / COMBINING DOUBLE TILDE RIGHT HALF
  • LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF

It turns out that the mapping file kindly provided by the Library of Congress does not include UCS mapping values for these two characters, and instead relies on alternate values.

v0.97 now uses the alternate value when the ucs is not available…which is good going forward. But I am literally sad when I think about how this little bug has added to the noise of erroneous extant MARC data. Please accept my humble apologies–and hear my plea to for bibliographic data that starts in Unicode rather than MARC-8. I’ll go further:

Don’t build systems that import/export MARC in transmission format anymore unless you absolutely have to.

Use MARCXML, MODS, RDF, JSON, YAML or something else instead. I realize this is hardly news but it feels good to be saying it. If you’re not convinced read Bill’s Pride and Prejudice installments. The library world needs to use common formats and encodings (with lots of tried/true tool sets)…and stop painting itself into a corner. Z39.2 has been hella useful for building up vast networks of data sharing libraries, but its time to leverage that data in ways that are more familiar to the networked world at large.

Many thanks to Michael O’Connor and Mike Rylander for discovering and resolving this bug.

rda/frbr and the semantic web

Thursday, May 3rd, 2007

I was interested to learn from Alistair Miles that folks in the library community are starting to look at expressing models such as RDA and FRBR using the semantic web technology stack–including things like Dublin Core Abstract Model.

It’s exciting and timely to see the luminaries in the field get together to talk about this sort of convergence. I must admit I’m still a little bit hazy about why we need DCAM when we already have RDF, RDFS and OWL … but I think only good things can come from this interaction.

It’s particularly heartening that the library community is exploring what RDA and FRBR look like when the rubber meets the road of data representation. Although Ian Davis, Richard Newman and Bruce D’Arcus have arguably already done this for FRBR.

Update: official announcement from the British Library.

late easter present

Tuesday, April 10th, 2007

I finally took the time to make pymarc setuptools friendly. This basically means that if you’ve got easy_install handy you can:

sudo easy_install pymarc

If you haven’t looked at eggs yet, they are pretty much the defacto standard for distributing python code. The PyPi (Python Package Index, aka Python Cheese Shop) allows easy_install to locate and download packages, which are then unpacked and installed.

pymarc was basically an experiment to make sure I understood how eggs worked with pypi. Next up Rob Sanderson has sent me some code he and a colleague did for parsing Library of Congress Classification Numbers which I’m going to bundle up as an egg as well. Stay tuned.

oclc registry

Monday, February 19th, 2007

So OCLC’s WorldCat Registry is a nice new addition to OCLCs growing list of web services. Do a search for your library and take a look at the URL: aye that’s right it’s SRU. In fact do a view source on the results page and you’ll see an SRU response in XML–the HTML is being rendered with client side XSLT.

If you drill into a particular institution you’ll see a pleasantly cool uri:

http://worldcat.org/registry/Institutions/89073

…which would serve nicely as an identifier for the Browne Popular Culture Library. The institution pages are HTML instead of XML–however there is a link to an XML representation:

http://worldcat.org/webservices/registry/content/Institutions/89073

This URL isn’t bad but it would be rather nice if the former could return XML if the Accept: header had text/xml slotted before text/html. Yeah, I did check:

  curl -I "Accept: text/xml" http://worldcat.org/registry/Institutions/89073

It’s inspiring to see OCLC going the extra mile to make their new services have web friendly machine APIs.

Update: for deeper analysis check out Pete Johnston’s WorldCat Institution Registry and Identifiers. He has some great points on the use of identifiers in the xml responses.

exhibit

Friday, February 16th, 2007

If you haven’t tried Exhibit out yet the simile folks have created a truly wonderful data publishing framework which runs entirely in your browser with a bit of javascript, html and css.

The remarkable part is that it requires no backend database, but simply operates on a stream of json. If you have a couple minutes take a look at their Getting Started Tutorial which shows you how to create a exhibit of MIT related nobel laureates with a tiny bit of HTML, CSS and JavaScript.

Just as an experiment I tried pointing it at my delicious json feed for metadata. It turns out that exhibit wants json data to be a hash with a key ‘items’ that points to a list of items. In addition it also wants each item to have a ‘label’ key. I quickly reformatted the delicious json with simplejson, and got this.

A few minutes later I prodded the simile folks to see if there is a way of filtering json data on the way into exhibit so that it can be normalized…time passes (like maybe an hour) and then I hear from Johan Sundström that the latest/greatest exhibit code has this sort of filtering built in!

Tangential to the exhibit code, there has been an interesting discussion recently about how to expose exhibit content to indexing services like google. Since exhibit content is generated with pure javascript, and google (as far as we know) primarily indexes html content–the exhibit content is rendered invisible. This is a problem that digital library applications and repositories have to deal with as well, so it may be of interest.

oxford dictionary of national biography

Monday, February 5th, 2007

It’s interesting to see that the Oxford Dictionary of National Biography has created Cool URIs for their index of notable people. So for example if you want an identifier for JRR Tolkien you can use:

http://www.oxforddnb.com/index/101031766

Alas, the full content of the biography isn’t available (unless you subscribe), but I guess some publishers still have business models to hold on to. To see all the entries you have to browse them.

I think it’s a nice simple example of how authority files can be integrated into the web as we know it. Thanks to Caroline Arms for forwarding this on to me…

identifiers and authority records

Saturday, January 6th, 2007

Authority files are rather important for unambiguously talking about a person, place or thing. In database lingo they essentially amount to a primary key for a table. Given the time and effort libraries spend in maintaining authority records and assigning control numbers to individuals it makes sense that a URI could be assigned to an individual in such authority files. I realize this idea is nothing new, but until recently I hadn’t seen it put into practice particularly well.

I imagine this has been there all along but I just noticed that OCLC’s Linked Authority File includes PURLs for authors now. For example the following URL contains a LCCN:

http://errol.oclc.org/laf/n79-7035

When you GET this your browser is automatically redirected with an HTTP 302 to:

http://alcme.oclc.org/laf/servlet/OAIHandler?
verb=GetRecord&metadataPrefix=oai_dc&identifier=n79-7035

which you’ll notice is a OAI-PMH request to fetch a DublinCore record with the identifier n79-7035:

<oai_dc:dc
  xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
    http://openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    Borges, Jorge Luis,--1899-
  </dc:creator>
  <dc:description xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    SuaÌrezLynch, B.--nnnc
  </dc:description>
</oai_dc:dc>

So now we know who this identifier is for, and the established heading for the individual. But it gets better (or worse depending on your perspective). Since this is an OAI-PMH server you can issue a ListMetadataFormats request to see what other flavors this record might be available in. If you do you’ll find out that this record is also available as marcxml in all its unholy glory (if you follow that link your browser will use a stylesheet to turn the raw xml into something a bit more presentable). Putting aside my snideness about MARC for a moment, this is a lot of useful data being made available.

You can also search the name authority file and get relevant PURLs via a SOAP/REST service. For example the irc bot panizzi in #code4lib actually has a bit of logic that allows it do lookups in the linked authority file:

06:56 < edsu> @naf borges, jorge
06:56 < panizzi> edsu: [20 matches] [~1] Borges, Jorge Luis, 1899-
                 <http://errol.oclc.org/laf/n79-7035>; [~2] Macedo, Jorge
                 Borges de. <http://errol.oclc.org/laf/n82-149895>; [~3]
                 Borges, Jorge G. (Jorge Guillermo), 1874-1938
                 <http://errol.oclc.org/laf/n90-681877>; [~4] Sua?rez Lynch, B.
                 <http://errol.oclc.org/laf/n82-21644>; [~5] Borges, Jorge
                 Wheliton Miranda <http://errol.oclc.org/laf/n92-76758>; [~6]
                 Canido Borges, Jorge Oscar (3 more messages)

All in all it’s an impressive mix of technology, standards and practice. It is not entirely clear to me how this work relates to the Virtual International Authority File. Perhaps LAF wasn’t considered a good acronym? If you are interested in such things Thom Hickey had a really interesting talk at Access2006 which has audio available.

got data?

Monday, November 20th, 2006

Just saw this float by on simile-general

… thanks to Ben, we now have permission to publish the barton RDF dump (consisting of 50 million juicy RDF statements from the MIT library catalogue). They are now available at

http://simile.mit.edu/rdf-test-data/

Juicy indeed…it would be nice to see more libraries do this sort of thing.

rsinger++

Thursday, September 28th, 2006

So Ross beat out 11 other projects to win the OCLC Research Software Contest for his next generation OpenURL resolver umlaut. Second place went to to Jesse Andrews’ BookBurro–so the competition was fierce this year. Much more so than last year when there were 4 contestants.

Those of us who hang out in #code4lib got to hear about this project when it was just a glimmer in his eye…and had front row seats for hearing about the development as it progressed. Essentially umlaut is an openurl router that’s able to consult online catalogs (via SRU), other OpenURL resolvers (SFX), Amazon, Google, Yahoo, Connotea, CiteULike and OAI-PMH. It’s all written in Ruby and RubyOnRails.

I feel particularly proud because Ross is enough of a mad genius to have found a use for some ruby gems I wrote for doing sru, oai-pmh and querying OCLC’s xisbn service.

Speaking of which we’ve been collaborating recently on a little ruby gem for querying OCLC’s OpenURL Resolver Registry. This registry essentially makes it easy to determine what the appropriate OpenURL resolver is given a particular IP address. So you could theoretically rewrite your fulltext URLs so that they were geospatially aware. For example:

  require "resolver_registry"
 
  client = ResolverRegistry::Client.new
  institution = client.find('130.207.50.91')
  print institution.resolver.base_address

If you want to take a look direct your svn client like so:

svn co http://rsinger.library.gatech.edu/svn/openurl_registry/

I imagine it’ll get released to rubyforge sometime shortly.

ruby-oai v0.0.3

Tuesday, September 19th, 2006

v0.0.3 of ruby-oai was just released to RubyForge. The big news is that this release allows you to use libxml for parsing thanks to the efforts of Terry Reese. Terry is building a RubyOnRails metasearch application at OSU and, well, felt the need for speed.

After committing the branch he was working on I ran some performance tests of my own. I ran a vanilla ListRecords request against dspace, eprints and american memory oai-pmh servers using both the rexml (default) and libxml backend parsers. Here are the results

server parser real user sys
dspace rexml 0m3.632s 0m2.008s 0m0.044s
libxml 0m1.900s 0m0.212s 0m0.032s
  1.732s (+48%) 1.796s (+89%) 0.012s (+27%)
 
eprints rexml 0m19.807s 0m1.984s 0m0.036s
libxml 0m19.344s 0m0.236s 0m0.024s
  0.463s (+2%) 1.748s (+88%) 0.012s (+33%)
 
american-memory rexml 0m12.991s 0m5.424s 0m0.052s
libxml 0m7.420s 0m0.324s 0m0.032s
  5.571s (+43%) 5.104s (+94%) 0.02s (+38%)

Those percentage values are speed improvements. Thanks Terry :-)