Archive for the ‘metadata’ Category

more marcdb

Monday, November 5th, 2007

This morning Clay and I were chatting about Library of Congress Subject Headings and SKOS a bit. At one point we found ourselves musing about how much reuse there is of topical subdivisions in topical headings in the LC authority file. You know how it is. Anyhow, I remembered that I’d used marcdb to import all of Simon Spiro’s authority data–so I fired up psql and wrote a query:

SELECT subfields.value AS subdivision, count(*) AS total
FROM subfields, data_fields
WHERE subfields.code = 'x'
  AND subfields.data_field_id = data_fields.id
  AND data_fields.tag = '150'
GROUP BY subfields.value
ORDER BY total DESC;

And a few seconds later…

 subdivision                          | total
--------------------------------------+-------
 Law and legislation                  |  3342
 Religious aspects                    |  2500
 Buddhism, [Christianity, etc.]       |   898
 History                              |   847
 Equipment and supplies               |   571
 Taxation                             |   566
 Baptists, [Catholic Church, etc.]    |   476
 Diseases                             |   450
 Research                             |   422
 Campaigns                            |   378
 Awards                               |   342
 Finance                              |   284
 Study and teaching                   |   284
 Surgery                              |   275
 Employees                            |   269
 Spectra                              |   261
 Computer programs                    |   259
 Labor unions                         |   218
 Testing                              |   207
 Diagnosis                            |   194
 Isotopes                             |   190
 Complications                        |   183
 Physiological effect                 |   172
 Programming                          |   163

There’s nothin’ like the smell of strong set theory in the morning. Although something seems a bit fishy about [Christianity, etc.] and [Catholic Church, etc.]… If you want to try similar stuff and don’t want to wait hours for marcdb to import all the data and you use postgres, here’s the full database dump which you ought to be able to import:

  % createdb authorities
  % wget http://inkdroid.org/data/authorities.sql.bz2
  % bunzip2 authorities.sql.bz2
  % psql authorities < authorities.sql

good ore

Friday, November 2nd, 2007

In case you missed it the Object-Reuse-and-Exchange (ORE) folks are having a get together at Johns Hopkins University (Baltimore, MD) on March 3, 2008. It’s free to register, but space is limited. The Compound information objects whitepaper, May 2007 Technical Committee notes and the more recent Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication provide a good taste of what the beta ORE specs are likely to look like.

The ORE group isn’t small, and includes individuals from quite different organizations. So any consensus that can be garnered I think will be quite powerful. Personally I’ve been really pleased to see how much the ORE work is leaning on web architecture: notably resolvable HTTP URIs, content-negotiation, linked-data and named graphs. Also interesting in the recent announcement is that the initial specs will use RFC 4287 for encoding the data model. Who knows, perhaps the spec will rely on archive feeds as discussed recently on the code4lib discussion list.

I’m particularly interested to see what flavor of URIs are used to identify the compound objects:

The protocol-based URI of the Resource Map identifies an aggregation of resources (components of a compound object) and their boundary-type inter-relationships. While this URI is clearly not the identifier of the compound object itself, it does provide an access point to the Resource Map and its representations that list all the resources of the compound object. For many practical purposes, this protocol-based URI may be a handy mechanism to reference the compound object because of the tight dependency of the visibility of the compound object in web space on the Resource Map (i.e., in ORE terms, a compound object exists in web space if and only if there exists a Resource Map describing it).

We note, however, two subtle points regarding the use of the URI of the Resource Map to reference the compound object. First, doing so is inconsistent with the web architecture and URI guidelines that are explicit in their suggestion that a URI should identify a single resource. Strictly interpreted, then, the use of the URI of the Resource Map to identify both the Resource Map and the compound object that it describes is incorrect. Second, some existing information systems already use dedicated URIs for the identification of compound information objects “as a whole.” For example, many scholarly publishers use DOIs whereas the Fedora and aDORe repositories have adopted identifiers of the info URI scheme. These identifiers are explicitly distinct from the URI of the Resource Map. from: Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication

I understand the ORE group is intentionally not aligning themselves too closely with the semantic web community. However I think they need to consider whether compound information objects are WWW information resources or not:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.” (from Architecture of the World Wide Web vol. 1).

I’m not totally convinced that the resource map can’t serve as a suitable representation for the compound information object–however for the sake of argument lets say I am. It seems to me that the URI for the compound information object identifies the concept of a particular compound information object, which lies in various pieces on the network. However this doesn’t preclude the use of HTTP URLs to identify the compound objects. Indeed the What HTTP URIs identify and Cool URIs for the Semantic Web provide specific guidance on how to serve up these non-information resources. Of course philosophical arguments around httpRange-14 have raged for a while. But the Linking Open Data project is using the hash URI and 303 redirect very effectively. There has even been some work on a sitemap extension to enable crawling. As a practical matter using URLs to identify compound information objects will encourage their use because they will naturally find their ways into publications, blogs, other compound objects. Using non-resolvable or quasi-resolvable info-uris or dois will mean people just won’t create the links–and when they do they will create links that can’t be easily verified and evolved over time with standard web tools. The OAI-ORE effort represents a giant leap forward for the digital library community into the web. Here’s to hoping they land safely–we need this stuff.

groupthink

Wednesday, September 26th, 2007

This little hack came up in channel after Bruce posted some XSLT to transform OCLC Identities XML into FOAF.

xsltproc \
  http://inkdroid.org/data/identity-foaf.xsl \
  http://orlabs.oclc.org/Identities/lccn-no99-10609 \
  | xmllint --format -

!!!

XSLT has its place to be sure.

OCLC deserves some REST

Wednesday, September 26th, 2007

Hey Worldcat Identities you are doing awesome work–you deserve some REST. Why not use content-negotiation to serve up your HTML and XML representations? So:

  curl --header "Accept: text/html" http://orlabs.oclc.org/Identities/key/lccn-no99-10609

would return HTML and

  curl --header "Accept: application/xml" http://orlabs.oclc.org/Identities/key/lccn-no99-10609

would return XML. This would allow you to:

  • not be limited to XSLT driven user views (doesn’t that get tedious?)
  • allow you to scale to other sorts of output (application/rdf+xml, etc)

At least from the outside I’d have to disagree w/ Roy — it appears that institutions can and do innovate. But I won’t say it is easy …

linking open data

Monday, August 27th, 2007


If it isn’t already, put the Linking Open Data project on your radar. It’s a grassroots effort to make large data sets available on the web. These aren’t just tarballs sitting in an FTP directory either–they’re URL addressable information resources available in machine readable format. A few weeks ago Joshua Tauberer announced the availability of the US Census as close to 1 billion triples. If you like data and the web the discussion list is a wonderful place to watch these data sets getting released and linked together.

app and repositories

Monday, July 16th, 2007

Pete Johnston blogged recently about a very nice use of the Atom Publishing Protocol (APP) to provide digital library repository functionality. The project is supported by UKOLN at the University of Bath and is called Simple Web-service Offering Repository Deposit (SWORD).

If you are interested in digital repositories and web services take a look at their APP profile. It’s a great example of how APP encourages the use of the Atom XML format and RESTful practices, which can then be extended to suit the particular needs of a community of practice.

To understand APP you really only need to grok a handful of concepts from the data model and REST. The data model is basically made up of a service document, which describes a set of collections, which aggregates member entries, which can in turn point to a media entry. All of these types of resources are identified with URLs. Since they are URLs you can interact with the objects with plain old HTTP–just like your web browser. For example you can list the entries in a collection by issuing a GET to the collection URL. Or you can create a member resource by doing a POST to the collection URL. Similarly you can delete a member entry by issuing a DELETE to the member entry. The full details are available in the latest draft of the RFC–and also in a wide variety of articles including this one.

So to perform a SWORD deposit a program would have to:

  1. get the service document for the repository (GET http://www.myrepository.ac.uk/app/servicedocument)
  2. see what collections it can add objects to
  3. create some IMS, METS or DIDL metadata to describe your repository object and ZIP it up with any of the objects datastreams
  4. POST the zip file to the appropriate collection URL with the appropriate X-Format-Namespace to identify the format of the submitted object
  5. check that you got a 201 Created status code and record the Location of the newly created resource
  6. profit!

1 and 2 are perhaps not even necessary if the URL for the target collection is already known. Some notable things about the SWORD profile of APP:

  • two levels of conformance (one really minimalistic one)
  • the idea that collections imply particular treatments or workflows associated with how the object is ingested
  • service documents dynamically change to describe only the collections that a particular user can see
  • no ability to edit resources
  • no ability to delete resources
  • no ability to list collections
  • repository objects are POSTed as ZIP files to collections
  • HTTP Basic Authentication + TLS for security
  • the use of DublinCore to describe collections and their respective policies.
  • collections can support mediated deposit which means deposits can include the X-On-Behalf-Of HTTP header to identify the user to create the resource for.
  • the use of X-Format-Namespace HTTP header to explicitly identify the format of the submission package that is zipped up: for example IMS, METS or DIDL.

While I understand why update and delete would be disabled for deposited packages I don’t really understand why the listing of collections would be disabled. An atom feed for a collection would essentially enable harvesting of a repository, much like ListRecords in OAI-PMH.

I’m not quite sure I completely understand X-On-Behalf-Of and sword:mediation either. I could understand X-On-Behalf-Of in an environment where there is no authentication. But if a user is authenticated couldn’t their username be used to identify who is doing the deposit? Perhaps there are cases (as the doc suggests) where a deposit is done for another user?

All in all this is really wonderful work. Of particular value for me was seeing the list of SWORD extensions and also the use of HTTP status codes. If I have the time I’d like to throw together a sample repository server and client to see just how easy it is to implement SWORD. I did try some experiments along these lines for my presentation back in February…but they never got as well defined as SWORD.

purl2

Thursday, July 12th, 2007

It’s great to see that OCLC is going to work with Zepheira on a new version of the PURL service and that it’s going to have an Apache license. Other than addressing scalability issues it sounds like Zepheira is going to build in support for resources that are outside of the information space of the web:

The new PURL software will also be updated to reflect the current understanding of Web architecture as defined by the World Wide Web Consortium (W3C). This new software will provide the ability to permanently identify networked information resources, such as Web documents, as well as non-networked resources such as people, organizations, concepts and scientific data. This capability will represent an important step forward in the adoption of a machine-processable “Web of data” enabled by the Semantic Web.

Since Eric Miller helped start up Zepheira it’s not surprising that purl2 will take this on. As part of some experiments I’ve been doing with SKOS, and serving up Concepts over HTTP it has become clear that a minimal bit of work for managing these identifiers would be useful. I can definitely see the need for a general solution that helps manage identifiers for people, organizations, concepts, etc. which also fits into how HTTP should/could serve up the resources associated with them.

via Thom Hickey

ruby-zoom v0.3.0

Tuesday, July 10th, 2007

Thanks to some prodding from William Denton and Jason Ronallo and the kindness of Laurent Sansonetti I’ve been added as a developer to the ruby-zoom project which provides a Ruby wrapper to the yaz Z39.50 library. I essentially wanted to remove some unused code from the project that was interfering with the ruby-marc gem … and I also wanted to create gem for ruby-zoom. This was the first time I’ve tried packaging up a C wrapper as a gem and it was remarkably smooth. I also added a test suite and a Rakefile. So assuming you have yaz installed you can install ruby-zoom with:

% gem install zoom

I’ll admit, I’m no huge fan of Z39.50 but the fact remains that it’s pretty much the most widely deployed machine API for getting at bibliographic data locked up in online catalogs. It’s really nice to see forward thinking systems at Talis, Evergreen and Koha who have (or at least experimented with) OpenSearch implementations.

Angela’s dilemma

Thursday, May 31st, 2007

If you are interested in practical ways to garden in the emerging web-of-data take a look at this draft finding that folks in the W3C Technical Architecture Group are considering. Or for a different expression of the same idea look at Cool URIs for the Semantic Web.

These two documents describe a simple use of HTTP and URLs to identify resources that are outside of the information space of the web. Yes, you read that right: resources that are outside the information space of the web. Why would I want to use URLs to address resources that aren’t on the web!? The finding illustrates this subtlety using Angela’s dilemma:

Angela is creating an OWL ontology that defines specific characteristics of devices used to access the Web. Some of these characteristics represent physical properties of the device, such as its length, width and weight. As a result, the ontology includes concepts such as unit of measure, and specific instances, such as meter and kilogram. Angela uses URIs to identify these concepts.Having chosen a URI for the concept of the meter, Angela faces the question of what should be returned if that URI is ever dereferenced. There is general advice that owners of URIs should provide representations [AWWW] and Angela is keen to comply. However, the choices of possible representations appear legion. Given that the URI is being used in the context of an OWL ontology, Angela first considers a representation that consists of some RDF triples that allow suitable computer systems to discover more information about the meter. She then worries that these might be less useful to a human user, who might prefer the appropriate Wikipedia entry. Perhaps, she reasons, a better approach would be to create a representation which itself contains a set of URIs to a range of resources that provide related representations. Perhaps content negotiation can help? She could return different representations based on the content type specified in the request.

Angela’s dilemma is, of course, based on the fact that none of the representations she is considering are actually representations of the units of measure themselves. Even if the Web could deliver a platinum-iridium bar with two marks a meter apart at zero degrees celsius, or 1,650,763.73 wavelengths of the orange-red emission line in the electromagnetic spectrum of the krypton-86 atom in a vacuum, or even two marks, a meter apart on a screen, such representations are probably less than completely useful in the context of an information space. The representations that Angela is considering are not representations of the meter itself. Instead, they are representations of information resources related to the meter.

It is not appropriate for any of the individual representations that Angela is considering to be returned by dereferencing the URI that identifies the concept of the meter. Not only do the representations she is considering fail to represent the concept of the meter, they each have a different essence and so they should each have their own URI. As a consequence, it would also be inappropriate to use content negotiation as a way to provide them as alternate representations when the URI for the concept of the meter is dereferenced.

So assuming we are agreed about the problem what’s the solution? Basically you can use content negotiation and a 303 See Other HTTP status code to redirect to the appropriate resource. For an example of the basic idea in action fire up curl and take a look at how this instance of the SemanticMediaWiki responds to a GET request:

%  curl --head http://ontoworld.org/wiki/Special:URIResolver/Ruby
HTTP/1.1 303 See Other
Date: Thu, 31 May 2007 20:03:12 GMT
Server: Apache/2.2.3 (Debian) ...
Location: http://ontoworld.org/wiki/Ruby
Content-Type: text/html; charset=UTF-8

Nothing too surprising there–basically just got redirected to another URL that serves up some friendly HTML describing the Ruby programming language. But send along an extra Accept header:

% curl --head  --header 'Accept: application/rdf+xml
http://ontoworld.org/wiki/Special:URIResolver/Ruby
HTTP/1.1 303 See Other
Date: Thu, 31 May 2007 20:04:36 GMT
Server: Apache/2.2.3 (Debian) ...
Location: http://ontoworld.org/wiki/Special:ExportRDF/Ruby
Content-Type: text/html; charset=UTF-8

Notice how you are redirected to another URL that results in rdf/xml describing Ruby coming down the pipe? RubyOnRails and other frameworks have good REST support built in for doing content negotiation to provide multiple representations of a single information resource. But the use of the 303 See Other here is a new subtle twist to accommodate the fact that the resource in question isn’t really a canonical set of bits on disk somewhere. The good news is that your browser will display the human readable resource when you visit http://ontoworld.org/wiki/Special:URIResolver/Ruby

Some folks would argue that resources that are outside the web don’t deserve URLs and should instead be identified with URIs like info-uris that are not required to resolve. My personal feeling is that info-uris do have a great deal of use in the enterprise (where they are most likely resolvable). But in situations like Angela’s where she is creating a public RDF document that needs to refer to concepts like “length” and “meter” I think it makes sense that these concepts should resolve to appropriate representations that will guide appropriate usage. Or as the Architecture of the World Wide Web puts it:

A URI owner may supply zero or more authoritative representations of the resource identified by that URI. There is a benefit to the community in providing representations. A URI owner SHOULD provide representations of the resource it identifies

It’ll be interesting to see how these issues shake out as more and more structured data is made available on the web.

the weight of legacy data

Sunday, May 20th, 2007

v0.97 of MARC::Charset was just released with an important bugfix. If you’ve had the misfortune of needing to convert from MARC-8 to UTF-8 and have used MARC::Charset >= v0.8 to do it you may very well have null characters (0×00) in your UTF-8 data. Well, only if your MARC-8 data contained either of the following characters:

  • DOUBLE TILDE, SECOND HALF / COMBINING DOUBLE TILDE RIGHT HALF
  • LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF

It turns out that the mapping file kindly provided by the Library of Congress does not include UCS mapping values for these two characters, and instead relies on alternate values.

v0.97 now uses the alternate value when the ucs is not available…which is good going forward. But I am literally sad when I think about how this little bug has added to the noise of erroneous extant MARC data. Please accept my humble apologies–and hear my plea to for bibliographic data that starts in Unicode rather than MARC-8. I’ll go further:

Don’t build systems that import/export MARC in transmission format anymore unless you absolutely have to.

Use MARCXML, MODS, RDF, JSON, YAML or something else instead. I realize this is hardly news but it feels good to be saying it. If you’re not convinced read Bill’s Pride and Prejudice installments. The library world needs to use common formats and encodings (with lots of tried/true tool sets)…and stop painting itself into a corner. Z39.2 has been hella useful for building up vast networks of data sharing libraries, but its time to leverage that data in ways that are more familiar to the networked world at large.

Many thanks to Michael O’Connor and Mike Rylander for discovering and resolving this bug.