Archive for the ‘metadata’ Category

literals and resources

Wednesday, March 26th, 2008

There’s a fascinating modeling discussion going on over on the DC-RDA list about whether RDA properties should reference literals or resources in descriptions. For example when describing an author you could use a literal:

Twain, Mark, 1835-1910

or a resource:

http://lccn.loc.gov/n79021164

There are some shades of gray in between (using blank nodes, auto-generated URIs, typed literals) but that’s the basic gist of it. The discussion basically concerns what the DC-RDA Application Profile should allow. There seems to be two competing interests:

  1. perceived ease of migrating legacy data (MARC -> RDA)
  2. perceived benefits to explicitly modeling the relationships found in bibliographic data

More information can also be found in the blogs of Karen Coyle and Jon Phipps.

My personal opinion is that RDA should take the high road on this one and really drive home the value proposition for using resources wherever possible, modeling relationships in bibliographic data, and leveraging hundreds of years of work maintaining controlled vocabularies. This will have the positive side effect of pushing library controlled vocabularies (LCSH, name authority, language and geographic codes, etc.) into the open on the web. More importantly I think it will highlight what libraries (at their best) do best, for the larger semantic web and computing world. I think it’s worth limping along a bit longer with MARC and waiting for RDA to actually “do the right thing”.

How to do this effectively is another matter, and is really what the discussion is about. It’s really nice to see people talking openly about these issues.

(PS, using an author isn’t a particularly good example because I don’t see it in the current list of RDA properties…)

(PSS, no that lccn url doesn’t currently resolve (it does for bibliographic records, but not authority) or return rdf (hopefully someday))

oai-ore post baltimore thoughts

Thursday, March 13th, 2008

The recent OAI-ORE meeting was just up the road in Baltimore, so it was easy for a bunch of us from the Library of Congress to attend. I work on a team at LC that is specifically looking at the role that repositories play at the library; I’ve implemented OAI-PMH data providers and harvesters, and in the past couple of years I’ve gotten increasingly interested in semantic web technologies — so OAI-ORE is of particular interest to me. I’ve commented a bit about OAI-ORE on here before, but I figure it can’t hurt to follow in my coworker’s footsteps and summarize my thoughts after the meeting.

(BTW, above is an image of some constellations I nabbed off of wikipedia. I included it here because the repeated analogy (during the meeting) of OAI-ORE resource maps as constellations was really compelling — and quite poetic.)

The Vocabulary

It seems to me that the real innovation of the OAI-ORE effort is that it provides a lightweight RDF vocabulary for talking about aggregated resources on the web. Unfortunately I think that this kernel gets a little bit lost in the 6 specification documents that were released en masse a few months ago.

The ORE vocabulary essentially consists of three new resource types: ore:ResourceMap, ore:Aggregation, ore:AggregatedResource ; and 5 new properties to use with those types: ore:describes, ore:isDescribedBy, ore:aggregates, ore:isAggregatedBy, ore:analogousTo. In addition, the Vocabulary document
provides guidance on how to use a few terms from the DublinCore vocabulary: dc:creator, dc:rights, dcterms:modified, dcterms:created.

The vocabulary is small, so if I were them I would publish the vocabulary elements using hash URIs, instead of slash URIs. The reason for this is that you don’t have to jigger the web server to do a httpRange-14 style 303 correctly:

  • http://www.openarchives.org/ore/0.2/terms#Aggregation
  • http://www.openarchives.org/ore/0.2/terms#AggregatedResource
  • http://www.openarchives.org/ore/0.2/terms#ResourceMap
  • http://www.openarchives.org/ore/0.2/terms#describes
  • http://www.openarchives.org/ore/0.2/terms#isDescribedBy
  • http://www.openarchives.org/ore/0.2/terms#aggregates
  • http://www.openarchives.org/ore/0.2/terms#isAggregatedBy
  • http://www.openarchives.org/ore/0.2/terms#analogousTo

Also, I think ore:AggregatedResource is currently missing from the rdf/xml vocabulary, so it should be added. Also ore:isDescribedBy seems to be commented out.

There is a lot of redundancy between the Abstract Data Model and the Vocabulary documents–so I would recommend collapsing them down into a single, succinct document. This is in keeping with the DRY principle and will have the added benefit of making it easier for newbies to hit the ground running (not having to wade through multiple docs and mentally reconcile them). I could understand having a separate Abstract Data Model document if it were totally divorced from the web and semantic web technologies like RDF, but it’s not.

The Graph

The OAI-ORE effort seemed to be mostly driven by a desire to take harvesting agents the last mile to the actual repository resources themselves–enabling digital library objects (in addition to their metadata) to be harvested from repositories (using HTTP) ; and to be referenced from other contexts (say objects in other repositories). This desire was born out of real, hard won experience with harvesting metadata records, and marked a shift from metadata-centric harvesting to resource-centric harvesting.

In addition OAI-ORE marks a departure from predictable and mind-numbing arguments about SIP formats (METS, DIDL, FOXML, IEEE LOM, XFDU, etc). Yet as soon as we have our shiny new OAI-ORE vocabulary we have to learn yet-another-packaging-format, this time one built on top of Atom.

First, let me just say I’m a big fan of RFC 4287, in particular how it is used in the RESTful Atom Publishing Protocol (RFC 5023). I also think it makes sense to have an Atom serialization for OAI-ORE resource maps — assuming there is a GRDDL transform for turning it into RDF. But the workshop in Baltimore seemed to stress that the Atom serialization was the only way to do OAI-ORE, and didn’t emphasize that there are in fact lots of ways of representing RDF graphs on the web. For example GRDDL allows you to associate arbitrary XML with an XSLT transform to extract a RDF graph. And you could encode your RDF graph directly with RDFa, N3, Turtle, ntriples, or RDF/XML.

Perhaps there is a feeling that stressing the RDF graph too much will alienate some people who are more familiar with XML technologies. Or perhaps all these graph serialization choices could be perceived as being too overwhelming. But I think the opposite extreme of making it look like you can only use an overloaded Atom document as a means to publishing ORE resource maps is misguided, and will ultimately slow adoption. Why not encourage people to publish GRDDL transforms for METS, DIDL or mark up their “splash pages” with RDFa? This would bring the true value of the OAI-ORE work home–it’s not about yet-another-packaging format, it’s about what the various packaging formats have in common on the web.

Release Early, Release Often

In hindsight I think it would’ve been helpful for the OAI-ORE group to privately build consensus about the core OAI-ORE vocabulary (if necessary), then release that into the world wild web for discussion. Then once the kinks were worked out, and there was general understanding, moving on to issues such as discovery and serialization. As it stands the various documents were all dumped at the same time, and seem somewhat fragmented, and in places redundant. Clearly a lot of conversations have gone on that aren’t happening on the public discussion list.

I expressed interest in being part of the OAI-ORE and was politely turned down. I’m actually kind of glad really because I also don’t want to be part of some cabal of digital library practitioners. Maybe I should’ve titled this post “Sour Grapes” :-) Seriously though, the digital library needs good practical solutions and communities of users that encourage widespread adoption and tool support. We don’t need research-ware. Having secret discussions and occasional public events that feel more like lectures than meetings isn’t a good way to encourage adoption.

Anyhow, I hope that this isn’t all seen as being too harsh. Everyone’s a critic eh? All in all there is a lot in OAI-ORE to be proud of. The effort to integrate Web Architecture into Digital Library practices is most welcome indeed. Keep up the good work y’all.

oai-ore and the shadow web

Friday, February 22nd, 2008

The OAI-ORE meeting is coming up, and in general I’ve been really impressed with the alpha specs that have come out. It’s not clear that there’s an established vocabulary for talking about aggregated resources on the web, so the Data Model and Vocabulary documents were of particular interest to me.

One thing I didn’t quite understand, and which I think may have some significance for implementors, is some language in the Discovery document on the subject of URI conflation:

The Data Model document [ORE Model] explicitly prohibits a URI of a ReM (URI-R) ever returning anything other than a ReM. This allows multiple representations to be associated with URI-R, such as using content negotiation to return ReMs in different languages, character sets, or compression encodings. But it does not allow URI-R to return a human readable “splash page”, either by HTTP content negotiation or redirection. For example, clients MUST NOT merge with content negotiation the following URI pair that would correspond to a ReM and a “splash page” for an object:

If I’m understanding right this would prohibit using technologies like microformats, eRDF, RDFa and GRDDL in a “splash page” to represent the resource map. It seems odd to me that you can represent a resource map in Atom, but not in HTML.

To illustrate what this might look like I took a splash page off of arXiv (hope that was ok!) and marked it up with oai-ore RDFa.

Take a look. So all I did is modify the existing XHTML at arxiv.org, and I’ve been able to represent an ORE Resource Map. This seems like a relatively simple, and powerful way for existing repositories to make their aggregated resources available.

RDFa just entered Last Call, but there are already multiple implementations. Try out the GetN3 bookmarklet on the splash page, and you should see some triples come back. I ran them through the validator at w3c and got the following graph (kinda too big to include here inline).

This kind of issue seem to be at the heart of what Ian Davis refers to when he asks “Is the Semantic Web Destined to be a Shadow?“. Andy Powell and Pete Johnston have also been strong voices for integrating digital library repositories and the web–and they are also involved with the oai-ore effort. It feels like some of the oia-ore language could be loosened a bit to allow machine readable and human readable information to commingle a bit more.

lcsh, thesauri and skos

Wednesday, January 23rd, 2008

Simon Spero has an interesting post on why LCSH cannot be considered a thesaurus. At $work I’ve been working on mapping LCSH/MARC to SKOS, so Simon’s efforts in both collecting and analyzing LCSH authority data have been extremely valuable. In particular Simon and Leonard Willpower’s involvement with SKOS alerted me relatively early on to some of the problems that lie in store when thinking of LCSH in terms of a thesaurus.

The problem stems from very specific (standardized) notions of what thesauri are. Z39-19-2005 defines broader relationships in thesauri as being transitive. So if a has the broader term b, and b has the broader term c, then you can infer a has the broader term c.

Now consider the broader relationships (BT for those of you w/ the red books handy, or care to browse authorities.loc.gov from the comfort of your chair) from the heading “Non-alcoholic cocktails”:

If broader relationships are to be considered transitive one is obliged to treat Alcoholic beverages as a broader term for Non-alcoholic cocktails. But clearly it’s nonsense to consider a non-alcoholic cocktail a specialization of an alcoholic beverage. As Simon pointed out the problem was recognized by Mary Dykstra soon after LCSH adopted terminology from the thesauri world (BT, NT, RT) in 1986. Her article, LC Subject Headings Disguised as a Thesaurus describes the many difficulties of treating LCSH as a thesaurus. In the example above from LCSH the broader (BT) relationship is used for both hierarchical (IS-A) relationships, as well as part/whole (HAS-A) relationships. According to thesauri folks this is a no-no.

LCSH aside, the semantics of broader/narrower have been an issue for SKOS for a fair amount of time. Guus Schreiber proposed a resolution, which was just accepted at yesterday’s SWD telecon. SKOS is trying to straddle several different worlds, enabling the representation of a range of knowledge organization systems from thesauri and taxonomies to subject heading lists, folksonomy and other controlled vocabularies. To remain flexible in this way, while still appealing to the thesaurus world a compromise was reached where the skos:broader and skos:narrower semantic relations were declared to be sub-properties of two new properties: skos:broaderTransitive and skos:narrowerTransitive (respectively). Since transitivity is not inherited, SKOS can still be used by people who want to represent loose broader relationships (LCSH, and others). At the same time SKOS will allow vocabulary owners to infer transitive broader/narrower relationships across concepts. Incidentally the SKOS Reference was just approved yesterday as a W3C Working Draft, which is its first step along the way to hopefully becoming a Recommendation.

My pottering about with LCSH and SKOS has also illustrated the value in making links between concepts explicit. Modeling LCSH as a graph data structure (SKOS), where each concept has a unique identifier has been a simple and yet powerful step in working with the data. For example to generate the image above, I simply wrote a script that transformed the subgraph related to “Non-alcoholic cocktails” to a graphviz dot file:


digraph G {
  rankdir = "BT"
  "Non-alcoholic cocktails" -> "Cocktails";
  "Alcoholic beverages" -> "Beverages";
  "Non-alcoholic beverages" -> "Beverages";
  "Cocktails" -> "Alcoholic beverages";
  "Non-alcoholic cocktails" -> "Non-alcoholic beverages";
  "Non-alcoholic beer" -> "Non-alcoholic beverages";
}

And then ran that through the graphviz dot utility:


% dot -T png non-alcoholic-cocktails.dot > non-alcoholic-cocktails.png

to generate the PNG file you see. It’s my hope that making a richly linked graph like LCSH/SKOS available will enable not only enhanced use of the vocabulary, but also aid in creative, collaborative refactoring of the graph. I know that these issues are not new to LC, however tools that enable refactoring along the lines of what Margherita Sini proposed for the cocktail problem above will only be possible in a world where the graph can easily be manipulated and, downstream applications (library catalogs, etc) can easily adapt to the changing concept scheme.

metadata hackers

Monday, December 31st, 2007

I opened the paper this morning to read a story of another person involved in the creation of MARC who has just died. I hadn’t realized before reading Henrietta Avram and Samuel Snyder’s obituaries that there was a bit of an NSA LC connection when MARC was being created.

From 1964 to 1966, [Samuel Snyder] was coordinator of the Library of Congress’s information systems office. He was among the creators of the library’s Machine Readable Cataloging system that replaced the handwritten card with an electronic searchable database system that became the standard worldwide.

I imagine NSA folks had a lot to do with early automation efforts in the federal government…but it’s still an interesting connection. One of my coworkers is reading up on this early history of MARC so this is for him in the unlikely event that he missed it…email would probably have worked better I guess, but I also wanted to pay tribute. Libraries wouldn’t be what they are today without this influential early work.

permalinks reloaded

Monday, December 17th, 2007

The recently announced Zotero / InternetArchive partnership is exciting on a bunch of levels. The one that immediately struck me was the use of the Internet Archive URI. As you may have noticed before all the content in Internet Archive Wayback Machine can be referenced with a URL that looks something like:

  • http://web.archive.org/web/{yyyymmddhhmmss}/{url}

Where url is the document URL you want to look up in the archive at the given time. So for example:

is a URL for what http://google.com looked like on December 02, 1998 at 23:04:10. Perhaps this is documented somewhere prominent or is common knowledge, but it looks like you can play with the timestamp, and archive.org will adjust as needed, redirecting you to the closest snapshot it can find:

and even:

which redirects to the most recent content for a given URL. It’s just a good old 302 at work:

ed@curry:~$ curl -I http://web.archive.org/web/199812/http://www.google.com/
HTTP/1.1 302 Found
Date: Mon, 17 Dec 2007 21:11:12 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.2 mod_ssl/2.0.54 OpenSSL/0.9.7g mod_perl/2.0.1 Perl/v5.8.7
Location: http://web.archive.org/web/19981202230410/www.google.com/
Content-Type: text/html; charset=iso-8859-1

So anyhow, pretty cool use of URIs and HTTP right? The addition of zotero to the mix will mean that scholars can cite the web as it appeared at a particular point in time:

… as scholars begin to use not only traditional primary sources that have been digitized but also “born digital” materials on the web (blogs, online essays, documents transcribed into HTML), the possibility arises for Zotero users to leverage the resources of IA to ensure a more reliable form of scholarly communication. One of the Internet Archive’s great strengths is that it has not only archived the web but also given each page a permanent URI that includes a time and date stamp in addition to the URL.

Currently when a scholar using Zotero wishes to save a web page for their research they simply store a local copy. For some, perhaps many, purposes this is fine. But for web documents that a scholar believes will be important to share, cite, or collaboratively annotate (e.g., among a group of coauthors of an article or book) we will provide a second option in the Zotero web save function to grab a permanent copy and URI from IA’s web archive. A scholar who shares this item in their library can then be sure that all others who choose to use it will be referring to the exact same document.

This is pretty fundamental to scholarship on the web. Of course when generating a time anchored permalink with zotero one can well expect that archive.org will on occasion not have a snapshot of said content, resulting in a 404. It would be great if archive.org could leverage these requests for snapshots as requests to go out and archive the page. One could imagine a blocking and nonblocking request: the former which would spawn a request to fetch a particular URI, stash content away, and return the permalink; and the latter which would just quickly return the best match its already got (which may be a 404).

Anyhow, it’s really good to see these two outfits working together. Nice work!

ps. dear lazyweb is there a documented archive.org api available?

more marcdb

Monday, November 5th, 2007

This morning Clay and I were chatting about Library of Congress Subject Headings and SKOS a bit. At one point we found ourselves musing about how much reuse there is of topical subdivisions in topical headings in the LC authority file. You know how it is. Anyhow, I remembered that I’d used marcdb to import all of Simon Spiro’s authority data–so I fired up psql and wrote a query:

SELECT subfields.value AS subdivision, count(*) AS total
FROM subfields, data_fields
WHERE subfields.code = 'x'
  AND subfields.data_field_id = data_fields.id
  AND data_fields.tag = '150'
GROUP BY subfields.value
ORDER BY total DESC;

And a few seconds later…

 subdivision                          | total
--------------------------------------+-------
 Law and legislation                  |  3342
 Religious aspects                    |  2500
 Buddhism, [Christianity, etc.]       |   898
 History                              |   847
 Equipment and supplies               |   571
 Taxation                             |   566
 Baptists, [Catholic Church, etc.]    |   476
 Diseases                             |   450
 Research                             |   422
 Campaigns                            |   378
 Awards                               |   342
 Finance                              |   284
 Study and teaching                   |   284
 Surgery                              |   275
 Employees                            |   269
 Spectra                              |   261
 Computer programs                    |   259
 Labor unions                         |   218
 Testing                              |   207
 Diagnosis                            |   194
 Isotopes                             |   190
 Complications                        |   183
 Physiological effect                 |   172
 Programming                          |   163

There’s nothin’ like the smell of strong set theory in the morning. Although something seems a bit fishy about [Christianity, etc.] and [Catholic Church, etc.]… If you want to try similar stuff and don’t want to wait hours for marcdb to import all the data and you use postgres, here’s the full database dump which you ought to be able to import:

  % createdb authorities
  % wget http://inkdroid.org/data/authorities.sql.bz2
  % bunzip2 authorities.sql.bz2
  % psql authorities < authorities.sql

good ore

Friday, November 2nd, 2007

In case you missed it the Object-Reuse-and-Exchange (ORE) folks are having a get together at Johns Hopkins University (Baltimore, MD) on March 3, 2008. It’s free to register, but space is limited. The Compound information objects whitepaper, May 2007 Technical Committee notes and the more recent Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication provide a good taste of what the beta ORE specs are likely to look like.

The ORE group isn’t small, and includes individuals from quite different organizations. So any consensus that can be garnered I think will be quite powerful. Personally I’ve been really pleased to see how much the ORE work is leaning on web architecture: notably resolvable HTTP URIs, content-negotiation, linked-data and named graphs. Also interesting in the recent announcement is that the initial specs will use RFC 4287 for encoding the data model. Who knows, perhaps the spec will rely on archive feeds as discussed recently on the code4lib discussion list.

I’m particularly interested to see what flavor of URIs are used to identify the compound objects:

The protocol-based URI of the Resource Map identifies an aggregation of resources (components of a compound object) and their boundary-type inter-relationships. While this URI is clearly not the identifier of the compound object itself, it does provide an access point to the Resource Map and its representations that list all the resources of the compound object. For many practical purposes, this protocol-based URI may be a handy mechanism to reference the compound object because of the tight dependency of the visibility of the compound object in web space on the Resource Map (i.e., in ORE terms, a compound object exists in web space if and only if there exists a Resource Map describing it).

We note, however, two subtle points regarding the use of the URI of the Resource Map to reference the compound object. First, doing so is inconsistent with the web architecture and URI guidelines that are explicit in their suggestion that a URI should identify a single resource. Strictly interpreted, then, the use of the URI of the Resource Map to identify both the Resource Map and the compound object that it describes is incorrect. Second, some existing information systems already use dedicated URIs for the identification of compound information objects “as a whole.” For example, many scholarly publishers use DOIs whereas the Fedora and aDORe repositories have adopted identifiers of the info URI scheme. These identifiers are explicitly distinct from the URI of the Resource Map. from: Interoperability for the Discovery, Use, and Re-Use of Units of Scholarly Communication

I understand the ORE group is intentionally not aligning themselves too closely with the semantic web community. However I think they need to consider whether compound information objects are WWW information resources or not:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.” (from Architecture of the World Wide Web vol. 1).

I’m not totally convinced that the resource map can’t serve as a suitable representation for the compound information object–however for the sake of argument lets say I am. It seems to me that the URI for the compound information object identifies the concept of a particular compound information object, which lies in various pieces on the network. However this doesn’t preclude the use of HTTP URLs to identify the compound objects. Indeed the What HTTP URIs identify and Cool URIs for the Semantic Web provide specific guidance on how to serve up these non-information resources. Of course philosophical arguments around httpRange-14 have raged for a while. But the Linking Open Data project is using the hash URI and 303 redirect very effectively. There has even been some work on a sitemap extension to enable crawling. As a practical matter using URLs to identify compound information objects will encourage their use because they will naturally find their ways into publications, blogs, other compound objects. Using non-resolvable or quasi-resolvable info-uris or dois will mean people just won’t create the links–and when they do they will create links that can’t be easily verified and evolved over time with standard web tools. The OAI-ORE effort represents a giant leap forward for the digital library community into the web. Here’s to hoping they land safely–we need this stuff.

groupthink

Wednesday, September 26th, 2007

This little hack came up in channel after Bruce posted some XSLT to transform OCLC Identities XML into FOAF.

xsltproc \
  http://inkdroid.org/data/identity-foaf.xsl \
  http://orlabs.oclc.org/Identities/key/lccn-no99-10609 \
  | xmllint --format -

!!!

XSLT has its place to be sure.

OCLC deserves some REST

Wednesday, September 26th, 2007

Hey Worldcat Identities you are doing awesome work–you deserve some REST. Why not use content-negotiation to serve up your HTML and XML representations? So:

  curl --header "Accept: text/html" http://orlabs.oclc.org/Identities/key/lccn-no99-10609

would return HTML and

  curl --header "Accept: application/xml" http://orlabs.oclc.org/Identities/key/lccn-no99-10609

would return XML. This would allow you to:

  • not be limited to XSLT driven user views (doesn’t that get tedious?)
  • allow you to scale to other sorts of output (application/rdf+xml, etc)

At least from the outside I’d have to disagree w/ Roy — it appears that institutions can and do innovate. But I won’t say it is easy …