lcsh, thesauri and skos

Simon Spero has an interesting post on why LCSH cannot be considered a thesaurus. At $work I’ve been working on mapping LCSH/MARC to SKOS, so Simon’s efforts in both collecting and analyzing LCSH authority data have been extremely valuable. In particular Simon and Leonard Willpower’s involvement with SKOS alerted me relatively early on to some of the problems that lie in store when thinking of LCSH in terms of a thesaurus.

The problem stems from very specific (standardized) notions of what thesauri are. Z39-19-2005 defines broader relationships in thesauri as being transitive. So if a has the broader term b, and b has the broader term c, then you can infer a has the broader term c.

Now consider the broader relationships (BT for those of you w/ the red books handy, or care to browse from the comfort of your chair) from the heading “Non-alcoholic cocktails”:

If broader relationships are to be considered transitive one is obliged to treat Alcoholic beverages as a broader term for Non-alcoholic cocktails. But clearly it’s nonsense to consider a non-alcoholic cocktail a specialization of an alcoholic beverage. As Simon pointed out the problem was recognized by Mary Dykstra soon after LCSH adopted terminology from the thesauri world (BT, NT, RT) in 1986. Her article, LC Subject Headings Disguised as a Thesaurus describes the many difficulties of treating LCSH as a thesaurus. In the example above from LCSH the broader (BT) relationship is used for both hierarchical (IS-A) relationships, as well as part/whole (HAS-A) relationships. According to thesauri folks this is a no-no.

LCSH aside, the semantics of broader/narrower have been an issue for SKOS for a fair amount of time. Guus Schreiber proposed a resolution, which was just accepted at yesterday’s SWD telecon. SKOS is trying to straddle several different worlds, enabling the representation of a range of knowledge organization systems from thesauri and taxonomies to subject heading lists, folksonomy and other controlled vocabularies. To remain flexible in this way, while still appealing to the thesaurus world a compromise was reached where the skos:broader and skos:narrower semantic relations were declared to be sub-properties of two new properties: skos:broaderTransitive and skos:narrowerTransitive (respectively). Since transitivity is not inherited, SKOS can still be used by people who want to represent loose broader relationships (LCSH, and others). At the same time SKOS will allow vocabulary owners to infer transitive broader/narrower relationships across concepts. Incidentally the SKOS Reference was just approved yesterday as a W3C Working Draft, which is its first step along the way to hopefully becoming a Recommendation.

My pottering about with LCSH and SKOS has also illustrated the value in making links between concepts explicit. Modeling LCSH as a graph data structure (SKOS), where each concept has a unique identifier has been a simple and yet powerful step in working with the data. For example to generate the image above, I simply wrote a script that transformed the subgraph related to “Non-alcoholic cocktails” to a graphviz dot file:

digraph G {
  rankdir = "BT"
  "Non-alcoholic cocktails" -> "Cocktails";
  "Alcoholic beverages" -> "Beverages";
  "Non-alcoholic beverages" -> "Beverages";
  "Cocktails" -> "Alcoholic beverages";
  "Non-alcoholic cocktails" -> "Non-alcoholic beverages";
  "Non-alcoholic beer" -> "Non-alcoholic beverages";

And then ran that through the graphviz dot utility:

% dot -T png > non-alcoholic-cocktails.png

to generate the PNG file you see. It’s my hope that making a richly linked graph like LCSH/SKOS available will enable not only enhanced use of the vocabulary, but also aid in creative, collaborative refactoring of the graph. I know that these issues are not new to LC, however tools that enable refactoring along the lines of what Margherita Sini proposed for the cocktail problem above will only be possible in a world where the graph can easily be manipulated and, downstream applications (library catalogs, etc) can easily adapt to the changing concept scheme.

8 thoughts on “lcsh, thesauri and skos

  1. I believe that transitivity is not really essential here: the fundamental issue is class inclusion vs. prototypes. For example, is “stone lion” a hyponym of “lion”? If we say no, we are hard put to it to understand what the relationship is; if we say yes, we have to abandon such obvious facts about lions as that they are made out of meat (hat tip to Terry Bisson here) and that they have parents that are also lions. Similar issues arise over “teddy bear” and “bear”, “ostrich” and “bird”, and “T-girl” and “girl” :-).

    If on the other hand we treat “lion” as a prototype category, then it’s easy to see that stone lions are lions that simply lack some of the prototypical lion properties while preserving others such as the mane, the tail, the jaws, and the pugnacious expression.

    The OO version of this problem has people deriving a ColoredPoint or 3DPoint class directly from a (2D, colorless) Point class, because it’s easy to add just one instance variable, though it should be obvious that a 3D point is not a 2D point. Instead, both should be derived from AbstractPoint, a class that is uncommitted to issues of dimensionality and color. Similarly we could have “abstract lion” as the hypernym of both “meat lion” and “stone lion”. But then how much do we factor out? It’s impossible to say a priori. By having a lion prototype object, we can clone it several times to create real or stone lions as appropriate by overriding prototype properties.

    And yet. Class inclusion is so handy when it does work, so powerful and expressive, it’s hard to think of abandoning it altogether.

  2. Spot-on, John.

    Cocktails aside, “alcoholic beverages” and “non-alcoholic beverages” are clearly disjoint sets, can this be exploited somehow?

    The fact that “non-alcoholic cocktails” has a path (indirectly) to both might be used to derive a score for edges in the graph, hinting at how strict the parental relationships are. We can derive from your graph that “cocktails share many properties of alcoholic drinks, but not all of them”, and this could be presented as an annotation on the edge between cocktails and alcoholic beverages.

    A simple score model might be to count the number of edges in a node’s subtree which link to a peer of the current node, and apply this value to the edge connecting the current node to its parent. The distance of the subnode from the current node should probably be factored into the score. So, cocktails-to-alcoholic beverages might get a weight of 0.5, since it has a child that refers to Non-alcoholic beverages.

    I’m not a graph expert by any means, and this weighting approach may be naive, but at least it is something that can be derived programatically, and can be presented visually (e.g. the thickness of the edges could depend on their scores), and that might help others in studying relationships in the graph, especially when viewing subgraphs like the one in your post. (For example, “Beer” isn’t on the graph, but its influence could be implied by edge annotation.)

  3. You stubled upon the classical diamond problem of object oriented knowledge modelling. Just forget about transivity and secondary differences between IS-A and HAS-A. In a general thesaurus there is only a broader/narrower relationship, everything else depends on your specific use case and can be discussed.

  4. Thanks for the helpful comments John, Graham and Jakob. I agree, I kind of muddied the waters focusing on transitivity in this post. The distinction between class inclusion vs prototypes is what I was after, and I appreciate the clarification.

    The good news is there is nothing preventing SKOS from being extended in a way to capture these two specializations of skos:broader…the bad news is that, well you have to extend SKOS, and multiple communities might do it totally differently. This is the double-edged sword of trying to serve multiple communities.

    Graham, if memory serves Alistair Miles’ thesis contains some details about the weighting of links between concepts along similar lines to what you suggested. I’m not a graph expert either, so these suggestions are most welcome.

  5. And, damn.

    Dykstra says “We librarians have lived with LCSH as a liability for a long time. The matter now, however, must no longer be lived with, for it has become a professional disgrace.”

    OUCH. She said that in 1988. Read her essay. Pretty much everything she complains about is still with us 20 years later. 20 YEARS. That’s an awful long time to still be living with what Dykstra was not afraid to call a professional disgrace. Ouch ouch ouch.

Leave a Reply