c4l09

So code4lib2009 was a whole lot of fun. The amazing thing about the conference isn’t really reflected in the program of talks. I feel like I can say that since I was one of them.

The real value is the social space and the time to talk to people you’ve seen online, throw around ideas, get background/contextual information on projects, etc. Hats off to Jean Rainwater and Birkin Diana for picking an beautifully casual and intimate hotel to hold the conference in.

It’s taken me a few days to get some perspective on all that happened. In the meantime I’ve read a few accounts that capture important aspects of the event from: Terry Reese, Jon Phipps, Jay Luker, Declan Fleming, Richard Wallis (1,2,3), Dan Chudnov, Gabe Farrell.

The Linked Data Pre-conference was quite valuable. For one it gave attendees some experience in what it means to publish data in a distributed way, and to write code to aggregate it using a attendees/FOAF experiment. Mike Giarlo aptly surmised from this that the key points for teaching beginners about linked data are that:

  1. Validators are essential
  2. You are not your FOAF

In other words:

  1. Am I doing this rdf/xml, turtle, rdfa right?
  2. ZOMG, httpRange-14!

Ian Davis presented the basics of RDF for people who are already familiar with traditional data management. Apparently Ian’s slides hit #1 for the day on SlideShare, which highlights the interest in linked data that is percolating through the Web. The pre-conf was very well attended as well.

Some folks like Jonathan Brinley and Michael Klein were able to hack on a Supybot Plugin to work with the FOAF data generated by the crawler. I also got chatting with William Denton about the potential of linked data for FRBR/RDA efforts. Unfortunately I didn’t hear about Alistair Miles’ new project on google-code for exploring the translation of traditional MARC/MODS into RDA/FRBR until after the event. Most of the other slides from presenters at the pre-conf are available from the wiki page.

I was really struck by some of the issues that Dan Chudnov raised in his talk about Caching and Proxying Linked Data right before lunch. In particular his comparison of the Linking Open Data Cloud to what libraries understand as their ready reference collection:


See p.9 of Dan’s slides

Dan explored how we need to think about the technical and administrative details of managing linked-data if linked-data is to be taken seriously by the library community. Relatedly the pre-conf gave me an opportunity to publicly apologize to Anders Söderbäck for yanking lcsh.info offline in such an abrupt manner, and disturbing his links from subject authority records at libris.kb.se to lcsh.info. Dan’s ideas for consuming library linked data and Anders and mine experience publishing library linked data gelled nicely in my brain. Similar ideas from Jon Phipps (one of the authors of Best Practice Recipes for Publishing RDF Vocabularies) have led me to believe this could be a nice little area for some research.

Prepping for the pre-conference itself was good fun, since it led me to discover a series of connections between the early development of the www and Brown University (where the conference was being held) and the history of hyperdata/text: in a nutshell it was Tim Berners-Lee’s proposal for the web -> Dynatext -> Steve DeRose -> Andy van Dam -> Hypertext Editing System -> Ted Nelson -> Doug Engelbart -> Vannevar Bush. Yeah, I guess you had to be there … or maybe that didn’t help. At any rate the slides, complete with breakdancing instructions are available.

I haven’t even started talking about the main event yet. The things I took away from the 3 days of presentations and talks, in no particular order were:

  • I want to learn more about the Author-ID effort that Geoffrey Bilder talked about
  • Stefano Mazzocchi’s keynote and Sean Hannan’s presentation convinced me that I need to understand and play with Freebase’s JavaScript application development environment Acre and the sparql-ish, query by example Metaweb Query Language (MQL). It seems like Freebase is exploring some really interesting territory in building a shared knowledge base of machine readable, human editable data, which can sit behind a seemingly infinite amount of web presentation layers.
  • Terence Ingram’s presentation, Ross Singer’s presentation about Jangle, me and Mike’s SWORD presentation, and a chat with Fedora/REST proponent Matt Zumwalt, and hearing about the Talis Platform have convinced me that real REST has got mind-share and traction in the library technology world.
  • Ian Davis’ keynote on the second day captured for me, the constant challenge it is to stay true to the roots of the web, and how important it is to stay true to them. It was really interesting to hear how he emphasized the importance of data over code, and the necessity for decentralization compared with the centralization.
  • Chatting with Jodi Schneider and William Denton and listening to their presentation made me want to understand RDA and FRBR at a practical level. This includes getting into the vocabularies that are being developed, and trying to convert some data. The history of FRBR in particular as told by Bill is also a gateway into a really fascinating history of cataloging. Also the work that Diane Hillman and Jon Phipps have been doing to enable vocabulary development like RDA/FRBR seems really important to keep abreast of.

More tidbits will probably float into my blog or into my tweets over the coming weeks, as the beer wears off, and the ideas sink in. But for now I’ll leave you with some of my favorite photos from the conference. It’s the people that makes code4lib what it is. It was great to connect up, and meet new folks in the field.

 

Oh and in case you missed it, the tweetstream and the other fine photos.


the importance of being crawled

While lcsh.info was up and running harvesters actively crawled it. At its core all lcsh.info did was mint a URI for every Library of Congress Subject Heading. This is similar in spirit to Brewster Kahle’s more ambitious OpenLibrary project to mint a URI for every book, or in his words:

One web page for every book

Aside: It’s also similar in spirit to RESTful web development, and to the linked data, semantic web effort generally.

Minting a URI for every Library of Congress Subject Heading meant that there were lots of densely interlinked pages. Some researchers at Stanford did a data visualization of LCSH two years ago, which illustrates just how deeply linked LCSH is:

I wanted lcsh.info to get crawled so I intentionally put some high level, well connected concepts (Humanities, Science, etc) on the home page to provide a doorway for web crawlers to walk through into the site and begin discovering all the broader, narrower, related links between concepts–without having to perform a search.

So lcsh.info is down now, but it turns out you can still see its shadow living on in quite a usable form in web search engines. For example type this into any of the big three search engines:

site:lcsh.info mathematics

And you’ll see:

Google

Yahoo



Microsoft



It’s interesting that (unlike Google and Yahoo) Microsoft’s relevancy ranking actually puts the heading for “Mathematics” at the top. Also note that simple things like giving the page a good title, and descriptive text make the heading show up in usable form in each search engine.

It’s not too surprising that trying the same for authorities.loc.gov doesn’t work out so well. Umm, yeah http://authorities.loc.gov/robots.txt

On the one hand, I’m just being nostalgic looking at the content that once was there &sigh;. But on the other there seems to be a powerful message here, that putting data out onto the open web, and making it crawlable means your content is viewable via lots of different lenses. Maybe you don’t have to get search exactly right on your website, let other people do it for you.

Two other things come to mind: LOCKSS and Brewster’s even more ambitious project. I’ve been sort hoping that somehow or another the Internet Archive and the Open Library would find there way into being publicly funded projects. What if? I can daydream right?


crawling bibliographic data

Today’s Guardian article Why you can’t find a library book in your search engine prompted me to look at Worldcat’s robots.txt file for the first time. Part of the beauty of the web is that it’s an open information space where anyone (people and robots) can start with a single URL and follow their nose to other URLs. This seemingly simple principle is what has allowed a advertising^w search company like Google (that we all use every day) to grow and prosper.

The robots.txt file is a simple mechanism that allows web publishers to tell web crawlers what they are allowed to look at on a website. Predictably, the files are always found at the root of a website in a file named robots.txt. You don’t have to have one, but many publishers like to control what gets indexed on their website, sometimes to hide content, and other times to shield what may be costly server side operations. Anyway, here’s what you see today for worldcat.org:

User-agent: *
Disallow: /search

Sitemap: http://worldcat.org/identities/sitemap_index.xml

So this instructs a web crawler to not follow any links that match /search in the path, such as:

http://www.worldcat.org/search?qt=worldcat_org_all&q=everything+is+miscellaneous

Now if you look on the homepage for Worldcat there are very few links into the dense bibliographic information space that is worldcat. But you’ll notice a few in the lower left box “Create lists”. So a crawler could for example discover a link to:

http://www.worldcat.org/oclc/77271226

This URL is allowed by the robots.txt so the harvester could go on to that page. Once at that item page there are lots of links to other bibliographic records: but notice the ones to other record displays all seem to match the /search pattern disallowed by the robots.txt, such as:

http://www.worldcat.org/search?q=au%3AC++S+Harris&qt=hot_author

or

http://www.worldcat.org/search?q=su%3ALondon+%28England%29+Fiction.&qt=hot_subject

So a web crawler will not be able to wander into the rich syndetic structure of Worldcat and start indexing.

However, all is not lost. Notice above that OCLC does reference a Worldcat sitemap in their robots.txt. Sitemaps are a lightweight mechanism that Yahoo, Google and Microsoft developed for instructing a web harvester on how to walk through a site.

So if we look at OCLC’s sitemap sitemap we’ll see this:

< ?xml version="1.0" encoding="UTF-8"?>

    
      http://worldcat.org/identities/lccn-no99-80690.sitemap.xml
      2008-05-19
    
    
      http://worldcat.org/identities/lccn-sh95-8559.sitemap.xml
      2008-05-19
    
  

This essentially defers to two other sitemaps. The first 30 lines of the first one (careful in clicking it’s big!) looks like:

< ?xml version="1.0" encoding="UTF-8"?>

  
    http://worldcat.org/identities/lccn-no99-80690
    2008-05-19
    monthly
    1.0000
  
  
    http://worldcat.org/identities/lccn-n78-95332
    2008-05-19
    monthly
    1.0000
  
  
    http://worldcat.org/identities/lccn-n79-41716
    2008-05-19
    monthly
    1.0000
  
  
    http://worldcat.org/identities/lccn-n80-92173
    2008-05-19
    monthly
    1.0000
  
  ...

Now we can see the beauty of sitemaps. They are basically just an XML representation for sets of web resources, much like syndicated feeds. There are actually 40,000 links listed in the first sitemap file, and 12,496 in the second. Now URLs like

http://worldcat.org/identities/lccn-no99-80690

are clearly allowed by the robots.txt file. So indexers can wander around and index the lovely identities portion of Worldcat. It’s interesting though, that the content served up by the identities portion of Worldcat is not HTML–it’s XML that’s transformed client side to HTML w/ XSLT. So it’s unclear how much a stock web crawler would be able to discover from the XML. If google/yahoo/microsoft’s crawlers are able to apply the XSLT transform, they will get some HTML to chew on. But notice in the HTML view that all the links into Worldcat proper (that aren’t other identities) are disallowed because they start with /search.

And a quick grep and perl pipeline confirm that all 52496 urls in the sitemap are to the identies portion of the site…

So this is a long way of asking: I wonder if web crawlers are crawling the books views on Worldcat at all? I imagine someone else has written about this already, and there is a known answer, but I felt like writing about the web and library data anyhow.

Since OCLC has gone through the effort of providing a web presentation for millions of books, and even links out to the libraries that hold them, they seem uniquely positioned to provide a global gateway for web crawlers to the library catalogs around the world. The links from worldcat out to the rest of the world’s catalogs would turn OCLC into a bibliographic super node in the graph of the web, much like Amazon and Google Books. But perhaps this is perceived as giving up the family jewels? Or maybe it would put to much stress on the system? Of course it would also be great to see machine readable data served up in a similar linked way

So in conclusion, it to would be awesome to see either (or maybe both):

  • the /search exclusion removed from the robots.txt file
  • sitemaps added for the web resources that look like http://www.worldcat.org/oclc/77271226

Of course one of the big projects I work on at LC is Chronicling America which is currently excluded by LC’s robots.txt…so I know that there can be real reasons for restricting crawling access (in our case performance problems we are trying to fix).


Oh gosh, I just noticed when re-reading the Guardian article that my lcsh.info experiment was mentioned. Hopefully there will be good news to report from LC on this front shortly.


work identifiers and the web

Michael Smethurst’s In Search of Cultural Identifiers post over at the BBC Radio Labs got me thinking about web identifiers for works, about LibraryThing and OCLC as linked library data providers, and finally about the International Standard Text Code. Admittedly it’s kind of a hodge-podge of topics, and I’m going to taking some liberties with what ‘linked data’ and ‘works’ mean, so bear with me.

Both OCLC Worldcat and LibraryThing mint URIs for bibliographic works, like these for Wide Sargasso Sea:

So the library community really does have web identifiers for works–or more precisely web identifiers for human readable records about works. What’s missing (IMHO) is the ability to use that identifier to get back something meaningful for a machine. Tools like Zotero need to scrape the screen to pull out the data points of interest to citation management. Sure, if you want you can implement COinS or unAPI to allow the metadata to be extracted, but could there be a more web-friendly way of doing this?

Consider how blog syndication works on the web. You visit a blog (like this one) and your browser is able to magically figure out the location of an RSS or Atom feed for the blog, and give you an option to subscribe to it.

Well it’s not really magic it’s just a bit of markup in the HTML:


Simple right?

Now back to work identifiers. Consider that both Worldcat and LibraryThing have web2.0 apis for retrieving machine readable data for a work:

http://www.librarything.com/services/rest/1.0/?method=librarything.ck.getwork&id={work_id}&apikey={your_key}

or:

http://www.worldcat.org/webservices/catalog/content/{oclc_number}?wskey={key}

What if the web pages for these resources at OCLC and LibraryThing linked directly to these machine readable versions? For example if the page for Wide Sargasso Sea at LibraryThing contained this in its <head> element:


This would allow browsers, plugin tools like Zotero and web crawlers to follow the natural grain of the web and discover the machine readable representation. Admittedly this is something that COinS and unapi are designed to do. But the COinS and unAPI protocols are really optimized for making citation data, and non web identifiers available and routable via a resolver of some kind. Maybe I’m just over reaching a bit, but this approach of using the <link> header seems to embrace the notion that there are resources within the Worldcat and Librarything websites, and there can be alternate representations of those resources that can be discovered in a hypertext-driven way.

Of course there is the issue of the API key. In the example above I used the demo key in LibraryThing’s docs. More important in the context of web identifiers for works is the need to distinguish between the identifier for the record, and the identifier for the concept of the work, which is most elegantly solved (IMHO) by following a pattern from the Cool URIs for the Semantic Web doc. But I think it’s important that people realize that it’s not necessary to jump headlong into RDF to start leveraging some of the principles behind the Architecture of the World Wide Web. Henry Thompson has a nice web-centric discussion of this issue in his What’s a URI and Why Does it Matter?

While writing this blog post I noticed a thread over on Autocat that Bowker has been named the US Registrar for the International Standard Text Code. The gist is that the ISTC will be a “global identification system for textual works”, and that registrars (like Bowker) will mint identifiers for works, such as:

ISTC 0A9 2002 12B4A105 7

Where the structure of the identifier is roughly:

ISTC {registration agency} {year element} {work element} {check digit}

It’s interesting that the meat of the ISTC is the work element that is:

… assigned automatically by the central ISTC registration system after a metadata record has been submitted for registration and the system has verified that the record is unique;

The metadata record in question is actually a chunk of ONIX, which presumably Bowker will send to the ISTC central registrar, and get back a work id.

This work that the ISTC is taking on is really important–and one would imagine quite costly. One thing I would suggest to them is that they may want to make the ISTC codes have a URI equivalent like:

http://istc-international/0A9/2002/12B4A1057

They also should encourage Bowker and other registrars to publish their work identifiers on the web:

http://bowker.com/istc/0A9/2002/12B4A1057

It seems to me that we might (in the long term) be better served by a system that embraces the distributed nature of the web. A web in which organizations like Bowker, ISTC, OCLC, LibraryThing, Library of Congress and national libraries publish their work identifiers using URIs, and return meaningful metadata for them. Rather than waiting for other people to solve our problems, why don’t we start solving them ourselves bottom-up instead of waiting for someone else to solve it top-down?

Anyhow I feel like I’m kind of being messy in suggesting this linked-data-lite idea. Is it heresy? My alibi/excuse is that I’ve been sitting in the same room as dchud for extended periods of time.


q & a

Q: What do 100 year old knitting patterns and a lost Robert Louis-Stevenson story have in common?

A: A digitally preserved newspaper page.

Q: What about if you add:

A: Just a typical lunch time conversation at Pete’s with a couple people I work with. The cool thing (for me) is that this is normal, involves a host of smart/interesting characters, and is routinely encouraged. I love my job.


100,000 Books and FRBR

The news about 100,000 books on Freebase got me poking around with curl. I was pleased to see that Freebase actually distinguishes between a book as a work, and a particular edition of that book. To FRBR aficionados this will be familiar as the difference between a Work and a Manifestation:

For example here is a URI for James Joyce’s Dubliners as a work:

http://rdf.freebase.com/ns/en.dubliners

and here is a URI for a 1991 edition of Dubliners:

http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000048ea5b4

If you follow those links in your browser you’ll most likely be redirected to the human readable html view. But machine agents can use the same URL to discover say an RDF representation of this edition of Dubliners, for example with curl:

curl --location --header "Accept: application/turtle" http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000048ea5b4

@prefix fb: http://rdf.freebase.com/ns/.
@prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#.
@prefix rdfs: http://www.w3.org/2000/01/rdf-schema#.
@prefix xml: http://www.w3.org/XML/1998/namespace.

 <http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000048ea5b4> a 
         <http://rdf.freebase.com/ns/book.book_edition>,
         <http://rdf.freebase.com/ns/common.topic>,
         <http://rdf.freebase.com/ns/media_common.creative_work>;
     <http://rdf.freebase.com/ns/book.book_edition.ISBN> "0486268705";
     <http://rdf.freebase.com/ns/book.book_edition.LCCN> "91008517";
     <http://rdf.freebase.com/ns/book.book_edition.author_editor> <http://rdf.freebase.com/ns/en.james_joyce>;
     <http://rdf.freebase.com/ns/book.book_edition.book> <http://rdf.freebase.com/ns/en.dubliners>;
     <http://rdf.freebase.com/ns/book.book_edition.dewey_decimal_number> "823";
     <http://rdf.freebase.com/ns/book.book_edition.number_of_pages> <http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000009a3be60>;
     <http://rdf.freebase.com/ns/book.book_edition.publication_date> "1991";
     <http://rdf.freebase.com/ns/type.object.name> "Dubliners";
     <http://rdf.freebase.com/ns/type.object.permission> <http://rdf.freebase.com/ns/boot.all_permission>. 

 <http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000009a3be60> a <http://rdf.freebase.com/ns/book.pagination>;
     <http://rdf.freebase.com/ns/book.pagination.numbered_pages> "152"^^<http://www.w3.org/2001/XMLSchema#int>;
     <http://rdf.freebase.com/ns/type.object.permission> <http://rdf.freebase.com/ns/boot.all_permission>. 

There are a few assertions that struck me as interesting:

  • the statement in red that states that the resource is in fact an edition (of type http://rdf.freebase.com/ns/book.book_edition)
  • the statement in green which links the edition with the work (http://rdf.freebase.com/ns/en.dubliners).
  • and the assertion in blue which states the Library of Congress Control Number (LCCN) for the book

I was mostly surprised to see the library-centric metadata being collected such as LCCN, OCLC Number, Dewey Decimal Classification, LC Classification. There are even human readable instructions for how to enter the data (take that AACR2!).

Anyhow it got me wondering what it would be like to stuff all the Freebase book data into a triple store, assert:

<http://rdf.freebase.com/ns/book.book> <owl:sameAs> <http://purl.org/vocab/frbr/core#Work> .
<http://rdf.freebase.com/ns/book.book_edition> <owl:sameAs> <http://purl.org/vocab/frbr/core#Manifestation> .

and then run some basic inferencing and get some FRBR data. I know, crazy-talk … but it’s interesting in theory (to me at least).


digital-curation

Some folks at LC and CDL are trying to kick-start a new public discussion list for talking about digital curation in its many guises: repositories, tools, standards, techniques, practices, etc. The intuition being that there is a social component to the problems of digital preservation and repository interoperability.

Of course NDIIPP (the arena for the CDL/LC collaboration) has always been about building and strengthening a network of partners. But as Priscilla Caplan points out in her survey of the digital preservation landscape Ten Years After, organizations in Europe like the JISC and NESTOR seem to have understood that there is an educational component to digital preservation as well. Yet even the JISC and NESTOR have tended to focus more on the preservation of scholarly output, whereas digital preservation really extends beyond that realm of materials.

The continual need to share good ideas and hard-won-knowledge about digital curation, and to build a network of colleagues and experts that extends out past the normal project/institution specific boundaries is just as important as building the collections and the technologies themselves.

So I guess this is a rather highfalutin goal … here’s some text stolen from the digital-curation home page to give you more of a flavor:

The digital preservation and repositories domain is fortunate to have a diverse set of institutional and consortial efforts, software projects, and standardization initiatives. Many discussion lists have been created for these individual efforts. The digital-curation discussion list is intended to be a public forum that encourages cross-pollination across these project and institutional boundaries in order to foster wider awareness of project- and institution-specific work and encourage further collaboration.

Topic of conversation can include (but is not limited to)

  • digital repository software (Fedora, DSpace, EPrints, etc.)
  • management of digital formats (JHOVE, djatoka, etc.)
  • use and development of standards (OAIS, OAI-PMH/ORE, MPEG21, METS, BagIt, etc.)
  • issues related to identifiers, packaging, and data transfer
  • best practices and case studies around curation and preservation of digital content
  • repository interoperability
  • conference, workshop, tutorial announcements
  • recent papers
  • job announcements
  • general chit chat about problems, solutions, itches to be scratched
  • humor and fun

We’ll see how it goes. If you are at all interested please sign up.


simplicity

So we have a few bookshelves in our house–one of which is in our kitchen. Only one or two of the shelves in this bookshelf actually house books, most of which are food-stained cookbooks. The rest of the 4 or 5 shelves are given over to photographs, albums, pamphlets from schools, framed pictures, compact discs, pencils, letters, screwdrivers, coins, candles, bills, artwork, crayons–basically the knickknacks and detritus of daily living. We spend a lot of time in the kitchen, so it’s convenient and handy to just stash stuff there.

The only problem is IT DRIVES ME INSANE!

The randomness, and perceived messiness of the bookshelf drives me crazy. I look at it and I see chaos, complexity and disorder. I know I have a problem, but that knowledge doesn’t seem to help. I am constantly shuffling things around, grouping things, moving things, throwing things out while more and and more things are quietly added. I’d almost prefer the bookshelf to be somewhere out of sight, but then we’d probably use something else in the kitchen.

This morning, on my way to work, I got a call from Kesa asking where two flower petals were that needed to be ironed on to Chloe’s Girl Scouts uniform. They were in the bookshelf at one point. Did I throw them away? I can’t remember it’s all a blur. I admit that I probably did. I can hear Chloe crying in the background. I feel bad…and resentful about having to keep this bookshelf organized.

Why am I writing here about this? Well mostly it wouldn’t fit within a 140 byte limit. But srsly – I guess I just feel like this bookshelf is a living emblem of my professional life as a software developer at a library. I strive to create software that is simple in its expression, that does one thing and does it well, and which is hopefully easy to maintain by more people than just me. I relish working at an institution that values the preservation of objects and knowledge.

But I threw away the flower decal …

It’s important to remember that real life is complicated, and that the messiness is something to be relished as well. The useful bookshelf, or bag of bits, chunk of json, or half-remembered perl script in someones homedir are valuable for their organic resilience. Or as Einstein famously said:

Things should be made as simple as possible, but not simpler.

I’m sorry Chloe.


bagit and .deb

I’m just now (OK I’m slow) marveling at how similar BagIt turned out to be to the Debian Package Format. Given some of the folks involved, this synchronicity isn’t too surprising.

Both .deb and BagIt use a directory ‘data’ for bundling the files in the package (well .deb has it as a compressed file data.tar.gz). Both have md5sum-style checksum files for stating the fixity values of said files. Both have simple rfc2822-style text files for expressing metadata. Both have files that contain the version number of the packaging format. One nice thing that deb has which BagIt intentionally eschewed was a serialization format. But no matter.

At LC we (a.k.a. coding machine Justin Littman) are working on a software library for creating and validating bags, as well as a shiny GUI that’ll sit on top of it to assist in bag creation for people who like shiny things.

It’s an interesting counterpoint to this process of creating BagIt tools to look how a .deb can be downloaded and inspected. Here’s a sampling of a shell session where I downloaded and extracted the parts of the .deb for python-rdflib.

ed@curry:~/tmp$ aptitude download python-rdflib
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Reading extended state information       
Initializing package states... Done
Building tag database... Done      
Get:1 http://us.archive.ubuntu.com hardy/universe python-rdflib 2.4.0-4 [276kB]
Fetched 276kB in 0s (346kB/s) 

ed@curry:~/tmp$ ar -xv python-rdflib_2.4.0-4_i386.deb 
x - debian-binary
x - control.tar.gz
x - data.tar.gz

ed@curry:~/tmp$ tar xvfz control.tar.gz 
./
./postinst
./prerm
./md5sums
./control

ed@curry:~/tmp$ cat control
Package: python-rdflib
Source: rdflib
Version: 2.4.0-4
Architecture: i386
Maintainer: Ubuntu MOTU Developers 
Original-Maintainer: Nacho Barrientos Arias 
Installed-Size: 1608
Depends: libc6 (>= 2.5-5), python-support (>= 0.3.4), python (< < 2.6), python (>= 2.4), python-setuptools
Provides: python2.4-rdflib, python2.5-rdflib
Section: python
Priority: optional
Description: RDF library containing an RDF triple store and RDF/XML parser/serializer
 RDFLib is a Python library for working with RDF, a simple yet
 powerful language for representing information. The library
 contains an RDF/XML parser/serializer that conforms to the
 RDF/XML Syntax Specification and both in-memory and persistent
 Graph backend.
 .
 This package also provides a serialization format converter
 called rdfpipe in order to deal with the different formats
 RDFLib works with.
 .
  Homepage: http://rdflib.net/

ed@curry:~/tmp$ cat md5sums 
75af966e839159902537614e5815c415  usr/lib/python-support/python-rdflib/python2.5/rdflib/sparql/bison/SPARQLParserc.so
a33eb3985c6de5589cb723d03d2caeb1  usr/lib/python-support/python-rdflib/python2.4/rdflib/sparql/bison/SPARQLParserc.so
d1b5578dd1d64432684d86bbb816fafc  usr/bin/rdfpipe
0191b561e3efe1ceea7992e2c865949b  usr/share/doc/python-rdflib/changelog.gz
98a861211f3effe1e69d6148c1e31ab2  usr/share/doc/python-rdflib/copyright
d75c2ab05f3a4239963d8765c0e9e7c5  usr/share/doc/python-rdflib/examples/example.py
17b61c23d0600e6ce17471dc7216d3fa  usr/share/doc/python-rdflib/examples/swap_primer.py
3894fa16d075cf0eee1c36e6bcc043d8  usr/share/doc/python-rdflib/changelog.Debian.gz
15653f75f35120b16b1d8115e6b5a179  usr/share/man/man1/rdfpipe.1.gz
405cb531a83fd90356ef5c7113ecd774  usr/share/python-support/python-rdflib/rdflib/sparql/bison/CompositionalEvaluation.py
41e28217ddd2eb394017cd8f12b1dfd5  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Util.py
ec9ae5147463ed551d70947c2824bc82  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Resource.py
6e018a69ca242acb613effe420c2cdc7  usr/share/python-support/python-rdflib/rdflib/sparql/bison/SolutionModifier.py
7e72a08f29abc91faddb85e91f17e87c  usr/share/python-support/python-rdflib/rdflib/sparql/bison/FunctionLibrary.py
648384e5980ef39278466be38572523a  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Expression.py
494386730a6edf5c6caf7972ed0bf4ba  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Bindings.py
4513b2fdc116dc9ff02895222a81421d  usr/share/python-support/python-rdflib/rdflib/sparql/bison/IRIRef.py
a800bdac023ae0c02767ab623dffe67b  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Triples.py
6c31647f2b3be724bdfcc35f631162b1  usr/share/python-support/python-rdflib/rdflib/sparql/bison/SPARQLEvaluate.py
c158b3fb8fd66858f598180084f481c4  usr/share/python-support/python-rdflib/rdflib/sparql/bison/GraphPattern.py
bff095caa2db064cc2b1827c4b90a9e7  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Processor.py
2db0c4925d17b49f5bb355d7860150c2  usr/share/python-support/python-rdflib/rdflib/sparql/bison/QName.py
10e02ecf896d07c0546b791a450da633  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Query.py
eee29bb22b05b16da2a5e6552044bf22  usr/share/python-support/python-rdflib/rdflib/sparql/bison/__init__.py
a29a508631228f6674e11bb077c24afc  usr/share/python-support/python-rdflib/rdflib/sparql/bison/PreProcessor.py
479a4702ebee35f464055a554ebf5324  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Filter.py
d2fe75aa4394ec7d9106a1e02bb3015a  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Operators.py
da186350e65c8e062887724b1758ef80  usr/share/python-support/python-rdflib/rdflib/sparql/Query.py
0130de0f5d28087d7c841e36d89714c4  usr/share/python-support/python-rdflib/rdflib/sparql/graphPattern.py
826ffe4c6b3f59a9635524f0746299fe  usr/share/python-support/python-rdflib/rdflib/sparql/sparqlOperators.py
...

ed@curry:~/tmp$ tar xvfz data.tar.gz 
./
./usr/
./usr/lib/
./usr/lib/python-support/
./usr/lib/python-support/python-rdflib/
./usr/lib/python-support/python-rdflib/python2.5/
./usr/lib/python-support/python-rdflib/python2.5/rdflib/
./usr/lib/python-support/python-rdflib/python2.5/rdflib/sparql/
./usr/lib/python-support/python-rdflib/python2.5/rdflib/sparql/bison/
./usr/lib/python-support/python-rdflib/python2.5/rdflib/sparql/bison/SPARQLParserc.so
./usr/lib/python-support/python-rdflib/python2.4/
./usr/lib/python-support/python-rdflib/python2.4/rdflib/
./usr/lib/python-support/python-rdflib/python2.4/rdflib/sparql/
./usr/lib/python-support/python-rdflib/python2.4/rdflib/sparql/bison/
./usr/lib/python-support/python-rdflib/python2.4/rdflib/sparql/bison/SPARQLParserc.so
./usr/bin/
./usr/bin/rdfpipe
./usr/share/
./usr/share/doc/
./usr/share/doc/python-rdflib/
./usr/share/doc/python-rdflib/changelog.gz
./usr/share/doc/python-rdflib/copyright
./usr/share/doc/python-rdflib/examples/
./usr/share/doc/python-rdflib/examples/example.py
./usr/share/doc/python-rdflib/examples/swap_primer.py
./usr/share/doc/python-rdflib/changelog.Debian.gz
./usr/share/man/
./usr/share/man/man1/
./usr/share/man/man1/rdfpipe.1.gz
./usr/share/python-support/
./usr/share/python-support/python-rdflib/
./usr/share/python-support/python-rdflib/rdflib/
./usr/share/python-support/python-rdflib/rdflib/sparql/
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/CompositionalEvaluation.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Util.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Resource.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/SolutionModifier.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/FunctionLibrary.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Expression.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Bindings.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/IRIRef.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Triples.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/SPARQLEvaluate.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/GraphPattern.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Processor.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/QName.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Query.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/__init__.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/PreProcessor.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Filter.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Operators.py
./usr/share/python-support/python-rdflib/rdflib/sparql/Query.py
./usr/share/python-support/python-rdflib/rdflib/sparql/graphPattern.py
./usr/share/python-support/python-rdflib/rdflib/sparql/sparqlOperators.py
...

Here are some more useful notes on the structure of .deb files and how to create them. If you are interested in trying out the nascent-alpha BagIt tools give me a holler (ehs at pobox dot com) or just add a comment here…


bibliovirus

Terry’s analysis of the proposed changes to OCLC’s record policy is essential reading. I’m really concerned that these 996 fields will slip somewhat unnoticed into data that I use.

996 $aOCLCWCRUP $iUse and transfer of this record is governed by the OCLC® Policy for Use and Transfer of WorldCat® Records. $uhttp://purl.org/oclc/wcrup

This appears to be an engineered, legal virus for our bibliographic ecosystems. I’m not a lawyer, so I can’t fully determine the significance of these legal terms…mostly because there isn’t a policy at the end of that PURL right now. There’s a FAQ full of ominous references to “the Policy”, and a glossy, feel-good overview, but the policy itself is empty at the moment. So the precise nature of the virus is so far unknown…or am I wrong?

At any rate, I think libraries need to be careful about letting these 996 fields creep into their data–especially data that they create. I wonder are there other examples of legalese that have slipped into MARC data over the years?

Update 2008-11-03: it appears that “the Policy” was removed sometime Sunday evening? Perhaps its best not to jump to conclusions eh? But that image of the virus is too cool, and I needed an excuse to post it on my blog.

Update 2008-11-07: check out Terry’s re-analysis of “the Policy” when a new version was brought back online by OCLC.