Archive for the ‘repositories’ Category

bagit and .deb

Wednesday, November 5th, 2008

I’m just now (OK I’m slow) marveling at how similar BagIt turned out to be to the Debian Package Format. Given some of the folks involved, this synchronicity isn’t too surprising.

Both .deb and BagIt use a directory ‘data’ for bundling the files in the package (well .deb has it as a compressed file data.tar.gz). Both have md5sum-style checksum files for stating the fixity values of said files. Both have simple rfc2822-style text files for expressing metadata. Both have files that contain the version number of the packaging format. One nice thing that deb has which BagIt intentionally eschewed was a serialization format. But no matter.

At LC we (a.k.a. coding machine Justin Littman) are working on a software library for creating and validating bags, as well as a shiny GUI that’ll sit on top of it to assist in bag creation for people who like shiny things.

It’s an interesting counterpoint to this process of creating BagIt tools to look how a .deb can be downloaded and inspected. Here’s a sampling of a shell session where I downloaded and extracted the parts of the .deb for python-rdflib.

ed@curry:~/tmp$ aptitude download python-rdflib
Reading package lists... Done
Building dependency tree
Reading state information... Done
Reading extended state information
Initializing package states... Done
Building tag database... Done
Get:1 http://us.archive.ubuntu.com hardy/universe python-rdflib 2.4.0-4 [276kB]
Fetched 276kB in 0s (346kB/s) 

ed@curry:~/tmp$ ar -xv python-rdflib_2.4.0-4_i386.deb
x - debian-binary
x - control.tar.gz
x - data.tar.gz

ed@curry:~/tmp$ tar xvfz control.tar.gz
./
./postinst
./prerm
./md5sums
./control

ed@curry:~/tmp$ cat control
Package: python-rdflib
Source: rdflib
Version: 2.4.0-4
Architecture: i386
Maintainer: Ubuntu MOTU Developers 
Original-Maintainer: Nacho Barrientos Arias 
Installed-Size: 1608
Depends: libc6 (>= 2.5-5), python-support (>= 0.3.4), python (< < 2.6), python (>= 2.4), python-setuptools
Provides: python2.4-rdflib, python2.5-rdflib
Section: python
Priority: optional
Description: RDF library containing an RDF triple store and RDF/XML parser/serializer
 RDFLib is a Python library for working with RDF, a simple yet
 powerful language for representing information. The library
 contains an RDF/XML parser/serializer that conforms to the
 RDF/XML Syntax Specification and both in-memory and persistent
 Graph backend.
 .
 This package also provides a serialization format converter
 called rdfpipe in order to deal with the different formats
 RDFLib works with.
 .
  Homepage: http://rdflib.net/

ed@curry:~/tmp$ cat md5sums
75af966e839159902537614e5815c415  usr/lib/python-support/python-rdflib/python2.5/rdflib/sparql/bison/SPARQLParserc.so
a33eb3985c6de5589cb723d03d2caeb1  usr/lib/python-support/python-rdflib/python2.4/rdflib/sparql/bison/SPARQLParserc.so
d1b5578dd1d64432684d86bbb816fafc  usr/bin/rdfpipe
0191b561e3efe1ceea7992e2c865949b  usr/share/doc/python-rdflib/changelog.gz
98a861211f3effe1e69d6148c1e31ab2  usr/share/doc/python-rdflib/copyright
d75c2ab05f3a4239963d8765c0e9e7c5  usr/share/doc/python-rdflib/examples/example.py
17b61c23d0600e6ce17471dc7216d3fa  usr/share/doc/python-rdflib/examples/swap_primer.py
3894fa16d075cf0eee1c36e6bcc043d8  usr/share/doc/python-rdflib/changelog.Debian.gz
15653f75f35120b16b1d8115e6b5a179  usr/share/man/man1/rdfpipe.1.gz
405cb531a83fd90356ef5c7113ecd774  usr/share/python-support/python-rdflib/rdflib/sparql/bison/CompositionalEvaluation.py
41e28217ddd2eb394017cd8f12b1dfd5  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Util.py
ec9ae5147463ed551d70947c2824bc82  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Resource.py
6e018a69ca242acb613effe420c2cdc7  usr/share/python-support/python-rdflib/rdflib/sparql/bison/SolutionModifier.py
7e72a08f29abc91faddb85e91f17e87c  usr/share/python-support/python-rdflib/rdflib/sparql/bison/FunctionLibrary.py
648384e5980ef39278466be38572523a  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Expression.py
494386730a6edf5c6caf7972ed0bf4ba  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Bindings.py
4513b2fdc116dc9ff02895222a81421d  usr/share/python-support/python-rdflib/rdflib/sparql/bison/IRIRef.py
a800bdac023ae0c02767ab623dffe67b  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Triples.py
6c31647f2b3be724bdfcc35f631162b1  usr/share/python-support/python-rdflib/rdflib/sparql/bison/SPARQLEvaluate.py
c158b3fb8fd66858f598180084f481c4  usr/share/python-support/python-rdflib/rdflib/sparql/bison/GraphPattern.py
bff095caa2db064cc2b1827c4b90a9e7  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Processor.py
2db0c4925d17b49f5bb355d7860150c2  usr/share/python-support/python-rdflib/rdflib/sparql/bison/QName.py
10e02ecf896d07c0546b791a450da633  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Query.py
eee29bb22b05b16da2a5e6552044bf22  usr/share/python-support/python-rdflib/rdflib/sparql/bison/__init__.py
a29a508631228f6674e11bb077c24afc  usr/share/python-support/python-rdflib/rdflib/sparql/bison/PreProcessor.py
479a4702ebee35f464055a554ebf5324  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Filter.py
d2fe75aa4394ec7d9106a1e02bb3015a  usr/share/python-support/python-rdflib/rdflib/sparql/bison/Operators.py
da186350e65c8e062887724b1758ef80  usr/share/python-support/python-rdflib/rdflib/sparql/Query.py
0130de0f5d28087d7c841e36d89714c4  usr/share/python-support/python-rdflib/rdflib/sparql/graphPattern.py
826ffe4c6b3f59a9635524f0746299fe  usr/share/python-support/python-rdflib/rdflib/sparql/sparqlOperators.py
...

ed@curry:~/tmp$ tar xvfz data.tar.gz
./
./usr/
./usr/lib/
./usr/lib/python-support/
./usr/lib/python-support/python-rdflib/
./usr/lib/python-support/python-rdflib/python2.5/
./usr/lib/python-support/python-rdflib/python2.5/rdflib/
./usr/lib/python-support/python-rdflib/python2.5/rdflib/sparql/
./usr/lib/python-support/python-rdflib/python2.5/rdflib/sparql/bison/
./usr/lib/python-support/python-rdflib/python2.5/rdflib/sparql/bison/SPARQLParserc.so
./usr/lib/python-support/python-rdflib/python2.4/
./usr/lib/python-support/python-rdflib/python2.4/rdflib/
./usr/lib/python-support/python-rdflib/python2.4/rdflib/sparql/
./usr/lib/python-support/python-rdflib/python2.4/rdflib/sparql/bison/
./usr/lib/python-support/python-rdflib/python2.4/rdflib/sparql/bison/SPARQLParserc.so
./usr/bin/
./usr/bin/rdfpipe
./usr/share/
./usr/share/doc/
./usr/share/doc/python-rdflib/
./usr/share/doc/python-rdflib/changelog.gz
./usr/share/doc/python-rdflib/copyright
./usr/share/doc/python-rdflib/examples/
./usr/share/doc/python-rdflib/examples/example.py
./usr/share/doc/python-rdflib/examples/swap_primer.py
./usr/share/doc/python-rdflib/changelog.Debian.gz
./usr/share/man/
./usr/share/man/man1/
./usr/share/man/man1/rdfpipe.1.gz
./usr/share/python-support/
./usr/share/python-support/python-rdflib/
./usr/share/python-support/python-rdflib/rdflib/
./usr/share/python-support/python-rdflib/rdflib/sparql/
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/CompositionalEvaluation.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Util.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Resource.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/SolutionModifier.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/FunctionLibrary.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Expression.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Bindings.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/IRIRef.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Triples.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/SPARQLEvaluate.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/GraphPattern.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Processor.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/QName.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Query.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/__init__.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/PreProcessor.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Filter.py
./usr/share/python-support/python-rdflib/rdflib/sparql/bison/Operators.py
./usr/share/python-support/python-rdflib/rdflib/sparql/Query.py
./usr/share/python-support/python-rdflib/rdflib/sparql/graphPattern.py
./usr/share/python-support/python-rdflib/rdflib/sparql/sparqlOperators.py
...

Here are some more useful notes on the structure of .deb files and how to create them. If you are interested in trying out the nascent-alpha BagIt tools give me a holler (ehs at pobox dot com) or just add a comment here…

resource maps and site maps

Friday, August 1st, 2008

Andy reminds me that a relatively simple idea (I think it was David’s at RepoCamp) for the OAI-ORE Challenge would be to create a tool that transformed OAI-ORE resource maps expressed as Atom into Google Site Maps. This would allow “repositories” that exposed their “objects” as resource maps, to easily be crawled by Google and others.

It would also be useful to demonstrate what value-add OAI-ORE resource maps give you: to answer the question of why not just generate the site map and be done with it. I think there definitely are advantages, such as being able to identify compound objects or aggregations of web resources, and then make assertions about them (a.k.a. attach metadata to them).

RepoCamp recap

Monday, July 28th, 2008

So RepoCamp was a lot of fun. The goal was to discuss repository interoperability–and at the very least repository practitioners got to interoperate, and have a few beers afterwards. Hats off to David Flanders who clearly has got running these events down to a fine art.

I finally got to meet Ben O’Steen after bantering with him on #code4lib and #talis … and also got to chat with Jim Downing (Cambridge Univ) about SWORD stuff, and Stephan Drescher (Los Alamos National Lab) about validating OAI-ORE.

Stephan and I had a varied and wide ranging discussion about the web in general, which was a lot of fun. I really dug his metaphor of the web as an aquatic ecosystem, with interdependent organisms and shared environments. It reminded me a bit of how shocked I was to discover how rich and varied the ecosystem is around a “simple” service like twitter. If I ever return to school it will be to study something along the lines of web science.

It was also interesting to hear that other people saw a parallel between OAI-ORE Resource Maps and BagIt’s fetch.txt. The parallel being that both resource maps and bags are aggregations of web resources. Of course bags can also just be files on disk, it’s when the fetch.txt is present in the bag that the package is made up of web resources. It would be interesting to see what vocabularies are available for expressing fixity information (md5 checksums and the like), and if they could be layered into the resource map atom serialization. Perhaps PREMIS v2.0? It might be fun to code up what a simple OAI-ORE resource map harvester would look like, that checked fixity values — using LC’s existing BagIt parallelretriever.py as a starting point. God I wish I could just hyperlink to that :-(

At any rate, I now need to investigate OAuth because Jim thinks it fits really nicely with AtomPub and SWORD in particular. And if it’s good enough for Google it’s probably worth checking out. Jim also said that there is a possibility that the SWORD 2.0 might take shape as an IETF RFC, which would be good to see.

Thanks to all that made it happen, and for all of you that traveled long distances to join us at the Library of Congress.