Some folks at LC and CDL are trying to kick-start a new public discussion list for talking about digital curation in its many guises: repositories, tools, standards, techniques, practices, etc. The intuition being that there is a social component to the problems of digital preservation and repository interoperability.
Of course NDIIPP (the arena for the CDL/LC collaboration) has always been about building and strengthening a network of partners. But as Priscilla Caplan points out in her survey of the digital preservation landscape Ten Years After, organizations in Europe like the JISC and NESTOR seem to have understood that there is an educational component to digital preservation as well. Yet even the JISC and NESTOR have tended to focus more on the preservation of scholarly output, whereas digital preservation really extends beyond that realm of materials.
The continual need to share good ideas and hard-won-knowledge about digital curation, and to build a network of colleagues and experts that extends out past the normal project/institution specific boundaries is just as important as building the collections and the technologies themselves.
So I guess this is a rather highfalutin goal … here’s some text stolen from the digital-curation home page to give you more of a flavor:
The digital preservation and repositories domain is fortunate to have a diverse set of institutional and consortial efforts, software projects, and standardization initiatives. Many discussion lists have been created for these individual efforts. The digital-curation discussion list is intended to be a public forum that encourages cross-pollination across these project and institutional boundaries in order to foster wider awareness of project- and institution-specific work and encourage further collaboration.
Topic of conversation can include (but is not limited to)
digital repository software (Fedora, DSpace, EPrints, etc.)
management of digital formats (JHOVE, djatoka, etc.)
use and development of standards (OAIS, OAI-PMH/ORE, MPEG21, METS, BagIt, etc.)
issues related to identifiers, packaging, and data transfer
best practices and case studies around curation and preservation of digital content
conference, workshop, tutorial announcements
general chit chat about problems, solutions, itches to be scratched
humor and fun
We’ll see how it goes. If you are at all interested please sign up.
So we have a few bookshelves in our house–one of which is in our kitchen. Only one or two of the shelves in this bookshelf actually house books, most of which are food-stained cookbooks. The rest of the 4 or 5 shelves are given over to photographs, albums, pamphlets from schools, framed pictures, compact discs, pencils, letters, screwdrivers, coins, candles, bills, artwork, crayons–basically the knickknacks and detritus of daily living. We spend a lot of time in the kitchen, so it’s convenient and handy to just stash stuff there.
The only problem is IT DRIVES ME INSANE!
The randomness, and perceived messiness of the bookshelf drives me crazy. I look at it and I see chaos, complexity and disorder. I know I have a problem, but that knowledge doesn’t seem to help. I am constantly shuffling things around, grouping things, moving things, throwing things out while more and and more things are quietly added. I’d almost prefer the bookshelf to be somewhere out of sight, but then we’d probably use something else in the kitchen.
This morning, on my way to work, I got a call from Kesa asking where two flower petals were that needed to be ironed on to Chloe’s Girl Scouts uniform. They were in the bookshelf at one point. Did I throw them away? I can’t remember it’s all a blur. I admit that I probably did. I can hear Chloe crying in the background. I feel bad…and resentful about having to keep this bookshelf organized.
Why am I writing here about this? Well mostly it wouldn’t fit within a 140 byte limit. But srsly – I guess I just feel like this bookshelf is a living emblem of my professional life as a software developer at a library. I strive to create software that is simple in its expression, that does one thing and does it well, and which is hopefully easy to maintain by more people than just me. I relish working at an institution that values the preservation of objects and knowledge.
But I threw away the flower decal …
It’s important to remember that real life is complicated, and that the messiness is something to be relished as well. The useful bookshelf, or bag of bits, chunk of json, or half-remembered perl script in someones homedir are valuable for their organic resilience. Or as Einstein famously said:
Things should be made as simple as possible, but not simpler.
I’m sorry Chloe.
I’m just now (OK I’m slow) marveling at how similar BagIt turned out to be to the Debian Package Format. Given some of the folks involved, this synchronicity isn’t too surprising.
Both .deb and BagIt use a directory ‘data’ for bundling the files in the package (well .deb has it as a compressed file data.tar.gz). Both have md5sum-style checksum files for stating the fixity values of said files. Both have simple rfc2822-style text files for expressing metadata. Both have files that contain the version number of the packaging format. One nice thing that deb has which BagIt intentionally eschewed was a serialization format. But no matter.
At LC we (a.k.a. coding machine Justin Littman) are working on a software library for creating and validating bags, as well as a shiny GUI that’ll sit on top of it to assist in bag creation for people who like shiny things.
It’s an interesting counterpoint to this process of creating BagIt tools to look how a .deb can be downloaded and inspected. Here’s a sampling of a shell session where I downloaded and extracted the parts of the .deb for python-rdflib.
ed@curry:~/tmp$ aptitude download python-rdflib
Reading package lists... Done
Building dependency tree
Reading state information... Done
Reading extended state information
Initializing package states... Done
Building tag database... Done
Get:1 http://us.archive.ubuntu.com hardy/universe python-rdflib 2.4.0-4 [276kB]
Fetched 276kB in 0s (346kB/s)
ed@curry:~/tmp$ ar -xv python-rdflib_2.4.0-4_i386.deb
x - debian-binary
x - control.tar.gz
x - data.tar.gz
ed@curry:~/tmp$ tar xvfz control.tar.gz
ed@curry:~/tmp$ cat control
Maintainer: Ubuntu MOTU Developers
Original-Maintainer: Nacho Barrientos Arias
Depends: libc6 (>= 2.5-5), python-support (>= 0.3.4), python (< < 2.6), python (>= 2.4), python-setuptools
Provides: python2.4-rdflib, python2.5-rdflib
Description: RDF library containing an RDF triple store and RDF/XML parser/serializer
RDFLib is a Python library for working with RDF, a simple yet
powerful language for representing information. The library
contains an RDF/XML parser/serializer that conforms to the
RDF/XML Syntax Specification and both in-memory and persistent
This package also provides a serialization format converter
called rdfpipe in order to deal with the different formats
RDFLib works with.
ed@curry:~/tmp$ cat md5sums
ed@curry:~/tmp$ tar xvfz data.tar.gz
Here are some more useful notes on the structure of .deb files and how to create them. If you are interested in trying out the nascent-alpha BagIt tools give me a holler (ehs at pobox dot com) or just add a comment here…
Terry’s analysis of the proposed changes to OCLC’s record policy is essential reading. I’m really concerned that these 996 fields will slip somewhat unnoticed into data that I use.
996 $aOCLCWCRUP $iUse and transfer of this record is governed by the OCLC® Policy for Use and Transfer of WorldCat® Records. $uhttp://purl.org/oclc/wcrup
This appears to be an engineered, legal virus for our bibliographic ecosystems. I’m not a lawyer, so I can’t fully determine the significance of these legal terms…mostly because there isn’t a policy at the end of that PURL right now. There’s a FAQ full of ominous references to “the Policy”, and a glossy, feel-good overview, but the policy itself is empty at the moment. So the precise nature of the virus is so far unknown…or am I wrong?
At any rate, I think libraries need to be careful about letting these 996 fields creep into their data–especially data that they create. I wonder are there other examples of legalese that have slipped into MARC data over the years?
Update 2008-11-03: it appears that “the Policy” was removed sometime Sunday evening? Perhaps its best not to jump to conclusions eh? But that image of the virus is too cool, and I needed an excuse to post it on my blog.
Update 2008-11-07: check out Terry’s re-analysis of “the Policy” when a new version was brought back online by OCLC.
Ok, this is pretty big news for linked data folks, and for semweb-heads in general. Freebase is now a linked-data target. This is important news because Freebase is an active community of content creators, creating rich data-centric descriptions with a wiki style interface, fancy data loaders, and useful machine APIs.
The web2.0-meets-semweb space is also being explored by folks like Talis. It’ll be interesting to see how this plays out–particularly in light of SPARQL adoption, which I remain kind of neutral about for some undefined, wary, spooky reason. I get the idea of web resources having data views. It seems like a logical, “one small step for an web agent, one giant leap for the web”. But queryability with SPARQL sounds like something to push off, particularly if you’ve already got a search api that could be hooked up to the data views.
At any rate, what this announcement means is that you can get machine readable data back from freebase using a URI. The descriptions then use more URIs, which you can then follow-your-nose to, and get more machine readable data. So if you are on a page like:
you can construct a URL for Tim Berners-Lee like this:
Then you resolve that URL asking for
application/turtle (you could ask for
application/rdf+xml but I find the turtle more readable).
curl --location --header "Accept: application/turtle" http://rdf.freebase.com/ns/en.tim_berners-lee
And you’ll get back a description like this. There’s a lot of useful data there, but the interesting part for me is the follow-your-nose effect where you can see an assertion like:
And you can then go look up Ted Nelson using that URI:
curl --location --header "Accept: application/turtle" http://rdf.freebase.com/ns/en.ted_nelson
And get another chunk of data which includes this assertion:
And you can then continue following your nose to:
Lather, rinse, repeat.
So why is this important? Because following your nose in HTML is what enabled companies like Lycos, AltaVista, Yahoo and Google to be born. It allowed for agents to be able to crawl the web of documents and build indexes of the data to allow people to find what they want (hopefully). Being able to link data in this way allows us to harvest data assets across organizational boundaries and merge them together. It’s early days still, but seeing an organization like Freebase get it is pretty exciting.
Oh, there are a few little rough spots which probably should be ironed out … but when is that ever not the case eh? Inspiring stuff.
I spent a 1/2 an hour goofing around with with the new (to me) SemanticProxy service from Calais. You give the service a URL along with your API key, and it’ll go pull down the content and then give you back some HTML or RDF/XML. The call is pretty simple, it’s just a GET:
Here’s an example of some turtle you can get for my friend Dan’s blog. Obviously there’s a lot of data there, but I wanted to see exactly what entities are being recognized, and their labels. It doesn’t take long to notice that most of the resource types are in the namespace:
And most of these resources have a property which seems to assign a literal string label to the resource:
It’s kind of a bummer that these vocabulary terms don’t resolve, because it would be sweet to get a bigger picture look at their vocabulary.
At any rate, with these two little facts gleaned from looking at the RDF for a few moments I wrote a little script (using rdflib) which you feed a URL and it’ll munge through the RDF and print out the recognized entities:
ed@curry:~/bzr/calais$ ./entities.py http://onebiglibrary.net
a Company named Lehman Bros.
a Company named Southwest Airlines
a Company named Costco
a Company named Everbank
a Holiday named New Year's Day
a ProvinceOrState named Illinois
a ProvinceOrState named Arizona
a ProvinceOrState named Michigan
a IndustryTerm named media ownership rules
a IndustryTerm named unreliable technologies
a IndustryTerm named bank
a IndustryTerm named health care insurance
a IndustryTerm named bank panics
a IndustryTerm named free software
a City named Lansing
a Facility named Big Library
a Person named Ralph Nader
a Person named Dan Chudnov
a Person named Shouldn't Bob Barr
a Person named John Mayer
a Person named Daniel Chudnov
a Person named Cynthia McKinney
a Person named Bob Barr
a Person named John Legend
a Country named Iraq
a Country named United States
a Country named Afghanistan
a Organization named FDIC
a Organization named senate
a Currency named USD
Quite easy and impressive IMHO. One thing that is missing from this output are the URIs that identify the various resources that are recognized like Dan’s:
Like the vocabulary URIs it doesn’t resolve (at least outside the Reuters media empire). Sure would be nice if it did. It’s got the fact that it’s a person cooked into it (pershash)…but otherwise seems to be just a simple hashing algorithm applied to the string “Dan Chudnov”.
I didn’t actually spend any time looking at the licensing issues around using the service. I’ve heard they are somewhat stultifying and vague, which is to be expected I guess. The news about Reuters and Zotero isn’t exactly encouraging … but it is interesting to see how good some of the NLP analysis is getting at institutions like Reuters. It would be lovely to get a backend look at how this technology is actually being used internally at Reuters.
If you want to take this entities.py for a spin and can’t be bothered to download it, just drop into #code4lib and ask #zoia for entities:
14:45 < edsu> @entities http://www.frbr.org/2008/10/21/xid-updates-at-oclc
14:45 < zoia> edsu: 'ok I found: a Facility Library of Congress, a Company FRBR
Review Group, a City York, a EmailAddress firstname.lastname@example.org, a Person
Jenn Riley, a Person Robert Maxwell, a Person Arlene Taylor, a
Person William Denton, a Person Barbara Tillett, a Organization
Congress, a Organization Open Content Alliance, a Organization
in python JSON is faster, smaller and more portable than pickle …
At work, I’m working on a project where we’re modeling newspaper content in a relational database. We’ve got newspaper titles, issues, pages, institutions, places and some other fun stuff. It’s a django app, and the db schema currently looks something like:
Anyhow, if you look at the schema you’ll notice that we have a
Page model, and that attached to that is an
OCR model. If you haven’t heard of it before OCR is an acronym for optical character recognition. For each newspaper page we have, we have a TIF image for the original page, and we have rectangle coordinates for the position of every word on the page. Basically it’s xml that looks something like this (warning your browser may choke on this, you might want to right-click-download).
So there are roughly around 2500 words on a page of newspaper text, and there can sometimes be 350 occurrences of a particular word on a page…and we’re looking to model 1,000,000 pages soon … so if we got really prissy with normalization we could soon be looking at (worst case) 875,000,000,000 rows in a table. While I am interested in getting a handle on how to manage large databases like this, we just don’t need the fine grained queries into the word coordinates. But we do need to be able to look up the coordinates for a particular word on a particular page to do hit highlighting in search results.
So let me get to the interesting part already. To avoid having to think about databases with billions of rows, I radically denormalized the data and stored the word coordinates as a blob of JSON in the database. So we just have a
word_coordinates_json column in the OCR table, and when we need to look up the coordinates for a page we just load up the JSON dictionary and we’re good to go. JSON is nice with django, since django’s ORM doesn’t seem to support storing blobs in the database, and JSON is just text. This worked just fine on single page views, but we also do hit highlighting on pages where there are 10 pages being viewed at the same time. So we started noticing large lags on these page views – because it was taking a while to load the JSON (sometimes 327K * 10 of JSON).
As I mentioned we’re using Django, so it was easy to use django.utils.simplejson for the parsing. When we noticed slowdowns I decided to compare django.utils.simplejson to the latest simplejson and python-cjson. And just for grins I figured it couldn’t hurt to see if using pickle or cPickle (protocols 0, 1 and 2) would prove to be faster than using JSON. So I wrote a little benchmark script that timed the loading of a 327K JSON and a 507K pickle file 100 times using each technique. Here are the results:
Yeah, that’s right. The real simplejson is 62 times faster than django.utils.simplejson! Even more surprising simplejson seems to be faster than even cPickle (even using binary protocols 1 and 2) python-cjson seems to have a slight edge on simplejson. This is good news for our search results page that has 10 newspaper pages to highlight on it, since it’ll take 10 * 0.033183 = .3 seconds to parse all the JSON instead of the totally unacceptable 10 * 0.976193 = 9.7 seconds. I guess in some circles 0.3 seconds might be unacceptable, we’ll have to see how it pans out. We may be able to remove the JSON deserialization from the page load time by pushing some of the logic into the browser w/ AJAX. If you want, please try out my benchmarks yourself on your own platform. I’d be curious if you see the same ranking.
Here are the versions for various bits I used:
django trunk: r9231 2008-10-13 15:38:18 -0400
So in summary for pythoneers: JSON is faster, smaller and more portable than pickle. Of course there are caveats in that you can only store simple datatypes that JSON allows you to, not the full fledged Python objects. But in my use case JSON’s data types were just fine. Makes me that much happier that simplesjson aka
json is now cooked into the Python 2.6 standard library.
Note: if you aren’t seeing simplejson performing better than cPickle you may need to have python development libraries installed:
aptitude install python-dev # or the equivalent for your system
You can verify if the optimizations are available in simplejson by:
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
<<< import simplejson
<module 'simplejson._speedups' from '/home/ed/.python-eggs/simplejson-2.0.3-py2.5-linux-i686.egg-tmp/simplejson/_speedups.so'>
Thanks eby, mjgiarlo, BenO and Kapil for their pointers and ideas.
If you are curious how lcsh.info is being used I’ve made the apache server logs available, including the ones for the sparql service. I’ve been meaning to do some analysis of the logs but haven’t got the time yet. You’ll notice that among the data that’s collected is the Accept header sent by agents, since it’s so important to what representation is served up. Thanks to danbri for the idea to simply make them available.
I just learned from Tom Heath that The International Semantic Web Conference is coming to Washington DC next year. This is pretty cool news to me, since traveling to conferences isn’t always the easiest thing to navigate. Also, Tom suggested that it might be fun to organize a VoCamp around the conference, to provide an informal collaboration space for vocabulary demos, development, q/a, etc. If you want to help out please join the mailing list.
good news via the govtrack mailing list
Carl Malamud of public.resource.org, with funding from a bunch of places including a small bit from GovTrack’s ad profits, announced his intention to purchase from the Government Printing Office documents they produce in the course of their statutory obligations and then have the nerve to sell back to the public at prohibitive prices. The document to be purchased is the Code of Federal Regulations, the component of federal law created by executive branch agencies, in electronic form. Once obtained, it will be posted openly/freely online.
More here: http://public.resource.org/gpo.gov/index.html
And Carl’s letter to the GPO:
It’s pretty sad that it has to come to this…but it’s also pretty awesome that it’s happening.