Adam Bosworth has some good advice for would-be standards developers in the form of a 7 item list. It is strangely reassuring to know that someone in the US Federal Government called someone like Adam for advice about standards…even if it was at some inhuman hour. Number 5 really resonated with me:
Always have real implementations that are actually being used as part of [the] design of any standard … And the real implementations should be supportable by a single engineer in a few weeks.
It is interesting because I think it could be argued that #1, #2, #4, #6 and #7 largely fall out from really doing #5.
Keep the standard as simple and stupid as possible.
The data being exchanged should be human readable and easy to understand.
Standards should have precise encodings.
Put in hysteresis for the unexpected.
Make the spec itself free, public on the web, and include lots of simple examples on the web site.
That leaves #3 Standards work best when they are focused – which seems to be a really tricky one to get right. Maybe it comes down to being able to say No to the right things, to keep scope creep at bay. Or to be able to say:
Stop It. Just Stop.
to ward off complexity, like Joel Spolsky’s Duct Tape Programmer. But I think #3 is really about being able to say Yes to the right things. The right things are the things that bind together the most people involved in the effort. Maybe it’s the “rough consensus” in the Tao of the IETF:
We reject kings, presidents and voting. We believe in rough consensus and running code.
Whatever it is, it seems slippery and subtle – a zen-like appreciation for what is best in people, mixed with being in the right place at the right time.
Each event had a different flavor, but the topic under discussion at each was digital preservation. iPRES focused generally on digital preservation, particularly from a research angle. IIPC also had a bit of a research flavor, but focused more specifically on the practicalities of archiving web content. And PASIG was less research oriented, and much more oriented around building/maintaining large scale storage systems. There was so much good content at these events, that it’s kind of impossible to summarize it here. But I thought I would at least attempt to blurrily characterize some of the ideas from the three events that I’m taking back with me.
Long term digital preservation has many hard problems–so many that I think it is rational to feel somewhat overwhelmed and to some extent even paralyzed. It was important to see other people recognize the big problems of emulation, format characterization/migration, compression – but continue working on pragmatic solutions, for today. Martha Anderson made the case several times for thinking of digital preservation in terms of 5-10 year windows, instead of forever. The phrase “to get to forever you have to get to 5 years first” got mentioned a few times, but I don’t know who said it first. John Kunze brought up the notion of preservation as a “relay”, where bits are passed along at short intervals–and how digital curation need to enable these hand offs to happen easily. It came to my attention later that this relay idea is something that Chris Rusbridge written about back in 2006.
On a similar note, Martha Anderson indicated that making bits useful today is a key factor that the National Digital Information Infrastructure and Preservation Program (NDIIPP) weighs when making funding decisions. Brewster Kahle in his keynote for IIPC struck a similar note that “preservation is driven by access”. Gary Wright gave an interesting presentation about how the Church of Latter Day Saints had to adjust the Reference Model for Open Archival Information System (OAIS) to enable thousands of concurrent users access to their archive of 3.1 billion genealogical image records. Jennifer Waxman was kind enough to give me a pointer to some workPaul Conway has done on this topic of access driven preservation. The topic of access in digital preservation is important to me, because I work in a digital preservation group at the Library of Congress, working primarily on access applications. We’ve had a series of pretty intense debates about the role of access in digital preservation … so it was good to hear the topic come up in San Francisco. In a world where Lots of Copies Keeps Stuff Safe, access to copies is pretty important.
Less is More (More or Less)
Over the week I got several opportunities to hear details from John Kunze, Stephen Abrams, and Margaret Low about the California Digital Library’s notion of curation micro-services, and how they enable digital preservation efforts at CDL. Several folks in my group at LC have been taking a close look at the CDL specifications recently, so getting to hear about the specs, and even see some implementation demos from Margaret was really quite awesome. The specs are interesting to me because they seem to be oriented around the fact that our digital objects ultimately reside on some sort of hierarchical file-system. Fileystem APIs are fairly ubiquitous. In fact, as David Rosenthal has pointed out, some file systems are even designed to resist change. As Kunze said at PASIG in his talk Permanent Objects, Evolving Services, and Disposable Systems: An Emergent Approach to Digital Curation Infrastructure
What is the thinnest smear of functionality that we can add to the filesystem so that it can act as an object storage system?
Approaches to building digital repository software thus far have been primarily aimed at software stacks (dspace, fedora, eprints) which offer particular services, or service frameworks. But the reality is that these systems come and go, and we are left with the bits. Why don’t we try to get the bits in shape so that they can be handed off easily in the relay from application to application, filesystem to filesystem? What is nice about the micro-services approach is that:
The services are compose-able, allowing digital curation activities to be emergent, rather than imposed by a pre-defined software architecture. Since I’ve been on a bit of a functional programming kick lately, I see compose-ability as a prettybigwin.
The services are defined by short specifications, not software–so they are ideas instead of implementations. The specifications are clearly guided by ease of implementation, but ultimately they could be implemented in a variety of languages, and tools. Having a 2-3 page spec that defines a piece of functionality, and can be read by a variety of people, and implemented by different groups seems to be an ideal situation to strive for.
Everything Else Is Miscellaneous
Like I said, there was a ton of good content over the week…and it seems somewhat foolhardy to try to summarize it all in a single blog post. I tried to summarize the main themes I took home with me on the plane back to DC…but there were also lots of nuggets of ideas that came up in conversation, and in presentations that I want to at least jot down:
While archival storage may not be best served by HDFS, jobs like virus scanning huge web crawls are well suited to distributed computing environments like Hadoop. We need to be able to operate at this scale at loc.gov.
In Cliff Lynch’s summary wrap up for PASIG he indicated that people don’t talk so much about what we do when the inevitable happens, and bits are lost. The digital preservation community needs to share more statistics on bit loss, system failure modes, and software design patterns that let us build more sustainable storage systems.
I want to learn more about the (w)arc data format, and perhaps contribute to some of the existing codebases for working w/ (w)arc. I’m particularly interested in using harvesting tools and WARC to preserve linked data…which I believe some of the Sindice folks have worked on for their bot.
It’s long since time I understood how LOCKSS works as a technology. It was mentioned as the backbone of severalprojectsduring the week. I even overheard some talk about establishing rogue LOCKSS networks, which of course piqued my interest even more.
It would be fun to put a jython or jruby web front end on DROID for format identification, but it seems that Carol Chou of the Florida Center for Library Automation has already done something similar. Still, it would be neat to at least try it out, and perhaps have it conneg to Dave’s P2 registry or PRONOM.
Ok, braindump complete. Thanks for reading this far!
As an experiment to learn more about xmpp I created a little utility that will poll an oai-pmh server and send new records as a chunk of xml over xmpp. The idea wasn’t necessarily to see all the xml coming into my jabber client (although you can do that). I wanted to enable downstream applications to have records pushed to them, instead of them having to constantly poll for updates. So you could write a client that archived away metadata and potentially articles as they are found, or write a current awareness tool that listened for articles that matched a particular users research profile, etc…
which would poll Directory of Open Access Journals for new articles every 10 minutes, and send them via xmpp to email@example.com. You can adjust the poll interval, and limit to records within a particular set with the –pollinterval and –set options, e.g.:
It’s a one file python hack in the spirit of Thom Hickey’s 2PageOAI that has a few dependencies documented in the file (lxml, xmpppy, httplib2). I’ve run it for about a week against DOAJ and arxiv.org without incident (it does respect 503 HTTP status codes telling it to slow down). You can find it here.
If you try it out, have any improvements, or ideas let me know.
Pooh began to feel a little more comfortable, because when you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it.
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<http://covers.openlibrary.org/b/olid/OL619435M-M.jpg> foaf:depicts <http://openlibrary.org/b/OL619435M> .
<http://inkdroid.org/journal/2009/09/16/think-of-things/#thingQuote> a bibo:Quote ;
dct:source <http://openlibrary.org/b/OL619435M> ;
bibo:content """Pooh began to feel a little more comfortable, because when you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it."""@en ;
bibo:pages "266"@en .
I've struggled in the past with what constitutes an Information Resource in the context of Web Architecture, Linked Data and practical digital library applications such as the National Digital Newspaper Project I work on at the Library of Congress. So it was reassuring to see the issue come up a few months ago during a review of the effort to revise the HTTP specification (RFC 2616). It would be a major effort to summarize the entire conversation here. However an interesting sub-discussion circled around the idea of normalizing the language in the Architecture of the World Wide Web and RFC 2616 with respect to Resources.
Well into the multi-month thread Tim Berners-Lee offered up a very helpful, historical recap of the "what is a resource" issue , in which he said:
I would like to see what the documents [AWWW and RFC 2616] all look like if edited to use the words Document and Thing, and eliminate Resource.
Which, somewhat predictably, started a discussion of what a Document is. However this conversation seemed more tangible and earthy, and culminated in Larry Masinterrecommending David M. Levy's book Scrolling Forward:
... since much of the thought behind it informs a lot of my own thinking about the nature of "Document", "representation", "Resource" and the like.
Now Larry is a scientist at Adobe, a company that knows a thing or two about electronic documents. He also works closely with the W3C and IETF on web architectural issues. So when he suggested reading a book to learn what he means by Document my ears perked up. The interjection of a book reference into this rapid-fire email exchange was like a magic spell, that made me pause, and consider that a working definition of Document was nuanced enough to be the subject matter of an entire book.
I've come to expect references to Michael Buckland's classic What is a Document? in discussions of documents. I hadn't run across David Levy's name before so Larry's recommendation was enough for me to request it from the stacks, and give it a read. I wasn't disappointed. Scrolling Forward is an ode to documents of all shapes and sizes, from all time periods. It's a joyful, mind expanding work, that explores the entire landscape of our documents: from cash register receipts, the multi-editioned Leaves of Grass, email messages, letters, books, photographs, papyrus scrolls, greeting cards and web pages. Since this takes place in 212 pages, it is not surprising that the analysis
Use python’s urllib to extract the HTML for each of the Times Topics Index Pages, e.g. for A.
Parse HTML into a fine, queryable data structure using BeautifulSoup.
Locate topic names and their associated URLs, and gently add them to the graph with a pinch of SKOS.
Go back to step 3 to fetch the next batch of topics, until you’ve finished Z.
Bake the RDF graph as an rdf/xml file.
If you don’t feel like cooking up the rdf/xml yourself you can download it from here (might want to right-click to download, some browsers might have trouble rendering the xml), or download the 68 line implementation and run it yourself.
The point of this exercise was mainly to show how thinking of the New York Times Topics as a controlled vocabulary, that can be serialized as a file, and still present on the Web, could be useful. Perhaps to someone writing an application that needs to integrate with the New York Times and who want to be able to tag content using the same controlled vocabulary. Or perhaps someone wants to be able to link your own content with similar content at the New York Times. These are all use cases for expressing the Topics as SKOS, and being able to ship it around with resolvable identifiers for the concepts.
Of course there is one slight wrinkle. Take a look at this Turtle snippet for the concept of Ray Bradbury:
I spent an hour checking out the HathiTrust API docs this morning; mainly to see what the similarities and differences are with the as-of-yet undocumented API for Chronicling America. There are quite a few similarities in the general RESTful approach, and the use of Atom, METS and PREMIS in the metadata that is made available.
Everyone’s a critic right? Nevertheless, I’m just going to jot down a few thoughts about the API, mainly for my friend over in #code4libBill Dueber who works on the project. Let me just say at the outset that I think it’s awesome that HathiTrust are providing this API, especially given some of the licensing constraints around some of the content. The API is a good example of putting library data on the web using both general and special purpose standards. But there are a few minor things that could be tweaked I think, to make the API fit into the web and the repository space a bit better.
it would be nice if the OpenSearch description document referenced in the HTML at
worked. It should be pretty easy and non-invasive to add a basic description file for the HTML response since the search is already GET driven. Ideally it would be nice to see the responses also available as Atom and/or JSON with Atom Feed Paging.
Another thing that would be nice to see is the API being merged more into the human usable webapp. The best way to explain this is with an example. Consider the HTML page for this 1914 edition of Walt Whitman’s Leaves of Grass, available with this clean URI:
Now, you can get a fewflavors of metadata for this book, and an aggregated zip file of all the page images and OCR if you are a HathiTrust member. Why not make these alternate representations discoverable right from the item display? It could be as simple as adding some <link> elements to the HTML, that use the link relations they’ve already established for their Atom:
If you wanted to get fancy you could also put human readable links into the <body> and annotate them w/ RDFa. But this would just be icing on the cake. There are a few reasons for doing at least the bare minimum. The big one is to enable in browser applications (like Zotero, etc) to be able to learn more about a given resource in a relatively straightforward and commonplace way. The other big one is to let automated agents like GoogleBot and YahooSlurp and Internet Archive’s Heritrix, etc. discover the deep web data that’s held behind your API. Another nice side effect is that it helps people who might ordinarily scrape your site automatically discover the API in a straightforward way.
Lastly, I was curious to know if HathiTrust considered adjusting their Atom response to use the Atom pattern recommended by the OAI-ORE folks. They are pretty close already, and in fact seem to have modeled their own aggregation vocabulary on OAI-ORE. It would be interesting to hear why they diverged if it was intentional, and if it might be possible to use a bit of oai-ore in there so we can bootstrap an oai-ore harvesting ecosystem.
I’m not sure that I can still call this approach to integrating web2.0 APIs into web1.x applications Linked Data anymore, since it doesn’t really involve RDF directly. It does involve thinking in a RESTful way about the resources you are publishing on the web, and how they can be linked together to form a graph. My colleague Dan has been writing in Computers in Libraries recently about how perhaps thinking in terms of “building a better web” may be a more accurate way of describing this activity.
For reasons I don’t fully understand I’ve been reading a lot of Wittgenstein (well mainly books about Wittgenstein honestly) lately during the non-bike commute. The trajectory of his thought over his life is really interesting to me. He had this zen-like, controversial idea that
Philosophy simply puts everything before us, nor deduces anything. — Since everything lies open to view there is nothing to explain. For what is hidden, for example, is of no interest to us. (PI 126)
I really like this idea that our data APIs on the web could be “open to view” by checking out the HTML, following your nose, and writing scrapers, bots and browser plugins to use what you find. I think it’s unfortunate that the recent changes to the Linked Data Design Issues, and the ensuing discussion seemed to create this dividing line about the use of RDF and SPARQL. I had always hoped (and continue to hope) that the Linked Data effort is bigger than a particular brand, or reformulation of the semantic web effort … for me it’s a pattern for building a better web. I think RDF is very well suited to expressing the core nature of the web, the Giant Global Graph. I’ve served up RDF representations in applications I’ve worked on just for this reason. But I think Linked Data pattern will thrive most if it is thought of as an inclusive continuum of efforts, similar to what Dan Brickley has suggested. Us technology people strive for explicitness, it’s an occupational hazard – but there’s sometimes quite a bit of strength in ambiguity.
Anyhow, my little review of the HathiTrust API turned into a bit of a soapbox for me to stand on and shout like a lunatic. I guess I’ve been wanting to write about what I think Linked Data is for a few weeks now, and it just kinda bubbled up when I least expected it. Sorry Bill!
So for example this newspaper page on Chronicling America:
Is also available on Flickr:
I haven’t written about it here yet, but Chronicling America is just a regular vanilla Django/MySQL/Apache webapp which exposes machine readable metadata for the newspaper content using the Linked Data pattern. It just so happens that Dave was able to use these linked data views to determine the metadata to use when uploading content to Flickr. For example if a curator wants to have this newspaper page uploaded to Flickr, Dave’s flickr uploading program is able to use the associated metadata (referenced in a link element inthe HTML) to get the newspaper title, issue, date, and other related metadata. The beauty of this was Dave was able to do the technical work on his own, and it didn’t require any formal project coordination.
A few weeks ago, on learning of the Flickr / Chronicling America integration Rob Sanderson asked if we could possibly reference the Flickr content in our own data views. Rob became interested in Chronicling America because of its use of the Open Archives Initiative Object Reuse and Exchange Vocabulary in the linked data views. Rob has written a rather nice oai-ore Greasemonkey visualization tool called foresite-explorer, which can visualize oai-ore links to Flickr. It also makes sense from a curatorial perspective to want to capture these bi-directional links between Chronicling America and Flickr in the Chronicling America database.
After agreeing with Rob I’ve had it on my plate to get the list of Flickr URLs and their associated Chronicling America URLs from Dave, for loading into Chronicling America so that the links can be served up in the data views, and perhaps maybe in the HTML. But yesterday morning I had the realization that I didn’t really need to ask (and keep asking every month) Dave for the list. Since Dave created a Flickr set for these pages, and has embedded the URI for the Chronicling America page as a machine tag I can get it right from Flickr. So I hacked together a quick script, and now I have the list too.
The point of this blog post is that Dave, Rob and I were able to independently work on tools for pushing the same repository content around without having to talk much at all. The World Wide Web, and URIs in particular, enables this sort of integration. I guess some people would argue that this is a Web2.0 Mashup, and I guess they would be right in a way. But really it is using the Web the way it was designed to be used–as a decentralized information space. Flickr can tell me about the Chronicling America content that’s on Flickr; and Chronicling America can tell other people about the repository objects themselves.
Now I just need to make those links available for Rob in the data views :-)
Through an internal discussion list at the Library of Congress I learned that this year will mark the 20th Anniversary of American Memory. The exact date of the anniversary depends on how you want to mark it: either the beginning of FY90 on October 1st, 1999 1989 (thanks David) when work officially began, or earlier in the year when the President signed the bill that included the Legislative Branch appropriations for that year (exact date yet to be determined).
Via the discussion list I was able to learn that Shirley Liang (with the help of Nancy Eichacker) was able to locate a transcript of the hearings, which includes the details of Carl Fleischhauer’s demo of a Hypercard / Laser Video Disc based system before the House and later the Senate. Yes, HyperCard. LoC was making a pitch for American Memory before Congress just a few months after Tim Berners-Lee made his proposal to build a “web of notes with links” at CERN. Incidentally, I learned recently in Andrew Lih’s Wikipedia Revolution, that Ward Cunningham’s first implementation of Wiki was written using Hypercard.
I digress…and I want to digress more.
As a Library School student in the mid 90s I became a big fan of American Memory. It seemed like an audacious and exciting experiment right on the cutting edge of what the World Wide Web made (and continues to make) possible. The work that Caroline Arms and Carl Fleischhauer did to expose metadata about American Memory collections (with the technical expertise of Dave Woodward) deepened my interest in what LoC was doing. In hindsight, I think seeing this work from afar is what got me interested in trying to find a job at the Library of Congress.
Seeing that American Memory is turning 20 this year made me fess up to a crazy idea of writing a history of the project. In conversation with those much more knowledgeable than me I think I’ve convinced myself that a good place to start would be compiling a bibliography of things that have been written about the project. It seems a relatively simple and logical place to start.
As the last post indicated I’m part of a team at loc.gov working on an application that serves up page views like this for historic newspapers–almost a million of them in fact. For each page view there is another URL for a view of the OCR text gleaned from that image, such as this. Yeah, kind of yuckster at the moment, but we’re working on that.
Perhaps it’s obvious, but the goal of making the OCR html view available is so that search engine crawlers can come and index it. Then when someone is searching for someone’s name, say Dr. Herbert D. Burnham in Google they’ll come to page 3 in the 08/25/1901 issue of the New York Tribune. And this can happen without the searcher needing to know anything about the Chronicling America project beforehand. Classic SEO…
The current OCR view at the moment is quite confusing, so we wanted to tell Google that when they link to the page in their search results they use the page zoom view instead. We reached for Google’s (and now other major search engine’s) rel=“canonical”, since it seemed like a perfect fit.
… we now support a format that allows you to publicly specify your preferred version of a URL. If your site has identical or vastly similar content that’s accessible through multiple URLs, this format provides you with more control over the URL returned in search results. It also helps to make sure that properties such as link popularity are consolidated to your preferred version.
From our logs we can see that Google has indeed come and fetched both the page viewer and the ocr view for this particular page, and also the text/plain and application/xml views.
But it doesn’t look like the ocr html view is being indexed, at least based on the results for this query I just ran. We can see the .txt file is showing up, which was harvested just after the OCR html view … so it really ought to be in the search results.
Amazon.com also suffers from URL canonicalization issues, multiple URLs reference identical or similar content. For example, our (Google’s) Discovery crawl crawls both
The two URLs return identical content and offer little value since these pages off two “different” views on an empty customer review list. Simple crawlers cannot detect these type of duplicate URLs without downloading all duplicate URLs first, processing their content, and wasting resources in the process.
It seems like a somewhat non-obvious (to me) side effect of asserting a canonical relationship with another URI is that Google will not index the document at the alternate URI. I guess I’m just learning to only use canonical when a site has “identical or vastly similar content that’s accessible through multiple URLs” … Does this seem about right to you?
(thanks Erik Wilde for the pointer to the Schonfeld and Shivakuma paper)