Hacking O'Reilly RDFa

I recently learned from Ivan Herman’s blog that O’Reilly has begun publishing RDFa in their online catalog of books. So if you go and install the RDFa Highlight bookmarklet and then visit a page like this and click on the bookmarklet you’ll see something like:



Those red boxes you see are graphical depictions of where metadata can be found interleaved in the HTML. In my screenshot you can maybe barely see an assertion involving the title being displayed:

<urn:x-domain:oreilly.com:product:9780596516499.IP> dc:title "Natural Language Processing with Python"

But there is actually quite a lot of metadata hiding in the page, which can be found by running the page through the RDFa Distiller (quickly skim over this if your eyes glaze over when you see Turtle):

(???) dc: <http://purl.org/dc/terms/> . (???) foaf: <http://xmlns.com/foaf/0.1/> . (???) frbr: <http://vocab.org/frbr/core#> . (???) gr: <http://purl.org/goodrelations/v1#> . (???) rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . (???) rdfs: <http://www.w3.org/2000/01/rdf-schema#> . (???) xml: <http://www.w3.org/XML/1998/namespace> . (???) xsd: <http://www.w3.org/2001/XMLSchema#> .


thank you wikipedia

Wikipedia Affiliate ButtonI just donated to Wikipedia because I use it everyday. I work as a software developer at the Library of Congress. I’m not ashamed to admit that I’ve spent the last 10 years filling in gaps in my computer science, math and philosophy knowledge. Working in libraries makes this sort of self-education process easier because of all the access to books, journals and whatnot. But Wikipedia has made this process much more fun and collaborative. I don’t think I could do my job without it.

I also am a Linked Data enthusiast, and appreciate the essential role that Wikipedia plays in data sets like dbpedia, yago and freebase in bootstrapping Linked Data around the world. Seeing Wikipedia pages float to the top of Google search results really brought home to me how important it is that we can use URLs as names for things in the world, and gather a shared understanding of what they are.

If you use Wikipedia I encourage you to take a moment to say thank you as well.


MARCetplace

Last Saturday I passed the time while waiting in line at the DMV by reading the recently released Study of the North American MARC Records Marketplace. The analysis of the survey results seem to focus on the role of the Library of Congress in the marketplace, which is understandable given that LC funded the report. But there seems to be a real effort to look at LC’s role in the broader MARCetplace (sorry I couldn’t resist).

Anyhow, I jotted down some random notes and questions in the margins, and
figured I’d add them here before my notes got tossed in the circular file.

So I found this kind of surprising at the time:

7 participating distributors report that they do not acquire MARC records from external sources, but the rest do. Of those external sources, LC was predominant, followed by OCLC, LC record resellers, Library and Archives Canada, and the British National Library. Approximately 14% of respondents acquire a significant portion of their records via Z39.50 protocols and various web crawlers.

p. 19

Should I be surprised that there are more LC subscribers than OCLC subscribers
among the 70 distributors participating in the survey? I am surprised.

Much has changed since this law was formulated. First, LC took on a community oriented role by underwriting the CIP program, which accounted for 53,000 new titles in 2008. Second, for the past 25 years or so, LC records have been distributed electronically. This has not only lowered the cost of distribution, but has made the records easily transferable from one institution to another, often without payment. One result is that LC records are significantly underpriced, since the cost of production is not included. Another is that an entire industry has developed around free (or at least very cheap) MARC records. Consider that an LC record for a single title might appear in thousands of library catalogs, while its MARC Distribution Service lists only 74 customers, 30 of them foreign. Most copies of LC records are obtained either free (via its Z39.50 servers and WebOPAC) or purchased from OCLC or vendors who supply those records in conjunction with the materials they sell. In short, many libraries and vendors benefit from a product for which production costs are not recovered.

pp. 26-27

It would’ve been nice to see how much money it costs to distribute MARC data from the LC FTP site compared with how much money LC gets through its MARC subscription program. The report points out elsewhere that LC catalogs items through the CIP program that it ends up discarding. So they aren’t technically part of the operating cost of the library–if you don’t consider the Copyright Office part of the Library of Congress. The last time I looked at the LC organization chart the Copyright Office was part of LC. Furthermore, unless I missed it there is no indication of how many records there are in that category. Extrapolating from the 74 customers and the current price of the subscription service (21,905) it would appear that LC gets approximately $1,620,970.00 a year in revenue from its distribution of the MARC data. It’s difficult for me to imagine the generation of CIP records for items that LC discards added to the cost of operating an FTP site to equal this number.

Another major distribution channel involves direct downloads from LC’s Voyager database. At present, LC offers four separate interfaces:

  • A Web OPAC for bib records that supports 875 simultaneous users
  • A Web OPAC for authority records that supports 500 simultaneous users
  • Z39.50 direct access for users with Z39.50 clients, which supports 340 simultaneous users
  • Z39.50 gateway interface that supports up to 250 simultaneous users

In total, these search interfaces process about 500,000 searches each business day. While not every search leads to a download, the volume of searches is a clear indication of interest. Major users, to the degree that can be determined, include school libraries and small publics, who may not be OCLC members. In addition, vendors, open database providers, and firms such as Amazon regularly seek these records.

p. 35

Wow, half a million searches a day, that’s bigger than I would’ve thought. It would be interesting to see how many actual MARC downloads there are through these services, and also to see a breakdown across services. Ironically, I think providing piecemeal access to records, and supporting these search interfaces such as Z39.50 have quite a high cost in practice, and that simply making bulk downloads available for free to the public via FTP or what have your would do a lot to mitigate these costs.

Lastly the findings with respect to copy cataloging were really interesting.

In looking at the median numbers of original catalogers reported, we estimated that well over 30,000 professional catalogers are at work in North America. In the earlier example, we suggested that if each of those catalogers were to produce one record per work day, that would provide the capacity to create 6.8 million records per year.

p. 36

I probably missed it, but the report doesn’t seem to estimate how much backlogged material there is in the United States. Presumably it is lower than 6.8 million? It is kind of staggering to think how much untapped potential there is for original cataloging by professional catalogers in the United States. I lay the blame for the lack of original cataloging at the doorstep of archaic and arcane systems, data formats, and rules for content generation. The barrier to entry is just too high. Unfortunately the barrier to entry for getting the bibliographic data that is generated using tax payers money is too high as well.

These are obviously my own rambling thoughts and not those of my employer, or anyone else I work with for that matter.


cloaking and fulltext

It’s comforting to know that California Digital Library are selectively serving up fulltext content in HTML from their institutional repository for search engines to chew on. For example, compare the output of:

curl http://escholarship.org/uc/item/2896686x

with:

curl --header "User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)" http://escholarship.org/uc/item/2896686x

You should see full-text content for the article in the latter and not in the former:

...

qt2896686x repo "Wholly Visionary": the American Library Association, the Library of Congress, and the Card Distribution Program wholly visionary the american library association the library of congress and the card distribution program 2009 2009 2009 2009-04-01 2009-04-01 20090401 yee yy::Yee, Martha M Yee, Martha M American Library Association American Library Association Library of Congress Library of Congress card distribution program card distribution program shared cataloging shared cataloging cooperative cataloging cooperative cataloging national bibliography national bibliography cataloging rules and standards cataloging rules and standards library history united states library history united states This paper offers a historical review of the events and institutional influences in the nineteenth century that led to the ...

The advantage to doing this is that when I was searching for a quote from Title 2, Chapter 5, Section 150 of the US Code:

The Librarian of Congress is authorized to furnish to such institutions or individuals as may desire to buy them

I found Martha Yee’s paper “Wholly Visionary”: the American Library Association, the Library of Congress, and the Card Distribution Program as the 5th hit in the search results.

We do this at the Library of Congress as well in Chronicling America to make the OCR text of historic newspaper pages available to search engines, while not burdening the UI search interface with all the (much noisier) textual content. Compare:

curl http://chroniclingamerica.loc.gov/lccn/sn84026749/1908-04-09/ed-1/seq-11/

with:

curl --header "User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)" http://chroniclingamerica.loc.gov/lccn/sn84026749/1908-04-09/ed-1/seq-11/

However we’ve got a ticket in our tracking system to revisit this practice in light of Google themselves frowning on the practice of ‘cloaking’:

Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.

We were thinking of returning the OCR text in all the responses and putting it in a background pane of some kind that can be selected. But this will most likely increase the size of the HTTP response, and may significantly impact the load time. As more and more fulltext content moves online it would be nice to have a pattern digital libraries could follow for minting URIs for books, articles, etc while still making the fulltext content available to UserAgents that can effectively use it.

Google hasn’t dropped Chronicling America’s pages from its index yet, which is a good sign. After running across similar patter at CDL I’m wondering if it’s OK to continue doing what we are doing. What do you think?

Update: Leigh Dodds let me know in twitter that much of the content gets into Google Scholar via cloaking.


skos as atom

I’ll be the first to admit the tone and content of my last post was a bit off kilter. I guess it was pretty clear immediately from the title of the post. Chalk it up to a second night of insomnia; and also to my unrealistic and probably unnecessary goal of bringing the Atom/REST camp in closer alignment with the RDF/LinkedData camp … at least in my own brain if not on the web.

So, ever the pragmatist, Ian Davis called my bluff a bit on some of the crazier stuff I said:

I know Peter Keane took a stab at this over the summer. But I couldn’t find sample output lying around on the web, so I marked up one by hand to serve as a strawman. So here’s the turtle for the LCSH concept “World Wide Web”:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://id.loc.gov/authorities/sh95000541#concept>
    a skos:Concept ;
    skos:prefLabel "World Wide Web"@en ;
    dcterms:modified "2001-10-01T09:56:06-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
    skos:altLabel "W3 (World Wide Web)"@en, "WWW (World Wide Web)"@en, "Web (World Wide Web)"@en, "World Wide Web (Information retrieval system)"@en ;
    skos:broader <http://id.loc.gov/authorities/sh88002671#concept>, <http://id.loc.gov/authorities/sh92002381#concept> ;
    skos:narrower <http://id.loc.gov/authorities/sh2002000569#concept>, <http://id.loc.gov/authorities/sh2003001415#concept>, <http://id.loc.gov/authorities/sh2007008317#concept>, <http://id.loc.gov/authorities/sh2007008319#concept>, <http://id.loc.gov/authorities/sh2008009697#concept>, <http://id.loc.gov/authorities/sh97003254#concept> ;
    skos:related <http://id.loc.gov/authorities/sh92002816#concept> ;
    skos:closeMatch <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb13319953j> .

And here’s the “corresponding” atom:

<?xml version="1.0" encoding="utf-8"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:skos="http://www.w3.org/2004/02/skos/core#">
    <id>http://id.loc.gov/authorities/sh95000541#concept</id>
    <title>LCSH: World Wide Web</title>
    <author><name>Library of Congress</name></author>
    <updated>2001-10-01T09:56:06Z</updated>
    <skos:prefLabel>World Wide Web</skos:prefLabel>
    <skos:altLabel>W3 (World Wide Web)</skos:altLabel>
    <skos:altLabel>Web (World Wide Web)</skos:altLabel>
    <skos:altLabel>World Wide Web (Information retrieval system)</skos:altLabel>
    <skos:altLabel>WWW (World Wide Web)</skos:altLabel>
    <link rel="http://www.w3.org/2004/02/skos/core#broader" href="http://id.loc.gov/authorities/sh88002671#concept" title="Hypertext systems" />
    <link rel="http://www.w3.org/2004/02/skos/core#broader" href="http://id.loc.gov/authorities/sh92002381#concept" title="Multimedia systems" />
    <link rel="http://www.w3.org/2004/02/skos/core#narrower" href="http://id.loc.gov/authorities/sh2008009697#concept" title="Invisible web"/>
    <link rel="http://www.w3.org/2004/02/skos/core#narrower" href="http://id.loc.gov/authorities/sh2007008317#concept" title="Mashups (World Wide Web)" />
    <link rel="http://www.w3.org/2004/02/skos/core#narrower" href="http://id.loc.gov/authorities/sh2002000569#concept" title="Semantic Web" />
    <link rel="http://www.w3.org/2004/02/skos/core#narrower" href="http://id.loc.gov/authorities/sh2007008319#concept" title="Web 2.0" />
    <link rel="http://www.w3.org/2004/02/skos/core#narrower" href="http://id.loc.gov/authorities/sh97003254#concept" title="WebDAV (Standard)" />
    <link rel="http://www.w3.org/2004/02/skos/core#narrower" href="http://id.loc.gov/authorities/sh97003254#concept" title="WebTV (Trademark)" />
    <link rel="http://www.w3.org/2004/02/skos/core#related" href="http://id.loc.gov/authorities/sh92002816#concept" title="Internet" />
    <link rel="http://www.w3.org/2004/02/skos/core#closeMatch" href="http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb13319953j" title="Web" />
    <link rel="alternate" href="http://id.loc.gov/authorities/sh95000541" type="text/html" />
    <link rel="alternate" href="http://id.loc.gov/authorities/sh95000541.json" type="application/json" />
</entry>

Maybe I botched something? It could use a GRDDL stylesheet I suppose. At least the Atom validates. I really am a bit conflicted posting any of this here because there is so much about the Linked Data community that I like, and want to be a part of. But I’m finding it increasingly difficult to see a Linked Data future where RDF/XML is deployed all over. Instead I bleakly expect we’ll see more fragmentation, and dueling idioms/cultures … and I’m trying to see if perhaps things aren’t as bleak as they seem by grasping at what the groups have in common. Maybe John Cowan’s idea (in the comments) of coming up with an RDF serialization that is valid Atom wasn’t so bad after all? My apologies to any Linked Data folks who have helped me in the past who may have been rubbed the wrong way by my last blog post.

Update: Sean Palmer clued me in to some earlier work he has done in the area of Atom and RDF, the Atom Extensibility Framework. And Niklas Lindström let me know of some thinking he’s done on the topic that is grounded in some work he has been doing for legal information systems in Sweden.


alien vs predator: www-style

I finally got around to reading Web Services for Recovery.gov by Erik Wilde, Eric Kansa and Raymond Yee. The authors wrote the report with funding from the Sunlight Foundation, who are deeply engaged in improving the way the US Federal Government provides transparent access to its data assets.

I highly recommend giving it a read if you are interested in web services, REST, Linked Data, and simple things you can do to open up access to data. The practicality of the advice is clearly gleaned from the experience of an actual implementation over at recovery.berkeley.edu where they kick the tires on their ideas.

Erik’s blog has a succinct summary of the paper’s findings, which for me boils down to:

any data source that is frequently updated must have simple ways for synchronizing data

Web syndication is a widely deployed mechanism for presenting a list of updated web resources. The authors make a pretty strong case for Atom because of its pervasive use of identifiers for content, extensibility, rich linking semantics, paging, the potential for write-enabled services, install base, and generally just good old Resource Oriented Architecture a.k.a. REST.

Because of my interest in Linked Data the paragraph that discusses why RDF/XML wasn’t chosen as a data format is particularly interesting:

The approach described in this report, driven by a desire for openness and accessibility, uses the most widely established technologies and data formats to ensure that access to reporting data is as easy as possible. Recently, the idea of openly accessible data has been promoted under the term of “linked data”, with recent recommendations being centered around a very specific choice of technologies and data models (all centered around Semantic Web approaches focusing on RDF for data representation and centralized data storage). While it is possible to use these approaches for building Web applications, our recommendation is to use better established and more widely supported technologies, thereby lowering the barrier-to-entry and choosing a simpler toolset for achieving the same goals as with the more sophisticated technologies envisioned for the Semantic Web.

It could be argued that the growing amount of RDF/XML in the Linked Data web make it a contender for Atom’s install base–especially when you consider RSS1.0. However I think the main point the authors are making is that the tools for working with XML documents far outnumber the tools that are available for processing RDF/XML graphs. Furthermore, most programmers I know are more familiar with the processing model and standards associated with XML documents (DOM, XSLT, XPath, XQuery) compared with RDF graphs (Triples, Directed Graph, GRDDL, SPARQL). Maybe this says more about the people I know … and if I were to jump into the biomedical field I’d feel different. But perhaps the most subtle point is that whether or not developers know it, Atom expresses a Graph model just like RDF/XML … but it does it in a much more straightforward, familiar document-centric way.

Of course the debate of whether RDF needed to be a part of Linked Data or not rippled through the semantic web community a few months ago–and there’s little chance of resolving any of those issues here. In the pure-land of RDF model theory the choice between Atom and RDF/XML is a bit of a false dilemma since RDF/XML is minimally processable with, well, XML tools … and idioms like GRDDL allow Atom to be explicitly represented as an RDF Graph. And in fact, REST and Content Negotiation would allow both serializations to co-exist nicely in the context of a single web application. However, I’d argue that this point isn’t a particularly easy thing to explain, and it certainly isn’t terrain that you would want to navigate in documentation on the recovery.gov website. The choice of whether RDF belongs in Linked Data or not has technical and business considerations – but I’m increasingly seeing it as a cultural issue,that perhaps doesn’t really even need resolving.

Even Tim Berners-Lee recognizes that there are quite large hurdles to modeling all government data on the Linked Data web in RDF and querying it with SPARQL. It’s a bit unrealistic to expect the Federal Government to start modeling and storing their enterprise data in a fundamentally new and somewhat experimental way in order to support what amounts to arbitrary database queries from anyone on the web. If that’s what the Linked Data brand is I’m not buying it. That being said, I see a great deal of value in the RDF data model (the giant global graph), especially as a tool for seeing how your data fits the contours of the web.

The important message that Erik, Eric and Raymond’s paper communicates is that the Federal Government should be focused on putting data out on the web in familiar ways, using sound web architecture practices that have allowed the web to grow and evolve into the wonderful environment it is today. Atom is a flexible, simple, commonly supported, well understood XML format for letting downstream applications know about newly published web resources. If the Federal Government is serious about the long term sustainability of efforts like recovery.gov and data.gov they should focus on enabling an ecosystem of visualization applications created by third parties, rather than trying to produce those applications themselves. I hope the data.gov folks also run across this important work. Thanks to Sunlight Foundation for funding the folks at Berkeley.


hackability

Adam Bosworth has some good advice for would-be standards developers in the form of a 7 item list. It is strangely reassuring to know that someone in the US Federal Government called someone like Adam for advice about standards…even if it was at some inhuman hour. Number 5 really resonated with me:

Always have real implementations that are actually being used as part of [the] design of any standard … And the real implementations should be supportable by a single engineer in a few weeks.

It is interesting because I think it could be argued that #1, #2, #4, #6 and #7 largely fall out from really doing #5.

  • Keep the standard as simple and stupid as possible.
  • The data being exchanged should be human readable and easy to understand.
  • Standards should have precise encodings.
  • Put in hysteresis for the unexpected.
  • Make the spec itself free, public on the web, and include lots of simple examples on the web site.

That leaves #3 Standards work best when they are focused – which seems to be a really tricky one to get right. Maybe it comes down to being able to say No to the right things, to keep scope creep at bay. Or to be able to say:

Stop It. Just Stop.

to ward off complexity, like Joel Spolsky’s Duct Tape Programmer. But I think #3 is really about being able to say Yes to the right things. The right things are the things that bind together the most people involved in the effort. Maybe it’s the “rough consensus” in the Tao of the IETF:

We reject kings, presidents and voting. We believe in rough consensus and running code.

Whatever it is, it seems slippery and subtle – a zen-like appreciation for what is best in people, mixed with being in the right place at the right time.


ipres, iipc, pasig roundup/braindump

I spent last week in San Francisco attending 3 back-to-back conferences: the International Conference on Preservation of Digital Objects (iPRES), International Internet Preservation Consortium (IIPC), and the Sun Preservation and Archiving Special Interest Group (PASIG)…thanks to the Library of Congress and to Kesa Summers for letting me go. Also, thanks to the 3 conferences for deciding to co-locate in San Francisco at the same time, which made this sort of tag-team-digital-preservation-event-week possible. I hadn’t been to either iPRES, IIPC or PASIG before, so it was a lot of fun being able to take them all in at once…especially since given the nature of my group at the Library of Congress, these are my kind of people.

Each event had a different flavor, but the topic under discussion at each was digital preservation. iPRES focused generally on digital preservation, particularly from a research angle. IIPC also had a bit of a research flavor, but focused more specifically on the practicalities of archiving web content. And PASIG was less research oriented, and much more oriented around building/maintaining large scale storage systems. There was so much good content at these events, that it’s kind of impossible to summarize it here. But I thought I would at least attempt to blurrily characterize some of the ideas from the three events that I’m taking back with me.

Forever

Long term digital preservation has many hard problems–so many that I think it is rational to feel somewhat overwhelmed and to some extent even paralyzed. It was important to see other people recognize the big problems of emulation, format characterization/migration, compression – but continue working on pragmatic solutions, for today. Martha Anderson made the case several times for thinking of digital preservation in terms of 5-10 year windows, instead of forever. The phrase “to get to forever you have to get to 5 years first” got mentioned a few times, but I don’t know who said it first. John Kunze brought up the notion of preservation as a “relay”, where bits are passed along at short intervals–and how digital curation need to enable these hand offs to happen easily. It came to my attention later that this relay idea is something that Chris Rusbridge written about back in 2006.

Access

On a similar note, Martha Anderson indicated that making bits useful today is a key factor that the National Digital Information Infrastructure and Preservation Program (NDIIPP) weighs when making funding decisions. Brewster Kahle in his keynote for IIPC struck a similar note that “preservation is driven by access”. Gary Wright gave an interesting presentation about how the Church of Latter Day Saints had to adjust the Reference Model for Open Archival Information System (OAIS) to enable thousands of concurrent users access to their archive of 3.1 billion genealogical image records. Jennifer Waxman was kind enough to give me a pointer to some work Paul Conway has done on this topic of access driven preservation. The topic of access in digital preservation is important to me, because I work in a digital preservation group at the Library of Congress, working primarily on access applications. We’ve had a series of pretty intense debates about the role of access in digital preservation … so it was good to hear the topic come up in San Francisco. In a world where Lots of Copies Keeps Stuff Safe, access to copies is pretty important.

Less is More (More or Less)

Over the week I got several opportunities to hear details from John Kunze, Stephen Abrams, and Margaret Low about the California Digital Library’s notion of curation micro-services, and how they enable digital preservation efforts at CDL. Several folks in my group at LC have been taking a close look at the CDL specifications recently, so getting to hear about the specs, and even see some implementation demos from Margaret was really quite awesome. The specs are interesting to me because they seem to be oriented around the fact that our digital objects ultimately reside on some sort of hierarchical file-system. Fileystem APIs are fairly ubiquitous. In fact, as David Rosenthal has pointed out, some file systems are even designed to resist change. As Kunze said at PASIG in his talk Permanent Objects, Evolving Services, and Disposable Systems: An Emergent Approach to Digital Curation Infrastructure

What is the thinnest smear of functionality that we can add to the filesystem so that it can act as an object storage system?

Approaches to building digital repository software thus far have been primarily aimed at software stacks (dspace, fedora, eprints) which offer particular services, or service frameworks. But the reality is that these systems come and go, and we are left with the bits. Why don’t we try to get the bits in shape so that they can be handed off easily in the relay from application to application, filesystem to filesystem? What is nice about the micro-services approach is that:

  • The services are compose-able, allowing digital curation activities to be emergent, rather than imposed by a pre-defined software architecture. Since I’ve been on a bit of a functional programming kick lately, I see compose-ability as a pretty big win.
  • The services are defined by short specifications, not software–so they are ideas instead of implementations. The specifications are clearly guided by ease of implementation, but ultimately they could be implemented in a variety of languages, and tools. Having a 2-3 page spec that defines a piece of functionality, and can be read by a variety of people, and implemented by different groups seems to be an ideal situation to strive for.

Everything Else Is Miscellaneous

Like I said, there was a ton of good content over the week…and it seems somewhat foolhardy to try to summarize it all in a single blog post. I tried to summarize the main themes I took home with me on the plane back to DC…but there were also lots of nuggets of ideas that came up in conversation, and in presentations that I want to at least jot down:

  • While archival storage may not be best served by HDFS, jobs like virus scanning huge web crawls are well suited to distributed computing environments like Hadoop. We need to be able to operate at this scale at loc.gov.
  • In Cliff Lynch’s summary wrap up for PASIG he indicated that people don’t talk so much about what we do when the inevitable happens, and bits are lost. The digital preservation community needs to share more statistics on bit loss, system failure modes, and software design patterns that let us build more sustainable storage systems.
  • Dave Tarrant’s presentation on Where the Semantic Web and Web 2.0 meet format risk management: P2 registry was a welcome revelation about the intersection of my interest in linked data and digital preservation. His presentation of the PRONOM format registry as linked data, and Kevin De Vorsey’s talk about Obsolescence, Risk Management, and Preservation Planning at the National Library of New Zealand made me think that it might be interesting to explore how the LC’s Digital Formats website could be delivered as linked data, and linked to something like PRONOM. David Pearson also suggested that collaborative wiki-spaces could be used by digital format specialists to collect information…which got me thinking of how a semantic media wiki instance could be used in conjunction with Tarrant’s ideas. How easy would it be to use the web to build a distributed network of preservation information, as opposed to some p2p solution?
  • I want to learn more about the (w)arc data format, and perhaps contribute to some of the existing code bases for working w/ (w)arc. I’m particularly interested in using harvesting tools and WARC to preserve linked data…which I believe some of the Sindice folks have worked on for their bot.
  • It’s long since time I understood how LOCKSS works as a technology. It was mentioned as the backbone of several projects during the week. I even overheard some talk about establishing rogue LOCKSS networks, which of course piqued my interest even more.
  • It would be fun to put a jython or jruby web front end on DROID for format identification, but it seems that Carol Chou of the Florida Center for Library Automation has already done something similar. Still, it would be neat to at least try it out, and perhaps have it conneg to Dave’s P2 registry or PRONOM.

Ok, braindump complete. Thanks for reading this far!


oai-pmh and xmpp

As an experiment to learn more about xmpp I created a little utility that will poll an oai-pmh server and send new records as a chunk of xml over xmpp. The idea wasn’t necessarily to see all the xml coming into my jabber client (although you can do that). I wanted to enable downstream applications to have records pushed to them, instead of them having to constantly poll for updates. So you could write a client that archived away metadata and potentially articles as they are found, or write a current awareness tool that listened for articles that matched a particular users research profile, etc…

Here’s how you start it up:

oai2xmpp.py http://www.doaj.org/oai.article from@example.com to@example.org

which would poll Directory of Open Access Journals for new articles every 10 minutes, and send them via xmpp to to@example.org. You can adjust the poll interval, and limit to records within a particular set with the –pollinterval and –set options, e.g.:

oai2xmpp.py http://export.arxiv.org/oai2 currents@jabber.org ehs@jabber.org --set cs --pollinterval 86400

It’s a one file python hack in the spirit of Thom Hickey’s 2PageOAI that has a few dependencies documented in the file (lxml, xmpppy, httplib2). I’ve run it for about a week against DOAJ and arxiv.org without incident (it does respect 503 HTTP status codes telling it to slow down). You can find it here.

If you try it out, have any improvements, or ideas let me know.


Think of Things

Pooh began to feel a little more comfortable, because when you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it.

The Complete Tales of Winnie-the-Pooh p. 266

This page sent through the RDFa Distiller yields some quotation data marked up with the Bibliographic Ontology, Dublin Core and FOAF vocabularies:

@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://covers.openlibrary.org/b/olid/OL619435M-M.jpg> foaf:depicts <http://openlibrary.org/b/OL619435M> .

<http://inkdroid.org/journal/2009/09/16/think-of-things/#thingQuote> a bibo:Quote ;
     dct:source <http://openlibrary.org/b/OL619435M> ;  
     bibo:content """Pooh began to feel a little more comfortable, because when you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it."""@en ;
     bibo:pages "266"@en .