notes on retooling libraries

If you work in the digital preservation field and haven’t seen Dorothea Salo’s Retooling Libraries for the Data Challenge in the latest issue of Ariadne definitely give it a read. Dorothea takes an unflinching look at the at the scope and characteristics of data assets currently being generated by scholarly research, and how equipped traditional digital library efforts are to deal with it. I haven’t seen so many of the issues I’ve had to deal with (largely unconsciously) as part of my daily work so neatly summarized before. Having them laid out in such a lucid, insightful and succinct way is really refreshing–and inspiring.

In the section on Taylorist Production Processes, Dorothea makes a really good point that libraries have tended to optimize workflows for particular types of materials (e.g. photographs, books, journal articles, maps). When materials require a great deal of specialized knowledge to deal with, and the tools that are used to manage and provide access to the content are similarly specialized, it’s not hard to understand why this balkanization has happened. On occasion I’ve heard folks (/me points finger at self) bemoan the launch of a new website as the creation of “yet another silo”. But the technical silos that we perceive are largely artifacts of the social silos we work in: archives, special collections, rare books, maps, etc. The collections we break up our libraries into…the projects that mirror those collections. We need to work better together before we can build common digital preservation tools. To paraphrase something David Brunton has said to me before: we need to think of our collections more as sets of things that can be rearranged at will, with ease and even impunity. In fact the architecture of the Web (and each website on it) is all about doing that.

Even though it can be tough (particularly in large organizations) I think we can in fact achieve some levels of common tooling (in areas like storage and auditing); but we must admit (to ourselves at least) that some levels of access will likely always be specialized in terms of technical infrastructure and user interface requirements:

Some, though not all, data can be shoehorned into a digital library not optimised for them, but only at the cost of the affordances surrounding them. Consider data over which a complex Web-based interaction environment has been built. The data can be removed from the environment for preservation, but only at the cost of loss of the specialised interactions that make the data valuable to begin with. If the dataset can be browsed via the Web interface, a static Web snapshot becomes possible, but it too will lack sophisticated interaction. If the digital library takes on the not inconsiderable job of recreating the entire environment, it is committing to rewriting interaction code over and over again indefinitely as computing environments change.

Dorothea’s statement about committing to rewriting interaction code over and over again is important. I’m a software developer, and a web developer to boot – so there’s nothing I like more than yanking the data out of one encrusted old app, and creating it a-fresh using the web-framework-du-jour. But in my heart of hearts I know that while this may work for large collections of homogeneous data, it doesn’t scale very well for a vast sea of heterogeneous data. However, all is not lost. As the JISC are fond of saying:

The coolest thing to do with your data will be thought of by someone else.

So why don’t us data archivers get out of the business of building the “interaction code”. Maybe our primary service should be to act as data wholesalers who collect it, and make it available in bulk to those who do want to build access layers on top of it. Lets make our data easy for other people to use (with clear licensing) and reference (with web identifiers) so that they can annotate it, and we can pull back those annotations and views. In a way this is kind of hearkening back to the idea of Data Providers and Service Providers that was talked about a lot in the context of OAI-PMH. But in this case we’d be making the objects available as well as the metadata that describes them, similar to the use cases around OAI-ORE. I got a chance to chat with Kate Zwaard of the GPO at CurateCamp a few weeks ago, and learned how the new Federal Register is a presentation application for raw XML data being made available by the GPO. Part of the challenge is making these flows of data public, and giving credit where credit is due – not only to the creators of the shiny site you see, but to the folks behind the scenes who make it possible.

Another part of Dorothea’s essay that stuck out a bit for me, was the advice to split ingest, storage and access systems.

Ingest, storage, and end-user interfaces should be as loosely coupled as possible. Ideally, the same storage pool should be available to as many ingest mechanisms as researchers and their technology staff can dream up, and the items within should be usable within as many reuse, remix, and re-evaluation environments as the Web can produce.

This is something we (myself and other folks at LC) did as part of the tooling to support the National Digital Newspaper Program. Our initial stab at the software architecture was to use Fedora to manage the full life cycle (from ingest, to storage, to access) of the newspaper content we receive from program awardees around the US. The only trouble was that we wanted the access system to support heavy use by researchers and also robots (Google, Yahoo, Microsoft, etc) building their own views on the content. Unfortunately the way we had put the pieces together we couldn’t support that. Increasingly we found ourselves working around Fedora as much as possible to squeeze a bit more performance out of the system.

So in the end we (and by we I mean David) decided to bite the bullet and split off the inventory systems keeping track of where received content lives (what storage systems, etc) from the access systems that delivered content on the Web. Ultimately this meant we could leverage industry proven web development tools to deliver the newspaper content…which was a huge win. Now that’s not saying that Fedora can’t be used to provide access to content. I think the problems we experienced may well have been the result of our use of Fedora, rather than Fedora itself. Having to do multiple, large XSLT transforms to source XML files to render a page is painful. While it’s a bit of a truism, a good software developer tries to pick the right tool for the job. Half the battle there is deciding on the right granularity for the job … the single job we were trying to solve with Fedora (preservation and access) was too big for us to do either right.

Having a system that’s decomposable, like the approach that CDL is taking with Microservices is essential for long-term thinking about software in the context of digital preservation. I guess you could say “there’s no-there-there” with Microservices, since there’s not really a system to download–but in a way that’s kind of the point.

I guess this is just a long way of saying, Thanks Dorothea! :-)


top hosts referenced in english wikipedia

I’ve recently been experimenting a bit to provide some tools to allow libraries, archives and museums to see how Wikipedians are using their content as primary source material. I didn’t actually anticipate the interest in having a specialized tool like linkypedia to monitor who is using your institutions content on Wikipedia. So, the demo site is having some scaling problems–not the least of which is the feeble VM that it is running on. That’s why I wanted to make the code available for other people to run where it made sense. At least before I had some time to think through how to scale it better.

Anyhow, I wanted to get a handle on just how many external links there are in the full snapshot of English Wikipedia. A month or so ago Jakob Voss pointed me at the External Links SQL dump over at Wikipedia as a possible way to circumvent heavy use of the Wikipedia’s API, by providing a baseline to update against. So I thought to myself that I could just suck this down, and import it into MySQL and run some analysis on that to see how many links and what sorts of host name concentrations there were.

Sucking down the file didn’t take too long. But the mysql import on the dump spent about 24hrs (on my laptop) before I killed it. On a hunch I peeked into the 4.5G sql file and noticed that the table had several indexes defined. So I went through some contortions with csplit to remove the indexes from the DDL, and lo and behold it loaded in something like 20 minutes. Then I wrote a some python to query the database, get each external link url, extract the hostname from the url, and write it out through a unix pipeline to count up the unique hostnames:

./hosts.py | sort -S 1g | uniq -c | sort -rn > enwiki-externallinks-hostnames.txt

This is a little unix trick my old boss Fred Lindberg taught me years ago, and it stills works remarkably well: 30,127,734 urls were sorted into 2,162,790 unique domains in another 20 minutes or so. If you are curious the full output is available here. The number 1 host was toolserver.org with 3,169,993 links. This wasn’t too surprising since it is a hostname heavily used by wikipedians as they go about their business. Next is www.google.com at 2,117,967 links, which appeared to be quite a few canned searches. This wasn’t terribly exciting either. So I removed toolserver.org and www.google.com (so as not to visually skew things too much), and charted the rest of the top 100:

Top 100-2 Hostnames in Wikipedia External Links

I figured that could be of some interest to somebody, sometime. I didn’t find similar current stats available anywhere on the web. But if you know of them please let me know. The high ranking of www.ncbi.nlm.nih.gov and dx.doi.org were pleasant surprises. I did a little superficial digging and found some fascinating bots like Citation Bot and ProteinBoxBot, which seem to trawl external article databases looking for appropriate Wikipedia pages to add links to. Kind of amazing.


version control and digital curation

For some time now I have been meaning to write about some of the issues related to version control in repositories as they relate to some projects going on at $work. Most repository systems have a requirement to maintain original data as submitted. But as we all know this content often changes over time–sometimes immediately. Change is in the very nature of digital preservation; as archaic formats are migrated to fresher, more usable ones, and the wheels of movage keep turning. At the same time it’s essential that when content is used and cited that the specific version in time is cited, or as Cliff Lynch is quoted as saying in the Attributes of a Trusted Repository:

It is very easy to replace an electronic dataset with an updated copy, and … the replacement can have wide-reaching effects. The processes of authorship … produce different versions which in an electronic environment can easily go into broad circulation; if each draft is not carefully labeled and dated it is difficult to tell which draft one is looking at or whether one has the “final” version of a work.

For example at \(work we have a pilot project to process journal content submitted by publishers. We don't have the luxury of telling them exactly what content to submit (packaging formats, xml schemas, formats, etc), but on receipt we want to normalize the content for downstream applications by <a href="http://purl.org/net/bagit">bagging</a> it up, and then extracting some of the content into a standard location.</p> <p>So a publisher makes an issue of a journal available for pickup via FTP:</p> <pre> adi_v2_i2 |-- adi_015.pdf |-- adi_015.xml |-- adi_018.pdf |-- adi_018.xml |-- adi_019.pdf |-- adi_019.xml |-- adi_v2_i2_ofc.pdf |-- adi_v2_i2_toc.pdf `-- adi_v2_i2_toc.xml </pre> <p>The first thing we do is <a href="http://purl.org/net/bagit">bag</a> up the content, to capture what was retrieved (manifest + checksums) and stash away some metadata about where it came from.</p> <pre> adi_v2_i2 |-- bag-info.txt |-- bagit.txt |-- data | |-- adi_015.pdf | |-- adi_015.xml | |-- adi_018.pdf | |-- adi_018.xml | |-- adi_019.pdf | |-- adi_019.xml | |-- adi_v2_i2_ofc.pdf | |-- adi_v2_i2_toc.pdf | `-- adi_v2_i2_toc.xml `-- manifest-md5.txt </pre> <p>Next we run some software that knows about the particularities of this publishers content, and persist it into the bag in a predictable, normalized way:</p> <pre> adi_v2_i2 |-- bag-info.txt |-- bagit.txt |-- data | |-- adi_015.pdf | |-- adi_015.xml | |-- adi_018.pdf | |-- adi_018.xml | |-- adi_019.pdf | |-- adi_019.xml | |-- adi_v2_i2_ofc.pdf | |-- adi_v2_i2_toc.pdf | |-- adi_v2_i2_toc.xml | `-- eco | `-- articles | |-- 1 | | |-- article.pdf | | |-- meta.json | | `-- ocr.xml | |-- 2 | | |-- article.pdf | | |-- meta.json | | `-- ocr.xml | `-- 3 | |-- article.pdf | |-- meta.json | `-- ocr.xml `-- manifest-md5.txt </pre> <p>The point of this post isn't the particular way these changes were layered into the filesystem--it could be done in a multitude of other ways (mets, oai-ore, mpeg21-didl, foxml, etc). The point is rather that the data has undergone two transformations very soon after the it was obtained. And if we are successful, it would undergo many more as it is preserved over time. The advantage of layering content like this into the filesystem is that makes some attempt to preserve the data by disentangling it from the code that operates on it, which will hopefully be swapped out at some point. At the same time we want to be able to get back to the original data, and to also possibly rollback changes which inadvertently corrupted or destroyed portions of the data. Shit happens.</p> <p>So the way I see it we have three options:</p> <ul> <li>Ignore the problem and hope it will go away.</li> <li>Create full copies of the content, and link them together</li> <li>Version the content</li> </ul> <p>While tongue in cheek, option 1 is sometimes reality. Perhaps your repository content is such that it never changes. Or if it does, it happens in systems that are downstream from your repository. The overhead of managing versions isn't something that you need to worry about...yet. I've been told that <a href="http://dspace.org">DSpace</a> doesn't currently handle versioning, and that they are looking to upcoming <a href="http://web.archive.org/web/20110812055244/https://wiki.duraspace.org/display/DSPACE/DSpace-Fedora+Integration+FAQ">Fedora integration</a> to provide versioning options.</p> <p>In option 2 version control is achieved by copying the files that make up a repository object, making modifications to the relevant files, and then linking this new copy to the older copy. We currently do this in our internal handling of the content in the National Digital Newspaper Program, where we work at what we call the <em>batch level</em>, which is essentially a directory of content shipped to the Library of Congress, containing a bunch of jp2, xml, pdf and tiff files. When we have accepted a batch, and later have discovered a problem, we then version the batch by creating a copy of the directory. When service copies of the batches are 50G or so, these full copies can add up. Sometimes just to make a few small modifications to some of the XML files we end up having what appears to be largely redundant copies of multi-gigabyte directories. Disk is cheap as they say, but it makes me feel a bit dirty.</p> <p>Option 3 is to leverage some sort of version control system. One of the criteria I have for a version control system for data is that versioned content should not be dependent on a remote server. The reason for this is I want to be able to push the versioned content around to other systems, tape backup, etc and not have it become stale because some server configuration happened to change at some point. So in my mind, subversion and cvs are out. Revision control systems like <a href="http://www.gnu.org/software/rcs/">rcs</a>, <a href="http://git-scm.com/">git</a>, <a href="http://mercurial.selenic.com/">mercurial</a>, <a href="http://bazaar.canonical.com/en/">bazaar</a> on the other hand are possibilities. Some of the nice things about using git or mercurial:</p> <ul> <li>they are very active projects, so there are lots of eyes on the code fixing bugs, making enhancements</li> <li>the content repositories are distributed, so they allow content to migrate into other contexts</li> <li>developers who come to inherit content will be familiar with using them</li> <li>you get audit logs for free</li> <li>they include functionality to publish content on the web, and to copy or clone repository content</li> </ul> <p>But there are trade offs with using git, mercurial, etc since these are quite complex pieces of software. Changes in them sometimes trigger backwards incompatible changes in repository formats--which require upgrading the repository. Fortunately there are tools for these upgrades, but one must stay on top of them. These systems also have a tendency to see periods of active use, followed by an exodus to a newer shinier version control system. But there are usually tools to convert your repositories from the old to the new.</p> <p>To anyone familiar with digital preservation this situation should seem pretty familiar: format migration, or more generally <a href="http://en.wikipedia.org/wiki/Data_migration">data migration</a>.</p> <p>The <a href="http://www.cdlib.org/">California Digital Library</a> have recognized the need for version control, but have taken a slightly different approach. CDL created a specification for a simplified revision control system called <a href="https://confluence.ucop.edu/display/Curation/ReDD">Reverse Directory Deltas (ReDD</a>), and built an implementation into their <a href="http://www.cdlib.org/services/uc3/curation/storage.html">Storage Service</a>. The advantage that ReDD has is that the repository format is quite a bit more transparent, and less code dependent than other repository formats like git, mercurial, etc. It is also a specification that CDL manages, and can change at will, rather than being at the mercy of an external opensource project. As an exercise about a year ago I wanted to see how easy it was to use ReDD by <a href="http://github.com/edsu/dflat">implementing it</a>. I ended up calling the tool dflat, since ReDD fits in with CDL's filesystem convention <a href="https://confluence.ucop.edu/display/Curation/D-flat">D-flat specification</a>.</p> <p>In the pilot project I mentioned above I started out wanting to use ReDD. I didn't really want to use all of D-flat, because it did a bit more than I needed, and thought I could refactor the ReDD bits of my <a href="http://github.com/edsu/dflat">dflat code</a> into a ReDD library, and then use it to version the journal content. But in looking at the code a year later I found myself wondering about the trade offs again. Mostly I wondered about the bugs that might be lurking in the code, which nobody else was really using. And I thought about new developers that might work on the project, and who wouldn't know what the heck ReDD was and why I wasn't using something more common.</p> <p>So I decided to be lazy, and use mercurial instead.</p> <p>I figured this was a pilot project, so what better way to get a handle on the issues of using distributed revision control for managing data? The project was using the Python programming language extensively, and Mercurial is written in Python -- so it seemed like a logical choice. I also began to think of the problem of upgrading software and repository formats as another facet to the format migration problem...which we wanted to have a story for anyway.</p> <p>I wanted to start a conversation about some of these issues at <a href="http://groups.google.com/group/digital-curation/web/curation-technology-sig">Curate Camp</a>, so I wrote a pretty simplistic tool that compares git, mercurial and dflat/redd in terms of the time it takes to initialize a repository, and commit changes to an arbitrary set of content in a directory called 'data'. For example I ran it against a gigabyte of mp3 content, and it generates the following output:</p> <pre> ed@curry:~/Projects\) git clone git://github.com/edsu/versioning-metrics.git ed@curry:~/Projects/versioning-metrics$ cp ~/Music data ed@curry:~/Projects/versioning-metrics$ ./compare.py


archival context on the web

On Mark’s advice I attended the Moving Forward With Authority Society of American Archivists pre-conference, which focused on the role of authority control in archival descriptions. There were lots of presentations/discussions over the day, but by and large most of them revolved around the recent release of the Encoded Archival Context: Corporate Bodies, Persons and Families (EAC-CPF) XML schema. Work on the EAC-CPF began in 2001 with the Toronto Tenets, which articulated the need for encoding not only the contents of archival finding aids (w/ Encoded Archival Description), but also the context (people, families, organizations) that the finding aid referenced:

Archival context information consists of information describing the circumstances under which records (defined broadly here to include personal papers and records of organizations) have been created and used. This context includes the identification and characteristics of the persons, organizations, and families who have been the creators, users, or subjects of records, as well as the relationships amongst them.
Toronto Tenets

I’ll admit I’m jaded. So much XML has gone under the bridge, that it’s hard for me to get terribly excited (anymore) about yet-another-xml-schema. Yes, structured data is good. Yes, encouraging people make their data available similarly is important. But for me to get hooked I need a story for how this structured data is going to live, and be used on the Web. This is the main reason I found Daniel Pitti’s talk about the Social Networks of Archival Context (SNAC) to be so exciting.

SNAC is an NEH (and possibly IMLS soon) funded project of the University of Virginia, California Digital Library, and the Berkeley School for Information. The project’s general goal is to:

… unlock descriptions of people from descriptions of their records and link them together in exciting new ways.

Where “descriptions of their records” are EAD XML documents, and “linking them together” means exposing the entities buried in finding aids (people, organizations, etc), assigning identifiers to them, and linking them together on the web. I guess I might be reading what I want a bit into the goal based on the presentation, and my interest in Linked Data. If you are interested, more accurate information about the project can be found in the NEH Proposal that this quote came from.

Even though SNAC was only very recently funded, Daniel was already able to demonstrate a prototype application that he and Brian Tingle worked on. If I’m remembering this right, Daniel basically got a hold of the EAD finding aids from the Library of Congress, extracted contextual information (people, families, corporate names) from relevant EAD elements, and serialized these facts as EAC-CPF documents. Brian then imported the documents using an extension to the eXtensible Text Framwork, which allowed XTF to be EAC-CPF aware.

The end result is a web application that lets you view distinct web pages for individuals mentioned in the archival material. For example here’s one for John von Neumann

All those names in the list on the right are themselves hyperlinks which take you to that person’s page. If you were able to scroll down (the prototype hasn’t been formally launched yet) you could see links to corporate names like Los Alamos Scientific Laboratory, The Institute for Advanced Study, US Atomic Energy Commission, etc. You would also see a Resources section that lists related finding aids and books, such as the John von Neumann Papers finding aid at the Library of Congress.

I don’t think I was the only one in the audience to immediately see the utility of this. In fact it is territory well trodden by OCLC and the other libraries involved in the VIAF project which essentially creates web pages for authority records for people like John Von Neumann who have written books. It’s also similar to what People Australia, BibApp and VIVO are doing to establish richly linked public pages for people. As Daniel pointed out: archives, libraries and museums do a lot of things differently; but ultimately they all have a deep and abiding interest in the intellectual output, and artifacts created by people. So maybe this is an area where we can see more collaboration across the cultural divides between cultural heritage institutions. The activity of putting EAD documents, and their transformed HTML cousins on the web is important. But for them to be more useful they need to be contextualized in the web itself using applications like this SNAC prototype.

I immediately found myself wondering if the URL say for this SNAC view of John von Neumann could be considered an identifier for the EAC-CPF record. And what if the HTML contained the structured EAC-CPF data using xml+xslt, a microformat, rdfa or a link rel pointing at an external XML document? Notice I said an instead of the identifier. If something like EAC-CPF is going to catch on lots of archives would need to start generating (and publishing) them. Inevitably there would be duplication: e.g. multiple institutions with their own notion of John von Neumann. I think this would be a good problem to have, and that having web resolvable identifiers for these records would allow them to be knitted together. It would also allow hubs of EAC-CPF to bubble up, rather than requiring some single institution to serve as the master database (as in the case of VIAF).

A few things that would be nice to see for EAC-CPF would be:

  • Instructions on how to link EAD documents to EAC-CPF documents.
  • Recommendations on how to make EAC-CPF data available on the Web.
  • A wikipage or some low-cost page for letting people share where they are publishing EAC-CPF.

In addition I think it would be cool to see:

Anyhow I just wanted to take a moment to say how exciting it is to see the stuff hiding in finding aids making it out onto the Web, with URLs for resources like People, Corporations and Families. I hope to see more as the SNAC project integrates more with existing name authority files (work that Ray Larson at Berkeley is going to be doing), and importing finding aids from more institutions with different EAD encoding practices.


federal register embraces the web and opensource

Tom Lee of the Sunlight Foundation blogged yesterday about the new Federal Register website. The facelift was also announced a few days earlier by the Archivist of the United States, David Ferriero. If you aren’t familiar with it already, the Federal Register is basically the daily newspaper of the United States Federal Government, which details all the rules and regulations of the federal agencies. It is compiled by the Office of the Federal Register located in the National Archives, and printed by the Government Printing Office. As the video describing the new site points out, the Federal Register began publication in 1936 in the depths of the Great Depression as a way to communicate in one place all that the agencies were doing to try to jump start the economy. So it seems like a fitting time to be rethinking the role of the Federal Register.

I’m no usability expert, but just a few minutes browsing the new site and comparing it to the old one make it clear what a leap forward this is. Hopefully the legal status of the new site will be ironed out shortly.

Most of all it’s great to see that the Federal Register is now a single web application. The service it provides to the American public is important enough to deserve its own dedicated web presence. As the developers point out in their video describing the effort, they wanted to make the Federal Register a “first class citizen of the web”…and I think they are certainly helping do that. This might seem obvious, but often there is a temptation to jam publications from the print world (like the Federal Register) into dumbed down monolithic repositories that treat all “objects” the same. Proponents of this approach tend to characterize one off websites like Federal Register 2.0 as “yet another silo”. But I think it’s important to remember that the web was really created to break down the silo walls, and that every well designed web site is actually the antithesis of a silo. In fact, monolithic repository systems that treat all publications as static documents to be uniformly managed are more like silos than these ‘one off’ dedicated web applications.

As a software developer working in the federal government there were a few things about the Federal Register 2.0 that I found really exciting:

  • Fruitful collaboration between federal employees and citizen activist/geeks initiated by a software development contest.
  • Extensive use of opensource technologies like Ruby, Ruby on Rails, MySQL, Sphinx, nginx, Varnish, Passenger, Apache2, Ubuntu Linux, Chef. Opensource technologies encourage collaboration by allowing citizen activists/technologists to participate without having to drop a princely sum.
  • Release of the source code for the website itself, using decentralized revision control (git) so that people can easily contribute changes, and see how the site was put together.
  • Extensive use of syndicated feeds to communicate how how content is being added to the site, ical feeds to keep on top of events going on in your area, and detailed XML for each entry.
  • The robots.txt file for the site makes the content fully crawlable by web indexers, except for search related portions of the website. Excluding dynamic search results is often important for performance reasons, but much of the article content can be discovered via links, see below about permalinks. They also have made a sitemap available for crawlers to efficiently discover URLs for the content.
  • Deployment of the web application to the cloud using Amazon’s EC2 and S3 services. Cloud computing allows computing resources to scale to meet demand. In effect this means that government IT shops don’t have to make big up front investments in infrastructure to make new services available. I guess the jury is still out, but I think this will eventually prove to greatly lower the barrier to innovation in the egov sector. It also lets the more progressive developers in government leap frog ancient technologies and bureaucracies to get things done in a timely manner.
  • And last, but certainly not least … now every entry in the Federal Register has a URL!. Permalinks for the Federal Register are incredibly important for citability reasons. I predict that we’ll quickly see more and more people referencing specific parts of the Federal Register in social media sites like Facebook, Twitter and out on the open web in blogs, and in collaborative applications like Wikipedia.

I would like to see more bulk access to XML data made available, for re-purposing on other websites–although I guess it might be able to walk from the syndicated feeds to the detailed XML. Also, the search functionality is so rich it would be useful to have an OpenSearch description that documents it, and perhaps provides some hooks for getting back JSON and/or XML representations. Perhaps even following the lead of the London Gazette and trying to make some of the structured metadata available in the the HTML using RDFa. It also looks like content is only available for 2008 on, so it might be interesting to see how easy it would be to make more of the historic content available.

But the great thing about what these folks have done is now I can fork the project on github, see how easy it is to add the changes, and let the developers know about my updates to see if they are worth merging back into the production website. This is an incredible leap forward for egov efforts–so hats off to everyone who helped make this happen.


linking things and common sense

Tom Scott’s recent Linking Things post got me jotting down what I’ve been thinking lately about URIs, Linked Data and the Web. First go read Tom’s post if you haven’t already. He does a really nice job of setting the stage for why people care about using distinct URIs (web identifiers) for identifying web documents (aka information resources) and real world things (aka non-information resources). Tom’s opinions are grounded in the experience of really putting these ideas into practice at the BBC. His key point, which he attributes to Michael Smethurst, is that:

Some people will tell you that the whole non-information resource thing isn’t necessary – we have a web of documents and we just don’t need to worry about URIs for non-information resources; others will claim that everything is a thing and so every URL is, in effect, a non-information resource.

Michael, however, recently made a very good point (as usual): all the interesting assertions are about real world things not documents. The only metadata, the only assertions people talk about when it comes to documents are relatively boring: author, publication date, copyright details etc.

If this is the case then perhaps we should focus on using RDF to describe real world things, and not the documents about those things.

I think this is an important observation, but I don’t really agree with the conclusion. I would conclude instead that the distinction between real world and document URIs is a non-issue. We should be able to tell if the thing being described is a document or a real world thing based on the vocabulary terms that are being used.

For example, if I assert:

<http://en.wikipedia.org/wiki/William_Shakespeare> a foaf:Person ; foaf:name "William Shakespeare" .

Isn’t it reasonable to assume http://en.wikipedia.org/wiki/William_Shakespeare identifies a person whose name is William Shakespeare? I don’t have to try to resolve the URL and see if I get a 303 or 200 response code do I?

And if I also assert,

<http://en.wikipedia.org/wiki/William_Shakespeares> dcterms:modified "2010-06-28T17:02:41-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>

can’t I can assume that http://en.wikipedia.org/wiki/William_Shakespeare identifies a document that was modified on 2010-06-28T17:02:41? Does it really make sense to think that the person William Shakespeare was modified then? Not really…

Similarly if I said,

<http://en.wikipedia.org/wiki/William_Shakespeare> cc:license <http://creativecommons.org/licenses/by-sa/3.0/> .

Isn’t it reasonable to assume that http://en.wikipedia.org/wiki/William_Shakespeare identifies a document that is licensed with the Attribution-ShareAlike 3.0 Unported license? It doesn’t really make sense to say that the person William Shakespeare is licensed with Attribution-ShareAlike 3.0 Unported does it? Not really…

Why does the Linked Data community lean on using identifiers to do this common sense work? Well, largely because people argued about it for three years and this is the resolution the W3C came to. In general I like the REST approach of saying a URL identifies a Resource, and that when you resolve one you get back a Representation (a document of some kind, html, rdf/xml, whatever). Why does it have to be more complicated than that?

If it’s not clear if an assertion is about a document or a thing, why isn’t that a problem with the vocabulary in use being underspecified and vague? I believe this is essentially the point that Xiaoshu Wang made three years ago in his paper URI Identity and Web Architecture Revisited.

To get back to Tom’s point, I agree that the really interesting assertions in Linked Data are about things, and their relations, or as Richard Rorty said a bit more expansively:

There is nothing to be known about anything except an initially large, and forever expandable, web of relations to other things. Everything that can serve as a term of relation can be dissolved into another set of relations, and so on for ever. There are, so to speak, relations all the way down, all the way up, and all the way out in every direction: you never reach something which is not just one more nexus of relations.

Philosophy and Social Hope, pp 53-54.

But assertions about a document, albeit being a bit more on the dry side, are also useful and important, such as: who created the web document, when they created it, a license associated with the document, its relation to previous versions, etc. As a software developer working in a library I’m actually really interested in this sort of administrivia. In fact the Open Archives Initiative Object Reuse and Exchange vocabulary, and the Memento efforts are largely about relating web documents together in meaningful and useful ways: to be able to harvest compound objects out of the web, and to navigate between versions of web documents. Heck, the Dublin Core vocabulary started out as an effort to describe networked resources (essentially documents), and the gist of the Dublin Core Metadata Terms retain much of this flavor. So I think RDF is also important for describing documents on the web, or (more accurately) representations.

So, in short:

  1. URLs identify resources.
  2. A resource can be anything.
  3. When you resolve a URL you get a representation of that resource.
  4. If a representation is some sort of flavor of RDF, the semantics of an RDF vocabulary should make it clear what is being described.
  5. If it’s not clear, maybe the vocabulary sucks.

I think this is basically the point that Harry Halpin and Pat Hayes were making in their paper In Defense of Ambiguity. A URL has a dual role: it identifies resources, and it allows us to access representations of resources. This ambiguity is the source of its great utility, expressiveness and power. It’s why we see URLs on the sides of buses and buildings. It’s why a QR Code slapped on some real world thing has a URL embedded in it.

In an ideal world (where people agreed with Xiaoshu, Harry and Pat) I don’t think this would mean we would have to redo all the Linked Data that we have already. I think it just means that publishers who want the granularity of distinguishing between real world things and documents at the identifier level can have it. It would also mean that the Linked Data space can accommodate RESTafarians, and other mere mortals who don’t want to ponder whether their resources are information resources or not. And, of course, it would mean we could use a URL like http://en.wikipedia.org/wiki/William_Shakespeare to identify William Shakespeare in our RDF data …

Wouldn’t that be nice?


scoping intertwingularity

Dan Brickley’s recent post to the public-lod discussion list about the future of RDF is one of the best articulations of why I appreciate the practice of linking data:

And why would anyone care to get all this semi-related, messy Web data? Because problems don’t come nicely scoped and packaged into cleanly distinct domains. Whenever you try to solve one problem, it borders on a dozen others that are a higher priority for people elsewhere. You think you’re working with ‘events’ data but find yourself with information describing musicians; you think you’re describing musicians, but find yourself describing digital images; you think you’re describing digital images, but find yourself describing geographic locations; you think you’re building a database of geographic locations, and find yourself modeling the opening hours of the businesses based at those locations. To a poet or idealist, these interconnections might be beautiful or inspiring; to a project manager or product manager, they are as likely to be terrifying.

Any practical project at some point needs to be able to say “Enough with all this intertwingularity! this is our bit of the problem space, and forget the rest for now”. In those terms, a linked Web of RDF data provides a kind of safety valve. By dropping in identifiers that link to a big pile of other people’s data, we can hopefully make it easier to keep projects nicely scoped without needlessly restricting future functionality. An events database can remain an events database, but use identifiers for artists and performers, making it possible to filter events by properties of those participants. A database of places can be only a link or two away from records describing the opening hours or business offerings of the things at those places. Linked Data (and for that matter FOAF…) is fundamentally a story about information sharing, rather than about triples. Some information is in RDF triples; but lots more is in documents, videos, spreadsheets, custom formats, or [hence FOAF] in people’s heads.

Dan’s description is also a nice illustration of how the web can help us avoid Yak Shaving, by leveraging the work of others:

Any seemingly pointless activity which is actually necessary to solve a problem which solves a problem which, several levels of recursion later, solves the real problem you’re working on.

I’m just stashing that away here so I can find it again when I need it. Thanks danbri!


Confessions of a Graph Addict

Today I’m going to be at the annual conference of the American Library Association today for a pre-conference about Libraries and Linked Data. I’m going to try talking about how Linked Data, and particularly how the graph data structure fits the way catalogers have typically thought about bibliiographic information. Along the way I’ll include some specific examples of Linked Data projects I’ve worked on at the Library of Congress–and gesture at work that remains to be done.

Tomorrow there’s an unconference style event at ALA to explore the what Linked Data means for Libraries. The pre-conference today is booked up, but the event tomorrow is open to the public, so please consider dropping by if you are interested and in the DC area.


bibliographic records on the web

There are a couple interesting threads (disclaimer I inadvertently started one) going on over on the Open Library technical discussion list about making Linked Data views available for authors. Since the topic was largely how to model people, part of the discussion spilled over to foaf-dev (also my fault).

When making library Linked Data available my preference has been to follow the lead of Martin Malmsten, Anders Söderbäck and the Royal Library of Sweden by modeling authors as People using the FOAF vocabulary:

<http://libris.kb.se/resource/auth/317488>
    libris:key "Berners-Lee, Tim" ;
    a foaf:Person ;
    rdfs:isDefinedBy <http://libris.kb.se/data/auth/317488> ;
    skos:exactMatch <http://viaf.org/viaf/23002995> ;
    foaf:name "Berners-Lee, Tim", "Lee, Tim Berners-", "Tim Berners- Lee", "Tim Berners-Lee" .

It seems sensible enough right? But there is some desire in the library community to model an author as a Bibliographic Resource and then relate this resource to a Person resource. While I can understand wanting to have this level of indirection to assert a bit more control, and to possibly use some emerging vocabularies for RDA, I think (for now) using something like FOAF for modeling authors as people is a good place to start.

It will engage folks from the FOAF community who understand RDF and Linked Data, and get them involved in the Open Library Project. It will make library data fit in with other Linked Data out on the web. Plus, it just kind of fits my brain better to think of authors as people…isn’t that what libraries were trying to do all along with their authority data? I’m not saying that FOAF will have everything the library world needs (it won’t), but it’s an open world and we can add stuff that we need, collaborate, and make it a better place.

Anyway, that’s not really what I wanted to talk about here. Over the course of this discussion Erik Hetzner raised what I thought was an important question:

Are you saying that there is a usable distinction between:

  1. a bibliographic record, and
  2. the data contained in that bibliographic record?

    From above, my first notion would be to model things as, in
    pseudo-Turtle::

    <Victor Hugo> a frbr:Person .
    <Victor Hugo> rdfs:isDefinedBy <bib record> .
    <bib record> dc:modified “…”^^xsd:date .

    But it seems to me that you are adding a further distinction::

    <Victor Hugo> a frbr:Person .
    <Victor Hugo> rdfs:isDefinedBy <bib record> .
    <bib record> rdfs:isDefinedBy <bib record data>
    <bib record data> dc:modified “…”^^xsd:date .

    Is this a usable or useful distinction? Are there times when we want to distinguish between the abstract bibliographic record and the representation of a bibliographic record? In linked data-speak, is a bibliographic record a non-information resource? My thinking has been that a bibliographic record is an information resource, and that one does not need to distinguish between (1) and (2) above.

I think it’s an important question because I don’t think it’s been really discussed much before, and has a direct impact on what sort of URL you can use to identify a Bibliographic Record, and what sort of HTTP response a client gets when it is resolved. This is the httpRange-14 issue, which is covered in Cool URIs for the Semantic Web. If a Bibliographic Record is an Information Resource then its OK to identify the record with any old URL, and for the server to say 200 OK like normal. If it’s not an Information Resource then the URL should either have a hash fragment in it, or the server should respond 303 See Other, and redirect to another location.

In my view if a Bibliographic Record is on the web with a URL, it is useful to think of it as an Information Resource…or (as Richard Cyganiak dubs it) a Web Document. I don’t think it’s worthwhile philosophizing about this, but instead to think about it pragmatically. I think it’s useful to consider

as being an identifier for a bibliographic record that happens to be in HTML. Likewise

are all identifiers for Bibliographic Records in MODS, Dublin Core and MARCXML respectively. It might be useful to link them together as they are with <link> elements in the HTML, or in some RDF serialization. It also could be useful to treat one as canonical, and content negotiate from one of the URLs (e.g. curl –header “Accept: application/marc+xml” http://lccn.loc.gov/99027665). But I think it simplifies deployment of library Linked Data to think of bibliographic records as things that can be put on the web as documents, without worrying too much about httpRange-14. A nice side effect of this is that it would grandfather in all the OPAC record views out there. Maybe it’ll be useful to distinguish between an abstract notion of a bibliographic record, and the actual document that is the bibliographic record – but I’m not seeing it right now…and I think it would introduce a lot of unnecessary complexity in this fragile formative period for library Linked Data.


the 5 stars of open linked data

While perusing the minutes of today’s w3c egov telecon I noticed mention of Tim Berners-Lee’s Bag of Chips talk at the gov2.0 expo last week in Washington, DC. I actually enjoyed the talk not so much for the bag-of-chips example (which is good), but for the examination of Linked Data as part of a continuum of web publishing activities associated with gold stars, like the ones you got in school. Here they are: