xhtml, wayback

The Internet Archive gave the Wayback Machine a facelift back in January. It actually looks really nice, but I noticed something kinda odd. I was looking for old archived versions of the lcsh.info site. Things work fine for the latest archived copies:

But during part of lcsh.info’s brief lifetime the site was serving up XHTML with the application/xhtml+xml media type. Now Wayback rightly (I think) remembers the media type, and serves it up that way:

ed@curry:~$ curl -I http://replay.waybackmachine.org/20081216020433/http://lcsh.info/
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Archive-Guessed-Charset: UTF-8
X-Archive-Orig-Connection: close
X-Archive-Orig-Content-Length: 6497
X-Archive-Orig-Content-Type: application/xhtml+xml; charset=UTF-8
X-Archive-Orig-Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.4.6 PHP/5.2.4-2ubuntu5.4 with Suhosin-Patch mod_wsgi/1.3 Python/2.5.2
X-Archive-Orig-Date: Tue, 16 Dec 2008 02:04:31 GMT
Content-Type: application/xhtml+xml;charset=utf-8
X-Varnish: 1458812435 1458503935
Via: 1.1 varnish
Date: Wed, 09 Mar 2011 23:09:47 GMT
X-Varnish: 903390921
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS

But to add navigation controls and branding, Wayback also splices in its own HTML into the display, which unfortunately is not valid XML. And since the media type and doctype trigger standards mode in browsers, the pages render in Firefox like this:

And in Chrome like this:

Now I don’t quite know what the solution should be here. Perhaps the HTML that is spliced in should be valid XML. Or maybe Wayback should just serve up the HTML as text/html. Or maybe this is a good use case for frames (gasp). But I imagine it will similarly afflict any other XHTML that was served up as application/xhtml+xml when Heretrix crawled it.

Sigh. I sure am glad that HTML5 is arriving on the scene and XHTML is riding off into the sunset. Although it’s kind of the Long Goodbye given Internet Archive has archived it.

Update: just a couple hours later I got an email that a fix for this was deployed. And sure enough now it works. I quickly eyeballed the response and didn’t see what the change was. Thanks very much Internet Archive!

on "good" repositories

Chris Rusbridge kicked off an interesting thread on JISC-REPOSITORIES with the tag line What makes a good repository? Implicit in this question, and perhaps the discussion list, is that he is asking about “digital” repositories, and not the brick n’ mortar libraries, archives, etc that are arguably also repositories.

The question of what a repository is, is pretty much verboten in the group I work in. This is kind of amusing since I work in a group whose latest name (in a string of names, and no-names) is the Repository Development Center. Well, maybe saying it’s “verboten” is putting it a bit too strongly. It’s not as if the “repository” is equivalent to He Who Shall Not Be Named or anything. It’s just that the word means so many things, to so many different people, and encompasses so much of what we do, that it’s hardly worth talking about. At our best (IMHO) we focus on what staff and researchers want to do with digital materials, and building out services that help them do that. Getting wrapped around the axle about what set of technologies we are using, and whether they model data in a particular way, is putting the cart before the horse.

As Dan penned one April 1st: “if you seek a pleasant repository, look about you”. I guess this largely depends on where you are sitting. But seriously, if there’s one thing that the Trustworthy Repositories Audit & Certification: Criteria and Checklist, the Ten Principles for Digital Preservation Repositories, and the Blue Ribbon Task Force on Sustainable Digital Preservation and Access make abundantly clear (after you’ve clawed out your eyes) it’s that the fiscal and social dimension of repositories are a whole lot more important in the long run than the technical bits of how a repository is assembled in the now. I’m a software developer, and by nature I reach for technical solutions to problems, but in my heart of hearts I know it’s true.

Back to Chris’ question. Perhaps the “digital” is a red-herring. What if we consider his question in light of traditional libraries? This got me thinking: could Ranganthan and his Five Laws of Library Science serve as a touchstone? Granted, bringing Ranganathan into library discussions is a bit of a cliché. But asking ethical questions like the “goodness” of something is a great excuse to dip into the canon. So put on your repository colored glasses, which magically substitute Repository Object for Book, and …

Repository Objects Are For Use

We can build repositories that function as dark archives. But it kind of rots your soul to do it. It rots your soul because no matter what awesome technologies you are using to enable digital preservation in the repository, the repository needs to be used by people. If it isn’t used, the stuff rots. And the digital stuff rots a whole lot faster than the physical materials. Your repository should have a raison d’être. It should be anchored in a community of people that want to use the materials that it houses. If it doesn’t the repository is likely to suck not be good.

Every Reader His/Her Repository Object

Depending on their raison d’être (see above) repositories are used by a wide variety of people: researchers, administrators, systems developers, curators, etc. It does a disservice to these people if the repository doesn’t support their use cases. A researcher probably doesn’t care when fixity checks were last performed, and an administrator generating a report on fixity checks doesn’t care about how an repository object was linked to and tagged in Twitter. Does your repository allow these different views, for different users to co-exist for the same object? Does it allow new classes of users to evolve?

Every Repository Object Its Reader

Are the objects in your repository discoverable? Are there multiple access pathways to them? For example, can someone do a search in Google and wind up looking at an item in your repository? Can someone link to it from a Wikipedia article? Can someone do a search within your repository to find an object of interest? Can they browse a controlled vocabulary or subject guide to find it? Are repository objects easily identified and found by automated agents like web crawlers and software components that need to audit them? Is it easy to extend, enhance and refine your description of what the repository object is as new users are discovered?

Save the Time of the Reader

Is your repository collection meaningfully on the Web? If it isn’t, it should be, because that’s where a lot of people are doing research today…in their web browser. If it can’t be open access on the web, that’s OK … but the collection and its contents should be discoverable so that someone can arrange an onsite visit. For example, can a genealogist do a search for a person’s name in a search engine and end up in your repository? Or do they have to know to come to your application to type in a search there? Once they are in your repository can they easily limit their search along familiar dimensions such as who, what, why, when, and where? Is it easy for someone to bookmark a search, or an item for later use. Do you allow your repository objects to be reused in other contexts like Facebook, Twitter, Flickr, etc which put the content where people are, instead of expecting them to come to you?

The Repository is a Growing Organism

This is my favorite. Can you keep adding numbers and types of objects, and scale your architecture linearly? Or are you constrained in how large the repository can grow? Is this constraint technical, social and/or financial? Can your repository change as new types or numbers of users (both human and machine) come into existence? When the limits of a particular software stack are reached, is it possible to throw it away and build another without losing the repository objects you have? How well does your repository fit into the web ecosystem? As the web changes do you anticipate your repository will change along with it? How can you retire functionality and objects; to let them naturally die, with respect, and make way for the new?

So …

I guess there are more questions here than answers. I hadn’t thought of framing repository questions in terms of Ranganathan’s laws before, but I imagine it has occurred to other people before. They still seem to be quite good principles to riff on, even in the digital repository realm–at least for a blog post. If you happen to run across similar treatment elsewhere I would appreciate hearing about them.

release early, release often

The National Digital Newspaper Program (NDNP) went live with a new JavaScript viewer today (as well as a lot of other stylistic improvements) in the Beta area portion of Chronicling America.

Being able to smoothly zoom in on images, and go into fullscreen mode is really reminiscent (for me) of the visceral experience of using a microfilm reader. We joked about adding whirring sound effects when moving between pages. But you’ll be glad to know we showed restraint :-) It’s all kind of deeply ironic given the Web’s roots in the Memex.

Hats off to Dan Krech. Risa Ohara and Chris Adams for really digging into things like dynamically rendering tiles from our JPEG2000 access copies using Python (more maybe on that later).

I hacked together a brief video demonstration (above) of looking up the day the American Civil War ended (April 9, 1865) in the New York Daily Tribune, to show off the viewer. One thing I forgot to do was go into headless mode (F11 w/ Firefox, Chrome, etc), which amplifies the effect somewhat.

Aside from the improvements on the site, this is a real milestone for the project and (I believe) the Library of Congress generally, since it is a ‘beta’ preview of what we would like to replace the existing site with. Given the nature of what they do, libraries are typically fairly conservative and slow moving organizations. So knowing how to use a beta/experimental area has proven to be a challenge. Hopefully a little space for experimentation will pay off. I don’t think we could’ve gotten this far without the help of our fearless leader in all things tech, David Brunton.

If you have ideas, feel free to leave feedback using the little widget on the lower-right of pages at Chronicling America, or using our new mailing list that is devoted to the Open Source software project that makes Chronicling America available. Open Source too, imagine that!


Yesterday at Code4lib 2011 Karen Coombs gave a talk where (among other things) she demonstrated mapFAST that lets you find relevant subject headings for a given location, and then click on a subject heading and find relevant books on the topic. Go check out the archived video of her talk (note you’ll have to jump 39 minutes or so into the stream). Karen mentioned that the demo UI uses the mapFAST REST/JSON API. The service lets you construct a URL like this to get back subjects for any location you can identify with lat/lon coordinates:


For example:

ed@curry:~$ curl -i 'http://experimental.worldcat.org/mapfast/services?geo=39.01,-77.01;crs=wgs84&radius=100000&mq=&sortby=distance&max-results=1'
HTTP/1.1 200 OK
Date: Wed, 09 Feb 2011 14:07:39 GMT
Server: Apache/2.0.63 (Unix)
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Content-Type: application/json

  "Status": {
    "code": 200, 
    "request": "geocode"
  "Placemark": [
      "point": {
        "coordinates": "39.0064,-77.0303"
      "description": "", 
      "ExtendedData": [
          "name": "NormalizedName", 
          "value": "maryland silver spring woodside park"
          "name": "Feature", 
          "value": "ppl"
          "name": "FCode", 
          "value": "P"
      "id": "fst01324433", 
      "name": "Maryland -- Silver Spring -- Woodside Park"
  "name": "FAST Authority Records"

Recently I have been reading Mark Pilgirm’s wonderful Dive into HTML5 book, so I got it into my head that it would be fun to try out some of the geo-location features in modern browsers to display subject headings that are relevant for wherever you are. A short time later I now have a simple HTML/JavaScript HTML5 application (dubbed subjects-here) that does just that. The application itself is really just a toy. Part of Karen’s talk was emphasizing the importance of using more than just text in Library Applications…and subjects-here kind of misses that key point.

What I wanted to highlight is the text in red in the HTTP response above:

Access-Control-Allow-Origin: *

The Access-Control-Allow-Origin HTTP header is a Cross-Origin Resource Sharing (CORS) header. If you’ve developed JavaScript applications before, you probably have run into situations where you wanted to load some JavaScript from a service elsewhere on the web. But you were prevented from doing this by Same Origin Policy, which prevents your JavaScript code from talking to a website that is different from the one it loaded from. So normally you hack around this by creating a proxy for that web service in your own application, which is a bit of work. Sometimes license agreements frown on you re-exposing their service, so you have to jump through a few more hoops to make sure it’s not an open proxy for the web service.

Enter CORS.

What the folks at OCLC did was add a Access-Control-Origin header to their JSON response. This basically means that my JavaScript served up at inkdroid.org is able to run in your browser and talk to the server at experimental.worldcat.org. OCLC has decided to allow this, to make their Web Service easier to use. So to create subjects-here I didn’t have to write a single bit of server side code, it’s just static HTML and JavaScript:

function main() {
    if (Modernizr.geolocation) {
    else {

function lookup_subjects(position) {
    lat = parseFloat(position.coords.latitude);
    lon = parseFloat(position.coords.longitude);
    url = "http://experimental.worldcat.org/mapfast/services?geo=" + lat + "," + lon + ";crs=wgs84&radius=100000&mq=&sortby=distance&max-results=15";
    $.getJSON(url, display_subjects);

function display_subjects(data) {
    // putting results into the DOM left as exercise to the reader

Nice and simple right? The full code is on GitHub, which seemed a bit superfluous since there is no server-side piece (it’s all in the browser). So the big wins are:

  • OCLC gets to see who is actually using their web service, not who is proxying it.
  • I don’t have to write some proxy code.

The slight drawbacks are:

  • My application has a runtime dependency on experimental.worldcat.org, but it kinda did already when I was proxying it.
  • Most modern browsers support CORS headers, but not all of them. So you would need to evaluate whether that matters to you.

I guess this is just a long way of saying USE CORS!! and help make the web a better place (pun intended).

Update: and also, it is a good example where something like GeoJSON and OpenSearch Geo could’ve been used to help spread common patterns for data on the Web. Thanks to Sean Gillies for pointing that out.

Update: and Chris is absolutely right, JSONP is another pattern in the Web Developer community that is a bit of a hack, but is an excellent fallback for older browsers.


Andy Powell has a post over on the eFoundations blog about some metadata guidelines he and Pete Johnston are working on for the UK Resource Discovery Taskforce. I got to rambling in a text area on his blog, but I guess I wrote too much, or included too many external URLs, so I couldn’t post it in the end. So I thought I’d just post it here, and let trackback do the rest.

So uh, please s/you/Andy/g in your head as you are reading this …

A bit of healthy skepticism, from a 15-year vantage point is definitely warranted. Bearing in mind that often times its hard to move things forward without taking a few risks. I imagine constrained fiscal resources could also be a catalyst to improving access to data flows that cultural heritage institutions participate in, or want to participate in. I wonder if it would be useful to factor in the money that organizations can save by working together better?

As I’ve heard you argue persuasively in the past, the success of the WWW as a platform for delivery of information is hard to argue with. One of the things that the WWW did right (from the beginning) was focus the technology on people actually doing stuff…in their browsers. It seems really important to make sure whatever this metadata is, that users of the Web will see it (somehow) and will be able to use it. Ian Davis’ points in Is the Semantic Web Destined to be a Shadow are still very relevant today I think.

My friend Dan Krech calls this an “alignment problem”. So I was really pleased to see this in the vision document:

Agreed core standards for metadata for
the physical objects and digital objects in
aggregations ensuring the aggregations
are open to all major search engines

Aligning with the web is a good goal to have. Relatively recent service offerings from Google and Facebook indicate their increased awareness of the utility of metadata to their users. And publishers are recognizing how important they are for getting their stuff before more eyes. It’s a kind of virtuous cycle I hope.

This must feel like it has been a long time in coming for you and Pete. Google’s approach encourages a few different mechanisms: RDFa, Microdata and Microformats. Similarly, Google Scholar parses a handful of metadata vocabularies present in the HTML head element. The web is a big place to align with I guess.

I imagine there will be hurdles to get over, but I wonder if your task-force could tap into this virtuous cycle. For example it would be great if cultural heritage data could be aggregated using techniques that big Search companies also use: e.g. RDFa, microformats and microdata; and sitemaps and Atom for updates. This would assume a couple things: publishers could allow (and support) crawling, and that it would be possible to build aggregator services to do the crawling. An important step would be releasing the aggregated content in an open way too. This seems to be an approach that is very similar to what I’ve heard Europeana is doing…which may be something else to align with.

I like the idea of your recommendations providing a sliding scale, for people to get their feet wet in providing some basic information, and then work their way up to the harder stuff. Staying focused on what sorts of services moving up the scale provides seems to be key. Part of the vision document mentions that the services are intended for staff. There is definitely a need for administrators to manage these systems (I often wonder what sort of white-listing functionality Google employs with its Rich Snippets service to avoid spam). But keeping the ultimate users of this information in mind is really important.

Finally I’m a bit curious about the use of ‘aggregations’ in the RLUK vision. Is that some OAI-ORE terminology percolating through?


Wikipedia’s 10th Birthday Party at the National Archives in Washington DC on Saturday was a lot of fun. Far and away, the most astonishing moment for me came early in the opening remarks by David Ferriero, the Archivist of the United States, when he stated (in no uncertain terms) that he was a big fan of Wikipedia, and that it was often his first go-to for information. Not only that, but when discussion about a bid for a DC WikiMania (the Wikipedia Annual Conference) came up later in the morning, Ferriero suggested that the National Archives would be willing to host it if it came to pass. I’m not sure if anything actually came of this later in the day–a WikiMania in DC would be incredible. It was just amazing to hear the Archivist of the United States be supportive of Wikipedia as a reference source…especially as stories of schools, colleges and universities rejecting Wikipedia as a source are still common. Ferriero’s point was even more poignant with several high schoolers in attendance. Now we all can say:

If Wikipedia is good enough for the Archivist of the United States, maybe it should be good enough for you.

Another highlight for me was meeting Phoebe Ayers, who is a reference librarian at UC Davis, member of the Wikimedia Foundation Board of Trustees, and author of How Wikipedia Works. I strong armed Phoebe into signing my copy (I bought this copy on Amazon after it was de-accessioned from Cuyahoga County Public Library in Parma, Ohio ). Phoebe has some exciting ideas for creating collaborations between libraries and Wikipedia, which I think fit quite well into the Galleries, Libraries, Archives and Museuems (GLAM) effort within Wikipedia. I think she is still working on how to organize the effort.

Later in the day we heard how the National Archives is thinking of following the lead of the British Museum and establishing a Wikipedian in Residence. Liam Wyatt, the first Wikipedian in Residence, put a human face on Wikipedia for the British Museum, and familiarized museum staff with editing Wikipedia, through activities like the Hoxne Challenge. Having a Wikipedia in Residence at the National Archives (and who knows maybe the Smithsonian and the Library of Congress) would be extremely useful I think.

In a similar vein, Sage Ross spoke at length about the Wikipedia Ambassador Program. The Ambassador Program is a formal way for folks to represent Wikipedia in academic settings (universities, high schools, etc). Ambassadors can get training in how to engage with Wikipedia (editing, etc) and can help professors and teachers who want to integrate Wikipedia into their curriculum, and scholarly activities.

I got to meet Peter Benjamin Meyer of the Bureau of Labor Statistics, who has some interesting ideas for aggregating statistical information from federal statistical sources, and writing some bots that will update article info-boxes for places in the United States. The impending release of the 2010 US Census Data has the Wikipedia community discussing the best way to update the information that was added by a bot for the 2000 census. It seemed like Peter might be able to piggy back some of his efforts on this work that is going on at Wikipedia for the 2010 Census.

Jyothis Edthoot an Oracle employee and Wikipedia Steward gave me a behind the scenes look at the tools he and others in Counter Vandalism Unit use to keep Wikipedia open for edits from anyone in the world. I also got to meet Harihar Shankar from Herbert van de Sompel’s team at Los Alamos National Lab, and to learn more about the latest developments with Memento, which he gave a lightning talk about. I also ran into Jeanne Kramer-Smyth of the World Bank, and got to hear about their efforts to provide meaningful access to their document collections to web crawlers using their metadata.

I did end up giving a lightning talk about Linkypedia (slides on the left). I was kind of rushed, and I wasn’t sure that this was exactly the right audience for the talk (being mainly Wikipedians instead of folks from the GLAM sector). But it helped me think through some of the challenges in expressing what Linkypedia is about, and who it is for. All in all it was a really fun day, with a lot of friendly folks interested in the Wikipedia community. There must’ve been at least 70 people there on a very cold Saturday–a promising sign of good things to come for collaborations between Wikipedia and the DC area.

wikipedia external links: a redis database

As part of my continued meandering linkypedia v2 experiments I created a Redis database of high level statistics about host names and top-level-domain names in external links from Wikipedia articles. Tom Morris happened to mention he has been loading the external links as well (thanks Alf), so I thought I’d make the redis database dump available to anyone that is interested in looking at it. If you want to give it a whirl try this out:

% wget http://inkdroid.org/data/wikipedia-extlinks.rdb
% sudo aptitude install redis-server
% sudo mv wikipedia-extlinks.rdb /var/lib/redis/dump.rdb
% sudo chown redis:redis /var/lib/redis/dump.rdb
% sudo /etc/init.d/redis-server restart
% sudo pip install redis # or easy_install (version in apt is kinda old)
% python 
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import redis
>>> r = redis.Redis()
>>> r.zrevrange("hosts", 0, 25, True)
[('toolserver.org', 2360809.0), ('www.ncbi.nlm.nih.gov', 508702.0), ('dx.doi.org', 410293.0), ('commons.wikimedia.org', 408986.0), ('www.imdb.com', 398877.0), ('www.nsesoftware.nl', 390636.0), ('maps.google.com', 346997.0), ('books.google.com', 323111.0), ('news.bbc.co.uk', 214738.0), ('tools.wikimedia.de', 181215.0), ('edwardbetts.com', 168102.0), ('dispatch.opac.d-nb.de', 166322.0), ('web.archive.org', 165665.0), ('www.insee.fr', 160797.0), ('www.iucnredlist.org', 155620.0), ('stable.toolserver.org', 155335.0), ('www.openstreetmap.org', 154127.0), ('d-nb.info', 141504.0), ('ssd.jpl.nasa.gov', 137200.0), ('www.youtube.com', 133827.0), ('www.google.com', 131011.0), ('www.census.gov', 124182.0), ('www.allmusic.com', 117602.0), ('maps.yandex.ru', 114978.0), ('news.google.com', 102111.0), ('amigo.geneontology.org', 95972.0)]
>>> r.zrevrange("hosts:edu", 0, 25, True)
[('nedwww.ipac.caltech.edu', 28642.0), ('adsabs.harvard.edu', 25699.0), ('animaldiversity.ummz.umich.edu', 21747.0), ('www.perseus.tufts.edu', 20438.0), ('genome.ucsc.edu', 20290.0), ('cfa-www.harvard.edu', 14234.0), ('penelope.uchicago.edu', 9806.0), ('www.bucknell.edu', 8627.0), ('www.law.cornell.edu', 7530.0), ('biopl-a-181.plantbio.cornell.edu', 5747.0), ('ucjeps.berkeley.edu', 5452.0), ('plato.stanford.edu', 5243.0), ('www.fiu.edu', 5004.0), ('www.volcano.si.edu', 4507.0), ('calphotos.berkeley.edu', 4446.0), ('www.usc.edu', 4345.0), ('ftp.met.fsu.edu', 3941.0), ('web.mit.edu', 3548.0), ('www.lpi.usra.edu', 3497.0), ('insects.tamu.edu', 3479.0), ('www.cfa.harvard.edu', 3447.0), ('www.columbia.edu', 3260.0), ('www.yale.edu', 3122.0), ('www.fordham.edu', 2963.0), ('www.people.fas.harvard.edu', 2908.0), ('genealogy.math.ndsu.nodak.edu', 2726.0)]
>>> r.zrevrange("tlds", 0, 25, True)
[('com', 11368104.0), ('org', 7785866.0), ('de', 1857158.0), ('gov', 1767137.0), ('uk', 1489505.0), ('fr', 1173624.0), ('ru', 897413.0), ('net', 868337.0), ('edu', 793838.0), ('jp', 733995.0), ('nl', 707177.0), ('pl', 590058.0), ('it', 486441.0), ('ca', 408163.0), ('au', 387764.0), ('info', 296508.0), ('br', 276599.0), ('es', 242767.0), ('ch', 224692.0), ('us', 179223.0), ('at', 163397.0), ('be', 132395.0), ('cz', 92683.0), ('eu', 91671.0), ('ar', 89856.0), ('mil', 87788.0)]
>>> r.zscore("hosts", "www.bbc.co.uk")

Basically there are a few sorted sets in there:

  • “hosts”: all the hosts sorted by the number of externallinks
  • “hosts:%s”: where %s is a top level domain (“com”, “uk”, etc)
  • “tlds”: all the tlds sorted by the number of externallinks
  • “wikipedia”: the wikipedia langauges sorted by total number of externallinks

I’m not exactly sure how portable redis databases are but I was able to move it between a couple Ubuntu machines and Ryan successfully looked at it on a Gentoo box he had available. You’ll need roughly 300MB of RAM available. I must say I was impressed with redis and in particular sorted sets for this stats collection task. Thanks to Chris Adams for pointing me in the direction of redis in the first place.

Wikipedia 10

Wikipedia is turning ten years old on January 15th, and celebratory gatherings are going around the globe, including one in my area (Washington DC) on January 22 at the National Archives.

Like you, I’ve been an accidental user of Wikipedia when searching for a topic to learn more about. Over time I have started actively searching on Wikipedia for things, linking to Wikipedia articles (in email and HTML), and even saying thank you with a small donation. I’ve only attended one of the local DC chapter events before, but am definitely planning on attending the DC event to meet other people who value Wikipedia as a resource.

Perhaps also like you (since you are reading this blog) I also work in a cultural heritage organization, well a library to be precise. I wasn’t able to attend the recent Galleries, Libraries, Archives, Museums & Wikimedia conference at the British Museum last November. But I have been listening to the audio that they kindly provided recently with great interest. If you are interested in the role that cultural heritage organizations can play on Wikipedia, and the Web in general definitely check it out if you’ve got a long commute (or time to kill) and a audio device of some kind. There are lots of museums, galleries, archives and libraries in the DC area, so I’m hoping that this event on the 22nd will be an opportunity for folks to get together across institutional lines to talk about how they are engaging with the Wikipedia community. Who knows maybe it could be a precursor to a similar to GLAM-WIKI here in DC?

I’m planning on doing a lightning talk about this side/experimental project I’ve been working on called linkypedia. The basic idea is to give web publishers (and specifically cultural heritage organizations like libraries, museums, archives, etc) an idea of how their content is being used as primary resource material on Wikipedia. The goal is to validate the work that these institutions have done to make this content available, and for them to do more…and also to engage with the Wikipedia community. Version 1 of the (opensource) software is running at here on my minimal Linode VPS. But I’m also working on version 2, which will hopefully scale a bit better, and provide a more global (not just English Wikipedia) and real time picture of how your stuff is being used on Wikipedia. Part of the challenge is figuring out how to pay for it, given the volume of external links in the major language Wikipedias. I’m hoping a tip-jar and thoughtful use of Amazon EC2 will be enough.

If you are interested in learning more about the event on the 22nd check out Katie Filbert’s recent post to the Sunlight Labs Discussion List, and the (ahem) wiki page to sign up! Thanks Mark for letting me know about the birthday celebration last week in IRC. Sometimes with all the discussion lists, tweets, and blogs things like this slip by without me noticing them. So a direct prod in IRC helps :-)

on preserving bookmarks

While it’s not exactly clear what the future of Delicious is, the recent news about Yahoo closing the doors or selling the building prompted me to look around at other social bookmarking tools, and to revisit some old stomping grounds.

Dan Chudnov has been running Unalog since 2003 (roughly when Delicious started). In fact I can remember Dan and Joshua Schacter having some conversations about the idea of social bookmarking as both of the services co-evolved. So my first experience with social bookmarking was on Unalog, but a year or so later I ended up switching to Delicious in 2004 for reasons I can’t quite remember. I think I liked some of the tools that had sprouted up around Delicious, and felt a bit guilty for abandoning Unalog.

Anyhow, I wanted to take the exported Delicious bookmarks and see if I could get them into Unalog. So I set up a dev Unalog environment, created a friendly fork of Dan’s code, and added the ability to POST a chunk of JSON:

curl --user user:pass \
         --header "Content-type: application/json" \
         --data '{"url": "http://example.com", "title": "Example"}' \

Here’s a fuller example of the JSON that you can supply:

      "url": "http://zombo.com",
      "title": "ZOMBO",
      "comment": "found this awesome website today",
      "tags": "website awesome example",
      "content": "You can do anything at Zombo.com. The only limit is yourself. Etc...",
      "is_private": true

The nice thing about Unalog is that if you supply it (content), Unalog will index the text of the resource you’ve bookmarked. This allows you to do a fulltext search over your bookmarked materials.

So yeah, to make a long story a bit shorter, I created a script that reads in the bookmarks from a Delicious bookmark export (an HTML file) and pushes them up to a Unalog instance. Since the script GETs the bookmark URL to send Unalog the content to index you also get a log which contains the HTTP Status Code that provides the fodder for a linkrot report like:

200 OK 4546
404 Not Found 367
403 Forbidden 141
DNS failure 81
Connection refused 28
500 Internal Server Error 19
503 Service Unavailable 10
401 Unauthorized 9
410 Gone 5
302 Found 5
400 Bad Request 4
502 Bad Gateway 2
412 Precondition Failed 1
300 Multiple Choices 1
201 Created 1

That was 5,220 bookmarks total collected over 5 years–which initially seemed low, until I did the math and figured I did 3 bookmarks a day on average. If we lump all the non-200 OK responses, that amounts to 13% linkrot. At first blush this seems significantly different compared to the research done by Spinelli from 2003 (thanks Mike) which reported 28% linkrot. I would’ve expected the Spinelli results to be better than my haphazard bookmark collection since he was sampling academic publications on the Web. But he was also sampling links from the 1995-1999 period, while I had links from 2005-2010. I know this is mere conjecture, but maybe we’re learning to do things better on the web w/ Cool URIs. I’d like to think so at least. Maybe a comparison with some work done by folks at HP and Microsoft would provide some insight.

At the very least this was a good reminder of how important this activity of pouring data from one system into another is to digital preservation. What Greg Janée calls relay-supporting preservation.

Most of all I want to echo the comments of former Yahoo employee Stephen Hood who wrote recently about the value of this unique collection of bookmarks to the web community. If for some reason Yahoo were to close the doors on Delicious it would be great if they could donate the public bookmarks to the Web somehow, either via a public institution like the Smithsonian or the Library of Congress (full disclosure I work at the Library of Congress in a Digital Preservation group), or to an organization dedicated to the preservation of the Web like the International Internet Preservation Consortium of which LC, other National Libraries, and the Internet Archive are members.

dcat:distribution considered helpful

The other day I happened to notice that the folks at data.gov.uk have started using the Data Catalog Vocabulary in the RDFa they have embedded in their dataset webpages. As an example here is the RDF you can pull out of the HTML for the Anonymised MOT tests and results dataset. Of particular interest to me is that the dataset description now includes an explicit link to the actual data being described using the dcat:distribution property.

     <http://data.gov.uk/id/dataset/anonymised_mot_test> dcat:distribution
         <http://www.dft.gov.uk/data/download/10022/GZ> .

Chris Gutteridge happened to see a Twitter message of mine about this, and asked what consumes this data, and why I thought it was important. So here’s a brief illustration. I reran a little python program I have that crawls all of the data.gov.uk datasets, extracting the RDF using rdflib’s RDFa support (thanks Ivan). Now there are 92,550 triples (up from 35,478 triples almost a year ago).

So what can you do with the this metadata about datasets? I am a software developer working in the area where digital preservation meets the web. So I’m interested in not only getting the metadata for these datasets, but also the datasets themselves. It’s important to enable 3rd party, automated access to datasets for a variety of reasons; but the biggest one for me can be summarized with the common-sensical: Lots of Copies Keep Stuff Safe.

It’s kind of a no-brainer, but copies are important for digital preservation, when the unfortunate happens. The subtlety is being able to know where the copies of a particular dataset are in the enterprise, in a distributed system like the Web, and the mechanics for relating them together. It’s also important for scholarly communication, so that researchers can cite datasets and follow citations of other research to the actual dataset it is based upon. And lastly aggregation services that collect datasets for dissemination on a particular platform, like data.gov.uk, need ways to predictably sweep domains for datasets that needs to be collected.

Consider this practical example: as someone interested in digital preservation I’d like to be able to know what format types are used within the data.gov.uk collection. Since they have used the dcat:distribution property to point at the referenced dataset, I was able to write a small Python program to crawl the datasets and log the media type and HTTP status code along the way, to generate some results like:

media type datasets
text/html 5898
application/octet-stream 1266
application/vnd.ms-excel 874
text/plain 234
text/csv 220
application/pdf 167
text/xml 81
text/comma-separated-values 51
application/x-zip-compressed 36
application/vnd.ms-powerpoint 33
application/zip 31
application/x-msexcel 28
application/excel 21
application/xml 18
text/x-comma-separated-values 14
application/x-gzip 13
application/x-bittorrent 12
application/octet_stream 12
application/msword 10
application/force-download 10
application/x-vnd.oasis.opendocument.presentation 9
application/x-octet-stream 9
application/vnd.excel 9
application/x-unknown-content-type 6
application/xhtml+xml 6
application/vnd.msexcel 5
application/vnd.google-earth.kml+xml kml 5
application/octetstream 4
application/csv 3
vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/octet-string 2
image/jpeg 1
image/gif 1
application/x-mspowerpoint 1
application/vnd.google-earth.kml+xml 1
application/powerpoint 1
application/msexcel 1

Granted some of these aren’t too interesting. The predominance of text/html is largely an artifact of using dcat:distribution to link to the splash page for the dataset, not to the dataset itself. This is allowed by the dcat vocabulary … but dcat’s approach kind of assumes that the domain of the assertion is suitably typed as a dcat:Download, dcat:Feed or dcat:WebService. I personally think that dcat has some issues that make it a bit more difficult to use than I’d like. But it’s extremely useful that data.gov.uk are kicking the tires on the vocabulary, so that kinks like this can be worked out.

The application/octet-stream media-type (and its variants) are also kind of useless for these purposes, since it basically says the dataset is made of bits. It would be more helpful if the servers in these cases could send something more specific. But it ought to be possible to use something like JHOVE or DROID to do some post-hoc analysis of the bitstream to figure out just what this data is, if it is valid etc.

The nice thing about using the Web to publish these datasets and their descriptions is that this sort of format analysis application could be decoupled from the data.gov.uk web publishing software itself. data.gov.uk becomes a clearinghouse for information and whereabouts of datasets, but a format verification service can be built as an orthogonal application. I think it basically fits the RESTful style of Curation Microservices being promoted by the California Digital Library:

Micro-services are an approach to digital curation based on devolving curation function into a set of independent, but interoperable, services that embody curation values and strategies. Since each of the services is small and self-contained, they are collectively easier to develop, deploy, maintain, and enhance. Equally as important, they are more easily replaced when they have outlived their usefulness. Although the individual services are narrowly scoped, the complex function needed for effective curation emerges from the strategic combination of individual services.

One last thing before you are returned to your regular scheduled programming. You may have noticed that the URI for the dataset being described in the RDF is different from the URL for the HTML view for the resource. For example:


instead of:


This is understandable given some of the dictums about Linked Data and trying to separate the Information Resource from the Non-Information Resource. But it would be nice if the URL resolved via a 303 redirect to the HTML as the Cool URIs for the Semantic Web document prescribes. If this is going to be the identifier for the dataset it’s important that it resolves so that people and automated agents can follow their nose to the dataset. I think this highlights some of the difficulties that people typically face when deploying Linked Data.