on not linking

NPR Morning Edition recently ran an interview with Teju Cole about his most recent project called Small Fates. Cole is the recipient of the 2012 Hemingway Foundation/PEN Award for his novel Open City. Small Fates is a series of poetic snippets from Nigerian newspapers, which Cole has been posting on Twitter. It turns out Small Fates draws on a tradition of compressed news stories known as fait divers. The interview is a really nice description of the poetic use of this material to explore time and place. In some ways it reminds me a little of the cut-up technique that William S. Bouroughs popularized; albeit in a more lyrical, less dadaist form.

At about the 3 minute mark in the interview Cole mentions that he has recently been using content from historic New York newspapers using Chronicling America. For example:

Chronicling America is a software project I work on. Of course we were all really excited to hear Cole mention us on NPR. One thing that we were wondering is whether he could include shortened URLs to the newspaper page in Chronicling America in his tweets. Obviously this would be a clever publicity vehicle for the NEH funded National Digital Newspaper Program. It would also allow the Small Fates reader to follow the tweet to the source material, if they wanted more context.

Through the magic of Facebook, Twitter, good old email and Teju’s generosity I got in touch with him to see if he would be willing to include some shortened Chronicling America URLs in his tweets. His response indicated that he had clearly already thought about linking, but had decided not to. His reasons for not linking struck me as really interesting, and he agreed to let me quote them here:

I can’t include links directly in my tweets for three reasons.

The first is aesthetic: I like the way the tweets look as clean sentences. One wouldn’t wish to hyperlink a poem.

The second is artistic: I want people to stay here, not go off somewhere else and crosscheck the story. Why go through all the trouble of compression if they’re just going to go off and read more about it? What’s omitted from a story is, to me, an important part of a writer’s storytelling strategy.

And the third is practical: though I seldom use up all 140 characters, rarely do I have enough room left for a url string, even a shortened one.

I really loved this artistic (and pragmatic) rationale for not linking, and thought you might too.

wikipedia external links: a redis database

As part of my continued meandering linkypedia v2 experiments I created a Redis database of high level statistics about host names and top-level-domain names in external links from Wikipedia articles. Tom Morris happened to mention he has been loading the external links as well (thanks Alf), so I thought I’d make the redis database dump available to anyone that is interested in looking at it. If you want to give it a whirl try this out:

% wget http://inkdroid.org/data/wikipedia-extlinks.rdb
% sudo aptitude install redis-server
% sudo mv wikipedia-extlinks.rdb /var/lib/redis/dump.rdb
% sudo chown redis:redis /var/lib/redis/dump.rdb
% sudo /etc/init.d/redis-server restart
% sudo pip install redis # or easy_install (version in apt is kinda old)
% python 
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import redis
>>> r = redis.Redis()
>>> r.zrevrange("hosts", 0, 25, True)
[('toolserver.org', 2360809.0), ('www.ncbi.nlm.nih.gov', 508702.0), ('dx.doi.org', 410293.0), ('commons.wikimedia.org', 408986.0), ('www.imdb.com', 398877.0), ('www.nsesoftware.nl', 390636.0), ('maps.google.com', 346997.0), ('books.google.com', 323111.0), ('news.bbc.co.uk', 214738.0), ('tools.wikimedia.de', 181215.0), ('edwardbetts.com', 168102.0), ('dispatch.opac.d-nb.de', 166322.0), ('web.archive.org', 165665.0), ('www.insee.fr', 160797.0), ('www.iucnredlist.org', 155620.0), ('stable.toolserver.org', 155335.0), ('www.openstreetmap.org', 154127.0), ('d-nb.info', 141504.0), ('ssd.jpl.nasa.gov', 137200.0), ('www.youtube.com', 133827.0), ('www.google.com', 131011.0), ('www.census.gov', 124182.0), ('www.allmusic.com', 117602.0), ('maps.yandex.ru', 114978.0), ('news.google.com', 102111.0), ('amigo.geneontology.org', 95972.0)]
>>> r.zrevrange("hosts:edu", 0, 25, True)
[('nedwww.ipac.caltech.edu', 28642.0), ('adsabs.harvard.edu', 25699.0), ('animaldiversity.ummz.umich.edu', 21747.0), ('www.perseus.tufts.edu', 20438.0), ('genome.ucsc.edu', 20290.0), ('cfa-www.harvard.edu', 14234.0), ('penelope.uchicago.edu', 9806.0), ('www.bucknell.edu', 8627.0), ('www.law.cornell.edu', 7530.0), ('biopl-a-181.plantbio.cornell.edu', 5747.0), ('ucjeps.berkeley.edu', 5452.0), ('plato.stanford.edu', 5243.0), ('www.fiu.edu', 5004.0), ('www.volcano.si.edu', 4507.0), ('calphotos.berkeley.edu', 4446.0), ('www.usc.edu', 4345.0), ('ftp.met.fsu.edu', 3941.0), ('web.mit.edu', 3548.0), ('www.lpi.usra.edu', 3497.0), ('insects.tamu.edu', 3479.0), ('www.cfa.harvard.edu', 3447.0), ('www.columbia.edu', 3260.0), ('www.yale.edu', 3122.0), ('www.fordham.edu', 2963.0), ('www.people.fas.harvard.edu', 2908.0), ('genealogy.math.ndsu.nodak.edu', 2726.0)]
>>> r.zrevrange("tlds", 0, 25, True)
[('com', 11368104.0), ('org', 7785866.0), ('de', 1857158.0), ('gov', 1767137.0), ('uk', 1489505.0), ('fr', 1173624.0), ('ru', 897413.0), ('net', 868337.0), ('edu', 793838.0), ('jp', 733995.0), ('nl', 707177.0), ('pl', 590058.0), ('it', 486441.0), ('ca', 408163.0), ('au', 387764.0), ('info', 296508.0), ('br', 276599.0), ('es', 242767.0), ('ch', 224692.0), ('us', 179223.0), ('at', 163397.0), ('be', 132395.0), ('cz', 92683.0), ('eu', 91671.0), ('ar', 89856.0), ('mil', 87788.0)]
>>> r.zscore("hosts", "www.bbc.co.uk")
56245.0

Basically there are a few sorted sets in there:

  • “hosts”: all the hosts sorted by the number of externallinks
  • “hosts:%s”: where %s is a top level domain (“com”, “uk”, etc)
  • “tlds”: all the tlds sorted by the number of externallinks
  • “wikipedia”: the wikipedia langauges sorted by total number of externallinks

I’m not exactly sure how portable redis databases are but I was able to move it between a couple Ubuntu machines and Ryan successfully looked at it on a Gentoo box he had available. You’ll need roughly 300MB of RAM available. I must say I was impressed with redis and in particular sorted sets for this stats collection task. Thanks to Chris Adams for pointing me in the direction of redis in the first place.

on preserving bookmarks

While it’s not exactly clear what the future of Delicious is, the recent news about Yahoo closing the doors or selling the building prompted me to look around at other social bookmarking tools, and to revisit some old stomping grounds.

Dan Chudnov has been running Unalog since 2003 (roughly when Delicious started). In fact I can remember Dan and Joshua Schacter having some conversations about the idea of social bookmarking as both of the services co-evolved. So my first experience with social bookmarking was on Unalog, but a year or so later I ended up switching to Delicious in 2004 for reasons I can’t quite remember. I think I liked some of the tools that had sprouted up around Delicious, and felt a bit guilty for abandoning Unalog.

Anyhow, I wanted to take the exported Delicious bookmarks and see if I could get them into Unalog. So I set up a dev Unalog environment, created a friendly fork of Dan’s code, and added the ability to POST a chunk of JSON:

curl --user user:pass \
         --header "Content-type: application/json" \
         --data '{"url": "http://example.com", "title": "Example"}' \
         http://unalog.com/entry/new

Here’s a fuller example of the JSON that you can supply:

{
      "url": "http://zombo.com",
      "title": "ZOMBO",
      "comment": "found this awesome website today",
      "tags": "website awesome example",
      "content": "You can do anything at Zombo.com. The only limit is yourself. Etc...",
      "is_private": true
    }

The nice thing about Unalog is that if you supply it (content), Unalog will index the text of the resource you’ve bookmarked. This allows you to do a fulltext search over your bookmarked materials.

So yeah, to make a long story a bit shorter, I created a script that reads in the bookmarks from a Delicious bookmark export (an HTML file) and pushes them up to a Unalog instance. Since the script GETs the bookmark URL to send Unalog the content to index you also get a log which contains the HTTP Status Code that provides the fodder for a linkrot report like:

200 OK 4546
404 Not Found 367
403 Forbidden 141
DNS failure 81
Connection refused 28
500 Internal Server Error 19
503 Service Unavailable 10
401 Unauthorized 9
410 Gone 5
302 Found 5
400 Bad Request 4
502 Bad Gateway 2
412 Precondition Failed 1
300 Multiple Choices 1
201 Created 1

That was 5,220 bookmarks total collected over 5 years–which initially seemed low, until I did the math and figured I did 3 bookmarks a day on average. If we lump all the non-200 OK responses, that amounts to 13% linkrot. At first blush this seems significantly different compared to the research done by Spinelli from 2003 (thanks Mike) which reported 28% linkrot. I would’ve expected the Spinelli results to be better than my haphazard bookmark collection since he was sampling academic publications on the Web. But he was also sampling links from the 1995-1999 period, while I had links from 2005-2010. I know this is mere conjecture, but maybe we’re learning to do things better on the web w/ Cool URIs. I’d like to think so at least. Maybe a comparison with some work done by folks at HP and Microsoft would provide some insight.

At the very least this was a good reminder of how important this activity of pouring data from one system into another is to digital preservation. What Greg Janée calls relay-supporting preservation.

Most of all I want to echo the comments of former Yahoo employee Stephen Hood who wrote recently about the value of this unique collection of bookmarks to the web community. If for some reason Yahoo were to close the doors on Delicious it would be great if they could donate the public bookmarks to the Web somehow, either via a public institution like the Smithsonian or the Library of Congress (full disclosure I work at the Library of Congress in a Digital Preservation group), or to an organization dedicated to the preservation of the Web like the International Internet Preservation Consortium of which LC, other National Libraries, and the Internet Archive are members.