Tag Archives: delicious

on preserving bookmarks

While it’s not exactly clear what the future of Delicious is, the recent news about Yahoo closing the doors or selling the building prompted me to look around at other social bookmarking tools, and to revisit some old stomping grounds.

Dan Chudnov has been running Unalog since 2003 (roughly when Delicious started). In fact I can remember Dan and Joshua Schacter having some conversations about the idea of social bookmarking as both of the services co-evolved. So my first experience with social bookmarking was on Unalog, but a year or so later I ended up switching to Delicious in 2004 for reasons I can’t quite remember. I think I liked some of the tools that had sprouted up around Delicious, and felt a bit guilty for abandoning Unalog.

Anyhow, I wanted to take the exported Delicious bookmarks and see if I could get them into Unalog. So I set up a dev Unalog environment, created a friendly fork of Dan’s code, and added the ability to POST a chunk of JSON:

curl --user user:pass \
         --header "Content-type: application/json" \
         --data '{"url": "http://example.com", "title": "Example"}' \
         http://unalog.com/entry/new

Here’s a fuller example of the JSON that you can supply:

{
      "url": "http://zombo.com",
      "title": "ZOMBO",
      "comment": "found this awesome website today",
      "tags": "website awesome example",
      "content": "You can do anything at Zombo.com. The only limit is yourself. Etc...",
      "is_private": true
    }

The nice thing about Unalog is that if you supply it (content), Unalog will index the text of the resource you’ve bookmarked. This allows you to do a fulltext search over your bookmarked materials.

So yeah, to make a long story a bit shorter, I created a script that reads in the bookmarks from a Delicious bookmark export (an HTML file) and pushes them up to a Unalog instance. Since the script GETs the bookmark URL to send Unalog the content to index you also get a log which contains the HTTP Status Code that provides the fodder for a linkrot report like:

200 OK 4546
404 Not Found 367
403 Forbidden 141
DNS failure 81
Connection refused 28
500 Internal Server Error 19
503 Service Unavailable 10
401 Unauthorized 9
410 Gone 5
302 Found 5
400 Bad Request 4
502 Bad Gateway 2
412 Precondition Failed 1
300 Multiple Choices 1
201 Created 1

That was 5,220 bookmarks total collected over 5 years–which initially seemed low, until I did the math and figured I did 3 bookmarks a day on average. If we lump all the non-200 OK responses, that amounts to 13% linkrot. At first blush this seems significantly different compared to the research done by Spinelli from 2003 (thanks Mike) which reported 28% linkrot. I would’ve expected the Spinelli results to be better than my haphazard bookmark collection since he was sampling academic publications on the Web. But he was also sampling links from the 1995-1999 period, while I had links from 2005-2010. I know this is mere conjecture, but maybe we’re learning to do things better on the web w/ Cool URIs. I’d like to think so at least. Maybe a comparison with some work done by folks at HP and Microsoft would provide some insight.

At the very least this was a good reminder of how important this activity of pouring data from one system into another is to digital preservation. What Greg Janée calls relay-supporting preservation.

Most of all I want to echo the comments of former Yahoo employee Stephen Hood who wrote recently about the value of this unique collection of bookmarks to the web community. If for some reason Yahoo were to close the doors on Delicious it would be great if they could donate the public bookmarks to the Web somehow, either via a public institution like the Smithsonian or the Library of Congress (full disclosure I work at the Library of Congress in a Digital Preservation group), or to an organization dedicated to the preservation of the Web like the International Internet Preservation Consortium of which LC, other National Libraries, and the Internet Archive are members.