broken wordpress links

Internet Archive recently announced their new Availability API for checking if a representation for a given URL is in their archive with a simple HTTP call. In addition to the API they highlighted a few IA related projects, including a Wordpress plugin called Broken Link Checker which will check the links in your Wordpress site, and offer to fix any broken ones using an Internet Archive URL, if it is available based on a call to the Availability API.

I installed the plugin here and let it run for a bit. It detected 3898 unique URLs in 4910 links of which 482 are broken. This amounts to 12% link rot … but there were also 1038 redirects that resulted in a 200 OK ; so there may be a fair bit of reference rot lurking there. The plugin itself doesn’t provide a summary of HTTP status codes for the “broken URLS” but they are listed one by one in the broken link report. Since I could see the HTTP status codes in the table, I figured out you can easily log into your Wordpress database and run a query like this to get a summary:

mysql> select http_code, count(*) from wp_blc_links where broken is true group by http_code;
+-----------+----------+
| http_code | count(*) |
+-----------+----------+
|         0 |      113 |
|       400 |        9 |
|       403 |       15 |
|       404 |      333 |
|       410 |        5 |
|       416 |        1 |
|       500 |        3 |
|       502 |        1 |
|       503 |        2 |
+-----------+----------+

I’m assuming the 113 (23% of the broken links) are DNS lookup failure, or connection failures. Once the broken links are identified, you have to manually curate each link to decide whether you want to link out to the Internet Archive based on whether it’s possible, and whether the most recent link is appropriate or not. This can take some time, but it is useful given I uncovered a number of fat-fingered URLs that looked like they never worked, which I was able to fix. Of the legitimately broken URLs, 136 URLs (~28%) were available at the Internet Archive. Once you’ve decided to use an IA URL the plugin can quickly rewrite the original content without requiring you to do in and tweak the content yourself.

One thing that would be nice would be for the API to be queried for a representation of the resource based on when the post was authored. For example my Snakes and Rubies post had a broken link to http://snakesandrubies.com and the plugin correctly found that it was available at the Internet Archive with an API query like:

% curl --silent 'http://archive.org/wayback/available?url=snakesandrubies.com' | python -mjson.tool
{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "status": "200",
            "timestamp": "20130131030609",
            "url": "http://web.archive.org/web/20130131030609/http://snakesandrubies.com/"
        }
    }
}

When requesting that URL you get this hit in the archive: http://web.archive.org/web/20130131030609/http://snakesandrubies.com/ but that’s an archive of a cyerbsquatted version of the page:

If the timestamp of the blogpost were used perhaps a better representation like this could’ve been found automatically, or at least it could have been offered first?

Based on the svn log for the plugin it appears to have been 2007-10-08 and has been downloaded 2,099,072 times since then. When people gripe about the Web being broken by design I think it’s good to remember that tools like this exist to help make it better, one website and link at a time.