top hosts referenced in english wikipedia

I’ve recently been experimenting a bit to provide some tools to allow libraries, archives and museums to see how Wikipedians are using their content as primary source material. I didn’t actually anticipate the interest in having a specialized tool like linkypedia to monitor who is using your institutions content on Wikipedia. So, the demo site is having some scaling problems–not the least of which is the feeble VM that it is running on. That’s why I wanted to make the code available for other people to run where it made sense. At least before I had some time to think through how to scale it better.

Anyhow, I wanted to get a handle on just how many external links there are in the full snapshot of English Wikipedia. A month or so ago Jakob Voss pointed me at the External Links SQL dump over at Wikipedia as a possible way to circumvent heavy use of the Wikipedia’s API, by providing a baseline to update against. So I thought to myself that I could just suck this down, and import it into MySQL and run some analysis on that to see how many links and what sorts of host name concentrations there were.

Sucking down the file didn’t take too long. But the mysql import on the dump spent about 24hrs (on my laptop) before I killed it. On a hunch I peeked into the 4.5G sql file and noticed that the table had several indexes defined. So I went through some contortions with csplit to remove the indexes from the DDL, and lo and behold it loaded in something like 20 minutes. Then I wrote a some python to query the database, get each external link url, extract the hostname from the url, and write it out through a unix pipeline to count up the unique hostnames:

./hosts.py | sort -S 1g | uniq -c | sort -rn > enwiki-externallinks-hostnames.txt

This is a little unix trick my old boss Fred Lindberg taught me years ago, and it stills works remarkably well: 30,127,734 urls were sorted into 2,162,790 unique domains in another 20 minutes or so. If you are curious the full output is available here. The number 1 host was toolserver.org with 3,169,993 links. This wasn’t too surprising since it is a hostname heavily used by wikipedians as they go about their business. Next is www.google.com at 2,117,967 links, which appeared to be quite a few canned searches. This wasn’t terribly exciting either. So I removed toolserver.org and www.google.com (so as not to visually skew things too much), and charted the rest of the top 100:

Top 100-2 Hostnames in Wikipedia External Links

I figured that could be of some interest to somebody, sometime. I didn’t find similar current stats available anywhere on the web. But if you know of them please let me know. The high ranking of www.ncbi.nlm.nih.gov and dx.doi.org were pleasant surprises. I did a little superficial digging and found some fascinating bots like Citation Bot and ProteinBoxBot, which seem to trawl external article databases looking for appropriate Wikipedia pages to add links to. Kind of amazing.

federal register embraces the web and opensource

Tom Lee of the Sunlight Foundation blogged yesterday about the new Federal Register website. The facelift was also announced a few days earlier by the Archivist of the United States, David Ferriero. If you aren’t familiar with it already, the Federal Register is basically the daily newspaper of the United States Federal Government, which details all the rules and regulations of the federal agencies. It is compiled by the Office of the Federal Register located in the National Archives, and printed by the Government Printing Office. As the video describing the new site points out, the Federal Register began publication in 1936 in the depths of the Great Depression as a way to communicate in one place all that the agencies were doing to try to jump start the economy. So it seems like a fitting time to be rethinking the role of the Federal Register.

I’m no usability expert, but just a few minutes browsing the new site and comparing it to the old one make it clear what a leap forward this is. Hopefully the legal status of the new site will be ironed out shortly.

Most of all it’s great to see that the Federal Register is now a single web application. The service it provides to the American public is important enough to deserve its own dedicated web presence. As the developers point out in their video describing the effort, they wanted to make the Federal Register a “first class citizen of the web”…and I think they are certainly helping do that. This might seem obvious, but often there is a temptation to jam publications from the print world (like the Federal Register) into dumbed down monolithic repositories that treat all “objects” the same. Proponents of this approach tend to characterize one off websites like Federal Register 2.0 as “yet another silo”. But I think it’s important to remember that the web was really created to break down the silo walls, and that every well designed web site is actually the antithesis of a silo. In fact, monolithic repository systems that treat all publications as static documents to be uniformly managed are more like silos than these ‘one off’ dedicated web applications.

As a software developer working in the federal government there were a few things about the Federal Register 2.0 that I found really exciting:

  • Fruitful collaboration between federal employees and citizen activist/geeks initiated by a software development contest.
  • Extensive use of opensource technologies like Ruby, Ruby on Rails, MySQL, Sphinx, nginx, Varnish, Passenger, Apache2, Ubuntu Linux, Chef. Opensource technologies encourage collaboration by allowing citizen activists/technologists to participate without having to drop a princely sum.
  • Release of the source code for the website itself, using decentralized revision control (git) so that people can easily contribute changes, and see how the site was put together.
  • Extensive use of syndicated feeds to communicate how how content is being added to the site, ical feeds to keep on top of events going on in your area, and detailed XML for each entry.
  • The robots.txt file for the site makes the content fully crawlable by web indexers, except for search related portions of the website. Excluding dynamic search results is often important for performance reasons, but much of the article content can be discovered via links, see below about permalinks. They also have made a sitemap available for crawlers to efficiently discover URLs for the content.
  • Deployment of the web application to the cloud using Amazon’s EC2 and S3 services. Cloud computing allows computing resources to scale to meet demand. In effect this means that government IT shops don’t have to make big up front investments in infrastructure to make new services available. I guess the jury is still out, but I think this will eventually prove to greatly lower the barrier to innovation in the egov sector. It also lets the more progressive developers in government leap frog ancient technologies and bureaucracies to get things done in a timely manner.
  • And last, but certainly not least … now every entry in the Federal Register has a URL!. Permalinks for the Federal Register are incredibly important for citability reasons. I predict that we’ll quickly see more and more people referencing specific parts of the Federal Register in social media sites like Facebook, Twitter and out on the open web in blogs, and in collaborative applications like Wikipedia.

I would like to see more bulk access to XML data made available, for re-purposing on other websites–although I guess it might be able to walk from the syndicated feeds to the detailed XML. Also, the search functionality is so rich it would be useful to have an OpenSearch description that documents it, and perhaps provides some hooks for getting back JSON and/or XML representations. Perhaps even following the lead of the London Gazette and trying to make some of the structured metadata available in the the HTML using RDFa. It also looks like content is only available for 2008 on, so it might be interesting to see how easy it would be to make more of the historic content available.

But the great thing about what these folks have done is now I can fork the project on github, see how easy it is to add the changes, and let the developers know about my updates to see if they are worth merging back into the production website. This is an incredible leap forward for egov efforts–so hats off to everyone who helped make this happen.