I’ve recently been experimenting a bit to provide some tools to allow libraries, archives and museums to see how Wikipedians are using their content as primary source material. I didn’t actually anticipate the interest in having a specialized tool like linkypedia to monitor who is using your institutions content on Wikipedia. So, the demo site is having some scaling problems–not the least of which is the feeble VM that it is running on. That’s why I wanted to make the code available for other people to run where it made sense. At least before I had some time to think through how to scale it better.
Anyhow, I wanted to get a handle on just how many external links there are in the full snapshot of English Wikipedia. A month or so ago Jakob Voss pointed me at the External Links SQL dump over at Wikipedia as a possible way to circumvent heavy use of the Wikipedia’s API, by providing a baseline to update against. So I thought to myself that I could just suck this down, and import it into MySQL and run some analysis on that to see how many links and what sorts of host name concentrations there were.
Sucking down the file didn’t take too long. But the mysql import on the dump spent about 24hrs (on my laptop) before I killed it. On a hunch I peeked into the 4.5G sql file and noticed that the table had several indexes defined. So I went through some contortions with csplit to remove the indexes from the DDL, and lo and behold it loaded in something like 20 minutes. Then I wrote a some python to query the database, get each external link url, extract the hostname from the url, and write it out through a unix pipeline to count up the unique hostnames:
./hosts.py | sort -S 1g | uniq -c | sort -rn > enwiki-externallinks-hostnames.txt
This is a little unix trick my old boss Fred Lindberg taught me years ago, and it stills works remarkably well: 30,127,734 urls were sorted into 2,162,790 unique domains in another 20 minutes or so. If you are curious the full output is available here. The number 1 host was toolserver.org with 3,169,993 links. This wasn’t too surprising since it is a hostname heavily used by wikipedians as they go about their business. Next is www.google.com at 2,117,967 links, which appeared to be quite a few canned searches. This wasn’t terribly exciting either. So I removed toolserver.org and www.google.com (so as not to visually skew things too much), and charted the rest of the top 100:
I figured that could be of some interest to somebody, sometime. I didn’t find similar current stats available anywhere on the web. But if you know of them please let me know. The high ranking of www.ncbi.nlm.nih.gov and dx.doi.org were pleasant surprises. I did a little superficial digging and found some fascinating bots like Citation Bot and ProteinBoxBot, which seem to trawl external article databases looking for appropriate Wikipedia pages to add links to. Kind of amazing.