Jodi Schneider pointed out to me in an email that my previous post about the top 100 hosts referenced in wikipedia may have been slightly off balance since it counted *all* pages on wikipedia (talk pages, files, etc), and was not limited to only links in articles. The indicator for her was the high ranking of www.google.com, which seemed odd to her in the article space.
So I downloaded the enwiki-latest-page.sql.gz, loaded it in, and then joined on it in my query to come up with a new list. Jodi was right, it’s a lot more interesting:
This removed a lot of the interwiki links between the English wikipedia and other language wikipedias (which would be interesting to look at in their own right). It also removed administrative links to things like www.dnsstuff.com. Also interesting is that it removed www.facebook.com from the list, which probably were linked to from user profile pages? The neat thing is it introduced new sites into the top 100 like the following:
We can see a lot more pop culture media present: newspapers, magazines, sporting information. Also we can see research oriented websites like worldcat.org, ssd.jpl.nasa.gov, adsabs.harvard.edu make it into the top 100.
I work for the US federal government so I was interested to look at what .gov domains were in the top 100:
Which points to the importance of federal biomedical, geospatial, scientific, demographic and biographical information to wikipedians. It would be interesting to take a look at higher education institutions at some point. Doing these one off reports is giving me some ideas about what linkypedia could turn into. Thanks Jodi.
top hosts referenced in wikipedia (part 2) by Ed Summers, unless otherwise expressly stated, is licensed under a Creative Commons Attribution 4.0 International License.