top hosts referenced in wikipedia (part 2)

Jodi Schneider pointed out to me in an email that my previous post about the top 100 hosts referenced in wikipedia may have been slightly off balance since it counted *all* pages on wikipedia (talk pages, files, etc), and was not limited to only links in articles. The indicator for her was the high ranking of www.google.com, which seemed odd to her in the article space.

So I downloaded the enwiki-latest-page.sql.gz, loaded it in, and then joined on it in my query to come up with a new list. Jodi was right, it’s a lot more interesting:

This removed a lot of the interwiki links between the English wikipedia and other language wikipedias (which would be interesting to look at in their own right). It also removed administrative links to things like www.dnsstuff.com. Also interesting is that it removed www.facebook.com from the list, which probably were linked to from user profile pages? The neat thing is it introduced new sites into the top 100 like the following:

adsabs.harvard.edu
bioguide.congress.gov
cfa-www.harvard.edu
eclipse.gsfc.nasa.gov
openjurist.org
select.nytimes.com
ssd.jpl.nasa.gov
worldcat.org
www1.arbitron.com
www.animenewsnetwork.com
www.cbc.ca
www.cricinfo.com
www.cricketarchive.com
www.discogs.com
www.expasy.org
www.fifa.com
www.gutenberg.org
www.history.navy.mil
www.hockeydb.com
www.imagesofengland.org.uk
www.independent.co.uk
www.jstor.org
www.leighrayment.com
www.mtv.com
www.nfl.com
www.nhm.ac.uk
www.nps.gov
www.racingpost.com
www.radio-locator.com
www.reuters.com
www.rollingstone.com
www.rsssf.com
www.soccerbase.com
www.usatoday.com
www.variety.com

We can see a lot more pop culture media present: newspapers, magazines, sporting information. Also we can see research oriented websites like worldcat.org, ssd.jpl.nasa.gov, adsabs.harvard.edu make it into the top 100.

I work for the US federal government so I was interested to look at what .gov domains were in the top 100:

hostname links
www.ncbi.nlm.nih.gov 419816
www.pubmedcentral.nih.gov 62134
geonames.usgs.gov 57423
factfinder.census.gov 48530
www.census.gov 33018
www.nr.nps.gov 25962
www.fcc.gov 25941
ssd.jpl.nasa.gov 20178
eclipse.gsfc.nasa.gov 20063
bioguide.congress.gov 18880
www.nlm.nih.gov 15115
www.nps.gov 12196

Which points to the importance of federal biomedical, geospatial, scientific, demographic and biographical information to wikipedians. It would be interesting to take a look at higher education institutions at some point. Doing these one off reports is giving me some ideas about what linkypedia could turn into. Thanks Jodi.

7 thoughts on “top hosts referenced in wikipedia (part 2)

  1. Thanks for posting these results.

    I think you’ve posted the original list again rather than the updated one because toolserver.com is the top entry. It also appears to be served with the wrong MIME type (or perhaps is twice compressed?)

    This is just the tip of the iceberg and it would be interesting to see what kinds of additional insights could be teased out. Some of the things that come to mind include identifying patterns used by bots and excluding/segregating them so that you can see more organic results, aggregating equivalent domains (e.g. Google Books in different TLDs), reversing domains and aggregating subtotals (gov.nih.*, gov.nps.*), aggregating by type/subject (e.g. recensement.insee.fr & census.gov), etc, etc. Days of fun for someone who wants to slice & dice the data!

  2. Actually toolserver.org was in the top of the results when limiting to articles as well. I just removed it from my barchart again.

    As far as I can see with curl, the content-type header for the list is “application/x-gzip” which is right?

    ed@curry:~/Projects/openpub/examples$ curl -I http://inkdroid.org/data/enwiki-externallinks-hostnames-articles-only.txt.gz
    HTTP/1.1 200 OK
    Date: Thu, 26 Aug 2010 02:03:31 GMT
    Server: Apache/2.2.14 (Ubuntu)
    Last-Modified: Wed, 25 Aug 2010 11:30:30 GMT
    ETag: "f8013-b44b97-48ea4357ce180"
    Accept-Ranges: bytes
    Content-Length: 11815831
    Vary: Accept-Encoding
    Connection: close
    Content-Type: application/x-gzip
    

    Those are some good ideas for further analysis, thanks!

  3. Ed, when you say “referenced in Wikipedia” do you mean actually used as a “References” or “Notes” link to substantiate a citation, or do you simply mean “external link” of any variety, anywhere on an article page?

    If the latter, might I suggest you use some different terminology in this blog post?

    Great work, by the way. It’s the sort of stuff that the Wikimedia Foundation itself should be doing, but their too busy hiring new staff to not do much.

  4. @thekosher, well by external links I mean links included in enwiki-latest-externallinks.sql.gz made available by en.wikipedia.org. It’s my understanding that these are referenced citations, and not just any link on an article page. The stats in this post also used the enwiki-latest-page.sql.gz dump to limit to only external links in articles with the following SQL:

    SELECT * FROM externallinks, page 
    WHERE externallinks.el_from = page.page_id
    AND page.page_namespace = 0
    

    I hope that clarifies things somewhat.

Leave a Reply