NoDB

Last week Liam Wyatt emailed me asking if I could add The National Museum of Australia to Linkypedia, which tracks external links from Wikipedia articles to specific websites. Specifically Liam was interested in seeing the list of articles that reference the National Museum, sorted by how much they are viewed at Wikipedia. This presented me with two problems:

I turned Linkypedia off a few months ago, since the site hadn’t been seeing much traffic, and I have not yet figured out how to keep the site going on the paltry Linode VPS I’m using for other things like this blog.
I hadn’t incorporated Wikipedia page view statistics into Linkypedia, because I didn’t know they were available, and even if I had I didn’t have Liam’s idea of using them in this way.

1 was easily rectified since I still had the database lying around and had just disabled the Linkypedia Apache vhost. I brought it up, and added www.nma.gov.au to the list of hosts to monitor. #2 was surprisingly easy to solve as well, with a somewhat hastily written Django management command that pulls down the hourly stats and rifles through them looking for links to sites that Linkypedia manages. Incidentally the Python requests library makes efficiently iterating through large amounts of gzip compressed text at a URL very simple indeed.

After the management command had run for a few days and the stats for 2012 had been added to the Linkypedia database, I was able to see that the top 10 most accessed Wikipedia articles that linked to the National Museum were: Mick Jagger, Cricket, Kangaroo, Victorian Era, Ned Kelly, James Cook, Thylacine, Indigenous Australians and Playing Card and Emu … if you are curious the full list is available with counts, as well as similar lists for the Museum of Modern Art, the Biodiversity Heritage Library, Wikileaks, Thomas and other sites which Linkypedia also monitors somewhat sporadically.

So this was interesting…but it’s not actually what I set out to write about today.

While I had seen aggregate reports of Wikipedia page view data, prior to Liam’s email I didn’t know that hourly dumps of page view statistics existed. I recently did a series of experiments with realtime Wikipedia data, so naturally I wondered what might be do-able with the page-view stats. The data is gzip compressed and space delimited, which made it a perfect fit for noodling around in the Unix shell with curl, cut, sort, etc. Before much time passed I had a bash script that could run from cron every hour and dump out the top 1000 accessed English Wikipedia pages as a JSON file:

#!/bin/bash

# fetch.sh is a shell script to run from cron that will download the latest 
# hour's page view statistics and write out the top 1000 to a JSON file
# you will probably want to run it sufficiently after the top of the hour
# so that the file is likely to be there, e.g.
#
# 30 * * * * cd /home/ed/Projects/wikitrends/; ./fetch.sh

# first, get some date info

year=`date -u +%Y`
month=`date -u +%m`
day=`date -u +%d`
hour=`date -u +%H`

# this is the expected URL for the pagecount dump file

url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0000.gz" 

# sometimes the filenames have a timestamp of 1 second instead of 0 
# so if 0000.gz isn't there try using 0001.gz instead

curl -f -s -I $url > /dev/null
retval=$?
if [ $retval -ne 0 ]; then
    url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0001.gz" 
fi

# create a directory and filename for the JSON

json_dir="data/$year/$month/$day"
json_file="$json_dir/$hour.json"
mkdir -p $json_dir

# fetch the data and write out the stats!

curl --silent $url | \
    gunzip -c | \
    egrep '^en ' | \
    perl -npe '@cols=split/ /; print "$cols[2] $cols[1]\n";' | \
    sort -rn | \
    head -n 1000 | \
    ./jsonify.py > $json_file

And a short time later after that I had some pretty simple HTML and JavaScript that could use the JSON to display the top 25 accessed Wikipedia articles from the last hour.

Which I put it up on GitHub as Wikitrends. It was kind of surprising at first to see how prevalent mainstream media (mostly television) figures into the statistics. Perhaps it was a function of me working on the code at night when “normal” people are typically watching TV and looking up stuff related to the shows they are looking at. I did notice some odd things pop up in the list occasionally and found myself googling to see if there was recent news on the topic.

To help provide some context I added flyovers so that when you hover over the article title you will see the summary of the Wikipedia article. Behind the scenes the JavaScript looks up the article using the Wikipedia API and extracts the summary. This got me thinking that it could also be useful to include some links to canned searches at Google (last hour), Twitter and Facebook to provide context for the spike that the Wikipedia article is seeing. Perhaps it would be more interesting to see this information flow by somehow on the side…

A nice side effect of this minimalist (hereby dubbed NoDB…take that NoSQL!) approach to developing the Wikitrends app is that I have uniquely named, URL addressable JSON files for the top 1000 accessed English Wikipedia articles every hour at http://inkdroid.org/wikitrends/data. Even better, the JSON files even get archived at GitHub. Now don’t take this seriously, it’s late (or early) and I’m really just making a really lame joke. I’m sure having a database would be useful for trend analysis and whatnot. But for my immediate purposes it wasn’t really needed.

So, Wikitrends is another Wikiepdia stats curiosity app. At the very least I got a chance to use JavaScript a bit more seriously by working with Underscore and the very slick async library. Perhaps there are some ways you can think of to make Wikitrends more useful or interesting. If so please let me know.

And, since I haven’t said it enough before: thank you Wikimedia for taking such a pragmatic approach to making your data available. It is an inspiration.