NoDB
Last week Liam Wyatt emailed me asking if I could add The National Museum of Australia to Linkypedia, which tracks external links from Wikipedia articles to specific websites. Specifically Liam was interested in seeing the list of articles that reference the National Museum, sorted by how much they are viewed at Wikipedia. This presented me with two problems:
- I turned Linkypedia off a few months ago, since the site hadnāt been seeing much traffic, and I have not yet figured out how to keep the site going on the paltry Linode VPS Iām using for other things like this blog.
- I hadnāt incorporated Wikipedia page view statistics into Linkypedia, because I didnāt know they were available, and even if I had I didnāt have Liamās idea of using them in this way.
1 was easily rectified since I still had the database lying around and had just disabled the Linkypedia Apache vhost. I brought it up, and added www.nma.gov.au to the list of hosts to monitor. #2 was surprisingly easy to solve as well, with a somewhat hastily written Django management command that pulls down the hourly stats and rifles through them looking for links to sites that Linkypedia manages. Incidentally the Python requests library makes efficiently iterating through large amounts of gzip compressed text at a URL very simple indeed.
After the management command had run for a few days and the stats for 2012 had been added to the Linkypedia database, I was able to see that the top 10 most accessed Wikipedia articles that linked to the National Museum were: Mick Jagger, Cricket, Kangaroo, Victorian Era, Ned Kelly, James Cook, Thylacine, Indigenous Australians and Playing Card and Emu ā¦ if you are curious the full list is available with counts, as well as similar lists for the Museum of Modern Art, the Biodiversity Heritage Library, Wikileaks, Thomas and other sites which Linkypedia also monitors somewhat sporadically.
So this was interestingā¦but itās not actually what I set out to write about today.
While I had seen aggregate reports of Wikipedia page view data, prior to Liamās email I didnāt know that hourly dumps of page view statistics existed. I recently did a series of experiments with realtime Wikipedia data, so naturally I wondered what might be do-able with the page-view stats. The data is gzip compressed and space delimited, which made it a perfect fit for noodling around in the Unix shell with curl, cut, sort, etc. Before much time passed I had a bash script that could run from cron every hour and dump out the top 1000 accessed English Wikipedia pages as a JSON file:
#!/bin/bash
# fetch.sh is a shell script to run from cron that will download the latest
# hour's page view statistics and write out the top 1000 to a JSON file
# you will probably want to run it sufficiently after the top of the hour
# so that the file is likely to be there, e.g.
#
# 30 * * * * cd /home/ed/Projects/wikitrends/; ./fetch.sh
# first, get some date info
year=`date -u +%Y`
month=`date -u +%m`
day=`date -u +%d`
hour=`date -u +%H`
# this is the expected URL for the pagecount dump file
url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0000.gz"
# sometimes the filenames have a timestamp of 1 second instead of 0
# so if 0000.gz isn't there try using 0001.gz instead
curl -f -s -I $url > /dev/null
retval=$?
if [ $retval -ne 0 ]; then
url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0001.gz"
fi
# create a directory and filename for the JSON
json_dir="data/$year/$month/$day"
json_file="$json_dir/$hour.json"
mkdir -p $json_dir
# fetch the data and write out the stats!
curl --silent $url | \
gunzip -c | \
egrep '^en ' | \
perl -npe '@cols=split/ /; print "$cols[2] $cols[1]\n";' | \
sort -rn | \
head -n 1000 | \
./jsonify.py > $json_file
And a short time later after that I had some pretty simple HTML and JavaScript that could use the JSON to display the top 25 accessed Wikipedia articles from the last hour.
Which I put it up on GitHub as Wikitrends. It was kind of surprising at first to see how prevalent mainstream media (mostly television) figures into the statistics. Perhaps it was a function of me working on the code at night when ānormalā people are typically watching TV and looking up stuff related to the shows they are looking at. I did notice some odd things pop up in the list occasionally and found myself googling to see if there was recent news on the topic.
To help provide some context I added flyovers so that when you hover over the article title you will see the summary of the Wikipedia article. Behind the scenes the JavaScript looks up the article using the Wikipedia API and extracts the summary. This got me thinking that it could also be useful to include some links to canned searches at Google (last hour), Twitter and Facebook to provide context for the spike that the Wikipedia article is seeing. Perhaps it would be more interesting to see this information flow by somehow on the sideā¦
A nice side effect of this minimalist (hereby dubbed NoDBā¦take that NoSQL!) approach to developing the Wikitrends app is that I have uniquely named, URL addressable JSON files for the top 1000 accessed English Wikipedia articles every hour at http://inkdroid.org/wikitrends/data. Even better, the JSON files even get archived at GitHub. Now donāt take this seriously, itās late (or early) and Iām really just making a really lame joke. Iām sure having a database would be useful for trend analysis and whatnot. But for my immediate purposes it wasnāt really needed.
So, Wikitrends is another Wikiepdia stats curiosity app. At the very least I got a chance to use JavaScript a bit more seriously by working with Underscore and the very slick async library. Perhaps there are some ways you can think of to make Wikitrends more useful or interesting. If so please let me know.
And, since I havenāt said it enough before: thank you Wikimedia for taking such a pragmatic approach to making your data available. It is an inspiration.