data dumps

As usual, the following comments are the reflections of a software developer working at the Library of Congress and are not an official statement of my employer..

One of the challenges that we’ve had at the National Digital Newspaper Program’s website Chronicling America has been access to data. At the surface level Chronicling America is a conventional web application that provides access to millions of pages of historic newspapers. Here “access” means a researcher’s ability to browse to each newspaper, issue and page, as well as search across the OCR text for each page.

Digging a bit deeper “access” also means programmatic access via a Web API. Chronicling America’s API enables custom software to issue queries using the popular OpenSearch protocol, and it also makes URL addressable data available using principles of Linked Data. In addition the website also makes the so called “batch” data that each NDNP awardee sends to the Library of Congress available on the Web. The advantage to making the batch data available is that it allows 3rd parties are then able to build their own custom search indexes on top of the data so their own products and services don’t have a runtime dependency on our Web API. Also researchers can choose to index things differently, perform text mining operations, or conduct other experiments. Each batch contains JPEG 2000, PDF, OCR XML and METS XML data for all the newspaper content; and it is in fact the very same data that the Chronicling America web application ingests. The batch data views makes it possible for interested parties to crawl the content using wget or some similar tool that talks HTTP, and fetch a lot of newspaper data.

But partly because of NDNP’s participation in the NEH’s Digging Into Data program, as well as the interest from other individuals and organizations we’ve recently started making data dumps of the OCR content available. This same OCR data is available as part of the batch data mentioned above, but the dumps provide two new things:

  1. The ability to download a small set of large compressed files with checksums to verify their transfer, as opposed to having to issue HTTP GETs for millions of uncompressed files with no verification.
  2. The ability to easily map each of the OCR files to their corresponding URL on the web. While it is theoretically possible to extract the right bits from the METS XML data in the batch data, the best of expression of how to do this is encapsulated in the Chronicling America ingest code, and is non-trivial.

So when you download, decompress and untar one of the files you will end up with a directory structure like this:

sn86063381/
|-- 1908
|   |-- 01
|   |   |-- 01
|   |   |   `-- ed-1
|   |   |       |-- seq-1
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-2
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-3
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       `-- seq-4
|   |   |           |-- ocr.txt
|   |   |           `-- ocr.xml
|   |   |-- 02
|   |   |   `-- ed-1
|   |   |       |-- seq-1
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-2
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-3
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       `-- seq-4
|   |   |           |-- ocr.txt
|   |   |           `-- ocr.xml

...

The pattern here is:

{lccn}/{year}/{month}/{day}/{edition}/{sequence}/

If you don’t work in a library, an lccn is a Library of Congress Control Number, which is a unique ID for each newspaper title. Each archive file will lay out in a similar way, such that you can process each .tar.bz2 file and will end up with a complete snapshot of the OCR data on your filesystem. The pattern maps pretty easily to URLs of the format:


http://chroniclingamerica.loc.gov/lccn/{lccn}/{year}-{month}-{day}/{edition}/{sequence}/

This is an obvious use case for a pattern like PairTree, but there was some perceived elegance to using paths that were a bit more human readable, and easier on the filesystem, which stands a good chance of not being ZFS.

Another side effect of having a discrete set of files to download is that each dump file can be referenced in an Atom feed, so that you can keep your snapshot up to date with a little bit of automation. Here’s a snippet of the feed:

< ?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
 
    <title>Chronicling America OCR Data Feed</title>
    <link rel="self" type="application/atom+xml" href="http://chroniclingamerica.loc.gov/ocr/feed/" />
    <id>info:lc/ndnp/ocr</id>
    <author>
        <name>Library of Congress</name>
        <uri>http://loc.gov</uri>
    </author>
    <updated>2012-09-20T10:34:02-04:00</updated>
 
    <entry>
        <title>part-000292.tar.bz2</title>
        <link rel="enclosure" length="650169965" hash="sha1:bb7fa00e8e07041501a9703bf85afbe5040e3448" type="application/x-bzip2" href="http://chroniclingamerica.loc.gov/data/ocr/part-000292.tar.bz2" />
        <id>info:lc/ndnp/dump/ocr/part-000292.tar.bz2</id>
        <updated>2012-09-20T10:34:02-04:00</updated>
        <summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">OCR dump file <a href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-000292.tar.bz2">part-000292.tar.bz2</a> with size 620.1 MB generated Sept. 20, 2012, 10:34 a.m.</div></summary>
    </entry>
 
    ...
 
</feed>

As you can see it’s a pretty vanilla Atom feed that should play nicely with whatever feed reader or library you are using. You may notice the <link> element has some attributes that you might not be used to seeing. The enclosure and length attributes are directly from RFC 4287 for giving clients an idea that the referenced resource might be on the large side. The hash attribute is a generally useful attribute from James Snell’s Atom Link Extensions IETF draft.

If parsing XML is against your religion, there’s also a JSON flavored feed that looks like:

{
  ocr: [
    {
      url: "http://chroniclingamerica.loc.gov/data/ocr/part-000337.tar.bz2",
      sha1: "fd73d8e1df33015e06739c897bd9c08a48294f82",
      size: 283454353,
      name: "part-000337.tar.bz2",
      created: "2012-09-21T06:56:35-04:00"
    },
    ...
  ]
}

Again, I guess we could’ve kicked the tires on the emerging ResourceSync specification to simliar effect. But ResourceSync is definitely still in development, and well, Atom is a pretty nice Internet standard for publishing changes. Syndication technologies like RSS and Atom have already been used by folks like Wikipedia for publishing the availability of data dumps. ResourceSync seems intent on using Zip for compressing dump files, and bzip is common enough, and enough better than zip that it’s worth diverging. In some ways this blog post has turned into a when-to-eschew-digital-library-standards in favor of more mainstream or straightforward patterns. I didn’t actually plan that, but those of you that know me probably are not surprised.

If you plan to use the OCR dumps I, and others on the NDNP team, would love to hear from you. One of the big problems with them so far is that there is no explict statement about how the data is in the public domain, which it is. I’m hopeful this can be rectified soon. If you have feedback on the use of Atom here I would be interested in that too. But the nice thing about using it is really how uncontroversial it is, so I doubt I’ll hear much feedback on that front.

NoDB

Last week Liam Wyatt emailed me asking if I could add The National Museum of Australia to Linkypedia, which tracks external links from Wikipedia articles to specific websites. Specifically Liam was interested in seeing the list of articles that reference the National Museum, sorted by how much they are viewed at Wikipedia. This presented me with two problems:

  1. I turned Linkypedia off a few months ago, since the site hadn’t been seeing much traffic, and I have not yet figured out how to keep the site going on the paltry Linode VPS I’m using for other things like this blog.
  2. I hadn’t incorporated Wikipedia page view statistics into Linkypedia, because I didn’t know they were available, and even if I had I didn’t have Liam’s idea of using them in this way.

1 was easily rectified since I still had the database lying around and had just disabled the Linkypedia Apache vhost. I brought it up, and added www.nma.gov.au to the list of hosts to monitor. #2 was surprisingly easy to solve as well, with a somewhat hastily written Django management command that pulls down the hourly stats and rifles through them looking for links to sites that Linkypedia manages. Incidentally the Python requests library makes efficiently iterating through large amounts of gzip compressed text at a URL very simple indeed.

After the management command had run for a few days and the stats for 2012 had been added to the Linkypedia database, I was able to see that the top 10 most accessed Wikipedia articles that linked to the National Museum were: Mick Jagger, Cricket, Kangaroo, Victorian Era, Ned Kelly, James Cook, Thylacine, Indigenous Australians and Playing Card and Emu … if you are curious the full list is available with counts, as well as similar lists for the Museum of Modern Art, the Biodiversity Heritage Library, Wikileaks, Thomas and other sites which Linkypedia also monitors somewhat sporadically.

So this was interesting…but it’s not actually what I set out to write about today.

While I had seen aggregate reports of Wikipedia page view data, prior to Liam’s email I didn’t know that hourly dumps of page view statistics existed. I recently did a series of experiments with realtime Wikipedia data, so naturally I wondered what might be do-able with the page-view stats. The data is gzip compressed and space delimited, which made it a perfect fit for noodling around in the Unix shell with curl, cut, sort, etc. Before much time passed I had a bash script that could run from cron every hour and dump out the top 1000 accessed English Wikipedia pages as a JSON file:

#!/bin/bash
 
# fetch.sh is a shell script to run from cron that will download the latest 
# hour's page view statistics and write out the top 1000 to a JSON file
# you will probably want to run it sufficiently after the top of the hour
# so that the file is likely to be there, e.g.
#
# 30 * * * * cd /home/ed/Projects/wikitrends/; ./fetch.sh
 
# first, get some date info
 
year=`date -u +%Y`
month=`date -u +%m`
day=`date -u +%d`
hour=`date -u +%H`
 
# this is the expected URL for the pagecount dump file
 
url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0000.gz" 
 
# sometimes the filenames have a timestamp of 1 second instead of 0 
# so if 0000.gz isn't there try using 0001.gz instead
 
curl -f -s -I $url > /dev/null
retval=$?
if [ $retval -ne 0 ]; then
    url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0001.gz" 
fi
 
# create a directory and filename for the JSON
 
json_dir="data/$year/$month/$day"
json_file="$json_dir/$hour.json"
mkdir -p $json_dir
 
# fetch the data and write out the stats!
 
curl --silent $url | \
    gunzip -c | \
    egrep '^en ' | \
    perl -npe '@cols=split/ /; print "$cols[2] $cols[1]\n";' | \
    sort -rn | \
    head -n 1000 | \
    ./jsonify.py > $json_file

And a short time later after that I had some pretty simple HTML and JavaScript that could use the JSON to display the top 25 accessed Wikipedia articles from the last hour.

Which I put it up on GitHub as Wikitrends. It was kind of surprising at first to see how prevalent mainstream media (mostly television) figures into the statistics. Perhaps it was a function of me working on the code at night when “normal” people are typically watching TV and looking up stuff related to the shows they are looking at. I did notice some odd things pop up in the list occasionally and found myself googling to see if there was recent news on the topic.

To help provide some context I added flyovers so that when you hover over the article title you will see the summary of the Wikipedia article. Behind the scenes the JavaScript looks up the article using the Wikipedia API and extracts the summary. This got me thinking that it could also be useful to include some links to canned searches at Google (last hour), Twitter and Facebook to provide context for the spike that the Wikipedia article is seeing. Perhaps it would be more interesting to see this information flow by somehow on the side…

A nice side effect of this minimalist (hereby dubbed NoDB…take that NoSQL!) approach to developing the Wikitrends app is that I have uniquely named, URL addressable JSON files for the top 1000 accessed English Wikipedia articles every hour at http://inkdroid.org/wikitrends/data. Even better, the JSON files even get archived at GitHub. Now don’t take this seriously, it’s late (or early) and I’m really just making a really lame joke. I’m sure having a database would be useful for trend analysis and whatnot. But for my immediate purposes it wasn’t really needed.

So, Wikitrends is another Wikiepdia stats curiosity app. At the very least I got a chance to use JavaScript a bit more seriously by working with Underscore and the very slick async library. Perhaps there are some ways you can think of to make Wikitrends more useful or interesting. If so please let me know.

And, since I haven’t said it enough before: thank you Wikimedia for taking such a pragmatic approach to making your data available. It is an inspiration.