data dumps

As usual, the following comments are the reflections of a software developer working at the Library of Congress and are not an official statement of my employer..

One of the challenges that we’ve had at the National Digital Newspaper Program’s website Chronicling America has been access to data. At the surface level Chronicling America is a conventional web application that provides access to millions of pages of historic newspapers. Here “access” means a researcher’s ability to browse to each newspaper, issue and page, as well as search across the OCR text for each page.

Digging a bit deeper “access” also means programmatic access via a Web API. Chronicling America’s API enables custom software to issue queries using the popular OpenSearch protocol, and it also makes URL addressable data available using principles of Linked Data. In addition the website also makes the so called “batch” data that each NDNP awardee sends to the Library of Congress available on the Web. The advantage to making the batch data available is that it allows 3rd parties are then able to build their own custom search indexes on top of the data so their own products and services don’t have a runtime dependency on our Web API. Also researchers can choose to index things differently, perform text mining operations, or conduct other experiments. Each batch contains JPEG 2000, PDF, OCR XML and METS XML data for all the newspaper content; and it is in fact the very same data that the Chronicling America web application ingests. The batch data views makes it possible for interested parties to crawl the content using wget or some similar tool that talks HTTP, and fetch a lot of newspaper data.

But partly because of NDNP’s participation in the NEH’s Digging Into Data program, as well as the interest from other individuals and organizations we’ve recently started making data dumps of the OCR content available. This same OCR data is available as part of the batch data mentioned above, but the dumps provide two new things:

  1. The ability to download a small set of large compressed files with checksums to verify their transfer, as opposed to having to issue HTTP GETs for millions of uncompressed files with no verification.
  2. The ability to easily map each of the OCR files to their corresponding URL on the web. While it is theoretically possible to extract the right bits from the METS XML data in the batch data, the best of expression of how to do this is encapsulated in the Chronicling America ingest code, and is non-trivial.

So when you download, decompress and untar one of the files you will end up with a directory structure like this:

sn86063381/
|-- 1908
|   |-- 01
|   |   |-- 01
|   |   |   `-- ed-1
|   |   |       |-- seq-1
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-2
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-3
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       `-- seq-4
|   |   |           |-- ocr.txt
|   |   |           `-- ocr.xml
|   |   |-- 02
|   |   |   `-- ed-1
|   |   |       |-- seq-1
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-2
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-3
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       `-- seq-4
|   |   |           |-- ocr.txt
|   |   |           `-- ocr.xml

...

The pattern here is:

{lccn}/{year}/{month}/{day}/{edition}/{sequence}/

If you don’t work in a library, an lccn is a Library of Congress Control Number, which is a unique ID for each newspaper title. Each archive file will lay out in a similar way, such that you can process each .tar.bz2 file and will end up with a complete snapshot of the OCR data on your filesystem. The pattern maps pretty easily to URLs of the format:


http://chroniclingamerica.loc.gov/lccn/{lccn}/{year}-{month}-{day}/{edition}/{sequence}/

This is an obvious use case for a pattern like PairTree, but there was some perceived elegance to using paths that were a bit more human readable, and easier on the filesystem, which stands a good chance of not being ZFS.

Another side effect of having a discrete set of files to download is that each dump file can be referenced in an Atom feed, so that you can keep your snapshot up to date with a little bit of automation. Here’s a snippet of the feed:

< ?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
 
    <title>Chronicling America OCR Data Feed</title>
    <link rel="self" type="application/atom+xml" href="http://chroniclingamerica.loc.gov/ocr/feed/" />
    <id>info:lc/ndnp/ocr</id>
    <author>
        <name>Library of Congress</name>
        <uri>http://loc.gov</uri>
    </author>
    <updated>2012-09-20T10:34:02-04:00</updated>
 
    <entry>
        <title>part-000292.tar.bz2</title>
        <link rel="enclosure" length="650169965" hash="sha1:bb7fa00e8e07041501a9703bf85afbe5040e3448" type="application/x-bzip2" href="http://chroniclingamerica.loc.gov/data/ocr/part-000292.tar.bz2" />
        <id>info:lc/ndnp/dump/ocr/part-000292.tar.bz2</id>
        <updated>2012-09-20T10:34:02-04:00</updated>
        <summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">OCR dump file <a href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-000292.tar.bz2">part-000292.tar.bz2</a> with size 620.1 MB generated Sept. 20, 2012, 10:34 a.m.</div></summary>
    </entry>
 
    ...
 
</feed>

As you can see it’s a pretty vanilla Atom feed that should play nicely with whatever feed reader or library you are using. You may notice the <link> element has some attributes that you might not be used to seeing. The enclosure and length attributes are directly from RFC 4287 for giving clients an idea that the referenced resource might be on the large side. The hash attribute is a generally useful attribute from James Snell’s Atom Link Extensions IETF draft.

If parsing XML is against your religion, there’s also a JSON flavored feed that looks like:

{
  ocr: [
    {
      url: "http://chroniclingamerica.loc.gov/data/ocr/part-000337.tar.bz2",
      sha1: "fd73d8e1df33015e06739c897bd9c08a48294f82",
      size: 283454353,
      name: "part-000337.tar.bz2",
      created: "2012-09-21T06:56:35-04:00"
    },
    ...
  ]
}

Again, I guess we could’ve kicked the tires on the emerging ResourceSync specification to simliar effect. But ResourceSync is definitely still in development, and well, Atom is a pretty nice Internet standard for publishing changes. Syndication technologies like RSS and Atom have already been used by folks like Wikipedia for publishing the availability of data dumps. ResourceSync seems intent on using Zip for compressing dump files, and bzip is common enough, and enough better than zip that it’s worth diverging. In some ways this blog post has turned into a when-to-eschew-digital-library-standards in favor of more mainstream or straightforward patterns. I didn’t actually plan that, but those of you that know me probably are not surprised.

If you plan to use the OCR dumps I, and others on the NDNP team, would love to hear from you. One of the big problems with them so far is that there is no explict statement about how the data is in the public domain, which it is. I’m hopeful this can be rectified soon. If you have feedback on the use of Atom here I would be interested in that too. But the nice thing about using it is really how uncontroversial it is, so I doubt I’ll hear much feedback on that front.

Confessions of a Graph Addict

Today I’m going to be at the annual conference of the American Library Association today for a pre-conference about Libraries and Linked Data. I’m going to try talking about how Linked Data, and particularly how the graph data structure fits the way catalogers have typically thought about bibliiographic information. Along the way I’ll include some specific examples of Linked Data projects I’ve worked on at the Library of Congress–and gesture at work that remains to be done.

Tomorrow there’s an unconference style event at ALA to explore the what Linked Data means for Libraries. The pre-conference today is booked up, but the event tomorrow is open to the public, so please consider dropping by if you are interested and in the DC area.

New York Times Topics as SKOS

Serves 23,376 SKOS Concepts

INGREDIENTS

DIRECTIONS

  1. Open a new file using your favorite text editor.
  2. Instantiate an RDF graph with a dash of rdflib.
  3. Use python’s urllib to extract the HTML for each of the Times Topics Index Pages, e.g. for A.
  4. Parse HTML into a fine, queryable data structure using BeautifulSoup.
  5. Locate topic names and their associated URLs, and gently add them to the graph with a pinch of SKOS.
  6. Go back to step 3 to fetch the next batch of topics, until you’ve finished Z.
  7. Bake the RDF graph as an rdf/xml file.

NOTES

If you don’t feel like cooking up the rdf/xml yourself you can download it from here (might want to right-click to download, some browsers might have trouble rendering the xml), or download the 68 line implementation and run it yourself.

The point of this exercise was mainly to show how thinking of the New York Times Topics as a controlled vocabulary, that can be serialized as a file, and still present on the Web, could be useful. Perhaps to someone writing an application that needs to integrate with the New York Times and who want to be able to tag content using the same controlled vocabulary. Or perhaps someone wants to be able to link your own content with similar content at the New York Times. These are all use cases for expressing the Topics as SKOS, and being able to ship it around with resolvable identifiers for the concepts.

Of course there is one slight wrinkle. Take a look at this Turtle snippet for the concept of Ray Bradbury:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@Prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://topics.nytimes.com/top/reference/timestopics/people/b/ray_bradbury#concept> a skos:Concept;
    skos:prefLabel "Bradbury, Ray";
    skos:broader <http://topics.nytimes.com/top/reference/timestopics/people#concept>;
    skos:inScheme <http://topics.nytimes.com/top/reference/timestopics#conceptScheme> 
    .

Notice the URI being used for the concept?


http://topics.nytimes.com/top/reference/timestopics/people/b/ray_bradbury#concept

The wrinkle is that there’s no way to get RDF back from this URI currently. But since NYT is already using XHTML, it wouldn’t be hard to sprinkle in some RDFa such that:

<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:skos="http://www.w3.org/2004/02/skos/core#">
...
<h1 about="http://topics.nytimes.com/top/reference/timestopics/people/b/ray_bradbury#concept" property="skos:prefLabel">Ray Bradbury</h1>
...
</html>

And voila you’ve got Linked Data. I took the 5 minutes to mark up the HTML myself and put it here which you can run through the RDFa Distiller to get some Turtle. Of course if the NYT ever decided to alter their HTML to provide this markup this recipe would be simplified greatly: no more error prone scraping, the assertions could be pulled directly out of the HTML.