New York Times Topics as SKOS

Serves 23,376 SKOS Concepts

INGREDIENTS

Text editor: Vim, Emacs, TextMate, etc
Python
BeautifulSoup
rdflib
Internet connection

DIRECTIONS

Open a new file using your favorite text editor.
Instantiate an RDF graph with a dash of rdflib.
Use python’s urllib to extract the HTML for each of the Times Topics Index Pages, e.g. for A.
Parse HTML into a fine, queryable data structure using BeautifulSoup.
Locate topic names and their associated URLs, and gently add them to the graph with a pinch of SKOS.
Go back to step 3 to fetch the next batch of topics, until you’ve finished Z.
Bake the RDF graph as an rdf/xml file.

NOTES

If you don’t feel like cooking up the rdf/xml yourself you can download it from here (might want to right-click to download, some browsers might have trouble rendering the xml), or download the 68 line implementation and run it yourself.

The point of this exercise was mainly to show how thinking of the New York Times Topics as a controlled vocabulary, that can be serialized as a file, and still present on the Web, could be useful. Perhaps to someone writing an application that needs to integrate with the New York Times and who want to be able to tag content using the same controlled vocabulary. Or perhaps someone wants to be able to link your own content with similar content at the New York Times. These are all use cases for expressing the Topics as SKOS, and being able to ship it around with resolvable identifiers for the concepts.

Of course there is one slight wrinkle. Take a look at this Turtle snippet for the concept of Ray Bradbury:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@Prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://topics.nytimes.com/top/reference/timestopics/people/b/ray_bradbury#concept> a skos:Concept;
    skos:prefLabel "Bradbury, Ray";
    skos:broader <http://topics.nytimes.com/top/reference/timestopics/people#concept>;
    skos:inScheme <http://topics.nytimes.com/top/reference/timestopics#conceptScheme> 
    .

Notice the URI being used for the concept?

http://topics.nytimes.com/top/reference/timestopics/people/b/ray_bradbury#concept

The wrinkle is that there’s no way to get RDF back from this URI currently. But since NYT is already using XHTML, it wouldn’t be hard to sprinkle in some RDFa such that:


...
Ray Bradbury
...

And voila you’ve got Linked Data. I took the 5 minutes to mark up the HTML myself and put it here which you can run through the RDFa Distiller to get some Turtle. Of course if the NYT ever decided to alter their HTML to provide this markup this recipe would be simplified greatly: no more error prone scraping, the assertions could be pulled directly out of the HTML.