polling and pushing with the Times Newswire API

I’ve been continuing to play around with Node.js some. Not because it’s the only game in town, but mainly because of the renaissance that’s going in the JavaScript community, which is kind of fun and slightly addictive. Ok, I guess that makes me a fan-boy, whatever…

So the latest in my experiments is nytimestream, which is a visualization (ok, it’s just a list) of New York Times headlines using the Times Newswire API. When I saw Derek Willis recently put some work into a Ruby library for the API I got to thinking what it might be like to use Node.js and Socket.IO to provide a push stream of updates. It didn’t take too long. I actually highly doubt anyone is going to use nytimestream much. So you might be wondering why I bothered to create it at all. I guess it was kind more of an academic exercise than anything to reinforce some things that Node.js has been teaching me.

Normally if you wanted a web page to dynamically update based on events elsewhere you’d have some code running in the browser routinely poll a webservice for updates. In this scenario our clients (c1, c2 and c3) poll the Times Newswire directly:

But what happens if lots of people start using your application? Yup, you get lots of requests going to the web service…which may not be a good thing, particularly if you are limited to a certain number of requests per day.

So a logical next step is to create a proxy for the webservice, which will reduce hits on the Times Newswire API.

But still, the client code needs to poll for updates. This can result in the proxy web service needing to field lots of requests as the number of clients increases. You can poll less, but that will diminish the real time nature of your app. If you are interested in having the real time updates in your app in the first place this probably won’t seem like a great solution.

So what if you could have the proxy web service push updates to the clients when it discovers an update?

This is basically what an event-driven webservice application allows you to do (labelled NodeJS in the diagram above). Node’s Socket.IO provides a really nice abstraction around streaming updates in the browser. If you view source on nytimestream you’ll see a bit of code like this:

var socket = io.connect();
socket.on('connect', function() {
  socket.on('story', function(story) {
    addStory(story);
    removeOld();
    fadeList();
  });
});

story is a JavaScript object that comes directly from the proxy webservice as a chunk of JSON. I’ve got the app running on Heroku, which currently recommends Socket.IO be configured to only do long polling (xhr-polling). Socket.IO actually supports a bunch of other transports suitable for streaming, including web sockets. xhr-polling basically means the browser keeps a connection open to the server until an update comes down, after which it quickly reconnects to wait for the next update. This is still preferable to constant polling, especially for the NYTimes API which often sees 15 minutes or more go by without an update. Keeping connections open like this can be expensive in more typical web stacks where each connection translates into a thread or process. But this is what Node’s non-blocking IO programming environment fixes up for you.

Just because I could, I added a little easter egg view in nytimestream, which allows you to see new stories come across the wire as JSON when nytimestream discovers them. It’s similar to Twitter’s stream API in that you can call it with curl. It’s different in that, well, there’s hardly the same amount of updates. Try it out with:

curl http://nytimestream.herokuapp.com/stream/

The occasional newlines are there to prevent the connection from timing out.

New York Times Topics as SKOS

Serves 23,376 SKOS Concepts

INGREDIENTS

DIRECTIONS

  1. Open a new file using your favorite text editor.
  2. Instantiate an RDF graph with a dash of rdflib.
  3. Use python’s urllib to extract the HTML for each of the Times Topics Index Pages, e.g. for A.
  4. Parse HTML into a fine, queryable data structure using BeautifulSoup.
  5. Locate topic names and their associated URLs, and gently add them to the graph with a pinch of SKOS.
  6. Go back to step 3 to fetch the next batch of topics, until you’ve finished Z.
  7. Bake the RDF graph as an rdf/xml file.

NOTES

If you don’t feel like cooking up the rdf/xml yourself you can download it from here (might want to right-click to download, some browsers might have trouble rendering the xml), or download the 68 line implementation and run it yourself.

The point of this exercise was mainly to show how thinking of the New York Times Topics as a controlled vocabulary, that can be serialized as a file, and still present on the Web, could be useful. Perhaps to someone writing an application that needs to integrate with the New York Times and who want to be able to tag content using the same controlled vocabulary. Or perhaps someone wants to be able to link your own content with similar content at the New York Times. These are all use cases for expressing the Topics as SKOS, and being able to ship it around with resolvable identifiers for the concepts.

Of course there is one slight wrinkle. Take a look at this Turtle snippet for the concept of Ray Bradbury:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@Prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://topics.nytimes.com/top/reference/timestopics/people/b/ray_bradbury#concept> a skos:Concept;
    skos:prefLabel "Bradbury, Ray";
    skos:broader <http://topics.nytimes.com/top/reference/timestopics/people#concept>;
    skos:inScheme <http://topics.nytimes.com/top/reference/timestopics#conceptScheme> 
    .

Notice the URI being used for the concept?


http://topics.nytimes.com/top/reference/timestopics/people/b/ray_bradbury#concept

The wrinkle is that there’s no way to get RDF back from this URI currently. But since NYT is already using XHTML, it wouldn’t be hard to sprinkle in some RDFa such that:

<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:skos="http://www.w3.org/2004/02/skos/core#">
...
<h1 about="http://topics.nytimes.com/top/reference/timestopics/people/b/ray_bradbury#concept" property="skos:prefLabel">Ray Bradbury</h1>
...
</html>

And voila you’ve got Linked Data. I took the 5 minutes to mark up the HTML myself and put it here which you can run through the RDFa Distiller to get some Turtle. Of course if the NYT ever decided to alter their HTML to provide this markup this recipe would be simplified greatly: no more error prone scraping, the assertions could be pulled directly out of the HTML.