Lessons of JSON

A recent (and short) IEEE Computing Conversations interview with Douglas Crockford about the development of JavaScript Object Notation (JSON) offers some profound, and sometimes counter-intuitive, insights into standards development on the Web.

Pave the Cowpaths

I don’t claim to have invented it, because it already existed in nature. I just saw it, recognized the value of it, gave it a name, and a description, and showed its benefits. I don’t claim to be the only person to have discovered it.

Crockford is likeably humble about the origins of JSON. Rather than claiming he invented JSON he instead says he discovered it–almost as if he was a naturalist on an expedition in some uncharted territory. Looking at the Web as an ecosystem with forms of life in it might seem more like a stretched metaphor or sci-fi plot; but I think it’s a useful and pragmatic stance. The Web is a complex space, and efforts to forcibly make it move in particular directions often fail, even when big organizations and trans-national corporations are behind them. Grafting, aligning and cross-fertilizing technologies, while respecting the communities that they grow from will probably feel more chaotic, but it’s likely to yield better results in the long run.

Necessity is the Mother of Invention

Crockford needed JSON when building an application where a client written in JavaScript needed to communicate with a server written in Java. He wanted something simple that would let him solve a real need he had right in front of him. He didn’t want the client to have to compose a query for the server, have the server perform the query against a database, and return something to the client that in turn needed to be queried again. He wanted something where the data serialization matched the data structures available to both programming language environments. He wanted something that made his life easier. Since Crockford was actually using JSON in his work it has usability at its core.

Less is More

Crockford tried very hard to strip unnecessary stuff from JSON so it stood a better chance of being language independent. When confronted with push back about JSON not being a “standard” Crockford registered json.org, put up a specification that documented the data format, and declared it as a standard. He didn’t expend a lot of energy in top-down mode, trying to convince industry and government agencies that JSON was the way to go. Software developers discovered it, and started using it in their applications because it made their life easier too. Some people complained that the X in AJAX stood for XML, and therefore JSON should not be used. But this dogmatism faded into the background over time as the benefits of JSON were recognized.

JSON is not versioned, and has no mechanism for revision. JSON cannot change. This probably sounds like heresy to many folks involved in standardization. It is more radical than the WHATWG’s decision to remove the version number from HTML5. It may only be possible because Crockford focused so much on keeping the JSON specification so small and tight. Crockford is realistic about JSON not solving all data representation problems, and speculates that we will see use cases that JSON does not help solve. But when this happens something new will be created instead of extending JSON. This relieves the burden of backwards compatibility that can often drag projects down into a quagmire of complexity. Software designed to work with JSON will always work with whatever valid JSON is found in the wild.


Don’t listen to me, go watch the interview!

on not linking

NPR Morning Edition recently ran an interview with Teju Cole about his most recent project called Small Fates. Cole is the recipient of the 2012 Hemingway Foundation/PEN Award for his novel Open City. Small Fates is a series of poetic snippets from Nigerian newspapers, which Cole has been posting on Twitter. It turns out Small Fates draws on a tradition of compressed news stories known as fait divers. The interview is a really nice description of the poetic use of this material to explore time and place. In some ways it reminds me a little of the cut-up technique that William S. Bouroughs popularized; albeit in a more lyrical, less dadaist form.

At about the 3 minute mark in the interview Cole mentions that he has recently been using content from historic New York newspapers using Chronicling America. For example:

Chronicling America is a software project I work on. Of course we were all really excited to hear Cole mention us on NPR. One thing that we were wondering is whether he could include shortened URLs to the newspaper page in Chronicling America in his tweets. Obviously this would be a clever publicity vehicle for the NEH funded National Digital Newspaper Program. It would also allow the Small Fates reader to follow the tweet to the source material, if they wanted more context.

Through the magic of Facebook, Twitter, good old email and Teju’s generosity I got in touch with him to see if he would be willing to include some shortened Chronicling America URLs in his tweets. His response indicated that he had clearly already thought about linking, but had decided not to. His reasons for not linking struck me as really interesting, and he agreed to let me quote them here:

I can’t include links directly in my tweets for three reasons.

The first is aesthetic: I like the way the tweets look as clean sentences. One wouldn’t wish to hyperlink a poem.

The second is artistic: I want people to stay here, not go off somewhere else and crosscheck the story. Why go through all the trouble of compression if they’re just going to go off and read more about it? What’s omitted from a story is, to me, an important part of a writer’s storytelling strategy.

And the third is practical: though I seldom use up all 140 characters, rarely do I have enough room left for a url string, even a shortened one.

I really loved this artistic (and pragmatic) rationale for not linking, and thought you might too.

cc0 and git for data

In case you missed it the Cooper-Hewitt National Design Museum at the Smithsonian Institution made a pretty important announcement almost a month ago that they have released their collection metadata on GitHub using the CC0 Creative Commons license. The choice to use GitHub is interesting (more about that below) but the big news is the choice to license the data with CC0. John Wilbanks wrote a nice piece about why the use of CC0 is important. Rather than paraphrase I’ll just quote his main point:

… the fact that the Smithsonian has gone to CC0 is actually a great step. It means that data owners inside the USG have the latitude to use tools that put USG works into a legal status outside the US that is interoperable with their public domain status inside the US, and that’s an unalloyed Good Thing in my view.

While I helped prototype and bring the first version of id.loc.gov online the licensing of the data was a persistent question that I heard from people who wanted to use the data. The about page at id.loc.gov current says:

The Library of Congress has prepared this vocabulary terminology system and is making it available as a public domain data set. While it has attempted to minimize inaccuracies and defects in the data and services furnished, THE LIBRARY OF CONGRESS IS PROVIDING THIS DATA AND THESE SERVICES “AS IS” AND DISCLAIMS ANY AND ALL WARRANTIES, WHETHER EXPRESS OR IMPLIED.

But as John detailed in his post, this isn’t really enough for folks outside the US. I think even for parties inside the US a CC0 license would add more confidence to using the data set in different contexts, and would help align the Library of Congress more generally with the Web. Last Friday Eric Miller of Zepheira spoke about Linked Data at the Library of Congress (eventually the talk should be made available). Near the end of his talk he focused on things that need to be worked on, and I was glad to hear him stress that work needed to be done on licensing. The issue is really nothing new, and it really transcends the Linked Data space. I’m not saying it’s easy, but I agree with everyone who is saying it is important to focus on…and it’s great to see the advances that others in the federal government are making.

Git and GitHub

The other fun part of the Smithsonian announcement was the use of GitHub as a platform for publishing the data. To do this Cooper-Hewitt established an organizational account on GitHub, which you might think is easy, but is actually no small achievement by itself for an organization in the US federal government. With the account in hand the collection project was created and the collection metadata was released as two CSV files (media.csv and objects.csv) by Micah Walter. The repository was then forked by Aaron Straup Cope. Aaron added some Python scripts for converting the CSV files into record based JSON files. In the comments to the Cooper-Hewitt Labs blog post Aaron commented on why he chose to break up the CSV into JSON. The beautiful thing about using Git and GitHub this way for data is that you have a history view like this:

For digital preservation folks this view of what changed, when, and by who is extremely important for establishing provenance. The fact that you get this for free by using the opensource Git version control system, and pushing your repository to GitHub is very compelling.

Over the past couple of years there has been quite a bit of informal discussion in the digital curation community about using Git for versioning data. Just a couple weeks before the Smithsonian announcement Stephanie Collett and Martin Haye from the California Digital Library reported on the use of Git and Mercurial to version data at Code4lib 2012.

But as Alf Eaton observed:

In this case we’re talking 205,137 files. If you doubt Alf, try cloning the repository. Here’s what I see:

ed@rorty:~$ time git clone https://github.com/cooperhewitt/collection.git cooperhewitt-collection Cloning into cooperhewitt-collection… remote: Counting objects: 230004, done. remote: Compressing objects: 100% (19507/19507), done. remote: Total 230004 (delta 102489), reused 223775 (delta 96260) Receiving objects: 100% (230004/230004), 27.84 MiB | 3.96 MiB/s, done. Resolving deltas: 100% (102489/102489), done.

Curating Curation

This post was composed over at Storify and exported here.


Last week Liam Wyatt emailed me asking if I could add The National Museum of Australia to Linkypedia, which tracks external links from Wikipedia articles to specific websites. Specifically Liam was interested in seeing the list of articles that reference the National Museum, sorted by how much they are viewed at Wikipedia. This presented me with two problems:

  1. I turned Linkypedia off a few months ago, since the site hadn’t been seeing much traffic, and I have not yet figured out how to keep the site going on the paltry Linode VPS I’m using for other things like this blog.
  2. I hadn’t incorporated Wikipedia page view statistics into Linkypedia, because I didn’t know they were available, and even if I had I didn’t have Liam’s idea of using them in this way.

2011 Musics

2011 was the year of streaming music for me–specifically using Rdio. Being able to follow what friends and folks I admire are listening to, easily listen along, and then build my own online collection in the cloud was a revelation. Being able to easily do it from my desktop at home, or at work, or on my mobile device for $5/month was just astounding. The world ain’t perfect, but this is damn near close.

Anyhow, here’s some of my favorite music from 2011, in no particular order … a lot of which I probably wouldn’t have listened to if it wasn’t streamable on the Web. You might have to wait a few seconds while the YouTube clips load.

genealogy of a typo

I got a Kindle Touch today for Christmas–thanks Kesa! Admittedly I’m pretty late to this party. As I made ready to purchase my first ebook I hopped over to my GoodReads to-read list, to pick something out. I scanned the list quickly, and my eye came to rest on Stephen Ramsey’s recent book Reading Machines. But I got hung up on something irrelevant: the subtitle was Toward and Algorithmic Criticism instead of Toward an Algorithmic Criticism, the latter of which is clearly correct based on the cover image.

Having recently looked at API services for book data I got curious about how the title appeared on other popular web properties, such as Amazon:


Barnes & Noble:

and LibraryThing

I wasn’t terribly surprised not to find it on OpenLibrary. But it does seem interesting that the exact same typo is present on all these book websites as well, while the title appears correct on the publisher’s website:

and at OCLC:

It’s hard to tell for sure, but my guess is that Amazon, Barnes & Noble, and GoogleBooks got the error from Bowker Link (the Books in Print data service), and that LibraryThing then picked up the data from Amazon, and similarly GoodReads picked up the data from GoogleBooks. LibraryThing can pull data from a variety of sources, including Amazon; and I’m not entirely sure where GoodReads gets their data from, but it seems likely that it comes from the GoogleBooks API given other tie-ins with Google.

If you know more about the lineage of data in these services I would be interested to hear it. Specifically if you have a subscription to BowkerLink it would be great if you could check the title. It would be nice to live in a world where these sorts of data provenance issues were easier to read.

polling and pushing with the Times Newswire API

I’ve been continuing to play around with Node.js some. Not because it’s the only game in town, but mainly because of the renaissance that’s going in the JavaScript community, which is kind of fun and slightly addictive. Ok, I guess that makes me a fan-boy, whatever…

So the latest in my experiments is nytimestream, which is a visualization (ok, it’s just a list) of New York Times headlines using the Times Newswire API. When I saw Derek Willis recently put some work into a Ruby library for the API I got to thinking what it might be like to use Node.js and Socket.IO to provide a push stream of updates. It didn’t take too long. I actually highly doubt anyone is going to use nytimestream much. So you might be wondering why I bothered to create it at all. I guess it was kind more of an academic exercise than anything to reinforce some things that Node.js has been teaching me.

Normally if you wanted a web page to dynamically update based on events elsewhere you’d have some code running in the browser routinely poll a webservice for updates. In this scenario our clients (c1, c2 and c3) poll the Times Newswire directly:

But what happens if lots of people start using your application? Yup, you get lots of requests going to the web service…which may not be a good thing, particularly if you are limited to a certain number of requests per day.

So a logical next step is to create a proxy for the webservice, which will reduce hits on the Times Newswire API.

But still, the client code needs to poll for updates. This can result in the proxy web service needing to field lots of requests as the number of clients increases. You can poll less, but that will diminish the real time nature of your app. If you are interested in having the real time updates in your app in the first place this probably won’t seem like a great solution.

So what if you could have the proxy web service push updates to the clients when it discovers an update?

This is basically what an event-driven webservice application allows you to do (labelled NodeJS in the diagram above). Node’s Socket.IO provides a really nice abstraction around streaming updates in the browser. If you view source on nytimestream you’ll see a bit of code like this:

var socket = io.connect();
socket.on('connect', function() {
  socket.on('story', function(story) {

story is a JavaScript object that comes directly from the proxy webservice as a chunk of JSON. I’ve got the app running on Heroku, which currently recommends Socket.IO be configured to only do long polling (xhr-polling). Socket.IO actually supports a bunch of other transports suitable for streaming, including web sockets. xhr-polling basically means the browser keeps a connection open to the server until an update comes down, after which it quickly reconnects to wait for the next update. This is still preferable to constant polling, especially for the NYTimes API which often sees 15 minutes or more go by without an update. Keeping connections open like this can be expensive in more typical web stacks where each connection translates into a thread or process. But this is what Node’s non-blocking IO programming environment fixes up for you.

Just because I could, I added a little easter egg view in nytimestream, which allows you to see new stories come across the wire as JSON when nytimestream discovers them. It’s similar to Twitter’s stream API in that you can call it with curl. It’s different in that, well, there’s hardly the same amount of updates. Try it out with:

curl http://nytimestream.herokuapp.com/stream/

The occasional newlines are there to prevent the connection from timing out.

fascinating, hypnotic, inspirational, appalling, irrelevant

Thanks to everyone that noticed the Wikistream coverage in the NextWeb article and elsewhere. If you happen to have tweeted about Wikistream in the last 2 days you should see your avatar to the left. Click on it to make it bigger. I’m in there somewhere too :-)

Before Sunday the site hadn’t seen more than 180 unique visitors per day, and on Monday it saw almost 30,000. The site is kind of different since it streams all the Wikipedia updates to the browser as JSON, where it is then displayed. I had some nail-biting moments as I watched Node frequently streaming up to 300 concurrent connections. It was a wild ride for my little Linode VPS with 512MB of RAM, where Wordpress and a Django website were also running…but it seemed to weather the storm OK. Mostly I think I could have used more RAM during peak usage when Node and Redis were wanting enough memory to cause the system to swap. Thanks to Gabe and Chris for helping me get the cache headers set right in Express.

I thought briefly about upgrading to a larger Linode instance, putting the app on EC2, or maybe asking Wikimedia if they wanted to host it. But Wikistream is really more a piece of performance art than it is a useful website. I’m expecting people that have looked at Wikistream once will have seen how much Wikipedia is actively edited, and not feel compelled to look at it again. After a few days I expect the usage to plummet and it can go back to running comfortably on my little Linode VPS to serve as a live prop in presentations about Wikipedia, crowd-sourcing, Web culture, etc.

One of my favorite mentions of Wikistream came from Nat Torkington’s Four Short Links on O’Reilly Radar. Nat described Wikistream as

fascinating and hypnotic and inspirational and appalling and irrelevant all at once

I took this as high-praise of course. I could only get the last two days out of Twitter’s search API, which misses the day that the NextWeb article appeared, followed by it getting picked up on Hacker News. But it was 226 tweets, and provided for a fun little data set to look at. I wrote a little script to look for URLs in the tweets, unshorten them and come up with a list of web pages that mentioned Wikistream in the past few days. One thing that was really interesting to me was the predominance of non-English websites. Here’s a list of some of them if you are interested.

OK, I’m finished with the narcissistic navel gazing for a bit. But seriously, thanks for the attention. I’ve never experienced anything like that before. Wow.

thanks Wikipedia

Support Wikipedia
The field of software development, the Web and libraries is changing so fast that there is no way to know everything I need to know to do my job well. Wikipedia continues to be an essential resource for learning about technologies, algorithms, people, and history related to my work. It’s hard to imagine what the world would be like without it. Thanks for another year of awesome Wikipedia! The check is in the mail; well OK it’s actually coming from PayPal … you know the drill.