in python JSON is faster, smaller and more portable than pickle ...

At work, I'm working on a project where we're modeling newspaper content in a relational database. We've got newspaper titles, issues, pages, institutions, places and some other fun stuff. It's a django app, and the db schema currently looks something like:

Anyhow, if you look at the schema you'll notice that we have a Page model, and that attached to that is an OCR model. If you haven't heard of it before OCR is an acronym for optical character recognition. For each newspaper page we have, we have a TIF image for the original page, and we have rectangle coordinates for the position of every word on the page. Basically it's xml that looks something like this (warning your browser may choke on this, you might want to right-click-download).

So there are roughly around 2500 words on a page of newspaper text, and there can sometimes be 350 occurrences of a particular word on a page...and we're looking to model 1,000,000 pages soon ... so if we got really prissy with normalization we could soon be looking at (worst case) 875,000,000,000 rows in a table. While I am interested in getting a handle on how to manage large databases like this, we just don't need the fine grained queries into the word coordinates. But we do need to be able to look up the coordinates for a particular word on a particular page to do hit highlighting in search results.

So let me get to the interesting part already. To avoid having to think about databases with billions of rows, I radically denormalized the data and stored the word coordinates as a blob of JSON in the database. So we just have a word_coordinates_json column in the OCR table, and when we need to look up the coordinates for a page we just load up the JSON dictionary and we're good to go. JSON is nice with django, since django's ORM doesn't seem to support storing blobs in the database, and JSON is just text. This worked just fine on single page views, but we also do hit highlighting on pages where there are 10 pages being viewed at the same time. So we started noticing large lags on these page views -- because it was taking a while to load the JSON (sometimes 327K * 10 of JSON).

As I mentioned we're using Django, so it was easy to use django.utils.simplejson for the parsing. When we noticed slowdowns I decided to compare django.utils.simplejson to the latest simplejson and python-cjson. And just for grins I figured it couldn't hurt to see if using pickle or cPickle (protocols 0, 1 and 2) would prove to be faster than using JSON. So I wrote a little benchmark script that timed the loading of a 327K JSON and a 507K pickle file 100 times using each technique. Here are the results:

method total seconds avg seconds
django-simplejson 140.606723 1.406067
simplejson 2.260988 0.022610
pickle 45.032428 0.450324
cPickle 4.569351 0.45694
cPickle1 2.829307 0.028293
cPickle2 3.042940 0.030429
python-cjson 1.852755 0.018528

Yeah, that's right. The real simplejson is 62 times faster than django.utils.simplejson! Even more surprising simplejson seems to be faster than even cPickle (even using binary protocols 1 and 2) python-cjson seems to have a slight edge on simplejson. This is good news for our search results page that has 10 newspaper pages to highlight on it, since it'll take 10 * 0.033183 = .3 seconds to parse all the JSON instead of the totally unacceptable 10 * 0.976193 = 9.7 seconds. I guess in some circles 0.3 seconds might be unacceptable, we'll have to see how it pans out. We may be able to remove the JSON deserialization from the page load time by pushing some of the logic into the browser w/ AJAX. If you want, please try out my benchmarks yourself on your own platform. I'd be curious if you see the same ranking.

Here are the versions for various bits I used:

  • python v2.5.2
  • django trunk: r9231 2008-10-13 15:38:18 -0400
  • simplejson 2.0.3

So in summary for pythoneers: JSON is faster, smaller and more portable than pickle. Of course there are caveats in that you can only store simple datatypes that JSON allows you to, not the full fledged Python objects. But in my use case JSON's data types were just fine. Makes me that much happier that simplesjson aka json is now cooked into the Python 2.6 standard library.

Note: if you aren't seeing simplejson performing better than cPickle you may need to have python development libraries installed:

  aptitude install python-dev # or the equivalent for your system

You can verify if the optimizations are available in simplejson by:

ed@hammer:~/bzr/jsonickle$ python
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) 
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
<<< import simplejson
<<< simplejson._speedups
<module 'simplejson._speedups' from '/home/ed/.python-eggs/simplejson-2.0.3-py2.5-linux-i686.egg-tmp/simplejson/'>

Thanks eby, mjgiarlo, BenO and Kapil for their pointers and ideas.