json vs pickle
in python JSON is faster, smaller and more portable than pickle …
At work, I’m working on a project where we’re modeling newspaper content in a relational database. We’ve got newspaper titles, issues, pages, institutions, places and some other fun stuff. It’s a django app, and the db schema currently looks something like:
Anyhow, if you look at the schema you’ll notice that we have a
Page
model, and that attached to that is an
OCR
model. If you haven’t heard of it before OCR is an
acronym for
optical
character recognition. For each newspaper page we have, we have a
TIF image for the original page, and we have rectangle coordinates for
the position of every word on the page. Basically it’s xml that looks
something like this
(warning your browser may choke on this, you might want to
right-click-download).
So there are roughly around 2500 words on a page of newspaper text, and there can sometimes be 350 occurrences of a particular word on a page…and we’re looking to model 1,000,000 pages soon … so if we got really prissy with normalization we could soon be looking at (worst case) 875,000,000,000 rows in a table. While I am interested in getting a handle on how to manage large databases like this, we just don’t need the fine grained queries into the word coordinates. But we do need to be able to look up the coordinates for a particular word on a particular page to do hit highlighting in search results.
So let me get to the interesting part already. To avoid having to think
about databases with billions of rows, I radically denormalized the data
and stored the word coordinates as a blob of
JSON in the database. So we just have
a word_coordinates_json
column in the OCR table, and when
we need to look up the coordinates for a page we just load up the JSON
dictionary and we’re good to go. JSON is nice with django, since
django’s ORM doesn’t seem to support storing blobs in the database, and
JSON is just text. This worked just fine on single page views, but we
also do hit highlighting on pages where there are 10 pages being viewed
at the same time. So we started noticing large lags on these page views
– because it was taking a while to load the JSON (sometimes 327K * 10 of
JSON).
As I mentioned we’re using Django, so it was easy to use django.utils.simplejson for the parsing. When we noticed slowdowns I decided to compare django.utils.simplejson to the latest simplejson and python-cjson. And just for grins I figured it couldn’t hurt to see if using pickle or cPickle (protocols 0, 1 and 2) would prove to be faster than using JSON. So I wrote a little benchmark script that timed the loading of a 327K JSON and a 507K pickle file 100 times using each technique. Here are the results:
method | total seconds | avg seconds |
---|---|---|
django-simplejson | 140.606723 | 1.406067 |
simplejson | 2.260988 | 0.022610 |
pickle | 45.032428 | 0.450324 |
cPickle | 4.569351 | 0.45694 |
cPickle1 | 2.829307 | 0.028293 |
cPickle2 | 3.042940 | 0.030429 |
python-cjson | 1.852755 | 0.018528 |
Yeah, that’s right. The real simplejson is 62 times faster than django.utils.simplejson! Even more surprising simplejson seems to be faster than even cPickle (even using binary protocols 1 and 2) python-cjson seems to have a slight edge on simplejson. This is good news for our search results page that has 10 newspaper pages to highlight on it, since it’ll take 10 * 0.033183 = .3 seconds to parse all the JSON instead of the totally unacceptable 10 * 0.976193 = 9.7 seconds. I guess in some circles 0.3 seconds might be unacceptable, we’ll have to see how it pans out. We may be able to remove the JSON deserialization from the page load time by pushing some of the logic into the browser w/ AJAX. If you want, please try out my benchmarks yourself on your own platform. I’d be curious if you see the same ranking.
Here are the versions for various bits I used:
- python v2.5.2
- django trunk: r9231 2008-10-13 15:38:18 -0400
- simplejson 2.0.3
So in summary for pythoneers: JSON is faster, smaller and more portable
than pickle. Of course there are caveats in that you can only store
simple datatypes that JSON allows you to, not the full fledged Python
objects. But in my use case JSON’s data types were just fine. Makes me
that much happier that simplesjson aka json
is now cooked
into the Python
2.6 standard library.
Note: if you aren’t seeing simplejson performing better than cPickle you may need to have python development libraries installed:
aptitude install python-dev # or the equivalent for your system
You can verify if the optimizations are available in simplejson by:
ed@hammer:~/bzr/jsonickle$ python Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. <<< import simplejson <<< simplejson._speedups <module 'simplejson._speedups' from '/home/ed/.python-eggs/simplejson-2.0.3-py2.5-linux-i686.egg-tmp/simplejson/_speedups.so'>
Thanks eby,
mjgiarlo,
BenO and Kapil for their
pointers and ideas.