<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>inkdroid &#187; ocr</title>
	<atom:link href="http://inkdroid.org/journal/tag/ocr/feed/" rel="self" type="application/rss+xml" />
	<link>http://inkdroid.org/journal</link>
	<description>$pithy_personal_mission_statement</description>
	<lastBuildDate>Wed, 28 Jul 2010 13:48:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>json vs pickle</title>
		<link>http://inkdroid.org/journal/2008/10/24/json-vs-pickle/</link>
		<comments>http://inkdroid.org/journal/2008/10/24/json-vs-pickle/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 07:09:02 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/?p=401</guid>
		<description><![CDATA[in python JSON is faster, smaller and more portable than pickle &#8230; At work, I&#8217;m working on a project where we&#8217;re modeling newspaper content in a relational database. We&#8217;ve got newspaper titles, issues, pages, institutions, places and some other fun stuff. It&#8217;s a django app, and the db schema currently looks something like: Anyhow, if [...]]]></description>
			<content:encoded><![CDATA[<p><em>in python JSON is faster, smaller and more portable than pickle &#8230; </em></p>
<p>At work, I&#8217;m working on a project where we&#8217;re modeling newspaper content in a relational database. We&#8217;ve got newspaper titles, issues, pages, institutions, places and some other fun stuff. It&#8217;s a django app, and the db schema currently looks something like:</p>
<p><a href="http://inkdroid.org/images/ndnp-schema.png"><br />
<img width="500" src="http://inkdroid.org/images/ndnp-schema.png" title="Click for Big Picture" border="none" /><br />
</a></p>
<p>Anyhow, if you look at the schema you&#8217;ll notice that we have a <code>Page</code> model, and that attached to that is an <code>OCR</code> model. If you haven&#8217;t heard of it before OCR is an acronym for <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">optical character recognition</a>. For each newspaper page we have, we have a TIF image for the original page, and we have rectangle coordinates for the position of every word on the page. Basically it&#8217;s xml that looks something like <a href="http://inkdroid.org/data/ndnp-ocr.xml">this</a> (warning your browser may choke on this, you might want to right-click-download). </p>
<p>So there are roughly around 2500 words on a page of newspaper text, and there can sometimes be 350 occurrences of a particular word on a page&#8230;and we&#8217;re looking to model 1,000,000 pages soon &#8230; so if we got really prissy with normalization we could soon be looking at (worst case) 875,000,000,000 rows in a table. While I am interested in getting a handle on how to manage large databases like this, we just don&#8217;t need the fine grained queries into the word coordinates. But we do need to be able to look up the coordinates for a particular word on a particular page to do hit highlighting in search results.</p>
<p>So let me get to the interesting part already. To avoid having to think about databases with billions of rows, I radically denormalized the data and stored the word coordinates as a blob of <a href="http://www.json.org/">JSON</a> in the database. So we just have a <code>word_coordinates_json</code> column in the OCR table, and when we need to look up the coordinates for a page we just load up the JSON dictionary and we&#8217;re good to go. JSON is nice with django, since django&#8217;s ORM doesn&#8217;t seem to support storing blobs in the database, and JSON is just text. This worked just fine on single page views, but we also do hit highlighting on pages where there are 10 pages being viewed at the same time. So we started noticing large lags on these page views &#8212; because it was taking a while to load the JSON (sometimes 327K * 10 of JSON).</p>
<p>As I mentioned we&#8217;re using Django, so it was easy to use django.utils.simplejson for the parsing. When we noticed slowdowns I decided to compare django.utils.simplejson to the latest <a href="http://www.undefined.org/python/">simplejson</a> and <a href="http://pypi.python.org/pypi/python-cjson">python-cjson</a>. And just for grins I figured it couldn&#8217;t hurt to see if using pickle or cPickle (protocols 0, 1 and 2) would prove to be faster than using JSON. So I wrote a little benchmark script that timed the loading of a 327K JSON and a 507K pickle file 100 times using each technique. Here are the results:</p>
<table padding="10px" style="border: thin solid gray; padding: 15px;">
<tr>
<th>method</th>
<th>total seconds</th>
<th>avg seconds</th>
</tr>
<tr>
<td>django-simplejson</td>
<td>140.606723</td>
<td>1.406067</td>
</tr>
<tr>
<td>simplejson</td>
<td>2.260988</td>
<td>0.022610</td>
</tr>
<tr>
<td>pickle</td>
<td>45.032428</td>
<td>0.450324</td>
</tr>
<tr>
<td>cPickle</td>
<td>4.569351</td>
<td>0.45694</td>
</tr>
<tr>
<td>cPickle1</td>
<td>2.829307</td>
<td>0.028293</td>
</tr>
<tr>
<td>cPickle2</td>
<td>3.042940</td>
<td>0.030429</td>
</tr>
<tr>
<td>python-cjson</td>
<td>1.852755</td>
<td>0.018528</td>
</tr>
</table>
<p><img src="http://chart.apis.google.com/chart?cht=bhs&#038;chd=t:139.92,2.24,44.03,4.4,3.03,3.11,1.85&#038;chdl=django-simplejson|simplejson|pickle|cpickle|cpickle1|cpickle2|python-cjson&#038;chco=33FF33|9900FF|FF0033|3366FF|2211EE|99AAFF|00FFCC&#038;chs=450x225&#038;chg=20&#038;chxt=x&#038;chx0=0,100" /></p>
<p>Yeah, that&#8217;s right. The real simplejson is 62 times faster than django.utils.simplejson! Even more surprising simplejson seems to be faster than even cPickle (even using binary protocols 1 and 2) python-cjson seems to have a slight edge on simplejson. This is good news for our search results page that has 10 newspaper pages to highlight on it, since it&#8217;ll take 10 * 0.033183 = .3 seconds to parse all the JSON instead of the totally unacceptable 10 * 0.976193 = 9.7 seconds. I guess in some circles 0.3 seconds might be unacceptable, we&#8217;ll have to see how it pans out. We may be able to remove the JSON deserialization from the page load time by pushing some of the logic into the browser w/ AJAX. If you want, please try out <a href="http://inkdroid.org/bzr/jsonickle">my benchmarks</a> yourself on your own platform. I&#8217;d be curious if you see the same ranking.</p>
<p>Here are the versions for various bits I used:</p>
<ul>
<li>python v2.5.2</li>
<li>django trunk: r9231 2008-10-13 15:38:18 -0400</li>
<li>simplejson 2.0.3</li>
</ul>
<p>So in summary for pythoneers: JSON is faster, smaller and more portable than pickle. Of course there are caveats in that you can only store simple datatypes that JSON allows you to, not the full fledged Python objects. But in my use case JSON&#8217;s data types were just fine. Makes me that much happier that simplesjson aka <code>json</code> is now cooked into the <a href="http://docs.python.org/whatsnew/2.6.html">Python 2.6</a> standard library.</p>
<p><em>Note: if you aren&#8217;t seeing simplejson performing better than cPickle you may need to have python development libraries installed:</p>
<pre>
  aptitude install python-dev # or the equivalent for your system
</pre>
<p>You can verify if the optimizations are available in simplejson by:</p>
<pre>
ed@hammer:~/bzr/jsonickle$ python
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
&lt;&lt;&lt; import simplejson
&lt;&lt;&lt; simplejson._speedups
&lt;module 'simplejson._speedups' from '/home/ed/.python-eggs/simplejson-2.0.3-py2.5-linux-i686.egg-tmp/simplejson/_speedups.so'&gt;
</pre>
<p>Thanks <a href="http://blog.ryaneby.com/">eby</a>, <a href="http://lackoftalent.org/michael/blog/">mjgiarlo</a>, <a href="http://oxfordrepo.blogspot.com/">BenO</a> and Kapil for their pointers and ideas.<br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/10/24/json-vs-pickle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>calais and ocr newspaper data</title>
		<link>http://inkdroid.org/journal/2008/02/13/calais-and-ocr-newspaper-data/</link>
		<comments>http://inkdroid.org/journal/2008/02/13/calais-and-ocr-newspaper-data/#comments</comments>
		<pubDate>Wed, 13 Feb 2008 20:30:16 +0000</pubDate>
		<dc:creator>ed</dc:creator>
				<category><![CDATA[libraries]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[semweb]]></category>
		<category><![CDATA[datamining]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[semanticweb]]></category>
		<category><![CDATA[sparql]]></category>
		<category><![CDATA[webservices]]></category>

		<guid isPermaLink="false">http://inkdroid.org/journal/2008/02/13/calais-and-ocr-newspaper-data/</guid>
		<description><![CDATA[Like you I&#8217;ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it&#8217;s RDF that uses a bunch of homespun vocabularies. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://opencalais.com/"><img src="/images/calais.gif" style="margin-right: 10px; float: left;"/></a> Like you I&#8217;ve been <a href="http://radar.oreilly.com/archives/2008/02/reuters_semantic_web_moneytech.html">reading</a> <a href="http://ebiquity.umbc.edu/blogger/2008/02/02/reuters-calais-offers-free-text-extraction-services-producing-rdf/">about</a> the new <a href="http://opencalais.com/">Reuters Calais Web Service</a>. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it&#8217;s RDF that uses a bunch of homespun vocabularies. </p>
<p>At work <a href="http://eikeon.com">Dan</a>, <a href="http://ardvaark.net">Brian</a> and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It&#8217;s a very different approach in that Calais is doing <a href="http://en.wikipedia.org/wiki/Natural_language_processing">natural language processing</a> and we instead are looking for patterns in the structure of XML. But the end result is the same&#8211;an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have <a href="http://loc.gov/chroniclingamerica">large amounts</a> of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais&#8230;</p>
<p>To aid in the process I wrote a helper utility (<a href="http://inkdroid.org/bzr/calais/calais.py">calais.py</a>) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan&#8217;s <a href="http://rdflib.net">rdflib</a>:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">  <span style="color: #ff7700;font-weight:bold;">import</span> calais
  graph = calais_graph<span style="color: black;">&#40;</span>content<span style="color: black;">&#41;</span></pre></div></div>

<p>This is dependent on you getting a calais <a href="http://developer.opencalais.com/member/register">license key</a> and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here&#8217;s the <a href="http://inkdroid.org/bzr/calais/people">people</a> script. <em>note, the angly brackets are missing from the sparql prefixes intentionally, since they don&#8217;t render properly (yet) in wordpress</em>.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">  <span style="color: #ff7700;font-weight:bold;">from</span> calais <span style="color: #ff7700;font-weight:bold;">import</span> calais_graph
  <span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">sys</span> <span style="color: #ff7700;font-weight:bold;">import</span> argv
&nbsp;
  filename = argv<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>
  content = <span style="color: #008000;">file</span><span style="color: black;">&#40;</span>filename<span style="color: black;">&#41;</span>.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
  g = calais_graph<span style="color: black;">&#40;</span>content<span style="color: black;">&#41;</span>
&nbsp;
  sparql = <span style="color: #483d8b;">&quot;&quot;&quot;
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          &quot;&quot;&quot;</span>
&nbsp;
  <span style="color: #ff7700;font-weight:bold;">for</span> row <span style="color: #ff7700;font-weight:bold;">in</span> g.<span style="color: black;">query</span><span style="color: black;">&#40;</span>sparql<span style="color: black;">&#41;</span>:
      <span style="color: #ff7700;font-weight:bold;">print</span> row<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here&#8217;s what we get when I run <a href="http://inkdroid.org/bzr/calais/data/ndnp:774348">this</a> OCR data through (take a <a href="http://inkdroid.org/bzr/calais/data/ndnp:774348">look</a> at the linked OCR to see just how irregular this data is).</p>
<pre>
  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer
</pre>
<p>Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here&#8217;s the output of <a href="http://inkdroid.org/bzr/calais/cities">cities</a>. </p>
<pre>
  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO
</pre>
<p>Not too shabby. If you want to try this out, install <a href="http://rdflib.net">rdflib</a>, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:</p>
<pre>
  bzr branch http://inkdroid.org/bzr/calais
</pre>
<p>If you do dive into calais.py you&#8217;ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF. </p>
]]></content:encoded>
			<wfw:commentRss>http://inkdroid.org/journal/2008/02/13/calais-and-ocr-newspaper-data/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>
