<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: pymarc, marc8 and nothingness</title>
	<atom:link href="http://inkdroid.org/journal/2007/07/20/pymarc-marc8/feed/" rel="self" type="application/rss+xml" />
	<link>http://inkdroid.org/journal/2007/07/20/pymarc-marc8/</link>
	<description>$pithy_personal_mission_statement</description>
	<pubDate>Wed, 03 Dec 2008 22:32:54 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.5</generator>
		<item>
		<title>By: Nodalities &#187; Blog Archive &#187; This Week&#8217;s Semantic Web</title>
		<link>http://inkdroid.org/journal/2007/07/20/pymarc-marc8/#comment-59652</link>
		<dc:creator>Nodalities &#187; Blog Archive &#187; This Week&#8217;s Semantic Web</dc:creator>
		<pubDate>Thu, 17 Apr 2008 17:05:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.inkdroid.org/journal/2007/07/20/pymarc-marc8/#comment-59652</guid>
		<description>[...] pymarc, marc8 and nothingness - new function, marc8_to_unicode() [...]</description>
		<content:encoded><![CDATA[<p>[...] pymarc, marc8 and nothingness - new function, marc8_to_unicode() [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ed</title>
		<link>http://inkdroid.org/journal/2007/07/20/pymarc-marc8/#comment-53141</link>
		<dc:creator>ed</dc:creator>
		<pubDate>Thu, 31 Jan 2008 03:13:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.inkdroid.org/journal/2007/07/20/pymarc-marc8/#comment-53141</guid>
		<description>Wow, sorry for the delay gabe, this slipped my radar -- &lt;a href="http://pypi.python.org/pypi/pymarc" rel="nofollow"&gt;1.7&lt;/a&gt; was just released with some fixes you just recently sent to me in a patch. 

In addition I documented the marc8_to_unicode function more so that it hopefully makes clear you aren't supposed to pass in a serialized marc record, but just a chunk of text extracted from the record that you'd like to translate.

Thanks for the info and the patch!</description>
		<content:encoded><![CDATA[<p>Wow, sorry for the delay gabe, this slipped my radar &#8212; <a href="http://pypi.python.org/pypi/pymarc" rel="nofollow">1.7</a> was just released with some fixes you just recently sent to me in a patch. </p>
<p>In addition I documented the marc8_to_unicode function more so that it hopefully makes clear you aren&#8217;t supposed to pass in a serialized marc record, but just a chunk of text extracted from the record that you&#8217;d like to translate.</p>
<p>Thanks for the info and the patch!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: thesecretmirror.com &#187; Blog Archive &#187; When Life Hands You MARC, make pymarc</title>
		<link>http://inkdroid.org/journal/2007/07/20/pymarc-marc8/#comment-39168</link>
		<dc:creator>thesecretmirror.com &#187; Blog Archive &#187; When Life Hands You MARC, make pymarc</dc:creator>
		<pubDate>Fri, 20 Jul 2007 22:13:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.inkdroid.org/journal/2007/07/20/pymarc-marc8/#comment-39168</guid>
		<description>[...] I&#8217;m no expert, but I&#8217;m glad that I could help bring pymarc up to version 1.0 and that I&#8217;ve had a chance to begin enjoy programming again. I&#8217;m also glad to see that Catalogablog has spread the word. Download a copy and start hacking; maybe you&#8217;ll be rewarded with rediscovering the joy of code like I was. [...]</description>
		<content:encoded><![CDATA[<p>[...] I&#8217;m no expert, but I&#8217;m glad that I could help bring pymarc up to version 1.0 and that I&#8217;ve had a chance to begin enjoy programming again. I&#8217;m also glad to see that Catalogablog has spread the word. Download a copy and start hacking; maybe you&#8217;ll be rewarded with rediscovering the joy of code like I was. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gabriel Farrell</title>
		<link>http://inkdroid.org/journal/2007/07/20/pymarc-marc8/#comment-39150</link>
		<dc:creator>Gabriel Farrell</dc:creator>
		<pubDate>Fri, 20 Jul 2007 17:58:16 +0000</pubDate>
		<guid isPermaLink="false">http://www.inkdroid.org/journal/2007/07/20/pymarc-marc8/#comment-39150</guid>
		<description>Great things, Ed (and thanks to Mark and Aaron, too).  I svn'd and tried it out on a dump I had lying around of 105,000 of our records.  Following the pymarc.__doc__, I did:

&#62;&#62;&#62; reader = MARCReader('bibs105000.out')
&#62;&#62;&#62; for record in reader:
...     print record['245']['a']
... 
Traceback (most recent call last):
  File "", line 1, in ?
  File "build/bdist.linux-x86_64/egg/pymarc/reader.py", line 51, in next
ValueError: invalid literal for int(): bibs1
&#62;&#62;&#62; 

But I noticed "self.reader = pymarc.MARCReader(file('test/test.dat'))" in test/reader.py, so I:

&#62;&#62;&#62; marc8_file = file('bibs105000.out')
&#62;&#62;&#62; reader = MARCReader(marc8_file)
&#62;&#62;&#62; for record in reader:
...     print record['245']['a']
... 
Microeconomics :
The multilateral development banks :
The Accountants digest.
Achievement.
ALA bulletin.
Acta arithmetica.
Acta crystallographica.
ASLE transactions.
Acta mathematica.
Acta mechanica.
Acta physica Polonica,
Acta physica Austriaca /
Acta physica Polonica.
Acta polytechnica Scandinavica.
[etc. -- big old list of titles, streaming by, very fast]

So MARCReader expects a file object, not a filename.  Does that doc string need updating or did I misread it?

I then tried to test marc8_to_unicode:

&#62;&#62;&#62; utf8_file = marc8_to_unicode(marc8_file)
Traceback (most recent call last):
  File "", line 1, in ?
  File "build/bdist.linux-x86_64/egg/pymarc/marc8.py", line 8, in marc8_to_unicode
  File "build/bdist.linux-x86_64/egg/pymarc/marc8.py", line 43, in translate
TypeError: len() of unsized object
&#62;&#62;&#62; 

Looking at test/marc8.py I saw marc8_to_unicode expects a string, so I:

&#62;&#62;&#62; marc8_file = file('bibs105000.out')
&#62;&#62;&#62; marc8_file_str = marc8_file.read()
&#62;&#62;&#62; utf8_file_str = marc8_to_unicode(marc8_file_str)
couldn't find 66 69 221 221
couldn't find 66 69 220 220
couldn't find 66 69 221 221
couldn't find 66 69 220 220
couldn't find 66 69 221 221
couldn't find 66 69 221 221
couldn't find 66 69 220 220
couldn't find 66 69 220 220
couldn't find 66 69 221 221
&#62;&#62;&#62; 

This took a long time, running at about 98% of my cpu and 20% of my ~4GB of  memory.  So I timed it (only 3 times because, you know, I'm impatient):

&#62;&#62;&#62; t = Timer(stmt='utf8_file_str = marc8_to_unicode(marc8_file_str)', setup='marc8_file = file("bibs105000.out"); marc8_file_str = marc8_file.read(); from pymarc import marc8_to_unicode')
&#62;&#62;&#62; t.timeit(3)
[a bunch of the "couldn't find 66 69 22[01] 22[01]", three times over]
786.2618350982666

So it took a little while.  How big is the file?

&#62;&#62;&#62; len(marc8_file_str)
93747869
&#62;&#62;&#62; len(utf8_file_str)
87474412

Hmm.  Those should match, right?  Well, let's see if we can read the new file:

&#62;&#62;&#62; pymarc_utf8_file = file('bibs105000_pymarc_utf8.dat', 'w')
&#62;&#62;&#62; pymarc_utf8_file.write(utf8_file_str.encode('utf8'))
&#62;&#62;&#62; reader = MARCReader(file('bibs105000_pymarc_utf8.dat'))
&#62;&#62;&#62; for record in reader:
...     print record['245']['a']
... 
None
Traceback (most recent call last):
  File "", line 1, in ?
  File "build/bdist.linux-x86_64/egg/pymarc/reader.py", line 54, in next
  File "build/bdist.linux-x86_64/egg/pymarc/record.py", line 46, in __init__
  File "build/bdist.linux-x86_64/egg/pymarc/record.py", line 123, in decodeMARC
pymarc.exceptions.BaseAddressInvalid: Base address exceeds size of record
&#62;&#62;&#62; 

Have I done something screwy, or is it something in the file?  I can get the file to you if you want to test on your system.  I also have a utf8 file to compare the output against, produced by `yaz-marcdump -f MARC-8 -t UTF-8 -I bibs105000.out &#62; bibs105000_utf8.out`, and of course the diff is a mile long.  If I get time later I'll test line by line, as done in test/marc8.py.  

Apologies for the long long comment -- should be on some mailing list or something.</description>
		<content:encoded><![CDATA[<p>Great things, Ed (and thanks to Mark and Aaron, too).  I svn&#8217;d and tried it out on a dump I had lying around of 105,000 of our records.  Following the pymarc.__doc__, I did:</p>
<p>&gt;&gt;&gt; reader = MARCReader(&#8217;bibs105000.out&#8217;)<br />
&gt;&gt;&gt; for record in reader:<br />
&#8230;     print record['245']['a']<br />
&#8230;<br />
Traceback (most recent call last):<br />
  File &#8220;&#8221;, line 1, in ?<br />
  File &#8220;build/bdist.linux-x86_64/egg/pymarc/reader.py&#8221;, line 51, in next<br />
ValueError: invalid literal for int(): bibs1<br />
&gt;&gt;&gt; </p>
<p>But I noticed &#8220;self.reader = pymarc.MARCReader(file(&#8217;test/test.dat&#8217;))&#8221; in test/reader.py, so I:</p>
<p>&gt;&gt;&gt; marc8_file = file(&#8217;bibs105000.out&#8217;)<br />
&gt;&gt;&gt; reader = MARCReader(marc8_file)<br />
&gt;&gt;&gt; for record in reader:<br />
&#8230;     print record['245']['a']<br />
&#8230;<br />
Microeconomics :<br />
The multilateral development banks :<br />
The Accountants digest.<br />
Achievement.<br />
ALA bulletin.<br />
Acta arithmetica.<br />
Acta crystallographica.<br />
ASLE transactions.<br />
Acta mathematica.<br />
Acta mechanica.<br />
Acta physica Polonica,<br />
Acta physica Austriaca /<br />
Acta physica Polonica.<br />
Acta polytechnica Scandinavica.<br />
[etc. -- big old list of titles, streaming by, very fast]</p>
<p>So MARCReader expects a file object, not a filename.  Does that doc string need updating or did I misread it?</p>
<p>I then tried to test marc8_to_unicode:</p>
<p>&gt;&gt;&gt; utf8_file = marc8_to_unicode(marc8_file)<br />
Traceback (most recent call last):<br />
  File &#8220;&#8221;, line 1, in ?<br />
  File &#8220;build/bdist.linux-x86_64/egg/pymarc/marc8.py&#8221;, line 8, in marc8_to_unicode<br />
  File &#8220;build/bdist.linux-x86_64/egg/pymarc/marc8.py&#8221;, line 43, in translate<br />
TypeError: len() of unsized object<br />
&gt;&gt;&gt; </p>
<p>Looking at test/marc8.py I saw marc8_to_unicode expects a string, so I:</p>
<p>&gt;&gt;&gt; marc8_file = file(&#8217;bibs105000.out&#8217;)<br />
&gt;&gt;&gt; marc8_file_str = marc8_file.read()<br />
&gt;&gt;&gt; utf8_file_str = marc8_to_unicode(marc8_file_str)<br />
couldn&#8217;t find 66 69 221 221<br />
couldn&#8217;t find 66 69 220 220<br />
couldn&#8217;t find 66 69 221 221<br />
couldn&#8217;t find 66 69 220 220<br />
couldn&#8217;t find 66 69 221 221<br />
couldn&#8217;t find 66 69 221 221<br />
couldn&#8217;t find 66 69 220 220<br />
couldn&#8217;t find 66 69 220 220<br />
couldn&#8217;t find 66 69 221 221<br />
&gt;&gt;&gt; </p>
<p>This took a long time, running at about 98% of my cpu and 20% of my ~4GB of  memory.  So I timed it (only 3 times because, you know, I&#8217;m impatient):</p>
<p>&gt;&gt;&gt; t = Timer(stmt=&#8217;utf8_file_str = marc8_to_unicode(marc8_file_str)&#8217;, setup=&#8217;marc8_file = file(&#8221;bibs105000.out&#8221;); marc8_file_str = marc8_file.read(); from pymarc import marc8_to_unicode&#8217;)<br />
&gt;&gt;&gt; t.timeit(3)<br />
[a bunch of the "couldn't find 66 69 22[01] 22[01]&#8220;, three times over]<br />
786.2618350982666</p>
<p>So it took a little while.  How big is the file?</p>
<p>&gt;&gt;&gt; len(marc8_file_str)<br />
93747869<br />
&gt;&gt;&gt; len(utf8_file_str)<br />
87474412</p>
<p>Hmm.  Those should match, right?  Well, let&#8217;s see if we can read the new file:</p>
<p>&gt;&gt;&gt; pymarc_utf8_file = file(&#8217;bibs105000_pymarc_utf8.dat&#8217;, &#8216;w&#8217;)<br />
&gt;&gt;&gt; pymarc_utf8_file.write(utf8_file_str.encode(&#8217;utf8&#8242;))<br />
&gt;&gt;&gt; reader = MARCReader(file(&#8217;bibs105000_pymarc_utf8.dat&#8217;))<br />
&gt;&gt;&gt; for record in reader:<br />
&#8230;     print record['245']['a']<br />
&#8230;<br />
None<br />
Traceback (most recent call last):<br />
  File &#8220;&#8221;, line 1, in ?<br />
  File &#8220;build/bdist.linux-x86_64/egg/pymarc/reader.py&#8221;, line 54, in next<br />
  File &#8220;build/bdist.linux-x86_64/egg/pymarc/record.py&#8221;, line 46, in __init__<br />
  File &#8220;build/bdist.linux-x86_64/egg/pymarc/record.py&#8221;, line 123, in decodeMARC<br />
pymarc.exceptions.BaseAddressInvalid: Base address exceeds size of record<br />
&gt;&gt;&gt; </p>
<p>Have I done something screwy, or is it something in the file?  I can get the file to you if you want to test on your system.  I also have a utf8 file to compare the output against, produced by `yaz-marcdump -f MARC-8 -t UTF-8 -I bibs105000.out &gt; bibs105000_utf8.out`, and of course the diff is a mile long.  If I get time later I&#8217;ll test line by line, as done in test/marc8.py.  </p>
<p>Apologies for the long long comment &#8212; should be on some mailing list or something.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
