Archive for the ‘python’ Category

lcsh.info SPARQL endpoint

Monday, July 7th, 2008

I’ve set up a SPARQL endpoint for lcsh.info at sparql.lcsh.info. If you are new to SPARQL endpoints, they are essentially REST web services that allow you to query a pool of RDF data using a query language that combines features of pattern matching, set logic and the web, and then get back results in a variety of formats. If you are a regular expression and/or SQL junkie, and like data, then SPARQL is definitely worth taking a look at.

If you are new to SPARQL and/or LCSH as SKOS you can try the default query and you’ll get back the first 10 triples in the triple store:

SELECT ?s ?p ?p
WHERE {?s ?p ?o}
LIMIT 10

As a first tweak try increasing the limit to 100. If you are feeling more adventurous perhaps you’d like to look up all the triples for a concept like Buddhism:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?s ?p ?o
WHERE {
  ?s ?p ?o .
  ?s skos:prefLabel "Buddhism"@en .
}

Or, perhaps you are interested in seeing what narrower terms there are for Buddhism:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?uri ?label
WHERE {
  <http://lcsh.info/sh85017454#concept> skos:narrower ?uri .
  ?uri skos:prefLabel ?label
}

Or maybe you don’t know the skos:prefLabel (aka authorized heading), so look for all the lcsh headings that start with Independence

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?s ?label
WHERE {
  ?s skos:prefLabel ?label.
  FILTER regex(?label, '^independence', 'i')
}

Feel free to use the service however you want. I’m interested in seeing what its limitations are.

Benjamin Nowack’s ARC made it extremely easy to load up the 2,441,494 LCSH triples in a few hours with a script like:

include_once('arc/ARC2.php');
 
$config = array(
    'db_name'               => 'arc',
    'db_user'               => 'arc',
    'db_pwd'                => 'notapassword',
    'store_name'            => 'lcsh',
    'store_log_inserts'     => 1,
);
 
$store = ARC2::getStore($config);
 
if (!$store->isSetup()) {
    $store->setUp();
}
 
$store->reset();
$rs = $store->query('LOAD &lt;http://lcsh.info/static/lcsh.nt&gt;');
 
print_r($rs);

Then it’s just a simple matter of putting up a php script like:

/* ARC2 static class inclusion */
include_once('arc/ARC2.php');
 
/* MySQL and endpoint configuration */
$config = array(
  /* db */
  'db_host' => 'localhost', /* optional, default is localhost */
  'db_name' => 'arc',
  'db_user' => 'arc',
  'db_pwd' => 'fakepassword',
 
  /* store name */
  'store_name' => 'lcsh',
 
 
  /* endpoint */
  'endpoint_features' => array(
    'select', 'construct', 'ask', 'describe'
  ),
  'endpoint_timeout' => 60, /* not implemented in ARC2 preview */
  'endpoint_read_key' => '', /* optional */
  'endpoint_write_key' => 'fakekey', /* optional */
  'endpoint_max_limit' => 1000, /* optional */
);
 
/* instantiation */
$ep = ARC2::getStoreEndpoint($config);
 
/* request handling */
$ep->go();

Ideally I would’ve been able to quickly bring up a SPARQL endpoint on top of the rdflib Sleepycat triple store that is being used to serve up the linked data at lcsh.info. But rather that pursuing elegance (this is kinda side work after all), I wanted to quickly put the SPARQL service out there for experimentation, and this was the quickest way for me to do that. If the service proves useful I’ll look more at what it takes to create an rdflib SPARQL service, or porting over the little python code I have to php (gasp).

pymarc PEP-8 cleanup

Thursday, February 28th, 2008

pymarc v2.0 was released yesterday afternoon. I’m mentioning it here to give a big tip of the hat to Gabriel Farrell (gsf on #code4lib) who spent a significant amount of time cleaning up the code to be PEP-8 compliant.

If you are a current user of pymarc your code will most likely break, since methods like: addField() will now look like add_field(). This is a small price to pay for pythonistas who typically prefer clean, consistent and more coherent code (how’s that for alliteration?). It had to be done and I’m very grateful to gsf for taking the time to do it.

Another big thing is that we’ve switched from using subversion to bzr for revision control. Initially it seemed like a lightweight way for gsf and I to collaborate without monkeying with svn authentication (again)…and to learn the zen of distributed revision control. We both liked it so much that Gabe is hosting the pymarc repository at http://fruct.us/bzr/pymarc.

So if you like the latest/greatest/shiniest, and/or want to contribute some of your own changes to pymarc:

  % bzr branch http://fruct.us/bzr/pymarc
  % # hack, hack, hack, hackety, hack
  % bzr commit
  % bzr send --mail-to gsf@fruct.us --message "Gabe, I added a jammies method to the record object!"
  % # or publish your own repo and point us at it :-)

calais and ocr newspaper data

Wednesday, February 13th, 2008

Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun vocabularies.

At work Dan, Brian and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It’s a very different approach in that Calais is doing natural language processing and we instead are looking for patterns in the structure of XML. But the end result is the same–an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have large amounts of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais…

To aid in the process I wrote a helper utility (calais.py) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan’s rdflib:

  import calais
  graph = calais_graph(content)

This is dependent on you getting a calais license key and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here’s the people script. note, the angly brackets are missing from the sparql prefixes intentionally, since they don’t render properly (yet) in wordpress.

  from calais import calais_graph
  from sys import argv
 
  filename = argv[1]
  content = file(filename).read()
  g = calais_graph(content)
 
  sparql = """
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          """
 
  for row in g.query(sparql):
      print row[0]

Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here’s what we get when I run this OCR data through (take a look at the linked OCR to see just how irregular this data is).

  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer

Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here’s the output of cities.

  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO

Not too shabby. If you want to try this out, install rdflib, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:

  bzr branch http://inkdroid.org/bzr/calais

If you do dive into calais.py you’ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF.

WoGroFuBiCo wc

Thursday, January 10th, 2008

word count
library 263
bibliographic 236
data 170
libraries 144
lc 127
control 109
information 98
cataloging 91
records 88
subject 82
materials 81
standards 81
use 80
congress 79
work 76
record 73
community 67
users 61
working 59
group 58
access 57
recommendations 56
resources 53
authority 52
metadata 47
future 46
new 40
environment 37
development 37
web 36
collections 35
systems 35
available 35
creation 35
services 34
headings 32
national 31
findings 30
research 30
unique 29
sharing 29
oclc 28
model 28
catalog 28
international 27
develop 27
value 27
lcsh 26
pcc 26
user 26
need 26
report 25
make 25
practices 25
rda 25
used 25
time 24
needs 24
rare 24
including 24
provide 23
discovery 23
communities 23
special 23
frbr 23
current 22
resource 22
rules 22
digital 21
cooperative 21
program 21
participants 21
management 21
service 20
dc 20
programs 20
online 20
costs 20
washington 20
standard 19
support 19
knowledge 19
different 19
appropriate 19
effort 18
applications 18
marc 18
shared 18
exchange 18
process 18
changes 17
lcs 17
increase 16
public 16
search 16
creating 16
broader 16
catalogs 16
controlled 16

I converted the pdf to text file called ‘lc’ with xpdf and then wrote a little python:

#!/usr/bin/env python
 
from urllib import urlopen
from re import sub
 
stop_words = urlopen('http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words').read().split()
text = file('lc').read()
 
counts = {}
for word in text.split():
    word = word.lower()
    word = sub(r'\W', '', word)
    word = sub(r'\d+', '', word)
    if word == ''  or word in stop_words: continue
    counts[word] = counts.get(word,0) + 1
 
words = counts.keys()
words.sort(lambda a,b: cmp(counts[b], counts[a]))
for word in words[0:100]:
    print "%20s %i" % (word, counts[word])

Does me writing code to read the report count as reading the report? …

tools

Thursday, October 18th, 2007


At $work recently many late nights were spent hackety-hacking on a prototype that got written up in the New York Times today. Apart from some promotional materials, not much is available to the public just yet. I just got pulled in near the end to do some search stuff. Over the past few months I’ve seen dchud in top form managing complicated data/organizational workflows while making technical decisions. A nice outgrowth of working with smarties is ending up with a fun and productive technology stack: python, django, postgres, jquery, solr, tilecache, ubuntu, trac, subversion, vmware. Given the press and the commitment to UNESCO I think the code is going to start being a bit more than a prototype pretty soon :-)

marcdb

Monday, October 1st, 2007

If you are a library data wrangler at some point you’ve probably wanted to stuff MARC data into a relational database so you can do queries across it. Perhaps your $vendor supports querying like this, but perhaps not. At any rate for some work I’ve been doing I’ve really needed to be able to get a feel for a batch of MARC authority data, in particular the data that Simon Spero has kindly made available.

So I created a little tool I’m calling marcdb which slurps in MARCXML or MARC and stuffs it into a relational database schema. The source for marcdb is available and you can install via the python cheeseshop with easy_install if you have it. As you can see from the README it lets SQLAlchemy and Elixir do the database talkin’. This results in a nice little python file that defines the schema in terms of Python classes. You ought to be able to use marcdb with any backend database (mysql, sqlite, postgres) that is supported by SQLAlchemy.

At any rate, the point of all this is to enable querying. So for example after I loaded Simon’s authority data I can do a query to see what the lay of the land is in terms of number of tags.

SELECT tag, COUNT(*) AS tag_count 
FROM data_fields 
GROUP BY tag 
ORDER BY tag_count DESC;

tag | tag_count 
—–+———– 
035 |    558727
670 |    496600
040 |    379999
010 |    379999
953 |    369625
906 |    272196
550 |    232544
150 |    217556
450 |    211067 
952 |    185012 
151 |    158900 
451 |    143538 
781 |    122490 
043 |     92656 
053 |     92404 
675 |     42496 
551 |     24797 
667 |     14434 
985 |     13725 
680 |     10342 
681 |      8873 
410 |      7103 
360 |      4126 
073 |      3540 
180 |      3000 
019 |      1832 
678 |      1311 
580 |       857 
480 |       808 
260 |       753 
185 |       501 
510 |       369 
485 |       262 
042 |       260 
500 |       259 
016 |       243 
585 |       192 
400 |       147 
682 |       134 
710 |       132 
979 |       107 
530 |        93 
430 |        82 
665 |        44 
182 |        36 
482 |         8 
969 |         4 
181 |         4 
555 |         4 
581 |         4 
455 |         4 
582 |         3 
481 |         3 
052 |         3 
411 |         2 
155 |         2 
751 |         2 
014 |         2 
050 |         2 
856 |         1

Or, here’s a more complex query for determining the types of relationships found in See Also From Tracing fields.

SELECT subfields.value, count(*) AS value_count
FROM data_fields, subfields
WHERE data_fields.tag in ('500', '510', '511', '530', '548', '550', '551',
  '555', '580', '581', '582', '585')
AND data_fields.id = subfields.id
AND subfields.code = 'w'
GROUP BY subfields.value
ORDER BY value_count

 value | value_count
-------+-------------
 g     |        8438
 nne   |        1243
 nnaa  |        1083
 a     |         146
 b     |         140
 nna   |           8
 bnna  |           4
 anna  |           3
 n     |           2
 nnnd  |           2
 nnnb  |           1
(11 rows)

So most of the relations are ‘g’ which is for broader relations. I know MARC is kind of passé these days, but there’s a lot of it around in libraries, and it’s important to be able to make decisions about it–especially when converting it to more web-viable formats. I’d be interested in feedback if you get a chance to try it out.

pymarc, marc8 and nothingness

Friday, July 20th, 2007

pymarc 1.0 went out day before yesterday with a new function: marc8_to_unicode(). When trying to leverage MARC bibliographic data in today’s networked world it is inevitable that the MARC8 character encoding will at some point rear its ugly head and make your brain hurt. The problem is that the standard character set tools for various programming languages do not support it. So you need to know to use a specialized tool like marc4j, yaz, MARC::Charset for converting from MARC8 into something useful like Unicode.

The MARC8 support in pymarc is the brainchild of Aaron Lav and Mark Matienzo. Aaron gave permission for us to package up some of is code from PyZ3950 into pymarc. In testing with equivalent MARC-8 and UTF-8 record batches from the Library of Congress we were able to find and fix a few glitches.

The exercise was instructive to me because of my previous experience working with the MARC::Charset Perl module. When I wrote MARC::Charset I was overly concerned with not storing the mapping table in memory, I used an on disk Berkeley-DB originally. Aaron’s code simply stored the mapping in memory. Since python stores bytecode on disk after compiling there were some performance gains to be had over Perl–since Perl would compile the big mapping hash every time. But the main thing is that Aaron seemed to choose the simplest solution first– whereas I was busy performing a premature optimization. I also went through some pains to enable mapping not only MARC-8 to Unicode but Unicode back to MARC-8. In hindsight this was a mistake because going back to MARC-8 is increasingly more insane as each day passes.

Aaron’s code as a result is much cleaner and easier to understand because, well, there’s less of it. I’m reading Beautiful Code at the moment and was just reading Jon Bentley’s chapter “The Most Beautiful Code I Never Wrote” — which really crystallized things. Definitely check out Beautiful Code if you have a chance. Maybe the quiet books4code could revive to read it as a group?

late easter present

Tuesday, April 10th, 2007

I finally took the time to make pymarc setuptools friendly. This basically means that if you’ve got easy_install handy you can:

sudo easy_install pymarc

If you haven’t looked at eggs yet, they are pretty much the defacto standard for distributing python code. The PyPi (Python Package Index, aka Python Cheese Shop) allows easy_install to locate and download packages, which are then unpacked and installed.

pymarc was basically an experiment to make sure I understood how eggs worked with pypi. Next up Rob Sanderson has sent me some code he and a colleague did for parsing Library of Congress Classification Numbers which I’m going to bundle up as an egg as well. Stay tuned.

oai/sru and ruby

Thursday, April 20th, 2006
biblio:~/Projects/ruby-oai ed$ ruby test.rb
Loaded suite test
Started
..........
Finished in 171.247595 seconds.

10 tests, 280 assertions, 0 failures, 0 errors

So after about 4 hours of hacking I’ve got ruby-oai which is a OAI-PMH client library for ruby. Included is a test suite that puts all 6 oai-pmh verbs through their paces using OAI-PMH servers at the loc.gov, lanl.gov, pubmedcentral.gov.

Just a few days before I did something with sruby which is a SRU client library for Ruby. Now these are just the initial versions, and I’m sure there are ways that they can be improved. But after doing a few years of solid Java coding it’s just another reminder of how dynamic languages such as Ruby and Python can really help catapault productivity.

I have to admit, I do miss javac sitting on my shoulder reminding me of things I’m doing wrong at compile time compared with discovering bugs/errors at run time in python and ruby. But I’ve really come around to Bruce Eckel’s point about Strong Typing vs Strong Testing. Building sruby and ruby-oai using test driven development really makes me more confident that my code is working properly…and what’s more it makes me much happier with the look and feel of the API. This look n’ feel aspect is something that the mechanical javac can’t really help with. Doing test driven development in Java with Eclipse approaches this level–but somehow isn’t as fun–but I imagine it scales better to larger teams. However, I don’t work on any of these uber large teams anyhow. Hopefully I’ll find a moment to talk about the new work I’m doing in the coming weeks.

learning programming w/ python

Tuesday, March 14th, 2006

Programming For Newbies With Python

Begins: Sat, 25 Mar 2006 at 1:00 PM

Ends: Sat, 25 Mar 2006 at 3:30 PM

Location:

Loyola University

Chicago, IL

Link: details

Here’s an interesting event for promoting python as a first-language. Below is an excerpt from mtobis’ email announcement:

The Chicago Python User Group (ChiPy) with the kind cooperation of the
Computer Science Department at Loyola University of Chicago will be
offering a free introduction to computer programming using the Python
language.

I’m looking for people who would be interested in taking up computing
as a serious hobby. The final impetus to present this was presented by
a father-son team who want to learn to program together. I would
welcome teenagers or adults. Parent/teen pairs are especially welcome.

Children under the age of 13 may attend if accompanied by an adult but
for most pre-teens this may prove too challenging.

On the other hand, professional programmers will find the pace too slow.

You need no coding experience at all, but you shouldn’t be unfamiliar
with a computer altogether. A small amount of exposure to HTML would
be helpful.

The first meeting will be an introduction to the power of Python, and
an organizational meeting. By the time you leave you will have written
a small and amusing piece of working software.

We’ll also poll the group about your interests, and decide on where
and how often we should meet in the future, and set up some online
communication to keep each other in contact. We will probably meet
every second Saturday.

Tags: chicago python