Skip to content

Category Archives: python

Hacking O’Reilly RDFa

I recently learned from Ivan Herman’s blog that O’Reilly has begun publishing RDFa in their online catalog of books. So if you go and install the RDFa Highlight bookmarklet and then visit a page like this and click on the bookmarklet you’ll see something like:

Those red boxes you see are graphical depictions of where metadata [...]

oai-pmh and xmpp

As an experiment to learn more about xmpp I created a little utility that will poll an oai-pmh server and send new records as a chunk of xml over xmpp. The idea wasn’t necessarily to see all the xml coming into my jabber client (although you can do that). I wanted to enable [...]

New York Times Topics as SKOS

Serves 23,376 SKOS Concepts
INGREDIENTS

Text editor: Vim, Emacs, TextMate, etc
Python
BeautifulSoup
rdflib
Internet connection

DIRECTIONS

Open a new file using your favorite text editor.
Instantiate an RDF graph with a dash of rdflib.
Use python’s urllib to extract the HTML for each of the Times Topics Index Pages, e.g. for A.
Parse HTML into a fine, queryable data structure using BeautifulSoup.
Locate topic names and [...]

LibraryThing Ubuntu Screen Saver

I read about the LibraryThing Mac Screensaver and of course wanted the same thing for my Ubuntu workstation at $work. Naturally, I’m supposed to be working on some high-priority tickets on a tight deadline…so I started to work right away on how to do this. Your tax dollars at work, etc…
I’m sure that there’s a [...]

json vs pickle

in python JSON is faster, smaller and more portable than pickle …
At work, I’m working on a project where we’re modeling newspaper content in a relational database. We’ve got newspaper titles, issues, pages, institutions, places and some other fun stuff. It’s a django app, and the db schema currently looks something like:

Anyhow, if you [...]

lcsh.info SPARQL endpoint

I’ve set up a SPARQL endpoint for lcsh.info at sparql.lcsh.info. If you are new to SPARQL endpoints, they are essentially REST web services that allow you to query a pool of RDF data using a query language that combines features of pattern matching, set logic and the web, and then get back results in [...]

pymarc PEP-8 cleanup

pymarc v2.0 was released yesterday afternoon. I’m mentioning it here to give a big tip of the hat to Gabriel Farrell (gsf on #code4lib) who spent a significant amount of time cleaning up the code to be PEP-8 compliant.
If you are a current user of pymarc your code will most likely break, since methods [...]

calais and ocr newspaper data

Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun [...]

WoGroFuBiCo wc

word
count

library
263

bibliographic
236

data
170

libraries
144

lc
127

control
109

information
98

cataloging
91

records
88

subject
82

materials
81

standards
81

use
80

congress
79

work
76

record
73

community
67

users
61

working
59

group
58

access
57

recommendations
56

resources
53

authority
52

metadata
47

future
46

new
40

environment
37

development
37

web
36

collections
35

systems
35

available
35

creation
35

services
34

headings
32

national
31

findings
30

research
30

unique
29

sharing
29

oclc
28

model
28

catalog
28

international
27

develop
27

value
27

lcsh
26

pcc
26

user
26

need
26

report
25

make
25

practices
25

rda
25

used
25

time
24

needs
24

rare
24

including
24

provide
23

discovery
23

communities
23

special
23

frbr
23

current
22

resource
22

rules
22

digital
21

cooperative
21

program
21

participants
21

management
21

service
20

dc
20

programs
20

online
20

costs
20

washington
20

standard
19

support
19

knowledge
19

different
19

appropriate
19

effort
18

applications
18

marc
18

shared
18

exchange
18

process
18

changes
17

lcs
17

increase
16

public
16

search
16

creating
16

broader
16

catalogs
16

controlled
16

I converted the pdf to text file called ‘lc’ with xpdf and then wrote a little python:

#!/usr/bin/env python
 
from urllib import urlopen
from re import sub
 
stop_words = urlopen(’http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words’).read().split()
text = file(’lc’).read()
 
counts = {}
for word in text.split():
word = word.lower()
word = sub(r’\W’, ”, word)
word = sub(r’\d+’, ”, word)
[...]

tools

At $work recently many late nights were spent hackety-hacking on a prototype that got written up in the New York Times today. Apart from some promotional materials, not much is available to the public just yet. I just got pulled in near the end to do some search stuff. Over the past few months I’ve [...]