Tuesday, December 22, 2009
I recently learned from Ivan Herman’s blog that O’Reilly has begun publishing RDFa in their online catalog of books. So if you go and install the RDFa Highlight bookmarklet and then visit a page like this and click on the bookmarklet you’ll see something like:
Those red boxes you see are graphical depictions of where metadata [...]
Wednesday, September 23, 2009
As an experiment to learn more about xmpp I created a little utility that will poll an oai-pmh server and send new records as a chunk of xml over xmpp. The idea wasn’t necessarily to see all the xml coming into my jabber client (although you can do that). I wanted to enable [...]
Serves 23,376 SKOS Concepts
INGREDIENTS
Text editor: Vim, Emacs, TextMate, etc
Python
BeautifulSoup
rdflib
Internet connection
DIRECTIONS
Open a new file using your favorite text editor.
Instantiate an RDF graph with a dash of rdflib.
Use python’s urllib to extract the HTML for each of the Times Topics Index Pages, e.g. for A.
Parse HTML into a fine, queryable data structure using BeautifulSoup.
Locate topic names and [...]
Wednesday, March 11, 2009
I read about the LibraryThing Mac Screensaver and of course wanted the same thing for my Ubuntu workstation at $work. Naturally, I’m supposed to be working on some high-priority tickets on a tight deadline…so I started to work right away on how to do this. Your tax dollars at work, etc…
I’m sure that there’s a [...]
in python JSON is faster, smaller and more portable than pickle …
At work, I’m working on a project where we’re modeling newspaper content in a relational database. We’ve got newspaper titles, issues, pages, institutions, places and some other fun stuff. It’s a django app, and the db schema currently looks something like:
Anyhow, if you [...]
I’ve set up a SPARQL endpoint for lcsh.info at sparql.lcsh.info. If you are new to SPARQL endpoints, they are essentially REST web services that allow you to query a pool of RDF data using a query language that combines features of pattern matching, set logic and the web, and then get back results in [...]
Thursday, February 28, 2008
pymarc v2.0 was released yesterday afternoon. I’m mentioning it here to give a big tip of the hat to Gabriel Farrell (gsf on #code4lib) who spent a significant amount of time cleaning up the code to be PEP-8 compliant.
If you are a current user of pymarc your code will most likely break, since methods [...]
Wednesday, February 13, 2008
Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun [...]
Thursday, January 10, 2008
word
count
library
263
bibliographic
236
data
170
libraries
144
lc
127
control
109
information
98
cataloging
91
records
88
subject
82
materials
81
standards
81
use
80
congress
79
work
76
record
73
community
67
users
61
working
59
group
58
access
57
recommendations
56
resources
53
authority
52
metadata
47
future
46
new
40
environment
37
development
37
web
36
collections
35
systems
35
available
35
creation
35
services
34
headings
32
national
31
findings
30
research
30
unique
29
sharing
29
oclc
28
model
28
catalog
28
international
27
develop
27
value
27
lcsh
26
pcc
26
user
26
need
26
report
25
make
25
practices
25
rda
25
used
25
time
24
needs
24
rare
24
including
24
provide
23
discovery
23
communities
23
special
23
frbr
23
current
22
resource
22
rules
22
digital
21
cooperative
21
program
21
participants
21
management
21
service
20
dc
20
programs
20
online
20
costs
20
washington
20
standard
19
support
19
knowledge
19
different
19
appropriate
19
effort
18
applications
18
marc
18
shared
18
exchange
18
process
18
changes
17
lcs
17
increase
16
public
16
search
16
creating
16
broader
16
catalogs
16
controlled
16
I converted the pdf to text file called ‘lc’ with xpdf and then wrote a little python:
#!/usr/bin/env python
from urllib import urlopen
from re import sub
stop_words = urlopen(’http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words’).read().split()
text = file(’lc’).read()
counts = {}
for word in text.split():
word = word.lower()
word = sub(r’\W’, ”, word)
word = sub(r’\d+’, ”, word)
[...]
Thursday, October 18, 2007
At $work recently many late nights were spent hackety-hacking on a prototype that got written up in the New York Times today. Apart from some promotional materials, not much is available to the public just yet. I just got pulled in near the end to do some search stuff. Over the past few months I’ve [...]