Skip to content

Tag Archives: python

bad xml smells

I’m used to refactoring code smells, but sometimes you can catch a bad whiff in XML too. Before: < ?xml version="1.0" encoding="UTF-8"?> <mets TYPE="urn:library-of-congress:ndnp:mets:encyclopedia:encyclopediaEntry" PROFILE="urn:library-of-congress:mets:profiles:ndnp:encyclopediaEntry:v1.1" LABEL="The National Forum Scope Note" xmlns:mods="http://www.loc.gov/mods/v3" xmlns="http://www.loc.gov/METS/" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">   <!–METS HEADER–> <metshdr CREATEDATE="2007-01-10T09:00:00" ><!–CREATEDATE should be populated with creation date of the record. RECORDSTATUS should only be set [...]

version control and digital curation

For some time now I have been meaning to write about some of the issues related to version control in repositories as they relate to some projects going on at $work. Most repository systems have a requirement to maintain original data as submitted. But as we all know this content often changes over time–sometimes immediately. [...]

data.gov.uk and rdfa

The recent public release of the UK Government’s data.gov.uk site got picked up by the press last week in articles at The Guardian, Prospect Magazine and elswhere. These have been supplemented by some more technical discussions at ReadWriteWeb, Open Knowledge Foundation, Talis, Jeni Tennison’s blog, and some helpful emails from Leigh Dodds (Talis) and Jonathan [...]

New York Times Topics as SKOS

Serves 23,376 SKOS Concepts INGREDIENTS Text editor: Vim, Emacs, TextMate, etc Python BeautifulSoup rdflib Internet connection DIRECTIONS Open a new file using your favorite text editor. Instantiate an RDF graph with a dash of rdflib. Use python’s urllib to extract the HTML for each of the Times Topics Index Pages, e.g. for A. Parse HTML [...]

flickr, digital curation and the web

The Library of Congress has started to put selected content from Chronicling America into Flickr as part of the Illustrated Newspaper Supplements set. More details on the rationale and process involved can be found in a FAQ on the LC Newspapers and Current Periodical Reading Room website. So for example this newspaper page on Chronicling [...]

LibraryThing Ubuntu Screen Saver

I read about the LibraryThing Mac Screensaver and of course wanted the same thing for my Ubuntu workstation at $work. Naturally, I’m supposed to be working on some high-priority tickets on a tight deadline…so I started to work right away on how to do this. Your tax dollars at work, etc… I’m sure that there’s [...]

json vs pickle

in python JSON is faster, smaller and more portable than pickle … At work, I’m working on a project where we’re modeling newspaper content in a relational database. We’ve got newspaper titles, issues, pages, institutions, places and some other fun stuff. It’s a django app, and the db schema currently looks something like: Anyhow, if [...]

pymarc PEP-8 cleanup

pymarc v2.0 was released yesterday afternoon. I’m mentioning it here to give a big tip of the hat to Gabriel Farrell (gsf on #code4lib) who spent a significant amount of time cleaning up the code to be PEP-8 compliant. If you are a current user of pymarc your code will most likely break, since methods [...]