iogdc ramblings

Yesterday I was at the first day of the International Open Government Data Conference in Washington DC. It was an exciting day, with a great deal of enthusiasm being expressed by luminaries like Tim Berners-Lee, Jim Hendler , Beth Noveck, and Vivek Kundra for enabling participatory democracy by opening up access to government data. Efforts like,,, to aggregate egov datasets from their jurisdictions were well represented, although it would’ve been great to hear more from places like Spain, Sweden as well as groups like the Sunlight Foundation and Open Knowledge Foundation … but there are two more days to go. Here are my reflections so far from the first day:


New Zealand is embracing the use of Creative Commons licenses to release their datasets onto the web. Their NZGOAL project got cabinet approval for using CC licenses in June of this year. They are now doing outreach within government agencies, and building tools to help data owners put these license into play, so that data can go out on the web. Where I work at the Library of Congress, the general understanding is that our data is public domain (in the US) … except when its not. For example some of the high resolution images in the Prints and Photographs Catalog aren’t available outside the physical buildings of the Library of Congress, due to licensing concerns. So I’m totally envious of New Zealand’s coordinated efforts to iron out these licensing issues.


Vivek Kundra and Alan Mallie of the touted the number of datasets that they are federating access to. But it remains unclear exactly how content is federated, and how datasets flow from agencies into itself. Perhaps some of these details are included in the v1.0 release of the Concept of Operations (which Kundra announced). An excellent question posed to Berners-Lee and Kundra concerned what role centralized and distributed approaches play in publishing data. While there is value in one-stop-shopping where you can find data aggregated in one place, Berners-Lee really stressed that the web grew because it was distributed. Aggregated collections of datasets like need to be able to efficiently pull data from places where it is collected. We need to use the web effectively to enable this.

Legacy Data

There are tons of datasets waiting to be put on the web. Steve Young of the EPA described a few datasets such as the Toxics Release Inventory, which has the goal to:

provide communities with information about toxic chemical releases and waste management activities and to support informed decision making at all levels by industry, government, non-governmental organizations, and the public.

This data has been collected for 22 years after the Emergency Planning and Right to Know Act. Young emphasized how important it is that this data be used in applications, and combined with other datasets. The data is available for download directly from the EPA, and is also available on It would’ve been interesting to learn more about the mechanics of how the EPA gets data onto ; and how updates can flow.

But a really important question came from Young’s colleague at the EPA (sorry I didn’t note her name). She asked about how the data in their relational databases could be made available on the web. Should they simply dump the database? Or is there something else they could do? Young said that it’s early days, but he hoped that Linked Data might have some answers. The issues came up later in the day at the Is the Semantic Web Ready Yet panel. There was a question about how to make Linked Data relevant to folks whose focus is Enterprise data. In my opinion Linked Data advocates over emphasize the importance of using RDF and SPARQL (standards), and converting all the data over without completely understanding how invasive these solutions are. Not enough is done to show enterprise data folks, who typically think in terms of relational databases, what they can do to put their lovingly crafted and hugged data on the web. Consider a primary key in a database: what does it identify, what relations does that thing have with other things? Why not use that key in constructing a URL for that thing, and link things together using the URLs? Then other people could use your URLs as well in their own data. I think the drumbeat to use SPARQL and triple stores often misses explaining this fundamental baby step that data owners could take. As Derek Willis said (on the 2nd day, when I’m writing this), people want to use your data, but not your database…people want to browse your data using their web browser. Assigning URLs to the important stuff in your databases is the first important step to make with Linked Data.


Robert Schaefer of the Applied Physics Lab at Johns Hopkins University pointed out that enabling virtual communities around our data is an essential part of making data available and usable. In my opinion this is the true potential of platform, data aggregator sites like…they can allow users of government datasets to share what they have done, and learn from each other. Efforts like Civic Commons also promise to be places where this collaboration can take place. The communities may be born inside or outside of government, but they inevitably must include both. The W3C Egov effort might also be a good place to collaborate on standards possibly.

me and my homepage

Thanks for all the positive feedback to my last post about using URLs to identify resources. Tom Heath (one of the founding fathers of the Linked Data meme/pattern) suggested that discussions about this topic are harmful, so of course I have to continue the conversation … even if I don’t have much new to say. Hell, I just wanted an excuse to re-publish another one of Paul Downey’s lovely REST Tarot Cards that he is doing for NaNoDrawMo 2010, and get some more hits on my backwater blog :-)

Anyhow, so Tom said:

Joe Developer … has to take a bit of responsibility for writing sensible RDF statements. Unfortunately, people like Ed seeming to conflate himself and his homepage (and his router and its admin console) don’t help with the general level of understanding. I’ve tried many times to explain to someone that I am not my homepage, and as far as I know I’ve never failed. In all this frantic debate about the 303 mechanism, let’s not abandon certain basic principles that just make sense.

I’m glad Tom is able to explain this stuff about Information Resources better than me. I think I was probably one of the people he explained it to at some point. I understand the principles that Tom and other Linked Data advocates are promulgating well enough to throw together some Linked Data implementations at the Library of Congress, such as the Library of Congress Subject Headings and Chronicling America which put millions of resources online using the principles that got documented in Cool URIs for the Semantic Web.

How do you know if someone understood something you said? Normally by what they do in response to what you say, right? The rapid growth of the Linked Data cloud is a testament to the Linked Data folks ability to effectively communicate with software developers. No question. But lets face it, the principles of web architecture have seen way more adoption right? The successes that Linked Data have enjoyed so far have been a result of grounding the Semantic Web vision in the mechanics of the web we have now. And my essential point is that they didn’t go far enough in making it easier.

So, yeah…I’m not my homepage. As someone would’ve said to me in grade school: “No shit Sherlock” :-) Although, our blogs sure seem to be having a friendly argument with each other at the moment (thanks Keith). What is a homepage anyhow? The Oxford English Dictionary defines a homepage as:

A document created in a hypertext system (esp. on the World Wide Web) which serves either as an introductory page for a visitor to a web site, or as a focus of information on a particular topic, and which usually contains hypertext links to related documents and other web sites.

So my homepage is a hypertext document with a particular focus, in this case the person Ed Summers. If you are at your desk, and fire up your browser, and type in the URL for my homepage you would get an HTML document. If you were on the train, and typed in the same URL into the browser on your mobile device you might get a very different HTML document optimized for rendering on a smaller screen. This is how the web was designed to work, albeit a bit ex post facto (which is why it is awesome). A URL identifies a Resource, a Resource can be anything, when you request that Resource using HTTP you get back a Representation of the current state of the Resource. The Representation that you get back is determined by the way it was requested: in this case the User-Agent of the browser determined what HTML I got back.

It’s very easy to look down over your bi-focals and say things like “surely Ed realizes he is not his homepage”. But if we’re going to go there, it kind of begs the question, what is a homepage … and who am I? Identity is hard. Tom should be pretty familiar with how hard identity as his instructions on using owl:sameAs to link resources together proves to be a bit harder in practice than in theory.

But let’s not go there. Who really wants to think about stuff like that when you are building a webapp that makes reusable machine readable data available?

My contention is that this whole line of discussion is largely academic, and gets in the way of actually putting resource descriptions out on the web. The reality is that people can and do use as an identifier for me, Ed Summers. It is natural, and effortless, and doesn’t require someone with a PhD in Knowledge Representation to understand it. If I want to publish some RDF that says:

<> a foaf:Person .

I can do that. It’s my website, and I decide what the resources on it are. If someone puts that URL into their browser and gets some HTML that’s cool. If someone’s computer program likes RDF and gets back some “application/rdf+xml”, all the better. If a script wants to nibble on some JSON, sure here’s some “application/json” for ya. If someone wants to publish RDF about me, and use as the identifier to hang their descriptions off of, I say, go right ahead. It’s an Open Web still right (oh please say it still is).

And best of all, if someone wants different URLs for themselves and their homepage, that’s fine too. The Linked Data we have deployed by following the rules to the best of our ability is still legit. It’s all good. I don’t mind following rules, but ultimately they have to make sense…to me. And this website is all about me, right? :-)

routers, webcams and thermometers

If you have a local wi-fi network at home you probably use something like this Linksys wireless router on the left, to let your laptop and other devices connect to the Internet. When you bought it and plugged it in you probably followed the instructions and typed “” into your web browser and visited a page to configure the router: settings its name, admin password, etc.

Would you agree that this router sitting on top of your TV, or wherever it is, is a real world thing? It’s not some abstract concept of a router: you can pick it up, turn it off and on, take it apart and try to put it back together again. And the router is identified with a URL: When your web browser resolves the URL for your router it gets back some HTML, that lets you see the router’s current state, and make modifications to it. You don’t get the router itself. That would be silly right?

In terms of REST, the router is a Resource that has a URL Identifier, which when resolved returns an HTML Representation of the Resource. But you don’t really have to think about it much at all, because it’s intuitively part of how you use the web every day.

In fact the Internet is strewn with online devices that have embedded web servers in them. A 5 year old BoingBoing article More Googleable Unsecured Webcams shows how you can drop a web search for inurl:“view/index.shtml” into Google, and get back thousands of webcams from around the world. You can zoom and pan these cameras using your web browser. These are URLs for real world cameras. When you put the URL in your browser you don’t get the camera itself, that’s crazy talk; instead you get some HTML describing the camera’s current state, and some form controls for changing its position. Again all is well in the REST world, where the camera is the Resource identified with a URL, and your browser receives a Representation of the Resource.

If you are an Arduino hacker you might follow some instructions to build an online thermometer. You wire up the temperature sensor, and configure the Arduino to listen for HTTP requests at a particular IP address. You can then visit a URL in your web browser, and the server returns a Representation of the current temperature. It doesn’t return the Arduino board, the thermometer, or the thermodynamic state of its environment…that’s crazy talk. It returns a Representation of the temperature.

So imagine I want to give myself a URL, say Is this so different than the camera, the router and the thermometer? Sure, I don’t have a web server embedded in me. But even if I did nobody would expect it to return me would they? Just as in the other cases, people would expect a Representation of me to be returned. Heck, there are millions of OpenID URLs deployed for people already. But this argument is used time, and time again in the Semantic Web, Linked Data community to justify the need for elaborate, byzantine, hard to explain HTTP behavior when making RDF descriptions of real world things available. The pattern has been best described in the Cool URIs for the Semantic Web W3C Note. I understand it. But if you’ve ever had to explain it to a web developer not already brainwashed^w familar with the pattern you will understand that it is hard to explain convincingly. It’s even harder to implement correctly, since you are constantly asking yourself nonsensical questions like: “is this a Information Resource” when you are building your application.

I was pleased to see Ian Davis’ recent well articulated posts about whether the complicated HTTP behavior is essential for deploying Linked Data. I know I am biased because I was introduced to much of the Semantic Web and Linked Data world when Ian Davis and Sam Tunnicliffe visited the Library of Congress three years ago. I agree with Ian’s position: the current situation with the 303 redirect is potentially wasteful, error prone and bordering on the absurd…and the Linked Data community could do a lot to make it easier to deploy Linked Data. At its core, Ian’s advice in Guide to Publishing Linked Data Without Redirects does a nice job of making Linked Data publishing seem familiar to folks who have used HTTP’s content-negotiation features to enable internationalization, or building RESTful web services. A URL for a resource that has a more constrained set of representations, allows for Agent Driven Negotiation in situations where custom tuning the Accept header in the client isn’t convenient and practical. Providing a pattern for linking these resources together with something like wrds:describedby and/or the describedby relation that’s now available in RFC 5988 is helpful for people building REST APIs and Linked Data applications.

At the end of the day, it would be useful if the W3C could de-emphasize httpRange-14, simplify the Architecture of the World Wide Web (by removing the notion of Information Resources), and pave the cowpaths we already are seeing for Real World Objects on the Web. It would be great to have a W3C document that guided people on how to put URIs for things on the web, that fit with how people are already doing it, and made intuitive sense. We’re already used to things like our routers, cameras and thermometers being on the web, and my guess is we’re going to see much, much more of it in the coming years. I don’t think a move like this would invalidate documents like Cool URIs for the Semantic Web, or make the existing Linked Data that is out there somehow wrong. It would simply lower the bar for people who want to publish Linked Data, who don’t necessarily want to go through the process of using URIs to distinguish non-Information Resources from Information Resources.

If the W3C doesn’t have the stomach for it, I imagine we will see the IETF lead the way, or for innovation to happen elsewhere as with HTML5.

Linked Library Data at the Deutschen Nationalbibliothek

Just last week Lars Svensson from the Deutschen Nationalbibliothek (German National Library aka DNB) made a big announcement that they have released their authority data as Linked Data for the world to use. What this means is that there are now unique URLs (and machine readable data at the other end of them) for:

The full dataset that the DNB has made available for download amounts to 38,849,113 individual statements (aka triples). Linked Data enthusiasts that are used to thinking in terms of billions of triples might not even blink when seeing these numbers. But it is important to remember that these data assets have been curated by a network of German, Austrian and Swiss libraries, for close to a hundred of years, as they documented (and continue to document) all known German-language publications.

The simple act of making each of these authority records URL addressable, means that they can now meaningfully participate in the global information space some call the Web of Data. It’s true, the records were available as part of the DNB’s Online Catalog before they were released as Linked Data. What’s new is that the DNB has commited to using persistent URLs to identify these records, using a new host name in combination with their own record identifiers. This means that people can persistently link to these DNB resources in their own web applications and data. Another subtle thing, and really the heart of what Linked Data pattern offers, is the ability to use the same URL to retrieve the record as structured metadata. The important thing about having machine readable data is it allows other applications to easily re-purpose the information, much like libraries have done traditionally by shipping around batches of Machine Readable Cataloging (MARC) records. Here’s a practical example:

The URL identifies the author Herta Müller, who won the Nobel Prize for Literature in 2009. If you load that URL in your web browser by clicking on it, you should see a web page (HTML) for the authority record describing Herta Müller. But if a web client requests that same URL asking for RDF it will (via a redirect) get the same authority record as RDF. RDF is more a data model than a particular file format, so it has a variety of serializations … The server at returns RDF/XML, and they have made their data dumps available in N-Triples…but I’m kind of fond of the Turtle serialization which is kind of JSON-ish, and makes the RDF a bit more readable. Here is the RDF (as Turtle) for Herta Müller that the DBN makes available:

(???) gnd: <> . (???) rdaGr2: <> . (???) foaf: <> . (???) owl: <> .


I fully admit that there is not uncommon craze for trichotomies. I do not know but the psychiatrists have provided a name for it. If not, they should … it might be called triadomany. I am not so afflicted; but I find myself obliged, for truth’s sake, to make such a large number of trichotomies that I could not [but] wonder if my readers, especially those of them who are in the way of knowing how common the malady is, should suspect, or even opine, that I am a victim of it … I have no marked predilection for trichotomies in general.

Charles S. Peirce quoted in The Sign of Three, edited by Umberto Eco and Thomas A. Sebeok.

It’s hard not to read a bit of humor and irony into this quote from Peirce. My friend Dan Chudnov observed once that all this business with RDF and Linked Data often seems like fetishism. RDF colored glasses are kind of hard to take off when you are a web developer and have invested a bit of time in understanding the Semantic Web Linked Data vision. I seem to go through phases of interest with the triples: ebbs and flows. Somehow it’s comforting to read of Peirce’s predilections for triples at the remove of a couple hundred years.

Seeing the Linked Open Data Cloud for the first time was a revelation of sorts. It helped me understand concretely how the Web could be used to assemble a distributed, collaborative database. That same diagram is currently being updated to include new datasets. But a lot of Linked Data has been deployed since then … and a lot of it has been collected as part of the annual Billion Triple Challenge.

It has always been a bit mysterious to me how nodes get into the LOD Cloud, so I wondered how easy it would be create a cloud from the 2010 Billion Triple Challenge dataset. It turns out that with a bit of unix pipelining and the nice ProtoVis library it’s not too hard to get something “working”. It sure is nice to work in an environment with helpful folk who can set aside a bit of storage and compute time for experimenting like this, without having to bog down my laptop for a long time.

If you click on the image you should be taken to the visualization. It’s kind of heavy on JavaScript processing, so a browser like Chrome will probably render it best.

But as Paul Downey pointed out to me in Twitter:

Paul is so right. I find myself drawn to these graph visualizations for magical reasons. I can console myself that I did manage to find a new linked data supernode that I didn’t know about before: bibsonomy–which doesn’t appear to be in the latest curated view of the Linked Open Data Cloud. And I did have a bit of fun making the underlying data available as rdf/xml and Turtle using the Vocabulary of Interlinked Datasets (VoID). And I generated a similar visualization for the 2009 data. But it does feel a bit navel-gazy, so a sense of humor about the enterprise is often a good tonic. I guess this is the whole point of the Challenge, to get something generally useful (and not navel-gazy) out of the sea of triples.

Oh and Sign of Three is an excellent read so far :-)

edu, gov and tlds in en.wikipedia external links

Some folks over at Wikipedia Signpost asked if they could use some of the barcharts I’ve been posting here recently. They needed the graphs to be released with a free license, which was a good excuse to slap a Creative Commons Attribution 3.0 license on all the content here at inkdroid. I’m kinda ashamed I didn’t think of doing this before…

I was also asked how easy it would be to generate the .gov and .edu graphs, as well as top level domains. I already had the hostnames exported, so it was just a few greps, sorts and uniqs away. I’ve included the graphs and the full data files below. My friend Dan Chudnov suggested that plotting this data on a logarithmic scale would probably look better. I think he’s probably right, but I just haven’t gotten around to doing that yet. It’s definitely something I will keep in mind for an app that allowed this slicing and dicing of the Wikipedia external links data.

top 100 .edu hosts in en.wikipedia external links


top 100 .gov hosts in en.wikipedia external links


top 100 top level domains in en.wikipedia external links


lots of copies keeps epubs safe

Over the weekend you probably saw the announcements going around about Google Books releasing +1 million public domain ebooks on the web as epubs. This is great news: epub is a web friendly, open format – and having all this content available as epub is important.

Now I might be greedy, but when I saw that 1 million epubs are available my mind immediately jumps to thinking of getting them, indexing them and whatnot. Then I guiltily justified my greedy thoughts by pondering the conventional digital preservation wisdom that Lots of Copies Keeps Stuff Safe (LOCKSS). The books are in the public domain, so …. why not?

Google Books has a really nice API, which lets you get back search results as Atom, with lots of links to things like thumbnails, annotations, item views, etc. You also get a nice amount of Dublin Core metadata. And you can limit your search to books published before 1923. For example here’s a search for pre-1923 books that mention “Stevenson” (disclaimer: I don’t think the 1923 limit is actually working):

curl '' | xmllint --format -

which yields:

< ?xml version="1.0" encoding="UTF-8"?>
  Search results for Stevenson
    Google Books Search
  Google Book Search data API
    Robert Louis Stevenson
    308 pages
    being memoirs of the adventures of David Balfour in the year 1851 ...

    Treasure Island
    Robert Louis Stevenson
    George Edmund Varian
    CHAPTER I THE OLD SEA DOG AT THE &quot;ADMIRAL BENBOW&quot; SQUIRE Trelawney, Dr. Livesey, 
and the rest of these gentlemen having asked me to write down the whole ...
    306 pages
    Treasure Island

    Adlai Ewing Stevenson
    Grace Darling
    David Darling
    127 pages
    Biography & Autobiography

    Robert Louis Stevenson
    This scarce antiquarian book is included in our special Legacy Reprint Series.
    128 pages
    Kessinger Pub Co
    Day by Day

    A child's garden of verses
    Robert Louis Stevenson
    IN winter I get up at night And dress by yellow candle-light. In summer, quite 
the other way, I have to go to bed by day. I have to go to bed and see The ...
    136 pages
    Children's poetry, Scottish
    A child's garden of verses
    by Robert Louis Stevenson; illustrated by Charles Robinson

    Travels with a donkey in the Cevennes
    Robert Louis Stevenson
    THE DONKEY, THE PACK, AND THE PACK - SADDLE IN a little place called Le 
Monastier, in a pleasant highland valley fifteen miles from Le Puy, I spent 
about a ...
    287 pages
    Cévennes Mountains (France)
    Travels with a donkey in the Cevennes
    An inland voyage

    St. Ives
    Robert Louis Stevenson
    IVES CHAPTER IA TALE OF A LION RAMPANT IT was in the month of May,, that I was 
so unlucky as to fall at last into the hands of the enemy. ...
    528 pages
    St. Ives
    being the adventures of a French prisoner in England

    Cruising with Robert Louis Stevenson
    Oliver S. Buckton
    Cruising with Robert Louis Stevenson: Travel, Narrative, and the Colonial Body is the first book-length study about the influence of travel on Robert Louis ...
    344 pages
    Ohio Univ Pr
    Literary Criticism
    Cruising with Robert Louis Stevenson
    travel, narrative, and the colonial body

    New Arabian nights
    Robert Louis Stevenson
residence in London, the accomplished Prince Florizel of Bohemia gained the ...
    386 pages
    New Arabian nights

    Robert Louis Stevenson
    Richard Ambrosini
    Richard Dury
    As the editors point out in their Introduction, Stevenson reinvented the “personal essay” and the “walking tour essay,” in texts of ironic stylistic ...
    377 pages
    Univ of Wisconsin Pr
    Literary Criticism
    Robert Louis Stevenson
    writer of boundaries

Now it would be nice if the Atom included <link> elements for the epubs themselves. Perhaps the feed could even use the recently released “acquisition” link relation defined by OPDS v1.0. For example, by including something like the following in each atom:entry element:

Theoretically it should be possible to construct the appropriate link for the epub, based on what data is available in the Atom. But it would enable quite a bit of use of the epubs to make their URLs available explicitly in a programmatic way. Unfortunately we would still be limited to dipping into the full dataset using a query, instead of being able to crawl the entire archive, with something like a paged Atom feed. From a conversation over on get-theinfo it appears that this approach might not be as easy as it sounds. Also, it turns out that magically, many of the books have been uploaded to the Internet Archive. 902,188 of them in fact.

So maybe not that much work needs to be done. But presumably more public domain content will become available from Google Books, and it would be nice to be able to say there was at least one other copy of it elsewhere, for digital preservation purposes. It would be great to see Google step up and do some good, by making their API usable for folks wanting to replicate the public domain content. Still, at least they haven’t of done evil by locking it away completely. Dan Brickley had an interesting suggestion to possibly collaborate on this work.

simplicity and digital preservation, sorta

Over on the Digital Curation discussion list Erik Hetzner of the California Digital Library raised the topic of simplicity as it relates to digital preservation, and specifically to CDL’s notion of Curation Microservices. He referenced a recent bit of writing by Martin Odersky (the creator of Scala) with the title Simple or Complicated. In one of the responses Brian Tingle (also of CDL) suggested that simplicity for an end user and simplicity for the programmer are often inversely related. My friend Kevin Clarke prodded me in #code4lib into making my response to the discussion list into a blog post so, here it is (slightly edited).

For me, the Odersky piece is a really nice essay on why simplicity is often in the eye of the beholder. Often the key to simplicity is working with people who see things in roughly the same way. People who have similar needs, that are met by using particular approaches and tools. Basically a shared and healthy culture to make emergent complexity palatable.

Brian made the point about simplicity for programmers having an inversely proportional relationship to simplicity for end users, or in his own words:

I think that the simpler we make it for the programmers, usually the more complicated it becomes for the end users, and visa versa.

I think the only thing to keep in mind is that the distinction between programmers and end users isn’t always clear.

As a software developer I’m constantly using, or inheriting someone else’s code: be it a third party library that I have a dependency on, or a piece of software that somebody wrote once upon a time, who has moved on elsewhere. In both these cases I’m effectively an end-user of a program that somebody else designed and implemented. The interfaces and abstractions that this software developer has chosen are the things I (as an end user) need to be able to understand and work with. Ultimately, I think that it’s easier to keep software usable for end users (of whatever flavor) by keeping the software design itself simple.

Simplicity makes the software easier to refactor over time when the inevitable happens, and someone wants some new or altered behavior. Simplicity also should make it clear when a suggested change to a piece of software doesn’t fit the design of the software in question, and is best done elsewhere. One of the best rules of thumb I’ve encountered over the years to help get to this place is the Unix Philosophy:

Write programs that do one thing and do it well. Write programs to work together.

As has been noted elsewhere, composability is one of the guiding principles of the Microservices approach–and it’s why I’m a big fan (in principle). Another aspect to the Unix philosophy that Microservices seems to embody is:

Data dominates.

The software can (and will) come and go, but we are left with the data. That’s the reality of digital preservation. It could be argued that the programs themselves are data, which gets us into sci-fi virtualization scenarios. Maybe someday, but I personally don’t think we’re there yet.

Another approach I’ve found that works well to help ensure code simplicity has been unit testing. Admittedly it’s a bit of a religion, but at the end of the day, writing tests for your code encourages you to use the APIs, interfaces and abstractions that you are creating. So you notice sooner when things don’t make sense. And of course, they let you refactor with a safety net, when the inevitable changes rear their head.

And, another slightly more humorous way to help ensure simplicity:

Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live.

Which leads me to a jedi mind trick my former colleague Keyser Söze Andy Boyko tried to teach me (I think): it’s useful to know when you don’t have to write any code at all. Sometimes existing code can be used in a new context. And sometimes the perceived problem can be recast, or examined from a new perspective that makes the problem go away. I’m not sure what all this has to do with digital preservation. The great thing about what CDL is doing with microservices is they are trying to focus on the what, and not the how of digital preservation. Whatever ends up happening with the implementation of Merritt itself, I think they are discovering what the useful patterns of digital preservation are, trying them out, and documenting them…and it’s incredibly important work that I don’t really see happening much elsewhere.

bad xml smells

I’m used to refactoring code smells, but sometimes you can catch a bad whiff in XML too.


< ?xml version="1.0" encoding="UTF-8"?>

            Library of Congress

                            Library of Congress
                              The National Forum (Washington, DC), 1910-19??
The first issue of the National Forum was likely released on April 30, 1910 
and the newspaper ran through at least November 12 of that year. The four-page African-American 
weekly covered such local events as Howard University graduations and Baptist church activities, but its 
pages also included national news, sports, home maintenance, women's news, science, editorial 
cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news 
affected the city's black community. A unique feature was its coverage of Elks Club meetings and 
activities.  Business manager John H. Wills contributed the community-centered "Vanity Fair" column that
 usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who 
went on to publish another African-American newspaper, the 
McDowell Times of Keystone, West Virginia. Originally located at 
609 F St., NW, the newspaper's offices moved in August to 1022 U Street, N.W. to be closer to the 
African-American community it served.  No extant first issue of the National
 Forum exists.


< ?xml version="1.0" encoding="utf-8"?>
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "">

    The National Forum (Washington, DC), 1910-19??

The first issue of the National Forum was likely released on April 30, 1910 and the newspaper ran through at least November 12 of that year. The four-page African-American weekly covered such local events as Howard University graduations and Baptist church activities, but its pages also included national news, sports, home maintenance, women's news, science, editorial cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news affected the city's black community. A unique feature was its coverage of Elks Club meetings and activities. Business manager John H. Wills contributed the community-centered "Vanity Fair" column that usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who went on to publish another African-American newspaper, the McDowell Times of Keystone, West Virginia. Originally located at 609 F St., NW, the newspaper's offices moved in August to 1022 U Street, N.W. to be closer to the African-American community it served. No extant first issue of the National Forum exists.

I basically took a complicated METS wrapper around some XHTML, which was really just expressing metadata about the HTML, and refactored it as XHTML. Not that METS is a bad XML smell generally, but in this particular case it was overkill. If you look closely you’ll see I’m using RDFa, similar to what Facebook are doing with their OpenGraph Protocol. There’s less to get wrong, what’s there should look more familiar to web developers who aren’t versed in arcane library standards, and I can now read the metadata from the XHTML with an RDFa aware parser, like Python’s rdflib:

>>> import rdflib
>>> g = rdflib.Graph()
>>> g.parse('essays/1.html', format='rdfa')
>>> for triple in g: print triple
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef(''), rdflib.term.Literal(u'Library of Congress'))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef(''), rdflib.term.Literal(u'The National Forum (Washington, DC), 1910-19??'))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef(''), rdflib.term.Literal(u'\n    

\nThe first issue of the National Forum was likely released on April 30, 1910 and the newspaper ran through at least November 12 of that year. The four-page African-American weekly covered such local events as Howard University graduations and Baptist church activities, but its pages also included national news, sports, home maintenance, women s news, science, editorial cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news affected the city s black community. A unique feature was its coverage of Elks Club meetings and activities. Business manager John H. Wills contributed the community-centered "Vanity Fair" column that usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who went on to publish another African-American newspaper, the McDowell Times of Keystone, West Virginia. Originally located at 609 F St., NW, the newspaper s offices moved in August to 1022 U Street, N.W. to be closer to the African-American community it served. No extant first issue of the National Forum exists.\n

\n ', datatype=rdflib.term.URIRef(''))) (rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef(''), rdflib.term.Literal(u'')) (rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef(''), rdflib.term.Literal(u'2007-01-10T09:00:00'))

top hosts referenced in wikipedia (part 2)

Jodi Schneider pointed out to me in an email that my previous post about the top 100 hosts referenced in wikipedia may have been slightly off balance since it counted all pages on wikipedia (talk pages, files, etc), and was not limited to only links in articles. The indicator for her was the high ranking of, which seemed odd to her in the article space.

So I downloaded the enwiki-latest-page.sql.gz, loaded it in, and then joined on it in my query to come up with a new list. Jodi was right, it’s a lot more interesting:

This removed a lot of the interwiki links between the English wikipedia and other language wikipedias (which would be interesting to look at in their own right). It also removed administrative links to things like Also interesting is that it removed from the list, which probably were linked to from user profile pages? The neat thing is it introduced new sites into the top 100 like the following:

We can see a lot more pop culture media present: newspapers, magazines, sporting information. Also we can see research oriented websites like,, make it into the top 100.

I work for the US federal government so I was interested to look at what .gov domains were in the top 100:

hostname links 419816 62134 57423 48530 33018 25962 25941 20178 20063 18880 15115 12196

Which points to the importance of federal biomedical, geospatial, scientific, demographic and biographical information to wikipedians. It would be interesting to take a look at higher education institutions at some point. Doing these one off reports is giving me some ideas about what linkypedia could turn into. Thanks Jodi.