Fielding notes

a tongue-in-cheek change request from (???) />Paul Downey

I’ve been doing a bit of research into the design of the Web for a paper I’m trying to write. In my travels I ran across Jon Udell’s 2006 interview with Roy Fielding. The interview is particularly interesting because of Roy’s telling of how (as a graduate student) he found himself working on libwww-perl which helped him discover the architecture of the Web that was largely documented by Tim Berners-Lee’s libwww HTTP library for Objective-C.

For the purposes of note taking, and giving some web spiders some text to index, here are a few moments that stood out:

Udell: A little later on [in Roy’s dissertation] you talk about how systems based on what you call control messages are in a very different category from systems where the decisions that get made are being made by human beings, and that that’s, in a sense, the ultimate rationale for designing data driven systems that are web-like, because people need to interact with them in lots of ways that you can’t declaratively define.

Fielding: Yeah, it’s a little bit easier to say that people need to reuse them, in various unanticipated ways. A lot of people think that when they are building an application that they are building something that’s going to last forever, and almost always that’s false. Usually when they are building an application the only thing that lasts forever is the data, at least if you’re lucky. If you’re lucky the data retains some semblance of archivability, or reusability over time.

Udell: There is a meme out there to the effect that what we now call REST architectural style was in a sense discovered post facto, as opposed to having been anticipated from the beginning. Do you agree with that or not?

Fielding: No, it’s a little bit of everything, in the sense that there are core principles involved that Berners-Lee was aware of when he was working on it. I first talked to Tim about what I was calling the HTTP Object Model at the time, which is a terrible name for it, but we talked when I was at the W3C in the summer of 95, about the software engineering principles. Being a graduate student of software engineering, that was my focus, and my interest originally. Of course all the stuff I was doing for the Web that was just for fun. At the time that was not considered research.

Udell: But did you at the time think of what you then called the HTTP object model as being in contrast to more API like and procedural approaches?

Fielding: Oh definitely. The reason for that was that the first thing I did for the Web was statistical analysis software, which turned out to be very effective at helping people understand the value of communicating over the Web. The second thing was a program called MOMSpider. It was one of the first Web spiders, a mechanism for testing all the links that were on the Web.

Udell: And that was when you also worked on libwww-perl?

Fielding: Right, and … at the time it was only the second protocol library available for the Web. It was a combination of pieces from various sources, as well as a lot of my own work, in terms of filling out the details, and providing an overall view of what a Web client should do with an HTTP library. And as a result of that design process I realized some of the things Tim Berners-Lee had designed into the system. And I also found a whole bunch of cases where the design didn’t make any sense, or the way it had been particularly implemented over at NCSA, or one of the other clients, or various history of the Web had turned out to be not-fitting with the rest of the design. So that led to a lot of discussions with the other early protocol developers particularly people like Rob McCool, Tony Sanders and Ari Luotonen–people who were building their own systems and understood both what they were doing with the Web, and also what complaints they were getting from their users. And from that I distilled a model of basically what was the core of HTTP. Because if you look back in the 93/94 time frame, the HTTP specification did not look all that similar to what it does now. It had a whole range of methods that were never used, and a lot of talk about various aspects of object orientation which never really applied to HTTP. And all of that came out of Tim’s original implementation of libwww, which was an Objective-C implementation that was trying to be as portable as possible. It had a lot of the good principles of interface separation and genericity inside the library, and really the same principles that I ended up using in the Perl library, although they were completely independently developed. It was just one of those things where that kind of interaction has a way of leading to a more extensible design.

Udell: So was focusing down on a smaller set of verbs partly driven by the experience of having people starting to use the Web, and starting to experience what URLs could be in a human context as well as in a programmatic context?

Fielding: Well, that was really a combination of things. One that’s a fairly common paradigm: if you are trying to inter-operate with people you’ve never met, try to keep it as simple as possible. There’s also just inherent in the notion of using URIs to identify everything, which is of course really the basis of what the Web is, provides you with that frame of mind where you have a common resource, and you want to have a common resource interface.

spotify vs rdio 2012

Back in August of 2011 I wrote a little utility that pulled down Alf Eaton’s Album of the Year data. AOTY is nice for two reasons: a) I like Alf’s taste in music, so the lists are relevant to me; and b) AOTY is a nice example of layering structured metadata into HTML, for easy processing (aka scraping). With the data in hand it was easy to to check to see if the albums were available on the streaming services Spotify and Rdio using their respective APIs. I was trying to decide which one to use at the time, and wanted to know if there was any significant difference in their catalogs.

Back then, it looked like 32% of the albums were available on Spotify, and 46% on Rdio. Alf has updated his list for 2012 so I decided to rerun aotycmp, and it appears that coverage of both has improved, with Spotify (41%) closing the gap a bit closer with Rdio (49%) which still has a comfortable lead. If you want the availability data I’ve updated it on Github.

I’ve been very happy with Rdio, although pieces like Damon Krukowski’s (thanks (???)) make me wish there was a better way to a) stream music while b) actually putting money in the artists pockets. I’d love to have the ability to pay a little bit more if I knew it was going to the help support the artist in creating more of their art.

Darth Nader

This may be a bad/shortlived idea, but as part of a New Year’s resolution to write more varied material I’m going to try to use my blog (partly) as a dream journal. This will probably drive the few readers I have away, but I’m hoping it might provide some amusement. I barely remember my dreams these days, and would like to remember more of them, so here goes. Feel free to file under TMI.

Walking into a cafe/restaurant in the morning, in what feels like New York, but I’m not sure…it could be any city. It’s a cosy, narrow setup, with all the seats taken by people quietly chatting. I manage to get a cup of coffee to go, and stand waiting for a table to open up. I discover a staircase and vaguely remember that there is seating upstairs. I go up the stairs carefully balancing my wide bowl-like cup of coffee.

The upstairs area is quite large and sprawling, dimly lit, with comfortable chairs, wider tables, and in the middle is a life sized sculpture of a woman in motion, looking behind, while walking–who apparently is the owner of the establishment. A hostess shows me to a table nearby, and says she can’t remember the name of the server, but that someone would be with me shortly. I sit down with my coffee.

After just a few minutes I notice that it feels like evening. There are lots of conversations going on nearby, which I’m able to hear fairly easily. One man in his early 30s is standing at his table, and in a kind of spotlight. He is talking quietly, as if on stage, not obviously on a cell phone, about a meeting that he has just had, and how they will need to travel to Austin, Texas to help protect some geographic area. I can’t remember the exact details of what he was saying but it is clear he is working for an organization that is trying to save some ecosystem features in Austin.

There is a bookshelf nearby with a disembodied head on it, which looks like Ralph Nader, and also a bit like Darth Vader when Luke takes his helmet off at the end of Return of the Jedi. The head is animated, and seems to be simulating the other half of the conversation. He is saying that this is important work, and is similar to a recent project in Seattle. The conversation ends, and the man walks out of the coffee shop.

I notice three other people, with big thick, Ginsbergian beards also leave their tables at the same time, deep in conversation, about something different. There is a counter-culture, occupy-like feeling in the air, of people steadily working to make there corner of the world a better place, it’s a good feeling.


Half awake I found myself thinking about the talking head, and how it reminded me of LibraryBox. It was as if the head made it possible to easily tune into public conversations that were going on in the local context of the coffee shop…and it served as an archive or store of these conversations for others to discover later. I don’t know if LibraryBox actually lets any of that happen, but it’s something I’ve been meaning to learn more about in the new year.

By the way, dream interpretations as comments are most welcome…

archiving tweets

If you are an active Twitter user you may have heard that you can now download your complete archive of tweets. The functionality is still being rolled out across the millions of accounts, so don’t be surprised if you don’t see the function yet in your settings.

The WSJ piece kind of joked about the importance of this move on Twitter’s part, which is a bit unfortunate, since it’s a pretty important issue. Yes you can use a 3rd party apps for downloading your Twitter data, but it says a lot when a company takes “archiving” seriously enough to offer it as a service to its users.

If you work in the digital preservation space it’s kind of fun to take a look at the way that Twitter makes these personal archives available. Luckily (if you don’t have the archive download button yet like me) Dave Winer has started collecting some archives, and making them publicly available for browsing and download off of S3. For example we can look at Sarah Bourne’s (who tipped me off to Dave’s work–thanks Sarah!). After you’ve downloaded the ZIP file you get a directory that looks like:

|-- css
|   `-- application.min.css
|-- data
|   |-- csv
|   |   |-- 2008_08.csv
|   |   |-- 2008_09.csv
|   |   |-- 2008_10.csv
|   |   |-- 2008_11.csv
|   |   |-- 2008_12.csv
|   |   |-- 2009_01.csv
|   |   |-- 2009_02.csv
|   |   |-- 2009_03.csv
|   |   |-- 2009_04.csv
|   |   |-- 2009_05.csv
|   |   |-- 2009_06.csv
|   |   |-- 2009_07.csv
|   |   |-- 2009_08.csv
|   |   |-- 2009_09.csv
|   |   |-- 2009_10.csv
|   |   |-- 2009_11.csv
|   |   |-- 2009_12.csv
|   |   |-- 2010_01.csv
|   |   |-- 2010_02.csv
|   |   |-- 2010_03.csv
|   |   |-- 2010_04.csv
|   |   |-- 2010_05.csv
|   |   |-- 2010_06.csv
|   |   |-- 2010_07.csv
|   |   |-- 2010_08.csv
|   |   |-- 2010_09.csv
|   |   |-- 2010_10.csv
|   |   |-- 2010_11.csv
|   |   |-- 2010_12.csv
|   |   |-- 2011_01.csv
|   |   |-- 2011_02.csv
|   |   |-- 2011_03.csv
|   |   |-- 2011_04.csv
|   |   |-- 2011_05.csv
|   |   |-- 2011_06.csv
|   |   |-- 2011_07.csv
|   |   |-- 2011_08.csv
|   |   |-- 2011_09.csv
|   |   |-- 2011_10.csv
|   |   |-- 2011_11.csv
|   |   |-- 2011_12.csv
|   |   |-- 2012_01.csv
|   |   |-- 2012_02.csv
|   |   |-- 2012_03.csv
|   |   |-- 2012_04.csv
|   |   |-- 2012_05.csv
|   |   |-- 2012_06.csv
|   |   |-- 2012_07.csv
|   |   |-- 2012_08.csv
|   |   |-- 2012_09.csv
|   |   |-- 2012_10.csv
|   |   |-- 2012_11.csv
|   |   `-- 2012_12.csv
|   `-- js
|       |-- payload_details.js
|       |-- tweet_index.js
|       |-- tweets
|       |   |-- 2008_08.js
|       |   |-- 2008_09.js
|       |   |-- 2008_10.js
|       |   |-- 2008_11.js
|       |   |-- 2008_12.js
|       |   |-- 2009_01.js
|       |   |-- 2009_02.js
|       |   |-- 2009_03.js
|       |   |-- 2009_04.js
|       |   |-- 2009_05.js
|       |   |-- 2009_06.js
|       |   |-- 2009_07.js
|       |   |-- 2009_08.js
|       |   |-- 2009_09.js
|       |   |-- 2009_10.js
|       |   |-- 2009_11.js
|       |   |-- 2009_12.js
|       |   |-- 2010_01.js
|       |   |-- 2010_02.js
|       |   |-- 2010_03.js
|       |   |-- 2010_04.js
|       |   |-- 2010_05.js
|       |   |-- 2010_06.js
|       |   |-- 2010_07.js
|       |   |-- 2010_08.js
|       |   |-- 2010_09.js
|       |   |-- 2010_10.js
|       |   |-- 2010_11.js
|       |   |-- 2010_12.js
|       |   |-- 2011_01.js
|       |   |-- 2011_02.js
|       |   |-- 2011_03.js
|       |   |-- 2011_04.js
|       |   |-- 2011_05.js
|       |   |-- 2011_06.js
|       |   |-- 2011_07.js
|       |   |-- 2011_08.js
|       |   |-- 2011_09.js
|       |   |-- 2011_10.js
|       |   |-- 2011_11.js
|       |   |-- 2011_12.js
|       |   |-- 2012_01.js
|       |   |-- 2012_02.js
|       |   |-- 2012_03.js
|       |   |-- 2012_04.js
|       |   |-- 2012_05.js
|       |   |-- 2012_06.js
|       |   |-- 2012_07.js
|       |   |-- 2012_08.js
|       |   |-- 2012_09.js
|       |   |-- 2012_10.js
|       |   |-- 2012_11.js
|       |   `-- 2012_12.js
|       `-- user_details.js
|-- img
|   |-- bg.png
|   `-- sprite.png
|-- index.html
|-- js
|   `-- application.min.js
|-- lib
|   |-- bootstrap
|   |   |-- bootstrap-dropdown.js
|   |   |-- bootstrap.min.css
|   |   |-- bootstrap-modal.js
|   |   |-- bootstrap-tooltip.js
|   |   |-- bootstrap-transition.js
|   |   |-- glyphicons-halflings.png
|   |   `-- glyphicons-halflings-white.png
|   |-- hogan
|   |   `-- hogan-2.0.0.min.js
|   |-- jquery
|   |   `-- jquery-1.8.3.min.js
|   |-- twt
|   |   |-- sprite.png
|   |   |-- sprite.rtl.png
|   |   |-- twt.all.min.js
|   |   `-- twt.min.css
|   `-- underscore
|       `-- underscore-min.js
`-- README.txt

So why is this interesting?

The Data

The archive includes data both as CSV and as JavaScript. The CSV is perfect for throwing into a spreadsheet, and doing stuff with it there. The JavaScript is actually a very light shim over some JSON data that is quite a bit richer than the CSV. The JavaScript shim is needed so that it can be used by the app that comes in the archive (more on that later). For example here’s a randomly picked tweet from Sarah:

Here is how the Tweet shows up in the CSV:

"281405942321532929","281400879465238529","61233","","","2012-12-19 14:29:39 +0000","Janetter","@monkchips Ouch. Some regrets are harsher than others."

And here’s the archived JSON for the Tweet:

  "source" : "Janetter",
  "entities" : {
    "user_mentions" : [ {
      "name" : "James Governor",
      "screen_name" : "monkchips",
      "indices" : [ 0, 10 ],
      "id_str" : "61233",
      "id" : 61233
    } ],
    "media" : [ ],
    "hashtags" : [ ],
    "urls" : [ ]
  "in_reply_to_status_id_str" : "281400879465238529",
  "geo" : {
  "id_str" : "281405942321532929",
  "in_reply_to_user_id" : 61233,
  "text" : "@monkchips Ouch. Some regrets are harsher than others.",
  "id" : 281405942321532929,
  "in_reply_to_status_id" : 281400879465238529,
  "created_at" : "Wed Dec 19 14:29:39 +0000 2012",
  "in_reply_to_screen_name" : "monkchips",
  "in_reply_to_user_id_str" : "61233",
  "user" : {
    "name" : "Sarah Bourne",
    "screen_name" : "sarahebourne",
    "protected" : false,
    "id_str" : "16010789",
    "profile_image_url_https" : "",
    "id" : 16010789,
    "verified" : false

So there’s quite a bit more structured data in the archived JSON including whether geo coordinates, hash tags, urls mentioned, etc. Also, the avatar images are still referenced out on the Web, where they can change, disappear, etc. It’s also interesting to compare the archived JSON against what you get back the from Twitter API for the same Tweet:

  "user": {
    "follow_request_sent": false, 
    "profile_use_background_image": true, 
    "default_profile_image": false, 
    "id": 16010789, 
    "verified": false, 
    "profile_text_color": "080C0C", 
    "profile_image_url_https": "", 
    "profile_sidebar_fill_color": "FCFAEF", 
    "entities": {
      "url": {
        "urls": [
            "url": "", 
            "indices": [
            "expanded_url": null
      "description": {
        "urls": []
    "followers_count": 2367, 
    "profile_sidebar_border_color": "FFFFFF", 
    "id_str": "16010789", 
    "profile_background_color": "DAE0D9", 
    "listed_count": 331, 
    "profile_background_image_url_https": "", 
    "utc_offset": -18000, 
    "statuses_count": 20090, 
    "description": "Internet technology strategist, Accessibility and assistive technologies. Views expressed/implied are my own. See my Twitter lists for more interests.", 
    "friends_count": 784, 
    "location": "Boston, MA, USA", 
    "profile_link_color": "800326", 
    "profile_image_url": "", 
    "following": true, 
    "geo_enabled": false, 
    "profile_banner_url": "", 
    "profile_background_image_url": "", 
    "screen_name": "sarahebourne", 
    "lang": "en", 
    "profile_background_tile": true, 
    "favourites_count": 3147, 
    "name": "Sarah Bourne", 
    "notifications": null, 
    "url": "", 
    "created_at": "Wed Aug 27 12:24:25 +0000 2008", 
    "contributors_enabled": false, 
    "time_zone": "Eastern Time (US & Canada)", 
    "protected": false, 
    "default_profile": false, 
    "is_translator": false
  "favorited": false, 
  "entities": {
    "user_mentions": [
        "id": 61233, 
        "indices": [
        "id_str": "61233", 
        "screen_name": "monkchips", 
        "name": "James Governor"
    "hashtags": [], 
    "urls": []
  "contributors": null, 
  "truncated": false, 
  "text": "@monkchips Ouch. Some regrets are harsher than others.", 
  "created_at": "Wed Dec 19 14:29:39 +0000 2012", 
  "retweeted": false, 
  "in_reply_to_status_id_str": "281400879465238529", 
  "coordinates": null, 
  "in_reply_to_user_id_str": "61233", 
  "source": "Janetter", 
  "in_reply_to_status_id": 281400879465238529, 
  "in_reply_to_screen_name": "monkchips", 
  "id_str": "281405942321532929", 
  "place": null, 
  "retweet_count": 0, 
  "geo": null, 
  "id": 281405942321532929, 
  "in_reply_to_user_id": 61233

Using json-diff it’s not too difficult to see what the differences are between the archived version and the API version:

+  favorited: false
+  contributors: null
+  truncated: false
+  retweeted: false
+  coordinates: null
+  place: null
+  retweet_count: 0
   entities: {
-    media: [
-    ]
-  geo: {
-  }
+  geo: null
   user: {
+    follow_request_sent: false
+    profile_use_background_image: true
+    default_profile_image: false
+    profile_text_color: "080C0C"
+    profile_sidebar_fill_color: "FCFAEF"
+    entities: {
+      url: {
+        urls: [
+          {
+            url: ""
+            indices: [
+              0
+              38
+            ]
+            expanded_url: null
+          }
+        ]
+      }
+      description: {
+        urls: [
+        ]
+      }
+    }
+    followers_count: 2367
+    profile_sidebar_border_color: "FFFFFF"
+    profile_background_color: "DAE0D9"
+    listed_count: 331
+    profile_background_image_url_https: ""
+    utc_offset: -18000
+    statuses_count: 20090
+    description: "Internet technology strategist, Accessibility and assistive technologies. Views expressed/implied are my own. See my Twitter lists for more interests."
+    friends_count: 784
+    location: "Boston, MA, USA"
+    profile_link_color: "800326"
+    profile_image_url: ""
+    following: true
+    geo_enabled: false
+    profile_banner_url: ""
+    profile_background_image_url: ""
+    lang: "en"
+    profile_background_tile: true
+    favourites_count: 3147
+    notifications: null
+    url: ""
+    created_at: "Wed Aug 27 12:24:25 +0000 2008"
+    contributors_enabled: false
+    time_zone: "Eastern Time (US & Canada)"
+    default_profile: false
+    is_translator: false

To be fair some of the user profile information has been normalized in the archive (perhaps to save space for the viewing application) out to a user_details.js file, which looks like:

  "screen_name" : "sarahebourne",
  "location" : "Boston, MA, USA",
  "full_name" : "Sarah Bourne",
  "bio" : "Internet technology strategist, Accessibility and assistive technologies. Views expressed/implied are my own. See my Twitter lists for more interests.",
  "id" : "16010789",
  "created_at" : "Wed Aug 27 12:24:25 +0000 2008"

Notably missing from this is a homepage for the user, their number of favourites, their number of friends, followers, whether geo is enabled, etc.

All these details aside, Twitter deserves a lot of credit for making the data available as CSV for ease of use, and also as JavaScript for programmatic use.

The Code

So the really, really neat thing about the archive is that it comes with a pure HTML, CSS and JavaScript application that you can open locally in your browser and view your archive. It looks pretty, for example here is Sarah’s archive that Dave Winer mounted up on S3. It even has a keyword search across all your tweets, which takes a bit of time (it interactively loads all your tweet JavaScript files mentioned above), but it works. You can zip the data up, give it to someone else, and it all just works.

The archive uses some third party libraries such as jQuery, Underscore, Twitter Bootstrap and Hogan, which all come minified and bundled statically in the archive. The application itself is called Grailbird and comes minified as well. Grailbird loads the static JavaScript (as needed) and displays it. The only network traffic I saw while it was running was fetching avatar images.

Assuming JavaScript backwards compatibility, and browser support for JavaScript, the Twitter archive’s contextual display for the underlying data could last a long, long time. At least that’s a possible interpretation based on David Rosenthal’s hypothesis about the Web’s effect on format obsolescence. I think it’s safe to say that this app written for the local Web platform is likely last longer than a GUI application written in another language environment. The separation of code and data, and independence from a particular browser implementation are big wins. These are qualities that we all had to fight and work hard for on the Web, and I think it makes sense to re-purpose them here in an archival context.

I doubt anyone from Twitter has read this far, but if someone has, it would be great to see Grailbird show up with the other great stuff you have released to Github. I found myself wanting to quickly search across tweets looking for things, like geo-enabled tweets (to make sure that they are there). I could look at the minified Grailbird source in Chrome using developer tools, but it wasn’t good enough for me to figure out how to dynamically load data. I resorted to using NodeJS, and evaling the JavaScript files…and was able to confirm that there is geo data in the archives if you have it enabled. Here’s the simplistic script I came up with:

var fs = require('fs');

var Grailbird = {data: {}};

// load all the tweet data
eval(fs.readFileSync("data/js/tweet_index.js", "utf8"));
for (var i = 0; i < tweet_index.length; i++) {
  eval(fs.readFileSync(tweet_index[i].file_name, "utf8"));

// look at each tweet and print out the date and geolocation if it's there
for (var slice in {
  for (var j = 0; j <[slice].length; j++) {
    var tweet =[slice][j];
    if (tweet.geo.coordinates) console.log(tweet.created_at, ",", tweet.geo.coordinates.join(","));

and the output for Jeremy Keith’s archive.

% node geo.js
Fri Nov 30 13:08:33 +0000 2012,50.8262027605,-0.138112306595
Sat Nov 17 12:09:18 +0000 2012,54.6000387923,-5.9254288673
Fri Nov 16 22:32:03 +0000 2012,54.5925614526,-5.930852294
Thu Nov 15 13:35:35 +0000 2012,54.595909,-5.922033
Sat Nov 10 12:59:37 +0000 2012,50.825832,-0.142381
Fri Nov 09 13:54:51 +0000 2012,50.8262027605,-0.1381123066
Wed Nov 07 18:07:24 +0000 2012,50.825977,-0.138339
Tue Nov 06 16:58:49 +0000 2012,50.8378257671,-1.1800042739
Tue Oct 30 11:19:53 +0000 2012,50.8262027605,-0.1381123066
Thu Oct 18 17:51:22 +0000 2012,43.0733634985,-89.38608062
Tue Oct 16 17:29:20 +0000 2012,43.0872606735,-89.3659955263
Tue Oct 09 18:11:20 +0000 2012,40.7406891129,-74.0076184273
Sun Oct 07 14:27:50 +0000 2012,50.82906975,-0.126056
Sat Oct 06 16:29:30 +0000 2012,50.825832,-0.142381
Thu Oct 04 16:46:56 +0000 2012,50.8262027605,-0.1381123066
Tue Oct 02 17:46:42 +0000 2012,50.826646,-0.136921
Mon Oct 01 10:46:04 +0000 2012,50.8262027605,-0.1381123066
Mon Oct 01 10:43:46 +0000 2012,50.8262027605,-0.1381123066
Mon Oct 01 09:38:01 +0000 2012,50.8236703111,-0.1387184062
Mon Oct 01 08:53:15 +0000 2012,50.8236703111,-0.1387184062
Thu Sep 27 13:05:16 +0000 2012,59.915652,10.749959
Sun Sep 23 12:54:16 +0000 2012,50.8281663943,-0.128531456
Sat Sep 22 13:44:09 +0000 2012,50.87447886,0.017625
Thu Sep 20 13:16:11 +0000 2012,50.8262027605,-0.1381123066
Thu Sep 20 09:27:55 +0000 2012,50.8262027605,-0.1381123066
Mon Sep 17 07:51:20 +0000 2012,47.9952739036,7.8525775405
Sun Sep 16 09:01:28 +0000 2012,51.1599172667,-0.1787844393
Thu Sep 13 12:40:26 +0000 2012,50.822951,-0.136905
Tue Sep 11 18:41:47 +0000 2012,50.822746,-0.142274
Tue Sep 11 17:19:38 +0000 2012,50.822219,-0.140802
Tue Sep 11 13:05:59 +0000 2012,50.8262027605,-0.1381123066
Tue Sep 11 13:03:35 +0000 2012,50.8262027605,-0.1381123066
Tue Sep 11 12:48:51 +0000 2012,50.8262027605,-0.1381123066
Tue Sep 11 12:06:36 +0000 2012,50.8262027605,-0.1381123066
Tue Sep 11 08:23:00 +0000 2012,50.8262027605,-0.1381123066
Sun Sep 09 19:10:21 +0000 2012,50.826646,-0.136921
Tue Sep 04 17:33:44 +0000 2012,50.826646,-0.136921
Tue Sep 04 12:57:16 +0000 2012,50.822951,-0.136905
Mon Sep 03 16:03:37 +0000 2012,50.8262027605,-0.1381123066
Mon Sep 03 15:26:41 +0000 2012,50.8262027605,-0.1381123066
Sun Sep 02 19:40:38 +0000 2012,50.8229428584,-0.1390289018
Sun Sep 02 19:24:45 +0000 2012,50.8229428584,-0.1390289018
Sun Sep 02 19:08:55 +0000 2012,50.825977,-0.138339
Sun Sep 02 18:25:08 +0000 2012,50.825449,-0.137123
Sun Sep 02 17:04:15 +0000 2012,50.825449,-0.137123
Sun Sep 02 15:34:31 +0000 2012,50.8229428584,-0.1390289018
Fri Aug 31 17:33:20 +0000 2012,50.8291396274,-0.133923449
Fri Aug 31 09:20:04 +0000 2012,50.8311581116,-0.1335176435
Tue Aug 28 20:44:32 +0000 2012,41.8844650304,-87.6257600109
Mon Aug 27 13:57:24 +0000 2012,41.8844650304,-87.6257600109
Sat Aug 25 18:45:51 +0000 2012,41.8851594291,-87.6232355833
Wed Aug 22 12:32:45 +0000 2012,50.824415,-0.134691
Tue Aug 21 11:39:46 +0000 2012,50.8262027605,-0.1381123066
Mon Aug 20 11:01:28 +0000 2012,51.535132,-0.069309
Fri Aug 17 12:03:40 +0000 2012,50.8262027605,-0.1381123066
Sat Aug 11 16:08:13 +0000 2012,50.826646,-0.136921
Fri Aug 10 14:25:15 +0000 2012,50.8262027605,-0.1381123066
Wed Aug 08 11:51:45 +0000 2012,50.8262027605,-0.1381123066
Tue Aug 07 15:45:49 +0000 2012,50.8262027605,-0.1381123066
Fri Aug 03 16:38:55 +0000 2012,50.8262027605,-0.1381123066
Fri Aug 03 14:33:04 +0000 2012,50.8262027605,-0.1381123066
Sat Jul 28 14:57:52 +0000 2012,50.825449,-0.137123
Sat Jul 28 12:09:01 +0000 2012,50.828404,-0.137435
Thu Jul 26 17:17:22 +0000 2012,50.8266230357,-0.1367429505
Tue Jul 24 15:07:39 +0000 2012,50.8262027605,-0.1381123066
Mon Jul 23 12:25:35 +0000 2012,50.823104,-0.139515
Sat Jul 21 12:46:25 +0000 2012,50.827943,-0.136033
Fri Jul 20 13:21:41 +0000 2012,50.8262027605,-0.1381123066
Mon Jul 16 19:28:01 +0000 2012,50.825449,-0.137123
Sun Jul 15 10:48:44 +0000 2012,51.4714930776,-0.4883337021
Sat Jul 14 23:08:27 +0000 2012,41.974037,-87.890239
Tue Jul 10 13:44:08 +0000 2012,30.2655234842,-97.7385378752
Mon Jul 09 19:32:48 +0000 2012,30.2655234842,-97.7385378752
Mon Jul 09 14:40:21 +0000 2012,30.2656095537,-97.7385592461
Sat Jul 07 15:08:12 +0000 2012,51.4726745412,-0.4817537462
Fri Jun 29 10:55:03 +0000 2012,50.8262027605,-0.1381123066
Wed Jun 20 10:23:29 +0000 2012,51.488197,-0.120692
Mon Jun 18 12:12:01 +0000 2012,50.8262027605,-0.1381123066
Mon Jun 18 12:02:43 +0000 2012,50.8262027605,-0.1381123066
Sat Jun 16 15:51:15 +0000 2012,50.8244773427,-0.1387893509
Sat Jun 16 15:10:29 +0000 2012,50.827972412,-0.136271402
Fri Jun 15 22:15:44 +0000 2012,50.947306,0.090209
Fri Jun 15 12:58:27 +0000 2012,50.947306,0.090209
Wed Jun 13 12:12:49 +0000 2012,50.822951,-0.136905
Mon Jun 11 14:05:50 +0000 2012,50.825977,-0.138339
Wed Jun 06 16:31:48 +0000 2012,51.50361668,-0.683839
Wed Jun 06 15:38:45 +0000 2012,51.50361668,-0.683839
Sat Jun 02 15:40:48 +0000 2012,50.825449,-0.137123
Fri Jun 01 13:29:40 +0000 2012,50.8262027605,-0.1381123066
Thu May 31 16:37:18 +0000 2012,50.8262027605,-0.1381123066
Wed May 30 14:58:46 +0000 2012,50.8262027605,-0.1381123066
Wed May 30 12:45:33 +0000 2012,50.8262027605,-0.1381123066
Wed May 30 12:32:27 +0000 2012,50.8262027605,-0.1381123066
Tue May 29 12:12:15 +0000 2012,50.8242644595,-0.1329624653
Tue May 29 08:12:24 +0000 2012,50.8307708894,-0.1330473622
Sun May 27 21:06:57 +0000 2012,47.5608179303,-52.70936785
Mon May 21 19:15:05 +0000 2012,50.824975,3.26387
Mon May 21 13:56:02 +0000 2012,51.0541040608,3.7238935404
Mon May 21 12:19:17 +0000 2012,51.055163,3.720835
Sat May 19 15:52:22 +0000 2012,50.821309,-0.1434404
Sat May 19 14:19:38 +0000 2012,50.822215,-0.154896
Sun May 13 14:08:33 +0000 2012,50.8244462443,-0.139321602
Sun May 13 13:29:30 +0000 2012,50.8192217888,-0.1411056519
Sat May 12 19:32:13 +0000 2012,50.820359,-0.14243
Sat May 12 17:51:57 +0000 2012,50.822623,-0.142676
Fri May 11 09:22:05 +0000 2012,52.366239,4.894655
Tue May 08 12:39:36 +0000 2012,50.8287188784,-0.1423922896
Sun May 06 20:38:27 +0000 2012,50.871762,0.011501
Fri May 04 14:35:37 +0000 2012,50.8262027605,-0.1381123066
Thu May 03 16:03:52 +0000 2012,50.8262027605,-0.1381123066
Thu May 03 12:05:08 +0000 2012,50.8242644595,-0.1329624653
Wed May 02 12:43:38 +0000 2012,50.8262027605,-0.1381123066
Tue May 01 14:50:47 +0000 2012,50.8244094849,-0.1399479955
Tue May 01 13:17:36 +0000 2012,50.8262027605,-0.1381123066
Tue May 01 12:01:59 +0000 2012,50.826779,-0.138462
Tue May 01 11:22:41 +0000 2012,50.8262027605,-0.1381123066
Mon Apr 30 15:58:14 +0000 2012,50.8262027605,-0.1381123066
Fri Apr 27 17:26:19 +0000 2012,50.825449,-0.137123
Thu Apr 26 12:44:54 +0000 2012,50.8262027605,-0.1381123066
Tue Apr 24 11:30:25 +0000 2012,50.8262027605,-0.1381123066
Sat Apr 21 14:37:59 +0000 2012,50.8244773427,-0.1387893509
Wed Apr 18 11:05:28 +0000 2012,51.514461,-0.15415
Tue Apr 17 11:38:39 +0000 2012,50.8262027605,-0.1381123066
Mon Apr 16 17:28:09 +0000 2012,50.825449,-0.137123
Fri Apr 13 17:35:30 +0000 2012,50.825449,-0.137123
Fri Apr 13 11:39:01 +0000 2012,50.8262027605,-0.1381123066
Thu Apr 12 20:59:46 +0000 2012,50.8284865994,-0.1406764984
Thu Apr 12 20:43:24 +0000 2012,50.8284865994,-0.1406764984
Thu Apr 12 12:38:06 +0000 2012,50.8262027605,-0.1381123066
Wed Apr 04 17:35:46 +0000 2012,50.829236,-0.130433
Wed Apr 04 11:20:06 +0000 2012,50.8262027605,-0.1381123066
Wed Mar 28 19:51:57 +0000 2012,50.82533,-0.1371919
Wed Mar 28 17:41:06 +0000 2012,50.8266230357,-0.1367429505
Sat Mar 24 15:24:22 +0000 2012,50.82578,-0.139591
Sat Mar 24 14:42:14 +0000 2012,50.8244773427,-0.1387893509
Thu Mar 22 20:33:36 +0000 2012,50.821049,-0.140416
Thu Mar 15 16:00:20 +0000 2012,32.8975517297,-97.0442533493
Wed Mar 14 15:41:13 +0000 2012,30.265426,-97.740498
Tue Mar 13 19:52:43 +0000 2012,30.2647199679,-97.7443528175
Tue Mar 13 16:29:12 +0000 2012,30.2653850259,-97.7383099888
Mon Mar 12 02:03:53 +0000 2012,30.2669212002,-97.745683415
Sun Mar 11 17:45:31 +0000 2012,30.2626071693,-97.739803791
Sun Mar 11 15:18:53 +0000 2012,30.2647199679,-97.7443528175
Fri Mar 09 15:11:51 +0000 2012,30.2671521557,-97.7396624407
Mon Mar 05 10:56:37 +0000 2012,50.8262027605,-0.1381123066
Thu Mar 01 09:55:16 +0000 2012,50.8304057758,-0.1329698575
Wed Feb 22 23:56:59 +0000 2012,-33.8782765912,151.221249511
Wed Feb 22 02:00:43 +0000 2012,-41.328228677,174.809947014
Thu Feb 16 01:13:27 +0000 2012,-41.2890508786,174.777774995
Wed Feb 15 21:39:06 +0000 2012,-41.2893031956,174.777374268
Wed Feb 15 18:50:42 +0000 2012,-41.2893031956,174.777374268
Wed Feb 15 02:10:18 +0000 2012,-41.29336192,174.776485
Mon Feb 13 04:07:07 +0000 2012,-41.2893031956,174.777374268
Mon Feb 13 03:36:49 +0000 2012,-41.2924914456,174.776140451
Mon Feb 13 03:00:13 +0000 2012,-41.293314,174.776395
Mon Feb 13 02:40:18 +0000 2012,-41.2934345895,174.775958061
Mon Feb 13 01:22:04 +0000 2012,-41.2939726591,174.775840044
Sat Feb 11 23:39:04 +0000 2012,-36.405247,174.65600431
Sat Feb 11 07:32:16 +0000 2012,-36.405247,174.65600431
Sat Feb 11 06:49:42 +0000 2012,-36.405247,174.65600431
Wed Feb 08 23:20:25 +0000 2012,-33.878302,151.221256
Sat Feb 04 11:14:52 +0000 2012,50.828205,-0.1378011703
Thu Feb 02 13:41:42 +0000 2012,50.8262027605,-0.1381123066
Wed Feb 01 16:57:16 +0000 2012,50.8262027605,-0.1381123066
Sat Jan 28 16:57:35 +0000 2012,50.827062,-0.135349
Sat Jan 28 15:55:49 +0000 2012,50.828295,-0.138769
Thu Jan 26 12:42:08 +0000 2012,50.8262027605,-0.1381123066
Mon Jan 23 12:34:45 +0000 2012,50.822219,-0.140802
Sun Jan 22 15:18:32 +0000 2012,50.825832,-0.142381
Sat Jan 21 14:27:51 +0000 2012,50.8213,-0.1409
Fri Jan 20 12:45:34 +0000 2012,51.9479484763,-0.5020558834
Thu Jan 19 20:49:09 +0000 2012,52.9556027724,-1.1504852772
Thu Jan 19 12:38:47 +0000 2012,52.954584773,-1.1563324928
Wed Jan 18 16:42:24 +0000 2012,52.954584773,-1.1563324928
Wed Jan 18 16:39:09 +0000 2012,52.954584773,-1.1563324928
Tue Jan 17 15:00:09 +0000 2012,50.8262027605,-0.1381123066
Mon Jan 16 10:03:12 +0000 2012,50.8303548561,-0.1329055827
Sat Jan 14 16:11:55 +0000 2012,50.824838842,-0.1516896486
Wed Jan 11 21:07:19 +0000 2012,51.522789913,-0.0784921646
Wed Jan 11 19:27:24 +0000 2012,51.5237223711,-0.0770612686
Sat Jan 07 14:49:09 +0000 2012,50.824424,-0.138875
Fri Apr 09 01:52:12 +0000 2010,47.4412234282,-122.3010026978
Fri Apr 09 00:00:15 +0000 2010,47.4432422071,-122.3010595342
Thu Apr 08 01:29:11 +0000 2010,47.6873506139,-122.3341637453
Wed Apr 07 00:16:03 +0000 2010,47.6109922102,-122.3480262842
Sun Apr 04 18:47:33 +0000 2010,47.7083958758,-122.3272574643
Sat Apr 03 18:06:54 +0000 2010,47.6687063559,-122.3942997359
Sat Apr 03 18:05:00 +0000 2010,47.6687063559,-122.3942997359

I guess it’s kind of scary that you can do this, and is perhaps why Twitter doesn’t let you export anyone’s account, even if it is public. But returning to the issue of Grailbird being on Github, I imagine there would be people that would write code that uses Grailbird as an API to the archive data, to provide extensions that would display a map of where you’ve been over time for example, or an analysis of your friendship network, or a view on hashtags you’ve used, events you’ve been at etc.

I think from an archival perspective, it would be really useful to be able to receive something like a Tweet archive from a donor, and overlay functionality on top of it. The model of using the Web as a local application platform for this sort of archival content seems like it could be a growth area.

Inside Out Libraries

Peter Brantley tells a sad tale about where public library leadership is at, as we plunge headlong into the ebook future, that has been talked about for what seems like forever, and which is now upon us. It’s not pretty.

The general consensus among participants was that public libraries have two, maybe three years to establish their relevance in the digital realm, or risk fading from the central place they have long occupied in the world’s literary culture.

The fact that a bunch of big-wigs invited by IFLA were seemingly unable to find inspiration and reason to hope that public libraries will continue to exist is not surprising in the least I guess. I’m not sure that libraries were ever the center of the world’s literary culture. But for the sake of argument lets assume they were, and that now they’re increasingly not. Let us also assume that the economic landscape around ebooks is in incredible turmoil, and that there will continue to be sea changes in technologies, and people’s use of them in this area for the foreseeable future.

What can libraries do to stay relevant? I think part of the answer is: stop being libraries…well, sorta.

The HyperLocal

The most serious threat facing libraries does not come from publishers, we argued, but from e-book and digital media retailers like Amazon, Apple, and Google. While some IFLA staff protested that libraries are not in the business of competing with such companies, the library representatives stressed that they are. If public libraries can’t be better than Google or Amazon at something, then libraries will lose their relevance.

In my mind the thing that libraries have to offer, which these big corporations cannot, is authentic, local context for information about a community’s past, present and future. But in the past century or so libraries have focused on collecting mass produced objects, and sharing data about said objects. The mission of collecting hyper-local information has typically been a side task, that has fallen to special collections and archives. If I were invited to that IFLA meeting I would’ve said that libraries need to shift their orientation to caring more about the practices of archives and manuscript collections, by collecting unique, valued, at risk local materials, and adapting collection development and descriptive practices to the realities of more and more of this information being available as data.

As Mark Matienzo indicated (somewhat indirectly in Twitter) after I published this blog post, a lot of this work involves focusing less on hoarding items like books, and focusing more on the functions, services, and actions that public libraries want to document and engage with in their communities. Traditionally this orientation has been a strength area for archivists in their practice and theory of appraisal where:

… considerations … include how to meet the record-granting body’s organizational needs, how to uphold requirements of organizational accountability (be they legal, institutional, or determined by archival ethics), and how to meet the expectations of the record-using community. Wikipedia

I think this represents a pretty significant cognitive shift for library professionals, and would in fact take some doing. But perhaps that’s just because my exposure to archival theory in “library school” was pretty pathetic. Be that as it may here are some practical examples of growth areas for public libraries that I wish came up at the IFLA meeting.

Web Archiving

The Internet Archive and national libraries that are part of the International Internet Preservation Consortium don’t have the time, resources and often mandate to collect web content that are of interest at the local level. What if the tooling and expertise existed for public libraries to perform some of this work, and to have the results fed into larger aggregations of web archives?

Municipality Reports and Data

Increasing amounts of data are being collected as part of the daily working of our local governments. What if your public library had the resources to be a repository for this data? Yeah, I said the R word. But I’m not suggesting that public libraries get the expertise to set up Fedora instances with Hydra heads, or something. I’m thinking about approaches to allowing data to easily flow into an organization, where it is backed up, and made available in a clearinghouse manner similar to on the Web, for search engines to pick up. Perhaps even services like LibraryBox offer another lens to look at the opportunities that lie in this area.

Born Digital Manuscript Collections

Public libraries should be aggressively collecting the “papers” of local people who have had significant contributions to their communities. Increasingly, these aren’t paper at all, but are born digital content. For example: email correspondence, document archives, digital photograph collections. I think that librarians and archivists know, in theory, that this born digital content is out there, but the reality is it’s not flowing into the public library/archive. How can we change this? Efforts such as Personal Digital Archiving are important for two reasons: they help set up the right conditions for born digital collections to be donated, and they also make professionals think about how they would like to receive materials so that they are easier to process. Think more things like AIMS, training and tooling for both professionals and citizens.


It’s not unusual for archives and special collections to have all sorts of donor gift agreements that place restrictions on how their donated materials can be used. To some extent needing to visit the collection, request it, and not being able to leave the room with it, has mitigated some of this special-snowflakism. But when things are online things change a bit. We need to normalize these agreements so that content can flow online, and be used online in clearer ways. What if we got donors to think about Creative Commons licenses when they donated materials? How can we make sure donated material can become a usable part of the Web


We all know that things come and go on the Web. But it doesn’t need to be that way for everything on the Web. Libraries and archives have an opportunity to show how focusing on being a clearninghouse for data assets can allow for things to live persistently on the Web. Thinking about our URLs as identifiers for things we are taking care of is important. Practical strategies for achieving that are possible, and repeatable. What if public libraries were safe harbors for local content on the World Wide Web? This might sound hard to do, but I think it’s not as hard as people think.


As libraries/archives make more local content available publicly on the Web it becomes important to track how this content is accessed and used online. Quick wins like Web analytics tools (Google Analytics) for seeing what is being accessed and from where. Seeing how content is cited in social media applications like Facebook, Twitter, Pinterest and Wikipedia is important for reporting on the value of online collections. But encouraging professionals to use this information to become part of the conversations is equally important. Good metrics are also essential for collection development purposes, seeing what content is of interest, and what is not.

Inside Out Libraries

So, no I don’t think public libraries need a new open source Overdrive. The ebook market will likely continue to take care of itself. I also am not really convinced we need some overarching organization like the Digital Public Library of America to serve as a single point of failure when the funding runs dry. We need distributed strategies for documenting our local communities, so that this information can take its rightful place on the Web, and be picked up by Google so that people can find it when they are on the other side of the world. Things will definitely keep changing, but I think libraries and archives need to invest in the Web as an enduring delivery platform for information.

I’ve never been before but I was so excited to read the call for the European Library Automation Group (ELAG) this year.

The theme of this year’s conference is ‘The INSIDE-OUT Library’. This theme was chosen at last year’s conference, because we concluded:

  • Libraries have been focusing on bringing the world to their users. Now information is globally available.
  • Libraries have been producing metadata for the same publications in parallel. Now they are faced with deduplicating redundancy.
  • Libraries have been selecting things for their users. Now the users select things themselves.
  • Libraries have been supporting users by indexing things locally. Now everything is being indexed in global, shared indexes.

Instead of being an OUTSIDE-IN library, libraries should try and stay relevant by shifting their paradigm 180 degrees. Instead of only helping users to find what is available globally, they should also focus on making local collections and production available to the world. Instead of doing the same thing everywhere, libraries should focus on making unique information accessible. Instead of focusing on information trapped in publications, libraries should try and give the world new views on knowledge.

This blog post is really just a somewhat shabby rephrasing of that call. Maybe IFLA could use some of the folks on the ELAG program commmittee at their next meeting about the future of public libraries? Hopefully 2013 will be a year I can make it to ELAG.

I expect public libraries will continue to exist, but there isn’t going to be some magical technical solution to their problems. Their future will be forged by each local relationship they make, which leads to them better documenting their place on the Web. We may not call these places public libraries at first, but that’s what they will be.

linkrot: use your illusion

Mike Giarlo wrote a bit last week about the issues of citing datasets on the Web with Digital Object Identifiers (DOI). It’s a really nice, concise characterization of why libraries and publishers have promoted and used the DOI, and indirect identifiers more generally. Mike defines indirect identifiers as

… identifiers that point at and resolve to other identifiers.

I might be reading between the lines a bit, but I think Mike is specifically talking about any identifier that has some documented or ad-hoc mechanism for turning it into a Web identifier, or URL. A quick look at the Wikipedia identifier category yields lots of these, many of which (but not all) can be expressed as a URI.

The reason why I liked Mike’s post so much is that he was able to neatly summarize the psychology that drives the use of indirect identifier technologies:

… cultural heritage organizations and publishers have done a pretty poor job of persisting their identifiers so far, partly because they didn’t grok the commitment they were undertaking, or because they weren’t deliberate about crafting sustainable URIs from the outset, or because they selected software with brittle URIs, or because they fell flat on some area of sustainability planning (financial, technical, or otherwise), and so because you can’t trust these organizations or their software with your identifiers, you should use this other infrastructure for minting and managing quote persistent unquote identifiers

Mike goes on to get to the heart of the problem, which is that indirect identifier technologies don’t solve the problem of broken links on the Web, they just push it elsewhere. The real problem of maintaining the indirect identifier when the actual URL changes becomes someone else’s problem. Out of sight, out of mind … except it’s not really out of sight right? Unless you don’t really care about the content you are putting online.

We all know that linkrot on the Web is a real thing. I would be putting my head in the sand if I were to say it wasn’t. But I would also be putting my head in the sand if I said that things don’t go missing from our brick and mortar libraries. But still, we should be able to do better than 1/2 the URLs in arXiv going dead right? I make a living as a web developer, I’m an occasional advocate for linked data, and I’m a big fan of the work Henry Thompson and David Orchard did for the W3C analyzing the use of alternate identifier schemes on the Web…so, admittedly, I’m a bit of a zealot when it comes to promoting URLs as identifiers, and taking the Web seriously as an information space.

Mike’s post actually kicked off what I thought was a useful Twitter conversation (yes they can happen), which left me contemplating the future of libraries and archives on (or in) the Web. Specifically, it got me thinking that perhaps libraries and archives of the not too distant future will be places that take special care in how they put content on the Web, so that it can be accessed over time, just like a traditional physical library or archive. The places where links and the content they reference are less likely to go dead will be the new libraries and archives. These may not be the same institutions we call libraries today. Just like today’s libraries, these new libraries may not necessarily be free to access. You may need to be part of some community to access them, or to pay some sort of subscription fee. But some of them, and I hope most, will be public assets.

So how to make this happen? What will it look like? Rather than advocating a particular identifier technology I think these new libraries need to think seriously about providing Terms of Service documents for their content services. I think these library ToS documents will do a few things.

  • They will require the library to think seriously about the service they are providing. This will involve meetings, more meetings, power lunches, and likely lawyers. The outcome will be an organizational understanding of what the library is putting on the Web, and the commitment they are entering into with their users. It won’t simply be a matter of a web development team deciding to put up some new website…or take one down. This will likely be hard, but I think it’s getting easier all the time, as the importance of the Web as a publishing platform becomes more and more accepted, even in conservative organizations like libraries and archives.
  • The ToS will address the institutions commitment for continued access to the content. This will involve a clear understanding of the URL namespaces that the library manages, and a statement about how they will be maintained over time. The Web has built in mechanisms for content moving from place to place (HTTP 301), and for when resources are removed (HTTP 410), so URLs don’t need to be written in stone. But the library needs to commit to how resources will redirect permanently to new locations, and for how long–and how they will be removed.
  • The ToS will explicitly state the licensing associated with the content, preferably with Creative Commons licenses (hey I’m daydreaming here) so that it can be confidently used.
  • Libraries and archives will develop a shared palette of ToS documents. Each institution won’t have it’s own special snowflake ToS that nobody reads. There will be some normative patterns for different types of libraries. They will be shared across consortia, and among peer institutions. Maybe they will be incorporated into, or reflect shared principles found in documents like ALA’s Library Bill of Rights or SAA’s Code of Ethics.

I guess some of this might be a bit reminiscent of the work that has gone into what makes a trusted repository. But I think a Terms of Service between a library/archive and its researcher is something a bit different. It’s more outward looking, less interested in certification and compliance and more interested in entering into and upholding a contract with the user of a collection.

As I was writing this post, Dan Brickley tweeted about a recent talk Tony Ageh (head of the archive development team at the BBC) gave at the recent Economies of the Commons conference. He spoke about his ideas for a future Digital Public Space, and the role that archives and organizations like the BBC play in helping create it.

Things no longer ‘need’ to disappear after a certain period of time. Material that once would have flourished only briefly before languishing under lock and key or even being thrown away — can now be made available forever. And our Licence Fee Payers increasingly expect this to be the way of things. We will soon need to have a very, very good reason for why anything at all disappears from view or is not permanently accessible in some way or other.

That is why the Digital Public Space has placed the continuing and permanent availability of all publicly-funded media, and its associated information, as the default and founding principle.

I think Tony and Mike are right. Cultural heritage organizations need to think more seriously, and more long term about the content they are putting on the Web. They need to put this thought into clear, and succinct contracts with their users. The organizations that do will be what we call libraries and archives tomorrow. I guess I need to start by getting my own house in order eh?

level 0 linked archival data

Depósito del Archivo de la Fundación

TLDR; lets see if we can share structured archival data better by adding HTML <link> elements that point at our EAD XML files.

A few weeks ago I attended a small meeting of DC museums, archives and libraries that were discussing what Linked Data means for Archives. Hillel Arnold and I took collaborative notes in Pirate Pad. For a good part of the time we went around the room talking about how we describe archival collections with various workflows using Encoded Archival Description (EAD), and how this was mostly working (or not).

Some good work has already been done imagining how Linked Data can transform archival description by the LOCAH (now Linking Lives) as well as the Social Networks and Archival Context project. I think tools like Editors’ Notes, CWRC Writer, and Google’s Research Pane could provide really useful models for how the work of an archivist could benefit from linking to external resources such as Wikipedia, dbpedia, VIAF, etc. But we really didn’t talk about that in too much detail. The focus instead was on various tools people used in their EAD workflows: Archivists’ Toolkit, Oxygen, ExistDB, Access databases, etc … and the hope that Archives Space could possibly improve matters. We did touch briefly on what it means to make finding aids available on the Web, but not in a very satisfactory way.

I was really struck by how everyone was using EAD, even if their tools were different. I was also left with the lingering suspicion that not much of this EAD data was linked to from the HTML presentation of the finding aid. After some conversations it was also my understanding that even after 20 years of work on EAD, there is not a listing of websites that make EAD finding aids available. It seems particularly sad that institutions have invested a lot of time and effort in putting EAD into practice, and yet we still aren’t really sharing them very well with each other.

So in a bit of a fit of frustration I did some hacking to see if I could use Google and ArchiveGrid to identify websites that serve up finding aids either as HTML or as EAD XML. I wanted to:

  1. Get a list of websites that made HTML and EAD XML finding aids available. We can rely on Google to index the Web, but maybe we could index the archival web a bit better ourselves if we had a better understanding of where the EAD data was available. The idea is that this initial list could be used to bootstrap a list of websites making EAD finding aids available in the Wikipedia entry for EAD.
  2. To see which websites have HTML representations that link to an EAD XML representation. The rationale here is to encourage a very simple best practice for linking to structured archival data when it is available. More on that below.

I was able to identify 201 hosts that served up finding aids either as HTML or XML. You should be able to see them here in this spreadsheet. I also collected URLs for finding aids (both HTML and XML) that I was able to locate, which can be seen in this JSON file.

With the URLs in hand I wrote a little script to examine which of the 156 hosts serving up HTML representations of finding aids had a link to an XML EAD document. I looked for a very simple kind of link that was popularized by the RSS and Atom syndication community for autodiscovery of blog feeds. A <link> tag that has a rel attribute of alternate and a type attribute set to application/xml. Out of the 156 websites serving up HTML representations of finding aids I could only find two websites that used this link pattern: Princeton University and Emory University.

For example if you view the HTML source for the Einstein Collection finding aid at Princeton you’ll see this link:

Similarly the finding aid for the Salman Rushdie collection at Emory University has this link:

As the title of this blog post suggests, I’m calling this pattern level 0 linked data. Linked Data purists would probably say this isn’t Linked Data at all since it doesn’t involve an RDF serialization. And I guess they would be right. But it does express a graph of HTML and EAD data that is linked, and it serves a real need. If you are interested in Linked Data and archives I encourage you to add these links to your HTML finding aids today.

So why is are these links important?

The main reason is they are found in HTML documents, which are the representations that matter most on the Web. HTML documents are read by people. They are hypertext documents that link to and from other places on an archives website and elswewhere on the Web at large. They are well understood technically by the Web development community…if you hire a developer they might have strong feelings about using PHP or Ruby, but they will know HTML backwards and forwards. They are crawled and indexed by search engine bots so that researchers around the world can discover our collections. They are cited in social environments like Twitter, Facebook, blog posts, etc. We have a responsibility to create stable homes (URLs) for our archival descriptions that fit into the Web.

The other reason is these links are important is that they make our investment in EAD visible on the Web for anyone who is looking. Nobody but ArchiveGrid actively crawl EAD XML data. They are the only ones that can find them, because they have been told where they are. If we did a better job of advertising the availability of our EAD documents I think we would see more tools and services around them. ArchiveGrid is a good example of the sort of tool that could be built on top of a web of EAD data. But what about archival collections in your local area? Perhaps it would be useful to have a service that let you look across the archival holdings of institutions in a consortium you belong to. Or perhaps you might want to create an alerting service that lets researchers know what new archival collections are being made available. Or maybe you need to collaborate with archives in a specific domain, and need tools that provide a custom experience for that distributed collection. I imagine there would be lots of ideas for apps if there were just a teensy bit more thought put into how finding aids (both the HTML and the XML) are put on the Web, and how we shared information about their availability.

Going forward I think HTML5 microdata and RDFa present some excellent opportunities for Linked Data representations of finding aids. Especially when you consider some of the vocabulary development being done around them; as well as some of the work being done by Tim Sherratt on using linked data to create new user experiences around archival data. But if your institution has already invested in creating EAD documents I think trying this link pattern with data you already have could be a good first step towards introducing linked data into your archive. I hope it is a first baby step that archives can take in merging some of the structured data found in the EAD XML document into the HTML they publish about their collections.

I’m planning on getting the list of EAD publishers into the Wikipedia article for EAD, and putting out a call for others to add their website if it is missing. I also think that a simple crawling and aggregation service that use the links in some fashion could also encourage more linking. A lot of this blog post has been mental preparation for my involvement in an IMLS funded project run out of Tufts that will be looking at Linked Archival Metadata, which is about to be kicked off this winter. If you’ve read this far, and have any thoughts or suggestions about this I’d enjoy hearing them either here, on Twitter or via email.

who creates the LCNAF (part 2)

I ended my A Look at Who Creates the LCNAF post with a hunch that the Library of Congress Name Authority File is increasingly supported by particpants in the Name Authority Cooperative (NACO) rather than by the Library of Congress themself. It didn’t occur to me until a few days later that I missed a pretty obvious opportunity to graph the number of records created by LC compared with all the other members of the collective. So, here it is:

It looks like this has been a trend since about 1996 or so. I think it validates the cooperative aspect of the PCC and NACO. Not that it needs any validating. It’s just nice to see libraries and librarians working together to build something. I guess the name Library of Congress Name Authority File is also increasingly ironic…

Update: thanks to Kevin Ford (who emailed me privately) it seems that LC has been quite aware of this trend, and highlighted the event in 1996 when NACO members began contributing more records than LC with a press release.

Always Already New

Always Already New: Media, History, And The Data Of CultureAlways Already New: Media, History, And The Data Of Culture by Lisa Gitelman
My rating: 3 of 5 stars

I enjoyed this book, mainly for the author’s technique of exploring what media means in our culture by using two examples, separated in time: the phonograph and the Internet. She admits that in some ways this amounts to comparing apples to oranges, and there is definitely a creative tension in the book. Gitelman’s emphasis is not that media technologies change society and culture, but that a technology is introduced and is in turn shaped by its particular social and historical context, which then reshapes society and culture.

I define media as socially realized structures of communication, where structures include both technological forms and their associated protocols, and where communication is a cultural practice, a ritualized collocation of different people on the same mental map, sharing or engaged with popular ontologies of representation. As such, media are unique and complicated historical subjects.

It’s tempting to talk about media technologies as if their ultimate use is somehow inevitable. For example, Gitelman discusses how the initial commercial placement of the phonograph centered largely around the idea that it would transform dictation and the office. Early demonstrations intended to increase sales of the device focused on recording and playback, rather than simply playback. They didn’t initially see the market for recorded music, which would so transform the device. To some extent we’ve cynically come to expect this out of marketing and “evangelism” about media technologies all the time. But this mode of thinking is also present in purely technical discussions, which don’t account for the placement of the technology in a particular social context.

Getting a sense of the social context you are in the middle of, as opposed to one you one you are historically removed from, presents some challenges. I think this difficulty is more evident in the second part of the book which focuses on the Internet and the World Wide Web against a backdrop of libraries and bibliography. Like many others I imagine, my knowledge of JCR Licklider’s influence on the development of ARPAnet, and the Internet was largely culled from Where Wizards Stay Up Late. I had no idea, until reading Always Already New, that Licklider contracted with the Council on Library Resources (now Council on Library and Information Resources) to write a report Libraries of the Future on the topic of how computing would change libraries.

I enjoyed the discussion of the role that the Request for Comment (RFC) played on the Internet. How these documents that were initially shared via the post, helped bootstrap the technologies that would create the Internet that allowed them to be shared as electronic documents or text. I didn’t know about the RFC-Online project that Jon Postel started right before his death, to recover the earliest RFCs that had been already lost. Gitelman’s study of linking, citation and “publishing” on the Web was also really enjoyable, mainly because of her orientation to these topics:

I will argue that far from making history impossible, the interpretive space of the World Wide Web can prompt history in exciting new ways.

All this being said, I finished the book with the sneaking feeling that I needed to reread it. Gitelman’s thesis was subtle enough that it was only when I got to the end that I felt like I understood it: the strange loop that thinking and media participate in, and how difficult (and yet fruitful) it is to talk about media and their social context. Maybe this was also partly the effect of reading it on a Kindle :-)

View all my reviews

learning from people that do

Anil Dash recently wrote a nice piece about the need for what he calls a Hi-Tech Vo-tech in the technology sector. If you are not familiar with it already, Vo-Tech is shorthand in the US for Vocational-technical school, which provide focused training in specific areas, often on a part time basis. The Vo-Tech experience is markedly different from the typical 4 year university experience, which tends to be focused more on theory than practice.

I totally agree.

But if you are looking to work as a software developer, and to help build this amazing information space we call the World Wide Web, you don’t need to wait for this dream of a better high school curriculum for computer programming, or Hi-Tech Vo-Techs to come to your town. I don’t want to minimize the effort involved in finding your way into the workplace…it’s hard, especially when there is competition from “qualified” candidates, and the skill sets seem to be constantly shifting. But here are some relatively simple steps you can take to get started.

Look at Job Ads

Go to the CraigsList for your area, look at what jobs are available under the internet engineers and software / qa / dba sections. I suggest Craigslist because of their local flavor, and the low cost to advertise, which typically means the jobs are at smaller companies who are less interested in finding someone with the right college degree, and more interested in finding someone who can get things done. Look for jobs that focus on what you can do rather than schooling. Don’t apply for any of the jobs just yet. Note down the tools they want people to know: computer languages, operating systems, web frameworks, etc. Research them on Wikipedia. Focus on tools that seem to pop up a lot, are opensource, and can be downloaded and used without cost. You don’t need to do anything with them just yet though.

Go To User Group Meetings

I say opensource because opensource tools often have open communities around them. You should be able to find user groups in your area where people present on how they use these tools at their place of work. You might have to drive a while, or take a long bus/train ride – but it’s worth it. To find the meetings do some searches by technology and location on Meetup. Alternatively you can Google for whatever the technology is + “user group” + your area (e.g. Philadelphia) and go through a few pages of results. At a user group meeting you will not only learn about the details of the technology, but you will meet actual, real people who are using it. There are often subtle differences in the cultures and communities of practice around software tools. Some user groups will feel more comfortable than others. Pay attention to your gut reactions–they are indicators of how much you would like a job working with the technology, and the people who like it. If you get a bad vibe, don’t take it personally, try another meeting. Finding a job is often a matter of who you know, not what you know … and user groups are a great place to get to know people working in the software development field. There’s no online substitute for meeting people in real life.

Use Social Networks

At user group meetings you meet people who you can learn from. See if they have a blog, are on Twitter or Facebook. Maybe they use a social bookmarking tool you can follow. Or perhaps there are email discussion lists you can subscribe to. It’s not stalking, these people are your mentors, learn from them. Take a dip into sites like Hacker News or Programming Reddit. Watch the trends, you aren’t being a fanboy/girl, you are learning about what people care about in the field. Don’t feel bad if it’s overwhelming (it’s overwhelming to “experts” too), focus on what seems interesting. Also, cultivate your own online identity by posting stuff that you are interested in, or have questions about. Stay positive, and try not to bash things: people (and potential employers) are watching you the same way you are watching them.


Sometimes the speakers at User Group meetings will also be authors of books. You will see books reviewed on sites like Hacker News. People you follow may mention the books they read, or have accounts on sites like GoodReads. See if a library or a bookstore has them, and go skim them. Buy or borrow the ones you like. Take notes about them online, so people can see your interests. Get a Google Reader account and follow blogs related to tools you would like to use. Look for tools that have approachable/readable tutorials. Try out the examples, and get a feel for how well the theory of the tutorial translates into practice. If tools don’t install or seem to work the way they are described, don’t feel like you did something wrong…move on to tools that work more smoothly, and fit your brain better. The benefit to focusing on opensource projects is that you will find more content about them online. You can can read code. Reading the source code for Ruby or GoLang is definitely not for the faint of heart, though it’s nice you can do it. It’s more important that you look at code that uses these tools. Go to GitHub and see what projects there are that use the tool. Browse the source online, or clone the repositories to your workstation. See if you can help out with some low hanging fruit tasks in their issue queue.

Find a Niche

You are probably interested in things other than programming. For example I like libraries and archives, and the cultural heritage sector. I’ve found a virtual community of software developers in this area called code4lib, which helps me learn more about new projects, tools in the field, and is a way to get to know people. You may be surprised to find a similar community around something you are interested in: be it astrophysics, cartoons, music, maps, real estate, etc. If you don’t find one, maybe think about starting one up–you might be surprised by how many people turn up. Sometimes there are collaborative projects that need your help like Wikipedia, Open Street Map where the ability to automate mundane tasks is needed. You might not get paid for this work, but it will broaden your circle of contacts, deepen your technical skills, will build your self confidence, and will be something to put on your resume. The key thing that finding a niche can do is make your job search a bit easier, since technology skills cut across domains. You will also find that your niche has a particular set of tools that it likes to use. These typically aren’t hard and fast rules about using X instead of Y, but are norms. Pay attention to them, and learn about things that interest you.

Be Confident

I don’t mean to imply any of this is easy. It can be extremely difficult to get out of your comfort zone and explore things you don’t know. But you will be rewarded for your efforts, by learning from people who actually do things in the world. I’ve worked with some really excellent software developers that didn’t have a compsci degree, and some that I wasn’t even sure if they graduated high school. Sometimes I wonder if I even graduated from high school. So be confident in your ability to learn and do this thing we call software development. Show that you are humble about what you don’t know, and that you are hungry to learn it. Above all, don’t buy into the cult of the “real programmer” … she doesn’t exist. There are just people to learn from, and if you are doing it right, you never stop learning.