emoji dick and mo tweets

The news about Emoji Dick (the version of Moby Dick translated into Emoji) being acquired by the Library of Congress prompted me to capriciously go to Twitter Search to see who was talking about it. As I drilled backwards I was surprised to see the search results went back to Fred Benenson’s original Tweet about the project.


That Tweet is from 4 years ago!

Up until recently you could only search back a couple of weeks, tops. The only sad thing is that the Twitter Search API still seems to have the two week window. I used my little twarc utility to drill back in the search results via the API and the earliest it was able to find for the same query was from 2013-02-18.

Hopefully the search window for the API will be opened up at some point, since it is at least theoretically possible now. If you happen to know any of the details about how the search functionality works I would be most grateful to hear from you.

Oh, and of course, I had to request Emoji Dick from the stacks:

STATUS: Your request has been received.
REQUEST ID: 243106235
SEND TO: Adams Charge Station (LA 5244) - Staff
REQUEST RECEIVED: Mon Feb 25 12:56:19 EST 2013
TITLE: Emoji Dick ; or The Whale / by Herman Melville ; Edited and Compiled by Fred Benenson ; Translation by Amazon Mechanical Turk. 
AUTHOR: Melville, Herman, 1819-1891. 
CALL#: PS2384 .M6 2012

The one-time-cataloger in me thinks that there was a missed opportunity to add a uniform title to the LC catalog record…. But the title statement of responsibility mentioning that it is a translation made by Amazon Turk more than makes up for that!

Thanks Jay for letting me know what is going on at my own place of work.

brief note on Ernst

Although the traditional archive used to be a rather static memory, the notion of the archive in Internet communication tends to move the archive toward an economy of circulation: permanent tranformations and updating. The so-called cyberspace is not primarily about memory as cultural record but rather about a performantive form of memory as communication. Within this economy of permanent recycling of information, there is less need for emphatic but short-term, updatable memory, which comes close to the operative storage management in the von Neumann architecture of computing. Repositories are no longer final destinations but turn into frequently accessed sites. Archives become cybernetic systems. The aesthetics of fixed order is being replaced by permanent reconfigurability.

Wolfgang Ernst. “Archives in Transition.” Digital Memory and the Archive.

I was reading this and remembering Kevin Kelly’s idea of movage, and the idea of relay supporting archives from Janée et al. I really like the way Ernst works this idea into the way the Internet works, and the ways that the Web transforms the archival function. I’m only half way through the book, and will likely have more to say when I do, so just taking some notes for myself, carry on…

genealogy of a braeburn

It has been observed that when systems break down we get to actually see how they operate. I wonder what this breakage below says about the use of Freebase and Wikipedia data in Google’s Knowlege Graph.

Yes, that’s an image of Braeburn from My Little Pony to the right, and text about the apple to the left. Interestingly it’s fine at Wikipedia:

And it’s not even there in Freebase (according to a search).

I don’t know if this reveals what’s going on in the flow of entities between Wikipedia, Freebase and Google. But I thought it was interesting. I wonder where to report such an anomaly. Is there a place?

Thanks to Jeff Godin in #code4lib for noticing the breakage in Knowledge Graph.

See also Hilary Mason’s post about how her identity got mixed up on Bing. (Thanks Chris).

Update: 2012-02-04

I thought to check a week later, and the The Knowledge Graph results got even funnier, now it’s a collage of apples and My Little Pony:


Aaron Swartz left us all a week ago. It’s strange, I only met Aaron once at the Internet Archive, and had a handful of conversations with him via email/irc … but not a day has passed since last Saturday that I haven’t thought about him, and his principled life.

I’ve been asked a few times why Aaron has been on my mind so much, and I’ve struggled to put it into words. Meanwhile, so many thoughtful things have been written about him. The arc of his life, his ideals, and abilities, charisma, and chutzpah, seem larger than life. And yet, he was just a person, a son, a friend, with people who loved him. It’s just heartbreaking.

I work as a software developer in libraryland, trying to bridge the world of information we’ve had with the world we are building on the Web. So for me, Aaron was a role model, a teacher whose lessons weren’t in textbooks or scholarly journals, but in his blog, in his code, in his talks, in his experiments with real world results. He was only 26 when he died, but he was, and remains, as Tim Berners-Lee paradoxically called him, a “wise elder”.

I wanted to write something here, but more than that I wanted to do something.

I noticed that Internet Archive created a collection devoted to online material related to Aaron, and thought I would try to collect together all the Twitter conversations that mention him. Twitter’s search is limited to the last week, so I quickly wrote a command line utility that pages through search results using their API, and writes out the complete data as line-oriented JSON. I also pulled in the tweets that mention #pdftribute since they were largely inspired by Aaron’s efforts in the open access space. I packaged up the data using BagIt and put it up at Internet Archive. Here’s the description from the bag-info.txt

On January 11, 2013 the Internet activist Aaron Swartz took his own life, and a great deal of grief, anger, and constructive thinking erupted on the Web and in Twitter. In particular the #pdftribute Twitter tag was born, in an attempt to raise awareness about Open Access issues, that Aaron did so much to futher during his life.

This package contains Twitter JSON data for two Twitter search queries that were collected in the week following Aaron’s death:

  • “Aaron Swartz” OR aaronsw
  • #pdftribute

aaronsw.json.gz contains 630,397 tweets, for the period starting with 2013-01-11 16:50:22 and ending 2013-01-18 13:50:02.

pdftribute.json.gz contains 42,277 tweets, for the period starting with Jan 13 02:42:26 and ending Jan 17 03:33:46.

In addition the URLs mentioned in the tweets found in aaronsw.tar.gz were extracted, unshortened, and then aggregated to provide a report of what people linked to. These URLs are available in aaronsw-urls.txt.gz.

It is hoped that this data will help document the Web community’s response to Aaron’s death, and life.

Below is a list of the top 50 links shared in tweets about Aaron. There were 36,506 in all.

Page Shares
RIP, Aaron Swartz - Boing Boing 11763
The Truth about Aaron Swartz’s “Crime” « Unhandled Exception 6641
Aaron Swartz commits suicide - The Tech 5539
Remove United States District Attorney Carmen Ortiz from office for overreach in the case of Aaron Swartz. 6478
Prosecutor as bully - Lessig Blog 3738
The inspiring heroism of Aaron Swartz | Glenn Greenwald | Comment is free | guardian.co.uk 2522
Aaron Swartz Faced A More Severe Prison Term Than Killers, Slave Dealers And Bank Robbers | ThinkProgress 2367
Farewell to Aaron Swartz, an Extraordinary Hacker and Activist - EFF 2042
Internet Activist, a Creator of RSS, Is Dead at 26, Apparently a Suicide - New York Times 1927
Aaron Swartz muere por suicidio a sus 26 años 1572
Technology’s Greatest Minds Say Goodbye to Aaron Swartz 1558
Aaron Swartz a través de 5 grandes contribuciones a la red 1495
Aaron Swartz, American hero 1397
Internet Activist Aaron Swartz Commits Suicide 1330
Anonymous hacks MIT after Aaron Swartz’s suicide | Internet & Media - CNET News 1327
danah boyd | apophenia » processing the loss of Aaron Swartz 1280
Official Statement from the family and partner of Aaron Swartz - Remember Aaron Swartz 1199
depression lies | WIL WHEATON dot NET: 2.0 1164
BBC News - Aaron Swartz, internet freedom activist, dies aged 26 1143
In the Wake of Aaron Swartz’s Death, Let’s Fix Draconian Computer Crime Law - EFF 1088
Westboro Baptist Church Drops Aaron Swartz Funeral Protest After Anonymous Vows Action (VIDEO) 1079
Soup • Official Statement from the Family and Partner of… 1067
‘Aaron was killed by the government’ - Robert Swartz on his son’s death — RT 1066
#PDFTribute list of documents 1044
Internet prodigy, activist Aaron Swartz commits suicide - CNN.com 1009
Remembering Aaron Swartz | The Nation 1003
If I get hit by a truck… 991
Suicide d’Aaron Swartz, activiste à l’origine du format RSS et de Creative Commons 938
Hacker, Activist Aaron Swartz Commits Suicide | ZDNet 896
Activism “How We Stopped SOPA” by Aaron Swartz (1986-2013) 896
Muere a los 26 años el ciberactivista Aaron Swartz | Tecnología | EL PAÍS 887
10 Awful Crimes That Get You Less Prison Time Than What Aaron Swartz Faced | Alternet 868
Aaron Swartz, Coder and Activist, Dead at 26 | Threat Level | Wired.com 856
How the Legal System Failed Aaron Swartz–and Us : The New Yorker 849
https://aaronsw.jottit.com/howtoget 811
How Anonymous Got Westboro to Back Off Aaron Swartz’s Funeral - National - The Atlantic Wire 804
Muerte de Aaron Swartz: la necesidad del Open Data en el I+D 779
US court drops charges on Aaron Swartz days after his suicide — RT 772
Researchers begin posting article PDFs to twitter in #pdftribute to Aaron Swartz « Neuroconscience 745
My Aaron Swartz, whom I loved. | Quinn Said 742
The inspiring heroism of Aaron Swartz | Glenn Greenwald | Comment is free | guardian.co.uk 713
Government formally drops charges against Aaron Swartz | Ars Technica 708
Aaron Swartz’s Politics « naked capitalism 704
CNN.com - Breaking News, U.S., World, Weather, Entertainment & Video News 690
After Aaron Swartz: The Tech World Must Talk About Depression 670
JSTOR liberator 663
Internet Activist Aaron Swartz Commits Suicide 661
Anonymous tumba las webs del MIT y DOJ como tributo a Aaron Swartz 652
Anonymous Hacks MIT, Leaves Farewell Message for Aaron Swartz 647

There were 209,839 Twitter users that mentioned Aaron on Twitter in the last week. I was one of them. I wish I could’ve done more to help.

Fielding notes

a tongue-in-cheek change request from (???) />Paul Downey

I’ve been doing a bit of research into the design of the Web for a paper I’m trying to write. In my travels I ran across Jon Udell’s 2006 interview with Roy Fielding. The interview is particularly interesting because of Roy’s telling of how (as a graduate student) he found himself working on libwww-perl which helped him discover the architecture of the Web that was largely documented by Tim Berners-Lee’s libwww HTTP library for Objective-C.

For the purposes of note taking, and giving some web spiders some text to index, here are a few moments that stood out:

Udell: A little later on [in Roy’s dissertation] you talk about how systems based on what you call control messages are in a very different category from systems where the decisions that get made are being made by human beings, and that that’s, in a sense, the ultimate rationale for designing data driven systems that are web-like, because people need to interact with them in lots of ways that you can’t declaratively define.

Fielding: Yeah, it’s a little bit easier to say that people need to reuse them, in various unanticipated ways. A lot of people think that when they are building an application that they are building something that’s going to last forever, and almost always that’s false. Usually when they are building an application the only thing that lasts forever is the data, at least if you’re lucky. If you’re lucky the data retains some semblance of archivability, or reusability over time.

Udell: There is a meme out there to the effect that what we now call REST architectural style was in a sense discovered post facto, as opposed to having been anticipated from the beginning. Do you agree with that or not?

Fielding: No, it’s a little bit of everything, in the sense that there are core principles involved that Berners-Lee was aware of when he was working on it. I first talked to Tim about what I was calling the HTTP Object Model at the time, which is a terrible name for it, but we talked when I was at the W3C in the summer of 95, about the software engineering principles. Being a graduate student of software engineering, that was my focus, and my interest originally. Of course all the stuff I was doing for the Web that was just for fun. At the time that was not considered research.

Udell: But did you at the time think of what you then called the HTTP object model as being in contrast to more API like and procedural approaches?

Fielding: Oh definitely. The reason for that was that the first thing I did for the Web was statistical analysis software, which turned out to be very effective at helping people understand the value of communicating over the Web. The second thing was a program called MOMSpider. It was one of the first Web spiders, a mechanism for testing all the links that were on the Web.

Udell: And that was when you also worked on libwww-perl?

Fielding: Right, and … at the time it was only the second protocol library available for the Web. It was a combination of pieces from various sources, as well as a lot of my own work, in terms of filling out the details, and providing an overall view of what a Web client should do with an HTTP library. And as a result of that design process I realized some of the things Tim Berners-Lee had designed into the system. And I also found a whole bunch of cases where the design didn’t make any sense, or the way it had been particularly implemented over at NCSA, or one of the other clients, or various history of the Web had turned out to be not-fitting with the rest of the design. So that led to a lot of discussions with the other early protocol developers particularly people like Rob McCool, Tony Sanders and Ari Luotonen–people who were building their own systems and understood both what they were doing with the Web, and also what complaints they were getting from their users. And from that I distilled a model of basically what was the core of HTTP. Because if you look back in the 93/94 time frame, the HTTP specification did not look all that similar to what it does now. It had a whole range of methods that were never used, and a lot of talk about various aspects of object orientation which never really applied to HTTP. And all of that came out of Tim’s original implementation of libwww, which was an Objective-C implementation that was trying to be as portable as possible. It had a lot of the good principles of interface separation and genericity inside the library, and really the same principles that I ended up using in the Perl library, although they were completely independently developed. It was just one of those things where that kind of interaction has a way of leading to a more extensible design.

Udell: So was focusing down on a smaller set of verbs partly driven by the experience of having people starting to use the Web, and starting to experience what URLs could be in a human context as well as in a programmatic context?

Fielding: Well, that was really a combination of things. One that’s a fairly common paradigm: if you are trying to inter-operate with people you’ve never met, try to keep it as simple as possible. There’s also just inherent in the notion of using URIs to identify everything, which is of course really the basis of what the Web is, provides you with that frame of mind where you have a common resource, and you want to have a common resource interface.

spotify vs rdio 2012

Back in August of 2011 I wrote a little utility that pulled down Alf Eaton’s Album of the Year data. AOTY is nice for two reasons: a) I like Alf’s taste in music, so the lists are relevant to me; and b) AOTY is a nice example of layering structured metadata into HTML, for easy processing (aka scraping). With the data in hand it was easy to to check to see if the albums were available on the streaming services Spotify and Rdio using their respective APIs. I was trying to decide which one to use at the time, and wanted to know if there was any significant difference in their catalogs.

Back then, it looked like 32% of the albums were available on Spotify, and 46% on Rdio. Alf has updated his list for 2012 so I decided to rerun aotycmp, and it appears that coverage of both has improved, with Spotify (41%) closing the gap a bit closer with Rdio (49%) which still has a comfortable lead. If you want the availability data I’ve updated it on Github.

I’ve been very happy with Rdio, although pieces like Damon Krukowski’s (thanks (???)) make me wish there was a better way to a) stream music while b) actually putting money in the artists pockets. I’d love to have the ability to pay a little bit more if I knew it was going to the help support the artist in creating more of their art.

Darth Nader

This may be a bad/shortlived idea, but as part of a New Year’s resolution to write more varied material I’m going to try to use my blog (partly) as a dream journal. This will probably drive the few readers I have away, but I’m hoping it might provide some amusement. I barely remember my dreams these days, and would like to remember more of them, so here goes. Feel free to file under TMI.

Walking into a cafe/restaurant in the morning, in what feels like New York, but I’m not sure…it could be any city. It’s a cosy, narrow setup, with all the seats taken by people quietly chatting. I manage to get a cup of coffee to go, and stand waiting for a table to open up. I discover a staircase and vaguely remember that there is seating upstairs. I go up the stairs carefully balancing my wide bowl-like cup of coffee.

The upstairs area is quite large and sprawling, dimly lit, with comfortable chairs, wider tables, and in the middle is a life sized sculpture of a woman in motion, looking behind, while walking–who apparently is the owner of the establishment. A hostess shows me to a table nearby, and says she can’t remember the name of the server, but that someone would be with me shortly. I sit down with my coffee.

After just a few minutes I notice that it feels like evening. There are lots of conversations going on nearby, which I’m able to hear fairly easily. One man in his early 30s is standing at his table, and in a kind of spotlight. He is talking quietly, as if on stage, not obviously on a cell phone, about a meeting that he has just had, and how they will need to travel to Austin, Texas to help protect some geographic area. I can’t remember the exact details of what he was saying but it is clear he is working for an organization that is trying to save some ecosystem features in Austin.

There is a bookshelf nearby with a disembodied head on it, which looks like Ralph Nader, and also a bit like Darth Vader when Luke takes his helmet off at the end of Return of the Jedi. The head is animated, and seems to be simulating the other half of the conversation. He is saying that this is important work, and is similar to a recent project in Seattle. The conversation ends, and the man walks out of the coffee shop.

I notice three other people, with big thick, Ginsbergian beards also leave their tables at the same time, deep in conversation, about something different. There is a counter-culture, occupy-like feeling in the air, of people steadily working to make there corner of the world a better place, it’s a good feeling.


Half awake I found myself thinking about the talking head, and how it reminded me of LibraryBox. It was as if the head made it possible to easily tune into public conversations that were going on in the local context of the coffee shop…and it served as an archive or store of these conversations for others to discover later. I don’t know if LibraryBox actually lets any of that happen, but it’s something I’ve been meaning to learn more about in the new year.

By the way, dream interpretations as comments are most welcome…

archiving tweets

If you are an active Twitter user you may have heard that you can now download your complete archive of tweets. The functionality is still being rolled out across the millions of accounts, so don’t be surprised if you don’t see the function yet in your settings.

The WSJ piece kind of joked about the importance of this move on Twitter’s part, which is a bit unfortunate, since it’s a pretty important issue. Yes you can use a 3rd party apps for downloading your Twitter data, but it says a lot when a company takes “archiving” seriously enough to offer it as a service to its users.

If you work in the digital preservation space it’s kind of fun to take a look at the way that Twitter makes these personal archives available. Luckily (if you don’t have the archive download button yet like me) Dave Winer has started collecting some archives, and making them publicly available for browsing and download off of S3. For example we can look at Sarah Bourne’s (who tipped me off to Dave’s work–thanks Sarah!). After you’ve downloaded the ZIP file you get a directory that looks like:

|-- css
|   `-- application.min.css
|-- data
|   |-- csv
|   |   |-- 2008_08.csv
|   |   |-- 2008_09.csv
|   |   |-- 2008_10.csv
|   |   |-- 2008_11.csv
|   |   |-- 2008_12.csv
|   |   |-- 2009_01.csv
|   |   |-- 2009_02.csv
|   |   |-- 2009_03.csv
|   |   |-- 2009_04.csv
|   |   |-- 2009_05.csv
|   |   |-- 2009_06.csv
|   |   |-- 2009_07.csv
|   |   |-- 2009_08.csv
|   |   |-- 2009_09.csv
|   |   |-- 2009_10.csv
|   |   |-- 2009_11.csv
|   |   |-- 2009_12.csv
|   |   |-- 2010_01.csv
|   |   |-- 2010_02.csv
|   |   |-- 2010_03.csv
|   |   |-- 2010_04.csv
|   |   |-- 2010_05.csv
|   |   |-- 2010_06.csv
|   |   |-- 2010_07.csv
|   |   |-- 2010_08.csv
|   |   |-- 2010_09.csv
|   |   |-- 2010_10.csv
|   |   |-- 2010_11.csv
|   |   |-- 2010_12.csv
|   |   |-- 2011_01.csv
|   |   |-- 2011_02.csv
|   |   |-- 2011_03.csv
|   |   |-- 2011_04.csv
|   |   |-- 2011_05.csv
|   |   |-- 2011_06.csv
|   |   |-- 2011_07.csv
|   |   |-- 2011_08.csv
|   |   |-- 2011_09.csv
|   |   |-- 2011_10.csv
|   |   |-- 2011_11.csv
|   |   |-- 2011_12.csv
|   |   |-- 2012_01.csv
|   |   |-- 2012_02.csv
|   |   |-- 2012_03.csv
|   |   |-- 2012_04.csv
|   |   |-- 2012_05.csv
|   |   |-- 2012_06.csv
|   |   |-- 2012_07.csv
|   |   |-- 2012_08.csv
|   |   |-- 2012_09.csv
|   |   |-- 2012_10.csv
|   |   |-- 2012_11.csv
|   |   `-- 2012_12.csv
|   `-- js
|       |-- payload_details.js
|       |-- tweet_index.js
|       |-- tweets
|       |   |-- 2008_08.js
|       |   |-- 2008_09.js
|       |   |-- 2008_10.js
|       |   |-- 2008_11.js
|       |   |-- 2008_12.js
|       |   |-- 2009_01.js
|       |   |-- 2009_02.js
|       |   |-- 2009_03.js
|       |   |-- 2009_04.js
|       |   |-- 2009_05.js
|       |   |-- 2009_06.js
|       |   |-- 2009_07.js
|       |   |-- 2009_08.js
|       |   |-- 2009_09.js
|       |   |-- 2009_10.js
|       |   |-- 2009_11.js
|       |   |-- 2009_12.js
|       |   |-- 2010_01.js
|       |   |-- 2010_02.js
|       |   |-- 2010_03.js
|       |   |-- 2010_04.js
|       |   |-- 2010_05.js
|       |   |-- 2010_06.js
|       |   |-- 2010_07.js
|       |   |-- 2010_08.js
|       |   |-- 2010_09.js
|       |   |-- 2010_10.js
|       |   |-- 2010_11.js
|       |   |-- 2010_12.js
|       |   |-- 2011_01.js
|       |   |-- 2011_02.js
|       |   |-- 2011_03.js
|       |   |-- 2011_04.js
|       |   |-- 2011_05.js
|       |   |-- 2011_06.js
|       |   |-- 2011_07.js
|       |   |-- 2011_08.js
|       |   |-- 2011_09.js
|       |   |-- 2011_10.js
|       |   |-- 2011_11.js
|       |   |-- 2011_12.js
|       |   |-- 2012_01.js
|       |   |-- 2012_02.js
|       |   |-- 2012_03.js
|       |   |-- 2012_04.js
|       |   |-- 2012_05.js
|       |   |-- 2012_06.js
|       |   |-- 2012_07.js
|       |   |-- 2012_08.js
|       |   |-- 2012_09.js
|       |   |-- 2012_10.js
|       |   |-- 2012_11.js
|       |   `-- 2012_12.js
|       `-- user_details.js
|-- img
|   |-- bg.png
|   `-- sprite.png
|-- index.html
|-- js
|   `-- application.min.js
|-- lib
|   |-- bootstrap
|   |   |-- bootstrap-dropdown.js
|   |   |-- bootstrap.min.css
|   |   |-- bootstrap-modal.js
|   |   |-- bootstrap-tooltip.js
|   |   |-- bootstrap-transition.js
|   |   |-- glyphicons-halflings.png
|   |   `-- glyphicons-halflings-white.png
|   |-- hogan
|   |   `-- hogan-2.0.0.min.js
|   |-- jquery
|   |   `-- jquery-1.8.3.min.js
|   |-- twt
|   |   |-- sprite.png
|   |   |-- sprite.rtl.png
|   |   |-- twt.all.min.js
|   |   `-- twt.min.css
|   `-- underscore
|       `-- underscore-min.js
`-- README.txt

So why is this interesting?

The Data

The archive includes data both as CSV and as JavaScript. The CSV is perfect for throwing into a spreadsheet, and doing stuff with it there. The JavaScript is actually a very light shim over some JSON data that is quite a bit richer than the CSV. The JavaScript shim is needed so that it can be used by the app that comes in the archive (more on that later). For example here’s a randomly picked tweet from Sarah:


Here is how the Tweet shows up in the CSV:

"281405942321532929","281400879465238529","61233","","","2012-12-19 14:29:39 +0000","Janetter","@monkchips Ouch. Some regrets are harsher than others."

And here’s the archived JSON for the Tweet:

  "source" : "Janetter",
  "entities" : {
    "user_mentions" : [ {
      "name" : "James Governor",
      "screen_name" : "monkchips",
      "indices" : [ 0, 10 ],
      "id_str" : "61233",
      "id" : 61233
    } ],
    "media" : [ ],
    "hashtags" : [ ],
    "urls" : [ ]
  "in_reply_to_status_id_str" : "281400879465238529",
  "geo" : {
  "id_str" : "281405942321532929",
  "in_reply_to_user_id" : 61233,
  "text" : "@monkchips Ouch. Some regrets are harsher than others.",
  "id" : 281405942321532929,
  "in_reply_to_status_id" : 281400879465238529,
  "created_at" : "Wed Dec 19 14:29:39 +0000 2012",
  "in_reply_to_screen_name" : "monkchips",
  "in_reply_to_user_id_str" : "61233",
  "user" : {
    "name" : "Sarah Bourne",
    "screen_name" : "sarahebourne",
    "protected" : false,
    "id_str" : "16010789",
    "profile_image_url_https" : "https://si0.twimg.com/profile_images/638441870/Snapshot-of-sb_normal.jpg",
    "id" : 16010789,
    "verified" : false

So there’s quite a bit more structured data in the archived JSON including whether geo coordinates, hash tags, urls mentioned, etc. Also, the avatar images are still referenced out on the Web, where they can change, disappear, etc. It’s also interesting to compare the archived JSON against what you get back the from Twitter API for the same Tweet:

  "user": {
    "follow_request_sent": false, 
    "profile_use_background_image": true, 
    "default_profile_image": false, 
    "id": 16010789, 
    "verified": false, 
    "profile_text_color": "080C0C", 
    "profile_image_url_https": "https://si0.twimg.com/profile_images/638441870/Snapshot-of-sb_normal.jpg", 
    "profile_sidebar_fill_color": "FCFAEF", 
    "entities": {
      "url": {
        "urls": [
            "url": "http://www.linkedin.com/in/sarahbourne", 
            "indices": [
            "expanded_url": null
      "description": {
        "urls": []
    "followers_count": 2367, 
    "profile_sidebar_border_color": "FFFFFF", 
    "id_str": "16010789", 
    "profile_background_color": "DAE0D9", 
    "listed_count": 331, 
    "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/671143407/8544adf04bc3823d306c7f05efef2351.jpeg", 
    "utc_offset": -18000, 
    "statuses_count": 20090, 
    "description": "Internet technology strategist, Accessibility and assistive technologies. Views expressed/implied are my own. See my Twitter lists for more interests.", 
    "friends_count": 784, 
    "location": "Boston, MA, USA", 
    "profile_link_color": "800326", 
    "profile_image_url": "http://a0.twimg.com/profile_images/638441870/Snapshot-of-sb_normal.jpg", 
    "following": true, 
    "geo_enabled": false, 
    "profile_banner_url": "https://si0.twimg.com/profile_banners/16010789/1348096060", 
    "profile_background_image_url": "http://a0.twimg.com/profile_background_images/671143407/8544adf04bc3823d306c7f05efef2351.jpeg", 
    "screen_name": "sarahebourne", 
    "lang": "en", 
    "profile_background_tile": true, 
    "favourites_count": 3147, 
    "name": "Sarah Bourne", 
    "notifications": null, 
    "url": "http://www.linkedin.com/in/sarahbourne", 
    "created_at": "Wed Aug 27 12:24:25 +0000 2008", 
    "contributors_enabled": false, 
    "time_zone": "Eastern Time (US & Canada)", 
    "protected": false, 
    "default_profile": false, 
    "is_translator": false
  "favorited": false, 
  "entities": {
    "user_mentions": [
        "id": 61233, 
        "indices": [
        "id_str": "61233", 
        "screen_name": "monkchips", 
        "name": "James Governor"
    "hashtags": [], 
    "urls": []
  "contributors": null, 
  "truncated": false, 
  "text": "@monkchips Ouch. Some regrets are harsher than others.", 
  "created_at": "Wed Dec 19 14:29:39 +0000 2012", 
  "retweeted": false, 
  "in_reply_to_status_id_str": "281400879465238529", 
  "coordinates": null, 
  "in_reply_to_user_id_str": "61233", 
  "source": "Janetter", 
  "in_reply_to_status_id": 281400879465238529, 
  "in_reply_to_screen_name": "monkchips", 
  "id_str": "281405942321532929", 
  "place": null, 
  "retweet_count": 0, 
  "geo": null, 
  "id": 281405942321532929, 
  "in_reply_to_user_id": 61233

Using json-diff it’s not too difficult to see what the differences are between the archived version and the API version:

+  favorited: false
+  contributors: null
+  truncated: false
+  retweeted: false
+  coordinates: null
+  place: null
+  retweet_count: 0
   entities: {
-    media: [
-    ]
-  geo: {
-  }
+  geo: null
   user: {
+    follow_request_sent: false
+    profile_use_background_image: true
+    default_profile_image: false
+    profile_text_color: "080C0C"
+    profile_sidebar_fill_color: "FCFAEF"
+    entities: {
+      url: {
+        urls: [
+          {
+            url: "http://www.linkedin.com/in/sarahbourne"
+            indices: [
+              0
+              38
+            ]
+            expanded_url: null
+          }
+        ]
+      }
+      description: {
+        urls: [
+        ]
+      }
+    }
+    followers_count: 2367
+    profile_sidebar_border_color: "FFFFFF"
+    profile_background_color: "DAE0D9"
+    listed_count: 331
+    profile_background_image_url_https: "https://si0.twimg.com/profile_background_images/671143407/8544adf04bc3823d306c7f05efef2351.jpeg"
+    utc_offset: -18000
+    statuses_count: 20090
+    description: "Internet technology strategist, Accessibility and assistive technologies. Views expressed/implied are my own. See my Twitter lists for more interests."
+    friends_count: 784
+    location: "Boston, MA, USA"
+    profile_link_color: "800326"
+    profile_image_url: "http://a0.twimg.com/profile_images/638441870/Snapshot-of-sb_normal.jpg"
+    following: true
+    geo_enabled: false
+    profile_banner_url: "https://si0.twimg.com/profile_banners/16010789/1348096060"
+    profile_background_image_url: "http://a0.twimg.com/profile_background_images/671143407/8544adf04bc3823d306c7f05efef2351.jpeg"
+    lang: "en"
+    profile_background_tile: true
+    favourites_count: 3147
+    notifications: null
+    url: "http://www.linkedin.com/in/sarahbourne"
+    created_at: "Wed Aug 27 12:24:25 +0000 2008"
+    contributors_enabled: false
+    time_zone: "Eastern Time (US & Canada)"
+    default_profile: false
+    is_translator: false

To be fair some of the user profile information has been normalized in the archive (perhaps to save space for the viewing application) out to a user_details.js file, which looks like:

  "screen_name" : "sarahebourne",
  "location" : "Boston, MA, USA",
  "full_name" : "Sarah Bourne",
  "bio" : "Internet technology strategist, Accessibility and assistive technologies. Views expressed/implied are my own. See my Twitter lists for more interests.",
  "id" : "16010789",
  "created_at" : "Wed Aug 27 12:24:25 +0000 2008"

Notably missing from this is a homepage for the user, their number of favourites, their number of friends, followers, whether geo is enabled, etc.

All these details aside, Twitter deserves a lot of credit for making the data available as CSV for ease of use, and also as JavaScript for programmatic use.

The Code

So the really, really neat thing about the archive is that it comes with a pure HTML, CSS and JavaScript application that you can open locally in your browser and view your archive. It looks pretty, for example here is Sarah’s archive that Dave Winer mounted up on S3. It even has a keyword search across all your tweets, which takes a bit of time (it interactively loads all your tweet JavaScript files mentioned above), but it works. You can zip the data up, give it to someone else, and it all just works.

The archive uses some third party libraries such as jQuery, Underscore, Twitter Bootstrap and Hogan, which all come minified and bundled statically in the archive. The application itself is called Grailbird and comes minified as well. Grailbird loads the static JavaScript (as needed) and displays it. The only network traffic I saw while it was running was fetching avatar images.

Assuming JavaScript backwards compatibility, and browser support for JavaScript, the Twitter archive’s contextual display for the underlying data could last a long, long time. At least that’s a possible interpretation based on David Rosenthal’s hypothesis about the Web’s effect on format obsolescence. I think it’s safe to say that this app written for the local Web platform is likely last longer than a GUI application written in another language environment. The separation of code and data, and independence from a particular browser implementation are big wins. These are qualities that we all had to fight and work hard for on the Web, and I think it makes sense to re-purpose them here in an archival context.

I doubt anyone from Twitter has read this far, but if someone has, it would be great to see Grailbird show up with the other great stuff you have released to Github. I found myself wanting to quickly search across tweets looking for things, like geo-enabled tweets (to make sure that they are there). I could look at the minified Grailbird source in Chrome using developer tools, but it wasn’t good enough for me to figure out how to dynamically load data. I resorted to using NodeJS, and evaling the JavaScript files…and was able to confirm that there is geo data in the archives if you have it enabled. Here’s the simplistic script I came up with:

var fs = require('fs');

var Grailbird = {data: {}};

// load all the tweet data
eval(fs.readFileSync("data/js/tweet_index.js", "utf8"));
for (var i = 0; i < tweet_index.length; i++) {
  eval(fs.readFileSync(tweet_index[i].file_name, "utf8"));

// look at each tweet and print out the date and geolocation if it's there
for (var slice in Grailbird.data) {
  for (var j = 0; j < Grailbird.data[slice].length; j++) {
    var tweet = Grailbird.data[slice][j];
    if (tweet.geo.coordinates) console.log(tweet.created_at, ",", tweet.geo.coordinates.join(","));

and the output for Jeremy Keith’s archive.

% node geo.js
Fri Nov 30 13:08:33 +0000 2012,50.8262027605,-0.138112306595
Sat Nov 17 12:09:18 +0000 2012,54.6000387923,-5.9254288673
Fri Nov 16 22:32:03 +0000 2012,54.5925614526,-5.930852294
Thu Nov 15 13:35:35 +0000 2012,54.595909,-5.922033
Sat Nov 10 12:59:37 +0000 2012,50.825832,-0.142381
Fri Nov 09 13:54:51 +0000 2012,50.8262027605,-0.1381123066
Wed Nov 07 18:07:24 +0000 2012,50.825977,-0.138339
Tue Nov 06 16:58:49 +0000 2012,50.8378257671,-1.1800042739
Tue Oct 30 11:19:53 +0000 2012,50.8262027605,-0.1381123066
Thu Oct 18 17:51:22 +0000 2012,43.0733634985,-89.38608062
Tue Oct 16 17:29:20 +0000 2012,43.0872606735,-89.3659955263
Tue Oct 09 18:11:20 +0000 2012,40.7406891129,-74.0076184273
Sun Oct 07 14:27:50 +0000 2012,50.82906975,-0.126056
Sat Oct 06 16:29:30 +0000 2012,50.825832,-0.142381
Thu Oct 04 16:46:56 +0000 2012,50.8262027605,-0.1381123066
Tue Oct 02 17:46:42 +0000 2012,50.826646,-0.136921
Mon Oct 01 10:46:04 +0000 2012,50.8262027605,-0.1381123066
Mon Oct 01 10:43:46 +0000 2012,50.8262027605,-0.1381123066
Mon Oct 01 09:38:01 +0000 2012,50.8236703111,-0.1387184062
Mon Oct 01 08:53:15 +0000 2012,50.8236703111,-0.1387184062
Thu Sep 27 13:05:16 +0000 2012,59.915652,10.749959
Sun Sep 23 12:54:16 +0000 2012,50.8281663943,-0.128531456
Sat Sep 22 13:44:09 +0000 2012,50.87447886,0.017625
Thu Sep 20 13:16:11 +0000 2012,50.8262027605,-0.1381123066
Thu Sep 20 09:27:55 +0000 2012,50.8262027605,-0.1381123066
Mon Sep 17 07:51:20 +0000 2012,47.9952739036,7.8525775405
Sun Sep 16 09:01:28 +0000 2012,51.1599172667,-0.1787844393
Thu Sep 13 12:40:26 +0000 2012,50.822951,-0.136905
Tue Sep 11 18:41:47 +0000 2012,50.822746,-0.142274
Tue Sep 11 17:19:38 +0000 2012,50.822219,-0.140802
Tue Sep 11 13:05:59 +0000 2012,50.8262027605,-0.1381123066
Tue Sep 11 13:03:35 +0000 2012,50.8262027605,-0.1381123066
Tue Sep 11 12:48:51 +0000 2012,50.8262027605,-0.1381123066
Tue Sep 11 12:06:36 +0000 2012,50.8262027605,-0.1381123066
Tue Sep 11 08:23:00 +0000 2012,50.8262027605,-0.1381123066
Sun Sep 09 19:10:21 +0000 2012,50.826646,-0.136921
Tue Sep 04 17:33:44 +0000 2012,50.826646,-0.136921
Tue Sep 04 12:57:16 +0000 2012,50.822951,-0.136905
Mon Sep 03 16:03:37 +0000 2012,50.8262027605,-0.1381123066
Mon Sep 03 15:26:41 +0000 2012,50.8262027605,-0.1381123066
Sun Sep 02 19:40:38 +0000 2012,50.8229428584,-0.1390289018
Sun Sep 02 19:24:45 +0000 2012,50.8229428584,-0.1390289018
Sun Sep 02 19:08:55 +0000 2012,50.825977,-0.138339
Sun Sep 02 18:25:08 +0000 2012,50.825449,-0.137123
Sun Sep 02 17:04:15 +0000 2012,50.825449,-0.137123
Sun Sep 02 15:34:31 +0000 2012,50.8229428584,-0.1390289018
Fri Aug 31 17:33:20 +0000 2012,50.8291396274,-0.133923449
Fri Aug 31 09:20:04 +0000 2012,50.8311581116,-0.1335176435
Tue Aug 28 20:44:32 +0000 2012,41.8844650304,-87.6257600109
Mon Aug 27 13:57:24 +0000 2012,41.8844650304,-87.6257600109
Sat Aug 25 18:45:51 +0000 2012,41.8851594291,-87.6232355833
Wed Aug 22 12:32:45 +0000 2012,50.824415,-0.134691
Tue Aug 21 11:39:46 +0000 2012,50.8262027605,-0.1381123066
Mon Aug 20 11:01:28 +0000 2012,51.535132,-0.069309
Fri Aug 17 12:03:40 +0000 2012,50.8262027605,-0.1381123066
Sat Aug 11 16:08:13 +0000 2012,50.826646,-0.136921
Fri Aug 10 14:25:15 +0000 2012,50.8262027605,-0.1381123066
Wed Aug 08 11:51:45 +0000 2012,50.8262027605,-0.1381123066
Tue Aug 07 15:45:49 +0000 2012,50.8262027605,-0.1381123066
Fri Aug 03 16:38:55 +0000 2012,50.8262027605,-0.1381123066
Fri Aug 03 14:33:04 +0000 2012,50.8262027605,-0.1381123066
Sat Jul 28 14:57:52 +0000 2012,50.825449,-0.137123
Sat Jul 28 12:09:01 +0000 2012,50.828404,-0.137435
Thu Jul 26 17:17:22 +0000 2012,50.8266230357,-0.1367429505
Tue Jul 24 15:07:39 +0000 2012,50.8262027605,-0.1381123066
Mon Jul 23 12:25:35 +0000 2012,50.823104,-0.139515
Sat Jul 21 12:46:25 +0000 2012,50.827943,-0.136033
Fri Jul 20 13:21:41 +0000 2012,50.8262027605,-0.1381123066
Mon Jul 16 19:28:01 +0000 2012,50.825449,-0.137123
Sun Jul 15 10:48:44 +0000 2012,51.4714930776,-0.4883337021
Sat Jul 14 23:08:27 +0000 2012,41.974037,-87.890239
Tue Jul 10 13:44:08 +0000 2012,30.2655234842,-97.7385378752
Mon Jul 09 19:32:48 +0000 2012,30.2655234842,-97.7385378752
Mon Jul 09 14:40:21 +0000 2012,30.2656095537,-97.7385592461
Sat Jul 07 15:08:12 +0000 2012,51.4726745412,-0.4817537462
Fri Jun 29 10:55:03 +0000 2012,50.8262027605,-0.1381123066
Wed Jun 20 10:23:29 +0000 2012,51.488197,-0.120692
Mon Jun 18 12:12:01 +0000 2012,50.8262027605,-0.1381123066
Mon Jun 18 12:02:43 +0000 2012,50.8262027605,-0.1381123066
Sat Jun 16 15:51:15 +0000 2012,50.8244773427,-0.1387893509
Sat Jun 16 15:10:29 +0000 2012,50.827972412,-0.136271402
Fri Jun 15 22:15:44 +0000 2012,50.947306,0.090209
Fri Jun 15 12:58:27 +0000 2012,50.947306,0.090209
Wed Jun 13 12:12:49 +0000 2012,50.822951,-0.136905
Mon Jun 11 14:05:50 +0000 2012,50.825977,-0.138339
Wed Jun 06 16:31:48 +0000 2012,51.50361668,-0.683839
Wed Jun 06 15:38:45 +0000 2012,51.50361668,-0.683839
Sat Jun 02 15:40:48 +0000 2012,50.825449,-0.137123
Fri Jun 01 13:29:40 +0000 2012,50.8262027605,-0.1381123066
Thu May 31 16:37:18 +0000 2012,50.8262027605,-0.1381123066
Wed May 30 14:58:46 +0000 2012,50.8262027605,-0.1381123066
Wed May 30 12:45:33 +0000 2012,50.8262027605,-0.1381123066
Wed May 30 12:32:27 +0000 2012,50.8262027605,-0.1381123066
Tue May 29 12:12:15 +0000 2012,50.8242644595,-0.1329624653
Tue May 29 08:12:24 +0000 2012,50.8307708894,-0.1330473622
Sun May 27 21:06:57 +0000 2012,47.5608179303,-52.70936785
Mon May 21 19:15:05 +0000 2012,50.824975,3.26387
Mon May 21 13:56:02 +0000 2012,51.0541040608,3.7238935404
Mon May 21 12:19:17 +0000 2012,51.055163,3.720835
Sat May 19 15:52:22 +0000 2012,50.821309,-0.1434404
Sat May 19 14:19:38 +0000 2012,50.822215,-0.154896
Sun May 13 14:08:33 +0000 2012,50.8244462443,-0.139321602
Sun May 13 13:29:30 +0000 2012,50.8192217888,-0.1411056519
Sat May 12 19:32:13 +0000 2012,50.820359,-0.14243
Sat May 12 17:51:57 +0000 2012,50.822623,-0.142676
Fri May 11 09:22:05 +0000 2012,52.366239,4.894655
Tue May 08 12:39:36 +0000 2012,50.8287188784,-0.1423922896
Sun May 06 20:38:27 +0000 2012,50.871762,0.011501
Fri May 04 14:35:37 +0000 2012,50.8262027605,-0.1381123066
Thu May 03 16:03:52 +0000 2012,50.8262027605,-0.1381123066
Thu May 03 12:05:08 +0000 2012,50.8242644595,-0.1329624653
Wed May 02 12:43:38 +0000 2012,50.8262027605,-0.1381123066
Tue May 01 14:50:47 +0000 2012,50.8244094849,-0.1399479955
Tue May 01 13:17:36 +0000 2012,50.8262027605,-0.1381123066
Tue May 01 12:01:59 +0000 2012,50.826779,-0.138462
Tue May 01 11:22:41 +0000 2012,50.8262027605,-0.1381123066
Mon Apr 30 15:58:14 +0000 2012,50.8262027605,-0.1381123066
Fri Apr 27 17:26:19 +0000 2012,50.825449,-0.137123
Thu Apr 26 12:44:54 +0000 2012,50.8262027605,-0.1381123066
Tue Apr 24 11:30:25 +0000 2012,50.8262027605,-0.1381123066
Sat Apr 21 14:37:59 +0000 2012,50.8244773427,-0.1387893509
Wed Apr 18 11:05:28 +0000 2012,51.514461,-0.15415
Tue Apr 17 11:38:39 +0000 2012,50.8262027605,-0.1381123066
Mon Apr 16 17:28:09 +0000 2012,50.825449,-0.137123
Fri Apr 13 17:35:30 +0000 2012,50.825449,-0.137123
Fri Apr 13 11:39:01 +0000 2012,50.8262027605,-0.1381123066
Thu Apr 12 20:59:46 +0000 2012,50.8284865994,-0.1406764984
Thu Apr 12 20:43:24 +0000 2012,50.8284865994,-0.1406764984
Thu Apr 12 12:38:06 +0000 2012,50.8262027605,-0.1381123066
Wed Apr 04 17:35:46 +0000 2012,50.829236,-0.130433
Wed Apr 04 11:20:06 +0000 2012,50.8262027605,-0.1381123066
Wed Mar 28 19:51:57 +0000 2012,50.82533,-0.1371919
Wed Mar 28 17:41:06 +0000 2012,50.8266230357,-0.1367429505
Sat Mar 24 15:24:22 +0000 2012,50.82578,-0.139591
Sat Mar 24 14:42:14 +0000 2012,50.8244773427,-0.1387893509
Thu Mar 22 20:33:36 +0000 2012,50.821049,-0.140416
Thu Mar 15 16:00:20 +0000 2012,32.8975517297,-97.0442533493
Wed Mar 14 15:41:13 +0000 2012,30.265426,-97.740498
Tue Mar 13 19:52:43 +0000 2012,30.2647199679,-97.7443528175
Tue Mar 13 16:29:12 +0000 2012,30.2653850259,-97.7383099888
Mon Mar 12 02:03:53 +0000 2012,30.2669212002,-97.745683415
Sun Mar 11 17:45:31 +0000 2012,30.2626071693,-97.739803791
Sun Mar 11 15:18:53 +0000 2012,30.2647199679,-97.7443528175
Fri Mar 09 15:11:51 +0000 2012,30.2671521557,-97.7396624407
Mon Mar 05 10:56:37 +0000 2012,50.8262027605,-0.1381123066
Thu Mar 01 09:55:16 +0000 2012,50.8304057758,-0.1329698575
Wed Feb 22 23:56:59 +0000 2012,-33.8782765912,151.221249511
Wed Feb 22 02:00:43 +0000 2012,-41.328228677,174.809947014
Thu Feb 16 01:13:27 +0000 2012,-41.2890508786,174.777774995
Wed Feb 15 21:39:06 +0000 2012,-41.2893031956,174.777374268
Wed Feb 15 18:50:42 +0000 2012,-41.2893031956,174.777374268
Wed Feb 15 02:10:18 +0000 2012,-41.29336192,174.776485
Mon Feb 13 04:07:07 +0000 2012,-41.2893031956,174.777374268
Mon Feb 13 03:36:49 +0000 2012,-41.2924914456,174.776140451
Mon Feb 13 03:00:13 +0000 2012,-41.293314,174.776395
Mon Feb 13 02:40:18 +0000 2012,-41.2934345895,174.775958061
Mon Feb 13 01:22:04 +0000 2012,-41.2939726591,174.775840044
Sat Feb 11 23:39:04 +0000 2012,-36.405247,174.65600431
Sat Feb 11 07:32:16 +0000 2012,-36.405247,174.65600431
Sat Feb 11 06:49:42 +0000 2012,-36.405247,174.65600431
Wed Feb 08 23:20:25 +0000 2012,-33.878302,151.221256
Sat Feb 04 11:14:52 +0000 2012,50.828205,-0.1378011703
Thu Feb 02 13:41:42 +0000 2012,50.8262027605,-0.1381123066
Wed Feb 01 16:57:16 +0000 2012,50.8262027605,-0.1381123066
Sat Jan 28 16:57:35 +0000 2012,50.827062,-0.135349
Sat Jan 28 15:55:49 +0000 2012,50.828295,-0.138769
Thu Jan 26 12:42:08 +0000 2012,50.8262027605,-0.1381123066
Mon Jan 23 12:34:45 +0000 2012,50.822219,-0.140802
Sun Jan 22 15:18:32 +0000 2012,50.825832,-0.142381
Sat Jan 21 14:27:51 +0000 2012,50.8213,-0.1409
Fri Jan 20 12:45:34 +0000 2012,51.9479484763,-0.5020558834
Thu Jan 19 20:49:09 +0000 2012,52.9556027724,-1.1504852772
Thu Jan 19 12:38:47 +0000 2012,52.954584773,-1.1563324928
Wed Jan 18 16:42:24 +0000 2012,52.954584773,-1.1563324928
Wed Jan 18 16:39:09 +0000 2012,52.954584773,-1.1563324928
Tue Jan 17 15:00:09 +0000 2012,50.8262027605,-0.1381123066
Mon Jan 16 10:03:12 +0000 2012,50.8303548561,-0.1329055827
Sat Jan 14 16:11:55 +0000 2012,50.824838842,-0.1516896486
Wed Jan 11 21:07:19 +0000 2012,51.522789913,-0.0784921646
Wed Jan 11 19:27:24 +0000 2012,51.5237223711,-0.0770612686
Sat Jan 07 14:49:09 +0000 2012,50.824424,-0.138875
Fri Apr 09 01:52:12 +0000 2010,47.4412234282,-122.3010026978
Fri Apr 09 00:00:15 +0000 2010,47.4432422071,-122.3010595342
Thu Apr 08 01:29:11 +0000 2010,47.6873506139,-122.3341637453
Wed Apr 07 00:16:03 +0000 2010,47.6109922102,-122.3480262842
Sun Apr 04 18:47:33 +0000 2010,47.7083958758,-122.3272574643
Sat Apr 03 18:06:54 +0000 2010,47.6687063559,-122.3942997359
Sat Apr 03 18:05:00 +0000 2010,47.6687063559,-122.3942997359

I guess it’s kind of scary that you can do this, and is perhaps why Twitter doesn’t let you export anyone’s account, even if it is public. But returning to the issue of Grailbird being on Github, I imagine there would be people that would write code that uses Grailbird as an API to the archive data, to provide extensions that would display a map of where you’ve been over time for example, or an analysis of your friendship network, or a view on hashtags you’ve used, events you’ve been at etc.

I think from an archival perspective, it would be really useful to be able to receive something like a Tweet archive from a donor, and overlay functionality on top of it. The model of using the Web as a local application platform for this sort of archival content seems like it could be a growth area.

Inside Out Libraries

Peter Brantley tells a sad tale about where public library leadership is at, as we plunge headlong into the ebook future, that has been talked about for what seems like forever, and which is now upon us. It’s not pretty.

The general consensus among participants was that public libraries have two, maybe three years to establish their relevance in the digital realm, or risk fading from the central place they have long occupied in the world’s literary culture.

The fact that a bunch of big-wigs invited by IFLA were seemingly unable to find inspiration and reason to hope that public libraries will continue to exist is not surprising in the least I guess. I’m not sure that libraries were ever the center of the world’s literary culture. But for the sake of argument lets assume they were, and that now they’re increasingly not. Let us also assume that the economic landscape around ebooks is in incredible turmoil, and that there will continue to be sea changes in technologies, and people’s use of them in this area for the foreseeable future.

What can libraries do to stay relevant? I think part of the answer is: stop being libraries…well, sorta.

The HyperLocal

The most serious threat facing libraries does not come from publishers, we argued, but from e-book and digital media retailers like Amazon, Apple, and Google. While some IFLA staff protested that libraries are not in the business of competing with such companies, the library representatives stressed that they are. If public libraries can’t be better than Google or Amazon at something, then libraries will lose their relevance.

In my mind the thing that libraries have to offer, which these big corporations cannot, is authentic, local context for information about a community’s past, present and future. But in the past century or so libraries have focused on collecting mass produced objects, and sharing data about said objects. The mission of collecting hyper-local information has typically been a side task, that has fallen to special collections and archives. If I were invited to that IFLA meeting I would’ve said that libraries need to shift their orientation to caring more about the practices of archives and manuscript collections, by collecting unique, valued, at risk local materials, and adapting collection development and descriptive practices to the realities of more and more of this information being available as data.

As Mark Matienzo indicated (somewhat indirectly in Twitter) after I published this blog post, a lot of this work involves focusing less on hoarding items like books, and focusing more on the functions, services, and actions that public libraries want to document and engage with in their communities. Traditionally this orientation has been a strength area for archivists in their practice and theory of appraisal where:

… considerations … include how to meet the record-granting body’s organizational needs, how to uphold requirements of organizational accountability (be they legal, institutional, or determined by archival ethics), and how to meet the expectations of the record-using community. Wikipedia

I think this represents a pretty significant cognitive shift for library professionals, and would in fact take some doing. But perhaps that’s just because my exposure to archival theory in “library school” was pretty pathetic. Be that as it may here are some practical examples of growth areas for public libraries that I wish came up at the IFLA meeting.

Web Archiving

The Internet Archive and national libraries that are part of the International Internet Preservation Consortium don’t have the time, resources and often mandate to collect web content that are of interest at the local level. What if the tooling and expertise existed for public libraries to perform some of this work, and to have the results fed into larger aggregations of web archives?

Municipality Reports and Data

Increasing amounts of data are being collected as part of the daily working of our local governments. What if your public library had the resources to be a repository for this data? Yeah, I said the R word. But I’m not suggesting that public libraries get the expertise to set up Fedora instances with Hydra heads, or something. I’m thinking about approaches to allowing data to easily flow into an organization, where it is backed up, and made available in a clearinghouse manner similar to public.resource.org on the Web, for search engines to pick up. Perhaps even services like LibraryBox offer another lens to look at the opportunities that lie in this area.

Born Digital Manuscript Collections

Public libraries should be aggressively collecting the “papers” of local people who have had significant contributions to their communities. Increasingly, these aren’t paper at all, but are born digital content. For example: email correspondence, document archives, digital photograph collections. I think that librarians and archivists know, in theory, that this born digital content is out there, but the reality is it’s not flowing into the public library/archive. How can we change this? Efforts such as Personal Digital Archiving are important for two reasons: they help set up the right conditions for born digital collections to be donated, and they also make professionals think about how they would like to receive materials so that they are easier to process. Think more things like AIMS, training and tooling for both professionals and citizens.


It’s not unusual for archives and special collections to have all sorts of donor gift agreements that place restrictions on how their donated materials can be used. To some extent needing to visit the collection, request it, and not being able to leave the room with it, has mitigated some of this special-snowflakism. But when things are online things change a bit. We need to normalize these agreements so that content can flow online, and be used online in clearer ways. What if we got donors to think about Creative Commons licenses when they donated materials? How can we make sure donated material can become a usable part of the Web


We all know that things come and go on the Web. But it doesn’t need to be that way for everything on the Web. Libraries and archives have an opportunity to show how focusing on being a clearninghouse for data assets can allow for things to live persistently on the Web. Thinking about our URLs as identifiers for things we are taking care of is important. Practical strategies for achieving that are possible, and repeatable. What if public libraries were safe harbors for local content on the World Wide Web? This might sound hard to do, but I think it’s not as hard as people think.


As libraries/archives make more local content available publicly on the Web it becomes important to track how this content is accessed and used online. Quick wins like Web analytics tools (Google Analytics) for seeing what is being accessed and from where. Seeing how content is cited in social media applications like Facebook, Twitter, Pinterest and Wikipedia is important for reporting on the value of online collections. But encouraging professionals to use this information to become part of the conversations is equally important. Good metrics are also essential for collection development purposes, seeing what content is of interest, and what is not.

Inside Out Libraries

So, no I don’t think public libraries need a new open source Overdrive. The ebook market will likely continue to take care of itself. I also am not really convinced we need some overarching organization like the Digital Public Library of America to serve as a single point of failure when the funding runs dry. We need distributed strategies for documenting our local communities, so that this information can take its rightful place on the Web, and be picked up by Google so that people can find it when they are on the other side of the world. Things will definitely keep changing, but I think libraries and archives need to invest in the Web as an enduring delivery platform for information.

I’ve never been before but I was so excited to read the call for the European Library Automation Group (ELAG) this year.

The theme of this year’s conference is ‘The INSIDE-OUT Library’. This theme was chosen at last year’s conference, because we concluded:

  • Libraries have been focusing on bringing the world to their users. Now information is globally available.
  • Libraries have been producing metadata for the same publications in parallel. Now they are faced with deduplicating redundancy.
  • Libraries have been selecting things for their users. Now the users select things themselves.
  • Libraries have been supporting users by indexing things locally. Now everything is being indexed in global, shared indexes.

Instead of being an OUTSIDE-IN library, libraries should try and stay relevant by shifting their paradigm 180 degrees. Instead of only helping users to find what is available globally, they should also focus on making local collections and production available to the world. Instead of doing the same thing everywhere, libraries should focus on making unique information accessible. Instead of focusing on information trapped in publications, libraries should try and give the world new views on knowledge.

This blog post is really just a somewhat shabby rephrasing of that call. Maybe IFLA could use some of the folks on the ELAG program commmittee at their next meeting about the future of public libraries? Hopefully 2013 will be a year I can make it to ELAG.

I expect public libraries will continue to exist, but there isn’t going to be some magical technical solution to their problems. Their future will be forged by each local relationship they make, which leads to them better documenting their place on the Web. We may not call these places public libraries at first, but that’s what they will be.

linkrot: use your illusion

Mike Giarlo wrote a bit last week about the issues of citing datasets on the Web with Digital Object Identifiers (DOI). It’s a really nice, concise characterization of why libraries and publishers have promoted and used the DOI, and indirect identifiers more generally. Mike defines indirect identifiers as

… identifiers that point at and resolve to other identifiers.

I might be reading between the lines a bit, but I think Mike is specifically talking about any identifier that has some documented or ad-hoc mechanism for turning it into a Web identifier, or URL. A quick look at the Wikipedia identifier category yields lots of these, many of which (but not all) can be expressed as a URI.

The reason why I liked Mike’s post so much is that he was able to neatly summarize the psychology that drives the use of indirect identifier technologies:

… cultural heritage organizations and publishers have done a pretty poor job of persisting their identifiers so far, partly because they didn’t grok the commitment they were undertaking, or because they weren’t deliberate about crafting sustainable URIs from the outset, or because they selected software with brittle URIs, or because they fell flat on some area of sustainability planning (financial, technical, or otherwise), and so because you can’t trust these organizations or their software with your identifiers, you should use this other infrastructure for minting and managing quote persistent unquote identifiers

Mike goes on to get to the heart of the problem, which is that indirect identifier technologies don’t solve the problem of broken links on the Web, they just push it elsewhere. The real problem of maintaining the indirect identifier when the actual URL changes becomes someone else’s problem. Out of sight, out of mind … except it’s not really out of sight right? Unless you don’t really care about the content you are putting online.

We all know that linkrot on the Web is a real thing. I would be putting my head in the sand if I were to say it wasn’t. But I would also be putting my head in the sand if I said that things don’t go missing from our brick and mortar libraries. But still, we should be able to do better than 1/2 the URLs in arXiv going dead right? I make a living as a web developer, I’m an occasional advocate for linked data, and I’m a big fan of the work Henry Thompson and David Orchard did for the W3C analyzing the use of alternate identifier schemes on the Web…so, admittedly, I’m a bit of a zealot when it comes to promoting URLs as identifiers, and taking the Web seriously as an information space.

Mike’s post actually kicked off what I thought was a useful Twitter conversation (yes they can happen), which left me contemplating the future of libraries and archives on (or in) the Web. Specifically, it got me thinking that perhaps libraries and archives of the not too distant future will be places that take special care in how they put content on the Web, so that it can be accessed over time, just like a traditional physical library or archive. The places where links and the content they reference are less likely to go dead will be the new libraries and archives. These may not be the same institutions we call libraries today. Just like today’s libraries, these new libraries may not necessarily be free to access. You may need to be part of some community to access them, or to pay some sort of subscription fee. But some of them, and I hope most, will be public assets.

So how to make this happen? What will it look like? Rather than advocating a particular identifier technology I think these new libraries need to think seriously about providing Terms of Service documents for their content services. I think these library ToS documents will do a few things.

  • They will require the library to think seriously about the service they are providing. This will involve meetings, more meetings, power lunches, and likely lawyers. The outcome will be an organizational understanding of what the library is putting on the Web, and the commitment they are entering into with their users. It won’t simply be a matter of a web development team deciding to put up some new website…or take one down. This will likely be hard, but I think it’s getting easier all the time, as the importance of the Web as a publishing platform becomes more and more accepted, even in conservative organizations like libraries and archives.
  • The ToS will address the institutions commitment for continued access to the content. This will involve a clear understanding of the URL namespaces that the library manages, and a statement about how they will be maintained over time. The Web has built in mechanisms for content moving from place to place (HTTP 301), and for when resources are removed (HTTP 410), so URLs don’t need to be written in stone. But the library needs to commit to how resources will redirect permanently to new locations, and for how long–and how they will be removed.
  • The ToS will explicitly state the licensing associated with the content, preferably with Creative Commons licenses (hey I’m daydreaming here) so that it can be confidently used.
  • Libraries and archives will develop a shared palette of ToS documents. Each institution won’t have it’s own special snowflake ToS that nobody reads. There will be some normative patterns for different types of libraries. They will be shared across consortia, and among peer institutions. Maybe they will be incorporated into, or reflect shared principles found in documents like ALA’s Library Bill of Rights or SAA’s Code of Ethics.

I guess some of this might be a bit reminiscent of the work that has gone into what makes a trusted repository. But I think a Terms of Service between a library/archive and its researcher is something a bit different. It’s more outward looking, less interested in certification and compliance and more interested in entering into and upholding a contract with the user of a collection.

As I was writing this post, Dan Brickley tweeted about a recent talk Tony Ageh (head of the archive development team at the BBC) gave at the recent Economies of the Commons conference. He spoke about his ideas for a future Digital Public Space, and the role that archives and organizations like the BBC play in helping create it.

Things no longer ‘need’ to disappear after a certain period of time. Material that once would have flourished only briefly before languishing under lock and key or even being thrown away — can now be made available forever. And our Licence Fee Payers increasingly expect this to be the way of things. We will soon need to have a very, very good reason for why anything at all disappears from view or is not permanently accessible in some way or other.

That is why the Digital Public Space has placed the continuing and permanent availability of all publicly-funded media, and its associated information, as the default and founding principle.

I think Tony and Mike are right. Cultural heritage organizations need to think more seriously, and more long term about the content they are putting on the Web. They need to put this thought into clear, and succinct contracts with their users. The organizations that do will be what we call libraries and archives tomorrow. I guess I need to start by getting my own house in order eh?