#c4l15

code4lib 2015 is about to kick off in Portland this morning. Unfortunately I couldn’t make it this year, but I’m looking forward to watching the livestream over the next few days. Thanks so much to the conference organizers for setting up the livestream. The schedule has the details about who is speaking when.

As a little gift to real and virtual conference goers (mostly myself) I quickly created a little web app that will watch the Twitter stream for #c4l15 tweets, and keep track of which URLs people are talking about. You can see it running, at least while the conference is going here.

I’ve done this sort of thing in an ad hoc way with twarc and some scripts–mostly after (rather than during) an event. For example here’s a report of URLs mentioned during #dlfforum. But I wanted something a bit more dynamic. As usual the somewhat unkempt code is up on Github as a project named earls, in case you have ideas you’d like to try out.

#c4l15 urls, or earls

earls is a node app that listens to Twitter’s filter stream API for tweets mentioning #c4l15. When it finds one it then looks for 1 or more links in the tweet. Each link is fetched (which also unshortens it), it tries to parse any HTML (thanks cheerio) to find a page title, and then stashes these details as well as the tweet in redis.

When you load the page it will show you the latest counts for all URLs it has found so far. Unfortunately at the moment you need to reload the page to get an update. If I have time I will work on making it update live in the page with socket.io. earls could be used for other conferences, and ought to run pretty easily on heroku for free.

Oh, and you can see the JSON data here in case you have other ideas of things you’d like to do with the data.

Have a superb conference you crazy dreamers and doers!


Documenting Ferguson Emails

If you are an IfThisThenThat user and are interested in archives maybe you’ll be interested in this recipe that will email you when a new item is added to the Documenting Ferguson repository. Let me know if you give it a try! I just created the recipe and it hasn’t emailed me yet. But the RSS Feed from Washington University’s Omeka instance reports that the last item was added on January 30th, 2015. So the collection is still being added to.

IFTTT Recipe: Documenting Ferguson Emails connects feed to email

I thought about having it tweet, but that would involve creating a Twitter account for the project and that isn’t my place. Plus, RSS and Email are still fun Web 1.0 technologies that don’t get enough love. Well I guess Email predates the Web entirely heh, but you get my drift.


A Life Worth Noting

There are no obituaries for the war casualties that the United States inflicts, and there cannot be. If there were to be an obituary there would had to have been a life, a life worth noting, a life worth valuing and preserving, a life that qualifies for recognition. Although we might argue that it would be impractical to write obituaries for all those people, or for all people, I think we have to ask, again and again, how the obituary functions as the instrument by which grievability is publicly distributed. It is the means by which a life becomes, or fails to become, a publicly grievable life, an icon for national self-recognition, the means by which a life becomes noteworthy. As a result, we have to consider the obituary as an act of nation-building. The matter is not a simple one, for, if a life is not grievable, it is not quite a life; it does not qualify as a life and is not worth a note. It is already the unburied, if not the unburiable.

Precarious Life by Judith Butler, (p. 34)


Library of Alexandria v2.0

In case you missed Jill Lepore has written a superb article for the New Yorker about the Internet Archive and archiving the Web in general. The story of the Internet Archive is largely the story of its creator Brewster Kahle. If you’ve heard Kahle speak you’ve probably heard the Library of Alexandria v2.0 metaphor before. As a historian Lepore is particularly tuned to this dimension to the story of the Internet Archive:

When Kahle started the Internet Archive, in 1996, in his attic, he gave everyone working with him a book called “The Vanished Library,” about the burning of the Library of Alexandria. “The idea is to build the Library of Alexandria Two,” he told me. (The Hellenism goes further: there’s a partial backup of the Internet Archive in Alexandria, Egypt.)

I’m kind of embarrassed to admit that until reading Lepore’s article I never quite understood the metaphor…but now I think I do. The Web is on fire and the Internet Archive is helping save it, one HTTP request and response at a time. Previously I couldn’t get the image of this vast collection of Web content that the Internet Archive is building as yet another centralized collection of valuable material that, as with v1.0, is vulnerable to disaster but more likely, as Heather Phillips writes, creeping neglect:

Though it seems fitting that the destruction of so mythic an institution as the Great Library of Alexandria must have required some cataclysmic event like those described above – and while some of them certainly took their toll on the Library - in reality, the fortunes of the Great Library waxed and waned with those of Alexandria itself. Much of its downfall was gradual, often bureaucratic, and by comparison to our cultural imaginings, somewhat petty.

I don’t think it can be overstated: like the Library of Alexandria before it, the Internet Archive is an amazingly bold and priceless resource for human civilization. I’ve visited the Internet Archive on multiple occasions, and each time I’ve been struck by how unlikely it is that such a small and talented team have been able to build and sustain a service with such impact. It’s almost as if it’s too good to be true. I’m nagged by the thought that perhaps it is.

Herbert van de Sompel is quoted by Lepore:

A world with one archive is a really bad idea.

Van de Sompel and his collaborator Michael Nelson have repeatedly pointed out just how important it is for there to be multiple archives of Web content, and for there to be a way for them to be discoverable, and work together. Another thing I learned from Lepore’s article is that Brewster’s initial vision for the Internet Archive was much more collaborative, which gave birth to the International Internet Preservation Consortium, which is made up of 32 member organizations who do Web archiving.

A couple weeks ago one prominent IIPC member, the California Digital Library announced that it was retiring its in house archiving infrastructure and out sourcing its operation to ArchiveIt, which is the subscription web archiving service from the Internet Archive.

The CDL and the UC Libraries are partnering with Internet Archive’s Archive-It Service. In the coming year, CDL’s Web Archiving Service (WAS) collections and all core infrastructure activities, i.e., crawling, indexing, search, display, and storage, will be transferred to Archive-It. The CDL remains committed to web archiving as a fundamental component of its mission to support the acquisition, preservation and dissemination of content. This new partnership will allow the CDL to meet its mission and goals more efficiently and effectively and provide a robust solution for our stakeholders.

I happened to tweet this at the time:

Which at least inspired some mirth from Jason Scott, who is an Internet Archive employee, and also a noted Internet historian and documentarian.

Jason is also well known for his work with ArchiveTeam, which quickly mobilizes volunteers to save content on websites that are being shutdown. This content is often then transferred to the Internet Archive. He gets his hands dirty doing the work, and inspires others to do the same. So I deserved a bit of derisive laughter for my hand-wringing.

But here’s the thing. What does it mean if one of the pre-eminent digital library organizations needs to outsource their Web archiving operation? And what if, as the announcement indicates, Harvard, MIT, Stanford, UCLA, and others might not be far behind. Should we be concerned that the technical expertise and infrastructure for doing this work is becoming consolidated in a single organization? What does it say about our Web archiving tools that it is more cost-effective for CDL to outsource this work?

The situation isn’t as dire as it might sound since ArchiveIt subscribers retain the right to download their content and store it themselves. How many institutions do that with regularity isn’t well known (at least to me). But Web content isn’t like paper that you can put in a box, in a climate controlled room, and return to years hence. As Matt Kirschenbaum has pointed out:

the preservation of digital objects is logically inseparable from the act of their creation — the lag between creation and preservation collapses completely, since a digital object may only ever be said to be preserved if it is accessible, and each individual access creates the object anew

Can an organization download their WARC content, not provide any meaningful access to it, and say that it is being preserved? I don’t think so. You can’t do digital preservation without thinking about some kind of access to make sure things are working and people can use the stuff. If the content you are accessing is on a platform somewhere else that you have no control over you should probably be concerned.

I’m hopeful that this collaboration between CDL and ArchiveIt, and other organizations, will lead to a fruitful collaboration and improved tools. But I’m worried that it will mean organizations can simply outsource the expertise and infrastructure of web archiving, while helping reinforce what is already a huge single point of failure. David Rosenthal of Stanford University notes that diversity is a vital component to digital preservation:

Media, software and hardware must flow through the system over time as they fail or become obsolete, and are replaced. The system must support diversity among its components to avoid monoculture vulnerabilities, to allow for incremental replacement, and to avoid vendor lock-in.

I’d like to see more Web archiving classes in iSchools and computer science departments. I’d like to see improved and simplified tools for doing the work of Web archiving. Ideally I’d like to see more in house crawling and access of web archives, not less. I’d like to see more organizations like the Internet Archive that are not just technically able to do this work, but are also bold enough to collect what they think is important to save on the Web and make it available. If we can’t do this together I think the Library of Alexandria metaphor will be all too literal.


When Google Met WikiLeaks

When Google Met WikileaksWhen Google Met Wikileaks by Julian Assange
My rating: 4 of 5 stars

This book is primarily the transcript of a conversation between Julian Assange and Eric Schmidt (then CEO of Google) and Jared Cohen for their book The New Digital Age. The transcript is also available in its entirety (fittingly) on the WikiLeaks website along with the actual audio of the conversation. The transcript is book-ended by several essays: Beyond Good and “Don’t Be Evil”, the Banality of “Don’t Be Evil” (also published in New York Times) and Deliver us from “Don’t Be Evil”.

Assange read The New Digital Age and wasn’t happy with the framing of the conversation, or the degree to which his interview wasn’t included. When Google Met WikiLeaks is Assange’s attempt to reframe the discussion in terms of the future of publishing, information and the Internet. In particular Assange takes issue with Schmidt and Cohen’s assertion that:

The information released on WikiLeaks put lives at risk and inflicted serious diplomatic damage.

Schmidt and Cohen offer no source for this bold assertion, and in a note they equate WikiLeaks with minimally enabling espionage, again with no citation. Assange makes the case that WikiLeaks is actually in the business of publishing and journalism, not secretly selling information for private gain. I think Assange does this, but more importantly, he presents a view of the near future of the Internet, that is presaged by WikiLeaks, which is actually interesting and compelling. The transcript itself is heavily annotated with footnotes, many of which have URLs, that are archived at archive.today.

For me the most interesting parts of the book center on what Assange calls the Naming of Things:

The naming of human intellectual work and our entire intellectual record is possibly the most important thing. So we all have words for different objects, like “tomato.” But we use a simple word, “tomato,” instead of actually describing every little aspect of this god damn tomato…because it takes too long. And because it takes too long to describe this tomato precisely we use an abstraction so we can think about it so we can talk about it. And we do that also when we use URLs. Those are frequently used as a short name for some human intellectual content. And we build all of our civilization, other than on bricks, on human intellectual content. And so we currently have system with URLs where the structure we are building our civilization out of is the worst kind of melting plasticine imaginable. And that is a big problem.


Transcript of secret meeting between Julian Assange and Google CEO Eric Schmidt

This particular section goes on to talk about some really interesting topics: such as the effects of right to be forgotten laws, DNS, Bittorrent magnet URIs, how not to pick ISPs, hashing algorithms, digital signatures, public key cryptography, Bitcoin, NameCoin, flood networks, and distributed hash tables. The fascinating thing is that Schmidt is asking Assange for these details to understand how WikiLeaks operates; but Assange’s response is to discuss some general technologies that may influence a new kind of Web of documents. A Web where identity matters, where documents are signed and mirrored, republished and resilient.

Assange has been largely demonized by the mainstream press, and this book humanizes him quite a bit. It’s hard not to think of him in the Ecuadorian Embassy in London (where he will have been for 1500 days tomorrow) quietly adding footnotes to the transcript, and archiving web content.

OR Books role in printing this content on paper, for bookshelves everywhere is another aspect to this process of replication. Hats off to them for putting this project together.

Here’s some musical accompaniment to go along with this post:


Bowie

BowieBowie by Simon Critchley
My rating: 5 of 5 stars

If you are a Bowie fan, you will definitely enjoy this. If you are curious why other people are so into Bowie you will enjoy this. If you’ve never read any Critchley and are interested in something quick and accessible by him you will enjoy this. I fell into the first and third categories so I guess I’m guessing about the second. But I suspect it’s true.

I finished the book feeling like I understand the why and how of my own fascination with Bowie’s work much better. I also want to revisit some of his albums like Diamond Dogs, Heathen and Outside which I didn’t quite connect with at first. I would’ve enjoyed a continued discussion of Bowie’s use of the cutup technique, but I guess that fell out of the scope of the book.

I also want to read some more Critchley too – so if you have any recommendations please let me know. The sketches at the beginning of each chapter are wonderful. OR Books continues to impress.


Languages on Twitter.

There have been some interesting visualizations of languages in use on Twitter, like this one done by Gnip and published in the New York Times. Recently I’ve been involved in some research on particular a topical collection of tweets. One angle that’s been particularly relevant for this dataset is language. When perusing some of the tweet data we retrieved from Twitter’s API we noticed that there were two lang properties in the JSON. One was attached to the embedded user profile stanza, and the other was a top level property of the tweet itself.

We presumed that the user profile language was the language the user (who submitted the tweet) had selected, and that the second language on the tweet was the language of the tweet itself. The first is what Gnip used in its visualization. Interestingly, Twitter’s own documentation for the /get/statuses/:id API call only shows the user profile language.

When you send a tweet you don’t indicate what language it is in. For example you might indicate in your profile that you speak primarily English, but send some tweets in French. I can only imagine that detecting language for each tweet isn’t a cheap operation for the scale that Twitter operates at. Milliseconds count when you are sending 500 million tweets a day, in real time. So at the time I was skeptical that we were right…but I added a mental note to do a little experiment.

This morning I noticed my friend Dan had posted a tweet in Hebrew, and figured now was as a good a time as any.

I downloaded the JSON for the Tweet from the Twitter API and sure enough, the user profile had language en and the tweet itself had language iw which is the deprecated ISO 639-1 code for Hebrew (current is he. Here’s the raw JSON for the tweet, search for lang:

{
  "contributors": null,
  "truncated": false,
  "text": "\u05d0\u05e0\u05d7\u05e0\u05d5 \u05e0\u05ea\u05d2\u05d1\u05e8",
  "in_reply_to_status_id": null,
  "id": 540623422469185537,
  "favorite_count": 2,
  "source": "Tweetbot for Mac",
  "retweeted": false,
  "coordinates": null,
  "entities": {
    "symbols": [],
    "user_mentions": [],
    "hashtags": [],
    "urls": []
  },
  "in_reply_to_screen_name": null,
  "id_str": "540623422469185537",
  "retweet_count": 0,
  "in_reply_to_user_id": null,
  "favorited": true,
  "user": {
    "follow_request_sent": false,
    "profile_use_background_image": true,
    "profile_text_color": "333333",
    "default_profile_image": false,
    "id": 17981917,
    "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/3725850/woods.jpg",
    "verified": false,
    "profile_location": null,
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/524709964905218048/-CuYZQQY_normal.jpeg",
    "profile_sidebar_fill_color": "DDFFCC",
    "entities": {
      "description": {
        "urls": []
      }
    },
    "followers_count": 1841,
    "profile_sidebar_border_color": "BDDCAD",
    "id_str": "17981917",
    "profile_background_color": "9AE4E8",
    "listed_count": 179,
    "is_translation_enabled": false,
    "utc_offset": -18000,
    "statuses_count": 14852,
    "description": "",
    "friends_count": 670,
    "location": "Washington DC",
    "profile_link_color": "0084B4",
    "profile_image_url": "http://pbs.twimg.com/profile_images/524709964905218048/-CuYZQQY_normal.jpeg",
    "following": true,
    "geo_enabled": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/17981917/1354047961",
    "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/3725850/woods.jpg",
    "name": "Dan Chudnov",
    "lang": "en",
    "profile_background_tile": true,
    "favourites_count": 1212,
    "screen_name": "dchud",
    "notifications": false,
    "url": null,
    "created_at": "Tue Dec 09 02:56:15 +0000 2008",
    "contributors_enabled": false,
    "time_zone": "Eastern Time (US & Canada)",
    "protected": false,
    "default_profile": false,
    "is_translator": false
  },
  "geo": null,
  "in_reply_to_user_id_str": null,
  "lang": "iw",
  "created_at": "Thu Dec 04 21:47:22 +0000 2014",
  "in_reply_to_status_id_str": null,
  "place": null
}

Although tweets are short they certainly can contain multiple languages. I was curious what would happen if I tweeted two words, one in English and one in French.

When I fetched the JSON data for this tweet the language of the tweet was indicated to be pt or Portuguese! As far as I know neither testing nor essai are Portuguese.

This made me think perhaps the tweet was a bit short so I tried something a bit longer, with the number of words in each language being equal.

This one came across with lang fr. So having the text be a bit longer helped in this case. Admittedly this isn’t a very sound experiment, but it seems interesting and useful to see that Twitter is detecting language in tweets. It isn’t perfect, but that shouldn’t be surprising at all given the nature of human language. It might be useful to try a more exhaustive test using a more complete list of languages to see how it fairs. I’m adding another mental note…



An Invitation to Study

These are some brief remarks I prepared for a 5 minute lightning talk at the Ferguson Town Hall meeting at UMD on December 3, 2014.

Thank you for the opportunity to speak here this evening. It is a real privilege. I’d like to tell you a little bit about an archive of 13 million Ferguson related tweets we’ve assembled here at the University of Maryland. You can see a random sampling of some of them up on the screen here. The 140 characters of text in a tweet only makes up about 2% of the data for each tweet. The other 98% includes metadata such as who sent it, their profile, how many followers they have, what tweet they are replying to, who they are retweeting, when the tweet was sent, (sometimes) where the tweet was sent from, embedded images and video. I’m hoping that I can interest some of you in studying this data.

I intentionally used the word privilege in my opening sentence to recognize that my ethnicity and my gender enabled me to be here speaking to you this evening. I’d actually like to talk very quickly about about a different set of privileges I have: those of my profession, and as a member of the UMD academic community that we are all a part of:

In a Democracy Now interview back in August, hip-hop artist and activist Talib Kweli characterized the Ferguson related activity on Twitter in a deeply insightful way:

I remember a world before the Internet. And I remember what it really takes to have movement on the ground. Someone tweeted me back and said, “Well, you know, back in the day they didn’t have Twitter, but they had letters, and they wrote letters to each other, so…” I said, “Yeah, but ain’t nobody saying that the letters started the revolution.”

When I look at the Green Revolution, when I look what happened to Egypt, when I look at what happened to Occupy Wall Street, yeah, the tweets helped—they helped a lot—but without those bodies in the street, without the people actually being there, ain’t nothing to tweet about. If Twitter worked like that, Joseph Kony would be locked down in a jail right now.

Of course, Talib is right. It’s why we are all here this evening. In some sense it doesn’t matter what happens on Twitter. What matters is what happened in Ferguson, what is happening in Ferguson, and in meetings and demonstrations like this one all around the country.

Talib’s comparison of a tweet to a letter struck me as particularly insightful. I work as an archivist and software developer in the Maryland Institute for Technology in the Humanities here at UMD. Humanities scholars have traditionally studied a particular set of historical materials, of which letters are one. These materials form the heart of what we call the archive. What gets collected in archives and studied is inevitably what forms our cultural canon. It is a site of controversy, for as George Orwell wrote in 1984:

Who controls the past controls the future. Who controls the present controls the past.

Would we be here tonight if it wasn’t for Twitter? Would the President be talking about Ferguson if it wasn’t for the groundswell of activity on Twitter? Without Twitter what would the main stream media have reported about Mike Brown, John Crawford, Eric Garner, Renisha McBride, Trayvon Martin, and Tamir Rice? As you know this list of injustices is long…it is vast and overwhelming. It extends back to the beginnings of this state and this country. But what trace do we have of these injustices and these struggles in our archives?

The famed historian and social activist Howard Zinn said this when addressing a group of archivists in 1970:

the existence, preservation and availability of archives, documents, records in our society are very much determined by the distribution of wealth and power. That is, the most powerful, the richest elements in society have the greatest capacity to find documents, preserve them, and decide what is or is not available to the public. This means government, business and the military are dominant.

This is where social media and the Web present such a profoundly new opportunity for us, as we struggle to understand what happened in Ferguson…as we struggle to understand how best to act in the present. We need to work to make sure the voices of Ferguson are available for study–and not just in the future, but study now. Let’s put our privilege as members of this academic community to work. Is there something to learn in these 13 million tweets, these letters from ordinary people. The thousands of videos and photographs, and links to stories? I think there is. I’m hopeful that these digital traces provide us with a new insight into an old problem…insights that can guide our actions here in the present.

If you have questions you’d like to ask of the data please get in touch with either me or Neil Fraistat (Director of MITH) here tonight, or via email or Twitter.