twarc & Ferguson demo

Here’s a brief demo of what it looks like to use twarc on the command line to archive tweets that are mentioning Ferguson. I’ve been doing archiving around this topic off and on since August of last year, and happened to start it up again recently to collect the response to the Justice Department report.

I kind of glossed over getting your Twitter keys set up, which is a bit tedious. I have them set in environment variables for that demo, but you can pass them in on the command line now. I guess that could be another demo sometime. If you are interested send me a tweet.


JavaScript and Archives

Tantek Çelik has some strong words about the use of JavaScript in Web publishing, specifically regarding it’s accessibility and longevity:

… in 10 years nothing you built today that depends on JS for the content will be available, visible, or archived anywhere on the web

It is a dire warning. It sounds and feels true. I am in the middle of writing a webapp that happens to use React, so Tantek’s words are particularly sobering.

And yet, consider for a moment how Twitter make personal downloadable archives available. When you request your archive you eventually get a zip file. When you unzip it, you open an index.html file in your browser, and are provided you with a view of all the tweets you’ve ever sent.

If you take a look under the covers you’ll see it is actually a JavaScript application called Grailbird. If you have JavaScript turned on it looks something like this:

JavaScript On

If you have JavaScript turned off it looks something like this:

JavaScript Off

But remember this is a static site. There is no server side piece. Everything is happening in your browser. You can disconnect from the Internet and as long as your browser has JavaScript turned on it is fully functional. (Well the avatar URLs break, but that could be fixed). You can search across your tweets. You can drill into particular time periods. You can view your account summary. It feels pretty durable. I could stash it away on a hard drive somewhere, and come back in 10 years and (assuming there are still web browsers with a working JavaScript runtime) I could still look at it right?

So is Tantek right about JavaScript being at odds with preservation of Web content? I think he is, but I also think JavaScript can be used in the service of archiving, and that there are starting to be some options out there that make archiving JavaScript heavy websites possible.

The real problem that Tantek is talking about is when human readable content isn’t available in the HTML and is getting loaded dynamically from Web APIs using JavaScript. This started to get popular back in 2005 when Jesse James Garrett coined the term AJAX for building app-like web pages using asynchronous requests for XML, which is now mostly JSON. The scene has since exploded with all sorts of client side JavaScript frameworks for building web applications as opposed to web pages.

So if someone (e.g. Internet Archive) comes along and tries to archive a URL it will get the HTML and associated images, stylesheets and JavaScript files that are referenced in that HTML. These will get saved just fine. But when the content is played back later in (e.g. Wayback Machine) the JavaScript will run and try to talk to these external Web APIs to load content. If those APIs no longer exist, the content won’t load.

One solution to this problem is for the web archiving process to execute the JavaScript and to archive any of the dynamic content that was retrieved. This can be done using headless browsers like PhantomJS, and supposedly Google has started executing JavaScript. Like Tantek I’m dubious about how widely they execute JavaScript. I’ve had trouble getting Google to index a JavaScript heavy site that I’ve inherited at work. But even if the crawler does execute the JavaScript, user interactions can cause different content to load. So does the bot start clicking around in the application to get content to load? This is yet more work for a archiving bot to do, and could potentially result in write operations which might not be great.

Another option is to change or at least augment the current web archiving paradigm by adding curator driven web archiving to the mix. The best examples I’ve seen of this are Ilya Kreymer’s work on pywb and pywb-recorder. Ilya is a former Internet Archive engineer, and is well aware of the limitations in the most common forms of web archiving today. pywb is a new player for web archives and pywb-recorder is a new recording environment. Both work in concert to let archivists interactively select web content that needs to be archived, and then for that content to be played back. The best example of this is his demo service webrecorder.io which composes pywb and pywb-recorder so that anyone can create a web archive of a highly dynamic website, download the WARC archive file, and then reupload it for playback.

The nice thing about Ilya’s work is that it is geared at archiving this JavaScript heavy content. Rhizome and the New Museum in New York City have started working with Ilya to use pywb to archive highly dynamic Web content. I think this represents a possible bright future for archives, where curators or archivists are more part of the equation, and where Web archives are more distributed, not just at Internet Archive and some major national libraries. I think the work Genius are doing to annotate the Web, archived versions of the Web is in a similar space. It’s exciting times for Web archiving. You know, exciting if you happen to be an archivist and/or archiving things.

At any rate, getting back to Tantek’s point about JavaScript. If you are in the business of building archives on the Web definitely think twice about using client side JavaScript frameworks. If you do, make sure your site degrades so that the majority of the content is still available. You want to make it easy for Internet Archive to archive your content (lots of copies keeps stuff safe) and you want to make it easy for Google et al to index it, so people looking for your content can actually find it. Stanford University’s Web Archiving team have a super set of pages describing archivability of websites. We can’t control how other people publish on the Web, but I think as archivists we have a responsibility to think about these issues as we create archives on the Web.


Facts are Mobile

To classify is, indeed, as useful as it is natural. The indefinite multitude of particular and changing events is met by the mind with acts of defining, inventorying and listing, reducing the common heads and tying up in bunches. But these acts like other intelligent acts are performed for a purpose, and the accomplishment of purpose is their only justification. Speaking generally, the purpose is to facilitate our dealing with unique individuals and changing events. When we assume that our clefts and bunches represent fixed separations and collections in rerum natura, we obstruct rather than aid our transactions with things. We are guilty of a presumption which nature promptly punishes. We are rendered incompetent to deal effectively with the delicacies and novelties of nature and life. Our thought is hard where facts are mobile ; bunched and chunky where events are fluid, dissolving.

Dewey (1957), p. 131.

Dewey, J. (1957). Human nature and conduct. New York: Modern Library. Retrieved from https://archive.org/details/humannatureandco011182mbp




#c4l15

code4lib 2015 is about to kick off in Portland this morning. Unfortunately I couldn’t make it this year, but I’m looking forward to watching the livestream over the next few days. Thanks so much to the conference organizers for setting up the livestream. The schedule has the details about who is speaking when.

As a little gift to real and virtual conference goers (mostly myself) I quickly created a little web app that will watch the Twitter stream for #c4l15 tweets, and keep track of which URLs people are talking about. You can see it running, at least while the conference is going here.

I’ve done this sort of thing in an ad hoc way with twarc and some scripts–mostly after (rather than during) an event. For example here’s a report of URLs mentioned during #dlfforum. But I wanted something a bit more dynamic. As usual the somewhat unkempt code is up on Github as a project named earls, in case you have ideas you’d like to try out.

#c4l15 urls, or earls

earls is a node app that listens to Twitter’s filter stream API for tweets mentioning #c4l15. When it finds one it then looks for 1 or more links in the tweet. Each link is fetched (which also unshortens it), it tries to parse any HTML (thanks cheerio) to find a page title, and then stashes these details as well as the tweet in redis.

When you load the page it will show you the latest counts for all URLs it has found so far. Unfortunately at the moment you need to reload the page to get an update. If I have time I will work on making it update live in the page with socket.io. earls could be used for other conferences, and ought to run pretty easily on heroku for free.

Oh, and you can see the JSON data here in case you have other ideas of things you’d like to do with the data.

Have a superb conference you crazy dreamers and doers!


Documenting Ferguson Emails

If you are an IfThisThenThat user and are interested in archives maybe you’ll be interested in this recipe that will email you when a new item is added to the Documenting Ferguson repository. Let me know if you give it a try! I just created the recipe and it hasn’t emailed me yet. But the RSS Feed from Washington University’s Omeka instance reports that the last item was added on January 30th, 2015. So the collection is still being added to.

IFTTT Recipe: Documenting Ferguson Emails connects feed to email

I thought about having it tweet, but that would involve creating a Twitter account for the project and that isn’t my place. Plus, RSS and Email are still fun Web 1.0 technologies that don’t get enough love. Well I guess Email predates the Web entirely heh, but you get my drift.


A Life Worth Noting

There are no obituaries for the war casualties that the United States inflicts, and there cannot be. If there were to be an obituary there would had to have been a life, a life worth noting, a life worth valuing and preserving, a life that qualifies for recognition. Although we might argue that it would be impractical to write obituaries for all those people, or for all people, I think we have to ask, again and again, how the obituary functions as the instrument by which grievability is publicly distributed. It is the means by which a life becomes, or fails to become, a publicly grievable life, an icon for national self-recognition, the means by which a life becomes noteworthy. As a result, we have to consider the obituary as an act of nation-building. The matter is not a simple one, for, if a life is not grievable, it is not quite a life; it does not qualify as a life and is not worth a note. It is already the unburied, if not the unburiable.

Precarious Life by Judith Butler, (p. 34)


Library of Alexandria v2.0

In case you missed Jill Lepore has written a superb article for the New Yorker about the Internet Archive and archiving the Web in general. The story of the Internet Archive is largely the story of its creator Brewster Kahle. If you’ve heard Kahle speak you’ve probably heard the Library of Alexandria v2.0 metaphor before. As a historian Lepore is particularly tuned to this dimension to the story of the Internet Archive:

When Kahle started the Internet Archive, in 1996, in his attic, he gave everyone working with him a book called “The Vanished Library,” about the burning of the Library of Alexandria. “The idea is to build the Library of Alexandria Two,” he told me. (The Hellenism goes further: there’s a partial backup of the Internet Archive in Alexandria, Egypt.)

I’m kind of embarrassed to admit that until reading Lepore’s article I never quite understood the metaphor…but now I think I do. The Web is on fire and the Internet Archive is helping save it, one HTTP request and response at a time. Previously I couldn’t get the image of this vast collection of Web content that the Internet Archive is building as yet another centralized collection of valuable material that, as with v1.0, is vulnerable to disaster but more likely, as Heather Phillips writes, creeping neglect:

Though it seems fitting that the destruction of so mythic an institution as the Great Library of Alexandria must have required some cataclysmic event like those described above – and while some of them certainly took their toll on the Library - in reality, the fortunes of the Great Library waxed and waned with those of Alexandria itself. Much of its downfall was gradual, often bureaucratic, and by comparison to our cultural imaginings, somewhat petty.

I don’t think it can be overstated: like the Library of Alexandria before it, the Internet Archive is an amazingly bold and priceless resource for human civilization. I’ve visited the Internet Archive on multiple occasions, and each time I’ve been struck by how unlikely it is that such a small and talented team have been able to build and sustain a service with such impact. It’s almost as if it’s too good to be true. I’m nagged by the thought that perhaps it is.

Herbert van de Sompel is quoted by Lepore:

A world with one archive is a really bad idea.

Van de Sompel and his collaborator Michael Nelson have repeatedly pointed out just how important it is for there to be multiple archives of Web content, and for there to be a way for them to be discoverable, and work together. Another thing I learned from Lepore’s article is that Brewster’s initial vision for the Internet Archive was much more collaborative, which gave birth to the International Internet Preservation Consortium, which is made up of 32 member organizations who do Web archiving.

A couple weeks ago one prominent IIPC member, the California Digital Library announced that it was retiring its in house archiving infrastructure and out sourcing its operation to ArchiveIt, which is the subscription web archiving service from the Internet Archive.

The CDL and the UC Libraries are partnering with Internet Archive’s Archive-It Service. In the coming year, CDL’s Web Archiving Service (WAS) collections and all core infrastructure activities, i.e., crawling, indexing, search, display, and storage, will be transferred to Archive-It. The CDL remains committed to web archiving as a fundamental component of its mission to support the acquisition, preservation and dissemination of content. This new partnership will allow the CDL to meet its mission and goals more efficiently and effectively and provide a robust solution for our stakeholders.

I happened to tweet this at the time:

Which at least inspired some mirth from Jason Scott, who is an Internet Archive employee, and also a noted Internet historian and documentarian.

Jason is also well known for his work with ArchiveTeam, which quickly mobilizes volunteers to save content on websites that are being shutdown. This content is often then transferred to the Internet Archive. He gets his hands dirty doing the work, and inspires others to do the same. So I deserved a bit of derisive laughter for my hand-wringing.

But here’s the thing. What does it mean if one of the pre-eminent digital library organizations needs to outsource their Web archiving operation? And what if, as the announcement indicates, Harvard, MIT, Stanford, UCLA, and others might not be far behind. Should we be concerned that the technical expertise and infrastructure for doing this work is becoming consolidated in a single organization? What does it say about our Web archiving tools that it is more cost-effective for CDL to outsource this work?

The situation isn’t as dire as it might sound since ArchiveIt subscribers retain the right to download their content and store it themselves. How many institutions do that with regularity isn’t well known (at least to me). But Web content isn’t like paper that you can put in a box, in a climate controlled room, and return to years hence. As Matt Kirschenbaum has pointed out:

the preservation of digital objects is logically inseparable from the act of their creation — the lag between creation and preservation collapses completely, since a digital object may only ever be said to be preserved if it is accessible, and each individual access creates the object anew

Can an organization download their WARC content, not provide any meaningful access to it, and say that it is being preserved? I don’t think so. You can’t do digital preservation without thinking about some kind of access to make sure things are working and people can use the stuff. If the content you are accessing is on a platform somewhere else that you have no control over you should probably be concerned.

I’m hopeful that this collaboration between CDL and ArchiveIt, and other organizations, will lead to a fruitful collaboration and improved tools. But I’m worried that it will mean organizations can simply outsource the expertise and infrastructure of web archiving, while helping reinforce what is already a huge single point of failure. David Rosenthal of Stanford University notes that diversity is a vital component to digital preservation:

Media, software and hardware must flow through the system over time as they fail or become obsolete, and are replaced. The system must support diversity among its components to avoid monoculture vulnerabilities, to allow for incremental replacement, and to avoid vendor lock-in.

I’d like to see more Web archiving classes in iSchools and computer science departments. I’d like to see improved and simplified tools for doing the work of Web archiving. Ideally I’d like to see more in house crawling and access of web archives, not less. I’d like to see more organizations like the Internet Archive that are not just technically able to do this work, but are also bold enough to collect what they think is important to save on the Web and make it available. If we can’t do this together I think the Library of Alexandria metaphor will be all too literal.


When Google Met WikiLeaks

When Google Met WikileaksWhen Google Met Wikileaks by Julian Assange
My rating: 4 of 5 stars

This book is primarily the transcript of a conversation between Julian Assange and Eric Schmidt (then CEO of Google) and Jared Cohen for their book The New Digital Age. The transcript is also available in its entirety (fittingly) on the WikiLeaks website along with the actual audio of the conversation. The transcript is book-ended by several essays: Beyond Good and “Don’t Be Evil”, the Banality of “Don’t Be Evil” (also published in New York Times) and Deliver us from “Don’t Be Evil”.

Assange read The New Digital Age and wasn’t happy with the framing of the conversation, or the degree to which his interview wasn’t included. When Google Met WikiLeaks is Assange’s attempt to reframe the discussion in terms of the future of publishing, information and the Internet. In particular Assange takes issue with Schmidt and Cohen’s assertion that:

The information released on WikiLeaks put lives at risk and inflicted serious diplomatic damage.

Schmidt and Cohen offer no source for this bold assertion, and in a note they equate WikiLeaks with minimally enabling espionage, again with no citation. Assange makes the case that WikiLeaks is actually in the business of publishing and journalism, not secretly selling information for private gain. I think Assange does this, but more importantly, he presents a view of the near future of the Internet, that is presaged by WikiLeaks, which is actually interesting and compelling. The transcript itself is heavily annotated with footnotes, many of which have URLs, that are archived at archive.today.

For me the most interesting parts of the book center on what Assange calls the Naming of Things:

The naming of human intellectual work and our entire intellectual record is possibly the most important thing. So we all have words for different objects, like “tomato.” But we use a simple word, “tomato,” instead of actually describing every little aspect of this god damn tomato…because it takes too long. And because it takes too long to describe this tomato precisely we use an abstraction so we can think about it so we can talk about it. And we do that also when we use URLs. Those are frequently used as a short name for some human intellectual content. And we build all of our civilization, other than on bricks, on human intellectual content. And so we currently have system with URLs where the structure we are building our civilization out of is the worst kind of melting plasticine imaginable. And that is a big problem.


Transcript of secret meeting between Julian Assange and Google CEO Eric Schmidt

This particular section goes on to talk about some really interesting topics: such as the effects of right to be forgotten laws, DNS, Bittorrent magnet URIs, how not to pick ISPs, hashing algorithms, digital signatures, public key cryptography, Bitcoin, NameCoin, flood networks, and distributed hash tables. The fascinating thing is that Schmidt is asking Assange for these details to understand how WikiLeaks operates; but Assange’s response is to discuss some general technologies that may influence a new kind of Web of documents. A Web where identity matters, where documents are signed and mirrored, republished and resilient.

Assange has been largely demonized by the mainstream press, and this book humanizes him quite a bit. It’s hard not to think of him in the Ecuadorian Embassy in London (where he will have been for 1500 days tomorrow) quietly adding footnotes to the transcript, and archiving web content.

OR Books role in printing this content on paper, for bookshelves everywhere is another aspect to this process of replication. Hats off to them for putting this project together.

Here’s some musical accompaniment to go along with this post: