Here’s a brief demo of what it looks like to use twarc on the command line to archive tweets that are mentioning Ferguson. I’ve been doing archiving around this topic off and on since August of last year, and happened to start it up again recently to collect the response to the Justice Department report.
I kind of glossed over getting your Twitter keys set up, which is a bit tedious. I have them set in environment variables for that demo, but you can pass them in on the command line now. I guess that could be another demo sometime. If you are interested send me a tweet.
… in 10 years nothing you built today that depends on JS for the content will be available, visible, or archived anywhere on the web
It is a dire warning. It sounds and feels true. I am in the middle of writing a webapp that happens to use React, so Tantek’s words are particularly sobering.
And yet, consider for a moment how Twitter make personal downloadable archives available. When you request your archive you eventually get a zip file. When you unzip it, you open an index.html file in your browser, and are provided you with a view of all the tweets you’ve ever sent.
Another option is to change or at least augment the current web archiving paradigm by adding curator driven web archiving to the mix. The best examples I’ve seen of this are Ilya Kreymer’s work on pywb and pywb-recorder. Ilya is a former Internet Archive engineer, and is well aware of the limitations in the most common forms of web archiving today. pywb is a new player for web archives and pywb-recorder is a new recording environment. Both work in concert to let archivists interactively select web content that needs to be archived, and then for that content to be played back. The best example of this is his demo service webrecorder.io which composes pywb and pywb-recorder so that anyone can create a web archive of a highly dynamic website, download the WARC archive file, and then reupload it for playback.
To classify is, indeed, as useful as it is natural. The indefinite multitude of particular and changing events is met by the mind with acts of defining, inventorying and listing, reducing the common heads and tying up in bunches. But these acts like other intelligent acts are performed for a purpose, and the accomplishment of purpose is their only justification. Speaking generally, the purpose is to facilitate our dealing with unique individuals and changing events. When we assume that our clefts and bunches represent fixed separations and collections in rerum natura, we obstruct rather than aid our transactions with things. We are guilty of a presumption which nature promptly punishes. We are rendered incompetent to deal effectively with the delicacies and novelties of nature and life. Our thought is hard where facts are mobile ; bunched and chunky where events are fluid, dissolving.
Dewey (1957), p. 131.
To be satisfied with repeating, with traversing the ruts which in other conditions led to good, is the surest way of creating carelessness about present and actual good.
Dewey (1957), p. 67.
Human motives sharpen all our questions, human satisfactions lurk in all our answers, all our formulas have a human twist.
William James in Pragmatism and Humanism.
code4lib 2015 is about to kick off in Portland this morning. Unfortunately I couldn’t make it this year, but I’m looking forward to watching the livestream over the next few days. Thanks so much to the conference organizers for setting up the livestream. The schedule has the details about who is speaking when.
As a little gift to real and virtual conference goers (mostly myself) I quickly created a little web app that will watch the Twitter stream for #c4l15 tweets, and keep track of which URLs people are talking about. You can see it running, at least while the conference is going here.
I’ve done this sort of thing in an ad hoc way with twarc and some scripts–mostly after (rather than during) an event. For example here’s a report of URLs mentioned during #dlfforum. But I wanted something a bit more dynamic. As usual the somewhat unkempt code is up on Github as a project named earls, in case you have ideas you’d like to try out.
earls is a node app that listens to Twitter’s filter stream API for tweets mentioning
#c4l15. When it finds one it then looks for 1 or more links in the tweet. Each link is fetched (which also unshortens it), it tries to parse any HTML (thanks cheerio) to find a page title, and then stashes these details as well as the tweet in redis.
When you load the page it will show you the latest counts for all URLs it has found so far. Unfortunately at the moment you need to reload the page to get an update. If I have time I will work on making it update live in the page with socket.io. earls could be used for other conferences, and ought to run pretty easily on heroku for free.
Oh, and you can see the JSON data here in case you have other ideas of things you’d like to do with the data.
Have a superb conference you crazy dreamers and doers!
If you are an IfThisThenThat user and are interested in archives maybe you’ll be interested in this recipe that will email you when a new item is added to the Documenting Ferguson repository. Let me know if you give it a try! I just created the recipe and it hasn’t emailed me yet. But the RSS Feed from Washington University’s Omeka instance reports that the last item was added on January 30th, 2015. So the collection is still being added to.
I thought about having it tweet, but that would involve creating a Twitter account for the project and that isn’t my place. Plus, RSS and Email are still fun Web 1.0 technologies that don’t get enough love. Well I guess Email predates the Web entirely heh, but you get my drift.
There are no obituaries for the war casualties that the United States inflicts, and there cannot be. If there were to be an obituary there would had to have been a life, a life worth noting, a life worth valuing and preserving, a life that qualifies for recognition. Although we might argue that it would be impractical to write obituaries for all those people, or for all people, I think we have to ask, again and again, how the obituary functions as the instrument by which grievability is publicly distributed. It is the means by which a life becomes, or fails to become, a publicly grievable life, an icon for national self-recognition, the means by which a life becomes noteworthy. As a result, we have to consider the obituary as an act of nation-building. The matter is not a simple one, for, if a life is not grievable, it is not quite a life; it does not qualify as a life and is not worth a note. It is already the unburied, if not the unburiable.
Precarious Life by Judith Butler, (p. 34)
In case you missed Jill Lepore has written a superb article for the New Yorker about the Internet Archive and archiving the Web in general. The story of the Internet Archive is largely the story of its creator Brewster Kahle. If you’ve heard Kahle speak you’ve probably heard the Library of Alexandria v2.0 metaphor before. As a historian Lepore is particularly tuned to this dimension to the story of the Internet Archive:
When Kahle started the Internet Archive, in 1996, in his attic, he gave everyone working with him a book called “The Vanished Library,” about the burning of the Library of Alexandria. “The idea is to build the Library of Alexandria Two,” he told me. (The Hellenism goes further: there’s a partial backup of the Internet Archive in Alexandria, Egypt.)
I’m kind of embarrassed to admit that until reading Lepore’s article I never quite understood the metaphor…but now I think I do. The Web is on fire and the Internet Archive is helping save it, one HTTP request and response at a time. Previously I couldn’t get the image of this vast collection of Web content that the Internet Archive is building as yet another centralized collection of valuable material that, as with v1.0, is vulnerable to disaster but more likely, as Heather Phillips writes, creeping neglect:
Though it seems fitting that the destruction of so mythic an institution as the Great Library of Alexandria must have required some cataclysmic event like those described above – and while some of them certainly took their toll on the Library - in reality, the fortunes of the Great Library waxed and waned with those of Alexandria itself. Much of its downfall was gradual, often bureaucratic, and by comparison to our cultural imaginings, somewhat petty.
I don’t think it can be overstated: like the Library of Alexandria before it, the Internet Archive is an amazingly bold and priceless resource for human civilization. I’ve visited the Internet Archive on multiple occasions, and each time I’ve been struck by how unlikely it is that such a small and talented team have been able to build and sustain a service with such impact. It’s almost as if it’s too good to be true. I’m nagged by the thought that perhaps it is.
Herbert van de Sompel is quoted by Lepore:
A world with one archive is a really bad idea.
Van de Sompel and his collaborator Michael Nelson have repeatedly pointed out just how important it is for there to be multiple archives of Web content, and for there to be a way for them to be discoverable, and work together. Another thing I learned from Lepore’s article is that Brewster’s initial vision for the Internet Archive was much more collaborative, which gave birth to the International Internet Preservation Consortium, which is made up of 32 member organizations who do Web archiving.
A couple weeks ago one prominent IIPC member, the California Digital Library announced that it was retiring its in house archiving infrastructure and out sourcing its operation to ArchiveIt, which is the subscription web archiving service from the Internet Archive.
The CDL and the UC Libraries are partnering with Internet Archive’s Archive-It Service. In the coming year, CDL’s Web Archiving Service (WAS) collections and all core infrastructure activities, i.e., crawling, indexing, search, display, and storage, will be transferred to Archive-It. The CDL remains committed to web archiving as a fundamental component of its mission to support the acquisition, preservation and dissemination of content. This new partnership will allow the CDL to meet its mission and goals more efficiently and effectively and provide a robust solution for our stakeholders.
I happened to tweet this at the time:
Which at least inspired some mirth from Jason Scott, who is an Internet Archive employee, and also a noted Internet historian and documentarian.
Jason is also well known for his work with ArchiveTeam, which quickly mobilizes volunteers to save content on websites that are being shutdown. This content is often then transferred to the Internet Archive. He gets his hands dirty doing the work, and inspires others to do the same. So I deserved a bit of derisive laughter for my hand-wringing.
But here’s the thing. What does it mean if one of the pre-eminent digital library organizations needs to outsource their Web archiving operation? And what if, as the announcement indicates, Harvard, MIT, Stanford, UCLA, and others might not be far behind. Should we be concerned that the technical expertise and infrastructure for doing this work is becoming consolidated in a single organization? What does it say about our Web archiving tools that it is more cost-effective for CDL to outsource this work?
The situation isn’t as dire as it might sound since ArchiveIt subscribers retain the right to download their content and store it themselves. How many institutions do that with regularity isn’t well known (at least to me). But Web content isn’t like paper that you can put in a box, in a climate controlled room, and return to years hence. As Matt Kirschenbaum has pointed out:
the preservation of digital objects is logically inseparable from the act of their creation — the lag between creation and preservation collapses completely, since a digital object may only ever be said to be preserved if it is accessible, and each individual access creates the object anew
Can an organization download their WARC content, not provide any meaningful access to it, and say that it is being preserved? I don’t think so. You can’t do digital preservation without thinking about some kind of access to make sure things are working and people can use the stuff. If the content you are accessing is on a platform somewhere else that you have no control over you should probably be concerned.
I’m hopeful that this collaboration between CDL and ArchiveIt, and other organizations, will lead to a fruitful collaboration and improved tools. But I’m worried that it will mean organizations can simply outsource the expertise and infrastructure of web archiving, while helping reinforce what is already a huge single point of failure. David Rosenthal of Stanford University notes that diversity is a vital component to digital preservation:
Media, software and hardware must flow through the system over time as they fail or become obsolete, and are replaced. The system must support diversity among its components to avoid monoculture vulnerabilities, to allow for incremental replacement, and to avoid vendor lock-in.
I’d like to see more Web archiving classes in iSchools and computer science departments. I’d like to see improved and simplified tools for doing the work of Web archiving. Ideally I’d like to see more in house crawling and access of web archives, not less. I’d like to see more organizations like the Internet Archive that are not just technically able to do this work, but are also bold enough to collect what they think is important to save on the Web and make it available. If we can’t do this together I think the Library of Alexandria metaphor will be all too literal.
When Google Met Wikileaks by Julian Assange
My rating: 4 of 5 stars
This book is primarily the transcript of a conversation between Julian Assange and Eric Schmidt (then CEO of Google) and Jared Cohen for their book The New Digital Age. The transcript is also available in its entirety (fittingly) on the WikiLeaks website along with the actual audio of the conversation. The transcript is book-ended by several essays: Beyond Good and “Don’t Be Evil”, the Banality of “Don’t Be Evil” (also published in New York Times) and Deliver us from “Don’t Be Evil”.
Assange read The New Digital Age and wasn’t happy with the framing of the conversation, or the degree to which his interview wasn’t included. When Google Met WikiLeaks is Assange’s attempt to reframe the discussion in terms of the future of publishing, information and the Internet. In particular Assange takes issue with Schmidt and Cohen’s assertion that:
The information released on WikiLeaks put lives at risk and inflicted serious diplomatic damage.
Schmidt and Cohen offer no source for this bold assertion, and in a note they equate WikiLeaks with minimally enabling espionage, again with no citation. Assange makes the case that WikiLeaks is actually in the business of publishing and journalism, not secretly selling information for private gain. I think Assange does this, but more importantly, he presents a view of the near future of the Internet, that is presaged by WikiLeaks, which is actually interesting and compelling. The transcript itself is heavily annotated with footnotes, many of which have URLs, that are archived at archive.today.
For me the most interesting parts of the book center on what Assange calls the Naming of Things:
The naming of human intellectual work and our entire intellectual record is possibly the most important thing. So we all have words for different objects, like “tomato.” But we use a simple word, “tomato,” instead of actually describing every little aspect of this god damn tomato…because it takes too long. And because it takes too long to describe this tomato precisely we use an abstraction so we can think about it so we can talk about it. And we do that also when we use URLs. Those are frequently used as a short name for some human intellectual content. And we build all of our civilization, other than on bricks, on human intellectual content. And so we currently have system with URLs where the structure we are building our civilization out of is the worst kind of melting plasticine imaginable. And that is a big problem.
Transcript of secret meeting between Julian Assange and Google CEO Eric Schmidt
This particular section goes on to talk about some really interesting topics: such as the effects of right to be forgotten laws, DNS, Bittorrent magnet URIs, how not to pick ISPs, hashing algorithms, digital signatures, public key cryptography, Bitcoin, NameCoin, flood networks, and distributed hash tables. The fascinating thing is that Schmidt is asking Assange for these details to understand how WikiLeaks operates; but Assange’s response is to discuss some general technologies that may influence a new kind of Web of documents. A Web where identity matters, where documents are signed and mirrored, republished and resilient.
Assange has been largely demonized by the mainstream press, and this book humanizes him quite a bit. It’s hard not to think of him in the Ecuadorian Embassy in London (where he will have been for 1500 days tomorrow) quietly adding footnotes to the transcript, and archiving web content.
OR Books role in printing this content on paper, for bookshelves everywhere is another aspect to this process of replication. Hats off to them for putting this project together.
Here’s some musical accompaniment to go along with this post: