Tag Archives: internet archive

Glass Houses

You may have noticed Brooklyn Museum’s recent announcement that they have pulled out of Flickr Commons. Apparently they’ve seen a “steady decline in engagement level” on Flickr, and decided to remove their content from that platform, so they can focus on their own website as well as Wikimedia Commons.

Brooklyn Museum announced three years ago that they would be cross-posting their content to Internet Archive and Wikimedia Commons. Perhaps I’m not seeing their current bot, but they appear to have two, neither of which have done an upload since March of 2011, based on their user activity. It’s kind of ironic that content like this was uploaded to Wikimedia Commons by Flickr Uploader Bot and not by one of their own bots.

The announcement stirred up a fair bit of discussion about how an institution devoted to the preservation and curation of cultural heritage material could delete all the curation that has happened at Flickr. The theory being that all the comments, tagging and annotation that has happened on Flickr has not been migrated to Wikimedia Commons. I’m not even sure if there’s a place where this structured data could live at Wikimedia Commons. Perhaps some sort of template could be created, or it could live in Wikidata?

Fortunately, Aaron Straup-Cope has a backup copy of Flickr Commons metadata, which includes a snapshot of the Brooklyn Museum’s content. He’s been harvesting this metadata out of concern for Flickr’s future, but surprise, surprise — it was an organization devoted to preservation of cultural heritage material that removed it. It would be interesting to see how many comments there were. I’m currently unpacking a tarball of Aaron’s metadata on an ec2 instance just to see if it’s easy to summarize.

But:

I’m pretty sure I’m living in one of those.

I agree with Ben:

It would help if we had a bit more method to the madness of our own Web presence. Too often the Web is treated as a marketing platform instead of our culture’s predominant content delivery mechanism. Brooklyn Museum deserves a lot of credit for talking about this issue openly. Most organizations just sweep it under the carpet and hope nobody notices.

What do you think? Is it acceptable that Brooklyn Museum discarded the user contributions that happened on Flickr, and that all the people who happened to be pointing at said content from elsewhere now have broken links? Could Brooklyn Museum instead decided to leave the content there, with a banner of some kind indicating that it is no longer actively maintained? Don’t lots of copies keep stuff safe?

Or perhaps having too many copies detracts from the perceived value of the currently endorsed places of finding the content? Curators have too many places to look, which aren’t synchronized, which add confusion and duplication. Maybe it’s better to have one place where people can focus their attention?

Perhaps these two positions aren’t at odds, and what’s actually at issue is a framework for thinking about how to migrate Web content between platforms. And different expectations about content that is self hosted, and content that is hosted elsewhere?

Fresh Data

In his talk Secrecy, Archives and the Public Interest in 1970 Howard Zinn famously challenged professional archivists to realize the role of politics in their work. His talk included 7 points of criticism, which are still so relevant today, but the last two really moved me to transcribe and briefly comment on them here:

6. That the emphasis is on the past over the present, on the antiquarian over the contemporary; on the non-controversial over the controversial; the cold over the hot. What about the transcripts of trials? Shouldn’t these be made easily available to the public? Not just important trials like the Chicago Conspiracy Trial I referred to, but the ordinary trials of ordinary persons, an important part of the record of our society. Even the extraordinary trials of extraordinary persons are not available, but perhaps they do not show our society at its best. The trial of the Catonsville 9 would be lost to us if Father Daniel Berrigan had not gone through the transcript and written a play based on it.

7. That far more resources are devoted to the collection and preservation of what already exists as records, than to recording fresh data: I would guess that more energy and money is going for the collection and publication of the Papers of John Adams than for recording the experiences of soldiers on the battlefront in Vietnam. Where are the interviews of Seymour Hersh with those involved in the My Lai Massacre, or Fred Gardner‘s interviews with those involved in the Presidio Mutiny Trial in California, or Wallace Terry‘s interviews with black GIs in Vietnam? Where are the recorded experiences of the young Americans in Southeast Asia who quit the International Volunteer Service in protest of American policy there, or of the Foreign Service officers who have quietly left?

What if Zinn were to ask archivists today about contemporary events? While the situation is far from perfect, the Web has allowed pheomena like Wikipedia, Wikileaks, the Freedom of the Press Foundation and many, many others, to emerge, and substantially level the playing field in ways that we are still grappling with. The Web has widened, deepened and amplified traditional journalism. Indeed, electronic communication media like the Web have copying and distribution cooked into their very essence, and make it almost effortless to share information. Fresh data, as Zinn presciently calls it, is what the Web is about; and the Internet that the Web is built on allows us to largely route around power interests…except, of course, when it doesn’t.

Strangely, I think if Zinn were talking to archivists today he would be asking them to think seriously about where this content will be in 20 years–or maybe even one year. How do we work together as professionals to collect the stuff that needs saving? The Internet Archive is awesome…it’s simply amazing what such a small group of smart people have been able to do. But this is a heavy weight for them to bear alone, and lots of copies keeps stuff safe right? Where are the copies? Yes there is the IIPC, but can we just assume this job is just being taken care of? What web content is being collected? How do we decide what is collected? How do we share our decisions with others so that interested parties can fill in gaps they are interested in? Maybe I’m just not in the know, but it seems like there’s a lot of (potentially fun) work to do.

aaronsw

Aaron Swartz left us all a week ago. It’s strange, I only met Aaron once at the Internet Archive, and had a handful of conversations with him via email/irc … but not a day has passed since last Saturday that I haven’t thought about him, and his principled life.

I’ve been asked a few times why Aaron has been on my mind so much, and I’ve struggled to put it into words. Meanwhile, so many thoughtful things have been written about him. The arc of his life, his ideals, and abilities, charisma, and chutzpah, seem larger than life. And yet, he was just a person, a son, a friend, with people who loved him. It’s just heartbreaking.


I work as a software developer in libraryland, trying to bridge the world of information we’ve had with the world we are building on the Web. So for me, Aaron was a role model, a teacher whose lessons weren’t in textbooks or scholarly journals, but in his blog, in his code, in his talks, in his experiments with real world results. He was only 26 when he died, but he was, and remains, as Tim Berners-Lee paradoxically called him, a “wise elder”.

I wanted to write something here, but more than that I wanted to do something.


I noticed that Internet Archive created a collection devoted to online material related to Aaron, and thought I would try to collect together all the Twitter conversations that mention him. Twitter’s search is limited to the last week, so I quickly wrote a command line utility that pages through search results using their API, and writes out the complete data as line-oriented JSON. I also pulled in the tweets that mention #pdftribute since they were largely inspired by Aaron’s efforts in the open access space. I packaged up the data using BagIt and put it up at Internet Archive. Here’s the description from the bag-info.txt

On January 11, 2013 the Internet activist Aaron Swartz took his own life, and a great deal of grief, anger, and constructive thinking erupted on the Web and in Twitter. In particular the #pdftribute Twitter tag was born, in an attempt to raise awareness about Open Access issues, that Aaron did so much to futher during his life.

This package contains Twitter JSON data for two Twitter search queries that were collected in the week following Aaron’s death:

  • “Aaron Swartz” OR aaronsw
  • #pdftribute

aaronsw.json.gz contains 630,397 tweets, for the period starting with 2013-01-11 16:50:22 and ending 2013-01-18 13:50:02.

pdftribute.json.gz contains 42,277 tweets, for the period starting with Jan 13 02:42:26 and ending Jan 17 03:33:46.

In addition the URLs mentioned in the tweets found in aaronsw.tar.gz were extracted, unshortened, and then aggregated to provide a report of what people linked to. These URLs are available in aaronsw-urls.txt.gz.

It is hoped that this data will help document the Web community’s response to Aaron’s death, and life.

Below is a list of the top 50 links shared in tweets about Aaron. There were 36,506 in all.

Page Shares
RIP, Aaron Swartz – Boing Boing 11763
The Truth about Aaron Swartz’s “Crime” « Unhandled Exception 6641
Aaron Swartz commits suicide – The Tech 5539
Remove United States District Attorney Carmen Ortiz from office for overreach in the case of Aaron Swartz. 6478
Prosecutor as bully – Lessig Blog 3738
The inspiring heroism of Aaron Swartz | Glenn Greenwald | Comment is free | guardian.co.uk 2522
Aaron Swartz Faced A More Severe Prison Term Than Killers, Slave Dealers And Bank Robbers | ThinkProgress 2367
Farewell to Aaron Swartz, an Extraordinary Hacker and Activist – EFF 2042
Internet Activist, a Creator of RSS, Is Dead at 26, Apparently a Suicide – New York Times 1927
Aaron Swartz muere por suicidio a sus 26 años 1572
Technology’s Greatest Minds Say Goodbye to Aaron Swartz 1558
Aaron Swartz a través de 5 grandes contribuciones a la red 1495
Aaron Swartz, American hero 1397
Internet Activist Aaron Swartz Commits Suicide 1330
Anonymous hacks MIT after Aaron Swartz’s suicide | Internet & Media – CNET News 1327
danah boyd | apophenia » processing the loss of Aaron Swartz 1280
Official Statement from the family and partner of Aaron Swartz – Remember Aaron Swartz 1199
depression lies | WIL WHEATON dot NET: 2.0 1164
BBC News – Aaron Swartz, internet freedom activist, dies aged 26 1143
In the Wake of Aaron Swartz’s Death, Let’s Fix Draconian Computer Crime Law – EFF 1088
Westboro Baptist Church Drops Aaron Swartz Funeral Protest After Anonymous Vows Action (VIDEO) 1079
Soup • Official Statement from the Family and Partner of… 1067
‘Aaron was killed by the government’ – Robert Swartz on his son’s death — RT 1066
#PDFTribute list of documents 1044
Internet prodigy, activist Aaron Swartz commits suicide – CNN.com 1009
Remembering Aaron Swartz | The Nation 1003
If I get hit by a truck… 991
Suicide d’Aaron Swartz, activiste à l’origine du format RSS et de Creative Commons 938
Hacker, Activist Aaron Swartz Commits Suicide | ZDNet 896
Activism “How We Stopped SOPA” by Aaron Swartz (1986-2013) 896
Muere a los 26 años el ciberactivista Aaron Swartz | Tecnología | EL PAÍS 887
10 Awful Crimes That Get You Less Prison Time Than What Aaron Swartz Faced | Alternet 868
Aaron Swartz, Coder and Activist, Dead at 26 | Threat Level | Wired.com 856
How the Legal System Failed Aaron Swartz–and Us : The New Yorker 849
https://aaronsw.jottit.com/howtoget 811
How Anonymous Got Westboro to Back Off Aaron Swartz’s Funeral – National – The Atlantic Wire 804
Muerte de Aaron Swartz: la necesidad del Open Data en el I+D 779
US court drops charges on Aaron Swartz days after his suicide — RT 772
Researchers begin posting article PDFs to twitter in #pdftribute to Aaron Swartz « Neuroconscience 745
My Aaron Swartz, whom I loved. | Quinn Said 742
The inspiring heroism of Aaron Swartz | Glenn Greenwald | Comment is free | guardian.co.uk 713
Government formally drops charges against Aaron Swartz | Ars Technica 708
Aaron Swartz’s Politics « naked capitalism 704
CNN.com – Breaking News, U.S., World, Weather, Entertainment & Video News 690
After Aaron Swartz: The Tech World Must Talk About Depression 670
JSTOR liberator 663
Internet Activist Aaron Swartz Commits Suicide 661
Anonymous tumba las webs del MIT y DOJ como tributo a Aaron Swartz 652
Anonymous Hacks MIT, Leaves Farewell Message for Aaron Swartz 647

There were 209,839 Twitter users that mentioned Aaron on Twitter in the last week. I was one of them. I wish I could’ve done more to help.