On Forgetting

After writing about the Ferguson Twitter archive a few months ago three people have emailed me out of the blue asking for access to the data. One was a principal at a small, scaryish defense contracting company, and the other two were from a prestigious university. I’ve also had a handful of people interested where I work at the University of Maryland.

I ignored the defense contractor. Maybe that was mean, but I don’t want to be part of that. I’m sure they can go buy the data if they really need it. My response to the external academic researchers wasn’t much more helpful since I mostly pointed them to Twitter’s Terms of Service which says:

If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.

You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.

Any Content provided to third parties via non-automated file download remains subject to this Policy.

It’s my understanding that I can share the data with others at the University of Maryland, but I am not able to give it to the external parties. What I can do is give them the Tweet IDs. But there are 13,480,000 of them.

So that’s what I’m doing today: publishing the tweet ids. You can download them from the Internet Archive:

https://archive.org/details/ferguson-tweet-ids

I’m making it available using the CC-BY license.

Hydration

On the one hand, it seems unfair that this portion of the public record is unshareable in its most information rich form. The barrier to entry to using the data seems set artificially high in order to protect Twitter’s business interests. These messages were posted to the public Web, where I was able to collect them. Why are we prevented from re-publishing them since they are already on the Web? Why can’t we have lots of copies to keep stuff safe? More on this in a moment.

Twitter limits users to 180 requests every 15 minutes. A user is effectively a unique access token. Each request can hydrate up to 100 Tweet IDs using the statuses/lookup REST API call.

180 requests * 100 tweets = 18,000 tweets/15 min 
                          = 72,000 tweets/hour

So to hydrate all of the 13,480,000 tweets will take about 7.8 days. This is a bit of a pain, but realistically it’s not so bad. I’m sure people doing research have plenty of work to do before running any kind of analysis on the full data set. And they can use a portion of it for testing as it is downloading. But how do you download it?

Gnip, who were recently acquired by Twitter, offer a rehydration API. Their API is limited to tweets from the last 30 days, and similar to Twitter’s API you can fetch up to 100 tweets at a time. Unlike the Twitter API you can issue a request every second. So this means you could download the results in about 1.5 days. But these Ferguson tweets are more than 30 days old. And a Gnip account costs some indeterminate amount of money, starting at $500…

I suspect there are other hydration services out there. But I adapted twarc the tool I used to collect the data, which already handled rate-limiting, to also do hydration. Once you have the tweet IDs in a file you just need to install twarc, and run it. Here’s how you would do that on an Ubuntu instance:

    
    sudo apt-get install python-pip
    sudo pip install twarc
    twarc.py --hydrate ids.txt > tweets.json
    

After a week or so, you’ll have the full JSON for each of the tweets.

Archive Fever

Well, not really. You will have most of them. But you won’t have the ones that have been deleted. If a user decided to remove a Tweet they made, or decided to remove their account entirely you won’t be able to get their Tweets back from Twitter using their API. I think it’s interesting to consider Twitter’s Terms of Service as what Katie Shilton would call a value lever.

The metadata rich JSON data (which often includes geolocation and other behavioral data) wasn’t exactly posted to the Web in the typical way. It was made available through a Web API designed to be used directly by automated agents, not people. Sure, a tweet appears on the Web but it’s in with the other half a trillion Tweets out on the Web, all the way back to the first one. Requiring researchers to go back to the Twitter API to get this data and not allowing it circulate freely in bulk means that users have an opportunity to remove their content. Sure it has already been collected by other people, and it’s pretty unlikely that the NSA are deleting their tweets. But in a way Twitter is taking an ethical position for their publishers to be able to remove their data. To exercise their right to be forgotten. Removing a teensy bit of informational toxic waste.

As any archivist will tell you, forgetting is an essential and unavoidable part of the archive. Forgetting is the why of an archive. Negotiating what is to be remembered and by whom is the principal concern of the archive. Ironically it seems it’s the people who deserve it the least, those in positions of power, who are often most able to exercise their right to be forgotten. Maybe putting a value lever back in the hands of the people isn’t such a bad thing. If I were Twitter I’d highlight this in the API documentation. I think we are still learning how the contours of the Web fit into the archive. I know I am.

If you are interested in learning more about value levers you can download a pre-print of Shilton’s Value Levers: Building Ethics into Design.

Sign o’ the Times

a sign, a metaphor

a sign, a metaphor by Casey Bisson.

An old acquaintance took this photo in Coaldale, Nevada. I had to have a copy for myself.

Social Machines and the Archive

Yesterday MIT announced that Twitter made a 5 million dollar investment to help them create a Laboratory for Social Machines (LSM) as part of the MIT Media Lab proper:

It seems like an important move for MIT to formally recognize that social media is a new medium that deserves its own research focus, and investment in infrastructure. The language on the homepage gives a nice flavor for the type of work they plan to be doing. I was particularly struck by their frank assessment of how our governance systems are failing us, and social media’s potential role in understanding and helping solve the problems we face:

In a time of growing political polarization and institutional distrust, social networks have the potential to remake the public sphere as a realm where institutions and individuals can come together to understand, debate and act on societal problems. To date, large-scale, decentralized digital networks have been better at disrupting old hierarchies than constructing new, sustainable systems to replace them. Existing tools and practices for understanding and harnessing this emerging media ecosystem are being outstripped by its rapid evolution and complexity.

Their notion of “social machines” as “networked human-machine collaboratives” reminds me a lot of my somewhat stumbling work on @congressedits and archiving Ferguson Twitter data. As Nick Diakopoulos has pointed out we really need a theoretical framework for thinking about what sorts of interactions these automated social media agents can participate in, formulating their objectives, and for measuring their effects. Full disclosure: I work with Nick at the University of Maryland, but he wrote that post mentioning me before we met here, which was kind of awesome to discover after the fact.

Some of the news stories about the Twitter/MIT announcement have included this quote from Deb Roy from MIT who will lead the LSM:

The Laboratory for Social Machines will experiment in areas of public communication and social organization where humans and machines collaborate on problems that can’t be solved manually or through automation alone.

What a lovely encapsulation of the situation we find ourselves in today, where the problems we face are localized and yet global. Where algorithms and automation are indispensable for analysis and data gathering, but people and collaborative processes are all the more important. The ethical dimensions to algorithms and our understanding of them is also of growing importance, as the stories we read are mediated more and more by automated agents. It is super that Twitter has decided to help build this space at MIT where people can answer these questions, and have the infrastructure to support asking them.

When I read the quote I was immediately reminded of the problem that some of us were discussing at the last Society of American Archivists meeting in DC: how do we document the protests going on in Ferguson?

Much of the primary source material was being distributed through Twitter. Internet Archive were looking for nominations of URLs to use in their web crawl. But weren’t all the people tweeting about Ferguson including URLs for stories, audio and video that were of value? If people are talking about something can we infer its value in an archive? Or rather, is it a valuable place to start inferring from?

I ended up archiving 13 million of the tweets that mention “ferguson” for the 2 week period after the killing of Michael Brown. I then went through the URLs in these tweets, and unshortened them and came up with a list of 417,972 unshortened URLs. You can see the top 50 of them here, and the top 50 for August 10th (the day after Michael Brown was killed) here.

I did a lot of this work in prototyping mode, writing quick one off scripts to do this and that. One nice unintended side effect was unshrtn which is a microservice for unshortening URLs, which John Kunze gave me the idea for years ago. It gets a bit harder when you are unshortening millions of URLs.

But what would a tool look like that let us analyze events in social media, and helped us (archivists) collect information that needs to be preserved for future use? These tools are no doubt being created by those in positions of power, but we need them for the archive as well. We also desperately need to explore what it means to explore these archives: how do we provide access to them, and share them? It feels like there could be a project here along the lines of what George Washington University University are doing with their Social Feed Manager. Full disclosure again: I’ve done some contracting work with the fine folks at GW on a new interface to their library catalog.

The 5 million dollars aside, an important contribution that Twitter is making here (that’s probably worth a whole lot more) is firehose access to the Tweets that are happening now, as well as the historic data. I suspect Deb Roy’s role at MIT as a professor and as Chief Media Scientist at Twitter helped make that happen. Since MIT has such strong history of supporting open research, it will be interesting to see how the LSM chooses to share data that supports its research.