Every document, every moment in every document, conceals (or reveals) an indeterminate set of interfaces that open into alternate spaces and temporal relations.
Traditional criticism will engage this kind of radiant textuality more as a problem of context than a problem of text, and we have no reason to fault that way of seeing the matter. But as the word itself suggests, “context” is a cognate of text, and not in any abstract Barthesian sense. We construct the poem’s context, for example, by searching out the meanings marked in the physical witnesses that bring the poem to us. We read those witnesses with scrupulous attention, that is to say, we make our detailed way through the looking glass of the book and thence to the endless reaches of the Library of Babel, where every text is catalogued and multiple cross-referenced. In making the journey we are driven far out into the deep space, as we say these days, occupied by our orbiting texts. There objects pivot about many different points and poles, the objects themselves shapeshift continually and the pivots move, drift, shiver, and even dissolve away. Those transformations occur because “the text” is always a negotiated text, half perceived and half created by those who engage with it.
After writing about the Ferguson Twitter archive a few months ago three people have emailed me out of the blue asking for access to the data. One was a principal at a small, scaryish defense contracting company, and the other two were from a prestigious university. I’ve also had a handful of people interested where I work at the University of Maryland.
I ignored the defense contractor. Maybe that was mean, but I don’t want to be part of that. I’m sure they can go buy the data if they really need it. My response to the external academic researchers wasn’t much more helpful since I mostly pointed them to Twitter’s Terms of Service which says:
If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.
You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.
Any Content provided to third parties via non-automated file download remains subject to this Policy.
It’s my understanding that I can share the data with others at the University of Maryland, but I am not able to give it to the external parties. What I can do is give them the Tweet IDs. But there are 13,480,000 of them.
So that’s what I’m doing today: publishing the tweet ids. You can download them from the Internet Archive:
On the one hand, it seems unfair that this portion of the public record is unshareable in its most information rich form. The barrier to entry to using the data seems set artificially high in order to protect Twitter’s business interests. These messages were posted to the public Web, where I was able to collect them. Why are we prevented from re-publishing them since they are already on the Web? Why can’t we have lots of copies to keep stuff safe? More on this in a moment.
Twitter limits users to 180 requests every 15 minutes. A user is effectively a unique access token. Each request can hydrate up to 100 Tweet IDs using the statuses/lookup REST API call.
So to hydrate all of the 13,480,000 tweets will take about 7.8 days. This is a bit of a pain, but realistically it’s not so bad. I’m sure people doing research have plenty of work to do before running any kind of analysis on the full data set. And they can use a portion of it for testing as it is downloading. But how do you download it?
Gnip, who were recently acquired by Twitter, offer a rehydration API. Their API is limited to tweets from the last 30 days, and similar to Twitter’s API you can fetch up to 100 tweets at a time. Unlike the Twitter API you can issue a request every second. So this means you could download the results in about 1.5 days. But these Ferguson tweets are more than 30 days old. And a Gnip account costs some indeterminate amount of money, starting at $500…
I suspect there are other hydration services out there. But I adapted <a href="http://github.com/edsu/twarc’>twarc the tool I used to collect the data, which already handled rate-limiting, to also do hydration. Once you have the tweet IDs in a file you just need to install twarc, and run it. Here’s how you would do that on an Ubuntu instance:
After a week or so, you’ll have the full JSON for each of the tweets.
Well, not really. You will have most of them. But you won’t have the ones that have been deleted. If a user decided to remove a Tweet they made, or decided to remove their account entirely you won’t be able to get their Tweets back from Twitter using their API. I think it’s interesting to consider Twitter’s Terms of Service as what Katie Shilton would call a value lever.
The metadata rich JSON data (which often includes geolocation and other behavioral data) wasn’t exactly posted to the Web in the typical way. It was made available through a Web API designed to be used directly by automated agents, not people. Sure, a tweet appears on the Web but it’s in with the other half a trillion Tweets out on the Web, all the way back to the first one. Requiring researchers to go back to the Twitter API to get this data and not allowing it circulate freely in bulk means that users have an opportunity to remove their content. Sure it has already been collected by other people, and it’s pretty unlikely that the NSA are deleting their tweets. But in a way Twitter is taking an ethical position for their publishers to be able to remove their data. To exercise their right to be forgotten. Removing a teensy bit of informational toxic waste.
As any archivist will tell you, forgetting is an essential and unavoidable part of the archive. Forgetting is the why of an archive. Negotiating what is to be remembered and by whom is the principal concern of the archive. Ironically it seems it’s the people who deserve it the least, those in positions of power, who are often most able to exercise their right to be forgotten. Maybe putting a value lever back in the hands of the people isn’t such a bad thing. If I were Twitter I’d highlight this in the API documentation. I think we are still learning how the contours of the Web fit into the archive. I know I am.
It seems like an important move for MIT to formally recognize that social media is a new medium that deserves its own research focus, and investment in infrastructure. The language on the homepage gives a nice flavor for the type of work they plan to be doing. I was particularly struck by their frank assessment of how our governance systems are failing us, and social media’s potential role in understanding and helping solve the problems we face:
In a time of growing political polarization and institutional distrust, social networks have the potential to remake the public sphere as a realm where institutions and individuals can come together to understand, debate and act on societal problems. To date, large-scale, decentralized digital networks have been better at disrupting old hierarchies than constructing new, sustainable systems to replace them. Existing tools and practices for understanding and harnessing this emerging media ecosystem are being outstripped by its rapid evolution and complexity.
Their notion of “social machines” as “networked human-machine collaboratives” reminds me a lot of my somewhat stumbling work on (???) and archiving Ferguson Twitter data. As Nick Diakopoulos has pointed out we really need a theoretical framework for thinking about what sorts of interactions these automated social media agents can participate in, formulating their objectives, and for measuring their effects. Full disclosure: I work with Nick at the University of Maryland, but he wrote that post mentioning me before we met here, which was kind of awesome to discover after the fact.
Some of the news stories about the Twitter/MIT announcement have included this quote from Deb Roy from MIT who will lead the LSM:
The Laboratory for Social Machines will experiment in areas of public communication and social organization where humans and machines collaborate on problems that can’t be solved manually or through automation alone.
What a lovely encapsulation of the situation we find ourselves in today, where the problems we face are localized and yet global. Where algorithms and automation are indispensable for analysis and data gathering, but people and collaborative processes are all the more important. The ethical dimensions to algorithms and our understanding of them is also of growing importance, as the stories we read are mediated more and more by automated agents. It is super that Twitter has decided to help build this space at MIT where people can answer these questions, and have the infrastructure to support asking them.
When I read the quote I was immediately reminded of the problem that some of us were discussing at the last Society of American Archivists meeting in DC: how do we document the protests going on in Ferguson?
Much of the primary source material was being distributed through Twitter. Internet Archive were looking for nominations of URLs to use in their web crawl. But weren’t all the people tweeting about Ferguson including URLs for stories, audio and video that were of value? If people are talking about something can we infer its value in an archive? Or rather, is it a valuable place to start inferring from?
I ended up archiving 13 million of the tweets that mention “ferguson” for the 2 week period after the killing of Michael Brown. I then went through the URLs in these tweets, and unshortened them and came up with a list of 417,972 unshortened URLs. You can see the top 50 of them here, and the top 50 for August 10th (the day after Michael Brown was killed) here.
I did a lot of this work in prototyping mode, writing quick one off scripts to do this and that. One nice unintended side effect was unshrtn which is a microservice for unshortening URLs, which John Kunze gave me the idea for years ago. It gets a bit harder when you are unshortening millions of URLs.
But what would a tool look like that let us analyze events in social media, and helped us (archivists) collect information that needs to be preserved for future use? These tools are no doubt being created by those in positions of power, but we need them for the archive as well. We also desperately need to explore what it means to explore these archives: how do we provide access to them, and share them? It feels like there could be a project here along the lines of what George Washington University University are doing with their Social Feed Manager. Full disclosure again: I’ve done some contracting work with the fine folks at GW on a new interface to their library catalog.
The 5 million dollars aside, an important contribution that Twitter is making here (that’s probably worth a whole lot more) is firehose access to the Tweets that are happening now, as well as the historic data. I suspect Deb Roy’s role at MIT as a professor and as Chief Media Scientist at Twitter helped make that happen. Since MIT has such strong history of supporting open research, it will be interesting to see how the LSM chooses to share data that supports its research.
Pick a random substring in the suggestion as the seed for the next line.
Assuming that Google’s suggestions are personalized for you (if you are logged into Google) and your location (your IP address), the poem is dependent on you. So I suppose it’s more of a collective subconscious in a way.
If you find an amusing phrase, please hover over the stanza and tweet it – I’d love to see it!
I’m just finishing up my first week at MITH. What a smart, friendly group to be a part of, and with such exciting prospects for future work. Riding along the Sligo Creek and Northwest Branch trails to and from campus certainly helps. Let’s just say I couldn’t be happier with my decision to join MITH, and will be writing more about the work as I learn more, and get to work.
But I already have a question, that I’m hoping you can help me with.
I’ve been out of academia for over ten years. In my time away I’ve focused on my role as an agile software developer – increasingly with a lower case “a”. Working directly with the users of software (stakeholders, customers, etc), and getting the software into their hands as early as possible to inform the next iteration of work has been very rewarding. I’ve seen it work again, and again, and I suspect you have too on your own projects.
What I’m wondering is if you know of any tips, books, articles, etc on how to apply these agile practices in the context of grant funded projects. I’m still re-aquainting myself with how grants are tracked, and reported, but it seems to me that they seem to often encourage fairly detailed schedules of work, and cost estimates based on time spent on particular tasks, which (from 10,000 ft) reminds me a bit of the waterfall.
Who usually acts as the product owner in grant drive software development projects? How easy is it to adapt schedules and plans based on what you have learned in a past iteration? How do you get working software into the hands of its potential users as soon as possible? How often do you meet, and what is the focus of discussion? Are there particular funding bodies that appreciate agile software development? Are grants normally focused on publishing research and data instead of software products?
If you take a look you’ll see that only the Twitter ids are listed in the data that you can download. The full metadata that Mark collected (with twarc incidentally) doesn’t appear to be there. Laura knows from her work on the Social Feed Manager that it is fairly common practice in the research community to only openly distribute lists of Tweet ids instead of the raw data. I believe this is done out of concern for Twitter’s terms of service (1.4.A):
If you provide downloadable datasets of Twitter Content or an API that returns Twitter Content, you may only return IDs (including tweet IDs and user IDs).
You may provide spreadsheet or PDF files or other export functionality via non-programmatic means, such as using a “save as” button, for up to 100,000 public Tweets and/or User Objects per user per day. Exporting Twitter Content to a datastore as a service or other cloud based service, however, is not permitted.
There are privacy concerns here (redistributing data that users have chosen to remove). But I suspect Twitter has business reasons to discourage widespread redistribution of bulk Twitter data, especially now that they have bought the social media data provider Gnip.
I haven’t really seen a discussion of this practice of distributing Tweet ids, and its implications for research and digital preservation. I see that the International Conference on Weblogs and Social Media now have a dataset service where you need to agree to their “Sharing Agreement”, which basically prevents re-sharing of the data.
Please note that this agreement gives you access to all ICWSM-published datasets. In it, you agree not to redistribute the datasets. Furthermore, ensure that, when using a dataset in your own work, you abide by the citation requests of the authors of the dataset used.
I can certainly understand wanting to control how some of this data is made available, especially after the debate after Facebook’s Emotional Contagion Study went public. But this does not bode well for digital preservation where lots of copies keeps stuff safe. What if there were a standard license that we could use that encouraged data sharing among research data repositories? A viral license like the GPL that allowed data to be shared and reshared within particular contexts? Maybe the CC-BY-NC, or is it too weak? If each tweet is copyrighted by the person who sent it, can we even license them in bulk? What if Twitter’s terms of service included a research clause that applied to more than just Twitter employees, but to downstream archives?
Back of the Envelope
So if I were to make the ferguson tweet ids available, to work with the dataset you would need to refetch the data using the Twitter API, one tweet at a time. I did a little bit of reading and poking at the Twitter API and it appears an access token is limited to 180 requests every 15 minutes. So how long would it take to reconstitute 13 million Twitter ids?
13,000,000 tweets / 180 tweets per interval = 72,222 intervals
72,222 intervals * 15 minutes per interval = 1,083,330 minutes
1,083,330 minutes is two years of constant accesses to the Twitter API. Please let me know if I’ve done something conceptually/mathematically wrong.
Update: it turns out the statuses/lookup API call can return full tweet data for up to 100 tweets per request. So a single access token could fetch about 72,000 tweets per hour (100 per request, 180 requests per 15 minutes) … which only amounts to 180 hours, which is just over a week. James Jacobs rightly points out that a single application could use multiple access tokens, assuming users allowed the application to use them. So if 7 Twitter users donated their Twitter account API quota, the 13 million tweets could be reconstituted from their ids in roughly a day. So the situation is definitely not as bad as I initially thought. Perhaps there needs to be an app that allows people to donate some of the API quota for this sort of task? I wonder if that’s allowed by Twitter’s ToS.
The big assumption here is that the Twitter API continues to operate as it currently does. If Twitter changes its API, or ceases to exist as a company, there would be no way to reconstitute the data. But what if there were a functioning Twitter archive that could reconstitute the original data using the list of Twitter ids…
Digital Preservation as a Service
I’ve hesitated to write about LC’s Twitter archive while I was an employee. But now that I’m no longer working there I’ll just say I think this would be a perfect experimental service for them to consider providing. If a researcher could upload a list of Twitter ids to a service at the Library of Congress and get them back a few hours, days or even weeks later, this would be much preferable to managing a two year crawl of Twitter’s API. It also would allow an ecosystem of Twitter ID sharing to evolve.
The downside here is that all the tweets are in one basket, as it were. What if LC’s Twitter archiving program is discontinued? Does anyone else have a copy? I wonder if Mark kept the original tweet data that he collected, and it is private, available only inside the UNT archive? If someone could come and demonstrate to UNT that they have a research need to see the data, perhaps they could sign some sort of agreement, and get access to the original data?
I have to be honest, I kind of loathe idea of libraries and archives being gatekeepers to this data. Having to decide what is valid research and what is not seems fraught with peril. But on the flip side Maciej has a point:
These big collections of personal data are like radioactive waste. It’s easy to generate, easy to store in the short term, incredibly toxic, and almost impossible to dispose of. Just when you think you’ve buried it forever, it comes leaching out somewhere unexpected.
Managing this waste requires planning on timescales much longer than we’re typically used to. A typical Internet company goes belly-up after a couple of years. The personal data it has collected will remain sensitive for decades.
It feels like we (the research community) need to manage access to this data so that it’s not just out there for anyone to use. Maciej’s essential point is that businesses (and downstream archives) shouldn’t be collecting this behavioral data in the first place. But what about a tweet (its metadata) is behavioural? Could we strip it out? If I squint right, or put on my NSA colored glasses, even the simplest metadata such as who is tweeting to who seems behavioral.
It’s a bit of a platitude to say that social media is still new enough that we are still figuring out how to use it. Does a legitimate right to be forgotten mean that we forget everything? Can businesses blink out of existence leaving giant steaming pools of informational toxic waste, while research institutions aren’t able to collect and preserve small portions as datasets? I hope not.
To bring things back down to earth, how should I make this Ferguson Twitter data available? Are a list of tweet ids the best the archiving community can do, given the constraints of Twitter’s Terms of Service? Is there another way forward that addresses very real preservation and privacy concerns around the data? Some archivists may cringe at the cavalier use of the word “archiving” in the title of this post. However, I think the issues of access and preservation bound up in this simple use case warrant the attention of the archival community. What archival practices can we draw and adapt to help us do this work?
If you are interested in an update about where/how to get the data after reading this see here.
Much has been written about the significance of Twitter as the recent events in Ferguson echoed round the Web, the country, and the world. I happened to be at the Society of American Archivists meeting 5 days after Michael Brown was killed. During our panel discussion someone asked about the role that archivists should play in documenting the event.
There was wide agreement that Ferguson was a painful reminder of the type of event that archivists working to “interrogate the role of power, ethics, and regulation in information systems” should be documenting. But what to do? Unfortunately we didn’t have time to really discuss exactly how this agreement translated into action.
Fortunately the very next day the Archive-It service run by the Internet Archiveannounced that they were collecting seed URLs for a Web archive related to Ferguson. It was only then, after also having finally read Zeynep Tufekci’s terrific Medium post, that I slapped myself on the forehead … of course, we should try to archive the tweets. Ideally there would be a “we” but the reality was it was just “me”. Still, it seemed worth seeing how much I could get done.
I had some previous experience archiving tweets related to Aaron Swartz using Twitter’s search API. (Full disclosure: I also worked on the Twitter archiving project at the Library of Congress, but did not use any of that code or data then, or now.) I wrote a small Python command line program named twarc (a portmanteau for Twitter Archive), to help manage the archiving.
You give twarc a search query term, and it will plod through the search results, in reverse chronological order (the order that they are returned in), while handling quota limits, and writing out line-oriented-json, where each line is a complete tweet. It worked quite well to collect 630,000 tweets mentioning “aaronsw”, but I was starting late out of the gate, 6 days after the events in Ferguson began. One downside to twarc is it is completely dependent on Twitter’s search API, which only returns results for the past week or so. You can search back further in Twitter’s Web app, but that seems to be a privileged client. I can’t seem to convince the API to keep going back in time past a week or so.
So time was of the essence. I started up twarc searching for all tweets that mention ferguson, but quickly realized that the volume of tweets, and the order of the search results meant that I wouldn’t be able to retrieve the earliest tweets. So I tried to guesstimate a Twitter ID far enough back in time to use with twarc’s –max_id parameter to limit the initial query to tweets before that point in time. Doing this I was able to get back to 2014-08-10 22:44:43 – most of August 9th and 10th had slipped out of the window. I used a similar technique of guessing a ID further in the future in combination with the –since_id parameter to start collecting from where that snapshot left off. This resulted in a bit of a fragmented record, which you can see visualized (sort of below):
In the end I collected 13,480,000 tweets (63G of JSON) between August 10th and August 27th. There were some gaps because of mismanagement of twarc, and the data just moving too fast for me to recover from them: most of August 13th is missing, as well as part of August 22nd. I’ll know better next time how to manage this higher volume collection.
Apart from the data, a nice side effect of this work is that I fixed a socket timeout error in twarc that I hadn’t noticed before. I also refactored it a bit so I could use it programmatically like a library instead of only as a command line tool. This allowed me to write a program to archive the tweets, incrementing the max_id and since_id values automatically. The longer continuous crawls near the end are the result of using twarc more as a library from another program.
Bag of Tweets
To try to arrange/package the data a bit I decided to order all the tweets by tweet id, and split them up into gzipped files of 1 million tweets each. Sorting 13 million tweets was pretty easy using leveldb. I first loaded all 16 million tweets into the db, using the tweet id as the key, and the JSON string as the value.
db = leveldb.LevelDB('./tweets.db')
for line in fileinput.input():
tweet = json.loads(line)
This took almost 2 hours on a medium ec2 instance. Then I walked the leveldb index, writing out the JSON as I went, which took 35 minutes:
db = leveldb.LevelDB('./tweets.db')
for k, v in db.RangeIter(None, include_value=True):
After splitting them up into 1 million line files with cut and gzipping them I put them in a Bag and uploaded it to s3 (8.5G).
I am planning on trying to extract URLs from the tweets to try to come up with a list of seed URLs for the Archive-It crawl. If you have ideas of how to use it definitely get in touch. I haven’t decided yet if/where to host the data publicly. If you have ideas please get in touch about that too!