On Archiving Tweets

After my last post about collecting 13 million Ferguson tweets Laura Wrubel from George Washington University’s Social Feed Manager project recommended looking at how Mark Phillips made his Yes All Women collection of tweets available in the University of North Texas Digital Library. By the way, both are awesome projects to check out if you are interested in how access informs digital preservation.

If you take a look you’ll see that only the Twitter ids are listed in the data that you can download. The full metadata that Mark collected (with twarc incidentally) doesn’t appear to be there. Laura knows from her work on the Social Feed Manager that it is fairly common practice in the research community to only openly distribute lists of Tweet ids instead of the raw data. I believe this is done out of concern for Twitter’s terms of service (1.4.A):

If you provide downloadable datasets of Twitter Content or an API that returns Twitter Content, you may only return IDs (including tweet IDs and user IDs).

You may provide spreadsheet or PDF files or other export functionality via nonĀ­-programmatic means, such as using a “save as” button, for up to 100,000 public Tweets and/or User Objects per user per day. Exporting Twitter Content to a datastore as a service or other cloud based service, however, is not permitted.

There are privacy concerns here (redistributing data that users have chosen to remove). But I suspect Twitter has business reasons to discourage widespread redistribution of bulk Twitter data, especially now that they have bought the social media data provider Gnip.

I haven’t really seen a discussion of this practice of distributing Tweet ids, and its implications for research and digital preservation. I see that the International Conference on Weblogs and Social Media now have a dataset service where you need to agree to their “Sharing Agreement”, which basically prevents re-sharing of the data.

Please note that this agreement gives you access to all ICWSM-published datasets. In it, you agree not to redistribute the datasets. Furthermore, ensure that, when using a dataset in your own work, you abide by the citation requests of the authors of the dataset used.

I can certainly understand wanting to control how some of this data is made available, especially after the debate after Facebook’s Emotional Contagion Study went public. But this does not bode well for digital preservation where lots of copies keeps stuff safe. What if there were a standard license that we could use that encouraged data sharing among research data repositories? A viral license like the GPL that allowed data to be shared and reshared within particular contexts? Maybe the CC-BY-NC, or is it too weak? If each tweet is copyrighted by the person who sent it, can we even license them in bulk? What if Twitter’s terms of service included a research clause that applied to more than just Twitter employees, but to downstream archives?

Back of the Envelope

So if I were to make the ferguson tweet ids available, to work with the dataset you would need to refetch the data using the Twitter API, one tweet at a time. I did a little bit of reading and poking at the Twitter API and it appears an access token is limited to 180 requests every 15 minutes. So how long would it take to reconstitute 13 million Twitter ids?

13,000,000 tweets / 180 tweets per interval = 72,222 intervals
72,222 intervals * 15 minutes per interval =  1,083,330 minutes

1,083,330 minutes is two years of constant accesses to the Twitter API. Please let me know if I’ve done something conceptually/mathematically wrong.

Update: it turns out the statuses/lookup API call can return full tweet data for up to 100 tweets per request. So a single access token could fetch about 72,000 tweets per hour (100 per request, 180 requests per 15 minutes) … which only amounts to 180 hours, which is just over a week. James Jacobs rightly points out that a single application could use multiple access tokens, assuming users allowed the application to use them. So if 7 Twitter users donated their Twitter account API quota, the 13 million tweets could be reconstituted from their ids in roughly a day. So the situation is definitely not as bad as I initially thought. Perhaps there needs to be an app that allows people to donate some of the API quota for this sort of task? I wonder if that’s allowed by Twitter’s ToS.

The big assumption here is that the Twitter API continues to operate as it currently does. If Twitter changes its API, or ceases to exist as a company, there would be no way to reconstitute the data. But what if there were a functioning Twitter archive that could reconstitute the original data using the list of Twitter ids…

Digital Preservation as a Service

I’ve hesitated to write about LC’s Twitter archive while I was an employee. But now that I’m no longer working there I’ll just say I think this would be a perfect experimental service for them to consider providing. If a researcher could upload a list of Twitter ids to a service at the Library of Congress and get them back a few hours, days or even weeks later, this would be much preferable to managing a two year crawl of Twitter’s API. It also would allow an ecosystem of Twitter ID sharing to evolve.

The downside here is that all the tweets are in one basket, as it were. What if LC’s Twitter archiving program is discontinued? Does anyone else have a copy? I wonder if Mark kept the original tweet data that he collected, and it is private, available only inside the UNT archive? If someone could come and demonstrate to UNT that they have a research need to see the data, perhaps they could sign some sort of agreement, and get access to the original data?

I have to be honest, I kind of loathe idea of libraries and archives being gatekeepers to this data. Having to decide what is valid research and what is not seems fraught with peril. But on the flip side Maciej has a point:

These big collections of personal data are like radioactive waste. It’s easy to generate, easy to store in the short term, incredibly toxic, and almost impossible to dispose of. Just when you think you’ve buried it forever, it comes leaching out somewhere unexpected.

Managing this waste requires planning on timescales much longer than we’re typically used to. A typical Internet company goes belly-up after a couple of years. The personal data it has collected will remain sensitive for decades.

It feels like we (the research community) need to manage access to this data so that it’s not just out there for anyone to use. Maciej’s essential point is that businesses (and downstream archives) shouldn’t be collecting this behavioral data in the first place. But what about a tweet (its metadata) is behavioural? Could we strip it out? If I squint right, or put on my NSA colored glasses, even the simplest metadata such as who is tweeting to who seems behavioral.

It’s a bit of a platitude to say that social media is still new enough that we are still figuring out how to use it. Does a legitimate right to be forgotten mean that we forget everything? Can businesses blink out of existence leaving giant steaming pools of informational toxic waste, while research institutions aren’t able to collect and preserve small portions as datasets? I hope not.

To bring things back down to earth, how should I make this Ferguson Twitter data available? Are a list of tweet ids the best the archiving community can do, given the constraints of Twitter’s Terms of Service? Is there another way forward that addresses very real preservation and privacy concerns around the data? Some archivists may cringe at the cavalier use of the word “archiving” in the title of this post. However, I think the issues of access and preservation bound up in this simple use case warrant the attention of the archival community. What archival practices can we draw and adapt to help us do this work?

RealAudio, AAC and Archivy

A few months ago I happened to read a Pitchfork interview with David Grubbs about his book Records Ruin the Landscape. In the interview Grubbs mentioned how his book was influenced by a 2004 Kenny Goldsmith interview with Henry Flynt…and Pitchfork usefully linked to the interview in the WFMU archive.

You know, books linking to interviews linking to interviews linking to archives, the wondrous beauty and utility of hypertext.

I started listening to the interview on my Mac with Chrome and the latest RealAudio plugin but after a few minutes it went into a feedback loop of some kind, and became full of echoes and loops, and was completely unlistenable. This is WFMU so I thought maybe this was part of the show, but it went on for a while, which seemed a little bit odd. I tried reloading thinking it might be some artifact of the stream, but the exact thing happened again. I noticed a prominent Get Help link right next to the link for listening to the content. I clicked on it and filled out a brief form, not really expecting to hear back.

As you can see the WFMU archive view for the interview is sparse but eminently useful.

Unexpectedly, just a few hours later I received an email from Jeff Moore who wrote that playback of Real Audio had been reported to be a problem before on some items in the archive, and that they were in the process of migrating them to AAC. My report had pushed this particular episode up in the queue, and I could now reload the page and listen to an AAC stream via their Flash player. I guess now that it’s AAC there is probably something that could be done with the audio HTML element to avoid the Flash bit. But now I could listen to the interview (which, incidentally, is awesome) so I was happy.

I asked Jeff about how they were converting the RealAudio, because we have a fair bit of RealAudio laying around at my place of work. He wrote back with some useful notes that I thought I would publish on the Web for others googling for how to do it at this particular point in time. I’d be curious to know if you regard RealAudio as a preservation risk, and good example of a format we ought to be migrating. The playback options seem quite limited, and precarious, but perhaps that’s just my own limited experience.

The whole interaction with WFMU, from discovery, to access, to preservation, to interaction seemed like such a perfect illustration of what the Web can do for archives, and vice-versa.

Jeff’s Notes

The text below is from Jeff’s email to me. Jeff, if you are reading this and don’t really want me quoting you this way, just let me know.

I’m still fine-tuning the process, which is why the whole bulk transcode isn’t done yet. I’m trying to find the sweet spot where I use enough space / bandwidth for the resulting files so that I don’t hear any obvious degradation from the (actually pretty terrible-sounding) Real files, but don’t just burn extra resources with nothing gained.

Our Real files are mostly mono sampled at 22.04khz, using a codec current decoders often identify as “Cook”.

I’ve found that ffmpeg does a good job of extracting a WAV file from the Real originals – oh, and since there are two warring projects which each provide a program called ffmpeg, I mean this one:

http://ffmpeg.org/

We’ve been doing our AAC encoding with the Linux version of the Nero AAC Encoder released a few years ago:

http://www.nero.com/enu/company/about-nero/nero-aac-codec.php

…although I’m still investigating alternatives.

One interesting thing I’ve encountered is that a straight AAC re-encoding from the Real file (mono, 22.05k) plays fine as a file on disk, but hasn’t played correctly for me (in the same VLC version) when streamed from Amazon S3. If I convert the mono archive to stereo and AAC-encode that with the Nero encoder, it’s been streaming fine.

Oh, and if you want to transfer tags from the old Real files to any new files, and your transcoding pipeline doesn’t automatically copy tags, note that ffprobe (also from the ffmpeg package) can extract tags from Real files, which you can then stuff back in (with neroAacTag or the tagger of your choice).

Afterword

Here is Googlebot coming to get the content a few minutes after I published this post.

54.241.82.166 - - [23/May/2014:10:36:22 +0000] "GET http://inkdroid.org/journal/2014/05/23/realaudio-aac-and-archivy/ HTTP/1.1" 200 20752 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

So someone searching for how to convert RealAudio to AAC might stumble across it. This decentralized Web thing is kinda neat. We need to take care of it.

Fresh Data

In his talk Secrecy, Archives and the Public Interest in 1970 Howard Zinn famously challenged professional archivists to realize the role of politics in their work. His talk included 7 points of criticism, which are still so relevant today, but the last two really moved me to transcribe and briefly comment on them here:

6. That the emphasis is on the past over the present, on the antiquarian over the contemporary; on the non-controversial over the controversial; the cold over the hot. What about the transcripts of trials? Shouldn’t these be made easily available to the public? Not just important trials like the Chicago Conspiracy Trial I referred to, but the ordinary trials of ordinary persons, an important part of the record of our society. Even the extraordinary trials of extraordinary persons are not available, but perhaps they do not show our society at its best. The trial of the Catonsville 9 would be lost to us if Father Daniel Berrigan had not gone through the transcript and written a play based on it.

7. That far more resources are devoted to the collection and preservation of what already exists as records, than to recording fresh data: I would guess that more energy and money is going for the collection and publication of the Papers of John Adams than for recording the experiences of soldiers on the battlefront in Vietnam. Where are the interviews of Seymour Hersh with those involved in the My Lai Massacre, or Fred Gardner‘s interviews with those involved in the Presidio Mutiny Trial in California, or Wallace Terry‘s interviews with black GIs in Vietnam? Where are the recorded experiences of the young Americans in Southeast Asia who quit the International Volunteer Service in protest of American policy there, or of the Foreign Service officers who have quietly left?

What if Zinn were to ask archivists today about contemporary events? While the situation is far from perfect, the Web has allowed pheomena like Wikipedia, Wikileaks, the Freedom of the Press Foundation and many, many others, to emerge, and substantially level the playing field in ways that we are still grappling with. The Web has widened, deepened and amplified traditional journalism. Indeed, electronic communication media like the Web have copying and distribution cooked into their very essence, and make it almost effortless to share information. Fresh data, as Zinn presciently calls it, is what the Web is about; and the Internet that the Web is built on allows us to largely route around power interests…except, of course, when it doesn’t.

Strangely, I think if Zinn were talking to archivists today he would be asking them to think seriously about where this content will be in 20 years–or maybe even one year. How do we work together as professionals to collect the stuff that needs saving? The Internet Archive is awesome…it’s simply amazing what such a small group of smart people have been able to do. But this is a heavy weight for them to bear alone, and lots of copies keeps stuff safe right? Where are the copies? Yes there is the IIPC, but can we just assume this job is just being taken care of? What web content is being collected? How do we decide what is collected? How do we share our decisions with others so that interested parties can fill in gaps they are interested in? Maybe I’m just not in the know, but it seems like there’s a lot of (potentially fun) work to do.