Google’s Subconscious

Can a poem provide insight into the inner workings of a complex algorithm? If Google Search had a subconscious, what would it look like? If Google mumbled in its sleep, what would it say?

A few days ago, I ran across these two quotes within hours of each other:

So if algorithms like autocomplete can defame people or businesses, our next logical question might be to ask how to hold those algorithms accountable for their actions.

Algorithmic Defamation: The Case of the Shameless Autocomplete by Nick Diakopoulos


A beautiful poem should re-write itself one-half word at a time, in pre-determined intervals.

Seven Controlled Vocabluaries by Tan Lin.

Then I got to thinking about what a poem auto-generated from Google’s autosuggest might look like. Ok, the idea is of dubious value, but it turned out to be pretty easy to do in just HTML and JavaScript (low computational overhead), and I quickly pushed it up to GitHub.

Here’s the heuristic:

  1. Pick a title for your poem, which also serves as a seed.
  2. Look up the seed in Google’s lightly documented suggestion API.
  3. Get the longest suggestion (text length).
  4. Output the suggestion as a line in the poem.
  5. Stop if more than n lines have been written.
  6. Pick a random substring in the suggestion as the seed for the next line.
  7. GOTO 2

The initial results were kind of light on verbs, so I found a list of verbs and randomly added them to the suggested text, occasionally. The poem is generated in your browser using JavaScript so hack on it and send me a pull request.

Assuming that Google’s suggestions are personalized for you (if you are logged into Google) and your location (your IP address), the poem is dependent on you. So I suppose it’s more of a collective subconscious in a way.

If you find an amusing phrase, please hover over the stanza and tweet it — I’d love to see it!

One Big Archive

Several months ago Hillel Arnold asked me to participate in a panel at the Society of American Archivists, A Trickle Becomes a Flood: Agency, Ethics and Information. The description is probably worth repeating here:

From the Internet activism of Aaron Swartz to Wikileaks’ release of confidential U.S. diplomatic cables, numerous events in recent years have challenged the scope and implications of privacy, confidentiality, and access for archives and archivists. With them comes the opportunity, and perhaps the imperative, to interrogate the role of power, ethics, and regulation in information systems. How are we to engage with these questions as archivists and citizens, and what are their implications for user access?

It is broad and important topic. I had 20 minutes to speak. Here’s what I said.

Thanks to Dan Chudnov for his One Big Library idea, which I’ve adapted/appropriated.

Thanks for inviting me to participate in this panel today. I’ve been a Society of American Archivists member for just two years, and this is my first SAA. So go easy on me, I’m a bit of a newb. I went to “Library School” almost 20 years ago, and was either blind to it or just totally missed out on the rich literature of the archive. As I began to work in the area of digital preservation 8 years ago several friends and mentors encouraged me to explore how thinking about the archive can inform digital repository systems. So here I am today. It’s nice to be here with you.

One thing from the panel description I’d like to focus on is:

the opportunity, perhaps the imperative, to interrogate the role of power, ethics, and regulation in information systems

I’d like to examine a specific, recent controversy, in order to explore how the dynamics of power and ethics impact the archive. I’m going to stretch our notion of what the archive is, perhaps to its breaking point, but I promise to bring it back to its normal shape at the end. I hope to highlight just how important archival thinking and community are to the landscape of the Web.


Perhaps I’ve just been perseverating about what to talk about today, but it seems like the news has been full of stories about the role of ethics of information systems lately. One that is still fresh in most people’s mind is the recent debate about Facebook’s emotional contagion study, where users news feeds were directly manipulated to test theories about how emotions are transferred in social media.

As Tim Carmody pointed out, this is significant not only for the individuals that had their news feeds tampered with, but also for the individuals who posted content to Facebook, only to have it manipulated. He asks:

What if you were one of the people whose posts were filtered because your keywords were too happy, too angry, or too sad?

However, much of the debate centered on the Terms of Service that Facebook users agree to when they use that service. Did the ToS allow Facebook and Cornell to conduct this experiment?

At the moment there appears to be language in the Facebook Terms of Service that allows user contributed data to be used for research purposes. But that language was not present in the 2012 version of the ToS, when the experiment was conducted. But just because the fine print of a ToS document allows for something doesn’t necessarily mean it is ethical. Are Terms of Service documents the right way to be talking about the ethical concerns and decisions of an organization? Aren’t they just an artifact of the ethical decisions and conversations that have already happened?

Also, it appears that the experiment itself was conducted before Cornell University’s Institutional Review Board had approved the study. Are IRB’s functioning the way we want?


The Facebook controversy got an added dimension when dots were connected between one of the Cornell researchers involved in the study and the Department of Defense funded Minerva Initiative, which,

aims to determine “the critical mass (tipping point)” of social contagions by studying their “digital traces” in the cases of “the 2011 Egyptian revolution, the 2011 Russian Duma elections, the 2012 Nigerian fuel subsidy crisis and the 2013 Gazi park protests in Turkey.”

I’m currently reading Betty Medsger‘s excellent book about how a group of ordinary peace activists in 1971 came to steal files from a small FBI office in Media, Pennsylvania that provided evidence for the FBI’s Counter Intelligence Program (COINTELPRO). COINTELPRO was an illegal program for “surveying, infiltrating, discrediting, and disrupting domestic political organizations”. Medsger was the Washington Post journalist who received the documents. It’s hard not to see the parallels to today where prominent Muslim-American academics and politicians are having their email monitored by the FBI and the NSA. Where the NSA are collecting millions of phone records from Verizon everyday. And where the NSA’s XKeyScore allows analysts to search through the emails, online chats and the browsing histories of millions of people.

Both Facebook and Cornell University later denied that this particular study was funded through the Defense Department’s Minerva Project, but did not deny that Professor Hancock had previously received funding from it. I don’t think you need to be an information scientist (although many of you are) to see how one person’s study of social contagions might inform another study of social contagion, by that very same person. But information scientists have bills to pay just like anyone else. I imagine if many of us traced our sources of income back sufficiently far enough we would find the Defense Department as a source. Still, the degrees of separation matter, don’t they?

Big Data

In her excellent post What does the Facebook Experiment Teach Us, social media researcher Danah Boyd explores how the debate about Facebook’s research practices surfaces a general unease about the so called era of big data:

The more I read people’s reactions to this study, the more I’ve started to think the outrage has nothing to do with the study at all. There is a growing amount of negative sentiment towards Facebook and other companies that collect and use data about people. In short, there’s anger at the practice of big data. This paper provided ammunition for people’s anger because it’s so hard to talk about harm in the abstract.

Certainly part of this anxiety is also the result of what we have learned our own government is doing in willing (and unwilling) collaboration with these companies, based on documents leaked by whistleblower Edward Snowden, and the subsequent journalism from the Guardian and the Washington Post that won them the Pulitzer Prize for Public Service. But as Maciej Ceglowski argues in his (awesome) The Internet With a Human Face.

You could argue (and I do) that this data is actually safer in government hands. In the US, at least, there’s no restriction on what companies can do with our private information, while there are stringent limits (in theory) on our government. And the NSA’s servers are certainly less likely to get hacked by some kid in Minsk.

But I understand that the government has guns and police and the power to put us in jail. It is legitimately frightening to have the government spying on all its citizens. What I take issue with is the idea that you can have different kinds of mass surveillance.

If these vast databases are valuable enough, it doesn’t matter who they belong to. The government will always find a way to query them. Who pays for the servers is just an implementation detail.

Ceglowski makes the case that we need more regulation around the collection of behavioural data, specifically what is collected and how long it is kept. This should be starting to sound familiar.

Boyd takes a slightly more pragmatic tack by pointing out that social media companies need to establish ethics boards that allow users and scholars to enter into conversation with employees that have insight into how things work (policy, algorithms, etc). We need a bit more nuance than “Don’t be evil”. We need to know how the companies that run the websites where we put our content are going to behave, or at least how they would like to behave, and what their moral compass is. I think we’ve seen this to some degree in the response of Google to the Right to be Forgotten law in the EU (a topic for another panel). But I think Boyd is right, that much more coordinated and collaborative work could be done in this area.

The Archive

So, what is the relationship between big data and the archive? Or put a bit more concretely: given the right context, would you consider Facebook content to be archival records?

If you are wearing the right colored glasses, or perhaps Hugh Taylor‘s spectacles, I think you will likely say yes, or at least maybe.

Quoting from SAA’s What is an Archive:

An archives is a place where people go to find information. But rather than gathering information from books as you would in a library, people who do research in archives often gather firsthand facts, data, and evidence from letters, reports, notes, memos, photographs, audio and video recording.

Perhaps content associated with Facebook’s user, business and community group accounts are a form of provenance or fonds?

Does anyone here use Facebook? Have you downloaded your Facebook data?

If an individual downloads their Facebook content and donates it to an archive along with other material, does this Facebook data become part of the archival record? How can researchers then access these records? What constraints are in place that govern who can see them, and when. These are familiar questions for the archivist, aren’t they?

Some may argue that the traditional archive doesn’t scale in the same way that big data behemoths like Facebook or Google do. I think in one sense they are right, but this is a feature, not a bug.

If an archive accessions an individual’s Facebook data there is an opportunity to talk with the donor about how they would like their records to be used, by whom and when. Think of an archival fonds as small data. When you add up enough of these fonds, and put them on the Web, you get big data. But the scope and contents of these fonds fit in the brain of a person that can reason ethically about them, and this is a good thing, that is at the heart of our profession, and which we must not lose.

When you consider theories of the postcustodial archive the scope of the archive enlarges greatly. Is there a role for the archivist and archival thinking and practices at Facebook itself? I think there is. Could archivists help balance the needs of researchers and records creators, and foster communication around the ethical use of these records? I think so. Could we help donors think about how they want their content to fit into the Web (when the time comes) by encouraging the use of creative commons licenses? Could we help organizations think about how they allow for people to download and/or delete their data, and how to package it up so it stands a chance of being readable?

Is it useful to look at the Web as One Big Archive, where assets are moved from one archive to another, where researcher rights are balanced with the rights of record creators? Where long term access is taken more seriously in some pockets than in others? I hope so. I think it’s what we do.

I haven’t mentioned Aaron, who worked so hard for these things. I could’ve spent the entire time talking about his work helping to build the Web, and interrogating power. Maybe I should have…but I guess I kind of did. There isn’t any time left to help Aaron. But we still have time to help make the Web a better place.