Google's Subconscious

Can a poem provide insight into the inner workings of a complex algorithm? If Google Search had a subconscious, what would it look like? If Google mumbled in its sleep, what would it say?

A few days ago, I ran across these two quotes within hours of each other:

So if algorithms like autocomplete can defame people or businesses, our next logical question might be to ask how to hold those algorithms accountable for their actions.

Algorithmic Defamation: The Case of the Shameless Autocomplete by Nick Diakopoulos


A beautiful poem should re-write itself one-half word at a time, in pre-determined intervals.

Seven Controlled Vocabluaries by Tan Lin.

Then I got to thinking about what a poem auto-generated from Google’s autosuggest might look like. Ok, the idea is of dubious value, but it turned out to be pretty easy to do in just HTML and JavaScript (low computational overhead), and I quickly pushed it up to GitHub.

Here’s the heuristic:

  1. Pick a title for your poem, which also serves as a seed.
  2. Look up the seed in Google’s lightly documented suggestion API.
  3. Get the longest suggestion (text length).
  4. Output the suggestion as a line in the poem.
  5. Stop if more than n lines have been written.
  6. Pick a random substring in the suggestion as the seed for the next line.
  7. GOTO 2

The initial results were kind of light on verbs, so I found a list of verbs and randomly added them to the suggested text, occasionally. The poem is generated in your browser using JavaScript so hack on it and send me a pull request.

Assuming that Google’s suggestions are personalized for you (if you are logged into Google) and your location (your IP address), the poem is dependent on you. So I suppose it’s more of a collective subconscious in a way.

If you find an amusing phrase, please hover over the stanza and tweet it – I’d love to see it!

Agile in Academia

I’m just finishing up my first week at MITH. What a smart, friendly group to be a part of, and with such exciting prospects for future work. Riding along the Sligo Creek and Northwest Branch trails to and from campus certainly helps. Let’s just say I couldn’t be happier with my decision to join MITH, and will be writing more about the work as I learn more, and get to work.

But I already have a question, that I’m hoping you can help me with.

I’ve been out of academia for over ten years. In my time away I’ve focused on my role as an agile software developer – increasingly with a lower case “a”. Working directly with the users of software (stakeholders, customers, etc), and getting the software into their hands as early as possible to inform the next iteration of work has been very rewarding. I’ve seen it work again, and again, and I suspect you have too on your own projects.

What I’m wondering is if you know of any tips, books, articles, etc on how to apply these agile practices in the context of grant funded projects. I’m still re-aquainting myself with how grants are tracked, and reported, but it seems to me that they seem to often encourage fairly detailed schedules of work, and cost estimates based on time spent on particular tasks, which (from 10,000 ft) reminds me a bit of the waterfall.

Who usually acts as the product owner in grant drive software development projects? How easy is it to adapt schedules and plans based on what you have learned in a past iteration? How do you get working software into the hands of its potential users as soon as possible? How often do you meet, and what is the focus of discussion? Are there particular funding bodies that appreciate agile software development? Are grants normally focused on publishing research and data instead of software products?

Any links, references, citations, tips or advice you could send my way here, (???), or email would be greatly appreciated. I’ve already got Bethany Nowviskie’s Lazy Consensus bookmarked for re-reading :-)

On Archiving Tweets

After my last post about collecting 13 million Ferguson tweets Laura Wrubel from George Washington University’s Social Feed Manager project recommended looking at how Mark Phillips made his Yes All Women collection of tweets available in the University of North Texas Digital Library. By the way, both are awesome projects to check out if you are interested in how access informs digital preservation.

If you take a look you’ll see that only the Twitter ids are listed in the data that you can download. The full metadata that Mark collected (with twarc incidentally) doesn’t appear to be there. Laura knows from her work on the Social Feed Manager that it is fairly common practice in the research community to only openly distribute lists of Tweet ids instead of the raw data. I believe this is done out of concern for Twitter’s terms of service (1.4.A):

If you provide downloadable datasets of Twitter Content or an API that returns Twitter Content, you may only return IDs (including tweet IDs and user IDs).

You may provide spreadsheet or PDF files or other export functionality via non­-programmatic means, such as using a “save as” button, for up to 100,000 public Tweets and/or User Objects per user per day. Exporting Twitter Content to a datastore as a service or other cloud based service, however, is not permitted.

There are privacy concerns here (redistributing data that users have chosen to remove). But I suspect Twitter has business reasons to discourage widespread redistribution of bulk Twitter data, especially now that they have bought the social media data provider Gnip.

I haven’t really seen a discussion of this practice of distributing Tweet ids, and its implications for research and digital preservation. I see that the International Conference on Weblogs and Social Media now have a dataset service where you need to agree to their “Sharing Agreement”, which basically prevents re-sharing of the data.

Please note that this agreement gives you access to all ICWSM-published datasets. In it, you agree not to redistribute the datasets. Furthermore, ensure that, when using a dataset in your own work, you abide by the citation requests of the authors of the dataset used.

I can certainly understand wanting to control how some of this data is made available, especially after the debate after Facebook’s Emotional Contagion Study went public. But this does not bode well for digital preservation where lots of copies keeps stuff safe. What if there were a standard license that we could use that encouraged data sharing among research data repositories? A viral license like the GPL that allowed data to be shared and reshared within particular contexts? Maybe the CC-BY-NC, or is it too weak? If each tweet is copyrighted by the person who sent it, can we even license them in bulk? What if Twitter’s terms of service included a research clause that applied to more than just Twitter employees, but to downstream archives?

Back of the Envelope

So if I were to make the ferguson tweet ids available, to work with the dataset you would need to refetch the data using the Twitter API, one tweet at a time. I did a little bit of reading and poking at the Twitter API and it appears an access token is limited to 180 requests every 15 minutes. So how long would it take to reconstitute 13 million Twitter ids?

13,000,000 tweets / 180 tweets per interval = 72,222 intervals
72,222 intervals * 15 minutes per interval =  1,083,330 minutes

1,083,330 minutes is two years of constant accesses to the Twitter API. Please let me know if I’ve done something conceptually/mathematically wrong.

Update: it turns out the statuses/lookup API call can return full tweet data for up to 100 tweets per request. So a single access token could fetch about 72,000 tweets per hour (100 per request, 180 requests per 15 minutes) … which only amounts to 180 hours, which is just over a week. James Jacobs rightly points out that a single application could use multiple access tokens, assuming users allowed the application to use them. So if 7 Twitter users donated their Twitter account API quota, the 13 million tweets could be reconstituted from their ids in roughly a day. So the situation is definitely not as bad as I initially thought. Perhaps there needs to be an app that allows people to donate some of the API quota for this sort of task? I wonder if that’s allowed by Twitter’s ToS.

The big assumption here is that the Twitter API continues to operate as it currently does. If Twitter changes its API, or ceases to exist as a company, there would be no way to reconstitute the data. But what if there were a functioning Twitter archive that could reconstitute the original data using the list of Twitter ids…

Digital Preservation as a Service

I’ve hesitated to write about LC’s Twitter archive while I was an employee. But now that I’m no longer working there I’ll just say I think this would be a perfect experimental service for them to consider providing. If a researcher could upload a list of Twitter ids to a service at the Library of Congress and get them back a few hours, days or even weeks later, this would be much preferable to managing a two year crawl of Twitter’s API. It also would allow an ecosystem of Twitter ID sharing to evolve.

The downside here is that all the tweets are in one basket, as it were. What if LC’s Twitter archiving program is discontinued? Does anyone else have a copy? I wonder if Mark kept the original tweet data that he collected, and it is private, available only inside the UNT archive? If someone could come and demonstrate to UNT that they have a research need to see the data, perhaps they could sign some sort of agreement, and get access to the original data?

I have to be honest, I kind of loathe idea of libraries and archives being gatekeepers to this data. Having to decide what is valid research and what is not seems fraught with peril. But on the flip side Maciej has a point:

These big collections of personal data are like radioactive waste. It’s easy to generate, easy to store in the short term, incredibly toxic, and almost impossible to dispose of. Just when you think you’ve buried it forever, it comes leaching out somewhere unexpected.

Managing this waste requires planning on timescales much longer than we’re typically used to. A typical Internet company goes belly-up after a couple of years. The personal data it has collected will remain sensitive for decades.

It feels like we (the research community) need to manage access to this data so that it’s not just out there for anyone to use. Maciej’s essential point is that businesses (and downstream archives) shouldn’t be collecting this behavioral data in the first place. But what about a tweet (its metadata) is behavioural? Could we strip it out? If I squint right, or put on my NSA colored glasses, even the simplest metadata such as who is tweeting to who seems behavioral.

It’s a bit of a platitude to say that social media is still new enough that we are still figuring out how to use it. Does a legitimate right to be forgotten mean that we forget everything? Can businesses blink out of existence leaving giant steaming pools of informational toxic waste, while research institutions aren’t able to collect and preserve small portions as datasets? I hope not.

To bring things back down to earth, how should I make this Ferguson Twitter data available? Are a list of tweet ids the best the archiving community can do, given the constraints of Twitter’s Terms of Service? Is there another way forward that addresses very real preservation and privacy concerns around the data? Some archivists may cringe at the cavalier use of the word “archiving” in the title of this post. However, I think the issues of access and preservation bound up in this simple use case warrant the attention of the archival community. What archival practices can we draw and adapt to help us do this work?

A Ferguson Twitter Archive

If you are interested in an update about where/how to get the data after reading this see here.

Much has been written about the significance of Twitter as the recent events in Ferguson echoed round the Web, the country, and the world. I happened to be at the Society of American Archivists meeting 5 days after Michael Brown was killed. During our panel discussion someone asked about the role that archivists should play in documenting the event.

There was wide agreement that Ferguson was a painful reminder of the type of event that archivists working to “interrogate the role of power, ethics, and regulation in information systems” should be documenting. But what to do? Unfortunately we didn’t have time to really discuss exactly how this agreement translated into action.

Fortunately the very next day the Archive-It service run by the Internet Archive announced that they were collecting seed URLs for a Web archive related to Ferguson. It was only then, after also having finally read Zeynep Tufekci’s terrific Medium post, that I slapped myself on the forehead … of course, we should try to archive the tweets. Ideally there would be a “we” but the reality was it was just “me”. Still, it seemed worth seeing how much I could get done.


I had some previous experience archiving tweets related to Aaron Swartz using Twitter’s search API. (Full disclosure: I also worked on the Twitter archiving project at the Library of Congress, but did not use any of that code or data then, or now.) I wrote a small Python command line program named twarc (a portmanteau for Twitter Archive), to help manage the archiving.

You give twarc a search query term, and it will plod through the search results, in reverse chronological order (the order that they are returned in), while handling quota limits, and writing out line-oriented-json, where each line is a complete tweet. It worked quite well to collect 630,000 tweets mentioning “aaronsw”, but I was starting late out of the gate, 6 days after the events in Ferguson began. One downside to twarc is it is completely dependent on Twitter’s search API, which only returns results for the past week or so. You can search back further in Twitter’s Web app, but that seems to be a privileged client. I can’t seem to convince the API to keep going back in time past a week or so.

So time was of the essence. I started up twarc searching for all tweets that mention ferguson, but quickly realized that the volume of tweets, and the order of the search results meant that I wouldn’t be able to retrieve the earliest tweets. So I tried to guesstimate a Twitter ID far enough back in time to use with twarc’s –max_id parameter to limit the initial query to tweets before that point in time. Doing this I was able to get back to 2014-08-10 22:44:43 – most of August 9th and 10th had slipped out of the window. I used a similar technique of guessing a ID further in the future in combination with the –since_id parameter to start collecting from where that snapshot left off. This resulted in a bit of a fragmented record, which you can see visualized (sort of below):

In the end I collected 13,480,000 tweets (63G of JSON) between August 10th and August 27th. There were some gaps because of mismanagement of twarc, and the data just moving too fast for me to recover from them: most of August 13th is missing, as well as part of August 22nd. I’ll know better next time how to manage this higher volume collection.

Apart from the data, a nice side effect of this work is that I fixed a socket timeout error in twarc that I hadn’t noticed before. I also refactored it a bit so I could use it programmatically like a library instead of only as a command line tool. This allowed me to write a program to archive the tweets, incrementing the max_id and since_id values automatically. The longer continuous crawls near the end are the result of using twarc more as a library from another program.

Bag of Tweets

To try to arrange/package the data a bit I decided to order all the tweets by tweet id, and split them up into gzipped files of 1 million tweets each. Sorting 13 million tweets was pretty easy using leveldb. I first loaded all 16 million tweets into the db, using the tweet id as the key, and the JSON string as the value.

import json
import leveldb
import fileinput

db = leveldb.LevelDB('./tweets.db')

for line in fileinput.input():
    tweet = json.loads(line)
    db.Put(tweet['id_str'], line)

This took almost 2 hours on a medium ec2 instance. Then I walked the leveldb index, writing out the JSON as I went, which took 35 minutes:

import leveldb

db = leveldb.LevelDB('./tweets.db')
for k, v in db.RangeIter(None, include_value=True):
    print v,

After splitting them up into 1 million line files with cut and gzipping them I put them in a Bag and uploaded it to s3 (8.5G).

I am planning on trying to extract URLs from the tweets to try to come up with a list of seed URLs for the Archive-It crawl. If you have ideas of how to use it definitely get in touch. I haven’t decided yet if/where to host the data publicly. If you have ideas please get in touch about that too!

Moving On to MITH

I’m very excited to say that I am going to be joining the Maryland Institute for Technology in the Humanities to work as their Lead Developer. Apart from the super impressive folks at MITH that I will be joining, I am really looking forward to helping them continue to explore how digital media can be used in interdisciplinary scholarship and learning…and to build and sustain tools that make and remake the work of digital humanities research. MITH’s role in the digital humanities field, and its connections on campus and to institutions and projects around the world were just too good of an opportunity to pass up.

I’m extremely grateful to my colleagues at the Library of Congress who have made working over the last 8 years such a transformative experience. It’s a bit of a cliche that there are challenges to working in the .gov space. But at the end of the day, I am so proud to have been part of the mission to “further the progress of knowledge and creativity for the benefit of the American people”. It’s a tall order, that took me in a few directions that ended up bearing fruit, and some others that didn’t. But that’s always the way of things right? Thanks to all of you (you know who you are) that made it so rewarding…and so much fun.

When I wrote recently about the importance of broken world thinking I had no idea I would be leaving LC to join MITH in a few weeks. I feel incredibly lucky to be able to continue to work in the still evolving space of digital curation and preservation, with a renewed focus on putting data to use, and helping make the world a better place.

Oh, and I still plan on writing here … so, till the next episode.

One Big Archive

Several months ago Hillel Arnold asked me to participate in a panel at the Society of American Archivists, A Trickle Becomes a Flood: Agency, Ethics and Information. The description is probably worth repeating here:

From the Internet activism of Aaron Swartz to Wikileaks’ release of confidential U.S. diplomatic cables, numerous events in recent years have challenged the scope and implications of privacy, confidentiality, and access for archives and archivists. With them comes the opportunity, and perhaps the imperative, to interrogate the role of power, ethics, and regulation in information systems. How are we to engage with these questions as archivists and citizens, and what are their implications for user access?

It is broad and important topic. I had 20 minutes to speak. Here’s what I said.

Thanks to Dan Chudnov for his One Big Library idea, which I’ve adapted/appropriated.

Thanks for inviting me to participate in this panel today. I’ve been a Society of American Archivists member for just two years, and this is my first SAA. So go easy on me, I’m a bit of a newb. I went to “Library School” almost 20 years ago, and was either blind to it or just totally missed out on the rich literature of the archive. As I began to work in the area of digital preservation 8 years ago several friends and mentors encouraged me to explore how thinking about the archive can inform digital repository systems. So here I am today. It’s nice to be here with you.

One thing from the panel description I’d like to focus on is:

the opportunity, perhaps the imperative, to interrogate the role of power, ethics, and regulation in information systems

I’d like to examine a specific, recent controversy, in order to explore how the dynamics of power and ethics impact the archive. I’m going to stretch our notion of what the archive is, perhaps to its breaking point, but I promise to bring it back to its normal shape at the end. I hope to highlight just how important archival thinking and community are to the landscape of the Web.


Perhaps I’ve just been perseverating about what to talk about today, but it seems like the news has been full of stories about the role of ethics of information systems lately. One that is still fresh in most people’s mind is the recent debate about Facebook’s emotional contagion study, where users news feeds were directly manipulated to test theories about how emotions are transferred in social media.

As Tim Carmody pointed out, this is significant not only for the individuals that had their news feeds tampered with, but also for the individuals who posted content to Facebook, only to have it manipulated. He asks:

What if you were one of the people whose posts were filtered because your keywords were too happy, too angry, or too sad?

However, much of the debate centered on the Terms of Service that Facebook users agree to when they use that service. Did the ToS allow Facebook and Cornell to conduct this experiment?

At the moment there appears to be language in the Facebook Terms of Service that allows user contributed data to be used for research purposes. But that language was not present in the 2012 version of the ToS, when the experiment was conducted. But just because the fine print of a ToS document allows for something doesn’t necessarily mean it is ethical. Are Terms of Service documents the right way to be talking about the ethical concerns and decisions of an organization? Aren’t they just an artifact of the ethical decisions and conversations that have already happened?

Also, it appears that the experiment itself was conducted before Cornell University’s Institutional Review Board had approved the study. Are IRB’s functioning the way we want?


The Facebook controversy got an added dimension when dots were connected between one of the Cornell researchers involved in the study and the Department of Defense funded Minerva Initiative, which,

aims to determine “the critical mass (tipping point)” of social contagions by studying their “digital traces” in the cases of “the 2011 Egyptian revolution, the 2011 Russian Duma elections, the 2012 Nigerian fuel subsidy crisis and the 2013 Gazi park protests in Turkey.”

I’m currently reading Betty Medsger’s excellent book about how a group of ordinary peace activists in 1971 came to steal files from a small FBI office in Media, Pennsylvania that provided evidence for the FBI’s Counter Intelligence Program (COINTELPRO). COINTELPRO was an illegal program for “surveying, infiltrating, discrediting, and disrupting domestic political organizations”. Medsger was the Washington Post journalist who received the documents. It’s hard not to see the parallels to today where prominent Muslim-American academics and politicians are having their email monitored by the FBI and the NSA. Where the NSA are collecting millions of phone records from Verizon everyday. And where the NSA’s XKeyScore allows analysts to search through the emails, online chats and the browsing histories of millions of people.

Both Facebook and Cornell University later denied that this particular study was funded through the Defense Department’s Minerva Project, but did not deny that Professor Hancock had previously received funding from it. I don’t think you need to be an information scientist (although many of you are) to see how one person’s study of social contagions might inform another study of social contagion, by that very same person. But information scientists have bills to pay just like anyone else. I imagine if many of us traced our sources of income back sufficiently far enough we would find the Defense Department as a source. Still, the degrees of separation matter, don’t they?

Big Data

In her excellent post What does the Facebook Experiment Teach Us, social media researcher Danah Boyd explores how the debate about Facebook’s research practices surfaces a general unease about the so called era of big data:

The more I read people’s reactions to this study, the more I’ve started to think the outrage has nothing to do with the study at all. There is a growing amount of negative sentiment towards Facebook and other companies that collect and use data about people. In short, there’s anger at the practice of big data. This paper provided ammunition for people’s anger because it’s so hard to talk about harm in the abstract.

Certainly part of this anxiety is also the result of what we have learned our own government is doing in willing (and unwilling) collaboration with these companies, based on documents leaked by whistleblower Edward Snowden, and the subsequent journalism from the Guardian and the Washington Post that won them the Pulitzer Prize for Public Service. But as Maciej Ceglowski argues in his (awesome) The Internet With a Human Face.

You could argue (and I do) that this data is actually safer in government hands. In the US, at least, there’s no restriction on what companies can do with our private information, while there are stringent limits (in theory) on our government. And the NSA’s servers are certainly less likely to get hacked by some kid in Minsk.

But I understand that the government has guns and police and the power to put us in jail. It is legitimately frightening to have the government spying on all its citizens. What I take issue with is the idea that you can have different kinds of mass surveillance.

If these vast databases are valuable enough, it doesn’t matter who they belong to. The government will always find a way to query them. Who pays for the servers is just an implementation detail.

Ceglowski makes the case that we need more regulation around the collection of behavioural data, specifically what is collected and how long it is kept. This should be starting to sound familiar.

Boyd takes a slightly more pragmatic tack by pointing out that social media companies need to establish ethics boards that allow users and scholars to enter into conversation with employees that have insight into how things work (policy, algorithms, etc). We need a bit more nuance than “Don’t be evil”. We need to know how the companies that run the websites where we put our content are going to behave, or at least how they would like to behave, and what their moral compass is. I think we’ve seen this to some degree in the response of Google to the Right to be Forgotten law in the EU (a topic for another panel). But I think Boyd is right, that much more coordinated and collaborative work could be done in this area.

The Archive

So, what is the relationship between big data and the archive? Or put a bit more concretely: given the right context, would you consider Facebook content to be archival records?

If you are wearing the right colored glasses, or perhaps Hugh Taylor’s spectacles, I think you will likely say yes, or at least maybe.

Quoting from SAA’s What is an Archive:

An archive is a place where people go to find information. But rather than gathering information from books as you would in a library, people who do research in archives often gather firsthand facts, data, and evidence from letters, reports, notes, memos, photographs, audio and video recording.

Perhaps content associated with Facebook’s user, business and community group accounts are a form of provenance or fonds?

Does anyone here use Facebook? Have you downloaded your Facebook data?

If an individual downloads their Facebook content and donates it to an archive along with other material, does this Facebook data become part of the archival record? How can researchers then access these records? What constraints are in place that govern who can see them, and when. These are familiar questions for the archivist, aren’t they?

Some may argue that the traditional archive doesn’t scale in the same way that big data behemoths like Facebook or Google do. I think in one sense they are right, but this is a feature, not a bug.

If an archive accessions an individual’s Facebook data there is an opportunity to talk with the donor about how they would like their records to be used, by whom and when. Think of an archival fonds as small data. When you add up enough of these fonds, and put them on the Web, you get big data. But the scope and contents of these fonds fit in the brain of a person that can reason ethically about them, and this is a good thing, that is at the heart of our profession, and which we must not lose.

When you consider theories of the postcustodial archive the scope of the archive enlarges greatly. Is there a role for the archivist and archival thinking and practices at Facebook itself? I think there is. Could archivists help balance the needs of researchers and records creators, and foster communication around the ethical use of these records? I think so. Could we help donors think about how they want their content to fit into the Web (when the time comes) by encouraging the use of creative commons licenses? Could we help organizations think about how they allow for people to download and/or delete their data, and how to package it up so it stands a chance of being readable?

Is it useful to look at the Web as One Big Archive, where assets are moved from one archive to another, where researcher rights are balanced with the rights of record creators? Where long term access is taken more seriously in some pockets than in others? I hope so. I think it’s what we do.

I haven’t mentioned Aaron, who worked so hard for these things. I could’ve spent the entire time talking about his work helping to build the Web, and interrogating power. Maybe I should have…but I guess I kind of did. There isn’t any time left to help Aaron. But we still have time to help make the Web a better place.

Paper Work

The connective quality of written traces is still more visible in the most despised of all ethnographic objects: the file or the record. The “rationalization” granted to bureaucracy since Hegel and Weber has been attributed by mistake to the “mind” of (Prussian) bureaucrats. It is all in the files themselves.

A bureau is, in many ways, and more and more every year, a small laboratory in which many elements can be connected together just because their scale and nature has been averaged out: legal texts, specifications, standards, payrolls, maps, surveys (ever since the Norman conquest, as shown by Clanchy, 1979). Economics, politics, sociology, hard sciences, do not come into contact through the grandiose entrance of “interdisciplinarity” but through the back door of the file.

The “cracy” of bureaucracy is mysterious and hard to study, but the “bureau” is something that can be empirically studied, and which explains, because of its structure, why some power is given to an average mind just by looking at files: domains which are far apart become literally inches apart; domains which are convoluted and hidden, become flat; thousands of occurrences can be looked at synoptically.

More importantly, once files start being gathered everywhere to insure some two-way circulation of immutable mobiles, they can be arranged in cascade: files of files can be generated and this process can be continued until a few men consider millions as if they were in the palms of their hands. Common sense ironically makes fun of these “gratte papiers” and “paper shufflers”, and often wonders what all this “red tape” is for; but the same question should be asked of the rest of science and technology. In our cultures “paper shuffling” is the source of an essential power, that constantly escapes attention since its materiality is ignored.

from Visualization and Cognition: Drawing Things Together by Bruno Latour

why @congressedits?

Note: as with all the content on this blog, this post reflects my own thoughts about a personal project, and not the opinions or activities of my employer.

Two days ago a retweet from my friend Ian Davis scrolled past in my Twitter stream:

The simplicity of combining Wikipedia and Twitter in this way immediately struck me as a potentially useful transparency tool. So using my experience on a previous side project I quickly put together a short program that listens to all major language Wikipedias for anonymous edits from Congressional IP address ranges (thanks Josh) … and tweets them.

In less than 48 hours the @congressedits Twitter account had more than 3,000 followers. My friend Nick set up gccaedits for Canada using the same software … and @wikiAssemblee (France) and @RiksdagWikiEdit (Sweden) were quick to follow.

Watching the followers rise, and the flood of tweets from them brought home something that I believed intellectually, but hadn’t felt quite so viscerally before. There is an incredible yearning in this country and around the world for using technology to provide more transparency about our democracies.

Sure, there were tweets and media stories that belittled the few edits that have been found so far. But by and large people on Twitter have been encouraging, supportive and above all interested in what their elected representatives are doing. Despite historically low approval ratings for Congress, people still care deeply about our democracies, our principles and dreams of a government of the people, by the people and for the people.

We desperately want to be part of a more informed citizenry, that engages with our local communities, sees the world as our stage, and the World Wide Web as our medium.

Consider this thought experiment. Imagine if our elected representatives and their staffers logged in to Wikipedia, identified much like Dominic McDevitt-Parks (a federal employee at the National Archives) and used their knowledge of the issues and local history to help make Wikipedia better? Perhaps in the process they enter into conversation in an article’s talk page, with a constituent, or political opponent and learn something from them, or perhaps compromise? The version history becomes a history of the debate and discussion around a topic. Certainly there are issues of conflict of interest to consider, but we always edit topics we are interested and knowledgeable about, don’t we?

I think there is often fear that increased transparency can lead to increased criticism of our elected officials. It’s not surprising given the way our political party system and media operate: always looking for scandal, and the salacious story that will push public opinion a point in one direction, to someone’s advantage. This fear encourages us to clamp down, to decrease or obfuscate the transparency we have. We all kinda lose, irrespective of our political leanings, because we are ultimately less informed.

I wrote this post to make it clear that my hope for @congressedits wasn’t to expose inanity, or belittle our elected officials. The truth is, @congressedits has only announced a handful of edits, and some of them are pretty banal. But can’t a staffer or politician make a grammatical change, or update an article about a movie? Is it really news that they are human, just like the rest of us?

I created @congressedits because I hoped it could engender more, better ideas and tools like it. More thought experiments. More care for our communities and peoples. More understanding, and willingness to talk to each other. More humor. More human.

I’m pretty sure zarkinfrood meant @congressedits figuratively, not literally. As if perhaps @congressedits was emblematic, in its very small way, of something a lot bigger and more important. Let’s not forget that when we see the inevitable mockery and bickering in the media. Don’t forget the big picture. We need transparency in our government more than ever, so we can have healthy debates about the issues that matter. We need to protect and enrich our Internet, and our Web … and to do that we need to positively engage in debate, not tear each other down.

Educate and inform the whole mass of the people. Enable them to see that it is their interest to preserve peace and order, and they will preserve them. And it requires no very high degree of education to convince them of this. They are the only sure reliance for the preservation of our liberty. – Thomas Jefferson

Who knew TJ was a Wikipedian…

MayDay - We Can Fix This Thing

We won’t get our democracy back until we change the way campaigns are funded.

TL;DR if you were thinking of supporting MayDay and have been putting it off please act by July 4th. Every contribution helps, and you will only be charged if they hit their 5 million dollar target (2 million to go right now as I write this). Plus, and this is a big plus, you will be able to tell yourself and maybe your grandkids that you helped make real political reform happen in the United States. Oh and it’s July 4th weekend, what better way to celebrate the independence we have left!

If you are reading my blog you are most likely a fan of Lawrence Lessig and Aaron Swartz’s work on Creative Commons to help create a content ecosystem for the Web that works for its users … you and me.

Before he left us Aaron convinced Lessig that we need to get to the root of the problem, how political decisions are made in Congress, in order to address macro problems like copyright reform. For a really personal interview with Lessig that covers this evolution in his thinking, and how it led to the Granny D inspired midwinter march across New Hampshire, and the recent MayDay effort check out last week’s podcast of the The Good Fight.

Or if you haven’t seen it, definitely watch Lessig’s 13 minute TED Talk:

If you are part of the 90% of American’s who think that our government is broken because of the money in politics please check out MayDay’s efforts to crowdsource enough money to fund political campaigns that are committed to legislation that will change it.

It doesn’t matter if you are on the left or the right, or if you live in a red or blue state, or honestly whether you live in the United States or not. This is an issue that impacts all of us, and generations to come. We can fix this thing. But we have to try to fix it, we can’t just sit back and expect someone else to fix it for us. Lessig has a plan, and he’s raised 4 million dollars so far (if you include the previous 1 million campaign) from people like you who think he’s on to something. Let’s push MayDay over the five million dollar edge and see what happens next!


And as my friend Mike reminded me, even if you aren’t sure about the politics, donate in memory of Aaron. He was such an advocate and innovator for the Web, libraries and archives … and continues to be sorely missed. Catch the recently released documentary about Aaron in movie theaters now or for free on the Internet Archive since it is Creative Commons licensed:

No Place to Hide

No Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance StateNo Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance State by Glenn Greenwald
My rating: 4 of 5 stars

I think Greenwald’s book is a must read if you have any interest in the Snowden story, and the role of investigative journalism and its relationship to political power and the media. Greenwald is clearly a professional writer: his narrative is both lucid and compelling, and focuses on three areas that roughly correlate to sections of the book.

The first (and most exciting) section of the book goes behind the scenes to look at how Greenwald first came into contact with Snowden, and worked to publish his Guardian articles about the NSA wiretapping program. It is a riveting story, that provides a lot of insights into what motivated Snowden to do what he did. Snowden comes off as a very ethical, courageous and intelligent individual. Particularly striking was Snowden’s efforts to make sure that the documents were not simply dumped on the Internet, but that journalists had an opportunity to interpret and contextualize the documents to encourage constructive discussion and debate.

In sixteen hours of barely interrupted reading, I managed to get through only a small fraction of the archive. But as the plane landed in Hong Kong, I knew two things for certain. First, the source was highly sophisticated and politically astute, evident in his recognition of the significance of most of the documents. He was also highly rational. The way he chose, analyzed, and described the thousands of documents I now had in my possession proved that. Second, it would be very difficult to deny his status as a classic whistle-blower. If disclosing proof that top-level national security officials lied outright to Congress about domestic spying programs doesn’t make one indisputably a whistle-blower, then what does?

This section is followed by a quite detailed overview of what the documents revealed about the NSA wiretapping program, and their significance. If you are like me, and haven’t read all the articles that have been published in the last year you’ll enjoy this section.

And lastly the book analyzes the relationship between journalism and power in our media organizations, and the role of the independent journalist. The Guardian comes off as quite a progressive and courageous organization. Other media outlets like the New York Times and the Washington Post don’t fare so well. I recently unsubscribed from the Washington Post, after vague feelings of uneasiness about their coverage – so it was good to read Greenwald’s pointed critique. After just having spent some time reading Archives Power I was also struck by the parallels between positivist theories of the archive and journalism, and how important it is to be aware and recognize how power shapes and influences what we write, or archive.

Every news article is the product of all sorts of highly subjective cultural, nationalistic, and political assumptions. And all journalism serves one faction’s interest or another’s. The relevant distinction is not between journalists who have opinions and those who have none, a category that does not exist. It is between journalists who candidly reveal their opinions and those who conceal them, pretending they have none.

The only reason I withheld the 5th star from my rating is it would’ve been interesting to know more about Snowden’s departure from Hong Kong, his negotiations to seek asylum, and his relationship with Wikileaks and Sarah Harrison. Maybe that information wasn’t known to Greenwald, but it would’ve been interesting to have a summary of what was publicly known.

One thing that No Place to Hide really did for me was underscore the importance of privacy and cryptography on the Web and the Internet. This is particularly relevant today, exactly one year after Greenwald’s first Guardian article was published, and as many people celebrate the anniversary by joining with the Reset the Net campaign. I haven’t invested in a signed SSL certificate yet for but I’m committing to doing that now. I’ve also recently started using GPGTools w/ Mail on my Mac. If you are curious about steps you can take check out the Reset the Net Privacy Pack. In no place to hide Greenwald talks quite frankly about how he found cryptography tools difficult to use and understand, and how he got help in using them – and how essential these tools are to his work.

View all my reviews