Many thanks to the Congrés d’Arxivística de Catalunya for the invitation to come and talk about social media archives and the Documenting the Now project. This post was prepared in part for my remarks there, and also to summarize some recent work we have been doing on the Documenting the Now project to review the HCI literature around consent, particularly in light of archival practice.
In the second phase of the Documenting the Now project we are continuing to support and develop tools that embody ethical practices for social media archiving. This work is being informed by a series of workshops with activist communities, as well an effort to bring social media archiving practices into the college classroom. The structure of the project reflects an understanding that was developed early on in Phase 1, that one of the most significant challenges that social media archiving practice faces is how to meaningfully work with issues of consent. Consent in this essay is being conceived of not just in the legal sense, but more broadly as a communicative act that, when it works, can be morally transformative (Kleinig, 2010).
Simply because a user has quickly agreed to a social media platform’s Terms of Service does not mean that they have agreed to have their content collected and preserved for the long term in an archive. Consider the case of the not fully realized Library of Congress Twitter archive, in which users had no opportunity to opt-in or opt-out of having their tweets archived for perpetuity. This archive, and of late, an increased anxiety about the archival nature of social media, gave rise to a variety of services like Tweet Delete, Tweet Wipe and Tweet Deleter that allow you to routinely or fully purge your Twitter history. It’s productive to look at these services as a mirror of records management processes in more traditional archival settings, where appraisal and retention schedules determine what records are placed into an archive, how long particular records are to be held.
It is also interesting to speculate for a moment about how the Library of Congress Twitter partnership could have been significantly altered if Twitter had added a checkbox to the user settings page to indicate whether to archive a user’s posts and media at the Library of Congress. What should the default setting have been? How could it have been explained? In many ways this essay is a meditation on this very simple issue. You may think that consent is no longer an issue for the Library of Congress since they have stopped archiving all public tweets (Library of Congress, 2017):
As the twelfth year of Twitter draws to a close, the Library has decided to change its collection strategy for receipt of tweets on December 31, 2017. After this time, the Library will continue to acquire tweets but will do so on a very selective basis under the overall guidance provided in the Library’s Collections Policy Statements and associated documents. Generally, the tweets collected and archived will be thematic and event-based, including events such as elections, or themes of ongoing national interest, e.g. public policy.
But even when selectively archiving content from social media, doesn’t consent still play a significant role? If the content is created by public figures, such as elected officials, a strong case can be made that consent isn’t so much of an issue. After all, these are individuals that are familiar with being in the limelight, and often have teams of people that help them manage it. But this will not be the case for all people posting with an election hashtag. Furthermore, if only public figures’ tweets are to be made part of the historical record, doesn’t that privilege particular voices in society? Doesn’t that very same privileging shortchange the democratizing and transformational potential of having a social media archive in the first place?
These issues of consent were brought into stark relief for us, on a much smaller scale, as we collected 28,560,078 tweets surrounding the protests in Ferguson after the murder of Michael Brown, and the subsequent decision not to indict Darren Wilson, the police officer that killed him. We were interested in how these tweets from people on the ground mobilized media awareness of the events in Ferguson, and documented the subsequent state response. We were interested in how these individual and collective voices formed an essential part of the historical record, and as documentation for achieving social justice. Our work was operating out of a concern that if archivists aren’t able to collect this material when needed, then these voices and their actions would be lost to history, and forgotten.
However, as we learned, we were not the only actors who were interested in collecting this material. Law enforcement, and the technology companies that serve them, were actively collecting and analyzing Ferguson tweets too. Some of them contacted us directly asking for access to the tweets we had collected. These corporate and government actors were also interested in the voices of the Ferguson community, and the many people that traveled there to participate in the protests, but for purposes of political and social control–not social justice and history. They wanted to be able to use this data in order to predict when uprisings occur, in order to foreclose on them, and to identify, track and incarcerate the activists involved.
This should come as no surprise to archival practitioners and scholars who understand that archives are always instruments and sites for the exercise of power. Whether they are managed by an archivist at a large state institution, or a policy expert working at a social media company, or by a volunteer in a small community-based archive, or an individual scholar as they do field work on site, record keeping practices both enact power, and hold power accountable (Ketelaar, 2002).
Archival practices can reinforce and calcify dominant cultural forms, or augment, subvert, destabilize and overcome them by offering counterstories that surface submerged records, and present multiple narratives and truths (Dunbar, 2006). The key question for us then became, whose power is our Ferguson archive serving? As Dunbar writes, Critical Race Theory (CRT) provides a lens for examining what medial forms are permitted in an archive, and stresses the importance, and degree to which record creators can participate in the creation of archives:
CRT techniques of evidential rectifying could be useful to archival discourse in terms of broadening notions of what constitutes a record, the role of human subjects documented as co-creators of the record, and assumptions about archives and archivists as neutral third parties in the preservation and use of the record and other forms of historical evidence. (p. 114)
This analysis, and archival approaches to social justice more generally (Punzalan & Caswell, 2016 ; Flinn, 2011), informed our own work, but we were left with the conundrum of how to enact these principles while working with millions of records from hundreds of thousands of people.
Consent at Web Scale?
What does consent mean at the scale of the web? Is consent, as it is conventionally framed, even the right conceptual tool to center in our application designs? The expedient answer to these questions is to say consent can’t be done at scale and either a) ignore social media archiving, or b) say the ethics of consent don’t apply to “public” tweets, and to collect it all. And yet few Twitter users are aware that their public tweets could be used by researchers, and the majority felt that researchers should not be able to use tweets without consent (Fiesler & Proferes, 2018). So we chose instead to build on some of the privacy ensuring features in our tools by informing our design work by reviewing the archival and Human Computer Interaction (HCI) research literature around consent and social media data collection, to see what the state of the art was. Below are a few things we’ve found so far.
The general consensus in the literature, which many of us know from direct experience, is that Terms of Service (ToS), End User License Agreements (EULA), and the notice-and-consent model as currently construed, are ineffective for managing users expectations of how their data will be used. We all have the experience of signing up for a new service and being presented with a long document, which we must agree to in order to continue. McDonald & Cranor (2008) estimate that it would take approximately 201 hours per year for an individual (reading at 250 words per minute) to read all the Terms of Service documents they encounter in a year. In terms of the number of people online in the United States, and the average income, this expenditure in time would amount to $781 billion dollars per year. Most of us cannot make this kind of time commitment, and so the ToS documents go unread.
The purpose of these ToS documents is to reduce power asymmetries between corporate and government entities on the one side and individual actors on the other. Legal theories around privacy and consent assume an informed, rational and autonomous data subject. However, in a digital environment these assumptions have led to some rather pernicious side effects where individuals are inundated with increasingly opaque requests for consent. As noted by Schermr, Custers, & van der Hof (2014), increased regulation around consent, such as the General Data Protection Regulation (GDPR), can perversely lead to Consent Transaction Overload where users are desensitized to the importance of consent. Paternalistic approaches to consent can result in users blindly accepting cookie notifications, and clicking through consent transactions in order to get on with whatever task they are performing (Böhme & Köpsell, 2010).
Even if users were able to spend the time required to read ToS documents, it is unlikely that they would understand the full ramifications of their choices. Acquisti & Grossklags (2005) found that 41% of individuals with high privacy concerns admit that they don’t read privacy policies. Individuals generally lack sufficient information to make decisions about their privacy and often trade off their long-term privacy for short term benefits. As Schermer indicates, it is increasingly unlikely that we will be able to comprehend the processual flows of data bound up in the computer systems that EULA’s describe:
as data processing becomes more and more complex, more factors need to be taken into account. The result is that the reality of data processing will become even further removed from the simplistic mental models employed by data subjects.
Internet protocols allow data to be easily copied, and instantly transported at great distance, which can yield extremely complex, forking data flows, that can lead to a loss of control of information as it moves between contexts. Nissenbaum (2011) refers to this as the Transparency Paradox where the full picture of these data flows cannot be understood, let alone read, while on the other hand, summarizing practices remove details that could ultimately be quite important.
To further complicate matters, acts of consent happen at a particular point in time, like when signing up to use an online service or platform. Afterwards giving consent your data participates in information systems that then adapt and change as business models are adjusted or pivoted, or utterly transformed, as is the case when businesses go bankrupt or are acquired. Custers (2016) notes that a click should not confer consent forever, and that perhaps EULA’s need expiration dates that require re-consenting, particularly at significant junctures.
Finally our ethical frameworks, such as the [Belmont Principles], and Institutional Review Boards (IRB) that protect the privacy of research subjects are being strained to the breaking point by online research. Overly simplistic notions of what constitutes public and private on the web can lead to situations such as when researchers at Aarhus University released a dataset of personal profiles collected from the OKCupid dating website. This data was released as a public dataset, was not anonymized, and included information such as geographic location and sexual preference, which put many individuals at risk (Zimmer, 2016). However many IRB processes will allow such research to take place because the data is “public”. Despite this example, Vitak, Shilton, & Ashktorab (2016) have found that in fact there are emerging ethical research practices, which suggest three areas to extend frameworks like the Belmont Principles:
- Transparency with participants.
- Ethical deliberations with colleagues.
- Caution with sharing with results.
Admittedly, there are often disciplinary pressures that limit transparency with participants, as is the case when researchers are concerned with potential Observer Effect. Never the less, it is generally good practice to inform participants when their data is being collected, even when it is perceived to be public and on the web. The challenge is how to do it, especially when bulk data collection is at play. While these suggestions speak specifically to the research community they also suggest productive ways that archives can transform their own practice, and operate as trusted places where this research can happen. In what ways can our data collection methodologies and tools, such as the ones we are building in the Documenting the Now project, embody these emerging principles?
Two other important themes that emerge from the HCI literature around privacy are that 1) privacy is context dependent, and 2) online spaces themselves provide a context that can shape user’s perceptions and decisions. Nissenbaum’s idea of Contextual Integrity has been hugely influential because of her insight that privacy is a function of appropriate flows of data, that are determined by contextual norms. For example your expectation that your doctor may share your health records with another doctor to assist in treatment, but will not sell it to a marketing company for profit Nissenbaum (2011). These norms of behavior when they happen online are not generalized or monolithic, but are as varied as the privacy contexts we find ourselves in everyday:
Not only is life online integrated into social life, and hence not productively conceived as a discrete context, it is radically heterogeneous, comprising multiple social contexts, not just one, and certainly is not just a commercial context where protecting privacy amounts to protecting consumer privacy and commercial information … the contexts in which activities are grounded shape expectations that, when unmet, cause anxiety, fright, and resistance.
The central question for Nissenbaum then is identifying the relevant contextual norms for a given application domain, and then using that understanding to craft the appropriate flows of data. Hutton & Henderson (2015) have found that contextual norms can actually be put to use in calibrating the consent transactions that take place online. They distinguish between secured consent, where consent is granted once, and sustained consent, where consent is continually requested of an individual, in order to highlight the ineffectiveness of both approaches with respect to the burden they place on the individual, and the validity of the consent that is granted. These two factors, burden and validity, are linked together: attempts at attaining greater validity require increased burden on the consenter (consent transactions)…but lowering this burden, also lowers the validity, making the consent granted increasingly meaningless. Instead, they suggest a middle ground of contextual integrity consent be used, which reduces the number of consent requests that are made based on the how well the consenter’s previous answers fit in with established norms. The difficulty here is determining what these norms are, and the degree to which they are being shaped by the online environments themselves, and people operating within them.
Indeed, the design of online environments, their interfaces, actions, affordances, and protocols all work to fashion the experience of privacy online. In a recent extensive review of the design literature around privacy Acquisti et al. (2017) summarize the many ways that behavioral economy and experimental psychology are being used to delimit and influence privacy decisions taken online using a technique they call nudges.
Every design choice is a nudge.
Whether they are aware of it or not, the designers and implementors of online systems present and highlight particular privacy options for their users. One way of understanding these nudges is by considering the extensive use of A/B testing in user experience design, which allows platforms like Google, Facebook and Netflix to constantly evolve their platforms to maximize desired outcomes using small improvements. Acquisiti et. al. document a large set of nudge types that can be used to influence user behavior when it comes to making privacy decisions. Below is a quick overview of these types of nudges with some potential ways in which our social media archiving applications can incorporate them.
Nudging with Information
Users are more likely to approve requests for their information when they are provided with information about why the request is being made. Furthermore, people are likely to adjust their behavior regarding the privacy of others when they are supplied information such as the privacy preferences of others (Könings, Thoma, Schaub, & Weber, 2014). There has been some experimentation with broadcasting privacy preferences using techniques such as PrivIcons, which are bits of metadata that are sent along with email messages, which can then be rendered by an email client (König & König, 2011). These icons allow email senders to communicate their preferences about how the receivers should treat the email they are sending.
Alexandra Dolan-Mescal of the Documenting the Now project is currently experimenting in this area with our Social Humans labels which are designed for creators and archivists to use when they share and archive social media content. Sample actions include delaying access to one’s content, or expiring access after a particular time, removing media, or requesting credit.
While communicating information is important, Acquisiti et al. stress that users need some measure of feedback as they make decisions about their privacy. Components like password meters help users ascertain the strength of the password they are choosing. Whereas browser extensions like Lightbeam and Ghostery allow users to see contextual information about how a website is sharing information with third parties when visiting a web site.
It is important for us on the Documenting the Now project that tools such as the DocNow application allow users to see what data has been collected from them already, and giving them controls over what kinds of content will be collected, which will then re-present those decisions. Information such as who is doing the data collection and why will be extremely important for users trying to
Nudging with Presentation
Deeply related to information and feedback is the presentation of privacy controls. This includes how the data collection is being framed, how the different components of the privacy concerns are ordered in terms of their saliencey, and also the structure of those controls. Framing is particularly important for archiving social media because this is not the usual request for consent for research. The job of the collecting software is to adequately communicate why the collection is being built: is it a large research university collecting all your tweets about Ferguson for sociological research? Or is it a the Ferguson Public Library who wants to collect specific tweets you sent with videos of activist speech? The framing matters, as does the order in which options are given, and their clarity.
Nudging with Defaults
It goes without saying that default behaviors often go unchanged. This is because people often don’t investigate their settings at all. So it is important that application’s chosen defaults reflect the intended values. If our tools prioritize the needs of the data collector the defaults will be set differently than if they instead centered the needs of the data subject. As with concerns about presentation, it matters how and when the default settings are communicated. Rather than having mass defaults, where everyone gets the same defaults, it is also possible to use contextual factors to set them.
For example in their pioneering report Beyond the Hashtags Freelon, McIlwain and Clark decided not only link to tweets in their publications when the user had 3,000 or more followers or were Twitter verified (Freelon, Mcilwain, & Clark, 2016). Setting and communicating default behaviors appropriately is something our tools can do to respect not only the privacy of individuals, but also to honor their content, especially when the context is long term preservation.
Nudging with Incentives
When presenting users with decisions about how their content can be collected and used in an archival setting it is crucial to provide them with reasons why this might be beneficial to them. For example, if a public library is collecting photographs about a local parade for a collection they are creating, the user could be presented with information about the collection that their photographs will be added to, and give them a sense of how the collection has been used to enrich the local history. Incentives are intrinsically related to how the data collection is being framed.
On the other hand, if the data collection interface requires a user to perform repetitive or costly tasks in order to respond to a request for consent, then the application is punishing the user instead of providing incentives. For example if the user must click through and consent for every photograph being collected, and then must do the same thing for the textual content, this is unlikely to result in either the data collector or the data subject in achieving a satisfying result. Interfaces must be ergonomically designed to incentivize both the collectors and the collected to work together.
User decisions about what social media content to share and who to share it with will change over time. We see this reflected in applications like the previously mentioned TweetDelete which allows users to retroactively remove their publicly posted tweets, or to selectively purge them on a schedule. Application behaviors and controls that facilitate these interactions can lead to increased and more meaningful trust when it comes to collecting, using and archving people’s data. For example, if users are able to consent to having their data collected for a particular period, and then to turn it off.
When studying user behaviors around tagging on Facebook, Besmer & Lipford (2010) found that tensions that exist between the tagger and the tagged can be mitigating by bringing out of band communication into the application. Their experimentation with Restricting Others functionality allowed tagged users to send a message to the tagger indicating that they would like to hide their tag from other users or sets of users. This allowed the content to continue to circulate, but not as widely as before, and tended to prevent tagging wars where a user removes their name from a photo only to have it placed on the photo again.
What types of communications can be enabled between content creators and archivists that can facilitate meaningful preservation of content that accounts for both the privacy of the creators, and their wish to contribute that content to a particular collection?
Timing of Nudges
Finally, nudges happen at particular points in time, and their timing is an important factor when considering how users make decisions about their privacy online. If too many requests are clustered in a particular period they will likely be ignored, or reacted to negatively. But if they are too infrequent, or happen at too great a remove in time the consent that is provided will be less meaningful, as context of particular content has shifted. The time factor of privacy nudges directly recalls the relationship between burden and validity that was discussed above Hutton & Henderson (2015).
Timing is especially important for archival collections because the value proposition offered by an archive is fundamentally about time. Social media content is created with the expectation that it will be momentary, fleeting and temporary rather than something permanent like an archival record. Communicating the context and purpose of social media content in an archive, is a challenge. It would be useful if archives could communicate with content creators in a timely manner, when users are actively documenting events of interest to an archive. But it may be important for content creators to revisit those collections later, to see what was collected, and perhaps revise or augment their decisions.
Appraisal & Values
Let’s bring the themes of contextual integrity and privacy nudges back together, and make some connections for archival practice and tool design. First, it is essential to dispel any notions that tech alone is going to save us. It won’t. Acquisti et al.’s detailed cataloging of privacy nudges clearly points out that these nudges can be used to promote user privacy…and also to take it away. Information feedback, presentation, incentives, reversibility, and timing can be used to privilege the data collector, and disadvantage the content owner. In many instances tech is actually the problem, because it is being mobilized to further the needs of surveillance capitalism (Zuboff, 2015). The website Dark Patterns attempts to use social media itself to identify and share awareness of these design exploits that are being used by companies to extract data from us online. But sometimes it’s easier to describe what not to do, than it is to find productive ways forward. This is where context helps.
The norms of information sharing in our lived experience should guide our efforts to collect social media for archives. Of course, there are many different types of archives. There are government archives, and corporate archives, and community-based archives, and even personal archives. Each of these types of archive operate in different contexts, with different sets of values. As Nissenbaum’s work makes clear, it is a category error to treat all online data flow behaviors uniformly, and the same is true of archives. Each archive has its own context, and while multiple archives might share a context, we must be on guard against frameworks that entirely collapse archival ethics and values. Especially pernicious is the idea that our archives provide neutral spaces. This is why appraisal, both its theory and practice, matter more than ever when it comes to building collections of social media content. Appraisal is all about aligning organizational values with the record creators and their records (Cook, 2005 ; Booms, 1987 ; Samuels, 1986). Appraisal is all about preserving content in the knowledge that not all content can be saved. Appraisal is all about understanding the context of information, which allows it to move it into, and out of, the contexts of preservation and access (McKemmish, Upward, & Reed, 2010). And thus, reconnecting with appraisal is fundamental to understanding context and privacy, in order to do meaningful social media archiving (Bingo, 2011 ; Henttonen, 2017).
For the Documenting the Now project we have become increasingly focused on the needs of activist community-based archives. When we started the project in 2015 we brought together a variety of stakeholders, including archivists, activists, journalists, researchers, technologists, and lawyers to understand the particular dynamics of social media archiving in Ferguson. Our initial efforts at tool design were aimed at all of these contexts, and as a result we fell short of most of them. We also came to realize that there were in fact already tools for extracting social media content that were available to institutional archives. This has led us to focus specifically on the needs of memory work in the context of community-based archives, especially ones that are doing social justice work, and are operating outside of institutional contexts. We are currently planning a series of workshops that put us into collaboration with these activist groups, to explore what social media archives might mean for them in their work. In these workshops we will be working with activists to think about how their social media materials can be placed into new preservation contexts like Mukurtu, which allow media objects to circulate relative to the values of particular communities (Christen, 2015).
As we design tools that help us do this work it is important to think about the specific needs for consent and privacy, and how they may require new approaches to Human Computer Interaction. Hutton & Henderson (2017) suggest that when designing information systems for consent we must reorient our efforts around Human Data Interaction instead of HCI (Mortier, Haddadi, Henderson, McAuley, & Crowcroft, 2014 ; Crabtree & Mortier, 2015). HDI centers questions of legibility, agency and negotiability. We must make our data archiving activities legible to the people who are having their data archived. One way of insuring this is for archivists to operate as members of the communities they are documenting. But in situations where we are archiving other people’s data we have a responsibility to make our activities discoverable, and engage meaningfully with the communities we are documenting, to allow them to meaningfully participate in, and benefit from, those activities (agency). And finally we must recognize that decisions about privacy happen in time, and that data can shift from one context to another. Providing people with ways to respond to requests for their content in time, but also to give notifications to users about how their content is being used is something we can do to build solidarity.