The day-to-day reality of the COVID-19 crisis hit home three weeks ago. I’ve been trying to stay focused on the work I had going on before the pandemic…but it has been difficult. It’s really hard not to attend to things that seem relevant to this time we are living through.
In many ways COVID-19 feels like a moment of clarity, for re-examining how we work, and how we live. It is a singular opportunity for collectively recognizing that our lives can, and must, change–especially in light of other crises that demand global attention and cooperation like climate change. Our everyday activities are always being refashioned and rearranged by events. But, unfortunately, COVID-19 is also an opportunity for powerful interests to capitalize on this moment as well. As our infrastructures show their seams, break down, and transmute, our political and economic worlds are being remade.
If you’ve been watching this blog for the past few years you may have noticed that I’ve been interested in how practices of web archiving and archival appraisal meet, and where they don’t meet at all. The web is a big place, but the universe of documentation has always been a pretty big place. So how do we decide what to collect when the web makes everything instantly available, but collecting everything just isn’t realistic? How do our decisions about what to collect reflect (and create) technologies for appraisal that become expressions of our values?
A few weeks ago I learned about an ongoing effort by the International Internet Preservation Consortium (IIPC) to create a collection of web materials in their Novel Coronavirus (COVID-19) Archive-It collection. The IIPC made a nomination form using Google Forms for anyone to submit web content to be archived. These nomination forms have become a kind of standard practice for collaborative web archiving, and in this case are being used to drive Archive-It’s web crawling activities.
You can see previous collaborative web archiving efforts by the IIPC in a list of collections. As of this writing, the IIPC homepage reports that 2573 “sites” have been archived, from 30 languages. It seems fitting that an international organization like the IIPC would focus on the topics like the Olympics, climate change, and the refugee crisis. But how are sites being selected within those topic areas, and who is doing the selecting?
This IIPC collecting effort really spoke to me because it is the type of work we’ve been doing in the Documenting the Now project, since its beginnings in 2014 after the murder of Michael Brown in Ferguson, Missouri. One of the things that Ferguson made clear to us was that a large amount of documentation work was happening in social media, and that it was important to meaningfully engage with social media as a tool for appraisal when documenting an event like Ferguson.
And so thinking about this IIPC COVID-19 collection I became curious about how it intersects with other collecting efforts that are happening. For example a few weeks ago I learned in an email from Amelia Acker that a group of people were using GitHub to collect URLs for Chinese language stories about the COVID-19 crisis in the nCovMemory GitHub repository. While its unclear why GitHub was specifically chosen, media scholar Christina Xu points out that GitHub is difficult to censor in China, not for technical reasons, but for social ones.
People are also circulating this link to Github as a place to share journalism from the outside and personal accounts. Historically, Github has been hard to block in China because so much of the tech industry relies on it. https://t.co/fBycxNi77r— Christina Xu ((???)) February 6, 2020
Escaping censorship could have been a crucial factor in curating this collection on GitHub. Being able to easily copy the data within GitHub as forks (there are currently over a thousand) and externally as full clones are also affordances that GitHub (the platform) and Git (the technology) both provide. Amelia was rightly concerned about sharing the repository URL publicly on social media, not wanting to draw draw unwanted attention to its authors. But now that information about the repository has spread widely on Twitter, and there has even been a New York Times video documentary that mentions it I think those concerns are less pressing now.
The repository can be edited directly on GitHub, or contributors can submit content via their issue tracker and another project member will integrate it in. The repository also is used to make a static website available. Many of these stories document the Chinese government’s response to the crisis, and contributors are actively archiving content using archive.is as part of their process.
In addition people are finding and sharing web content using platforms like Pinboard and Reddit. Milligan, Ruest, & Lin (2016) have described some of the benefits of looking at social media streams like Twitter as a source of URLs–of leveraging the “masses” in addition to the specialized knowledge of content specialists like archivists. With over a 1.7 million members, I think the Coronavirus subreddit definitely qualifies as “the masses”. But nCovMemory and Pinboard have much smaller numbers of contributors, and it appears that they may be specialists in their own right.
So, I thought it would be interesting to look at these sites as potential sources of content that might be in need of archiving. But its important to be a bit more precise with language here, because “archiving” really encompasses preservation, description and access. Making sure there is a copy of the content at a URL is fundamental, but its only part of the work. As Maemura, Worby, Milligan, & Becker (2018) highlight, it’s extremely important to document how the content came to be archived if it is going to be used later. Researchers need to know why content was selected for the collection. So in the IIPC’s case, the spreadsheet that sits behind their Google Form is an essential part of the archive.
But lets return to the case of nCovMemory, Pinboard and Reddit. What would it mean to use these sites to help us document COVID-19? Part of the problem is knowing what URLs are found in these web platforms. Also the platforms are in constant motion so the URLs they make available are constantly changing. While it might be possible to build a generic tool that interfaces with APIs from Twitter, Reddit and Pinboard, I think there will always be ad hoc work to do, as is the case with nCovMemory. Another problem that dedicated applications or tools do is that they tend to obscure their inner workings, and present results in shiny surfaces that don’t reflect the many decisions that went into deciding what is (and is not) there.
Out of habit I started working locally on my laptop in a Jupyter Notebook to see what the IIPC Collection seeds (URLs for archiving) looked like using Archive-It’s Partner API (some of which is public). You can see that notebook here. I then moved on to a nCovMemory notebook, a Reddit notebook, and then a Pinboard notebook. You will see that each notebook fits the kinds of data each platform provides, as well as their authentication methods, and API surfaces.
But one nice thing about working in Jupyter, rather than creating an “application” to perform the collection and present the results, is that notebooks center writing prose about what is being done and why. You can share the notebooks with others as static documents that become executable again in the right environment. For example you can view them all as documents on GitHub launch my notebooks over in MyBinder, where you can rerun them to get the latest results, or run additional analysis and visualization.
It only occurred to me as I was in the middle of putting these notebooks together that just as the IIPC seed spreadsheets are part of the provenance of their collection, these notebooks could serve as documentation of the provenance of how a COVID-19 web archive (or collection if you prefer) came into being.
We’ve seen significant effort by the Library of Congress to to explore their web archives, sometimes using Jupyter notebooks. Recently Andy Jackson and Tim Sherrat announced that they are going to be working on building out practices for exploring history using Jupyter and web archives (see the IIPC Slack for a window into this work). But perhaps Jupyter also has a place when documenting how a collection came into being in the first place? What would be some useful practices for how to do that? Netflix has written about how they use Jupyter notebooks not only for doing data visualization, but as a unit of work in their data analysis pipelines. I think we must consider adding to our toolbox of web appraisal methods, to do more than simply ask people what they think should be archived, and to factor in what they are talking about, and sharing. Using Jupyter notebooks could be a viable way of both doing that work and providing documentation about it.
Maemura, E., Worby, N., Milligan, I., & Becker, C. (2018). If these crawls could talk: Studying and documenting web archives provenance. Journal of the Association for Information Science and Technology, 69(10), 1223–1233. Retrieved from https://tspace.library.utoronto.ca/handle/1807/82840
Milligan, I., Ruest, N., & Lin, J. (2016). Content selection and curation for web archiving: The gatekeepers vs. the masses. In Proceedings of the Joint Conference on Digital Libraries. Retrieved from https://cs.uwaterloo.ca/~jimmylin/publications/Milligan _etal _JCDL2016.pdf