Comprehensively Collecting the Web

I had some interesting conversations after my talk yesterday. In particular Nicholas Taylor asked me a provocative question, playing Devil’s advocate a bit, because I don’t think his work as an archivist demonstrates this approach at all. I’m paraphrasing:

If we are concerned about systematic bias in archives, wouldn’t it be better for archivists to comprehensively collect in an area rather than being selective? If we attempt to collect all of something then we are less likely to project our own bias and preferences into our collections.

It’s a compelling logical argument that, on its face, doesn’t seem easy to refute. But I think it does come apart if we consider two things:

What area should the archivist collect in? If the archivist is going to comprehensively collect in an area like computer science how does the archivist choose that area? They must make choices about what to collect at some level, and if they aren’t prepared to say why one area over another then how is the archive not biased?
The second was the topic of my talk yesterday, in which I tried to argue, somewhat obliquely, that the accelerated growth of the web and social media, are a reflection of a capitalist mode of production that is literally eating our world alive. If every country on earth had a legal deposit program, and were somehow able to comprehensively archive their corner of the web, we would be greatly deepening this crisis.

So we must confront the fact that we have to make choices, and be prepared to talk about those choices. It’s ok for us to talk about our efforts to be comprehensive as long as it is tempered with a real discussion of the ways in which it is not–and documenting the trade offs. Maybe its easier just to abandon the idea of archives being comprehensive altogether, but archivists clutch fiercely to this idea because their professional identity seems bound up with it.

Furthermore, if we are archiving the web, we must face up to the fact that we cannot archive it all, and that this is, in fact, a very good thing for our planet.

Of course this argument of mine is really nothing new, and is just a rehash of decades of discussion about appraisal in archives, in response to challenges from Ham (1981), Bearman (1989), Hedstrom (1991) right at the dawn of the World Wide Web. I think we are still coming to terms with their work, and seeing how it fits with this World Wide Web we have today. It is particularly relevant as archives of the web are being used to create computational models and algorithms that we interact with in our daily life (boyd & Crawford, 2012 ; Noble, 2018) and make policy decisions (Eubanks, 2018).

References

Bearman, D. (1989). Archival methods. Archives and Museum Informatics, 3(1). Retrieved from http://www.archimuse.com/publishing/archival _methods/

boyd, danah, & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15(5), 662–679.

Eubanks, V. (2018). Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press.

Ham, F. G. (1981). Archival strategies for the post-custodial era. The American Archivist, 44(3), 207–216.

Hedstrom, M. (1991). Understanding electronic incunabula: A framework for research on electronic records. The American Archivist, 54(3), 334–354.

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce oppression. New York University Press.