To support some thinking about the role of web archives in social practices that I’ve been doing with Jess Ogden and Shawn Walker I thought it could be interesting to look at the rate of ingest in the Internet Archive’s Save-Page-Now (SPN) functionality. SPN allows anyone with a web browser to go to the Internet Archive’s Wayback Machine and add a webpage to the archive. According to Brewster Kahle (Internet Archive’s Director) SPN is currently seeing up to 100 URLs being added every second:
(???)) May 10, 2018
This is an astonishing rate for an archival process – but the Internet Archive isn’t just any archive. Part of the reason for this is that the Internet Archive have made it extremely easy to add a URL via SPN. All you have to do is issue an HTTP GET request to:
PastPages have even created a little utility to save URLs from the command line if that’s your thing. I would guess that there is quite a bit of automation at work that are adding things to the Internet Archive. I guess I know because I’ve written some myself.
So what does the ingest rate look like over time? Here’s what I was able to determine using publicly available metadata made available via Internet Archive’s API. The short story is that while the SPN WARC data itself isn’t publicly available (I think for privacy reasons), the metadata for the items in this collection I think suffices for determining the aggregate volume over time.
Head over to the Jupyter Notebook itself for the details of how the data was collected and the graph was generated.
The graph itself is clear evidence of the increased use of Save-Page-Now. There are a few anomalous bumps that could be interesting to zoom in on. But I think many people in the web archiving community already intuitively knew this to be the case, so it’s a bit uninteresting by itself. Still, ~1.4 TB a month of user initiated, participatory web archiving is really very significant.
As Nick tweeted just after a posted these initial results, much of this activity could be automated:
I have a few cronjobs that are probably in that graph :-D— nick ruest ((???)) May 19, 2018
… and I’m sure Nick is not alone. But even with automation like a cronjob, or other processes that extract URLs from tweets or wikipedia edits, etc, it’s important to remember that behind every piece of automation there are people making decisions. Yes, bots are people too.
In terms of archives, I see these practices of deciding what streams of data should flow into a web archive as appraisal decisions. It’s possible, and useful, to think of much like we have traditionally thought about appraisal techniques such as macro appraisal, functional analysis, content analysis, fat file method, sampling, documentation strategies, etc. But as with appraisal, the challenge is in documenting (think visualizing) these decisions, so that researchers can understand both the records that are there–and those that are not.
Documentation of this kind is more difficult to generate, because of how algorithms are developed and operate (Burrell, 2016 ; Seaver, 2017). This is where ideas about appraisal usefully intersect with the notion of provenance. Fortunately folks like Emily Maemura are beginning to look at this very issue in the context of web archives (Maemura, Worby, Milligan, & Becker, 2018).
Burrell, J. (2016). How the machine thinks: Understanding opacity in machine learning algorithms. Big Data & Society, 3(1).
Maemura, E., Worby, N., Milligan, I., & Becker, C. (2018). If these crawls could talk: Studying and documenting web archives provenance. Retrieved from https://tspace.library.utoronto.ca/handle/1807/82840
Seaver, N. (2017). Algorithms as culture: Some tactics for the ethnography of algorithmic systems. Big Data & Society, 4(2).