Web Archives and QA

Here are some notes from today’s Digital Preservation Coalition meeting focused on quality assurance (QA) in web archives. Archives of web content often can encounter problems when the content is viewed in a replay system like OpenWayback or pywb. Noticing and then addressing these problems has been a bit of a known sore point for web archiving programs for quite some time. So it wasn’t surprising that the event was well attended, with about 130 people online when I looked.

The webinar was made up of three presentations from the British Library, the Library of Congress and the National Archives UK. Having the perspectives of smaller web archiving programs came up during the QA. I didn’t note the names of the presenters below but you can find those in the event description. I think it’s a common problem in web archiving discussions that the needs of large institutions, often with mandates to collect national regions of the web are considered.

British Library (UK Web Archive)

BL characterized their QA work in the UK Web Archive (UKWA) as “reactive”, where quality assurance often happens in response to a report by a curator that something is wrong. They used to have a button in their web interface that let people easily report problems with the replay of a web page but they had to turn it off since it generated so many reports!

They have 3 dedicated web archives staff, one of whom focuses on QA, but it wasn’t their only responsibility. Attention to time and resources spent on QA is always an issue. They have ~300 staff members who can annotate crawls–it wasn’t entirely clear what was meant by annotation, but I think they are going to be sharing information about that in the future?

Update: Andy Jackson pointed me at their previous work on W3ACT and indicated that they are in the process of thinking about separating the management of what to archive from the description of what has been archived.

Sometimes errors were related to the metadata for the crawl, which drove what did and did not get archived. So QA can sometimes lead to changing this metadata.

A Google Sheet is being used to check a site’s [sitemap] but it wasn’t entirely clear (to me) why this was relevant for QA.

They have a service status page that provides an indicator of whether the various software systems are online and operating, which can be an important factor in whether content is playing back.

Some class of QA errors are related to the originating web server not delivering the content to the web archiving crawlers. A good example of this is CloudFlare locking out clients because they have been identified as malicious actors due to the crawl rate.

Occasionally some errors are related to the versions of software that are being used for crawling and replay. Sometimes looking at versions on Github is needed, or consulting with engineers. It occurred to me later that perhaps this would be good information for the status page, to check if the software was up to date.

BL also have a Grafana based dashboard that provides some insight into Heretrix crawler logs. These reports can help diagnose if things are working correctly, but I didn’t catch any specific examples of that. A weekly crawl report from crawl engineers helps in determining some class of playback errors.

Openwayback is still being used in the reading rooms, and PyWB is used for open access web archive content. QA focuses on using PyWB, since it is typically better at playback, and BL is in the process of moving to PyWB in the reading rooms.

For diagnosing specific playback problems they often will open the browser dev tools window and look at the network traffic. Focusing on what resources are loaded from where, how they load in time, and which ones are not found (not HTTP 200 OK) can help identify problems. Returning to the live site and blocking resources at particular host names can be helpful in testing a theory about playback failure. If the same broken behavior can be seen on the live site after disabling that can be a good indicator that the missing resource is responsible for the problem.

When evaluating what to do in response to QA analysis it is sometimes important to consider whether the content is still online. If it’s not online, or it has changed substantially there’s not much that can be done to fix it since the content is now gone.

Library of Congress

LC is using an approach to QA that is guided by Ayala (2022) which focuses on issues of:

correspondence: similarity between the original and archived content
relevance: pertinence of the contents of an archived website and the original
archiveability: properties of the original website that make it easier or harder to archive

Maybe I missed it, but this third point didn’t really come up during this presentation. But it did come up in the discussion at the end.

LC pointed out that they were only talking about capture assessment which they think of as different from seed assessment. It wasn’t entirely clear how these were distinguished from each other.

People have the ability to score aspects of of web content from 1-5. I was listening at this point in the discussion and didn’t see what categories of things are evaluated. Results get fed into Jira. Scores less than 5 trigger some in depth QA. It wasn’t entirely clear what causes users to visit this data gathering tool.

Some detailed statistics were gathered. Apparently about 75% of the reports involved missing images. One wonders if it missing graphical components to the page instead of missing img tags. The team tries to respond via email. There are about 8 people who work on the Web Archiving Team (WAT), and they have recently trained ~50 people in how to evaluate websites.

I found the repeated mention of Grounded Theory in the presentation to be a little bit confusing. Grounded Theory is a qualitative research method which is used to develop a hypothesis or theory about a research question by gathering data and using inductive reasoning. It seems that Ayala (2020) found correspondence to be the most important aspect when evaluating the quality of web archives, and that Grounded Theory was the method that was used to discover that? I’m definitely going to give the article a read, and appreciated the pointer to it.

UK National Archives

[The UK Government Web Archive] (UKGWA) is part of the UK National Archives only collects UK government websites. Problems on high priority pages are given more QA attention. They have about 800 active sites, but more in the full archive.

QA is resource intensive so UKGWA has paid attention to automating QA tasks, and prioritizing QA analysis as much as possible. Content is prioritized for QA according to:

Newness: is the content new to the archive?
Is the site is from “central” government? What determines the centralness wasn’t really explained. *Update: David Underdown clarified for me what “central” means here.
Has the site been problematic in the past? If it has presumably that lowers the priority?
Is the content about to disappear or go offline? If so it’s probably more important to get to it faster?

They explicitly think about QA as part of the archiving process where QA analysis feeds into patch crawls which bring the content back up for QA. They have a nice diagram for this–although you have to wonder if it sometimes leads to doing a full crawl instead of a patch crawl.

It was asserted that almost anything can be automated, which is certainly a viewpoint, but I’m not sure I personally agree with it. At some level quality is a human measure. While tools can definitely help make the problem more manageable, the tools themselves are created by people, and have values about what matters encoded into their design, and are under constant revision as long as they are used.

UKGWA have released their automated QA tools on GitHub. This tool focuses on three areas:

log analysis: looking for 4XX and 5XX HTTP responses in Heretrix logs. Missing content gets added to a list and then tried again in a patch crawl. Presumably the WARC files could also be used for that since these responses should be written there too? This makes me think some generic tools that looked in WARC files and summarized problems would potentially be useful for QA. Maybe they already exist?
DOM rendering: pages are loaded with Screaming Frog and the list of resources is compared with what was crawled with Heretrix (their crawler of choice). This will catch when non-browser based crawling missing loading some resources because the JavaScript was not executed. This is less of an issue with browser based crawling like Browsertrix Crawler or Brozzler.
PDF analysis: they extract URLs from PDFs looking for additional resources that need to be archived. This is probably very important since they are focused on document centric government institutions. I’ve found URL parsing in PDFs to be a bit trickier than it seems because of page layout and line wrapping, but it would be good to take a closer look at what they are doing. Interesting work is going on elsewhere at ensuring that cited works in PDFs are adequately archived, so this work relates to that I think.

UKGWA have customized their Jira instance to create issues for these QA tasks, and to have the ability to trigger these QA processes from JIRA itself. It sounded like the log analysis is always run, but the DOM and PDF analysis is much more resource intensive and is run on a case by case basis through interacting with Jira.

Discussion

There was a bit of discussion after the three presentations. It might be worth watching the video when it comes online for that. One conspicuous topic that didn’t come up at all except briefly in the discussion was the use of screenshots. It seems that BL has been creating screenshots but isn’t currently using them in QA yet. It was pointed out that straight up comparison of screenshots of archived content with live web content probably wouldn’t suffice because web sites change over time. So that matching would need to be able to deal with some level of ambiguity. Perhaps removing boilerplate somehow prior to taking the screen shot could help. It could also hurt. People seemed to agree that semi-automated visual comparison is a viable topic for future research and practice in QA.

Overall this was a really useful session. It could have used the perspective of smaller organizations, but it can be hard to get people to talk honestly about web archives QA, so major props to the DPC for organizing this!

Thanks to Andy Jackson for feedback about the difference between the UKWA and the UKGWA which an earlier version of this post completely confused!

References

Ayala, B. R. (2020). Correspondence as the Primary Measure of Quality for Web Archives: A Grounded Theory Study. In M. Hall, T. Merčun, T. Risse, & F. Duchateau (Eds.), Digital Libraries for Open Knowledge (Vol. 12246, pp. 73–86). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-54956-5_6