Memento Bisect

Quite a few websites have decided to block OpenAI from crawling their content, including The New York Times and The Guardian. This is basically them saying: “Actually no, please don’t take our intellectual property, and resell it, thank you very much.”.

While reading these stories I thought it would be a bit of fun to use the Internet Archive’s Wayback Machine to identify approximately when the New York Times started blocking GPTBot by looking at the snapshots of their robots.txt file.

Once you know which versions to diff it’s pretty easy to see. But finding the right versions requires some wandering around in the list of snapshots. I finally came up with the two versions which I could download using the “raw” URL (note the id_ in the URL pattern), and then diff with a tool like vimdiff (RIP Bram Moolenaar):

$ wget -q -O robots-20230817.txt https://web.archive.org/web/20230817012138id_/https://www.nytimes.com/robots.txt
$ wget -q -O robots-20230818.txt https://web.archive.org/web/20230818012335id_/https://www.nytimes.com/robots.txt
$ vimdiff robots-20230817.txt robots-2023-0818.txt

It occurred to me while doing this that it might be helpful to have a tool similar to git bisect that would methodically search the list of versions for a page in a particular web archive, looking for some text to be present or missing.

When mentioning this over in the Fediverse I learned about recent work that Lesley Frew has been doing at Old Dominion University to model and visualize the changes to websites that are stored in web archives (Frew, Nelson, & Weigle, 2023).

She has been studying how journalists and researchers, such as those associated with the Environmental Data Governance Initiative, need to be able to find and visualize changes to web content. EDGI’s Web Monitoring project has been using empirical data collected from the web to help them ground their research and reporting around environmental data access issues.

There are various services for tracking how web content changes over time. But they tend to be oriented around security use cases, and not questions about how discourse changes over time, sometimes beyond the document level. I’m interested to see where Frew’s work goes from here since it does seem like an underexplored HCI topic that is somewhat uniquely situated for hypermedia archives.

It is worth noting here that the Internet Archive’s Wayback Machine offers a useful “Changes” view, which lets you visualize the changes in a page over time, and then generate a diff of two selected versions. For example, here is the view for the NYTimes robots.txt file:

As useful as these views are, they don’t actually help us efficiently identify when a particular change was made. You would have to manually perform a binary search yourself across the versions to narrow down when a change might have happened.

In designing a solution Frew extended other technical work done to address full text search in web archives (SolrWayback). Having a text index of some kind seems like it would be essential for interactively analyzing and viewing a large number of documents and changes efficiently.

I was imagining something lighter weight that would use the Memento support (RFC 7089) in a web archive to dynamically get a list of the snapshots for a given web resource over time, also known as its “Time Map”. With that list in hand, you can ensure it is sorted by date, and then perform a binary search looking for the text that was missing. It’s worth pointing out that this approach is much slower, since the data needs to be pulled on demand, instead of assembling a database to be queried.

So this method is similar in principle to using git bisect to find when a bug was introduced. But instead of running a test to test if the bug exists, the page can be evaluated by a person, or in the case of simple text search, the page can be rendered and searched.

The nice thing about using Memento instead of Wayback Machine specific URLs, or their CDX API, is that the same tool can work on other Memento supporting web archives, like the UK Web Archive, or the Stanford Web Archive Portal.

I called the tool memento-cli which you can install:

$ pip install memento-cli

The simplest way to use the new memento command is with an archive URL (aka Memento) such as https://web.archive.org/web/20230407140923/https:/help.twitter.com/en/rules-and-policies/hateful-conduct-policy to list all the other snapshots that are available:

$ memento list https://web.archive.org/web/20230407140923/https:/help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2017-12-29 05:40:51 https://web.archive.org/web/20171229054051/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-03 20:03:00 https://web.archive.org/web/20180103200300/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-04 06:39:58 https://web.archive.org/web/20180104063958/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-06 16:08:07 https://web.archive.org/web/20180106160807/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-12 06:10:07 https://web.archive.org/web/20180112061007/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-12 17:40:16 https://web.archive.org/web/20180112174016/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-12 18:40:34 https://web.archive.org/web/20180112184034/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-12 19:11:48 https://web.archive.org/web/20180112191148/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-20 19:05:57 https://web.archive.org/web/20180120190557/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-20 19:19:20 https://web.archive.org/web/20180120191920/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
...

Since memento works with any RFC 7089 supporting archive you can use it to list versions in other web archives as well:

$ memento list https://www.webarchive.org.uk/wayback/archive/20130501020401/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
2013-05-01 02:03:57 https://www.webarchive.org.uk/wayback/archive/20130501020357mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition
2013-05-01 02:04:01 https://www.webarchive.org.uk/wayback/archive/20130501020401mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
2013-07-29 12:58:03 https://www.webarchive.org.uk/wayback/archive/20130729125803mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition
2013-07-29 12:58:06 https://www.webarchive.org.uk/wayback/archive/20130729125806mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
2021-01-22 06:38:21 https://www.webarchive.org.uk/wayback/archive/20210122063821mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
2022-03-14 16:36:16 https://www.webarchive.org.uk/wayback/archive/20220314163616mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/

Things get more interesting when you want to search for a change. For example let’s suppose you know that the Twitter Hateful Conduct Policy used to have language about:

women, people of color, lesbian, gay, bisexual, transgender, queer, intersex, asexual individuals

You can see it in the Internet Archive Wayback Machine in 2019. But you can’t see it on the page in 2023. This is not a contrived example.

To identify when the change was introduced, you can bisect the version history to search for the version where the text went missing, using the two snapshots. This will perform a binary search between the two versions looking for the given text to go missing, and should eventually return with this snapshot: https://web.archive.org/web/20230408115900/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy

$ memento bisect --missing --text "women, people of color, lesbian, gay" \
  https://web.archive.org/web/20190711134608/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy \
  https://web.archive.org/web/20230621094005/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy

A (slow) video that shows the narrowing window of snapshots that are being searched as the binary search algorithm hones in on the commit.

The bisect command uses a browser behind the scenes (Selenium) in order to fully render the page by executing JavaScript. If you wanted to find out when some text appears (rather than goes missing) then remove the --missing parameter from the command.

And if you would prefer to examine the pages in between manually, leave off the --text parameter and memento will prompt you to continue, and show you the browser it is controlling.

Finally if you would like to see the browser when using --text then use the --show-browser option.

Lesley Frew rightly pointed out to me that the tool could us the Memgator API to discover and search in snapshots across the a large set of web archives, rather than just one web archive. I didn’t want memento-api to be tied to a specific service like Memgator, but I think this would be a good option to add?

You may notice in the little video above that it can take a while to retrieve snapshot data from web archives. Fully rendering the page also takes time, since Selenium will wait until all the requests for resources (images, javascript, css, etc) have settled. This is just how many web archives function unfortunately, and is why a tool like memento bisect is helpful. Please let me know what you think if you do try it out.

PS. Parsing RFC 6690 formatted links in Time Map is not trivial, and probably is a barrier for Memento adoption by clients?

References

Frew, L., Nelson, M. L., & Weigle, M. C. (2023, April 30). Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives. arXiv. Retrieved from http://arxiv.org/abs/2305.00546