Memento Bisect
Quite a few websites have decided to block OpenAI from crawling their content, including The New York Times and The Guardian. This is basically them saying: âActually no, please donât take our intellectual property, and resell it, thank you very much.â.
While reading these stories I thought it would be a bit of fun to use the Internet Archiveâs Wayback Machine to identify approximately when the New York Times started blocking GPTBot by looking at the snapshots of their robots.txt file.
Once you know which versions to diff itâs pretty easy to
see. But finding the right versions requires some wandering around in
the list
of snapshots. I finally came up with the two versions which I could
download using the ârawâ URL (note the id_
in the URL
pattern), and then diff with a tool like vimdiff (RIP
Bram
Moolenaar):
$ wget -q -O robots-20230817.txt https://web.archive.org/web/20230817012138id_/https://www.nytimes.com/robots.txt
$ wget -q -O robots-20230818.txt https://web.archive.org/web/20230818012335id_/https://www.nytimes.com/robots.txt
$ vimdiff robots-20230817.txt robots-2023-0818.txt
It occurred to me while doing this that it might be helpful to have a tool similar to git bisect that would methodically search the list of versions for a page in a particular web archive, looking for some text to be present or missing.
When mentioning this over in the Fediverse I learned about recent work that Lesley Frew has been doing at Old Dominion University to model and visualize the changes to websites that are stored in web archives (Frew, Nelson, & Weigle, 2023).
She has been studying how journalists and researchers, such as those associated with the Environmental Data Governance Initiative, need to be able to find and visualize changes to web content. EDGIâs Web Monitoring project has been using empirical data collected from the web to help them ground their research and reporting around environmental data access issues.
There are various services for tracking how web content changes over time. But they tend to be oriented around security use cases, and not questions about how discourse changes over time, sometimes beyond the document level. Iâm interested to see where Frewâs work goes from here since it does seem like an underexplored HCI topic that is somewhat uniquely situated for hypermedia archives.
It is worth noting here that the Internet Archiveâs Wayback Machine offers a useful âChangesâ view, which lets you visualize the changes in a page over time, and then generate a diff of two selected versions. For example, here is the view for the NYTimes robots.txt file:


As useful as these views are, they donât actually help us efficiently identify when a particular change was made. You would have to manually perform a binary search yourself across the versions to narrow down when a change might have happened.
In designing a solution Frew extended other technical work done to address full text search in web archives (SolrWayback). Having a text index of some kind seems like it would be essential for interactively analyzing and viewing a large number of documents and changes efficiently.
I was imagining something lighter weight that would use the Memento support (RFC 7089) in a web archive to dynamically get a list of the snapshots for a given web resource over time, also known as its âTime Mapâ. With that list in hand, you can ensure it is sorted by date, and then perform a binary search looking for the text that was missing. Itâs worth pointing out that this approach is much slower, since the data needs to be pulled on demand, instead of assembling a database to be queried.
So this method is similar in principle to using git bisect to find when a bug was introduced. But instead of running a test to test if the bug exists, the page can be evaluated by a person, or in the case of simple text search, the page can be rendered and searched.
The nice thing about using Memento instead of Wayback Machine specific URLs, or their CDX API, is that the same tool can work on other Memento supporting web archives, like the UK Web Archive, or the Stanford Web Archive Portal.
I called the tool memento-cli which you can install:
$ pip install memento-cli
The simplest way to use the new memento
command is with
an archive URL (aka Memento) such as
https://web.archive.org/web/20230407140923/https:/help.twitter.com/en/rules-and-policies/hateful-conduct-policy
to list all the other snapshots that are available:
$ memento list https://web.archive.org/web/20230407140923/https:/help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2017-12-29 05:40:51 https://web.archive.org/web/20171229054051/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-03 20:03:00 https://web.archive.org/web/20180103200300/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-04 06:39:58 https://web.archive.org/web/20180104063958/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-06 16:08:07 https://web.archive.org/web/20180106160807/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-12 06:10:07 https://web.archive.org/web/20180112061007/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-12 17:40:16 https://web.archive.org/web/20180112174016/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-12 18:40:34 https://web.archive.org/web/20180112184034/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-12 19:11:48 https://web.archive.org/web/20180112191148/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-20 19:05:57 https://web.archive.org/web/20180120190557/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
2018-01-20 19:19:20 https://web.archive.org/web/20180120191920/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
...
Since memento works with any RFC 7089 supporting archive you can use it to list versions in other web archives as well:
$ memento list https://www.webarchive.org.uk/wayback/archive/20130501020401/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
2013-05-01 02:03:57 https://www.webarchive.org.uk/wayback/archive/20130501020357mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition
2013-05-01 02:04:01 https://www.webarchive.org.uk/wayback/archive/20130501020401mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
2013-07-29 12:58:03 https://www.webarchive.org.uk/wayback/archive/20130729125803mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition
2013-07-29 12:58:06 https://www.webarchive.org.uk/wayback/archive/20130729125806mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
2021-01-22 06:38:21 https://www.webarchive.org.uk/wayback/archive/20210122063821mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
2022-03-14 16:36:16 https://www.webarchive.org.uk/wayback/archive/20220314163616mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
Things get more interesting when you want to search for a change. For example letâs suppose you know that the Twitter Hateful Conduct Policy used to have language about:
women, people of color, lesbian, gay, bisexual, transgender, queer, intersex, asexual individuals
You can see it in the Internet Archive Wayback Machine in 2019. But you canât see it on the page in 2023. This is not a contrived example.
To identify when the change was introduced, you can bisect the version history to search for the version where the text went missing, using the two snapshots. This will perform a binary search between the two versions looking for the given text to go missing, and should eventually return with this snapshot: https://web.archive.org/web/20230408115900/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
$ memento bisect --missing --text "women, people of color, lesbian, gay" \
\
https://web.archive.org/web/20190711134608/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy https://web.archive.org/web/20230621094005/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
The bisect command uses a browser behind the scenes
(Selenium) in order to fully render the page by executing JavaScript. If
you wanted to find out when some text appears (rather than goes missing)
then remove the --missing
parameter from the command.
And if you would prefer to examine the pages in between manually,
leave off the --text
parameter and memento will
prompt you to continue, and show you the browser it is controlling.
Finally if you would like to see the browser when using
--text
then use the --show-browser
option.
Lesley Frew rightly pointed out to me that the tool could us the Memgator API to discover and search in snapshots across the a large set of web archives, rather than just one web archive. I didnât want memento-api to be tied to a specific service like Memgator, but I think this would be a good option to add?
You may notice in the little video above that it can take a while to
retrieve snapshot data from web archives. Fully rendering the page also
takes time, since Selenium will wait until all the requests for
resources (images, javascript, css, etc) have settled. This is just how
many web archives function unfortunately, and is why a tool like
memento bisect
is helpful. Please let me know what you think if you
do try it out.
PS. Parsing RFC 6690 formatted links in Time Map is not trivial, and probably is a barrier for Memento adoption by clients?