Quite a few websites have decided to block OpenAI from crawling their content, including The New York Times and The Guardian. This is basically them saying: “Actually no, please don’t take our intellectual property, and resell it, thank you very much.”.
While reading these stories I thought it would be a bit of fun to use the Internet Archive’s Wayback Machine to identify approximately when the New York Times started blocking GPTBot by looking at the snapshots of their robots.txt file.
Once you know which versions to diff it’s pretty easy to
see. But finding the right versions requires some wandering around in
of snapshots. I finally came up with the two versions which I could
download using the “raw” URL (note the
id_ in the URL
pattern), and then diff with a tool like vimdiff (RIP
$ wget -q -O robots-20230817.txt https://web.archive.org/web/20230817012138id_/https://www.nytimes.com/robots.txt $ wget -q -O robots-20230818.txt https://web.archive.org/web/20230818012335id_/https://www.nytimes.com/robots.txt $ vimdiff robots-20230817.txt robots-2023-0818.txt
It occurred to me while doing this that it might be helpful to have a tool similar to git bisect that would methodically search the list of versions for a page in a particular web archive, looking for some text to be present or missing.
When mentioning this over in the Fediverse I learned about recent work that Lesley Frew has been doing at Old Dominion University to model and visualize the changes to websites that are stored in web archives (Frew, Nelson, & Weigle, 2023).
She has been studying how journalists and researchers, such as those associated with the Environmental Data Governance Initiative, need to be able to find and visualize changes to web content. EDGI’s Web Monitoring project has been using empirical data collected from the web to help them ground their research and reporting around environmental data access issues.
There are various services for tracking how web content changes over time. But they tend to be oriented around security use cases, and not questions about how discourse changes over time, sometimes beyond the document level. I’m interested to see where Frew’s work goes from here since it does seem like an underexplored HCI topic that is somewhat uniquely situated for hypermedia archives.
It is worth noting here that the Internet Archive’s Wayback Machine offers a useful “Changes” view, which lets you visualize the changes in a page over time, and then generate a diff of two selected versions. For example, here is the view for the NYTimes robots.txt file:
As useful as these views are, they don’t actually help us efficiently identify when a particular change was made. You would have to manually perform a binary search yourself across the versions to narrow down when a change might have happened.
In designing a solution Frew extended other technical work done to address full text search in web archives (SolrWayback). Having a text index of some kind seems like it would be essential for interactively analyzing and viewing a large number of documents and changes efficiently.
I was imagining something lighter weight that would use the Memento support (RFC 7089) in a web archive to dynamically get a list of the snapshots for a given web resource over time, also known as its “Time Map”. With that list in hand, you can ensure it is sorted by date, and then perform a binary search looking for the text that was missing. It’s worth pointing out that this approach is much slower, since the data needs to be pulled on demand, instead of assembling a database to be queried.
So this method is similar in principle to using git bisect to find when a bug was introduced. But instead of running a test to test if the bug exists, the page can be evaluated by a person, or in the case of simple text search, the page can be rendered and searched.
The nice thing about using Memento instead of Wayback Machine specific URLs, or their CDX API, is that the same tool can work on other Memento supporting web archives, like the UK Web Archive, or the Stanford Web Archive Portal.
I called the tool memento-cli which you can install:
$ pip install memento-cli
The simplest way to use the new
memento command is with
an archive URL (aka Memento) such as
to list all the other snapshots that are available:
$ memento list https://web.archive.org/web/20230407140923/https:/help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2017-12-29 05:40:51 https://web.archive.org/web/20171229054051/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2018-01-03 20:03:00 https://web.archive.org/web/20180103200300/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2018-01-04 06:39:58 https://web.archive.org/web/20180104063958/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2018-01-06 16:08:07 https://web.archive.org/web/20180106160807/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2018-01-12 06:10:07 https://web.archive.org/web/20180112061007/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2018-01-12 17:40:16 https://web.archive.org/web/20180112174016/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2018-01-12 18:40:34 https://web.archive.org/web/20180112184034/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2018-01-12 19:11:48 https://web.archive.org/web/20180112191148/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2018-01-20 19:05:57 https://web.archive.org/web/20180120190557/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy 2018-01-20 19:19:20 https://web.archive.org/web/20180120191920/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy ...
Since memento works with any RFC 7089 supporting archive you can use it to list versions in other web archives as well:
$ memento list https://www.webarchive.org.uk/wayback/archive/20130501020401/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/ 2013-05-01 02:03:57 https://www.webarchive.org.uk/wayback/archive/20130501020357mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition 2013-05-01 02:04:01 https://www.webarchive.org.uk/wayback/archive/20130501020401mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/ 2013-07-29 12:58:03 https://www.webarchive.org.uk/wayback/archive/20130729125803mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition 2013-07-29 12:58:06 https://www.webarchive.org.uk/wayback/archive/20130729125806mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/ 2021-01-22 06:38:21 https://www.webarchive.org.uk/wayback/archive/20210122063821mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/ 2022-03-14 16:36:16 https://www.webarchive.org.uk/wayback/archive/20220314163616mp_/http://www.vam.ac.uk/content/exhibitions/david-bowie-is/david-bowie-is-inside-the-exhibition/
Things get more interesting when you want to search for a change. For example let’s suppose you know that the Twitter Hateful Conduct Policy used to have language about:
women, people of color, lesbian, gay, bisexual, transgender, queer, intersex, asexual individuals
To identify when the change was introduced, you can bisect the version history to search for the version where the text went missing, using the two snapshots. This will perform a binary search between the two versions looking for the given text to go missing, and should eventually return with this snapshot: https://web.archive.org/web/20230408115900/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
$ memento bisect --missing --text "women, people of color, lesbian, gay" \ \ https://web.archive.org/web/20190711134608/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy https://web.archive.org/web/20230621094005/https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
The bisect command uses a browser behind the scenes
you wanted to find out when some text appears (rather than goes missing)
then remove the
--missing parameter from the command.
And if you would prefer to examine the pages in between manually,
leave off the
--text parameter and memento will
prompt you to continue, and show you the browser it is controlling.
Finally if you would like to see the browser when using
--text then use the
Lesley Frew rightly pointed out to me that the tool could us the Memgator API to discover and search in snapshots across the a large set of web archives, rather than just one web archive. I didn’t want memento-api to be tied to a specific service like Memgator, but I think this would be a good option to add?
You may notice in the little video above that it can take a while to
retrieve snapshot data from web archives. Fully rendering the page also
takes time, since Selenium will wait until all the requests for
many web archives function unfortunately, and is why a tool like
memento bisect is helpful. Please let me know what you think if you
do try it out.
PS. Parsing RFC 6690 formatted links in Time Map is not trivial, and probably is a barrier for Memento adoption by clients?