TL;DR if you are annoyed that your web archiving crawls aren’t collecting everything you want, and you don’t mind a bit of JavaScript then Browsertrix Crawler “drivers” might be for you.


As part of my involvement with web archiving at Stanford I’ve had the opportunity to work with Peter Chan (Stanford’s Web Archivist) to diagnose some crawls that have proven difficult to do with Archive-It for various reasons.

One example is the Temporality website which was created by poet Stephen Ratcliffe. I actually haven’t asked Peter the full story of how this site came to be selected for archiving at Stanford. But Ratcliffe is the author of many books, the Director of the Creative Writing program at Mills College (in Oakland, CA) and was also a fellow at Stanford, so there is a strong California connection.

At first Temporality looks like a pretty vanilla Blogger website, with some minor customization, which should pose little difficulty for archiving. But on closer inspection things get interesting because Ratcliffe has been writing a post every day there since May 1, 2009. Yes, every day–sometimes the post includes a photograph too.

Given that time span there should be about 5000 posts, and perhaps a few thousand more post listing pages. But the problem is that even after 2 months of continuous crawling with Archive-It, and collecting 1,010,351 pages (90GB), the dashboard still indicated that it needed to crawl another million pages.

We took a look at the Archive-It reports and noticed the crawler had spent a lot of time fetching URLs like this:

https://www.blogger.com/navbar.g?targetBlogID=5356039418422135632&blogName=Temporality&publishMode=PUBLISH_MODE_BLOGSPOT&navbarType=LIGHT&layoutType=LAYOUTS&searchRoot=https://stephenratcliffe.blogspot.com/search&blogLocale=en&v=2&homepageUrl=https://stephenratcliffe.blogspot.com/&vt=7450728192459364499&usegapi=1&jsh=m%3B%2F_%2Fscs%2Fabc-static%2F_%2Fjs%2Fk%3Dgapi.lb.en.z9QjrzsHcOc.O%2Fd%3D1%2Frs%3DAHpOoo8359JQqZQ0dzCVJ5Ui3CZcERHEWA%2Fm%3D__features__#id=navbar-iframe&_gfid=navbar-iframe&parent=https%3A%2F%2Fstephenratcliffe.blogspot.com&pfname=&rpctoken=24676009 

These turned out to be some kind of iframe tracker, that seemed to generate a unique URL for every page load. Unless these were blocked the crawl would never complete!

Adding exclusions for these isn’t hard in Archive-It–but knowing what patterns to exclude when you are looking at millions of queued URLs can be a bit tricky.

Unfortunately we expended our monthly budget doing this crawl, so we decided to give Browsertrix Crawler a try using our staff workstation.

Actually there actually was another reason why we wanted to try Browsertrix Crawler.

We noticed during playback that the calendar menu on the right wasn’t working correctly. When you selected a month it would just display a Loading message forever.

The problem here is that the button to open the month was a <span> element that had an onClick event which triggered an AJAX call. Since it wasn’t a button or an anchor tag the crawlers wouldn’t usually click on them, and so the AJAX response would never be made part of the archive.

Browsertrix Crawler does have a feature called behaviors which allow you to trigger custom crawler actions for popular social media platforms like Twitter, Instagram and Facebook. There isn’t a behavior for Blogger, so I started out by creating one (more about this in a future post).

In the process of writing a unit test for my new Blogger behavior I discovered that Blogger can be customized quite a bit. The calendar menu present on Temporality was just one of many possible ways to configure your archive links. So a general purpose behavior probably wasn’t the right way to approach this problem.

Browsertrix Crawler offers an another option called drivers, which inject crawl specific behavior directly into the crawl. This was actually quite a bit simpler than developing a plugin, because it was easier to ensure the behavior ran only once during the entire crawl, rather than for each page (which is how Browsertrix Behaviors usually work).

Like the site specific behavior plugins drivers are a bit of JavaScript that Browsertrix Crawler will load and call when it starts up. It passes in the data which includes information about the crawl, a page which is a Chromium browser page managed by Puppeteer, and the browsertrix Crawler object.

In my case all I had to do was to create a new function openCalendar() which clicks on all the links triggering the AJAX calls, and then continues along with the crawl by calling crawler.loadPage(). Note the use of openedCalendar to track whether the calendar links had been opened already.

let openedCalendar = false;

async function openCalendar(page) {

  // the first browser window crawls the calendar
  if (openedCalendar) return;
  openedCalendar = true;

  // Go to the Temporality homepage
  await page.goto("https://stephenratcliffe.blogspot.com/")

  // This is evaluated in the context of the browser page with access to the DOM
  await page.evaluate(async () => {
  
    // A helper function for sleeping
    function sleep(timeout) {
      return new Promise((resolve) => setTimeout(resolve, timeout));
    }

    // Get each element in the list of years
    for (const year of document.querySelectorAll("#ArchiveList > div > ul.hierarchy > li")) {

      // Expand each year
      const yearToggle = year.children[0];
      if (year.classList.contains("collapsed")) {
        yearToggle.click();
      }

      // Expand each month within each year
      for (const month of year.querySelectorAll("ul.hierarchy > li.collapsed")) {
        const monthToggle = month.children[0];
        await monthToggle.click();
        await sleep(250);
      }
    }
  });

}

module.exports = async ({data, page, crawler}) => {
  await openCalendar(page);
  await crawler.loadPage(page, data);
};

I’ve found running the Browsertrix Crawler is easiest (and most repeatable) when you put all the options in a YAML configuration file and then starting up Browsertrix Crawler pointed at it:

docker run -p 9037:9037 -it -v $PWD:/crawls/ webrecorder/browsertrix-crawler crawl --config /crawls/config/temporality.yaml

You can keep the YAML file as a record of how the crawl was configured, you can use it to rerun the crawl, and you can copy the config into new files and edit them to setup similar crawls of different sites.

collection: stephenratcliffe
workers: 4
generateWACZ: true
screencastPort: 9037
driver: /crawls/drivers/stephenratcliffe.js
seeds:
  - url: https://stephenratcliffe.blogspot.com
    scopeType: host
    exclude:
      - stephenratcliffe\.blogspot\.com/navbar.g.*

The -p 9037:9037 in the long docker command above allows you to open http://localhost:9037 in your browser to watch the progress of your crawl by seeing the 4 browser windows do their work. This is surprisingly useful for diagnosing when things appear to be stuck.

I wanted to write this up mostly to encourage others to fine tune their crawl behaviours using drivers to meet the specific needs of websites they are trying to archive. While there are aspects of web archiving that can be generalized, the web is a platform for distributing client side applications so there is a real limit on what can be done generally, and a real need to be able to control aspects of the crawl.

I also wanted to practice writing about this topic because I want to help think about the maintenance and further development of Browsertrix Behaviors which are so important for the Webrecorder approach to web archiving.


PS. The browsertrix crawl ran for 2.6 hours, collected 9,994 pages and generated 1.4 GB of WARC data. An improvement!

PSS. Karen has some other good examples of where drivers can be helpful!