Sometimes I want to archive a web page or website using browsertrix-crawler but I always have to go to the documentation to look up the specifics of the Docker incantation. Once I get something working I often want a record of how I ran the crawl so I can remember how it was created. I’ve done this enough times that I decided to create a little shell script and directory structure for the configuration files.
$ git clone https://github.com/edsu/browsertricky.git
$ cd browsertricky
$ ./browsertricky example
🎉 ✨ 🦄 🛸
Ok, that’s not a terribly interesting example (pun intended). But you can use the example config as a template for another configuration to archive another site. To do that you’ll need to:
$ cp config/example.yaml config/mysite.yaml
config/mysite.yaml adding information about a
site you would like to archive:
mysite, which will change where your WACZ is written in the
- Change the
seedslist to include a new URL like
And run it!
$ ./browsertricky mysite
If you open http://localhost:9037 while the crawl is underway you should see a screencast of the browser.
Read the browsertrix-crawler documentation for all the options you can put in your YAML configuration files. There are quite a few!
I should really adapt this to work with podman too… It was updated to prefer
podman to docker if it is available.
PPS. Check out browsertrix-cloud if you don’t want to work on the command line and would rather use a web application to do it.