Off a tip from Joan Donovan and her team at the Shorenstein Center I’ve been experimenting a bit lately to examine new domains related to the COVID-19 pandemic. I’m writing this blog post to document a way of using web archiving tools like the WARC format and the warcio Python module, in the context of Jupyter Notebooks as a way to study the web. I think we often consider WARC as useful for saving what a page looks like so that it can be “played back” later, using something akin to the Internet Archive’s Wayback Machine. But (as I hope you’ll see) WARC is also a useful and convenient method for storing web content along with critical information about how it was retrieved, so you can study it later.

Specifically, I’ve been looking at domains that have been flagged by the componay DomainTools who help other companies with brand protection, domain monitoring, domain valuation, and cybercrime investigations. In late March DomainTools released a public dataset of COVID-19 related domains that they have flagged as being likely threats. Threats can include phishing scams, and attacks, as well as attempts to capitalize on the pandemic.