Off a tip from Joan
Donovan and her team at the Shorenstein Center I’ve been
experimenting a bit lately to examine new domains related to the
COVID-19 pandemic. I’m writing this blog post to document a way of using
web archiving tools like the WARC format and the
warcio Python
module, in the context of Jupyter
Notebooks as a way to study the web. I think we often consider WARC
as useful for saving what a page looks like so that it can be “played
back” later, using something akin to the Internet Archive’s Wayback Machine. But (as I hope
you’ll see) WARC is also a useful and convenient method for storing web
content along with critical information about how it was
retrieved, so you can study it later.
Specifically, I’ve been looking at domains that have been flagged by
the componay DomainTools who help
other companies with brand protection, domain monitoring, domain
valuation, and cybercrime investigations. In late March DomainTools released
a public dataset of COVID-19 related domains that they have flagged as
being likely threats. Threats can include phishing scams, and attacks,
as well as attempts to capitalize on the pandemic.
The DomainTools dataset is updated every day, and since April 5 I’ve
been downloading it and creating a snapshot at the Internet
Archive. This dataset is not only a unique picture of the use of web
platforms to capitalize on the Coronavirus crisis, but also provides a
rare (but narrow) view into the operations of a cybersecurity firm.
Given the news
of how Shopify is being used by
various actors to make money off of the crisis I thought it could be
interesting to look at the DomainTools data, and see how prevalent
Shopify sites were.
The first step is to collect the homepages for these domains. In this
Jupyter
notebook I used the warcio library from the
Webrecorder project to walk through
the domains, and fetch the homepage, and write the data to a WARC file.
The notebook contains some logic for probing where the homepage for a
domain might be, but you can see the basics of how easy it is to write
to a WARC file as you crawl the web in
the warcio documentation.
Using warcio here is very useful because it means I don’t have to
manage writing the HTTP responses to disk myself. In addition the WARC
data contains the full HTTP transaction history (including HTTP headers)
to the file as well. It took a good week to collect the over 100,000
domains that DomainTools had marked with a high threat score (99). This
is because I did it in the simplest possible way, one domain at a time,
and a fair number (26,335) no longer had responsive web servers
running.
With the WARC data in hand I then worked in this
notebook to scan the HTTP responses looking for evidence of Shopify.
Just reading the file line by line showed that indeed there were quite a
few sites that set a Shopify cookie, that is used in their analytics tracking:
Set-Cookie: _shopify_y=53b92b06-c72e-40a3-a64e-d6b0f81b6c47; path=/; expires=Thu, 14 Apr 2022 01:41:58 GMT
Given that little hint it’s then possible to read the WARC data using
warcio, and identify domains that appear to be served up using Shopify
using some code like this (see the notebook for the details):
from warcio.archiveiterator import ArchiveIterator
shopify_urls = []
with open('warc.gz') as stream:
for r in ArchiveIterator(stream):
if r.rec_type == 'response':
url = r.rec_headers.get_header('WARC-Target-URI')
cookie = r.http_headers.get_header('Set-Cookie', '')
if '_shopify_y' in cookie:
shopify_urls.append(url)
This found 372 domains. Joining this list back up with the
DomainTools data lets us see how these shops have been created over
time, with the caveat that the created time could be when the domain was
registered or when DomainTools first noticed the domain.

So it looks like they peaked on March 15. But if we compare that
graph to the graph of all the domains being tracked by DomainTools we
can see it coincides roughly with the general trend in domains that
DomainTools is tracking:

It’s important to stress that what we are seeing here is as much
(perhaps more) a picture of how DomainTools collects, analyzes
and classifies DNS, as it is a picuture of the Coronavirus web. You can
find the chart generation for these graphs in their notebooks.
Given that Shopify claims to have started blocking shops that sold
testing equipment I thought it could be interesting to see how many of
the Shopify websites mention “test kit”. This is pretty easy to do with
warcio with some similar code, but this time looking at the HTML
response content:
import re
test_kit_urls = []
with open('warc.gz') as stream:
for r in ArchiveIterator(stream):
if r.rec_type == 'response':
url = r.rec_headers.get_header('WARC-Target-URI')
if 'html' in r.http_headers.get_header('Content-Type'):
content = r.content_stream().read().decode('utf8', errors='ignore')
if re.search('test kit', content, re.IGNORECASE):
test_kit_urls.append(url)
This found 388 domains, but only one appeared to be using Shopify:
rapidmedicalsystems.com. We can join this list as we did before back
with the original DomainTools data to see test-kit sites created over
time, which largely fits the same pattern we saw before.

If you are interested in inspecting these domains for yourself I’ve
made a snapshot of the
DomainTools data with additional columns for Shopify and Test Kits, to
indicate if they tested true for either of those. A few caveats to note
is that these domains were determined using the 2020-04-11 snapshot of
the DomainTools dataset, and only domains that have were flagged with
the highest risk score of 99 (78% of the domains) were included.