I spent a few hours today using the Internet Archive Wayback Machine CDX API in a notebook to see how many snapshots of the Los Angeles Police Department website they’ve made (1,120,030), and specifically how many unique PDF documents there are in there (10,437).
It’s really quite beautiful how you can easily use the wayback module to drop the results of searching the API directly into a Pandas DataFrame for analysis:
wb = pandas.DataFrame(wb.search('lapdonline.org', scopeType='prefix')lapd
I did run into some inconsistencies between searching using a scopeType=domain versus scopeType=prefix.
It’s my understanding that using domain allows you to fetch
the complete results for any subdomain of that domain, so
assets.lapdonline.org in addition to
www.lapdonline.org when searching for
But that didn’t seem to be the case (see this example). So I used scopeType=domain to discover the relevant hostnames and then looked for each individually with a scopeType=prefix query.