Spark + warcio

Check out our SavePageNow Jupyter Notebooks for examples of how to process WARC data with Apache Spark and the warcio Python library.

Over the last year or so I’ve been working with Shawn Walker and Jess Ogden along with Mark Graham at the Internet Archive to try to get an understanding of the type of activity that is moving through the archive’s SavePageNow service. If you haven’t seen it before SavePageNow allows anyone with an Internet connection to take a snapshot of a URL to save for playback in the Wayback Machine. We all bring different research interests to the table but I’m most interested in examining SavePageNow as a participatory archive service par excellence (Huvila, 2015).

Brewster Kahle estimated over a year ago that SavePageNow was receiving around 100 URLs per second. That’s 8.6 million requests per day. All this got us thinking how there could be multiple types of agents using the SavePageNow service.

We presented some of our initial findings at RESAW 2019 last week. You can find the slides for our presentation up at Shawn’s website. We’re hoping to get these findings out as a publication sometime soon. But until then I wanted to briefly share something interesting we discovered when analyzing the SavePageNow data. This isn’t a finding about the data itself so much as it is a reflection on the process of analyzing WARC data with Apache Spark.

WARC Data

When you are working with web archive data, most of the time you are working with WARC data. A full description of WARC data is a bit out of scope here, but suffice it to say the software that is archiving a web page needs to get data from the web and save it in some fashion as a file. WARC’s notion of response records allow retrieved data to be written to disk in such a way that it can be predictably used by playback software like the Wayback Machine or software like pywb.

One of the peculiar aspects of our SavePageNow study is that we were interested not only in the content that was collected from the web, but also in the way it was collected, and by who. Specifically we were interested in characterizing the participation that the SPN activity represented. For privacy reasons the WARC data itself doesn’t contain identifiers that allow the tracking of individuals. However the data does contain information about the User-Agent that was used to request that a page be archived. Given the volume of activity (100 urls/sec) we expected that there would be significant differences between the types of users. But to do this we needed to look at specific information in the WARC data.

We started our analysis using the amazing Archives Unleashed Toolkit, which provides a whole host of tools for analysis of WARC data using the Apache Spark data analysis framework. However we ran into a bit of a problem when it came to looking at the User-Agent data because it is stored in the WARC Request record, and the data of the response is stored in a WARC Response record. We wanted to connect the requests and the responses and return data such as the User-Agent, the URL requested, the media-type of the response etc. We took at look at the internals of AUT to see if we could figure it out, but none of us had adequate Scala programming skills to be confident of doing this correctly.

We also looked at the somewhat more composable ArchiveSpark library from Helge Holzmann, who is now at the Internet Archive. It seemed like it ought to be possible to unite the request and response records, but again, since it was written in Scala we weren’t able to figure it out. Fortunately I did get a chance to talk to Helge at RESAW and he said that we had not missed anything and that in fact ArchiveSpark currently didn’t have the ability to collected data from both the request and response records. But he did say that it is something that has come up before, and that they may add at some point.

warcio

So what we did instead was to switch to working with Spark in Jupyter Notebooks with Python and the warcio Python library developed by the good folks at the Webrecorder project. While Python can’t be optimized the same way as Scala by the JVM running Spark, we thought speed could be a valuable tradeoff for legibility and understandability of the Python code. While we didn’t do any extensive benchmarking we did notice reduced memory use and roughly the same speeds as what we saw when processing with AUT. So things weren’t as bad as we expected.

The first trick to processing WARC data in Spark with warcio is to create what I called extractor functions. These functions are given a single WARC record (actually a warcio ArcWarcRecord) and are responsible for extracting some data from the record, and returning it. So for example here’s one that will extract the targeted URL from a WARC response record.

@extractor
def url(record):
    if record.rec_type == 'response':
        yield (record.rec_headers.get_header('WARC-Target-URI'), )

If you are a Pythonista you may notice that our function has a decorator called extractor. Decorators are common programming language feature that allow you to reduce code duplication by creating functions that annotate other functions with boilerplate operations. We created an extractor decorator that we could use to write all kinds of extractors that would pull out things of interest from the WARC data. If you are interested you can see this decorator defined here. The point is for this decorator to slip into the background a bit. If there is interest I could put this function up on PyPI.

Another thing to note here is that our extractor function yields instead of returning data directly. This makes our extractor functions into generators which can return multiple values rather than just one. Basically they act as iterators. In the case of the URL it’s not terribly significant because there’s only one target URL for a WARC Response. But consider an extractor that needed to return all the links, or images in an HTML response. Having a single interface for all these operations is useful. However it’s probably safe to say that decorators and generators are probably more advanced programming concepts. But developers who use web frameworks like Django or Flask use them all the time.

At any rate once the extractor is defined you can apply it using standard pyspark operations. In our case we had a big list of WARC files that we wanted to process in parallel:

import pyspark
sc = pyspark.SparkContext()

import glob
warc_files = glob.glob('warcs/*.warc.gz')

warcs = sc.parallelize(warc_files)
output = warcs.mapPartitions(url)

output.save.csv('urls')

Joining Records

In our case we were interested in exploring differences in the data collected based on the User-Agent. As mentioned above the response data (media-type, content, date, size etc) is stored in the WARC response record, which is separate from where the User-Agent is stored in the WARC request record. Fortunately the two are linked together using the WARC-Concurrent-To and WARC-Record-ID fields.

Example WARC Request Headers

WARC/1.0
WARC-Type: request
WARC-Record-ID: <urn:uuid:e6d16180-4273-4956-b56f-b2f1378b451f>
WARC-Date: 2018-10-25T21:47:18Z
Content-Length: 387
Content-Type: application/http; msgtype=request
WARC-Concurrent-To: <urn:uuid:9d11abcf-b15d-47a2-8fcb-05adf194222a>
WARC-Target-URI: https://www.youtube.com/channel/UC6x7GwJxuoABSosgVXDYtTw
WARC-Warcinfo-ID: <urn:uuid:54787850-6be1-4fd5-b41f-143cec5fe7da>

Example WARC Response Headers

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:9d11abcf-b15d-47a2-8fcb-05adf194222a>
WARC-Date: 2018-10-25T21:47:18Z
Content-Length: 715816
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:EQRZMNEOAGEMJUNWKUTVQO7YGQGMGSVI
WARC-Target-URI: https://www.youtube.com/channel/UC6x7GwJxuoABSosgVXDYtTw
WARC-Warcinfo-ID: <urn:uuid:54787850-6be1-4fd5-b41f-143cec5fe7da>

Alas there is no guaranteed order to the request and response records, and it’s also possible that the records could exist in totally separate WARC files. So we needed to pass through the data and join the records together so that we could then save information about the request with information about the response. We did this by writing an extractor called get_urls that extracted different information based on the type of record it was passed:

def get_urls(record):

    if record.rec_type == 'request':
        id = record.rec_headers.get_header('WARC-Concurrent-To')
        user_agent = record.http_headers.get('user-agent')
        yield (id, {"user_agent": user_agent})

    elif record.rec_type == 'response':
        id = record.rec_headers.get_header('WARC-Record-ID')
        url = record.rec_headers.get_header('WARC-Target-URI')
        status_code = record.http_headers.get_statuscode()
        yield (id, {"url": url, "status_code": status_code})

This extractor function yields a tuple of (id, dictionary) where the id and the dictionary are obtained differently depending on whether the record is a request or response. Then we can apply our extractor:

warc_files = glob('warcs/*.warc.gz')
warcs = sc.parallelize(warc_files)
results = warcs.mapPartitions(get_urls)

At this point our results are a Spark RDD that contains a mixture of tuples for requests and responses, but we want to combine them together based on the id. There may be other ways of achieving this but I chose to use Spark’s combineByKey method, which allows you to combine the data as it is being processed. By taking a single pass through the data. combineByKey is probably one of the more complicated RDD methods you can use in Spark–but if you have a copy the explanation in Learning Spark is quite good.

def unpack(d1, d2):
    d1.update(d2)
    return d1

dataset = results.combineByKey(
    lambda d: d,
    unpack,
    unpack
)

combineByKey takes 3 functions to apply at different phases when processing the RDD. The first is to initialize a data structure, in our case we just use the supplied dictionary. The second updates combines dictionaries that have been joined on the key (the WARC record id) within a partition. And the third is identical in this case which will be applied when combining data from the separate partitions. To do this I created a simple function unpack that takes two dictionaries and combines them together.

Finally this results in an RDD of (id, dictionary) tuples where the dictionary contains the user-agent, url and status_code, which we can easily output to CSV for further analysis. Exporting the data as CSV is useful because it means we will no longer need to go back to the WARC data to analyze patterns. The data can be easily loaded into a Spark or Pandas DataFrame for analysis.

dataset.take(3)

[
  ('<urn:uuid:5fc7e893-f91a-4fe7-bf0b-0f35d14c2958>', {'url': 'https://www.youtube.com/channel/UCXZkaZ4f7d3-YuMyy76yS6A', 'user_agent': 'Wget/1.19.5 (linux-gnu)'}, 'status_code': '200'}),
  ('<urn:uuid:24ea637b-05c6-4591-8bfe-4a964650d8fa>', {'url': 'https://www.youtube.com/channel/UCQIUhhcmXsu6cN6n3y9-Pww/about', 'user_agent': 'Wget/1.19.5 (linux-gnu)', 'status_code': '200'}),
  ('<urn:uuid:a9d9d853-3ce0-4d6b-88ae-f42d713b60c6>', {'url': 'https://www.youtube.com/channel/UCQaX6TS_0hKrucF0LcuvWxA/about', 'user_agent': 'Wget/1.19.5 (linux-gnu)', 'status_code': '200'})
]

So…

So this has been a pretty esoteric look at processing WARC data with Spark and warcio. I wanted to jot down these notes mostly to highlight how toolkits such as ArchiveSpark and ArchivesUnleashedToolkit are a great place to start when looking at analyzing WARC data. But if your research questions require you to peek under the covers to see how Scala is processing the data, or if you need to do something novel that isn’t covered by the framework, it’s possible to use Python and warcio in Spark as well.

Even when you are doing distant reading (Moretti, 2013) of WARC data, you often need to do close reading of the data to properly understand what it is that you are viewing from a distance. Indeed, as we are using the methodology of trace ethnography in our project it’s particularly important that we engage with WARC data with as little mediation and friction as possible (R. Stuart Geiger & Ribes, 2011 ; R. Stuart Geiger & Ribes, 2010).

References

Geiger, R. Stuart, & Ribes, D. (2010). The work of sustaining order in Wikipedia. Computer Supported Cooperative Work CSCW.

Geiger, R. Stuart, & Ribes, D. (2011). Trace ethnography: Following coordination through documentary practices. In 44th Hawaii International Conference on System Sciences (pp. 1–10). Retrieved from http://www.stuartgeiger.com/trace-ethnography-hicss-geiger-ribes.pdf

Huvila, I. (2015). The Unbearable Lightness of Participating? Revisiting the discourses of “participation” in archival literature. Journal of Documentation, 43, 29–41.

Moretti, F. (2013). Distant reading. London ; New York: Verso.