Less is (sometimes) More
Below is a short presentation that I prepared for iPRES 2020 (a.k.a #WeMissiPRES) which was held remotely due to the Coronavirus pandemic.
In the next 9 minutes I hope to convince you that a text file of numbers is an important resource for archiving the web. Yes, thatâs right just a list of numbers like this:
My hat is off to the organizers because I couldnât have asked for a better person to speak after than Rhiannon, since she just presented on the topic of significant properties and OAIS. Hopefully you will see the connection too in a moment.
There doesnât appear to be anything significant about these numbers. What possible preservation value could a list of numbers like this have for archiving the web?
Of course significant properties have been the topic of significant critique from the digital preservation community. In 2006 Chris Rusbridge noted that:
.. there is no way of precisely defining the designated community, and similarly no way of foretelling the properties that future users might deem significant. This leads to pressure for preservation that must be faithful to the original in all respects. (Rusbridge, 2006)
And this pressure to remain faithful to the âoriginalâ can sometimes work perniciously to guarantee that instead nothing is preserved. Itâs all or nothing â and most of the time that means nothing.
15 years ago John Kunze (who has an uncanny ability for naming things) gave a talk here at iPRES titled Future Proofing the Web in which he introduced the idea of âpreservation through desiccationâ. He drew attention to the properties of paper that made is such a successful preservation medium, and asked us to consider the venerable IETF RFC standards archive which used plain text files without fonts, graphics, colors, diacritics, but which retained âessential cultural valueâ. Part of the argument John made was:
The simplest technologies to maintain and understand today are the simplest to carry forward and to recreate in the future.
Today this principle is known as minimal computingâat least in some digital humanities circles. But the idea goes back further to the earlyish days of the web, when in 1998 Tim Berners-Lee wrote down the Principle of Least Power to describe his process for designing web standards like HTML:
When designing computer systems, one is often faced with a choice between using a more or less powerful language for publishing information, for expressing constraints, or for solving some problem. This finding explores tradeoffs relating the choice of language to reusability of information. The âRule of Least Powerâ suggests choosing the least powerful language suitable for a given purpose.
Ok, so what does all this have to do with a list of numbers? To understand that I need to quickly tell you about three interrelated problems we encountered on the Documenting the Now project (they should sound familiar). Documenting the Now is a project (thank you Mellon Foundation) that is cultivating a community of practice for social media and web archiving that centers the rights, safety and voices of content creators. The project started in 2014 in the wake of the murder of Michael Brown in Ferguson, Missouri with the recognition that:
- Social media presents a huge opportunity for documenting previously undocumented historical events. However cultural organizations often (rightly) steer clear of engaging in it because of concerns about how to provide meaningful access without harming the people doing the documentation. You may remember this subject being described last year at iPRES by Michelle Caswellâs in her keynote: Whose Digital Preservation?.
- Researchers of all disciplinary stripes routinely create collections of social media for use as data in their research. But by and large they do not provide access to these collections because social media platforms forbid it.
- Content creators in social media have little control over how their data is being used in archives, and instead are the subject of widespread surveillance capitalism (Zuboff, 2015).
Why would we want to wade into this river you might ask? Honestly, it was the voices of the activists in Ferguson that kept us going as we tried to find what we could do so that their work was not forgotten. It is worth stating clearly at this point, that there is no technical-fix for this problem. Memory is a people problem. Tools can help (and hurt), but there is no silver bullet (Stiegler, 2012 ; Brooks, 1975).
Over the past five years weâve developed a few tools that can be used separately or in combination to address parts of these problems given the right set of actors to use them responsibly. Hereâs the basic intervention we made while focused on the social media platform Twitter, which was so critical to documenting the events in Ferguson:
Twitter do not allow data to be collected from them and then reshared with third parties. Itâs bad for business, because they want to sell it. But they do allow the sharing tweet identifiers (long numbers like above), and explicitly encourage academic researchers to do this. Why not encourage the sharing of tweet id datasets in digital repositories and provide a view into them as a whole. Thatâs why we created The Catalog.
But how do you create these lists of tweet identifiers? And what would you do with a list once you downloaded them? We created a few tools, mostly twarc for collecting data from the Twitter API and Hydrator which lets you turn those identifiers back into data again.
Ok, fine. But what about the rights of content creators? What say do they have in how their data is collected? First, twarc only collects public tweets. So if their account is protected it wonât show up in the filter stream or search API endpoits that twarc uses. But the same is also true of the API endpoint that the Hydrator uses. If a tweet id dataset is published and then the creator decides to delete it or protect their account the data can no longer be âhydratedâ. This gives some measure of agency back to content creators.
This obviously isnât a perfect solution because many content creators need more control, and some need less (weâre working on that too). Researchers studying things like disinformation campaigns wonât be happy with the deletes that go missing from hydrated datasets. But the Catalogâs primary purpose is to serve as a clearinghouse for where these datasets live in fuller representations in repositories. Iâm normally neutral on OAIS but in this case I think its actually useful to consider the tweet identifiers as an OAIS Dissemination Information Package (DIP). Using the contact information in the Catalog itâs within the realm of possibility to gain access to the original data by reaching out and becoming a project partner rather than a third party.
But rather than convincing you that the work weâve done on Documenting the Now is the bees knees, the catâs meow, or a real humdinger (sorry I got lost in a thesaurus) I hope to have convinced you that (sometimes) less is more. Strategically sharing less data can serve the interests of digital preservation and access. Less isnât just a matter of technical sustainability but itâs also lever (Shilton, 2012) that we have at our disposal when we consider the positionality of our memory work. Digital preservation isnât always about the highest resolution representation with the most significant properties. Use this value lever wisely!
Here is an audio version of this post with some slides. Spoiler Alert: there are slides containing a bee, a cat and a bell.