Curating Corpora

Here’s a really nice talk by Everest Pipkin about the need to curate datasets for generative text algorithms, especially when they are being used in creative work. Everest considers creative work broadly as any work where care for the experience is important.

To curate your own corpora is to let you have a hyper-specific control for the tone, vibe, content, ethics, language, and poetics of that space.

Since I’m teaching a data curation class in an information studies department this semester, but I also work in a digital humanities center, this approach to understanding the algorithmic impacts of data curation is super interesting to me. I particularly liked how Everest extended this idea of curation to the rules that sit on top of the models, which select and deslect particular interactions that are useful in a particular context.

It feels like this creative practice that Everest provides a view into could actually be more prevalent than the popular press might have it. There is so much focus on obtaining larger and larger datasets, and neural networks with more and more synapses to approach the complexity found in the brain.

Rather than the selection of an algorithm, tuning its hyperparameters, and the vast infrastructures for training, being the secret sauce perhaps the data that is selected, and how model interactions are interpreted are more equally if not more important, especially in particular contexts.

I guess this also highlights why data assets are so guarded, and why projects like Wikipedia and Common Crawl are so important for tools like GPT-3 and spaCy. It would be pretty cool to be able to select a model based on some subset, or subsets in a dataset like CommonCrawl. Like for example if you wanted to generate text based on text in a fan fiction site like AO3–or even an author or set of authors within an AO3. Maybe something like that already exists?