Corpus Linguistics (CL) involves the study of language using machine readable texts or corpora. It’s interesting to run across CL in the context of discourse analysis. I first ran up against corpus linguistics as an undergraduate student, where I remember writing a term paper using a printed concordance to analyze the use of the word “orange” in Hemingway’s A Farewell to Arms. I don’t quite remember what I concluded there, but I do remember thinking how interesting it was that there were books that contained lists of words that were in other books. I guess nowadays there might be quite a bit of overlap between corpus linguistics and computational linguistics – which relies heavily on corpora.

CL in the context of Discourse Analysis focuses on written and spoken text, and rests on a particular theory that sees language variation as systematic, functional and tied to a particular social context. I think this view is fundamental to discourse analysis and is shared by the other types of discourse analysis we’ve looked at this semester such as pragmatics, conversational analysis, ethnography of communication, critical discourse analysis, interaction sociolinguistics. But what is different in corpus linguistics is the attention to the corpus, both the way it is assembled and the way that it is used. CL also is distinguished by taking a quantitative or statistical approach to the study of language. It is not purely statistical however, since the interpretation of words and their significance often involves social context and cultural factors.

While there are several examples of well known large corpus datasets (BASE, BAWE, TOEFL, LSWE) they don’t necessarily have to be super large. Specialized corpora focused in specific areas can be very useful, even if they don’t contain many documents. Results from specialized corpora can also be compared to results from larger corpora for context. Paltridge (2012) outlines some key issues to keep in mind when building a corpus of text:

  • authenticity: what language and/or dialects are present in the corpus. Which ones are missing? How does this impact the types of questions that can be asked of the corpus?
  • time: What time period are they taken from? How often is it updated? How does this influence the types of questions that can be asked?
  • size: the number and length of documents needed can vary depending on the research question. Making assertions about a larger population requires a sample that reflects the size of the population of texts. Is the size of that population even known?
  • balance: sampling should reflect the distribution of texts within the corpus

Some examples of structures that researchers have looked for in corpora of written text include:

  • non-clausal units: utterances that lack a subject or verb
  • personal pronouns and ellipsis: where items are left out of conversation because they are part of the context
  • repetition: words that repeat in particular types of conversation to add emphasis
  • lexical bundles: formulaic multi-word sequences

Corpora can also be made of text from conversation, where researchers can look at:

  • pauses: gaps in utterances and the flow of conversation
  • prefaces: patterns for introducing conversation
  • tags: phrases added to the end of utterances, such as questions, or repeating
  • concepts with different wording
  • informality or casualness of words

Maybe I’m not being imaginative enough but it seems like it could be difficult to locate some of these features automatically using a computational methods. Identifying repetitions and lexical bundles seems like it could be fairly easy once the text has been modeled as n-grams and collocation statistics are generated so that they can be browsed. But to programmatically identify pauses seems to require some kind of pre-existing markup for the time gaps like what we saw in CA transcription. I guess these could be determined by a computer if audio recordings are available and digitized. But it seems like it could be difficult to identify turns in conversation (changes in speaker), and when the pause occurs there or within the flow of a turn. Also identifying where contextual features are being elided in conversation seems like it would require some degree understanding of an utterance, which is notoriously difficult for computers (McCarthy & Hayes, 1969).

However some of the features CL enables you to study are quite down to earth and useful. Collocation looks at what words appear most in particular texts, and across genres. Analysis can be top-down where a particular discourse structure is identified beforehand and then the corpus is examined looking for that structure. It can also be bottom-up where lower level shifts and repetitions in word usage are used to identify discourse structures. I imagine that a given study could cycle back and forth between these modes: bottom up leading to discourse structure that is then examined top-down?

The main criticisms of corpus studies is that its quantitative focus on words, tends to reduce the focus on the social context that the words are used in. Ways of countering this are to do qualitative interviews and surveys to provide this extra dimension, which is what Hyland (2004) did in his study of academic discourse.

Tribble (2002) offers a framework for using corpus studies to look at contextual features such as: cultural values, communicative purpose and grammatical features that are stylistically salient. The framework works better on a corpus that has a genre focus than a register focus – or one that is more tightly scoped. The difference between register and genre seems significant and a little bit difficult to grasp. I guess it’s a matter of scale or abstraction– but also of materiality perhaps? A genre suggests a particular embodiment of text. It could be a useful distinction to make if I am going to study collection development policies which are embodied in a particular way as documents on the Web.

Mautner (2016) has several interesting things to say about using CL as a methodology in support of Critical Discourse Analysis (CDA).

  • CL is built on a theoretical foundation that positions language variation as systematic, functional and tied to social context (see Firth)
  • CL allows for the analysis of larger amounts of data
  • CL provides a different view of data that can be useful in triangulation.
  • CL can bring some measure of quantification that can temper potential researcher bias and subjectivity.
  • CL provides some methodological for qualitative analysis of data, such as browsing collocational information.

If words, or lexical items, are an important measure in your research then CL is a good tool to use. In many ways it allows CDA researchers, who are usually focused beyond the text, to ground their analysis on the text itself.

Matuner points out how collocation lists with their t-score and Mutual Information (MI) score can be used in CDA. The MI score is a ratio that measures the observed number of co-occurring words against the expected number of co-occurring words. The expected number is known because you know how many times a word occurs in the entire text. The t-score is a complementary statistic that weighs the MI score based on the number of times the word appears in the corpus.

CL tools allow analyst to scan and examine word lists, or examples of co-ocurring words which can suggest qualitative or contextual factors. CDA encourages people not to jettison context in making texts machine readable. Features such as textual layout and emphasis, or video/audio from recorded speech are important to retain.

The concept of saturation, that is foundational in qualitative methods is somewhat at odds with CL, because the strength of how common some features are is important to measurements such as MI and the t-score. So saturation artificially limits them. Ideally you want to sample the entire population of texts. I guess this presupposes that the texts need to be digitized in some way. At any rate some judgment about the entire population of texts needs to be made, and this is a really important decision to be made in CL studies.

It is interesting to see that a skill-gap was identified as a criticism of CL. It reminds me of criticisms of digital humanities. Realistically I imagine the same critique could be directed at CA or EC as conversation transcription and field studies are learned skills. Disciplinary boundaries between computational and social scientists seems a little irrelevant in the deeply interdisciplinary space that we’re in now – at least in most information studies departments.

I thought the criticisms of the embarrassment of riches that the Web offers to be really fascinating, especially in light of how I’ve been collecting Twitter data. Texts should still be reflexively selected because they speak to the research being performed, and should not simply be slurped up mindlessly just because it’s easy to do. The nature of the documents that make up the corpora, and the means and manner in which they were selected are of key significance. It’s also important to remember that context still matters, and that some patterns in text will be invisible to the quantitative measures provided in CL.


Hyland, K. (2004). Disciplinary discourses: Social interactions in academic writing. University of Michigan Press.

Mautner, G. (2016). Methods of critical discourse studies. In R. Wodak & M. Meyer (Eds.),. Sage.

McCarthy, J., & Hayes, P. J. (1969). Some philosophical problems from the standpoint of artificial intelligence. Readings in Artificial Intelligence, 431–450.

Paltridge, B. (2012). Discourse analysis: An introduction. Bloomsbury Publishing.

Tribble, C. (2002). Corpora and corpus analysis: New windows on academic writing. Academic Discourse, 131–149.