How humans vs. machines identify corpus discourse topics

Activity: Talk or presentationScience to science

Description

Identifying discourses and discursive topics in a set of texts has not only been of interest to linguists, but to researchers working across the full breadth of the social sciences. Traditionally, these analyses have been conducted based on small-scale interpretive analyses of discourse which involve some form of close reading. Naturally, however, that close reading is only possible when the dataset is small, and it leaves the analyst open to accusations of bias, cherry-picking and a lack of representativeness (Mautner, 2015).

Designed to avoid these issues, other methods have emerged which involve larger datasets and have some form of quantitative component. Within linguistics, this has typically been through the use of corpus-assisted methods, whilst outside of linguistics, topic modelling is one of the most widely-used approaches. Increasingly, researchers are also exploring the utility of LLMs (such as ChatGPT) to assist analyses (Curry et al., 2023). How corpus linguistics, topic modelling, and LLM-assisted work differ, though, is in the degree of contextualisation available to the researcher. Topic modelling algorithms reduce texts to a simple bag-of-words, presenting only a list of co-occurring words to the researcher for analysis. Researchers utilising topic modelling typically eyeball these words and attempt to ascertain topic labels (Gillings and Hardie, 2022). On the other hand, corpus-assisted methods, and in particular concordance analysis, allow the user to see words of interest within their co-text (typically a few words on either side). Corpus-assisted methods, then, are somewhere in between the completely decontextualised topic modelling, and the completely contextualised close reading.

This talk reports on a study assessing the effect that analytical method has on the interpretation of texts, specifically in relation to the identification of the main topics. Using a corpus of corporate sustainability reports, totalling 98,277 words, we asked 6 different researchers, along with ChatGPT, to interrogate the corpus and decide on its main ‘topics’ via four different methods. Each method gradually increases in the amount of context available.

Method A: ChatGPT is used to categorise the topic model output and assign topic labels;
Method B: Two researchers were asked to view a topic model output and assign topic labels based purely on eyeballing the co-occurring words;
Method C: Two researchers were asked to assign topic labels based on a concordance analysis of 100 randomised lines of each co-occurring word;
Method D: Two researchers were asked to reverse-engineer a topic model output by creating topic labels based on a close reading.

The talk explores how the identified topics differed both between researchers in the same condition, and between researchers in different conditions shedding light on some of the mechanisms underlying topic identification by machines vs. humans or machines assisted by humans. Ultimately, we find that the more context is available, the more divergent the interpretations of the text. We conclude with a series of tentative observations regarding the benefits and limitations of each method and recommendations for researchers when it comes to choosing an analytical technique for the identification of discourse topics.
Period17 Jul 2024
Event title7th Corpora & Discourse International Conference
Event typeConference
LocationInnsbruck, AustriaShow on map
Degree of RecognitionInternational