Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification

Mohamad Al Sayed, Adrian M.P. Bra ̧soveanu, Lyndon J.B. Nixon, Arno Scharl

Publication: Chapter in book/Conference proceedingContribution to conference proceedings

Abstract

Transformer models have achieved state-of-the-art results for news classification tasks, but remain difficult to modify to yield the desired class probabilities in a multi-class setting. Using a neural topic model to create dense topic clusters helps with generating these class probabilities. The presented work uses the BERTopic clustered embeddings model as a preprocessor to eliminate documents that do not belong to any distinct cluster or topic. By combining the resulting embeddings with a Sentence Transformer fine-tuned with SetFit, we obtain a prompt-free framework that demonstrates competitive performance even with few-shot labeled data. Our findings show that incorporating BERTopic in the preprocessing stage leads to a notable improvement in the classification accuracy of news documents. Furthermore, our method outperforms hybrid approaches that combine text and images for news document classification.
Original languageEnglish
Title of host publicationAdvances in Computational Intelligence
Subtitle of host publication17th International Work-Conference on Artificial Neural Networks, IWANN 2023, Ponta Delgada, Portugal, June 19–21, 2023, Proceedings, Part I
EditorsIgnacio Rojas, Gonzalo Joya, Andreu Catala
Place of PublicationCham
PublisherSpringer
Pages162-174
Number of pages13
Volume1
Edition1
ISBN (Electronic)978-3-031-43085-5
ISBN (Print)978-3-031-43084-8
Publication statusPublished - 2023
Externally publishedYes

Publication series

SeriesLecture Notes in Computer Science (LNCS)
Volume14134
ISSN0302-9743
  • Gentio

    Hornik, K. (PI - Project head), Seiler, A. (Contact person for administrative matters), Polleres, A. (Researcher) & Disselbacher-Kollmann, K. (Contact person for administrative matters)

    1/01/2030/06/23

    Project: Research

Cite this