Latent Dirichlet Allocation in R

Martin Ponweiser

Publication: Working/Discussion PaperWU Working Paper

328 Downloads (Pure)

Abstract

Topic models are a new research field within the computer sciences information retrieval and text mining. They are generative probabilistic models of text corpora inferred by machine learning and they can be used for retrieval and text mining tasks. The most prominent topic model is latent Dirichlet allocation (LDA), which was introduced in 2003 by Blei et al. and has since then sparked off the development of other topic models for domain-specific purposes.
This thesis focuses on LDA's practical application. Its main goal is the replication of the data analyses from the 2004 LDA paper "Finding scientific topics" by Thomas Griffiths and Mark Steyvers within the framework of the R statistical programming language and the R~package topicmodels by Bettina Grün and Kurt Hornik. The complete process, including extraction of a text corpus from the PNAS journal's website, data preprocessing, transformation into a document-term matrix, model selection, model estimation, as well as presentation of the results, is fully documented and commented. The outcome closely matches the analyses of the original paper, therefore the research by Griffiths/Steyvers can be reproduced. Furthermore, this thesis proves the suitability of the R environment for text mining with LDA.
Original languageEnglish
Place of PublicationVienna
PublisherWU Vienna University of Economics and Business
Publication statusPublished - 1 May 2012

Publication series

NameTheses / Institute for Statistics and Mathematics
No.2

WU Working Paper Series

  • Theses / Institute for Statistics and Mathematics

Cite this