Distributed Text Mining with tm

  • Stefan Theußl (Contributor)
  • Ingo Feinerer (Contributor)
  • Hornik, K. (Contributor)

Activity: Talk or presentationScience to science

Description

Text mining is a widely used technique utilizing statistical and machine learning methods to extract
patterns or knowledge from large unstructured text data sets. Recently R has gained explicit text mining
support via the tm package. This infrastructure provides sophisticated methods for document handling, transformations, filters, and data export (e.g., term-document matrices).

However, the availability of very large and always growing text corpora poses new challenges for efficient
handling of these data sets mainly due to architectural performance limits of single processor environments and memory restrictions. On the other hand we observe an increasing availability of multicore architectures even in commodity computers and high performance computing environments, i.e., distributed and highly integrated computing clusters.

In this context, we propose to make use of a technique called MapReduce which is widely used in high performance computing because of its functional programming nature. Existing building blocks in tm
allow for adding new layers to support this kind of parallelism and distributed allocation. In particular we identify compute-intensive parts of tm, break these parts up into suitable entities for parallel processing and finally encapsulate the emerging parallelism in a functional programming style.

A key factor in large scale text mining is the efficient management of data. Therefore, we show how
distributed storage can be utilized to facilitate parallel processing of large and very large data sets. This
approach offers us a reliable, flexible, and scalable high performance computing solution for distributed text mining.
Period8 Jul 200910 Jul 2009
Event titleuseR!
Event typeUnknown
Degree of RecognitionInternational

Austrian Classification of Fields of Science and Technology (ÖFOS)

  • 102022 Software development
  • 102023 Supercomputing