Simple Parallel Computing with Hadoop

  • Stefan Theußl (Redner*in)

Aktivität: VortragWissenschaftlicher Vortrag (Science-to-Science)


The availability of very large and steadily growing data sets poses new challenges for efficient data handling mainly due to architectural performance limits of single processor environments and memory restrictions. On the other hand we observe an increasing availability of multicore architectures even in commodity computers and high performance computing environments, i.e., distributed and highly integrated computing clusters.

In this context, we propose to make use of a technique called MapReduce which is widely used in high performance computing because of its functional programming nature. Apache Hadoop is an open source Java software framework implementing MapReduce and thus supports massive data processing across a cluster of workstations.

We present the parallel programming model MapReduce, its implementation Hadoop, and how this framework can be used in conjunction with R. Based on an example in text mining we illustrate how one may break up existing building blocks into suitable entities for parallel processing and how Hadoop can be
used to encapsulate this parallelism within R. Furthermore, we show how distributed storage can be utilized
to facilitate parallel processing of large and very large data sets. This approach offers us a simple, flexible,
and scalable parallel computing solution.
Zeitraum28 Juni 20092 Juli 2009
EreignistitelRmetrics Workshop
VeranstaltungstypKeine Angaben

Österreichische Systematik der Wissenschaftszweige (ÖFOS)

  • 102023 Supercomputing