MapReduce is with no doubt the parallel computation paradigm which has managed to interpret and serve at best the need, expressed in any field, of running fast and accurate analyses on Big Data. The strength of MapReduce is its capability of exploiting the computing power of a cluster of resources, by distributing the load on multiple computing units, and of scaling with the number of computing units. Today many data analysis algorithms are available in the MapReduce form: Data Sorting, Data Indexing, Word Counting, Relations Joining to name just a few. These algorithms have been observed to work fine in computing context where the computing units (nodes) connect by way of high performing network links (in the order of Gigabits per second). Unfortunately, when it comes to run MapReduce on nodes that are geographically distant to each other the performance dramatically degrades. Basically, in such scenarios the cost for moving data among nodes connected via geographic links counterbalances the benefit of parallelization. In this paper the issues of running MapReduce Joins in a geo-distributed computing context are discussed. Furthermore, we propose to boost the performance of the Join algorithm by leveraging a hierarchical computing approach.

MapReduce Join Across Geo-Distributed Data Centers

Tomarchio O.
2019-01-01

Abstract

MapReduce is with no doubt the parallel computation paradigm which has managed to interpret and serve at best the need, expressed in any field, of running fast and accurate analyses on Big Data. The strength of MapReduce is its capability of exploiting the computing power of a cluster of resources, by distributing the load on multiple computing units, and of scaling with the number of computing units. Today many data analysis algorithms are available in the MapReduce form: Data Sorting, Data Indexing, Word Counting, Relations Joining to name just a few. These algorithms have been observed to work fine in computing context where the computing units (nodes) connect by way of high performing network links (in the order of Gigabits per second). Unfortunately, when it comes to run MapReduce on nodes that are geographically distant to each other the performance dramatically degrades. Basically, in such scenarios the cost for moving data among nodes connected via geographic links counterbalances the benefit of parallelization. In this paper the issues of running MapReduce Joins in a geo-distributed computing context are discussed. Furthermore, we propose to boost the performance of the Join algorithm by leveraging a hierarchical computing approach.
2019
9783030273545
Geo-distributed computation; Hadoop; Hierarchical MapReduce; Join; MapReduce
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/375216
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact