MapReduce Join Across Geo-Distributed Data Centers

IRIS

MapReduce is with no doubt the parallel computation paradigm which has managed to interpret and serve at best the need, expressed in any field, of running fast and accurate analyses on Big Data. The strength of MapReduce is its capability of exploiting the computing power of a cluster of resources, by distributing the load on multiple computing units, and of scaling with the number of computing units. Today many data analysis algorithms are available in the MapReduce form: Data Sorting, Data Indexing, Word Counting, Relations Joining to name just a few. These algorithms have been observed to work fine in computing context where the computing units (nodes) connect by way of high performing network links (in the order of Gigabits per second). Unfortunately, when it comes to run MapReduce on nodes that are geographically distant to each other the performance dramatically degrades. Basically, in such scenarios the cost for moving data among nodes connected via geographic links counterbalances the benefit of parallelization. In this paper the issues of running MapReduce Joins in a geo-distributed computing context are discussed. Furthermore, we propose to boost the performance of the Join algorithm by leveraging a hierarchical computing approach.

MapReduce Join Across Geo-Distributed Data Centers

Di Modica G.;Tomarchio O.

2019-01-01

Abstract

MapReduce is with no doubt the parallel computation paradigm which has managed to interpret and serve at best the need, expressed in any field, of running fast and accurate analyses on Big Data. The strength of MapReduce is its capability of exploiting the computing power of a cluster of resources, by distributing the load on multiple computing units, and of scaling with the number of computing units. Today many data analysis algorithms are available in the MapReduce form: Data Sorting, Data Indexing, Word Counting, Relations Joining to name just a few. These algorithms have been observed to work fine in computing context where the computing units (nodes) connect by way of high performing network links (in the order of Gigabits per second). Unfortunately, when it comes to run MapReduce on nodes that are geographically distant to each other the performance dramatically degrades. Basically, in such scenarios the cost for moving data among nodes connected via geographic links counterbalances the benefit of parallelization. In this paper the issues of running MapReduce Joins in a geo-distributed computing context are discussed. Furthermore, we propose to boost the performance of the Join algorithm by leveraging a hierarchical computing approach.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Codice ISBN
	
				9783030273545
			
	Parole chiave
	
				Geo-distributed computation; Hadoop; Hierarchical MapReduce; Join; MapReduce
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/375216

Citazioni

ND

1

ND

social impact