Authors:
Marco Cavallo
;
Lorenzo Cusma'
;
Giuseppe Di Modica
;
Carmelo Polito
and
Orazio Tomarchio
Affiliation:
University of Catania, Italy
Keyword(s):
Big Data, Mapreduce, Hierarchical Hadoop, Context Awareness, Integer Partitioning.
Related
Ontology
Subjects/Areas/Topics:
Big Data Cloud Services
;
Cloud Computing
;
Cloud Computing Architecture
;
Fundamentals
;
Platforms and Applications
Abstract:
In many application fields such as social networks, e-commerce and content delivery networks there is a constant
production of big amounts of data in geographically distributed sites that need to be timely elaborated.
Distributed computing frameworks such as Hadoop (based on the MapReduce paradigm) have been used to
process big data by exploiting the computing power of many cluster nodes interconnected through high speed
links. Unfortunately, Hadoop was proved to perform very poorly in the just mentioned scenario. We designed
and developed a Hadoop framework that is capable of scheduling and distributing hadoop tasks among
geographically distant sites in a way that optimizes the overall job performance. We propose a hierarchical
approach where a top-level entity, by exploiting the information concerning the data location, is capable
of producing a smart schedule of low-level, independent MapReduce sub-jobs. A software prototype of the
framework was developed. Tests run on the prototy
pe showed that the job scheduler makes good forecasts of
the expected job’s execution time.
(More)