Published in

2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems

DOI: 10.1109/cisis.2015.37

Links

Tools

Export citation

Search in Google Scholar

Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments

Proceedings article published in 2015 by Vrushali Ubarhande, Alina-Madalina Popescu, Horacio Gonzalez-Velez ORCID
This paper was not found in any repository, but could be made available legally by the author.
This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

The Hadoop framework has been developed to effectively process data-intensive MapReduce applications. Hadoop users specify the application computation logic in terms of a map and a reduce function, which are often termed MapReduce applications. The Hadoop distributed file system is used to store the MapReduce application data on the Hadoop cluster nodes called Data nodes, whereas Name node is a control point for all Data nodes. While its resilience is increased, its current data-distribution methodologies are not necessarily efficient for heterogeneous distributed environments such as public clouds. This work contends that existing data distribution techniques are not necessarily suitable, since the performance of Hadoop typically degrades in heterogeneous environments whenever data-distribution is not determined as per the computing capability of the nodes. The concept of data-locality and its impact on the performance of Hadoop are key factors, since they affect the performance in the Map phase when scheduling tasks. The task scheduling techniques in Hadoop should arguably consider data locality to enhance performance. Various task scheduling techniques have been analysed to understand their data-locality awareness while scheduling applications. Other system factors also play a major role while achieving high performance in Hadoop data processing. The main contribution of this work is a novel methodology for data placement for Hadoop Data nodes based on their computing ratio. Two standard MapReduce applications, Word Count and Grep, have been executed and a significant performance improvement has been observed based on our proposed data distribution technique.