上海交通大学 EPCC新兴并行计算研究中心

Improve Parallelism of Task Execution to Optimize Utilization of MapReduce Cluster Resources

作者: Liming Zheng, Yao Shen.

MapReduce, as a programming model, has become an important solution to large-scale data-intensive processing. It has been widely used in various fields such as Web search, machine learning and e-commerce. Hadoop, as an open-source implementation of MapReduce, is widely used for offline massive data job. It consists of MapReduce and HDFS. In the study of Hadoop, we found data parallel in Hadoop is coarse grained, and it cannot take full advantage of multi-core system. Eventually, this would lower utilization and efficiency of the whole cluster. To improve Hadoop into a fine grained data-parallel frame, we propose a strategy that scales the parallelism of task execution in map/reduce task. We implement our strategy as a new feature for Hadoop. And our experiments show that strategy can not only optimize utilization of MapReduce cluster resources, but also speedup job completion time up to 3x.