Investigation of Data Locality and Fairness in MapReduce


Traditional High-Performance Computing (HPC) environments separate compute and storage resources and adopt "bring data to compute" strategy. MapReduce is a data parallel model that makes use the same set of nodes for both compute and storage. As a result, data affinity is integrated into the scheduling algorithm to bring compute to data. In data-intensive computing, data locality becomes more important than before because it can potentially reduce network traffic significantly. In this project, we try to investigate the data locality of MapReduce in detail, and do the following things: 1) we summarize important system factors and theoretically deduce the relationship between those factors and data locality; 2) we analyze the state-of-the-art Hadoop scheduling algorithms to investigate their performance; 3) we propose new scheduling algorithms that yield optimal data locality; 4) we integrate data locality and fairness; 5) we compare our algorithms with the default Hadoop scheduling algorithm.

Intellectual Merit

This project tries to address an important issue in MapReduce : data locality. Our proposed algorithms yield optimal data locality and can dramatically reduce the time of data movement. The integration of data locality and fairness allows users to make the best tradeoff based on their environments and requirements.

Broader Impact

In the era of data-intensive computing, we all know data locality is critical because it is not efficient to move extreme amount of data during data processing. This project can help researchers to better understand MapReduce data locality in a quantitative way. In addition, this project produces some insightful conclusions and results that pave the foundation for further research on data parallel systems.

Use of FutureGrid

We ran extensive simulation experiments on FutureGrid bare metal machines.

Scale Of Use

We used 1 - 5 of HPC nodes.



Our experiment results show that our proposed algorithms improve data locality and outperform the default Hadoop scheduling substantially. For example, the ratio of data-local tasks is increased by 12% - 14% and the cost of data movement is reduced by up to 90%.
The detailed results of this project have been presented in two papers: "Investigation of data locality and fairness in MapReduce" [1], and "Investigation of Data Locality in MapReduce" [2].


  1. [Guo:2012:IDL:2287016.2287022] Guo, Z., G. Fox, and M. Zhou, "Investigation of data locality and fairness in MapReduce", Proceedings of third international workshop on MapReduce and its Applications Date, Delft, The Netherlands, ACM, pp. 25–32, 2012.
  2. [fg-261-05-2012-a] Guo, Z., G. Fox, and M. Zhou, "Investigation of Data Locality in MapReduce", Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), Ottawa, Canada, IEEE Computer Society, pp. 419–426, 05/2012.
Zhenhua Guo
Indiana University


1 year 51 weeks ago
1 year 28 weeks ago