Hadoop Job Optimization
Microsoft IT White Paper
Writers: Sherman Wang, Liang Mo, Andy Miao
Published: April 2013
Applies to: HDInsight, Hadoop on Windows
Summary: The Map/Reduce paradigm has greatly simplified development of large-scale data processing tasks. However, when processing data at the terabytes or petabyte scale in Hadoop, jobs might run for hours or even days. Therefore, understanding how to analyze, fix, and fine-tune the performance of Map/Reduce jobs is an extremely important skill for Hadoop developers.
This paper describes the principal bottlenecks that occur in Hadoop jobs, and presents a selection of techniques for resolving each issue and mitigating performance problems on different workloads. The paper explains the interaction of disk I/O, CPU, RAM and other resources, and demonstrates with examples why efforts to tune performance should adopt a balanced approach.
It includes the results of extensive experiments with performance tuning, which resulted in significant differences in the speed of the same Map/Reduce job before and after.
To review the document, please download the Hadoop Job Optimzation Word document.