Hadoop Job Optimization

Microsoft IT White Paper

Writers: Sherman Wang, Liang Mo, Andy Miao

Published: April 2013

Applies to: HDInsight, Hadoop on Windows

Summary: The Map/Reduce paradigm has greatly simplified development of large-scale data processing tasks. However, when processing data at the terabytes or petabyte scale in Hadoop, jobs might run for hours or even days. Therefore, understanding how to analyze, fix, and fine-tune the performance of Map/Reduce jobs is an extremely important skill for Hadoop developers.

This paper describes the principal bottlenecks that occur in Hadoop jobs, and presents a selection of techniques for resolving each issue and mitigating performance problems on different workloads. The paper explains the interaction of disk I/O, CPU, RAM and other resources, and demonstrates with examples why efforts to tune performance should adopt a balanced approach.

It includes the results of extensive experiments with performance tuning, which resulted in significant differences in the speed of the same Map/Reduce job before and after.

To review the document, please download the Hadoop Job Optimzation Word document.