Share via


Compression in Hadoop: A Microsoft IT White Paper

Microsoft IT White Paper

Writers: Sherman Wang, Liang Mo, Andy Miao

Applies to: HDInsight, Hadoop on Windows

Summary: When using Hadoop, there are many challenges in dealing with large data sets. The goal of this document is to provide compression techniques that you can use to optimize your Hadoop jobs, and reduce bottlenecks associated with moving and processing large data sets.

In this paper, we will describe the problem of data volumes in different phases of a Hadoop job, and explain how we have used compression to mitigate these problems. We review the compression tools and techniques that are available, and report on tests of each tool. We describe how to enable compression and decompression using both command-line arguments and configuration files.

To review the document, please download the Compression in Hadoop Word document.