Overview of LINQ to HPC and the Distributed Storage Catalog (Beta 2)

This will be the final preview of LINQ to HPC and we do not plan to move forward with a production release. In line with our announcement in October at the PASS conference, we will focus our effort on bringing Apache Hadoop to both Windows Server and Windows Azure. For more information, see Microsoft to develop Hadoop distributions for Windows Server and Azure and Microsoft Expands Data Platform to Help Customers Manage the ‘New Currency of the Cloud’.

LINQ to HPC and the Distributed Storage Catalog (DSC) help developers write programs that use cluster-based computing to manipulate and analyze very large data sets. LINQ to HPC and the DSC include services that run on a Windows® High Performance Computing (HPC) cluster or a Windows Azure Scheduler deployment in Windows Azure, as well as client-side features that are invoked by applications.

LINQ to HPC, and the DSC support data-intensive computing for many types of applications, such as data mining, image processing, and scientific computing. Where possible, LINQ to HPC co-locates a subset of the data, and the computational logic that processes that data, on the same computer. This proximity is known as data locality. With data locality, you can scale up data-intensive applications by adding more computers to the cluster. With a large enough cluster, you can use LINQ to HPC to store petabytes of data and execute queries across terabyte-sized data sets.

Two features distinguish LINQ to HPC from other approaches to distributed computing. The first is that LINQ to HPC uses a high-level, query-based programming model that is based on the Language Integrated Query (LINQ). Using LINQ enables programmers to write high-level queries that express distributed computations in a familiar and concise way. Much of the code in a typical LINQ to HPC application is similar to that used by LINQ to Objects applications. For example, applications use LINQ operators to transform or group the elements of a sequence, and to use foreach to enumerate the transformed values. The query-based programming model is more expressive and flexible than other distributed computing frameworks, which require all processing to be formulated in terms of the map-reduce pattern.

The second unique aspect of LINQ to HPC is its ability to automatically analyze and optimize the execution plan of a query based on many factors, including the topology of the cluster that runs the query. For example, you can use LINQ to express a grouped aggregation query (a group-by operation that is followed by an aggregation operation). LINQ to HPC automatically analyzes expression trees to create an optimized query plan that decomposes the grouped aggregation into efficient, distributed computations. The plan uses partial aggregation to greatly reduce the amount of network data transfer. With LINQ to HPC, programmers do not need to be experts in distributed computing algorithms in order to write LINQ queries that execute efficiently on a cluster.

This document discusses the features that make up the LINQ to HPC system. It also gives an overview of how a LINQ to HPC query executes.

This document assumes that readers are already familiar with HPC clusters, and concepts such as the compute nodes and the role of the HPC head node. For information about HPC clusters, see the Windows HPC Server site at http://www.microsoft.com/hpc/en/us/default.aspx.

In this overview: