Implementing Distributed Algorithms by Using LINQ to HPC

The records of a DSC file set are partitioned into DSC files that reside on the compute nodes of your cluster. How those records are distributed is ordinarily not visible to a LINQ to HPC query. Instead, the LINQ to HPC runtime components know how the records are distributed and create an optimized execution plan for the query, based on that information.

As you become more familiar with LINQ to HPC and use it for increasingly complex applications, you may need to write queries that, unlike most queries, are aware of how the records are partitioned. Such queries let you extend LINQ to HPC with new distributed algorithms.

Implementing new distributed algorithms is not required for most LINQ to HPC applications. The reason is that the built-in LINQ to HPC operations already use distributed algorithms in their implementations. Distributed processing is applied for you, with no extra programming needed on your part.

Unlike the other operators supported by LINQ to HPC, the operators that you use to implement new distributed algorithms either depend on how the input file set has been divided into DSC files, or they alter how the output will be divided into DSC files. These operators fall into three categories, based on the ways they are most typically used. These categories are:

  • Partial reduction. Partial reduction is a technique that breaks aggregation operations into multiple, distributed stages. Operators for partial reduction process the records of individual DSC files. They convert subsequences of records into other subsequences of records that act as intermediate results. LINQ to HPC includes operators that apply partial reduction to DSC files that are part of a single file set, and to pairs or tuples of records that are taken from the DSC files of multiple input file sets. Overloaded versions of the Apply operator provide the interface to partial reduction. These overloads invoke user-provided methods that have the custom attributes [DistributiveOverConcat] and [LeftDistributiveOverConcat].

  • Repartitioning. Repartitioning operators partition the records of the output file set into DSC files that do not match the DSC files of the input file set. For example, in some cases you might want to co-locate all records with matching keys in the same DSC file. In other cases, you might want to increase or decrease the number of DSC files that are used for a particular sequence of records. The repartitioning operators are RangePartition and HashPartition.

  • Global reduction. Many distributed algorithms require a final processing step that combines intermediate results into a final result. The Apply operator performs global reduction when it invokes a user-provided function that does not have either the [DistributiveOverConcat] or the [LeftDistributiveOverConcat] attribute. Global reduction is also a special kind of repartitioning operation because it merges all of the DSC files of its input file into an output file set that has a single DSC file.

In addition to the operators that are aware of how a file set has been partitioned into DSC files, there are also higher-level operators whose functionality can be extended by using custom attributes on user-supplied reduction operators. For example, instead of implementing your own aggregation algorithm, you can reuse LINQ to HPC’s implementation of distributed aggregation simply by annotating the user-provided aggregation method with the [Associative] custom attribute. You can reuse the LINQ to HPC implementation of distributed grouped aggregation for your own aggregation methods with the [Decomposable] attribute.

These building blocks for distributed algorithms are described in the following sections, along with examples of how to use them.

This section includes the following topics.