The HashPartition Operator

The HashPartition operator uses a hash value that is computed for each record's key to partition records into DSC files. Like the RangePartition operator, the HashPartition operator always places records with the same key in the same DSC file. However, the HashPartition operator assigns keys to files based on the calculated hash value. As a result, each DSC file of the output file set contains what is effectively a randomly selected subset of the available keys, with the constraint that records with the same key will be placed in the same DSC file. The records in the output file set have no defined order.

Hash partitioning is useful for implementing distributed algorithms that are based on unsorted grouping. It can also be used to change the number of DSC files that are used to store the records of a file set. The HashPartition operator is typically used rather than the RangePartition operator because it is less computationally intensive. Hash partitioning is often used as part of distributed grouping algorithms, while range partitioning is often used in distributed sorting algorithms.

There are four overloaded versions of the HashPartition operator. The overloads give you the option of specifying a different number of DSC files than are included in the input file set.

Hash partitioning without changes to the number of DSC files

Here is the signature of the overloaded versions of the HashPartition operator that group records with the same hash key into the same DSC file, but do not change the overall number of DSC files. One of these enables you to specify how the hash code should be computed. Here are the signatures of the overloaded methods.

public static IQueryable<TSource> HashPartition<TSource, TKey>(
  this IQueryable<TSource> source, 
  Expression<Func<TSource, TKey>> keySelector) 

public static IQueryable<TSource> HashPartition<TSource, TKey>(
  this IQueryable<TSource> source, 
  Expression<Func<TSource, TKey>> keySelector, 
  IEqualityComparer<TKey> comparer)

In the output file set, all records that have the same key value are located in the same DSC file.

Here is an example of hash partitioning that does not change the number of DSC files. The example behaves like the LINQ to HPC GroupBy operator when the operator is applied to a file set that contains records of type KeyValuePair<string, string> that use x => x.Key as the key selector.

string classFileSetName = ...
string partitionedFileSetName = ...

context.FromDsc<Tuple<string, string>>(classFileSetName)
         .HashPartition(x => x.Item1)
         .Apply(localRecords => LocalGroupBy(localRecords))
         .ToDsc(partitionedFileSetName)
         .SubmitAndWait(); 

This code implements a distributed grouping algorithm. The input file set consists of pairs, with the name of a teacher followed by the name of a student. If the file set contains more than one DSC file, then the HashPartition operation is distributed. The result of the HashPartition is a distributed data set where records for a given teacher are guaranteed to be in the same DSC file. The number of DSC files will not change.

noteNote
The HpcTuple data type that is used in this example is user defined. It is not provided by .NET Framework 3.5. The implementation of HpcTuple is in the HpLinqExtras project that ships as part of the samples.

Here is the definition of the LocalGroupBy helper method.

[DistributiveOverConcat]
public static IEnumerable<IGrouping<string, HpcTuple<string, string>>> 
LocalGroupBy(IEnumerable<HpcTuple<string, string>> localRecords)
{
  return localRecords.GroupBy(x => x.Item1);
}

The LocalGroupBy method is applied on a per-file basis. It uses the LINQ-to-Objects GroupBy operator for each of the DSC files in the distributed hash-partitioned file set.

Hash partitioning that changes the number of DSC files

You can also use hash partitioning to change the number of DSC files that store a sequence of records.

There are two overloaded versions of the HashPartition operator that enable you to specify the number of DSC files that are used in the output file set. The second version uses a comparer parameter.

public static IQueryable<TSource> HashPartition<TSource, TKey>(
  this IQueryable<TSource> source, 
  Expression<Func<TSource, TKey>> keySelector, 
  int partitionCount)

public static IQueryable<TSource> HashPartition<TSource, TKey>(
  this IQueryable<TSource> source, 
  Expression<Func<TSource, TKey>> keySelector, 
  IEqualityComparer<TKey> comparer, 
  int partitionCount)

The count parameter is the number of DSC files that are used in the output file set. The comparer parameter is a comparison object that provides the GetHashCode method. The comparer parameter can be specified when you want to override the default hash code generator of the TKey type with a hash code function that you provide.



Show: