Creating Free-Form Compound Queries

In the examples that have been shown so far, most queries have been structured as a sequence of stages. However, LINQ to HPC also allows for query structures that are arbitrary directed acyclic graphs. This feature enables you to create queries that calculate multiple results and to use results from subqueries, even scalar results, as inputs to subsequent stages of a compound query.

How to create a free-form, compound query is best understood with an example. Here is a compound query that calculates the average, standard deviation, and element count of an input file set.

string fileSetName = ...

var input = context.FromDsc<double>(fileSetName);
var mean = input.AverageAsQuery();
var count = input.LongCountAsQuery();
var errors = input.Select(x => x - mean.Single());
var squaredErrors = normalized.SumAsQuery(x => x * x);
var stdDev = squaredErrors.CalcStdDevQuery(count);
var result = stdDev.Select(x => new Statistics(mean.Single(), count.Single(), x));

var statistics = result.Single();
Console.WriteLine("The data has:\n  Average: {0}\n  Count: {1}\n  StdDev: {2}", 
                  statistics.Average, statistics.Count, statistics.StdDev);

The example calculates the standard deviation of a sequence of values as the square root of the average of squared deviations from the mean.

To perform this as a single, compound query, it calculates the following values.

  • The mean value (that is, the average of all values in the input file set)

  • The count of all values

  • The error sequence, which consists of each value minus the mean

  • The sum of squared errors

  • The quotient that is calculated by dividing the sum of the squared errors by the number of elements

  • The square root of the quotient

The code uses a struct to contain the values that are returned by the query. Here is its definition.

[Serializable]
public struct Statistics
{
     public double Average;
     public long Count;
     public double StdDev;
}

The example uses a technique that relies on IQueryable objects that have a single value. It uses the LINQ to HPC special operators that return single-valued IQueryable result types for scalar queries. The AverageAsQuery, SumAsQuery, and LongCountAsQuery operators are of this type. Unlike their counterparts Average, Sum, and LongCount, these operators do not cause a query to execute immediately. This behavior allows them to be used to chain multiple subqueries together.

You can use the single-valued IQueryable technique with a user-provided operator, as well as with LINQ to HPC built-in operators such as SumAsQuery. The example does this with the CalcStdDevQuery method. Here is its definition.

public static IQueryable<double> 
CalcStdDevQuery(this IQueryable<double> input, IQueryable<long> count)
{
  IQueryable<double> stddev = 
     input.Apply(count, (x, y) => Scale(x, y))
                       .Select(x => Math.Sqrt(x));
  return stddev;
}

The CalcStdDevQuery method uses queries to perform scalar arithmetic on the cluster. Its inputs are single-valued queries named input and count. The return value is a single-valued query whose value is the square root of the quotient input[0] /count[0]. The Apply operator is used to calculate the quotient. The Apply operator’s argument list includes a lambda expression that calls the Scale method to perform the division. Here is the definition of the Scale method.

public static IEnumerable<double> 
Scale(IEnumerable<double> left, IEnumerable<long> right)  
{
   // left and right will contain a single value each
   double l = left.Single();
   long r = right.Single();
   yield return l / (double)r;
}

The result of the Apply operator is a single-valued query that contains the calculated quotient. A subsequent Select query transforms the quotient by taking its square root.

Finally, mean, count and stdDev can be returned as a single-valued IQueryable<Statistics> object. To do this, the code applies the Select operator to the single-valued stdDev query that was returned by the CalcStdDevQuery method. This is shown in the following code.

var result = stdDev.Select(x => new Statistics(mean.Single(), count.Single(), x));

var statistics = result.Single();
Console.WriteLine("The data has:\n  Average: {0}\n  Count: {1}\n  StdDev: {2}", 
                  statistics.Average, statistics.Count, statistics.StdDev);

The code creates an IQueryable<Statistics> object whose single value is a Statistics struct. The code assigns the query as the value of the result variable. Calling the Single method on this query causes the compound query, including its subqueries, to be queued for execution as a LINQ to HPC job on the cluster. The Single method blocks until the query has completed.

noteNote
If this example had not used the single-valued query idiom, it would have been necessary to run at least three separate LINQ to HPC queries to calculate the result. By combining those queries into a single query, the example avoided the overhead and management costs of having multiple HPC jobs.



Show: