Managing LINQ to HPC Queries

This section includes the following topics.

The last topic is for developers who are migrating to LINQ to HPC from other distributed computing frameworks.

Running concurrent queries

You can use an overloaded version of the Submit method to run multiple queries simultaneously on the cluster as a single job.

Monitoring a running LINQ to HPC query

When a LINQ to HPC query runs, a new HPC job is created by the scheduler to execute the query. In most of the examples so far the SubmitAndWait method has been used to block program execution until the LINQ to HPC query has completed. In the following example the program uses the Submit method to schedule a job on the cluster. This returns an HpcLinqJobInfo that has two properties; JobId and HeadNode. The program then uses the Wait method to block until the job has completed before displaying the result.

var query = context.FromDsc<LineRecord>.CountAsQuery();
HpcLinqJobInfo info = query.Submit();
Console.WriteLine("Submitted job " + info.JobId + " on " + info.Configuration.HeadNode + "... ");
Console.WriteLine("Found " + query.First() + " lines. ");

You can also use the methods of the Scheduler class to monitor the status of the job. This class is in the Microsoft.Hpc.Scheduler namespace. You must add a reference to Microsoft.Hpc.Scheduler. It is included in the SDK, not in the client installation.

The following code shows an extension method for the LINQ to HPC HpcLinqJobInfo type that gets the scheduler job associated with the query.

public static ISchedulerJob GetJob(this HpcLinqJobInfo jobInfo)
  using (IScheduler scheduler = new Scheduler())
     return scheduler.OpenJob(jobInfo.JobId);

Canceling a LINQ to HPC query

You can cancel a running LINQ to HPC query. The following code shows how to do this.

string intData = ...

var query = context.FromDsc<long>(intData).SumAsQuery();

HpcLinqJobInfo info = query.Submit();


JobState job = info.GetJobState();

Console.WriteLine("Job state is: {0}", job);

The Wait method is discussed in the previous section, Monitoring a Running LINQ to HPC Query. The CancelJob method is a user-provided extension method that is defined in the following code.

public static void CancelJob(this HpcLinqContext context, HpcLinqJobInfo info)
  if (info == null)
    throw new ArgumentNullException("info");

  if (context.Configuration.LocalDebug)

  using (IScheduler scheduler = new Scheduler())
     scheduler.CancelJob(info.JobId, "Job canceled by user");

Debugging a LINQ to HPC query

This section discusses how to debug a LINQ to HPC query both locally and remotely.

Examining the failed job output

Before you debug a query locally or attach the remote debugger, it is often helpful to simply examine the output of the failed vertex to see if the error information can help you to resolve the issue. The following steps show you how to do this.

  • Open HPC Job Manager and select the Job Management tab.

  • Use HPC Job Manager to select the failed job that is associated with the query.

  • Double-click the Graph Manager task to show the View Job dialog box.

  • The Results tab contains the call stack from the failed vertex, the output from the graph manager, and a path to a share on the vertex node that contains output data that is related to the vertex.

This information is often enough to tell you why the vertex failed, without any need to use a debugger.

Examining the failed query compilation output

It is possible to write a query that compiles within your application but that fails to compile within an assembly that is intended to execute as part of a LINQ to HPC query. An example is a query that includes a lambda expression, which calls a method that is private to your application assembly. In this case, an HpcLinqException is thrown at run time. The exception message includes the path to the dynamically generated assembly for the query. The LINQ to HPC runtime also generates an HpcLinq.log file that is in the same directory as the generated assembly. This log file contains the compiler output, and it will help you to understand why the runtime was unable to compile the query assembly.

Local debugging

If you set the LocalDebug property of your context's HpcLinqConfiguration object to true, then any queries that are created from that context will use the LINQ to HPC local debugging mode. In local debugging mode, LINQ to HPC queries execute in the current AppDomain and use LINQ-to-Objects to execute the query. Local debugging is useful if you want to debug user functions before you execute the job on the cluster. This mode accesses the DSC in the normal way for input and output data. (Local debug mode does not let you run without a connection to the cluster.) However, the vertex code is not compiled, and the job is not submitted to the HPC server.

Query optimizations do not occur in local debug mode. For example, if you implement the IDecomposable interface, the IDecomposable methods are not called while you are in this mode.

Custom serializers cannot be invoked in local debug mode.

Remote debugging

LINQ to HPC queries are divided into units of work called vertices. If a vertex fails, the LINQ to HPC run-time components save the inputs to the vertex for diagnostic purposes. To diagnose the problem, you can attach the Microsoft® Visual Studio® debugger to the process and rerun the vertex on the DSC node where it failed.

Typically, you use remote debugging to discover why a vertex failed. The reason is that it is unlikely that DSC nodes have the Microsoft Visual Studio development environment installed. The nodes of an HPC cluster are often a shared resource, and IT administrators may be reluctant to grant individual developers the privileges required to install and run Visual Studio. To use remote debugging, you must have Visual Studio installed on the client machine, and the Remote Debugging Monitor (Msvsmon.exe) installed on the DSC node. For more information about Msvsmon.exe, see Remote Debugging Setup.

Remote debugging is not supported by the Visual Studio Express edition.

If you do not administer your own cluster, check with your IT administrator to see if Msvsmon.exe is installed on the HPC compute nodes.

The following procedures show you how to begin a remote debugging session. There are four procedures, and you should perform them in order. These are the procedures.

  • Configure the query to support debugging

  • Find the failed vertex

  • Set up the DSC node

  • Begin the remote debugging session

Configure the query to support debugging

Set the CompileForVertexDebugging property to configure your LINQ to HPC query to support debugging. The following code shows how to do this.

HpcLinqConfiguration config = new HpcLinqConfiguration("MYHEADNODE");
config.CompileForVertexDebugging = true;

This code sets the appropriate compiler flags, and ensures that the program databases (PDBs) are copied to the vertex.

Find the failed vertex

  1. On the client machine, start the HPC Job Manager and select the failed job.

  2. In the Tasks tab, double-click the Graph Manager Task to view the job details.

  3. Select View Tasks. The Results tab displays the vertex failure details, which tell you the node to debug. Here is an example.

    Trace level set to ERROR
    Graph abort because vertex failed 6 times: vertex 8 in stage Super__19
    Error returned from managed runtime invocation, Error code 2148734208 (0x80131500)
    For vertex logs, exception dump and rerun batch files see working directory for failed vertex:

In this example, the vertex that failed was part of job 2125, which was run by the user USERNAME on COMPUTENODE1. Make a note of the directory. You will need it in the next procedure.

Set up the DSC node

  1. On the client machine, start the HPC Cluster Manager.

  2. Use Remote Desktop to open a session on the DSC node with the failed vertex. Open a Windows® command prompt. From the Windows command prompt, type: Msvsmon.exe

  3. Navigate to the working directory of the failed vertex. (This is the directory that is included in the vertex failure details.)

  4. From the Windows command prompt, type: set LINQTOHPC_DEBUGVERTEX=1. This command sets the LINQTOHPC_DEBUGVERTEX environment variable, which ensures that the vertex waits for the debugger to attach instead of immediately executing. In fact, you can set the variable to any value other than "LAUNCH".

  5. From the Windows command prompt, type: vertex-*-*-rerun.cmd. This command runs the batch file that was created when the vertex failed. The asterisks (*) must be replaced by the numbers used in the actual file name. Here is an example: vertex-8-6-rerun.cmd

All vertices first check the LINQTOHPC_DEBUGVERTEX environment variable to determine how to behave. If the variable is not set, the vertex begins to execute. If the variable is set to any value other than LAUNCH, the vertex waits for a debugger to attach. After it detects the attachment, the vertex issues a debug break. If the variable is set to LAUNCH, the vertex launches the default debugger that is registered on the system.

Begin the remote debugging session

  1. On the client machine, start Visual Studio.

  2. On the Debug menu, select Attach to Process.

  3. In the Attach to Process dialog box, in the Qualifier field, log on to the remote machine using the format username@ machinename.

  4. Click Refresh to obtain the list of processes.

  5. Select the HpcQueryVertexHost.exe process.

  6. Click Attach. The debugger displays some assembly code, which you can ignore.

  7. Depending on what you want to do you can:

    1. Point the Visual Studio debugger to the correct PDB files, and source files for the client application.

    2. Set breakpoints in the application code that runs on the vertex. Typically, this is code that executes within the lambda expressions of your LINQ to HPC query.

  8. Press F5 to continue execution of your application on the vertex.

Optimizing the performance and scalability of LINQ to HPC clusters

Here are some guidelines for optimizing the performance and scalability of a LINQ to HPC application.

If LINQ to HPC executes a query such as OrderBy, which establishes ordering constraints on the output, the graph manager can use that information to optimize succeeding query operations. However, if you use the ToDsc operation to save the results of a query to a new DSC file set, subsequent queries that use the FromDsc operation to read the file set will not be aware that the data is sorted. File sets do not contain any metadata that records all the context conditions that are used to generate optimized query plans.

You can tell the graph manager that a file set is the result of an OrderBy operation by including the AssumeOrderBy operator in the subsequent query. The following code is an example of how to use the OrderBy operation.

string intFileSetName= ...
string sortedIntFileSetName = ...

var query1 = context.FromDsc<long>(intFileSetName)
                    .OrderBy(x => x)

This code sorts a file set of 64-bit integers into ascending order.

The following code is an example of how to use the AssumeOrderBy operator.

string sortedIntFileSetName = ...

var query2 = context.FromDsc<long>(sortedIntFileSetName)
                    .AssumeOrderBy(x => x, false)

This code gets the minimum value from the sorted list. The AssumeOrderBy operator tells the graph manager that the data is sorted.

In addition to the AssumeOrderBy operator, there are also operators named AssumeHashPartition and AssumeRangePartition that let you alert the graph manager about assumptions that it can make about the way the file set is divided into DSC files.

Security considerations for LINQ to HPC queries

A typical LINQ to HPC application distributes data among many different computers, which must all have appropriate access to each other. In general, LINQ to HPC requires that the user be a member of the HPCUsers security group on the cluster. If a LINQ to HPC application encounters file access errors, it is probably because the user account on the client workstation does not have adequate privileges on at least one of the associated computers.

The user account on the client workstation that runs the job must have permission to submit and run jobs on the HPC cluster. The actual user password must be securely cached by the HPC client infrastructure. The HPC system prompts for missing credentials for console applications; however, in other situations, the proper credentials must be provided ahead of time with the IScheduler.SetCachedCredentials method or the HpcLinqConfiguration object’s JobPassword and JobUsername properties. If credentials are missing, the LINQ to HPC job will not be submitted successfully. Also, at a minimum, the application must have read access to any input data that is stored in files on other computers. For more information, see Security in Windows HPC Server 2008 R2 on TechNet.

Supported versions of .NET

Dryad LINQ and the DSC are based on Microsoft .NET version 3.5. LINQ to HPC cannot execute on the cluster using .NET 4. See the Configure a New LINQ to HPC Project in Visual Studio section for details of how to set up a Visual Studio project for LINQ to HPC.

Migrating to LINQ to HPC

If you currently use distributed computing frameworks, such as Hadoop, that are based on map-reduce, you will generally find that the flexibility of LINQ queries makes your applications simpler. Although LINQ can express map-reduce, it also supports many other distributed algorithms, and its syntax is very clear and expressive.

In general, the easiest way to port your data set is to divide it into text file partitions and use the DSC command-line interface to copy the data set to an HPC cluster. From there, it is very easy to implement map-reduce in terms of LINQ to HPC. The following section gives an example.

How to: Implementing map-reduce in LINQ to HPC

The following code uses LINQ to implement the standard map-reduce algorithm.

public static IQueryable<TResult> MapReduce<TSource, TMap, TKey, TResult>(
    this IQueryable<TSource> source,
    Expression<Func<TSource, IEnumerable<TMap>>> mapper,
    Expression<Func<TMap, TKey>> keySelector,
    Expression<Func<TKey, IEnumerable<TMap>, TResult>> reducer)
  return source.SelectMany(mapper).GroupBy(keySelector, reducer);

If the source query is a LINQ to HPC query that contains distributed data, then the map-reduce implementation will be executed across the compute nodes of the cluster in a distributed manner. The TSource type parameter specifies the type of the input values.

The mapper parameter is the user-provided map function. The TMap type parameter specifies the type of the mapper's result records.

The keySelector parameter is the user-provided key extraction function. The TKey type parameter specifies the type of the keys that will be used for hash exchange.

The reducer parameter is the user-provided reduce function. The TResult type parameter specifies the return type of the reduce function.

You can optimize the application of the reducer function by applying the [Decomposable] attribute to the definition of the reducer method. See Using the [Associative] Attribute for Distributed Aggregation.

The following code is taken from the MapReduce sample, which is part of the LINQ to HPC download. The MapReduce sample returns the word counts from a DSC file set that contains arbitrary text. This code shows the MapReduce operation that implements the word count operation. It is analogous to the word count operation that is implemented on the Hadoop website.

const string headNodeName = ...
string[] inputFiles = ...
string inputFileSetName = ...

HpcLinqConfiguration config = new HpcLinqConfiguration(headNodeName);
HpcLinqContext context = new HpcLinqContext(config);

// ...

// Define map expression

Expression<Func<LineRecord, IEnumerable<string>>> mapper = 
     (line) => line.Line.Split(new[] { ' ', '\t' }, 

// Define key selector

Expression<Func<string, string>> selector = word => word;

// Define reducer (LINQ to HPC is able to infer the Decomposable 
// nature of this expression)

Expression<Func<string, IEnumerable<string>, Pair>> reducer = 
     (key, words) => new Pair(key, words.Count());

// Map-reduce query with ordered results and take top 200.

IQueryable<Pair> results = 
            .MapReduce(mapper, selector, reducer)
            .OrderByDescending(pair => pair.Count)

// Print results

Console.WriteLine("Most common words in order of occurrences:");
foreach (Pair result in results)
   Console.WriteLine("{0,-20} : {1}", result.Word, result.Count);

The input file set consists of lines of text. The code counts the lines of text, and then returns the 200 most frequently used words, along with the number of times they are used.

See Using Distributed Grouped Aggregation for information about how LINQ to HPC automatically optimizes the group-aggregate phase of map-reduce.

You may also want to examine the Histogram sample, which uses the LINQ to HPC APIs directly to calculate the same word counts. It is instructive to compare the two code samples to see the expressiveness of the LINQ to HPC programming model.