Creating a DSC File Set from Text Files

LINQ to HPC queries are often used to analyze lines of text from multiple files. For example, you might want to analyze logs of a web server for usage trends.

There are several approaches that you can use to create a DSC file set from text files. Two approaches are to write a Microsoft® .NET program to do this programmatically, or to use the Dsc.exe command-line interface. Either of these techniques can create a file set that has one DSC file for each text file. You can also use techniques that create file sets with one DSC file for each DSC node. In this case, some of the text files must be split or concatenated. One of these techniques is called hash partitioning. It is somewhat more complicated than the other two approaches, but it can be very useful when you have many more files than compute nodes. For more information, see Creating a DSC File Set from Text Files by Using Hash Partitioning.

Creating a DSC file set from text files programmatically

Here is an example of a .NET program that creates a DSC file set from text files. The code produces a file set with one DSC file for each text file. Although it is the simplest way to distribute text data to the cluster, it may not be suitable when there are a large number of files.

var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
var context = new HpcLinqContext(config);

string myFileSetName = ...
string[] files = ...
 
DscFileSet fileSet = context.DscService.CreateFileSet(myFileSetName, DscCompressionScheme.None);

foreach (var path in files)
{
  FileInfo info = new FileInfo(path);
  File.Copy(path, fileSet.AddNewFile((ulong)info.Length).WritePath);
}

fileSet.Seal();

This code example creates a DSC file set on the HPC cluster. The name of the new file set is given by the value of the myFileSetName variable. You must create an instance of the HpcLinqContext class before you can perform any operations on the DSC. The CreateFileSet method creates a new, empty file set that will hold the uncompressed data. The AddNewFile method creates an empty DSC file on the cluster. The DSC chooses a DSC node for the file for you. The code uses the File.Copy method to transfer the contents of each text file to the DSC node that the DSC designates to hold the data.

Note that the WritePath property of the DscFile class returns a UNC path name of a file in the cluster. You cannot write directly to a DscFile object. Instead, you must get a path name from the WritePath property.

The Seal method tells the DSC that the file set’s data is complete. When you use the CreateFileSet method to create a DSC file set, you must call the Seal method before you attempt to read data from the new file set. Invoking the FromDsc method on an unsealed file set causes an exception to be thrown. Sealing a file set allows it to be replicated according to the current DSC replication factor. The seal will fail if there are insufficient nodes to replicate the file set. By default, the cluster replication factor is 3, so the cluster must contain at least 3 DSC nodes in order to successfully create and seal a file set.

noteNote
You may observe some loss in performance if the data in a DSC file set is divided into many small files. When you run a query, the LINQ to HPC runtime creates a vertex for each DSC file in the file set. If you have many small files, consider using hash partitioning to reduce their number. (For more information, see Creating a DSC File Set from Text Files by Using Hash Partitioning.) You can also concatenate some of the files before you create the DSC file set. See Choosing the Right Number of DSC Files for guidelines on how to partition your data for the best performance.

noteNote
The Seal operation can sometimes take a long time. In particular, if multiple clients are sealing multiple file sets at the same time, performance may be compromised.

After you create the DSC file set that contains the text data, you can run a LINQ to HPC query. The following code is an example.

var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
var context = new HpcLinqContext(config);
string myFileSetName = ...

var lines = context.FromDsc<LineRecord>(myFileSetName);
Console.WriteLine("The number of lines is {0}", lines.Count());

This code prints the aggregate number of text lines. In other words, it produces the sum of the line counts for all of the DSC files in the file set. The LineRecord class represents a line of text. You can use the Line property on the LineRecord class to convert a line record into a .NET string.

Creating a DSC file set from text files with Dsc.exe

You do not need to write a program to create a DSC file set. You can also use the command-line tool, Dsc.exe, to create a DSC file set from individual text files.

Here is an example of how to use the command-line interface.

dsc.exe fileset add \\MyServer\data MyFileSet /compression:None /service:MyHpcClusterHeadNode

The fileset add command creates a new file set named MyFileSet and adds it to the DSC. All files in the specified directory, \\MyServer\data, will be uploaded as DSC files to the new file set. The upload uses the DSC service that runs on the computer named MyHpcClusterHeadNode. If you omit the /service option, the dsc.exe program uses the host name that is stored in the environment variable %CCP_SCHEDULER%.

The same cautions apply to the command-line approach as to the programmatic approach that is explained in Creating a DSC File Set from Text Files Programmatically.

Creating a DSC file set from text files by using hash partitioning

As discussed previously, you may want to control how your data is grouped into DSC files when you distribute it across the compute nodes. For example, you may want to create one DSC file per DSC node, even if you have more files than nodes. If this is your situation, then use hash partitioning. The following code sample shows an example.

var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
var context = new HpcLinqContext(config);    
       
string[] filePaths = ...
string FileListTempFileSetName = ...
string DataPartitionedFileSetName = ...

// Create a temporary file set with a single DSC file. The file set
// contains the path names to copy
            
DscFileSet fileSet = 
  context.DscService.CreateFileSet(FileListTempFileSetName, ... );
            
DscFile file = fileSet.AddNewFile(100);

File.WriteAllLines(file.WritePath, filePaths);

fileSet.Seal();

// Read the path names to copy as LineRecords within an LINQ to HPC query, and 
// distribute the path names across the cluster based on each path
// name’s hash code. Then, copy the text lines contained in the files to each node.

int stageSize = context.GetNodeCount();

context.FromDsc<LineRecord>(fileSet.Name)
       .HashPartition(r => r, stageSize)
       .SelectMany(lr => Utilities.ReadAllLinesAsEnumerable(lr.Line))
       .ToDsc(DataPartitionedFileSetName)
       .SubmitAndWait();

This approach requires several steps. First, create a temporary file set with a single DSC file that contains the path names of the files that you want to import.

Next, run a hash partitioning query. A hash partitioning query is an operation that distributes data across the cluster based on user-provided categories. In this case, the category is the path name. The hash codes are used to arbitrarily distribute path names across the cluster. Hash codes make it very likely that the data is divided approximately evenly across all the nodes. Quite often, when you partition n files, where n is the second parameter to the HashPartition operator, you set n equal to the number of DSC nodes. GetNodeCount is a helper method, defined in the HpcLinqExtras project, which returns the number of nodes in the user-defined group LinqToHpcNodes. If this group is not defined, then the method returns the total number of cluster nodes.

Then, use the SelectMany operator to map path names into lines of text. The ReadAllLinesAsEnumerable helper method converts each path name into an enumerated object that contains all of the text lines of the file with the given path name. The SelectMany operator transforms one element into multiple elements. In this example, SelectMany is a fully distributed operation that runs across the nodes of the cluster.

As a final step, use the ToDsc method to copy the lines of the text files into a new file set. Because the copy operations are distributed across the cluster, the LINQ to HPC runtime ensures that the resulting file set is distributed, as well.

The ReadAllLinesAsEnumerable helper method is defined in the following code.

public static IEnumerable<LineRecord> ReadAllLinesAsEnumerable(string filePath)
{
  using (StreamReader stream = File.OpenText(filePath))
  {
    while (!stream.EndOfStream)
        yield return new LineRecord(stream.ReadLine());
  }
}

The helper method transforms a path name into an enumeration of text lines in the associated text file.

The SubmitAndWait method starts a LINQ to HPC query, and then waits for it to finish. It is necessary because the Submit method is asynchronous. See Monitoring a Running LINQ to HPC Query for information on how to wait for a submitted query to complete.



Show: