Creating a DSC File Set for Application-Specific Files
In many situations, you may want to process application-specific files from a LINQ to HPC query. For example, you might want to apply an operation on a per-file basis.
An example of processing application-specific files is the DupPic2 example that is shipped as one of the LINQ to HPC samples. The DupPic2 example calculates the checksum of each image in a large set of images, and then looks for duplicates. Duplicate images have the same checksum.
To perform a query that accesses individual files, you first need to transfer the JPEG files to the cluster. The following code does this.
var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
config.IntermediateDataCompressionScheme = DscCompressionScheme.None;
var context = new HpcLinqContext(config);
string[] imageFilePaths = ...
string inputFileSetName = ...
// Load all the images from a share onto the cluster and store them in FileRecords. // The FileRecords are stored in a file set that contains multiple records
// per file and is distributed over the cluster.
int stageSize = context.GetNodeCount();
context.FromEnumerable(imageFilePaths)
.HashPartition(r => r, stageSize)
.Select(path => new FileRecord(path))
.ToDsc(inputFileSetName)
.SubmitAndWait();
The FromEnumerable operator serializes objects in the application and copies them to a temporary file set that has a single DSC file. It then uses a LINQ to HPC query to open the temporary file set.
Note |
|---|
| This example uses the OutputDataCompressionScheme property of the HpcLinqConfiguration class to disable compression. Image data is already compressed, and there is no performance benefit to be gained by compressing it a second time. |
When the HashPartition operator is applied, the output file set from the operation causes the list of path names to be written into a file set that has as many DSC files as there are DSC nodes on the cluster.
Then, a FileRecord object is created for each path. The constructor of the FileRecord object reads the binary image of the JPEG file and stores it as a byte array along with the path to the original file. Here is the definition of the FileRecord class.
[Serializable]
public class FileRecord
{
public string FilePath { get; private set; }
public byte[] FileData { get; private set; }
public FileRecord(string filePath)
{
FilePath = filePath;
FileData = File.ReadAllBytes(filePath);
}
}
After the data is loaded to the cluster, you can perform operations on the application-specific files. For example, to list duplicated JPEG files, you can use the following LINQ to HPC query.
var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
var context = new HpcLinqContext(config);
string inputFileSetName = ...
var duplicatedFiles =
context.FromDsc<FileRecord>(inputFileSetName)
.Select(r => new {
Path = r.FilePath,
Checksum = GetChecksum(r.FileData)
})
.GroupBy(record => record.Checksum)
.Where(group => group.Count() > 1)
.SelectMany(group => group.Select(record => record.Path));
Console.WriteLine("\nThe following files are duplicates:");
foreach (var filepath in duplicatedFiles)
Console.WriteLine(" {0}", filepath);
The records that are processed in the Select operation are instances of the FileRecord class. They contain the original path names, as well as a binary copy of the application-specific files that have been stored on the local compute node. In this example, the code calculates a checksum for each of the files, and then checks for duplicate checksums.
Note