Choosing the Right Number of DSC Files
How many DSC files to use in a file set is a design decision that requires you to balance several competing factors.
Smaller (more numerous) files can result in better use of available DSC nodes than large files. For example, if you have a cluster with k DSC nodes, you need at least k DSC files for a file set in order for a Select operation on that file set to use all of the DSC nodes. (Select uses one vertex per DSC file.) It is also possible that the amount of work performed by each vertex is not identical. If some vertices complete more quickly than others, you can encounter a situation where some compute nodes stand idle while other compute nodes are running. If you divide the file set into a number of DSC files that exceeds the number of compute nodes, you can compensate for uneven execution times. If one DSC file takes a particularly long time to process, the other nodes can run additional vertices.
To allow for statistical load balancing of a cluster of k nodes, a good rule of thumb is to use approximately 5 * k DSC files per file set. You should experiment and see what works best for a particular query, using from 1 to 10 DSC files for each node on the cluster. Try to partition a file set into DSC files that are a few gigabytes or smaller.
In general, you will want to choose a number of DSC files that is an integer multiple of the number of DSC nodes. The number of DSC files should always be at least equal to the number of nodes.
Another advantage to smaller DSC files is that it is less likely that a vertex will run out of memory as it processes the data. If your application does need more memory, consider dividing the file set into smaller DSC files. For example, a LINQ to HPC query that has a Join operation followed immediately by an Aggregate operation can require large amounts of memory. In such situations, you can partition the data into smaller files to reduce the memory required by a single vertex.
On the other hand, large DSC files reduce the amount of work that the DSC must perform to manage a file set. Tests of the system show that some LINQ to HPC operations have better throughput when they are applied to large files. In many test cases, 1 GB allocated for each DSC file gave the best performance. However, for any particular application, you must consider the attributes of the cluster (for example, the number of DSC nodes, the amount of memory, and the CPU ratio), as well as the nature of the query that you intend to perform.