January 2011

Volume 26 Number 01

Parallel Computing - Data Processing: Parallelism and Performance

By Johnson M. Hart | January 2011

Processing data collections is a fundamental computing task, and a number of practical problems are inherently parallel, potentially enabling improved performance and throughput on multicore systems. I’ll compare several distinct Windows-based methods to solve problems with a high degree of data parallelism.

The benchmark I’ll use for this comparison is a search problem (“Geonames”) from Chapter 9 of Troy Magennis’ book, “LINQ to Objects Using C# 4.0” (Addison-Wesley, 2010). The alternative solutions are:

  • Parallel Language Integrated Query (PLINQ) and C# 4.0, with and without enhancements to the original code.
  • Windows native code using C, the Windows API, threads and memory-mapped files.
  • Windows C#/Microsoft .NET Framework multithreaded code.

The source code for all solutions is available on my Web site (jmhartsoftware.com). Other parallelism techniques, such as the Windows Task Parallel Library (TPL), are not examined directly, although PLINQ is layered on the TPL.

Comparing and Evaluating Alternative Solutions

The solution evaluation criteria, in order of importance, are:

  • Total performance in terms of elapsed time to complete the task.
  • Scalability with the degree of parallelism (number of tasks), core count and data collection size.
  • Code simplicity, elegance, ease of maintenance and similar intangible factors.

Results Summary

This article will show that  for a representative benchmark search problem:

  • You can successfully exploit multicore 64-bit systems to improve performance in many data-processing problems, and PLINQ can be part of the solution.
  • Competitive, scalable PLINQ performance requires indexed data collection objects; supporting the IEnumerable interface isn’t sufficient.
  • The C#/.NET and native code solutions are fastest.
  • The original PLINQ solution is slower by a factor of nearly 10, and it doesn’t scale beyond two tasks, whereas the other solutions scale well up to six tasks with six cores (the maximum tested). However, code enhancements improve the original solution significantly.
  • The PLINQ code is the simplest and most elegant in all respects because LINQ provides a declarative query capability for memory-resident and external data. The native code is ungainly, and the C#/.NET code is considerably better than—but not as simple as—the PLINQ code.
  • All methods scale well with file size up to the limits of the test system physical memory.

The Benchmark Problem: Geonames

The idea for this article came from Chapter 9 of Magennis’ LINQ book, which demonstrates PLINQ by searching a geographic database containing more than 7.25 million place names in an 825MB file (that’s more than one location for every 1,000 people). Each place name is represented by a UTF-8 (en.wikipedia.org/wiki/UTF-8) text line record (variable length) with more than 15 tab-separated data columns. Note: The UTF-8 encoding assures that a tab (0x9) or line feed (0xA) value won’t occur as part of a multi-byte sequence; this is essential for several implementations.

Magennis’ Geonames program implements a hard-coded query to identify all locations with an elevation (column 15) greater than 8,000 meters, displaying the location name, country and elevation, sorted by decreasing elevation. In case you’re wondering, there are 16 such locations, and Mt. Everest is the highest at 8,848 meters.

Magennis reports elapsed times of 22.3 seconds (one core) and 14.1 seconds (two cores). Previous experience (for example, see my article “Windows Parallelism, Fast File Searching and Speculative Processing” at informit.com/articles/article.aspx?p=1606242) shows that files of this size can be processed in a few seconds and performance scales well with the core count. Therefore, I decided to attempt to replicate that experience and also try to enhance Magennis’ PLINQ code for better performance. The initial PLINQ enhancements almost doubled the performance but didn’t improve scalability; further enhancements, however, do yield performance nearly as good as native and C# multithreaded code.

This benchmark is interesting for several reasons:

  • The subject (geographical places and attributes) is inherently interesting, and it’s easy to generalize the query.
  • There’s a high degree of data parallelism; in principle, every record could be processed concurrently.
  • The file size is modest by today’s standards, but it’s simple to test with larger files simply by concatenating the Geonames allCountries.txt file to itself multiple times.
  • The processing is not stateless; it’s necessary to determine the line and field boundaries in order to partition the file, and the lines need to be processed to identify the 
individual fields.

An Assumption: Assume that the number of identified records (in this instance, locations higher than 8,000 meters) is small, so that sorting and display time are minimal relative to the overall processing time, which involves examining every byte.

Another Assumption: The performance results represent time required to process memory-resident data collections, such as data produced by a prior program step. The benchmark program does read a file, but the test programs are run several times to assure that the file is memory-resident. However, I’ll mention the time required to load the file initially, and this time is approximately the same for all solutions.

Performance Comparison

The first test system is a six-core desktop system running Windows 7 (AMD Phenom II, 2.80 GHz, 4GB RAM). Later, I’ll present results for three other systems with hyper-threading, or HT (en.wikipedia.org/wiki/Hyper-threading), and different core counts.

Figure 1 shows the results for six different Geonames solutions, with elapsed time (seconds) as a function of “Degree of Parallelization,” or DoP (the number of parallel tasks, which can be set to higher than the number of processors); the test system has six cores, but the implementations control the DoP. Six tasks are optimal; using more than six tasks degraded performance. All tests use the original Geonames 825MB allCountries.txt data file.

image: Geonames Performance as a Function of Degree of Parallelism

Figure 1 Geonames Performance as a Function of Degree of Parallelism

The implementations are (fuller explanations will follow):

  1. Geonames Original. This is Magennis’ original PLINQ solution. The performance isn’t competitive and doesn’t scale with processor count.
  2. Geonames Helper. This is a performance-enhanced version of Geonames Original.
  3. Geonames MMChar. This was an unsuccessful attempt to enhance Geonames Helper with a memory-mapped file class similar to that used in Geonames Threads. Note: Memory mapping allows a file to be referenced as if it was in memory without explicit I/O operations, and it can provide performance benefits.
  4. Geonames MMByte. This solution modifies MMChar to process the individual bytes of the input file, whereas the previous three solutions convert UTF-8 characters to Unicode (at 2 bytes each). The performance is the best of the first four solutions and is more than double that of Geonames Original.
  5. Geonames Threads doesn’t use PLINQ. This is the C#/.NET implementation using threads and a memory-mapped file. The performance is faster than Index (the next item) and about the same as Native. This solution and Geonames Native provide the best parallelism scalability.
  6. Geonames Index. This PLINQ solution preprocesses the data file (requiring about nine seconds) to create a memory-resident List<byte[]> object for subsequent PLINQ processing. The preprocessing cost can be amortized over multiple queries, giving performance that’s only slightly slower than Geonames Native and Geonames Threads.
  7. Geonames Native (not shown in Figure 1) doesn’t use PLINQ. This is the C Windows API implementation using threads and a memory-mapped file as in Chapter 10 of my book, “Windows System Programming” (Addison-Wesley, 2010). Full compiler optimization is essential for these results; the default optimization provides only about half the performance.

All implementations are 64-bit builds. 32-bit builds work in most cases, but they fail for larger files (see Figure 2). Figure 2 shows performance using DoP 4 and larger files.

image: Geonames Performance as a Function of File Size

Figure 2 Geonames Performance as a Function of File Size

The test system in this case has four cores (AMD Phenom Quad-Core, 2.40 GHz, 8GB RAM). The larger files were created by concatenating multiple copies of the original file. Figure 2 shows just the three fastest solutions, including Geonames Index—the fastest PLINQ solution (not counting file preprocessing)—and performance scales with the file size up to the limits of physical memory.

I’ll now describe implementations two through seven and discuss the PLINQ techniques in more detail. Following that, I’ll discuss results on other test systems and summarize the findings.

The Enhanced PLINQ Solutions: Geonames Helper

Figure 3shows Geonames Helper with my changes (in bold face) to the Geonames Original code.

Figure 3 Geonames Helper with Highlighted Changes to the Original PLINQ Code

class Program
{
  static void Main(string[] args)
  {
    const int nameColumn = 1;
    const int countryColumn = 8;
    const int elevationColumn = 15;
    String inFile = "Data/AllCountries.txt";
    if (args.Length >= 1) inFile = args[0];
        
    int degreeOfParallelism = 1;
    if (args.Length >= 2) degreeOfParallelism = int.Parse(args[1]);
    Console.WriteLine("Geographical data file: {0}. 
      Degree of Parallelism: {1}.", inFile, degreeOfParallelism);
    var lines = File.ReadLines(Path.Combine(
      Environment.CurrentDirectory, inFile));
    var q = from line in 
      lines.AsParallel().WithDegreeOfParallelism(degreeOfParallelism)
        let elevation = 
          Helper.ExtractIntegerField(line, elevationColumn)
        where elevation > 8000 // elevation in meters
        orderby elevation descending
        select new
        {
          elevation = elevation,
          thisLine = line
         };
    foreach (var x in q)
    {
      if (x != null)
      {
        String[] fields = x.thisLine.Split(new char[] { '\t' });
        Console.WriteLine("{0} ({1}m) - located in {2}",
          fields[nameColumn], fields[elevationColumn], 
          fields[countryColumn]);
      }
    }
  }
}

As many readers may not be familiar with PLINQ and C# 4.0, I’ll provide a few comments about Figure 3, including descriptions of the enhancements:

  • Lines 9-14 allow the user to specify the input file name and the degree of parallelism (the maximum number of concurrent tasks) on the command line; these values are hardcoded in the original.
  • Lines 16-17 start to read the file lines asynchronously and implicitly type lines as a C# String array. The lines values aren’t used until Lines 19-27. Other solutions, such as Geonames MMByte, use a different class with its own ReadLines method, and these code lines are the only lines that need to be changed.
  • Lines 19-27 are LINQ code along with the PLINQ AsParallel extension. The code is similar to SQL, and the variable “q” is implicitly typed as an array of objects consisting of an integer elevation and a String. Notice that PLINQ performs all the thread-management work; the AsParallel method is all that’s needed to turn serial LINQ code into PLINQ code.
  • Line 20. Figure 4 shows the Helper.ExtractIntegerField method. The original program uses the String.Split method in a manner similar to that used to display the results in Line 33 (Figure 3). This is the key to the improved performance of Geonames Helper compared to Geonames Original, as it’s no longer necessary to allocate String objects for every field in every line.

Figure 4 Geonames Helper Class and ExtractIntegerField Method

class Helper
{
  public static int ExtractIntegerField(String line, int fieldNumber)
  {
    int value = 0, iField = 0;
    byte digit;
    // Skip to the specified field number and extract the decimal value.
    foreach (char ch in line)
    {
      if (ch == '\t') { iField++; if (iField > fieldNumber) break; }
      else
      {
        if (iField == fieldNumber)
        {
          digit = (byte)(ch - 0x30);  // 0x30 is the character '0'
          if (digit >= 0 && digit <= 9) 
            { value = 10 * value + digit; }
          else // Character not in [0-9]. Reset the value and quit.
          { value = 0; break; }
        }
      }
    }
    return value;
  }
}

Note that the AsParallel method used in Line 19 can be used with any IEnumerable object. As I mentioned earlier, Figure 4 shows the Helper class ExtractIntegerField method. It simply extracts and evaluates the specified field (elevation in this case), avoiding library methods for performance. Figure 1 shows that this enhancement doubles the performance with DoP 1.

Geonames MMChar and Geonames MMByte

Geonames MMChar represents an unsuccessful attempt to improve performance by memory mapping the input file, using a custom class, FileMmChar. Geonames MMByte, however, does yield significant benefits, as the input file bytes aren’t expanded to Unicode.

MMChar requires a new class, FileMmChar, which supports the IEnumerable<String> interface. The FileMmByte class is similar and deals with byte[] objects rather than String objects. The only significant code change is to Figure 3, Lines 16-17, which are now:

var lines = FileMmByte.ReadLines(Path.Combine(
    Environment.CurrentDirectory, inFile));

The code for

public static IEnumerable<byte[]> ReadLines(String path)

that supports the IEnumerable<byte[]> interface in FileMmByte constructs a FileMmByte object and an IEnumerator<byte[]> object that scans the mapped file for individual lines.

Note that the FileMmChar and FileMmByte classes are “unsafe” because they create and use pointers to access the files, and they use C#/native code interoperability. All the pointer usage is, however, isolated in a separate assembly, and the code uses arrays rather than pointer dereferencing. The .NET Framework 4 MemoryMappedFile class doesn’t help, because it’s necessary to use accessor functions to move data from mapped memory.

Geonames Native

Geonames Native exploits the Windows API, threads and file memory mapping. The basic code patterns are described in Chapter 10 of “Windows System Programming.” The program must manage the threads directly and must also carefully map the file to memory. The performance is much better than all PLINQ implementations except Geonames Index.

There is, however, an important distinction between the Geonames problem and a simple, stateless file search or transformation. The challenge is to determine the correct method to partition the input data so as to assign different partitions to different tasks. There’s no obvious way to determine the line boundaries without scanning the entire file, so it isn’t feasible to assign a fixed-size partition to each task. However, the solution is straightforward when illustrated with DoP 4:

  • Divide the input file into four equal partitions, with the partition start location communicated to each thread as part of the thread function argument.
  • Have each thread then process all lines (records) that start in the partition. This means that a thread will probably scan into the next partition in order to complete processing of the last line that starts in the partition.

Geonames Threads

Geonames Threads uses the same logic as Geonames Native; in fact some code is the same or nearly the same. However, lambda expressions, extension methods, containers and other C#/.NET features greatly simplify the coding.

As with MMByte and MMChar, the file memory mapping requires “unsafe” classes and C#/native code interoperability in order to use pointers to the mapped memory. The effort is worthwhile, however, because Geonames Threads performance is the same as Geonames Native performance with much simpler code.

Geonames Index

The PLINQ results (Original, Helper, MMChar and MMByte) are disappointing compared to the Native and .NET Threads results. Is there a way to exploit the simplicity and elegance of PLINQ without sacrificing performance?

Although it’s impossible to determine exactly how PLINQ processes the query (Lines 16-27 in Figure 3), it’s probable that PLINQ has no good way to partition the input lines for parallel processing by separate tasks. As a working hypothesis, assume that partitioning may be the cause of the PLINQ performance problem.

From Magennis’ book (pp. 276-279), the lines String array supports the IEnumerable<String> interface (see also John Sharp’s book, “Microsoft Visual C# 2010 Step by Step” [Microsoft Press, 2010], Chapter 19). However, lines isn’t indexed, so PLINQ probably uses “chunk partitioning.” Furthermore, the IEnumerator.MoveNext methods for the FileMmChar and FileMmByte classes are slow because they need to scan every character until the next new line is located.

What would happen if the lines String array were indexed? Could we improve PLINQ performance, especially when supplemented by memory mapping the input file? Geonames Index shows that this technique does improve performance, yielding results comparable to native code. In general, however, either there’s a necessary up-front cost to move the lines into an in-memory list or array, which is indexed (the cost can then be amortized over multiple queries), or the file or other data source is already indexed, perhaps during generation in an earlier program step, eliminating the preprocessing cost.

The up-front indexing operation is simple; just access each line, one at a time, and then add the line to a list. Use the list object in Lines 16-17, as illustrated in Figure 3, and in this code snippet, which shows the preprocessing:

// Preprocess the file to create a list of byte[] lines
List<byte[]> lineListByte = new List<byte[]>();
var lines = 
    FileMmByte.ReadLines(Path.Combine(Environment.CurrentDirectory, inFile));
// ... Multiple queries can use lineListByte
// ....
foreach (byte[] line in lines) { lineListByte.Add(line); }
// ....
var q = from line in lineListByte.AsParallel().
    WithDegreeOfParallelism(degreeOfParallelism)

Note that it’s slightly more efficient to process the data further by converting the list to an array, although this increases the preprocessing time.

A Final Performance Enhancement

Geonames Index performance can be improved still further by indexing the fields in each line so that the ExtractIntegerField method doesn’t need to scan all the characters in a line out to the specified field.

The implementation, Geonames IndexFields, modifies the ReadLines method so that a returned line is an object containing both a byte[] array and a uint[] array containing the locations of each field. This results in about a 33 percent performance improvement over Geonames Index and brings the performance fairly close to the native and C#/.NET solutions. (Geonames IndexFields is included with the code download.) Furthermore, it’s now much easier to construct more general queries as the individual fields are readily available.

Limitations

The efficient solutions all require memory-resident data, and the performance advantages don’t extend to very large data collections. “Very large” in this case refers to data sizes that approach the system physical memory size. In the Geonames example, the 3,302MB file (four copies of the original file) could be processed on the 8GB test system. A test with eight concatenated copies of the file, however, was very slow with all the solutions.

As mentioned earlier, performance is also best when the data files are “live,” in the sense that they’ve been accessed recently and are likely to be in memory. Paging in the data file during the initial run can take 10 or more seconds and is comparable to the indexing operation in the earlier code snippet.

In summary, the results in this article apply to memory-resident data structures, and today’s memory sizes and prices allow significant data objects, such as a file of 7.25 million place names, to be memory-resident.

Additional Test System Results

Figure 5 shows test results on an additional system (Intel i7 860, 2.80GHz, four cores, eight threads, Windows 7, 4GB RAM). The processor supports hyper-threading, so the tested DoP values are 1, 2, ..., 8. Figure 1 is based on a test system with six AMD cores; this system doesn’t support hyper-threading.

image: Intel i7 860, 2.80GHz, Four Cores, Eight Threads, Windows 7, 4GB RAM

Figure 5 Intel i7 860, 2.80GHz, Four Cores, Eight Threads, Windows 7, 4GB RAM

Two additional test configurations produced similar results (and the full data is available on my Web site):

  • Intel i7 720, 1.60GHz, four cores, eight threads, Windows 7, 8GB RAM
  • Intel i3 530, 2.93GHz, two cores, four threads, Windows XP64, 4GB RAM

Interesting performance characteristics include:

  • Geonames Threads consistently provides the best performance, along with Geonames Native.
  • Geonames Index is the fastest PLINQ solution, approaching the Geonames Threads performance. Note: GeonamesIndexFields is slightly faster but isn’t shown inFigure 5*.*
  • Other than Geonames Index, all PLINQ solutions scale negatively with the DoP for DoP greater than two; that is, performance decreases as the number of parallel tasks increases. From this example, PLINQ produces good performance only when used with indexed objects.
  • The hyper-threading performance contribution is marginal. Therefore, Geonames Threads and Geonames Index performance doesn’t increase significantly for DoP greater than four. This poor HT scalability might be a consequence of scheduling two threads on logical processors on the same core, rather than assuring that they run on distinct cores when possible. However, this explanation doesn’t appear to be plausible as Mark E. Russinovich, David A. Solomon and Alex Ionescu say that physical processors are scheduled before logical processors on p. 40 of their book, “Windows Internals, Fifth Edition” (Microsoft Press, 2009). The AMD systems without HT (Figure 1) provided about three to four times the performance with DoP greater than four, compared to the sequential DoP one results for Threads, Native and Index. Figure 1 shows that the best performance occurs when the DoP is the same as the core count, where the multithreaded performance is 4.2 times the DoP one performance.

Results Summary

PLINQ provides an excellent model for processing in-memory data structures, and it’s possible to improve existing code performance with some simple changes (for example, Helper) or with more advanced techniques, as was shown with MMByte. However, none of the simple enhancements provide performance close to that of native or multithreaded C#/.NET code. Furthermore, the enhancement doesn’t scale with the core count and DoP.

PLINQ can come close to native and C#/.NET code performance, but it requires the use of indexed data objects.

Using the Code and Data

All code is available on my Web site (jmhartsoftware.com/SequentialFileProcessingSupport.html), following these instructions:

  • Go to the page to download a ZIP file containing PLINQ code and Geonames Threads code. All PLINQ variations are in the GeonamesPLINQ project (Visual Studio 2010; Visual Studio 2010 Express is sufficient). Geonames Threads is in the GeonamesThreads Visual Studio 2010 project. The projects are both configured for 64-bit release builds. The ZIP file also contains a spreadsheet with the raw performance data used in Figures 1, 2 and 5. A simple “Usage” comment at the head of the file explains the command-line option to select the input file, DoP and implementation.
  • Go to the Windows System Programming support page (jmhartsoftware.com/comments_updates.html) to download the Geonames Native code and projects (a ZIP file), where you’ll find the Geonames project. A ReadMe.txt file explains the structure.
  • Download the GeoNames database from download.geonames.org/export/dump/allCountries.zip.

Unexplored Questions

This article compared the performance of several alternative techniques to solve the same problem. The approach has been to use standard interfaces as they’re described and to assume a simple shared-memory model for the processors and threads. There was, however, no significant effort to probe deeply into the underlying implementations or specific features of the test machines, and numerous questions could be investigated in the future. Here are some examples:

  • What’s the effect of cache misses, and is there any method to reduce the impact?
  • What would be the impact of solid-state disks?
  • Is there any way to reduce the performance gap between the PLINQ Indexsolution and the Threadsand Nativesolutions? Experiments reducing the amount of data copying in the FileMmByte IEnumerator.MoveNext and Current methods didn’t show any significant benefit.
  • Is the performance close to the theoretical maximum as determined by memory bandwidth, CPU speed and other architectural features?
  • Is there a way to get scalable performance on HT systems (see Figure 5) that’s comparable to systems without HT (Figure 1)?
  • Can you use profiling and Visual Studio 2010 tools to identify and remove performance bottlenecks?

I hope you’re able to investigate further.


Johnson (John) M. Hart is a consultant specializing in Microsoft Windows and Microsoft .NET Framework application architecture and development, technical training and writing. He has many years of experience as a software engineer, engineering manager and architect at Cilk Arts Inc. (since acquired by Intel Corp.), Sierra Atlantic Inc., Hewlett-Packard Co. and Apollo Computer. He served as computer science professor for many years and has authored four editions of “Windows System Programming” (Addison-Wesley, 2010).

Thanks to the following technical experts for reviewing this article: Michael Bruestle,Andrew Greenwald, Troy Magennis and CK Park