January 2013

Volume 28 Number 01

C# - The C# Memory Model in Theory and Practice, Part 2

By Igor Ostrovsky | January 2013

This is the second article in a two-part series that discusses the C# memory model. As explained in the first part in the December issue of MSDN Magazine (msdn.microsoft.com/magazine/jj863136), the compiler and the hardware may subtly transform the memory operations of a program in ways that don’t affect single-threaded behavior, but can impact multi-threaded behavior. As an example, consider this method:

void Init() {
  _data = 42;
  _initialized = true;
}

If _data and _initialized are ordinary (that is, non-volatile) fields, the compiler and the processor are allowed to reorder the operations so that Init executes as if it were written like this:

void Init() {
  _initialized = true;
  _data = 42;
}

In the previous article, I described the abstract C# memory model. In this article, I’ll explain how the C# memory model is actually implemented on different architectures supported by the Microsoft .NET Framework 4.5.

Compiler Optimizations

As mentioned in the first article, the compiler might optimize the code in a way that reorders memory operations. In the .NET Framework 4.5, the csc.exe compiler that compiles C# to IL doesn’t do many optimizations, so it won’t reorder memory operations. However, the just-in-time (JIT) compiler that converts IL to machine code will, in fact, perform some optimizations that reorder memory operations, as I’ll discuss.

Loop Read Hoisting Consider the polling loop pattern:

class Test
{
  private bool _flag = true;
  public void Run()
  {
    // Set _flag to false on another thread
    new Thread(() => { _flag = false; }).Start();
    // Poll the _flag field until it is set to false
    while (_flag) ;
    // The loop might never terminate!
  }
}

In this case, the .NET 4.5 JIT compiler might rewrite the loop like this:

if (_flag) { while (true); }

In the single-threaded case, this transformation is entirely legal and, in general, hoisting a read out of a loop is an excellent optimization. However, if the _flag is set to false on another thread, the optimization can cause a hang.

Note that if the _flag field were volatile, the JIT compiler would not hoist the read out of the loop. (See the “Polling Loop” section in the December article for a more detailed explanation of this pattern.)

Read Elimination Another compiler optimization that can cause errors in multi-threaded code is illustrated in this sample:

class Test
{
  private int _A, _B;
  public void Foo()
  {
    int a = _A;
    int b = _B;
    ...
  }
}

The class contains two non-volatile fields, _A and _B. Method Foo first reads field _A and then field _B. However, because the fields are non-volatile, the compiler is free to reorder the two reads. So, if the correctness of the algorithm depends on the order of the reads, the program contains a bug.

It’s hard to imagine what the compiler would gain by switching the order of the reads. Given the way Foo is written, the compiler probably wouldn’t swap the order of the reads.

However, the reordering does happen if I add another innocuous statement at the top of the Foo method:

public bool Foo()
{
  if (_B == -1) throw new Exception(); // Extra read
  int a = _A;
  int b = _B;
  return a > b;
}

On the first line of the Foo method, the compiler loads the value of _B into a register. Then, the second load of _B simply uses the value that’s already in the register instead of issuing a real load instruction.

Effectively, the compiler rewrites the Foo method as follows:

public bool Foo()
{
  int b = _B;
  if (b == -1) throw new Exception(); // Extra read
  int a = _A;
  return a > b;
}

Although this code sample gives a rough approximation of how the compiler optimizes the code, it’s also instructive to look at the disassembly of the code:

if (_B == -1) throw new Exception();
  push        eax
  mov         edx,dword ptr [ecx+8]
  // Load field _B into EDX register
  cmp         edx,0FFFFFFFFh
  je          00000016
int a = _A;
  mov         eax,dword ptr [ecx+4]
  // Load field _A into EAX register
return a > b;
  cmp         eax,edx
  // Compare registers EAX and EDX
...

Even if you don’t know assembly, what’s happening here is pretty easy to understand. As a part of evaluating the condition _B == -1, the compiler loads the _B field into the EDX register. Later, when field _B is read again, the compiler simply reuses the value it already has in EDX instead of issuing a real memory read. Consequently, the reads of _A and _B get reordered.

In this case, the correct solution is to mark field _A as volatile. If that’s done, the compiler shouldn’t reorder the reads of _A and _B, because the load of _A has load-acquire semantics. However, I should  point out that the .NET Framework, through version 4, doesn’t handle this case correctly and, in fact, marking the _A field as volatile will not prevent the read reordering. This issue has been fixed in the .NET Framework version 4.5.

Read Introduction As I just explained, the compiler sometimes fuses multiple reads into one. The compiler can also split a single read into multiple reads. In the .NET Framework 4.5, read introduction is much less common than read elimination and occurs only in very rare, specific circumstances. However, it does sometimes happen.

To understand read introduction, consider the following example:

public class ReadIntro {
  private Object _obj = new Object();
  void PrintObj() {
    Object obj = _obj;
    if (obj != null) {
      Console.WriteLine(obj.ToString());
    // May throw a NullReferenceException
    }
  }
  void Uninitialize() {
    _obj = null;
  }
}

If you examine the PrintObj method, it looks like the obj value will never be null in the obj.ToString expression. However, that line of code could in fact throw a NullReferenceException. The CLR JIT might compile the PrintObj method as if it were written like this:

void PrintObj() {
  if (_obj != null) {
    Console.WriteLine(_obj.ToString());
  }
}

Because the read of the _obj field has been split into two reads of the field, the ToString method may now be called on a null target.

Note that you won’t be able to reproduce the NullReferenceException using this code sample in the .NET Framework 4.5 on x86-x64. Read introduction is very difficult to reproduce in the .NET Framework 4.5, but it does nevertheless occur in certain special circumstances.

C# Memory Model Implementation on the x86-x64

As the x86 and x64 have the same behavior with respect to the memory model, I’ll consider both processor variants together.

Unlike some architectures, the x86-x64 processor provides fairly strong ordering guarantees on memory operations. In fact, the JIT compiler doesn’t need to use any special instructions on the x86-x64 to achieve volatile semantics; ordinary memory operations already provide those semantics. Even so, there are still specific cases when the x86-x64 processor does reorder memory operations.

x86-x64 Memory Reordering Even though the x86-x64 processor provides fairly strong ordering guarantees, a particular kind of hardware reordering still happens.

The x86-x64 processor will not reorder two writes, nor will it reorder two reads. However, the one (and only) possible reordering effect is that when a processor writes a value, that value will not be made immediately available to other processors. Figure 1 shows an example that demonstrates this behavior.

Figure 1 StoreBufferExample

class StoreBufferExample
{
  // On x86 .NET Framework 4.5, it makes no difference
  // whether these fields are volatile or not
  volatile int A = 0;
  volatile int B = 0;
  volatile bool A_Won = false;
  volatile bool B_Won = false;
  public void ThreadA()
  {
    A = true;
    if (!B) A_Won = true;
  }
  public void ThreadB()
  {
    B = true;
    if (!A) B_Won = true;
  }
}

Consider the case when methods ThreadA and ThreadB are called from different threads on a new instance of StoreBufferExample, as shown in Figure 2. If you think about the possible outcomes of the program in Figure 2, three cases seem to be possible:

  1. Thread 1 completes before Thread 2 starts. The outcome is A_Won=true, B_Won=false.
  2. Thread 2 completes before Thread 1 starts. The outcome is A_Won=false, B_Won=true.
  3. The threads interleave. The outcome is A_Won=false, B_Won=false.

Calling ThreadA and ThreadB Methods from Different Threads
Figure 2 Calling ThreadA and ThreadB Methods from Different Threads

But, surprisingly, there’s a fourth case: It’s possible that both the A_Won and B_Won fields will be true after this code has finished! Because of the store buffer, stores can get “delayed,” and therefore end up reordered with a subsequent load. Even though this outcome is not consistent with any interleaving of Thread 1 and Thread 2 executions, it can still happen.

This example is interesting because we have a processor (the x86-x64) with relatively strong ordering, and all fields are volatile—and we still observe a reordering of memory operations. Even though the write to A is volatile and the read from A_Won is also volatile, the fences are both one-directional, and in fact allow this reordering. So, the ThreadA method may effectively execute as if it were written like this:

public void ThreadA()
{
  bool tmp = B;
  A = true;
  if (!tmp) A_Won = 1;
}

One possible fix is to insert a memory barrier into both ThreadA and ThreadB. The updated ThreadA method would look like this:

public void ThreadA()
{
  A = true;
  Thread.MemoryBarrier();
  if (!B) aWon = 1;
}

The CLR JIT will insert a “lock or” instruction in place of the memory barrier. A locked x86 instruction has the side effect of flushing the store buffer:

mov         byte ptr [ecx+4],1
lock or     dword ptr [esp],0
cmp         byte ptr [ecx+5],0
jne         00000013
mov         byte ptr [ecx+6],1
ret

As an interesting side note, the Java programming language takes a different approach. The Java memory model has a slightly stronger definition of “volatile” that doesn’t permit store-load reordering, so a Java compiler on the x86 will typically emit a locked instruction after a volatile write.

x86-x64 Remarks The x86 processor has a fairly strong memory model, and the only source of reordering at the hardware level is the store buffer. The store buffer can cause a write to get reordered with a subsequent read (store-load reordering).

Also, certain compiler optimizations can result in reordering of memory operations. Notably, if several reads access the same memory location, the compiler might choose to perform the read only once and keep the value in a register for subsequent reads.

One interesting piece of trivia is that the C# volatile semantics closely match the hardware reordering guarantees made by x86-x64 hardware. As a result, reads and writes of volatile fields require no special instructions on the x86: Ordinary reads and writes (for example, using the MOV instruction) are sufficient. Of course, your code shouldn’t depend on these implementation details because they vary among hardware architectures and possibly .NET versions.

C# Memory Model Implementation on the Itanium Architecture

The Itanium hardware architecture has a memory model weaker than that of the x86-x64. Itanium was supported by the .NET Framework until version 4.

Even though Itanium is no longer supported in the .NET Framework 4.5, understanding the Itanium memory model is useful when you read older articles on the .NET memory model and have to maintain code that incorporated recommendations from those articles.

Itanium Reordering Itanium has a different instruction set than the x86-x64, and memory model concepts show up in the instruction set. Itanium distinguishes between an ordinary load (LD) and load-acquire (LD.ACQ), and an ordinary store (ST) and store-release (ST.REL).

Ordinary loads and stores can be freely reordered by the hardware, as long as the single-threaded behavior doesn’t change. For example, look at this code:

class ReorderingExample
{
  int _a = 0, _b = 0;
  void PrintAB()
  {
    int a = _a;
    int b = _b;
    Console.WriteLine("A:{0} B:{1}", a, b);
  }
  ...
}

Consider two reads of _a and _b in the PrintAB method. Because the reads access an ordinary, non-volatile field, the compiler will use ordinary LD (not LD.ACQ) to implement the reads. Consequently, the two reads might be effectively reordered in the hardware, so that PrintAB behaves as if it were written like this:

void PrintAB()
{
  int b = _b;
  int a = _a;
  Console.WriteLine("A:{0} B:{1}", a, b);
}

In practice, whether the reordering happens or not depends on a variety of unpredictable factors—what’s in the processor cache, how busy the processor pipeline is and so forth. However, the processor will not reorder two reads if they’re related via data dependency. Data dependency between two reads occurs when the value returned by a memory read determines the location of the read by a subsequent read.

This example illustrates data dependency:

class Counter { public int _value; }
class Test
{
  private Counter _counter = new Counter();
  void Do()
  {
    Counter c = _counter; // Read 1
    int value = c._value; // Read 2
  }
}

In the Do method, Itanium will never reorder Read 1 and Read 2, even though Read 1 is an ordinary load and not load-acquire. It might seem obvious that these two reads can’t be reordered: The first read determines which memory location the second read should access! However, some processors—other than Itanium—may in fact reorder the reads. The processor might guess on the value that Read 1 will return and perform Read 2 speculatively, even before Read 1 has completed. But, again, Itanium will not do that.

I’ll get back to the data dependency in Itanium discussion in a bit, and its relevance to the C# memory model will become clearer.

Also, itanium will not reorder two ordinary reads if they’re related via control dependency. Control dependency occurs when the value returned by a read determines whether a subsequent instruction will execute.

So, in this example, the reads of _initialized and _data are related via control dependency:

void Print() {
  if (_initialized)            // Read 1
    Console.WriteLine(_data);  // Read 2
  else
    Console.WriteLine("Not initialized");
}

Even if _initialized and _data are ordinary (non-volatile) reads, the Itanium processor will not reorder them. Note that the JIT compiler is still free to reorder the two reads, and in some cases will.

Also, it’s worth pointing out that, like the x86-x64 processor, Itanium also uses a store buffer, so the StoreBufferExample shown in Figure 1 will exhibit the same kind of reorderings on Itanium as it did on the x86-x64. An interesting piece of trivia is that if you use LD.ACQ for all reads and ST.REL for all writes on Itanium, you basically get the x86-x64 memory model, where the store buffer is the only source of reordering.

Compiler Behavior on Itanium The CLR JIT compiler has one surprising behavior on Itanium: all writes are emitted as ST.REL, and not ST. Consequently, a volatile write and a non-volatile write will typically emit the same instruction on Itanium. However, an ordinary read will be emitted as LD; only reads from volatile fields are emitted as LD.ACQ.

This behavior might come as a surprise because the compiler is certainly not required to emit ST.REL for non-volatile writes. As far as the European Computer Manufacturers Association (ECMA) C# specification is concerned, the compiler could emit ordinary ST instructions. Emitting ST.REL is just something extra that the compiler chooses to do, in order to ensure that a particular common (but in theory incorrect) pattern will work as expected.

It can be difficult to imagine what that important pattern might be where ST.REL must be used for writes, but LD is sufficient for reads. In the PrintAB example presented earlier in this section, constraining just the writes wouldn’t help, because reads could still be reordered.

There’s one very important scenario in which using ST.REL with ordinary LD is sufficient: when the loads themselves are ordered using data dependency. This pattern comes up in lazy initialization, which is an extremely important pattern. Figure 3 shows an example of lazy initialization.

Figure 3 Lazy Initialization

// Warning: Might not work on future architectures and .NET versions;
// do not use
class LazyExample
{
  private BoxedInt _boxedInt;
  int GetInt()
  {
    BoxedInt b = _boxedInt; // Read 1
    if (b == null)
    {
      lock(this)
      {
        if (_boxedInt == null)
        {
          b = new BoxedInt();
          b._value = 42;  // Write 1
          _boxedInt = b; // Write 2
        }
      }
    }
    int value = b._value; // Read 2
    return value;
  }
}

In order for this bit of code to always return 42—even if GetInt is called from multiple threads concurrently—Read 1 must not be reordered with Read 2, and Write 1 must not be reordered with Write 2. The reads will not be reordered by the Itanium processor because they’re related via data dependency. And, the writes will not be reordered because the CLR JIT emits them as ST.REL.

Note that if the _boxedInt field were volatile, the code would be correct according to the ECMA C# spec. That’s the best kind of correct, and arguably the only real kind of correct. However, even if _boxed is not volatile, the current version of the compiler will ensure that the code still works on Itanium in practice.

Of course, loop read hoisting, read elimination and read introduction may be performed by the CLR JIT on Itanium, just as they are on x86-x64.

Itanium Remarks The reason Itanium is an interesting part of the story is that it was the first architecture with a weak memory model that ran the .NET Framework.

As a result, in a number of articles about the C# memory model and the volatile keyword and C#, the authors generally had Itanium in mind. After all, until the .NET Framework 4.5, Itanium was the only architecture other than the x86-x64 that ran the .NET Framework.

Consequently, the author might say something like, “In the .NET 2.0 memory model, all writes are volatile—even those to non-volatile fields.” What the author means is that on Itanium, CLR will emit all writes as ST.REL. This behavior isn’t guaranteed by the ECMA C# spec, and, consequently, might not hold in future versions of the .NET Framework and on future architectures (and, in fact, does not hold in the .NET Framework 4.5 on ARM).

Similarly, some folks would argue that lazy initialization is correct in the .NET Framework even if the holding field is non-­volatile, while others would say that the field must be volatile.

And of course, developers wrote code against these (sometimes contradictory) assumptions. So, understanding the Itanium part of the story can be helpful when trying to make sense of concurrent code written by someone else, reading older articles or even just talking to other developers.

C# Memory Model Implementation on ARM

The ARM architecture is the most recent addition to the list of architectures supported by the .NET Framework. Like Itanium, ARM has a weaker memory model than the x86-x64.

ARM Reordering Just like Itanium, ARM is allowed to freely reorder ordinary reads and writes. However, the solution that ARM provides to tame the movement of reads and writes is somewhat different from that of Itanium. ARM exposes a single instruction—DMB—that acts as a full memory barrier. No memory operation can pass over DMB in either direction.

In addition to the constraints imposed by the DMB instruction, ARM also respects data dependency, but doesn’t respect control dependency. See the “Itanium Reordering” section earlier in this article for explanations of data and control dependencies.

Compiler Behavior on ARM The DMB instruction is used to implement the volatile semantics in C#. On ARM, the CLR JIT implements a read from a volatile field using an ordinary read (such as LDR) followed by the DMB instruction. Because the DMB instruction will prevent the volatile read from getting reordered with any subsequent operations, this solution correctly implements the acquire semantics.

A write to a volatile field is implemented using the DMB instruction followed by an ordinary write (such as STR). Because the DMB instruction prevents the volatile write from getting reordered with any prior operations, this solution correctly implements the release semantics.

Just as with the Itanium processor, it would be nice to go beyond the ECMA C# spec and keep the lazy initialization pattern working, because a lot of existing code depends on it. However, making all writes effectively volatile is not a good solution on ARM because the DBM instruction is fairly costly.

In the .NET Framework 4.5, the CLR JIT uses a slightly different trick to get lazy initialization working. The following are treated as “release” barriers:

  1. Writes to reference-type fields on the garbage collector (GC) heap
  2. Writes to reference-type static fields

As a result, any write that might publish an object is treated as a release barrier.

This is the relevant part of LazyExample (recall that none of the fields are volatile):

b = new BoxedInt();
b._value = 42;  // Write 1
// DMB will be emitted here
_boxedInt = b; // Write 2

Because the CLR JIT emits the DMB instructions prior to the publication of the object into the _boxedInt field, Write 1 and Write 2 will not be reordered. And because ARM respects data dependence, the reads in the lazy initialization pattern will not be reordered either, and the code will work correctly on ARM.

So, the CLR JIT makes an extra effort (beyond what’s mandated in the ECMA C# spec) to keep the most common variant of incorrect lazy initialization working on ARM.

As a final comment on ARM, note that—as on x86-x64 and Itanium—loop read hoisting, read elimination and read introduction are all legitimate optimizations as far as the CLR JIT is concerned.

Example: Lazy Initialization

It can be instructive to look at a few different variants of the lazy initialization pattern and think about how they’ll behave on different architectures.

Correct Implementation The implementation of lazy initialization in Figure 4is correct according to the C# memory model as defined by the ECMA C# spec, and so it’s guaranteed to work on all architectures supported by current and future versions of the .NET Framework.

Figure 4 A Correct Implementation of Lazy Initialization

class BoxedInt
{
  public int _value;
  public BoxedInt() { }
  public BoxedInt(int value) { _value = value; }
}
class LazyExample
{
  private volatile BoxedInt _boxedInt;
  int GetInt()
  {
    BoxedInt b = _boxedInt;
    if (b == null)
    {
      b = new BoxedInt(42);
      _boxedInt = b;
    }
    return b._value;
  }
}

Note that even though the code sample is correct, in practice it’s still preferable to use the Lazy<T> or the LazyInitializer type.

Incorrect Implementation No. 1Figure 5 shows an implementation that isn’t correct according to the C# memory model. In spite of this, the implementation will probably work on the x86-x64, Itanium and ARM in the .NET Framework. This version of the code is not correct. Because _boxedInt isn’t volatile, a C# compiler is permitted to reorder Read 1 with Read 2, or Write 1 with Write 2. Either reordering would potentially result in 0 returned from GetInt.

Figure 5 An Incorrect Implementation of Lazy Initialization

// Warning: Bad code
class LazyExample
{
  private BoxedInt _boxedInt; // Note: This field is not volatile
  int GetInt()
  {
    BoxedInt b = _boxedInt; // Read 1
    if (b == null)
    {
      b = new BoxedInt(42); // Write 1 (inside constructor)
      _boxedInt = b;        // Write 2
    }
    return b._value;        // Read 2
  }
}

However, this code will behave correctly (that is, always return 42) on all architectures in the .NET Framework versions 4 and 4.5:

  • x86-x64:
    • Writes and reads will not be reordered. There is no store-load pattern in the code, and also no reason the compiler would cache values in registers.
  • Itanium:
    • Writes will not be reordered because they are ST.REL.
    • Reads will not be reordered due to data dependency.
  • ARM:
    • Writes will not be reordered because DMB is emitted before “_boxedInt = b.”
    • Reads will not be reordered due to data dependency.

Of course, you should use this information only to try to understand the behavior of existing code. Do not use this pattern when writing new code.

Incorrect Implementation No. 2 The incorrect implementation in Figure 6 may fail on both ARM and Itanium.

Figure 6 A Second Incorrect Implementation of Lazy Initialization

// Warning: Bad code
class LazyExample
{
  private int _value;
  private bool _initialized;
  int GetInt()
  {
    if (!_initialized) // Read 1
    {
      _value = 42;
      _initialized = true;
    }
    return _value;     // Read 2
  }
}

This version of lazy initialization uses two separate fields to track the data (_value) and whether the field is initialized (_initialized). As a result, the two reads—Read 1 and Read 2—are no longer related via data dependency. Additionally, on ARM, the writes might also get reordered, for the same reasons as in the next incorrect implementation (No. 3).

As a result, this version may fail and return 0 on ARM and Itanium in practice. Of course, GetInt is allowed to return 0 on x86-x64 (and also as a result of JIT optimizations), but that behavior doesn’t seem to happen in the .NET Framework 4.5.

Incorrect Implementation No. 3 Finally, it’s possible to get the example to fail even on x86-x64. I just have to add one innocuous-looking read, as shown in Figure 7.

Figure 7 A Third Incorrect Implementation of Lazy Initialization

// WARNING: Bad code
class LazyExample
{
  private int _value;
  private bool _initialized;
  int GetInt()
  {
    if (_value < 0) throw new Exception(); // Note: extra reads to get _value
                                         // pre-loaded into a register
    if (!_initialized)      // Read 1
    {
      _value = 42;
      _initialized = true;
      return _value;
    }
    return _value;          // Read 2
  }
}

The extra read that checks whether _value < 0 can now cause the compiler to cache the value in register. As a result, Read 2 will get serviced from a register, and so it gets effectively reordered with Read 1. Consequently, this version of GetInt may in practice return 0 even on x86-x64.

Wrapping Up

When writing new multi-threaded code, it’s generally a good idea to avoid the complexity of the C# memory model altogether by using high-level concurrency primitives like locks, concurrent collections, tasks, and parallel loops. When writing CPU-intensive code, it sometimes makes sense to use volatile fields, as long as you only rely on the ECMA C# specification guarantees and not on architecture-specific implementation details.


Igor Ostrovsky is a senior software development engineer at Microsoft. He has worked on Parallel LINQ, the Task Parallel Library, and other parallel libraries and primitives in the .NET Framework. Ostrovsky blogs about programming topics at igoro.com.

Thanks to the following technical experts for reviewing this article: Joe Duffy, Eric Eilebrecht, Joe Hoag, Emad Omara, Grant Richins, Jaroslav Sevcik and Stephen Toub