Using the Reliability Features of the .NET Framework

Article
10/18/2019

High Availability

Keep Your Code Running with the Reliability Features of the .NET Framework

Stephen Toub

This article is based on a prerelease version of the .NET Framework 2.0. All information herein is subject to change.

This article discusses:

Understanding OutOfMemoryException, StackOverflowException, and ThreadAbortException
Constrained Execution Regions
Critical Finalizers and SafeHandle
FailFast and MemoryGates

This article uses the following technologies:
.NET Framework, SQL Server 2005

Contents

Villainy
Constrained Execution Regions
RuntimeHelpers
Reliability Contracts
Explicit Preparation
Handling StackOverflowException
CLR Hosts and Escalation Policies
Critical Finalizers and Safe Handles
Critical Regions
FailFast and MemoryGates
Conclusion

Do you write reliable managed code? Obviously if your manager asks you that question, you'll want the answer to be yes. You use try/finally blocks to release resources deterministically and you eagerly dispose all of your disposable objects. So, of course, your code is reliable, right? A job well done?

Regrettably, that's not the whole story. In the context of writing managed code, reliability requires the capacity to execute a sequence of operations in a deterministic way, even under exceptional conditions. This allows you to ensure that resources are not leaked and that you can maintain state consistency without relying on application domain unloading (or worse, process restarts) to fix any corrupted state. Unfortunately, in the Microsoft® .NET Framework, not all exceptions are deterministic and synchronous, which makes it difficult to write code that is always deterministic in its ability to execute a predetermined sequence of operations. And with the .NET Framework 1.x, some situations make it nearly impossible. In this article I'll show you why, and I'll then explore new features of the .NET Framework 2.0 that help you to mitigate these situations and write more reliable code.

As a primary example of why this is important, starting with the SQL Server™ 2005 release, SQL Server is able to host the common language runtime (CLR), allowing stored procedures, functions, and triggers to be written in managed code. Since access to these stored procedures must be fast, SQL Server hosts the CLR in-process. ASP.NET makes use of process recycling to ensure high availability, starting new worker processes when it determines that the current worker processes are not functioning correctly. But SQL Server with its in-process hosting doesn't have that luxury; it can't just restart the worker process, since restarting that main database process will create downtime. Instead, SQL Server opts for application domain isolation as a backstop against unpredicted failures, such that a deteriorating application domain can simply be unloaded and replaced with a new domain. As a result, it is very important that code running in SQL Server be as reliable as possible, and that when shared state is corrupted, that corruption be constrained such that the server can recover. (Process-wide corruption may force SQL Server to disable the CLR.) Reliability is important for client applications that consume system-wide resources, and it's extremely important for any application that requires significant up-time, be it SQL Server, a Windows® service, or any other host or application that might need to run for prolonged periods of time. Ensuring that your code is reliable is a great step towards being able to run in these environments.

Villainy

For me, the best drama involves the misunderstood antagonist, the rival who you might root for if the situation were different. The villains in this story are no different, useful for many situations, and yet a thorn in the side of reliable code. These villains come in the form of OutOfMemoryException, StackOverflowException, and ThreadAbortException.

An OutOfMemoryException is thrown during an attempt to procure more memory for a process when there is not enough contiguous memory available to satisfy the demand. Typically, one thinks of these exceptions as happening in response to explicit instructions in code to create new objects, meaning newobj and newarr instructions at the Microsoft intermediate language (MSIL) level. But other operations result in memory allocations, too. Boxing (using the MSIL box instruction), for example, requires heap allocation to store a value type. Calling a method that references a type for the first time will result in the relevant assembly being delay-loaded into memory, thus requiring allocations. Executing a previously unexecuted method requires that method to be just-in-time (JIT) compiled, requiring memory allocations to store the generated code and associated runtime data structures. And so on.

In truth, memory allocations in managed code can happen in the unlikeliest of places, making it very hard to correctly handle these exceptions, even with the liberal use of try/catch/finally and finalizers. What if those back-out code blocks and finalizers require memory allocations? What if they haven't been JIT compiled yet? What if a method you call allocates memory, as do many of the APIs in the Framework itself? In truth, the CLR in the .NET Framework 1.x makes no guarantees that back-out code will ever be executed. Lack of execution guarantees for back-out code paths makes it extremely difficult to author applications that reliably handle out-of-memory conditions.

StackOverflowException is also problematic. This exception will occur when the execution stack for the current thread overflows by having too many pending method calls, often the result of a highly recursive function or a stack frame that consumes a significant amount of stack space, such as one that uses the C# stackalloc keyword (corresponding to the MSIL localloc instruction). Some method calls, such as invocations of a method in another application domain (which involves .NET Remoting), can also result in significant stack consumption, even if the target method itself doesn't require much. Like with OutOfMemoryException, the CLR in the .NET Framework 1.x makes no guarantees that back-out code will ever be executed; in fact, it makes no guarantees that a StackOverflowException can even be caught. In version 1.x, an exception will be thrown if the overflow occurs in managed code, but if the overflow occurs within the runtime, the process will be torn down.

When Windows detects that a thread has overflowed the stack, the process can attempt to handle the error. However, if the process's exception handling code isn't written extremely carefully, it can result in another stack overflow. If the same thread overflows the stack twice without resetting the stack's guard page, the operating system kills the process. Most applications and libraries never attempt to address this problem, because it is very difficult to write code that can survive a stack overflow from every method call (handling stack overflows also requires part of the stack to be unwound, which means that finally blocks close to the overflow point may be skipped altogether). In fact, in version 1.x, the CLR itself could get a stack overflow while handling a stack overflow from managed code, which would cause the operating system to kill the process. In the .NET Framework 2.0, the CLR can reliably detect stack overflows, and then, based on host policy, decide whether to tear down the process or to throw a StackOverflowException and allow relevant managed catch and finally blocks to run.

Arguably the worst of the lot is ThreadAbortException. When a thread calls Thread.Abort on itself (for example, Thread.CurrentThread.Abort), the result isn't particularly interesting: a synchronous ThreadAbortException is thrown. The reliability problem occurs when a thread uses Abort to terminate another thread—or when AppDomain.Unload is called, which has the effect of calling Abort on all threads currently executing in the target domain or that have an existing stack frame from that domain. This causes the runtime to inject a ThreadAbortException into the target thread, and this can occur in the target thread between any two machine instructions. Consider the following code:

IntPtr memPtr = Marshal.AllocHGlobal(0x100);
try
{
    ... // use allocated unmanaged memory here
}
finally { Marshal.FreeHGlobal(memPtr); }

What if a ThreadAbortException is thrown after allocating the memory, but before entering the try block? The finally block will not be executed (since the exception did not occur within the corresponding try block), and you now have a memory leak as the garbage collector (GC) knows nothing about this unmanaged memory. You might try rewriting the code so that the allocation is performed within the try block, but that won't help if the exception occurs after AllocHGlobal returns and before the machine instruction executes to store that value to memPtr. The pointer to the allocated memory will be lost, and the memory will be leaked.

Alternatively, what if the exception is raised in the finally block, which can happen in the .NET Framework 1.x and in certain scenarios in the .NET Framework 2.0? In short, if user code has no way of indicating regions of code that can't be interrupted by ThreadAbortExceptions, it's nearly impossible to write reliable code in the face of asynchronous exceptions.

Constrained Execution Regions

The .NET Framework 2.0 introduces Constrained Execution Regions (CER), which impose restrictions both on the runtime and on the developer. In a region of code marked as a CER, the runtime is constrained from throwing certain asynchronous exceptions that would prevent the region from executing in its entirety. The developer is also constrained in the actions that can be performed in the region. This creates a framework and an enforcement mechanism for authoring reliable managed code, making it a key player in the reliability story for the .NET Framework 2.0.

For the runtime to meet its burden, it makes two accommodations for CERs. First, the runtime will delay thread aborts for code that is executing in a CER. In other words, if a thread calls Thread.Abort to abort another thread that is currently executing within a CER, the runtime will not abort the target thread until execution has left the CER. Second, the runtime will prepare CERs as soon as is possible to avoid out-of-memory conditions. This means that the runtime will do everything up front that it would normally do during the code region's JIT compilation. It will also probe for a certain amount of free stack space to help eliminate stack overflow exceptions. By doing this work up front, the runtime can better avoid exceptions that might occur within the region and prevent resources from being cleaned up appropriately.

To use CERs effectively, developers should avoid certain actions that might result in asynchronous exceptions. The code is constrained from performing certain actions, including things like explicit allocations, boxing, virtual method calls (unless the target of the virtual method call has already been prepared), method calls through reflection, use of Monitor.Enter (or the lock keyword in C# and SyncLock in Visual Basic®), isinst and castclass instructions on COM objects, field access through transparent proxies, serialization, and multidimensional array accesses.

In short, CERs are a way to move any runtime-induced failure point from your code to a time either before the code runs (in the case of JIT compiling), or after the code completes (for thread aborts). However, CERs really do constrain the code you can write. Restrictions such as not allowing most allocations or virtual method calls to unprepared targets are significant, implying a high development cost to authoring them. This means CERs aren't suited for large bodies of general-purpose code, and they should instead be thought of as a technique to guarantee execution of small regions of code.

RuntimeHelpers

By default, the runtime doesn't assume code is in a constrained execution region. A developer must explicitly mark which regions of code are to be protected and prepared using methods from the System.Runtime.CompilerServices.RuntimeHelpers class. The most important method from this class is the PrepareConstrainedRegions static method. This method may probe for sufficient stack space and serves as a marker to the CLR, informing it that a CER is about to begin. At the MSIL level, it must come immediately before the start of a try block, and has the effect of making the try a "reliable try" by ensuring that all of the resources needed by the back-out code are allocated in advance (thereby protecting the region from CLR-introduced out-of-memory exceptions within the region). It also ensures that any thread aborts are delayed until the back-out code has finished its execution.

Note that the only code prepared is that in the catch, finally, fault, and filter blocks associated with the try, and not in the try block itself. It's still possible for the code in the try block to throw an OutOfMemoryException from the JIT compiler or to be interrupted by a ThreadAbortException. However, since the back-out code has already been prepared, the related catch, finally, fault, and filter blocks will be able to run and handle these exceptions. (Had this code not been prepared, an out-of-memory situation could prevent the back-out code from executing.) The Visual Basic code in Figure 1 demonstrates what code will and will not be prepared (though it does not demonstrate fault blocks, since they are not currently available from any Microsoft languages other than MSIL).

Figure 1 RuntimeHelpers.PrepareConstrainedRegions

Imports System
Imports System.Runtime.CompilerServices

Class RegionsDemo
    Shared Sub Main()
        <span class="clsGreen" xmlns="https://www.w3.org/1999/xhtml">... ' will not be prepared</span>
        RuntimeHelpers.PrepareConstrainedRegions()
        Try
            <span class="clsGreen" xmlns="https://www.w3.org/1999/xhtml">... ' will not be prepared</span>
        Catch exc As SomeException When SomeFilter(exc)
            <span class="clsRed" xmlns="https://www.w3.org/1999/xhtml">... ' will be prepared</span>
        Finally
            <span class="clsRed" xmlns="https://www.w3.org/1999/xhtml">... ' will be prepared</span>
        End Try
        <span class="clsGreen" xmlns="https://www.w3.org/1999/xhtml">... ' will not be prepared</span>
    End Sub

    Shared Function SomeFilter(ByVal exc as Exception) As Boolean
        <span class="clsRed" xmlns="https://www.w3.org/1999/xhtml">... ' will be prepared</span>
    End Function
End Class

Because the code in a try block marked with PrepareConstrainedRegions will not be prepared, you'll frequently see (and may find useful) the following pattern for creating a region of code that is not interruptible:

RuntimeHelpers.PrepareConstrainedRegions();
try {} finally
{
    ... // your noninterruptible code here
}

Here, instead of the finally block acting as back-out code used to recover from state changes in the try block, the finally block contains the code that makes forward progress.

In addition to preparing the code in the back-out blocks, the effect of PrepareConstrainedRegions is transitive, meaning that the CLR will walk the call graph from the back-out code and will prepare any methods found in the graph. There are, however, a few restrictions to what methods will be prepared. First, the CLR can only prepare methods that it can find. If a call site involves any sort of indirection, such as a call through an interface, a delegate, a virtual method, or reflection, the CLR won't be able to trace through the graph to the target, and thus the target won't be prepared. As a result, out-of-memory exceptions may still occur at run time and thread aborts won't be delayed properly. Second, every method in the graph must be covered by an appropriate reliability contract.

Reliability Contracts

In order for the runtime to eagerly prepare a CER, it must know that the code within that CER and its call graph adheres to the constraints required for execution within a CER. For this, the .NET Framework provides ReliabilityContractAttribute (shown in Figure 2) in the System.Runtime.ConstrainedExecution namespace.

Figure 2 ReliabilityContractAttribute

[AttributeUsage(AttributeTargets.Interface | AttributeTargets.Method |  
    AttributeTargets.Constructor | AttributeTargets.Struct | 
    AttributeTargets.Class | AttributeTargets.Assembly, Inherited=false)] 
public sealed class ReliabilityContractAttribute : Attribute
{
    public ReliabilityContractAttribute(
        Consistency consistencyGuarantee, CER cer);
    public Cer Cer { get; set; }
    public Consistency ConsistencyGuarantee { get; set; }
}

[Serializable]
public enum Cer { None = 0, MayFail = 1, Success = 2 }

[Serializable]
public enum Consistency
{
    MayCorruptProcess = 0, MayCorruptAppDomain = 1,
    MayCorruptInstance = 2, WillNotCorruptState = 3
}

A reliability contract expresses two different, but related, concepts: what kind of state corruption could result from asynchronous exceptions being thrown during the method's execution, and what kind of completion guarantees the method can make if it were to run in a CER. In addition to being supplied at the method level, contracts can also be specified at the class and assembly levels. Contracts found on methods override any found at the class or assembly level, and those found on classes override those at the assembly level.

The first part of the contract is expressed through the attribute's ConsistencyGuarantee property, which takes a value from the Consistency enumeration. This property describes the level of state corruption that can result if an asynchronous exception is thrown during the method's execution. (Think of this as a gauge used to provide the caller with some idea of what state needs to be thrown away in order to recover to a known-good state.) The worst corruption is process corruption (MayCorruptProcess), which is used to indicate that the method in question might have been mucking with process-wide state when the exception occurred, such that the state might now be inconsistent. Similarly, MayCorruptAppDomain signals that the method might be messing with state isolated to the application domain (such as a static variable), and thus that state might be inconsistent (though process-wide state should still be consistent). MayCorruptInstance is used to signal that the method might have left an instance in an inconsistent state (but nothing more), and WillNotCorruptState means that the method could not possibly have left any state in an inconsistent fashion (often the case for methods that simply read state information).

The Cer property and enumeration are used to indicate what kind of completion guarantees a method makes if it is executing in a CER (if it's not in a CER, this guarantee is worthless). A value of Success means that this method will always complete successfully when executing under a CER, assuming valid input; a method marked with Success may still throw exceptions if supplied with parameters that are not valid.

A Cer value of MayFail is used to signal that when faced with asynchronous exceptions, the code may not complete in an expected fashion. Since thread aborts are being delayed over constrained execution regions, this really means that your code is doing something that may cause memory to be allocated or that might result in a stack overflow. More importantly, it means that you must take the possible failures into consideration when calling this method.

Only three combinations of Cer and Consistency values are valid for methods in a CER root's call graph:

[ReliabilityContract(Consistency.MayCorruptInstance, Cer.MayFail)]
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]

The first indicates that in the face of exceptional conditions, the method may fail, but at worst it will only corrupt the particular instance; application domain and process-wide state will still be intact. The second indicates that the method may fail, but even if it does, all state will still be valid. This corresponds to what the C++ community calls the "strong exception guarantee," meaning either the action completed or it failed with no bad side effects and a thrown exception. The third implies that the method will always succeed and will not corrupt any state at all. This last pairing is the strongest guarantee possible, but as a result, very few methods can be marked as such. In fact, most methods you come across will either be marked with a Cer value of None or will lack a reliability contract completely, meaning that the method makes no guarantees whatsoever in the face of exceptional conditions. The lack of a reliability contract on a method implies Cer.None and Consistency.MayCorruptProcess.

When setting up reliability contracts for your own methods, keep in mind that it is considered a breaking change to lower your Cer or Consistency guarantees. Callers may have taken a dependency on your level of reliability, and by changing your levels you can break those dependencies. Leaving reliability contracts off most of your methods is a reasonable first-pass approach until you know exactly which of your methods will likely be used from within constrained execution regions.

Reliability contracts also have some interesting interpretation issues. For example, one of the problems the CLR designers ran into is that most well-implemented methods check their parameters and throw exceptions for input that is not valid. Thus, an approach of simply outlawing all allocations from within CERs would have been next to useless (some methods may also allocate and then recover from any out of memory exceptions). Reliability contracts are an attempt to abstract away that type of detail from the runtime, allowing the code author to clearly state whether callers need to worry about failures, and if so, how much state may need to be thrown away. As a consequence, reliability contracts are considered null and void if the arguments passed to the method are illegal or the method is used incorrectly. Additionally, there may be concepts that aren't easily expressed through the contracts. Imagine a method that accepts an Object as a parameter and calls its Equals method. This method may be reliable, but only if you pass in an object that has a reliable Equals override. There is currently no built-in way to express that concept.

Explicit Preparation

As mentioned earlier, PrepareConstrainedRegions will cause the runtime to walk any call graphs extending from the CER root. Unfortunately, the runtime isn't omniscient and can't predict the actual target method from virtual call sites. Thus, if an interface, virtual method, delegate, or generic method is used from a constrained region, the runtime may not be able to prepare the target methods ahead of time, since the target method won't be decided until run time (or in the case of generics, it may be decided, but some memory allocation may still be required the first time the method is invoked with unique type parameters). To help the CLR, a developer can make use of two additional methods on the RuntimeHelpers class, specifically PrepareMethod and PrepareDelegate. PrepareMethod accepts a RuntimeMethodHandle for the MethodBase of the target method. At run time, code can retrieve the method handle of the actual method to be invoked, and can then use PrepareMethod to prepare it before entering the CER.

For example, consider the first code snippet in Figure 3. BaseObject exposes a virtual method VirtualMethod that's used in a CER. Since it's virtual, the CLR is not able to determine through static analysis what the actual target of the invocation will be. If the object passed to SomeMethod is actually a DerivedObject, DerivedObject's VirtualMethod override may not be properly prepared by the time it's called, thus putting the CER in jeopardy. To fix the situation, PrepareMethod can be used once the actual target is known. Of course, reflecting to get method handles can be an expensive operation. If you know all of the possible methods that will be called, you might consider doing all preparation work up front, such as in a static constructor.Unhandled Exceptions

In the .NET Framework 1.x , unhandled exceptions only sometimes cause a process to terminate, with the exact behavior varying based on the type of thread on which the exception is thrown as well as on the type of the exception. Unhandled exceptions on thread pool threads, on the finalizer thread, and on threads created with Thread.Start are all backstopped by the runtime. Thread pool threads are simply returned to the thread pool. The finalization thread eats the exception. Threads created by Thread.Start are terminated gracefully. Unhandled exceptions on the application's main thread cause the process to terminate. And threads created explicitly with the Win32® CreateThread function have their own set of rules.

Ignoring unhandled exceptions has caused a variety of reliability problems. Errors that would otherwise cause a system to crash—quickly exposing the problem and frequently providing important debugging information, such as the call stack at the point where the exception originated—instead have frequently caused system performance to degrade, eventually resulting in hangs and deadlocks. As a result, the CLR team decided that this unhandled exception policy is a poor one, and in the .NET Framework 2.0 the policy has been replaced by a new one that makes it easier for developers to track down such problems.

In the .NET Framework 2.0, all unhandled exceptions lead to process termination. The only exemptions from this rule are exceptions that are used by the CLR for flow control, including ThreadAbortException and AppDomainUnloadedException. This is a breaking change, one of relatively few that have been introduced for the new version of the .NET Framework. As such, there are two ways to roll back this behavior. First, a configuration file can be used to enable the legacy behavior:

<system>
    <runtime><legacyUnhandledExceptionPolicy enabled="1"/></runtime>
</system>

Second, an application hosting the CLR can make use of the ICLRPolicyManager::SetUnhandledExceptionPolicy method to inform the CLR of the unhandled exception policy to be used. Specifying eRuntimeDeterminedPolicy tells the runtime to use its default policy, which is to tear down the process on all unhandled exceptions. Alternatively, eHostDeterminedPolicy can be used to inform the CLR that it should revert to the behavior from the .NET Framework 1.x, where most exceptions are not treated as fatal to the process. The host can then subscribe to the relevant exception-related events on the appropriate AppDomain and can implement its own unhandled exception policy thusly.

Figure 3 Virtual Method Calls in CERs

Erroneous Virtual Method Call in CER

public void SomeMethod(BaseObject o)
{
    RuntimeHelpers.PrepareConstrainedRegions();
    try { ... } finally { o.VirtualMethod(); }
}

Fixed Virtual Method Call in CER

public void SomeMethod(BaseObject o)
{
    RuntimeHelpers.PrepareMethod(
        o.GetType().GetMethod("VirtualCall").MethodHandle));
    RuntimeHelpers.PrepareConstrainedRegions();
    try { ... } finally { o.VirtualMethod(); }
}

Early Preparation if All Targets Known Up Front

static SomeType()
{
    foreach (Type t in 
        new Type[]{typeof(DerivedType1), typeof(DerivedType2), ...})
    {
        RuntimeHelpers.PrepareMethod(
            t.GetMethod("VirtualMethod").MethodHandle);
    }
}

public void SomeMethod(BaseObject o)
{
    RuntimeHelpers.PrepareConstrainedRegions();
    try { ... } finally { o.VirtualMethod(); }
}

Similar to PrepareMethod is PrepareDelegate, which accepts a delegate and prepares the referenced method. PrepareDelegate only prepares the specific delegate referenced and does not prepare any delegates linked to by that delegate. In other words, even if a multicast delegate is comprised of multiple delegates, only the explicitly referenced one will be prepared. If a delegate will be raised from within a CER, either the delegate's invocation list must be retrieved and each individual delegate prepared, or each individual delegate should be prepared prior to being combined with the rest of the delegates. This has ramifications for events. If you have an event that you plan on raising from within a CER, you might consider supplying a custom add accessor for the event that will prepare any delegates registered with it before combining the supplied delegate with any previously registered:

public event EventHandler MyEvent {
    add {
        if (value == null) return;
        RuntimeHelpers.PrepareDelegate(value);
        lock(this) _myEvent += value;
    }
    remove { lock(this) _myEvent -= value; }
}

This is how several events on AppDomain are implemented, such as ProcessExit and DomainUnload, both of which are raised from within a CER implicitly rooted within the CLR.

It can be difficult to track down problems where PrepareMethod and PrepareDelegate are not used where they should have been used. The CLR provides several managed debug assistants (MDA) for diagnosing such CER-related problems. See the sidebar on Managed Debug Assistants for more information.

Handling StackOverflowException

The CLR does not make any guarantees that a StackOverflowException in a try block will result in the relevant back-out code executing. To account for scenarios where back-out code must absolutely execute in the face of StackOverflowExceptions, the RuntimeHelpers class provides the ExecuteCodeWithGuaranteedCleanup method. This method accepts three parameters: a RuntimeHelpers.TryCode delegate containing the code to run in the try block, a RuntimeHelpers.CleanupCode delegate containing the code to run in the finally block, and user data that will be supplied to both delegates. An example of using this method is shown in the code in Figure 4.

Figure 4 ExecuteCodeWithGuaranteedCleanup

try
{
    RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(
        TryMethod, CleanupMethod, this);
}
catch(Exception exc) { ... } // Handle unexpected failure

...

[MethodImpl(MethodImplOptions.NoInlining)]
void TryMethod(object userdata) { ... } 

[MethodImpl(MethodImplOptions.NoInlining)]
[ReliabilityContract(Cer.Success, Consistency.MayCorruptInstance)]
[PrePrepareMethod]  // For better performance if the assembly is ngen'ed.
void CleanupMethod(object userdata, bool fExceptionThrown) { ... }

ExecuteCodeWithGuaranteedCleanup erects a structured exception handling (SEH) handler within the CLR. With that protection in place, the method calls back into the managed TryCode. If a stack overflow occurs within that code, the runtime catches it and ensures that enough stack space exists to run CleanupCode correctly. Of course, you still need to make sure that the back-out code itself does not cause a stack overflow (it should not make highly recursive calls or depend on code that requires a large or unknown amount of stack space). As an aside, note that C# doesn't currently allow you to specify a custom attribute on an anonymous method, so try not to use anonymous methods for your back-out code.

CLR Hosts and Escalation Policies

When does an out-of-memory condition not result in an OutOfMemoryException? When a CLR host says it shouldn't. An unmanaged application hosting the CLR can control how the runtime reacts to certain types of situations, including resource allocation failures, resource allocation failures in critical regions of code, orphaned locks, and fatal errors within the runtime itself. Through the unmanaged ICLRPolicyManager interface, a host can control what actions the CLR takes when certain failures occur. These actions include throwing exceptions, aborting threads, unloading application domains, exiting the process, disabling the runtime, and taking no action at all (ignoring the failure). Throwing exceptions and ignoring failures are the actions used by the default CLR host. So, for example, when memory can't be allocated, the runtime throws an OutOfMemoryException, and when the runtime is alerted to an orphaned lock, it ignores the problem. With ICLRPolicyManager and its SetActionOnFailure method, a host can change these defaults. This means that when memory can't be allocated, a host might choose to have the runtime abort the thread by throwing a ThreadAbortException instead of throwing an OutOfMemoryException. When an orphaned lock is discovered, a host might choose to have the runtime unload the application domain rather than ignoring the error. You should be aware that this is possible when writing your code.

These types of policies aren't enough for all hosts. Sometimes a host needs to be able to escalate the action that's taken if the previous action wasn't enough. For example, consider what happens if ThreadAbortException is thrown within a try block, and the finally block associated with that try goes into an infinite loop. With the default runtime policy, that thread will never be aborted, and an unmanaged host that requires reliability guarantees needs some way to deal with this scenario. A solution comes in the form of the SetTimeoutAndAction method of ICLRPolicyManager (and to a lesser degree its SetDefaultAction method). This method accepts three parameters: an action, a timeout, and a response action. If the runtime is currently performing an action that has a timeout, and that timeout is reached, the runtime will upgrade the action to the response action. For example, a host could specify that all thread aborts should take no more than five seconds, and that if one does, the application domain should be unloaded. The runtime itself only has one default timeout, and that is for process exit. If an attempt to gracefully exit a process (aborting all threads, unloading all application domains, and so on) doesn't succeed within 40 seconds or so, the runtime will terminate the process.

I mentioned before that there are multiple actions a host can take in response to certain failures, including aborting threads and unloading application domains. What I didn't mention is that there are multiple levels of severity to some of these actions, including thread aborts, application domain unloads, and process exits. Up to this point, I've simply talked about thread aborts as the result of the runtime throwing ThreadAbortException on a thread. Typically, this will cause the thread to terminate. However, a thread can handle a thread abort, preventing it from terminating the thread. To account for this, the runtime provides a more powerful action, the aptly named rude thread abort. A rude thread abort causes a thread to cease execution. When this happens, the CLR makes no guarantees that any back-out code on the thread will run (unless the code is executing in a CER). Rude, indeed.

Similarly, while a typical application domain unload will gracefully abort all threads in the domain, a rude application domain unload will rudely abort all threads in the domain, and makes no guarantees that normal finalizers associated with objects in that domain will run. SQL Server 2005 is one CLR host that makes use of rude thread aborts and rude application domain unloads as part of its escalation policy. When an asynchronous exception occurs, the resource allocation failure will be upgraded to a thread abort. And when a thread abort occurs, if it doesn't finish within a time span set by SQL Server, it'll be upgraded to a rude thread abort. Similarly, if an application domain unload operation does not finish within a time span set by SQL Server, it'll be upgraded to a rude application domain unload. (Note that the policies just laid out are not exactly what SQL Server uses, as SQL Server also takes into account whether code is executing in critical regions, but more on that topic shortly).

It's important to internalize how the CLR treats graceful thread aborts differently from rude thread aborts. In the .NET Framework 1.x, there is no such thing as a rude thread abort, and thread aborts can be raised anywhere in managed code, with no protection provided by the CLR. In the .NET Framework 2.0, the CLR delays graceful thread aborts by default over CERs, finally blocks, catch blocks, static constructors, and unmanaged code. However, rude thread aborts will only be delayed over CERs and unmanaged code (the CLR has little to no control over the latter).

Rude thread aborts and rude application domain unloads are used by CLR hosts to ensure that runaway code can be kept in check. Of course, failure to run finalizers or non-CER finally blocks due to these actions presents the CLR host with new reliability problems, since there's a good chance these actions will leak the resources the back-out code was supposed to clean up. Thank goodness for critical finalizers.

Critical Finalizers and Safe Handles

A graceful thread abort will allow any relevant finally blocks to run, but that guarantee is not made for rude thread aborts (unless in a CER). A graceful application domain unload uses graceful thread aborts and will allow any relevant finally blocks and any finalizers for objects in that domain to run, but that guarantee is not made for rude application domain unloads (unless in a CER). With rude application domain unloads, the runtime makes no guarantees about back-out code or normal finalizers running. How can a system be reliable under these conditions?

The .NET Framework 2.0 introduces a new kind of finalizer, referred to as a critical finalizer. A critical finalizer is a special kind of finalizer in that it runs even during a rude application domain unload, and it runs within a CER. Usage of critical finalizers should be reserved for only the most important of finalizers, those that are necessary for security or reliability. Only a handful of classes in the .NET Framework make use of critical finalizers.

To implement a critical finalizer, simply derive the class in question from the CriticalFinalizerObject class in the System.Runtime.ConstrainedExecution namespace:

[SecurityPermission(
    SecurityAction.InheritanceDemand, UnmanagedCode=true)]
public abstract class CriticalFinalizerObject
{
    protected CriticalFinalizerObject();
    [ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
    protected override void Finalize();
}

Any finalizer you implement in your derived class will be called as part of a CER and must have reliability guarantees that match those of the reliability contract on CriticalFinalizerObject.Finalize. In other words, you should only derive from CriticalFinalizerObject if your finalizer is guaranteed to succeed and is guaranteed not to corrupt any state when invoked under a CER. When an object that derives from CriticalFinalizerObject is instantiated, the runtime prepares the Finalize method then and there, so it won't have to JIT any code (or do other preparation) when it comes time to actually run the method for the first time.

One of the classes in the .NET Framework that derives from CriticalFinalizerObject is SafeHandle, which lives in the System.Runtime.InteropServices namespace. SafeHandle is a very welcome addition to the .NET Framework and goes a long way towards solving many of the reliability-related problems present in previous versions. At its core, SafeHandle is simply a managed wrapper around an IntPtr with a finalizer that knows how to release the underlying resource referenced by that IntPtr. Since SafeHandle derives from CriticalFinalizerObject, this finalizer is prepared when SafeHandle is instantiated, and will be called from within a CER to ensure that asynchronous thread aborts do not interrupt the finalizer.

Consider the Win32 FindFirstFile function, which is used to enumerate files in a directory:

HANDLE FindFirstFile(
    LPCTSTR lpFileName, LPWIN32_FIND_DATA lpFindFileData);

In the .NET Framework 1.x, a P/Invoke declaration for this function would typically be declared as follows:

[DllImport("kernel32.dll", CharSet=CharSet.Auto, SetLastError=true)]
private static extern IntPtr FindFirstFile(
    string pFileName, [In, Out] WIN32_FIND_DATA pFindFileData);

Unfortunately, if an asynchronous exception is thrown after FindFirstFile returns but before the resulting IntPtr handle is stored, that operating system resource is now leaked without any decent hopes for freeing it. SafeHandle comes to the rescue. In the .NET Framework 2.0, you might rewrite this signature as follows:

[DllImport("kernel32.dll", CharSet=CharSet.Auto, SetLastError=true)]
private static extern SafeFindHandle FindFirstFile(
    string fileName, [In, Out] WIN32_FIND_DATA data);

The only difference here is that I've replaced the IntPtr return type with SafeFindHandle, where SafeFindHandle is a custom type that derives from SafeHandle. When the runtime invokes a call to FindFirstFile, it first creates an instance of SafeFindHandle. When FindFirstFile returns, the runtime stores the resulting IntPtr into the already created SafeFindHandle. The runtime guarantees that this operation is atomic, meaning that if the P/Invoke method successfully returns, the IntPtr will be stored safely inside the SafeHandle. Once inside the SafeHandle, even if an asynchronous exception occurs and prevents FindFirstFile's SafeFindHandle return value from being stored, the relevant IntPtr is already stored within a managed object whose finalizer will ensure its proper release.

Internally, the .NET Framework makes use of a large number of SafeHandle-derived types, one for each type of unmanaged resource it needs to deal with. Publicly, only a few are exposed, including SafeFileHandle (used to wrap file handles) and SafeWaitHandle (used to wrap synchronization handles). Of course, if you're dealing with any unmanaged resources not covered by these, you can create your own SafeHandle-derived types. Figure 5 shows some examples of doing just that. To make writing your own easier, the .NET Framework provides two additional public SafeHandle-derived types, SafeHandleZeroOrMinusOneIsInvalid and SafeHandleMinusOneIsInvalid. SafeHandle has to be able to tell whether the IntPtr it's storing is valid for the relevant type of resource. Since the majority of resource handles in the Win32 world are invalid when they're -1 or when they're either 0 or -1, these classes have been provided to incorporate those checks so that you don't have to in your code.

Figure 5 Sample SafeHandle Implementations

SafeHandle for Heap Memory

[SecurityPermission(SecurityAction.LinkDemand, UnmanagedCode=true)]
public sealed class SafeLocalAllocHandle : 
    SafeHandleZeroOrMinusOneIsInvalid
{
    [DllImport("kernel32.dll")]
    public static extern SafeLocalAllocHandle LocalAlloc(
       int uFlags, IntPtr sizetdwBytes);

    private SafeLocalAllocHandle() : base(true) { }

    protected override bool ReleaseHandle()
    {
        return LocalFree(handle) == IntPtr.Zero;
    }

    [SuppressUnmanagedCodeSecurity]
    [ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
    [DllImport("kernel32.dll", SetLastError=true)]
    private static extern IntPtr LocalFree(IntPtr handle);
}

SafeHandle for LoadLibrary

[SecurityPermission(SecurityAction.LinkDemand, UnmanagedCode=true)]
public sealed class SafeLibraryHandle : SafeHandleZeroOrMinusOneIsInvalid
{
    [DllImport("kernel32.dll", SetLastError=true)]
    public static extern SafeLibraryHandle LoadLibrary(
        string lpFileName);

    private SafeLibraryHandle() : base(true) { }

    protected override bool ReleaseHandle()
    {
        return FreeLibrary(handle);
    }

    [SuppressUnmanagedCodeSecurity]
    [ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
    [DllImport("kernel32.dll", SetLastError=true)]
    private static extern bool FreeLibrary(IntPtr hModule);
}

In addition to resource lifetime management, safe handles provide other benefits. First, they aid in memory management by reducing graph promotions for finalization. In the .NET Framework 1.x, a class that requires unmanaged resources typically stores the relevant IntPtr in the class, along with any other managed objects the class relies on. This class almost certainly implements a finalizer to ensure that the IntPtr is properly released. Since finalizable objects always survive at least one garbage collection, the net result is that the entire object graph starting from the class in question survives a collection, all because of that one IntPtr. Since SafeHandle now replaces that IntPtr, and since SafeHandle has its own finalizer, in most scenarios the class that stores the SafeHandle no longer needs its own finalizer. This eliminates those GC graph promotion, thus reducing the stress on the GC. Additionally, removing the finalizer typically means you can get rid of calls to GC.KeepAlive(this) in those objects.

Safe handles also help plug possible security holes present due to handle recycling issues. Windows maintains internal tables that map handles to their associated kernel objects. When a handle is released, Windows is free to reuse that handle to point to a different resource. Under certain circumstances, it might be possible for an attacker to use multiple threads to take advantage of this handle recycling, closing the handle on one thread and making use of it on another, hoping that Windows will reassign the handle to a different resource, thereby allowing the attacker to gain access to a resource that might not otherwise have been available. To negate this possibility, SafeHandle implements reference counting. When a SafeHandle is passed to a P/Invoke method, the runtime increments the use count on the SafeHandle. When the P/Invoke returns, the use count is decremented. When a call to Dispose or Close is made on a SafeHandle, its use count is checked. If that count is 0, the operation is allowed to continue. If the count is greater than 0, the operation is delayed until such time that the count is 0. In this fashion, it is not possible to launch a handle recycling attack through SafeHandle. This reference counting does incur a minor expense. If for some reason that expense is too great for your situation, you can instead make use of the CriticalHandle class, also in the System.Runtime.InteropServices namespace. The CriticalHandle class is like SafeHandle except that it does not implement reference counting.

Using SafeHandle is straightforward. Consider the SafeLibraryHandle implementation shown in Figure 5. This class is used to store a handle to a library as provided by LoadLibrary. Since LoadLibrary uses reference counting to ensure that libraries are only freed after all references have been released, it's important to ensure that each call to LoadLibrary is matched by a call to FreeLibrary. And since LoadLibrary returns a handle, it's a perfect fit for SafeHandle. A very common request for the .NET Framework 1.x was to be able to use P/Invoke functionality to invoke an unmanaged function bound dynamically at run time. Figure 6 shows an example of doing just that with the .NET Framework 2.0.

Figure 6 Using a SafeLibraryHandle

[return: MarshalAs(UnmanagedType.Bool)]
private delegate bool ReturnBooleanHandler();

[DllImport("kernel32.dll")]
private static extern IntPtr GetProcAddress(
    SafeLibraryHandle hModule, string procname);
...
using (SafeLibraryHandle lib = 
    SafeLibraryHandle.LoadLibrary("user32.dll"))
{
    IntPtr procAddr = GetProcAddress(lib, "LockWorkStation");
    Delegate d = Marshal.GetDelegateForFunctionPointer(
        procAddr, typeof(ReturnBooleanHandler));
    d.DynamicInvoke();
}

Critical Regions

Critical regions are used to denote that the code running within the region is holding a lock and may be editing shared state. In essence, they're another form of lock count.

The impact of a thread abort or of an unhandled exception within a critical region might not be limited to the current task (see the Unhandled Exceptions sidebar for a look at changes to the unhandled exceptions policy in the .NET Framework 2.0). Consider the following code snippet that makes use of a static member variable _someSharedLock:

lock(_someSharedLock) { /* do important stuff here */ }

Now consider what happens if a rude thread abort happens within the lock. The aborted thread was quite possibly editing shared state, in which case there's a good chance that state is now inconsistent. Moreover, this lock is now orphaned by the deceased thread, which was never able to exit the lock due to premature termination. Other threads could very easily deadlock since the lock on _someSharedLock will never be released. The best course of action from a reliability standpoint might be to scrap the whole application domain in which this thread was running.

To give CLR hosts that option, managed code can declare when it is about to enter and leave a critical region, one where the host might want to kill the whole application domain should something go wrong. This is possible through the new BeginCriticalRegion and EndCriticalRegion static methods on the Thread class. Hosts are alerted when a thread is terminated from within a critical region, and they can choose to handle the situation as they see fit. (Hosts that participate in managed memory allocations are also alerted when memory requests are made from within critical regions; this allows hosts to prioritize those requests in order to minimize failures in critical regions.) SQL Server takes advantage of this in its failure escalation policy, similar to the one shown in Figure 7. If a thread is aborted in a critical region, a lock is held, the code is likely to be editing shared state, and thus the action is escalated to an application domain unload.

Figure 7 Sample Host Escalation Policy

Figure 7** Sample Host Escalation Policy **

I used the C# lock keyword in my example earlier as motivation for critical regions. However it's actually not a fair example. The C# lock keyword (SyncLock in Visual Basic) is built around the Monitor class, which already has this notification system built into it. Other locking mechanisms aren't as simple as Monitor.

Consider auto reset events. What does it mean for a thread to have "acquired" an auto reset event? Not much. As such, the runtime needs the help of a developer to know when code is executing within a critical region. The net result of this is that you should consider using Thread.BeginCriticalRegion and Thread.EndCriticalRegion any time your code makes use of EventWaitHandle (from which both ManualResetEvent and AutoResetEvent derive in the .NET Framework 2.0), Semaphore, a spin lock (see Jeffrey Richter's column in this issue of MSDN®Magazine for more information on spin locks), a custom managed lock mechanism, or any unmanaged locking mechanism you access through P/Invoke.

FailFast and MemoryGates

CERs and critical regions let the runtime and a CLR host help make your code more reliable. Sometimes, however, something happens that's so bad you want to take matters into your own hands and kill your app as soon as is possible to prevent something even worse from happening. In such a situation, you need a method that lets you end your application's process quickly. Enter FailFast.

System.Environment.FailFast is a simple method that does three things. First, it writes an event to the Windows Application log noting that a fatal error occurred. This message includes custom information supplied to FailFast as its sole string parameter. Second, FailFast causes a Watson error report and mini-dump to be generated and uploaded to the Microsoft Windows Error Reporting (WER) service. You can then use Windows Quality Online Services (winqual.microsoft.com) to access WER data for the application and analyze the data to locate the source of the problem. Finally, FailFast kills the process.

FailFast is good for handling situations after something has gone wrong. But what about doing checks up front to ensure that nothing will go wrong? A memory gate is a check for sufficient resources prior to initiating an activity that requires a large amount of memory. Failing the check prevents the operation from starting, reducing the probability of an application failing during execution due to lack of resources.

Memory gates are implemented in the .NET Framework 2.0 through the System.Runtime.MemoryFailPoint class. To use a memory gate, you create an instance of MemoryFailPoint, passing to its constructor the number of megabytes of memory that the upcoming operation is expected to use. If the specified amount of memory is not available, an InsufficientMemoryException (which derives from OutOfMemoryException) is raised. This can be very useful as it raises the exception before any of the operations have started rather than during them, thereby preventing dependencies on back-out code in some scenarios. Typical use of MemoryFailPoint is as follows:

using(new MemoryFailPoint(10)) //operation will require 10 MB of memory
{
    ... // perform operation
}

In the current implementation of MemoryFailPoint, the system is checked for the specified amount of memory, but that amount of memory is not reserved. It's possible that another thread or process could come along and claim those resources. Even so, memory gates such as this have been employed successfully for years and are still useful for providing a deterministic checkpoint for resource consumption.Managed Debug Assistants

Writing robust code is not easy. To help, the .NET Framework 2.0 provides Managed Debug Assistants (MDA) for finding problems at run time in the code you've written. If you're familiar with the Customer Debug Probes (CDP) made available as part of the .NET Framework 1.x, think of MDAs as their mature older sibling. MDAs are probes built into the .NET Framework that, when enabled, check for certain conditions and alert the user when one is found. The .NET Framework provides four MDAs that relate to CERs. These MDAs can help you discover when you've written CER-related code that won't behave in a way that you expect it to. Here's a quick guide to the new CER-related MDAs in the .NET Framework 2.0:

IllegalPrepareConstrainedRegions alerts you to a misplaced call to RuntimeHelpers.PrepareConstrainedRegions. At the MSIL level, a call to PrepareConstrainedRegions must be the instruction before the start of a try block. If a call to it is found at any other location, this MDA will fire.

InvalidCERCall fires when a CER call graph contains a call to a method with a weak reliability contract. This includes any method without an explicitly defined reliability contract or with a contract that specifies that application domain or process-wide state may be corrupted if an asynchronous exception occurs.

VirtualCERCall alerts you when a call graph within a CER is making use of a virtual target such as a call to a virtual method or through an interface. This can help remind you to verify that you've correctly made use of RuntimeHelpers.PrepareMethod in order to prepare any methods that might be the actual target for this invocation.

OpenGenericCERCall fires when a generic type is being used with at least one reference type parameter in a CER call graph. The native code generated by the JIT compiler for reference types is shared. However, methods with generic type variables lazily allocate resources during the method's first run (these resources are referred to as generic dictionary entries). PrepareMethod can be used as a workaround for this scenario, just as with virtual and interface methods.

MDAs are very useful for finding these types of problems early. Of course, this type of static analysis work could be done before run time, making liberal use of FxCop rules to analyze compiled code for these problems. Unfortunately, the Visual Studio® 2005 release of FxCop does not include any such rules (consider writing the FxCop rules a good homework exercise). For more information on MDAs, including how to enable them, see the .NET Framework documentation.

Conclusion

Writing reliable code in the face of everything that go wrong can be a daunting task. The good news is that unless you're writing a framework or a library for use in CLR hosts that require prolonged periods of up time, you probably won't need to think about this stuff too often. For those of you who do, the good news is that the .NET Framework 2.0 provides a useful toolset to make the job possible and easier. With an understanding of how these systems operate and are used, you have the ability to write managed code that can be as reliable as carefully written unmanaged code.

Stephen Toub is the Technical Editor for MSDN Magazine, for which he also writes the .NET Matters column.

Additional resources