CLR Inside Out

Writing Reliable .NET Code

Alessandro Catorcini and Brian Grunkemeyer and Brian Grunkemeyer

Code download available at:CLRInsideOut2007_12.exe(156 KB)

Contents

A Look at Runtime Failure
Dealing with Failure
Recycling the Host Process
Mirrored Processes
Recycling a Portion of the Process
What Is an AppDomain?
State Corruption
Detecting Shared State
Escalation Policy
Resiliency to Escalation
Writing Reliable Code
Choosing a Reliability Bar
Clean Up Code
Constrained Execution Regions
Reliability Contracts
Mitigating Failures the Hard Way
When to Use a Strong Reliability Contract
SafeHandle
Using Locks
Hard and Soft OOM Conditions

When we talk about something being reliable, we're referring to it being dependable and predictable. When it comes to software, however, there are other key attributes that must also be present for the code to be considered reliable.

Software needs to be resilient, meaning that it will continue to function in the face of internal and external disruptions. It must be recoverable, such that it knows how to restore itself to a previously known, consistent state. The software needs to be predictable, so it will provide timely and expected service. It must be undisruptable, meaning that changes and upgrades won't affect its service. And, finally, the software must be production-ready, meaning that it contains a minimal number of bugs and will require only a limited number of updates. When these criteria are met, then the software can be considered truly reliable.

These key attributes of reliable code depend on various factors—some depend on the overall architecture of the software, some depend upon the OS on which the software will run, and others depend on the tools used to develop the application and the framework on which it is built. Resilience is an attribute that relies on every layer, and an application will only be as resilient as its weakest link.

Now consider Microsoft® .NET Framework-based applications. These apps delegate to the runtime certain operations that in a native environment either did not exist (such as just-in-time compilation of IL code) or were under the direct control of the developer (such as memory management). In terms of reliability, the platform itself can introduce its own points of failure that impact the reliability of the applications that run on top of it. It's important to understand where these breakdowns can occur and what techniques you can use to create more reliable .NET-based apps.

A Look at Runtime Failure

There are certain exceptional events that can occur at any time and in any code section. These events, which we will call asynchronous exceptions, are resource exhaustions (out of memory and stack overflows), thread aborts, and access violations. (In the execution of managed code, access violations occur in the runtime.)

This last case is not very interesting—if this event does actually occur, it means that a serious bug in the implementation of the common language runtime (CLR) is being exposed and should be fixed. The first two cases, however, deserve deeper analysis.

In theory, we would imagine that resource exhaustions would be gracefully managed by the runtime and that they would never affect the ability of application code to continue running. That's just theory, though—reality is more complex.

To explain, we'll start by taking a look at how some popular server applications deal with out-of-memory (OOM) events. Server applications, such as ASP.NET and Exchange Server 2007, that require very high availability have achieved this through AppDomain and process recycling. The operating system provides a very powerful mechanism to clean up memory and most other resources used by a process—all this is done for you when the process terminates.

In a client scenario, when memory pressure gets to the point that even small allocations fail, the overall system reaches such a level of unresponsiveness due to extensive thrashing and paging that the user is much more likely to reach for the reset button or the task manager than to allow any recovery code to run. In a sense, the user's initial reaction is to perform the same action manually that ASP.NET or Exchange 2007 will do automatically.

Some OOMs may not even be caused by any particular issue with the running code. Another process running on the machine or another AppDomain running in the process may be hogging the available resource pool and causing allocations to fail. In this sense, you should consider resource exhaustions to be asynchronous in that they can occur at any time in the execution of code and they may depend on environmental factors external to and independent from the running code.

This problem is exacerbated by the fact that the runtime may allocate memory to perform operations related to its own workings. Here are a few examples of allocations that happen in the runtime that could fail in a constrained resource environment:

  • boxing and unboxing
  • delayed class loading until the first use of the class
  • remoted operations on MarshalByRef objects
  • certain operations on strings
  • security checks
  • JITing methods

This is just a partial list of the many internal operations in the runtime that may need to allocate resources, but it should give you an idea of why it is not really practical to predict and mitigate the consequences of any specific allocation failure.

In the family of the asynchronous exceptions, thread aborts have a special role. Thread aborts are not errors due to resource exhaustions (like OOMs and SOs), but they too can happen at any time. If you abort a thread from a different thread, the point on the aborted thread where the exception will be raised is completely random.

Stack overflows also have their own idiosyncrasies. Stack space is reserved per thread and committed eagerly, so it should always be possible to avoid any competition for the resource. But there are problems with this. It is a guessing game to predict how much stack is enough for every application. The OS limits the amount of stack space per thread. There are issues with presenting an exception via Structured Exception Handling (SEH) when the thread is low on stack space due to unwinding issues. And reentrancy and recursion rule out computing a finite upper bound on the stack space required by a method.

In a nutshell: it is practically impossible to forecast when OutOfMemoryException, StackOverflowException, and ThreadAbortException might occur. Thus, writing backout code to attempt to recover from asynchronous exceptions is not practical on a large scale.

You may be wondering whose responsibility it is to make the application reliable if the runtime cannot guarantee resilience to asynchronous exceptions. The ASP.NET example we just discussed hinted at the answer. While the application code is responsible for handling the common synchronous exceptions, it cannot handle asynchronous ones; this must be handled by the host process. In the case of ASP.NET, this is where the logic is contained that triggers process recycling when memory consumption goes beyond a known threshold.

In other more sophisticated cases, hosts like SQL ServerTM 2005 utilize the CLR's hosting APIs to decide whether to abort the managed thread running a transaction and roll it back, unload an AppDomain, or even suspend the execution of all managed code on the server. The default policy for an application is that the host process will be killed, but there are several techniques you can use to embrace and extend this approach or to override this behavior. For further reading on the CLR's hosting APIs, see the August 2006 CLR Inside Out column (msdn.microsoft.com/msdnmag/issues/06/08/CLRInsideOut).

Dealing with Failure

By now, you've probably identified the key to solving this problem. A resilient application is one that isolates the work in units that can independently fail without affecting the other units. So far, three models have been proven successful at creating resilient, managed applications.

The first model is to make the process itself the unit of failure and isolate the managed code execution in one or more worker processes that can die and spawn liberally.

The second is to keep two redundant processes working in parallel doing the processing, with one active and the other dormant. On failure, the dormant process takes over and spawns another dormant process to act as a backup in case of another failure.

The third model is to make the AppDomain the unit of failure and ensure that the process is never affected by any failure occurring in managed code or in the runtime.

We'll look at these three approaches in more depth, analyzing the cost of implementation and the different ways that there are to approach the design.

Recycling the Host Process

Say we accept the fact that resource exhaustion could tear down the process hosting the CLR. And say the system is built in such a way that the work is isolated in one or more child processes supervised by a master process whose job it is to manage the lifetime of the worker processes. Under these circumstances, we have a cheap and effective solution with which to provide a reliable system. The system will not be resilient, but it will be fully recoverable and predictable in its behavior.

This approach is ideal for whenever you are handling a large number of independent stateless requests, such as Web requests in ASP.NET, and you want to isolate the execution of their processing. If an asynchronous exception is raised, a worker process dies and the requests being worked on will simply not be serviced—they need to be resubmitted. This, however, makes the approach ill-suited for long and expensive operations in which the cost of resubmitting the job may be too high.

Still, this is the cheapest way to provide a reliable service running managed code. You are effectively making use of the runtime's default behavior.

Mirrored Processes

IT shops make extensive use of redundancy for everything from local disk drives to entire servers. If a disk or server fails, a second one that is in sync with the first quickly takes over.

A similar approach can be taken with a process. You can design software that runs two copies of a process on the same machine, each receiving the same input and producing the same output. Under circumstances where the main process fails due to a temporary fault in that specific process (as opposed to a reproducible bug that will affect all instances of the process), this model will provide some resiliency. You should also use a transacted store in this scenario to ensure that any failed requests can be safely rolled back.

The National Aeronautics and Space Administration (NASA) uses a model like this for the computers on the space shuttle. When life-and-death operations depend on a computer, some degree of redundancy is a must.

NASA actually ran into a problem when using just two computers in this scenario: a situation occurred where the two computers disagreed on the result of a computation. Which one was right? With only two computers, you can't know which is correct in the event of a single unit failure. So, NASA added a third machine, and if two computers agree but the third returns a different result, the third is considered to be broken.

This level of redundancy is nice—unless one of your three boxes fails, because then you're back in the scenario where only two are available and you don't know which is correct in the event of a second failure. So NASA added a fourth, then talked itself into adding a fifth. Clearly, resolving non-crashing failures in a pair of mirrored processes is not simple.

Another problem with this approach is that if you run multiple processes on the same machine and one of them exhausts a system-level resource, there's a reasonable chance that the other process will need the same resource simultaneously. And this may cause the reserve process to fail in the same manner as the first. In fact, they may even both be competing with one another for the same resource.

Recycling a Portion of the Process

For a high level of resiliency, tearing down a process and restarting it or failing over to another process just won't do. What you really need to do is find the part of the application that failed and recycle that part. This requires isolating various parts of your application's process into recyclable chunks. Operations must either be stateless or they must use a transactional system to ensure that no writes occur or that they get backed out. Additionally, all resource usage must be freed when you recycle a part of the process.

When thinking about long-living servers, think about state corruption and lack of consistency. Consistency can apply to several different layers. While a simple linked list might be consistent, a complex data structure will require additional invariants. If a consumer has an invariant that all elements in a linked list also must be stored as values in a hash table, then consistency in the linked list doesn't mean the application is consistent. For this reason, you must treat the slightest possibility of corruption that breaks invariants to be a problem. If an asynchronous exception occurs, how much of the state may potentially be corrupted and how can a server be resilient to this corruption?

To carve an application into recyclable chunks, you must isolate operations from one another. When an asynchronous exception occurs, it may have caused state corruption. To avoid recycling the entire process, you need to have encapsulated a smaller portion of your process, including the set of failed operations and all relevant state information.

What Is an AppDomain?

An application domain, or AppDomain for short, is a sub-process unit of isolation for managed code. Most assemblies can be loaded into an AppDomain. When the AppDomain is unloaded, the assemblies can often be unloaded as well.

Each AppDomain gets its own copy of static variables. While threads can cross AppDomain boundaries, it is (almost) impossible to initiate communication across an AppDomain boundary without using .NET remoting or a similar technique. AppDomains give you a relatively firm boundary to contain code.

If an asynchronous exception occurs on a thread, you must decide how much corruption might be present and how to be resilient to that specific level of corruption.

State Corruption

There are three buckets that state corruption may fall into. The first is local state, which includes local variables and heap objects that are only used by a particular thread. The second is shared state, which includes anything shared across threads within the AppDomain, such as objects stored in static variables. Caches often fall into this category. The third is process-wide, machine-wide, and cross-machine shared state—files, sockets, shared memory, and distributed lock managers fall into this camp.

The amount of state that can be corrupted by an async exception is the maximum amount of state a thread is currently modifying. If a thread allocates a few temporary objects and doesn't expose them to other threads, only those temporary objects can be corrupted. But if a thread is writing to shared state, that shared resource may be corrupted, and other threads may potentially encounter this corrupted state. You must not let that happen. In this case, you abort all the other threads in the AppDomain and then unload the AppDomain. In this way, an asynchronous exception escalates to an AppDomain, causing it to unload and ensuring that any potentially corrupted state is thrown away. Given a transacted store like a database, this AppDomain recycling provides resiliency to corruption of local and shared state.

Detecting Shared State

Figuring out whether a thread is modifying shared state is not the easiest problem. It's not obvious in most code whether the value of a local variable on the stack was initialized locally, whether it was passed in as a parameter to the method, or whether it simply refers to an object that is stored in a static variable (or is reachable from a static variable).

A subtle demand about modifying shared state comes from the concurrency space: whenever you edit a static variable, your code almost inevitably holds a lock. So keeping a lock count on each thread informs our escalation policy as to whether a thread may potentially be editing shared state. It's true that locks are often taken while reading from shared state and no state corruption occurs while reading from shared state (assuming there is no lazy initialization), but this pessimistic heuristic is the tightest we have found, without requiring an unacceptable additional level of discipline from users.

You can safely use interlocked operations, as they are only safe for editing one piece of shared state atomically. And knowing whether they succeed is easy—they either do or they do not.

Escalation Policy

The CLR's escalation policy goes a bit further. We'd like to attempt to give user code a chance to clean up when aborting threads. So the CLR attempts to run finally blocks and finalizers when aborting threads and before the AppDomain is unloaded. But there is tension caused by the user's desire to run arbitrary cleanup code and the host's availability needs. You can impose some timeout on running finally blocks and finalizers and simply abort them if they do not finish within a reasonable time.

A harder question, though, is how to be resilient if an asynchronous exception occurs while accessing a process-wide, machine-wide, or cross-machine piece of state. We'll discuss this in more detail later.

Resiliency to Escalation

Escalation policy also imposes limits on code, both when written by users and library authors. For SQL Server, the CLR allows stored procedures to be written in managed code, with some very high restrictions on what it can express. For scalability, reliability, and security concerns, user code in SQL Server should not launch or terminate threads, it should minimize or completely avoid shared state, and it should not be allowed to access certain types of OS resources. However, trusted libraries such as the .NET Framework must access these resources, often on behalf of this relatively untrusted user code.

The CLR provides code access security as a first line of defense to tweak the set of permissions given to user code. However, the CLR does not include permissions for all interesting resource types. For this purpose, the CLR defines the HostProtectionAttribute attribute, which can be used to mark methods that raise programming model concerns, such as by having the ability to kill threads.

These limitations on user code are actually a very good thing because by restricting user code from accessing OS resources directly (along with other limitations), user code is freed from the responsibility of tracking its use of these resources.

In the process recycling world, whenever a process is killed, the OS frees all of the machine-wide resources used by the process. For AppDomain recycling to truly provide resiliency, AppDomain unloading must provide the same level of guarantees. Since an AppDomain is just a unit within the process, managed libraries that provide access to resources must fill the gap between the operating system's ability to clean up on process exit and the demands of AppDomain unloading.

Writing Reliable Code

AppDomain unloading must be clean. This is the guiding principle for writing libraries resilient to async exceptions. For a transacted host like SQL Server that guarantees its own consistency, all other concerns about user and library code (like correctness, performance, and maintainability) are secondary to the host's ability to guarantee its availability. If user code has a bug that occasionally causes a crash, the server will live on as long as it can make forward progress and doesn't degrade over time.

Clean AppDomain unloading and careful resource management allow your code to fight against the CLR's escalation policy, enabling you to identify code that must have an opportunity to run to ensure consistency of a resource or to ensure that no resources are leaked. In most cases, an asynchronous exception will have already happened or will be unpreventable—these are the tools your code will need to become resilient to failure. The most important advice for library writers is to use SafeHandle, and it's critical that you understand several other available features so you can fully grasp why to use SafeHandle.

Choosing a Reliability Bar

Not all code is equally critical. You should think about what level of reliability is required from a certain block of code. The techniques we describe can increase development costs, so a good engineer should determine how much resiliency is necessary for a block of code.

First, ask yourself what your code should do when a power failure occurs. As a starting point, all code obviously must be able to restart and work correctly after a power failure. Even if client applications will lose work and encounter data corruption, they still must, at a bare minimum, be able to start back up.

Designing mail servers to ensure they don't lose e-mail when a power failure occurs is a significantly harder problem. Similarly, software controlling a nuclear power plant must be able to tolerate such failures with greater resilience than, say, a basic productivity app. Figuring out where your application falls on this continuum should be trivial, and it will help inform your decisions on how much to invest in resiliency.

For most client applications, the surprising answer is to do very little. Surviving an async exception is overkill for most client apps. Killing the process and restarting is often sufficient. And when using the Windows Vista® Restart Manager APIs, this approach can help limit how much state is lost when the client app crashes. Outlook® offers a great example. If Outlook 2007 crashes on Windows Vista, it can recover and reopen all windows to the right spot. If you were composing a message when Outlook crashed, you might lose only the last minute or two of typing, rather than the entire message.

For libraries, the reliability bar is determined by the most aggressive host in which your code will run. If your library is used by hosts that recycle processes, your reliability needs are less than for a host that recycles AppDomains. However, if your library allows accesses to resources without using some of the techniques we describe, and your code is used in a host that recycles AppDomains, your library can eventually cause that host to fail.

Clean Up Code

You should use try/finally blocks and finalizers to clean up resources. As a first approximation, these tools give developers a simple way to restore state to a reasonable level of consistency. Try/finally blocks (and language-specific keywords like the C# "using" statement) will improve correctness and performance by ensuring that resources are cleaned up or disposed of at deterministic points in your code.

The problem with this approach, though, is that the CLR cannot easily guarantee that the code in your finally block will complete. While the CLR attempts to avoid aborting threads while running in finally blocks, resource exhaustion and thus async exceptions can occur at any time.

Likewise, we have no way of ensuring that any given finalizer will complete. Additionally, the CLR's escalation policy allows a host to abort threads in finally blocks and finalizers in case a finally block goes into an infinite loop or blocks indefinitely.

Constrained Execution Regions

A constrained execution region, or CER, is a reliability primitive that helps enable code to maintain consistency. This is best for when you are willing to work extremely hard to guarantee some limited amount of forward progress or ability to undo changes to an object. Constrained execution regions are a best-effort attempt from the runtime to run your code. With CERs, the runtime will hoist any CLR-induced failures to predictable locations. This doesn't guarantee that your code will run—you can't sprinkle CERs like fairy dust and expect your code to magically work—but if you're writing code with some rigid constraints, there is a good chance your code will run to completion.

CERs are exposed in three forms:

  • With ExecuteCodeWithGuaranteedCleanup, which is a stack-overflow safe form of a try/finally.
  • As a try/finally block preceded immediately by a call to RuntimeHelpers.PrepareConstrainedRegions. In this case, the try block isn't constrained, but all catch, finally, and fault blocks for that try are constrained.
  • As a critical finalizer. Here, any subclass of CriticalFinalizerObject has a finalizer that is eagerly prepared before an instance of the object is allocated.

(Note that there is a special case: SafeHandle's ReleaseHandle method, a virtual method that is eagerly prepared before the subclass is allocated and called from SafeHandle's critical finalizer.)

To give your code an opportunity to execute, the CLR will eagerly prepare your code, meaning it will JIT the statically discoverable call graph for your method. If you use delegates within a CER, the target method of the delegate must be eagerly prepared, preferably by calling RuntimeHelpers.PrepareMethod() on the method. Additionally, if you use ngen, you can mark the method with the PrePrepareMethodAttribute to eliminate some need to pull in the JIT. This eager preparation will eliminate CLR-induced out-of-memory exceptions.

With regard to thread aborts, the CLR will disable these over CERs. The CLR will also probe for some stack, if and only if the CLR's host (for example, any native app that uses the CLR's COM hosting interfaces to launch the CLR) wants to survive stack overflow. Barring bugs in the runtime itself, heap corruption, and hardware problems, this should eliminate all CLR-induced asynchronous exceptions caused by your code.

Reliability Contracts

You may have noticed that I keep drawing a line between CLR-induced failures and all other failures. Failures can occur at all levels of the system since each layer of the system has a different view of consistency. Consider an application that wants to maintain a list of items in sorted order. If you use an unsorted collection like List<T>, your code may look like this:

public static void InsertItem<T>(List<T> list, T item) { // List<T> isn't sorted, but the app relies on it being sorted. list.Add(item); list.Sort(); }

Now imagine that all the methods on List<T> magically succeeded all the time. We'll use the hypothetical ReliableList<T> to imply this behavior. Imagine that Add never needs to grow the size of the list and the Sort routine always works (even with its indirect calls into comparers and CompareTo methods on the generic type T). If a ThreadAbortException occurs in InsertItem after the call to Add completed but before the call to Sort, we still run into a consistency problem.

From the perspective of ReliableList<T>, the list is fully consistent. The internal data structures in the list are in a perfectly fine state in that there is no extra item half-added to the collection. And we can continue to use the list without any problems. But, from the perspective of the application (remember that it required the list to be sorted), this invariant has been broken and the method hasn't been written to recover from this problem.

This scenario illustrates that consistency exists at various levels in the system. This is indicated to users via reliability contracts—a very coarse-grained mechanism for stating the extent of corruption if an asynchronous exception occurs. In this case, the method corrupted the list, and if we admit the possibility of an OutOfMemoryException during the Add method call, then we have to mark this method as potentially failing, like this:

[ReliabilityContract(Consistency.MayCorruptInstance, Cer.MayFail)] public static void InsertItem<T>(ReliableList<T> list, T item)

Reliability contracts, which are required on all code called within CERs, serve primarily as documentation to the developer about whether to expect the possibility of a failure. Here, reliability contracts could do a better job of indicating corruption of parameters versus a "this" pointer, but these are really meant to start a reliability discussion between the author of the InsertItem method and the person calling it. In this case, someone calling InsertItem<T> will notice that it may cause corruption of an instance if it fails, so they will know that they need a mitigation strategy.

Mitigating Failures the Hard Way

Constrained execution regions offer some ways to mitigate failures in code like InsertItem. The options vary in complexity and some techniques may not work well if the code changes significantly from one version to another.

The most obvious technique is to try to understand the failure and avoid it. In this sample, it may be possible to hoist allocations to a place where you can recover from the failure, such as allocating the list with sufficient capacity to begin with. This would hoist any allocation failures. However, this requires that you understand InsertItem<T>, List<T>, and how they both fail.

Additionally, this technique doesn't solve the thread abort problem. The list may still not be sorted. Here, a CER could be used to ensure that the finally block is sorted all the time. Regardless of whether Add fails or not, we ensure consistency of the list by guaranteeing that it is sorted. This requires a strong reliability guarantee on our fictitious ReliableList<T>'s Sort method. But with that in place, we could write our code as shown in Figure 1, offering a stronger consistency guarantee in InsertItem's reliability contract.

Figure 1 A Stronger Consistency Guarantee

[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)] public static void InsertItem<T>(ReliableList<T> list, T item) { RuntimeHelpers.PrepareConstrainedRegions(); // CER marker try { // Add can fail with an OutOfMemoryException. // That's fine, since the list didn't change. list.Add(item); // Consider a failure here, perhaps from calling a second method. } finally { // Restore consistency by guaranteeing the list is sorted. list.Sort(); } }

Keep in mind that these techniques don't work in a real-world implementation of this example since ReliableList<T> doesn't really exist. It would be particularly difficult to write due to the nature of sort algorithms—you need any comparer used by Sort and most likely T's CompareTo method to also be reliable.

But there's another technique similar to the recycling model. If you can live with the performance hit, you can copy the data, make your changes on your copy, and then expose your consistent data structure to the world. Here's one way to do this, complete with compromises to the method signature:

[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)] public static void InsertItem<T>(ref List<T> list, T item) { // Forward progress List<T> copy = new List<T>(list); // expensive copy operation copy.Add(); copy.Sort(); // Expose the results list = copy; }

Here, we can use the real List<T>, but be sure to note that we take a significant performance hit by copying the data. Additionally, using a ref parameter here isn't really a preferable solution. However, a strong reliability contract saying "will not corrupt state" is often highly desirable.

Another technique is to try to make forward progress and provide backout code in case of a failure. This is similar to constructing a transaction around your changes and providing code to roll back the changes made in the transaction.

When to Use a Strong Reliability Contract

The previous examples show that a strong reliability contract is difficult to implement efficiently. Using CERs is like rocket science, since you need to be prepared for failures that can occur anywhere. Note that for CERs exposed as an annotated try/finally block, the try block is not a CER, so even multiple assignment operations may be interrupted by an async exception. For this reason, CERs are not for most people—better choices are to rely on a host's escalation policy, or for client applications, recycling the entire application.

That said, CERs are truly required in places that manipulate machine or process-wide state, such as a shared memory segment. And in these cases, business needs often require that these applications are written to recover from power failures anywhere in the code, so you not only have to write CERs, but you also have to be very careful about consistency throughout significant parts of the application. This complexity is why you don't want your summer intern writing control software for a nuclear power plant.

SafeHandle

You might be wondering whether you can safely return values from a method in the presence of asynchronous exceptions. You cannot safely do this! When you call an unhardened P/Invoke method like CreateFile that returns an IntPtr and then assign the return value to a local variable, two distinct machine instructions are generated, and threads can be aborted between any two machine instructions. In situations like this, using IntPtr to represent OS handles is fundamentally unreliable. But there is a solution.

When you access an OS resource, such as a file, socket, or event, you get a handle for that resource, and that resource must eventually be freed. SafeHandle ensures that if you can allocate a resource, you can free that resource. To provide this guarantee, SafeHandle aggregates multiple CLR features.

You define a subclass of SafeHandle and provide an implementation of the ReleaseHandle method. This method is a CER, called from SafeHandle's critical finalizer. Assuming that your ReleaseHandle method obeys all the CER rules, you are guaranteed that if you can successfully allocate an instance of a native resource, your ReleaseHandle method will get a chance to run.

SafeHandle also offers some key security features. It prevents handle recycling attacks by ensuring that no thread can free a handle while another thread is actively using it. It also prevents a subtle race condition between a class using a handle and its own finalizer. SafeHandle is also integrated with the P/Invoke marshaling layer. For any method that returns or consumes a handle, you can replace that handle with a subclass of SafeHandle. CreateFile for example, would return a SafeFileHandle, and WriteFile would take a SafeFileHandle as its first parameter.

Using Locks

Locks must be taken on the right types of objects. Consider the C# lock keyword, which expands into calling Monitor.Enter(Object), followed in a finally block by Monitor.Exit(Object). Locks should have a strong sense of identity—unboxed value types do not fit the bill, as they would get boxed each time you pass them as an Object. But the CLR also shares certain types across AppDomain boundaries in certain cases. Taking locks on Strings, Type objects, CultureInfo instances, and byte[]'s may end up taking a lock across AppDomain boundaries. Similarly, taking locks on any subclass of a MarshalByRefObject from outside of that class's implementation may lock on a transparent proxy instead of the real object in the correct appdomain, meaning you might not take the right lock! In naïvely written code, if an async exception occurs, the finally block to release the lock will not run and you may cause other threads to block indefinitely.

Do not define your own locks unless you are a true expert when it comes to memory models and concurrency and you have demonstrated a need for a better lock. Beyond the obvious pitfalls like alertable waits and how to spin on hyper-threaded CPUs, your lock must also cooperate with the CLR's escalation policy, which needs to keep track of whether each thread is holding a lock. Thread.BeginCriticalRegion and EndCriticalRegion can help the CLR identify when third-party locks are acquired and released.

Hard and Soft OOM Conditions

There is another mitigation technique besides AppDomain unloading for out-of-memory errors. The MemoryFailPoint class will attempt to predict whether a memory allocation will fail. You allocate a MemoryFailPoint for X megabytes of memory, where X is an upper bound on the expected additional working set for processing one request. You then process the request and call Dispose on the MemoryFailPoint.

If there is not enough memory available, the constructor throws an InsufficientMemoryException, which is a different exception type used to express the concept of a soft OOM. Apps can use this to throttle their own performance based on available memory. The exception is thrown before memory is actually allocated, when no corruption has occurred yet. Thus, this implies that there is no corruption of shared state and therefore there is no need for an escalation policy to kick in.

MemoryFailPoint does not reserve memory in the sense of reserving or committing physical pages of memory, so this technique is not iron-clad—you can get into races with other heap allocations in the process. But this technique does maintain an internal process-wide reservation count to keep track of all threads using a MemoryFailPoint in the process. And we believe this can reduce the frequency of hard OOMs requiring invocation of the framework's escalation policy.

Send your questions and comments to clrinout@microsoft.com.

Alessandro Catorcini and Brian Grunkemeyer is a Lead Program Manager in the CLR team. In Visual Studio 2005 he was responsible for the hosting API layer and CLR integration into SQL Server 2005.

Brian Grunkemeyer has been a Software Design Engineer on the .NET Framework since 1998. He has implemented a large portion of the Framework Class Libraries, as well as cross-cutting features, such as generics, managed code reliability, and versioning considerations.