August 2012

Volume 27 Number 08

CLR - .NET Development for ARM Processors

By Andrew Pardoe | August 2012

Consumers are a large driver of the technology market today. As evidenced by the trend known as “the consumerization of IT,” long battery life and always-connected and media-rich experiences are important to all technology customers. To enable the best experience for devices with long battery life, Microsoft is bringing the Windows 8 OS to systems built on the low-powered ARM processor, which powers most mobile devices today. In this article, I’ll discuss details about the Microsoft .NET Framework and the ARM processor, what you as a .NET developer should keep in mind and what we at Microsoft (where I’m a program manager on the CLR team) had to do to bring .NET over to ARM.

As a .NET developer, you can imagine that writing apps to run on a variety of different processors would pose a bit of a quandary. The ARM processor’s instruction set architecture (ISA) is incompatible with the x86 processor’s ISA. Apps built to run natively on x86 run well on x64 because the x64 processor’s ISA is a superset of the x86 ISA. But the same isn’t true of native x86 apps running on ARM—they need to be recompiled to execute on the incompatible architecture. Being able to choose from a range of different devices is great for consumers, but it brings some complexity to the developer story.

Writing your app in a .NET language not only allows you to reuse your existing skills and code, it also enables your app to run on all Windows 8 processors without recompilation. By porting the .NET Framework to ARM, we helped to abstract the unique characteristics of the architecture that are unfamiliar to most Windows developers. But there are still some things you might need to watch out for when writing code to run on ARM.

The Road to ARM: .NET Past and Present

The .NET Framework already runs on ARM processors, but it’s not the exact same .NET Framework as the version that runs on the desktop. Back when we started working on the first version of .NET, we realized that being able to write easily portable code across processors was key to our value proposition of increased developer productivity. The x86 processor dominated the desktop computing space, but a huge variety of processors existed in the embedded and mobile space. To enable developers to target those processors, we created a version of the .NET Framework called the .NET Compact Framework, which runs on machines that have memory and processor constraints.

The first devices that the .NET Compact Framework supported had as little as 4MB of RAM and a 33 MHz CPU. The design of the .NET Compact Framework emphasized both an efficient implementation (which allowed it to run on such constrained devices) and portability (so it could run across the vast range of processors that were common in the mobile and embedded spaces). But the most popular mobile devices—smartphones—now run on configurations that are comparable to the computers of 10 years ago. The desktop .NET Framework was designed to run on Windows XP machines with at least a 300 MHz processor and 128MB of RAM. Windows Phone devices today require at least 256MB of RAM and a modern ARM Cortex processor.

The .NET Compact Framework is still a big part of the Windows Embedded Compact developer story. Devices in embedded scenarios run in constrained configurations, often with as little as 32MB of RAM. We’ve also created a version of the .NET Framework called the .NET Micro Framework, which runs on processors that have as little as 64KB of RAM. So we actually have three versions of the .NET Framework, each of which runs on a different class of processor. But this is the first time that our flagship product, the desktop .NET Framework, has joined the Compact and Micro Frameworks in running on ARM processors.

Running on ARM

Although the .NET Framework was designed to be platform-neutral, it’s mostly been run on x86-based hardware throughout its existence. This means that a few x86-specific patterns have slipped into the collective minds of .NET programmers. You should be able to focus on writing great apps instead of writing for the processor architecture, but you should keep a few things in mind when writing .NET code to run on ARM. These include a weaker memory model and stricter data alignment requirements as well as some places where function parameters are treated differently. Finally, there are a few project-configuration steps in Visual Studio that differ when you target devices. I’ll discuss each of these.

A Weaker Memory Model A “memory model” refers to the visibility of changes made to global state in a multithreaded program. A program that shares data between two (or more) threads will normally take a lock on that shared data. Depending on the particular lock used, if one thread is accessing the data, other threads that attempt to access the data will block until the first thread is finished with the shared data. But locks aren’t necessary if you know that every thread accessing the shared data will do so without interfering with other threads’ view of that data. Programming in such a manner is called using a “lock-free” algorithm.

The trouble with lock-free algorithms comes when you don’t know the precise order in which your code will execute. Modern processors reorder instructions to ensure the processor can make progress on every clock cycle and combine writes to memory in order to decrease latency. Although almost every processor performs these optimizations, there’s a difference in how the ordering of reads and writes is presented to the program. x86-based processors guarantee that the processor will look like it’s executing most reads and writes in the same order that the program specifies them. This guarantee is called a strong memory model, or strong write ordering. ARM processors don’t make as many guarantees—they’re generally free to move operations around as long as doing so doesn’t change the way the code would run in a single-threaded program. The ARM processor does make some guarantees that allow carefully constructed lock-free code, but it has what’s called a weak memory model.

Interestingly, the .NET Framework CLR itself has a weak memory model. All the references to write ordering in the ECMA Common Language Infrastructure (CLI) specification (available as a PDF at bit.ly/1Hv1xw), the standard that the CLR is designed to meet, refer to volatile accesses. In C# this means accesses to variables marked with the volatile keyword (see section 12.6 of the CLI specification for reference). But in the last decade, most managed code has been run on x86 systems, and the CLR just-in-time (JIT) compiler hasn’t added much to the reorderings permitted by the hardware, so there were relatively few cases where the memory model would reveal latent concurrency bugs. This could present a problem if managed code written for and tested only on x86-based machines is expected to work the same way on ARM systems.

Most of the patterns that require additional caution with regard to reordering are rare in managed code. But some of these patterns that do exist are deceptively simple. Here’s an example of code that doesn’t look like it has a bug, but if these statics are changed on another thread, this code might break on a machine with a weak memory model:

static bool isInitialized = false;
static SomeValueType myValue;
if (!isInitialized)
{
  myValue = new SomeValueType();
  isInitialized = true;
}
myValue.DoSomething();

To make this code correct, simply indicate that the isInitialized flag is volatile:

static volatile bool isInitialized = false; // Properly marked as volatile

Execution of this code without reordering is shown in the left block in Figure 1. Thread 0 is the first to initialize SomeValueType on its local stack and copies the locally created SomeValueType to an AppDomain global location. Thread 1 determines by checking isInitialized that it also needs to create SomeValueType. But there’s no problem because the data is being written back to the same AppDomain global location. (Most often, as in this example, any mutations made by the DoSomething method are idempotent.) 

Write Reordering
Figure 1 Write Reordering

The right-side block shows execution of the same code with a system that supports write reordering (and a conveniently placed stall in execution). This code fails to execute properly because Thread 1 determines by reading isInitialized’s value that SomeValueType doesn’t need to be initialized. The call to DoSomething refers to memory that hasn’t been initialized. Any data read from Some-ValueType will have been set by the CLR to 0.

Your code won’t often execute incorrectly due to this kind of reordering, but it does happen. These reorderings are perfectly legal—the order of the writes doesn’t matter when executing on a single thread. It’s best when writing concurrent code to mark volatile variables correctly with the volatile keyword.

The CLR is allowed to expose a stronger memory model than the ECMA CLI specification requires. On x86, for example, the memory model of the CLR is strong because the processor’s memory model is strong. The .NET team could’ve made the memory model on ARM as strong as the model on x86, but ensuring the perfect ordering whenever possible can have a notable impact on code execution performance. We’ve done targeted work to strengthen the memory model on ARM—specifically, we’ve inserted memory barriers at key points when writing to the managed heap to guarantee type safety—but we’ve made sure to only do this with a minimal impact on performance. The team went through multiple design reviews with experts to make sure that the techniques applied in the ARM CLR were correct. Moreover, performance benchmarks show that .NET code execution performance scales the same as native C++ code when compared across x86, x64 and ARM.

If your code relies on lock-free algorithms that depend on the implementation of the x86 CLR (rather than the ECMA CLR specification), you’ll want to add the volatile keyword to relevant variables as appropriate. Once you’ve marked shared state as volatile, the CLR will take care of everything for you. If you’re like most developers, you’re ready to run on ARM because you’ve already used locks to protect your shared data, properly marked volatile variables and tested your app on ARM.

Data Alignment Requirements Another difference that might affect some programs is that ARM processors require some data to be aligned. The specific pattern in which alignment requirements apply is when you have a 64-bit value (that is, an int64, a uint64 or a double) that isn’t aligned on a 64-bit boundary. The CLR takes care of alignment for you, but there are two ways to force an unaligned data type. The first way is to explicitly specify the layout of a structure with the [ExplicitLayout] custom attribute. The second way is to incorrectly specify the layout of a structure passed between managed and native code.

If you notice a P/Invoke call come back with garbage, you might want to take a look at any structures being marshaled. As an example, we fixed a bug while porting some .NET libraries in which a COM interface passed a POINTL structure containing two 32-bit fields to a function in managed code that took a 64-bit double as a parameter. The function used bit operations to obtain the two 32-bit fields. Here’s a simplified version of the buggy function:

void CalledFromNative(int parameter, long point)
{
  // Unpack native POINTL from long point
  int x = (int)(point & 0xFFFFFFFF);
  int y = (int)((point >> 32) & 0xFFFFFFFF);
  ...  // Do something with POINTL here
}

The native code didn’t have to align the POINTL structure on a 64-bit boundary, because it contained two 32-bit fields. But ARM requires the 64-bit double to be aligned when it’s passed into the managed function. Making certain that the types are specified to be the same on both sides of the managed-native call is critical if your types require alignment.

Inside Visual Studio Most developers won’t ever notice the differences I’ve discussed because .NET code by design isn’t specific to any processor architecture. But there are some differences in Visual Studio when profiling or debugging Windows 8 apps on an ARM device, because Visual Studio doesn’t run on ARM devices.

You’re already familiar with the cross-platform development process if you write apps for Windows Phone. Visual Studio runs on your x86 dev box and you launch your app remotely on the device or an emulator. The app uses a proxy installed on your device to communicate with your development machine through an IP connection. Other than the initial setup steps, the debugging and profiling experiences behave the same on all processors.

One other point to be aware of is the Visual Studio project settings add ARM to x86 and x64 as a choice of target processor. You’ll normally choose to target AnyCPU when you write a .NET app for Windows on ARM, and your app will just run on all Windows 8 architectures.

Going Deep into Supporting ARM

Having made it this far into the article, you already know a lot about ARM. Now I’d like to share some of the interesting, deep technical details about the .NET team’s work to support ARM. You won’t have to do this kind of work in your .NET code—this is just a quick behind-the-scenes peek at the kind of work we did.

Most of the changes inside the CLR itself were straightforward because the CLR was designed to be portable across architectures. We did have to make a few changes to conform to the ARM Application Binary Interface (ABI). We also had to rewrite assembly code in the CLR to target ARM and change our JIT compiler to emit ARM Thumb 2 instructions.

The ABI specifies the how of a processor’s programmable interface. It’s similar to the API that specifies the what of an OS’s programmatically available functions. The three areas of the ABI that affected our work are the function calling convention, the register conventions and call stack unwind information. I’ll discuss each.

Function Calling Convention A calling convention is an agreement between code that calls a function and the function being called. The convention specifies how parameters and return values are laid out in memory, as well as what registers need to have their values preserved across the call. In order for function calls to work across boundaries (such as a call from a program into the OS), code generators need to generate function calls that match the convention that the processor defines, including alignment of 64-bit values.

ARM was the first 32-bit processor where the CLR had to handle aligning parameters and objects on the managed heap on a 64-bit boundary. The simplest solution would be to align all parameters, but the ABI requires that a code generator not leave bubbles in the stack when no alignment is required so there’s no performance degradation. Thus the simple operation of pushing a bunch of parameters on the stack becomes more delicate on the ARM processor. Because a user structure can contain an int64, the CLR’s solution was to use a bit on each type to indicate if it requires alignment. This gives the CLR enough information to ensure that function calls containing 64-bit values don’t accidentally corrupt the call stack.

Register Convention The data-alignment requirement carries over when structures are fully or partially registered on ARM. This means we had to modify the code inside the CLR that moves frequently used data from memory into registers to make sure the data is properly aligned in the registers. This work had to be done for two situations: first, making certain that 64-bit values start in even registers, and second, placing homogeneous floating-point aggregates (HFAs) in the proper registers.

If a code generator registers an int64 on ARM, it must be stored in an even-odd pair—that is, R0-R1 or R2-R3. The protocol for HFAs allows up to four double or single floating-point values in a homogeneous structure. If these are registered, they must be stored in either the S (single) or D (double) register sets but not in the general-purpose R registers.

Unwind Information Unwind information records the effects that a function call has on the stack and records where nonvolatile registers are saved over function calls. On x86, Windows looks at FS:0 to view a linked list of each function’s exception registration information in the event of an unhandled exception. 64-bit Windows introduced the concept of unwind information that allows Windows to crawl the stack in case of an unhandled exception. The ARM design extended this unwind information from 64-bit designs. The CLR code generators, in turn, had to change to accommodate the new design.

Assembly Code Even though most of the CLR runtime engine is written in C++, we have assembly code that must be ported to each new processor. Most of this assembly code is what we call “stub functions,” or “stubs.” Stubs serve as interface glue that enables us to tie together the C++ and JIT-compiled portions of the runtime. The remainder of the assembly code inside the CLR is written in assembly for performance. For example, the garbage collector write barrier needs to be extremely fast because it’s called frequently—any time an object reference is written to an object on the managed heap.

One example of a stub is what we call the “shuffle thunk.” We call it a shuffle thunk because it shuffles parameter values across registers. Sometimes the CLR has to change the placement of parameters in registers just before a function call is made. The CLR uses the shuffle thunk to do this when invoking delegates.

Conceptually, when you invoke a delegate, you just make a call to the Invoke method. In reality, the CLR makes an indirect call through a field of the delegate rather than making a named method call (except when you explicitly call Invoke through reflection). This method is far faster than a named method call because the runtime can simply swap the instance of the delegate (obtained from the target pointer) for the delegate in the function call. That is, for an instance foo of delegate d, the call to the d.Member method is mapped to the foo.Member method.

If you make a closed instance delegate call, the this pointer is stored inside the first register used for passing parameters, R0, and the first parameter is stored in the next register, R1. But this only works when you have a delegate bound to an instance method. What happens if you’re calling an open static delegate? In that case, you expect that the first parameter is stored in R0 (as there’s no this pointer.) The shuffle thunk moves the first parameter from R1 into R0, the second parameter into R0 and so on, as show in Figure 2. Because the purpose of this shuffle thunk is to move values from register to register, it needs to be rewritten specifically for each processor.

A “Shuffle Thunk” Shuffles Value Parameters Across Registers
Figure 2 A “Shuffle Thunk” Shuffles Value Parameters Across Registers

Just Focus on the Code

To review, porting the .NET Framework to ARM was an interesting project and a lot of fun for the .NET team. And writing .NET apps to run on top of the .NET Framework on ARM should be fun for you as a .NET developer. Your .NET code might execute differently on ARM than it does on x86-based processors in a few situations, but the .NET Framework virtual execution environment normally abstracts those differences for you. This means you don’t have to worry about what processor architecture your .NET app runs on. You can just focus on writing great code.

I believe having Windows 8 available on ARM will be great for both developers and end users. ARM processors are especially suited for long battery life, so they enable lightweight, portable, always-connected devices. The most significant issues you’ll see when porting your app to ARM are performance differences from desktop processors. But make sure to run your code on ARM before saying it actually works on ARM—don’t trust that developing on x86 is enough. For most developers, that’s all that’s needed. And if you do run into any issues, you can refer back to this article to get some insight into where to start investigating.


Andrew Pardoe is a program manager for the CLR team, helping to ship the Microsoft .NET Framework on all kinds of processors. His personal favorite remains the Itanium. He can be reached at Andrew.Pardoe@microsoft.com.

Thanks to the following technical experts for reviewing this article: Brandon Bray, Layla Driscoll, Eric Eilebrecht and Rudi Martin