Export (0) Print
Expand All

Examining Execution Speed of JITted Code with CF 2.0

.NET Compact Framework 1.0
 

Chris Tacke, eMVP
Principal Partner
OpenNETCF Consulting, LLC

March 2006

Contents

Introduction
The Baseline: Toggling a GPIO with C
Using C#
Conclusion

Introduction

I like to think of myself as a "systems" guy, not an application developer, however I love the benefits that both managed code and Visual Studio 2005 provide as a development platform. In the early days of the Compact Framework, a deep understanding of how Platform Invoke (P/Invoke) and marshaling data across the managed/unmanaged boundary was almost always required for any robust solution. I tended to split my development time between writing driver and kernel code and writing managed code to access those native resources. As a natural product of doing that kind of work, I've always been intrigued by the idea of writing a driver in managed code, and so I'm always looking for ways that it might be realistic or achievable.

The reality is that it's not possible to write a true driver with the Compact Framework as it is today. The Execution Engine (EE) cannot be hosted by a native process, which makes it impossible to expose the entry points required for device.exe to load an assembly. However, if we relax the definition of a "driver" a bit and consider a driver to simply be a piece of software dedicated to managing specific hardware resources, then the story changes a bit. Of course there are still a couple of large hurdles that still have to be overcome to achieve this definition of a driver. First is the lack of deterministic behavior in a managed code environment and second is the purported lack of performance of managed code.

The first hurdle — the non deterministic nature of a garbage collected environment — has been discussed and demonstrated fairly well, and while it's something I'm still trying to work around, it is something that I feel doesn't need much discussion.

The second hurdle — performance — is something different. While I've seen it argued on several occasions that since managed code is not truly compiled code and that it runs against the .NET Common Language Runtime, it inherently must perform worse than native code. Surprisingly, I've never actually seen anything that specifically set out to quantify the difference, so I decided that before I just accept "common knowledge" maybe a little testing was in order.

Lastly, I want to make it clear that I don't recommend writing a driver in managed code for a production environment. To make it work, a lot of care must be taken, right down to each line of code, and the implications of every call have to be heavily modeled. With today's Compact Framework tools, it's just too fragile a system to trust on a large scale. However, as you'll see in this article, there are some extremely promising behaviors that, with just a few additions from the Compact Framework team, could indeed make managed drivers a distinct reality in the next version.

The Baseline: Toggling a GPIO with C

Before I could test managed code and have meaningful results, I needed to get a set of control data. How fast can some meaningful action occur with typical unmanaged code? I decided that a reasonable test would be to toggle a processor general-purpose input/output (GPIO) line as fast as possible, since it's a common action in a driver, and using an oscilloscope it would be very easy to get a quantifiable measurement of speed.

So I put together the following piece of code to toggle a GPIO as fast as possible:

#define GPIO3 (1 << 3)
...
DWORD *p = (DWORD*)MapAddress(0x40E00000);

DWORD *gpdr = p + (0x10 / sizeof(DWORD));
DWORD *gpsr = p + (0x18 / sizeof(DWORD));
DWORD *gpcr = p + (0x24 / sizeof(DWORD));

*gpdr |= GPIO3;
while(true)
{
        *gpsr = GPIO3;
        *gpcr = GPIO3;
}

For the curious, the call to MapAddress is a function that wraps VirtualAlloc and VirtualCopy in the same way MmMapIoSpace does to get a mapped virtual address for a specified physical address. Here I passed in the base physical address of the PXA255's GPIO registers, then I allocated pointers to the direction (GPDR), set (GPSR), and clear (GPCR) registers. Basically, how these work is you set the state of a bit in GPDR to determine whether it's an input or an output, then you set the same bit in GPSR to turn it on or set the bit in GPCR to turn it off.

The measurement would be nothing more than the two calls in the while loop to set and clear GPIO3 as fast as possible. Below is the compiler output for the previous code.

; 49   : while(true)
; 50   : {
; 51   :        *gpsr = GPIO3;
  0004c e5930000         ldr       r0, [r3]
  00050 e5804000         str       r4, [r0]
; 52   :        *gpcr = GPIO3;
  00054 e5921000         ldr       r1, [r2]
  00058 e5814000         str       r4, [r1]
; 53   : }

You can see that the compiler has turned this into two pair of load and store operations. While this could have been made faster by writing it in assembly, the purpose of this test wasn't to get the best possible time from unmanaged code, but to compare managed code with typical unmanaged code.

Figure 1 is a captured scope trace of the output produced by the unmanaged code, measured right on the pin of the processor. The important piece of information to see is that a state change (high-to-low or low-to-high) is pretty consistent and is about 110ns.

Figure 1. Oscilloscope traces for unmanaged code

Using C#

Now that we've got the control measured, let's take a look at how we can implement the same feature (toggling GPIO3) in managed code and the speed we see from it. For my testing I chose to use C# instead of VB.NET. Initially this choice was simply a matter of personal preference, but as we'll see shortly, some features available in C# but not VB.NET gave faster results.

The base of my first tests were done with a simple class that P/Invokes the VirtualAlloc and VirtualCopy APIs to map a physical address to a virtual address just like the C code used earlier. The full class code is available in the downloadable sample, but the interesting part, the calling code, can be seen here:

int gpio3 = (1 << 3);

// map all of GPIO space
PhysicalAddressPointer pap;
pap = new PhysicalAddressPointer(0x40E00000, 0x6B);

// make an GPIO output
int gpdr = pap.ReadInt32(0x10);
pap.WriteInt32(gpdr | gpio3);

while(true)
{
   // turn it off
   pap.WriteInt32(gpio3, 0x24);

   // turn it on
   pap.WriteInt32(gpio3, 0x18);
}

An important difference between the hardware access in this code, versus what was done with the unmanaged code, is that the virtual address is stored as an IntPtr in managed code. This means that any reads from or writes to the address are done through a call to Marshal.Copy instead of directly to the pointer address like we were able to do in C. Intuitively I felt that this was going to add some overhead, and the resulting scope trace, seen in Figure 2, shows that it is indeed slower — almost seven times slower.

Figure 2. Oscilloscope traces for managed code

Even though the managed code was significantly slower it was very consistent, and was still faster than I had expected, considering managed code had to make a function call to the Marshal class, which then had to marshal the data to the IntPtr address location. The question remained: "How much of the difference is the overhead of the extra calls, and how much can be attributed to the Common Language Runtime (CLR) itself that the code runs in?" To determine that, I needed a better way to get at the hardware address, something that takes the IntPtr and Marshal calls out of the picture. This is where I had to turn to a C# code feature, unsafe code. Unsafe code simply means that if I set a specific compiler option, I'm allowed to allocate and use pointers in my managed code.

To use a pointer, I had to make a slight modification to the previously used PhysicalAddressPointer class implementation to give external classes access to the internal virtual address as a uint* using the IntPtr.ToPointer method. Using the newly exposed function I modified my test code to look like this:

// toggle GPIO 3
int gpio3 = (1 << 3);

// map all of GPIO space
PhysicalAddressPointer pap;
pap = new PhysicalAddressPointer(0x40E00000, 0x6B);

unsafe
{
   int *p = (int*)pap.GetUnsafePointer();
   int *gpsr = p + (0x18 / 4);
   int *gpcr = p + (0x24 / 4);
   int *gpdr = p + (0x10 / 4);
}

When I measured the state changes this time I was pleasantly surprised. The traces, as seen in Figure 3, were identical to the unmanaged traces, meaning that the CLR was adding zero measurable overhead to the hardware access. Yes, you read that right — the managed code implementation was just as fast as the native implementation. All of the latency measured in the first managed code test lies in the overhead of the call to the Marshal class.

Figure 3. Oscilloscope traces with unsafe code

My longer term goal was to make access to the hardware a little more user friendly by providing a wrapper class for the entire PXA255 processor, but I also wanted maximum performance to remain a goal. I wanted VB.NET developers to have the same advantages that C# developers would get, so I did some rethinking on how to get at the virtual address without going through the Marshal class.

The first thought was to try to get a struct that would map its members directly to the registers in the processor, and then pin those into memory. Unfortunately, even with a pinned struct, you're still relegated to using the Marshal class for passing data to the mapped target address.

I then decided that if the PXA255 wrapper used unsafe pointers internally that were wrapped by CLS compliant properties, VB developers would be able to directly access hardware as well as benefit from the speed of unsafe code. I then put together a comparable test using the PXA255 class and checked the performance with the scope.

PXA25x pxa = new PXA25x();

// set gpio3 as an output
pxa.GPIO.GPDR0 |= PXA25x.GPIO3;

while(true)
{
   // set the pin
   pxa.GPIO.GPSR0 = PXA25x.GPIO3;
   // clear the pin
   pxa.GPIO.GPCR0 = PXA25x.GPIO3;
}

Once again I was surprised by the result, but this time the surprise wasn't pleasant. It turned out that even though the class was using unsafe pointers, the results showed a similarly large latency (see Figure 4). It appeared that it wasn't the internals of the Marshal class that were the performance hit after all, it was simply the fact that a method call was being made.

Figure 4. Oscilloscope traces using the PXA255 class

The last step was to physically verify that hunch, so I wrote a last bit of test code using the PXA255 class, but retrieving its internal pointer and then using the pointer locally in the test. Of course this isn't VB-accessible, but it would prove the theory about the location of the performance bottleneck.

PXA25x pxa = new PXA25x();

unsafe
{
   uint *gpio = pxa.GetGPIORegistersUnsafePointer();
   uint *gpsr = gpio + (0x18 / 4);
   uint *gpcr = gpio + (0x24 / 4);
   uint *gpdr = gpio + (0x10 / 4);

   // make GPIO an output
   *gpdr |= PXA25x.GPIO3;

   while(true)
   {
      // set the pin
      *gpsr = PXA25x.GPIO3;
      *gpcr = PXA25x.GPIO3;
   }
}

In Figure 5 you can see that using the pointer again provided the same level of performance that the unmanaged code did, proving that the expense is due to the fact that a method call had been made, not that the Marshal class has any inherent bottleneck. In fact, this shows that the Marshal class internally is actually quite performant, adding very little overhead beyond the call into it.

Figure 5. Oscilloscope traces using the PXA255 class and pointer

Conclusion

So you see, the performance differences between managed code and unmanaged code can be negligible if you are cognizant of the behavior characteristics of the managed environment when you are writing the code. What does that buy us as a community of developers? Potentially, the implications are immense.

I know that I said in the introduction that I didn't recommend writing drivers in managed code, but as it stands right now we easily have the required performance to write device drivers for many items that are tolerant of large potential, but typically rare, latencies. Things like I²C or SPI synchronous serial busses or other GPIO devices are well within the realm of possibility. We've also seen that managed code can perform equally to unmanaged code, so if we can find a way to eke out deterministic behavior from the CLR, then we easily have the performance required for a whole host of devices.

With a little ingenuity on our part and a little cooperation from those developing future versions of managed compilers and specifications, writing device drivers in managed code could become a commonplace task. I'm not advocating that we do away with unmanaged code, it certainly has it place today and will for the foreseeable future just as assembly still does, but we don't need to fear managed developers playing with hardware any more than the assembly developers of yesteryear needed to worry about C and C++ developers. Change is what has always driven the industry and what I do advocate is embracing that change because it looks like it's going to be fun.

Show:
© 2014 Microsoft