Testing Real-Time Systems in Microsoft Windows CE .NET

Article
08/28/2006

Mike Hall
Microsoft Corporation

Steve Maillet
Entelechy Consulting

June 28, 2002

In last month's Get Embedded article, we took a look at how to build a Microsoft® Windows® CE .NET x86-based reference platform using a PC. (This reference platform is known as a CEPC, simply because we run the CE operating system on a PC). During the coming months, we will take a look at how to build interesting platforms using the Windows CE .NET emulator and CEPC reference platforms (the projects could easily be used on other reference platforms). This month's article concentrates on Windows CE .NET real-time support (is there such a thing as fake time?) and the tools used to measure and test the real-time performance of a device.

It is understood that real-time systems are not tested with a single analysis that pronounces them correct. Testing of real-time systems is a proof by exhaustion. You work to develop confidence in the solution. The tools described below work together to further your understanding of your real-time system by providing complete timing explanations of all your application and operating system interactions.

There is much discussion about what is real-time in the newsgroups. Before we take a look at the tools that ship with Windows CE .NET that can be used to determine the real-time response on your device, it's probably good to take a look at the definition of real time. In this case I've taken a quote from comp.realtime FAQ. The canonical definition of a real-time system (from Donald Gillies), is as follows:

"A real-time system is one in which the correctness of the computations not only depends upon the logical correctness of the computation but also upon the time at which the result is produced. If the timing constraints of the system are not met, system failure is said to have occurred."

Since the industrial automation applications of high speed I/O, robotics and machinery controls create the most demanding timing constraints. Microsoft chose to ask that community for their specific requirements. General Motors Powertrain Group (GMPTG) has been the leader in implementing OMAC technologies in its manufacturing applications since 1986, and GMPTG later drove the formation of an OMAC User Group. Together, they reviewed hundreds of applications and found that most systems (95 percent) require cycle times of one millisecond and greater. A tolerable variation on this one millisecond cycle time would be 10 percent, or 100 microseconds (µs). This is the design goal for Windows CE .NET on a 200-megahertz (MHz) X86 system, and response average of 50 µs on this platform. Windows CE .NET meets or exceeds the requirements of 95 percent of the hard real-time applications OMAC reviewed.

Large portions of industrial automation applications reviewed were driven by an external signal from a piece of machinery. This signal is presented to hard real-time applications in the form of an interrupt. Microsoft has been encouraging Windows CE developers to put as much application code into the interrupt service threads (ISTs) as is possible. This leads the OMAC jitter definition to become a timing constraint on IST latencies of no more than 100 µs. The remainder of the applications reviewed use timers to create their cycle times. This led to the requirement of 1 millisecond timers with no more than 100 µs latency or jitter. In summary, the OMAC definition provided the following design and test requirements:

Interrupt Service Thread (IST) latencies of no more than 100 µs latency.
1 millisecond timers with maximum of 100 µs latency.

So real-time systems are in some ways similar to great comedy, it's all about timing. That being the case, we should take a look at the tools that ship with Windows CE .NET that can be used to determine the interrupt timings, application execution behavior, operating system feature timings, and scheduling timing.

It's also important to distinguish between a real-time system and a real-time operating system. A real-time system contains all elements, including hardware, operating system, and applications. A real-time operating system is just one element of the complete real-time system. For more information, take a look at Designing and Optimizing Microsoft Windows CE for Real-Time Performance.

We will be looking at a number of tools and the results from each, these include:

ILTiming. This tool determines the ISR and IST latencies for your platform. ISR latency is the time from a hardware interrupt to the time of the first Interrupt Service Routine instruction. IST latency is the time from the ISR exiting and the Interrupt Service Thread starting.
OSBench. This tool enables you to collect timing samples to measure the performance of the kernel by conducting scheduler performance-timing tests.
Kernel Tracker. This tool provides a visual representation of the execution of the Windows CE .NET operating system on a target device. This tool can be used to view thread interactions, internal dependencies, and system state information in a real-time environment. For the purposes of this paper, we will examine the interaction between threads and processes.
Call Profiler. This tool can be used to identify algorithmic bottlenecks in your code.

There are a number of places within a device that can affect real-time performance, including hardware, drivers, and applications. In this case we will start at the application level. An application running in a real-time environment should allocate all resources at startup. All kernel objects (processes, threads, mutexes, critical sections, semaphores, and events) are allocated in virtual memory, which is allocated on demand. Allocating memory on demand is nondeterministic—the operating system's time to complete the operation is possibly unbounded, and therefore cannot be used during the real-time execution of an application.

Remote Call Profiler

We all know that you write great code, thoroughly tested, and totally free of bugs or bottlenecks. Still, let's ask a hypothetical question: What if there are bottlenecks in your application code? How do we test for this situation? This could, of course, affect the overall performance of your device. The answer: This is the domain of the Remote Call Profiler, a tool that ships with Windows CE .NET. This tool provides answers to questions like: What code is executing and when? What are the interactions of the various software components? Where is the CPU spending its time getting through your application code?

To prove a point, I'm using the 'dining philosophers' application built and running on Windows CE. So here's the deal: Five philosophers (threads) are sitting at a round table. In front of each of them is a bowl of food. Each philosopher starts with one chopstick. Before a philosopher can eat, however, he must have two chopsticks (therefore, one of the other philosophers must give up one of their chopsticks). The philosophers must find some way to share chopsticks so that they all get to eat.

Likewise, when multithreaded programs have more than one thread (philosopher) competing for resources (food), there is the possibility of a deadlock—or a fight depending on how hungry the philosophers are! Having multiple threads all waiting on scarce resources can cause threads to wait indefinitely, freezing the application. This is not a great scenario for a real-time application, hence the option of running the application with the Remote Call Profiler.

The Remote Call Profiler can display profiling information in a number of different views, including a graphical call graph. This shows the amount of time spent on a per function basis during the run of the application. You can imagine that this is a great tool to run against applications that deal with real-time compression/decompression of video/audio streams! The following table shows the available views within the Remote Call Profiler application.

Table 1. Veiws available within the Remote Call Profiler

View	Description
Top X View	Displays the functions that consume the greatest time or counter value based on timing criteria that you specify.
Module Summary View	Lists statistics for loaded modules and lists instances of entry into the functions in each module.
Call Tree View	Displays a hierarchical depiction of the call paths of functions.
Butterfly View	Displays data in tabular format that shows the relationship between calling and called functions.
Call Graph View	Displays a summary of the parent-child relationships between functions.
Call Trace and Function Trace Views	Displays a data grid that shows the actual and calculated timing values for function entry and exit events.

The following figure shows the Call Graph view for the Philosophers application. This shows that 87% of the application's time is being spent in a function AckAc2( ). Perhaps it's time to review the content of that function!

Figure 1. Remote Call Profiler (Click thumbnail to view larger image.)

You're probably wondering what needs to be added to your application source code in order to run with the Remote Call Profiler. Actually, you don't need to change your code at all; you simply need to compile with an additional flag (/Gh for x86, and /callcap for other CPU's), and link against CeCap.lib. CeCap.lib can be found in the C:\Program Files\Common Files\Microsoft Shared\Windows CE Tools\Platman\target\<processor type> folder. Then you're done. We can get a ton of useful information by simply changing the compile and link steps of the application.

The addition of the CeCap.lib will of course affect application performance; an application may take between 2 and 6 times as long to run. The CeCap library provides application developers with a uniquely detailed view of your application logic during execution. This tool can be used during lower speed testing to develop confidence in your application code.

Kernel Tracker

The Remote Kernel Tracker can be used to examine the interaction between processes, threads, and interrupts on a running device. Again, below is some sample code that demonstrates integration into Kernel Tracker. The example is an application that has two threads, each trying to get access to a process global critical section—and of course, only one thread at a time can get access.

One thread is set to thread priority 50, the other is set to thread priority 250 using the function CeSetThreadPriority( ). (Note that threads, by default are created with a thread priority of 251.) The higher priority thread obtains the critical section, outputs a Kernel Tracker event, sits in a loop querying the performance counter to spin for one millisecond before releasing the critical section. The lower priority thread obtains the critical section, outputs a Kernel Tracker event (more on this in a second) and then releases the Critical Section. We would therefore expect to see the high priority thread running longer than the lower priority thread within Kernel Tracker.

Here's Kernel tracker, the user interface is divided into three areas, the left pane shows the interrupts and processes, the center pane shows the thread/process interaction, and the right pane (not shown) is a key to the symbols used in the center pane. We can clearly see the KevLab application running (at the bottom of the image), with the two threads being visible. We can see that the topmost thread within the Kevlab application (the high priority thread) has access to the critical section for a long period of time compared with the lower priority thread—but exactly how much time is the higher priority thread running for?

Figure 2. The Remote Kernel Tracker user interface

Kernel Tracker provides the ability to set time markers between events and displays the time difference in the status bar of the Kernel Tracker application. There are a number of predefined events, covering synchronization, miscellaneous, and user defined. The thread state is also displayed, which shows threads running, blocked, sleeping, and migrating. In this example, each thread is making a call to CELOGDATA to output the current executing thread handle. This can be seen in the Kernel Tracker window as a box with four squares, which shows a custom type has been output.

Here's an example of how to call CELOGDATA( ). In this case I'm outputting the current thread ID, although I could output a thread count (for the number of interrupts I've handled) or my current thread priority—it's your call...

CELOGDATA(TRUE, CELID_RAW_LONG, &dwThreadID, 
(WORD) (sizeof(DWORD)), 1, CELZONE_MISC);

So, back to the time markers. In the image below, we've set the first time marker on the high priority thread CELOGDATA event (when the thread gets access to the critical section), and on the low priority thread CELOGDATA event (again, when the low priority thread gains access to the critical section).

The Marker Time Delta is shown in the status bar of Kernel Tracker as .001120 seconds or 1120 µs. This is the stall time used by the High Priority Thread.

Figure 3. Remote Kernel Tracker—time delta

The Kernel Tracker tool can be used to locate and examine deadlock situations, and to examine the amount of time being spent in your application and driver threads. Running Kernel Tracker does affect the overall timing of the operating system, perhaps adding an additional 2 to 3% of overhead to the system.

OSBENCH

This program provides benchmarking for an extensive set of operating system conditions from the overhead of an internal Protected Server Library (PSL) call from your application into one of the processes of the operating system (FileSys.exe, Device.exe, and so on). The tests are broken down into 7 basic groups as follows:

CriticalSections
Event set-wakeup
Semaphore release-acquire
Mutex
Voluntary yield
PSL API call overhead
Interlocked APIs (decrement, increment, testexchange, exchange)

As an example let's look at the output of a couple of tests:

===================================================================
|  0.01  |  IP =  NO  |  CS =  NO  |       1 IPS
-------------------------------------------------------------------
EnterCriticalSection traditional (blocking) without priority inversion :
Time from a higher priority thread calling EnterCS (blocked) to a lower
priority runnable thread getting run
-------------------------------------------------------------------
|  Max Time =         13.409 µs
|  Min Time =          7.543 µs
|  Avg Time =          8.389 µs
===================================================================
===================================================================
|  0.02  |  IP =  NO  |  CS =  NO  |    1000 IPS
-------------------------------------------------------------------
EnterCriticalSection fastpath (uncontested)
-------------------------------------------------------------------
|  Subtracting out base result of 12 ticks
|  Max Time =          0.064 µs
|  Min Time =          0.061 µs
|  Avg Time =          0.061 µs
===================================================================

These tests compare the amount of time it takes to process a call to EnterCrticalSection().There are 2 paths through that function. The first is the fast path that exercises the critical section with contention and causes the implementation to transition to the kernel to handle the contention. The second remains entirely in the calling process when there is no contention for the critical section and is obviously a lot faster. (This explains why critical sections are preferred for synchronization.)

ILTIMING

ILTIMING measures the Interrupt Latencies in a system. It uses a number of OAL support capabilities to measure the response times to interrupts for the ISR and IST. These numbers are critical for understanding the constraints of your system.

Let's take a look at the numbers on an AMD K6 500Mhz-based CEPC system.

Table 2. ILTiming results

	ISR Starts	IST Starts
Minimum	1.6 µs	23.5 µs
Average	3.1 µs	32.2 µs
Maximum	10.6 µs	122.6 µs

Whoa! Hold on! Those don't look like anything that could be classified as real time. What gives?

Glad you asked! Well it shows that real time is a combination of a number of factors in a system and not just the operating system. So, what does one do to get better responses in the system? Good question. You need to understand the problem in order to answer that. And like most real-world problems, there is actually more then one issue to contend with.

The first issue is the PC architecture of the CEPC platform. The PC uses a separate real-time clock (RTC) to keep track of the time across power cycles. This device is attached to the I/O space of the CPU on the ISA Bus. (On most modern systems, it's on a PCI-ISA bridge.) To read the current time requires several reads from the device. These read cycles can generate cycle times of up to 60 µs where the system bus is locked. This really throws off the interrupt latency. So how do we fix that?

Well it turns out the answer to that is actually documented in the online help. (At this point I have to thank my fellow eMVP, Nat Frampton, who politely pointed out this issue with the Internet equivalent of a smack in the head and a polite RTFM <grin>. When I saw the results of that first test with ILTIMING, I was a bit perplexed since I'd seen numbers from other tests that were much better. Nat set me straight.) The online help contains a sample of a "virtualized" RTC implementation where the RTC is read once at initialization time and then never read again. Instead, the system tick is used to keep track of the amount of time since the initial read. Making this change reduces the numbers significantly, but the maximum times are still out of whack with what we'd expect or want. Something else is going on. (We did mention there was more than one issue, didn't we?)

The second issue is a bit subtler and is the result of the way the CPU itself works. When using virtual memory, the CPU has a Translation Lookaside Buffer (TLB) that is used to cache virtual-to-physical memory-address mappings. The x86 architecture uses a software-assisted method of handling TLB misses. So if there is a TLB Cache miss, then software will need to handle updating the TLB, and that eats up CPU time with interrupts off affecting the interrupt latency of the system. So how do we get around the problem?

Well the operating system team at Microsoft put some thought into this issue, and in Windows CE .NET, they added a variable in the OAL called dwOEMTPoolSize. This value specifies the TLB pool size in pages. The larger this value, the less the impact of a TLB miss (and the lower the number of their occurrences). However, this doesn't come for free. It takes more time to handle process-context switches, since more information needs saving. It also requires memory to hold the pages in the pool. You can choose to pay the price and set it to its max value (512), or you can figure out an optimal value for your particular system.

The value for dwOEMTPoolSize is a number of pages to set aside for the page table pool. To map a complete process space (32 MB) requires 8 pages. To map all of the 32 processes and the shared memory mapped region (2 GB) requires a pool size of 512. This translates to an extra 2 MB of memory used for the pool. (Remember, we said it wasn't for free!) So you can count up the number of processes you will use in your system and multiply by 8. Then you need to find the amount of pool to set aside for the shared memory area. Calculate whether you can use the "mi" command from the CE Target control window in the IDE and locate the entries with "Mapped File Section" as follows:

Mapped file section 0
Slot base 42000000  Section ptr 83ff3000
Page summary: code=0(0) data r/o=0 r/w=6500 stack=0 reserved=1691

Mapped file section 1
Slot base 44000000  Section ptr 83fe7000
Page summary: code=0(0) data r/o=0 r/w=0 stack=0 reserved=8192

Mapped file section 2
Slot base 46000000  Section ptr 83fd6000
Page summary: code=0(0) data r/o=145 r/w=1 stack=0 reserved=5998

You then add the r/w section values of each and divide by 32. So in this sample: (6500 + 0 + 1)/32 = 203.15625. Rounded up, that's 204. Then add this together with the value for the number of processes (8 processes x 8 pages each = 64) for a total of 268. You can then set the dwOEMTPoolSize to 268 to optimize it for this particular system.

Keep in mind that any changes to the system requiring more pages will cause an impact on the system response times. In a bit we'll take a look at the numbers if you just set it to 512 and chose not to worry about it. There's still one more factor to account for that will affect the timing results we are trying to measure.

Whenever you instrument a system to measure something, there is always the risk that the instrumentation itself will throw off the results. That can actually occur in this case. However, the potential problem is not the ILTIMING itself, but the method of receiving its results. The debug connection to the device to capture the output will affect the latency of the system. We don't need that connection in a release build, so there is no point in focusing a lot of effort working on that. Instead, we will just eliminate the issue by building a release image with the KITL and CE target control disabled. Once we have that, we can use the debug serial port to capture the data as the data is collected, and then output as a complete set, so any timing issue the serial port might have won't affect the results. Now it's time to look at some of those numbers again and see what happens:

Table 3. dwOEMTPoolSize = 16 (the default out of the box for a CEPC)

	ISR Starts	IST Starts
Minimum	1.6 µs	7.5 µs
Average	1.6 µs	8.4 µs
Maximum	2.5 µs	36.0 µs

Table 4. dwOEMTPoolSize = 512 (the extreme)

	ISR starts	IST starts
Minimum	1.6 µs	21.7 µs
Average	1.6 µs	22.7 µs
Maximum	1.6 µs	26.8 µs

Now those values are all a bit larger due to the overhead of managing the larger pool size, but they are also a lot more consistent and won't spike out due to a TLB miss. (Then there is the addition of the extra memory for the larger pool.) So this shows a worst-case impact. If it's Okay in your system, then you can just leave it maxed out to 512 and not worry about that issue any more. Otherwise you will need to go through the process of calculating the optimum size of the pool for the conditions of your system.

So there we have it. On our system, Windows CE .NET does conform to the OMAC definition of a hard real-time operating system, and ships with the tools and resources needed to build, test, and deploy a real-time device. All of these tools: Kernel Tracker, Remote Call Profiler, OSBENCH, and ILTIMING work together to help you evaluate the real-time capabilities of Windows CE .NET on your platform. Next month we will create a Windows CE .NET Web cam project. Unfortunately, it's time to go—we're out of (real) time for this month's article.

Get Embedded

Mike Hall is a Product Manager in the Microsoft Embedded and Appliance Platform Group (EAPG). Mike has been working with Windows CE since 1996—in developer support, Embedded System Engineering, and the Embedded product group. When not at the office, Mike can be found with his family, working on Skunk projects, or riding a Honda ST1100.

Steve Maillet is the Founder and Senior Consultant for Entelechy Consulting. Steve has provided training and has developed Windows CE solutions for clients since 1997, when CE was first introduced. Steve is a frequent contributor to the Microsoft Windows CE development newsgroups. When he's not at his computer burning up the keys, Steve can be found jumping out of airplanes at the nearest drop zone.

Testing Real-Time Systems in Microsoft Windows CE .NET

Remote Call Profiler

Kernel Tracker

OSBENCH

ILTIMING

Additional resources