Designing and Optimizing Microsoft Windows CE .NET for Real-Time Performance

Article
06/29/2006

Mike Thomson
Jason Browne
Microsoft Corporation

Updated June 2002

Applies to:
Microsoft® Windows® CE .NET and later

Summary: This paper describes in technical detail the changes made in the Microsoft Windows CE operating system (OS) that are designed to enhance its real-time performance characteristics. It also discusses the tools available to test real-time performance, and provides representative real-time performance test results for particular hardware configurations. (21 printed pages)

Introduction
Changes to the Kernel
Real-Time Measurement Tools
Performance Measurements
Summary

Introduction

Real-time performance is essential for the time-critical responses required in high-performance embedded applications, such as telecommunications switching equipment, industrial automation and control systems, medical monitoring equipment, and space navigation and guidance systems. Such applications must deliver their responses within specified time parameters in real time.

What is real-time performance? For the Microsoft Windows CE .NET OS, the following list defines real-time performance:

Guaranteed upper bounds on high-priority thread scheduling—only for the highest-priority thread among all the scheduled threads.
Guaranteed upper bound on delay in scheduling high-priority interrupt service routines (ISRs). The kernel has a few places where pre-emption is turned off for a short, bounded time.
Fine control over the scheduler and how it schedules threads.

It is important to distinguish between a real-time system and a real-time OS (RTOS). The real-time system consists of all elements—the hardware, OS, and applications—that are needed to meet the system requirements. The RTOS is just one element of the complete real-time system and must provide sufficient functionality to enable the overall real-time system to meet its requirements.

Although previous versions of Windows CE offered some RTOS capabilities, a number of significant changes made to the kernel since Windows CE 3.0 have greatly enhanced real-time performance. The Windows CE .NET kernel contains the same real-time enhancements as Windows CE 3.0 with some additional features. This paper describes the following changes that are part of Windows CE .NET and its previous versions:

Windows CE .NET

Added the ability to specify the page pool size through an OEM-defined variable for x86 platforms.

Windows CE 3.0

Increased number of thread priority levels from 8 to 256.
More control over times and scheduling. Applications can control the amount of time provided to each thread and manipulate the scheduler to their advantage. Timer accuracy is now one millisecond for Sleep- and Wait-related application programming interfaces (APIs).
Improved method for handling priority inversion.
Full support for nested interrupts.
Reduced ISR and interrupt service thread (IST) latencies.
More granular memory management control.

In addition, this paper describes tools used to test the real-time performance of the kernel and provides real-time performance test results on three different CPUs.

Changes to the Kernel

The kernel is the inner core of the Windows CE OS and is responsible for scheduling and synchronizing threads, processing exceptions and interrupts, loading applications, and managing virtual memory. In Windows CE 3.0, the kernel underwent the following several changes to increase performance and reduce latencies:

Moving all kernel data structures into physical memory, thus largely avoiding translation look-aside buffer (TLB) misses while executing non-preemptible code in the kernel.
All non-preemptible, but interruptible, portions of the kernel, known as KCALLs, were broken into smaller non-preemptible sections. This introduces some complexity, due to the increased number of sections, but now lets pre-emption be turned off for shorter periods of time.

This section describes further changes made to the kernel to enhance the real-time performance of Windows CE 3.0.

More Priority Levels

The kernel's scheduler runs a thread with a higher-priority level first and runs threads with the same priority in a round-robin fashion. Assigning priority levels to threads is one way to manage the execution speed.

Windows CE 3.0 increased the number of priority levels available for threads from 8 to 256, with 0 being the highest priority and 255 the lowest. Priority levels 0 to 7 of the previous version of Windows CE correspond to levels 248 to 255 in Windows CE 3.0. More priority levels allows developers greater flexibility in controlling the scheduling of embedded systems and prevent random applications from degrading system performance by restricting the number of priority levels.

To assign these new priorities, Windows CE 3.0 introduces two new functions: CeSetThreadPriority and CeGetThreadPriority. The new functions look exactly like the SetThreadPriority and GetThreadPriority functions in Windows CE 2.12, except that the new functions take a number in the range of 0 to 255.

More Control over Times and Scheduling

Windows CE 3.0 has improved timer performance with one-millisecond accuracy in the timer and Sleep function calls, and applications can set a quantum for each thread.

The timer, or system tick, is the rate at which a timer interrupt is generated and serviced by the OS. Previously, the timer was also the thread quantum, the maximum amount of time that a thread could run in the system without being preempted. In Windows CE 3.0, the timer is no longer directly related to the thread quantum.

Previously, the OEM set the timer and the quantum as a constant in the OEM adaptation layer (OAL) at about 25 milliseconds. When the timer fired, the kernel scheduled a new thread if one was ready. In Windows CE 3.0, the timer is always set to one millisecond and the quantum can be set for each thread.

Changing the timer from OEM-defined to one millisecond lets an application perform a Sleep(1) function and expect to receive approximately one-millisecond accuracy. Of course, this is dependent on the priority of the thread, the priority of other threads, and whether ISRs are running. Previously, Sleep(1) returned on a system tick, which meant Sleep(1) was really Sleep(25) if the timer was set to 25 milliseconds.

Timer interrupt

The kernel has a few new variables now available to developers that determine whether a reschedule is required on the system tick. A fully implemented system tick ISR can prevent the kernel from rescheduling by returning SYSINTR_NOP flag instead of the SYSINTR_RESCHED flag, when appropriate. Nk.lib exports the following variables that are used in the Timer ISR:

dwPreempt is the number of milliseconds until the thread is pre-empted.
dwSleepMin is the number of milliseconds until the first timeout—if any—expires, requiring a reschedule.
ticksleft is the number of system ticks that have elapsed but have not yet been processed by the scheduler's sleep queues; thus, a non-zero value causes a reschedule.

In the Timer ISR, additional logic optimizes the scheduler and prevents the kernel from doing unnecessary work as the following code sample shows.

if (ticksleft || (dwSleepMin && (DiffMSec >= dwSleepMin)) || (dwPreempt && 
     (DiffMSec >= dwPreempt))) return SYSINTR_RESCHED; return SYSINTR_NOP;

OEMIdle function

The OEM implements the OEMIdle function, which is called by the kernel when there are no threads to schedule. In previous releases, the timer tick forced the OS out of an idle state and back into the kernel to determine if threads were ready to be scheduled. If no threads were ready, the kernel again called OEMIdle. This operation caused the kernel to be activated every 25 milliseconds—or other quantum specified by the OEM—to determine that there were still no threads to schedule. On a battery-powered device, such an operation uses up valuable battery life.

To enable the reduction of power consumption with a higher tick rate in Windows CE 3.0, the OEMIdle function can put the CPU in standby mode for longer than one millisecond. The OEM programs the system tick timer to wake up on the first timeout available by using the dwSleepMin and DiffMSec variables. DiffMSec is the current millisecond value since the last interval time was retrieved from the TimerCallBack function.

The hardware timer is likely to have a maximum timeout that is less than MAX_DWORD milliseconds, so the timer may be programmed for its maximum wait time. In all cases, when the system returns from idle, the OEMIdle function must update CurMSec and DiffMSec with the actual number of milliseconds that have elapsed. CurMSec is the current value for the interval time—that is, the number of milliseconds since startup.

Thread quantum

In Windows CE 3.0, the thread quantum is flexible enough to enable an application to set the quantum on a thread-by-thread basis. This lets a developer adapt the scheduler to the current needs of the application. To adjust the time quantum, two new functions have been added: CeGetThreadQuantum and CeSetThreadQuantum. This change enables an application to set the quantum of a thread based on the amount of time needed by the thread to complete a task. By setting the thread quantum of any thread to zero, a round-robin scheduling algorithm can change to a run-to-completion algorithm. Only a higher-priority thread or a hardware interrupt can preempt a thread that is set to run-to-completion.

The default quantum is 100 milliseconds, but an OEM can override the default for the system by setting the kernel variable dwDefaultThreadQuantum to any value greater than zero during the OEM initialization phase.

Changes to Handling Priority Inversion

To help improve response time, Windows CE 3.0 changed its approach to priority inversion, which occurs when a low-priority thread owns a kernel object that a higher-priority thread requires. Windows CE deals with priority inversion using priority inheritance, where a thread that is blocked holding a kernel object needed by a higher-priority thread inherits the higher priority. Priority inversion enables the lower-priority thread to run and free the resource for use by the higher-priority thread. Previously, the kernel handled an entire inversion chain. Since Windows CE 3.0, the kernel is guaranteed only to handle priority inversion to a depth of one level.

There are two basic examples of priority inversion. The first is a simple case where the processing of priority inversion has not changed from Windows CE 2.12 to Windows CE 3.0. This case can be seen when, for example, you have three threads in a running state. Thread A is at priority 1 and threads B and C are at a lower priority. If thread A is running and becomes blocked because thread B is holding a kernel object that thread A needs, then thread B's priority is boosted to A's priority level to allow thread B to run. If thread B then becomes blocked because thread C is holding a kernel object that thread B needs, thread C's priority is boosted to A's priority level to allow thread C to also run.

The second and more interesting case is where thread A can run at a higher priority then B and C, thread B holds a kernel object needed by A, thread B is blocked waiting for C to release a kernel object that it needs, and C is running. In Windows CE 2.12, when A runs and then is blocked on B, the priorities for both B and C are boosted to A's priority to enable them to run. In Windows CE 3.0, when A is blocked on B, only thread B's priority is boosted. By reducing the complexity and changing the algorithm, the largest KCALL in Windows CE was greatly reduced and bounded.

Interrupt Handling and Nested Interrupts

Real-time applications use interrupts as a way to ensure that the OS quickly notices external events. Within Windows CE, the kernel and the OAL are tuned to optimize interrupt delivery and event dispatching to the rest of the system. Windows CE balances performance and ease of implementation by splitting interrupt processing into two steps: an interrupt service routine (ISR) and an interrupt service thread (IST).

Each hardware interrupt request line (IRQ) is associated with one ISR. When interrupts are enabled and an interrupt occurs, the kernel calls the registered ISR for that interrupt. The ISR, the kernel-mode portion of interrupt processing, is kept as short as possible. Its responsibility is primarily to direct the kernel to launch the appropriate IST.

The ISR performs its minimal processing and returns an interrupt identifier to the kernel. The kernel examines the returned interrupt identifier and sets the associated event that links an ISR to an IST. The IST waits on that event. When the kernel sets the event, the IST stops waiting and starts performing its additional interrupt processing if it is the highest-priority thread ready to run. Most of the interrupt handling actually occurs within the IST.

Nested interrupts

In versions prior to Windows CE 3.0, when an ISR was running, all other interrupts were turned off. This prevented the kernel from handling any additional interrupts until one ISR had completed. So if a high-priority interrupt were ready, the kernel would not handle the new interrupt until the current ISR had completed operations and returned to the kernel.

To prevent the loss and delay of high-priority interrupts, Windows CE 3.0 added support for the nesting of interrupts based on priority, if the CPU or additional associated hardware support it. When an ISR is running in Windows CE 3.0, the kernel runs the specified ISR the same as before, but only disables the same and lower-priority ISRs. If a higher-priority ISR is ready to run, the kernel saves the state of the running ISR, and lets the higher priority ISR run. The kernel can nest as many ISRs as supported by the CPU. ISRs nest in order of their hardware priority.

In most cases, an OEM's current ISR code would not change because the kernel takes care of the details. If the OEM is sharing global variables between ISRs, changes may be required, but in general, ISRs are not aware that they have been interrupted for a higher-priority ISR. Where an ISR performs an action periodically, a noticeable delay may occur but only if a higher-priority IRQ is fired.

After the highest-priority ISR ends, any pending lower-priority ISRs are executed. Then the kernel resumes processing any KCALL that was interrupted. If a thread was being scheduled and was interrupted in the middle of its KCALL, the scheduler resumes processing the thread. This enables the kernel to pick up where it left off and not totally restart the scheduling of a thread, saving valuable time. Once the pending KCALL is complete, the kernel reschedules the threads for execution and starts executing the highest-priority thread that is ready to run.

Interrupt Latencies

One of the most important features of kernel real-time performance is the ability to service an IRQ within a specified amount of time. Interrupt latency refers primarily to the software interrupt handling latencies—that is, the amount of time that elapses from the time that an external interrupt arrives at the processor until the time that the interrupt processing begins.

Windows CE 3.0 interrupt latency times are bounded for threads locked in memory, if paging does not occur. This makes it possible to calculate the worst-case latencies—the total times to the start of the ISR and to the start of the IST. The total amount of time until the interrupt is handled can then be determined by calculating the amount of time needed within the ISR and IST.

ISR latency

ISR latency is the time from when an IRQ is set at the CPU to when the ISR begins to run. The following three time-related variables affect the start of an ISR:

A Maximum time that interrupts are off in the kernel. The kernel seldom turns off interrupts, but when they are turned off, it is for a bounded amount of time.

B Time between when the kernel dispatches an interrupt and when an ISR is actually invoked. The kernel uses this time to determine what ISR to run and to save any register that must be saved before proceeding.

C Time between when an ISR returns to the kernel and the kernel actually stops processing the interrupt. This is the time when the kernel completes the ISR operation by restoring any states, such as registers, that were saved before the ISR was invoked.

The start of the ISR that is being measured can be calculated based on the current status of other interrupts in the system. If an interrupt is in progress, calculating the start of the new ISR to be measured must account for two factors: the number of higher-priority interrupts that will occur after the interrupt of interest has occurred and the amount of time spent executing an ISR. The following illustration shows the resulting start time.

Start of ISR =

Where N_ISR is the number of higher-priority interrupts that will occur after the interrupt of interest has occurred; and T_ISR(N) is the amount of time needed to execute an ISR. The formula is illustrated below in Figure 1.

Figure 1. Schematic representation of formula for ISR start time

If no higher priority interrupts occur (N_ISR = 0), the previous formula reduces to the following code example.

Start of ISR = A + B

Both Windows CE and the OEM affect the time to execute an ISR. Windows CE is in control of the variables A, B, and C, all of which are bounded. The OEM is in control of N_ISR and T_ISR(N), both of which can dramatically affect ISR latencies.

IST Latency

IST latency is the time from when an ISR finishes execution—that is, signals a thread—to when the IST begins execution. The following four time-related variables affect the start of an IST:

B Time between when the kernel dispatches an interrupt and when an ISR is actually invoked. The kernel uses this time to determine what ISR to run and to save any register that must be saved before proceeding.

C Time between when ISR returns to the kernel and the kernel actually stops processing the interrupt. This is the time when the kernel completes the ISR operation by restoring any state, such as registers, that were saved before the ISR was invoked.

L Maximum time in a KCALL.

M Time to schedule a thread.

The start time of the highest-priority IST begins after the ISR returns to the kernel and the kernel performs some work to begin the execution of the IST. The IST start time is affected by the total time in all ISRs after the ISR returns and signals IST to run. The following illustration shows the resulting start time.

Start of highest priority IST =

The following illustration shows the formula.

Figure 2. Schematic representation of formula for start of highest priority IST

Both Windows CE and the OEM affect the time required to execute an IST. Windows CE is in control of the variables B, C, L, and M, which are bounded. The OEM is in control of N_ISR and T_ISR(N), which can dramatically affect IST latencies.

Windows CE 3.0 also adds the following restriction to ISTs: The event handle that links the ISR and IST can only be used in the WaitForSingleObject function. Windows CE 3.0 prevents the ISR-IST event handle from being used in a WaitForMultipleObjects function, which means that the kernel can guarantee an upper bound on the time to trigger the event and time to release the IST.

Memory and Real-Time Performance

The kernel supports several types of kernel objects, such as processes, threads, critical sections, mutexes, events, and semaphores. Because the OS uses virtual memory, all kernel objects are allocated in virtual memory, and thus the memory for these objects is allocated on demand. Because allocating memory on demand can affect performance, OEMs should allocate kernel objects whenever a process starts. Note that once the kernel has allocated memory for a kernel object, it does not release the memory back to the system after the object has been freed. The kernel keeps this pool of memory available; it reuses memory from the pool when necessary and allocates more memory when the memory pool is insufficient.

There are three types of memory that can affect real-time performance: virtual, heap, and stack.

Virtual memory

Windows CE is a virtual memory–based OS that makes full use of a memory management unit (MMU). Because memory allocation and deallocation is based on virtual memory, real-time performance may be affected. The lowest level of memory is virtual memory, specifically virtual memory APIs such as VirtualAlloc. During the act of allocating virtual memory, the kernel searches for free physical memory and then associates the memory with the virtual address space of a process. Because virtual memory is allocated on 64-kilobyte (KB) boundaries, virtual memory APIs should not be used for small allocations. During a virtual memory allocation request, the kernel searches for physical memory from its pool of physical memory. Depending on the amount of memory in use and the fragmentation of the memory, the time to search for memory will vary. In a virtually managed system, all processes access memory from the same pool of physical memory.

To reduce the impact on allocating virtual memory, a process should allocate and commit all virtual memory before proceeding with normal processing.

Heap memory

The next level above virtual memory is heap memory and its associated APIs. The heap APIs rely on the low-level services that virtual memory APIs use for support. Memory allocations in a heap are not made on a 64-KB boundary and are meant for allocations smaller than 192 KB. Anything higher than 192 KB will cause the Heap Management System to use virtual memory directly to create a separate heap for the allocation.

A default heap is created for a process when it is created. An application can use the process heap for its memory allocations and the heap will grow and shrink accordingly. However, a performance penalty could be incurred if the amount and type of memory allocations in the default heap cause the heap to become fragmented. When a heap becomes fragmented, it spends more time trying to find a hole for a new memory allocation, which can affect performance. To prevent fragmentation from occurring in a process, an OEM should take control of the memory allocation process by creating separate heaps for similar objects. If a heap exists that contains same-sized objects, the Heap Manager will more easily be able to find a free block large enough to hold a new allocation. However, even if fragmentation is manageable through separate heaps, there is still a cost associated with allocating memory as you go.

To reduce the impact on allocating heap memory, a process should allocate heap memory before proceeding with normal processing.

Stack memory

When a new thread is created in the system, the kernel reserves memory for the stack. The amount of memory reserved for the stack is determined by the /STACK parameter passed to the linker when the module is built. The default stack size for Windows CE components is 65 KB. When a thread is scheduled for the first time, the thread's stack memory is committed; stack memory is committed one page at a time and committed only when needed.

To prevent the initial commitment of stack memory from affecting your performance, you should ensure that that your thread is scheduled at least once before performing real-time processing. In the case of an IST, this generally occurs because the thread performs several actions before settling down into a WaitForSingleObject loop. Further stack commits can be avoided if your thread requires no more than the initial stack allocation. Because memory is typically low in a Windows CE environment, the kernel will at times take back stack memory from threads in which it is no longer needed. The OEM controls the timing of this procedure and the parameters passed to SetOomEvent. If the values passed to SetOomEvent trigger a hunt for stack memory, the kernel will suspend each thread that has a free stack while memory from the stack is removed. To avoid this situation, carefully set the memory usage parameters for SetOomEvent.

Page table pool size

In Windows CE .NET, a new OEM variable called dwOEMPTPoolSize was created. This variable allows the OEM to define the size of the virtual memory page table data structure that the kernel uses to maintain its list of memory pages for each process on x86 platforms. The default value is 16 pages, with the maximum being 512 pages. Depending on the system, the OEM will have to increase the value until the right fit is found. The size of the page table pool structure can affect how often a page of the pool must be zeroed to map memory to a different area in the same process or to a different process. The zeroing of a page occurs when there are no available pointers in the page table data structure. For security reasons, a page must be zeroed before assigning it to the new process. The act of zeroing several pages can adversely influence the IST latency of a real-time platform. However, the OEM must also balance this issue with the fact that as the structure gets bigger, the average process switch time will increase because the kernel needs to reset permissions for each page pointer in the structure. Because this variable is available only in Windows CE .NET and later, a quick fix engineering (QFE) topic was created for developers who use Windows CE 3.0 that sets a static size to the page table pool size. For more information, please see this Microsoft Web site, open the Readme.txt file, and then search for QFE 33.

Real-Time Measurement Tools

The updates to the kernel for Windows CE 3.0 include two kernel-level tools, Interrupt Timing Analysis (ILTiming) and Scheduler Timing Analysis (OSBench), to test the real-time performance of the kernel and measure specific latencies. Performance numbers are hardware-specific, depending on CPU type and speed, memory architecture, and cache organization, and size.

Interrupt Timing Analysis (ILTiming)

The measurements of ISR and IST latencies have been combined in the ILTiming test tool, previously called IntrTime, that is available in source code and also distributed with Microsoft Platform Builder. The measurements are made using the system clock timer in an effort to make ILTiming available to all hardware platforms on which Windows CE runs, because some platforms do not provide for a separately available, unused timer.

Under normal circumstances, the system clock interrupts the kernel in regular intervals. The associated system timer ISR then processes the tick and returns either SYSINTR_NOP directing the kernel to ignore the tick or SYSINTR_RESCHED to wake up the scheduler.

The ILTiming test tool measures the latencies by taking every n^th tick of the system clock-defaults to every fifth system tick, and signaling a special SYSINTR_TIMING interrupt identifier event. The ILTiming application's main thread waits on the SYSINTR_TIMING interrupt event, thus becoming the IST. The ISR and IST measurements are derived from time stamps—that is, the counter values of the high-resolution timer since the last system tick.

Because ILTiming requires special modifications to the OAL only and not to the kernel, it can be easily adapted and can run on any OEM platform.

ILTiming command prompt parameters

The ILTiming command prompt parameters allow for the introduction of the following variations:

Setting the IST to run on various priorities
Flushing or not flushing the cache after each interrupt
Changing the ISR rate and number of interrupts captured
Printing or outputting to file the collected results

The following ILTiming command prompt parameters are available:

Usage: iltiming [ options ]
Options:
  -p num          Priority of the IST (default 0 ; highest)
  -ni             no idle priority thread 
  -i0             no idle thread (same as -ni)
  -i1             Run idle thread type 1
  -i2             Run idle thread type 2
  -i3             Run idle thread type 3
  -i4             Run idle thread type 4
  -i5             Run idle thread type 5
  -t num          SYSINTR_TIMING interval (default 5)
  -n num          number of interrupts (default 10)
  -all            print all data (default: print summary only)
  -o file         output to file (default: output to debug)
  -h              Display the help screen

The IST can be run at different priority levels (-p). By default, the application flushes the cache before each run. The option -ncs disables the CacheSync call. The -t option sets the ISR rate, and the system tick ISR returns SYSINTR_TIMING every t^th tick.

ILTiming can also create one or more idle threads running in the background. This affects the IST latencies by allowing the kernel to be in a non-preemptible kernel call that must finish before the IST is run. The following five types of idle threads are available:

Idle thread 1: One thread simply spinning, doing nothing
Idle thread 2: One thread spinning doing a SetThreadPriority(IDLE)
Idle thread 3: Two threads alternating the SetEvent and WaitForSingleObject functions with a 10-second timeout
Idle thread 4: Two threads alternating the SetEvent and WaitForSingleObject functions with an infinite timeout
Idle thread 5: One thread calling VirtualAlloc and VirtualFree with an infinite timeout. Designed to flush the cache and the TLB.

External interrupt response measurements

For quick assessment of the day-to-day, real-time performance of the system, the Interrupt Timing Analysis tool is sufficient to determine the ISR and IST interrupt latencies. This convenient method works across all supported processors but relies on the timer on the device itself, which may affect the measurements.

Thus, a more elaborate setup can be used to accurately measure ISR and IST latencies. The following two machines can be set up:

A workstation that generates an external interrupt and measures the time it takes to receive acknowledgements from ISR and IST routines.
Device-under-test that receives the external interrupt and toggles output lines when ISR and IST routines are reached.

Testing is performed under various stress levels, running anywhere from one to hundreds of threads of varying priorities on the test device.

The Microsoft Windows NT® 4.0–based workstation, equipped with a National Instruments PC-TIO-10 digital I/O timer/counter card, is used to generate interrupts and time responses, and a CEPC target platform equipped with an identical card is used to respond to those interrupts. The Windows NT software takes advantage of the driver library supplied by National Instruments, while the Windows CE software is written by Microsoft.

The theory of operation is simple: The PC-TIO-10 card has two sets of five timers. Each set contains one timer that provides 200-nanosecond resolution, while the other timers have one-microsecond granularity. In addition, the card contains two sets of eight digital I/O lines, with each set providing one line that can be used to interrupt on edge or level triggering. One output line from the Windows NT–based machine is wired both to the external interrupt pin of the CEPC target platform and back to the timers on the Windows NT–based workstation's card.

As the Windows NT–based workstation asserts one of its output lines, it generates an interrupt on the CEPC target platform and starts ISR and IST timers on the Windows NT card. The ISR on the CEPC target platform acknowledges the receipt of the interrupt by asserting an output line on the card, which stops the ISR timer on the Windows NT–based workstation and notifies the kernel to schedule the IST. When the IST starts running, it asserts a different output line, stopping the second timer on the Windows NT–based workstation. At this point, the Windows NT–based workstation can read the values on the timer counters to determine the intervals between an interrupt being generated and the CEPC target platform's responses. As soon as the Windows NT–based workstation has read the counter values, it issues another interrupt that the CEPC target platform uses to bring all output lines to the standby state, ready for another cycle.

Preliminary results gathered using the above measurements confirm the accuracy of the ILTiming testing results.

Scheduler Timing Analysis (OSBench)

OSBench, previously called CEBench, is a performance tool included in Windows CE 3.0 and later. It tests for scheduler performance timing focus on measuring the time required to perform basic kernel operations, such as the following synchronization actions: how long to acquire a critical section, how long to schedule a thread waiting on an event that another thread has just set, and so on. Wherever appropriate, the test runs two sets of metrics: thread-to-thread within a process and thread-to-thread across processes. If appropriate, a stress suite may be applied while running the test.

OSBench collects timing samples for the following performance metrics in Windows CE:

Acquire/release critical section (both fastpath and traditional)
Wait/signal an event (wait single with auto reset)
Semaphores
Mutexes
Voluntary yield using Sleep(0)

Metrics that are different from the above yield/run scenarios are timings for interlocked APIs and the system-call overhead. These metrics are Interlocked Increment/Decrement, Interlocked Exchange, and System API call overhead.

The following sample code shows the OSBench command-prompt parameters.

Usage: osbench [ options ]
Options:
  -all         Run all tests (default: run only those specified by -t option)
  -t num       ID of test to run (need separate -t for each test)
  -n num       Number of samples per test (default = 100)
  -m addr      Virtual address to write marker values to (default = <none>)
  -list        List test IDs with descriptions
  -v           Verbose : show all measurements
  -o file      Output to CSV file (default: output only to debug)
  -h           Display the help screen

OSBench –list
TestId 0 : CriticalSections
TestId 1 : Event set-wakeup
TestId 2 : Semaphore release-acquire
TestId 3 : Mutex
TestId 4 : Voluntary yield
TestId 5 : PSL API call overhead

As with ILTiming measurements, the QueryPerformanceCounter function call is used to obtain timing information. In addition, at every timing point where QueryPerformanceCounter is invoked, a user can specify that a specific marker value be written to the virtual address. Providing the virtual address at the command prompt when OSBench begins enables this hardware verification feature. Markers written at the virtual address can then be monitored by an analyzer, independently timed by an external device, and the results used to double-check the QueryPerformanceCounter timing accuracy. The setup similar to the external measurements of interrupt latency can be used for this purpose.

When using the QueryPerformanceCounter function call to get time stamps, the frequency of the counter on a particular platform and the overhead of calling this function have to be taken into account when analyzing the results. Care needs to be exercised to provide for proper exclusions of the measuring overhead in the final timing numbers. The QueryPerformanceCounter call is looped for a number of iterations before every test and the average is subtracted from the final result.

When the operation takes a very short time to complete, the overhead of the QueryPerformanceCounter function call becomes significant. In those cases, the operation is looped for a fixed number of iterations per sample (IPS), a clear indication of which is provided with every test, and the result is then averaged. A special submarker value is provided for these cases if hardware verification was enabled. A side effect of this looping is that the cache cannot be flushed between each iteration of the operation. For other tests where the IPS is equal to 1, the test is run twice, once with and once without cache flush for each iteration.

The following sample code shows OSBench test output.

==================================================
|  1.00  |  IP =  NO  |  CS =  NO  |       1 IPS
--------------------------------------------------
Event intraprocess :
Time from SetEvent in one thread to a blocked 
              WaitForSingleObject() 
waking in another thread in the same process.
--------------------------------------------------
|  Max Time =         10.057 us
|  Min Time =          5.867 us
|  Avg Time =          6.823 us
==================================================

In the example of test number 1.00, whose output is shown above, the operation was timing of the intraprocess event synchronization object. The IPS was 1; CacheSync (CS) was not conducted after each run; interprocess status (IP) shows that a second process was not used—both threads were in the same process. The maximum, minimum, and average results for 100 operations—the default if nothing is specified at the command prompt—are given in microseconds. The basic suite of tests and the overall layout of the OSBench program allow for easy additions of new test cases and measurements, augmenting the implementation for particular kernel functions that might be of special interest.

Performance Measurements

Performance measurements of Windows CE .NET were taken on three different x86 CPUs. All measurements are in microseconds and the results could vary depending on system load.

ILTiming Test Results

The following table shows the ISR and IST latencies.

Note These numbers, in microseconds, should only be used as an example of performance on a specific piece of hardware. Many hardware and software variables can affect these numbers so users should run these tests on their own platforms and not expect to obtain identical results.

Table 1.

CPU	ISR latency (microseconds)			IST latency (microseconds)
	Minimum	Maximum	Average	Minimum	Maximum	Average
AMD K6-2 - 350 MHz	1.6	3.3	1.6	9.2	34.3	21.2
Pentium – 166 MHz	3.3	6.7	4.2	19.2	105.6	62.1
Pentium PIII –500 MHz	3.3	6.7	3.4	10	26.8	17.2

OSBench Test Results

The OSBench tests were run on all three CPUs and used the following two basic variations to calculate the performance numbers:

The tests were performed in the same process or between two or more processes.
The tests were performed without cache flushes and with cache flushes—CacheSync—to flush both the data and instruction cache.

The table below shows the results for the OSBench tests. The results times are in microseconds to perform a specific test, which is represented by a number in the first column and defined in the section that follows the table.

Note These numbers, in microseconds, should only be used as an example of performance on a specific device. Many hardware and software variables can affect these numbers so you should run these tests on their own platforms and do not expect to obtain identical results.

Table 2.

Test	AMD K6-2 - 350 MHz			Pentium - 166 MHz			Pentium PIII – 500 MHz
	Min	Max	Avg	Min	Max	Avg	Min	Max	Avg
1	9.219	12.571	9.836	26.819	32.686	27.776	6.705	17.6	7.61
2	10.057	12.571	11.14	28.496	31.848	29.901	8.381	11.734	9.71
3	0.082	0.088	0.082	0.144	0.152	0.145	0.22	0.229	0.221
4	0.067	0.074	0.067	0.143	0.155	0.143	0.221	0.23	0.221
5	9.219	10.057	9.972	27.658	30.172	28.631	6.705	17.6	7.839
6	9.219	12.571	11.258	27.658	45.258	31.162	8.381	11.734	9.659
7	6.705	15.086	7.339	20.115	25.143	21.486	5.029	7.543	5.604
8	20.114	24.305	20.977	67.886	74.591	69.629	11.734	22.629	12.825
9	6.705	14.248	7.322	20.953	30.172	21.664	5.029	6.705	5.367
10	20.114	20.952	20.427	68.724	74.591	69.765	11.734	20.953	12.775
11	6.705	10.895	7.238	21.791	27.658	22.925	5.029	8.381	6.197
12	20.114	25.981	20.325	67.048	83.81	70.18	12.572	21.791	13.164
13	6.705	7.543	7.187	22.629	29.334	23.966	5.867	15.924	6.247
14	20.114	23.467	20.503	69.562	76.267	70.823	12.572	16.762	13.604
15	8.381	10.895	8.533	25.981	31.848	27.166	6.705	16.762	7.576
16	21.79	25.143	22.298	72.077	79.62	74.938	13.41	15.924	14.569
17	8.381	11.733	8.474	26.819	41.067	27.708	6.705	17.6	7.644
18	21.79	25.143	22.154	72.077	85.486	76.013	14.248	23.467	14.798
19	3.352	4.19	3.851	12.572	21.791	13.384	2.515	4.191	3.446
20	15.924	21.79	16.499	52.8	62.858	53.935	9.219	19.277	10.023
21	3.352	7.543	4.046	11.734	16.762	12.817	2.515	5.867	3.353
22	15.924	19.276	16.635	52.8	62.02	53.926	9.219	20.115	10.057
23	2.717	2.74	2.726	5.385	5.466	5.42	2.417	2.514	2.476
24	2.647	2.667	2.653	5.369	5.405	5.373	2.386	2.495	2.463
25	2.83	2.839	2.832	5.932	5.959	5.935	2.618	2.728	2.691
26	16.623	16.656	16.641	46.761	46.785	46.766	9.386	9.451	9.412
27	16.624	16.672	16.65	48.881	48.897	48.886	9.41	9.517	9.462
28	17.302	17.34	17.322	47.307	47.323	47.31	9.634	9.766	9.685
29	1.756	1.765	1.758	3.043	3.053	3.044	1.479	1.507	1.504
30	0.019	0.021	0.019	0.003	0.017	0.004	0.023	0.035	0.023
31	0.019	0.022	0.019	0.003	0.011	0.003	0.023	0.027	0.023
32	0.253	0.255	0.253	0.072	0.08	0.073	0.056	0.067	0.057
33	0.039	0.042	0.039	0.015	0.025	0.015	0.015	0.02	0.015

Test descriptions

The following descriptions pertain to the tests listed in Table above:

(1) EnterCriticalSection traditional blocking with priority inversion: Time from when a lower-priority thread calls LeaveCriticalSection until a higher-priority thread waiting on an EnterCriticalSection call is unblocked.
(2) EnterCriticalSection traditional blocking without priority inversion: Time between when a higher priority thread calls EnterCriticalSection (blocked) until a lower priority thread is released to run.
(3) EnterCriticalSection fastpath: An uncontested call to EnterCriticalSection.
(4) LeaveCriticalSection fastpath: An uncontested call to LeaveCriticalSection.
(5) EnterCriticalSection with inversion and CacheSync: Time from when a lower-priority thread calls LeaveCriticalSection until a higher-priority thread waiting on an EnterCriticalSection call is unblocked.
(6) EnterCriticalSection traditional blocking without priority inversion and CacheSync: Time from when a higher-priority thread calls EnterCriticalSection (blocked) until a lower-priority thread is released to run.
(7) Event intraprocess: Time from when the SetEvent function in one thread signals an event until a thread that is blocked on WaitForSingleObject in the same process is released.
(8) Event interprocess: Time from when SetEvent in one thread signals an event until a thread that is blocked on WaitForSingleObject in a different process is released.
(9) Event intraprocess with CacheSync: Time from when SetEvent in one thread signals an event until a thread that is blocked on WaitForSingleObject in the same process is released.
(10) Event interprocess with CacheSync: Time from when SetEvent in one thread signals an event until a thread that is blocked on WaitForSingleObject in a different process is released.
(11) Semaphore signaling intraprocess: Time from when a lower-priority thread calls ReleaseSemaphore until a higher-priority thread that is blocked on WaitForSingleObject in the same process is released.
(12) Semaphore signaling interprocess: Time from when a lower-priority thread calls ReleaseSemaphore until a higher-priority thread that is blocked on WaitForSingleObject in a different process is released.
(13) Semaphore signaling intraprocess with CacheSync: Time from when a lower-priority thread calls ReleaseSemaphore until a higher-priority thread that is blocked on WaitForSingleObject in the same process is released.
(14) Semaphore signaling interprocess with CacheSync: Time from when a lower-priority thread calls ReleaseSemaphore until a higher-priority thread that is blocked on WaitForSingleObject in a different process is released.
(15) Mutex intraprocess: Time from when a lower-priority thread calls ReleaseMutex until a higher priority thread that is blocked on WaitForSingleObject in the same process is released.
(16) Mutex interprocess: Time from when a lower-priority thread calls ReleaseMutex until a higher-priority thread that is blocked on WaitForSingleObject in a different process is released.
(17) Mutex intraprocess with CacheSync: Time from when a lower-priority thread calls ReleaseMutex until a higher-priority thread that is blocked on WaitForSingleObject in the same process is released.
(18) Mutex interprocess with CacheSync: Time from when a lower-priority thread calls ReleaseMutex until a higher-priority thread that is blocked on WaitForSingleObject in a different process is released.
(19) Yield to thread timing intraprocess: Time from when a thread calls Sleep(0) until a same-priority thread in the same process wakes from a previous call to Sleep(0).
(20) Yield to thread timing interprocess: Time from when a thread calls Sleep(0) until a same-priority thread in a different process wakes from a previous call to Sleep(0).
(21) Yield to thread timing intraprocess with CacheSync: Time from when a thread calls Sleep(0) until a same-priority thread in the same process wakes from a previous call to Sleep(0).
(22) Yield to thread timing interprocess with CacheSync: Time from when a thread calls Sleep(0) until a same-priority thread in a different process wakes from a previous call to Sleep(0).
(23) System API call (roundtrip) intraprocess: Time required to call a system API that is part of the current process with no parameters and have the call return immediately.
(24) System API call (roundtrip) intraprocess: Time required to call a system API that is part of the current process with seven DWORD parameters and have the call return immediately.
(25) System API call (roundtrip) intraprocess: Time required to call a system API that is part of the current process with seven PVOID parameters and have the call return immediately.
(26) System API call (roundtrip) interprocess: Time required to call a system API that is in a different process with no parameters and have the call return immediately.
(27) System API call (roundtrip) interprocess: Time required to call a system API that is in a different process with seven DWORD parameters and have the call return immediately.
(28) System API call (roundtrip) interprocess: Time required to call a system API that is in a different process with seven PVOID parameters and have the call return immediately.
(29) System API call (roundtrip) to Nk.exe: Time required to call a system API in the kernel that returns immediately.
(30) InterlockedIncrement: Time to call the InterlockedIncrement API.
(31) InterlockedDecrement: Time to call the InterlockedDecrement API.
(32) InterlockedExchange: Time to call the InterlockedExchange API.
(33) InterlockedTestExchange: Time to call the InterlockedTestExchange API.

Summary

Although previous versions of Windows CE offered some real-time OS (RTOS) capabilities, a number of significant changes made to the kernel since Windows CE 3.0 have greatly enhanced real-time performance. As this paper has demonstrated, the following changes were made for Windows CE .NET:

Increased number of thread priority levels from 8 to 256.
More control over times and scheduling. Applications can control the amount of time provided to each thread and manipulate the scheduler to their advantage. Timer accuracy is now one millisecond for Sleep- and Wait-related APIs.
Improved method for handling priority inversion.
Full support for nested interrupts.
Reduced ISR and IST latencies.
More granular memory management control.
Ability to specify the page pool table size through an OEM-defined variable on x86 platforms.

This paper has also described the tools—ILTiming and OSBench—used to test the real-time performance of the kernel, and has provided real-time performance test results on three CPUs.

For More Information

For more information about the Windows Embedded family, please see the Windows Embedded Web site.