Assessing Microsoft Windows CE 3.0 Real-Time Capabilities
Siemens CT SE 2
Dr. Holger Küfner
Dr. Odo Stierschneider
Features of Windows CE 3.0
Driver and Hardware Support
Application Runtime Services
GUI and User
Microsoft Windows CE Platform Builder 3.0
eMbedded Visual Tools 3.0
Real-Time Properties of Windows CE 3.0
Hardware Configuration for Performance and Latency Measurement
Latency Tool Description
Performance Tool Description
Results of Latency Measurement
Results for the Hitachi SH4 Evaluation Board
Results for the Intel Pentium Board
Results for the Pentium Board with QFE for the x86 Kernel
Results of Performance Measurement
Results for the Intel Pentium Board
Results for the Pentium Board with QFE for the x86 Kernel
Summary of the Latency Results
Summary of the Performance Results
Real-Time programming hints
Embedded real-time applications are becoming increasingly important for a number of business segments. Siemens Corporation, a player in e-business and IT technology, evaluated the Microsoft® Windows® CE operating system for its real-time capabilities. Various properties of this operating system, like its built-in multimedia capabilities, its optional GUI support, and the availability of the Microsoft® Win32® application programming interface, make it a natural choice for Siemens in embedded applications. However, its real-time behavior still must be assessed to determine its usability for today's demanding real-time applications. The purpose of this paper is to provide this evaluation.
The important properties for real-time systems are the latency times for Interrupt Service Routines (ISR) and Interrupt Service Threads (IST). Additionally, performance of certain kernel calls has been measured and compared. These requirements are particularly relevant to the industrial automation industry.
This document does not provide a detailed description of all the features of Windows CE 3.0. If you need more information about these features, see the references in the Bibliography.
The following sections outline new features and characteristics of Windows CE 3.0. The information in these sections is taken from experiences with beta releases, information from Microsoft project managers, and the white paper "Information to Windows CE 3.0".
Power management is essential for mobile computing. Windows CE 3.0 provides a considerably higher tick-rate, which yields higher power consumption, than previous releases. To allow for lower power consumption, the OEMIdle function can now be adapted to place the CPU into standby for arbitrarily long periods that are not bound to the thread quantum anymore.
To achieve this, the kernel now exposes the following variables, which allow for custom CPU standby periods:
- dwPreempt: Number of milliseconds until the current thread is preempted.
- dwSleepMin: Number of milliseconds until the next scheduler interrupt is necessary; this equals the minimum of the sleeping times of all available threads.
- ticksleft: The number of ticks that have elapsed but have not been processed by the scheduler.
- CurrMSec: The time in milliseconds since the system last booted.
- DiffMSec: The time in milliseconds since the scheduler last ran.
- Nested interrupts are possible, with support from the CPU and/or additional hardware. An interrupt with a higher priority level will be serviced immediately without having to wait for a lower priority ISR to complete.
- Better thread response. The upper bound of scheduler latencies for high-priority ISTs has been tightened (see "Real-Time Properties of Windows CE 3.0").
- 256 priority levels, which allows better control over the system behavior. The original eight priorities are mapped to the lowest eight of the 256 priority levels. Any priority thread can now be run to completion by setting the thread scheduling quantum to zero (see the next paragraph). This should only be done in critical situations, since in this case the system is less predictable. Normally, only priority level 0 (highest priority level) threads are meant to run to completion. Therefore special driver or hard real-time threads should be the only ones to use this feature.
- Thread level scheduling quantum control. Any thread can have any scheduling quantum (time slice). This feature enables CPU load balancing among several threads that have the same priority. Using this feature enables threads at any priority level to run to completion as long as no higher priority thread is running. In effect, this decouples the time a thread can run before being preempted from the system tick.
The kernel underwent some major changes in Windows CE 3.0. The dominant reason behind this redesign was to enhance its Real Time Operating System (RTOS) capabilities. The changes include:
- Priority inheritance to correct thread priority inversion is only supported for one level (earlier versions of Windows CE were unbounded).
- Large kernel calls have been segmented into smaller pieces to allow for faster preemption.
- 1 ms system tick. This allows a more timely response to Sleeps and Waits. Sleep(0) forces a reschedule.
- Process/Thread synchronization objects now also include semaphores.
- File compression module optional to reduce kernel size. If not included in a kernel-image, programs in ROM must not be compressed.
- Multiple Execute In Place (XIP) regions supported. For example, you could install the core operating system components into ROM, and install the other operating system components into flash memory. XIP applications could run in both places.
- Full kernel mode support, in which all threads run in kernel mode and performance optimizations are enabled.
- Support for on-chip debugging. Hardware-assisted debugging enables debugging of the OAL before the OS kernel is running, simplifying the OAL debugging process.
- Event tracking that allows platform developers to track events and improve performance.
- Replaceable ROM DLLs with implicit links to other modules.
- Kernel-level security. A new security model prevents unauthorized applications from accessing system APIs that might potentially damage the platform. An OEM can specify whether individual modules and processes will be allowed to run and specify those that are fully trusted on a particular platform. Two new APIs allow software developers to retrieve the assigned trust level of a module or a process.
- Footprint: Smallest Windows CE version: about 400kB. Networking, but no graphics: about 1200kB. Version with GWES (Window Manager, GDI and Events), shell and networking: about 3-4 MB. Full blown up version with task manager, hand writing recognition, etc.: up to 6 MB. Windows CE shell : about 3 MB.
- Support has been added for querying VERSIONINFO resources to obtain version and language-support information from files.
- Dial-up bootloader. The bootloader enables the platform image to be updated in the field. It supports security and certificate authentication, and connects to standard Web server technologies, such as HTTP, FTP, PPP and SLIP.
- Enhanced USB support.
- UHCI (Universal Host Controller Interface) driver.
- Sample USB Function controller driver, based on the Scanlogic SL11 chipset.
- USB HID (Human Interface Device) driver with Direct Input Interface.
- Sample USB, PCMCIA and Serial Smart Card drivers.
- Enhanced display-driver support.
- Flat mode driver that works with the most of the PC-based display cards currently available.
- Accelerated display driver for ATI Rage XL 2D-only.
- Sample display drivers that illustrate the use of software cursors and anti-aliased text.
- New Serial MDD driver that improves serial interface performance.
- Support for DiskOnChip Millennium product from M-Systems (also limited Flash devices).
Windows CE 3.0 supports larger data storage systems and larger files within those systems.
- The number of objects that can be kept in the object store has been increased from 65,536 to 4.19 million (object IDs are reused).
Windows CE 3.0 has enhanced capabilities in the area of security.
- Smart card subsystem, including a resource manager API and reader drivers, for developing PS/SC-compliant smart card systems for Windows CE. RSAENH (Microsoft Enhanced Cryptography Service), which includes 128-bit encryption algorithms.
The Registry has be enhanced to support the following features:
- Write protected sections of the registry.
Communications enhancements to Windows CE include the following:
- TAPI 2.1 support. It includes inbound data modem call capability, and support for adding TSPs (telephony service providers). A Unimodem TSP for AT-command-based modems is included.
- RAS support for multiple sessions.
- Improved TCP/IP support. Equivalent to the TCP/IP support in Windows 2000, Windows CE now supports dead gateway detection, which enables better fault tolerance across different network segments as well as package forwarding across multiple interfaces.
- NDIS (network driver interface specification) now supports intermediate drivers, the NDIS WAN (wide area network) media type, and token-ring networks. The Windows CE 3.0 DDK includes a new, improved NDIS Test utility.
- DHCP (Dynamic Host Configuration Protocol) client support now includes Autonet, which assigns an IP address to a device if a DHCP server is not available. Autonet allows Windows CE-based devices to connect to a network where there is no DHCP server available.
- IP Helper APIs provide application developers more control over TCP/IP. For example, they provide access to the Route and ARP tables. In addition to the IP helper APIs, Windows CE 3.0 supports the following tools: Route, IPConfig, and Ping.
- Common Internet File System (CIFS) redirector. Windows CE supports the CIFS redirector for accessing remote file systems and printers. The CIFS protocol is also known as the Server Message Block (SMB) protocol.
- Telnet server sample.
- Web (HTTP) server support.
- Support for persistent connections, multiple connections, folder browsing, and multiple virtual paths.
- ISAPI extensions and filters.
- Dynamic Web Pages: a subset of standard ASP functionality is supported. As a default, ASP scripting is possible via the Microsoft® Visual Basic®, Scripting Edition (VBScript) programming language and the Microsoft® JScript® development software. Custom scripting engines can be added if necessary.
- Remote server-based management.
- Support for caching dynamic link libraries (ISAPI extensions and filters) and COM objects.
- Control Panel (with source code).
- Telnet server sample enabling a Telnet client to connect to a remote Windows CE-based (version 3.0) device. This feature can be used for a software update via the Internet (used by CT SE 2 for Siemens Medical).
- Application Manager to add/remove programs on a Windows CE-based device.
Windows CE 3.0 supports two separate COM modules that offer two different levels of COM support.
- A limited-feature, small-footprint module that provides interprocess calls and a free-threading model (similar to Windows CE 2.12).
- A full-featured module that supports out-of-process calls, full-threading model support, and DCOM. The DCOM module, with the exception of the security interfaces, is fully compatible with Microsoft® Windows NT® 4.0 Service Pack 5 (SP5).
- Building upon COM, the Microsoft® ActiveX® standard is supported (with and without ATL).
- Enhanced MSMQ compatible with the Windows NT, Microsoft® Windows 98, and Microsoft® Windows 2000 operating systems.
- Microsoft® Windows Media™ technologies player with multimedia streaming, ASF/ASX.
- Windows Media player control with JScript, VCR-like controls, information retrieval, and monitoring.
- Sound, used to draw user attention to critical events or to play .WAV files.
- Microsoft® DirectX® application programming interface support for cutting edge graphics and audio performance.
- Support for latest a/v codecs like MPEG-4, WMA, and MP3.
- Resolution-independent controls and dialog boxes.
Development tools for Windows CE 3.0 include Platform Builder 3.0 for creating and debugging new embedded designs and Microsoft® eMbedded Visual Tools 3.0 for creating embedded applications and components that can be used with Platform Builder.
Platform Builder can package the OS configuration of a kernel image that has been built into a software development kit (SDK), and then export the SDK for the use with eMbedded Visual Tools. This SDK contains the tools and information needed to develop applications for platforms based on the Windows CE operating system.
Disadvantage: By creating an application with eMbedded Visual Tools it will become more difficult to develop this application in parallel for Windows NT and Windows CE. Currently the configuration and project files of Microsoft® Visual Studio® and eMbedded Visual Tools are not compatible and, at the time of this writing, no tools for converting project and configuration files between those development platforms were known to exist.
The features of Platform Builder 3.0 are:
- The Windows CE operating system can be built for 8 predefined configurations ranging from a minimal system with little more than a kernel to a comprehensive system complete with a rich graphical user interface (GUI) and preloaded applications.
- Offers an integrated, intuitive UI that supports the needs of developers designing platforms and components.
- A tool for creating and exporting SDKs that can be imported into eMbedded Visual Tools to develop applications.
- Optional support for Microsoft run-time libraries, including Microsoft Foundation Classes (MFC) for Windows CE, Active Template Library (ATL) for Windows CE, and Visual Basic.
- Component development tools that enable the developer to build custom components for a platform, including device drivers, applications, dynamic-link libraries (DLLs), and static libraries.
- Documentation is available for Windows CE 3.0 and Platform Builder 3.0. Also online and context-sensitive help is provided.
- A Platform Wizard helps to quickly choose individual OS modules as needed.
- Two board support packages (ODO and CEPC) are included. Microsoft offers additional BSPs on its Web site, The creation of custom BSPs is also possible. The HARP platform for the Infineon TriCore™ processor will probably not be available for version 3.0 but will be released for the next version.
- Connectivity and download features are integrated in the IDE.
- A Status Monitor displays information regarding communication between the developer workstation and target device.
- The Extended Microprocessor Kit supports Add-in CPUs. Add-in CPU configurations are added by third-party vendors.
- Hardware debugging support is integrated.
- Import of Microsoft® Visual C++® projects is supported.
- Data Visualization support for third party visualization tools.
- Enhanced debugger user interface.
- New build options.
- Integrated Target Control that allows viewing process and thread information, start and stop processes, set debug zone information and much more.
eMbedded Visual Tools comprises Microsoft® eMbedded Visual C++® 3.0 and Microsoft® eMbedded Visual Basic® 3.0. Both tools are integrated development environments (IDE) that can be used by developers to build applications for the next generation of Windows CE-based communication, entertainment, and information-access devices. eMbedded Visual Tools 3.0 provides help to create, debug, and deploy applications for the Windows CE operating system quickly for a wide range of devices using well-known Visual C++ and Visual Basic development systems and techniques (including an adapted version of the Win32 application programming interface).
The features of eMbedded Visual Tools 3.0 are:
- Increased Developer Productivity
- Simplified Development and Integrated Debugging
- Comprehensive Access to the Windows CE Platform
- Fast, Flexible Data Access
As mentioned before, by using these tools, application cross development for Windows NT and Windows CE will become more difficult because two tool sets must be supported (Visual Studio and eMbedded Visual Tools) and configuration/project file exchange is currently not supported. Furthermore, code created by the various wizards is not always compatible with the opposite version of the development tools, e.g. an ATL component source-code created with the ATL wizard in the Visual C++ 6.0 development system will not build in eMbedded Visual C++ 3.0 without editing the program code.
Real-time capabilities are very important in most embedded systems. Devices like telecommunications switching equipment, medical monitoring equipment, aircraft "fly-by-wire" controls, space navigation and guidance, laboratory experiment control, automobile engine control, and robotics systems must often manage time critical responses.
A real-time system is a system in which the correctness of the computations not only depends on the logical correctness of the computation, but also on the time when the result is produced. If timing constraints of the system are not met, system failure is said to have occurred. (Source: newsgroup.comp.realtime, Donald Gilles)
Because an RTOS is part of a real-time system, it must provide the functionality to enable the overall real-time system to meet its requirements.
Requirements of standard real-time operating systems (RTOS) are:
- The OS (operating system) must be multithreaded and preemptive.
- The OS must support thread priority.
- The OS must support predictable thread synchronization mechanisms.
In addition, the OS behavior must be predictable. This means real-time system developers must have detailed information about the system interrupt levels, system calls, and timing:
- The maximum time during which interrupts are masked by the OS and by device drivers must be known.
- The maximum time that device drivers use to process an interrupt, and specific IRQ information relating to those device drivers, must be known.
- The interrupt latency (the time from interrupt to task run) must be predictable and compatible with application requirements.
A major requirement for a RTOS is to process interrupts in a given range of time. As in most modern RTOSes, in Windows CE the processing of interrupts is divided in two steps, the Interrupt Service Routine (ISR) and the Interrupt Service Thread (IST).
The ISR performs minimal processing and returns an interrupt ID to the kernel. The kernel examines the returned interrupt ID and sets the associated event. The interrupt service thread is waiting for that event. When the kernel sets the event, the IST stops waiting and starts performing its additional interrupt processing. Most of the interrupt handling actually occurs within the IST.
The quality of the real-time behavior of an operating system is determined by the following factors. Clearly the interrupt latencies of the ISRs and the ISTs under worst case workload are significant. Also, the behavior of the scheduler, and in the case of Windows CE, the implemented reaction quality of the MMU is an important characteristic. However, more important for a complete real-time system is the jitter of ISRs and especially of ISTs running on top of the underlying RTOS. Events with real-time requirements occur asynchronously, but most of the time periodically and predictably. In most real-time designs its possible to derive characteristic system reaction times from these events. This allows for RT designs with latencies, which do not need to be so small. This is possible if the jitter, the variance of the interrupt latencies, is very small. Think of a band pass filter: the smaller the band width, the better the quality of the filter. Similarly, the smaller the jitter, the better the quality of a real-time
OS. Figure 1 clarifies this relationship.
In automation applications, events from an external technical process (e.g. chemistry transformation of material, GSM protocol stack) are usually sent to the controlling system (usually RTOS applications) on a periodic basis. Therefore, the notion of a "working period" is used to denote the time interval between the instant when an external process generates one of these periodic events (in the form of an interrupt) and the moment it expects a response to this event. The duration of one working period is determined by the requirements of the external technical process. It is valid for the processing of a low priority event to take longer than one working period if the processed result is not needed in the same period that the event occurred in. Sporadic events (e.g. alarm events), on the other hand, usually have a high priority and therefore have to be processed at once. Due to this description we can conclude a general rule for priorities: Periodic events determining the "working period" have a lower priority then sporadic events.
To get significant characteristics concerning real-time behavior and performance of Windows CE, we tracked two different views for the measurements:
- Latencies: measurement of interrupt latencies and operating system kernel call duration under different workloads to assess real-time and performance capabilities for Windows CE 3.0.
- Real-time behavior between different processor families: To collect data of RTOS implementation qualities, we verified interrupt latencies for two processor types, the Hitachi SH4 and the Intel Pentium. Comparisons between the SH4 (RISC processor) and the Pentium (more or less CISC) are made.
The results of the measurements will be graphically presented in sections 3.5 and 3.6, and a summary in sections 3.7 and 3.8. The following paragraphs describe the hardware configuration of the SH4 and the Pentium boards in more detail, as well as the tools used for interrupt latency and performance measurements.
The measurements were performed with the evaluation board S1 from Hitachi Europe (PFM-DS6C) for the Hitachi SH4 processor, and with a standard Siemens PC (PCD-5H) for the Intel Pentium processor. The following configurations were used:
|Processor:||SH4, 198 MHz||Pentium, 100 MHz|
|Instruction Cache:||8 kB||L1: 16 kB, L2: 256 kB|
|Operand Cache:||16 kB||L1: 16 kB,|
|Memory:||64 MB||32 MB|
|Graphic Chip:||64413 (Q2SD Board)||Tseng ET 4000|
|Microsoft Windows CE 3.0, Build 126, Maxall.|
Microsoft has developed a test tool for latency measurement named "iltiming". This tool was modified by Siemens CT SE 2 to support LAN services, which are used to send the test results to a workstation or PC. In this manner, errors in measurement are minimized. More precise statements can only be made with the help of a logic analyzer.
Iltiming uses the timer interrupt to measure latency times. It measures the time difference between the occurrence of the timer interrupt and the start of the Interrupt Service routine and also the time difference between the time the ISR finished and the time the Interrupt Service Thread started.
To understand the method that is used to get the time difference between the occurrence of the interrupt and the start of the IST, we take a closer look at the timer. During system initialization, the timer is initialized with a start value of the timer counter. The timer counter decrements every clock tick until it becomes zero. If it becomes zero, the timer triggers an interrupt, loads its timer counter register with the start value, and starts to decrement its timer counter again. This means the interrupt occurred when the timer counter register was loaded with its start value.
The ISR reads the timer counter register immediately after invocation. The difference between the start value of the counter and the value that was read from the ISR, together with the frequency of the timer, can be used to calculate the time difference.
The ISR gets a second timestamp (reading of timer counter register) before leaving. A third timestamp is taken from the Interrupt Service Thread (IST) immediately after invocation. The second and the third timestamp are used by "iltiming" to calculate the time difference that was needed to start the Interrupt Service Thread. To get the complete interrupt latency from the occurrence of a hardware interrupt up to activating the IST, both latency numbers and the individual runtime of the ISR have to be added which is done by our graphical front end on the host system.
It should be noted that usually the time needed to process an IST is also included in the latency measurements. This means that the latency denotes the time difference between the occurrence of an event (beginning of the ISR) and the end of its processing (usually the end of the IST). As the latency in this case is application specific and as we tried to present a "normalized", application independent view on latencies in Windows CE, the time spent inside the IST was not included in our latency experiments.
For the latency measurements the iltiming tool was called with the following parameters:
Iltiming –n 1000 –t 2 –udp <ip_addr>
|-n 1000||a measurement cycle consists of 1000 measurements|
|-t 2||every second timer interrupt is used to measure latency times|
|-udp <ip_addr>||iltiming sends the test result via UDP to the workstation specified by the IP address <ip_addr>|
To simulate workload, mathematical functions as well as graphics and network operations were used. As graphical workload, we used an application that rotates a rectangle (see figure 2). The network workload is an application that sends 250 UDP packet/second. To simulate computation, we used standard math functions (mathload.exe).
The Dhrystone Test was used to measure arithmetic and string manipulations regarding compiler optimizations and OS support while the SSC-Benchmark was used to benchmark operating system related issues.
Dhrystone is a suite of arithmetic and string manipulating programs. Since the whole program is less then 8 kBytes, it fits into the processor cache. It can be used to measure two aspects both the processor's speed as well as the optimizing capabilities of the compiler. The resulting number is the number of executions of the program suite per second.
SSC-Benchmark tests the following operating system functionality (structure as specified by SSC):
- Create/Delete Task
We create a task with a higher priority and start it. The task does not do anything and terminates itself. Therefore, the creator task will be executed again.
Windows CE needs four system calls to perform this test: one for the creation of the task (we create the task suspended), one to set the priority, one to get the task running after setting the priority, and the last system call closes the handle to the created task (after the task has finished itself).
The measurement is labeled "Create Task".
- Ping Suspend/Resume Task
A task with a lower priority resumes another suspended task (with a higher priority). The task with the higher priority suspends itself immediately and therefore returns to the task with the lower priority.
The measurement is labeled "Suspend Ping".
- Suspend/Resume Task
A task with a higher priority suspends a task with a lower priority and resumes it afterwards (without a context switch).
The measurement is labeled "Suspend/Resume".
- Ping Mutex
Two tasks alternately acquire and release a semaphore.
Windows CE 3.0 supports semaphores, but in this case, the testing was performed using mutexes. A mutex object can only be acquired or released. It is not equal to a counting semaphore, but in this measurement its behavior is similar.
The measurement is labeled "Sema/Mutex Ping".
- Get/Release Mutex
One task acquires a mutex and releases it immediately.
The measurement is labeled "Semaphore/Mutex".
- Queue Fill
Filling the message queue without blocking.
Because Windows CE only uses one message queue (the standard queue for receiving Windows Messages), we do not need to create a separate message queue. However, the messages have to be a user created message with an ID higher than or equal to "WM_USER".
The measurement is labeled "Fill Queue".
- Queue Drain
Emptying the message queue after it has been filled by "Queue Fill".
The measurement is labeled "Empty Queue".
- Queue Fill/Drain
One task sends a message to itself and receives it immediately.
The measurements are labeled "Send/Recv" and "Queue Send/Recv".
- Ping Fill/Drain
There are two tasks connected with each other via two message queues. The second task waits for messages from the first queue and sends them immediately in the second queue. The first task sends a message in the first queue and waits for receiving the message from the second queue. The two tasks have equal priorities.
The measurement is labeled "Ping Queue".
- Allocate/Deallocate Memory
Allocating and freeing memory regions.
Windows CE is able to allocate memory from the local heap and memory from an additional requested heap. Freeing of the allocated memory was also measured.
The measurements are labeled "Heap Alloc", "Heap Free", "Local Alloc 32B", "Local Alloc 128B" and "Local Free".
- Time Calls
Getting the value of a timer.
For Windows CE we measure two different timer calls, one to get the system time and the other to get the ticks from a high performance timer.
The measurements are labeled "GetTickCount" and "GetSystemTime".
- Ping Event
Two tasks with equal priorities are each waiting for an event generated by the other task.
The measurement was labeled "Ping Event".
- Event Pending
A task with a higher priority is waiting for an event, which is generated from a task with lower priority.
The measurement was labeled "Event Pending".
- Set Event
We measure the time for setting an event. There is no other task waiting for the event.
The measurement was labeled "Set Event".
The following figures present the interrupt latency measurement results that Windows CE 3.0 shows under different workloads running on the specified boards.
The following latency times are measured with the "iltiming" tool on a:
- SH4 processor 198 MHz (S1 board),
- MAXALL Platform.
WORKLOAD: Rotate and Mathload
Figure 3 shows the results for the graphical and mathematical workload. Each multicolored column represents the results from one execution of the "iltiming" tool (as described in section 3.3, each execution of "iltiming" records 1,000 measurements). The top of the black bar represents the minimum latency measured. The top of the green bar represents the average latency. The top of the red bar represents the maximum latency. The jitter is computed as the difference between the worst (maximum) and best (minimum) IST measurements. The IST minimum, average and maximum runtimes are fairly constant, which can be explained by the constant workload that has been applied to the system.
The ISR runtimes also show rather stable values with the exception of occasional peaks, which can be attributed to the occurrence of other interrupts, or sections of code in the kernel where interrupts are disabled. It might be a good guess that calls to the graphical subsystem of Windows CE are responsible for those peaks.
WORKLOAD: UDP messages 250 packets/second
Figure 4 presents the results when sending UDP packets as a workload. The IST times are again fairly stable due to the constant nature of the system workload.
The ISR runtime is about the same as before but without the occasional peaks found with the graphical and mathematical workload. Looking at this figure, it should be noted that the occurrence of the worst maximum values of the IST does not necessarily coincide with the maximum runtime values of the ISR.
The following latency times were measured with the "iltiming" tool on a:
- Intel Pentium processor 100 MHz (Siemens PCD-5H),
- MAXALL Platform.
WORKLOAD: Rotate and Mathload
Figure 5 reveals a considerable performance problem of the standard x86 Windows CE kernel. Although the Pentium is running at just half the clock rate of the SH4, this result (jitter three times as high as on the S1 board) cannot be explained by merely referring to the clock rate. As a result of these measurements, Microsoft produced a QFE that remedies those performance drawbacks under certain circumstances.
Again, the time spent in the ISR is fairly stable and comparable to the times measured on the Hitachi board. It should be noted that the test on the Pentium board does not show the sporadic peaks that the SH4 has.
WORKLOAD: UDP messages 250 packets/s
Figure 6 presents about the same picture as Figure 5. The jitter is extraordinarily high, showing a rather strong variance concerning its maximum IST runtime values.
The ISR, on the other hand, still shows about the same performance as in the rotate/mathload test above.
In response to the performance concerns on x86, Microsoft generated a QFE with optimized TLB handling for the Intel processors:
- QFE to reduce interrupt latencies for the x86 processor.
- Currently optimized for up to 8 processes or 8 x 32 MB of virtual memory.
- Increasing interrupt latency for more than 8 processes up to standard interrupt latency as shown above.
- Solution mainly based on an enlarged page pool.
- Needs about 300K more memory space.
The following latency times were measured with the "iltiming" tool on a:
- Intel Pentium processor 100 MHz (Siemens PCD-5H),
- MAXALL Platform.
- QFE for x86 kernel applied.
WORKLOAD: Rotate and Mathload
Figure 7 shows that with the QFE installed, the Pentium board yields a considerably improved IST performance, now comparable to the Hitachi evaluation board. Also, the jitter was reduced to about the same value as on the SH4 board.
Again the ISR consumes the same amount of time as it did in the preceding test runs.
WORKLOAD: UDP messages 250 packets/s
Figure 8 confirms the improved real-time qualities of the x86 kernel through the QFE. The IST runtimes and the jitter are comparable to the respective values measured on the Hitachi board running the same workload.
Although the QFE does improve the real-time capabilities of Windows CE, it should be noted that it does so only for up to eight processes and at the expense of 300K additional memory. As we will see below, the QFE even has some negative effects on other performance measurements, so caution has to be used when thinking about applying this fix.
Finally, we tested Windows CE without any workload to show that it is very important to test the OS with the right workloads for the specified applications. Otherwise, the measurements are absolute worthless.
The last test performed without any workload at all shows how important it is to include the expected workload into one's real-time considerations. Due to some sporadic peaks, the jitter is the same with workload as without, whereas the overall time consumption of the IST is, as expected, very small (these are "marketing" numbers).
The ISR runtime again is stable at around 3 to 4 µs.
This section presents detailed results of the performance testing conducted on the two different computing boards. The tests for the Pentium board were conducted once with and once without the latency QFE applied. Information on how the following results have to be interpreted can be found in chapter 3.8. The measurement results present the minimum, the average, the maximum, and also the last measured value. Due to the fact that a program does more than one kernel call we compare the average values in chapter 3.8. This is in contrast to the interrupt latency measurements where the worst case values are of interest. Results for the Hitachi SH4 Evaluation Board
*** Dhrystone: *** Dhrystone Benchmark, Version 2.1 (Language: C) Program compiled without 'register' attribute Execution starts, 1000000 runs through Dhrystone Execution ends Final values of the variables used in the benchmark: Microseconds for one run through Dhrystone: 3.3 Dhrystones per Second: 302388.9 --==< Start: 20493 Iterations: 10 Current: 48663 >==-- Used timer runs with 2083333 Hz, 1 tick is 4.800001e-007 s Semaphore/Mutex (10000) last/min/avg/max 37/37/38/366 us, t = 380.5 ms Sema/Mutex Ping (10000) last/min/avg/max 117/117/120/2135 us, t = 1202.5 ms HeapAlloc (9999) last/min/avg/max 18/16/19/411 us, t = 190.7 ms HeapFree (10000) last/min/avg/max 27/17/18/594 us, t = 178.4 ms LocalAlloc 32B (10000) last/min/avg/max 19/18/22/1134 us, t = 222.8 ms LocalAlloc 128B (10000) last/min/avg/max 19/18/22/2357 us, t = 222.5 ms LocalFree (10000) last/min/avg/max 28/20/24/2353 us, t = 235.9 ms GetTickCount (10000) last/min/avg/max 14/14/15/423 us, t = 147.7 ms GetSystemTime (10000) last/min/avg/max 64/62/64/923 us, t = 636.7 ms Create Task (10000) last/min/avg/max 1149/1131/1159/11594 us, t = 11590.9 ms Suspend Ping (9999) last/min/avg/max 68/56/59/2065 us, t = 586.0 ms Suspend/Resume (10000) last/min/avg/max 39/38/40/2048 us, t = 400.6 ms Event Pending (10000) last/min/avg/max 844/64/68/2076 us, t = 682.7 ms Set Event (10000) last/min/avg/max 20/20/21/513 us, t = 206.1 ms Ping Event (10000) last/min/avg/max 100/99/102/2113 us, t = 1019.2 ms Ping Queue (10000) last/min/avg/max 335/329/344/4282 us, t = 3442.1 ms Fill Queue (10000) last/min/avg/max 81/69/80/2110 us, t = 799.1 ms Empty Queue (10000) last/min/avg/max 56/39/45/2989 us, t = 445.8 ms Queue Send/Recv (10000) last/min/avg/max 107/106/109/2159 us, t = 1089.0 ms *** End of Performance Test ***
*** Dhrystone: *** Dhrystone Benchmark, Version 2.1 (Language: C) Program compiled without 'register' attribute Execution starts, 1000000 runs through Dhrystone Execution ends Final values of the variables used in the benchmark: Microseconds for one run through Dhrystone: 6.5 Dhrystones per Second: 153917.2 --==< Start: 51427 Iterations: 10 Current: 74905 >==-- Used timer runs with 1193180 Hz, 1 tick is 8.380965e-007 s Semaphore/Mutex (10000) last/min/avg/max 23/23/24/509 us, t = 239.2 ms Sema/Mutex Ping (10000) last/min/avg/max 104/96/106/2912 us, t = 1056.7 ms HeapAlloc (10000) last/min/avg/max 8/8/9/381 us, t = 93.2 ms HeapFree (10000) last/min/avg/max 8/7/9/875 us, t = 86.8 ms LocalAlloc 32B (10000) last/min/avg/max 8/8/11/659 us, t = 105.9 ms LocalAlloc 128B (10000) last/min/avg/max 8/8/11/619 us, t = 106.7 ms LocalFree (10000) last/min/avg/max 8/8/10/1132 us, t = 97.4 ms GetTickCount (10000) last/min/avg/max 6/6/7/535 us, t = 71.1 ms GetSystemTime (10000) last/min/avg/max 112/106/115/2948 us, t = 1151.9 ms Create Task (10000) last/min/avg/max 1251/1245/1281/4565 us, t = 12813.7 ms Suspend Ping (10000) last/min/avg/max 34/31/33/2819 us, t = 333.2 ms Suspend/Resume (10000) last/min/avg/max 21/20/22/873 us, t = 220.4 ms Event Pending (10000) last/min/avg/max 870/43/47/908 us, t = 466.5 ms Set Event (10000) last/min/avg/max 10/9/10/859 us, t = 103.7 ms Ping Event (10000) last/min/avg/max 74/71/76/2964 us, t = 756.9 ms Ping Queue (10000) last/min/avg/max 360/342/356/3373 us, t = 3561.8 ms Fill Queue (10000) last/min/avg/max 70/68/80/2984 us, t = 799.8 ms Empty Queue (10000) last/min/avg/max 49/48/52/1436 us, t = 522.4 ms Queue Send/Recv (10000) last/min/avg/max 122/122/126/3732 us, t = 1258.4 ms *** End of Performance Test ***
*** Dhrystone: *** Dhrystone Benchmark, Version 2.1 (Language: C) Program compiled without 'register' attribute Execution starts, 1000000 runs through Dhrystone Execution ends Final values of the variables used in the benchmark: Microseconds for one run through Dhrystone: 6.5 Dhrystones per Second: 152998.8 --==< Start: 31411 Iterations: 10 Current: 57662 >==-- Used timer runs with 1193180 Hz, 1 tick is 8.380965e-007 s Semaphore/Mutex (10000) last/min/avg/max 23/22/24/893 us, t = 236.6 ms Sema/Mutex Ping (10000) last/min/avg/max 107/98/105/2849 us, t = 1048.9 ms HeapAlloc (10000) last/min/avg/max 8/8/9/851 us, t = 95.0 ms HeapFree (10000) last/min/avg/max 10/7/9/501 us, t = 86.5 ms LocalAlloc 32B (10000) last/min/avg/max 8/8/11/2751 us, t = 109.2 ms LocalAlloc 128B (10000) last/min/avg/max 8/8/11/577 us, t = 108.1 ms LocalFree (10000) last/min/avg/max 8/8/10/1044 us, t = 102.0 ms GetTickCount (10000) last/min/avg/max 7/6/7/247 us, t = 70.1 ms GetSystemTime (10000) last/min/avg/max 113/111/117/19503 us, t = 1170.3 ms Create Task (10000) last/min/avg/max 1457/1432/1482/16431 us, t = 14818.0 ms Suspend Ping (10000) last/min/avg/max 34/31/33/2746 us, t = 333.4 ms Suspend/Resume (10000) last/min/avg/max 23/20/22/305 us, t = 222.6 ms Event Pending (10000) last/min/avg/max 977/42/48/2135 us, t = 481.3 ms Set Event (10000) last/min/avg/max 10/8/10/498 us, t = 103.5 ms Ping Event (10000) last/min/avg/max 80/70/74/2781 us, t = 735.2 ms Ping Queue (10000) last/min/avg/max 413/397/412/3154 us, t = 4116.5 ms Fill Queue (10000) last/min/avg/max 81/80/92/2815 us, t = 916.5 ms Empty Queue (10000) last/min/avg/max 61/59/64/2797 us, t = 637.5 ms Queue Send/Recv (10000) last/min/avg/max 145/143/149/2890 us, t = 1485.9 ms *** End of Performance Test ***
It is important to note that the QFE worsens some of the performance values of the SSC-Benchmark considerably. CreateTask and the queue functions particularly suffer in performance. This indicates that caution has to be used when applying this fix.
The following table compares the results of the interrupt latency measurements. It presents the best case, average and worst case results of the ISR and IST measurements for the SH4 as well as for the Pentium processor. The jitter is computed as the difference between the worst (maximum) and best (minimum) IST measurements. These numbers are indicative of the behavior of Windows CE 3.0 on both processors concerning interrupt latencies.
|SH4 198MHz |
Windows CE 3.0
|Pentium 100 MHz |
Windows CE 3.0
|Pentium 100 MHz |
Windows CE 3.0+QFE
|ISR Min [µs]||0.9||1.9||1.9|
Table 1: Latency measurement results
Again, it should be noted that Windows CE 3.0 supports nested interrupts. This tends to complicate the kernel's internal interrupt management and would be likely to block the processor more than in earlier versions. Microsoft divided the interrupt kernel calls into multiple, shorter parts, which compensates for this situation.
In table 1, the interesting values are the "Max" times of the IST in general and especially for the Pentium processor with and without the QFE. With the QFE, the maximum IST is about three times faster than without. This directly influences the value of jitter. With the interrupt latency characteristics that Windows CE 3.0 shows for the SH4 as well as for the Pentium processor, it can be claimed that it possesses real-time capabilities, which will substantially improve products.
Table 2 contains the performance results of Windows CE 3.0 on the SH4 platform (Hitachi S1 development board) and the Intel platform (Siemens Pentium board). The results show that the SH4 is about twice as fast in computation (Drystone), but needs more time for kernel calls. The results have to be interpreted with caution. On the one hand, the results depend upon the motherboard, the peripherals, the clock-rate of the processors, and upon the processor type itself (RISC or CISC). On the other hand, the SH-family does not support TLB exception handling by hardware. In standard automation applications, the overall processor time consumption for kernel calls is usually small compared to the processor time consumed by the user application. Therefore, the considerable performance differences concerning the SSC-Benchmark in table 2 are not so important as they might appear.
|SH4 198MHz |
Windows CE 3.0
|Pentium 100 MHz |
Windows CE 3.0
|Pentium 100 MHz |
Windows CE 3.0+QFE
|Sema/Mutex Ping [µs]||120||106||105|
|LocalAlloc 32B [µs]||22||11||11|
|LocalAlloc 128B [µs]||22||11||11|
|Create Task [µs]||1159||1281||1482|
|Suspend Ping [µs]||59||33||33|
|Event Pending [µs]||68||47||48|
|Set Event [µs]||21||10||10|
|Ping Event [µs]||102||76||74|
|Ping Queue [µs]||344||356||412|
|Fill Queue [µs]||80||80||92|
|Empty Queue [µs]||45||52||64|
|Queue Send/Recv [µs]||109||126||149|
Table 2: Performance measurement results
Table 2 shows that some functions achieve relatively poor results (e.g. Create Task). These are functions which should normally only be called during the boot time of a real-time application. Most functions that are used during runtime achieve good results (e.g. SetEvent, Queue Send/Recv). However, some "runtime functions" have worse numbers (e.g. Semaphore/ Mutex Ping) than other functions that usually get called during runtime. Improvements in these areas are necessary in order to increase the performance of the OS.
It is important for programmers to know – and always to keep in mind – that some of the "ease of use" real-time programming functionality introduced in earlier versions of the Windows CE operating system has been removed in order to improve the real-time behavior of Windows CE 3.0. One example is that only one level of priority inversion is supported. It is a dilemma that many Windows CE programmers are not aware of certain real-time related issues and are not careful in regards to real-time programming (see the following paragraph). We suggest that software architects should check the programming restrictions of Windows CE 3.0 with respect to their real-time requirements.
Following is a summary of real-time programming do's and don'ts to reach real-time performance.
To achieve real-time performance, do not:
- spend inordinate amounts of time in ISRs,
- spin in highest priority thread, this will starve the system,
- use non-real-time APIs in real-time threads and expect real-time performance (e.g. SetTimer, file system calls, Process/Thread creation),
- allow priority inversion to occur. Keep in mind: by using a good system and program architecture, priority inversion is avoidable.
To achieve real-time performance, do:
- pre-allocate all your resources (e.g. memory, threads, processes, mutexes, semaphores, events, etc.),
- pre-allocate resources in system/application pools to use these resources during runtime without allocation and releasing,
- setup embedded program structure correctly, especially process communication,
- buffer data in ISR if passing it directly to IST isn't fast enough,
- use ISR for all the work if:
no system services are required, and
no extensive processing (long ISR time) required,
- set priorities and scheduling quanta correctly,
- use the function LoadDriver() to avoid page faults or turn off the demand-pager,
- watch out for slow hardware or hardware that utilizes the bus for extended periods (e.g. a CMOS real-time clock takes 70µs to access; display acceleration hardware steals bus cycles from CPU). These cases block ISR hits and delay ISR calls,
- watch for exception handling, which is very time consuming.
All programming efforts are worthless, if:
- no structured development process is explicitly declared,
- no test and verification strategy is developed and used during the whole project,
- no real-time requirements and their verification are considered (logic testing is not enough!).
Real-Time Programming for Embedded Systems, Holger Küfner, course of lectures, 2000
Getting Started with CE 2.12, Luis Alonso, Odo Stierschneider, CT/SE2, November 1999
Getting Started with CE 2.11, Luis Alonso, CT/SE2, October 1998
Designing and Optimizing Microsoft Windows CE 3.0 for Real-Time Performance, Microsoft Embedded web site, Microsoft Corporation, June 1999
Real-Time Systems with Microsoft Windows CE, Microsoft Embedded web site, Microsoft Corporation, April 2000
Updated with the New Kernel Features, Windows CE 3.0 Really Packs a Punch, Douglas Boling, Microsoft Systems Journal web site, Microsoft Corporation, July 1999
Microsoft Windows CE Platform Builder Release Notes 3.0 beta, Microsoft Corporation, July 1999
Microsoft Windows CE Platform Builder Books Online 3.0 beta, Microsoft Corporation, July 1999
Programming Windows CE, Douglas Boling, Microsoft Press, 1998
The Microsoft Embedded Technology Tour, Collected Slides, Microsoft Corporation, October 1999
Microsoft Windows CE Developers Conference, Collected Slides, Microsoft Corporation, June 1999
Windows CE 3.0 Product Overview, Microsoft Corporation, June 2000
Building Embedded Systems with the Windows CE 3.0 Platform, Microsoft Embedded web site, Microsoft Corporation, June 2000
Windows CE Platform Builder 3.0: Getting Started, Microsoft Embedded web site, Microsoft Corporation, May 2000
What's New in Platform Builder 3.0, Microsoft Embedded web site, Microsoft CorporationJune 2000,
Comparing Windows CE 3.0 with Windows CE 2.12, Microsoft Embedded web site, Microsoft Corporation, June 2000
Microsoft Windows CE 3.0 Data Sheet, Microsoft Embedded web site, Microsoft Corporation, June 2000
TriCore is a registered trademark of Infineon Technologies AG