Export (0) Print
Expand All
Expand Minimize

Implementing Fault Tolerant Systems with Windows CE .NET

Windows CE .NET
 

Nat Frampton
Microsoft Embedded MVP
President, Real Time Development Corp.

Richard Lee
Microsoft Embedded MVP
Software Architect, Vibren Technologies Inc.

Updated April 2003

Applies to:
    Microsoft® Windows® CE .NET

Summary: Discover techniques to add fault tolerance to software components, using the supporting APIs and services available on Windows CE .NET for realizing robust and reliable systems. (26 printed pages)

Contents

Introduction
Specification and Implementation
Software Fault Tolerance Techniques
Key Operating System Properties
Applying Fault Tolerant Techniques
Conclusions
Additional Resources
References

Introduction

Microsoft Windows CE .NET has proven to deliver stable, real-time operating system services, making it an increasingly popular software platform for deploying complex solutions. Demand for highly available, lower cost systems that deliver reliable and correct services is being driven by a variety of embedded applications, ranging from industrial control to home entertainment appliances. As the Windows CE .NET embedded implementations grow more complex, software errors become an important cause of failure to deliver services as designed, with consequences ranging from the inconvenient to the catastrophic.

Elimination of all software errors is virtually impossible for many systems, which may integrate software from diverse sources, thereby introducing unknown and boundless combinations of interactions. Improving the overall dependability requires that software errors be tolerated, without loss of service. This paper describes techniques to add fault tolerance to software components, using the supporting APIs and services available on Windows CE .NET for realizing robust reliable systems.

The feature demands for embedded systems are always growing. The entrance of Microsoft® Windows® CE into the embedded community over the past four years has proved to be the catalyst for demands. Web integration, elaborate user interfaces, and data collection are just a few of the expected features of today's embedded systems. The users of these products expect a degree of reliability and availability which is often greater than what is expected from desk top computers. Microsoft Windows CE .NET has provided underlying operating system (OS) framework for a broad range of industrial control and consumer products for which safety, availability, and overall dependability are critical requirements. The low cost and considerable growth in the capability and reliability of platform hardware drives the addition of software features and enhancements. At some point, software faults become the major cause of service failure [Gray 91]. As software complexity increases to meet the new feature demands, the ability of the system to detect software errors and respond correctly to them becomes imperative.

History

Dependable computing concepts trace back to Charles Babbage's Calculating Engine [Lard 43]. The general problem of building reliable systems from unreliable parts was addressed in the early days of electronic computing, by Moore and Shannon [Moor 56], and von Neumann [Neum 56]. Avizienis introduced fault tolerant system design concepts, to meet the reliability demands of computer systems used in space [Aviz 67]. Computers used in flight systems, medical devices for life support, and interplanetary operation use carefully engineered hardware and software redundancy to achieve the rigorous dependability levels required. The time and costs associated with the development of such systems are considerable. For the many applications for which system dependability is highly valuable, but failure can be managed, solutions based on off-the-shelf subsystems—both software and hardware—can be created within the constraints of the development budget.

The field of Reliability engineering has grown to address the classification and characterization of software faults and the quantification of a software system's probability of failure. Traditional hardware reliability characterization focuses on finding the Mean Time Between Failure (MTBF) for a particular hardware component or entire system. The MTBF statistics for a system are found by exhaustive testing. Designers specify their requirements for MTBFs and have techniques to combine individual system components characteristics into a system level MTBF. The research into Reliability engineering has grown to create a similar classification for software systems. The end user's tolerance and characterization of an individual failure, combine to work (through statistical models) to create the equivalent of MTBFs for individual software components. These individual statistics are then combined to determine the entire system's MTBF. Individual user errors are traced back to software interfaces, allowing the association software components with a particular failure. The system-level MTBF can then be used to determine if a software product can ship based on the user requirements and tolerance for particular errors. A complex system may have certain failures found in testing under highly unlikely system states that can be tolerated by the user, and thus allow the software to ship (with the error being fixed in later releases).

For example, a certain rare state of a system's hard-drive after a power failure may cause a dialog box to receive an incorrect value. The user determined that this failure can be tolerated in the first delivery. Reaching zero failures in a complex software system many not be economically feasible. The field of reliability engineering offers great promise and is being leveraged by many major software development companies.

Definitions and Concepts

There is a precise and rigorous terminology used in the literature to describe the basic concepts of dependable computing codified by Laprie [Lapr 92]; see also Avizienis, Laprie, and Randall [Avis 01].

A system is a collection of interacting components that deliver a service through a service interface to a user. The user can be a human operator, or another computer system. The service delivered by a system is its behavior perceived by the user.

Dependability of a computing system is the ability to deliver service that can justifiably be trusted. Applications can emphasize different attributes of dependability, including:

  • Availability, the readiness for correct service.
  • Reliability, the continuity of that service.
  • Safety, the avoidance of catastrophic consequences on the environment.
  • Security, the prevention of unauthorized access.

The function of a system is what the system is intended to do, as described by the functional specification. A system failure occurs when the service delivered does not comply with the specification. The system state is the set of the component states.

An error is a system state that may lead to failure. An error is detected if an error message or signal is produced within the system, or latent if not detected. A fault is the cause of an error, and is active when it results in an error, otherwise it is dormant.

Fault tolerance is the ability of a system to deliver of correct service in the presence of faults [Avis 76]. This is achieved by error processing—removing the system error state—and by treating the source of fault. The ability to detect and process error states and assess the consequences are critical requirements of fault tolerant design.

Fault tolerance—both hardware and software—is achieved through some kind of redundancy. Hardware redundancy techniques often make use of multiple identical units, in addition to a means for arbitrating the resulting output. ECC memory, for example, uses a few extra bits to detect and correct errors resulting from faults in the individual storage bits. Running the same input data through a faulty software module multiple times yields the same erroneous result each time. Software fault tolerance is built by applying algorithmic diversity, computing results through independent paths, and by judging the results. This adds complexity to the system in general. Adding software fault tolerance will improve system reliability only if the gains made by the added redundancy are not offset by commensurate new faults introduced by the parallel code.

Ground Rules

Creating fault tolerant behavior in a hardware/software system is a complex process. Faults have diverse sources, from physical failures of the hardware, logic errors in the software, either internal or external. They may be operational errors or the result of malicious use. Faults can be temporary or persistent. Software faults are characterized as permanent (though they may be hard to reproduce), and are always due to flaws in the design of the system.

Descriptions of failure scenarios range from the complete loss of power to the failure of individual components. The classification of failures and their consequences is always unique to the particulars of the service provided. Consequently, approaches for achieving a particular level of dependability will vary. Fault prevention in the design phase, and fault removal through maintenance are important means in delivering reliable software. The information provided in this document assumes that there is a valid hardware processor, memory and peripheral chips, and that the hardware designer has properly tied all unused pins appropriately. This discussion will focus in on how to improve the reliability of a system, by detecting and dealing with faults which originate in the software logic design.

Specification and Implementation

System design begins with specification of the service functionality and performance levels that are to be delivered. The specification must address dependability goals and identify any special service modes, such as fail-safe shut down. A key part of the specification is the identification and classification of faults that must be tolerated or otherwise remediated. The specification should define fault and error scenarios, consider system response when faults occur in coincidence, and consider the likelihood and handling of latent errors. A careful specification helps developer managers to make an informed choice when allocating limited development resources between fault avoidance and fault tolerance efforts.

Implementation begins with the partitioning of the system into subsystems. This defines independent error containment boundaries, such that a failure in one subsystem does not in itself cause an error in a separate subsystem. Partitioning should reflect the hierarchy of fault tolerance, identifying which faults are handled within the individual subsystems, and which faults involve a group of subsystems. The system dependability requirements can be apportioned to the appropriate subsystems after the hierarchy and containment boundaries are determined. Subsystems implement local error detection and recovery, and provide support for higher-level detection and recovery.

Error detection can be performed concurrently with the delivery of the service, or by preemptively suspending the service to execute tests. Because you are concerned with errors in the software design, the error detection methods must be able to recognize faults in the fault tolerance logic in addition to those in the base algorithms.

Software Fault Tolerance Techniques

Fault tolerance is generally achieved through redundancy. Improvements in system reliability are achieved when the failure of redundant components occurs independently. This independence improves the likelihood that a failure in one component will not occur concurrently with failures in the other components, which increases the probability of a correct result. Studies of faults in independently developed software by Knight et al, challenge the assumption of the independence of the errors which occur [Knig 86]. The improvements achieved through software redundancy are less than that predicted by theoretical considerations that assume independence.

Two common software fault tolerance schemes based on redundancy and the assumption that the occurrence of correlated faults are unlikely, are the use of a Recovery Block [Radn 75] and N-Version Programming [Avis 85].

The Recovery Block (RB) design makes use of a primary and one or more alternate program blocks, each of which performs the specified operations, but in different ways. A test on the results determines the acceptability of the results. This approach is often referred to as acceptance test-based fault tolerance.

The following code example shows a recovery block:

CurrentMethod = PrimaryMethod( );
do {
  result = CurrentMethod();
  if (AcceptanceTest(result) == OK)
    return result;
} while (CurrentMethod = NextAlternameMethod());
fail();

A precise acceptance test may be as complex as the primary method. Clearly for the RB design to improve reliability over the primary method alone, the acceptance test must be carefully designed and implemented. An imperfect test can actually decrease reliability by reporting failure when the primary method yields correct results.

The response to a detected fault results in the serial execution of the alternate modules. This can compromise execution deadlines in a real-time environment.

N-Version Programming (NVP) uses parallel execution of N>=2 independently developed functionally equivalent software versions. The outputs of all versions are examined in a voter block to determine the correct output, if one exists. NVP is commonly called voter-based fault tolerance.

The main differences between the acceptance test-based approach and the NVP voter is that the voter does not need to understand the details of the application, and can in principle, be simpler and more general purpose in nature than the acceptance test code.

Both RB and NVP depend on design diversity, and expensive processes with the potential to inject new logic errors into the system. Ammann and Knight [Amma 88] consider the use of data diversity, noting that programs often fail under a particular combination of input values. Introducing small changes into the input values may be sufficient to allow the software to execute without failure. This deliberate perturbation of the data can be done serially, followed by an acceptance test in data diverse analogue of the recovery block approach, or a vector of perturbed data can be fed through the algorithm, with a voter selecting the correct result. This latter approach is sometimes called N-copy programming or NCP because of its similarity to NVP. Note that the voter code in NCP can be considerably more complex than in NVP if the perturbed input data results in different, but acceptable outputs.

Key Operating System Properties

In building the fault tolerant systems on Windows CE .NET, several key operating system features will be exploited. The proper use and partitioning of application code into processes and threads, the correct setting of priorities, the use of exception handling, and the implementation of watchdog behaviors from both an application and interrupt level can all be applied where appropriate. Understanding Windows CE .NET's implementation of each of these areas is critical to developing fault tolerant applications. For each of the respective areas discussed below, full documentation of each operating system feature can be found in both the Platform Builder and Microsoft Embedded Visual C++® documentation help systems.

Processes, Threads, and Virtual Memory

Each application that runs under Windows CE .NET is called a process. The process contains both code and data. The basic unit of execution is the thread. Threads execute any part of the code in the process, including code that another thread may be running. Threads allow the application to perform multiple tasks at the same time. The threads allow for the partitioning of logic into the most appropriate architecture for the application. Support of preemptive multitasking allows Windows CE .NET to create the effect of executing multiple threads simultaneously. Switching from one application to another (or one process to another) requires saving the process state for that entire process. Here, the switching from one thread to another within a single process is very efficient and appears to the developer as threads running simultaneously. It is critical to remember that there is only one thread executing at a time, and protection of shared data and code is necessary.

Windows CE .NET supports up to 32 processes and as many threads as memory permits. Each process has a designated primary thread the WinMain function. The process creates additional threads as needed. The key difference between threads within the same process and threads in two different processes is the memory protection and the context switch times. Windows CE .NET processes run within a virtual memory environment enforced by hardware. Memory must be requested through the API functions. The Windows CE .NET kernel controls and tracks the allocation of physical memory, and can limit access rights to the requesting process only. A system exception occurs when a thread within a process attempts to directly access memory for another process without first mapping that memory. The separation of the kernel services, file systems, device logic, UI tasks, and user applications into separate individual process address spaces provides isolation of code and data for these key services. This address protection provides a level of error containment and prevensg memory corruption due to a fault in one process from interfering with other processes and the OS services required for fault detection and remediation.

Synchronization Objects

Partitioning a monolithic application into several threads and processes requires coordination and synchronization. Windows CE .NET provides many synchronization objects that enable communication between multiple threads and processes. Some of these objects include critical sections, mutexes, semaphores, and events. Thread interactions with the first three of these objects can be characterized by one thread taking ownership of the object, before proceeding with execution. If the object is owned by another thread, the requesting thread enters a wait state until the competing thread makes the object available. With events, the thread tells the operating system (OS) what must occur to resume execution of that thread, and then enters a wait state. When the OS or some other process signals the event, the yielding thread then resumes execution. The thread has now synchronized itself with another object.

Exception Handling

Windows CE .NET supports the try-except and try-finally Microsoft extensions to the C language to pass execution control to exception handling code, when the enclosed code performs an operation that would otherwise terminate the application. Examples of this are an illegal memory reference, or division by zero. The compilers included in Windows CE .NET support the C++ try, catch and throw exception handling code.

Security

Windows CE .NET supports the enforcement of application trust rules defined by the platform OEM. The OEM implements rules to validate each DLL and EXE as they are being loaded in the system, ensuring that its signature key is valid. Invalid modules will not be loaded by the Operating System. Additionally, Windows CE .NET can reserve the top 248 priority levels from applications when an OEM enforces a trusted environment to guarantee that errors or malicious behavior in an application cannot lock up the system.

Priorities

Windows CE .NET provides 256 priority levels that you can set on a per thread basis. The choice of which thread priorities to use is very critical. Thread priorities from 0-247, with 0 being the highest priority, are referred to as the real-time priorities and require a call to CeSetThreadPriority() to access them. The normal thread priorities from 248-255 are accessed by using SetThreadPriority().

The following tables enumerate the priorities in use by modules included in Windows CE .NET, and responsibilities of thread at each priority.

Table 1 Real-Time Thread Priorities: CeSetThreadPriority

Priority Component
0-19 Open – Real Time Above Drivers
20 Permedia Vertical Retrace
21-98 Open – Real Time Above Drivers
99 Power management Resume Thread
100-108 USB OHCI UHCI, Serial
109-129 Irsir1, NDIS, Touch
130 KITL
131 VMini
132 CxPort
133-144 Open – Device Drivers
145 PS2 Keyboard
146-147 Open – Device Drivers
148 IRComm
149 Open – Device Drivers
150 TAPI
151-152 Open – Device Drivers
153-247 Open – Real Time Below Drivers

Table 2 Normal Thread Priorities: SetThreadPriority

Priority Component
248 Power Management
249 WaveDev, TVIA5000, Mouse, PnP ,Power
250 WaveAPI
251 Power Manager Battery Thread
252-255 Open

Interrupt Architecture

The following illustration provides an overview of Windows CE .NET's Interrupt Architecture. The illustration contains a representation of the hardware, kernel, OEM Adaptation Layer (OAL) and thread interactions during an interrupt.

Figure 1. Windows CE .NET Interrupt Architecture

Time increases to the right of the diagram. The handling of an interrupt involves interactions at many layers. The first reaction to an interrupt is the kernel level exception handler Interrupt Service Routine (ISR), which dispatches control to the OEM Adaptation Layer ISR registered with HookInterrupt() during system startup. The OAL ISR may optionally dispatch to installable driver ISRs, which are not shown in this diagram. Upon returning from the OAL ISR, the Kernel sets the event defined in the call to InterruptInitialize(). The Interrupt Service Thread (IST) in the application or driver, which is waiting on this event, is scheduled for execution on return from the kernel ISR. After the application or driver concludes its work, the IST returns and the Kernel then enables that interrupt. The key element to note in the diagram is that the OAL ISR logic is reached before any application or driver code is run, allowing for an ISR behavior to be successfully executed independent of the IST state.

Applying Fault Tolerant Techniques

The initial porting or development of applications for Windows CE .NET involves creating a WinMain() application that is the equivalent of a main() in a standard ANSI C application. Applications characterized by logic loops that contain entire application logic are particularly vulnerable to software faults. The goal of this discussion is to transform these early developments into more fault tolerant implementations that leverage strategic Windows CE .NET characteristics.

The proper use and partitioning of application code into processes and threads, setting appropriate priorities, the use of exception handling and the implementation of watchdog behaviors from both an application and interrupt level can all be applied where appropriate. These areas are discussed in the following sections.

Partition into Threads and Processes

Threads and processes are the building block of Windows CE .NET applications. Both can be leveraged to create isolation and fault tolerant behavior. Isolation of logic into separate threads with the same process is the first level of protection, and moving threads to other processes ensures addition protection. The following example demonstrates both advantages.

The following source code shows a simple monolithic application that calls three algorithms at different intervals. A single timing loop is used to schedule the execution of three algorithms. This is a simple example of a monolithic application.

// AllInOneBefore.cpp : Implements a simple monolithic application.
//

#include "stdafx.h"

// Global Memory
//
int g_nCount1 = 0;
int g_nCount2 = 0;


// MAIN
//
int WINAPI WinMain(   HINSTANCE hInstance,
         HINSTANCE hPrevInstance,
         LPTSTR    lpCmdLine,
         int       nCmdShow)
{

   while( 1 )
   {
      Algoritm1();
      if( g_nCount1 == 4 )
      {
         Algorithm2();
         g_nCount1 = 0;
      }
      if( g_nCount2 == 9 )
      {
         Algorithm3();
         g_nCount2 = 0;
      }
      
      g_nCount1++;
      g_nCount2++;

      Sleep(5);
}

   return 0;
}

This monolithic application uses a single sleep routine with counters to get several algorithms to execute at various intervals. All the algorithms run within the same process and the same thread. Any fault in one algorithm will stop all algorithms from being executed. Each algorithm is run at an equal priority.

Transformation

The following two processes take the place of the single process as seen in the preceding example. The first process 1 integrates the algorithm 1 and algorithm 2 behaviors from the previous example, and the second process executes algorithm 3 from the previous example. Each algorithm call is placed inside a thread allowing for other thread terminations without affecting the other threads. Algorithm 3 is isolated from Algorithms 1 and 2 by being placed in a separate process.

// AllInOneAfter1.cpp : Breaks up the monolithic application.
//

#include "stdafx.h"

// Global Memory
//
HANDLE               g_htAlgoritm1;
HANDLE               g_htAlgoritm2;
BOOL               g_fFinished = FALSE;

// Prototypes
//
DWORD WINAPI      ThreadAlgoritm1   ( LPVOID lpvParam );
DWORD WINAPI      ThreadAlgoritm2   ( LPVOID lpvParam );

// MAIN 
//
int WINAPI WinMain(   HINSTANCE hInstance,
         HINSTANCE hPrevInstance,
         LPTSTR    lpCmdLine,
         int       nCmdShow)
{
   DWORD   dwThreadID;


   // Create a Algorithm 1 thread 
   //
   g_htAlgoritm1      = CreateThread(  NULL,         // CE Security
                    0,          // Default Size
                    ThreadAlgorithm1,   // Algorithm1      
                       NULL,         // No Parameters
                    0,         // No Flags
                    &dwThreadID      // Thread Id
                           );
   if( !g_htAlgoritm1)return 20;

   // Create a Algorithm 2 thread 
   //
   g_htAlgoritm2      = CreateThread(  NULL,         // CE Security
                    0,          // Default Size
                    ThreadAlgorithm2,   // Algorithm2      
                       NULL,         // No Parameters
                    0,         // No Flags
                    &dwThreadID      // Thread Id
                           );
   if( !g_htAlgoritm1)return 20;


   // Set the thread priorities
   //
   if( !CeSetThreadPriority(g_ g_htAlgoritm1, 50 ))return 30;
   if( !CeSetThreadPriority(g_ g_htAlgoritm2, 60 ))return 30;


   while( !g_fFinished )
   {
       Sleep( 500 );
   }

   return 0;
}

DWORD   WINAPI   ThreadAlgoritm1( LPVOID lpvParam )
{

   while( !g_fFinished )
   {
      // Sleep
      Sleep( 5);

      Algorithm1()

   }

   return 0;
}
DWORD   WINAPI   ThreadAlgoritm2( LPVOID lpvParam )
{

   while( !g_fFinished )
   {
      // Sleep
      Sleep( 25 );

      Algorithm2()

   }

   return 0;
}

They key elements to Process 1 are:

  • Create the two threads ThreadAlgorithm 1 & ThreadAlgorithm 2:
  • Set the thread priorities:
    • Set Thread Algorithm 1 to 50 (higher than the Algorithm 2).
    • Set Thread Algorithm 2 to 60.
  • Loop every 500 ms waiting for the g_fFinished flag to be set.
  • ThreadAlgorithm 1 and ThreadAlgorithm 2 sleep the appropriate amount of time and call their respective algorithms.
// AllInOneAfter2.cpp : Breaks up the monolithic application.
//

#include "stdafx.h"

// Global Memory
//
HANDLE               g_htAlgoritm3;
BOOL               g_fFinished = FALSE;

// Prototypes
//
DWORD WINAPI      ThreadAlgoritm3   ( LPVOID lpvParam );

// MAIN 
//
int WINAPI WinMain(   HINSTANCE hInstance,
         HINSTANCE hPrevInstance,
         LPTSTR    lpCmdLine,
         int       nCmdShow)
{
   DWORD   dwThreadID;


   // Create a Algorithm 3 thread 
   //
   g_htAlgoritm3      = CreateThread(  NULL,         // CE Security
                    0,          // Default Size
                    ThreadAlgorithm3,   // Algorithm 3   
                       NULL,         // No Parameters
                    0,         // No Flags
                    &dwThreadID      // Thread Id
                           );
   if( !g_htAlgoritm3)return 20;

   // Set the thread priority
   //
   if( !CeSetThreadPriority(g_htAlgoritm3, 40 ))return 30;


   while( !g_fFinished )
   {
       Sleep( 500 );
   }

   return 0;
}

DWORD   WINAPI   ThreadAlgoritm3( LPVOID lpvParam )
{

   while( !g_fFinished )
   {
      // Sleep
      Sleep( 50);

      Algorithm3()

   }

   return 0;
}

They key elements to Process 2 are:

  • Create the one thread ThreadAlgorithm3:
  • Set the thread priority:
  • Set thread algorithm 3 to 40 (higher than the Algorithm 1 & 2).
  • Loop every 500 ms waiting for the g_fFinished flag to be set.
  • ThreadAlgorithm3 sleeps the appropriate amount of time and calls Algorithm 3.

The developer now has a tremendous amount of flexibility. The two processes have allowed for partitioning of the algorithms and have increased their ability to run in the event of logic faults in the others. The ability to set the thread priorities allows for the prioritization of the algorithms, which was not available in the monolithic case. The advantages of such a partitioning are just the beginning of the fault tolerant techniques available to the Windows CE .NET developer.

Implement Watchdog Threads

Watchdog threads can be implemented to verify that a particular set of logic is continuously executing. The following example has a control thread with the responsibility to set a particular watchdog event for each control cycle. If the watchdog thread ever wakes up without the control thread having run, the watchdog thread will execute a shut down behavior. This example will have the threads execute with a single process and it can be easily modified to have the threads in two processes. The following source code shows the WinMain with the initialization code.

// WatchDogExample.cpp : Implements a simple watchdog example.
//

#include "stdafx.h"

// Global Memory
//
HANDLE               g_hevWatchdog;
HANDLE               g_htWatchdog;
HANDLE               g_htControl;
BOOL               g_fFinished;

// Prototypes
//
DWORD WINAPI      ThreadWatchdog   ( LPVOID lpvParam );
DWORD WINAPI      ThreadControl   ( LPVOID lpvParam );

// MAIN
//
int WINAPI WinMain(   HINSTANCE hInstance,
         HINSTANCE hPrevInstance,
         LPTSTR    lpCmdLine,
         int       nCmdShow)
{
   DWORD   dwThreadID;

   // Create an event
   //
   g_hevWatchdog = CreateEvent(NULL, FALSE, FALSE, NULL);
   if(!g_hevWatchdog) return 10;


   // Create a Watchdog thread that waits for signaling
   //
   g_htWatchdog      = CreateThread(  NULL,         // CE Security
                    0,          // Default Size
                    ThreadWatchdog,   // Watchdog      
                       NULL,         // No Parameters
                    CREATE_SUSPENDED,   // Suspended
                    &dwThreadID      // Thread Id
                           );
   if( !g_htWatchdog )return 20;

   // Create a control thread that waits for signaling
   //
   g_htControl      = CreateThread(  NULL,         // CE Security
                    0,          // Default Size
                    ThreadControl,   // Control      
                       NULL,         // No Parameters
                    CREATE_SUSPENDED,   // Suspended
                    &dwThreadID      // Thread Id
                           );
   if( !g_htControl )return 20;

   // Set the thread priorities   //
   if( !CeSetThreadPriority(g_htWatchdog, 5 ))return 30;
   if( !CeSetThreadPriority(g_htControl, 10 ))return 40;

   // Get the Threads going
   //
   ResumeThread( g_htWatchdog );
   ResumeThread( g_htControl );


   while( !g_fFinished )
   {
       Sleep( 500 );
   }

   return 0;
}

The key steps in initializing the controller from WinMain are:

  • Create the watchdog event.
  • Set up the watchdog and control thread suspended.
  • Set the thread priorities.
    • Set the watchdog thread priority to 5 (higher than the control thread).
    • Set the control thread priority to 10.
  • Resume both the watchdog and control threads.
  • Wait for the g_fFinished flag to be set.

Control Thread

The following is the control thread code.

DWORD   WINAPI   ThreadControl( LPVOID lpvParam )
{

   while( !g_fFinished )
   {
      SetEvent( g_hevWatchdog );

      // Do application processing here...
        //

      // Sleep
      Sleep( 5);
   }

   return 0;
}

The ThreadControl is responsible for updating the watchdog event and then executing its control logic. It then sleeps for 5 milliseconds (ms) and repeats the process.

Watchdog Thread

The following is the source code for the watchdog thread.

DWORD   WINAPI   ThreadWatchdog( LPVOID lpvParam )
{

   while( !g_fFinished )
   {
      // Get Request to update
      //
      result = WaitForSingleObject( g_hevWatchdog, 500 );

      // See if signaled
      //
      if( result == WAIT_TIMEOUT  ) 
      {
           // We have a failure, application thread is dead!!!
           // Shutdown safely
         //
         WRITE_PORT_UCHAR( (PUCHAR)PowerAddress, POWER_SHUTDOWN);
      }

   }

   return 0;
}

The ThreadWatchdog continuously loops, checking the finished flag. It waits on the watchdog event for 500 ms. If the watchdog has woken up because of a timeout, the thread executes a shut down behavior. The thread then loops, checking the finished flag, and waits on the watchdog event.

Leverage Exception Handling

The ANSI C++ standard includes exception handling. Exceptions are anomalous situations that may occur during your program execution, including common software faults such as arithmetic value domain errors, and access to invalid memory addresses. The Microsoft C++ compiler implements the C++ exception handling model based on the ANSI C++ standard. Exception handling is achieved by leveraging the try, throw, and catch statements. With C++ exception handling you can directly handle exceptions or pass exception control up to a higher level where handling the error is more appropriate.

Here is a simple example of using exception handling to monitor the use of a buffer returned from a GetBuffer function.

#include <iostream.h>

int WINAPI WinMain(   HINSTANCE hInstance,
         HINSTANCE hPrevInstance,
         LPTSTR    lpCmdLine,
         int       nCmdShow)
{
    char *buf;

    try
    {
       while( 1)
       {
           buf = GetBuffer();
          
       if( buf == 0) throw "Memory Failure from GetBuffer!";
    Sleep(4);
   }
    }

    catch( char* str  )
    {
        cout << "Exception raised! " <<  str <<<'\n';
}
catch( … ) // Catch all other exceptions
{
   throw;  // Pass control to the next
}
 
    return 0;
}

This example wraps the while forever loop of a main process loop that accesses a GetBuffer function with a try ... catch syntax. The significant features of this example are the ability to throw an exception or to hand off exception handling to a particular level. The first catch block catch ( char* str) handles the string type exceptions. The catch( ... ) handles all other exceptions and then, by using the throw statement, passes execution to the next level of exception handling if it exists. In this case, there is no other exception handling in the process. In this simple example, the particular exception has been identified and handled at the appropriate level, which is a fundamental step in building fault tolerant software.

Implement Interrupt Level Fault Detection

As described in the Interrupt Architecture, the Interrupt Service Routine (ISR) is an opportunity for the developer to run a fault tolerant set of logic independent of the application logic state. In this basic scenario, the ISR monitors a shared memory address where the application thread is updating a watchdog value. Each time the interrupt occurs in response to a hardware timer countdown, the ISR logic checks to ensure that the application logic has updated the watchdog value. If the value is the same as the last execution, the ISR assumes a system fault and command the hardware to perform a safe shutdown.

To install an ISR into a platform there are two required steps:

  1. Call the LoadIntChainHandler function to load the DLL containing the ISR code.
  2. Respond with the proper interrupt id, in our case SYSINTR_WATCHDOG.

The DLL is loaded into the kernel's address space, and may not be linked to COREDLL.DLL, or other dynamic link libraries. The source code is limited to accessing hardware registers and memory. This is adequate for certain gross fault detection and recovery behaviors. The advantage of execution within the ISR, is that there are very few software logic errors that can block the execution of the ISR or corrupt the kernel memory structures involved in ISR dispatch.

The following source code example demonstrates a basic shell for creating an installable watchdog timer ISR. There are three functions:

  • DLLEntry receives the process and thread attach messages.
  • IOControl is the handler for any IST calls with KernelLibIOControl.
  • ISRHandler is the actual fault tolerant ISR.
Static   UCHAR uLastWatchdog = 0;

BOOL __stdcall   DllEntry(    HINSTANCE hinstDll,
            DWORD dwReason,
            LPVOID lpReserved )
{
     if (dwReason == DLL_PROCESS_ATTACH) {}

     if (dwReason == DLL_PROCESS_DETACH) {}

     return TRUE;
}


BOOL IOControl(     DWORD   InstanceIndex,
            DWORD   IoControlCode, 
             LPVOID  pInBuf, 
             DWORD   InBufSize,
            LPVOID  pOutBuf, 
             DWORD   OutBufSize, 
             LPDWORD pBytesReturned ) 
{
switch (IoControlCode) {

      case IOCTL_WATCHDOG_DRIVER:  
                  // Your I/O Code Here
                  return TRUE;
                  break;

      default:
                  // Invalid IOCTL
                  return FALSE;
}
    
return TRUE;
}

DWORD    ISRHandler(  DWORD InstanceIndex  )
{
   BYTE   Value;

Value = READ_PORT_UCHAR((PUCHAR)IntAddress );
if( Value == WATCHDOG_INTERRUPT )
{
        Value = READ_PORT_UCHAR((PUCHAR)WatchdogAddress );

        // Check the watchdog to see if the application has tickled it
        if ( Value  == uLastWatchdog )
        {
           // We have a failure, application thread is dead!!!
           // Shutdown safely
         //
         WRITE_PORT_UCHAR( (PUCHAR)PowerAddress, POWER_SHUTDOWN);
         
         // Add failure handling code here..
        }
   uLastWatchdog   = Value;
      return SYSINTR_WATCHDOG;
}
else
{
   return SYSINTR_CHAIN;
}
}

The ISR handler code first uses a port I/O Call to read the IntAddress location to ensure that this is a watchdog interrupt, otherwise it returns SYSINTR_CHAIN to allow the OS to call the remaining ISRs. The ISR then reads the WatchdogAddress to see if the application thread has updated the memory location. If the thread has not updated the location, the ISR writes a POWER_SHUTDOWN to the PowerAddress and returns. This simple example can be easily expanded by remembering the access limitations of and ISR.

Conclusions

The entrance of Windows CE .NET into the embedded community has enabled a new class of complex and robust embedded software products. With complexity comes the need to acknowledge and deal with software faults. Microsoft has provided a rich framework of operating systems features and capabilities that may be leveraged to improve system reliability through software fault tolerant design. Partitioning monolithic, complex algorithms into separate threads and processes, proper usage of thread priorities, the use of exception handling, the implementation of watchdog behaviors and algorithmic redundancy are readily available techniques that may be used in a Windows CE .NET implementation to deliver more dependable services.

Additional Resources

For more information about Windows CE .NET, please visit the Windows Embedded Web site.

For online documentation and context-sensitive Help included with Windows CE .NET, please visit the Windows CE .NET product documentation.

http://msdn.microsoft.com/library/en-us/wcelib40/html/pb_start.asp?frame=true

References

[Gray 91] J. Gray and D. Siewiorek, High-availability computer systems. IEEE Computer, Sept 1991

[Lard 43] D. Lardner, Babbage's calculating engine, Edinburgh Review, July 1834. Dionysus Lardner's lectures on Mr. B's appliance at the Mechanics Institute caught the attention of Ms. A. Ada Lovelace, and the rest is software history.

[Moor 56] E.F. Moore and C.E. Shannon. Reliable circuits using less relays. J. Franklin Institute, Sept/Oct 1956

[Neum 56] J. von Neumann. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Annals of Math Studies, ed. C.E. Shannon, Princeton University Press 1956

[Aviz 67] A. Avizienis. Design of fault tolerant computers. Proc. 1967 Fall Joint Computer Conf., 1967

[Lapr 92] J.C. Laprie (ed.). Dependability: Basic Concepts and Terminology, Dependable computing and fault tolerant systems services, Vol 5. Springer-Verlag 1992

[Aviz 01] A. Avizienis, J.-C. Laprie, B. Randall. Fundamental Concepts of Dependability", 2001

[Aviz 76] A. Avizienis. Fault Tolerant Systems, IIEE Trans. Computers Vol C-25 No 12. 1976

[Knig86] J.C. Knight, N.G. Leveson. An Experimental Evaluation of the Assumption of Independence in Multiversion Programming, IEEE Trans. Soft.Eng., Vol. SE-12, No, 1 1986

[Rand 75] B. Randell, System Structures for Software Tolerance" , IEEE Trans. Soft. Eng., Vol SE-1 1975

[Aviz 85] A. Avizienis. The N-Version Approach to Fault tolerant Software. IEEE Trans. Soft. Eng. Vol SE-11 No 12. 1985

[Amma 88] P.E. Ammann and J.C. Knight. Data Diversity: An Approach to Software Fault Tolerance. IEE Trans. Computers Vol. C-37 No. 4 1988

Show:
© 2014 Microsoft