August 2012

Volume 27 Number 08

Windows with C++ - Lightweight Cooperative Multitasking with C++

By Kenny Kerr | August 2012

Kenny KerrIf you work for a company that has one of those coding standards documents that would annihilate an entire rainforest were it ever to be printed, you’d better stop reading now. Chances are, what I’m about to show you will violate many of the sacred cows in the aforementioned epic. I’m going to tell you about a particular technique I originally developed to allow me to write completely asynchronous code efficiently and without the need for complex state machines.

Unless your name is Donald Knuth, it’s likely any code you write will resemble something that has already been done. Indeed, the oldest account I can find of the technique I describe here is a mention by Knuth himself, and the most recent is by a gentleman from England by the name of Simon Tatham, known for the popular PuTTY terminal emulator. Nevertheless, to paraphrase the judge in the recent Oracle v. Google dispute, “you can’t patent a for loop.” Still, we’re all indebted to our peers in the field of computer programming as we push the craft forward.

Before I dive in and describe what my technique is and how it works, I need to present a quick diversion in the hope that it will give you a bit more perspective for what’s to come. In “Compilers: Principles, Techniques and Tools, Second Edition” (Prentice Hall, 2006) by Alfred V. Aho, Monica S. Lam, Ravi Sethi and Jeffrey D. Ullman—more commonly known as the Dragon Book—the authors sum up the purpose of a compiler as being a program that can read a program in one language and translate it into an equivalent program in another language. If you were to ask C++ designer Bjarne Stroustrup what language he used to write C++, he would tell you it was C++. What this means is that he wrote a preprocessor that read C++ and produced C, which a standard C compiler could then further translate into some machine language. If you look close enough, you can see variants of this idea in many places. The C# compiler, for example, translates the seemingly magical yield keyword into a regular class that implements an iterator. Late last year, I noticed that the Visual C++ compiler first translated portions of the C++/CX syntax into standard C++ before compiling it further. This may have since changed, but the point is that compilers are fundamentally about translation.

Sometimes, when using a particular language, you might conclude that a feature would be desirable but the very nature of the language prohibits you from implementing it. This rarely happens in C++, because the language by design is suited to the expression of different techniques, thanks to its rich syntax and meta-programming facilities. However, this in fact did happen to me.

I spend most of my time these days working on an embedded OS that I created from scratch. Neither Linux, Windows nor any other OS runs under the covers. I rely on no open source software whatsoever, and indeed that would typically be impossible anyway for reasons that will become clear. This has opened my eyes to a whole world of C and C++ programming that’s quite different from the traditional world of PC software development. Most embedded systems have very different constraints from those of “normal” programs. They have to be extremely reliable. Failure can be costly. Users are seldom around to “restart” a failed program. Systems might have to run for years uninterrupted and without human intervention. Imagine a world without Windows Update or the like. These systems might also have relatively scarce computing resources. Correctness, reliability and, in particular, concurrency all play central roles in the design of such systems.

As such, the plush world of Visual C++, with its powerful libraries, is seldom appropriate. Even if Visual C++ were to target my embedded microcontroller, the accompanying libraries aren’t well-suited for systems with such scarce computing resources and, often, hard real-time constraints. One of the microprocessors I’m currently using has just 32KB of RAM running at less than 50 MHz, and this is still luxurious to some in the embedded community. It should be clear that by “embedded” I do not mean your average smartphone with a half-gig of RAM and a 1 GHz processor.

In “Programming: Principles and Practice Using C++” (Addison-Wesley Professional, 2008), Stroustrup cites free-store allocations (for example, new and delete), exceptions and dynamic_cast in a short list of features that must be avoided in most embedded systems because they aren’t predictable. Unfortunately, that precludes the use of most of the standard, vendor-supplied and open source C++ libraries available today.

The result is that most embedded programming—and for that matter, kernel-mode programming on Windows—still employs C rather than C++. Given that C is primarily a subset of C++, I tend to use a C++ compiler but stick to a strict subset of the language that’s predictable, portable and works well in embedded systems.

This led me on a journey to find a suitable technique to enable concurrency in my little embedded OS. Up to this point, my OS had a single thread, if you can call it that. There are no blocking operations, so any time I needed to implement something that might take some time, such as waiting for a storage I/O interrupt or for a network retransmission timeout, I would need to write a carefully constructed state machine. This is standard practice with event-driven systems, but it results in code that’s hard to reason through logically, because the code isn’t sequential.

Imagine a storage driver. There might be a storage_read function to read a page of memory from a device’s persistent flash storage. The storage_read function might first check whether the peripheral or bus is busy, and if so, simply queue the read request before returning. At some point the peripheral and bus become free and the request can proceed. This might involve disabling the transceiver, formatting a command appropriate for the bus, preparing the direct memory access buffers, enabling the transceiver again and then returning to allow the caller to do something else while the transfer completes in hardware. Eventually the bus signals its completion and the caller is notified via some callback mechanism, and any other queued requests proceed. Needless to say, it can get pretty complicated managing the queues, callbacks and state machines. Verifying everything is correct is even harder. I haven’t even described how a nonblocking file system might be implemented on top of this abstraction or how a Web server might use the file system to serve up data—all without ever blocking. A different approach is needed to reduce the inevitable and growing complexity.

Now, imagine that C++ had a few keywords that let you transport the processor from one call stack to another in mid-function. Imagine the following hypothetical C++ code:

void storage_read(storage_args & args) async
{
  wait while (busy);
  busy = true;
  begin_transfer(args);
  yield until (done_transmitting());
  busy = false;
}

Notice the “async” contextual keyword after the parameter list. I’m also using two imaginary spaced keywords named “wait while” and “yield until.” Consider what it means for C++ to have such keywords. The compiler would somehow have to express the notion of an interlude, if you will. Knuth called this a coroutine. The async keyword might appear with a function declaration to let the compiler know that the function can run asynchronously and must thus be called appropriately. The hypothetical wait and yield keywords are the actual points at which the function ceases to execute synchronously and can potentially return to the caller, only to resume where it left off at a later stage. You could also imagine a “wait until” conditional keyword as well as an unconditional yield statement.

I’ve seen alternatives to this cooperative form of concurrency—notably the Asynchronous Agents Library included with Visual C++—but all the solutions I found depended on some runtime scheduling engine. What I propose here and will illustrate in a moment is that it’s entirely possible that a C++ compiler might indeed provide cooperative concurrency without any runtime cost whatsoever. Keep in mind that I’m not saying this will solve the manycore scalability challenge. What I’m saying is we should be able to write fast and responsive event-driven programs without involving a scheduler. And as with the existing C and C++ languages, nothing should prevent those techniques from being used along with OS threads or other concurrency runtimes.

Obviously, C++ doesn’t support what I’m describing right now. What I discovered, however, is that it can be simulated reasonably well using Standard C or C++ and without relying on assembler trickery. Using this approach, the storage_read function described earlier might look as follows:

task_begin(storage_read, storage_args & args)
{
  task_wait_while(busy);
  busy = true;
  begin_transfer(args);
  task_yield_until(done_transmitting());
  busy = false;
}
task_end

Obviously, I’m relying on macros here. Gasp! Clearly, this violates item 16 in the C++ Coding Standards (bit.ly/8boiZ0), but the alternatives are worse. The ideal solution is for the language to support this directly. Alternatives include using longjmp, but I find that’s worse and has its own pitfalls. Another approach might be to use assembly language, but then I’d lose all portability. It’s debatable whether it could even be done as efficiently in assembly language, because that would most likely result in a solution that used more memory due to the loss of contextual information and the inevitable one-stack-per-task implementation. So humor me as I show you how simple and effective this is, and then how it all works.

To keep things clear, I’ll henceforth call these asynchronous functions tasks. Given the task I described earlier, it can be scheduled simply by calling it as a function:

storage_args = { ... };
storage_read(args);

As soon as the task decides it can’t proceed, it will simply return to the caller. Tasks employ a bool return type to indicate to callers whether they’ve completed. Thus, you could continuously schedule a task until it completes, as follows:

while (storage_read(args));

Of course, this would block the caller until the task completes. This might actually be appropriate, perhaps when your program first starts in order to load a configuration file or the like. Apart from that exception, you’d rarely want to block in this manner. What you need is a way to wait for a task in a cooperative manner:

task_wait_for(storage_read, args);

This assumes the caller is itself a task and will then yield to its caller until the nested task completes, at which point it will continue. Now let me loosely define the task keywords, or pseudo-functions, and then go through an example or two you can actually try for yourself: 

  • task_declare(name, parameter)Declares a task, typically in a header file.
  • task_begin(name, parameter)Begins the definition of a task, typically in a C++ source file.
  • task_endEnds the definition of a task.
  • task_return()Terminates the execution of a task and returns control to the caller.
  • task_wait_until(expression)Waits until the expression is true before continuing.
  • task_wait_while(expression)Waits while the expression is true before continuing.
  • task_wait_for(name, parameter)Executes the task and waits for it to complete before continuing.
  • task_yield()Yields control unconditionally, continuing when the task is rescheduled.
  • task_yield_until(expression)Yields control at least once, continuing when the expression is non-zero.

It’s important to remember that none of these routines block in any way. They’re all designed to achieve a high degree of concurrency in a cooperative manner. Let me illustrate with a simple example. This example uses two tasks, one to prompt the user for a number and the other to calculate a simple arithmetic mean of the numbers as they arrive. First is the average task, shown in Figure 1.

Figure 1 The Average Task

struct average_args
{
  int * source;
  int sum;
  int count;
  int average;
  int task_;
};
task_begin(average, average_args & args)
{
  args.sum = 0;
  args.count = 0;
  args.average = 0;
  while (true)
  {
    args.sum += *args.source;
    ++args.count;
    args.average = args.sum / args.count;
    task_yield();
  }
}
task_end

A task accepts exactly one argument that must be passed by reference and must include an integer member variable called task_. Obviously, this is something the compiler would hide from the caller given the ideal scenario of first-class language support. However, for the purpose of this simulation, I need a variable to keep track of the task’s progress. All the caller needs to do is initialize it to zero.

The task itself is interesting in that it contains an infinite while loop with a task_yield call within the loop’s body. The task initializes some state before entering this loop. It then updates its aggregates and yields, allowing other tasks to execute before repeating indefinitely.

Next is the input task, as shown in Figure 2.

Figure 2 The Input Task

struct input_args
{
  int * target;
  int task_;
};
task_begin(input, input_args & args)
{
  while (true)
  {
    printf("number: ");
    if (!scanf_s("%d", args.target))
    {
      task_return();
    }
    task_yield();
  }
}
task_end

This task is interesting in that it illustrates that tasks may in fact block, as the scanf_s function will do while it waits for input. Although not ideal for an event-driven system. This task also illustrates using the task_return call to complete the task in mid-function rather than using a conditional expression in the while statement. A task completes either by calling task_return or by falling off the end of the function, so to speak. Either way, the caller will see this as the task completing, and it will be able to resume.

To bring these tasks to life, all you need is a simple main function to act as a scheduler:

int main()
{
  int share = -1;
  average_args aa = { &share };
  input_args ia = { &share };
  while (input(ia))
  {
    average(aa);
    printf("sum=%d count=%d average=%d\n",
      aa.sum, aa.count, aa.average);
  }
}

The possibilities are endless. You could write tasks representing timers, producers and consumers, TCP connection handlers and much more.

So how does it work? First keep in mind again that the ideal solution is for the compiler to implement this, in which case it can use all kinds of clever tricks to implement it efficiently, and what I’m about to describe won’t actually be anywhere near as sophisticated or complicated.

As best as I can tell, this comes down to a discovery by a programmer named Tom Duff, who found out that you can play clever tricks with the switch statement. As long as it’s syntactically valid, you can nest various selection or iteration statements within a switch statement to effectively jump in and out of a function at will. Duff published a technique for manual loop unrolling, and Tatham then realized it could be used to simulate coroutines. I took those ideas and implemented tasks as follows.

The task_begin and task_end macros define the surrounding switch statement:

#define task_begin(name, parameter) \
                                    \
  bool name(parameter)              \
  {                                 \
    bool yield_ = false;            \
    switch (args.task_)             \
    {                               \
      case 0:
#define task_end                    \
                                    \
    }                               \
    args.task_ = 0;                 \
    return false;                   \
  }

It should now be obvious what the single task_ variable is for and how it all works. Initializing task_ to zero ensures that execution jumps to the beginning of the task. When a task ends, it’s again set back to zero as a convenience so the task can be restarted easily. Given that, the task_wait_until macro provides the necessary jump location and cooperative return facility:

#define task_wait_until(expression)      \
                                         \
  args.task_ = __LINE__; case __LINE__:  \
  if (!(expression))                     \
  {                                      \
    return true;                         \
  }

The task_ variable is set to the predefined line number macro, and the case statement gets the same line number, thus ensuring that if the task yields, the next time it’s scheduled the code will resume right where it left off. The remaining macros are shown in Figure 3.

Figure 3 The Remaining Macros

#define task_return()                    \
                                         \
  args.task_ = 0;                        \
  return false;
#define task_wait_while(expression)      \
                                         \
  task_wait_until(!(expression))
#define task_wait_for(name, parameter)   \
                                         \
  task_wait_while(name(parameter))
#define task_yield_until(expression)     \
                                         \
  yield_ = true;                         \
  args.task_ = __LINE__; case __LINE__:  \
  if (yield_ || !(expression))           \
  {                                      \
    return true;                         \
  }
#define task_yield()                     \
                                         \
  task_yield_until(true)

These should all be patently obvious. Perhaps the only subtlety worth mentioning is task_yield_until, as it’s similar to task_wait_until but for the fact that it will always yield at least once. task_yield, in turn, will only ever yield exactly once, and I’m confident that any respectable compiler will optimize away my shorthand. I should mention that task_wait_until is also a great way to deal with out-of-memory conditions. Rather than failing in some deeply nested operation with dubious recoverability, you can simply yield until the memory allocation succeeds, giving other tasks an opportunity to complete and hopefully free up some much-needed memory. Again, this is critical for embedded systems where failure is not an option.

Given that I’m emulating coroutines, there are some pitfalls. You can’t reliably use local variables within tasks, and any code that violates the validity of the hidden switch statement is going to cause trouble. Still, given that I can define my own task_args—and considering how much simpler my code is thanks to this technique—I’m thankful that it works as well as it does.

I found it useful to disable the following Visual C++ compiler warnings:

#pragma warning(disable: 4127) // Conditional expression is constant
#pragma warning(disable: 4189) // Local variable is initialized but not referenced
#pragma warning(disable: 4706) // Assignment within conditional expression

Finally, if you’re using the Visual C++ IDE, be sure to disable “edit and continue debugging” by using /Zi instead of /ZI.

As I concluded this column, I looked around the Web for any similar initiatives and found the new async and await keywords that the Visual Studio 2012 C# compiler has introduced. In many ways, this is an attempt to solve a similar problem. I expect the C++ community to follow suit. The question is whether these solutions will come to C and C++ in a manner that will produce predictable code suitable for embedded systems as I’ve described in this column or whether they’ll rely on a concurrency runtime, as the current Visual C++ libraries do. My hope is that one day I’ll be able to throw away these macros, but until that day comes, I’ll remain productive with this lightweight, cooperative and multitasking technique.

Stay tuned for the next installment of Windows with C++ in which I’ll show you some new techniques that Niklas Gustafsson and Artur Laksberg from the Visual C++ team have been working on to bring resumable functions to C++.


Kenny Kerr is a software craftsman with a passion for native Windows development. Reach him at kennykerr.ca.

Thanks to the following technical expert for reviewing this article: Artur Laksberg