This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

MSDN Magazine

Improving Runtime Performance with the Smooth Working Set Toolâ€"Part 1

John Robbins

Download the code for this article:Bugslayer1000.exe (155KB)
Browse the code for this article at Code Center:Working Set Tuner and Smooth Working Set

I was recently extolling the benefits of the Working Set Tuner (WST) to a group of developers. As I was telling them how you could find WST on the Platform SDK, someone pointed out that they had the latest Platform SDK and there was no such thing as the WST. I quickly whipped out my laptop (which always has the latest Platform SDK installed) and lo and behold, I was telling a lie! The WST no longer exists!
      The idea behind WST was to arrange your functions in memory so that the most frequently called functions would be packed closest together. This allowed the operating system to discard infrequently used parts of an application without affecting the most used portions. If applications are properly tuned, the overall machine performance will improve dramatically because the operating system will be doing less work.
      In the Bugslayer column in the February 1999 issue of MSJ, I applauded Microsoft for finally fixing the broken WST. I explained why you should use the WST and how to make it work. The last time I used WST was in the prerelease days of Windows® 2000, and I was quite shocked to find it missing in the final Platform SDK release. Seeing an excellent opportunity, I decided to right the wrong and create a WST replacement, which I call Smooth Working Set (SWS).
      In this month's column, I will discuss the first half of the project. I'll start with a quick overview of WST/SWS so you can understand the purpose behind these tools. Then I'll discuss the design decisions I made when writing SWS. Finally, I will highlight the first part of the implementation.

What is Smooth Working Set?

      Unfortunately, with today's 256MB machines and 1GB servers, some programmers just don't worry about the size of their programs. However, as the Godzilla movie of a few years ago so eloquently put it, "size does matter." The smaller your program is in terms of overall size, the better. Of course, you should also watch the amount of RAM your program uses when it's running. This is called the working set.
      The working set is the amount of memory that's dedicated to the current running parts of the program. The smaller you can make it, the faster your computer will run because you will avoid page faults. A page fault occurs when parts of your application are either not in cache memory (soft page faults) or not in memory at all and have to be brought in from the hard disk (hard page faults). As a wise man once told me, "A page fault can ruin your day!"
      Since tuning an application means lumping the common functions close together in memory and putting the most-used functions at the start of the working set, you can probably guess that the order of all the pieces depends on the linker. It might surprise you that the linker, by default, plops functions into a binary in the order in which they appear in the OBJ files. If you change the order of OBJ files on the linker command line, the function order will change as well. To control the exact placement of functions in your application, you need to tell the linker where to put each function. This is done through the /ORDER command-line option to LINK.EXE. The parameter to the /ORDER option is simply a text file, called the order file, which lists each function on a single line in the order in which you want them to appear. The idea behind SWS is to produce the order file for your application.
      Producing a simple order file actually takes quite a bit of work on the part of the SWS program, as well as on your part. In order to decide the order of the functions, you have to gather data about the calling patterns. You do this by setting up and running SWS, and then performing the tasks that the user will most commonly perform with your program. You should not run everything in your program; that would defeat the purpose of SWS. While you might always immediately jump to the About box and press the special Easter Egg keys to see your name scroll across the screen, your users do not. Depending on your application, it might take quite a bit of thinking to determine the most common user scenarios. It's a good idea to observe several customers during your beta sequence to see what they do with your application.
      For further information on WST, I strongly suggest you search for the name Russ Blake on the MSDN® CD. Russ was in charge of performance tuning on the Windows NT® team and wrote Volume 4 of the Microsoft® Windows NT 3.51 Resource Kit book, Optimizing Windows NT (Microsoft Press, 1995), which is an excellent introduction to performance tuning. You should definitely read the chapter "Tuning the Working Set of Your Application." There, Russ discusses how to use the WST tool and how the performance team designed it. I used that chapter as a starting point for designing SWS, and I am assuming you will read it before looking at the rest of this column.

Reality Check

      SWS is not a magic elixir. If you have terrible performance before running SWS, you will have terrible performance after running SWS. Tuning your working set should be the final performance tweak you perform after you have perfected your algorithms and taken care of any incidental performance bugs you find.
      There are times you might not need to run SWS. If you have a small application whose total compiled size for all binaries is less than 2 to 3MB, you might not see any working set reduction at all. The maximum benefit is achieved when you tune the working set on larger applications because the more pages you have, the more room there is for improvement. For those larger applications, Russ Blake says that you should expect a 30% reduction in resource usage for the scenario that you are testing.
      If you are using a third-party code coverage product, such as Rational's PureCoverage or NuMega's TrueCoverage, you can easily get the number of times each function was called. What makes these third-party tools attractive is that you don't have to do the extra setup that SWS requires. Additionally, the third-party tools have excellent merge capabilities, so the work of combining multiple runs of your application to get the final numbers is trivial. SWS doesn't just use the third-party tool's list to determine the order of your functions. It attempts to pack each code page as full of functions as possible. However, I feel that just getting the functions sorted by how frequently they are called is sufficient for most applications. Now that you have some background, I'll turn to some of my initial design decisions.

Design Decisions

      When thinking about how SWS should work, I felt that I should follow the same basic usage model that WST uses, except I didn't want SWS to require those outdated COFF symbols. The best reason to follow the WST model is that I can just point you to the WST documentation on the MSDN CD when you want to learn how to use SWS. Also, WST used an interesting hook to get control on each function call, and I could not think of a cleaner way to accomplish that.
      The Visual C++® compiler has a switch, /Gh, that inserts a special function call, _penter (or prolog enter), into each function prolog. That means you have the opportunity to do anything you want on each function compiled with /Gh. Without the /Gh compilation option, creating a tool such as SWS would be considerably difficult and error prone. As you can imagine, the magic of /Gh could allow you to write some very interesting bugslaying tools.
      With the function-hooking design mercifully out of the way, I started looking at the actions that had to take place each time the _penter function is called. The basic algorithm looks like the one shown in Figure 1.
      As you can see, the _penter function is not very exciting. Things get more interesting when you start to think about how to organize the address data. Since I will need to associate an address with a function name, I need to use my good friend the DBGHELP.DLL symbol engine somewhere in the program. However, since looking up symbols in the symbol engine is not the fastest of operations, I needed a way to keep the data access small and fast, since it will be touched on each function call.
      When I thought about it, what I wanted for the data arrangement was just a sorted array of all the function addresses with their associated execution counts. That way when I had the return address in the _penter function, I could just do a quick binary search for the address. That seemed relatively straightforward because all I had to do was enumerate the modules' symbols, sort the array of functions, and I would have the necessary data.
      I thought that SWS, like the WST, would keep each module's execution counts for a run in a separate data file. I favor that approach because if a particular application run is not what you wanted, you can delete the particular run that you don't want to include in the overall merged data set. Unlike the WST, which uses the module name.run number naming format, I wanted SWS to use a module name.run number.SWS scheme so I could eventually write a graphical program to make combining all the runs easier.
      After I decided how I wanted to approach the runtime data processing, I looked at how I would generate the actual order file. As I have already mentioned, I need to concatenate the individual runs. However, when I thought about producing the actual order file, I realized that I hadn't brought forward all the necessary data. The order file needs the names of the functions as well as their sizes, but all I had planned to store were the function addresses. While I could have used the symbol engine again when doing the actual order file generation, the only way to get a symbol's size is to enumerate all the module's symbols. Since I was already doing full symbol enumeration in the initial data generation phases, I thought I should just go ahead and write out to a separate data file at that point. The name of the file is the module name with an .SDW extension. It contains the function addresses, the function size, and the associated names.
      If you read the chapter, "How the Working Set Tuner Works" in Russ Blake's Optimizing Windows NT, as I suggested, you might wonder why I didn't mention anything about bit sets and time intervals. The Windows NT performance team implemented WST using a scheme where a bit in a bit set represents each function. Every so many milliseconds, WST writes out the bit set to record which functions executed in that time interval. I've often wondered why they implemented WST that way. At first glance, it seems like you do save some memory with the bit set, but you have to remember that there must be some way of mapping a function address to a bit, so I don't think it saves much more memory than my scheme. I believe the Windows NT performance team used the bit set scheme because they were tuning the operating system with WST. In contrast, I'm looking at individual binariesâ€"so it's an issue of scale.
      When designing my data structures, I was concerned about one point: I wanted to simply increment a value whenever a function executed. In multithreaded programs, I need to protect that value so only one thread at a time is manipulating it. My goal was to make the SWS runtime as fast as possible, so the best choice would be to increment the function count with the InterlockedIncrement API function. Since that function uses the hardware locks, I can guarantee data coherency in multithreaded applications. However, in Win32®, the largest data size that you can pass to InterlockedIncrement is a 32-bit value. Therefore, it's possible to wrap after 4,294,967,295 function calls. While four billion is a large number, some message loops could execute that high if you run your application long enough.
      The WST program avoided this problem by only recording if the function executed, and then calculating the total executions. When tuning the operating system, the odds are quite good that you could execute some functions more than four billion times. However, with SWS I am willing to take the chance during my initial design and do some testing to determine if this approach is a problem. One thing that might make it less likely to wrap the function count is that I'm going to allow you to start and stop the data collection at will. That way you can only generate data on the cases where you need it, instead of major parts of the application each time you run it.
      If I do find out that the count wrapping is an issue (which I don't think will happen), I have two options. The first is to implement the data counting variable as a 64-bit integer and have a per-module synchronization object to protect all the bits. The other option is drastic in that I will need to implement the SWS runtime to work like WST does. Of course, the wrapping problem only happens on Win32, so another option would be to implement SWS for Win64â„¢ only, because there's no way you will ever hit 18,446,744,073,709,551,615 function executions in your lifetime. Now I'll turn to some of the implementation highlights in this month's code.

Initial Implementation Highlights

      The first portion of SWS that I tackled was the actual file handling. When I looked at the problem, I realized that I wanted to use a common base class since many of the file handling chores were shared between reading and writing the files. As you can see in Figure 2, the on-disk file format is obvious. For the most part, the file implementation is standard Win32 file manipulationâ€"nothing special. Please note that you can look at all the file manipulation code in SWSFILE.DLL if you are interested (see the code link at the top of this article).
      It may seem odd that I exported several classes from SWSFILE.DLL. In general, I am loath to do that because it can get messy. However, I thought that since I was calling the file manipulation classes from both the runtime and the data generation program, I should make it sharable. As most C++ programmers developing for Windows know, exporting a C++ class from a DLL is perfectly acceptable. That's how MFC does it.
      When I first thought I would export a class from the DLL, I needed to check that Microsoft had fixed the nasty new/delete mismatch problem with exported classes. The problem was that exported classes were allocated by the new operator in the EXE, but would be deallocated with the delete from the DLL, thus resulting in your application leaving a large crater in the middle of the user's screen. For all the gory details check out Knowledge Base article Q122675 at https://support.microsoft.com/support/kb/articles/Q122/6/75.asp. Since I had run into this problem before, I automatically did three things:

  • used __declspec(dllexport) in the class declaration to export the class
  • used __declspec(dllimport) wherever I used the class
  • declared all destructors as virtual, which you should always do anyway

If you follow these three steps, exporting a class from a DLL will work correctly with Visual C++ 5.0 and higher.
      The other two binaries in this month's code (which you can download from the link at the top of this article) are SWSDLL.DLL and SWS.EXE. SWSDLL.DLL will eventually contain all the runtime code. Right now, it holds the data file generation and dumping functions. SWS.EXE is the program that you will run to perform the actual working set tuning to produce the order file. Additionally, SWS.EXE allows you to dump existing data files and to generate the initial data files. Normally, the runtime will create the initial data files if they do not exist, but if you want to speed up your SWS program runs, you can generate the files ahead of time.

Wrap-up

      Armed with the first half of the SWS program, you might want to implement the two remaining portions, the runtime and the order file generation. There are two interesting points about the runtime that you need to keep in mind. First, you will need to implement _penter in such a way that you can reliably get at the return address. If you're a loyal Bugslayer reader, you'll remember that I showed you how to solve this in the February 1999 issue of MSJ. Second, you might want to consider some sort of last-used module and last-used address caching scheme. Since most functions inside a module call other functions inside the same moduleâ€"and functions generally call functions around themâ€"you might be able to speed up the runtime drastically with an appropriate caching scheme. It would be interesting to compare how you and I would solve each of these problems.
      In this month's code distribution, I included a new version of the ongoing BUGSLAYERUTIL.DLL project. Walter Warniaha found a whopper of a bug in my Crash Handler code. I was allocating the array of modules, which I would call the exception filter with HeapAlloc, but deleting it with the C runtime free function. Wow! That's my most stupid bug to date! Speaking of my bugs, Jim Miars found another small bug in CRASHANDLER.CPP, where I had the last two parameters to FillMemory reversed so the memory wasn't initialized to a known value. I appreciate both Walter and Jim letting me know about the bugs so that I could get them fixed.
      As a follow-up to my last column in which I implemented the COM wrapper on the DBGHELP.DLL symbol engine, Staffan Gustafsson took the challenge and implemented a real OLE DB provider symbol engine! Staffan did a great job. If you want to know how to do it, you'll find his full source code at my little spot on the Web, https://www.wintellect.com/robbins.

Da Tips!

      It's October, so if you don't get your debugging tips to me at john@wintellect.com, the ghosts and goblins that appear at your door on the 31st will leave all their candy wrappers littered on your yard!**
Tip 37** Spencer Low writes: I am looking for a way to perform compile-time checks on constant-expression assertions, rather than just waiting until runtime to hit them. As I was looking through WINNT.H, I ran across the following code, which is exactly what I needed to solve the problem.

  //
// C_ASSERT() can be used to perform many compile-time assertions:
// type sizes, field offsets, etc.
//
// An assertion failure results in error C2118: negative subscript.
//

#define C_ASSERT(e) typedef char __C_ASSERT__[(e)?1:-1]

Tip 38** Reed Mangino writes: While this might not be a typical Bugslayer tip, it can save you lots of time in the debugger and while editing in the Visual C++ IDE. Use CTRL+SHIFT+G to bring up source files from numerous places. In the Find window, type the name of the file to open (for example, WINNT.H), press CTRL+SHIFT+G and, bingo, WINNT.H will open. In an editor window, if you place the cursor inside the <> or "" of an include statement, pressing CTRL+SHIFT+G will open that include file. If the editor doesn't find the files, you might need to add the appropriate paths in the Include files section of the Options dialog Directories tab.

John Robbins is a cofounder of Wintellect, a software consulting, education, and development firm that specializes in programming with Windows and COM. He is the author of* Debugging Applications *(Microsoft Press, 2000). You can contact John at https://www.wintellect.com.

From the October 2000 issue of MSDN Magazine.