Printer Friendly Version      Send     
Click to Rate and Give Feedback
Related Articles

There is a large body of research on group techniques to determine the best alternative from a set of options. Dr. James McCaffrey outlines five of them.

Dr. James McCaffrey

MSDN Magazine November 2008

...

Read more!

This article presents an overview of the motivation behind new techniques that decompose problems into independent pieces for optimal use of parallel programming.

David Callahan

MSDN Magazine October 2008

...

Read more!

Because Virtual Server is built upon a set of COM modules, you can automate the creation and testing of virtual machines. Here we use Windows PowerShell to run the tests.

Dr. James McCaffrey and Paul Despe

MSDN Magazine December 2008

...

Read more!

This month we explain how pseudo variables and format specifiers provide a wealth of information for use in debugging.

Kenny Kerr

MSDN Magazine December 2008

...

Read more!

Designing testability into your app means smaller tests that are cheaper to create, easier to understand, faster to run, and much simpler to debug.

Jeremy Miller

MSDN Magazine December 2008

...

Read more!

Also by this Author

This installment of Bugslayer covers the use of ADPlus to create a minidump of your Microsoft .NET Framework 2.0 pro¬cesses on specific exceptions.

John Robbins

MSDN Magazine November 2006

...

Read more!

Those of you who have been reading this old Bugslayer column over the last nine years have branded into your frontal lobe a single word: ASSERT! Anytime you can have the code tell you about a problem instead of having to find it by slaving away with a debugger is a huge timesaver.

John Robbins

MSDN Magazine November 2005

...

Read more!

In the June 2004 installment of the Bugslayer column, I introduced the amazing FxCop, which analyzes your . NET assemblies for errors and problems based on code that violates the . NET Design Guidelines.

John Robbins

MSDN Magazine September 2004

...

Read more!

Having talked to thousands of developers who use the Microsoft . NET Framework, I've heard one consistent complaint: "I really wish all the samples were written in my programming language. " Nothing is more frustrating than having braved the wilds of Internet searches for a snippet of code that does exactly what you want but is written in a language you don't use.

John Robbins

MSDN Magazine August 2004

...

Read more!

Windows Vista has a new API called Wait Chain Traversal (WCT), which allows you to determine when and why a process is deadlocked. Read on.

John Robbins

MSDN Magazine July 2007

...

Read more!

Popular Articles

Here are some design patterns that allow you to achieve higher cohesion and looser coupling for more flexible, reusable applications.

Jeremy Miller

MSDN Magazine October 2008

...

Read more!

Jeff Prosise explains when it's better to use UpdatePanel and when it's better to use asynchronous calls to WebMethods or page methods instead.

Jeff Prosise

MSDN Magazine June 2007

...

Read more!

Learn how to automate custom SharePoint application deployments, use the SharePoint API, and avoid the hassle of custom site definitions.

E. Wilansky, P. Olszewski, and R. Sneddon

MSDN Magazine May 2008

...

Read more!

Now you can perform efficient, sophisticated text analysis using regular expressions in SQL Server 2005.

David Banister

MSDN Magazine February 2007

...

Read more!

This article presents an overview of the motivation behind new techniques that decompose problems into independent pieces for optimal use of parallel programming.

David Callahan

MSDN Magazine October 2008

...

Read more!

Our Blog

Windows Presentation Foundation (WPF) adds functionality to the Microsoft .NET Framework so that you actually can reliably keep bound controls synchronized with their data sources.

In the December 2008 issue of MSDN Magazine, Ken Getz demonstrates how to use the ObservableCollection class provided by WPF to keep bound controls in ...

Read more!

Earlier this year MSDN Magazine embarked on a collaborative project with Behind the Code, an interview program airing on MSDN Channel 9. In this program, Robert Hess interviews prominent developers at Microsoft, and those developers also write a column for { End Bracket } in MSDN Magazine. In the newest interview, Richard Ward talks about working on the core infrastructure components of future versions of Windows, as well as ...

Read more!

We're currently in the process of stepping back and taking a critical look at our Web site to see how you all are using it - and how we can redesign parts of it (big or small) to make that experience better.  We are continuously receiving your feedback on existing frustrations and we are working hard to remedy those (as a general fyi, most of the frustrations have to do with navigation).  However, in order to get a sense of whether we need to look at some of the more fundamental ...

Read more!

Silverlight and SharePoint provide a simple, yet powerful, infrastructure for building intranet and extranet applications with sophisticated user interface designs and interactions.

In the November 2008 issue of MSDN Magazine, Steve Fox and Paul Stubbs demonstrate how to build a SharePoint Web Part as a wrapper for a Silverlight application.

...

Read more!

With the releases of LINQ to SQL and the ADO.NET Entity Framework, developers now have two products from Microsoft designed to tie together relational data and object-oriented programming.

In the December 2008 issue of MSDN Magazine, Anthony Sneed provides a roadmap to these technologies and demonstrates how you can create ...

Read more!

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.
MSDN Magazine
Improving Runtime Performance with the Smooth Working Set Toolâ€"Part 1
John Robbins
Download the code for this article: Bugslayer1000.exe (155KB)
Browse the code for this article at Code Center: Working Set Tuner and Smooth Working Set

I
was recently extolling the benefits of the Working Set Tuner (WST) to a group of developers. As I was telling them how you could find WST on the Platform SDK, someone pointed out that they had the latest Platform SDK and there was no such thing as the WST. I quickly whipped out my laptop (which always has the latest Platform SDK installed) and lo and behold, I was telling a lie! The WST no longer exists!
      The idea behind WST was to arrange your functions in memory so that the most frequently called functions would be packed closest together. This allowed the operating system to discard infrequently used parts of an application without affecting the most used portions. If applications are properly tuned, the overall machine performance will improve dramatically because the operating system will be doing less work.
      In the Bugslayer column in the February 1999 issue of MSJ, I applauded Microsoft for finally fixing the broken WST. I explained why you should use the WST and how to make it work. The last time I used WST was in the prerelease days of Windows® 2000, and I was quite shocked to find it missing in the final Platform SDK release. Seeing an excellent opportunity, I decided to right the wrong and create a WST replacement, which I call Smooth Working Set (SWS).
      In this month's column, I will discuss the first half of the project. I'll start with a quick overview of WST/SWS so you can understand the purpose behind these tools. Then I'll discuss the design decisions I made when writing SWS. Finally, I will highlight the first part of the implementation.

What is Smooth Working Set?

      Unfortunately, with today's 256MB machines and 1GB servers, some programmers just don't worry about the size of their programs. However, as the Godzilla movie of a few years ago so eloquently put it, "size does matter." The smaller your program is in terms of overall size, the better. Of course, you should also watch the amount of RAM your program uses when it's running. This is called the working set.
      The working set is the amount of memory that's dedicated to the current running parts of the program. The smaller you can make it, the faster your computer will run because you will avoid page faults. A page fault occurs when parts of your application are either not in cache memory (soft page faults) or not in memory at all and have to be brought in from the hard disk (hard page faults). As a wise man once told me, "A page fault can ruin your day!"
      Since tuning an application means lumping the common functions close together in memory and putting the most-used functions at the start of the working set, you can probably guess that the order of all the pieces depends on the linker. It might surprise you that the linker, by default, plops functions into a binary in the order in which they appear in the OBJ files. If you change the order of OBJ files on the linker command line, the function order will change as well. To control the exact placement of functions in your application, you need to tell the linker where to put each function. This is done through the /ORDER command-line option to LINK.EXE. The parameter to the /ORDER option is simply a text file, called the order file, which lists each function on a single line in the order in which you want them to appear. The idea behind SWS is to produce the order file for your application.
      Producing a simple order file actually takes quite a bit of work on the part of the SWS program, as well as on your part. In order to decide the order of the functions, you have to gather data about the calling patterns. You do this by setting up and running SWS, and then performing the tasks that the user will most commonly perform with your program. You should not run everything in your program; that would defeat the purpose of SWS. While you might always immediately jump to the About box and press the special Easter Egg keys to see your name scroll across the screen, your users do not. Depending on your application, it might take quite a bit of thinking to determine the most common user scenarios. It's a good idea to observe several customers during your beta sequence to see what they do with your application.
      For further information on WST, I strongly suggest you search for the name Russ Blake on the MSDN® CD. Russ was in charge of performance tuning on the Windows NT® team and wrote Volume 4 of the Microsoft® Windows NT 3.51 Resource Kit book, Optimizing Windows NT (Microsoft Press, 1995), which is an excellent introduction to performance tuning. You should definitely read the chapter "Tuning the Working Set of Your Application." There, Russ discusses how to use the WST tool and how the performance team designed it. I used that chapter as a starting point for designing SWS, and I am assuming you will read it before looking at the rest of this column.

Reality Check

      SWS is not a magic elixir. If you have terrible performance before running SWS, you will have terrible performance after running SWS. Tuning your working set should be the final performance tweak you perform after you have perfected your algorithms and taken care of any incidental performance bugs you find.
      There are times you might not need to run SWS. If you have a small application whose total compiled size for all binaries is less than 2 to 3MB, you might not see any working set reduction at all. The maximum benefit is achieved when you tune the working set on larger applications because the more pages you have, the more room there is for improvement. For those larger applications, Russ Blake says that you should expect a 30% reduction in resource usage for the scenario that you are testing.
      If you are using a third-party code coverage product, such as Rational's PureCoverage or NuMega's TrueCoverage, you can easily get the number of times each function was called. What makes these third-party tools attractive is that you don't have to do the extra setup that SWS requires. Additionally, the third-party tools have excellent merge capabilities, so the work of combining multiple runs of your application to get the final numbers is trivial. SWS doesn't just use the third-party tool's list to determine the order of your functions. It attempts to pack each code page as full of functions as possible. However, I feel that just getting the functions sorted by how frequently they are called is sufficient for most applications. Now that you have some background, I'll turn to some of my initial design decisions.

Design Decisions

      When thinking about how SWS should work, I felt that I should follow the same basic usage model that WST uses, except I didn't want SWS to require those outdated COFF symbols. The best reason to follow the WST model is that I can just point you to the WST documentation on the MSDN CD when you want to learn how to use SWS. Also, WST used an interesting hook to get control on each function call, and I could not think of a cleaner way to accomplish that.
      The Visual C++® compiler has a switch, /Gh, that inserts a special function call, _penter (or prolog enter), into each function prolog. That means you have the opportunity to do anything you want on each function compiled with /Gh. Without the /Gh compilation option, creating a tool such as SWS would be considerably difficult and error prone. As you can imagine, the magic of /Gh could allow you to write some very interesting bugslaying tools.
      With the function-hooking design mercifully out of the way, I started looking at the actions that had to take place each time the _penter function is called. The basic algorithm looks like the one shown in Figure 1.
      As you can see, the _penter function is not very exciting. Things get more interesting when you start to think about how to organize the address data. Since I will need to associate an address with a function name, I need to use my good friend the DBGHELP.DLL symbol engine somewhere in the program. However, since looking up symbols in the symbol engine is not the fastest of operations, I needed a way to keep the data access small and fast, since it will be touched on each function call.
      When I thought about it, what I wanted for the data arrangement was just a sorted array of all the function addresses with their associated execution counts. That way when I had the return address in the _penter function, I could just do a quick binary search for the address. That seemed relatively straightforward because all I had to do was enumerate the modules' symbols, sort the array of functions, and I would have the necessary data.
      I thought that SWS, like the WST, would keep each module's execution counts for a run in a separate data file. I favor that approach because if a particular application run is not what you wanted, you can delete the particular run that you don't want to include in the overall merged data set. Unlike the WST, which uses the module name.run number naming format, I wanted SWS to use a module name.run number.SWS scheme so I could eventually write a graphical program to make combining all the runs easier.
      After I decided how I wanted to approach the runtime data processing, I looked at how I would generate the actual order file. As I have already mentioned, I need to concatenate the individual runs. However, when I thought about producing the actual order file, I realized that I hadn't brought forward all the necessary data. The order file needs the names of the functions as well as their sizes, but all I had planned to store were the function addresses. While I could have used the symbol engine again when doing the actual order file generation, the only way to get a symbol's size is to enumerate all the module's symbols. Since I was already doing full symbol enumeration in the initial data generation phases, I thought I should just go ahead and write out to a separate data file at that point. The name of the file is the module name with an .SDW extension. It contains the function addresses, the function size, and the associated names.
      If you read the chapter, "How the Working Set Tuner Works" in Russ Blake's Optimizing Windows NT, as I suggested, you might wonder why I didn't mention anything about bit sets and time intervals. The Windows NT performance team implemented WST using a scheme where a bit in a bit set represents each function. Every so many milliseconds, WST writes out the bit set to record which functions executed in that time interval. I've often wondered why they implemented WST that way. At first glance, it seems like you do save some memory with the bit set, but you have to remember that there must be some way of mapping a function address to a bit, so I don't think it saves much more memory than my scheme. I believe the Windows NT performance team used the bit set scheme because they were tuning the operating system with WST. In contrast, I'm looking at individual binariesâ€"so it's an issue of scale.
      When designing my data structures, I was concerned about one point: I wanted to simply increment a value whenever a function executed. In multithreaded programs, I need to protect that value so only one thread at a time is manipulating it. My goal was to make the SWS runtime as fast as possible, so the best choice would be to increment the function count with the InterlockedIncrement API function. Since that function uses the hardware locks, I can guarantee data coherency in multithreaded applications. However, in Win32®, the largest data size that you can pass to InterlockedIncrement is a 32-bit value. Therefore, it's possible to wrap after 4,294,967,295 function calls. While four billion is a large number, some message loops could execute that high if you run your application long enough.
      The WST program avoided this problem by only recording if the function executed, and then calculating the total executions. When tuning the operating system, the odds are quite good that you could execute some functions more than four billion times. However, with SWS I am willing to take the chance during my initial design and do some testing to determine if this approach is a problem. One thing that might make it less likely to wrap the function count is that I'm going to allow you to start and stop the data collection at will. That way you can only generate data on the cases where you need it, instead of major parts of the application each time you run it.
      If I do find out that the count wrapping is an issue (which I don't think will happen), I have two options. The first is to implement the data counting variable as a 64-bit integer and have a per-module synchronization object to protect all the bits. The other option is drastic in that I will need to implement the SWS runtime to work like WST does. Of course, the wrapping problem only happens on Win32, so another option would be to implement SWS for Win64™ only, because there's no way you will ever hit 18,446,744,073,709,551,615 function executions in your lifetime. Now I'll turn to some of the implementation highlights in this month's code.

Initial Implementation Highlights

      The first portion of SWS that I tackled was the actual file handling. When I looked at the problem, I realized that I wanted to use a common base class since many of the file handling chores were shared between reading and writing the files. As you can see in Figure 2, the on-disk file format is obvious. For the most part, the file implementation is standard Win32 file manipulationâ€"nothing special. Please note that you can look at all the file manipulation code in SWSFILE.DLL if you are interested (see the code link at the top of this article).
      It may seem odd that I exported several classes from SWSFILE.DLL. In general, I am loath to do that because it can get messy. However, I thought that since I was calling the file manipulation classes from both the runtime and the data generation program, I should make it sharable. As most C++ programmers developing for Windows know, exporting a C++ class from a DLL is perfectly acceptable. That's how MFC does it.
      When I first thought I would export a class from the DLL, I needed to check that Microsoft had fixed the nasty new/delete mismatch problem with exported classes. The problem was that exported classes were allocated by the new operator in the EXE, but would be deallocated with the delete from the DLL, thus resulting in your application leaving a large crater in the middle of the user's screen. For all the gory details check out Knowledge Base article Q122675 at http://support.microsoft.com/support/kb/articles/Q122/6/75.asp. Since I had run into this problem before, I automatically did three things:
  • used __declspec(dllexport) in the class declaration to export the class
  • used __declspec(dllimport) wherever I used the class
  • declared all destructors as virtual, which you should always do anyway
If you follow these three steps, exporting a class from a DLL will work correctly with Visual C++ 5.0 and higher.
      The other two binaries in this month's code (which you can download from the link at the top of this article) are SWSDLL.DLL and SWS.EXE. SWSDLL.DLL will eventually contain all the runtime code. Right now, it holds the data file generation and dumping functions. SWS.EXE is the program that you will run to perform the actual working set tuning to produce the order file. Additionally, SWS.EXE allows you to dump existing data files and to generate the initial data files. Normally, the runtime will create the initial data files if they do not exist, but if you want to speed up your SWS program runs, you can generate the files ahead of time.

Wrap-up

      Armed with the first half of the SWS program, you might want to implement the two remaining portions, the runtime and the order file generation. There are two interesting points about the runtime that you need to keep in mind. First, you will need to implement _penter in such a way that you can reliably get at the return address. If you're a loyal Bugslayer reader, you'll remember that I showed you how to solve this in the February 1999 issue of MSJ. Second, you might want to consider some sort of last-used module and last-used address caching scheme. Since most functions inside a module call other functions inside the same moduleâ€"and functions generally call functions around themâ€"you might be able to speed up the runtime drastically with an appropriate caching scheme. It would be interesting to compare how you and I would solve each of these problems.
      In this month's code distribution, I included a new version of the ongoing BUGSLAYERUTIL.DLL project. Walter Warniaha found a whopper of a bug in my Crash Handler code. I was allocating the array of modules, which I would call the exception filter with HeapAlloc, but deleting it with the C runtime free function. Wow! That's my most stupid bug to date! Speaking of my bugs, Jim Miars found another small bug in CRASHANDLER.CPP, where I had the last two parameters to FillMemory reversed so the memory wasn't initialized to a known value. I appreciate both Walter and Jim letting me know about the bugs so that I could get them fixed.
      As a follow-up to my last column in which I implemented the COM wrapper on the DBGHELP.DLL symbol engine, Staffan Gustafsson took the challenge and implemented a real OLE DB provider symbol engine! Staffan did a great job. If you want to know how to do it, you'll find his full source code at my little spot on the Web, http://www.wintellect.com/robbins.

Da Tips!

      It's October, so if you don't get your debugging tips to me at john@wintellect.com, the ghosts and goblins that appear at your door on the 31st will leave all their candy wrappers littered on your yard!
Tip 37 Spencer Low writes: I am looking for a way to perform compile-time checks on constant-expression assertions, rather than just waiting until runtime to hit them. As I was looking through WINNT.H, I ran across the following code, which is exactly what I needed to solve the problem.
//
// C_ASSERT() can be used to perform many compile-time assertions:
//            type sizes, field offsets, etc.
//
// An assertion failure results in error C2118: negative subscript.
//

#define C_ASSERT(e) typedef char __C_ASSERT__[(e)?1:-1]
Tip 38 Reed Mangino writes: While this might not be a typical Bugslayer tip, it can save you lots of time in the debugger and while editing in the Visual C++ IDE. Use CTRL+SHIFT+G to bring up source files from numerous places. In the Find window, type the name of the file to open (for example, WINNT.H), press CTRL+SHIFT+G and, bingo, WINNT.H will open. In an editor window, if you place the cursor inside the <> or "" of an include statement, pressing CTRL+SHIFT+G will open that include file. If the editor doesn't find the files, you might need to add the appropriate paths in the Include files section of the Options dialog Directories tab.
John Robbins is a cofounder of Wintellect, a software consulting, education, and development firm that specializes in programming with Windows and COM. He is the author of Debugging Applications (Microsoft Press, 2000). You can contact John at http://www.wintellect.com.

From the October 2000 issue of MSDN Magazine.

Page view tracker