This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.
|Improving Runtime Performance with the Smooth Working Set Toolâ"Part 2|
|Download the code for this article: Bugslayer1200.exe (1,933KB)|
Browse the code for this article at Code Center: Smooth Working Set 2
As everyone in software development knows, smaller is better: the less memory that your application uses, the faster it will run. The working set represents the memory your application takes up. SWS's job is to help you determine which functions are called most frequently so that it can build an order file that the linker can use to place the most frequently called functions together. This means that fewer memory pages are needed to make your application turn over. The fewer memory pages you have, the fewer page faults you have, and the faster your application will run. Read my October column to understand the SWS design philosophy.
Using SWSUsing SWS is a three-stage process. The first stage involves recompiling your application to get SWS hooked in so you can collect the function execution data. The second stage involves running the most common user scenarios using the special compiled version. To use SWS correctly, you must spend some time determining exactly what those user scenarios are so that you can duplicate them precisely. Just running your application randomly under SWS won't help reduce your working set much at all. As I mentioned last time, what you do with your application is not necessarily the same thing a typical user does with it.
The third stage involves generating the order file for the linker (which is very simple to do), and integrating that order file into your final build. The whole SWS system consists of three DLLs and a single executable. Figure 1 gives a brief overview of each file.
Getting your application compiled for SWS is generally straightforward; you simply follow these steps:
The two core data files generated are modulename.SWS and modulename.SDW. The .SWS file contains nothing but the addresses in your module and room for the execution counts. The .SDW file holds the addresses and names of all your functions. I broke out the names into a separate file because there was no need to drag them into each specific run's file, and it helped reduce memory overhead. When you dump a .SWS file using the -d command-line option to SWS.EXE, SWS.EXE does all the work to match up the SWS and SDW files. Figure 4 shows all the command-line options to SWS.EXE
If you run your application a couple of times, you will notice that a few more .SWS files appear in the same directory where your special compiled binary resides. Each run's data is stored in modulename.#.SWS, where # is the number of the run. The base .SWS file does not contain execution counts, so you can dump a specific run to see what you executed by using the -d command-line option to SWS.EXE and passing the specific run file on the command line.
Once you've run all of your common user scenarios, it's time to tune your application and generate the order file. SWS.EXE provides the front end to the tuning with the -t command-line option followed by just the module name of the binary to tune. Tuning produces two files, a .TWS file and the actual order file. A .TWS file contains the summed execution counts sorted in highest to lowest order. You can dump the .TWS file with SWS.EXE just like a regular .SWS file. The order file has a .PRF extension, mainly because that's what the old Working Set Tuner tool produced. The .PRF file is just a text file that you pass to the linker.
Once you have the .PRF order file, it's time to apply the reordering to the linker. You can apply the following steps to your regular build, but I opt to create yet another configuration to indicate the build is special. The following steps show you how to get the order file integrated.
Implementation HighlightsNow that you know how to run SWS, I want to turn to some of the implementation highlights so you can get an idea how SWS works under the covers. SWS is not exactly rocket science, but I found it quite fun to implement. The most interesting part of SWS is the _penter function that's automatically generated by the compiler when you use the /Gh switch.
Figure 5 shows the code for my _penter. As you can see from the code, it's naked and I generate my own prolog and epilog to get the start of the function. If you remember past Bugslayer columns, the reason I go naked is to make it easy to get the return address of the function. Fortunately, when the compiler says it will generate _penter before anything else, it means it! The following disassembly shows the effects of the /Gh switch. As you can see, the call to _penter comes even before the PUSH EBP standard function prolog.
If you daydream a little bit you can see that the /Gh switch might allow some other interesting utilities. The first one that pops into my mind is a performance tool. Unfortunately, since the compiler does not offer an epilog exit, you will have to do a little more work to keep everything straight. Maybe if we all ask Microsoft nicely, they will implement the epilog exit switch.
In the October issue, I discussed the design of the file DLL, SWSFILE.DLL, and how I approached the issue of making the individual runs fast. I thought I was all done with the file handling, but the day after I submitted that column, I realized I forgot something very important.
When generating the initial .SWS file, I was using the addresses as they came out of the module. The problem is: what would happen if the module is relocated in memory? The SWSDLL.DLL runtime would be called with one address, but I would not have any record of that address in any of the module's SWS files that are loaded. While everyone should always be rebasing their DLLs, sometimes people forget, and I wanted to make sure SWS didn't crater when that happened. Consequently, I had to go back and add the original load address into the SWSFILE.DLL. In the runtime itself, I had to add code to check if a module was relocated as well, to keep everything kosher.
One area that did give me a little trouble was generating the symbols for the initial SWS module. Because of the way programs are linked and symbols are generated, many of the symbols reported in a module are not those that have _penter calls inserted in them. For example, if you link against the static C runtime, your module will have all sorts of C runtime functions added. Since the address lookup would be faster in the SWS runtime if there were fewer symbols, I looked at a few ways to minimize the numbers.
Figure 6 shows the symbol enumeration callback and how I started limiting the number of symbols. The first step I took was to check if the symbol had corresponding line information with it. Because I assume that functions that have _penter calls were properly compiled using the steps I specified earlier, I safely got rid of many extraneous symbols. The next test to eliminate symbols was to check if specific strings are part of the symbols. For example, any symbols that start with "_imp__" are imported functions from other DLLs. There are two other checks that I did not implement, but left as exercises for you dear readers. The first is that you should be able to flag symbols from specific files, which SWS should ignore. The main reason for implementing this feature is so that you can add all the C runtime source files to that list. The last symbol elimination trick ensures that the address in question only comes from a code section in the module. You might not need this last check, but it would ensure that only true code symbols are used.
One symbol problem that I had at runtime happened because the symbol engine does not return static functions. Being Mr. Contentious, if I did not find an address that came out of a module, I popped my usual six or seven assertion message boxes. At first I was a little confused that I was seeing the assertions, because one of my test programs did not have anything declared as static. When I popped up the stack in the debugger, I found I was looking at a symbol named something like $E127. There was a call to _penter in the function and everything looked good. It finally dawned on me that I was looking at a compiler-generated function, such as a copy constructor. While I would have really liked to keep the error checking in the code, I noticed that there were quite a few of those static/compiler-generated functions in WordPad, so all I could do was report the problem with a TRACE statement in debug builds.
The last interesting part of SWS is the tuning of a module. The code for the TuneModule function is large, so Figure 7 shows the algorithm. As you can see, I work to ensure that I pack each code page with as many functions as possible to eliminate padding. The interesting part is where I hunt down the best fitting function. I decided to try to pack as many functions with execution counts into the pages as possible. If I can't find a function with an execution count that fits, I will use a function that has no execution counts. My initial algorithm for fitting everything together worked great. However, it started crashing when tuning certain modules.
A little exploration revealed that I was getting into a situation where I had a page almost filled, but only had a function whose size was bigger than the page. That's right, a function size reported by the symbol engine was bigger than a memory page. When I looked more closely, I noticed that those huge functions only appeared when they were the last symbols in the code section. Evidently, the symbol engine treats everything after certain symbols as part of the symbol, so the size is wrong. In the tuning algorithm, you can see that if I get a symbol larger than the page size, the only thing I can do is punt and drop the symbol into the order file. That might not be the best solution, but it's a boundary condition that you shouldn't run into too often.
What's Next for SWS?As it stands, SWS is good enough for your module slimming and svelting needs. If you are interested in SWS, here are a few cool things you might do in future versions:
Wrap-upEven though the Working Set Tuner disappeared, with SWS there's no excuse for having extra fat in your working set. If you use SWS on your application, I'd be curious to know how much space you actually save in the end. Even though SWS was an ambitious utility and took two columns to cover, I think it was well worth it.
In my next column, I will start tackling debugging in MicrosoftÂ® .NET. While you might think that .NET is supposed to make all your bugs and problems go away, I've been playing with it and all it means is that there will be new and different debugging challenges for us to tackle. .NET will provide enough material for many cool debugging columns.
In this month's source code distribution, I have included an updated version of my BugslayerUtil.DLL. Included is a new option to write your assertions to the event log under Windows NTÂ® 4.0 and WindowsÂ® 2000. If you are working on server applications that don't have UIs, getting the assertion output in a common place can make all the difference in the world. Also included is a bug fix in HookImportedFunctionsByName reported by Attila Szepesvï¿½ry and Tim Tabor. Finally, thanks to Craig Ball for reporting that Crash Handler didn't report that an application crashed on a Visual C++ exception.
Da Tips!Guess what? The holidays are almost upon us. If you don't send your tips to me at firstname.lastname@example.org, you might not be getting any presents for Hanukkah or Christmas!
Tip 39 Microsoft has released an interesting utility called PageHeap to help track memory corruption problems. You can read more about PageHeap in Knowledge Base article Q264471.
Tip 40 Don't you just hate it when you turn on memory leak detection in the C runtime and all of your leaks allocated by the new operator come out CRTDBG.H and not where you allocated memory? That drives me nuts! The problem is that there is a bug in CRTDBG.H in that the new operator is declared as an inline function. Since debug builds turn all inlining off, the new operator becomes another function and the __FILE__ macro expands to CRTDBG.H. Fortunately, I found a workaround. Make all of your precompiled headers look like the following:
The one drawback to this approach is that you need to ensure that all STL headers in particular are only included in your precompiled header file. If they are included after the precompiled header, you will get compilation errors. Additionally, if you have custom new operators for a class, you will also get errors. You will need to undefine new before declaring your class and perform the defines I just mentioned after your class. Also, include a placement operator version of new in your class that matches the one in CRTDBG.H so you can get the source and line information.
| John Robbins is a cofounder of Wintellect, a software consulting, education, and development firm that specializes in programming in Windows and COM. He is the author of Debugging Applications (Microsoft Press, 2000). You can contact John at http://www.wintellect.com.|
From the December 2000 issue of MSDN Magazine