MSDN Magazine > Issues and Downloads > 2000 > July >  Under the Hood: Happy 10th Anniversary, Windows
This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.
MSDN Magazine

Happy 10th Anniversary, Windows
Matt Pietrek
A
s I write this, Microsoft® Windows® is approaching a major anniversary, one most people have overlooked. The release of Windows 3.0 in May 1990 marked the beginning of widespread Windows-based development, so, in many ways, May 2000 marks the 10-year anniversary of programming for Windows.
      In my day-to-day work, I often encounter people who never programmed for 16-bit Windows. I end up explaining that no matter how bad programmers think they have it now, programming for Windows used to be far more primitive. This month, using Dr. Evil's "time machine," I'll compare key aspects of programming Windows 3.0 in May 1990 to programming with Win32® today. In order to make this more than just a rehashing of old tales, I'm going to highlight some key architectural details that still affect programmers.

All About Modes

      To set the stage, it's important to first understand modes in Windows. Windows 3.0 was the high-water mark in terms of the number of operating modes available to users. Today, most users turn on their computer, wait a while, and Windows appears. In 1990, a user would first boot her PC to an MS-DOS® prompt. At the prompt, she'd then run the WIN.COM program to start Windows. Little did she know that the primary task of WIN.COM was to figure out what mode to run Windows in.
      Windows 3.0 had three separate modes it could be run in, depending on the type of processor installed in the PC. The lowliest of these was real mode, and only required an 8086/8088. In real mode, Windows 3.0 was not much more than a pretty GUI layer over MS-DOS. It was limited to using 640KB, plus whatever memory might be available via the Expanded Memory Specification (EMS). In addition, in real mode Windows utilized none of the processor features like read-only segments and paging. Programming for real mode compatibility was a pain, and mercifully Windows 3.1 did away with real mode.
      Next up the totem pole of modes was standard mode. Using standard mode required at least an 80286 processor. The big advantage of standard mode was that it used the CPU's protected mode. Today, we always run in protected mode of one sort or another, but back then it was revolutionary. Protected mode let the operating system assign attributes (such as read-only) to segments of memory, and the CPU would enforce those attributes. An attempt to write to a code segment marked as read-only caused a fault that the system would catch, and hopefully it would cleanly terminate the offending program.
      Protected mode also brought a big change in the way memory was referenced. Even in the hazy past of programming for Windows, addresses were 32 bits. Unfortunately, they weren't the simple 32-bit addresses you use today. Instead, a Win16 protected mode address consisted of a 16-bit selector and a 16-bit offset. Selectors were used to reference specific regions of memory, which were called segments.
      Through a hardware mechanism known as the Local Descriptor Table (LDT), every valid selector value mapped to a 24 or 32-bit linear address, depending on the processor mode. In standard mode, Windows used 24-bit selectors. To access memory with a selector and offset, the CPU added the selector's linear address and the 16-bit offset to form a linear address. Since the offset portion of an address was only 16-bits, the maximum effective amount of memory that could be read or written to by a selector was 64KB.
      In addition to a base address, each selector also had an associated length and attributes such as read-only or code. These were the primary means of memory protection. If you allocated a 6KB data segment and tried to read from an offset 7KB within it, you'd cause a processor fault. Likewise, if you tried to write to a segment with the code attribute, you'd fault. This, rather than paging, was how Windows 3.0 tried to protect programs from errant pointers.
      One interesting aspect of standard mode Windows was how it handled MS-DOS-based programs. Back then, the vast majority of existing programs were written for MS-DOS. While standard mode Windows supported MS-DOS-based programs, it did this by switching the processor between standard mode when running Windows-based code and real mode when running MS-DOS-based programs. Thus, if you hosed your PC while inside an MS-DOS-based program, you lost the entire Windows session. What's worse, when performing operations like file I/O, programs running under Windows had to switch to real mode so that real mode MS-DOS could execute the request.
      Standard mode liberated Windows from the constraints of the 640KB barrier. In 80286 protected mode, the CPU used 24 address lines, making the usable physical memory a maximum of 16MB. However, when running with less than 16MB of physical memory, standard mode could still use up to 16MB of address space by a mechanism known as swapping. In standard mode swapping, an entire memory segment (up to 64KB) could be copied out to a swap file, and the corresponding selector could be marked not-present. When the selector was subsequently referenced, the CPU generated a fault. The system handled this by finding memory elsewhere and copying the data from the swap file back into memory.
      Another nice feature available in standard mode was that the actual location of segments could move around in physical memory without the application needing to know where they were relocated. Prior to standard mode, Windows had a much more difficult job of moving code segments around in memory, and applications had to be written to be aware of this. Because of the selector address translation provided by the CPU in standard mode, Windows simply had to update the selector's base address in the LTD, and applications were none the wiser.
      Windows 95 killed off standard mode. This was possible because of the new enhanced mode introduced in Windows 3.0. Looking back at the past 10 years of programming for Windows, I can't think of anything as revolutionary as the addition of this mode. Enhanced mode Windows required at least an 80386, but the capabilities it wrung from the processor were phenomenal (at the time), and are the foundation of Windows today.
      Enhanced mode Windows rode squarely on the two main features introduced with the 80386: paging and virtual 8086 mode. Paging offered Windows the ability to use vast amounts of memory, even if there wasn't a corresponding amount of physical memory. This was done by using disk space to simulate real physical memory. Virtual 8086 mode enabled Windows to run MS-DOS-based programs in a processor simulation of an 8086 CPU. Ill-behaved MS-DOS-based programs were rarely able to bring down the entire system anymore.
      Interestingly, enhanced mode Windows was superior to standard mode primarily because of additions underneath the standard mode code. The layer of ring 0 code known as the Virtual Machine Manager (VMM), along with Virtual Device Drivers (VxDs), introduced the architecture that still powers Windows 98. This ring 0 code did many things, but the two most powerful were page-based memory management and virtual machines.
      Enhanced mode Windows implemented page-based memory management, but this fact was essentially invisible to applications and the ring 3 system components; the VMM simply made a bigger pool of memory for the system to allocate from. All user mode code still dealt with segments and was oblivious to the fact that physical memory might not be assigned to a segment at any particular point in time.
      Virtual machines allowed Windows-based applications and multiple MS-DOS prompts to run concurrently, with each believing that it had sole control of the keyboard, mouse, screen, and so forth. All Windows-based applications ran in a single virtual machine reserved for their use. This was made possible by the virtual 8086 mode support introduced with the 80386.
      As an interesting note, the ring 3 components of enhanced mode Windows were nearly identical to the standard mode components. Components such as window management (USER.EXE) and graphics (GDI.EXE) used the same code in both modes. The only significant difference between the two modes was in the kernel (KRNL286.EXE versus KRNL386.EXE), and in how MS-DOS-based programs were run.

Memory Management: Then and Now

      If you've only programmed with Win32, you've been spared the major annoyance of memory segments and the selectors used to reference them. Segments permeated nearly every aspect of Win16 programming, but I'll highlight only the biggest areas here.
      Many programmers aren't aware that segments are still used in Win32 today. This is only because x86 processors require the use of segments. Windows 9x and Windows 2000 make segments essentially invisible by creating and using just a few segments that have a limit of 4GB. At a minimum, these systems have four segments: matching code and data segments for user mode and kernel mode. The selector used for the stack is the same selector used for the data segment. All processes share the same small set of selectors. You can see their values by looking at the CS, DS, and SS registers in your debugger.
      The ringmaster of segments in Windows 3.0 was the global heap manager. The global heap managed up to 8192 different selectors and their associated segments. Allocations were made from the global heap in a number of ways, including direct allocations by programs, and by the OS when loading parts of an executable.
      The key thing to know about the global heap in Windows 3.0 is that the memory truly was globalâ€"any program that knew a valid selector could access that segment's memory, even if the memory had been allocated by another process. Put another way, programs running under Windows 3.0 all shared the same address space. This was sometimes a blessing and sometimes a curse.

Sharing Memory

      Since Win16 programs all played in the same address space, sharing memory between programs was extremely simple. In contrast, user mode Win32-based applications have no notion of implicity shared memory accessible to all processes. To share dynamically allocated memory between processes, all programs wanting to see the memory must use memory mapped files. In Windows NT®, there's no guarantee that two processes will even see the shared memory at the same address. For Windows 9x, the story is more convolutedâ€"basically, the entire region between 2GB and 3GB is accessible to all programs (although you're not supposed to rely on this fact in your programming).
      In Windows 9x, the system doesn't change the page table mappings for memory above 2GB when switching between Win32 processes. This region above 2GB is also where the Win16 global heap region gets its memory. The planned net effect of this architecture is that all Win16 programs can see one another for Win16 compatibility, and they can also see the current Win32 process.
      Having 16-bit code and the current 32-bit process in the same effective address context is what allows relatively quick thunking between 16 and 32-bit code. Figure 1 shows how 16 and 32-bit Windows-based programs coexist in the processor address space. It's not a stretch to say that the memory mapped file and shared memory area in Windows 9x is influenced by the Win16 architecture.
Figure 1 Programs Sharing Processor Address Space
Figure 1 Programs Sharing Processor Address Space

      Cross-process sharing of a DLL's data area is another gulf between programs written in 1990 and today. Back then, all memory was implicitly visible to every program, so sharing was not an issue. In fact, programmers often went to elaborate lengths to make their DLLs store data on a per-process basis. Typically this involved some sort of code stub that mapped the current task ID to a selector assigned to the current task, and loaded the DS register with this selector. There was no explicit OS support for doing this, so various home-grown solutions sprung up.
      Today, the pain of per-process DLL data is gone. Win32 DLLs implicitly have their data area in per-process memory. Each DLL starts out with a fresh new copy of its data section when a new process loads the DLL. If you want to share data sections across processes, it's as easy as making a special section with the data to share, then telling the linker to give the section the SHARED attribute.

More on the Global Heap Manager

      To allocate memory directly from the global heap, Windows 3.0-based programs called the GlobalAlloc API. GlobalAlloc still exists in Win32, but only to ease porting of Win16 code. Under the hood, GlobalAlloc is just a layer over the same heap manager used by HeapAlloc, which allocates memory visible only to the calling process. Thus, GlobalAlloc memory in Win32 is local to its allocating process, and not really global.
      When loading EXEs and DLLs, the system kernel also used the global heap manager. Since day one, 16-bit programs have broken their code and data areas into multiple segments stored in an executable. Each segment loaded from an executable resulted in an allocation from the global heap. There was a practical limit of 253 segments allowed in a given executable.
      Besides code and data segments, Win16 executable files also contained resources. Back then, each resource was loaded into its own segment allocated from the global heap. In Win32, LoadResource just looks up the address of a resource within an executable and lets normal demand paging take care of bringing the data into memory. In Win16, LoadResource had to find the resource in the file, allocate a segment of the appropriate size, and copy the resource data from the executable.
      When allocating memory from the global heap, much attention was paid to whether the memory was moveable (GMEM_MOVEABLE) or fixed at a specific address (GMEM_FIXED). You could also move allocated segments (the GlobalWire function), and affect when segments would be swapped (the GlobalLRUNewest function). Only a black belt programmer truly understood what all the permutations of attributes and API calls did. You can skip all this nonsense today since the Win32 heap managers don't need to mimic legacy heap behaviors.

Memory Models

      A particularly troublesome aspect of segmentation had to do with compiler memory models. On the x86 processor, loading a selector into a segment register (CS, DS, ES, SS, FS, or GS) is a relatively lengthy operation, so you want to avoid it if possible. Therefore, making a near call to another offset in the same segment is less expensive than making a far call to a different segment and offset. Rather than make all generated code capable of operating in the worst possible conditions, compiler vendors used memory models. (To be fair, memory models existed previously in programming for MS-DOS, but weren't nearly so esoteric.)
      If you were writing a fairly small program with less than 64KB of code and 64KB of data, you could use the small memory model. In the small model, the compiler generated code that assumed all calls were near (within the same segment). In addition, all global and local variables on the stack were in the same data segment, accessible by just a 16-bit offset. Of course, exceptions were made when calling outside of your code, such as a call to a DLL, where a far call was needed. Compiling in the small model was often a tricky task since all your data and the stack had to fit within 64KB.
      At the opposite end of the spectrum were the large and huge memory models. When compiling for these models, functions and global variables were referenced using far pointers, with the attendant slowdown in performance. Generally, every source file in your project caused its own distinct code segment to be placed in the executable. The effect here was to make calls to routines outside the current source file more expensive than calling routines in the same file.
      Win16 compilers also had medium and compact models, which fall between the small and large models. One model had multiple code segments and one data segment, while the other model was the inverse. It's fair to say that a large number of hair pulling incidents concerned selecting the wrong memory model and trying to squeeze code and data into 64KB segments.
      Today, Win32-based programs use something akin to the small memory modelâ€"that is, all calls are made to offsets within the same segment, and all data resides in the same segment. The difference is that these offsets are 32 bits, and the segments are 4GB. The only time that a Win32 system changes segments is when it transitions between ring 0 and ring 3. Running Win16 programs under Win32 is an exception, of course. In a strange twist, Win64™-based programming on the IA64 architecture will involve something akin to Win16 data segments, although nowhere near as onerous. I'll have much more to say about this and Win64 in future columns.

The Process Model

      The biggest difference in the process model used by Windows 10 years ago and today is that Win32-based applications are much more isolated from each other. This provides benefits beyond the protection of having your own address space. If a Win16 application failed to clean up and exit properly, it wasn't uncommon to leave orphaned DLLs and global heap blocks. Even worse were unfreed USER and GDI resources, which came from two 64KB heaps. These small heaps gave rise to the free system resources problems, where their available space kept dropping until the only recourse was to reboot.
      Today, a Windows-based application can do pretty much whatever evil or stupid things it wants and still not affect other processes. This is the theory at least. Windows 2000 does a very respectable job using paging to keep system data structures out of the way of a rogue process. Win16 applications running under Windows 2000 are isolated in their own sandbox (NTVDM.EXE), which is just another Win32 process as far as the OS is concerned.
      Windows 9x is a different story because of its evolution from Windows 3.0. The system does a fair job of protecting the main process address space up to 2GB. However, the region between 2GB and 3GB is problematic because it's shared across all processes. Key parts of the Win16 system code (that is, USER.EXE and GDI.EXE) are still used in Windows 98, and keep critical data such as the window manager heap in this region. A Dr. Evil-concocted program could bash memory in this area and potentially bring down the whole system. Fire the laser!
      The nice thing about Windows 2000 and Windows 9x is that they do a much better job of cleaning up resources from a program that crashes. DLLs are associated with a particular process, and the system cleans up when a process exits. DLL reference counting is essentially a thing of the past. Likewise, the USER and GDI portions of the Win32 systems associate their resources with a process and clean up after the process when it exits. In other words, Austin gets his mojo back!
      Even if today's Win32-based platforms didn't free up resources any better, you'd have a much harder time hitting a 64KB resource barrier. Windows 9x still uses 64KB heaps, but moves many of the larger items out into an extended 2MB region following the heap. USER and GDI resources for Windows 2000 are bound only by the virtual memory available. I am often asked, "How do I find the free system resources on Windows NT?" The answer is that Windows NT doesn't have a way to tell you because USER and GDI system resources aren't really an issue.
      One of the few nice things about the Windows 3.0 process model was that countless programmers didn't spend a lot of time figuring out how to get their code loaded into every process. For a variety of reasons, programmers often want to know about all file I/O calls by every program, or see every call to create a new process, or all text written to the screen, or... you name it. In Win16, this wasn't so hard because once a DLL was loaded, it could patch, intercept, or otherwise tinker with the code for any or all running programs. This is the upside to a single shared address space. James Finnegan exploited this when he wrote the classic ProcHook article in the January 1994 issue of Microsoft Systems Journal, which showed how to globally hook APIs in Win16.
      Win32 provides a much higher hurdle to tinkering outside your process. DLLs only load into the processes that request them. If you want your DLL to poke around at another process (for instance, to see all messages sent to the Explorer process's windows), you need to somehow get your DLL injected into the desired processes. There are several ways to do this, including windows hooks, using a debug loop, and CreateRemoteThread. Each method has its strengths and weaknesses, but that's a column in its own right. My December 1996 MSJ column contains more information on DLL injection techniques.
      As a final note on processes, it's hard to believe that 10 years ago, the only kind of processes available were GUI-based. Every Windows 3.0-based program typically had some sort of boilerplate code to register a window class, create a window, and pump messages. Console mode processes and services for Windows didn't arrive until Windows NT 3.1 in 1993.

Threading and Synchronization

      From one perspective, threading 10 years ago was really simple. Each task had a single thread, and that was it. In practice, things were a lot messier. Windows 3.0 used cooperative multitasking. In this setup, the current task got the CPU until it gave up control, usually by calling GetMessage or some other API that wrapped a GetMessage loop. An example is the DialogBox API.
      The problem with cooperative multitasking was that programs often weren't that cooperative. A program might get a message that it handled by querying a database on another machine. If the database took 30 seconds to respond, no other programs got any CPU time for those 30 seconds. A lot of user time was wasted trying to determine whether Windows 3.0 was hung or merely taking a long time to do something.
      To smooth out these problems, programmers often resorted to using a timer and breaking lengthy tasks into small chunks. Whenever the program's message loop received a timer message, it would do the next small chunk of work, then return to processing messages. A related problem was waiting for a spawned task to exit. A Win16 program would create a secondary message loop and process messages while waiting for the spawned task to exit. Afterward, the program would exit its secondary message loop and return to the primary message loop.
      In Win32, real preemptive threading eliminates the pain of having to constantly pump messages. Preemptive means the system can steal the CPU from your thread at any time. In addition, a thread that's waiting for something won't use any CPU time while it's blocked. This lets the system devote the CPU to threads that can actually use it.
      The preemptive threading of Win32 makes programming easier and more difficult. The easy part is that when you have work that takes a long time to complete, you can create another thread to do the work without holding up your main thread. Gone are the days of trying to juggle multiple logical work items with a single thread and window messages.
      On the other hand, the preemption of Win32 brings with it the need for synchronization. These include critical sections, mutexes, events, and semaphores. Synchronization can be wonderful to use, or a nightmare. To wait for a spawned process to exit, just call WaitForSingleObject, passing it the process handle of the spawned process. The WaitForSingleObject call won't return until the other process has terminated. What could be easier?
      Synchronization nightmares are typically deadlocks. Trying to find which set of threads are blockedâ€"waiting for another thread to release a synchronization objectâ€"can be horrifically painful. There are tools such as John Robbin's DeadlockDetection from the October 1998 issue of MSJ that can help, but there's no substitute for truly understanding what you're getting into when you use multiple threads. I won't say that I prefer Windows 3.0 and message pumps, but it sure was easier to debug a hung application.

DLLs and Module Management

      One of the bigger problem areas in Win16 that came about as a result of a single address space was DLL reference counts. In Win16, once a DLL was loaded by one task, any task could use it. Each DLL had a systemwide reference count, and only when it dropped to zero would the system remove it from memory. A program that loaded a DLL with LoadLibrary and didn't call FreeLibrary later would leave the DLL in memory even after the program exited.
      As mentioned earlier, crashing Win16 programs would orphan all their dynamically loaded DLLs in memory, a situation that required rebooting or manually removing them with a tool like NukeDLL from my May 1994 MSJ column. In Win32, DLL management is much saner. Every process has its own list of DLLs and associated reference counts. A process can only see a DLL that it has loaded, either by implicitly linking to it or by calling LoadLibrary. DLLs can't be orphaned in Win32 since without an address space to live in, a DLL can't be mapped into memory.
      Another Win16 DLL pain banished in Win32 involved initialization notifications. LibMain and WEP were two Win16 DLL routines called when the DLL was loaded and unloaded, respectively. Unfortunately, LibMain was only called when the DLL first loaded. There was no Win32 DLL_PROCESS_ATTACH-style notification, so a DLL wasn't informed when a second process started using it. Likewise, the WEP routine was only called when the DLL unloaded, rather than when each referencing task terminated.
      A problem that plagued Win16 programmers and still hits people today is cross-module memory allocations. In Win16, each DLL implicitly received its own local heap with a maximum size of 64KB. Win32 DLLs don't automatically get their own heap, but most compiler-supplied runtime libraries create one for each DLL.
      A common bug caused by DLL heaps is when memory allocated in one DLL is freed in another. Since a given heap doesn't know about any other heaps, it can't very well free memory allocated from a different heap. It's easy to bite yourself here when using C++. For example, I personally have overloaded an operator new for a class and forgotten to overload the free operator to release the memory from the appropriate heap.
      While on the subject of modules, two common questions are, "How can I tell if I'm the second instance of a program running, and why is the hPrevInstance parameter to WinMain always 0?" There's a story here. In Win16, a second copy of a running program could be started by just making a fresh copy of the program's data segment as stored in the EXE (this is a bit simplified, but close). The data segment associated with a given task was known as its instance.
      Commonly, the two running copies of a program wanted to communicate some data between themselves. By passing the first program's instance handle (that is, its data selector) to the second copy, the second copy could read memory directly out of the first program's data segment. Does this make you shiver? Another reason for passing the hPrevInstance value was because window classes were global. A second copy of a running program could skip calling RegisterClass since the first copy had already done so. Checking the hPrevInstance parameter was an easy way to see if you were running as a second instance.
      Thankfully, Win32 is much more consistent about isolating processes from each other. Setting hPrevInstance to NULL allowed Win16 code to be ported to Win32 without changing its RegisterClass logic. Any Win32-based code that looked at hPrevInstance could only assume it was a first instance. The minor hassle is that Win32-based programs that need to check for other instances of themselves running need to add a bit of code. A typical method is to create a named object with a predefined name at program startup. If the object already exists, you can infer that another instance of your code is running.

Primitive Conditions

      While thinking about how much Windows has evolved in the past 10 years, I was really struck by how many absolutely fundamental Windows-based technologies didn't exist back then. I won't list them all, but a few are worth mentioning.
      For starters, there was no structured exception handling provided by the operating system. No __try and __catch blocks. IsBadReadPtr and IsBadWritePtr didn't make an appearance until Windows 3.1 in 1992. Even then, Windows didn't have anything like the frame-based structured exception handling described in my February 1997 MSJ article. Later, some Win16 compilers added some structured exception handling support, but they had no help from the operating system.
      Also missing 10 years ago was the registry. Instead, information that needed to be persisted across program runs was stored in INI files. You can still use INI files today, although few programmers do. The advantage of INI files was that they were text-based and easily edited. The disadvantage of INI files was that they littered your hard drive with hundreds of files, and the information couldn't be stored in a hierarchy like the registry does.
      Without the registry, it's hard to imagine another key technology: COM. Back in 1990, there was no COM or OLE. The closest thing was Dynamic Data Exchange (DDE). When OLE was introduced, it was touted as being a far better replacement for DDE, which was incredibly arcane and complex. Of course, some wags might say the same about COM today, but we won't get into that particular debate.
      Finally, when Windows 3.0 was introduced, C++ was still very new. Microsoft didn't yet have a C++ compiler. There are two obvious implications to this. First, strange as it may seem, there was a time when my friend Paul DiLascia didn't write a C++ column for MSJ or MSDN™ Magazine. The other implication is that there was a Windows-based programming world before MFC. I'll let you decide which is harder to conceive of.

Follow up to February 2000

      In my Under the Hood column in the February 2000 issue of MSJ, I discussed the measurable effects of basing and binding on program load time. Technical reviewer Jonathan Russ of Microsoft had some great comments. First, my timing code was a little complex and had some overhead that wasn't easily filtered out. If I had only been concerned with running under Windows 2000, I could have used the GetProcessTimes API. While I was shooting for cross-platform capability, I have to confess that I forgot about this API.
      The second issue was with the negative effects of DLL address collisions under Windows 2000. Here's what Jonathan had to say:
First, the app takes a time hit because of the address fix-ups and the fact that the entire DLL image must be copied to the pagefile. Second, the system takes a virtual memory hit because the pagefile must house this copy of the DLL. I have handled numerous issues in which a developer was not properly basing his huge DLLs and then wondering 1) why it takes so long for the app to load; 2) why it takes just as long for a second or third instance to load; and 3) why the app consumes so much memory.
      Well said, and thanks for the feedback, Jonathan.

Matt Pietrek. does advanced research for the NuMega Labs of Compuware Corporation, and is the author of several books. His Web site, at http://www.wheaty.net/, has a FAQ page and information on previous columns and articles

From the July 2000 issue of MSDN Magazine.

Page view tracker