Inside SQL Server 2000's Memory Management Facilities
This column is excerpted from The Guru's Guide to SQL Server Architecture and Internals (Addison-Wesley, 2003) by Ken Henderson. Used by permission. All rights reserved.
Summary: Ken Henderson explores the internals of SQL Server memory management from a developer's perspective. (10 printed pages)
In this column, we'll explore the internals of SQL Server memory management from a developer's perspective. That is, we'll discuss the way the server manages memory in terms of the APIs and operating system facilities it makes use of and how they work. Exploring a product in this fashion helps us get inside the heads of the developers who built it and helps us understand how they intended it to be used. Understanding how a product works internally and its intended use is key to mastering it.
We'll begin our investigation by first covering some basic Windows® memory management fundamentals. Like all 32-bit Windows applications, SQL Server makes use of Windows memory management facilities to allocate, free, and generally manage memory resources. It calls Win32® memory management API functions to interact with the memory resources provided by the operating system in the same way that any Windows application does.
Because nearly all memory allocations within SQL Server make use of virtual memory (rather than heaps), most allocation code eventually resolves to a call to the Win32 VirtualAlloc or VirtualFree API function. The server calls VirtualAlloc to reserve and commit virtual memory and VirtualFree to free virtual memory.
Virtual Memory vs. Physical Memory
On the x86 family of processors, Windows provides all processes a 4GB virtual memory sandbox in which to play. By "virtual," I mean that the memory isn't memory in the traditional sense. It is merely a range of addresses with no physical storage implicitly associated with it. As a process makes memory allocations, these addresses are used and physical storage is associated with them. However, this physical storage is not necessarily (and not usually) physical memory. It is usually disk drive space. Specifically, it is space in the system paging file(s). That's how multiple applications can run on a system with 128MB of memory, each with a virtual address space of 4GB—it's not real memory, but it seems like it to the application. Windows transparently handles copying data to and from the paging file so that the application can allocate more memory than physically exists in the machine and so that multiple applications can have equal access to the machine's physical RAM.
This 4GB address space is divided into two partitions: the user mode partition and the kernel mode partition. By default, each of these is sized at 2GB, though you can change this through BOOT.INI switches on the Windows NT® family of the operating system (OS). (Windows NT, Windows 2000, Windows XP, and Windows Server™ 2003 are members of the Windows NT family; Windows 9x and Windows ME are not.)
Figure 1. Windows partitions a process's virtual address space into the user mode (application) and kernel mode (operating system) partitions.
Although each process receives its own virtual memory address space, OS code and device driver code share a single private address space. Each virtual memory page is associated with a particular processor mode. In order for the page to be accessed, the processor must be in the required mode. This means that user applications cannot access kernel mode virtual memory directly; the system must switch into kernel mode in order for kernel mode memory to be accessible.
Application Memory Tuning
The /3GB boot option (available on the Advanced Server and Data Center editions of Windows 2000 and later) permits these default sizes to be changed. It allows a process's user mode address space to be expanded from 2GB to 3GB at the expense of the kernel mode address space (which is reduced from 2GB to 1GB). In Windows parlance, this facility is known as application memory tuning or 4GB tuning (4GT). You enable application memory tuning by adding /3GB to the appropriate line in the [operating systems] section of your BOOT.INI. It's common for people to configure their systems to be bootable with and without /3GB by setting up the entries in the [operating systems] section of BOOT.INI such that they can choose either option at startup.
Warning You can also boot Windows 2000 Professional and Windows 2000 Server with the /3GB switch. However, this has the negative consequence of reducing kernel mode space to 1GB without increasing user mode space. In other words, you gain nothing for the kernel mode space you give up.
Note Windows XP and Windows Server 2003 introduced a new boot option, /USERVA, to be used in conjunction with /3GB that allows a finer degree of control than /3GB alone. You add /USERVA to your BOOT.INI just as you would /3GB. The advantage of /USERVA over /3GB alone is that it permits you to specify exactly how much address space to set aside for user mode access. For example, /USERVA=2560 configures 2.5GB for user mode space and leaves the remaining 1.5GB for the kernel. The caveats that apply to the /3GB switch alone apply here as well.
Before support for /3GB was added to Windows, an application could never access a pointer with the high bit set. Only addresses that could be represented by the first 31 bits of a 32-bit pointer could be accessed by user mode applications. This left 1 bit unused, so some developers, being the clever coders they were and not wanting to waste so much as a bit in the process address space, made use of it for other purposes (e.g., to flag a pointer as referencing a particular type of application-specific allocation). This caused a conundrum when /3GB was introduced because these types of applications would not be able to easily distinguish a legitimate pointer that happened to reference memory above the 2GB boundary from a pointer that referenced memory below 2GB but had its high bit set for other reasons. Basically, booting a machine with /3GB could have broken such applications. To deal with this, Microsoft added support for a new bit flag in the Characteristics field of the Win32 Portable Executable (PE) file format (the format that defines the layout of executable files—EXEs and DLLs—under Windows) that indicates whether an application is large-address aware. When this flag (IMAGE_FILE_LARGE_ADDRESS_AWARE) is enabled, bit 32 in the Characteristics field in an executable file's header will be set. By having this flag set in its executable header, an application indicates to Windows that it can correctly handle pointers with the high bit set—that it doesn't do anything exotic with this bit. When this flag is set and the appropriate version of Windows has been booted with the /3GB option, the system will provide the process with an expanded private user mode address space. You can check whether an executable has this flag enabled using utilities such as DumpBin and ImageCfg that can dump the header of an executable file. Visual C++® exposes IMAGE_FILE_LARGE_ADDRESS_AWARE via its /LARGEADDRESSAWARE linker switch. SQL Server has this flag enabled, so if you boot with the /3GB switch on the appropriate version of Windows, the system will expand SQL Server's user-mode address space.
Note Windows checks IMAGE_FILE_LARGE_ADDRESS_AWARE flag at process startup and ignores it for DLLs. DLL codes must always behave appropriately when presented with a pointer whose high bit is set.
Physical Address Extension
Intel processors, starting with the Pentium Pro and later, include support for a memory-mapping model called Physical Address Extension (PAE). PAE can provide access for up to 64GB of physical memory. In PAE mode, the memory management unit (MMU) still implements page directory entries (PDEs) and page table entries (PTEs), but a new level exists above them: the page directory pointer table. Also, in PAE mode, PDEs and PTEs are 64 bits wide (rather than the standard 32 bits.) The system can address more memory than the standard translation because PDEs and PTEs are twice their standard width, not because of the page directory pointer table. The page directory pointer table is needed to manage these high-capacity tables and the indexes into them. A special version of the Windows kernel is required to use PAE mode. This kernel ships with every version of Windows 2000 and later and resides in Ntkrnlpa.exe for uniprocessor machines and in Ntkrnlpamp for multiprocessor machines. You enable PAE use by adding the /PAE switch to your BOOT.INI file, just as you might add /3GB or /USERVA.
Address Windowing Extensions
The Address Windowing Extensions (AWE) facility in Windows exists to allow applications to access more than 4GB of physical memory. A 32-bit pointer is an integer that is limited to storing values of 0xFFFFFFFF or less—that is, to references within a linear 4GB memory address space. AWE allows an application to circumvent this limitation and access all the memory supported by the operating system.
At a conceptual level, AWE is nothing new—operating systems and applications have been using similar mechanisms to get around pointer limitations practically since the dawn of computers. For example, back in the DOS days, 32-bit extenders (e.g., Phar Lap, Plink, and others) were commonly used to allow 16-bit applications to access memory outside their normal address space. Special-purpose managers and APIs for extended and expanded memory were common; you may even remember products such as the Quarterdeck QEMM-386 product, which was commonly used for this sort of thing way back when. Typically, mechanisms that allow a pointer to access memory at locations beyond its direct reach (i.e., at addresses too large to store in the pointer itself) pull off their magic by providing a window or region within the accessible address space that is used to transfer memory to and from the inaccessible region. This is how AWE works: You provide a region in the process address space—a window—to serve as a kind of staging area for transfers to and from memory that would otherwise be inaccessible to user mode code.
In order to use AWE, an application:
- Allocates the physical memory to be accessed using the Win32 AllocateUserPhysicalPages API function. This function requires that the caller have the Lock Pages in Memory permission.
- Creates a region in the process address space to serve as a window for mapping views of this physical memory using the VirtualAlloc API function.
- Maps a view of the physical memory into the virtual memory window using the MapUserPhysicalPages or MapUserPhysicalPagesScatter Win32 API functions.
While AWE exists on all editions of Windows 2000 and later and can be used even on systems with less than 2GB of physical RAM, it's most typically used on systems with 2GB or more of memory because it's the only way a 32-bit process can access memory beyond 3GB. If you enable AWE support in SQL Server on a system with less than 3GB of physical memory, the system ignores the option and uses conventional virtual memory management instead. One interesting characteristic of AWE memory is that it is never swapped to disk. You'll notice that the AWE-specific API routines refer to the memory they access as physical memory. This is exactly what AWE memory is: physical memory that is never swapped to or from the system paging file.
The virtual memory window used to buffer the physical memory provided by AWE requires read-write access. Hence, the only protection attribute that can be passed into VirtualAlloc when you set up this window is PAGE_READWRITE. Not surprisingly, this also means that you can't use VirtualProtect to protect pages within this region from modification or access.
Note None of the tools you typically use to inspect application memory use (Task Manager, Perfmon/Sysmon, etc.) show the amount of AWE memory used by individual processes. There's no indication of the amount of AWE memory used by each process, nor is this memory included in the working set size reported for a given process.
/3GB vs. AWE
The ability to increase the private process address space by as much as 50 percent via application memory tuning is certainly a handy and welcome enhancement to Windows memory management facilities; however, the Windows AWE facility is far more flexible and scalable. As I said earlier, when you increase the private process address space by a gigabyte, that gigabyte comes from the kernel mode address space, which shrinks from 2GB to 1GB. Since the kernel mode code is already cramped for space even when it has the full 2GB to work with, shrinking this space means that certain internal kernel structures must also shrink. Chief among these is the table Windows uses to manage the physical memory in the machine. When you shrink the kernel mode partition to 1GB, you limit the size of this table such that it can manage a maximum of only 16GB of physical memory. For example, if you're running under Windows 2000 Data Center on a machine with 64GB of physical memory and you boot with the /3GB option, you'll be able to access only 25 percent of the machine's RAM—the remaining 48GB will not be usable by the operating system or applications. AWE also allows you to access far more memory than /3GB does. Obviously, you get just one additional gigabyte of private process space via /3GB. This additional space is made available to applications that are large-address aware automatically and transparently, but it is limited to just 1GB. AWE, by contrast, can make the entirety of the physical RAM that's available to the operating system available to an application provided the application has been coded to make use of the AWE Win32 API functions. So, while AWE is more trouble to use and access, it's far more flexible and open-ended.
This isn't to say that there aren't situations where /3GB is preferable to AWE—there certainly are. For example, if you need more space for memory allocations that cannot reside in AWE memory (thread stacks, lock memory, procedure plans), you may find that /3GB is a better fit.
SQL Server organizes the memory it allocates into two distinct regions: the BPool (buffer pool) and MemToLeave (memory to leave) regions. If you make use of AWE memory, there's actually a third region: the physical memory above 3GB made available by the Windows AWE support.
The BPool is the preeminent region of the three. It is the primary allocation pool in SQL Server, serves primarily as a data and index page cache, and is also used for memory allocations less than 8KB. MemToLeave consists of the virtual memory space within the user mode address space that is not used by the BPool. The AWE memory above 3GB functions as an extension of the BPool and provides additional space for caching data and index pages.
When you start SQL Server, the BPool's upper limit is computed based on the physical memory in the machine and the size of the user mode address space. Once this size is computed, the MemToLeave region is set aside (reserved) so that it will not be fragmented by the BPool reservations that are to follow. The BPool region is then set aside, using as many as 32 separate reservations in order to work around the DLLs and other allocations that may already be taking up virtual address space within the SQL Server process by the time the BPool is reserved. Once the BPool is reserved, the MemToLeave region is released. It is used for internal SQL Server allocations that exceed 8KB of contiguous space and for allocations made by external consumers (i.e., memory consumers outside of the main server engine hosted within the SQL Server process) such as OLE DB providers, in-process COM objects, and the like.
So, once SQL Server has started, the BPool has been reserved, but not committed, and the MemToLeave region is essentially free space within the virtual memory address space of the process. If you view the Virtual Bytes Perfmon counter for the SQL Server process just after SQL Server has started, you'll see that it reflects the BPool reservation. I've seen people become alarmed because this number is often so high—after all, it usually reflects either the total physical memory in the machine or the maximum user mode address space minus the size of the MemToLeave region. This is nothing to worry about, however, because it is only reserved, not committed, space. As I said earlier, reserved space is just address space—it does not have physical storage behind it until it is committed. Over time, the amount of memory committed to the BPool will increase until it reaches the upper limit computed when the server was originally started.
Monitoring SQL Server Virtual Memory Use
You can track the computed maximum size of the BPool via the SQL Server:Buffer Manager\Target Pages Perfmon counter. As different parts of the server need memory, the BPool commits the 8KB pages it originally reserved until this committed size reaches the computed target. You can track the BPool's use of committed virtual memory via the SQL Server:Buffer Manager\Total Pages Perfmon counter. You can track the server's overall use of committed virtual memory via the Private Bytes counter for the SQL Server process.
Because most of SQL Server's virtual memory usage comes from the BPool, these two counters will, generally speaking, increase or level off in tandem (keep in mind that, when AWE support has been enabled, the Private Bytes counter won't reflect the entirety of SQL Server's memory use). If the Total Pages counter levels off but the Private Bytes counter continues to climb, this usually indicates continued allocations from the MemToLeave region. These allocations could be completely normal—for example, they could be allocations related to thread stacks as additional worker threads are created within the server—or they could indicate a leak by an external consumer such as an in-process COM object or an xproc. If the process runs out of virtual memory address space because the MemToLeave region is exhausted due to a leak or overconsumption (or if the maximum free block within the MemToLeave region falls below the default thread stack size of .5MB), the server will be unable to create new worker threads, even if the sp_configure max worker threads value has not been reached. In this situation, if the server needs to create a new worker thread in order to carry out a work request—for example, to process a new connection coming into the server—this work request will be delayed until the server can create the thread or another worker becomes available for it to use. This can prevent a user from connecting to the server because the connection may time out before a sufficient amount of MemToLeave space is freed or another worker becomes available to process the connection request.
A memory consumer within the server initiates a memory allocation by first creating a memory object to manage the request. When the object allocates the request, it calls on the appropriate memory manager within the server to fulfill the request from either the BPool or the MemToLeave region. For requests of less than 8KB, the request is usually filled using memory from the BPool. For requests of 8KB or more of contiguous space, the request is usually filled using memory from the MemToLeave region. Because a single memory object may be used to carry out multiple allocations, it's possible for an allocation that is well below 8KB in size (including its management overhead) to be allocated from the MemToLeave region. Consumers of memory within the SQL Server process space are usually internal consumers, that is, consumers or objects within the SQL Server code itself that need memory to carry out a task, but they do not have to be. They can also be external consumers, as I've said. Usually, these external consumers use normal Win32 memory API functions to allocate and manage memory and, therefore, allocate space from the MemToLeave space since it is the only region within the SQL Server process that appears to be available. However, xprocs are a special exception. When an xproc calls the Open Data Services srv_alloc API function, it is treated just like any other consumer within the server. Generally speaking, srv_alloc requests for less than 8KB of memory are allocated from the BPool. Larger allocations come from the MemToLeave space.
The Memory Manager
As the server runs, the memory manager checks to make sure that a given amount of physical memory remains available on the server so that Windows and other applications on the server continue to run smoothly. This amount can vary between 4MB and 10MB (it trends closer to 10MB on Windows Server 2003) and is based on the system load and the page life expectancy for the BPool. If the available physical memory on the server begins to dip below this threshold, the server decommits BPool pages in order to shrink its physical storage usage (assuming that dynamic memory configuration is enabled). The memory manager also ensures that a given number of pages remain free at any given point in time so that, as new allocation requests come in, they will not have to wait for memory to be allocated. By "free," I mean that the page is committed but not used. Unused committed BPool pages are tracked via a free list. As pages are used from this list, the memory manager commits more pages from the BPool reservation until the entire reservation has been committed. You will see the Process:Private Bytes Perfmon counter increase gradually (and usually linearly) due to this activity.
There is a separate free list for each CPU on the system. When a free page is needed to satisfy an allocation request, the free list associated with the current CPU requesting the allocation is checked first, followed by the lists for the other CPUs on the system. This is done to improve scalability by making better use of the local cache for each processor on a multiprocessor system. You can monitor a specific BPool partition via the SQL Server:Buffer Partition Perfmon object. You can monitor the free list for all partitions via the SQL Server:Buffer Manager\Free Pages Perfmon counter.
So, throughout the time that it runs, SQL Server's memory manager process (whether run from the memory manager thread or via another server thread) monitors the memory status of the system to be sure that a reasonable amount of free physical memory remains available to the rest of the system and that a healthy number of free pages remains available for use by new memory allocation requests. Some aspects of this change by necessity when AWE memory is used by the server. The BPool begins by acquiring and locking physical memory on the machine. The amount of memory it locks varies based on whether maximum server memory has been set. If it has, the BPool attempts to lock the amount specified by maximum server memory. If it has not, the BPool locks all of the physical memory on the machine except for approximately 128MB, which it leaves available for other processes. The BPool then uses the physical memory above 3GB (the AWE memory) as a kind of paging file for data and index pages. It maps physical pages from this region into the virtual memory address space as necessary so they can be referenced via 32-bit pointers.
So there you have it: SQL Server memory management in a nutshell. Understanding how an application allocates and manages memory is essential to understanding how the application itself works. Memory is such an important resource and its efficient use is such an integral element of sound application design that understanding how an application manages memory gives you great insight into the overall design of the application.
So how does all this affect you as a developer? Knowing how the server manages memory gives you insight into how to write applications that use it efficiently and helps you troubleshoot memory-related problems with the server. Say, for example, that, in an attempt to speed up client connections, you increase SQL Server's default network packet size to 8KB. Shortly thereafter, the server begins writing messages to the error log indicating problems reserving virtual memory in the MemToLeave region. You immediately know on seeing this that your change is probably at least partially to blame because you know that allocations of 8KB or larger come from the MemToLeave region. The buffers associated with servicing SQL Server connections will come from this area because you have set the network packet size so high. Given that the default network packet size for the .NET Framework's SQLClient provider is 8KB, this situation isn't as hypothetical as it may sound. In fact, it's not uncommon at all to see issues related to a shortage of space in the MemToLeave region caused, at least in part, by a network packet size that is too large.
Also, knowing the way in which SQL Server partitions memory can help you understand whether your custom code running within the SQL Server process is competing with critical system resources, such as the data cache, for memory. Say, for example, that you create an extended procedure that calls srv_alloc() to allocate memory. Let's assume, for the moment, that the buffers allocated by your code are smaller than 8KB in size. Based on our discussion above, we know that your extended procedure will allocate memory from the BPool—memory that instead could be used by the data cache.
Knowing how SQL Server manages memory can also help you size a system to begin with—it can help you compute the amount of physical memory you need and how to allocate and partition it for SQL Server. For example, having a good understanding of the relative strengths and weaknesses of using AWE vs. application memory tuning can help you determine whether you need more than 3GB of physical RAM in your server and can help you decide whether to enable the /3GB boot switch.
The design and implementation of SQL Server's memory management facilities has a major impact on the design and implementation of applications based on SQL Server. Knowing how the server does things will help guide how you design, build, and test your own SQL Server-based applications.
SQL Server for Developers
Ken Henderson is a husband and father living in suburban Dallas, Texas. He is the author of eight books on a variety of technology-related topics, including the recently released The Guru's Guide to SQL Server Architecture and Internals (Addison-Wesley, 2003). An avid Dallas Mavericks fan, Ken spends his spare time watching his kids grow up, playing sports, and gardening.