The Memory Manager in Windows Server 2003 and Windows Vista
®
Landy Wang Software Design Engineer Windows Kernel Team Microsoft Corporation
Outline Memory Manager (MM) improvements in Windows Server 2003 SP1
64-bit Windows features and Enhancements NUMA and large page support added Performance Enhancements Support for No Execute (NX) Capability
Large page support added for user images and pagefile-backed sections Large pages now also used in 32-bit, even when booted with /3GB switch, for
Kernel Page Frame Number (PFN) database Initial non-paged pool
Pool tagging paths parallelized
Introduced shared acquire mode for spinlocks Employing for tag table updates
Expand hash table for tagging large pages
When we detect searches are occurring, instead of waiting for the table to be entirely filled
Overlapped asynchronous flushing for user requests to maximize I/O throughput Pagefiles zeroed in parallel instead of serially
Faster shutdown when ³zero my pagefile´ is set
Per-process working set lock used to synchronize PTE updates and working set list changes to an address space
System, session or process This lock converted from a mutex to a pushlock
Pushlocks support both shared and exclusive acquire modes Mutexes support only exclusive acquisitions
In conjunction with 2-byte interlocked operations this allows parallelization of many operations
MmProbeAndLockPages
Completely remove the PFN lock acquire from this very hot routine
Major PFN lock reduction to improve scalability
Reducing time held Replacing acquisitions with lock-free or alternative lock sequences in many places & APIs
Support for no execute (NX) capability New Win32 SetThreadStackGuarantee API
Allows user applications & the CLR to specify guaranteed stack space requirements Requirements honored even in low resource scenarios
Support for hot-patching a running system
Patch system without reboot to reduce down time Backported to Windows XP SP2
Windows Vista ± Dynamic Address Space
System virtual address (VA) space allocated on-demand
Instead of at boot time based registry & configuration information Region sizes bounded only by VA limitations Applies to non-paged, paged, session space, mapped views, etc.
Kernel page tables allocated on demand
No longer preallocated at system boot, saves
1.5MB on x86 systems 3MB on PAE systems 16MB to 2.5GB on 64-bit machines
Key Benefits of Dynamic Address Space
No registry editing & reboots to reconfigure systems due to resource imbalances Maximum resources available in wide range of scenarios, w/ no human intervention
Desktop heap exhaustion Terminal Server maximum scaling Large video clients /3GB SQL and Exchange machines Http servers, NFS servers, etc
Windows Vista ± Planned Enhancements for NUMA, Large System, Large Page Support
Initial nonpaged pool now NUMA aware, with separate VA ranges for each node Per-node look-asides for full pages Page table allocation for system PTEs, the system cache, etc. distributed across nodes
More even locality Avoids exhausting free pages from the boot node
NUMA-related APIs for device drivers
MmAllocateContiguousMemorySpecifyCacheNode MmAllocatePagesForMdlEx Default if no node is specified has been changed
From current processor to the thread¶s ideal processor
Windows Vista ± Planned Enhancements for NUMA, Large System, Large Page Support
Win32® APIs that specify nodes for allocations & mapped views on per VAD & per section basis VirtualAllocExNuma CreateFileMappingExNuma MapViewOfFileExNuma
Scalable query
QueryWorkingSetEx
Higher perf for very physically sparse machines
Example: Hewlett-Packard Superdome
1TB gaps between chunks of physical memory
PFN database & initial nonpaged pool always mapped with large pages regardless of physical memory sparseness
Windows Vista ± Planned Enhancements for NUMA, Large System, Large Page Support /3GB mode on 32-bit systems supports up to 64GB of RAM
Booting in /3GB mode on 32-bit systems now supports up to 64GB of RAM instead of just 16GB Booting without /3GB on 32-bit systems continues to support up to 128 GB of RAM
Windows Vista ± Planned Enhancements for NUMA, Large System, Large Page Support Much faster large page allocations in kernel & user Support for cache-aligned pool allocation directives Data structures describing non-paged pool free list converted from linked list to bitmap
Reduced lock contention by over 50% Bitmaps can be searched opportunistically lock-free Costly combining of adjacent allocations on free no longer necessary
Windows Vista ± New Video Model Support
Dramatically different video architecture in Windows Vista
More fully exploits modern GPUs & virtual memory
MM provides new mapping type
Rotate virtual address descriptors (VADs) Allow video drivers to quickly switch user views from regular application memory into Cached, non-cached, write combined AGP or video RAM mappings Allows video architecture to use GPU to rotate unneeded clients in and out on demand
First time Windows-based OS has supported fully pageable mappings w/ arbitrary cache attributes
Windows Vista ± I/O Section Access Improvements
Pervasive prefetch-style clustering for all types of page faults and system cache read ahead Major benefits over previous clustering
Infinite size read ahead instead of 64k max Dummy page usage
So a single large I/O is always issued regardless of valid pages encountered in the cluster
Pages for the I/O are put in transition (not valid) No VA space is required
If the pages are not subsequently referenced, no working set trim and TLB flush is needed either
Further emphasizes that driver writers must be aware that MDL pages can have their contents change !
Windows Vista ± I/O Section Access Improvements Significant changes in pagefile writing
Larger clusters up to 4GB Align near neighbors Sort by virtual address (VA) Reduced fragmentation Improved reads
Cache manager read ahead size limitations in thread structure removed Improved synchronization between cache manager and memory manager data flushing to maximize filesystem/disk throughput and efficiency
Windows Vista ± I/O Section Access Improvements Mapped file writing and file flushing performance increases
Support for writes of any size up to 4GB instead of previous 64k limit per write Multiple asynchronous flushes can be issued, both internally and by the caller, to satisfy a single call
Windows Vista ± I/O Section Access Improvements Elimination of pagefile writes and potential subsequent re-reads of completely zero pages
Check pages at trim time to see if they are all zero Optimization used to make this nearly free
User virtual address used to check for the first and last ULONG_PTR being zero; if they both are, then After the page is trimmed, and TLB invalidated, a kernel mapping used to make the final check of the entire page Avoids needless scans & TLB flushes
We¶ve measured over 90% success rate with this algorithm
Windows Vista ± I/O Section Access Improvements Access to large section performance increases
A subsection is the name of the data structure used to describe on-disk file spans for sections The subsection structure was converted
From a singly linked (i.e., linear walk required) To a balanced AVL tree Enables huge performance gain for sections mapping large files
User mappings & flushes, system cache mappings, flushes & purges, section-based backups, etc
Windows Vista ± I/O Section Access Improvements Dependencies between modified writer & mapped writer removed to
Increase parallelism Reduce filesystem deadlock rules Provide the cache manager with a way to influence which portions of files get written first
To optimize disk seek as well as avoiding valid data length extension costs
Windows Vista ± I/O Section Access Improvements
Core support for ³Superfetch´
Enables significantly faster app launch by deciding which pages should be prioritized Provides mechanisms to pre-fetch pages and prevents premature cannibalization Includes support for Per page priorities Access bit tracing Private page pre-fetching Section (including pagefile-backed) pre-fetching
Windows Vista ± Fast S4 Support
Hibernation converted to use memory management mirroring facilities Hibernation time reduced by 2x, with 50% smaller hiber-file Resume time reductions
Windows Vista ± Internal Data Structure and Algorithmic Performance Enhancements
Constant PFN lock time reduction always ongoing, has included areas like
User address space trimming and deletion MEM_RESET Page allocations
the PFN sharecount now uses interlocked updates instead of requiring the PFN lock, etc
Page faults Modified writes Page color generation MDL construction for fault I/Os, and so on
Windows Vista ± Internal Data Structure and Algorithmic Performance Enhancements
The per-process address space lock used to synchronize creation/deletion/changes to user address spaces
This lock was converted from a mutex to a pushlock
Pushlocks support both shared and exclusive acquire modes Mutexes support only exclusive acquisitions
Allowed parallelization of many operations like VirtualAlloc, etc
VirtualAlloc support has been revamped to reduce
Conventional (non-AWE) allocations by over 30% AWE allocations by over 2500% (not a typo)
Address Windowing Extension (AWE) non-zeroed allocations are >10x faster than in SP1
Can now therefore be used for http responses, for example
Windows ± Internal Data Structure and Algorithmic Performance Enhancements
PFN database contains information about all physical memory in the machine In the past, whenever a new page was needed:
The PFN spinlock was acquired New page removed from appropriate list chained through PFN database
Windows Vista ± Terminal Services Improvements
Added Terminal Server session objects
Enables various components to have secure session IDs and implement compartment IDs, for example
Major overhaul of Terminal Server global-per-session image support
Eliminated multiple image control areas
To provide single image cache & fix flush/purge/truncate races Only the shared subsections themselves are now per-session, instead of the entire image Shared subsection use AVL tree instead of a linked list, for faster searches
Windows Vista ± Additional Robustness and Diagnosability
Capability to mark system cache views as read only
Used by Registry to protect views from inadvertent driver corruption
Reduced data loss in the face of crashes
Flush all modified data to its backing store (local & remote) if we are going to bugcheck due to a failed inpage Only failed inpages of kernel and/or drivers are fatal
Failed inpages of user process code/data merely results in an inpage exception being handed to the application
Commit thresholds now reflected in global named events
Apps can use this to monitor the system
Windows Vista ± Additional Robustness and Diagnosability .pagein debugger support for kernel/driver addresses added
Allows for viewing memory addresses which have been paged out to disk when debugging crashes
Call to Action Consider these significant Memory Manager enhancements as you develop drivers for Windows Server 2003 and Windows Vista Use new APIs when available in Windows Vista