Owning the Image Object File Format, the Compiler Toolchain, and the Operating System
As the Windows kernel continues to pursue in its quest for ever-stronger security features and exploit mitigations, the existence of fixed addresses in memory continues to undermine the advances in this area, as attackers can use data corruption vulnerabilities and combine these with stack and instruction pointer control in order to bypass SMEP, DEP, and countless other architectural defense-in-depth techniques. In some cases, entire mitigations (such as CFG) are undone due to their reliance on a single, well-known static address. In the latest builds of Windows 10 Redstone 1, aka “Anniversary Update”, the kernel takes a much stronger toward Kernel Address Space Layout Randomization (KASLR), employing an arsenal of tools that can only be available to an operating system developer that also happens to own the world’s most commercially successful compiler, and the world’s most pervasive executable object image format.The Page Table Entry Array
One of the most unique aspects of the Windows kernel is the reliance on a fixed kernel address to represent the virtual base address of an array of page table entries that describes the entire virtual address space, and the usage of a self-referencing entry which acts as a pivot describing the page directory for the space itself (and, on x64 systems, describing the page directory table itself, and the page map level 4 itself). This elegant solutions allows instant O(1) translation of any virtual address to its corresponding PTE, and with the correct shifts and base addresses, a conversion into the corresponding PDE (and PPE/PXE on x64 systems). For example, the function MmGetPhysicalAddress only needs to work as follows: Each iteration of the MMU table walk uses simple MiAddressTo macros such as the one below, which in turn rely on hard-code static addresses. As attackers have figured out, however, this “elegance” has notable security implications. For example, if a write-what-where is mitigated by the existence of a read-only page (which, in Linux, would often imply requiring the WP bit to be disabled in CR0), a Windows attacker can simply direct the write-what-where attack toward the pre-computed PTE address in order to disable the WriteProtect bit, and then follow that by the actual write-what-where on the data. Similarly, if an exploit is countered by SMEP, which causes an access violation when a Ring 0 Code Segment’s Instruction Pointer (CS:RIP) points to a Ring 3 PTE, the exploit can simply use a write-what-where (if one exists), or ROP (if the stack can be controlled), in order to mark the target user-mode PTE containing malicious code, as a Ring 0 page. Other PTE-based attacks are also possible, such as by using write-what-where vulnerabilities to redirect a PTE to a different physical address which is controlled by the attacker (undocumented APIs available to Administrators will leak the physical address space of the OS, and some physical addresses are also leaked in the registry or CPU registers). Ultimately, the list goes on and on, and many excellent papers exist on the topic. It’s clear that Microsoft needed to address this limitation of the operating system (or clever optimization, as some would call it). Unfortunately, a number of obstacles exist:- Using virtual-mapped tables based on the EPROCESS structure (as Linux and OS X do) causes significant performance impact, as pointer chasing the different tables now causes cache misses and page translations. This becomes even worse when thinking about multi-processor systems, and the cache waste that this causes (where the TLB may end up getting filled with the various global (locked) pages corresponding to the page tables of various processes, instead of only the current process).
- Changing the address of the PTE array has a number of compatibility concerns, as PTE_BASE is actually documented in ntddk.h, part of the Windows Driver Kit. Additionally, once the new address is discovered, attackers can simply adjust their exploits to use the appropriate static address based on the version of the operating system.
- Randomizing the address of the PTE array means that Windows memory manager functions can no longer use a static constant or preprocessor definition for the address, but must instead access a global address which contains it. Forcing every processor to dereference a single global address every single time a virtual memory operation (allocation, protection, page walk, fault, etc…) is performed is a significantly negative performance hit, especially on multi-socket, NUMA systems.
- Dealing with the global variable problem above by creating cache-aligned copies of the address in a per-processor structure causes a waste of precious kernel storage space (for example, assuming a 64-byte cache line and 640 processors, 40KB of physical memory are used to replicate the variable most efficiently). However, on NUMA systems, one would also want the page containing this data to be local to the node, so we might imagine an overhead of 4KB per socket. In practice, this wouldn’t be quite as bad, as Windows already has a per-NUMA-node-allocated, per-processor, cache-aligned list of critical kernel variables: the Kernel Processor Region Control Block (KPRCB).