KAISER: hiding the kernel from user space
Some address-space history
On 32-bit systems, the address-space layout for a running process dedicated the bottom 3GB (0x00000000 to 0xbfffffff) for user-space use and the top 1GB (0xc0000000 to 0xffffffff) for the kernel. Each process saw its own memory in the bottom 3GB, while the kernel-space mapping was the same for all. On an x86_64 system, the user-space virtual address space goes from zero to 0x7fffffffffff (the bottom 47 bits), while kernel-space mappings are scattered in the range above 0xffff880000000000. While user space can, in some sense, see the address space reserved for the kernel, it has no actual access to that memory.
This mapping scheme has caused problems in the past. On 32-bit systems, it limits the total size of a process's address space to 3GB, for example. The kernel-side problems are arguably worse, in that the kernel can only directly access a bit less than 1GB of physical memory; using more memory than that required the implementation of a complicated "high memory" mechanism. 32-Bit systems were never going to be great at using large amounts of memory (for a 20th-century value of "large"), but keeping the kernel mapped into user space made things worse.
Nonetheless, this mechanism persists for a simple reason: getting rid of it would make the system run considerably slower. Keeping the kernel permanently mapped eliminates the need to flush the processor's translation lookaside buffer (TLB) when switching between user and kernel space, and it allows the TLB entries for kernel space to never be flushed. Flushing the TLB is an expensive operation for a couple of reasons: having to go to the page tables to repopulate the TLB hurts, but the act of performing the flush itself is slow enough that it can be the biggest part of the cost.
Back in 2003, Ingo Molnar implemented a different mechanism, where user space and kernel space each got a full 4GB address space and the processor would switch between them on every context switch. The "4G/4G" mechanism solved problems for some users and was shipped by some distributors, but the associated performance cost ensured that it never found its way into the mainline kernel. Nobody has seriously proposed separating the two address spaces since.
Rethinking the shared address space
On contemporary 64-bit systems, the shared address space does not constrain the amount of virtual memory that can be addressed as it used to, but there is another problem that is related to security. An important technique for hardening the system is kernel address-space layout randomization (KASLR), which randomizes the placement of the kernel in the virtual address space at boot time. By denying an attacker the knowledge of where the kernel lives in memory, KASLR makes many types of attack considerably more difficult. As long as the actual location of the kernel does not leak to user space, attackers will be left groping in the dark.
The problem is that this information leaks in many ways. Many of those leaks date back to simpler days when kernel addresses were not sensitive information; it even turns out that your editor introduced one such leak in 2003. Nobody was worried about exposing that information at that time. More recently, a concerted effort has been made to close off the direct leaks from the kernel, but none of that will be of much benefit if the hardware itself reveals the kernel's location. And that would appear to be exactly what is happening.
This paper from Daniel Gruss et al. [PDF] cites a number of hardware-based attacks on KASLR. They use techniques like exploiting timing differences in fault handling, observing the behavior of prefetch instructions, or forcing faults using the Intel TSX (transactional memory) instructions. There are rumors circulating that other such channels exist but have not yet been disclosed. In all of these cases, the processor responds differently to a memory access attempt depending on whether the target address is mapped in the page tables, regardless of whether the running process can actually access that location. These differences can be used to find where the kernel has been placed — without making the kernel aware that an attack is underway.
Fixing information leaks in the hardware is difficult and, in any case, deployed systems are likely to remain vulnerable. But there is a viable defense against these information leaks: making the kernel's page tables entirely inaccessible to user space. In other words, it would seem that the practice of mapping the kernel into user space needs to end in the interest of hardening the system.
KAISER
The paper linked above provided an implementation of separated address spaces for the x86-64 kernel; the authors called it "KAISER", which evidently stands for "kernel address isolation to have side-channels efficiently removed". This implementation was not suitable for inclusion into the mainline, but it was picked up and heavily modified by Dave Hansen. The resulting patch set (still called "KAISER") is in its third revision and seems likely to find its way upstream in a relatively short period of time.
Whereas current systems have a single set of page tables for each process, KAISER implements two. One set is essentially unchanged; it includes both kernel-space and user-space addresses, but it is only used when the system is running in kernel mode. The second "shadow" page table contains a copy of all of the user-space mappings, but leaves out the kernel side. Instead, there is a minimal set of kernel-space mappings that provides the information needed to handle system calls and interrupts, but no more. Copying the page tables may sound inefficient, but the copying only happens at the top level of the page-table hierarchy, so the bulk of that data is shared between the two copies.
Whenever a process is running in user mode, the shadow page tables will be active. The bulk of the kernel's address space will thus be completely hidden from the process, defeating the known hardware-based attacks. Whenever the system needs to switch to kernel mode, in response to a system call, an exception, or an interrupt, for example, a switch to the other page tables will be made. The code that manages the return to user space must then make the shadow page tables active again.
The defense provided by KAISER is not complete, in that a small amount of kernel information must still be present to manage the switch back to kernel mode. In the patch description, Hansen wrote:
While the patch does not mention it, one could imagine that, if the presence of the remaining information turns out to give away the game, it could probably be located separately from the rest of the kernel at its own randomized address.
The performance concerns that drove the use of a single set of page tables have not gone away, of course. More recent processors offer some help, though, in the form of process-context identifiers (PCIDs). These identifiers tag entries in the TLB; lookups in the TLB will only succeed if the associated PCID matches that of the thread running in the processor at the time. Use of PCIDs eliminates the need to flush the TLB at context switches; that reduces the cost of switching page tables during system calls considerably. Happily, the kernel got support for PCIDs during the 4.14 development cycle.
Even so, there will be a performance penalty to pay when KAISER is in use:
Not that long ago, a security-related patch with that kind of performance penalty would not have even been considered for mainline inclusion. Times have changed, though, and most developers have realized that a hardened kernel is no longer optional. Even so, there will be options to enable or disable KAISER, perhaps even at run time, for those who are unwilling to take the performance hit.
All told, KAISER has the look of a patch set that has been put onto the
fast track. It emerged nearly fully formed and has immediately seen a lot
of attention from a number of core kernel developers. Linus Torvalds is
clearly in support of the idea, though he naturally has pointed out a
number of things that, in his opinion, could be improved. Nobody has
talked publicly about time frames for merging this code, but 4.15 might not be
entirely out of the question.
Index entries for this article | |
---|---|
Kernel | Memory management/User-space layout |
Kernel | Security/Meltdown and Spectre |
Security | Linux kernel |
Posted Nov 15, 2017 5:54 UTC (Wed)
by jreiser (subscriber, #11027)
[Link] (2 responses)
Posted Nov 15, 2017 15:39 UTC (Wed)
by hansendc (subscriber, #7363)
[Link] (1 responses)
Posted Dec 9, 2018 15:58 UTC (Sun)
by dembego3 (guest, #129120)
[Link]
Posted Nov 15, 2017 8:19 UTC (Wed)
by marcH (subscriber, #57642)
[Link] (7 responses)
I find timing-based attacks fascinating.
In order to grow, computer performance has become less and less deterministic. On one hand this makes real-time and predictions harder. On the other hand this leaks more and more information about the system.
Posted Nov 15, 2017 14:19 UTC (Wed)
by epa (subscriber, #39769)
[Link] (6 responses)
Posted Nov 15, 2017 15:55 UTC (Wed)
by matthias (subscriber, #94967)
[Link]
I did also not know this before, but several of these attacks are described in the linked paper.
Posted Nov 27, 2017 15:46 UTC (Mon)
by abufrejoval (guest, #100159)
[Link] (4 responses)
If instead you set a randomization bias you can run CPUs at say 5GHz logical clock and then add random delays to hit say 3, 2 or 1 GHz on average depending on the workload. Every iteration of an otherwise pretty identical loop would wind up a couple of clocks different, throwing off snoop code without much of an impact elsewhere. Of course it shouldn't be one central clock overall, but essentially any clock domain could use its own randomization source and bias. I guess CPUs have vast numbers of clock synchronization gates these days anyway, so very little additional hardware should be required.
Stupid, genius or simply old news?
Posted Nov 27, 2017 16:55 UTC (Mon)
by NAR (subscriber, #1313)
[Link] (2 responses)
Posted Jan 5, 2018 8:26 UTC (Fri)
by clbuttic (guest, #121058)
[Link] (1 responses)
Posted Jan 6, 2018 16:02 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Nov 27, 2017 17:43 UTC (Mon)
by excors (subscriber, #95769)
[Link]
For example, the KAISER paper says the "double page fault attack" distinguishes page faults taking 12282 cycles for mapped pages and 12307 cycles for unmapped pages, i.e. a difference of 25 cycles. If I remember how maths works: You could add a random delay to the page fault handler (or randomly vary the CPU speed or whatever) so it has a mean and standard deviation of (M, S) for mapped pages and (M+25, S) for unmapped. If S > 25 (very roughly) then the attacker can measure a page fault but can't be sure whether it belongs to the first category or the second.
But if they repeat it 10,000 times (which only takes a few msecs) and average their measurements, they'd expect to get a number in the distribution (M, S/100) for mapped pages or (M+25, S/100) for unmapped. You'd have to make S > 2500 to stop them being able to distinguish those cases easily. At that point it's much more expensive than the KAISER defence, and it would still be useless against an attacker who can repeat the measurement a million times. And that's for measuring a relatively tiny difference of 25 cycles in an operation that takes ~12K cycles; it's harder to protect the TSX or prefetch attacks where the operation only takes ~300 cycles.
It seems much safer to ensure operations will always take a constant amount of time, rather than adding randomness and just hoping the statistics work in your favour.
Posted Nov 15, 2017 9:27 UTC (Wed)
by sorokin (guest, #88478)
[Link]
A single change may not affect performance significantly (although 5% slow down is too much for my taste). But multiple changes can stack up over time. In the past this has been seen both for performance improving and for performance regression changes. A performance regression example is how compilers' performance regressed overtime (although since GCC 6 the trend has reversed). A performance improving example is sqlite.
Posted Nov 15, 2017 11:25 UTC (Wed)
by ballombe (subscriber, #9523)
[Link] (5 responses)
IIRC this is not true on SPARC, is it ?
Posted Nov 15, 2017 14:05 UTC (Wed)
by arjan (subscriber, #36785)
[Link] (4 responses)
Posted Nov 15, 2017 14:56 UTC (Wed)
by cborni (subscriber, #12949)
[Link] (3 responses)
Posted Nov 16, 2017 2:06 UTC (Thu)
by jamesmorris (subscriber, #82698)
[Link]
Posted Nov 16, 2017 2:10 UTC (Thu)
by jamesmorris (subscriber, #82698)
[Link] (1 responses)
Posted Nov 16, 2017 12:52 UTC (Thu)
by ballombe (subscriber, #9523)
[Link]
Posted Nov 16, 2017 4:57 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
I would definitely like to protect my browser and anything started by it, but I would like my gcc started from a terminal to run at full speed.
Posted Nov 16, 2017 20:40 UTC (Thu)
by hansendc (subscriber, #7363)
[Link] (1 responses)
You would essentially need to keep a bit of per-cpu data that was consulted very early in assembly at kernel entry. It would have to be updated at every context switch, probably from some flag in the task_struct. Again, doable, but far from trivial.
Posted Nov 19, 2017 6:16 UTC (Sun)
by luto (subscriber, #39314)
[Link]
But this is definitely not a v1 feature.
Posted Nov 16, 2017 7:21 UTC (Thu)
by alkbyby (subscriber, #61687)
[Link] (3 responses)
Otherwise I cannot see why simply hiding kernel addresses better, suddenly becomes important enough to spend massive amount of cpu on it.
Posted Nov 16, 2017 8:08 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
But it looks like software-based hiding is ineffective by itself with the current model.
Posted Nov 16, 2017 9:29 UTC (Thu)
by alkbyby (subscriber, #61687)
[Link] (1 responses)
Posted Nov 16, 2017 10:41 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
The problem is that hardware simply makes all software countermeasures irrelevant without something like KAISER.
Posted Nov 16, 2017 18:34 UTC (Thu)
by ttelford (guest, #44176)
[Link] (1 responses)
Posted Nov 16, 2017 20:52 UTC (Thu)
by hansendc (subscriber, #7363)
[Link]
Most of the KAISER performance impact is purely from the cost of manipulating the hardware. L4 and other kernels would pay the same cost Linux would.
It's not fair to compare a non-hardened kernel to a hardened one, though. It's apples-to-oranges.
Posted Nov 17, 2017 17:45 UTC (Fri)
by valarauca (guest, #109490)
[Link] (2 responses)
AVX512 adds explicit flags to suppress memory errors on scatter/gather load/store vectorized instructions which will just add another method to exploit this. The ways of _accessing_ memory you can't access on x64 just continue to grow. I really don't see how AMD64 can fix this without breaking either the page table or the debug timers.
Posted Nov 18, 2017 0:09 UTC (Sat)
by anton (subscriber, #25547)
[Link] (1 responses)
Concerning the article, hyperbole is the standard in security news, but "a hardened kernel is no longer optional" seems to be a little extreme even so. I very much hope that stuff like this will be optional.
A possibly less costly way to mitigate attacks that try to defeat KASLR might be to map additional inaccessible address space that would respond to the attacks just like real kernel memory.
Posted Nov 30, 2017 1:32 UTC (Thu)
by Garak (guest, #99377)
[Link]
Posted Nov 22, 2017 10:40 UTC (Wed)
by mlankhorst (subscriber, #52260)
[Link]
Put the kernel mapping at ring3, and make the remainder of the upper 64-bits mapped RWX at ring 1 or 2.
Try to exploit timing differences then from RING 0!
Or am I thinking too simple?
Posted Jan 3, 2018 16:55 UTC (Wed)
by EdRowland (guest, #120787)
[Link] (2 responses)
Posted Jan 3, 2018 17:30 UTC (Wed)
by excors (subscriber, #95769)
[Link] (1 responses)
You could try to reduce the size by e.g. using a single dummy PTE table that's shared by all the higher-level tables, instead of keeping them distinct. But an attacker can likely measure the timing difference between a page walk that fetches the PTE from cache, vs one that fetches it from RAM. If you access address A, then address A+4096, and the second one is fast (i.e. the PTE is already in the cache), you know that's using the dummy PTE, so it's still leaking information about where the kernel is.
Posted Jan 6, 2018 0:26 UTC (Sat)
by ridethewave (guest, #121115)
[Link]
(CR3 manipulation) add a few hundred cycles to a syscall or interrupt. That's a couple L3 cache misses [CAS latency on SDRAM has been ~60ns for decades] which probably is tolerable on a syscall. But hundreds of cycles is horrible for an interrupt. [33MHz is a common bus clock, so just generating an interrupt already requires an average latency of ~15ns.] Some architectures have a special interrupt context (and/or separate small locked caches) exactly for this reason.
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
These days where CPUs constantly vary their speeds to either exploit every bit of thermal headroom they can find or re-adjust constantly to hit an energy optimum for a limited value workload, it seems almost stupid to try sticking to a constant speed.
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interruptKAISER: hiding the kernel from user space
A few hundred cycles to a syscall or interrupt is vaguely similar to the basic IPC cost of the L4 microkernel. (200-300 cycles for amd64).
Kernels are not my area of expertise, so I have to ask: if a syscall is about to become as expensive as IPC on L4... would the (theoretical) performance of the respective kernels be similar after KAISER?
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
I read some performance caveats about vmaskmovps (AVX, not sure if there is an SSE equivalent) that make me think that this instruction can be used for such purposes, too.
KAISER: hiding the kernel from user space
wordage nuance
Concerning the article, hyperbole is the standard in security news, but "a hardened kernel is no longer optional" seems to be a little extreme even so. I very much hope that stuff like this will be optional.
My reaction for a couple seconds as well till I read the next sentence. I agree that sentence is not the best way to describe things. I think it's important to highlight that security-vs-performance tradeoffs is a vast spectrum of subtle choices that *depend on the situation/deployment*. There are many different situations. Quite often a performance hit from enabling SELinux or whatever new hardening-with-five-percent-hit tactic, is absolutely not worth it. Other times your computers are trying to secure millions of dollars of cryptocurrency/etc. Most users should be taught about such nuance versus the "more secure equals always better" narrative. If something is useful to lots of people, sure it should be available as an option. But leave it to the distributors and then the end users to figure out when and where various options should be tuned.
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
KAISER: hiding the kernel from user space
Couldn't you just map each virtual address to the same physical address then?