Shadow stacks for user space
Shadow-stack basics
Whenever one function calls another, information for the called function, including any parameters and the address to which the function should return once it has done its work, is pushed onto the call stack. As the call chain deepens, the chain of return addresses on the stack grows apace. Normally, all works as intended, but any corruption of the stack can cause one or more return addresses to be overwritten; that, in turn, will cause execution to "return" to an unintended location. With luck, that will cause the application to crash; if the corrupt data was deliberately placed there, instead, execution could continue in a way that will cause worse things to happen.
Shadow stacks seek to mitigate this problem by creating a second copy of the stack that (usually) only contains the return-address data. Whenever a function is called, the return address is pushed onto both the regular stack and the shadow stack. When that function returns, the return addresses are popped off both stacks and compared; if they fail to match, the system goes into red alert and (probably) kills the process involved. Shadow stacks can be implemented entirely in software; even if the shadow stack is writable, it raises the bar for an attacker, who must now be able to corrupt two areas of memory, one of which is at an arbitrary location. Hardware support can make shadow stacks stronger, though.
Intel processors, among others, can provide that support. If a shadow stack has been set up (which is a privileged operation), the pushing of return addresses onto that stack and comparison on function return are all done by the CPU itself. Meanwhile, the shadow stack is normally not writable by the application (other than by way of the function-call and return instructions), and thus not corruptible by an attacker. The hardware also requires the presence of a special "restore token" on the shadow stack itself that, among other things, ensures that two processes cannot be sharing the same shadow stack — a situation that, once again, would facilitate attacks.
Supporting user-space shadow stacks
The current version of the shadow-stack support patches has been posted by Rick Edgecombe; the bulk of the patches themselves were written by Yu-cheng Yu, who has posted numerous earlier versions of this work. Enabling this feature requires 35 non-trivial patches, and the problem is not entirely solved yet. One might wonder why it is so hard, since shadow stacks seem like a feature that most code could ignore almost all of the time, but life is never so simple.
As might be expected, the kernel must contain the code to manage user-space shadow stacks. That includes enabling the feature at the processor level, and handing it for each specific process. Each process needs its own shadow stack set up with a proper restore token, then the (privileged) shadow-stack pointer register must be aimed at it. Faults, both normal page faults and things like integrity-violation traps, must be handled. There is yet more information to be managed on context switches. This is all pretty normal stuff for a new feature of this sort.
The memory allocated for the shadow stack itself must be treated specially. It belongs to user space, but user-space code must not normally be allowed to write to it. The processor must also recognize memory dedicated to shadow stacks, so they must be marked specially in the page tables, and that's where things get a little interesting. There are a number of bits set aside in each page-table entry (PTE) to describe the protections that apply and various other types of status, but the x86 architecture does not include a "this is a shadow-stack page" bit. There are some PTE bits set aside for the operating system's use; Linux does not use them all and could have spared one for this purpose, but evidently certain other operating systems have no spare PTE bits, so stealing one for this purpose would not be welcome.
The solution that the hardware engineers arrived at might well be described as a bit of a hack. If a page's write-enable bit is clear (indicating that it cannot be written to), but its dirty bit is set (indicating that it has been written to), the CPU will conclude that the page in question is part of a shadow stack. This is a combination of settings that, in ordinary usage, might not make sense, so it evidently seemed like fair game.
Unfortunately, Linux kernel developers came to a similar conclusion many years ago, so Linux has its own interpretation for that combination of PTE bits. Specifically, that is how the kernel marks copy-on-write pages. The lack of write access will cause a trap should a process attempt to write the page; the presence of the dirty bit then tells the kernel to make a copy of the page and give the process write access to it. It all works well — until the CPU comes along and applies its own interpretation to that bit combination. So much of the patch set is focused on grabbing one of those unused PTE bits for a new _PAGE_COW flag and causing the memory-management code to use it.
Shadow stacks bring other complications as well, of course. If a process calls clone(), a new shadow stack must be allocated for the child process; the kernel handles this task automatically. Signals, as always, add complications of their own, since they already involve various stack manipulations. It gets worse if a process has set up an alternative stack for signal handlers with sigaltstack() — to the point that the current patch set does not handle that case at all. From such details (and more), a long patch series is made.
ABI issues
The use of shadow stacks should be entirely transparent to most applications; after all, developers rarely think about the call stack in any case. But there will always be applications that do tricky things with their stacks, starting with multi-threaded programs that explicitly manage the stack area for each thread. Others may place their own specially crafted thunks onto the stack or even more obscure things. Without special care, all of those applications will break if they are suddenly set up with a shadow stack. That sort of mass regression tends to make security features unpopular, so various measures have been taken to avoid it.
The carefully considered plan that emerged was to mark applications (with a special property in the .note.gnu.property ELF section) that are prepared to run with a shadow stack. Applications that do no stack trickery could simply be rebuilt and run with shadow stacks thereafter. For the more complicated cases, a set of arch_prctl() operations was defined to enable the explicit manipulation of shadow stacks. The GNU C Library was enhanced to use these calls to configure the environment properly on application startup, and the kernel would enable shadow stacks whenever a suitably marked program was run. Some distributions, including Fedora and Ubuntu, have been building their binaries for shadow stacks; all they need is a suitably equipped kernel to run with the extra protection.
It is always dangerous to ship code using kernel features that have not yet
been accepted and merged; shadow stacks turn out to be an example of why.
According to the cover letter on the current series, the
arch_prctl() API was "abandoned for being strange
".
But those shadow-stack-ready binaries deployed on systems worldwide were
built expecting that API, strange or not, to be present; if the kernel
respects the markings in the ELF file and enables shadow stacks for those
programs, some of them will break. That would cause system administrators
worldwide to disable shadow stacks until at least 2040, rather defeating
the purpose of the whole exercise.
One obvious workaround for this problem would be to never recognize the current ELF marker for shadow stacks and, instead, create a new one to mark binaries using the interface actually supported by the kernel. The decision that was made, though, was to get the kernel out of the business of recognizing shadow-stack-capable binaries entirely and let the C library take care of it. So, if this version of the ABI is adopted, the kernel will never enable shadow stacks unless user space requests it.
The proposed interface
Overall control of shadow stack functionality is to be had with a (presumably not strange) arch_prctl() call:
status = arch_prctl(ARCH_X86_FEATURE_ENABLE, ARCH_X86_FEATURE_SHSTK);
There is also an ARCH_X86_FEATURE_DISABLE operation that can be used to turn shadow stacks off, and ARCH_X86_FEATURE_LOCK to prevent future changes.
While most applications need not worry about shadow stacks, some of them will need to be able to create new ones. Applications using makecontext() and friends are a prominent example. Creating a shadow stack requires kernel support; the associated memory must have the special page bits set as described above, and must also include the restore token. So there is a new system call for this operation:
void *map_shadow_stack(unsigned long size, unsigned int flags);
The size of the desired stack is passed as size, while flags has a single possibility: SHADOW_STACK_SET_TOKEN to request that a restore token be stored in the stack. The return value on success is the address of the base of this stack.
Actually using this new stack is a matter of executing the RSTORSSP instruction to make the switch, most likely done as a part of a user-space context switch between threads. That instruction will perform the necessary verification of page permissions and the restore token before making the switch. It will also mark the token on the new shadow stack as being busy, preventing that stack from being used by any other process.
Applications doing especially tricky things may require the ability to
write to the shadow stack. That access is normally not allowed for obvious
reasons but, as Edgecombe noted,
that "restricts any potential apps that may want to do exotic things
at the expense of a little security
". For the exotic case, another
feature (LINUX_X86_FEATURE_WRSS) can be turned on with
arch_prctl(); that, in turn, enables the WRSS
instruction, which can write to shadow-stack memory. Directly writing to
that memory by dereferencing a pointer is still disallowed in this case.
What next?
This work is not exactly new; an early version of it was covered in this 2018 article. In its previous incarnation, the shadow-stack patch set got up to version 30. The other half of the control-flow integrity work (indirect branch tracking), which got up to version 29 itself, has been set aside for the moment (though Peter Zijlstra has just shown up with a separate implementation). With a new developer heading up the work, a reduction in scope, and some asked-for changes, it is hoped, this work can finally make some progress toward the mainline.
In a number of ways, it looks like that hope might be realized. While there were comments on various parts of the patch set, there does not appear to be a lot of opposition to how it works at this point. Developers did express concern, though, about the lack of support for alternate signal stacks. That is a feature that seems certain to be wanted at some point, so it would be good to see how it fits into the whole picture before this functionality is merged.
There was also a
separate subthread regarding problems with Checkpoint/restore in user space
(CRIU), which engages in no end of underhanded tricks to get its job done. One
part of the checkpoint process involves injecting "parasite" code into a
process to be checkpointed, grabbing the needed information, then doing a
special return out of the parasite to resume normal execution. That is
just the sort of control-flow tampering that shadow stacks are meant to
prevent. Various possible solutions were discussed, but nothing has
appeared in code form at this point. A solution here, too, seems necessary
before shadow stacks can be merged; as Thomas Gleixner put it: "We can't
break CRIU systems with a kernel upgrade
".
Finally, the range of supported hardware will almost certainly need to be
expanded. Some AMD CPUs implement shadow stacks, evidently in a compatible
manner, but only Intel CPUs are supported in this patch set; a lack of
testing is cited as the reason. That, at
least, will probably need to change for the work to go forward. Shadow
stacks are also unsupported on 32-bit systems; fixing that may be harder
and it is not clear whether the motivation to do that work exists. With or
without 32-bit support, though, there is clearly still work to be done
before this code enters the mainline. It should not be expected to show up
in a near-future release.
Index entries for this article | |
---|---|
Kernel | Security/Control-flow integrity |
Security | Linux kernel |
Posted Feb 21, 2022 17:01 UTC (Mon)
by epa (subscriber, #39769)
[Link] (15 responses)
Posted Feb 21, 2022 17:11 UTC (Mon)
by corbet (editor, #1)
[Link] (7 responses)
Posted Feb 21, 2022 17:31 UTC (Mon)
by epa (subscriber, #39769)
[Link] (6 responses)
Posted Feb 21, 2022 17:35 UTC (Mon)
by epa (subscriber, #39769)
[Link] (4 responses)
Posted Feb 21, 2022 18:26 UTC (Mon)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
Posted Feb 22, 2022 17:00 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (2 responses)
I think all the confusion about these bits comes from a lack of clarity about _who owns what_ and how. A bit ironic considering this is a feature meant to catch memory corruption.
Posted Feb 22, 2022 20:42 UTC (Tue)
by nix (subscriber, #2304)
[Link]
(ref: http://www.os2museum.com/wp/theres-more-to-the-286-xenix-...)
Posted Mar 6, 2022 6:17 UTC (Sun)
by oldtomas (guest, #72579)
[Link]
The point Paolo is making is that there /is/ a protocol for the software/OS to tell the hardware "go ahead, use this bit for your shadow stack".
It seems that Redmond, though... well, we know that routine :-)
Posted Feb 22, 2022 5:46 UTC (Tue)
by geuder (subscriber, #62854)
[Link]
I guess the point is because the "other operating system" has no allocation for the extra bits the hardware designers chose not to use them for the new purpose. Instead they used the mentioned combination of previously used bits.
Linux has no problem to use the extra bits. CoW is now being moved to one of them.
Posted Feb 22, 2022 8:58 UTC (Tue)
by jorgegv (subscriber, #60484)
[Link] (3 responses)
In my opinion, if you remove the "operating" word, then all the paragraph makes much more sense to me: So this would mean that Linux on X86 would be able to use the spare PTE bits, but Linux on other systems would not have those spare PTE bits available, and thus a different solution which could be available for LInux an all HW platform has been used.
Posted Feb 22, 2022 9:02 UTC (Tue)
by johill (subscriber, #25196)
[Link] (1 responses)
But the *CPU* couldn't take one of the "OS reserved" bits because non-Linux-OSes running on x86 hardware already use them for other purposes.
Hence the CPU repurposes an otherwise invalid (to it, anyway) combination, necessitating the Linux use of this combination be moved to the "OS reserved" bits.
Posted Feb 22, 2022 9:41 UTC (Tue)
by jorgegv (subscriber, #60484)
[Link]
Thanks for the heads up.
Posted Feb 22, 2022 17:51 UTC (Tue)
by luto (subscriber, #39314)
[Link]
Posted Feb 23, 2022 19:31 UTC (Wed)
by jthill (subscriber, #56558)
[Link] (2 responses)
That part of the discussion is about hardware support for shadow stacks. The hardware PTEs have so-far-unused bits, left uninterpreted and unchecked so they're available to operating systems. Linux doesn't use all of those bits, but other operating systems do. So the hardware engineers implementing hardware-shadow-stack support were loath to claw back a PTE bit because it'd break those other operating systems.
Posted Feb 23, 2022 19:59 UTC (Wed)
by zdzichu (subscriber, #17118)
[Link] (1 responses)
Posted Feb 23, 2022 20:16 UTC (Wed)
by corbet (editor, #1)
[Link]
Posted Feb 21, 2022 21:22 UTC (Mon)
by mtaht (subscriber, #11087)
[Link]
Vaporware, but helpful to dream about, I suppose.
Posted Feb 22, 2022 9:00 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Feb 24, 2022 4:02 UTC (Thu)
by ncm (guest, #165)
[Link] (2 responses)
And, there is more than one $other. Does anybody know whether all of them can make this work? Can Intel have succeeded in quizzing everybody who has an OS that might want shadow stacks?
Posted Feb 28, 2022 14:44 UTC (Mon)
by jtaylor (subscriber, #91739)
[Link]
Linux uses that combination already but that can be changed as it has free bits.
So only an OS has a problem if it has both no free page bits and is also interpreting write-enable + dirty bits. Probably this is sufficiently unlikely to exist in the real world.
Posted Mar 19, 2022 8:12 UTC (Sat)
by cpitrat (subscriber, #116459)
[Link]
I didn't understand this bit:
"Other operating systems" and "stealing" a bit
There are some PTE bits set aside for the operating system's use; Linux does not use them all and could have spared one for this purpose, but evidently certain other operating systems have no spare PTE bits, so stealing one for this purpose would not be welcome.
Surely if the system is running Linux, then Linux can use the PTE bits for whatever purposes it wants? And when it's booted into another operating system, that OS doesn't have to care what interpretation Linux would give to the bits. And I thought the page table entry was something purely OS specific anyway, but the article seems to imply it's provided by "the x86 architecture" itself.
Page tables are defined and used by the hardware — on x86, at least. Some of the bits in the PTE are given over to software use. Others, including the "present" and "dirty" bits, along with the permission bits, are defined by the hardware, so how the architecture interprets them matters.
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
It took me a bit too.
"Other operating systems" and "stealing" a bit
"Other operating systems" and "stealing" a bit
All operating systems will need changes to make use of this feature; it doesn't seem all that terrible if asking this particular change from Linux turned out to be the path of least resistance. Especially since Intel is actually doing the work to effect that change.
"Other operating systems" and "stealing" a bit
Shadow stacks for user space
Shadow stacks for user space
Other operating systems
Other operating systems
Other operating systems