Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
|
|

Shadow stacks for user space

By Jonathan Corbet
February 21, 2022
The call stack is a favorite target for attackers attempting to compromise a running process; if an attacker finds a way to overwrite a return address on the stack, they can redirect control to code of their choosing, leading to a situation best described as "game over". As a result, a great deal of effort has gone into protecting the stack. One technique that offers promise is a shadow stack; support for shadow stacks is thus duly showing up in various processors. Support for protecting user-space applications with shadow stacks is taking a bit longer; it is currently under discussion within the kernel community, but adding this feature is trickier than one might think. Among other things, these patches have been around for long enough that they have developed some backward-compatibility problems of their own.

Shadow-stack basics

Whenever one function calls another, information for the called function, including any parameters and the address to which the function should return once it has done its work, is pushed onto the call stack. As the call chain deepens, the chain of return addresses on the stack grows apace. Normally, all works as intended, but any corruption of the stack can cause one or more return addresses to be overwritten; that, in turn, will cause execution to "return" to an unintended location. With luck, that will cause the application to crash; if the corrupt data was deliberately placed there, instead, execution could continue in a way that will cause worse things to happen.

Shadow stacks seek to mitigate this problem by creating a second copy of the stack that (usually) only contains the return-address data. Whenever a function is called, the return address is pushed onto both the regular stack and the shadow stack. When that function returns, the return addresses are popped off both stacks and compared; if they fail to match, the system goes into red alert and (probably) kills the process involved. Shadow stacks can be implemented entirely in software; even if the shadow stack is writable, it raises the bar for an attacker, who must now be able to corrupt two areas of memory, one of which is at an arbitrary location. Hardware support can make shadow stacks stronger, though.

Intel processors, among others, can provide that support. If a shadow stack has been set up (which is a privileged operation), the pushing of return addresses onto that stack and comparison on function return are all done by the CPU itself. Meanwhile, the shadow stack is normally not writable by the application (other than by way of the function-call and return instructions), and thus not corruptible by an attacker. The hardware also requires the presence of a special "restore token" on the shadow stack itself that, among other things, ensures that two processes cannot be sharing the same shadow stack — a situation that, once again, would facilitate attacks.

Supporting user-space shadow stacks

The current version of the shadow-stack support patches has been posted by Rick Edgecombe; the bulk of the patches themselves were written by Yu-cheng Yu, who has posted numerous earlier versions of this work. Enabling this feature requires 35 non-trivial patches, and the problem is not entirely solved yet. One might wonder why it is so hard, since shadow stacks seem like a feature that most code could ignore almost all of the time, but life is never so simple.

As might be expected, the kernel must contain the code to manage user-space shadow stacks. That includes enabling the feature at the processor level, and handing it for each specific process. Each process needs its own shadow stack set up with a proper restore token, then the (privileged) shadow-stack pointer register must be aimed at it. Faults, both normal page faults and things like integrity-violation traps, must be handled. There is yet more information to be managed on context switches. This is all pretty normal stuff for a new feature of this sort.

The memory allocated for the shadow stack itself must be treated specially. It belongs to user space, but user-space code must not normally be allowed to write to it. The processor must also recognize memory dedicated to shadow stacks, so they must be marked specially in the page tables, and that's where things get a little interesting. There are a number of bits set aside in each page-table entry (PTE) to describe the protections that apply and various other types of status, but the x86 architecture does not include a "this is a shadow-stack page" bit. There are some PTE bits set aside for the operating system's use; Linux does not use them all and could have spared one for this purpose, but evidently certain other operating systems have no spare PTE bits, so stealing one for this purpose would not be welcome.

The solution that the hardware engineers arrived at might well be described as a bit of a hack. If a page's write-enable bit is clear (indicating that it cannot be written to), but its dirty bit is set (indicating that it has been written to), the CPU will conclude that the page in question is part of a shadow stack. This is a combination of settings that, in ordinary usage, might not make sense, so it evidently seemed like fair game.

Unfortunately, Linux kernel developers came to a similar conclusion many years ago, so Linux has its own interpretation for that combination of PTE bits. Specifically, that is how the kernel marks copy-on-write pages. The lack of write access will cause a trap should a process attempt to write the page; the presence of the dirty bit then tells the kernel to make a copy of the page and give the process write access to it. It all works well — until the CPU comes along and applies its own interpretation to that bit combination. So much of the patch set is focused on grabbing one of those unused PTE bits for a new _PAGE_COW flag and causing the memory-management code to use it.

Shadow stacks bring other complications as well, of course. If a process calls clone(), a new shadow stack must be allocated for the child process; the kernel handles this task automatically. Signals, as always, add complications of their own, since they already involve various stack manipulations. It gets worse if a process has set up an alternative stack for signal handlers with sigaltstack() — to the point that the current patch set does not handle that case at all. From such details (and more), a long patch series is made.

ABI issues

The use of shadow stacks should be entirely transparent to most applications; after all, developers rarely think about the call stack in any case. But there will always be applications that do tricky things with their stacks, starting with multi-threaded programs that explicitly manage the stack area for each thread. Others may place their own specially crafted thunks onto the stack or even more obscure things. Without special care, all of those applications will break if they are suddenly set up with a shadow stack. That sort of mass regression tends to make security features unpopular, so various measures have been taken to avoid it.

The carefully considered plan that emerged was to mark applications (with a special property in the .note.gnu.property ELF section) that are prepared to run with a shadow stack. Applications that do no stack trickery could simply be rebuilt and run with shadow stacks thereafter. For the more complicated cases, a set of arch_prctl() operations was defined to enable the explicit manipulation of shadow stacks. The GNU C Library was enhanced to use these calls to configure the environment properly on application startup, and the kernel would enable shadow stacks whenever a suitably marked program was run. Some distributions, including Fedora and Ubuntu, have been building their binaries for shadow stacks; all they need is a suitably equipped kernel to run with the extra protection.

It is always dangerous to ship code using kernel features that have not yet been accepted and merged; shadow stacks turn out to be an example of why. According to the cover letter on the current series, the arch_prctl() API was "abandoned for being strange". But those shadow-stack-ready binaries deployed on systems worldwide were built expecting that API, strange or not, to be present; if the kernel respects the markings in the ELF file and enables shadow stacks for those programs, some of them will break. That would cause system administrators worldwide to disable shadow stacks until at least 2040, rather defeating the purpose of the whole exercise.

One obvious workaround for this problem would be to never recognize the current ELF marker for shadow stacks and, instead, create a new one to mark binaries using the interface actually supported by the kernel. The decision that was made, though, was to get the kernel out of the business of recognizing shadow-stack-capable binaries entirely and let the C library take care of it. So, if this version of the ABI is adopted, the kernel will never enable shadow stacks unless user space requests it.

The proposed interface

Overall control of shadow stack functionality is to be had with a (presumably not strange) arch_prctl() call:

    status = arch_prctl(ARCH_X86_FEATURE_ENABLE, ARCH_X86_FEATURE_SHSTK);

There is also an ARCH_X86_FEATURE_DISABLE operation that can be used to turn shadow stacks off, and ARCH_X86_FEATURE_LOCK to prevent future changes.

While most applications need not worry about shadow stacks, some of them will need to be able to create new ones. Applications using makecontext() and friends are a prominent example. Creating a shadow stack requires kernel support; the associated memory must have the special page bits set as described above, and must also include the restore token. So there is a new system call for this operation:

    void *map_shadow_stack(unsigned long size, unsigned int flags);

The size of the desired stack is passed as size, while flags has a single possibility: SHADOW_STACK_SET_TOKEN to request that a restore token be stored in the stack. The return value on success is the address of the base of this stack.

Actually using this new stack is a matter of executing the RSTORSSP instruction to make the switch, most likely done as a part of a user-space context switch between threads. That instruction will perform the necessary verification of page permissions and the restore token before making the switch. It will also mark the token on the new shadow stack as being busy, preventing that stack from being used by any other process.

Applications doing especially tricky things may require the ability to write to the shadow stack. That access is normally not allowed for obvious reasons but, as Edgecombe noted, that "restricts any potential apps that may want to do exotic things at the expense of a little security". For the exotic case, another feature (LINUX_X86_FEATURE_WRSS) can be turned on with arch_prctl(); that, in turn, enables the WRSS instruction, which can write to shadow-stack memory. Directly writing to that memory by dereferencing a pointer is still disallowed in this case.

What next?

This work is not exactly new; an early version of it was covered in this 2018 article. In its previous incarnation, the shadow-stack patch set got up to version 30. The other half of the control-flow integrity work (indirect branch tracking), which got up to version 29 itself, has been set aside for the moment (though Peter Zijlstra has just shown up with a separate implementation). With a new developer heading up the work, a reduction in scope, and some asked-for changes, it is hoped, this work can finally make some progress toward the mainline.

In a number of ways, it looks like that hope might be realized. While there were comments on various parts of the patch set, there does not appear to be a lot of opposition to how it works at this point. Developers did express concern, though, about the lack of support for alternate signal stacks. That is a feature that seems certain to be wanted at some point, so it would be good to see how it fits into the whole picture before this functionality is merged.

There was also a separate subthread regarding problems with Checkpoint/restore in user space (CRIU), which engages in no end of underhanded tricks to get its job done. One part of the checkpoint process involves injecting "parasite" code into a process to be checkpointed, grabbing the needed information, then doing a special return out of the parasite to resume normal execution. That is just the sort of control-flow tampering that shadow stacks are meant to prevent. Various possible solutions were discussed, but nothing has appeared in code form at this point. A solution here, too, seems necessary before shadow stacks can be merged; as Thomas Gleixner put it: "We can't break CRIU systems with a kernel upgrade".

Finally, the range of supported hardware will almost certainly need to be expanded. Some AMD CPUs implement shadow stacks, evidently in a compatible manner, but only Intel CPUs are supported in this patch set; a lack of testing is cited as the reason. That, at least, will probably need to change for the work to go forward. Shadow stacks are also unsupported on 32-bit systems; fixing that may be harder and it is not clear whether the motivation to do that work exists. With or without 32-bit support, though, there is clearly still work to be done before this code enters the mainline. It should not be expected to show up in a near-future release.

Index entries for this article
KernelSecurity/Control-flow integrity
SecurityLinux kernel


to post comments

"Other operating systems" and "stealing" a bit

Posted Feb 21, 2022 17:01 UTC (Mon) by epa (subscriber, #39769) [Link] (15 responses)

I didn't understand this bit:
There are some PTE bits set aside for the operating system's use; Linux does not use them all and could have spared one for this purpose, but evidently certain other operating systems have no spare PTE bits, so stealing one for this purpose would not be welcome.
Surely if the system is running Linux, then Linux can use the PTE bits for whatever purposes it wants? And when it's booted into another operating system, that OS doesn't have to care what interpretation Linux would give to the bits. And I thought the page table entry was something purely OS specific anyway, but the article seems to imply it's provided by "the x86 architecture" itself.

"Other operating systems" and "stealing" a bit

Posted Feb 21, 2022 17:11 UTC (Mon) by corbet (editor, #1) [Link] (7 responses)

Page tables are defined and used by the hardware — on x86, at least. Some of the bits in the PTE are given over to software use. Others, including the "present" and "dirty" bits, along with the permission bits, are defined by the hardware, so how the architecture interprets them matters.

"Other operating systems" and "stealing" a bit

Posted Feb 21, 2022 17:31 UTC (Mon) by epa (subscriber, #39769) [Link] (6 responses)

Thanks for clarifying that the page table entry is defined by the CPU (on x86). But I still don't understand the comment about other operating systems. If there are certain bits reserved for the OS's use, then surely Linux can give any interpretation it wants to them, without worrying about what would happen if the machine were booted into a different OS.

"Other operating systems" and "stealing" a bit

Posted Feb 21, 2022 17:35 UTC (Mon) by epa (subscriber, #39769) [Link] (4 responses)

Oh, I see, it's a roundabout way of saying that all the PTE bits are already used, in practice, and there isn't one that can be used for this new feature. The interpretation of the bit would have to be done by the hardware, not by the OS. And while the hardware doesn't currently care what happens to Spare Bit Number Three, some OSes might be using that bit, so it would be a backwards-incompatible change for the hardware to start taking notice of it.

"Other operating systems" and "stealing" a bit

Posted Feb 21, 2022 18:26 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (3 responses)

What the processor could do, would be to only start using Spare Bit Number Three if a certain bit has been set to 1 in a control register. For example, CR4.PKE and CR4.PKS control the interpretation of bit 59 to 62. However, our editor's (presumably informed) guess is that somebody in Redmond begged Intel not to do that for the shadow stack.

"Other operating systems" and "stealing" a bit

Posted Feb 22, 2022 17:00 UTC (Tue) by marcH (subscriber, #57642) [Link] (2 responses)

If a PTE bit has been given to software / the OSes, then it is not a "spare" bit anymore. This is not just about Redmond begging, it's about not "stealing" back something that was given.

I think all the confusion about these bits comes from a lack of clarity about _who owns what_ and how. A bit ironic considering this is a feature meant to catch memory corruption.

"Other operating systems" and "stealing" a bit

Posted Feb 22, 2022 20:42 UTC (Tue) by nix (subscriber, #2304) [Link]

Both Intel and software on Intel processors has learned this before: in the 286 days lots of software used reserved bits freely, and then the 386 started using them and all hell broke loose. It's literally impossible to do that now, because the processor stops you.

(ref: http://www.os2museum.com/wp/theres-more-to-the-286-xenix-...)

"Other operating systems" and "stealing" a bit

Posted Mar 6, 2022 6:17 UTC (Sun) by oldtomas (guest, #72579) [Link]

"If a PTE bit has been given to software / the OSes..."

The point Paolo is making is that there /is/ a protocol for the software/OS to tell the hardware "go ahead, use this bit for your shadow stack".

It seems that Redmond, though... well, we know that routine :-)

"Other operating systems" and "stealing" a bit

Posted Feb 22, 2022 5:46 UTC (Tue) by geuder (subscriber, #62854) [Link]

I also had some difficulty understanding that part. You need to read further and it should befome clear.

I guess the point is because the "other operating system" has no allocation for the extra bits the hardware designers chose not to use them for the new purpose. Instead they used the mentioned combination of previously used bits.

Linux has no problem to use the extra bits. CoW is now being moved to one of them.

"Other operating systems" and "stealing" a bit

Posted Feb 22, 2022 8:58 UTC (Tue) by jorgegv (subscriber, #60484) [Link] (3 responses)

In my opinion, if you remove the "operating" word, then all the paragraph makes much more sense to me:

"There are some PTE bits set aside for the operating system's use; Linux does not use them all and could have spared one for this purpose, but evidently certain other systems have no spare PTE bits, so stealing one for this purpose would not be welcome."

So this would mean that Linux on X86 would be able to use the spare PTE bits, but Linux on other systems would not have those spare PTE bits available, and thus a different solution which could be available for LInux an all HW platform has been used.

"Other operating systems" and "stealing" a bit

Posted Feb 22, 2022 9:02 UTC (Tue) by johill (subscriber, #25196) [Link] (1 responses)

I don't think so, we're talking only about x86 (CPUs) here.

But the *CPU* couldn't take one of the "OS reserved" bits because non-Linux-OSes running on x86 hardware already use them for other purposes.

Hence the CPU repurposes an otherwise invalid (to it, anyway) combination, necessitating the Linux use of this combination be moved to the "OS reserved" bits.

"Other operating systems" and "stealing" a bit

Posted Feb 22, 2022 9:41 UTC (Tue) by jorgegv (subscriber, #60484) [Link]

Ah, yes, I have re-read that paragraph and the next couple of them, and now I understand. The COW bit combination needs to be migrated to use a new PTE bit because the HW engineers decided to use the same bit combination that Linux was using for COW.

Thanks for the heads up.

"Other operating systems" and "stealing" a bit

Posted Feb 22, 2022 17:51 UTC (Tue) by luto (subscriber, #39314) [Link]

I suspect that “other operating system” means Windows. Non-x86 is irrelevant here — no one expects any sort of page table compatibility across architectures.

"Other operating systems" and "stealing" a bit

Posted Feb 23, 2022 19:31 UTC (Wed) by jthill (subscriber, #56558) [Link] (2 responses)

It took me a bit too.

That part of the discussion is about hardware support for shadow stacks. The hardware PTEs have so-far-unused bits, left uninterpreted and unchecked so they're available to operating systems. Linux doesn't use all of those bits, but other operating systems do. So the hardware engineers implementing hardware-shadow-stack support were loath to claw back a PTE bit because it'd break those other operating systems.

"Other operating systems" and "stealing" a bit

Posted Feb 23, 2022 19:59 UTC (Wed) by zdzichu (subscriber, #17118) [Link] (1 responses)

It is interesting that HW guys choose the solution which required modification in Linux. The "other system" (could we guess it was MS Windows?) used all bits, so instead telling "change your ways and do as Linux do", they told Linux "you have to change". Isn't that unfair?

"Other operating systems" and "stealing" a bit

Posted Feb 23, 2022 20:16 UTC (Wed) by corbet (editor, #1) [Link]

All operating systems will need changes to make use of this feature; it doesn't seem all that terrible if asking this particular change from Linux turned out to be the path of least resistance. Especially since Intel is actually doing the work to effect that change.

Shadow stacks for user space

Posted Feb 21, 2022 21:22 UTC (Mon) by mtaht (subscriber, #11087) [Link]

For other ways to think about this... http://millcomputing.com/wiki/Protection

Vaporware, but helpful to dream about, I suppose.

Shadow stacks for user space

Posted Feb 22, 2022 9:00 UTC (Tue) by roc (subscriber, #30627) [Link]

Not really looking forward to making rr support this.

Other operating systems

Posted Feb 24, 2022 4:02 UTC (Thu) by ncm (guest, #165) [Link] (2 responses)

Wondering... Linux doesn't use all the bits, so can claim one to label a COW page. But $other does use them all, already. So, $other can't use Linux's method. What can $other do, instead? Has it already used one of those bits to tag a COW page, i.e. it already does what Linux will do?

And, there is more than one $other. Does anybody know whether all of them can make this work? Can Intel have succeeded in quizzing everybody who has an OS that might want shadow stacks?

Other operating systems

Posted Feb 28, 2022 14:44 UTC (Mon) by jtaylor (subscriber, #91739) [Link]

As I understood it because others do not have free bits, the cpus use the combination of write-enable and dirty bits.

Linux uses that combination already but that can be changed as it has free bits.

So only an OS has a problem if it has both no free page bits and is also interpreting write-enable + dirty bits. Probably this is sufficiently unlikely to exist in the real world.

Other operating systems

Posted Mar 19, 2022 8:12 UTC (Sat) by cpitrat (subscriber, #116459) [Link]

IIUC, COW is purely software so Linux can do whatever it wants with the invalid bits combinations. Shadow stack requires hardware support too (unless I'm mistaken) so it's a different story.


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds