Reliable user-space stack traces with SFrame
Rostedt began by saying that obtaining a full stack trace of a user-space process is useful for a number of purposes. It is needed for accurate profiling, so both perf and ftrace make use of stack traces. BPF programs, too, can benefit from a picture of the state of the call stack.
The traditional way to reliably obtain stack frames is to build the program in question with frame pointers. The frame pointer is simply a CPU register that is dedicated to containing the base address of the current stack frame. That frame will include a saved copy of the previous frame pointer, indicating where the previous frame began. The kernel (or any other program) can thus follow the chain of frame pointers to locate each frame on the stack. If frame pointers are not present, instead, the kernel's perf subsystem must, at each event, copy a lot of the stack for later postprocessing using the DWARF unwinder. That is a costly thing to do.
But frame pointers are not free either. Managing the frame pointer requires some setup code to run at the entry to every function. Using a register for the frame pointer makes a scarce CPU register unavailable for other uses, slowing program execution. As described in this article, building user space with frame pointers can lead to measurable performance regressions, which can cause their use to be controversial.
The kernel, Rostedt continued, has a stack unwinder called ORC that is much simpler than DWARF. It was added in the 4.14 release to support live patching — another application that needs reliable stack traces. The kernel's objtool utility creates the ORC data at build time and adds two tables to a section in the kernel executable: orc_unwind to hold stack-frame information, and orc_unwind_ip to map instruction pointer values to the appropriate unwind entry.
SFrame is based on ORC; it provides the same mechanism, but for user space rather than the kernel. When an executable is built with SFrame data, the kernel can create full stack traces without the need for frame pointers. There is always a cost, of course; in this case, developers are sacrificing a bit of disk space (to hold the ORC tables) for speed. This data is read, if needed, in the kernel's ptrace() path, so it doesn't affect execution when it is not needed. Some additional effort was required to handle some user-space complications; for example, since binaries are relocatable, there must be a mechanism to apply the correct offsets to the SFrame data.
Rostedt provided an overview of how SFrame support would work in the kernel. The generation of a perf event starts with a non-maskable interrupt (NMI), which ends up in the perf code. If a stack trace is called for, then the kernel will make an attempt to read the call stack; if that encounters a page fault, then there will be no stack trace for this event. He would like to change that code to look for the SFrame data instead. The NMI handler would set a flag indicating that there is work to be done before returning to user space; the ptrace() path would see that flag and reconstruct the stack trace in user context. Among other things, that would make it possible to recover the stack even if page faults occur while reading it.
This approach would require some changes to the user-space perf tool as well. The initial perf event, generated at NMI time, will not include the call stack (which will not be obtained until later), so it will, instead, have a bit set saying "a stack trace is coming". There may be several intervening events generated before that stack trace finally shows up in the ring. Joel Fernandes asked whether the kernel could just reserve space in the ring buffer at NMI time, then fill it in later. Rostedt answered that the ring may end up with multiple events all with the same stack trace; reserving that space for each would end up wasting space.
Rostedt concluded his part by saying that the stack is unlikely to be swapped out, so generating the trace will not normally create I/O to fault pages back in. That said, generating the trace will need to bring in some other data, since the SFrame tables are stored in the executable on disk. The SFrame data should only be mapped when it is actually used, so the first use within a process will cause a brief stall while that mapping takes place.
Bhagat (who has done much of the work to implement this functionality) said that there could perhaps be a problem with code in parts of the kernel that are written in assembly. The non-standard stack usage in that code may well confuse the unwinder. It remains to be seen whether unwinding through those parts of the kernel is important, she said.
Another potential issue is that the SFrame data is stored unaligned on the disk; that can lead to unaligned memory accesses in the kernel. Avoiding that requires a certain amount of copying of data, "weird casts", and such. The alternative, forcing the data to be aligned, would bloat the format though. There seemed to be agreement that storing the data unaligned is the best solution, and that there was no need to change it.
Other outstanding problems include the need to handle dlopen(), which maps executable text from another file into a range of the calling process's memory. This issue could perhaps be addressed by adding a system call to tell the kernel where the SFrame data for a given executable mapping can be found. Just-in-time compiled code is also a problem; when there is no backing file for a mapping, there is no SFrame data either.
As the session concluded, the sentiment in the room seemed to be that
SFrame would be a nice tool to have and that this work should continue.
Index entries for this article | |
---|---|
Kernel | Development tools/SFrame |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2023 |
Posted May 23, 2023 0:32 UTC (Tue)
by brenns10 (subscriber, #112114)
[Link] (6 responses)
This is such an exciting project, it's the "have your cake and eat it, too" approach to stack unwinding. No extra code generated for frame pointers, no wasted register or icache. But still reliable unwinding without relying on the full DWARF debuginfo.
Hopefully this becomes standard along with CTF for lightweight introspection. Programs may want to unwind their own stack or examine the layout of data structures, so there's already good use cases. What's more, debuggers can do a lot with a symbol table, a reliable unwinder, and the basic information about types provided by CTF. While dwarf is better suited for development tasks, these smaller formats could fill a nice for basic diagnostics in production environments where debuginfo isn't available.
Posted May 23, 2023 7:25 UTC (Tue)
by roc (subscriber, #30627)
[Link] (5 responses)
I'm worried that people who want to build binaries with full debugging information or just stack traces with parameter values are going to have to build even *bigger* binaries with both DWARF and SFrame information.
Posted May 23, 2023 8:40 UTC (Tue)
by Sesse (subscriber, #53779)
[Link] (3 responses)
Posted May 23, 2023 9:52 UTC (Tue)
by atnot (subscriber, #124910)
[Link] (2 responses)
I don't know how much of this is needed to only do unwinding, but the idea of DWARF in the kernel is a very spooky prospect to me.
Posted May 23, 2023 9:59 UTC (Tue)
by atnot (subscriber, #124910)
[Link]
Posted May 24, 2023 5:09 UTC (Wed)
by lathiat (subscriber, #18567)
[Link]
Posted May 25, 2023 2:53 UTC (Thu)
by himi (subscriber, #340)
[Link]
I think it'd be pretty similar to the dlopen() scenario, except that instead of just pointing at existing SFrame data for the object it'd generate the data from another source first.
Posted May 23, 2023 3:14 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Posted May 23, 2023 13:29 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
I'm not sure what this means. The mechanism for unwinding (in-kernel, copies to userspace, whatever) is orthogonal to the format being used (DWARF, SFrame, ORC): they can presumably all be unwound using code running in many contexts. They're just formats after all.
But... in general in a signal handler you can't do anything useful involving the process you're running inside -- in particular you can't use stdio or allocate memory and more or less arbitrary locks might be taken out, and that's when nothing has gone wrong: and if you're backtracing quite often it's because all hell has broken loose and the program might be in any state at all. glibc removed the machinery that gave (fp-based) backtraces on stack-protector failure for a reason.
One attractive-sounding alternative suggested at a past LPC is to use a coredump handler: that is given an image of as much or as little of the process as you wish to configure (this stuff is customizable in /proc) and can do whatever it wants because it's a completely separate process that nothing has gone wrong with and which isn't in a signal handler and has no unexpected locks or half-completed mallocs fouling things up. But a signal handler? The more you do with signals, the more pain you'll eventually be in, and that goes double if the process is halfway through crashing!
Posted May 23, 2023 23:32 UTC (Tue)
by eklitzke (subscriber, #36426)
[Link]
Posted May 23, 2023 7:29 UTC (Tue)
by izbyshev (subscriber, #107996)
[Link] (1 responses)
Why would this problem be specific to dlopen()? ISTM it's the same for any dynamically-linked executables (even if they don't use dlopen()). Dynamic linking happens in user space, so the kernel currently learns about libraries only indirectly (by seeing them mmap'ed for execution).
Posted May 23, 2023 11:31 UTC (Tue)
by nevets (subscriber, #11875)
[Link]
Posted May 23, 2023 13:24 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted May 23, 2023 16:38 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (1 responses)
Posted May 23, 2023 17:34 UTC (Tue)
by ibhagat (subscriber, #133641)
[Link]
Posted May 23, 2023 17:29 UTC (Tue)
by ibhagat (subscriber, #133641)
[Link] (1 responses)
The commonality between SFrame and ORC is that both encode the stack offsets directly. But beyond that, there are enough divergences between the two formats making them quite different - SFrame is generated by the toolchain, has support for AMD64 and AArch64 (AAPCS64), has compactness related optimizations in its on-disk representation; ORC is designed to work for the kernel stack tracing use case.
Just saying..."SFrame is based on ORC" can be misleading.
Posted May 23, 2023 20:50 UTC (Tue)
by nevets (subscriber, #11875)
[Link]
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-...
https://news.ycombinator.com/item?id=33788794
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame
Reliable user-space stack traces with SFrame