| |

Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for July 11, 2024

This edition contains the following feature content:

New features in C++26: a look at some of the features that may appear in the next major version of the C++ standard.
Sxmo: a text-centric mobile user interface: a different approach to interactive mobile applications.
Another try for getrandom() in the vDSO: the ongoing effort to provide fast and secure random data to user space.
Offload-friendly network encryption in the kernel: a look at the PSP security protocol.
Two more reports from LSFMM+BPF 2024:
- A new API for tree-in-dcache filesystems: filesystems that store their entire tree in the directory-entry cache have proliferated, without handling the edge cases well; a new API would try to clean up some of those problems.
- Improving pseudo filesystems: problems abound in pseudo (or virtual) filesystems, in part because there is a lack of guidance available for kernel developers who want to create one; what can be done to improve that?
Giving bootloaders the boot with nmbl: booting directly into the kernel and cutting out the bootloader middleman.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

New features in C++26

By Daroc Alden
July 5, 2024

ISO releases new C++ language standards on a three-year cadence; now that it's been more than a year since the finalization of C++23, we have a good idea of what features could be adopted for C++26 — although proposals can still be submitted until January 2025. Of particular interest is the addition of support for hazard pointers and user-space read-copy-update (RCU). Even though C++26 is not yet a standard, many of the proposed features are already available to experiment with in GCC or Clang.

New threading libraries

Hazard pointers are a technique for building lock-free concurrent code. In a system with hazard pointers, each thread keeps a list of shared objects that it is currently accessing. Other threads can use that information to avoid modifying or freeing objects that are still in use.

In the proposed C++ library, a thread that wants to free a potentially shared object instead "retires" it, passing the responsibility for reclaiming the object to the library. Retiring an object is done atomically, such that there can be no new readers after an object is retired. When an object is retired, the library checks to see whether any existing hazard pointers reference it. If it is still referenced, it is added to a set of retired objects, which are rechecked periodically (when another object is retired, not on a timer). The object is freed once no hazard pointers reference it.

The proposed interface requires any classes protected by a hazard pointer to be a subclass of hazard_pointer_obj_base. Then, users can call make_hazard_pointer() to create a hazard pointer. Calling the protect() method of the returned hazard pointer protects the object with the pointer, and the thread can use the object normally. A thread that wants to retire an object calls object->retire(), which hands it over to the library. In all, the proposed API would be used like this:

    // Register a new (empty) hazard pointer with the library
    hazard_pointer hp = make_hazard_pointer();

    // Acquire a pointer to the object that needs protecting
    const atomic<T*>& object = ...;

    // Protect the pointer by putting it in the hazard_pointer
    T* normal_pointer = hp.protect(object);

    ... // Perform operations using normal_pointer

    // Remove the object from the hazard_pointer once done,
    // or let RAII clean up the hazard_pointer.
    hp.reset_protection(normal_pointer);

    // Meanwhile, another thread could call
    object->retire()

Hazard pointers aren't the only addition, however. User-space RCU support has also been proposed for inclusion. RCU is a technique that is widely used in the Linux kernel. Access to an object protected by RCU is done through a pointer; when a thread wants to change the object, it first makes a separate copy, and then edits that copy. Finally, the thread atomically swaps the pointer to point to the new copy. Using a compare-and-exchange instruction lets the thread know that it will need to try again if it happened to contend with another writer. The library also keeps track of some information to determine when all readers have finished with the old version of the object, allowing the writing thread to free it. The exact details can vary between implementations, so the proposed API doesn't mandate a particular approach to ensuring that readers are done.

Like the hazard-pointer proposal, the new RCU library defines a rcu_obj_base class that objects which will be protected by RCU can inherit from. Unlike with hazard pointers, this inheritance is not required; rcu_retire can be used on objects that are not descendants of that class.

Both libraries already have reference implementations available. There are several user-space RCU libraries cited, but both proposals list the Apache-2.0-licensed folly library as the primary reference implementation. If accepted, these features will become part of the C++ standard library.

Other library changes

There's also a good number of proposals for features to include in the standard library that are smaller or less significant. C++26 could see a debugging header that supplies a breakpoint() function, a linear algebra header that incorporates features from BLAS, and a text encoding header that lets users access the IANA Character Sets registry — the official list of character sets that can be used on the internet.

There's a long list of fixes and updates to other parts of the standard library, including changes to charconv functions, several updates to formatting and printing, stable sorting at compile time, changes to make more types usable as map keys, and several removals of deprecated items.

New core language changes

Also planned for the new standard are changes to the language itself. Some are relatively small changes; C++26 will probably contain a clarification of what it means for an attribute to be ignorable, and a change to the wording describing how to determine the length of an array when some of the initializers use the brace elision feature. There are also a handful of small fixes for string literals, such as defining previously undefined behavior during lexing (including specifying that unterminated string literals are ill-formed), specifying that characters that can't be encoded in the source file's encoding are not permitted in string literals, and clarifying when string literals should be evaluated by the compiler.

But most of the upcoming changes are a good deal more interesting. One proposal would turn some infinite loops into defined behavior. This is particularly important for the correctness of some low-level code that uses infinite loops intentionally as a way to halt a thread. Infinite loops had originally been made undefined behavior in order to allow optimizing compilers to assume that programs made forward progress; in general, determining whether a loop condition ever becomes false is equivalent to the halting problem. If the compiler is allowed to assume that a loop will eventually halt, it can use that information to make optimizations that are otherwise not possible.

C and C++ have slightly different requirements for how implementations must treat infinite loops, however. C has an explicit exception for loops with constant control expressions. So, since C11, these two loops have been meaningfully different:

    int cond = ...;

    while (cond) {
        // ...
    }

    while (true) {
        if (!cond)
            break;
        // ...
    }

The former loop can be assumed to eventually terminate, but the latter loop cannot. In contrast, C++11 allows the compiler to assume that both loops must eventually terminate. The proposal would change these rules so that "trivial" infinite loops — those with empty bodies — are not considered undefined behavior. This is different from what C specifies, but it would allow low-level code that actually intends to have a CPU spin in an empty loop.

Another change would allow casting from void* to other types in constexpr, which is C++'s mechanism for compile-time code execution. Over time, the language has slowly been expanding what is possible at compile time, mostly in the form of library changes to mark functions as constexpr. But there are still some fundamental restrictions around constexpr code, many of which deal with memory. Allowing constexpr code to use void pointers is another step toward loosening those restrictions.

That isn't the only compile-time improvement — there are proposals to have static_assert() take a user-supplied message, as well as adding messages to the = delete syntax (which allows the programmer to suppress the generation of methods that the compiler normally provides, such as copy constructors). Both of those changes could make it easier to communicate the reasoning behind compile-time checks and make writing maintainable code easier. There is also a proposal to make binding a returned glvalue to a temporary value ill-formed. A glvalue is any expression where its evaluation determines the identity of an object or function — a "generalized lvalue" that could have something assigned to it. In other words, code like this will no longer be accepted:

    const std::string_view& getString() {
        static std::string s;
        return s;
    }

If getString() had returned a std::string directly, it would be — and remains — valid. The problem comes because a std::string_view& is a non-owning reference to a std::string_view. When the value is returned, the underlying std::string_view is freed, and the reference becomes dangling. Languages like Rust solve this problem with an ownership system; C++26 would not go that far — it would still be possible to write a function that returns a dangling reference. But it would become a bit more difficult to do by mistake, since references to temporary values (such as the implicit conversion to std::string_view&) would be detected by the compiler.

Template improvements

C++26 also has a few improvements planned for the template system. In C++, when a template takes a parameter with a "..." before the name, the compiler creates a special structure holding multiple parameters called a "pack". A small proposal would allow the [] operator to index into a pack in a template. To distinguish this from a normal index operation, the proposal provides new syntax for accessing an element of a pack: name...[]. For example:

    template <typename... T>
    constexpr auto first_plus_last(T... values) -> T...[0] {
        return T...[0](values...[0] + values...[sizeof...(values)-1]);
    }

Packs can also be used in more places, with one change permitting them in the friend declaration of a class. This change is potentially useful because it permits programmers to use templates to implement the passkey idiom, a technique for exposing given methods only to particular classes.

The last proposed changes to the language itself to date include two changes to variable bindings: allowing attributes on structured bindings and allowing _ in a binding to discard a value. Finally, there is an obscure change to how compilers are required to initialize template parameters, and a change to braced initializers that makes them more efficient.

C++ may not be the first language people consider when thinking about evolving languages, but it still sees several important improvements in each edition. If no problems crop up, these changes will likely be accepted during the C++26 standardization meetings next year. In the meantime, there is still an open window to reflect on these changes and contribute additional suggestions.

[Thanks to Alison Chaiken for the suggestion to cover user-space RCU that led to this article.]

Comments (134 posted)

Sxmo: a text-centric mobile user interface

July 10, 2024

This article was contributed by Koen Vervloesem

Sxmo, short for "Simple X Mobile", is described on its web site as "a minimalist environment for Linux mobile devices"; it offers a menu-driven interface that is controlled with the phone's hardware buttons. Sxmo enables the user to send SMS messages from a text editor and is entirely customizable with shell scripts. This peculiar mobile user interface significantly differs from the prevailing approach—but it works.

While mobile user interfaces such as Phosh, KDE Plasma Mobile, and Lomiri have some differences between them, they are all rooted in the same philosophy. They center on touch-based interactions and display apps through icons, an approach influenced by the conventional point-and-click paradigm of desktop user interfaces. However, on the desktop, a text-centric approach centering on keyboard input and terminal programs following the Unix philosophy has remained popular among advanced users. Sxmo aims to offer such an environment for mobile devices.

There are two flavors of Sxmo: Xorg or Wayland. The Xorg version is based on a couple of forks of tools from the suckless project, which has "a focus on simplicity, clarity, and frugality". This includes the dynamic window manager dwm, the menu system dmenu, and the simple terminal emulator st. The Wayland version uses some tools inspired by their Xorg counterparts, including the tiling Wayland compositor Sway, the menu system bemenu, and the terminal emulator foot. In practice, both versions work similarly, with some minor low-level differences in configuration.

Sxmo is best supported on the postmarketOS Linux distribution for mobile devices (previously covered here). Pre-built images of postmarketOS with Sxmo for various devices can be found on the download page of the distribution. Alternatively, a custom image can be built by running pmbootstrap init and choosing sxmo-de-sway (for the Wayland version) or sxmo-de-dwm (for the Xorg version) as the interface. I tested Sxmo's Wayland version by installing a custom-built postmarketOS image.

Menu-driven interface

Since interacting with Sxmo deviates significantly from other mobile interfaces, its user guide is required reading, especially to learn about the actions behind the phone's hardware buttons. Most phones have three buttons on the side: volume-up, volume-down, and power. For each of these buttons, Sxmo triggers an action based on whether you tap the button once, twice, or thrice; a long-press can be used instead of tapping three times. This way, the user can start nine different actions solely by using the hardware buttons, just with the thumb. Touch-based input also works.

Sxmo's home screen is merely a background image featuring the current date and time, with a status bar on top that includes a workspace number (only one at first), and, next to it, status icons for the mobile network, WiFi connection, battery, volume, lock state, and time. The global system menu opens when the user taps on the volume-up button or swipes down from the top of the screen. While within a menu, the hardware buttons exhibit a different behavior: the volume-up button navigates to the previous item, while the volume-down button advances to the next item. The power button selects the current item. Tapping an item on the touch screen also selects it.

The global system menu gives access to various scripts and applications. Several menu items open an application in a terminal window upon selection. For instance, scanning for WiFi networks starts the nmcli d wifi list command. Configuring the phone (under the Config submenu) enables setting the brightness, enabling or disabling touch, gestures, and Bluetooth, as well as upgrading packages, among other things.

SMS messages and calls

Sending an SMS message requires typing a phone number or selecting a person from the contact list first. Then the Vim-like text editor vis is opened to compose the message. After the user exits the editor and confirms, the message is sent. An incoming SMS message briefly appears on the home screen; it can also be read later from the global system menu, where it's included in a Notifications menu item, and in the Texts submenu. Upon receipt of a new text message, the phone's LED emits a green light and the vibration motor momentarily triggers.

Calling uses a similar text-centric process. To place a new call, open the Dialer submenu in the global system menu and enter a phone number or choose an entry from the contact list. After selecting the number, Sxmo starts calling, and once the call connects, a menu appears with options to hang up the call, manage audio routing, and more. An incoming call triggers the phone's green LED and the vibration motor, and a menu appears that enables the user to accept or dismiss the call.

Mobile terminal

When no menu is active, a single tap on the volume-down button shows or hides the virtual keyboard at the bottom of the screen. This action can also be performed by swiping up and down from the bottom of the screen. One tap on the volume-up button launches an application-specific context menu for the currently focused window of a supported application. Three taps or holding the power button opens the terminal emulator. And three taps (or hold) on the volume-down button terminates the currently focused window.

Although working with terminal applications on a primarily touch-based device might not seem like a good match, Sxmo defines some one-finger swipe gestures that ease the experience. For example, swiping from left to right along the bottom edge sends a Return key to the application. Similarly, swiping from right to left along the bottom sends Backspace. Swiping top to bottom along the right edge sends an arrow-down key, while swiping bottom to top along the right edge sends an arrow-up key. Swiping right to left onto the left edge sends a left-arrow key, while swiping left to right onto the right edge sends a right-arrow key. These gestures allow the user to scroll through the shell history and the current command without the need to open the virtual keyboard and use up precious screen space. The virtual keyboard does not support swipe typing, however.

Plain-text files and shell scripts

Sxmo's configuration is entirely based on plain-text files. For example, contact details are stored in a tab-separated file in ~/.config/sxmo/contacts.tsv. This file has two columns: first the phone number, followed by the contact name. Contacts can be added either by manually editing the file or from the Contacts entry in the global system menu. Similarly, the ~/.config/sxmo/block.tsv file lists phone numbers and corresponding contact names that the user wishes to block.

Sxmo's behavior is defined in the ~/.config/sxmo/profile file, as well as in dozens of hooks, which are shell scripts with a specific name. For example, when the phone is receiving an incoming call, the shell script in /usr/share/sxmo/default_hooks/sxmo_hook_ring.sh is executed with the first argument ($1) set to the contact name or incoming number. This script can be overridden by the ~/.config/sxmo/hooks/sxmo_hook_ring.sh file if the user wants to change the default behavior. User scripts can also be added to the global system menu for custom functionality. The postmarketOS wiki hosts a tips and tricks page for Sxmo, featuring some helpful advice and configuration snippets. For a deeper exploration under the hood, Sxmo's system guide offers an excellent source of information.

Contributing to Sxmo

Much of Sxmo's core functionality is built upon shell scripts. While this allows complete customization, it requires some Linux and shell-scripting knowledge. The developers actively invite contributions, especially for device profiles, which are shell scripts that Sxmo loads early on to set some attributes that ensure the devices work well with Sxmo.

Sxmo was originally designed for the PinePhone. Meanwhile, support for a few other devices has been added, including the Librem 5 and the Fairphone 4, but also the 15-year-old Nokia N900 with a physical keyboard and the Kobo Clara HD e-reader. Users trying to run Sxmo on an unsupported device can find some instructions to create a device profile in the sxmo-utils repository linked above. The result is a shell script that exports some environment variables for the touch-screen device, display output, screen scale, buttons, and more.

In the last two years, Sxmo has had eight releases, slowing down a bit during the last year. The most recent was Sxmo 1.16.1 from June 3, which was a minor release. It had some tiny improvements in the control of the suspend functionality along with initial device profiles for PINE64's PineTab 2 tablet and Xiaomi's Redmi Note 4 phone. There's a ticket list tracking bugs and feature improvements, but it doesn't seem very active. The sxmo-devel mailing list does get patches regularly for various repositories of the project.

With Sxmo's focus on delivering a minimalist environment, it is no surprise that the project doesn't intend to switch to systemd, as postmarketOS has done. In the FAQ about the switch, the postmarketOS developers make it clear that Sxmo will be sticking with OpenRC for its init system.

Conclusion

Sxmo is even less suited for most users than postmarketOS. However, for tech-savvy, seasoned Linux users who want to have a lightweight and completely scriptable interface on their phone, Sxmo provides a different path from the conventional mobile user interfaces. It takes some time to adjust to the hardware buttons to control the phone's menu-based interface, but after some time, it works remarkably well.

Comments (3 posted)

Another try for getrandom() in the vDSO

By Jonathan Corbet
July 4, 2024

Random numbers, it seems, can never be random enough, and they cannot be generated quickly enough. The kernel's getrandom() system call might, after years of discussion, be seen as sufficiently secure by most users, but it is still a system call. Linux system calls are relatively fast, but they are necessarily slower than calling a function directly. In an attempt to speed the provision of secure random data to user space, Jason Donenfeld has put together an implementation of getrandom() that lives in the virtual dynamic shared object (vDSO) area.

Random data is used in no end of applications, including modeling of natural phenomena, generation of identifiers like UUIDs, and game play; it's how NetHack somehow summons those three balrogs all in one corner of a dark room. Security-related operations, such as the generation of nonces and keys, also make heavy use of random data — and depend heavily on that data actually being random. Some applications need so much random data that they spend a significant amount of time in the getrandom() system call.

One possible solution to this problem is to generate random numbers in user space, perhaps seeded by random data from the kernel; many developers have taken that route over the years. But, as Donenfeld explains in the cover letter to his patch series, that approach is not ideal. The kernel has the best view of the amount of entropy in the system and what is needed to generate truly random data. It is also aware of events, such as virtual-machine forks, that can compromise a random-number generator and make a reseeding necessary. He concluded:

The simplest statement you could make is that userspace RNGs that expand a getrandom() seed at some point T1 are nearly always *worse*, in some way, than just calling getrandom() every time a random number is desired.

Always calling getrandom() ensures the best random data, but the associated performance problem remains. Moving that function into the vDSO can help to address that problem.

`getrandom()` in the vDSO

The vDSO is a special mechanism provided to accelerate tasks that require some kernel involvement, but which can otherwise be carried out just as well in user space. It contains code and data provided in user space directly by the kernel in a memory area mapped into every thread's address space. The classic vDSO function is gettimeofday(), which returns the current system time as kept by the kernel. This function can be implemented as a system call, but that will slow down applications that frequently query the time, of which there are many. So the Linux vDSO includes an implementation of gettimeofday(); that implementation can simply read a time variable in memory shared with the kernel and return it to the caller, avoiding the need to make a system call.

getrandom() is a similar sort of function; it reads data from the kernel and returns it to user space. So a vDSO implementation of getrandom() might make sense. Such an implementation must be done carefully, though; it should return data that is just as random as a direct call into the kernel would, and it must be robust against the types of events (a fork, for example) that could compromise the state of a thread's random-number generation.

In Donenfeld's implementation, user-space programs will continue to just call getrandom() as usual, with no changes needed. Under the hood, though, there are some significant changes needed within the C library, which provides the getrandom() wrapper for the system call.

State-area allocation

The random-number generator works on some state data stored in memory. When random data is needed, a pseudo-random-number generator creates it from that state, mutating the state in the process. Every thread must have its own state, and care must be taken to avoid exposing that state during events like process forks, core dumps, virtual-machine forks, or checkpointing. That state should be reseeded with random data regularly, and specifically at any time when its content might have been compromised.

The vDSO implementation of getrandom() requires that the memory to be used for this state be allocated by the kernel. So the first thing that the C library must do is to allocate this state storage for as many threads as it thinks are likely to run. That is done with a new system call:

    struct vgetrandom_alloc_args {
        u64 flags;
      	u64 num;
      	u64 size_per_each;
      	u64 bytes_allocated;
    };

    void *vgetrandom_alloc(struct vgetrandom_alloc_args *args, size_t args_len);

The structure pointed to by args describes the allocation request, while args_len is sizeof(*args); that allows the structure to be extended in a compatible way if needed in the future. Within that structure, flags must currently be zero, and num is the number of thread-state areas that the kernel is being requested to allocate. On a successful return, num will be set to the number of areas actually allocated, size_per_each describes the size of a state area, and bytes_allocated is the total amount of memory that was allocated. The return value will point to the base of the allocated area.

The allocated area is ordinary anonymous memory, except that it will be specially marked within the kernel using a number of virtual-memory-area flags. The VM_WIPEONFORK flag causes its contents to be zeroed if the process forks (so that the two processes do not generate the same stream of random numbers), VM_DONTDUMP keeps its contents from being written to core dumps, and VM_NORESERVE causes it to not be charged against the process's locked-memory limit. Donenfeld also added a new flag to use with this area: VM_DROPPABLE allows the memory-management subsystem to simply reclaim the memory if need be; since this is anonymous memory, accessing it after it is reclaimed will cause a new, zero-filled page to be allocated. The result is memory that should remain private, but which can be zeroed (or reclaimed, which has the same effect) by the kernel at any time.

Generating random data

The kernel also shares some memory with the vDSO containing this structure:

    struct vdso_rng_data {
	u64	generation;
	u8	is_ready;
    };

This structure is used by the vDSO version of getrandom(), which has this prototype:

    ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags,
                       void *opaque_state, size_t opaque_len);

The first three arguments mirror getrandom(), describing the amount of random data needed and whether the call should block waiting for the kernel's random-number generator to be ready. The final two, instead, describe one of the state areas allocated by vgetrandom_alloc(). This function's job is to provide the same behavior that getrandom() would.

It starts by looking at the is_ready field in the shared structure; if the kernel's random-number generator is not yet ready, vgetrandom() will just call getrandom() to handle the request. Once the random-number generator has initialized, though, that fallback will no longer be necessary. So the next thing to do is to compare the generation count (which tracks the number of times that the kernel's random-number generator has been reseeded) with a generation count stored in the state area. If the two don't match, then the state area must be reseeded with random data obtained from the kernel.

When the state area is first allocated, it is zeroed, so the generation number found there will be zero, which will never match the kernel's generation number; that will cause the state area to be seeded on the first call to vgetrandom(). The same thing will happen if this area has been cleared by the kernel, as the result of a fork (VM_WIPEONFORK) or the memory being reclaimed (VM_DROPPABLE), for example. So the kernel is able to clear that memory at any time in the knowledge that vgetrandom() will do the right thing.

Once the state area is known to be in a good condition, vgetrandom() uses it to generate the requested random data using the same algorithm used within the kernel itself. Doing this calculation securely is a bit tricky; if the process forks or core-dumps while it is underway, any data kept on the stack could be exposed. So vgetrandom() has to use an implementation of the ChaCha20 stream cipher that uses no stack at all. The patch series only includes an x86 implementation of this cipher; other architectures seem certain to follow.

As a final step before returning the generated data to the caller, vgetrandom() checks the generation number one more time. If, for example, the state area was wiped by the kernel while the call was executing, the generation-number check will fail. In such cases, vgetrandom() will throw away its work and start over.

Donenfeld described the end result of this work as "pretty stellar (around 15x for uint32_t generation)" and noted happily that "it seems to be working".

Prospects

LWN last looked at this work at the beginning of 2023. At that time, there were a number of objections, many of which were focused on the VM_DROPPABLE changes to the memory-management subsystem, which included some tricky, x86-specific tricks. When version 15 of the patch series was posted several months later, VM_DROPPABLE remained, but the logic had been simplified considerably in the hope of addressing those concerns, seemingly successfully. There does not appear to be anybody who is arguing against the inclusion of this series now.

As of the current version (20), this work has been added to linux-next for wider testing; if all goes well, it could go upstream as soon as the 6.11 merge window later this month. "If all goes well", of course, includes passing muster with Linus Torvalds, who has not commented this time around; he was not thrilled with previous versions, though. Should the mainline merge happen, the work to integrate the needed changes into the C libraries can begin. The end result will be a significant internal change, but the only thing that users should notice is that their programs run faster.

Comments (25 posted)

Offload-friendly network encryption in the kernel

By Daroc Alden
July 9, 2024

The PSP security protocol (PSP) is a way to transparently encrypt packets by efficiently offloading encryption and decryption to the network interface cards (NICs) that Google uses for connections inside its data centers. The protocol is similar to IPsec, in that it allows for wrapping arbitrary traffic in a layer of encryption. The difference is that PSP is encapsulated in UDP, and designed from the beginning to reduce the amount of state that NICs have to track in order to send and receive encrypted traffic, allowing for more simultaneous connections. Jakub Kicinski wants to add support for the protocol to the Linux kernel.

The protocol

PSP is a fairly minimal protocol. It completely avoids the topic of how to do a secure key exchange, assuming that the applications on either end of a connection will be able to exchange symmetric-encryption keys somehow. This is not an unreasonable design decision, since IPsec does the same thing. There are several existing protocols for securely exchanging keys, such as Internet Key Exchange or Kerberized Internet Negotiation of Keys. Usually, those protocols are indifferent to the source of the symmetric keys. PSP is a bit different. To support hardware-offload, PSP requires that the NIC itself generate the keys for a PSP connection. The PSP architecture specification goes into detail about how NICs should do that.

The main requirement is that the NIC should be able to rederive the key for a session from a fairly limited amount of information — specifically, from a 32-bit "Security Parameter Index" (SPI) value and a secret key stored on the device. That key is generated by the NIC and used to derive the session-specific encryption key for each connection as needed, using a secure key-derivation function. Therefore, the SPI alone is not enough to decrypt a packet; this allows SPIs to be included in PSP packets, letting the NIC rederive the encryption key for a packet on the fly. In turn, this means that the NIC does not actually need to store an encryption key in order to receive and process a packet, greatly reducing the amount of memory necessary on the device.

Unfortunately, this requirement to rederive keys comes at a price — PSP connections are unidirectional. For real use cases where bidirectional communication is required, the application needs to set up a separate PSP connection for each direction. Since rederiving the key requires access to a device key stored on the NIC, only the receiving NIC can rederive it, so the transmitting device still needs to store the key somewhere. While that could be on the NIC, the PSP specification leaves open the possibility of a hardware implementation that requires the computer to send the encryption key to the NIC alongside any transmitted packets.

To encrypt packets, PSP uses AES-128-GCM or AES-256-GCM. These are both authenticated encryption with associated data (AEAD) schemes — they guarantee that the received data has not been tampered with (authentication) and bundle some encrypted data alongside some associated plain-text data. In PSP's case, this is used to implement an offset that allows the sender to leave the headers of a protocol encapsulated in PSP unencrypted, while still protecting the contents. Supporting only two modes of AES keeps the implementation complexity of PSP low.

PSP also has a packet layout designed to make parsing the packet in hardware more efficient, by providing explicit lengths in the header and using fewer optional headers than IPsec does. In combination with the unique key-derivation scheme, PSP ends up being more hardware-implementation-friendly than other encryption protocols like IPsec or TLS.

While Google is both the originator and largest current user of PSP, the protocol could potentially be useful to other users. Compared to other encrypted protocols, PSP requires a lot less on-device memory, letting it scale to larger numbers of connections. Because the protocol doesn't mandate a key-exchange standard, PSP is probably a good choice for an environment where the user controls both ends of the connection, but still wants to ensure that traffic can be encrypted.

The discussion

Despite how useful Google finds the protocol, kernel developers were dubious about adding yet another encryption protocol to the kernel, which already handles IPsec, WireGuard, TLS, and others. Paul Wouters expressed surprise that Kicinski wanted to add PSP to the kernel, when the IETF had declined to standardize the protocol on the basis that it is too similar to IPsec.

Steffen Klassert shared a draft that the IPsecME working group has been putting together that covers some of the same use cases as PSP. That may not be as helpful as it sounds, however, because there are already hardware devices implementing PSP, Willem de Bruijn pointed out. "It makes sense to work to get to an IETF standard protocol that captures the same benefits. But that is independent from enabling what is already implemented."

That answer didn't satisfy Wouters, who asked: "How many different packet encryption methods should the linux kernel have?" He said that waiting for protocols to be standardized provides interoperability, and chances to make sure a protocol is actually useful for more than one use case. PSP and IPsec can also use a lot of the same NIC hardware, he pointed out.

"I don't disagree on the merits of a standards process, of course. It's just moot at this point wrt PSP", de Bruijn replied. Klassert agreed that existing PSP users need to be supported, but thought that it was still important to work on standardizing a modern encryption protocol that meets everyone's needs. He invited Google to send representatives to the IETF IPsecME working group meeting to discuss the topic. De Bruijn said that the company would.

But some reviewers also had technical objections to Kicinski's patch set. The API he proposed allows user space to set a PSP encryption key on a socket, after which data transmitted through that socket will be encrypted. The problem is how exactly this interacts with retransmissions. PSP is encapsulated in UDP, and has no guarantee that data will arrive intact or in order, leaving that detail up to higher-level protocols. But Kicinski is mainly interested in PSP as a TLS replacement — which of course runs on top of TCP. PSP supports wrapping TCP, but just by encrypting individual packets, not by participating in TCP's retransmission logic.

Together, this creates some edge cases, de Bruijn pointed out. How is the kernel supposed to handle retransmissions of plain-text data after a PSP key has been associated with a socket? "Like TLS offload, the data is annotated 'for encryption' when queued. So data queued earlier or retransmits of such data will never be encrypted", Kicinski explained.

De Bruijn wasn't completely satisfied with that, pointing out that it still leaves edge cases. What happens if one peer upgrades the connection at the same time the other peer decides to drop it? "If (big if) that can happen, then the connection cannot be cleanly closed." In response, Kicinski suggested "only enforcing encryption of data-less segments once we've seen some encrypted data". De Bruijn agreed that might help, but was still worried that the kernel API could still be used in ways that break the correctness of the protocol, and suggested that some more thorough documentation might be appropriate.

In response, Kicinski put together some sequence diagrams showing how a PSP-secured socket is set up and torn down. De Bruijn thought that was a substantial improvement, but remained unconvinced that all of the edge cases had been handled.

Lance Richardson raised some questions about the kernel API as well, noting that there didn't seem to be any real support for rekeying a socket, something that PSP mandates every 24 hours. In a follow-up message, Richardson suggested that it might be as simple as keeping an old key around for a minute or two after rekeying. Kicinski agreed that made sense, and promised to add it to the next version of the patch set.

While there are lots of reasons to use PSP, and the presence of hardware that supports it is a good sign, the lack of a standard and questions around the implementation suggest that it may be some time before support is finalized. Still, in the future we could see yet another encryption protocol come to the kernel. The exact details of how that happens remain to be seen.

Comments (15 posted)

A new API for tree-in-dcache filesystems

By Jake Edge
July 9, 2024

LSFMM+BPF

There are a number of kernel filesystems that store their directory entries directly in the directory-entry cache (dcache) without having any permanent storage for those objects. It started out as a "neat hack" for ramfs, Al Viro said, at the start of his filesystem-track session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. Unfortunately, as the use of this technique has grown into other filesystems, there has been a lot of scope creep that has gotten out of control. He wanted to discuss some new infrastructure that he is working on to try to clean some of that up.

Viro displayed some notes on his thoughts to accompany his talk; some of this article derives from those notes. He has a patch set to implement those ideas (contained in his "untested.persistency" branch) that is "very much a work in progress", which is untested and may not compile on anything other than x86, he said. He wanted to describe the problem it is meant to solve and how it does so.

It all started with a demonstration by Linus Torvalds of how to create a filesystem without a backing store, he said. The technique kept all of the files and directories in the cache and was the basis for ramfs. A "controlled dentry [directory entry] leak" was used; reference counts are artificially increased to ensure that the directory entries do not get evicted. When an unmount is done, they are all cleaned up.

The technique was adopted by tmpfs, hugetlbfs, and in other places, because it is simpler than what procfs uses. There are problems that arise in some of the other users, however, that do not exist in the original. The original intent was only for filesystems that were being populated from user space, but eventually it was used for filesystems that are populated by the kernel, or, perhaps worse, both the kernel and user space.

For example, rmdir() only removes directories that are empty, but the configfs user-space tools expect the system call to be able to remove a populated subtree if all of its entries were created by the kernel. If there are user-created subdirectories, the tools expect the rmdir() to fail. Christian Brauner pointed out that the control-group filesystem (cgroupfs) also has this behavior. Viro said that filesystems of this sort have to implement their own rmdir() because it is so specialized. For configfs, it needs to check if there have been any directories created by the user inside the target—or any that are in progress. The code "is horrible", he said.

There is a real need for some infrastructure to help these filesystems, Viro said. There are around a dozen different implementations of the subdirectory-removal handling, none of which have been done correctly. His idea is to introduce a flag, DCACHE_PERSISTENT that will be used to mark the dentries that are being "controllably leaked" so that they can be properly handled. Then the kernel-initiated operations and those from user space can set the flag, so that they are handled in the same way, which is not the case right now.

There would be two new functions that would be the counterparts to dget() and dput() (which obtain and release references to directory entries); d_make_persistent() would do the equivalent of dget() and set the flag, while d_make_discardable() will do a dput() and clear it. There are new helper functions to handle both the simple filesystems like ramfs and the more complicated varieties, including handling some of the variations of open-coded directory-removal code. There are more details in the notes file.

There are still four filesystems that remain to be converted, Viro said. They all have "interesting problems" that need to be resolved; two of them are for USB gadgets, one is configfs, and the other is apparmorfs. The diffstat of his patch set shows that the changes would actually result in a net removal of around 500 lines of code from the tree.

He had hoped to discuss configfs with Christoph Hellwig, who was not present, though he did arrive later in the day. He plans to talk to Greg Kroah-Hartman about the USB-gadget filesystems, but is not sure who to talk to about apparmorfs. There is some strange locking being done in apparmorfs, which he mentioned to the AppArmor developers, but got nowhere, he said. There was some further discussion on this and related work as time ran out on the session.

Comments (2 posted)

Improving pseudo filesystems

By Jake Edge
July 10, 2024

LSFMM+BPF

The eventfs filesystem provides an interface to the tracepoints that are available to be used by various Linux tracing tools (e.g. ftrace, perf, uprobes, etc.); it is meant to be a version of the tracefs filesystem that dynamically allocates its entries as needed. The goal is to reduce the memory required for multiple instances of tracefs, as Steven Rostedt described in a session at the 2022 Linux Storage, Filesystem, Memory Management, and BPF Summit. He returned to the 2024 edition of the summit to talk further about how to make pseudo (or virtual) filesystems, such as tracefs/eventfs, more like regular Linux filesystems, where the directory entries (dentries) and inodes are only created (and cached) as needed.

Background

He began with some background on eventfs; it is based on tracefs, which was itself based on debugfs. Because of the interface that debugfs provides, eventfs maintained dentries for each of its files and directories. Around the same time that Rostedt proposed a session on virtual filesystems for this year, eventfs was being extensively reworked to avoid a number of problems, some of which were security related. As part of that, Linus Torvalds made it clear (in his inimitable way) that a dentry-centric approach was not right.

At the session in 2022, Christian Brauner had suggested using kernfs as the basis for eventfs. When Rostedt looked at that, he saw that only sysfs and control groups used kernfs, and it did not look like it applied to what he was trying to do. After playing with it more recently, he can see that it might make sense to convert all of debugfs and tracefs to use kernfs, but it will not work for eventfs, he said.

One of the things he is working on is tracing infrastructure for Chromebooks, some of which have only 2GB of RAM; "memory is very much a hot commodity there". There are "thousands and thousands of files" in eventfs; new instances of eventfs create a new ring buffer, but they also duplicate most of the files, which uses a lot of memory. So, eventfs was turned into a dynamic filesystem that did not create dentries and inodes until they were actually needed, which provided substantial memory savings.

The crux of the disagreement with Torvalds is based in Rostedt's lack of understanding of how filesystems are supposed to be implemented, coupled with the API for debugfs. Torvalds asked Rostedt why dentries were being created for eventfs before the filesystem was even mounted. Creating a file with debugfs_create_file() returns a dentry, however, so Rostedt thought that was the way it should be done. Al Viro pointed out that eventfs went far beyond what debugfs had ever done, though, which "was really scary"; he is "not fond" of what debugfs does, but eventfs took things much further. Things are "much saner" after the fixes that went into eventfs, Viro said.

Rostedt said that now that he has learned more, he is concerned that debugfs needs attention; "maybe we should update it". Viro noted that debugfs has some object-lifetime problems as well as a lack of "sane exclusion" when doing I/O on files that are being removed. Rostedt wondered if debugfs should be switched to using kernfs, but Viro said that "kernfs has different issues".

Kernfs is not fully namespace-aware for one thing, Viro said. Brauner suggested that debugfs did not need namespace support, but Rostedt and Viro said that people want to be able to mount debugfs inside containers. "That's insane on the face of it", Brauner said; "we are not going to do this".

One of the problems that he has encountered, Rostedt said, is that developers start by using debugfs for some project, but that once it gains some traction, they want to move it to its own filesystem. Debugfs is not really a good basis for that as it stands. That's what happened to him; "I'm here to say, let's not have someone else follow my steps". For that reason, he thinks debugfs should be switched to a better interface that people can use as a basis for their filesystems.

The path?

After some discussion between Brauner and Viro about the current status of debugfs, Rostedt shifted gears slightly and asked what the proper path is for kernel developers who want to move their filesystems from debugfs to a real filesystem. Viro reiterated some of the concerns he has with debugfs, including the ability for applications to continue reading from open file descriptors after the debugfs file has been removed.

Rostedt wondered about the use of dentries in the debugfs interface; he thought that there might well be memory concerns for those who are mounting it. Brauner said that there is no real control over how many files there are in debugfs, since any random driver can add entries whenever it wants. Writing to one of those files might deadlock or crash the system. For those and other reasons, "mounting debugfs on a production system is ... adventurous".

Dave Chinner said that Rostedt was asking the wrong question. Starting with debugfs and then trying to move that code to a production filesystem is wrong. If it is destined for production, it should be developed within sysfs, but Rostedt noted that sysfs is restrictive; sysfs files are supposed to only have a single value. Chinner said that the restriction was often ignored.

Developers who are not filesystem-focused choose debugfs as a starting point because it is easy to do so, Rostedt said; they are typically just doing it for debug purposes in the early going. Then the functionality turns out to be useful, but now the code has been built around debugfs, which is "not the way to do it", he said.

"Ask the experts", Chinner suggested, but Rostedt said that "a lot of times the experts are busy doing their own thing". He had posted versions of his work along the way, he said, but rarely got any comments from the experts.

Brauner and Viro talked about some approaches that might scale reasonably for the eventfs use case, but did not really come to a conclusion. Part of the problem, Ted Ts'o said, is that "there is no one general solution" to point kernel developers at; debugfs is perfectly fine for a small number of files and when just a single instance is needed. For situations with multiple instances and millions of files, there is no existing code to point to. So it is not possible to give advice that pertains to all of the possible filesystems that may be needed.

But Rostedt said that he was not trying to solve the eventfs problem—it is a specialized use case—but wanted to figure out what to tell developers who want to move a fairly simple debugfs-based filesystem to a real filesystem. Brauner said that there are some questions that need to be answered first: is there a single instance of the filesystem or does each mount create a new one? Does the filesystem need to be namespace-aware? Without those answers, it is difficult to say what the proper path might be.

As time was running down, Rostedt said that what was being said in the session "was like gold to me". He thinks that what was discussed needs to be fully documented and that the questions that need to be answered are obviously a big part of that. Viro said that sounded like a "frequently asked questions" document, which elicited some laughter. Rostedt agreed, though, and asked if there was any documentation of that sort. Brauner said that there was not, "and I think that is a fair point"; for example, most filesystems these days will need to be namespace-aware but there are not really good examples to point developers to. Whether that will result in documentation patches was not clear, however.

Comments (none posted)

Giving bootloaders the boot with nmbl

By Joe Brockmeier
July 8, 2024

DevConf.cz

At DevConf.cz 2024, Marta Lewandowska gave a talk to discuss a new approach for booting Linux systems, "No more boot loader: Please use the kernel instead". The talk, available on YouTube, introduced a new project called nmbl (for "no more bootloader", pronounced "nimble"). The idea is to get rid of bootloaders (e.g., GNU GRUB) with a Unified Kernel Image (UKI) that removes the need for a separate bootloader altogether. It is early days for nmbl, currently the project is only being tested for use with virtual machines, but the idea is compelling. If successful, nmbl could offer security, performance, and maintenance benefits compared to GRUB and other separate bootloaders.

Rationale

Longtime Linux users have seen their share of bootloaders, the software that initializes hardware and loads the operating system, over the years. In the earliest days, users might have used loadlin to boot into Linux from MS-DOS. Then there was Linux Loader (LILO). It was the popular choice for Linux distributions until the mid-2000s, when Linux distributions began switching to the GRand Unified Bootloader (GRUB), and then GRUB 2 (which has supplanted GRUB legacy, so we will just say "GRUB" after this). The SYSLINUX family of bootloaders was a popular choice for booting from floppies, CD-ROMs (ISOLINUX), network servers using PXE boot (PXELINUX), and a variety of filesystems (EXTLINUX). That is not an exhaustive list, merely a sampling of more widely used bootloaders on x86/x86_64 systems. Other platforms required their own bootloaders, of course.

Lewandowska, a quality engineer at Red Hat, started her talk with a discussion of the purpose of the bootloader and things that can go wrong. The bootloader, she said, is the first piece of software that runs and "gets everything ready for booting and getting the operating system running" and then transfers control to the kernel.

That may not sound like a big deal when booting a desktop or laptop system, she said, but it becomes much more complicated for multiple architectures, complex storage schemes, and booting over the network. All of those things have to be possible, and "all of those of you who have filed bugs with us know" it can go wrong in many ways.

On top of that, there is secure boot. The idea of secure boot, of course, is to ensure that a machine "doesn't have any malware from the beginning" she said. The chain of trust starts with hardware that will only load a trusted bootloader, which will only load a trusted kernel. Lewandowska explained briefly how this works with UEFI and Linux systems. When the system starts, its firmware will load a first-stage bootloader, called the shim, if it has a trusted signature. Then it will load a signed bootloader, in this case GRUB. Next, GRUB will provide a menu of boot options and/or allow the user to edit boot options. Then GRUB will verify the selected kernel's signature with a protocol installed by the shim. GRUB also has to load the initramfs, the initial root filesystem image used for booting the kernel, which Lewandowska said is "the biggest security hole" because it is not signed.

GRUB is great, she said, but it is also complex and needs to handle a lot of functionality that is duplicated in the Linux kernel. And, of course, it has security vulnerabilities too. She showed a slide with a list of 15 CVEs for GRUB since 2021. (Slides here.) There has been only one in 2024, so far, but "believe me, more are coming" Lewandowska said. In addition to the CVEs there are plenty of regular bugs in GRUB such as filesystem, storage, and memory-allocation bugs that are difficult to solve and that those working with GRUB don't have the resources to fix. GRUB is not as actively developed as the Linux kernel is, so things are fixed more slowly. She noted that Red Hat was carrying "hundreds of downstream patches" for GRUB. "We're trying to fix all this, but it's a huge task and it goes slowly." That finally brings us to nmbl, she said.

Why nmbl

The idea for nmbl is "taking a whole bunch of things that have already existed" in Linux, adding a bit of code, and putting them together, Lewandowska said. Nmbl is delivered as a UKI: a single image in Portable Execution / Common Object File Format (PE/COFF) format that bundles the kernel image and resources needed to boot. Nmbl includes the kernel command line, an initramfs, the kernel, and UEFI stub (using systemd-stub) wrapped up as a UEFI executable that can be run from the UEFI firmware. As a bonus, nmbl can be signed so "now the whole thing becomes secure", including the initramfs.

Most of this is already in Fedora, she said. Fedora has Dracut for generating initramfs images, and has been adding support for UKIs. (LWN has covered Fedora's plans for UKIs, and the progress toward UKI support in Fedora 40.) Nmbl also uses grub-emu to provide a GRUB-style menu that is already familiar for Linux users.

In a blog post timed to accompany the talk, she explained the advantages that nmbl might offer. First is improved boot time. Currently there are two variants of nmbl being worked on, one that provides direct booting of the desired kernel, and another that allows the user to boot into different kernels. The direct-boot option loads the same kernel used by nmbl and performs a switch root to switch from the initrfamfs to the user-space filesystem. The other option loads the nmbl UKI and uses grub-emu to display the menu of bootable kernels. Then it uses kexec to boot the final kernel and bring up the system. When nmbl is the target kernel, it would substantially decrease boot time since there is no need to boot a second kernel.

It will also speed up feature delivery, says Lewandowska. "Since kernel and bootloader code will no longer have to be duplicated, features will only have to be implemented once to be available in both places." In addition, she wrote that implementing features once will reduce duplicate work and worries that things like kernel filesystem drivers would change. Since the bootloader and kernel would be "one and the same", any kernel changes would be immediately available to the bootloader as well.

She also touted the idea of increased security. Including all of the initramfs into a signed binary would "considerably increase security" she wrote. On top of that, nmbl would significantly reduce the attack surface "since the new code comprising nmbl is only several hundred lines of code compared to GRUB's hundreds of thousands". Finally, Lewandowska said that since the Linux kernel has a much larger community, nmbl would receive more scrutiny than GRUB does on its own.

Early days

There is still a lot of work to be done, Lewandowska said. The next thing that nmbl developers want to do is to build nmbl with every Fedora kernel build. Another feature on the horizon is shim A/B booting, which would allow fallback to the previous nmbl kernel if the newest one fails for some reason. This would make failed upgrades easier to recover from. The proof-of-concept (POC) for nmbl right now runs on UEFI. Longer-term, the team wants to get nmbl working on other architectures that do not use UEFI as well.

Support for bare metal is also on the wish list. During the Q&A of her talk, Lewandowska said that development and testing of nmbl is primarily being done with virtual machines at the moment. It has been tested on hardware, she said, but not recently. She has posted instructions on how to do this in a virtual machine with Fedora 39, including guidance on how to generate a signing certificate and enroll the key to use secure boot with the nmbl UKI.

It will likely be years before nmbl is ready to supplant GRUB as the bootloader of choice for most Linux users, but it's an interesting approach that could have a lot of advantages if it succeeds.

Comments (54 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>