research-article

Open access

FLAShadow: A Flash-based Shadow Stack for Low-end Embedded Systems

Authors:

Michele Grisafi,

Mahmoud Ammar,

Marco Roveri,

Bruno CrispoAuthors Info & Claims

ACM Transactions on Internet of Things, Volume 5, Issue 3

Article No.: 19, Pages 1 - 29

https://doi.org/10.1145/3670413

Published: 12 August 2024 Publication History

PDF eReader

Abstract

Runtime attacks are a rising threat to both low- and high-end systems with the spread of techniques such as Return-Oriented Programming (ROP), which aims at hijacking the control flow of vulnerable applications. Although several control flow integrity schemes have been proposed by both academia and the industry, the vast majority of them are not compatible with low-end embedded devices, especially the ones that lack hardware security features.

In this article, we propose \(\sf {\textsc {FLAShadow}}\) , a secure shadow stack design and implementation for low-end embedded systems, relying on zero hardware security features. The key idea is to leverage a software-based memory isolation mechanism to establish an integrity-protected memory area on the Flash of the target device, where \(\sf {\textsc {FLAShadow}}\) can be securely maintained. \(\sf {\textsc {FLAShadow}}\) exclusively reserves a register for maintaining the integrity of the stack pointer and also depends on a minimal trusted runtime component to avoid trusting the compiler toolchain. We evaluate an open-source implementation of \(\sf {\textsc {FLAShadow}}\) for the MSP430 architecture, showing an average performance and memory overhead of 168.58% and 25.91%, respectively. While the average performance overhead is considered high, we show that it is application dependent and incurs less than 5% for some applications.

1 Introduction

In recent decades, computer systems have been evolving on all fronts, aiming at providing better experiences for users. Nevertheless, the foundation of all software stacks in most systems is still written in unsafe programming languages such as C and C++ for obvious flexibility and performance reasons. Memory corruption vulnerabilities are very common in such languages because they require the programmer to manually manage memory allocations. This opens the door for committing programming mistakes that result in vulnerabilities, which can be exploited in several ways. Due to the widespread deployment of defences such as Data Execution Prevention (DEP) [42] that have raised the bar for successfully mounting traditional code-injection attacks [52], control-flow hijacking attacks have become the most popular software exploitation technique over the past years. Such attacks, exemplified by Return-Oriented Programming (ROP) [48] and Jump-Oriented Programming (JOP) [15], circumvent traditional deployed defences, e.g., DEP and Address Space Layout Randomization (ASLR) [2], by reusing some code snippets, called gadgets, that exist in memory.

In particular, control-flow hijacking attacks exploit memory corruption vulnerabilities, e.g., buffer overflows and use-after-free, to divert program execution from its intended control flow by executing a chain of gadgets in a controlled manner. Hence, rather than injecting malicious payloads directly onto the stack or heap, where DEP mechanisms block it from being executed, control-flow hijacking attacks intend to inject addresses of existing in-memory code fragments, so-called gadgets, onto the victim stack (i.e., ROP attacks [48]) or heap (i.e., JOP attacks [15]), causing the victim application to execute its own binary code in an unanticipated order. As the names indicate, ROP attacks target backward edges by means of gadgets ending with return (RET) instructions, whereas JOP attacks target forward edge gadgets that end with indirect jump or call (JMP/CALL) instructions.

To narrow the focus, ROP is the most widely leveraged attack technique, posing a higher significant threat, compared with other control-flow hijacking attacks, for both high-end devices [14, 49] and embedded systems [3, 18, 37, 40]. It is also expected that ROP attacks will increase in frequency and are still dangerous despite the deployment of some defences [16, 17, 35]. Mitigating ROP attacks requires guaranteeing the integrity of return addresses when pushed onto the stack. So far, a multitude of defences have been proposed by both industry and academia to prevent control-flow hijacking attacks, most notably Control Flow Integrity (CFI) mechanisms for both forward [1, 54, 58, 63] and backward [16, 21, 39, 64] edges. CFI guarantees that program execution follows one of the valid paths in a pre-generated static Control Flow Graph (CFG). Forward-edge CFI techniques are deployed by big information technology (IT) players such as Google to protect Chrome and Android [55] and Microsoft to protect Windows 10 and other higher versions [41]. In contrast, the most comprehensive backward-edge CFI mechanisms such as safe stacks [56] and shadow stacks [57] have never seen wide adoption despite being available in mainline compilers. This is mainly because they can be easily bypassed via information disclosure attacks that would leak the location of the shadow/safe stack, thus tampering with its content [16, 17]. Additionally, the performance cost of some implementations on some architectures is not tolerable [21].

Yet, regardless of the aforementioned limitations, existing ROP defences are not compatible with many of the deployed low-end embedded devices due to one or more of the following reasons. First, the underlying MicroController Units (MCUs) of embedded systems have constrained resources, e.g., a few KBs of Flash and RAM, and lack basic protection features found on high-end desktop systems. For instance, the vast majority of corresponding MCUs do not have Memory Management Units (MMUs) to enable DEP and ASLR protection mechanisms that are needed as a prerequisite for any ROP defence. Second, a significant number of embedded systems run bare-metal applications, lacking any Operating System (OS) that would help in enforcing security policies. Third, MCUs load and run a single statically linked binary image in a single address space, making it very challenging to compartmentalize any part of the software securely. Notably, statically linked binaries do not decrease the attack surface for ROP attacks, possibly leading to the presence of additional ROP gadgets.

As a countermeasure against ROP attacks on embedded systems, several directions have been followed by the research community. For instance, various control flow attestation schemes have been proposed to detect control-flow hijacking attacks on embedded devices by a trusted remote party [4, 25, 46]. Such schemes can only detect attacks after happening with no means of providing prevention capabilities. Another direction was to enforce backward edge integrity through some form of CFI schemes that would suit embedded systems [6, 26]. More recently, a shadow stack approach has been proposed for securing return addresses in embedded devices [65]. Despite the potential advantages that distinguish each of the aforementioned techniques, all of them depend on hardware features that might not exist in all embedded devices or are basic enough to be easily disabled by a motivated attacker [29]. This brings up two important questions: (i) Is it possible to secure embedded systems with zero hardware security features against ROP attacks and, (ii) if so, would the proposed solution be secure enough and acceptable in terms of performance and memory overhead? To fill this knowledge gap, this article proposes and evaluates \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , a shadow stack approach for low-end embedded systems, requiring no hardware security features.

In doing so, we address a security issue on many of these legacy devices that lack proper security hardware, thus representing vulnerable potential targets for ROP attacks. Notably, such devices, spanning several architectures such as MSP430, ARM, and AVR, are still widely employed and will most likely not be replaced in the foreseeable future.

\(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) extends and builds atop \(\sf {{\rm\small PISTIS}}\) [30], a pure software memory isolation technique, to fully protect the shadow stack from malicious tampering. \(\sf {{\rm\small PISTIS}}\) is a software-based trusted computing architecture that offers basic security features such as memory isolation, DEP, and remote attestation (RA) through selective software instrumentation and load-time verification of binaries. To guarantee the integrity of return addresses, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) dedicates a write- and execute-protected memory area in the Flash, to which a copy of return addresses is pushed and then popped by inline instructions that are inserted and verified by the trusted architecture. The access control rules are also enforced by our architecture, where the adherence to them is verified on the device itself, eliminating the trust of the compiler toolchain. To achieve full protection, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) leverages some of \(\sf {{\rm\small PISTIS}}\) ’s security features to complement its role. For instance, the static RA module is leveraged to guarantee the integrity of the loaded binary onto memory. Unlike the design and implementations of a software-based shadow stack in high-end systems [21, 57], \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is secure against information leaks and can be deployed on low-end systems without the need for any randomization scheme. The average memory overhead of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is centred around 25.91%. While the performance overhead might not be tolerable for some applications (up to 578.16%), we show that \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) incurs less than 5% of performance overhead for others.

In a nutshell, this article makes the following contributions:

—

A design of a novel Flash-based shadow stack technique for generic low-end embedded systems that lack hardware security features.

—

An open-source implementation for the MSP430 architecture, publicly available at [13].

—

An extensive evaluation of the proposed design on different applications, showing the impact on performance, memory, and energy consumption.

Article outline. The remainder of the article is organized as follows. Preliminaries are presented in Section 2. Section 3 discusses \(\sf {{\rm\small PISTIS}}\) , the memory isolation technique that we leverage to build \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Section 4 describes \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in detail. Implementation details and evaluation are reported in Sections 5 and 6, respectively. Section 7 reviews the related work. Section 8 contains the conclusion.

2 Preliminaries

2.1 Return-Oriented Programming (ROP)

ROP is a classical exploit technique that allows an attacker to execute code in the presence of security defences that prevent injecting shell code. The idea behind this exploit is that the attacker exploits a memory corruption vulnerability to gain control of the call stack to hijack program control flow and then executes carefully chosen machine instruction sequences, called gadgets, which are already present in the code memory of the target device. Each gadget typically ends in a return instruction and is located in a subroutine within the existing program or linked library code. By chaining these gadgets together, the attacker would be able to perform the malicious task. Figure 1 visualizes the principle behind ROP, assuming a buffer overflow vulnerability that allows the attacker to overwrite a return address. Notable is that ROP attacks or any of its variants (e.g., ROP without return [19]) could succeed in the presence of several combined defences such as stack canaries, DEP, and ASLR. This motivates the need for other effective defences, such as shadow stacks.

Fig. 1.

2.2 Scope of Embedded Devices

\(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) targets tiny embedded devices that have small MCUs based on the Von Neumann architecture with little or no hardware security features. In general, such MCUs have a single core and feature Flash and Static Random Access Memory (SRAM). They execute instructions in place (in physical memory) and have no memory management unit (MMU) to support virtual memory. Some can support a memory protection unit (MPU). However, our design and implementation neglect the existence of MPUs due to their shortcomings as clarified in [66] and [29]. In particular, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) targets single-thread, yet multi-tasking bare-metal applications that are the most common ones in the Internet of Things (IoT) domain [20]. Notably, there are still many of these low-end devices employed in both critical and non-critical fields, due to their low cost and power consumption. However, due to their poor hardware configuration, they are incompatible with most state-of-the-art security techniques. Consequently, they are often unprotected against common threats such as ROP attacks, which, depending on the scenario, could lead to disastrous consequences.

Our implementation is based on the MSP430 architecture. This choice is due to the wide use of this architecture in many IoT devices as well as research prototypes. Nevertheless, our design has the same requirements as \(\sf {{\rm\small PISTIS}}\) [30]; thus, it is applicable to other low-end MCUs in the same class, such as ARM Cortex-M.

2.3 Challenges in Designing \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\)

To guarantee security, any target system should be able to protect the building blocks of the provided security services against malicious tampering. This is especially important when software is the backbone of these services. Any kind of protection cannot be enforced without some sort of isolation. Trusted Execution Environments (TEEs) are proposed to fill this gap. However, they have been realized in mid-range and high-end devices. The low-end spectrum of embedded devices lacks rich hardware security features that provide memory isolation in addition to confidentiality guarantees. To tackle this issue, we first propose \(\sf {{\rm\small PISTIS}}\) , a runtime-based memory isolation mechanism that depends on software instrumentation. Atop it, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is constructed as an effective pure-software shadow stack approach to mitigate ROP attacks. The design of both \(\sf {{\rm\small PISTIS}}\) and \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is not straightforward and has several challenges to handle.

The biggest challenge is the underlying architecture of target MCUs. In general, all MCUs used in small embedded devices (regardless of their vendors) are designed based on either of two memory architectures: Von Neumann or Harvard. Both architectures are visualized in Figure 2. \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) targets the Von Neumann architecture, which is more challenging than the Harvard one for two main reasons. First, the Harvard architecture features a hardware isolation address space as both program (non-volatile Flash) and data (volatile SRAM) memories are physically separated, having different instructions for accessing them. This feature simplifies isolating part of the memory to host the trust anchor, as the data memory is not executable; thus, there is no need to instrument its related instructions [9]. This is not the case in the Von Neumann architecture, in which both program and data memories share the same memory address space and, thus, are executable, facilitating code injection attacks as basically any instruction could alter the state of the program memory, i.e., read from or write to it. This poses a challenge when designing \(\sf {{\rm\small PISTIS}}\) and \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) without incurring an intolerable performance overhead. Second, in contrast to the Harvard architecture, the Von Neumann one has a variable-length instruction set that would be exploited to break any software-based memory isolation technique by jumping into the middle of a multi-word instruction and executing one of its words as an unaligned instruction. This challenge cannot be handled by borrowing some of the well-known Software Fault Isolation (SFI) techniques [9] without considering a careful design to adapt them.

Fig. 2.

Realizing \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) on top of \(\sf {{\rm\small PISTIS}}\) poses another challenge, as hosting the shadow stack in either the RAM or Flash has pros and cons. While the RAM can be enforced by \(\sf {{\rm\small PISTIS}}\) to be non-executable and thus appears as a good choice for \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , it is very limited in terms of size and would incur high performance overhead to fully protect it. On the other hand, implementing \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in part of the Flash requires a very sophisticated design that should consider, among many other factors, (i) the execute-only property to ensure strong security guarantees, (ii) the writing and read mechanisms of the Flash that are more complex, compared with the RAM, and (iii) the right location of the stack to avoid corrupting any functionality when growing in size.

To solve the main challenge of realizing memory isolation with optimised performance, \(\sf {{\rm\small PISTIS}}\) focuses on checking the position of the Program Counter (PC) rather than instrumenting the entire instruction set. Also, \(\sf {{\rm\small PISTIS}}\) devises a special canary mechanism to mark legitimate jump targets, thus handling the variable-length instruction set issue. We then design and implement \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) on top of \(\sf {{\rm\small PISTIS}}\) in part of the Flash, considering the aforementioned challenges. Further details about design decisions are provided in the next sections.

3 Memory Isolation

Systems security strongly relies on the concept of trust, as lacking it renders any security service infeasible. Trust is typically assured via some sort of memory isolation, a mandatory security primitive for all security services. To this end, we first construct \(\sf {{\rm\small PISTIS}}\) as a software-based memory isolation technique, atop which \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is built. Our memory isolation primitive follows the bottom-up approach to create an ARM TrustZone-like capability [11] that is highly optimised for low-end embedded devices without hardware support while allowing for direct memory access (DMA) and interrupt operations. Compared with ARM TrustZone, our software architecture itself plays a secure monitor role with multiple secure entry points [12]. To achieve this, we first deploy an initial code, called a Trusted Computing Module (TCM), that occupies part of the Flash memory, including the bootloader area. The main goal of the TCM is to guarantee memory protection by isolating its memory area from other memory parts, creating two logically isolated memory zones: secure and insecure. The TCM leverages the secure memory part to deploy security services and primitives like \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Please note that while our memory isolation technique adapts some of the SFI techniques, i.e., binary rewriting, in its special domain, it does not depend on any hardware feature, such as segmentation registers in MPUs, as in [31]. Furthermore, in contrast to standard SFI mechanisms [53], we do not trust the compiler toolchain. This significantly broadens the attacker model and boosts the security guarantees of our approach. The correctness of the memory isolation design in \(\sf {{\rm\small PISTIS}}\) is formally verified as clarified in [30].

In what follows, we outline the adversary model that is also valid for \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) and then describe the design of \(\sf {{\rm\small PISTIS}}\) in detail. The design of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is discussed in the next section.

3.1 Adversary Model

We consider a software-based adversary \(\sf {\mathcal {A}dv}\) who has full access to the network or potentially presents inside the device itself in the form of malware. \(\sf {\mathcal {A}dv}\) can eavesdrop on or tamper with traffic on any communication medium supported by the target device. \(\sf {\mathcal {A}dv}\) can also control any software deployed in the insecure memory part of the device. This includes trying to inject malicious code, reading or writing any memory address that is not explicitly protected, corrupting specific data, manipulating I/O pins, or mounting ROP attacks. More importantly, \(\sf {\mathcal {A}dv}\) could compromise the toolchain and tamper with the code before it is loaded on the device. Notably, we consider Denial of Service (DoS) attacks out-of-scope as they do not tamper with the device memory, and the literature proposes several complementary security solutions to address this class of attacks [5, 36].

We assume that the TCM is initially and correctly installed on the embedded device by a trusted party. This can be easily achieved during a new IoT deployment or, in the case of existing systems, it suffices a clean reset of the device with a software update. We also assume that the TCM is bug free and does not contain memory corruption vulnerabilities.¹ This means that \(\sf {\mathcal {A}dv}\) cannot bypass any protection rules enforced by the TCM. Finally, we rule out all physical and hardware-focused attacks.

3.2 \(\sf {{\rm\small PISTIS}}\) : From the Ground Up

In what follows, we will describe the main building blocks of \(\sf {{\rm\small PISTIS}}\) , namely, the TCM and the accompanying compiler toolchain, that guarantee memory isolation. We then build some needed Trusted Applications (TAs), i.e., remote attestation and secure code update, leveraging memory protection as a basis.

3.2.1 Memory Isolation Design Rationale.

The memory isolation property of \(\sf {{\rm\small PISTIS}}\) aims at enforcing memory protection to guarantee the integrity and confidentiality of \(\sf {{\rm\small PISTIS}}\) ’s TCM and the TAs deployed on top of it, as well as the integrity of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . This is achieved by first deploying the TCM on the device using a physical programming device, e.g., JTAG, by a trusted party. The TCM is responsible for protecting itself against any untrusted software that is deployed on the same device afterwards. To achieve this, the TCM requires any software to be deployed through it to verify its safety at the instruction level. The entire deployment will be rejected by the TCM if there is at least one unsafe instruction that violates the memory isolation property. Our TCM is accompanied by a \(\sf {{\rm\small PISTIS}}\) -enabled compiler toolchain that produces compatible binary images. However, this toolchain is untrusted as it is easy for \(\sf {\mathcal {A}dv}\) to tamper with it. Therefore, the security of \(\sf {{\rm\small PISTIS}}\) depends on the load time and runtime verification that occurs on the device itself by the TCM.

Trusted Computing Module (TCM). The TCM is a set of software functions that reserves part of the non-volatile memory, including the bootloader area, to act as a hypervisor that fully manages access control to the entire memory area. The TCM consists of the following components:

—

Initial code: A bootloader code that replaces the original one. When the MCU is powered on, this code decides whether to continue booting from the TCM memory or from the memory that holds untrusted software.

—

Loader/Verifier: This software module is responsible for receiving the untrusted software image from the network interface and verifying whether it is \(\sf {{\rm\small PISTIS}}\) compliant. If so, the software will be installed and activated on the device. Otherwise, it will be rejected and erased.

—

Virtualized Instructions: A set of functions that represent safe equivalent versions to some potentially unsafe instructions that cannot be checked at load time. During compilation, the \(\sf {{\rm\small PISTIS}}\) -enabled toolchain will replace each instruction with a hyper-call to a safe equivalent one that can be safely verified at runtime.

—

Helper modules: other necessary helper functions such as the ones that read and write memory pages.

The TCM code runs in a privileged mode in the sense that it can perform any memory access operation. In contrast, untrusted software is subject to some restrictions that limit its access to some memory regions. Such restrictions are meant to enforce memory protection; thus, they are checked and enforced by the TCM. Figure 3 shows the memory layout of the MCU when activating \(\sf {{\rm\small PISTIS}}\) . The TCM is the first to be physically deployed in the \(\sf {{\rm\small PISTIS}}\) Core memory part. It then manages other deployments to maintain the shown layout. When deploying an application, the TCM verifies its instructions according to the following access policy (AP):

Fig. 3.

—

Read access is limited to Memory Mapped I/O (MMIO), Application Data, and Application Read-Only memory.

—

Write access is limited to MMIO memory² and Application Data memory.

—

Jumps (to execute instructions) are limited to Application Instruction memory and specific entry points of the \(\sf {{\rm\small PISTIS}}\) Core memory.

In contrast to the Harvard architecture, in which maintaining memory isolation only requires taking care of control-transfer instructions, any instruction in the Von Neumann architecture can violate the aforementioned AP. Therefore, to adhere to such a policy, the TCM’s Loader/Verifier should smartly verify each memory access (read/write) or control-transfer (jump/call) instruction of untrusted software before deploying it on the device. Initially, the TCM only knows the boundary of its non-volatile memory area and the dedicated space in the volatile one. To know the other boundaries shown in Figure 3, the \(\sf {{\rm\small PISTIS}}\) -enabled compiler toolchain produces some meta-data that is sent ahead of the deployed binary image. The TCM performs the verification process according to such meta-data. Note that the values of such meta-data are accepted as long as they do not cross protected memory regions. With that in mind, the TCM installs the entire received binary image in the expected part of the non-volatile application memory. It then starts verifying it after disabling all interrupts to ensure atomicity. The following checks must be passed before the actual deployment of untrusted software:

•

Instructions with a static addressing mode, whose target memory address is known, are checked and verified at load time. These instructions can be either control-transfer or memory-access ones. They both must comply with the AP visualized in Figure 3 in the sense that:

–

Control-transfer instructions can only target the memory area where the application will be installed or one of the entry points of the TCM. This check guarantees the DEP property of the data memory since the PC will not be allowed to jump there.

–

Write instructions can only target the permitted part of the volatile data memory (RAM) or (some) MMIO registers.

–

Read instructions can only target any write-permitted memory location or the application read-only memory.

•

Instructions with a dynamic addressing mode whose target address is only known at runtime are replaced with static instructions in the form of hyper-calls to their secure virtualized versions (part of the TCM). If at runtime the target address is deemed to be unsafe, \(\sf {{\rm\small PISTIS}}\) performs a soft reset of the MCU to block this operation.

If the binary image contains at least one instruction that does not pass the above checks, it will not be deployed and accordingly will be erased.³ Note that \(\sf {{\rm\small PISTIS}}\) does not support self-modifying code in the sense that the untrusted software cannot directly write to its instruction memory area. However, if required, the untrusted software can invoke the TCM’s Loader/Verifier module after installing the needed chunks of code in the non-executable data memory. The TCM’s Loader/Verifier will relocate the installed chunks to the required memory location if they adhere to the aforementioned AP.

Modified Toolchain. \(\sf {{\rm\small PISTIS}}\) leverages a modified toolchain to transparently rewrite each potentially unsafe dynamic instruction and replace it with a safe virtualized equivalent that can be accessed via a call to a subroutine stored in the protected TCM memory area. The target address of the corresponding instruction is verified at runtime when the subroutine is invoked. The execution continues normally if it is valid. Otherwise, an MCU reset is triggered.

Figure 4 visualizes the modified compiler toolchain. We chose the open-source GCC compiler toolchain and modified it to suit our needs. Our modifications represent (i) an instrumenter plugin that verifies the validity of static instructions and rewrites dynamic ones and (ii) a custom linker script that maintains the required memory layout and resolves the addresses of a valid TCM’s entry points. The instrumenter module is placed between the compiler and assembler, targeting assembly instructions. During instrumentation, all instructions with a static addressing mode are left untouched as they are checked at load time by the TCM on the device itself. All control-transfer and memory-access instructions with a dynamic addressing mode are rewritten as previously clarified. Note that \(\sf {\mathcal {A}dv}\) cannot write hand-crafted assembly instructions or use its own toolchain without being detected by the TCM. Application developers can use the \(\sf {{\rm\small PISTIS}}\) -enabled toolchain with the same ease of use as any other toolchain. No extra action is needed, as all required steps are performed transparently. Furthermore, developers can simply allow their software to interact with the TCM by invoking any of its valid entry points.

Fig. 4.

Virtual Instructions. Virtual instructions are part of the TCM. They represent a safe replacement for some dynamic unsafe instructions, maintaining the same functionality and adhering to the aforementioned AP. This allows the TCM to verify the safety of the intended operation at runtime before executing the corresponding instruction.

Listing 1 shows an example of an unsafe dynamic CALL instruction that tries to jump to an address held by a register. Such an address is only known at runtime. When compiling with the \(\sf {{\rm\small PISTIS}}\) -enabled toolchain, such an instruction is replaced by a sequence of instructions shown in Listing 2. The main purpose of these instructions is to safely invoke an equivalent safe routine inside the TCM. This routine will check the validity of the target address before jumping to it. The instructions for such a routine are shown in Listing 3.

Listing 1.

Listing 2.

Listing 3.

3.2.2 Memory Isolation: Variable Length Instructions.

The Von Neumann architecture supports a variable-length instruction set, i.e., some instructions can be of variable lengths, holding one or more words. \(\sf {\mathcal {A}dv}\) -s can maliciously leverage this feature to arbitrarily execute code and bypass the memory protection enforced by \(\sf {{\rm\small PISTIS}}\) . In other words, some benign-like instructions can behave maliciously by jumping into the middle of a multi-word instruction, located in the insecure memory zone. The target location can hold a value that could be executed as an instruction, allowing for illegal access to the protected memory area. Therefore, it is crucial for \(\sf {{\rm\small PISTIS}}\) to only allow jumping to verified instructions. While the TCM can easily check static jumps at load time, a similar runtime check of dynamic jumps incurs a high runtime overhead.

To tackle this issue, \(\sf {{\rm\small PISTIS}}\) comprises the following countermeasures. First, the instrumenter plugin (in the \(\sf {{\rm\small PISTIS}}\) -enabled toolchain) inserts a special instruction, called instruction canary, before any valid address that can be a target for a potential jump. This instruction will be in the form of a No Operation (NOP) slide: a sequence of two NOP instructions. Given that our modification to the toolchain limits the number of dynamic jump instructions when possible, only a few NOP slides are inserted in each application. Second, the TCM is configured to take such NOP slides into account when virtualizing jump instructions. When a dynamic jump executes at runtime, the corresponding virtual safe routine inside the TCM will check whether a NOP slide precedes the target address. If so, the jump is valid (i.e., the jump target is a permitted address inside the insecure memory area). Otherwise, an MCU reset is performed as a consequence of an invalid target address.

Figure 5 shows a snippet code, highlighting the working mechanism of the two-NOP canary on the native MSP430 implementation. Note that \(\sf {{\rm\small PISTIS}}\) by itself does not enforce control flow integrity. Nevertheless, the original design guarantees that diverting control flow does not break the maintained memory isolation property, both with dynamic jumps and return addresses. With the introduction of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , we only retain the former protection, leveraging the shadow stack to secure the return addresses.

Fig. 5.

3.2.3 Security Services.

Not only \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) can be built on top of \(\sf {{\rm\small PISTIS}}\) . Many other security services can be considered. In this section, we show how remote attestation ( \(\sf {\mathcal {R}A}\) ) and secure code update services can be easily supported.

Remote Attestation ( \(\sf {\mathcal {R}A}\) ). Considering that the TCM memory is neither writable nor readable, we leverage it to design an \(\sf {\mathcal {R}A}\) service. To do so, a secret key (that is pre-shared with the verifier) and an integrity verification function based on message authentication codes (MACs) are installed in the TCM memory. The first instruction of this function is considered a valid entry point to \(\sf {{\rm\small PISTIS}}\) . Whenever an attestation request is received, this function computes a digest (MAC value) of the entire memory and then sends it back to the verifier for verification. Either a nonce or a timestamp can be used to avoid replay attacks. Our \(\sf {\mathcal {R}A}\) is similar to SIMPLE [8, 10].

Secure Code Update. We also extended the TCM to include a secure code update service that complies with SUIT [43], an IETF standard for software updates on IoT devices that requires maintaining authenticity, integrity, and confidentiality of the deployed software. By extending the cryptographic module of the TCM with a decryption function, our secure code update mechanism meets the above requirements [7]. Considering a pre-shared secret key, the verifier has to send the software image encrypted along with a MAC value. The image will be deployed if it is decrypted and verified (in terms of matching MAC values and checking the compliance with regard to the AP of \(\sf {{\rm\small PISTIS}}\) ) successfully.

4 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\)

Low-end embedded systems can be particularly vulnerable to advanced attacks such as ROP. These aim at bypassing traditional security primitives, such as DEP, to divert the application control flow of an application and execute arbitrary code. Many embedded systems lack the hardware support to mount adequate defences such as shadow stacks, thus leaving a considerable attack surface for \(\sf {\mathcal {A}dv}\) -s.

To fill this security gap and offer protection against these advanced attacks on low-end devices, we propose \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , a unique software-based shadow stack design. \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) contributes to the protection of the application control flow by preventing \(\sf {\mathcal {A}dv}\) -s from tampering with the return addresses on the stack. To the best of our knowledge, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is the first software-based shadow stack solution that does not rely on any hardware component, e.g., MMU or MPU, and offers strong security guarantees. As such, its design is compatible with most low-end embedded systems that are not equipped with rich hardware features. Notably, the lack of such features often impedes the use of most state-of-the-art security solutions, and replacing these devices with more capable ones is often impractical and expensive. Conversely, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) does not require any secure hardware component, thus it is a suitable alternative to the lack of security.

Shadow stack is a fully precise technique that verifies the integrity of backward edges with each execution of a return instruction. The main idea is to store a copy of the return address in a separate, isolated region of memory that is not accessible by the attacker. Upon returning, the integrity of the program return address is checked against the protected copy on the shadow stack. Execution is terminated if both values are not equal. As such, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) has been designed to ensure that return instructions correctly return to their legitimate callers. Nevertheless, this is not straightforward in embedded systems, considering the lack of hardware support.

To compensate for the lack of proper security hardware, we leverage \(\sf {{\rm\small PISTIS}}\) , a purely software-based approach, and extend its security guarantees with the ones of a shadow stack. As such, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is compatible with any device that supports \(\sf {{\rm\small PISTIS}}\) and can be deployed alongside it on new devices using any trusted software update technique (e.g., using JTAG or over-the-air (OTA) updates). Once deployed, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) transparently takes care of protecting the untrusted applications that are compiled with our toolchain and deployed with our Secure Update service.

To accomplish this, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) introduces (i) an integrity-protected memory area, (ii) an integrity-protected shadow stack pointer, and (iii) a firmware integration for stack manipulation. The integrity protection is provided by \(\sf {{\rm\small PISTIS}}\) , whereas the firmware integration is built on top of its TCM.

4.1 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) : Design

To achieve its goal, a shadow stack must be secured against unauthorised accesses, e.g., \(\sf {\mathcal {A}dv}\) -s trying to inject malicious return addresses. There exist several techniques to protect the shadow stack, one of which considers using a hidden memory area through randomization [21, 57]. However, other than being vulnerable to information disclosure attacks [16], these approaches are not suitable for embedded systems due to the limited entropy. For this reason, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) opts for a sounder approach by introducing an integrity-protected memory area to hold a copy of the application return addresses during execution. Notably, the entire area could be leaked to an attacker without compromising the security of the system, as \(\sf {\mathcal {A}dv}\) -s would need to modify such values.

Although integrity protection is often provided by an MMU- or MPU-like hardware, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) leverages \(\sf {{\rm\small PISTIS}}\) to achieve this. In particular, we implement the shadow stack on top of the software-based memory protection primitive of \(\sf {{\rm\small PISTIS}}\) without incurring any extra design overhead. In practice, this translates to adding a new protected area to the \(\sf {{\rm\small PISTIS}}\) AP, as can be seen in Figure 3.

In embedded systems, protecting the integrity of the sole shadow stack is not enough to prevent ROP attacks. The integrity of the Shadow Stack Pointer (SSP) needs to be protected as an attacker with access to it could invalidate it, i.e., overwrite it with a pointer to an attacker-controlled memory area. To protect the SSP, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) considers storing it in a protected register that is exclusively reserved for this purpose. The value of this pointer always points to the top of the shadow stack so that the system knows which return address to pop upon return and where to push the new address when executing a call instruction.

Finally, any system using a shadow stack must implement a mechanism to store (push) and fetch (pop) the new protected addresses on the new stack. While on high-end systems this is accomplished with dedicated hardware, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) introduces new software hooks to both push and pop operations. Specifically, we extend the custom toolchain (Section 3.2.1) and the verifier module to treat all application CALL instructions as unsafe, so that each of them is instrumented. Then, we design a new safe_call and safe_return virtual function, with a sequence of safe operations aiming to manipulate the shadow stack every time a function is called or returned from.

In summary, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) comprises a new protected memory area for the return addresses, a new protected register for the safe shadow stack pointer, a compiler/verifier extension for the instrumentation of every application call instruction, and a TCM extension to some of its virtual functions.

It is worth mentioning that \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) also makes some of the original \(\sf {{\rm\small PISTIS}}\) features redundant. For instance, the NOP slide mechanism described in Section 3.2.2 is slightly revised. The introduction of a shadow stack makes the NOP slide superfluous for return statements. The new \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) operations replace the safe_return NOP slide verification. Nevertheless, this mechanism is still required for forward edges, mainly for dynamic call instructions.

4.2 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) : Flash-Based Stack

Traditionally, shadow stacks are maintained in the RAM for performance reasons. In light of the revisited AP in Figure 7, creating an integrity-protected RAM segment would incur a redesign of the current \(\sf {{\rm\small PISTIS}}\) architecture, which leverages the Flash controller to minimise the complexity of its memory protection primitive and thus reduce the execution overhead. Therefore, we design \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) to store the new shadow stack on the Flash memory (Flash + Shadow Stack = \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) ), effectively leveraging the native \(\sf {{\rm\small PISTIS}}\) TCM without losing its performance optimisation. \(\sf {{\rm\small PISTIS}}\) enforces an integrity-protected area over the entire Flash, preventing any unauthorised write operation. This satisfies the \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) security requirements, allowing it to operate without interference from attackers. Notably, the only requirement for \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is \(\sf {{\rm\small PISTIS}}\) , i.e., a TCM with memory isolation, which is entrusted with the protection of the shadow stack. Porting \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) to a RAM-based implementation of \(\sf {{\rm\small PISTIS}}\) only requires an engineering effort.

Fig. 6.

Fig. 7.

While the use of Flash reduces the complexity of the overall security architecture, effectively minimising the software instrumentation required, it also introduces several design challenges stemming from the intrinsic rigidity of Flash. We recall that a shadow stack is a highly dynamic memory structure that grows and shrinks whenever new values are pushed to or popped from its top. In practice, new return addresses are often pushed to a previously popped location, overwriting the stale content of the shadow stack, thus maintaining the correct call sequence. However, this overwrite operation can be quite challenging on a Flash memory, where write operations can only target a cleared memory location, i.e., that has never been used or that has been previously cleared via a memory erasure operation.⁴ This means that whenever a value is popped from the shadow stack, the memory location containing the popped value must be cleared before another value can be pushed to it.

Considering that the smallest area that can be erased in the Flash of any embedded device is a block of bytes (e.g., \(512 \,\rm B\) ), overwrite operations become cumbersome and particularly expensive to modify a single value, requiring (i) copying its entire segment in RAM, (ii) modifying the value in RAM, (iii) erasing the Flash segment, and, finally, (iv) overwriting the original Flash segment with the modified segment from RAM. Notably, zeroing a single Flash address, i.e., overwriting it with 0, is a one-cycle central processing unit (CPU) operation.

To overcome this limitation, we propose a novel only-growing stack design that does not involve overwrite operations. Starting from an empty Flash segment,⁵ we always push new values to the first free address at the top of the stack, i.e., the first never-before-used stack entry. To keep track of the popped values, and thus remove them from the stack, we zero them. As a result, at any point in time, the stack contains an interleaved sequence of zero (popped) and non-zero (un-popped) values, effectively representing the current sequence of calls.

By never reusing popped locations, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) avoids the expensive overwrite operations, thus reducing the runtime overhead. However, this has a considerable impact on the size of the shadow stack. Rather than growing linearly with the depth of the calling sequence, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) grows linearly with the total number of function calls. Let us define the size of a stack, at any point in time, as the difference between its base address and the address of its first available slot. Considering the application in Listing 4, a traditional shadow stack would never exceed the size of 1: each return shrinks the stack size, and each call increases it. In contrast, considering \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , the size of the shadow stack is 3, as with each new call, we use a new memory location, no matter whether the other locations are no longer used.

Listing 4.

Having a bigger shadow stack is not the only drawback of a Flash-based design. The presence of zero-values and the impossibility to overwrite them make pop and push operations more expensive. While in a traditional scenario we would always operate on the top of the stack, with \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , we might need to interact with any point of the stack. This makes adjourning the shadow stack pointer expensive, as we have to find the next address to be popped (highest non-zero value) and the first free entry where to push the new address to (the top of the stack). Considering the ever-increasing size of the stack and its zero-entries, these searches can become expensive in terms of CPU cycles. To overcome this limitation, we propose a two-pointer design with a Top Stack Pointer (TSP) and a Pop Stack Pointer (PSP). The TSP will always point to the top of the shadow stack, i.e., the lowest free slot, thus reducing the cost of push operations. This resembles the behaviour of the normal shadow stack. The PSP is instead used to keep track of the next value to be popped. Keeping this pointer up-to-date is cheap upon a new push (PSP=TSP) but can be quite expensive after a pop operation considering that the next non-zero value could be at any point of the stack. Figure 6 shows an example of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in action. It can be seen how on ret(c) and ret(D) the PSP has to travel down several positions. Nevertheless, we note that this is the cheaper design that we can realize in a pure-software while being maintained on the Flash.

5 Implementation

Although \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is compatible with every architecture that supports \(\sf {{\rm\small PISTIS}}\) , we propose an implementation of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) for the MSP430 architecture from Texas Instruments [23]. MSP430 MCUs are based on a Von Neumann memory model, and they are widely used in many critical application domains. For instance, they are employed in implantable medical devices (IMDs), e.g., pacemakers, that support standard interfaces for wireless communication [22, 32, 50]. They also have been a target for many research prototypes, including SANCUS [44], SMART [27], and VRASED [45]. We focus on a specific architecture to optimise our implementation and provide a more significant evaluation. Nevertheless, the design could be implemented on a different target architecture, such as ARM Cortex-M.

In particular, we implemented \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) on top of the latest version of \(\sf {{\rm\small PISTIS}}\) , which is implemented on top of the MSP430F5529 MCU that features \(\sim\) \(132 \,\mathrm{k}\rm B\) of FLASH, \(\sim\) \(8 \,\mathrm{k}\rm B\) of SRAM, and up to an \(8 \,\mathrm{M}\mathrm{Hz}\) of CPU speed using internal oscillators. We used HMAC-SHA256 and ChaCha20 from the HACL library [67] to implement other security services, namely, \(\sf {\mathcal {R}A}\) and Secure Code Update.

5.1 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in Numbers

Our \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) implementation is two-fold: we provide a software-based implementation for the MSP430F5529 MCU and an extension to the GCC compiler, used to produce \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) -compliant binary images. Both contributions are extensions to the work in \(\sf {{\rm\small PISTIS}}\) [30]. Our TCM is a modular software composed of a core, which includes our memory isolation primitive and \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , and various TAs that can be deployed independently by adjusting some configurations in a custom Makefile, which serves as an application programming interface (API). The TCM core, including a boot code, Loader/Verifier, virtual functions, shadow stack hooks (pop and push), and some helper functions, is composed of 1,232 lines of C code along with 744 lines of Assembly. The cryptographic primitives used, namely, HACL HMAC-SHA2 and ChaCha2, comprise 944 lines of code, extracted from the HACL library [67]. Leveraging these primitives, \(\sf {\mathcal {R}A}\) implementation includes 78 lines of code, whereas the secure code update mechanism is composed of 236 lines of C code. The MSP430 architecture features 60 main instructions, of which 24 instructions had to be instrumented to maintain strong isolation guarantees (further details can be found in [30]). We provide an API, comprising a 121-line Makefile in addition to an extended MSP430F5529 linker script with 601 lines that fully automates the deployment of the TCM on the target device. Our GCC plugin serves as another API to fully automate the production of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) -compliant binary images. It comprises a 159-line Makefile that transparently modifies 15 lines of the original linker script of applications and executes 740 lines of Python scripts to rewrite instructions and embed meta-data. The user API includes a 236-line Python script that automates the deployment of produced binaries using serial communication. Notably, these figures include both \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) implementation and the underlying \(\sf {{\rm\small PISTIS}}\) .

5.2 Memory Map in Practice

Section 3 elaborated on the design rationale of enforcing memory isolation between software modules. As a consequence, Figure 3 visualized the resulting memory map with regard to the employed AP. Given that the AP is mainly concerned with read, write, and execute rights, the number of application instructions requiring virtualisation might vary depending on the application’s nature.

The implementation of our memory isolation was optimised for our target MSP430 architecture, reducing the number of virtualised instructions by following a slightly different memory map that, nevertheless, has equivalent security guarantees to the one shown in Figure 3. The new memory map, visualized in Figure 7, introduces a secure storage segment, where all the sensitive data, including cryptographic keys, are stored. This memory part is only accessible by the \(\sf {{\rm\small PISTIS}}\) core, relaxing the application access policy as follows: (i) Read access is extended to cover the \(\sf {{\rm\small PISTIS}}\) Core, App Instruction memory, \(\sf {{\rm\small PISTIS}}\) Data memory and \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) stack; and (ii) Write access is extended to cover \(\sf {{\rm\small PISTIS}}\) Data memory.

\(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) design rationale is compatible with \(\sf {{\rm\small PISTIS}}\) general design, leveraging its virtualisation and verification primitives to achieve a secure shadow stack. However, as Section 4.2 highlights, we propose a Flash-based shadow stack to meet the implementation proposed in [30]. As such, our \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) implementation leverages two existing commodity hardware features in MSP430 MCUs: the Bootloader Section (BSL) and the Flash memory controller. Further details follow.

5.2.1 Secure Storage using BSL.

The BSL is a small memory segment, located as a part of the Flash memory, whose confidentiality and integrity are hardware enforced by the MCU. It triggers a reset at any illegal access. A legal access occurs through a few entry points, i.e., the Z-area, which can be configured during the physical deployment of the TCM. Leveraging its intrinsic properties, \(\sf {{\rm\small PISTIS}}\) customises part of this segment to form its secure storage, thus blocking any access by the untrusted application. The Loader/Verifier module in the TCM is only required to check that the application instructions do not jump to the Z-area. Only the TCM can freely access the Z-area. This allows applications to have read access to the rest of the Flash memory without breaking our memory isolation security.

5.2.2 Integrity of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) .

While the untrusted software is allowed to read any part of \(\sf {{\rm\small PISTIS}}\) core memory, including the \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) stack, it must not be allowed to have write access to it in order to preserve the integrity of the entire core. One possibility is to follow the guidelines of \(\sf {{\rm\small PISTIS}}\) design and virtualise all write instructions. However, the MSP430 architecture offers a hardware feature that allows for achieving the same goal with better performance. The Flash memory controller regulates all write accesses over the Flash. It must be set up via custom memory-mapped registers that enable (unlock) or disable (lock) writing to the Flash according to their loaded value. In particular, writing to the Flash will only have an effect if there is an instruction that unlocks the Flash controller, i.e., by writing a specific pre-defined byte sequence (password) into one of the controller registers. Thus, \(\sf {{\rm\small PISTIS}}\) prevents the untrusted application from writing to any part of the Flash memory by checking that none of its instructions tries to unlock the Flash memory controller. This reduces the number of instructions that have to be virtualised and fully preserves the integrity of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Note that the volatile Data memory (SRAM) is not controlled by the Flash memory controller. Therefore, to reduce the performance overhead, \(\sf {{\rm\small PISTIS}}\) allows applications to read from or write to any part of it. \(\sf {{\rm\small PISTIS}}\) rather prevents the leakage of any sensitive data by clearing all shared memory areas before any context switch from a privileged to a non-privileged mode.

5.3 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in Practice

Although \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) was designed as an extension of \(\sf {{\rm\small PISTIS}}\) core security primitives, implementing \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) alongside \(\sf {{\rm\small PISTIS}}\) on constrained devices entails many challenges. We recall that \(\sf {{\rm\small PISTIS}}\) was designed for the very low-end MCUs, with an implementation that makes a highly optimised use of the few resources available. This leaves little room for additional extensions such as \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Nevertheless, we manage to tweak \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) implementation to make it compatible with the reference MSP430 target architecture.

Let us consider the two shadow stack registers: TSP and PSP. Although using the available CPU registers makes their operations fast, reserving two additional general-purpose registers (on top of the three already used by \(\sf {{\rm\small PISTIS}}\) TCM) might be counterproductive. First, it makes the untrusted application code run slower, having been compiled with one less general-purpose register available. Second, it reduces compatibility with some applications and external libraries.

For instance, instrumenting libc with four reserved registers required several tweaks. Instrumenting it with five reserved registers might require considerable effort. Notably, libc is heavily used in the MSP430 architecture due to a limited instruction set in which many complex operations are virtualised via stub functions. These auxiliary functions are invoked transparently by the compiler, even without the inclusion of standard libc function calls (e.g., malloc()).

To tackle this issue, we reduce both TSP and PSP to \(10 \,-\rm bit\) pointers and store both of them in a single \(20 \,-\rm bit\) register: [TSP<<10 | PSP]. While merging the two saves one CPU register, it also makes pop and push operations slightly more expensive, requiring additional shift and copy instructions. We reckon that a two-register \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) implementation could still be employed for a subset of compatible applications.

We further propose a second optimisation to reduce the cost of return instructions, which might be particularly expensive the more the stack grows. With each pop, we have to traverse the shadow stack to find the highest non-zero value, as seen in Figure 6. We call it the long-travel side-effect. While this cannot be avoided in most scenarios, the long-travel becomes unnecessary when our stack only contains zeros, i.e., \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is empty. To mitigate this, we propose a stack-erasure optimisation: we erase the entire shadow stack as soon as the PSP reaches the bottom, as in the case of Figure 6 after ret(A).

We recall that erasing a Flash segment allows it to be written over once again. On the one hand, this allows both the PSP and TSP to restart from the bottom, reducing temporarily the long-travel issue. On the other hand, erasing a segment is an expensive operation costing several milliseconds. Consequently, there is the risk of over-erasure, for which applications such as the one in Listing 4 would trigger an erasure after each return. This would considerably aggravate the performance of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) .

To mitigate this issue and still leverage the performance boost in the case of long-travels, we introduce a threshold \(\theta\) for the TSP below which the stack is not erased. In other words, we erase the stack only when it is empty and its size has reached \(\theta\) . It is worth mentioning that \(\theta\) has an impact on the performance of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . An excessively high value would still enable the long-travel issue, whereas an insufficiently low value would accentuate the over-erasure issue. Notably, the perfect \(\theta\) is application-dependent, being strictly related to the control flow of an application. Nevertheless, using both statistics and testing, we set \(\theta =166\) . Algorithm 1 and Algorithm 2 show the pseudo-code of the new virtual functions, safe_ret and safe_call, respectively.

5.3.1 Limitations.

The current \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) implementation has some limitations. First, due to the \(10 \,-\rm bit\) pointers, the shadow stack cannot register more than 511 entries. This poses a hard limit on some recursive applications. In practice, the limit is exacerbated by the ever-growing design, which might further reduce the total number of active entries. Indeed, the TSP might be higher than 0 even with an empty stack. In this regard, \(\theta\) has also an effect on the maximum stack size. Lower \(\theta\) values allow for more frequent stack erasure, thus limiting the aforementioned issue.

6 Experimental Evaluation

Embedded systems are used with a high variety of low-level applications, differing in the type of operations performed and the hardware resources utilized. This heterogeneity, along with the lack of suitable benchmarks, makes the evaluation of a control flow integrity technique such as \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) non-trivial, requiring a careful selection of a representative set of applications that would reflect the overhead incurred by \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) from various perspectives.

We chose a set of 8 applications, considering three main factors: (i) the compatibility with our target experimental platform (MSP430F5529LP), (ii) the compatibility with \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) limitations, and (iii) making a good balance between the more CPU-intensive and the more IO-intensive tasks that are typical for an embedded device. Our evaluation metrics include memory footprint, execution time, and power consumption.

The proposed \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) implementation is a security primitive built on top of the \(\sf {{\rm\small PISTIS}}\) TCM [30], bringing new security functionalities at the cost of additional overhead. Notably, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) brings also several optimisations and improvements to this TCM. Moreover, we use a different compiler version with a different libc library and slightly revised applications to make the tests more representative of our scenario. Finally, the runtime evaluation of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) was performed with a Saleae Logic Pro 8 logical analyser rather than with the internal clock timers of the reference board. As a result, the baseline data for \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) might differ from the data in the tables of \(\sf {{\rm\small PISTIS}}\) [30].

NOTE. Unless otherwise specified, applications are compiled using a |-O3| optimisation flag and \(8 \,\mathrm{M}\mathrm{Hz}\) as a CPU speed.

6.1 Memory Footprint

Needless to say that instrumentation adds extra bytes to the size of each application due to inserting extra instructions, e.g., NOP slides and virtual calls. Considering a GCC toolchain with a |-O3| optimisation flag (that further optimises the execution time), Table 1 shows the size differences (in bytes) between binaries compiled using a standard GCC toolchain and our custom toolchain that adds the required instrumentation. In other words, the recorded sizes represent the amount of consumed Flash memory in bytes.

Table 1.

App	Orig.	Mod.	Orig.	Mod.
	Memory Footprint		Execution Time
App	SerialMSP	296 B	346 B (+16.89%)	274.06 ms	275.44 ms (0.50%)
CopyDMA	452 B	544 B (+20.35%)	118.90 ms	806.33 ms (+578.16%)
Bitcount	1,432 B	1,696 B (+18.44%)	5.4609 ms	5.7412 ms (+5.13%)
ML-acc	5,422 B	8,612 B (+58.83%)	3,985.4 ms	16,407 ms (+311.69%)
16bitSwitch	134 B	148 B (+10.45%)	0.00759 ms	0.00760 ms (0.13%)
8bitMatrix	876 B	878 B (+0.23%)	0.5498 ms	0.5499 ms (0.02%)
MatrixMul	524 B	526 B (+0.38%)	0.3280 ms	0.3290 ms (0.30%)
dhrystone	1,270 B	2,308 B (+81.73%)	100.42 ms	386.25 ms (+284.63%)
Average		+25.91%		+168.58%

Table 1. Memory Footprint and Execution Time Comparison between Original Applications (Orig.) and \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) -enabled Ones (Mod.)

Although \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) adds some instrumentation on top of \(\sf {{\rm\small PISTIS}}\) , e.g., requiring all calls to be instrumented, it also makes some instrumentation redundant (e.g., return NOP slides). Furthermore, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) brings several optimisations to the TCM, ultimately decreasing the memory overhead to an average of \(25.91 \%\) (from the 33.74% of native \(\sf {{\rm\small PISTIS}}\) [30]).

Table 1 shows the comparison of the binaries with \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Notably, the results are application dependent, with the memory footprint increasing depending on the types of instructions used. None of the tested applications incurred issues due to the increased memory size. We believe that in practical scenarios, this memory increase can be acceptable and it is compatible with the memory capacity of class I devices [34]. Interestingly, by inspecting the instrumented binaries, it can be observed how, in some cases, a considerable part of the additional memory footprint lies within the libc stub functions.

As shown in Figure 7, both \(\sf {{\rm\small PISTIS}}\) TCM and \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) persistently occupy part of the Flash memory. Compared with the original size of the native \(\sf {{\rm\small PISTIS}}\) core, which is \(7.4 \,\mathrm{k}\rm B\) , \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) brings some optimisations that result in a reduced core of \(7.2 \,\mathrm{k}\rm B\) , corresponding to the \(5.5 \,\%\) of the available Flash memory ( \(\sim\) \(132 \,\mathrm{k}\rm B\) ) of the target MCU. Although \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) expands the multi-fold functionality of \(\sf {{\rm\small PISTIS}}\) , it also makes several optimisations to the core of it, leading to a smaller memory footprint.

6.2 Execution Time

Similar to all control-flow integrity techniques, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) imposes a runtime overhead on the applications’ execution times due to the need to perform runtime checks on their control transfer instructions. In order to evaluate the impact of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) on the normal execution of applications, we measured the difference between the time it takes to execute our test applications with and without \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Measurements are recorded at a CPU speed of \(1 \,\mathrm{M}\mathrm{Hz}\) , considering an average of 5 different test runs for each one. Table 1 shows the overhead imposed by \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , which increases the execution time by an average of \(168.58 \,\%\) .

As would be expected, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) overhead is greater than the one of the native \(\sf {{\rm\small PISTIS}}\) ( \(40.30 \,\%\) ) due to the need of inserting the shadow stack operations on all calls and return statements. The resulting overhead averages at \(168.58 \,\%\) , highlighting how the cost of a software-only shadow stack can be prohibitive in several scenarios. However, it can be observed how the results are highly application dependent, with some that incur an overhead of around \(5\%\) . In particular, the overhead is high in scenarios with an increased number of function calls, while being negligible in other cases.

It is important to mention that this might be an acceptable price to pay for accommodating a full-featured trusted computing architecture with an integrated shadow stack without any hardware modification, especially in those scenarios in which security is critical. In such cases, the performance drop might be overshadowed by the increased security, thus allowing the safe execution of the untrusted application. We also note that using other optimisation flags rather than |-O3| might optimise the execution time for some applications. Finally, we note that the accompanied \(\sf {\mathcal {R}A}\) with \(\sf {{\rm\small PISTIS}}\) consumes 7.4 seconds when computing a MAC digest over a \(64 \,\mathrm{k}\rm B\) memory block using the default clock speed ( \(8 \,\mathrm{M}\mathrm{Hz}\) ), considering the HACL HMAC-SHA2 cryptographic primitive [67].

To further analyse the results, we performed a deep inspection of the instrumented code, and of its execution, of each application. From this analysis, we could determine that a big part of the overhead can be ascribed to the high number of function calls included by the MSP430 libc library, which is transparently used by the compiler, comprising many stub functions that are crucial to the MSP430 architecture. However, these functions make heavy use of the shadow stack. We reckon that a standard library crafted explicitly for our use case, and thus with a reduced number of function calls, could boost the performance of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Similarly, it is possible to decrease the application overhead by performing a code-in-lining operation on the application code. This would produce denser yet faster code, thus increasing the compatibility of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) with the more performance-demanding applications.

6.3 Power Consumption

Embedded devices are often used in areas where a connection to a power line is simply unfeasible or expensive. Therefore, such devices depend on batteries as their main power source, requiring optimised energy usage to reduce maintenance costs.

Figure 8 compares the amount of consumed energy by applications compiled with and without \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Three different types of applications are considered: I/O-, memory-, and CPU-intensive. Considering a \(3.0 \,\mathrm{V}\) / \(3 \,\mathrm{A}\mathrm{h}\) battery, Figure 8 shows the impact on battery life when considering different execution rates. In the worst-case scenario, our exemplar applications, namely, SerialMSP, Bitcount, and CopyDMA, decrease the battery lifetime by \(-0.30 \,\%\) , \(-0.15 \,\%.\) and \(-69.69 \,\%.\) respectively, when running once a second. The battery lifetime degradation becomes lower than \(0.01 \,\%\) when running the aforementioned applications at a rate of once every \(120 \,\mathrm{min}\) , \(30 \,\mathrm{min}.\) and \(360 \,\mathrm{h}\) , respectively. We also note that the battery would last for no less than 1 year when performing our \(\sf {\mathcal {R}A}\) mechanism once an hour considering any of the sample applications executing once every 2 minutes.

Fig. 8.

6.4 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in Real-World Scenarios

The security protection introduced by \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) comes with a memory and performance price that might not always be acceptable. In particular, the high runtime overhead might be prohibitive in real-time applications in which \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) could break their functionality or in scenarios in which performance is more important than security. However, we argue that many real-world applications deployed on these very low-constrained devices can afford the performance drop to benefit from the protection of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Let us consider a few real-world examples: a smart lock [51], a health monitor system [33], and an agriculture irrigation node [38]. In such instances, the embedded applications might have an execution time in the order of fractions of a second, which becomes almost irrelevant to the functionality of the device. It is of utmost importance for these devices to correctly carry out their task regardless of whether this is accomplished with a delay. Consequently, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) runtime overhead would not disrupt these applications’ functionality, and its protection would fill their security gap and prevent potentially catastrophic attacks.

Furthermore, we reckon that in practice, the overhead introduced by \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in these real-world scenarios can be much smaller. These embedded applications are seldom computationally intensive, focusing rather on some interaction with the outside world: examples are sensors, actuators or even network nodes. In such cases, the process of data acquisition or transmission introduces delays that make most computational overhead negligible. We demonstrate one such case by emulating a real-world case on an MSP430-based device equipped with a gyroscope sensor and a low-power IEEE 802.15.4 compliant radio. The main task of the device was to obtain 10 different measurements from the gyroscope (x,y,z coordinates), encrypt them, transmit the encrypted data over the network to a host controller, and wait for an acknowledgement. By measuring the time required to perform this entire set of actions (which constitutes the job of the IoT device), we noticed that out of a total of \(276 \,\mathrm{m}\mathrm{s}\) as a complete execution time, only \(6.02 \,\%\) of the time was occupied by the software instrumented by \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . The majority of the time ( \(93.98 \,\%\) ) was consumed by the network layer (which is hosted in another MCU and interfaced with through peripherals). This means that if \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) incurs \(168 \,\%\) performance overhead on top of the normal execution time of the corresponding software module, the overall performance overhead on top of the total execution time of the entire job will be not more than \(10.1 \,\%\) . Therefore, we believe that the runtime overhead incurred by \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) can be reasonable in practice.

In conclusion, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) can be a valid and practical security solution in many security sensitive contexts where more sophisticated security hardware is too expensive or impractical to be deployed and software-based techniques are the only viable option.

6.5 Security Evaluation

\(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is designed with the goal of thwarting ROP attacks, thus increasing the security of vulnerable applications against control-flow attacks. Considering the attacker model described in Section 3.1, we proceed to demonstrate the effectiveness of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) . Let us consider an application with a buffer overflow that grants the attacker the ability to write to any arbitrary memory address.

First, we consider the case in which the attacker tries to directly modify a return address on the stack. In such a case, even a successful corruption would not alter the application control flow since \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) fetches its return addresses from the separate shadow stack. Second, the attacker could try to directly modify the values on the shadow stack instead by either attempting a direct injection or by hijacking the virtual function safe_call. Both these attacks would fail due to the intrinsic write protection granted by \(\sf {{\rm\small PISTIS}}\) TCM, which prevents any unauthorised write on Flash, where both the shadow stack and the code of the virtual functions are stored.

A more sophisticated attacker could try to circumvent the use of the shadow stack by crafting code that does not use the safe_return virtual call. This could be achieved by either tampering with the compiled binary or by tampering with the toolchain itself. However, the Loader/Verifier in the TCM thwarts such an attack by ensuring that the executed binary does not contain unprotected return statements and that those encoded in variable-length instructions are inaccessible (see Section 3.2.2). Finally, a different attack vector for ROP is the use of exceptions by trying to corrupt the return value at the end of the exception frame. However, \(\sf {{\rm\small PISTIS}}\) intrinsically protects the stack frame return value by storing it in the write-protected Flash and then using safe_reti to fetch it.

In order to further show the effectiveness of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , we created a demo application vulnerable to ROP attacks and then tested how \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) can protect it. Our application simulates a smart lock service that (i) receives a pin from the user, (ii) compares it with a pre-defined correct pin and (iii) opens the lock if the two pins match. For the sake of simplicity, we hard-coded the insertion of the pin in the application code, bypassing a more realistic interaction between the user and a peripheral (e.g., a keyboard). Rather than writing to some MMIO registers to open an actual lock, we simply turn on a red LED to simulate an open lock. These two simplifications are reasonable since they do not alter the logic of the application.

One of the prerequisites for an ROP attack is a memory vulnerability, which can be very common in C and C++ programs. As such, we introduced a buffer overflow in the receivePin function, thus allowing potential attackers to corrupt the stack. In practical and more complex ROP attacks, the attacker could leverage this vulnerability to chain several ROP gadgets, as in [3], and execute Turing-complete code. However, we simply demonstrate the feasibility of ROP attacks by simulating a malicious over-length user input that corrupts the stack, overwriting the return address of the receivePin function with the address of the openLock function. In an unprotected environment, this simple attack manages to divert the control flow of the application to an illegal path, successfully triggering the opening of the lock.

To demonstrate the effectiveness of our shadow stack, we tested the same application and the same attack on our MSP430 device with \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) installed. When our shadow stack is in place, the attacker still manages to overwrite the return address on the stack. However, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) forces the MCU to trigger a reset as soon as the receivePin function tries to return to the corrupted address. This simple POC thus demonstrates how \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) can effectively protect the backward edges of the application, effectively defending it from ROP attacks.

7 Related Work

The research area of control-flow hijacking attacks and defences has been active for more than two decades with various proposals for detecting or protecting against attacks. Nevertheless, many of the proposed defences are not compatible with low-end embedded systems, whose protection is often hindered by the lack of proper security hardware. Therefore, the focus of this section will be on discussing related work that targets MCUs, which represent the core of any embedded device. That being said, CFI and Control Flow Attestation (CFA) are the most common defence approaches that have been proposed for embedded systems.

Control Flow Integrity. CFI is a security technique that is used to mitigate control-flow hijacking attacks by restricting the set of possible control-flow transfers to those that are strictly required for correct program execution. In general, CFI solutions instrument inline reference monitors for indirect control transfer (ICT) instructions, whose transfer targets could be compromised, to enforce that ICT instructions only jump to legitimate targets at runtime. The CFI enforcement policy is enforced according to a legitimate CFG that is generated during compile time. Notable is that there are many variations of CFI mechanisms enforcing different levels of CFGs to achieve different degrees of attack surface reduction. In other words, there is a wide range of CFI implementations, each making its own choice of CFI policy, i.e., what precision of CFGs it enforces.

CFI-CaRE [47] proposes a forward-edge CFI mechanism along with a shadow stack targeting ARMv8-M-based MCU to leverage TrustZone-M [11] for hardware isolation and protection of metadata. SCFP [61] extends a RISC-V core designed for embedded devices with a hardware module between the CPU’s fetch and decode stage, which continuously authenticates instructions, yielding a fine-grained control-flow integrity scheme for both forward and backward edges. Similar to CFI-CaRE, RECFISH [60] also proposes a forward-edge CFI mechanism with a shadow stack targeting closed-source binaries, running on top of ARM Cortex-R MCUs with real-time operating systems. It uses binary instrumentation and a Memory Protection Unit (MPU) to place the shadow stack in a privileged region, requiring a system call with each return. Notable is that both CFI-CaRE and RECFISH incur a high-performance overhead in many cases, i.e., \(\approx 500\%\) for some applications. \(\mu\) RAI [6] mitigates ROP attacks by replacing all return instructions with hard-coded jump tables, where the address of the right return entry is encoded in a single reserved register that is never spilt into memory without protection. To handle cases such as calling non-instrumented libraries and interrupt routines that execute in a higher privilege level, \(\mu\) RAI depends on a variant of the SFI approach [59] along with the existence of an MPU. Silhouette [65] also targets ROP attacks by proposing an MPU-protected shadow stack for ARM-based MCUs. It creates the logical separation between code and memory by utilizing the possibility of executing store instructions in either privileged or unprivileged mode. The proposed shadow stack can only be modified using privileged store instructions. The MPU is used to enforce memory access rules. FH-CFI [28] utilizes a hash-based message authentication code scheme to encrypt and decrypt return and call-site instructions to defeat control-flow hijacking attacks. It mainly targets ARM MCUs and thus depends on some hardware features to protect encryption keys. Kage [26] is a software system that mitigates control-flow hijacking attacks on MCUs with real-time embedded operating systems. It consists of a kernel extension that enforces protection at runtime by leveraging an MPU and a compiler that transforms code to be Kage compliant. Kage can be seen as an enhancement of Silhouette [65] in terms of enabling intra-address space isolation, separating control and non-control data in two different parts of memory. FIXER [24] is a hardware extension for embedded RISC-V processors to guarantee CFI against both JOP and ROP attacks.

All of the aforementioned approaches offer different performance and, more importantly, provide different security guarantees. We compare the latter in Table 2. Notably, it can be seen how \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is the only system that effectively supports the use of an untrusted toolchain while also protecting the interrupt routines. Specifically, the use of an untrusted toolchain guards the device against the tampering of the application binaries or of the toolchain itself. The interrupt protection prevents any vulnerabilities in the code used to handle interrupts and exceptions from being exploited by an attacker to mount ROP attacks. Both of these security guarantees considerably lower the attack surface, guarding against stronger attacker models, as shown in Section 6.5. Regardless of the pros and cons of the aforementioned schemes, they all depend on some sort of hardware assistance to enforce protection. As Table 2 highlights, this assistance varies from the introduction of custom hardware to the use of specific security components. Consequently, these solutions become impractical and expensive, requiring the replacement and redeployment of a considerable number of legacy, cheap, and very low-end devices. In contrast, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is a pure-software approach targeting embedded systems with zero hardware security features, thus filling this security gap by allowing the reuse of existing devices without compromising the security of the system. As argued in Section 6.4, this can be most needed in security-sensitive scenarios.

Table 2.

CFI Defence	HW Requirements	Untrusted Toolchain	Interrupts Routines Protection
FIXER [24]	Custom HW ✗	Yes ✓	No ✗
CFI-CaRE [47]	TrustZone + MPU ✗	No ✗	Yes ✓
SCFP [61]	Custom HW ✗	No ✗	Yes ✓
RECFISH [60]	MPU ✗	No ✗	No ✗
\(\mu\) RAI [6]	MPU ✗	No ✗	Yes ✓
Silhouette [65]	MPU ✗	No ✗	No ✗
Kage [26]	MPU ✗	No ✗	Yes ✓
\(\sf {{\rm FLAS}{\rm\small{HADOW}}}\)	None ✓	Yes ✓	Yes ✓

Table 2. \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) vs. State-of-the-art Backward CFI Defences from Various Perspectives

Control Flow Attestation (CFA). CFA has emerged as a technique for attesting the integrity of the execution state of applications, following the main principles of static attestation mechanisms [10]. The main argument behind proposing CFA was that naïvely integrating CFI approaches into static remote attestation protocols would provide limited state information to the remote verifier. CFI techniques can only report whether a control-flow attack occurred and provide no information about the actual executed control-flow path. Therefore, the verifier cannot determine which path has been hijacked to divert the control-flow execution.

C-FLAT [4] is the first CFA proposal for embedded systems. It assigns a unique ID for each basic block of code and instruments control-transfer instructions to log the IDs of their blocks in a protected memory area, targeting and leveraging TrustZone-enabled MCUs. A hash chain of the accumulated logs will be calculated and sent to the verifier upon request. The verifier would then verify the received report based on some reference measurements that are calculated depending on a pre-generated CFG during compile time. Lo-FAT [25] overcomes the high-performance overhead issue of C-Flat (due to software instrumentation) by proposing a pure-hardware CFA scheme, extending a RISC-V core with hardware monitors. ATRIUM [62] extends LO-FAT by considering a stronger adversary that is capable of mounting some sort of physical attack. Tiny-CFA [46] is a hardware-software co-design for a CFA scheme that requires minimal hardware modifications; thus, it is cheaper than LO-FAT and ATRIUM. To the best of our knowledge, all existing CFA schemes are hardware assisted. Therefore, they are infeasible for simple embedded systems that are already manufactured and cannot be modified. Furthermore, compared with \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , while such schemes are able to detect ROP attacks, they do not provide any prevention capabilities. In critical systems in which detection is not enough, \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) can provide stronger security guarantees to prevent ROP attacks at runtime.

8 Conclusion

This article proposes \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) , a software-based shadow stack design to mitigate ROP attacks on low-end embedded systems. \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) is an extension of \(\sf {{\rm\small PISTIS}}\) , a software-based memory isolation mechanism. The key idea is to provide an integrity-protected memory area on the Flash leveraging software-based memory isolation, where a shadow stack can be securely maintained. With our evaluation, we strive to answer the question of how feasible it is to protect the lowest-end devices, those for which the state-of-the-art does not offer comprehensive solutions due to the lack of security hardware. We show how a solution such as \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) might be reasonable in some real-world scenarios, offering strong security guarantees with a practical overhead. Such a solution fills a security gap for which other existing solutions might be too expensive or unpractical. As future work, we plan to design and implement a fine-grained control flow integrity scheme for forward edges that can be seamlessly integrated with \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) .

Acknowledgments

Though funded by the European Union, views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or European Climate, CINEA. Neither the European Union nor the granting authority can be held responsible for them.

Footnotes

As in the work in [9], the memory-safety property can be achieved by formally verifying the code before deploying it.

Exceptions can be made for critical MMIO registers.

The list of instrumented instructions is described in Table 5 in Appendix A in the original paper [30].

⁴

On MSP430, cleared addresses contain 0xFF.

⁵

This allows \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) to cheaply store values without overwriting.

References

[1]

Martín Abadi, Mihai Budiu, Ulfar Erlingsson, and Jay Ligatti. 2009. Control-flow integrity principles, implementations, and applications. ACM Transactions on Information and System Security (TISSEC) 13, 1 (2009), 1–40.

Abstract

1 Introduction

2 Preliminaries

2.1 Return-Oriented Programming (ROP)

2.2 Scope of Embedded Devices

2.3 Challenges in Designing \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\)

3 Memory Isolation

3.1 Adversary Model

3.2 \(\sf {{\rm\small PISTIS}}\) : From the Ground Up

3.2.1 Memory Isolation Design Rationale.

3.2.2 Memory Isolation: Variable Length Instructions.

3.2.3 Security Services.

4 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\)

4.1 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) : Design

4.2 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) : Flash-Based Stack

5 Implementation

5.1 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in Numbers

5.2 Memory Map in Practice

5.2.1 Secure Storage using BSL.

5.2.2 Integrity of \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) .

5.3 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in Practice

5.3.1 Limitations.

6 Experimental Evaluation

6.1 Memory Footprint

6.2 Execution Time

6.3 Power Consumption

6.4 \(\sf {{\rm FLAS}{\rm\small{HADOW}}}\) in Real-World Scenarios

6.5 Security Evaluation

7 Related Work

8 Conclusion

Acknowledgments

Footnotes

References

Index Terms

Recommendations

HCFI: Hardware-enforced Control-Flow Integrity

PT-CFI: Transparent Backward-Edge Control Flow Violation Detection Using Intel Processor Trace

The Performance Cost of Shadow Stacks and Stack Canaries

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations