research-article

Open access

Coherence Attacks and Countermeasures in Interposer-based Chiplet Systems

Authors:

Vassos SoteriouAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 21, Issue 2

Article No.: 23, Pages 1 - 25

https://doi.org/10.1145/3633461

Published: 15 February 2024 Publication History

PDF eReader

Abstract

Industry is moving towards large-scale hardware systems that bundle processor cores, memories, accelerators, and so on. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect carrier. This new design style is beneficial in terms of yield and economies of scale, as chiplets may come from various vendors and are relatively easy to integrate into one larger sophisticated system. However, the benefits of this approach come at the cost of new security challenges, especially when integrating chiplets that come from untrusted or not fully trusted, third- party vendors.

In this work, we explore these challenges for modern interposer-based systems of cache-coherent, multi-core chiplets. First, we present basic coherence-oriented hardware Trojan attacks that pose a significant threat to chiplet-based designs and demonstrate how these basic attacks can be orchestrated to pose a significant threat to interposer-based systems. Second, we propose a novel scheme using an active interposer as a generic, secure-by-construction platform that forms a physical root of trust for modern 2.5D systems. The implementation of our scheme is confined to the interposer, resulting in little cost and leaving the chiplets and coherence system untouched. We show that our scheme prevents a range of coherence attacks with low overheads on system performance, ∼4%. Further, we demonstrate that our scheme scales efficiently as system size and memory capacities increase, resulting in reduced performance overheads.

1 Introduction

A recent trend in computing systems is the adoption of chiplets on interposers as a form of 2.5D integration [44, 60, 61, 75]. This approach disaggregates the functional components across multiple smaller chips, i.e., chiplets, designed and manufactured separately. These chiplets then serve as hard intellectual property (IP) modules and are consolidated on an integration and interconnect carrier, i.e., an interposer [44, 60, 61, 72, 75]. The adoption of chiplet and interposer integration raises design reuse to the level of the physical system, optimizing yields and streamlining time to market, resulting in significant cost benefits. Industry has already adopted 2.5D integration in products such as AMD Epyc processors [60, 61]. Most 2.5D integrated systems are single vendor to date with future designs expected to source chiplets from various vendors [1, 30].

Similar to systems on chips (SoCs), chiplet-based systems are composed of different IPs provided from different designers. Unlike SoCs, chiplets raise the level of IP disaggregation to the system level; integrating different hard IPs that are not only designed by different designers but also manufactured at different facilities (and integrated in a final manufacturing stage). While chiplets incur a higher design complexity, industry is developing standards [1, 3] to describe their communication and integration. As chiplets can each be a specialized component, there is speculation that a market of vendors specializing in chiplet design and manufacturing capabilities will grow to an estimated $47Bn industry by 2031 [2].

1.1 Security Challenges

Interposer-based systems are vulnerable to traditional attacks and a range of dedicated, new attacks. Various third-party chiplets may introduce vulnerabilities, e.g., via untrusted design or fabrication [40] of chiplets, malicious or buggy third-party IPs [67] within the chiplets, or collusion of multiple malicious actors across chiplets.

Hardware Trojans, or Trojans for short [8], are a threat in which an attacker infiltrates some level of the design or fabrication process to insert malicious circuitry into a design. Trojans can cause disastrous system failures via confidentiality, integrity, and availability violations. Prior work demonstrates that Trojans can leak data from memory [42], disrupt cryptographic security features [9], and induce denial-of-service attacks [45].

As industry moves towards 2.5D designs integrating chiplets from multiple vendors, specific chiplets used in building these systems may be untrustworthy. Even if the IP vendor is trustworthy, the manufacturing process may not be, leading to infiltration and insertion of Trojans. An attacker may camouflage malicious circuitry within a chiplet. While a large body of work exists that seeks to identify hardware Trojans in SoC designs, e.g., based on formal verification and testing of designs to attempts to verify against expected functional and logical correctness, doing so remains a difficult challenge as no single detection method is exhaustive enough to detect and observe every possible type of Trojan. The issue is further exacerbated in chiplet systems which rely on multiple designs and supply chain sources. The vulnerability of a single chiplet can undermine the entire system’s security if not appropriately addressed.

A large body of work addresses secure network-on-chip(NoC) fabrics [16, 28, 46, 62, 66, 68] with untrusted IP modules and seeks to secure the overall system in their presence; such approaches are extendable to interposer-based systems. However, working at the NoC abstraction level limits attackers and defenders to communication that passes along specific routes in the NoC. These attacks and defenses target the communications’ behavior but not the communication’s impact on a system’s data. Attackers may target a higher level of communication abstraction to exploit the system-spanning cache-coherence protocol to directly affect the systems’ data.

Cache coherence is an essential mechanism to ensure all components maintain a consistent view of the system’s memory. Coherence protocols are used in interposer-based systems, SoCs, and chip multiprocessors broadly. We identify the coherence system as a highly attractive target for Trojan attacks as coherence mechanisms control how all components communicate data updates, making them highly prevalent and predictable.

Current coherence schemes do not enforce existing virtual/physical memory permissions. Thus, a Trojan connected to the coherence scheme can directly manipulate any memory region in the system regardless of memory permissions or physical location. A Trojan working as a cache-coherent entity can obviate existing memory-protection hardware and software by directly creating coherence messages [45]. Unlike prior packet-level NoC attacks, Trojans targeting cache coherence do not need to be physically on the path between the victim and the memory controller to launch effective attacks. The properties of cache coherence allow any connected device to request access to any memory, with each device expected to follow a predefined coherence protocol; extending this logic to an attacker means that any hardware Trojan that has compromised the coherence system or its communication interface can request any memory, even if its host IP would otherwise not have access. Integrating defenses directly into the coherence system is difficult as coherence protocols are naturally very complex and difficult to verify, requiring extensive verification and design effort.

1.2 Our Contributions

Alternative defense approaches that match with the security challenges for modern chiplet systems are sought after—notably, such defenses are within reach of interposer-based systems themselves.

First, this work provides new insights into how Trojans can manipulate coherence systems to violate the security of a chiplet system. Specifically, we propose several new Trojan attacks leveraging the coherence system protocol to maliciously manipulate a victim process’ memory using only legal coherence interactions. We first describe a set of new fundamental attacks that a Trojan can mount on coherence systems based on Basak et al.’s taxonomy [6] of: passive reading,masquerading, modifying, and diverting attacks. We assume an attacker implements these coherence system attacks in hardware through compromised design or manufacturing. While each attack violates a system’s security, we further demonstrate how orchestrating them together allows attackers to perform complex Forging attacks which modify any process’ memory, even memory the compromised chiplet should not be able to access. Pure hardware-centric attacks cannot be thwarted by contemporary software defense mechanisms as all exploited coherence interactions are transparent to software and legal within the coherence protocol. No prior work considers such attacks on coherence systems in the context of 2.5D systems with chiplets or traditional 2D systems.

Second, we leverage interposer-based system designs to establish a secure-by-construction root of trust in modern multi-core, multi-chiplet systems. Importantly, unlike prior art for secure system design, we do not assume/require trusted manufacturing of the whole system, only of the interposer, to provide system-level security promises. We use this backbone to implement our Coherence Message Checkers (CMCs) that inspect messages traversing the NoC against tampering, thereby preventing chiplets from accessing memory addresses they do not have permission for. This is achieved via a region-based memory protection scheme, thwarting the aforementioned Trojan attacks.

Commercial products and prototypes demonstrate [44, 60, 61, 72, 75] that a small trusted team can manage an interposer’s design, possibly in-house, and a relatively low-end trustworthy fab can manufacture it, possibly domestically. In the context of 2.5D systems with an interposer as a root of trust, we argue that prior schemes for securing NoCs look at the wrong part of the problem. Instead of securing each part of the NoC at a low level, including links within chiplets, we propose a defensive strategy targeting the coherence-level communication from any untrusted chiplet directly at its interface with the trusted system.

The contributions of our work are summarized as follows:

(1)

We present potential attack stages available to Trojan designers exploiting coherence systems by snooping, spoofing, modifying, or diverting coherence messages.

(2)

We demonstrate the use of these fundamental stages to orchestrate a complex Trojan attack to enable unprivileged memory references and data forging.

(3)

To mitigate these threats around coherence-oriented system-level communication, we propose an active interposer as the physical backbone for a secure-by-construction root of trust, including a secure interconnect fabric for multi-chiplet systems.

(4)

We introduce a novel microarchitecture to secure communication passing from untrusted chiplets onto the interposer and into the system via per-packet validation at the interposer ingress links. Our design does not modify the system’s underlying coherence protocol but prevents untrusted chiplets from divulging or manipulating sensitive information. The key objective of our proposal is to realize a secure large-scale system of untrusted chiplets.

(5)

We implement and evaluate our technique in gem5 and examine the implications of our security features. We characterize the performance impact as a low, $\sim$4% overhead. Further, we show that the overhead decreases as the number of workloads scale.

2 Background and Motivation

Here, we review key concepts of interposer technology, hardware security, and cache coherence protocols. We also motivate the contributions of our work considering the security challenges and promises for the respective state-of-the-art.

2.1 Interposer Technology

Interposer technology, also known as 2.5D integration, is the process of manufacturing two or more chips, or chiplets, separately and then integrating and interconnecting them using a carrier made of silicon or other materials [44, 60, 61, 72, 75]. While future 2.5D designs are expected to be more heterogeneous, current state-of-the-art systems are largely homogeneous, cache-coherent, multi-core chiplet designs [23, 44, 60, 61, 75]. Compared to traditional, monolithic SoC designs, 2.5D integration drastically reduces time to market and allows for design and manufacturing process optimization, increasing chiplet yield. A system designer can procure IP as commodity chiplets and integrate them at the physical system level, with only requiring effort for designing the interposer.

An interposer design is categorized as active or passive, with active interposers containing active devices (e.g., NoC routers, voltage regulators), while passive interposers act solely as integration carriers and wiring mediums. Passive interposers are cheap to manufacture, but their physical design can be quite challenging [38, 44]. In contrast, active interposers allow for buffered interconnects but incur design, power, and delay overheads.

An active interposer with an embedded NoC is apt for large-scale integration and system communication. The chiplet interconnect fabric is separate from the interposer’s NoC and attached at the interposer’s edge router. The heterogeneous fabric allows cross-optimizing topologies on the chiplets and the interposer, opening up considerable opportunities for system design [7, 20, 23, 38, 75, 84]. Active interposers improve testability [33, 72, 75] and improve the final system’s yield.

2.2 Hardware Security

2.2.1 IC Manufacturing.

Industry has widely adopted a work mode where a design house and partners carry out IC design and verification in-house and then outsource fabrication and testing to off- shore facilities to provide access to advanced technologies, reduce production costs, and streamline the time to market [82]. However, it raises concerns regarding the trustworthiness of the outsourced fabrication facilities, which may seek to insert security vulnerabilities in general or hardware Trojans in particular [40].

The threat vector posed by untrusted fabrication facilities implies that the ICs they manufacture are untrustworthy, posing a security challenge for modern systems in multiple ways. First, any hardware security feature embedded in an outsourced IC may no longer offer the desired protection, presenting a profound challenge. Second, a modern system may be comprised of chiplets with varying trustworthiness. Any malicious chiplet behavior may compromise the entire system due to its interconnected nature.

The interposer technology can avoid such complications since an interposer can be fabricated separately in a trusted facility and embed security features. We propose an interposer designed to constitute a hardware-enforced root of trust that can be built upon to ensure a system’s trustworthiness.

2.2.2 Hardware Trojans.

Hardware Trojans can lead to catastrophic system security failures. For example, Bidmeshki et al. [9] study an attack scenario where a hardware Trojan renders the cryptography subsystem vulnerable, Khan et al. [42] demonstrate Trojans that leak data from processor caches, and Kim et al. [45] introduce Trojans inject malicious coherence messages to create a denial-of-service attack. In Section 5, we identify potential attack vectors for Trojans to maliciously modify data owned by a process operating in a different chiplet than a Trojan-compromised chiplet.

Detecting Trojans is challenging as chiplet-based systems contain multiple complex IPs sourced from various vendors. Testing a chiplet’s functionality may occur during the manufacturing or integration stage, which requires a reference model or device [8]. However, if the 3rd party IP’s source is untrustworthy, the reference model may incorporate the Trojan, or the IP may camouflage the Trojan as a correct implementation. For this work, we assume that an attacker infiltrates some stage within the design or manufacturing process to target the system’s coherence mechanisms. Attackers can conceal a Trojan by designing it to trigger only under rare or specific conditions. For example, the Trojan we describe in Section 5.2 only activates when it observes references to a specific address. These properties make Trojans difficult to detect as simply testing the chiplet’s functionality does not guarantee that the Trojan will be active at an observable output node.

Prior art focuses on infiltrating the NoC of a target design to cause deadlocks [45], leak information [42], or disrupt security features [9]. However, NoC-based attacks require the Trojan to attack NoC packets directly, limiting the Trojan to only packets traversing a particular path in the NoC. Prior attacks would not work in a 2.5D integrated system because attacking chiplets are not on the path between victim chiplets and the memory controller.

2.2.3 Secure Interconnect Fabrics.

Prior art for secure NoCs assumes malicious activities arise from connected components or the network fabric. Fiorin et al. [29] propose security features for policy-based message checking against untrusted components. Selected works focus on securing the system through encryption/decryption of packets exchanged through NoC fabrics [25, 31]. Kinsy et al. [46] propose organizing secure and non-secure software/hardware entities as tenants, configuring NoC routers to exchange messages securely. The amount of key exchanges required to isolate nodes/tenants incurs high latencies, inhibiting scaling.

Nabeel et al. [59] propose an interposer-based architecture where security modules monitor the interconnect fabric at the bus level to block transactions that violate memory access policies. Their design represents a relevant first work toward secure 2.5D integration and the viability of interposer designs, but it has several limitations. First, the authors consider an overly simplistic architecture, ignoring that state-of-the-art 2.5D designs are fully memory-mapped and cache-coherent, opting to focus instead on physical design and layout. We find that addressing the coherence model is critical to providing system-level security. Second, the authors overlook new security challenges arising from interposer designs. Critically, their design would fail to hinder recent coherence-oriented snooping attacks that do not violate memory access policies/permissions. We also find the transaction monitor (TRANSMON) they propose is largely impractical, as their design requires an entry in their policy register space (PRS) tables for each and every allocated memory region in the physical memory space. There is no discussion in that work of how large that table must be, what to do when it overflows, and how many cycles it would take to access. Furthermore, their work does not perform an architectural evaluation and does not provide a performance analysis, which does not allow us to directly compare with them.

For most prior art, networks are not secure-by-construction, requiring high-overhead solutions such as key-based security [16, 28, 68], model checking [12, 66], or structures to verify traffic patterns [53, 62, 66]. While prior work has proposed packet-checking schemes, e.g., [69], the underlying defense mechanisms often only address a single attack vector [11, 68] and/or fail to address the coherence system’s exploitability [28, 66, 69].

While these works check the message’s memory operation, they do not differentiate between coherence message types and how coherence messages can be exploited beyond simple read or write traffic. Further, most prior art assumes, often implicitly, trusted manufacturing of the whole system. Outsourced supply chains challenge such an assumption; 2.5D integration using chiplets from various vendors only exacerbates these concerns.

2.2.4 Hardware Support for Root of Trust.

Intel’s SGX provides trusted execution environments (TEEs), called enclaves [21, 58]. Enclaves prevent unprivileged access to secure data during security-sensitive execution. SGX maps protected memory pages to reserved memory regions in which a hardware encryption module encrypts the pages. ARM’s hardware-enforced TEE isolates secure execution from untrusted software [54], leveraging a normal OS alongside a secure OS. The latter can access the full range of a device’s peripherals and memory. In contrast, the normal OS only has access to a subset of peripherals and memory to prevent unauthorized access to sensitive resources.

Recent works have shown vulnerabilities in SGX due to programming errors and untrusted software [43, 64] as well as speculative execution [13, 17, 49, 70]. In general, TEEs are prone to vulnerabilities from architectural, implementation, and hardware issues [14].

2.3 Cache Coherence

Coherence protocols ensure updates to cached copies of data are visible to all cores and IP blocks in modern multi-core designs [23, 60, 75]. Coherence schemes are broadly categorized as broadcast (or snooping) protocols [4, 10, 71] and directory protocols [37, 52, 85]. While simple to implement, broadcast protocols suffer from high traffic due to the many messages multi-core systems require to maintain coherence. Directory protocols allow for fine-grained state tracking and unicast messages, making them highly scalable but are difficult to implement and have higher access latencies.

A coherence protocol is generally oblivious to software and may permit malicious accesses that leak sensitive information [73, 83]. Existing countermeasures address conflict-based and transient-execution side-channel attacks, but do not consider threats from maliciously manipulated/malformed coherence message packets [47, 79, 80, 81].

Since coherence protocols act only based on rules for updating memory across multiple parties, attackers may exploit the protocol’s low-level behavior. Crucially, coherence is a hardware-managed, micro-architectural feature that is neither influenced by nor exposed to the software executing on the system, rendering software-based defenses ineffective. Further, page-level/TLB-based memory protections are enforced only on initial access to the cache, not on the resulting coherence traffic; thus, a Trojan producing “legal” coherence messages can bypass these traditional protections.

2.4 Summarized Motivation for Our Work

Given all the above security challenges for state-of-the-art interposer technology, hardware security, and cache coherence, our work is motivated as follows.

(1) Coherence systems are a prevalent and predictable mechanism within modern large-scale devices, providing communication between devices. Prior art targets a device’s communication medium, i.e., the NoC, but do not address attacks that occur at higher abstraction levels and directly interact with memory. Thus, from the attacker’s perspective, we propose targeting the communications’ values and interactions as opposed to the communication medium’s behavior. The demonstrations in this work cover novel and practically relevant threats.

(2) Active interposers are typically manufactured in relatively older nodes [59, 75]. Therefore, it is realistic that a trusted facility is available to manufacture such active interposers. For the defense perspective, we propose an active interposer-based root of trust with security features embedded within its NoC routers, all based on trusted design and manufacturing.

We enforce system-level security for untrusted commodity chiplets by integrating them on an interposer-based root of trust, the only component requiring trusted fabrication, thereby providing a secure-by-construction NoC. Without the need to secure the NoC’s integrity, a more simplified approach can ensure the overall system security, resulting in lower overheads. Traditional root of trust schemes incur overheads, are prone to dedicated attacks, and are generally susceptible to Trojans. In contrast, our approach has little impact on system performance, and its key components are secure-by-construction.

(3) Our solution does not require modifying the coherence protocol. Rather than risking complex, adversarial system behavior side-effects, we ensure coherence messages’ integrity and prevent untrustworthy chiplets from exploiting the coherence protocol.

(4) Our work is orthogonal to and compatible with prior art on Trojan detection and mitigation, e.g., [32, 34, 74]. Here we do not seek to prevent Trojans but their attacks from affecting the system-level security. Specifically, we seek to prevent hardware-centric attacks from executing through the memory and coherence system, bypassing existing page-table/TLB-based memory protections. A clear physical separation of untrusted commodity chiplets and security features residing in the trusted interposer enforces the notion of system-level security. Prior art on Trojan detection and mitigation cannot offer such secure-by-construction organization as ours.

3 Secure Interposer-based Chiplet System: Architecture Overview

Figure 1 outlines the secure interposer-based multi-chiplet and multi-core system proposed in this work. We loosely base the baseline system on the architecture of the Rocket-64 design proposed by Kim et al. [44]. Note that we evaluate a homogeneous 64-core system as a case study for the impact of hardware Trojans and our proposed defense on coherent systems. Our conclusions are not limited to homogeneous systems, but applicable to any interposer-based system that implements cache coherence, as the coherence protocol is agnostic to the chiplet’s functionality so long as they follow the protocol’s required request/response interactions. In addition to the overview in this section, Section 6 provides more details.

Fig. 1.

3.1 Chiplet and Interconnects Architecture

We employ eight chiplets in this system, each containing eight CPU cores for 64 cores total, similar to recent AMD processors [60, 61]. Each core has an L1 instruction cache, data cache, and a unified L2 cache; all cache levels are private. The cache controllers generate coherence messages that the network interface (NI) in each chiplet converts to network packets before injection into the interposer NoC (via interface routers). A 2D-mesh NoC in the active interposer interconnects each chiplet and four memory controllers (MC). The interface routers, depicted along the east and west edges of the system, serve as ingress links for the chiplets into the interposer NoC.

Many other architectures are practical for interposer-based systems [7, 20, 23, 38, 75, 84]. Our design can be ported to other systems using cache-coherent shared memory and an active interposer. Notably, the security principles we leverage are extendable to various physical fabrics and communication protocols in homogeneous and heterogeneous systems. For example, interfaces such as PCIe are typically memory-mapped; checking memory-system messages can prevent malicious chiplets’ unauthorized access.

3.2 Principles for System-Level Security

We propose the interposer as the root of trust for integrating untrustworthy chiplets into a secure system by enforcing policy checking for all system-level communication. The key attributes to enable such a secure system are: (1) the interposer is manufactured separately from the untrusted chiplets in a trusted facility; and (2) the interposer serves as the integration and communication backbone between chiplets.

Any system-level communication across chiplets must pass through the interposer. For example, if a CPU core wants to read/write data from/to memory, a corresponding coherence message (embedded in a packet) must traverse the interposer NoC. Similarly, if a core wants to communicate with another core in another chiplet, such direct messages must also traverse the interposer NoC. Significantly, all direct communication messages are limited to legal coherence messages, as is typical in most multi-processor systems.

Since all messages must inevitably traverse through the active interposer, we embed our security features, the Coherence Message Checkers (CMCs), in the physical ingress links of the interposer to validate all coherence messages coming from chiplets into the active interposer. We also add a secure co-processor for critical tasks, including system-level memory allocation. Since these security features are implemented exclusively within the trusted active interposer, their hardware is trustworthy and free from Trojans by construction.

3.3 Cache Coherence Protocol

We focus on the MOESI Hammer cache coherence protocol [19] as the basis for our implementation due to its use in many AMD systems as a protocol for multi-core systems. MOESI Hammer has been used extensively in modern micro-architectural performance [24, 26, 27] and security research [36, 55]. Our approach is extendable to other schemes as well.

MOESI Hammer is a hybrid protocol based on MOESI that encapsulates the scalability of directory protocols without high implementation complexity while achieving the low latency of broadcast protocols without overly increasing broadcasted coherence message traffic. To that end, MOESI Hammer maintains a sparse directory between multiple home nodes to track cache lines’ states and owners. Coherence requests access to a cache line’s home-node directory and DRAM in parallel to reduce the cost of a directory miss, canceling the DRAM response if a directory entry is present. Traffic is reduced by only broadcasting to all cores for specific state transitions.

4 Threat Model

Here we introduce the scope and assumptions of our threat model and discuss the various threat vectors we expect in a heterogeneous chiplet-based system. We extend these threat vectors into a new set of coherence-oriented attacks [15] and demonstrate how fundamental coherence threats can be combined to exploit coherence mechanisms.

The focus of this work is a system wherein multiple chiplets have been fabricated by various untrusted parties and connected using an active, trusted interposer. Our work is orthogonal to prior art on Trojan detection and mitigation for non-interposer-based systems [32, 34, 74] in that we seek to prevent Trojans from affecting the system-level security.

The key assumption is that the fabrication and operational behavior of chiplets, either designed in-house or composed of third-party IPs, cannot be trusted. In other words, we assume that some Trojan(s) may exist in some chiplet(s). We also assume that attacks target memory-system traffic which is the only type of traffic physically passing through the interposer. We specifically demonstrate and defend against Trojans that target the memory space that its compromised chiplet has never accessed and may not have permissions for.

Another key assumption is that all proposed security features are designed, manufactured, and operated in an entirely trustworthy manner. Furthermore, we assume a secure boot-up and OS environment for memory management tasks. Both assumptions are physically enforced by implementing the related hardware exclusively within the trusted interposer.

Our scheme operates on a per-chiplet granularity; attacks between cores in the same chiplet [57, 63] are out of this work’s scope. Similarly, out of scope are attacks wherein code running on one core attempts to violate the security of other processes running on that same core or another core in the same chiplet. Further, Rowhammer [50] attacks are also out of scope. Prior defenses against all these threats are orthogonal to our work and can be applied as needed. For memory allocation, we assume that detecting excessive memory requests from denial-of-service attacks are handled otherwise. Further, denial-of-service attacks resulting from some chiplets dropping coherence messages are out of scope.

5 Trojan Attacks On Coherence Systems

5.1 Basic Attacks

Figure 2 illustrates our proposed basic coherence attacks. We assume a Trojan is placed at a core’s cache controller and can intercept coherence messages from the network interface ahead of the state directory. While these attacks target the MOESI Hammer hybrid protocol, the basic principles of the attacks are consistent with any coherence protocol. Any of these attacks may be modified to target either incoming or outgoing messages from the compromised chiplet. Here we provide specific examples of each basic attack.

Fig. 2.

Passive Reading (Figure 2(a)): Trojans passively reading (snooping) observe incoming coherence messages from the chiplet’s NoC sub-system as they reach the L2’s state directory. The Trojan may buffer messages, identify specific request patterns, and facilitate a covert communication channel. The Trojan does not affect the system’s state but may trigger a more complex Trojan.

Masquerading (Figure 2(b)): Masquerading (spoofing) occurs when a Trojan modifies the packet’s sender field such that the packet appears to originate from a different core. If the target packet is a request, masquerading can result in a deadlock since all responses from the directory or other cores are sent to the incorrect core. If the target packet is a response, the Trojan may block it and respond with an acknowledgment that appears to be from a different core, resulting in an incoherent memory state.

Modification (Figure 2(c)): Such attacks occur when the Trojan directly modifies the message type of a coherence message. This attack may result in a deadlock since the Trojan can cause the memory controller’s directory to assume the data is in one state, due to a modified packet, while the local directory holds the data in a different—incorrect—state.

Diverting (Figure 2(d)): Trojans can launch diverting attacks by blocking the local state directory from observing a request and then resending the request with a different destination field. This results in the compromised core and the original requester becoming incoherent with respect to the rest of the memory system.

Limitations of Basic Attacks: Any of the above attacks can individually result in incoherence or deadlocks but cannot directly manipulate a victim’s data. Only by combining these attacks into a more complex Trojan can it pose a significant security threat.

5.2 Forging Attack: Trojan Design and Operation

To demonstrate the threat these attacks pose when orchestrated into a complex attack, we propose the Forging Attack, a novel attack that manipulates legal coherence transactions to allow a Trojan to write to a target address in a different process operating in a different chiplet. Since the Trojan resides between the network interface and a core’s directory in the compromised chiplet (Figure 3), it has a complete view of incoming/outgoing coherence messages and can block the core from observing specific interactions. The Trojan holds registers to track the target data’s current state relative to the Trojan. These registers imitate the core’s directory to ensure the Trojan correctly responds to the global directory.

Fig. 3.

Here we assume the Trojan has a predefined target address. In a real-world scenario, the Trojan can observe coherence messages broadcasted to the compromised chiplet on the network to select its target. This can be achieved with prior knowledge of the execution behavior of the target application and through an additional passive reading stage. Since coherence operates on physical memory, the hardware Trojan can freely observe physical memory addresses beyond software and virtual memory defenses. The coherence protocol requires the global directory to send invalidation messages when a core sends a write request (GETX) to a line it does not own. The invalidation broadcast removes all copies in other cores before updating the line with new data.

The Trojan operates in two phases. During the first phase, the Trojan deceives the global directory into giving the Trojan access to the data. During the second phase, the Trojan follows the protocol’s required transactions to write to the target address, which the victim will later read. The interactions caused by the Trojan in both phases are legal from the perspective of the global directory. Furthermore, they are transparent to the software executing in the victim process and all other security software in the system.

Phase 1, Acquiring Access to Target Data: Figure 3(left) illustrates the initial steps the Trojan must take to gain access permissions to the target address before it can maliciously write to it. ➊ The Trojan observes coherence requests, waiting for a specific address to trigger the attack. ➋ The Trojan generates a malicious GETX packet for the target address. ➌ The directory receives the GETX request, broadcasts an invalidation to all cores, and waits for all cores to send acknowledgments (ACKs). ➍ The directory forwards the data and all ACKs to the compromised core. ➎ The Trojan blocks the local directory from seeing any response from the directory or cores, waiting to receive all ACKs. ➏ Once all ACKs are received, the Trojan can access the data, since the directory considers the compromised core as the data owner.

Phase 2, Writing Malicious Data: Once access permissions are acquired, the global directory assumes that the Trojan’s core is the exclusive owner of the data. Figure 3(right) illustrates Phase 2 of the attack. This phase allows the Trojan to mimic the legal operations for writing to main memory as if the core was evicting the data after modifying it. The steps of the attack are: ➊ Once the Trojan receives the final ACK, the requests to the target address are unblocked. ➋ The Trojan immediately sends a PUTX to the directory to indicate that it is “evicting” modified data. ➌ The directory responds with a WRITEBACK_ACKNOWLEDGEMENT, allowing the Trojan to proceed with “evicting” the maliciously changed dirty data. ➍ The Trojan responds to the WRITEBACK_ACKNOWLEDGEMENT with a WRITEBACK_EXCLUSIVE_DIRTY response containing the malicious data. ➎ The data is written to memory.

5.3 Forging Attack: Results

We evaluate the Trojans in gem5 (see Section 7.1), targeting a victim that employs the classical advanced encryption standard (AES) processing to first encrypt a target file and then, after some other operation proceeds, to again decrypt the file. The encryption and decryption stages of the application require multiple reads and writes to memory that generate coherence traffic which is observable by a hardware Trojan located in a separate chiplet and core than the victim application. Importantly, the Trojan-compromised chiplet and core does not have access to the victim’s address space, has never held the target data in its caches, and does not interact with the victim in any way during execution. When the victim begins to decrypt the target file, the Trojan targets specific addresses being written to by the victim and is able to launch a Forging Attack that successfully modifies the decryption stage’s output without disrupting the victim’s execution.

Unlike prior work, which focuses on Trojans modifying packets [5, 41, 65], we leverage the coherence mechanism itself to modify data in memory that is never touched and not owned by the chiplet containing the Trojan. Our attack does not require the data to be in the compromised core’s caches. Generating and blocking specific coherence messages allows the Trojan to mislead the global directory about the state and ownership of the targeted data.

6 Secure Interposer-based Chiplet System: Design

Our proposed design prevents attacks on any given chiplet from violating the overall system’s security by physically enforcing protection against any unauthorized access to shared-memory regions and conduct continuous checking of the integrity and validity of cache coherence messages. Next, we discuss the system design.

6.1 Microarchitecture

6.1.1 CMC Overview.

With the proposed CMCs, we monitor and validate all incoming packets to the interposer. Figure 4 depicts the CMC embedded in a router of the interposer NoC. The CMC is a pipelined, non-blocking structure that monitors messages traversing the physical links prior to entering the virtual channel buffers within the routers.

Fig. 4.

Operation Basics: As packets enter the network, they are converted into 64b flits; then, they move onto the interposer layer and are broken down to be analyzed by the CMC. The CMC requires one pipeline stage to analyze a 64b flit, i.e., to extract the fields required by the PCM and APU to enforce region protections and prevent malicious coherence messages. The subsequent pipeline stages are then required to lookup the appropriate permissions in the APU and respond to request if necessary (the CMC-1 has two pipeline stages, while the CMC-2 requires three as discussed below). This design provides low power and latency overhead as its behavior is pipelined and does not heavily impact the flow of packets into the network, as discussed in Section 7.2.

Each CMC has two components described as follows.

Packet Checker/Modifier (PCM): The PCM monitors and modifies cache coherence messages as needed. The proposed system follows standard shared-memory semantics with memory accesses creating coherence messages during communications between cores, IPs, I/O buses, and memory. Thus, the PCM operates on coherence messages to check addresses and permissions.

Address Protection Unit (APU) Table: This is a direct-mapped, SRAM-based look-up table with entries for each memory region and associated per-chiplet permissions. As outlined in Section 6.2, the physical memory is partitioned into multiple fixed-size regions; each region has a corresponding entry in the APU. This is compatible with the page-level protections enforced by the TLBs, but coarser-grained, allowing for a more overhead-efficient implementation.

6.1.2 CMC Types and Placement.

Recall Figure 1, depicting CMCs embedded in the secure interposer-based system. We denote CMCs connected to the physical links for chiplets as “CMC-1” and denote those connected to the physical links for MCs as “CMC-2.” CMC-1 monitors and verifies coherence messages entering the interposer, whereas CMC-2 modifies specific coherence messages at the MC directories (i.e., to counter passive-reading threats on broadcast messages). Router-to-router connections exclusively within a trusted interposer do not require monitoring.

CMC-1: Prevents the attached chiplet from injecting malicious coherence messages into the system that violate the shared-memory organization (see Section 6.2). The PCM monitors all traffic from the attached chiplet based on the physical address of the packet. This physical address is compared against the per-region permissions stored in the APU table (described further below). If a message is of an allowed type to an allowed memory region for the given chiplet (e.g., a GETX to a read-only memory region it owns), the message may proceed into the interposer NoC. Otherwise, if the packet is rejected, a dedicated security signal realized as a machine-check exception is thrown, and system execution stops.¹

CMC-2: Prevents the broadcast of coherence messages to chiplets that are not permitted to access the related memory region. MOESI Hammer [19] does not maintain per-core sharing information, hence specific message requests cause an MC directory to broadcast the request to all cores. The cores then respond if that core shares the cache block, raising concerns of passive reading/snooping.

To prevent snooping, the PCM uses the APU table to determine whether a given broadcast message is allowed to access the referred memory region. If the chiplet does not have access, the broadcast message is converted into an appropriate direct broadcast message directed only to the original requester. For example, an invalidation request for data owned by Chiplet 0 does not need to be observed by any other chiplet.

Note that this approach is legal within the coherence scheme: if a chiplet does not have access to a memory region, then its caches cannot contain lines associated with that region. This allows the CMC-2 to divert broadcast messages from the directory and prevent snooping safely.

Also note that we do not include an operational diagram for the CMC as it is a direct look-up into the APU table, realizing a comparison of the message fields to the chiplet’s ID and allowed message types. A page table and interaction with the memory management unit are not required because the addresses used by the CMC and coherence messages are physical. See next for more details on the APU table structure.

6.1.3 APU Table.

The APU table is a lookup table containing entries describing the access permissions for applications running within a respective chiplet. The access permissions are determined by a secure OS running exclusively within the active interposer, independently of the regular OS running on the chiplets. The permissions are programmed into the APU tables during runtime, as outlined in Section 6.2.

Figure 5 the APU table. Each entry requests one memory region, with two bits allocated to represent access permissions of applications running in a chiplet: no permissions (“00”), read-only permissions (“01”), or read/write permissions (“11”) with “10” unused.

Fig. 5.

When the PCM intercepts a packet, the upper bits of its physical address are extracted and used to index into the APU table. The related entry is read and handed back to the PCM to compare the request type, requester ID, and destination ID against the permission levels in the APU table entry.

6.2 OS Support and Memory Organization

Here we extend and build upon prior work in security-enabled OS environments [18, 22, 48, 77], TEEs (Section 2.2.4), and secure boot-up and execution environments [35, 54, 78]. Our scheme requires delegating critical tasks to a secure environment, including updates to the APU.

Such environments must ensure that malicious OS threads running on an untrusted chiplet are physically incapable of purposefully assigning memory regions that would violate security policies. To this end, we use a trustworthy OS located in a co-processor embedded in the interposer, where the active interposer’s construction physically prevents attacks on the security components.

6.2.1 Representing Memory Regions.

Permissions in shared memory systems are typically defined per physical page by the OS during memory allocation. For our system, enforcing per-page permissions in a CMC poses several challenges. Page-level tracking requires a TLB-like structure to cache translations [76]. The support required to maintain the structure in coherence with the full system’s page table significantly increases hardware complexity and performance overhead.

We argue that a page-level implementation at the interposer is excessive in a system of relatively few and coarse-grained chiplets. Instead, we partition the physical memory into coarse-grained memory regions, similar to prior art [48, 51, 77, 78]. Here we aim for the “sweet spot” between too coarse-grained, where few memory regions are available and fragmentation consumes capacity, versus too fine-grained, where the APU table cannot hold regions in excess without incurring high access latency or placing entries in a backstore. We find that a total number of memory regions between 4$\times$–8$\times$ the number of chiplets is sufficient, allowing for diverse private and shared memory regions without too much fragmentation. For our design with eight chiplets and 4 GB of physical memory, the APU has 64 entries, representing 64 MB (16,384 pages).

Each memory region is designated as read- or write-able independently to any given chiplet, with permissions updated as needed. Data private to a single chiplet is placed in a region (or set of regions) only accessible by that chiplet. A page shared across multiple chiplets is assigned to a memory region the given chiplets are allowed to access.

6.2.2 Memory Allocation and OS Modifications.

The interposer includes a trusted co-processor to host a secure and trustworthy OS for system-level memory allocation. It is the role of this OS to ensure that processes that must be protected from each other are placed on different chiplets with different memory regions assigned to them. The OS threads running on the chiplets must delegate their page allocation to the OS running on the interposer. The code in the chiplet’s OS threads must be extended to call the interposer’s secure OS for all page allocation.

The secure OS provides the chiplet’s OS threads with a physical page based on the chiplet’s current memory regions and access permissions. The API between the OS threads running on the chiplets and the secure OS is composed of two functions, APU_ALLOCATE and APU_DELETE. The APU_ALLOCATE interface function is called by the chiplet’s OS when access to a new physical page is required. Similarly, when the chiplet’s OS threads are ready to free a physical page, they call the APU_DELETE function to return that page to the secure OS.

Initial memory partitioning and permission setting occurs during the initial soft page fault on a virtual page. An API call from the OS updates region allocations and permissions, similar to Intel’s SGX page allocation model [58]. After a page fault, the following occurs:

(1)

The chiplet’s OS requests a physical page for the process from the secure OS operating on the interposer’s co-processor via the APU_ALLOCATE interface function.

(2)

The secure OS running in the interposer then searches for an available page with the correct permissions for the given chiplet, differentiating three scenarios:

(a)

If the chiplet already has a region allocated and assigned in the APUs, and this region has unassigned physical pages, a page from this region is selected.

(b)

If the chiplet does not have an entry in the APU or its current region is fully allocated, the interposer updates the APU tables to allocate a new region with appropriate permissions for the chiplet requesting the page. The secure OS then selects a free page from the newly allocated region.

(c)

If space is unavailable in the assigned regions and there are no unassigned regions, memory allocation fails.²

(3)

The secure OS then provides the allocated physical page to the chiplet’s OS.

Since the APU table update occurs on the trusted interposer, chiplets not involved in the allocation process are unaware of the memory allocation request. Critically, a malicious chiplet that somehow gains knowledge of the request cannot access the region since the new permissions are set in the APU tables before other malicious operations target the memory region.

6.3 Implementation Details

6.3.1 NoC Configuration.

Regardless of the interposer’s NoC topology, CMCs are emplaced at the interface between chiplets or MCs and the active interposer. However, the physical link’s width impacts the CMC design and its logic. In our implementation and evaluation, the link width is 128 bits within chiplets and 64/128 bits within the interposer.

In MOESI Hammer, every control message fits within a single 128-bit flit [19]. When a flit enters the interposer, it is broken down into one/two flits which are analyzed in the CMC logic over one/two clock cycles, depending on the 64/128-bit width of the interposer link. In the case of 64-bit links, depicted in Figure 6, we dedicate the first cycle to extracting the control parameters from the head and the second cycle to extracting the address for the cache block being accessed. The CMC logic is similar for request and response messages, as both cases require analyzing the first two flits.

Fig. 6.

6.3.2 Cache Coherence Protocol.

The system’s cache coherence protocol affects the CMC design and logic as the CMC must analyze the coherence message fields. MOESI Hammer’s responses either a control or data message; a control response follows the same flit structure as a request, whereas a data response has additional flits containing 64 bytes of data.

Based on the message type and identified threats (Section 4), the CMC must analyze specific key parameters; these are highlighted in Figure 6. The PCM extracts the parameters and compares them with the permissions set in the APU table. Since an attacker may exploit either request or response messages, both message types must be analyzed.

6.3.3 Protocol Compliance.

First, coherence messages are converted into network packets by the chiplets’ NIs. However, these packets are not guaranteed to adhere to the rules of the network and coherence protocols. For example, a Trojan may fabricate an invalid message type, yielding undefined, possibly vulnerable behavior. Second, messages corresponding to particular virtual networks (VNs) must follow a specific, limited set of requester/destination IDs and message types.

To address both aspects, the PCM checks the possible field values to verify the legality of messages. Since these checks are orthogonal to memory-region permission checking, they are performed in parallel and incur no extra delay.

6.3.4 Design Cost.

We design the PCM module with three pipeline stages for lookup, packet checking, and modification. The third stage is excluded in CMC-1 instances as these only monitor packets on ingress to the interposer. An APU table is embedded in each of the twelve routers (8 for the chiplets, 4 for the MCs) in the interposer, imposing a total memory footprint of only 1.5 KB. More specifically, an APU table requires two bits for identifying each chiplet’s permissions, and there are 64 table entries; an APU table requires 1,024 bits.

6.3.5 Threat Model Coverage.

Our scheme addresses the threat model (Section 4) as follows:

Passive reading: This threat is prevented by rerouting broadcast messages as they enter the CMC-2 at the interposer/MC boundary. Broadcast messages from the directories are converted into negative acknowledgments (back to the requester) for chiplets without permissions to the message’s memory region. In all cases, the CMC will throw a security check exception and halt execution if an attack is detected.

Masquerading: Every CMC-1 is programmed with the range of IDs expected in each coherence message’s requester ID field. For example, in Figure 1, the CMC-1 in router 72 expects requester IDs in the range of 0–7 and will reject messages with an ID outside this range, as discussed in Section 6.3.3.

Modification: This is detected by comparing a message type, such as GETX/GETS, with the access permissions in the APU table.

Diversion: This threat is detected by checking the destination ID and the message type. Only specific message types can have other cores as the destination ID. This, along with the memory region permissions in the APU table, allows us to detect any malicious diversion of messages.

Complex: Complex attacks, like the Forging Attack demonstrated in Section 5.2, are similarly covered by our technique as they are composed of multiple simple attacks.

Software Threats: Our design prevents unauthorized accesses to memory regions due to privilege escalation or exploitation using the mechanics described above for hardware threats. Importantly, since coherence messages are generated by hardware, a solely software-driven attack cannot engage in these threats through packet manipulation without malicious hardware intervention.

6.3.6 Security Testing.

To test the system’s ability to counter all threats, we inject tailored, malicious coherence messages at the network interface of cores. We verified that respective check exceptions are thrown for all attacks listed above – no malicious packets enter the interposer NoC before the system halts. We also verify that the complex Forging Attack is rendered incapable of executing its payload in our secure system.

7 Secure Interposer-based Chiplet System: Evaluation

We first discuss our evaluation methodology. Then, we examine the security coverage our design provides. Finally, we examine the performance overheads caused by our scheme.

7.1 Methodology

We implement and evaluate our proposed system for system emulation using gem5 [56]. Table 1 depicts the configuration details. The particular proof-of-concept system studied here is inspired by the Rocket-64 design [44]. We simulate an 8-chiplet, 64-core system described in Section 3. We assume the interposer is fabricated using an older process node operating at 250 MHz, a quarter of the chiplets’ frequency.

Table 1.

Component	Variable
Chiplet Architecture
Core	8 RISC-V cores
Private L1 I-Cache	32KB
Private L1 D-Cache	64KB
L2 Cache	2MB
NoC	Eight-port,
	128-bit Crossbar
vc_per_vnet	4, 6, 8, or 10
Chiplet Frequency	1GHz
System Architecture
Chiplets	8
MCs	4
Main Memory	4GB
Memory Regions	64, 64MB each
NoC	3x4 2D-Mesh,
	64 or 128 bit
Virtual Networks	4
vc_per_vnet	4, 6, 8, or 10
Interposer Frequency	250MHz
CMC Latency, 64b or 128b
link width	1 or 2 cycles
Cache Coherence
Model	MOESI Hammer
Simulation Configuration
Processor Model	TimingSimpleCPU
Simulation Model	System emulation

Table 1. System Architecture Configuration

We measure performance impact as the IPC speedup/slowdown for the secure, CMC-enabled configuration over the insecure baseline configuration. The CMCs latencies are discussed in Section 6.3 and disabled for the unsecured baseline. Due to long simulation times induced for this large system, we evaluate IPC using a subset of the SPEC 2006 benchmarks. We perform single-threaded and multi-programmed simulations to better understand the CMCs impact.

7.2 Single-Threaded Performance Impact

Figure 8 shows the impact of the IPC of the system with CMCs enabled compared to the baseline configuration. As expected, all workloads experience a speedup of less than 1.00 as the CMCs introduce higher latencies to the network. As the figure shows, the CMCs impose an average performance loss of $\sim$4%, with several benchmarks (povray, hmmer, libquantum) showing little to no impact. sphinx3, however, is an outlier, showing a significant $\sim$27% performance loss.

Fig. 7.

Fig. 8.

To investigate further, we examine the L2 miss rates of each benchmark in Figure 9. The figure demonstrates that the variation between each benchmark’s result in Figure 8 is highly correlated to a benchmark’s cache hit rate. For instance, sphinx3 shows a much higher L2 cache miss rate than other benchmarks at $\sim$68%. The CMCs must process each packet resulting in increased memory access latencies. Thus, the CMC-enabled system’s performance depends on the number of coherence messages that the L2 cache misses inject into the NoC.

Fig. 9.

The performance degradation in some benchmarks is further analyzed in Figure 7, showing the percentage change for pre-injection queuing latency versus in-network latency and the total latency experienced by packets in the network. Interestingly, while the queuing latency increases by $\sim$80%, the in-network latencies drop by 5%–10%. The increase in queuing latency is expected due to the extra pipeline delays on network insertion that the CMCs cause. The decrease in in-network latency is due to CMC-2 instances rerouting acknowledgment messages back to only the original requester. Thus, the CMC-2 reduces the total network load by removing one packet in the transaction.

The total packet latency increases by 39% on average. Interestingly, although sphinx3 incurs a higher performance impact than the other benchmarks, it does not see a significantly different packet latency. That is, as discussed above, sphinx3’s performance loss is due to a higher L2 miss rate and hence higher packet injection, not a higher per-packet latency. sphinx3’s higher miss rate exposes it more to the increase of network latency than other applications with lower miss rates.

Figure 10 depicts the speedup of the benchmarks with three different VC configurations (vc_per_vnet). We observe that the geometric mean speedup approaches 0.98 with more VC. This is because VCs act as parallel buffers that allow for higher bandwidth and for more packets to be in-flight in the network, all while avoiding deadlocks.

Fig. 10.

We see a significant improvement in speedup for sphinx3 due to the improvement in queuing latencies at the network interfaces. These significant gains imply that increasing the VC count is a good way to improve performance if the application has a high cache miss rate, as it allows for more misses to be in-flight within the network.

In Figure 11, we analyze the impact of increasing the interposer link widths to 128 bits versus the baseline of 64 bits. Note that, due to runtime constraints for such large-scale simulations running on our shared high-performance computing cluster, here we focused on a representative subset of benchmark runs for that particular experimentation. The larger bandwidth provides slightly better speedup than the baseline, implying that increasing the bit-width for the physical links in the interposer is likely not worthwhile. However, this depends on the designer’s tradeoff for costs/overheads and the system’s scalability.

Fig. 11.

7.3 Multi-Programmed Performance Impact

We also evaluate the impact of the CMCs on multi-programmed workloads, using random mixes of two benchmarks each, executed in two cores in separate chiplets. Here we simulate until all applications complete at least five billion instructions, and we report the weighted speedup of the combination using the methodology of Kadjo et al. [39].

Figure 12 shows the speedup for these multi-programmed workloads. In general, speedups range between 0.95 and 1.06. In some cases, namely bzip2-namd and bwaves-gcc, the speedup with the CMCs enabled was better than the baseline. Further, the mixes which included sphinx3 showed reduced performance loss versus the stand-alone sphinx3. Similar to the work presented by Bilir et al. [10], the improvement results from CMC-2 filtering out packets otherwise sent to unauthorized chiplets. This reduces the bandwidth pressure multiple applications induce on the NoC, implying that the overheads on performance do not increase as we add more workloads.

Fig. 12.

8 Secure Interposer-based Chiplet System: Discussion

In this section, we discuss the scalability of the proposed system as well as its applicability to other coherence protocols.

8.1 Scalability

The performance regression shown in Figure 8 for single application workloads is due to the latency incurred by the CMCs. Since the CMCs are located on the periphery of the NoC, the performance overhead in terms of added latency for each packet remains constant, regardless of the NoC size. Thus, the overhead is a function of the network perimeter, not the network node count. As the system scales to include more nodes in the network and chiplets in the system, the impact of the CMC overhead will decrease because the in-network per-hop latency will increase faster than the CMC overhead. Similarly, the number of CMCs scales with the network perimeter, meaning that the overhead will scale sublinearly with respect to energy and area of the overall system.

Our case study implementation partitions the physical memory space into 64 memory regions in total. We chose this as what we felt was a “reasonable” tradeoff between design complexity of the APU versus the number of independently configurable unshared/shared memory regions for the eight chiplets. Other design choices are possible based on system requirements, each with its own tradeoffs. For example, allocating 128 MB per region implies the APU only needs 32 entries for our 4 GB system. This results in a smaller, faster, APU (and CMC) with reduced energy consumption, but at the cost of fewer sharing options for system applications. Conversely, if the system design would require more diverse and less granular sharing options, one might increase the number of entries in the APU from 64 to 128. Then, each region would be allocated 32 MB for a 4 GB system. The cost of this would be a 2$\times$ larger APU with according timing/area/power overheads.

8.2 Other Coherence Protocols

Recall that coherence models can be broadly categorized into two types (Section 2.3): directory and broadcast protocols. The Trojans outlined in Section 5 are tailored to MOESI Hammer, leveraging the protocol’s broadcast component. Thus, these attacks can be trivially implemented in any broadcast protocol.

It is also possible to apply these attacks to directory protocols by having the Trojan generate an additional fake GETS so that it can monitor the target cache line’s state. For example, in the case of the Forging Trojan (Section 5.2), the Trojan would generate a GETS message for its target address, designating it as a sharer in the view of the directory. This is needed because, in traditional directory protocols, the directory will only send invalidation requests to cores it identifies as sharers. Thus, the GETS message ensures that the Trojan will receive an invalidation request when another chiplet attempts to write to that cache line with a GETX message. When another chiplet makes a GETX request for the target data, the global directory broadcasts invalidations to all sharers. Once the Trojan observes the request, it can launch the attack as previously outlined.

The extension of our secure interposer-based design proposed in Section 6 to directory-based protocols is more straightforward. Generally, CMC-2 instances would not be needed in directory-based schemes since their primary purpose is to filter unnecessary broadcast messages. CMC-1 instances do not need significant modification, other than to properly understand and verify the coherence message types used by the new coherence scheme. Otherwise, they will function the same as in broadcast schemes. Thus, our attacks and proposed design can be extended to systems using other coherence models.

We do note that, in the case of non-broadcast protocols, we will likely lose the performance benefit from filtering out unnecessary broadcast messages that we see with the MOESI Hammer implementation. Nevertheless, the latency overheads of the proposed system will still scale with the perimeter of the network, not the node count. Thus, we also do not expect the performance to degrade with larger networks in directory-based protocols.

Finally, we note that our proposed design can be directly used with more heterogeneous interposer-based designs made up of CPUs, GPUs, hardware accelerators, and so on. That is, as long as the system maintains a cache coherence protocol in the interconnect for communication between the components/chiplets and the memory, our interposer-based root of trust system leveraging CMCs can enforce the security guarantees.

9 Conclusion

We propose an active interposer as a root of trust for modern chiplets-based systems by implementing hardware security features directly in the interposer. Specifically, we devise a CMC, which we propose including at the boundary between the interposer, the chiplets and memory controllers. We show how such a scheme addresses various attacks arising from malicious chiplets with relatively low performance impact, $\sim$4%.

Footnotes

This is a secure and protocol-conform approach. For the sake of system-level throughput, one may want only to isolate the chiplet(s) triggering a security violation. Doing so safely, however, is not trivial, requiring significant modifications of the coherence protocol itself to prevent deadlocks; this is planned for future work.

In future work, we may consider extending this scenario. However, this would increase complexity of the process and require careful investigation.

References

[1]

[n.d.]. Compute Express Link (CXL), Retrieved October 14, 2022 from www.computeexpresslink.org

Abstract

1 Introduction

1.1 Security Challenges

1.2 Our Contributions

2 Background and Motivation

2.1 Interposer Technology

2.2 Hardware Security

2.2.1 IC Manufacturing.

2.2.2 Hardware Trojans.

2.2.3 Secure Interconnect Fabrics.

2.2.4 Hardware Support for Root of Trust.

2.3 Cache Coherence

2.4 Summarized Motivation for Our Work

3 Secure Interposer-based Chiplet System: Architecture Overview

3.1 Chiplet and Interconnects Architecture

3.2 Principles for System-Level Security

3.3 Cache Coherence Protocol

4 Threat Model

5 Trojan Attacks On Coherence Systems

5.1 Basic Attacks

5.2 Forging Attack: Trojan Design and Operation

5.3 Forging Attack: Results

6 Secure Interposer-based Chiplet System: Design

6.1 Microarchitecture

6.1.1 CMC Overview.

6.1.2 CMC Types and Placement.

6.1.3 APU Table.

6.2 OS Support and Memory Organization

6.2.1 Representing Memory Regions.

6.2.2 Memory Allocation and OS Modifications.

6.3 Implementation Details

6.3.1 NoC Configuration.

6.3.2 Cache Coherence Protocol.

6.3.3 Protocol Compliance.

6.3.4 Design Cost.

6.3.5 Threat Model Coverage.

6.3.6 Security Testing.

7 Secure Interposer-based Chiplet System: Evaluation

7.1 Methodology

7.2 Single-Threaded Performance Impact

7.3 Multi-Programmed Performance Impact

8 Secure Interposer-based Chiplet System: Discussion

8.1 Scalability

8.2 Other Coherence Protocols

9 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Countermeasures for timing-based side-channel attacks against shared, modern computing hardware

Power Analysis Attacks and Countermeasures

An introduction to implementation attacks and countermeasures

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations