research-article

Public Access

Cross-VM Covert- and Side-Channel Attacks in Cloud FPGAs

Authors:

Ilias Giechaskiel,

Shanquan Tian,

Jakub SzeferAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 16, Issue 1

Article No.: 6, Pages 1 - 29

https://doi.org/10.1145/3534972

Published: 22 December 2022 Publication History

All formats PDF

Abstract

The availability of FPGAs in cloud data centers offers rapid, on-demand access to reconfigurable hardware compute resources that users can adapt to their own needs. However, the low-level access to the FPGA hardware and associated resources such as the PCIe bus, SSD drives, or DRAM modules also opens up threats of malicious attackers uploading designs that are able to infer information about other users or about the cloud infrastructure itself. In particular, this work presents a new, fast PCIe-contention-based channel that is able to transmit data between FPGA-accelerated virtual machines (VMs) by modulating the PCIe bus usage. This channel further works with different operating systems and achieves bandwidths reaching 20 kbps with 99% accuracy. This is the first cross-FPGA covert channel demonstrated on commercial clouds and has a bandwidth which is over 2000× larger than prior voltage- or temperature-based cross-board attacks. This article further demonstrates that the PCIe receivers are able to not just receive covert transmissions, but can also perform fine-grained monitoring of the PCIe bus, including detecting when co-located VMs are initialized, even prior to their associated FPGAs being used. Moreover, the proposed mechanism can be used to infer the activities of other users, or even slow down the programming of the co-located FPGAs as well as other data transfers between the host and the FPGA. Beyond leaking information across different virtual machines, the ability to monitor the PCIe bandwidth over hours or days can be used to estimate the data center utilization and map the behavior of the other users. The article also introduces further novel threats in FPGA-accelerated instances, including contention due to network traffic, contention due to shared NVMe SSDs, as well as thermal monitoring to identify FPGA co-location using the DRAM modules attached to the FPGA boards. This is the first work to demonstrate that it is possible to break the separation of privilege in FPGA-accelerated cloud environments, and highlights that defenses for public clouds using FPGAs need to consider PCIe, SSD, and DRAM resources as part of the attack surface that should be protected.

1 Introduction

Public cloud infrastructures with FPGA-accelerated virtual machine (VM) instances allow for easy, on-demand access to reconfigurable hardware that users can program with their own designs. The FPGA-accelerated instances can be used to accelerate machine learning, image and video manipulation, or genomic applications, for example [5]. The potential benefits of the instances with FPGAs have resulted in numerous cloud providers including Amazon Web Services (AWS) [14], Alibaba [3], Baidu [20], Huawei [37], and Tencent [59], giving public users direct access to FPGAs. However, providing users low-level access to upload their own hardware designs has resulted in serious implications for the security of cloud users and the cloud infrastructure itself. Several recent works have considered the security implications of shared FPGAs in the cloud, and have demonstrated covert-channel [29] and side-channel [33] attacks in this multi-tenant setting. However, today’s cloud providers, such as AWS with their F1 instances, only offer “single-tenant” access to FPGAs. In the single-tenant setting, each FPGA is fully dedicated to the one user who rents it, while many other users may be in parallel using their separate, dedicated FPGAs which are within the same server. Once an FPGA is released by a user, it can then be assigned to the next user who rents it. This can lead to temporal thermal covert channels [61], where heat generated by one circuit can be later observed by other circuits that are loaded onto the same FPGA. Such channels are slow (less than \(1 \,b/\mathrm{s}\) ), and are only suitable for covert communication since they require the two parties to coordinate and keep being scheduled on the same physical hardware one after the other. Other means of covert communication in the single-tenant setting do not require being assigned to the same FPGA chip. For example, multiple FPGA boards in servers share the same power supply, and prior work has shown the potential for such shared power supplies to leak information between FPGA boards [30]. However, the resulting covert channel was slow (less than \(10 \,b/\mathrm{s}\) ) and was only demonstrated in a lab setup.

Another single-tenant security topic that has been previously explored is that of fingerprinting FPGA instances using Physical Unclonable Functions (PUFs) [60, 62]. Fingerprinting allows users to partially map the infrastructure and get some insights about the allocation of FPGAs (e.g., how likely a user is to be re-assigned to the same physical FPGA they used before), but fingerprinting by itself does not lead to information leaks. A more recent fingerprinting-related work explored mapping FPGA infrastructures using PCIe contention to find which FPGAs are co-located in the same Non-Uniform Memory Access (NUMA) node within a server [63]. However, no prior work has successfully launched a cross-VM covert- or side-channel attack in a real cloud FPGA setting.

By contrast, our work shows that shared resources can be used to leak information across separate VMs running on the FPGA-accelerated F1 instances in AWS data centers. In particular, we use the contention of the PCIe bus to not only demonstrate a new, fast covert channel (reaching up to \(20 \,\mathrm{k}b/\mathrm{s}\) ) that persists across different operating systems but also to identify patterns of activity based on the PCIe signatures of different Amazon FPGA Images (AFIs) used by other users. This includes detecting when co-located VMs are initialized or performing an interference attack that can slow down the programming of other users’ FPGAs, or more generally degrade the transfer bandwidth between the FPGA and the host VM. Our attacks do not require special privileges or potentially malicious circuits such as Ring Oscillators (ROs) or Time-to-Digital Converters (TDCs), and thus cannot easily be detected through static analysis or Design Rule Checks (DRCs) that cloud providers may perform. We further introduce three new methods of finding co-located instances that are in the same physical server: (a) through reducing the network bandwidth via PCIe contention, (b) through resource contention of the Non-Volatile Memory Express (NVMe) SSDs that are accessible from each F1 instance via the PCIe bus, and (c) through the common thermal signatures obtained from the decay rates of each FPGA’s DRAM modules. Our work, therefore, shows that single-tenant attacks in real FPGA-accelerated cloud environments are practical, and our work several ways to infer information about the operations of other cloud users and their FPGA-accelerated VMs or the data center itself.

1.1 Contributions

In summary, the contributions of this work are¹:

(1)

Demonstrating the first FPGA-based covert channel between separate VMs, reaching \(20 \,\mathrm{k}b/\mathrm{s}\) with 99% accuracy.

(2)

Characterizing the cross-VM covert-channel accuracy and bandwidth tradeoffs across different operating systems.

(3)

Inferring information about the behavior of other users through the PCIe signatures of their AFIs.

(4)

Detecting when co-located VM instances with FPGAs are initialized.

(5)

Performing long-term monitoring of data center activity through PCIe contention.

(6)

Slowing down communication between the host and the FPGA, resulting in attacks that degrade the FPGA programming times and other application data transfers.

(7)

Identifying network- and SSD-based interference mechanisms and covert channels between separate F1 users.

(8)

Exploiting the thermal signatures of DRAM modules to identify F1 instances which are on separate NUMA nodes, but share the same server.

1.2 Responsible Disclosure

Our findings and a copy of this article have been shared with the AWS security team.

1.3 Article Organization

The remainder of the article is organized as follows. Section 2 provides the background on today’s deployments of FPGAs in public cloud data centers and summarizes related work. Section 3 discusses typical FPGA-accelerated cloud servers and PCIe contention that can occur among the FPGAs, while Section 4 evaluates our fast, PCIe-based, cross-VM channel. Using the ideas from the covert channel, Section 5 investigates how to infer information about other VMs through their PCIe traffic patterns, including detecting the initialization of co-located VMs, long-term PCIe monitoring of data center activity, and slowing down PCIe traffic on adjacent instances. Section 6 then presents alternative sources of information leakage due to network bandwidth contention, shared SSDs, and thermal signatures of DRAM modules. The article concludes in Section 7.

2 Background and Related Work

This section provides a brief background on FPGAs in public cloud computing data centers, with a focus on the F1 instances from AWS [14] that are evaluated in this work. It also summarizes related work in the area of FPGA cloud security.

2.1 AWS F1 Instance Architecture

AWS has offered FPGA-accelerated VM instances to users since late 2016 [4]. These so-called F1 instances are available in three sizes: f1.2xlarge, f1.4xlarge, and f1.16xlarge, where the instance name represents twice the number of FPGAs it contains (so f1.2xlarge has 1 FPGA, while f1.4xlarge has 2, and f1.16xlarge has 8 FPGAs). Each instance is allocated 8 virtual CPUs (vCPUs), \(122 \,\mathrm{G}{\rm i}B\) of DRAM, and \(470 \,\mathrm{G}B\) of NVMe SSD storage per FPGA. For example, f1.4xlarge instances have 16 vCPUS, \(244 \,\mathrm{G}{\rm i}B\) of DRAM, and \(940 \,\mathrm{G}B\) of SSD space [14], since they contain 2 FPGAs.

Each FPGA board is attached to the server over a x16 PCIe Gen 3 bus. In addition, each FPGA board contains four DDR4 DRAM chips, totaling \(64 \,\mathrm{G}{\rm i}B\) of memory per FPGA board [14]. These memories are separate from the server’s DRAM and are directly accessible by each FPGA. The F1 instances use Virtex UltraScale+ XCVU9P chips [14], which contain over \(1.1\) million lookup tables (LUTs), \(2.3\) million flip-flops (FFs), and \(6.8\) thousand Digital Signal Processing (DSP) blocks [69].

As has recently been shown, each server contains 8 FPGA boards, which are evenly split between two NUMA nodes [63]. The AWS server architecture deduced by Tian et al. [63] is shown in Figure 1 and is consistent with publicly available information on AWS instances [12, 14]. AWS servers containing FPGAs have two Intel Xeon E5-2686 v4 (Broadwell) processors, connected through an Intel QuickPath Interconnect (QPI) link. Each processor forms its own NUMA node with its associated DRAM and four FPGAs attached as PCIe devices. Due to this architecture, an f1.2xlarge instance may be on the same NUMA node as up to three other f1.2xlarge instances, or on the same NUMA node as one other f1.2xlarge instance and one f1.4xlarge instance (which uses 2 FPGAs). Conversely, an f1.4xlarge instance may share the same NUMA node with up to two f1.2xlarge instances or one f1.4xlarge instance. Finally, as f1.16xlarge instances use up all 8 FPGAs in the server, they do not share any resources with other instances, since both NUMA nodes of the server are fully occupied. In this work, we are able to produce a more fine-grained topology of the servers and their PCIe topologies due to additional sources of contention via NVMe SSDs and Network Interface Controller (NIC) cards.

Fig. 1.

2.2 Programming AWS F1 Instances

Users utilizing F1 instances do not retain entirely unrestricted control of the underlying hardware, but instead, need to adapt their hardware designs to fit within a predefined architecture. In particular, user designs are defined as “Custom Logic (CL)” modules that interact with external interfaces through the cloud-provided “Shell”, which hides physical aspects such as clocking logic and I/O pinouts (including for PCIe and DRAM) [29, 62]. This restrictive Shell interface further prevents users from accessing identifier resources, such as eFUSE and Device DNA primitives, which could be used to distinguish between different FPGA boards [29, 62]. Finally, users cannot directly upload bitstreams to the FPGAs. Instead, they generate a Design Checkpoint (DCP) file using Xilinx’s tools and then provide it to Amazon to create the final bitstream (Amazon FPGA Image, or AFI), after it has passed a number of DRCs. The checks, for example, include prohibiting combinatorial loops such as ROs as a way of protecting the underlying hardware [28, 29], though alternative designs bypassing these restrictions have been proposed [29, 57].

2.3 Related Work

Since the introduction of FPGA-accelerated cloud computing about five years ago, a number of researchers have been exploring the security aspects of FPGAs in the cloud. A key feature differentiating such research from prior work on FPGA security outside of cloud environments is the threat model, which assumes remote attackers without physical access to or modifications of the FPGA boards. This section summarizes selected work that is applicable to the cloud setting, leaving traditional FPGA security topics to existing books [38] or surveys [26, 39, 50, 73].

2.3.1 PCIe-Based Threats.

The Peripheral Component Interconnect Express (PCIe) standard provides a high-bandwidth, point-to-point, full-duplex interface for connecting peripherals within servers. Existing work has shown that PCIe switches can cause bottlenecks in multi-GPU systems [21, 25, 27, 55, 56], leading to severe stalls due to their scheduling policy [44]. In terms of PCIe contention in FPGA-accelerated cloud environments, prior work has shown that different driver implementations result in different overheads [66] and that changes in PCIe bandwidth can be used to co-locate different instances on the same server [63]. In parallel to this work, PCIe contention was used for side-channel attacks which can recover the workload of GPUs and NICs via changes in the delay of PCIe responses [58]. Our work is similar but presents the first successful cross-VM attacks using PCIe contention on a real public cloud. Moreover, by going beyond just PCIe, our work is able to deduce cross-NUMA-node co-location using the DRAM thermal fingerprinting approach.

2.3.2 Power-Based Threats.

Computations that cause data-dependent power consumption can result in information leaks that can be detected even by adversaries without physical access to the device under attack. For example, it is known that a shared power supply in a server can be used to leak information between different FPGAs, where one FPGA modulates power consumption and the other measures the resulting voltage fluctuations [30]. However, such work results in low transmission rates (below \(10 \,b/\mathrm{s}\) ), and has only been demonstrated in a lab environment.

Other work has shown that it is possible to develop stressor circuits which modulate the overall power consumption of an FPGA and generate a lot of heat, for instance by using ROs or transient short circuits [1, 2, 35]. These large power draws can be used for fault attacks [40], or as Denial-of-Service (DoS) attacks [42] which simply make the hardware unavailable for an extended period of time. Such attacks could also prematurely age FPGAs, due to the potentially excessive heat for an extended period of time [19]. Our work has instead focused on information leaks and non-destructive reverse-engineering of the cloud infrastructure.

2.3.3 Thermal-Based Threats.

It is now well-known that it is possible to implement temperature sensors suitable for thermal monitoring on FPGAs using ROs [23], whose frequency drifts in response to temperature variations [45, 46, 65, 72]. A receiver FPGA could thus use an RO to observe the ambient temperature of a data center. For example, existing work [61] has explored a new type of temporal thermal attack: heat generated by one circuit can be later observed by other circuits that are loaded onto the same FPGA. This type of attack is able to leak information between different users of an FPGA who are assigned to the same FPGA over time. However, the bandwidth of temporal attacks is low (less than \(1 \,b/\mathrm{s}\) ), while our covert channels can reach a bandwidth of up to \(20 \,\mathrm{k}b/\mathrm{s}\) .

2.3.4 DRAM-Based Threats.

Recent work has shown that direct control of the DRAM connected to the FPGA boards can be used to fingerprint them [62]. This can be combined with existing work [63] to build a map of the cloud data centers where FPGAs are used. Such fingerprinting does not by itself, however, help with cross-VM covert channels, as it does not provide co-location information. By contrast, our PCIe, NIC, SSD, and DRAM approaches are able to co-locate instances in the same server and enable cross-VM covert channels and information leaks.

2.3.5 Multi-Tenant Security.

This work has focused on the single-tenant setting, where each user gets full access to the FPGA, and thus reflects the current environment offered by cloud providers. However, there is also a large body of security research in the multi-tenant context, where a single FPGA is shared by multiple, logically (and potentially physically) isolated users. For example, several researchers have shown how to recover information about the structure [64, 74] or inputs [51] of machine learning models or cause timing faults to reduce their accuracy [24, 54]. Other work in this area has shown that crosstalk due to routing wires [28] and logic elements [31] inside the FPGA chips can be used to leak static signals, while voltage drops due to dynamic signals can lead to covert-channel [29], side-channel [33, 36], and fault [52] attacks. Several works have also tried to address such issues to enable multi-tenant applications, proposing static checks [41, 43], voltage monitors [34, 48, 53], or a combination of the two [42]. Our work on PCIe, SSD, and DRAM threats is orthogonal to such work but is directly applicable to current cloud FPGA deployments.

3 PCIe Contention in Cloud FPGAs

The user’s CL running on the FPGA instances can use the Shell to communicate with the server through the PCIe bus. Users cannot directly control the PCIe transactions, but, instead, perform simple reads and writes to predefined address ranges through the Shell. These memory accesses get translated into PCIe commands and PCIe data transfers between the server and the FPGA. Users may also set up Direct Memory Access (DMA) transfers between the FPGA and the server. By designing hardware modules with low logic overhead, users can generate transfers fast enough to saturate the PCIe bandwidth. In fact, because of the shared PCIe bus within each NUMA node, these transfers can create interference and bus contention that affects the PCIe bandwidth of other users. The resulting performance degradation can be used for detecting co-location [63], or, as we show in this work, for fast covert- and side-channel attacks, breaking the isolation between otherwise logically and physically separate VM instances.

In our covert-channel analysis (Section 4), we show that the communication bandwidth is not identical for all pairs of FPGAs in a NUMA node. In particular, this suggests that the 4 PCIe devices are not directly connected to each CPU, but instead likely go through two separate switches, forming the hierarchy shown in Figure 2, improving the deduced model of prior work [63]. Although not publicly confirmed by AWS, this topology is similar to the one described for P4d instances, which contain 8 GPUs [7]. As a result, even though all 4 FPGAs in a NUMA node contend with each other, the covert-channel bandwidth is highest amongst those sharing a PCIe switch, due to the bottleneck imposed by the shared link (Section 4).

Fig. 2.

We also expand on the model to show that the PCIe switches provide connectivity to an NVMe SSD drive and a Network Interface Card (NIC), thereby expanding the attack surface by identifying additional sources of PCIe contention. Finally, as we show in Section 4.5, how effectively these PCIe links can be saturated is also dependent on the operating system/kernel configuration instead of just the user-level software and underlying hardware architecture.

4 Cross-VM Covert Channels

In this section, we describe our implementation for the first cross-FPGA covert-channel on public clouds (Section 4.1) and discuss our experimental setup (Section 4.2). We then analyze bandwidth vs. accuracy tradeoffs (Section 4.3), before investigating the impact of receiver and transmitter transfer sizes on the covert-channel accuracy for a given covert-channel bandwidth (Section 4.4). We finish the section by discussing differences in the covert-channel bandwidth between VMs using different operating systems (Section 4.5). Side channels and information leaks based on PCIe contention from other VMs are discussed in Section 5.

4.1 Covert-Channel Implementation

Our covert channel is based on saturating the PCIe link between the FPGA and the server, so, at their core, both the transmitter and the receiver consist of (a) an FPGA image that interfaces with the host over PCIe with minimal latency in accepting write requests or responding to read requests, and (b) software that attaches to the FPGA and repeatedly writes to (or reads from) the mapped Base Address Register (BAR). These requests are translated to PCIe transactions, transmitted over the data and physical layers, and then relayed to the CL hardware through the shell (SH) logic as AXI reads or writes. The transmitter stresses its PCIe link to transmit a 1, but remains idle to transmit a bit 0, while the receiver keeps measuring its own bandwidth during the transmission period (the receiver is thus identical to a transmitter that sends a 1 during every measurement period). The receiver then classifies the received bit as a 1 if the bandwidth \(B\) has dropped below a threshold \(T\) and as 0 otherwise.

The two communicating parties need to have agreed upon some minimal information prior to the transmissions: the specific data center to use (region and availability zone, e.g., us-east-1e), the time \(t\) to start communications, and the initial measurement period, expressed as the time difference between successive transmissions \(\delta\) . All other aspects of the communication can be handled within the channel itself, including detecting that the two parties are on the same NUMA node, or increasing the bandwidth by decreasing \(\delta\) . To ensure that the PCIe link returns to idle between successive measurements, transmissions stop before the end of the measurement interval, i.e., the transmission duration \(d\) satisfies \(d\lt \delta\) . Note that synchronization is implicit due to the receiver and transmitter having access to a shared wall clock time, e.g., via the Network Time Protocol (NTP). Figure 3 provides a high-level overview of our covert-channel mechanism.

Fig. 3.

Before they can communicate, the two parties (Alice and Bob in the example of Figure 3) first need to ensure that they are co-located on the same NUMA node within the server. To do so, they can launch multiple instances at or near an agreed upon time and attempt to detect whether any of their instances are co-located by sending handshake messages and expecting a handshake response, using the same setup information as for the covert channel itself (i.e., the time \(t^{\prime }\) to start the communication, the measurement duration \(\delta\) , and location information such as the data center region and availability zone). They additionally need to have agreed on the handshake message \(H\) , which determines the per-handshake measurement duration \(\Delta\) . This co-location process is summarized in Figure 4. Note that as prior work has shown [63], by launching multiple instances, the probability for co-location is high, but the two parties would have to agree on a “timeout” approach. For instance, they could have a maximum number of handshake attempts \(M\) , after which they re-launch instances at time \(t^{\prime }+M\cdot \Delta\) , or launch additional instances for every unsuccessful handshake attempt (e.g., after attempt 1, Alice and Bob both launch a new instance, while Alice terminates instance 1).

Fig. 4.

In our experiments, we typically launch 5 instances per user at the same time in the same region and availability zone, have a 128-bit handshake message \(H\) , and consider the co-location attempt successful if the message was recovered with \(\ge \!\!80\%\) accuracy. Other fixed parameters, such as the measurement duration or transfer sizes, were informed by early manual experiments and the work in [63] to ensure we can reliably detect co-location. Note that these parameters can be different from those used after the co-location has been established. For instance, co-location detection can use low-bandwidth transfers (e.g., \(200 \,b/\mathrm{s}\) ) that are reliable across all NUMA node setups, and can be increased as part of the setup process, once co-location has been established.

During the co-location process, the two communicating parties can also establish what the threshold \(T\) should be and whether the communication bandwidth should be increased. As shown in [63], the PCIe bandwidth of an instance drops from over \(3000 \,\mathrm{M}B/\mathrm{s}\) to under \(1000 \,\mathrm{M}B/\mathrm{s}\) when there is an external PCIe stressor. As a result, this threshold \(T\) could be simply hardcoded (at, say, \(2000 \,\mathrm{M}B/\mathrm{s}\) ), or be adaptive, as the mid-point between the minimum \(b_m\) and maximum \(b_M\) bandwidths recorded during a successful handshake. The latter is the approach we use in our work: if the two instances are not co-located, \(b_m\approx b_M\) , and the decoded bits will be random, and hence will not match the expected handshake message \(H\) . If the two instances are co-located, \(b_M\gg b_m\) (assuming \(H\) contains at least one 0 and at least one 1), so any bit 1 will correspond to a bandwidth \(b_1\approx b_m \ll (b_m+b_M)/2=T\) and any bit 0 will result in bandwidth \(b_0\approx b_M \gg (b_m+b_M)/2=T\) .

4.2 Experimental Setup

For the majority of our experiments, we use VMs with AWS FPGA Developer Amazon Machine Image (AMI) [17] version 1.8.1, which runs CentOS 7.7.1908, and develop our code with the Hardware and Software Development Kit (HDK/SDK) version 1.4.15 [8]. We vary the operating systems used for the transmitters and receivers and significantly improve the covert-channel bandwidth in Section 4.5. For our FPGA bitstream, we use the unmodified CL_DRAM_DMA image provided by AWS (agfi-0d132ece5c8010bf7) [11] for both the transmitter and the receiver designs. This design maps the \(128 \,\mathrm{G}{\rm i}B\) PCIe Application Physical Function (AppPF) BAR4 (a 64-bit prefetchable BAR [10]) to the \(64 \,\mathrm{G}{\rm i}B\) of FPGA DRAM (twice). It is not necessary to use the FPGA DRAMs themselves: just responding to the PCIe requests to not hang the interfaces, like in the CL_HELLO_WORLD example [13], is sufficient. Our custom-written software maps the FPGA DRAM modules via the BAR4 register, and uses the BURST_CAPABLE flag to support write-combining for higher performance. Data transfers are implemented using memcpy, getting similar performance to the AWS benchmarks [6]. Algorithm 1 summarizes our software routine in pseudocode.

Unless otherwise noted, we primarily perform experiments with “spot” instances in the us-east-1 (North Virginia) region in availability zone d for cost reasons, though prior work has shown that PCIe contention is present with both spot and on-demand instances, in all regions and different availability zones where F1 instances are currently supported, namely ap-southeast-2 (Sydney), eu-west-1 (Ireland), us-east-1 (North Virginia), and us-west-2 (Oregon) [63]. Although the results presented are for instances launched by a single user, it should also be noted that we have successfully created a cross-VM covert channel between instances launched by two different users.

4.3 Bandwidth vs. Accuracy Tradeoffs

Using our co-location mechanism, we are able to easily find 4 distinct f1.2xlarge instances that are all in the same NUMA node, and then measure the covert-channel accuracy for different bandwidths, i.e., different measurement parameters \(d\) and \(\delta\) . Specifically, we test \((d,\delta)\) from \((0.1 \,\mathrm{m}\mathrm{s}, 0.2 \,\mathrm{m}\mathrm{s})\) to \((9 \,\mathrm{m}\mathrm{s}, 10 \,\mathrm{m}\mathrm{s})\) , corresponding to transmission rates between \(5 \,\mathrm{k}b/\mathrm{s}\) and \(100 \,b/\mathrm{s}\) .² For these experiments, the receiver keeps transferring \(2 \,\mathrm{k}B\) chunks of data from the host, while the transmitter repeatedly sends \(64 \,\mathrm{k}B\) of data in each transmission period (i.e., until the end of the interval \(d\) ). These parameters are explored separately in Section 4.4 below.

The results of our experiments, shown in Figure 5, indicate that we can create a fast covert channel between any two FPGAs in either direction: at \(200 \,b/\mathrm{s}\) and below, the accuracy of the covert channel is 100%, with the accuracy at \(250 \,b/\mathrm{s}\) dropping to 99% for just one pair. At \(500 \,b/\mathrm{s}\) , three of the six possible pairs can communicate at 100% accuracy, while one pair can communicate with 97% accuracy at \(2 \,\mathrm{k}b/\mathrm{s}\) (and sharply falls to 70% accuracy even at \(2.5 \,\mathrm{k}b/\mathrm{s}\) —though in Section 4.5 we show that bandwidths of \(20 \,\mathrm{k}b/\mathrm{s}\) at 99% accuracy are possible). It should be noted that, as expected, the bandwidth within any given pair is symmetric, i.e., it remains the same when the roles of the transmitter and the receiver are reversed. As the VMs occupy a full NUMA node, there should not be any impact from other users’ traffic. The variable bandwidth between different pairs is therefore likely due to the PCIe topology.

Fig. 5.

4.4 Transfer Sizes

In this set of experiments, we fix \(d=4 \,\mathrm{m}\mathrm{s}\) , \(\delta =5 \,\mathrm{m}\mathrm{s}\) (i.e., a covert-channel bandwidth of \(200 \,b/\mathrm{s}\) ), and vary the transmitter and receiver transfer sizes. Figure 6 first shows the per-pair channel accuracy for different transmitter sizes. The results show that at \(4 \,\mathrm{k}B\) and above, the covert-channel accuracy is 100%, while it becomes much lower at smaller transfer sizes. This is because sending smaller chunks of data over PCIe results in lower bandwidth due to the associated PCIe overhead of each transaction. For example, in one 4 ms transmission, the transmitter completes 140301 transfers of \(1 \,B\) each, corresponding to a PCIe bandwidth of only \(1 \,B\times 140301/4 \,\mathrm{m}\mathrm{s}=33.5 \,\mathrm{M}B/\mathrm{s}\) . However, at the same time, a transmitter can complete 1890 transfers of \(4 \,\mathrm{k}B\) , for a PCIe bandwidth of \(4 \,\mathrm{k}B\times 1890/4 \,\mathrm{m}\mathrm{s}=1.8 \,\mathrm{G}B/\mathrm{s}\) .³

Fig. 6.

The results of the corresponding experiments for receiver transfer sizes are shown in Figure 7. Similar to the transmitter experiments, very small transfer sizes are unsuitable for the covert channel due to the low resulting bandwidth. However, unlike in the transmitter case, large receiver transfer sizes are also problematic, as the number of transfers completed within each measurement interval is too small to be able to distinguish between external transmissions and the inherent measurement noise.

Fig. 7.

4.5 Operating Systems

Starting with FPGA AMI version 1.10.0, Amazon has provided AMIs based on Amazon Linux 2 (AL2) [18] alongside AMIs based on CentOS [17] (both using the Xilinx-provided XOCL PCIe driver). AL2 uses a Linux kernel that has been tuned for the AWS infrastructure [15], and may therefore impact the performance of the covert channel. Since the attacker does not have control over the victim’s VM, it is necessary to explore the effect of the operating system on our communication channel, and thus experiment with both types of operating systems as receivers and transmitters. We use the co-location methodology of Section 4.1 to find different instances that are in the same NUMA node, and report the accuracy of our cross-VM covert channel from bandwidths as low as \(0.1 \,\mathrm{k}b/\mathrm{s}\) to as high as \(66.6 \,\mathrm{k}b/\mathrm{s}\) . As described in Section 3 and shown in Figure 2, each NUMA node consists of 4 distinct f1.2xlarge instances, and each one can run either AL2 or CentOS. As Section 4.3 identified, the bandwidth between different FPGA pairs will depend on where they are in the PCIe topology, so to get an accurate estimate of the maximum cross-instance covert-channel bandwidth for different setups, we run experiments on three different configurations of full NUMA nodes. The first experiment contains one instance running CentOS and three instances running AL2 (Figure 8), the second contains two instances with CentOS and two with AL2 (Figure 9), while the last setup has three CentOS VMs and one AL2 one (Figure 10). For each experiment, we collect the covert-channel bandwidths for all pairs of instances, and in both directions of communication, resulting in 12 different bandwidth vs. accuracy sets of measurements.

Fig. 8.

Fig. 9.

Fig. 10.

Figure 8 shows the covert channel bandwidths for all FPGA pairs, where one instance is running CentOS and the remaining three are running AL2. For any pair of AL2 instances, the covert-channel accuracy at \(20 \,\mathrm{k}b/\mathrm{s}\) is over 90% (in fact, reaching 99%), and for a subset of those pairs remains above 80% at even \(40 \,\mathrm{k}b/\mathrm{s}\) . However, when a CentOS instance is involved, the bandwidth drops to \(0.5 \,\mathrm{k}b/\mathrm{s}\) , for either direction of communication.

Figures 9 and 10 show that, depending on where the instances are on the PCIe topology, the bandwidth can vary. Indeed, Figure 9 shows that the bandwidth for an AL2 transmitter and a CentOS receiver can reach \(2.5 \,\mathrm{k}b/\mathrm{s}\) at 98% accuracy, but CentOS transmitters and AL2 receivers generally have bandwidths below \(0.5 \,\mathrm{k}b/\mathrm{s}\) , though in repeated individual experiments (outside of a full NUMA node), we have been able to get a channel at \(5.9 \,\mathrm{k}b/\mathrm{s}\) at 95% accuracy. The CentOS-CentOS results of Figure 10 are consistent with those of Section 4.3, with bandwidths between \(250 \,b/\mathrm{s}\) and \(1.4 \,\mathrm{k}b/\mathrm{s}\) for all but the fastest pair of instances. Table 1 summarizes these results, while Table 2 compares the achieved bandwidths to prior work in cross-FPGA communications.

Table 1.

Transmitter	Receiver	Bandwidth	Accuracy
CentOS	CentOS	\(2.0 \,\mathrm{k}b/\mathrm{s}\)	97%
CentOS	Amazon Linux 2	\(^\ast\) \(0.3 \,\mathrm{k}b/\mathrm{s}\)	100%
Amazon Linux 2	CentOS	\(2.5 \,\mathrm{k}b/\mathrm{s}\)	94%
Amazon Linux 2	Amazon Linux 2	\(20.0 \,\mathrm{k}b/\mathrm{s}\)	99%

Table 1. Cross-VM Covert Channel Bandwidth for Different Receiver and Transmitter Operating Systems

\(^\ast\) A bandwidth of \(5.9 \,\mathrm{k}b/\mathrm{s}\) at 95% accuracy could be sustained across repeated individual experiments outside of a full NUMA node.

Table 2.

Cloud	Method	Reference	Bandwidth	Accuracy
TACC	Thermal Attack	[61]	\(\ll \!\!0.1 \,b/\mathrm{s}\)	100%
\(^\ast\) —	Voltage Stressing	[30]	\(6.1 \,b/\mathrm{s}\)	99%
AWS	PCIe Contention (CentOS)	This work	\(2000.0 \,b/\mathrm{s}\)	97%
AWS	PCIe Contention (AL2)	This work	\(20000.0 \,b/\mathrm{s}\)	99%

Table 2. Cross-FPGA Covert Channel Bandwidth Achieved by Different Works

The PCIe contention approach of our work achieves bandwidths that are several orders of magnitude faster than prior research, and are performed on a commercial public cloud. \(^\ast\) Achieved only in a lab setup.

5 Cross-VM Side-Channel Leaks

In this section, we explore what kinds of information malicious adversaries can infer about computations performed by un-cooperating victim users that are co-located in the same NUMA node in different, logically isolated VMs. We first show that the PCIe activity of an off-the-shelf video-processing AMI from the AWS Marketplace leaks information about the resolution and bitrate properties of the video being processed, allowing adversaries to infer the activity of different users (Section 5.1). We then show that it is possible to detect when a VM in the same NUMA node is being initialized (Section 5.2), and more generally monitor the PCIe bus over a long period of time (Section 5.3). We finally show that PCIe contention can be used for interference attacks, including slowing down the programming of the FPGA itself, or of other data transfer communications between the FPGA and the host VM (Section 5.4). The attacks of this and the next section are summarized in Figure 11.

Fig. 11.

5.1 Inferring User Activity

To help users in accelerating various types of computations on F1 FPGA instances, the AWS Marketplace lists numerous VM images created and sold by independent software vendors [16]. Users can purchase instances with pre-loaded software and hardware FPGA designs for data analytics, machine learning, and other applications, and deploy them directly on the AWS Elastic Cloud Compute (EC2) platform. AWS Marketplace products are usually delivered as AMIs, each of which provides the VM setup, system environment settings, and all the required programs for the application that is being sold. AWS Marketplace instances for FPGAs naturally use PCIe to communicate between the software and the hardware of the purchased instance. In this section, we first introduce an AMI we purchased to test as the victim software and hardware design (Section 5.1.1), and then discuss the recovery of potentially private information from the victim AMI’s activity by running a co-located receiver VM that monitors the victim’s PCIe activity (Section 5.1.2).

5.1.1 Experimental Setup.

Among the different hardware accelerator solutions for cloud FPGAs, in this section, we target video processing using the DeepField AMI, which leverages FPGAs to accelerate the Video Super-Resolution (VSR) algorithm to convert low-resolution videos to high-resolution ones [22]. The DeepField AMI is based on AL2, and sets up the system environment to make use of the proprietary, pre-trained neural network models [22]. To use the AMI, the VM software first loads the AFI onto the associated FPGA using the load_afi command to set up the FPGA board on the F1 instance. The ffmpeg program, which is customized for the FPGA platform, is called to convert an input video of no more than \(1280\times 720\) in resolution to a high-resolution video with a maximum output resolution of \(3840\times 2160\) . As discussed above, the DeepField AMI handles all of the software and provides the FPGA image for the acceleration of the VSR algorithm. Users do not know how the FPGA logic operates, since it is provided as a pre-compiled AFI. However, PCIe contention allows us to reveal potentially private information from such example AMIs by running an attacker VM to measure the PCIe activity of the victim. In particular, this type of high-performance computing for image and video processing inevitably requires massive data transfers between the FPGA and the host processor through PCIe. These AMI behaviors are reflected in the PCIe bandwidth trace.

For our experiments, we first launch a group of f1.2xlarge instances running the DeepField AMI to find a co-located F1 instance pair using our PCIe contention approach of Section 4. After verifying that the attacker and the victim are co-located, we set up the attacker VM in monitoring mode, which continuously measures the PCIe bandwidth, similar to the receiver in the covert-channel setup. The monitoring program has been configured to measure bandwidth with a measurement duration of \(\delta =20 \,\mathrm{m}\mathrm{s}\) and a data transfer duration of \(d=18 \,\mathrm{m}\mathrm{s}\) .

The victim VM then runs the unmodified DeepField AMI to convert different lower-resolution videos to higher-resolution ones using the ffmpeg program. In our experiments, each run of the DeepField AMI takes approximately \(5 \,\mathrm{min}\) , and each bandwidth trace in the attacker VM lasts for \(10 \,\mathrm{min}\) , thus covering both the conversion process, as well as periods of inactivity. As discussed in Section 5.1.2, by comparing the bandwidth traces among the different experiments, we observe that we can (a) infer information about whether the victim is actively in the process of converting a video, and (b) deduce certain parameters of the videos.

5.1.2 Leaking Private Information from Marketplace AMIs.

We now show that private information regarding the activities of co-located instances can be revealed through the PCIe bandwidth traces. Figure 12 shows the PCIe bandwidth measured by the attacker while the victim is running the DeepField AMI on an f1.2xlarge instance. We test different input video files, with three different resolutions (360p, 480p, and 720p) and two frame rates of 15 and 30 frames-per-second (FPS). All videos have a 16:9 aspect ratio, and, except for the resolutions and frame rates, the contents of the input video files are otherwise identical. The output video produced for each conversion always has a resolution of \(3840\times 2160\) , but maintains the same frame rate as the original input. The beginning and ending of the VSR conversion on the FPGA can be clearly seen in Figure 12, where vertical red lines delineating the start and end of the process have been added for clarity. We observe that the PCIe bandwidth drops during the conversion, and that runtime is reduced as the input resolution or the input frame rate decrease. For example, the runtime for a 720p, 30 FPS video (Figure 12(f)) is approximately twice as long as for a 15 FPS one (Figure 12(c)).

Fig. 12.

5.2 Detecting Instance Initialization

In the experiments of this work, we have thus far only focused on covert communication and side-channel information leakage between VM instances that have already been initialized. By contrast, in this section, we show for the first time that the instance initialization process can also be detected by monitoring the bandwidth of the PCIe bus. Indeed, on AWS, there is a time lag between when a user requests that an instance with a target AMI be launched and when it is provisioned, initialized, and ready for the user to connect to it over SSH. This process can take multiple minutes, and, as we show in this work, causes significant PCIe traffic that is measurable by co-located adversaries.

For our experiments, we first create an f1.2xlarge instance (named INST-A) and start the PCIe bandwidth monitoring program on it. We then launch five f1.2xlarge instances in sequence, named INST-B-i, for \(i \in \lbrace 1,2,3,4,5\rbrace\) . For each INST-B-i, we attempt to complete a handshake with INST-A at a pre-determined time, and then terminate the instance before launching the next one. As the monitoring program on INST-A is running throughout the experiments (including when no INST-B is running), it is able to capture the initialization, handshake, and termination of any potentially co-located instances.

Figure 13 plots the PCIe bandwidth of the monitoring instance INST-A, along with three reference lines for each of the five instance initializations:

Fig. 13.

—

“Create VM” denotes the request for initializing a new VM.

—

“Finish Init” means that the VM has been initialized, which we define as being able to SSH into the VM instance.

—

“Terminate VM” indicates the request for shutting down the VM.

For each VM, we load the PCIe transmitter AFI and software and attempt a handshake between the “Finish Init” and “Terminate VM” steps. The handshake results suggest that the last three instances are co-located with INST-A but the first two are not. Incidentally, the last the three instances also cause large PCIe bandwidth drops (from \(1600 \,\mathrm{M}B/\mathrm{s}\) to \(600 \,\mathrm{M}B/\mathrm{s}\) ) during their initialization process, as shown in Figure 13. The PCIe bandwidth stays stable for the first two instances, as they are not co-located with INST-A. Note that this bandwidth drop occurs before we can SSH into the instances, and therefore reflects the initialization process itself. Moreover, it is worth noting that the termination step is not reflected in the PCIe trace, indicating a potentially lazy termination process that does not require heavy data transfers. The ability to detect when other users are being allocated to the same NUMA node not only helps with the covert-channel handshaking process of Section 4.1, but can also alert non-adversarial users to potential interference from other users so that they can tweak their applications to expect slower transfers.

5.3 Long-Term PCIe Monitoring

In this section, we present the results of measuring the PCIe bandwidth for two on-demand f1.2xlarge instances in the us-east-1 region (availability zone e). These experiments took place between 5pm on April 25, 2021 and 2am on April 26 (Eastern Time, as us-east-1 is located in North Virginia). For both sets of four-hour measurements, the first f1.2xlarge instance (Figure 14) is measuring with a transmission duration of \(d=4 \,\mathrm{m}\mathrm{s}\) and a measurement duration of \(\delta =5 \,\mathrm{m}\mathrm{s}\) , while the second instance (Figure 15) has \(d=18 \,\mathrm{m}\mathrm{s}\) and \(\delta =20 \,\mathrm{m}\mathrm{s}\) . For the first instance, the PCIe link remains mostly idle during the evening (Figure 14(a)), but experiences contention in the first night hour (Figure 14(b)). The second instance instead appears to be co-located with other FPGAs that make heavier use of their PCIe bandwidth. During the evening measurements (Figure 15(a)), the PCIe bandwidth drops momentarily below \(1200 \,\mathrm{M}B/\mathrm{s}\) during the third hour and below \(800 \,\mathrm{M}B/\mathrm{s}\) during the fourth hour. These large drops are likely due to co-located VMs are being initialized and not normal user traffic, as described in Section 5.2. The instance also experiences sustained contention in the third hour of the night measurement (Figure 15(b)). Although the bandwidth in the two instances is comparable, the 5 ms measurements are noisier compared to the 20 ms ones. Finally, note that, generally, our covert-channel code results in bandwidth drops of over \(800 \,\mathrm{M}B/\mathrm{s}\) , while the activity of other users tends to cause drops of less than \(50 \,\mathrm{M}B/\mathrm{s}\) , suggesting that noise from external traffic has minimal impact on our channel.

Fig. 14.

Fig. 15.

5.4 Interference Attacks

The PCIe contention mechanism we have uncovered can also be used to degrade the performance of co-located applications by other users. Indeed, as we have shown in a prior work [63], the bandwidth can fall from \(3 \,\mathrm{G}B/\mathrm{s}\) to under \(1 \,\mathrm{G}B/\mathrm{s}\) using just one PCIe stressor (transmitter), and to below \(200 \,\mathrm{M}B/\mathrm{s}\) when using two stressors.

To exemplify how the reduced PCIe bandwidth can affect user applications, we again find a full NUMA node with four co-located VMs, but only use three of them. Specifically, the first VM is running the DeepField AMI VSR algorithm [22], and represents the victim user. The second VM is monitoring the PCIe bandwidth (similar to the experiments of Section 5.1), while the third acts as a PCIe stressor. The fourth one is unused and left idle, to avoid unintended interference. To further minimize any other external effects, the VSR computation in Figure 16 is repeated five times in sequence. As Figure 16 shows, the PCIe bandwidth measured by the monitoring instance drops from over \(1950 \,\mathrm{M}B/\mathrm{s}\) to under \(650 \,\mathrm{M}B/\mathrm{s}\) , and the conversion time in the victim instance increases by 33%. In addition to slowing down the victim application, when using a stressor, the attacker can extract even more fine-grained information about the victim. Indeed, as Figure 16(b) shows, the boundary between the five repetitions becomes clear, aiding the AMI fingerprinting attacks discussed in Section 5.1.

Fig. 16.

Furthermore, one particular, and perhaps unexpected, consequence of the reduced PCIe bandwidth is a more time-consuming programming process that can, in some cases, be more than tripled. To investigate this effect, we measure the FPGA programming time in one of the instances (INST-A) under different conditions including:

(1)

Whether a PCIe bandwidth-hogging application is running on a second instance, INST-B.

(2)

Whether just the CL or both the CL and FPGA shell (SH) are reloaded with fpga-load- local-image (using the -F flag).

(3)

The size of the loaded AFI in terms of the logic resources used (see Table 3). Because AWS uses partial reconfiguration [9], “the size of a partial bitstream is directly proportional to the size of the region it is reconfiguring” [68], with larger images therefore requiring more data transfers from the host to the FPGA device.

Table 3.

AFI	Lookup Tables (LUTs)	Registers	CARRY8 Chains	Multiplexers
Small	6728 \(\hphantom{00000}\)	8369 \(\hphantom{0}\)	75 \(\hphantom{00000}\)	72 \(\hphantom{00}\)
Medium	139020 \(\hphantom{00000}\)	220061 \(\hphantom{0}\)	2529 \(\hphantom{00000}\)	4741 \(\hphantom{00}\)
Large	310462 \(\hphantom{00000}\)	321713 \(\hphantom{0}\)	7316 \(\hphantom{00000}\)	28597 \(\hphantom{00}\)

Table 3. Resources Used by the Three AFIs Tested

The results of our experiments are summarized in Figure 17, where three AFIs of different sizes are loaded onto INST-A with/without reloading the shell, and with/without PCIe contention on INST-B. As Figure 17(a) shows, PCIe contention slows down the FPGA programming of all AFIs, with the effect being more prominent for larger instances, where programming has slowed down from \(\approx \!\!7 \,\mathrm{s}\) to \(\approx \!\!12 \,\mathrm{s}\) . When the shell is also reloaded (Figure 17(b)), the same pattern holds, but the effects are even more pronounced: even reloading the small AFI slows down from \(\approx \!\!\!7 \,\mathrm{s}\) to over \(20 \,\mathrm{s}\) , while the large AFI takes over \(30 \,\mathrm{s}\) compared to \(\approx \!\!\!9 \,\mathrm{s}\) without PCIe stressing. The effect is likely not just due to the fact that the AFI needs to transferred to the FPGA over PCIe using the fpga-load-local-image command, but in part also because the AFIs need to be fetched over the network from the cloud provider’s internal servers. As we show in the next section, network bandwidth is also impacted by the FPGA’s PCIe activity.

Fig. 17.

6 Other Cross-instance Effects

In this section, we investigate how other aspects of the hardware that is present in F1 servers, namely NICs (Section 6.1), NVMe SSD storage (Section 6.2), and DRAM modules directly attached to the FPGAs (Section 6.3) leak information that can permeate the VM instance boundary and can be used to, for example, cause interference on other users, or determine that different VM instances belong to the same server. The NIC and SSD contention-based attacks are summarized in Figure 11(b).

6.1 Network-Based Contention

NIC cards provide connectivity between a VM and the Internet through external devices such as switches and routers. NIC cards are typically also connected to the host over PCIe, and therefore share the bandwidth with the FPGAs. To test whether the FPGA PCIe traffic has any effect on the network bandwidth, we rent three co-located f1.2xlarge instances and test each instance as the PCIe bandwidth-hogging stressor, and use the remaining two instances in turn to measure the network bandwidth using the speedtest-cli program [49] (a total of six combinations).

The results for all six pairs of instances are identical: when the PCIe stressor is not running, speedtest-cli --bytes reports a download bandwidth of approximately \(233 \,\mathrm{M}B/\mathrm{s}\) and an upload bandwidth of \(157 \,\mathrm{M}B/\mathrm{s}\) . However, when the stressor is running on a co-located instance, the download bandwidth drops to \(100 \,\mathrm{M}B/\mathrm{s}\) , while the upload bandwidth is reduced to \(75 \,\mathrm{M}B/\mathrm{s}\) . This means that the PCIe stressor can demonstrably halve the network bandwidth of co-located instances as a result of the NIC sharing the PCIe bus with the FPGAs, as shown in Figure 2. It is worth noting that our experiments did not reveal any influences in the other direction, i.e., the PCIe and network bandwidth of co-located instances remained the same when running a network bandwidth stressor, likely because such a network stressor does not saturate the PCIe bus.

6.2 SSD Contention

Another shared resource that can lead to contention is the SSD storage that F1 instances can access. The public specification of F1 instances notes that f1.2xlarge instances have access to \(470 \,\mathrm{G}B\) of NVMe SSD storage, f1.4xlarge have \(940 \,\mathrm{G}B\) , and f1.16xlarge have \(4\times 940 \,\mathrm{G}B\) [14]. This suggests that F1 servers have four separate \(940 \,\mathrm{G}B\) SSD drives, each of which can be shared between two f1.2xlarge instances. In this section, we confirm our hypothesis that one SSD drive can be shared between multiple instances, and explain how this fact can be exploited to reverse-engineer the PCIe topology and co-locate VM instances. The SSD contention we uncover can also be used for a slow, but reliable, covert channel, or to degrade the performance of other users, akin to the interference attack of Section 5.4. We also demonstrate the existence of FPGA-to-SSD contention, which is likely the result of the SSD going through the same PCIe switch, as shown in Figure 2. This topology remains consistent with the one publicly described for GPU-based P4d instances [7], which appear to be architecturally similar to F1 instances.

6.2.1 SSD-to-SSD Contention.

SSD contention is tested by measuring the bandwidth of the SSD by using the hdparm command with its -t option, which performs disk reads without any data caching [47]. Measurements are averaged over repeated reads of \(2 \,\mathrm{M}B\) chunks from the disk in a period of 3 seconds. When the server is otherwise idle, hdparm reports the SSD read bandwidth to be over \(800 \,\mathrm{M}B/\mathrm{s}\) . However, when the other f1.2xlarge instance that shares the same SSD stresses it using the stress command [67] with the --io 4 --hdd 4 parameters, the bandwidth drops below \(50 \,\mathrm{M}B/\mathrm{s}\) . The stress command with the parameters above results in 4 threads calling sync (to stress the read buffers) and another 4 threads calling write and unlink (to stress write performance). The total number of threads is kept to 8, to match the number of vCPUs allocated to an f1.2xlarge instance, while all FPGAs remain idle during these experiments.

This non-uniform SSD behavior can be used for a robust covert channel with a bandwidth of \(0.125 \,b/\mathrm{s}\) with 100% accuracy. Specifically, for a transmission of bit 1, stress is called for 7 seconds, while for a transmission of bit 0, the transmitter remains idle. The receiver uses hdparm to measure its SSD’s bandwidth, and can distinguish between contention and no-contention of the SSD resources (i.e., bits 1 and 0 respectively) using a simple threshold. The period of 8 seconds per bit also accounts for 1 second of inactivity in every transmission, allowing the disk usage to return to normal.

The same mechanism can be exploited to deteriorate the performance of other tenants. It can further co-locate instances on an even more fine-grained level than was previously possible. To accomplish this, we rent several f1.2xlarge instances until we find four which form a full NUMA node through the PCIe-based co-location approach of Section 4. We then stress the SSD in one of the four instances, and measure the SSD performance in the remaining three. We discover two pairs of instances with mutual SSD contention, which supports our hypothesis, and is also consistent with the PCIe topology for other instance types [7].

The fact that SSD contention only exists between two f1.2xlarge instances can be beneficial for adversaries: when the covert-channel receiver and the transmitter are scheduled on two instances that share an SSD, they can communicate without interference from other tenants in the same NUMA node.⁴

6.2.2 FPGA-to-SSD Contention.

To formalize the above observations, we use the methodology described in Section 4 to find four co-located f1.2xlarge instances in the same NUMA node. Then, for each pair of instances, we repeatedly run hdparm in the “receiver” instance for a period of 3 minutes, and then in the transmitter instance, (a) at the one minute mark run stress for \(30 \,\mathrm{s}\) , and (b) at the two minute mark use our FPGA-based covert-channel code as a stressor which constantly transmits the bit 1 during each measurement period for another \(30 \,\mathrm{s}\) .

The results of these experiments are summarized in Figure 18. During idle periods, the SSD bandwidth is approximately \(800 \,\mathrm{M}B/\mathrm{s}\,{\rm to}\, 900 \,\mathrm{M}B/\mathrm{s}\) . However, for the two instances with SSD contention, i.e., pairs \((A,D)\) and \((B,C)\) , the bandwidth drops to as low as \(7 \,\mathrm{M}B/\mathrm{s}\) while the stress command is running (the bandwidth for the other instance pairs remains unaffected). When the FPGA-based PCIe stressor is enabled, the SSD bandwidth reported by hdparm is reduced in a measurable way to approximately \(700 \,\mathrm{M}B/\mathrm{s}\) .

Fig. 18.

We further test for the opposite effect, i.e., whether stressing the SSD can cause a measurable difference to the FPGA-based PCIe performance. We again stress the SSD between \(60 \,\mathrm{s}\,{\rm to}\, 90 \,\mathrm{s}\) , and stress the FPGA between \(120 \,\mathrm{s}\,{\rm to}\, 150 \,\mathrm{s}\) . As the results of Figure 19 show, the PCIe bandwidth drops from almost \(1.8 \,\mathrm{G}B/\mathrm{s}\) to approximately \(500 \,\mathrm{M}B/\mathrm{s}\,{\rm to}\, 1000 \,\mathrm{M}B/\mathrm{s}\) when the FPGA-stressor is enabled, but there is no significant difference in performance when the SSD-based stressor is turned on. Similar to the experiments of Section 6.1, this is likely because the FPGA-based stressor can more effectively saturate the PCIe link, while the SSD-based stressor seems to be limited by the performance of the hard drive itself, whose bandwidth when idle ( \(800 \,\mathrm{M}B/\mathrm{s}\) ) is much lower than that of the FPGA ( \(1.8 \,\mathrm{G}B/\mathrm{s}\) ). In summary, using the FPGA as a PCIe stressor can cause the SSD bandwidth to drop, but the converse is not true, since there is no observable influence on the FPGA PCIe bandwidth as a result of SSD activity.

Fig. 19.

6.3 DRAM-Based Thermal Monitoring

DRAM decay is known to depend on the temperature of the DRAM chip and its environment [70, 71]. Since the FPGAs in cloud servers have direct access to the on-board DRAM, they can be used as sensors for detecting and estimating the temperature around the FPGA boards, supplementing PCIe-traffic-based measurements.

Figure 20 summarizes how the DRAM decay of on-board chips can be used to monitor thermal changes in the data center. When a DRAM module is being initialized with some data, the DRAM cells will become charged to store the values, with true cells storing logical 1s as charged capacitors, and anti-cells storing them as depleted capacitors. Typically, true and anti-cells are paired, so initializing the DRAM to all ones will ensure only half of the DRAM cells will be charged, even if the actual location of true and anti-cells is not known.

Fig. 20.

After the data has been written to the DRAM and the cells have been charged, the DRAM refresh is disabled. Disabling DRAM refresh in the server itself is not possible as the physical hardware on the server is controlled by the hypervisor, not the users. However, the FPGA boards have their own DRAMs. By programming the FPGAs with AFIs that do and do not have DRAM controllers, disabling of the DRAM refresh can be emulated, allowing the DRAM cells to decay [62]. Eventually, some of the cells will lose enough charge to “flip” their value (for example, data written as 1 becomes 0 for true cells, since the charge has dissipated).

DRAM data can then be read after a fixed time \(T_{decay}\) , which is called the decay time. The number of flipped cells during this time depends on the temperature of the DRAM and its environment [71], and can therefore produce coarse-grained DRAM-based temperature sensors of F1 instances.

Prior work [63] and this article have so far focused on information leaks due to shared resources within a NUMA node, but did not attempt to co-locate instances that are in the same physical server, but belong to different NUMA nodes. In this section, we propose such a methodology that uses the boards’ thermal signatures, which are obtained from the decay rates of each FPGA’s DRAM modules. To collect these signatures, we use the method and code provided by Tian et al. [62] to alternate between bitstreams that instantiate DRAM controllers and ones that leave them unconnected to initialize the memory and then disable its refresh rate. When two instances are in the same server, the temperatures of all 8 FPGAs in an f1.16xlarge instance (and by extension the DRAM thermal signatures) are highly correlated. However, when the instances come from different servers, the decay rates are different, and thus contain distinguishable patterns that can be used to classify the two instances separately. This insight can be used to find FPGA instances that are co-located in the same server, even if they span different NUMA nodes.

6.3.1 Setup and Evaluation.

Our method for co-locating instances within a server has two aspects to it: first, we show that we can successfully identify two FPGA boards as being in the same server with high probability using their DRAM decay rates, and then we show that by using PCIe-based co-location we can build the full profile of a server, and identify all eight of its FPGA boards, even if they are in different NUMA nodes. More specifically, we use the open-source software by Tian et al. [62] to collect DRAM decay measurements for several FPGAs over a long period of time and then find which FPGAs’ DRAM decay patterns are the “closest”.

To validate our approach, we rent three f1.16xlarge instances (a total of 24 FPGAs) for a period of 24 hours, and measure how “close” each pair of FPGA traces is by calculating the total distance between their data points over the entire measurement period for three different metrics. The first metric compares the raw number of bit flips from the DRAM decay measurement \(c^i_{\rm raw}\) directly. The second approach normalizes the data to fit in the \([-1,1]\) range, i.e., \(c^i_{\rm norm}=(2c^i_{\rm raw}\,-m-\,M)/\) \((M\,-\,m)\) , where \(m=\min _i c^i_{\rm raw}\) and \(M=\max _i c^i_{\rm raw}\) . In Figure 21, we show an alternative metric, which takes the difference between successive raw measurements, i.e., \(c^i_{\rm diff}=c^i_{\rm raw}-c^{i-1}_{\rm raw}\) . Note that if FPGA A is the closest to FPGA B using these metrics, then B is not necessarily the closest to A. However, if FPGA A is closest to B and B is closest to C, then A, B, and C are all in the same server.

Fig. 21.

The raw data metric has an accuracy of 75%, the normalized metric is 71% accurate, while the difference metric succeeds in correctly pairing all FPGAs except for one, for an accuracy of 96%. Shorter measurement periods still result in high accuracies. For example, using the DRAM data from the first 12 hours results in only one additional FPGA mis-identification, for an accuracy of 92%. We plot the classification accuracy for the three metrics as a function of time in Figure 22.

Fig. 22.

In the experiments of Figure 21, the \(c_{\rm diff}\) metric places slots 0–4 of server A together (along with, mistakenly, slot 0 of server B), slots 5–7 of server A as a second group, slots 1–7 of server B as one server, and slots 0–3 and 4–7 of server C as the two final groups. Consequently, our method successfully identifies the six NUMA nodes without making use of PCIe contention at all.

However, by using insights about the NUMA nodes that can be extracted through our PCIe-based experiments, the accuracy and reliability of this method can be further increased. For example, slot 0 of server B could already be placed in the same NUMA node as slots 1–3 using PCIe-based co-location. Leveraging the PCIe-based co-location method, if the “closest” FPGA is known to be in the same NUMA node due to PCIe contention, and the second-closest FPGA (not in the same NUMA node according to PCIe contention) is only farther by at most 1% compared to the closest FPGA, then this second-closest FPGA can be identified as belonging to the second NUMA node of the same server. In the experiment of Figure 21, this approach successfully groups all FPGAs in the three tested servers without errors.

7 Conclusion

This article introduced a novel, fast covert-channel attack between separate users in a public, FPGA-accelerated cloud computing setting. It characterized how contention of the PCIe bus can be used to create a robust communication mechanism, even among users of different operating systems, with bandwidths reaching \(20 \,\mathrm{k}b/\mathrm{s}\) with 99% accuracy. In addition to making use of contention of the PCIe bus for covert channels, this article demonstrated that contention can be used to monitor or disrupt the activities of other users, including inferring information about their applications, or slowing them down. This work further identified alternative co-location mechanisms, which make use of network cards, SSDs, or even the DRAM modules attached to the FPGA boards, allowing adversaries to co-locate FPGAs in the same server, even if they are on separate NUMA nodes.

More generally, this work demonstrated that malicious adversaries can use PCIe monitoring to observe the data center server activity, breaking the separation of privilege that isolated VM instances are supposed to provide. With more types of accelerators becoming available on the cloud, including FPGAs, GPUs, and TPUs, PCIe-based threats are bound to become a key aspect of cross-user attacks. Overall, our insights showed that low-level, direct hardware access to PCIe, NIC, SSD, and DRAM hardware creates new attack vectors that need to be considered by both users and cloud providers alike when deciding how to tradeoff performance, cost, and security for their designs: even if the endpoints of computations (e.g., CPUs and FPGAs) are assumed to be secure, the shared nature of cloud infrastructures poses new challenges that need to be addressed.

Footnotes

This article extends the work accepted at HOST 2021 [32] by (a) measuring and identifying differences in the covert-channel bandwidth across different operating systems, (b) detecting when co-located VM instances with FPGAs are initialized, (c) showing that malicious adversaries can use PCIe contention for slowing down the communication between the host and the FPGA, leading to slower FPGA programming times and applications, and (d) introducing a new method of instance co-location based on network bandwidth contention. Our new findings also allow us to update the deduced PCIe topology of F1 server architectures used by AWS.

Section 4.5 shows that different setups can result in even higher bandwidths exceeding \(20 \,\mathrm{k}b/\mathrm{s}\) .

A maximum transfer size of \(1 \,\mathrm{M}B\) was chosen to ensure that multiple transfers were possible within each transfer interval without ever interfering with the next measurement interval.

⁴

Assuming that slots within a server are assigned randomly, the probability of getting instances with shared SSDs given that they are already co-located in the same NUMA node is 33%: out of the three remaining slots in the same NUMA node, exactly one slot can be in an instance that shares the SSD.

References

[1]

Andreas Agne, Hendrik Hangmann, Markus Happe, Marco Platzner, and Christian Plessl. 2014. Seven recipes for setting your FPGA on fire—A cookbook on heat generators. Microprocessors and Microsystems 38, 8(2014), 911–919.

Abstract

1 Introduction

1.1 Contributions

1.2 Responsible Disclosure

1.3 Article Organization

2 Background and Related Work

2.1 AWS F1 Instance Architecture

2.2 Programming AWS F1 Instances

2.3 Related Work

2.3.1 PCIe-Based Threats.

2.3.2 Power-Based Threats.

2.3.3 Thermal-Based Threats.

2.3.4 DRAM-Based Threats.

2.3.5 Multi-Tenant Security.

3 PCIe Contention in Cloud FPGAs

4 Cross-VM Covert Channels

4.1 Covert-Channel Implementation

4.2 Experimental Setup

4.3 Bandwidth vs. Accuracy Tradeoffs

4.4 Transfer Sizes

4.5 Operating Systems

5 Cross-VM Side-Channel Leaks

5.1 Inferring User Activity

5.1.1 Experimental Setup.

5.1.2 Leaking Private Information from Marketplace AMIs.

5.2 Detecting Instance Initialization

5.3 Long-Term PCIe Monitoring

5.4 Interference Attacks

6 Other Cross-instance Effects

6.1 Network-Based Contention

6.2 SSD Contention

6.2.1 SSD-to-SSD Contention.

6.2.2 FPGA-to-SSD Contention.

6.3 DRAM-Based Thermal Monitoring

6.3.1 Setup and Evaluation.

7 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Temporal Thermal Covert Channels in Cloud FPGAs

Thermal and Voltage Side and Covert Channels and Attacks in Cloud FPGAs

Covert-Hammer: Coordinating Power-Hammering on Multi-tenant FPGAs via Covert Channels

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations