Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure

Sangpyo Kim Department of Intelligence and Information
Seoul National University
Seoul, South Korea
vnb987@snu.ac.kr
   Jongmin Kim {@IEEEauthorhalign} Jaeyoung Choi Interdisciplinary Program in Artificial Intelligence
Seoul National University
Seoul, South Korea
jongmin.kim@snu.ac.kr
Department of Intelligence and Information
Seoul National University
Seoul, South Korea
sojae0518@snu.ac.kr
   Jung Ho Ahn Department of Intelligence and Information
Seoul National University
Seoul, South Korea
gajh@snu.ac.kr
Abstract

Fully homomorphic encryption (FHE) is in the spotlight as a definitive solution for privacy, but the high computational overhead of FHE poses a challenge to its practical adoption. Although prior studies have attempted to design ASIC accelerators to mitigate the overhead, their designs require excessive chip resources (e.g., areas) to contain and process massive data for FHE operations. We propose CiFHER, a chiplet-based FHE accelerator with a resizable structure, to tackle the challenge with a cost-effective multi-chip module (MCM) design. First, we devise a flexible core architecture whose configuration is adjustable to conform to the global organization of chiplets and design constraints. Its distinctive feature is a composable functional unit providing varying computational throughput for the number-theoretic transform, the most dominant function in FHE. Then, we establish generalized data mapping methodologies to minimize the interconnect overhead when organizing the chips into the MCM package in a tiled manner, which becomes a significant bottleneck due to the packaging constraints. This study demonstrates that a CiFHER package composed of a number of compact chiplets provides performance comparable to state-of-the-art monolithic ASIC accelerators while significantly reducing the package-wide power consumption and manufacturing cost.

Index Terms:
Fully homomorphic encryption, Domain-specific architecture, Chiplet

I introduction

Privacy is no longer merely a virtue but a critical objective for corporates due to recently established strict data privacy policies, such as the California Consumer Privacy Act (CCPA) and the EU General Data Protection Regulation (GDPR). For example, Apple’s App Tracking Transparency (ATT) policy cost technology companies $15.8 billion in 2022 [65].

Fully homomorphic encryption (FHE) is emerging as an attractive solution providing strong privacy guarantees. FHE enables an unlimited number of computations directly on encrypted data without decryption; thus, it allows users to safely offload services without disclosing private information. Numerous studies have shown its applicability to practical privacy-preserving applications, including genomics [56, 46], biometrics [60, 76], and machine learning (ML) tasks [37, 61, 62, 14, 24].

There are two major competing classes of FHE based on the learning with errors (LWE) computational problem: one based on the ring LWE variant (e.g., BGV [15], BFV [16, 33], and CKKS [22] schemes) and the other based on the torus LWE (e.g., FHEW [31] and TFHE [23] schemes). This work focuses on the former class and, specifically, CKKS [22], which stands out among the state-of-the-art FHE schemes due to its relatively high performance [9] and the ability to encrypt complex vectors. Such benefits make CKKS suitable for numerous practical tasks, particularly ML tasks.

Still, alleviating the computational and memory overhead of FHE is a challenging prerequisite for its widespread use as FHE workloads run over 10,000×\times× slower than their unencrypted counterparts [81, 52]. To mitigate the overhead, prior work has attempted hardware acceleration [11, 34, 51, 2, 95, 58, 80, 81, 55]. In particular, ASIC FHE accelerator proposals [81, 55, 58] achieved considerable performance enhancements. For example, ARK [55] can perform a CIFAR-10 CNN inference in 0.125 seconds with the ResNet-20 model, which takes 2,271 seconds in a single-threaded CPU [61]. However, these remarkable speedups owe to the massive chip area usage (373.6–472.3mm2) that enables the deployment of unconventionally large on-chip memory capacity (256–512MB) and substantial amounts of computational logic.

As prior massive FHE accelerator designs are prohibitively costly to realize, we turn to using a multi-chip module (MCM). The cost of a chip skyrockets with the die area in recent technology nodes due to severe degradation in yield and high design complexity [45, 84]. By splitting a monolithic chip design into several small chiplets, an MCM drastically reduces the cost and becomes a scalable solution in the post-Moore era. Abundant studies [84, 89, 96, 43, 64, 8], including commercial chips [73, 69, 70, 94], have demonstrated the efficiency of MCMs in high-performance computing (HPC).

We propose CiFHER, a flexible and cost-effective MCM architecture for FHE. First, we propose a core architecture that is resizable at design time. Our composable number-theoretic transform (NTT) unit, which can provide varying computational throughput depending on the package configuration and technology demands, provides the resizing flexibility, enabling architects to split a monolithic design into multiple cores with a best-fit size for a given die area budget.

Given a chiplet architecture, we construct efficient package configurations of CiFHER, focusing on minimizing the high network cost of MCMs. Due to the technical constraints of a network on package (NoP) connecting chiplets, an MCM architecture becomes bottlenecked by the NoP communication among the cores; data communication has been a source of bottleneck even in prior monolithic accelerators, and MCM intensifies the problem. We address the problem by developing generalized data mapping methodologies optimized for the many-core MCM architecture. We also devise a limb duplication algorithm, which significantly reduces data communication during base conversion (BConv), the second most dominant function in FHE.

Based on the analysis, we discover balanced combinations of the individual chiplet design, data mapping, and FHE algorithms under various design constraints. These optimizations enable CiFHER to achieve comparable performance to prior monolithic chips; a CiFHER package with 16 cores performs an encrypted CIFAR-10 CNN inference using the ResNet-20 model [61] in 0.189 seconds. Depending on the area budget, CiFHER provides various core configurations, ranging from four 47.08mm2 core dies to sixty-four 4.28mm2 core dies in default settings.

Overall, this work makes the following key contributions:

  • We design a flexible chiplet core with a composable NTT unit, which allows the distribution of memory and compute resources across multiple dies while taking advantage of the vector NTT unit.

  • We introduce generalized data mapping methodologies on a tiled FHE accelerator to resolve the NoP communication bottleneck of MCMs.

  • We propose limb duplication, an algorithmic optimization tailored to the MCM design, reducing the amount of die-to-die communication, which would have caused significant latency and energy overhead.

II Background & Motivation

II-A Fully Homomorphic Encryption (FHE)

Homomorphic encryption (HE) is classified into leveled HE (LHE) and fully HE (FHE) depending on the problem size. LHE supports only a limited number of operations (ops) and is usually used in conjunction with other cryptographic protocols [53, 42]. In contrast, FHE solely supports an unlimited number of ops. A defining feature of FHE is bootstrapping. The LWE problem [77] forms the cryptographic basis for state-of-the-art HE schemes, and relies on the error in encryption for its security. When performing HE ops on encrypted data, the magnitude of errors increases, threatening data integrity unless managed properly. Bootstrapping reduces the magnitude of errors, thus allowing more HE ops performed. In a typical use case, a bootstrapping op is inserted after every predetermined number of HE ops is applied to the encrypted data.

II-B CKKS FHE Scheme

We briefly describe the CKKS FHE scheme, the main target of this paper. During CKKS encryption, a (vector) message composed of n𝑛nitalic_n numbers is first packed into a degree-(N1𝑁1N-1italic_N - 1) integer polynomial referred to as plaintext, which is an element of a cyclotomic polynomial ring Q=Q[X]/(XN+1)subscript𝑄subscript𝑄delimited-[]𝑋superscript𝑋𝑁1\mathcal{R}_{Q}=\mathbb{Z}_{Q}[X]/(X^{N}+1)caligraphic_R start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT = blackboard_Z start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT [ italic_X ] / ( italic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT + 1 ) with the degree N2n𝑁2𝑛N\geq 2nitalic_N ≥ 2 italic_n and the modulus Q𝑄Qitalic_Q. All integer ops are performed modulo-Q𝑄Qitalic_Q in Qsubscript𝑄\mathcal{R}_{Q}caligraphic_R start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT. A plaintext is encrypted into a ciphertext composed of two polynomials in Qsubscript𝑄\mathcal{R}_{Q}caligraphic_R start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT. We denote polynomials with bold characters (e.g., 𝐯𝐯\mathbf{v}bold_v), and ciphertexts encrypting polynomials with brackets ([𝐯]delimited-[]𝐯[\mathbf{v}][ bold_v ]). Then, the encryption of 𝐯𝐯\mathbf{v}bold_v can be formulated as [𝐯]=(𝐯1,𝐯2)Q2delimited-[]𝐯subscript𝐯1subscript𝐯2superscriptsubscript𝑄2[\mathbf{v}]=(\mathbf{v}_{1},\mathbf{v}_{2})\in\mathcal{R}_{Q}^{2}[ bold_v ] = ( bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ caligraphic_R start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

A high N𝑁Nitalic_N value around 215superscript2152^{15}2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT217superscript2172^{17}2 start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT and Q𝑄Qitalic_Q as large as 21500superscript215002^{1500}2 start_POSTSUPERSCRIPT 1500 end_POSTSUPERSCRIPT are utilized to meet the security constraints of FHE [7, 28]. To handle such a large Q𝑄Qitalic_Q, residue number system (RNS) is utilized [21]. Q𝑄Qitalic_Q is set to the product of L𝐿Litalic_L number of small primes (qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s), such that Q=i=1Lqi𝑄superscriptsubscriptproduct𝑖1𝐿subscript𝑞𝑖Q=\prod_{i=1}^{L}q_{i}italic_Q = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, each coefficient c𝑐citalic_c of a polynomial is represented by an L𝐿Litalic_L-tuple of small residues (c mod q1,,c mod qL)𝑐 mod subscript𝑞1𝑐 mod subscript𝑞𝐿(c\text{ mod }q_{1},\cdots,c\text{ mod }q_{L})( italic_c mod italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_c mod italic_q start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ). Each row of the matrix corresponding to the prime qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is referred to as limb, which can be regarded as a polynomial in qi=qi[X]/(XN+1)subscriptsubscript𝑞𝑖subscriptsubscript𝑞𝑖delimited-[]𝑋superscript𝑋𝑁1\mathcal{R}_{q_{i}}=\mathbb{Z}_{q_{i}}[X]/(X^{N}+1)caligraphic_R start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = blackboard_Z start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X ] / ( italic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT + 1 ). Using RNS, modulo-Q𝑄Qitalic_Q ops between polynomials having large coefficients are replaced by a set of L𝐿Litalic_L pairwise modulo-qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ops between limbs having small word-sized coefficients. Also, the modulus can change over HE ops such that Q=i=1qisubscript𝑄superscriptsubscriptproduct𝑖1subscript𝑞𝑖Q_{\ell}=\prod_{i=1}^{\ell}q_{i}italic_Q start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is used at a particular moment. As a result, a polynomial at that moment is represented as an ×N𝑁\ell\times Nroman_ℓ × italic_N matrix of residues.

In a client-server scenario of FHE, the server performs HE ops to manipulate the message encrypted in a ciphertext. Mult/add between two ciphertexts (𝙷𝙼𝚞𝚕𝚝𝙷𝙼𝚞𝚕𝚝\mathtt{HMult}typewriter_HMult/𝙷𝙰𝚍𝚍𝙷𝙰𝚍𝚍\mathtt{HAdd}typewriter_HAdd) or between a ciphertext and a plaintext (𝙿𝙼𝚞𝚕𝚝𝙿𝙼𝚞𝚕𝚝\mathtt{PMult}typewriter_PMult/𝙿𝙰𝚍𝚍𝙿𝙰𝚍𝚍\mathtt{PAdd}typewriter_PAdd) are some of the most frequently used HE ops. Also, it is often required to change the data order inside the message vector composed of n𝑛nitalic_n numbers. Such data reordering can be performed with an 𝙷𝚁𝚘𝚝𝙷𝚁𝚘𝚝\mathtt{HRot}typewriter_HRot op, which evaluates the cyclic rotation of the message encrypted inside a ciphertext. Among these HE ops, 𝙷𝙼𝚞𝚕𝚝𝙷𝙼𝚞𝚕𝚝\mathtt{HMult}typewriter_HMult and 𝙷𝚁𝚘𝚝𝙷𝚁𝚘𝚝\mathtt{HRot}typewriter_HRot are the most computationally expensive due to complex key-switching procedures involved in them. Key-switching is a procedure to multiply a polynomial with evaluation keys (𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evks), which are special types of ciphertexts.

II-C Primary Functions of CKKS

We defer a more detailed explanation of each HE op to prior work [22, 21, 38, 12] and instead break down HE ops into primary functions to discuss their computational aspects.

Number-theoretic transform (NTT): To reduce the complexity of a polynomial mult, NTT is utilized. A polynomial mult in qisubscriptsubscript𝑞𝑖\mathcal{R}_{q_{i}}caligraphic_R start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is equivalent to a negacyclic convolution between the coefficients. NTT, a type of discrete Fourier transform, and its inverse (iNTT) are defined for each qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. (i)NTT reduces the overall complexity of the convolution into O(NlogN)𝑂𝑁𝑁O(N\log N)italic_O ( italic_N roman_log italic_N ) for each limb by adopting a fast Fourier transform (FFT) algorithm [26].

Base conversion (BConv): The modulus Q𝑄Qitalic_Q changes frequently over HE ops, which requires BConv to be performed on polynomials. For example, for a polynomial 𝐯Q𝐯subscriptsubscript𝑄\mathbf{v}\in\mathcal{R}_{Q_{\ell}}bold_v ∈ caligraphic_R start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we often need to change its modulus to an auxiliary modulus P=i=1Kpi𝑃superscriptsubscript𝑖1𝐾subscript𝑝𝑖P=\sum_{i=1}^{K}p_{i}italic_P = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For this situation, 96% of BConv computation is spent on a matrix-matrix mult between a pre-computed BConv table (a K×𝐾K\times\ellitalic_K × roman_ℓ matrix) and 𝐯𝐯\mathbf{v}bold_v (an ×N𝑁\ell\times Nroman_ℓ × italic_N matrix) [55]. We obtain a new 𝐯P𝐯subscript𝑃\mathbf{v}\in\mathcal{R}_{P}bold_v ∈ caligraphic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT through BConv.

Automorphism (ϕrsubscriptitalic-ϕ𝑟\phi_{r}italic_ϕ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT): During 𝙷𝚁𝚘𝚝𝙷𝚁𝚘𝚝\mathtt{HRot}typewriter_HRot, automorphism, which is a data permutation function, is performed. The i𝑖iitalic_i-th coefficient of a polynomial is mapped to the (i5r mod N𝑖superscript5𝑟 mod 𝑁i\cdot 5^{r}\text{ mod }Nitalic_i ⋅ 5 start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT mod italic_N)-th position.

Element-wise or constant functions: Other than the aforementioned primary functions, we only perform element-wise mult/add or constant mult/add on polynomials for HE ops.

FHE computations are extremely expensive, mainly attributed to key-switching, which involves multiple (i)NTTs and BConvs. (i)NTT and BConv account for more than 80% of the total computations in CKKS [58, 55]. Thus, enhancing (i)NTT and BConv performance is essential to FHE acceleration.

II-D Previous ASIC FHE Accelerators

F1 [80] is the first ASIC accelerator that reports bootstrapping performance. However, F1 targets small parameters of N=214𝑁superscript214N\!=\!2^{14}italic_N = 2 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT and L=16𝐿16L\!=\!16italic_L = 16, with which it only partially supports bootstrapping. Subsequent accelerator proposals, BTS [58], CraterLake [81], and ARK [55], fully support bootstrapping by targeting larger parameters of N=216𝑁superscript216N=2^{16}italic_N = 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT217superscript2172^{17}2 start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT and L=24𝐿24L=24italic_L = 2460606060. BTS places 2,048 processing elements (PEs) in a grid and connects them by global wires that span the whole chip. Each PE contains a butterfly unit for (i)NTT, a multiply-accumulate unit for BConv, and a multiplier and an adder for element-wise functions. A butterfly unit is capable of performing the butterfly ops required in (i)NTT, which are shown in Eq. 1. ARK and CraterLake develop upon F1’s design of placing N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG parallel lanes in a cluster (CraterLake refers to it as lane group) that feeds data into massive vector NTT units (NTTUs). A vector NTTU contains 12NlogN12𝑁𝑁\frac{1}{2}\sqrt{N}\log Ndivide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG italic_N end_ARG roman_log italic_N butterfly units, placed in an organization directly reflecting the four-step FFT dataflow [10]. ARK deploys four NTTUs, and CraterLake 16.

I. 𝙱𝚞𝚝𝚝𝚎𝚛𝚏𝚕𝚢NTT(a,b,w)=(a+bw,abw)II. 𝙱𝚞𝚝𝚝𝚎𝚛𝚏𝚕𝚢iNTT(a,b,w)=(a+b,(ab)w)I. subscript𝙱𝚞𝚝𝚝𝚎𝚛𝚏𝚕𝚢NTT𝑎𝑏𝑤𝑎𝑏𝑤𝑎𝑏𝑤II. subscript𝙱𝚞𝚝𝚝𝚎𝚛𝚏𝚕𝚢iNTT𝑎𝑏𝑤𝑎𝑏𝑎𝑏𝑤\begin{split}\text{I. }&\mathtt{Butterfly}_{\text{NTT}}(a,b,w)=(a+b\cdot w,\ a% -b\cdot w)\\ \text{II. }&\mathtt{Butterfly}_{\text{iNTT}}(a,b,w)=(a+b,\ (a-b)\cdot w)\end{split}start_ROW start_CELL I. end_CELL start_CELL typewriter_Butterfly start_POSTSUBSCRIPT NTT end_POSTSUBSCRIPT ( italic_a , italic_b , italic_w ) = ( italic_a + italic_b ⋅ italic_w , italic_a - italic_b ⋅ italic_w ) end_CELL end_ROW start_ROW start_CELL II. end_CELL start_CELL typewriter_Butterfly start_POSTSUBSCRIPT iNTT end_POSTSUBSCRIPT ( italic_a , italic_b , italic_w ) = ( italic_a + italic_b , ( italic_a - italic_b ) ⋅ italic_w ) end_CELL end_ROW (1)

To tackle the increased number of computations and size of the working set when using large parameters, CraterLake, BTS, and ARK all rely on enormous amounts of computational logic and on-chip memory capacity; the latter even reaches 256–512MB. These designs end up using massive chip resources, utilizing a monolithic die sized 373.6–472.3mm2 and dissipating up to 317W of power.

II-E Multi-Chip Module (MCM)

MCM is a promising solution for scaling chips in the post-Moore era. By organizing a module with multiple small dies, referred to as chiplets, and connecting them using advanced packaging technologies providing high bandwidth on-package interconnects [41, 66, 17], MCMs reduce the area of an individual die. Because the design and fabrication cost of a die substantially drops as the die area is reduced [45, 84], MCMs can greatly reduce the total cost. With yield and technology scaling declining over chip fabrication generations, MCMs have become one of the few left options for scaling systems. Recent Intel [73] and AMD [69, 70] CPUs are notable examples of MCMs; multiple chiplets containing several CPU cores and cache memory are integrated into a single package using EMIB [66] or CoWoS [41] packaging technologies. Also, numerous domain-specific architectures utilizing MCMs have been proposed [84, 89, 43, 64, 88]. Moreover, MCMs have been realized at various area granularity, ranging from AMD’s high-end 74mm2 die [70] to 6mm2 die of Simba [84].

These recent trends motivate us to design an MCM architecture for cost-effective FHE acceleration. By using MCMs, we can enjoy the aforementioned benefits coming from a significant reduction in individual die areas. Also, this approach is highly scalable as we can reuse the same design for each chiplet and place multiple chiplets to facilitate developing various MCM FHE accelerators with different computational throughput and bandwidth required by the technology constraints and the purpose of use. Prior ASIC FHE accelerator proposals include high bandwidth memories (HBMs), already assuming the use of advanced packaging for connecting HBMs to the chip. The same packaging can be utilized to connect multiple chiplets with little additional cost [54]. We utilize the open specification of Universal Chiplet Interconnect Express (UCIe) [91] to design and evaluate the MCM FHE accelerator.

Refer to caption
Figure 1: Exemplar configurations of a composable NTT unit, simplified to N=28𝑁superscript28N=2^{8}italic_N = 2 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT. A submodule, the smallest unit occupying N44𝑁\sqrt[4]{N}nth-root start_ARG 4 end_ARG start_ARG italic_N end_ARG lanes, is shown above. A two-submodule configuration and its NTT process for a length-N𝑁Nitalic_N polynomial are shown below.
Refer to caption
Figure 2: The CiFHER package consists of multiple core chiplets and two I/O dies handling HBM. A core comprises functional units (FUs), register files (RFs), and networking components.

III CiFHER Microarchitecture

To reduce the high cost of a monolithic chip design, we propose CiFHER, an MCM architecture for FHE acceleration. CiFHER has a tiled architecture with multiple chiplet cores connected by a mesh NoP. Fig. 2 depicts the package and core organization of CiFHER. The main challenge in designing an MCM architecture lies in reducing the NoP pressure. As deployable bandwidth and topology are heavily restricted by the technology constraints of the NoP, we devise novel data mapping methodologies (§IV) and algorithms (§V) addressing the high latency and low bandwidth of NoP. Nonetheless, prior to determining the configuration of the MCM package, we will first architect the internals of a chiplet core.

III-A Core Organization

Prior accelerators either utilize little (PE in BTS) or big (cluster or lane group in F1, ARK, and CraterLake) cores without providing much flexibility in design. In contrast, we design the cores of CiFHER to have varying computational throughput, which can be adjusted at design time depending on the package configuration.

A core comprises a composable NTTU, a systolic BConv unit (BConvU), an automorphism unit (AutoU), an element-wise function unit (EFU), a PRNG 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evk generator (PRNG), a router and PHYs for NoP, and register files (RFs). We adopt the vector architecture utilized in F1, CraterLake, and ARK, where we place multiple parallel lanes in a core. Each lane has a dedicated RF space not accessible by the other lanes, which enables maximizing parallelism with low costs for synchronization and arbitration. The vector architecture has shown effectiveness in reducing the communication overhead and the RF bandwidth pressure, the major bottlenecks in FHE.

By effectively combining components from different vector accelerators with proper modifications, CiFHER provides an improved core design for FHE accelerators. In particular, we adopt the distributed RF, systolic BConvU, and AutoU structures from ARK, and adopt PRNG from CraterLake. A systolic BConvU is composed of multiple multiply-accumulate (MAC) units, forming an output-stationary systolic array.

To achieve a flexible core design, we aim to adjust configurations, particularly the number of lanes. While most functional units (FUs) readily adapt to lane count and throughput goals, the preceding vector NTTU presents a challenge. This critical vector core component is a substantial computing unit with its 2,048 butterfly units locked into a fixed organization.

The inflexibility of previous large-scale NTTU designs severely limits our design options, ultimately harming hardware efficiency. To manage the on-chip memory bandwidth demands of NTT computation, prior studies rely on deeply pipelined NTTU designs. These treat a length-N polynomial as an N×N𝑁𝑁\sqrt{N}\times\sqrt{N}square-root start_ARG italic_N end_ARG × square-root start_ARG italic_N end_ARG matrix, performing the NTT in four steps: column-wise N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG-point NTT, twisting, transposition, and row-wise N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG-point NTT. With N𝑁Nitalic_N reaching 216217superscript216superscript2172^{16}\!-\!2^{17}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT - 2 start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT, this vector NTTU becomes bulky and expensive. Worse, its single-configuration design forces us to use large chiplets to accommodate it, regardless of the actual computational throughput needed. This leads to underutilized NTTUs, creating an imbalanced and inefficient design for MCMs.

III-B Composable NTT Unit

We devise a composable NTTU supporting various configurations with the number of lanes adjustable from 16 to 256. We also regard a length-N𝑁Nitalic_N polynomial as a N×N𝑁𝑁\sqrt{N}\times\sqrt{N}square-root start_ARG italic_N end_ARG × square-root start_ARG italic_N end_ARG matrix and perform N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG-point (i)NTT along the rows and then along the columns in sequence, but additionally apply the four-step FFT dataflow [10] to each of the row- and column-direction (i)NTTs. We obtain a compact NTTU spanning N4=164𝑁16\sqrt[4]{N}=16nth-root start_ARG 4 end_ARG start_ARG italic_N end_ARG = 16 lanes shown in Fig. 1 (simplified to N=28𝑁superscript28N=2^{8}italic_N = 2 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT), which we refer to as the submodule of our NTTU.

Up to 16 submodules can be stacked to deliver higher computational throughput for (i)NTT, providing flexibility for our design space exploration. We devise a method to combine multiple submodules to have them collaboratively perform (i)NTT on a limb by performing a perfect shuffle [87] between the submodules. The perfect shuffle enables the rearrangement of dispersed data across various submodules by transforming it into a sequence that can be effectively transposed using subsequent quadrant swap units. An exemplar two-submodule NTTU configuration for N=28𝑁superscript28N\!=\!2^{8}italic_N = 2 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT performing NTT is shown in Fig. 1. We regard a length-28superscript282^{8}2 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT polynomial as a 16×16161616\!\times\!1616 × 16 matrix. The first half of a submodule performs a four-step NTT on a row (data indices: 0–15/16–32). It starts from stride-1 butterfly ops and ends with stride-8 ops to perform 16-point NTT on a row. Then the buffering and shuffling steps in the middle redistribute the data between different submodules so that each submodule holds data elements with stride 16 in the second buffer; i.e., each submodule holds the elements of a column. Finally, the second half of a submodule performs a four-step NTT on a column (data indices: 0–240/4–244, stride 16) by starting from stride-16 butterfly ops and ending with stride-128 ops to perform 16-point NTT on a column.

Our composable NTTU also allows up to N4=164𝑁16\sqrt[4]{N}=16nth-root start_ARG 4 end_ARG start_ARG italic_N end_ARG = 16 cores to perform (i)NTT of a limb together. In this case, data exchange between the collaborating cores occurs after the buffering and shuffling steps in the case of NTT.

Table I: Target parameters of CiFHER.
Param. N𝑁Nitalic_N L𝐿Litalic_L K𝐾Kitalic_K Q𝑄Qitalic_Q P𝑃Pitalic_P Security
Descr. Degree # of qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT # of pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i=1Lqisuperscriptsubscriptproduct𝑖1𝐿subscript𝑞𝑖\prod_{i=1}^{L}q_{i}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i=1Kpisuperscriptsubscriptproduct𝑖1𝐾subscript𝑝𝑖\prod_{i=1}^{K}p_{i}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (bits)
Value 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT 48absent48\leq 48≤ 48 12 21218absentsuperscript21218\leq 2^{1218}≤ 2 start_POSTSUPERSCRIPT 1218 end_POSTSUPERSCRIPT 2336superscript23362^{336}2 start_POSTSUPERSCRIPT 336 end_POSTSUPERSCRIPT 128absent128\geq 128≥ 128 [12]

III-C Word Length & Logic Optimization

We choose 32 bits as the word length to use in CiFHER as it is beneficial to use short word lengths for the efficiency of logical operations. While BTS and ARK utilize a word length of 64 bits, which has been traditionally used in CKKS libraries [32, 27, 20], F1 (32 bits), CraterLake (28 bits), and FPGA FHE accelerators (FAB [2] (54 bits) and Poseidon [95] (32 bits)) utilize shorter word lengths. Despite using a relatively short word length, we maintain high precision by the methods in [1] to utilize high scales ranging from 247superscript2472^{47}2 start_POSTSUPERSCRIPT 47 end_POSTSUPERSCRIPT to 255superscript2552^{55}2 start_POSTSUPERSCRIPT 55 end_POSTSUPERSCRIPT. Table I tabulates our target parameters.

We utilize the word-level Montgomery reduction circuit proposed in [67] with further enhancements using the signed Montgomery reduction algorithm in [82]. The circuit is utilized for all modular reduction circuits in CiFHER, including EFUs, NTTUs, PRNGs, and BConvUs.

Systolic BConvU: By using a 32-bit word length, more limbs are utilized (higher L𝐿Litalic_L and K𝐾Kitalic_K) than 64-bit ARK. Thus, to provide higher word-level throughput for BConv, we use a longer 1×121121\!\times\!121 × 12 BConv configuration per lane by default.

EFU: Inside an EFU, we place multiple modular multipliers, modular adders, and element-wise op circuits for specific purposes required in FHE (e.g., double-word accumulation and reduction). An EFU can perform compound element-wise ops to reduce the pressure on RFs.

IV Data mapping methodology

IV-A Tiled Architecture

The technology constraints of NoP enforce CiFHER to be a tiled architecture connected by a mesh network. Due to the short channel reach of an advanced interface [91], a complex network topology that requires connections between distant nodes is less suitable for MCMs. Thus, we fix the topology to 2D mesh (see Fig. 2). Also, due to the high cost of NoP communication, we set the default bisection bandwidth of the entire package to 2TB/s, several to an order of magnitude lower than that of CraterLake (29TB/s) or ARK (8TB/s).

Prior FHE accelerators co-design the data mapping method with the network on chip (NoC) organization. They design complex NoC structures using direct connections between distant nodes with high bandwidth and uniform communication time. However, in an MCM architecture with a mesh network, communication latency between nodes is non-uniform, as distant nodes are separated by multiple hops. Lagging data from distant nodes causes others to wait idly, degrading the overall performance [84]. The low bandwidth of NoP exacerbates this problem. Therefore, we delve into data mapping methodologies to reduce the pressure of the network.

IV-B Combining Data Mapping Methods

Prior data mapping methods can be classified into coefficient scattering and limb scattering. Coefficient scattering is a method of distributing N𝑁Nitalic_N coefficients of each limb equally to the entire computing units. By contrast, in limb scattering, each prime and the corresponding limbs are allocated to a specific computing unit. BTS and CraterLake use coefficient scattering, whereas F1 and ARK mostly use limb scattering. ARK temporarily switches to coefficient scattering during BConv, which we detail in §V-A.

Due to the conflicting data access patterns of HE ops, core-to-core communication is inevitable regardless of the data mapping method. (i)NTT and BConv, the two most dominant primitives, require different data mapping due to their data access patterns. For a polynomial, expressed as an ×N𝑁\ell\times Nroman_ℓ × italic_N matrix, (i)NTT can be separately applied to each of \ellroman_ℓ limbs, favoring limb scattering. By contrast, as BConv is mostly a matrix-matrix mult with a K×𝐾K\times\ellitalic_K × roman_ℓ BConv table (see §II-C), data access is performed along the columns of a polynomial and makes coefficient scattering a better option. Thus, data exchange between cores is required either during (i)NTT (coefficient scattering) or BConv (limb scattering).

In an accelerator with many cores, both data mapping methods have limitations. As the number of limbs (\ellroman_ℓ) vary over HE ops, it is difficult to distribute the limbs equally to cores without causing fragmentation issues in limb scattering. A core will be assigned fewer limbs than others depending on \ellroman_ℓ and wait idly for others to finish processing. In contrast, coefficient scattering is free from fragmentation issues because the degree of a polynomial (N𝑁Nitalic_N) does not change. However, coefficient scattering has a scaling limit due to quadratic growth in the number of transferred packets with increasing cores. For example, during (i)NTT with c𝑐citalic_c cores, coefficient scattering results in all-to-all data exchange between the cores, requiring a total of c(c1)𝑐𝑐1c(c-1)italic_c ( italic_c - 1 ) packets to be transferred. These issues become more severe as more cores participate in the distribution. Also, with more cores, additional performance degradation from long tail latency in lagging data occurs.

Therefore, in an MCM architecture with many cores and restrained NoP bandwidth, the combined use of limb scattering and coefficient scattering enhances performance, even with the increased total amount of transferred data due to performing data exchanges for both (i)NTT and BConv. We partition a polynomial in both limb and coefficient directions by dividing cores into limb clusters and coefficient clusters. A limb cluster is a set of cores each holding an evenly distributed subset of the coefficients of a limb. By contrast, the cores in a coefficient cluster partition the residues of a coefficient corresponding to different primes. Then, (i)NTT or BConv only incurs data exchange within each limb cluster or coefficient cluster, reducing the number of cores participating in the exchange. Thus, combining two data mapping methods suffers less from the aforementioned issues, leading to performance enhancement.

{subcaptiongroup}Refer to caption
\phantomcaption{subcaptiongroup}
\phantomcaption{subcaptiongroup}
\phantomcaption
Figure 3: Generalized data mapping methods of CiFHER with 64 cores: 3 dimension-wise clustering and block clustering for two different exemplar block sizes of 3 2×4242\times 42 × 4 and 3 4×4444\times 44 × 4.

IV-C Generalized Data Mapping for a Tiled FHE Accelerator

We introduce two generalized data mapping methods that combine coefficient scattering and limb scattering for mapping HE ops to a tiled architecture: dimension-wise clustering and block clustering. Dimension-wise clustering binds the cores on the same vertical line into a limb cluster and the cores on the same horizontal line into a coefficient cluster (or in the oppositice directions) on a mesh network. Fig. 3 shows an example of dimension-wise clustering on an 8×8888\times 88 × 8 mesh. This distribution method has the advantage that each data exchange process of (i)NTT and BConv can be performed without interference because data movement occurs in different axial directions. However, the configuration is fixed for a given mesh shape, potentially sharing the same limitations of prior mapping methods when the mesh shape is skewed.

Block clustering is a more generalized mapping method that places limb clusters in the form of blocks with an arbitrary size, and forms coefficient clusters with the cores in the same position within each block. Fig. 3 and Fig. 3 show configurations dividing an 8×8888\times 88 × 8 mesh into eight 2×4242\times 42 × 4 or four 4×4444\times 44 × 4 limb cluster blocks. Unlike dimension-wise clustering, various block clustering configurations are possible by changing the size of a block. We refer to the configuration on a dx×dysubscript𝑑𝑥subscript𝑑𝑦d_{x}\times d_{y}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT mesh using a bh×bwsubscript𝑏subscript𝑏𝑤b_{h}\times b_{w}italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT block size as dx×dysubscript𝑑𝑥subscript𝑑𝑦d_{x}\times d_{y}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT-BK-bh×bwsubscript𝑏subscript𝑏𝑤b_{h}\times b_{w}italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Dimension-wise clustering is simply denoted as dx×dysubscript𝑑𝑥subscript𝑑𝑦d_{x}\times d_{y}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT-DW, which is indeed a special variant of block clustering; e.g., Fig. 3 can be expressed as 8×8888\times 88 × 8-BK-8×1818\times 18 × 1. Also, limb scattering and coefficient scattering can be expressed as dx×dysubscript𝑑𝑥subscript𝑑𝑦d_{x}\times d_{y}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT-BK-1×1111\times 11 × 1 and dx×dysubscript𝑑𝑥subscript𝑑𝑦d_{x}\times d_{y}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT-BK-dx×dysubscript𝑑𝑥subscript𝑑𝑦d_{x}\times d_{y}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, respectively.

Determining the data mapping method of CiFHER is challenging as it requires considering various factors that have conflicting effects on its performance. In addition to the trade-offs discussed in §IV-B, generalized data mapping methods reduce the maximum number of hops a packet travels by restricting its destination to adjacent cores within a cluster. Also, it is challenging to balance the limb clusters and coefficient clusters; an unbalanced configuration would result in fragmentation or severe performance degradation as either (i)NTT or BConv ops must wait for the other to finish due to the dependency in common CKKS use cases. We account for these challenges by developing an extensive simulation tool for CiFHER, which we use for the evaluation in §VI.

V Algorithmic optimizations

V-A Limb Duplication for BConv

We propose a novel limb duplication algorithm, which duplicates the input limbs of BConv to reduce the amount of data communication caused by the redistribution of the output limbs during BConv. BConv requires gathering the residues from different cores. F1 uses parameters which allows performing BConv in parallel for each of the \ellroman_ℓ limbs without redistribution of input data (\ellroman_ℓ limbs) before BConv [20]. However, F1’s method produces massive 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT output limbs, which need to be redistributed among cores. More recently proposed ARK adopts parameters that make BConv produce fewer limbs (roughly β𝛽\ell\cdot\betaroman_ℓ ⋅ italic_β limbs for a small integer β𝛽\betaitalic_β), which also significantly reduces the amount of computation [38]. However, as it becomes harder to parallelize BConv, ARK switches to coefficient scattering prior to BConv and switches back to limb scattering after BConv, inducing data redistribution of both input and output data of BConv. Fig. 4 shows the estimated amount of data transfer required for key-switching depending on \ellroman_ℓ (the number of primes) when using ARK’s method. As the number of input limbs is usually smaller than the number of output limbs, the output limbs of BConv account for most of the data transfer regardless of \ellroman_ℓ.

Refer to caption
Figure 4: The amount of data transfer during key-switching using the method of ARK [55] under different \ellroman_ℓ conditions.

To eliminate the data transfer required for the redistribution of generated limbs, limb duplication duplicates input limbs and broadcasts them to all the cores in a coefficient cluster so that each core holds the entire polynomial (×N𝑁\ell\times Nroman_ℓ × italic_N matrix when only considering limb scattering). Then, each core performs BConv with a piece of the BConv table (K×𝐾K\times\ellitalic_K × roman_ℓ matrix); e.g., a core in charge of the primes p1,p5subscript𝑝1subscript𝑝5p_{1},p_{5}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, and p9subscript𝑝9p_{9}italic_p start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT would only multiply the 1st, 5th, and 9th rows of the BConv table with the polynomial to obtain BConv results. As the output limbs already belong to the right owner, data redistribution after BConv is unnecessary.

However, limb duplication requires additional data transfer for broadcasting input limbs. The overhead of broadcasting varies depending on the ratio of the number of input limbs and the number of output limbs and the data mapping methodology. We can formulate the reduction in the number of transferred limbs compared to the ARK’s method as in Eq. 2. 𝚋𝚛𝚘𝚊𝚍𝚌𝚊𝚜𝚝overheadsubscript𝚋𝚛𝚘𝚊𝚍𝚌𝚊𝚜𝚝overhead\mathtt{broadcast}_{\text{overhead}}typewriter_broadcast start_POSTSUBSCRIPT overhead end_POSTSUBSCRIPT denotes the ratio of data transfer required for broadcasting compared to that required for even distribution and has a value greater than one.

#𝚕𝚒𝚖𝚋𝚜output#𝚕𝚒𝚖𝚋𝚜input×(𝚋𝚛𝚘𝚊𝚍𝚌𝚊𝚜𝚝overhead1)#subscript𝚕𝚒𝚖𝚋𝚜output#subscript𝚕𝚒𝚖𝚋𝚜inputsubscript𝚋𝚛𝚘𝚊𝚍𝚌𝚊𝚜𝚝overhead1\begin{split}\mathtt{\#limbs}_{\text{output}}-\mathtt{\#limbs}_{\text{input}}% \times(\mathtt{broadcast}_{\text{overhead}}-1)\end{split}start_ROW start_CELL # typewriter_limbs start_POSTSUBSCRIPT output end_POSTSUBSCRIPT - # typewriter_limbs start_POSTSUBSCRIPT input end_POSTSUBSCRIPT × ( typewriter_broadcast start_POSTSUBSCRIPT overhead end_POSTSUBSCRIPT - 1 ) end_CELL end_ROW (2)

Eq. 2 must be larger than zero to make limb duplication beneficial. We selectively use limb duplication only when this condition is satisfied for each BConv. As there are much more output limbs than input limbs in general, limb duplication is usually beneficial for the reduction of communication.

V-B An Eclectic Approach to Prior Algorithms

We adopt the state-of-the-art CKKS algorithms mostly from an FHE CPU library, Lattigo [32]. We combine the bootstrapping algorithms from [19, 13]. There are also accelerator-specific algorithms, which offers trade-offs between computation and off-chip memory access. They target 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evks and plaintext polynomials, which are loaded from the off-chip memory and are large in size, possibly inducing the off-chip memory bandwidth bottleneck. We explain how such algorithms can (or cannot) be applied to CiFHER.

Minimum key-switching: To reduce the memory access for 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evks, ARK builds on the algorithm proposed in [36] to suggest the minimum key-switching algorithm for HRot ops. Most of the 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evks are for 𝙷𝚁𝚘𝚝𝙷𝚁𝚘𝚝\mathtt{HRot}typewriter_HRot ops, which require a different 𝖾𝗏𝗄rsubscript𝖾𝗏𝗄𝑟\mathsf{evk}_{r}sansserif_evk start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT for each rotation amount (r𝑟ritalic_r). The algorithm transforms a common computational pattern of sequential HRot ops with different rotation amounts forming an arithmetic progression into a recursive sequence of HRot ops with the same rotation amount equal to the common difference of the progression. Then, only one 𝖾𝗏𝗄rsubscript𝖾𝗏𝗄𝑟\mathsf{evk}_{r}sansserif_evk start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is required for the whole sequence. CiFHER also adopts minimum key-switching as it is crucial for performance to reduce the off-chip memory access for 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evks.

On-the-fly limb extension: Plaintexts (polynomials) required for 𝙿𝙼𝚞𝚕𝚝𝙿𝙼𝚞𝚕𝚝\mathtt{PMult}typewriter_PMult ops also account for a considerable portion of off-chip memory access. For example, bootstrapping requires several GBs of plaintexts. On-the-fly limb extension compresses each plaintext to occupy only one limb per polynomial and generates the rest at runtime. However, the runtime extension involves NTT, incurring expensive NoP data transfer. As the overall cost for runtime extension exceeds the benefits coming from the reduced off-chip memory access in an MCM architecture, we do not use this algorithm in CiFHER.

PRNG 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evk generation: We adopt the PRNG 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evk generation from CraterLake. Half of polynomials composing 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evks are purely random. Therefore, by deploying deterministic PRNG circuits inside a chip, they can be generated on-the-fly using a short seed, reducing the off-chip memory access for 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evks to half [81]. As 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evks can be generated within each core in parallel, no additional overhead is required for MCMs.

VI Evaluation

Table II: Area breakdown of the default configurations of CiFHER with varying numbers of cores.
Area / core (mm2)
Configuration 4 core 8 core 16 core 32 core 64 core
(2×2222\times 22 × 2) (2×4242\times 42 × 4) (4×4444\times 44 × 4) (4×8484\times 84 × 8) (8×8888\times 88 × 8)
Register files 31.71 15.86 7.93 3.96 1.98
NTTU 3.75 1.82 0.88 0.43 0.22
BConvU 1.07 0.65 0.44 0.34 0.28
EFU 0.72 0.36 0.18 0.09 0.05
AutoU 2.32 0.58 0.14 0.04 0.01
PRNG 0.71 0.36 0.18 0.09 0.04
Router/PHY 6.80 3.40 3.40 1.70 1.70
Total 47.08 23.02 13.15 6.65 4.28
Package area (mm2)
Cores 188.32 184.13 210.43 212.75 273.87
I/O dies 36.71
Total 225.04 220.84 247.14 249.46 310.59
Refer to caption
(a) Boot
Refer to caption
(b) ResNet
Refer to caption
(c) Sort
Refer to caption
(d) HELR256
Refer to caption
(e) HELR1024
Figure 5: Performance comparison between CiFHER default configurations and prior vector FHE accelerators, 5(a)5(b)5(d) CLake+ and 5(a)5(b)5(c)5(e) ARK, for the workloads. n𝑛nitalic_n-chip denotes the default configuration of CiFHER with n𝑛nitalic_n core chiplets, except for 1-chip, which is modified from 4-chip to integrate 4 cores into a monolithic die and has double bisection bandwidth to account for the bandwidth gap between NoC and NoP. BK: block clustering. DW: dimension-wise clustering. duplicated: using limb duplication.
Refer to caption
Figure 6: Relative manufacturing cost of CiFHER and ARK systems compared to the 4-core CiFHER system, depending on the number of manufactured systems.

VI-A Implementation & Default Configurations

Performance modeling: We evaluate CiFHER by implementing a cycle-accurate simulator for multi-core systems. It takes a sequence of HE ops as input and translates it into a flow graph of instructions. It divides instructions into micro-instructions and maps the micro-instructions to chiplets according to the data mapping method. The simulator tracks the status of data transmission, data dependency between micro-instructions, and virtual channel allocation status of the NoP to avoid structural hazards and functionality issues. Micro-instructions are scheduled based on this tracking data. Our simulator adopts the decoupled data orchestration of CraterLake to prefetch data from HBM using the static nature of HE ops.

Hardware modeling: We synthesized logic components in RTL using the ASAP7 7.5-track 7nm process design kit [25]. Large SRAM components with a bank size larger than 2KB were evaluated using a memory modeling tool, FinCACTI [83]. We modified FinCACTI to match the IEEE IRDS roadmap [44], ASAP7, and other published data in 7nm nodes [18, 86, 72, 93, 48, 50]. We used published data for HBM [75] and PHYs [91, 50]. We integrated the results into the performance modeling tool to automatically derive power and die area. Overhead due to long wires is also estimated in the tool. All components run at 1GHz.

Manufacturing cost analysis: We also estimated the relative cost of a CiFHER package using the cost modeling tool from [35], results of which are shown in Fig. 6.

Routing and NoP implementation: We used XY routing and 5×5555\times 55 × 5 virtual channel routers [30] for NoP communication, which is a simple yet effective solution for the highly balanced traffic pattern of CiFHER. Each input and output port has 4 virtual channels. We modeled NoP performance and energy using the specification of the UCIe advanced package with a PHY supporting 16GT/s of transfer rate [91].

Memory instances: Two types of RFs, the main scratchpad RF and an auxiliary RF for key-switching, are utilized in CiFHER. The RFs together can provide 6 reads and 6 writes per lane in a cycle through bank interleaving [74]. We modeled the NoP traffic for HBM access by injecting packets through edges connecting the I/O die and its adjacent cores on our simulator. The bandwidth of this edge is the same as that of edges connecting the cores. Two HBM stacks are utilized, each providing 500GB/s of bandwidth [47].

CiFHER default configurations: The default configurations have different numbers of cores but similar levels of total computational throughput and memory bandwidth. We can adjust the number of vector lanes in a core from 16 to 256. We fixed (# of cores)×(# of lanes in a core)# of cores# of lanes in a core(\text{\# of cores})\times(\text{\# of lanes in a core})( # of cores ) × ( # of lanes in a core ) to 1,024 to make different configurations have similar total computational throughput. We fixed the bisection bandwidth of the package to 2TB/s and decided the bandwidth of each edge by dividing it by the number of edges crossing the bisection. The aggregate capacity of auxiliary RFs was set to 16MB (e.g., 1MB per core in a 16-core configuration). Also, we used a total of 256MB for the scratchpad RF, which is sufficient for CiFHER and is the same as the RF capacity of CraterLake. Table II shows the area breakdown of the default configurations of CiFHER.

VI-B Experimental Setup

We utilized four representative FHE workloads that have been utilized for evaluating the previous accelerators.

CKKS bootstrapping (Boot): FHE CKKS workloads require frequent bootstrapping, each involving hundreds of HE ops. We performed bootstrapping for a ciphertext containing 215superscript2152^{15}2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT complex numbers. We divided the execution time with the number of rescaling ops between consecutive bootstrapping execution (nine in our parameters with L=48𝐿48L=48italic_L = 48) to account for the frequency of bootstrapping.

CNN inference (ResNet): We used a CKKS implementation [61] of the ResNet-20 model for CIFAR-10. We report the latency of a single-image inference, which takes 2,271 seconds in the original single-threaded CPU implementation.

Sorting 214superscript2142^{14}2 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT numbers (Sort): We used the two-way sorting implementation from [40]. The original 32-thread CPU implementation takes 23,066 seconds to finish.

Logistic regression training (HELR): We used the implementation from [37]. We performed encrypted training with binary classification for classifying the numbers 3333 and 8888 in the MNIST dataset. We iterated single-batch training 32 times and report the average execution time per iteration. We tested for two batch sizes of 256 (HELR256) and 1,024 (HELR1024).

We compare CiFHER with CraterLake and ARK using their reported performance and power values under 128-bit security constraints. For 12/14nm CraterLake, optimistic compensations for the difference in the technology node (14nm to 7nm [72]) were applied, which we refer to as CLake+.

Refer to caption
(a) 4×\times×4-BK-2×\times×2 configuration
Refer to caption
(b) 8×\times×8-BK-4×\times×4 configuration
Refer to caption
(c) Data movement amount
Figure 7: 7(a)7(b) Delay, energy breakdown, and energy-delay-area product (EDAP) [63] of CiFHER when using the data mapping of ARK, CraterLake (CL), and our block clustering without (BK) and with limb duplication (BL). 7(c) Amount of data movement between cores as CiFHER performs each workload with and without limb duplication.

VI-C Performance & Efficiency

Among the default configurations, 4-core CiFHER using dimension-wise clustering and limb duplication showed the best energy-delay product (EDP), whose values are shown with the 1×\times× equi-EDP lines in Fig. 5. We limited the number of limb clusters to eight to avoid severe fragmentation issues.

When we integrated the 4 cores into a single monolithic die and connected the cores in the NoC with 2×\times× higher bisection bandwidth (1-chip in Fig. 5), additional performance and EDP enhancements were achieved. 1-chip CiFHER resembles ARK with datapath reduced to 32 bits and showed a similar EDP with the original 64-bit ARK except for HELR1024. ARK performs inferior in HELR1024 because we reordered the HE ops to further enhance the reuse of 𝖾𝗏𝗄𝖾𝗏𝗄\mathsf{evk}sansserif_evks in the memory-bound region, which ARK identifies as a bottleneck of HELR.

However, the fabrication cost of 1-chip CiFHER or ARK is significatly higher than chiplet-based CiFHER for reasonable production volumes (see Fig. 6), mainly due to the high non-recurring engineering (NRE) cost stemming from the large chip size. 1-chip CiFHER (respecitvely, ARK) has 1.22×\times× (2.22×\times×) and 1.37×\times× (2.48×\times×) higher NRE compared to 4-core and 16-core CiFHER, Although NRE can be amortized for large production volumes, the cost of 1-chip CiFHER becomes on par with that of 4-core CiFHER when one million packages are produced, which is a huge amount for domain-specific accelerators. Meanwhile, ARK is always expensive than 4-core CiFHER due to higher raw chip fabrication cost and yield loss.

Table III: Execution time and relative energy-delay-area product (EDAP) [63] of CiFHER and prior vector FHE accelerators.
Execution time (ms), the lower the better
CLake+ ARK CiFHER
4 cores 16 cores 64 cores
Boot 0.79 0.44 0.62 0.64 0.73
ResNet 321 125 194 189 222
Sort - 1990 2282 2328 2683
HELR256 3.81 - 3.34 3.55 4.09
HELR1024 - 7.42 5.16 5.50 6.20
Relative EDAP (vs. 4-core CiFHER), the lower the better
Boot 2.04×\times× 1.46×\times× 1.00×\times× 1.35×\times× 2.63×\times×
ResNet 4.09×\times× 1.20×\times× 1.00×\times× 1.26×\times× 2.56×\times×
Sort - 2.04×\times× 1.00×\times× 1.34×\times× 2.62×\times×
HELR256 1.40×\times× - 1.00×\times× 1.41×\times× 2.80×\times×
HELR1024 - 4.70×\times× 1.00×\times× 1.42×\times× 2.77×\times×

Under the fixed computational throughput and bisection bandwidth of the package, as we split the chip and utilize smaller cores, performance and efficiency inevitably decline because more NoP data transfer is required. Thus, the area constraint is a decisive factor in an MCM FHE accelerator design. If the cost budget allows, as in recent AMD server CPUs (74mm2 per compute die [70]), utilizing four cores would be best. Table III shows that 4-core CiFHER enhances efficiency in terms of energy-delay-area product (EDAP) [63] with comparable performance to prior monolithic FHE accelerators due to significantly lower power dissipation (see Fig. 5). Compared to ARK, 4-core CiFHER performs 1.15×\times× slower but reduces EDAP by 2.03×\times× in geometric mean (geomean) of the workloads. Compared to CLake+, it performs 1.34×\times× faster and results in a 2.27×\times× EDAP reduction. By contrast, if we can only afford dies as small as recent domain-specific MCM accelerators, such as NN-Baton [89] (2mm2) and Simba [84] (6mm2), 16-core or 64-core CiFHER needs to be utilized.

Our data mapping and limb duplication minimize the performance and efficiency degradation of many-core configurations. The best 16-core and 64-core configurations respectively perform only 1.03×\times× and 1.18×\times× slower in geomean than 4-core. Compared to 4-core, EDP increases to 1.23×\times× (16-core) and 1.94×\times× (64 core), and EDAP increases to 1.35×\times× (16-core) and 2.67×\times× (64-core) in geomean using our optimized configurations, whereas a naïve configuration (e.g., 64-core using limb scattering with ARK’s method) would require 17.1×\times× higher EDP and 23.6×\times× higher EDAP in bootstrapping.

Meanwhile, due to the skewed organization, 8-core (2 × 4) and 32-core (4 × 8) configurations suffer from an imbalance in NoP throughput between the horizontal and vertical directions, increasing the tail latency. As a result, performance drops severely, making these configurations less attractive.

Refer to caption
(a) CiFHER baseline
Refer to caption
(b) CiFHER with double NoP bandwidth
Refer to caption
(c) CiFHER with double compute
Figure 8: Bootstrapping performance of CiFHER with/without limb duplication. The three best-performing mapping configurations were selected for each number of cores. ARK’s redistribution method for BConv was used when limb duplication was not used.

VI-D Sensitivity Study

We estimated energy, delay, and EDAP of CiFHER while incrementally applying block clustering (4×\times×4-BK-2×\times×2 and 8×\times×8-BK-4×\times×4) and limb duplication (see Fig. 7). Limb scattering of ARK shows 1.44–1.53×\times× (4×\times×4) and 2.89–3.06×\times× (8×\times×8) slow execution time compared to block clustering. Coefficient scattering of CraterLake is 1.10–1.13×\times× faster than our method for the 4×\times×4 configuration, whereas it is 2.76–2.81×\times× slower for 8×\times×8. The data exchange pattern during limb scattering is not all-to-all as we can group some limbs together to have data transfer occur within each group. By contrast, coefficient scattering requires all-to-all data exchange and thus scales inferior to limb scattering. Coefficient scattering consumes 2.07×\times× more energy on NoP and shows 4.66×\times× higher (worse) EDAP than our mapping in geomean for 8×\times×8. Yet, all-to-all data exchange between 16 cores is manageable, making coefficient scattering faster than our mapping for 4×\times×4.

Fig. 7(c) compares the amounts of data movement between the cores of CiFHER, with and without limb duplication. Limb duplication consistently eliminated 18–22% of the data movement between the cores. When applying limb duplication to our mapping in the 4×\times×4 configuration, we obtained 1.10×\times× delay and 1.16×\times× EDAP reductions in geomean, and the delay and EDAP gaps with coefficient scattering were also reduced to only 1.3% and 2.1% in geomean.

Refer to caption
(a) 2x2-DW
Refer to caption
(b) 4x4-BK-2x2
Figure 9: Normalized delay, energy, and EDAP of CiFHER when performing bootstrapping with various single-core computational throughput using the composable NTTU or an alternative NTTU.

To assess the efficiency of composable NTTUs, we also measured delay, energy, and EDAP of CiFHER during bootstrapping with different single-core computational throughput, represented by the number of lanes. We either use composable NTTUs or alternative NTTUs, which extend upon prior NTTUs by interpreting a length-N𝑁Nitalic_N polynomial as a three-dimensional vector (N=#lane×#lane×Nrest𝑁subscript#𝑙𝑎𝑛𝑒subscript#𝑙𝑎𝑛𝑒subscript𝑁𝑟𝑒𝑠𝑡N\!=\!\text{\#}_{lane}\times\text{\#}_{lane}\times N_{rest}italic_N = # start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT × # start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t end_POSTSUBSCRIPT) and performing a sequence of #lanesubscript#𝑙𝑎𝑛𝑒\text{\#}_{lane}# start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT-point, #lanesubscript#𝑙𝑎𝑛𝑒\text{\#}_{lane}# start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT-point, and Nrestsubscript𝑁𝑟𝑒𝑠𝑡N_{rest}italic_N start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t end_POSTSUBSCRIPT-point NTTs. Similar to prior NTTUs, twisting and transposing steps are appropriately inserted in between NTT steps. Alternative NTTUs can also adjust the computational throughput according to the number of lanes (#lanesubscript#𝑙𝑎𝑛𝑒\text{\#}_{lane}# start_POSTSUBSCRIPT italic_l italic_a italic_n italic_e end_POSTSUBSCRIPT); however, whereas data exchange in composable NTTUs mostly occur in between adjacent lanes, alternative NTTUs require data exchange with distant lanes, increasing overall energy and area due to long wires.

Composable NTTUs increase the efficiency of CiFHER for many-core configurations by enabling a compact MCM architecture with a balance between computational throughput and available data movement bandwidth. In the 2x2-DW configuration, where data transfer through NoP is relatively less heavy and high computational throughput is required for each core, energy and delay increased when prior NTTUs were replaced with composable NTTUs with lower throughput (see Fig. 9(a)). Specifically, the delay (respectively, energy) increases by 1.43×(1.29×)1.43\times(1.29\times)1.43 × ( 1.29 × ) and 2.7×(1.92×)2.7\times(1.92\times)2.7 × ( 1.92 × ) as the number of lanes in a core decreases from 256 to 128 and 64 by utilizing composable NTTUs. Conversely, in the 4x4-BK-2x2 configuration that demands substantial NoP data transfer, halving the number of lanes by using composable NTTUs do not increase the delay by a large margin. Using 128 lanes leads to a slight 6% increase in delay and improves the overall EDAP by 12% due to smaller chiplet area.

Composable NTTUs are more efficient in terms of energy and area compared to alternative NTTUs. In Fig. 9(b), the reduction to 128 lanes using alternative NTTU also leads to an around 2% enhancement in EDAP. Nevertheless, due to the substantial wire requirements of alternative NTTUs, the energy consumption and area footprint are larger by 1.05×1.05\times1.05 × for both when compared to those of composable NTTUs, culminating in an EDAP value 1.10×1.10\times1.10 × larger than that achieved with composable NTTUs.

VI-E Limb Duplication Effects

The effects of limb duplication differ by each configuration. We selected the three best-performing configurations for each number of cores in Fig. 5(a) and compared the performance of bootstrapping with and without applying limb duplication. Fig. 8(a) shows the results. Limb duplication improved bootstrapping performance by up to 31% for 4×\times×8-BK-4×\times×4 but rather caused 28% performance degradation for 8×\times×8-BK-2×\times×4.

Although limb duplication makes fewer limbs exchanged, the bursty nature of data broadcasting causes network contention and introduces additional latency. The overhead of broadcasting in limb duplication is affected by the arrangement of limb clusters. Only 8×\times×8-BK-2×\times×4 in Fig. 8(a) has a coefficient cluster composed of eight cores; coefficient clusters in the other configurations consist of one to four cores. The large number of cores in a coefficient cluster greatly increases the broadcasting latency, degrading performance.

As limb duplication is effective for reducing the NoP pressure, increasing the NoP bandwidth reduced the performance gains from limb duplication (see Fig. 8(b)). Limb duplication enhanced the performance of the baseline configurations by 7% in geomean. With the NoP bandwidth doubled, limb duplication rather decreased performance by 3% in geomean, and the maximum performance gain observed for 4×8-BK-4×4 was reduced to 14% from 31% in the baseline.

When we intensified the NoP bandwidth bottleneck by doubling computational throughput, performance gains from using limb duplication increased (see Fig. 8(c)). In these configurations, limb duplication enhanced performance by 14% in geomean, and the maximum performance gain reached 43%.

VI-F Scalability

Refer to caption
Figure 10: Performance of CiFHER while scaling the number of cores under default and double NoP bandwidth conditions. The number of NTT submodules in each core is fixed to eight.

To evaluate the scalability of CiFHER, we estimated the performance of each workload while increasing the number of cores in Fig. 10. We set the internals of a core fixed to an eight-NTTU-submodule setting (128 lanes). Limb duplication and dx×dysubscript𝑑𝑥subscript𝑑𝑦d_{x}\times d_{y}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT-BK-dx/2×dy/2subscript𝑑𝑥2subscript𝑑𝑦2\nicefrac{{d_{x}}}{{2}}\times\nicefrac{{d_{y}}}{{2}}/ start_ARG italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG × / start_ARG italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG block clustering were used for a dx×dysubscript𝑑𝑥subscript𝑑𝑦d_{x}\times d_{y}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT package configuration. We fixed the HBM bandwidth and the bisection bandwidth of a package. Thus, the bandwidth of a network edge decreases as the number of cores increases.

The limited bandwidth of NoP restricts the scalability of CiFHER. When we increased the total computational throughput by increasing the number of cores from 4 to 16, 1.99–2.11×\times× of speedups were achieved for the workloads. Increasing from 16 to 64 only resulted in a 1.07×\times× speedup. Therefore, increasing the number of cores without consideration for proper sizing of the core has an adversarial effect on efficiency. Meanwhile, skewed configurations (8-core and 32-core), which already suffer from the NoP bandwidth imbalance, showed suboptimal performance.

HBM bandwidth becomes the next limiting factor for scalability when NoP bandwidth is sufficient. Doubling NoP bandwidth mitigates the NoP pressure and improves the performance of the 16-core configuration by 1.34×\times× in geomean. Scalability slightly improves as 4-to-16 and 16-to-64 number-of-cores transitions result in 2.23×\times× and 1.13×\times× speedups. However, the degree of speedups still falls largely behind the increase in the number of cores as the hardware becomes bottlenecked by the HBM bandwidth, whose utilization reaches 75–80% in the 64-core configuration.

VII Related Work

Function-level acceleration and leveled HE (LHE): [52] accelerates non-RNS CKKS using Intel AVX-512 instructions and GPUs. HEXL [11] specifically utilizes AVX512-IFMA52 instructions for further CPU optimization. [97] accelerates (i)NTT through high-radix FFT and instruction-level optimizations using Intel GPUs. [57] identifies off-chip memory access as the performance bottleneck of (i)NTT on GPUs and applies a better use of the memory hierarchy of GPUs, improving data reusability to address the bottleneck. [6] accelerates the BFV [16, 4] scheme by replacing the major functions with ones more suitable for GPUs. [5] further partitions data to multiple GPUs to exploit abundant parallelism inside BFV.

Numerous FPGA works attempted to optimize core HE ops by deploying tailored datapaths. [79] and [90] proposed specialized FUs for the BFV scheme. HEAX [78] and coxHE [39] designed FPGA modules targeting key-switching, a dominant HE op in CKKS. [59] supports larger parameters adequate for FHE CKKS; however, it only accelerates (i)NTT.

Torus LWE (TLWE) FHE: TLWE-based FHE schemes [23, 31] are fundamentally different from RLWE in that usually a single prime modulus (limb) is utilized. Recent studies have specifically targeted the TFHE [23] scheme. cuFHE [92] speeds up TFHE computations on a GPU by utilizing primitive functions accelerated in cuHE [29], which uses a specialized prime to reduce the cost of modular reduction. [71] implements an FPGA datapath for an optimized TFHE bootstrapping algorithm using inter-operation parallelism. In TFHE, floating-point Fourier transform is often used instead of NTT. [68] utilizes bit-level parallelism and the NVIDIA cuFFT library to accelerate TFHE on GPUs. Matcha [49] is the first ASIC proposal and devises an approximate integer FFT method that replaces mult ops with shift and add ops.

Ring LWE (RLWE) FHE: [51] applies kernel fusion and parameter optimization to accelerate CKKS bootstrapping on GPUs. TensorFHE [34] utilizes the tensor cores in recent NVIDIA GPUs for (i)NTT. [85] introduces new microarchitectural extensions that minimize redundant data access to DRAM and enhance the computational throughput of modular reduction operations, further boosting the performance of FHE operations on current GPUs. FAB [2] reduces the working set to alleviate the off-chip memory bandwidth bottleneck even with limited FPGA memory space. Poseidon [95] builds a 512-lane vector FPGA architecture with FUs reducing the on-chip memory bandwidth pressure, such as high-radix NTTUs. GPU and FPGA solutions achieve orders of magnitude higher performance compared to CPU but recent ASIC FHE accelerators [80, 81, 58, 55] introduced in Section II-D achieve even higher performance gains. [3] also introduces an FHE accelerator design using chiplets to reduce chip size but only covered a fixed number of chiplets, set as the sweet spot.

VIII Conclusion

We have proposed CiFHER, a fully homomorphic encryption (FHE) accelerator based on a multi-chip module (MCM). We devised methods for minimizing die-to-die communication, which incurs serious latency and energy overhead for an MCM. Our data mapping limits data communication to happen within a cluster, reducing the traveling distance of data and alleviating the network pressure. A novel limb duplication algorithm eliminates data transfer caused by redistributing output during base conversion, a key function of FHE. By utilizing multiple chiplets with a resizable structure, enabled by using composable number-theoretic transform units, we created various optimized MCM package configurations of CiFHER. Multiple chiplets, each configurable to be as small as 4.28mm2, collaborate efficiently with our data mapping methods and algorithms; the resulting CiFHER performs comparable to state-of-the-art monolithic ASIC accelerators with significantly reduced area and energy.

Acknowledgement

This work was supported in part by Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd. and by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-01343 and IITP-2023-RS-2023-00256081). The EDA tool was supported by the IC Design Education Center (IDEC), Korea. Jung Ho Ahn, the corresponding author, is with the Department of Intelligence and Information, the Interdisciplinary Program in Artificial Intelligence, and ICT (Institute of Computer Technology), Seoul National University, South Korea.

References

  • [1] R. Agrawal et al., “High-Precision RNS-CKKS on Fixed but Smaller Word-Size Architectures: Theory and Application,” in Workshop on Encrypted Computing & Applied Homomorphic Cryptography, 2023, pp. 23–34.
  • [2] R. Agrawal et al., “FAB: An FPGA-based Accelerator for Bootstrappable Fully Homomorphic Encryption,” in HPCA, 2023, pp. 882–895.
  • [3] A. Aikata, A. C. Mert, S. Kwon, M. Deryabin, and S. S. Roy, “REED: Chiplet-Based Scalable Hardware Accelerator for Fully Homomorphic Encryption,” arXiv preprint arXiv:2308.02885, 2023.
  • [4] A. Al Badawi, Y. Polyakov, K. M. M. Aung, B. Veeravalli, and K. Rohloff, “Implementation and Performance Evaluation of RNS Variants of the BFV Homomorphic Encryption Scheme,” IEEE Transactions on Emerging Topics in Computing, vol. 9, no. 2, pp. 941–956, 2019.
  • [5] A. Al Badawi, B. Veeravalli, J. Lin, N. Xiao, M. Kazuaki, and A. Khin Mi Mi, “Multi-GPU Design and Performance Evaluation of Homomorphic Encryption on GPU Clusters,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 2, pp. 379–391, 2021.
  • [6] A. Al Badawi, B. Veeravalli, C. F. Mun, and K. M. M. Aung, “High-performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA,” IACR Transactions on Cryptographic Hardware and Embedded Systems, pp. 70–95, 2018.
  • [7] M. Albrecht et al., “Homomorphic Encryption Standard,” in Protecting Privacy through Homomorphic Encryption.   Springer, 2021, pp. 31–62.
  • [8] A. Arunkumar et al., “MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability,” in ISCA, 2017, pp. 320–332.
  • [9] A. A. Badawi and Y. Polyakov, “Demystifying Bootstrapping in Fully Homomorphic Encryption,” Cryptology ePrint Archive, Paper 2023/149, 2023. [Online]. Available: https://eprint.iacr.org/2023/149
  • [10] D. H. Bailey, “FFTs in External or Hierarchical Memory,” in ACM/IEEE Conference on Supercomputing, 1989, pp. 234–242.
  • [11] F. Boemer, S. Kim, G. Seifu, F. D. M. de Souza, and V. Gopal, “Intel HEXL: Accelerating Homomorphic Encryption with Intel AVX512-IFMA52,” in Workshop on Encrypted Computing & Applied Homomorphic Cryptography, 2021, pp. 57–62.
  • [12] J. Bossuat, C. Mouchet, J. R. Troncoso-Pastoriza, and J. Hubaux, “Efficient Bootstrapping for Approximate Homomorphic Encryption with Non-sparse Keys,” in Annual International Conference on the Theory and Applications of Cryptographic Techniques, 2021, pp. 587–617.
  • [13] J. Bossuat, J. Troncoso-Pastoriza, and J. Hubaux, “Bootstrapping for Approximate Homomorphic Encryption with Negligible Failure-Probability by Using Sparse-Secret Encapsulation,” in Applied Cryptography and Network Security, 2022, pp. 521–541.
  • [14] F. Bourse, M. Minelli, M. Minihold, and P. Paillier, “Fast Homomorphic Evaluation of Deep Discretized Neural Networks,” in Annual International Cryptology Conference, 2018, pp. 483–512.
  • [15] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(Leveled) Fully Homomorphic Encryption without Bootstrapping,” ACM Transactions on Computing Theory, vol. 6, no. 3, pp. 1–36, 2014.
  • [16] Z. Brakerski and V. Vaikuntanathan, “Efficient Fully Homomorphic Encryption from (Standard) LWE,” SIAM Journal on Computing, vol. 43, no. 2, pp. 831–871, 2014.
  • [17] L. Cao, “Advanced Packaging Technology Platforms for Chiplets and Heterogeneous Integration,” in International Electron Devices Meeting, 2022, pp. 3.3.1–3.3.4.
  • [18] J. Chang et al., “A 7nm 256Mb SRAM in High-K Metal-Gate FinFET Technology with Write-Assist Circuitry for Low-VMIN Applications,” in IEEE International Solid-State Circuits Conference, 2017, pp. 206–207.
  • [19] H. Chen and K. Han, “Homomorphic Lower Digits Removal and Improved FHE Bootstrapping,” in Annual International Conference on the Theory and Applications of Cryptographic Techniques, 2018, pp. 315–337.
  • [20] H. Chen, K. Laine, and R. Player, “Simple Encrypted Arithmetic Library - SEAL v2.1,” in Financial Cryptography and Data Security, 2017, pp. 3–18.
  • [21] J. H. Cheon, K. Han, A. Kim, M. Kim, and Y. Song, “A Full RNS Variant of Approximate Homomorphic Encryption,” in Selected Areas in Cryptography, 2018, pp. 347–368.
  • [22] J. H. Cheon, A. Kim, M. Kim, and Y. S. Song, “Homomorphic Encryption for Arithmetic of Approximate Numbers,” in International Conference on the Theory and Applications of Cryptology and Information Security, 2017, pp. 409–437.
  • [23] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachène, “TFHE: Fast Fully Homomorphic Encryption Over the Torus,” Journal of Cryptology, vol. 33, no. 1, pp. 34–91, 2020.
  • [24] I. Chillotti, M. Joye, and P. Paillier, “Programmable Bootstrapping Enables Efficient Homomorphic Inference of Deep Neural Networks,” in Cyber Security Cryptography and Machine Learning, 2021, pp. 1–19.
  • [25] L. T. Clark et al., “ASAP7: A 7-nm FinFET Predictive Process Design Kit,” Microelectronics Journal, vol. 53, pp. 105–115, 2016.
  • [26] J. W. Cooley and J. W. Tukey, “An Algorithm for the Machine Calculation of Complex Fourier Series,” Mathematics of Computation, vol. 19, no. 90, pp. 297–301, 1965.
  • [27] CryptoLab Inc., “HEAAN v2.1,” Sep 2018. [Online]. Available: https://github.com/snucrypto/HEAAN
  • [28] B. R. Curtis and R. Player, “On the Feasibility and Impact of Standardising Sparse-secret LWE Parameter Sets for Homomorphic Encryption,” in Workshop on Encrypted Computing & Applied Homomorphic Cryptography, 2019, pp. 1–10.
  • [29] W. Dai and B. Sunar, “cuHE: A Homomorphic Encryption Accelerator Library,” in Cryptography and Information Security in the Balkans, 2016, pp. 169–186.
  • [30] W. J. Dally, “Virtual-Channel Flow Control,” IEEE Transactions on Parallel and Distributed Systems, vol. 3, no. 2, pp. 194–205, 1992.
  • [31] L. Ducas and D. Micciancio, “FHEW: Bootstrapping Homomorphic Encryption in Less Than a Second,” in Annual International Conference on the Theory and Applications of Cryptographic Techniques, 2015, pp. 617–640.
  • [32] EPFL-LDS and Tune Insight SA, “Lattigo v4,” Aug 2022. [Online]. Available: https://github.com/tuneinsight/lattigo
  • [33] J. Fan and F. Vercauteren, “Somewhat Practical Fully Homomorphic Encryption,” Cryptology ePrint Archive, Paper 2012/144, 2012. [Online]. Available: https://eprint.iacr.org/2012/144
  • [34] S. Fan, Z. Wang, W. Xu, R. Hou, D. Meng, and M. Zhang, “TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU,” in HPCA, 2023, pp. 922–934.
  • [35] Y. Feng and K. Ma, “Chiplet Actuary: A Quantitative Cost Model and Multi-Chiplet Architecture Exploration,” in ACM/IEEE Design Automation Conference, 2022, p. 121–126.
  • [36] S. Halevi and V. Shoup, “Faster Homomorphic Linear Transformations in HElib,” in Annual International Cryptology Conference, 2018, pp. 93–120.
  • [37] K. Han, S. Hong, J. H. Cheon, and D. Park, “Logistic Regression on Homomorphic Encrypted Data at Scale,” in AAAI Conference on Artificial Intelligence, 2019, pp. 9466–9471.
  • [38] K. Han and D. Ki, “Better Bootstrapping for Approximate Homomorphic Encryption,” in Cryptographers’ Track at the RSA Conference, 2020, pp. 364–390.
  • [39] M. Han, Y. Zhu, Q. Lou, Z. Zhou, S. Guo, and L. Ju, “coxHE: A Software-Hardware Co-Design Framework for FPGA Acceleration of Homomorphic Computation,” in Design, Automation & Test in Europe Conference & Exhibition, 2022, pp. 1353–1358.
  • [40] S. Hong, S. Kim, J. Choi, Y. Lee, and J. H. Cheon, “Efficient Sorting of Homomorphic Encrypted Data With k-Way Sorting Network,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 4389–4404, 2021.
  • [41] S. Y. Hou et al., “Wafer-Level Integration of an Advanced Logic-Memory System Through the Second-Generation CoWoS Technology,” IEEE Transactions on Electron Devices, vol. 64, no. 10, pp. 4071–4077, 2017.
  • [42] Z. Huang, W. jie Lu, C. Hong, and J. Ding, “Cheetah: Lean and Fast Secure Two-Party Deep Neural Network Inference,” in USENIX Security Symposium, 2022, pp. 809–826.
  • [43] R. Hwang, T. Kim, Y. Kwon, and M. Rhu, “Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations,” in ISCA, 2020, pp. 968–981.
  • [44] IEEE, “International Roadmap for Devices and Systems,” Tech. Rep., 2018. [Online]. Available: https://irds.ieee.org/editions/2018/
  • [45] IEEE, “Heterogeneous Integration Roadmap,” Tech. Rep., 2021. [Online]. Available: https://eps.ieee.org/technology/heterogeneous-integration-roadmap.html
  • [46] Y. Ishimaki, H. Imabayashi, K. Shimizu, and H. Yamana, “Privacy-Preserving String Search for Genome Sequences with FHE Bootstrapping Optimization,” in 2016 IEEE International Conference on Big Data (Big Data), 2016, pp. 3989–3991.
  • [47] JEDEC, “High Bandwidth Memory DRAM (HBM3),” Tech. Rep. JESD238, 2022.
  • [48] W. Jeong et al., “True 7nm Platform Technology featuring Smallest FinFET and Smallest SRAM cell by EUV, Special Constructs and 3rd Generation Single Diffusion Break,” in IEEE Symposium on VLSI Technology, 2018, pp. 59–60.
  • [49] L. Jiang, Q. Lou, and N. Joshi, “MATCHA: A Fast and Energy-Efficient Accelerator for Fully Homomorphic Encryption over the Torus,” in ACM/IEEE Design Automation Conference, 2022, pp. 235–240.
  • [50] N. P. Jouppi et al., “Ten Lessons From Three Generations Shaped Google’s TPUv4i: Industrial Product,” in ISCA, 2021, pp. 1–14.
  • [51] W. Jung, S. Kim, J. Ahn, J. H. Cheon, and Y. Lee, “Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs,” IACR Transactions on Cryptographic Hardware and Embedded Systems, vol. 2021, no. 4, pp. 114–148, 2021.
  • [52] W. Jung et al., “Accelerating Fully Homomorphic Encryption Through Architecture-Centric Analysis and Optimization,” IEEE Access, vol. 9, pp. 98 772–98 789, 2021.
  • [53] C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan, “{{\{{GAZELLE}}\}}: A Low Latency Framework for Secure Neural Network Inference,” in USENIX Security Symposium, 2018, pp. 1651–1669.
  • [54] A. Kannan, N. E. Jerger, and G. H. Loh, “Enabling Interposer-Based Disintegration of Multi-Core Processors,” in MICRO, 2015, pp. 546–558.
  • [55] J. Kim et al., “ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter-Operation Key Reuse,” in MICRO, 2022, pp. 1237–1254.
  • [56] M. Kim et al., “Ultrafast homomorphic encryption models enable secure outsourcing of genotype imputation,” Cell Systems, vol. 12, no. 11, pp. 1108–1120.e4, 2021.
  • [57] S. Kim, W. Jung, J. Park, and J. Ahn, “Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs,” in IEEE International Symposium on Workload Characterization, 2020, pp. 264–275.
  • [58] S. Kim et al., “BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption,” in ISCA, 2022, pp. 711–725.
  • [59] S. Kim, K. Lee, W. Cho, Y. Nam, J. H. Cheon, and R. A. Rutenbar, “Hardware Architecture of a Number Theoretic Transform for a Bootstrappable RNS-based Homomorphic Encryption Scheme,” in IEEE International Symposium on Field-Programmable Custom Computing Machines, 2020, pp. 56–64.
  • [60] T. Kim, Y. Oh, and H. Kim, “Efficient Privacy-Preserving Fingerprint-Based Authentication System Using Fully Homomorphic Encryption,” Security and Communication Networks, vol. 2020, pp. 1–11, 2020.
  • [61] E. Lee et al., “Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions,” in International Conference on Machine Learning, 2022, pp. 12 403–12 422.
  • [62] J.-W. Lee et al., “Privacy-Preserving Machine Learning With Fully Homomorphic Encryption for Deep Neural Network,” IEEE Access, vol. 10, pp. 30 039–30 054, 2022.
  • [63] S. Li, J. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures,” in MICRO, 2009, pp. 469–480.
  • [64] Y. Li, A. Louri, and A. Karanth, “SPACX: Silicon Photonics-based Scalable Chiplet Accelerator for DNN Inference,” in HPCA, 2022, pp. 831–845.
  • [65] Lotame, “IDFA and Big Tech Impact – One Year Later,” 2022. [Online]. Available: https://www.lotame.com/idfa-and-big-tech-impact-one-year-later
  • [66] R. Mahajan et al., “Embedded Multi-die Interconnect Bridge (EMIB) – A High Density, High Bandwidth Packaging Interconnect,” in IEEE Electronic Components and Technology Conference, 2016, pp. 557–565.
  • [67] A. C. Mert, E. Öztürk, and E. Savaş, “Design and Implementation of a Fast and Scalable NTT-Based Polynomial Multiplier Architecture,” in Euromicro Conference on Digital System Design, 2019, pp. 253–260.
  • [68] T. Morshed, M. M. A. Aziz, and N. Mohammed, “CPU and GPU Accelerated Fully Homomorphic Encryption,” in IEEE International Symposium on Hardware Oriented Security and Trust, 2020, pp. 142–153.
  • [69] S. Naffziger et al., “Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product,” in ISCA, 2021, pp. 57–70.
  • [70] S. Naffziger, K. Lepak, M. Paraschou, and M. Subramony, “AMD Chiplet Architecture for High-Performance Server and Desktop Products,” in IEEE International Solid-State Circuits Conference, 2020, pp. 44–45.
  • [71] K. Nam, H. Oh, H. Moon, and Y. Paek, “Accelerating N-Bit Operations over TFHE on Commodity CPU-FPGA,” in IEEE/ACM International Conference on Computer-Aided Design, 2022, pp. 1–9.
  • [72] S. Narasimha et al., “A 7nm CMOS Technology Platform for Mobile and High Performance Compute Application,” in IEEE International Electron Devices Meeting, 2017, pp. 29.5.1–29.5.4.
  • [73] N. Nassif et al., “Sapphire Rapids: The Next-Generation Intel Xeon Scalable Processor,” in IEEE International Solid-State Circuits Conference, vol. 65, 2022, pp. 44–46.
  • [74] W. Oed and O. Lange, “On the Effective Bandwidth of Interleaved Memories in Vector Processor Systems,” IEEE Transactions on Computers, vol. C-34, no. 10, pp. 949–957, 1985.
  • [75] M. O’Connor et al., “Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems,” in MICRO, 2017, pp. 41–54.
  • [76] G. Pradel and C. Mitchell, “Privacy-Preserving Biometric Matching Using Homomorphic Encryption,” in IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 2021, pp. 494–505.
  • [77] O. Regev, “On Lattices, Learning with Errors, Random Linear Codes, and Cryptography,” Journal of the ACM, vol. 56, no. 6, pp. 1–40, 2009.
  • [78] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “HEAX: An Architecture for Computing on Encrypted Data,” in ASPLOS, 2020, pp. 1295–1309.
  • [79] S. S. Roy, F. Turan, K. Järvinen, F. Vercauteren, and I. Verbauwhede, “FPGA-Based High-Performance Parallel Architecture for Homomorphic Computing on Encrypted Data,” in HPCA, 2019, pp. 387–398.
  • [80] N. Samardzic et al., “F1: A Fast and Programmable Accelerator for Fully Homomorphic Encryption,” in MICRO, 2021, pp. 238–252.
  • [81] N. Samardzic et al., “CraterLake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data,” in ISCA, 2022, pp. 173–187.
  • [82] G. Seiler, “Faster AVX2 optimized NTT multiplication for Ring-LWE lattice cryptography,” Cryptology ePrint Archive, Paper 2018/039, 2018. [Online]. Available: https://eprint.iacr.org/2018/039
  • [83] A. Shafaei, Y. Wang, X. Lin, and M. Pedram, “FinCACTI: Architectural Analysis and Modeling of Caches with Deeply-Scaled FinFET Devices,” in IEEE Computer Society Annual Symposium on VLSI, 2014, pp. 290–295.
  • [84] Y. S. Shao et al., “Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture,” in MICRO, 2019, pp. 14–27.
  • [85] K. Shivdikar et al., “GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption,” in MICRO, 2023, pp. 670–684.
  • [86] T. Song et al., “A 7nm FinFET SRAM Using EUV Lithography with Dual Write-Driver-Assist Circuitry for Low-Voltage Applications,” in IEEE International Solid-State Circuits Conference, 2018, pp. 198–200.
  • [87] H. S. Stone, “Parallel Processing with the Perfect Shuffle,” IEEE Transactions on Computers, vol. C-20, no. 2, pp. 153–161, 1971.
  • [88] E. Talpes et al., “The Microarchitecture of DOJO, Tesla’s Exa-Scale Computer,” IEEE Micro, pp. 1–5, 2023.
  • [89] Z. Tan, H. Cai, R. Dong, and K. Ma, “NN-BATON: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators,” in ISCA.   IEEE, 2021, pp. 1013–1026.
  • [90] F. Turan, S. S. Roy, and I. Verbauwhede, “HEAWS: An Accelerator for Homomorphic Encryption on the Amazon AWS FPGA,” IEEE Transactions on Computers, vol. 69, no. 8, pp. 1185–1196, 2020.
  • [91] Universal Chiplet Interconnect Express, “Universal Chiplet Interconnect express (UCIe) Specification,” Tech. Rep., 2022. [Online]. Available: https://www.uciexpress.org/specification
  • [92] Vernam Group, “CUDA-Accelerated Fully Homomorphic Encryption Library,” Feb 2019. [Online]. Available: https://github.com/vernamlab/cuFHE
  • [93] S. Wu et al., “A 7nm CMOS Platform Technology Featuring 4th Generation FinFET Transistors with a 0.027um2 High Density 6-T SRAM cell for Mobile SoC Applications,” in IEEE International Electron Devices Meeting, 2016, pp. 2.6.1–2.6.4.
  • [94] J. Xia, C. Cheng, X. Zhou, Y. Hu, and P. Chun, “Kunpeng 920: The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services,” IEEE Micro, vol. 41, no. 5, pp. 67–75, 2021.
  • [95] Y. Yang, H. Zhang, S. Fan, H. Lu, M. Zhang, and X. Li, “Poseidon: Practical Homomorphic Encryption Accelerator,” in HPCA, 2023, pp. 870–881.
  • [96] F. Zaruba, F. Schuiki, and L. Benini, “Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing,” IEEE Micro, vol. 41, no. 2, pp. 36–42, 2021.
  • [97] Y. Zhai et al., “Accelerating Encrypted Computing on Intel GPUs,” in IEEE International Parallel and Distributed Processing Symposium, 2022, pp. 705–716.