research-article

Open access

Highly Parallel Multi-FPGA System Compilation from Sequential C/C++ Code in the AWS Cloud

Authors:

Kemal Ebcioglu,

Ismail SanAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 15, Issue 4

Article No.: 47, Pages 1 - 42

https://doi.org/10.1145/3507698

Published: 08 August 2022 Publication History

All formats PDF

Abstract

We present a High Level Synthesis compiler that automatically obtains a multi-chip accelerator system from a single-threaded sequential C/C++ application. Invoking the multi-chip accelerator is functionally identical to invoking the single-threaded sequential code the multi-chip accelerator is compiled from. Therefore, software development for using the multi-chip accelerator hardware is simplified, but the multi-chip accelerator can exhibit extremely high parallelism. We have implemented, tested, and verified our push-button system design model on multiple field-programmable gate arrays (FPGAs) of the Amazon Web Services EC2 F1 instances platform, using, as an example, a sequential-natured DES key search application that does not have any DOALL loops and that tries each candidate key in order and stops as soon as a correct key is found. An 8- FPGA accelerator produced by our compiler achieves 44,600 times better performance than an x86 Xeon CPU executing the sequential single-threaded C program the accelerator was compiled from. New features of our compiler system include: an ability to parallelize outer loops with loop-carried control dependences, an ability to pipeline an outer loop without fully unrolling its inner loops, and fully automated deployment, execution and termination of multi-FPGA application-specific accelerators in the AWS cloud, without requiring any manual steps.

1 Introduction

The sequential programming abstraction is known to increase programmer productivity and (through behavioral or high-level synthesis (HLS)) hardware designer productivity as well, as compared to parallel programming and traditional parallel hardware design, e.g., because the sequential programming abstraction is deterministic and is free from difficult, hard-to-debug race conditions, and because the sequential programming abstraction relieves the programmer from the extra burden of proving that a parallel version of the program is equivalent to a sequential specification. Today, the sequential programming abstraction is mature: The C language and, in particular, the C++ language allow a programmer to express a sequential algorithm at a fairly high level, which is then reliably compiled by mature optimizing compilers into the very efficient sequential object code of an industry standard processor.

The sequential programming abstraction has often been associated with the Von Neumann computational model, which processes data sequentially, one word at a time: in particular, John Backus in his 1977 Turing award lecture remarked that the Von Neumann computation model has imposed “an intellectual bottleneck that has kept us tied to word-at-a-time thinking” [17]. New parallel programming models have been proposed throughout the years (e.g., [12, 36, 37, 38, 39]) to replace the sequential abstraction model and to overcome this word-at-a-time thinking. However, we believe that the Von Neumann computational model does not fairly reflect the inherent parallelism within the sequential abstraction: with sufficient hardware resources, parallelism in the sequential abstraction can indeed be very large, even in sequential-natured applications (see, for instance [44, 83]). Instructions of a sequential program do not need to be executed one after the other, they merely need to appear as if they were executed one after the other. Indeed, only the instructions in the execution trace of a sequential program that depend on each other should be executed sequentially: independent instructions can be executed in parallel. Parallel execution of a sequential program can be completed in a time period almost as short as the length of the critical path in the execution trace, i.e., almost in optimal time, except in some known corner cases [79]. The authors believe that parallel hardware accelerator design with sequential code involves understanding the concept of minimum initiation interval (MII) in an outermost loop of a program (MII is the minimum number of cycles between the start of iteration n and the start of iteration $n+1$), seeing the maximum parallelism that already exists in the sequential code if sufficient resources are provided, and then, if needed, improving the sequential code to reduce this MII further, to obtain an even higher performance hardware accelerator.

Hence, sequential code is merely an abstraction, which is useful because of its simplicity: It is not the same as the Von Neumann computational model. In the present work, we are in fact proposing the inherent parallelism in sequential code as a parallel programming model for creating application-specific hardware accelerator systems. We will describe some of our new High Level Synthesis compiler contributions that actually approach parallelism limits in sequential code.

We have created a high level synthesis compiler system where a sequential code, without any parallelization directives, is the full system high performance parallel hardware description. Since only the dependent operations in an execution trace of a program need to be executed sequentially, and since all other operations in the execution trace can be executed in parallel, sequential code already has plenty of parallelism, as mentioned above. For this reason, our compiler does not require constructs outside of ordinary sequential code semantics to extract parallelism, unlike some existing HLS tools that extend C/C++ with various explicit parallelism constructs expressed through pragmas, that are not part of sequential C/C+ (e.g., a “function level pipelining” pragma, or a “dataflow” pragma in Vivado^(TM) [91]). Our compiler, therefore, offers an important productivity advantage by creating multi-chip application-specific hardware accelerators merely from sequential software programs. Such a multi-chip application-specific hardware accelerator would be difficult to create by existing methods.

Software development can be a large cost factor in new hardware development projects. For example, the development of new assemblers, compilers, APIs, and kernel drivers is usually required to use the new hardware. But the multi-field-programmable gate array (FPGA) hardware accelerator produced by our compiler is functionally identical to the sequential single-threaded C/C++ function (including its hierarchy of sub-functions) the accelerator was compiled from. Thus, invoking the multi-FPGA hardware accelerator is simply equivalent to invoking, in software, the sequential single-threaded C/C++ function the accelerator was compiled from. This feature of our compiler system can reduce the software development costs of deploying new multi-chip hardware accelerators inside or outside the cloud.

Improvements in processor performance have been maintained for many years thanks to Moore’s law [78] and Dennard scaling law [35]. However, after reaching the “voltage scaling wall” in semiconductor and chip manufacturing technologies, the laws of Dennard scaling have broken down [19], meaning that it is no longer possible to have constant power density between chip generations to attain performance improvements in processors. Note that increasing the number of cores in multi-core processors cannot always be translated into performance improvements in a user’s application, even for server workloads that contain a large scale of parallelism [49].

As one means to remedy the stalled upward trend in the performance of general purpose processors, we believe in designing application-specific hardware architectures aim at achieving the best performance and power efficiency in the user’s application itself (as opposed to in the circuits of a processor, which interprets the software instructions implementing the user’s application). We believe that designing application-specific hardware is preferable to waiting for performance improvements in general purpose processors through advancements in semiconductor technology [34, 46]. The combination of high performance FPGAs and HLS compilers can even today make it practical to generate dedicated specialized hardware for each different problem. As for ASIC design, “semi-reconfigurable ASIC”s can be created by an HLS compiler while still maintaining the performance and power efficiency benefits of an ASIC, while reducing Non-Recurring Expenses through an ASIC “union chip” that can take on different identities depending on the configuration, as we have proposed in [42].

Without implying any lack of generality in our approach, we have selected a highly sequential version of the Data Encryption Standard (DES) key search algorithm (DES cracking algorithm) as a running example and case study in this article, as given in Algorithm 1 (“find the first correct key, trying all candidate keys in strictly sequential order”, where a “correct” key is one that decrypts the given ciphertext and obtains the given plaintext), to demonstrate the high performance achieved by our HLS compiler, which takes the optimized sequential x86 assembler code produced by the gcc compiler as input (including source level information made available by a -g gcc debug option) and automatically generates large scale hardware engines spanning multi-FPGA devices in the Amazon Elastic Compute Cloud platform [10]. The hardware accelerator produced by our compiler behaves identically to the sequential single-threaded C program, it was compiled from.

While the DES has little security value today, DES cracking remains an excellent example of an HLS compiler test program, that still has high computation requirements. Note that while DES key search is usually implemented with an embarrassingly parallel problem specification [29], Algorithm 1 is not embarrassingly parallel in the traditional sense, since the loops of Algorithm 1 have a conditional, data-dependent exit on line 19, and are therefore not DOALL loops. The conditional, data-dependent exit from the loops of Algorithm 1 requires that candidate keys be tried in strict sequential order and furthermore resists easy automatic parallelization on a SIMD or multicore machine. While DOALL loops are easy to handle in a parallelizing compiler, Algorithm 1 is an automatic parallelization challenge, which is representative of general sequential single-threaded algorithms where strict sequential order matters (e.g., searching a key in a linked list of linked lists of key-value pairs as in the xlygetvalue function on page 3 of [66] in the SPEC95 li benchmark follows the same nested loops and conditional, data-dependent exit pattern as in Algorithm 1). We will demonstrate the automatic parallelization of Algorithm 1 as written, in the present article, using our HLS compiler.

2 Contributions

Our compilation techniques aim at reaching the theoretical performance limit of the sequential application being compiled, namely, an execution time equal to as little as the critical path length of the sequential application’s execution trace, where possible. Our compiler creates a complete parallel hardware accelerator system (and not just a component of a larger system) specified by the input sequential program, where the sequential program is the parallel hardware accelerator full system hardware description.

In view of the DES cracking application given in Algorithm 1 (“find the first key that decrypts the given ciphertext and obtains the given plaintext, trying all candidate keys in strictly sequential order”) used in this article as an example, our compiler’s differences from existing HLS techniques include:

–

An ability to parallelize outer loops with loop-carried control dependences;

–

An ability to pipeline an outer loop without fully unrolling its inner loops;

–

Multi-chip system design;

–

Super Logic Region-aware placement by auto-pipelining; and

–

Productive deployment of multi-chip application-specific hardware accelerators in the cloud.

We will summarize some new features of our high-level synthesis compiler for creating multi-FPGA systems below.

2.1 Ability to Parallelize Outer Loops with Loop-carried Control Dependences

Within existing parallel execution programming platforms such as OpenMP [32] (for multi-core processors), MPI [25] (for distributed parallel processors), or NVIDIA Cuda [73] (for General-Purpose Computation on Graphics Processing Unit (GPGPU) machines with many parallel threads), outer or inner loops can be efficiently parallelized only if there are no loop-carried dependences [5, 6, 7], i.e., all iterations of the loop can be executed simultaneously without fear of giving an incorrect result. Synchronization is expensive on existing architectures [30, 31, 55, 80, 96].

HLS, since it creates a customized, application-specific hardware design, is not limited by the multi-core, distributed computing, or GPGPU architecture paradigms, and has an unbounded ability to overcome limitations of existing architectures and to create the right parallel hardware for a particular application. i.e., with HLS, “sky’s the limit” in terms of hardware acceleration innovation. But, while innermost loops with loop-carried dependences have been parallelized in previous compiler literature on Instruction Level Parallelism as well as on HLS, using software pipelining [4, 33, 43, 45, 64, 68, 75, 88] (also present with some limitations in today’s commercial HLS platforms [91]) existing HLS techniques have not been able to parallelize outer loops with loop-carried dependences.

As an example of how our compilation techniques parallelize outer loops with loop carried control dependences, in the DES cracking problem as stated in this document in Algorithm 1, there is a loop carried control dependence in the outer loop (in the sequential view, iteration p of the outer loop is executed, and if a key is found, the outer loop ends before starting the next outer loop iteration p + 1). In our solution, many iterations p + 1, p + 2$, \ldots$ of the outer loop are started speculatively, while iteration p of the outer loop is still executing, and are discarded as wasted work if iteration p of the outer loop exits; therefore, overcoming the sequential-natured execution ostensibly required by loop-carried control dependences. The reader is referred to [42] for further details of our compilation techniques for overcoming loop-carried control and data dependences in outer loops and related synchronization requirements.

2.2 Ability to Pipeline an Outer Loop without Fully Unrolling Its Inner Loops

Existing commercial HLS tools (e.g., Vivado [91]) are unable to pipeline an outer loop without fully unrolling all its inner loops first. But fully unrolling inner loops leads to wasteful use of the hardware resources, on the order of (number of iterations x number of instructions per iteration). Just the act of fully unrolling an inner loop alone can exceed the resources of an FPGA chip. Pipelining the fully unrolled inner loop exacerbates the already high resource requirements, Furthermore, many loops such as for loops with unknown bounds or while loops such as one traversing a linked list, cannot be fully unrolled.

Unlike previous HLS techniques, our compilation techniques can pipeline an outer loop (as exemplified in the DES outer loop without fully unrolling inner loops. Pipelining an outer loop is similar to pipelining an inner loop; it means: starting iteration $n+1$, $n+2, \ldots$ of the outer loop, before iteration n of the outer loop is finished, when dependences and resources permit. Our technique of pipelining an outer loop, called hierarchical software pipelining [42] is achieved in summary by pipelining an inner loop, creating many copies of the pipelined inner loop, and adding a result reordering circuit so that the results of the inner loops are observed in sequential order by the outer loop, i.e., the collection of inner loops is made to behave like a simple pipelined multiplier that instead of taking a pair of multiplication operands every cycle, takes the inputs of an entire loop invocation of an inner loop every cycle, and delivers an entire inner loop invocation result every cycle after the pipeline fill time has elapsed, when dependences and resources permit (the entire loop invocation of an inner loop means: all code from the first instruction of the first iteration to the last instruction of the last iteration of the inner loop; it does not mean a single iteration of the inner loop). It is possible to achieve 1 cycle per iteration in the outer loop with sufficient duplication of the inner loops. By applying the technique to outer-outer, outer-outer-outer$, \ldots$ loops recursively, even the outermost loop in a loop nest can be executed at a rate of one cycle per iteration if dependences permit and sufficient multi-chip hardware resources are provided. E.g., consider adding a further outer loop

to the searchkey function in Algorithm 1, which would greatly increase the number of required FPGA chips and the associated parallelism. The loop nest can have arbitrary structure and arbitrary control flow, for example, an inner loop can conditionally take an early exit from not just itself but from the outer loop as well, as in line 19 of the DES cracking example here in Algorithm 1. Because each pipelined inner loop usually occupies less resources than a fully unrolled and then pipelined inner loop, and because pipelined inner loop hardware components communicate with other hardware components not by ordinary wires but by packet-switched networks (which make highly scalable design partitioning possible), many copies of a pipelined inner loop can be created and distributed across multiple chips, using our scalable design partitioning technique, further summarized below.

Liu et al. in [67] proposed software pipelining of an outer loop without fully unrolling its inner loops, but were unaware of our earlier work on this topic in [42]. Furthermore, [67] did not propose a method for handling loop-carried dependences in an outer loop, or for scalable partitioning of a resulting large design into multiple chips. This reference will be discussed further in the related work section.

2.3 Multi-chip Design

While there have been manually designed multi-chip FPGA hardware accelerator designs as it will explained in related work section, existing HLS tools [21, 28, 59, 60, 72, 92] are unable to automatically create a multiple chip application-specific hardware accelerator system from sequential code.

Our compilation approach is able to automatically partition a large hardware accelerator design into multiple chips as exemplified by the DES cracking application of the present paper (see Figures 1 and 2). Our compiler uses I/O controllers to address design partitioning and chip-to-chip communication. In summary, an outgoing message from a message source component in a message source FPGA chip is routed within an on-chip partial network to reach the I/O controller of the message source FPGA chip. The message is then converted to a standard UDP message format and sent out of the FPGA and out of the network interface of the AWS EC2 F1 machine instance through an efficient Data Plane Development Kit (DPDK) software “poll-mode driver”, which operates without invoking the kernel (see the virtual ethernet and the cl_sde design in AWS FPGA github [9], and DPDK software [58], a fast packet processing library). The message then travels through the AWS virtual private cloud of the user, and reaches the network interface of the destination AWS EC2 F1 machine instance and through DPDK reaches the destination FPGA chip containing the destination component. The I/O controller of the destination FPGA chip converts the incoming message back to its original form and then the message is routed to the message destination component via an on-chip partial network.

Fig. 1.

Fig. 2.

2.4 Compiler-level Super Logic Region Crossing with Auto-pipelining

Xilinx FPGAs comprise multiple “Super Logic Regions” (SLRs) that arise from 2.5D stacking technology, where there is a significant added delay when a signal must traverse more than one SLR. Since the highest performance FPGA designs will not fit into a single SLR, the timing closure problem arising from crossing SLRs must be addressed.

The auto-pipelining feature provided by the Vivado tool inserts additional pipeline registers automatically during the placement phase to help achieve timing closure (see pp. 79-84 of [93]). A hardware component using the auto-pipelined register chain feature, is normally functionally equivalent to a FIFO, but whose sender (input) side may be very far away from its receiver (output) side, in a different SLR. But since the number of register stages the Vivado placer will add to the auto-pipelined register chain is unpredictable at an early compilation stage, one must use an extra internal FIFO with enough elements on the receiver side of the component, to ensure that there will not be any buffer overruns, as the sender side sends new data items without being aware of buffer overrun hazards on the receiver side. As our contribution, we have added a very light-weight and energy-efficient credit based flow control to the sender side (input side) of our component in the form of a small counter predicting the “remaining internal FIFO buffer elements in the receiver side”, so that a much smaller internal FIFO (with only 2 or 3 elements) can be used in the receiver side instead of an internal FIFO large enough to cover the worst case traffic. This simplification allows our compiler to use many duplicates of our new auto-pipelined FIFO component pervasively with little penalty in latency or chip resources, in every design part where SLR crossing can potentially occur, without having to know exactly where the SLR crossings will occur. Then, during the placement phase, on a long wire, more registers are added and on a short wire less registers (possibly only one) are added by Vivado. Our new auto-pipelined FIFO helped us meet a 250 MHz timing goal by merely using Vivado’s automatic place-and-route with a design spanning multiple SLRs.

2.5 Productive Deployment of Multi-chip Application-specific Hardware Accelerators in the Cloud

Large companies such as Microsoft and Google have already deployed multi-chip FPGA-based and ASIC based hardware accelerators in the cloud. However, only a few companies (AWS [9, 10, 11] and Alibaba Cloud [2, 3]) allow the user to directly program FPGAs in the cloud at the Register Transfer Level. But using FPGAs in the cloud is not easy. For example, even for an expert engineer, designing and deploying just one AWS EC2 F1 FPGA instance in the cloud requires many manual steps using the AWS console interface and manual editing of tcl scripts as needed. The requirement for taking such manual steps currently hampers human engineering productivity. Our HLS compiler helps to overcome this productivity barrier, by means of:

(1)

a gcc-compatible compilation command for creating a multi-chip FPGA-based application-specific hardware accelerator from a sequential C/C++ program. The executable file generated by our compiler automatically deploys an application-specific multi-chip FPGA-based hardware accelerator in the cloud, intended solely for accelerating the given sequential C/C++ program. It gives the same results as gcc without hardware acceleration but can do so significantly faster (for the DES example, an 8 FPGA accelerator produced by our compiler achieves $\mathbf {44,\!600}$ times better performance than an x86 Xeon CPU executing the sequential single-threaded C program the accelerator was compiled from).

(2)

newly created commands (using the AWS Command Line Interface primitives under the covers) for automating all the steps of creating an FPGA-based hardware accelerator in the cloud, running the sequential program with FPGA-based hardware acceleration, and terminating the FPGA-based hardware accelerator in the cloud as soon as the application is finished, to minimize costs to the user.

After this summary, our HLS compiler’s new features will be elaborated in further detail in the sections below.

3 multi-FPGA System Architecture

Our general system architecture is depicted in Figure 3, which shows our cloud-level hyper-scale vision for application-specific multi-FPGA architectures based on AWS F1 2x.large instances. AWS EC2 F1 instances are located in the same “availability zone” and the same “placement group” (for achieving lower latency communication), and are connected to each other through 10 Gbps Ethernet links through an AWS “virtual private cloud” network belonging to the user. We developed a UDP packet-based interface between F1 instances for instance-to-instance communication.

Fig. 3.

3.1 AWS EC2 F1 Instance FPGA Architecture

AWS EC2 F1 is an FPGA-based AWS cloud instance family, which consists of three types of instances: f1.2xlarge, f1.4xlarge, and f1.16xlarge. A general block diagram of an instance is shown in Figure 4. The AWS f1.2xlarge instance contains an Intel Xeon E5-2686 v4 x86 CPU and a Xilinx Virtex UltraScale+ VU9P FPGA device (xcvu9p-flgb2104-2-i). The on-board x86 Xeon CPU and the FPGA are connected via a PCI Express x16 Gen3 mode bus. Table 1 shows the available resources on the AWS EC2 F1. We used AWS f1.2xlarge instances for our multi-FPGA framework.

Fig. 4.

Table 1.

Available Resources in FPGA						Off-chip
SLRs	LUT	FF	DSP	URAM	BRAM	DDR
3	895k	1790k	$5,\!640$	225Mb	59Mb	64GB

Table 1. Available Hardware Resources in AWS f1.2xlarge Instance

There are some works that use an AWS EC2 F1 cloud instance [18, 23, 24, 40, 56, 84, 85] in the open literature, but they mainly target a single FPGA device.

In the AWS EC2 F1 family of instances, the FPGAs are not able to directly access the virtual private cloud network. The on-board x86-based processor in an instance can access the virtual private cloud network through an Elastic Network Interface (ENI); thus, an FPGA can only talk to the virtual private cloud network indirectly, through the on-board x86 processor’s ENI.

AWS provides a “Shell” hardware design for this family of instances, which includes necessary hardware interfaces including PCIe and DDR4 to help with the development of hardware applications on FPGAs. AWS provides an AWS EC2 FPGA Development Kit [9], which contains a hardware development kit (HDK) and a software development kit (SDK). The HDK and SDK, in turn, contain tools for programming an FPGA on the AWS EC2 F1 platform.

The VU9P-FPGA device has three FPGA dies (super logic regions, SLRs) called SLR0, SLR1, and SLR2. The AWS Shell is placed in some portions of SLR0 and SLR1. It is important to consider the placement of the I/O controller, which is responsible for communication among different F1 instances. The I/O controller is implemented as a custom logic part that has two interfaces: one for communicating with compiler generated custom logic and one for talking to the AWS Shell through a “Streaming Data Engine” IP provided by AWS. In order to achieve timing closure of the design at relatively higher frequencies, placement of the I/O controller in the right SLR is important.

3.2 Networking

The partitioned verilog register-transfer-level (RTL) hardware generated by our compiler is not technology-specific, and can be easily ported to any multi-chip platform interconnected by a scalable network (e.g., k-ary-n-cube, fat tree) using either FPGA or ASIC chips each having its own off-chip DRAM. Our compiler has the capability to create parts of a k-ary n-cube network (e.g., hypercube) directly within each hardware partition, so that each serial network link of a chip can be connected point-to-point to a serial network link of another chip with a copper or optical cable, implementing a k-ary n-cube interconnection, without requiring external network hardware. A cluster of chips interconnected by an external fat tree switch is also a good target hardware platform for our HLS compiler. For optimizing total cost of ownership for a frequently used applications, NRE costs of an ASIC design can be reduced by using a “chip-unioning” technique [42], so that a single ASIC can implement more than one hardware partition, depending on configuration parameters supplied during power-up. In the present section, we will describe our engineering effort to port our multi-chip hardware accelerator framework to the AWS-EC2 F1 instances platform, where scalable communication across FPGAs must be done over an AWS Virtual Private Cloud network and where communication must also include an efficient packet forwarding software layer, such as DPDK, since there is no direct connection from an FPGA to a network interface leading to the Virtual Private Cloud network.

Ideally, an FPGA or ASIC chip that is a part of an application-specific hardware accelerator must connect to a network interface directly, as some researchers have already suggested (e.g., [57]). However, in the AWS EC2 F1 instances platform, the network interface cannot be directly accessed from an FPGA. Instead, AWS has developed a sample FPGA design called the Streaming Data Engine, “cl_sde” together with related packet forwarding software called Virtual Ethernet [9], which can route Ethernet packets between the network interface and an FPGA. cl_sde and Virtual Ethernet use the fast open source DPDK software [58] running on the on-board x86 processor in conjunction with the network interface, as if they were a combined network device connected to the FPGA. DPDK avoids the overhead of kernel invocations by relying on “poll-mode” user-space drivers not needing any kernel functions.

As part of the present project, we have modified cl_sde to connect to one of our accelerator hardware partitions inside an FPGA. We have modified the relevant DPDK functions, to route packets coming in from the network interface to the FPGA, and to route packets going out from the FPGA to the network interface. More details of how our compiler uses the cl_sde IP and DPDK will be explained below.

Note that network bandwidth across AWS EC2 F1 FPGA instances that are connected via Ethernet over the virtual private cloud can be a bottleneck in terms of performance for networking applications. The f1.2xlarge instance has up to 10Gb network bandwidth. Packet size plays a significant role in achieving available bandwidth and packets per second (PPS) performance on the AWS F1 instances (see [23]).

We are employing UDP-based (User datagram protocol) communication among different f1.2xlarge instances that contain compiler-generated accelerator partitions. UDP is simple and fast, but it provides unreliable data transmission. However, in our multi-FPGA experiments with f1.2xlarge instances, we did not yet encounter any packet drops, perhaps due to the currently small number of FPGAs that are being used. We do plan to address the unreliability of UDP communication in our ongoing work by adding reliability features on top of the UDP protocol. There are many works that propose reliability without sacrificing throughput [52, 54, 65, 70, 86]. We are evaluating the alternatives suitable for an AWS Virtual Private Cloud Network.

3.2.1 SDE-based Streaming Accelerator Interface.

The Streaming Data Engine (SDE) is a hardware IP module that is provided in the AWS-FPGA library [9]. The SDE provides high-performance streaming connectivity among FPGAs in different F1 instances and also the user’s software application operating at the Host CPU.

We used the AWS cl_sde example design as a baseline and modified it so that a custom accelerator partition within the FPGA can send and receive messages through the SDE module. In order to achieve higher connectivity between different AWS EC2 F1 instances through Ethernet in a virtual private cloud, currently the FPGA, SDE IP, and DPDK Soft Patch Panel (SPP) software for interfacing with the FPGA should be used together. For achieving higher throughput values, we have integrated each of our accelerator partitions within an SDE module in an AWS-EC2 F1 instance FPGA, to ensure high-speed communication among the accelerator partitions, each implemented within a different AWS EC2 instance FPGA.

Figure 5 shows the interface of SDE module with the shell and the custom logic, where the latter includes one partition of our compiler-generated accelerator. SDE talks to the shell via AXI4 bus memory-mapped PCIS (PCI slave) and PCIM (PCI master) interfaces. The SDE uses the PCIS-AXI4 interface to obtain descriptors written by DPDK software, which contain information about the data transfer being performed, e.g., the destination physical addresses for the data, and the number of bytes for the data transfer. Data are transferred between the custom logic and the memory of the on-board x86 processor through the PCIM AXI4 interface.

Fig. 5.

SDE provides two AXI stream interfaces (H2C and C2H) to the custom logic to be able to send and receive data in streaming fashion.

Each accelerator FPGA partition has two global streaming interfaces, A2H and H2A, and they are connected to the SDE streaming interfaces as shown in Figure 5.

3.2.2 Chip to Chip Communication on AWS Through DPDK.

Chip to chip communication is a bottleneck in high-performance multi-chip implementations due to physical constraints. The most efficient way to interconnect chips is directly wiring them through very fast communication mediums. In the current version of FPGA instances of AWS cloud, FPGAs cannot access a virtual cloud network within the data center directly, but they are able to connect to the virtual cloud network through the network interface card of the on-board x86 CPU. In software, DPDK (1) achieves fast packet processing when communicating and dispatching tasks among several AWS EC2 F1 instances that are parts of our FPGA-based cloud hardware accelerator, and (2) allows us to bypass the operating system (OS) kernel, in order to achieve fast data transfer between the ENI connected to the on-board x86 processor and the FPGA custom logic.

In DPDK, the environmental abstraction layer (EAL) enables the application to gain access to low-level hardware resources and memory space. An EAL thread (lcore), which is a Linux pthread, executes the tasks issued by the remote_launch function. The DPDK user-level application runs with two lcores in (1) packet forwarding mode, (2) autostart mode, which starts forwarding on initialization, and (3) chained mode, as a port topology to enable the forwarding of packets to the next available port in our case.

Eliminating network noise packets: Even if the FPGAs communicate with each other on a separate, dedicated subnet, in the AWS virtual cloud environment many unrelated Ethernet packets, not generated by our accelerator hardware and not useful for acceleration, arrive at an EC2 F1 instance, such as Address Resolution Protocol (ARP) packets. We have modified DPDK code to efficiently check for any packets not generated by our compiler-generated hardware. These incoming packets are dropped using only a few x86 instructions. They are not delivered to the FPGA; thus, we eliminate the network noise packets.

While using any software for network communication is not ideal, we used the DPDK software specifically to overcome a limitation of the current AWS EC2 F1 platform since FPGAs cannot communicate with a network interface directly: in a future, for example, ASIC-based implementation of a cloud hardware accelerator generated by our compiler, each chip will be able to connect to a plurality of high-speed network interfaces directly.

4 multi-FPGA Design for DES Key Search

In this section, we will explain how our HLS compiler extracts parallelism from sequential code for multi-FPGA system design for the DES key search application, how it generates the full system hardware with all the necessary communication components, and how it automatically builds an FPGA-based multi-instance cloud accelerator system in the AWS cloud. We will first start with an explanation of the DES computation using its high-level sequential description.

4.1 DES Algorithm (High-level Sequential Description)

We have implemented the original National Institute of Standards and Technology Data Encryption Standard (DES) specification FIPS PUB 46-3 [74] (reaffirmed on October 25, 1999, withdrawn on May 19, 2005 in favor of AES) verbatim, without any optimizations, in C.

Algorithm 2 shows the decryption procedure of DES, where the symbol $||$ means concatenation and the symbol & denotes the logic AND operation. We provide this high-level algorithm of DES in order to explain the computation. Our main contribution is that our high-level compiler takes the same high-level description (written in C) and generates very high-performance multi-FPGA distributed hardware architecture, which has the potential to outperform all previous manually-designed DES hardware architecture results.

Note that C does not have any native operations to represent hardware bit manipulation operations such as bit permutation. Such operations must be implemented with and, or, xor, shift, or other operations in C. Our compiler recognizes sequences of and, or, xor, shift and other operation that are equivalent to a rearranging of bits (or negations of bits) belonging to one or more registers, and implements such sequences of operations with a single Verilog concatenation of bits (or negations of bits) where possible. Our compiler also implements bit-width compression optimizations in registers and network payloads and attempts to avoid creating any flip-flops or wires for constant bits, redundant bits, which are copies of another bit, or dead bits. As a result, where possible, the compiler infers hardware registers and network payloads of a size smaller than standard C data type sizes of 8, 16, 32, or 64 bits.

Algorithm 3 shows the Feistel f function of DES. After input to the f function is expanded from 32-bits to 48-bits, it is XORed with the provided subkey. An S-box operation is now applied to the result. In order to generate addresses for the S-box tables of DES, there is a special encoding (line 4 in Algorithm 3).

Loop independent dependence in the i-loop in Algorithm 3, e.g., between lines 3 and 4 and also between lines 4 and 5, can be implemented as a parallel hardware [20] (or pipelined hardware even if there is a true dependence on a variable or a memory location), but it is challenging for a compiler to automatically understand the immense parallelism in the given sequential code and generate parallel and pipelined hardware execution units without any rewriting of the code. The sequential description of DES actually has a lot of parallelism. A skillful hardware engineer is able to design a parallel hardware architecture for DES, but at the cost of a long design cycle. Full system verification also requires very careful analysis and long verification efforts. However, in the present study, a user of our proposed framework can automatically create a multi-FPGA accelerator system from a simple sequential C description without rewriting the C code, without writing any Register Transfer Level code, and without having any cloud-level expertise. How our HLS compiler extracts the parallelism and generates the hardware is explained in the following section.

4.2 Compiler Overview and Parallelization

Our compiler maps a sequential program such as the DES cracker whose code is shown in Algorithm 1 to a parallel hardware accelerator such as the one in Figure 1. The hardware components are highly pipelined FSMs (shown as rectangles). Very efficient, lightweight packet switching networks (shows as ovals) are used for communication between hardware components. The hardware follows the loop hierarchy of the program, where a (possibly duplicated) inner loop is considered a child of its outer loop, and an outer loop is considered the parent of its inner loop copies. In Figure 1, one can see the outer loop of the DES cracker application, and its duplicated inner loops. However, because of the need to duplicate inner loops of an outer loop to achieve parallelism, the initial design will normally not fit in a single FPGA. The number of inner loop copies per outer loop can be estimated by the compiler or can be specified by the user for additional control.

Types of packet switching networks include incomplete butterfly multi-stage networks (“incomplete” meaning that the number of input ports and/or output ports of the network need not be a power of two), load balancing “task networks” and linear array networks.

We will provide a simplified high-level summary, illustrated in Figure 6, of our HLS compiler, which demonstrates one possible way to implement our compilation approach; there are many other ways. Our HLS compiler uses the object code and assembly code outputs of an unmodified gcc/g++ compiler as inputs and operates as a linker that has hardware acceleration capabilities. All phases of our HLS compiler are encapsulated in a gcc/g++ compatible command line tool. A compiler-generated executable that invokes a hardware accelerator also behaves like an ordinary, gcc/g++ generated executable.

Fig. 6.

The compiler phases comprise the following:

①

The gcc/g++ compiler is invoked from within our HLS compiler and converts the user’s C/C++ program into an x86 assembly language file. The assembly language file also contains debugging data produced by a -g flag, for providing a degree of source-level information to our HLS compiler. The user also indicates the function(s) to be accelerated. Optimizations and scheduling are performed only on the functions to be accelerated.

②

An intermediate code translator translates the x86 assembly language file to unoptimized intermediate code, consisting of RISC primitives for implementing each x86 instruction by following the x86 architecture specification verbatim.

③

An optimizer applies standard and x86-specific optimizations to the unoptimized intermediate code and obtains clean RISC-like optimized intermediate code.

④

A scheduler applies hierarchical software pipelining to the optimized intermediate code, as if targeting an extremely wide-issue architecture, as explained below. As a result, scheduled, software pipelined program regions are obtained.

⑤

An FSM generator converts each scheduled, software pipelined program region into an FSM in verilog. Columns 44–52 of [42] provide an example of more details of this transformation.

⑥

A design integrator combines the FSMs obtained from the user’s C/C++ code and verilog modules picked from the library of the compiler, to create a complete non-partitioned flat accelerator design (as if the target chip had infinite area) by wiring together the FSMs, application-specific on-chip and cross-chip networks, application-specific on-chip memories, floating point units, top task adaptor, auto-pipelined interfaces, response re-ordering units, and so on, selected from the library. The final duplication counts of Loop FSMs for achieving performance through hierarchical software pipelining, are also decided on at this stage.

⑦

A partitioner then partitions the flat accelerator design into multiple chips, and creates I/O controllers in each chip for cross-chip communication; the result is a partitioned accelerator design.

⑧

An executable packer then combines the un-accelerated part of the software, an executable manager program, and the locations of the files in AWS S3 storage that will contain the FPGA image identifiers when they become ready, and creates an executable. When the executable is started, the manager program will be invoked first to do housekeeping actions such as verifying the readiness of FPGA images and initializing FPGA instances, before starting the un-accelerated part of the software. When the C/C++ function within the user’s software application intended for hardware acceleration is attempted to be executed, communication messages are exchanged between the accelerator and software application, to realize the hardware acceleration and to maintain memory coherence between the software application and accelerator. The executable packer stage also creates a stand-alone verilog tarball from the partitioned accelerator design and starts Vivado processing at AWS.

⑨

An AWS-FPGA image creator accepts the verilog tarball and it passes on to FPGA Developer AMI instances on AWS, which then in parallel execute scripts that go through all the steps to convert the verilog files to a Vivado design checkpoint, which is, in turn, submitted to AWS, which, in turn, delivers the FPGA image for each partition. When all the FPGA images are ready and required FPGA instances are up, the previously created executable will now run with hardware acceleration. Prior to this point, the executable will still run, but in software only.

For achieving verification of a multi-chip design, the executable packer also has the option to enable multi-chip verilog simulation in software. The Verilator tool is used to create a verilog simulation executable for each hardware partition. The executable packer will pack together the manager program, the un-accelerated part of the software application, and the verilog simulators for each hardware partition. When such an executable is started, the manager program will first distribute the executables to multiple servers, before starting the un-accelerated part of the software application. Then, when the C/C++ function within the user’s software application intended for hardware acceleration is attempted to be executed, simulated multi-chip hardware acceleration will be realized through message exchanges among the software application and the verilog simulators.

Parallelization consists of applying software pipelining, which is mainly based on enhanced pipeline scheduling [71], recursively in a bottom up manner following the loop hierarchy in reverse post-order enumeration. First, an inner loop is pipelined. Then, by duplicating the inner loop and adding a result-reordering hardware circuit, the collection of inner loops is made to look like a simple pipelined multiplier or simple store queue (in case the inner loop is executed only for side effects), that delivers an inner loop invocation every cycle after its pipeline is full, when dependences and resources permit. After inner loops of an outer loop are thus converted to simple pipelined units, the outer loop is software pipelined, wherein the inner loops appear like simple instructions with appropriate dependences and latencies, as observed from the outer loop. Because inner loops are made to appear as simple instructions, the usual software pipelining algorithm is applied to the outer loop as well. The algorithm continues recursively in this manner with outer-outer loops, outer-outer-outer loops, and so on, potentially creating a massive amount of additional hardware at each loop level. The outermost program, not being a loop, is not pipelined, it is merely scheduled.

Because our compiler performs hierarchical software pipelining in the presence of memory dependences and in the presence of arbitrary control flow, and since it creates synchronization circuits tailored to the application, it constitutes a general algorithm that can accept any single-threaded program as input.

To understand why reordering of the results of copies of inner loops is necessary, before an outer loop observes these results, again consider the sequential DES cracking algorithm specified in this paper. In general, inner loops of an outer loop can complete their work and return their result after an unpredictable delay. For example, while the inner loop of iteration p of the outer loop of the DES cracker application is continuing, the inner loop of iteration $p+1$ may be speculatively started, and the inner loop of iteration $p+1$ may find a correct key immediately, and may finish immediately. The outer loop must confirm that the inner loop of its iteration p will not find a correct key, before looking at the result of the inner loop of its iteration $p+1$, to exactly replicate the sequential algorithm specification while achieving high parallelism. When it is guaranteed that a parallel hardware accelerator implements a sequential algorithm exactly, the authors believe that the result is conceptual simplicity, which, in turn, improves the designer’s productivity. A reordering hardware unit can be implemented, for example, by a special FIFO-like queue, where elements are read and removed sequentially from the front of the queue (blocking when the desired next element is not yet present in the front of the queue), and where elements are written with random access anywhere within the special FIFO-like queue.

4.3 Partitioning

Once the flat, non-partitioned design of the hardware accelerator is obtained, it will normally not fit in a single chip, and will need to be partitioned into multiple chips. The hardware components of the flat hardware accelerator design are partitioned into parts that can each fit into a single chip.

The Figures 1 and 2 indicate the partitioning of the example DES cracker application. Figure 1 depicts the flat non-partitioned design containing all 128 inner loops. Figure 2(a) is partition 1 containing inner loops from $0-15$. Figure 2(b) represents partitions $2-8$ of the hardware accelerator each containing 16 inner loops, numbered 16–31, 32–$47, \ldots, 112$–127, respectively.

Note that there can be a plurality of networks in the flat, non-partitioned hardware accelerator design. When two hardware components x and y connected to a network n in the original flat hardware accelerator are assigned to different chips $\mathsf {A}$ and $\mathsf {B}$, respectively, after partitioning, the first component x in chip $\mathsf {A}$ can still send a message to the second component y in chip $\mathsf {B}$, without being aware that the design is partitioned into multiple chips. Partitioning phase is depicted in Figure 7, and is achieved as follows:

Fig. 7.

–

We will call the destination port numbers of the original network n of the flat hardware accelerator the virtual destination port numbers.

–

The message source component x in chip $\mathsf {A}$ sends a message to destination component y in chip $\mathsf {B}$ over network n, not being aware the design is partitioned, using virtual destination port number y in the message from x to y.

–

We instantiated I/O controllers in each chip to manage cross-chip communication over a scalable cross-chip network. As the cross-chip network in this study, we used a virtual private cloud network in the AWS EC2 cloud platform, which allows fast communication between instances in the same “availability zone” and “placement group” [8] for minimizing communication delays. The part of network n that remained locally in chip $\mathsf {A}$ routes the message from x to y to the I/O controller of chip $\mathsf {A}$, since the destination component y is not in chip $\mathsf {A}$. This is done by a small combinatorial logic or ROM routing table that maps the virtual destination port y to the physical destination port of the local partial n network connected to the local I/O controller. The routing table is accessed normally in a single cycle and can be further optimized, if the number of components on the original network n is a power of two and the components are distributed evenly to chips. The physical destination port routing bits are prepended to the message from x to y, to guide the message through the local partial network n. These physical routing bits are discarded when the local physical destination port (in this case connected to the I/O controller of chip $\mathsf {A}$) is reached.

–

The I/O controller of chip $\mathsf {A}$ then converts the payload size of the message from x to y to a standard “flit” size (512 bits), and adds a header that indicates chip $\mathsf {B}$ as the destination chip, and also the network number indicating which network this message should enter in the destination chip $\mathsf {B}$, after reaching the destination chip $\mathsf {B}$. This will be network n.

–

After the message from x to y arrives at the I/O controller of chip $\mathsf {B}$ over a scalable chip-to-chip network, (in this case an AWS virtual cloud network) its header is used to make the message enter the correct local partial network n in chip $\mathsf {B}$. The header of the message is deleted and the payload size of the message from x to y is then converted back to its original size.

–

The message from x to y then enters the local partial network n of chip $\mathsf {B}$ to go from the I/O controller of chip $\mathsf {B}$ to the component y in chip $\mathsf {B}$. For this purpose, another small lookup table is used to map the virtual destination port y to the physical destination port of the local n network connected to the y component. The physical destination port number is prepended to the message as it enters the local partial n network at the I/O controller of chip $\mathsf {B}$ and is used to guide the message within the local partial n network, so it goes to the destination component y. When the destination component y is reached, the physical destination port routing bits are discarded.

–

Thus, components x and y need not even be aware that the hardware accelerator is partitioned and that they are on different chips. This approach avoids the design changes inside the already-created components, that could otherwise be necessary for creating a partitioned hardware accelerator, after the flat hardware accelerator is created.

4.4 Automatically Generated Inner-Loop FSMs

Our compiler automatically generates a finite state machine (FSM) for each loop copy from the given sequential C/C++ code (in this case, DES searchkey function given in Algorithm 1). The compiler generated (annotated) Verilog code for part of the critical loop that executes at 1 cycle per iteration is given in Listing 1. The FSM for the inner loop has a total of 39 states. First state S0 is responsible to wait for and accept a new incoming task. When a new task has been issued by the outer loop and received by this FSM, the pipeline is filled in states S$1, \ldots,$ and S37. When the pipeline reaches steady state S38, a total of 303 different operations are all executed concurrently in a single cycle, and FSM stays in the same state until the inner loop is finished. Inner loop of searchkey function contains xor, table lookup (rom load) for the constant, read-only “sbox” array accesses in the C code, or bit permutation operations (bit rearrangement operations emanating from, e.g., and, or, shift operations in the C code), which are necessary for DES decryption. Thus, state S38 consists of 303 total operations including xor, not, bitpermute, romload, copy, network receive, and network send operations. When the inner loop is finished (once the inner loop reaches a block boundary), the FSM sends back an acknowledgement to the outer loop to indicate the end of its operation. Therefore, our HLS compiler offers a productivity advantage: This FSM’s hardware would be difficult to create by Register Transfer Level design or by existing HLS tools.

4.5 I/O Controller

The compiler-generated Verilog code of the I/O controller in the first FPGA chip that performs chip-to-chip and chip-to-host communication is partially given in Listing 2. For each FPGA chip, there is one specialized I/O controller unit. Only the first FPGA has an interface with the host CPU through host I/O module for chip-to-host communication. The I/O controller, which resides in the first FPGA chip is responsible for forwarding incoming packets from the host to the destination FPGA chip and outgoing packets from the FPGA chip to the host unit. The I/O controllers instantiated in other FPGA chips does not have a host interface. Each I/O controller sends and receives UDP packets; thus, all the necessary information in creating a UDP packet is embedded into I/O controller, e.g., the MAC addresses of the instances, and UDP source and destination port numbers.

Loop FSM units in the chip can communicate with the I/O controller through an on-chip packet switching network (or directly, through a network elision optimization [42], in case the packet switching network has only one input and one output port), in order to send a packet to the external interface (out of the chip). The I/O controller accepts data from the loop FSMs through loop slave interfaces. The width of the received data might be different than the external I/O bus width (see, for instance, 263-bit inputData_9_0 in Listing 2). Thus, I/O controller has to convert this width, 263-bit, into the bus width of the I/O bus (512-bit in this case) via bus_width_converter module before sending the packet from the loop FSM into other chips, or to the host CPU device through the I/O bus. A message header (UDP header and Ethernet header) information is computed including the IP header checksum, and the data being sent to outside of the chip is encapsulated with this message header using the insert_message_header module. There might exist other slave interfaces in the design, e.g., host_io and loop, to send packets to out of the chip. Hence, there is an incomplete butterfly switch (in this case a 2 to 1 switch) that handles incoming packets from more than one input interface going to a single output interface. The output of the incomplete butterfly switch is registered via the auto_pipelined_reg module. The number of the pipeline register stages to be inserted in between the I/O controller and the SDE unit is determined during the FPGA placement phase to achieve timing closure. The output of the auto_pipelined_reg module is connected to the external I/O interface.

If a packet is received from the outside of the chip through the response channel of the external I/O master interface that is being forwarded into a component (e.g., loop FSM) inside the chip, the previous operations are performed in reverse. First, the data are passed through the auto_pipelined_reg unit. The output of the auto_pipelined_reg unit is connected to a incomplete butterfly network unit (now it is a 1 –2 network) to forward the incoming packets from a single (external) interface into two different on-chip units. Then, packet is decapsulated, and the headers are removed from the packets via delete_message_header. Lastly, the bus_width_converter module adjusts the bus size, and its output is connected to the response channel of the loop FSM unit or host I/O unit (resides only in chip 1 in this case) depending on the destination unit information.

4.6 The Second Enhanced Network Interface

All AWS EC2 instances of interest for FPGA acceleration already have a first Elastic Network Interface (ENI), which can, for example, be used for connecting via normal remote ssh into the instance. But communication among F1 (FPGA) instances and the host CPU instance for acceleration purposes occurs only in a second enhanced network interface of each AWS EC2 instance, which is created by our tools on a separate subnet of the virtual private cloud.

The second network interface is brought up after an instance is started and reaches the “system-status-OK” state. The second network interface is brought down before stopping or terminating a EC2 F1 instance.

The second network interface does not need any IPV4 public address. It is used only for FPGA to FPGA communication. Other communication, such as ssh and scp commands sent from a non-FPGA instance (e.g., the host CPU instance) to an FPGA instance for initialization, uses the first enhanced network interface of EC2 instances. As mentioned above, the user can also perform a remote ssh into the instance of interest using the first enhanced network interface of the instance via a public IPV4 address.

4.7 Implementation of Compiler-level SLR Crossing with Auto-pipelining

The AWS cl_sde example design can run at a frequency of 250MHz. In this study, we also aimed at reaching a frequency of 250 MHz for our compiler-generated accelerator partitions. However, when increasing the loop duplication factor at the compiler-level, the generated hardware utilizes a significant part of the logic resources in the SLRs and also crosses the SLRs. Our initial experiments show that critical data paths that span two SLRs require special attention: extra pipeline registers should be added into the critical paths of the design, when crossing SLRs.

The auto-pipelining feature provided by the Vivado tool inserts additional pipeline registers automatically during the placement phase to help achieve timing closure (see pp. 79–84 of [93]). In our hardware accelerator model, our HLS compiler uses this feature to enable the Vivado tool add pipeline registers where necessary, for example, in I/O controllers, linear array networks and pass-through units, which are used for receiving messages coming in to a loop FSM machine, in order to meet the 250MHz frequency target. The compiler-generated Verilog-level design spans multiple SLR regions. Furthermore, there is communication between components in different SLR regions. Hence, it is challenging to meet the 250MHz frequency target if SLR crossings are not optimized. Auto-pipelinable registers (i.e., with “autopipeline” attributes indicated in the RTL), are in fact inserted by our compiler into our hardware accelerator designs, for example, in the I/O controller unit, linear array network stage units, and pass-through units.

Figure 8 shows how we enable auto-pipelined registers with a simple credit-based flow control between our input and output streaming interfaces of I/O controller unit, linear array networks, and pass-through units. The Vivado tool will decide on the number of pipeline register stages that are needed, during the placement of our design, to achieve timing closure. The compiler-generated Verilog code has the necessary auto-pipeline attribute information.

Fig. 8.

Our contribution in inserting auto-pipelinable registers includes a credit-based flow controller design implemented via a small counter (ctr). Referring to Figure 8, assuming that there are M1 pipeline register stages on the way from the output s_ready signal to the input, and M2 pipeline register stages on the way back from the input m_valid signal to the output, the counter logic on the input side must analyze the information coming from M1 cycles in the past and predict the number of FIFO elements M2 cycles into the future. The final number of register stages M1 and M2 are determined by the Vivado placer tool. The counter predicts what the remaining number of FIFO elements at the output side will be M2 cycles from now (see Figure 8), using the following logic:

The auto-pipelined unit will be ready to receive data from the input side (will assert input $\texttt {m_ready}==1$) if and only if:

The logic responsible to change the ctr value is depicted in Figure 8, and is explained as follows:

–

The counter is initialized to the number of FIFO elements at power-up time.

–

The counter is decremented when:

–

The counter is incremented when:

–

Otherwise, the counter value is unchanged.

For a slightly better timing margin, ctr can be initialized to $\texttt {the number of FIFO elements}-1$ and instead of testing for $\texttt {ctr}==0$, $\texttt {ctr}$<0 (ctr sign bit) can be tested. This optimization was done in Figure 8.

Using the circuit described above and also in Figure 8, a flow control is enabled by simply decrementing or incrementing the ctr value to keep track of remaining budget in the FIFO. Input side logic (one that starts the data transaction) knows when the receiver side (output logic in Figure 8) has enough resources to accept the data. Input side logic will be stalled until there is enough FIFO capacity at the output side. Credit-based logic allows a simple and efficient flow control between input and output of the streaming via a small counter at the Verilog-level. Thus, the auto-pipeline logic only utilizes a small FIFO (which occupies less resources than a larger FIFO of the original design intended for worst case traffic), and it allows pervasive use of auto-pipeline register insertion across the SLR regions, wherever it is necessary in the whole design for each chip. Depending on the unit in our DES cracker hardware accelerator, we configure the number of stages of the FIFO and other auto-pipelining parameters at compile time. As part of the present experiments, we enabled auto-pipelining and Vivado inserted auto-pipeline registers depending on the distance of the communicating components. Hence, our inner-loops are able send and receive messages across the SLR regions, and yet are able to achieve 250 MHz.

When we compare the simple credit-based flow control that we developed (from a source unit to a sink unit in a single chip with a fixed delay, determined after the Vivado placement phase) with a well-known credit-based flow control method used in networks (where the sender and receiver are often far away from each other) such as the N23 method summarized in Kung et al. [63], in our Register Transfer Level solution the sender (input) side does not need any handshaking and/or packet from the receiver (output) side that indicates how many empty buffer elements are left in the receiver buffer. Since in our proposal, the flow control is done in the same circuitry with the same clock, even if the sender and receiver sides are in different SLRs, it is accomplished by simple time-delayed synchronous and continuous control signals arriving from the receiver side (s_ready and inverted s_valid), which are used for incrementing or decrementing a small remaining FIFO buffer elements budget counter in the sender side. This is an energy efficient solution for performing the flow control task, as compared to a packet-based implementation of the same task. There is an inherent advantage to our method, which is based on implementing flow control in RTL and relying on the presence of a fixed delay between the sender and receiver: In a network where a sender and a receiver are far away from each other, the delay between the sender and the receiver (possibly on different compute nodes) cannot be expressed as a fixed number of clock cycles; thus, the sender may be required to receive a credit value as a packet from the receiver.

Figure 9 illustrates the FPGA layout of the placed and routed first partition (see block diagram of the first partition in Figure 2). First partition is special compared to other FPGA chip partitions since it has top task adaptor and outer loop finite state machine. Inner-loop FSMs have a rectangular shape in the layout and labeled with different colors. AWS shell logic is colored with orange in the layout and spans SLR0 and SLR1. The rest of SLR0, SLR1 resources and complete SLR2 resources can be utilized for user custom logic. As it is shown, the DES key search hardware accelerator spans into three SLRs, and there is no single inner-loop FSM, which spans into multiple SLRs. For communication between inner-loop designs, auto-pipelining is enabled for achieving a high-frequency target.

Fig. 9.

The number of register stages that are inserted by Vivado for each auto-pipelining module becomes known only after the floorplanning step during the placement phase, Hence, the number of registers remains unknown before this step. We looked at each placed and routed partition (chip) to see the number of registers that were added into the potentially long paths where our compiler instantiated auto-pipelining modules in the RTL (a potentially long path is one that may traverse a longer distance or may cross multiple SLRs). Since we have set the auto-pipeline limit to 12 in this case, for each path where we use an auto-pipelining module, a varying number of register stages from 1 to 6 are inserted by Vivado. Figure 10 shows two different paths where auto-pipelining is applied: one is for a longer path (six registers were added by the Vivado tool) and the other one is for a shorter path (only one register was added by the Vivado tool). We infer that Vivado adds more register stages when the path is long (see Figure 10(b)) and fewer register stages (possibly one) when the path is short (see Figure 10(a)). Having more register stages in the longer paths is totally fine with regard to our design approach, since our components communicate not with ordinary wires, but with messages going through light-weight packet switching networks, and since the function of our compiler-generated design is unaffected by message latencies. The added latencies also do not hurt the achievable theoretical performance. Therefore, the auto-pipelining technique in conjunction with our credit-based flow control enhancement solved timing issues, enabling our compiler-generated design to reach 250MHz for high-performance, despite the inevitable SLR crossings.

Fig. 10.

4.8 Maintaining gcc Compatibility in the Presence of Extremely Long FPGA Image Generation Times

Generating AWS-FPGA images from the Verilog files using Vivado can take many hours, and can, therefore, not become part of a normal gcc-like compiler flow. Since, in our case, the hardware accelerator (executing on a cluster of FPGAs and communicating with the un-accelerated part of the software) is functionally equivalent to the original sequential code fragment it was compiled from; the hardware accelerator is analogous to the next tier of compiler optimization after the highest compiler optimization tier in a just-in-time compilation system [16]. While we are not currently determining hot application executables and hot functions within these executables dynamically by profiling (a user must currently supply this information), executables created by our gcc-compatible HLS compiler have the property that they can be initially run in software, while Vivado processing continues in the background. The software application is then seamlessly replaced by its hardware-accelerated version, as soon as its required FPGA images all become ready, asynchronously, after many hours of Vivado processing. Below we explain how we have prototyped this seamless replacement. Our vision is a future operating system where large parts of frequently executed user applications, and even large parts of the OS kernel can be compiled into multi-chip hardware accelerators over time, without disruptions to the users.

For this reason, when starting a job to create an FPGA image from a set of Verilog files:

–

The Vivado tool is run in parallel on multiple AWS-FPGA Developer AMI instances, with multiple (currently three) hardware partitions assigned to each instance. When an instance doing Vivado processing is finished, it writes its results into the s3 storage and terminates itself.

–

A uniquely directory named with the concatenation of the file location of the executable (flattened) and the checksum of the Verilog files is created under the user’s designated s3 bucket. A creation time stamp file is written into that uniquely named directory.

–

A second compilation of the same program with the same Verilog checksum will not be started if a creation time stamp is present, or if the FPGAs are already created for the program. Thus, each version of Verilog files of a sequential software application is compiled to FPGA images only once.

–

Our Vivado compilation scripts will place the FPGA image id (AWS agfi and afi id numbers) of the N generated FPGA images in files named 001, $002, \ldots, N$ under this uniquely named directory when the Verilog compilation is finished, and will send Vivado design checkpoints to AWS to create the N final FPGA images. When AWS completes the FPGA image creation phase after a considerable time, for each requested FPGA image, the AWS-FPGA image creation service will write the final timing summary to the user’s designated location within s3 storage, and confirm that the requested FPGA image is available for use.

–

The compiler-generated executable for the software application internally contains the name of the unique directory in s3 that contains its FPGA image id’s and the total number of FPGAs to use for acceleration of the software application.

–

At the beginning of execution of the software application, the executable invokes a manager function to manage the loading of FPGA images to F1 instances. This manager function checks in the AWS s3 bucket unique directory, that the correct files 001, $002, \ldots, N$ have been created, and that the FPGA image id’s therein are currently available. If not all the required FPGA image id’s are available, the application is executed in software only, but a message indicating that the FPGAs were not ready is written in a log file. A second compilation and execution of the same application the next day may find that the FPGAs are ready.

–

On the other hand, if all the FPGA image id’s are available, the manager function looks up the private IPV4 address of each EC2 F1 instance in the virtual cloud from a table, loads the correct FPGA image id on the F1 machine, and starts DPDK message forwarding. Only after all F1 instances have been thus initialized and DPDK message forwarding has started on each F1 instance, the software application is allowed to start. Since the actual acceleration messaging among FPGAs and the software application must start with the software application sending an “initial registers” message to the first FPGA of the accelerator, no further start-up synchronization is necessary. It suffices that the FPGAs are up and are receiving messages before the software application starts.

–

When the software program ends, nothing special is done about the F1 instances or DPDK. The next application execution will stop DPDK, and will reload the FPGAs with possibly different images belonging to the next application, and will start DPDK message forwarding again.

The net result is that we have created a gcc-compatible compiler, which keeps working seamlessly even when the FPGA images are not ready, because of the extremely long Vivado processing time. When the FPGA images are finally ready, the executable is elevated to the next compiler optimization tier, but its function stays the same, just as in a Hot Spot Just in Time compiler.

5 Experimental Analysis

Our HLS compiler is able to take a sequential DES key search code as input and automatically generate a DES key search hardware accelerator with different design parameters, e.g., number of FPGAs and number of inner loops per FPGA. Three design configurations are compared here, i.e., (1) with eight FPGAs, eight inner loops per FPGA, (2) with two FPGAs, 16 inner loops per FPGA, and (3) with eight FPGAs, 16 inner loops per FPGA, all running at 250 MHz frequency.

5.1 Results and Evaluation

Preliminary experimental results for DES key search hardware accelerators in the cloud automatically compiled for multiple FPGAs are presented in Table 2.

Table 2.

Initial	number of keys	number of keys	x86 CPU		8 FPGAs		2 FPGAs		8 FPGAs		Performance ratio compared to x86
56 bit	searched	searched			8 inner loops each		16 inner loops each		16 inner loops each
key (hex)	(hex)	(decimal)	Time (sec)	keys/sec	Time (sec)	keys/sec	Time (sec)	keys/sec	Time (sec)	keys/sec	8 FPGAs	2 FPGAs	8 FPGAs
											8 inner-loop	16 inner-loop	16 inner-loop
DB4A6528000000	3019719	$50,\!435,\!865$	88.44	$5.703E+05$	0.0051	$9.8894E+09$	0.0080	$6.3045E+09$	0.0034	$1.4834E+10$	17341.18	11055.00	26011.76
DB4A6520000000	B019719	$184,\!653,\!593$	323.78	$5.703E+05$	0.0147	$1.2561E+10$	0.0247	$7.4759E+09$	0.0086	$2.1471E+10$	22024.49	13107.69	37646.51
DB4A6500000000	2B019719	$721,\!524,\!505$	1265.19	$5.703E+05$	0.0531	$1.3588E+10$	0.0919	$7.8512E+09$	0.0295	$2.4458E+10$	23824.30	13765.72	42883.73
DB4A6400000000	12B019719	$5,\!016,\!491,\!801$	8796.31	$5.706E+05$	0.3599	$1.3939E+10$	0.6291	$7.9741E+09$	0.1967	$2.5503E+10$	24429.98	13976.08	$\mathbf {44699.29}$
DB4A6000000000	52B019719	$22,\!196,\!360,\!985$	$-$^†	$-$^†	1.5860	$1.3995E+10$	2.7780	$7.9901E+09$	0.8665	$2.5616E+10$	$-$^†	$-$^†	$-$^†
DB4A4000000000	252B019719	$159,\!635,\!314,\!457$	$-$^†	$-$^†	11.3933	$1.4011E+10$	19.9681	$7.9945E+09$	6.2134	$2.5692E+10$	$-$^†	$-$^†	$-$^†
DB4A0000000000	652B019719	$434,\!513,\!221,\!401$	$-$^†	$-$^†	31.0175	$1.4009E+10$	54.3483	$7.9950E+09$	16.928	$2.5668E+10$	$-$^†	$-$^†	$-$^†
DB480000000000	2652B019719	$2,\!633,\!536,\!476,\!953$	$-$^†	$-$^†	196.4420	$1.3406E+10$	329.5455	$7.9914E+09$	102.53	$2.5686E+10$	$-$^†	$-$^†	$-$^†
DB400000000000	A652B019719	$11,\!429,\!629,\!499,\!161$	$-$^†	$-$^†	832.0758	$1.3736E+10$	1429.5599	$7.9952E+09$	445.32	$2.5666E+10$	$-$^†	$-$^†	$-$^†

Table 2. Performance Results of DES Key Search on x86 CPU and our Multi-FPGA System Hardware Accelerator Automatically Created within AWS Cloud Instances in a Push-button Way

^†Execution time on x86 CPU becomes very long after some point since the key space is very long for a sequential execution to complete searching in a reasonable time.

It takes hours/days to complete the experiment in CPU after some point.

Note that the number of inner loops per FPGA and the number of FPGAs utilized vary on the experiments. Correct 56 bit key $=$ 0xdb4a652b019719 in hexadecimal (hex).

In particular, referring to line 4 of this table, an application-specific hardware accelerator design consisting of eight FPGAs, each containing 16 inner loops (128 total inner loops), was compiled from the sequential single-threaded DES key search C program of Algorithm 1, and achieved a frequency of 250MHz, and a performance that is $(8796 \text{ sec})/(0.197 \text{ sec}) \approx \mathbf {44,\!600}$ times faster than an AWS EC2 m5.8xlarge Xeon x86 machine running the original sequential single-threaded C program the hardware accelerator was compiled from.

The algorithm, Algorithm 1, is in fact a brute-force exhaustive search of the DES key space, where the search stops immediately after a correct key is found, as we have summarized previously:

Given a plaintext and ciphertext pair and an initial 56-bit key to start the search from, try all candidate keys starting from the initial key in strict sequential order and stop immediately when a key that decrypts the ciphertext correctly obtaining the plaintext is found.

The sequential-natured “find the first correct key” feature makes the algorithm more difficult to parallelize with traditional means as previously discussed. If the first key so found was not correct (was a false positive), the accelerated procedure can be called again to continue where it left off. On average, the correct key will be found after searching half the key space, where the whole key space consists of $2^{56}$ or about $72 \times 10^{15}$ candidate keys.

Algorithm 1 has two-level nested loop where there is an outer loop beginning on line 15 and an inner loop beginning on line 17. As also mentioned in earlier sections of this article, our compiler generates FSM units for the loops and duplicates the inner-loop FSMs depending on the duplication factor, which is defined by the user or estimated by the compiler. In our case, an outer loop dispatches blocks of keys to a multitude of inner loop iterations, and reorders the results from the inner loops (each returning either “key not found in block of potential keys” or the first working key in the block of potential keys) with extra reordering hardware. Before knowing if the correct key is within block n, the outer loop speculatively dispatches blocks $(n+1)$, $(n+2), \ldots, (n$ + $\text{number}$

$\text{of inner loop FSMs}+\delta)$ before checking the result of block n, so that all inner loop FSMs are likely kept busy, where $\delta$ is an extra speculation amount to increase the chances that all inner loops will be kept busy.

Each inner loop FSM takes a block of potential keys as input and processes the corresponding block by delivering a decryption every cycle once its pipeline is filled. It takes 38 cycles to fill its pipeline. Once the pipelines of the inner-loop FSMs are full, all the operations are executed concurrently, and one new decryption result is delivered in every cycle.

In the experiments below, time spent in function searchkey is measured in microseconds using the linux gettimeofday function in a version that uses FPGA acceleration, and a version of code that does not use FPGA acceleration (running only on one x86 machine).

In our experiments, we call the DES cracking application from the following terminal command running on the Host machine in the cloud:

The x86 processor being compared against the FPGA hardware accelerator is an m5.8xlarge instance on AWS, “Intel Xeon Platinum $8,\!000$ series processor (Skylake-SP or Cascade Lake) with a sustained all core Turbo CPU clock speed of up to 3.1 GHz”. The original C code is compiled with the following:

for experiments using the x86 CPU alone. Note that because none of the loops of the sequential DES key search algorithm are DOALL loops, due to the data-dependent, conditional exit from the loops on line 19 of Algorithm 1, the x86 CPU baseline implementation is not multi-threaded. Our hardware accelerator was compiled from Algorithm 1 as written, including the data-dependent, conditional exit from its loops. If multi-threading is desired in the x86 CPU, the code must be rewritten so that the data-dependent, conditional exit is removed, and an embarrassingly parallel algorithm specification for DES key search must be used (“try all candidate keys and save the correct ones”). Thus, the performance comparison of our hardware accelerator to the necessarily sequential, single-threaded Algorithm 1, from which the hardware was compiled, is a logical comparison.

It is also important to compare our hardware accelerator compiled from Algorithm 1 as written, to an embarassingly parallel OpenMP implementation on the same 32 thread m5.8xlarge x86 machine, in terms of candidate keys tried per second. We will provide this comparison in Table 4. In short, as compared to the sequential single-threaded Algorithm 1 as written, the embarassingly parallel OpenMP implentation examines 15.8x more candidate keys per second, but performs more work on average.

To be able to compare the FPGA-based hardware accelerator’s execution time to that of an x86 CPU implementation of the same sequential code, the run time needs to be shortened. For the present experiment, a random 64 bit plaintext 0x055424a43a2ccef5, and a random 56 bit key 0xdb4a652b019719 was picked, and the plaintext was encrypted with the key, obtaining the ciphertext 0xdb8b883c1da79b9a. Thus, the correct 56 bit key that needs to be found in this experiment happens to be 0xdb4a652b019719. The algorithm will try to decrypt the given ciphertext with exactly (correct key - initial candidate key) incorrect candidate keys, comparing the decrypted text to the expected plaintext value in each case, before finding the correct key. The accelerated hardware version will try several more candidate keys speculatively, but the useless speculative work (that the original sequential program would not have executed) will be discarded.

The Register Transfer Level accelerator designs were also tested for correctness with random inputs.

Table 2 contains eight lines, and each line corresponds to a specific set of key space. After fourth line, the key space is getting large, and it takes very long time executing in x86 CPU (finding the key in a reasonable time is not possible). In this table, the first line, the starting first candidate key is DB4A6528000000, which is the correct key with the low order 27 bits set to 0. Each line in the table causes more keys to be tried by setting another one bit of the correct key to 0 and using that as the initial key, therefore, gradually increasing the number of incorrect keys being tried, and therefore increasing the execution time. When the initial key is set to all 0 bits, the searchkey function will perform a full DES key search, potentially searching the entire $2^{56}$ candidate key space. The present experiments also allow us to compute the

\begin{equation*} \texttt {(number of keys)/(second)} \end{equation*}

parameter for each different multi-FPGA design, which is a fair metric to compare different computations.

In the experiment of the fourth line of Table 2 related to the eight FPGAs and 16 inner loops per chip design, about 5.016 billion keys (see column 3) are searched, in 0.197 seconds, which means 25.5 billion keys per second performance is achieved with eight FPGAs and 16 inner loops per chip. The performance improvement vs. the x86 CPU running the sequential single-threaded C program the accelerator was compiled from, is about $44,\!600$x.

In the experiment of the fourth line of Table 2 related to the eight FPGAs and eight inner loops per chip design, about 5.016 billion keys are searched, in 0.360 seconds, which means 13.9 billion keys per second performance is achieved with eight FPGAs and eight inner loops per chip. The performance improvement vs. the x86 CPU is about $24,\!400$x.

In the experiment of the fourth line of Table 2 related to the two FPGAs and 16 inner loops per chip design, about 5.016 billion keys are searched, in 0.629 seconds, which means 8.0 billion keys per second performance is achieved with two FPGAs and 16 inner loops per chip. The performance improvement vs. the x86 CPU is about $13,\!900$x.

Note that when dependences permit, there is no limit in our HLS compiler for the number of FPGAs when creating an FPGA-based hardware accelerator from the sequential high-level description. The number of inner-loop copies per chip is constrained by the amount of logic resources available in the device target FPGA/ASIC. But if there are enough hardware logic resources to enable instantiating many inner-loop copies and dependences permit, our compiler is able to efficiently utilize these resources even if they are in different SLR regions or in different chips or in different instances.

Our HLS compiler automatically synthesizes, places, and routes the design for the AWS EC2 F1 instances FPGA targets, using the AWS infrastructure in the cloud, and, therefore, can also collect area and timing information for each FPGA chip. Table 3 summarizes our place-and route results measured with Vivado 2020.2. In our ongoing work, we are optimizing component placements in the FPGA devices to achieve higher-frequency targets, as we will mention in the following section.

Table 3.

	SLRs	LUT	FF	DSP	URAM	BRAM
Utilized	3	$597,\!268$	$445,\!877$	3	43	447
Available	3	$1,\!180,\!984$	$2,\!361,\!968$	6840	960	4320
Percentage	100%	51%	19%	0.04%	5%	10%

Table 3. Place-and-Route Results for the First Partition (FPGA 1) of DES Key Search (with 8 FPGAs, 16 Inner Loops Per FPGA Experiment) Running at 250MHz on AWS f1.2xlarge Instance (xcvu9p-flgb2104-2-i)

5.2 Discussion and Ongoing Work

By building upon our partitioning techniques for multiple chips, we are considering compiler directed placement of hardware components into SLRs and regions smaller than SLRs as ongoing work, which can potentially improve the frequency of the hardware accelerator design.

The eight FPGA design with 16 inner loops per FPGA delivers only 25.5 billion keys per second, but its peak performance at 250 MHz should be

\begin{equation*} (250 \cdot 10^{6} \cdot 8 \cdot 16) \approx 32 \text{ billion keys per second}, \end{equation*}

even with the wasted work due to speculation. Similarly the eight FPGA design with eight inner loops per chip delivers only 13.9 billion keys per second, while its peak performance should be 16 billion keys per second. Inner loops are possibly not getting enough work. We are investigating the reasons. On the other hand, the two FPGA, 16 inner loops per chip design does achieve its peak performance of about 8 billion keys per second.

The main goal of the present work is to showcase our HLS compilation technology on a computation intensive problem expressed using a sequential abstraction. However, actual DES key search can also be done with increased speed using our technology. The average running time in seconds for a DES cracker that stops as soon as a correct key is found can be simply calculated as

\begin{equation*} \frac{(2^{56})/2}{(\texttt {number of chips}) \cdot (\texttt {keys per second per chip})}. \end{equation*}

As only a guess, if 32 inner loops can be placed in each FPGA chip by various improvements, each FPGA will search very nearly 8 billion keys per second. Then, with $1,\!024$ FPGAs on AWS, the full 56 bit DES cracking problem can be solved in:

\begin{equation*} \frac{(2^{56})/2}{(1024)\cdot (8 \cdot 10^{9})} = 4,\!398 \text{ seconds,} \end{equation*}

or 1 hour and 13 minutes on average, which will beat the performance of all existing DES cracking implementations, at a cost of about $$2,\!061$, assuming on-demand pricing on AWS at $1.65 per FPGA per hour (billed by the second). When the user is billed by the second as in the AWS platform, and when performance increases linearly with the number of FPGAs, increasing the number of FPGAs reduces the solution time for a problem but does not significantly change the cost. Increasing the number of FPGAs to $2,\!048$, $4,\!096,$ and $8,\!192$ FPGAs, will reduce hardware accelerator completion time by $1/2$ (36.6 minutes), by $1/4$ (18.3 minutes) and by $1/8$ (9.2 minutes), respectively, while the cost stays about the same.

In the present article, we showed that a straightforward sequential C implementation of the DES key search algorithm is one and the same as a highly parallel application-specific hardware accelerator for performing the DES key search, exactly as defined in the sequential C implementation. Also note that the on-demand availability of multiple high performance FPGA chips in the cloud that are billed by the second makes nearly unbounded resources available to the hardware designer on a reasonable budget (in fact, the flat, non-partitioned hardware accelerator shown in Figure 1 in this article is similar to a chip of unbounded size, which is then automatically partitioned so that the each partition fits in a single chip). Such large resources were not easily available before for hardware designers. Also note that while a straightforward sequential code may correspond to one kind of application-specific hardware accelerator, a hardware designer with a clear understanding of how sequential code is mapped to highly parallel hardware can find ways to recode the sequential algorithm in an alternative way, e.g., to reduce the critical path, or to improve hardware utilization. The resulting hardware can be a different and better application-specific hardware accelerator. The fact that a highly parallel application specific hardware accelerator design has been shown to be one and the same as a sequential C/C++ function is, we believe, a harbinger of exciting future possibilities toward more productive hardware design.

6 Related Work

In this section, we will summarize the previous work related to our research.

Modern High-Level Synthesis Tools: Even for a single application, FPGA programming for achieving high performance can be a very challenging and time-consuming process with Register Transfer Level design. For example, designing manually-tuned hardware architectures for cryptographic algorithms that mainly contain several bit manipulations similar to the DES algorithm requires extensive Register Transfer Level design efforts (finding the best scheduling, creating the optimum datapath architecture for each algorithm, and designing the controller circuit etc.) [13, 14, 15, 76, 77]. To overcome this barrier in design productivity, HLS tools have been developed [21, 28, 59, 60, 72, 92]. These tools enable hardware and software designers to generate RTL hardware architectures from a function written in the C/C++ high-level language; however, they require many manual steps in order to create a full system together with I/O controllers for cross-chip communication over a scalable network, specialized memory hierarchies, and specialized networks between the hardware units of the accelerator for achieving high performance.

In our approach [42], our compiler generates all the necessary hardware components at Verilog-level for a full system including I/O controllers for cross-chip communication, task networks for efficient dispatch of hardware tasks, accelerated execution units, specialized memory hierarchies, and so on. The hardware accelerator generated by our compiler contains many pipelined finite-state machines that synchronize with each other using specialized synchronization circuits, and jointly perform the function described by the sequential C/C++ code in parallel. These compute engines are interconnected via on-chip and off-chip networks with compiler-level customization for each specific application that is accelerated. As another new feature, our HLS compiler is able to pipeline outer loops with loop-carried control dependences, which has been further elaborated in the “Contributions” section earlier in this article.

Liu et al. in [67] propose ElasticFlow, a method for software pipelining of an outer loop without fully unrolling its inner loops, but were unaware of our earlier work in [42] on the same topic. To pipeline an outer loop containing an inner loop, [67] proposes, for example, in Figure 3 of [67], replacing the inner loop by:

–

a distributor network for distributing inner loop tasks (this was disclosed at least as a “task network” and an “incomplete butterfly network” in column 17, lines 36–67, and Figures 7–9 of [42]),

–

multiple copies of the inner loop where the copies operate in parallel (this was disclosed at least as “hierarchical software pipelining” in Figure 15 and column 25, line 45 –column 26, line 13 of [42]), and

–

a reordering collector network for collecting results of the inner loop copies and putting them back in order (this was disclosed at least in section “How to receive responses out of order” starting at column 60, line 28 of [42]).

The ElasticFlow article further suggests improving hardware utilization by having a hardware functional unit implement more than one function, e.g., in Figure 5(d) and (e) of the ElasticFlow article —named the mLPA Architecture (this was disclosed at least in Figures 58 and 59 and the section “Primitive structural transformations for sharing resources among thread units” starting on column 90, line 55 of [42]).

Furthermore, the ElasticFlow article does not propose a method for handling loop-carried dependences in an outer loop, or for scalable partitioning of a resulting large design into multiple chips.

Dai et al. [33] allows MAYBE dependences between memory instructions to be optimistically ignored (achieved by carefully coding each memory operation as an access to a different array, as appropriate, in the input C/C++ code fed into a commercial HLS compiler), and suggests constructing a customized, application-specific memory with a high number of virtual ports, that takes on the responsibility of detecting data speculation errors among memory accesses at run time. This article also implements a data speculation error recovery mechanism in cooperation with the pipelined FSM, by sending a replay signal to the pipelined FSM in the midst of pipeline execution when a data speculation error is detected, e.g., when a logically earlier store in iteration n is determined to store into the same location as a logically later load in iteration n + k, k $\gt$ 0, that was already executed with data speculation, and has already loaded the wrong, stale value of memory location. An implication of the recovery mechanism is an instant reverse execution of iterations n + k, n + k + 1$, \ldots$ in the FSM to return to exactly the cycle where the incorrect data-speculative load of iteration n + k is executed, with minimal disruption to a standard software pipelined schedule. Iterations n, n + 1$, \ldots,$ n + k $-$ 1 can continue unimpeded. This article’s approach is more resource efficient than the earlier, general purpose Multiscalar Architecture “Address Resolution Buffer” work [51] (for avoiding an associative search among many addresses, also see [61]), because of the application-specific customization during HLS. The article’s approach is also potentially less complex than software pipelining of a loop with conditional branches (implied by the load/store address comparisons in this problem), which can lead to a Minimum Initiation Interval that dynamically varies at run-time [45]. However, a further implication of the method proposed in the article is that not only the incorrect data speculative load of iteration n + k must be re-executed, but also all the operations that depended on this load that were already executed, must be re-executed with their original operands (some of these operands may be corrected after re-executing the incorrect load). This further implies undoing changes to memories and registers to return to the exact cycle of the incorrect data speculative load of iteration n + k. Also, any harmful side effects that depend on an incorrect data-speculative load (e.g., sending a network packet that has a side effect like printing a check) must be prevented. Also, a second store occurring later in iteration n can be later found to overlap with a different incorrect data-speculative load that occurred even earlier in iteration n + k, leading to a second replay/reverse execution of iterations n + k, n + k + 1$, \ldots$ emanating from the data speculation error detected by the second store in iteration n. But this interesting article feels incomplete to the reader, in the sense that these implications and their solutions are not discussed at all, and no HLS compiler algorithm is given. Only hardware solutions for particular examples are given. The squash and recovery (e.g., reverse execution) issues for handling data speculation in outer loops have not yet been addressed in Dai et al., since this article relies on a standard software-pipelined schedule produced by a commercial HLS tool, and commercial HLS tools do not currently implement outer loop pipelining without full unrolling of inner loops.

Push-button High-Level Hardware System Compilation: There are a small number of existing push-button compilation systems from a sequential C program to create an application-specific hardware in the open literature. Zhang et al. [94] present a push-button compilation from a C program into a full system hardware design on FPGA, with the help of pragma directives. Compared to the study by Zhang et al., our method (1) does not utilize any parallelization directives and relies solely on the inherent parallelism within the input sequential single-threaded code, (2) aims at creating a massively parallel multi-FPGA system with a minimum latency when sufficient resources are available, and (3) also achieves high-frequency thanks to our proposed RTL-level inter-SLR pipelined communication mechanism. Zhang et al. mainly utilize the polyhedral model to optimize the loops and propose a task-level polyhedral model. According to their example model [94, Figure 5], using pragmas, an innermost loop may be removed from the polyhedral model in order to reduce complexity and also gain freedom from the burden of meeting the requirements of using the polyhedral model. But unlike the polyhedral model, our compiler does not impose any requirement at all on the input program. In particular, using affine array subscript expressions or affine loop bounds are not required. For example, code involving linked list traversals (e.g., searching a key in a linked list of linked lists of key-value pairs as in the xlygetvalue function on page 3 of [66]) can be parallelized using our hierarchical software pipelining.

But polyhedral loop transformations can potentially make radical changes in a sequential program without altering the program’s function, in the case where the user does not care about precise exceptions. Polyhedral loop transformations may in principle be performed on the input sequential code of our compiler, before parallelization with our hierarchical software pipelining begins, to provide an additional benefit. One such transformation could be the fusion of nested loops [95], which can reduce the total latency of a hierarchically software-pipelined program.

Cong et al. [26, 27] propose an automated FPGA compilation solution for an application-specific hardware design with the help of pragma directives, depending on the program to be accelerated. Cong et al. have used the Merlin compiler and have mapped the accelerators to a rack-level multi-chip system. Their Merlin compiler requires a user/digital designer to specify how the user’s C/C++ program is to be parallelized, using OpenMP-like pragma directives such as parallel and pipeline. In Cong et al. the baseline overall framework is an existing parallel software framework, namely, Apache Spark (see Figure 3 in Cong et al.). The Merlin compiler takes a user’s C/C++ code as input and produces optimized OpenCL code, which is deployed on an FPGA connected to each node within the parallel software framework, using the OpenCL implementation available on the target platform. There is no communication among the FPGAs in different nodes. Our compiler approach differs from Cong et al., since it relies not on OpenCL and not on any parallel software platform such as Spark, but on the inherent maximum parallelism in the sequential single-threaded input code, and does not require parallelization directives on the part of the user. Our compiler creates a multi-chip FPGA hardware accelerator directly from sequential code, where the FPGAs communicate and synchronize among themselves to jointly perform the acceleration.

SLR Crossing in an FPGA chip: Another problem is that today’s HLS tools suffer from timing issues due to generated long circuit paths, especially across multiple SLRs that prevent reaching high frequency designs. A very recent study to address such a problem is the AutoBridge system proposed by Guo et al. [53]. They propose a methodology to enhance the design by coupling coarse grained floorplanning with pipelining in order to meet the timing closure, especially when there are long wires to cross the SLRs in the latest modern FPGAs, e.g., FPGA devices in the AWS cloud. Their method assume the HLS functions are written in dataflow programming model and they divide the FPGA device into a set of regions and assign each HLS function to one region. When inter-region communication is needed, Guo et al. pipeline this communication during HLS compilation. Extension of their method to non-dataflow programs is given as a discussion, although the designs appear to be dependent on the data flow programming model, whose insensitivity to message latencies, typically small sized filter functions and point-to-point fifo communication channels seem essential to the success of the Autobridge floorplanning methodology. Unlike our work, AutoBridge does not optimize its FIFOs with credit-based flow control. Our approach of placing a small auto-pipelined FIFO with credit-based flow control in each on-chip network communication channel where an SLR crossing is possible, has the advantage that it is a simple approach and works with Vivado’s existing automatic placement phase. Note that our credit-based auto-pipelined long distance FIFO units, can be used to reduce resources in any design that pervasively uses FIFO communication channels between components. Our proposed method is furthermore not restricted to any programming model, such as dataflow.

Existing Multi-FPGA Architectures: Caulfield et al. [22] present a reconfigurable cloud architecture for datacenter applications. Their proposed architecture has FPGA devices between network switches and the servers to tailor the hardware architecture to the selected workload. Their architecture has direct FPGA-to-FPGA communication for better latency. It provides (1) local compute acceleration, (2) network acceleration, and (3) global application acceleration. Caulfield et al. [22] demonstrated the performance of their Configurable Cloud architecture for Web search ranking and high-speed (network line rate) encryption workloads. According to their results, they offload encryption/decryption processes into FPGA devices to reduce the burden on the CPU cores. For example, without FPGAs, 5 or 15 cores may be required just for cryptography, depending on the encryption procedure. But when using FPGAs to accelerate the cryptography tasks, the CPU cores thus unburdened can now become free to do other work and can generate revenue.

There is a recent study [82] that accelerates database management systems (DBMSs) in AWS F1 cloud for a single FPGA card using the Vivado HLS tool. Sun et al. also [82] discuss potential directions using multi-FPGAs in the cloud for DBMSs acceleration. Jianga et al. [62] propose a method for a design of real-time AI applications on a platform with CPU and one FPGA. They also discuss multi-FPGA design for their method as a future study.

However, previous work has not displayed a full system design perspective including higher design productivity and multi-FPGA system compilation. Our proposed high-level synthesis compiler addresses this full-system design perspective problem efficiently, by automatically compiling sequential code into an application-specific hardware accelerator system targeting a multi-FPGA cloud.

There are also earlier studies about multi-FPGA design, but they are mainly implemented on local FPGAs (see for instance [87]).

Previous Fastest DES Cracker Machines: In 1998, the Electronic Frontier Foundation (EFF) DES Cracker machine [50] that contains custom ASIC chips found the 56-bit key in 56 hours. The machine contains 29 boards, where each board has 64-chips. Each chip has 24 non-pipelined search units, and each search unit completes one decryption in 16 cycles running at 40 MHz. Hence, each search unit could examine 2.5 million keys per second. The cost of the project was around $$250,\!000$.

As emphasized by EFF team, “The right way to crack DES is with special-purpose hardware. A custom-designed chip, even with a slow clock, can easily outperform even the fastest general-purpose computer” [50]. However, to get the highest performance with a specialized hardware architecture for each application, Register Transfer Level hardware design complexity is a barrier. In our study, we address this problem and demonstrate that high performance is achieved using a high-level compiler that generates a full system architecture from merely sequential code. Our present proposal to create, deploy, and terminate a virtual FPGA-based hardware accelerator in the cloud on-demand, paying only during the actual use of the hardware accelerator, is also very cost effective for short-running applications as compared to an ASIC-based hardware accelerator, given the high cost of ASIC chip design and manufacturing.

But the fact that FPGA-based hardware accelerators rented by the second in the cloud are cost-effective for short-running applications, does not diminish the total cost of ownership (TCO) advantages of ASIC-based hardware accelerators in the cloud, for long-running applications [69]. Because of ASIC chip foundry features for optimizing total ASIC design cost at low and medium volume, such as multi-project wafer and multi-layer mask, and because the software investment for deploying a multi-chip hardware accelerator created by our HLS compiler is lower than normal, an ASIC-based hardware accelerator can indeed become cost-effective for important applications that must be repeatedly executed [47, 48, 90] and will achieve better performance, better power efficiency and lower total cost of ownership. ASIC-based hardware accelerators can also realize a higher performance scalable chip-to-chip network as compared to, e.g., an AWS virtual private cloud.

In 2019, Sugier [81] showed that 40 different keys per cycle can be searched with 40 pipelined parallel decoding modules (P16 version) running at 186 MHz on a single Xilinx 7S100 low-cost FPGA device. The proposed P16 architecture is approximately 150 times faster than one custom ASIC chip proposed in [50]. However, the technologies used in these two chips are different.

One of the modern DES cracking services is crack.sh [29], which is an online service for commercial DES cracking. This is a manually tuned hardware implementation of DES key search. crack.sh promises that the searched key will be found within at most 26 hours using 48 Virtex-6 FPGA devices. Each FPGA contains fully pipelined 40 DES cores that run at 400 MHz, meaning that they search $1,\!920$ different keys per clock cycle.

It is interesting to note that in the crack.sh website, the designers also estimate that $1,\!800$ Graphics processing unit (GPU) devices would be needed to perform the same DES key search within 26 hours with GPU computing. Note that since FPGAs are power-efficient [89], an FPGA-based design is also energy efficient as compared to a GPU-based solution running the same algorithm.

For reference, we are including the performance of the sequential single-threaded Algorithm 1 on a m5.8xlarge AWS x86 CPU in the line marked “m5.8xlarge Xeon, single-threaded”. Note that current automatic parallelizers are not able to automatically convert the sequential single-threaded Algorithm 1 into an embarrassingly parallel algorithm because Algorithm 1 has a data-dependent, conditional exit on line 19 from both of its loops. But a programmer can rewrite Algorithm 1 in an embarrassingly parallel way without the data-dependent conditional exit, although the embarrassingly parallel version is not equivalent to the single threaded version: Assuming the correct solution key is random, Algorithm 1 tries half of the candidate keys on average (performs 2X less work as compared to the embarassingly parallel version) and examines the candidate keys following a strict sequential order, whereas the embarrassingly parallel version will try all candidate keys and will return all potentially correct keys. We have indeed rewritten Algorithm 1 in this embarrassingly parallel way, with nested DOALL loops, and have used OpenMP to parallelize it, and have measured its performance on the same m5.8xlarge AWS machine instance (which has 32 threads or vCPUs), which appears as the line “m5.8xlarge Xeon, embarrassingly parallel”. The recoded OpenMP implementation achieved 15.8x the keys/second performance of the sequential single-threaded algorithm on the m5.8xlarge x86 machine. Note that the results in this table do not consider the “less work” advantage of our parallelized single-threaded Algorithm 1. But it can be seen that an FPGA or GPU have a performance advantage over a CPU for the DES key search application.

GPU-based commodity computing machines are powerful for single-instruction multiple data (SIMD) type of applications since they process a high number of threads concurrently within their large number of hardware execution units. There are some studies that use GPUs for a DES cracking application. Ahmadzadeh et al. [1] propose a single instruction multiple thread architecture for DES cracking application on GPUs. Instead of conventional bit swapping during a bit permutation, entire registers are swapped in their method. By enabling register swap and shared memory implementation of the DES algorithm on GPUs, in each iteration, each thread examines a set of 32 keys. They achieved approximately $6.5\times 10^9$ keys per second with only one GPU device. They also showed linear speed-up when two GPUs are used and accomplished $13\times 10^9$ keys per second performance since in their model, there is no dependency between keys.

However, an individual encryption or decryption algorithm of [1] is no longer the original DES algorithm; it is a different algorithm that takes a (plaintext, ciphertext) pair and a vector of 32 56 bit candidate keys (stored in transposed bit matrix form inside 56 32 bit registers) and returns a vector of 32 results, indicating which, if any, of the 32 candidate keys were correct. The algorithm also relies on the key evaluations being independent: It does not comply with any requirement to stop and return the correct key as soon as a correct key is discovered, trying all candidate keys in strict sequential order. Thus, the starting point of [1] is a different algorithm as compared to other DES key search studies discussed here.

Table 4 shows the keys per second performance of different DES cracker machines in the open literature, in our x86 CPU experiments and also our main result of multi-FPGA experiment automatically compiled from high-level. As one can see from Table 4, crack.sh [29] performs the best when keys per second is considered. However, they are using 48 chips, which cooperate to solve the problem, which is higher than our chip utilization in our present study. All of the open literature implementations are manually designed or tuned for the targeted platforms or architectures. By contrast, our method presents an automatic translation from a simple high-level C/C++ programming language description to a cloud-level multi-instance, multi-FPGA implementation of DES.

Table 4.

	Year	Target platform	Number of chips	Number of search units per chip	Keys per second in total
EFF DES Cracker machine [50]	1998	ASIC	64	24 search units	3.8 billion
Sugier [81], P16	2019	FPGA	1	40 pipelined decrypters	7.45 billion
crack.sh [29]	2021	FPGA	48	40 fully pipelined DES cores	768 billion
Ahmadzadeh et al. [1]	2018	GPU	4	128 threads	26 billion
Our CPU experiments
m5.8xlarge Xeon, single-threaded	2021	CPU	1	32 threads (1 used)	0.0005706 billion
(Sequential software experiment)
m5.8xlarge Xeon, embarrassingly parallel	2021	CPU	1	32 threads (all used)	0.009015 billion
(OpenMP Parallel software experiment)
This work	2021	FPGA	8	16 inner loop search engines	25.6 billion
(Automatic multi FPGA compilation
from sequential description)

Table 4. Candidate Keys Per Second Performance of DES Cracker Machines

In Table 4, we have two CPU-based experimental results for the DES key search application. One of them is single-threaded implementation and our purpose to give such a result is to highlight the performance gain since we accept the same sequential C description in our compiler. We have also provided a multi-core implementation of the DES targeting to m5.8xlarge AWS machine instance with 32-cores in order to utilize all the available processor cores. Our automatically generated hardware system for AWS-FPGA instances performs much better when it is compared to these single core and multi-core CPU implementations.

The current literature toward DES cracking is mostly manually optimizing the workload for the targeted platform. Our method solves most of the low-level implementation issues automatically. DES algorithm is just an example for our proposed multi-FPGA compiler framework. Our work is targeted to an FPGA platform, but it can easily be configured for any other hardware platform. The main difference of our work from the literature is that our target hardware accelerator is automatically compiled from a sequential, non-optimized C/C++ program. A second difference is that our implementation which is a parallelization of the single-threaded Algorithm 1, will do less work on average, as compared to an embarrassingly parallel implementation.

7 Conclusion

We presented an application-specific, high-performance approach for multi-FPGA accelerator system design starting from sequential code. We implemented, tested, and verified our push-button system design model on FPGA-based AWS EC2 F1 instances and demonstrated its viability.

The authors believe that at least one surprising and non-obvious feature of the authors’ work is the fact that an entire highly parallel FPGA hardware accelerator system has been fully described by ordinary sequential code, without any parallelization directives. The authors believe that sequential code can offer a more productive way to design future multi-chip FPGA-based or ASIC-based application-specific hardware accelerator systems.

A Automated Deployment AND Termination of FPGA-based Hardware Accelerators in the Cloud

Developing and deploying even a single FPGA accelerator in the AWS cloud currently takes a long time and requires many manual steps, that can each go wrong during development or deployment. Thus, it is essential to reliably automate the deployment and termination of a multi-chip FPGA accelerator in the cloud. In addition to our HLS compiler, we have created the following commands using the AWS-CLI primitives under the covers, which can significantly increase the productivity of a user in deploying multi-chip FPGA accelerators:

–

FPGA_create command:

This command creates an AWS virtual private cloud complete with two subnets (the first subnet intended for normal access including ssh access, and the second subnet reserved only for accelerator messages among the FPGAs and the host CPU) for running several EC2 F1 instances containing FPGAs, as well as one non-FPGA instance (the host CPU) for running the non-accelerated part of the user’s sequential application software. But FPGA_create does not start any machine instances

–

FPGA_destroy command:

This command destroys a previously created virtual private cloud for running FPGA-based hardware accelerators. No instances should be running on the virtual private cloud.

–

FPGA_start command:

This command allocates second network interfaces for <max machines> machines within the virtual cloud previously created for <FPGA cluster name>, if not already allocated, but starts only <machines to start now> machines. All instances are allocated within the same placement group for minimizing communication delay. Using the the MAC addresses and fixed static private IP addresses of the second network interfaces, it is possible to create say nine machines (one non-FPGA instance and eight FPGA instances), but start just one machine to run the compiler. Since the second network interfaces of all machines in the hardware accelerator are known, the compiler can generate Verilog code with hard-coded MAC and “private IP” addresses (belonging to the second subnet and second ENI) of all the EC2 F1 instance within the virtual private cloud, for achieving best efficiency in inter-FPGA communication, without needing to bring up the actual F1 instances during compilation.

–

FPGA_stop command:

All running machines of this FPGA cluster are terminated.

It would have been better to deploy an FPGA image on an FPGA hardware resource instantly on demand, at the point where the application actually starts using the FPGA image, and to terminate the deployment of the FPGA image as soon as the FPGA resource currently running the image is perceived to be idle and/or must be preempted by a higher priority hardware task (e.g., see the work on an all-hardware parallel hypervisor for efficient on-demand deployment of multi-chip accelerators within future FPGA and ASIC clouds in [41]). But as of today, initialization of the AWS EC2 F1 instances takes minutes and is currently not quick enough for an on-demand launch. This is why we settled on the above fpga_start and fpga_stop commands, which can be issued just before and just after a series of accelerated application executions, to minimize costs.

References

[1]

Armin Ahmadzadeh, Omid Hajihassani, and Saeid Gorgin. 2018. A high-performance and energy-efficient exhaustive key search approach via GPU on DES-like cryptosystems. The Journal of Supercomputing 74, 1 (2018), 160–182.

Abstract

1 Introduction

2 Contributions

2.1 Ability to Parallelize Outer Loops with Loop-carried Control Dependences

2.2 Ability to Pipeline an Outer Loop without Fully Unrolling Its Inner Loops

2.3 Multi-chip Design

2.4 Compiler-level Super Logic Region Crossing with Auto-pipelining

2.5 Productive Deployment of Multi-chip Application-specific Hardware Accelerators in the Cloud

3 multi-FPGA System Architecture

3.1 AWS EC2 F1 Instance FPGA Architecture

3.2 Networking

3.2.1 SDE-based Streaming Accelerator Interface.

3.2.2 Chip to Chip Communication on AWS Through DPDK.

4 multi-FPGA Design for DES Key Search

4.1 DES Algorithm (High-level Sequential Description)

4.2 Compiler Overview and Parallelization

4.3 Partitioning

4.4 Automatically Generated Inner-Loop FSMs

4.5 I/O Controller

4.6 The Second Enhanced Network Interface

4.7 Implementation of Compiler-level SLR Crossing with Auto-pipelining

4.8 Maintaining gcc Compatibility in the Presence of Extremely Long FPGA Image Generation Times

5 Experimental Analysis

5.1 Results and Evaluation

5.2 Discussion and Ongoing Work

6 Related Work

7 Conclusion

A Automated Deployment AND Termination of FPGA-based Hardware Accelerators in the Cloud

References

Index Terms

Recommendations

Polyhedral parallel code generation for CUDA

A Code Transformation to Improve the Efficiency of OpenCL Code on FPGA through Pipes

Efficient compilation of CUDA kernels for high-performance computing on FPGAs

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations