Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

AIgean: An Open Framework for Deploying Machine Learning on Heterogeneous Clusters

Published: 27 December 2021 Publication History
  • Get Citation Alerts
  • Abstract

     AIgean, pronounced like the sea, is an open framework to build and deploy machine learning (ML) algorithms on a heterogeneous cluster of devices (CPUs and FPGAs). We leverage two open source projects: Galapagos, for multi-FPGA deployment, and hls4ml, for generating ML kernels synthesizable using Vivado HLS. AIgean provides a full end-to-end multi-FPGA/CPU implementation of a neural network. The user supplies a high-level neural network description, and our tool flow is responsible for the synthesizing of the individual layers, partitioning layers across different nodes, as well as the bridging and routing required for these layers to communicate. If the user is an expert in a particular domain and would like to tinker with the implementation details of the neural network, we define a flexible implementation stack for ML that includes the layers of Algorithms, Cluster Deployment & Communication, and Hardware. This allows the user to modify specific layers of abstraction without having to worry about components outside of their area of expertise, highlighting the modularity of AIgean. We demonstrate the effectiveness of AIgean with two use cases: an autoencoder, and ResNet-50 running across 10 and 12 FPGAs. AIgean leverages the FPGA’s strength in low-latency computing, as our implementations target batch-1 implementations.

    1 Introduction

    The interest in using FPGAs for computing at scale has become desirable because of the need for increased performance and reducing power. The flagship example of this is the Microsoft Catapult project that has led to an FPGA being deployed in every Microsoft server [1]. FPGAs at Microsoft are used for search engine acceleration, in a machine learning (ML) framework for applications within the data center as well as for many network and packet-processing tasks.
    A distinguishing feature of the Catapult architecture is that the FPGAs can directly communicate with other FPGAs and CPUs as peers on the network versus the more common accelerator model for FPGAs where the FPGAs are attached to a CPU and only accessible through the CPU. The peer model is more efficient for applications that are large enough to span multiple FPGAs requiring low-latency communication between the FPGAs. Although Microsoft has shown significant success in scaling up and using multiple FPGAs in a single application, such as Project Brainwave [2, 3] used for real-time AI, there is no public description of how the applications are built and deployed to the FPGAs, and the platform and tools are not available for others to build their own applications. There is also no known equivalent open source platform available where someone can build their own version of Brainwave.
    Brainwave has shown how useful multi-FPGA implementations can be as they leverage having all their weights in on-chip memory as opposed to accessing memory in off-chip DRAM. Having a framework to be able to build custom circuits, like the one in Brainwave, will allow users to create their own networks, which at the moment is quite difficult due to the lack of abstractions within FPGA systems. On top of being able to access on-chip memory, within an infinitely large fabric available through a multi-FPGA framework we could unroll all our computations completely, or to any desired level of unrolling. This will enable the construction of very low latency, high throughput networks that can run at batch 1. In this work, it is our hope to provide the abstraction of an infinitely large FPGA fabric by abstracting the difficulties of network-connected FPGAs. This article leads to a broad range of possible applications where low-latency, large AI inference is needed to process information in real time. Examples include systems controls, web search, real-time physics applications, and medical image processing.
    For this work, we define a cluster of network-connected FPGAs, (i.e., all FPGAs have direct connections to the network) as a multi-FPGA cluster. By this definition, Brainwave is a multi-FPGA application.
    The focus of this article is to describe how we created AIgean, which is an open source platform that can be used to build multi-FPGA ML applications on multi-FPGA clusters. AIgean provides the user with multiple layers of abstraction. The user can use AIgean as a black box that takes neural net descriptions as inputs and get an output of programmed FPGAs. Our black box is responsible for the creation of IP cores, communication protocols, partitioning the neural net across multiple devices, and finally generating the final bitstreams of all FPGAs. Our focus with AIgean is ease of use as well as modularity. However, the parts of the black box are implemented as a stack of abstraction layers and can be further customized by users who are experts in the various layers. This stack is built in modular pieces, which also allows for alternative implementations at each layer. In particular, a user can modify the architecture of a particular convolution layer, and implement the layer in hardware on an FPGA or in software on a processor. The communications layer can be modified to use different protocols, such as UDP, TCP/IP, layer 2 Ethernet, PCIe, parallel buses between devices, or any custom protocol. A change at any of the layers of AIgean does not affect any of the other layers, especially the application layer at the top of the stack. This provides portability between platforms, particularly across different types of FPGAs.
    We started with two open source projects: hls4ml [4] and Galapagos [5, 6]. By using hls4ml, we can convert ML descriptions into C++ code synthesizable with high-level synthesis (HLS). Galapagos is a framework for deploying streaming computation kernels onto a cluster of heterogeneous devices [5, 6], especially FPGAs, which are particularly suited to streaming computation. We define streaming as communication via streams of data moving from one kernel to another where the processing is effectively done on-the-fly versus a mechanism like a source kernel writing to memory and the destination kernel reading from that memory.
    Although conceptually AIgean is a combination of two existing platforms, a significant effort was required to integrate the two platforms. Initially, the idea seemed straightforward, but when considering the details, much more is required. Hls4ml was not initially designed to build the layers of the network as individual cores and required significant enhancements to enable the output of separate cores. The interfaces between layers needed to comply with the streaming interfaces required by Galapagos, and the output cores had to be put into a directory structure suitable for processing by Galapagos. Galapagos had not been tested with a large application, and the deployment of a large neural network was the first attempt at doing so. We then realized that for very large applications, an automated partitioner is required and Galapagos was enhanced to have a new layer that can do the partitioning. The first partitioner is only enough to build AIgean, but significant future work can enhance it in many ways. These contributions would not have come to light without AIgean and are important considerations for developing future application frameworks that leverage Galapagos.
    An important contribution of this work is to describe that effort and more generally show the challenges of building multi-FPGA application frameworks that can be customizable and portable across multiple kinds of FPGA hardware. The main outcome is an ML platform that enables ML practitioners to use familiar tools and map them to a multi-FPGA cluster without needing to do any hardware design. We contrast AIgean with the current vendor ML flows [7, 8] that only target a few FPGAs hosted in a single server and lack the ability to scale easily.
    Our contributions in this work are as follows:
    (1)
    A fully push-button flow to take an ML network input from popular ML tools and deploy the network to a multi-FPGA/CPU back-end. Abstracted away from the user is the creation of the hardware IP cores for the given ML network, the partitioning of these IP cores, and the connecting and routing between them. Some of the core functionality was already handled by hls4ml and Galapagos, but large modifications and additions were required for scaling out to using multiple FPGAs.
    (2)
    Modifications to hls4ml to generate separate IP cores for each layer and the automatic inclusion of a bridge to combine the many parallel streams between the hls4ml cores into the single stream supported by the Galapagos framework. The bridges are created at compile time as the width of the bridges are dependent on the number of dimensions in the layer the user wants to deploy.
    (3)
    Galapagos modifications to add the partitioning layer of the stack. This layer decides how many FPGAs are required and where to place the IP cores generated by our modified hls4ml. Our first partitioner is a simple greedy partitioner leaving a lot of room for future research into partitioners that can produce more efficient results. A partitioner is required within our AIgean stack to enable the seamless push-button flow from front-end to multi-node back-end. Given that the partitioner is a separate abstraction layer, changing the partitioner can be done without requiring any changes in the other layers.
    (4)
    A framework that allows for incremental development and deployment of an ML application because we can seamlessly integrate hardware and software IP cores. For example, the first step to deploying an ML network is to do it entirely in software targeting a multi-CPU back-end. By simply changing a configuration file, layers of the network can be incrementally switched from running in software to running on FPGAs. Eventually, all layers can be targeted for FPGAs, or the user may choose to run with a heterogeneous implementation where some layers are in software and some are in hardware.
    A large use case of ResNet-50 deployed with two configurations, one with 10 FPGAs and the other with 12 FPGAs. Changing between these implementations is done by changing only a few lines of hls4ml code and re-running the flow. This is also a case study that demonstrates the effort required to create a multi-FPGA application on the Galapagos platform.
    A fully integrated hardware and application layer stack that starts with FPGA shells, the layer in the FPGA that abstracts the application logic from the specifics of each FPGA board, the hardware middleware layer that deals with the connectivity between IP cores, a communications layer that implements the desired networking protocol between IP cores instantiated on different CPUs or FPGAs, and an application layer that takes ML networks as input and generates the required IP cores. These carefully defined abstraction layers provide an excellent research platform for experts at each layer to tinker and make each layer better. AIgean is available as open source to enable further research at all the layers of its stack and can be downloaded at https://github.com/UofT-HPRC/AIgean.
    In Section 2, we describe related work, followed by Section 3, where we provide an overview of hls4ml and Galapagos. We describe the implementation and tool flow of AIgean in Section 4 and present some results in Section 5. Future work is described in Section 6, and, finally, we present conclusions in Section 7.

    2 Related Work

    We describe AIgean as a platform that can be used to build heterogeneous ML implementations with a particular focus on using FPGAs and CPUs. As a platform, AIgean spans the full computing stack from the hardware to the tools used to create the inputs to AIgean. We have built AIgean with the goal of making it flexible and modifiable at all levels of the stack to enable research and continued improvement. With this view, we present the related work according to our model of the full ML computing stack. We first describe the model and then present the related work as it fits within our model.

    2.1 The ML Computing Stack

    The ML computing stack is shown in Figure 1. At the top of the stack, we have a wide range of Applications & Algorithms, many of these applications having strict performance constraints. At the Cluster Deployment & Communication layer of the stack, we have petabytes of data being transferred, and at the Hardware layer, we have many mathematical operations (typically matrix/vector multiplications) implemented on a computing substrate ranging from programmable processors to custom hardware. Each layer of this stack provides its challenges. For example, at the Applications layer, the user has to decide which error rates are acceptable for their given application. At the Communication layer, the user has to decide how they will connect their devices (consisting of computing devices as well as sensors gathering data). Finally, at the low-level Hardware layer, the user may want to make optimizations on bit-level operations for their given application or define different levels of parallelism. There are opportunities for research at all levels of this stack.
    Fig. 1.
    Fig. 1. Abstraction stack for common ML frameworks.

    2.2 Software ML Frameworks

    We define software frameworks as those that mainly target CPUs and GPUs that are programmed via software. Leading software ML frameworks include TensorFlow [9], Torch [10], and Caffe [11]. They provide the users with libraries in various programming languages (e.g., Python, C++) to describe their ML applications. These frameworks then compile the applications into a series of instructions to be executed. Furthermore, they offer an interface to create custom layers that can be compiled into instructions to run on different back-end devices. Finally, they also support connectivity across multiple devices. For example, TensorFlow provides an API [12] to run on distributed clusters, where the communication between different devices (CPUs, GPUs, and TPUs [13]) is either through the CPU network link, through NVLink [14] (i.e., a proprietary link between certain NVIDIA GPUs), or via a direct network link. NVIDIA also provides the NVIDIA Collective Communication Library [15], which implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and networking. This enables scaling GPU computations across large numbers of GPUs available on a network and is supported by several popular deep learning frameworks. Note that in the current implementation of the NVIDIA Collective Communication Library, the GPUs do not have direct connections to the network, unlike what we are able to do with FPGAs. The GPUs connect to the server’s network interface through PCIe. We expect that with the acquisition of Mellanox by NVIDIA [16], GPUs will soon also be able to access the network directly and bypass the need for a PCIe transfer.
    These frameworks have a high level of customization at the application level. They also allow the users to input custom instructions, but the underlying hardware circuitry is limited to CPU, GPU, and TPU computation and cannot be modified.

    2.3 FPGA Overlay Frameworks

    Frameworks such as the Xilinx ML Suite and the Intel Deep Learning Accelerator (DLA) provide overlays implemented on FPGAs [17, 18]. An overlay is essentially a programmable engine implemented on the FPGA, which has a limited level of customization. These suites are integrated in existing OpenCL development environments (IDEs), Xilinx SDAccel [19], and Intel OpenCL [20].
    The OpenCL IDEs use HLS to improve the accessibility of the design of accelerators for FPGAs over traditional approaches based on VHDL or Verilog, which are time consuming and unfamiliar to most ML experts. In addition, these suites both provide libraries for Direct Memory Access (DMA), buffers, and communication channels, and abstract the underlying hardware, such as device drivers, PCIe link, interconnect, and accelerator placement [21]. These frameworks are similar to the Software ML Frameworks except they add the capability to customize the processor by tuning the overlay architecture on the FPGA.
    Nurvitadhi et al. [22] describe a platform that supports multiple PCIe-connected FPGAs in a single server. They build a software stack on top of the Intel OPAE [23] and tightly couple operations on the CPU with operations on the FPGA. Their goal is to implement low-latency neural machine translation and do this by keeping the model in on-chip memories to avoid slower off-chip memory accesses. The ability to leverage multiple FPGAs makes this feasible. This work shows how to leverage the communication layer implemented with PCIe to target multiple FPGA overlays, but their scalability is limited by the number of boards available in one node.
    These frameworks allow application developers to seamlessly deploy ML applications on FPGAs thanks to mature software and hardware development environments. However, on the one hand, the high level of abstraction through overlays minimizes the FPGA design time, and on the other hand, it reduces the user control of the generated hardware. With respect to the MS stack we define in Figure 1, these overlay frameworks allow some flexibility in the algorithms and limited flexibility in the hardware. Depending on the hooks available, a user can implement different supported layers to make their own customizations. The hardware flexibility is quite limited as an IP core is already generated. Some frameworks allow the user to modify the IP core through parameters, but this is generally limited in flexibility.

    2.4 FPGA ML Core Generators

    In this category of work, the focus is at the hardware level of our ML stack where the goal is to make it easier to generate cores for ML computations. These cores must then be integrated into a system that provides the full ML computing stack. Here, we present open source tools1 that generate ML accelerators as third-party IPs to be integrated into FPGA projects.
    CHaiDNN [24] is an ML library for the acceleration of deep neural networks on Xilinx UltraScale MPSoCs. The library provides a subset of ML operators to be synthesized with Vivado HLS and uses 6/8-bit integer arithmetic. Pynq DL [25] provides only a configurable IP for the convolution on Xilinx Zynq SoCs. FINN [26] is a framework for the implementation of binary neural networks that use a dataflow architecture. PipeCNN [27] is an OpenCL-based FPGA accelerator for large-scale CNNs and uses pipelined functional kernels to achieve improved throughput in inference computation. The design is scalable both in performance and hardware resources, and thus can be deployed on a variety of FPGA platforms. HLSLibs [28] is a set of libraries implemented in standard C++ for bit-accurate HLS design. Many of the library operators (e.g., MatMult, SoftMax, Sigmoid) can be easily integrated into the design of ML accelerators. Recently, CNN implementations similar to the design in this article have been produced for low bit precision CNNs [29] and for sparse CNNs [30]. These solutions provide more flexibility as the developer, in some cases, can modify the generated cores, as well as integrate additional circuitry around the provided IP cores. However, this design flow is only accessible to those with hardware design knowledge.
    With respect to the ML stack, ML core generators provide full flexibility of the hardware and supported algorithms. However, they provide very little when it comes to support for integrating systems at a much larger scale, as it is the user’s responsibility to integrate the generated cores into their larger design.

    2.5 ML Computing on Multi-FPGA Clusters

    In a multi-FPGA cluster, all FPGAs are network connected, and Brainwave [2, 3] built on top of Microsoft’s Catapult network-connected FPGA framework [1] is the most successful and well known. Each FPGA contains a customizable overlay. The focus of Brainwave is to minimize latency. Thus, the entire processing only uses on-chip memory and resources, and the neural network is partitioned across multiple FPGAs accordingly. The links between the network-connected FPGAs use Catapult’s 40-Gb/s custom Lightweight-Transport-Layer, a lightweight reliability layer on top of a communication protocol similar to that of UDP. When characterizing Brainwave using the stack defined in Figure 1, it can be observed that Brainwave also provides a flexible Application layer as multiple types of neural networks are supported. Brainwave is limited to the Lightweight-Transport-Layer for cluster communication between FPGAs, but this is still an improvement over frameworks that force all accelerator communication through a CPU. Finally, Brainwave provides some flexibility at synthesis time to customize precision, vector size, number of data lanes, and the size of the matrix-vector tile engine. These works allow users to scale a large ML framework across multiple nodes, providing the cluster deployment layer in the ML stack. These works also support a number of layers allowing for the user to customize their algorithm. However due to the scale, there is little hardware flexibility, as the parameterization happens at the node level as opposed to the level of the IP core.

    2.6 Where AIgean Fits

    Although AIgean can be used with a single FPGA, it best fits the category of Section 2.5, or ML Computing on Multi-FPGA Clusters, and has a similar goal as Brainwave. Both platforms use network-connected FPGAs in a peer-to-peer configuration. Brainwave uses a programmable overlay that has some parameterization that can be invoked at the time the overlay is synthesized and can implement many different ML networks depending on the program that is loaded. AIgean synthesizes custom hardware cores and implements each ML network directly in hardware, so changing an ML network will take much longer than recompiling the program for an overlay. With AIgean, there is the ability for researchers to experiment at the hardware implementation layer with hls4ml, the possibility to experiment with the communication protocols used, and to specify how the computation kernels are deployed. All of the related works provide some layer of the ML stack defined in Figure 1. AIgean is the only work that can provide support at all of these layers, allowing users to parameterize at the IP core level and at the cluster level, and support many algorithms. This is all made possible by the layered approach used by AIgean and because everything is available as open source.

    3 Background

    The goal of AIgean is to provide a scalable platform for implementing ML applications using multiple FPGAs. We use hls4ml to build the ML cores and Galapagos as the substrate for deploying an application across multiple FPGAs. In this section, we present the background required to understand hls4ml and Galapagos before describing how they are integrated into the platform we call AIgean.

    3.1 Hls4ml

    We need a way to implement hardware ML-inference cores that can take specifications from common ML frameworks. We choose hls4ml [4] because it can translate the specification of ML models from common frameworks such as Keras [31], PyTorch [32], ONNX-formatted models [33], and the quantized version of KERAS, QKERAS [34], into Register-Transfer Level (RTL) implementations for FPGAs using HLS tools [35]. In our experiments, we use Vivado HLS [36] as the hls4ml back-end though the flow can be extended to other HLS tools. Hls4ml currently has support for Vivado HLS, Quartus HLS, and Mentor Catapult HLS [37].
    At the start of the AIgean development, hsl4ml was a tool that was only targeted to implement ML applications that fit on a single FPGA. In this section, we describe the baseline capabilities of hls4ml, and in Section 4.2, we describe the changes we made to integrate hls4ml into AIgean. A more detailed description of the changes to hls4ml is found in Appendix A.1.
    An ML designer prepares a neural network for a specific task, such as image classification, in Keras or PyTorch. After an iterative training phase that ends when the target accuracy/error goals are met, the ML designer releases a final model to be deployed for inference. The model is usually described as two files in standard formats: a JSON file for the model architecture, and an HDF5 file for the model weights and biases. These are the inputs for hls4ml. At this point, a hardware designer can fine-tune the hls4ml project and push-button translate it into a complete Vivado HLS specification (C++ and TCL files) to be synthesized and implemented for a target FPGA.
    The hardware designer faces the challenge of creating an optimal FPGA implementation from the given ML model. The hls4ml framework exposes a crafted set of configuration parameters (HLS knobs) to balance the FPGA resource usage and the latency and throughput goals. The design of ML-inference accelerators using HLS is simplified with hls4ml by hiding the large variety of HLS knobs and providing carefully optimized layer implementations for HLS.
    The conversion from deep learning model to HLS-based software is done by constructing a custom intermediate network representation that is amenable to low-latency design. From this intermediate representation, HLS code is generated with design guidelines specified in a configuration file. Optimized HLS implementations of neural network layers are generated, with the optimization dependent on specified configuration parameters. The code is thoroughly modular, and most optimizations can be tuned after the HLS code generation. Hls4ml has been used to construct MLP networks, CNNs, Graph neural networks, RNNs, and BDTs [38, 39, 40, 41].
    The hls4ml design flow explicitly focuses on batch-1 processing. Larger batch processing is not considered. The design flow is similar to the FINN architecture [26, 42] in that model-specific layers are implemented. A critical element of the design of hls4ml is to allow for very low latency implementation of ML algorithms with a low initiation interval.2 As a consequence, hls4ml generates an HLS firmware implementation of the neural network on a layer-by-layer basis. Each layer corresponds to a different firmware block, and therefore individual layers can be run concurrently. This design paradigm differs from most other FPGA deep learning implementations, such as Xilinx ML Suite, where the same firmware blocks are repeatedly used to perform the inference computations over many layers of a neural network. Separation of the layers into separate firmware blocks is particularly tractable for multiprocessor use since layers can easily be split into separate IP blocks without any modifications in the algorithm design or changes in resource usage. We leveraged this capability for AIgean.
    The trade-off among latency, initiation interval, and resource usage determines the parallelization of the accelerator logic (and vice versa). In hls4ml, this trade-off is configured with a single configuration parameter—the reuse factor. The choice of a value for the reuse factor affects the initiation interval of the RTL pipelines and the number of critical resources (e.g., DSPs) in each layer of the neural network.
    Within the hls4ml implementation, the reuse factor dictates the number of times a single DSP is reused within a single matrix multiplication. This factor translates directly to the number of resources that each layer uses. In particular, both the DSP usage and the number of BRAM partitions will scale with the reuse. By scaling the reuse factor value, the designer can explore various implementations. A reuse factor of 1 generates a completely parallel implementation (lowest latency); a reuse factor of R, generates an implementation with 1/R fewer DSPs (lower resource usage) and BRAM partitions. Designers may choose a larger value for reuse factor in the case of limited resources and a smaller value when they can afford higher parallelism.
    For the development of AIgean, a number of improvements were made within hls4ml. These developments include:
    Streaming dataflow between the layers (with Galapagos)
    Optimized large layers for the Dense/Linear Layer, CNN Layer, Pooling Layer, Split Layer, and Merge Layer
    Modified Reuse Factor for CNN throughput
    Weight reconfiguration through the use of external block RAM ports
    Finally, the generated ML accelerators have interfaces that are system agnostic. In Section 4, we illustrate our extension to the Galapagos flow that enables a designer to rapidly prototype ML accelerators and deploy them in a Galapagos system with minimal effort.

    3.2 AIgean Stack

    Fig. 2.
    Fig. 2. The AIgean stack. It includes an Application layer on top of the previously developed Galapagos stack [6].
    AIgean is a development stack for deploying ML applications across multi-FPGA and CPU clusters. This logically can be seen as a superset of the Galapagos development stack with a specific application layer. Galapagos is a hardware stack that provides customization at different levels of abstraction [6]. The main goal of Galapagos is to abstract the low-level hardware plumbing required to deploy an application across multiple FPGAs while also providing the ability to port applications across multiple FPGA platforms (i.e., platforms built using different FPGA cards with different networking infrastructures). We know of no other platform that can take as input just the computation kernels and a logical description of the connections between the kernels and then generate all of the FPGA bitstreams with all of the network connectivity included. Without Galapagos, an application developer with a multi-FPGA cluster would need to be an expert in hardware design. In addition to building the computation kernels, the developer would need to incorporate into their design the interfaces to the on-board memory, the network interfaces, the network protocol hardware (most likely hardware UDP or TCP/IP cores), and configure Ethernet MAC addresses, IP addresses, the routing information for moving data between kernels, as well as build all of the packet formatting and protocol translation between the computation kernels. The FPGA vendor platforms for OpenCL [19, 20] are usable by non-hardware application developers because they abstract away these details. Galapagos does the equivalent abstraction, but for a multi-FPGA cluster environment. By building on Galapagos for AIgean, we can leverage the multi-FPGA abstraction that is provided by Galapagos, and can focus on the integration of hls4ml and not worry about the low-level hardware plumbing required.
    The structure of Galapagos is analogous to a traditional software or networking stack, with each layer of the stack providing an API for the layer above. The lower the layer in the stack, the closer it is to the physical hardware. Figure 2 shows the AIgean stack.
    Physical hardware and connectivity. This layer represents the physical hardware that runs applications, and for this work we focus on the FPGAs. Aside from implementing the computations in FPGA logic, we can also implement different forms of connectivity. In Galapagos, we can use PCIe, 10G SFP+ Ethernet, 100G QSFP28 Ethernet, and L1 circuit switching. For Ethernet, we can select TCP/IP, UDP, and raw L2 Ethernet. Once configured in this lower level, typical software and ML practitioners can work at a higher level of abstraction.
    Hypervisor. The hypervisor3 abstracts away the I/O interfaces of a single FPGA so that the hardware applications only needs to connect to a standardized interface, and they can then be implemented on any FPGA that has the same hypervisor. This is the key requirement that enables applications to be portable across multiple hardware platforms enabled with Galapagos. In the same way, the hypervisor in the software world provides an abstraction of the hardware and some level of services, typically I/O and memory.
    Middleware. This layer connects the different devices within the Galapagos cluster and sets the off-chip network communication protocols between them. Within Galapagos, computation kernels can address each other and are agnostic of their placement.
    Communication layer. The communication layer provides the APIs with the ability to send packets using the connections laid out by the middleware. Galapagos transmits packets using the AXI-Stream protocol [43]. All of the kernels within the cluster can reach any other kernel via AXI-Stream. Since the middleware layer provides the network address translation functionalities to convert AXI-Stream into off-chip network packets using the desired network communication, the network details and locations of kernels are abstracted away from the user. In software, this is the role of network-socket libraries or other network and communication protocols such as the Message Passing Interface (MPI) [44].
    Application layer. For AIgean, the application layer is the ML layer provided by hls4ml and the tools used to generate the inputs to hls4ml.

    4 Implementation and Tool Flow

    The implementation of AIgean requires a significant effort to create a seamless integration of hls4ml, which builds hardware cores for ML, and Galapagos, which builds multi-FPGA applications. We had to make substantial modifications to hls4ml so that it generated streaming cores, and we needed to build the adaptation layer that can take output from hls4ml and convert it into the format for input to Galapagos. Furthermore, as part of this work, we implemented a number of improvements to both hls4ml and Galapagos to further optimize the functionality of both systems. In this section, we highlight the details of our AIgean tool flow along with the specific changes to hls4ml and Galapagos required to integrate them into AIgean.

    4.1 Tool Flow

    The stages of the AIgean tool flow are visually presented in Figure 3. The AIgean automated flow provides a black box that takes an ML model to a CPU/FPGA cluster. In the following sections, we highlight the inner components of the black box because they can be modified by domain experts working within each part of the tool flow to explore a large design space relevant to their interests.
    Fig. 3.
    Fig. 3. The AIgean flow. The components in the black box are abstracted away for the average users.
    Each of these stages corresponds to layers of the abstraction stack for common ML frameworks we described in Figure 1. The stages are described as follows.

    4.1.1 Implementation-Agnostic Model Tooling.

    This stage of the flow corresponds to the top layer of the ML stack, Applications & Algorithms. This layer of the stack is for the data scientist and ML experts, where they can tune their network for a given accuracy independent of the implementation and performance. For a given application, the users will decide on the algorithms they wish to use for their ML implementation. Using their application-specific test data, they can determine a suitable accuracy for a given neural network, independent of the implementation being done in hardware or software. Once a model is trained with the appropriate precision and performance requirements, AIgean will take this model and perform a full conversion to a distributed deep learning inference engine.

    4.1.2 HLS Layer Implementation.

    At this stage, the input from the previous stage is transformed by hls4ml into RTL synthesizable C++ code that can be executed as software to verify functional behavior. It is at the discretion of the user to select the granularity of the IP blocks generated, where the finest granularity the user can select is at the boundary of the neural net layers. A coarse granularity can limit flexibility in partitioning networks across multiple devices and may create IP blocks that are too large, but it is simpler. The user will also select the reuse factor, where a lower reuse factor unrolls the implementation of the IP blocks to use more DSP and BRAMs. Once the user tunes these cores for their resources, the functional correctness of the individual IP blocks can be verified by running the code in software. Users who wish to tinker with implementations of individual layers can work at this level.
    In Section 5, we explore two different IP core implementations of ResNet-50 as an example. This part of the flow first generates a directory structure with many sub-directories, and each sub-directory is for an individual IP core (one IP core per layer), containing the HLS source code and build files. The top-level directory also has a build file, and then the user can then do a parallel build across all sub-directories of all the IP cores. For our ResNet-50 case study in Section 5.5, the generation of the directory structure and HLS source files can take on the order of minutes, whereas the HLS can take on the order of a few hours. The HLS also does an out-of-context place and route so that we can get a more accurate resource utilization that is then used in the partitioner.

    4.1.3 Layer Partitioning.

    At this stage, the user begins with IP cores described in C++ that were generated from hls4ml. Each IP core is input to the HLS tool to generate RTL, which is then placed and routed out of context—that is, as a stand-alone circuit, from which an estimate of resource usage is generated for each IP core. Using these estimates, the user can allocate one or more IP cores to FPGAs. We have implemented a simple partitioning tool within Galapagos that can automate the placement of IP cores on FPGAs. Galapagos can take IP cores labeled as “floating” IPs within the cluster and place them on any available FPGA. Our implementation of this is a simple greedy algorithm by using the resource estimation of the IP cores and available resources on the respective FPGAs. Once the IP cores are partitioned, our tool analyzes the graph to investigate the edges between FPGA node boundaries. Based on the boundaries, our tool places an hls4ml-to-Galapagos bridge that is custom made to fit the dimensions on the output and input FPGAs. This is needed as an hls4ml kernel has a parallel stream for each dimension tensor, whereas a Galapagos kernel has a single stream.
    This is our first implementation of this partitioner and leaves much more room for future work focused on the partitioning. Given that the partitioner is a separate abstraction layer, changing the partitioner can be done without requiring any changes in the other layers. The partitioner is also implemented within Galapagos as this is independent of the ML use case and can be applied to other domain spaces.
    Once we get a partitioning from our Galapagos partitioner, our AIgean-specific bridging cores then provide bridging based on the kernels that occur at the edges of each FPGA. This is done separately by our ML to Galapagos layer (ML2G) as the bridges required at each edge is specific to the partitioning as we have a different bridge depending on the width of the ML kernel on the edge. The hls4ml bridge is explained in detail in Section 4.2. Once the bridges have been appended to the partitioned cluster, we can generate bitstreams or model the partitioned cluster in software. The output of this layer is the Galapagos configuration files, describing all the kernels and their connectivity. In our case study, in ResNet-50 this can take on the order of a few seconds.

    4.1.4 Software Cluster Implementation.

    This stage is optional but highly recommended for heterogeneous development. The underlying Galapagos framework can wrap HLS synthesizable C++ code with software libraries to enable network socket communication. The underlying Galapagos software library [45] translates Galapagos stream packets into network packets in a user-specified off-chip network protocol (e.g., UDP, TCP). We describe the underlying Galapagos framework in Appendix A.3. Galapagos can be seen as using the standard AXI-streaming protocol, typically used for streaming kernels within a single Xilinx FPGA. There is also basic routing with a destination field within AXI-stream. Galapagos can take AXI-stream and encapsulate packets with higher-level protocols to get the convenience of a single device AXI-stream but over multiple nodes. The user at this stage can create a homogeneous cluster partitioned across multiple software nodes (with each software node taking the place of a hardware node), recreating the network topology the user wishes to have for their heterogeneous deployment. All the network connections, binary generation, and deployment are automated with the underlying Galapagos framework.

    4.1.5 Heterogeneous Cluster Implementation.

    Once the neural network is partitioned across multiple software nodes and shown to be working correctly, the user can then migrate parts of their software deployment into hardware nodes. This is done by simply changing a parameter in one of the Galapagos files to indicate that an IP core should be implemented in hardware rather than run in software. Since Galapagos ensures that both software and hardware nodes use the same protocol, the migration is seamless. The migration of cores from software to hardware can be done in an iterative process as the generation of hardware bitstreams can be a time-consuming process. The outputs of this stage are the final bitstreams. For each FPGA, the IP cores are put together and synthesized to a bitstream. In our ResNet-50 case study, this took on the order of a couple of hours.

    4.2 Hls4ml Modifications

    The full details of the modifications implemented in hls4ml are specified in Appendix A.1. In particular, we modified hls4ml to produce HLS cores with streaming interfaces so that they can fit with the streaming model of Galapagos. As part of these modifications, an auxiliary channel is added between layers to allow for network inference to reset in the middle of an inference. This additional option is useful for multi-FPGA implementations where data streams are vulnerable and can be interrupted. Additionally, hls4ml was extended to have the option of large CNN layers with millions of weights. The previous CNN implementation was intended for low-latency use and could not support as many weights. The core of hls4ml, including the fully connected layers and activation functions, remain the same and are embedded in the streaming implementation. As a consequence, the full functionality of hls4ml is preserved in this streaming implementation, allowing for a broad range of models to be implemented.
    Further optimizations are applied for ResNet-50, including the fusing of batchnorm layers with the convolutional layers and the compression of 8-bit weights into single 16-bit weights so that DSP multiplier units can be used and the total number of needed multiplications is halved. Finally, an additional configuration parameter is added to the autogeneration that allows for approximate tuning of the reuse factor to obtain the desired CNN throughput. With this new option, the tuning factor for the network is defined by the desired throughput in operational clocks and the reuse factor is adjusted so that every layer achieves the desired throughput. As a result of the throughput tuning, the reuse factor will be adjusted to ensure the inter-layer latency is roughly the same. A balanced throughput avoids significant bottlenecks between the layers. The full details of the throughput tuning is described in Appendix A.1.
    The hls4ml streams have no side channels. There are only the data payload (e.g., fixed-point data) and ready/valid signals between kernels. This kind of stream suffices for point-to-point connections within one FPGA. However, off-chip communication between cores requires additional routing information. Galapagos IP cores use HLS streams with side channels to provide routing information (i.e., destination). The destination field that is used in Galapagos by default is 16 bits, but this is configurable depending on the number of IP cores we have in our cluster. We designed bridges to transform hls4ml streams to Galapagos streams by adding the additional routing information and packing the data in larger-bit-width Galapagos streams. These bridges convert a single tensor consisting of many parallel AXI streams, with one stream per dimension, into a single large bit width stream for off-chip communication. Since the bridge’s size is dependent on the hls4ml IP core (the bridges input size depends on the number of streams), this also needs to be auto-generated. A visual representation of this can be seen in Figure 4.
    Fig. 4.
    Fig. 4. An hls4ml kernel can be surrounded by custom bridges to make it possible for off-chip communication.
    Furthermore, with the processing of streams, it allows us to send flits of data corresponding to different images within the same packet, allowing for a more efficient use of bandwidth. These streams are also configurable by allowing the user to configure how many AXIS stream flits4 to pack within one network packet. This solution can be significant for FPGA-CPU links where it is crucial to amortize the cost of network communication on the CPU, due to the overhead added by the Linux network stack.

    4.3 Galapagos Modifications

    To explore the design space of large ML networks (like ResNet-50) across multiple FPGAs, we developed an automated partitioner to work with the rest of the Galapagos framework. When we turned to ResNet-50 to implement a very large network, it quickly became clear that we needed an automated means for partitioning a large application to make the best use of the resources. Our first partitioner is described in Section 4.1.3 and is not specific for just hls4ml kernels but can be used for any streaming kernels that require placement.
    The original Galapagos framework supported 10G TCP and L2 Ethernet for off-chip communication. We designed a bridge to provide the option for 100G UDP cores to increase the performance of our network links and to reduce the probability of the network communication being the bottleneck. This enhances the capability of any application using Galapagos, not just AIgean.
    To support 100G, we also improved the portability within Galapagos to support multiple bit-widths of data. The prior bridging within Galapagos assumes all kernels communicate over AXI-stream with a destination side channel. However, due to the additional required bridge needed to allow hls4ml kernels to communicate over AXI-stream with a destination side channel, we needed to modify Galapagos to include support for the insertion of application-specific bridging. This is shown in Figure 5. For more details on the bridging provided within Galapagos, please refer to Appendix A.3. In this case, one of such application-layer bridges is the hls4ml bridge described in Section 4.2. An application-layer bridge transforms an application-layer protocol into Galapagos packets. There is a configuration control path for the user to adjust the properties of the bridge. For the network bridge, it allows users to adjust the routing between kernels by allowing the user to adjust the mapping of destinations to IP addresses.
    Fig. 5.
    Fig. 5. The IP cores providing the bridging in Galapagos.
    A major goal of AIgean and Galapagos is the ease of development. Galapagos offers functional portability of cores between hardware and software by implementing software libraries to model the hardware routers and bridges shown in Figure 5. The software library (i.e., libGalapagos) is described in the work of Tarafdar and Chow [45]. Furthermore, Galapagos allows fast simulations of the cluster thanks to the combination of the HLS IP cores (in C++) and libGalapagos for the connections between cores. When we designed the additional cores and bridges for AIgean (e.g., 100G UDP core), we also implemented the libGalapagos equivalent of these cores to maintain functional portability. On top of prototyping the entire cluster in software, we have also added the ability to simulate the entire cluster in RTL. We designed an RTL model of a network switch that is configurable and can simulate the latency between network links. This capability has been invaluable during the development of AIgean as a platform but would not be required during the normal use of AIgean. The combined efforts of both these frameworks result in a fully configurable design space exploration tool of multi-node heterogeneous ML clusters.

    5 Results

    This section presents the outcomes of our efforts to build AIgean. It is important to emphasize that the initial goal of this work is to build a platform to enable the development of multi-FPGA ML applications. The performance results that we report here demonstrate that AIgean is working, and even with our first example applications, the results are reasonable. For this work, we claim success if we are able to easily build ML applications and map them to multiple network-connected FPGAs. There is much room to tune for application performance given a working AIgean, and we will now be using AIgean to explore opportunities for tuning and to build other kinds of networks.
    We first describe the hardware testbed used for our experiments and discuss the ease of use of AIgean in its current state with a case study we did for our own experiments. Then we present more quantitative results by addressing the physical limits of the communication links, and finally we present the current performance results of our first applications starting with a small network to illustrate the latency benefits of using network-connected FPGAs and then for ResNet-50 as a test to see whether we can implement a very large network.

    5.1 Hardware Testbed and Tools

    Our hardware testbed comprises Supermicro servers with Intel Xeon E5-2650V4 CPUs and 64 GB of memory. The FPGA boards we have available are Alpha Data 8K5s with a Xilinx KU115-2 FPGA, Fidus Sidewinders with Xilinx ZU19EG FPGAs, and Xilinx Alveo U200 and U250 cards with XCU200 and XCU250 FPGAs, respectively. We have 16 Sidewinders mounted in a 16-slot PCIe chassis, and the other boards are mounted in PCIe slots of our servers. For the network interconnect, we have two Dell S4048-ON 10G and two Z9100-ON 100G switches. The servers are connected to a 10G switch and the FPGA boards are mostly connected to 100G switches. For the AIgean tests reported here, we used Vivado 2019.1 and Sidewinder FPGA boards connected to 100G switches.
    The SDAccel platform we used was on an Amazon f1.2xlarge instance using SDAccel v2018.2. For those tests, hls4ml was used to generate the cores, and they were invoked as OpenCL kernels using SDAccel. The GPU tests used an Nvidia 1080Ti.

    5.2 Ease of Use Case Study

    While working on this article, we have gone through several iterations of different layers of the stack. One iteration involved optimizing our HLS library to use DSPs more efficiently, with the same functionality. We describe the results in Section 5.5, but we would like to discuss the steps required to change our cluster implementation between the two IP core implementations. This change was done in the hls4ml level, particularly the domain of a hardware expert looking to optimize the hardware implementation of a particular IP core. Following the change, we ran hls4ml generating a directory structure and a makefile, with a subdirectory per IP core. At this point, we can build at the top-level makefile by typing “make,” and this will rebuild all IP cores that have changed their implementation. Then we point our IP core directory to the rest of the AIgean flow and type “make.” This will then partition the IP cores, add the bridges, and generate the bitstreams. This case study shows that an expert in the IP core generation only has to focus on their layer of abstraction and then rebuild the entire cluster by typing “make” twice.

    5.3 Communication Protocol

    In this section, we present the latency and throughput measurements for different link configurations. This is to provide some understanding of the penalties for communication over network links. For communication between nodes, we can use a 100G UDP core [46] or a 10G UDP core [47] on the FPGA. These are interchangeable within our framework by the user. The 100G core uses a 512-bit interface as compared to the 64-bit interface for the 10G core. The CPU NIC we use is a 10G SFP NIC [48], even when communicating to a 100G FPGA. The specific FPGA board we are using is the Fidus Sidewinder with an MPSoC FPGA [49]. For latency measurements, we send a single flit of data (8 bytes) using the four different types of links listed in Table 1. The results in Table 1 involving software are shown with the 100G UDP core on the FPGA, but similar results are observed when using the 10G UDP core. From hardware to software, we observe the FPGA outputting at 100 Gb/s, but we experience packet drop in the software when doing the throughput measurement. Observe that the links involving software are limited by the CPU network stack and library implementation, whereas the FPGA-to-FPGA links can transfer at the full network bandwidth. Note that the results in Table 1 are for the raw throughput, including the protocol headers, which is why it is possible to achieve the full link bandwidth when using the FPGAs.
    Table 1.
    LinkLatencyThroughput
    Software to Hardware0.029 ms0.244 Gb/s
    Hardware to Hardware QSFP0.00017 ms100 Gb/s
    Hardware to Hardware SFP0.0003 ms10 Gb/s
    Hardware to Software0.0203 msN/A
    Table 1. Round-Trip Latencies and Throughputs of Three Different Links

    5.4 Autoencoder

    Here we describe our first small multi-FPGA network implemented with AIgean. We consider an example network with applications for high-energy physics. Specifically, our network is an autoencoder designed to detect anomalous events, potentially from new physics. An autoencoder is an unsupervised learning technique that leverages a neural network where a bottleneck in the shape of the network forces a compressed representation of the original input. Details about the model and use cases can be found in Appendix A.2.
    This network is a very interesting size for our studies, as it can be implemented on a single FPGA, but this requires a high degree of resource reuse that necessarily increases the inference latency. When splitting the network across multiple FPGAs, we can adjust the throughput and latency of the network by changing the reuse factor and compiling the network across multiple FPGAs. The network split across multiple FPGAs will have a higher throughput but incurs some latency from the transfer of the intermediate results.
    The resources for the autoencoder network are shown in Table 2 along with the resources available on the FPGAs we used. To test this autoencoder, we considered two separate implementations of the network: an implementation using an AWS F1-instance (VU9P FPGA) using SDAccel, and a second implementation using AIgean on three Sidewinder (ZU19EG FPGA) boards. What is notable is that the single FPGA implementation would not be able to fit on a single Sidewinder board, and it would have to be spread over multiple FPGAs for the chosen reuse factor. The single FPGA implementation also requires more than one super logic region, and as a consequence has difficulty meeting timing when compiled on the F1 instance with SDAccel.
    Table 2.
     Initiation IntervalDSPsBRAMsLUTsFlip-Flops
    Autoencoder Resources55276835.9 MB1.02M335K
    F1 Resources Per FPGA9.2K72.6 MB1.29M2.59M
    Sidewinder Resources Per FPGA1.9K34.6 MB522K1.04M
    Table 2. Autoencoder Resources Compared to Target FPGA Resources
    Table 3 highlights the results from implementing the autoencoder on various devices as well as on a single FPGA using SDAccel and three FPGAs using AIgean.
    Table 3.
    DeviceLatency (ms)
    Xeon E5-2650V43.3
    Nvidia 1080Ti2.5
    1 FPGA Implemented in SDAccel (125 MHz)0.24
    3 FPGAs Implemented in AIgean (190 MHz)0.08
    3 FPGAs Implemented in AIgean (125 MHz)0.12
    Table 3. Round-Trip Latency of a Single Batch Inference
    Our 1-FPGA autoencoder is clocked at 125 MHz at a low reuse factor when using SDAccel. Limitations in our version of SDAccel, as well as the resources required for the FPGA, prevented us from using a higher clock speed. For the 3-FPGA version, we used AIgean and were able to achieve 200 MHz for two of the FPGAs and 190 MHz for the third one. We did not try to improve it, so we will use 190 MHz since that is the limitation. To make a fair comparison to the 1-FPGA implementation, we scale the AIgean latency by the ratio of clock speeds and get \(0.08 \times 190/125 = 0.12\) ms, which is still \(0.24/0.12 = 2\) times better than the latency using SDAccel. This shows that there is still a significant architectural advantage to using multiple FPGAs and is not unexpected because more resources are available. The performance increase with three FPGAs can be attributed to (a) the use of networking to directly communicate with the FPGA, yielding low latency, and (b) less demanding resources per FPGA since only one-third of the model is implemented on each FPGA.
    The implementations of this model on both a single FPGA and the full three FPGAs have an initiation interval of 552 clocks and require roughly the same resources (the reuse factor is the same). In other words, the three FPGAs are capable of processing a new image every 2.76 \(\mu s\) (362 KHz). Such a throughput approaches the demands needed for real-time processing of anomalies at the LHC. Although the single FPGA implementation with SDAccel has a potential throughput that is half that of the 3-FPGA implementation, achieving this throughput would require efficiently buffering the inputs and outputs by sending larger batches of calls on and off the FPGA through the DDR and PCIe transfers. As a consequence, the individual (batch-1) latency would be significantly degraded for the final throughput to approach half that of the 3-FPGA implementation.

    5.5 ResNet-50

    To test AIgean on a much larger network, we have developed a multi-FPGA implementation of ResNet-50 [50]. The flexibility provided by AIgean allows us to target a high throughput implementation whereby we unroll the multiplication in each CNN layer at a rate corresponding to the number of pixels that are being used in each respective CNN layer. This allows for the design of ResNet-50 that can be balanced across the different CNN layers to have a uniform throughput.
    Most of ResNet-50’s architecture can be broken down into many sub-blocks consisting of a Split, two to three convolutions followed by a Relu, and an addition operator as shown in Figure 6. The dashed boxes represent the IP block granularity that we have used within our implementations.
    Fig. 6.
    Fig. 6. Sub-blocks found throughout ResNet-50 and our IP cores.
    We have two implementations of ResNet-50: the first requires 12 Sidewinder boards using int-8 precision (ranging from 80% to about 90% of the resources used on each FPGA); the second is more DSP efficient and requires 10 Sidewinder boards using int-8 precision as well. We have one FPGA available to use as a 100G data generator that can feed inputs at line rate to the FPGAs. For the 12-FPGA configuration, we tested in a piece-wise fashion.5 We have tested the traffic generator and the first 10 of 12 FPGAs followed by testing the traffic generator and the remaining 2 FPGAs. We have verified that the full 10-FPGA configuration and the piece-wise 12-FPGA configuration can run at 660 images per second.
    Table 4.
    ImplementationThroughputLatency
    AIgean Using CPU/FPGA network400 images/s7 ms
    AIgean Using FPGA Data Generator660 images/s1.9 ms
    Microsoft Brainwave Batch 1559 images/s [51]10 ms
    Nvidia V100 GPU Mixed Precision Batch 1250 images/s [51]5.9 ms
    Table 4. Performance of Different Layers and Implementations at Batch 1
    Table 4 summarizes the throughput and latency results of our full 12-FPGA implementation of ResNet-50. When the source data is coming from the CPU, we observe that the maximum throughput is only 400 images per second with a latency of 7 ms due to the bandwidth limitation between the CPU and the FPGA (5-ms latency between the CPU and the FPGA). To demonstrate the full performance achievable with the FPGAs, we use the FPGA data generator and observe a throughput of 660 images per second with a latency of about 1.9 ms. The latency is determined through a simulation of the full ResNet-50 network where each layer is separately run in parallel. The network delay between each FPGA is estimated from Table 1 using the QSFP. For 10 hops, the total network delay would be 0.0017 ms, which is insignificant compared to the computation latency. The next row gives the values for Microsoft’s Brainwave [51]. For the latency of Brainwave, we quote the end-to-end latency determined from sending an image to a Brainwave server and then receiving the result for a CPU within the same computing cluster. The final row shows the performance for an Nvidia V100 GPU using the mixed precision implementation of ResNet-50 applied for batch 1. The latency and throughput quoted is obtained through the use of the Triton inference server with a client on the same machine. As a consequence, the latency numbers include the PCIe transfer time in addition to the network inference. Equivalent numbers quoted by Nvidia yield a batch-2 latency of 1 ms with a throughput of 2,000 images per second for the same model [52]; batch 1 latency is not quoted.
    Table 5.
    FPGAFLip-FlopsLUTsDSPsBRAM
    Number(%)(%)(%)(%)
    031.840.171.51.73
    126.735.174.811.9
    211.1212.0674.80.68
    349.366.665.06.90
    438.350.971.52.00
    520.523.278.514.0
    654.072.665.07.08
    757.375.965.010.1
    860.178.268.313.8
    958.976.552.07.26
    1044.557.458.55.12
    1130.939.938.78.72
    Total Absolute Resources Used Across All FPGAs5.05 M3.28 M15.4 K31.5 MB
    Total Resources Available Per FPGA1.04 M522 K1.9 K34.6 MB
    Table 5. Resource Utilization Percentage of Each FPGA and Total Resources Available Per FPGA
    Table 5 summarizes the resources used for our 12-FPGA implementation. Note that this was partitioned with our greedy partitioning scheme that uses a heuristic of 80% utilization before allocating the next FPGA. The 10-FPGA configuration is very similar in terms of resources but with half the DSPs. Some other noteworthy details are that a number of the layers early in the network are smaller, and we can see that the FPGAs are DSP limited as compared to the larger layers later in the network being logic limited. The highest resource utilized for each FPGA is shown in bold, representing the limiting factor of each FPGA (with exception of the last FPGA that is not fully used.). For perspective, the total resources available on an individual FPGA are shown at the bottom of Table 5. This FPGA is approximately equivalent to a single SLR of the VU9P FPGA in the Amazon F1 instance (each VU9P having three SLRs) [53]. For further perspective, we can also compare this to the Xilinx Alveo U250 [54]. Our current utilization is DSP limited, and we could fit our entire ResNet-50 implementation on two Alveo U250 boards, where the U250 board has 12.2K DSPs.
    Last, we would like to contrast this implementation with previous implementations of ResNet-50. The design flow of AIgean differs from previous 8-bit implementations of ResNet-50 in that no overlay is used, and each layer is implemented separately. In this scenario, it is possible to continuously stream images through the implementation without having to wait for an image to be complete. With the overlay architecture, the images are streamed through each layer to a buffer and then subsequent layers are loaded and the next layer is streamed. As a consequence, a scheme is needed for buffering of each input. Additionally, some time is needed to switch between layers. With the AIgean design flow, the whole network exists on the FPGA fabric, and so images can be continuously pumped through. This leads to a more efficient use of multiplier resources, at a cost of additional resources to route individual layers together. Since images are continuously pumped through, we achieve batch-1 streaming. Additionally, since we are continuously pumping images through, the amount of buffering between the layers is limited to just the partial image that is needed for matrix multiplications of the CNN applied to nearby pixels.
    To understand the efficient use of resources, we compute the total number of multiplication operations needed for a perfectly efficient FPGA clocked at 200 MHz. With our implementation of ResNet-50, we find a total of 4B multiplications, which if we divide by 3 \(\times 10^{5}\) clocks to achieve a 1.5-ms latency at 200 MHz yields a total of 13,500 multiplications per cycle. Our current implementation uses 15,419 DSPs, which is slightly more due to the fact that many of the individual layers are tuned to a latency that is actually below 1.5 ms. The number of DSPs can be reduced through two means: first, through the sharing of DSPs, which is only partially implemented here, and second, through the use of a faster clock frequency. The sharing of DSPs would lead to roughly a factor of 2 reduction in DSPs. A faster clock frequency would yield a lower latency for the same number of DSPs. Since each multiplier unit is mapped directly to a specific multiplication within the network, the only way to inefficiently use the DSP resources results from the case where an allowed reuse parameter for a specific latency is not near the desired throughput and, as a consequence, the individual layer has a significantly lower latency than its neighboring layers.
    Adjustment of the reuse parameter effectively modifies the initiation interval of each layer. A reuse factor of 5,000, corresponds to a layer that has an initiation interval of 5,000. To efficiently adjust the reuse parameters with hls4ml, the reuse needs to split the dense matrix multiply embedded within the layer across DSPs so as to maintain a regular systolic-array architecture. As a consequence, optimal implementations of the reuse can only be certain numbers, which is determined by the number of input and output features of each layer. Our current implementation is near ideal since 1.5 ms allows for a consistent set of reuse values that are near the 1.5-ms ideal latency point. To achieve a higher throughput, we need to adjust the reuse factor to the desired throughput and re-implement the whole design. Although this procedure requires a lot of computing, the whole procedure is automated through the AIgean design flow.
    When adjusting the reuse factor, we observe a direct correlation with the number of DSPs. Halving the reuse factor will halve the initiation interval of the matrix multiply within a layer, and it will also double the number of DSPs. Flip-Flops, and LUTs will not change as significantly since they largely exist to store partial images. BlockRAMs are used primarily to store weights of the neural network on the FPGA. Their second use is to act as a buffer between layers. As a consequence, the BlockRAM resources will not change significantly with reuse factor. In this current implementation, since DSP sharing of the multiplications is only partially used, the resulting resources are more consistent with a ResNet-50 implementation having a latency of roughly half the observed latency (0.75 ms).
    Faster implementations of ResNet-50 are possible by adjusting the reuse factor. However, for CNNs, a lower bound is present in the current, pixel-by-pixel implementation of the algorithm. The lower bound results from the fact that for each pixel that streams through the algorithm, there is a one clock latency. Furthermore, there is an additional latency of three clocks to prepare the inputs to run the matrix multiplication. For layers within the network, where there are many pixels, such as the first layer, the ultimate latency is limited by these operations. Applying this limit to the first layer of ResNet-50, we find that the single layer throughput is bounded to be greater than roughly 0.4 ms. Lower single-inference latencies can still be achieved by splitting the image into sub-images and simultaneously streaming these sub-image streams into separate, cloned implementations of the chosen layer. Although the use of multiple streams effectively reduces the single inference throughput by the (number of streams) \(^{-1}\) , it has the added cost of increasing the resources by the number of streams.

    6 Future Work

    The work demonstrated within this article is the first prototype of what is possible with an open multi-FPGA ML platform. This leaves much room to improve in all areas of the ML stack in Figure 1. Within the hardware section, there leaves much room for optimization of the IP cores themselves. Furthermore, the potential for splitting images into multiple streams can effectively block any throughput limitation at the cost of large resource usage, as well as larger throughput.
    Once the IP cores are further optimized, it is our hope that the communication once again is the bottleneck. When this is the case, then we should explore more intelligent partitioning schemes that limit communication across the FPGA boundary. At the moment, this is a greedy solution looking solely at resource utilization without taking into consideration the communication patterns between IP cores within the cluster.
    Finally, hls4ml has the flexibility to run a broad range of other network architectures including transformer networks [55] and binary/ternary networks [56]. This work and new developments with hls4ml can be directly integrated into the AIgean flow. We can now explore a broad range of deep learning architectures with many different sizes across multiple FPGAs.

    7 Conclusion

    AIgean is a platform for mapping ML applications onto a cluster of network-connected FPGAs. This is much more scalable and has higher performance for computing than using the FPGA vendor tools, which are principally targeted at a single server with a handful of PCIe-connected FPGAs. Results from our initial implementations of actual networks show the benefits of using FPGAs for low-latency applications. We have also built two implementations of ResNet-50 to show that AIgean can implement very large networks.
    The structure of AIgean is a number of abstraction layers spanning the entire computing stack from the ML development layer at the top to the physical hardware layer at the computing and communication layer implemented in the FPGA. This gives multiple opportunities to optimize the computing stack and for research depending on the area of interest and the design expertise available—that is, from ML algorithms down to low-level hardware design.
    The layered approach makes it easier to implement AIgean because it is possible to leverage the Galapagos multi-FPGA platform and only add an additional application bridge to the Galapagos library. It also makes it possible to quickly add automation in the translation of the hls4ml protocols and interfaces to the Galapagos protocol.
    By leveraging Galapagos, AIgean is also portable to other FPGA platforms as long as the low-level hypervisor layer in the FPGA is created. AIgean also leverages the ability of Galapagos to deploy computing kernels to either CPUs or FPGAs such that an application can be first debugged and characterized entirely in software before committing all or parts of it to FPGA hardware.
    The experience of developing AIgean has demonstrated the challenges of building a multi-FPGA application development platform that is portable across many FPGA boards, but it proves that it is feasible in a reasonable amount of time.
    AIgean is available as an open source project and can be downloaded at https://github.com/UofT-HPRC/AIgean.

    Acknowledgments

    We sincerely thank the reviewers for their helpful comments that significantly improved the quality of this article.

    Footnotes

    1
    We report only tools publicly available on GitHub and with a high user rating (Star Metric).
    2
    In HLS, the initiation interval specifies the number of clock cycles between the introduction of new inputs in a pipeline.
    3
    In Microsoft terminology, this layer is called the shell [1].
    4
    A flit is the amount of data transferred in one clock cycle in an AXIS stream.
    5
    We could not get access to enough boards.

    A Appendix

    This is an appendix covering details on both hls4ml and Galapagos as well as the details about the models we used in Section 5.

    A.1 Hls4ml

    Hls4ml was initially designed to address the necessity for ultra low latency inference. In this context, it was developed to allow for the possibility of deep neural network inference at timescales below \(1 \mu s\) that are pipelined with initiation intervals that are a few tens of nanoseconds. Such low-latency, high throughput networks are required to process information at the Large Hadron Collider in an all FPGA system with an approximate throughput of 1 Petabit/s. In this context, algorithms were explicitly designed to achieve the fastest possible latency at the cost of utilizing more hardware. In place of reusing network layers, deep neural networks are unrolled on the processor to allow for sequential, batch-1 processing of network inferences with initiation intervals smaller than the total network latency. As a result of this design flow, the focus on hls4ml was towards the development of small networks that used a large amount of resources but that could also run at very low latencies.
    Extending this paradigm to larger networks, such as ResNet-50, was not considered part of the scope of hls4ml since such large networks would require resources greater than a single FPGA. As a consequence, hls4ml did not contain large layer implementations. However, with AIgean, through the distribution of a single network across many FPGAs, the possibility of extensive, low-latency networks under the hls4ml design flow is achievable. Consequently, hls4ml was extensively adapted to handle these large networks to create AIgean. In particular, we created or heavily adapted implementations of many new layers of hls4ml, including:
    Dense/Linear Layer
    CNN Layer
    Pooling Layer
    Split Layer
    Merge Layer
    These layers make up the core of most deep neural networks used. Furthermore, the large layer design flow established through the development of AIgean will enable the fast implementation of other layers following the AIgean implementations. In this appendix, we will outline the various adaptations needed in hls4ml to run a wide variety of algorithms at large scales. Furthermore, we will comment on new features added to hls4ml that enhance the core software framework and enables large model, high throughput implementations.

    A.1.1 Design Flow.

    To enable large CNN layers within hls4ml, the flow of hls4ml was re-factorized to work at longer latencies. In the original hls4ml implementation, to run NNs at ultra fast latencies, layers were connected through large arrays written simultaneously. For AIgean, we added the option of streaming arrays that allow for the outputs to be arrays of streams, which are then interfaced with Galapagos. Arrays of streams allow for the ability to stream out partial results between layers. This element is crucial for processing data under a CNN where the application of the same CNN kernel is repeated many times on a different part of a larger image. Arrays of streams differ from other network architectures, particularly FINN [26], in that there is still the potential for substantial throughput between layers since the array size can be adjusted to the latency requirements.

    A.1.2 Neural Network Weights.

    Within hls4ml, the NN weights are compiled and embedded within the HLS project. While this feature can allow for optimized place and route of neural networks that account for unstructured pruning, this adds a complexity to the HLS compilation that can substantially slow down the compilation time. To avoid this and to have the added flexibility of weight updates, we have added the functionality to treat the weights as external Block RAM or UltraRAM ports. The use of external ports keep the total HLS compilation time to under one hour per layer.

    A.1.3 Dense/Linear Layer.

    A single dense (tf notation) or linear (PyTorch notation) layer is the core of most NN implementations where matrix multiplication is performed. In hls4ml, this consists of a systolic array with a dedicated size compiled directly for that designed layer. In particular, the parameter reuse is built into the dense layer construction, and it dictates the size of the systolic array through the use of DSPs. To adapt the dense layer for AIgean we added the possibility of streaming between layers. To account for partial images of streams, we embedded NN flattening within the dense layer. The dense layer multiplications were modified to allow for the option of merging two 8-bit multiplies into one single DSP operation [57].

    A.1.4 CNN Layer.

    To avoid large outputs and to improve the overall throughput, a new CNN layer was developed, which operates on an image pixel by pixel. In this scenario, images are streamed through an array of streams between layers, with each depth element in the stream corresponding to a single pixel of an image. Pixels are then streamed one at a time into a layer, and the resulting output pixel is streamed to the next set of layers. The partial image is stored within a layer using a line buffer implementation. The line buffer is implemented as an array of shift registers, and we rely on the specialized HLS shift register objects to ensure the final implementation uses explicit shift register logic elements (SRLs). The line buffer is further optimized to store the minimal number of pixels required by the kernel size (Convolution kernel height \(\times\) row). An intermediate buffer is also used to store the kernel window before the matrix multiply needed for the convolution kernel. The matrix multiplication within the CNN kernel uses the default dense layer within hls4ml. The total latency before the kernel matrix multiplication takes three clocks (one for reading the inputs, one for the appreciation of the shift register, and one for filling the kernel window).
    The reuse factor for the CNN kernel, thus, defines the reuse per output pixel (i.e., the reuse is tied to the matrix multiplication for the Kernel). For the base implementation, the overall latency of the convolution kernel is five clocks per output pixel + the additional reuse factor for the dense layer. This five-clock overhead can be reduced for small networks by utilizing a fully partitioned line buffer at the cost of more resources. Zero Padding for the individual layers is built into the layer implementations.

    A.1.5 Pooling Layer.

    In addition to a CNN layer implementation, a pooling layer is added following the same data flow as in the CNNs (array of streams). The pooling layer implementation is similar to the CNN layer, except that it performs pooling instead of the convolution kernel matrix multiplication present in the CNN.

    A.1.6 Split/Merge Layers.

    To allow for the possibility of ResNet-50, split and merge layers are added to hls4ml using the array of streams data flow. Since the splitting of arrays of streams can use a large number of resources, the split and merge layers are generated with the ability to time-multiplex and de-multiplex the streams. This leads to a longer latency to perform the layer operation. However, the throughput for these layers is typically heavily subdominant to other layers in a network.

    A.1.7 Reuse Factor.

    Hls4ml has one main tuneable parameter, the reuse factor. The reuse factor defines the usage parameter for how often a DSP is reused within a matrix multiply. For example, a reuse factor of 25 implies that a single DSP will be used 25 times to perform single matrix multiplication. As a consequence, the initiation interval of the layer with reuse 25 would be 25 clocks. Furthermore, the number of DSPs used would be equal to the total number of multiplications in that layer divided by the reuse factor. To ensure a regular architecture, the reuse factor needs to be a multiple of the number of total multiplications used in the layer. As an example, consider the last convolutional layer in ResNet-50. This layer consists of matrix multiplication of a (1 \(\times\) 1 pixel kernel) with 512 input features and 2048 output features or 1.04M multiplications. For a 0.25ms implementation of ResNet-50, a reuse factor of 1024 is used, leading to an initiation interval of 1024 clocks, and 1024 DSPs without DSP merging, and 512 with DSP merging. Furthermore, this convolution kernel is run on a 7 \(\times\) 7 input image or 49 separate times on ResNet-50 leading to a total latency of exactly 0.25ms.
    The reuse factor for the CNN points directly to the reuse factor of the Dense layer implementation. This reuse factor of the dense layer has two internal implementations depending on the size of the reuse factor. For instances where the reuse factor is smaller than the number of input features, the systolic array is split across the input regions so that neighbouring multiplications are accumulated into the same or adjacent output feature. For instances where the reuse factor is larger than the number of inputs, the systolic array is split across the output feature. The input features are multiplexed and multiplied before being accumulated across the output features. These optimizations were chosen to ensure optimal resource usage in the matrix multiply to allow for large matrix multiplications with millions of weights. As a consequence of these choices, certain reuse factors that are multiples of the inputs and outputs are particularly resource optimized. Within hls4ml code generation, an automatic adjustment of reuse factor is performed to allow for the nearest optimized reuse factor for that layer.
    To allow for a balanced throughput between the layers, we have modified the reuse factor for AIgean kernels to account for the per layer throughput. Instead of defining the DSP reuse factor for the network, the reuse factor per layer is dynamically computed based on a desired per-layer throughput for the total network. To compute the optimized reuse factor, we rely on an analytic formula of the total layer throughput based on the reuse factor. This analytic formula yielding the total throughput per layer, \(R\) , in clocks, is defined as
    \begin{equation} R = N_{\it pixel} \left(6 + \frac{N_{\it in}N_{\it out}}{r}\right) \end{equation}
    (1)
    Where in this case, \(r\) is the reuse factor per layer, and \(N_{\it pixel}\) is the number of pixels in a CNN layer, \(N_{\it in}\) is the number of input features and \(N_{\it out}\) is the number of output features. The additional 6 approximates the number of clocks per pixel that a single layer requires to perform a shift and fill and reuse one matrix multiply; this presents a lower bound on the latency for a single CNN layer of \(6 N_{\it pixel}\) . Ultimately, this limitation is soft since multiple CNN layers can be used on separate parts of an image. Within the hls4ml configuration, the throughput latency \(R\) is defined in clocks before the project generation, and then the reuse factors are automatically computed in the project generation. As a consequence, to change the throughput, one need only adjust to the desired latency \(R\) , regenerate and recompile the project.

    A.2 Models

    In this appendix, we present a detailed description of each NN model used in this paper. The choice of these architectures is partly motivated by use in high energy physics, where low-latency deep neural network inference is an essential tool for operation. As a consequence, we comment on the model architecture and its application to problems within physics.

    A.2.1 Autoencoder.

    Autoencoders are unsupervised networks often used to identify anomalous features. By creating an information bottleneck within the network, autoencoders can compress and classify detector level information. The autoencoder considered in this example is capable of identifying LHC collisions that occur anomalously at the LHC. In particular, this network can identify top quark pair productions, Higgs boson pair production, and other more exotic final states.
    At the Large Hadron Collider, this network has a direct application through the integration of data scouting/trigger level analyses [58]. Data scouting is a process whereby partially reconstructed collisions are read out and processed to investigate collisions that are normally thrown out in the LHC data flow chain. This technique is particularly powerful in the search for Dark Matter [59]. In this case, events can be analyzed at a rate as large as 40 MHz, and the autoencoder can be used to create an “anomaly stream” at a reduced rate. With a maximum data rate of the order of 50 Terabits per second within the LHC trigger “scouting” stream, throughput and low latency are critical, since any delay in even a single inference would require significantly larger buffers. Distribution of this system onto multiple FPGAs brings a significant advantage since it would allow for very low latency while preserving the ability to pipeline events with small initiation interval.
    The autoencoder network is trained using events that involve known and understood physics processes. Thus, any event that cannot be encoded and decoded accurately is a potential candidate for new physics searches. The inputs and outputs to the network are 276 expert event features. The number of hidden features in the first layer is 276. The second layer goes down by 1/3 to 184, the third by 1/2 to 92, and remains having 92 features for the next 6 layers before the hidden features expand back symmetrically to 276 output features. The compression factor in the bottleneck is thus 3, and in total, the network consists of 12 fully connected layers and over 300,000 weights. Between the fully connected layers Relu activation is used.

    A.2.2 ResNet-50.

    Lastly, we consider the well-known ResNet-50 benchmark. ResNet-50 is a deep neural network used for image processing [60]. The ResNet-50 architecture has been shown to have a lot of versatility. In particular, quantized ResNet-50 has been retrained for the process of top quark identification within high energy physics, leading to results that are comparable to world leading algorithms [61]. More recently, ResNet-50 has become the standard algorithm for benchmarking algorithm performance. With the development of a quantized neural network, the 8-bit implementation of ResNet-50 has taken over the floating point implementation as the standard benchmarking algorithm for neural network inference. The 8-bit ResNet-50 has been shown to yield almost identical performance to the full ResNet-50 implementation with the added advantage of 8-bit operations.
    For the implementation of ResNet-50 considered in this paper, we use the 8-bit implementation. The network consists of 50 convolutional layers, 2 pooling layers, a dense layer, 16 merge layers, 16 split layers, and 50 batch normalization layers. To minimize the total amount of computation, the batch normalization layers are fused with the convolutional layers. ResNet-50 takes an input image that is \(224 \times 224\) pixels and will iteratively reduce the size of the image from \(224 \times 224\) to a final image that is \(7 \times 7\) pixels. The total number of multiplications present in ResNet-50 is 4.8 Billion. For a target throughput of 650 Hz, this translates to 3.1 trillion multiplications per second or 15000 continuous multiplications at 200 MHz. In the AIgean implementation, we perform two 8-bit multiplications per DSP. As a consequence, we require 7.5k DSPs for the full implementation if we have 100% efficiency. Our actual usage is 9.9k DSPs, corresponding to an aggregate 77% efficiency. The reason for the 77% efficient computation, and not higher, is a result of the fact that the ResNet-50 target design was actually synthesized for a throughput faster than 1.5 ms per image to ensure the target latency is met. The latency range of each layer ranges between 1.0 ms to 1.4 ms, with most layers having a throughput of 1.3 ms.

    A.3 Galapagos

    Galapagos is a heterogeneous deployment stack that allows users to deploy streaming IP cores on a cluster of FPGAs and CPUs. First we will describe the high-level abstraction model of Galapagos and then delve into each layer of abstraction that we implemented.

    A.3.1 High-Level Abstraction.

    The end goal of Galapagos is to be able to overlay a data flow graph of streaming IP cores to a cluster of devices, without the user having to worry about how to physically connect these IP cores amongst each other on one device as well as across devices. These IP cores should be able to address their destinations with respect to their target IP core, independent to how and where the IP core is implemented and placed. The IP cores themselves use AXI-stream to communicate with each other, with a destination side-channel. This is typically used within a single FPGA for routing amongst AXI-stream kernels. Our goal is to be able to provide this seamlessly across many devices to provide the user an abstraction similar to a single device. This can be seen as AXI-stream over the data center. We accomplish this by automating the encapsulation and decapsulation of AXI-stream packets with higher level network protocols. Figure 7 shows the high-level overview of Galapagos. On the left-hand side is a placement and implementation agnostic data-flow graph of streaming IP cores, and the right-hand side is how they are placed and implemented. Prior to this work the user had to provide configuration parameters to define the mapping of kernels to devices, but even that is now abstracted away with a partitioner. While the vendor tool kits provide the ability to add networking to a design [62, 63] the application must be explicitly aware of the networking.
    Fig. 7.
    Fig. 7. An overview of Galapagos. The user provides the implementation and network agnostic data flow graph of streaming IP cores, and our tool flow implements the right hand side, with the appropriate bridging to connect the devices together.
    Fig. 8.
    Fig. 8. An example of the lowest level of abstraction. With the abstraction provided, all of these streaming devices have a consistent interface and can communicate with one another
    Galapagos is built from the bottom up through several layers of the stack. From the bottom, individual devices are abstracted to appear to be streaming devices. This is shown in Figure 8. We have implemented this on different FPGAs and CPUs but this could be extended to other devices such as IoT sensors. Once each device is abstracted we can connect them together seamlessly at the protocol level. Furthermore we can look at a finer granularity of IP cores that can run on these devices, and even migrate implementations of these nodes as long as they have a consistent interface. Galapagos provides higher levels of abstraction to place streaming IP cores on these devices and connect and route amongst these IP cores on one device as well as target multiple devices. This is done through the implementation of the following layers of the stack: Physical Hardware and Connectivity, Hypervisor, Middleware Layer, and the Communication Layer.

    A.3.2 Physical Hardware and Connectivity.

    This layer of the stack refers to the physical devices you would have in your cluster and how they are connected. Currently we have created clusters of FPGAs (The Fidus Sidwinder, Alphadata 7v3, Pynq ZC702), x86 CPUs, and connectivity using 1G Ethernet, 10G SFP, and 100G QSFP. Currently the requirement for these devices is to have some connection to the network that we can connect to a network switch, but even this requirement can be abstracted in some way by the Hypervisor above. We have tested the same abstraction layers on these different devices to show the consistency of our higher levels of abstraction. Two examples of Galapagos setups can be seen in Figure 9.
    Fig. 9.
    Fig. 9. Two examples of “data centers” where we deployed Galapagos.

    A.3.3 Hypervisor.

    This layer of the stack refers to the abstraction of the physical device to standardize their interfaces to appear like Figure 8. Our standard model assumes a control path and a data path. The control path is used for configuration, programming, and monitoring the devices. Typically in our FPGAs we either use PCIe for FPGAs that do not have a tightly coupled ARM (Alphadata 7v3), or AXI for FPGAs with a tightly coupled ARM (Fidus Sidewinder, Pynq ZC702). The data path is used by the application IP cores to communicate off-chip to other nodes within the cluster. For the hypervisor to comply with the rest of the layers of the stack, it has to provide an AXI-stream interface. This standardization makes it simple for a user to add their own board within the stack. All they would need to do is provide an AXI-stream interface that can connect to a network switch. An example FPGA hypervisor is shown in Figure 10.
    Fig. 10.
    Fig. 10. An example Galapagos FPGA Hypervisor.

    A.3.4 Middleware and Communication Layers.

    The middleware is responsible for partitioning the kernels onto different FPGAs. This was previously done by a user-specified configuration, that provides a hint to the our middleware layer to place kernels on different devices. However we now automate this partitioning. This is described in Section 4.1.3. Once we have the placement of kernels, the middleware then places bridges to allow for kernels to communicate off-chip. The hypervisor guarantees an AXI-stream without a side channel available for the destination field. However with a Galapagos router and bridge we can take AXI-stream packets destined for off-chip and append the destination as a header. The Galapagos router has a routing table that specifies the location of all kernels by destination within the cluster. Furthermore, the off-chip communication can be done over various network communication protocols, handled by the communication layer. Depending on the destination FPGA, our network bridge encapsulates the packet with the correct network header. The network bridge is specific for each off-chip communication protocol the user wishes to support. If the user wishes to implement Galapagos on top of their own network protocol, they would need to supply a bridge that can translate their network packets into AXI-stream packets with a Galapagos header. The formation of the Galapagos router, routing table, and network bridges is all automated. The IP cores generated by the Middleware are shown in Figure 11.
    Fig. 11.
    Fig. 11. Automated Middleware IP cores in Galapagos.

    References

    [1]
    Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, et al. 2016. A cloud-scale acceleration architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). IEEE, Los Alamitos, CA, Article 7, 13 pages.
    [2]
    Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, et al. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (March 2018), 8–20. https://www.microsoft.com/en-us/research/publication/serving-dnns-real-time-datacenter-scale-project-brainwave/.
    [3]
    Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 1–14. DOI:http://dx.doi.org/10.1109./ISCA.2018.00012
    [4]
    J. Duarte, S. Han, P. Harris, S. Jindariani, E. Kreinar, B. Kreis, J. Ngadiuba, M. Pierini, R. Rivera, N. Tran, and Z. Wu. 2018. Fast inference of deep neural networks in FPGAs for particle physics. Journal of Instrumentation 13, 7 (July 2018), 305.
    [5]
    Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia, and Paul Chow. 2017. Enabling flexible network FPGA clusters in a heterogeneous cloud data center. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 237–246.
    [6]
    Naif Tarafdar, Nariman Eskandari, Varun Sharma, Charles Lo, and Paul Chow. 2018. Galapagos: A full stack approach to FPGA integration in the cloud. IEEE Micro 38, 6 (2018), 18–24.
    [7]
    Xilinx. n.d. Xilinx Vitis AI. Retrieved April 10, 2021 from https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html.
    [9]
    Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
    [10]
    Ronan Collobert, Samy Bengio, and Johnny Mariéthoz. 2002. Torch: A Modular Machine Learning Software Library. Technical Report. Idiap.
    [11]
    Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 675–678.
    [12]
    TensorFlow. n.d. Distributed Training with TensorFlow. Retrieved April 8, 2021 from https://www.tensorflow.org/guide/distributed_training/.
    [13]
    Google. n.d. Cloud Tensor Processing Units (TPUs). Retrieved April 8, 2021 from https://cloud.google.com/tpu/docs/tpus/.
    [14]
    NVIDIA. 2020. NVIDIA NVLink Fabric: Advanced Multi-GPU Processing. Retrieved November 3, 2021 from https://www.nvidia.com/en-us/data-center/nvlink
    [15]
    NVIDIA. n.d. NVIDIA NCCL. Retrieved April 8, 2021 from https://developer.nvidia.com/nccl/.
    [16]
    NVIDIA. n.d. NVIDIA Completes Acquisition of Mellanox, Creating Major Force Driving Next-Gen Data Centers. Retrieved April 10, 2021 from https://nvidianews.nvidia.com/news/nvidia-completes-acquisition-of-mellanox-creating-major-force-driving-next-gen-data-centers.
    [17]
    Xilinx. 2019. Machine Learning (ML) Suite. Retrieved November 3, 2021 from https://github.com/Xilinx/ml-suite.
    [18]
    Mohamed S. Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, Los Alamitos, CA, 411–4117.
    [19]
    Xilinx. 2020. SDAccel Development Environment. Retrieved November 3, 2021 from https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html.
    [20]
    Intel. 2020. Intel FPGA SDK for OpenCL Software Technology. Retrieved November 3, 2021 from https://www.intel.com/content/www/us/en/software/programmable/sdk-for-opencl/overview.html.
    [21]
    Lester Kalms and Diana Göhringer. 2017. Exploration of OpenCL for FPGAs using SDAccel and comparison to GPUs and multicore CPUs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, Los Alamitos, CA, 1–4.
    [22]
    E. Nurvitadhi, A. Boutros, P. Budhkar, A. Jafari, D. Kwon, D. Sheffield, A. Prabhakaran, K. Gururaj, P. Appana, and M. Naik. 2019. Scalable low-latency persistent neural machine translation on CPU server with multiple FPGAs. In Proceedings of the 2019 International Conference on Field-Programmable Technology (ICFPT’19). 307–310.
    [23]
    GitHub. n.d. Open Programmable Acceleration Engine. Retrieved November 3, 2021 from https://opae.github.io/.
    [24]
    Xilinx. 2018. CHaiDNN. Retrieved November 3, 2021 from https://github.com/Xilinx/CHaiDNN.
    [25]
    Xilinx. 2019. PYNQ DL. Retrieved November 3, 2021 from https://github.com/Xilinx/PYNQ-DL.
    [26]
    Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 65–74.
    [27]
    Dong Wang. 2019. An OpenCL-based FPGA Accelerator for Convolutional Neural Networks. Retrieved November 3, 2021 from https://github.com/doonny/PipeCNN.
    [28]
    HLSLibs. 2019. Open-Source High-Level Synthesis IP Libraries. Retrieved November 3, 2021 from https://hlslibs.org.
    [29]
    Lucian Petrica, Tobias Alonso, Mairin Kroes, Nicholas J. Fraser, Sorin Cotofana, and Michaela Blott. 2020. Memory-efficient dataflow inference for deep CNNs on FPGA. CoRR abs/2011.07317 (2020). arxiv:2011.07317https://arxiv.org/abs/2011.07317.
    [30]
    Mathew Hall and Vaughn Betz. 2020. From TensorFlow graphs to LUTs and wires: Automated sparse and physically aware CNN hardware generation. In Proceedings of the 2020 International Conference on Field-Programmable Technology (ICFPT’20). 56–65. DOI:https://doi.org/10.1109/ICFPT51103.2020.00017
    [31]
    François Chollet.2020. Keras: The Python Deep Learning Library. Retrieved November 3, 2021 from https://keras.io
    [32]
    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’17).
    [33]
    Junjie Bai, Fang Lu, and Ke Zhang2019. ONNX: Open Neural Network Exchange. Retrieved November 3, 2021 from https://github.com/onnx/onnx.
    [34]
    Claudionor N. Coelho, Aki Kuusela, Shan Li, Hao Zhuang, Thea Aarrestad, Vladimir Loncar, Jennifer Ngadiuba, Maurizio Pierini, Adrian Alan Pol, and Sioni Summers. 2020. Automatic deep heterogeneous quantization of deep neural networks for ultra low-area, low-latency inference on the edge at particle colliders. arXiv:2006.10159 (2020). arxiv:physics.ins-det/2006.10159
    [35]
    Razvan Nane, Vlad-Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, et al. 2015. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10 (2015), 1591–1604.
    [36]
    Xilinx. 2020. Vivado Design Suite. Retrieved November 3, 2021 from https://www.xilinx.com/products/design-tools/vivado.html.
    [37]
    F. Fahim, B. Hawks, C. Herwig, J. Hirschauer, S. Jindariani, N. Tran, M. B. Valentin, et al. 2021. hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices. In Proceedings of the tinyML Research Symposium 2021. arxiv:cs.LG/2103.05579
    [38]
    T. Aarrestad, V. Loncar, N. Ghielmetti, M. Pierini, S. Summers, J. Ngadiuba, C. Petersson, et al. 2021. Fast convolutional neural networks on FPGAs with hls4ml. arXiv:2101.05108 (1 2021). arxiv:cs.LG/2101.05108
    [39]
    A. Heintz, V. Razavimaleki, J. Duarte, G. DeZoort, I. Ojalvo, S. Thais, M. Atkinson, et al. 2020. Accelerated charged particle tracking with graph neural networks on FPGAs. In Proceedings of the 34th Conference on Neural Information Processing Systems. arxiv:physics.ins-det/2012.01563
    [40]
    S. Summers, G. Di Guglielmo, J. M. Duarte, P. Harris, D. Hoang, S. Jindariani, E. Kreinar, et al. 2020. Fast inference of boosted decision trees in FPGAs for particle physics. Journal of Instrumentation 15, 5 (2020), P05026. DOI:https://doi.org/10.1088/1748-0221/15/05/P05026arxiv:physics.comp-ph/2002.02534
    [41]
    Y. Iiyama, G. Cerminara, A. Gupta, J. Kieseler, V. Loncar, M. Pierini, S. R. Qasim, et al. 2020. Distance-weighted graph neural networks on FPGAs for real-time particle reconstruction in high energy physics. Frontiers in Big Data 3 (2020), 598927. DOI:https://doi.org/10.3389/fdata.2020.598927arxiv:physics.ins-det/2008.03601
    [42]
    Michaela Blott, Thomas Preusser, Nicholas Fraser, Giulio Gambardella, Kenneth O’Brien, and Yaman Umuroglu. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Transactions on Reconfigurable Technology and Systems 11, 3 (2018), Article 16, 23 pages. arxiv:cs.AR/1809.04570
    [43]
    ARM. 2010. AMBA 4 AXI4-Stream Protocol. Retrieved April 19, 2020 from https://static.docs.arm.com/ihi0051/a/IHI0051A_amba4_axi4_stream_v1_0_protocol_spec.pdf.
    [44]
    Marc Snir. 1998. MPI—The Complete Reference: The MPI Core. Vol. 1. MIT Press, Cambridge, MA.
    [45]
    Naif Tarafdar and Paul Chow. 2019. libGalapagos: A software environment for prototyping and creating heterogeneous FPGA and CPU applications. In Proceedings of the 6th International Workshop on FPGAs for Software Programmers. 1–7.
    [46]
    Qianfeng Shen. 2019. GULF-Stream. Retrieved January 13, 2020 from https://github.com/QianfengClarkShen/GULF-Stream.
    [47]
    David Sidler, Gustavo Alonso, Michaela Blott, Kimon Karras, Kees Vissers, and Raymond Carley. 2015. Scalable 10Gbps TCP/IP stack architecture for reconfigurable hardware. In Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’15). IEEE, Los Alamitos, CA, 36–43.
    [48]
    Intel. 2019. Intel 82599 10 GbE Controller Datasheet. Intel.
    [49]
    Fidus. 2019. Fidus Sidewinder. Retrieved January 13, 2020 from https://fidus.com/products/sidewinder/
    [50]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
    [51]
    Javier Duarte, Philip Harris, Scott Hauck, Burt Holzman, Shih-Chieh Hsu, Sergo Jindariani, Suffian Kha, et al. 2019. FPGA-accelerated machine learning inference as a service for particle physics computing. arXiv preprint arXiv:1904.08986 (2019).
    [52]
    NVIDIA. 2021. NVIDIA Data Center Deep Learning Product Performance. Retrieved June 17, 2021 from https://developer.nvidia.com/deep-learning-performance-training-inference.
    [53]
    Amazon. 2021. Amazon EC2 F1 Instances. Retrieved January 19, 2021 from https://aws.amazon.com/ec2/instance-types/f1/.
    [54]
    Xilinx. 2020. Alveo U200 and u250 Data Center Accelerator Cards Data Sheet. Retrieved January 19, 2021 from https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf.
    [55]
    Yutaro Iiyama, Gianluca Cerminara, Abhijay Gupta, Jan Kieseler, Vladimir Loncar, Maurizio Pierini, Shah Rukh Qasim, et al. 2021. Distance-weighted graph neural networks on FPGAs for real-time particle reconstruction in high energy physics. Frontiers in Big Data 3 (2021), 598927. arxiv:hep-ex/2008.03601
    [56]
    Jennifer Ngadiuba, Vladimir Loncar, Maurizio Pierini, Sioni Summers, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, et al. 2020. Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml. Machine Learning: Science and Technology 2, 1 (Dec. 2020), 015001. DOI:https://doi.org/10.1088/2632-2153/aba042
    [57]
    Xilinx. 2017. Deep Learning with INT8 on Xilinx Devices. Retrieved November 3, 2021 from https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.
    [58]
    Javier Duarte. 2018. Fast reconstruction and data scouting. In Proceedings of the 4th International Workshop Connecting the Dots 2018. arxiv:hep-ex/1808.00902
    [59]
    A. M. Sirunyan, A. Tumasyan, W. Adam, F. Ambrogi, T. Bergauer, M. Dragicevic, J. Ero, et al. 2020. Search for a narrow resonance lighter than 200 GeV decaying to a pair of muons in proton-proton collisions at \(\sqrt {s} =\) TeV. Physical Review Letters 124, 13 (2020), 131802. DOI:https://doi.org/10.1103/PhysRevLett.124.131802arxiv:hep-ex/1912.04776
    [60]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). arxiv:1512.03385http://arxiv.org/abs/1512.03385
    [61]
    J. Duarte, P. Harris, S. Hauck, B. Holzman, S.-C. Hsu, S. Jindariani, S. Khan, et al. 2019. FPGA-accelerated machine learning inference as a service for particle physics computing. Computing and Software for Big Science 3, 1 (2019), 13. DOI:https://doi.org/10.1007/s41781-019-0027-2arxiv:physics.data-an/1904.08986
    [62]
    GitHub. n.d. Xilinx Vitis Network Example. Retrieved November 4, 2021 from https://github.com/Xilinx/xup_vitis_network_example.
    [63]
    GitHub. n.d. Xilinx Vitis with 100G TCP/IP. Retrieved November 4, 2021 from https://github.com/fpgasystems/Vitis_with_100Gbps_TCP-IP.

    Cited By

    View all
    • (2024)Efficient and Scalable Architecture for Multiple-Chip Implementation of Simulated Bifurcation MachinesIEEE Access10.1109/ACCESS.2024.337408912(36606-36621)Online publication date: 2024
    • (2023)Extending Data Flow Architectures for Convolutional Neural Networks to Multiple FPGAs2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00020(132-141)Online publication date: 12-Dec-2023
    • (2023)SSDe: FPGA-Based SSD Express Emulation Framework2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323737(1-9)Online publication date: 28-Oct-2023
    • Show More Cited By

    Index Terms

    1. AIgean: An Open Framework for Deploying Machine Learning on Heterogeneous Clusters

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Transactions on Reconfigurable Technology and Systems
            ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 3
            September 2022
            353 pages
            ISSN:1936-7406
            EISSN:1936-7414
            DOI:10.1145/3508070
            • Editor:
            • Deming Chen
            Issue’s Table of Contents

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 27 December 2021
            Accepted: 01 July 2021
            Revised: 01 June 2021
            Received: 01 January 2021
            Published in TRETS Volume 15, Issue 3

            Permissions

            Request permissions for this article.

            Check for updates

            Author Tags

            1. FPGAs
            2. data center
            3. hardware/software co-design

            Qualifiers

            • Research-article
            • Refereed

            Funding Sources

            • Xilinx
            • NSERC
            • CMC Microsystems
            • NSF
            • IRIS-HEP
            • NSF Institute for AI and Fundamental Interactions

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)731
            • Downloads (Last 6 weeks)79
            Reflects downloads up to 11 Aug 2024

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Efficient and Scalable Architecture for Multiple-Chip Implementation of Simulated Bifurcation MachinesIEEE Access10.1109/ACCESS.2024.337408912(36606-36621)Online publication date: 2024
            • (2023)Extending Data Flow Architectures for Convolutional Neural Networks to Multiple FPGAs2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00020(132-141)Online publication date: 12-Dec-2023
            • (2023)SSDe: FPGA-Based SSD Express Emulation Framework2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323737(1-9)Online publication date: 28-Oct-2023
            • (2023)Machine Learning Across Network-Connected FPGAs2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363454(1-7)Online publication date: 25-Sep-2023
            • (2023)Partitioning Large-Scale, Multi-FPGA Applications for the Data Center2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00043(253-258)Online publication date: 4-Sep-2023
            • (2023)DOSA: Organic Compilation for Neural Network Inference on Distributed FPGAs2023 IEEE International Conference on Edge Computing and Communications (EDGE)10.1109/EDGE60047.2023.00019(43-50)Online publication date: Jul-2023

            View Options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format.

            HTML Format

            Get Access

            Login options

            Full Access

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media