-
Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
Authors:
Marius Meyer,
Tobias Kenter,
Lucian Petrica,
Kenneth O'Brien,
Michaela Blott,
Christian Plessl
Abstract:
Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a tr…
▽ More
Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.
△ Less
Submitted 7 April, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
ACCL+: an FPGA-Based Collective Engine for Distributed Applications
Authors:
Zhenhao He,
Dario Korolija,
Yu Zhu,
Benjamin Ramhorst,
Tristan Laan,
Lucian Petrica,
Michaela Blott,
Gustavo Alonso
Abstract:
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-sour…
▽ More
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Memory-Efficient Dataflow Inference for Deep CNNs on FPGA
Authors:
Lucian Petrica,
Tobias Alonso,
Mairin Kroes,
Nicholas Fraser,
Sorin Cotofana,
Michaela Blott
Abstract:
Custom dataflow Convolutional Neural Network (CNN) inference accelerators on FPGA are tailored to a specific CNN topology and store parameters in On-Chip Memory (OCM), resulting in high energy efficiency and low inference latency. However, in these accelerators the shapes of parameter memories are dictated by throughput constraints and do not map well to the underlying OCM, which becomes an implem…
▽ More
Custom dataflow Convolutional Neural Network (CNN) inference accelerators on FPGA are tailored to a specific CNN topology and store parameters in On-Chip Memory (OCM), resulting in high energy efficiency and low inference latency. However, in these accelerators the shapes of parameter memories are dictated by throughput constraints and do not map well to the underlying OCM, which becomes an implementation bottleneck. In this work, we propose an accelerator design methodology - Frequency Compensated Memory Packing (FCMP) - which improves the OCM utilization efficiency of dataflow accelerators with minimal reduction in throughput and no modifications to the physical structure of FPGA OCM. To validate our methodology, we apply it to several realizations of medium-sized CIFAR-10 inference accelerators and demonstrate up to 30% reduction in OCM utilization without loss of inference throughput, allowing us to port the accelerators from Xilinx Zynq 7020 to 7012S, reducing application cost. We also implement a custom dataflow FPGA inference accelerator for a quantized ResNet-50 CNN, utilizing on-chip weights, the largest topology ever implemented with this accelerator architecture. We demonstrate that by applying FCMP to the ResNet accelerator, the OCM bottleneck is alleviated which enables the accelerator to be ported from Alveo U250 to the smaller Alveo U280 board with less throughput loss compared to alternative techniques. By providing a finer-grained trade off between throughput and OCM requirements, FCMP increases the flexibility of custom dataflow CNN inference designs on FPGA.
△ Less
Submitted 14 November, 2020;
originally announced November 2020.
-
Evolutionary Bin Packing for Memory-Efficient Dataflow Inference Acceleration on FPGA
Authors:
Mairin Kroes,
Lucian Petrica,
Sorin Cotofana,
Michaela Blott
Abstract:
Convolutional neural network (CNN) dataflow inference accelerators implemented in Field Programmable Gate Arrays (FPGAs) have demonstrated increased energy efficiency and lower latency compared to CNN execution on CPUs or GPUs. However, the complex shapes of CNN parameter memories do not typically map well to FPGA on-chip memories (OCM), which results in poor OCM utilization and ultimately limits…
▽ More
Convolutional neural network (CNN) dataflow inference accelerators implemented in Field Programmable Gate Arrays (FPGAs) have demonstrated increased energy efficiency and lower latency compared to CNN execution on CPUs or GPUs. However, the complex shapes of CNN parameter memories do not typically map well to FPGA on-chip memories (OCM), which results in poor OCM utilization and ultimately limits the size and types of CNNs which can be effectively accelerated on FPGAs. In this work, we present a design methodology that improves the mapping efficiency of CNN parameters to FPGA OCM. We frame the mapping as a bin packing problem and determine that traditional bin packing algorithms are not well suited to solve the problem within FPGA- and CNN-specific constraints. We hybridize genetic algorithms and simulated annealing with traditional bin packing heuristics to create flexible mappers capable of grouping parameter memories such that each group optimally fits FPGA on-chip memories. We evaluate these algorithms on a variety of FPGA inference accelerators. Our hybrid mappers converge to optimal solutions in a matter of seconds for all CNN use-cases, achieve an increase of up to 65% in OCM utilization efficiency for deep CNNs, and are up to 200$\times$ faster than current state-of-the-art simulated annealing approaches.
△ Less
Submitted 24 March, 2020;
originally announced March 2020.