As many programs were originally designed for CPUs, running them on an FPGA often involves rewriting a large portion of the code in HLS, which could take several person-weeks or even person-months. This effort adds to the cost of FPGA designing and hinders its wider application in these fields. We observe the classical 80-20 rule apply to FPGA HLS design as well, where to rewrite the code that constitutes less than 20% of the runtime, it takes 80% of the refactoring efforts of the whole application. It is imperative to eliminate or at least reduce those efforts and to enable the programmers to focus on the logic and the performance bottlenecks.
Among the attempts to reduce the efforts, the source-to-source transformation seems to be promising because it automates the common practice, can be FPGA-vendor independent, and has the potential of further parallelizing or pipelining the output C/C++ code for performance optimizations. In this section, we discuss the common properties of legacy code that prevent direct code porting from a CPU to an FPGA. We present one of the source-to-source solutions, HeteroRefactor [
120], which performs dynamic analysis of the program and automates the refactoring of legacy code. We further point out the unsolved challenges and potential future directions.
3.3.1 Challenges of Legacy Code.
Given the differences between a CPU and an FPGA, HLS supports a subset of the C/C++ language. However, a program that originally targets the CPU usually involves unsupported features. In an extreme case demonstrated in Figure
16, 33 LOC could yield up to eight errors. Those errors confuse new HLS users and require experts’ manual correction efforts. We list the common challenges of porting a CPU legacy program for HLS as follows.
Pointer support. There is no unified address space for on-chip memory on an FPGA. In fact, deeper and wider memories can be composed using multiple on-chip memory blocks (BRAMs or URAMs) that utilize customized logic and interconnects. As a result, it is challenging to give each memory location an identifier and to provide universal access. Ramanathan et al. [
128] statically analyzed the pointer usages to reduce the requirement of global connections. However, the challenge remains when the application accesses memory dynamically, thus the support for pointers is limited in most HLS tools. As a rule of thumb, only statically analyzable pointers can be synthesized without producing errors. As a result, many commonly used pointer features, including
nested pointers (line 3),
global pointers (line 32), and many
pointer operations, such as comparison (line 9), require code refactoring into their equivalences to be taken by HLS tools as valid input.
Dynamic on-chip memory management. Besides the lack of unified address space and access interconnect, on-chip memory requires customizations, such as port width specifications, for efficient access. Therefore, it is challenging to dynamically reallocate the resources of one data type for another due to different access patterns. Thus, HLS tools do not support on-chip dynamic memory management (line 6). Although Giamblanco and Anderson [
129] and Xue and Thomas [
130] supported memory management by providing data structure templates, they require significant manual refactoring to be adapted. Liang et al. [
131] proposed source-to-source transformation. However, static sizing of each data type’s heap is needed.
Recursion. Similarly, recursive functions (line 10), which need dynamic storage in a stack of the execution states, are not supported by HLS tools. A manual transformation into loops is necessary. Thomas [
132] provides a DSL on top of C++ to simplify this transformation.
Bitwidth. The CPU programs usually over-provision the variable bitwidth to the standard data types (line 29) supported by the processor. A direct adaption of those programs will lead to resource waste, which could cause performance degradation and higher energy consumption. Lee et al. [
133] and Stephenson et al. [
134] performed static analysis to automatically reduce the bitwidth.
Polymorphism. HLS does not support virtual functions, or function pointers (line 20), due to the lack of unique identifiers for the implemented function modules. In addition, because of the aforementioned memory model, it does not handle complex data type reinterpretation (line 17).
Exception handling. Similar to recursion, the lack of function execution states in a stack further causes the exception handling (line 24) challenging to implement on an FPGA. Although usually implemented as a trivial stack pop in CPU programs, exception handling on an FPGA requires the message passing to the exception handling module and the termination of all intermediate modules.
Multi-threading. Although supported in HLS to have multiple functions executed concurrently, many existing programs employ CPU thread execution models, like
pthreads, to spawn multiple workers. Choi et al. [
135] supported the
pthreads model in HLS. Instead of putting stages into different threads for pipelining, these programs, written in this manner, often have multiple stages executed in the same thread (line 23). In HLS, pipeline stages of multiple tasks are often preferred over multiple homogenous PEs, since the stages could be optimized for a specific task, thus improving performance and reducing resource usage.
3.3.2 Automated Refactoring.
To address these challenges, Lau et al. [
120] proposed to automatically refactor the legacy code for HLS compatibility in their work, HeteroRefactor. They observed that the major challenge in supporting legacy code is the dynamic property of CPU programs. Traditionally, to refactor the code, one needs to understand the algorithm to determine the bounds. For example, the data width of a color pixel is 24 bits, the maximum size of a tree is the input length, and so forth. Using dynamic invariant analysis techniques, they extended the commonly used tools in the software engineering community to collect FPGA-specific properties. Using this knowledge, HeteroRefactor can transform the program to be synthesizable and selectively offload the program to the FPGA when the invariants are not violated. Figure
17 demonstrates the overall framework of HeteroRefactor. HeteroRefactor consists of two parts,
dynamic invariants collection, which detects the program properties, and
refactoring, which uses the properties to perform FPGA-specific transformations.
Dynamic invariants collection. Two approaches are supported in the HeteroRefactor framework, Daikon-based and refactoring-based detection, applicable to different scenarios and extensible to a broader range of invariants. By executing the program using typical input data, the likely properties could be collected and therefore an understanding of the algorithm is obtained for the refactoring.
The Daikon-based detection approach supports the dynamic detection of a wide range of invariants by utilizing the state-of-the-art detector, Daikon [
136]. Although the Daikon framework already supports many properties, it requires extensions for HLS refactoring. In HeteroRefactor, the authors proposed two types of FPGA-specific invariants: (1) the required bitwidth of a variable, and (2) the number and type of elements in an array. Besides those properties, through standard specifications in Daikon, one could further extend HeteroRefactor for addressing other challenges.
Additionally, to reduce the instrumentation overhead and support non-standard invariants, Lau et al. [
120] proposed a refactoring-based approach. Although powerful, the Daikon detector needs tens of minutes to instrument a complex program, and its support is limited to a given static program. We could collect some invariants with lower costs if we embed the instrumentation in the program and optimize them with the compilers. For example, HeteroRefactor modifies memory allocation/de-allocation function calls and adds tracing points at the entry/exit of recursive functions. In this way, one could obtain the size of dynamically allocated elements and the depth of the necessary function call stack during normal program execution. However, program modification, which is not supported in Daikon, is required in some of the invariant detections, such as the precision of a floating-point variable. Using the refactoring-based approach, HeteroRefactor performs differential execution of two programs: the original program and the modified low-precision program. The differences are empirically assessed to determine the minimum precision to maintain an acceptable precision loss.
Refactoring. HeteroRefactor addressed four aforementioned challenges: pointer support, dynamic memory management, recursion, and bitwidth. Using the collected invariants, Lau et al. [
120] performed transformations using the ROSE compiler [
137].
To support pointers and dynamic memory management, HeteroRefactor (1) rewrites malloc/free and (2) modifies the pointer accesses to array accesses. With the collected sizes of dynamic allocations, HeteroRefactor could create preallocated arrays with optimized access width per data type, and allot memory resources using a buddy memory system. It then converts the pointers into indexes in the arrays before changing the pointer operations into index arithmetic instructions. From there, it transforms the pointer access into the access of the corresponding element in the array.
To support recursion, HeteroRefactor (1) creates a stack for the local variable context, (2) rewrites the control flow, and (3) modifies the variable accesses. The invariants collected in the detection step determine the depth of the stack. HeteroRefactor converts the recursion to a while loop, where a function call pushes the new context into the stack and continues the loop, and a function return pops the stack and restores the execution state. Local variable accesses are redirected to the stack.
To optimize the variable bitwidth, HeteroRefactor (1) modifies the data width, (2) changes the arithmetic operators, and (3) propagates the type changes. It reduces the width of an integer using
ap_uint<k> or
ap_int<k> provided by Vitis HLS, where
k is the collected invariant. Utilizing Thomas’s templatized soft floating-point type library [
138], HeteroRefactor optimizes the precision of a floating-point variable. It modifies the operators of the variables accordingly to the reduced bitwidth and finally propagates the type changes to their references to keep the program valid.
HeteroRefactor further creates a guard protection system in the refactored kernel to monitor if the dynamic invariants are maintained for unforeseen data. When the dynamic invariants are violated, the program execution will fall back to the CPU to ensure correctness.
Experimental result. Lau et al. [
120] employed 10 programs originally designed for a CPU to evaluate HeteroRefactor, including complex recursion programs like Aho-Corasick string matching, merge sort and Strassen’s matrix multiplication, and real-world applications like face detection, 3D rendering from Rosetta benchmark suite [
139], and color conversion from OpenCV. Those programs are ported to an FPGA for on-board execution using HeteroRefacor with no human intervention, effectively reducing the efforts of manual rewriting the program with 1.05
\(\times\) more lines of the legacy code on average. Furthermore, by optimizing the bitwidth of integers and the precision of floating numbers, BRAM usage is decreased by 41%, and DSP usage is cut by 50%, correspondingly.
3.3.3 Future Work.
Although HeteroRefactor has addressed many of the challenges, some issues remain unsolved. To name a few,
global pointer remains unsupported because of the complexity of access optimizations, and
polymorphism support is not implemented. To the best of the authors’ knowledge, no existing work has supported the C++
exception handling in HLS. LegUp has supported the multi-threading model [
135]. However, the support is limited to
pthreads and requires explicit data flow pipelining, which is less common in CPU programs. For most of the existing legacy code, it remains impractical to directly synthesize them into hardware in a push-button fashion.
With the increasing complexity of the applications, more available resources on FPGAs, and the trends of embedded lightweight processors like ARM processors or AI Engines [
140] on the fabrics, it might be a good time to revisit the hybrid hardware-software architecture, like the one proposed in the work of Canis et al. [
141] with an MIPS soft-processor compiled with the LegUp HLS. Combining the power of an FPGA’s customizability and instruction processor’s compatibility, we could expect future HLS to have greater programmability while producing designs with a higher performance given that the embedded processor can run with a much higher frequency.