research-article

Open access

FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler

Authors:

Ziaul Choudhury,

Anish Gulati,

Suresh PuriniAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 4

Article No.: 60, Pages 1 - 25

https://doi.org/10.1145/3629523

Published: 14 December 2023 Publication History

PDF eReader

Abstract

The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput.

Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms.

This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920 × 1080 pixels in the implemented benchmarks.

1 Introduction

The smartphone revolution made cameras ubiquitous, ushering in a new era of computational photography. Computational photography uses many digital image processing algorithms, often pipelined, to accomplish tasks such as image construction, compression, stabilization, and the like. For example, the camera pipeline [35] is a typical image processing pipeline deployed in all digital cameras. These pipelines are computationally intensive, requiring, for example, 120 gigaflops/sec to process 1080p/60fps raw video. Furthermore, since these computational pipelines are deployed in power-constrained devices, achieving energy efficiency while maintaining desired performance in terms of both throughput and latency is crucial. Traditionally, this is accomplished in cameras using specialized hardware for image processing tasks. Prior work [16] has shown that such custom ASICs can increase energy efficiency by 500× compared to general-purpose processors, substantially extending the battery life of smartphones and digital cameras. Among other factors, the energy efficiency of ASICs comes from the lack of instruction fetch-and-decode cycles. Further intermediate data is buffered using local on-chip storage, thereby avoiding external memory accesses.

A domain-specific accelerator (DSA) is a custom processor or set of computing units optimized to perform a narrow range of computations. They are tailored to meet the needs of the algorithms required for their domain. For example, an AI accelerator might have an array of elements, including multiply-accumulate functionality, to efficiently undertake matrix operations. Google’s Tensor Processing Unit (TPU), Neural Engine in Apple’s M1 processor, and Xilinx AI Engine are popular ASIC-based DSAs. ASIC based DSAs provide significant gains in performance and power efficiency. It is well known that custom ASICs have a long lead time to market, are prohibitively expensive in the absence of economies of scale, and cannot keep up with the rapidly increasing scope of user requirements. We can circumvent these obstacles with practical trade-offs in power and performance using FPGAs. At the same time, FPGAs far outperform CPUs and GPUs on the performance per watt metric [31]. Therefore, FPGAs are rapidly evolving towards becoming an alternative to custom ASICs to design DSAs.

FPGAs’ advantages are often overshadowed by the programmability issues associated with Hardware Description Languages (HDLs). High-level Synthesis (HLS) through C-based languages [6, 15, 29], OpenCL/CUDA-based ones [30], Bluespec System Verilog [27] and Chisel [3] are some prominent efforts to improve FPGA programmability. Other model-based frameworks like NI LabVIEW and MATLAB HDL coder [14] are also available. A comprehensive survey of all these approaches can be found in [4]. Despite these efforts, hardware accelerator design involves a steep learning curve causing acute programmer productivity challenges. Further, the hardware compilers still fail to generate optimal designs from source specifications. Recent trends to ameliorate these issues include Domain Specific Languages (DSLs) and Domain Specific Accelerators (DSAs). Domain-specific languages (DSLs) provide high-level domain-specific language abstractions to improve programmability while making program analysis and transformation simpler for the corresponding DSL compilers to deliver performance. Darkroom [17], Rigel [18], Halide [32], Hipacc [19] and PolyMage [9] are some examples of DSLs for expressing complex image-processing pipelines.

When it comes to DSAs, there are broadly two approaches available: (1) per program accelerators and (2) Overlay accelerators for a class of programs. Most of the prior work [9, 17, 18, 19, 32] on translating DSLs to DSAs fall under the per program category, i.e., these compilation frameworks generate fixed hardware for a specific input source program. This approach requires time-consuming design synthesis and FPGA re-flashing for accelerating different algorithms from the domain, which is very expensive and perhaps infeasible in many edge and deeply embedded applications.

In contrast to this fixed-function approach, where the DSA gets tied with a specific function, an overlay architecture can process a broad class of computations. The overlay is synthesized and flashed on FPGA only once, and computations are orchestrated through soft reconfiguration by a runtime system. This is highly beneficial not only for edge applications but also for the cloud. Many companies offer high-performance domain-specific APIs as services on the cloud. The client API requests are scheduled for processing on a cluster of accelerators. The work in [5] does a critical analyis of the existing academic and commercial efforts to provide FPGA acceleration in datacenters and the cloud. Generic DSAs or overlays facilitate easier scheduling and better resource usage than task-specific accelerators. Also, it is becoming increasingly more desirable to have FPGA-based accelerators that enable edit-compile-run flow familiar to software developers instead of hardware synthesis and place-and-route [37].

We can broadly classify FPGA overlays into two types: instruction-based and non-instruction based. Instruction-based overlays such as [2, 7, 36, 37, 40] use specialized soft cores processors and often lead to inefficiencies arising from the fetch-decode-execute model of processor design. In this paper, we propose a non-instruction-based FPGA overlay for image processing pipelines. More specifically, we make the following contributions.

Firstly, we have created a front-end DSL called FlowPix, which is built in Scala and enables the programmer to express complex image processing pipelines using minimal lines of code. Secondly, our proposed overlay architecture is highly adaptable to map different DAG structures and operations, and this mapping is done at a macro level using DAG nodes, rather than at the micro-level through instructions. This overlay architecture can leverage data and pipelined parallelism commonly found in image processing algorithms. Finally, we demonstrate the effectiveness of our framework by processing 15 image processing benchmarks, including simple filters and complex pyramid systems, on a Virtex7-690t FPGA. Our results are comparable to, and in some cases better than, existing DSL-to-FPGA frameworks that generate fixed-function hardware based on a front-end specification.

2 Related Work

Coming up with high-performance hardware using HLS [12, 24] is like using C to develop high-performance software [32]. This limitation is overcome by using domain-level abstractions on top of the high-level language in the form of DSLs with micro-architectural templates, generating hardware blocks relevant to the domain [15, 21, 26]. Aetherling [13] is a DSL that compiles high-level image processing algorithms to hardware with the focus of exploring resource-vs-throughput trade-offs. DSAGEN [39] DSL annotates algorithms using pragmas and automatically searches a large architecture design space for a range of algorithms. FLOWER [1] is a comprehensive compiler infrastructure that provides automatic canonical transformations for high-level synthesis from a domain-specific library. This allows programmers to focus on algorithm implementations rather than low-level optimizations for dataflow architectures. AnyHLS [28] is an approach to synthesize FPGA designs in a modular and abstract way. AnyHLS is able to raise the abstraction level of the existing HLS tools by resorting to programming language features such as types and higher order functions. Researchers have created image processing DSLs based on the line buffered pipeline model for the image processing domain. Darkroom [17], converted a pipeline of the stencil and pointwise computations into custom hardware using a synchronous data flow model [22]. The resulting pipelines were statically scheduled, supporting single rate processing at a throughput of one pixel/cycle. Rigel [18], follow-up work on Darkroom and supported multi-rate pipelines, enabling the programmer to specify pipeline stages running at throughput other than one pixel/cycle. An integer linear programming formulation was used in both these frameworks to calculate the line buffer sizes. Rigel used FIFO buffers to handle the variability between two pipeline stages. FlowPix employs a more intuitive DAG formulation and data relaying techniques to find and optimize line buffer sizes in the pipeline.

As for efforts against designing a new DSL, there have been works augmenting pre-existing DSLs targeted for CPU and GPU architectures with an FPGA back-end. The Halide HLS framework [32] extended the Halide compiler [33], allowing the user to accelerate their programs on the FPGA. The system provided high-level semantics to explore different hardware mappings of applications, including the flexibility of changing the throughput rate of the individual stage. The PolyMageHLS [10], compiler extended the PolyMage [25] DSL to target FPGAs. It exploited coarse-grain parallelism by creating multiple copies of the pipeline to process image tiles in parallel. PolyMage and Halide emit HLS-C and feed that output to Vivado HLS to generate the hardware. O. Reiche et al. [34], [41] explored Image Processing and a source-to-source compiler called Hipacc that produces highly efficient hardware accelerators. By utilizing spatial and architectural information, Hipacc can generate tailored implementations that can compete with handwritten source codes, without requiring an expert in hardware architecture. The DSL’s capabilities enable the definition of algorithms for a wide range of target platforms, and also allow for the exploitation of target-specific forms of parallelism.

In contrast to DSL based frameworks, FPGA based softcore processors have also been proposed to accelerate domain computations. For example, the work in [20] proposed GraphSoC, a custom soft processor for accelerating graph algorithms. However, the main challenge in designing such processors is to ease programmability and at the same time provide performance. The work in [38] proposed an Image Processing Processor (IPPro) that is tailored to accelerate image processing operations. The IPPro is a 16-bit signed fixed-point, five-stage balanced pipelined RISC architecture that exploits the DSP48E1 features of a typical FPGA. There is a pressing need for more innovative domain specific processor designs that break away from the conventional 5-stage RISC design. Overlay processors, which we propose in this paper, are a step in this direction. The essence of overlay designs lies in their capacity to eliminate the need for traditional instructions, allowing direct hardware control through the use of control words. These control words empower the hardware to consistently process new data and execute the same functions repeatedly until reset with fresh control words. This stands in stark contrast to instruction-based processors that necessitate both new instructions and data, even for repetitive tasks. Moreover, in certain cases, it proves challenging to map inherent parallelism in a domain to a typical processor pipeline. However, overlays can be purposefully designed, taking parallelism into account from the ground up. Although overlay processors have been proposed in various domains, including AI, there is a noticeable lack of research on their application in the image processing domain. In this paper, we aim to address this gap.

3 FlowPix Overview

FlowPix comprises a DSL front-end, a compiler, and a backend hardware overlay. The DSL is embedded in the Scala language. It is possible to run a FlowPix DSL code as yet another Scala program that executes on the underlying CPU. This feature facilitates the programmer during the development cycle. One can use the CPU execution mode without incurring expensive hardware execution for debugging purposes. The FlowPix compiler constructs a Directed Acyclic Graph (DAG) from the source DSL program. It maps the DAG to the processing engines in the overlay and generates a control word sequence accordingly. The overlay is configured to process the DAG computations through these control words. A C++ driver streams the control words followed by the image frames to the overlay. The overlay continues to perform the same computations until it is reconfigured by the driver using a different control word sequence. Figure 1 summarises the compilation and execution flow of a FlowPix DSL program.

Fig. 1.

The FlowPix overlay is designed as a collection of Compute units (CUs). Each CU is further made up of an array of Processing Engines (PE) connected in a pipeline via FIFO interfaces. Each PE, in turn, has a fully pipelined design. Thus, both intra-PE and inter-PE pipelined parallelism are exploited within a CU. Figure 2 shows how different types of parallelisms are exploited by the compute hierarchy of the FlowPix overlay. The DAG is broken into smaller compute blocks called clusters before processing. Each DAG cluster is scheduled for execution on a single CU. The topologically sorted DAG clusters are mapped to PEs, facilitating streaming dataflow computation within a PE and across PEs in the same CU. If the available PEs are less than the number of clusters, we time multiplex by reconfiguring the PE. The generated intermediate data is internally buffered with no necessity for off-chip DDR memory accesses. Newer clusters are configured to run on the overlay through soft reconfiguration to process this intermediate data to generate the final processed image. The reconfiguration is inexpensive since we can reset the PE control words in minimal cycles at runtime. We provide a more detailed discussion of this in Section 4.

Fig. 2.

We use Bluespec System Verilog (BSV) to implement our overlay. The overlay has a parametric design. This is enabled by features of BSV that include parametric modules and interfaces, which permit writing and maintaining a single, compact source code to encompass a wide choice of architectures. The number of CUs and the size of the PE array inside each CU can be configured at hardware synthesis time depending on the FPGA and the workload characteristics. In the presence of more CUs, the same computations are instantiated across all the CUs. Based on the number of CUs, the image is vertically divided into an equal number of strips. The size of each strip is fixed based on the size of the overlay’s internal memory. Thus, the available data parallelism is exploited by processing the strips in parallel over the CUs. A single strip is processed over the PE array inside a CU. The overlay has a dedicated control memory for storing the control words. This memory is distributed across the PEs. A distribution logic routes the incoming control words to their correct location inside the control memory. The processed image from the overlay is streamed back to the driver, from where it is passed upwards to the application logic. The parallel CUs can also be used to process different workloads. In other words, the FlowPix framework can accelerate two or more similar image processing workloads at the same time. It is possible to add new DSL front-ends to the FlowPix framework. The new front end must generate the correct control word sequence, and the rest of the infrastructure remains the same.

4 FlowPix Language

This section provides a brief background on the image processing pipelines and then shows how they can be expressed using our FlowPix DSL.

4.1 Image Processing Pipelines

An image processing pipeline transforms an input image into an output image by passing it through a sequence of stages connected in directed acyclic graph (DAG) form. Each node in the DAG denotes a stage in the pipeline that performs computations on input image pixels and generates output frame pixels. An edge between two DAG nodes represents a producer-consumer relationship between stages. Each node in the DAG can be either a stencil or a point-wise stage computation. A stencil stage has only one predecessor stage. It applies a stencil operation on the intermediate input image streamed by the predecessor stage along the edge. Usually, a stencil operation refers to convolution on a small pixel window. However, generally, it can be any filter operation, such as max filter, used for downsampling images. A point-wise stage can have multiple predecessor stages. Each output pixel is computed using the corresponding input pixels from the intermediate input images streamed along incoming edges. The stages in the DAG can be executed in parallel, subject to dependency constraints and data availability. This leads to a streaming data flow model of computation suitable for realization in FPGAs.

Pointwise stages do not require buffering because of the input required, a single pixel is exactly what the prior stage produces. In contrast, stencil operations need multiple input pixels from the previous stage. For example, with a 3 × 3 Gaussian operator, the blurring stage in the unsharp mask pipeline requires 3 pixels in vertical and horizontal directions of the input image. The input pixels arrive in the scan-line order. Therefore, the stage has to buffer at least two rows of the input frame before performing the computation. This is referred to as line buffering as it buffers lines of the input image. The line buffers’ presence between the stages ensures that the intermediate values are fed from the producer to a consumer without any external memory intervention, which is a considerable energy saving.

4.2 Front-end DSL

Our DSL is embedded in the Scala Language. DSL programmers can use all Scala features with embedded DSL constructs for specifying image processing pipelines. All the input and intermediate images in the computational pipeline are stored as Scala objects of class type Stage. Point-wise computations on images are handled through operator overloading. Filter operations on images corresponding to stencil stages are obtained by instantiating the Filter class with suitable parameters. The currently supported filter operations in the DSL are convolution, max, min, sum and average. It is possible to support more filter operations in our DSL compiler infrastructure easily. We relieve the programmer from specifying the image dimensions at each computation step. Instead, our compiler automatically infers the dimensions through DAG analysis.

A standard image processing algorithm, Unsharp Mask (USM), used as a running example in this section, is a commonly used image sharpening technique in digital image processing systems. In a signal processing context, we can visualize it as a filter that amplifies an input signal’s high-frequency components. The algorithm is expressed as a composition of three operations. Firstly, a blur operation performs a Gaussian blur of the original input image along the y and the x directions. The second operation, sharpen, computes a weighted sum of the resultant blur and outputs a sharpened image. The final stage, mask, chooses pixels to be written to the output, between the original and the sharpened image. Sharpened pixels are chosen based on the difference between the original and its blur image, crossing a threshold value.

Listing 1.

Figure 3 shows the computational DAG associated with the Unsharp Mask (USM) benchmark. Stencil and point-wise stages are denoted by diamond and circle symbols, respectively. The computations associated with each stage are given in Table 1. Listing 1 shows the FlowPix code for the USM benchmark. The input Stage object img (line 1) is instantiated by passing a valid image path to the Stage class. The other stage objects in the program blur (line 7), sharp (line 9), thresh (line 10) and mask (line 11) correspond to different stages in the DAG. The intermediate image at the blur stage is obtained by applying the filter kernel on the image img (line 7) using the ** operator. kernel is a convolution filter obtained by instantiating the Filter class with a suitable stencil matrix as a parameter (line 3). The first two parameters in a filter instantiation correspond to the filter dimensions and filter strides, respectively. Filter dimension is a tuple of two integers that specifies the filter’s number of rows and columns. A filter stride is a tuple of two integers corresponding to the row and column strides by which the filter moves over the image. The last parameter specifies the weights of the convolution operation. Point-wise operations on lines 9, 10, and 11 are written as expressions over image objects. The output Stage object mask is stored as an image by passing an image path to the store_image method provided by the Stage class.

Table 1.

Stage	Computation
I	\(Input \ Image\)
blur	\(\frac{1}{16}\left[ \begin{smallmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1\end{smallmatrix} \right]\)
sharp	\(0.8I\left(i,j\right) - blur\left(i,j\right)\)
thres	\(0.2I\left(i,j\right) - blur\left(i,j\right)\)
mask	\(thres\left(i,j\right) \gt = 0\)
	\(? I\left(i,j\right) : sharp\left(i,j\right)\)

Table 1. table

Computations

Fig. 3.

Figure 4 and Table 2 show the DAG and the associated computations for the 2-level Gaussian pyramid benchmark. This benchmark performs a downsampling operation using the max filter. The downsampling stage is represented using an inverted pyramid symbol in the DAG. Listing 2 shows the corresponding FlowPix code.

Table 2.

Stage	Computation
I	\(Input \ Image\)
\(blur_0\)	\(\frac{1}{16}\left[ \begin{smallmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1\end{smallmatrix} \right]\)
\(down_0\)	\(\frac{I}{4}\\)\)
\(blur_1\)	\(\frac{1}{16}\left[ \begin{smallmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1\end{smallmatrix} \right]\)
\(down_1\)	\(\frac{I}{16}\\)\)

Table 2. Computations

Fig. 4.

Listing 2.

The image is first convolved using a \(3\times 3\) convolution filter kernel (Line 3) and then passed to a \(2 \times 2\) max filter down (Line 2). The max filter outputs the pixel with the highest intensity within the pixel window. The window moves with a row and column stride of 2, shrinking the image by a factor of 4. In Line 2, the first tuple denotes the filter dimensions, and the second tuple denotes the row and column strides. The convolution and max-pooling stages are repeated to generate the final output image. Iterative computations are expressed using Scala’s loop construct. The above computation sequence of convolving and down-sampling the image is put inside a loop such that the output image from one iteration acts as the input image for the next. The image is iteratively down-sampled to \(\frac{1}{16}^{th}\) of its original dimension using a loop iteration count of two.

5 FlowPix Overlay Architecture

The structure of the overlay consists of a group of Compute Units (CUs), as shown in Figure 5. To transfer control and data words from the host CPU to the overlay, a shared control and data-bus is utilized. The host input is initially stored in FPGA DDR memory, then streamed to the overlay via an AXI DMA MM2S (Memory Mapped to Stream) channel. The CUs are made up of processing engines (PEs) that are linked in a pipeline, with data exchange occurring through a streaming FIFO interface. The number and size of CUs are pre-determined during synthesis based on FPGA resources. Local control registers that form part of the overlay’s overall control memory dictate the runtime behavior of a CU. A limited number of BRAM memory banks are utilized to buffer the output from a CU locally. The output is then transmitted back to the CPU through an AXI DMA S2MM (Stream to Memory Mapped) channel. Depending on the computation schedule, the stored output can also be looped back to the CU.

Fig. 5.

5.1 Processing Engine

The overlay’s primary execution unit is the Processing Engine (PE), see Figure 6, which comprises processing units (PUs) that carry out computations. Each PU accepts three inputs and has two scalar units that can perform simple operations such as addition, subtraction, multiplication, and comparison. Both scalar units can be used in a cascaded manner when utilizing the three input values. The PU also contains registers that store control words and constants. Apart from the computing PUs, there are additional relaying PUs positioned between the computing PUs. These relaying PUs are responsible for forwarding data without performing any processing. We will delve into the specific functions of these relay PUs at a later point. The PE has a parametric design and its shape and size is set during synthesis by two parameters: \(P_x\) and \(P_y\). The PE architecture is divided into four pipelined stages, one input buffering stage, and three compute stages. The compute stages consist of pipelined sequences of PU arrays. The number of PU arrays and their interconnections varies across the stages.

Fig. 6.

Input to the PE is in the form of a vector. The output from the PE is streamed to the next PE inside the CU. The output from the final PE in the PE sequence is stored in the CU memory banks. The first compute stage performs stencil computations and is designed as a multiply-and-accumulator unit. The PUs in this stage are arranged in a tree structure with the base of the tree having \(4 \times P_x\) PUs. The next stage computes pointwise operations and consists of \(P_y\) arrays with \(P_x\) PUs in each array. The final stage has a single array with \(P_x\) PUs and processes upsample or downsample operations. This stage can perform a maximum of \(P_x\) downsample operations or a single upsample operation. The input to the PE can be new data from the host or partial data stored in the CU memory banks from previous executions. Input buffering is done using an array of line buffers, which are storage structures built using the FPGA BRAM that buffer a minimum number of rows from the input image to produce square windows. The line buffer design used in FlowPix is from [8] and can be dynamically programmed to generate any window size moving with a stride (the default stride is set at 1) over the input. The ordering of the compute stages inside a PE aligns with the compute pattern of the benchmarks analyzed. To illustrate, the majority of these benchmarks apply stencils to an input image, then perform pointwise operations to combine the stencil outputs. This is followed by an up-sampling or down-sampling operation on the intermediate output image. When the stage ordering differs, it results in under-utilization of resources, as certain stages must be skipped during the benchmark mapping process. The three compute stages of the PE are discussed in more detail below.

Stencil Stage:- The stencil operation generally involves a computation that multiplies and accumulates values. The processing units (PUs) within the stencil stage are arranged in a reduction tree-like pattern. This stage comprises \(m+1\) PU arrays, where m is equal to \(\log {(4 \times P_x)}\). Each PU within array i reads two input values from the adjacent PUs in the array \(i-1\). The first PU array performs the multiplication, while the subsequent arrays handle the accumulation step. A binary reduction tree is efficient for computing stencil windows whose size is a power of 2 but this window size is uncommon in image processing algorithms. Typically, window sizes are \(3 \times 3\), \(5 \times 5\), and so on. To address the issue of odd sizes, the window can be zero-padded to make its size a power of 2, but this approach under utilizes the PUs. Optimal throughput is achieved when this stage processes the maximum possible number of stencils in parallel. Therefore, we use a unique data layout that rearranges the line buffer windows across the first PU array. This layout is as follows.

The proposed approach is to break each stencil window into smaller partitions whose size is a power of 2. For instance, a \(3 \times 3\) window is partitioned into two smaller partitions, with sizes 8 and 1, respectively. In general, a window of size \(k^2\) is broken down into at most w partitions, denoted as \(p_0\) through \(p_{w}\), where \(w=\lfloor {\log _2(k^2)}\rfloor-1\). In the new data layout, the same-sized partitions from all windows are positioned next to each other. These partition groups are ordered from left to right, based on decreasing size. The mapper module located within the stencil stage has access to all windows created by the line buffers. It is a multiplexer array that maps a value from a line buffer window to an input port of a PU. The multiplexers are configured to implement this data layout.

In order to illustrate the data layout and operation of the stencil stage, we provide an example in Figure 7. In part (a), the stencil stage processes two \(1 \times 3\) stencil windows. The window is first partitioned into two smaller partitions of sizes 2 and 1, respectively. The partitions are then rearranged using the data layout as [2,2,1,1] over the first 6 processing units (PUs), with the filter coefficients stored in the PU registers. Similarly, in part (b), the stencil stage processes a single \(1 \times 7\) window which is partitioned into three smaller partitions of sizes 4, 2, and 1, respectively, and arranged over the first 7 PUs. In part (a) of Figure 7, the 6 partial products produced by the first PU array are accumulated by the rest of the PU arrays. This is achieved by first reducing the partitions corresponding to a window to single values inside the tree. The reduced value \(r^{m}_{i}\) corresponds to a partition \(p^{m}_{i}\) of size m, belongs to window i, and is generated at the \(\log {m}^{th}\) array. For example, in part (a) of Figure 7, the 2 sized partitions \(p_{1}^2\), \(p_{2}^2\) are reduced to the single values \(r_{1}^2\), \(r_{2}^2\) by the second (\(\log {2}=1\)) PU array. All the reduced values corresponding to a window must be combined into a single value, which is not possible using the level-to-level PU interconnections since the reduced values lie at different PU arrays. Therefore, the reduced values are aligned and forwarded using the shifter and the vector register between the arrays. The vector register at level i stores the output of the \({i-1}^{th}\) PU array. The reduced values \(r_{1}^1\) and \(r_{2}^1\) of size 1 are stored at positions 5 and 6 inside the first vector register, respectively. These values needs to be added with \(r_{1}^2\), \(r_{2}^2\), produced by the first and second PUs of the second PU array. Since \(r_{1}^1\) and \(r_{2}^1\) are produced early, they are moved to positions 1 and 2 by shifting the first vector register by 4 units. Subsequently, these two values are propagated. The first and second PUs adds the forwarded value with the inputs received from its predecessor PUs in the tree structure to produce the final output. The stencil stage also supports other reduction operators, such as max or min over a window. When using max or min operators over a stencil window, the stencil coefficients are set to 1 and the PUs are configured to perform the reduction operation using the maximum or minimum function instead of multiplication and accumulation. This is because the max and min operators do not require multiplication with filter coefficients, but rather involve comparing values to find the maximum or minimum. Therefore, the PUs are set up to accumulate the partial results using the max or min function, depending on the desired operation.

Fig. 7.

Pointwise Stage:- The pointwise stage can obtain input from either the stencil stage or directly from the input vector, bypassing the stencil stage. The PU arrays within this stage are all of equal length, and data exchange occurs through an all-to-all routing network between the arrays. This network consists of multiplexers that connect the output of a PU at level i to the inputs of the PUs at level \(i+1\). The inter-level multiplexers are configured by control words generated by the host, which also determine the operation of the scalar units within each PU. In cascaded mode, the PU utilizes both the scalar units and three input values. The first scalar unit processes the first two inputs, while the second scalar unit operates on the output of the first unit and the third input value. A ternary operator of the form \(E_1 \odot E_2? E_3: E_4\) is treated as a pointwise operation. Here \(E_1\) through \(E_4\) are pointwise expressions, and \(\odot\) is a relational operator. This operation is also processed by the pointwise stage as follows. Assume the expressions \(E_1\) though \(E_4\) have already been processed. Two PU units, execute the operation \(R_1 = E_1 \odot E_2\) and \(R_2 = \lnot (E_1 \odot E_2)\). The value of \(R_1\) and \(R_2\) is either a 1 or a 0. Following this, two more PUs from the successive level computes \(S_1 = R_1 \times E_3\) and \(S_2 = R_2 \times E_4\). The value of \(S_1\) is either \(E_3\) or 0, and the value of \(S_2\) is either \(E_4\) or 0. Finally, \(S_1\) is added to \(S_2\) to get the final value. In summary, after computing the four pointwise expressions inside the ternary operator, the pointwise stage utilizes 5 PUs spread across 3 arrays to produce the final output.

Upsample-Downsample Stage:- In this stage multiple images can be simultaneously downsampled or a single image can be upsampled by a factor of two. In the case of downsampling, a single Processing Unit (PU) is responsible for this operation. To perform the downsampling, a scalar unit inside the PU is configured. A PU register is set up with the row length of the input image that needs to be downsampled. As the input data is received, the scalar unit increments a counter to keep track of the row and column index of the image. The scalar unit then outputs data from every other row and column, effectively reducing the image size by half.

The upsample operation is processed by multiple PUs along with the upsampler module. If the image to be upsampled is produced earlier, it is forwarded to this stage through intermediate PUs in the pointwise stage. Image is upsampled by generating a \(2\times 2\) matrix for every single input received. More precisely, for every pixel \(w_{r,c}\) belonging to the \(r^{th}\) row and \(c^{th}\) column of the input image, a matrix \(W =[0,0,0,w_{r,c}]\) is generated. A total of four PUs generates W. The first 3 PUs are configured to produce a 0 in the output, and the fourth PU forwards the received input \(w_{r,c}\). The matrix W is read by the upsampler module that stores it across four internal memory banks \(U_{0}\) through \(U_3\) in a striped fashion. Note that these banks are separate from the CU memory banks and are exclusive to the upsampler module. The data in the four banks is collated and stored sequentially as a single upscaled image inside one of the CU memory banks by the upsampler module. This is done by emptying the banks \(U_0\) through \(U_3\) in an interleaved fashion. \(U_0\) contains data from rows \(0, 2, 4, 6...\) and column \(0, 2, 4, 6 ...\). \(U_1\) contains data from the same row indices but odd columns. \(U_2\) and \(U_3\) contain odd rows and even and odd columns, respectively. The drain sequence starts by reading a single value from \(U_0\) followed by the other value from \(U_1\), alternating between both until row 0 is filled in B; here, B is a CU memory bank. Then the same alternating sequence is repeated with \(U_2\) and \(U_3\) until row 1 is filled in B. At this point, the upsampler module again switches back to the first two banks. This interleaved sequence is repeated until all the rows of the upscaled image are buffered in B, marking the end of the upsample operation.

5.2 Control Memory

The dynamic configuration of the overlay is stored in the control memory as control words, which can be seen in part (b) of Figure 8. The control memory is distributed across all modules of a CU, with each module having its own set of dedicated registers to store control words for its runtime functioning. This design ensures no communication bottleneck is created when the modules read their respective control words. The control words are 32-bit (16-bit index and 16-bit value) and transmitted over a hierarchical tree-structured control bus. The 16-bit index is used to route the value through the hierarchical tree structure of the control bus. For example, the top levels of the tree route the control word to the correct CU, followed by the next levels routing it to a PE and the lower levels routing it to a module inside the PE.

Fig. 8.

Before every computation phase, a control phase is initiated to set the control words. Minimizing the time spent setting the control words is essential to reduce stall cycles. The host sets a bit-vector register E before the control phase begins. Each overlay module that depends on control words has a corresponding bit representation in the vector. Only modules with a set bit in E are reconfigured during the control phase, while others continue to execute the same functionality as defined by their previous control word setting.

5.3 Data Memory

The output of a CU is stored in the data memory, as shown in part (a) of Figure 8. An input data bus transfers the output data stream from a PE to a set of 128 KB memory banks. Each bank has its control unit generating the read-write signals and can be configured to store data from any index of the vector data stream. The input vector \(D[0:n]\) is relayed from bank to bank. When the control unit activates the store signal at a bank, the bank reads the \(D[i]\) data value from the input vector and stores it in memory. The index value i is stored in the control memory of the bank. Banks that do not have their store signal activated forward the input vector to the following bank in the sequence through the data bus. When the write signal is activated, the bank transfers its output on the output data bus. The output data bus is parallel and can be accessed by all the banks simultaneously. The output from all the banks is concatenated into a vector that can be sent back to the host CPU or looped back to the CU, depending on the behaviour determined by the CU control unit. The looped back data behaves as the input for the future CU computations.

6 FlowPix Compiler

The FlowPix compiler generates a control word sequence from the DSL code. Generating these control words is similar to the code generation step in traditional compilers. The compiler can be broadly divided into three phases, as shown in Figure 1. The first phase, the DAG Builder phase, involves processing the DSL code to create a DAG representation, where nodes represent computations and edges represent producer-consumer relationships. This DAG is stored internally and used by the other two phases. The compiler then performs a range inference pass over the DAG to calculate the image dimensions at each node, which is necessary to generate the proper control words for each node and ensure correct hardware execution. The mapper is the second phase of the FlowPix compiler, which establishes a mapping between the DAG nodes and the compute units. The mapper has two sub-phases: clustering and latency matching. In the clustering phase, the DAG is broken into smaller sub-graphs, called clusters, that represent the maximum number of DAG nodes that a single CU can execute at once, subject to certain constraints. The formed clusters are then adjusted in the latency matching phase to account for variable latency nodes. In the final phase, the control sequence generator generates a control word sequence for each mapped node in the cluster. At this stage, the compiler has access to enough information about each DAG node, including its type, parent nodes, and CU mapping. For instance, in the case of a stencil node, the compiler has knowledge of the stencil size, stride, coefficients, and input and output image dimensions. The three phases, of the compiler work together seamlessly to transform DSL code into a control word sequence, enabling correct and efficient execution of an image processing pipeline.

\begin{align} \mathtt {Ordering\ constraint} \rightarrow \hspace{7.22743pt} &L(x) \lt L(y) \Rightarrow P(x) \lt P(y) \end{align}

(1)

\begin{align} \mathtt {Location\ constraint} \rightarrow \hspace{7.22743pt} &P_{low}(t(x)) \le P(x) \lt P_{high}(t(x)) \end{align}

(2)

6.1 DAG Clustering

The clustering phase divides the DAG into clusters, which are smaller compute blocks that can be processed by a CU. The multiplicity of CUs is not considered during clustering because all image strips mapped to CUs are processed through the same cluster. The input DAG S is divided into distinct clusters \(S_1\), \(S_2\), and so forth up to \(S_n\). The DAG nodes inside a cluster are then mapped to the PUs inside the PEs, assuming that the entire CU is accessible to execute the cluster. Initially, S is considered as a single cluster. The compiler then attempts to assign PUs to the nodes in S while adhering to a set of constraints. The assigned nodes form the cluster \(S_1\). This procedure is recursively applied to the remaining DAG \(S - S_1\) to create the next cluster \(S_2\). The process continues until S is empty and no further clusters are formed, thus achieving convergence.

To assign PUs, the DAG nodes are first sorted in topological order and segregated into different levels, denoted as \(L_0\) through \(L_n\). For instance, in Figure 9, the DAG is segmented into four levels, \(L_0 = [A,B,C]\), \(L_1=[D]\), \(L_2=[E]\), and \(L_3=[F]\). The DAG levels are matched against the available PU arrays, represented as \(P_0\) through \(P_m\). Recall that the PU arrays within a PE are distributed across the stencil, pointwise, and upsample/downsample stages. The type of a node x, denoted as \(t(x)\), determines which PE stage or PU arrays can process it. The range of permissible PU arrays to process a given node type is given by \(P_{low}(t(x))\) to \(P_{high}(t(x))\). For stencil and up/downsample nodes, \(P_{low}\) and \(P_{high}\) values are the same, since they are assigned to a single PU array. Only the PU array at the base of the multiply-and-accumulate tree is considered for stencil nodes.

Fig. 9.

To ensure correct mapping, the PU-to-node assignments must satisfy two constraints: the ordering constraint and the location constraint, represented by Equation (1) and Equation (2), respectively. The ordering constraint guarantees that DAG nodes are mapped to PU arrays in level order. If two nodes, x and y, belong to DAG levels \(L(x)\) and \(L(y)\) respectively, and \(L(x) \lt L(y)\), then the node x must be assigned to a PU array, given by \(P(x)\), with a lower index. This constraint ensures that data flows in the same direction inside the CU as in the DAG. The location constraint restricts the set of PU arrays to which a node of a particular type can be assigned. After determining the correct PU array, the assignment is completed by assigning the next available PU within that array. The compiler maintains an availability map of all the PUs within the PE. If a suitable PU cannot be found for a given node, the compiler considers the next available PE within the CU. If all PEs are exhausted, the compiler consolidates the current cluster, re-initializes the mapping data structures, and creates the next cluster with the remaining DAG nodes.

6.2 Latency Matching

Assume we have a cluster mapped to the CU, and inside the cluster we have three nodes: x, v, and w. The nodes v and w send data to node x, and they are mapped to the \(i^{th}\) and \(j^{th}\) PU arrays, respectively, where \(j - i \gt 2\). Node x is mapped to the \(j+1^{th}\) PU array inside the CU. During execution, data is directly exchanged between the PUs assigned for nodes w and x because they are in adjoining PU arrays. However, there is no direct connection between nodes v and x. To ensure the execution is correct, a buffer of size \(j-i\) must be placed on the route between v and x. Unfortunately, the PE does not have a buffering capacity between PUs, so extra PUs relay the data from the \(i^{th}\) to the \(j^{th}\) array. The compiler creates relay nodes in the cluster and distributes them, one PU per array, across the \(j-i-2\) arrays in between.

Relaying is a method used to handle variable latency across the input edges of a mapped node. To ensure successful relaying, each PU array inside the PE has a reserved set of PUs that act as forwarding units. These forwarding units are lightweight and do not carry out any computing. The number of such PUs is equal to the length of the array. The compiler configures relay nodes based on the difference in the edge latencies across DAG nodes. An example of relaying can be seen in Figure 9. The node E mapped to the fourth PU array receives input from node B, located at the first PU array. Therefore, two relay nodes are configured between them. However, no relaying is required for the other node input E since the data is produced at the preceding PU array. The same applies to node D with inputs from nodes A and C.

In certain situations, data transmission across all edges within a cluster can be hindered by a lack of relay nodes. Let’s examine a scenario where a PU consists of four compute levels, each with a single relay node. We observe that cluster \(S_1\) (depicted in Figure 9) cannot be mapped to the PU due to an insufficient number of relay nodes. This problem arises because the second level of \(S_1\) requires two relay nodes, which exceeds the available quantity. To tackle this issue, the compiler divides the cluster further, ensuring that smaller clusters get a sufficient number of relay nodes. In the present example, \(S_1\) is split into two clusters: one encompassing nodes A, C, and D, and the other comprising nodes B and E. The division of the cluster involves removing nodes from \(S_1\) to create two new clusters, namely \(S^a_1\) (the reduced cluster with adequate relay connectivity) and \(S^b_1\) (containing the deleted nodes). This process is recursively applied to \(S^b_1\). The elimination of nodes occurs by selecting them in reverse topological order (in our example, starting with node E). All dependent nodes of the deleted node are also removed. Furthermore, any node whose output is no longer utilized by any other node within the cluster is eliminated (such as node B).

6.3 Control Word Generation

In the final phase, the compiler produces 32-bit control words that correspond to each PU-to-node mapping. Each mapping is achieved by a series of control words, the number of which varies based on the type of node being mapped. The compiler pre-stores the sequence of control words for each mapping type in a lookup table as templates. These templates are then given specific values during the compilation process. For instance, if a PU is being mapped to an add operation, three control words are needed. The initial control word configures the scalar unit to perform the add operation, while the other two words establish the input ports of the PU with the corresponding input source indices. These sources could be either other PUs in the case of intra-cluster dependency or the PE input vector in the case of inter-cluster dependency.

When it comes to mapping stencil nodes, the compiler generates a single control word sequence to configure the reduction tree for processing all stencils together. Firstly, the line buffer array is configured with the stencil size, stride, and image dimensions. The lookup table contains the line buffer settings for a pre-defined set of stencil sizes, and processed stencils must have the same shape. Next, the mapper module is configured to create the data layout (mentioned in Section 5.1) based on the number and size of the stencils. The stencil coefficients are loaded into the registers of the PUs located at the base of the reduction tree. The PUs in the stencil stage are pre-set to perform the multiply and accumulate operation, and the shifter units between the PU arrays are configured with the appropriate shift amounts. The PUs can be also configured with a mix/max operation to process mix/max filters respectively. Finally, the output module of the stencil stage is configured to pass the correct output downwards to the pointwise stage.

7 FlowPix Scheduler

The FlowPix scheduler, which uses the driver API, is responsible for scheduling clusters created by the compiler on the overlay during runtime. A cluster is created assuming that the entire CU memory banks are available for its execution, but during runtime, the CU memory banks are shared across all cluster executions. The scheduler generates an optimal/acceptable execution schedule that minimizes the external memory traffic between the host and FPGA by streaming the input image only once and keeping partial data created by clusters inside the CU memory banks. However, if there are not enough memory banks to meet the requirements of a schedule, the scheduler raises an exception, and the user must re-synthesize the overlay with more memory banks. To generate an execution schedule, the scheduler uses the cluster graph and per cluster control word sequence as input. The cluster graph defines the interdependence among clusters, with each cluster having one or more parents. The cluster graph is a DAG that can also be viewed as a tree, with the cluster that generates the final output serving as the root. Every cluster serves as a root of a subtree, and to make a cluster eligible for execution, all the clusters within its corresponding subtree must be processed. The scheduler tracks which compute nodes within a cluster are mapped to which memory bank using a dynamic map. Before discussing the scheduling algorithm in more detail, let’s define a few terms that will be used throughout this discussion.

Definitions: The number of banks necessary to store the output of cluster M is denoted by \(B_o(M)\), while the number of banks from which cluster M reads its inputs is indicated by \(B_i(M)\). Considering all the clusters belonging to the subtree rooted at cluster M, the maximum value of \(B_i\) is referred to as \(B_{imax}(M)\). Essentially, \(B_{imax}\) identifies the highest number of input bank requirements needed to compute all the clusters in the subtree. The bank utilization factor \(U(M)\) for a cluster is given by \(B_o(M) - B_{imax}(M)\). This value is an approximation for the effective number of banks used by a cluster. Clusters with a negative value of \(U(\cdot)\) indicate a situation where the number of banks consumed for computation exceeds the number of banks needed to store the output. In such cases, the number of CU banks currently in use decreases.

Theorem 7.1.

Given a set of clusters eligible for execution, the overall bank utilization is minimized by processing the clusters in the increasing order of their bank utilization factor.

Proof.

Assume a cluster C with two parent clusters, M and N. We can make this assumption without losing the generalization of the problem. It is necessary to compute M and N before computing C. We prove that if \(U(M) \gt U(N)\), then the bank utilization is lesser if we compute N first followed by M.

We calculate the maximum number of banks utilized when M is computed first followed by N. \(B_{imax}(M)\) is the maximum number of input banks required to compute M. The output of M is stored in \(B_o(M)\) banks. Next, to compute N, \(B_{imax}(N)\) banks are utilized. Therefore the maximum number of banks utilized in this compute sequence is \(\max {\lbrace B_{imax}(M), B_o(M) + B_{imax}(N)\rbrace }\) which is equal to \(B_o(M) + B_{imax}(N)\). This can be proved by simple substitution using our initial assumption that \(U(M) \gt U(N)\).

\begin{align*} B_o(M) - B_{imax}(M)&\gt B_o(N) - B_{imax}(N) && \text{Initial assumption}\\ \Rightarrow B_o(M) + B_{imax}(N)&\gt B_o(N) + B_{imax}(M)\\ \Rightarrow B_o(M) + B_{imax}(N)&\gt B_{imax}(M) && \text{since }B_o(N) \gt 0 \end{align*}

Now, when N is computed first followed by M, the maximum number of banks in use will be \(\max {\lbrace B_{imax}(N), B_o(N) + B_{imax}(M)\rbrace }\). There are two cases to consider here.

(1)

\(\mathbf {B_{imax}(N) \gt B_o(N) + B_{imax}(M)}\): Therefore, maximum banks utilized is \(B_{imax}(N)\). If executing M first leads to a lower bank utilization, then \(B_o(M) + B_{imax}(N)\) must be less than \(B_{imax}(N)\). This implies that \(B_o(M) \lt 0\) which is not possible.

(2)

\(\mathbf {B_{imax}(N) \lt B_o(N) + B_{imax}(M)}\): Therefore, maximum banks utilized is \(B_o(N) + B_{imax}(M)\). If executing M first leads to a lower bank utilization, then \(B_o(M) + B_{imax}(N)\) must be less than \(B_o(N) + B_{imax}(M)\). By simple rearrangement, \(B_o(M) - B_{imax}(M) \lt B_o(N) - B_{imax}(N)\), implying that \(U(M) \lt U(N)\). This contradicts our initial assumption.

Therefore, we conclude that if \(U(M) \gt U(N)\), then N must be computed first to achieve a lower overall bank utilization. □

Scheduling Algorithm: The FlowPix scheduler generates an acceptable schedule (if one exists) by consistently selecting the cluster with the lowest bank utilization factor. The algorithm starts at the root of the cluster DAG. It recursively visits the subtrees whose root nodes have the lowest utilization factors. The generated cluster execution sequence ensures that the CU memory banks are used judiciously (see Theorem 1).

7.1 Example

We use a dummy benchmark, see Figure 10, to demonstrate the clustering and execution in FlowPix. We configure the overlay with a single CU having one PE. The PE is configured with \(P_x = 6\) and \(P_y = 4\). The CU can process a maximum of two \(3 \times 3\) stencils nodes in parallel. The clustering step results in the creation of 6 clusters labelled \(S_1\) through \(S_6\). DAG nodes 1, 2 and 7 occupy the same cluster as 1 and 2 are parallel stencil nodes and can fit inside the same cluster. The pointwise node 7 depends on the output from 1 and 2. Clusters \(S_2\) and \(S_3\) are formed similarly. The stencil nodes 10 and 11 do not fit in clusters \(S_1\) or \(S_2\) as they receive input from a pointwise node, and there is no data path inside the PE to transfer data from a pointwise to a stencil node. The upsample nodes 14 and 15 belong to separate clusters because of the limitation of the PE to process a single upsample operation. The scheduler-generated cluster execution sequence minimizes the overall bank utilization. The scheduler first considers \(S_6\) for processing as it is the root of the cluster DAG. \(S_6\) is dependent on \(S_4\) and \(S_5\). Between \(S_4\) and \(S_5\), the scheduler picks up \(S_4\) as it has a lower bank utilization factor (-1 < 0) than \(S_5\). \(S_4\) depends on \(S_1\) and \(S_2\), and both clusters have the same utilization factor. The source cluster \(S_1\) followed by \(S_2\) is executed first. Following this, \(S_4\) is executed. A similar execution pattern is followed inside the subtree rooted at \(S_5\) that results in the execution of the clusters \(S_3\) and \(S_5\). Finally, \(S_6\) is processed, and the output image is stored in a single memory bank.

Fig. 10.

8 Experimental Results

On a Virtex-7-690t FPGA, we executed 15 image-processing benchmarks using FlowPix. The benchmarks encompass a range of structures and complexities, from simple pipelines such as Unsharp mask to relatively complicated pipelines such as Optical Flow with multiple input images. We also implemented iterative pyramid systems like Gaussian Pyramids and Up-Down sampling. We compared FlowPix with several existing FPGA-based image processing frameworks, including Polymage [25], DarkRoom [17], HeteroHalide [23], and Vitis-Vision library [11]. These frameworks generate fixed-function hardware from a DSL specification. Our benchmark suite is described in Table 3. Furthermore, we compared FlowPix with IPpro [38], which employs an overlay-based approach.

Table 3.

SL No	Benchmark	Acronym	Description
1	Harris Corner	HCD	Corner Detection using 3 × 3 Stencil and Point-wise Operations.
2	Canny Edge	CED	Edge Detection using 5 × 5 Stencil and Point-wise Operations.
3	Gaussian Filter	GAF	Application of a 3 × 3 Gaussian Filter over an Image.
4	Blur	BLU	Average operation over a 3 × 3 window of pixels.
5	Linear Blur	LB	Blur operation followed by 2 linear transformation operation.
6	Pyramid	PYR	Pyramid of Up-sampling or Down-sampling Gaussian Filters
7	Down-Up	DUS	Single Down-sampling followed by Up-sampling operation.
8	Erosion	ERS	Minimum operation over a 3 × 3 window of pixels.
9	Median	MED	Median operation over a 3 × 3 window of pixels.
10	Dilation	DIL	Maximum operation over a 3 × 3 window of pixels.
11	Convolution	CONV	A 8 × 8 convolution operation over an Image.
12	Lucas Kanade	LK	Single iteration of Lucas-Kanade Optical flow Algorithm.
13	Stencil Chain	SC	3 × 3 Stencil Operation repeated 3 times in a pipeline.
14	Gaussian Diff	GAD	Difference of a single and double Gaussian Blur.
15	Pyramid Blend	PYRB	Blending of two images using 2 level Image Pyramids.

Table 3. A Short Description of the Benchmarks Implemented using FlowPix on a Virtex-7 FPGA

Experimental set-up: Our overlay is synthesized on a Virtex-7 690t FPGA with 3600 DSP blocks and 6.4 MB of on-chip BRAM (2940 18Kb blocks). This FPGA card is connected to an Intel Core-i5 processor via a PCIe-8x link. The host driver, which streams the image frames in row-major order, is written in C++ and runs on the CPU. We use the Xillybus PCIe core (http://xillybus.com/doc/revision-b-xl) to connect the host input with our overlay, see Figure 11. The Xillybus core can achieve an ideal data bandwidth of 6.4 GB/s each direction (read from and write to the host) when operating on the Virtex-7 device with 8x Gen3 lanes, utilizing the Gen3 Integrated Block for PCI Express v3.0. For details, please refer to http://xillybus.com/doc/xillybus-bandwidth. Our hardware is designed using Bluespec System Verilog (BSV) [27]. With BSV, we can precisely specify the desired hardware using a set of transactions or rules in an atomic manner, expressing system behavior in all-or-none transactional semantics. All the reported hardware characteristics of the design are obtained using the Vivado Design Suite version 2022.1 Post Place and Route. To match the datatype and filter sizes used in various benchmarks across different frameworks, we created four variants of the PE by altering both the datatype and the PE size. The characteristics of these PE variants are listed in Table 5. Although other design variants were possible, we chose these four specific ones. The number of CUs and PEs within a CU were determined based on the benchmark and framework being used. In order to minimize the comparative latency and LUT consumption, we synthesized the overlay using one of the PE variants and increased the number of CUs to match the pixel throughput. For example, when comparing FlowPix with a framework that generates 32 pixels per cycle for a given benchmark, we synthesized the overlay with a CU count of 8, considering a single PE within the CU generates 4 pixels per cycle. After matching the pixel throughput, we increased the PE count within a CU until the relative LUT increment was within a 50% (1.5× increase) threshold. This increase in PEs was done to determine if pipelined parallelism could reduce the achieved latency while slightly increasing LUT resource consumption.

Table 4.

Framework	HCD	CED	GAF	USM	BLU	LB	SC	DIL	ERS	MED	CONV	PYR	LK	DUS
FlowPix	15	8	13	6	1	11	3	2	2	2	1	19	23	10
Halide	15	-	15	7	2	9	3	-	-	-	-	16	-	-
Polymage	43	-	-	16	-	-	-	-	-	-	-	71	75	50
Hetero Halide	26	-	8	13	2	11	15	2	2	2	-	-	-	-

Table 4. Lines of Code Comparison of the Different Frameworks for the Benchmarks Implemented

Table 5.

Arch	Datatype	Px	LUT	FF	DSP	BRAM (18Kb)	Frequency MHz
A1	16 bit	8	9970	16984	176	144	200
A2	8 bit	8	5062	8673	88	80	250
A3	16 bit	16	13681	22111	352	272	200
A4	8 bit	16	5721	9971	176	144	250

Table 5. FPGA Resources Consumed by the Different Architecture Variants of our Overlay

\(P_y\) is set to 8 in all the PE designs.

Fig. 11.

In order to measure the processing latency of a benchmark accurately, we subtract the loop-back time from the total execution time. The loop-back time refers to the time required for the hardware to stream in and stream out an image without undergoing any processing. It encompasses factors such as the PCIe transfer time and other overhead introduced by the Xillybus framework. To establish the hardware in the loop-back mode, we remove the FlowPix overlay and connect the write FIFO to the read FIFO, refer to Figure 11. It is important to note that the loop-back time remains constant at 20 milliseconds regardless of the benchmark. However, it’s worth mentioning that the time required for transferring the control words is not included in the loop-back time calculation, as this value varies depending on the benchmark being processed.

8.1 Performance Comparison

Within this section, we conduct a comparative analysis of FlowPix alongside other frameworks, with a focus on their respective throughput and FPGA resource consumption across various benchmarks. We provide the resource consumption details for Look-Up Tables (LUTs) and Flip-Flops (FFs), and the total consumption of BRAM and DSP can be determined based on the number of PEs employed. For instance, if we employ a 16 CUs overlay, each consisting of an A4 type PE, the total utilization of DSP blocks and BRAM would amount to \(16 \times 176 = 2816\) and \(1 \times 16 \times 144 = 2304,\) respectively (refer to Table 5). Table 4 compares the Lines of Code of FlowPix with the other frameworks. To ensure a fair and objective comparison, we process each benchmark under the same settings as those reported by the framework in question. The configuration of each benchmark includes specifications such as image width (W), height (H), data type, and the number of pixels produced per cycle (P/C).

Table 6 presents a comparison between FlowPix and the hetero-halide framework across several reported benchmarks, all of which were processed at a pixel throughput greater than 1. The aim of this experiment is to assess FlowPix’s ability to scale towards higher throughputs. For 8-bit cases, FlowPix’s latency is comparable to that of hetero-halide, achieved with an average 1.7× increase in FPGA LUT consumption. In other benchmarks, the latency is increased by 25%, along with an average 1.6× increase in FPGA LUT consumption. Notably, the BLU benchmark shows the highest increase in LUT consumption as it is computed with 16-bit precision. Conversely, the GAF benchmark achieves better LUT utilization than hetero-halide, given that it is computed using the 8-bit PE version. In essence, achieving the desired latency and throughput entails scaling the PEs to address pipelined parallelism and scaling the CUs to accommodate data parallelism.

Table 6.

In Table 7, FlowPix is compared with the industry standard Vitis Vision library (https://xilinx.github.io/Vitis_Libraries/vision/2022.1/index.html) implemented using Vitis-HLS, which provides a software API interface for computer vision functions accelerated on an FPGA device. Like their OpenCV equivalent, the library function (kernel) is compiled into a bitstream and runs on the FPGA. The library is optimized for area and performance. The benchmarks show an approximately 20% degradation in latency for FlowPix compared to Vitis Vision. This difference in latency can be attributed to the fact that almost all the Vitis Vision kernels are synthesized at a frequency greater than 300 MHz compared to FlowPix which is synthesized at a maximum frequency of 250 MHz. For the pyramid benchmarks, the relative LUTs consumption of FlowPix is over 4× that of Vitis for a single-level pyramid utilizing \(5 \times 5\) filter sizes. However, since the benchmark involves just a single convolution followed by an upsample or downsample operation, the Vitis library does better at creating an area-efficient design. In the case of LK optical flow, pipelined parallelism is employed by setting the PE count to 2 since the LUTs increase is within the threshold. This parallelism benefits the processing of LK over two clusters. For CED edge detection, FlowPix is 4× faster than Vitis when processed using two \(3 \times 3\) filters. A single PE can support four instances of the CED benchmark since the A4 type PE is deployed. To achieve a \(P/C=1\), two parallel CU instances are utilized.

Table 7.

Table 8 compares FlowPix with the Darkroom and Rigel frameworks. Darkroom creates line-buffered pipelines with the stencil and pointwise computations. Rigel is an extension of Darkroom that can generate multi-rate pipelines containing upsample and downsample operations. Unlike Darkroom, Rigel pipelines can produce more than one pixel per cycle. FlowPix has a 1.5× to 2.7× higher latency for the HCD and CED benchmarks than Darkroom for single-pixel throughput. The latency increase is because these benchmarks are processed using a single PE. The downside of having a single PE is that more number of clusters are generated and are time multiplexed over the same PE. The HCD and CED benchmarks are processed using two clusters, executed one after another over the same PE. Increasing PEs was not an option here since it crosses the area threshold regarding LUT usage. In fact, for CED, the LUT consumption of FlowPix is 23% better than Darkroom.

Table 8.

FlowPix outperforms Rigel in both latency and LUT consumption. Specifically, when processing the LK benchmark with 2 PEs, FlowPix achieves 1.25 times better latency than Rigel because the relative LUT increment is within the set threshold of 50%. For the CONV and PYR-D benchmarks, which use \(8 \times 8\) filters and produce 4 pixels per cycle, FlowPix processes them using 4 CUs containing the A4 PE type, resulting in a latency improvement of 1.75× and 2.8×, respectively. When it comes to LUT usage, FlowPix fares better than Rigel in the PYR-D benchmark, utilizing 50% fewer LUTs. However, in the case of CONV, Rigel used 10% less LUT than FlowPix. Despite this, FlowPix still outperforms Rigel in terms of latency for both benchmarks.

Table 9 provides a comparison between FlowPix and the PolyMage framework. All three benchmarks are implemented on a single CU with the A1 PE type since PolyMage computes these benchmarks on 24-bit fixed-point precision. For the HCD benchmark, the latency with FlowPix is 11 times higher. This is because HCD is processed in 2 clusters over the single PE. For both USM and DUS, channel parallelism is exploited by using a single \(3 \times 3\) filter in parallel, allowing all 3 channels to be processed on a single PE. While DUS is processed over 2 clusters due to the presence of a downsample followed by an upsample operation, the upsample is processed in the second cluster. As a result, the latency increase with FlowPix compared to PolyMage is considerably high for DUS but not as significant for USM.

Table 9.

After transitioning from frameworks that generate fixed-function hardware, we proceed to compare FlowPix with the IPPro instruction set-based processor. In this experiment, we selected three functions that produces a single output on a set of inputs and executed them on one CU with the A1 PE type. Our goal was to highlight the control overhead involved in processing a single operation and compare it against the processor’s latency. As shown in Table 10, we found that the control cycles were 7x-10x higher than the compute cycles. However, this is a one-time overhead, and after the PE configuration, there will be no control cycles, and only compute cycles will be used until all the input is processed. This differs from a processor, where the pipeline stalls and control cycles are spent re-calibrating the pipeline for each instruction execution with one cycle latency. With FlowPix, we were able to process the POLY function and FIR filter at a considerably lower latency than IPPro.

Table 10.

8.2 Framework Analysis

In this section, we will begin by examining the capability of the FlowPix compiler to handle larger designs consisting of hundred of nodes. Subsequently, we will focus on the overlay and explore the implications of processing the benchmarks using a single design. This contrasts with the previous section, where we employed different PE designs for various benchmarks.

In the first experiment, we generated DAGs with a varying number of compute nodes, ranging from 32 to 1024. To create these DAGs, we employed a random graph generator that labelled the nodes as stencil, pointwise, upsample or downsample nodes. The labeling is done using a set of constraints. For instance, only nodes with a single input are allowed to be labeled as stencil, downsample, or upsample, while nodes with two input edges could be labeled as pointwise. These constraints ensured that the generated DAGs were capable of processing an input image to produce an output image. Figure 12 illustrates that the compilation time increases proportionally with the size of the input DAG. In other words, the complexity of the compiler is \(O(n)\), where n is the number of DAG nodes. The clustering phase accounts for approximately 90% of the compilation time, while the remaining 10% is spent on generating the control world and determining the scheduling order of the clusters. The clustering phase is particularly significant because as the size of the DAG increases and hence, the number of clusters that needs to be evaluated also increases.

Fig. 12.

In the second experiment, as shown in Table 11, we utilized a single 16-bit overlay design to process all the benchmarks. This overlay comprises a single CU equipped with one PE, where \(P_x\) = 16 and \(P_y\) = 8. The chosen input image size for a benchmark is determined as the maximum dimension among all dimensions specified for that benchmark in Section 8.1. The compilation time varied from 1 to 300 milliseconds. When we consider the compilation time and the maximum control word transfer time of around 10 ms, it is significantly shorter than the time required for writing an FPGA bitstream, which typically takes seconds. An overlay design is very advantageous when the user needs to switch between processing pipelines without incurring the overhead of writing a new bitstream every time. Also, observe that similar benchmarks have comparable compilation times. For example, the BLU and the MED both have 1 millisecond compilation time as they involve a single stencil operation. The increase in run time is directly proportional to the number of PEs and CUs. To illustrate this, let’s take the example of the HCD. Its latency is 16.45 milliseconds, which is a 16-fold increase compared to the data reported in Table 6. Similarly, the PYR-U benchmark has a latency of 39.77 milliseconds, which is nearly identical to the latency reported in Table 7, considering the PE and CU count used in both cases. However, a slight increment is observed due to the difference in bit width (16-bit vs. 8-bit), which directly affects the operating frequency of the design.

Table 11.

9 Conclusion

This paper proposed a DSL-based overlay accelerator called FlowPix for image processing applications, which achieves comparable performance to existing fixed function DSL-to-FPGA frameworks while maintaining flexibility in mapping different algorithms. Our architecture simplifies the process of scheduling image processing (ImP) workloads on the cloud, which in turn enables ImP-as-a-Service. We have designed our architecture to be amenable to running various algorithms concurrently, and we leave this as future work.

References

[1]

Puya Amiri, Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, Roland Leißa, and Sebastian Hack. 2021. FLOWER: A comprehensive dataflow compiler for high-level synthesis. In 2021 International Conference on Field-Programmable Technology (ICFPT). IEEE, 1–9.

Abstract

1 Introduction

2 Related Work

3 FlowPix Overview

4 FlowPix Language

4.1 Image Processing Pipelines

4.2 Front-end DSL

5 FlowPix Overlay Architecture

5.1 Processing Engine

5.2 Control Memory

5.3 Data Memory

6 FlowPix Compiler

6.1 DAG Clustering

6.2 Latency Matching

6.3 Control Word Generation

7 FlowPix Scheduler

7.1 Example

8 Experimental Results

8.1 Performance Comparison

8.2 Framework Analysis

9 Conclusion

References

Cited By

Index Terms

Recommendations

A DSL Compiler for Accelerating Image Processing Pipelines on FPGAs

Programming Heterogeneous Systems from an Image Processing DSL

Reconfiguration of FPGA for Domain Specific Applications Using Embedded System Approach

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations