5.1 Processing Engine
The overlay’s primary execution unit is the Processing Engine (PE), see Figure
6, which comprises processing units (PUs) that carry out computations. Each PU accepts three inputs and has two scalar units that can perform simple operations such as addition, subtraction, multiplication, and comparison. Both scalar units can be used in a cascaded manner when utilizing the three input values. The PU also contains registers that store control words and constants. Apart from the computing PUs, there are additional relaying PUs positioned between the computing PUs. These relaying PUs are responsible for forwarding data without performing any processing. We will delve into the specific functions of these relay PUs at a later point. The PE has a parametric design and its shape and size is set during synthesis by two parameters:
\(P_x\) and
\(P_y\). The PE architecture is divided into four pipelined stages, one input buffering stage, and three compute stages. The compute stages consist of pipelined sequences of PU arrays. The number of PU arrays and their interconnections varies across the stages.
Input to the PE is in the form of a vector. The output from the PE is streamed to the next PE inside the CU. The output from the final PE in the PE sequence is stored in the CU memory banks. The first compute stage performs stencil computations and is designed as a multiply-and-accumulator unit. The PUs in this stage are arranged in a tree structure with the base of the tree having
\(4 \times P_x\) PUs. The next stage computes pointwise operations and consists of
\(P_y\) arrays with
\(P_x\) PUs in each array. The final stage has a single array with
\(P_x\) PUs and processes upsample or downsample operations. This stage can perform a maximum of
\(P_x\) downsample operations or a single upsample operation. The input to the PE can be new data from the host or partial data stored in the CU memory banks from previous executions. Input buffering is done using an array of line buffers, which are storage structures built using the FPGA BRAM that buffer a minimum number of rows from the input image to produce square windows. The line buffer design used in FlowPix is from [
8] and can be dynamically programmed to generate any window size moving with a stride (the default stride is set at 1) over the input. The ordering of the compute stages inside a PE aligns with the compute pattern of the benchmarks analyzed. To illustrate, the majority of these benchmarks apply stencils to an input image, then perform pointwise operations to combine the stencil outputs. This is followed by an up-sampling or down-sampling operation on the intermediate output image. When the stage ordering differs, it results in under-utilization of resources, as certain stages must be skipped during the benchmark mapping process. The three compute stages of the PE are discussed in more detail below.
Stencil Stage:- The stencil operation generally involves a computation that multiplies and accumulates values. The processing units (PUs) within the stencil stage are arranged in a reduction tree-like pattern. This stage comprises \(m+1\) PU arrays, where m is equal to \(\log {(4 \times P_x)}\). Each PU within array i reads two input values from the adjacent PUs in the array \(i-1\). The first PU array performs the multiplication, while the subsequent arrays handle the accumulation step. A binary reduction tree is efficient for computing stencil windows whose size is a power of 2 but this window size is uncommon in image processing algorithms. Typically, window sizes are \(3 \times 3\), \(5 \times 5\), and so on. To address the issue of odd sizes, the window can be zero-padded to make its size a power of 2, but this approach under utilizes the PUs. Optimal throughput is achieved when this stage processes the maximum possible number of stencils in parallel. Therefore, we use a unique data layout that rearranges the line buffer windows across the first PU array. This layout is as follows.
The proposed approach is to break each stencil window into smaller partitions whose size is a power of 2. For instance, a \(3 \times 3\) window is partitioned into two smaller partitions, with sizes 8 and 1, respectively. In general, a window of size \(k^2\) is broken down into at most w partitions, denoted as \(p_0\) through \(p_{w}\), where \(w=\lfloor {\log _2(k^2)}\rfloor-1\). In the new data layout, the same-sized partitions from all windows are positioned next to each other. These partition groups are ordered from left to right, based on decreasing size. The mapper module located within the stencil stage has access to all windows created by the line buffers. It is a multiplexer array that maps a value from a line buffer window to an input port of a PU. The multiplexers are configured to implement this data layout.
In order to illustrate the data layout and operation of the stencil stage, we provide an example in Figure
7. In part (a), the stencil stage processes two
\(1 \times 3\) stencil windows. The window is first partitioned into two smaller partitions of sizes 2 and 1, respectively. The partitions are then rearranged using the data layout as [2,2,1,1] over the first 6 processing units (PUs), with the filter coefficients stored in the PU registers. Similarly, in part (b), the stencil stage processes a single
\(1 \times 7\) window which is partitioned into three smaller partitions of sizes 4, 2, and 1, respectively, and arranged over the first 7 PUs. In part (a) of Figure
7, the 6 partial products produced by the first PU array are accumulated by the rest of the PU arrays. This is achieved by first reducing the partitions corresponding to a window to single values inside the tree. The reduced value
\(r^{m}_{i}\) corresponds to a partition
\(p^{m}_{i}\) of size
m, belongs to window
i, and is generated at the
\(\log {m}^{th}\) array. For example, in part (a) of Figure
7, the 2 sized partitions
\(p_{1}^2\),
\(p_{2}^2\) are reduced to the single values
\(r_{1}^2\),
\(r_{2}^2\) by the second (
\(\log {2}=1\)) PU array. All the reduced values corresponding to a window must be combined into a single value, which is not possible using the level-to-level PU interconnections since the reduced values lie at different PU arrays. Therefore, the reduced values are aligned and forwarded using the shifter and the vector register between the arrays. The vector register at level
i stores the output of the
\({i-1}^{th}\) PU array. The reduced values
\(r_{1}^1\) and
\(r_{2}^1\) of size 1 are stored at positions 5 and 6 inside the first vector register, respectively. These values needs to be added with
\(r_{1}^2\),
\(r_{2}^2\), produced by the first and second PUs of the second PU array. Since
\(r_{1}^1\) and
\(r_{2}^1\) are produced early, they are moved to positions 1 and 2 by shifting the first vector register by 4 units. Subsequently, these two values are propagated. The first and second PUs adds the forwarded value with the inputs received from its predecessor PUs in the tree structure to produce the final output. The stencil stage also supports other reduction operators, such as max or min over a window. When using max or min operators over a stencil window, the stencil coefficients are set to 1 and the PUs are configured to perform the reduction operation using the maximum or minimum function instead of multiplication and accumulation. This is because the max and min operators do not require multiplication with filter coefficients, but rather involve comparing values to find the maximum or minimum. Therefore, the PUs are set up to accumulate the partial results using the max or min function, depending on the desired operation.
Pointwise Stage:- The pointwise stage can obtain input from either the stencil stage or directly from the input vector, bypassing the stencil stage. The PU arrays within this stage are all of equal length, and data exchange occurs through an all-to-all routing network between the arrays. This network consists of multiplexers that connect the output of a PU at level i to the inputs of the PUs at level \(i+1\). The inter-level multiplexers are configured by control words generated by the host, which also determine the operation of the scalar units within each PU. In cascaded mode, the PU utilizes both the scalar units and three input values. The first scalar unit processes the first two inputs, while the second scalar unit operates on the output of the first unit and the third input value. A ternary operator of the form \(E_1 \odot E_2? E_3: E_4\) is treated as a pointwise operation. Here \(E_1\) through \(E_4\) are pointwise expressions, and \(\odot\) is a relational operator. This operation is also processed by the pointwise stage as follows. Assume the expressions \(E_1\) though \(E_4\) have already been processed. Two PU units, execute the operation \(R_1 = E_1 \odot E_2\) and \(R_2 = \lnot (E_1 \odot E_2)\). The value of \(R_1\) and \(R_2\) is either a 1 or a 0. Following this, two more PUs from the successive level computes \(S_1 = R_1 \times E_3\) and \(S_2 = R_2 \times E_4\). The value of \(S_1\) is either \(E_3\) or 0, and the value of \(S_2\) is either \(E_4\) or 0. Finally, \(S_1\) is added to \(S_2\) to get the final value. In summary, after computing the four pointwise expressions inside the ternary operator, the pointwise stage utilizes 5 PUs spread across 3 arrays to produce the final output.
Upsample-Downsample Stage:- In this stage multiple images can be simultaneously downsampled or a single image can be upsampled by a factor of two. In the case of downsampling, a single Processing Unit (PU) is responsible for this operation. To perform the downsampling, a scalar unit inside the PU is configured. A PU register is set up with the row length of the input image that needs to be downsampled. As the input data is received, the scalar unit increments a counter to keep track of the row and column index of the image. The scalar unit then outputs data from every other row and column, effectively reducing the image size by half.
The upsample operation is processed by multiple PUs along with the upsampler module. If the image to be upsampled is produced earlier, it is forwarded to this stage through intermediate PUs in the pointwise stage. Image is upsampled by generating a \(2\times 2\) matrix for every single input received. More precisely, for every pixel \(w_{r,c}\) belonging to the \(r^{th}\) row and \(c^{th}\) column of the input image, a matrix \(W =[0,0,0,w_{r,c}]\) is generated. A total of four PUs generates W. The first 3 PUs are configured to produce a 0 in the output, and the fourth PU forwards the received input \(w_{r,c}\). The matrix W is read by the upsampler module that stores it across four internal memory banks \(U_{0}\) through \(U_3\) in a striped fashion. Note that these banks are separate from the CU memory banks and are exclusive to the upsampler module. The data in the four banks is collated and stored sequentially as a single upscaled image inside one of the CU memory banks by the upsampler module. This is done by emptying the banks \(U_0\) through \(U_3\) in an interleaved fashion. \(U_0\) contains data from rows \(0, 2, 4, 6...\) and column \(0, 2, 4, 6 ...\). \(U_1\) contains data from the same row indices but odd columns. \(U_2\) and \(U_3\) contain odd rows and even and odd columns, respectively. The drain sequence starts by reading a single value from \(U_0\) followed by the other value from \(U_1\), alternating between both until row 0 is filled in B; here, B is a CU memory bank. Then the same alternating sequence is repeated with \(U_2\) and \(U_3\) until row 1 is filled in B. At this point, the upsampler module again switches back to the first two banks. This interleaved sequence is repeated until all the rows of the upscaled image are buffered in B, marking the end of the upsample operation.