1 Introduction

In recent years, real-time vision systems on embedded hardware have become ubiquitous due to the increased need in different applications such as autonomous driving, edge computing, remote monitoring etc. Field-Programmable Gate Arrays (FPGA) offer the speed and flexibility to architect tight-knit designs that are power and resource-efficient. It has resulted in FPGAs becoming integrated into many applications [1]. Often these designs consist of many low to high-level image processing algorithms that form a pipeline [2]. Increasingly the race for faster processing encourages hardware application developers to optimise the algorithms.

Traditionally optimisations are domain agnostic and developed for general-purpose computing. The majority of these optimisations aim to improve throughput and resource usage by increasing the number of parallel operations [3], memory bandwidth [4] or operations per clock cycle [5]. On the contrary, domain-specific optimisations are more specialised in a particular domain and can potentially achieve larger gain both in terms of faster processing and reducing power consumption. This paper proposes domain-specific optimisation techniques on FPGAs that exploit the inherent knowledge of the image processing pipeline.

Optimisations can be divided into two categories: general-purpose and domain-specific. In image processing, domain-specific optimisations enable a significant reduction of computational load while maintaining sufficient accuracy. Example, optimisations are down-sampling [6], approximation[7], data-type conversion [8], kernel size [9], bit-width [10] and removing operations entirely. Although optimisations of algorithms on hardware accelerators, both in CPU, GPU and FPGAs have been extensively researched in [11,12,13], they are only aimed at the target algorithms. On contrary, there has been very little work on domain-specific optimisations of imaging algorithms on FPGAs. Qiao et al. [14] proposed a minimum cut technique to search fusible kernels recursively to improve data locality. Rawat et al. [15] proposed multiple tiling strategies that improved shared memory and register resources. However, such papers propose constrained domain-specific optimisation strategies that exclusively target CPU and GPU hardware only. Reiche et al. [16] proposed domain knowledge to optimise image processing accelerators using high-level abstraction tools such as domain-specific languages (DSL) and reusable IP-Cores. Other optimisations strategies such as loop unrolling, fission, fusion etc., do not translate well onto FPGA design. In demonstrating our proposition, we present a thorough analysis of well-known image processing algorithms, emerging CNN architectures (MobileNetV2[17] & ResNet50[18]) and Scale Invariant Feature Transform (SIFT) [19]. The decision to select Mobilenet is due to its popular use within embedded systems and ResNet, which consistently obtained higher accuracy rates than other available architectures. In addition, SIFT being the most popular feature extraction algorithm due to its performance and accuracy. The algorithmic properties are exploited with proposed domain-specific optimisation strategies. The optimised design is evaluated and compared with other general optimised hardware designs regarding performance, energy consumption and accuracy. The main contributions of this paper are:

  • Proposition of three domain-specific optimisation strategies for image processing and analysing their impact on performance, power and accuracy; and

  • Validation of the proposed optimisations on a widely used representative image processing algorithms and CNN architectures (MobilenetV2 & ResNet50) through profiling various components in identifying the common features and properties that have the potential for optimisations.

2 Domain-Specific Optimisations

Image processing algorithms typically form a pipeline with a series of processing blocks. Each processing block consists of a combination of low, mid, intermediate and high-level imaging operations starting from colour conversion, filtering to histogram generation, features extraction, object detection or tracking. Any approximation and alteration to the individual processing block or the pipeline have an impact on the final outcome, such as overall accuracy or run-time. However, depending on the applications such alterations are expected to be acceptable as long as they are within a certain error range (e.g., \(\sim \pm 10\%\)).

Many image processing algorithms operations share common functional blocks and features. Such features are useful to form domain specific optimisations strategies. Within the scope of this work, we profile and analyse image processing algorithms to enable potential areas for optimisations. However, such optimisations impact the algorithmic accuracy and therefore it is important to identify the trade-off between performance, power, resource usage and accuracy.

We hypothesise that understanding of this domain knowledge, e.g., processing pipeline, individual processing blocks or algorithmic performance, can be used for optimisations to gain significant improvements in run-time and lower power consumption, especially in FPGA-based resource-limited environments. Based on the common patterns observed in a variety of image processing applications, this section proposes three domain-specific optimisation (DSO) approaches: 1) downsampling, 2) datatype and 3) convolution kernel size. However, on the flip side, often the optimisation can lead to lower accuracy in return for gains in speed and lower energy consumption. We compare the effectiveness of these optimisations against benchmark FPGA, GPU and CPU implementations and show the impact on accuracy. Within the scope of this paper, we have identified three optimisations strategies which are discussed below:

2.1 Optimisation I: Down Sampling

Down/subsampling optimisation reduces the data dimensionality while largely preserving image structure and hence accelerates run-time by lowering the number of computations across the pipeline. Sampling rate conversion operations such as downsampling/subsampling are widely used within many application pipelines (e.g., low bit rate video compression [6] or pooling layers in Convolutional Neural Network (CNN) [20]) to reduce computation, memory and transmission bandwidth. Image downsampling reduces the spatial resolution while retaining as much information as possible. Many image processing algorithms use this technique to decrease the number of operations by removing every other row/column of an image to speed up the execution time. However, the major drawback is the loss of image accuracy due to the removal of pixels. We apply down sampling optimisation using bilinear interpolation and measure both the run-time and accuracy.

2.2 Optimisation II: Datatype

Bit width reduction through datatype conversion (e.g., floating-point (FP) to integer) significantly reduces the number of arithmetic operations resulting in optimised run-time at lower algorithmic accuracy. Whilst quantising from FP to integer representations is a common in the software domain, one of the advantages of reconfigurable hardware is the capability to reduce dimensionality to arbitary sizes (e.g., 7, 6, 5, 4 bits) as a tradeoff between accuracy and power/performance[21,22,23,24].

In the field of Image processing, majority of the algorithms are inherently developed using FP calculations. Although, FP has a higher accuracy representation, it is more expensive to compute, i.e., large number of arithmetic computations resulting in increasing resource (higher bit-width) and energy usage. The substitute for floating-point is fixed-point arithmetic, in which there is a fixed location of the point separating integers from fractional numbers. However, using fixed-point representation, while gaining performance in speed, will result in loss of accuracy vs FP representation. A datatype conversion optimisation is proposed here where all operation stages are converted from FP to integer and note the impact on performance and accuracy.

2.3 Optimisation III: Convolution Kernel Size

Convolution kernel size optimisation reduces computational complexity, which is directly proportional to the squared size of the filter kernel size, i.e., \(\mathcal {O}(n^2)\) or quadratic time complexity. Convolution is a fundamental operation used in most image processing algorithms that modify the spatial frequency characteristics of an image. Given a kernel and image size \(n \times n\) and \(M \times N\), respectively, it would require \(n^2MN\) multiplications and additions to convolve the image. For a given image, the complexity relies on the kernel size leading to a complexity of \(\mathcal {O}(n^2)\). Reducing kernel size significantly lowers the number of computations, e.g., a \(3\times 3\) kernel replacing \(5 \times 5\) kernel would reduce the computation by a factor of \(\text {x}2.7\). Therefore, we propose this as an ideal target for optimisation i.e., to use a smaller kernel size which is however may come at the cost of accuracy.

3 Case Study Algorithms

In order to apply the optimisations proposed in Section 2, In this section, a brief description of the representative algorithms and architectures which the optimisations selected will be applied:

Figure 1
figure 1

SIFT Algorithmic Block Diagram.

3.1 SIFT

SIFT [19] is one of the widely used prototypical feature extraction algorithms. To demonstrate the proposed optimisations, we’ve implemented various versions of SIFT which consists of two main and several sub-components as shown in Fig. 1 and described below.

Figure 2
figure 2

a Scale-Space Hardware Block Diagram b Extrema Detection in Local Space/Scale Neighbourhood.

3.1.1 Scale-Space Construction

Gaussian Pyramid

The Gaussian pyramid \(L(x,y,\sigma )\) is constructed by taking in an input image I(xy) and convolving it at different scales with a Gaussian kernel \(G(x,y,\sigma )\):

$$\begin{aligned} G(x,y,\sigma ) = \frac{1}{2 \pi \sigma ^2} e^{- \frac{x^2 + y^2}{2 \sigma ^2}}, \end{aligned}$$
(1)
$$\begin{aligned} L(x,y,\sigma )=G(x,y,\sigma ) * I(x,y), \end{aligned}$$
(2)

where \(\sigma\) is the standard deviation of the Gaussian distribution. The input image is then halved into a new layer (octave), which is a new set of Gaussian blurred images. The number of octaves and scales can be changed depending on the requirements of the application.

The implemented block design reads pixel data of input images into a line buffer show in Fig. 2a. The operations in this stage are processed in parallel for maximum throughput. This is due to significant matrix multiplication operations which greatly impacts the run-time. This stage is the most computationally intensive, making it an ideal candidate for optimisation.

The Difference of Gaussian \({DOG}(x,y,\sigma )\), in Eq.3 is obtained by subtracting the blurred images between two adjacent scales, separated by the multiplicative factor k.

$$\begin{aligned} \textit{DOG}(x,y,\sigma )=L(x,y,k\sigma )-L(x,y,\sigma ). \end{aligned}$$
(3)

The minima and maxima of the DOG are detected by comparing the pixels between scales shown in Fig. 2b. This identifies points that are best representations of a region of the image. The local extrema are detected by comparing each pixel with its 26 neighbours in the scale space. (8 neighbour pixels within the same scale, 9 neighbours within the above/below scales). Simultaneously, the candidate keypoints with low contrast or located on an edge are removed.

3.1.2 Descriptor Generation

Magnitude & Orientation Assignment

Inside the SIFT descriptor process shown in Fig. 3, the keypoint’s magnitude and orientation are computed for every pixel within a window and then assigned to each feature based on local image gradient. Considering L is the scale of feature points, the gradient magnitude m(xy) and the orientation \(\theta (x,y)\) are calculated as:

$$\begin{aligned} m(x,y) =\sqrt{L_{x}(x,y)+L_{y}(x,y)}, \end{aligned}$$
(4)
$$\begin{aligned} \theta (x,y) =tan^{-1}\left( \frac{L(x,y+1)-L(x,y-1)}{L(x+1,y)-L(x-1,y)} \right) . \end{aligned}$$
(5)

Once the gradient direction is obtained from the result of pixels in the neighbourhood window, then a 36 bin histogram is generated. The magnitudes are Gaussian weighted and accumulated in each histogram bin. During the implementation, m(xy) and \(\theta (x,y)\) are computed based on the CORDIC algorithm [25] in vector mode to map efficiently on an FPGA.

Figure 3
figure 3

Magnitude & Orientation Assignment and Keypoint Descriptor Generation.

3.1.3 Keypoint Descriptor

After calculating the gradient direction around the selected keypoints, a feature descriptor is generated. First, a \(16\times 16\) neighbourhood window is constructed around a keypoint and then divided into sixteen \(4\times 4\) blocks. An 8-bin orientation histogram is computed in each block. The generated descriptor vector consists of all histogram values resulting in a vector of \(16 \times 8 = 128\) numbers. The 128-dimensional feature vector is normalised to make it robust from rotational and illumination changes.

3.2 Digital Filters

Digital filters are a tool in image processing to extract useful information from noisy signals. They are commonly used for tasks such as smoothing, edge detection, and feature extraction. Filters operate by applying a kernel, or a small matrix of values, to each pixel of an image. The kernel is convolved with the image, and the resulting output value is placed in the corresponding pixel location of the output image shown in the Eq. 6. Where I(xy) is the input image and \(K(k_x,k_y)\) is the kernel. The convolution result O(xy) is calculated by:

$$\begin{aligned} \begin{aligned} O(x, y) = \sum _{k_x} \sum _{k_y} I(x - k_x, y - k_y) \cdot K(k_x, k_y) \end{aligned} \end{aligned}$$
(6)

The indices \(k_x\) and \(k_y\) correspond to the coordinates of the kernel K, x and y correspond to the coordinates of the output image O.

Fig. 4
figure 4

Common image filter kernels

3.2.1 Box

The box filter is a simple spatial smoothing technique that convolves the image with the kernel shown in Fig. 4a, replacing each pixel value with the average of its neighboring pixels. This process has the effect of reducing high frequency noise while preserving the edges and important details of the image. The box filter is also computationally efficient and easy to implement, making it a popular choice for many image processing applications. However, it can cause blurring and loss of sharpness in the image if the kernel size is too large.

3.2.2 Gaussian

The Gaussian filter is a widely used linear filter in image processing and computer vision. It is a type of low-pass filter that removes high-frequency noise while preserving the edges in an image. The filter works by convolving the image with a Gaussian kernel in Fig. 4b, which is a normalised two-dimensional Gaussian distribution. The Gaussian kernel has a circularly symmetric shape and can be expressed mathematically as:

$$\begin{aligned} G(x,y) = \frac{1}{2\pi \sigma ^2} e^{-\frac{x^2+y^2}{2\sigma ^2}} \end{aligned}$$
(7)

where \(\sigma\) is the standard deviation of the Gaussian distribution, and x and y are the distances from the centre of the kernel. The size of the kernel and the value of \(\sigma\) determine the amount of smoothing applied to the image.

3.2.3 Sobel

The Sobel filter is a type of edge-detection filter that uses two kernels shown in Fig. 4c, one for horizontal changes ( x kernel) and one for vertical changes ( y kernel) in an image. The Sobel filter works by convolving each of these kernels with the image and then computing the gradient magnitude at each pixel using the formula:

$$\begin{aligned} \sqrt{(G_x^2 + G_y^2)} \end{aligned}$$
(8)

where \(G_x\) and \(G_y\) are the convolved images using the x and y kernels, respectively. The resulting gradient image highlights edges in the original image and the direction of the edge can be determined by calculating the angle of the gradient using:

$$\begin{aligned} \theta = \tan ^{-1}(G_y / G_x) \end{aligned}$$
(9)

3.3 Convolutional Neural Network

Convolutional Neural Network’s are a class of deep neural networks typically applied to images to recognise and classify particular features. A CNN architecture typically consists of a combination of convolution, pooling, and fully connected layers shown in Fig. 5.

Figure 5
figure 5

Typical layers implemented within CNN Architectures.

The convolution layers extract features by applying a convolution operation to the input image using a set of learnable filters (also called kernels or weights) designed to detect specific features. The output of the convolution operation is a feature map, which is then passed through a non-linear activation function, such as ReLU, to introduce non-linearity into the network. The convolutional layers can be stacked to form a deeper architecture, where each layer is designed to detect more complex features than the previous one. In addition, it is the most computationally intensive layer because each output element in the feature map is computed by repeatedly taking a dot product between the filter and a local patch of the input, which results in a large number of multiply-add operations.

The pooling layers are responsible for reducing the spatial size of the feature maps while retaining important information. The most common types of pooling are max pooling and average pooling. These layers typically use a small window that moves across the feature map and selects the maximum or average value within the window. This operation effectively reduces the number of parameters in the network and helps to reduce overfitting.

The fully connected layers make predictions based on the extracted features. These layers take the output from the convolutional and pooling layers and apply a linear transformation to the input, followed by a non-linear activation function. The fully connected layer usually has the same number of neurons as the number of classes in the dataset, and the output of this layer is passed through a softmax activation function to produce probability scores for each class. A CNN architecture also includes normalisation layers such as batch normalisation, dropout layers that are used to regularise the network and reduce overfitting, and an output layer that produces the final predictions.

4 Experimental Results and Discussion

We verify the proposed optimisations on ’SIFT’, ’Box’, ’Gaussian’ and ’Sobel’(in Fig. 6) algorithms, as well as MobileNetV2 and Resnet50 CNN architectures. This is achieved by creating baseline benchmarks on three target hardware CPU, GPU and FPGA, followed by the realisations of the optimisations individually and combined. The CPU and GPU versions for Filter and SIFT algorithms are implemented using OpenCV [26]. Pytorch library is used to implement CNN architectures and optimisations. Additionally, both architectures are pre-trained on the image-net classification dataset. The FPGA implementation for all algorithms is developed using Verilog (SIFT/Filter) and HLS (CNN). All baseline algorithms and CNN model use floating point 32 (FP32), and an uncompressed grayscale 8-bit \(1920\times 1080\) input image is used for the SIFT algorithm, and each sub-operation is profiled. Details of the target hardware/software environments and power measurement tools are given in Table 1.

Table 1 Summary Table: Hardware/Software Environment & Measurement Tools.
Figure 6
figure 6

Filter Algorithms Applied onto Input Image.

Dataset

The input images used in the CNN and Filter experiments are from LIU4K-v2 dataset [31]. The dataset contains 2000 high resolution \(3840\times 2160\) images with various backgrounds and objects.

4.1 Performance Metrics

As part of the evaluation process, we measure using three different performance metrics, namely, 1) execution time, 2) energy consumption and 3) accuracy.

4.1.1 Execution Time

The execution time measured for the CPU and GPU platforms uses time function libraries to count the smallest tick period. Each algorithm/operation is run for 1000 iterations and averaged to minimise competing resources or other processes directly affecting the architecture, especially for the CPU architecture. The GPU has an initialisation time which is taken into account and removed from the results. The timing simulation integrated into Vivado design suite software is used to measure the time for the FPGA platform. The experiments exclude the time of both the image read and write from external memory. We compute the frame per second (FPS) as the inverse of the execution time:

$$\begin{aligned} \text {FPS}= 1/\text {Execution Time}. \end{aligned}$$
(10)

4.1.2 Power Consumption

Two common methods used for measuring power are software and hardware-based. Accurately estimating power consumption is a challenge using software-based methods, which have underlying assumptions in their models and may not measure other components within the platform. In addition, taking the instantaneous watt or theoretical TDP of a device is not accurate since power consumption varies on the specific workload. Therefore, we obtain the total energy consumed by measuring the power over the duration of the algorithm executed. A script is developed to automatically start and stop the measurements during the execution of the algorithm and extract the power values from the software.

With the use of a power analyser within the Vivado design suite and the MaxPower-tool, we measure the FPGA power consumption in two parts, (1) static power and (2) dynamic power. Static power relates to the consumption of power when there is no circuit activity and the system remains idle. Dynamic power is the power consumed when the design is actively performing tasks. The power consumption for the CPU and GPU is obtained using HWMonitor and Nvidia-smi software. To have a fair comparison across the target hardware for the SIFT algorithm, we normalise it as the energy per operation (EPO):

$$\begin{aligned} \text {Energy} = (\text {Power} * \text {Execution Time}). \end{aligned}$$
(11)

Additionally, We calculate the energy consumption for the Filter and CNN algorithms:

$$\begin{aligned} \text {EPO} = (\text {Power} * \text {Execution Time})/\text {Operations}. \end{aligned}$$
(12)

4.1.3 Accuracy

With an expectation that the optimisations impact overall algorithmic accuracy, we capture it by measuring the Euclidean distance between the descriptors generated from the CPU (our comparison benchmark) to the descriptor output produced by the FPGA. The Euclidean distance d(xy) is calculated in Eq. 13 where x and y are vectors, and K is the number of keypoints generated.

$$\begin{aligned} d(x,y)=\sqrt{\sum _{i=1}^{K} (x_{i}-y_{i})^{2}}. \end{aligned}$$
(13)

Subsequently, the accuracy for each Euclidean distance is calculated using Eq. 14:

$$\begin{aligned} \text {Accuracy} = 100 - \left( \left( \frac{\text {Euclidean Distance}}{\text {Max Distance}} \right) \times 100\right) \end{aligned}$$
(14)

The Euclidean Distance denotes the distance between the two descriptor vectors being compared, and Max Distance represents the maximum Euclidean distance found in the vector. The accuracy is transformed to have 100% indicate identical descriptors, while 0% indicates completely dissimilar descriptors.

We used root mean square error (RSME) to compare the input image to the output images produced by each hardware accelerator to determine the pixel accuracy. RMSE is defined as:

$$\begin{aligned} RMSE = \sqrt{(\frac{1}{n})\sum _{i=1}^{n}(y_{i} - x_{i})^{2}} \end{aligned}$$
(15)

Where the difference between the pixel intensity values of output and input (yi,xi) images. Divided by N, which is the total number of pixels in the image.

The accuracy of the CNN architecture is measured by taking the number of correct predictions divided by the total number of predictions:

$$\begin{aligned} \text {Accuracy} = \frac{\text {Number of Correct Predictions}}{\text {Total Number of Predictions}} \times 100 \end{aligned}$$
(16)

A high accuracy indicates that the model is making accurate predictions, while a low accuracy suggests room for improvement in the model’s performance.

4.2 Results and Discussions

The results and discussions section contains the evaluation of algorithms in three categories, feature extraction algorithms (SIFT), filter algorithms (Box, Gaussian, Sobel) and Convolution Neural Networks (MobilenetV2, Resnet50).

4.2.1 SIFT

We obtain results for FPGA implementations of the SIFT algorithm, considering various optimisations or combinations of them. Two sets of results are captured for octave, scale of (2,4) and (4,5) as they are regularly reported in the literature for SIFT implementation on FPGA. The results are primarily obtained at a target frequency of 300 MHz for various components of SIFT and execution time and accuracy are reported in Table 2 along with FPS numbers in Fig. 7. Finally, for the completeness we report the resource and power usage statistics for optimised configurations at 300 MHz in Table 3.

Table 2 SIFT: Resource Usage Summary of all Optimisations Downsampling, \(3 \times 3 Kernel\) & Integer Arithmetic Configuration.
Table 3 SIFT: Performance against state-of-the-art.
Figure 7
figure 7

SIFT: FPS and Accuracy for each optimisation on both configurations (octave, scale).

In terms of individual optimisations on the base FPGA implementation, down sampling and integer optimisations had the most reduction of accuracy but in trade for a greater reduction of run-time. On the other hand, \(3\times 3\) kernel size (down from default \(5\times 5\)) had better accuracy results but with a small improvement on the overall run-time. In the case of combined optimisations, both down sampling and integer combinations greatly reduced the execution times but at a cost of \(8\sim 10\%\) accuracy loss. In the most optimised case, (4,5) and (2,4) configurations achieved 17 and 50 fps, at an accuracy of \(90.18\%\) and \(89.45\%\), respectively. The \(10 \sim 11\%\) loss in accuracy in both configurations can be attributed to the loss of precision and pixel information resulting in imperfection in feature detection Fig. 7

The comparison with optimised CPU and GPU implementations are shown in Table 4 which includes total execution time as well as energy consumption per operation (nJ/Op). Results indicate the optimised FPGA implementation achieved comparable GPU run-time at 600 MHz but significantly outperformed them when energy consumption statistics are taken into account. The GPU results excluded the initialisation time, which would add greater latency to the overall run-time. In addition, the power consumption of the GPU is at 12.47nJ/Op, which would make it a difficult choice for real-time embedded systems. On the other hand, optimised FPGA implementations have better performance per watt than the GPU and CPU. The comparison with the state-of-the-art FPGA implementations are reported in Table 5 and results show major improvements in the run-time even with larger image size and more or similar feature points (\(\sim 10000\)).

Table 4 SIFT: Profiling Summary on each Hardware Platform. Baseline & Optimised (Octave, Scale).
Table 5 SIFT: Optimisation Result Summary, 300 Mhz Configuration (Octave, Scale).

4.2.2 Filter Implementations

Figures 8 and 9 plots the run-time and energy consumption of three image processing filter algorithms (Box, Gaussian, and Sobel) with various optimisations applied to the baseline algorithm. Comparing the baseline performance, the CPU architecture suffers the most in execution time and energy consumption which can be attributed to lack of many compute cores. In contrast, GPUs and FPGAs exploit data parallelism and stream processing to significantly reduce runtime.

Figure 8
figure 8

Filter: Runtime comparison for optimisations applied on each architecture.

Figure 9
figure 9

Filter: Energy consumption comparison for optimisations applied on each architecture.

The figures show that the performance of both GPU and FPGA are comparable in both metrics studied. The GPU demonstrated a marginally better computation speed compared to the FPGA, with a average improvement of \(12.59\%\) for Box and Gaussian algorithms. However, the GPU has been observed to consume \(\sim 1.20\times\) more Joules than the FPGA. The high energy cost can be derived from the support/unused logic components consuming static power. In the case for Sobel, the FPGA is \(1.11\sim 1.5\times\) faster over the GPU across all optimisation strategies. The smaller kernel size allows the FPGA use its DSP slices to efficiently compute the algorithm, whilst the GPU operations do not fully occupy the compute resources available which results in load imbalance and communication latency.

All optimisations, e.g Datatype, Kernel, and Downsampling optimisations had major improvements for each accelerator. Reducing the kernel size to \(3\times 3\) kernel size had the most impact due to lowering the number of operations computed during the convolution operation. The Downsampling and Datatype optimisations had around \(11.8\sim 24.5\%\) decrease in run-time for all algorithms. The optimisation runtime results and accuracy’s of each filter algorithm are reported in Tables 6 and 7 respectively.

Table 6 Image Processing Filters Runtime & Energy Result Summary.
Table 7 RMSE of Linear Filters (Compared to Original Input Image, Lower value indicating greater similarity).

4.2.3 CNN Architecture

Figure 10 displays the runtime performances and classification accuracy of the baseline and optimised CNN algorithms on each hardware architecture. The results show that the CPU, GPU, and FPGA exhibit similar levels of performance, with the GPU having an average improvement of \(5.41\sim 12\%\) over the FPGA for the Downsampling optimisation in MobileNetV2 and the baseline for Resnet50, respectively. The FPGA leads in the Datatype optimisation over the GPU with a \(6.25-11.1\%\) reduction in time for both CNNs. The Datatype optimisation involves quantisation of the model’s weights from FP32 to 8-bit to reduce complexity. The FPGA computes the quantised operations faster on both architectures due to exploiting the DSP blocks and requiring no additional hardware logic for floating-point arithmetic. However, the quantised model weights are unable to represent the full range of values present in the input image, resulting in a \(\sim 10\%\) accuracy loss for all platforms. The Downsampling strategy has a slight improvement in run-time with minimal impact on the accuracy, with a loss around \(\sim 5\%\).

Figure 10
figure 10

CNN: Architecture Execution Time and Classification Accuracy comparison of Model Datatype & Input Image Downsampling Optimisations on Resnet50 and MobilenetV2.

Figure 11
figure 11

CNN: Architecture Energy comparison of Model Datatype & Input Image Downsampling Optimisations on Resnet50 and MobilenetV2.

In Figure 11, the energy consumption graph shows that the CPU consumes on average \(3.14\times\) more energy than the other accelerators for both CNNs. In addition, the Resnet50 architecture has more layers than MobileNetV2, therefore contains more operations, resulting in higher energy usage. In all cases, the FPGA consumes the least amount of energy, \(1.11\sim 3.55\times\) less than the CPU and GPU, to compute the image classification. The results show the potential of reducing the computation time of CNN’s by further applying particular optimisations in each layer but at the cost of slight accuracy loss. The optimisation results of each CNN architectures and accuracy’s are reported in Table 8.

Table 8 CNN Optimisation Result Summary: Runtimes and Corresponding Image Classification Accuracy for Baseline and Optimisations Applied on each Hardware.

Consequently, larger images or complex networks with many layers and larger filter sizes require more memory to store the weights and activation’s. This leads to higher memory requirements, especially within real-time embedded systems where space is limited. However, applying optimisations can alleviate the computational load but careful consideration must be taken to understand the trade-offs between runtime and accuracy depending on the application.

5 Conclusion and Future Direction

This paper proposes new optimisation techniques called domain specific optimisation for real-time image processing on FPGAs. Common image processing algorithms and their pipelines are considered in proposing such optimisations, which include down/subsampling, datatype conversation and convolution kernel size reduction. These were validated on the popular image processing algorithms and convolution neural network architectures. The optimisation results for CNN and Filter algorithms vastly improved the computation time for all processing architectures. The SIFT algorithm implementation results significantly outperformed state-of-the-art SIFT implementations on FPGA and achieved run-time at par with GPU performances but with lower power usage. However, the optimisations on all algorithms come at the cost of \(\sim 5-20\%\) accuracy loss.

The results demonstrate that applying domain-specific optimisations to increase computational performance while minimising accuracy loss demands in-depth and thoughtful consideration. One proposal for algorithms comprising multiple operation stages is to use adaptive techniques instead of fixed integer downsampling factors, bit-widths, and kernel sizes, is to employ adaptive techniques. These adaptive methods analyse the data and dynamically adjust the level of optimisation based on input characteristics. For instance, adjusting the bit-width and downsampling factor according to the specific input data within each stage can yield better results and strike a more suitable trade-off between performance and accuracy. Several strategies can be employed in the CNN domain to address the challenges. Quantisation-Aware Training (QAT) and mixed-precision training enable the model to adapt to lower precision representations during training, reducing accuracy loss during inference with quantised weights and activations. Additionally, selective downsampling and kernel size reduction of CNN architectures help retain relevant information and preserve accuracy. Channel pruning can further offset accuracy loss by removing redundant or less critical channels. As a result, employing these strategies and considering hardware constraints makes it possible to strike an optimal balance between accuracy and performance, unlocking the full potential of efficient applications.

On the other hand, the drawback of traditional libraries and compilers is that they often struggle to keep pace with the rapid development of deep learning (DL) models, leading to sub-optimal utilisation of specialised accelerators. To address the limitation, adopting optimisation-aware domain-specific languages, frameworks, and compilers is a potential solution to cater to the unique characteristics of domain algorithms (e.g., machine learning or image processing). These tool-chains would enable algorithms to be automatically fine-tuned, alleviating the burden of manual domain-specific optimisation.