research-article

Open access

Energy-Efficient Approximate Edge Inference Systems

Authors:

Soumendu Kumar Ghosh,

Arnab Raha,

Vijay RaghunathanAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 4

Article No.: 77, Pages 1 - 50

https://doi.org/10.1145/3589766

Published: 24 July 2023 Publication History

PDF eReader

Abstract

The rapid proliferation of the Internet of Things and the dramatic resurgence of artificial intelligence based application workloads have led to immense interest in performing inference on energy-constrained edge devices. Approximate computing (a design paradigm that trades off a small degradation in application quality for disproportionate energy savings) is a promising technique to enable energy-efficient inference at the edge. This article introduces the concept of an approximate edge inference system (AxIS) and proposes a systematic methodology to perform joint approximations between different subsystems in a deep neural network (DNN)-based edge inference system, leading to significant energy benefits compared to approximating individual subsystems in isolation. We use a smart camera system that executes various DNN-based image classification and object detection applications to illustrate how the sensor, memory, compute, and communication subsystems can all be approximated synergistically. We demonstrate our proposed methodology using two variants of a smart camera system: (a) Cam_Edge, where the DNN is executed locally on the edge device, and (b) Cam_Cloud, where the edge device sends the captured image to a remote cloud server that executes the DNN. We have prototyped such an approximate inference system using an Intel Stratix IV GX-based Terasic TR4-230 FPGA development board. Experimental results obtained using six large DNNs and four compact DNNs running image classification applications demonstrate significant energy savings (≈ 1.6× -4.7× for large DNNs and ≈ 1.5× -3.6× for small DNNs), for minimal (<1%) loss in application-level quality. Furthermore, results using four object detection DNNs exhibit energy savings of ≈ 1.5× -5.2× for similar quality loss. Compared to approximating a single subsystem in isolation, AxIS achieves 1.05× -3.25× gains in energy savings for image classification and 1.35× -4.2× gains for object detection on average, for minimal (<1%) application-level quality loss.

1 Introduction

Deep Neural Networks (DNNs) have become the algorithm of choice for performing various machine learning and computer vision applications, such as object detection, recognition, and context-aware inference. Although DNNs frequently match (and sometimes even exceed) the accuracy levels reached by humans in several applications, this inevitably comes at the cost of high computational complexity and, consequently, high energy consumption. Extending the benefits of such DNN-based applications to energy-constrained edge devices (self-flying drones, wearable devices, surveillance cameras, etc., as shown in Figure 1) requires the execution of these workloads in an extremely energy-efficient manner.

Fig. 1.

An attractive property of DNN algorithms is that they are highly resistant to approximations in their underlying computations [22, 77, 94]. Many previous research efforts have shown how to use approximations to optimize the performance and energy consumption of computing systems that run DNN-based applications [99, 101]. However, these approximation techniques have focused mainly on reducing the energy consumed in the computation subsystem of the overall computing platform. In most edge devices, the computation subsystem only contributes a modest amount to the total energy consumption of the system (as shown in Section 3.2), which fundamentally limits the benefits of these approximation techniques.

Raha and Raghunathan [72] demonstrated the energy benefits of jointly approximating all subsystems of a system instead of approximating only the computation subsystem. This article explores this concept of end-to-end approximations in the domain of DNN-based smart systems to propose the first approximate edge inference system (AxIS) for energy-efficient inference at the edge. In particular, we focus on smart camera systems that use DNN-based inference as the underlying algorithm for various image recognition/detection applications. Figure 1 shows examples of Commercial-Off-the-Shelf (COTS) smart camera systems that use DNN-based machine learning, such as Google Clips, AWS DeepLens, Ooma Butterfleye, Nest Cam IQ, FLIR Firefly, and Lighthouse AI Cam. Many of these systems are designed to be battery operated or even run on harvested energy and hence require extreme levels of energy efficiency.

We consider two common variants of a smart camera system that we call Cam_Edge and Cam_Cloud. In Cam_Edge, image acquisition and inference occur locally on the edge device to produce an inference result. However, in Cam_Cloud, the locally captured image is transmitted to the cloud where the DNN inference algorithm is executed. For both of these variants, we show how the performance of synergistic end-to-end approximations leads to a superior tradeoff between the Quality vs. Energy(Q-E) of the system compared to the approximation of a single subsystem in isolation. Specifically, this article makes the following key contributions:

•

We propose AxIS, the first approximate edge inference system, and present two variants, Cam_Edge and Cam_Cloud, which employ full-system approximations to obtain maximum energy savings while satisfying a user-defined application quality bound. AxIS performs DNN inference using three computer vision applications, namely image classification, object detection, and image segmentation.

•

We exploit the error resilience in State-of-the-Art (SOTA) DNNs to introduce (i) compute approximation using a quality-driven model compression framework comprising a novel Quality-Driven Structured Pruning (QUSP) methodology; (ii) memory approximation using DRAM refresh rate reduction and significance-driven allocation of DNN weights and feature maps in different DRAM quality bins; (iii) sensor approximation using image subsampling in the camera; and (iv) communication approximation using lossy network data compression.

•

We provide a detailed discussion of the design decisions behind each selected subsystem-level approximation technique. We offer one of the first investigations into inter-subsystem interactions and analyze the impact of concurrent approximations in multiple subsystems on application quality and system energy consumption. Based on the insights derived, we propose an efficient Design Space Exploration (DSE) framework to dynamically perform system-level Q-E tradeoffs.

•

We evaluate AxIS by implementing a fully functional prototype on an Intel Stratix IV GX FPGA-based development board. For minimal (<1%) application-level quality loss, our experimental results show that on average, AxIS enables system-level energy savings of (i) \({2.1}\times\) and \({1.6}\times\) (Cam_Edge) and \({3.5}\times\) and \({3.3}\times\) (Cam_Cloud) in six large (server class) and four compact (edge optimized) classification DNNs, respectively, and (ii) \({2.3}\times\) (Cam_Edge) and \({4.5}\times\) (Cam_Cloud) in four SOTA detection DNNs, compared to a baseline accurate system. For the same quality specification, AxIS provides (iii) \({1.05}\times -{1.8}\times\) (Cam_Edge) and \({1.4}\times -{3.3}\times\) (Cam_Cloud) additional energy benefits over individual subsystem approximation in image classification, and (iv) \({1.3}\times -{2.1}\times\) (Cam_Edge) and \({1.4}\times -{4.2}\times\) (Cam_Cloud) additional benefits in object detection.

This article is organized as follows. Section 2 presents a holistic view of existing DNN-based approximation techniques targeting the compute, sensor, memory, and communication subsystems. Section 3 covers the necessary background and motivation behind our adoption of full-system approximation strategies. This is followed by the key design decisions and methodologies for individual subsystems, the inter-subsystem interactions, the synergistic approximation methodology, and the corresponding DSE framework in Section 4. The experimental setup and the 14 different DNN baseline architectures used in this work are described in Section 5. Section 6 discusses the results of AxIS in the context of classification and detection. A case study on AxIS running segmentation is also included at the end of this section. Finally, the article concludes in Section 7 with a short summary and guidelines for future research on approximate systems.

2 Related Work

The inherent algorithmic error resiliency of DNNs arises due to their implicit information redundancy, self-healing nature, ability to work with noisy inputs, and resiliency against approximations in their intermediate computations [79, 80], which leads to application resilience. Recent years have witnessed a surge in approximate computing (AxC) [92] that takes advantage of this forgiving nature of DNNs and exploits the optimization potential to increase the overall energy efficiency of the DNN computing system [4, 10, 53, 70, 77, 92, 94, 95, 96, 99, 108]. This section presents an overview of several popular and widely used approximation and non-approximation-based optimization techniques that target DNN workloads and execution systems catering to computer vision applications. We divide approximations into multiple categories based on the system component they target for application performance/energy optimization and present a summary of representative approximation techniques for each component in Figure 2.

Fig. 2.

2.1 Compute Approximations

Most prior works on approximation techniques for DNNs target the compute subsystem. These can be broadly classified into four categories, namely algorithmic, software, hardware, and hardware-software co-approximations. Algorithmic approximation techniques transform the architecture of computationally intensive DNN models for energy and resource optimization by trading off accuracy. Popular network scaling strategies to optimize DNN models for the low-compute and high-compute regime [16] include depth scaling [26], width scaling [107], resolution scaling [32], and compound scaling [16, 87]. Branch-based networks with early exit [64, 90] scale the computational effort to different inputs and allow conditional exit of inference for real-time and energy-sensitive Deep Learning (DL) applications. Along the same lines, some works [20, 93] adapt the computational effort of DNNs to the complexity of the input. Neural Architecture Search (NAS) has also led to the discovery of efficient DNN models such as MNASNet [86] and few-shot NAS [110] under different resource and hardware constraints. However, computational convolution optimization techniques, such as pointwise group convolution, depthwise separable convolution, fast Fourier transform (FFT)-based convolution, winograd convolution, and parameter/tensor decomposition techniques, such as canonical polyadic decomposition (CPD) and batch normalization decomposition (BMD), have proved to be effective in compressing models and memory access while maintaining accuracy [22, 47, 96].

Software-based approximation strategies include structured and unstructured pruning [24, 27, 28, 51, 54, 83, 100, 111] that introduces sparsity in DNN weights and activations. Subsequently, these methods remove ineffectual computations to generate compressed models without significantly impacting DNN accuracy. Weight-sharing [71] and knowledge distillation [31] techniques have been used for model compression and reduction of computation and memory overhead during inference.

Popular hardware-based approximation techniques for DNNs involve a quality-configurable neuromorphic processing engine [94], an approximate Multiply-Accumulate (MAC) circuit [25], and strength reduction, where shift and add operators are used in place of MAC operators [5, 105]. Hardware-software co-approximation techniques are probably the most efficient in model optimization. Network quantization that scales the precision of weights and activations has evolved over the past decade from a simple clustering-based approach [24] to quantization-aware training to calibrated quantization based on Kullback-Leibler divergence [98]. Many recent works have reduced DNN energy consumption by adopting quantization from the 32-bit Floating Point (FP) format to 16-bit [49], 8-bit [34], and 4-bit [1], as well as mixed precision numeric formats [35, 102] without a significant impact on accuracy. Finally, few recent works have applied a combination of multiple compute approximation strategies to accelerate DNN inference and reduce energy, namely pruning and tensor factorization [46], parallel/one-shot pruning, and quantization [33, 91].

Apart from these four categories of approximation-based techniques, previous works have also proposed several strategies for power optimization of the processor/CPU. Commercial processors that support scalable architectures, such as Texas Instruments Sitara [68] often achieve active and static power reduction by varying the processing speed/frequency. Energy-efficient CPU scheduling algorithms such as GRACE-OS and EScheduler [106] take advantage of reduced clock frequency or integrate dynamic voltage scaling. Many processors offer various sleep modes that can reduce power consumption. For example, a popular microcontroller for Internet of Things (IoT) applications, ESP32 [18] offers four configurable sleep modes such as modem sleep, light sleep, deep sleep, and hibernation. Depending on the application, a system designer can use them to selectively clock-gate or power off different peripherals of the system ranging from processor core, radio, or communication module, and so on. Prior works [81, 82] also proposed DNN hardware accelerators, multithreaded software workers, heterogeneous schedulers, genetic algorithms, and reinforcement learning based controllers for energy optimization of processors and systems. Our work is orthogonal to these techniques because we want to investigate the impact of combining approximations in different subsystems on DNN accuracy and system energy.

2.2 Sensor and Data Approximations

The limited energy budget of IoT/edge devices is negatively affected by the large energy consumption of the equipped sensors. In recent years, authors have shown that approximations in the data acquisition path of such smart sensors can be quite effective in reducing energy consumption. Reduction of sampling frequency such as sensor subsampling [23], modulation of spatial/temporal resolution [55], and quantization of sensor data are techniques widely used for sensor approximation. Pagliari and Poncino [63] used energy-efficient approximate bus encoding techniques such as approximate differential encoding (ADE) [61], Serial T0 [62], and Axserbus [39] to implement rounding/quantization and smoothing approximations and evaluated their energy-quality tradeoff in the context of image classification and activity recognition. Warp [85] uses voltage over scaling to modulate the sensing precision and enables accuracy vs. energy efficiency tradeoffs in the multi-modal sensing platform. Another recent work, AdaScale [9] employs adaptive image scaling for video object detection and reduces detector inference latency. Apart from these approximation-based approaches, there is a vast literature on non-approximation-based techniques for energy-efficient image sensor design. The proposed techniques use compressive sensing, predictive coding, and discrete cosine transform (DCT) [38, 44, 59], among others, to optimize the analog-to-digital converters (ADCs), which are usually power and performance bottlenecks in high-resolution image sensors. In addition, image sensors can benefit from power saving mechanisms such as aggressive standby power mode and optimal clock scaling [48]. Most of these approximation and non-approximation-based strategies require minimal to extensive hardware modifications to COTS image sensors and to system applications. In contrast, the impact of approximations on the DNN model input, such as blurring, Gaussian and impulse noise addition, and pixelation, on accuracy has also been studied [15, 29]. Similarly, extensive literature has explored the impact of adversarial modification on the robustness and accuracy of neural networks [3, 60]. However, both of these classes of data-alteration strategies are not representative of the typical approximations introduced in smart sensors.

2.3 Memory and Storage Approximations

The memory subsystem is one of the primary components of a DNN-based inference system. In recent years, various types of approximate memory and approximate storage methods have been developed to reduce memory energy consumption and DNN inference latency while trading off DNN accuracy. These techniques can be classified in terms of the target memory type. In approximate SRAM, heterogeneous cell structures and cell sizes [42] are used to allocate data bits according to significance. The voltage over scaling technique has been applied to both SRAM [52, 103] and DRAM [41] to reduce dynamic and static power. Reduction of DRAM refresh rate [56, 57, 73] and write recovery time [109] have also shown promise in improving power efficiency and throughput. Storage-based approximations have been proposed in nonvolatile memory (NVM). Reduction in guard band width [12] and selective application of error correction codes (ECCs) [69] have been effective in reducing access latency leading to higher throughput and low latency. Ranjan et al. [75] demonstrated significant memory energy benefits by implementing a generic approximate memory compression technique for various types of memory. In general, all of these approximation techniques lead to better energy efficiency at the cost of memory bit errors and a reduction in accuracy. However, DNNs in most of these works are subjected to iterative retraining to recover lost accuracy due to these memory errors, making an offline characterization phase of the target memory an essential part of the framework. However, non-approximation-based memory energy saving strategies found in the literature include 3D DRAM architecture, partial array self-refresh (PASR) [65], and an out-of-order DRAM access scheduler [43]. In addition, many proposed DRAM architectures selectively activate and precharge cells in a row to lower the energy cost. However, most of these approaches require custom memory architectures and design that limits their widespread applicability for consumer devices. We refer the reader to the work of Delaluz et al. [13] and Lee et al. [43] for other related literature on software and hardware methods for increasing the energy efficiency of DRAM.

2.4 Communication Approximations

Approximations have also been explored in the field of data communication using lossy and lossless data compression [8, 15]. For example, Dodge and Karam [15] investigated the performance of image classification under different image quality distortions, namely compression, noise, and blur, and demonstrated the resiliency of DNNs subjected to high degrees of image compression. Poyser et al. [67] also evaluated a variety of DNN models and reported their resilience to lossy compression. However, the authors did not evaluate the latency or energy benefits achieved as a result of the approximations in image quality. Gandor and Nalepa [21] investigated the impact of lossy compression on DNNs for object detection and showed the tradeoff between compression level and detection performance. An orthogonal research direction to these approaches is the offloading of DL computations from edge devices to the cloud. Several non-approximation-based communication optimization strategies [96] have been proposed to reduce transmission costs, inference latency, and satisfy the energy requirements of edge devices. A recent work in the domain of multi-view object detection proposed communication-efficient multi-resolution view pooling strategies [84] to achieve a substantial reduction in data communication. Similarly, Ren et al. [78] showed significant improvement in image transmission efficiency by trading off object detection accuracy under image compression. Other wireless radio optimization strategies [2, 65] targeting various radio components such as WiFi, cellular (4G/5G), Bluetooth, and GPS include location sensing frameworks (e.g., LearnLoc), adaptive interface selection strategies (e.g., Bluesaver), and wireless sampling rate modulation, among others.

2.5 Multi-Subsystem Approximations

As is evident from the preceding discussion, most of these techniques apply approximations to individual components of the DNN inference systems. Moreover, most research efforts only target the compute subsystem. Although the reduction in energy of the approximated subsystem is translated to system to a certain extent, these approaches are unable to exploit the full potential of system-level energy savings as all the other subsystems continue to operate without any approximations. Raha and Raghunathan [72] proposed approximations across multiple subsystems to demonstrate a complete system energy reduction subject to specific quality constraints. However, the benchmarks evaluated in this work comprised only traditional image processing workloads that did not include DNNs.

Compared to the aforementioned literature, our research deeply studies the inter-subsystem interactions in a DNN-based inference system. We extend the paradigm of approximate computing to approximate systems and lay the foundational concepts for full-system approximations for extreme energy efficiency in DNN systems.

3 Background and Motivation

In this section, we first provide a brief background on the paradigm of edge intelligence and its associated challenges in the context of DNN inference. Subsequently, we discuss the primary motivation behind adopting a full-system approach to approximate computing.

3.1 Challenges to Energy-Efficient Edge Inference

The rapid proliferation in ubiquitous sensors and intelligent devices together with the unprecedented rise in Artificial Intelligence (AI) drives the core of computation from server clusters in cloud data centers (the cloud) to smartphones, wearable devices, and other IoT devices (the edge). In recent years, academia and industry have made conscious efforts to advance the field of edge intelligence. Consequently, DNNs have become the algorithm of choice for performing edge visual analytics in different applications, such as smart camera surveillance, advanced driver assistance systems, and autonomous vehicles, due to their need for data-driven decision making. However, despite significant research, the limited computing resources, memory footprint, and energy budget of resource-constrained edge computing environments remain significant challenges for energy-efficient DNN inference on edge devices.

SOTA DNNs in computer vision, such as ResNet101, InceptionV3, EfficientNet, and Faster_ RCNN, are highly over-parameterized [17], as they contain hundreds of layers with a large number of trainable weights. Importantly, over-parameterization not only contributes to high DNN accuracy but also allows networks to generalize across diverse inputs and be error resistant to a certain extent [58]. However, these over-parameterized networks repeatedly exceed the compute and memory capacity of most commercial edge devices available today. The energy consumption of these devices is also aggressively affected by \(\mathcal {O}(10^{9})\) arithmetic operations and memory accesses invoked by a single inference operation in these over-parameterized networks [111]. As discussed in Section 2, many recent works have leveraged the inherent error resilience of DNNs and aimed to solve these challenges by introducing approximations in individual subsystems in isolation in exchange for disproportionate energy benefits.

3.2 From Approximate Computing to Approximate Systems

As seen in Section 2, most of the existing approximation techniques have been limited to the computing subsystem [40, 53, 108]. Although these approaches usually reduce DNN inference latency on edge devices, they only target energy reduction of the compute subsystem and leave much of the energy saving opportunities on the table. Consider the example of an intelligent camera system that has become quite popular among different edge platforms for edge AI applications. For such resource-constrained devices, energy efficiency is a fundamental necessity. When dissecting one of these devices, we can observe that it comprises the sensor, memory, and communication subsystems in addition to the computing system. Figure 3 shows the constituent subsystems of a representative smart camera system. The figure also shows the measured energy breakdown of a system that runs on-device inference (image classification) and cloud-based inference (Cam_Edge and Cam_Cloud, respectively, defined in Section 4.2) for an equal duration. Our studies have also revealed a similar energy profile for the camera system that runs object detection as the underlying DL application. As observed in the figure, the energy breakdown of the representative edge system is analogous to the energy profile of a SOTA COTS platform, such as the Raspberry Pi Zero W [76] system. The compute subsystem is responsible for only 31% of this system’s overall system-level energy budget, and the other subsystems contribute to the dominant share (69%). Approximating only the computing subsystem (or any other subsystem) in isolation will not produce tangible system-level energy benefits. Therefore, it is imperative to investigate the energy-quality tradeoffs in each subsystem and exploit them to perform synergistic approximations in all subsystems to maximize the energy efficiency of the whole system. This work advances the field of edge intelligence by shifting the focus from approximate computing to approximate systems and explores end-to-end approximations in such edge inference systems.

Fig. 3.

4 AxIS Design Methodology

To efficiently optimize the limited energy availability of edge computing systems for DNN inference applications, we propose AxIS. AxIS is the first and only work that performs highly energy-efficient inference tasks by employing synergistic approximations on multiple subsystems subject to a user-defined target application quality bound. AxIS is driven by three main insights. First, DNNs are inherently error resilient. Second, approximations in any subsystem leverage application error tolerance to improve energy efficiency without significant quality degradation. Third, the combination of various approximation strategies both within and across different subsystems tends to mask each other, as demonstrated in other works [72, 74], whereas system-level energy savings are mostly an addition of the individual subsystems.

We first provide a high-level overview of the architecture of AxIS in the context of edge intelligence and cloud intelligence in Section 4.1, and describe two variants of AxIS, namely Cam_Edge and Cam_Cloud, in Section 4.2. In Section 4.3, we present a brief overview of the AxIS prototype and DNN benchmarks used to obtain experimental data depicted in this work. Section 4.4 explains a generic technique for evaluating system-level energy consumption of AxIS. We then describe the approximation techniques for each individual subsystem and formulate the design decisions behind each selected approximation technique in Sections 4.5.1, 4.5.2, 4.5.3, and 4.5.4. In Section 4.6, we examine inter-subsystem interactions and analyze their impact on the individual and system Q-E tradeoff. Finally, Section 4.7 discusses the methodology for exploring the complex multi-subsystem design space and providing the most energy-efficient approximate system configuration while satisfying the target quality bound.

4.1 AxIS System Architecture

The unprecedented advances in DL continue to spark great interest in the deployment of SOTA DNN models for cognitive computing workloads in the domain of computer vision, natural language processing, recommendation systems, and so on. Since the introduction of AlexNet, the DL community has designed myriad models with increasing predictive power, leading to their ubiquitous deployment both in large-scale cloud data centers [11] and in resource-constrained edge devices [96]. Nevertheless, these accuracy gains driven by the ever-increasing model complexity and the number of Floating-Point Operations (FLOPs) impose heavy computational and memory burden on these edge devices, and consequently lead to high energy consumption during DNN inference. For example, EfficientNet [87] needs 36 billion operations to perform image classification, whereas YOLOv5 [36] uses 334 billion operations¹ to perform object detection, both on a single image. These overwhelming numbers of computations and associated energy costs arise due to the MAC operations² in convolution (CONV) layers that apply 2D convolution between 4D weight matrices (kernels) and input channels (or Input Feature Map (IFM)) to generate activations (or Output Feature Map (OFM)).

A substantial energy overhead is also due to the enormous data movement associated between the processor and the memory [101]. Consequently, many edge inference systems avoid the high-energy burden by offloading the computation requests along with the sensory data to the cloud servers for DNN inference, with the edge device being used only for data (image/video) acquisition. We refer to this DNN inference paradigm as cloud intelligence. However, the high transmission cost, strict application latency demands, data privacy and security constraints, and reliability of the wireless network have led to the emergence of the paradigm of edge intelligence where the end-to-end DNN inference application is performed locally on the mobile/IoT device. As discussed in previous sections, the constraints of memory, compute, and energy pose significant challenges to the execution of accurate SOTA DNN models in these systems. These limitations call for multiple optimizations throughout the system to reduce inference latency and improve system-level energy efficiency. As an example, for a miniature drone (having a small battery capacity) embedded with a DNN inference engine, the underlying DNN model must consume only a small amount of energy to ensure that the drone is operational for an extended period of time.

Due to the growing popularity of AI-enabled smart cameras, as shown in Figure 1, we assembled a smart camera based edge inference system that executes various energy-intensive computer vision applications, namely image classification, object detection, and instance segmentation, to demonstrate the concept of AxIS. Similarly to the constituent subsystems of a typical smart camera system, as stated in Section 3.2, the proposed AxIS comprises the following four main subsystems, which are also shown in Figure 4:

Fig. 4.

(1)

Sensor subsystem: This subsystem comprises the image sensor integrated into the camera module, which captures the image for subsequent processing. We consider the COTS CMOS image sensor, predominantly used for embedded sensing.

(2)

Memory subsystem: This subsystem stores the DNN-based application program and acquired image. Depending on AxIS variants (Section 4.2), it may also store pretrained DNN weights and feature maps (IFMs and OFMs). Without loss of generality, we consider a DRAM-based main memory for its widespread use in modern embedded systems running AI workloads.

(3)

Compute subsystem: This subsystem loads the acquired image from the memory and performs adequate preprocessing steps. Depending on the variant, it can either execute the entire DNN inference application to generate the final classification/detection output or only transmit acquired sensor data to the cloud.

(4)

Communication subsystem: This subsystem comprises a wireless radio frequency communication module that either transmits the application output or offloads the inference computation request along with the acquired image to the cloud. We consider WiFi as the underlying technology used by the communication module.

As is already evident from Section 2, the inherent error resiliency in DNNs [77, 99] presents several optimization opportunities in each of these subsystems. Using these potential avenues, approximations can be introduced into this system to reduce energy without having a substantial impact on the quality of the application. One way is to just approximate the sensor, for example, using image sensor subsampling to reduce the energy consumption of this subsystem. Alternatively, approximations can also be introduced into the computational fabric using parameter quantization or pruning, potentially leading to energy savings in the compute and memory subsystems. Thus, we expect that the concurrent application of these individual approximations may enable smart camera systems to perform highly energy-efficient DNN inference. Fortunately, these neural networks have an intrinsic ability to tolerate errors. Therefore, these systems should still be able to meet the application accuracy constraints, as errors that arise from each approximation mainly mask each other [72], providing the biggest “bang for the buck.” However, we used a smart camera system to demonstrate AxIS, and the foundational concepts could easily be extended to other intelligent edge systems by equivalent consideration of mutual interaction among the constituent subsystems.

4.2 Variants of AxIS

In this article, we present two distinct variants of the proposed approximate edge inference system to meet the demands of the AI inference paradigms discussed in Section 4.1, namely edge intelligence and cloud intelligence. Without loss of generality, we assume that AxIS only runs DNN inference. Training of the DNN is assumed to be done a priori in the cloud.

4.2.1 Cam_Edge.

This computation-intensive design caters to the domain of edge intelligence. Here, the local edge device runs the end-to-end DNN inference application, from data (image) acquisition to final result generation. Sensor, memory, and computation are active subsystems as indicated in Figure 4. The sensor subsystem is responsible for capturing the image. The compute subsystem (processor) applies the following preprocessing steps: (i) aspect-preserving resize and central-crop the image (to fit the DNN input dimension specification), (ii) convert the image to tensor (FP32 format) and scale the image’s pixel intensity values in the range \([0., 1.]\) , and (iii) normalize the resulting tensor. Note that DNN models used for classification (InceptionV3, ResNet101) and detection (YOLOv5), among others, could use test-time augmentation techniques, namely multi-resolution, multi-crop, flips, and rotation, among others, and merge the augmented predictions to obtain higher accuracy. However, we do not consider them in light of inference time and higher energy consumption. Following transformations, the on-device processor executes the DNN inference operation to generate the final object class, in case of image classification, and the bounding box coordinates with class labels for each detected object, in case of object detection. The memory subsystem is active throughout the application, as it stores the acquired image, the application program, and DNN weights and feature maps during inference. The communication module (if present) is mostly inactive and may be used to communicate the final results to the cloud or other devices, thereby consuming a negligible amount of energy. Clearly, the first three subsystems contribute to the overall energy consumption of the system. Therefore, synergistic approximations to the same are applied to achieve better energy efficiency.

4.2.2 Cam_Cloud.

This communication-intensive design caters to the domain of cloud intelligence. As shown in Figure 4, all four subsystems are active in this design. The edge device captures the image using the sensor subsystem and transmits it to a resource-rich cloud server through the communication subsystem. Here, the cloud server applies all necessary transformations to the input image, as mentioned earlier, and then runs DNN inference before sending the relevant application result to the edge device. Large-scale cloud servers in data centers typically employ clusters of machines with GPUs or Google Tensor Processing Units (TPUs) [37]. The memory subsystem is again active throughout as it participates in data acquisition and transmission. Consequently, synergistic approximations are made in the sensor, memory, and communication subsystems. In particular, although the DNN workload is not executed on the device processor, the compute module works simultaneously with the communication module to transmit the image and retrieve the result. Therefore, all four subsystems have significant contributions to the overall energy consumption of the system, with approximations employed in all except the compute subsystem.

4.3 Short Description of the AxIS Prototype and DNN Benchmarks

We designed a fully functional prototype of an approximate inference system to evaluate the impact of individual and synergistic approximation of different subsystems on the application quality and system energy consumption. We built this prototype using the Intel FPGA development board. The system was interfaced with a 5-MP CMOS image sensor as the sensor subsystem, 1-GB DRAM as the memory subsystem, and the ESP-WROOM-02 module as the communication subsystem. An Intel Nios II soft processor core was used as the compute subsystem. The graphs and quality/energy numbers presented in all of the following sections are obtained using this system. Note that there are few other peripheral components in this prototype apart from these primary components. We advise readers to check Section 5 and Figure 23 (presented later) for a detailed description of this prototype along with the energy measurement setup.

Now, we present a brief overview of the 14 DNN benchmarks used in this work. As mentioned in Section 1, we evaluated AxIS performance on three computer vision applications, namely image classification, object detection, and image segmentation. The suite of benchmarks for classification include four small DNNs, namely SqueezeNet1.1, MobileNetV2, MNASNet1.0, and EfficientNet_Lite, and six large DNNs, namely AlexNet, VGG19_BN, DenseNet121, InceptionV3, ResNet101, and EfficientNet. However, we used Faster_RCNN, Mask_RCNN, EfficientDet, and YOLOv5 as our detection benchmarks and Mask_RCNN as the segmentation benchmark. Please refer to Section 5 and Tables 1 and 2 for further details on DNN model specifications and the software frameworks used in this work. The rest of the article presents various experimental data based on DNN inference of these networks on the AxIS prototype.

Table 1.

Table 2.

4.4 System-Level Energy Consumption of AxIS

In recent years, reducing the energy consumption of DNNs has received a lot of attention due to the ubiquitous deployment of DNNs in edge systems with limited energy. As a result, estimating the DNN inference energy, which is much more complex than estimating the size of the DNN model and FLOPs, is of paramount importance. A limited number of previous works [101, 104] have proposed model-based energy estimation methodologies to estimate the energy consumption of a single DNN inference based on its architecture, model sparsity, and bit precision. These techniques model layerwise energy consisting of computation energy and data access energy and accumulate the energy of all the layers to derive the network energy. Using the insights derived from these works, we developed a comprehensive energy estimation methodology to calculate the energy consumption of the complete system running the DNN inference application. This simple energy model can be used for both Cam_Edge and Cam_Cloud by considering all four subsystems simultaneously as mentioned in Section 4.1. We first provide a brief overview of the four types of constituent subsystem energy (for a single inference operation) before delving into full system energy:

(1)

Sensor energy: The sensor consumes a relatively small amount of energy during image acquisition and is otherwise in low-power mode. This subsystem performs the same operation in both AxIS variants. We represent the total sensor energy consumption by \(E_{sens}\) . Note that \(E_{sens}\) depends on the sensor configuration, the sampling mode, and the resolution of the generated image (subsampling factor).

(2)

Communication energy: The communication module is one of the most energy intensive subsystems, as shown previously in Figure 3. We represent the total energy consumed by the communication subsystem by \(E_{comm}\) . In Cam_Cloud, \(E_{comm}\) contributes significantly to the overall energy of the system as part of the image transmission/offload phase. \(E_{comm}\) is calculated by taking the product of the average communication transmission power and the total data transmission time during which the module is active. Note that the transmission time is primarily driven by the data (image) size to be offloaded and the network (WiFi) bandwidth. For simplicity, we assume that this module contributes a very negligible amount ( \(\approx 0\) ) to the system energy due to its inactivity in Cam_Edge.

(3)

Compute energy: The energy consumption of the compute subsystem (processor), \(E_{comp}\) , differs between the two distinct variants. In Cam_Edge, both image preprocessing and DNN inference contribute to \(E_{comp}\) , which is calculated by taking the product of the average processor power and the total active time including both preprocessing time and inference latency. The preprocessing time is usually smaller compared to the inference latency. Hence, the energy is primarily dependent on the inference latency, which in turn depends on the underlying system configuration, the DNN model, and the input image size. A larger DNN model with a high number of compute operations (FLOPs) will lead to high inference latency and consequently high \(E_{comp}\) . In contrast, the compute module in Cam_Cloud consumes energy when it assists the sensor subsystem in image capture and the communication module in the conversion of the image to data packets and their subsequent transmission to the cloud. For simplicity, we calculate \(E_{comp}\) by taking the product of average processor power and the total time including data packetization and data transmission.

(4)

Memory energy: In both AxIS variants, the memory subsystem (DRAM) consumes a substantial amount of energy as it remains active throughout the lifetime of the DL application, as described in Section 4.2. During on-device DNN inference in Cam_Edge, the compute subsystem accesses the pretrained weights from the memory subsystem. In addition, IFMs are read from the memory and OFMs are written to the memory for every layer in the DNN model. These data transfers lead to the data access energy component of the total memory energy consumption \(E_{mem}\) . Due to the existence of multi-level memory hierarchy in today’s DNN hardware chips and embedded consumer devices [30], modeling the data access energy is definitely challenging. From existing literature [7], we know that DRAM access energy is one order of magnitude higher than access energy of other downstream memories. Therefore, we only consider DRAM access energy for simplicity. In addition to data access energy, refresh energy is an important contributor to the overall energy consumption of DRAM [56, 57, 73], as periodic refresh operations throughout the DRAM are necessary to counteract charge leakage over time and maintain data integrity. We elucidate the contribution of refresh energy by providing a breakdown of \(E_{mem}\) during DNN inference on the device (Cam_Edge) for a suite of classification and detection benchmarks in Figure 5.³ As observed in the figure, the refresh energy consumes \(68\%\) of \(E_{mem}\) on average across all the DNN benchmarks considered. On the contrary, the average access energy is \(6\%\) , whereas the idle energy, which accounts for the background energy arising from the DRAM controller logic and the peripheral circuitry of the DRAM, is \(26\%\) . Note that although the absolute memory consumption corresponding to each DNN benchmark is different, the figure shows the breakdown in percentages, each normalized to the total memory energy \(E_{mem}\) . The total refresh energy is calculated by taking the product of average DRAM refresh power that varies with the refresh interval, and the total memory active time. In Cam_Edge, this memory active time comprises both sensor and compute subsystem active times. Consequently, total refresh energy, total data access energy, and idle energy contribute to \(E_{mem}\) . In Cam_Cloud, the memory active time consists of both sensor and communication subsystem active times. Note that the compute active time overlaps with sensor time during image acquisition. Similarly, the compute and communication subsystems work coherently during offloading, hence compute time does not need to be added to memory active time separately. Consequently, total refresh energy and idle memory energy contribute to \(E_{mem}\) .

Fig. 5.

Total energy consumption of AxIS, represented by \(E_{sys}\) , is calculated by summing up the energy consumption of the individual subsystems for a single DNN inference, as shown in Equation (1). This equation can be used for energy calculation for both Cam_Edge and Cam_Cloud. As mentioned previously, \(E_{comm} \approx 0\) for Cam_Edge. We believe that the proposed energy estimation methodology will play a critical role in development of energy-efficient DNNs and provide useful insights to intelligent embedded system designers. All energy numbers and graphs in the rest of the article have been obtained using this energy model.

\[\begin{gather} E_{sys} = E_{sens} + E_{mem} + E_{comp} + E_{comm} \end{gather}\]

(1)

Figure 6 shows the normalized subsystem energy breakdown of the image classification benchmarks consisting of six large DNNs suited for cloud intelligence and four small DNNs suited for edge intelligence (details in Table 1). In both variants, the sensor consumes less than 1% of the total system energy, which is understandable. Note that in Cam_Edge, although the absolute energy values would vary greatly depending on the architecture of the DNN, the normalized breakdown allows us to compare DNNs of different complexity on a single scale. As observed, compute (in red) and memory (in blue) subsystems contribute \(48\%\) and \(51\%\) , respectively, to the system energy, on average (geomean). On the contrary, the Cam_Cloud energy is independent of the DNN model. As expected, the communication (in green) subsystem consumes the maximum energy ( \(65\%\) ), and compute and memory contribute to a lesser extent, namely \(13\%\) and \(15\%\) , respectively. Similarly, Figure 7 shows the normalized energy breakdown of four SOTA object detection benchmarks (details in Table 2). The energy contributions of the individual subsystem are similar to those of the classification DNNs. Since these detection DNNs are more computationally intensive in comparison, the sensor contribution is reduced by an order of magnitude in Cam_Edge. However, unlike classification DNNs, these networks can perform inference using multiple resolutions, and the baseline image resolutions vary from one DNN to another (details can be found in Table 2). Therefore, the absolute Cam_Cloud energy depends on the respective configuration of the model, but the geomean of the respective subsystem energy matches closely to that of the classification DNNs.

Fig. 6.

Fig. 7.

4.5 Subsystem-Level Approximations

In this section, we provide a comprehensive overview of our proposed approximation techniques for the four individual subsystems of AxIS and also represent the same in Figure 8. As illustrated in the figure, Cam_Edge consists of the approximations in sensor, memory, and compute subsystems, whereas Cam_Cloud consists of the approximations in sensor, memory, and communication subsystems. Furthermore, we also elucidated the design decisions that guide these strategies and their impact on the application quality and energy of the respective subsystem in the context of two DL applications: image classification and object detection.

Fig. 8.

4.5.1 Approximate Sensor.

Cameras or image sensors embedded in modern edge devices record very high resolution images to satisfy the growing consumer demand for high-quality photography. On the contrary, most computer vision applications work with images with a relatively lower resolution. For example, DNN architectures such as ResNet101, VGG19_BN, and MobileNetV2 can accept \(224\times 224\) resolution images, whereas some recent models (EfficientNet) can accept multiple image resolutions up to \(600\times 600\) . Although they have the ability to process images with arbitrary resolutions, detection networks, such as Faster_RCNN, EfficientDet, and YOLOv5, are generally trained on images with a resolution in the range of \([640, 1536]\) . Therefore, sensors with more than 10 million pixels are too expensive for these applications. Furthermore, the number of pixels sampled by the sensor is (approximately) proportional to the energy consumption of this subsystem. In fact, capturing a high-resolution image not only aggravates the sensor energy but also directly impacts the computation required by the DNN. Driven by these insights and the inherent error resilience of DNNs, we captured images with a resolution even lower than the specified input size for these networks, without changing the receptive field of the sensor. To do this, we investigated two distinct approximation strategies that are widely available in mobile image sensors today. For image classification, we applied pixel subsampling or nearest neighbor subsampling, which skips the readout of entire pixel rows/columns, thus reducing the resolution of the DNN input data. As an illustration, Figure 8.1 shows that subsampling skips multiple pixels to reduce the image resolution from 1080 p ( \(1920\times 1080\) ) to 720/480 p. However, we adopted binning or bilinear subsampling, which takes a weighted average of four neighborhood pixels to reduce image resolution and image noise for object detection. The design choice of the sampling strategy is based on our observation of empirical data. Pixel subsampling results in superior quality in the classification task, whereas bilinear sampling gives better results for the detection task. Therefore, we adopted subsampling factor ( \(s_r\) ) as the sensor approximation knob (f) for both of these strategies to reduce image size by \(s_r^2\) and obtain appreciable energy savings in the sensor subsystem with minimal effect on DNN accuracy. Additional benefits arising from this approximation include reduced sensor active time and data storage needs. Note that subsampling is performed in the sensor subsystem itself, as the camera directly tunes its own sampling resolution and does not involve the processor or compute subsystem. Internally, we might need to upsample the acquired image due to input size restrictions of some classification DNNs; however, this operation incurs minimal overhead to the edge processor. In contrast, the upsampling operation is unnecessary for DNNs that support model scaling and detection DNNs. In fact, we eliminated the necessary downsampling steps followed in traditional DNN inference pipelines, thus reducing the computation workload. Interestingly, the reduction in the XY resolution could also speed up subsequent DNN processing and therefore might lead to system-level energy savings. Using these insights, we modulate f to generate images of multiple resolutions and demonstrate the Q-E tradeoffs for multiple classification and detection benchmarks in Figure 9. The x-axis in both of these graphs represents the square of the subsampling factor (i.e., the image size reduction \(s_r^2\) ), the left y-axis represents the normalized quality, and the right y-axis represents the normalized sensor subsystem energy. Note that the x-axis range varies among these two graphs due to the difference in the image size reduction between classification and detection DNNs. This set of graphs uses top-1 accuracy as the quality metric for classification and Mean Average Precision (mAP) for detection. We derive two insights from these approximation studies: (i) heavy-weight (large) DNNs are comparatively more resilient to sensor approximations than light-weight (small) DNNs, and (ii) normalized quality degradation is much lower in detection DNNs compared to classification DNNs, whereas the amount of energy savings is quite similar.

Fig. 9.

4.5.2 Approximate Memory.

DRAM serves as the main memory unit ubiquitously in commercially available smart cameras and modern embedded systems. Following along the lines of the prior art [72], we use approximate DRAM due to its high density, large capacity, and better energy efficiency. Section 4.4 clearly suggests that refresh energy consumption should be considered as one of the most critical parameters in the design of the computing system. Thus, the approximations in DRAM are introduced by increasing the background refresh interval beyond the nominal 64 ms, and this interval acts as the memory approximation knob f. Lowering the refresh rate or increasing the interval drastically reduces the overall DRAM energy consumption but introduces retention bit errors in DRAM pages, resulting in accurate and erroneous pages. Since DRAM is responsible for storing the image and different DRAM parameters, including weights and feature maps, page errors ultimately lead to bit errors in these data. For example, Figure 8.2 illustrates how these bit errors result in a noisy image. Note that at higher refresh intervals, the number of erroneous DRAM pages increases, further degrading the image. Adopting the DRAM classification strategy of Raha et al. [73], we split the physical DRAM pages into different quality bins { \({qbinI~|~I~\in ~0~to~3+}\) } based on the error characteristics of each page and illustrate the same in Figure 10. It is important to specify that quality bins with a higher index (i.e., represented by a higher value of I) indicate a higher bit error rate, and as the refresh interval is increased, the distribution is skewed toward qbinI pages with higher values of I, as shown in Figure 10. In the context of DNN inference, we further categorize DNN data into critical and noncritical data for fine granular control over approximations [73], and assign them to accurate (qbin0) and erroneous pages (the remaining qbin0 and \(qbin1+\) ), respectively. The entire DNN-based application program and IFMs (OFMs) of the DNN are considered critical data, whereas the input image and pretrained DNN weights are considered noncritical data. IFMs are considered critical due to their lower bit error rate tolerance compared to DNN weights, as shown in previous works [41, 73]. To obtain the maximum energy savings, we select the maximum refresh interval that will allow us to allocate all critical data in qbin0, thus ensuring minimal impact on quality.

Fig. 10.

Figure 10 shows the amount of critical and noncritical data for a set of popular DNNs used for classification and detection. Note that the y-axis is on a logarithmic scale. As we can see, the portion of noncritical data (weights) is mostly larger than critical data (IFMs). The average (geomean) of the maximum IFMs across all individual layers in these detection networks is 154 MB compared to the average of total weights (370 MB). In classification DNNs, the ratio is more skewed in favor of weights (62 MB) compared to the maximum IFMs (5 MB). These data clearly support our allocation strategy, since critical data always fit in qbin0, with noncritical data mostly allocated to \(qbin0+\) , thus ensuring minimal quality loss and yielding substantial energy savings, at different approximation levels (refresh intervals). Note that approximations only affect DNN weights and feature maps in Cam_Edge when DNN inference is executed on the edge processor. In comparison, only the input image is affected by DRAM approximations in Cam_Cloud. Our design decisions and the approximation strategy are validated by the graphs in Figure 11. We clearly see that increasing the refresh interval from 64 ms to 1 second results in a nearly 0% drop in quality while reducing the refresh power by a significant amount ( \(68\%\) ). Looking at the four plots together also stimulates interesting observations, such as (i) Cam_Cloud is more resilient than Cam_Edge, which follows our expected behavior; (ii) at high intervals (30 seconds), large DNNs tolerate more errors than small DNNs in classification; (iii) in Cam_Cloud, detection networks suffer more quality loss compared to classification networks at high intervals; and (iv) high intervals give diminishing returns in terms of power reduction. These results demonstrate the DNN’s remarkable robustness to errors in inputs and weights.

Fig. 11.

4.5.3 Approximate Compute.

We developed a generic quality-driven model compression framework that takes advantage of modern DNN approximation techniques to approximate the compute subsystem of a typical edge inference system running the AI workload. By systematically controlling the quality of the application, namely top-1 accuracy in classification and mAP in detection/segmentation, this framework leverages the inherent error tolerance of DNNs to achieve energy-efficient inference—in other words, trades off quality for reduced energy consumption of DNNs. Essentially, the framework achieves this by tuning different compute approximation knobs. Three distinct functionalities are incorporated to approximate different types of DNN models. We discuss each of these functionalities in this section.

Quality-Driven Structured Pruning. Among the multitude of algorithmic/hardware/software approximate computing techniques mentioned in Section 2.1, we adopted a software approximation flavor for this work due to its applicability to all types of commercially available computing platforms. Specifically, we selected structured pruning and thinning to facilitate energy-efficient inference on resource-constrained edge devices due to their simplicity, as shown in Figure 8.3A. This generic approximation methodology allows the deployment of optimized DNNs on COTS devices without the need for specialized sparse convolution libraries/hardware as opposed to unstructured (elementwise) pruning, which cannot leverage network sparsity into real inference speedup unless implemented in specialized hardware accelerators. Reduction in the size of the DNN model, the number of compute operations (FLOPs/MACs), and the dynamic memory footprint and memory accesses are some of the significant benefits that stem from this technique. Ultimately, these benefits lead to faster inference and less energy consumption without sacrificing a significant amount of accuracy. These factors motivated us to introduce a novel and generic pruning framework, QUSP, to approximate the DNN inference operation. First, we give a high-level overview of QUSP, as illustrated in Figure 12. The system designer provides pretrained DNN model \(M_0\) , a set of end user defined target application quality loss bounds \(Q_b\) (e.g., \(\lbrace 0.5, 1.0, 2.5, \ldots \rbrace\) ), a DNN-specific application dataset, and a saliency metric for ranking structures (filters or channels) in a DNN. Using these inputs, QUSP generates a gradual/iterative quality-driven pruning schedule and subsequently runs the structured pruning and thinning algorithm for compression of the DNN model. The unique feature of QUSP that makes it different is that, given a set of quality metrics, this framework generates a family of compressed DNN models \(\mathcal {MF} = \lbrace M_i\rbrace\) , each meeting a distinct quality bound \(q_i \in Q_b\) . Essentially, pruning introduces sparsity in entire filters depending on their importance, and thinning permanently eliminates these sparse filters and dependent OFMs, thereby reducing computational complexity (FLOPs) and decreasing the energy consumption of the compute subsystem. As one can observe in the figure, for a higher-quality loss bound, QUSP prunes a higher number of filters and feature maps, thus showing a positive correlation between the approximation degree and the quality loss bounds. Put another way, as the user relaxes the quality demand even more, this allows the system designer to apply a higher degree of approximations that could result in better energy efficiency in the edge inference system. Since most pruning algorithms involve training and/or fine-tuning models, QUSP is run offline on a resourceful cloud server, as the computing capacity of edge systems usually exceeds pruning requirements. The key novelty lies in generating multiple models with an accuracy-FLOPs tradeoff that can be directly executed on COTS edge devices in an energy-efficient way, without the need for any hardware modifications. Note that structured pruning is one of the straightforward ways to approximate any DNN, and QUSP can be easily extended to any other compression algorithm and using any saliency metric to generate Q-E tradeoffs for the compute subsystem.

Fig. 12.

Let us dive into the inner workings of the QUSP framework to understand how we perform pruning in a systematic manner. The overall quality-driven pruning and thinning scheme is described in Algorithm 2. Among the different pruning granularities found in the literature, we adopted filter (channel) pruning of the CONV layers in a DNN following popular SOTA compression schemes [27, 47], where the objective is to remove entire filters in multiple layers of DNN based on their significance. As DNN models are over-parameterized as discussed in Section 3.1, it is highly likely that there exist some redundant filters in multiple layers. A saliency metric answers a fundamental question in this context. The metric decides which filters in a layer or across the network are redundant so that their removal will result in the least accuracy degradation. Among the multitude of such metrics proposed by researchers [66], QUSP has the ability to choose one or more metrics for filter classification, such as layerwise \(\ell _2\) norm of filters, \(\ell _1\) norm of activations, and average of gradient. Using the chosen metric, QUSP runs Sensitivity Analysis (SA) [45] on the pretrained model \(M_0\) to measure the sensitivity of all layers to pruning and generate a sparsity table \(\Gamma\) (line 3). Dense or fully connected layers are not evaluated in this analysis, as CONV layers contribute to the most FLOPs in a DNN. The pruning schedule is another critical component of any pruning algorithm. This determines the number of iterations to run (duration), the pruning criteria, the number of filters to prune per iteration, and the frequency of pruning. Now, existing pruning schemes [47] set sparsity levels based on SA results or continue pruning until a specified compute budget is achieved. However, these stand-alone techniques are insufficient for our work, as we use multiple quality loss bounds as the driving constraint. Following the lines of previous work, AGP [111], we designed a new quality-driven iterative pruning scheduler for QUSP that automatically calculates the pruning duration d, and the initial sparsity value \(s_s\) and the final sparsity value \(s_f\) for each layer l and for each quality bound \(q_i \in Q_b\) . These sparsity values are generated using a quality-driven sparsity selector, as described in Algorithm 1. This process uses \(\Gamma\) to find the maximum tolerable sparsity of each DNN layer per \(q_i\) . The thresholds used in this algorithm were empirically determined.

At each iteration, QUSP evaluates the importance of the filters based on the selected saliency metric and prunes K less significant filters based on the sparsity percentage inferred from the schedule \(s_t\) (line 11 –19). After d iterations, we clone the current pruned model to \(M_1\) for the first quality bound \(q_1\) (line 24) and continue pruning until B models are created, one for each \(q_i\) . It is important to note that pruning alone cannot offer any inference speedup or computation reduction, unless there is hardware support for sparse inference. This iterative process is followed by network thinning that considers data dependencies across layers and then physically removes the pruned filters from the CONV layers, along with bias and the relevant coefficients of the batch normalization layers following that CONV layer. An added advantage of thinning is that the OFMs in the immediate CONV layer are also pruned. On top of that, the weight filters corresponding to these pruned OFMs are also removed. These cumulative efforts result in a substantial reduction in the number of FLOPs and weights, which speeds up the inference and reduces the energy of the computation subsystem. To counteract the loss of accuracy induced by thinning, we must fine-tune each model \(M_i\) so that its inference accuracy satisfies the corresponding quality bound \(q_i\) (lines 30 –33). This process continues until QUSP generates a family of compressed DNN models \(\mathcal {MF}\) for all target quality specifications with different FLOPs and accuracy. Figure 8.3A gives an illustration of a network approximated using QUSP. Here, two weight filters and two OFMs are pruned in layer \(L_i\) , and in \(L_{i+1}\) the weight filters corresponding to these OFMs are also removed. Similarly, Figure 12 shows increasingly pruned filters, IFMs, and OFMs of a single CONV layer in three compressed models generated by QUSP.

Model Selection and Backbone Switching. The second functionality that we incorporated into our model compression framework is a simple model selection criterion. Given an architecture family consisting of a series of pretrained DNNs with varying quality and computational cost, the framework finds the most computationally efficient model subject to a predefined quality constraint. A few examples of families are ResNet [26], MobileNet [32], and EfficientNet [87] in classification networks, and EfficientDet [88] and YOLOv5 [36] in detection networks. Let \(\mathcal {M}\) represent the set of models available in any such family. Essentially, the framework takes a set of user-defined quality loss bounds \(Q_b\) of size B and finds B number of models \(\mathcal {MF} = \lbrace M_j \in \mathcal {M}\rbrace\) with a minimum number of FLOPs, each satisfying the respective quality bound by solving the following problem:

\begin{align} \begin{split} \min _{M_j \in \mathcal {M}} & \quad FLOPs(M_j) \\ s.t. & \quad 100 \times (1 - Q_{M_j}/Q_0) \lt Q_b[i], \quad 1 \le i \le B. \end{split} \end{align}

(2)

Another prominent feature of our compression framework is backbone switching. Generally, detection DNNs use a backbone network to extract features from the input image followed by a neck (feature pyramid network or region proposal network) and a head (prediction network) and generate bounding boxes, class scores, and labels. The framework intelligently incorporates different backbone networks with varying complexity in SOTA detection DNNs and subsequently fine-tunes them to generate B number of models \(\mathcal {MF}\) , one for each quality specification.

Now, we present a set of figures that supports our decision making behind all the compute approximation techniques that we proposed in our model compression framework. Figure 13(a) shows the top-1 accuracy vs. FLOPs in multiple compressed versions of small and large DNNs generated by QUSP, which are used for classification. We can clearly observe that each family comprises models that gradually trade off accuracy in favor of a reduction in the number of FLOPs and parameters. This plot clearly validates our motivation for QUSP. As can be seen in Figure 13(b), our compression framework is also able to apply the model selection criterion and the backbone switching technique to find a family of detection models with tradeoffs in mAP-FLOPs and mAP-parameters. Figure 14 depicts graphs for the application quality with the percentage reduction in FLOPs compared to the baseline model for both applications. We can again draw the same conclusions from these plots. Another significant positive correlation was found between the size of the baseline model and the approximation potential. Large models were able to tolerate more pruning-induced approximations with less impact on accuracy than smaller models, which proves that smaller models such as MobileNetV2 and SqueezeNet1.1 are comparatively less over-parameterized. We also wanted to understand whether the reduction in FLOPs translates into actual hardware speedup. To do this, we executed all versions of the models from generated families \(\mathcal {MF}\) on our prototype benchmark (discussed in Section 5) and measured the inference latency using a single image (i.e., batch size of 1). As observed in Figure 15, we obtained a significant speedup for all DNNs across different quality specifications. Overall, these results clearly indicate that all of our compression techniques, including QUSP, are quite effective in approximating the compute subsystem, which ultimately leads to less energy consumption of the compute subsystem. Note that our approach is orthogonal and complementary to many existing DNN compression works. We use the generated library of compressed models along with concurrent approximations in other subsystems, and we evaluate system quality and energy benefits, as well as explore inter-subsystem interactions for multiple compressed versions (discussed later in Section 4.6).

Fig. 13.

Fig. 14.

Fig. 15.

4.5.4 Approximate Communication.

In the past decade, smart camera systems have become an integral part of private homes, office buildings, smart cities, autonomous vehicles, traffic management systems, smart agriculture, and so on, and have found applications in home security, video surveillance, biometric recognition, real-time viewing, GPS and geofencing, smart parking system, and suicide prevention, among many others. Since these cameras process large volumes of data, lossy compression is commonly applied to acquired images prior to offloading and/or processing. We are interested in compression, as this can potentially reduce data traffic and save communication energy. The communication subsystem (WiFi module) in these cameras is used to offload the acquired image to a cloud server, which runs the DNN inference and sends the result back to the device. As discussed in Section 2.4, DNNs have been shown to be resistant to compression-induced distortions in the input image [15]. Without loss of generality, we adopt this JPEG compression scheme to logically approximate the communication subsystem. Note that lossy compression is a very effective approximation technique, as it can drastically lower the total amount of data to be transmitted while incurring a very small amount of error in the transmitted data. The compression quality parameter { \({Q \in 100, 95, \ldots , 10}\) } that controls the amount of JPEG compression is used as a very effective approximation knob. Here, a smaller Q results in a smaller image size with lower image quality and consequently higher energy savings, thus representing a higher degree of approximation. We vary Q to obtain images of gradually decreasing size, as shown in Figure 8.3B. To assess the robustness of popular DNN benchmarks to the input image perturbed by these distortions, we run cloud-based DNN inference on these compressed images and measure the resulting energy savings in communication subsystem. Figure 16(a) and 16(b) show the normalized Q-E tradeoff for classification and detection, respectively. Although the quality degradation was not significant, the most striking result emerging from these graphs is that there was a measurable influence of compression throughout the range of quality settings. These results also revealed that, unlike classification, detection networks suffer a marked decrease in application quality below \(Q=30\) , which is in line with those obtained by Gandor and Nalepa [21]. Interestingly, setting \(Q=80\) results in a 93% to 95% reduction in communication energy, a fact that has been overlooked in most articles related to DNN image compression, as the authors do not consider the energy aspect. Together, these motivating results suggest the range of practical compression levels (used in Section 4.7) that reduces the communication energy of the wireless module with minimal impact on the quality of the application.

Fig. 16.

4.6 Interaction Among Subsystem-Level Approximations and Q-E Tradeoffs

The knowledge of the approximation techniques for individual subsystems and their corresponding Q-E tradeoffs allows us to investigate some interesting interactions between them, as well as to determine the relative impact of individual subsystem approximations on the energy and quality of the system. In this section, we demonstrate that the approximations in one subsystem of AxIS can also influence the operation of the other subsystems, thus affecting their energy consumption. Furthermore, we also investigate the cumulative effects on system-level Q-E tradeoffs when multiple subsystems are approximated simultaneously.

4.6.1 Interaction Between Sensor and Compute Approximations.

We first examine the interactions among different subsystems in Cam_Edge. As mentioned in Section 2.1, modern DNN models support different strategies of network scaling, including resolution and compound scaling. For two such networks, EfficientNet in the high-compute regime and EfficientNet_Lite in the mobile/edge-compute regime, the family of approximated models \(\mathcal {MF}\) consists of models scaled in both depth d and width w dimensions. In other words, the network architecture varies among these models due to the difference in the number of layers and the parameters of the CONV layers. This ultimately leads to a lower number of MACs (1 MAC = 1 FLOP) and activations than the most computationally complex network in \(\mathcal {MF}\) , which we consider as the baseline model \(M_0\) . For the rest of the article, we follow this convention to select the baseline model for all DNNs, which is shown later in Section 5. As verified earlier in Section 4.5.3, these lower-complexity models demonstrate speedup of inference in real hardware. We represent the reduction in the number of MACs due to this compute approximation by scaling factor \(s_c\) . In comparison, the sensor approximation that reduces the input image size is essentially reminiscent of resolution scaling. Intuitively, subsampling by a factor \(s_r\) reduces MACs by \(s_r^2\) . This generates an excellent accuracy-MAC tradeoff, which ultimately results in a Q-E tradeoff. We consider the maximum input image size corresponding to \(M_0\) as the baseline image resolution. Since approximations in both the sensor and the compute subsystem modify the MACs, their interactions are quite interesting to study. Figure 17 shows 3D plots to represent normalized MACs (on the z-axis) vs. image size and compressed model version \(M_i \in \mathcal {MF}\) (on the x and y axes). Note that MACs are normalized w.r.t. MACs ( \(M_0\) ) to allow comparison, and the absolute number of MACs for each \(M_i\) is already shown in Figure 13. This kind of surface plot allows us to gauge the impact of compound scaling. Cumulative approximations essentially reduce MACs in proportion to \(s_r^2 \times s_c\) , which is clearly evident from these graphs. Note that among classification networks, this kind of interaction occurs only for models that support resolution scaling.

Fig. 17.

Figure 18 shows similar graphs for four SOTA DNNs used for object detection and instance segmentation. Unlike early classification DNNs, object detection networks are inherently capable of resolution scaling, as they can accept input images of varying resolutions. As a consequence, the image size decided by the approximated sensor subsystem has a significant impact on the number of MACs in the network, which is clearly observed in these four graphs. However, the compression framework used the backbone switching functionality to generate \(\mathcal {MF}\) for Faster_RCNN and Mask_RCNN with varying numbers of MACs. Furthermore, the family \(\mathcal {MF}\) corresponding to EfficientDet and YOLOv5, which are scaled in dimensions dw, also showed a significant reduction in the number of MACs. Figure 18 also shows the maximum reduction at the intersection points. Together, we derive the following observations from the figures. First, to achieve maximum reduction in MACs and better system-level energy efficiency for a particular quality specification, it is critical to perform sensor and compute approximations in a synergistic way. Second, another markedly visible point is the gradual slope of the surface w.r.t. sensor approximation. This indicates that sensor subsampling is a fine-granular knob to control DNN complexity. Third, a final by-product of these approximation methods is the reduction in memory energy, since the number of memory accesses is reduced and the DRAM now needs to store fewer DNN weights and activations.

Fig. 18.

4.6.2 Impact of Sensor Approximation on Other Subsystems.

Let us turn our attention to the interactions in Cam_Cloud, and specifically to how the sensor approximation affects the energy of the other subsystems for the object detection task. Figure 19 shows the energy breakdown of the four subsystems corresponding to different degrees of sensor approximations (i.e., different input image sizes). For example, we can observe a drop in communication energy from \(67\%\) to \(43\%\) when the image resolution for inference using YOLOv5 drops from 1280 to 1024 in Figure 19(d). As is evident from these graphs, the communication subsystem gets the maximum benefit from the sensor approximation. This decrease is certainly due to the smaller image, as the communication module has to transmit fewer data, thus reducing its active time. The reduction in transmission time also reduces the active time of the processor. Furthermore, if communication approximation is selected, then the processor can compress a smaller image in a shorter time frame. Moreover, memory has to store less data and is active for a shorter duration. These results in lower memory and compute energy, as confirmed by the \(5\%\) energy drop in both subsystems in Figure 19(d). Similar trends can also be observed for the other detection DNNs in Figure 19(a) through (c). Thus, the benefit of sensor approximation percolates into the other subsystems, resulting in a cascade of energy reduction in all subsystems that ultimately translates into overall system energy efficiency. We have also observed similar results for EfficientNet and EfficientNet_Lite in classification DNNs, since they are the only ones (among our benchmarks) that accept images of varying resolutions.

Fig. 19.

4.6.3 Q-E Tradeoff for Approximations in Multiple Subsystems for Object Detection.

In the previous two sections, we studied the impact of sensor and compute approximations on the operation and energy consumption of different subsystems. However, it is imperative to take into account the degradation of application quality and energy at the system level when multiple subsystems are approximated together. Unlike previous studies that investigated the impact of individual approximations on the performance of DNNs, we dive into inter-subsystem interactions and their corresponding impact on the system. To do this, we examine the Q-E tradeoffs in both Cam_Edge and Cam_Cloud and represent the tradeoffs using a set of six graphs in Figure 20. Here, we consider a single detection inference using the YOLOv5 architecture subject to different pairs of subsystem approximation techniques. The presentation of tradeoffs subject to the concurrent application of approximations in all three subsystems is not possible, as it is quite difficult to visualize a 4D or 5D graph. In these graphs, quality Q is plotted on the z-axis, and the color gradient of the graph represents the system-level energy E, each normalized to the corresponding accurate system metrics. Here, red indicates higher energy consumption, whereas green indicates lower energy consumption. We first define a metric, energy-quality gradient, \(\nabla EQ = \Delta E/ \Delta Q\) . Essentially, this metric will allow us to measure the gain in energy savings \(\Delta E\) at the system level with loss of quality \(\Delta Q\) and will help us evaluate each approximation strategy in the context of the system-level Q-E tradeoff. Tradeoffs in Cam_Edge. First, without applying the sensor approximation, we investigate the interaction between the compute and memory subsystem and their impact on the Q-E tradeoff. Accelerated inference achieved using models generated by the compression framework, \(M_i \in \mathcal {MF}\) , yields multiple benefits. Active DRAM time is reduced, leading to lower DRAM energy. In addition, since the number of weights is reduced, we can allocate a larger portion of these weights to qbin0 (accurate DRAM pages), which causes less quality loss. However, reducing the refresh rate reduces the memory refresh overhead, leading to higher inference acceleration and better energy efficiency. These effects reduce the total system energy that can be clearly observed in Figure 20(a). As is evident, the energy reduction due to approximation in the compute subsystem is much greater than that due to approximation in memory. However, \(\Delta Q\) is quite similar; thus, \(\nabla EQ_{comp} \gt \gt \nabla EQ_{mem}\) . In Figure 20(b), the memory subsystem is left untouched while the sensor and the compute subsystems are approximated. Section 4.6.1 already discusses how these two interact with each other and reduce the number of MACs/FLOPs in a DNN. This graph shows that a smaller number of FLOPs ultimately results in a lower energy consumption in the system. Interestingly, it seems that the sensor approximation has a greater impact on system energy, despite similar \(\Delta Q\) ; thus, \(\nabla EQ_{sensor} \gt \nabla EQ_{comp}\) . This shows that the sensor subsystem provides more energy savings compared to the computation subsystem. Finally, we explore the interaction between the sensor and the memory approximations. Figure 20(c) shows that sensor approximations result in higher energy efficiency compared to memory. One of the striking results that emerges is that DNNs are more resilient to bit errors in input image compared to image distortions introduced by sensor subsampling. This is indicated by the relatively lower \(\Delta Q\) for memory; thus, \(\nabla EQ_{sensor} \gt \gt \nabla EQ_{mem}\) . Looking at the graphs in Figure 20(b) and (c) together suggests that \(\Delta Q\) due to the approximations in the sensor is aggravated by that in compute compared to memory. Comparing Figure 20(a) and (b) shows that \(\Delta Q\) induced by the compute approximation is affected more by memory than by the sensor. This observation is related to the fact that DNN weights are affected by bit errors in DRAM, since they are assigned to erroneous pages (Section 4.5.2). This is also clear if we compare \(\Delta Q\) due to the memory approximations in Figure 20(a) and (c). Compared to the sensor, the approximation of the compute subsystem affects the memory more.

Fig. 20.

Tradeoffs in Cam_Cloud. Similarly to Cam_Edge, Figure 20(d) through (f) depicts the interactions in Cam_Cloud. As observed in Figure 20(d), \(\Delta Q\) due to communication and memory approximations are quite low. Image compression not only reduces communication energy but also reduces processor energy consumption, as it packetizes fewer data. This reduction in compute energy easily makes up for the overhead of performing the JPEG compression itself. Therefore, the communication subsystem provides more system-level energy savings, resulting in \(\nabla EQ_{comm} \gt \nabla EQ_{mem}\) . Figure 20(e) shows the Q-E tradeoff vs. sensor and communication approximations without any change in the memory subsystem. Clearly, DNNs are more resistant to communication approximations, which results in \(\nabla EQ_{comm} \gt \nabla EQ_{sensor}\) . From Figure 20(f), we can infer that \(\nabla EQ_{mem} \gt \nabla EQ_{sensor}\) due to negligible \(\Delta Q\) from memory approximation despite higher energy savings from sensor approximation. Comparing the three graphs reveals that the memory approximation has the least impact on the application quality in Cam_Cloud. Thus, DNN resiliency to tolerate bit errors in the input image is clearly validated.

Here are the key takeaways from Figure 20. First, the compute and communication subsystems provide more energy saving opportunities than memory, when the sensor is not affected; second, the memory approximations have the least impact on Q in both variants; third, the sensor provides more energy savings than compute and communication, when memory is not approximated; and fourth, the sensor also provides better savings than memory when the other two subsystems are kept at their baseline configuration. In summary, these results show that errors arising from individual approximations tend to influence each other. Strong evidence was found that approximating multiple subsystems could definitely lead to more energy savings at the system level compared to individual approximations.

4.6.4 Q-E Tradeoff for Approximations in Multiple Subsystems for Image Classification.

We also investigated inter-subsystem interactions in the context of image classification using ResNet101. Figure 21 represents the Q-E tradeoff for both variants of AxIS. Similarly to observations in the previous section, DNNs are quite resilient to memory approximations in the presence of other approximations, as shown by the minimal \(\Delta Q\) in these six plots. In fact, the drop is lower in DNNs for classification compared to DNNs for detection. This resembles our results in Section 4.5.2. Sensor behavior is the main difference between these two applications. We initially expected that the lower-resolution image acquired from the approximate sensor could potentially induce system-level energy reduction. However, bit errors due to DRAM refresh rate reduction and/or compute/communication approximations adversely affect the application quality to a large extent for subsampled images. This accuracy loss was mitigated by upsampling low-resolution images in the edge processor before they were stored in the DRAM, as mentioned in Section 4.5.1. Therefore, the sensor contributed very little to system-level energy savings compared to other subsystems, as observed in Figure 21(b), (c), (e), and (f). Here, computation and communication approximations mainly reduce system energy. It is important to mention that classification DNNs such as EfficientNet and EfficientNet_Lite show Q-E tradeoffs similar to detection networks, as they are inherently designed to support resolution and compound scaling. Sensor approximations actually result in system-level energy savings for the family of compressed models for these two architectures.

Fig. 21.

4.7 Design-Space Exploration for System-Level Q-E Tradeoffs

Interestingly, Section 4.6 reveals that there is sufficient complexity in how each subsystem interacts with each other. The impact on application quality and system energy depends not only on the degree of approximation of the individual subsystem but also on the degrees of the other subsystems. Clearly, the best-approximation configuration at each quality specification does not have a deterministic solution and varies across applications and DNNs. To navigate this complex and non-homogeneous space, we devised a gradient descent based DSE algorithm. Using insights from individual subsystem Q-E tradeoffs (Section 4.5) and their interactions (Section 4.6), we use this algorithm to synergistically tune all subsystem approximation knobs, as shown in Figure 22. Our synergistic methodology is designed to achieve more system-level resource and energy efficiency compared to individual approximations, while still meeting user-defined quality loss bounds \(Q_b\) .

Fig. 22.

We start with the baseline configuration without employing approximations in any subsystem and assess the baseline quality \(Q_0\) and system energy \(E_0\) for a specific DNN model and the associated validation dataset. The per-round quality bound \(Q_{B}\) is derived from \(Q_{b}\) only for the first round. For simplicity, we represent the list of subsystems targeted for approximations by s and the corresponding approximation levels by \(\alpha _s\) . The algorithm increments \(\alpha _i\) of each subsystem ( \(i \in s\) ) individually and performs a Q-E SA to measure quality \(Q_i^{\alpha _i+1}\) and energy \(E_i^{\alpha _i+1}\) at each configuration of the system. The algorithm identifies all subsystems whose approximations do not meet the quality bound \(Q_B\) . Furthermore, it eliminates them from all subsequent SA rounds and thus drastically reduces/prunes the design search space. Subsequently, the algorithm selects the system configuration with the maximum \(\nabla EQ\) (defined in Section 4.6.3), which is used in the next round of SA. The quality \(Q_{f}\) , the energy \(E_{f}\) , and the quality bound \(Q_{B}\) corresponding to the selected configuration are the updated baselines. In the next round, we continue to explore, eliminate subsystems that violate the updated \(Q_B\) , and select the best configuration. This elimination-based gradient descent process continues until the user-defined quality specification \(Q_{b}\) is reached, when we save the final configuration for all subsystems along with the final quality \(Q_{f}\) and energy \(E_{f}\) . This DSE algorithm is one of the most effective ways to synergistically approximate multiple subsystems and obtain the best system-level Q-E tradeoff. Interestingly, the maximum energy quality gradient provides the maximum bang for the buck, as it always selects the configuration with the maximum energy savings with the minimum quality degradation. Note that the DSE is performed offline prior to actual deployment of the edge inference system in the real world. We opt for a simple gradient descent based algorithm for its simplicity and low computational overhead instead of any complex reinforcement learning optimization algorithm [89]. Another major benefit is that the proposed DSE is generic and can be applied to any system where there exist opportunities to employ approximations in multiple subsystems. In addition, this can be easily modified to compare different approximation strategies in the same subsystem. Compared to an exhaustive search, we obtain an overall system Q-E relationship \(4\times -17\times\) faster (Cam_Edge) and \(4\times -12\times\) faster (Cam_Cloud) for \(Q_{b}=0.5\%-10\%\) (on average) across all classification and detection DNN benchmarks.

5 Experimental Methodology

In this section, we describe the experimental setup and DNN benchmarks used in this work. Figure 23 depicts the fully functional smart camera system used to implement the two variants of AxIS, namely Cam_Edge and Cam_Cloud. We prototyped and implemented this system using an FPGA-based Terasic development board along the lines of Raha and Raghunathan [72]. The board also consists of a 1-GB DDR3 DRAM module operating at 1.5 V required to implement the frame buffer to capture the image and store the DNN inference application. The DRAM module was also used during inference execution to store DNN weights (parameters) and feature maps (IFMs/OFMs) in Cam_Edge. The FPGA was interfaced with a 5-eMP CMOS digital image sensor and an ESP-WROOM-02 communication module with full TCP/UDP stack support. An Intel Nios II soft processor core was programmed in the FPGA (running at 133 MHz) together with the Intel UniPHY DDR3 memory controller (with refresh rate control) and the Intel Frame Buffer IP to operate the DRAM and sensor modules, respectively. Finally, a custom software memory allocator was built within \(\mu\) C-OS II to map the frame buffer and DNN data (weights and other parameters) to the desired DRAM pages. Figure 23 also shows the overall energy measurement setup. The average power consumption of the Nios II processor was measured using the Intel PowerPlay Power Analyzer tool, and the DRAM energy consumption was measured with the help of an ADEXELEC DDR3-SODIMM-01 extender containing a current sensing resistor interfaced to a Keithley 6430 SourceMeter. The energy consumption of the camera module and the WiFi module was measured using the Monsoon Power Monitor. Note that the total energy consumption of the system for each DNN workload was calculated by aggregating the energy consumed by the different components of the system, such as the processor, the DRAM module, the camera sensor, the WiFi module, and other soft IPs such as the memory controller and frame buffer during the inference of a single test image, as discussed in Section 4.4. Note that all of our results are measured and generated considering an iso-work condition.

Fig. 23.

DNN Baselines . We describe the suite of DNN benchmarks used for the evaluation of AxIS in the presence of approximations in multiple subsystems (Sections 4.4, 4.5, and 4.6) as well as for the DSE algorithm (Section 4.7). Table 1 lists the 10 modern and widely popular DNN models that we evaluated for the image classification application on the ImageNet dataset (ILSVRC 2012) [14]. Large-scale DNNs traditionally used on servers are considered, such as ResNet101, InceptionV3, and EfficientNet. We also use small-scale DNNs such as SqueezeNet1.1, MobileNetV2, and MNASNet1.0 used for real-time applications on embedded devices and mobile platforms. Table 1 also shows the quality metric (top-1 accuracy) for all of these DNNs, along with the number of parameters and model size. In addition, the number of FLOPs/MACs used for processing one input image is shown with the corresponding image size. FLOPs reported and used throughout this work were calculated using fvcore [19]. Table 2 lists four SOTA DNN models used for object detection and instance segmentation on the MSCOCO dataset [50]. We consider two-stage detectors such as Faster_RCNN and Mask_RCNN and one-stage detectors such as EfficientDet and YOLOv5. The table also shows the backbone network used for the baseline version of these detection models, quality metric (box mAP or mask mAP), and model specifications similar to Table 1. We have created our model compression framework (Section 4.5.3) on top of publicly available DL repositories, namely Distiller [112], PyTorch Image Models [97], and MMDetection [6], to generate the family of compressed DNN models for each benchmark. Note that we used the FP32 variant of models.

6 Experimental Results and Key Insights

In this section, we present the experimental results to evaluate the performance of AxIS in the presence of synergistic approximations and to understand the Q-E tradeoffs at the system level. The energy and quality reported here are normalized to the corresponding values ( \(E_0, Q_0\) ) of the baseline, which is an accurate system that does not involve any approximations. This enables us to conduct comparisons across multiple DNNs, as well as among subsystems. In each subsequent section, we first show the result for image classification and then for object detection.

6.1 System-Level Energy Improvements

First, we present overall energy improvements at the system level compared to an accurate system at different user-defined quality loss bounds \(Q_b\) using Figures 24 and 25. The final \(Q_{f}\) and \(E_{f}\) for each \(Q_{b}\) obtained from the proposed DSE algorithm (Figure 22), were used to generate these graphs. In these figures, the x-axis represents \(Q_{b}\) , whereas the y-axis shows the normalized energy of the system ( \(\tilde{E} = E_f/E_0\) ). Note that \(Q_{b}\) represents normalized quality loss—for example, \(1\%\) quality loss for ResNet101 represents \(0.77\%\) absolute reduction in top-1 accuracy ( \(77.374\%\) to \(76.6\%\) ).

Fig. 24.

Fig. 25.

6.1.1 Image Classification.

Figure 24(a) and (c) show \(\tilde{E}\) vs. \(Q_{b}\) for large and small classification benchmarks, respectively, for Cam_Edge. Similarly, Figure 24(b) and (d) present similar results for Cam_Cloud. As we can observe, the system-level energy savings \(\Delta E_{sys}^{edge}\) of Cam_Edge improve by a factor of \({1.6\times }\) for a negligible normalized quality degradation ( \(\Delta Q\le 0.5\%\) ) on average (geomean), for both large and small networks. However, in Cam_Cloud, large and small DNNs consume \({2.9\times }\) and \({2.2\times }\) less energy for the same \(Q_b\) , respectively. For higher-quality loss targets, \(Q_{b}=1\%-10\%\) , our design results in \(\Delta E_{sys}^{edge} = {2.1\times }-{4.5\times }\) for large DNNs and \(\Delta E_{sys}^{edge} = {1.6\times }-{2.8\times }\) for small DNNs in Cam_Edge. Similarly, in Cam_Cloud, we obtain \(\Delta E_{sys}^{cloud} = {3.5\times }-{5.5\times }\) and \(\Delta E_{sys}^{cloud} = {3.3\times }-{5.2\times }\) for large and small DNNs, respectively. As we can see, even a small amount of quality degradation allows us to achieve significant energy improvements with incremental benefits at larger degradation bounds. Surprisingly, the results reveal that even small DNNs (MobileNetV2, MNASNet1.0) optimized for edge/mobile deployment by design offer significant energy savings.

6.1.2 Object Detection.

Figure 25 presents similar results for object detection. As can be inferred from the graphs, Cam_Cloud provides more energy savings compared to Cam_Edge on average—that is, \(\Delta E_{sys}^{edge} = 1.6\times\) and \(\Delta E_{sys}^{cloud} = 4.4\times\) at \(Q_b\le 0.5\%\) . Interestingly, this also indicates that AxIS using detection DNNs offers better energy efficiency in Cam_Cloud, compared to classification DNNs. In fact, this is true even for \(Q_{b}=1\%-10\%\) , where \(\Delta E_{sys}^{edge} = {2.3\times }-{12.1\times }\) and \(\Delta E_{sys}^{cloud} = {4.5\times }-{11.8\times }\) . This increase in energy savings is caused by the sensor approximations and its beneficial effects on other subsystems. These results validate that synergistic end-to-end approximations in approximate systems unlock the maximum energy efficiency for edge inference applications.

6.2 Subsystem-Level Energy Benefits

Now, we take a deep dive into the energy breakdown of AxIS. The following discussions shed light on the contribution of each subsystem to \(\Delta E_{sys}\) . For classification, we show the energy reduction corresponding to each individual subsystem in Figures 26 and 27 for Cam_Edge and Cam_Cloud, respectively. Subsequently, Figures 28 and 29 exhibit similar results for detection. In these graphs, the breakdown of the normalized system energy \(\tilde{E}\) is presented on the x-axis and the quality loss bounds \(Q_{b}\) on the y-axis. The bottom bar of each graph represents the baseline energy breakdown, as also shown in Figures 6 and 7. In addition, we indicated the percentage of energy reduction w.r.t. the immediate lower \(Q_{b}\) using numbered boxes with the rest of the bars.

Fig. 26.

Fig. 27.

Fig. 28.

Fig. 29.

6.2.1 Image Classification.

Let us take the example of the Cam_Edge system running ResNet101 (Figure 26(e)). As observed, we obtain \(37\%\) energy reduction at \(Q_{b}=0.5\%\) . Approximating this system further to \(Q_{b}=1\%\) , we achieve an additional \(31\%\) reduction w.r.t. \(Q_{b}=0.5\%\) . Proceeding further toward higher levels of approximations yields diminishing returns. We already know that the contribution of memory and computation subsystem to \(\tilde{E}\) is almost equal in an accurate system ( \(M=51\%, C=48\%\) from Figure 6). Here, we see that the memory approximation is the main component driving the overall energy reduction for \(Q_{b}\le 1\%\) . In contrast, most of the energy savings for higher \(Q_{b}\) are attributed to compute approximations, which also saves memory energy, as DRAM is active for a shorter duration. Similar DNNs that have fixed input size restrictions (Figure 26(a)–(d) and (g)–(i)) show similar behavior, as sensor approximations have negligible impact on them, as stated in Section 4.6.4. On the contrary, EfficientNet and EfficientNet_Lite (Figure 26(f) and (j)) show a high degree of energy reduction for all \(Q_b\) . This is caused by both the sensor and the compute subsystem and their interactions, thus validating our previous claims in Section 4.6.

Let us now investigate the contribution of the individual subsystems in Cam_Cloud. Since communication consumes \(65\%\) of the energy of the accurate system (Figure 6), approximations here lead to substantial energy savings for all DNNs. If we look at the graph for ResNet101 (Figure 27(e)), communication energy reduces from \(65\%\) to \(5\%\) , resulting in a reduction of \(71\%\) in the overall energy of the system. Unlike Cam_Edge, sensor approximations reduce considerable energy in Cam_Cloud, albeit at a high \(Q_{b}\ge 5\%\) . These results also indicate diminishing returns at higher \(Q_b\) for most DNNs. This is probably caused by saturation in communication energy savings for higher JPEG compression factors, as illustrated in Section 4.5.4. However, sensor approximations play an important role in EfficientNet and EfficientNet_Lite. In fact, smaller resolutions benefit the compute energy, as the compute subsystem has to perform JPEG compression on a smaller image. Similarly, it also reduces the size of the data to be offloaded, thus reducing the communication energy. These effects result in huge energy savings for these two networks, as can be seen in Figure 27(f) and (j).

6.2.2 Object Detection.

We observe \(\approx \!37\%\) savings in \(\tilde{E}\) at \(Q_b\le 0.5\%\) for all four DNNs in Figure 28. As is evident, the memory approximation is the primary contributing factor. This is explained by the drop in \(E_{mem}\) from \(52\%\) to \(\approx \!14.5\%\) on average at this quality specification. However, for higher \(Q_b\) , all other subsystems interact and help extract the maximum energy efficiency. Although the contribution of sensor energy is much less compared to other subsystem, its impact is profound. The benefits resulting from sensor subsampling reduce the energy of multiple subsystems, as elaborated in Section 4.6.2 and 4.6.3. In addition, compressed models lead to faster inference on the smaller images. These effects result in substantial energy savings. On the contrary, Figure 29 shows that both the communication and memory approximations contribute to the \(\approx \!77\%\) energy reduction (on average) in \(\tilde{E}\) at \(Q_b\le 0.5\%\) . The drop in communication energy from \(\approx \!66\%\) to \(\approx \!3\%\) and the drop in memory energy from \(\approx \!15\%\) to \(\approx \!4\%\) clearly support this deduction. At \(Q_b\ge 1\%\) , the sensor approximation again has the most significant impact, as the benefits percolate through all subsystems.

6.3 Comparative Analysis: Individual vs. Synergistic Approximations

The next set of results highlights the benefits of our proposed synergistic approximation methodology in AxIS compared to the approximation of the individual subsystems in isolation. In Figures 30 and 31, the y-axis shows \(\tilde{E}\) and the x-axis denotes \(Q_b\) . The columns represent \(\tilde{E}\) for Cam_Edge (in red) and Cam_Cloud (in green) when all subsystems are synergistically approximated using the proposed DSE algorithm, whereas the lines denote \(\tilde{E}\) when an individual subsystem such as sensor, memory, compute, or communication is approximated in isolation.

Fig. 30.

Fig. 31.

6.3.1 Image Classification.

Figure 30(a) and (c) show this comparison for large and small DNN benchmarks for Cam_Edge. For large DNNs, the proposed Cam_Edge system, on average, yields \({1.6\times }-{3.0\times }\) , \({1.03\times }-{2.2\times }\) , and \({1.6\times }-{1.7\times }\) energy improvements over sensor-only, memory-only, and compute only approximations, respectively, for \(Q_{b}=0.5\%-10\%\) . In comparison, smaller DNNs show \({1.6\times }-{2.3\times }\) , \({1.0\times }-{1.6\times }\) , and \({1.6\times }-{1.7\times }\) energy improvements. Following common intuition, large DNNs are more resilient, resulting in higher energy gains. Figure 30(b) and (d) show similar trends for Cam_Cloud. For the same \(Q_b\) range, the energy improvements at the system level compared to sensor, memory, and communication amount to \({2.8\times }-{4.4\times }\) , \({2.5\times }-{4.5\times }\) , and \({1.4\times }-{1.7\times }\) for large DNNs. For small DNNs, these numbers change to \({2.2\times }-{4.1\times }\) , \({1.9\times }-{4.4\times }\) , and \({1.3\times }-{1.6\times }\) . The feasibility and effectiveness of synergistic approximations in AxIS are clearly demonstrated in these results.

6.3.2 Object Detection.

Figure 31 shows the effectiveness of AxIS for object detection. Cam_Edge on average leads to \({1.6\times }-{2.5\times }\) , \({1.0\times }-{7.4\times }\) , and \({1.6\times }-{4.4\times }\) energy improvements over sensor-only, memory only, and compute-only approximations, respectively, for \(Q_{b}=0.5\%-10\%\) . Similarly, Cam_Cloud produces \({3.9\times }-{4.3\times }\) , \({3.9\times }-{10.3\times }\) , and \({1.4\times }-{3.4\times }\) energy improvements for the same quality range. Overall, these results clearly indicate that the effectiveness of AxIS is very well translated to the object detection application, similar to that observed for image classification.

Consideration of these comparative results reveals interesting facts. First, energy benefits due to memory approximations usually plateau for \(Q_b\gt 0.5\%\) . Second, sensor approximations benefit detection DNNs for all \(Q_b \gt 1\%\) . In comparison, the benefits in the classification DNNs are limited and are only visible in EfficientNet and EfficientNet_Lite (i.e., that support resolution scaling). Third, the communication approximations provide energy savings for all \(Q_b\ge 0.5\%\) . However, the benefit gradually saturates for higher bounds. Fourth, approximations in compute subsystem are usually visible for \(Q_b\ge 1\%\) and provide a gradual Q-E tradeoff at subsequent quality bounds.

6.4 Inference Latency Speedup in AxIS

Figure 32 shows the speedup of the inference latency of Cam_Edge for all DNN benchmarks (in bold), where all subsystems are synergistically approximated. In addition, we report the speedup of the system on its side, when only the compute system is approximated (i.e., results from Figure 15). Similarly to the previous sections, the approximate system configuration to execute the DNNs on actual hardware was obtained from the DSE. We make the following observations from the figure. Although improving energy efficiency has been the main driving objective of this work, AxIS also demonstrates a significant increase in speedup. Second, we see a decrease in speedup at \(Q_b=1\%\) for some of the classification DNNs. This is because the memory approximation is usually preferred at this quality specification, as DNNs show the most resiliency to memory approximations (as explained in Section 4.6.3). As \(\nabla EQ_{mem} \gt \nabla EQ_{comp}\) at this quality bound, DSE prioritizes energy savings by selecting the memory approximation instead of compute. However, we observe a significant increase in speedup at higher \(Q_b\) , where sensor and compute approximations contribute to savings in addition to memory. From the figure, we observe an absolute speedup of up to \({21.9\times }\) and \({10.7\times }\) for classification and detection, respectively. Compared to only compute approximation-induced speedup, AxIS provides up to \(3.5\times\) and \(5.7\times\) more speedup for classification and detection, respectively.

Fig. 32.

We conclude that AxIS is not only effective in improving DNN inference in terms of energy efficiency; it results in a significant decrease in DNN inference latency while maintaining the quality bounds of the target application, both for classification and detection.

6.5 Case Study for Image Segmentation

In addition to using AxIS for image classification and object detection benchmarks as mentioned in previous sections, we performed a case study using AxIS (Cam_Edge and Cam_Cloud) in the context of an image (instance) segmentation application for a smart camera. The underlying architecture is a two-stage network Mask_RCNN (details in Table 2) that generates bounding boxes and segmentation masks for each instance of an object in the input image. Figure 33 shows that Cam_Edge and Cam_Cloud result in \({2.5\times }-{12.2\times }\) and \({3.4\times }-{11.0\times }\) system-level energy savings at \(Q_{b}=0.5\%-10\%\) compared to the baseline system. Compared to the individual subsystem approximation in isolation, Cam_Edge provides \({1.8\times }-{5.1\times }\) and Cam_Cloud provides \({2.9\times }-{5.3\times }\) additional energy savings on average (geomean). We also show the breakdown of the normalized system energy at different \(Q_b\) in Figure 34. These results are consistent with those presented earlier for detection DNNs in Section 6.2.2. Together, what stands out most is that AxIS is quite effective in extracting the maximum energy efficiency in any DNN-based computer vision application.

Fig. 33.

Fig. 34.

7 Conclusion

This article introduced AxIS, the first DNN-based approximate edge inference system that executes highly energy-efficient inference for computer vision applications by employing synergistic approximations across multiple subsystems while strictly meeting a quality limit for the target application. Specifically, AxIS employs a novel DNN model compression framework in addition to sensor subsampling, DRAM refresh rate reduction, and lossy JPEG compression. This article provided new insights into the complex inter-subsystem interactions of any edge system and their impact on DNN quality and system energy. AxIS uses a simple and scalable DSE algorithm that systematically characterizes the sensitivity of DNNs to these approximations and finds the configuration that provides the maximum energy savings for the minimum quality loss, along with inference speedup. We demonstrated fully functional prototypes of two variants of AxIS, namely Cam_Edge and Cam_Cloud, using COTS modules. Our experimental evaluation showed that AxIS allows significant system-level energy savings for image classification (up to \(2.9\times\) ), object detection (up to \(4.4\times\) ), and instance segmentation (up to \(3.4\times\) ) for minimal (<0.5%) quality loss. Our results are encouraging and provide a foundation for further research in the paradigm of energy-efficient edge AI. Therefore, the findings in this article provide a really compelling case for shifting the focus of future research from just approximate computing to approximate systems that can unlock the true potential of DNNs in the era of pervasive edge intelligence.

Footnotes

FLOPs reported a for \(1280\times 960\) image; the GitHub repository reports 210 billion FLOPs (1 MAC = 2 FLOPs) for a 640 \(\times\) 512 image.

We count one MAC operation as one FLOP for the entirety of the article.

We derived these statistics using the experimental methodology and operating conditions as described in Sections 4.3 and 5.

References

[1]

Ron Banner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. 2019. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 32. Curran Associates, NY, Article 714, 9 pages.

Abstract

1 Introduction

2 Related Work

2.1 Compute Approximations

2.2 Sensor and Data Approximations

2.3 Memory and Storage Approximations

2.4 Communication Approximations

2.5 Multi-Subsystem Approximations

3 Background and Motivation

3.1 Challenges to Energy-Efficient Edge Inference

3.2 From Approximate Computing to Approximate Systems

4 AxIS Design Methodology

4.1 AxIS System Architecture

4.2 Variants of AxIS

4.2.1 CamEdge.

4.2.2 CamCloud.

4.3 Short Description of the AxIS Prototype and DNN Benchmarks

4.4 System-Level Energy Consumption of AxIS

4.5 Subsystem-Level Approximations

4.5.1 Approximate Sensor.

4.5.2 Approximate Memory.

4.5.3 Approximate Compute.

4.5.4 Approximate Communication.

4.6 Interaction Among Subsystem-Level Approximations and Q-E Tradeoffs

4.6.1 Interaction Between Sensor and Compute Approximations.

4.6.2 Impact of Sensor Approximation on Other Subsystems.

4.6.3 Q-E Tradeoff for Approximations in Multiple Subsystems for Object Detection.

4.6.4 Q-E Tradeoff for Approximations in Multiple Subsystems for Image Classification.

4.7 Design-Space Exploration for System-Level Q-E Tradeoffs

5 Experimental Methodology

6 Experimental Results and Key Insights

6.1 System-Level Energy Improvements

6.1.1 Image Classification.

6.1.2 Object Detection.

6.2 Subsystem-Level Energy Benefits

6.2.1 Image Classification.

6.2.2 Object Detection.

6.3 Comparative Analysis: Individual vs. Synergistic Approximations

6.3.1 Image Classification.

6.3.2 Object Detection.

6.4 Inference Latency Speedup in AxIS

6.5 Case Study for Image Segmentation

7 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Approximate inference systems (AxIS): end-to-end approximations for energy-efficient inference at the edge

Energy efficient task allocation and energy scheduling in green energy powered edge computing

Approximate data mapping in refresh-free DRAM for energy-efficient computing in modern mobile systems

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

4.2.1 Cam_Edge.

4.2.2 Cam_Cloud.