1 Introduction
With the advancement in
Convolutional Neural Networks (
CNN) [
18,
22,
39], visual analytics tasks (herein referred to as vision apps) such as human face recognition [
36], pedestrian detection [
8], or traffic monitoring [
28] have been deployed at the edge of the Internet. Typically, the image is captured by the cameras of end devices, e.g., drones in the air or underwater. Due to the computation constraints of the camera end devices and the prohibitive cost of running CNN models on these end devices, the captured images are encoded, live-streamed to edge servers, and decoded for analysis, i.e., visual analytics offloading, (Figure
1).
To guarantee the performance of vision apps at the edge, the network bandwidth required for visual analytics offloading must be minimized because of the challenging network conditions. For example, capturing and offloading images for object detection in drones requires us to minimize the offloading bandwidth since the network connection between the drone and edge server can be highly dynamic or even intermittent. Moreover, the latency of the whole visual analytics offloading pipeline, from encoding to decoding, must be minimal to support time-sensitive vision apps. For example, during a victim search in a fire incident, images of the firefighting site should be sent to the command center for analysis as soon as possible so that commanders can guide the rescue operation effectively.
The key to achieving the
low-bandwidth and
low-latency visual analytics offloading is to minimize the size of images to offload through image compression. Well-known image compression standards such as JPEG [
46], and JPEG2000 [
43] focus on improving the visual quality of the reconstructed images under limited network bandwidth. However, they cannot consider the analytics accuracy when applied to image offloading in vision apps.
Machine-centric image compression [
14] has been proposed to address this limitation by both enhancing the accuracy of vision apps’ object detection and/or classification, and minimizing the size of the data to be offloaded. CNN-driven compression [
2,
3,
38,
48] is one category of such techniques. These methods employ CNN models to encode an image into a vector at the end device for offloading and use generative models to reconstruct the image at the server. They can compress images into smaller sizes than traditional image compression standards while preserving the quality of reconstructed images. However, these approaches usually require heavy computation power (e.g., GPU) to perform encoding (on the end device) and/or to decode (on the edge server) through sophisticated CNN models [
2,
3,
38], which could incur excessive end-to-end latency in the offloading pipeline for vision apps. The other category of machine-centric compression – server-driven compression [
14,
30] compresses images for offloading adaptively based on the information sent from the edge server that indicates the importance of image regions. Nevertheless, the server feedback introduces an additional delay before the data can be compressed for offloading. If the delay is significant, the
regions of interest (
ROI) sent by the edge server can deviate from the ROI currently captured, and the compression performance will degrade.
In this article, we remedy the aforementioned issues of existing image compression techniques by proposing
CICO,
1 Context-aware Image Compression Optimization. CICO is a lightweight framework that contextualizes and optimizes image compression for low-bandwidth and low-latency visual analytics offloading in vision apps. As low-level image features such as STAR [
1] and FAST [
40] reflect high-level image semantics that is of interest to the vision apps, CICO learns such a relationship and utilizes it to identify the importance of different image regions for a vision app. Accordingly, CICO optimizes the tradeoff between compression size and analytics accuracy. By putting the compression of each image region under a vision app into a
context, CICO can minimize the required network bandwidth for visual analytics offloading while preserving the analytics accuracy. By employing image features that can be computed efficiently in the runtime, CICO allows images to be compressed, offloaded, and reconstructed in a minimal end-to-end latency. To the best of our knowledge, CICO is the first compression framework that achieves low-bandwidth and low-latency visual analytics offloading while ensuring analytics accuracy.
Realizing CICO requires us to overcome two challenges.
(1) How to make the relationship between image features and image compression learnable? The basic principle of CICO is that the image region with a higher density of important image features should have a higher compression quality, i.e., less information loss. To achieve this goal, design choices like (1) the significance of different features in a particular vision app and (2) the mapping from the feature density to the compression quality have to be made. We innovatively propose the context-aware compression module (CCM) within the CICO framework that models the above design choices into learnable parameters (referred to as the configuration). The CCM is a generic module that can be built on top of any other compression methods such that the compression methods will fit a vision app in a better way.
(2) How to conduct the learning to compress images? An essential step in CICO is to make the CCM aware of and optimized for the target vision app. To this end, we model the selection of the configuration of the CCM into a
multi-objective optimization (
MOO) problem. The variable is the configuration and the objectives are (1) maximizing the analytics accuracy regarding the vision app, e.g., the top-1 accuracy for image classification and the
mean average precision (
mAP) for object detection [
15], and (2) minimizing the size of data to be offloaded. Solving the MOO problem means deriving its
Pareto front, which is non-trivial because of the infinite design space of the configuration and the costly evaluation of a configuration. We address these issues with the
compression optimizer (
CO) within the CICO framework that optimizes the choice of configurations and efficiently evaluates each configuration. The CO finds offline the optimal set of configurations for the CCM in a reasonable amount of time.
We evaluate CICO by focusing on two vision apps (image classification and object detection) and two end devices (Raspberry Pi 4 Model B and Nvidia Jetson Nano) in two network environments (WiFi and LTE networks). By comparing CICO with traditional JPEG standard and a CNN-based compression method [
48], our extensive results demonstrate that CICO improves the accuracy-bandwidth tradeoffs of JPEG and CNN-based encoders and achieves lower end-to-end latency and higher processing speed for visual analytics offloading. Specifically, CICO reduces the size of offloaded images compressed by existing compression techniques by up to 40% while reaching comparable analytics accuracy. Regarding the support for low-latency vision apps, CICO achieves up to a 2× speedup over state-of-the-art compression techniques.
The contribution of this article is summarized as follows:
—
We propose CICO, a novel and lightweight framework that contextualizes and optimizes image compression for low-bandwidth and low-latency offloading in vision apps.
—
We model and solve the image compression as an MOO offline, allowing online compression to be context-aware with minimal impact on the latency.
—
We optimize JPEG and a CNN-based encoder with CICO and conduct extensive evaluations to validate CICO’s low-bandwidth and low-latency benefits.
For the remainder of this article, we first discuss the motivation and the related work in Section
3. Then, we present an overview of the system architecture in Section
5. Two key components in CICO, the context-aware compression module, and the compression optimizer, are detailed in Sections
4 and
6, respectively. CICO is evaluated in Section
7, which is followed by the discussion in Section
8 and the conclusion in Section
9.
4 Context-aware Compression
The context-aware compression module (Figure
3) is a key component of our system. It consists of feature extraction, context derivation, base compression, and feature selection.
Feature extraction. Low-level feature extraction distills information from input images efficiently. We start with a set of low-level image features represented by
\(\Gamma =\lbrace F^{(1)},\ldots ,F^{(M)} \rbrace\), where
\(F^{(j)}\) is the jth image feature and
M is the number of classes of features. Common image feature extraction such as STAR [
1], FAST [
40], and ORB [
41] can be applied to the input image
I for extracting feature points.
Context Derivation. Context derivation efficiently translates low-level image features to the context, performed in the following three steps.
(1)
Tiling. By spatially dividing a raw image I into N equal-sized tiles, where each tile is indexed by \(i\in \lbrace 1,\ldots ,N\rbrace\), we can get the vector of feature density \(\boldsymbol { d}^{(j)}=(d^{(j)}_1,\ldots ,d^{(j)}_N)\) for the jth feature, \(j=1,\ldots ,M\). \(d^{(j)}_i\) represents the feature density of the jth feature in the ith tile. Note that \(\sum _{i=1}^{N}d^{(j)}_i=1\), \(j=1,\ldots ,M\).
(2)
Weighting. We define the vector of weighted density \(\boldsymbol {\rho }\) to represent the weighted density contributed by all features in each tile, i.e., \(\boldsymbol {\rho } = (\rho _1,\ldots ,\rho _N)\). The vector of weights \(\boldsymbol { \alpha }=(\alpha _1,\ldots ,\alpha _{M})\) describes the importance of different features. The weighted density is calculated as the dot product of the vector of feature density and the vector of weights, i.e., \(\rho _i=\sum _{j=1}^{M} \alpha _j d^{(j)}_i\), \(i\in \lbrace 1,\ldots ,N\rbrace\). Note that \(\rho _i\in [0,1]\) and \(\sum _{i=1}^N \rho _i=1\).
(3)
Non-linear Transform. We use the nonlinear function \(g(\cdot ;\boldsymbol {\beta })\) defined on \([0,1]\) to map the vector of weighted density \(\boldsymbol {\rho }\) to the vector of compression quality \(\boldsymbol {\eta }=(\eta _1,\ldots ,\eta _N)\), where \(\eta _i=g(\rho _i;\boldsymbol {\beta })\in [0,1]\) indicates the compression quality of the ith tile. \(\boldsymbol {\beta }\) is a hyper-parameter. A higher compression quality implies less information loss after compression.
Base compression. Base compression utilizes the context to perform adaptive compression with an existing compression method, e.g., JPEG. Specifically, we apply the existing compression method to different tiles in the image I based on the compression quality in that tile. For example, different quantization tables in JPEG can be selected for a tile based on its compression quality. The base compression is denoted by \(I^{^{\prime }}=\mathcal {C}(I;\boldsymbol {\eta })\), where \(\mathcal {C}\) represents the compression operation.
The
compression configuration is
\(\boldsymbol {\theta }=(\boldsymbol {\alpha },\boldsymbol {\beta })\). For clarity, the derivation of the context can be treated as a mapping
\(\xi\) from the input image
I to the compression quality
\(\boldsymbol {\eta }\), i.e.,
\(\eta =\xi (I;\boldsymbol {\theta })\). Finally, the CCM can be expressed as
Feature selection. Low-level image features are crucial to context-aware compression. However, it is non-trivial because it is unknown to the system designer how each low-level image feature affects the outcome of a specific task and the compression speed. Thus, we design the feature selection to automatically learn and select low-level image features. Specifically, the system designer initializes feature extraction with a few well-known features [
1,
33,
40,
41,
42]. Then, the weight of features in each tile can be derived in context derivation. We average the weights of each feature and select the features with the highest weights. The number of features is selected to ensure the feature extraction and the compression time meet the real-time requirements, e.g., 30 fps.
5 System Overview
As shown in Figure
4, the architecture of CICO can be split into the offline profiling stage that learns the profile to correctly perform context-aware compression and the online compression stage that applies context-aware compression based on the profile.
5.1 Offline Profiling Stage
In the offline profiling stage, the CO interacts with the CCM and the vision app to establish the profile for online compression in the following five steps.
(1) Initialization. The CO first samples raw images from the training data and a configuration from unexplored ones to be evaluated.
(2) Encoding. The CCM compresses the sampled images with its encoder, CCM(en), based on the sampled configuration. The size of the compressed images will be recorded for further optimization.
(3) Decoding. The CCM decompresses the encoded images with its decoder, CCM(de), and feeds them to the vision app.
(4) Image Processing. After receiving the decoded images, the vision app performs analysis via CNN models and records the analytics result, e.g., accuracy, for further optimization.
(5) Metrics Collection. The performance analyzer collects the size of compressed images and the analytics result. They are translated into two metrics: bandwidth reduction ratio (the portion of image data reduced for offloading) and accuracy, respectively. These metrics are forwarded to the compression optimizer.
(6) Optimization. The CO takes the sampled configuration and the resulting metrics as input. Then, it analyzes the historical performance of all selected configurations and learns to select the next configuration that achieves high bandwidth reduction and accuracy.
The profile consists of explored configurations that are Pareto optimal regarding the accuracy and bandwidth reduction ratio. In other words, the profile is the approximate Pareto front on the training data.
5.2 Online Compression Stage
The online compression stage consists of four steps: context adaptation, context-aware encoding, decoding, and image processing.
(1) Context adaptation. The context adapter selects the optimal configuration from the profile based on task requirements, e.g., the bandwidth and accuracy, specified by the system administrator.
(2) Context-aware encoding. Next, the configured CCM compresses raw images from the user based on the selected configuration.
(3) Decoding. Then, the compressed images are offloaded to the cloud server and decoded.
(4) Image processing. Finally, the decoded images are processed by CNN models in the cloud server.
The offline stage ensures the online stage can handle different task requirements resulting from varying network and application-specific changes. In the following, we present details of the context-aware compression module and the compression optimizer in Sections
4 and
6, respectively.
8 Discussion
Adaptability of CICO. While CICO demonstrates adaptability to diverse network conditions through configurable bitrates, it currently lacks the ability to adjust to varying computation constraints. Extending its applicability to a wide range of devices, from IoT to smartphones, necessitates factoring in computation overhead in the multi-objective optimization process. For example, optimizing the tradeoff between compression rate and accuracy while adhering to a computation latency constraint on the device. Addressing this issue could be a potential avenue for future research, enabling CICO to seamlessly adapt to both computation and network constraints.
Scalability of CICO. At present, CICO necessitates application-specific profiling, which can be resource-intensive, especially when targeting a broad range of applications. This poses a scalability challenge. Nonetheless, certain vision applications, such as car detection and pedestrian tracking, may share functional similarities. An approach to enhance scalability is by categorizing akin vision applications. This strategy can help manage profiling costs at a reasonable level, even when dealing with a substantial number of target applications. However, it’s important to note that addressing the scalability challenge lies beyond the scope of this current work.
Choice of the nonlinear function. The nonlinear function models the relationship between the feature density and the compression quality. We selected the one in Equation (
6) to strike a tradeoff between training complexity and compression performance. The nonlinear function can be defined in other forms if it maps a density value in
\([0,1]\) to a compression quality value in
\([0,1]\). It is also feasible to perform context derivation with more parameters. However, it should be configured carefully to achieve a speed comparable to traditional image codecs like JPEG.
Choice of the base compression module. The choice of the compression method is generally flexible. It can be any traditional, e.g., JPEG, or machine learning-based, e.g., DeepCOD, compression method. The base compression method would need to operate in a way that compresses different image tiles with different qualities. The other consideration is that an excessively complicated compression method should not be used because the benefits introduced by CICO in bandwidth reduction and network latency reduction might be offset by the additional delay incurred in the encoding and decoding modules.
The vision-based application. In addition to image classification and object detection, our approach is generic and can be applied to other vision-based applications like car counting [
32], and action detection [
25]. As long as a vision app explicitly outputs a metric that can evaluate the performance of an image dataset, CICO can be used to learn the dataset and enhance the visual analytics offloading performance.
Comparison to the end-to-end training workflow. Our work provides an approach to optimize an image codec with the codec itself and the vision application treated like black boxes. The black box means we cannot calculate derivatives and perform back-propagation with it. This case is common when an image codec only provides the encoded and decoded data, e.g., JPEG, and the vision application only provides its prediction based on the input like many cloud services. However, when both the image codec and the vision application are differentiable, i.e., implemented with neural networks, it would be more optimal to use neural networks to design the context extractor [
9,
48] and have an end-to-end workflow to train it.