1 Introduction
The field of deep learning (DL) has evolved at an impressive pace over the last few years [
68], with new breakthroughs continuously appearing in domains such as computer vision (CV), natural language processing (NLP), digital signal processing (DSP), and wireless networking [
56,
118], among others—we refer to [
110] for a comprehensive survey on DL. For example, today’s state-of-the-art deep neural networks (DNNs) can classify thousands of images with unprecedented accuracy [
51], while bleeding-edge advances in deep reinforcement learning (DRL) have shown to provide near-human capabilities in a multitude of complex optimization tasks, from playing dozens of Atari video games [
99] to winning games of Go against top-tier players [
127].
As DL-based classifiers improve their predictive accuracy, mobile applications such as speech recognition in smartphones [
20,
45], real-time unmanned navigation [
105], and drone-based surveillance [
129,
170] are increasingly using DNNs to perform complex inference tasks. However,
state-of-the-art DNN models present computational requirements that cannot be satisfied by the majority of the mobile devices available today. In fact, many state-of-the-art DNN models for difficult tasks—such as computer vision and natural language processing—are extremely complex. For instance, the EfficientDet [
139] family offers the best performance for object detection tasks. While EfficientDet-D7 achieves a
mean average precision (mAP) of 52.2%, it involves 52M parameters and will take seconds to be executed on strong embedded devices equipped with GPUs such as the NVIDIA Jetson Nano and Raspberry Pi. Notably, the execution of such complex models significantly increases energy consumption. While lightweight models specifically designed for mobile devices exist [
122,
138], the reduced computational burden usually comes to the detriment of the model accuracy. For example, compared to ResNet-152 [
43], the networks MnasNet [
138] and MobileNetV2 [
122] present up to 6.4% accuracy loss on the ImageNet dataset. YOLO-Lite [
116] achieves a frame rate of 22 frames per second on some embedded devices but has a mAP of 12.36% on the COCO dataset [
83]. To achieve 33.8% mAP on the COCO dataset, even the simplest model in the EfficientDet family, EfficientDet-D0, requires 3 times more FLOPs (2.5B)
1 than SSD-MobileNetV2 [
122] (0.8B FLOPs). While SSD-MobileNetV2 is a lower-performance DNN specifically designed for mobile platforms and can process up to 6 fps, its mAP on the COCO dataset is 20%, and keeping the model running on a mobile device significantly increases power consumption. On the other hand, due to excessive end-to-end latency, cloud-based approaches are hardly applicable in most of the latency-constrained applications where mobile devices usually operate. Most of the techniques we overview in the survey can be applied to both mobile-device-to-edge-server and edge-server-to-cloud offloading. For the sake of clarity, we primarily refer to the former to explain the frameworks.
Recently, edge computing (EC) approaches [
10,
88] have attempted to address the “latency vs. computation” conundrum by completely offloading the DNN execution to servers located very close to the mobile device, i.e., at the “edge” of the network. However, canonical EC does not consider that the quality of wireless links—although providing high throughput on average—can suddenly fluctuate due to the presence of erratic noise and interference patterns, which may impair performance in latency-bound applications. For example, mobility and impaired propagation have been shown to decrease throughput even in high-bandwidth wireless links [
89,
169], while many Internet of Things (IoT) systems are based on communication technologies such as
Long Range (LoRa) [
121], which has a maximum data rate of 37.5 Kbps due to duty cycle limitations [
1].
The severe offloading limitations of some mobile devices, coupled with the instability of the wireless channel (e.g., UAV network [
36]), imply that the amount of data offloaded to edge should be decreased, while at the same time keeping the model accuracy as close as possible to the original. For this reason,
split computing (SC) [
60] and
early exiting (EE) strategies [
140] have been proposed to provide an intermediate option between EC and local computing. The key intuition behind SC and EE is similar to the one behind model pruning [
38,
44,
74,
160] and knowledge distillation [
46,
61,
98]—since modern DNNs are heavily over-parameterized [
165,
166], their accuracy can be preserved even with substantial reduction in the number of weights and filters, thus representing the input with fewer parameters. Specifically, SC divides a larger DNN into head and tail models, which are respectively executed by the mobile device and edge server. EE, on the other hand, proposes the introduction of “subbranches” into the early layers of DNN models, so that the full computation of the model can be halted—and a prediction result provided—if the classifiers in the current subbranches have high confidence with the specific model input.
Motivation and Novel Contributions. The proliferation of DL-based mobile applications in the IoT and 5G landscapes implies that techniques such as SC and EE are not simply “nice-to-have” features, but will become fundamental computational components in the years to come. Although a significant amount of research work has been done in SC and EE, to the best of our knowledge, a comprehensive survey of the state of the art has not been conducted yet. Moreover, there are still a series of research challenges that need to be addressed to take SC and EE to the next level. For this reason, this article makes the following novel contributions:
•
We summarize SC and EE studies with respect to approaches, tasks, and models. We first provide an overview of local, edge, split computing, and early-exit models in Section
2 by highlighting similarities and difference among them.
•
We then discuss and compare the various approaches to SC and EE in Sections
4 and
5 by highlighting the training strategies and applications. Since code availability is fundamental for replicability/reproducibility [
34],
2 we provide for each work its corresponding code repository, if available, so that interested readers can reproduce and learn from existing studies.
•
We conclude the article by discussing in Section
6 a compelling agenda of research challenges in SC and EE, hoping to spur further contributions in these exciting and timely fields.
2 Overview of Local, Edge, Split Computing, and Early-exit Models
In this section, we provide an overview of local, edge, split computing, and early-exit models, which are the main computational paradigms that will be discussed in the article. Figure
1 provides a graphical overview of the approaches.
All these techniques operate on a DNN model
\(\mathcal {M}(\cdot)\) whose task is to produce the inference output
\(\mathbf {y}\) from an input
\(\mathbf {x}\). Typically,
\(\mathbf {x}\) is a high-dimensional variable, whereas the output
\(\mathbf {y}\) has significantly lower dimensionality [
143]. Split computing and early-exit approaches are contextualized in a setting where the system is composed of a mobile device and an edge server interconnected via a wireless channel. The overall goal of the system is to produce the inference output
\(\mathbf {y}\) from the input
\(\mathbf {x}\) acquired by the mobile device, by means of the DNN
\(\mathbf {y} = \mathcal {M}(\mathbf {x})\) under—possibly time varying—constraints on:
Resources: (1) the computational capacity (roughly expressed as number operations per second) \(C_{\rm md}\) and \(C_{\rm es}\) of the mobile device and edge server, respectively, and (2) the capacity \(\phi\), in bits per second, of the wireless channel connecting the mobile device to the edge server
Performance: (1) the absolute of average value of the time from the generation of \(\mathbf {x}\) to the availability of \(\mathbf {y}\), and (2) the degradation of the “quality” of the output \(\mathbf {y}\).
Split, edge, local, and early exiting strategies strive to find suitable operating points with respect to accuracy, end-to-end delay, and energy consumption, which are inevitably influenced by the characteristics of the underlying system. It is generally assumed that the computing and energy capacities of the mobile device are smaller than that of the edge server. As a consequence, if part of the workload is allocated to the mobile device, then the execution time increases, while the battery lifetime decreases. However, as explained later, the workload executed by the mobile device may result in a reduced amount of data to be transferred over the wireless channel, possibly compensating for the larger execution time and leading to smaller end-to-end delays.
2.1 Local and Edge Computing
We start with an overview of local and edge computing. In local computing (LC), the function
\(\mathcal {M}(\mathbf {x})\) is entirely executed by the mobile device. This approach eliminates the need to transfer data over the wireless channel. However, the complexity of the best-performing DNNs most likely exceeds the computing capacity and energy consumption available at the mobile device. Usually, simpler models
\(\hat{\mathcal {M}}(\mathbf {x})\) are used, such as MobileNet [
122] and MnasNet [
138], which often have a degraded accuracy performance. Besides designing lightweight neural models executable on mobile devices, the widely used techniques to reduce the complexity of models are knowledge distillation [
46] and model pruning/quantization [
55,
73], described in Section
3.2. Some of the techniques are also leveraged in SC studies to introduce bottlenecks without sacrificing model accuracy, as will be described in the following sections.
In EC, the input
\(\mathbf {x}\) is transferred to the edge server, which then executes the original model
\(\mathcal {M}(\mathbf {x})\). In this approach, which preserves full accuracy, the mobile device is not allocated computing workload, but the full input
\(\mathbf {x}\) needs to be transferred to the edge server. This may lead to an excessive end-to-end delay in degraded channel conditions and erasure of the task in extreme conditions. A possible approach to reduce the load imposed to the wireless channel, and thus also transmission delay and erasure probability, is to compress the input
\(\mathbf {x}\). We define, then, the encoder and decoder models
\(\mathbf {z} = F(\mathbf {x})\) and
\(\hat{\mathbf {x}} = G(\mathbf {z})\), which are executed at the mobile device and edge server, respectively. The distance
\(d(\mathbf {x},\hat{\mathbf {x}})\) defines the performance of the encoding-decoding process
\(\hat{\mathbf {x}} = G(F(\mathbf {x}))\), a metric that is separate from, but may influence, the accuracy loss of
\(\mathcal {M}(\hat{\mathbf {x}})\) with respect to
\(\mathcal {M}(\mathbf {x})\), that is, of the model executed with the reconstructed input with respect to the model executed with the original input. Clearly, the encoding/decoding functions increase the computing load at both the mobile device and edge server side. A broad range of different compression approaches exists ranging from low-complexity traditional compression (e.g., JPEG compression for images in EC [
101]) to neural compression models [
4,
5,
162]. We remark that while the compressed input data, e.g., JPEG objects, can reduce the data transfer time in EC, those representations are designed to allow the accurate reconstruction of the input signal. Therefore, these approaches may (1) decrease privacy as a “reconstructable” representation is transferred to the edge server [
147] and (2) result in a larger amount of data to be transmitted over the channel compared to representation specifically designed for the computing task as in bottleneck-based SC, as explained in the following sections.
2.2 Split Computing and Early Exiting
SC aims at achieving the following goals: (1) the computing load is distributed across the mobile device and edge server and (2) establishes a task-oriented compression to reduce data transfer delays. We consider a neural model
\(\mathcal {M}(\cdot)\) with
L layers, and define
\(\mathbf {z}_{\ell }\) as the output of the
\(\ell\)th layer. Early implementations of SC select a layer
\(\ell\) and divide the model
\(\mathcal {M}(\cdot)\) to define the head and tail submodels
\(\mathbf {z}_{\ell }{=}\mathcal {M}_{H}(\mathbf {x})\) and
\(\mathbf {\hat{y}}{=}\mathcal {M}_{T}(\mathbf {z}_{\ell })\), executed at the mobile device and edge server, respectively. In early instances of SC, the architecture and weights of the head and tail model are exactly the same as the first
\(\ell\) layers and last
\(L-\ell\) layers of
\(\mathcal {M}(\cdot)\). This simple approach preserves accuracy but allocates part of the execution of
\(\mathcal {M}(\cdot)\) to the mobile device, whose computing power is expected to be smaller than that of the edge server, so that the total execution time may be larger. The transmission time of
\(\mathbf {z}_{\ell }\) may be larger or smaller compared to that of transmitting the input
\(\mathbf {x}\), depending on the size of the tensor
\(\mathbf {z}_{\ell }\). However, we note that in most relevant applications the size of
\(\mathbf {z}_{\ell }\) becomes smaller than that of
\(\mathbf {x}\) only in later layers, which would allocate most of the computing load to the mobile device. More recent SC frameworks introduce the notion of
bottleneck to achieve
in-model compression toward the global task [
90]. As formally described in the next section, a bottleneck is a compression point at one layer in the model, which can be realized by reducing the number of nodes of the target layer and/or by quantizing its output. We note that as SC realizes a task-oriented compression, it guarantees a higher degree of privacy compared to EC. In fact, the representation may lack information needed to fully reconstruct the original input data.
Another approach to enable mobile computing is referred to as EE. The core idea is to create models with multiple “exits” across the model, where each exit can produce the model output. Then, the first exit providing a target confidence on the output is selected. This approach tunes the computational complexity, determined by the exit point, to the sample or to system conditions. Formally, we can define a sequence of models \(\mathcal {M}_i\) and \(\mathcal {B}_i, i= 1,\ldots ,N\). Model \(\mathcal {M}_i\) takes as input \(\mathbf {z}_{i-1}\) (the output of model \(\mathcal {M}_{i-1}\)) and outputs \(\mathbf {z}_i\), where we set \(\mathbf {z}_{0}=\mathbf {x}\). The branch models \(\mathcal {B}_i\) take as input \(\mathbf {z}_i\) and produce the estimate of the desired output \(\mathbf {y}_i\). Thus, the concatenation of \(\mathcal {M}_1,\ldots ,\mathcal {M}_N\) results in an output analogous to that of the original model. Intuitively, the larger the number of models used to produce the output \(\mathbf {y}_i\), the better the accuracy. Thus, while SC optimizes intermediate representations to preserve information toward the final task (e.g., classification) for the whole dataset, early-exit models take a “per sample” control perspective. Each sample will be sequentially analyzed by concatenations of \(\mathcal {M}_i\) and \(\mathcal {B}_i\) sections until a predefined confidence level is reached. The hope is that a portion of the samples will require a smaller number of sections compared to executing the whole sequence.
4 Split Computing: A Survey
This section discusses the existing state of of the art in SC. Figure
2 illustrates the existing SC approaches. They can be categorized into either (1)
without network modification or (2)
with bottleneck injection. We first present SC approaches without DNN modification in Section
4.1. We then discuss the motivations behind the introduction of SC with bottlenecks in Section
4.2, which are then discussed in detail in Section
4.3. Since the latter require specific training procedures, we devote Section
4.4 to their discussion.
4.1 Split Computing without DNN Modification
In this class of approaches, the architecture and weights of the head
\(\mathcal {M}_{H}(\cdot)\) and tail
\(\mathcal {M}_T(\cdot)\) models are exactly the same as the first
\(\ell\) layers and last
\(L-\ell\) layers of
\(\mathcal {M}(\cdot)\). To the best of our knowledge, Kang et al. [
60] proposed the first SC approach (called “Neurosurgeon”), which searches for the best partitioning layer in a DNN model for minimizing total (end-to-end) latency or energy consumption. Formally, inference time in SC is the sum of processing time on a mobile device, delay of communication between a mobile device and edge server, and processing time on an edge server.
Interestingly, their experimental results show that the best partitioning (splitting) layers in terms of energy consumption and total latency for most of the considered models result in either their input or output layers. In other words, deploying the whole model on either a mobile device or an edge server (i.e., local computing or EC) would be the best option for such DNN models. Following the work by Kang et al. [
60], the research communities explored various SC approaches mainly focused on CV tasks such as image classification. Table
1 summarizes the studies on SC without architectural modifications.
Jeong et al. [
58] used this partial offloading approach as a privacy-preserving way for computation offloading to blind the edge server to the original data captured by client. Leveraging neural network quantization techniques, Li et al. [
73] discussed best splitting point in DNN models to minimize inference latency and showed that quantized DNN models did not degrade accuracy compared to the (pre-quantized) original models. Choi and Bajić [
13] proposed a feature compression strategy for object detection models that introduces a quantization/video-coding-based compressor to the intermediate features in YOLO9000 [
115].
Eshratifar et al. [
25] propose JointDNN for collaborative computation between the mobile device and cloud and demonstrate that using either local computing only or cloud computing only is not an optimal solution in terms of inference time and energy consumption. Different from [
60], they consider not only discriminative deep learning models (e.g., classifiers) but also generative deep learning models and autoencoders as benchmark models in their experimental evaluation. Cohen et al. [
16] introduce a technique to code the output of the head portion in a split DNN to a wide range of bit-rates and demonstrate the performance for image classification and object detection tasks. Pagliari et al. [
106] first discuss the collaborative inference for simple recurrent neural networks, and their proposed scheme is designed to automatically select the best inference device for each input data in terms of total latency or end-device energy. Itahara et al. [
53] use dropout layers [
133] to emulate a packet loss scenario rather than for the sake of compression and discuss the robustness of VGG-based models [
128] for split computing.
While only a few studies in Table
1 heuristically choose splitting points [
13,
16], most of the other studies [
25,
58,
60,
73,
76,
106,
168] in Table
1 analyze various types of cost (e.g., computational load and energy consumption on mobile device, communication cost, and/or privacy risk) to partition DNN models at each of their splitting points. Based on the analysis, performance profiles of the split DNN models are derived to inform selection. Concerning metrics, many of the studies in Table
1 do not discuss task-specific performance metrics such as accuracy. This is in part because the proposed approaches do not modify the input or intermediate representations in the models (i.e., the final prediction will not change). On the other hand, Choi and Bajić [
13], Cohen et al. [
16], and Li et al. [
73] introduce lossy compression techniques to intermediate stages in DNN models, which may affect the final prediction results. Thus, discussing the tradeoff between compression rate and task-specific performance metrics would be essential for such studies. As shown in the table, such tradeoff is discussed only for CV tasks, and many of the models considered in such studies have weak performance compared with state-of-the-art models and complexity within reach of modern mobile devices. Specific to image classification tasks, most of the models considered in the studies listed in Table
1 are more complex and/or the accuracy is comparable to or lower than that of lightweight baseline models such as MobileNetV2 [
122] and MnasNet [
138]. Thus, in future work, more accurate models should be considered to discuss the performance tradeoff and further motivate SC approaches.
4.2 The Need for Bottleneck Injection
While Kang et al. [
60] empirically show that executing the whole model on either the mobile device or edge server would be best in terms of total inference and energy consumption for most of their considered DNN models, their proposed approach finds the best partitioning layers inside some of their considered CV models (convolutional neural networks (CNNs)) to minimize the total inference time. There are a few trends observed from their experimental results: (1) communication delay to transfer data from the mobile device to edge server is a key component in SC to reduce total inference time; (2) all the neural models they considered for NLP tasks are relatively small (consisting of only a few layers), which potentially resulted in finding that the output layer is the best partition point (i.e., local computing) according to their proposed approach; and (3) similarly, not only DNN models they considered (except VGG [
128]) but also the size of the input data to the models (see Table
2) are relatively small, which gives more advantage to EC (fully offloading computation). In other words, it highlights that complex CV tasks requiring large (high-resolution) images for models to achieve high accuracy such as ImageNet and COCO datasets would be essential to discuss the tradeoff between accuracy and execution metrics to be minimized (e.g., total latency, energy consumption) for SC studies. The key issue is that straightforward SC approaches like Kang et al. [
60] rely on the existence of
natural bottlenecks—that is, intermediate layers whose output
\(\mathbf {z}_{\ell }\) tensor size is smaller than the input—inside the model. Without such natural bottlenecks in the model, straightforward splitting approaches would fail to improve performance in most settings [
6,
35].
Some models, such as AlexNet [
64], VGG [
128], and DenseNet [
51], possess such layers [
90]. However, recent DNN models such as ResNet [
43], Inception-v3 [
136], Faster R-CNN [
117], and Mask R-CNN [
42] do not have natural bottlenecks in the early layers; that is, splitting the model would result in compression only when assigning a large portion of the workload to the mobile device. As discussed earlier, reducing the communication delay is key to minimize total inference time in SC. For these reasons, introducing
artificial bottlenecks to DNN models by modifying their architecture is a recent trend and has been attracting attention from the research community. Since the main role of such encoders in SC is to compress intermediate features rather than to complete inference, the encoders usually consist of only a few layers. Also, the resulting encoders in SC to be executed on constrained mobile devices are often much smaller (e.g., 10K parameters in the encoder of ResNet-based SC model [
94]) than lightweight models such as MobileNetV2 [
122] (3.5M parameters) and MnasNet [
138] (4.4M parameters). Thus, even if the model accuracy is either degraded or comparable to such small models, SC models are still beneficial in terms of computational burden and energy consumption at the mobile devices.
4.3 Split Computing with Bottleneck Injection
This class of models can be described as composed of three sections: \(\mathcal {M}_{E}\), \(\mathcal {M}_{D}\), and \(\mathcal {M}_{T}\). We define \(\mathbf {z}_{\ell }|\mathbf {x}\) as the output of the \(\ell\)th layer of the original model given the input \(\mathbf {x}\). The concatenation of the \(\mathcal {M}_{E}\) and \(\mathcal {M}_{D}\) models is designed to produce a possibly noisy version \(\hat{\mathbf {z}}_{\ell }|\mathbf {x}\) of \(\mathbf {z}_{\ell }|\mathbf {x}\), which is taken as input by \(\mathcal {M}_{T}\) to produce the output \(\hat{\mathbf {y}}\), on which the accuracy degradation with respect to \(\mathbf {y}\) is measured. The models \(\mathcal {M}_{E}\), \(\mathcal {M}_{D}\) function as specialized encoders and decoders in the form \(\hat{\mathbf {z}}_{\ell }=\mathcal {M}_{D}(\mathcal {M}_E(\mathbf {x}))\), where \(\mathcal {M}_{E}(\mathbf {x})\) produces the latent variable \(\mathbf {z}\). In worlds, the two first sections of the modified model transform the input \(\mathbf {x}\) into a version of the output of the \(\ell\)th layer via the intermediate representation \(\mathbf {z}\), thus functioning as encoder/decoder functions. The model is split after the first section; that is, \(\mathcal {M}_{E}\) is the head model, and the concatenation of \(\mathcal {M}_{D}\) and \(\mathcal {M}_{T}\) is the tail model. Then, the tensor \(\mathbf {z}\) is transmitted over the channel. The objective of the architecture is to minimize the size of \(\mathbf {z}\) to reduce the communication time while also minimizing the complexity of \(\mathcal {M}_E\) (that is, the part of the model executed at the—weaker—mobile device) and the discrepancy between \(\mathbf {y}\) and \(\hat{\mathbf {y}}\). The layer between \(\mathcal {M}_E\) and \(\mathcal {M}_D\) is the injected bottleneck.
Table
3 summarizes SC studies with bottleneck-injected strategies. To the best of our knowledge, the papers in [
26] and [
90] were the first to propose altering existing DNN architectures to design relatively small bottlenecks at early layers in DNN models, instead of introducing compression techniques (e.g., quantization, autoencoder) to the models, so that communication delay (cost) and total inference time can be further reduced. Following these studies, Hu and Krishnamachari [
49] introduce bottlenecks to MobileNetV2 [
122] (modified for CIFAR datasets) in a similar way for SC and discuss end-to-end performance evaluation. Choi et al. [
14] combine multiple compression techniques such as quantization and tiling besides convolution/deconvolution layers and design a feature compression approach for object detectors. Similar to the concept of bottleneck injection, Shao and Zhang [
126] find that over-compression of intermediate features and inaccurate communication between computing devices can be tolerated unless the prediction performance of the models is significantly degraded by them. Also, Jankowski et al. [
57] propose introducing a reconstruction-based bottleneck to DNN models, which is similar to the concept of BottleNet [
26]. A comprehensive discussion on the delay/complexity/accuracy tradeoff can be found in [
91,
164].
These studies are all focused on image classification. Other CV tasks present further challenges. For instance, state-of-the-art object detectors such as R-CNN models have more narrow range of layers that we can introduce bottlenecks due to the network architecture, which has multiple forward paths to forward outputs from intermediate layers to the
feature pyramid network (FPN) [
81]. The head network distillation training approach—discussed later in this section—was used by Matsubara and Levorato [
94] to address some of these challenges and reduce the amount of data transmitted over the channel by 94% while degrading mAP loss by 1 point. Assine et al. [
2] introduce bottlenecks to the EfficientDet-D2 [
139] object detector and apply the training method based on the generalized head network distillation [
94] and mutual learning [
159] to the modified model. Following the studies on SC for resource-constrained edge computing systems [
90,
91,
164], Sbai et al. [
124] introduce autoencoder to small classifiers and train them on a subset of the ImageNet dataset in a similar manner. These studies discuss the tradeoff between accuracy and memory size on mobile devices, considering communication constraint-based 3G and LoRa technologies [
121]. Similar to [
2,
93,
94], Lee et al. [
70] design a lightweight encoder for an object detector on the mobile device followed by both a module to amplify the compressed feature and the object detector to be executed on the edge server. Matsubara et al. [
92] empirically show that bottleneck-injected models can be further improved by elaborating the methods to train the models. The resulting models outperform models with autoencoder-based feature compression (e.g., Figure
5) in terms of the tradeoff between model accuracy and transferred data size.
Matsubara et al. [
97] propose a supervised compression method for resource-constrained edge computing systems, which adapts ideas from knowledge distillation and neural image compression [
4,
5]. Their student model (namely,
Entropic Student) contains a lightweight encoder with a learnable prior, which quantizes and entropy-codes latent representations under a prior probability model for efficiently saving the size of data to be offloaded to the edge server. By adjusting a balancing weight in their loss function during training, we can control the tradeoff between data size (rate) and model accuracy (distortion). The performance of the entropic student model was demonstrated for three large-scale downstream supervised tasks: image classification (ImageNet), object detection (COCO), and semantic segmentation (COCO, PASCAL VOC). Notably, the representation produced by a single trained encoder of the entropic student model can serve multiple downstream tasks. Following the study, Matsubara et al. [
96] further investigate this approach and empirically show that it generalizes to other reference models (e.g., ResNet-101 [
43], RegNetY-6.4GF [
113], Hybrid ViT [
134]). Through experiments, the study also points out that simply introducing such bottleneck layers at later layers in a model can improve the conventional
rate-distortion (R-D) tradeoff, which will result in most of the computational load being assigned to a weak mobile device.
In contrast to SC studies without bottlenecks in Table
1, many of the studies on bottleneck injection strategies in Table
3 are published with code that would help the research communities replicate/reproduce the experimental results and build on existing studies.
4.4 SC with Bottlenecks: Training Methodologies
Given that recent SC studies with bottleneck injection strategies result in more or less accuracy loss compared to the original models (i.e., without injected bottlenecks), various training methodologies are used and/or proposed in such studies. Some of the training methods are designed specifically for architectures with injected bottlenecks. We now summarize the differences between the various training methodologies used in recent SC studies.
We recall that
\(\mathbf {x}\) and
\(\mathbf {y}\) are an input (e.g., an RGB image) and the corresponding label (e.g., one-hot vector), respectively. Given an input
\(\mathbf {x}\), a DNN model
\(\mathcal {M}\) returns its output
\(\mathbf {\hat{y}} = \mathcal {M}(\mathbf {x})\) such as class probabilities in classification task. Each of the
L layers of model
\(\mathcal {M}\) can be either low-level (e.g., convolution [
69], batch normalization [
52], ReLU [
100]) or high-level layers (e.g., residual block in ResNet [
43] and dense block in DenseNet [
51]), which are composed by multiple low-level layers.
\(\mathcal {M}(\mathbf {x})\) is a sequence of the
L layer functions
\(\mathrm{f}_{j}\)s, and the
jth layer transforms
\(\mathbf {z}_{j-1}\), the output from the previous
\({(j-1)}\)th layer:
where
\(\mathbf {\theta }_{j}\) denotes the
jth layer’s hyperparameters and parameters to be optimized during training.
Cross-entropy-based Training
To optimize parameters in a DNN model, we first need to define a loss function and update the parameters by minimizing the loss value with an optimizer such as stochastic gradient descent and Adam [
62] during training. In image classification, a standard method is to train a DNN model
\(\mathcal {M}\) in an end-to-end manner using the cross-entropy like many of the studies [
26,
49,
91] in Table
3. For simplicity, here we focus on the categorical cross-entropy and assume
\(c \equiv \mathbf {y}\) is the correct class index given a model input
\(\mathbf {x}\). Given a pair of
\(\mathbf {x}\) and
c, we obtain the model output
\(\mathbf {\hat{y}} = \mathcal {M}(\mathbf {x})\), and then the (categorical) cross-entropy loss is defined as
where
\(\hat{\mathbf {y}}_{j}\) is the class probability for the class index
j, and
\(\mathcal {C}\) is a set of considered classes (
\(c \in \mathcal {C}\)).
As shown in Equation (
2), the loss function used in cross-entropy-based training methods are used as a function of the final output
\(\mathbf {\hat{y}}\), and thus are not designed for SC frameworks. While Eshratifar et al. [
26], Hu and Krishnamachari [
49], Lee et al. [
70], and Shao and Zhang [
126] use cross-entropy to train bottleneck-injected DNN models in end-to-end manners (Figure
3), Matsubara et al. [
91] empirically show that these methods cause a larger accuracy loss in complex tasks such as the ImageNet dataset [
120] compared to other more advanced techniques, including knowledge distillation.
Knowledge Distillation
Complex DNN models are usually trained to learn parameters for discriminating between a large number of classes (e.g., 1,000 in ImageNet dataset) and are often overparameterized. KNOWLEDGE DISTILLATION (KD) [
3,
46,
78] is a training scheme to address this problem, and trains a DNN model (called “student”) using additional signals from a pretrained DNN model (called “teacher” and often larger than the student). In standard cross-entropy-based training—that is, using “hard targets” (e.g., one-hot vectors)—we face a side effect that the trained models assign probabilities to all of the incorrect classes. From the relative probabilities of incorrect classes, we can see how large models tend to generalize.
As illustrated in Figure
4, by distilling the knowledge from a pretrained complex model (teacher), a student model can be more generalized and avoid overfitting to the training dataset, using the outputs of the teacher model as “soft targets” in addition to the hard targets [
46].
where
\(\alpha\) is a balancing factor (hyperparameter) between the
hard target (left term) and
soft target (right term) losses, and
\(\tau\) is another hyperparameter called
temperature to soften the outputs of teacher and student models in Equation (
4).
\(\mathcal {L}_\text{task}\) is a task-specific loss function, and it is a cross-entropy loss in image classification tasks, i.e.,
\(\mathcal {L}_\text{task} = \mathcal {L}_\text{CE}\).
\(\mathrm{KL}\) is the Kullback-Leibler divergence function, where
\(\mathrm{q}(\hat{\mathbf {y}}^\text{S})\) and
\(\mathrm{p}(\hat{\mathbf {y}}^\text{T})\) are probability distributions of student and teacher models for an input
\(\mathbf {x}\), that is,
\(\mathrm{q}(\hat{\mathbf {y}}^\text{S}) = [\mathrm{q}_{1}(\hat{\mathbf {y}}^\text{S}), \ldots , \mathrm{q}_{|\mathcal {C}|}(\hat{\mathbf {y}}^\text{S})]\) and
\(\mathrm{p}(\hat{\mathbf {y}}^\text{T}) = [\mathrm{p}_{1}(\hat{\mathbf {y}}^\text{S}), \ldots , \mathrm{p}_{|C|}(\hat{\mathbf {y}}^\text{T})]\):
Using the ImageNet dataset, it is empirically shown in Matsubara et al. [
91] that all the considered bottleneck-injected student models trained with their teacher models (original models without injected bottlenecks) consistently outperform those trained without the teacher models. This result matches a widely known trend in knowledge distillation reported in Ba and Caruana [
3]. However, similar to cross-entropy, the knowledge distillation is still not aware of bottlenecks we introduce to DNN models and may result in significant accuracy loss as suggested by Matsubara et al. [
91].
Reconstruction-based Training
As illustrated in Figure
5, Choi et al. [
14], Jankowski et al. [
57], Sbai et al. [
124], and Yao et al. [
164] inject AE models into existing DNN models and train the injected components by minimizing the reconstruction error. First, manually an intermediate layer in a DNN model (say its
jth layer) is chosen, and the output of the
jth layer
\(\mathbf {z}_{j}\) is fed to the encoder
\(\mathrm{f}_\text{enc}\) whose role is to compress
\(\mathbf {z}_{j}\). The encoder’s output
\(\mathbf {z}_\text{enc}\) is a compressed representation, i.e., bottleneck, to be transferred to the edge server, and the following decoder
\(\mathrm{f}_\text{dec}\) decompresses the compressed representation and returns
\(\mathbf {z}_\text{dec}\). As the decoder is designed to reconstruct
\(\mathbf {z}_{j}\), its output
\(\mathbf {z}_\text{dec}\) should share the same dimensionality with
\(\mathbf {z}_{j}\). Then, the injected AEs are trained by minimizing the following reconstruction loss:
where
\(\Vert \mathbf {z}\Vert _n^m\) denotes the
\(m^\text{th}\) power of the
n-norm of
\(\mathbf {z}\), and
\(\epsilon\) is an optional regularization constant. For example, Choi et al. [
14] set
\(m = 1\),
\(n = 2,\) and
\(\epsilon = 10^{-6}\), and Jankowski et al. [
57] use
\(m = n = 1\) and
\(\epsilon = 0\). Inspired by the idea of knowledge distillation [
46], Yao et al. [
164] also consider additional squared errors between intermediate feature maps from models with and without bottlenecks as additional loss terms like generalized head network distillation [
94], described later. While Yao et al. [
164] show a high compression rate with small accuracy loss by injecting encoder-decoder architectures into existing DNN models, such strategies [
14,
57,
124,
164] increase computational complexity as a result. Suppose the encoder and decoder consist of
\(L_\text{enc}\) and
\(L_\text{dec}\) layers, respectively; then the total number of layers in the altered DNN model is
\(L + L_\text{enc} + L_\text{dec}\).
Head Network Distillation
The training methods described above are focused on either end-to-end or encoder-decoder training. The first approach often requires hard targets such as one-hot vectors and more training cost, while the latter can focus on the injected components (encoder and decoder) during training, but the additional components (layers) will increase the complexity of the DNN model. To reduce both training cost and model complexity while preserving accuracy, it is proposed in Matsubara et al. [
90] to use head network distillation (HND) to distill the head portion of the DNN—which contains a bottleneck—leveraging pretrained DNN models. Figure
6 illustrates this approach.
The original pretrained DNN (consisting of
L layers) is used as a starting point, whose architecture (in the head part) is simplified. As only the teacher’s head portion is altered, the tail portion of the student model is identical to that of the teacher model with respect to architecture and the same pretrained parameters can be maintained. Thus, head network distillation requires only the first layers of the teacher and student models in the training session as the student head model
\(\mathrm{f}_\text{head}^\text{S}\) will be trained to mimic the behavior of teacher’s head model
\(\mathrm{f}_\text{head}^\text{T}\) given an input
\(\mathbf {x}\):
where
\(\mathrm{f}_\text{head}^\text{S}\) and
\(\mathrm{f}_\text{head}^\text{T}\) are sequences of the first
\(L_\text{head}^\text{S}\) and
\(L_\text{head}^\text{T}\) layers in student and teacher models (
\(L_\text{head}^\text{S} \ll L^\text{S}\), and
\(L_\text{head}^\text{T} \ll L\)), respectively.
Experimental results with the ImageNet (ILSVRC 2012) dataset show that given a bottleneck-introduced model, the head network distillation method consistently outperforms cross-entropy-based training [
26,
49,
126] and knowledge distillation methods in terms of not only training cost but also accuracy of the trained model. This method is extended in Matsubara and Levorato [
94], where the generalized
head network distillation technique (GHND) is proposed for complex object detection tasks and models. We note that these tasks require finer feature maps mimicking those at intermediate layers in the original pretrained object detectors. The loss function in this approach is
where
j is the loss index,
\(\lambda _{j}\) is a scale factor (hyperparameter) associated with loss
\(\mathcal {L}_{j}\), and
\(\mathrm{f}_{1-L_j^\text{S}}^\text{S}\) and
\(\mathrm{f}_{1-L_j^\text{T}}^\text{T}\) indicate the corresponding sequences of the first
\(L_j^\text{S}\) and
\(L_j^\text{T}\) layers in the student and teacher models (functions of input data
\(\mathbf {x}\)), respectively. The total loss, then, is a linear combination of
\(|\mathcal {J}|\) weighted losses. Following Equation (
7), the previously proposed head network distillation technique [
90] can be seen as a special case of generalized head network distillation (GHND). GHND significantly improved the object detection performance in bottleneck-injected R-CNN models on the COCO 2017 dataset while achieving a high compression rate.
6 Split Computing and Early Exiting: Research Challenges
In this section, we describe some of the research challenges in the SC and EE domains.
Evaluation of SC and EE in More Practical Settings
Due to the cross-disciplinary nature of this research area, it is essential to design practical and convincing evaluation settings to demonstrate the effectiveness of proposed approaches. As shown in Tables
3 and
4, the techniques proposed in many of the recent related studies are validated only on small-scale datasets such as MNIST and CIFAR datasets, which leads to some concerns on the input data size in relation to compression. Indeed, Table
2 suggests that the input data size in many of such datasets is relatively small (e.g., smaller than 2 kilobytes per image with a resolution of
\(32 \times 32\) pixels). The low resolution of the input size may enable conventional EC, where the mobile device fully offloads the computing task by transferring the input data to an edge server. In fact, the transmission of such a small amount of data would require a short time even in settings with limited communication capacity. As a consequence, executing even small head models on a resource-limited mobile device could lead to an overall delay increase.
Based on the above discussion, it becomes apparent that the models and datasets, in addition to the wireless and computing environments, are of paramount importance when assessing the performance of SC and EE schemes. Of particular relevance is the evaluation of accuracy, which is not provided in some of the early studies (e.g., [
43,
122,
128]), and the consideration of state-of-the-art models and datasets, which are largely used in the machine learning community. For instance, the use of small models, such as MobileNetV2, ResNet-50, and VGG-16, which are likely overparameterized for simple classification tasks, could lead to wrong conclusions when injecting bottlenecks. Conversely, it was shown in [
90] how challenging it is to inject bottlenecks when considering complex vision tasks such as classification on the ImageNet dataset [
120].
Optimization of Bottleneck Design and Placement in SC
The study of the architecture and placement of the bottleneck in a DNN model is also of considerable importance. As suggested in [
96], important metrics include (1) bottleneck data size (or compression rate), (2) complexity of the head model executed on the mobile device, and (3) resulting model accuracy. As a principle, the smaller the bottleneck representation is, the lower the communication cost between the mobile device and edge server will be. In general, the objective of SC is to generate a bottleneck whose data size is smaller than that of input data such as JPEG file size of input data, which is in turn much smaller than the data size of the input tensor (32-bit floating point), as the communication delay is a key component to reduce overall inference time [
90,
91,
94,
158]. Secondly, since mobile devices often have limited computing resources and may have other constraints such as energy consumption due to their battery capacities, SC should aim at minimizing their computational load by making head models as lightweight as possible. For instance, designing a small bottleneck at a very early stage of the DNN model enables a reduction in the computational complexity of the head model [
93,
94].
On top of these two criteria, the resulting model accuracy by the bottleneck injection should not be compromised as the introduced bottleneck removes more or less information at the placement compared to the original model. A reasonable lower bound of the model accuracy in SC would be that of widely recognized lightweight models, e.g., MobileNetV2 [
122] for ImageNet dataset, considering a local computing system where such lightweight models can be efficiently executed. In general, it is challenging to optimize bottleneck design and placement with respect to all three different metrics, and existing studies empirically design the bottlenecks and determine the placements. Thus, theoretical discussion on bottleneck design and placement should be an interesting research topic for future work.
Dynamic Control of Exits in EE
In most of the recent studies, early exits are used when one of the introduced early classifiers (exits) is confident enough in its prediction. However, users are required to determine a threshold for each of the classifiers beforehand at least for one early classifier in the original model where we introduce the early classifier. For example, if the first classifier’s prediction score is greater than 0.9 in the range of 0.0 and 1.0, then the inference for the input is terminated.
To achieve more efficient inference without significantly sacrificing the accuracy of the original model, the system needs to find a balance between (early) classifiers. As recent studies introduce multiple early exits to a model at different stages, such optimizations are challenging. In addition to manually defining such a threshold for each of the classifiers based on empirical results, a possibly interesting direction is the optimization of the decision-making process, that is, at which (early) classifier we should terminate the inference for a given input, without a set of thresholds defined beforehand based on system characteristics.
Expanding the Application Domain of SC and EE
The application domains of SC and (in minor part) EE remain primarily focused on image classification. This focus may be explained by the size of the input, which makes compression a relevant problem in many settings and the complexity of the models and tasks. However, there are many other unexplored domains from which SC would benefit. Real-time health conditions monitoring via wearable sensors is a notable example of an application where a significant amount of data is transferred from sensors to edge servers such as cellular phones and home hubs. For instance, the detection and monitoring of heart anomalies (e.g., arrhythmia) from ECG [
30] require the processing of high-rate samples (e.g., 100 to
\(1,\!000\) per heart cycle) using high-complexity DNN models [
41]. Health monitoring applications pose different challenges compared to CV-based applications. Indeed, in the former, both the computing capacity and the bandwidth available to the system are often smaller compared to the latter scenario, and conceptual advancements are required.
Toward an Information-theoretic Perspective
The key intuition behind the success of SC and EE is similar to what has led to the success of techniques such as model pruning [
38,
44,
74,
160] and knowledge distillation [
46,
61,
98]: most state-of-the-art DNNs are significantly overparameterized [
165,
166]. A possible approach to justify SC and EE can be found in the study of
information bottlenecks (IBs), which were introduced in [
142] as a compression technique in which a random variable
\(\mathbf {X}\) is compressed while preserving relevant information about another random variable
\(\mathbf {Y}\). The IB method has been applied in [
143] to quantify mutual information between the network layers and derive an information theory limit on DNN efficiency. This has led to attempts at explaining the behavior of deep neural networks with the information bottleneck formalism [
123].
Despite these early attempts, a strong connection between this relatively new perspective and the techniques described in this article is still elusive. Some of the approaches and architectures discussed in this article are meaningful attempts to efficiently extract a compressed representation of the input and provide sufficient information toward a certain task early in the network layers. The emerging IB formalism is a promising approach to enable the first moves in the information-theoretic analysis of neural-network-based transformations. We believe that this interpretation could serve as a foundation for an in-depth study of structural properties for both SC and EE.