2.1 Automatic Volumetric Segmentation
State-of-the-art works employ
fully convolutional networks (FCNs) [
14] for automatic brain tumor segmentation. Different from other common convolution neural networks that use a fully connected layer at the end, FCNs also employ a convolutional layer as the last layer to produce a pixel-wise prediction. In particular, a fundamental FCN architecture, namely, U-Net [
23], consists of a contracting encoder (a.k.a. analysis path) and a successive expanding decoder (a.k.a. synthesis path). The encoding part analyzes the input image and interprets it as a feature map, which is then fed into the decoder. Moreover, high-resolution activations in the analysis path are concatenated with up-sampled outputs in the synthesis path through shortcut connections to achieve better localization performance. Due to the symmetric fully convolutional architecture, the decoding part constructs a label map with the same size of the input image, each of whose channels corresponds to a segmentation label. Within a channel, every pixel indicates the probability of the corresponding label being positive.
Though U-Nets have achieved an accuracy close to human performance in segmenting 2D images, when it is applied to volumetric medical images, 3D images have to be processed as multiple 2D slices and hence it fails to capture the relationship of adjacent slices. Therefore, some later works further propose volumetric extensions of the U-Net to produce smoother volumetric segmentation.
In particular, the authors of U-Net also propose their feasible solution to the volumetric segmentation problem, namely, 3D U-Net [
29], by replacing the 2D convolutions in U-Net their 3D counterparts. An overview of the 3D U-Net is illustrated in Figure
1. As can be seen in Figure
1, like U-Net, 3D U-Net comprises the left analysis path and the right synthesis path. In particular, each stage of the encoder consists of two 3D convolutional layers with a kernel size of
\(3\times 3\times 3\) and a 3D max pooling layer to down-sample the feature map. On the other side, there are also two
\(3\times 3\times 3\) convolutions at each stage, and the up-sampling is performed with an up-convolutional layer (while some later works replace it with a nearest-neighbor up-sampling layer). The last layer of the network performs a
\(1\times 1\times 1\) convolution that resizes the number of output channels to match the number of labels. However, while a couple of 2D images easily fit into a single GPU, whole 3D images can be too big for the GPU memory, especially for training the network, since large memory footprint has to be stored for back-propagation. As one of the main bottlenecks of 3D U-Net, the whole volume sometimes has to be divided into several patches and fed sequentially into the network.
3D U-Net has been serving as a prototype for automatic volumetric segmentation and many later approaches are developed based on the 3D U-Net architecture and modules.
For example, Reference [
26] proposes multi-level deep supervision based on the 3D U-Net architecture, in which the three stages in the synthesis path are referred to as three different levels: lower layers, middle layers, and upper layers. Besides connecting to the next level, the lower and middle levels (note the upper level is the final stage) are also followed by up-convolutional blocks that upscale their reconstructions to match the input resolution. Therefore, each of the three levels separately produces a segmentation output with the same resolution. It is discussed that the back-propagation performance is improved by calculating losses for the three different outputs, due to the fact that direct supervision on the hidden layers is more effective for the gradient computation.
V-Net [
16], which is another volumetric derivation of U-Nets, replaces the pooling layers of the contracting path with 3D convolutions. It is discussed in their paper that convolutions can be applied to reduce the activation resolution by appropriately selecting kernel size and stride, i.e., a kernel size of
\(2\times 2\times 2\) and a stride of 2 halve the resolution of activations. The volumetric convolutional layers increase the receptive field and save the memory footprint during training, since they do not need to record the switches that associate the output and input of pooling layers for back-propagation. In addition, each stage (in both the encoder and the decoder) is a residual block in which the input is, after processed by the ReLU non-linearity, added directly to the output of the last convolutional layer. Compared with the non-residual U-Net architecture, the residual modules in V-Net help the network to better converge and achieve higher performance. The very last convolutional layer is similar to that of 3D U-Net, which has a kernel size of
\(1\times 1\times 1\) and it produces a probabilistic segmentation map by applying a voxel-wise softmax function to its output.
In addition, Attention U-Net proposes to highlight the more relevant activations with soft attention modules. To be specific, the authors argue that activations in the synthesis path are relatively imprecise, since they are constructed by the up-sampling. Standard U-Nets address the issue with the shortcut paths connecting the analysis path and synthesis path, which, nonetheless, brings heavy redundancy and distracts the network. Therefore, Attention U-Net introduces additive soft attention implemented at the shortcut connections on a per-voxel basis, which reduces the computational cost and improve the segmentation performance. Their experiments show that as the number of training epochs increases, Attention U-Net learns to focus more on the foreground areas, and they achieve a clear improvement in dice score compared with the standard 3D U-Net.
According to Reference [
9], it is commonly believed that more specialized architectures are required for different segmentation tasks and there have been huge amounts of works designed for few or even a single dataset in recent years, which results in troubles for researchers to identify and select the architecture that fits the best in their scenarios. Moreover, those kinds of models generally suffer from overfitting and a lack of adaptation. In this context, Reference [
9] proposes nnU-Net with adaptive architectures. In particular, three basic U-Net architectures are included in nnU-Net: 2D-U-Net, 3D-U-Net, and 3D-UNet Cascade, which consists of two 3D-UNet cascaded in sequence to address the memory constraints for large images. All the three architectures are initialized with a specific patch size, batch size, and number of feature maps, which are automatically adjusted according to the median plane size of the training data. A five-fold cross-validation is utilized to choose an architecture (or ensemble) and its topology with the best performance as the final model. Experimental evaluations show that nnU-Net achieves state-of-the-art performance on several distinct datasets and even outperforms the specialized models for some tasks.
While there have been proposed many works on variant specialized architectures for different segmentation applications, they are in generally usually based on the standard 3D U-Net. Therefore, in this article, we adopt the basic 3D U-Net as our segmentation model with some small modifications to better fit with our problem and approach, which will be explained in the evaluation section.
2.2 Network Compression and Acceleration
As discussed in the previous section, the enormous size and computational cost are currently bottlenecks for these models to be practically deployed. Many methods have been proposed to overcome the efficiency challenge, including quantization [
1,
4,
6,
7,
10,
11,
22,
25,
27,
28], pruning, [
7], and other encoding approaches [
6,
7]. In particular, these works roughly fall into two categories.
The first type of works focuses on the on-device storage optimization but gain no computational efficiency improvement to support real-time applications. Although network parameters are compressed into tiny models, they need to be converted back into full-precision values and the computation is carried out using floating-point representation. For example, Reference [
7] proposes to “prune” network synapses by forcing some of the weights to zero. In addition, the non-zero weights are clustered into groups and encode the entries using Huffman coding to further reduce the storage per weight. The model can be decoded back into full precision with the code book, and they achieve significant compression rate with negligible accuracy loss. The same authors also propose in Reference [
28] to quantize weights into ternary values (2-bit weights), which causes very little accuracy degradation by training the quantization centroids. Reference [
22] considers the brain segmentation problem and derives their “3DQ” approach based on Reference [
28], which also quantizes the full-precision weights into 2 bits. They further incorporate an additional factor to scale the quantization centroids and achieve near full-precision accuracy on two medical imaging 3D segmentation datasets. However, the downside of such technologies is that they do not bring any computational benefits and may even possibly worsen the speed due to the additional decoding phase.
Alternatively, some works directly train the parameters to be integers. In addition to the storage overhead reduction, such approaches also effectively reduce the number of floating-point operation for inferences and improve the computational efficiency. For instance, it is proposed in References [
1,
25] to operate the neural networks, including training and inferences, with 8-bit-integer weights and activations, where the quantization centroids of Reference [
25] is uniformly distributed between
\(-\)1 and 1, while those of Reference [
1] are derived from the maximum absolute values of the weights and activations. Further, DoReFa-Net [
27] allows the weights and activations to be quantized into arbitrary bits. They decide the quantization centroids such that the value range of weights is limited to [
\(-\)1, 1] while activations are bounded within [0, 1]. These works directly approximate the full-precision model with low-bit-width values so they are able to run with integer arithmetic. However, some approaches use the low-precision integer to index the quantization centroids. In Reference [
10], weights and activations are encoded as non-negative integers on a per-layer basis and can be decoded into full-precision approximations with a pair of shifting and scaling operations. The shifting and scaling factors are directly derived from the full-precision model during the training phase such that all real-valued points fall within the range between the smallest and the greatest quantization centroids, i.e., the clustering is simply performed by taking the full-precision range with a uniformly partition on it. Moreover, they propose a “batch-normalization folding” technique that absorbs the parameters of batch normalization into the previous convolutional or fully connected layer to reduce the computational complexity.
However, clear accuracy drops are present in the above approaches. The reasons include:
•
In Reference [
10], an exponential moving average with the smoothing parameter being close to 1 is used to derive the factors for activations. Since the intermediate activations differ from sample to sample, this makes the factors highly depend on the latest batch and relatively volatile.
•
Since the weights and activations of a well-trained model mostly follow the Gaussian and half-wave Gaussian distributions [
3], respectively, a significant amount of points are concentrating around the mean value and 0. Therefore, for both weights and activations, it is unnecessary and sub-optimal for References [
1,
10] to span a range covering all samples, especially when using a large mini-batch size or there exist extreme outliers. However, References [
25,
27] force the weights between
\(-\)1 and 1, which as well reduces the performance compared with networks with no such constraints.
•
The centroids of weight approximations are not trained in these approaches, but directly computed from the full-precision distributions such that the same ranges are spanned by the quantization centers with the full-precision weights and activations, which makes the accuracy of the full-precision model form an upper bound of the quantized performance. However, due to the definite error introduced by representing continuous ranges using discrete centroids, the drop on performance is inevitable.
The motivation of this work is to improve the previous approaches and address the issues discussed above. In comparison with the first type of works, our approach grants an efficiency improvement on volumetric segmentation with the integer-arithmetic dot-product operations. Moreover, we allow using arbitrary bits for the quantization and aim to reduce the performance degradation by directly training the quantization factors together with other network parameters to minimize the segmentation loss rather than deriving them from the full-precision model, which grants the low-precision model a potential to even outperform the full-precision network.