Concurrent Spatial and Channel Squeeze & Excitation' in Fully Convolutional Networks
Concurrent Spatial and Channel Squeeze & Excitation' in Fully Convolutional Networks
Concurrent Spatial and Channel Squeeze & Excitation' in Fully Convolutional Networks
1 Introduction
2 Methods
0
Let us assume an input feature map X ∈ RH×W ×C that passes through an
encoder or decoder block Ftr (·) to generate output feature map U ∈ RH×W ×C ,
Ftr : X → U. Here H and W are the spatial height and width, with C 0 and C
being the input and output channels, respectively. The generated U combines
the spatial and channel information of X through a series of convolutional layers
and non-linearities defined by Ftr (·). We place the SE blocks FSE (·) on U to
recalibrate it to Û. We propose three different variants of SE blocks, which are
detailed next. The SE blocks can be seamlessly integrated within any F-CNN
3
model by placing them after every encoder and decoder block, as illustrated in
Fig. 1(a). Û is used in the subsequent pooling/upsampling layers.
Fig. 1: Illustration of network architecture with squeeze & excitation (SE) blocks. (a)
The proposed integration of SE blocks within F-CNN. (b-d) The architectural design
of cSE, sSE and scSE blocks, respectively, for recalibrating feature map U.
ÛcSE = FcSE (U) = [σ(zˆ1 )u1 , σ(zˆ2 )u2 , · · · , σ(zˆC )uC ]. (2)
th
The activation σ(ẑi ) indicates the importance of the i channel, which are
rescaled. As the network learns, these activations are adaptively tuned to ignore
less important channels and emphasize the important ones. The architecture of
the block is illustrated in Fig. 1(b).
4
ÛsSE = FsSE (U) = [σ(q1,1 )u1,1 , · · · , σ(qi,j )ui,j , · · · , σ(qH,W )uH,W ]. (3)
Each value σ(qi,j ) corresponds to the relative importance of a spatial information
(i, j) of a given feature map. This recalibration provides more importance to
relevant spatial locations and ignores irrelevant ones. The architectural flow is
shown in Fig. 1(c).
3 Experimental Results
In this section, we conducted extensive experiments to explore the impact of
our proposed modules. We chose three state-of-the-art F-CNN architectures, U-
5
Net [3], SD-Net [4] and Fully Convolutional DenseNet [9]. All of the networks
have an encoder/decoder based architecture. The encoding and decoding paths
consist of repeating blocks separated by down-sampling and up-sampling, respec-
tively. We insert (i) channel-wise SE (cSE) blocks, (ii) spatial SE (sSE) blocks
and (iii) concurrent spatial and channel-wise SE (scSE) blocks after every en-
coder and decoder block of the F-CNN architecture and compare against its
vanilla version.
Datatsets: We use two datasets in our experiments. (i) Firstly, we tackle the
task of segmenting MRI T1 brain scans into 27 cortical and sub-cortical struc-
tures. We use the Multi-Atlas Labelling Challenge (MALC) dataset [11], which is
a part of OASIS [12], with 15 scans for training and 15 scans for testing consistent
to the challenge instructions. The main challenge associated with the dataset are
the limited training data with severe class imbalance between the target struc-
tures. Manual segmentations for MALC were provided by Neuromorphometrics,
Inc.? (ii) Secondly, we tackle the task of segmenting 10 organs on whole-body
contrast enhanced CT (ceCT) scans. We use data from the Visceral dataset [13].
We train on 65 scans from the silver corpus, and test on 20 scans with manual
annotations from the gold corpus. The silver corpus was automatically labeled
by fusing the results of multiple algorithms, yielding noisy labels. The main chal-
lenge associated with the whole-body segmentation is the highly variable shape
of the visceral organs and the capability to generalize when trained with noisy
labels. We use Dice score for performance evaluation.
Model Learning: In our experiments, all of the three F-CNN architectures had
4 encoder blocks, one bottleneck layer, 4 decoder blocks and a classification layer
at the end. The logistic loss function was weighted with median frequency bal-
ancing [8] to compensate for the class imbalance. The learning rate was initially
set to 0.01 and decreased by one order after every 10 epochs. The momentum was
set to 0.95, weight decay constant to 10−4 and a mini batch size of 4. Optimiza-
tion was performed using stochastic gradient descent. Training was continued till
validation loss converged. All the experiments were conducted on an NVIDIA
Titan Xp GPU with 12GB RAM.
Quantitative Results: Table 1 lists the mean Dice score on test data for both
datasets. Results of the standard networks together with the addition of cSE,
sSE and scSE blocks are reported. Comparing along the columns, we observe
that inclusion of any SE block consistently provides a statistically significant
(p ≤ 0.001, Wilcoxon signed-rank) increase in Dice score in comparison to the
normal version for all networks, in both applications. We further observe that
the spatial excitation yields a higher increase than the channel-wise excitation,
which confirms our hypothesis that spatial excitation is more important for seg-
mentation. Spatial and channel-wise SE yields the overall highest performance,
with an increase of 4 − 8% Dice for brain segmentation and 2 − 3% Dice for
whole-body segmentation compared to the standard network. Particularly for
brain, the performance increase is striking, given the limited increase in model
?
http://Neuromorphometrics.com/
6
Table 1: Mean and standard deviation of the global Dice scores for the different F-
CNN models without and with cSE, sSE and scSE blocks on both datasets.
MALC Dataset
Networks No SE Block + cSE Block + sSE Block + scSE Block
DenseNets [9] 0.842 ± 0.058 0.865 ± 0.069 0.876 ± 0.061 0.882 ± 0.063
SD-Net [4] 0.771 ± 0.150 0.790 ± 0.120 0.860 ± 0.062 0.862 ± 0.082
U-Net [3] 0.763 ± 0.110 0.825 ± 0.063 0.837 ± 0.058 0.843 ± 0.062
Visceral Dataset
Networks No SE Block + cSE Block + sSE Block + scSE Block
DenseNets [9] 0.892 ± 0.068 0.903 ± 0.058 0.912 ± 0.056 0.918 ± 0.051
SD-Net [4] 0.871 ± 0.064 0.892 ± 0.065 0.901 ± 0.057 0.907 ± 0.057
U-Net [3] 0.857 ± 0.106 0.865 ± 0.086 0.872 ± 0.080 0.881 ± 0.082
Fig. 2: Boxplot of Dice scores for all brain structures on the left hemisphere (due to
space constraints), using DenseNets on MALC dataset, without and with proposed cSE,
sSE, scSE blocks. Grey and white matter are abbreviated as GM and WM, respectively.
segmentation improves with the inclusion of the scSE block (Fig. 4(d)). For whole
body ceCT, we indicate the spleen, which is over segmented using DenseNet
(Fig. 4(g)), and which is rectified with adding scSE block (Fig. 4(h)).
4 Conclusion
We proposed the integration of squeeze & excitation blocks within F-CNNs for
image segmentation. Further, we introduced the spatial squeeze & excitation,
which outperforms the previously proposed channel-wise squeeze & excitation.
We demonstrated that SE blocks yield a consistent improvement for three differ-
ent F-CNN architectures and for two different segmentation applications. Hence,
recalibration with SE blocks seems to be a fairly generic concept to boost per-
formance in CNNs. Strikingly, the substantial increase in segmentation accuracy
comes with a negligible increase in model complexity. With the seamless integra-
tion, we believe that squeeze & excitation can be a crucial component for neural
networks in many medical applications.
Acknowledgement: We thank the Bavarian State Ministry of Education, Sci-
ence and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B)
for funding and NVIDIA corporation for GPU donation.
References
1. Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with
deep convolutional neural networks. In NIPS pp. 1097-1105.
2. He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image
recognition. In CVPR, pp. 770-778, IEEE.
3. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical
image segmentation. In Proc. MICCAI, Springer 2015, pp. 234-241.
4. Roy, A.G., Conjeti, S., Sheet, D., Katouzian, A., Navab, N. and Wachinger, C.,
2017, September. Error Corrective Boosting for Learning Fully Convolutional Net-
works with Limited Data. In MICCAI, pp. 231-239, Springer.
5. Roy, A.G., Conjeti, S., Navab, N. and Wachinger, C., 2018. QuickNAT: Segmenting
MRI Neuroanatomy in 20 seconds. arXiv:1801.04161.
6. Long J, Shelhamer E, and Darrell T. Fully convolutional networks for semantic
segmentation. In CVPR 2015, pp. 3431-40, IEEE.
8