1 Introduction

Colorectal cancer is the third leading cause of cancer-related deaths. It is estimated that the number of new cases of colorectal cancer in the US will reach 150 thousand in 2019 [1]. Colorectal polyps are believed one of the early symptoms of colorectal cancer. Therefore, regular screening of colorectal polyps is crucial, during which automatic polyp segmentation is considered an indispensable component. It can help clinicians locate the polyp areas for further diagnosis.

In recent years, numerous deep learning based methods are developed for handling polyp segmentation problem [2, 3, 10, 12, 13]. The Fully Convolutional Network (FCN) [5] is a commonly used architecture, which replaces the fully connected layers of traditional Convolutional Neural Networks (CNNs) with convolutional layers, thus preserving the spatial information for segmentation. Brandao et al. [3] adopted the FCN with a pre-trained VGG model to identify and segment polyps from colonoscopy images. Akbari et al. [2] adopted a modified version of FCN, i.e., FCN8s, to further improve the polyp segmentation accuracy. Inspired by FCN, the UNet [10], a more powerful and concise network, is then proposed. It also adopts an encoder-decoder architecture but additionally adds some parallel skip concatenations between the encoder and decoder. Based on UNet, SegNet [12] applied the pooling indices from the encoder to the decoder and UNet++ [13] developed a densely connected encoder-decoder network with deep supervision to further enhance the polyp segmentation performance.

Although these networks achieve high performance, there still exist some defects in their architectures. One problem is that compared with UNet++ [13], the FCN [5], UNet [10], and SegNet [12] do not consider multi-scale features. For instance, skip concatenations in UNet connect encoder layers and decoder layers in a parallel manner, which can only aggregate image features of the same scale. In contrast, UNet++ builds up dense connections for integrating more features at different scales but significantly decrease the training and inference efficiency. Besides, all these networks adopt a fixed size of kernel at each layer. This restricts the network from exploring and leveraging representational features obtained from different receptive fields. As a matter of fact, the fusion of features from different scales and multiple receptive fields can remarkably enlarge the perception dimensions, thus helping the network learn more discriminative representations of the input images. In this work, one of our contributions is to optimize the skip concatenations and enrich the diversity of receptive fields at each layer, namely the selective feature aggregation.

Another problem is that the existing methods neglect the area-boundary constraints, a key factor in improving the segmentation performance. Murugesan et al. [7, 8] talked about this issue and proposed to utilize area and boundary information simultaneously in polyp segmentation, but the relationship between the area and boundary is not further explored. Actually, the dependency between areas and boundaries is quite crucial, which is capable of enhancing the area prediction. Specifically, if the predicted area is larger than ground truth, boundary information can constrain the extended areas, while if the predicted area is much smaller than ground truth, boundary information can enlarge the predicted areas. We leverage such relationship to improve the performance of polyp segmentation via a boundary-sensitive loss function in our method.

In this paper, we propose a novel selective feature aggregation network with a boundary-sensitive loss. Three contributions are claimed as follows. (1) We develop a new encoder-decoder network in which multi-scale features of the encoder are selected and concatenated with two decoders, i.e., the area branch and the boundary branch. Moreover, a Selective Kernel Module (SKM) is embedded into convolutional layers to dynamically learn features from different size of kernels, i.e., \(3\times 3\), \(5\times 5\), \(7\times 7\). (2) A boundary-sensitive loss is proposed to leverage the area-boundary constraints, with which the two branches can be mutually improved to produce more accurate predictions. (3) We validate the effectiveness of our method in EndoScene dataset and achieve the state-of-the-art results.

2 Method

2.1 Network Architecture

In this section, we first present the network architecture, as shown in Fig. 1, then introduce two strategies to select and aggregate the polyp features at different scales or receptive fields. Finally we propose a new boundary-sensitive loss via which the area and boundary branch can be reciprocally affected, thus generating more accurate predictions. Our network is composed of a shared encoder, an area branch, and a boundary branch. Each branch contains four convolutional modules. Each module contains three layers integrated with the SKMs. On top of the area branch, a light-weight UNet is adopted to help detect boundaries of the predicted areas. These boundaries and the contours derived from the boundary branch are used to formalize the boundary-sensitive loss.

Fig. 1.
figure 1

Illustration of the selective feature aggregation network with area-boundary constraints. Numbers in each block represents the number of feature channels. (Color figure online)

2.2 Selective Feature Aggregation

Up-Concatenations. Inspired by UNet [10] and BESNet [9], we adopt a shared encoder and two decoder branches to assist polyp segmentation, shown in Fig. 1. The shared encoder (middle solid blocks) utilizes five convolutional modules to learn feature representations of the input images. The area branch (left blocks with stripe) is built for area localization while the boundary branch (right hollow blocks) is built for boundary recovery. Different from the architectures in [9, 11, 12], where skip concatenations exist in a parallel manner (black dash lines), three extra up-concatenations are added to both the area and boundary branch (red arrow lines) in our method. In this case, multi-scale feature maps derived from deep layers of the shared encoder are copied to shallow layers of the decoders, which enriches the feature representations and keeps the training and inference process more efficient compared with [13].

SKM. Traditionally, most of the convolutional kernels in CNN are with a fixed size, i.e., \(3\times 3\), which cannot simultaneously capture features from other receptive fields. To tackle this problem, we adopt the SKM [4] in our method, which can dynamically aggregate features obtained from different size of kernels. Our work is the first study to embed SKMs into the encoder-decoder networks. As shown in Fig. 2, an input feature map \(\mathbf X \in \mathbb {R}^{C \times W \times H}\) is first filtered by three respective kernels simultaneously, then followed by a Batch Normalization and a ReLU activation, and outputs three distinct feature maps \(X_{3}\), \(X_{5}\), \(X_{7}\). To regress the weight vectors, we first perform element-wise summation of the three feature maps, \(\widetilde{X} = \sum X_{k} (k \in \{3,5,7\})\), then process \(\widetilde{X}\) according to the following procedure.

$$\begin{aligned} f= & {} \mathcal {F}_{fc}(\frac{1}{W \times H}\sum \nolimits _{i=1}^{W}\sum \nolimits _{j=1}^{H} \widetilde{X}(i,j)) \end{aligned}$$
(1)
$$\begin{aligned} m_{k}= & {} e^{W_{k}f}/\sum e^{W_{k}f} \quad (k \in \{3,5,7\}) \end{aligned}$$
(2)

where \(\mathcal {F}_{fc}\) is a fully connected operation, \(f \in \mathbb {R}^{C}\) denotes the adaptive features, \(W_{k}\) are learnable parameters in \(m_{k}\), and \(m_{k}\) is the weight vector corresponding to \(X_{k}\). With the weight vectors, we are able to adaptively aggregate the feature maps, \(\widehat{X} = \sum {m_{k}X_{k}}, k \in \{3,5,7\}\).

Fig. 2.
figure 2

Selective Kernel Module (SKM). sq: squeeze; fc: fully connected; ex: excitation.

2.3 Boundary-Sensitive Loss

To make full use of the area-boundary constraints, we propose a joint loss function that can mutually propagate the information between the area branch and boundary branch. As shown in Fig. 1, the loss function is composed of three parts: an area loss \(L_{a}\), a boundary loss \(L_{b}\), and the area-boundary constraint loss, i.e., \(L_{C1}\) and \(L_{C2}\).

Area Loss. \(L_{a}\) consists of a binary cross-entropy loss and a dice loss [6], which can be represented by the following function.

$$\begin{aligned} L_{a} = -\sum _{i}z_{i}log(m_{i}) + (1 - \frac{2\sum _{i}m_{i} z_{i} + \varepsilon }{\sum _{i}m_{i} + \sum _{i}z_{i} + \varepsilon }) \end{aligned}$$
(3)

where \(m_{i}\) indicates the probability of pixel i being categorized into polyp class, \(z_{i}\in \){0, 1} is the corresponding area ground truth label, and \(\varepsilon \) is a small positive number used for increasing numerical stabilities. The cross-entropy loss function penalizes pixel classification errors while the dice loss function measures the overlap between the predicted polyp areas and the area ground truth, which can handle the foreground-background imbalance problem.

Boundary Loss. \(L_{b}\) measures the difference between outputs of boundary branch and boundary ground truth labels, which can be represented as

$$\begin{aligned} L_{b} = -\sum _{i}y_{i}log(p_{i}) \end{aligned}$$
(4)

where \(p_{i}\) denotes the probability of pixel i being the polyp boundary and \(y_{i}\in \,\){0, 1} is the corresponding boundary ground truth label.

Area-Boundary Constraint Loss. The constraint loss aims to model the dependency between areas and boundaries. To achieve that, we utilize a two-layer UNet to extract boundaries of the predicted polyp areas. Here, the light-weight UNet acts as a differentiable edge detector. The area-boundary constraint loss is composed of two parts, the first part \(L_{C1}\) is to minimize the difference between edge detector results and boundary ground truth and the second part \(L_{C2}\) aims to minimize the difference between edge detector results and outputs of boundary branch. These two constraint loss functions are represented as

$$\begin{aligned}&\displaystyle L_{C1} = -\sum _{i}y_{i}log(q_{i}) \end{aligned}$$
(5)
$$\begin{aligned}&\displaystyle L_{C2}=D_{KL}(P||Q)+D_{KL}(Q||P)=-\sum _{i}p_{i}log(\frac{q_{i}}{p_{i}})- \sum _{i}q_{i}log(\frac{p_{i}}{q_{i}}) \end{aligned}$$
(6)

where \(q_{i}\) is the results predicted by the edge detector, i.e., the light-weight UNet, \(y_{i}\) denotes boundary ground truth, and \(p_{i}\) indicates outputs of boundary branch. \(D_{KL}\) denotes Kullback-\(Leibler\ divergence\) which can measure the distance between two distributions. As a result, minimizing \(D_{KL}\) is equivalent to making the final outputs of area and boundary branch closer. Intuitively speaking, \(L_{C2}\) tries to make the area and boundary branch internally consistent, thus preventing the two branches deviating from each other too much. Compared with \(L_{C1}\) that uses boundary supervision explicitly, \(L_{C2}\) can implicitly impose the area-boundary constraints. Due to the consistent training goal of the two branches, the shared encoder can learn more discriminative features and help increase the final segmentation performance. The final loss function is as follows.

$$\begin{aligned} L_{total} = w_{a}L_{a} + w_{b}L_{b} + w_{C1}L_{C1} + w_{C2}L_{C2} \end{aligned}$$
(7)

where \(w_{a}\), \(w_{b}\), and \(w_{C1}\) are set to 1, \(w_{C2}\) is set to 0.5 by empirical studies.

3 Experimental Results

3.1 Dataset and Evaluation Metrics

In this work, we adopt the EndoScene dataset [11] which contains 912 images with at least one polyp in each image, acquired from 44 video sequences obtained from 36 subjects. The dataset is divided into a train set, a validation set, and a test set and each sequence is uniquely included in one of the three sets. The area ground truth is provided, while the boundary ground truth is derived by ourselves. The area ground truth is filtered with a 5*5 kernel, where kernel elements are initialized with 0. The element turns to 1 if it overlaps with the area ground truth, otherwise is 0. If kernel elements are not identical, i.e. 0 and 1 exist simultaneously, these elements are regarded as boundary points.

The evaluation metrics we adopt are Recall, Specificity, Precision, Dice Score, Intersection-over-Union for Polyp (\(IoU_P\)), IoU for Background (\(IoU_B\)), Mean IoU (mIoU), and Accuracy. In all experiments, the batch size is set to 4. The initial learning rate is 0.01 and decreases by 10 times after 20 and 100 epochs, respectively. The SGD optimizer is used with a weight decay of 0.0005 and a momentum of 0.9. The models are implemented based on the PyTorch framework and trained on a workstation with Intel Core(TM) i7-9700K@3.60 GHz processors and a NVIDIA GeForce RTX 2080 Ti (11 GB) installed.

3.2 Comparative Experiments

The detailed experiment results are presented in Table 1, in which we list the performance of four state-of-the-arts (row 1–4) and our proposed method (row 5). To further analyze the influence of each component of our method, we conduct several comparative experiments and the results are reported in row 6–10. All the results are evaluated on test set, with the checkpoint achieving the highest Dice Score on validation set.

Fig. 3.
figure 3

Polyp segmentation results of different methods.

Table 1. Comparison with different baselines and other state-of-the-art methods. ‘UNet’: the typical UNet with area branch; ‘Up’: up-concatenations; ‘SKM’: selective kernel module; ‘bd’: two-branch model with \(L_{a}\) and \(L_{b}\); ‘\(L_{C1}\)’: the first constraint loss; ‘Ours’: the two-branch model with total loss, i.e. \(L_{a}\), \(L_{b}\), \(L_{C1}\) and \(L_{C2}\).

As shown in Fig. 3, our method outperforms the four state-of-the-art methods on all the evaluation metrics by a significant increment. Segmentation performance in SegNet [12] and UNet++ [13] is superior to that in FCN8 [3] and FCN8s [2], which indicates the significance of skip concatenations on accuracy improvement. However, increasing the density of concatenations does not guarantee more improvement on the segmentation performance. We have attempted numerous combinations of concatenations and find that models with complicated concatenations are hard to train and usually cannot give a competitive result. Compared with our up-concatenation model, UNet++ [13] that adopts even more complicated concatenations shows a significant drop in the performance.

Then, we evaluate the performance of the up-concatenations and the SKM components, which are denoted as UNet+Up and UNet+Up+SKM in Table 1. The results show that the UNet with up-concatenations achieves better performance than UNet alone on all the evaluation criteria. This indicates that up-concatenations can effectively transmit multi-scale features from the encoder to the decoder. The SKM component is also verified to be effective in improving the segmentation performance, especially the Precision and \(IoU_p\) which increase by more than 1.5%. This improvement is achieved by allowing the network to dynamically aggregate features derived from different size of kernels, which is a more appropriate strategy than manually specifying the kernel size.

Further, we evaluate the effectiveness of the boundary-sensitive loss. By comparing the ‘UNet+Up+SKM+bd’ with ‘UNet+Up+SKM’ in Table 1, it is apparent that the integration of boundary branch helps improve segmentation accuracy a lot. In particular, the Recall, Precison, Dice, and \(IoU_{P}\) show an increment of more than 1% compared with the model that only contains area branch. This indicates that the boundary information can help improve the representation capability of the shared encoder network via the loss backward propagation mechanism. According to last two rows and Ours in Table 1, we can see the area-boundary constraint loss functions also play important roles in improving the segmentation performance. \(L_{C1}\) aims to minimize the difference between edge detector results generated by the light-weight UNet and boundary ground truth. Since the results of the edge detectors are built upon the area branch, \(L_{C1}\) can explicitly backpropagate the boundary supervision information, thus the boundary constraints can be introduced to the area branch. The purpose of \(L_{C2}\) is to minimize the difference between edge detector results and outputs of boundary branch. In this way, edge detector results and the outputs of boundary branch can be mutually constrained, thus making area and boundary branch internally consistent. It is worth noting that introduction of either \(L_{C1}\) or \(L_{C2}\) improves the segmentation accuracy compared with models without any constraint.

4 Conclusion

We propose a novel selective feature aggregation network with the area and boundary constraints for polyp segmentation. Up-concatenations and SKMs are used to select multi-scale and multi-receptive-field representations of polyp images. Furthermore, a new loss is proposed to take into account the dependency between the area and boundary branch, thus two branches can be reciprocally influenced and enable more accurate predictions. Experimental results demonstrate that our method shows a superior performance over existing state-of-the-arts and it is capable of enabling more accurate polyp localization. The source code is available at https://github.com/Yuqi-cuhk/Polyp-Seg.