3D Tiled Convolution for Effective Segmentation of Volumetric Medical Images

Zeng, Guodong; Zheng, Guoyan

doi:10.1007/978-3-030-32245-8_17

Guodong Zeng^16,17,18 &
Guoyan Zheng^16,17

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11765))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

12k Accesses

Abstract

Convolutional Neural Networks (CNNs) have achieved remarkable performance in many 2D computer vision and medical image analysis tasks. In clinical practice, however, a large part of the medical imaging data available are in 3D. This has motivated the development of 3D CNNs in order to benefit from more spatial context. Although weight sharing in CNNs significantly reduces the number of parameters that have to be learned, state-of-the-art 3D methods still depend on patch processing due to GPU memory restrictions caused by moving to fully 3D. The size of the input patch is usually small if no specialized hardware with large GPU memory is used, limiting the incorporation of larger context information for a better performance. In this paper, we propose a 3D Tiled Convolution (3D-TC) which learn a number of separate kernels within the same layer. 3D-TC has the advantage of significantly reducing the required GPU memory for 3D medical image processing task but with improved performance. Results obtained from comprehensive experiments conducted on both hip T1 MR images and pancreas CT images demonstrate the efficacy of the present method. Our implementation can be found at: https://github.com/guoyanzheng/LPNet.

You have full access to this open access chapter, Download conference paper PDF

TernaryNet: faster deep model inference without GPUs for medical 3D segmentation using sparse and binary convolutions

Article 30 May 2018

BRIEFnet: Deep Pancreas Segmentation Using Binary Sparse Convolutions

Domain-specific cues improve robustness of deep learning-based segmentation of CT volumes

Article Open access 01 July 2020

Keywords

1 Introduction

Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in many different 2D medical image analysis tasks. The common idea behind these solutions is to use deep convolutional networks with many hidden layers, aiming at learning discriminative feature embedding from raw data, rather than replying on handcrafted feature extraction. In clinical practice, however, a large part of the medical imaging data available are in 3D. This has motivated the development of 3D CNNs in order to benefit from more spatial context [1,2,3,4,5,6]. Due to GPU memory restrictions caused by moving to fully 3D, state-of-the-art methods [1,2,3,4,5,6] depend on subvolume/patch processing. The size of the input patch is usually small if no specialized hardware with large GPU memory is used, limiting the incorporation of larger context information for a better performance. In this paper, we present a novel and efficient approach for effective segmentation of volumetric images by leveraging context information in a large patch. Our contributions can be summarized as follows:

First, inspired by [7], we propose a novel 3D Tiled Convolution (3D-TC) for volumetric image processing. 3D-TC is implemented with a periodic down-shuffling operation followed by conventional 3D convolutions. 3D-TC has the advantage of significantly reducing the size of the data for sub-sequential processing while using all the information available in the input irrespective of the down-shuffling factors. We apply 3D-TC directly to the input data, whose output will be used as the input for sub-sequential Fully Convolutional Networks (FCNs). A full-resolution dense prediction at the final output is then generated by using previously introduced Dense Upsampling Convolution (DUC) [8, 9], which consists of low-resolution convolutions with a periodic up-shuffling operation to jointly learn the feature extraction and upsampling weights.
Second, we extensively validate the proposed approach on two typical yet challenging volumetric image segmentation tasks, i.e., segmentation of hip bony structures from 3D T1 MR images of limited field of view and segmentation of pancreas from 3D CT images. We show that 3D-TC and DUC are network agnostic and can be combined with different FCNs for an improved performance.

2 Methods

2.1 3D Tiled Convolution

As shown in Fig. 1, 3D-TC consists of a periodic down-shuffling operator and low-resolution (LR) convolutions. 3D-TC is designed to be directly applicable to the input data with two goals in mind. First, different from conventional convolutions, where convolutional kernels are shared by all neurones in a particular layer, 3D-TC aims to learn $ n_x \times n_y \times n_z $ separate kernels within the same layer, where $ n_x , n_y , n_z $ are the down-shuffling factors along the three spatial axes, respectively. Thus, irrespective of the down-shuffling facts, all data available in the input are used which is different from a simple down-sampling operation. Second, 3D-TC can significantly reduce the size of the data for sub-sequential processing. More specifically, let’s assume that the size of input data ($I^{HR}$) is $( n_x \times d) \times ( n_y \times h) \times ( n_z \times w) \times C$ and the size of the output from 3D-TC is $d \times h \times w \times k$, where C is the number of channels in the input data; k is the number of feature maps in the output of 3D-TC. Instead of applying convolution to high resolution (HR) images, we first apply a periodic down-shuffling operator to the input data to get $C \times ( n_x \times n_y \times n_z )$ channels of LR feature maps and then further apply convolutions with a kernel size of $3 \times 3 \times 3$ to get the k feature maps of size $(d \times h \times w)$. Mathematically, this can be described as:

$$\begin{aligned} \begin{aligned} TC(I^{LR}; W_1, b_1) = \phi (W_1 * PDS (I^{HR}) + b_1) \end{aligned} \end{aligned}$$

(1)

where $\phi $ is an non-linear activation function that is applied element-wise; $W_1$, $b_1$ are trainable weights and bias, respectively; $ PDS $ is a periodic down-shuffling operator which aims to rearrange the tensor ($T_{HR}$) in the shape of $( n_x \times d) \times ( n_y \times h) \times ( n_z \times w) \times C$ to the tensor ($T_{LR}$) in the shape of $(d \times h \times w) \times (C \times ( n_x \times n_y \times n_z ))$. And the operation $T_{LR} = PDS (T_{HR})$ can be mathematically described as below:

$$\begin{aligned} \begin{aligned} T_{LR}(x',y',z',c') = T_{HR}( {x' \times n_x}+\left\lfloor {mod(c', n_x \times C)/C}\right\rfloor , \\ {y' \times n_y}+\left\lfloor {mod(c',n_x \times n_y \times C)/(n_x \times C)}\right\rfloor , \\ {z' \times n_z}+\left\lfloor {c'/(n_x \times n_y \times C)}\right\rfloor , \\ mod(c', C) \ ) \end{aligned} \end{aligned}$$

(2)

where $x',y',z',c'$ are the coordinates of the voxels in the LR space, and $x' \in [0,d-1], y' \in [0,h-1], z' \in [0, w-1], c' \in [0, C \times n_x \times n_y \times n_z -1] $.

2.2 3D-TC and DUC Augmented FCNs for Image Segmentation

Both DUC and 3D-TC are network agnostic and can be combined with existing FCNs such as 3D U-net [1] for volumetric image analysis as shown in Fig. 2, as long as the dimensions of the output from 3D-TC satisfy the input requirement of the sub-sequential FCNs. The advantage of such a pipeline is apparent. When a 3D-TC with down-shuffling factors of $( n_x , n_y , n_z )$ is applied to the input data, both the computational and the storage cost for the underlying networks will be reduced by a factor of $( n_x \times n_y \times n_z )$, allowing one to use large patch as the input but with reduced computational time. The full resolution result is then obtained at the final output by applying a DUC with up-shuffling factors of $( n_x , n_y , n_z )$. To differentiate from the original 3D U-net, we call the 3D U-net augmented with 3D-TC and DUC as 3D large patch U-net (3D LP-U-net). Similarly we can derive 3D LP-V-net and LP-HighRes3DNet respectively by augmenting the original 3D V-net [2] and HighRes3DNet [5] with 3D-TC and DUC. In this study, we take the original 3D U-net, 3D V-net and HighRes3DNet as the base networks to evaluate the performance of the associated networks augmented with 3D-TC and DUC. For all the studies, a combination of cross entropy loss with Dice loss as introduced in [2] is used.

2.3 Implementation Details

All methods were implemented in Python using TensorFlow framework and were trained and tested on a desktop with a 3.6 GHz Intel(R) i7 CPU and a NVIDIA GTX 1080 Ti graphics card with 11 GB GPU memory. We empirically fixed the number of output feature maps from the 3D-TC as $k=64$.

3 Experiments and Results

In this section, we present experimental results of the proposed pipeline for volumetric image analysis. Two datasets, i.e., an in-house dataset consisting of 25 T1 hip MR images with limited field of view and a publicly available dataset from National Institute of Health (NIH) containing 82 abdominal contrast enhanced 3D CT scans [10], were used in our study. For all the experiments described below, we used Dice Overlap Coefficients (DOC) and Average Surface Distance (ASD) as the evaluation metrics.

3.1 Ablation Study on Hip MR Images with Limited Field of View

Data and Augmentation. We used 25 3D T1 MR images, acquired from patients with hip pain. All images were resampled to have a uniform size of $480 \times 480 \times 160$ voxels with an average voxel spacing of 0.374 mm $\times $ 0.363 mm $\times $ 1.078 mm. Slice by slice manual segmentation was used to create the reference ground truth segmentation. In this study, we randomly distributed the 25 datasets into two groups with one group containing 20 datasets as the training data and the remaining 5 datasets as the testing data.

Table 1. Results of investigation of different patch sizes on the performance of the original 3D U-net. Ace: the acetabulum; Femur: the proximal femur

Full size table

Table 2. Results when different shuffling factors were used for the 3D LP-U-net. The size of the input patch is fixed to $400 \times 400 \times 80$.

Full size table

Ablation Study. We first investigated the influence of patch sizes on the performance of the original 3D U-net [1]. The results are presented in Table 1. It was observed that better performance was obtained when larger patch size was used. Due to the GPU memory constraint, $200 \times 200 \times 40$ is the maximum size that we can use.

We then examined the effect of different shuffling factors on the performance of 3D LP-U-net when a fixed patch size of $400 \times 400 \times 80$ was used. The results are reported in Table 2. From this table, we can see that (1) the higher the shuffling factor, in general the less accurate the results but the best results were achieved when the shuffling factor was (4, 4, 2); (2) even with a shuffling factor as high as (25, 25, 2), we still get sub-millimeter segmentation accuracy for both structures; and (3) in comparison with the results reported in Table 1, 3D LP-U-net achieved better results than the original 3D U-net with the largest patch size when the shuffling factor was smaller than (16, 16, 2).

Figure 3 visually compares the segmentation results obtained by the 3D LP-U-net with a fixed patch size of $400 \times 400 \times 80$ but different shuffling factors and the 3D U-net with the largest patch size. In this figure, we show both the overall segmentation and the probability of each structure as well as the results around the hip joint. From this figure, we observe that (1) less false positive segmentation was observed when comparing the results obtained by the 3D LP-U-net with those by the 3D U-net; and (2) for the 3D LP-U-net, the larger the shuffling factors, the higher the uncertainty around the boundary but the best results are obtained when the shuffling factors are (4, 4, 2).

Finally, we show that 3D-TC and DUC are agnostic to the base networks. To demonstrate such a capability, we compared the performance of the original 3D V-Net and HighRes3DNet when the maximally allowed patch sizes were used with that of 3D LP-V-Net with a shuffling factor (2, 2, 2) and LP-HighRes3DNet with a shuffling factor (4, 4, 1) when a fixed patch size of $400 \times 400 \times 80$ was used. Please note that caused by high spatial resolution, the maximally allowed patch size for HighRes3DNet is $200 \times 200 \times 20$ while for 3D V-Net, it is $200 \times 200 \times 40$. The results are presented in Table 3, where results achieved by the 3D LP-V-Net and the LP-HighRes3DNet are better than those achieved by the associated base networks.

Table 3. Results when different underlying FCNs were used.

Full size table

Table 4. Segmentation accuracy of 3D LP-U-net and three state-of-the-art methods.

Full size table

3.2 Validation on Hip MR Images

Using the same 25 T1 hip MR images with limited field of view as we used in the ablation study, we conducted a standard 5-fold cross validation experiment. We also adopted the same data augmentation strategy and the same training strategy as in the ablation study. In this cross validation study, for the 3D LP-U-net, we chose a fixed patch size of $400 \times 400 \times 80$ and a fixed shuffling factor of (4, 4, 2). We compared the performance of the 3D LP-U-net with state-of-the-art methods such as 3D U-net [1], 3D V-net [2], and HighRes3dNet [5] and the results are shown in Table 4. An average DOC of $96.76\pm 0.92$%, $94.01\pm 2.80$%, $93.35\pm 3.21$% and $90.51\pm 7.32$% was found for the 3D LP-U-net, the 3D U-net, the 3D V-net and the HighRes3DNet, respectively. The 3D LP-U-net showed significantly higher accuracy than all other three methods ($p < 0.01$). For ASD, the same significance was also observed. For the proximal femur, an average DOC of $98.14\pm 0.47$%, $96.89\pm 0.85$%, $96.47\pm 1.54$% and $89.99\pm 4.91$% was found for the 3D LP-U-net, the 3D U-net, the 3D V-net and the HighRes3DNet, respectively. The 3D LP-U-net showed significantly higher accuracy than all other three methods ($p < 0.01$) when segmenting the proximal femur. An average ASD of $0.27\pm 0.09$ mm, $0.96\pm 1.16$ mm, $0.91\pm 0.87$ mm and $1.79\pm 0.83$ mm was found for the 3D LP-U-net, the 3D U-net, the 3D V-net and the HighRes3DNet, respectively.

3.3 Validation on NIH Pancreas CT Dataset

We verified our approach on the NIH pancreas CT dataset [10] as well, which contains 82 contrast-enhanced abdominal CT volumes provided by an experienced radiologist. Following the training protocol [10], we conducted a 4-fold cross validation in a random split from 82 patients for training and testing folds, where each testing fold had 22, 20, 20, and 20 cases, respectively. We implemented a two-stage pipeline consisting of a coarse stage and a fine stage. In the coarse stage, we first trained a deep segmentation network to locate the rough region of the pancreas from a whole CT volume. The goal of the fine stage is then to train another deep segmentation network to further refine the results. In both stages, we used the 3D LP-U-net as the segmentation networks. During the training phase of the coarse stage, we chose a fixed patch size of $480 \times 480 \times 64$ voxels and a fixed shuffling factor of (4, 4 1). During the training phase of the fine stage, all training images were cropped by the bounding box calculated from the associated ground truth segmentation plus a padding of 20 voxels along all three spatial axes. Then all cropped images were resampled to a fixed size of $196 \times 128 \times 128$ voxels. For the 3D LP-U-net in the fine stage, we chose a fixed patch size of $176 \times 112 \times 96$ voxels and a fixed shuffling factor of (2, 2, 1).

Table 5. Accuracy (DOC, %) comparison between 3D LP-U-net and the state-of-the-arts on the NIH pancreas segmentation dataset.

Full size table

Table 5 shows the accuracy comparison between our approach and previous state-of-the-art methods when evaluated on the NIH pancreas CT dataset. Our approach achieved a comparable average DOC with the state-of-the-art methods. Although the average DOC and the maximum DOC achieved by our approach are slightly worse than those achieved by [6], which is based on a complicated recurrent saliency transformation network with careful parameter fine-tuning, our approach achieved much better result in the worst case (68.39% by our approach vs. 62.81% by previous state-of-the-art), which guaranteed the reliability of our approach in clinical application.

4 Conclusion

We proposed a simple yet effective 3D tiled convolution for 3D medical image analysis tasks. The 3D-TC consists of a periodic down-shuffling operation followed by low-resolution 3D convolutions. It can be directly applied to the input data and has the advantage of significantly reducing the size of the data for sub-sequential processing while using all the information available in the input irrespective of the down-shuffling factors. To achieve volumetric dense prediction at the output, we used a previously introduced dense upsampling convolution. We showed that 3D-TC and DUC were network agnostic and could be combined with different FCNs for an improved performance. Experimental results demonstrated the effectiveness of our framework on different semantic segmentation tasks. Our future work is to apply 3D-TC to other volumetric image processing tasks such as image synthesis and image super-resolution.

References

Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Chapter Google Scholar
Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: Proceedings of 2016 International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)
Google Scholar
Dou, Q., et al.: 3D deeply supervised network for automated segmentation of volumetric medical images. Med. Image Anal. 41, 40–54 (2017)
Article Google Scholar
Kamnitsas, K., et al.: Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017)
Article Google Scholar
Li, W., Wang, G., Fidon, L., Ourselin, S., Cardoso, M.J., Vercauteren, T.: On the compactness, efficiency, and representation of 3D convolutional networks: brain parcellation as a pretext task. In: Niethammer, M., et al. (eds.) IPMI 2017. LNCS, vol. 10265, pp. 348–360. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59050-9_28
Chapter Google Scholar
Yu, Q., Xie, L., Wang, Y., Zhou, Y., Fishman, E.K., Yuille, A.L.: Recurrent saliency transformation network: incorporating multi-stage visual cues for small organ segmentation. In: CVPR, pp. 8280–8289, June 2018
Google Scholar
Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Le, Q.V., Ng, A.Y.: Tiled convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1279–1287 (2010)
Google Scholar
Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: CVPR, pp. 1874–1883 (2016)
Google Scholar
Wang, P., et al.: Understanding convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1451–1460. IEEE (2018)
Google Scholar
Roth, H.R., et al.: DeepOrgan: multi-level deep convolutional networks for automated pancreas segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 556–564. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24553-9_68
Chapter Google Scholar
Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A fixed-point model for pancreas segmentation in abdominal CT scans. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10433, pp. 693–701. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66182-7_79
Chapter Google Scholar
Cai, J., Lu, L., Xie, Y., Xing, F., Yang, L.: Pancreas segmentation in MRI using graph-based decision fusion on convolutional neural networks. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 674–682. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_77
Chapter Google Scholar
Roth, H.R., et al.: Spatial aggregation of holistically-nested convolutional neural networks for automated pancreas localization and segmentation. Med. Image Anal. 45, 94–107 (2018)
Article Google Scholar

Download references

Acknowledgements

This study was partially supported by a start-up funding from Shanghai Jiao Tong University, China with Grant No. WF220882002 and the Swiss National Science Foundation via project 205321_163224/1.

Author information

Authors and Affiliations

School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China
Guodong Zeng & Guoyan Zheng
Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China
Guodong Zeng & Guoyan Zheng
ARTORG Center for Biomedical Engineering, University of Bern, Bern, Switzerland
Guodong Zeng

Authors

Guodong Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Guoyan Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoyan Zheng .

Editor information

Editors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Dinggang Shen
University of Georgia, Athens, GA, USA
Tianming Liu
Western University, London, ON, Canada
Terry M. Peters
Yale University, New Haven, CT, USA
Lawrence H. Staib
University of Strasbourg, Illkirch, France
Caroline Essert
United Imaging Intelligence, Shanghai, China
Sean Zhou
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Pew-Thian Yap
Western University, London, ON, Canada
Ali Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zeng, G., Zheng, G. (2019). 3D Tiled Convolution for Effective Segmentation of Volumetric Medical Images. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11765. Springer, Cham. https://doi.org/10.1007/978-3-030-32245-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-32245-8_17
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32244-1
Online ISBN: 978-3-030-32245-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

3D Tiled Convolution for Effective Segmentation of Volumetric Medical Images

Abstract

Similar content being viewed by others

TernaryNet: faster deep model inference without GPUs for medical 3D segmentation using sparse and binary convolutions

BRIEFnet: Deep Pancreas Segmentation Using Binary Sparse Convolutions

Domain-specific cues improve robustness of deep learning-based segmentation of CT volumes

Keywords

1 Introduction