Keywords

1 Introduction

Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in many different 2D medical image analysis tasks. The common idea behind these solutions is to use deep convolutional networks with many hidden layers, aiming at learning discriminative feature embedding from raw data, rather than replying on handcrafted feature extraction. In clinical practice, however, a large part of the medical imaging data available are in 3D. This has motivated the development of 3D CNNs in order to benefit from more spatial context [1,2,3,4,5,6]. Due to GPU memory restrictions caused by moving to fully 3D, state-of-the-art methods [1,2,3,4,5,6] depend on subvolume/patch processing. The size of the input patch is usually small if no specialized hardware with large GPU memory is used, limiting the incorporation of larger context information for a better performance. In this paper, we present a novel and efficient approach for effective segmentation of volumetric images by leveraging context information in a large patch. Our contributions can be summarized as follows:

  • First, inspired by [7], we propose a novel 3D Tiled Convolution (3D-TC) for volumetric image processing. 3D-TC is implemented with a periodic down-shuffling operation followed by conventional 3D convolutions. 3D-TC has the advantage of significantly reducing the size of the data for sub-sequential processing while using all the information available in the input irrespective of the down-shuffling factors. We apply 3D-TC directly to the input data, whose output will be used as the input for sub-sequential Fully Convolutional Networks (FCNs). A full-resolution dense prediction at the final output is then generated by using previously introduced Dense Upsampling Convolution (DUC) [8, 9], which consists of low-resolution convolutions with a periodic up-shuffling operation to jointly learn the feature extraction and upsampling weights.

  • Second, we extensively validate the proposed approach on two typical yet challenging volumetric image segmentation tasks, i.e., segmentation of hip bony structures from 3D T1 MR images of limited field of view and segmentation of pancreas from 3D CT images. We show that 3D-TC and DUC are network agnostic and can be combined with different FCNs for an improved performance.

Fig. 1.
figure 1

A schematic view of 3D tiled convolutions which are implemented as a periodic down-shuffling operation with low-resolution convolutions. Here the down-shuffling factor is (2, 2, 2).

2 Methods

2.1 3D Tiled Convolution

As shown in Fig. 1, 3D-TC consists of a periodic down-shuffling operator and low-resolution (LR) convolutions. 3D-TC is designed to be directly applicable to the input data with two goals in mind. First, different from conventional convolutions, where convolutional kernels are shared by all neurones in a particular layer, 3D-TC aims to learn \( n_x \times n_y \times n_z \) separate kernels within the same layer, where \( n_x , n_y , n_z \) are the down-shuffling factors along the three spatial axes, respectively. Thus, irrespective of the down-shuffling facts, all data available in the input are used which is different from a simple down-sampling operation. Second, 3D-TC can significantly reduce the size of the data for sub-sequential processing. More specifically, let’s assume that the size of input data (\(I^{HR}\)) is \(( n_x \times d) \times ( n_y \times h) \times ( n_z \times w) \times C\) and the size of the output from 3D-TC is \(d \times h \times w \times k\), where C is the number of channels in the input data; k is the number of feature maps in the output of 3D-TC. Instead of applying convolution to high resolution (HR) images, we first apply a periodic down-shuffling operator to the input data to get \(C \times ( n_x \times n_y \times n_z )\) channels of LR feature maps and then further apply convolutions with a kernel size of \(3 \times 3 \times 3\) to get the k feature maps of size \((d \times h \times w)\). Mathematically, this can be described as:

$$\begin{aligned} \begin{aligned} TC(I^{LR}; W_1, b_1) = \phi (W_1 * PDS (I^{HR}) + b_1) \end{aligned} \end{aligned}$$
(1)

where \(\phi \) is an non-linear activation function that is applied element-wise; \(W_1\), \(b_1\) are trainable weights and bias, respectively; \( PDS \) is a periodic down-shuffling operator which aims to rearrange the tensor (\(T_{HR}\)) in the shape of \(( n_x \times d) \times ( n_y \times h) \times ( n_z \times w) \times C\) to the tensor (\(T_{LR}\)) in the shape of \((d \times h \times w) \times (C \times ( n_x \times n_y \times n_z ))\). And the operation \(T_{LR} = PDS (T_{HR})\) can be mathematically described as below:

$$\begin{aligned} \begin{aligned} T_{LR}(x',y',z',c') = T_{HR}( {x' \times n_x}+\left\lfloor {mod(c', n_x \times C)/C}\right\rfloor , \\ {y' \times n_y}+\left\lfloor {mod(c',n_x \times n_y \times C)/(n_x \times C)}\right\rfloor , \\ {z' \times n_z}+\left\lfloor {c'/(n_x \times n_y \times C)}\right\rfloor , \\ mod(c', C) \ ) \end{aligned} \end{aligned}$$
(2)

where \(x',y',z',c'\) are the coordinates of the voxels in the LR space, and \(x' \in [0,d-1], y' \in [0,h-1], z' \in [0, w-1], c' \in [0, C \times n_x \times n_y \times n_z -1] \).

Fig. 2.
figure 2

A schematic view of how to augment existing FCNs with 3D-TC and DUC for volumetric image analysis.

Fig. 3.
figure 3

Qualitative comparison of the segmentation results of 3D LP-U-Net with different shuffling factors and the 3D U-net with the largest patch size.

2.2 3D-TC and DUC Augmented FCNs for Image Segmentation

Both DUC and 3D-TC are network agnostic and can be combined with existing FCNs such as 3D U-net [1] for volumetric image analysis as shown in Fig. 2, as long as the dimensions of the output from 3D-TC satisfy the input requirement of the sub-sequential FCNs. The advantage of such a pipeline is apparent. When a 3D-TC with down-shuffling factors of \(( n_x , n_y , n_z )\) is applied to the input data, both the computational and the storage cost for the underlying networks will be reduced by a factor of \(( n_x \times n_y \times n_z )\), allowing one to use large patch as the input but with reduced computational time. The full resolution result is then obtained at the final output by applying a DUC with up-shuffling factors of \(( n_x , n_y , n_z )\). To differentiate from the original 3D U-net, we call the 3D U-net augmented with 3D-TC and DUC as 3D large patch U-net (3D LP-U-net). Similarly we can derive 3D LP-V-net and LP-HighRes3DNet respectively by augmenting the original 3D V-net [2] and HighRes3DNet [5] with 3D-TC and DUC. In this study, we take the original 3D U-net, 3D V-net and HighRes3DNet as the base networks to evaluate the performance of the associated networks augmented with 3D-TC and DUC. For all the studies, a combination of cross entropy loss with Dice loss as introduced in [2] is used.

2.3 Implementation Details

All methods were implemented in Python using TensorFlow framework and were trained and tested on a desktop with a 3.6 GHz Intel(R) i7 CPU and a NVIDIA GTX 1080 Ti graphics card with 11 GB GPU memory. We empirically fixed the number of output feature maps from the 3D-TC as \(k=64\).

3 Experiments and Results

In this section, we present experimental results of the proposed pipeline for volumetric image analysis. Two datasets, i.e., an in-house dataset consisting of 25 T1 hip MR images with limited field of view and a publicly available dataset from National Institute of Health (NIH) containing 82 abdominal contrast enhanced 3D CT scans [10], were used in our study. For all the experiments described below, we used Dice Overlap Coefficients (DOC) and Average Surface Distance (ASD) as the evaluation metrics.

3.1 Ablation Study on Hip MR Images with Limited Field of View

Data and Augmentation. We used 25 3D T1 MR images, acquired from patients with hip pain. All images were resampled to have a uniform size of \(480 \times 480 \times 160\) voxels with an average voxel spacing of 0.374 mm \(\times \) 0.363 mm \(\times \) 1.078 mm. Slice by slice manual segmentation was used to create the reference ground truth segmentation. In this study, we randomly distributed the 25 datasets into two groups with one group containing 20 datasets as the training data and the remaining 5 datasets as the testing data.

Table 1. Results of investigation of different patch sizes on the performance of the original 3D U-net. Ace: the acetabulum; Femur: the proximal femur
Table 2. Results when different shuffling factors were used for the 3D LP-U-net. The size of the input patch is fixed to \(400 \times 400 \times 80\).

Ablation Study. We first investigated the influence of patch sizes on the performance of the original 3D U-net [1]. The results are presented in Table 1. It was observed that better performance was obtained when larger patch size was used. Due to the GPU memory constraint, \(200 \times 200 \times 40\) is the maximum size that we can use.

We then examined the effect of different shuffling factors on the performance of 3D LP-U-net when a fixed patch size of \(400 \times 400 \times 80\) was used. The results are reported in Table 2. From this table, we can see that (1) the higher the shuffling factor, in general the less accurate the results but the best results were achieved when the shuffling factor was (4, 4, 2); (2) even with a shuffling factor as high as (25, 25, 2), we still get sub-millimeter segmentation accuracy for both structures; and (3) in comparison with the results reported in Table 1, 3D LP-U-net achieved better results than the original 3D U-net with the largest patch size when the shuffling factor was smaller than (16, 16, 2).

Figure 3 visually compares the segmentation results obtained by the 3D LP-U-net with a fixed patch size of \(400 \times 400 \times 80\) but different shuffling factors and the 3D U-net with the largest patch size. In this figure, we show both the overall segmentation and the probability of each structure as well as the results around the hip joint. From this figure, we observe that (1) less false positive segmentation was observed when comparing the results obtained by the 3D LP-U-net with those by the 3D U-net; and (2) for the 3D LP-U-net, the larger the shuffling factors, the higher the uncertainty around the boundary but the best results are obtained when the shuffling factors are (4, 4, 2).

Finally, we show that 3D-TC and DUC are agnostic to the base networks. To demonstrate such a capability, we compared the performance of the original 3D V-Net and HighRes3DNet when the maximally allowed patch sizes were used with that of 3D LP-V-Net with a shuffling factor (2, 2, 2) and LP-HighRes3DNet with a shuffling factor (4, 4, 1) when a fixed patch size of \(400 \times 400 \times 80\) was used. Please note that caused by high spatial resolution, the maximally allowed patch size for HighRes3DNet is \(200 \times 200 \times 20\) while for 3D V-Net, it is \(200 \times 200 \times 40\). The results are presented in Table 3, where results achieved by the 3D LP-V-Net and the LP-HighRes3DNet are better than those achieved by the associated base networks.

Table 3. Results when different underlying FCNs were used.
Table 4. Segmentation accuracy of 3D LP-U-net and three state-of-the-art methods.

3.2 Validation on Hip MR Images

Using the same 25 T1 hip MR images with limited field of view as we used in the ablation study, we conducted a standard 5-fold cross validation experiment. We also adopted the same data augmentation strategy and the same training strategy as in the ablation study. In this cross validation study, for the 3D LP-U-net, we chose a fixed patch size of \(400 \times 400 \times 80\) and a fixed shuffling factor of (4, 4, 2). We compared the performance of the 3D LP-U-net with state-of-the-art methods such as 3D U-net [1], 3D V-net [2], and HighRes3dNet [5] and the results are shown in Table 4. An average DOC of \(96.76\pm 0.92\)%, \(94.01\pm 2.80\)%, \(93.35\pm 3.21\)% and \(90.51\pm 7.32\)% was found for the 3D LP-U-net, the 3D U-net, the 3D V-net and the HighRes3DNet, respectively. The 3D LP-U-net showed significantly higher accuracy than all other three methods (\(p < 0.01\)). For ASD, the same significance was also observed. For the proximal femur, an average DOC of \(98.14\pm 0.47\)%, \(96.89\pm 0.85\)%, \(96.47\pm 1.54\)% and \(89.99\pm 4.91\)% was found for the 3D LP-U-net, the 3D U-net, the 3D V-net and the HighRes3DNet, respectively. The 3D LP-U-net showed significantly higher accuracy than all other three methods (\(p < 0.01\)) when segmenting the proximal femur. An average ASD of \(0.27\pm 0.09\) mm, \(0.96\pm 1.16\) mm, \(0.91\pm 0.87\) mm and \(1.79\pm 0.83\) mm was found for the 3D LP-U-net, the 3D U-net, the 3D V-net and the HighRes3DNet, respectively.

3.3 Validation on NIH Pancreas CT Dataset

We verified our approach on the NIH pancreas CT dataset [10] as well, which contains 82 contrast-enhanced abdominal CT volumes provided by an experienced radiologist. Following the training protocol [10], we conducted a 4-fold cross validation in a random split from 82 patients for training and testing folds, where each testing fold had 22, 20, 20, and 20 cases, respectively. We implemented a two-stage pipeline consisting of a coarse stage and a fine stage. In the coarse stage, we first trained a deep segmentation network to locate the rough region of the pancreas from a whole CT volume. The goal of the fine stage is then to train another deep segmentation network to further refine the results. In both stages, we used the 3D LP-U-net as the segmentation networks. During the training phase of the coarse stage, we chose a fixed patch size of \(480 \times 480 \times 64\) voxels and a fixed shuffling factor of (4, 4 1). During the training phase of the fine stage, all training images were cropped by the bounding box calculated from the associated ground truth segmentation plus a padding of 20 voxels along all three spatial axes. Then all cropped images were resampled to a fixed size of \(196 \times 128 \times 128\) voxels. For the 3D LP-U-net in the fine stage, we chose a fixed patch size of \(176 \times 112 \times 96\) voxels and a fixed shuffling factor of (2, 2, 1).

Table 5. Accuracy (DOC, %) comparison between 3D LP-U-net and the state-of-the-arts on the NIH pancreas segmentation dataset.

Table 5 shows the accuracy comparison between our approach and previous state-of-the-art methods when evaluated on the NIH pancreas CT dataset. Our approach achieved a comparable average DOC with the state-of-the-art methods. Although the average DOC and the maximum DOC achieved by our approach are slightly worse than those achieved by [6], which is based on a complicated recurrent saliency transformation network with careful parameter fine-tuning, our approach achieved much better result in the worst case (68.39% by our approach vs. 62.81% by previous state-of-the-art), which guaranteed the reliability of our approach in clinical application.

4 Conclusion

We proposed a simple yet effective 3D tiled convolution for 3D medical image analysis tasks. The 3D-TC consists of a periodic down-shuffling operation followed by low-resolution 3D convolutions. It can be directly applied to the input data and has the advantage of significantly reducing the size of the data for sub-sequential processing while using all the information available in the input irrespective of the down-shuffling factors. To achieve volumetric dense prediction at the output, we used a previously introduced dense upsampling convolution. We showed that 3D-TC and DUC were network agnostic and could be combined with different FCNs for an improved performance. Experimental results demonstrated the effectiveness of our framework on different semantic segmentation tasks. Our future work is to apply 3D-TC to other volumetric image processing tasks such as image synthesis and image super-resolution.