Automatic Vertebrae Recognition from Arbitrary Spine MRI Images by a Hierarchical Self-calibration Detection Framework

Zhao, Shen; Wu, Xi; Chen, Bo; Li, Shuo

doi:10.1007/978-3-030-32251-9_35

Shen Zhao¹⁶,
Xi Wu¹⁷,
Bo Chen¹⁶ &
…
Shuo Li¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11767))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

9069 Accesses
5 Citations

Abstract

Automatic vertebrae recognition is crucial in spine diseases diagnosis, treatment planning, and response assessment. Although vertebrae detection has been studied for years, reliably recognizing vertebrae from arbitrary spine MRI images remains a challenge due to varying image characteristics, field of view (FOV) as well as vertebrae appearance. In this paper, we propose a Hierarchical Self-calibration Detection Framework (Hi-scene) to precisely recognize the labels and bounding boxes of all vertebrae in an arbitrary spine MRI image. Hi-scene is designed to first coarsely localize regions where vertebrae exist without the need of a priori knowledge about the scale, image characteristics and FOV; then accurately recognize vertebrae and automatically correct wrong recognitions by an elaborated self-calibration recognition network that embeds message passing into deep learning network. The method is trained and evaluated on a capacious and challenging dataset of 450 MRI scans, and the evaluation results show that our Hi-scene achieves high performance (testing accuracy reaches 0.933) from arbitrary input spine MRI and outperforms other state-of-the-art methods.

You have full access to this open access chapter, Download conference paper PDF

VertNet: Accurate Vertebra Localization and Identification Network from CT Images

Automatic Vertebra Labeling in Large-Scale Medical Images Using Deep Image-to-Image Network with Message Passing and Sparsity Regularization

Automatic Localization and Identification of Vertebrae in Spine CT via a Joint Learning Model with Deep Neural Networks

1 Introduction

Automatic vertebrae recognition (i.e., label classification and bounding box localization) from magnetic resonance imaging (MRI) is an essential tool for spine diseases diagnosis, medical and surgical treatment planning, as well as postoperative response assessment [1]. Performing automatic vertebrae recognition accurately and reproducibly is crucial because incorrect recognition is the main cause of wrong-site surgery, which is one of five surgical Never Events in clinical practice [2].

However, as shown in Fig. 1, automatic vertebrae recognition in an arbitrary spine image is challenging because: (1) The field of view (FOV) varies unpredictably in input images, it is impossible to use some specifically-shaped vertebrae such as sacrum to classify the other ones [1]. (2) The MRI image characteristics (resolution, scales, and image intensity distribution) vary widely [3] and impose difficulty on learning representative features to distinguish vertebrae. (3) The appearances of different vertebrae are similar due to their repetitive nature (Fig. 1(d)); while the pathological variation and/or image artifacts alters the appearance of vertebrae in an unknown manner (Fig. 1(e)) [4, 5].

While many methods have been proposed for vertebrae detection and nontrivial progress have been achieved, simultaneously recognizing the labels and bounding boxes of all vertebrae from arbitrary spine images remains a problem. Lootus [4] presents an accurate method using deformable part model, but it needs the sacrum to be present. Glocker [6] uses random forests and probabilistic graphical models to regress vertebrae centroid points using appearance-based features. [7,8,9] also use other methods such as snakes, multi-task learning and state-space approaches for detection and/or segmentation tasks. These universal methods can be modified to perform vertebrae detection. With the development of deep learning, [1, 3, 5] use convolutional neural network (CNN) as the basic method and use several advanced techniques (deep supervision, message passing, RNN, pairwise conditional dependency) to enhance the performance and accurately obtain pixel-wise probability maps for each vertebrae centroid. However, directly recognizing the labels and bounding boxes of the vertebrae (rather than the probability map of centroid points) may be more meaningful for clinical use because it reveals their relative sizes and positions; moreover, this strategy mitigates the problem of false positives [3, 6].

We propose a two-stage hierarchical self-calibration recognition network (Hi-scene) to recognize vertebrae from arbitrary spine MRI. As shown in Fig. 2, we first develop a hierarchical detection network (HDN) to coarsely detect regions containing vertebrae. HDN generates multi-scale anchors and extracts discriminative hierarchical features to match the scales of different vertebrae, which deals with the multi-scale/resolution challenge. Then, we propose a self-calibration recognition network (SRN) to recognize each vertebra from the coarse detections. SRN creatively formulates message passing algorithm into deep learning network to calibrate the predicted class probabilities and correct the wrong recognitions, which deals with the appearance and FOV challenge.

Our contributions are: (1) We propose an accurate clinical tool to directly recognize vertebrae labels and bounding boxes from arbitrary MRI images. (2) For the first time, we formulate message passing into deep learning object recognition network for performance enhancement, which benefits other object recognition problems where the locations of the target objects have some certain internal spatial relationships.

2 Methodology

Our Hi-scene (Fig. 2) is composed of two stages: (1) The hierarchical detection network (HDN, Sect. 2.1, Fig. 3), which generates multi-scale anchors at different regular locations all over the input image and extracts hierarchical discriminative image features. This design requires no a priori knowledge of the input image; it simultaneously predicts the objectness scores and the coarse locations of regions containing vertebrae using the generated multi-scale anchors. (2) The self-calibration recognition network (SRN, Sect. 2.2, Fig. 4), which embeds message passing into multi-task recognition network to recognize vertebrae by predicting the class probabilities and refined bounding boxes based on the coarse locations obtained by HDN. The embedded message passing leverages the label relationship among different vertebrae to effectively calibrate the class probabilities of wrong recognitions caused by FOV variety and/or pathological deformation.

2.1 Hierarchical Detection Network (HDN)

HDN coarsely localizes regions containing vertebrae and handles the scale/ resolution challenge through 3 carefully designed cascade modules:

The anchor generator (Fig. 3(a)) equidistantly samples grid points from the original input image and places boxes of different size and aspect ratio (namely, anchors) centered on the grid points. The sizes (16, 32 and 64 pixels in our work) and aspect ratios (1:1, 1:1.5, 1:2 in our work) are chosen to approximate the sizes and aspect ratios of vertebrae in the input images; and the sampling distance (8 pixels) is decided to ensure that the anchors are dense enough to cover all vertebrae. The elaborated configuration guarantees that vertebrae of all scales/aspect ratios at all possible locations can be detected by a similar-shaped anchor in its vicinity.

The feature extraction module (Fig. 3(b)) adopts a Resnet-like network as feature extractor with two careful designs: (1) The output size of each stage of Resnet (C1$\sim $C3 in Fig. 3(b)) is designed to be equal to that of one anchor scale (namely, 64, 32, and 16 for C1, C2, and C3). Redundant layers in the original Resnet [11] are deleted to reduce network parameters and computational cost. (2) Top-down layers are introduced to explicitly combine high- and low-level features. The features P2$\sim $P3 are up-sampled and merged with their lower level features using lateral connections [12]. This design preserves the inherent advantages of Resnet (e.g., shortcuts to prevent degradation problem in deeper networks, robustness to the resolution change), while guarantees that features of proper scales are associated with each anchor size.

The sibling detection module (Fig. 3(c)) sends the ablated features P1$\sim $P3 into a 3 $\times $ 3 convolutional layer and two sibling 1 $\times $ 1 convolutional layers to predict the objectness scores (probability of each anchor containing vertebra, OS) and first-stage bounding box corrections ($BBC_{1}$) of different anchors. Then, as inspired by Faster RCNN [10], the $BBC_{1}$’s are applied to the anchors to obtain coarse detection boxes (called proposals as in [10]). Proposals with high OS are locations where vertebrae probably exist.

2.2 Self-calibration Recognition Network (SRN)

SRN recognizing the label/bounding box of each vertebra and handles the appearance/FOV challenges through two coherently integrated modules:

The pre-recognition module (Fig. 4(a)) adopts ROI-pooling [10] to choose one feature from P3$\sim $P1 according to the size of each proposal, then crops and resizes the feature to 32 $\times $ 32. The resized features are fed into a two-siblings network similar to that in HDN to pre-recognize class-aware labels and bounding boxes. This is implemented by predicting the class probability vectors (CPV) and second-stage bounding box corrections ($BBC_{2}$) for all classes. For each pre-recognition, the CPV is a vector whose elements are the probabilities of the pre-recognized vertebrae having different labels; the predicted label is the index of CPV’s maximum, and the recognition confidence score (RS) is the maximum value (i.e., if the kth element in CPV is maximum, this vertebra is predicted to have label k and its RS is the kth value in CPV). The $BBC_{2}$ corresponding to the predicted label is chosen and applied to the proposals to obtain the pre-recognition bounding boxes.

The self-calibration module (Fig. 4(b)) creatively formulates message passing [13] to correct wrong pre-recognitions by calibrating their CPV’s. The essence of message passing is that, it adjusts the CPV of an arbitrary pre-recognition using the other CPV’s by combining them with a label compatibility matrix $\mathbf {\Psi }$ and absorbing the combination to the target CPV (which will be detailed in Eq. 1); if the index of the maximum in the adjusted CPV alters, the predicted label will thereupon alter. Thanks to our HDN and pre-recognize network, the majority of our pre-recognitions have acceptable CPV’s with correct labels and high RS values. Thus, the adjustments of the CPV’s are in the correct direction, i.e., the undesired CPV’s are calibrated and their maxima are urged to appear at the correct indices. This calibration is performed in an iterative manner [13]; we reformulate it into Eq. 1, which shows how an arbitrary CPV absorbs messages (i.e., combinations of CPV’s) from the others:

$$\begin{aligned} \mathbf{b _\mathbf{i }}^{t+1} = \frac{1}{Z}\mathbf{b _\mathbf{i }}^{t}\,\!\!\!\!\prod _{j\in v(i)}\!\!\!\!\,\mathbf{m _\mathbf{ji }}, \quad \quad \mathrm{where}\quad \mathbf{m _\mathbf{ji }} = \sum _{label_j \in \mathcal {L}}\!\!\!\!\ \mathbf{b _\mathbf{j }}^{t}\otimes \mathbf {\Psi } \otimes \prod _{k\in v(j)\backslash i}\!\!\!\!\mathbf{m _\mathbf{kj }} \end{aligned}$$

(1)

where: (1) The CPV of the ith pre-recognition ($\mathbf{b _\mathbf{i }}^{t+1}$) is iteratively updated by the product of its previous value ($\mathbf{b _\mathbf{i }}^{t}$) and the message it receives from all its neighbors $\left\{ j \right\} $ ($\mathbf{m _\mathbf{ji }}$); (2) The message from the jth to the ith pre-recognition ($\mathbf{m _\mathbf{ji }}$) is a combination of the CPV’s and the label compatibility matrix by element wise product ($\otimes $) of: (i) the CPV of the jth pre-recognition ($\mathbf{b _\mathbf{j }}^{t}$), (ii) each row of the label compatibility matrix ($\mathbf {\Psi }$, a N$\times $N trainable parameter), and (iii) the message that flow into the jth pre-recognition from its neighbors except the ith ($\mathbf{m _\mathbf{kj }}$); (3) The notation $\sum _{label_j \in \mathcal {L}}$ means summing over labels, which concerns all possible labels of the jth pre-recognition to form a comprehensive message; (4) Z is a normalization constant to force the CPV’s sum to 1.

In our vertebrae recognition work, the ith pre-recognition only neighbors the $i-1$th and $i+1$th (i.e., the CPV $\mathbf{b _\mathbf{i }}^{t+1}$ in Eq. 1 is simplified to $\frac{1}{Z}\mathbf{b _\mathbf{i }}^{t}\mathbf{m _\mathbf{i -\mathbf 1,i }}\mathbf{m _\mathbf{i+1,i }}$), and $\mathbf{m _\mathbf{i -\mathbf 1,i }}/\mathbf{m _\mathbf{i+1,i }}$ is only dependent on the $i-2/i+2$th pre-recognition (e.g., $\mathbf{m _\mathbf{i -\mathbf 1,i }}=\sum _{label_{i-1} \in \mathcal {L}}\mathbf{b _\mathbf{i -\mathbf 1 }}^{t}\otimes \mathbf {\Psi }\otimes \mathbf{m _\mathbf{i -\mathbf 2,i -\mathbf 1 }}$). Thus, by substituting the expression of message $\mathbf{m }$ in turn, the CPV’s of all pre-recognized vertebrae are combined by the $\mathbf {\Psi }$ matrix and absorbed into $\mathbf{b _\mathbf{i }}^{t+1}$, which means all other CPV’s are used to calibrate the ith CPV. After all CPV’s exchange their messages and the iteration reaches stability, the cross entropy loss ($\mathrm{loss_{mp}} = -\sum _{i=1}^{N_o}\sum _{j=1}^{N}\mathbf{y _\mathbf{i }} \mathrm{ln} \mathbf{b _\mathbf{i }}\mathrm{(j)}$, $\mathbf{y _\mathbf{i }}$ is the one-hot ground truth label of the ith vertebrae) is calculated and minimized together with the classification and localization loss of the pre-recognition network to train SRN.

Our SRN reinforces mutual benefit of the pre-recognition network and the message passing scheme. During training, the training label accuracy is very high (near 100%) when the pre-recognition network (and also the HDN) reaches stability, which provides reliable CPV’s to train the $\mathbf {\Psi }$ matrix. There are no false positives or wrong predictions to perturb the training. During testing, we first filter out the “easy” false positives (e.g. those at wrong positions, as demonstrated by the yellow box in Fig. 4(a2)) using the predicted x coordinate. If the predicted x coordinate of a vertebra is inappropriate, i.e., the deviations of the x coordinates of one vertebra to those of its neighboring vertebrae are larger than some threshold (one vertebrae width in our work), this pre-recognition is judged as false positive. This procedure rejects the “easy” false positives, leaving only the “hard” wrong pre-recognitions with approximately correct positions but wrong labels (e.g., the upper green box with label L3 in Fig. 4(a2)) to be fed into the message passing scheme. Then, the wrong labels can be corrected by calibrating the CPV’s using Eq. 1 (Fig. 4(b2)). Moreover, since $BBC_{2}$’s are predicted for each class, the bounding boxes can be automatically refined by choosing the $BBC_{2}$ of the correct class.

3 Experiments and Results

Hi-scene has been intensively evaluated using a challenging dataset including 450 arbitrary MRI images of different image characteristics (such as vertebrae appearance, image resolution, intensity distribution) and FOV (containing S1$\sim $T12, S1$\sim $T11, S1$\sim $T10, L5$\sim $T11, L5$\sim $T10, L4$\sim $T10, each FOV has $\sim $75 images). 2D slices (not necessarily the mid-sagittal slices) of each 3D MRI scan are automatically extracted and resized to 512 $\times $ 512. Hi-scene is implemented in Python 3.6 on Tensorflow 1.2 and trained using momentum optimizer with exponential learning rate decay. The initial learning rate is 1e–3, the decay factor is 0.96 per 10 epochs, and the learning momentum is 0.9. The training is implemented on an NVIDIA GTX1080 GPU for 20000 steps ($\sim $111 epochs). We use standard five-fold cross-validation for evaluation. The number of images of each FOV is kept approximately the same in the training/testing dataset in each fold. The five results from the folds are averaged to produce a single result. Four metrics are used to evaluate the detection performance: (1) Image accuracy, which means the ratio $\frac{correctly~recognized~images}{all~images}$. This is a rather strict metric because an image is considered as correctly recognized only if all vertebrae in the image are correctly recognized. (2) Identification rate (IDR), which measures the accuracy of individual vertebra classification [3]. (3) $mAP_{75}$, which measures the mean AP (average precision) at IoU (intersection over union) threshold 0.75 for all images. This metric is detailed in [14] and is extensively used in object detection domain [10]; (4) mIoU, which is the average IoU of the detection and ground truth bounding boxes of all vertebrae in all images.

Qualitative Demonstration. Figure 5 demonstrates that Hi-scene network achieves high vertebrae classification accuracy and bounding box localization precision. The detected boxes (dashed) have correct labels and high overlaps with their ground truth boxes (solid) despite the varied FOV and vertebrae appearance (for example, Fig. 5(c) and (d) both have 8 vertebrae, but their FOVs and vertebrae appearances are different) in arbitrary images.

Quantitative Analysis. As shown in Fig. 6, the average image accuracy reaches 0.933 (the first black bar), which means that all the vertebrae are correctly recognized in 93.3% of the input images without false positive/negatives. Image accuracies are high (the other six black bars) for individual FOV’s; even the most difficult FOV (L4$\sim $T10, which tends to be confused with FOV L5$\sim $T11) has high accuracy (0.867). The mean IDR reaches 0.96 and shows high classification performance for individual vertebra. The mean $mAP_{75}$’s (the first blue bar) reaches 0.964, which shows that the recognition network has high precision at high recalls [14]. The mean mIoU (the first pink bar) reaches 0.919, which means the recognized bounding boxes have high overlaps with the ground truths. These indicate that our Hi-scene successfully approaches the challenge of vertebrae recognition from arbitrary spine MRI.

Comparison with the State-of-the-Art. Our approach is compared with three other methods in [1, 3, 10]. Since only [10] has published their code, we make our best effort to re-implement [3] and a 2D version of [1] and adjust their hyper-parameters for better performance. The re-implementation of [3] reaches the reported IDR using our dataset. However, probably because we lack the implementation details of their strategy (using cropped images to train CNN and then converting the CNN into FCN to process the whole image), the re-implementation of [1] does not reach the reported IDR. Quite some false positives occurs in the FCN results, which are not easy to distinguish and harm the successive training/inference of the LSTM. For comparison, as shown in Table 1, Hi-scene outperforms both the state-of-the-art methods and our baseline methods without message passing in all four metrics used in Fig. 6 as well as the localization error (Loc-Err). Particularly, the image accuracy of our method is significantly higher because only the minority of vertebrae are incorrect in most wrongly recognized images, which can be effectively corrected by message passing. Another interesting result is that although [3] and our work both use message passing, our work has a better performance because our HDN and pre-recognition network in SRN successfully handles images of arbitrary scales/characteristics/FOVs; the pre-recognized vertebrae are more precise and contains less false positives compared with the predicted pixel-wise probability maps for vertebrae centroid, which gives play to the message passing algorithm to pass the correct probabilities for better CPV calibration performance.

Table 1. Comparison with the state-of-the-art. MP is the abbreviation of message passing.

Full size table

4 Conclusion

In this paper, we develop Hierarchical Self-calibration Detection Framework to recognize vertebrae in arbitrary spine MRI images. It consists of two novel networks: a hierarchical detection network for detecting multi-scale/resolution regions containing vertebrae; and a self-calibration recognition network for recognizing the label and bounding box of each vertebra and automatically correcting wrong recognitions caused by FOV variety and pathological deformations. Its performance and effectiveness are demonstrated by extensive experiments.

References

Liao, H., Mesfin, A., Luo, J.: Joint vertebrae identification and localization in spinal CT images by combining short-and long-range contextual information. IEEE Trans. Med. Imaging 37(5), 1266–1275 (2018)
Article Google Scholar
Philip, F.: Patient Safety in Surgery. Springer, London (2014). https://doi.org/10.1007/978-1-4471-4369-7
Yang, D., et al.: Automatic vertebra labeling in large-scale 3D CT using deep image-to-image network with message passing and sparsity regularization. In: Niethammer, M., et al. (eds.) IPMI 2017. LNCS, vol. 10265, pp. 633–644. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59050-9_50
Chapter Google Scholar
Lootus, M., Kadir, T., Zisserman, A.: Vertebrae detection and labelling in lumbar MR images. In: Yao, J., Klinder, T., Li, S. (eds.) Computational Methods and Clinical Applications for Spine Imaging. LNCVB, vol. 17, pp. 219–230. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07269-2_19
Chapter Google Scholar
Chen, H., et al.: Automatic localization and identification of vertebrae in spine CT via a joint learning model with deep neural networks. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 515–522. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24553-9_63
Chapter Google Scholar
Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A.: Vertebrae localization in pathological spine CT via dense classification from sparse annotations. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 262–270. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40763-5_33
Chapter Google Scholar
Zhao, S., et al.: Robust segmentation of intima-media borders with different morphologies and dynamics during the cardiac cycle. IEEE J. Biomed. Health 22(5), 1571–1582 (2018)
Article Google Scholar
Zhao, R., Liao, W., Zou, B., Chen, Z., Li, S.: Weakly-supervised simultaneous evidence identification and segmentation for automated glaucoma diagnosis. In: AAAI (2019)
Google Scholar
Gao, Z., et al.: Robust estimation of carotid artery wall motion using the elasticity-based state-space approach. Med. Image Anal. 37, 1–21 (2017)
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 1–14 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Lin, T., Dollar, P., Girshick, R., He, K., Hariharan, B., Belon, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)
Google Scholar
Yedidia, J., Freeman, W., Weiss, Y.: Understanding belief propagation and its generalizations. Exploring Artif. Intell. New Millennium 8, 236–239 (2003)
Google Scholar
Everingham, M., Van, G., Williams, C., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Medical Imaging and Medical Biophysics, Western University, London, ON, Canada
Shen Zhao, Bo Chen & Shuo Li
Chengdu University of Information Technology, Chengdu, Sichuan, China
Xi Wu

Authors

Shen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xi Wu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shuo Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuo Li .

Editor information

Editors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Dinggang Shen
University of Georgia, Athens, GA, USA
Tianming Liu
Western University, London, ON, Canada
Terry M. Peters
Yale University, New Haven, CT, USA
Lawrence H. Staib
University of Strasbourg, Illkirch, France
Caroline Essert
United Imaging Intelligence, Shanghai, China
Sean Zhou
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Pew-Thian Yap
Western University, London, ON, Canada
Ali Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, S., Wu, X., Chen, B., Li, S. (2019). Automatic Vertebrae Recognition from Arbitrary Spine MRI Images by a Hierarchical Self-calibration Detection Framework. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11767. Springer, Cham. https://doi.org/10.1007/978-3-030-32251-9_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-32251-9_35
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32250-2
Online ISBN: 978-3-030-32251-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)