1 Introduction

Automatic vertebrae recognition (i.e., label classification and bounding box localization) from magnetic resonance imaging (MRI) is an essential tool for spine diseases diagnosis, medical and surgical treatment planning, as well as postoperative response assessment [1]. Performing automatic vertebrae recognition accurately and reproducibly is crucial because incorrect recognition is the main cause of wrong-site surgery, which is one of five surgical Never Events in clinical practice [2].

However, as shown in Fig. 1, automatic vertebrae recognition in an arbitrary spine image is challenging because: (1) The field of view (FOV) varies unpredictably in input images, it is impossible to use some specifically-shaped vertebrae such as sacrum to classify the other ones [1]. (2) The MRI image characteristics (resolution, scales, and image intensity distribution) vary widely [3] and impose difficulty on learning representative features to distinguish vertebrae. (3) The appearances of different vertebrae are similar due to their repetitive nature (Fig. 1(d)); while the pathological variation and/or image artifacts alters the appearance of vertebrae in an unknown manner (Fig. 1(e)) [4, 5].

Fig. 1.
figure 1

Challenges of automatic vertebrae recognition in arbitrary spine images (classifying the labels and tight bounding boxes (i.e.. the white box for L3 in Fig. 1 (a)) of each vertebrae). Figure 1(a–c) show the challenges caused by different field of view (FOV) and image characteristics (resolution, intensity, modality, and scales). Figure 1(d–e) show that the repetitive nature, pathological variation, and/or image artifacts make the problem more challenging.

While many methods have been proposed for vertebrae detection and nontrivial progress have been achieved, simultaneously recognizing the labels and bounding boxes of all vertebrae from arbitrary spine images remains a problem. Lootus [4] presents an accurate method using deformable part model, but it needs the sacrum to be present. Glocker [6] uses random forests and probabilistic graphical models to regress vertebrae centroid points using appearance-based features. [7,8,9] also use other methods such as snakes, multi-task learning and state-space approaches for detection and/or segmentation tasks. These universal methods can be modified to perform vertebrae detection. With the development of deep learning, [1, 3, 5] use convolutional neural network (CNN) as the basic method and use several advanced techniques (deep supervision, message passing, RNN, pairwise conditional dependency) to enhance the performance and accurately obtain pixel-wise probability maps for each vertebrae centroid. However, directly recognizing the labels and bounding boxes of the vertebrae (rather than the probability map of centroid points) may be more meaningful for clinical use because it reveals their relative sizes and positions; moreover, this strategy mitigates the problem of false positives [3, 6].

We propose a two-stage hierarchical self-calibration recognition network (Hi-scene) to recognize vertebrae from arbitrary spine MRI. As shown in Fig. 2, we first develop a hierarchical detection network (HDN) to coarsely detect regions containing vertebrae. HDN generates multi-scale anchors and extracts discriminative hierarchical features to match the scales of different vertebrae, which deals with the multi-scale/resolution challenge. Then, we propose a self-calibration recognition network (SRN) to recognize each vertebra from the coarse detections. SRN creatively formulates message passing algorithm into deep learning network to calibrate the predicted class probabilities and correct the wrong recognitions, which deals with the appearance and FOV challenge.

Our contributions are: (1) We propose an accurate clinical tool to directly recognize vertebrae labels and bounding boxes from arbitrary MRI images. (2) For the first time, we formulate message passing into deep learning object recognition network for performance enhancement, which benefits other object recognition problems where the locations of the target objects have some certain internal spatial relationships.

Fig. 2.
figure 2

Overview of Hi-scene, which is a two-stage network containing: (1) Hierarchical detection network (HDN) for coarsely localizing regions containing vertebrae (class-agnostic proposals) of different scales and resolutions. (2) Self-calibration recognition network (SRN) for recognizing the class label and bounding boxes of each vertebra and automatically calibrating the wrong detections using the message passing algorithm.

2 Methodology

Our Hi-scene (Fig. 2) is composed of two stages: (1) The hierarchical detection network (HDN, Sect. 2.1, Fig. 3), which generates multi-scale anchors at different regular locations all over the input image and extracts hierarchical discriminative image features. This design requires no a priori knowledge of the input image; it simultaneously predicts the objectness scores and the coarse locations of regions containing vertebrae using the generated multi-scale anchors. (2) The self-calibration recognition network (SRN, Sect. 2.2, Fig. 4), which embeds message passing into multi-task recognition network to recognize vertebrae by predicting the class probabilities and refined bounding boxes based on the coarse locations obtained by HDN. The embedded message passing leverages the label relationship among different vertebrae to effectively calibrate the class probabilities of wrong recognitions caused by FOV variety and/or pathological deformation.

2.1 Hierarchical Detection Network (HDN)

HDN coarsely localizes regions containing vertebrae and handles the scale/ resolution challenge through 3 carefully designed cascade modules:

The anchor generator (Fig. 3(a)) equidistantly samples grid points from the original input image and places boxes of different size and aspect ratio (namely, anchors) centered on the grid points. The sizes (16, 32 and 64 pixels in our work) and aspect ratios (1:1, 1:1.5, 1:2 in our work) are chosen to approximate the sizes and aspect ratios of vertebrae in the input images; and the sampling distance (8 pixels) is decided to ensure that the anchors are dense enough to cover all vertebrae. The elaborated configuration guarantees that vertebrae of all scales/aspect ratios at all possible locations can be detected by a similar-shaped anchor in its vicinity.

Fig. 3.
figure 3

The hierarchical detection network (HDN). The anchor generator (Fig. 3(a)) places anchors of different scales/aspect ratios at different locations. The feature extraction module (Fig. 3(b)) extracts discriminative features that are robust to resolution change. The sibling detection module (Fig. 3(c)) simultaneously predicts the objectness scores (OS) and bounding box corrections (\(BBC_{1}\)) of each anchor, and then the anchors are refined to multi-scale proposals by \(BBC_{1}\)’s. The configuration of anchor scales, feature scales and anchor intervals are elaborately designed to approach multi-scale challenge.

The feature extraction module (Fig. 3(b)) adopts a Resnet-like network as feature extractor with two careful designs: (1) The output size of each stage of Resnet (C1\(\sim \)C3 in Fig. 3(b)) is designed to be equal to that of one anchor scale (namely, 64, 32, and 16 for C1, C2, and C3). Redundant layers in the original Resnet [11] are deleted to reduce network parameters and computational cost. (2) Top-down layers are introduced to explicitly combine high- and low-level features. The features P2\(\sim \)P3 are up-sampled and merged with their lower level features using lateral connections [12]. This design preserves the inherent advantages of Resnet (e.g., shortcuts to prevent degradation problem in deeper networks, robustness to the resolution change), while guarantees that features of proper scales are associated with each anchor size.

The sibling detection module (Fig. 3(c)) sends the ablated features P1\(\sim \)P3 into a 3 \(\times \) 3 convolutional layer and two sibling 1 \(\times \) 1 convolutional layers to predict the objectness scores (probability of each anchor containing vertebra, OS) and first-stage bounding box corrections (\(BBC_{1}\)) of different anchors. Then, as inspired by Faster RCNN [10], the \(BBC_{1}\)’s are applied to the anchors to obtain coarse detection boxes (called proposals as in [10]). Proposals with high OS are locations where vertebrae probably exist.

2.2 Self-calibration Recognition Network (SRN)

SRN recognizing the label/bounding box of each vertebra and handles the appearance/FOV challenges through two coherently integrated modules:

The pre-recognition module (Fig. 4(a)) adopts ROI-pooling [10] to choose one feature from P3\(\sim \)P1 according to the size of each proposal, then crops and resizes the feature to 32 \(\times \) 32. The resized features are fed into a two-siblings network similar to that in HDN to pre-recognize class-aware labels and bounding boxes. This is implemented by predicting the class probability vectors (CPV) and second-stage bounding box corrections (\(BBC_{2}\)) for all classes. For each pre-recognition, the CPV is a vector whose elements are the probabilities of the pre-recognized vertebrae having different labels; the predicted label is the index of CPV’s maximum, and the recognition confidence score (RS) is the maximum value (i.e., if the kth element in CPV is maximum, this vertebra is predicted to have label k and its RS is the kth value in CPV). The \(BBC_{2}\) corresponding to the predicted label is chosen and applied to the proposals to obtain the pre-recognition bounding boxes.

Fig. 4.
figure 4

The self-calibration recognition network (SRN). The pre-recognition module (Fig. 4(a)) classifies the proposals into class-aware labels and refines proposals to bounding boxes. The self-calibration module (Fig. 4(b)) filters out “easy” wrong recognitions and uses message passing to correct “hard” wrong detections. One iteration of message passing is shown in Fig. 4(b2), where the CPV of one pre-recognition (\(\mathbf{b _\mathbf{5 }}^{t}\) in this figure) receives messages from its neighbors (\(\mathbf{b _\mathbf{4 }}^{t}\) and \(\mathbf{b _\mathbf{6 }}^{t}\)) via the \(\mathbf {\Psi }\) matrix. The messages contain CPV’s of all other pre-recognitions and helps adjust the \(\mathbf{b _\mathbf{5 }}^{t}\) to reach label compatibility. This self-calibration process corrects the recognition errors caused by appearance deformation in arbitrary FOVs. (Color figure online)

The self-calibration module (Fig. 4(b)) creatively formulates message passing [13] to correct wrong pre-recognitions by calibrating their CPV’s. The essence of message passing is that, it adjusts the CPV of an arbitrary pre-recognition using the other CPV’s by combining them with a label compatibility matrix \(\mathbf {\Psi }\) and absorbing the combination to the target CPV (which will be detailed in Eq. 1); if the index of the maximum in the adjusted CPV alters, the predicted label will thereupon alter. Thanks to our HDN and pre-recognize network, the majority of our pre-recognitions have acceptable CPV’s with correct labels and high RS values. Thus, the adjustments of the CPV’s are in the correct direction, i.e., the undesired CPV’s are calibrated and their maxima are urged to appear at the correct indices. This calibration is performed in an iterative manner [13]; we reformulate it into Eq. 1, which shows how an arbitrary CPV absorbs messages (i.e., combinations of CPV’s) from the others:

$$\begin{aligned} \mathbf{b _\mathbf{i }}^{t+1} = \frac{1}{Z}\mathbf{b _\mathbf{i }}^{t}\,\!\!\!\!\prod _{j\in v(i)}\!\!\!\!\,\mathbf{m _\mathbf{ji }}, \quad \quad \mathrm{where}\quad \mathbf{m _\mathbf{ji }} = \sum _{label_j \in \mathcal {L}}\!\!\!\!\ \mathbf{b _\mathbf{j }}^{t}\otimes \mathbf {\Psi } \otimes \prod _{k\in v(j)\backslash i}\!\!\!\!\mathbf{m _\mathbf{kj }} \end{aligned}$$
(1)

where: (1) The CPV of the ith pre-recognition (\(\mathbf{b _\mathbf{i }}^{t+1}\)) is iteratively updated by the product of its previous value (\(\mathbf{b _\mathbf{i }}^{t}\)) and the message it receives from all its neighbors \(\left\{ j \right\} \) (\(\mathbf{m _\mathbf{ji }}\)); (2) The message from the jth to the ith pre-recognition (\(\mathbf{m _\mathbf{ji }}\)) is a combination of the CPV’s and the label compatibility matrix by element wise product (\(\otimes \)) of: (i) the CPV of the jth pre-recognition (\(\mathbf{b _\mathbf{j }}^{t}\)), (ii) each row of the label compatibility matrix (\(\mathbf {\Psi }\), a N\(\times \)N trainable parameter), and (iii) the message that flow into the jth pre-recognition from its neighbors except the ith (\(\mathbf{m _\mathbf{kj }}\)); (3) The notation \(\sum _{label_j \in \mathcal {L}}\) means summing over labels, which concerns all possible labels of the jth pre-recognition to form a comprehensive message; (4) Z is a normalization constant to force the CPV’s sum to 1.

In our vertebrae recognition work, the ith pre-recognition only neighbors the \(i-1\)th and \(i+1\)th (i.e., the CPV \(\mathbf{b _\mathbf{i }}^{t+1}\) in Eq. 1 is simplified to \(\frac{1}{Z}\mathbf{b _\mathbf{i }}^{t}\mathbf{m _\mathbf{i -\mathbf 1,i }}\mathbf{m _\mathbf{i+1,i }}\)), and \(\mathbf{m _\mathbf{i -\mathbf 1,i }}/\mathbf{m _\mathbf{i+1,i }}\) is only dependent on the \(i-2/i+2\)th pre-recognition (e.g., \(\mathbf{m _\mathbf{i -\mathbf 1,i }}=\sum _{label_{i-1} \in \mathcal {L}}\mathbf{b _\mathbf{i -\mathbf 1 }}^{t}\otimes \mathbf {\Psi }\otimes \mathbf{m _\mathbf{i -\mathbf 2,i -\mathbf 1 }}\)). Thus, by substituting the expression of message \(\mathbf{m }\) in turn, the CPV’s of all pre-recognized vertebrae are combined by the \(\mathbf {\Psi }\) matrix and absorbed into \(\mathbf{b _\mathbf{i }}^{t+1}\), which means all other CPV’s are used to calibrate the ith CPV. After all CPV’s exchange their messages and the iteration reaches stability, the cross entropy loss (\(\mathrm{loss_{mp}} = -\sum _{i=1}^{N_o}\sum _{j=1}^{N}\mathbf{y _\mathbf{i }} \mathrm{ln} \mathbf{b _\mathbf{i }}\mathrm{(j)}\), \(\mathbf{y _\mathbf{i }}\) is the one-hot ground truth label of the ith vertebrae) is calculated and minimized together with the classification and localization loss of the pre-recognition network to train SRN.

Our SRN reinforces mutual benefit of the pre-recognition network and the message passing scheme. During training, the training label accuracy is very high (near 100%) when the pre-recognition network (and also the HDN) reaches stability, which provides reliable CPV’s to train the \(\mathbf {\Psi }\) matrix. There are no false positives or wrong predictions to perturb the training. During testing, we first filter out the “easy” false positives (e.g. those at wrong positions, as demonstrated by the yellow box in Fig. 4(a2)) using the predicted x coordinate. If the predicted x coordinate of a vertebra is inappropriate, i.e., the deviations of the x coordinates of one vertebra to those of its neighboring vertebrae are larger than some threshold (one vertebrae width in our work), this pre-recognition is judged as false positive. This procedure rejects the “easy” false positives, leaving only the “hard” wrong pre-recognitions with approximately correct positions but wrong labels (e.g., the upper green box with label L3 in Fig. 4(a2)) to be fed into the message passing scheme. Then, the wrong labels can be corrected by calibrating the CPV’s using Eq. 1 (Fig. 4(b2)). Moreover, since \(BBC_{2}\)’s are predicted for each class, the bounding boxes can be automatically refined by choosing the \(BBC_{2}\) of the correct class.

Fig. 5.
figure 5

The Hi-scene network achieves high vertebrae detection performance on a challenging dataset of different FOV, image characteristics and modalities. The dotted boxes are the detection boxes with confidence scores, and the solid boxes are ground truth boxes.

3 Experiments and Results

Hi-scene has been intensively evaluated using a challenging dataset including 450 arbitrary MRI images of different image characteristics (such as vertebrae appearance, image resolution, intensity distribution) and FOV (containing S1\(\sim \)T12, S1\(\sim \)T11, S1\(\sim \)T10, L5\(\sim \)T11, L5\(\sim \)T10, L4\(\sim \)T10, each FOV has \(\sim \)75 images). 2D slices (not necessarily the mid-sagittal slices) of each 3D MRI scan are automatically extracted and resized to 512 \(\times \) 512. Hi-scene is implemented in Python 3.6 on Tensorflow 1.2 and trained using momentum optimizer with exponential learning rate decay. The initial learning rate is 1e–3, the decay factor is 0.96 per 10 epochs, and the learning momentum is 0.9. The training is implemented on an NVIDIA GTX1080 GPU for 20000 steps (\(\sim \)111 epochs). We use standard five-fold cross-validation for evaluation. The number of images of each FOV is kept approximately the same in the training/testing dataset in each fold. The five results from the folds are averaged to produce a single result. Four metrics are used to evaluate the detection performance: (1) Image accuracy, which means the ratio \(\frac{correctly~recognized~images}{all~images}\). This is a rather strict metric because an image is considered as correctly recognized only if all vertebrae in the image are correctly recognized. (2) Identification rate (IDR), which measures the accuracy of individual vertebra classification [3]. (3) \(mAP_{75}\), which measures the mean AP (average precision) at IoU (intersection over union) threshold 0.75 for all images. This metric is detailed in [14] and is extensively used in object detection domain [10]; (4) mIoU, which is the average IoU of the detection and ground truth bounding boxes of all vertebrae in all images.

Fig. 6.
figure 6

Four evaluation metrics indicate that Hi-scene can accurately classify and localize vertebrae in images of different FOVs, image characteristics, and vertebrae appearances. (Color figure online)

Qualitative Demonstration. Figure 5 demonstrates that Hi-scene network achieves high vertebrae classification accuracy and bounding box localization precision. The detected boxes (dashed) have correct labels and high overlaps with their ground truth boxes (solid) despite the varied FOV and vertebrae appearance (for example, Fig. 5(c) and (d) both have 8 vertebrae, but their FOVs and vertebrae appearances are different) in arbitrary images.

Quantitative Analysis. As shown in Fig. 6, the average image accuracy reaches 0.933 (the first black bar), which means that all the vertebrae are correctly recognized in 93.3% of the input images without false positive/negatives. Image accuracies are high (the other six black bars) for individual FOV’s; even the most difficult FOV (L4\(\sim \)T10, which tends to be confused with FOV L5\(\sim \)T11) has high accuracy (0.867). The mean IDR reaches 0.96 and shows high classification performance for individual vertebra. The mean \(mAP_{75}\)’s (the first blue bar) reaches 0.964, which shows that the recognition network has high precision at high recalls [14]. The mean mIoU (the first pink bar) reaches 0.919, which means the recognized bounding boxes have high overlaps with the ground truths. These indicate that our Hi-scene successfully approaches the challenge of vertebrae recognition from arbitrary spine MRI.

Comparison with the State-of-the-Art. Our approach is compared with three other methods in [1, 3, 10]. Since only [10] has published their code, we make our best effort to re-implement [3] and a 2D version of [1] and adjust their hyper-parameters for better performance. The re-implementation of [3] reaches the reported IDR using our dataset. However, probably because we lack the implementation details of their strategy (using cropped images to train CNN and then converting the CNN into FCN to process the whole image), the re-implementation of [1] does not reach the reported IDR. Quite some false positives occurs in the FCN results, which are not easy to distinguish and harm the successive training/inference of the LSTM. For comparison, as shown in Table 1, Hi-scene outperforms both the state-of-the-art methods and our baseline methods without message passing in all four metrics used in Fig. 6 as well as the localization error (Loc-Err). Particularly, the image accuracy of our method is significantly higher because only the minority of vertebrae are incorrect in most wrongly recognized images, which can be effectively corrected by message passing. Another interesting result is that although [3] and our work both use message passing, our work has a better performance because our HDN and pre-recognition network in SRN successfully handles images of arbitrary scales/characteristics/FOVs; the pre-recognized vertebrae are more precise and contains less false positives compared with the predicted pixel-wise probability maps for vertebrae centroid, which gives play to the message passing algorithm to pass the correct probabilities for better CPV calibration performance.

Table 1. Comparison with the state-of-the-art. MP is the abbreviation of message passing.

4 Conclusion

In this paper, we develop Hierarchical Self-calibration Detection Framework to recognize vertebrae in arbitrary spine MRI images. It consists of two novel networks: a hierarchical detection network for detecting multi-scale/resolution regions containing vertebrae; and a self-calibration recognition network for recognizing the label and bounding box of each vertebra and automatically correcting wrong recognitions caused by FOV variety and pathological deformations. Its performance and effectiveness are demonstrated by extensive experiments.