Keywords

1 Introduction

Recently, deep convolutional neural networks (CNN) show promising performances in various computer vision tasks such as classification [6, 9], localization [2], and segmentation [12]. Among those tasks, object (lesion in medical images) localization is one of the challenging problems. In object localization task, a lot of training images with bounding boxes (or pixel-level) annotations of region-of-interests (ROIs) are required. However, a dataset with such location information needs heavy annotation efforts. Such annotation costs are much greater for medical images because only experts can interpret them. To alleviate the problem, several works for weakly supervised localization only using a weak-labeled (i.e. image-level label) dataset have been proposed [10, 11, 14]. These approaches require pre-trained models on relatively well-localized datasets (e.g., ImageNet [3]) to transfer good initial features for localization. Therefore, we cannot expect good performances for medical image domain since we do not have such domain-specific well-trained features.

In this work, we propose a self-transfer learning (STL) framework for weakly supervised lesion localization in medical images. STL co-optimizes both classification and localization networks simultaneously in order to guide the localization network with the most discriminative features in terms of the classification task (see Fig. 1). The proposed method does not require not only the location information but also any types of priors for training. We show that previous approaches without good initial features are not effective by themselves since errors are back-propagated through a restricted path or with insufficient information.

Fig. 1.
figure 1

Overall architecture of STL (F \(_\mathbf{cls }\) denotes fully-connected classification layers, C \(_\mathbf{loc }\) and P \(_\mathbf{loc }\) denote a \(1\times 1\) convolutional layer and a global pooling layer respectively). The final objective function Loss \(_\mathbf{total }\) is a weighted sum of Loss \(_\mathbf{cls }\) and Loss \(_\mathbf{loc }\) with a controllable hyperparameter \(\mathbf {\alpha }\). Self-transfer learning is realized by re-weighting the \(\mathbf {\alpha }\) adaptively in a training phase.

Related Work.  We consider recent methods based on CNN showing a promising performance on weakly supervised object localization [10, 11, 14, 15]. The common strategy for them is to produce activation maps (in other words, score maps) for each class, and select or extract a representative activation value. The dimensions of those maps are automatically determined by a network architecture. If such a network is trained well, it is expected that a target object can be easily localized by examining the activation map corresponding to its class.

To select or extract the representative activations for each class, typical pooling methods can be effectively used. In [10], a global max pooling method is used and its classification and localization performances are verified in the domain of general images. Another choice can be a global average pooling method. As discussed in [15], it might be better for localization since a global max pooling focuses on the most discriminative part only, while a global average pooling discovers all those parts as much as possible. Inferring segmentation map is more challenging compared to object localization, since it performs pixel-level classification. In [11], they adopt a Log-Sum-Exponential pooling method which is a smooth version of the max pooling to explore the entire feature maps. Smoothing priors are also considered to obtain fine-grained segmentation maps.

Those approaches can be interpreted as a variant of multiple instance learning (MIL), which is designed for classification where labels are associated with sets of instances, called bags, instead of individual instances. In image classification tasks, the full size image and its subsampled patches are considered as a bag and instances, respectively. For instance, if we use a global max pooling to select a representative value among activations of patches, it is equivalent to use a well-classified single patch for building the decision boundary. Strictly speaking, however, current approaches are not generally applicable since they essentially require well-trained features on semantically similar datasets.

2 Self-Transfer Learning for Weakly Supervised Learning

STL consists of three main components: shared convolutional layers, fully connected layers (i.e. classifier), and localization layers (i.e. localizer) (see Fig. 1). The key features of STL are twofold. First, it simultaneously propagates errors backward from both classifier and localizer to prevent the localizer from wandering a loss surface to find a local optimum. Second, an adjustable hyperparameter \(\alpha \) is introduced to control the relative importance between classifier and localizer. Two losses, Loss \(_\mathbf{cls }\) from classifier and Loss \(_\mathbf{loc }\) from localizer, are computed at the forward pass, and the weighted sum of those errors is propagated at the backward pass. The errors from classifier contribute to train the filters in an overall view, while those from localizer are backpropagated through the subsampled region which is the most important window to classify training set. At the early stage of training phase, the errors from classifier should be more weighted than those from localizer to prevent the localizer from falling in a bad local optimum. By reducing the effects of errors from localizer, good filters which have a discriminative power can be well trained although localizer fails to find objects associated with the class label. As training proceeds, the weight for localizer increases to focus on the subsampled region of input image. At this stage, the network’s filters are fine-tuned for the task of localization.

Consider a data set of N input-target pairs \(\{\mathbf {x}_i,\mathbf {t}_i\}_{i=1}^N\). \(\mathbf {x}_i\) and \(\mathbf {t}_i\) denote an i-th image and the corresponding K-dimensional true label vector, respectively, where K represents the number of classes. Assuming an image with a single class label, our objective function to be optimized is a weighted sum of cross-entropy losses from classifier and localizer, which can be defined as follows:

$$\begin{aligned} \mathbf Loss _\mathbf total&= (1-\alpha )\mathbf Loss _\mathbf cls + \alpha \mathbf Loss _\mathbf loc \\ \nonumber&=-(1-\alpha ) \textstyle \sum _{i=1}^{N} {\mathbf {t}_{i}^\intercal \mathbf {log}({\mathbf {y}_i^{cls}}}) - \alpha \sum _{i=1}^{N} {\mathbf {t}_{i}^\intercal \mathbf {log}({\mathbf {y}_i^{loc}}}) \end{aligned}$$
(1)

where \(\mathbf {y}_i^{cls}\) and \(\mathbf {y}_i^{loc}\) are K-dimensional class probability vectors from classifier and localizer, respectively, for i-th input, and \(\mathbf {log}(\cdot )\) denotes an element-wise log operation.

The effect of the proposed STL can be explained by examining a backpropagation process at the end of shared convolutional layers C. Suppose that the node i represents a particular node in C which is connected with H nodes in F \(_\mathbf{cls }\) and K nodes in C \(_\mathbf{loc }\). Note that C \(_\mathbf{loc }\) is obtained by \(1\times 1\) convolution on C as shown in Fig. 1 and K is equal to the number of activation maps (i.e. the number of classes). If ReLU activation function is used for the node i, the backpropagated error \(\delta _i\) at the node i can be written as follows:

$$\begin{aligned} \delta _i = \max (0, \delta ^{cls}_i + \delta ^{loc}_i)~~\text {where}~~\delta ^{cls}_i = \textstyle \sum _{j=1}^{H} {w_{ji} \delta _j},~\delta ^{loc}_i = \sum _{k=1}^{K} {w_{ki} \delta _k} \end{aligned}$$
(2)

It should be noted that the relative importance between classifier and localizer is already reflected in the errors \(\delta ^{cls}_i\) and \(\delta ^{loc}_i\) through the weighted loss function \({\mathbf {Loss}_\mathbf{total }}\). It can be seen that the errors \(\delta ^{loc}_i\) are backpropagated undesirably without \(\delta ^{cls}_i\) due to the special treatment, a global pooling, for activation maps in \(\mathbf {C}_{\mathbf {loc}}\). For instance, if a global max pooling is used to aggregate the activations within each activation map and the location corresponding to node i in \(\mathbf {C}\) is not selected as the maximum, all \(\delta _k\)’s to be backpropagated from \(\mathbf {C}_{\mathbf {loc}}\) will be zero. Therefore, the computed errors of most of nodes in \(\mathbf {C}\) except for the nodes whose locations correspond to the maximal responses for each activation map will be zero. In case of a global average pooling, zero errors will be merely replaced with a mean of errors. This situation is not certainly desirable, especially when we train the network from scratch (i.e. without pre-trained filters). By incorporating classifier into a network architecture, the shared convolutional layers \(\mathbf {C}\) can be consistently improved even if the backpropagated errors \(\delta ^{loc}_i\) from localizer do not contribute to learn useful features.

It should be noted that STL is different from multi-task learning (MTL). They look similar because of the branch architecture and several objectives. However, STL solves exactly the same tasks and therfore it does not need any extra supervision. While, MTL jointly trains several tasks with separate losses. Therefore, it is more appropriate to see the classifier in STL as an auxiliary component for successful training of localizer.

3 Computational Experiments

In this section we use two medical image datasets, chest X-rays (CXRs) and mammograms, to evaluate the classification and localization performances of STL. All training CXRs and mammograms are resized to 500\(\,\times \,\)500. The network architecture used in this experiment is slightly modified based on the network from [9]Footnote 1. For localizer, \(15\times 15\) activation maps for each class are obtained via \(1\times 1\) convolution operation. Two global pooling methods, max [10] and average poolings [15], are applied to the activation maps. The network is trained via stochastic gradient descent with momentum 0.9 and the minibatch size is set to 64. There is an additional hyperparameter \(\alpha \) on STL to determine the level of importance between classifier and localizer. We set its initial value to 0.1 so that the network more focuses on learning the representative features at the early stage, and it is increased to 0.9 after 60 epochs to fine-tune the localizer.

To compare the classification performance, an area under characteristic curve (AUC), accuracy and average precision (AP) of each class are used. For STL, class probabilities obtained from localizer is used for measuring performance. For a localization task, a similar performance metric in [10] is used. It is based on AP, but the difference is the way to count true positives and false positives. In classification, it is a true positive if its class probability exceeds some threshold. To measure the localization performance under this metric, the test image whose class probability is greater than some threshold (i.e. a true positive in case of classification) but the maximal response in the activation map does not fall within the ground truth annotations allowing some tolerance is counted as a false positive. In our experiment, only positive class is considered for localization AP since there is no ROI on negative class. First, the activation map of positive class is resized to the size of original image via simple bilinear interpolation, then it is examined whether the maximal response falls into the ground truth annotations within 16 pixels tolerances which is a half of the global stride 32 of the considered network architecture. If the response is located inside true annotations, the test image is counted as a true positive. If not, it is counted as a false positive.

Tuberculosis Detection.  We use three CXRs datasets, namely KIT, Shenzhen and MC sets in this experiment. All the CXRs used in this work are de-identified by the corresponding image providers. KIT set contains 10,848 DICOM images, consisting of 7,020 normal and 3,828 abnormal (TB) cases, from the Korean Institute of Tuberculosis. ShenzhenFootnote 2 and MCFootnote 3 sets are available limited to research purpose from the authors of [1, 7, 8]. We train the models using the KIT set, and test the classification and localization performances using the Shenzhen and MC sets. To evaluate the localization performance, we obtain their detailed annotations from the TB clinician since the testsets, Shenzhen and MC sets, do not contain any annotations for TB lesions.

Fig. 2.
figure 2

Training curves and 1st layer filters at 5000 iterations in case of average pooling

Table 1 summarizes the experimental results. For both classification and localization tasks, STL consistently outperforms other methods. The best performance model is STL+AvePool. A global average pooling works well for localization and it is consistent result with [15]. Since the value of localization AP is always less than that of classification AP (from the definition of measure), it is helpful to see the improvement ratio for performance comparison. Regardless of pooling methods, the localization APs for both Shenzhen and MC sets are much improved from baselines (i.e. MaxPool and AvePool) compared to classification APs. This means that STL certainly assists localizer to find the most important ROIs which define the class label. Figure 2 clearly shows the advantages of STL, faster training and better feature learning. The localization examples in testsets are visualized in Fig. 3.

Mammography. We use two public mammography databases, Digital Database for Screening Mammography (DDSM) [4, 5] and Mammographic Image Analysis Society (MIAS) [13]. DDSM and MIAS are used for training and testing, respectively. We preprocess DDSM images to have two labels, positive (abnormal) and negative (normal). Originally, abnormal mammographic images contain several types of abnormalities such as masses, microcalcification, etc. We merge all types of abnormalities into positive class to distinguish any abnormalities from normal, thus the number of positive and negative images are 4,025 and 6,338 respectively in the training set (DDSM). In testset (MIAS), there are 112 positive and 210 negative images. Note that we do not use any additional information except for image-level labels for training although the training set has boundary information of abnormal ROIs. The boundary information of testset is utilized to evaluate the localization performance.

Table 1. Classification and localization performances for CXRs and mammograms (subscripts + and - denote positive and negative class, respectively)

Table 1 reports the classification and localization resultsFootnote 4. As we can see, classification of mammograms is much difficult compared to TB detection. First of all, mammograms used for training are low quality images which contain some degree of artifact and distortion generated from the scanning process for creating digital images from films. Moreover, this task is inherently complicated since there also exist quite a few irregular patterns in negative class caused by various shapes and characteristics of normal tissues. Nevertheless, it is confirmed that STL is significantly better than other methods for both classification and localization. Again, the localization performances are much improved from baselines compared to the classification performances regardless of pooling methods. Figure 3 shows some localization examples in the testset.

Fig. 3.
figure 3

Localization examples for chest X-rays and mammograms. Top row shows test images with groud-truth annotations. The belows represent the results from MaxPool, AvePool, STL+MaxPool and STL+AvePool in a sequential order. The activation map for positive class is linearly scaled to the range between 0 and the maximum probability.

4 Conclusion

In this work, we propose a novel framework STL which enables training CNN for lesion localization without neither any location information nor pre-trained models. Our framework jointly learns both classifier and localizer using a weighted loss as an objective function for the purpose of preventing localizer from falling in a bad local optimum. Self-transfer is realized via a weight controlling the relative importance between classifier and localizer. Also, the effect of classifier on localizer is discussed to provide the rationale behind the advantages of the proposed framework. Computational experiments for lesion localization given only image-level labels show that the proposed framework outperforms the existing approaches in terms of both classification and localization performance metrics.