M-VAAL: Multimodal Variational Adversarial Active Learning For Downstream Medical Image Analysis Tasks
M-VAAL: Multimodal Variational Adversarial Active Learning For Downstream Medical Image Analysis Tasks
M-VAAL: Multimodal Variational Adversarial Active Learning For Downstream Medical Image Analysis Tasks
1
Center for Imaging Science, RIT, Rochester, NY, USA
2
Biomedical Engineering, RIT, Rochester, NY, USA
3
NepAl Applied Mathematics and Informatics Institute for Research (NAAMII)
4
University of Aberdeen, Aberdeen, UK
5
University College London, London, UK
1 Introduction
Automated medical image analysis tasks, such as feature segmentation and dis-
ease classification play an important role in assisting with clinical diagnosis,
as well as appropriate planning of non-invasive therapies, including surgical in-
terventions [8,2]. In recent years, supervised deep learning-based methods have
2 B. Khanal et al.
explored multimodal medical images in the context of AL [5], none of the exist-
ing methods directly use the multimodal image as auxiliary information to learn
the sampler.
In this paper, we propose a task-agnostic method that exploits multimodal
imaging information to improve the AL sampling process. We modify the ex-
isting task-agnostic VAAL framework to enable exploiting multimodal images
and evaluate its performance on two widely used publicly available datasets: the
multimodal BraTS dataset [14] and COVID-QU-Ex dataset [21] containing lung
segmentation maps as additional information.
The contributions of this work are as follows: 1) We propose a novel mul-
timodal variational adversarial AL method (M-VAAL) that uses additional in-
formation from auxiliary modalities to select informative and diverse samples
for annotation; 2) Using the BraTS2018 and COVID-QU-Ex datasets, we show
the effectiveness of our method in actively selecting samples for segmentation,
multi-label classification, and multi-class classification tasks. 3) We show that
the sample selection in AL can potentially benefit from the use of auxiliary
information.
2 Methodology
Fig. 1: M-VAAL Pipeline: Our active learning method uses multimodal informa-
tion (m1 and m2) to improve VAAL. M-VAAL samples the unlabelled images,
and selects samples complementary to the already annotated data, which are
passed to Oracle for annotation. Incorporating auxiliary information from the
second modality (m2) produces a more generalized latent representation for
sampling. Our method learns task-agnostic representations, therefore the latent
space can be used for both classification and segmentation tasks to sample the
best-unlabelled images for annotation. (Refer to 2.2 for the meaning of each no-
tation)
one class from the set. Once a task is trained using labeled samples, we move
on to the sampler (B) stage to select the best samples for annotation. After
annotating the selected samples, we add them to the labeled sample pool and
retrain the task learner using the updated sample set. Thus, the task learner
is completely retrained for multiple rounds, with an increased training budget
each time.
Fig. 2: Histogram of the discriminator scores for unlabeled and labeled data at
third AL round. The discriminator is adversarially trained to push unlabeled
samples toward lower values and labeled samples toward higher values. Our
method involves selecting unlabelled instances that are far from the peak distri-
bution of the labeled data. The number of samples to select is dictated by the
AL budget.
labeled and unlabelled examples, with the end goal of improving the discrimi-
nator at distinguishing between labeled and unlabelled pairs. After the initial
round of training, the trained discriminator is used to select the top b unlabelled
samples identified as belonging to the unlabelled set. These selected samples are
sent to the Oracle for annotation. Finally, the selected samples are removed from
the unlabelled set and then added to the pool of labeled sets, along with their
respective labels. The next round of AL proceeds in a similar fashion, but with
the updated labeled and unlabelled sets.
3 Experiments
3.1 Dataset
BraTS2018: We used the BraTS2018 dataset [14], which includes co-registered
3D MR brain volumes acquired using different acquisition protocols, yielding
slices of T1, T2, and T2-Flair images. To sample informative examples from un-
labelled brain MR images, we employed the M-VAAL algorithm using contrast-
enhanced T1 sequences as the main modality, and T2-Flair as auxiliary infor-
mation. Contrast-enhanced T1 images are preferred for tumor detection as the
tumor border is more visible [4]. In addition, T2-Flair, which captures cerebral
edema (fluid-filled region due to swelling), can also be utilized for diagnosis [14].
Our focus was on the provided 210 High-Grade Gliomas (HGG) cases, which
included manual segmentation verified by experienced board-certified neuro-
radiologists. There are three foreground classes: Enhancing Tumor (ET), Ede-
matous Tissue (ED), and Non-enhancing Tumor Core (NCR). In practice, the
given foreground classes can be merged to create different sub-regions such as
whole tumor and tumor core for evaluations [14]. The whole tumor entails all
foreground classes, while the tumor core only entails ET and NCR.
Before extracting 2D slices from the provided 3D volumes, we randomly split
the 210 volumes into training and test cases with an 80:20 ratio. The training
set was further split into training and validation cases using the same ratio of
80:20, resulting in 135 training, 33 validation, and 42 test cases. These splits were
created before extracting 2D slices to avoid any patient information leakage in
the test splits. Each 3D volume had 155 (240 × 240) transverse slices in axial
view with a spacing of 1 mm. However, not all transverse slices contained tumor
regions, so we extracted only those containing at least one of the foreground
classes. Some slices contained only a few pixels of the foreground segmentation
classes, so we ensured that each extracted slice had at least 1000 pixels repre-
senting the foreground class; any slice not meeting this threshold was discarded.
Consequently, the curated dataset comprised 3673 training images, 1009 valida-
tion images, and 1164 test images of contrast-enhanced T1, T2-Flair, and the
segmentation map.
We evaluated our method on two downstream tasks: whole tumor segmen-
tation and multi-label classification task. In the multi-label classification task,
our prediction classes consisted of ET, ED, and NCR, with each image having
either one, two, or all three classes. It’s worth noting that for the downstream
8 B. Khanal et al.
task, only contrast-enhanced T1 images were used, while M-VAAL made use of
both contrast-enhanced T1 and T2-Flair images.
BraTS2018: All the input images have a single channel of size 240 × 240. To
preprocess the images, we removed 1% of the top and bottom intensities and
normalized them linearly by dividing each pixel with the maximum intensity,
bringing the pixel intensity to a range of 0 to 1. We also normalized the images by
subtracting and dividing them by the mean and standard deviation, respectively,
of the training data. For VAAL and M-VAAL, the images were center-cropped
to 210 x 210 pixels and resized to 128 × 128 pixels. For the downstream task, we
used the original image size. Furthermore, to stabilize the training of VAAL and
M-VAAL under similar hyperparameters as the original VAAL [19], we converted
the single-channel input image to a three-channel RGB with the same grayscale
value across all channels.
For M-VAAL, we used the same β-VAE and discriminator as in original
VAAL [19], but added a batch normalization layer after each linear layer in
the discriminator. Additionally, instead of using vanilla GAN with binary-cross
entropy loss, we used WGAN with a gradient penalty to stabilize the adversarial
training [7], with λ = 1. The latent dimension of VAE was set to 64, and the
initial learning rate for both the VAE and discriminator was 1e−4 . We tested γ3
in the range M = 0.2, 0.4, 0.8, 1 via an ablation study reported in Sec 5, while
γ1 and γ2 were set to 1. Both VAAL and M-VAAL used a mini-batch size of 16
and were trained for 100 epochs using the same hyperparameters, except for γ3 ,
which was only present in M-VAAL. For consistency, we initialized the model
at each stage with the same random seed and repeated three trials with three
different seeds for each experiment, recording the average score.
For the segmentation task, we used a U-Net [15] architecture with four down-
sampling and four up-sampling layers, trained with an initial learning rate of
1e−5 using the RMSprop optimizer for up to 30 epochs in a mini-batch size of
32. The loss function was the sum of the pixel-wise cross-entropy loss and Dice
Multimodal Variational Adversarial Active Learning 9
coefficient loss. The best model was identified by monitoring the best valida-
tion Dice score and used on the test dataset. For the multi-label classification,
we used a pre-trained ResNet18 architecture trained on the ImageNet dataset.
Instead of normalizing the input images with our training set’s mean and stan-
dard deviation, we used the ImageNet mean and standard deviation to make
them compatible with the pre-trained ResNet18 architecture. We used an initial
learning rate of 1e−5 with the Adam optimizer and trained up to 50 epochs in
a mini-batch size of 32. The best model was identified by monitoring the best
validation mAP score and used on the test set. We used AL to sample the best
examples for annotation for up to six rounds and seven rounds for segmentation
and classification tasks, respectively, starting with 200 samples and adding 100
examples (budget – b) at each round.
COVID-QU-Ex: All the input images were loaded as RGB, with all channels
having the same gray-scale value. To bring the pixel values within the range of 0
to 1, we normalized each image by dividing all the pixels by 255. Additionally, we
normalized the images by subtracting the mean and dividing by the standard
deviation, both with values of 0.5. The same hyperparameters were used for
both VAAL and M-VAAL as those used in BraTS, and we also downsampled
the original 256 × 256 images to 128 × 128 for both models. For the downstream
multi-class classification task, we utilized a pre-trained ResNet18 architecture
that was trained on the ImageNet dataset. We trained the model with an initial
learning rate of 1e−5 using the Adam optimizer and a mini-batch size of 32 for
up to 50 epochs. The best model was determined based on the highest validation
overall accuracy score, and we evaluated this model on the test set. We employed
an AL method that involved sampling up to seven rounds using an AL budget of
100 samples. The initial budget for classification before starting active sampling
was 100 samples.
We have released the GitHub repository for our source code implementation
6
. We implemented our method using the standard Pytorch 12.1.1 framework in
Python 3.8 and trained in a single A100 GPU (40 GB).
4 Results
Segmentation: Fig. 3 compares the whole tumor segmentation performance (in
terms of Dice score) between our proposed method (M-VAAL), the two baselines
6
https://github.com/Bidur-Khanal/MVAAL-medical-images
10 B. Khanal et al.
(random sampling and VAAL), and a U-Net trained on the entire fully labeled
dataset, serving as an upper bound with an average Dice score of 0.789.
As shown, with only 800 labeled samples, the segmentation performance
starts to saturate and approaches that of the fully trained U-Net on 100% labeled
Multimodal Variational Adversarial Active Learning 11
data. Moreover, M-VAAL performs better than baselines in the early phase and
gradually saturates as the number of training samples increase.
Fig. 4 illustrates a qualitative, visual assessment of the segmentation masks
yielded by U-Net models trained with 400 samples selected by M-VAAL against
ground truth, VAAL, and random sampling, at the AL second round. White
denotes regions identified by both segmentation methods. Blue denotes regions
missed by the test method (i.e., method listed first), but identified by the ref-
erence method (i.e., method listed second). Red denotes regions that were iden-
tified by the test method (i.e., method listed first), but not identified by the
reference method (i.e., method listed second). As such, an optimal segmentation
method will maximize the white regions, while minimizing the blue and red re-
gions. Furthermore, Fig. 4 clearly shows that the segmentation masks yielded by
M-VAAL are more consistent with the ground truth segmentation masks than
those generated by VAAL or Random Sampling.
Fig. 5: Comparison of the mean average precision (mAP) for multi-label classifi-
cation of tumor types between the proposed M-VAAL method and two baseline
methods (VAAL and random sampling).
12 B. Khanal et al.
(a)
(b)
(c)
Fig. 7: Comparing the performance of (a) whole tumor segmentation, (b) tumor
type multi-label classification, (c) chest X-ray infection multi-class classification
with different training budgets, using the samples selected by M-VAAL trained
with different values of M.
the optimal value of M differs for different tasks, as shown in Sec. 5. The type of
dataset also plays an important role in the effectiveness of AL. While AL is usu-
ally most effective with large pools of unlabelled data that exhibit high diversity
and uncertainty, the small dataset used in this study, coupled with the nature of
the labels, led to high-performance variance across each random initialization of
the task network. For instance, in the whole tumor segmentation task, as tumors
do not have a specific shape, the prediction was based on texture and contrast
information. Similarly, in the Chest X-ray image classification task, the images
14 B. Khanal et al.
looked relatively similar, with only subtle minute features distinguishing between
classes. In addition, the evaluation test also plays a crucial role in accessing the
AL sampler. If the test distribution is biased and does not contain diverse test
cases, the effective evaluation of AL will be undermined. We conducted several
Student’s t-tests to evaluate the statistical significance of the performance re-
sults; nevertheless, we observed high variance across runs in downstream tasks
when smaller AL budgets are used as shown by the error bars in Fig. 6, 3, and
5, as a result, the scores are not always statistically significant, given the large
variance.
In the future, we plan to investigate this further by evaluating the benefit
of multimodal AL on a larger pool of unannotated medical data on diverse
multimodal datasets that guarantee a diversified distribution. Additionally, we
aim to explore the potential of replacing the discriminator with metric learning
[3] to contrast the labeled and unlabelled sets in the latent space. There is
also a possibility of extending M-VAAL to other modalities. For instance, depth
information can serve as auxiliary multimodal information to improve an AL
sampler for image segmentation and label classification on surgical scenes with
endoscopic images.
In this work, we proposed a task-agnostic sampling method in AL that can
leverage multimodal image information. Our results on the BraTS2018 and
COVID-QU-Ex datasets show initial promise in the direction of using multi-
modal information in AL. M-VAAL can consistently improve AL performance,
but the hyperparameters need to be properly tuned.
References
1. Al Khalil, Y., et al.: On the usability of synthetic data for improving the robustness
of deep learning-based segmentation of cardiac magnetic resonance images. Medical
Image Analysis 84, 102688 (2023)
2. Ansari, M.Y., et al.: Practical utility of liver segmentation methods in clinical
surgeries and interventions. BMC Medical Imaging 22, 1–17 (2022)
3. Bellet, A., Habrard, A., Sebban, M.: A survey on metric learning for feature vectors
and structured data. arXiv preprint arXiv:1306.6709 (2013)
4. Bouget, D., et al.: Meningioma segmentation in T1-weighted MRI leveraging global
context and attention mechanisms. Frontiers in Radiology 1, 711514 (2021)
5. Budd, S., et al.: A survey on active learning and human-in-the-loop deep learning
for medical image analysis. Medical Image Analysis 71, 102062 (2021)
Multimodal Variational Adversarial Active Learning 15
6. Chen, X., et al.: Semi-supervised semantic segmentation with cross pseudo super-
vision. In: Proc. IEEE Computer Vision and Pattern Recognition. pp. 2613–2622
(2021)
7. Gulrajani, I., et al.: Improved training of Wasserstein GANs. Advances in neural
information processing systems 30 (2017)
8. Hamamci, A., et al.: Tumor-cut: segmentation of brain tumors on contrast-
enhanced MR images for radiosurgery applications. IEEE transactions on medical
imaging 31, 790–804 (2011)
9. He, K., et al.: Identity mappings in deep residual networks. In: European conference
on computer vision. pp. 630–645. Springer (2016)
10. Kim, D.D., et al.: Active learning in brain tumor segmentation with uncer-
tainty sampling, annotation redundancy restriction, and data initialization. arXiv
preprint arXiv:2302.10185 (2023)
11. Laradji, I., et al.: A weakly supervised region-based active learning method for
COVID-19 segmentation in CT images. arXiv:2007.07012 (2020)
12. Lewis, D.D.: A sequential algorithm for training text classifiers: Corrigendum and
additional data. In: Acm Sigir Forum. vol. 29, pp. 13–19. ACM New York, NY,
USA (1995)
13. Luo, X., et al.: Semi-supervised medical image segmentation through dual-task
consistency. AAAI Conference on Artificial Intelligence pp. 8801–8809 (2021)
14. Menze, B.H., et al.: The multimodal brain tumor image segmentation benchmark
(BraTS). IEEE Transactions on Medical Imaging 34, 1993–2024 (2015)
15. Ronneberger, O., et al.: U-Net: Convolutional networks for biomedical image
segmentation. In: International Conference on Medical image computing and
computer-assisted intervention. pp. 234–241. Springer (2015)
16. Shao, W., et al.: Deep active learning for nucleus classification in pathology images.
In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).
pp. 199–202 (2018)
17. Sharma, D., et al.: Active learning technique for multimodal brain tumor segmen-
tation using limited labeled images. In: Domain Adaptation and Representation
Transfer and Medical Image Learning with Less Labels and Imperfect Data: MIC-
CAI Workshop 2019. pp. 148–156 (2019)
18. Singh, N.K., Raza, K.: Medical image generation using generative adversarial net-
works: A review. Health informatics: A computational perspective in healthcare
pp. 77–96 (2021)
19. Sinha, S., et al.: Variational adversarial active learning. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. pp. 5972–5981 (2019)
20. Skandarani, Y., et al.: GANs for medical image synthesis: An empirical study.
Journal of Imaging 9(3), 69 (2023)
21. Tahir, A.M., et al.: COVID-QU-Ex Dataset (2022), https://www.kaggle.com/
dsv/3122958
22. Thapa, S.K., et al.: Task-aware active learning for endoscopic image analysis.
arXiv:2204.03440 (2022)
23. Verma, V., et al.: Interpolation consistency training for semi-supervised learning.
In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial
Intelligence, IJCAI-19. pp. 3635–3641 (7 2019)
24. Yang, L., et al.: Suggestive annotation: A deep active learning framework for
biomedical image segmentation. In: International conference on medical image
computing and computer-assisted intervention. pp. 399–407. Springer (2017)
25. Zhan, X., et al.: A comparative survey of deep active learning. arXiv:2203.13450
(2022)
16 B. Khanal et al.
7 Supplementary Materials
Table 1: Summary of Dice Score (Mean ± Std. Dev.) for brain tumor segmenta-
tion between our proposed M-VAAL and baselines (VAAL and random sampling)
at each active sampling round.
Dice Score
AL budget M-VAAL VAAL Rand
200 0.611 ± 0.028 0.611 ± 0.028 0.611 ± 0.028
300 0.725 ± 0.006 0.702 ± 0.011 0.69 ± 0.006
400 0.745 ± 0.002 0.747 ± 0.009 0.72 ± 0.016
500 0.751 ± 0.007 0.739 ± 0.02 0.738 ± 0.011
600 0.757 ± 0.001 0.728 ± 0.004 0.743 ± 0.007
700 0.756 ± 0.001 0.743 ± 0.013 0.738 ± 0.016
800 0.763 ± 0.006 0.764 ± 0.016 0.763 ± 0.011
Table 2: Summary of mAP (Mean ± Std. Dev.) for multi-label of tumor types
between our proposed M-VAAL and baselines (VAAL and random sampling) at
each active sampling round.
Mean Average Precision (mAP)
AL budget M-VAAL VAAL Rand
200 0.931 ± 0.005 0.931 ± 0.005 0.931 ± 0.005
300 0.956 ± 0.008 0.948 ± 0.004 0.948 ± 0.006
400 0.959 ± 0.005 0.951 ± 0.007 0.951 ± 0.006
500 0.962 ± 0.005 0.958 ± 0.002 0.954 ± 0.009
600 0.967 ± 0.003 0.959 ± 0.003 0.961 ± 0.007
700 0.964 ± 0.002 0.961 ± 0.002 0.962 ± 0.008
800 0.966 ± 0.002 0.961 ± 0.004 0.962 ± 0.008
900 0.968 ± 0.002 0.964 ± 0.001 0.965 ± 0.005
Multimodal Variational Adversarial Active Learning 17
Table 3: Summary of Overall Accuracy (Mean ± Std. Dev.) for chest X-ray
infection multi-class classification of tumor types between our proposed M-VAAL
and baselines (VAAL and random sampling) at each active sampling round.
Overall Accuracy
AL budget M-VAAL VAAL Rand
100 0.722 ± 0.049 0.722 ± 0.049 0.722 ± 0.049
200 0.881 ± 0.007 0.876 ± 0.010 0.882 ± 0.007
300 0.902 ± 0.007 0.899 ± 0.008 0.899 ± 0.004
400 0.911 ± 0.010 0.909 ± 0.002 0.905 ± 0.008
500 0.923 ± 0.001 0.918 ± 0.002 0.916 ± 0.005
600 0.928 ± 0.001 0.924 ± 0.006 0.919 ± 0.007
700 0.932 ± 0.002 0.934 ± 0.004 0.925 ± 0.005
800 0.935 ± 0.001 0.933 ± 0.005 0.931 ± 0.005