Depp Learning For Medical Image Processing
Depp Learning For Medical Image Processing
Depp Learning For Medical Image Processing
Deep learning has received extensive research interest in developing new medical image processing
algorithms, and deep learning based models have been remarkably successful in a variety of medical imaging
tasks to support disease detection and diagnosis. Despite the success, the further improvement of deep learning
models in medical image analysis is majorly bottlenecked by the lack of large-sized and well-annotated datasets.
In the past five years, many studies have focused on addressing this challenge. In this paper, we reviewed and
summarized these recent studies to provide a comprehensive overview of applying deep learning methods in
various medical image analysis tasks. Especially, we emphasize the latest progress and contributions of state-
of-the-art unsupervised and semi-supervised deep learning in medical image analysis, which are summarized
based on different application scenarios, including classification, segmentation, detection, and image
registration. We also discuss the major technical challenges and suggest the possible solutions in future research
Keywords: Deep learning, unsupervised learning, self-supervised learning, semi-supervised learning,
medical images, classification, segmentation, detection, registration, Transformer, attention
In current clinical practice, accuracy of detection and diagnosis of cancers and/or many other diseases
depends on the expertise of individual clinicians (e.g., radiologists, pathologists) (Kruger et al., 1972), which
results in large inter-reader variability in reading and interpreting medical images. In order to address and
overcome this clinical challenge, many computer-aided detection and diagnosis (CAD) schemes have been
developed and tested, aiming to help clinicians more efficiently read medical images and make the diagnostic
decision in a more accurate and objective manner. The scientific rationale of this approach is that using
computer-aided quantitative image feature analysis can help overcome many negative factors in the clinical
practice, including the wide variations in expertise of the clinicians, potential fatigue of human experts, and
lack of sufficient medical resources.
Although early CAD schemes have been developed in 1970s (Meyers et al., 1964; Kruger et al., 1972;
Sezaki and Ukena, 1973), progress of the CAD schemes accelerates since the middle of 1990s (Doi et al., 1999),
due to the development and integration of more advanced machine learning methods or models into CAD
schemes. For conventional CAD schemes, a common developing procedure consists of three steps: target
Corresponding author:
segmentation, feature computation, and disease classification. For example, Shi et al. (2008) developed a CAD
scheme to achieve mass classification on digital mammograms. The ROIs containing the target masses were
first segmented from the background using a modified active contour algorithm (Sahiner et al., 2001). Then a
large number of image features were applied to quantify the lesion characteristics in size, morphology, margin
geometry, texture, and etc. Thus the raw pixel data was converted into a vector of representative features.
Finally, a LDA (linear discrimination analysis) based classification model was applied on the feature vector to
identify the mass malignancy.
As a comparison, for deep learning based models, hidden patterns inside ROIs are progressively
identified and learned by the hierarchical architecture of deep neural networks (LeCun et al., 2015). During this
process, important properties of the input image will be gradually identified and amplified for certain tasks (e.g.
classification, detection), while irrelevant features will be attenuated and filtered out. For instance, an MRI
image depicting suspicious liver lesions comes with a pixel array (Hamm et al., 2019), and each entry is used
as one input feature of the deep learning model. The first several layers of the model may initially obtain some
basic lesion information, such as tumor shape, location, and orientation. The next batch of layers may identity
and keep the features consistently related to lesion malignancy (e.g. shape, edge irregularity), while ignoring
irrelevant variations (e.g. location). The relevant features will be further processed and assembled by
subsequent higher layers in a more abstract manner. When increasing the number of layers, a higher level of
feature representations can be achieved. Through the entire procedure, important features hidden inside the raw
image are recognized by a general neural network based model in a self-taught manner, and thus the manual
feature development is not needed.
Due to its huge advantage, deep learning related methods have become the mainstream technology in
the CAD field and have been widely applied in a variety of tasks, such as disease classification (Li et al., 2020a;
Shorfuzzaman and Hossain, 2021; Zhang et al., 2020a; Frid-Adar et al., 2018a; Kumar et al., 2016; Kumar et
al., 2017), ROI segmentation (Alom et al., 2018; Yu et al., 2019; Fan et al., 2020), medical object detection
(Rijthoven et al., 2018; Mei et al., 2021; Nair et al., 2020; Zheng et al., 2015), and image registration
(Simonovsky et al., 2016; Sokooti et al., 2017; Balakrishnan et al., 2018). Among different kinds of deep
learning techniques, supervised learning was first adopted in medical image analysis. Although it has been
successfully utilized in many applications (Esteva et al., 2017; Long et al., 2017), further deployment of
supervised models in many scenarios is majorly hurdled by the limited size of most medical datasets. As
compared to regular datasets in computer vision, a medical image dataset usually contains relatively small
amounts of images (less than 10,000), and in many cases, only a small percentage of images are annotated by
experts. To overcome this limitation, unsupervised and semi-supervised learning methods have received
extensive attention in the past three years, which are able to (1) generate more labeled images for model
optimization, (2) learn meaningful hidden patterns from unlabeled image data, and (3) generate pseudo labels
for the unlabeled data.
There already exist a number of excellent review articles that summarized deep learning applications
in medical image analysis. Litjens et al. (2017) and Shen et al. (2017) reviewed relatively early deep learning
techniques, which are mainly based on supervised methods. More recently, Yi et al. (2019) and Kazeminia et
al. (2020) reviewed the applications of generative adversarial networks across different medical imaging tasks.
Cheplygina et al. (2019) surveyed on how to use semi-supervised learning and multiple instance learning in
diagnosis or segmentation tasks. Tajbakhsh et al. (2020) investigated a variety of methods to deal with dataset
limitations (e.g., scarce or weak annotations) specifically in medical image segmentation. In contrast, a major
goal of this paper is to shed light on how the medical image analysis field, which is often bottlenecked by
limited, annotated data, can potentially benefit from some latest trends of deep learning. Our survey
distinguishes itself from recent review papers with two characteristics – being comprehensive and technically
oriented. “Comprehensive” is reflected in three aspects. First, we highlight the applications of a broad range of
promising approaches falling in the “not-so-supervised” category, including self-supervised, unsupervised,
semi-supervised learning; meanwhile, we do not ignore the continuing importance of supervised methods.
Second, rather than covering only a specific task, we introduce the applications of the above learning
approaches in four classical medical image analysis tasks (classification, segmentation, detection, and
registration). Especially, we discussed the deep learning based object detection in detail, which is rarely
mentioned in recent review papers (after 2019). We focused on the applications of chest X-ray, mammogram,
CT, and MRI images. All these types of the images have many common characteristics, which are interpreted
by physicians at the same department (Radiology). We also mentioned some general methods which were
applied on other image domains (e.g. histopathology) but have potential to be used in radiological or MRI
images. Third, state-of-the-art architectures/models for these tasks are explained. For example, we summarized
how to adapt Transformers from natural language processing for medical image segmentation, which has not
been mentioned by existing review papers to the best of our knowledge. In terms of “technically oriented”, we
review the recent advances of not-so-supervised approaches in detail. In particular, self-supervised learning is
quickly emerging as a promising direction but yet systematically reviewed in the context of medical vision. A
wide audience may benefit from this survey, including researchers with deep learning, artificial intelligence
and big data expertise, and clinicians/medical researchers.
This survey is presented as follows (Figure 1): Section 2 provides an in-depth overview of recent
advances in deep learning, with a focus on unsupervised and semi-supervised approaches. In addition, three
important strategies for performance enhancement, including attention mechanisms, domain knowledge, and
uncertainty estimation, are also discussed. Section 3 summarizes the major contributions of applying deep
learning techniques in four main tasks: classification, segmentation, detection, and registration. Section 4
discusses challenges for further improving the model and suggests possible perspectives on future research
directions toward large scale applications of deep learning based medical image analysis models.
Figure 1. The overall structure of this survey.
Depending on whether labels of the training dataset are present, deep learning can be roughly divided
into supervised, unsupervised, and semi-supervised learning. In supervised learning, all training images are
labeled, and the model is optimized using the image-label pairs. For each testing image, the optimized model
will generate a likelihood score to predict its class label (LeCun et al., 2015). For unsupervised learning, the
model will analyze and learn the underlying patterns or hidden data structures without labels. If only a small
portion of training data is labeled, the model learns input-output relationship from the labeled data, and the
model will be strengthened by learning semantic and fine-grained features from the unlabeled data. This type
of learning approach is defined as semi-supervised learning (van Engelen and Hoos, 2020). In this section, we
briefly mentioned supervised learning at the beginning, and then majorly reviewed the recent advances of
unsupervised learning and semi-supervised learning, which can facilitate performing medical image tasks with
limited annotated data. Popular frameworks for these two types of learning paradigms will be introduced
accordingly. In the end, we summarize three general strategies that can be combined with different learning
paradigms for better performance in medical image analysis, including attention mechanisms, domain
knowledge, and uncertainty estimation.
2.1. Supervised learning
Convolutional neural networks (CNNs) are a widely used deep learning architecture in medical image
analysis (Anwar et al., 2018). CNNs are mainly composed of convolutional layers and pooling layers. Figure
2 shows a simple CNN in the context of medical image classification task. The CNN directly takes an image as
input, and transforms it via convolutional layers, pooling layers, and fully connected layers, and finally outputs
a class-based likelihood of that image.
At each convolutional layer 𝑙𝑙, a bunch of kernels 𝑊𝑊 = {𝑊𝑊1 , … , 𝑊𝑊𝑘𝑘 } are used to extract features from
the input image, and biases 𝑏𝑏 = {𝑏𝑏1 , … , 𝑏𝑏𝑘𝑘 } are added, generating new feature maps 𝑊𝑊𝑖𝑖𝑙𝑙 𝑥𝑥𝑖𝑖𝑙𝑙 + 𝑏𝑏𝑖𝑖𝑙𝑙 . Then a non-
linear transform, an activation function 𝜎𝜎(. ), is applied resulting in 𝑥𝑥𝑘𝑘𝑙𝑙+1 = 𝜎𝜎(𝑊𝑊𝑖𝑖𝑙𝑙 𝑥𝑥𝑖𝑖𝑙𝑙 + 𝑏𝑏𝑖𝑖𝑙𝑙 ) as the input of the
next layer. After the convolutional layer, a pooling layer is incorporated to reduce the dimension of feature
maps, thus reducing the number of parameters. Average pooling and maximum pooling are two common
pooling operations. The above process is repeated for the rest layers. At the end of the network, fully connected
layers are usually employed to produce the probability distribution over classes via a sigmoid or softmax
function. The predicted probability distribution gives a label 𝑦𝑦� for each input instance so that a loss function
𝐿𝐿(𝑦𝑦�, 𝑦𝑦) can be calculated, where 𝑦𝑦 is the real label. Parameters of the network are iteratively optimized by
minimizing the loss function.
Figure 2. A simple CNN for disease classification from MRI images (Anwar et al., 2018).
2.2. Unsupervised learning
2.2.1. Autoencoders
Autoencoders are widely applied in dimensionality reduction and feature learning (Hinton and
Salakhutdinov, 2006). The simplest autoencoder, initially known as auto-associator (Bourlard and Kamp,
1988), is a neural network with only one hidden layer that learns a latent feature representation of the input data
by minimizing a reconstruction loss between the input and its reconstruction from the latent representation. The
shallow structure of simple autoencoders limits their representation power, but deeper autoencoders with more
hidden layers can improve the representation. By stacking multiple auto-encoders and optimizing them in a
greedy layer-wise manner , deep autoencoders or Stacked Autoencoders (SAEs) can learn more complicated
non-linear patterns than shallow ones and thus generalize better outside training data (Bengio et al., 2007).
SAEs consist of an encoder network and a decoder network, which are typically symmetrical to each other. To
further force models to learn useful latent representations with desirable characteristics, regularization terms
such as sparsity constraints in Sparse Autoencoders (Ranzato et al., 2007) can be added to the original
reconstruction loss. Other regularized autoencoders include the Denoising Autoencoder (Vincent et al., 2010)
and Contractive Autoencoder (Rifai et al., 2011), both designed to be insensitive to input perturbations.
Unlike the above classic autoencoders, variational autoencoder (VAE) (Kingma and Welling, 2013)
works in a probablistic manner to learn mappings between the observed data space 𝑥𝑥 ∈ 𝑅𝑅𝑚𝑚 and latent space
𝑧𝑧 ∈ 𝑅𝑅𝑛𝑛 (𝑚𝑚 ≫ 𝑛𝑛). As a latent variable model, VAE formulates this problem as maximizing the log-likelihood
of the observed samples log 𝑝𝑝(𝑥𝑥) = log ∫ 𝑝𝑝(𝑥𝑥|𝑧𝑧)𝑝𝑝(𝑧𝑧) 𝑑𝑑𝑑𝑑, where 𝑝𝑝(𝑥𝑥|𝑧𝑧) can be easily modeled using a neural
network, and 𝑝𝑝(𝑧𝑧) is a prior distribution (such as Gaussian) over the latent space. However, the integral is
intractable because it is impossible to sample the full latent space. As a result, the posterior distribution 𝑝𝑝(𝑧𝑧|𝑥𝑥)
also becomes intractable according to Bayes rule. To solve the intractability issue, the authors of VAE propose
that in addition to modeling 𝑝𝑝(𝑥𝑥|𝑧𝑧) using the decoder, the encoder learns 𝑞𝑞(𝑧𝑧|𝑥𝑥) that approximates the
unknown posterior distribution. Ultimately a tractable lower bound also termed as evidence lower bound
(EBLO), can be derived for log 𝑝𝑝(𝑥𝑥).
log 𝑝𝑝(𝑥𝑥) ≥ 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = 𝐸𝐸𝑞𝑞(𝑧𝑧|𝑥𝑥) [log 𝑝𝑝(𝑥𝑥 |𝑧𝑧)] − 𝐾𝐾𝐾𝐾[𝑞𝑞(𝑧𝑧|𝑥𝑥)||𝑝𝑝(𝑧𝑧)],
where KL stands for the Kullback-Leibler divergence. The first term can be understood as a reconstruction loss
measuring the similarity between the input image and its counterpart reconstructed from the latent
representation. The second term computes the divergence between the approximated posterior and Gaussian
Later different VAE extensions were proposed to learn more complicated representations. Although
the probabilistic working mechanism allows its decoder to generate new data, VAE cannot specifically control
the data generation process. Sohn et al. (2015) proposed the so-called conditional VAE (CVAE), where
probabilistic distributions learnt by the encoder and decoder are both conditioned using external information
(e.g. image classes). This enables VAE to generate structured output representations. Another line of research
explores imposing more complex priors on the latent space. For example, Dilokthanakul et al. (2016) presented
Gaussian Mixture VAE (GMVAE) that uses a mixture of Gaussians as prior to obtain higher modeling
compacity in latent space. We refer readers to a recent paper (Kingma and Welling, 2019) for more details of
VAE and its extensions.
2.2.2. Generative adversarial networks (GANs)
Generative adversarial networks (GANs) are a class of deep nets for generative modeling first proposed
by Goodfellow et al. (2014). For this architecture, a framework for estimating generative models is designed to
directly draw samples from the desired underlying data distribution without the need to explicitly define a
probability distribution. It consists of two models: a generator G and a discriminator D. The generative model
G takes as input a random noise vector 𝑧𝑧 sampled from a prior distribution 𝑃𝑃𝑧𝑧 (𝑧𝑧) , often either a Gaussian or a
uniform distribution, and then maps 𝑧𝑧 to data space as 𝑮𝑮�𝑧𝑧, 𝜃𝜃𝑔𝑔 �, where 𝑮𝑮 is a neural network with parameters
𝜃𝜃𝑔𝑔 . The fake samples denoted as 𝑮𝑮(𝑧𝑧) or 𝑥𝑥𝑔𝑔 are expected to resemble real samples from the training data 𝑃𝑃𝑟𝑟 (𝑥𝑥),
and these two types of samples are sent into D. The discriminator, a second neural network parameterized by
𝜃𝜃𝑑𝑑 , outputs the probability 𝑫𝑫(𝑥𝑥 , 𝜃𝜃𝑑𝑑 ) that a sample comes from the training data rather than G. The training
procedure is like playing a minimax two-player game. The discriminative network D is optimized to maximize
the log likelihood of assigning correct labels to fake samples and real samples, while the generative model G is
trained to maximize the log likelihood of D making a mistake. Through the adversarial process, G is desired to
gradually estimate the underlying data distribution and generate realistic samples.
Based on the vanilla GAN, the performance was improved in the following two directions: 1) different
loss (objective) functions, and 2) conditional settings. For the first direction, Wasserstein GAN (WGAN) is a
typical example. In WGAN, Earth-Mover (EM) distance or Wasserstein-1, commonly known as the
Wasserstein distance, was proposed to replace the Jensen–Shannon (JS) divergence in original Vanilla GAN
and measure the distance between the real and synthetic data distribution (Arjovsky et al., 2017). The critic of
WGAN has the advantage to provide useful gradients information where JS divergence saturates and results in
vanishing gradients. WGAN could also improve the stability of learning and alleviate problems like mode
An unconditional generative model cannot explicitly control the modes of data being synthesized. To
guide the data generation process, the conditional GAN (cGAN) is constructed by conditioning its generator
and discriminator with additional information (i.e., the class labels) (Mirza and Osindero, 2014). Specifically,
the noise vector z and class label c are jointly provided to G; the real/ fake data and class label c are together
presented as the inputs of D. The conditional information can also be images or other attributes, not limited to
class labels. Further, the auxiliary classifier GAN (ACGAN) presents another strategy to employ label
conditioning to improve image synthesis (Odena et al., 2017). Unlike the discriminator of cGAN, D in ACGAN
is no longer provided with the class conditional information. Apart from separating real and fake images, D is
also tasked with reconstructing class labels. When forcing D to perform the additional classification task,
ACGAN can generate high-quality images easily.
2.2.3. Self-supervised learning
In the past few years, unsupervised representation learning has gained huge success in natural language
processing (NLP), where massive unlabeled data is available for pre-training models (e.g. BERT, Kenton and
Toutanova, 2019) and learning useful feature representations. Then the feature representations are fine-tuned
in downstream tasks such as question answering, natural language inference, and text summarization. In
computer vision, researchers have explored a similar pipeline – models are first trained to learn rich and
meaningful feature representations from the raw unlabeled image data in an unsupervised manner, and then the
feature representations are fine-tuned in a wide variety of downstream tasks with labeled data, such as
classification, object detection, instance segmentation, etc. However, this practice was not as successful as in
NLP for quite a long time, and instead supervised pre-training has been the dominant strategy. Interestingly,
we find this situation is changing toward the opposite direction in recent two years, as more and more studies
report a higher performance of self-supervised pre-training than supervised pre-training.
In recent literature, the term self-supervised learning is used interchangeably with unsupervised
learning; more accurately, self-supervised learning actually refers to a form of deep unsupervised learning,
where inputs and labels are created from unlabeled data itself without external supervision. One important
motivation behind this technology is to avoid supervised tasks that are often expensive and time-consuming,
due to the need to establish new labeled datasets or acquire high-quality annotations in certain fields like
medicine. Despite the scarcity and high cost of labeled data, there usually exist large amounts of cheap
unlabeled data remaining unexploited in many fields. The unlabeled data is likely to contain valuable
information that is either weak or not present in labeled data. Self-supervised learning can leverage the power
of unlabeled data to improve both the performance and efficiency in supervised tasks. Since self-supervised
learning touches upon vaster data than supervised learning, features learnt in a self-supervised manner can
potentially better generalize in the real world. Self supervision can be created in two ways: pretext tasks based
methods and contrastive learning based methods. Since the contrastive learning based methods have received
broader attention in very recent years, we will highlight more works in this direction.
Pretext task is designed to learn representative features for downstream tasks, but the pretext itself is
not of the true interest (He et al., 2020). The pretext tasks learn representations by hiding certain information
(e.g., channel, patches, etc.) for each input image, and then predict the missing information from the image’s
remaining parts. Examples include image inpainting (Pathak et al., 2016), colorization (Zhang et al., 2016),
relative patch prediction (Doersch et al., 2015), jigsaw puzzles (Noroozi and Favaro, 2016), rotation (Gidaris
et al., 2018), etc. However, the learnt representations’ generalizability is heavily dependent on the quality of
hand-crafted pretext tasks (Chen et al., 2020a).
Contrastive learning relies on the so-called contrastive loss, which can date back to at least (Hadsell
et al., 2006; Chopra et al., 2005a). Later a number of variants of this contrastive loss were used (Oord et al.,
2018; Chen et al., 2020a; Chaitanya et al., 2020). In essence, the original loss and its later versions all enforce
a similarity metric to be maximized for positive (similar) pairs and be minimized for negative (dissimilar) pairs,
so that the model can learn discriminative features. In the following we will introduce two representative
frameworks for contrastive learning, namely Momentum Contrast (MoCo) (He et al., 2020) and SimCLR (Chen
et al., 2020a).
MoCo formulates contrastive learning as a dictionary look-up problem, which requires an encoded
query to be similar to its matching key. As shown in Figure 3 (a), given an image 𝑥𝑥, an encoder encodes the
image resulting in a feature vector, which is used as a query (𝑞𝑞). Likewise, with another encoder the dictionary
can be built up by the features {𝑘𝑘0 , 𝑘𝑘1 , 𝑘𝑘2 , … } , also known as keys, from a large set of image samples
{𝑥𝑥0 , 𝑥𝑥1 , 𝑥𝑥2 , … }. In MoCo, the encoded query 𝑞𝑞 and a key are considered similar if they come from different
crops of the same image. Suppose there exists a single dictionary key (𝑘𝑘 + ) that matches with 𝑞𝑞, then these two
items are regarded as a positive pair, whereas the rest keys in the dictionary are considered negative. The authors
compute the loss function of a positive pair using InfoNCE (Oord et al., 2018) as follows:
Established from a sampled subset of all images, a large dictionary is important for good accuracy. To
make the dictionary large, the authors maintain the feature representations from previous image batches as a
queue: new keys are enqueued with old keys dequeued. Therefore, the dictionary consists of encoded
representations from the current and previous batches. This, however, could lead to a rapidly updated key
encoder, rendering the dictionary keys inconsistent, i.e., their comparisons to the encoded query are not
consistent. The authors thus propose using momentum update on the key encoder to avoid rapid changes. This
key encoder is referred as the momentum encoder.
SimCLR is another popular framework for contrastive learning. In this framework, two augmentated
images are considered a postitive pair if they derive from the same example; if not, they are a negative pair.
The agreement of feature representations from of postive image pairs are maximized. As shown in Figure 3 (b),
SimCLR consists of four components: (1) stochastic image augmentation; (2) encoder networks (f (.)) extracting
feature representations from augmented images; (3) a small neural network (multilayer perceptron (MLP)
projection head) (g (.)) that maps the feature representations to a lower-dimensional space; and (4) contrastive
loss computation. The third component differs SimCLR from its predecessors. Previous frameworks like MoCo
compute the feature representations directly rather than first mapping them to a lower-dimensional space. This
component is further proven important in achieving satisfactory results, as demonstrated in MoCo v2 (Chen et
al., 2020b).
Note that since self-supervised contrastive learning is very new, wide applications of recent advances
such as MoCo and SimCLR in the medical image analysis field have yet been established at the time of this
writing. Nonetheless, considering the promising results of self-supervised learning reported in the existing
literature, we anticipate studies applying this new technology to analyze medical images are likely to explode
soon. Also, self-supervised pre-training has great potential to be a strong alternative of supervised pre-training.
(a) (b)
Figure 3. (a) MoCo (He et al., 2020); (b) SimCLR (Chen et al., 2020a).
2.3. Semi-supervised learning
Different from unsupervised learning that can work just on unlabeled data to learn meaningful
representations, semi-supervised learning (SSL) combines labeled and unlabeled data during model training.
Especially, SSL applies to the scenario where limited labeled data and large-scale but unlabeled data are
available. These two types of data should be relevant, so that the additional information carried by unlabeled
data could be useful in compensating the labeled data. It is reasonable to expect that unlabeled data would lead
to an average performance boost – probably the more the better for performing tasks with only limited labeled
data. In fact, this goal has been explored for several decades, and the 1990s already witnessed a rising interest
of applying SSL methods in text classification. The Semi-Supervised Learning book (Chapelle et al., 2009) is
a good source for readers to grasp the connection of SSL to classic machine learning algorithms. Interestingly,
despite its potential positive value, the authors present empirical findings that unlabeled data sometimes
deteriorates the performance. However, this empirical finding seems to be experiencing changes in recent
literature of deep learning – an increasing number of works, mostly from the computer vision field, have
reported that deep semi-supervised approaches generally perform better than high-quality supervised baselines
(Ouali et al., 2020). Even when varying the amount of labeled and unlabeled data, a consistent performance
improvement can still be observed. At the same time, deep semi-supervised learning has been successfully
applied in the medical image analysis field to reduce annotation cost and achieve better performance. We divide
popular SSL methods into three groups: (1) consistency regularization based approaches; (2) pseudo labeling
based approaches; (3) generative models based approaches.
Methods in the first category share one same idea that the prediction for an unlabeled example should
not change significantly if some perturbations (e.g., adding noise, data augmentation) are applied. The loss
function of an SSL model generally consists of two parts. More concretely, given an unlabeled data example 𝑥𝑥
and its perturbed version 𝑥𝑥� , the SSL model outputs logits 𝑓𝑓𝜃𝜃 (𝑥𝑥) and 𝑓𝑓𝜃𝜃 (𝑥𝑥�). On the unlabeled data, the objective
is to give consistent predictions by minimizing the mean squared error 𝑑𝑑(𝑓𝑓𝜃𝜃 (𝑥𝑥), 𝑓𝑓𝜃𝜃 (𝑥𝑥�) ), and this leads to the
consistency (unsupervised) loss 𝐿𝐿𝑢𝑢 on unlabeled data. On the labeled data, a cross entropy supervised loss 𝐿𝐿𝑠𝑠
is computed. Example SSL models that are regularized by consistency constraints include Ladder Networks
(Rasmus et al., 2015), Π-Model (Laine and Aila, 2017), and Temporal Ensembling (Laine and Aila, 2017). A
more recent example is the Mean Teacher paradigm (Tarvainen and Valpola, 2017), composed of a teacher
model and a student model (Figure 4). The student model is optimized by minimizing 𝐿𝐿𝑢𝑢 on unlabeled data and
𝐿𝐿𝑠𝑠 on labeled data; as an Exponential Moving Average (EMA) of the student model, the teacher model is used
to guide the student model for consistency training. Most recently, several works such as unsupervised data
augmentation (UDA) (Xie et al., 2020) and MixMatch (Berthelot et al., 2019) have brought the performance of
SSL to a new level.
For pseudo labeling (Lee, 2013), an SSL model itself generates pseudo annotations for unlabeled
examples; the pseudo-labeled examples are used jointly with labeled examples to train the SSL model. This
process is iterated for several times, during which the quality of pseudo labels and the model’s performance get
enhanced. The naïve pseudo-labeling process can be combined with Mixup augmentation (Zhang et al., 2018a)
to further improve SSL model’s performance (Arazo et al., 2020). Pseudo labeling also works well with multi-
view co-training (Qiao et al., 2018). For each view of the labeled examples, co-training learns a separate
classifier, and then the classifier is used to generate pseudo labels for the unlabeled data; co-training maximizes
the agreement of assigning pseudo annotations among each view of unlabeled examples.
For methods in the third category, semi-supervised generative models such as GANs and VAEs put
more focus on solving target tasks (e.g., classification) than just generating high-fidelity samples. Here we
illustrate the mechanism of semi-supervised GAN for brevity. One simple way to adapt GAN to semi-
supervised settings is by modifying the discriminator to perform additional tasks. For example, in the task of
image classification, Salimans et al. (2016) and Odena (2016) changed the discriminator in DCGAN by forcing
it to serve as a classifier. For an unlabeled image, the discriminator functions as in the vanilla GAN, providing
a probability of the input image being real; for a labeled image, the discriminator predicts its class besides
generating a realness probability. However, Li et al. (2017) demonstrated that the optimal performance of the
two tasks may not be achieved at the same time by a single discriminator. Thus, they introduced an additional
classifier that is independent from the generator and discriminator. This new architecture composed of three
components is called Triple-GAN.
Figure 4. Mean Teacher model application in medical image analysis (Li et al., 2020b). 𝜋𝜋𝑖𝑖 refers to the transformation
operations, including rotation, flipping, and scaling. 𝑧𝑧𝑖𝑖 and 𝑧𝑧̃𝑖𝑖 are network outputs.
3.2. Segmentation
Medical image segmentation, identifying the set of pixels or voxels of lesions, organs, and other
substructures from background regions, is another challenging task in medical image analysis (Litjens et al.,
2017). Among all common image analysis tasks such as classification and detection, segmentation needs the
strongest supervision (large amounts of high-quality annotations) (Tajbakhsh et al., 2020). Since its
introduction in 2015, U-Net (Ronneberger et al., 2015) has become probably the most well-known architecture
for segmenting medical images; afterwards, different variants of U-Net have been proposed to further improve
the segmentation performance. From the very recent literature, we observe that the combination of U-Net and
Transformers from NLP (Chen et al., 2021b) has contributed to state-of-the-art performance. In addition, a
number of semi-supervised and self-supervised learning based approaches have also been proposed to alleviate
the need for large annotated datasets. Accordingly, in this section we will (1) review the original U-Net and its
important variants, and summarize useful performance enhancing strategies; (2) introduce the combination of
U-Net and Transformers, and Mask RCNN (He et al., 2017); 3) cover self-supervised and semi-supervised
approaches for segmentation. Since recent studies focus on applying Transformers to segment medical images
in a supervised manner, we purposely position the introduction of Transformers-based architectures in the
supervised segmentation section. However, it should be noted that such categorization does not mean
Transformers-based architectures cannot be used in semi-supervised or unsupervised settings.
3.2.1. Supervised learning based segmenting models
I. U-Net and its variants
In a convolutional network, the high-level coarse-grained features learned by higher layers capture
semantics beneficial to the whole image classification; in contrast, the low-level fine-grained features learned
by lower layers contain useful details for precise localizations (i.e., assigning a class label to each pixel)
(Hariharan et al., 2015), which are important to image segmentation. U-Net is built on the fully convolutional
network (Long et al., 2015), the key innovation of U-Net is the so-called skip connections between opposing
convolutional layers and deconvolutional layers, which successfully concatenate features learned at different
levels to improve the segmentation performance. Meanwhile, skip connections is also helpful in recovering the
network’s output to be of the same spatial resolution as the input. U-Net takes 2D images as input, and it
generates several segmentation maps, each of which corresponds to one respective pixel class.
Based on the basic architectures, Drozdzal et al. (2016) further studied the influence of long and short
skip connections in biomedical image segmentation. They concluded that adding short skip connections is
important to train very deep segmentation networks. In one study, Zhou et al. (2018) claimed that the plain skip
connections between U-Net’s encoder and decoder subnetworks leads to fusion of semantically dissimilar
feature maps; they proposed to reduce the semantic gap prior to fusing feature maps. In the proposed model
UNet++, the plain skip connections were replaced by nested and dense skip connections. The suggested
architecture outperformed U-Net and wide U-Net across four different medical image segmentation tasks.
Aside from redesigning the skip connections, Çiçek et al. (2016) replaced all 2D operations with their
3D counterparts to extend the 2D U-Net to 3D U-Net for volumetric segmentation with sparsely annotated
images. Further, Milletari et al. (2016) proposed V-Net for 3D MRI prostate volumes segmentation. A major
architecture difference between U-Net and V-Net lies in the change of the forward convolutional units (Figure
5(a)) to residual convolutional units (Figure 5(c)), so V-Net is also referred as residual U-Net. A new loss
function based on Dice coefficient was proposed to deal with the imbalanced number of foreground and
background voxels. To tackle the scarcity of annotated volumes, the authors augmented their training dataset
with random non-linear transformations and histogram matching. Gibson et al. (2018a) proposed the Dense V-
network that modified V-Net’s loss function of binary segmentation to support multiorgan segmentation of
abdominal CT images. Although the authors followed the V-Net architecture, they replaced its relatively
shallow down-sampling network with a sequence of three dense feature stacks. The combination of densely
linked layers and the shallow V-Net architecture demonstrates its importance in improving segmentation
accuracy, and the proposed model yielded significantly higher Dice scores for all organs compared to multi-
atlas label fusion (MALF) methods.
Figure 6. (a) Transformer layer; (b) the architecture of TransUNet (Chen et al., 2021b).
In another study, Zhang et al. (2021) adopt a different approach to combine CNN and Transformer.
Instead of first using CNN to extract low-level features and then passing features through the Transformer
layers, the proposed model TransFuse combines CNN and Transformer with two branches in a parallel manner.
The Transformer branch consisting of several layers takes as input a sequence of embedded image patches to
capture global context information. The output of the last layer is reshaped into 2D feature maps. To recover
finer local details, these maps are upsampled to higher resolutions at three different scales. Correspondingly,
the CNN branch uses three ResNet-based blocks to extract features from local to global at three different scales.
Features with the same resolution scale from both branches are selectively fused using an independent module.
The fused features can capture both the low-level spatial and high-level global context. In the end, the multi-
level fused features are used to generate a final segmentation mask. TransFuse achieved good performance in
prostate MRI segmentation.
In addition to 2D image segmentation, the hybrid approach is also useful to 3D scenarios. Hatamizadeh
et al. (2022) propose a UNet-based architecture to perform volumetric segmentation of MRI brain tumor and
CT spleen. Similar to 2D cases, 3D images are first split into volumes. Then linear embeddings and positional
embeddings are applied to the sequence of input image volumes before fed to the encoder. The encoder,
composed of multiple Transformer layers, extracts multi-scale global feature representations from the
embedded sequence. The extracted features at different scales are all upsampled to higher resolutions and later
merged with multi-scale features from the decoder via skip connections. In another study, Xie et al. (2021b)
research on reducing Transformers’ computational and spatial complexities in the 3D multi-organ segmentation
task. To achieve this goal, they replace the original MSA module in the vanilla Transformer with the deformable
self-attention module (Zhu et al., 2021a). This attention module attends over a small set of key positions instead
of treating all positions equally, thus resulting in much lower complexity. Besides, their proposed architecture
CoTr, is in the same spirit of TransUNet – a CNN generates feature maps, used as the inputs of Transformers.
The difference lies in that instead of extracting only single-scale features, the CNN in CoTr extracts feature
maps at multiple scales.
For the Transformer-only paradigm, Cao et al. (2021) present Swin-Unet, the first UNet-like pure
Transformer architecture for medical image segmentation. Swin-UNet has a symmetric encoder-decoder
structure without using any convolutional operations. The major components of the encoder and decoder are
(1) Swin Transformer blocks (Liu et al., 2021) and (2) patching merging or expanding layers. Enabled by a
shifted windowing scheme, the Swin Transformer block exhibits better modeling power as well as lower
complexity in computing self-attention. Therefore, the authors use it to extract feature representations for the
input sequence of image patch embeddings. The subsequent patching layer down-samples the feature
representations/ maps into lower resolutions. These down-sampled maps are further passed through several
other Transformer blocks and patching merging layers. Likewise, the decoder also uses Transformer blocks for
feature extraction, but its patching expanding layers upsample feature maps into higher resolutions. Similar to
U-Net, the upsampled feature maps are fused with the down-sampled feature maps from the encoder via skip
connections. Finally, the decoder outputs pixel-level segmentation predictions. The proposed framework
achieved satisfactory results on multi-organ CT and cardiac MRI segmentation tasks.
Note that, to ensure good performance and reduce training time, most of the Transformers-based
segmentation models introduced so far are pre-trained on a large external dataset (e.g., ImageNet). Interestingly,
it has been shown that Transformers can also produce good results without pre-training by utilizing
computationally efficient self-attention modules (Wang et al., 2020a) and new training strategies to integrate
high-level information and finer details (Valanarasu et al., 2021). Also, when applying Transformers-based
model for 3D medical image segmentation, Hatamizadeh et al. (2022) and Xie et al. (2021b) find pre-training
did not show performance improvement.
III. Mask R-CNN for segmentation
Aside from the above UNet and Transformers-based approaches, another architecture Mask RCNN (He
et al., 2017), which was originally developed for pixelwise instance segmentation, has achieved good results in
medical tasks. Since it is closely related to Faster RCNN (Ren et al., 2015; Ren et al., 2017), which is a region-
based CNN for object detection, details of Mask RCNN and its relations with the detection architectures will
be elaborated later. To sum up in brief, Mask RCNN has (1) a region proposal network (RPN) as in Faster
RCNN to produce high-quality region proposals (i.e., likely to contain objects), (2) the RoIAlign layer to
preserve spatial correspondence between RoIs and their feature maps, and (3) a parallel branch for binary mask
prediction in addition to bounding box prediction as in Faster RCNN. Notably, the Feature Pyramid Network
(FPN) (Lin et al., 2017a) is used as the backbone of the Mask RCNN to extract multi-scale features. FPN has a
bottom-up pathway and a top-down pathway to extract and merge features in a pyramidal hierarchy. The
bottom-up pathway extracts feature maps from high resolution (semantically weak features) to low resolution
(semantically strong), whereas the top-down pathway operates in the opposite. At each resolution, features
generated by the top-down pathway are enhanced by features from the bottom-up pathway via skip connections.
This design might make FPN seemingly resemble U-Net, but the major difference is that FPN makes predictions
independently at all resolution scales instead of one.
Wang et al. (2019b) proposed a volumetric attention (VA) module within the Mask RCNN framework
for 3D medical image segmentation. This attention module can utilize the contextual relations along the z
direction of 3D CT volumes. More concretely, feature pyramids are extracted from not only the target image
(3 adjacent slices with the target CT slice in the middle), but also a series of neighboring images (also 3 CT
slices). Then the target and neighboring feature pyramids are concatenated at each level to form an intermediate
pyramid, which carries long-range relations in the z axis. In the end, spatial attention and channel attention are
applied on the intermediate and target pyramids to form the final feature pyramid for mask prediction. With
this VA module incorporated, Mask RCNN could achieve lower false positives in segmentation. In another
study, Zhou et al. (2019c) combined UNet++ and Mask RCNN, leading to Mask RCNN++. As mentioned
earlier, UNet++ demonstrates better segmentation results using the redesigned nested and dense skip
connections, so the authors use them to replace the plain skip connections of the FPN inside Mask RCNN. A
large performance boost was observed using the proposed model.
3.2.2. Unsupervised learning based segmenting models
For medical image segmentation, to alleviate the need for a large amount of annotated training data,
reserachers have adopted generative models for image synthesis to increase the number of training examples
(Zhang et al., 2018b; Zhao et al., 2019a). Meanwhile, exploiting the power of unlabeled medical images seems
like a much more popular choice. In contrast to the difficult and expensive high-quality annotated dataset,
unlabeled medical images are often available, usually coming with a large number. Given a small medical
image dataset with limited ground truth annotations and a related but unlabeled large dataset, reserachers have
explored self-supervised and semi-supervised learning approches to learn useful and transferrable feature
representations from the unlabled dataset, which will be discussed in this and the next section respectively.
Self-supervised pretext tasks: Since self-supervision via pretext tasks and contrastive learning can
learn rich semantic representations from unlabeled datasets, self-supervised learning is often used to pre-train
the model and enable solving downstream tasks (e.g., medical image segmentation) more accurately and
efficiently when limited annotated examples are available (Taleb et al., 2020). The pretext tasks could be either
designed based on application scenarios or chosen from traditional ones used in the computer vision field. For
the former type, Bai et al. (2019) designed a novel pretext task by predicting anatomical positions for cardiac
MR image segmentation. The self-learnt features via the pretext task were transferred to tackle a more
challenging task, accurate ventricles segmentation. The proposed method achieved much higher segmentation
accuracy than the standard U-Net trained from scratch, especially when only limited annotations were available.
For the latter type, Taleb et al. (2020) extended performing pretext tasks from 2D to 3D scenarios, and
they investigated the effectiveness of several pretext tasks (e.g., rotation prediction, jigsaw puzzles, relative
patch location) in 3D medical image segmentation. For brain tumor segmentation, they adopted the U-Net
architecture, and the pretext tasks were performed on a large unlabeled dataset (about 22,000 MRI scans) to
pre-train the models; then the learned feature representations were fine-tuned on a much smaller labeled dataset
(285 MRI scans). The 3D pretext tasks performed better than their 2D counterparts; more importantly, the
proposed methods sometimes outperformed supervised pre-training, suggesting a good generalization ability
of the self-learnt features.
The performance of self-supervised pre-training could also be improved by adding other types of
information. Hu et al. (2020) implemented a context encoder (Pathak et al., 2016) performing semantic
inpainting as the pretext task, and they incorporated DICOM metadata from ultrasound images as weak labels
to boost the quality of pre-trained features toward facilitating two different segmentation tasks.
Self-supervised contrastive learning based approaches: For this method, early studies such as the
work by Jamaludin et al. (2017) adopted the original contrastive loss (Chopra et al., 2005b) to learn useful
feature representations. In recent three years, with a surge of interest in self-supervised contrastive learning,
contrastive loss has evolved from the original version to more powerful ones (Oord et al., 2018) for learning
expressive feature representations from unlabeled datasets. Chaitanya et al. (2020) claimed although the
contrastive loss in Chen et al. (2020a) is suitable for learning image-level (global) feature representations, it
does not guarantee learning distinctive local representations that are important for per-pixel segmentation. They
proposed a local contrastive loss to capture local features that can provide complementary information to boost
the segmentation performance. Meanwhile, to the best of our knowledge, when computing the global
contrastive loss, these authors are the first to utilize the domain knowledge that there is structural similarity in
volumetric medical images (e.g., CT and MRI). In MR image segmentation with low annotations, the proposed
method substantially outperformed other semi-supervised and self-supervised methods. In addition, it was
shown that the proposed method could further benefit from data augmentation techniques like Mixup (Zhang
et al., 2018a).
3.2.3. Semi- supervised learning based segmenting models
Semi-supervised consistency regularization: The mean teacher model is commonly used. Based on
the mean teacher framework, Yu et al. (2019) introduced uncertainty estimation (Kendall and Gal, 2017) for
better segmentation of 3D left atrium from MR images. They argued that on an unlabeled dataset, the output
of the teacher model can be noisy and unreliable; therefore, besides generating target outputs, the teacher model
was modified to estimate these outputs’ uncertainty. The uncertainty-aware teacher model can produce more
reliable guidance for the student model, and the student model could in turn improve the teacher model. The
mean teacher model can also be improved by the transformation-consistent strategy (Li et al., 2020b). In one
study, Wang et al. (2020b) proposed a semi-supervised framework to segment COVID-19 pneumonia lesions
from CT scans with noisy labels. Their framework is also based on the Mean Teacher model; instead of
updating the teacher model with a predefined value, they adaptively updated the the teacher model using a
dynamic threshold for the student model’s segmentation loss. Similarly, the student model was also adaptively
updated by the teacher model. To simultaneously deal with noisy labels and the foreground-background
imbalance, the authors developed a generalized version of the Dice loss. The authors designed the segmentation
network in the same spirit of U-Net but made several changes in terms of new skip connections (Pang et al.,
2019), multi-scale feature representation (Chen et al., 2018a), etc. In the end, the segmentation network with
the Dice loss were combined with the mean teacher framework. The proposed method demonstrated high
robustness to label noise and achieved better performance for pneumonia lesion segmentation than other state-
of-the-art methods.
Semi-supervised pseudo labeling: Fan et al. (2020) presented a semi-supervised framework (Semi-
InfNet) to tackle the lack of high-quality labeled data in COVID-19 lung infection segmentation from CT
images. To generate pseudo labels for the unlabeled images, they first used 50 labeled CT images to train their
model, which produced pseudo labels for a small amount of unlabeled images. Then the newly pseudo-labeled
examples were included in the original labeled training dataset to re-train the model to generate pseudo labels
for another batch of unlabled images. This process was repeated until 1600 unlabeled CT images all got
pseudo-labeled. Both the labeled and pseudo-labeled examples were used to train Semi-InfNet, and its
performance surpassed other cutting-edge segmentation models such as UNet++ by a large margin. Aside from
the semi-supervised learning strategy, there are three critical components in the model responsible for the good
performance: parallel partial decoder (PPD) (Wu et al., 2019a), reverse attention (RA) (Chen et al., 2018b),
and edge attention (Zhang et al., 2019). PPD can aggregate high-level features of the input image and generate
a global map indicating the rough location of lung infection regions; EA module uses low-level features to
model boundary details, and RA module further refines the rough estimation into an accurate segmentation
Semi-supervised generative models: As one of the earliest works that extended generative models to
semi-supervised segmentation task, Sedai et al. (2017) utilized two VAEs to segment optic cup from retinal
fundus images. The first VAE was employed to learn feature embeddings from a large number of unlabeled
images by performing image reconstruction; the second VAE, trained on a smaller number of labeled images,
mapped input images to segmentation masks. In other words, the authors used the first VAE to perform an
auxiliary task (image reconstruction) on unlabeled data, which can help the second VAE to better achieve the
target objective (image segmentation) using labled data. To leverage the feature embeddings learned by the
first VAE, the second VAE simultaneously reconstructed segmentation masks and latent representations of the
first VAE. The utilization of additional information from unlabled images improved segmentation accuracy.
In another study, Chen et al. (2019c) also adopted a similar idea of introducing an auxiliary task on unlabeled
data to facilitate performing image segmentation with limited labeled data. Specifically, the authors proposed
a semi-supervised segmentation framework consisting of a UNet-like network for segmentation (target
objective) and an autoencoder for reconstruction (auxiliary task). Unlike the previous study that trained two
VAEs separately, the segmentation network and reconstruction network in this framework share the same
encoder. Another difference lies in that the foreground and background parts of the input image were
reconstructed/generated separately, and the respective segmentation labels were obtained via an attention
mechanism. This semi-supervised segmentation framework outperformed its counterparts (e.g., fully
supervised CNNs) in different labeled/unlabeled data splits.
In addition to the aforementioned approaches, researchers have also explored incorporating domain-
specific prior knowledge to tailor the semi-supervised frameworks for a better segmentation performance. The
prior knowledge varies a lot, such as the anatomical prior (He et al., 2019), atlas prior (Zheng et al., 2019),
topological prior (Clough et al., 2020), semantic constraint (Ganaye et al., 2018), and shape constraint (Li et
al., 2020c) to name a few.
Table 2. A list of recent papers related to medical image segmentation
Author Year Application Model Dataset Contributions highlights
U-Net and its variants
(1) Incorporation of residual learning; (2) A
V-Net new loss function based on Dice coefficient to
Milletari et al., MRI prostate volumes PROMISE 2012
2016 (Residual U- deal with class imbalance; (3) Data
2016 segmentation dataset
Net) augmentation by applying random non-linear
transformations and histogram matching.
Segmentation of (1) CT (1) Proposing nested and dense skip
Data Science
lung nodules, (2) connections to reduce the semantic gap before
Zhou et al., Bowl 2018,
2018 microscopic cell nuclei, UNet++ fusing feature maps; (2) Using deep
2018 MICCAI 2018
(3) CT liver, and (4) supervision to enable accurate and fast
colon polyps segmentation.
Segmentation of CT NIH Pancreas- (1) A new loss function extends binary
Gibson et al., pancreas, CT dataset and segmentation to multiorgan segmentation; (2)
2018 Dense V-Net
2018a gastrointestinal organs, BTCV challenge Integrating densely linked layers into the
and surrounding organs dataset shallow V-Net architecture.
(1) DRIVE,
STARE, (1) Replacing U-Net’s forward convolutional
Segmentation of (1)
CHASH_DB1; units using RCNN’s recurrent convolutional
Alom et al., retina blood vessels, (2) RU-Net and
2018 (2) ISIC 2017 layers to accumulate useful features; (2)
2018 skin cancer lesions, and R2U-Net
Challenge; (3) Incorporating residual learning to train very
(3) lung
Data Science deep networks.
Bowl 2017
(1) Incorporating attention gates into the U-
Multi-class CT
NIH Pancreas- Net architecture to learn important salient
Oktay et al., segmentation of Attention U-
2018 CT dataset and a features and suppress irrelevant features; (2)
2018 pancreas, spleen, and Net
private dataset Image grid-based gating improves attention to
local regions.
MICCAI (1) Using adversarial learning for
Brain tumor adversarial
BRATS datasets segmentation; (2) Proposing a multi-scale L1
Xue et al., 2018 2018 segmentation from MRI network with
in 2013 and loss function to facilitate learning local and
volumes a segmentor
2015 global features.
and a critic
(1) Training GAN by adding a cycle-
Segmentation of multi- consistency loss and a shape consistency loss,
Zhang et al., GAN and a
2018 modal cardiovascular Private dataset making the segmentor and the generator
2018b U-Net based
images (CT and MRI) benefit from each other; (2) Updating the
generator in an online manner.
U-Net based
8 publicly Novel data augmentation (i.e., learning
networks and
Zhao et al., 2019 available MR complex spatial and appearance
Segmentation of brain a SD-Net
2019a datasets (e.g., transformations to synthesize additional
MRI scans (Roy et al.,
ADNI, OASIS, labeled images based on limited labeled
2017) based
etc.) examples).
(1) Applying conditional VAE for inference in
Segmentation of the U-Net architecture; (2) Using a separate
Baumgartner et a probabilistic LIDC-IDRI and
2019 prostate MR and latent variable to control segmentation at each
al., 2019 U-Net a private dataset
thoracic CT images resolution level to hierarchically generate final
Transformers for segmentation
(1) Combining the advantages of CNN
TransUNet: features (low-level spatial information) and
Synapse multi-
a hybrid the Transformer (modeling long-range
Segmentation of (1) CT organ
Chen et al., cascaded dependencies/ high-level semantics); (2) To
2021 abdominal organs, (2) segmentation
2021b CNN- enable precise localization, self-attentive
MRI cardiac structures dataset, ACDC
Transformer features from Transformer layers were
architecture combined with high-resolution CNN features
via skip connections.
TransFuse: Medical
Combining CNN and Transformer with two
Zhang et al., MRI prostate CNN and branches in a parallel manner and proposing
2021 segmentation
2021 segmentation Transformer the BiFusion module to fuse features from the
in parallel two branches.
(1) Directly utilizing volumetric data for 3D
MRI brain tumor UNETR: Medical
Hatamizadeh et segmentation; (2) The Transformer was used
2022 segmentation and CT UNet-based segmentation
al., 2022 as the main encoder.
spleen segmentation architecture decathlon
(1) Multiple-scale feature maps generated by a
CoTr: Synapse multi- CNN were used as the inputs of Transformers;
CT abdominal multi- an encoder- organ (2) Replacing the original MSA module in the
Xie et al., 2021b 2021
organ segmentation decoder segmentation vanilla Transformer with the deformable self-
structure dataset attention module to reduce computational and
spatial complexities.
(1) The first Transformer-only architecture for
Synapse multi-
Swin-Unet: medical image segmentation without any
Segmentation of (1) CT organ
a symmetric convolutional operations; (2) Using the Swin
Cao et al., 2021 2021 abdominal organs, (2) segmentation
Transformer- Transformer blocks (Liu et al., 2021) for better
MRI cardiac structures dataset, ACDC
based design modeling power and lower complexity in
computing self-attention.
(1) Proposing positive-sensitive attention gates
Segmentation of that enable good segmentation performance
A hybrid Private dataset,
Valanarasu et ultrasound brain even on smaller datasets; (2) Using the entire
2021 CNN- MoNuSeg,
al., 2021 anatomy and image and image patches to train a shallow
Transformer etc.
microscopic gland global branch and a deep local branch
respectively for better performance.
Mask R-CNN for segmentation
Proposing a volumetric attention module to
Wang et al., Segmentation of liver with
2019 LiTS challenge utilize the contextual relations along the z
2019b tumor on CT images volumetric
direction of 3D CT volumes.
Using the redesigned nested and dense skip
Segmentation of MRI Mask RCNN BraTS 2013,
Zhou et al., connections of UNet++ to replace the plain
2019 brain tumor, CT liver, with UNet++ LiTS challenge,
2019c skip connections of the FPN inside Mask
and CT lung nodules design LIDC-IDRI, etc.
RCNN for better performance.
Semi-supervised segmentation
UA-MT: Enforcing the teacher model to provide more
Segmentation of left Mean Teacher 2018 Atrial reliable guidance to the student model via
Yu et al., 2019 2019 atrium from 3D MR framework Challenge uncertainty estimation, where the estimated
scans with V-Net as dataset uncertainty was used to filter out highly
backbone uncertain predictions.
Segmentation of (1) TCSM_v2: Datasets of (1)
Imposing transformation-consistent
skin lesions, (2) fundus Mean Teacher ISIC 2017, (2)
Li et al., 2020b 2020 regularizations to unlabeled images to enhance
optic disks, and (3) CT framework REFUGE, (3)
the network’s generalization capacity.
liver with U-Net- LiTS
like network
as backbone
COPLE-Net: (1) Adaptively updating the teacher model and
Mean Teacher the student model; (2) Developing a
Segmentation of
Wang et al., framework generalized Dice loss to deal with noisy labels
2020 COVID-19 pneumonia Private dataset
2020b with U-Net- and foreground-background imbalance; (3)
lesions from CT scans
like network Using new skip connections and multi-scale
as backbone feature representation.
Semi-InfNet: (1) Iterative generation of pseudo labels for
Segmentation of Inf-Net 2 publicly unlabeled images; (2) Using the parallel partial
COVID-19 lung trained in a available CT decoder to generate a rough infection map, and
Fan et al., 2020 2020
infection from CT semi- datasets of reverse and edge attention modules to refine
images supervised COVID-19 the segmentation map. (3) Multi-scale training
manner strategy (Wu et al., 2019b).
Self-supervised segmentation
(1) Pre-training the network using a new
pretext tasks (i.e., predicting anatomical
Cardiac MR image UK Biobank positions) where meaningful features were
Bai et al., 2019 2019 supervised U-
segmentation (UKB) learned via self-supervision; (2) Comparing
three different ways for supervised fine-
(1) UKB and (1) Extending traditional 2D pretext tasks to
Segmentation of (1)
Self- BraTS 2018, (2) 3D, utilizing the 3D spatial context for better
Taleb et al., brain tumor from MRI
2020 supervised 3D part of medical self-supervision; (2) A comprehensive
2020 and (2) pancreas tumor
U-Net decathlon comparison of the performance of five
from CT
benchmarks different pretext tasks.
Segmentation of (1) supervised U- (1) DDTI Incorporating DICOM metadata from
thyroid nodule and (2) Net with ultrasound ultrasound images as weak labels to improve
Hu et al., 2020 2020
liver/ kidney from VGG16 or dataset and (2) a the quality of pre-trained features from the
ultrasound images ResNet50 as private dataset pretext task.
(1) Proposing a local contrastive loss; (2)
Incorporation of domain knowledge (structural
U-Net based 2017 ACDC, (2)
Segmentation of MRI similarity in volumetric) in contrastive loss
Chaitanya et al., encoder and Medical
2020 cardiac structures and calculation; (3) A comprehensive comparison
2020 decoder Segmentation
prostate regions of a variety of pre-training techniques, such as
architecture Decathlon, (3)
self-supervised contrastive and pretext task
pre-training, etc.
Forcing the VAE to reconstruct not only
Optic cup segmentation segmentation masks but also latent
Sedai et al., Two VAE-
2017 from retinal fundus Private dataset representations so that useful information
2017 based models
images learned from unlabeled images can be
Utilizing an attention mechanism to create
Brain tumor and white
UNet-like BraTS18, separate segmentation labels for foreground
Chen et al., matter hyperintensities
2019 network and WMH17 and background areas of the input image so
2019c segmentation from MRI
autoencoder Challenge that the auxiliary reconstruction task and the
target segmentation task can be better linked.
3.3. Detection
A natural image may contain objects belonging to different categories, and each object category may
contain several instances. In the computer vision field, object detection algorithms are applied to detect and
identify if any instance(s) from certain object categories are present in the image (Sermanet et al., 2014;
Girshick et al., 2014; Russakovsky et al., 2015). Previous works (Shen et al., 2017; Litjens et al., 2017) have
reviewed the successful applications of the frameworks before 2015, such as OverFeat (Sermanet et al., 2014;
Ciompi et al., 2015), RCNN (Girshick et al., 2014), and fully convolutional networks (FCN) based models
(Long et al., 2015; Dou et al., 2016; Wolterink et al., 2016). As a comparison, we aim at summarizing
applications of more recent object detection frameworks (since 2015), such as Faster RCNN (Ren et al., 2015),
YOLO (Redmon et al., 2016), and RetinaNet (Lin et al., 2017b). In this section, we will first briefly review
several recent milestone detection frameworks, including one-stage and two-stage detectors. It should be noted
that, since these detection frameworks are often used in supervised and semi-supervised settings, we introduce
them under these learning paradigms. Then we will cover these frameworks’ applications in specific type of
lesion detection and universal lesion detection. In the end, we will introduce unsupervised lesion detection
based on GANs and VAEs.
3.3.1. Supervised and semi-supervised lesion detection
I. Overview of the detection frameworks
RCNN framework (Girshick et al., 2014) is a multi-stage pipeline. Despite its impressive results in
object detection, RCNN has some drawbacks namely, the multistage pipeline makes training slow and difficult
to optimize; separately extracting features for each region proposal makes training expensive in disk space and
time, and it also slows down testing (Girshick, 2015). These drawbacks have inspired several recent milestone
detectors, and they can be categorized into two groups (Liu et al., 2020b): (1) two-stage detection frameworks
(Girshick, 2015; Ren et al., 2015; Ren et al., 2017; Dai et al., 2016), which include a separate module to generate
region proposals before bounding box recognition (predicting class probabilities and bounding box
coordinates); (2) one-stage detection frameworks (Redmon et al., 2016; Redmon and Farhadi, 2017; Liu et al.,
2016; Lin et al., 2017b; Law and Deng, 2020; Duan et al., 2019) which predict bounding boxes in a unified
manner without separating the process of generating region proposals. In an image, region proposals are a
collection of potential regions or candidate bounding boxes that are likely to contain an object (Liu et al.,
Two-stage detectors: Unlike RCNN, the Fast RCNN framework (Girshick, 2015) is an end-to-end
detection pipeline employing a multi-task loss to jointly classify region proposals and regress bounding boxes.
Region proposals in Fast RCNN are generated on a shared convolutional feature map rather than the original
image to speed up computation. Then a Region of Interest pooling layer was applied to warp all the region
proposals into the same size. The adjustments resulted in a better and faster detection performance but the speed
of Fast RCNN is still bottlenecked by the inefficient process of computing region proposals. In the Faster RCNN
framework (Ren et al., 2015; Ren et al., 2017), a Region Proposal Network (RPN) replaced the selective search
method to produce high-quality region proposals from anchor boxes efficiently. Anchor boxes are a set of pre-
determined candidate boxes of different sizes and aspect ratios to capture objects of specific classes (Ren et al.,
2015). Since that time, anchor boxes have played a dominant role in top-ranked detection frameworks. Mask
RCNN (He et al., 2017) is closely related to Faster RCNN but it was originally designed for pixelwise object
instance segmentation. Mask RCNN also has a RPN to propose candidate object bounding boxes; this new
framework extends Faster RCNN by adding an extra branch that outputs a binary object mask to the existing
branch of predicting classes and bounding box offsets. Mask RCNN uses a Feature Pyramid Network (FPN)
(Lin et al., 2017a) as its backbone to extract features at various resolution scales. Besides instance segmentation,
Mask RCNN can be used for object detection, achieving excellent accuracy and speed.
One-stage detectors Redmon et al. (2016) proposed a single-stage framework YOLO; instead of using
a separate network to generate region proposals, they treated object detection as a simple regression problem.
A single network was used to directly predict object classes and bounding box coordinates. YOLO also differs
from region proposal based frameworks (e.g., Faster CNN) in that it learns features globally from the entire
image rather than from local regions. Despite being faster and simpler, YOLO has more localization errors and
lower detection accuracy than Faster RCNN. Later the authors proposed YOLOv2 and YOLO9000 (Redmon
and Farhadi, 2017) to improve the performance by integrating different techniques, including batch
normalization, using good anchor boxes, fine-grained features, multi-scale training, etc. Lin et al. (2017b)
identified that the central cause for the lagging performance of one-stage detectors is the imbalance between
foreground and background classes (i.e., the training process was dominated by vast numbers of easy examples
from the background). To deal with the class imbalance problem, they proposed a new focal loss that can
weaken the influence of easy examples and enhance the contribution of hard examples. The proposed
framework (RetinaNet) demonstrated higher detection accuracy than state-of-the-art two-stage detectors at that
time. Law and Deng (2020) proposed CornerNet and pointed out that the prevalent use of anchor boxes in
object detection frameworks, especially one-stage detectors, causes issues such as the extreme imbalance
between positive and negative examples, slow training, introducing extra hyperparameters, etc. Instead of
designing a set of anchor boxes to detect bounding boxes, the authors formulated bounding boxes detection as
detecting a pair of key-points (top-left and bottom-right corners) (Newell et al., 2017; Tychsen-Smith and
Petersson, 2017). Nonetheless, CornerNet generates a large number of incorrect bounding boxes since it cannot
fully utilize the recognizable information inside the cropped regions (Duan et al., 2019). Based on CornerNet,
Duan et al. (2019) proposed CenterNet that detects each object using a triplet of key-points, including a pair
corners and one center key-point. Unlike CornerNet, CenterNet can extract more recognizable visual patterns
within each proposed region, thus effectively suppress inaccurate bounding boxes (Duan et al., 2019).
II. Specific-type medical object (e.g., lesion) detection
Common computer-aided detection (CADe) tasks include detecting lung nodules (Gu et al., 2018; Xie
et al., 2019b), breast masses (Akselrod-Ballin et al., 2017; Ribli et al., 2018), lymph nodes (Zhu et al., 2020b),
sclerosis lesions (Nair et al., 2020), etc. The general detection frameworks, originally designed for general
object detection in natural images, cannot guarantee satisfactory performance for lesion detection in medical
images for two main reasons: (1) lesions can be extremely small in size compared to natural objects; (2) lesions
and non-lesions often have similar appearances (e.g. texture and intensity) (Tao et al., 2019; Tang et al., 2019).
To deliver good detection performance in the medical domain, these frameworks need to be adjusted through
different methods, such as incorporating domain-specific characteristics, uncertainty estimation, or semi-
supervised learning strategy, which are presented as follows.
Incorporating domain-specific characteristics has been a popular choice in both the radiology and
histopathology domains. In the radiology domain, the intrinsic 3D spatial context information among
volumetric images (e.g. CT scans) has been utilized in many studies (Roth et al., 2016; Dou et al., 2017; Yan
et al., 2018a; Liao et al., 2019). For example, in the task of pulmonary nodule detection, Ding et al. (2017)
argued that the original Faster RCNN (Ren et al., 2015) with the VGG-16 network (Liu and Deng, 2015) as its
backbone cannot capture representative features of small pulmonary nodules; they introduced a deconvolutional
layer at the end of Faster RCNN to recover fine-grained features that are important in detecting small objects.
On the deconvolutional feature map, an FPN was applied to propose candidate regions of nodules from 2D
axial slices. To reduce the false positive rate, the authors proposed to make the classification network see the
full range of contexts of the nodule candidates. Instead of using 2D CNN, they chose a 3D CNN to exploit the
3D context of candidate regions so that more distinctive features can be captured for nodule recognition. The
proposed method ranked the 1st place in nodule detection on the LUNA16 benchmark dataset (Setio et al.,
2017). Zhu et al. (2018) also considered the 3D nature of lung CT images and designed a 3D Faster RCNN for
nodule detection. To efficiently learn nodule features, the 3D faster RCNN had the U-Net-like structure
(Ronneberger et al., 2015) and was built with compact dual path blocks (Chen et al., 2017). It should be noted
that despite the effectiveness in boosting detection performance, 3D CNN has downsides as compared to 2D
CNN, including consuming more computational resources and requiring more efforts to acquire 3D bounding
box annotations (Yan et al., 2018a; Tao et al., 2019). In a recent study, Mei et al. (2021) established a large
dataset (PN9) with more than 40, 000 annotated lung nodules to train 3D CNN-based models. The authors
improved the model’s ability to detect both large and small lung nodules by utilizing correlations that exist
among multiple consecutive CT slices. Given a slice group, a non-local operation based module (Wang et al.,
2018) was employed to seize long-range dependencies of different positions and different channels in the
feature map. Furthermore, since each shallow ResNet block can generate feature maps on the same scale that
carry useful spatial information, the authors reduced false positive nodule candidates by merging multi-scale
features produced by 3 different blocks.
In the histopathology domain, Rijthoven et al. (2018) presented a modified version of YOLOv2
(Redmon and Farhadi, 2017) for lymphocytes detection in whole-slide images (WSI). Based on the prior
knowledge of lymphocytes (e.g., average size, no overlaps), the authors simplified the original YOLO network
with 23 layers by keeping only a few layers. With the prior knowledge that brown areas without lymphocytes
in the WSI contain many hard negative samples, the authors also designed a sampling strategy to enforce the
detection model to focus on these hard negative examples during training. The proposed method improved F1-
score by 3% with a speed-up of 4.3X. In their later work, Swiderska-Chadaj et al. (2019) modified the YOLO
architecture to further detect lymphocytes in a more diversified WSI dataset of breast, prostate, and colon
cancer; however, it did not perform as well as the U-Net based detection architecture, which first classified
each pixel and then produced detection results using post-processing techniques. The modified YOLO
architecture was also shown the least robust to different staining.
Recently, semi-supervised methods have been used to improve the performance of medical object
detection (Gao et al., 2020; Qi et al., 2020). For example, Wang et al. (2020c) developed a generalized version
of the original focal loss (Lin et al., 2017b) to deal with soft labels in computing semi-supervised loss function.
They modified the semi-supervised approach MixMatch (Berthelot et al., 2019) from two aspects to make it
suitable for 3D medical image detection. An FPN was first applied on unlabeled CT images (without lesion
annotations) to generate pseudo-labeled object instances. Then the pseudo-labeled examples were mixed with
examples having ground truth annotations through Mixup augmentation. However, the original Mixup
augmentation (Zhang et al., 2018a) was designed for classification tasks where labels are image classes; the
authors adapted this augmentation technique to the lesion detection task with annotations in the form of
bounding boxes. The semi-supervised approach demonstrated a significant performance gain over supervised
learning baselines in pulmonary nodule detection.
In addition, uncertainty estimation is another useful technique to facilitate the detection of small objects
(Ozdemir et al., 2017; Nair et al., 2020). For example, in the task of multiple sclerosis lesion detection where
uncertainties mostly result from small lesions and lesion boundaries, Nair et al. (2020) explored using
uncertainty estimates to improve detection performance. Specifically, four uncertainty measures were
computed: a predicted variance from training data (Kendall and Gal, 2017), variance of Monte Carlo (MC)
samples, a predictive entropy, and mutual information. A threshold formed by these measures was used to filter
out the most uncertain lesion candidates and thus improve detection performance.
III. Universal lesion detection
Traditional lesion detectors have focused on a specific type of lesions but there is a rising research
interest in identifying and localizing different kinds of lesions from the whole human body all at once (Yan et
al., 2018a; Yan et al., 2019; Tao et al., 2019; Yan et al., 2020; Cai et al., 2020; Li et al., 2020d). DeepLesion is
a large and comprehensive dataset (32K lesions) that contains a variety of lesion types such as lung nodule,
liver tumor, abdominal mass, pelvic mass, etc. (Yan et al., 2018b; Yan et al., 2018c). Tang et al. (2019) proposed
ULDor based on Mask RCNN for universal lesion detection. Training Mask-RCNN requires ground truth
masks for lesions; however, the DeepLesion dataset does not contain such annotated masks. With the RECIST
(Response Evaluation Criteria In Solid Tumors) annotations (Eisenhauer et al., 2009), the authors estimated
real masks via ellipse fitting for each lesion region. In addition, hard negative examples were used to re-train
the model to reduce false positives. Yan et al. (2019) further improved the performance of universal lesion
detection by enforcing a multitask detector (MULAN) to jointly perform lesion detection, tagging, and
segmentation. It was previously shown that combining different tasks may provide complementary information
to each other and thus enhance the performance of a single task (Wu et al., 2018b; Tang et al., 2019). MULAN
is modified from Mask RCNN (He et al., 2017) with three head branches. The detection branch predicts whether
each proposed region is lesion and regresses bounding boxes; the tagging branch predicts 185 tags (e.g., body
part, lesion type, intensity, shape, etc.) for each lesion proposal; the segmentation branch outputs a binary mask
(lesion/non-lesion) for each proposed region. MULAN significantly surpassed previous lesion detection models
such as ULDor (Tang et al., 2019) and 3DCE (Yan et al., 2018a). Furthermore, Yan et al. (2020) have recently
shown that learning from heterogenous lesion datasets and partial labels can also boost detection performance.
In addition to the above strategies, attention mechanism is another useful way to improve lesion
detection. Tao et al. (2019) trained a universal lesion detector on the DeepLesion dataset, and the attention
mechanism (Wang et al., 2017; Woo et al., 2018) was introduced to incorporate 3D context and spatial
information into a R-FCN based detection architecture (Dai et al., 2016). A contextual attention module outputs
a vector indicating the importance of features learned from different axial CT slices, so the detection framework
can adaptively aggregate features from different slices (i.e., enhancing relevant contextual features); a spatial
attention module outputs a weight matrix so that discriminative regions on feature maps can be amplified,
through which richer and more representative features can be well learned for small lesions. The proposed
method demonstrated a significant performance improvement despite using much less fewer slices. Li et al.
(2019) presented an FPN based architecture with an attention module that can incorporate clinical knowledge.
In clinical practice, it is common for radiologists to inspect multiple CT windows for an accurate lesion
diagnosis. The authors first employed three FPNs to generate feature maps from three frequently inspected
windows; then the attention module (Woo et al., 2018) was used to reweight feature maps from different
windows. The prior knowledge of lesion positions was also incorporated to further improve the performance.
We observe that, whether in the detection of specific-type of lesions or universal lesions, two-stage
detectors are still quite prevalent for their high performance and robustness; however, separately generating
region proposals might hinder developing streamlined CADe schemes. Several very recent studies have
demonstrated that good detection performance can also be obtained by one-stage detectors (Pisov et al., 2020;
Lung et al., 2021; Zhu et al., 2021b). We predict that advanced anchor-free one-stage detectors (e.g., CenterNet
(Duan et al., 2019)) if adjusted properly to accommodate the uniqueness of medical images, will attract much
more attention and even become a better choice than two-stage detectors for developing new CADe schemes
in the long run.
3.3.2. Unsupervised lesion detection (non-prespecified type of lesion detection)
As mentioned in the above subsections, no matter it is specific-type or universal lesion detection,
certain amounts of supervision are necessary to train one-stage or two-stage detectors. To establish the
supervision, types of lesions need to be prespecified before training the detectors. Once trained, the detectors
cannot detect lesions not contained in the training dataset. On the contrary, unsupervised lesion detection does
not require ground-truth annotations, thus the lesion types do not need to be prespecified beforehand. The
unsupervised detection has the potential to detect arbitrary type of lesions (Baur et al., 2021), but its
performance is not comparable to that of fully-supervised/semi-supervised methods. Despite that, it can be used
to establish a rough detection of suspicious areas and provide imaging biomarker candidates.
To avoid potential confusion, we make two following clarifications. First, methods to be introduced in
this subsection originate from “unsupervised anomaly detection”, since it is natural to consider lesions like
brain tumors as one type of anomaly in medical images. The term “anomaly detection” will be used frequently
throughout the context. Second, it should be noted that “anomaly detection” often appears with another term
“anomaly segmentation” in the literature (Baur et al., 2021). This is because they are essentially two closely
connected tasks – once anomalous regions are detected in an image, the segmentation map can be obtained by
applying a binarization threshold to the detection map. In other words, approaches applicable to one direction
are usually suitable to the other, so readers will see the term “anomaly segmentation”.
The core assumption of unsupervised anomaly detection is that the underlying distribution of normal
parts (e.g. healthy tissues and anatomy) in the images can be captured by unsupervised models, but abnormal
parts such as tumors deviate from the normative distribution, so these anomalies can be detected. Commonly
used models for estimating the normative distribution mainly stem from the concept of VAE and GAN, and the
success of these unsupervised models has mostly been seen in MRI. Notably, Baur et al. (2021) review a variety
of autoencoders-based anomaly segmentation methods in brain MR images. The authors conduct a thorough
comparison of these models and present many interesting insights into successful applications. One important
conclusion reached by this paper is that restoration-based approaches generally perform better than
reconstruction-based ones when runtime is not a concern. In contrast to this comprehensive review paper, we
will briefly introduce reconstruction-based approaches and narrow our focus to recent works related to
restoration-based detection.
In the reconstruction-based pargdigm, an AE- or VAE-based model projects an image into low-
dimensional latent space and then reconstructs the original image from its latent representation. Only healthy
images are used for training, and the model is optimized to generate low pixel-wise reconstruction error. When
unhealthy images pass through the the model, the reconstruction error is expected to be low regarding normal
regions but high for anomalous areas. Uzunova et al. (2019) employed a CVAE to learn latent representations
of healthy image patches. Besides the reconstruction error, they further assumed a large distance between the
latent representations of healthy and unhealthy patches. Combining these two distances together, the CVAE-
based model delivered resonable segmentation results on MRIs with tumors. It is worthy noting that the authors
integrated local context into CVAE by utilizing the relative positions of patches as condition. The location-
related condition can provide additional prior information of healthy and unhealthy tissues to improve
In the restoration-based paradigm, the target to be restored is either (1) an optimal latent representation
or (2) the healthy counterpart of the input anomalous image. Both GAN-based and VAE-based methods have
been applied, but GAN is generally used during latent representation restoration for the first type. Although
the generator of GAN can easily map latent vectors back to images, it lacks the capability to perform inverse
mapping, (i.e., images to latent space), which is important in calculating anomaly score. This is a key issue
tackled by many works adapting GAN for anomaly detection. As a pionerring work, Schlegl et al. (2017)
proposed the so-called AnoGAN to obtain the inverse mapping, the authors first pre-trained a GAN (a generator
and a discriminator) using healthy images to learn the normative distribution, and kept this model’s weights
fixed. Then given an input image (either normal or anomalous), gradient descent in the latent space (regarding
latent variable) is performed to restore the corresponding optimal latent representation. More concretely, the
optimization is guided by two combined losses, namely residual loss and discrimination loss. The residual loss,
just like the previously mentioned reconstruction error, measures the pixel-wise dissimilarity between real input
images and images that are generated by the generator from latent variable. Meanwhile, these two types of
images are sent into the discriminator network, and one intermediate layer is used to extract features for them.
The difference of intermediate feature represenations is computed, resulting in the discirmination loss. Last,
after optimizing on the latent variable, the authors use both losses to calculate an anomaly score, indicating
whether the input image contains anomalous regions. AnoGAN delievers good performance, but iterative
optimization is time-consuming. In their follow-up work, Schlegl et al. (2019) proposed a more efficient model
f-AnoGAN by introducing an additional encoder, which can perform fast inverse mapping from image space
to latent space. Similar to developing AnoGAN, they first pre-trained a WGAN using healthy images and again
kept the models’s weights fixed. Then the generator with fixed weights was employed as the decoder of an AE
without futher training, whereas this AE’s encoder was trained using a combination of two loss functions as
introduced in AnoGAN. Once fully trained, the encoder network can efficiently map images to latent space
with one single forward pass. Slightly earlier than f-AnoGAN, Baur et al. (2018) proposed the so-called
AnoVAEGAN that combines VAE and GAN for fast inverse mapping. In this framework, GAN’s generator
and VAE’s decoder are the same network, and VAE’s encoder is employed to learn the inverse mapping.
Therefore, three components including encoder, decoder and discriminator need to be trained. The loss function
here differs from that of AnoGAN and f-AnoGAN, but it still has the reconstruction error. Also, in contrast to
these two patch-based models, AnoVAEGAN directly takes the entire MR images as input and thereby can
capture and utilize global context potentially valuable to anomaly segmentation.
For the second type, restoring a healthy counterpart of the input image means, if the input contains
abnormal regions, they are expected to be removed in the restored version, while the rest normal areas are
retained. Thus, a pixel-wise dissimilarity map between the input and restored image can be acquired, and
anomalies can be detected. Successful restoration typically relies on maximum a posteriori (MAP) estimation.
Specifically, the posterior being maximized is composed of a normative distribution of healthy images and a
data consistency term (Chen et al., 2020d). The normative distribution can be modeled through a VAE or its
variants, and its training is guided by ELBO, an estimation for VAE’s orginal objective function (Kingma and
Welling, 2013). As for the data consistency term, it controls to what extent the restored image should resemble
the input. In the task of detecting brain tumors from MR images, You et al. (2019) first employ a GMVAE to
capture the distribution of lesion-free MR images, and adopt the total variation norm for data consistency
regularization. Then these two elements together steer the optimization in MAP estimation so that the healthy
counterpart of an anomalous input is iteratively restored. Recently in their following work, Chen et al. (2021c)
claim that ELBO may not be a good approximation for VAE’s original loss function. As a result, this inaccurate
loss could lead to learning an inaccurate normative distribution, making gradient computation in iterative
optimization deviate from the true direction. To solve this issue, the authors propose using the derivatives of
local Gaussian distributions to replace the gradients of ELBO. When detecting glioblastomas and gliomas on
MR images, the proposed approach demonstrates higher accuracy at low false positive rates compared to other
methods. Also, different from most of previous works that depend on 2D MR slices, the authors incorporate
3D information into VAE’s training to further improve performance.
Table 3. A list of recent papers related to medical image detection
Author Year Application Model Dataset Contributions highlights
Specific-type medical objects detection
(1) Using deconvolutional layer to recover
Lung nodules Faster RCNN with
fine-grained features; (2) Using 3D CNN to
Ding et al., 2017 2017 detection from CT changed VGG16 LUNA16
exploit 3D spatial context information for
images as backbone
false positives reduction.
3D Faster RCNN
(1) Using 3D Faster RCNN considering the
Lung nodules with U-Net-like
3D nature of lung CT images; (2) Utilizing the
Zhu et al., 2018 2018 detection from CT structure, built LIDC-IDRIs
compactness (i.e., fewer parameters) of dual
images with dual path
path networks on small dataset.
(1) A semi-supervised learning strategy to
3D variant of FPN leverage unlabeled images in NLST; (2)
Lung nodules
Wang et al., with modified LUNA16 and Mixup augmentation for examples with
2020 detection from CT
2020c residual network as NLST pseudo labels and ground truth annotations;
backbone (3) FPN outputs multi-level features to
enhance small object detection.
(1) Inserting non-local modules in residual
Lung nodules blocks to seize long-range dependencies of
architecture, with
Mei et al., 2021 2021 detection from CT PN9 different positions and different channels. (2)
3D ResNet50 as
images Using multi-scale features for false positives
Breast mass Two-branch Faster DDSM and a Extraction of complementary relation features
Ma et al., 2021a 2020 detection from RCNNs, with private on CC and MLO views of mammograms
mammograms relation modules dataset using relation modules.
(Hu et al., 2018b)
BG-RCNN: (1) Modeling relations (e.g., complementary
Incorporating information and visual correspondences)
Breast mass DDSM and a
Bipartite Graph between CC and MLO views of mammograms
Liu et al., 2020c 2020 detection from private
convolutional using BGN; (2) Defining simple pseudo
mammograms dataset
Network (BGN) landmarks in mammograms to facilitate
into Mask RCNN learning geometric relations.
Lymphocytes (1) Simplifying the original YOLO network
detection in whole- using prior knowledge of lymphocytes (e.g.,
Smaller YOLOv2
Rijthoven et al., slide (WSI) Private average size, no overlaps); (2) Designing a
2018 with much fewer
2018 histology images of dataset new training sampling strategy using the prior
breast, colon, and knowledge (i.e., brown areas without
prostate cancer lymphocytes contain hard negative samples).
Lymph node Modified Fully (1) Utilizing FCN for fast gigapixel-level WSI
metastasis detection convolutional analysis; (2) Proposing anchor layers
Lin et al., 2019 2019 dataset and
from WSI histology network (FCN) for model conversion to ensure dense
ISBI 2016
images based on VGG16 scanning; (3) Hard negative mining.
Multiple sclerosis 3D U-Net based (1) Uncertainty estimation using Monte Carlo
lesion detection segmentation Private (MC) dropout; (2) Using multiple uncertainty
Nair et al., 2020 2020
from MR brain network to obtain dataset measures to filter out uncertain predictions of
images lesions lesion candidates.
Universal lesion detection
Detection of lung,
mediastinum, liver,
(1) Exploiting 3D context information; (2)
soft tissue, pelvis, 3DCE:
Yan et al., 2018a 2018 DeepLesion Leveraging pre-trained 2D backbones (VGG-
abdomen, kidney, Modified R-FCN
16) for transfer learning.
and bone lesions
from CT images
ULDor: (1) Pseudo mask construction using RECIST
Detection of various
Mask RCNN with annotations; (2) Hard negative mining to learn
Tang et al., 2019 2019 types of lesions in DeepLesion
ResNet-101 as more discriminative features for false positives
backbone reduction.
MULAN: Modified (1) Jointly performing three different tasks
Detection of various
Mask RCNN with (detection, tagging, and segmentation) for
Yan et al., 2019 2019 types of lesions in DeepLesion
DenseNet-121 as better performance; (2) A new 3D feature
backbone fusion strategy.
Contextual attention module aggregates
Detection of various
relevant context features, and spatial attention
Tao et al., 2019 2019 types of lesions in Improved R-FCN DeepLesion
module highlights discriminative features for
small objects.
Detection of various Using an attention module to incorporate
a three pathway
Li et al., 2019 2019 types of lesions in DeepLesion clinical knowledge of multi-view window
architecture with
DeepLesion inspection and position information.
FPN as backbone
Unsupervised lesion detection
A collection of Private data, A comprehensive and in-depth investigation
Baur et al., 2021 2021 VAE- and GAN- MSLUB, into the strengths and shortcomings of a
ion of brain MRI
based models MSSEG2015 variety of methods for anomaly segmentation.
Proposing a more accurate approximation of
Detection of MRI CamCAN,
VAE’s original loss by replacing the gradients
Chen et al., 2021c 2021 brain tumors and VAE-based model BRATS17,
of ELBO with the derivatives of local
stroke lesions ATLAS
Gaussian distributions.
CamCAN, Using autoencoding-based methods to learn a
Chen et al., MRI glioma and
2020 VAE-based model BRATS17, prior for healthy images and using MAP
2020d stroke detection
ATLAS estimation to for image restoration.
(1) The first work using GAN for anomaly
Anomaly detection AnoGAN:
Schlegl et al., Private detection; (2) Proposing a new approach that
2017 in optical coherence DCGAN-based
2017 dataset iteratively maps input images back to optimal
tomography (OCT) model
latent representations for anomaly detection.
Based on AnoGAN, an additional encoder was
Schlegl et al., OCT anomaly WGAN-based Private
2019 introduced to perform fast inverse mapping
2019 detection model dataset
from image space to latent space.
AnoVAEGAN: (1) Combining VAE and GAN for fast inverse
MRI multiple Private
Baur et al., 2018 2018 a combination of mapping; (2) The model can operate on an
sclerosis detection dataset
VAE and GAN entire MR slice to exploit global context.
Utilizing location-related condition to provide
Uzunova et al., MRI brain tumor CVAE-based
2019 BRATS15 additional prior information of healthy and
2019 detection model
unhealthy tissues for better performance.
3.4. Registration
Registration, the process of aligning two or more images into one coordinate system with matched
contents, is also an important step in many (semi-)automatic medical image analysis tasks. Image registration
can be sorted into two groups: rigid and deformable (non-rigid). In rigid registration, all the image pixels
uniformly experience a simple transform (e.g., rotation), while deformable registration aims to establish a non-
uniform mapping between images. In recent years, there have been more applications of deep learning related
to this research topic, especially for deformable image registration. Similar to the organization of the review
article (Haskins et al., 2020), deep learning-based medical image registration approaches in our survey are
categorized into three groups: (1) deep iterative registration; (2) supervised registration; (3) unsupervised
registration. Interested readers can refer to several other excellent review papers (Fu et al., 2020; Ma et al.,
2021b) for a more comprehensive set of registration methods.
3.4.1. Deep iterative registration
In deep iterative registration, deep learning models learn a metric that quantifies the similarity
between a target/moving image and a reference/fixed image; then the learned similarity metric is used in
conjunction with traditional optimizers to iteratively update the registration parameters of classical (i.e., non-
learning-based) transformation frameworks. For example, Simonovsky et al. (2016) used a 5-layer CNN to
learn a metric to evaluate the similarity between aligned 3D brain MRI T1–T2 image pairs, and then
incorporated the learnt metric into a continuous optimization framework to complete deformable registration.
This deep learning based metric outperformed manually defined similarity metrics such as mutual information
for multimodal registration (Simonovsky et al., 2016). In essence, this work is most related to previous approach
in Cheng et al. (2018) that estimates the similarity of 2D CT–MR patch pairs using an FCN pre-trained with
stacked denoising autoencoder; the major difference between these two works lies in network architecture
(CNN vs. FCN), application scenario (3D vs. 2D), and training strategy (from scratch vs. pre-training). For T1–
T2 weighted MR images and CT–MR images, Haskins et al. (2019) claimed it is relatively easy to learn a good
similarity metric because these multimodal images share large similar views or simple intensity mappings. They
extended the deep similarity metric to a more challenging scenario, 3D MR–TRUS prostate image registration,
where a large appearance difference exists between the two imaging modalities.
In summary, “deep similarity”, which can avoid manually defining similarity metrics, is useful for
establishing pixel-to-pixel and voxel-to-voxel correspondences. Deep similarity remains an important research
track, and it is often mentioned interchangeably with several other terms like “metric learning” and “descriptor
learning” (Ma et al., 2021b). Note that methods related to reinforcement learning can also be used to implicitly
quantify image similarity, but we do not expand on this topic since reinforcement learning is beyond the scope
of this review paper. Instead, more advanced deep similarity based approaches (e.g., adversarial similarity) will
be reviewed in the unsupervised registration subsection.
The authors gratefully acknowledge the following research support: Grant P20GM135009 from
National Institute of General Medical Sciences, National Institutes of Health; Stephenson Cancer Center Team
Grant funded by the National Cancer Institute Cancer Center Support Grant P30CA225520 awarded to the
University of Oklahoma Stephenson Cancer Center.
[1]. Meyers, P.H., Nice Jr, C.M., Becker, H.C., Nettleton Jr, W.J., Sweeney, J.W., Meckstroth, G.R., 1964. Automated
computer analysis of radiographic images. Radiology 83, 1029-1034.
[2]. Kruger, R.P., Townes, J.R., Hall, D.L., Dwyer, S.J., Lodwick, G.S., 1972. Automated Radiographic Diagnosis via
Feature Extraction and Classification of Cardiac Size and Shape Descriptors. IEEE Transactions on Biomedical
Engineering BME-19, 174-186.
[3]. Sezaki, N., Ukena, K., 1973. Automatic Computation of the Cardiothoracic Ratio with Application to Mass Screening.
IEEE Transactions on Biomedical Engineering BME-20, 248-253.
[4]. Doi, K., MacMahon, H., Katsuragawa, S., Nishikawa, R.M., Jiang, Y., 1999. Computer-aided diagnosis in radiology:
potential and pitfalls. European Journal of Radiology 31, 97-109.
[5]. Shi, J., Sahiner, B., Chan, H.-P., Ge, J., Hadjiiski, L., Helvie, M.A., Nees, A., Wu, Y.-T., Wei, J., Zhou, C., Zhang,
Y., Cui, J., 2008. Characterization of mammographic masses based on level set segmentation with new image features and
patient information. Med Phys 35, 280-290.
[6]. Sahiner, B., Petrick, N., Heang-Ping, C., Hadjiiski, L.M., Paramagul, C., Helvie, M.A., Gurcan, M.N., 2001.
Computer-aided characterization of mammographic masses: accuracy of mass segmentation and its effects on
characterization. IEEE Transactions on Medical Imaging 20, 1275-1284.
[7]. LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. nature 521, 436-444.
[8]. Hamm, C.A., Wang, C.J., Savic, L.J., Ferrante, M., Schobert, I., Schlachter, T., Lin, M., Duncan, J.S., Weinreb, J.C.,
Chapiro, J., 2019. Deep learning for liver tumor diagnosis part I: development of a convolutional neural network classifier
for multi-phasic MRI. European radiology 29, 3338-3347.
[9]. Li, X., Jia, M., Islam, M.T., Yu, L., Xing, L., 2020a. Self-Supervised Feature Learning via Exploiting Multi-Modal
Data for Retinal Disease Diagnosis. IEEE Trans Med Imaging 39, 4023-4033.
[10]. Shorfuzzaman, M., Hossain, M.S., 2021. MetaCOVID: A Siamese neural network framework with contrastive loss
for n-shot diagnosis of COVID-19 patients. Pattern recognition 113, 107700.
[11]. Zhang, Y., Jiang, H., Miura, Y., Manning, C., Langlotz, C., 2020a. Contrastive Learning of Medical Visual
Representations from Paired Images and Text, arXiv preprint arXiv:2010.00747, pp. 1-15.
[12]. Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., Greenspan, H., 2018a. GAN-based synthetic
medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321, 321-331.
[13]. Kumar, A., Kim, J., Lyndon, D., Fulham, M., Feng, D., 2016. An ensemble of fine-tuned convolutional neural
networks for medical image classification. IEEE journal of biomedical and health informatics 21, 31-40.
[14]. Kumar, A., Kim, J., Lyndon, D., Fulham, M., Feng, D., 2017. An Ensemble of Fine-Tuned Convolutional Neural
Networks for Medical Image Classification. IEEE Journal of Biomedical and Health Informatics 21, 31-40.
[15]. Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T.M., Asari, V.K., 2018. Recurrent residual convolutional neural
network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:1802.06955.
[16]. Yu, L., Wang, S., Li, X., Fu, C.-W., Heng, P.-A., 2019. Uncertainty-Aware Self-ensembling Model for Semi-
supervised 3D Left Atrium Segmentation, In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T.,
Khan, A. (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Springer International
Publishing, Cham, pp. 605-613.
[17]. Fan, D.P., Zhou, T., Ji, G.P., Zhou, Y., Chen, G., Fu, H., Shen, J., Shao, L., 2020. Inf-Net: Automatic COVID-19
Lung Infection Segmentation From CT Images. IEEE Transactions on Medical Imaging 39, 2626-2637.
[18]. Rijthoven, M., Swiderska-Chadaj, Z., Seeliger, K., Laak, J.v.d., Ciompi, F., 2018. You Only Look on Lymphocytes
Once, Medical Imaging with Deep Learning, pp. 1-15.
[19]. Mei, J., Cheng, M., Xu, G., Wan, L., Zhang, H., 2021. SANet: A Slice-Aware Network for Pulmonary Nodule
Detection. IEEE Transactions on Pattern Analysis & Machine Intelligence, pre-print.
[20]. Nair, T., Precup, D., Arnold, D.L., Arbel, T., 2020. Exploring uncertainty measures in deep networks for Multiple
sclerosis lesion detection and segmentation. Medical Image Analysis 59, 101557.
[21]. Zheng, Y., Liu, D., Georgescu, B., Nguyen, H., Comaniciu, D., 2015. 3D deep learning for efficient and robust
landmark detection in volumetric data, International Conference on Medical Image Computing and Computer-Assisted
Intervention. Springer, pp. 565-572.
[22]. Simonovsky, M., Gutiérrez-Becker, B., Mateus, D., Navab, N., Komodakis, N., 2016. A deep metric for multimodal
registration, International conference on medical image computing and computer-assisted intervention. Springer, pp. 10-
[23]. Sokooti, H., de Vos, B., Berendsen, F., Lelieveldt, B.P.F., Išgum, I., Staring, M., 2017. Nonrigid Image Registration
Using Multi-scale 3D Convolutional Neural Networks, In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins,
D.L., Duchesne, S. (Eds.), Medical Image Computing and Computer Assisted Intervention − MICCAI 2017. Springer
International Publishing, Cham, pp. 232-239.
[24]. Balakrishnan, G., Zhao, A., Sabuncu, M.R., Dalca, A.V., Guttag, J., 2018. An Unsupervised Learning Model for
Deformable Medical Image Registration, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
[25]. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S., 2017. Dermatologist-level
classification of skin cancer with deep neural networks. Nature 542, 115-118.
[26]. Long, E., Lin, H., Liu, Z., Wu, X., Wang, L., Jiang, J., An, Y., Lin, Z., Li, X., Chen, J., Li, J., Cao, Q., Wang, D.,
Liu, X., Chen, W., Liu, Y., 2017. An artificial intelligence platform for the multihospital collaborative management of
congenital cataracts. Nature Biomedical Engineering 1, 0024.
[27]. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken,
B., Sánchez, C.I., 2017. A survey on deep learning in medical image analysis. Medical image analysis 42, 60-88.
[28]. Shen, D., Wu, G., Suk, H.-I., 2017. Deep learning in medical image analysis. Annual review of biomedical
engineering 19, 221-248.
[29]. Yi, X., Walia, E., Babyn, P., 2019. Generative adversarial network in medical imaging: A review. Medical Image
Analysis 58, 101552.
[30]. Kazeminia, S., Baur, C., Kuijper, A., van Ginneken, B., Navab, N., Albarqouni, S., Mukhopadhyay, A., 2020. GANs
for medical image analysis. Artificial Intelligence in Medicine 109, 101938.
[31]. Cheplygina, V., de Bruijne, M., Pluim, J.P.W., 2019. Not-so-supervised: A survey of semi-supervised, multi-
instance, and transfer learning in medical image analysis. Medical Image Analysis 54, 280-296.
[32]. Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X., 2020. Embracing imperfect datasets: A review
of deep learning solutions for medical image segmentation. Medical Image Analysis 63, 101693.
[33]. van Engelen, J.E., Hoos, H.H., 2020. A survey on semi-supervised learning. Machine Learning 109, 373-440.
[34]. Anwar, S.M., Majid, M., Qayyum, A., Awais, M., Alnowami, M., Khan, M.K., 2018. Medical image analysis using
convolutional neural networks: a review. Journal of medical systems 42, 226.
[35]. Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313,
[36]. Bourlard, H., Kamp, Y., 1988. Auto-association by multilayer perceptrons and singular value decomposition.
Biological cybernetics 59, 291-294.
[37]. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., 2007. Greedy layer-wise training of deep networks. Advances
in neural information processing systems 19, 153.
[38]. Ranzato, M.A., Poultney, C., Chopra, S., Cun, Y.L., 2007. Efficient learning of sparse representations with an
energy-based model, Advances in neural information processing systems, pp. 1137-1144.
[39]. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., 2010. Stacked Denoising Autoencoders: Learning
Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research 11,
[40]. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y., 2011. Contractive auto-encoders: Explicit invariance during
feature extraction, Icml.
[41]. Kingma, D.P., Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
[42]. Sohn, K., Lee, H., Yan, X., 2015. Learning structured output representation using deep conditional generative
models. Advances in neural information processing systems 28, 3483-3491.
[43]. Dilokthanakul, N., Mediano, P.A., Garnelo, M., Lee, M.C., Salimbeni, H., Arulkumaran, K., Shanahan, M., 2016.
Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648.
[44]. Kingma, D.P., Welling, M., 2019. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691.
[45]. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014.
Generative adversarial nets, Proceedings of the 27th International Conference on Neural Information Processing Systems
- Volume 2. MIT Press, Montreal, Canada, pp. 2672–2680.
[46]. Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein generative adversarial networks, International conference
on machine learning. PMLR, pp. 214-223.
[47]. Mirza, M., Osindero, S., 2014. Conditional Generative Adversarial Nets, arXiv preprint arXiv:1411.1784, pp. 1-7.
[48]. Odena, A., Olah, C., Shlens, J., 2017. Conditional image synthesis with auxiliary classifier gans, International
conference on machine learning. PMLR, pp. 2642-2651.
[49]. Kenton, J.D.M.-W.C., Toutanova, L.K., 2019. Bert: Pre-training of deep bidirectional transformers for language
understanding, Proceedings of NAACL-HLT, pp. 4171-4186.
[50]. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R., 2020. Momentum Contrast for Unsupervised Visual Representation
Learning, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726-9735.
[51]. Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A., 2016. Context Encoders: Feature Learning by
Inpainting, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2536-2544.
[52]. Zhang, R., Isola, P., Efros, A.A., 2016. Colorful image colorization, European conference on computer vision.
Springer, pp. 649-666.
[53]. Doersch, C., Gupta, A., Efros, A.A., 2015. Unsupervised Visual Representation Learning by Context Prediction,
2015 IEEE International Conference on Computer Vision (ICCV), pp. 1422-1430.
[54]. Noroozi, M., Favaro, P., 2016. Unsupervised learning of visual representations by solving jigsaw puzzles, European
conference on computer vision. Springer, pp. 69-84.
[55]. Gidaris, S., Singh, P., Komodakis, N., 2018. Unsupervised Representation Learning by Predicting Image Rotations,
International Conference on Learning Representations, pp. 1-16.
[56]. Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020a. A Simple Framework for Contrastive Learning of Visual
Representations, In: Hal, D., III, Aarti, S. (Eds.), Proceedings of the 37th International Conference on Machine Learning.
PMLR, Proceedings of Machine Learning Research, pp. 1597--1607.
[57]. Hadsell, R., Chopra, S., LeCun, Y., 2006. Dimensionality reduction by learning an invariant mapping, 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06). IEEE, pp. 1735-1742.
[58]. Chopra, S., Hadsell, R., LeCun, Y., 2005a. Learning a similarity metric discriminatively, with application to face
verification, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). IEEE,
pp. 539-546.
[59]. Oord, A.v.d., Li, Y., Vinyals, O., 2018. Representation learning with contrastive predictive coding. arXiv preprint
[60]. Chaitanya, K., Erdil, E., Karani, N., Konukoglu, E., 2020. Contrastive learning of global and local features for
medical image segmentation with limited annotations, Advances in Neural Information Processing Systems, pp. 1-13.
[61]. Chen, X., Fan, H., Girshick, R., He, K., 2020b. Improved Baselines with Momentum Contrastive Learning, arXiv
preprint arXiv:2003.04297, pp. 1-3.
[62]. Chapelle, O., Scholkopf, B., Zien, A., 2009. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews].
IEEE Transactions on Neural Networks 20, 542-542.
[63]. Ouali, Y., Hudelot, C., Tami, M., 2020. An overview of deep semi-supervised learning. arXiv preprint
[64]. Rasmus, A., Valpola, H., Honkala, M., Berglund, M., Raiko, T., 2015. Semi-supervised learning with Ladder
networks, Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pp.
[65]. Laine, S., Aila, T., 2017. Temporal Ensembling for Semi-Supervised Learning.
[66]. Tarvainen, A., Valpola, H., 2017. Mean teachers are better role models: Weight-averaged consistency targets
improve semi-supervised deep learning results, Proceedings of the 31st International Conference on Neural Information
Processing Systems, pp. 1195-1204.
[67]. Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q., 2020. Unsupervised Data Augmentation for Consistency Training,
Advances in Neural Information Processing Systems, pp. 1-13.
[68]. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C., 2019. MixMatch: A Holistic Approach
to Semi-Supervised Learning, Advances in Neural Information Processing Systems, pp. 1-11.
[69]. Lee, D.-H., 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,
Workshop on challenges in representation learning, ICML.
[70]. Zhang, H., Cisse, M., Dauphin, Y., Lopez-Paz, D., 2018a. Mixup: Beyond empirical risk minimization, International
Conference on Learning Representations, pp. 1-13.
[71]. Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K., 2020. Pseudo-Labeling and Confirmation Bias
in Deep Semi-Supervised Learning, 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1-8.
[72]. Qiao, S., Shen, W., Zhang, Z., Wang, B., Yuille, A., 2018. Deep Co-Training for Semi-Supervised Image
Recognition, In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018. Springer
International Publishing, Cham, pp. 142-159.
[73]. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., 2016. Improved techniques for
training GANs, Proceedings of the 30th International Conference on Neural Information Processing Systems. Curran
Associates Inc., Barcelona, Spain, pp. 2234–2242.
[74]. Odena, A., 2016. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583.
[75]. Li, C., Xu, K., Zhu, J., Zhang, B., 2017. Triple generative adversarial nets, Proceedings of the 31st International
Conference on Neural Information Processing Systems. Curran Associates Inc., Long Beach, California, USA, pp. 4091–
[76]. Li, X., Yu, L., Chen, H., Fu, C.W., Xing, L., Heng, P.A., 2020b. Transformation-Consistent Self-Ensembling Model
for Semisupervised Medical Image Segmentation. IEEE Transactions on Neural Networks and Learning Systems 32, 523-
[77]. Itti, L., Koch, C., Niebur, E., 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20, 1254-1259.
[78]. Bahdanau, D., Cho, K., Bengio, Y., 2015. Neural Machine Translation by Jointly Learning to Align and Translate.
[79]. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017.
Attention is all you need, Proceedings of the 31st International Conference on Neural Information Processing Systems.
Curran Associates Inc., Long Beach, California, USA, pp. 6000–6010.
[80]. Xu, K., Ba, J.L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., Bengio, Y., 2015. Show, attend
and tell: neural image caption generation with visual attention, Proceedings of the 32nd International Conference on
International Conference on Machine Learning - Volume 37., Lille, France, pp. 2048–2057.
[81]. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J., 2016. Image Captioning with Semantic Attention, 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651-4659.
[82]. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L., 2018. Bottom-Up and Top-Down
Attention for Image Captioning and Visual Question Answering, 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 6077-6086.
[83]. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X., 2017. Residual Attention Network
for Image Classification, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6450-6458.
[84]. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S., 2018. CBAM: Convolutional Block Attention Module, In: Ferrari, V.,
Hebert, M., Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018. Springer International Publishing, Cham,
pp. 3-19.
[85]. Jetley, S., Lord, N., Lee, N., Torr, P., 2018. Learn to Pay Attention, International Conference on Learning
Representations, pp. 1-14.
[86]. Chen, L., Yang, Y., Wang, J., Xu, W., Yuille, A.L., 2016. Attention to Scale: Scale-Aware Semantic Image
Segmentation, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3640-3649.
[87]. Ren, M., Zemel, R.S., 2017. End-to-End Instance Segmentation with Recurrent Attention, 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 293-301.
[88]. Cho, K., Courville, A., Bengio, Y., 2015. Describing multimedia content using attention-based encoder-decoder
networks. IEEE Transactions on Multimedia 17, 1875-1886.
[89]. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K., 2015. Spatial transformer networks, Proceedings of
the 28th International Conference on Neural Information Processing Systems - Volume 2. MIT Press, Montreal, Canada,
pp. 2017–2025.
[90]. Hu, J., Shen, L., Sun, G., 2018a. Squeeze-and-excitation networks, Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 7132-7141.
[91]. Wang, X., Girshick, R., Gupta, A., He, K., 2018. Non-local neural networks, Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 7794-7803.
[92]. Chaudhari, S., Mithal, V., Polatkan, G., Ramanath, R., 2021. An attentive survey of attention models. ACM
Transactions on Intelligent Systems and Technology (TIST) 12, 1-32.
[93]. Zhou, Z., Sodha, V., Pang, J., Gotway, M.B., Liang, J., 2021. Models Genesis. Medical Image Analysis 67, 101840.
[94]. Zhou, Z., Sodha, V., Rahman Siddiquee, M.M., Feng, R., Tajbakhsh, N., Gotway, M.B., Liang, J., 2019a. Models
Genesis: Generic Autodidactic Models for 3D Medical Image Analysis, In: Shen, D., Liu, T., Peters, T.M., Staib, L.H.,
Essert, C., Zhou, S., Yap, P.-T., Khan, A. (Eds.), Medical Image Computing and Computer Assisted Intervention –
MICCAI 2019. Springer International Publishing, Cham, pp. 384-393.
[95]. Zhang, P., Wang, F., Zheng, Y., 2017. Self supervised deep representation learning for fine-grained body part
recognition, 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pp. 578-582.
[96]. Zhuang, X., Li, Y., Hu, Y., Ma, K., Yang, Y., Zheng, Y., 2019. Self-supervised Feature Learning for 3D Medical
Images by Playing a Rubik’s Cube, In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan,
A. (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Springer International
Publishing, Cham, pp. 420-428.
[97]. Zhu, J., Li, Y., Hu, Y., Ma, K., Zhou, S.K., Zheng, Y., 2020a. Rubik’s Cube+: A self-supervised feature learning
framework for 3D medical image analysis. Medical Image Analysis 64, 101746.
[98]. Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., Loh, A., Karthikesalingam, A., Kornblith, S.,
Chen, T., 2021. Big self-supervised models advance medical image classification, Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp. 3478-3488.
[99]. Vu, Y.N.T., Wang, R., Balachandar, N., Liu, C., Ng, A.Y., Rajpurkar, P., 2021. Medaug: Contrastive learning
leveraging patient metadata improves representations for chest x-ray interpretation, Machine Learning for Healthcare
Conference. PMLR, pp. 755-769.
[100]. Xie, X., Niu, J., Liu, X., Chen, Z., Tang, S., Yu, S., 2021a. A survey on incorporating domain knowledge into deep
learning for medical image analysis. Medical Image Analysis 69, 101985.
[101]. Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., Fieguth, P., Cao, X., Khosravi,
A., Acharya, U.R., 2021. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.
Information Fusion.
[102]. Gal, Y., Ghahramani, Z., 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep
learning, international conference on machine learning. PMLR, pp. 1050-1059.
[103]. Lakshminarayanan, B., Pritzel, A., Blundell, C., 2017. Simple and scalable predictive uncertainty estimation using
deep ensembles. Advances in neural information processing systems 30.
[104]. van Ginneken, B., Schaefer-Prokop, C.M., Prokop, M., 2011. Computer-aided diagnosis: how to move from the
laboratory to the clinic. Radiology 261, 719-732.
[105]. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks.
Advances in neural information processing systems 25, 1097-1105.
[106]. Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
[107]. Szegedy, C., Wei, L., Yangqing, J., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich,
A., 2015. Going deeper with convolutions, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 1-9.
[108]. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep Residual Learning for Image Recognition, 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 770-778.
[109]. Huang, G., Liu, Z., Maaten, L.V.D., Weinberger, K.Q., 2017. Densely Connected Convolutional Networks, 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261-2269.
[110]. Tajbakhsh, N., Shin, J.Y., Gurudu, S.R., Hurst, R.T., Kendall, C.B., Gotway, M.B., Liang, J., 2016. Convolutional
neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging 35, 1299-
[111]. Chen, S., Ma, K., Zheng, Y., 2019a. Med3D: Transfer Learning for 3D Medical Image Analysis, arXiv preprint
arXiv:1904.00625, pp. 1-12.
[112]. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T., 2014. Decaf: A deep convolutional
activation feature for generic visual recognition, International conference on machine learning. PMLR, pp. 647-655.
[113]. Deng, J., Dong, W., Socher, R., Li, L., Kai, L., Li, F.-F., 2009. ImageNet: A large-scale hierarchical image database,
2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255.
[114]. de Bruijne, M., 2016. Machine learning approaches in medical image analysis: From detection to diagnosis. Medical
Image Analysis 33, 94-97.
[115]. Shin, H., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M., 2016. Deep
Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer
Learning. IEEE Transactions on Medical Imaging 35, 1285-1298.
[116]. Yuan, Y., Qin, W., Buyyounouski, M., Ibragimov, B., Hancock, S., Han, B., Xing, L., 2019. Prostate cancer
classification with multiparametric MRI transfer learning model. Med Phys 46, 756-765.
[117]. Huynh, B.Q., Li, H., Giger, M.L., 2016. Digital mammographic tumor classification using transfer learning from
deep convolutional neural networks. Journal of medical imaging (Bellingham, Wash.) 3, 034501.
[118]. Minaee, S., Kafieh, R., Sonka, M., Yazdani, S., Soufi, G.J., 2020. Deep-covid: Predicting covid-19 from chest x-
ray images using deep transfer learning. Medical image analysis 65, 101794.
[119]. Zhou, Y., He, X., Huang, L., Liu, L., Zhu, F., Cui, S., Shao, L., 2019b. Collaborative Learning of Semi-Supervised
Segmentation and Classification for Medical Images, 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 2074-2083.
[120]. Guan, Q., Huang, Y., Zhong, Z., Zheng, Z., Zheng, L., Yang, Y., 2018. Diagnose like a Radiologist: Attention
Guided Convolutional Neural Network for Thorax Disease Classification, arXiv preprint arXiv:1801.09927, pp. 1-10.
[121]. Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., Rueckert, D., 2019. Attention gated
networks: Learning to leverage salient regions in medical images. Medical Image Analysis 53, 197-207.
[122]. Baumgartner, C.F., Kamnitsas, K., Matthew, J., Fletcher, T.P., Smith, S., Koch, L.M., Kainz, B., Rueckert, D.,
2017. SonoNet: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound. IEEE
Transactions on Medical Imaging 36, 2204-2215.
[123]. Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation,
International Conference on Medical image computing and computer-assisted intervention. Springer, pp. 234-241.
[124]. Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H., 2018b. Synthetic data augmentation using
GAN for improved liver lesion classification, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI
2018), pp. 289-293.
[125]. Wu, E., Wu, K., Cox, D., Lotter, W., 2018a. Conditional infilling GANs for data augmentation in mammogram
classification, Image Analysis for Moving Organ, Breast, and Thoracic Images. Springer, pp. 98-106.
[126]. Bai, W., Chen, C., Tarroni, G., Duan, J., Guitton, F., Petersen, S.E., Guo, Y., Matthews, P.M., Rueckert, D., 2019.
Self-Supervised Learning for Cardiac MR Image Segmentation by Anatomical Position Prediction, In: Shen, D., Liu, T.,
Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (Eds.), Medical Image Computing and Computer
Assisted Intervention – MICCAI 2019. Springer International Publishing, Cham, pp. 541-549.
[127]. Tao, X., Li, Y., Zhou, W., Ma, K., Zheng, Y., 2020. Revisiting Rubik’s Cube: Self-supervised Learning with
Volume-Wise Transformation for 3D Medical Image Segmentation, In: Martel, A.L., Abolmaesumi, P., Stoyanov, D.,
Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical Image Computing and Computer
Assisted Intervention – MICCAI 2020. Springer International Publishing, Cham, pp. 238-248.
[128]. Chen, T., Liu, S., Chang, S., Cheng, Y., Amini, L., Wang, Z., 2020c. Adversarial Robustness: From Self-Supervised
Pre-Training to Fine-Tuning, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 696-
[129]. Misra, I., Maaten, L.v.d., 2020. Self-supervised learning of pretext-invariant representations, Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707-6717.
[130]. Jing, L., Tian, Y., 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence.
[131]. Tajbakhsh, N., Hu, Y., Cao, J., Yan, X., Xiao, Y., Lu, Y., Liang, J., Terzopoulos, D., Ding, X., 2019. Surrogate
Supervision for Medical Image Analysis: Effective Deep Learning From Limited Quantities of Labeled Data, 2019 IEEE
16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 1251-1255.
[132]. Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara, M., Rueckert, D., 2019b. Self-supervised learning for
medical image analysis using image context restoration. Medical Image Analysis 58, 101539.
[133]. Larsson, G., Maire, M., Shakhnarovich, G., 2017. Colorization as a Proxy Task for Visual Understanding, 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 840-849.
[134]. Chen, X., Yao, L., Zhou, T., Dong, J., Zhang, Y., 2021a. Momentum contrastive learning for few-shot COVID-19
diagnosis from chest CT images. Pattern recognition 113, 107826.
[135]. Sowrirajan, H., Yang, J., Ng, A.Y., Rajpurkar, P., 2021. Moco pretraining improves representation and
transferability of chest x-ray models, Medical Imaging with Deep Learning. PMLR, pp. 728-744.
[136]. Madani, A., Moradi, M., Karargyris, A., Syeda-Mahmood, T., 2018a. Semi-supervised learning with generative
adversarial networks for chest X-ray classification with ability of data domain adaptation, 2018 IEEE 15th International
Symposium on Biomedical Imaging (ISBI 2018), pp. 1038-1042.
[137]. Kingma, D.P., Rezende, D.J., Mohamed, S., Welling, M., 2014. Semi-supervised learning with deep generative
models, Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. MIT
Press, Montreal, Canada, pp. 3581–3589.
[138]. Xie, Y., Zhang, J., Xia, Y., 2019a. Semi-supervised adversarial model for benign–malignant lung nodule
classification on chest CT. Medical Image Analysis 57, 237-248.
[139]. Madani, A., Ong, J.R., Tibrewal, A., Mofrad, M.R.K., 2018b. Deep echocardiography: data-efficient supervised
and semi-supervised deep learning towards automated diagnosis of cardiac disease. npj Digital Medicine 1, 59.
[140]. Shang, H., Sun, Z., Yang, W., Fu, X., Zheng, H., Chang, J., Huang, J., 2019. Leveraging Other Datasets for Medical
Imaging Classification: Evaluation of Transfer, Multi-task and Semi-supervised Learning, In: Shen, D., Liu, T., Peters,
T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (Eds.), Medical Image Computing and Computer Assisted
Intervention – MICCAI 2019. Springer International Publishing, Cham, pp. 431-439.
[141]. Liu, Q., Yu, L., Luo, L., Dou, Q., Heng, P.A., 2020a. Semi-Supervised Medical Image Classification With Relation-
Driven Self-Ensembling Model. IEEE Transactions on Medical Imaging 39, 3429-3440.
[142]. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y., 2021b. Transunet:
Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306.
[143]. He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask R-CNN, 2017 IEEE International Conference on
Computer Vision (ICCV), pp. 2980-2988.
[144]. Hariharan, B., Arbeláez, P., Girshick, R., Malik, J., 2015. Hypercolumns for object segmentation and fine-grained
localization, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 447-456.
[145]. Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431-3440.
[146]. Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C., 2016. The Importance of Skip Connections in
Biomedical Image Segmentation, In: Carneiro, G., Mateus, D., Peter, L., Bradley, A., Tavares, J.M.R.S., Belagiannis, V.,
Papa, J.P., Nascimento, J.C., Loog, M., Lu, Z., Cardoso, J.S., Cornebise, J. (Eds.), Deep Learning and Data Labeling for
Medical Applications. Springer International Publishing, Cham, pp. 179-187.
[147]. Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J., 2018. UNet++: A Nested U-Net Architecture for
Medical Image Segmentation, In: Stoyanov, D., Taylor, Z., Carneiro, G., Syeda-Mahmood, T., Martel, A., Maier-Hein,
L., Tavares, J.M.R.S., Bradley, A., Papa, J.P., Belagiannis, V., Nascimento, J.C., Lu, Z., Conjeti, S., Moradi, M.,
Greenspan, H., Madabhushi, A. (Eds.), Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical
Decision Support. Springer International Publishing, Cham, pp. 3-11.
[148]. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O., 2016. 3D U-Net: learning dense volumetric
segmentation from sparse annotation, International conference on medical image computing and computer-assisted
intervention. Springer, pp. 424-432.
[149]. Milletari, F., Navab, N., Ahmadi, S., 2016. V-Net: Fully Convolutional Neural Networks for Volumetric Medical
Image Segmentation, 2016 Fourth International Conference on 3D Vision (3DV), pp. 565-571.
[150]. Gibson, E., Giganti, F., Hu, Y., Bonmati, E., Bandula, S., Gurusamy, K., Davidson, B., Pereira, S.P., Clarkson,
M.J., Barratt, D.C., 2018a. Automatic Multi-Organ Segmentation on Abdominal CT With Dense V-Networks. IEEE
transactions on medical imaging 37, 1822-1834.
[151]. Ming, L., Xiaolin, H., 2015. Recurrent convolutional neural network for object recognition, 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 3367-3375.
[152]. Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A., 2018. H-DenseUNet: Hybrid Densely Connected UNet
for Liver and Tumor Segmentation From CT Volumes. IEEE Transactions on Medical Imaging 37, 2663-2674.
[153]. Xue, Y., Xu, T., Zhang, H., Long, L.R., Huang, X., 2018. SegAN: Adversarial Network with Multi-scale L1 Loss
for Medical Image Segmentation. Neuroinformatics 16, 383-392.
[154]. Zhang, Y., Miao, S., Mansi, T., Liao, R., 2020b. Unsupervised X-ray image segmentation with task driven
generative adversarial networks. Medical Image Analysis 62, 101664.
[155]. Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla,
N., Kainz, B., Glocker, B., Rueckert, D., 2018. Attention U-Net: Learning Where to Look for the Pancreas, Medical
Imaging with Deep Learning, pp. 1-14.
[156]. Nie, D., Gao, Y., Wang, L., Shen, D., 2018. ASDNet: Attention Based Semi-supervised Deep Networks for Medical
Image Segmentation, In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (Eds.), Medical
Image Computing and Computer Assisted Intervention – MICCAI 2018. Springer International Publishing, Cham, pp.
[157]. Sinha, A., Dolz, J., 2021. Multi-Scale Self-Guided Attention for Medical Image Segmentation. IEEE Journal of
Biomedical and Health Informatics 25, 121-130.
[158]. Wang, G., Li, W., Aertsen, M., Deprest, J., Ourselin, S., Vercauteren, T., 2019a. Aleatoric uncertainty estimation
with test-time augmentation for medical image segmentation with convolutional neural networks. Neurocomputing 338,
[159]. Baumgartner, C.F., Tezcan, K.C., Chaitanya, K., Hötker, A.M., Muehlematter, U.J., Schawkat, K., Becker, A.S.,
Donati, O., Konukoglu, E., 2019. PHiSeg: Capturing Uncertainty in Medical Image Segmentation, In: Shen, D., Liu, T.,
Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (Eds.), Medical Image Computing and Computer
Assisted Intervention – MICCAI 2019. Springer International Publishing, Cham, pp. 119-127.
[160]. Mehrtash, A., Wells, W.M., Tempany, C.M., Abolmaesumi, P., Kapur, T., 2020. Confidence calibration and
predictive uncertainty estimation for deep medical image segmentation. IEEE transactions on medical imaging 39, 3868-
[161]. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,
International Conference on Learning Representations.
[162]. Zhang, Y., Liu, H., Hu, Q., 2021. Transfuse: Fusing transformers and cnns for medical image segmentation,
International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 14-24.
[163]. Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H.R., Xu, D., 2022. Unetr:
Transformers for 3d medical image segmentation, Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pp. 574-584.
[164]. Xie, Y., Zhang, J., Shen, C., Xia, Y., 2021b. Cotr: Efficiently bridging cnn and transformer for 3d medical image
segmentation, International conference on medical image computing and computer-assisted intervention. Springer, pp.
[165]. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2021a. Deformable detr: Deformable transformers for end-to-
end object detection. International Conference on Learning Representations.
[166]. Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M., 2021. Swin-Unet: Unet-like Pure
Transformer for Medical Image Segmentation. arXiv preprint arXiv:2105.05537.
[167]. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision
transformer using shifted windows, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.
[168]. Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C., 2020a. Axial-deeplab: Stand-alone axial-attention
for panoptic segmentation, European Conference on Computer Vision. Springer, pp. 108-126.
[169]. Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M., 2021. Medical transformer: Gated axial-attention for
medical image segmentation, International Conference on Medical Image Computing and Computer-Assisted
Intervention. Springer, pp. 36-46.
[170]. Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks, Advances in Neural Information Processing Systems, pp. 1-9.
[171]. Ren, S., He, K., Girshick, R., Sun, J., 2017. Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 1137-1149.
[172]. Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017a. Feature Pyramid Networks for Object
Detection, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936-944.
[173]. Wang, X., Han, S., Chen, Y., Gao, D., Vasconcelos, N., 2019b. Volumetric attention for 3D medical image
segmentation and detection, International Conference on Medical Image Computing and Computer-Assisted Intervention.
Springer, pp. 175-184.
[174]. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2019c. Unet++: Redesigning skip connections to exploit
multiscale features in image segmentation. IEEE transactions on medical imaging 39, 1856-1867.
[175]. Zhang, Z., Yang, L., Zheng, Y., 2018b. Translating and Segmenting Multimodal Medical Volumes with Cycle-
and Shape-Consistency Generative Adversarial Network, 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 9242-9251.
[176]. Zhao, A., Balakrishnan, G., Durand, F., Guttag, J.V., Dalca, A.V., 2019a. Data Augmentation Using Learned
Transformations for One-Shot Medical Image Segmentation, 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 8535-8545.
[177]. Taleb, A., Loetzsch, W., Danz, N., Severin, J., Gaertner, T., Bergner, B., Lippert, C., 2020. 3D Self-Supervised
Methods for Medical Imaging, Advances in Neural Information Processing Systems, pp. 1-15.
[178]. Hu, S.-Y., Wang, S., Weng, W.-H., Wang, J., Wang, X., Ozturk, A., Li, Q., Kumar, V., Samir, A.E., 2020. Self-
Supervised Pretraining with DICOM metadata in Ultrasound Imaging, In: Finale, D.-V., Jim, F., Ken, J., David, K., Rajesh,
R., Byron, W., Jenna, W. (Eds.), Proceedings of the 5th Machine Learning for Healthcare Conference. PMLR, Proceedings
of Machine Learning Research, pp. 732--749.
[179]. Jamaludin, A., Kadir, T., Zisserman, A., 2017. Self-supervised Learning for Spinal MRIs, In: Cardoso, M.J., Arbel,
T., Carneiro, G., Syeda-Mahmood, T., Tavares, J.M.R.S., Moradi, M., Bradley, A., Greenspan, H., Papa, J.P., Madabhushi,
A., Nascimento, J.C., Cardoso, J.S., Belagiannis, V., Lu, Z. (Eds.), Deep Learning in Medical Image Analysis and
Multimodal Learning for Clinical Decision Support. Springer International Publishing, Cham, pp. 294-302.
[180]. Chopra, S., Hadsell, R., LeCun, Y., 2005b. Learning a similarity metric discriminatively, with application to face
verification, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp. 539-
546 vol. 531.
[181]. Kendall, A., Gal, Y., 2017. What uncertainties do we need in Bayesian deep learning for computer vision?,
Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., Long
Beach, California, USA, pp. 5580–5590.
[182]. Wang, G., Liu, X., Li, C., Xu, Z., Ruan, J., Zhu, H., Meng, T., Li, K., Huang, N., Zhang, S., 2020b. A Noise-Robust
Framework for Automatic Segmentation of COVID-19 Pneumonia Lesions From CT Images. IEEE Transactions on
Medical Imaging 39, 2653-2663.
[183]. Pang, Y., Li, Y., Shen, J., Shao, L., 2019. Towards Bridging Semantic Gap to Improve Semantic Segmentation,
2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4229-4238.
[184]. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018a. Encoder-Decoder with Atrous Separable
Convolution for Semantic Image Segmentation, In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (Eds.), Computer
Vision – ECCV 2018. Springer International Publishing, Cham, pp. 833-851.
[185]. Wu, Z., Su, L., Huang, Q., 2019a. Cascaded Partial Decoder for Fast and Accurate Salient Object Detection, 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3902-3911.
[186]. Chen, S., Tan, X., Wang, B., Hu, X., 2018b. Reverse Attention for Salient Object Detection, In: Ferrari, V., Hebert,
M., Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018. Springer International Publishing, Cham, pp.
[187]. Zhang, Z., Fu, H., Dai, H., Shen, J., Pang, Y., Shao, L., 2019. ET-Net: A Generic Edge-aTtention Guidance Network
for Medical Image Segmentation, Medical Image Computing and Computer Assisted Intervention – MICCAI 2019.
Springer-Verlag, pp. 442–450.
[188]. Sedai, S., Mahapatra, D., Hewavitharanage, S., Maetschke, S., Garnavi, R., 2017. Semi-supervised segmentation
of optic cup in retinal fundus images using variational autoencoder, International Conference on Medical Image
Computing and Computer-Assisted Intervention. Springer, pp. 75-82.
[189]. Chen, S., Bortsova, G., García-Uceda Juárez, A., Tulder, G.v., Bruijne, M.d., 2019c. Multi-task attention-based
semi-supervised learning for medical image segmentation, International Conference on Medical Image Computing and
Computer-Assisted Intervention. Springer, pp. 457-465.
[190]. He, Y., Yang, G., Chen, Y., Kong, Y., Wu, J., Tang, L., Zhu, X., Dillenseger, J.-L., Shao, P., Zhang, S., Shu, H.,
Coatrieux, J.-L., Li, S., 2019. DPA-DenseBiasNet: Semi-supervised 3D Fine Renal Artery Segmentation with Dense
Biased Network and Deep Priori Anatomy, In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-
T., Khan, A. (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Springer
International Publishing, Cham, pp. 139-147.
[191]. Zheng, H., Lin, L., Hu, H., Zhang, Q., Chen, Q., Iwamoto, Y., Han, X., Chen, Y.-W., Tong, R., Wu, J., 2019. Semi-
supervised Segmentation of Liver Using Adversarial Learning with Deep Atlas Prior, In: Shen, D., Liu, T., Peters, T.M.,
Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (Eds.), Medical Image Computing and Computer Assisted
Intervention – MICCAI 2019. Springer International Publishing, Cham, pp. 148-156.
[192]. Clough, J., Byrne, N., Oksuz, I., Zimmer, V.A., Schnabel, J.A., King, A., 2020. A Topological Loss Function for
Deep-Learning based Image Segmentation using Persistent Homology. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 1-1.
[193]. Ganaye, P.-A., Sdika, M., Benoit-Cattin, H., 2018. Semi-supervised Learning for Segmentation Under Semantic
Constraint, In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (Eds.), Medical Image
Computing and Computer Assisted Intervention – MICCAI 2018. Springer International Publishing, Cham, pp. 595-602.
[194]. Li, S., Zhang, C., He, X., 2020c. Shape-Aware Semi-supervised 3D Semantic Segmentation for Medical Images,
In: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L.
(Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. Springer International
Publishing, Cham, pp. 552-561.
[195]. Roy, A.G., Conjeti, S., Sheet, D., Katouzian, A., Navab, N., Wachinger, C., 2017. Error Corrective Boosting for
Learning Fully Convolutional Networks with Limited Data, In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P.,
Collins, D.L., Duchesne, S. (Eds.), Medical Image Computing and Computer Assisted Intervention − MICCAI 2017.
Springer International Publishing, Cham, pp. 231-239.
[196]. Wu, Z., Su, L., Huang, Q., 2019b. Stacked cross refinement network for edge-aware salient object detection,
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7264-7273.
[197]. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y., 2014. OverFeat: Integrated Recognition,
Localization and Detection using Convolutional Networks.
[198]. Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich Feature Hierarchies for Accurate Object Detection and
Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587.
[199]. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
M., Berg, A.C., Fei-Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer
Vision 115, 211-252.
[200]. Ciompi, F., de Hoop, B., van Riel, S.J., Chung, K., Scholten, E.T., Oudkerk, M., de Jong, P.A., Prokop, M., van
Ginneken, B., 2015. Automatic classification of pulmonary peri-fissural nodules in computed tomography using an
ensemble of 2D views and a convolutional neural network out-of-the-box. Med Image Anal 26, 195-202.
[201]. Dou, Q., Chen, H., Yu, L., Zhao, L., Qin, J., Wang, D., Mok, V.C., Shi, L., Heng, P., 2016. Automatic Detection
of Cerebral Microbleeds From MR Images via 3D Convolutional Neural Networks. IEEE Transactions on Medical
Imaging 35, 1182-1195.
[202]. Wolterink, J.M., Leiner, T., de Vos, B.D., van Hamersvelt, R.W., Viergever, M.A., Išgum, I., 2016. Automatic
coronary artery calcium scoring in cardiac CT angiography using paired convolutional neural networks. Med Image Anal
34, 123-136.
[203]. Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You Only Look Once: Unified, Real-Time Object
Detection, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779-788.
[204]. Lin, T., Goyal, P., Girshick, R., He, K., Dollár, P., 2017b. Focal Loss for Dense Object Detection, 2017 IEEE
International Conference on Computer Vision (ICCV), pp. 2999-3007.
[205]. Girshick, R., 2015. Fast R-CNN, 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440-1448.
[206]. Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., Pietikäinen, M., 2020b. Deep Learning for Generic
Object Detection: A Survey. International Journal of Computer Vision 128, 261-318.
[207]. Dai, J., Li, Y., He, K., Sun, J., 2016. R-FCN: object detection via region-based fully convolutional networks,
Proceedings of the 30th International Conference on Neural Information Processing Systems. Curran Associates Inc.,
Barcelona, Spain, pp. 379–387.
[208]. Redmon, J., Farhadi, A., 2017. YOLO9000: Better, Faster, Stronger, 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 6517-6525.
[209]. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C., 2016. SSD: Single Shot MultiBox
Detector, In: Leibe, B., Matas, J., Sebe, N., Welling, M. (Eds.), Computer Vision – ECCV 2016. Springer International
Publishing, Cham, pp. 21-37.
[210]. Law, H., Deng, J., 2020. CornerNet: Detecting Objects as Paired Keypoints. International Journal of Computer
Vision 128, 642-656.
[211]. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q., 2019. CenterNet: Keypoint Triplets for Object Detection,
2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6568-6577.
[212]. Newell, A., Huang, Z., Deng, J., 2017. Associative embedding: end-to-end learning for joint detection and grouping,
Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., Long
Beach, California, USA, pp. 2274–2284.
[213]. Tychsen-Smith, L., Petersson, L., 2017. DeNet: Scalable Real-Time Object Detection with Directed Sparse
Sampling, 2017 IEEE International Conference on Computer Vision (ICCV), pp. 428-436.
[214]. Gu, Y., Lu, X., Yang, L., Zhang, B., Yu, D., Zhao, Y., Gao, L., Wu, L., Zhou, T., 2018. Automatic lung nodule
detection using a 3D deep convolutional neural network combined with a multi-scale prediction strategy in chest CTs.
Computers in Biology and Medicine 103, 220-231.
[215]. Xie, H., Yang, D., Sun, N., Chen, Z., Zhang, Y., 2019b. Automated pulmonary nodule detection in CT images
using deep convolutional neural networks. Pattern recognition 85, 109-119.
[216]. Akselrod-Ballin, A., Karlinsky, L., Hazan, A., Bakalo, R., Horesh, A.B., Shoshan, Y., Barkan, E., 2017. Deep
Learning for Automatic Detection of Abnormal Findings in Breast Mammography, In: Cardoso, M.J., Arbel, T., Carneiro,
G., Syeda-Mahmood, T., Tavares, J.M.R.S., Moradi, M., Bradley, A., Greenspan, H., Papa, J.P., Madabhushi, A.,
Nascimento, J.C., Cardoso, J.S., Belagiannis, V., Lu, Z. (Eds.), Deep Learning in Medical Image Analysis and Multimodal
Learning for Clinical Decision Support. Springer International Publishing, Cham, pp. 321-329.
[217]. Ribli, D., Horváth, A., Unger, Z., Pollner, P., Csabai, I., 2018. Detecting and classifying lesions in mammograms
with deep learning. Scientific reports 8, 1-7.
[218]. Zhu, Z., Jin, D., Yan, K., Ho, T.-Y., Ye, X., Guo, D., Chao, C.-H., Xiao, J., Yuille, A., Lu, L., 2020b. Lymph Node
Gross Tumor Volume Detection and Segmentation via Distance-Based Gating Using 3D CT/PET Imaging in
Radiotherapy, In: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D.,
Joskowicz, L. (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. Springer
International Publishing, Cham, pp. 753-762.
[219]. Tao, Q., Ge, Z., Cai, J., Yin, J., See, S., 2019. Improving Deep Lesion Detection Using 3D Contextual and Spatial
Attention, In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (Eds.), Medical Image
Computing and Computer Assisted Intervention – MICCAI 2019. Springer International Publishing, Cham, pp. 185-193.
[220]. Tang, Y., Yan, K., Tang, Y., Liu, J., Xiao, J., Summers, R.M., 2019. Uldor: A Universal Lesion Detector For Ct
Scans With Pseudo Masks And Hard Negative Example Mining, 2019 IEEE 16th International Symposium on Biomedical
Imaging (ISBI 2019), pp. 833-836.
[221]. Roth, H.R., Lu, L., Liu, J., Yao, J., Seff, A., Cherry, K., Kim, L., Summers, R.M., 2016. Improving Computer-
Aided Detection Using Convolutional Neural Networks and Random View Aggregation. IEEE Transactions on Medical
Imaging 35, 1170-1181.
[222]. Dou, Q., Chen, H., Yu, L., Qin, J., Heng, P., 2017. Multilevel Contextual 3-D CNNs for False Positive Reduction
in Pulmonary Nodule Detection. IEEE Transactions on Biomedical Engineering 64, 1558-1567.
[223]. Yan, K., Bagheri, M., Summers, R.M., 2018a. 3D Context Enhanced Region-Based Convolutional Neural Network
for End-to-End Lesion Detection, In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G.
(Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. Springer International
Publishing, Cham, pp. 511-519.
[224]. Liao, F., Liang, M., Li, Z., Hu, X., Song, S., 2019. Evaluate the Malignancy of Pulmonary Nodules Using the 3-D
Deep Leaky Noisy-OR Network. IEEE transactions on neural networks and learning systems 30, 3484-3495.
[225]. Ding, J., Li, A., Hu, Z., Wang, L., 2017. Accurate Pulmonary Nodule Detection in Computed Tomography Images
Using Deep Convolutional Neural Networks, In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L.,
Duchesne, S. (Eds.), Medical Image Computing and Computer Assisted Intervention − MICCAI 2017. Springer
International Publishing, Cham, pp. 559-567.
[226]. Liu, S., Deng, W., 2015. Very deep convolutional neural network based image classification using small training
sample size, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 730-734.
[227]. Setio, A.A.A., Traverso, A., de Bel, T., Berens, M.S.N., Bogaard, C.V.D., Cerello, P., Chen, H., Dou, Q., Fantacci,
M.E., Geurts, B., Gugten, R.V., Heng, P.A., Jansen, B., de Kaste, M.M.J., Kotov, V., Lin, J.Y., Manders, J., Sóñora-
Mengana, A., García-Naranjo, J.C., Papavasileiou, E., Prokop, M., Saletta, M., Schaefer-Prokop, C.M., Scholten, E.T.,
Scholten, L., Snoeren, M.M., Torres, E.L., Vandemeulebroucke, J., Walasek, N., Zuidhof, G.C.A., Ginneken, B.V., Jacobs,
C., 2017. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in
computed tomography images: The LUNA16 challenge. Med Image Anal 42, 1-13.
[228]. Zhu, W., Liu, C., Fan, W., Xie, X., 2018. DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule
Detection and Classification, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 673-681.
[229]. Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J., 2017. Dual path networks, Proceedings of the 31st International
Conference on Neural Information Processing Systems. Curran Associates Inc., Long Beach, California, USA, pp. 4470–
[230]. Swiderska-Chadaj, Z., Pinckaers, H., van Rijthoven, M., Balkenhol, M., Melnikova, M., Geessink, O., Manson,
Q., Sherman, M., Polonia, A., Parry, J., Abubakar, M., Litjens, G., van der Laak, J., Ciompi, F., 2019. Learning to detect
lymphocytes in immunohistochemistry with deep learning. Medical Image Analysis 58, 101547.
[231]. Gao, Z., Puttapirat, P., Shi, J., Li, C., 2020. Renal Cell Carcinoma Detection and Subtyping with Minimal Point-
Based Annotation in Whole-Slide Images, In: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A.,
Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical Image Computing and Computer Assisted Intervention –
MICCAI 2020. Springer International Publishing, Cham, pp. 439-448.
[232]. Qi, H., Collins, S., Noble, J.A., 2020. Knowledge-guided Pretext Learning for Utero-placental Interface Detection.
Medical image computing and computer-assisted intervention : MICCAI ... International Conference on Medical Image
Computing and Computer-Assisted Intervention 12261, 582-593.
[233]. Wang, D., Zhang, Y., Zhang, K., Wang, L., 2020c. FocalMix: Semi-Supervised Learning for 3D Medical Image
Detection, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3950-3959.
[234]. Ozdemir, O., Woodward, B., Berlin, A.A., 2017. Propagating uncertainty in multi-stage bayesian convolutional
neural networks with application to pulmonary nodule detection. arXiv preprint arXiv:1712.00497.
[235]. Yan, K., Tang, Y., Peng, Y., Sandfort, V., Bagheri, M., Lu, Z., Summers, R.M., 2019. MULAN: Multitask
Universal Lesion Analysis Network for Joint Lesion Detection, Tagging, and Segmentation, In: Shen, D., Liu, T., Peters,
T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (Eds.), Medical Image Computing and Computer Assisted
Intervention – MICCAI 2019. Springer International Publishing, Cham, pp. 194-202.
[236]. Yan, K., Cai, J., Zheng, Y., Harrison, A.P., Jin, D., Tang, Y.b., Tang, Y.X., Huang, L., Xiao, J., Lu, L., 2020.
Learning from Multiple Datasets with Heterogeneous and Partial Labels for Universal Lesion Detection in CT. IEEE
Transactions on Medical Imaging, 1-1.
[237]. Cai, J., Yan, K., Cheng, C.-T., Xiao, J., Liao, C.-H., Lu, L., Harrison, A.P., 2020. Deep Volumetric Universal
Lesion Detection Using Light-Weight Pseudo 3D Convolution and Surface Point Regression, In: Martel, A.L.,
Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical
Image Computing and Computer Assisted Intervention – MICCAI 2020. Springer International Publishing, Cham, pp. 3-
[238]. Li, H., Han, H., Zhou, S.K., 2020d. Bounding Maps for Universal Lesion Detection, In: Martel, A.L., Abolmaesumi,
P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical Image Computing
and Computer Assisted Intervention – MICCAI 2020. Springer International Publishing, Cham, pp. 417-428.
[239]. Yan, K., Wang, X., Lu, L., Zhang, L., Harrison, A.P., Bagheri, M., Summers, R.M., 2018b. Deep Lesion Graphs
in the Wild: Relationship Learning and Organization of Significant Radiology Image Findings in a Diverse Large-Scale
Lesion Database, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9261-9270.
[240]. Yan, K., Wang, X., Lu, L., Summers, R.M., 2018c. DeepLesion: automated mining of large-scale lesion annotations
and universal lesion detection with deep learning. Journal of medical imaging (Bellingham, Wash.) 5, 036501.
[241]. Eisenhauer, E.A., Therasse, P., Bogaerts, J., Schwartz, L.H., Sargent, D., Ford, R., Dancey, J., Arbuck, S., Gwyther,
S., Mooney, M., Rubinstein, L., Shankar, L., Dodd, L., Kaplan, R., Lacombe, D., Verweij, J., 2009. New response
evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1). European Journal of Cancer 45, 228-247.
[242]. Wu, B., Zhou, Z., Wang, J., Wang, Y., 2018b. Joint learning for pulmonary nodule segmentation, attributes and
malignancy prediction, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1109-1113.
[243]. Li, Z., Zhang, S., Zhang, J., Huang, K., Wang, Y., Yu, Y., 2019. MVP-Net: Multi-view FPN with Position-Aware
Attention for Deep Universal Lesion Detection, In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap,
P.-T., Khan, A. (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Springer
International Publishing, Cham, pp. 13-21.
[244]. Pisov, M., Kondratenko, V., Zakharov, A., Petraikin, A., Gombolevskiy, V., Morozov, S., Belyaev, M., 2020.
Keypoints Localization for Joint Vertebra Detection and Fracture Severity Quantification, In: Martel, A.L., Abolmaesumi,
P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical Image Computing
and Computer Assisted Intervention – MICCAI 2020. Springer International Publishing, Cham, pp. 723-732.
[245]. Lung, K.-Y., Chang, C.-R., Weng, S.-E., Lin, H.-S., Shuai, H.-H., Cheng, W.-H., 2021. ROSNet: Robust one-stage
network for CT lesion detection. Pattern Recognition Letters 144, 82-88.
[246]. Zhu, H., Yao, Q., Xiao, L., Zhou, S.K., 2021b. You Only Learn Once: Universal Anatomical Landmark Detection,
International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 85-95.
[247]. Baur, C., Denner, S., Wiestler, B., Navab, N., Albarqouni, S., 2021. Autoencoders for unsupervised anomaly
segmentation in brain MR images: A comparative study. Medical Image Analysis 69, 101952.
[248]. Uzunova, H., Schultz, S., Handels, H., Ehrhardt, J., 2019. Unsupervised pathology detection in medical images
using conditional variational autoencoders. International Journal of Computer Assisted Radiology and Surgery 14, 451-
[249]. Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G., 2017. Unsupervised anomaly detection
with generative adversarial networks to guide marker discovery, International conference on information processing in
medical imaging. Springer, pp. 146-157.
[250]. Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U., 2019. f-AnoGAN: Fast unsupervised
anomaly detection with generative adversarial networks. Medical image analysis 54, 30-44.
[251]. Baur, C., Wiestler, B., Albarqouni, S., Navab, N., 2018. Deep autoencoding models for unsupervised anomaly
segmentation in brain MR images, International MICCAI Brainlesion Workshop. Springer, pp. 161-169.
[252]. Chen, X., You, S., Tezcan, K.C., Konukoglu, E., 2020d. Unsupervised lesion detection via image restoration with
a normative prior. Medical Image Analysis 64, 101713.
[253]. You, S., Tezcan, K.C., Chen, X., Konukoglu, E., 2019. Unsupervised lesion detection via image restoration with a
normative prior, International Conference on Medical Imaging with Deep Learning. PMLR, pp. 540-556.
[254]. Chen, X., Pawlowski, N., Glocker, B., Konukoglu, E., 2021c. Normative ascent with local gaussians for
unsupervised lesion detection. Medical Image Analysis 74, 102208.
[255]. Ma, J., Li, X., Li, H., Wang, R., Menze, B., Zheng, W.S., 2021a. Cross-View Relation Networks for Mammogram
Mass Detection, 2020 25th International Conference on Pattern Recognition (ICPR), pp. 8632-8638.
[256]. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y., 2018b. Relation networks for object detection, Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 3588-3597.
[257]. Liu, Y., Zhang, F., Zhang, Q., Wang, S., Wang, Y., Yu, Y., 2020c. Cross-view Correspondence Reasoning based
on Bipartite Graph Convolutional Network for Mammogram Mass Detection, Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 3812-3822.
[258]. Lin, H., Chen, H., Graham, S., Dou, Q., Rajpoot, N., Heng, P.A., 2019. Fast ScanNet: Fast and Dense Analysis of
Multi-Gigapixel Whole-Slide Images for Cancer Metastasis Detection. IEEE Transactions on Medical Imaging 38, 1948-
[259]. Haskins, G., Kruger, U., Yan, P., 2020. Deep learning in medical image registration: a survey. Machine Vision and
Applications 31, 8.
[260]. Fu, Y., Lei, Y., Wang, T., Curran, W.J., Liu, T., Yang, X., 2020. Deep learning in medical image registration: a
review. Physics in Medicine & Biology 65, 20TR01.
[261]. Ma, J., Jiang, X., Fan, A., Jiang, J., Yan, J., 2021b. Image matching from handcrafted to deep features: A survey.
International Journal of Computer Vision 129, 23-79.
[262]. Cheng, X., Zhang, L., Zheng, Y., 2018. Deep similarity learning for multimodal medical images. Computer
Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 6, 248-252.
[263]. Haskins, G., Kruecker, J., Kruger, U., Xu, S., Pinto, P.A., Wood, B.J., Yan, P., 2019. Learning deep similarity
metric for 3D MR–TRUS image registration. International Journal of Computer Assisted Radiology and Surgery 14, 417-
[264]. Uzunova, H., Wilms, M., Handels, H., Ehrhardt, J., 2017. Training CNNs for image registration from few samples
with model-based data augmentation, International Conference on Medical Image Computing and Computer-Assisted
Intervention. Springer, pp. 223-231.
[265]. Fan, J., Cao, X., Yap, P.-T., Shen, D., 2019a. BIRNet: Brain image registration using dual-supervised fully
convolutional networks. Medical image analysis 54, 193-206.
[266]. Zhao, S., Lau, T., Luo, J., Eric, I., Chang, C., Xu, Y., 2019b. Unsupervised 3D end-to-end medical image
registration with volume tweening network. IEEE journal of biomedical and health informatics 24, 1394-1404.
[267]. Kim, B., Kim, J., Lee, J.-G., Kim, D.H., Park, S.H., Ye, J.C., 2019. Unsupervised deformable image registration
using cycle-consistent cnn, International Conference on Medical Image Computing and Computer-Assisted Intervention.
Springer, pp. 166-174.
[268]. Wu, G., Kim, M., Wang, Q., Munsell, B.C., Shen, D., 2016. Scalable High-Performance Image Registration
Framework by Unsupervised Deep Feature Representations Learning. IEEE Transactions on Biomedical Engineering 63,
[269]. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y., 2011. Unsupervised learning of hierarchical representations with
convolutional deep belief networks. Communications of the ACM 54, 95-103.
[270]. Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C., 2008. Symmetric diffeomorphic image registration with
cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12, 26-
[271]. Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V., 2019. VoxelMorph: A Learning Framework
for Deformable Medical Image Registration. IEEE Transactions on Medical Imaging 38, 1788-1800.
[272]. Hu, Y., Modat, M., Gibson, E., Li, W., Ghavami, N., Bonmati, E., Wang, G., Bandula, S., Moore, C.M., Emberton,
M., Ourselin, S., Noble, J.A., Barratt, D.C., Vercauteren, T., 2018c. Weakly-supervised convolutional neural networks for
multimodal image registration. Medical Image Analysis 49, 1-13.
[273]. Hu, Y., Modat, M., Gibson, E., Ghavami, N., Bonmati, E., Moore, C.M., Emberton, M., Noble, J.A., Barratt, D.C.,
Vercauteren, T., 2018d. Label-driven weakly-supervised learning for multimodal deformable image registration, 2018
IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1070-1074.
[274]. de Vos, B.D., Berendsen, F.F., Viergever, M.A., Sokooti, H., Staring, M., Išgum, I., 2019. A deep learning
framework for unsupervised affine and deformable image registration. Med Image Anal 52, 128-143.
[275]. de Vos, B.D., Berendsen, F.F., Viergever, M.A., Staring, M., Išgum, I., 2017. End-to-end unsupervised deformable
image registration with a convolutional neural network, Deep Learning in Medical Image Analysis and Multimodal
Learning for Clinical Decision Support. Springer, pp. 204-212.
[276]. Fan, J., Cao, X., Wang, Q., Yap, P.-T., Shen, D., 2019b. Adversarial learning for mono-or multi-modal registration.
Medical image analysis 58, 101545.
[277]. Yang, X., Kwitt, R., Styner, M., Niethammer, M., 2017. Quicksilver: Fast predictive image registration - A deep
learning approach. Neuroimage 158, 378-396.
[278]. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent
adversarial networks, Proceedings of the IEEE international conference on computer vision, pp. 2223-2232.
[279]. Geras, K.J., Mann, R.M., Moy, L., 2019. Artificial intelligence for mammography and digital breast tomosynthesis:
current concepts and future perspectives. Radiology 293, 246-259.
[280]. Yang, Z., Luo, T., Wang, D., Hu, Z., Gao, J., Wang, L., 2018. Learning to navigate for fine-grained classification,
Proceedings of the European Conference on Computer Vision (ECCV), pp. 420-435.
[281]. Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Cardoso, M.J., 2017. Generalised dice overlap as a deep learning
loss function for highly unbalanced segmentations, Deep learning in medical image analysis and multimodal learning for
clinical decision support. Springer, pp. 240-248.
[282]. Abraham, N., Khan, N.M., 2019. A novel focal tversky loss function with improved attention u-net for lesion
segmentation, 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019). IEEE, pp. 683-687.
[283]. Kervadec, H., Bouchtiba, J., Desrosiers, C., Granger, E., Dolz, J., Ayed, I.B., 2019. Boundary loss for highly
unbalanced segmentation, International conference on medical imaging with deep learning. PMLR, pp. 285-296.
[284]. Li, M., Hsu, W., Xie, X., Cong, J., Gao, W., 2020e. SACNN: Self-attention convolutional neural network for low-
dose CT denoising with self-supervised perceptual loss network. IEEE transactions on medical imaging 39, 2289-2301.
[285]. Dai, Z., Yang, Z., Yang, F., Cohen, W.W., Salakhutdinov, R., 2017. Good semi-supervised learning that requires
a bad GAN, Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran
Associates Inc., Long Beach, California, USA, pp. 6513–6523.
[286]. Rubin, M., Stein, O., Turko, N.A., Nygate, Y., Roitshtain, D., Karako, L., Barnea, I., Giryes, R., Shaked, N.T.,
2019. TOP-GAN: Stain-free cancer cell classification using deep learning with a small training set. Medical Image
Analysis 57, 176-185.
[287]. Zhao, S., Liu, Z., Lin, J., Zhu, J., Han, S., 2020. Differentiable Augmentation for Data-Efficient GAN Training,
Advances in Neural Information Processing Systems, pp. 1-23.
[288]. Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T., 2020. Training Generative Adversarial
Networks with Limited Data, Advances in Neural Information Processing Systems, pp. 1-37.
[289]. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D., 2020.
Supervised Contrastive Learning. Advances in Neural Information Processing Systems 33.
[290]. Saunshi, N., Plevrakis, O., Arora, S., Khodak, M., Khandeparkar, H., 2019. A theoretical analysis of contrastive
unsupervised representation learning, International Conference on Machine Learning. PMLR, pp. 5628-5637.
[291]. Reed, C.J., Yue, X., Nrusimha, A., Ebrahimi, S., Vijaykumar, V., Mao, R., Li, B., Zhang, S., Guillory, D., Metzger,
S., 2022. Self-supervised pretraining improves self-supervised pretraining, Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, pp. 2584-2594.
[292]. Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C., Cubuk, E., Kurakin, A., Li, C., 2020.
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence, Advances in Neural Information
Processing Systems, pp. 1-13.
[293]. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V., 2020. Randaugment: Practical automated data augmentation with a
reduced search space, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,
pp. 702-703.
[294]. Oliver, A., Odena, A., Raffel, C.A., Cubuk, E.D., Goodfellow, I., 2018. Realistic evaluation of deep semi-
supervised learning algorithms. Advances in neural information processing systems 31.
[295]. Guo, L.-Z., Zhang, Z.-Y., Jiang, Y., Li, Y.-F., Zhou, Z.-H., 2020. Safe deep semi-supervised learning for unseen-
class unlabeled data, International Conference on Machine Learning. PMLR, pp. 3897-3906.
[296]. Yuille, A.L., Liu, C., 2021. Deep Nets: What have They Ever Done for Vision? International Journal of Computer
Vision 129, 781-802.
[297]. Marblestone, A.H., Wayne, G., Kording, K.P., 2016. Toward an integration of deep learning and neuroscience.
Frontiers in computational neuroscience 10, 94.
[298]. Zoph, B., Le, Q.V., 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
[299]. Elsken, T., Metzen, J.H., Hutter, F., 2019. Neural architecture search: A survey. The Journal of Machine Learning
Research 20, 1997-2017.
[300]. Gibson, E., Li, W., Sudre, C., Fidon, L., Shakir, D.I., Wang, G., Eaton-Rosen, Z., Gray, R., Doel, T., Hu, Y., 2018b.
NiftyNet: a deep-learning platform for medical imaging. Computer methods and programs in biomedicine 158, 113-122.
[301]. Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H., 2021. nnU-Net: a self-configuring method
for deep learning-based biomedical image segmentation. Nature Methods 18, 203-211.
[302]. Wynants, L., Van Calster, B., Collins, G.S., Riley, R.D., Heinze, G., Schuit, E., Bonten, M.M.J., Dahly, D.L.,
Damen, J.A., Debray, T.P.A., de Jong, V.M.T., De Vos, M., Dhiman, P., Haller, M.C., Harhay, M.O., Henckaerts, L.,
Heus, P., Kammer, M., Kreuzberger, N., Lohmann, A., Luijken, K., Ma, J., Martin, G.P., McLernon, D.J., Andaur Navarro,
C.L., Reitsma, J.B., Sergeant, J.C., Shi, C., Skoetz, N., Smits, L.J.M., Snell, K.I.E., Sperrin, M., Spijker, R., Steyerberg,
E.W., Takada, T., Tzoulaki, I., van Kuijk, S.M.J., van Bussel, B.C.T., van der Horst, I.C.C., van Royen, F.S., Verbakel,
J.Y., Wallisch, C., Wilkinson, J., Wolff, R., Hooft, L., Moons, K.G.M., van Smeden, M., 2020. Prediction models for
diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ 369, m1328.
[303]. Roberts, M., Driggs, D., Thorpe, M., Gilbey, J., Yeung, M., Ursprung, S., Aviles-Rivero, A.I., Etmann, C.,
McCague, C., Beer, L., 2021. Common pitfalls and recommendations for using machine learning to detect and
prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199-217.
[304]. Nagendran, M., Chen, Y., Lovejoy, C.A., Gordon, A.C., Komorowski, M., Harvey, H., Topol, E.J., Ioannidis,
J.P.A., Collins, G.S., Maruthappu, M., 2020. Artificial intelligence versus clinicians: systematic review of design,
reporting standards, and claims of deep learning studies. BMJ 368, m689.
[305]. Yang, Q., Liu, Y., Chen, T., Tong, Y., 2019. Federated machine learning: Concept and applications. ACM
Transactions on Intelligent Systems and Technology (TIST) 10, 1-19.
[306]. Li, T., Sahu, A.K., Talwalkar, A., Smith, V., 2020f. Federated learning: Challenges, methods, and future directions.
IEEE Signal Processing Magazine 37, 50-60.
[307]. Rieke, N., Hancox, J., Li, W., Milletari, F., Roth, H.R., Albarqouni, S., Bakas, S., Galtier, M.N., Landman, B.A.,
Maier-Hein, K., 2020. The future of digital health with federated learning. NPJ digital medicine 3, 1-7.
[308]. Kelly, C.J., Karthikesalingam, A., Suleyman, M., Corrado, G., King, D., 2019. Key challenges for delivering
clinical impact with artificial intelligence. BMC medicine 17, 1-9.
[309]. McKinney, S.M., Sieniek, M., Godbole, V., Godwin, J., Antropova, N., Ashrafian, H., Back, T., Chesus, M.,
Corrado, G.S., Darzi, A., Etemadi, M., Garcia-Vicente, F., Gilbert, F.J., Halling-Brown, M., Hassabis, D., Jansen, S.,
Karthikesalingam, A., Kelly, C.J., King, D., Ledsam, J.R., Melnick, D., Mostofi, H., Peng, L., Reicher, J.J., Romera-
Paredes, B., Sidebottom, R., Suleyman, M., Tse, D., Young, K.C., De Fauw, J., Shetty, S., 2020. International evaluation
of an AI system for breast cancer screening. Nature 577, 89-94.
[310]. Baltatzis, V., Bintsi, K.-M., Folgoc, L.L., Martinez Manzanera, O.E., Ellis, S., Nair, A., Desai, S., Glocker, B.,
Schnabel, J.A., 2021. The Pitfalls of Sample Selection: A Case Study on Lung Nodule Classification, In: Rekik, I., Adeli,
E., Park, S.H., Schnabel, J. (Eds.), Predictive Intelligence in Medicine. Springer International Publishing, Cham, pp. 201-