Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

1 s2.0 S2667305322000631 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Journal Pre-proof

Deep Learning-based Dot and Globule Segmentation with Pixel and


Blob-based Metrics for Evaluation

Anand K. Nambisan , Norsang Lama , Thanh Phan ,


Samantha Swinfard , Binita Lama , Colin Smith , Ahmad Rajeh ,
Gehana Patel , Jason Hagerty , William V. Stoecker ,
Ronald J. Stanley

PII: S2667-3053(22)00063-1
DOI: https://doi.org/10.1016/j.iswa.2022.200126
Reference: ISWA 200126

To appear in: Intelligent Systems with Applications

Received date: 18 June 2022


Revised date: 25 August 2022
Accepted date: 11 September 2022

Please cite this article as: Anand K. Nambisan , Norsang Lama , Thanh Phan , Samantha Swinfard ,
Binita Lama , Colin Smith , Ahmad Rajeh , Gehana Patel , Jason Hagerty , William V. Stoecker ,
Ronald J. Stanley , Deep Learning-based Dot and Globule Segmentation with Pixel and
Blob-based Metrics for Evaluation, Intelligent Systems with Applications (2022), doi:
https://doi.org/10.1016/j.iswa.2022.200126

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.

© 2022 Published by Elsevier Ltd.


This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
Deep Learning-based Dot and Globule Segmentation
with Pixel and Blob-based Metrics for Evaluation

Anand K. Nambisana, Norsang Lamaa, Thanh Phana, Samantha Swinfarda,


Binita Lamaa, Colin Smithb, Ahmad Rajehc, Gehana Pateld, Jason Hagertye,
William V. Stoeckere, Ronald J. Stanleyf,∗
aMissouriUniversity of Science & Technology, Rolla, MO, 65209, USA
bA.T.
Still University Medical School, Kirksville, MO, 63501, USA
cUniversity of Missouri School of Medicine, Columbia, MO, 65212 USA
dUniversity of Missouri - Columbia, Columbia , MO, 65211 USA
eS&A Technologies, Rolla, MO, 65401, USA
f127 Emerson Electric Company Hall, Missouri University of Science & Technology, Rolla,

MO, 65209, USA

Abstract

Deep learning (DL) applied to whole dermoscopic images has shown unprece-
dented accuracy in differentiating images of melanoma from benign lesions. We
hypothesize that accuracy in whole-image deep learning suffers because whole
lesion analysis lacks an evaluation of dermoscopic structures. DL also suffers a
“black box” characterization because it offers only probabilities to the physician
and no visible structures. We propose the detection of structures called dots
and globules as a means to improve precision in melanoma detection. We com-
pare two encoder-decoder architectures to detect dots and globules: UNET vs.
UNET++. For each of these architectures, we compare three pipelines: with
test-time augmentation (TTA), without TTA, and without TTA but with check-
point ensembles. We use an SE-RESNEXT encoder and a symmetric decoder.
The pixel-based F1-scores for globule and dot detection based on UNET++
and UNET techniques with checkpoint ensembles were found to be 0.632 and


Corresponding author
Email addresses: akn36d@mst.edu (Anand K. Nambisan), nlbft@mst.edu (Norsang
Lama), tupzg6@health.missouri.edu (Thanh Phan), slsbng@mst.edu (Samantha Swinfard),
binny.lama8@gmail.com (Binita Lama), colinsmith@atsu.edu (Colin Smith),
ahmadrjeh@gmail.com (Ahmad Rajeh), gnpv5k@missouri.com (Gehana Patel),
hagerty.jason@gmail.com (Jason Hagerty), wvs@mst.edu (William V. Stoecker),
stanleyj@mst.edu (Ronald J. Stanley )

Preprint submitted to Intelligent Systems with Applications August 24, 2022


0.628, respectively. The blob-based UNET++ and UNET F1-scores (50 percent
inter-section) were 0.696 and 0.685, respectively. This agreement score is over
twice the statistical correlation score measured among specialists. We propose
UNET++ globule and dot detection as a technique that offers two potential
advantages: increased diagnostic accuracy and visible structure detection to
better explain DL results and mitigate deep learning’s black-box problem. We
present a public globule and dot database to aid progress in automatic detection
of these structures.
Keywords: Machine Learning, Deep Learning, Data processing, Melanoma,
Globules, Feature Segmentation

1. Introduction

Invasive melanoma is a form of skin cancer with 99,780 new cases estimated
in the USA in 2022 [1]. The chances of survival are high if melanoma is diagnosed
early, as noted by Noone et al. in [2]. Despite this, projections by Rahib et al.
[3] show that the number of melanoma cases will more than double by 2040,
becoming the second most prevalent form of cancer by then. This puts emphasis
on the importance of engaging in research toward raising awareness and early
diagnosis of skin lesions.
In this section, we provide an overview of the state of current research in
relation to our present work. We also briefly discuss developments in related
medical and machine learning fields.
In the domain of computer-aided diagnosis (CAD) for digital dermoscopy,
researchers have focused on a set of tasks to analyze skin lesions for diagnosis.
These tasks can be broadly categorized as image enhancement or pre-processing,
lesion segmentation, feature , and finally lesion diagnosis or classification [4, 5].
The biggest and most used public datasets for machine learning-based skin
lesion research are the ISIC datasets, which were first released at the Inter-
national Symposium on Biomedical Imaging (ISBI) 2016, by the International
Skin Imaging Collaboration (ISIC) [6]. This was accompanied by a challenge

2
involving a set of tasks to push CAD-based skin research further. Since then,
there have been four more iterations of the challenge, each with a set of skin
lesion-related tasks and a dataset [7, 8, 9, 10]. Many other datasets have also
been used in CAD research. Some like PH2 [11] are small and have only 200
dermoscopic images with three diagnosis classes: common nevi, atypical nevi,
and melanoma. Other datasets include the interactive atlas of dermoscopy [12],
which has more than 1000 clinical cases with clinical images, dermoscopy im-
ages, histopathology results, and level of difficulty, it was created as a means
to train medical personnel. Kassem et al. [13] provide a succinct review of
current machine learning and deep learning approaches using various datasets
to highlight some of the prevalent challenges in CAD-based skin lesion research.
Adegun et al. [5] in their survey on deep learning techniques in CAD focused on
the ISIC 2018 and ISIC 2019 datasets. They conclude that model ensembling,
image pre-processing and lesion segmentation improve results for skin lesion
classification.
Segmentation of dermatological features has been repeatedly highlighted as
one of the more difficult lesion tasks. Codella et al. have mentioned this in
both the post-challenge reviews of the ISIC 2017 [9] and ISIC 2018 challenges
[8]. Barata et al. [14] provided a comprehensive survey of feature extraction
in skin lesion image analysis, including clinically inspired features like negative
networks, dots, globules, etc. They found that the time consumed for annota-
tion is the prime reason for the lack of exploration of such features for image
analysis. This has also hindered the incorporation of clinical features toward
deep learning-based feature segmentation. Recent work done by Cassidy et al.
[15] used a curated combination of multiple ISIC datasets along with in-depth
data cleaning strategies and provided benchmarks on multiple test sets. This
was done to highlight the biases that occur due to noise and other artifacts in the
assessment of the lesion diagnosis classification task. Work has also been done
to alleviate the difficulties in creating annotations for tasks in the healthcare
domain. Calisto et al. [16] have proposed touch-based interfaces for medical an-
notation aimed at radiologists to help during patient follow-ups. Calisto et al.

3
[17, 18] show that the integration of AI techniques improves workflow efficiency
and reduces work-related cognitive fatigue. Such interfaces can be introduced
into the clinic to make the annotation and data collection process across medical
disciplines. This would then make it easier to standardize and collect data for
later downstream machine learning tasks and statistical analysis.

1.1. Dot and Globule Segmentation

In this section, we focus on the clinical features of interest: dots and globules.
We provide information and context to support its importance in skin lesion
diagnosis.
Early diagnosis of melanoma, particularly at the in situ stage, yields the
best prognosis [19]. However, many cases of melanoma, especially early in situ
melanomas, are missed by domain experts [20],[21],[22]. Machine vision tech-
niques incorporating deep learning (DL) have shown higher diagnostic accuracy
than dermatologists can achieve [23], [24]. However accuracy of DL methods has
not been proven when applied to small-diameter melanomas. “Black dots” and
“brown globules” were among the earliest dermoscopic melanoma structures de-
tected and are still considered critical for diagnosis [25]. But these structures are
found in both benign and malignant lesions, thus there is a need to characterize
dots and globules precisely in order to use them for melanoma discrimination.
These structures may be most useful for discriminating tiny melanomas from
benign mimics. Regio Pereira et al. found irregular dots and globules in 76.5%
of small melanomas; these features were among the most discriminatory features
for these melanomas [26]. If dots and globules can be precisely delineated, their
features can be used to predict melanoma. Xu et al. found that large globules
and varying globule sizes and shapes had the highest odds ratio for melanoma
[27]. Maglogiannis and Delibasis reported automatic dot and globule detection
using a multi-resolution inverse non-linear diffusion approach and found that
features from the detected structures increased diagnostic accuracy by 6%, pri-
marily by increasing specificity (true-negative rate) [28]. The study showed the
potential of globules, but conclusions from this study and other studies such

4
Figure 1: Nevus (left) with Dot and globules marked (right) in the 2018 ISIC dataset.

as [14] are limited because they analyze a limited number of lesions from non-
public databases and lack specific metrics for assessing the detected structures.
The 2017 and 2018 ISIC challenges [9], [8] provide a globule database using
Superpixel-based ground truth annotations that include extraneous areas be-
sides globules. These masks include dermoscopic features but do not delineate
them precisely. Inexact masks do not allow determination of feature-specific
information like the number of instances of a feature within the lesion, vari-
ances in shape, structure, and color between instances of the feature across the
lesion. Once extracted, these features can be used for other downstream tasks
(explainability or classification). An example of ISIC globule annotation is
shown, (Fig. 1).
This research develops precise dot and globule masks and presents a DL
technique for detecting these masks automatically. We also present a blob-
based metric to best ascertain model detection accuracy. The remaining sections
include 2 Methodology, 3 Training, 4 Dataset Availability, 5 Hardware and
Software Configuration, 6 Results, 7 Discussion, 8 Conclusion, and Future Work.

2. Methodology

In the following subsections, we break down the different steps involved in


the curation of the dataset to prepare for model training, the DL models used,

5
and the evaluation metrics to assess model performance.

2.1. Data Collection and processing

To create the dataset we selected 539 images, with 501 from the ISIC 2019
dataset [9], [10], [7] and 38 from a multi-clinic study supported by NIH grants
SBIR R43 CA153927-01 and CA101639-02A2. We opted to use the ISIC 2019
dataset as this dataset also has metadata information, and we believe this can
lead to the development of multimodal approaches in future work. A researcher,
under the supervision of a dermatologist (W.V.S.) marked all regions in the
images that contained either a dot or a globule, as defined by a consensus
conference [29]. We used a broad definition of dots and globules so that the
model can extract the features and their mimics across diagnoses. Dots and
globules are dark-brown, black, or gray structures with fairly sharp borders,
often roughly round but sometimes irregular.
We split the dataset into 65 percent training set and 35 percent test set.
This gave us 381 images in the training/validation set, of which 358 were unique
images, and the rest were either the same images with more than one annotation
mask (which occurs for duplicate images in the image sets) or images with the
same lesion but acquired under slightly different conditions. This means that
the hold-out test set had 158 images, and all the images were unique. The
training/validation dataset is then split into five sets of train and validation sets
for 5-fold cross-validation. During the splitting into folds, we ensure that there
are no duplicates (the same image or the same lesion) across a pair of train-
validation sets. All duplicates are moved into the training set; leaving only
unique images in the validation dataset. To show the model’s generalization
capabilities, we also tested them on a set of 160 small melanomas. We did this
to determine whether the models detect similar structures across the classes.
Since dots and globules are important in separating melanoma from benign
nevi we want our models to be able to detect similar structures in melanomas.

6
2.2. Models

For globule segmentation, we use two variants of the UNET [30] architec-
ture, both with the same encoder network but with different decoder and skip
connection structures. The two architectures are the usual UNET architecture,
as in [30] and the UNET++ architecture [31]. The UNET++ architecture is a
modification of the UNET architecture with each encoder level connected to the
corresponding decoder level via nested dense convolutions. The architecture also
has multiple segmentation branches, each originating from a different level of
the encoder network. Two variants of the UNET++ architecture exist — based
on how the outputs from the segmentation branches are processed — a fast
mode where the final decoder segmentation output is selected and an accurate
mode where all the segmentation masks are averaged to generate a final mask.
We use the fast mode in our work. The encoder is an SE-ResNext50-32×4d
network [32], which incorporates the squeeze-and-excitation module to learn re-
lationships across the convolutional feature channels. The model is based on
the ResNext50 [33] model, which expands upon and modifies the blocks in the
Resnet model [34] along a new “cardinality” dimension, which in this particular
case can be simplified using the notation of grouped convolutions. The decoder
is a symmetric decoder based on the encoder, the output mask requirement,
and the number of layers required using the implementation in [35].

2.3. Evaluation metrics

In related tasks, such as the feature segmentation tasks in the ISIC 2018 and
2017 challenges, the metric used to evaluate the predicted feature masks was
pixel-based IOU (Jaccard) [8, 9]. The Jaccard metric worked well for superpixel-
based ground truth masks but would not be indicative of performance for our
precise dataset. An example of a case where pixel-based IOU is a poor metric
for dots and globule segmentation can be seen in Fig. 2.
For the PASCAL VOC challenge [36], the intersection over union (IOU)
of the predicted bounding boxes with the true boxes is the criterion used for

7
Figure 2: Image where the models localize well, but blobs are too generous. The pixel-based
IOU, precision, recall and F1-scores were 0.0947, 0.994, 0.0947, and 0.173 respectively. The
blob-based precision, recall and F1-scores with an intersection threshold of 25% were 0.25,
1.0, and 0.4 respectively, indicative of good localization.

deciding whether the predicted boxes are true-positive. A confidence score-


based approach is used to construct an interpolated precision-recall curve to
calculate the average precision.
We developed a blob-based approach for metrics to accommodate the subjec-
tive nature of globule boundaries. Globule boundaries marked by dermoscopy
experts differ significantly between dermatologists and even between globules
marked by the same expert at different times (please see discussion). Therefore,
detection of an object should be considered successful if there is significant over-
lap (mask intersection) between the detected (predicted) mask and the ground
truth mask.
We define a blob as any disjoint contour in a mask. For the globule segmen-
tation task, a single predicted blob may overlap multiple ground truth globules.
This can occur due to the subjective variation in the degree to which an anno-
tator chose to combine or separate globules that appear close together.
The blob-based Precision and Recall for the whole test set can be calculated
using the number of true-positive, false-positive, and false-negative blobs in an
image and accumulating them for the entire dataset.

8
TP = ∑𝐼𝑖=1 ∑𝑁 𝑖
𝑗=1 𝑓 (|𝑏𝒋,𝒊 ∩ 𝐺𝑖 | > 𝑇1 × |𝑏𝑗,𝑖 |), (1)

FP = ∑𝐼𝑖=1 ∑𝑁 𝑖
𝑗=1 𝑓 (|𝑏𝒋,𝒊 ∩ 𝐺𝑖 | ≤ 𝑇1 × |𝑏𝑗,𝑖 |), (2)

FN =∑𝐼𝑖=1 ∑𝑀 𝑖
𝑗=1 𝑓 (|𝑔𝒌,𝒊 ∩ 𝑃𝑖 | ≤ 𝑇2 × |𝑔𝑘,𝑖 |), (3)

and f is the indicator function,

1, 𝑖𝑓 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 𝑡𝑟𝑢𝑒
𝑓 (statement) = { (4)
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Where |𝑏| denotes the number of pixels in the blob 𝑏, I is the total number
of images in the test set, Ni is the total number of blobs in the predicted mask
(𝑃𝑖 ) for the ith image, and Mi is the total number of blobs in the ground truth
mask (𝐺𝑖 ) for the ith image. 𝑏𝒋,𝒊 is the jth blob in 𝑃𝑖 and 𝑔𝒌,𝒊 is the kth blob in
𝐺𝑖 . Here T1 and T2 specify the intersection percentage thresholds (intersection
thresholds) for considering a blob as true-positive (or false-positive) and false-
negative respectively, and fall in the open interval (0, 1) on the real line. The
thresholds T1 and T2 can also be interpreted as thresholds on a predicted blob’s
precision and ground truth blob’s recall, respectively. In our calculations, we
set T1 = T2 = T for simplicity. We can use TP , FP , and FN to calculate
precision, recall, and the F1-score. A similar approach that relies on centroid
distances between blobs instead of overlap was used by Xu et al. [37]. This
was done to account for the fact that the ground truth was in the form of blob
centroid co-ordinates rather than blob masks.
We calculated these blob-based metrics with three different percentages of
intersection thresholds. Those being 25%, 50%, and 75% ( T = 0.25, 0.5 and
0.75). Our analysis showed that F1-scores are relatively constant up to an inter-
section of 25% and then fall monotonically as the required overlap percentage
increases. The results are shown in Tables 1 and 2.

9
We also calculated pixel-based metrics on the test dataset to assess seg-
mentation quality. The pixel-based scores are calculated by accumulating the
true-positive, false-positive, true-negative, and false-negative pixels across the
entire dataset, and using the usual definitions of precision, recall, and F1-score
to calculate the metrics as done in the blob case above.

3. Training

In this section, we describe our training procedure, augmentations used, and


the different testing pipelines.
To train both architectures, we trained the full model with a constant learn-
ing rate of 1e-4 using the ADAM [38] optimizer and Dice loss, which has shown
great promise for medical image segmentation [39]. All models are trained
for 200 epochs with a 10-epoch early stopping patience on the validation loss.
All the models are configured for an input with dimensions 448×448 and with
3 channels. During training, we first crop a 448×448 patch from the image
(zero-padded if required), with random cropping done with a probability of
0.30 and cropping over a mask region with a probability of 0.70. Further color
and spatial augmentations are performed before finally feeding it to the model.
We also oversampled our dataset so that the number of training samples was
1.5 times the number of images in the training dataset. The output from the
model is a 448×448 probability mask, which is processed further to get the final
predicted mask. When performing validation during training, we employed a
similar scheme as in training for crop generation, but no further augmentations
were applied. We also saved the best model checkpoint within every 25-epoch
window. Since all models get trained for 200 epochs, we have 8 models from
training on a single fold, giving us 32 saved models in total across five folds.
These saved models are then ensembled with the best model during test-time,
similar to the work done in [40].
Testing a single set of trained weights on the hold-out test set involved ex-
tracting overlapping patches of dimensions 448×448 from the image, with each

10
patch having an overlap of 50 pixels across both the height and width dimen-
sions. These crops are then passed to the model, which outputs probability
masks of dimensions 448×448 for each crop. We also implemented another
testing pipeline where Test Time Augmentation (TTA) is performed on each
cropped patch. The augmentations used are all possible combinations of no-flip
and horizontal-flip along with rotations of 0°, 90°, 180°, and 270° giving us 8
augmentations and hence 8 augmented crops. No other spatial or color-based
augmentations were performed during testing. These sets of augmented crops
are then passed to the model giving us 8 probability masks per crop from the
image. The probability masks are de-augmented and averaged together to get
the resulting probability mask. These probability masks are then patched back
together with the appropriate weighting over overlapping regions to get the full
probability mask having the same dimensions as the input image. We then
have five probability masks, one from each model trained on one of the five
folds. These are averaged to get the final probability mask for the image. A
threshold of 0.5 is applied to the probability mask giving the final predicted
mask. Finally, we created a testing pipeline that ensembles the checkpointed
models by averaging the probability masks from each saved model and the best
overall model. In this third pipeline, no TTA was done, and inputs to each
model in the ensemble were processed using the same above-mentioned crop-
based approach. This lets us fuse the generous predictions of the initial windows
with the more precise predictions of the final ones. As in the previous pipelines,
a threshold of 0.5 is applied to get the final predicted mask.

4. Dataset availability

The globule masks and images for the non-public and public datasets used
are publicly available and can be accessed https://scholarsmine.mst.edu/
research_data/10/.

11
5. Hardware and Software configuration

Our models were trained on an Intel(R) Xeon(R) Silver 4110 CPU (2.10GHz)
with 64 GBs of ram, along with an NVIDIA Quadro P4000 GPU with an 8 GB
ram. The models were constructed using the Pytorch [41] library as imple-
mented in [35].

6. Results

A discussion of the results obtained after training and testing with the dif-
ferent pipelines are given below. We compare and tabulate the results from the
different architectures.
For blob-based evaluation, true-positive, false-positive, and false-negative
blobs are extracted using three different intersection thresholds described in
section 2.3. The intersection percentages were 25%, 50%, and 75%, respectively.
Once these have been calculated, we find the blob-based precision, recall, and
F1-scores. Table 1 shows the blob-based and pixel-based scores for nevi for
both architectures with the different testing pipelines mentioned in section 3.
Fig. 3 shows two cases with high pixel-based IOU scores (0.73 and 0.70) on our
best pipeline: UNET++ with checkpoint ensembles, showing correct globule
detection. Fig. 4 shows the worst globule detection results. Fig. 5 shows an
image with no globules present, and none found.
Since the model was not trained on melanoma images, we would not be
able to analyze the differences in the results if we treated the testing datasets
(nevi and melanoma) as a single entity, hence we present them separately. The
results for testing on small melanomas can be seen in Table. 2. Comparing
these scores with those in Table. 1 we see that the blob-based recalls are higher
for nevi compared to melanoma and blob-based precisions for melanoma are
higher than for nevi. This can be attributed to the fact that the models have
never been trained on melanoma images, but when they do predict globules,
they are usually correct (higher precision) but fail to get all the blobs marked
in the ground truth (lower recall). An example can be seen in Fig. 2.

12
Figure 3: Images showing two cases with the highest pixel-based IOU scores, shown. The
images on the right show the lesion overlayed with both the predicted mask and the ground
truth mask. The blue regions are false-positive pixels, the green are false-negatives, and the
blue-green regions are true-positives for Figs. 3, 4 and 5

13
Figure 4: Image overlays for the 4 cases with the lowest pixel-based IOU scores. The upper
left image shows faint false-negative blobs (DL failed to detect). The other 3 images show
false-positives for images where annotators did not mark faint blobs

Figure 5: Image with no dots or globules found, which equals ground truth.

14
Architecture UNET UNET++
Testing
P1 P2 P3 P1 P2 P3
Pipeline

Pixel-based
IOU 0.443 0.449 0.458 0.453 0.459 0.462
Precision 0.676 0.667 0.659 0.641 0.634 0.640
Recall 0.563 0.579 0.601 0.607 0.623 0.624
F1-score 0.614 0.620 0.628 0.623 0.629 0.632

Blob-based
IT= 25%
Precision 0.537 0.555 0.577 0.569 0.583 0.595
Recall 0.856 0.851 0.844 0.849 0.841 0.838
F1-score 0.660 0.672 0.685 0.682 0.689 0.696
IT = 50%
Precision 0.502 0.525 0.544 0.538 0.555 0.565
Recall 0.793 0.783 0.779 0.791 0.781 0.776
F1-score 0.615 0.629 0.641 0.641 0.649 0.654
IT = 75%
Precision 0.349 0.381 0.389 0.390 0.421 0.414
Recall 0.582 0.588 0.576 0.597 0.597 0.585
F1-score 0.436 0.462 0.465 0.472 0.494 0.485

Table 1: Pixel-based and blob-based scores for UNET and UNET++ with different test-time
approaches. The models were trained on an all-nevi dataset. The testing pipelines were: P1
(without Test Time Augmentation (TTA) or Checkpoint-ensemble approach (CE)), P2 (only
TTA), and P3 (only CE). Results are for testing on nevi images. IT stands for intersection
threshold. The highest value for each architecture is in bold, and the highest values across
both architectures are underlined.

15
Architecture UNET UNET++
Testing
P1 P2 P3 P1 P2 P3
Pipeline

Pixel-based
IOU 0.393 0.401 0.397 0.385 0.392 0.392
Precision 0.551 0.544 0.539 0.504 0.500 0.507
Recall 0.578 0.605 0.602 0.619 0.645 0.634
F1-score 0.564 0.574 0.568 0.556 0.563 0.564

Blob-based
IT= 25%
Precision 0.660 0.690 0.701 0.694 0.713 0.724
Recall 0.590 0.583 0.566 0.563 0.550 0.554
F1-score 0.623 0.632 0.626 0.544 0.621 0.628
IT = 50%
Precision 0.558 0.588 0.600 0.604 0.635 0.633
Recall 0.510 0.505 0.492 0.496 0.490 0.487
F1-score 0.533 0.544 0.540 0.544 0.553 0.551
IT = 75%
Precision 0.366 0.411 0.408 0.425 0.468 0.443
Recall 0.353 0.367 0.347 0.363 0.370 0.356
F1-score 0.360 0.388 0.375 0.392 0.413 0.395

Table 2: Pixel-based and blob-based scores for UNET and UNET++ with different test-time
approaches. The models were trained on an all-nevi dataset. The testing pipelines were:
P1 (without Test Time Augmentation (TTA) or Checkpoint-ensemble approach (CE)), P2
(only TTA), and P3 (only CE). Results are for testing on melanoma images. IT stands for
intersection threshold. The highest value for each architecture is in bold, and the highest
values across both architectures are underlined.

16
7. Discussion

This study demonstrates that precisely annotated ground truth maps en-
able high accuracy for deep learning despite the subjective nature of dot and
globule assessment. This segmentation accuracy can be achieved with a rela-
tively small number (539) of ground truth masks. We demonstrate this accuracy
with two metrics: pixel-based and blob-based. All dermoscopic structures suf-
fer from disagreement among experts. We propose the blob-based metric as a
model to better assess whether a structure has been correctly identified and ap-
proximately located. Dots and globules have a kappa statistic for interobserver
agreement of 0.33 (95% confidence interval 0.32-0.34) [29]. Our 50% intersection
blob-based dot-globule F1-score of 0.654 is over twice the dot-globule interob-
server agreement score, indicating that our deep learning model for dot and
globule detection shows moderate agreement with the ground truth. By ig-
noring pixel counting discrepancies resulting from minor shape differences, we
better model small structures whose detection is critical within a hierarchical
lesion classification pipeline. Our blob-based metric is but one possible biolog-
ically inspired metric [42] and shows how such tasks can inspire new metrics
for performance evaluation of subjective modeling tasks. The construction of
task-specific metrics – like the blob-based metrics presented here – is crucial
to proper model assessment. Dots and globules comprise a continuum with no
precise limit distinguishing the structures. Therefore, they are processed as a
single class. Pigmented lesion classification suffers from high intra-class variabil-
ity and low inter-class variability. Therefore, this research focuses on a single
class, benign nevi, for detecting these structures. This research can be incorpo-
rated into a proposed deep learning pipeline that leverages clinical features used
by dermatologists. This approach can overcome the black box nature of deep
learning and can be used to present a more convincing case for deep learning
models for diagnosis, as well as training, in a clinical setting. The structure-
based model exploits algorithms used by clinicians to improve the accuracy of
melanoma screening. These pipelines can be further analyzed using analytic

17
methods like Grad-Cam [43] and saliency maps [44]. These clinical features can
also be used within deep learning explainability paradigms such as TCAV [45] or
used by machine learning practitioners for an explainable classification pipeline.

8. Conclusion and Future work

In this section, we discuss some aspects of future work that we intend to


pursue and provide some ideas on the continuation of the current work.
Prior to the advent of deep learning models for medical computer vision, re-
searchers built highly specialized feature detection/extraction algorithms to find
clinically relevant features which would then be leveraged for lesion diagnosis.
These tasks can then be incorporated into a pipeline for fully automated lesion
diagnosis. Existing feature-based methods that use it for skin lesion diagnosis
rely on global features, which include color, shape, texture and other (usually
handcrafted) information either from the entire dermoscopic image or just the
lesion region [46, 47, 48]. Some like Benyahia et al. [49] have extracted the ac-
tivations of intermediate layers in deep learning models (deep learning features)
for classification. Hagerty et al. [50] improved diagnostic accuracy by fusing
deep learning and handcrafted features. Yap et al. [51] used metadata informa-
tion like age, gender, and lesion location and showed improvement when used
with deep learning models. Similar multimodal schemes were used for breast
cancer screening by Calisto et al. [52, 17].
The whole-image deep learning diagnostic models lack explainability. Ex-
plainability is especially desired in a medical application involving a critical
decision—whether to biopsy a lesion. Detection and display of an irregular
globule, a visible structure associated with melanoma, provides explainability
without resorting to an interpretable model [53, 54]. Multiple researchers have
noted the difficulty in using interpretable models in the medical domain. The
current study, in providing automatic globule detection, provides a step toward
improving explainability by structure detection. Our precise annotated masks
can also be used for training deep learning models.

18
To address these challenges, we present an annotated database and a deep
learning method that detects dots and globules with high accuracy. Another
objective of the proposed work is to off-load the tedious job of annotating clinical
features to deep learning models. We also intend to continue research into
biologically inspired metrics like the blob-based metric proposed here, which
can improve understanding of model performance on subjective databases. We
will explore fuzzy-logic-based metrics and other metrics such as [55], which
can be used to handle multiple or subjective ground truth databases. Ground
truth development will proceed for melanomas, which along with our current
database can potentially improve melanoma detection accuracy, especially for
small melanomas.

Acknowledgment

This work was supported in part by the National Institutes of Health (NIH)
under Grant SBIR R43 CA153927-01 and CA101639-02A2 of the National In-
stitutes of Health (NIH). Its contents are solely the responsibility of the authors
and do not necessarily represent the official views of the NIH.

CRediT Authorship Statement


Anand K. Nambisan: 0000-0003-4565-4609
Norsang Lama: 0000-0002-3580-5736
Jason Hagerty: NA
Colin Smith: NA
Ahmad Rajeh: NA
Thanh Phan: NA
Samantha Swinfard: NA
Binita Lama: NA
Gehana Patel: NA
William V. Stoecker : 0000-0003-4863-3483
Ronald J. Stanley: 0000-0003-0477-3388

Declaration of interests
The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.

19
References

[1] R. L. Siegel, K. D. Miller, H. E. Fuchs, A. Jemal, Cancer statis-


tics, 2022, CA: A Cancer Journal for Clinicians 72 (1) (2022) 7–33.
arXiv:https://acsjournals.onlinelibrary.wiley.com/doi/pdf/10.
3322/caac.21708, doi:https://doi.org/10.3322/caac.21708.
URL https://acsjournals.onlinelibrary.wiley.com/doi/abs/10.
3322/caac.21708

[2] A. M. Noone, N. Howlader, M. Krapcho, D. Miller, A. Brest, M. Yu,


J. Ruhl, Z. Tatalovich, A. Mariotto, D. R. Lewis, H. S. Chen, E. J. Feuer,
K. A. C. (eds), Seer cancer statistics review, 1975-2015, Tech. rep., National
Cancer Institute, Bethesda, MD (April 2018).

20
[3] L. Rahib, M. R. Wehner, L. M. Matrisian, K. T. Nead, Estimated Projec-
tion of US Cancer Incidence and Death to 2040, JAMA Network Open 4 (4)
(2021) e214708–e214708. arXiv:https://jamanetwork.com/journals/
jamanetworkopen/articlepdf/2778204/rahib\_2021\_oi\_210166\
_1617121223.53101.pdf, doi:10.1001/jamanetworkopen.2021.4708.
URL https://doi.org/10.1001/jamanetworkopen.2021.4708

[4] A. Madooei, M. S. Drew, Incorporating colour information for computer-


aided diagnosis of melanoma from dermoscopy images: A retrospective
survey and critical analysis, International journal of biomedical imaging
2016.

[5] A. Adegun, S. Viriri, Deep learning techniques for skin lesion analysis
and melanoma cancer detection: a survey of state-of-the-art, Vol. 54 (2),
Springer Netherlands, 2021. doi:10.1007/s10462-020-09865-y.
URL https://doi.org/10.1007/s10462-020-09865-y

[6] D. Gutman, N. C. F. Codella, E. Celebi, B. Helba, M. Marchetti, N. Mishra,


A. Halpern, Skin lesion analysis toward melanoma detection: A challenge
at the international symposium on biomedical imaging (ISBI) 2016, hosted
by the international skin imaging collaboration (ISIC), arXiv preprint
arXiv:1605.01397.

[7] P. Tschandl, C. Rosendahl, H. Kittler, The HAM10000 dataset, a large


collection of multi-source dermatoscopic images of common pigmented skin
lesions, Scientific data 5 (2018) 180161.

[8] N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gut-


man, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, H. Kittler,
A. Halpern, Skin Lesion Analysis Toward Melanoma Detection 2018: A
Challenge Hosted by the International Skin Imaging Collaboration (ISIC)
(2019). arXiv:1902.03368.

[9] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W.


Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, A. Halpern, Skin

21
lesion analysis toward melanoma detection: A challenge at the 2017 Inter-
national symposium on biomedical imaging (ISBI), hosted by the interna-
tional skin imaging collaboration (ISIC), in: Proceedings - International
Symposium on Biomedical Imaging, Vol. 2018-April, IEEE, 2018, pp. 168–
172. arXiv:1710.05006, doi:10.1109/ISBI.2018.8363547.

[10] M. Combalia, N. C. F. Codella, V. Rotemberg, B. Helba, V. Vilaplana,


O. Reiter, C. Carrera, A. Barreiro, A. C. Halpern, S. Puig, J. Malvehy,
BCN20000: Dermoscopic Lesions in the Wild (2019). arXiv:1908.02288.

[11] T. Mendon¸ca, P. M. Ferreira, J. S. Marques, A. R. Marcal, J. Rozeira, Ph


2-a dermoscopic image database for research and benchmarking, in: 2013
35th annual international conference of the IEEE engineering in medicine
and biology society (EMBC), IEEE, 2013, pp. 5437–5440.

[12] G. Argenziano, H. Soyer, V. De Giorgi, D. Piccolo, P. Carli, M. Delfino,


et al., Dermoscopy: a tutorial, EDRA, Medical Publishing & New Media
16.

[13] M. A. Kassem, K. M. Hosny, R. Damaˇseviˇcius, M. M. Eltoukhy, Machine


learning and deep learning methods for skin lesion classification and diag-
nosis: A systematic review, Diagnostics 11 (8) (2021) 1390.

[14] C. Barata, M. E. Celebi, J. S. Marques, A Survey of Feature Extraction


in Dermoscopy Image Analysis of Skin Cancer, IEEE Journal of Biomedi-
cal and Health Informatics 23 (3) (2019) 1096–1109. doi:10.1109/JBHI.
2018.2845939.

[15] B. Cassidy, C. Kendrick, A. Brodzicki, J. Jaworek-Korjakowska,


M. H. Yap, Analysis of the isic image datasets: Usage, benchmarks
and recommendations, Medical Image Analysis 75 (2022) 102305.
doi:https://doi.org/10.1016/j.media.2021.102305.
URL https://www.sciencedirect.com/science/article/pii/
S1361841521003509

22
[16] F. M. Calisto, A. Ferreira, J. C. Nascimento, D. Gon¸calves, Towards touch-
based medical image diagnosis annotation, in: Proceedings of the 2017
ACM International Conference on Interactive Surfaces and Spaces, ISS
’17, Association for Computing Machinery, New York, NY, USA, 2017, p.
390–395. doi:10.1145/3132272.3134111.
URL https://doi.org/10.1145/3132272.3134111

[17] F. M. Calisto, C. Santiago, N. Nunes, J. C. Nascimento, Introduction of


human-centric ai assistant to aid radiologists for multimodal breast image
classification, International Journal of Human-Computer Studies 150
(2021) 102607. doi:https://doi.org/10.1016/j.ijhcs.2021.102607.
URL https://www.sciencedirect.com/science/article/pii/
S1071581921000252

[18] F. M. Calisto, C. Santiago, N. Nunes, J. C. Nascimento, Breastscreening-


ai: Evaluating medical intelligent agents for human-ai inter-
actions, Artificial Intelligence in Medicine 127 (2022) 102285.
doi:https://doi.org/10.1016/j.artmed.2022.102285.
URL https://www.sciencedirect.com/science/article/pii/
S0933365722000501

[19] S. Mocellin, D. Nitti, Cutaneous melanoma in situ: translational evidence


from a large population-based study, The oncologist 16 (6) (2011) 896–903.

[20] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau,


S. Thrun, Dermatologist-level classification of skin cancer with deep neural
networks, Nature 542 (7639) (2017) 115–118. doi:10.1038/nature21056.

[21] L. K. Ferris, J. A. Harkes, B. Gilbert, D. G. Winger, K. Golubets, O. Akilov,


M. Satyanarayanan, Computer-aided classification of melanocytic lesions
using dermoscopic images, Journal of the American Academy of Dermatol-
ogy 73 (5) (2015) 769–776.

[22] M. A. Marchetti, N. C. Codella, S. W. Dusza, D. A. Gutman, B. Helba,


A. Kalloo, N. Mishra, C. Carrera, M. E. Celebi, J. L. DeFazio, N. Jaimes,

23
A. A. Marghoob, E. Quigley, A. Scope, O. Y´elamos, A. C. Halpern, Re-
sults of the 2016 international skin imaging collaboration international sym-
posium on biomedical imaging challenge: Comparison of the accuracy of
computer algorithms to dermatologists for the diagnosis of melanoma from
dermoscopic images, Journal of the American Academy of Dermatology
78 (2) (2018) 270–277.e1. doi:10.1016/j.jaad.2017.08.016.
URL https://doi.org/10.1016/j.jaad.2017.08.016

[23] H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum,


A. Kalloo, A. B. H. Hassen, L. Thomas, A. Enk, Others, Man against
machine: diagnostic performance of a deep learning convolutional neural
network for dermoscopic melanoma recognition in comparison to 58 der-
matologists, Annals of oncology 29 (8) (2018) 1836–1842.

[24] P. Tschandl, N. Codella, B. N. Akay, G. Argenziano, R. P. Braun, H. Cabo,


D. Gutman, A. Halpern, B. Helba, R. Hofmann-Wellenhof, Aimilios Lallas,
J. Lapins, C. Longo, J. Malvehy, M. A. Marchetti, A. Marghoob, S. Men-
zies, A. Oakley, J. Paoli, S. Puig, C. Rinner, C. Rosendahl, A. Scope,
C. Sinz, P. H. P. Soyer, P. L. Thomas, I. Zalaudek, H. Kittler, Comparison
of the accuracy of human readers versus machine-learning algorithms for
pigmented skin lesion classification: an open, web-based, international,
diagnostic study, The lancet oncology 20 (7) (2019) 938–947.

[25] H. Pehamberger, M. Binder, A. Steiner, K. Wolff, In vivo epiluminescence


microscopy: improvement of early diagnosis of melanoma, Journal of In-
vestigative Dermatology 100 (3) (1993) S356—-S362.

[26] A. R. Pereira, M. Corral-Forteza, H. Collgros, M.-A. El Sharouni, P. M.


Ferguson, R. A. Scolyer, P. Guitera, Dermoscopic features and screening
strategies for the detection of small-diameter melanomas, Clinical and Ex-
perimental Dermatology.

[27] J. Xu, K. Gupta, W. V. Stoecker, Y. Krishnamurthy, H. S. Rabinovitz,


A. Bangert, D. Calcara, M. Oliviero, J. M. Malters, R. Drugge, R. J.

24
Stanley, R. H. Moss, M. E. Celebi, Analysis of globule types in malignant
melanoma, Archives of dermatology 145 (11) (2009) 1245–1251.

[28] I. Maglogiannis, K. K. Delibasis, Enhancing classification accuracy utilizing


globules and dots features in digital dermoscopy, Computer methods and
programs in biomedicine 118 (2) (2015) 124–133.

[29] G. Argenziano, H. P. Soyer, S. Chimenti, R. Talamini, R. Corona, F. Sera,


M. Binder, L. Cerroni, G. D. Rosa, G. Ferrara, R. Hofmann-Wellenhof,
M. Landthaler, S. W. Menzies, H. Pehamberger, D. Piccolo, H. S. Rabi-
novitz, R. Schiffner, S. Staibano, W. Stolz, I. Bartenjev, A. Blum, R. Braun,
H. Cabo, P. Carli, V. D. Giorgi, M. G. Fleming, J. M. Grichnik, C. M.
Grin, A. C. Halpern, R. Johr, B. Katz, R. O. Kenet, H. Kittler, J. Kreusch,
J. Malvehy, G. Mazzocchetti, M. Oliviero, F. Ozdemir, K. Peris, R. Perotti,
A. Perusquia, M. A. Pizzichetta, S. Puig, B. Rao, P. Rubegni, T. Saida,
M. Scalvenzi, S. Seidenari, I. Stanganelli, M. Tanaka, K. Westerhoff, I. H.
Wolf, O. Braun-Falco, H. Kerl, T. Nishikawa, K. Wolff, A. W. Kopf, Der-
moscopy of pigmented skin lesions: results of a consensus meeting via the
Internet, Journal of the American Academy of Dermatology 48 (5) (2003)
679–693.

[30] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for


biomedical image segmentation, in: International Conference on Medical
image computing and computer-assisted intervention, Springer, 2015, pp.
234–241.

[31] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, J. Liang, Unet++: A


nested u-net architecture for medical image segmentation, in: Deep learning
in medical image analysis and multimodal learning for clinical decision
support, Springer, 2018, pp. 3–11.

[32] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings


of the IEEE conference on computer vision and pattern recognition, 2018,
pp. 7132–7141.

25
[33] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transfor-
mations for deep neural networks, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 1492–1500.

[34] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-
nition, in: Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 770–778.

[35] P. Yakubovskiy, Segmentation models pytorch, GitHub Repos.

[36] M. Everingham, S. M. A. Eslami, L. Van∼Gool, C. K. I. Williams, J. Winn,


A. Zisserman, The Pascal Visual Object Classes Challenge: A Retrospec-
tive, International Journal of Computer Vision 111 (1) (2015) 98–136.

[37] Y. Xu, T. Wu, F. Gao, J. R. Charlton, K. M. Bennett, Improved small


blob detection in 3D images using jointly constrained deep learning and
Hessian analysis, Scientific Reports 10 (1) (2020) 1–12. doi:10.1038/
s41598-019-57223-y.
URL http://dx.doi.org/10.1038/s41598-019-57223-y

[38] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv


preprint arXiv:1412.6980.

[39] R. Zhao, B. Qian, X. Zhang, Y. Li, R. Wei, Y. Liu, Y. Pan, Rethinking


Dice Loss for Medical Image Segmentation, in: 2020 IEEE International
Conference on Data Mining (ICDM), 2020, pp. 851–860. doi:10.1109/
ICDM50108.2020.00094.

[40] H. Chen, S. Lundberg, S.-I. Lee, Checkpoint ensembles: Ensemble methods


from a single training process, arXiv preprint arXiv:1710.03282.

[41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,


T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, S. Chintala, PyTorch: An Imperative Style, High-
Performance Deep Learning Library, in: H. Wallach, H. Larochelle,

26
A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, R. Garnett (Eds.), Advances in
Neural Information Processing Systems, Vol. 32, Curran Associates, Inc.,
2019.
URL https://proceedings.neurips.cc/paper/2019/file/
bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

[42] S. Sabbaghi Mahmouei, M. Aldeen, W. V. Stoecker, R. Garnavi, Bio-


logically Inspired QuadTree Color Detection in Dermoscopy Images of
Melanoma, IEEE Journal of Biomedical and Health Informatics 23 (2)
(2019) 570–577. doi:10.1109/JBHI.2018.2841428.

[43] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra,


Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based
Localization, Proceedings of the IEEE International Conference on Com-
puter Vision 2017-Octob (2017) 618–626. doi:10.1109/ICCV.2017.74.

[44] K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional net-


works: Visualising image classification models and saliency maps, 2nd In-
ternational Conference on Learning Representations, ICLR 2014 - Work-
shop Track Proceedings (2014) 1–8arXiv:1312.6034.

[45] A. Lucieri, M. N. Bajwa, S. Alexander Braun, M. I. Malik, A. Den-


gel, S. Ahmed, On Interpretability of Deep Learning based Skin Lesion
Classifiers using Concept Activation Vectors, Proceedings of the Inter-
national Joint Conference on Neural NetworksarXiv:2005.02000, doi:
10.1109/IJCNN48605.2020.9206946.

[46] M. A. Sheha, M. S. Mabrouk, A. Sharawy, et al., Automatic detection


of melanoma skin cancer using texture analysis, International Journal of
Computer Applications 42 (20) (2012) 22–26.

[47] T. Saba, M. A. Khan, A. Rehman, S. L. Marie-Sainte, Region extraction


and classification of skin cancer: A heterogeneous framework of deep cnn
features fusion and reduction, Journal of medical systems 43 (9) (2019)
1–19.

27
[48] M. Nasir, M. Attique Khan, M. Sharif, I. U. Lali, T. Saba, T. Iqbal,
An improved strategy for skin lesion detection and classification us-
ing uniform segmentation and feature selection based approach, Mi-
croscopy Research and Technique 81 (6) (2018) 528–543. arXiv:https:
//analyticalsciencejournals.onlinelibrary.wiley.com/doi/pdf/
10.1002/jemt.23009, doi:https://doi.org/10.1002/jemt.23009.
URL https://analyticalsciencejournals.onlinelibrary.wiley.
com/doi/abs/10.1002/jemt.23009

[49] S. Benyahia, B. Meftah, O. L´ezoray, Multi-features extraction based on


deep learning for skin lesion classification, Tissue and Cell 74 (2022)
101701. doi:https://doi.org/10.1016/j.tice.2021.101701.
URL https://www.sciencedirect.com/science/article/pii/
S0040816621002172

[50] J. R. Hagerty, R. J. Stanley, H. A. Almubarak, N. Lama, R. Kasmi, P. Guo,


H. S. Drugge, Rhett J. andH Rabinovitz, M. Oliviero, W. V. Stoecker,
Deep learning and handcrafted method fusion: Higher diagnostic accuracy
for melanoma dermoscopy images, IEEE Journal of Biomedical and Health
Informatics 23 (4) (2019) 1385–1391. doi:10.1109/JBHI.2019.2891049.

[51] J. Yap, W. Yolland, P. Tschandl, Multimodal skin lesion classification


using deep learning, Experimental Dermatology 27 (11) (2018) 1261–
1267. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/
exd.13777, doi:https://doi.org/10.1111/exd.13777.
URL https://onlinelibrary.wiley.com/doi/abs/10.1111/exd.13777

[52] F. M. Calisto, N. Nunes, J. C. Nascimento, Breastscreening: On the


use of multi-modality in medical imaging diagnosis, in: Proceedings of
the International Conference on Advanced Visual Interfaces, AVI ’20, As-
sociation for Computing Machinery, New York, NY, USA, 2020. doi:
10.1145/3399715.3399744.
URL https://doi.org/10.1145/3399715.3399744

28
[53] N. Burkart, M. F. Huber, A survey on the explainability of supervised
machine learning, Journal of Artificial Intelligence Research 70 (2021) 245–
317.

[54] C. Molnar, Interpretable machine learning. A Guide to Making Black Box


Models Explainable, Leanpub, Munich, Germany, 2022.

[55] A. A. Taha, A. Hanbury, Metrics for evaluating 3D medical image segmen-


tation: analysis, selection, and tool, BMC Medical Imaging 15 (1) (2015)
29. doi:10.1186/s12880-015-0068-x.
URL https://doi.org/10.1186/s12880-015-0068-x

29

You might also like