1 s2.0 S2667305322000631 Main
1 s2.0 S2667305322000631 Main
1 s2.0 S2667305322000631 Main
PII: S2667-3053(22)00063-1
DOI: https://doi.org/10.1016/j.iswa.2022.200126
Reference: ISWA 200126
Please cite this article as: Anand K. Nambisan , Norsang Lama , Thanh Phan , Samantha Swinfard ,
Binita Lama , Colin Smith , Ahmad Rajeh , Gehana Patel , Jason Hagerty , William V. Stoecker ,
Ronald J. Stanley , Deep Learning-based Dot and Globule Segmentation with Pixel and
Blob-based Metrics for Evaluation, Intelligent Systems with Applications (2022), doi:
https://doi.org/10.1016/j.iswa.2022.200126
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
Abstract
Deep learning (DL) applied to whole dermoscopic images has shown unprece-
dented accuracy in differentiating images of melanoma from benign lesions. We
hypothesize that accuracy in whole-image deep learning suffers because whole
lesion analysis lacks an evaluation of dermoscopic structures. DL also suffers a
“black box” characterization because it offers only probabilities to the physician
and no visible structures. We propose the detection of structures called dots
and globules as a means to improve precision in melanoma detection. We com-
pare two encoder-decoder architectures to detect dots and globules: UNET vs.
UNET++. For each of these architectures, we compare three pipelines: with
test-time augmentation (TTA), without TTA, and without TTA but with check-
point ensembles. We use an SE-RESNEXT encoder and a symmetric decoder.
The pixel-based F1-scores for globule and dot detection based on UNET++
and UNET techniques with checkpoint ensembles were found to be 0.632 and
∗
Corresponding author
Email addresses: akn36d@mst.edu (Anand K. Nambisan), nlbft@mst.edu (Norsang
Lama), tupzg6@health.missouri.edu (Thanh Phan), slsbng@mst.edu (Samantha Swinfard),
binny.lama8@gmail.com (Binita Lama), colinsmith@atsu.edu (Colin Smith),
ahmadrjeh@gmail.com (Ahmad Rajeh), gnpv5k@missouri.com (Gehana Patel),
hagerty.jason@gmail.com (Jason Hagerty), wvs@mst.edu (William V. Stoecker),
stanleyj@mst.edu (Ronald J. Stanley )
1. Introduction
Invasive melanoma is a form of skin cancer with 99,780 new cases estimated
in the USA in 2022 [1]. The chances of survival are high if melanoma is diagnosed
early, as noted by Noone et al. in [2]. Despite this, projections by Rahib et al.
[3] show that the number of melanoma cases will more than double by 2040,
becoming the second most prevalent form of cancer by then. This puts emphasis
on the importance of engaging in research toward raising awareness and early
diagnosis of skin lesions.
In this section, we provide an overview of the state of current research in
relation to our present work. We also briefly discuss developments in related
medical and machine learning fields.
In the domain of computer-aided diagnosis (CAD) for digital dermoscopy,
researchers have focused on a set of tasks to analyze skin lesions for diagnosis.
These tasks can be broadly categorized as image enhancement or pre-processing,
lesion segmentation, feature , and finally lesion diagnosis or classification [4, 5].
The biggest and most used public datasets for machine learning-based skin
lesion research are the ISIC datasets, which were first released at the Inter-
national Symposium on Biomedical Imaging (ISBI) 2016, by the International
Skin Imaging Collaboration (ISIC) [6]. This was accompanied by a challenge
2
involving a set of tasks to push CAD-based skin research further. Since then,
there have been four more iterations of the challenge, each with a set of skin
lesion-related tasks and a dataset [7, 8, 9, 10]. Many other datasets have also
been used in CAD research. Some like PH2 [11] are small and have only 200
dermoscopic images with three diagnosis classes: common nevi, atypical nevi,
and melanoma. Other datasets include the interactive atlas of dermoscopy [12],
which has more than 1000 clinical cases with clinical images, dermoscopy im-
ages, histopathology results, and level of difficulty, it was created as a means
to train medical personnel. Kassem et al. [13] provide a succinct review of
current machine learning and deep learning approaches using various datasets
to highlight some of the prevalent challenges in CAD-based skin lesion research.
Adegun et al. [5] in their survey on deep learning techniques in CAD focused on
the ISIC 2018 and ISIC 2019 datasets. They conclude that model ensembling,
image pre-processing and lesion segmentation improve results for skin lesion
classification.
Segmentation of dermatological features has been repeatedly highlighted as
one of the more difficult lesion tasks. Codella et al. have mentioned this in
both the post-challenge reviews of the ISIC 2017 [9] and ISIC 2018 challenges
[8]. Barata et al. [14] provided a comprehensive survey of feature extraction
in skin lesion image analysis, including clinically inspired features like negative
networks, dots, globules, etc. They found that the time consumed for annota-
tion is the prime reason for the lack of exploration of such features for image
analysis. This has also hindered the incorporation of clinical features toward
deep learning-based feature segmentation. Recent work done by Cassidy et al.
[15] used a curated combination of multiple ISIC datasets along with in-depth
data cleaning strategies and provided benchmarks on multiple test sets. This
was done to highlight the biases that occur due to noise and other artifacts in the
assessment of the lesion diagnosis classification task. Work has also been done
to alleviate the difficulties in creating annotations for tasks in the healthcare
domain. Calisto et al. [16] have proposed touch-based interfaces for medical an-
notation aimed at radiologists to help during patient follow-ups. Calisto et al.
3
[17, 18] show that the integration of AI techniques improves workflow efficiency
and reduces work-related cognitive fatigue. Such interfaces can be introduced
into the clinic to make the annotation and data collection process across medical
disciplines. This would then make it easier to standardize and collect data for
later downstream machine learning tasks and statistical analysis.
In this section, we focus on the clinical features of interest: dots and globules.
We provide information and context to support its importance in skin lesion
diagnosis.
Early diagnosis of melanoma, particularly at the in situ stage, yields the
best prognosis [19]. However, many cases of melanoma, especially early in situ
melanomas, are missed by domain experts [20],[21],[22]. Machine vision tech-
niques incorporating deep learning (DL) have shown higher diagnostic accuracy
than dermatologists can achieve [23], [24]. However accuracy of DL methods has
not been proven when applied to small-diameter melanomas. “Black dots” and
“brown globules” were among the earliest dermoscopic melanoma structures de-
tected and are still considered critical for diagnosis [25]. But these structures are
found in both benign and malignant lesions, thus there is a need to characterize
dots and globules precisely in order to use them for melanoma discrimination.
These structures may be most useful for discriminating tiny melanomas from
benign mimics. Regio Pereira et al. found irregular dots and globules in 76.5%
of small melanomas; these features were among the most discriminatory features
for these melanomas [26]. If dots and globules can be precisely delineated, their
features can be used to predict melanoma. Xu et al. found that large globules
and varying globule sizes and shapes had the highest odds ratio for melanoma
[27]. Maglogiannis and Delibasis reported automatic dot and globule detection
using a multi-resolution inverse non-linear diffusion approach and found that
features from the detected structures increased diagnostic accuracy by 6%, pri-
marily by increasing specificity (true-negative rate) [28]. The study showed the
potential of globules, but conclusions from this study and other studies such
4
Figure 1: Nevus (left) with Dot and globules marked (right) in the 2018 ISIC dataset.
as [14] are limited because they analyze a limited number of lesions from non-
public databases and lack specific metrics for assessing the detected structures.
The 2017 and 2018 ISIC challenges [9], [8] provide a globule database using
Superpixel-based ground truth annotations that include extraneous areas be-
sides globules. These masks include dermoscopic features but do not delineate
them precisely. Inexact masks do not allow determination of feature-specific
information like the number of instances of a feature within the lesion, vari-
ances in shape, structure, and color between instances of the feature across the
lesion. Once extracted, these features can be used for other downstream tasks
(explainability or classification). An example of ISIC globule annotation is
shown, (Fig. 1).
This research develops precise dot and globule masks and presents a DL
technique for detecting these masks automatically. We also present a blob-
based metric to best ascertain model detection accuracy. The remaining sections
include 2 Methodology, 3 Training, 4 Dataset Availability, 5 Hardware and
Software Configuration, 6 Results, 7 Discussion, 8 Conclusion, and Future Work.
2. Methodology
5
and the evaluation metrics to assess model performance.
To create the dataset we selected 539 images, with 501 from the ISIC 2019
dataset [9], [10], [7] and 38 from a multi-clinic study supported by NIH grants
SBIR R43 CA153927-01 and CA101639-02A2. We opted to use the ISIC 2019
dataset as this dataset also has metadata information, and we believe this can
lead to the development of multimodal approaches in future work. A researcher,
under the supervision of a dermatologist (W.V.S.) marked all regions in the
images that contained either a dot or a globule, as defined by a consensus
conference [29]. We used a broad definition of dots and globules so that the
model can extract the features and their mimics across diagnoses. Dots and
globules are dark-brown, black, or gray structures with fairly sharp borders,
often roughly round but sometimes irregular.
We split the dataset into 65 percent training set and 35 percent test set.
This gave us 381 images in the training/validation set, of which 358 were unique
images, and the rest were either the same images with more than one annotation
mask (which occurs for duplicate images in the image sets) or images with the
same lesion but acquired under slightly different conditions. This means that
the hold-out test set had 158 images, and all the images were unique. The
training/validation dataset is then split into five sets of train and validation sets
for 5-fold cross-validation. During the splitting into folds, we ensure that there
are no duplicates (the same image or the same lesion) across a pair of train-
validation sets. All duplicates are moved into the training set; leaving only
unique images in the validation dataset. To show the model’s generalization
capabilities, we also tested them on a set of 160 small melanomas. We did this
to determine whether the models detect similar structures across the classes.
Since dots and globules are important in separating melanoma from benign
nevi we want our models to be able to detect similar structures in melanomas.
6
2.2. Models
For globule segmentation, we use two variants of the UNET [30] architec-
ture, both with the same encoder network but with different decoder and skip
connection structures. The two architectures are the usual UNET architecture,
as in [30] and the UNET++ architecture [31]. The UNET++ architecture is a
modification of the UNET architecture with each encoder level connected to the
corresponding decoder level via nested dense convolutions. The architecture also
has multiple segmentation branches, each originating from a different level of
the encoder network. Two variants of the UNET++ architecture exist — based
on how the outputs from the segmentation branches are processed — a fast
mode where the final decoder segmentation output is selected and an accurate
mode where all the segmentation masks are averaged to generate a final mask.
We use the fast mode in our work. The encoder is an SE-ResNext50-32×4d
network [32], which incorporates the squeeze-and-excitation module to learn re-
lationships across the convolutional feature channels. The model is based on
the ResNext50 [33] model, which expands upon and modifies the blocks in the
Resnet model [34] along a new “cardinality” dimension, which in this particular
case can be simplified using the notation of grouped convolutions. The decoder
is a symmetric decoder based on the encoder, the output mask requirement,
and the number of layers required using the implementation in [35].
In related tasks, such as the feature segmentation tasks in the ISIC 2018 and
2017 challenges, the metric used to evaluate the predicted feature masks was
pixel-based IOU (Jaccard) [8, 9]. The Jaccard metric worked well for superpixel-
based ground truth masks but would not be indicative of performance for our
precise dataset. An example of a case where pixel-based IOU is a poor metric
for dots and globule segmentation can be seen in Fig. 2.
For the PASCAL VOC challenge [36], the intersection over union (IOU)
of the predicted bounding boxes with the true boxes is the criterion used for
7
Figure 2: Image where the models localize well, but blobs are too generous. The pixel-based
IOU, precision, recall and F1-scores were 0.0947, 0.994, 0.0947, and 0.173 respectively. The
blob-based precision, recall and F1-scores with an intersection threshold of 25% were 0.25,
1.0, and 0.4 respectively, indicative of good localization.
8
TP = ∑𝐼𝑖=1 ∑𝑁 𝑖
𝑗=1 𝑓 (|𝑏𝒋,𝒊 ∩ 𝐺𝑖 | > 𝑇1 × |𝑏𝑗,𝑖 |), (1)
FP = ∑𝐼𝑖=1 ∑𝑁 𝑖
𝑗=1 𝑓 (|𝑏𝒋,𝒊 ∩ 𝐺𝑖 | ≤ 𝑇1 × |𝑏𝑗,𝑖 |), (2)
FN =∑𝐼𝑖=1 ∑𝑀 𝑖
𝑗=1 𝑓 (|𝑔𝒌,𝒊 ∩ 𝑃𝑖 | ≤ 𝑇2 × |𝑔𝑘,𝑖 |), (3)
1, 𝑖𝑓 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 𝑡𝑟𝑢𝑒
𝑓 (statement) = { (4)
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Where |𝑏| denotes the number of pixels in the blob 𝑏, I is the total number
of images in the test set, Ni is the total number of blobs in the predicted mask
(𝑃𝑖 ) for the ith image, and Mi is the total number of blobs in the ground truth
mask (𝐺𝑖 ) for the ith image. 𝑏𝒋,𝒊 is the jth blob in 𝑃𝑖 and 𝑔𝒌,𝒊 is the kth blob in
𝐺𝑖 . Here T1 and T2 specify the intersection percentage thresholds (intersection
thresholds) for considering a blob as true-positive (or false-positive) and false-
negative respectively, and fall in the open interval (0, 1) on the real line. The
thresholds T1 and T2 can also be interpreted as thresholds on a predicted blob’s
precision and ground truth blob’s recall, respectively. In our calculations, we
set T1 = T2 = T for simplicity. We can use TP , FP , and FN to calculate
precision, recall, and the F1-score. A similar approach that relies on centroid
distances between blobs instead of overlap was used by Xu et al. [37]. This
was done to account for the fact that the ground truth was in the form of blob
centroid co-ordinates rather than blob masks.
We calculated these blob-based metrics with three different percentages of
intersection thresholds. Those being 25%, 50%, and 75% ( T = 0.25, 0.5 and
0.75). Our analysis showed that F1-scores are relatively constant up to an inter-
section of 25% and then fall monotonically as the required overlap percentage
increases. The results are shown in Tables 1 and 2.
9
We also calculated pixel-based metrics on the test dataset to assess seg-
mentation quality. The pixel-based scores are calculated by accumulating the
true-positive, false-positive, true-negative, and false-negative pixels across the
entire dataset, and using the usual definitions of precision, recall, and F1-score
to calculate the metrics as done in the blob case above.
3. Training
10
patch having an overlap of 50 pixels across both the height and width dimen-
sions. These crops are then passed to the model, which outputs probability
masks of dimensions 448×448 for each crop. We also implemented another
testing pipeline where Test Time Augmentation (TTA) is performed on each
cropped patch. The augmentations used are all possible combinations of no-flip
and horizontal-flip along with rotations of 0°, 90°, 180°, and 270° giving us 8
augmentations and hence 8 augmented crops. No other spatial or color-based
augmentations were performed during testing. These sets of augmented crops
are then passed to the model giving us 8 probability masks per crop from the
image. The probability masks are de-augmented and averaged together to get
the resulting probability mask. These probability masks are then patched back
together with the appropriate weighting over overlapping regions to get the full
probability mask having the same dimensions as the input image. We then
have five probability masks, one from each model trained on one of the five
folds. These are averaged to get the final probability mask for the image. A
threshold of 0.5 is applied to the probability mask giving the final predicted
mask. Finally, we created a testing pipeline that ensembles the checkpointed
models by averaging the probability masks from each saved model and the best
overall model. In this third pipeline, no TTA was done, and inputs to each
model in the ensemble were processed using the same above-mentioned crop-
based approach. This lets us fuse the generous predictions of the initial windows
with the more precise predictions of the final ones. As in the previous pipelines,
a threshold of 0.5 is applied to get the final predicted mask.
4. Dataset availability
The globule masks and images for the non-public and public datasets used
are publicly available and can be accessed https://scholarsmine.mst.edu/
research_data/10/.
11
5. Hardware and Software configuration
Our models were trained on an Intel(R) Xeon(R) Silver 4110 CPU (2.10GHz)
with 64 GBs of ram, along with an NVIDIA Quadro P4000 GPU with an 8 GB
ram. The models were constructed using the Pytorch [41] library as imple-
mented in [35].
6. Results
A discussion of the results obtained after training and testing with the dif-
ferent pipelines are given below. We compare and tabulate the results from the
different architectures.
For blob-based evaluation, true-positive, false-positive, and false-negative
blobs are extracted using three different intersection thresholds described in
section 2.3. The intersection percentages were 25%, 50%, and 75%, respectively.
Once these have been calculated, we find the blob-based precision, recall, and
F1-scores. Table 1 shows the blob-based and pixel-based scores for nevi for
both architectures with the different testing pipelines mentioned in section 3.
Fig. 3 shows two cases with high pixel-based IOU scores (0.73 and 0.70) on our
best pipeline: UNET++ with checkpoint ensembles, showing correct globule
detection. Fig. 4 shows the worst globule detection results. Fig. 5 shows an
image with no globules present, and none found.
Since the model was not trained on melanoma images, we would not be
able to analyze the differences in the results if we treated the testing datasets
(nevi and melanoma) as a single entity, hence we present them separately. The
results for testing on small melanomas can be seen in Table. 2. Comparing
these scores with those in Table. 1 we see that the blob-based recalls are higher
for nevi compared to melanoma and blob-based precisions for melanoma are
higher than for nevi. This can be attributed to the fact that the models have
never been trained on melanoma images, but when they do predict globules,
they are usually correct (higher precision) but fail to get all the blobs marked
in the ground truth (lower recall). An example can be seen in Fig. 2.
12
Figure 3: Images showing two cases with the highest pixel-based IOU scores, shown. The
images on the right show the lesion overlayed with both the predicted mask and the ground
truth mask. The blue regions are false-positive pixels, the green are false-negatives, and the
blue-green regions are true-positives for Figs. 3, 4 and 5
13
Figure 4: Image overlays for the 4 cases with the lowest pixel-based IOU scores. The upper
left image shows faint false-negative blobs (DL failed to detect). The other 3 images show
false-positives for images where annotators did not mark faint blobs
Figure 5: Image with no dots or globules found, which equals ground truth.
14
Architecture UNET UNET++
Testing
P1 P2 P3 P1 P2 P3
Pipeline
Pixel-based
IOU 0.443 0.449 0.458 0.453 0.459 0.462
Precision 0.676 0.667 0.659 0.641 0.634 0.640
Recall 0.563 0.579 0.601 0.607 0.623 0.624
F1-score 0.614 0.620 0.628 0.623 0.629 0.632
Blob-based
IT= 25%
Precision 0.537 0.555 0.577 0.569 0.583 0.595
Recall 0.856 0.851 0.844 0.849 0.841 0.838
F1-score 0.660 0.672 0.685 0.682 0.689 0.696
IT = 50%
Precision 0.502 0.525 0.544 0.538 0.555 0.565
Recall 0.793 0.783 0.779 0.791 0.781 0.776
F1-score 0.615 0.629 0.641 0.641 0.649 0.654
IT = 75%
Precision 0.349 0.381 0.389 0.390 0.421 0.414
Recall 0.582 0.588 0.576 0.597 0.597 0.585
F1-score 0.436 0.462 0.465 0.472 0.494 0.485
Table 1: Pixel-based and blob-based scores for UNET and UNET++ with different test-time
approaches. The models were trained on an all-nevi dataset. The testing pipelines were: P1
(without Test Time Augmentation (TTA) or Checkpoint-ensemble approach (CE)), P2 (only
TTA), and P3 (only CE). Results are for testing on nevi images. IT stands for intersection
threshold. The highest value for each architecture is in bold, and the highest values across
both architectures are underlined.
15
Architecture UNET UNET++
Testing
P1 P2 P3 P1 P2 P3
Pipeline
Pixel-based
IOU 0.393 0.401 0.397 0.385 0.392 0.392
Precision 0.551 0.544 0.539 0.504 0.500 0.507
Recall 0.578 0.605 0.602 0.619 0.645 0.634
F1-score 0.564 0.574 0.568 0.556 0.563 0.564
Blob-based
IT= 25%
Precision 0.660 0.690 0.701 0.694 0.713 0.724
Recall 0.590 0.583 0.566 0.563 0.550 0.554
F1-score 0.623 0.632 0.626 0.544 0.621 0.628
IT = 50%
Precision 0.558 0.588 0.600 0.604 0.635 0.633
Recall 0.510 0.505 0.492 0.496 0.490 0.487
F1-score 0.533 0.544 0.540 0.544 0.553 0.551
IT = 75%
Precision 0.366 0.411 0.408 0.425 0.468 0.443
Recall 0.353 0.367 0.347 0.363 0.370 0.356
F1-score 0.360 0.388 0.375 0.392 0.413 0.395
Table 2: Pixel-based and blob-based scores for UNET and UNET++ with different test-time
approaches. The models were trained on an all-nevi dataset. The testing pipelines were:
P1 (without Test Time Augmentation (TTA) or Checkpoint-ensemble approach (CE)), P2
(only TTA), and P3 (only CE). Results are for testing on melanoma images. IT stands for
intersection threshold. The highest value for each architecture is in bold, and the highest
values across both architectures are underlined.
16
7. Discussion
This study demonstrates that precisely annotated ground truth maps en-
able high accuracy for deep learning despite the subjective nature of dot and
globule assessment. This segmentation accuracy can be achieved with a rela-
tively small number (539) of ground truth masks. We demonstrate this accuracy
with two metrics: pixel-based and blob-based. All dermoscopic structures suf-
fer from disagreement among experts. We propose the blob-based metric as a
model to better assess whether a structure has been correctly identified and ap-
proximately located. Dots and globules have a kappa statistic for interobserver
agreement of 0.33 (95% confidence interval 0.32-0.34) [29]. Our 50% intersection
blob-based dot-globule F1-score of 0.654 is over twice the dot-globule interob-
server agreement score, indicating that our deep learning model for dot and
globule detection shows moderate agreement with the ground truth. By ig-
noring pixel counting discrepancies resulting from minor shape differences, we
better model small structures whose detection is critical within a hierarchical
lesion classification pipeline. Our blob-based metric is but one possible biolog-
ically inspired metric [42] and shows how such tasks can inspire new metrics
for performance evaluation of subjective modeling tasks. The construction of
task-specific metrics – like the blob-based metrics presented here – is crucial
to proper model assessment. Dots and globules comprise a continuum with no
precise limit distinguishing the structures. Therefore, they are processed as a
single class. Pigmented lesion classification suffers from high intra-class variabil-
ity and low inter-class variability. Therefore, this research focuses on a single
class, benign nevi, for detecting these structures. This research can be incorpo-
rated into a proposed deep learning pipeline that leverages clinical features used
by dermatologists. This approach can overcome the black box nature of deep
learning and can be used to present a more convincing case for deep learning
models for diagnosis, as well as training, in a clinical setting. The structure-
based model exploits algorithms used by clinicians to improve the accuracy of
melanoma screening. These pipelines can be further analyzed using analytic
17
methods like Grad-Cam [43] and saliency maps [44]. These clinical features can
also be used within deep learning explainability paradigms such as TCAV [45] or
used by machine learning practitioners for an explainable classification pipeline.
18
To address these challenges, we present an annotated database and a deep
learning method that detects dots and globules with high accuracy. Another
objective of the proposed work is to off-load the tedious job of annotating clinical
features to deep learning models. We also intend to continue research into
biologically inspired metrics like the blob-based metric proposed here, which
can improve understanding of model performance on subjective databases. We
will explore fuzzy-logic-based metrics and other metrics such as [55], which
can be used to handle multiple or subjective ground truth databases. Ground
truth development will proceed for melanomas, which along with our current
database can potentially improve melanoma detection accuracy, especially for
small melanomas.
Acknowledgment
This work was supported in part by the National Institutes of Health (NIH)
under Grant SBIR R43 CA153927-01 and CA101639-02A2 of the National In-
stitutes of Health (NIH). Its contents are solely the responsibility of the authors
and do not necessarily represent the official views of the NIH.
Declaration of interests
The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
19
References
20
[3] L. Rahib, M. R. Wehner, L. M. Matrisian, K. T. Nead, Estimated Projec-
tion of US Cancer Incidence and Death to 2040, JAMA Network Open 4 (4)
(2021) e214708–e214708. arXiv:https://jamanetwork.com/journals/
jamanetworkopen/articlepdf/2778204/rahib\_2021\_oi\_210166\
_1617121223.53101.pdf, doi:10.1001/jamanetworkopen.2021.4708.
URL https://doi.org/10.1001/jamanetworkopen.2021.4708
[5] A. Adegun, S. Viriri, Deep learning techniques for skin lesion analysis
and melanoma cancer detection: a survey of state-of-the-art, Vol. 54 (2),
Springer Netherlands, 2021. doi:10.1007/s10462-020-09865-y.
URL https://doi.org/10.1007/s10462-020-09865-y
21
lesion analysis toward melanoma detection: A challenge at the 2017 Inter-
national symposium on biomedical imaging (ISBI), hosted by the interna-
tional skin imaging collaboration (ISIC), in: Proceedings - International
Symposium on Biomedical Imaging, Vol. 2018-April, IEEE, 2018, pp. 168–
172. arXiv:1710.05006, doi:10.1109/ISBI.2018.8363547.
22
[16] F. M. Calisto, A. Ferreira, J. C. Nascimento, D. Gon¸calves, Towards touch-
based medical image diagnosis annotation, in: Proceedings of the 2017
ACM International Conference on Interactive Surfaces and Spaces, ISS
’17, Association for Computing Machinery, New York, NY, USA, 2017, p.
390–395. doi:10.1145/3132272.3134111.
URL https://doi.org/10.1145/3132272.3134111
23
A. A. Marghoob, E. Quigley, A. Scope, O. Y´elamos, A. C. Halpern, Re-
sults of the 2016 international skin imaging collaboration international sym-
posium on biomedical imaging challenge: Comparison of the accuracy of
computer algorithms to dermatologists for the diagnosis of melanoma from
dermoscopic images, Journal of the American Academy of Dermatology
78 (2) (2018) 270–277.e1. doi:10.1016/j.jaad.2017.08.016.
URL https://doi.org/10.1016/j.jaad.2017.08.016
24
Stanley, R. H. Moss, M. E. Celebi, Analysis of globule types in malignant
melanoma, Archives of dermatology 145 (11) (2009) 1245–1251.
25
[33] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transfor-
mations for deep neural networks, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 1492–1500.
[34] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-
nition, in: Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 770–778.
26
A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, R. Garnett (Eds.), Advances in
Neural Information Processing Systems, Vol. 32, Curran Associates, Inc.,
2019.
URL https://proceedings.neurips.cc/paper/2019/file/
bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
27
[48] M. Nasir, M. Attique Khan, M. Sharif, I. U. Lali, T. Saba, T. Iqbal,
An improved strategy for skin lesion detection and classification us-
ing uniform segmentation and feature selection based approach, Mi-
croscopy Research and Technique 81 (6) (2018) 528–543. arXiv:https:
//analyticalsciencejournals.onlinelibrary.wiley.com/doi/pdf/
10.1002/jemt.23009, doi:https://doi.org/10.1002/jemt.23009.
URL https://analyticalsciencejournals.onlinelibrary.wiley.
com/doi/abs/10.1002/jemt.23009
28
[53] N. Burkart, M. F. Huber, A survey on the explainability of supervised
machine learning, Journal of Artificial Intelligence Research 70 (2021) 245–
317.
29