BJD 0423
BJD 0423
BJD 0423
Summary
© 2020 The Authors. British Journal of Dermatology British Journal of Dermatology (2020) 183, pp423–430 423
published by John Wiley & Sons Ltd on behalf of British Association of Dermatologists
This is an open access article under the terms of the Creative Commons Attribution License, which permits use,
distribution and reproduction in any medium, provided the original work is properly cited.
424 What is AI? Applications of artificial intelligence to dermatology, X. Du-Harpur et al.
British Journal of Dermatology (2020) 183, pp423–430 © 2020 The Authors. British Journal of Dermatology
published by John Wiley & Sons Ltd on behalf of British Association of Dermatologists
What is AI? Applications of artificial intelligence to dermatology, X. Du-Harpur et al. 425
TRAINING TESTING
Algorithm
Malignant Predicted label
‘Malignant’
Labelled training data Unlabelled
test data
‘Benign’
Benign
= Perfect Performance
(100% Sensitivity, 100% Specificity)
Sensitivity
Figure 2 Schematic of a receiver operating characteristic (ROC) curve, which is a way of visualizing the performance of a trained model’s
sensitivity and specificity. Typically, machine learning studies will use ROC curves and calculations of the area under the curve (AUC or AUROC)
to quantify accuracy. The dashed line represents the desired perfect performance, when sensitivity and specificity are both 100%; in this scenario,
the AUC would be 10. In reality, there is a trade-off between sensitivity and specificity, which gives rise to a curve.
computing power was a major limitation in being able to PyTorch (developed by Facebook) and then trained further for
train them effectively. However, in 2013 it was recognized a specific purpose or used in a novel application. A common
that graphical processing units (GPUs), originally designed for approach would be to take a pretrained image recognition net-
three-dimensional graphics in computer games, could be work architecture such as Inception, and specialize its applica-
repurposed to power the repetitive training required for neu- tion by inputting a specific type of image data. This process is
ral networks.4,5 Of note, convolutional neural networks referred to as transfer learning.
(CNNs) are a specific form of deep learning architecture that
have proven effective for the classification of image data.
The application of convolutional deep learning
CNNs have massively increased in popularity as a method for
in dermatology
computer-based image classification after the victory of the
GPU-powered CNN AlexNet in 2012, which won the Ima- Classifying data using CNNs is now relatively accessible, com-
geNet competition with a top 5 error rate of 153%, which putationally efficient and inexpensive, hence the explosion in
was a remarkable 10% improvement on the next best so-called ‘artificial intelligence’. In medicine to date, the main
competitor.5 areas of application have been the visual diagnostic specialties
In the past few years, use of CNNs in classification tasks has of dermatology, radiology and pathology. Automating aspects
exploded due to demonstrable and consistently superior effi- of dermatology with computer-aided image classification has
cacy and availability. Novel CNN architectures have been been attempted in dermatology for over 30 years;6–8 however,
developed, improved and made available for public use by previous efforts have achieved only limited accuracy. Although
institutions with a high level of expertise and computational attempts have been made in recent years to use neural net-
resources; examples of these include ‘Inception’ by Google works to diagnose or monitor inflammatory dermatoses,9–11
and ‘ResNet’ by Microsoft. These architectures can be accessed these have generally not been as successful or impressive as
using software such as TensorFlow (developed by Google) or the networks constructed to diagnose skin lesions, particularly
© 2020 The Authors. British Journal of Dermatology British Journal of Dermatology (2020) 183, pp423–430
published by John Wiley & Sons Ltd on behalf of British Association of Dermatologists
426 What is AI? Applications of artificial intelligence to dermatology, X. Du-Harpur et al.
Melanoma
Naevus
melanoma. Melanoma is therefore the focus of the remainder (ISIC) database,22 which contains in excess of 20 000 labelled
of this review, and Table S1 (see Supporting Information) dermoscopic images and is required to meet some basic qual-
summarizes these head-to-head comparison studies.12–21 ity standards. This network was trained on over 12 000
In 2017, Esteva et al. published a landmark study in Nature images to perform two tasks: the first was to classify dermo-
that was notable for being the first to compare a neural net- scopic images of melanocytic lesions as benign or malignant
work’s performance against dermatologists.14 They used a pre- (Figure 4b), and the second was to classify clinical images of
trained GoogLeNet Inception v3 architecture and fine-tuned melanocytic lesions as benign or malignant (Figure 4c). The
the network (transfer learning) using a dataset of 127 463 dermatologists were assessed using 200 test images, with the
clinical and dermoscopic images of skin lesions (subsequent decision requested mirroring that of the study of Esteva et al.:
studies have shown it is possible to train networks on signifi- to biopsy/treat or to reassure. Additionally, the dermatolo-
cantly smaller datasets, numbering in the thousands). For test- gists’ demographic data, such as experience and training level,
ing, they selected a subset of clinical and dermoscopic images were requested.
confirmed with biopsy and asked over 20 dermatologists for The method used to quantify the relative performance also
their treatment decisions. Dermatologists were presented with consisted of drawing a mean ROC curve by calculating the
265 clinical images and 111 dermoscopic images of ‘ker- average predicted class probability for each test image (Fig-
atinocytic’ or ‘melanocytic’ nature, and asked whether they ure 4b, c). The dermatologists’ performance for the same set
would: (i) advise biopsy or further treatment or (ii) reassure of images was then plotted on the ROC curve. Barring a few
the patient. They inferred a ‘malignant’ or ‘benign’ diagnosis individual exceptions, the dermatologists’ performance fell
from these management decisions, and then plotted the der- below the CNN ROC curves in both the clinical and dermo-
matologists’ performance on the network’s ROC curves with scopic image classifications. The authors also used a second
regards to classifying the keratinocytic or melanocytic lesions approach, whereby they set the sensitivity of the CNN at the
(which were subdivided as dermoscopic or clinical) as ‘be- level of the attending dermatologists, and compared the mean
nign’ or ‘malignant’ (Figure 4a). In both ‘keratinocytic’ and specificity achieved at equivalent sensitivity. In the dermo-
‘melanocytic’ categories, the average dermatologist performed scopic test, at a sensitivity of 741%, the dermatologists’ speci-
at a level below the CNN ROC curves, with only one individ- ficity was 60% whereas the CNN achieved a superior 865%.
ual dermatologist performing better than the CNN ROC curve As part of an international effort to produce technology for
in each category. This suggests that in the context of this early melanoma diagnosis, in 2016 an annual challenge was
study, the CNN has superior accuracy to dermatologists. established to test the performance of machine learning algo-
A recently published large study detailed in two papers by rithms using the image database from the ISIC.22 A recent
Brinker et al.19,20 involved training a ‘ResNet’ model on the paper by Tschandl et al.21 summarizes the performance of the
publicly available International Skin Imaging Collaboration most recent competition in August to September 2018, and
British Journal of Dermatology (2020) 183, pp423–430 © 2020 The Authors. British Journal of Dermatology
published by John Wiley & Sons Ltd on behalf of British Association of Dermatologists
What is AI? Applications of artificial intelligence to dermatology, X. Du-Harpur et al. 427
Figure 4 Receiver operating characteristic (ROC) curves from studies by Esteva et al.,14 Brinker et al.19,20 and Tschandl et al.21 Most often, the
dermatologists’ comparative ROC curves are plotted as individual data points. Lying below the curve means that their sensitivity and specificity,
and therefore accuracy, are considered inferior to those of the model in the study. The studies all demonstrate that, on average, dermatologists sit
below the ROC curve of the machine learning algorithm. It is noticeable that the performance of the clinicians in Brinker’s studies (b, c), for
example, is inferior to that of the clinicians in the Esteva study (a). Although there is a greater spread of clinical experience in the Brinker studies,
the discrepancy could also be related to how the clinicians were tested. In both Brinker’s and Tschandl’s studies, some individual data points
represent performance discrepancy that is significantly lower than data would suggest in the real world, which could suggest that the assessments
may be biased against clinicians. AUC, area under the curve; CNN, convolutional neural network. All figures are reproduced with permission of
the copyright holders.
© 2020 The Authors. British Journal of Dermatology British Journal of Dermatology (2020) 183, pp423–430
published by John Wiley & Sons Ltd on behalf of British Association of Dermatologists
428 What is AI? Applications of artificial intelligence to dermatology, X. Du-Harpur et al.
History
Examination
Refer urgently
also compares the performance of the submitted algorithms from a variety of sources as a web application.15 When Navar-
against 511 human readers recruited from the World Der- ette-Dechent et al. tested the network on data from the ISIC
moscopy Congress, who comprised a mixture of board-certi- dataset, which the network had not previously been exposed
fied dermatologists, dermatology residents and general to, its performance dropped from a reported area under the
practitioners (Figure 4d). Test batches of 30 images were gen- curve of 091, to achieving the correct diagnosis in only 29
erated to compare the groups, with a choice of seven diag- out of 100 lesions, which would imply a far lower area under
noses as multiple-choice questions provided. When comparing the curve.23 As algorithms are fundamentally a reflection of
all 139 algorithms against all dermatologists, dermatologists their training data, this means that if the input image dataset
on average achieved 17 out of 30 on the image multiple- is biased in some way, this will have a direct impact on algo-
choice questions, whereas the algorithms on average achieved rithmic performance, which will only be apparent when they
19. As expected, years of experience improved the probability are tested on completely separate datasets.
for making a correct diagnosis. Regardless, the top three algo- Another important limitation of the methodology used to
rithms in the challenge outperformed even experts with > 10 compare AI models with dermatologists is that ROC curves,
years of experience, and the ROC curves of these top three although a useful visual representation of sensitivity and speci-
algorithms sit well above the average performance of the ficity, do not address other important clinical risks. For exam-
human readers. ple, in order to capture more melanomas (increased
sensitivity), the algorithm may incorrectly misclassify more
benign naevi as malignant (false-positives). However, this
Key biases, limitations and risks of automated
could potentially lead to unnecessary biopsies for patients,
skin lesion classification
which aside from patient harm would create additional
Given that, remarkably, all of the published studies indicate demand on an already burdened healthcare system. There is
superiority of machine learning algorithms over dermatolo- evidence that dermatologists have improved ‘number need to
gists, it is worth exploring the biases commonly found in biopsy’ metrics for melanoma in comparison with nonderma-
these study designs. These can be categorized into biases that tologists.24 The reporting of number need to biopsy would be
favour the networks and biases that disadvantage clinicians. a useful addition to studies such as that of Esteva et al.,14 as it
With regards to the first category, it is first worth noting that would aid in the estimation of potential patient and health
in the studies described, the neural networks were generally economic impact.
trained and tested on the same dataset. This closed-loop sys- It is also worth noting that these datasets are retrospectively
tem of training and testing highlights a common limitation collated and repurposed for image classification training; this
within machine learning called ‘generalizability’. On the occa- means that the images captured may not be representative in
sions that generalizability has been tested, neural networks terms of the proportion of diagnoses, or in terms of having
have often been found lacking. For example, Han et al. typical features. As neural networks are essentially a reflection
released their neural network, which was a Microsoft ResNet- of their labelled data input, this will undoubtedly have conse-
152 architecture trained on nearly 20 000 skin lesion images quences on how they perform. However, given the lack of
British Journal of Dermatology (2020) 183, pp423–430 © 2020 The Authors. British Journal of Dermatology
published by John Wiley & Sons Ltd on behalf of British Association of Dermatologists
What is AI? Applications of artificial intelligence to dermatology, X. Du-Harpur et al. 429
‘real-world’ studies, it is difficult to know how significant this The AI-integrated health service of the future?
is. When it comes to assessing clinicians using images from
these datasets, this may also introduce an element of bias that There are attempts to deploy ‘AI’ technologies within the
disadvantages clinicians too, as lesions that were deemed wor- healthcare space within two main scenarios: direct to con-
thy of capturing via photograph or being biopsied may not be sumer or public, and as a decision aid for clinicians. The
representative of the lesion type. As a result, the sensitivity of direct-to-consumer model already exists in some fashion;
clinicians diagnostically may be lower than in a normal clinic. there are smartphone apps such as SkinVision, which enable
This hypothesis for discrepancy in diagnostic accuracy was individuals to assess and track their skin lesions. However,
borne out in a recent Cochrane review, where the diagnostic currently such apps do not make accountable diagnoses and
sensitivity of dermatologists examining melanocytic lesions usually explicitly state in their terms and conditions that
with dermoscopy was 92%,25 which is significantly higher they do not provide a diagnostic service, and do not intend
than typically found in neural network studies. For example in to replace or substitute visits to healthcare providers. At pre-
Tschandl et al.’s web-based study of 511 clinicians, the sensi- sent, it is not yet clear what the benefits and risks of such
© 2020 The Authors. British Journal of Dermatology British Journal of Dermatology (2020) 183, pp423–430
published by John Wiley & Sons Ltd on behalf of British Association of Dermatologists
430 What is AI? Applications of artificial intelligence to dermatology, X. Du-Harpur et al.
dermatology, this very much holds true. Technology adoption 15 Han SS, Kim MS, Lim W et al. Classification of the clinical images
could improve clinical pathways, and enable our neediest for benign and malignant cutaneous tumors using a deep learning
patients to access dermatology services more efficiently. It is algorithm. J Invest Dermatol 2018; 138:1529–38.
16 Rezvantalab A, Safigholi H, Karimijeshni S. Dermatologist level
unlikely that they will threaten our profession; in reality they
dermoscopy skin cancer classification using different deep learning
represent an opportunity for personal learning, service convolutional neural networks algorithms. Available at: https://
improvement and leadership that could be transformative for arxiv.org/ftp/arxiv/papers/1810/1810.10348.pdf (last accessed
our future healthcare system. 27 January 2020).
17 Fujisawa Y, Otomo Y, Ogata Y et al. Deep-learning-based, com-
puter-aided classifier developed with a small dataset of clinical
References images surpasses board-certified dermatologists in skin tumour
1 Turing AMI. Computing machinery and intelligence. Mind 1950; diagnosis. Br J Dermatol 2019; 180:373–81.
LIX:433–60. 18 Tschandl P, Rosendahl C, Akay BN et al. Expert-level diagnosis of
2 LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015; nonpigmented skin cancer by combined convolutional neural net-
521:436–44. works. JAMA Dermatol 2019; 155:58–65.
British Journal of Dermatology (2020) 183, pp423–430 © 2020 The Authors. British Journal of Dermatology
published by John Wiley & Sons Ltd on behalf of British Association of Dermatologists