AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability

Iván Matas
Student
University of Seville
Seville, Spain
{imatas}@us.es
&Carmen Serrano
PhD
University of Seville
Seville, Spain
{cserrano}@us.es
&Francisca Silva-Clavería, Amalia Serrano
MD
Servicio de Dermatología
Hospital Universitario Virgen Macarena
Seville, Spain
&Tomás Toledo-Pastrana
MD
Servicio de Dermatología
Hospital Quiron Salud Infanta Luisa y Sagrado Corazón
Seville, Spain
&Begoña Acha
PhD
University of Seville
Seville, Spain
{bacha}@us.es

Abstract

An AI tool has been developed to provide interpretable support for the diagnosis of BCC via teledermatology, thus speeding up referrals and optimizing resource utilization. The interpretability is provided in two ways: on the one hand, the main BCC dermoscopic patterns are found in the image to justify the BCC/Non BCC classification. Secondly, based on the common visual XAI Grad-CAM, a clinically inspired visual explanation is developed where the relevant features for diagnosis are located. Since there is no established ground truth for BCC dermoscopic features, a standard reference is inferred from the diagnosis of four dermatologists using an Expectation Maximization (EM) based algorithm. The results demonstrate significant improvements in classification accuracy and interpretability, positioning this approach as a valuable tool for early BCC detection and referral to dermatologists. The BCC/non-BCC classification achieved an accuracy rate of 90%. For clinically-inspired XAI results, the detection of BCC patterns useful to clinicians reaches 99% accuracy. As for the Clinically-inspired Visual XAI results, the mean of the Grad-CAM normalized value within the manually segmented clinical features is 0.57, while outside this region it is 0.16. This indicates that the model struggles to accurately identify the regions of the BCC patterns. These results prove the ability of the AI tool to provide a useful explanation.

Keywords Artificial Intelligence $\cdot$ Basal Cell Carcinoma $\cdot$ Deep Learning $\cdot$ Skin Cancer $\cdot$ Telemedicine $\cdot$ XAI

1 Introduction

Skin cancer is the most commonly diagnosed cancer worldwide, with Melanoma, Basal Cell Carcinoma (BCC), and Squamous Cell Carcinoma (SCC) being the most prevalent types. BCC accounts for approximately $75$ % of all skin cancers, making it the most frequent form. It has well-established clinical criteria for diagnosis, yet there is significant variability in the presence of these clinical criteria between cases [1, 2, 3, 4].

Many research efforts seek to diagnose various skin diseases. Additionally, the number of published papers has increased significantly in recent years, driven by the development of public databases [5, 6, 7, 8]. Although these databases are accessible and comprehensive for everyone, the clinical criteria used for diagnosis are not readily available. To develop a tool that is useful from a medical perspective, it is crucial to provide not only classification metrics, but also a detailed diagnosis explaining the detected clinical features, thereby offering a more comprehensive result. In this context, Serrano at al. [9] developed a clinically inspired skin lesion classification tool through the detection of dermoscopic criteria for BCC.

XAI techniques such as Grad-CAM [10] play a crucial role in improving the transparency and interpretability of CNNs in any kind of medical diagnostics. By offering visual explanations of the decision-making process, these methodologies allow medical professionals to gain insight into the predictive mechanisms of AI models, thus fostering trust and acceptance in AI-driven diagnostics. In recent years, XAI has become an active field of research and numerous review papers have been published [11, 12, 13].

In particular, there are some recent papers on XAI in skin image analysis. Barata et al. uses attention modules to improve explainability in hierarchical skin lesion diagnosis [14]. Chanda et al. developed a multimodal XAI system to support dermatologists in diagnosing melanoma, which offers text and region-based explanations alongside its diagnostic predictions [15]. Rezk et al. developed an interpretable skin cancer diagnosis model using clinical images to assist general practitioners in early detection and referral. Their approach incorporates visual explanations of the model’s decisions based on Grad-CAM++ [16].

In these previous works, XAI is used to visualize the regions where the models have focused to extract features. The proposed method goes a step further: It proposes a clinical interpretation by localizing regions of clinical interest for diagnosis, specifically BCC dermoscopic features, and providing clinical labels.

2 Methodology

In modern healthcare, primary care physicians use teledermatology systems to receive high-quality diagnostic images remotely, allowing preliminary diagnoses of Basal Cell Carcinoma (BCC) using established patterns [4, 17]. Suspected BCC cases are promptly referred to dermatology specialists, improving healthcare efficiency, reducing waiting times, and facilitating early intervention.

Teledermatology is crucial in areas with limited specialist access and its integration with AI tools promises to enhance diagnostic accuracy and efficiency. This work aims to develop an AI tool to assist in this process by providing a binary classification of BCC/non-BCC with interpretable results Fig. 1 illustrates the current and proposed diagnostic workflows.

Refer to caption — Figure 1: Illustration showing the current workflow of BCC diagnosis by teledermatology in orange and the proposed workflow in green.

2.1 Database

The entire database was provided by the Dermatology Unit of the “Hospital Universitario Virgen Macarena” and were sent over 2 years from 60 primary care centers. The dataset comprises 1559 dermoscopic images divided into 3 subsets. Four dermatologists provided different types of annotation according to the subsets. Specifically:

•

The first subset consisted of 1089 dermoscopic images. Initially, the labeling annotations for these images were the presence or absence of each of the dermoscopic features involved in the diagnosis of BCC.
•

A second subset of 334 images is additionally enriched with dermatologist delineations of BCC dermoscopic patterns within each image. More than one segmented area may appear on an image if there are multiple patterns in the BCC lesion. In the Fig. 3 an example is shown.
•

The third subset is made up of 136 non-BCC images, mostly consisting of nevus lesions, from the ISIC archive [8].

Table 1 summarizes the distribution of labels in the database. As can be seen in this table, the database has a significant class imbalance, with SW and MG underrepresented. Several techniques have been used to address this problem.

Table 1: Sample distribution for binary and multilabel codification.

Binary codification		Multi-label codification
BCC	Non-BCC	Pigment Network	Ulceration	Ovoid Nests	Multiglobules	Maple Leaf-like	Spoke Wheel	Arborizing Telangiectasia
775	784	557	385	333	191	244	178	455

2.1.1 Label codification

Each image may contain multiple dermoscopic patterns. Therefore, a one-hot coding scheme was used to encode the labels during image annotation and subsequently to process the dermatologists’ annotations. Each image label is a binary word and each BCC dermoscopic pattern is a digit, where $1$ means presence and $0$ means absence. The seven patterns that can appear in a BCC lesion are[4, 18, 17]: Pigment Network (PN) (negative criterion), Ulceration (U), Ovoid Nests (ON), Multiglobules (MG), Maple Leaf-like (ML), Spoke Wheel (SW), Arborizing Telangiectasia (AT) (Fig. 2). Thus, each label is a vector of dimensions $[1\textrm{x}7]$ . In Table 2 there are some examples of this process.

2.1.2 Standard Reference Inferring

The accepted ground truth (GT) for BCC diagnosis is biopsy. However, there is no established GT for BCC dermoscopic patterns, which are subjectively assessed by dermatologists. Several studies have reported a low kappa coefficient when measuring inter-dermatologist agreement in determining the different dermoscopic patterns present in a lesion [19, 17]. Therefore, an adequate SR inferred from the consensus of several dermatologists is required. An Expectation-Maximization (EM) based algorithm [20] was implemented to derive a single SR for model training from multiple specialist labels. This algorithm consolidates multilabel annotations from different dermatologists and generates a single inferred SR that encapsulates the collective expertise of the raters. Silva et al. [21] used this algorithm to infer and SR from BCC pattern annotations and demonstrated that integrating this diverse expertise mitigates the subjectivity inherent in diagnosing the BCC pattern and improves the reliability and robustness of the classification model.

Table 2: Example of multilabel and binary encoding for BCC diagnosis

Codification	Multi-label	Binary	Diagnostic
Example 1	[0 1 0 1 1 0 1]	1	Presence of BCC
Example 2	[1 0 0 0 0 0 0]	0	Absence of BCC
Example 3	[0 0 0 0 0 0 0]	0	Absence of BCC

2.2 Clinical inspired XAI

From a clinical point of view, achieving 100% accuracy in detecting BCC dermoscopic patterns is not crucial for a correct diagnosis of BCC. A dermatologist requires detecting just one correct BCC dermoscopic pattern to make an accurate BCC diagnosis. Therefore, a useful clinically-inspired XAI should provide any of the following outcomes: if no patterns are detected, it predicts non-BCC; if the PN pattern is detected, it serves as a negative criterion, explaining a non-BCC prediction; and if any other BCC pattern is detected, the prediction is positive for BCC.

The proposed AI tool for skin lesion diagnosis is based on the MobileNet-V2 model with 3 classifier layers. This model was joined with an optimization and training strategy in three phases. In the first step, ImageNet transfer learning was applied, and the classifier weights were trained. In the second step, fine-tuning was applied to the last three blocks of the feature extractor and the classifier with a lower learning rate (LR) and increasing the number of epochs. In the third step, transfer learning was applied from the binary model to solve the task of BCC pattern detection, retraining only the classifier with a very low LR

2.3 Clinical visual XAI

To develop this concept, expert-generated segmentations are used. As detailed in Sect. 2.1, a subset of samples was segmented by a specialist. A dermatologist performed an individual segmentation of each BCC pattern for each image.

Understanding the decision-making process of convolutional neural networks (CNNs) is critical for clinical applications where interpretability is as important as performance metrics. The most common visualization techniques in explainable AI (XAI) are activation maps and gradient-weighted class activation mapping (Grad-CAM). Activation Maps, as described by Zhou et al. [22], extract features from different network layers to show which patterns the model focuses on, although they may not always directly correlate with specific outcomes. On the other hand, Grad-CAM, as described in detail by Selvaraju et al. [10], uses the gradients of the predicted class that flow into the final convolutional layer to produce a localization map. The clinically-inspired explanation that we proposed in this paper is based on Grad-CAM.

However, additional information is provided. For this study, the segmentations of individual BCC patterns (see 2.1) were combined into a single segmented image, as shown in Fig. 3. And these manual segmentations were overlaid with activation maps to validate the clinical information provided by Grad-CAM. In this way, the explanation is not only where the AI tool is paying attention, but also what the dermoscopic pattern of that region is.

3 Results

3.1 Implementation details

The chosen optimizer was the AdamW schedule-free optimizer [23], with a mini-batch size of 32, and dropout rate with a rate of $0.3$ to prevent overfitting. Focal loss [24] has been used to address class imbalance.

Data augmentation techniques [25, 26] applied included rotation, perspective transformation, and Gaussian blur.

Due to the limited size of our database, a stratified k-fold cross-validation was implemented to ensure a comprehensive evaluation of the model’s performance. This approach mitigates the risk of biased training and testing distributions, especially crucial in datasets with imbalanced class distributions.

3.2 Clinical-inspired XAI: BCC diagnosis with additional label information

This section analyzes the performance of the AI tool for BCC detection in conjunction with the labels provided to explain this classification. Table 3 presents metrics that summarize this performance. The metrics are averaged over all folds. This table has three parts. The first part shows the performance of the AI tool in the binary classification. The second part shows its performance in detecting BCC dermoscopic patterns. Finally, the third part represents the accuracy of the labels that provide the clinical explanation.

Overall, the BCC/non-BCC diagnostic performance is high, around $0.9$ for all metrics. However, the BCC pattern detection performance has to be analysed with a deeper insight. Minority classes tend to attain low recall because the AI tool trained with unbalanced databases tends to favor majority classes. As shown in Sect. 2.1, SW, MG and ML are underrepresented classes. Strategies such as data augmentation and advanced sampling, a one-vs-all strategy combined with stratified k-fold cross-validation helped to achieve a more balanced classification across patterns, thereby improving overall model performance. However, the metrics achieved should not be analyzed in the same way as BCC/non-BCC performance. They should only be evaluated to the extent that they provide a correct explanation for the binary classification. It is not relevant if the AI tool misses a specific BCC pattern, but if it misses any BCC pattern, as clinicians diagnose skin lesions in the same way. This further evaluation is summarized in the third part of Table 3. As shown in this table, 73 percent of non-BCC lesions without any BCC pattern, 95 percent of non-BCC lesions with PN, and 99 percent of BCC lesions with some BCC pattern are correctly labeled as such.

Table 3: Evaluation using binary and multilabel classification metrics, fine-tuned binary classifier, and physician-guided analysis.

	Recall	Specificity	Precision	Accuracy
BCC/Non-BCC
	0.89	0.89	0.90	0.90
Pattern detection
Pigment Network	0.94	0.96	0.97	0.95
Ulceration	0.81	0.75	0.52	0.77
OvidNests	0.65	0.84	0.53	0.84
Multiglobules	0.61	0.81	0.32	0.80
Maple Leaf-like	0.50	0.82	0.34	0.77
Spoke Wheel	0.60	0.87	0.37	0.84
Arborizing Telangiectasia	0.89	0.76	0.61	0.80
Clinical-inspired XAI
All 0’s	-	-	-	0.73
Pigment Network	0.94	0.96	0.97	0.95
BCC pattern detection	0.84	0.88	0.71	0.99

3.3 Clinical-inspired Visual XAI

This section aims to quantify the accuracy of the AI tool in focusing on the correct part of the lesion, specifically the BCC dermoscopic patterns identified by clinicians. To this end, BCC pattern areas delineated by dermatologists will be compared with model activated areas. This will provide a quantitative measure of the model’s agreement with human diagnostic criteria and demonstrate its ability to accurately identify critical features of BCC lesions.

To quantify the accuracy of the model activation areas with respect to the areas of clinical interest the conditional probability density functions of the normalized GradCAM values within and outside the area segmented by dermatologist were estimated. Let $z(x,y)$ the GradCAM value at position $(x,y)$ . Let denote Fg the area segmented by the dermatologist and Bg the background. $P\left(z\left(x,y\right)\mid\text{Fg}\right)$ is the probability density function of GradCAM values for pixels $(x,y)\in\text{Fg}$ and w $P\left(z\left(x,y\right)\mid\text{Bg}\right)$ is the probability density function of GradCAM values for pixels $(x,y)\in\text{Bg}$ .

Fig. 4 illustrates this analysis. Fig. 4(a) shows the original BCC lesion. Fig. 4(b) shows the Grad-CAM map. Fig. 4(c) shows the dermatologist’s segmentation overlaid on the Grad-CAM map. Fig. 4(d) shows an example of the two conditional probability density functions. The orange curve represents $P\left(z\left(x,y\right)\mid\text{Bg}\right)$ , and the blue curve represents $P\left(z\left(x,y\right)\mid\text{Fg}\right)$ . The orange curve is centered near 0, indicating low activation outside the mask, while the blue curve shows significant Grad-CAM information within the clinical segmentation, indicating that the model extracts features from the same region as the specialist.

Table 4: Statistics derived from estimation of conditional probability density functions of GradCAM within and outside the region of clinical interest.

Prediction	Intersection	Mean Fg	Mean Bg	Std Fg	Std Bg
Correct	0.24	0.57	0.16	0.14	0.22
Incorrect	0.32	0.33	0.14	0.01	0.21

Table 4 summarizes the information extracted from these probability density function. Specifically, mean, standard deviation of $z\left(x,y\right)$ for $(x,y)\in Fg$ and $(x,y)\in\text{Bg}$ respectively, and the intersection area between $P\left(z\left(x,y\right)\mid\text{Fg}\right)$ and $P\left(z\left(x,y\right)\mid\text{Bg}\right)$ are shown. This table shows that correctly predicted samples have a larger mean standard deviation than incorrectly predicted samples. In addition, the intersection area is larger in these cases. These facts prove that the model is not able to pay attention to the areas of clinical interest in the incorrect predictions.

4 Discussion and conclusion

In this paper, an AI tool for BCC diagnosis that provides a clinical explanation has been developed. It achieves two main achievements that make it very clinically significant. First, this tool will contribute to changing the protocol of teledermatology, reducing the waiting time for diagnosis and intervention in an area of 60 geographically dispersed primary health centers. Second, its clinically inspired double explanation increases its usefulness.

References

[1] American Academy of Dermatology. Types of Common Skin Cancer. 2024. Accessed: 2024-03-28.
[2] Skin Cancer Foundation. Skin Cancer Information. https://www.skincancer.org/skin-cancer-information/#, 2024. Accessed: 2024-03-28.
[3] Magdalena Ciążyńska, Joanna Narbutt, Anna Woźniacka, and Aleksandra Lesiak. Trends in basal cell carcinoma incidence rates: a 16-year retrospective study of a population in central poland. Advances in Dermatology and Allergology/Postępy Dermatologii i Alergologii, 35(1):47–52, 2018.
[4] Ketty Peris, Maria Concetta Fargnoli, Claus Garbe, Roland Kaufmann, Lars Bastholt, Nicole Basset Seguin, Veronique Bataille, Veronique Del Marmol, Reinhard Dummer, Catherine A Harwood, et al. Diagnosis and treatment of basal cell carcinoma: European consensus–based interdisciplinary guidelines. European Journal of cancer, 118:10–34, 2019.
[5] Noel C. F. Codella, David A. Gutman, M. Emre Celebi, Brian Helba, Michael A. Marchetti, Stephen W. Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin K. Mishra, Harald Kittler, and Allan Halpern. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (ISIC). CoRR, abs/1710.05006, 2017.
[6] Marc Combalia, Noel C. F. Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer Reiter, Cristina Carrera, Alicia Barreiro, Allan C. Halpern, Susana Puig, and Josep Malvehy. Bcn20000: Dermoscopic lesions in the wild. 2019.
[7] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1), August 2018.
[8] International skin imaging collaboration (isic) archive. 2024. Accessed: 2024-03-28.
[9] Authors’ work.
[10] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 128(2):336–359, October 2019.
[11] Bas H.M. van der Velden, Hugo J. Kuijf, Kenneth G.A. Gilhuijs, and Max A. Viergever. Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Medical Image Analysis, 79:102470, July 2022.
[12] Ruey-Kai Sheu and Mayuresh Sunil Pardeshi. A Survey on Medical Explainable AI (XAI): Recent Progress, Explainability Approach, Human Interaction and Scoring System. Sensors, 22(20), 2022.
[13] Shahab S Band, Atefeh Yarahmadi, Chung-Chian Hsu, Meghdad Biyari, Mehdi Sookhak, Rasoul Ameri, Iman Dehzangi, Anthony Theodore Chronopoulos, and Huey-Wen Liang. Application of explainable artificial intelligence in medical health: A systematic review of interpretability methods. Informatics in Medicine Unlocked, 40:101286, 2023.
[14] Catarina Barata, M. Emre Celebi, and Jorge S. Marques. Explainable skin lesion diagnosis using taxonomies. Pattern Recognition, 110:107413, 2021.
[15] Tirtha Chanda, Katja Hauser, Sarah Hobelsberger, Tabea-Clara Bucher, Carina Nogueira Garcia, and et al. Wies. Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma. Nature Communications, 2024. All Open Access, Gold Open Access, Green Open Access.
[16] Eman Rezk, Mohamed Eltorki, and Wael El-Dakhakhni. Interpretable skin cancer classification based on incremental domain knowledge learning. Journal of Healthcare Informatics Research, 7(1):59 – 83, 2023. Cited by: 4.
[17] Ketty Peris, Emma Altobelli, Angela Ferrari, Maria Concetta Fargnoli, Domenico Piccolo, Maria Esposito, and Sergio Chimenti. Interobserver agreement on dermoscopic features of pigmented basal cell carcinoma. Dermatologic surgery, 28(7):643–645, 2002.
[18] Scott W Menzies, Karin Westerhoff, Harold Rabinovitz, Alfred W Kopf, William H McCarthy, and Brian Katz. Surface microscopy of pigmented basal cell carcinoma. Archives of dermatology, 136(8):1012–1016, 2000.
[19] Sam Polesie, Lisa Sundback, Martin Gillstedt, Hannah Ceder, Johan Dahlén Gyllencreutz, Julia Fougelberg, Eva Johansson Backman, Jenna Pakka, ZAAR Oscar, and John Paoli. Interobserver agreement on dermoscopic features and their associations with in situ and invasive cutaneous melanomas. Acta Dermato-Venereologica, 101(10), 2021.
[20] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28, 1979.
[21] Authors’ work.
[22] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society.
[23] Aaron Defazio, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky, et al. The road less scheduled. arXiv preprint arXiv:2405.15682, 2024.
[24] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, Los Alamitos, CA, USA, oct 2017. IEEE Computer Society.
[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
[26] P.Y. Simard, D. Steinkraus, and J.C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., pages 958–963, 2003.