MSMP-Net: A Multi-Scale Neural Network for End-to-End Monkeypox Virus Skin Lesion Classification

Huan, Eryang; Dun, Hui

doi:10.3390/app14209390

Open AccessArticle

MSMP-Net: A Multi-Scale Neural Network for End-to-End Monkeypox Virus Skin Lesion Classification

by

Eryang Huan

^1,*

and

Hui Dun

²

¹

School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhenzhou 450000, China

²

School of Electronics and Information, Zhengzhou University of Light Industry, Zhengzhou 450000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(20), 9390; https://doi.org/10.3390/app14209390

Submission received: 13 September 2024 / Revised: 10 October 2024 / Accepted: 11 October 2024 / Published: 15 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

Monkeypox is a zoonotic disease caused by monkeypox virus infection. It is easily transmitted among people and poses a major threat to human health, making it of great significance in public health. Therefore, this paper proposes MSMP-Net, a multi-scale neural network for end-to-end monkeypox virus skin lesion classification ConvNeXt is used as the backbone network, and designs such as inverse bottleneck layers and large convolution kernels are used to enhance the network’s feature extraction capabilities. In order to effectively utilize the multi-level feature maps generated by the backbone network, a multi-scale feature fusion structure was designed. By fusing the deepest feature maps of multi-scale features, the model’s ability to represent monkeypox image features is enhanced. Experimental results show that the accuracy, precision, recall, and F1-score of this method on the MSLD v2.0 dataset are 87.03 ± 3.43%, 87.59 ± 3.37%, 87.03 ± 3.43%, and 86.58 ± 3.66%, respectively.

Keywords:

monkeypox virus; MSMP-Net; ConvNeXt; multi-scale feature fusion

1. Introduction

Monkeypox is a viral disease caused by monkeypox virus infection. Since the first human case of monkeypox virus was reported in the UK on 7 May 2022, cases of monkeypox virus infection have occurred worldwide [1]. On 23 June 2022, the World Health Organization announced that monkeypox poses a “moderate risk” to global public health.

Monkeypox is a zoonotic infectious disease caused by monkeypox virus. Its clinical symptoms are similar to smallpox, such as high fever, fatigue, headache, rash, etc. Monkeypox is easily transmitted among people, posing a major threat to human health. In addition, monkeypox infection is similar to other poxvirus infections and is difficult to diagnose based on clinical symptoms alone [2], as shown in Figure 1. Currently, a variety of nucleic acid detection methods have been studied for the detection of monkeypox virus, such as the fluorescent PCR method [3]. This method needs to be carried out in a PCR laboratory and strictly follow the principle of zoning operation. Rapid detection of monkeypox virus is crucial for the timely diagnosis of monkeypox cases and the control of monkeypox epidemics. It is necessary to establish a rapid detection method for monkeypox virus.

In recent years, with the continuous development of deep learning technology, a series of breakthroughs have been made in the fields of image classification [4], object detection [5], and semantic segmentation [6]. At the same time, computer-aided diagnosis technology, as an emerging technology, uses deep learning methods to analyze medical images or patient non-image data, which can not only evaluate the patient’s condition but also help clinicians make decisions. Ali et al. [7] constructed a monkeypox skin disease dataset and compared the classification effect of monkeypox skin disease through different deep network models. Sitaula et al. [8] compared 13 different pre-trained deep learning models for monkeypox virus detection and finally used an integrated model method to obtain the best diagnostic effect. Sahin et al. [9] used transfer learning methods to construct a mobile monkeypox skin disease diagnosis method. Hu et al. [10] diagnosed monkeypox skin disease based on a residual network model by fusing attention and deep separable convolution modules. Biswas et al. [11] used transfer learning technology to construct a lightweight BinaryDNet53 for monkeypox virus detection. This algorithm is suitable for mobile devices with limited resources. Compared with references [7,8,9,10], reference [11] has six categories, and its accuracy is not high. Monkeypox image classification also meets the following problems: (1) the size, shape, color, and distribution of monkeypox lesions in different patients may vary greatly; (2) there is a scarcity of annotated medical images for monkeypox, which hampers the training of robust deep learning models; and (3) as previously addressed, the dataset contains an unequal number of images across different classes, leading to potential biases in model training.

In summary, this paper introduces MSMP-Net, a novel monkeypox virus classification method based on ConvNeXt and multi-scale feature fusion, designed for rapid detection of monkeypox virus skin diseases. The main contributions of this study are as follows:

(1): Enhanced Feature Extraction with ConvNeXt: By utilizing ConvNeXt [12] as the backbone network, we not only improve the network’s feature extraction capabilities, but also reduce computational complexity.
(2): Proposed Multi-Scale Feature Fusion Structure: We introduce a multi-scale feature fusion architecture to achieve complementary fusion of features at different scales. This enhances the feature representation of monkeypox virus images and improves the diagnostic performance for monkeypox virus detection.
(3): End-to-End Classification: MSMP-Net is designed for end-to-end classification, allowing seamless integration of feature extraction and classification processes. This reduces the complexity of the pipeline and enhances the overall efficiency and accuracy of the model.

This work follows the following organization. Section 2 provides a review of monkeypox virus skin lesion classification, and Section 3 introduces the MSMP-Net algorithm proposed in the paper in detail. The validation of this methodology is illustrated in Section 4. Finally, the main conclusions of this work are drawn in Section 5.

2. Related Work

Computer-aided diagnosis (CAD) is an emerging technology that uses machine learning methods to analyze imaging or non-imaging patient data, evaluate patient conditions, and help clinicians make decisions. The emergence of deep learning has also successfully promoted the research and development of computer-aided diagnosis technology [6]. Using deep learning for CAD is very common in the medical field. Applications include classification of diseases and normal patterns, classification of malignant and benign lesions, and classification of high-risk and low-risk patterns for predicting future cancer. Ali et al. [7] constructed a monkeypox skin disease dataset and compared the classification of monkeypox skin disease using different deep network models. Sitaula et al. [8] compared 13 different pre-trained deep learning models for monkeypox virus detection and finally used an integrated model method to obtain the best diagnostic effect. Sahin et al. [9] used a transfer learning method to construct a mobile monkeypox skin disease diagnosis method. Hu et al. [10] used a residual network model to diagnose monkeypox skin disease by fusing attention and deep separable convolution modules. Biswas et al. [11] used transfer learning technology to construct the lightweight BinaryDNet53 for monkeypox virus detection. This algorithm is suitable for mobile devices with limited resources. Abdelhamid et al. [13] used transfer learning to extract features, metaheuristic optimization to select features, and multi-layer neural network parameter optimization to detect monkeypox virus. This algorithm used different evaluation criteria. Bala et al. [14] constructed a new monkeypox dataset for monkeypox virus detection. This algorithm compared different network models and constructed a MonkeyNet model based on transfer learning technology. Alakus et al. [15] used a deep learning algorithm for monkeypox virus detection. The dataset of this algorithm was based on DNA sequences. Nayak et al. [16] compared different deep network models for monkeypox virus detection, using augmented data as the test dataset. The above literature either uses different datasets or enhanced test sets, or uses different evaluation criteria. Ahsan et al. [17] constructed a monkeypox virus dataset and used the VGG16 model for monkeypox virus detection. The algorithm was used for binary classification. Jaradat et al. [18] used a transfer learning method to detect monkeypox virus based on the MobileNetV2 model. The accuracy of this algorithm needs to be improved. Kundu et al. [19] used Vision Transformer to detect monkeypox virus, with an accuracy of 93%. This algorithm was used to distinguish between varicella and monkeypox.

3. Methodology

3.1. MSMP-Net

The structure of MSMP-Net is shown in Figure 2. The model consists of an enhanced feature extraction network ConvNeXt, a multi-scale feature fusion branch, a residual module, and a multi-layer convolution module. The enhanced feature extraction network is an improved ConvNeXt model based on ResNet [20] and Swin-Transformer [21]. The model is pre-trained on ImageNet-22k and fine-tuned on ImageNet-1k, with a resolution of 224 × 224. The ConvNeXt uses larger convolution kernels, and the local information of the feature map obtained by the shallow network will be lost. In order to effectively improve the feature representation ability, the feature map with more local information extracted by the backbone network is introduced into the multi-scale feature fusion branch. The fused features are input into the residual module and the multi-layer convolution module to further enhance the representation ability of image features.

3.2. ConvNeXt

The ConvNeXt network is an improved model based on ResNet. During the design process, the idea of Swin-Transformer was borrowed to optimize five key aspects: (1) ConvNeXt refers to the scaling strategy of Swin-Transformer and adjusts the four network blocks of ResNet-50 from the original (3, 4, 6, 3) to (3, 3, 9, 3); (2) ConvNeXt draws on the advantages of deep separable convolution (ResNeXt), replaces the 3 × 3 convolution in the bottleneck layer with deep separable convolution, and increases the network width, thereby achieving a double improvement in computational efficiency and accuracy; (3) from the perspective of the anti-bottleneck layer, ConvNeXt learns the design concept of MobileNet v2 [22] and adopts a structure with a large middle and small ends to reduce information loss. The difference is that ConvNeXt moves the separable convolution up one layer; (4) after experiments on convolution operations of four different sizes of 5 × 5, 7 × 7, 9 × 9, and 11 × 11, it was found that the 7 × 7 convolution operation has the best effect; and (5) the ReLU is replaced by GELU, with fewer activation functions and normalization layers, and BN is replaced by LN and split downsampling layers. ConvNeXt has four versions (T, S, B, L), which differ in the number of input channels at each stage and the number of times the stacked blocks are repeated at each stage. This paper uses ConvNeXt-large, and the network structure is shown in Figure 3.

The initial layer of ConvNeXt uses a 4 × 4 convolution layer with a stride of 4 to form a patchify and uses layer normalization (LN) to improve the training stability and convergence speed of the neural network. The following four stages extract features by downsampling and stacking ConvNeXt Blocks of different times. The downsampling module is composed of LN and a 2 × 2 convolution with a stride of 2. The ConvNeXt block is constructed in the order of DepthwiseConv2d (7 × 7, stride = 1) → LN → Conv2d (1 × 1, stride = 1) → GELU → Conv2d (1 × 1, stride = 1). The four-layer feature map output by ConvNeXt is expressed by Equations (1) and (2).

M_{i} = f {(M_{i - 1})}_{j}

(1)

M = {M_{\frac{H}{4} \times \frac{W}{4} \times C}, M_{\frac{H}{8} \times \frac{W}{8} \times C}, M_{\frac{H}{16} \times \frac{W}{16} \times C}, M_{\frac{H}{32} \times \frac{W}{32} \times C}}

(2)

Among them, for the image passing through the ConvNeXt backbone network, the M represents the set of feature maps generated at different stages,

M_{i}

represents the feature map generated by the

(i - 1)

th stage, the range of

i

is [2,5], and

j

represents the number of times the ConvNeXt block is stacked in the four stages.

3.3. The Multi-Scale Feature Fusion Module

Feature pyramid networks [23] (FPNs) are implemented through top-down lateral connections. The low-level features contain rich positioning information, but this information may be lost when transmitted to the high-level layers, making it difficult to obtain accurate positioning information. Its implementation is shown in Figure 4. Inspired by FPN, the low-level semantic features are enhanced to the high-level semantic feature hierarchy through bottom-up path enhancement. The network structure is shown in Figure 5.

As can be seen from the network structure Figure 5, the four-layer feature maps output by the ConvNeXt are represented by

C_{1}

,

C_{2}

,

C_{3}

, and

C_{4}

, respectively. The dimension of

C_{1}

is (batchsize, 192, 56, 56), the dimension of

C_{2}

is (batchsize, 384, 28, 28), the dimension of

C_{3}

is (batchsize, 768, 14, 14), and the dimension of

C_{4}

is (batchsize, 1536, 7, 7). First,

C_{1}

,

C_{2}

, and

C_{3}

are subjected to 1 × 1 convolution and global average pooling, and the output dimensions are (batchsize, 1536, 7, 7), (batchsize, 1536, 7, 7), (batchsize, 1536, 7, 7), and (batchsize, 1536, 7, 7), respectively, represented by the symbols

P_{1}

,

P_{2}

, and

P_{3}

. Then, the DownSample feature map is fused with to integrate multi-scale information. The multi-scale feature fusion formulas are shown in Equations (3)–(6).

P_{1} = g_{8}^{8 \times 8} (f_{1}^{1 \times 1} (C_{1}))

(3)

P_{2} = g_{4}^{4 \times 4} (f_{1}^{1 \times 1} (C_{2}))

(4)

P_{3} = g_{2}^{2 \times 2} (f_{1}^{1 \times 1} (C_{3}))

(5)

o u t = P_{1} \oplus P_{2} \oplus P_{3} \oplus C_{4}

(6)

where

f_{1}^{1 \times 1}

represents a 1 × 1 convolution layer with a stride of 1;

g_{8}^{8 \times 8}

,

g_{4}^{4 \times 4}

, and

g_{2}^{2 \times 2}

represent an 8 × 8 global average pooling layer with a stride of 8, a 4 × 4 global average pooling layer with a stride of 4, and a 2 × 2 global average pooling layer with a stride of 2 respectively; and

\oplus

represents a suboperation.

3.4. The Identity Block and Multi-Layer Convolution Module

ResNet is a deep convolutional neural network. Compared with traditional convolutional neural networks, it uses a shortcut connection mechanism to solve the problems of gradient vanishing or gradient explosion during training of deep neural networks, and at the same time realizes deeper and stronger neural networks. ResNet contains two modules: Conv Block and Identity Block. Conv Block performs nonlinear mapping of learning features to extract high-level features of the image, and Identity Block jumps the input to the output port to facilitate linear addition of the result after convolution, which is used to deepen the network. In order to increase the depth of the network, this paper adopts the Identity Block module, and the network structure is shown in Figure 6. VGG16 is a deep convolutional neural network architecture with a simple network structure. It gradually extracts higher-level features through multiple convolutional layers and has strong capabilities in processing complex image classification tasks. Inspired by VGG16, this paper constructs a multi-layer convolution module, as shown in Figure 7.

3.5. Loss Function

For image classification, the commonly used loss function is the cross entropy loss function. The expression of the cross entropy loss function is shown in Equation (7):

L_{C E} = - \sum_{i = 1}^{m} p (x) \log q (x)

(7)

where m represents the batch size set for each round,

i

represents the

i

-th image in the batch,

p (x)

is the true distribution probability of the six types of skin lesions, and

q (x)

is the predicted distribution probability of the six types of skin lesions.

4. Experiments

4.1. Experimental Setup

The experimental environment of this paper is using an Ubuntu 18.04 system, an RTX 3090 graphics card, and a 24 G graphics card memory. The experiment uses the Pytorch framework, the model uses the cross entropy loss function, the model uses AdamW [24] as the optimizer, the batch size is 16, the learning rate is 4 × 10⁻⁵, and the training epoch is 6. The specific hyper-parameter settings are shown in Table 1.

4.2. Dataset

This paper uses the latest Monkeypox Skin Lesion Dataset (MSLD v2.0) [25], which is an extension of MSLD v1.0 [7]. The MSLD v1.0 dataset has only two categories, namely monkeypox and others. The MSLD v2.0 dataset has a total of 755 pictures and six categories, namely chickenpox, cowpox, healthy, HFMD (hand-foot-mouth disease), measles, and monkeypox, with corresponding numbers of 75, 66, 114, 161, 55, and 284, respectively. The specific distribution is shown in Table 2. The MSLD v2.0 dataset has five folders, each containing three sub-files, namely the training set, validation set, and test set. In addition, data enhancement operations are used to enhance the training set, such as rotation, translation, reflection, shear, hue, saturation, contrast and brightness jitter, noise, scaling, etc.

4.3. Evaluation Metrics

The evaluation metrics used in this paper are confusion matrix, accuracy, precision, recall, and F1-score. The confusion matrix is shown in Table 3. True positive (TP) represents the true positive class, false negative (FN) represents the false negative class, false positive (FP) represents the false positive class, and true negative (TN) represents the true negative class.

The calculation formulas for Accuracy, Precision, Recall, and

F_{1}

are Equation (8), Equation (9), Equation (10), and Equation (11), respectively.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(8)

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

Re c a l l = \frac{T P}{T P + F N}

(10)

F_{1} = \frac{2 \times P r e c i s i o n \times Recall}{P r e c i s i o n + Re c a l l}

(11)

4.4. Experimental Results and Analysis

First, the cross-validation technique is used to evaluate the test accuracy of the model. The entire dataset is divided into k subsets of equal size. Then, k − 1 subsets are used for training, and the remaining subset is used to test the model. Each subset of the data is used for either training or validation. After using all the data subsets, the cumulative average validation score is calculated. The MSLD v2.0 dataset has five folders, each containing three sub-files, namely the training dataset, validation dataset, and test dataset.

Class imbalance is crucial to prevent bias in the model and ensure reliable performance across all classes. Data augmentation is used to address class imbalance. The MSLD v2.0 dataset consists of two subsets: the original dataset and augmented dataset, as shown in Table 2. Several data augmentation methods, including rotation, translation, reflection, shear, hue, saturation, contrast and brightness jitter, noise, and scaling, were applied to the original dataset. This augmentation process helped to mitigate the class imbalance by increasing the sample sizes of minority classes (e.g., cowpox and measles) to be more comparable with the majority classes. We conducted experiments to compare the model’s performance on the original dataset versus the augmented dataset. The experimental results using the cross-validation technique are shown in Table 4 and Table 5. The accuracy, precision, recall, and F1-score on the original dataset are 79.45 ± 5.66%, 81.03 ± 4.83%, 79.45 ± 5.66, and 78.85 ± 5.54, respectively. The corresponding indicators on the enhanced dataset are 87.03 ± 3.43%, 87.59 ± 3.37%, 87.03 ± 3.43%, and 86.58 ± 3.66%, respectively. From Table 4 and Table 5, we can see that (1) the model trained on the augmented dataset outperformed the one trained on the original dataset across all evaluation metrics (accuracy, precision, recall, and F1-score); and (2) data augmentation helped the model generalize better to unseen data by providing a more diverse training set.

In order to further verify the classification effect of the algorithm proposed in this paper, the classification effects of different models are compared under the same dataset. The models used are ResNet50 [20], VIT [26], DenseNet121 [27], MobileNetV3 [22], ClIP [28], Swin-Transformer [21], and ConvNext [12]. The evaluation indicators used are accuracy, precision, recall, and F1-score. The experimental results are shown in Table 6. As can be seen from Table 6, the algorithm proposed in this paper is better than the classification effect of other models, which further shows that the method in this paper can effectively diagnose monkeypox skin images. The bold value represents the highest value in each indicator.

To further validate the proposed model, this paper also trained and tested the dataset on various learning rates and batch sizes. In this experiment, the learning rate was set to 1 × 10⁻⁵, 2 × 10⁻⁵, 3 × 10⁻⁵, 4 × 10⁻⁵, 5 × 10⁻⁵, and 6 × 10⁻⁵, and the batch sizes were 4, 8, 16, and 32. After using these parameters to train our proposed model, all the obtained results are shown in Table 7 and Table 8. From Table 7, it can be seen that the learning rate of 4 × 10⁻⁵ exceeds the other learning rates. The 4 × 10⁻⁵ learning rate used produces the best results with the highest accuracy. After studying the learning rate, each batch size was evaluated, and a learning rate of 4 × 10⁻⁵ was used for evaluation. During the evaluation, the F1-score, precision, recall, and accuracy of all batch sizes were compared. From Table 8, it can be seen that the batch size of 16 performs well, with an accuracy of 87.03 ± 3.43%, while the batch size of 32 performs correspondingly better, but not as well as the batch size of 16.

There are many methods for monkeypox virus detection, although the dataset or methods used are different. The algorithm proposed in this paper was compared with existing methods, and the results are shown in Table 9. As can be seen from Table 9, the accuracy, precision, recall, and F1-score of this method are 87.03 ± 3.43%, 87.59 ± 3.37%, 87.03 ± 3.43%, and 86.58 ± 3.66%, respectively, which are higher than other research methods. This shows that the method in this paper is effective in the diagnosis of monkeypox skin diseases and can have a good classification effect on monkeypox skin images, while providing an objective diagnostic basis for doctors in clinical practice.

In order to verify the effectiveness of each strategy and module, we conducted an ablation experiment. The strategies and methods used in this paper mainly include multi-scale feature fusion modules. The experiment mainly verifies the effect of different scale fusion, as shown in Table 10. As can be seen from Table 10, the effect of fusing four scale features is better than that of other scale feature fusion.

5. Conclusions

This paper proposes a multi-scale neural network for end-to-end classification of monkeypox virus skin lesions. By using ConvNeXt as a reinforced feature extraction net-work, the recognition accuracy of monkeypox virus can be improved while reducing the complexity of the model. Subsequently, the proposed multi-scale fusion module can effectively capture the information of multi-scale feature maps and enhance the image representation ability. On the MSLD v2 dataset, the network model of this paper is compared with the existing common network models and research methods. MSMP-Net can serve as an efficient diagnostic tool in clinical settings, assisting healthcare professionals in accurately identifying monkeypox infections based on skin lesion images, and can be employed in large-scale screening programs to monitor and detect monkeypox cases in real-time, aiding public health authorities in implementing timely interventions to prevent the spread of the virus. However, there are still some limitations, such as dataset size and diversity, method improvement, and so on. In the future, the detection effect of monkeypox virus can be further improved by integrating the self-attention mechanism to further enhance the detection capabilities of MSMP-Net.

Author Contributions

E.H. contributed to the conception of the study, performed the experiments and data analyses, and wrote the manuscript. H.D. contributed significantly to analysis and manuscript preparation and helped perform the analysis, with constructive discussions. All authors have read and agreed to the published version of the manuscript.

Funding

The corresponding author received support from The Key Technologies R&D Program of Henan Province (242102210112) for the submitted work.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset can be accessed from Kaggle (https://www.kaggle.com/datasets/joydippaul/mpox-skin-lesion-dataset-version-20-msld-v20, accessed on 7 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Letafati, A.; Sakhavarz, T. Monkeypox virus: A review. Microb. Pathog. 2023, 176, 106027. [Google Scholar] [CrossRef] [PubMed]
Anwar, F.; Haider, F.; Khan, S.; Ahmad, I.; Ahmed, N.; Imran, M.; Rashid, S.; Ren, Z.-G.; Khattak, S.; Ji, X.-Y. Clinical manifestation, transmission, pathogenesis, and diagnosis of monkeypox virus: A comprehensive review. Life 2023, 13, 522. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Xiao, F.; Huang, X.; Fu, J.; Jia, N.; Sun, C.; Chen, M.; Xu, Z.; Huang, H.; Wang, Y. Rapid detection of monkeypox virus and differentiation of West African and Congo Basin strains using endonuclease restriction-mediated real-time PCR-based testing. Anal. Methods 2024, 16, 2693–2701. [Google Scholar] [CrossRef] [PubMed]
Srivastava, S.; Sharma, G. Omnivec: Learning robust representations with cross modal sharing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1236–1248. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Sung, C.; Kim, W.; An, J.; Lee, W.; Lim, H.; Myung, H. Contextrast: Contextual Contrastive Learning for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 3732–3742. [Google Scholar]
Ali, S.N.; Ahmed, M.T.; Paul, J.; Jahan, T.; Sani, S.; Noor, N.; Hasan, T. Monkeypox skin lesion detection using deep learning models: A feasibility study. arXiv 2022, arXiv:2207.03342. [Google Scholar]
Sitaula, C.; Shahi, T.B. Monkeypox virus detection using pre-trained deep learning-based approaches. J. Med. Syst. 2022, 46, 78. [Google Scholar] [CrossRef] [PubMed]
Sahin, V.H.; Oztel, I.; Yolcu Oztel, G. Human monkeypox classification from skin lesion images with deep pre-trained network using mobile application. J. Med. Syst. 2022, 46, 79. [Google Scholar] [CrossRef] [PubMed]
Hu, Y.; Du, B.; Hu, C. Classification of Monkeypox Virus Skin Lesions Based on Improved ResNet. Comput. Syst. Appl. 2023, 32, 197–203. [Google Scholar]
Biswas, D.; Tešić, J. Binarydnet53: A lightweight binarized CNN for monkeypox virus image classification. Signal Image Video Process. 2024, 18, 7107–7118. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Abdelhamid, A.A.; El-Kenawy, E.-S.M.; Khodadadi, N.; Mirjalili, S.; Khafaga, D.S.; Alharbi, A.H.; Ibrahim, A.; Eid, M.M.; Saber, M. Classification of monkeypox images based on transfer learning and the Al-Biruni Earth Radius Optimization algorithm. Mathematics 2022, 10, 3614. [Google Scholar] [CrossRef]
Bala, D.; Hossain, M.S.; Hossain, M.A.; Abdullah, M.I.; Rahman, M.M.; Manavalan, B.; Gu, N.; Islam, M.S.; Huang, Z. MonkeyNet: A robust deep convolutional neural network for monkeypox disease detection and classification. Neural Netw. 2023, 161, 757–775. [Google Scholar] [CrossRef] [PubMed]
Alakus, T.B.; Baykara, M. Comparison of monkeypox and wart DNA sequences with deep learning model. Appl. Sci. 2022, 12, 10216. [Google Scholar] [CrossRef]
Nayak, T.; Chadaga, K.; Sampathila, N.; Mayrose, H.; Gokulkrishnan, N.; Prabhu, S.; Umakanth, S. Deep learning-based detection of monkeypox virus using skin lesion images. Med. Nov. Technol. Devices 2023, 18, 100243. [Google Scholar] [CrossRef] [PubMed]
Ahsan, M.M.; Uddin, M.R.; Farjana, M.; Sakib, A.N.; Momin, K.A.; Luna, S.A. Image Data collection and implementation of deep learning-based model in detecting Monkeypox disease using modified VGG16. arXiv 2022, arXiv:2206.01862. [Google Scholar]
Jaradat, A.S.; Al Mamlook, R.E.; Almakayeel, N.; Alharbe, N.; Almuflih, A.S.; Nasayreh, A.; Gharaibeh, H.; Gharaibeh, M.; Gharaibeh, A.; Bzizi, H. Automated monkeypox skin lesion detection using deep learning and transfer learning techniques. Int. J. Environ. Res. Public Health 2023, 20, 4422. [Google Scholar] [CrossRef] [PubMed]
Kundu, D.; Siddiqi, U.R.; Rahman, M.M. Vision transformer based deep learning model for monkeypox detection. In Proceedings of the 2022 25th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 17–19 December 2022; pp. 1021–1026. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Ali, S.N.; Ahmed, M.T.; Jahan, T.; Paul, J.; Sani, S.S.; Noor, N.; Asma, A.N.; Hasan, T. A web-based mpox skin lesion detection system using state-of-the-art deep learning models considering racial diversity. Biomed. Signal Process. Control 2024, 98, 106742. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]

Figure 1. Diagram of different types of skin diseases. (a) Chickenpox; (b) measles; (c) HFMD; (d) monkeypox.

Figure 2. The network structure diagram of MSMP-Net.

Figure 3. The structure of the ConvNeXt-large network.

Figure 4. Top-down module structure.

Figure 5. The structure of the multi-scale feature fusion module.

Figure 6. Identity block.

Figure 7. Multi-layer convolution module.

Table 1. Experimental hyper-parameters.

Parameter	Value
Loss function	Cross entropy
Optimizer	AdamW
Batch_size	16
Epochs	6
Weight_decay	1 × 10⁻⁴
Learning_rate	4 × 10⁻⁵

Table 2. Instance distribution statistics of the presented Monkeypox Skin Lesion Dataset (MSLD) v2.0.

Class Label	No. of Original Images	No. of Augmented Images	No. of Unique Patients
Chickenpox	75	3598	62
Cowpox	66	3220	41
Healthy	114	5656	104
HFMD	161	7882	144
Measles	55	2618	46
Monkeypox	284	14,070	143
Total	755	37,044	540

Table 3. The confusion matrix.

	Positive	Negative
Positive	TP	FN
Negative	FP	TN

Table 4. Five-fold cross-validation results for original dataset.

	Number of Fold	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
Proposed Method	Fold 1	88.89	89.94	88.89	88.65
	Fold 2	76.54	78.84	76.54	75.90
	Fold 3	79.69	82.00	79.69	79.97
	Fold 4	80.49	78.11	80.49	77.68
	Fold 5	71.62	76.25	71.62	72.05
	Mean	79.45	81.03	79.45	78.85
	Standard deviation	5.66	4.83	5.66	5.54

Table 5. Five-fold cross-validation results for augmented dataset.

	Number of Fold	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
Proposed Method	Fold 5	93.65	94.15	93.65	93.52
	Fold 4	86.42	87.29	86.42	86.10
	Fold 3	85.94	85.91	85.94	85.78
	Fold 2	85.37	84.90	85.37	84.72
	Fold 1	83.78	85.70	83.78	82.78
	Mean	87.03	87.59	87.03	86.58
	Standard deviation	3.43	3.37	3.43	3.66

Table 6. Classification results for different deep learning models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
ResNet50 [20]	77.93 ± 6.18	79.41 ± 7.87	77.93 ± 6.18	77.09 ± 6.54
DenseNet121 [24]	58.51 ± 3.34	59.63 ± 1.58	58.51 ± 3.34	57.05 ± 2.17
MobileNetV3-large [22]	80.47 ± 5.65	82.00 ± 5.62	80.47 ± 5.65	79.74 ± 5.77
CLIP-vit-base [28]	72.06 ± 3.13	73.03 ± 3.96	72.06 ± 3.13	69.56 ± 3.64
CLIP-vit-large [28]	83.34 ± 2.30	84.83 ± 2.35	83.34 ± 2.30	82.84 ± 2.50
Swin-base [21]	85.52 ± 2.78	86.67 ± 2.55	85.52 ± 2.78	84.88 ± 2.68
Swin-small [21]	84.75 ± 3.81	85.60 ± 3.89	84.75 ± 3.81	84.34 ± 3.68
Swin-large [21]	84.58 ± 3.98	84.89 ± 4.09	84.58 ± 3.98	83.74 ± 4.27
ConvNext-tiny [12]	82.85 ± 3.87	83.17 ± 4.28	82.85 ± 3.87	82.07 ± 3.93
ConvNext-small [12]	82.52 ± 5.57	83.11 ± 5.69	82.52 ± 5.57	81.89 ± 5.98
ConvNextV2-tiny [29]	82.69 ± 3.32	83.28 ± 3.66	82.69 ± 3.32	81.84 ± 3.55
VIT-base [26]	81.62 ± 1.38	82.85 ± 1.57	81.62 ± 1.38	80.57 ± 1.39
VIT-large [26]	83.91 ± 3.77	84.67 ± 3.42	83.83 ± 3.66	83.20 ± 3.82
VIT-huge [26]	81.62 ± 5.46	82.40 ± 5.35	81.62 ± 5.46	80.78 ± 5.84
Proposed method	87.03 ± 3.43	87.59 ± 3.37	87.03 ± 3.43	86.58 ± 3.66

Table 7. Classification results for different learning rates.

Learning Rate	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
1 × 10⁻⁵	83.81 ± 4.40	84.07 ± 4.69	83.81 ± 4.40	83.34 ± 4.57
2 × 10⁻⁵	84.18 ± 3.89	85.54 ± 3.25	87.29 ± 2.79	83.54 ± 4.30
3 × 10⁻⁵	84.86 ± 3.08	85.45 ± 2.97	84.86 ± 3.08	84.48 ± 3.34
4 × 10⁻⁵	87.03 ± 3.43	87.59 ± 3.37	87.03 ± 3.43	86.58 ± 3.66
5 × 10⁻⁵	85.65 ± 2.98	86.94 ± 2.54	85.65 ± 2.98	84.84 ± 2.68
6 × 10⁻⁵	82.67 ± 1.40	83.17 ± 1.81	82.67 ± 1.40	82.07 ± 1.88

Table 8. Classification results for different batch sizes.

Batch Size	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
4	84.61 ± 1.90	85.51 ± 2.02	84.61 ± 1.90	84.35 ± 2.19
8	85.55 ± 3.79	86.14 ± 3.17	85.55 ± 3.79	84.74 ± 4.27
16	87.03 ± 3.43	87.59 ± 3.37	87.03 ± 3.43	86.58 ± 3.66
32	86.40 ± 4.03	86.32 ± 4.39	86.40 ± 4.03	86.04 ± 4.33

Table 9. Comparison with previous works (# indicate no value).

Authors	Number of Class	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
Ali et al. [7]	2	82.96 ± 4.57	87.00 ± 0.07	83.00 ± 0.02	84.00 ± 0.03
Sahin et al. [9]	2	91.11	90	90	90
Hu et al. [10]	2	97.3	97	97	97
Abdelhamid et al. [13]	2	98.8	#	#	#
Alakus et al. [15]	2	96.08	#	#	99.83
Nayak et al. [16]	2	99.49	100	99.43	99.49
Kundu et al. [19]	2	93	93	91	92
Ahsan et al. [17]	2	83 ± 0.09	88 ± 0.07	83 ± 0.09	83 ± 0.85
Jaradat et al. [18]	2	98.16	99	96	98
Sitaula et al. [8]	4	87.13	85.44	85.47	85.40
Bala et al. [14]	4	97.61 ± 0.04	97.60 ± 0.04	97.61 ± 0.04	97.60 ± 0.04
Biswas et al. [11]	6	85.78	86.92	82.46	84.20
Ali et al. [25]	6	81.70 ± 5.39	83.00 ± 0.04	79.00 ± 0.06	80.00 ± 0.06
Proposed method	6	87.03 ± 3.43	87.59 ± 3.37	87.03 ± 3.43	86.58 ± 3.66

Table 10. Ablation experiment comparison.

	Level	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
MSMP-Net	one	75.59 ± 4.38	77.72 ± 4.52	75.59 ± 4.38	73.77 ± 4.97
	two	77.94 ± 4.64	80.74 ± 3.73	77.94 ± 4.64	78.00 ± 4.74
	three	85.03 ± 4.21	85.42 ± 4.37	85.03 ± 4.21	84.45 ± 4.49
	four	83.97 ± 2.54	84.87 ± 3.29	83.97 ± 2.54	83.30 ± 2.84
	two, one	81.14 ± 4.72	82.15 ± 4.68	81.14 ± 4.72	80.24 ± 5.60
	three, one	84.36 ± 2.91	84.76 ± 3.43	84.36 ± 2.91	84.00 ± 2.93
	three, two	82.44 ± 5.17	83.44 ± 5.06	82.44 ± 5.17	81.76 ± 5.82
	three, four	86.14 ± 5.04	86.61 ± 4.95	86.14 ± 5.04	85.56 ± 5.37
	one, two, three	84.91 ± 3.34	86.17 ± 3.10	84.91 ± 3.34	84.04 ± 3.73
	two, three, four	86.82 ± 3.73	85.55 ± 4.78	86.82 ± 3.73	86.47 ± 3.87
	one, two, three, four	87.03 ± 3.43	87.59 ± 3.37	87.03 ± 3.43	86.58 ± 3.66

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huan, E.; Dun, H. MSMP-Net: A Multi-Scale Neural Network for End-to-End Monkeypox Virus Skin Lesion Classification. Appl. Sci. 2024, 14, 9390. https://doi.org/10.3390/app14209390

AMA Style

Huan E, Dun H. MSMP-Net: A Multi-Scale Neural Network for End-to-End Monkeypox Virus Skin Lesion Classification. Applied Sciences. 2024; 14(20):9390. https://doi.org/10.3390/app14209390

Chicago/Turabian Style

Huan, Eryang, and Hui Dun. 2024. "MSMP-Net: A Multi-Scale Neural Network for End-to-End Monkeypox Virus Skin Lesion Classification" Applied Sciences 14, no. 20: 9390. https://doi.org/10.3390/app14209390

APA Style

Huan, E., & Dun, H. (2024). MSMP-Net: A Multi-Scale Neural Network for End-to-End Monkeypox Virus Skin Lesion Classification. Applied Sciences, 14(20), 9390. https://doi.org/10.3390/app14209390

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSMP-Net: A Multi-Scale Neural Network for End-to-End Monkeypox Virus Skin Lesion Classification

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. MSMP-Net

3.2. ConvNeXt

3.3. The Multi-Scale Feature Fusion Module

3.4. The Identity Block and Multi-Layer Convolution Module

3.5. Loss Function

4. Experiments

4.1. Experimental Setup

4.2. Dataset

4.3. Evaluation Metrics

4.4. Experimental Results and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI