Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Multi-Attention Integrated Deep Learning Frameworks for Enhanced Breast Cancer Segmentation and Identification

Pandiyaraju V
School of Computer Science and Engineering
Vellore Institute of Technology, Chennai
Tamil Nadu, India
pandiyaraju.v@vit.ac.in
&Shravan Venkatraman
School of Computer Science and Engineering
Vellore Institute of Technology, Chennai
Tamil Nadu, India
shravan.venkatraman18@gmail.com
&Pavan Kumar S
School of Computer Science and Engineering
Vellore Institute of Technology, Chennai
Tamil Nadu, India
s.pavankumar2003@gmail.com
&Santhosh Malarvannan
School of Computer Science and Engineering
Vellore Institute of Technology, Chennai
Tamil Nadu, India
sandy5501m@gmail.com
&Kannan A
Department of Information Science and Technology
College of Engineering, Guindy, Anna University, Chennai
Tamil Nadu, India
akannan123@gmail.com
Abstract

Breast cancer poses a profound threat to lives globally, claiming numerous lives each year. Therefore, timely detection is crucial for early intervention and improved chances of survival. Accurately diagnosing and classifying breast tumors using ultrasound images is a persistent challenge in medicine, demanding cutting-edge solutions for improved treatment strategies. This research introduces multi-attention-enhanced deep learning (DL) frameworks designed for the classification and segmentation of breast cancer tumors from ultrasound images. A spatial channel attention mechanism is proposed for segmenting tumors from ultrasound images, utilizing a novel LinkNet DL framework with an InceptionResNet backbone. Following this, the paper proposes a deep convolutional neural network with an integrated multi-attention framework (DCNNIMAF) to classify the segmented tumor as benign, malignant, or normal. From experimental results, it is observed that the segmentation model has recorded an accuracy of 98.1%, with a minimal loss of 0.6%. It has also achieved high Intersection over Union (IoU) and Dice Coefficient scores of 96.9% and 97.2%, respectively. Similarly, the classification model has attained an accuracy of 99.2%, with a low loss of 0.31%. Furthermore, the classification framework has achieved outstanding F1-Score, precision, and recall values of 99.1%, 99.3%, and 99.1%, respectively. By offering a robust framework for early detection and accurate classification of breast cancer, this proposed work significantly advances the field of medical image analysis, potentially improving diagnostic precision and patient outcomes.

Keywords Breast Cancer  \cdot Deep Learning  \cdot Attention Mechanisms  \cdot Medical Imaging

1 Introduction

Breast cancer is one of the most common cancers among women worldwide, resulting in approximately 570,000 deaths in 2015 alone. Annually, over 1.5 million women, accounting for 25% of all female cancer diagnoses, are diagnosed with breast cancer globally [1][2]. Breast tumors often originate as ductal hyperproliferation and can progress to benign tumors or metastatic carcinomas when stimulated by various carcinogenic agents. The tumor microenvironment, including stromal effects and macrophages, plays a crucial role in the development and progression of breast cancer [3].

Early detection of breast carcinoma significantly increases the chances of successful treatment. Therefore, implementing effective procedures for identifying early signs of breast cancer is crucial [4]. Mammography, ultrasound, and thermography are the primary imaging techniques used for screening and diagnosing breast cancer [5][6]. With over 75% of tumors responding to hormones, breast cancer is primarily a postmenopausal illness. Their incidence rates are at the highest between the ages of 35-39 and then plateau after 80 years, with age and female sex being significant risk factors. This hormone dependency interacts with environmental and genetic factors to determine the incidence and progression of the disease [7].

Precise segmentation and classification of breast cancer are essential for effective treatment planning and positive patient outcomes. Traditional methods heavily depend on manual interpretation, which is both time-consuming and prone to errors. Advancements in technology have transformed the provision of healthcare. High processing power, primarily from GPUs, enables the creation of deep neural networks with multiple layers, allowing for the extraction of formerly unachievable features. Convolutional Neural Networks (CNNs) have made a profound impact on image processing and understanding, especially in the areas of segmentation, classification, and analysis [8][9].

Deep learning models can process vast amounts of medical imaging data and detect subtle abnormalities that might elude human observers. Accurate tumor segmentation and classification enhances oncologists’ capacity to make decisions about whether a tumor is malignant or not. Typically, these methods require professional annotation and pathology reports to make this assessment [10], which consumes a lot of human effort. DL provides an efficient and promising solution for the automation of these procedures. They can learn complicated patterns and features from ultrasounds and mamograms, which has the potential to improve classification accuracy and efficiency.

This paper proposes the Spatial-Channel Attention LinkNet Framework with InceptionResNet Backbone for breast cancer segmentation, and DCNNIMAF Framework for breast cancer classification. The segmentation framework is a novel and effective attention-enhanced mechanism that uses a pre-trained CNN model architecture for the encoder backbone. This enhances the capability of feature extraction, while effectively enhancing segmentation using a coupled spatial and channel attention mechanism in the decoder. The proposed classification framework - Deep CNN with an Integrated Multi-Attention Framework (DCNNIMAF) - is a unique and novel architecture with a hybrid of integrated self and spatial attention mechanisms. The segmentation results were evaluated using evaluation metrics such as Dice coefficient, IoU score, and a combination of focal loss and Jaccard loss, while classification evaluation metrics include recall, F1-score, precision, and accuracy.

The organization of this paper is as follows: Section 2 reviews the literature on breast cancer segmentation and classification; Section 3 describes the proposed approach; Section 4 presents experimental results; Section 5 concludes and outlines future research directions.

2 Related Works

Osareh et al. [11] utilized the K-nearest neighbors (KNN), Support Vector Machine (SVM), and Probabilistic Neural Network (PNN) classification models to perform the classification of tumor regions. The methodology was employed on two different publicly available datasets where one of the datasets was composed of Fine Needle Aspirates of the Breast Lumps (FNAB) with 457 negative samples and 235 positive samples while the other dataset was composed of 295 gene microarrays with 115 good-prognosis class and 180 poor-prognosis class data. To support the classifier, feature extraction and selection methodologies were utilized. Feature extraction techniques like Principal Component Analysis (PCA), optimized with auto-covariance coefficients of feature vectors, were employed to reduce high-dimensional features into low-dimensional ones. Feature selection includes two different approaches such as the Relief algorithm for filter approach where the features are selected using a pre-processing step and no bias of the induction algorithms is considered unlike the wrapper approach namely the proposed Sequential forward selection (SFS) technique where a feature set composed of 15 sonographic features are obtained. The results underwent ranking using a feature ranking method that employed Signal-to-Noise Ratio (SNR) to identify crucial features. The evaluation involved wrapper approach estimates assessed through a leave-one-out cross-validation procedure, focusing on overall accuracy, Sensitivity, Specificity, and Matthews Correlation Coefficient (MCC).

Li et al. [12] introduced a novel patches screening method that included the extraction of multi-size and discriminative patches from histology images involving tissue-level and cell-level features. Firstly, patches of dimensions 512x512 and 128x128 are generated from the input data. This is followed by the utilization of two ResNet50s where one of the models is fed with patches of dimensions 128x128 while the other inputs patches of dimensions 512x512 which extract tissue-level and cell-level features respectively. A finetuning approach is adopted to train the ResNet50 models this is followed by a screening of patches by aggregating them into different clusters based on their phenotype. For speeding up the process, the patch size is reduced to obtain 1024 features followed by PCA to reduce the number of features to 200. This is followed by the k mean clustering process. A ResNet50 fine-tuned with 128x128 size patches is employed to select the clusters. Subsequently, the P-norm pooling feature method is applied to extract the final features of the image, followed by the use of a Support Vector Machine to classify input images into four distinct classes: Normal, Benign, In situ carcinoma, or Invasive carcinoma.

Zheng et al. [13] introduced a DL-assisted Efficient Adaboost Algorithm (DLA-EABA) where the Convolutional Neural Network is trained with extensive data so that high precision can be achieved. A stacked autoencoder is utilized for generating a deep convolutional neural network and the encoder and decoder sections contain multiple non-linear transformations which are taken from the combined depictions of actual data which is taken as input. An efficient Adaboost algorithm is utilized to train the classifiers which estimate the positive value for threshold and parity and is done by reviewing all the potential mixtures of both values, The deep CNN contains Long Short-Term Memory (LSTM) with logistic activation function as conventional artificial neurons. This is followed by Softmax Regression for classifying the images with the help of features extracted.

Lotter et al. [14] introduced a robust breast tumor classification model for mammography images which utilizes bounding box annotations and is extended to digital breast tomosynthesis images to be able to identify the tumor region in the image. The CNN first trains to classify if lesions are present in the cropped image patches. Subsequently, using the entire image as input, the CNN initializes the backbone of the detection-based model. This model outputs the entire image with a bounding box, providing a classification score. The model’s performance is then evaluated by comparing its ability to identify the tumor region with Breast Imaging Reporting and Data System Standard (BI-RADS) scores of 1 and 1 considered as negative interpretations and index and pre-index cancer exams.

Saber et al. [15] employed transfer learning methodology on five different models: ResNet50, VGG19, Inception V3, Inception-V2, and VGG16. Feature extraction involved freezing the trained parameters from the source task except for the last three layers, which were then transferred to the target task. The images were preprocessed using different methods such as Median Filter, Histogram Equalization, Morphological Analysis, Segmentation, and Image Resizing. The dataset is split into an 80-20 ratio and Augmentation is applied to the training dataset where the images are rotated and flipped. The newly trained layers are combined with the existing pre-trained layers and the features are extracted using these models. Classification is done by feeding the extracted features from the transfer learning models into a Support Vector Machine classifier and Softmax classifiers that are fine-tuned using the Stochastic Gradient Descent method with momentum (SGDM). The gradient’s high-velocity dimensions are reduced due to SGDM jittering and the past gradients with momentum are reduced to saddle point.

Cho et al. [16] proposed a Breast Tumor Ensemble Classification Network (BTEC-Net) which utilizes an improved DenseNet121 and ResNet101 as base classifiers where each of the four blocks is connected to the Squeeze and Excitation Block and Global Average Pooling layer. Next, the feature map sizes are aligned using a fully connected layer and integrated along the channel dimension. The combined feature map is then fed into a feature-level fusion module to perform binary classification. Once the classification is done, segmentation is carried out by utilizing the proposed Residual Feature Selection UNet model (RFS-UNet) which is an encoder-decoder network and are connected with the layer positions of the same feature map size using skip-connections. The encoder part is composed of five encoders with each one comprising of a convolutional layer, an RFS module, a residual convolutional block, and a max-pooling layer. Similarly, the model is composed of five decoders where each decoder comprises a convolutional layer and an RFS module as well, a transpose convolutional layer and a Residual Block. The skip connections contain a spatial attention module where the input involves the output of transposed convolution and output of the RFS module from the encoder and the output is concatenated to the output of the same transposed convolution layer. The segmentation process ends with a sigmoid activation function which returns the segmented tumor region.

Dayong Wang et al. [17] introduced a novel method for automatically detecting metastatic breast cancer in whole slide images of sentinel lymph node biopsies, achieving first place in the International Symposium on Biomedical Imaging (ISBI) grand challenge. Their system delivered impressive results with an AUC of 0.925 for whole slide image classification and a 0.7051 tumor localization score, surpassing an independent pathologist’s review. By integrating the DL system’s predictions with pathologist diagnoses, a notable reduction in the error rate was achieved, showcasing the profound impact of DL on enhancing the accuracy of pathological diagnoses for breast cancer metastases.

Abdelrahman Sayed Sayed et al. [18] developed a new, economical design for a 3-RRR Planar Parallel Manipulator (PPM), aiming to overcome the challenge of deriving kinematic constraint equations for manipulators with complex nonlinear behavior. Utilizing screw theory, they computed direct and inverse kinematics and then developed a Neuro-Fuzzy Inference System (NFIS) model that was optimized with Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) to predict the position of the end-effector. The proposed PPM structure underwent investigation, with the development of its kinematic model and subsequent testing of a prototype in ADAMS, followed by fabrication for validation. Results showed that PSO outperformed GA in tuning the NFIS model, aligning closely with actual PPM data, indicating promise for enhanced robot capabilities and performance through further optimization and control strategies.

Luuk Balkenende et al. [19] proposed a comprehensive review elucidating the integration of deep learning techniques in breast cancer imaging. Their research highlights the wide-ranging applications of DL across modalities such as digital mammography, ultrasound, and magnetic resonance imaging (MRI), with a focus on tasks including lesion classification, segmentation, and predicting therapy response. Additionally, they discuss research on diagnosing breast cancer metastasis using CNNs on whole-body scintigraphy scans, and their investigation into aiding clinicians in diagnosing axillary lymph node metastasis with a 3D CNN model on PET/CT images. They emphasize the necessity of conducting large-scale trials and addressing ethical considerations to fully harness the potential of deep learning in clinical breast cancer imaging.

Shen et al. [20] proposed a pioneering DL-based approach for detecting breast cancer on screening mammograms. Their innovative "end-to-end" algorithm efficiently utilizes training datasets with varying levels of annotation, achieving exceptional performance compared to previous methods. On independent test sets from diverse mammography platforms, the proposed method achieves per-image AUCs ranging from 0.88 to 0.98, with sensitivities between 86.1% and 86.7%. Notably, the algorithm’s transferability across different mammography platforms is demonstrated, requiring minimal additional data for fine-tuning. These results emphasize the potential of deep learning to revolutionize breast cancer screening, offering more accurate and efficient diagnostic tools for clinical applications.

Han et al. [21] introduced a novel method for breast cancer diagnosis and prognosis. Their Class Structure-based Deep CNN (CSDCNN) achieves impressive accuracy (average 93.2%) by addressing challenges in automated multi-class classification from histopathological images. Combining hierarchical feature representation and distance constraints in feature space, their methodology offers a unique solution to subtle differences among breast cancer classes. Comparative experiments highlight the superior performance of the CSDCNN compared to existing methods, positioning it as a valuable tool for clinical decision-making in breast cancer management. Their work represents a significant advancement in automated breast cancer classification, providing clinicians with a reliable diagnostic aid.

Wang et al. [22] introduced DeepGrade, a deep learning-based histological grading model aimed at improving prognostic stratification for NHG 2 tumors. Developed and validated on large-scale datasets of digital whole-slide histopathology images, DeepGrade offers a novel approach to classify NHG 1 and NHG 3 morphological patterns. By re-stratifying NHG 2 tumors into DG2-high and DG2-low groups, DeepGrade provides independent prognostic information beyond traditional risk factors. Its performance was validated internally and externally, showcasing its ability to predict recurrence risk accurately. The ensemble approach, employing 20 deep convolutional neural network models, ensures robustness and reliability in classification tasks. DeepGrade shows promise as a cost-effective alternative to molecular profiling, supported by high area under the receiver operating characteristic curve values. This innovative methodology heralds a significant advancement in histological grading for breast cancer, promising improved clinical decision-making and personalized treatment strategies. Further research should focus on validating DeepGrade across diverse patient populations and integrating it into routine clinical practice.

Sizilio et al. [23] introduced a fuzzy logic-based approach for pre-diagnosing breast cancer from Fine Needle Aspirate (FNA) analysis. Addressing the global burden of breast cancer and the variability in FNA diagnostic accuracy (65% to 98%), this method enhances reliability through computational intelligence. The research employed the Wisconsin Diagnostic Breast Cancer Data (WDBC) and proceeded through four stages: fuzzification, rule base establishment, inference processing, and defuzzification. Validation included cross-validation and expert reviews. The method achieved a sensitivity of 98.59% and a specificity of 85.43%, demonstrating high reliability in detecting malignancies but highlighting the need for improvement in identifying benign cases. This approach shows significant potential for enhancing breast cancer diagnostic accuracy.

Sarkar et al. [24] explored the use of the K-Nearest Neighbors (KNN) algorithm for diagnosing breast cancer with the Wisconsin-Madison Breast Cancer dataset. Recognized for its straightforward and efficient implementation, KNN served as a non-parametric classifier in this study. The research showed that KNN improved classification performance by 1.17% over the best-known result for the dataset. Advantages of KNN include its simplicity, effectiveness with small training sets, and no need for retraining when new data is incorporated. However, the algorithm also has significant limitations, such as substantial storage requirements for large datasets and extensive computational demands for distance calculations between test and training data. The study noted the existence of faster KNN variants, such as those using k-d trees, which have been successful in tasks like script and speech recognition. The findings highlight KNN’s potential for various diagnostic applications, even though no single algorithm is optimal for all diagnostic problems. This research emphasizes KNN’s promise in enhancing diagnostic accuracy while acknowledging its challenges with storage and computational efficiency.

Song et al. [25] introduced an ML technique aimed at accurately annotating noncoding RNAs (ncRNAs) by searching genomes to find ncRNA genes characterized by known secondary structures. Their method involves aligning sequences optimally with a structure model, a critical step for identifying ncRNAs within genomes. Acknowledging the limitations of using a single structure model, they developed an approach that processes genome sequence segments to extract feature vectors. These vectors are then classified to differentiate between ncRNA family members and other sequences. The results showed that this method captures essential features of ncRNA families more effectively and enhances the accuracy of genome annotation compared to traditional tools. This work underscores the significant role of ML in bioinformatics, particularly in improving the precision of ncRNA gene identification.

Foster et al. [26] offered a critical commentary on the integration of ML in biomedical engineering, particularly focusing on the application of support vector machines (SVMs) beyond mere statistical tools. Their analysis highlighted the inherent challenges in developing clinically validated diagnostic techniques using SVMs, emphasizing concerns such as overfitting and the imperative for robust validation procedures. Unlike studies focused on specific diseases, their research aimed to evaluate and enhance existing ML models for broader biomedical applications. The commentary serves as a cautionary perspective for researchers, reviewers, and readers, stressing the complexities and potential pitfalls in classifier development. It advocates for an integrated approach where classifier validation forms an integral part of the experimental process. This work underscores the critical need to establish the clinical validity of diagnostic tools developed through ML in biomedical research.

Wei et al. [27] proposed an innovative method for improving microcalcification classification in breast cancer diagnosis using content-based image retrieval (CBIR) combined with ML. Their approach integrates CBIR to retrieve similar mammogram cases, enhancing the performance of a support vector machine (SVM) classifier. By incorporating local proximity information from retrieved cases, the adaptive SVM achieved a notable increase in classification accuracy from 78% to 82%, as measured by the area under the ROC curve. This method aims to provide radiologists with enhanced diagnostic support, serving as a valuable "second opinion" tool. Despite these advancements, the study acknowledges limitations in dataset size, which may affect generalizability. These findings underscore the potential of CBIR-assisted classification approaches in improving the precision of breast cancer diagnostics, emphasizing the need for further validation with larger clinical datasets to validate its efficacy and applicability in real-world clinical settings.

3 Proposed Work

3.1 Methodology

Refer to caption
Figure 1: Samples of Breast Ultrasound Images and Masks (overlap) from the Dataset

The ultrasound images are first augmented to handle class imbalance. Following augmentation, the images were preprocessed using a sequence of preprocessing steps – gamma correction, gaussian filtering, image resizing, and normalization. Pixel values in ultrasound images can reflect non-linearities, especially in high- or low-intensity regions. Gamma correction can help compensate for these non-linearities, leading to more accurate and visually appealing images. An effective technique for noise reduction and edge detail preservation in ultrasound images is the application of Gaussian filtering. This effectively reduces noise while preserving edge details. To preserve consistency throughout the dataset and facilitate batch processing, resizing is done to ensure the images fed into the proposed DL model have the same dimensions. Normalizing the image pixels to scale within a specific range enhances the quality of activation functions’ ability to capture the non-linearities in the data. Here, the images have been scaled to fall within the range (0, 1). The preprocessed images are then fed to the proposed Spatial-Channel Attention LinkNet Framework with InceptionResNet backbone for segmenting the tumor region. The segmented tumor maps are then fed to the proposed DCNNIMAF classifier to classify the segmented mass as benign, normal, or malignant. The overall workflow of this proposed work has been presented in Figure 1.

3.2 Dataset Exploration

Refer to caption
Figure 2: Overall Workflow Diagram of Proposed Work

The data utilized in this work was obtained from the Breast Ultrasound Images Dataset [28] made available by Arya Shah on Kaggle. It contains a total of 780 ultrasound images along with their corresponding segmented ground truth masks, split into three categories – benign, malignant, and normal. Figure 2 showcases a sample of ultrasound images from the dataset overlapped with their corresponding segmentation maps.

The dataset exhibits a significant class imbalance, with benign samples contributing to 56.5% of the data, while malignant and normal samples covered only 26.7% and 16.9% respectively. The distribution of ultrasound images exhibiting this class imbalance has been represented graphically in Figure 3. To mitigate this imbalance and avoid bias during the training of segmentation and classification models, augmentation techniques are utilized. Specifically, random crop, random rotation, random zoom, random shear, and random exposure methods were applied to augment the images belonging to the ’normal’ and ’malignant’ classes.

Refer to caption
Figure 3: Training Data Distribution of Breast Ultrasound Images Before Augmentation

The rationale behind this augmentation approach is to level the data count of the ’normal’ and ’malignant’ classes, thereby aligning them more closely with the larger ’benign’ class. By increasing the training data for the ’normal’ and ’malignant’ classes through augmentation, the effects of class imbalance are aimed to be mitigated and enable the models to learn effectively from all classes. This approach ensures that the segmentation and classification models are trained on a more balanced dataset, thereby improving their ability to accurately segment, identify, and characterize breast tumors across different classes. This augmentation resulted in a well-balanced data distribution of each category, which has been represented in Figure 4.

Refer to caption
Figure 4: Training Data Distribution of Breast Ultrasound Images After Augmentation

3.3 Preprocessing

Following augmentation, the images were preprocessed using a preprocessing pipeline, consisting of four stages – gamma correction, gaussian filtering, resizing, and image normalization. The output images obtained after each preprocessing step of breast ultrasound image preprocessing are shown in Figure 5.

Refer to caption
Figure 5: Breast Ultrasound Images Observed After Each Preprocessing Step

3.3.1 Gamma Correction

Gamma correction serves as the initial preprocessing step tailored specifically for breast ultrasound images. It plays a pivotal role in enhancing the visibility of crucial anatomical structures and subtle details within the images. By adjusting the image’s brightness and contrast, gamma correction improves the delineation of tumor boundaries and enhances the visibility of tumor features. This step is particularly critical in breast cancer tumor segmentation, where accurate visualization of tumor margins is essential for precise delineation and subsequent analysis.

Gamma correction can be represented mathematically as follows:

IoutOutput Pixel Intensity=IinInput Pixel Intensityγsubscriptsubscript𝐼𝑜𝑢𝑡Output Pixel Intensitysuperscriptsubscriptsubscript𝐼𝑖𝑛Input Pixel Intensity𝛾\underbrace{I_{out}}_{\text{Output Pixel Intensity}}=\underbrace{I_{in}}_{% \text{Input Pixel Intensity}}^{\gamma}under⏟ start_ARG italic_I start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Output Pixel Intensity end_POSTSUBSCRIPT = under⏟ start_ARG italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input Pixel Intensity end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT (1)

3.3.2 Gaussian Filtering

Following gamma correction, Gaussian filtering is employed to mitigate speckle noise, a common artifact in ultrasound images that can obscure tumor boundaries and hinder accurate segmentation. By selectively smoothing out noise while preserving essential details, Gaussian filtering improves the clarity of tumor features and enhances the accuracy of segmentation algorithms. This step is crucial in breast cancer tumor segmentation and classification, as it reduces noise artifacts and improves the fidelity of tumor delineation, leading to more accurate and reliable segmentation results. Gaussian filtering ensures that the images are cleaner and more conducive to subsequent segmentation and classification tasks, facilitating the accurate identification and characterization of breast tumors.

Gaussian filtering is given by:

Iout(x,y)OutputImage=i=N2N2j=N2N2Iin(x+i,y+j)InputImageG(i,j)GaussianKernelsubscriptsubscript𝐼𝑜𝑢𝑡𝑥𝑦OutputImagesuperscriptsubscript𝑖𝑁2𝑁2superscriptsubscript𝑗𝑁2𝑁2subscriptsubscript𝐼𝑖𝑛𝑥𝑖𝑦𝑗InputImagesubscript𝐺𝑖𝑗GaussianKernel\underbrace{I_{out}(x,y)}_{\text{OutputImage}}=\sum_{i=-\frac{N}{2}}^{\frac{N}% {2}}\sum_{j=-\frac{N}{2}}^{\frac{N}{2}}\underbrace{I_{in}(x+i,y+j)}_{\text{% InputImage}}\cdot\underbrace{G(i,j)}_{\text{GaussianKernel}}under⏟ start_ARG italic_I start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_x , italic_y ) end_ARG start_POSTSUBSCRIPT OutputImage end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = - divide start_ARG italic_N end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_N end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = - divide start_ARG italic_N end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_N end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT under⏟ start_ARG italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_x + italic_i , italic_y + italic_j ) end_ARG start_POSTSUBSCRIPT InputImage end_POSTSUBSCRIPT ⋅ under⏟ start_ARG italic_G ( italic_i , italic_j ) end_ARG start_POSTSUBSCRIPT GaussianKernel end_POSTSUBSCRIPT (2)

3.3.3 Ultrasound Image Resizing

Once the images have undergone gamma correction and Gaussian filtering, resizing is performed to standardize image dimensions, facilitating compatibility with segmentation and classification algorithms. Standardized image dimensions are essential for ensuring consistency and comparability across different datasets and analysis pipelines. Resizing enables researchers to create a uniform framework for analysis, simplifying the processing pipeline and reducing computational complexity.

Resizing of breast cancer images can be represented mathematically as:

Iout(x,y)OutputImage=Iin(xrx,yry)InputImagesubscriptsubscript𝐼𝑜𝑢𝑡superscript𝑥superscript𝑦OutputImagesubscriptsubscript𝐼𝑖𝑛𝑥subscript𝑟𝑥𝑦subscript𝑟𝑦InputImage\underbrace{I_{out}(x^{\prime},y^{\prime})}_{\text{OutputImage}}=\underbrace{I% _{in}\left(\frac{x}{r_{x}},\frac{y}{r_{y}}\right)}_{\text{InputImage}}under⏟ start_ARG italic_I start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT OutputImage end_POSTSUBSCRIPT = under⏟ start_ARG italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( divide start_ARG italic_x end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG , divide start_ARG italic_y end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG ) end_ARG start_POSTSUBSCRIPT InputImage end_POSTSUBSCRIPT (3)

3.3.4 Pixel Normalization

The last preprocessing stage, normalization, scales the pixel values of images to a standardized range, typically from 0 to 1. This normalization process is crucial for ensuring consistency in pixel intensity across different images, which is essential for training machine learning models and neural networks. Normalization enhances the comparability of images and improves the convergence speed of machine learning algorithms during training. By eliminating variations in intensity that may arise due to differences in acquisition parameters or imaging conditions, normalization ensures that segmentation and classification algorithms can learn effectively from the data, leading to more accurate and reliable analysis results.

The normalization process is denoted as:

IoutOutput Image=IinInput Imagemin(Iin)Minimum Of Input Imagemax(Iin)Maximum Of Input Imagemin(Iin)Minimum Of Input Imagesubscriptsubscript𝐼𝑜𝑢𝑡Output Imagesubscriptsubscript𝐼𝑖𝑛Input Imagesubscriptsubscript𝐼𝑖𝑛Minimum Of Input Imagesubscriptsubscript𝐼𝑖𝑛Maximum Of Input Imagesubscriptsubscript𝐼𝑖𝑛Minimum Of Input Image\underbrace{I_{out}}_{\text{Output Image}}=\frac{\underbrace{I_{in}}_{\text{% Input Image}}-\underbrace{\min(I_{in})}_{\text{Minimum Of Input Image}}}{% \underbrace{\max(I_{in})}_{\text{Maximum Of Input Image}}-\underbrace{\min(I_{% in})}_{\text{Minimum Of Input Image}}}under⏟ start_ARG italic_I start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Output Image end_POSTSUBSCRIPT = divide start_ARG under⏟ start_ARG italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input Image end_POSTSUBSCRIPT - under⏟ start_ARG roman_min ( italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Minimum Of Input Image end_POSTSUBSCRIPT end_ARG start_ARG under⏟ start_ARG roman_max ( italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Maximum Of Input Image end_POSTSUBSCRIPT - under⏟ start_ARG roman_min ( italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Minimum Of Input Image end_POSTSUBSCRIPT end_ARG (4)

3.4 Dual Attention and CNN Backbone Enhanced LinkNet Framework for Breast Cancer Segmentation

This section presents the proposed framework for breast cancer tumor segmentation utilizing a LinkNet framework with an InceptionResNet backbone, employing a dual spatial-channel attention mechanism. The framework takes preprocessing breast ultrasound images and their corresponding ground truth masks as input to the segmentation model and provides the predicted segmentation map as output.

The LinkNet architecture [29] is a deep learning model designed for semantic segmentation tasks, particularly in the context of biomedical imaging. The encoder of the proposed framework is built using the InceptionResNet CNN model [30], which is designed to capture contextual information from the entire input image. The decoder is a series of transpose convolution layers with dual spatial-channel attention mechanisms incorporated within the decoder blocks.

3.4.1 Encoder Section

The encoder section of the segmentation architecture is designed using an InceptionResNet CNN backbone and thus consists of a stem block, three types of InceptionResNet blocks, and two types of reduction blocks.

The stem block begins with three convolution layers and is followed by a max pooling layer and a convolution layer where the layers get executed at the same time. This is followed by a filter concatenation layer and this is split into two paths that are parallel to each other. One of the paths contains two convolutional layers while the other is composed of four convolutional layers. Both paths are combined using filter concatenation and are followed by a parallel convolution and max pooling again which is further followed by filter concatenation.

Conv(I(i,j),F)Convolutional Operation=m=0M1n=0N1I(i+m,j+n)Input feature mapF(m,n)Filter+bBiassubscriptConvsubscript𝐼𝑖𝑗𝐹Convolutional Operationsuperscriptsubscript𝑚0𝑀1superscriptsubscript𝑛0𝑁1subscriptsubscript𝐼𝑖𝑚𝑗𝑛Input feature mapsubscriptsubscript𝐹𝑚𝑛Filtersubscript𝑏Bias\underbrace{\text{Conv}(I_{(i,j)},F)}_{\text{Convolutional Operation}}=\sum_{m% =0}^{M-1}\sum_{n=0}^{N-1}\underbrace{I_{(i+m,j+n)}}_{\text{Input feature map}}% \cdot\underbrace{F_{(m,n)}}_{\text{Filter}}+\underbrace{b}_{\text{Bias}}under⏟ start_ARG Conv ( italic_I start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT , italic_F ) end_ARG start_POSTSUBSCRIPT Convolutional Operation end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT under⏟ start_ARG italic_I start_POSTSUBSCRIPT ( italic_i + italic_m , italic_j + italic_n ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input feature map end_POSTSUBSCRIPT ⋅ under⏟ start_ARG italic_F start_POSTSUBSCRIPT ( italic_m , italic_n ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Filter end_POSTSUBSCRIPT + under⏟ start_ARG italic_b end_ARG start_POSTSUBSCRIPT Bias end_POSTSUBSCRIPT (5)

The Inception Resnet blocks are of three types, named A, B and C respectively. Block A is composed of three different paths and a residual connection. The first path consists of a single convolution operation while the second and third paths consist of three and two convolutional operations respectively. The three paths are combined with the help of another convolution operation followed by concatenation with the residual connection. Blocks B and C are similar but the major difference is with the size of the feature maps since an average pooling operation is responsible for downsampling the data from block B to block C. They are composed of two different paths, one with three convolution operations and the other with one convolution operation. The convolution paths are combined by utilizing another convolution operation. There also exists a residual connection which is combined with the result of the convolution operations by utilizing a convolution operation.

Concatenation(A,B)(i,j,k)Concatenation Operation={A(i,j,k)Input feature mapif 1kdepth(A)B(i,j,kdepth(A))Input feature mapif depth(A)kdepth(A)+depth(B)subscriptConcatenationsubscript𝐴𝐵𝑖𝑗𝑘Concatenation Operationcasessubscriptsubscript𝐴𝑖𝑗𝑘Input feature mapif 1𝑘depth𝐴subscriptsubscript𝐵𝑖𝑗𝑘depth𝐴Input feature mapif depth𝐴𝑘depth𝐴depth𝐵\underbrace{\text{Concatenation}(A,B)_{(i,j,k)}}_{\text{Concatenation % Operation}}=\begin{cases}\underbrace{A_{(i,j,k)}}_{\text{Input feature map}}&% \text{if }1\leq k\leq\text{depth}(A)\\ \underbrace{B_{(i,j,k-\text{depth}(A))}}_{\text{Input feature map}}&\text{if }% \text{depth}(A)\leq k\leq\text{depth}(A)+\text{depth}(B)\end{cases}under⏟ start_ARG Concatenation ( italic_A , italic_B ) start_POSTSUBSCRIPT ( italic_i , italic_j , italic_k ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Concatenation Operation end_POSTSUBSCRIPT = { start_ROW start_CELL under⏟ start_ARG italic_A start_POSTSUBSCRIPT ( italic_i , italic_j , italic_k ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input feature map end_POSTSUBSCRIPT end_CELL start_CELL if 1 ≤ italic_k ≤ depth ( italic_A ) end_CELL end_ROW start_ROW start_CELL under⏟ start_ARG italic_B start_POSTSUBSCRIPT ( italic_i , italic_j , italic_k - depth ( italic_A ) ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input feature map end_POSTSUBSCRIPT end_CELL start_CELL if roman_depth ( italic_A ) ≤ italic_k ≤ depth ( italic_A ) + depth ( italic_B ) end_CELL end_ROW (6)
ReLU(x)Rectified Linear Unit Activation={xFeature mapif x>00otherwisesubscriptReLU𝑥Rectified Linear Unit Activationcasessubscript𝑥Feature mapif 𝑥00otherwise\underbrace{\text{ReLU}(x)}_{\text{Rectified Linear Unit Activation}}=\begin{% cases}\underbrace{x}_{\text{Feature map}}&\text{if }x>0\\ 0&\text{otherwise}\end{cases}under⏟ start_ARG ReLU ( italic_x ) end_ARG start_POSTSUBSCRIPT Rectified Linear Unit Activation end_POSTSUBSCRIPT = { start_ROW start_CELL under⏟ start_ARG italic_x end_ARG start_POSTSUBSCRIPT Feature map end_POSTSUBSCRIPT end_CELL start_CELL if italic_x > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (7)

The Reduction blocks are of two variants which are called A and B respectively. Block A begins with a filter concatenation operation which is then split into three paths. The first and third paths are composed of a max pooling and a convolution operation respectively and the second path is composed of three convolution layers. The three paths are then combined with the help of a filter concatenation operation. Block B also consists of a max pooling operation and three convolution operations which are present parallelly. Unlike block A, block B is composed of four different parallel paths where the first two paths are described in the previous statement. The other two paths are two convolution operations respectively and all the four paths are combined by utilising a filter concatenation operation.

MaxPooling(O)(i,j)Max Pooling Operation=maxp=0k1maxq=0k1I(is+p,js+q)Input feature mapsubscriptMaxPoolingsubscript𝑂𝑖𝑗Max Pooling Operationsuperscriptsubscript𝑝0𝑘1superscriptsubscript𝑞0𝑘1subscriptsubscript𝐼𝑖𝑠𝑝𝑗𝑠𝑞Input feature map\underbrace{\text{MaxPooling}(O)_{(i,j)}}_{\text{Max Pooling Operation}}=\max_% {p=0}^{k-1}\max_{q=0}^{k-1}\underbrace{I_{(i\cdot s+p,j\cdot s+q)}}_{\text{% Input feature map}}under⏟ start_ARG MaxPooling ( italic_O ) start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Max Pooling Operation end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_p = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_q = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT under⏟ start_ARG italic_I start_POSTSUBSCRIPT ( italic_i ⋅ italic_s + italic_p , italic_j ⋅ italic_s + italic_q ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input feature map end_POSTSUBSCRIPT (8)
FilterConcat(F1,F2)(X)Filter Concatenation=Concatenation(Conv(X,F1),Conv(X,F2))Concatenated ConvolutionssubscriptFilterConcatsubscript𝐹1subscript𝐹2𝑋Filter ConcatenationsubscriptConcatenationConv𝑋subscript𝐹1Conv𝑋subscript𝐹2Concatenated Convolutions\underbrace{\text{FilterConcat}(F_{1},F_{2})(X)}_{\text{Filter Concatenation}}% =\underbrace{\text{Concatenation}(\text{Conv}(X,F_{1}),\text{Conv}(X,F_{2}))}_% {\text{Concatenated Convolutions}}under⏟ start_ARG FilterConcat ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( italic_X ) end_ARG start_POSTSUBSCRIPT Filter Concatenation end_POSTSUBSCRIPT = under⏟ start_ARG Concatenation ( Conv ( italic_X , italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , Conv ( italic_X , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT Concatenated Convolutions end_POSTSUBSCRIPT (9)

3.4.2 Decoder Section

The Decoder section is composed of decoder blocks, spatial-channel attention blocks, convolution and transpose convolution layers, and a softmax activation function. The decoder block begins with convolution and a batch normalization operation followed by a transpose convolution operation and another batch normalization operation which is followed by convolution and batch normalization operations again.

BN(x)Batch Normalization=γLearnable parameter(xInputμMeanσ2Variance+ϵSmall constant)+βLearnable parametersubscriptBN𝑥Batch Normalizationsubscript𝛾Learnable parametersubscript𝑥Inputsubscript𝜇Meansubscriptsuperscript𝜎2Variancesubscriptitalic-ϵSmall constantsubscript𝛽Learnable parameter\underbrace{\text{BN}(x)}_{\text{Batch Normalization}}=\underbrace{\gamma}_{% \text{Learnable parameter}}\left(\frac{\underbrace{x}_{\text{Input}}-% \underbrace{\mu}_{\text{Mean}}}{\sqrt{\underbrace{\sigma^{2}}_{\text{Variance}% }+\underbrace{\epsilon}_{\text{Small constant}}}}\right)+\underbrace{\beta}_{% \text{Learnable parameter}}under⏟ start_ARG BN ( italic_x ) end_ARG start_POSTSUBSCRIPT Batch Normalization end_POSTSUBSCRIPT = under⏟ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT Learnable parameter end_POSTSUBSCRIPT ( divide start_ARG under⏟ start_ARG italic_x end_ARG start_POSTSUBSCRIPT Input end_POSTSUBSCRIPT - under⏟ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT Mean end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG under⏟ start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Variance end_POSTSUBSCRIPT + under⏟ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT Small constant end_POSTSUBSCRIPT end_ARG end_ARG ) + under⏟ start_ARG italic_β end_ARG start_POSTSUBSCRIPT Learnable parameter end_POSTSUBSCRIPT (10)
TransposeConv(X,K)(i,j,d)Transpose Convolution=p=0F1q=0F1c=0C1X(i+sp,j+sq,c)Input feature mapK(p,q,c,d)Filter kernelsubscriptTransposeConvsubscript𝑋𝐾𝑖𝑗𝑑Transpose Convolutionsuperscriptsubscript𝑝0𝐹1superscriptsubscript𝑞0𝐹1superscriptsubscript𝑐0𝐶1subscriptsubscript𝑋𝑖𝑠𝑝𝑗𝑠𝑞𝑐Input feature mapsubscriptsubscript𝐾𝑝𝑞𝑐𝑑Filter kernel\underbrace{\text{TransposeConv}(X,K)_{(i,j,d)}}_{\text{Transpose Convolution}% }=\sum_{p=0}^{F-1}\sum_{q=0}^{F-1}\sum_{c=0}^{C-1}\underbrace{X_{(i+s\cdot p,j% +s\cdot q,c)}}_{\text{Input feature map}}\cdot\underbrace{K_{(p,q,c,d)}}_{% \text{Filter kernel}}under⏟ start_ARG TransposeConv ( italic_X , italic_K ) start_POSTSUBSCRIPT ( italic_i , italic_j , italic_d ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Transpose Convolution end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C - 1 end_POSTSUPERSCRIPT under⏟ start_ARG italic_X start_POSTSUBSCRIPT ( italic_i + italic_s ⋅ italic_p , italic_j + italic_s ⋅ italic_q , italic_c ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input feature map end_POSTSUBSCRIPT ⋅ under⏟ start_ARG italic_K start_POSTSUBSCRIPT ( italic_p , italic_q , italic_c , italic_d ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Filter kernel end_POSTSUBSCRIPT (11)

The spatial channel attention block begins with two double convolution operations taking place simultaneously followed by the addition of the two feature maps obtained from the operation. The addition operation is followed by the introduction of non-linearity using ReLU activation followed by another convolution operation. The convolution operation is followed by a sigmoid activation function to restrict the values to lie within the range 0 and 1.

AveragePooling(X)(i,j)Average Pooling=1k2p=0k1q=0k1X(is+p,js+q)Input feature mapsubscriptAveragePoolingsubscript𝑋𝑖𝑗Average Pooling1superscript𝑘2superscriptsubscript𝑝0𝑘1superscriptsubscript𝑞0𝑘1subscriptsubscript𝑋𝑖𝑠𝑝𝑗𝑠𝑞Input feature map\underbrace{\text{AveragePooling}(X)_{(i,j)}}_{\text{Average Pooling}}=\frac{1% }{k^{2}}\sum_{p=0}^{k-1}\sum_{q=0}^{k-1}\underbrace{X_{(i\cdot s+p,j\cdot s+q)% }}_{\text{Input feature map}}under⏟ start_ARG AveragePooling ( italic_X ) start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Average Pooling end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_p = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT under⏟ start_ARG italic_X start_POSTSUBSCRIPT ( italic_i ⋅ italic_s + italic_p , italic_j ⋅ italic_s + italic_q ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input feature map end_POSTSUBSCRIPT (12)
Sigmoid(x)Sigmoid Activation=11+exInputsubscriptSigmoid𝑥Sigmoid Activation11superscript𝑒subscript𝑥Input\underbrace{\text{Sigmoid}(x)}_{\text{Sigmoid Activation}}=\frac{1}{1+e^{-% \underbrace{x}_{\text{Input}}}}under⏟ start_ARG Sigmoid ( italic_x ) end_ARG start_POSTSUBSCRIPT Sigmoid Activation end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - under⏟ start_ARG italic_x end_ARG start_POSTSUBSCRIPT Input end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (13)
Addition(A,B)(i,j)Addition Operation=A(i,j)Input feature map+B(i,j)Input feature mapsubscriptAdditionsubscript𝐴𝐵𝑖𝑗Addition Operationsubscriptsubscript𝐴𝑖𝑗Input feature mapsubscriptsubscript𝐵𝑖𝑗Input feature map\underbrace{\text{Addition}(A,B)_{(i,j)}}_{\text{Addition Operation}}=% \underbrace{A_{(i,j)}}_{\text{Input feature map}}+\underbrace{B_{(i,j)}}_{% \text{Input feature map}}under⏟ start_ARG Addition ( italic_A , italic_B ) start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Addition Operation end_POSTSUBSCRIPT = under⏟ start_ARG italic_A start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input feature map end_POSTSUBSCRIPT + under⏟ start_ARG italic_B start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input feature map end_POSTSUBSCRIPT (14)

This is followed by channel attention. The Channel attention block is composed of two different pooling operations (max pooling and average pooling) which happen simultaneously and the obtained feature maps are given as input to a shared multi-layered perceptron. The shared MLP is composed of a flatten layer, gaussian error linear unit (GELU) activation function and dropout layers. Once these three operations are done flatten and dropout operations are performed again. The decoder operation ends with a transpose convolution operation, two convolution operations, and a softmax activation function thus displaying the segmented output.

y=GELU(Wx+b)Gated Linear Unit𝑦subscriptGELU𝑊𝑥𝑏Gated Linear Unity=\underbrace{\text{GELU}(W\cdot x+b)}_{\text{Gated Linear Unit}}italic_y = under⏟ start_ARG GELU ( italic_W ⋅ italic_x + italic_b ) end_ARG start_POSTSUBSCRIPT Gated Linear Unit end_POSTSUBSCRIPT (15)
GELU(x)Gaussian Error Linear Unit=xΦ(x)Gaussian error functionsubscriptGELU𝑥Gaussian Error Linear Unit𝑥subscriptΦ𝑥Gaussian error function\underbrace{\text{GELU}(x)}_{\text{Gaussian Error Linear Unit}}=x\cdot% \underbrace{\Phi(x)}_{\text{Gaussian error function}}under⏟ start_ARG GELU ( italic_x ) end_ARG start_POSTSUBSCRIPT Gaussian Error Linear Unit end_POSTSUBSCRIPT = italic_x ⋅ under⏟ start_ARG roman_Φ ( italic_x ) end_ARG start_POSTSUBSCRIPT Gaussian error function end_POSTSUBSCRIPT (16)
O(i,j)Output feature map=I(i,j)Input feature map1rateDropout ratesubscriptsubscript𝑂𝑖𝑗Output feature mapsubscriptsubscript𝐼𝑖𝑗Input feature map1subscriptrateDropout rate\underbrace{O_{(i,j)}}_{\text{Output feature map}}=\frac{\underbrace{I_{(i,j)}% }_{\text{Input feature map}}}{1-\underbrace{\text{rate}}_{\text{Dropout rate}}}under⏟ start_ARG italic_O start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Output feature map end_POSTSUBSCRIPT = divide start_ARG under⏟ start_ARG italic_I start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input feature map end_POSTSUBSCRIPT end_ARG start_ARG 1 - under⏟ start_ARG rate end_ARG start_POSTSUBSCRIPT Dropout rate end_POSTSUBSCRIPT end_ARG (17)
softmax(z)Softmax Activation=ezOutput scorei,jezsubscriptsoftmax𝑧Softmax Activationsuperscript𝑒subscript𝑧Output scoresubscript𝑖𝑗superscript𝑒𝑧\underbrace{\text{softmax}(z)}_{\text{Softmax Activation}}=\frac{e^{% \underbrace{z}_{\text{Output score}}}}{\sum_{i,j}e^{z}}under⏟ start_ARG softmax ( italic_z ) end_ARG start_POSTSUBSCRIPT Softmax Activation end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT under⏟ start_ARG italic_z end_ARG start_POSTSUBSCRIPT Output score end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT end_ARG (18)
Refer to caption
Figure 6: Spatial-Channel Attention LinkNet Framework with InceptionResNet Backbone Layer Architecture for Breast Cancer Segmentation

3.4.3 Workflow and Execution

Initially, the image is processed through a stem block, which captures low-level features such as edges and textures. These edges outline the boundaries of potential tumors, while textures reveal the internal structure of these masses, which often differ significantly between healthy tissue and malignancies. Following the stem block, the image progresses through five InceptionResNet-A Blocks. Microcalcifications require finer resolution, whereas architectural distortions span larger areas. The residual connections, facilitate deeper networks by mitigating the vanishing gradient problem, and capture a broad spectrum of features at various scales. This multi-scale feature capture is crucial for analyzing ultrasound images of breast tissue, where abnormalities can manifest at different scales.

Subsequently, the image encounters a reduction block, which reduces the spatial dimensions of the feature maps. This reduction allows the model to focus on higher-level features and significantly reduces computational complexity, facilitating more efficient processing. This is particularly useful for identifying broader patterns indicative of cancer, such as the overall shape and orientation of a mass.

The image then navigates through two InceptionResNet Block B layers, further through another stem block. This refines the detection of mid-level features, such as more nuanced textural patterns and subtle edge variations. The stem block repetition extracts additional low-level features that complement the more complex features identified in the intermediate stages, ensuring the model has a comprehensive grasp of the ultrasound image’s content.

Following this, the image passes through a reduction block, which reduces the spatial dimensions of the feature maps. This reduction allows the model to focus on low-level features, essential for the precise delineation of tumor boundaries. The image then enters five InceptionResNet Block C layers, and finally into an average pooling layer. These operations are optimized for extracting high-level semantic features for differentiating between various types of tissues present in the image.

This compressed feature map is then fed to the LinkNet decoder which transforms the abstracted feature map into a spatially coherent segmentation map. In the decoder, upsampling refines the segmentation map generated by the encoder, and the attention mechanism focuses specifically on the tumor region, enhancing its emphasis. By integrating spatial and channel attention mechanisms, the model can enhance feature maps by emphasizing spatial locations and informative channels. This comprehensive approach improves the model’s capability to understand intricate tumor patterns and structures, thereby enhancing segmentation performance.

Initially, the feature map is fed to 2 convolutional blocks, followed by a spatial-channel attention block, which is repeated thrice. They perform a preliminary enhancement of the map, focusing on sharpening the details and adjusting the contrast to make the underlying structures more prominent. This ensures that the feature map contains clear and distinguishable elements that correspond to the anatomical structures within the breast ultrasound images. It is then passed to the first decoder block.

The initial decoder block is designed to capture high-level semantic features essential for segmenting larger structures within breast ultrasound images. It facilitates the reconstruction of the spatial relationships and contextual information abstracted away during the encoding process. The spatial-channel attention block that follows this decoder block scrutinizes the feature map to identify and accentuate the regions that are most likely to contain tumor structures. This is achieved by assigning higher weights to the spatial locations that exhibit characteristics typical of tumors, such as irregular shapes and unusual textural patterns. The channel attention mechanism analyses the feature map across different channels to determine the ones that carry the most relevant information for segmentation. By amplifying the signals from these informative channels, the model can better discern the unique features that differentiate tumor tissue from the surrounding healthy tissue.

Finer textures and structures within the tumors are captured as the feature map moves up to the second decoder block. The spatial-channel attention block adjusts the feature map’s weights to emphasize the spatial locations where these detailed features are most prominent, resulting in more precise segmentation of smaller tumor components. Channel attention further identifies the most relevant feature map channels for the task, focusing on the texture and shape of the tumors.

In the third decoder block, the feature map captures even more detailed features, including intricate patterns and structures within the tumors. The attention mechanism in this block focuses on the boundaries of the tumor region, which helps the model improve the quality of the produced segmentation map. This makes the output more accurate and minimizes extraneous markings.

The final decoder block is responsible for capturing the most detailed features, including the specific patterns and structures unique to each tumor. The attention mechanism allows the model to distinguish between benign and malignant types of tumors and identify subtle variations within a single tumor type. The output from this decoder block is transpose-convolved to ensure a consistent output shape of the segmentation map, followed by convolutions to correct the output channels. This finally transforms the abstracted feature map into a detailed and accurate segmentation map.

The model was trained by backpropagating over a custom loss function (21), equal to an aggregate of focal loss (19) and dice (Jaccard) loss (20) obtained after each training epoch.

lossfocal(pt)Focal loss=(1ptTrue class probability)γFocal loss focusing parameterlog(ptTrue class probability)subscriptsubscriptlossfocalsubscript𝑝𝑡Focal losssuperscript1subscriptsubscript𝑝𝑡True class probabilitysubscript𝛾Focal loss focusing parametersubscriptsubscript𝑝𝑡True class probability\underbrace{\text{loss}_{\text{focal}}(p_{t})}_{\text{Focal loss}}=-(1-% \underbrace{p_{t}}_{\text{True class probability}})^{\underbrace{\gamma}_{% \text{Focal loss focusing parameter}}}\log(\underbrace{p_{t}}_{\text{True % class probability}})under⏟ start_ARG loss start_POSTSUBSCRIPT focal end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Focal loss end_POSTSUBSCRIPT = - ( 1 - under⏟ start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT True class probability end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT under⏟ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT Focal loss focusing parameter end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log ( under⏟ start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT True class probability end_POSTSUBSCRIPT ) (19)
lossJaccardJaccard loss=1VpVgIntersection of predicted and ground truthVpVgUnion of predicted and ground truthsubscriptsubscriptlossJaccardJaccard loss1subscriptsubscript𝑉𝑝subscript𝑉𝑔Intersection of predicted and ground truthsubscriptsubscript𝑉𝑝subscript𝑉𝑔Union of predicted and ground truth\underbrace{\text{loss}_{\text{Jaccard}}}_{\text{Jaccard loss}}=1-\frac{% \underbrace{V_{p}\cap V_{g}}_{\text{Intersection of predicted and ground truth% }}}{\underbrace{V_{p}\cup V_{g}}_{\text{Union of predicted and ground truth}}}under⏟ start_ARG loss start_POSTSUBSCRIPT Jaccard end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Jaccard loss end_POSTSUBSCRIPT = 1 - divide start_ARG under⏟ start_ARG italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∩ italic_V start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Intersection of predicted and ground truth end_POSTSUBSCRIPT end_ARG start_ARG under⏟ start_ARG italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∪ italic_V start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Union of predicted and ground truth end_POSTSUBSCRIPT end_ARG (20)
losstotalTotal loss=lossfocalFocal loss+lossJaccardJaccard losssubscriptsubscriptlosstotalTotal losssubscriptsubscriptlossfocalFocal losssubscriptsubscriptlossJaccardJaccard loss\underbrace{\text{loss}_{\text{total}}}_{\text{Total loss}}=\underbrace{\text{% loss}_{\text{focal}}}_{\text{Focal loss}}+\underbrace{\text{loss}_{\text{% Jaccard}}}_{\text{Jaccard loss}}under⏟ start_ARG loss start_POSTSUBSCRIPT total end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Total loss end_POSTSUBSCRIPT = under⏟ start_ARG loss start_POSTSUBSCRIPT focal end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Focal loss end_POSTSUBSCRIPT + under⏟ start_ARG loss start_POSTSUBSCRIPT Jaccard end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Jaccard loss end_POSTSUBSCRIPT (21)

The model specifications and parameters of the proposed Spatial-Channel Attention LinkNet Framework with InceptionResNet Backbone are shown in Table 1.

Table 1: Dual Attention and CNN Backbone Enhanced LinkNet Segmentation Framework
Parameters Coefficients
Total Trainable Parameters 57,881,011
Learning Rate 0.001
Epochs 100
Image Shape (256, 256)
Batch Size 16

3.5 Multi-Attention Integrated Deep CNN Framework for Breast Cancer Classification

This section presents the proposed breast cancer deep learning classification model, coined Deep CNN with an Integrated Multi-Attention Framework (DCNNIMAF). Utilizing multiple attention modules integrated within its architecture the proposed approach is designed to effectively classify breast ultrasound images into malignant, benign, or normal categories. The input to the model comprises preprocessed breast ultrasound images and outputs the predicted class to which the image belongs.

The model architecture of DCNNIMAF integrates several pivotal blocks designed to extract pertinent features from the input breast ultrasound images. These blocks include convolutional blocks, double convolutional blocks, self-attention blocks, and fully connected layers. Each block plays a crucial role in feature extraction and classification. The layer architecture diagram of the proposed DCNNIMAF model is shown in Figure 7.

Refer to caption
Figure 7: DCNNIMAF Classification Layer Architecture for Breast Cancer Classification

3.5.1 Convolutional Block

The convolutional block within DCNNIMAF consists of a convolutional layer, followed by a batch normalization layer, and finally an activation layer. The activation function used varies between Leaky ReLU and SiLU in different convolutional blocks. The operations performed by the block on the input feature map are mathematically represented as follows:

Oi,jOutput feature map value=m=0M1n=0N1Ii+m,j+nInput pixel valueFm,nFilter weight+bBiassubscriptsubscript𝑂𝑖𝑗Output feature map valuesuperscriptsubscript𝑚0𝑀1superscriptsubscript𝑛0𝑁1subscriptsubscript𝐼𝑖𝑚𝑗𝑛Input pixel valuesubscriptsubscript𝐹𝑚𝑛Filter weightsubscript𝑏Bias\underbrace{O_{i,j}}_{\text{Output feature map value}}=\sum_{m=0}^{M-1}\sum_{n% =0}^{N-1}\underbrace{I_{i+m,j+n}}_{\text{Input pixel value}}\cdot\underbrace{F% _{m,n}}_{\text{Filter weight}}+\underbrace{b}_{\text{Bias}}under⏟ start_ARG italic_O start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Output feature map value end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT under⏟ start_ARG italic_I start_POSTSUBSCRIPT italic_i + italic_m , italic_j + italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Input pixel value end_POSTSUBSCRIPT ⋅ under⏟ start_ARG italic_F start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Filter weight end_POSTSUBSCRIPT + under⏟ start_ARG italic_b end_ARG start_POSTSUBSCRIPT Bias end_POSTSUBSCRIPT (22)
BN(x)Batch Normalization=γScale parameterxInput valueμMeanσ2Variance+ϵSmall constant+βShift parametersubscriptBN𝑥Batch Normalizationsubscript𝛾Scale parametersubscript𝑥Input valuesubscript𝜇Meansubscriptsuperscript𝜎2Variancesubscriptitalic-ϵSmall constantsubscript𝛽Shift parameter\underbrace{\text{BN}(x)}_{\text{Batch Normalization}}=\underbrace{\gamma}_{% \text{Scale parameter}}\frac{\underbrace{x}_{\text{Input value}}-\underbrace{% \mu}_{\text{Mean}}}{\sqrt{\underbrace{\sigma^{2}}_{\text{Variance}}+% \underbrace{\epsilon}_{\text{Small constant}}}}+\underbrace{\beta}_{\text{% Shift parameter}}under⏟ start_ARG BN ( italic_x ) end_ARG start_POSTSUBSCRIPT Batch Normalization end_POSTSUBSCRIPT = under⏟ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT Scale parameter end_POSTSUBSCRIPT divide start_ARG under⏟ start_ARG italic_x end_ARG start_POSTSUBSCRIPT Input value end_POSTSUBSCRIPT - under⏟ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT Mean end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG under⏟ start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Variance end_POSTSUBSCRIPT + under⏟ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT Small constant end_POSTSUBSCRIPT end_ARG end_ARG + under⏟ start_ARG italic_β end_ARG start_POSTSUBSCRIPT Shift parameter end_POSTSUBSCRIPT (23)
LeakyReLU(x)={xInput valueif x>0αxLeaky slopeotherwiseLeakyReLU𝑥casessubscript𝑥Input valueif 𝑥0subscript𝛼𝑥Leaky slopeotherwise\text{LeakyReLU}(x)=\begin{cases}\underbrace{x}_{\text{Input value}}&\text{if % }x>0\\ \underbrace{\alpha x}_{\text{Leaky slope}}&\text{otherwise}\end{cases}LeakyReLU ( italic_x ) = { start_ROW start_CELL under⏟ start_ARG italic_x end_ARG start_POSTSUBSCRIPT Input value end_POSTSUBSCRIPT end_CELL start_CELL if italic_x > 0 end_CELL end_ROW start_ROW start_CELL under⏟ start_ARG italic_α italic_x end_ARG start_POSTSUBSCRIPT Leaky slope end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW (24)
SiLU(x)Sigmoid Linear Unit=x1+exsubscriptSiLU𝑥Sigmoid Linear Unit𝑥1superscript𝑒𝑥\underbrace{\text{SiLU}(x)}_{\text{Sigmoid Linear Unit}}=\frac{x}{1+e^{-x}}under⏟ start_ARG SiLU ( italic_x ) end_ARG start_POSTSUBSCRIPT Sigmoid Linear Unit end_POSTSUBSCRIPT = divide start_ARG italic_x end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT end_ARG (25)

3.5.2 Double Convolutional Block

The double convolutional block comprises two consecutive convolutional layers with 256 filters, a kernel size of 3, and a padding of 1. Mathematically, the operation of this block can be represented as

OOutput=conv2Second convolution(conv1First convolution(IInput))subscript𝑂Outputsubscriptsubscriptconv2Second convolutionsubscriptsubscriptconv1First convolutionsubscript𝐼Input\underbrace{O}_{\text{Output}}=\underbrace{\text{conv}_{2}}_{\text{Second % convolution}}\left(\underbrace{\text{conv}_{1}}_{\text{First convolution}}% \left(\underbrace{I}_{\text{Input}}\right)\right)under⏟ start_ARG italic_O end_ARG start_POSTSUBSCRIPT Output end_POSTSUBSCRIPT = under⏟ start_ARG conv start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Second convolution end_POSTSUBSCRIPT ( under⏟ start_ARG conv start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT First convolution end_POSTSUBSCRIPT ( under⏟ start_ARG italic_I end_ARG start_POSTSUBSCRIPT Input end_POSTSUBSCRIPT ) ) (26)

3.5.3 Self-Attention Block

The self-attention block in DCNNIMAF computes the attention weights αijsubscript𝛼𝑖𝑗\alpha_{ij}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT for each pair of positions (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) within the feature map.

αijAttention weight for position (i,j)=softmax(QKTDot product of query and keydkDimensionality of key vectors)Softmax normalizationVValue matrixsubscriptsubscript𝛼𝑖𝑗Attention weight for position 𝑖𝑗subscriptsoftmaxsubscript𝑄superscript𝐾𝑇Dot product of query and keysubscriptsubscript𝑑𝑘Dimensionality of key vectorsSoftmax normalizationsubscript𝑉Value matrix\underbrace{\alpha_{ij}}_{\text{Attention weight for position }(i,j)}=% \underbrace{\text{softmax}\left(\frac{\underbrace{QK^{T}}_{\text{Dot product % of query and key}}}{\sqrt{\underbrace{d_{k}}_{\text{Dimensionality of key % vectors}}}}\right)}_{\text{Softmax normalization}}\underbrace{V}_{\text{Value % matrix}}under⏟ start_ARG italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Attention weight for position ( italic_i , italic_j ) end_POSTSUBSCRIPT = under⏟ start_ARG softmax ( divide start_ARG under⏟ start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Dot product of query and key end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG under⏟ start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Dimensionality of key vectors end_POSTSUBSCRIPT end_ARG end_ARG ) end_ARG start_POSTSUBSCRIPT Softmax normalization end_POSTSUBSCRIPT under⏟ start_ARG italic_V end_ARG start_POSTSUBSCRIPT Value matrix end_POSTSUBSCRIPT (27)

3.5.4 Workflow and Execution

The flow of information through DCNNIMAF begins with an input layer of shape (256, 256, 3). Initially, the segmentation map undergoes a convolutional block with 512 filters, a padding of 2 and a kernel size of 3. This extracts low-level features such as textures and edges from the input. Following this, the output from the first convolutional block is passed through another convolutional block with 256 filters, the same kernel size, and padding, but with SiLU activation. The introduction of SiLU activation enhances the non-linearity for higher-level feature extraction, which helps to distinguish between different breast tissue characteristics indicative of cancerous growth.

Subsequently, a double convolutional block is applied to further refine feature extraction. By employing consecutive convolutional layers with 256 filters each, this block extracts deeper and more abstract features from the input. Following this, a convolutional block with 128 filters, a kernel size of 4, and padding of 2 is employed, accompanied by a leaky ReLU activation. This operation aims to distill the extracted features into more compact and discriminative representations, facilitating the model’s capability to detect, and interpret complex patterns within the tumor’s structure such as textural anomalies to irregular shapes Continuing the feature refinement process, another convolutional block with 128 filters, a padding of 1, and a kernel size of 3 is applied, this time utilizing SiLU activation.

Subsequently, two convolutional blocks are utilized - the first with 128 filters, a kernel size of 4, and a padding of 2, and the second with 64 filters, a padding of 1, and a kernel size of 3. These features are then fed to a spatial attention mechanism, enhancing the model’s capacity to adjust to subtle differences between various tissue characteristics associated with malignant and benign tumors.

The feature map obtained from the preceding operations is then concatenated with the output from a convolution and batch normalization layer with 64 filters, a padding of 2, and a kernel size of 3. This model integrates both high-level and low-level features across different layers through a concatenation approach, enabling a more comprehensive representation of the input image. This allows the model to learn about the presence of microcalcifications and the density of the tumor tissue, that are most indicative of malignancy.

This concatenated output undergoes further processing through convolutional and activation layers before being upsampled and concatenated again with intricate feature attention results. This iterative refinement process ensures that the model can effectively leverage both global and local contextual details present in the input segmentation map. This is then fed through additional convolutional blocks and pooling layers before being passed through a self-attention block. By incorporating self-attention mechanisms, it allows the model to highlight more weightage to the distribution of cells or the presence of necrosis, filtering out less relevant information and potential artifacts that could obscure diagnosis.

Ultimately, the result from the self-attention block is flattened and subjected to dropout regularization to mitigate overfitting. Dropout prevents the model from relying on specific features or patterns within the training data that may not generalize well to unseen samples, thereby improving its robustness and generalization performance.

The feature map is then directed into a fully connected layer containing 128 neurons, then proceeds to an output layer with three neurons and softmax activation for classification into malignant, benign, or normal categories. This final step consolidates the extracted features into a compact representation suitable for classification, enabling the model to make accurate predictions concerning the existence and severity of breast tumors based on the input ultrasound image. The model’s training parameters were updated after each epoch via backpropagation using the categorical cross entropy loss criterion (29).

lossCECategorical Cross Entropy Loss=log(expExponential of the true class scorej=1NexjSum of exponentials of all class scores)subscriptsubscriptlossCECategorical Cross Entropy Losssubscriptsuperscript𝑒subscript𝑥𝑝Exponential of the true class scoresubscriptsuperscriptsubscript𝑗1𝑁superscript𝑒subscript𝑥𝑗Sum of exponentials of all class scores\underbrace{\text{loss}_{\text{CE}}}_{\text{Categorical Cross Entropy Loss}}=-% \log\left(\frac{\underbrace{e^{x_{p}}}_{\text{Exponential of the true class % score}}}{\underbrace{\sum_{j=1}^{N}e^{x_{j}}}_{\text{Sum of exponentials of % all class scores}}}\right)under⏟ start_ARG loss start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Categorical Cross Entropy Loss end_POSTSUBSCRIPT = - roman_log ( divide start_ARG under⏟ start_ARG italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Exponential of the true class score end_POSTSUBSCRIPT end_ARG start_ARG under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Sum of exponentials of all class scores end_POSTSUBSCRIPT end_ARG ) (28)

The working algorithm in classifying breast cancer as benign, malignant, or normal, is demonstrated in Algorithm 3.

The model specifications and parameters of the proposed DCNNIMAF classifier have been shown in Table 2.

Table 2: Architecture Specifications of DCNNIMAF Classification Model
Parameters Coefficients
Total Trainable Parameters 52,427,081
Learning Rate 0.001
Epochs 100
Image Shape (256, 256)
Batch Size 16

4 Experimental Setup and Results

This section outlines the findings and discussion achieved from training the proposed models. The experiments were conducted in a system with the following specifications: CPU - AMD Ryzen 7 4800H with Radeon Graphics, x86_64 architecture, running at a speed of 3GHz with 8 cores; GPU - NVIDIA GeForce RTX 3050-PCI Bus 1; and 32GB of RAM. These details are summarized in Table 3.

Table 3: System Specifications for Experimental Setup
Component Specification
CPU AMD Ryzen 7 4800H with Radeon Graphics
ARCHITECTURE x86_64
BASE SPEED 3GHz
CORES 8
GPU NVIDIA GeForce RTX 3050-PCI Bus 1
RAM 32GB

4.1 Segmentation Evaluation Metrics

The proposed segmentation framework’s performance was evaluated during the training and validation phase using the following segmentation metrics:

4.1.1 Accuracy

Accuracy measures the proportion of pixels that were classified correctly in the segmentation map compared to the ground truth.

accuracySegmentation Accuracy=correctly_classified_pixelsNumber of correctly classified pixelstotal_pixel_countTotal number of pixels in the imagesubscriptaccuracySegmentation Accuracysubscriptcorrectly_classified_pixelsNumber of correctly classified pixelssubscripttotal_pixel_countTotal number of pixels in the image\underbrace{\text{accuracy}}_{\text{Segmentation Accuracy}}=\frac{\underbrace{% \text{correctly\_classified\_pixels}}_{\text{Number of correctly classified % pixels}}}{\underbrace{\text{total\_pixel\_count}}_{\text{Total number of % pixels in the image}}}under⏟ start_ARG accuracy end_ARG start_POSTSUBSCRIPT Segmentation Accuracy end_POSTSUBSCRIPT = divide start_ARG under⏟ start_ARG correctly_classified_pixels end_ARG start_POSTSUBSCRIPT Number of correctly classified pixels end_POSTSUBSCRIPT end_ARG start_ARG under⏟ start_ARG total_pixel_count end_ARG start_POSTSUBSCRIPT Total number of pixels in the image end_POSTSUBSCRIPT end_ARG (29)

4.1.2 IoU Score

The IoU score, often termed the Jaccard index, assesses the intersection of the ground truth mask with the predicted segmentation mask divided by their union. It represents the amount of tumor region correctly segmented regarding the total tumor region (ground truth).

IoUScoreIoU=AreasegmentationAreagroundTruthIntersectionAreasegmentationAreagroundTruthUnionsubscriptsubscriptIoUScoreIoUsubscriptsubscriptAreasegmentationsubscriptAreagroundTruthIntersectionsubscriptsubscriptAreasegmentationsubscriptAreagroundTruthUnion\underbrace{\text{IoU}_{\text{Score}}}_{\text{IoU}}=\frac{\underbrace{\text{% Area}_{\text{segmentation}}\cap\text{Area}_{\text{groundTruth}}}_{\text{% Intersection}}}{\underbrace{\text{Area}_{\text{segmentation}}\cup\text{Area}_{% \text{groundTruth}}}_{\text{Union}}}under⏟ start_ARG IoU start_POSTSUBSCRIPT Score end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT IoU end_POSTSUBSCRIPT = divide start_ARG under⏟ start_ARG Area start_POSTSUBSCRIPT segmentation end_POSTSUBSCRIPT ∩ Area start_POSTSUBSCRIPT groundTruth end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Intersection end_POSTSUBSCRIPT end_ARG start_ARG under⏟ start_ARG Area start_POSTSUBSCRIPT segmentation end_POSTSUBSCRIPT ∪ Area start_POSTSUBSCRIPT groundTruth end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Union end_POSTSUBSCRIPT end_ARG (30)

4.1.3 Dice Coefficient

The Dice coefficient, often recognized as the Dice similarity index, assesses the overlap between the ground truth and the predicted segmentation mask.

DiceCoefficientDice=2×|AreasegmentationAreagroundTruthIntersection||AreasegmentationSegmentation|+|AreagroundTruthGround Truth|subscriptsubscriptDiceCoefficientDice2subscriptsubscriptAreasegmentationsubscriptAreagroundTruthIntersectionsubscriptsubscriptAreasegmentationSegmentationsubscriptsubscriptAreagroundTruthGround Truth\underbrace{\text{Dice}_{\text{Coefficient}}}_{\text{Dice}}=\frac{2\times|% \underbrace{\text{Area}_{\text{segmentation}}\cap\text{Area}_{\text{% groundTruth}}}_{\text{Intersection}}|}{|\underbrace{\text{Area}_{\text{% segmentation}}}_{\text{Segmentation}}|+|\underbrace{\text{Area}_{\text{% groundTruth}}}_{\text{Ground Truth}}|}under⏟ start_ARG Dice start_POSTSUBSCRIPT Coefficient end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Dice end_POSTSUBSCRIPT = divide start_ARG 2 × | under⏟ start_ARG Area start_POSTSUBSCRIPT segmentation end_POSTSUBSCRIPT ∩ Area start_POSTSUBSCRIPT groundTruth end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Intersection end_POSTSUBSCRIPT | end_ARG start_ARG | under⏟ start_ARG Area start_POSTSUBSCRIPT segmentation end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Segmentation end_POSTSUBSCRIPT | + | under⏟ start_ARG Area start_POSTSUBSCRIPT groundTruth end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Ground Truth end_POSTSUBSCRIPT | end_ARG (31)

Figures 8 and 9 depict the training and validation curves for accuracy and total loss, respectively, obtained while training the proposed segmentation framework. From the graphs, it is evident that the model has achieved a high accuracy of 98.1%, with a minimal loss of 0.06 at the end of 100 epochs. The model also achieved an impressive Dice Coefficient score of 97.2% and an IoU score of 96.9%. The training and validation curves of these metrics have been shown in Figures 10 and 11 respectively.

Refer to caption
Figure 8: Training and Validation Accuracy Curves of Proposed Segmentation Framework
Refer to caption
Figure 9: Training and Validation Loss Curves of Proposed Segmentation Framework
Refer to caption
Figure 10: Training and Validation IoU Score Curves of Proposed Segmentation Framework
Refer to caption
Figure 11: Training and Validation Dice Coefficient Curves of Proposed Segmentation Framework

4.1.4 Performance Evaluation and Discussion

From the segmentation results, it can be inferred that this model has demonstrated impressive performance. The high values obtained from IoU, Dice Coefficient, and Accuracy scores, along with the minimal total loss imply that the InceptionResNet backbone managed to successfully extract important characteristics from input preprocessed images, and the dual-attention mechanism in the decoder blocks helped fine-tune the segmentation maps during segmentation.

Grad-CAMs, which stands for Gradient-weighted Class Activation Mapping, is a method in deep learning used to visualize important regions in an input image that guide the model’s decision-making process [31]. They are particularly useful in understanding how Convolutional Neural Networks (CNNs) make their predictions, especially in tasks like medical image segmentation, where it is necessary to observe if the attention mechanism carries out its operations properly.

The GradCAMs of the attention block at the topmost decoder block, as provided in Figure 12, show how the attention mechanism focuses on specific regions of the feature map, highlighting the importance of these regions for the segmentation task. This visualization helps in understanding how the attention mechanism contributes to the segmentation performance by emphasizing the most relevant features and their spatial locations. From GradCAMs, it can be observed that the attention mechanism progressively shifts its focus towards the tumor region, with an improvement in localization accuracy as the number of training epochs increases.

Refer to caption
Figure 12: Segmentation Outputs with Attention GradCAMs at epochs 16, 32, 64, and 96
Table 4: Performance Metrics Comparison of Proposed Segmentation Model with Other Models
Performance Scores (in %)
Segmentation Model Dice Coefficient (%) IoU Score (%)
U-Net [31] 82.52 69.76
Res-U-Net [32] 88.01 80.21
U-Net with DenseNet backbone [33] 89.86 79.12
Multi-scale Fusion U-Net [34] 95.35 91.12
Proposed Spatial-Channel Attention LinkNet Framework with InceptionResNet Backbone 97.20 96.91

U-Net [32] model attains a Dice coefficient of 82.52% and an IoU score of 69.76%. These scores reflect a foundational capability in segmenting tumors from breast ultrasound images and highlight the model’s limitations in capturing the full extent of tumor boundaries and internal structures, particularly in the nuanced textures and densities often found in breast tissues. Res U-Net [33] enhances the original U-Net with a Dice coefficient of 88% and an IoU score of 80%, demonstrating enhanced performance through the incorporation of residual connections, but further refinements in its network architecture and feature extraction are necessary to achieve optimal segmentation accuracy, especially in dealing with the variable echo intensities and shadowing effects commonly encountered in breast ultrasound imaging. By integrating a DenseNet backbone, the U-Net with DenseNet Backbone [34] reaches a Dice coefficient of 89.8% and an IoU score of 79.1%, showcasing the benefits of dense connectivity in improving segmentation outcomes. However, additional strategies may be required to fully leverage the complex patterns inherent in breast ultrasound images, such as the differentiation between cystic and solid components of tumors, which is critical for accurate diagnosis. The Multi-scale Fusion U-Net [35] achieves a Dice coefficient of 95.35% and an IoU score of 91.12%, marking a significant improvement over earlier models. But it shows suboptimal performance when handling the heterogeneity of breast tissues and the dynamic nature of tumor growth observed in ultrasound sequences. The proposed Spatial-Channel Attention LinkNet Framework with InceptionResNet Backbone stands out with a Dice coefficient of 97.20% and an IoU score of 96.91%. This performance is attributed to the integration of spatial-channel attention mechanisms and the robust InceptionResNet backbone, which together enable precise localization and delineation of tumors, including the ability to distinguish between different types of breast lesions based on their texture, shape, and boundary characteristics.

4.2 Classification Evaluation Metrics

The outcomes of the proposed DCNNIMAF model for breast cancer classification are evaluated during the training and validation phase using the following classification metrics:

4.2.1 Accuracy

Accuracy is a fundamental metric that evaluates the overall performance of a model across all classes. It measures the proportion of true classifications (both true positives and true negatives) in the total images classified, providing a comprehensive view of the model’s effectiveness in correctly classifying instances.

Accuracy=k=1n(TPk+TNk)k=1n(TPk+TNk+FPk+FNk)Accuracysuperscriptsubscript𝑘1𝑛𝑇subscript𝑃𝑘𝑇subscript𝑁𝑘superscriptsubscript𝑘1𝑛𝑇subscript𝑃𝑘𝑇subscript𝑁𝑘𝐹subscript𝑃𝑘𝐹subscript𝑁𝑘\text{Accuracy}=\frac{\sum_{k=1}^{n}(TP_{k}+TN_{k})}{\sum_{k=1}^{n}(TP_{k}+TN_% {k}+FP_{k}+FN_{k})}Accuracy = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_T italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_T italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_T italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_T italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG (32)

Where:

  • TP𝑇𝑃TPitalic_T italic_P denotes the number of true positives.

  • TN𝑇𝑁TNitalic_T italic_N denotes the number of true negatives.

  • FP𝐹𝑃FPitalic_F italic_P denotes the number of false positives.

  • FN𝐹𝑁FNitalic_F italic_N denotes the number of false negatives.

  • n𝑛nitalic_n denotes the total number of classes.

4.2.2 Precision

Precision focuses on the proportion of true positive predictions among all positive predictions made by the classifier. It is particularly important in situations where false positives are costly, as it helps in minimizing the impact of false positives on the overall performance of the model.

Precision=k=1nTPkk=1n(TPk+FPk)Precisionsuperscriptsubscript𝑘1𝑛𝑇subscript𝑃𝑘superscriptsubscript𝑘1𝑛𝑇subscript𝑃𝑘𝐹subscript𝑃𝑘\text{Precision}=\frac{\sum_{k=1}^{n}TP_{k}}{\sum_{k=1}^{n}(TP_{k}+FP_{k})}Precision = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_T italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_T italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG (33)

4.2.3 Recall

Recall, also known as sensitivity, measures the ability of the classifier to identify all relevant instances within a specific class. It is crucial in situations where missing a positive instance (false negative) is more detrimental than identifying a negative instance as positive (false positive). Recall helps in ensuring that the model does not overlook any relevant instances.

Recall=k=1nTPkk=1n(TPk+FNk)Recallsuperscriptsubscript𝑘1𝑛𝑇subscript𝑃𝑘superscriptsubscript𝑘1𝑛𝑇subscript𝑃𝑘𝐹subscript𝑁𝑘\text{Recall}=\frac{\sum_{k=1}^{n}TP_{k}}{\sum_{k=1}^{n}(TP_{k}+FN_{k})}Recall = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_T italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_T italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG (34)

4.2.4 F1-Score

F1-Score combines precision and recall into a single measure, providing a balanced view of the model’s performance. It is useful in scenarios where both false positives and false negatives are equally important, and a balance between these two metrics is desired.

F1 Score=2k=1nTPkk=1n(2TPk+TNk+FPk)F1 Score2superscriptsubscript𝑘1𝑛𝑇subscript𝑃𝑘superscriptsubscript𝑘1𝑛2𝑇subscript𝑃𝑘𝑇subscript𝑁𝑘𝐹subscript𝑃𝑘\text{F1 Score}=\frac{2\sum_{k=1}^{n}TP_{k}}{\sum_{k=1}^{n}(2TP_{k}+TN_{k}+FP_% {k})}F1 Score = divide start_ARG 2 ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_T italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 2 italic_T italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_T italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG (35)

The proposed DCNNIMAF classifier was trained for 100 epochs, and the evaluation metrics were recorded after each epoch. The training and validation curves obtained for accuracy coupled with categorical cross-entropy loss have been depicted in Figures 13 and 14 respectively. From the graph plots, it can be seen that the classification model has obtained a high accuracy of 99.2% at a minimal loss of 0.03. Figures 15, 16, and 17 display the training and validation precision, recall, and F1-score curve, respectively. It can be inferred from the graphs, that the proposed model has minimized false positives and false negatives, thereby achieving a remarkable precision of 99.3% and a recall of 99.1%. The high values of precision and recall contribute to the high F1-score value of 99.1%.

Refer to caption
Figure 13: Training and Validation Accuracy Curves of Proposed DCNNIMAF Classifier
Refer to caption
Figure 14: Training and Validation Loss Curves of Proposed DCNNIMAF Classifier
Refer to caption
Figure 15: Training and Validation Precision Curves of Proposed DCNNIMAF Classifier
Refer to caption
Figure 16: Training and Validation Recall Curves of Proposed DCNNIMAF Classifier
Refer to caption
Figure 17: Training and Validation F1-Score Curves of Proposed DCNNIMAF Classifier

4.2.5 Performance Evaluation and Discussion

The normalized confusion matrix obtained on the validation data using the trained DCNNIMAF classification model has been presented in Figure 18. A normalized confusion matrix is a type of confusion matrix where the values are normalized to show proportions or percentages. It is useful for comparing classification performance across classes, since the values are between 0 to 1, making it easy to interpret.

Refer to caption
Figure 18: Confusion Matrix Obtained from Proposed DCNNIMAF Classifier

In Figure 18, the normalized confusion matrix depicts the proposed model’s classification performance across the three breast cancer classes: "benign," "normal," and "malignant." Each row corresponds to the actual class, with each column representing the predicted class. The matrix’s values show the proportion of true-class cases that were successfully classified (along the diagonal) or misclassified (off-diagonal).

From the matrix, it can be observed that the model has obtained remarkable accuracy. With most values along the diagonal close to one, it indicates that the majority of the samples were categorized correctly. For the "benign" class, the model had a true positive rate of 0.99, indicating that 99% of benign tumors were properly categorized. In the "normal" class, the true positive rate was 0.98, implying that 98% of normal cases were correctly identified. Similarly, in the "malignant" class, the true positive rate was 0.99, indicating that 99% of malignant tumors were correctly identified. Misclassification errors were minor, with extremely low false positive and false negative rates.

The proposed DCNNIMAF model is compared with other pretrained CNNs, including EfficientNetV2[36], MobileNetV2 [37], [38], NASNetMobile[39], Xception[40], InceptionV3[41], InceptionResNetV2[30], MobileNet[42], VGG16[43], and ResNet50[44]. This comparison aims to provide an overall assessment of the proposed model relative to existing baseline CNNs widely utilized for breast cancer classification. All models, including the proposed one, are trained utilizing the identical dataset, and the outcomes are presented in Table 5. The performance of these models is evaluated based on the following metrics: Accuracy (Acc), Precision (Prec), Recall (Rec), and F1 Score (F1).

Table 5: Performance Metrics Comparison of Proposed Classification Model with Other Baseline CNN Models
Training Phase Metrics Validation Phase Metrics
Model Acc Prec Rec F1 Acc Prec Rec F1
EfficientNetV2 0.926 0.931 0.920 0.925 0.871 0.871 0.871 0.871
MobileNetV2 0.935 0.948 0.925 0.936 0.858 0.857 0.852 0.854
DenseNet121 0.928 0.938 0.925 0.931 0.906 0.906 0.906 0.906
NASNetMobile 0.942 0.947 0.941 0.944 0.911 0.913 0.904 0.908
Xception 0.925 0.926 0.920 0.923 0.917 0.917 0.917 0.917
InceptionV3 0.878 0.897 0.862 0.879 0.774 0.774 0.771 0.772
InceptionResNetV2 0.958 0.961 0.958 0.959 0.947 0.957 0.901 0.928
MobileNet 0.956 0.957 0.948 0.952 0.872 0.874 0.871 0.873
VGG16 0.86 0.889 0.841 0.864 0.861 0.877 0.803 0.838
ResNet50 0.837 0.866 0.805 0.834 0.761 0.761 0.761 0.761
DCNNIMAF (Proposed) 0.989 0.994 0.992 0.993 0.992 0.993 0.991 0.991

From Table 5, it is evident that the proposed DCNNIMAF model has outperformed all baseline CNN models in terms of performance evaluation metrics. EfficientNetV2 overfits on the data due to difficulty in generalizing the nuanced features of breast cancer like irregular margins of malignant lesions or varying degrees of echogenicity observed in ultrasound images. MobileNetV2’s lightweight architecture struggles with the detailed analysis required to detect early signs of breast cancer, such as subtle changes in echotexture or the presence of microcalcifications within lesions. While DenseNet121 benefits from dense connectivity for feature reuse, its performance in identifying specific breast cancer markers like the orientation and distribution of calcifications or the assessment of lesion vascularity is compromised. NASNetMobile, designed for mobile applications, lacks the precision needed to capture the complex interplay of features indicative of breast cancer, such as the irregular shapes of masses or variations in posterior acoustic shadowing. Xception does not fully exploit the spatial dependencies crucial for identifying specific indicators of breast cancer, such as the pattern of calcifications or the echogenicity of surrounding tissue.

InceptionV3’s design compromise for computational efficiency limits its capacity to analyze the multidimensional data characteristic of breast cancer ultrasound images, particularly in detecting subtle architectural distortions or changes in tissue echotexture. Despite its sophisticated architecture, InceptionResNetV2 does not optimally align with the need to identify specific, disease-related features like the texture and margin irregularities of masses or the presence of ductal abnormalities. MobileNet’s focus on efficiency limits its depth necessary for detailed feature extraction from breast cancer ultrasound images. VGG16’s simplicity and relative shallowness struggles with the detailed analysis required to detect and classify features such as the presence of posterior acoustic enhancement, leading to lower accuracy in validation tests. Features such as the assessment of lesion margins might not be adequately learned due to limitations in the ResNet50’s depth and focus. The proposed DCNNIMAF model distinguishes itself by effectively integrating multiple spatial and self-attention mechanisms, enabling precise identification of critical features such as calcifications, architectural distortions, and mass margins. These enhancements allow the model to capture the complex, heterogeneous pathology of breast cancer evident in ultrasound imagery.

Table 6: Performance Metrics Comparison of Proposed Classification Model with Other Models
Classification Model Accuracy (%) Precision (%) Recall (%) F1-Score (%)
Fine Tuned VGG16 and Fine Tuned VGG19 ensemble model [45] 95.29 95.46 95.20 95.29
CNN-based Ensemble Learner with MLP meta classifier [46] 98.08 98.41 98.82 98.81
BCCNN [47] 98.31 98.39 98.30 98.28
ResNet50 hybrid with SVM [48] 97.98 96.51 97.63 95.97
Deep CNN with Fuzzy merging [49] 98.62 92.31 94.70 93.53
Xception + SVM R [50] 96.25 96.12 96.02 96.01
Grid-based deep feature generator + DNN classifier [51] 97.18 97.45 96.18 96.79
InceptionV3 with residual connections [52] 91.03 85.05 96.01 92.02
EDLCDS-BCDC [53] 95.15 97.35 94.74 96.92
AlexNet, ResNet50 and MobileNetV2 Hybrid feature extractor + mRMR + SVM [54] 95.60 95.69 95.61 95.65
DCNNIMAF (Proposed) 99.20 99.32 99.14 99.1

From the results presented in Table 6, it is apparent that the DCNNIMAF model proposed in this research outperforms all other models in existing research. The assembly of Fine Tuned VGG16 and VGG19 [45] achieves moderate performance with accuracy and F1-scores around 95%. Its performance is relatively low, indicating potential limitations in its ability to capture the complexity of breast cancer pathology fully. CNN-based Ensemble Learner with MLP Meta Classifier [46] has shown high performance with an accuracy of 98% but has struggled with identifying subtle changes in the irregular shapes of masses. BCCNN [47] shows promising results with metrics around 98%. However, the slight variation in F1-score compared to the highest performers suggests it faces challenges in maintaining a balance between precision and recall, essential for minimizing errors in breast cancer diagnosis.

ResNet50 Hybrid with SVM [48] presents strong recall but exhibits a lower precision score. This discrepancy indicates that while the model is capable of identifying many positive cases, it struggles with accurately distinguishing between benign and malignant lesions, leading to potential false positives. The precision score of Deep CNN with Fuzzy Merging [49] drops significantly highlighting a critical issue in its ability to classify breast cancer cases precisely. This suggests that while the model captures broad patterns effectively, it overlooks finer details necessary for accurate diagnosis. Xception combined with SVM R [50] shows a balanced performance of around 96% but indicates a relative inefficiency in comparison to other models in terms of feature extraction capabilities, leading to inefficiency in real-world use. Grid-based Deep Feature Generator with DNN Classifier [51] demonstrates a high precision score, but the minor discrepancies in recall and F1-score indicate potential inefficiencies in capturing all relevant pathological features, affecting its overall efficacy.

InceptionV3 with Residual Connections [52] achieves a high recall but significantly lower precision, indicating a significant imbalance in its diagnostic capabilities. This suggests challenges in accurately discriminating between similar-looking benign and malignant cases, which is crucial for reducing false positives. EDLCDS-BCDC [53] presents moderate performance across metrics, around 95% to 97%, highlighting potential shortcomings in accurately identifying subtle differences. AlexNet, ResNet50, and MobileNetV2 Hybrid Feature Extractor with mRMR and SVM [54]. shows solid performance with accuracy and F1-scores around 95%. However, its limitations suggest shortcomings in fully adapting to the complex and varied nature of breast cancer pathology, indicating areas for potential enhancement.

The proposed DCNNIMAF model demonstrates remarkable performance across all metrics evaluated, surpassing all other models in this comparison. This can be attributed to its meticulously designed architecture that incorporates advanced feature extraction techniques and multiple attention mechanisms, allowing for the precise and effective identification of the nuanced pathological features associated with breast cancer. This specialized approach ensures not only high accuracy but also maintains excellent precision and recall, showcasing its robustness and reliability in clinical applications for breast cancer classification.

5 Conclusion and Future Direction

The primary objective of this research is to detect and segment tumor regions within breast ultrasound images, subsequently categorizing them as benign, malignant, or normal. The objective of this work is to develop an accurate and efficient system for breast cancer tumor segmentation and classification, aiming to improve diagnosis and treatment outcomes for patients. The proposed segmentation model utilizes an InceptionResNet-based LinkNet framework with an intelligent dual-attention mechanism to precisely segment the tumor region. Leveraging spatial and self-attention mechanisms across multiple layers, the DCNNIMAF classification framework enables accurate classification of breast cancer types or the absence of cancerous conditions. The proposed models have excelled in performance, in comparison to existing works. In segmentation tasks, they showcase exceptional accuracy, IoU score, and Dice coefficient score. Furthermore, the classification metrics reveal impressive accuracy, precision, F1-score, and recall rates. Future work could extend the framework’s utility to other medical imaging modalities, facilitating the detection and classification of abnormalities beyond breast ultrasound images.

References

  • [1] B. Stewart and C. Wild, World Cancer Report 2014.   Geneva, Switzerland: WHO Press, 2014.
  • [2] World Health Organization, “Breast cancer,” http://www.who.int/cancer/prevention/diagnosis-screening/breast-cancer/en/, accessed: 2024-07-03.
  • [3] Y. Sun, Z. Zhao, Z. Yang, F. Xu, H. Lu, Z. Zhu, W. Shi, J. Jiang, P. Yao, and H. Zhu, “Risk factors and preventions of breast cancer,” Int J Biol Sci, Nov 2017.
  • [4] M. Tarique, F. Elzahra, A. Hateem, and M. Mohammad, “Fourier transform based early detection of breast cancer by mammogram image processing,” J Biomed Eng Med Imaging, vol. 24, p. 17, 2015.
  • [5] American Cancer Society, “How is breast cancer diagnosed?” http://www.cancer.org/cancer/breastcancer/detailedguide/breast-cancer-diagnosis, 2014, accessed: September 20, 2017.
  • [6] F. Sadoughi, Z. Kazemy, F. Hamedan, L. Owji, M. Rahmanikatigari, and T. Azadboni, “Artificial intelligence methods for the diagnosis of breast cancer by image processing: a review,” Breast Cancer (Dove Med Press), vol. 10, pp. 219–230, Nov 2018.
  • [7] J. Benson, I. Jatoi, M. Keisch, F. Esteva, A. Makris, and V. Jordan, “Early breast cancer,” Lancet, vol. 373, no. 9673, pp. 1463–79, Apr 2009.
  • [8] G. Litjens, T. Kooi, B. Bejnordi, A. Setio, F. Ciompi, M. Ghafoorian, J. van der Laak, B. van Ginneken, and C. Sánchez, “A survey on deep learning in medical image analysis,” Med Image Anal, vol. 42, pp. 60–88, Dec 2017.
  • [9] E. Rashed and M. El Seoud, “Deep learning approach for breast cancer diagnosis,” in Proceedings of the 8th International Conference on Software and Information Engineering, Apr 2019, pp. 243–247.
  • [10] S. Ramesh, S. Sasikala, S. Gomathi et al., “Segmentation and classification of breast cancer using novel deep learning architecture,” Neural Comput & Applic, vol. 34, pp. 16 533–16 545, 2022.
  • [11] A. Osareh and B. Shadgar, “Machine learning techniques to diagnose breast cancer,” in 2010 5th International Symposium on Health Informatics and Bioinformatics, Ankara, Turkey, 2010, pp. 114–120.
  • [12] Y. Li, J. Wu, and Q. Wu, “Classification of breast cancer histology images using multi-size and discriminative patches based on deep learning,” IEEE Access, vol. 7, pp. 21 400–21 408, 2019.
  • [13] J. Zheng, D. Lin, Z. Gao, S. Wang, M. He, and J. Fan, “Deep learning assisted efficient adaboost algorithm for breast cancer detection and early diagnosis,” IEEE Access, 2020.
  • [14] W. Lotter, A. Diab, B. Haslam et al., “Robust breast cancer detection in mammography and digital breast tomosynthesis using an annotation-efficient deep learning approach,” Nat Med, vol. 27, pp. 244–249, 2021.
  • [15] A. Saber, M. Sakr, O. Abo-Seida, A. Keshk, and H. Chen, “A novel deep-learning model for automatic detection and classification of breast cancer using the transfer-learning technique,” IEEE Access, vol. 9, pp. 71 194–71 209, 2021.
  • [16] S. Cho, N. Baek, and K. Park, “Deep learning-based multi-stage segmentation method using ultrasound images for breast cancer diagnosis,” J King Saud Univ - Comput Inf Sci, vol. 34, no. 10, pp. 10 273–10 292, 2022.
  • [17] D. Wang, A. Khosravi, R. Gargeya, H. Irshad, and A. Beck, “Deep learning for identifying metastatic breast cancer,” arXiv preprint arXiv:1606.05718, 2016.
  • [18] G. Hamed, T. Helmy, H. Badawi, and M. Shawky, “Deep learning in breast cancer detection and classification,” in Advances in Intelligent Systems and Computing, 2020.
  • [19] L. Balkenende, J. Teuwen, and R. Mann, “Application of deep learning in breast cancer imaging,” Semin Nucl Med, vol. 52, no. 5, 2022.
  • [20] L. Shen, L. Margolies, J. Rothstein et al., “Deep learning to improve breast cancer detection on screening mammography,” Sci Rep, vol. 9, p. 12495, 2019.
  • [21] Z. Han, B. Wei, Y. Zheng et al., “Breast cancer multi-classification from histopathological images with structured deep learning model,” Sci Rep, vol. 7, p. 4172, 2017.
  • [22] Y. Wang, B. Acs, S. Robertson et al., “Improved breast cancer histological grading using deep learning,” Ann Oncol, vol. 33, no. 1, 2022.
  • [23] G. Sizilio, C. Leite, A. Guerreiro, and A. Neto, “Fuzzy method for pre-diagnosis of breast cancer from the fine needle aspirate analysis,” Biomed Eng Online, vol. 11, no. 1, p. 83, 2012.
  • [24] M. Sarkar and T. Leong, “Application of k-nearest neighbours algorithm on breast cancer diagnosis problem,” in Proceedings of the AMIA Symposium.   American Medical Informatics Association, 2000, pp. 793–797.
  • [25] Y. Song, C. Liu, and Z. Wang, “A machine learning approach for accurate annotation of noncoding rnas,” IEEE/ACM Trans Comput Biol Bioinform, vol. 12, no. 3, pp. 551–559, 2014.
  • [26] K. Foster, R. Koprowski, and J. Skufca, “Machine learning, medical diagnosis, and biomedical engineering research-commentary,” Biomed Eng Online, vol. 13, no. 1, p. 94, 2014.
  • [27] L. Wei, Y. Yang, and R. Nishikawa, “Microcalcification classification assisted by content-based image retrieval for breast cancer diagnosis,” Pattern Recognition, vol. 42, no. 6, pp. 1126–1132, 2009.
  • [28] A. Shah, “Breast ultrasound images dataset,” 2020, accessed: 2024-07-03. [Online]. Available: https://www.kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset
  • [29] A. Chaurasia and E. Culurciello, “Linknet: Exploiting encoder representations for efficient semantic segmentation,” in 2017 IEEE Visual Communications and Image Processing (VCIP).   IEEE, 2017, pp. 1–4.
  • [30] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet and the impact of residual connections on learning,” CoRR, vol. abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.07261
  • [31] R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 618–626.
  • [32] R. Almajalid, J. Shan, Y. Du, and M. Zhang, “Development of a deep-learning-based method for breast ultrasound image segmentation,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).   IEEE, 2018, pp. 1103–1108.
  • [33] W. Yue, H. Zhang, J. Zhou et al., “Deep learning-based automatic segmentation for size and volumetric measurement of breast cancer on magnetic resonance imaging,” Front Oncol, vol. 12, p. 984626, 2022.
  • [34] S. Zhang, M. Liao, J. Wang et al., “Fully automatic tumor segmentation of breast ultrasound images with deep learning,” J Appl Clin Med Phys, vol. 24, p. e13863, 2023.
  • [35] J. Li, L. Cheng, T. Xia, H. Ni, and J. Li, “Multi-scale fusion u-net for the segmentation of breast lesions,” IEEE Access, vol. 9, pp. 137 125–137 139, 2021.
  • [36] M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” CoRR, vol. abs/2104.00298, 2021. [Online]. Available: https://arxiv.org/abs/2104.00298
  • [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018. [Online]. Available: http://arxiv.org/abs/1801.04381
  • [38] G. Huang, Z. Liu, and K. Weinberger, “Densely connected convolutional networks,” CoRR, vol. abs/1608.06993, 2016. [Online]. Available: http://arxiv.org/abs/1608.06993
  • [39] B. Zoph, V. Vasudevan, J. Shlens, and Q. Le, “Learning transferable architectures for scalable image recognition,” CoRR, vol. abs/1707.07012, 2017. [Online]. Available: http://arxiv.org/abs/1707.07012
  • [40] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” CoRR, vol. abs/1610.02357, 2016. [Online]. Available: http://arxiv.org/abs/1610.02357
  • [41] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015. [Online]. Available: http://arxiv.org/abs/1512.00567
  • [42] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
  • [43] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv [Cs.CV], vol. abs/1409.1556, 2015. [Online]. Available: http://arxiv.org/abs/1409.1556
  • [44] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
  • [45] Z. Hameed, S. Zahia, B. Garcia-Zapirain, J. Javier Aguirre, and A. María Vanegas, “Breast cancer histopathology image classification using an ensemble of deep learning models,” Sensors, vol. 20, no. 16, p. 4373, 2020.
  • [46] A. Das, M. Mohanty, P. Mallick, P. Tiwari, K. Muhammad, and H. Zhu, “Breast cancer detection using an ensemble deep learning method,” Biomed Signal Process Control, vol. 70, p. 103009, 2021.
  • [47] B. Abunasser, M. Al-Hiealy, I. Zaqout, and S. Abu-Naser, “Convolution neural network for breast cancer detection and classification using deep learning,” Asian Pac J Cancer Prev, vol. 24, no. 2, pp. 531–544, 2023.
  • [48] W. Salama, A. Elbagoury, and M. Aly, “Novel breast cancer classification framework based on deep learning,” IET Image Process, vol. 14, no. 13, pp. 3254–3259, 2020.
  • [49] R. Krithiga and P. Geetha, “Deep learning based breast cancer detection and classification using fuzzy merging techniques,” Mach Vis Appl, vol. 31, p. 63, 2020.
  • [50] S. Sharma and S. Kumar, “The xception model: A potential feature extractor in breast cancer histology images classification,” ICT Express, vol. 8, no. 1, pp. 101–108, 2022.
  • [51] H. Liu, G. Cui, Y. Luo, Y. Guo, L. Zhao, Y. Wang et al., “Artificial intelligence-based breast cancer diagnosis using ultrasound images and grid-based deep feature generator,” Int J Gen Med, vol. 15, pp. 2271–2282, 2022.
  • [52] N. Sirjani, M. Ghelich Oghli, M. Tarzamni, M. Gity, A. Shabanzadeh, P. Ghaderi et al., “A novel deep learning model for breast lesion classification using ultrasound images: A multicenter data evaluation,” Phys Medica, vol. 107, p. 102560, 2023.
  • [53] M. Ragab, A. Albukhari, J. Alyami, and R. Mansour, “Ensemble deep-learning-enabled clinical decision support system for breast cancer diagnosis and classification on ultrasound images,” Biology, vol. 11, p. 439, 2022.
  • [54] Y. Eroğlu, M. Yildirim, and A. Çinar, “Convolutional neural networks based classification of breast ultrasonography images by hybrid method with respect to benign, malignant, and normal using mrmr,” Comput Biol Med, vol. 133, p. 104407, 2021.