Ren 2024 Phys. Med. Biol. 69 025009

Physics in Medicine & Biology
PAPER You may also like

- A simple fabricated microfluidic chip for
MM-SFENet: multi-scale multi-task localization urine sample-based bladder cancer
detection
and classification of bladder cancer in MRI with Chunyang Geng, Chiyu Li, Wang Li et al.
- HPLC assisted Raman spectroscopic

spatial feature encoder network studies on bladder cancer
W L Zha, Y Cheng, W Yu et al.
To cite this article: Yu Ren et al 2024 Phys. Med. Biol. 69 025009 - NIR molecule induced self-assembled
nanoparticles for synergistic in vivo
chemo-photothermal therapy of bladder
cancer
Guanchen Zhu, Qingfeng Zhang, Xiaozhi
Zhao et al.
View the article online for updates and enhancements.
This content was downloaded from IP address 111.187.77.60 on 04/03/2024 at 06:17

Phys. Med. Biol. 69 (2024) 025009 https://doi.org/10.1088/1361-6560/ad1548
PAPER
RECEIVED
MM-SFENet: multi-scale multi-task localization and classification of
4 April 2023
REVISED
bladder cancer in MRI with spatial feature encoder network
25 November 2023
ACCEPTED FOR PUBLICATION
Yu Ren1,2,3,5, Guoli Wang2,3,5, Pingping Wang2,3, Kunmeng Liu2,3, Quanjin Liu1,∗, Hongfu Sun4,
13 December 2023 Xiang Li2,3 and Bengzheng Wei2,3,∗
PUBLISHED 1
College of Electronic Engineering and Intelligent Manufacturing, Anqing Normal University, Anqing 246133, Peopleʼs Republic of China
10 January 2024 2
Center for Medical Artificial Intelligence, Shandong University of Traditional Chinese Medicine, Qingdao 266112, Peopleʼs Republic of
China
3
Qingdao Academy of Chinese Medical Sciences, Shandong University of Traditional Chinese Medicine, Qingdao 266112, Peopleʼs
Republic of China
4
Urological department, Affiliated Hospital of Shandong University of Traditional Chinese Medicine, Jinan 250011, Peopleʼs Republic of
China
5
Yu Ren and Guoli Wang contributed equally to this work and should be considered co-first authors.
∗
Authors to whom any correspondence should be addressed.
E-mail: liuquanjin@aqnu.edu.cn and wbz99@sina.com
Keywords: bladder cancer, MRI, tumor detection, multi-scale, multi-task, deep learning
Abstract
Objective. Bladder cancer is a common malignant urinary carcinoma, with muscle-invasive and non-
muscle-invasive as its two major subtypes. This paper aims to achieve automated bladder cancer
invasiveness localization and classification based on MRI. Approach. Different from previous efforts
that segment bladder wall and tumor, we propose a novel end-to-end multi-scale multi-task spatial
feature encoder network (MM-SFENet) for locating and classifying bladder cancer, according to the
classification criteria of the spatial relationship between the tumor and bladder wall. First, we built a
backbone with residual blocks to distinguish bladder wall and tumor; then, a spatial feature encoder is
designed to encode the multi-level features of the backbone to learn the criteria. Main Results. We
substitute Smooth-L1 Loss with IoU Loss for multi-task learning, to improve the accuracy of the
classification task. By learning two datasets collected from bladder cancer patients at the hospital, the
mAP, IoU, Acc, Sen and Spec are used as the evaluation metrics. The experimental result could reach
93.34%, 83.16%, 85.65%, 81.51%, 89.23% on test set1 and 80.21%, 75.43%, 79.52%, 71.87%, 77.86%
on test set2. Significance. The experimental result demonstrates the effectiveness of the proposed MM-
SFENet on the localization and classification of bladder cancer. It may provide an effective
supplementary diagnosis method for bladder cancer staging.
1. Introduction
Bladder cancer is considered a critical malignancy with a high recurrence rate. According to the 2022 Global
Cancer Statistics report, bladder cancer has become the sixth most common cancer worldwide and caused the
ninth highest number of deaths (Siegel et al 2022). Based on the pathological depth of the tumor invasion,
bladder cancer is characterized as a heterogeneous disease with two major subtypes: muscle-invasive bladder
cancer (MIBC) and non-muscle-invasive bladder cancer (NMIBC) (Wong et al 2021). Earlier and accurate
classification of bladder cancer is critical for diagnosis, treatment, and follow-up of oncologic patients (Benson
et al 2009). Therefore, to improve the efficacy of bladder cancer screening and accurate and reliable diagnostic
methods are required.
The current diagnostic methods of bladder cancer are great challenges for clinicians, as the available tools for
diagnosis and staging include: (a) optical cystoscopy, an invasive and costly method; (b) computed tomography
(CT); (c) magnetic resonance imaging (MRI). Optical cystoscopy is regarded as the gold standard method for
bladder cancer diagnosis, but this procedure is painful for patients, and it may fail to visualize certain areas
© 2024 Institute of Physics and Engineering in Medicine

Phys. Med. Biol. 69 (2024) 025009 Y Ren et al
within the bladder. Compared with optical cystoscopy, imaging techniques have been developed to detect
tumors non-invasively. Given its high soft-tissue contrast and non-invasive feature, MRI-based image texture
analysis technique has made methods using radiomics that predict tumor stage and grade a potential alternative
for bladder cancer evaluation (Caglic et al 2020). The popularity of medical imaging equipment has enriched
medical image data (Guiot et al 2022). After training on massive annotated data, data-driven deep learning
models are becoming increasingly powerful methods for solving medical imaging problems, such as image
reconstruction (Lv et al 2021), lesion detection (Yu et al 2021), image segmentation (Pan et al 2023) and image
registration (Wu et al 2022). In recent years, several studies have shown that deep learning-based tissue
segmentation methods using UNet are comparable to segmentation tasks in MRI examinations by radiologists,
especially for bladder wall segmentation (Kushnure and Talbar 2022).
UNet adopts convolutional neural network (CNN) to downsample and encode the global information of the
image, and a short-cut structure is added to fuse the downsampling and upsampling layers by channel splicing
(Dolz et al 2018). In 2016, Cha et al conducted a pilot study on bladder CT image segmentation, generating a
lesion likelihood map and refining boundaries with level sets for bladder cancer segmentation (Cha et al 2016).
With superior soft-tissue contrast over CT, MRI has seen the introduction of various UNet-based segmentation
methods for bladder wall extraction. In 2018, dolz et al (Kushnure and Talbar 2022) first applied UNet to
segment bladder wall and tumor in MRI, and they introduced progressive dilation convolutional layers to
expand the receptive fields and decrease the sparsity of the dilated kernel. Liu et al further enhanced UNet by
embedding a pyramidal atrous convolution block, capturing multi-scale contextual information for accurate
bladder wall segmentation (Liu et al 2019). Some studies have also been made in 3D CNN. Hammouda et al
introduced a 3D framework in T2W MRI, incorporating contextual information for each voxel and refining the
network with a conditional random field (Hammouda et al 2019). These studies highlight CNN’s ability,
particularly in MRI, to distinguish bladder cancer tumor and wall tissue based on their textures. Zhang et al used
a two-stage method, manually segmenting bladder tumors on CT images, and then training a classification
model for invasiveness classification (Zhang et al 2021).
Unexpectedly, we found that exsiting research focuses solely on segmenting the bladder wall using a deep
learning model, which has some limitations. So far, there is no end-to-end deep neural network for directly
localizing and classifying bladder cancer. In the domain of bladder cancer localization and classification in MRI,
a thorough review indicates a lack of prior studies in this area (Bandyk et al 2021). This research gap underscores
the need for urgent attention. However, bladder cancer localization and classification in MRI still have some
problems: (a) delineating the bladder wall and tumor is challenging due to very low contrast. (b) As a tumor
invasiveness classification criterion, the spatial relationship between bladder wall and tumor is difficult to learn
by a deep-learning model. (c) Tumors have varied shapes.
Accordingly, we proposed a systematical model to solve the bladder cancer localization and classification
problem, namely, multi-scale multi-task spatial feature encoder network (MM-SFENet). Specifically, the
anterior half of the proposed network consists of a backbone with residual connection, and a pyramidal spatial
feature encoder (SFE) based on feature pyramid networks (FPN) (Lin et al 2017) to encode different semantic
features. The posterior half of the model generates four multi-scale predicted results from different decoders to
enhance the classification capability. In terms of the localization task, we substitute the four-variable-
independent-regression Smooth-L1 Loss with IoU Loss to improve the performance of the localization.
Consequentially, the paper makes three main contributions.
(i) We propose a novel end-to-end detector, namely MM-SFENet, to localize and classify bladder cancer in
MRI, which is the first work of its kind.
(ii) We design an encoder SFE considering multi-scale spatial features and embed it into the detector to learn
the bladder cancer classification criteria.
(iii) Extensive experiments on the datasets indicate the importance of SFE and IoU Loss. Moreover, we conduct
comparisons with the latest detectors, and MM-SFENet outperforms state-of-the-art methods.
2. Related works
2.1. Deep object detectors

The classification of bladder invasive carcinoma subtypes consists of a two-step process: the tumor is first
recognized and localized by a bounding box(b-box), then it is further classified by the positional relationships
between bladder wall and tumor. The adoption of an end-to-end two-stage approach for this purpose can be
seamlessly implemented utilizing the prevalent deep learning object detector presently in vogue.
2
Figure 1. Schematic diagram of different detection architectures.
Contemporary state-of-the-art object detection methods almost follow two major paradigms: two-stage
detectors and one-stage detectors. As a standard paradigm of the two-stage detection methods (Uijlings et al
2013, Girshick et al 2014, Girshick 2015, He et al 2015, Ren et al 2017, Cai and Vasconcelos 2021, Sun et al 2021),
Faster R-CNN combines a proposal detector and a region-wise classifier. In the first stage, region proposal
network generates a coarse set of region proposals, and in the second stage, region classifiers provide confidence
scores and b-box of the proposed region. Since the Faster R-CNN is generated, the architecture of two-stage
detection methods has been formed and upgraded by its descendants Cascade R-CNN (Cai and
Vasconcelos 2021) and Sparse R-CNN (Sun et al 2021), according to multi-stage refinement and relationship
between targets, respectively. To better detect objects at multi-scale, an embedded feature pyramid is built to
perform feature fusion in FPN that has now become a basic module of the latest detectors due to its excellent
performance.
Compared with two-stage detectors, one-stage detectors (Sermanet et al 2013, Liu et al 2016, Redmon and
Farhadi 2017, 2018, Bochkovskiy et al 2020, Lin et al 2020) are much more desired for real-time object detection,
but they are less accurate. These works make significant progress in different areas. The end-to-end bladder
cancer localization and classification method can be well implemented by a deep learning-based two-stage
detector.
2.2. Feature pyramid network

CNN is a structure superimposed by multi-layers of convolutional kernels, and the receptive field size of the
network is linearly related to the number of convolutional layers (Luo et al 2017). Considering a convolutional
kernel in layer l with a size of k and stride s, its receptive field RF can be defined as
L-1
RFL = RFL - 1 + (kL - 1) ´  sl (1)
l=0
From equation (1), it is clear that the receptive field size increase with the number of layers, and each
convolutional layer ‘sees’ different information that is an object.
A deep CNN exhibits a pyramidal shape with inherent multi-scale characteristics, where higher layers
convey more abstract and semantically meaningful information about objects, while lower layers represent low-
level details like edges, contours, and positions. Detectors like SSD (Liu et al 2016) and MS-CNN (Cai et al 2016)
operate independently on feature maps from various backbone hierarchies, avoiding the mixing of high- and
low-level information. This approach, termed prediction pyramid network (PPN), is illustrated in figure 1(b).
However, PPN introduces semantic gaps due to distinct feature hierarchies, particularly affecting the accurate
prediction of small objects. To address this, Lin et al proposed the feature pyramid network (FPN) as a solution,
depicted in figure 1(c). FPN, constructed upon the characteristics of CNN, integrates multiple shallow and
abstract features in a top-down architecture, serving as an encoder that encodes multi-scale features from the
backbone and provides multi-level feature representations for the decoder (detection heads) (Redmon and
Farhadi 2017, Chen et al 2021).
Faster R-CNN and its descendants have been utilized in medical image analysis with low demand for real-
time. The first step in R-CNNs involves downsampling for extracting high-level semantic features, achieved
through the backbone. For tasks like differentiating bladder wall from tumors in medical images, a deeper CNN-
based backbone is necessary. This suggests the importance of CNN judgment in tissue differentiation. Due to
difficulties in converging to a minimum and the occurrence of vanishing gradients in standard deep CNNs like
VGG (Simonyan and Zisserman 2014), the literature introduces a solution: residual connections (He et al 2016),
employed in the backbone to effectively shorten the gradient flow path. However, the residual connection
backbone exhibits a limitation: when a single last feature map is input into the region proposal network during
inference, it is sensitive only to a specific size range. To overcome this, a common strategy is to add a feature
3
pyramid network (FPN) as a neck after the backbone. FPN enables the backbone to extract multi-scale features
and fuse semantic information from different layers, including spatial information in lower layers, enhancing
both localization accuracy and classification performance.
2.3. Smooth-L1 loss

In the bladder tumor invasiveness classification task, positional relationships between bladder wall and tumor
are learned by a classifier to judge the invasiveness, so accurate localization is particularly important. Most
detection algorithms use four-variable-independent-regression Smooth-L1 Loss for b-box regression, which is
calculated as follows:
0.5d 2 ∣d ∣ < 1
Smooth L1 Loss (d ) = ⎧ (2)
⎨
⎩ ∣ d ∣ - 0.5 otherwise
As stated in equation (2), only the Euclidean distance d between points is considered by Smooth-L1 Loss, which
loses the scale information of the prediction box. It will result in the rapid convergence of the network during the
training stages, but the deviation between the prediction box and the ground-truth box will occur during the test
stage.
3. Materials and methods
3.1. Data acqusition

All datasets utilized for this research is sourced from the Affiliated Hospital of Shandong University of
Traditional Chinese Medicine. For this investigation, MRI data is predominantly acquired using the GE
DiscoveryMR750 3.0T scanner for all patients. Scans cover the region from the upper part of the bladder down
to the ischial tuberosity. Since the two datasets were collected at different times and saved at different sizes, they
were learned independently when the model was trained.
(i) Dataset1: The dataset1 comprises a total of 1287 MRI images obtained from 98 patients diagnosed with
bladder cancer, of which 130 images are used as an independent test set and the rest as a training set. The
sequence parameters are configured as follows: 80–124 slices per scan, each slice measuring
512 × 512 pixels with a pixel resolution of 0.5 mm × 0.5 mm. The slice thickness and inter-slice spacing are
both set at 1 mm. The 3D scanning process had acquisition times ranging from 160.456 to 165.135 s. The
repetition and echo times are established at 2500 ms and 135 ms, respectively.
(ii) Dataset2: The dataset2 comprises a total of 2000 MRI images obtained from 121 patients diagnosed with
bladder cancer. The sequence parameters are configured as follows: 80–124 slices per scan, each slice
measuring 256 × 256 pixels with a pixel resolution of 0.5 mm × 0.5 mm. All other cases are the same as
dataset1. The dataset, including data of 121 cases, is divided into MRI images and tumor invasiveness type
annotation files. We strictly divide dataset2 into train set, valid set and test set in the ratio of 7:3:1. 1400
images for the train set, 400 images for the valid set, and 200 images for the test set.
3.2. Data pre-processing

We use SimpleITK toolkit for data preprocessing. MRI transverse slices extracted from different time series,
wherein slices located at the temporal extremes of the sequence lack pathological lesion imagery. Consequently,
it becomes imperative to cull superfluous slices and retain those encapsulating relatively comprehensive
pathological manifestations. In this study, we employ the N4 algorithm to rectify image grayscale non-
uniformities, thereby enhancing image quality. We employ truncation normalization to standardize the
grayscale values of all slices to a consistent range, thereby expediting convergence and enhancing generalization
performance during model training.
MRI images are annotated by professional radiologists from with image-level annotation, the labeled data is
divided into two types, MIBC and NMIBC. We use the minimum bounding box of the tumor area as the labeled
box based on the Excel file provided by the doctor. Bladder cancer MRI dataset is constructed by b-box level
annotation under the guidance of professional physicians and combined with related articles (Pinto and
Tavares 2017, Xu et al 2017, Dolz et al 2018, Zhu et al 2018, Hammouda et al 2019, Liu et al 2019, Caglic et al
2020, Wong et al 2021, Babjuk et al 2022, Kushnure and Talbar 2022). The annotation software used is the
Labelme toolkit in Python. Some images are shown in figure 2.
The green lines are bladder wall tissues; the red lines are tumor areas; the yellow b-boxes are labeled boxes,
namely that of ground truth boxes. According to whether the tumor invades the bladder wall tissue, figure 2 can
be divided into two types: MIBC tumor of (a) and NMIBC tumor of (b).
4
Figure 2. Manual tumor and bladder wall contouring.
Figure 3. The creation process of the BCMnist.
3.3. BCMnist creation

In addition to creating datasets for bladder cancer tumor detection, we extend our efforts to construct Bladder
Cancer MNIST (BCMnist) for invasiveness classification using the MedMNIST method (Yang et al 2023). The
creation process of the dataset is shown in figure 3. We initiate the dataset creation process by extracting the
minimum bounding rectangles of the tumors from the prepared detection dataset. These regions are accurately
identified using the labeled boxes, which are commonly referred to as ground truth annotations. To ensure
uniformity and facilitate model training, we resize these tumor regions to a standardized size of 28 × 28 pixels.
Leveraging insights from the image distribution within the training and testing subsets of the detection dataset,
we crafted a distinct dataset tailored for tumor invasiveness classification. To enable the supervised learning
process, we assign binary labels to each image. All of these datasets, including both the tumor detection and
invasive classification datasets, have been meticulously organized and are stored in the widely-used .npz file
format.
3.4. MM-SFENet
Based on the above discussion, we propose a MM-SFENet, and its detailed architecture is shown in figure 4. In
this section, the two key technical components of MM-SFENet will be introduced in detail.
5
Figure 4. Models architecture for detecting and localizing bladder cancer. SFE based on FPN make the network can detect at multi-
scale and fuse multi-level information at the same time. We assign anchors of a corresponding single scale to multi-level, we define the
anchors to have are {322, 642, 1282, 2562} pixels. Accordingly, RoIs of different scales needed to be assigned to the pyramid levels, and
then summarize the outputs to decoder, duplicate predictions are eliminated by NMS operation.
Figure 5. The architecture of SFE. SFE taken an arbitrary bladder cancer MRI as input, short and long connection inside the network
increase the feature representation capability of model.
3.4.1. The spatial feature encoder

According to (Chen et al 2021), SFE is designed to enhance the output feature maps by effectively fusing spatial
high-resolution features with low-resolution semantic features. The structure diagram, as shown in figure 5,
illustrates the key components of the SFE.
SFE construction involves three main components: backbone, encoder pathway, and feature fusion.
Backbone is divided into a stem block and four residual blocks, denoted as {C1, C2, C3, C4}, each operating
under specific downsampling factors of {22, 23, 24, 25}, respectively. The backbone’s role is to capture
hierarchical features at different scales. Meanwhile, the encoder pathway is responsible for generating feature
maps with higher spatial resolution through the use of nearest neighbor upsampling. This process aims to
restore the fine-grained spatial information that might have been lost during the downsampling phase of the
backbone. Subsequently, the feature fusion step combines the feature maps under the same spatial size from
both the backbone and the encoder. SFE recodes the feature maps output from different layers in the following
way:
Pi - 1 = T (Mi - 1) Å U (Pi ) (3)
where T represents channel transformation; U denotes the nearest neighbor upsampling; Pi−1 represents the
fused feature map at the i − 1th layer, obtained through the feature fusion of the feature maps Mi−1 from the
backbone and Pi from the encoder.
6
Figure 6. Schematic diagram of b-box parameters.
The feature fusion process is depicted in the right half of figure 5, which requires the feature map to be
consistent in size and channel.
Step 1. To achieve this alignment, we perform a feature map channel transformation, changing the number of
channels to 256 for feature map Mi−1. This transformation could involve employing 1 × 1 convolutional
layers, which are commonly used for this purpose in deep learning architectures.
Step 2. Next, we proceed with upsampling and size enlargement to match the resolution of the feature map Mi−1.
Techniques like bilinear interpolation or transposed convolution can be employed for this upsampling
process.
Step 3. Subsequently, the fused feature map Pi−1 is obtained by combining the upsampled feature map Pi with the
transformed shallow feature map Mi−1. This fusion process typically involves an element-wise operation,
such as addition or 1 × 1 convolutional layers, which allows for the integration of information from both
feature maps.
By following this process of feature map transformation, upsampling, and element-wise, we successfully align
the feature maps’ size and channel, and fusing the information.
3.4.2. IoU loss

IoU Loss (Yu et al 2016) solves this problem by introducing the intersection and union (IoU) ratio between the
prediction box and the ground-truth box. As shown in figure 6, the 4-dimensional vector (x̃ t , x̃b , x̃l , x̃ r )
represents the distance from any pixel in the predication box to the four sides of the predication box, and the
ground-truth box is represented as (xt, xb, xl, xr). The ground-truth box X, the predication box X̃ and the overlap
area I between the ground-truth box and the predication box are calculated as follows:
⎧ X˜ = (x˜ t + x˜b ) * (x˜l + x˜ r )

⎪ X = (x t + x b ) * (x l + x r )
⎪
Ih = min (x t , x˜ t ) + min (x b, x˜b ) (4)
⎨
⎪ Iw = min (x l , x˜l ) + min (x r , x˜ r )
⎪ I = Iw * Ih, U = X + X˜ - I
⎩
According to the formula of the IoU ratio, the forward propagation of IoU Loss is calculated as:
I
Lloc = - ln (5)
U
About IoU Loss x̃ , the back propagation formula can be obtained by the partial derivative of the following
formula:
¶Lloc 1 ¶X˜ U + I ¶I
= - (6)
¶x˜ t, b, l, r U ¶x˜ t , b, l, r UI ¶x˜ t , b, l, r
7
Figure 7. Analysis of Dataset1.
The loss function is inversely proportional to the intersection area of ¶I ¶x˜t , b, l, r . The larger the intersection
area, the smaller the loss function. The four coordinates of the b-box are regarded as a whole in IoU loss, which
ensures that the predication-box scale is similar to that of the ground-truth box during the b-box regression.
3.5. k-means++ clustering analysis on the dataset1

In anchor-based detectors, matching the anchor box size and shape with the ground-truth box is crucial to the
final detection result. The settings of the anchor box in Faster R-CNN are based on the Pascal VOC dataset,
which is specific to the natural images but not applicable to the bladder cancer MRI dataset. The k-means++ is
used to cluster the dataset and then regenerate a new setting of anchor box for bladder cancer localization. The
data clustering results are shown in the figure 7(a), where the black points represent the cluster centroids, while
the other points represent the distribution of the ground-truth box aspects values, and the colors represent the
categories to which the ground-truth box belongs. The distribution of the ground-truth box aspect ratio is
shown in figure 7(b). Figure 7 shows that the aspect ratio of the ground-truth box is mostly near 0.6, 1.0 and 1.1.
Therefore, for the bladder cancer MRI dataset, the anchor aspect ratio can be set to [0.6, 1, 1.1], which makes it
more consistent with the shape of the true tumor area, and it can make the model converge more quickly during
training.
3.6. Evaluation metrics and loss functions

AP50 (Average Precision) and mAP (mean Average Precision) are used as the detection evaluation indexes and
IoU as the localization evaluation index, and AP50 is the AP score when IoU = 0.5. AP is the area under the PR
(Precision–Recall) curve. The formula is shown in equation (7). The mAP is the mean value of AP for each
category, and the formula is shown in equation (8), it measures the detection performance of a detector and can
also reflect the accuracy of classification to a certain extent. IoU is the intersection area of the ground-truth box
and the prediction box, and the formula is shown in equation (9).
N
AP = å P (k) Dr (k) (7)
k=1
where P(k) is the height of the k b-box under the PR curve, and Δr(k) is the width of the b-box. The formula for
calculating mAP is shown as follows,
åAP
m AP = (8)
m
where m is the total number of categories. The formula for calculating IoU is shown as follows,
Box p Ç BoxT
IoU (Box p, BoxT) = (9)
Box p È BoxT
where Boxp is the predicted box area, and BoxT is the ground-truth box area. The formulas of Accuracy,
Sensitivity and Specificity are as follows,
TP + TN
Accuracy = (10)
TP + FP + TN + FN
8
Table 1. Mean and standard deviation of detection results with different

localization loss function and weights λ in Dataset1.
Loss function λ=1 λ=2 λ=5
L1 91.23 ± 0.97 91.45 ± 1.10 90.52 ± 1.18

Smooth L1 91.88 ± 1.02 92.16 ± 0.98 91.26 ± 1.28
IoU 92.36 ± 0.87 93.17 ± 0.96 nan
TP
Sensitivity = (11)
TP + FN
TN
Specificity = . (12)
TN + FP
Bladder cancer detection process can be divided into two parts: localization and classification, based on
which we design loss functions for multi-task learning. We use binary cross-entropy for classification and IoU
Loss for b-box regression, and the calculation formula is shown as follows equation (13).
⎧ Lcls = å - log ( pi*pi + (1 - pi*)(1 - pi ))

⎪ i
Lloc = - ln (IoU) (13)
⎨
1 1
⎪ Loss = L
Npred cls
+ lN Lloc
⎩ pred
where pi* is the probability that the classifier is judged as MIBC, and Pi is the probability that the classifier is
judged as NMIBC. In the total loss function, Npred is the total number of prediction boxes outputted by the final
detector, and λ is the localization loss function weight.
3.7. Implementation details

We perform experiments on the deep learning server; the operating system is Ubuntu 18.04, and the CPU is Intel
Xeon platinum 8268; The memory size is 32 GB; GPU adopts NVIDIA Tesla V100 32 GB. All experiments are
based on the mmdetection framework:https://github.com/open-mmlab/mmdetection. Pytorch version is
1.7.1; mmcv version is 1.3.7, and CUDA version is 11.2. During the training process, four processes are started
for each graphics card, with 8 images per process. The SGD optimization is used to update the parameters, with
the initial learning rate set to 0.02, the weight decay to 0.0001, and the momentum to 0.9. The input image size is
uniformly 1000 × 600, and 24 epochs are trained using a multi-scale training strategy and adding horizontal and
vertical flips as data enhancement, according to Gsaxner et al (2018).
4. Results
4.1. Detect with different weights λ

We conduct 10-fold cross validation experiments under the effect of different localization loss function and λ on
the results, and the experimental results are shown in table 1. The metrics in the table is mAP used to
comprehensively measure the detection performance of detectors.
As can be seen from table 1, the vertical axis in figure 8(a) depicts the value of the detector’s classification loss
function, while the horizontal axis corresponds to the number of iterations. Owing to the inherent interplay
between the localization loss function and the classification loss function within the detector, there are
observable variations in the classification loss. Specifically, when λ is set to 2, the classification loss function
attains its minimum value, concurrently yielding the highest mAP score for the detector. This outcome signifies
the optimal detection performance. In all subsequent experiments, we adjust the λ to 2. When the weight of IoU
Loss is 5, the value of map is nan.
4.2. Ablation experiments

To verify the effectiveness of the improved method proposed in this paper, ablation experiments are conducted
on the network model, and the experimental results are shown in table 2. ResNet101 and ResNet50 are used as
the backbone in the ablation experiments to test the performance of the proposed model, respectively. As can be
seen from table 2, the SFE increases by 20.53% and 23.69% in the networks with ResNet101 and ResNet50 as a
backbone, respectively. Using IoU Loss as the loss function, the networks with ResNet101 and ResNet50 as the
backbone increase by 2.03% and 2.89%, respectively. The experiments prove that the model proposed can still
effectively improve the classification and localization accuracy of tumors under different backbones, which also
proves the effectiveness of SFE and IoU Loss.
9
Figure 8. The localization loss function in the curve used is Smooth-L1 Loss. Modulating the weight values in the localization loss
function of the detector can also have an impact on classification loss. This interconnection arises from their coupling, and during
back propagation, the gradient values concurrently optimize both aspects.
Table 2. Ablation studies on importance of each components. The

results in bold are the best results obtained on the test set.
Backbone IoU Loss SFE mAP (%) IoU (%)
ResNet101 71.62 63.87

ResNet101 √ 73.65 65.98
ResNet101 √ 92.15 81.89
ResNet101 √ √ 92.50 79.84
ResNet50 69.23 61.89
ResNet50 √ 72.12 64.13
ResNet50 √ 92.29 81.95
ResNet50 √ √ 93.34 83.16
ResNeXt101 √ √ 89.05 77.45
ResNeSt50 √ √ 80.74 75.90
ResNet101 √ √ 92.50 79.84
Table 3. The results in bold are the best results obtained on the test set in Dataset1.
Classification accuracy is specific to the NMIBC subtype.
Detectors mAP (%) IoU (%) Spec (%) Sen (%) Acc (%)
Cascade R-CNN 92.30 82.71 88.37 79.22 84.04

Sparse R-CNN 88.00 64.94 87.80 72.15 80.12
SSD 92.51 81.32 92.59 79.48 84.56
RetinaNet 86.32 79.40 88.75 68.29 78.39
YOLOv4 92.00 78.52 88.09 79.48 83.95
MM-SFENet 93.34 83.16 89.23 81.51 85.65
4.3. Comparison with the mainstream object detection algorithms

We propose SFE and substitute the four-variable-independent-regression Smooth-L1 Loss with IoU Loss under
Faster R-CNN architecture to localize and classify the invasiveness of bladder cancer tumors. The current
mainstream detection algorithms are trained on the dataset of this paper, and the obtained results are shown in
table 3, where the highest performance backbone and SFE is used for each detector.
Table 3 shows that the model proposed showed 93.34% mAP, 85.65% Acc, 81.51% Sen, 89.23% Spec and
83.16% IoU when using ResNet50 as the backbone. MM-SFENet has better results when compared with the
current mainstream detectors, which proves the applicability of the proposed method to the bladder tumor
detection problem. In order to further verify the generalization performance of the MM-SFENet, the bladder
cancer images are dataset2 test set, with the detection accuracy mAP of 80.21%, the positioning accuracy IoU of
75.43%, and the classification accuracy of 77.86%, as shown in table 4.
10
Table 4. The results in bold are the best results obtained on the test set in Dataset2.
Classification accuracy is specific to the NMIBC subtype.
Detectors mAP (%) IoU (%) Spec (%) Sen (%) Acc (%)
Cascade R-CNN 79.21 74.38 78.84 60.39 76.47

Sparse R-CNN 72.57 68.49 75.35 70.30 70.38
SSD 74.36 71.40 77.95 71.21 72.92
RetinaNet 74.57 72.41 78.62 71.76 72.32
YOLOv4 76.00 71.52 75.37 72.36 73.95
MM-SFENet 80.21 75.43 79.52 71.87 77.86
Table 5. MM-SFENet detection results of 10-fold cross validation and classification results of ResNet18
on the BcMnist dataset.
Fold mAP (%) IoU (%) Spec (%) Sen (%) Acc (%) BCMnist Acc (%)
1 91.61 81.16 88.37 79.22 82.12 90.56

2 90.12 80.45 90.86 79.15 81.65 89.94
3 91.60 82.67 89.64 79.48 83.56 91.69
4 91.48 82.34 89.37 78.34 83.89 91.97
5 92.00 78.52 88.04 79.48 84.95 92.37
6 90.42 81.37 87.92 78.91 81.64 90.42
7 93.63 82.71 88.53 81.92 85.28 93.65
8 94.71 83.21 89.62 82.11 85.81 94.34
9 92.30 81.71 88.61 78.34 83.04 92.52
10 93.82 83.12 89.37 81.36 85.04 93.14
Mean 92.17 81.72 89.03 79.83 83.70 92.06
Std 1.48 1.44 0.90 1.42 1.55 1.44
4.4. 10-fold cross validation on MM-SFENet and BCMnist

We conduct a 10-fold cross-validation on the MM-SFENet model and employ the approach described in
(Redmon and Farhadi 2017, Yang et al 2023) to extract images from the ground truth, thus creating a dataset for
tumor invasive classification tasks. Subsequently, we train and test on this dataset, utilizing ResNet-18 as the
backbone architecture. The detailed creation of BCMnist can be found in 3.3. As shown in the table 5, the
classification accuracy of BCMnist is always higher than that of MM-SFENet.
4.5. Visualization with smooth Grad-CAM++

Since the learning mechanism of CNN cannot be fully understood, it is necessary to obtain pathological
diagnosis reports based on diagnostic criteria in clinical treatments. Therefore, the class activation heat map is
obtained using Smooth Grad-CAM++ (Omeiza et al 2019). The original image size is enlarged by bilinear
interpolation and overlaid onto the original image to get the classification visualization image.
The classification heat map can reveal the basis of the model classification. Figure 9 illustrates that, for the
determination of NMIBC, the model focuses on the continuity of bladder wall. For the determination of MIBC,
the model pays attention to the relationship between the bladder wall and the tumor position, which has
significant implications on the interpretability and improvement of the model. Figure 10 shows that our model
exhibits satisfactory results in terms of generalizability.
5. Discussion and conclusion
In this study, when we are attempting to adjust the weights of several detectors within our model, we observe that
excessively high weight values could lead to the loss function exhibiting NaN (Not-a-Number) values during
training, rendering the model unable to converge effectively. Figure 8(b) shows that, when increasing the weight
value, additional loss values will be generated, which increases the overall gradient during training. This
adjustment proved to be crucial in enabling the model to converge properly. The results underscore the delicate
balance required when tuning the weight values associated with the localization loss function. While assigning
appropriate weights is essential for effective training guidance, setting them too high can be counterproductive.
The results in table 2 reveal an interesting phenomenon where the depth of the backbone network is not
directly proportional to detection accuracy. ResNet50 excels compared to enhanced ResNet backbones (e.g.
ResNeXt101 and ResNeSt50), despite their inclusion of attention mechanism modules. This indicates that, in
11
Figure 9. Samples and corresponding heat maps.
Figure 10. Detection results on test set in Dataset2.
bladder cancer tumor detection, the anticipated need for high feature extraction capability may be less
significant. Our results emphasize the crucial role of feature fusion in the SFE architecture for effective bladder
cancer tumor detection. SFE’s design excels in extracting and fusing multi-scale features, underscoring its
performance. In summary, our findings emphasize the nuanced interplay between backbone architecture and
detection accuracy in bladder cancer tumor detection.
Compared to MM-SFENet’s intricate tumor detection task, BCMnist involves a singular classification
objective. Unlike the detection task in MM-SFENet, BCMnist’s classification task excludes SFE’s feature fusion
due to its small input image size. Shallow neural networks prove effective in distinguishing bladder wall and
tumors, as evidenced by ablation experiments, which suggest that deep neural networks are not always necessary
for detecting tumor invasiveness. Figure 9 shows that the model’s impressive ability to discern subtle textural
differences between the bladder wall and tumors in MRI. This nuanced distinction facilitates the accurate
classification of both MIBC and NMIBC subtypes. The visualization results serve as compelling evidence that
our proposed detection model possesses the capability to differentiate between bladder wall and tumors at a
pixel-level granularity.
However, our study faces two notable limitations. Firstly, the limited availability and labeling quality of
existing MRI datasets for bladder cancer constrained model development and validation, potentially affecting
generalizability due to variations in diagnostic standards and imaging machines among doctors and hospitals.
Secondly, ensuring interpretability in diagnostic criteria, a common concern when deploying deep learning
models for computer-aided diagnosis, is also challenging.
12
Notably, compared with other machine learning methods, deep learning is a complex black box. Optimizing
this model in the future requires incorporating doctors’ ideas and experiences in disease diagnosis and treatment
to enhance interpretability. Only when the doctor can understand the reason why the model makes the
assessment can the model better assist the doctor in decision-making. VI-RADs is a newly developed scoring
system aimed at standardization of MRI acquisition, interpretation, and reporting for bladder cancer. It has been
proven to be a reliable tool in differentiating NMIBC from MIBC. Our next step is to integrate the industry-
recognized VI-RADs staging standards and multi-parameter MRI with bladder cancer staging (Panebianco et al
2018). This standardized approach aims to significantly enhance the accuracy of the model for bladder cancer
staging and subsequently improve the prognosis of patients.
Acknowledgments
We sincerely appreciate the valuable contributions of reviewers to our article. Their professional knowledge and
rigorous evaluations have helped refine the paper. The constructive comments improved article quality and
readability. We appreciate the hard work and dedication of the reviewers.
Data availability statement
The data cannot be made publicly available upon publication because they contain sensitive personal
information. The data that support the findings of this study are available upon reasonable request from the
authors.
Declaration of competing interest
The authors declare no conflicts of interest.
Funding
This work is partly supported by the National Nature Science Foundation of China (No.61872225), the Natural
Science Foundation of Shandong Province (No.ZR2020KF013, No.ZR2020ZD44, No.ZR2019ZD04, No.
ZR2020QF043) and Introduction and Cultivation Program for Young Creative Talents in Colleges and
Universities of Shandong Province (No.2019-173), the Special fund of Qilu Health and Health Leading Talents
Training Project.
Ethical statement
This study was approved by the Ethics Committee of Shandong University of Traditional Chinese Medicine.
The ethical approval number is 2020-079. All procedures contributing to this work comply with the ethical
standards of the relevant national and institutional committees on human experimentation and the Helsinki
Declaration of 1975, as revised in 2008. All participants signed an informed consent form before the study.
References
Babjuk M et al 2022 European Association of Urology Guidelines on Non–muscle-invasive Bladder Cancer (Ta, T1, and Carcinoma in Situ)
Eur. Urol. 81 75–94
Bandyk M G, Gopireddy D R, Lall C, Balaji K and Dolz J 2021 MRI and CT bladder segmentation from classical to deep learning based
approaches: Current limitations and lessons Comput. Biol. Med. 134 104472
Benson A B et al 2009 NCCN clinical practice guidelines in oncology: hepatobiliary cancers J. Natl Comprehensive Cancer Netw. : JNCCN 7
350–91
Bochkovskiy A, Wang C Y and Liao H Y M 2020 Yolov4: optimal speed and accuracy of object detection arXiv:2004.10934
Caglic I, Panebianco V, Vargas H A, Bura V, Woo S, Pecoraro M, Cipollari S, Sala E and Barrett T 2020 MRI of Bladder Cancer: Local and
Nodal Staging J. Magn. Reson. Imaging 52 649–67
Cai Z, Fan Q, Feris R S and Vasconcelos N 2016 A unified multi-scale deep convolutional neural network for fast object detection Computer
Vision—ECCV 2016 ed B Leibe et al (Netherlands: Springer International Publishing) 354–70
Cai Z and Vasconcelos N 2021 Cascade R-CNN: High Quality Object Detection and Instance Segmentation IEEE Trans. Pattern Anal. Mach.
Intell. 43 1483–98
Cha K H, Hadjiiski L M, Samala R K, Chan H P, Cohan R H, Caoili E M, Paramagul C, Alva A and Weizer A Z 2016 Bladder Cancer
Segmentation in CT for Treatment Response Assessment: Application of Deep-Learning Convolution Neural Network—A Pilot
Study Tomography 2 421–29
13
Chen Q, Wang Y, Yang T, Zhang X, Cheng J, Sun J et al 2021 You only look one-level feature 2021 IEEE/CVF Conf. on Computer Vision and
Pattern Recognition (CVPR)You only look one-level feature 13034–43
Dolz J, Xu X, Rony J, Yuan J, Liu Y, Granger E, Desrosiers C, Zhang X, Ben Ayed I, Lu H et al 2018 Multiregion segmentation of bladder
cancer structures in MRI with progressive dilated convolutional networks Med. Phys. 45 5482–93
Girshick R et al 2015 IEEE Int. Conf. on Computer Vision (ICCV)Fast R-CNN 1440–48
Girshick R, Donahue J, Darrell T, Malik J et al 2014 IEEE Conf. on Computer Vision and Pattern RecognitionRich feature hierarchies for accurate
object detection and semantic segmentation 580–87
Gsaxner C, Pfarrkirchner B, Lindner L, Pepe A, Roth P M, Egger J and Wallner J 2018 Biomedical Engineering Int. Conf. (BMEiCON)PET-
Train: Automatic Ground Truth Generation from PET Acquisitions for Urinary Bladder Segmentation in CT Images using Deep Learning
(Chiang Mai, Thailand, 21-24 November 2018) (IEEE) pp 1–5
Guiot J et al 2022 A review in radiomics: Making personalized medicine a reality via routine imaging Medicinal Res. Rev. 42 426–40
Hammouda K et al 2019 Int. Conf. on Advances in Biomedical Engineering (ICABME)A cnn-based framework for bladder wall segmentation
using mri pp 1–4
He K, Zhang X, Ren S and Sun J 2015 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition IEEE Trans. Pattern
Anal. Mach. Intell. 37 1904–16
He K, Zhang X, Ren S and Sun J 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)Deep residual learning for image
recognition pp 770–8
Kushnure D T and Talbar S N 2022 HFRU-Net: High-Level Feature Fusion and Recalibration UNet for Automatic Liver and Tumor
Segmentation in CT Images Comput. Methods Programs Biomed. 213 106501
Lin T Y, Dollár P, Girshick R, He K, Hariharan B and Belongie S 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)Feature
pyramid networks for object detection pp 936–44
Lin T Y, Goyal P, Girshick R, He K, Dollár P et al 2020 Focal Loss for Dense Object Detection IEEE Trans. Pattern Anal. Mach. Intell. 42
318–27
Liu J, Liu L, Xu B, Hou X, Liu B, Chen X, Shen L, Qiu G et al 2019 IEEE 16th Int. Symp. on Biomedical Imaging (ISBI 2019)Bladder Cancer
Multi-Class Segmentation in MRI With Pyramid-In-Pyramid Network 28–31
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C 2016 SSD: Single Shot MultiBox Detector Computer Vision—ECCV
2016 (Netherlands: Springer International Publishing) 21–37
Luo W, Li Y, Urtasun R, Zemel R et al 2017 ArXiv https://arxiv.org/abs/1701.04128 2017arXiv:1701.04128Understanding the effective
receptive field in deep convolutional neural networks
Lv J, Wang C and Yang G 2021 PIC-GAN: A Parallel Imaging Coupled Generative Adversarial Network for Accelerated Multi-Channel MRI
Reconstruction Diagnostics 11 1
Omeiza D, Speakman S, Cintas C and Weldermariam K 2019 Smooth Grad-CAM++: An Enhanced Inference Level Visualization
Technique for Deep Convolutional Neural Network Models arXiv:1701.04128
Pan S et al 2023 2D medical image synthesis using transformer-based denoising diffusion probabilistic model Phys. Med. Biol. 68 105004
Panebianco V et al 2018 Multiparametric Magnetic Resonance Imaging for Bladder Cancer: Development of VI-RADS (Vesical Imaging-
Reporting And Data System Eur. Urol. 74 294–306
Pinto J R and Tavares J M R 2017 A versatile method for bladder segmentation in computed tomography two-dimensional images under
adverse conditions Proc. Inst. Mech. Eng. 231 871–80
Redmon J and Farhadi A 2017 YOLO9000: Better, Faster, Stronger IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
YOLO9000: Better, Faster, Stronger 6517–25
Redmon J and Farhadi A 2018 YOLOv3: An incremental improvement arXiv:1804.02767
Ren S, He K, Girshick R, Sun J et al 2017 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks IEEE Trans.
Pattern Anal. Mach. Intell. 39 1137–49
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R and LeCun Y 2013 OverFeat: integrated recognition, localization and detection using
convolutional networks arXiv:1312.6229
Siegel R L, Miller K D, Fuchs H E, Jemal A et al 2022 Cancer statistics, 2022 CA: A Cancer Journal for Clinicians 72 7–33
Simonyan K and Zisserman A 2014 Very deep convolutional networks for large-scale image recognition arXiv.1409.1556
Sun P et al 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)Sparse r-cnn: End-to-end object detection with learnable
proposals pp 14449–58
Uijlings J, vandeSande K, Gevers T, Smeulders A et al 2013 Selective Search for Object Recognition Int. J. Comput. Vision 104 154–71
Wong V K, Ganeshan D, Jensen C T, Devine C E et al 2021 Imaging and Management of Bladder Cancer Cancers 13 1396
Wu C, Fu T, Wang Y, Lin Y, Wang Y, Ai D, Fan J, Song H and Yang J 2022 Fusion Siamese network with drift correction for target tracking in
ultrasound sequences Phys. Med. Biol. 67 4
Xu X P, Zhang X, Liu Y, Tian Q, Zhang G P, Yang Z Y, Lu H B and Yuan J 2017 Image and Graphics ed Y Zhao, X Kong and D Taubman
(Shanghai: Springer International Publishing) 528–42
Yang J, Shi R, Wei D, Liu Z, Zhao L, Ke B, Pfister H, Ni B et al 2023 MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D
biomedical image classification Sci. Data 10 41
Yu C J et al 2021 Lightweight deep neural networks for cholelithiasis and cholecystitis detection by point-of-care ultrasound Comput.
Methods Programs Biomed. 211 106382
Yu J, Jiang Y, Wang Z, Cao Z and Huang T 2016 UnitBox: An Advanced Object Detection Network Proceedings of the XXIV ACM
International Conference on Multimedia, MM ’16 (Netherlands: Association for Computing Machinery) 516–20
Zhang G et al 2021 Deep Learning on Enhanced CT Images Can Predict the Muscular Invasiveness of Bladder Cancer Front. Oncol. 11 1
Zhu Q, Du B, Yan P, Lu H and Zhang L 2018 Shape prior constrained PSO model for bladder wall MRI segmentation Neurocomputing 294
19–28
14

Ren 2024 Phys. Med. Biol. 69 025009

Uploaded by

Copyright:

Available Formats

Ren 2024 Phys. Med. Biol. 69 025009

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ren 2024 Phys. Med. Biol. 69 025009

Uploaded by

Copyright:

Available Formats

Physics in Medicine & Biology

PAPER You may also like

- HPLC assisted Raman spectroscopic

This content was downloaded from IP address 111.187.77.60 on 04/03/2024 at 06:17

© 2024 Institute of Physics and Engineering in Medicine

2.1. Deep object detectors

Figure 1. Schematic diagram of different detection architectures.

2.2. Feature pyramid network

2.3. Smooth-L1 loss

3. Materials and methods

3.1. Data acqusition

3.2. Data pre-processing

Figure 2. Manual tumor and bladder wall contouring.

Figure 3. The creation process of the BCMnist.

3.3. BCMnist creation

3.4.1. The spatial feature encoder

Figure 6. Schematic diagram of b-box parameters.

3.4.2. IoU loss

⎧ X˜ = (x˜ t + x˜b ) * (x˜l + x˜ r )

Figure 7. Analysis of Dataset1.

3.5. k-means++ clustering analysis on the dataset1

3.6. Evaluation metrics and loss functions

Table 1. Mean and standard deviation of detection results with different

Loss function λ=1 λ=2 λ=5

L1 91.23 ± 0.97 91.45 ± 1.10 90.52 ± 1.18

⎧ Lcls = å - log ( pi*pi + (1 - pi*)(1 - pi ))

3.7. Implementation details

4.1. Detect with different weights λ

4.2. Ablation experiments

Table 2. Ablation studies on importance of each components. The

Backbone IoU Loss SFE mAP (%) IoU (%)

ResNet101 71.62 63.87

Cascade R-CNN 92.30 82.71 88.37 79.22 84.04

4.3. Comparison with the mainstream object detection algorithms

Cascade R-CNN 79.21 74.38 78.84 60.39 76.47

1 91.61 81.16 88.37 79.22 82.12 90.56

4.4. 10-fold cross validation on MM-SFENet and BCMnist

4.5. Visualization with smooth Grad-CAM++

5. Discussion and conclusion

Figure 9. Samples and corresponding heat maps.

Figure 10. Detection results on test set in Dataset2.

Data availability statement

Declaration of competing interest

The authors declare no conﬂicts of interest.

You might also like

⎧ Lcls = å - log ( pipi + (1 - pi)(1 - pi ))