1 Introduction

Lung cancer is one of the malignant tumors with the highest mortality rate in the world [30]. The study of epidermal growth factor receptor (EGFR) mutation in cancer driver genes makes targeted therapy a relatively effective treatment [33]. EGFR mutation status is involved in the occurrence, development, invasion and metastasis of lung cancer [22]. The detection of EGFR mutation status is crucial in first-line therapies [3], because EGFR tyrosine kinase inhibitors can target specific mutation within EGFR gene and improve the prognosis of lung cancer patients with EGFR mutation [53]. Biopsy sequencing is the gold standard for gene mutation detection. Due to the widespread heterogeneity of lung tumors, biopsy sequencing need to locate tissue regions for measuring EGFR mutation status. Its applicability is limited due to the difficulty in obtaining tissue samples and repeatedly sampling tumor, relatively high cost and poor DNA quality [31]. In addition, biopsy increases the potential risk of cancer metastasis [25]. In these cases, a non-invasive and easy-to-use method is necessary to identify EGFR mutation status.

Computed tomography (CT), as a non-invasive routine diagnostic technique, can be used in the analysis of lung cancer [20, 52]. Recent studies have shown that the features extracted from CT images of lung cancer are related to gene expression patterns [1, 5, 17, 55], and show the ability to identify EGFR mutation status [23, 28, 35, 48, 54]. Although image-based evaluation cannot replace biopsy, it can be regarded as supplementary information of biopsy [15, 31]. For example, CT imaging can provide some information of the tumor heterogeneity such as tumor density, activity and microenvironment, allowing us to identify the EGFR mutation status [41, 47]. In addition, CT imaging is low-cost and easy to obtain throughout the treatment process. Therefore, it is promising for CT imaging as an alternative method to detect EGFR mutation status.

In recent years, researchers have predicted gene mutation based on CT images mainly by traditional radiomics, machine learning or statistical methods. Liu Y et al. [24] adopted radiomic method to extract features such as size, edge, transparency and uniformity from CT images for identifying EGFR mutation status. Velazquez et al. [31] developed a radiomic model based on CT image features and clinical data to distinguish between EGFR- and EGFR+, KRAS+ and KRAS-. Zhang et al. [50] also developed a radiogenomic model based on CT image features to predict EGFR mutation status in patients with lung adenocarcinoma. Jia T Y et al. [16] extracted radiomic features and adopted random forest model to identify EGFR mutation status in lung adenocarcinoma based on non-invasive imaging. Morgado J et al. [27] utilized a variety of linear, nonlinear, and ensemble predictive classification models, along with several feature selection methods, to classify the binary outcome of wild type or mutant of EGFR. In order to make the model performance better for disease prediction, radiomics methods are also gradually improved in various aspects, such as feature selection, data processing, classification algorithm. For example, Mandal M et al. [26] proposed a feature selection framework based on three-stage wrapper filter for disease detection, such as arrhythmia, leukemia, DLBCL and prostate cancer. Ijaz M F et al. [14] proposed a cervical cancer prediction model (CCPM) using risk factors as input, which removes abnormal data by using outlier detection method, increases the number of cases for balance and finally adopts random forest classifier to achieve good accuracy. Srinivasu P N et al. [39] proposed a computationally efficient anisotropic weighted-heuristic algorithm for real-time image segmentation (AW-HARIS) algorithm to automatically segment CT images for identifying the abnormalities of human liver. However, radiomic methods need to rely on manual labels with accurate tumor boundaries [6, 10]. Since radiological features are only calculated within the tumor area [49], the microenvironment and tumor attachment tissues are easily overlooked, resulting in poor specificity of result prediction.

To solve these problems, a large number of end-to-end deep learning models have been proposed and successfully applied to image classification, object detection and image segmentation, such as CNN [21], AlexNet [18], VGGNet [36], ResNet [8] and DenseNet [11] models. These models can alleviate these problems by self-learning technique without accurate tumor boundary annotation [19, 34], and can automatically learn features from image data for specific clinical analysis [44]. VGGNet is a multi-layer depth network model, which is proposed by Simonyan K et al. [36]. This model can achieve high accuracy on multiple image recognition datasets, especially VGG-16 and VGG-19 models. Based on VGG-16 model, Chen K et al. [2] make a precision efficiency trade-off for a variety of structured model pruning methods on CIFAR-10 and ImageNet datasets. This method improves memory usage and speed of model on TPUs. ResNet models including ResNet18, ResNet34, ResNet50, ResNet101 and ResNet152 are proposed by He K et al. [8] to address the degradation problem. These networks are easier to optimize and can gain accuracy from considerably increased depth. DenseNet proposed by Huang G et al. [11] is a dense convolution network. The network enhances the network effect and reduces the use of parameters by reusing the extracted features and bypass. Because these deep learning models possess high accuracy, high efficiency and high reliability, they are widely used in all kinds of medical image research, such as skin disease classification [4, 40], eye disease diagnosis [42] and non-invasive liver disease prediction [45]. Srinivasu P N et al. [40] proposed a computerized process of classifying skin disease based MobileNet V2 and Long Short Term Memory (LSTM), which is proved to be efficient in maintaining stateful information for precise predictions.

In addition, many deep learning models perform well in assisting lung cancer analysis [43, 46], and have been gradually applied to the study of image-based gene mutation prediction. Wang S et al. [47] firstly proposed an end-to-end deep learning model that uses CT images to predict the EGFR mutation status in lung adenocarcinoma. Song K et al. [38] proposed a joint network named segmentation-based multi-scale attention model (SMSAM) to predict the mutation status of KRAS gene in rectal cancer. Qin R et al. [29] proposed a hybrid network combining 3D CNN and RNN to design multi-type features and analyze their dependencies for the prediction of EGFR mutation status. However, there are not many studies on identifying EGFR mutation status of lung cancer by images based on deep learning methods, and extracting effective discriminant features for the non-invasive prediction of EGFR mutation status is still a great challenge.

In this work, we developed the ResNet with mixed loss based on batch training technique (ResNet-MLB) to extract CT image features and identify EGFR mutation status. The proposed models trained on the public dataset can be effectively transferred to another datasets from different hospital, which shows the good applicability and effectiveness in identifying EGFR mutation status. The proposed models can automatically learn the relevant features of EGFR mutation from CT images, which only requires manual selection of image blocks containing tumor regions in CT images and does not require precise tumor boundary segmentation or human-defined features. This study is a non-invasive auxiliary detection method, which is suitable for avoiding invasive injury when surgery and biopsy are inconvenient. Meanwhile, it can help the clinician to make treatment decisions for the patient and it is of positive significance to reduce the burden of doctors and promote the development of medicine.

The main contributions of this paper are as follows:

  1. 1.

    The ResNet-MLB is proposed to identify the EGFR mutation status through extracting more relevant features from CT images, which is a non-invasive and easy-to-implement method for detecting gene status.

  2. 2.

    A novel mixed loss based on the batch similarity and cross entropy is introduced, which can be easily integrated into some existing CNN models, such as VGGNet, DenseNet and ResNet.

  3. 3.

    The combination of mixed loss and batch training strategy is firstly applied to the VGGNet, DenseNet and ResNet models to recognize gene mutation status by images.

The organizational structure of this paper is described as follows. Section 2 introduces the overall architecture of models based on the batch training strategy, the details of the designed mixed loss, and the computational complexity of the mixed loss with batch training strategy. Section 3 describes the experiments to demonstrate the effectiveness of the batch training technique and the mixed loss of the proposed model. The conclusion is given in Section 4.

2 Methods

2.1 Overall architecture

ResNet-MLB is proposed to identify the EGFR mutation status through extracting more discriminative features from CT images. The overall architecture mainly includes two parts: feature extractor and classifier. This paper mainly regards ResNet as a baseline for feature extraction which is mainly composed of the residual block and the jump connection between the block and block. The residual block consists of a series of convolutional layers, batch normalization and Relu activation layers. The jump connection makes the gradient of back propagation better by shortening the distance between non-adjacent layers. In addition, it also enables the network to automatically learn the path of feature motions without affecting the performance of the network, thereby enhancing the generalization ability of the network. A fully connected layer is used in the classifier, and the classification is achieved through softmax. The input dimension of the classifier is fixed as 512, and the output dimension is determined as the number of classes. For example, the status of EGFR genes can be classified as the wild type and mutant, so the output dimension of the classifier is set as 2. The overall architecture is shown in Fig. 1.

Fig. 1
figure 1

The overall architecture of the proposed model

In our framework, the feature extractor F(∗), the classifier C(∗), and the class probability P(∗) are defined as follows,

$$ {F}_i=F\left({x}_i\right),{C}_i=C\left({F}_i\right),{P}_i=P\left({C}_i\right),i=1,2,\cdots, N, $$
(1)

where xi represents the i-th image sample and N is the number of samples.

The feature Fi ∈ R1 × l from the image xi is extracted by the feature extractor F(xi), in which l represents the length of a feature vector. Then, the classification result Ci ∈ R1 × C of the feature Fi is given by the classifier, in which Ci is the number of classes. The class probability Pi ∈ R1 × C is obtained through the softmax layer, that is, the class probability is the final prediction result of EGFR mutation status based on the lung cancer.

2.2 Mixed loss

In the research field of medical image classification, the cross-entropy loss (CL) is widely used in CNN models to train networks. However, it is not able to measure the similarity of intra-class and inter-class of samples [12], which prevents CL from learning discriminative features of the samples. Therefore, several other loss functions are proposed in deep learning models, such as the contrastive loss, the triplet loss, the triplet lifted structure loss and the triplet hard loss, which are able to learn discriminative features, suppress intra-class change [51], and maximize the gap between different classes [13]. However, they also have some drawbacks. For example, the contrastive loss [7] needs to construct sample pairs to train the model. Although there are a considerable number of potential sample pairs in the training set, only a small number of sample pairs are usually sampled during the model training phase, which also results in a substantial loss of useful information. The triplet loss [32], triplet lifted structure loss, and triplet hard losses all need to construct triplets from training samples. Although the triplet lifted structure loss [37] considers all possible pairs, it is not smooth, and its smooth upper bound needs to be optimized. The triplet hard loss [9] selects only the hardest pair, which will filter out outliers and make the network unable to learn normal relationships.

In each iteration of the error back propagation algorithm, a batch input X with nb samples is fed into a CNN model for training. For any sample xi ∈ X, i = 1, 2, ⋯, nb, we can get the feature vectors Fi = F(xi) ∈ R1 × l, \( {F}_{ij}^{+},j=1,2,\cdots, {n}_i^{+} \)and \( {F}_{ik}^{-},k=1,2,\cdots, {n}_i^{-} \), where \( {F}_{ij}^{+} \) denotes the feature vector corresponding to the samples of the same class as xi, and \( {F}_{ik}^{-} \) denotes the opposite. Here, nb, \( {n}_i^{+} \) and \( {n}_i^{-} \) represent the number of batch samples, samples of the same class as sample xi and samples of a different class from sample xi, respectively, and they satisfy \( {n}_b={n}_i^{+}+{n}_i^{-}+1 \). As mentioned above, the ideal triplet loss function is effective for clustering images of the same classes and separating images of the different classes. Therefore, the regular triplet loss is used and can be expressed as,

$$ {\displaystyle \begin{array}{c}{L}_t=\max \left(0,\alpha +d\left({F}_i,{F}_{ij}^{+}\right)-d\left({F}_i,{F}_{ik}^{-}\right)\right),\\ {}i=1,\cdots, {n}_b;j=1,\cdots, {n}_i^{+};k=1,\cdots, {n}_i^{-},\end{array}} $$
(2)

where d(∗, ∗) represents the distance measure of two vectors. α is a threshold that represents the minimum interval between the distance of positive sample pairs and the distance of negative sample pairs, which is an important index to measure similarity. The cosine distance is used to measure the similarity between samples in this work. Hence, the corresponding triplet loss based on similarity for the feature vectors \( \left({F}_i,{F}_{ij}^{+},{F}_{ik}^{-}\right) \), \( i=1,\cdots, {n}_b;j=1,\cdots, {n}_i^{+};k=1,\cdots, {n}_i^{-}, \) is redefined as

$$ {L}_{trip}\left({s}_{ij}^{+},{s}_{ik}^{-}\right)=\max \left(0,q-{s}_{ij}^{+}+{s}_{ik}^{-}\right), $$
(3)

where q ∈ [0, 1] is a threshold to measure cosine similarity between positive and negative sample pairs. \( {s}_{ij}^{+} \) and \( {s}_{ik}^{-} \) are the cosine similarity between the feature vectors Fi and \( {F}_{ij}^{+} \), and that between the feature vectors Fi and \( {F}_{ik}^{-} \), respectively. They can be calculated by.

$$ {s}_{ij}^{+}=\frac{F_i^T{F}_{ij}^{+}}{{\left\Vert {F}_i\right\Vert}_2{\left\Vert {F}_{ij}^{+}\right\Vert}_2},\mathrm{and}\ {s}_{ik}^{-}=\frac{F_i^T{F}_{ik}^{-}}{{\left\Vert {F}_i\right\Vert}_2{\left\Vert {F}_{ik}^{-}\right\Vert}_2}, $$
(4)

in which \( {F}_i^T \) denotes the transposition of Fi and ‖Fi2 represents the 2 norm of Fi. By reducing the loss Ltrip, \( {s}_{ij}^{+} \) will be close to 1 and \( {s}_{ik}^{-} \) will be close to 0. Note that due to the diversity and complexity of triple inputs, that is to say, the calculation of the triplet loss of the feature vector Fi needs reference vectors \( {F}_{ij}^{+} \) and \( {F}_{ik}^{-} \), it is not conducive to training.

In order to address these problems, a new mixed loss based on batch similarity and cross entropy is proposed to guide the network to better learn the model parameters. The mixed loss function is defined on the input batch samples for training the network. The similarities of all possible sample pairs in the batch is stored in the batch similarity matrix \( S\in {R}^{n_b\times {n}_b} \), in which the element Sij can be calculated by

$$ {S}_{ij}={\overset{\sim }{F}}_i{\overset{\sim }{F}}_j^T,i,j=1,2,\cdots, {n}_b, $$
(5)

where Sij represents the similarity of the i-th feacture vector \( {\overset{\sim }{F}}_i \) and the j-th feature vector \( {\overset{\sim }{F}}_j \), in which \( {\overset{\sim }{F}}_i={F}_i/{\left\Vert {F}_i\right\Vert}_2,i=1,2,\cdots, {n}_b \) with the size 1 × l. nb is the number of the input batch samples.

To analyze the similarity matrix S, a binary matrix \( B\in {R}^{n_b\times {n}_b} \) corresponding the ground truth is constructed to distinguish the similarity between positive and negative pairs. Their element Bij can be calculated by

$$ {B}_{ij}={y}_i{y}_j^T,i,j=1,2,\cdots, {n}_b, $$
(6)

where yi is a row vector of all zeros, except the ci-th element is one, corresponding to the i-th sample with the ground truth label ci, in which yi ∈ R1 × C, ci ∈ {1, 2, ⋯, C} and C is the number of classes.

From Eqs. (5) and (6), the discriminative similarity matrix D can be constructed as follows,

$$ D=S\odot \left(2B-1\right), $$
(7)

where \( \mathbf{1}\in {R}^{n_b\times {n}_b} \) is a matrix whose elements are all 1 and the symbol ⊙ denotes the Hadamard product. Note that the similarities of positive pairs are greater than 0, and the similarities of negative pairs are smaller than 0 in the matrix D. Moreover, the diagonal elements of D and S will be set to zeros, i.e. dii = 0 and sii = 0, because they represent the similarity between the vectors Fi and themselves, the data distribution cannot be evaluated. The process of constructing the required discriminant matrix to evaluate the similarities among all samples is shown in Fig. 2.

Fig. 2
figure 2

Construction process of discriminative similarity matrix

The loss of each sample xi ∈ X in input batch can be evaluated by the i-th row of the above discriminant similarity matrix D. The value of dij(dij ∈ D) represents the similarity between the feature vector Fi and other vectors in the batch samples (except for the diagonal element dii = 0). Obviously, the positive pairs and negative pairs in the i-th row may have multiple similarities. For convenience, for the i-th row, we re-express the similarity of positive pairs as \( {d}_{ij}^{+},j=1,2,\cdots, {n}_i^{+} \), and the similarity of negative pairs as \( {d}_{ik}^{-},k=1,2,\cdots, {n}_i^{-} \), and they satisfy that \( {n}_b={n}_i^{+}+{n}_i^{-}+1 \). The triplet loss of xi based on the batch similarity is defined as

$$ {L}_{s- trip}\left(D,{x}_i\right)=\max \left(0,q-\frac{1}{n_i^{+}}\sum \limits_{j=1}^{n_i^{+}}{d_{ij}^{+}}^2+\frac{1}{n_i^{-}}\sum \limits_{k=1}^{n_i^{-}}{d_{ik}^{-}}^2\right) $$
(8)

where the similarities of the positive pairs and the negative pairs are replaced by the average square similarity, respectively. Since |dij| ≤ 1, the square of dij is used instead of the linear function, so that the loss has a smoother gradient and more easily converges to the optimal solution. Therefore, the average batch similarity loss of a batch is expressed as

$$ \overline{L_{s- trip}}=\frac{1}{n_b}\sum \limits_{i=1}^{n_b}{L}_{s- trip}\left(D,{x}_i\right) $$
(9)

It is worth noting that the new batch similarity loss based on the triplet loss can further guide the model learning, so that samples of the same class are closer and the differences between samples of different class are more obvious, as shown in Fig. 3.

Fig. 3
figure 3

Clustering process based on batch similarity loss

In order to better achieve the classification of EGFR mutation status and better evaluate the ability of classification, the softmax classifier and the CL function Lce are also used in the structure of model. The CL based on the softmax probability is defined as,

$$ {L}_{ce}=-\frac{1}{n_b}\sum \limits_{i=1}^{n_b}\sum \limits_{j=1}^C{1}_{c_i=j}\log \frac{e^{w_j^T{F}_i^T+{b}_j}}{\sum_{j=1}^C{e}^{w_j^T{F}_i^T+{b}_j}}, $$
(10)

where Fi represents the extracted feature vector corresponding to the i-th image sample; wj and bj are the parameters of the classifier corresponding to the j-th class, which is composed of a fully connected layer. ci denotes the class of the i-th image. The value of \( {1}_{c_i=j} \) is 1 when the ground truth ci is equal to j.

Therefore, based on the average batch similarity loss in Eq. (9) and the CL in Eq. (10), the new mixed loss (ML) can be defined as,

$$ L=\beta {L}_{ce}+\gamma \overline{L_{s- trip}}, $$
(11)

where β and γ are the weight parameters of the cross-entropy Lce and the average batch similarity loss \( \overline{L_{s- trip}} \), respectively. The ML can better classify and obtain the discriminative features of samples and Algorithm 1 shows the implementation procedure of the mixed loss.

Algorithm 1
figure a

The implementation procedure of the mixed loss

2.3 Batch training strategy

The batch training (BT) strategy is adopted in the structure of model. It enables gradient descent to act on each batch and the amount of samples in each batch is small, so that models can be trained in limited memory. Meanwhile, BT can also be used for the distributed training to make the convergence faster. But the size of batch affects the stability of the training process. So it is significant to choose appropriate batch size for improving running efficiency and memory utilization. In order to obtain a better data distribution of the training data set, a new construction scheme of batch samples is used as Table 1. Table 1 gives the specific steps of construction of batch samples. Then, based on the batch samples, the ML is calculated and the ResNet-MLB model is trained. It is worthy note that this strategy is only used during training phase. In the testing phase, the batch similarity loss and the CL will be ignored, and the the softmax function is used to output the prediction results.

Table 1 The construction procedure of the batch samples

Taking the EGFR status of lung cancer as an example, the number of samples is recorded as X = {x1, x2, ⋯, xN}, where N is the total number of samples. Using the construction scheme of batch samples in Table 1, the samples can be firstly divided into EGFR-wild type denoted as \( \left\{{x}_1^{-},{x}_2^{-},\cdots, {x}_{N_w}^{-}\right\} \) and the EGFR-mutant denoted as \( \left\{{x}_1^{+},{x}_2^{+},\cdots, {x}_{N_m}^{+}\right\} \) based on the sample labels. Note that N = Nw + Nm, where Nw and Nm denote the number of EGFR-wild type and the EGFR-mutant, respectively. Then randomly divide the EGFR-wild type into three groups by clustering: \( \left\{{x}_1^{-},{x}_2^{-},\cdots, {x}_i^{-}\right\} \), \( \left\{{x}_{i+1}^{-},{x}_{i+2}^{-},\cdots, {x}_j^{-}\right\} \), \( \left\{{x}_{j+1}^{-},{x}_{j+2}^{-},\cdots, {x}_{N_w}^{-}\right\} \). Similarly, EGFR-mutant type is also randomly divided into three groups: \( \left\{{x}_1^{+},{x}_2^{+},\cdots, {x}_m^{+}\right\} \), \( \left\{{x}_{m+1}^{+},{x}_{m+2}^{+},\cdots, {x}_r^{+}\right\} \), \( \left\{{x}_{r+1}^{+},{x}_{r+2}^{+},\cdots, {x}_{N_m}^{+}\right\} \). Finally, randomly select CT images in each group proportionally to form an input batch.

2.4 Computational complexity

In this subsection, the computational complexity (Ccpl) of the mixed loss based on batch training strategy in each iteration is performed at the cost of the required number of floating-point operations (FLOPs). Assuming that the number of training data set is Ntrain, the computational complexity can be calculated by

$$ {C}_{cpl}={C}_{BT}+{C}_{ML}, $$
(12)

where CBT and CML are the computational complexity of executing batch training strategy and mixed loss in each iteration, respectively.

The computational complexity of executing batch training CBT can be obtained through complexity of four steps in Table 1,

$$ {C}_{BT}={N}_{train}+O(clustering)+{N}_{group}+{n}_b, $$
(13)

where Ntrain denotes the complexity of the first step of BT strategy. O(clustering) is the required number of FLOPs for the utilized clustering method in the second step. Ngroup is the number of groups of training images and Ngroup FLOPs is taken to get the proportion of images in the group in the third step. The fourth step requires nb FLOPs to build an input batch. It is noted that the first three steps of the proposed BT strategy are only executed once before the training process, while the fourth step is executed once per iteration in the training process.

The computational complexity of executing mixed loss CML can be obtained by Eqs. (5)–(9),

$$ {C}_{ML}=\left({l}^2+{C}^2+l+C+3\right){n}_b^2+\left(2l+5\right){n}_b, $$
(14)

where Eq. (5) requires lnb multiplication and (l − 1)nb addition for normalization of embedded vectors, i.e. (2l − 1)nb FLOPs. The calculation of the similarity matrix S needs \( \left({l}^2+l-1\right){n}_b^2 \) FLOPs in Eq. (5). In addition, the ground truth encoded as a one-hot vector yi needs nb index and allocation operations, which is calculated as 2nb FLOPs for convenience in this paper. Meanwhile, the calculation of the binary matrix B in Eq. (6) is \( \left({C}^2+C-1\right){n}_b^2 \) FLOPs. For the discriminant similarity matrix D, the cost is \( 3{n}_b^2+{n}_b \) FLOPs, which is calculated every time and shared each sample in the batch, and it is obtained by setting the diagonal value of the matrix on the right of Eq. (7) to 0. Eq. (8) needs to be repeatedly calculated nb times, with a total cost of \( 2{n}_b^2+2{n}_b \) FLOPs. In Eq. (9), nb FLOPs is performed for the average operation of the batch similarity loss.

3 Results and discussion

3.1 Datasets and details

3.1.1 Clinical characteristics of patients

Some experiments are conducted on the public dataset NSCLC Radiogenomics as the training set and the cooperative hospital dataset from Shanxi Province as the validation set. The NSCLC Radiogenomics dataset is downloaded from the TCIA website (https://wiki.cancerimagingarchive.net). The institutional review board of Shanxi cancer hospital has approved this retrospective study and abandoned the need to obtain patient informed consent. Meanwhile, patients from the public and the cooperative hospital need to meet the following inclusion criteria:

  1. (1).

    Primary lung cancer confirmed by histology;

  2. (2).

    Pathological examination of tumor specimens to confirm EGFR mutation status;

  3. (3).

    Preoperative enhanced CT data.

Besides, in the training and validation datasets, patients will be excluded with the following situation, such as (1) the lack of clinical data (age, gender, stage); (2) receiving preoperative treatment; or (3) exceeding 1 month from CT examination to postoperative operation.

The lesion areas in all CT images from 155 patients of the public dataset and 56 patients of the cooperative hospital are marked by these experienced radiologists (lung imaging practice for 12 years) in the partner hospital. Based on these marked lesion areas, the dataset in the experimentation is constructed, including a total of 16,040 image blocks with size 64*64 in which all marked tumor lesion areas are contained, and each sample image block is classified into EGFR-mutant and EGFR wild type based on the patient’s clinical information. Figure 4 shows some CT images including the EGFR-mutant and EGFR-wild image samples.

Fig. 4
figure 4

Lung cancer CT images including the EGFR mutant and EGFR wild image samples

Table 2 lists the detailed construction of the lung cancer dataset. In the dataset, there are the training sets with 12,835 images including EGFR-mutant images 3310 and EGFR wild images 9525 from the public dataset, and the validation sets with 3205 images including EGFR-mutant images 825 and EGFR wild images 2380 from partner hospitals.

Table 2 The construction of the lung cancer dataset

Table 3 presents the clinical characteristics of patients including the number of patients, average age, sex, smoking status, histology and EGFR mutation status in training set and validation set, and the corresponding p value between two datasets. It is from Table 3 that the p values of age, sex, smoking status, histology, and EGFR mutation status are greater than 0.05, which implies that there are no significant differences in age, sex, smoking status, histology and EGFR mutation status between the training set and the validation set. Note that when the p value is less than 0.05, there are the statistical significant of the corresponding characteristic between the training set and validation set.

Table 3 The clinical characteristics of patients

3.1.2 Experimental details

Due to the small amount of medical image data, in order to prevent over fitting, some simple data enhancement methods, such as horizontal flip, vertical flip and random rotation, are used to expand the training set for improving the ability of the model classification.

In the experiment, the Adam gradient optimization algorithm is used to optimize the parameters of the model, the weight decay rate is set to 1e-8 and the learning rate is set to 1e-4. The input batch size is set to 36. Moreover, in order to calculate the batch input similarity triplet loss, the dimensionality of the extracted features is reduced and set to 512. In the mixed loss, the weight parameters β and γ are set 1 and 0.5, respectively. The performance of the model is evaluated on the validation set in each epoch.

To test the effectiveness of the proposed mixed loss, we have applied the mixed loss to VGG16Net, ResNet18, ResNet34, ResNet50, DenseNet networks. The number of parameters and computational complexity of these networks are listed in Table 4, where FLOPs are generated when the size of the input image is 64 × 64 × 1. As can be seen from Table 4, DenseNet has the fewest parameters and ResNet18 has the least number of FLOPs.

Table 4 The parameters and computational complexity of the compared networks

The experiments in this work are carried out on a workstation with Ubuntu 18.04 LTS, the CPU of the server is 2.90 GHz Intel(R) Xeon(R) W-2102, and the GPU is NVIDIA TITAN XP with CUDA 10.1 for acceleration. Besides, all the deep learning frameworks are realized using Python 3.7.9 with Keras 2.3.1 and TensorFlow 1.15.0.

3.2 Influence of the size of batch samples

In this subsection, consider the influence of the size nb of the batch samples on the accuracy of the proposed models to choose appropriate batch size for improving running efficiency and memory utilization. Using the construction procedure of the batch samples, the training dataset on the lung cancer in Section 3.1 can be divided into two classes and three groups in each class by the K-means algorithm. Then, using eight different proportions, batch samples of different sizes are constructed, i.e.nb = 6,12,18,24,30,36,42,48, where each batch of samples is obtained in the 6 groups in the same proportion.

In this experiment, the different batch sizes are used for training ResNet-MLB models, and other parameters are fixed as subsection 3.1.2. The accuracy (ACC) is used as an index to evaluate classification ability of models, which is calculated by:

$$ ACC=\frac{TP+ TN}{N^{\prime }}, $$
(15)

where TP and TN, respectively, represent the number of samples of the correct prediction in all samples labeling EGFR-mutant (true positive) and that of the correct prediction in all samples labeling EGFR-wild type (true negative). N is the total number of all images in the validation set.

Figure 5 shows ACC values of models at the different batch size nb. It can be seen from Fig. 5 that the highest accuracy of 81.58% and the lowest accuracy of 78.52% are obtained at nb = 36 and nb = 6, respectively. In our results, the accuracy is positively correlated with the batch size nb, which is in line with the hypothesis: A larger nb can ensure that the data distribution of each batch is closer to the overall distribution of the training set. However, considering that a larger nb will result in insufficient number of iterations for network training within an epoch, nb = 36 is applied to the rest of the experiments as the default value.

Fig. 5
figure 5

Accuracy analysis based on different input batches nb of lung cancer dataset

3.3 Influence of batch training strategy

In this subsection, the effect of the BT strategy on the accuracy of the ResNet34 models using the CL (ResNet34-CL) or the ML (ResNet34-ML) is considered through the comparison with the random selection (RS) strategy. Figure 6a shows that the training and validation loss curves of ResNet34-CL models using BT and RS. Figure 6b shows that the training and validation loss curves of ResNet34-ML models using BT and RS. It is seen from these figures that the loss curve of the BT strategy is relatively smoother than that of the RS strategy for all models. This shows that it is easier to train the network using the BT strategy. In addition, we also find that the gap between the training loss and the validation loss is reduced by the BT strategy, which implies that it can alleviate the overfitting problem in the process of model training to some extent.

Fig. 6
figure 6

Loss curves of different training strategies for ResNet34

Table 5 lists the accuracy of ResNet34-CL and ResNet34-ML using BT and RS strategy on the validation set. As listed in Table 5, the accuracy of ResNet34-CL model using the BT strategy is 1.14% higher than that using RS strategy. The accuracy of ResNet34-ML using the BT strategy is improved by 0.93%, compared with ResNet34-ML using the RS strategy. The results indicate that the batch training strategy is beneficial for training models. Here, it can also be seen that the BT strategy is more effective for models based on the CL, meaning that it is able to compensate more significantly for the CL over the data distribution. The reason is that the ML including the batch similarity can evaluate the quality of the training data set distribution to a certain extent, but the CL function does not have this ability.

Table 5 Accuracy of different training strategies for ResNet34

3.4 Comparison and verification of results

In this subsection, the applicability and effectiveness of ResNet models using the ML is studied. In this experiment, the CL, the CL combined with the triple loss (CTL) in ref. [32], the CL combined with the improved lifted structure loss (CIL) in ref. [37] and the proposed ML are applied into VGG16Net, ResNet18, ResNet34, ResNet50, DenseNet models for the comparison.

The effectiveness of the new ML is firstly considered by comparing with the CL. Figure 7 shows the curve of training loss and validation loss using different models with the CL and the ML. From these figures, we find that.

  1. (1).

    The validation loss of the VGG16Net increases as the iteration increases for these models using the CL in Fig. 7b. This implies that the overfitting problem of VGG16Net is more serious than other models.

  2. (2).

    These models using ML are smoother, which implies that the overfitting problem can be suppressed.

  3. (3).

    The gap between the ML is smaller than that between the CL in all models. This demonstrate that the ML can play a regularizing effect on the CL.

Fig. 7
figure 7

Loss curves for the compared networks

In summary, the mixed loss can suppress the problem of overfitting, indicating that the mixed loss can play a regularizing effect on the CL.

Then, the performance of the models with the new ML is furtherly studied by comparing with models with the CL, CTL and CIL. The identification of the mutation status of EGFR is a binary classification task, and the sensitivity SE and specificity SP are used to evaluate the performance of these models in this experiment. They can be calculated by.

$$ SE=\frac{TP}{TP+ FN}\mathrm{and}\ SP=\frac{TN}{TN+ FP}, $$
(16)

where FP and FN, respectively, represent the number of samples of the incorrect prediction in all samples labeling EGFR mutation (true positive) and that of the incorrect prediction in all samples labeling EGFR wild type (true negative). The sensitivity SE and specificity SP can measure the ability of models to correctly identify the EGFR-mutant and EGFR-wild type in CT images of lung cancer. In addition, the accuracy (ACC) and the receiver operating characteristic (ROC) area under the curve (AUC) are also used to evaluate the classification ability of the models. The results (including the sensitivity SE, the specificity SP, the ACC and AUC) of VGG16Net, ResNet18, ResNet34, ResNet50, DenseNet models with different loss (CL, CTL, CIL and ML) are listed in Table 6. From the table, we find that:

  1. (1).

    The accuracy of all models with CIL and ML is higher than all models with CL, which means the improved lifted structure loss and batch similarity loss can improve the optimization ability of these models.

  2. (2).

    The highest sensitivity, specificity, accuracy and AUC are obtained in VGG16Net, ResNet18, ResNet34 and ResNet50 models with ML, which demonstrates that ML is more effective than CL, CTL and CIL for VGGNet and ResNet models. In addition, the DenseNet with ML and the DenseNet with CIL have higher performance, which means that there is a good robustness of both ML and CIL for the DenseNet model.

  3. (3).

    ResNet34 provides the highest accuracy (81.58%) for four different losses in all models, which is 2.37% higher than the model with CL, 0.73% higher than the model with CTL and 0.53% higher than the model with CIL. This shows that the ResNet34-ML can better learn the discriminative characteristics of samples, and get better classification ability.

Table 6 Metrics for the compared networks based on each loss

In general, ML has better adaptability and effectiveness in model training, and the ResNet34-ML can obtain best performance.

In the experiment, the task of identifying EGFR mutations in lung cancer is compared with the latest research, and the results are listed in Table 7. Table 7 shows that there is the highest sensitivity, specificity, accuracy and AUC in the ResNet34-ML compared with other studies. It further illustrates that the architecture proposed has a certain degree of improvement ability, and the ResNet34-ML with the BT strategy is effective for identifying the EGFR mutation status.

Table 7 Comparison of the studies

4 Conclusions

In this work, ResNet-MLB models are proposed using the mixed loss and the batch training technique for identification of EGFR mutation status in lung cancer. In these models, the mixed loss is proposed based on the batch similarity and cross entropy, and the batch training technique is applied, which guide the network to better learn the parameters. Some experiments about the size of batch samples, batch training strategy and various models with different losses are studied on the CT images of lung cancer dataset, and the following conclusions are obtained: (1) The performance of the BT strategy is 0.93% higher than that of a RS strategy for ResNet34-ML models, and the performance of ResNet34-CL using the BT strategy is improved by 1.14%, compared with ResNet34-CL using the RS strategy. Hence, the BT strategy is beneficial for training models, especially for ResNet34-CL. That is because the BT strategy is able to compensate more significantly for the CL over the data distribution, while the ML including the batch similarity can evaluate the quality of the training data set distribution to a certain extent. (2) For some common models, the proposed mixed loss has superiority in sensitivity, specificity, accuracy and AUC (sensitivity = 80.02%, specificity = 82.90%, accuracy = 81.58%, AUC = 0.8861), compared with other losses, which means ML has better adaptability and effectiveness in model training. (3) The ResNet34-ML with the batch training technique can better learn the discriminative characteristics of samples, and get the best classification performance in all models.

In short, the proposed mixed loss possesses the applicability and effectiveness, and ResNet34-ML with the batch training technique has better identification ability on the CT images of lung cancer dataset. The advantage of our method is that it provides a non-invasive alternative solution for identifying the EGFR mutation status when the patient is not suitable for biopsy, and quickly promotes the clinician to make treatment decisions for the patient.

Although the performance of ResNet-MLB models is encouraging, this study has some limitations. First, our research only focused on the EGFR mutation status of lung cancer. However, the relationship between EGFR mutation and other gene mutations (such as KRAS, ALK) is unconsidered. Secondly, we only consider the identification of the EGFR mutation status of lung cancer based on CT images, the combination of CT and other images (such as PET) is unclear. Therefore, the correlation between EGFR mutations and other gene mutations will be explored by introducing attention mechanism and multi-task learning in the future. Besides, more CT images and other images of lung cancer will be collected to design a fusion strategy which may improve the identifiable performance.