Detection of COVID-19 From Chest X-Ray Images Using Convolutional Neural Networks
Detection of COVID-19 From Chest X-Ray Images Using Convolutional Neural Networks
Detection of COVID-19 From Chest X-Ray Images Using Convolutional Neural Networks
research-article2020
JLAXXX10.1177/2472630320958376SLAS TechnologySekeroglu and Ozsahin
Original Research
SLAS Technology
Neural Networks
journals.sagepub.com/home/jla
Abstract
The detection of severe acute respiratory syndrome coronavirus 2 (SARS CoV-2), which is responsible for coronavirus
disease 2019 (COVID-19), using chest X-ray images has life-saving importance for both patients and doctors. In addition, in
countries that are unable to purchase laboratory kits for testing, this becomes even more vital. In this study, we aimed to
present the use of deep learning for the high-accuracy detection of COVID-19 using chest X-ray images. Publicly available
X-ray images (1583 healthy, 4292 pneumonia, and 225 confirmed COVID-19) were used in the experiments, which involved
the training of deep learning and machine learning classifiers. Thirty-eight experiments were performed using convolutional
neural networks, 10 experiments were performed using five machine learning models, and 14 experiments were performed
using the state-of-the-art pre-trained networks for transfer learning. Images and statistical data were considered separately
in the experiments to evaluate the performances of models, and eightfold cross-validation was used. A mean sensitivity of
93.84%, mean specificity of 99.18%, mean accuracy of 98.50%, and mean receiver operating characteristics–area under the
curve scores of 96.51% are achieved. A convolutional neural network without pre-processing and with minimized layers is
capable of detecting COVID-19 in a limited number of, and in imbalanced, chest X-ray images.
Keywords
COVID-19, pneumonia, X-ray, convolutional neural networks, coronavirus
Introduction will have access to those test kits that give results rapidly.
According to a recently published multinational consensus
At the end of 2019, humankind was faced with an epi- statement by the Fleischner Society, one of the main recom-
demic—severe acute respiratory syndrome coronavirus 2 mendations is to use chest radiography for patients with
(SARS CoV-2)–related pneumonia, referred to as coronavi- COVID-19 in a resource-constrained environment when
rus disease 2019 (COVID-19)—that people did not expect access to computed tomography (CT) is limited.3 The finan-
to encounter in the current era of technology. While the cial costs of the laboratory kits used for diagnosis, espe-
COVID-19 outbreak started in Wuhan, China, the signifi- cially for developing and underdeveloped countries, are a
cant spread of the epidemic around the world has meant that significant issue when fighting the illness. Using X-ray
the amount of equipment available to doctors fighting the images for the automated detection of COVID-19 might be
disease is insufficient. At the time of writing (September 8,
2020), there have been more than 27,000,000 confirmed
1
cases and more than 875,000 confirmed deaths worldwide.1 Department of Information Systems Engineering, Near East University,
Nicosia/TRNC, Mersin-10, Turkey
Considering the time required for diagnosis and the finan- 2
Department of Biomedical Engineering, Faculty of Engineering & DESAM
cial costs of the laboratory kits used for diagnosis, artificial Institute, Near East University, Nicosia/TRNC, Mersin-10, Turkey
intelligence (AI) and deep learning research and applica-
Received May 4, 2020, and in revised form July 30, 2020. Accepted for
tions have been initiated to support doctors who aim to treat publication Aug 23, 2020.
patients and fight the illness.2
Although rapid point-of-care COVID-19 tests are Corresponding Author:
Ilker Ozsahin, Department of Biomedical Engineering, Faculty of
expected to be used in clinical settings at some point, for Engineering, Near East University, Nicosia/TRNC, Mersin-10,
now, turnaround times for COVID-19 test results range 99138, Turkey.
from 3 to more than 48 hours, and probably not all countries Email: ilkerozsahin@windowslive.com
2 SLAS Technology 00(0)
helpful in particular for countries and hospitals that are convolutional layer by applying masks, which is the process
unable to purchase a laboratory kit for tests or that do not of dividing images into a predefined dimension of segments
have a CT scanner. This is significant because, currently, no and using filters to extract features from the image. Then a
effective treatment option has been found, and therefore feature map, which is the projection of features on the 2D
effective diagnosis is critical. map, is created by applying an activation function to the
AI tools have produced stable and accurate results in the values obtained by the masks. The activation function acti-
applications that use either image-based or other types of vates the most knowledgeable neurons in a nonlinear way
data.2,4–6 Apostolopoulos and Mpesiana2 performed one of and reduces the computational cost of the neural network.
the first studies on COVID-19 detection using X-ray Several activation functions are available in CNNs, and the
images. In their study, they considered transfer learning rectified linear unit (ReLU) is the most commonly used
using pre-trained networks such as VGG19, MobileNet V2, activation function; it does not activate all the neurons at the
Inception, Xception, and Inception ResNet V2, which are same time, and therefore provides a faster convergence
the most frequently used. Several evaluation metrics were when the weights find the optimal values to produce the
used to evaluate the results obtained from two different trained response during the training. A pooling operation is
datasets. MobileNet V2 and VGG19 achieved 97.40% and performed on the produced feature map to reduce the
98.75% accuracy, respectively, for two-class experiments dimensions of the images. Finally, the feature map is flat-
(COVID-19/Normal and COVID-19/Pneumonia), and tened into a vector and sent to the fully connected layer. The
92.85% and 93.48% for three-class experiments (COVID- convergence of the neural network and the classification of
19/Pneumonia/Normal). The final conclusion was made by the input patterns are performed in the fully connected
the authors using the obtained confusion matrices, not the layer, and its principles are based on error backpropagation
accuracy results because of the imbalanced data. to update the weights within this layer.
Ozsahin et al.4 used the average pixel per node (APPN) Deep ConvNets were applied in several image recogni-
approach, which is also considered in this study, and image tion applications with high accuracy, and this increased its
pre-processing techniques to detect Alzheimer’s disease in reliability for future research.10–12 Roy et al.10 explored
positron emission tomography (PET) images. Dai et al.5 CNNs for hyperspectral image classification, and
modeled vehicle interactions using long short-term memory Hartenstein et al.11 used deep learning to determine prostate
neural networks and predicted the trajectory for the vehicles. cancer positivity from CT imaging. Yoon et al.12 used a
Yilmaz et al.6 applied several machine learning classification CNN for tumor identification in colorectal histology
models to classify student performance, using a numerical images. These and similar studies motivated researchers to
dataset whose implemented logistic regression (LR) and investigate whether AI and ConvNets can be used effec-
decision trees are considered in this study. All these and simi- tively in COVID-19 research, particularly in diagnostic
lar studies obtained high-accuracy results using AI tech- applications. Recently, Apostolopoulos and Mpesiana2 per-
niques. For that reason, it has been widely used in the past formed a study on the classification of novel COVID-19.
two decades. AI, which aims to imitate human nature, can They considered two different, publicly available chest
learn and makes decisions from data and images. X-ray images. The training process was performed by using
Deep learning, which takes its name from the number of ConvNet with transfer learning with pre-trained networks.
its hidden layers, has gained a special place in the field of They concluded that VGG19 and MobileNet-V2 outper-
AI by providing successful results for both image-based formed other pre-trained ConvNets.
classification applications and regression problems during Each trained neural network gains knowledge for the par-
the past 10 years.7,8 The frequent use of deep convolutional ticular task that is considered. While the main principle of
neural networks (ConvNet, or CNNs)9 has enabled image- artificial neural networks is to simulate human behavior and
based applications to reach their peak in the past 5 years. intelligence, the transfer learning in artificial neural networks
Generally, CNNs that try to simulate biological aspects of is used to apply the stored knowledge of a particular task for
human beings on computers required pre-processing of another related task. Deep learning for image recognition
images or data before feeding them to the network. When applications is capable of learning millions of images, and
the ConvNet was first invented, however, it was described several huge models were trained with different architec-
as a neural network that requires minimal pre-processing of tures.13–17 These pre-trained models have been publicly
images before feeding them to the network, and a system shared so that all researchers can make use of the stored
that is capable of extracting the features from images to knowledge. The state-of-the-art pre-trained publicly avail-
optimize the learning performance of the neural network.6 able networks, namely, VGG16,13 VGG19,13 ResNet50,14
The ConvNet comprises both feature extraction and classi- InceptionV3,15 MobileNet-V2,16 and Densenet121,17 were
fication phases in a single network. A traditional ConvNet considered in comparison.
consists of three layers: convolution, pooling, and fully When we consider the incidence rates of COVID-19, it is
connected layers. Feature extraction is performed in the obvious that the data we can encounter in real life will be
Sekeroglu and Ozsahin 3
ConvNet
Experiment No. Architecture Input Dimension Pre-Processing Dense Layer #1 Dense Layer #2 Dense Layer #3
Exp.1 ConvNet#1 160×120 Sharpening 128 8 —
Exp.2 ConvNet#2 160×120 Sharpening 128 8 —
Exp.3 ConvNet#3 160×120 Sharpening 128 8 —
Exp.4 ConvNet#4 160×120 Sharpening 128 64 8
Exp.5 ConvNet#1 30×20 Sharpening 128 8 —
Exp.6 ConvNet#2 30×20 Sharpening 128 8 —
Exp.7 ConvNet#3 30×20 Sharpening 128 8 —
Exp.8 ConvNet#1 30×20 APPN 128 8 —
Exp.9 ConvNet#2 30×20 APPN 128 8 —
Exp.10 ConvNet#3 30×20 APPN 128 8 —
Exp.11 ConvNet#1 160×120 — 128 8 —
Exp.12 ConvNet#2 160×120 — 128 8 —
Exp.13 ConvNet#3 160×120 — 128 8 —
Exp.14 ConvNet#4 160×120 — 128 64 8
Exp.15 ConvNet#1 30×20 — 128 8 —
Exp.16 ConvNet#2 30×20 — 128 8 —
Exp.17 ConvNet#3 30×20 — 128 8 —
The APPN approach, which is widely and efficiently were performed on four different ConvNet architectures
used in image pre-processing for classification tasks,4 was with two different image dimensions. Table 2 shows the
applied to the X-ray images to obtain statistically resized properties of ConvNet experiments.
images. APPN is based on dividing the image into segments
with predetermined sizes, and taking the mean of the pixels Statistical Measurement Experiments. Each image hides
within the corresponding segment. Thus, statistically basic statistical information that is useful for machine learn-
reduced dimensions of images are obtained. In this research, ing models. Consideration of the limited number of values
640×480 X-ray images were resized to 600×400, and then instead of images decreases the computational time while
APPN with 20×20 segment sizes was applied. As a result, achieving reasonable results. In this research, basic statisti-
30×20 X-ray images were produced. Figure 1 presents the cal information and the pre-processed characteristics were
original, sharpened, and APPN-applied X-ray images. obtained from the images.
In addition to these, original images were sent to A threshold value was determined as half of the maxi-
ConvNets without any pre-processing. All experiments mum pixel value within the image, and the number of pixels
Sekeroglu and Ozsahin 5
Figure 1. Pre-process of X-ray images. (a) Original chest X-ray image, (b) sharpened image using a Laplacian filter, and (c) average
pixel per node (APPN)-applied image (10× enlarged).
Table 3. Description of Feature Vectors Created from X-Ray Transfer Learning Experiments. The images that gave the
Images. best results with the ConvNet experiments and statistical
Attribute Description measurement experiments, which were the unprocessed
images, were compared with the pre-trained networks men-
Lower Total number of pixel values smaller than tioned in the previous section.
[max(p)/2] VGG1613 is a CNN architecture that has 16 layers with
Higher Total number of pixel values greater than weights and uses 3×3 filters. After convolutional layers, it
[max(p)/2]
has two fully connected layers, followed by a softmax for
LMean Mean of the left segment of image
output. It has approximately 138 million parameters for the
CMean Mean of the center segment of image
RMean Mean of the right segment of image
network. VGG1913 is similar to VGG16, but it has 19 layers
MeanLP Mean of the Laplacian filter with weights, and this provides approximately 143 million
MeanSh Mean of the sharpened image parameters for the network.
MeanHE Mean of the histogram equalization applied image ResNet5014 has 50 residual layers, which aim to solve
Min Minimum pixel value within the image problems such as time consumption when the network
Max Maximum pixel value within the image becomes deeper. Its principle is based on skip connections
Entropy Entropy of the image between layers called identity function, and this increases
StdDev Standard deviation of the image the accuracy of the model and decreases the training time. It
Var Variance of the image has more than 23 million trainable parameters.
Mode Pixel value that is the most frequent within the Inception V315 has 42 layers and 24 million parameters.
image It factorizes convolutions to reduce the number of parame-
ters without decreasing the network efficiency. In addition,
novel downsizing was proposed in Inception V3 to reduce
greater and smaller than this value were counted. Then, the the number of features.
image was divided into three segments vertically, and the MobileNet-V216 has 53 layers and more than 3.4 million
center one was the widest so as not to divide the region of trainable parameters. It consists of residual connections and
interest. The mean values of each segment were calculated expansion, depthwise, and projection convolutions. The
separately. This process was performed to eliminate the cor- expansion convolutions convert the input tensor into a
ners and borders within the image. The mean values of higher-channel tensor; depthwise convolutions apply filters
Laplacian filter, sharpened image, and histogram equaliza- to the converted tensors; and, finally, the projection convo-
tion applied images were calculated separately to provide lutions project the higher channels to a smaller number of
different information to the machine learning models for the tensors.
same image at the same time. Besides these measurements, DenseNet12117 connects each layer to every other layer
the minimum and maximum pixel values within the image, in a feedforward fashion. The initial convolutional layer is
image entropy, standard deviation, variance, and the mode followed by a fully connected layer, and the rest of the con-
were calculated. Table 3 shows the created statistical and volutional layers are followed by the pooling and a fully
fundamental properties of the images in detail. A feature connected layer. It has 121 layers and more than 8 million
vector with 14 attributes, described above, was created and trainable parameters.
fed to five machine learning classifiers: SVM, LR, nB, DT, Each X-ray image was sent to the considered networks
and kNN. with the minimum dimensions required. The pre-processing
6 SLAS Technology 00(0)
was performed on the considered models’ pre-processing training, respectively. Four randomly selected images for
steps to provide consequent images to the models. After both healthy and coronavirus-infected patients were
training of each model with pre-trained weights, maximum assigned as the validation set. The number of images within
pooling was applied, and features were sent to the fully con- the validation set was limited so as not to reduce the number
nected layer (128). Similar to previous experiments, the of images in the infected class.
eightfold cross-validation method was used for all At the end of statistical measurement, COVID-19/
experiments. Normal, and COVID-19/Pneumonia experiments, the mean
accuracy, mean specificity, mean sensitivity, and the mean
Model Evaluation Criteria. Models can be evaluated using ROC AUC scores were calculated, and all the evaluations
different criteria, such as classification accuracy, sensitivity were performed on the mean scores. The mean ROC AUC
(true positive rate), specificity, and ROC AUC. Using only scores were, however, used as the primary evaluation crite-
an accuracy or a sensitivity/specificity criterion is not ria. For COVID-19/Pneumonia/Normal experiments, the
enough, however, especially for imbalanced data; while macro-averaged F1 score was used for the model evalua-
higher scores can be obtained in one metric, lower scores tion. All experiments were performed on an Ubuntu 18.01.4
can be produced by other metrics. Therefore, considering LTS 64-bit operating system, Intel Core i7-8700 CPU
all the above-mentioned criteria, ROC AUC was used to @3.20 GHz × 12, 32 GB RAM, NVidia GeForce RTX2060
evaluate the model performance for the statistical measure- GPU.
ment, COVID-19/Normal, and COVID-19/Pneumonia
experiments, which had two output classes (labels). ROC
AUC is used to measure the performance of a model. In
Results
medical applications, the model with the higher ROC AUC This section presents the results obtained from ConvNet
score is more capable of distinguishing between patients experiments, statistical measurement experiments, and
with COVID-19 and without COVID-19.22 “Positive” and transfer learning experiments.
“negative” results are the responses of the outputs (classifi-
cation predictions) obtained from the model. “True” and
Results of ConvNet Experiments
“false” are the actual data. The accuracy, sensitivity, and
specificity are calculated as given in Equation (1), Equation As mentioned above, 38 experiments were performed for
(2), and Equation (3), respectively: the ConvNet experiments in three groups separately.
Accuracy = ( TP + TN ) / ( TP + TN + FP + FN ) (1) Results of COVID-19/Normal Experiments. In this group, a
total of 1808 images (225 COVID-19 and 1583 Normal)
were trained in each experiment without the data augmenta-
Sensitivity = TP / ( TP + FN ) (2) tion procedure, which artificially increases the training
samples.
Sharpened images with different image sizes and by
Specificity = TN / ( TN + FP ) (3) using different architecture produced consistent results for
all experiments (Exp.1 through Exp.7). The highest mean
where TP and TN denote the true-positive and true-negative accuracy of Experiments 1–7 was obtained in Exp.3
values, respectively; and FP and FN represent false-positive (98.34%). The highest mean sensitivity, highest mean spec-
and false-negative values, respectively. ificity, and highest mean ROC AUC score, which is the pri-
Macro-averaged F1 score is a measure of model perfor- mary indicator for an imbalanced dataset, however, were
mance for multiclass (multilabel) problems that have more obtained in Exp.1 (91.05, 99.61, and 95.33%, respectively).
than two output classes, if the data are imbalanced, and the Exp.2 and Experiments 3–7 could not achieve higher rates
accuracy is not reliable.23 It considers the harmonic mean of than Exp.1 and Exp.3 in all evaluation metrics.
recall and precision scores of all classes separately, and In the APPN-applied experiments (Exp.8, Exp.9, and
measures the capacity of the model for the correct detection Exp.10), while the higher mean accuracy, higher mean sen-
of samples. sitivity, and higher ROC AUC score were obtained in Exp.8
All experiments were performed by k-fold cross-valida- (98.23, 91.84, and 95.41%, respectively), the higher mean
tion,24 which is based on dividing all the data into a pre- specificity was achieved in Exp.10 (99.29%). Exp.9, which
defined number of folds, k, and using onefold for testing was implemented using the deepest ConvNet architecture
and the remaining for training. The training step is repeated for APPN, produced the lowest results within these three
k times until all folds are used for the test set. In this study, experiments.
eightfold cross-validation was used for testing.25 Therefore, In Exp. 11 through Exp. 17, in which original images
12.5% and 87.5% of the data were used for testing and were used with different dimensions in different ConvNet
Sekeroglu and Ozsahin 7
COVID-19/Normal COVID-19/Pneumonia
Mean Mean
Mean Mean Mean ROC AUC Mean Mean Mean ROC AUC
Experiment Sensitivity (%) Specificity (%) Accuracy (%) (%) Sensitivity (%) Specificity (%) Accuracy (%) (%)
Exp.1 91.05 99.61 98.33 95.33 89.77 99.58 99.09 94.67
Exp.2 87.55 99.32 97.05 92.70 88.00 99.60 99.02 93.80
Exp.3 90.98 99.37 98.34 95.17 89.33 99.51 99.00 94.42
Exp.4 86.63 99.60 98.05 92.00 85.33 99.67 98.95 92.50
Exp.5 90.12 98.42 97.40 94.27 87.55 99.51 98.91 93.53
Exp.6 86.88 98.05 96.78 93.66 85.33 99.44 98.73 92.38
Exp.7 89.19 99.23 98.00 94.21 84.88 99.32 98.60 92.00
Exp.8 91.84 98.98 98.23 95.41 84.44 99.62 98.87 92.03
Exp.9 87.33 98.97 97.13 93.69 85.33 99.37 98.67 92.35
Exp.10 88.67 99.29 97.95 93.98 88.00 99.51 98.93 93.75
Exp.11 93.84 99.18 98.50 96.51 92.88 99.79 99.44 96.33
Exp.12 88.37 99.57 98.91 93.89 87.11 99.62 99.00 93.36
Exp.13 87.88 98.98 97.73 93.43 87.11 99.62 99.00 93.36
Exp.14 89.12 99.78 99.11 94.57 85.77 99.18 98.51 92.48
Exp.15 90.10 99.50 98.34 94.80 90.22 99.67 99.20 94.94
Exp.16 84.11 98.80 97.64 91.01 86.22 99.60 98.93 92.91
Exp.17 87.71 99.11 97.73 93.41 86.22 99.48 98.82 92.85
COVID-19: Coronavirus disease 2019; ROC AUC: receiver operating characteristics–area under the curve.
architectures, consistent rates were obtained for mean accu- was 99.60% in Exp.4. In APPN-applied experiments
racy and mean specificity. Changes in the rates of mean sen- (Experiments 8–10), similar results were obtained; how-
sitivity and mean ROC AUC scores (between 3 and 6%, ever, the lightest architecture achieved the highest mean
respectively) were, however, obtained using the different ROC AUC score.
architectures. The highest mean accuracy and highest mean When the images fed ConvNets directly (Experiments
specificity were obtained in Exp.14 (99.11 and 99.78%), 11–17), we observed that the increment of the convolutional
and these were the highest scores obtained in the ConvNet layer number of ConvNets reduces the scores obtained by
experiments for the COVID-19/Normal group. The highest the neural network up to 4%, similar to COVID-19/Normal
mean sensitivity and highest mean ROC AUC scores for the results. The highest mean accuracy, mean sensitivity, mean
COVID-19/Normal group were achieved in Exp.11 with specificity, and mean ROC AUC scores were obtained in
93.84 and 96.51%, respectively. Table 4 shows the results Exp.11: 99.44, 92.88, 99.79, and 96.33%, respectively.
obtained in the experiments for COVID-19/Normal Table 4 shows the results obtained in the experiments for
classification. the COVID-19/Pneumonia classification.
Macro-
Mean Averaged
Model Corona Normal Pneumonia Corona Normal Pneumonia Accuracy (%) F1 Score (%)
DenseNet121 98.87 90.90 88.52 95.66 97.20 92.03 95.99 93.85
Inception V3 97.76 90.99 86.54 95.99 96.54 91.16 94.90 93.14
ConvNet#1 96.20 93.72 92.98 97.45 86.22 90.63 95.26 92.84
ConvNet#2 96.77 95.27 93.05 97.41 90.04 92.12 95.75 94.10
ConvNet#3 96.26 93.15 92.26 96.98 86.79 90.88 95.04 92.70
ConvNet#4 97.51 91.42 92.32 96.90 92.69 93.49 95.88 94.04
experiments, could not produce the highest results in three-class The lowest scores of statistical measurements for COVID-
experiments. The results obtained by ConvNet#1 were similar 19/Pneumonia classification were obtained by LR. Even
to ConvNet#3 results, and the macro-averaged F1 score was though similar results were obtained in COVID-19/Normal
92.84%. and COVID-19/Pneumonia experiments, the decrement in
ConvNet#3 achieved the highest results obtained in this the classification levels was observed for all machine learn-
group in terms of precision, recall, and F1 score for all ing algorithms. This might be caused by both image classes
classes. The macro-averaged F1 score was 94.10%. Conv having disease and the increment of the number of training
Net#4, with the deepest structure, produced similar results images.
to ConvNet#2 but could not outperform it. It achieved a
macro-averaged F1 score of 94.04%. Table 5 presents the
Transfer Learning Experiments
results obtained in COVID-19/Pneumonia/Normal
experiments. Comparisons were performed for all groups of experiments.
Pre-processing methods were not applied to the images
because the original images achieved the highest results in
Results of Statistical Measurement Experiments ConvNet experiments. Similar to ConvNet experiments,
Five experiments were performed for COVID-19/Normal transfer learning experiments were also performed in three
classification by considering 14 features obtained from the groups as COVID-19/Normal, COVID-19/Pneumonia, and
images and using five machine learning classifiers: SVM, COVID-19/Pneumonia/Normal. The two models that would
LR, nB, DT, and kNN. Inconsistent results were obtained for produce superior results in the COVID-19/Normal and
kNN and nB. kNN achieved the highest mean specificity COVID-19/Pneumonia groups were considered in COVID-
rate (99.55%), but it also produced the lowest mean sensitiv- 19/Pneumonia/Normal experiments.
ity and lowest mean ROC AUC score (63.10 and 81.33%, In the COVID-19/Normal group, VGG19 and
respectively). Similarly, nB produced the highest mean sen- MobileNet-V2 produced the worst results. They were only
sitivity rate and mean ROC AUC score (82.95 and 92.75%, able to learn one class and could not classify COVID-19
respectively), but it produced the lowest mean accuracy and X-ray images. ResNet-50 and VGG16 produced compara-
mean specificity rates (93.97 and 94.05%, respectively). tively better results than VGG19 and MobileNet-V2. The
SVM achieved the highest mean accuracy result (96.57%). mean ROC AUC scores of ResNet-50 and VGG16 were
None of these models, however, was capable of outperform- calculated as 65.78 and 72.64%, respectively. Inception-V3
ing the ConvNet for any of the evaluation metrics using the produced higher results than other pre-trained networks;
obtained statistical data. Table 6 presents the results obtained however, the highest mean ROC AUC score in transfer
in statistical measurement experiments. learning experiments was obtained by DenseNet121
The same machine learning classifiers and features were (96.48%). Table 7 presents the results obtained using trans-
considered for the classification of COVID-19/Pneumonia. fer learning for the COVID-19/Normal group.
Similar results were obtained in the experiments, and nB In the COVID-19/Pneumonia group, similar results were
produced the highest mean ROC AUC, mean sensitivity, obtained. Even though the VGG19, MobileNet-V2, and
and mean accuracy scores (88.92, 80.00, and 96.96%, ResNet50 increased their scores, they were not able to reach
respectively) for statistical measurement experiments of the scores of DenseNet121 and Inception V3. The highest
COVID-19/Pneumonia classification. The highest mean mean ROC AUC score of COVID-19/Pneumonia classifi-
specificity was obtained by nB and SVM (97.85% each). cation in the transfer learning experiment was achieved by
Sekeroglu and Ozsahin 9
COVID-19/Normal COVID-19/Pneumonia
Mean Mean
Mean Mean Mean ROC AUC Mean Mean Mean ROC AUC
Experiment Sensitivity (%) Specificity (%) Accuracy (%) (%) Sensitivity (%) Specificity (%) Accuracy (%) (%)
SVM 81.30 98.80 96.57 90.05 75.55 97.85 96.74 86.70
Logistic Reg. 68.36 98.12 94.41 83.24 66.66 96.45 94.97 81.56
Decision Tree 75.91 96.53 93.97 87.10 69.77 96.50 95.17 83.14
Naive Bayes 82.95 94.05 93.97 92.75 80.00 97.85 96.96 88.92
kNN 63.10 99.55 95.02 81.33 64.44 96.22 94.64 80.33
COVID-19: Coronavirus disease 2019; kNN: k-nearest neighbor; ROC AUC: receiver operating characteristics–area under the curve (AUC); SVM:
support vector machine.
Table 7. Results Obtained in Transfer Learning Experiments for COVID-19/Normal and COVID-19/Pneumonia Classification.
COVID-19/Normal COVID-19/Pneumonia
Mean Mean
Mean Mean Mean ROC AUC Mean Mean Mean ROC AUC
Exp. Sensitivity (%) Specificity (%) Accuracy (%) (%) Sensitivity (%) Specificity (%) Accuracy (%) (%)
VGG16 46.04 99.24 92.64 72.64 77.33 99.65 98.53 88.49
VGG19 08.03 100.0 88.55 54.01 70.66 99.48 98.05 85.07
InceptionV3 90.14 99.17 98.17 94.66 89.77 99.65 99.15 94.71
MobileNet-V2 08.40 100.0 87.61 54.20 68.88 99.39 97.87 84.14
ResNet50 31.57 100.0 91.15 65.78 59.55 100.0 97.98 79.77
DenseNet121 93.92 99.04 98.39 96.48 92.44 99.46 99.11 95.95
COVID-19: Coronavirus disease 2019; ROC AUC: receiver operating characteristics–area under the curve (AUC).
DenseNet121 (95.95%), and it was followed by Inception (Table 4). This failed, however, to produce higher results in
V3 (94.71%). Table 7 presents the results obtained using terms of mean sensitivity, and this reduced the performance
transfer learning for the COVID-19/Pneumonia group. of the considered ConvNet in the primary performance
After considering the results obtained in the first two indicator for both classes, mean ROC AUC score. The high-
groups, we implemented DenseNet121 and Inception V3 est mean sensitivity was achieved by DenseNet121
for the classification of COVID-19/Pneumonia/Normal. (93.92%) (Table 7), but other obtained scores were not high
Even though fluctuating results were observed for precision enough to outperform other models in other metrics.
and recall scores for the COVID-19, Pneumonia, and DenseNet121’s mean ROC AUC score was 96.48%. Even
Normal classes, DenseNet121 outperformed Inception V3 though ConvNet#1 could not produce the optimal results in
in transfer learning experiments by obtaining a macro-aver- sensitivity, specificity, and accuracy results, its stability
aged F1 score of 93.85%, while Inception V3 achieved produced consistent results, and the highest mean ROC
93.14%. Table 5 shows the results obtained in COVID-19/ AUC score was achieved by ConvNet#1 with 96.51%
Pneumonia/Normal experiments with the results obtained (Table 4). Machine learning classifiers could not produce
in ConvNet experiments of the same group. satisfactory results using the extracted statistical informa-
tion to classify COVID-19 in this experimental group.
In COVID-19/Pneumonia classification, similarly to the
Comparisons of Experiments previous experiments, the highest mean ROC AUC score
In COVID-19/Normal classification, the highest mean was obtained in Exp.11 (96.33%) with ConvNet#1 (Table 4),
specificity (when the 100.0% scores of pre-trained net- followed by DenseNet121 (95.95%) (Table 7). Besides, the
works are not considered because of not learning another highest mean sensitivity and mean accuracy results were
class) and the highest mean accuracy results were obtained also obtained in Exp.11 (92.88 and 99.44%, respectively).
in Exp.14 (99.78 and 99.11%, respectively), which con- The highest mean specificity was achieved in transfer learn-
sisted of the deepest architecture in ConvNet experiments ing experiments by ResNet50 (100%); however, the other
10 SLAS Technology 00(0)
Table 8. TP, FP, TN, and FN results for Exp.11 and Densenet121 for all test folds.
COVID-19/Normal
Experiment TP FP TN FN
Exp.11 211 15 1568 14
Densenet121 209 13 1572 14
COVID-19/Pneumonia
Experiment TP FP TN FN
Exp.11 209 9 4283 16
Densenet121 208 23 4269 17
COVID-19: Coronavirus disease 2019; FN: false negative; FP: false positive; TN: true negative; TP: true positive.
Figure 2. Convolutional
neural network
1 (ConvNet#1)
architecture with two
convolutional and two
fully connected layers.
results reduced the success of the model to classify two- process was performed by the eightfold cross-validation
class experiments correctly at the same time. Similarly to method and ROC AUC score because of the imbalanced
the previous experiments, machine learning experiments database.
could not produce similar results to those of ConvNet In two-class experiments, a variety of image pre-pro-
experiments and transfer learning experiments. Table 8 cessing methods were applied with different image sizes
shows the total TP, TN, FP, and FN results obtained for and four ConvNet architectures to provide the highest
Exp.11 and DenseNet121 for all folds in COVID-19/ detection accuracy of COVID-19 in chest X-ray images.
Normal and COVID-19/Pneumonia classification. Figure 2 In COVID-19/Normal classification experiments, it was
demonstrates the architecture of ConvNet#1, which relatively easier to classify COVID-19 because the normal
obtained the highest classification results, and Figure 3 X-ray images do not contain any abnormalities. The per-
shows some of the highest ROC AUC scores obtained in formed experiments showed that the considered image pre-
ConvNet, statistical measurement, and transfer learning processing steps produced similar results to ConvNets fed
experiments. with original images; however, none of these considered
For three-class experiments (COVID-19/Pneumonia/ techniques were able to increase the performance of
Normal), the macro-averaged F1 scores were between 92.70 ConvNets in terms of mean ROC AUC score. The maxi-
and 94.10% (Table 5). DenseNet121, however, achieved mum mean ROC AUC score using an image pre-processing
higher results than ConvNet#1, ConvNet#3, ConvNet#4, technique was 95.41%, which was obtained in Exp.8 with
and Inception V3. But the optimal results were obtained by ConvNet#1 and APPN. The use of the images with reduced
ConvNet#2, which had a macro-averaged F1 score of dimensions caused the mean ROC AUC scores of the exper-
94.10%, followed by DenseNet121 with 93.85%, as shown iments to decrease by approximately 5.5% (max. 96.51%
in Table 5. Figure 4 shows the macro-averaged F1 scores and min. 91.01%) compared to the experiments with higher
obtained in COVID-19/Normal/Pneumonia experiments. dimensions. A possible solution is feeding the ConvNet
with images with increased dimensions.
Four architectures were also considered for all experi-
Discussion ments to evaluate the model performance with different
The performed experiments should be analyzed separately to numbers of layers. Experimental results showed that the use
evaluate the performance of the applied techniques and con- of more convolutional and fully connected layers could not
sidered models. As mentioned above, the final evaluation improve the model performance for the image database
Sekeroglu and Ozsahin 11
Figure 3. Highest ROC AUC scores obtained in the COVID-19/Normal and COVID-19/Pneumonia experiments. COVID-19:
Coronavirus disease 2019; ROC AUC: receiver operating characteristics–area under the curve.
Figure 4. Macro-averaged F1 scores of the COVID-19/Normal/Pneumonia experiments. COVID-19: Coronavirus disease 2019.
considered, because the differences between the mean ROC from the images. In addition to the above-mentioned statis-
AUC scores of the ConvNet with minimized layers and the tical measurements, the image pre-processing techniques
ConvNet with more layers were more than 1.7–5%, depend- were applied, and additional measurements were obtained
ing on the pre-processing technique. The minimum mean from the images to make the knowledge for the machine
ROC AUC score of ConvNet with more layers in APPN- learning models as similar as possible to that for the
applied images was 93.69%, while ConvNet#1 achieved ConvNets. The machine learning models, however, could
95.41%. The number of images used in the experiments has not achieve mean ROC AUC scores as high as those of the
a direct effect on the number of layers and the architecture ConvNets, and there was a 4% difference between the high-
of the ConvNet, but the obtained results suggest that the use est mean ROC AUC score in ConvNet experiments and nB,
of minimized layer numbers can enhance detection of which produced the highest result in statistical measure-
COVID-19 within the normal images. The highest result ment experiments.
was obtained by using two convolutional layers and two The use of transfer learning with the state-of-the-art pre-
dense layers with 160×120 image dimensions. trained ConvNets was also considered in COVID-19/
Then, statistical measurements and COVID-19 detection Normal classification experiments. Six pre-trained net-
using several machine learning models were considered. works were considered, and the results showed that two of
The determination of the specific statistical measurements them, InceptionV3 and Densenet121, were able to correctly
to be used is vital for this kind of classification approach; detect the X-ray images. Densenet121 produced similar
however, there are basic measurements that can be obtained results to the highest results obtained in Exp.11; however, it
12 SLAS Technology 00(0)
could not outperform Exp.11 in terms of mean specificity, COVID-19 data used in this study have been collected by
mean accuracy, and mean ROC AUC scores. pulling images from publications and websites. Therefore,
The other classification type in this study was the detec- they have come from different institutions and different scan-
tion of COVID-19 within the pneumonia images (COVID- ners. X-ray imaging parameters might be different for some
19/Pneumonia). The same experiments were performed as of the scans, which might result in different image quality,
with COVID-19/Normal experiments, and similar results and this is common when multisite studies are mixed, or one
were obtained. The lightest ConvNet outperformed the database has multiple characteristic flaws like different imag-
other considered ConvNet structures and pre-trained mod- ing protocols. Therefore, pre-processing of the data to make
els, even though the number of training samples increased the radiographic images more similar and uniform is impor-
because of the number of images in the dataset. Similarly, tant in terms of providing more efficient analysis and consis-
machine learning classifiers were not able to produce higher tency. This is a complex procedure, however, including
results than ConvNets obtained, but general reduction was co-registration, standardization, and so on to obtain the same
observed in the classification performance of machine image size and pixel size along the same spatial orientation
learning models. This was caused by the complexity of and to make the images’ resolution uniform and isotropic. We
images, the difficulty of differentiating COVID-19 from believe that, as more pre-processed datasets on COVID-19
pneumonia images, and the increased number of training become publicly available, more accurate studies will be
samples. It should be noted, however, that additional mea- conducted. Nevertheless, the current limited dataset has led
sured characteristics of images or significant statistical researchers around the globe to develop methods to aid in
measurements, such as contrast level, brightness level, kur- facilitating the diagnosis of COVID-19. Although this study
tosis, and so on, may help to improve the scores obtained by shows that CNNs can be used for automated detection of
machine learning models. COVID-19 and for distinguishing it from pneumonia, we
In three-class experiments (COVID-19/Normal/Pneumonia), believe applying artificial neural networks to COVID-19
the increment of the class number and the training samples detection more accurately requires clinical trials.
caused ConvNet#1 to not produce optimal results. Even the Another limitation of this study is the small sample size
deepest structure (ConvNet#4) could not achieve superior of COVID-19 images, which restricts the appropriate cohort
results; it was observed that the deeper structure was more selection and might result in a biased conclusion. At the
effective than ConvNet#1 at detecting COVID-19 between time of writing, there is no other reliable publicly available
pneumonia and normal images. dataset. To have a more accurate and robust model, a larger
Although the success of the recognition ability of the COVID-19 dataset is needed. Furthermore, because of the
models strongly depends on the image or dataset character- use of a relatively small number of COVID-19 images, clin-
istics, we can conclude that the use of lighter ConvNets for ical information about the patients, such as risk factors and
a smaller number of output classes for a limited number of medical history, is not available at this time.
images performs better convergence. The increment of the
number of output classes and training samples, however,
Conclusions
requires a deeper structure for effective learning. It should
also be noted that the characteristics of the images have a Detection of COVID-19 from chest X-ray images is of vital
direct effect on convergence; therefore, different architec- importance for both doctors and patients to decrease the diag-
tures should be analyzed for each application to improve the nostic time and reduce financial costs. Artificial intelligence
recognition capacity of the model. and deep learning are capable of recognizing images for the
Pre-trained networks have very deep architectures, they tasks taught. In this study, several experiments were performed
have been trained by using millions of different kinds of for the high-accuracy detection of COVID-19 in chest X-ray
images, and the saved final weights are intended to be trans- images using ConvNets. Various groups—COVID-19/Normal,
ferred to similar or different applications. Recent COVID-19/Pneumonia, and COVID-19/Pneumonia/Normal—
research,26–28 however, aimed to develop light ConvNets to were considered for the classification. Different image dimen-
reduce the computational cost of pre-trained networks; and, sions, different network architectures, state-of-the-art
as mentioned above, networks with less deep architectures pre-trained networks, and machine learning models were
become preferable for classification problems, even with a implemented and evaluated using images and statistical data.
huge number of images and a high number of output classes. When the number of images in the database and the detection
The obtained results also demonstrate that architectures time of COVID-19 (average testing time = 0.03 s/image) are
may begin to deepen more in connection with the increased considered using ConvNets, it can be suggested that the con-
number of images and output classes. For this reason, some sidered architectures reduce the computational cost with high
pre-trained neural networks have been found to have diffi- performance. The results showed that the convolutional neu-
culties in learning one class successfully while learning ral network with minimized convolutional and fully con-
another class with high accuracy. Similar results were nected layers is capable of detecting COVID-19 images
obtained in Apostolopoulos and Mpesiana.2 within the two-class, COVID-19/Normal and COVID-19/
Sekeroglu and Ozsahin 13
Pneumonia classifications, with mean ROC AUC scores of 10. Roy, S. K.; Krishna, G.; Dubey, S. R.; et al. HybridSN:
96.51 and 96.33%, respectively. In addition, the second pro- Exploring 3-D–2-D CNN Feature Hierarchy for Hyperspectral
posed architecture, which had the second-lightest architec- Image Classification. IEEE Geosci. Remote. Sens. Lett. 2020,
ture, is capable of detecting COVID-19 in three-class, 17, 277–281.
COVID-19/Pneumonia/Normal images, with a macro-aver- 11. Hartenstein, A.; Lübbe, F.; Baur, A. D. J.; et al. Prostate
Cancer Nodal Staging: Using Deep Learning to Predict
aged F1 score of 94.10%. Therefore, the use of AI-based
68Ga-PSMA-Positivity from CT Imaging Alone. Sci. Rep.
automated high-accuracy technologies may provide valuable
2020, 10, 3398.
assistance to doctors in diagnosing COVID-19. 12. Yoon, H.; Lee, J.; Oh, J. E.; et al. Tumor Identification in
Further studies, based on the results obtained in this Colorectal Histology Images Using a Convolutional Neural
study, would provide more information about the use of Network. J. Digit. Imaging. 2018, 32, 131–140.
CNN architectures with COVID-19 chest X-ray images and 13. Simonyan, K.; Zisserman, A. Very Deep Convolutional
improve on the results of this study. Networks for Large-Scale Image Recognition. ArXiv. 2015,
arXiv:14091556.
Declaration of Conflicting Interests 14. Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; et al. Deep
The authors declared no potential conflicts of interest with respect Residual Learning for Image Recognition. ArXiv. 2015,
to the research, authorship, and/or publication of this article. arXiv:1512.03385.
15. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; et al. Rethinking the
Funding Inception Architecture for Computer Vision. ArXiv. 2015,
arXiv:1512.00567v3.
The authors received no financial support for the research, author- 16. Howard, A. G.; Zhu, M.; Chen, B.; et al. MobileNets:
ship, and/or publication of this article. Efficient Convolutional Neural Networks for Mobile Vision
Applications. ArXiv. 2015, arXiv:17040486.
References 17. Huang, G.; Liu, Z.; van der Maaten, L.; et al. Densely Connected
1. World Health Organization. WHO Coronavirus Disease Convolutional Networks. ArXiv. 2018, arXiv:1608.06993v5.
(COVID-19) Dashboard. https://covid19.who.int. 18. Cohen, J. P. COVID-19 Image Data Collection. ArXiv. 2020,
2. Apostolopoulos, I. D.; Mpesiana, T. Covid-19: Automatic arXiv:2003.11597.
Detection from X-Ray Images Utilizing Transfer Learning 19. https://github.com/ieee8023/covid-chestxray-dataset.
with Convolutional Neural Networks. Phys. Eng. Sci. Med. 20. Kermany, D. S.; Goldbaum, M.; Cai, W.; et al. Identifying
2020, 43, 635–640. Medical Diagnoses and Treatable Diseases by Image-Based
3. Rubin, G. D.; Ryerson, C. J.; Haramati, L. B.; et al. The Role Deep Learning. Cell. 2018, 172, 1122–1131.
of Chest Imaging in Patient Management during the COVID- 21. Haralick, R.; Shapiro, L. Computer and Robot Vision, Vol. 1.
19 Pandemic: A Multinational Consensus Statement from the Addison-Wesley: Reading (MA), 1992; pp 346–351.
Fleischner Society [published online ahead of print, 2020 Apr 22. Melo, F. Area under the ROC Curve. In: Dubitzky, W.;
7]. Chest. 2020, S0012-3692, 30673–30675. doi:10.1016/j. Wolkenhauer, O.; Cho, K. H.; Yokota, H. (eds.), Encyclopedia
chest.2020.04.003. of Systems Biology. Springer: New York, 2013.
4. Ozsahin, I.; Sekeroglu, B.; Mok, G. S. P. The Use of Back 23. Sun, Y.; Wang, B.; Jin, J.; et al. Deep Convolutional Network
Propagation Neural Networks and 18F-Florbetapir PET for Method for Automatic Sleep Stage Classification Based on
Early Detection of Alzheimer’s Disease Using Alzheimer’s Neurophysiological Signals. In: 2018 11th International
Disease Neuroimaging Initiative Database. PLoS One. 2019, Congress on Image and Signal Processing, BioMedical
14, e0226577. Engineering and Informatics (CISP-BMEI), Beijing, China,
5. Dai, S.; Li, L.; Li, Z. Modeling Vehicle Interactions via 2018; pp 1–5. doi:10.1109/CISP-BMEI.2018.8633058.
Modified LSTM Models for Trajectory Prediction. IEEE 24. Refaeilzadeh, P.; Tang, L.; Liu, H. Cross-Validation. In: Liu,
Access. 2019, 7, 38287–38296. L.; Özsu, M. T. (eds.), Encyclopedia of Database Systems.
6. Yılmaz, N.; Sekeroglu, B. Student Performance Classification Springer: Boston, 2009.
Using Artificial Intelligence Techniques. In: 10th International 25. Yang, T.; Kan, P.; Lin, C.; et al. Using Polar Expression
Conference on Theory and Application of Soft Computing, Features and Nonlinear Machine Learning Classifier for
Computing with Words and Perceptions (ICSCCW) [2019]. Automated Parkinson’s Disease Screening. IEEE Sensors J.
Adv. Intel Sys. Comm. 2019, 1095, 596–603. 2020, 20, 501–514. doi:10.1109/JSEN.2019.2940694.
7. Meng, C.; Zhao, X. Webcam-Based Eye Movement 26. Xiang, W.; Ran, H.; Zhenan, S.; et al. A Light CNN for Deep
Analysis Using CNN. IEEE Access. 2017, 5, 19581–19587. Face Representation with Noisy Labels. IEEE Trans. Info.
doi:10.1109/ACCESS.2017.2754299. Forens. Sec. 2018, 13, 1. doi:10.1109/TIFS.2018.2833032.
8. Deng, X.; Zhang, Y.; Yang, S.; et al. Joint Hand Detection 27. Hong, Z.; Yun, Z. A Normalized Light CNN for Face
and Rotation Estimation Using CNN. IEEE Trans. Image Recognition. J. Phys. Conf. Ser. 2018, 1087, 062015.
Proc. 2018, 27, 1888–1900, doi:10.1109/TIP.2017.2779600. doi:10.1088/1742-6596/1087/6/062015.
9. LeCun, Y.; Haffner, P.; Bottou, L.; et al. Object Recognition with 28. Yang, Z.; Li, D. WasNet: A Neural Network-Based Garbage
Gradient-Based Learning: Shape, Contour and Grouping in Collection Management System. IEEE Access. 2020, 8,
Computer Vision. Lect. Notes Comput. Sci. 1999, 1681, 319–345. 103984–103993. doi:10.1109/ACCESS.2020.2999678.