1. Introduction
On 30 December 2019, the first diagnosis of COVID-19 was first reported at Wuhan Jinyintan Hospital in a patient with pneumonia of unknown etiology. The result showed that the virus had a family of coronaviruses called Betacoronavirus 2B [
1]. Coronavirus bat-like SARS exhibited a close link to the virus of COVID-19. The World Health Organization (WHO) identified the novel coronavirus as extreme acute coronavirus syndrome 2 (SARS-COV-2) and referred to it as coronavirus disorder 2019 (COVID-19) on 30 January 2020 [
2]. Symptoms of breathlessness, fever, headache, chills, myalgia or arthralgia, congested nose, diarrhea, hemoptysis, and conjunctival obstruction are typical symptoms of the disease [
3]. This can result in kidney failure, death, and severe acute respiratory syndrome in severe cases of the coronavirus disease [
4]. The present spread of coronavirus (COVID-19) threatens national health systems in all nations [
5]. The United States has become one of the most affected countries to be hit by the increase in COVID-19 in public health, emergency health care, and hospitals [
6]. Unfortunately, the rate of infections is expected to increase exponentially in many countries regardless of their health systems. Emergency steps are needed to provide good medical equipment and high-quality information in the health care system and hospital. Translational science can provide stakeholders and clinicians with appropriate evidence-based medicine concepts [
7]. The world has recorded 252,469,591 confirmed cases of COVID-19 and 5,093,058 deaths in 222 countries in 30 January 2021 [
8].
Management measures to mitigate transmission include the wearing of masks, hygienic hand screening, avoiding public contact, case identification, contact control, and quarantines [
9]. COVID-19 infected subjects depend on symptomatic treatment, since no successful antiviral therapy has been discovered [
10]. A scoping analysis suggested in [
11] was adopted to investigate COVID-19 in [
12] to better understand the cause, prevention, diagnosis, and control of this coronavirus. However, research papers focused primarily on causes, but prevention and regulation have improved over time. Diagnostic tests can discover viral infections that can be contagious and cause infection in other people. Antibody tests, conversely, examine whether the person has previously been infected with the virus. There are three reasons for testing for COVID-19:
Surveillance allows the government and health officials to monitor the rate of infections in a particular community. It seeks to observe the effectiveness of COVID-19 prevention measures, such as wearing a mask and maintaining social distancing. This could involve random testing of people in a particular location to know whether there is community transmission of the disease or not [
13];
Screening involves testing anyone regardless of whether they show symptoms or are unaware of their exposure to someone who has been infected. It provides an effective means of recognizing those who are likely to have been infected with the virus to stop further transmission [
14];
Diagnostic testing involves testing a person who is assumed to have been infected with COVID-19. The person may show symptoms of COVID-19, know that they have contacted people with confirmed cases of COVID-19 or have been infected, and is trying to perform more tests to verify that they are now negative [
15].
Machine learning (ML) algorithms are used to solve problems by analyzing and interpreting large volumes of data to solve problems in the medical sector [
16,
17,
18,
19,
20]. Several researchers have used machine learning algorithms to solve medical difficulties in this area. Since the beginning of the pandemic, a wide range of studies have been conducted to provide a better understanding of the case, prevention, diagnosis, and control of COVID-19. In this paper, ML algorithms are applied to determine the epidemiology of the COVID-19 pandemic. The ML algorithms have been shown to be very effective and robust algorithms that can handle large data successfully. Therefore, it can be used to analyze the epidemiology of COVID-19 [
21,
22,
23,
24,
25,
26,
27,
28,
29].
The major contributions of this work include:
- i.
The exploration of dataset noise filtering techniques (all k-edited nearest neighbors, blame based noise reduction, and condensed nearest neighbors) on the dataset of COVID-19 infection cases in South Korea, which has not been conducted before;
- ii.
The combination of noise filtering combined with machine learning techniques in epidemiological data for the prediction of COVID-19 cases;
- iii.
The performance evaluation of all k-edited nearest-neighbors noise filters combined with machine learning algorithms using different performance metrics.
The rest of this paper is organized as follows:
Section 2 is a review of the literature;
Section 3 discusses the materials and methods used in this work, as well as our performance measurements used; the results and the discussion are presented in
Section 4; and
Section 5 presents the conclusions.
2. Review of Related Works
In this section, we succinctly discuss recent research conducted in the field of application of machine learning to the COVID-19 pandemic. This Section follows from our explanations in
Section 1 of this paper, in which it was pointed out that machine learning algorithms have gained wide acceptance by data scientists and researchers as a viable tool for solving the COVID-19 crisis. This is due to the effectiveness of these algorithms in the detection and diagnosis of health-related problems. For example, Nemati et al. [
21] proposed the combination of statistical methods, support vector machine (SVM), and ensemble techniques that use COVID-19 data of patients to predict the date they are likely to be discharged from the isolation center. It also evaluates clinical information to determine the duration of the patient in the hospital. The downside of this work is that it is just a framework and there is no practical implementation of any machine learning or statistical algorithm. The effectiveness of the proposed method was also not evaluated.
Lalmuanawma et al. [
22] presented a review on the role of artificial intelligence (AI) and machine learning (ML) in investigating and predicting the transmission rate of COVID-19. They also examined how these techniques can be used to recognize, evaluate, and handle people who have been exposed to COVID-19 to prevent further transmission. Furthermore, the authors examined how AI and ML can help in the process of bringing a new pharmaceutical drugs into clinical practice for SARS-CoV-2 and its associated endemic. The findings of this study indicated that AI and ML have significantly improved the treatment, testing, prediction, and cure/immunization steps needed to take COVID-19 drugs from concept to market availability. Malik et al. [
23] used multiple machine learning models to obtain the correlation between different characteristics and the rate of transmission of COVID-19. ML models were used to evaluate the effect of climatic factors on the spread of COVID-19 by mining the connection between the number of confirmed cases and the variables of atmospheric condition variables in some counties. The authors opined that atmospheric characteristics are of great significance in forecasting the number of deaths due to COVID-19 compared to other factors mentioned in the paper. Kavadi et al. [
24] developed a partial derivative regression and nonlinear machine learning (PDR-NML) method for predicting COVID-19. The PDR was used to explore the dataset for optimal parameters with little computer resource usage. Subsequently, the machine learning model was used to normalize the attributes that are used to make predictions with high accuracy. In a more specific study, Amar et al. [
25] used various machine learning and statistical techniques to predict the transmission of the COVID-19 pandemic in Egypt. The authors aimed to assist the Egyptian government in managing the pandemic in the subsequent months. The experimental results showed that the exponential model outperforms other models compared in the paper. The authors deduced from their results that the COVID-19 pandemic in Egypt is not likely to end soon.
Goodman-Meza et al. [
26] applied ensemble machine learning to diagnose COVID-19 in patients admitted to hospital and receiving treatment. The patients are in an environment where the PCR test is insufficient or inaccessible. The performance is good, though there is still room for improvement. The authors did not propose any new machine learning algorithm; rather, they only used an ensemble of different machine learning models. Ozturk et al. [
27] used deep neural network models to automatically detect COVID-19 from rib cage radiographs of patients. Their model can classify images into two classes or more than two classes. The model can serve as a secondary or assisting diagnosis tool, especially in places where there is an unavailability of medical experts. The classification accuracy of the model is high for binary classes; however, the accuracy for multiclass is poor.
Khan et al. [
28] suggested employing parallel fusion and deep learning model optimization with a contrast enhancement using a top-hat and Wiener filter combination. Two deep learning models (AlexNet and VGG16) that have been pre-trained are used and fine-tuned based on the target classes (COVID-19 and healthy). A parallel fusion approach is used, parallel positive correlation, to extract and fuse features. The entropy-controlled firefly optimization approach is used to identify optimal features. Machine learning classifiers, such as the multiclass SVM, are used for classification.
Rehman et al. [
29] presented a framework for the diagnosis of 15 different forms of chest disease, including COVID-19, using a chest radiograph modality. They used a convolutional neural network (CNN) with a softmax classifier and a fully connected layer to extract deep features, which are input into traditional machine learning (ML) classification algorithms. The suggested architecture, conversely, improves the accuracy of COVID-19 detection and increases the prediction rates for other chest disorders.
Rustam et al. [
30] investigated the ability of four machine learning models to predict the number of people who will be infected with COVID-19. Each of the models was used to predict the number of new confirmed cases, death toll, and the number of recovered cases in a period of 10 days. The results show that the predictive capability of the models under investigation was not very good. Therefore, there is a need to try other machine learning models.
Wieczorek et al. in [
31,
32] used artificial neural networks (ANN) to estimate future COVID-19 cases using geolocation and past case data. The results of the proposed model show high accuracy, which in some cases reaches above 99%. Ahouz and Golabpour [
33] developed a least-squares-boosting classification model to predict the incidence rate two weeks in advance. The proposed model predicted the number of globally confirmed cases of COVID-19 with an accuracy of 98.45%. Zivkovic et al. [
34] proposed a hybridized method combining machine learning, adaptive neurofuzzy inference system (ANFIS), and enhanced beetle antennae search metaheuristics. The proposed model achieved a correlation of 0.9763 correlation on China’s COVID-19 outbreak data. For more related works, we would like to refer the readers to the review papers [
35,
36].
In summary, current machine learning methods have not been very successful in the prediction of confirmed cases due to challenges, such as the lack of historical data and the different approaches of governments toward testing, which makes the results hardly comparable [
37]. The prediction of COVID-19 cases using the deep learning method has gained more attention currently due to the unavailability of more data. Deep learning methods can specifically handle nonlinear problems more effectively. However, they still face the same problems of governmental actions that influence the data [
38].
4. Results and Discussion
This section presents the experimental results of machine learning techniques, such as bagging (BAG), stochastic gradient boosting (BST), bi-directional long short-term memory (BLSTM), support vector machine (SVM), naïve Bayes (NB), random forest (RF), k-nearest neighborhood (KNN), decision tree, and the multinomial logistic regression (LR) for the diagnosis of COVID-19 infection cases.
For our experiments, we used MATLAB 2021a (MathWorks Inc., Nattick, MA, USA) on a laptop computer with 64-bit Windows 10 OS with Intel Core i5-8265U CPU 1.80 GHz with 8 GB RAM.
We compared the performances of the algorithms under consideration using sensitivity, specificity, and balanced accuracy, kappa, accuracy, and p-value to discern which is more accurate in the diagnosis of COVID-19 cases, such as the number of released, deceased, and isolated cases. We used data from the Kaggle database for COVID-19 infection cases in South Korea. The data were segmented into both training (60%) and testing (40%) datasets. The training set was used to train the model, while the test set was used to test it. Classification of data has three classes—released, deceased, and an isolated class, consisting of 5165 data samples.
Table 2 shows the comparison of the performance metrics used in this research: sensitivity, specificity, and balanced accuracy. Most of the machine learning algorithms, such as BAG, BST, BLSTM, SVM, NB, RF, KNN, DT, and LR, can classify isolated and released classes, but fails to classify the deceased in sensitivity metrics. The specificity of the three classes, released, deceased, and isolated, is within the range of 78–100%, except in the isolated class of BLSTM.
Table 3 shows the comparison of accuracy, kappa, and
p-value of BAG, BST, BLSTM, SVM, NB, RF, KNN, DT, and LR. The overall best accuracy was obtained from LR with an accuracy of 82.77%, while the lowest was obtained from BLSTM with an accuracy of 65.96%. The result is not encouraging when compared with other state-of-the-art techniques. We use the proposed method to filter the noise from the COVID-19 dataset.
Table 4 presents the performance comparison of all the ML models for the AENN filtered dataset using sensitivity, specificity, and balanced accuracy. LR attained 82.77% accuracy, while BLSTM produced the worst accuracy, at 65.96%.
Table 5 presents the performance comparison of all the ML models for the AENN filtered dataset using accuracy, kappa, and
p-value. Both BAG and RF attained 100% accuracy, while LR produced the worst accuracy, at 98.81%.
Table 6 depicts the performance comparison of all the ML algorithms on the BBNR filtered dataset using sensitivity, specificity, and balanced accuracy. Both SVM and NB achieved 100% sensitivity and specificity, while LR produced the worst accuracy of the results.
Table 7 represents the performance comparison of accuracy, kappa, and
p-value of all the ML models for the BBNR filtered dataset. BAG produced the best performance, closely followed by RF with 74.12% and 74.01% accuracy, respectively, while NB produced the worst accuracy, at 55.69%.
Table 8 depicts the performance comparison of all the ML algorithms on dataset that was filtered by CNN using sensitivity, specificity, and balanced accuracy as performance metrics. NB achieved a sensitivity of 99.32% and specificity of 100%.
Table 9 shows the performance comparison of all the ML models for the CNN filtered the dataset using accuracy, kappa, and
p-value. RF produced the best performance and was closely followed by BAG with 87.76% and 87.72% accuracy, respectively, while SVM produced the worst accuracy, at 79.16%. The main result of
Table 9 is that the BAG and RF methods achieve the best performance in terms of accuracy and kappa.
The accuracy results from
Table 5,
Table 7 and
Table 9 are visualized in
Figure 1. We summarize the results of experiments in
Figure 2, which shows that the AENN method allows to achieve a statistically significant improvement (
p < 0.001, using the
t-test) of classification performance in terms of accuracy metric. AENN, on average, improved the accuracy by 19.7833 ± 4.9896% and BBNR was ineffective and led to the decrease in performance by 9.9500 ± 9.3480%, while the CNN filtering method increased the accuracy by 4.6600 ± 6.9520%.
The results of our study underscore the need for data filtering to improve the performance of machine learning classifiers. This study demonstrated the superiority of the AENN filtering method, which outperformed the BBNR and CNN filtering methods. This finding is in line with other recent studies [
51,
52,
53]. However, more research is needed to confirm our results.
The limitation of the current study is that only a limited dataset from a single country was used. More research with larger datasets is still needed to validate the proposed methods.