Machine Learning based COVID-19 Diagnosis from Blood Tests with Robustness to Domain Shifts

Jens Meier

Machine Learning based COVID-19 Diagnosis from Blood Tests with Robustness to Domain Shifts

2021

We investigate machine learning models that identify COVID-19 positive patients and estimate the mortality risk based on routinely acquired blood tests in a hospital setting. However, during pandemics or new outbreaks, disease and testing characteristics change, thus we face domain shifts. Domain shifts can be caused, e.g., by changes in the disease prevalence (spreading or tested population), by refined RT-PCR testing procedures (taking samples, laboratory), or by virus mutations. Therefore, machine learning models for diagnosing COVID-19 or other diseases may not be reliable and degrade in performance over time. To countermand this effect, we propose methods that first identify domain shifts and then reverse their negative effects on the model performance. Frequent re-training and reassessment, as well as stronger weighting of more recent samples, keeps model performance and credibility at a high level over time. Our diagnosis models are constructed and tested on large-scale data ......Read more

Machine Learning based COVID-19 Diagnosis from Blood Tests with Robustness to Domain Shifts Theresa Roland 1,* , Carl Böck 2 , Thomas Tschoellitsch 2 , Alexander Maletzky 3 , Sepp Hochreiter 1 , Jens Meier 2 , and Günter Klambauer 1 1 ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria 2 Department of Anesthesiology and Critical Care Medicine, Kepler University Hospital GmbH, Johannes Kepler University Linz, Austria 3 RISC Software GmbH, Hagenberg i.M., Austria * roland@ml.jku.at Abstract We investigate machine learning models that identify COVID-19 positive patients and estimate the mortality risk based on routinely acquired blood tests in a hos- pital setting. However, during pandemics or new out- breaks, disease and testing characteristics change, thus we face domain shifts. Domain shifts can be caused, e.g., by changes in the disease prevalence (spreading or tested population), by reﬁned RT-PCR testing pro- cedures (taking samples, laboratory), or by virus mu- tations. Therefore, machine learning models for diag- nosing COVID-19 or other diseases may not be reliable and degrade in performance over time. To countermand this eﬀect, we propose methods that ﬁrst identify do- main shifts and then reverse their negative eﬀects on the model performance. Frequent re-training and re- assessment, as well as stronger weighting of more recent samples, keeps model performance and credibility at a high level over time. Our diagnosis models are con- structed and tested on large-scale data sets, steadily adapt to observed domain shifts, and maintain high ROC AUC values along pandemics. 1 Introduction Reverse transcription polymerase chain reaction (RT-PCR) 1 are still the gold standard tests for the coro- navirus disease 2019 (COVID-19) 2 . However, RT-PCR tests are expensive, time-consuming, and not suited for high-throughput or large-scale testing eﬀorts. In contrast, antigen tests are cheap and fast, but they come with considerably lower sensitivity than RT-PCR tests 3 . Blood tests for COVID-19 are a promising technique, since they unify the best of RT-PCR and antigen tests: they are cheap, fast, eﬃcient, and have suﬃciently high sensitivity when combined with ma- chine learning (ML) methods. Furthermore, auto- matically checking all routinely taken blood tests for COVID-19 allows frequent, fast and broad scanning at low costs, thus provides a powerful tool to ban new outbreaks 4,5 . Therefore, we assess ML methods for di- agnosing COVID-19 from blood tests. ML can enhance the sensitivity of cheap and fast tests such as antigen 6 or blood tests, therefore enabling a cost eﬃcient alter- native to RT-PCR tests. ML methods enhanced tests could be particularly useful for asymptomatic patients with a routine blood test, who would not be tested for COVID-19. In this scenario, COVID-19 positive pa- tients could be identiﬁed, isolated and a further spread of the virus might be prevented. Especially in devel- oping countries with limited testing capacities, the ML enhanced tests can evolve into an eﬃcient tool in com- bating a pandemic. To conﬁne the spread of infectious diseases, and es- pecially the COVID-19 pandemic, ML approaches can be applied in very diﬀerent ways 7 . ML algorithms help in developing vaccines and drugs for the treatment of COVID-19 8–10 . COVID-19 and the patient’s prognosis can be predicted from chest CT-scans, X-rays 11–14 or sound recordings of coughs or breathing 15–17 . Further- more, it has been shown that ML models based on blood tests are capable of detecting COVID-19 infection 18–32 and predicting other outcomes, such as survival or ad- mission to an intensive care unit 33–41 . An ML model is constructed via learning on a data set with the goal that the model generalizes well, that is, performs well on new, unseen data, e.g., correctly 1 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.06.21254997 doi: medRxiv preprint NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

predicts the label or class for a new data item. The quality, size and characteristics of the training data set strongly determine the predictive quality of the result- ing model on new data. The central ML paradigm is that training data and future (test) data have the same distributions. This paradigm guarantees that the con- structed or learned model generalizes well to future data and has high predictive performance on new data. How- ever, this paradigm is violated during pandemics. Data sets collected during the progression of the COVID-19 pandemic are characterized by strong changes in distri- bution, called domain shifts. These domain shifts vio- late the central ML paradigm, nevertheless, they were insuﬃciently considered or even neglected during the evaluation of ML models. Unexpected behavior of the models in real world hospital settings often stem from neglected domain shifts 42 . Such an unexpected behav- ior could even lead to unfavorable consequences, like a major disease outbreak in a hospital. Most of the previ- ous ML studies evaluated the predictive performance of the learned models by cross-validation, bootstrapping or ﬁxed splits on randomly drawn samples 18–22,26–32 . However, the theoretical justiﬁcation of these evalu- ation methods is heavily founded on the central ML paradigm: that the distributions remain constant over time. To disregard domain shifts is a culpable negli- gence, since they may lead to an overoptimistic per- formance estimate on which medical practitioners base their decisions. These decisions are then misguided. Yang et al. 25 and Plante et al. 24 addressed the do- main shifts via evaluation on an external data set. Yang et al. 25 trained and evaluated their models on data from the same period and therefore, temporal domain shifts were not suﬃciently considered. The training and ex- ternal evaluation set as in Plante et al. 24 only includes pre-pandemic negatives, they missed out on using pan- demic negatives. Soltan et al. 23 considered the tem- poral domain shift by conducting a prospective evalua- tion. However, analogous to Plante et al. 24 , the nega- tives are all pre-pandemic, therefore, the domain shift is artiﬁcially generated and can deviate from domain shifts during the pandemic. In the following, we describe the categories of domain shifts that can occur in COVID-19 data sets. For the categorization, we have to consider two random vari- ables, which both are obtained by testing a patient: • x: Outcome of a fast and cheap test. The mea- surement values for a patient, which serve as in- put data (input features) for an ML model. We assume that the COVID-19 status (positive / neg- ative) can, to some extent, be inferred from these tests. The measurements can arise from a fast and cheap test such as a blood test or vital sign mea- surement. To illustrate this value, we assume that x is the ﬁbrinogen level, since it tends to rise during a systemic inﬂammation 43 . • y: Outcome of the slow and expensive COVID-19 RT-PCR test, which is assumed to be binary y ∈{0, 1} to indicate the COVID-19 status. The test result y is assumed to be the ground truth and should give the actual COVID-19 status. Our goal is to use ML methods to predict y from x, in order to replace the slow and expensive COVID-19 RT-PCR test by a fast and cheap test. Examples of temporal domain shifts are shown in Figure 1 a, which aﬀect the model performance and the trustworthiness of performance estimates, see Fig- ure 1 b and c. We identify and deﬁne following cate- gories of domain shifts 44,45 : • Prior shift: p(y). The probability of observing a certain RT-PCR test result, e.g., y =1, strongly changes during the pandemic. If the overall preva- lence of the disease in the population is high, the probability to observe a positive test usually in- creases. • Covariate shift: p(x). The distribution of the patient features is also aﬀected by the overall pan- demic course. E.g., if the prevalence of the disease is high, more persons suﬀer from disease symp- toms, with potentially high ﬁbrinogen, and go to the hospital. Nevertheless, ﬁbrinogen levels could also change without connection to the pandemic, for example, with time of year 46 . Or, in case there is an obligation for testing, the person group is changed as are the measurements. • General domain shift: p(y,x). The joint distri- bution of patient features and labels also changes during the pandemic, for example with new virus mutations. A new mutation could lead to more severe disease progression 47 and to even higher ﬁb- rinogen. • Concept shift: p(y|x). Probability to observe a certain RT-PCR test result given a patient charac- terized by their measurements such as blood tests. We model this by p(y =1|x) ≈ g(x; w), with the model g and the model parameters w. The RT-PCR test result y changes even if the patient features x are the same, which can occur with changing test technologies, changing test proce- dures, changing thresholds, and so on. Neglecting and insuﬃciently countering the above mentioned domain shifts can lead to undesired conse- quences and failures of the models: 2 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.06.21254997 doi: medRxiv preprint

medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . Machine Learning based COVID-19 Diagnosis from Blood Tests with Robustness to Domain Shifts Theresa Roland1,* , Carl Böck2 , Thomas Tschoellitsch2 , Alexander Maletzky3 , Sepp Hochreiter1 , Jens Meier2 , and Günter Klambauer1 1 ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria 2 Department of Anesthesiology and Critical Care Medicine, Kepler University Hospital GmbH, Johannes Kepler University Linz, Austria 3 RISC Software GmbH, Hagenberg i.M., Austria * roland@ml.jku.at Abstract technique, since they unify the best of RT-PCR and antigen tests: they are cheap, fast, efficient, and have We investigate machine learning models that identify sufficiently high sensitivity when combined with maCOVID-19 positive patients and estimate the mortality chine learning (ML) methods. Furthermore, autorisk based on routinely acquired blood tests in a hos- matically checking all routinely taken blood tests for pital setting. However, during pandemics or new out- COVID-19 allows frequent, fast and broad scanning at breaks, disease and testing characteristics change, thus low costs, thus provides a powerful tool to ban new we face domain shifts. Domain shifts can be caused, outbreaks4,5 . Therefore, we assess ML methods for die.g., by changes in the disease prevalence (spreading agnosing COVID-19 from blood tests. ML can enhance or tested population), by refined RT-PCR testing pro- the sensitivity of cheap and fast tests such as antigen6 cedures (taking samples, laboratory), or by virus mu- or blood tests, therefore enabling a cost efficient altertations. Therefore, machine learning models for diag- native to RT-PCR tests. ML methods enhanced tests nosing COVID-19 or other diseases may not be reliable could be particularly useful for asymptomatic patients and degrade in performance over time. To countermand with a routine blood test, who would not be tested for this effect, we propose methods that first identify do- COVID-19. In this scenario, COVID-19 positive pamain shifts and then reverse their negative effects on tients could be identified, isolated and a further spread the model performance. Frequent re-training and re- of the virus might be prevented. Especially in develassessment, as well as stronger weighting of more recent oping countries with limited testing capacities, the ML samples, keeps model performance and credibility at a enhanced tests can evolve into an efficient tool in comhigh level over time. Our diagnosis models are con- bating a pandemic. structed and tested on large-scale data sets, steadily To confine the spread of infectious diseases, and esadapt to observed domain shifts, and maintain high pecially the COVID-19 pandemic, ML approaches can ROC AUC values along pandemics. be applied in very different ways7 . ML algorithms help in developing vaccines and drugs for the treatment of COVID-198–10 . COVID-19 and the patient’s prognosis 1 Introduction can be predicted from chest CT-scans, X-rays11–14 or 15–17 . FurtherReverse transcription polymerase chain reaction sound recordings of coughs or breathing 1 more, it has been shown that ML models based on blood (RT-PCR) are still the gold standard tests for the coro18–32 2 tests are capable of detecting COVID-19 infection navirus disease 2019 (COVID-19) . However, RT-PCR tests are expensive, time-consuming, and not suited and predicting other outcomes, such as survival or ad33–41 . for high-throughput or large-scale testing efforts. In mission to an intensive care unit An ML model is constructed via learning on a data contrast, antigen tests are cheap and fast, but they come with considerably lower sensitivity than RT-PCR set with the goal that the model generalizes well, that tests3 . Blood tests for COVID-19 are a promising is, performs well on new, unseen data, e.g., correctly 1 by peer review and should not be used to guide clinical practice. NOTE: This preprint reports new research that has not been certified medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . predicts the label or class for a new data item. The quality, size and characteristics of the training data set strongly determine the predictive quality of the resulting model on new data. The central ML paradigm is that training data and future (test) data have the same distributions. This paradigm guarantees that the constructed or learned model generalizes well to future data and has high predictive performance on new data. However, this paradigm is violated during pandemics. Data sets collected during the progression of the COVID-19 pandemic are characterized by strong changes in distribution, called domain shifts. These domain shifts violate the central ML paradigm, nevertheless, they were insufficiently considered or even neglected during the evaluation of ML models. Unexpected behavior of the models in real world hospital settings often stem from neglected domain shifts42 . Such an unexpected behavior could even lead to unfavorable consequences, like a major disease outbreak in a hospital. Most of the previous ML studies evaluated the predictive performance of the learned models by cross-validation, bootstrapping or fixed splits on randomly drawn samples18–22,26–32 . However, the theoretical justification of these evaluation methods is heavily founded on the central ML paradigm: that the distributions remain constant over time. To disregard domain shifts is a culpable negligence, since they may lead to an overoptimistic performance estimate on which medical practitioners base their decisions. These decisions are then misguided. Yang et al.25 and Plante et al.24 addressed the domain shifts via evaluation on an external data set. Yang et al.25 trained and evaluated their models on data from the same period and therefore, temporal domain shifts were not sufficiently considered. The training and external evaluation set as in Plante et al.24 only includes pre-pandemic negatives, they missed out on using pandemic negatives. Soltan et al.23 considered the temporal domain shift by conducting a prospective evaluation. However, analogous to Plante et al.24 , the negatives are all pre-pandemic, therefore, the domain shift is artificially generated and can deviate from domain shifts during the pandemic. In the following, we describe the categories of domain shifts that can occur in COVID-19 data sets. For the categorization, we have to consider two random variables, which both are obtained by testing a patient: x is the fibrinogen level, since it tends to rise during a systemic inflammation43 . • y: Outcome of the slow and expensive COVID-19 RT-PCR test, which is assumed to be binary y ∈ {0, 1} to indicate the COVID-19 status. The test result y is assumed to be the ground truth and should give the actual COVID-19 status. Our goal is to use ML methods to predict y from x, in order to replace the slow and expensive COVID-19 RT-PCR test by a fast and cheap test. Examples of temporal domain shifts are shown in Figure 1 a, which affect the model performance and the trustworthiness of performance estimates, see Figure 1 b and c. We identify and define following categories of domain shifts44,45 : • Prior shift: p(y). The probability of observing a certain RT-PCR test result, e.g., y = 1, strongly changes during the pandemic. If the overall prevalence of the disease in the population is high, the probability to observe a positive test usually increases. • Covariate shift: p(x). The distribution of the patient features is also affected by the overall pandemic course. E.g., if the prevalence of the disease is high, more persons suffer from disease symptoms, with potentially high fibrinogen, and go to the hospital. Nevertheless, fibrinogen levels could also change without connection to the pandemic, for example, with time of year46 . Or, in case there is an obligation for testing, the person group is changed as are the measurements. • General domain shift: p(y, x). The joint distribution of patient features and labels also changes during the pandemic, for example with new virus mutations. A new mutation could lead to more severe disease progression47 and to even higher fibrinogen. • Concept shift: p(y|x). Probability to observe a certain RT-PCR test result given a patient characterized by their measurements such as blood tests. We model this by p(y = 1|x) ≈ g(x; w), with the model g and the model parameters w. The RT-PCR test result y changes even if the patient features x are the same, which can occur with changing test technologies, changing test procedures, changing thresholds, and so on. • x: Outcome of a fast and cheap test. The measurement values for a patient, which serve as input data (input features) for an ML model. We assume that the COVID-19 status (positive / negative) can, to some extent, be inferred from these tests. The measurements can arise from a fast and cheap test such as a blood test or vital sign measurement. To illustrate this value, we assume that Neglecting and insufficiently countering the above mentioned domain shifts can lead to undesired consequences and failures of the models: 2 Mandatory masks in public buildings Newly confirmed infections Hospitalizations PCR tests Mean age of newly infected Lockdown (light) Lockdown light (start) Number of cases (normalized) Lockdown (start) a Lockdown light (start) Pharmacy antigen tests Lockdown (start) Population screening Vaccination launch Onset B.1.1.7 Onset B.1.351 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . Age 1.00 60 0.75 50 0.50 40 0.25 30 1.Mar 1.Apr 1.May 1.Jun 1.Jul 1.Aug 1.Sep 1.Oct 1.Nov 1.Dec 1.Jan c ROC AUC ... 2019 2020 1.00 0.80 0.75 PR AUC Performance estimate Actual performance Model training b 0.50 Actual Estimate Jun Jul Aug Sep Oct Nov 0.25 Dec Figure 1: Domain shifts in COVID-19 data sets. a, COVID-19 numbers in Austria over time, illustrating factors causing a temporal domain shift. The numbers are sketched according to data from the Austrian BMSGPK (https://www.data.gv.at/COVID-19/). b, The actual model performance is calculated for each month from June to December 2020 and the estimated model performance is calculated on the respective previous month. c, Estimated and actual performance with 95 % confidence intervals. The estimated and actual ROC AUC is significantly different in December and PR AUC differs significantly in November and December, showing the effect of the domain shifts. Note that the PR AUC is sensitive to changes of prevalence. 3 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . Unreliable performance estimates. Performance estimates without consideration of domain shifts might be overoptimistic and the actual performance of the model can deviate significantly from the estimate42 (see Figure 1, and Section 2.5). has been collected after the blood test, with a window of 48 hours between the two tests. From the COVID-19 diagnosis data set 919 cases survived and 118 cases died with COVID-19. For the mortality prediction, the features and samples are selected on the basis of the COVID-19 positive Degrading of predictive performance over time. patients, rather than the 2019 cohort for the COVID-19 Standard ML approaches are unable to cope with diagnosis data set. The data sets are imbalanced in domain shifts over time and during the progression of a both tasks, the COVID-19 diagnosis and the mortality pandemic, which can result in a decrease of predictive prediction. The pre-selection of the samples and the merging is described in more detail in Section 4 and performance44,48,49 . in Figure 2. In the 2019 cohort and in the 2020 cohort women and men occur about equally often (2019 cohort: In light of the domain shifts, we suggest lifelong learn50–52 48 % men, 2020 cohort: 48 % men). However, in the ing and assessment , thereby maximizing the clinipositives cohort, there are more men (56 % men). The cal utility of the models. Concretely, we propose a) fredeath rate of patients relative to the COVID-19 posiquent temporal validation to identify domain shifts and tive samples in patients with 80 years or older is 20 %, b) re-training the models with higher weights of rein patients younger than 80 years, it is 8 %. In our data cently acquired samples. To this end, a continuous set, men died more than twice as often as women (68 % stream of COVID-19 samples is required, which can be men). In the age group below 80 years, men died even achieved by routinely testing a subset of samples with three times as often as women with COVID-19 (75 % an RT-PCR test. men). We evaluate and compare our proposed approach of lifelong learning and assessment against standard ML approaches on a large-scale data set. This data set com- 2.2 Machine learning methods and prises 127,115 samples after pre-processing and mergmodel selection ing, which exceeds the data set size of many small scale studies18–22,32 by far. Our data set comprises We show the capability of the ML models to classify pre-pandemic negative samples and pandemic negative COVID-19 and to predict the mortality risk. We comand positive samples spanning over multiple different pare the performance of self normalizing neural network departments of the Kepler University Hospital, Linz. (SNN)55 , K-nearest neighbor (KNN), logistic regression As opposed to studies that require additional expensive (LR), support vector machine (SVM), random forest features19,22,23 , our models use no other features than (RF) and extreme gradient boosting (XGB). XGB and blood test, age, gender and hospital admission type. RF outperform other model classes in the COVID-19 This way, the blood tests can be automatically scanned diagnosis and also in the mortality prediction. The dofor COVID-19 in a cost-effective way without any ad- main shifts are exposed when comparing the evaluaditional temporal effort for the hospital staff. tions on different cohorts. We additionally report the predictive ability for morThe hyperparameters are selected on a validation tality risk of the COVID-19 positive samples on the set or via nested cross-validation to avoid a hyperpabasis of the blood tests only, again with no additional rameter selection bias. Performance is estimated eiexpensive features33–35,38,40,41,53,54 . Compared to pre- ther via standard cross-validation or by temporal crossvious studies33,36,37,39 , our mortality models are trained validation (for details see Section 4.3). on a large number of COVID-19 positive patients. We again take domain shifts and other potential biases into 2.3 Comparison of estimated and actual account for mortality prediction. performance In this experiment, we investigate the effects of a standard ML approach, in which a model is trained on data collected in a particular time-period, then assessed on 2.1 Study Cohort a hold-out set and then deployed. Concretely, we train Our dataset comprises 125,542 negative and 1,573 pos- an XGB model on data from July 2019 until October itive samples for training and evaluation of the ML 2020, and assess the model performance on data from models for COVID-19 diagnosis. From the negatives, November 2020. We then simulate that the model is 116,067 have been acquired before the pandemic and deployed and used in December 2020. Without domain 9,475 during the pandemic. The RT-PCR test sample shifts, the predictive performance would remain similar, 2 Results 4 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . a b Blood tests 2019 COVID-19 negative 112,163 cases 9,241,110 entries RT-PCR results 2020 COVID-19 tested 79,886 cases 85,235 tests Blood tests 2020 COVID-19 tested 19,441 cases 3,396,853 entries Preprocessing Merging 2019 cohort 70,871 cases 116,067 samples 2020 cohort 9,013 cases 11,048 samples Negatives 79,053 cases 125,542 samples Positives 1,037 cases 1,573 samples Survivors 919 cases 1,398 samples Blood test Blood test 48 hours Blood test RT-PCR test Sample Deceased 118 cases 175 samples Age, ... Figure 2: Large-scale COVID-19 data set. a, Block diagram of the structure of the data set. The blood tests from 2019 (blood tests 2019) are all negatives and are pre-processed to the 2019 cohort. The COVID-19 RT-PCR test results and the blood tests are merged to the 2020 cohort. The negatives data set results from the 2019 cohort and the negative samples of the 2020 cohort. The positive tested cases (positives) are further divided to the cohort with the survived and the deceased cases. Note that one case can be in the negatives and positives cohort due to a change of the COVID-19 status. Multiple samples are obtained from one case, if RT-PCR and blood tests are measured repeatedly. b, Aggregation of the blood tests for the COVID-19 tested patients: The blood tests of the last 48 hours before the COVID-19 test are merged to one sample. In case a feature is measured multiple times, the most recent one is inserted in the sample. Patient specific data, namely age, gender and hospital admission type, are added to the sample. but in the presence of domain shifts, the performance would be significantly different. Thus, domain shifts are exposed by comparing actual performance with the estimated performance determined on the respective previous month, see Figure 1 b. The area under the receiver operating characteristic curve (ROC AUC) estimate is higher than the actual performance in most months (Figure 1 c). The ROC AUC performance estimate for December was significantly lower than the actual performance in December. The estimated and actual area under the precision recall curve (PR AUC) differ significantly in November and December. These results show that there is a domain shift and thus there is a necessity for up-to-date assessments, otherwise the performance estimate is not trustworthy. 2.4 (iv) mortality prediction assessed by random crossvalidation, (v) mortality prediction assessed by temporal crossvalidation. We then compare the performance estimates obtained by the assessment strategy. If the performance estimates by random cross-validation and temporal cross-validation are similar, then the underlying distribution of the data is likely to be similar over time. If the performance estimates of (ii) are different from (i), then former and current negatives follow different distributions. If performance estimates from (iii) are lower than those of (i) and (ii), the distribution of the data changes over time, hence indicating the presence of domain shifts. Equally, changing performance estimates from (iv) to (v) indicate a domain shift over time. The results in terms of threshold-independent performance metrics for the comparison of the models are shown in Table 1 a and b and in Figure 3. More information about the discriminating capability of individual features is shown in Figure 5 and in Table 5. Model performance under domain shifts In this section, we set up five modeling experiments with two prediction tasks and different assessment strategies: (i) COVID-19 diagnosis prediction assessed by random cross-validation with pre-pandemic negatives, (i) COVID-19 diagnosis prediction & random cross-validation with pre-pandemic negatives. In this experiment all cases from 2019 and 2020 are randomly shuffled, see Section 4.3 for more details. In experiment (i) the highest performance is achieved, however, domain shifts are not considered in the performance estimate. The model with the highest ROC (ii) COVID-19 diagnosis prediction assessed by random cross-validation with recent negatives, (iii) COVID-19 diagnosis prediction assessed by temporal cross-validation, 5 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . AUC of 0.97±0.00 and PR AUC of 0.52±0.01 is the RF. Note that the baseline of a random estimator (RE) is at 0.50±0.00 for ROC AUC and 0.01±0.00 for PR AUC, the latter due to the high imbalance of positive and negative samples. For in-hospital application a threshold is required to classify the probabilities of the models to the positive or negative class. This threshold is a trade-off between identifying all positive cases and a low number of false positives. Therefore, we report the threshold-dependent metrics for multiple thresholds, which are determined by defining negative predictive values on the validation set. The results with these determined thresholds are shown in Table 1 c for the RF. (v) Mortality prediction & temporal crossvalidation. In experiment (v), the model is trained with samples until October and evaluated on samples from November and December for mortality prediction of COVID-19 positive patients. Again, RF outperforms the other models with a ROC AUC of 0.86±0.01 and a PR AUC of 0.56±0.01. The performance drops from experiment (iv) to (v), revealing a domain shift over time for mortality prediction. 2.5 Lifelong learning and assessment We propose re-training and re-assessment with high frequencies to tackle the domain shifts by exploiting the new samples to achieve high performance and model credibility in the real world hospital setting. Therefore, we suggest to continuously determine the COVID-19 status with an RT-PCR test of some patients to acquire frequent samples, which is indispensable to avoid the model behavior to drift into unexpected and poor performance. These measures are essential to enable trustworthy ML models for clinical utility. The effect of the re-training frequency of the model is shown in Figure 4 b. The performance of the ML models increases with the re-training frequency, thereby reducing the domain shift of the training to the test samples. The evaluation procedure is shown in Figure 4 a and in Section 4.4. To counter the domain shifts, we additionally propose to weight current samples stronger during training of the COVID-19 diagnosis model, see Figure 4 c. On the validation set (May - October), we determine the best weighting in dependence of the sample recency. The highest performance gain on the validation set is achieved by setting the weight of the 2019 cohort samples to 0.01 and the weight of the samples of the most recent month to 3, and the second last month to 2 ([1, 1, 2, 3]). Compared to weighting all samples equally, this increases the ROC AUC on the validation set from 0.8118 (95 % CI: 0.7849-0.8386) to 0.8502 (95 % CI: 0.8271-0.8734) (p-value = 9e-6), which is statistically significant. The selected weighting is tested on November and December, leading to a statistically significant increase of the ROC AUC from 0.7996 (95 % CI: 0.78310.8162) to 0.8120 (95 % CI: 0.796-0.828) (p-value = 0.0045). The method to determine the weighting is described in more detail in Section 4.4. (ii) COVID-19 diagnosis prediction & random cross-validation with recent negatives. The test set of experiment (ii) only comprises cases, which have been tested for COVID-19 with an RT-PCR test. The 2020 cohort comprises patients which are suspicious for COVID-19, some might even have characteristic symptoms. Therefore, a classification of the samples in the 2020 cohort is more difficult and potential biases between the 2019 and 2020 cohort cannot be exploited. XGB outperforms the other models with a ROC AUC of 0.92±0.00 and a PR AUC of 0.62±0.00. (iii) COVID-19 diagnosis prediction & temporal cross-validation. In experiment (iii), the model is trained with samples until October and evaluated on samples from November and December. XGB achieves the highest ROC AUC of 0.81±0.00 and a PR AUC of 0.71±0.00. We face an additional performance drop in comparison to experiment (i) and (ii), which points to a domain shift over time. Besides others, this domain shift over time occurs due to potential changes in the lab infrastructure, testing strategy, prevalence of COVID-19 in different patient groups, or maybe even due to mutations of the COVID-19 virus, see Figure 1 and Section 1 for more details. These results again emphasize the necessity for countering the domain shifts with lifelong learning and assessment. (iv) Mortality prediction & random crossvalidation. We predict the mortality risk of COVID-19 positive patients, who only occur in the 2020 cohort. The samples are randomly shuffled and a five-fold nested cross-validation is performed. RF outperforms the other models for the mortality prediction with a ROC AUC of 0.88±0.02 in (iv) and a PR AUC of 0.63±0.11. We report the threshold-dependent metrics in Table 1 d, although the prediction scores of survival or death, provided by our models, are more informative for clinicians in practice, rather than a hard separation by a threshold into the two classes. 2.6 Features with discriminating capability For clinical insight, the violin plots show discriminating capability of the selected features for the three different cohorts (2019 and 2020 cohort, 2020 cohort, COVID-19 6 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . b 0.8 0.6 0.4 0.2 0.0 c SNN=0.957 KNN=0.907 LR=0.962 SVM=0.961 RF=0.965 XGB=0.963 BF=0.674 RE=0.500 0.6 0.4 0.2 0.8 False Positive Rate 0.6 0.4 0.0 0.0 1.0 0.8 AUC 0.0 0.0 Precision 0.8 0.2 0.6 0.4 0.2 0.8 False Positive Rate 0.6 SNN=0.755 KNN=0.736 LR=0.714 SVM=0.746 RF=0.817 XGB=0.635 BF=0.575 RE=0.213 0.0 0.2 0.4 0.2 0.0 1.0 0.8 1.0 0.8 1.0 PR Curve 1.0 SNN=0.900 KNN=0.881 LR=0.896 SVM=0.908 RF=0.917 XGB=0.865 BF=0.766 RE=0.500 0.6 0.4 Recall 0.2 d 1.0 0.4 SNN=0.456 KNN=0.314 LR=0.416 SVM=0.427 RF=0.513 XGB=0.556 BF=0.022 RE=0.012 0.2 ROC Curve 0.6 AUC 0.8 Precision 1.0 0.0 True Positive Rate PR Curve 1.0 AUC True Positive Rate ROC Curve AUC a 0.6 0.4 Recall Figure 3: Comparison of model classes for COVID-19 diagnosis and mortality prediction. a, ROC and b, PR curves for the test set of COVID-19 diagnosis prediction in experiment (i). c, ROC and d, PR curves for mortality prediction in experiment (iv). a-d, Curves plotted for the different model classes at one random seed. RF and XGB outperform the other model classes as well as the random estimator (RE) baseline and the best feature (BF) as an estimator. 7 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . Table 1: Performance metrics. a, Experiment (i) - (iii) are the results of the COVID-19 diagnosis prediction. In experiment (i) the test set is randomly selected from the shuffled 2019 and 2020 cohort. In experiment (ii) the test set is a random subset of the 2020 cohort and experiment (iii) are the results of a prospective evaluation on November and December 2020. b, The threshold-independent metrics for mortality prediction with random shuffling of the positives set (experiment (iv)) and with prospective evaluation on November and December (experiment (v)) are listed. The ML models are trained, validated and tested with five random seeds. The mean and the standard deviation (±) for the ROC AUC and PR AUC are listed. c and d, Performance metrics on test set of RF for different thresholds selected on basis of the negative predictive value on the validation set (NPV val) of c, COVID-19 diagnosis prediction in experiment (i) and d, mortality prediction in experiment (iv). a Model RE BF SNN KNN LR SVM RF XGB Threshold-independent metrics for COVID-19 diagnosis prediction Experiment (ii) Experiment (iii) Experiment (i) ROC AUC PR AUC ROC AUC PR AUC ROC AUC PR AUC 0.5000±0.0000 0.0124±0.0000 0.5000±0.0000 0.0822±0.0000 0.5000±0.0000 0.3162±0.0000 0.6745±0.0000 0.0221±0.0000 0.6774±0.0000 0.3141±0.0000 0.6623±0.0000 0.5716±0.0000 0.9567±0.0025 0.4349±0.0306 0.8998±0.0044 0.5577±0.0074 0.7836±0.0053 0.6620±0.0082 0.9071±0.0000 0.3137±0.0000 0.8432±0.0000 0.4486±0.0000 0.7209±0.0000 0.5712±0.0000 0.9600±0.0008 0.4126±0.0145 0.8878±0.0022 0.4770±0.0086 0.7732±0.0008 0.6467±0.0059 0.9611±0.0000 0.4268±0.0000 0.9045±0.0000 0.5573±0.0000 0.7759±0.0000 0.6387±0.0000 0.9654±0.0005 0.5231±0.0106 0.9138±0.0025 0.5761±0.0100 0.7957±0.0025 0.6626±0.0049 0.9629±0.0000 0.5558±0.0000 0.9169±0.0000 0.6216±0.0000 0.8142±0.0000 0.7077±0.0000 b Model RE BF SNN KNN LR SVM RF XGB Threshold-independent metrics for mortality prediction Experiment (iv) Experiment (v) ROC AUC PR AUC ROC AUC PR AUC 0.5000±0.0000 0.1592±0.0351 0.5000±0.0000 0.1320±0.0000 0.7599±0.0748 0.4320±0.1021 0.7483±0.0000 0.3938±0.0000 0.8656±0.0356 0.5866±0.1196 0.8478±0.0053 0.4917±0.0110 0.8207±0.0550 0.5527±0.1137 0.8272±0.0000 0.4669±0.0000 0.8613±0.0351 0.5555±0.1281 0.8388±0.0088 0.4784±0.0173 0.8587±0.0306 0.5679±0.1010 0.8271±0.0000 0.4185±0.0001 0.8813±0.0214 0.6267±0.1065 0.8572±0.0071 0.5556±0.0127 0.8501±0.0210 0.5196±0.1005 0.8038±0.0000 0.4334±0.0013 c NPV val NPV PPV BACC ACC Sensitivity Specificity F1 Threshold Threshold-dependent metrics of RF in experiment (i) 0.999 0.995 0.990 0.980 0.999±0.000 0.995±0.000 0.990±0.000 0.988±0.000 0.066±0.002 0.414±0.015 0.823±0.014 1.000±0.000 0.887±0.002 0.812±0.005 0.588±0.003 0.501±0.001 0.834±0.007 0.984±0.000 0.989±0.000 0.988±0.000 0.941±0.004 0.635±0.010 0.176±0.006 0.002±0.003 0.832±0.007 0.989±0.000 1.000±0.000 1.000±0.000 0.124±0.004 0.501±0.009 0.290±0.008 0.004±0.006 0.081±0.040 0.444±0.098 0.931±0.020 0.995±0.001 0.990 0.973±0.021 0.318±0.096 0.748±0.023 0.629±0.086 0.921±0.062 0.575±0.107 0.460±0.091 0.146±0.070 Threshold-dependent metrics of RF in experiment (iv) 0.980 0.975 0.950 0.900 0.979±0.021 0.971±0.022 0.929±0.034 0.867±0.041 0.299±0.109 0.369±0.156 0.523±0.161 0.789±0.173 0.746±0.030 0.775±0.032 0.748±0.041 0.596±0.063 0.609±0.101 0.681±0.105 0.822±0.085 0.859±0.033 0.937±0.064 0.905± 0.077 0.634±0.181 0.206±0.141 0.554±0.121 0.644±0.136 0.862±0.127 0.985±0.016 0.439±0.107 0.498±0.125 0.536±0.097 0.290±0.155 0.151±0.062 0.169±0.067 0.332±0.106 0.592±0.072 d NPV val NPV PPV BACC ACC Sensitivity Specificity F1 Threshold 8 0.850 0.849±0.031 1.000±0.000 0.527±0.014 0.850±0.030 0.055±0.029 1.000±0.000 0.103±0.053 0.793±0.089 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . a 2020 2019 Training Test ... Mar Apr May Jun Jul Aug Sep Oct Nov Dec c ... Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2020 Jul Aug Sep Oct Nov Dec ... Mar Apr May Jun Jul Aug Sep Oct Nov Dec ... Mar Apr May Jun Jul Aug Sep Oct Nov Dec ... Mar Apr May Jun Jul Aug Sep Oct Nov Dec ... Mar Apr May Jun Jul Aug Sep Oct Nov Dec ... ... 2019 ... Mar Apr May Jun ... Mar Apr May Jun Jul Aug Sep Oct Nov Dec ROC AUC b 0.85 0.80 0.75 0.70 COVID-19 Mortality 2 3 4 5 1 7 6 Model training frequency (months) Training, sample weights Validation Test Figure 4: Lifelong learning. a, Evaluation for a model training frequency of two months. The model is evaluated with an intercept of one month, but the model is evaluated on the two subsequent months after training. b, Effect of model training frequency on performance. The mean and the 95 % confidence intervals (error bars) of the ROC AUCs. The ROC AUC performance decreases with lower model training frequency. c, Current samples are weighted higher in training to counter the domain shifts. The weighting is selected on the validation months starting from May until October. The selected weighting is evaluated on the test months November and December. positive cohort) in Figure 5. The plotted features are selected based on their ROC AUC on the five experiments, the top-10 features as predictors for all five tasks are listed in Table 5 in the supplementary material. 3 able performance estimates. Some studies suggested logistic regression models for COVID-19 and mortality prediction39,53 , however, most identified (X)GB or RF as the best model classes18,20,25,31,38 . We confirm these findings and suggest to use XGB or RF for COVID-19 diagnosis and RF for mortality prediction, as these exhibit the highest performance in our experiments. Discussion Our models only require a small set of features of a patient, concretely a minimum of twelve blood test parameters and the age, gender and hospital admission type; in total at least 15 features. In case many blood test parameters are available, the model exploits up to 100 pre-selected features. Missing features are imputed, thereby allowing model application also on samples with a small number of features. This enables automatically scanning the blood tests without additional effort by the hospital staff, as opposed to published models, which require more expensive features, e.g., vital signs, which might not be as easily available19,22,23 . Through multiple experiments we expose domain shifts and their detrimental effect on ML models for COVID-19 diagnosis. We suggest to carefully assess the model performance frequently to avoid unexpected behavior with potentially adverse consequences, such as even greater spread of the disease due to trusting the wrongly classifying model. The model should be re-trained after particular time-periods to exploit newly acquired samples and, thus, to countermand the domain shift effect. To this end, we propose to assign a higher weight to recent samples, which, as we show, increases the predictive performance. In this large-scale study, we train and evaluate our models with more samples than most studies18–22 . Besides our large number of tested subjects, we also exploit pre-pandemic negative samples, which vastly increases our data set size. In comparison to Soltan et al.23 and Plante et al.24 we use the pre-pandemic as well as the pandemic negatives in our data set. We achieve high predictive performance with our models, comparable to previous studies18,19,21,25,35 , although the results can not be directly compared as our assessment procedure is more rigorous. Different assessment procedures within our study also yield highly vari- One limitation of our work could be that we did not evaluate the generalization of our model to other hospitals. A transfer of a COVID-19 diagnostic model should only be done with thorough re-assessments, as a domain shift between hospitals might be present. Besides others, such domain shifts from one institution to another could result from different testing strategies, laboratory equipment or demographics of the population in the hospital catchment area. Re-training of models rather than transferring to another hospital should be considered to obtain a skilled and trustworthy model. 9 15 10 5 1 0 o Eosinophils (%) 20 0 1e3 LDH (mIU/mL) i 2 0 Positives Negatives 4 2 IG (%) CRP (mg/dL) Age (years) Eosinophils (%) Basophils (G/L) Fibrinogen (mg/dL) pH () 7 0 0 20 10 0 200 0 200 0 Deceased Survivors 50 0 100 0 100 50 25 20 15 0 40 20 0 20 15 10 40 20 FO2Hb (%) 50 2 8 RDW (%) 100 4 0 20 25 100 50 0 20 0 PCT (ng/mL) CHE (IU/mL) Neutrophils (%) c 0 500 Hemoglobin (g/dL) 0 2 BUN (mg/dL) 2 4 Monocytes (%) 1e3 Eosinophils (G/L) 0 Phosphor (mmol/L) 1 Lymphocytes (%) 1e4 Lymphocytes (G/L) LDH (mIU/mL) Ferritin (ng/mL) b 50 Lymphocytes (G/L) 0 2 Adm. type () 500 4 Calcium (mmol/L) 0 0 0 Leukocytes (G/L) 0 200 50 CRP (mg/dL) 10 0 100 Neutrophils (G/L) 20 2 10 Positives Negatives 1e4 Phosphor (mmol/L) 2 4 Fibrinogen (mg/dL) Eosinophils (G/L) 4 Lymphocytes (G/L) IG (%) Calcium (mmol/L) a Ferritin (ng/mL) medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 100 50 0 Figure 5: Features with discriminating capability. a, Measured features in the 2019 and 2020 cohort for COVID-19 diagnosis prediction for negative and positive class. b, Features with discriminating capability for COVID-19 diagnosis prediction in the 2020 cohort, which contains the RT-PCR tested patients. c, Measured features of positives cohort for mortality prediction for survivors and deceased. Abbreviations: C-reactive protein (CRP), immature granulocytes (IG), type of hospital admission (Adm. type), inpatient (i), outpatient (o), lactate dehydrogenase (LDH), pH-value (pH), blood urea nitrogen (BUN), red cell distribution width (RDW), oxyhemoglobin fraction (FO2Hb), cholinesterase (CHE), procalcitonin (PCT). 10 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . However, this is not part of our investigation. Our findings and suggestions about domain shifts should be accounted for in all hospitals when applying a COVID-19 model. We evaluate our models on different cohorts to show the high performance as well as to reveal the domain shifts. However, the 2020 cohort only contains subjects that were tested for COVID-19 and where a blood test was taken. Hence, the 2020 cohort only is a subset of the total patient cohort on which the model will be applied. To counteract missing samples from a particular group, we also use the pre-pandemic negatives, which should cover a wide variety of negatives due to the large data set. An evaluation of all blood tests of 2020 just is not possible due to the lack of RT-PCR tests which serve as labels in our ML approach. Non-tested subjects of 2020 cannot be assumed to be negatives, therefore we discard them. This could only be circumvented by explicitly testing a large number of patients for this study, who would not be tested otherwise. For lifelong learning and assessment, testing of a subset of the patients with an RT-PCR test still is necessary to identify and counter the temporal domain shift. However, this does not diminish the benefit of the model as by automatically scanning all blood tests a large number of patients can be checked for COVID-19, which would not be feasible with expensive and slow RT-PCR tests. The benefit of the model also transfers to hospitals or areas with limited testing capacity. Rather than replacing RT-PCR tests, the model can be applied as a complement to or replacement for antigen tests. The model can be retrained with the already implemented pipeline. The computational effort is relatively low, as the model only requires tabular data and no time series (sound recordings) or images (CT, X-ray)11–17 . Other studies do not consider the domain shifts and the associated necessity for re-training, although it is indispensable for clinical utility. Lifelong learning and assessment does not only provide a performance gain for diagnostic models in pandemics like COVID-19, but also for other medical tasks, or in general, other applications of ML, where we face a continuous stream of data. We demonstrate the high capability of ML models in detecting COVID-19 infections and COVID-19-associated mortality risk on the basis of blood tests on a large-scale data set. With our findings concerning domain shifts and lifelong learning and assessment, we want to advance the ML models to be accurate and trustworthy in real world hospital settings. Lifelong learning and assessment is an important tool to allow the transition of research results to actual application in hospitals. By advancing this field of research, we want to increase patient safety and protect clinical staff and we wish to make a contribution in banning the pandemic. 4 4.1 Methods Ethics approval Ethics approval for this study was obtained from the ethics committee of the Johannes Kepler University, Linz (approval number: 1104/2020). The study is conducted on the blood tests (including age, gender and hospital admission type) from July 2019 until December 2020 and the COVID-19 RT-PCR tests from 2020 of the Kepler University Hospital, Med Campus III, Linz, Austria. In our study, we analyze anonymized data only. 4.2 Data set preparation We predict the result of the RT-PCR COVID-19 test on the basis of routinely acquired blood tests. A block diagram of the data set is sketched in Figure 2. We only use the blood tests before the RT-PCR test to avoid bias caused by the test result. We limit the time deviation of the blood test to the COVID-19 RT-PCR test to 48 hours to ensure that the blood test matches the determined COVID-19 status. Additionally, we incorporate pre-pandemic blood tests from the year 2019 as negatives to our data set to cover a wide variety of COVID-19 negative blood tests. For the data from the year 2020, we aggregate the blood tests of the last 48 hours before the test. If parameters are measured more than once, we take the most recent one, see Figure 2 b. In case no COVID-19 test follows the blood test within 48 hours, the blood test samples are discarded. Additionally, we discard all samples with a deviating RT-PCR test result within the next 48 hours, as the label might be incorrect. The data from 2019 does not contain COVID-19 tests, therefore, blood tests with a temporal distance of less than 48 hours are aggregated. The features age, gender and admission type (inpatient or outpatient) are added to the samples. For the prediction of the COVID-19 diagnosis, we select the 100 most frequent features in the 2019 cohort as the feature set. For the mortality task the 100 most frequent features are selected based on the positives cohort, as the model is only applied to COVID-19 positive samples. Each sample requires a minimum of 15 features (minimum of any twelve blood test features and age, gender and hospital admission type). All other features and samples are discarded. The fact that the samples only require a minimum of 15 features can lead to many missing entries as the feature vector has a length of 100. For each sample we create 100 additional binary entries, 11 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . which indicate whether each of the features is missing or measured. The missing values are filled by median imputation. Hence, the models can be applied to blood tests with few measured values. (iii) The training and validation sets include the 2019 cohort and the 2020 cohort before November (80 % training, 20 % validation). We conduct a prospective performance estimate for the test set with all samples from November and December 2020. We compare multiple different models suitable for tabular data. The pre-processing, training and evaluation is implemented in Python 3.8.3. In particular, the model classes RF, KNN and SVM are trained with the scikit-learn package 0.22.1. XGB is trained with the XGBClassifier from the Python package XGBoost 1.3.1. The SNN and LR are trained with Pytorch 1.5.0. The models are selected and evaluated based on the ROC AUC56 , which is a measure of the model’s discriminating power between the two classes. Further, we report the PR AUC56 and we calculate threshold-dependent metrics, where the classes are separated into positives and negatives, instead of probability estimates. These metrics are negative predictive value (NPV), positive predictive value (PPV), balanced accuracy (BACC), accuracy (ACC), sensitivity, specificity and the F1-score (F1)57 . We additionally report the thresholds, which are determined on the validation set to achieve the intended NPV. We perform a grid search over hyperparameters of the models, see Table 2 in the supplementary material. The best hyperparameters are selected based on the ROC AUC on the validation set. In the COVID-19 diagnosis prediction tasks (experiment (i)-(iii)) we use one fixed validation fold due to the high number of samples. The models are trained and evaluated with five random seeds. For the mortality prediction tasks (experiment (iv) and (v)) the mean ROC AUC over five validation folds is calculated to select the hyperparameters. Further, the selected models are evaluated on the test set to estimate the performance. Experiment (iv) is evaluated with five-fold nested cross validation and all other experiments use a fixed test set. The mean and standard deviation of the models, which are trained, validated and tested with five random seeds, are reported. In experiment (iv) and (v) we train the models to predict the mortality risk of COVID-19 positive patients. 4.4 4.3 Experiments for model mance under domain shift perfor- Given the presence of domain shifts, we define five experimental designs to estimate the performance. The experiments differ at the data split into training, validation and test set. These splits are conducted on patient level, such that one patient only occurs in one of the sets. In the first three experiments we train models for COVID-19 diagnosis prediction. We train and evaluate the COVID-19 diagnosis models with five random seeds with a fixed data split. (i) In our first experiment we randomly shuffle all patients and we split regardless of the patient cohorts (60 % training, 20 % validation, 20 % testing). (ii) The training and validation sets include the 2019 cohort and 80 % (60 % training, 20 % validation) of the 2020 cohort. The test set comprises the remaining samples (20 %) of the 2020 cohort. Therefore, the performance is estimated on patients who actually were tested for COVID-19. (iv) The training (60 %), validation (20 %) and test (20 %) sets comprise the positive cases from the 2020 cohort. Due to the limited number of samples, we predict performance with five-fold nested cross validation. Experiments for lifelong learning and assessment We conduct three experiments to show the necessity of lifelong learning and assessment for trustworthy and accurate models. The first experiment investigates the deviation of the estimated to the actual performance. (v) The training and validation sets include the posi- Therefore, we test the models on the months June until tive cases from 2020 before November (80 % train- December. The performance estimate is calculated on ing, 20 % validation). The test set comprises the the respective preceding month (May until November), cases from November and December. The test see Figure 1 b. The 95 % confidence intervals are deterset is fixed, but again, we train and evaluate the mined via bootstrapping by sampling 1,000 times with replacement. The deviations of estimated and actual models with five random seeds. performance are checked for significance. For this purZ-score normalization is applied to the entire data pose, XGB is trained with the selected hyperparameters set, with the mean and standard deviation calculated of experiment (iii). from the respective training set. Further, we check the effect of the model training 12 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . frequency on the performance. We evaluate the trained model on different numbers of subsequent months without re-training. We also refer to this number of subsequent months as model training frequency. A model training frequency of two months is sketched in Figure 4 a. We evaluate the different model training frequencies with an increment of one month, concatenate the predictions as well as the targets and calculate the ROC AUC and its 95 % confidence interval with bootstrapping 1,000 times with replacement. We do not report PR AUC, as the prevalence in the test sets of the different model training frequencies are not comparable. In our third experiment for lifelong learning and assessment, we investigate the effect of higher weights for current samples during training, as shown in Figure 4 c. Therefore, we define May until October as our validation months to select the optimal weighting and we evaluate the selection on November and December. We train the models with all available data before the respective validation month with the best hyperparameters determined in experiment (iii). The predictions and targets are concatenated for all validation months. With a one-sided, paired DeLong test58 , we test our hypothesis that the ROC AUC increases when current samples are weighted higher than older samples, in comparison to the ROC AUC when all samples are equally weighted. We pass the concatenated prediction and target vectors to the DeLong test, which returns the p-value and ROC AUC, calculated with the pROC package 1.17.0.1 in R. We identify the best weighting by combining all listed options of weights of the 2019 cohort and of the most recent, previous months on the validation set. The default weight of the samples is 1. We restrict the 2019 cohort weights to the set: {1, 0.1, 0.01, 0.001}, and the weights of the previous months to: {[1, 1, 1, 1], [1, 1, 1, 2], [1, 1, 2, 3], [1, 2, 3, 4], [2, 3, 4, 5]}, with the last entry in each square bracket being the weight of the last month, the second last of the second last month, and so forth. Afterwards, we normalize the weights to the length of the training samples, thereby we only change the relative weighting. As determined by the hyperparameter search, we also pass the scaling factor to scale_pos_weight to the model to balance positive and negative samples. The best weighting parameters are selected on the validation set and tested on November and December. 4.5 the ROC AUCs of individual features equally to the five experiments in Section 4.3. For these evaluations, the features themselves are considered as predictors. This way, we can identify features with discriminating capability and compare these with the ML models. The ROC AUC is equivalent to the concordance-statistic (c-statistic) for binary outcomes59 . Note that we do not train a model for this purpose, we simply use the positive or negative feature value as a predictor on the test set. Thereby, we identify important features for the COVID-19 diagnosis and the mortality task (Table 5 in supplementary material). Additionally, we visually prepare the most important features selected from the above described evaluation. The features of the full data set (2019 and 2020 cohort) and the 2020 cohort are plotted for the COVID-19 diagnosis as well as for the 2020 cohort for the mortality prediction in Figure 5. The violin plots only contain measured features, imputed feature values are not displayed for better visual clarity. 5 Data availability The data set is not available for public use due to data privacy reasons. 6 Code availability Code is provided at https://github.com/ml-jku/covid. References Features with discriminating capability Besides the ML models, we additionally report statistical evaluations to allow clinical insight: we calculate 13 1. Corman, V. M. et al. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Euro Surveill. 25, 2000045 (2020). 2. Zhu, N. et al. A Novel Coronavirus from Patients with Pneumonia in China, 2019. N. Engl. J. Med. 382, 727–733 (2020). 3. Mina, M. J., Parker, R. & Larremore, D. B. Rethinking Covid-19 Test Sensitivity — A Strategy for Containment. N. Engl. J. Med. 383, e120 (2020). 4. Chin, E. T. et al. Frequency of Routine Testing for Coronavirus Disease 2019 (COVID-19) in High-risk Healthcare Environments to Reduce Outbreaks. Clin. Infect. Dis., ciaa1383 (2020). 5. Larremore, D. B. et al. Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance. medRxiv, 2020.06.22.20136309 (2020). medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 6. Mak, G. et al. Evaluation of rapid antigen test 19. Cabitza, F. et al. Development, evaluation, and for detection of SARS-CoV-2 virus. J. Clin. Virol. validation of machine learning models for COVID129, 104500 (2020). 19 detection based on routine blood tests. Clin. Chem. Lab. Med. 59, 421–431 (2021). 7. Alafif, T., Tehame, A. M., Bajaba, S., Barnawi, A. & Zia, S. Machine and Deep Learning to- 20. Tschoellitsch, T., Dünser, M., Böck, C., wards COVID-19 Diagnosis andTreatment: SurSchwarzbauer, K. & Meier, J. Machine Learning vey, Challenges, and Future Directions. Int. J. EnPrediction of SARS-CoV-2 Polymerase Chain viron. Res. Public Health 18, 1–24 (2021). Reaction Results with Routine Blood Tests. Lab. Med. 52, 146–149 (2020). 8. Keshavarzi Arshadi, A. et al. Artificial Intelligence for COVID-19 Drug Discovery and Vaccine Devel- 21. Goodman-Meza, D. et al. A machine learning alopment. Front. Artif. Intell. Appl. 3, 65 (2020). gorithm to increase COVID-19 inpatient diagnostic capacity. Plos One 15, e0239474 (2020). 9. Ong, E., Wong, M. U., Huffman, A. & He, Y. COVID-19 Coronavirus Vaccine Design Using Re- 22. Langer, T. et al. Development of machine learning verse Vaccinology and Machine Learning. Front. models to predict RT-PCR results for severe acute Immunol. 11, 1581 (2020). respiratory syndrome coronavirus 2 (SARS-CoV2) in patients with influenza-like symptoms using 10. Hofmarcher, M. et al. Large-scale ligand-based only basic clinical data. Scand. j. trauma resusc. virtual screening for SARS-CoV-2 inhibitors using 28, 1–14 (2020). deep neural networks. arXiv, 2004.00979 (2020). 11. Ozsahin, I., Sekeroglu, B., Musa, M. S., Mustapha, 23. Soltan, A. A. S. et al. Rapid triage for COVID-19 using routine clinical data for patients attending M. T. & Ozsahin, D. U. Review on Diagnosis of hospital: development and prospective validation COVID-19 from Chest CT Images Using Artificial of an artificial intelligence screening test. Lancet Intelligence. Comput. Math. Method. M. 2020, 1– Digit. Health 3, 78–87 (2021). 10 (2020). 12. Borkowski, A. A. et al. Using Artificial Intelligence 24. Plante, T. B. et al. Development and External Validation of a Machine Learning Tool to Rule for COVID-19 Chest X-ray Diagnosis. Fed Pract. Out COVID-19 Among Adults in the Emergency 37, 398–404 (2020). Department Using Routine Blood Tests: A Large, 13. Saha, P., Sadi, M. S. & Islam, M. M. EMCNet: AuMulticenter, Real-World Study. J. Med. Internet tomated COVID-19 diagnosis from X-ray images Res. 22, 1–19 (2020). using convolutional neural network and ensemble of machine learning classifiers. Inform. Med. Un- 25. Yang, H. S. et al. Routine Laboratory Blood Tests Predict SARS-CoV-2 Infection Using Machine locked 22, 100505 (2021). Learning. Clin. Chem. 66, 1396–1404 (2020). 14. Pham, T. D. Classification of COVID-19 chest Xrays with deep learning: new models or fine tun- 26. Almansoor, M. & Hewahi, N. M. Exploring the Relation between Blood Tests and Covid-19 Using ing? Health inf. sci. syst. 9, 1–11 (2021). Machine Learning. ICDABI, 1–6 (2020). 15. Mouawad, P., Dubnov, T. & Dubnov, S. Robust Detection of COVID-19 in Cough Sounds. SN 27. AlJame, M., Ahmad, I., Imtiaz, A. & Mohammed, A. Ensemble learning model for diagnosComputer Science 2, 34 (2021). ing COVID-19 from routine blood tests. Inform. 16. Schuller, B. W. et al. COVID-19 and Computer Med. Unlocked 21, 100449 (2020). Audition: An Overview on WhatSpeech & Sound Analysis Could Contribute in the SARS-CoV-2 28. Formica, V. et al. Complete blood count might help to identify subjects with high probability of Corona Crisis. arXiv. 2003.11117 (2020). testing positive to SARS-CoV-2. Clin. Med. 20, 17. Laguarta, J., Hueto, F. & Subirana, B. COVID-19 e114–e119 (2020). Artificial Intelligence Diagnosis Using Only Cough Recordings. IEEE open j. eng. med. biol. 1, 275– 29. De Freitas Barbosa, V. A. et al. Heg.IA: an intelligent system to support diagnosis of Covid-19 281 (2020). based on blood tests. Res. Biomed. Eng. (2021). 18. Brinati, D. et al. Detection of COVID-19 Infection from Routine Blood Exams with Machine Learn- 30. Banerjee, A. et al. Use of Machine Learning and Artificial Intelligence to predict SARS-CoV-2 ining: A Feasibility Study. J. Med. Syst. 44, 135 fection from Full Blood Counts in a population. (2020). Int. Immunopharmacol. 86 (2020). 14 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 31. Silveira, E. C. Prediction of COVID-19 From 44. Kouw, W. M. & Loog, M. An introduction to Hemogram Results and Age Using Machine Learndomain adaptation and transfer learning. arXiv, ing. Front. health inform. 9, 39 (2020). 1812.11806 (2018). 32. Avila, E., Kahmann, A., Alho, C. & Dorn, M. 45. Adler, T. et al. Cross-Domain Few-Shot Learning Hemogram data as a tool for decision-making in by Representation Fusion. arXiv, 2010.06498v2 COVID-19 management: applications to resource (2021). scarcity scenarios. PeerJ 8, e9482 (2020). 46. Crawford, V., Sweeney, O., Coyle, P., Halliday, I. 33. Sun, H. et al. CoVA: An Acuity Score for Outpa& Stout, R. The relationship between elevated fibtient Screening that Predicts Coronavirus Disease rinogen and markers of infection: a comparison of 2019 Prognosis. J. Infect. Dis. 223, 38–46 (2020). seasonal cycles. QJM - Int. J. Med. 93, 745–750 (2000). 34. Vaid, A. et al. Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients 47. Davies, N. G. et al. Increased mortality in With COVID-19 in New York City: Model Develcommunity-tested cases of SARS-CoV-2 lineage opment and Validation. J. Med. Internet Res. 22, B.1.1.7. Nature (2021). 1–19 (2020). 48. Koh, P. W. et al. WILDS: A Benchmark of in35. Li, X. et al. Deep learning prediction of likelihood the-Wild Distribution Shifts. arXiv, 2012.07421 of ICU admission and mortality in COVID-19 pa(2021). tients using clinical variables. PeerJ 8, e10337 49. Wulfmeier, M., Bewley, A. & Posner, I. Incremen(2020). tal Adversarial Domain Adaptation for Continu36. Booth, A. L., Abels, E. & McCaffrey, P. Develally Changing Environments. ICRA, 1–9 (2018). opment of a prognostic model for mortality in COVID-19 infection using machine learning. Mod. 50. Chen, Z., Liu, B., Brachman, R., Stone, P. & Rossi, F. Lifelong Machine Learning: Second EdiPathol. 34, 522–531 (2020). tion (Morgan & Claypool, San Rafael, California 37. Ko, H. et al. An Artificial Intelligence Model to (USA), 2018). Predict the Mortality of COVID-19 Patients at Hospital Admission Time Using Routine Blood 51. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C. & Wermter, S. Continual lifelong learning with neuSamples: Development and Validation of an Enral networks: A review. Neural Netw. 113, 54–71 semble Model. J. Med. Internet Res. 22, e25442 (2019). (2020). 38. Heldt, F. S. et al. Early risk assessment for 52. Zhang, Y., Jordon, J., Alaa, A. M. & van der Schaar, M. Lifelong Bayesian Optimization. COVID-19 patients from emergency department arXiv, 1905.12280 (2019). data using machine learning. Sci. Rep. 11, 4200 (2021). 53. Heber, S. et al. Development and external validation of a logistic regression derived formula based 39. Zhou, Y., Li, B., Liu, J. & Chen, D. The Predictive on repeated routine hematological measurements Effectiveness of Blood Biochemical Indexes for the predicting survival of hospitalized Covid-19 paSeverity of COVID-19. Can. J. Infect. Dis. Med. tients. medRxiv, 2020.12.20.20248563 (2020). Microbiol. 2020, 732081 (2020). 40. Fernandes, F. T. et al. A multipurpose machine 54. Gao, Y. et al. Machine learning based early warning system enables accurate mortality risk prelearning approach to predict COVID-19 negative diction for COVID-19. Nat. Commun. 11, 5033 prognosis in São Paulo, Brazil. Sci. Rep. 11, 3343 (2020). (2021). 41. Yao, H. et al. Severity Detection for the Coron- 55. Klambauer, G., Unterthiner, T., Mayr, A. & Hochreiter, S. Self-normalizing neural networks. avirus Disease 2019 (COVID-19) Patients Using a NIPS, 971–980 (2017). Machine Learning Model Based on the Blood and Urine Tests. Front. Cell Dev. Biol. 8, 683 (2020). 56. Davis, J. & Goadrich, M. The relationship between Precision-Recall and ROC Curves. ICML 42. Elsahar, H. & Gallé, M. To Annotate or Not? Pre23, 233–240 (2006). dicting Performance Drop under Domain Shift. EMNLP-IJCNLP 9, 2163–2173 (2019). 57. Branco, P., Torgo, L. & Ribeiro, R. P. A Survey of Predictive Modeling on Imbalanced Domains. 43. Davalos, D. & Akassoglou, K. Fibrinogen as a key ACM Comput. Surv. 49, 1–50 (2016). regulator of inflammation in disease. Semin. Immunopathol. 34, 43–62 (2012). 15 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . Competing interests 58. DeLong, E. R., DeLong, D. M. & Clarke-Pearson, 10 D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a The authors declare no competing interests. nonparametric approach. Biometrics 44, 837–845 (1988). 59. Harrell Jr., F. E., Lee, K. L. & Mark, D. B. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Stat. Med. 15, 361–387 (1996). 7 Funding This project was funded by the Medical Cognitive Computing Center (MC3) and AI-MOTION (LIT-2018-6YOU-212). 8 Acknowledgements We thank the projects Medical Cognitive Computing Center (MC3), AI-MOTION (LIT-2018-6-YOU-212), DeepToxGen (LIT-2017-3-YOU-003), AI-SNN (LIT2018-6-YOU-214), DeepFlood (LIT-2019-8-YOU-213), PRIMAL (FFG-873979), S3AI (FFG-872172), DL for granular flow (FFG-871302), ELISE (H2020-ICT-20193 ID: 951847), AIDD (MSCA-ITN-2020 ID: 956832). We thank Janssen Pharmaceutica, UCB Biopharma SRL, Merck Healthcare KGaA, Audi.JKU Deep Learning Center, TGW LOGISTICS GROUP GMBH, Silicon Austria Labs (SAL), FILL Gesellschaft mbH, Anyline GmbH, Google, ZF Friedrichshafen AG, Robert Bosch GmbH, Software Competence Center Hagenberg GmbH, TÜV Austria, and the NVIDIA Corporation. We thank Franz Grandits, Innosol for the daily download of the age distribution data of the newly infected COVID-19 patients from BMSGPK. 9 Author contributions T.R., T.T., J.M., S.H. and G.K. designed the study. C.B. exported and anonymized the data from the hospital information system. T.R., A.M. and T.T. preprocessed the blood tests. T.R. pre-processed the RTPCR tests and mortality data. T.R. implemented the ML models and conducted the experiments. T.R., S.H. and G.K. wrote the manuscript. T.T. wrote the application for the ethics approval. S.H., J.M. and G.K. supervised the project. All authors critically revised the draft and approved the final manuscript. 16 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . A Supplementary material Table 2: Hyperparameters for grid search. Model SNN KNN LR SVM (COVID-19) SVM (Mortality) RF XGB Hyperparameters lr : {1e-3, 2e-4, 1e-4}, n_val_stops: {20}, weight_decay: {1e-5}, intermediate_size: {4, 16, 64}, n_layers: {1,3,6}, alpha_dropout: 0, 0.9, optimizer : {Adam} n_neighbors: {3,11,25,51,101,201,301}, weights: {uniform, distance} lr : {1e-2, 1e-3, 5e-4, 1e-4}, n_val_stops: {20}, weight_decay: {1e-5}, optimizer : {Adam} class: {LinearSVC}, dual: {False}, class_weight: {None, balanced} class: {SVC}, kernel: {linear, poly, rbf, sigmoid, precomputed}, probability: {True}, class_weight: {None, balanced} n_estimators: {501}, criterion: {gini, entropy}, max_depth: {2,8,32,None}, min_samples_split: {2}, min_samples_leaf : {1,8,32}, max_features: {auto, log2, None}, max_leaf_nodes: {None}, class_weight: {balanced, None} objective: {binary:logistic}, booster : {gbtree, gblinear, dart}, eta: {0.1, 0.3, 0.6}, gamma: {0}, max_depth: {2,6,32}, scale_pos_weight: {True, False}, grow_policy: {depthwise, lossguide} 17 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . Table 3: Comparison of estimated and actual performance. These metrics are calculated on the basis of the COVID-19 diagnosis prediction task with XGB. At significantly different deviations, the confidence intervals (CI) are colored in red. a, The actual performance is calculated on the listed month and the estimate is determined on the respective previous month. b, The estimate is determined by random samples from the 2020 cohort. c, The estimate is determined by random samples from the 2019 and 2020 cohort. a ROC AUC actual AUC estimate ∆ ROC AUC actual - estimate CI95 lower ROC AUC actual CI95 upper ROC AUC actual CI95 lower ROC AUC estimate CI95 upper ROC AUC estimate PR AUC actual PR AUC estimate ∆ PR AUC actual - estimate CI95 lower PR AUC actual CI95 upper PR AUC actual CI95 lower PR AUC estimate CI95 upper PR AUC estimate b ROC AUC actual ROC AUC estimate ∆ CI95 lower ROC AUC actual CI95 upper ROC AUC actual CI95 lower ROC AUC estimate CI95 upper ROC AUC estimate PR AUC actual P AUC estimate ∆ CI95 lower PR AUC actual CI95 upper PR AUC actual CI95 lower PR AUC estimate CI95 upper PR AUC estimate c ROC AUC actual ROC AUC estimate ∆ CI95 lower ROC AUC actual CI95 upper ROC AUC actual CI95 lower ROC AUC estimate CI95 upper ROC AUC estimate PR AUC actual PR AUC estimate ∆ CI95 lower PR AUC actual CI95 upper PR AUC actual CI95 lower PR AUC estimate CI95 upper PR AUC estimate Jun 0.7994 0.8182 0.0188 0.6973 0.8851 0.6983 0.9205 0.2413 0.3123 0.0711 0.1058 0.3941 0.1442 0.5236 Jul 0.7474 0.7767 0.0293 0.6687 0.8262 0.6717 0.8684 0.1180 0.2487 0.1307 0.0704 0.1870 0.1118 0.4003 0 Aug 0.9225 0.7880 0.1345 0.8405 0.9808 0.6959 0.8747 0.5155 0.3948 0.1207 0.2501 0.7190 0.2397 .5448 Sep 0.8076 0.9202 0.1126 0.7014 0.9014 0.8308 0.9878 0.3913 0.6699 0.2786 0.1967 0.5821 0.4430 0.8476 Oct 0.7766 0.8302 0.0536 0.7383 0.8142 0.7276 0.9188 0.4540 0.4678 0.0138 0.3784 0.5325 0.2768 0.6356 Nov 0.7975 0.7934 0.0042 0.7789 0.8179 0.7541 0.8297 0.7069 0.5274 0.1796 0.6714 0.7396 0.4481 0.6034 Dec 0.7419 0.8219 0.0800 0.7068 0.7782 0.8040 0.8389 0.5878 0.7256 0.1378 0.5288 0.6428 0.6916 0.7617 Jun 0.8116 0.8959 0.0843 0.7096 0.9074 0.8304 0.9598 0.2209 0.4047 0.1838 0.1177 0.4167 0.2259 0.6419 Jul 0.7913 0.8238 0.0325 0.6982 0.8717 0.7229 0.9061 0.3593 0.4106 0.0513 0.2050 0.5227 0.2431 0.5858 Aug 0.9531 0.9075 0.0456 0.8955 0.9889 0.8405 0.9589 0.6543 0.5428 0.1115 0.4516 0.8221 0.3912 0.7075 Sep 0.8242 0.8742 0.0501 0.7214 0.9145 0.8072 0.9313 0.4538 0.3366 0.1172 0.2654 0.6239 0.2057 0.4794 Oct 0.787 0.8922 0.1052 0.7500 0.8237 0.8382 0.9394 0.5141 0.5122 0.002 0.4364 0.5921 0.3461 0.6604 Nov 0.8227 0.8641 0.0414 0.8051 0.8409 0.8151 0.9129 0.7191 0.5237 0.1954 0.6856 0.7522 0.4121 0.6353 Dec 0.7515 0.896 0.1444 0.7156 0.7888 0.8729 0.9185 0.6087 0.7129 0.1042 0.545 0.6695 0.6592 0.7626 Jun 0,7819 0.9900 0.2082 0.6805 0.8737 0.9800 0.9974 0.2274 0.4554 0.2280 0.0973 0.3881 0.2223 0.6704 Jul 0.7638 0.9286 0.1647 0.6603 0.8567 0.8596 0.9807 0.3821 0.4508 0.0688 0.2407 0.5292 0.2635 0.6217 Aug 0.9060 0.9820 0.0760 0.7956 0.9855 0.9656 0.9937 0.6260 0.3643 0.2617 0.4050 0.8022 0.2093 0.5347 Sep 0.8483 0.9814 0.1332 0.7448 0.9338 0.9662 0.9929 0.4627 0.4093 0.0535 0.2670 0.6256 0.2563 0.5634 Oct 0.7939 0.9623 0.1684 0.7553 0.8319 0.9197 0.9908 0.5326 0.2814 0.2512 0.4548 0.6064 0.1170 0.4402 Nov 0.8226 0.9517 0.1291 0.8048 0.8412 0.9249 0.9746 0.7227 0.2938 0.4289 0.6902 0.7565 0.1982 0.4298 Dec 0.7507 0.9734 0.2227 0.7118 0.7868 0.9648 0.9812 0.6073 .5585 0.0487 0.5489 0.6673 0.4848 0.6293 18 Sum 0.4329 Mean 0.7990 0.8212 0.0618 0.9322 0.4307 0.4781 0.1332 Sum 0.5034 Mean 0.8202 0.8791 0.0719 0.7654 0.5043 0.4919 0.1093 Sum 1.1024 Mean 0.8096 0.9671 0.1575 1.3408 0.5087 0.4019 0.1915 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . ROC Curve 0.8 0.8 0.0 0.2 RF=0.915 XGB=0.917 BF=0.677 RE=0.500 0.6 0.4 0.8 False Positive Rate 0.2 0.0 0.8 0.4 0.2 0.0 0.0 0.2 Precision 0.8 SNN=0.779 KNN=0.720 LR=0.775 SVM=0.776 RF=0.797 XGB=0.813 BF=0.662 RE=0.500 0.6 0.4 0.8 False Positive Rate 0.6 0.4 0.2 0.0 0.0 1.0 ROC Curve e 0.8 AUC 0.2 0.0 0.0 0.2 Precision 0.8 0.4 0.6 0.4 0.8 False Positive Rate 1.0 0.2 0.6 0.4 Recall 0.8 1.0 PR Curve 1.0 SNN=0.844 KNN=0.831 LR=0.843 SVM=0.828 RF=0.864 XGB=0.803 BF=0.748 RE=0.500 0.8 SNN=0.663 KNN=0.569 LR=0.646 SVM=0.639 RF=0.671 XGB=0.707 BF=0.572 RE=0.316 f 1.0 0.6 0.6 0.4 Recall 0.2 PR Curve 1.0 0.6 RF=0.567 XGB=0.622 BF=0.314 RE=0.082 d 1.0 AUC True Positive Rate 0.0 1.0 ROC Curve c 0.4 SNN=0.560 KNN=0.449 LR=0.475 SVM=0.558 SNN=0.469 KNN=0.477 AUC 0.2 0.6 AUC 0.4 SNN=0.903 KNN=0.843 LR=0.890 SVM=0.904 AUC 0.6 Precision 1.0 0.0 True Positive Rate PR Curve b 1.0 AUC True Positive Rate a 0.6 0.4 LR=0.482 SVM=0.423 RF=0.563 XGB=0.437 BF=0.394 RE=0.132 0.2 0.0 0.0 1.0 0.2 0.6 0.4 Recall 0.8 1.0 Figure 6: Comparison of ML models for different experiments. ROC and PR curves for test set of a and b, experiment (ii); c and d, experiment (iii); e and f, experiment (v); the curves are plotted for different models (one random seed) and compared with best feature (BF) and random estimator (RE). 19 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . a b 0.80 0.90 SNN KNN LR SVM 0.80 Mean PR AUC Mean ROC AUC 0.70 SNN KNN LR SVM RF XGB BF RE 0.70 0.60 0.50 0 5 0.60 RF XGB BF RE 0.50 0.40 0.30 0.20 0.10 15 10 20 Min. days until death 0.00 25 c 0 5 15 10 20 Min. days until death 25 d SNN KNN LR SVM 0.85 0.50 Mean PR AUC Mean ROC AUC 0.80 0.75 0.70 SNN KNN LR SVM RF XGB BF RE 0.65 0.60 0.55 0.00 0 5 RF XGB BF RE 0.40 0.30 0.20 0.10 0.00 15 10 20 Min. days until death 25 0 5 15 10 20 Min. days until death 25 Figure 7: Performance of models in dependence of the minimum number of days until death. In this figure we answer, how early before death we can predict the risk of dying. Samples at which the death has occurred within the next days (minimum days until death), are excluded from the test set (but not from the training and validation set). a, The mean of ROC AUC and b, PR AUC values of five test folds is plotted. c, The mean of ROC AUC and d, PR AUC values of five random seeds in prospective evaluation are shown. For visual clarity, the standard deviations (error bars) are only plotted for the RF. The mortality risk can be estimated early before death, as the discriminating capability of the models remains high with increasing number of minimum days until death. The mean PR AUC in b and d is decreasing with increasing minimum days until death, equally to the random estimator baseline, due to the decreasing ratio of deceased to survivors. 20 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . b 1.0 Actual Estimate 0.9 0.8 ROC AUC ROC AUC a 0.7 Jul Aug Sep Oct Nov Dec Jun 0.8 0.6 0.6 PR AUC PR AUC 0.8 0.8 0.4 0.2 Jul Aug Sep Oct Jul Aug Sep Oct Nov Dec Nov Dec 0.4 Actual Estimate 0.2 Jun Jun Jul Aug Sep Oct Nov Dec d 1.0 ROC AUC 0.9 0.7 Jun c 1.0 2019 0.9 0.8 2020 a ... b ... c ... Actual Estimate 0.7 Jun Jul Aug Sep Oct Nov Dec PR AUC 0.8 0.6 0.4 Training 0.2 Jun Jul Aug Sep Oct Nov Performance estimate Actual performance Dec Figure 8: Deviation of estimated from actual performance with three options to determine the performance estimate. a, The actual ROC AUC differs significantly from the estimate in December, the PR AUC in November and December. b, Significant difference in October and November for ROC AUC and in November for PR AUC. c, Estimated and actual ROC AUC are significantly different in all month but August, due to heavy domain shifts, and PR AUC in October and November. The mean deviation of estimated and actual ROC AUC and PR AUC is higher in c compared to a and b. d, Three options to determine the performance estimate. In a, the performance estimate is calculated from the preceding month. In b, the samples for the estimate are randomly selected from the 2020 cohort (20 %), and in c they are randomly sampled from the 2019 and the 2020 cohort (20 % of the 2020 cohort and equal proportion from the 2019 cohort). 21 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . b a ROC AUC AUC ROC ROC AUC 1.0 Actual Estimate 0.5 1.0 1.0 0.9 Actual Estimate 0.8 0.5 0.7 Oct Sep Nov Dec PR AUC AUC PR PR AUC Sep Jun Jul OctAug SepNov Oct Dec Nov Jun Sep Jul OctAug SepNov Oct Nov Dec 0.8 1.0 1.0 0.5 0.6 0.4 0.5 0.2 Oct Sep Nov Dec Figure 9: Deviation of estimated from actual performance for mortality risk with two options to determine the performance estimate. a, The estimate is determined on the respective previous month. Note that the confidence interval at an early stage of the pandemic is high due to a low number of samples. b, The estimate is determined on randomly sampled 20 % of the COVID-19 positives, who occurred before the actual performance month. There is a significant difference in October for ROC AUC and PR AUC, which means that the performance estimate is overoptimistic in October. Table 4: Comparison of estimated and actual performance for mortality risk prediction. These metrics are calculated from the predictions of a RF, trained with the hyperparameters as determined in experiment (v). At significantly different deviations, the confidence intervals (CI) are colored in red. a, The actual performance is calculated on the listed month and the estimate was determined from the respective previous month. b, The estimate is determined by random samples from the positives cohort, occurring before the month, which the actual performance is calculated on. a ROC AUC actual AUC estimate ∆ ROC AUC actual - estimate CI95 lower ROC AUC actual CI95 upper ROC AUC actual CI95 lower ROC AUC estimate CI95 upper ROC AUC estimate PR AUC actual PR AUC estimate ∆ PR AUC actual - estimate CI95 lower PR AUC actual CI95 upper PR AUC actual CI95 lower PR AUC estimate CI95 upper PR AUC estimate b ROC AUC actual AUC estimate ∆ ROC AUC actual - estimate CI95 lower ROC AUC actual CI95 upper ROC AUC actual CI95 lower ROC AUC estimate CI95 upper ROC AUC estimate PR AUC actual PR AUC estimate ∆ PR AUC actual - estimate CI95 lower PR AUC actual CI95 upper PR AUC actual CI95 lower PR AUC estimate CI95 upper PR AUC estimate Sep 0.8167 0.7279 0.0887 0.6462 0.9538 0.3519 1.0000 0.3705 0.5233 0.1528 0.1329 0.8302 0.0556 1.0000 Oct 0.7427 0.9067 0.1640 0.6591 0.8146 0.7619 1.0000 0.2645 0.6694 0.4049 0.1791 0.3642 0.2417 1.0000 Nov 0.7894 0.7716 0.0179 0.7525 0.8282 0.6843 0.8485 0.2967 0.4188 0.1221 0.2321 0.3724 0.2591 0.6097 Dec 0.8403 0.8524 0.0121 0.7679 0.8980 0.8156 0.8888 0.5548 0.5172 0.0376 0.3846 0.7012 0.4242 0.6256 Sep 0.8500 0.9063 0.0563 0.7071 0.9692 0.8031 0.9824 0.5203 0.7729 0.2526 0.1694 0.8972 0.5043 0.9455 Oct 0.7714 0.9647 0.1933 0.6757 0.8463 0.9050 1.0000 0.3519 0.9367 0.5848 0.2241 0.5302 0.7980 1.0000 Nov 0.8341 0.8564 0.0223 0.7915 0.8722 0.7514 0.9477 0.4746 0.702 0.2274 0.3763 0.5750 0.4771 0.8756 Dec 0.8686 0.8962 0.0276 0.8079 0.9155 0.8448 0.9428 0.5772 0.7164 0.1393 0.4277 0.7167 0.5783 0.8256 22 Sum 0.2826 Mean 0.7973 0.8146 0.0707 0.7174 0.3716 0.5322 0.1793 Sum 0.2995 Mean 0.8310 0.9059 0.0749 1.2041 0.4810 0.7820 0.3010 medRxiv preprint doi: https://doi.org/10.1101/2021.04.06.21254997; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . Table 5: Features with discriminating capability. Top-10 features as predictors for COVID-19 diagnosis (experiment (i)-(iii)) and mortality prediction (experiment (iv) and (v)). The sign in brackets indicates whether the target is connected with the positive (+) or the negative sign of the feature value (−), i.e., patients with high ferritin and low calcium have higher probability for class COVID-19 positive. The standard deviation (±) is listed for experiment (iv), for the other experiments the test set is fixed. Abbreviations: absolute eosinophil count (AEC), immature granulocytes (IG), absolute basophil count (ABC), lactate dehydrogenase (LDH), C-reactive protein (CRP), absolute lymphocyte count (ALC), red cell distribution width (RDW). Experiment (i) Experiment (ii) Experiment (iii) Feature Feature ROC AUC PR AUC ROC AUC PR AUC ROC AUC PR AUC Calcium (−) 0.67 0.02 Ferritin (+) 0.68 0.31 Ferritin (+) 0.66 0.57 0.66 0.11 AEC (−) 0.66 0.25 AEC (−) 0.64 0.53 AEC (−) Ferritin (+) 0.66 0.08 Fibrinogen (+) 0.66 0.21 ABC (−) 0.63 0.50 Age (+) 0.66 0.03 Eosinophils (−) 0.64 0.25 Fibrinogen (+) 0.62 0.50 CRP (+) 0.66 0.02 Calcium (−) 0.64 0.12 IG (+) 0.62 0.45 0.65 0.12 IG (+) 0.64 0.16 Calcium (−) 0.62 0.41 Eosinophils (−) IG (+) 0.65 0.03 LDH (+) 0.63 0.13 Eosinophils (−) 0.62 0.51 0.65 0.02 Phosphor (−) 0.62 0.17 Leukocytes (−) 0.61 0.40 ALC (−) Fibrinogen (+) 0.65 0.05 pH (+) 0.61 0.14 ALC (−) 0.61 0.43 Phosphor (−) 0.64 0.04 ABC (−) 0.60 0.16 pH (+) 0.61 0.45 Experiment (iv) Experiment (v) Feature Feature ROC AUC PR AUC ROC AUC PR AUC Neutrophils (+) 0.76±0.03 0.42±0.10 Neutrophils (+) 0.75 0.39 Lymphocytes (−) 0.76±0.04 0.41±0.11 Lymphocytes (−) 0.74 0.35 CRP (+) 0.71 0.36 Blood Urea Nitrogen (+) 0.75±0.04 0.38±0.12 RDW (+) 0.73±0.04 0.35±0.10 Oxyhemoglobin Fraction (+) 0.71 0.42 CRP (+) 0.72±0.04 0.39±0.09 Monocytes (−) 0.70 0.35 Oxyhemoglobin Fraction (+) 0.70±0.05 0.46±0.10 Blood Urea Nitrogen (+) 0.70 0.33 Cholinesterase (−) 0.70±0.03 0.37±0.11 Neutrophils abs. (+) 0.69 0.26 ALC (−) 0.69±0.05 0.33±0.08 RDW (+) 0.68 0.21 Procalcitonin (+) 0.68 0.36 Monocytes (−) 0.68±0.04 0.32±0.07 Hemoglobin (−) 0.68±0.02 0.30±0.08 Cholinesterase (−) 0.68 0.30 Feature 23

Log In

Machine Learning based COVID-19 Diagnosis from Blood Tests with Robustness to Domain Shifts