Search | arXiv e-print repository

missForestPredict -- Missing data imputation for prediction settings

Authors: Elena Albu, Shan Gao, Laure Wynants, Ben Van Calster

Abstract: Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a co… ▽ More Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion (unified for continuous and categorical variables and based on the out-of-bag error) is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2405.01986 [pdf]

A comparison of regression models for static and dynamic prediction of a prognostic outcome during admission in electronic health care records

Authors: Shan Gao, Elena Albu, Hein Putter, Pieter Stijnen, Frank Rademakers, Veerle Cossey, Yves Debaveye, Christel Janssens, Ben Van Calster, Laure Wynants

Abstract: Objective Hospitals register information in the electronic health records (EHR) continuously until discharge or death. As such, there is no censoring for in-hospital outcomes. We aimed to compare different dynamic regression modeling approaches to predict central line-associated bloodstream infections (CLABSI) in EHR while accounting for competing events precluding CLABSI. Materials and Methods We… ▽ More Objective Hospitals register information in the electronic health records (EHR) continuously until discharge or death. As such, there is no censoring for in-hospital outcomes. We aimed to compare different dynamic regression modeling approaches to predict central line-associated bloodstream infections (CLABSI) in EHR while accounting for competing events precluding CLABSI. Materials and Methods We analyzed data from 30,862 catheter episodes at University Hospitals Leuven from 2012 and 2013 to predict 7-day risk of CLABSI. Competing events are discharge and death. Static models at catheter onset included logistic, multinomial logistic, Cox, cause-specific hazard, and Fine-Gray regression. Dynamic models updated predictions daily up to 30 days after catheter onset (i.e. landmarks 0 to 30 days), and included landmark supermodel extensions of the static models, separate Fine-Gray models per landmark time, and regularized multi-task learning (RMTL). Model performance was assessed using 100 random 2:1 train-test splits. Results The Cox model performed worst of all static models in terms of area under the receiver operating characteristic curve (AUC) and calibration. Dynamic landmark supermodels reached peak AUCs between 0.741-0.747 at landmark 5. The Cox landmark supermodel had the worst AUCs (<=0.731) and calibration up to landmark 7. Separate Fine-Gray models per landmark performed worst for later landmarks, when the number of patients at risk was low. Discussion and Conclusion Categorical and time-to-event approaches had similar performance in the static and dynamic settings, except Cox models. Ignoring competing risks caused problems for risk prediction in the time-to-event framework (Cox), but not in the categorical framework (logistic regression). △ Less

Submitted 6 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

Comments: 3388 words; 3 figures; 4 tables

arXiv:2404.16127 [pdf, other]

Comparison of static and dynamic random forests models for EHR data in the presence of competing risks: predicting central line-associated bloodstream infection

Authors: Elena Albu, Shan Gao, Pieter Stijnen, Frank Rademakers, Christel Janssens, Veerle Cossey, Yves Debaveye, Laure Wynants, Ben Van Calster

Abstract: Prognostic outcomes related to hospital admissions typically do not suffer from censoring, and can be modeled either categorically or as time-to-event. Competing events are common but often ignored. We compared the performance of random forest (RF) models to predict the risk of central line-associated bloodstream infections (CLABSI) using different outcome operationalizations. We included data fro… ▽ More Prognostic outcomes related to hospital admissions typically do not suffer from censoring, and can be modeled either categorically or as time-to-event. Competing events are common but often ignored. We compared the performance of random forest (RF) models to predict the risk of central line-associated bloodstream infections (CLABSI) using different outcome operationalizations. We included data from 27478 admissions to the University Hospitals Leuven, covering 30862 catheter episodes (970 CLABSI, 1466 deaths and 28426 discharges) to build static and dynamic RF models for binary (CLABSI vs no CLABSI), multinomial (CLABSI, discharge, death or no event), survival (time to CLABSI) and competing risks (time to CLABSI, discharge or death) outcomes to predict the 7-day CLABSI risk. We evaluated model performance across 100 train/test splits. Performance of binary, multinomial and competing risks models was similar: AUROC was 0.74 for baseline predictions, rose to 0.78 for predictions at day 5 in the catheter episode, and decreased thereafter. Survival models overestimated the risk of CLABSI (E:O ratios between 1.2 and 1.6), and had AUROCs about 0.01 lower than other models. Binary and multinomial models had lowest computation times. Models including multiple outcome events (multinomial and competing risks) display a different internal structure compared to binary and survival models. In the absence of censoring, complex modelling choices do not considerably improve the predictive performance compared to a binary model for CLABSI prediction in our studied settings. Survival models censoring the competing events at their time of occurrence should be avoided. △ Less

Submitted 24 May, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

arXiv:2401.01849 [pdf, other]

The expected value of sample information calculations for external validation of risk prediction models

Authors: Mohsen Sadatsafavi, Andrew J Vickers, Tae Yoon Lee, Paul Gustafson, Laure Wynants

Abstract: In designing external validation studies of clinical prediction models, contemporary sample size calculation methods are based on the frequentist inferential paradigm. One of the widely reported metrics of model performance is net benefit (NB), and the relevance of conventional inference around NB as a measure of clinical utility is doubtful. Value of Information methodology quantifies the consequ… ▽ More In designing external validation studies of clinical prediction models, contemporary sample size calculation methods are based on the frequentist inferential paradigm. One of the widely reported metrics of model performance is net benefit (NB), and the relevance of conventional inference around NB as a measure of clinical utility is doubtful. Value of Information methodology quantifies the consequences of uncertainty in terms of its impact on clinical utility of decisions. We introduce the expected value of sample information (EVSI) for validation as the expected gain in NB from conducting an external validation study of a given size. We propose algorithms for EVSI computation, and in a case study demonstrate how EVSI changes as a function of the amount of current information and future study's sample size. Value of Information methodology provides a decision-theoretic lens to the process of planning a validation study of a risk prediction model and can complement conventional methods when designing such studies. △ Less

Submitted 6 January, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Comments: 14 pages, 4 figures, 0 tables

arXiv:2208.03343 [pdf]

doi 10.1177/0272989X231178317

Value of Information Analysis for External Validation of Risk Prediction Models

Authors: Mohsen Sadatsafavi, Tae Yoon Lee, Laure Wynants, Andrew Vickers, Paul Gustafson

Abstract: Background: Before being used to inform patient care, a risk prediction model needs to be validated in a representative sample from the target population. The finite size of the validation sample entails that there is uncertainty with respect to estimates of model performance. We apply value-of-information methodology as a framework to quantify the consequence of such uncertainty in terms of NB. M… ▽ More Background: Before being used to inform patient care, a risk prediction model needs to be validated in a representative sample from the target population. The finite size of the validation sample entails that there is uncertainty with respect to estimates of model performance. We apply value-of-information methodology as a framework to quantify the consequence of such uncertainty in terms of NB. Methods: We define the Expected Value of Perfect Information (EVPI) for model validation as the expected loss in NB due to not confidently knowing which of the alternative decisions confers the highest NB at a given risk threshold. We propose methods for EVPI calculations based on Bayesian or ordinary bootstrapping of NBs, as well as an asymptotic approach supported by the central limit theorem. We conducted brief simulation studies to compare the performance of these methods, and used subsets of data from an international clinical trial for predicting mortality after myocardial infarction as a case study. Results: The three computation methods generated similar EVPI values in simulation studies. In the case study, at the pre-specified threshold of 0.02, the best decision with current information would be to use the model, with an expected incremental NB of 0.0020 over treating all. At this threshold, EVPI was 0.0005 (a relative EVPI of 25%). When scaled to the annual number of heart attacks in the US, this corresponds to a loss of 400 true positives, or extra 19,600 false positives (unnecessary treatments) per year, indicating the value of further model validation. As expected, the validation EVPI generally declined with larger samples. Conclusion: Value-of-information methods can be applied to the NB calculated during external validation of clinical prediction models to provide a decision-theoretic perspective to the consequences of uncertainty. △ Less

Submitted 5 August, 2022; originally announced August 2022.

Comments: 24 pages, 4,484 words, 1 table, 2 boxes, 5 figures

Showing 1–5 of 5 results for author: Wynants, L