Abstract
To introduce Random Forest (RF), a machine learning method, in an accessible way for health services researchers and highlight its unique considerations when applied to health administrative data. Physician claims’ data from the universal public insurer linked with the Canadian Community Health Survey for the Canadian province of Quebec. We describe in detail how RF can be useful in health services research, provide guidance on data set up, modeling decisions and demonstrate how to interpret results. We also highlight specific considerations for applying RF to health administrative data. In a working example, we compare RF with logistic regression, Ridge regression and LASSO in their ability to predict whether a person has a regular medical doctor. We use survey responses to “do you have a regular medical doctor” from three cycles of the Canadian Community Health Survey (2007, 2009, 2011). Responses are linked with physician claims’ data from 2002 to 2012. We limit our cohort to persons 40 years and older at the time of responding to the survey. We discuss the strengths and weaknesses of using RF in a health services research setting in comparison to using more conventional modeling techniques. Applying a RF model in a health services research setting can have advantages over conventional modeling approaches and we encourage health services researchers to add RF to their toolbox of predictive modeling methods.
Similar content being viewed by others
Code availability
All the packages used to do this analysis are available in R or Stata and are cited in the paper.
Data availability
This project uses linked microdata from the Canadian Community Health Survey and health administrative data (physicians’ claims data). The sensitive nature of this data precludes us from sharing.
References
Bou-Hamad, I., Larocque, D., Ben-Ameur, H.: A review of survival trees. Stat. Surv. 5, 44–71 (2011). https://doi.org/10.1214/09-SS047
Boulesteix, A.-L., et al. : Making complex prediction rules applicable for readers: current practice in random forest literature and recommendations. Biom. J. 61(5), 1314–1328 (2019). https://doi.org/10.1002/bimj.201700243
Boulesteix, A., Schmid, M.: Machine learning versus statistical modeling. Biom. J. 56(4), 588–593 (2014). https://doi.org/10.1002/bimj.201300226
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001a). https://doi.org/10.1023/A:1010933404324
Breiman, L.: Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat. Sci. 16(3), 199–231 (2001b). https://doi.org/10.1214/SS/1009213726
Breslau, N., Reeb, K.G.: Continuity of care in a university-based practice. J. Med. Educ. 50(10), 965–969 (1975). https://doi.org/10.1097/00001888-197510000-00006
Bylander, T.: Estimating generalization error on two-class datasets using out-of-bag estimates. Mach. Learn. 48(1–3), 287–297 (2002). https://doi.org/10.1023/A:1013964023376
Bzdok, D., Altman, N., Krzywinski, M.: Points of significance: statistics versus machine learning. Nat. Methods 15(4), 233–234 (2018)
Chen, C., & Liaw, A. (2004). Using random forest to learn imbalanced data
Clair, M. (2000). Emerging Solutions. Report and Recommandations of the Commission d’étude sur les services de santé et les services sociaux
Couronné, R., Probst, P., Boulesteix, A.-L.L.: Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinform. 19(1), 1–14 (2018). https://doi.org/10.1186/s12859-018-2264-5
Degenhardt, F., Seifert, S., Szymczak, S.: Evaluation of variable selection methods for random forests and omics data sets. Brief. Bioinform. 20(2), 492–503 (2019). https://doi.org/10.1093/bib/bbx124
DeVoe, J.E., Fryer, G.E., Phillips, R., Green, L.: Receipt of preventive care among adults: insurance status and usual source of care. Am. J. Publ. Health 93(5), 786–791 (2003)
Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7(1), 3 (2006). https://doi.org/10.1186/1471-2105-7-3
Dietrich, S., Floegel, A., Troll, M., Kühn, T., Rathmann, W., Peters, A., Sookthai, D., Von Bergen, M., Kaaks, R., Adamski, J., Prehn, C., Boeing, H., Schulze, M.B., Illig, T., Pischon, T., Knüppel, S., Wang-Sattler, R., Drogan, D.: Random survival forest in practice: a method for modelling complex metabolomics data in time to event analysis. Int. J. Epidemiol. 45(5), 1406–1420 (2016). https://doi.org/10.1093/ije/dyw145
Domingos, P.: A few useful things to know about machine learning. Commun. ACM (2012). https://doi.org/10.1145/2347736.2347755
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
Glowicz.: Variable importance-weighted Random Forests. Quant. Biol. 176(5), 139–148 (2017)
Goldstein, B.A., Polley, E.C., Briggs, F.B.S.: Random forests for genetic association studies. Stat. Appl. Genet. Mol. Biol. (2011). https://doi.org/10.2202/1544-6115.1691
Gordon, L., Olshen, R.A.: Tree-Structured survival analysis. Cancer Trea. Rep. 69(10), 1065–1068 (1985)
Granitto PM et al.: Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products’, Chem. Intell. Lab. Syst. 82(2), 83–90 (2006)
Greenland, S.: Principles of multilevel modelling. Int. J. Epidemiol. 29(1), 158–167 (2000). https://doi.org/10.1093/ije/29.1.158
Gregorutti, B., Michel, B., Saint-Pierre, P.: Correlation and variable importance in random forests. Stat. Comput. 27(3), 659–678 (2017). https://doi.org/10.1007/s11222-016-9646-1
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning data mining, inference, and prediction
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical learning with sparsity the Lasso and generalizations statistical learning with Sparsity. CRC Press (2015)
Hay, C., Pacey, M., Bains, N., Ardal, S.: Understanding the unattached population in Ontario: evidence from the primary care access survey (PCAS). Healthcare Policy 6(2), 33–47 (2010)
Heaton, J. (2016). An empirical analysis of feature engineering for predictive modeling. Conference proceedings - IEEE Southeastcon, 2016-July. https://doi.org/10.1109/SECON.2016.7506650
Hebiri, M., Lederer, J.: How correlations influence lasso prediction. IEEE Trans. Inf. Theory 59(3), 1846–1854 (2013). https://doi.org/10.1109/TIT.2012.2227680
Heinze, G., Wallisch, C., Dunkler, D.: Variable selection – a review and recommendations for the practicing statistician. Biom. J. 60(3), 431–449 (2018). https://doi.org/10.1002/bimj.201700067
Wang, H., Li, G.: A selective review on random survival forests for high dimensional data. Quant. Bio-Sci. 36(2), 85–96 (2017)
Huang, B.F.F., Boutros, P.C.: The parameter sensitivity of random forests. BMC Bioinform. (2016). https://doi.org/10.1186/s12859-016-1228-x
Ishwaran, H, & Kogalur, U. (2020). Fast unified random forests for survival, regression, and classification (RF-SRC) (R package version 2.9.3). https://cran.r-project.org/web/packages/randomForestSRC/citation.html
Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S.: Random survival forests. Ann. Appl. Stat. 2(3), 841–860 (2008). https://doi.org/10.1214/08-AOAS169
Ishwaran, H., Lu, M.: Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat. Med. 38(4), 558–582 (2019). https://doi.org/10.1002/sim.7803
Janitza, S., Hornung, R.: On the overestimation of random forest’s out-of-bag error. PLoS ONE 13(8), e0201904 (2018). https://doi.org/10.1371/journal.pone.0201904
Kirasich, K., Smith, T., & Sadler, B. (2018). Random forest vs logistic regression: binary classification for heterogeneous Datasets. In SMU Data Science Review (Vol. 1, Issue 3). https://scholar.smu.edu/datasciencereview. http://digitalrepository.smu.edu.Availableat: https://scholar.smu.edu/datasciencereview/vol1/iss3/9
Kursa, M. B. and Rudnicki, W. R. (2010) Feature selection with the boruta package, J. Stat. Soft. 36(11), pp. 1–13
Lambrew, J.M., DeFriese, G.H., Carey, T.S., Ricketts, T.C., Biddle, A.K.: The effects of having a regular doctor on access to primary care. Med. Care 34(2), 138–151 (1996)
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. https://cran.r-project.org/web/packages/randomForest/citation.html
Lin, Y., Jeon, Y.: Random forests and adaptive nearest neighbors. J. Am. Stat. Assoc. 101(474), 578–590 (2006). https://doi.org/10.1198/016214505000001230
Luchman, J. N. (2015). DOMIN: stata module to conduct dominance analysis. Statistical software components. https://ideas.repec.org/c/boc/bocode/s457629.html
McIsaac, W.J., Fuller-Thomson, E., Talbot, Y.: Does having regular care by a family physician improve preventive care? Can. Family Phys. Med. De Famille Can. 47, 70–76 (2001)
Mihaylova, B., Briggs, A., O’Hagan, A., Thompson, S.G.: Review of statistical methods for analysing healthcare resources and costs. Health Econ. 20(8), 897–916 (2011)
Nathans, L.L., Oswald, F.L., Nimon, K.: Interpreting multiple linear regression: a guidebook of variable importance - practical assessment, research & evaluation. Prac. Assess. Res. Eval. 17(9), 1–19 (2012)
O’brien, R., & Ishwaran, H. : A random forests quantile classifier for class imbalanced data. Pattern Recogn. 90, 232–249 (2019). https://doi.org/10.1016/j.patcog.2019.01.036
Probst, P., Boulesteix, A.-L.: To tune or not to tune the number of trees in random forest. J. Mach. Learn. Res. 18, 1–18 (2018)
Probst, P., Wright, M. N., & Boulesteix, A. (2019). Hyperparameters and tuning strategies for random forest. Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(3). https://doi.org/10.1002/widm.1301
Provost, S., Perez, J., Pineault, R., Borges Da Silva, R., Tousignant, P.: An algorithm using administrative data to identify patient attachment to a family physician. Int. J. Family Med. (2015). https://doi.org/10.1155/2015/967230
Rokach, L., & Maimon, O. (2010). Data mining and knowledge discovery handbook. https://doi.org/10.1007/978-0-387-09823-4_9
Scornet, E.: Tuning parameters in random forests. ESAIM Proc. Surv. 60, 144–162 (2018)
Segal, M. R. (2004). Machine learning benchmarks and random forest regression
Seifert, S., Gundlach, S., Szymczak, S.: Surrogate minimal depth as an importance measure for variables in random forests. Bioinformatics. 35(19), 3663–3671 (2019). https://doi.org/10.1093/bioinformatics/btz149
Shmueli, G.: To explain or to predict? Stat. Sci. 25(3), 289–310 (2010). https://doi.org/10.1214/10-STS330
Smyth, D., Deverall, E., Balm, M., Nesdale, A., Rosemergy, I.: Out-of-bag estimation. N. z. Med. J. 128(1425), 97–100 (2015). https://doi.org/10.1007/s13398-014-0173-7.2
Speiser, J. L., Miller, M. E., Tooze, J., Ip, E.: A comparison of random forest variable selection methods for classification prediction modeling. 134, 93–101 (2019). https://doi.org/10.1016/j.eswa.2019.05.028
Starfield, B.B., SHI, L., Macinko, J.: Contribution of primary care to health systems and health. Milbank q. 83(3), 457–502 (2005). https://doi.org/10.1111/j.1468-0009.2005.00409.x
Statisitcs Canada. (2016). Surveys and statistical programs - Canadian community health survey - annual component (CCHS).https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=3226
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinform. 9(1), 307 (2008). https://doi.org/10.1186/1471-2105-9-307
Sturmberg, J.P., Schattner, P.: Personal doctoring. Its impact on continuity of care as measured by the comprehensiveness of care score. Aus. Family Phys. 30(5), 513–518 (2001)
Svetnik, V. et al. (2004) Application of Breiman’s Random Forest to modeling structure-activity relationships of pharmaceutical molecules. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3077, pp. 334–343. doi: 10.1007/978-3-540-25966-4_33.
Thomas, D.R., Zumbo, B.D., Kwan, E., Schweitzer, L.: On Johnson’s (2000) relative weights method for assessing variable importance: a reanalysis. Multivar. Behav. Res. 49(4), 329–338 (2014). https://doi.org/10.1080/00273171.2014.905766
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. In Journal of the royal statistical society. Series B (Methodological). J. R. Statist. Soc. B (Vol. 58, Issue 1). http://www.math.yorku.ca/~hkj/Teaching/6621Winter2013/Coverage/lasso.pdf
Tolosi, L., Lengauer, T.: Data and text mining classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27(14), 1986–1994 (2011). https://doi.org/10.1093/bioinformatics/btr300
Tousignant, P., Diop, M., Fournier, M., Roy, Y., Haggerty, J., Hogg, W., Beaulieu, M.-D.: Validation of 2 new measures of continuity of care based on year-to-year follow-up with known providers of health care. Ann. Fam. Med. 12(6), 559–567 (2014). https://doi.org/10.1370/afm.1692
Wager, S., Hastie, T., & Efron, B. (2014). Confidence intervals for random forests: the Jackknife and the infinitesimal Jackknife. In Journal of Machine Learning Research (Vol. 15)
Xu, K.T.: Usual source of care in preventive service use: a regular doctor versus a regular site. Health Serv. Res. 37(6), 1509–1529 (2002). https://doi.org/10.1111/1475-6773.10524
Acknowledgements
The authors would like to thank Dr. Aman Verma for his valuable input on the methods as well as Dr. Ruth Lavergne and Dr. Kim McGrail for their manuscript suggestions. Linkage of the data was carried out by members of the TorSaDE project. The members of the TorSaDE Cohort Working Group are as follows: Alain Vanasse (leader), Gillian Bartlett, Lucie Blais, David Buckeridge, Manon Choinière, Catherine Hudon, Anaïs Lacasse, Benoit Lamarche, Alexandre Lebel, Amélie Quesnel-Vallée, Pasquale Roberge, Valérie Émond, Marie-Pascale Pomey, Mike Benigeri, Anne-Marie Cloutier, Marc Dorais, Josiane Courteau, Mireille Courteau, Stéphanie Plante, Pierre Cambon, Annie Giguère, Isabelle Leroux, Danielle St-Laurent, Denis Roy, Jaime Borja, André Néron, Geneviève Landry, Jean-François Ethier, Roxanne Dault, Marc-Antoine Côté-Marcil, Pier Tremblay, Sonia Quirion.
Funding
This study was funded by the Canadian Institutes for Health Research (CIHR) Strategy for Patient Oriented Research (SPOR) Network in Primary and Integrated Health Care Innovations (PIHCI) (CIHR HCI 150578), the Michael Smith Foundation for Health Research (MSFHR 17268), McGill University, Réseau-1 Québec, Québec Ministère de la Santé et des Services Sociaux and Université de Sherbrooke: Centre Recherche – Hôpital Charles Le Moyne. In-kind support was provided by the Institut de recherche en santé publique de l’Université de Montréal (IRSPUM). This work was carried out using data from the © Quebec Government, 2019. The Quebec Government is not responsible for this work or the interpretation of the results produced.
Author information
Authors and Affiliations
Contributions
Caroline King conceived and carried out the analyses. Erin Strumpf provided guidance throughout the project on both the concepts and analyses. Caroline King and Erin Strumpf wrote the paper.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Ethical approval
Ethics approval was obtained through McGill University’s Faculty of Medicine Institutional Review Board (IRB).
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Variable importance measures
1.1 Mean decrease impurity
MDI is a generic formula that different impurity functions can be applied to. Impurity functions simply measure how much more homogeneous data are after splitting on a feature compared to before the split. The Gini index or entropy are the most common impurity functions used in RF. MDI is calculated by summing the weighted impurity decreases for all nodes where the feature was used, averaged over all the trees in the forest. In other words, the importance of X for predicting Y is assessed by looking at X, in every tree it was included in, summing how much it decreased the impurity of the data after the split and then averaging that over all the trees 60. By summing the impurity in the trees that the feature was included in and then averaging it over all of the trees, the impurity of the feature is weighted by the number of trees it was included in. MDA is fast to calculate but it is known to be biased towards selecting variables with multiple cut points (i.e., continuous or categorical variables) (Xu 2002).
1.2 Mean decrease accuracy
MDA assesses how much the prediction accuracy of the RF changes when the values of that feature are randomly shuffled (permuted) in the OOB samples keeping all other features constant. This process keeps the variable in the model but effectively removes its correlation with the outcome. If randomly changing the values of the feature results in a large change in the prediction accuracy, this indicates that the feature is important. If the prediction accuracy changes very little then the feature is less important (Boulesteix et al. 2019; Wager et al. 2014 ). To be clear, this does not measure the effect on prediction were this variable removed from the model, because if the model was refitted without the variable, other variables could be used as surrogates.
Appendix 2: Variable selection
2.1 Recursive feature elimination (RFE)
RFE aims to find the minimal set of variables which leads to a good prediction. The user begins with a RF built using all the features and, based on their importance, a certain percentage of the features with low rankings are eliminated and only informative features are kept for the second round of analysis. This is repeated until a single feature is left in the input. The prediction performance is evaluated at each step and the model with the smallest error (or close to the lowest error) is selected. This works relatively well (Díaz-Uriarte and Alvarez de Andrés 2006; Svetnik et al. 2004; Granitto et al. 2006) and is widely used.
2.2 Boruta
The main idea of this approach is to compare the importance of the real features with random ‘shadow’ features using statistical testing and several runs of RF. In each run, the number of features is doubled by adding a copy of each feature which is referred to as the shadow feature. The values of the shadow features are generated by shuffling (permuting) the original values effectively removing the feature’s relationship with the outcome. A RF is trained on the original and shadow features and the VIMP values are collected. For each real feature a statistical test is performed comparing its importance with the maximum value found among all the shadow variables. Variables with significantly smaller importance values than the shadow variables are considered unimportant and are removed. All the shadow features are essentially noise in the model and the intuition here, is that any feature that has a VIMP less than the shadow features cannot be very important for the model. The previous steps are repeated until all variables are classified as important or unimportant or a pre-specified number of runs has been performed (Degenhardt, Seifert, and Szymczak 2019; Kursa and Rudnicki 2010).
2.3 Minimal depth
Minimal depth assumes that the most important variables are those that most frequently split nodes nearest to the root of the trees where they partition large samples of the population (Hemant Ishwaran et al. 2010). Within a tree, node levels are numbered based on their relative distance to the root of the tree (i.e., first split is equal to 1). Minimal depth measures the importance of a feature by averaging the depth of the first split for each variable over all trees within the forest. Lower values of this measure indicate a variable is important in splitting large groups of observations. Trees need to be adequately deep to obtain reliable results, so this method is not appropriate for datasets with lots of predictors but few observations. Furthermore, the correlation structure of the variables is not considered in this approach (Seifert, Gundlach, and Szymczak 2019).
2.4 Variable Importance Weighted RF
Variable Importance Weighted RF aims to keep all features in the model but encourages important features to be selected more often. Unlike RFE, all features are kept in the model and their importance scores are used in a second stage RF model with the importance scores serving as weights for each feature. The weight determines the probability that a feature will be incorporated in the RF model at each node (normally all features have an equal probability of being selected). Features with a high importance ranking are more likely to be selected at each node in the tree. With this ‘weighted sampling strategy’, the final model is able to emphasize the more informative features while not completely disregarding contributions from others at the same time (as is the case with RFE) (Glowicz 2017).
2.5 Considerations when choosing selection methods
RFE is the most commonly used variable selection method (Degenhardt, Seifert, and Szymczak 2019) but it is inferior to many of the other options available (Speiser et al. 2019). In general, Boruta performs consistently across different types of data, has a low computation time, fairly low error rates, and moderate to good parsimony. The computational efficiency of Boruta makes it preferable for data with a high number of features (Speiser et al. 2019). Minimal Depth is available for use through the randomForestSRC package and performs similarly to, but not quite as well, Boruta. The advantages of using minimal depth include the intuitive nature of the method and its ability to run successfully on different types of data (not all selection methods will work on all datasets).
Recursive feature elimination and minimal depth are more appropriate for finding a minimal set of predictors because they eliminate variables which are redundant. Boruta is better suited for selecting all relevant variables because features which are significantly correlated with the outcome are kept, and the significance here means that correlation is higher than that of the randomly generated shadow variables (Kursa and Rudnicki 2010). As with regression models, substantive knowledge can be used to keep or remove features as the researcher sees fit.
Appendix 3: Feature descriptions
Feature | Description/formula | |
---|---|---|
Demographics | Age | Continuous indicator: age of person at time of survey |
Sex | Binary indicator of sex | |
SES | Ordinal indicator with (5 levels): area code based socioeconomic measure of neighbourhood income per person equivalent, adjusted for household size, which is released by Statistics Canada (QIAPPE) | |
Rurality | Categorical indicator (7 levels): area code-based measure of rurality developed by Statistics Canada (CSIZEMIZ) | |
Healthcare Utilization | Ambulatory visits | Continuous indicator: Count of all outpatient visits not including ER visits |
Outpatient Visits | Continuous indicator: Count of all outpatient visits | |
ER visits | Continuous indicator: Count of all ER visits | |
Weekend Visits | Continuous indicator: Count of visits that occurred on a weekend | |
Usual Provider Continuity | Binary indicator: 1 if the fraction of visits to the most frequently visited provider is > 0.75; otherwise 0 | |
Fidelity | Continuous indicator (0–1): The fraction of visits to the most frequently visited provider | |
Usual Provider Continuity limited to GP visits | See Usual Provider Continuity | |
Fidelity limited to GP visits | See fidelity | |
Wolinsky | Binary indicator: 1 if there was at least 1 visit to the same provider every 8 months over the previous 2-year period | |
Modified Continuity Index | Continuous indicator: Equals 1 minus the number of providers divided by the number of visits. This index is adjusted for utilization by ascribing a higher value to those who have more frequent visits to the same provider | |
Personal Provider Continuity | Binary indicator: A dichotomous version of the Modified Continuity index | |
Modified, Modified Continuity Index | Continuous indicator: Modified Continuity Index divided by 1 minus the inverse number of visits | |
Ejlertsson’s Index K | Continuous indicator: (total # visits—# providers)/(total visits-1) | |
Number of providers seen | Continuous indicator: Total number of providers seen in year | |
Number of specialists seen | Continuous indicator: Total number of specialists seen in year | |
Known provider of care (Multiple providers) | Continuous indicator: Number of visits with physicians seen in the previous year /Total # of visits | |
Health Status | Charlson Index | Comorbidity index |
Diabetes ICD-9: 250 excl: 650–659 (pregnancy) ICD-10: E10-E14 | Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0 | |
Cancer ICD-9: 140–172; 174–208 ICD-10: C00-C43; C45-C97 | Binary indicator: Has cancer if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0 | |
COPD ICD-9: 491- 492; 496 ICD-10: J41-J44 | Binary indicator: 1 if 1 hospitalization or 1 outpatient visits within 1 year; otherwise 0 | |
Asthma ICD-9: 493 ICD-10: J45-J46 | Binary indicator: 1 if hospitalization or 2 outpatient visits within 2 years AND a oral steroid prescription; otherwise 0 | |
Chronic Inflammation ICD-9: 555–556; 558; 714; 695.4; 696.0 ICD-10: K50-K52; M05-M06; L93; L40.50 | Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0 | |
Central Nervous System disease (CNS) ICD-9: 333.4; 138; 332.0; 333.4; 340 ICD-10: G10; G14; G20; G35 | Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0 | |
Occupational Lung disease ICD-9: 117.3; 495; 500–508; 511.0 ICD-10: J60-J70; J92.0 | Binary indicator: 1 if 1 hospitalization or 1 outpatient visits within 2 year; otherwise 0 | |
Coronary Artery Disease (CAD) ICD-9: 410–414 ICD-10: I20-I25 | Binary indicator: 1 if hospitalization or 2 outpatient visits within 2 years; otherwise 0 | |
Heart Failure (HF) ICD-9: 428 ICD-10: I50 | Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within1 rolling year; otherwise 0 | |
Arterial Fibrillation (AF) ICD-9: 427; 785.0 ICD-10: I48 | Binary indicator: 1 if 1 hospitalization or 1 outpatient visit within 1 year; otherwise 0 | |
Mental Health ICD-9: 295- 302; 306–319 ICD-10: F20-F54; F56-F99 | Binary indicator: 1 if 1 hospitalization or 2 ambulatory visits within 1 rolling year; otherwise 0 | |
HIV ICD-9: 042 ICD-10: B20-B24 | Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0 | |
Renal Failure ICD-9: 582–587; 589 ICD-10: N01-N07; N17-N19; N26-N27 | Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0 | |
Thrombolytic Event ICD-9: 415; 434.01; 434.11; 434.91 ICD-10: I26; I63 | Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0 | |
Substance Abuse ICD-9: 291–292; 303–305; 980 ICD-10: F10-F16; F18-F19; T51 | Binary indicator: 1 if 1 hospitalization or 1 outpatient visits within 1 year; otherwise 0 |
Appendix 4: Model performance
Model N = 37,508 | Number of features | AUC | |
---|---|---|---|
RF | Full Model (Imbalanced) | 104 | 0.869 |
Full Model (Balanced) | 104 | 0.869 | |
Minimal depth (Imbalanced) | 69 | 0.869 | |
Minimal depth (Balanced) | 69 | 0.869 | |
5 years (Balanced) | 38 | 0.860 | |
3 years (Balanced) | 38 | 0.861 | |
2 year (Balanced) | 38 | 0.853 | |
1 year (Balanced) | 38 | 0.837 | |
LR | Full Model | 91* | 0.855 |
LASSO | 55 | 0.860 | |
Ridge | 104 | 0.863 | |
Backward Selection (pr 0.10) | 36 | 0.855 | |
5 years | 35 | 0.847 | |
3 years | 34 | 0.852 | |
2 years | 31 | 0.849 | |
1 year | 29 | 0.832 |
Appendix 5: Random forest code in R
Rights and permissions
About this article
Cite this article
King, C., Strumpf, E. Applying random forest in a health administrative data context: a conceptual guide. Health Serv Outcomes Res Method 22, 96–117 (2022). https://doi.org/10.1007/s10742-021-00255-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10742-021-00255-7