Applying random forest in a health administrative data context: a conceptual guide

King, Caroline; Strumpf, Erin

doi:10.1007/s10742-021-00255-7

Applying random forest in a health administrative data context: a conceptual guide

Published: 17 July 2021

Volume 22, pages 96–117, (2022)
Cite this article

Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

564 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

To introduce Random Forest (RF), a machine learning method, in an accessible way for health services researchers and highlight its unique considerations when applied to health administrative data. Physician claims’ data from the universal public insurer linked with the Canadian Community Health Survey for the Canadian province of Quebec. We describe in detail how RF can be useful in health services research, provide guidance on data set up, modeling decisions and demonstrate how to interpret results. We also highlight specific considerations for applying RF to health administrative data. In a working example, we compare RF with logistic regression, Ridge regression and LASSO in their ability to predict whether a person has a regular medical doctor. We use survey responses to “do you have a regular medical doctor” from three cycles of the Canadian Community Health Survey (2007, 2009, 2011). Responses are linked with physician claims’ data from 2002 to 2012. We limit our cohort to persons 40 years and older at the time of responding to the survey. We discuss the strengths and weaknesses of using RF in a health services research setting in comparison to using more conventional modeling techniques. Applying a RF model in a health services research setting can have advantages over conventional modeling approaches and we encourage health services researchers to add RF to their toolbox of predictive modeling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Logistische Regression und Random Forests: Der Einfluss der Financial Literacy auf die Aktienmarktpartizipation in Europa

Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests

Article Open access 25 March 2022

Common, uncommon, and novel applications of random forest in psychological research

Article 01 August 2022

Code availability

All the packages used to do this analysis are available in R or Stata and are cited in the paper.

Data availability

This project uses linked microdata from the Canadian Community Health Survey and health administrative data (physicians’ claims data). The sensitive nature of this data precludes us from sharing.

References

Bou-Hamad, I., Larocque, D., Ben-Ameur, H.: A review of survival trees. Stat. Surv. 5, 44–71 (2011). https://doi.org/10.1214/09-SS047
Article Google Scholar
Boulesteix, A.-L., et al. : Making complex prediction rules applicable for readers: current practice in random forest literature and recommendations. Biom. J. 61(5), 1314–1328 (2019). https://doi.org/10.1002/bimj.201700243
Article PubMed Google Scholar
Boulesteix, A., Schmid, M.: Machine learning versus statistical modeling. Biom. J. 56(4), 588–593 (2014). https://doi.org/10.1002/bimj.201300226
Article PubMed Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001a). https://doi.org/10.1023/A:1010933404324
Article Google Scholar
Breiman, L.: Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat. Sci. 16(3), 199–231 (2001b). https://doi.org/10.1214/SS/1009213726
Article Google Scholar
Breslau, N., Reeb, K.G.: Continuity of care in a university-based practice. J. Med. Educ. 50(10), 965–969 (1975). https://doi.org/10.1097/00001888-197510000-00006
Article CAS PubMed Google Scholar
Bylander, T.: Estimating generalization error on two-class datasets using out-of-bag estimates. Mach. Learn. 48(1–3), 287–297 (2002). https://doi.org/10.1023/A:1013964023376
Article Google Scholar
Bzdok, D., Altman, N., Krzywinski, M.: Points of significance: statistics versus machine learning. Nat. Methods 15(4), 233–234 (2018)
Article CAS Google Scholar
Chen, C., & Liaw, A. (2004). Using random forest to learn imbalanced data
Clair, M. (2000). Emerging Solutions. Report and Recommandations of the Commission d’étude sur les services de santé et les services sociaux
Couronné, R., Probst, P., Boulesteix, A.-L.L.: Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinform. 19(1), 1–14 (2018). https://doi.org/10.1186/s12859-018-2264-5
Article Google Scholar
Degenhardt, F., Seifert, S., Szymczak, S.: Evaluation of variable selection methods for random forests and omics data sets. Brief. Bioinform. 20(2), 492–503 (2019). https://doi.org/10.1093/bib/bbx124
Article PubMed Google Scholar
DeVoe, J.E., Fryer, G.E., Phillips, R., Green, L.: Receipt of preventive care among adults: insurance status and usual source of care. Am. J. Publ. Health 93(5), 786–791 (2003)
Article Google Scholar
Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7(1), 3 (2006). https://doi.org/10.1186/1471-2105-7-3
Article CAS Google Scholar
Dietrich, S., Floegel, A., Troll, M., Kühn, T., Rathmann, W., Peters, A., Sookthai, D., Von Bergen, M., Kaaks, R., Adamski, J., Prehn, C., Boeing, H., Schulze, M.B., Illig, T., Pischon, T., Knüppel, S., Wang-Sattler, R., Drogan, D.: Random survival forest in practice: a method for modelling complex metabolomics data in time to event analysis. Int. J. Epidemiol. 45(5), 1406–1420 (2016). https://doi.org/10.1093/ije/dyw145
Article PubMed Google Scholar
Domingos, P.: A few useful things to know about machine learning. Commun. ACM (2012). https://doi.org/10.1145/2347736.2347755
Article Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
Article Google Scholar
Glowicz.: Variable importance-weighted Random Forests. Quant. Biol. 176(5), 139–148 (2017)
Goldstein, B.A., Polley, E.C., Briggs, F.B.S.: Random forests for genetic association studies. Stat. Appl. Genet. Mol. Biol. (2011). https://doi.org/10.2202/1544-6115.1691
Article PubMed PubMed Central Google Scholar
Gordon, L., Olshen, R.A.: Tree-Structured survival analysis. Cancer Trea. Rep. 69(10), 1065–1068 (1985)
CAS Google Scholar
Granitto PM et al.: Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products’, Chem. Intell. Lab. Syst. 82(2), 83–90 (2006)
Article Google Scholar
Greenland, S.: Principles of multilevel modelling. Int. J. Epidemiol. 29(1), 158–167 (2000). https://doi.org/10.1093/ije/29.1.158
Article CAS PubMed Google Scholar
Gregorutti, B., Michel, B., Saint-Pierre, P.: Correlation and variable importance in random forests. Stat. Comput. 27(3), 659–678 (2017). https://doi.org/10.1007/s11222-016-9646-1
Article Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning data mining, inference, and prediction
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical learning with sparsity the Lasso and generalizations statistical learning with Sparsity. CRC Press (2015)
Book Google Scholar
Hay, C., Pacey, M., Bains, N., Ardal, S.: Understanding the unattached population in Ontario: evidence from the primary care access survey (PCAS). Healthcare Policy 6(2), 33–47 (2010)
PubMed PubMed Central Google Scholar
Heaton, J. (2016). An empirical analysis of feature engineering for predictive modeling. Conference proceedings - IEEE Southeastcon, 2016-July. https://doi.org/10.1109/SECON.2016.7506650
Hebiri, M., Lederer, J.: How correlations influence lasso prediction. IEEE Trans. Inf. Theory 59(3), 1846–1854 (2013). https://doi.org/10.1109/TIT.2012.2227680
Article Google Scholar
Heinze, G., Wallisch, C., Dunkler, D.: Variable selection – a review and recommendations for the practicing statistician. Biom. J. 60(3), 431–449 (2018). https://doi.org/10.1002/bimj.201700067
Article PubMed PubMed Central Google Scholar
Wang, H., Li, G.: A selective review on random survival forests for high dimensional data. Quant. Bio-Sci. 36(2), 85–96 (2017)
Article CAS Google Scholar
Huang, B.F.F., Boutros, P.C.: The parameter sensitivity of random forests. BMC Bioinform. (2016). https://doi.org/10.1186/s12859-016-1228-x
Article Google Scholar
Ishwaran, H, & Kogalur, U. (2020). Fast unified random forests for survival, regression, and classification (RF-SRC) (R package version 2.9.3). https://cran.r-project.org/web/packages/randomForestSRC/citation.html
Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S.: Random survival forests. Ann. Appl. Stat. 2(3), 841–860 (2008). https://doi.org/10.1214/08-AOAS169
Article Google Scholar
Ishwaran, H., Lu, M.: Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat. Med. 38(4), 558–582 (2019). https://doi.org/10.1002/sim.7803
Article PubMed Google Scholar
Janitza, S., Hornung, R.: On the overestimation of random forest’s out-of-bag error. PLoS ONE 13(8), e0201904 (2018). https://doi.org/10.1371/journal.pone.0201904
Article CAS PubMed PubMed Central Google Scholar
Kirasich, K., Smith, T., & Sadler, B. (2018). Random forest vs logistic regression: binary classification for heterogeneous Datasets. In SMU Data Science Review (Vol. 1, Issue 3). https://scholar.smu.edu/datasciencereview. http://digitalrepository.smu.edu.Availableat: https://scholar.smu.edu/datasciencereview/vol1/iss3/9
Kursa, M. B. and Rudnicki, W. R. (2010) Feature selection with the boruta package, J. Stat. Soft. 36(11), pp. 1–13
Lambrew, J.M., DeFriese, G.H., Carey, T.S., Ricketts, T.C., Biddle, A.K.: The effects of having a regular doctor on access to primary care. Med. Care 34(2), 138–151 (1996)
Article CAS Google Scholar
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. https://cran.r-project.org/web/packages/randomForest/citation.html
Lin, Y., Jeon, Y.: Random forests and adaptive nearest neighbors. J. Am. Stat. Assoc. 101(474), 578–590 (2006). https://doi.org/10.1198/016214505000001230
Article CAS Google Scholar
Luchman, J. N. (2015). DOMIN: stata module to conduct dominance analysis. Statistical software components. https://ideas.repec.org/c/boc/bocode/s457629.html
McIsaac, W.J., Fuller-Thomson, E., Talbot, Y.: Does having regular care by a family physician improve preventive care? Can. Family Phys. Med. De Famille Can. 47, 70–76 (2001)
CAS Google Scholar
Mihaylova, B., Briggs, A., O’Hagan, A., Thompson, S.G.: Review of statistical methods for analysing healthcare resources and costs. Health Econ. 20(8), 897–916 (2011)
Article Google Scholar
Nathans, L.L., Oswald, F.L., Nimon, K.: Interpreting multiple linear regression: a guidebook of variable importance - practical assessment, research & evaluation. Prac. Assess. Res. Eval. 17(9), 1–19 (2012)
Google Scholar
O’brien, R., & Ishwaran, H. : A random forests quantile classifier for class imbalanced data. Pattern Recogn. 90, 232–249 (2019). https://doi.org/10.1016/j.patcog.2019.01.036
Article Google Scholar
Probst, P., Boulesteix, A.-L.: To tune or not to tune the number of trees in random forest. J. Mach. Learn. Res. 18, 1–18 (2018)
Google Scholar
Probst, P., Wright, M. N., & Boulesteix, A. (2019). Hyperparameters and tuning strategies for random forest. Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(3). https://doi.org/10.1002/widm.1301
Provost, S., Perez, J., Pineault, R., Borges Da Silva, R., Tousignant, P.: An algorithm using administrative data to identify patient attachment to a family physician. Int. J. Family Med. (2015). https://doi.org/10.1155/2015/967230
Article PubMed PubMed Central Google Scholar
Rokach, L., & Maimon, O. (2010). Data mining and knowledge discovery handbook. https://doi.org/10.1007/978-0-387-09823-4_9
Scornet, E.: Tuning parameters in random forests. ESAIM Proc. Surv. 60, 144–162 (2018)
Article Google Scholar
Segal, M. R. (2004). Machine learning benchmarks and random forest regression
Seifert, S., Gundlach, S., Szymczak, S.: Surrogate minimal depth as an importance measure for variables in random forests. Bioinformatics. 35(19), 3663–3671 (2019). https://doi.org/10.1093/bioinformatics/btz149
Shmueli, G.: To explain or to predict? Stat. Sci. 25(3), 289–310 (2010). https://doi.org/10.1214/10-STS330
Article Google Scholar
Smyth, D., Deverall, E., Balm, M., Nesdale, A., Rosemergy, I.: Out-of-bag estimation. N. z. Med. J. 128(1425), 97–100 (2015). https://doi.org/10.1007/s13398-014-0173-7.2
Article PubMed Google Scholar
Speiser, J. L., Miller, M. E., Tooze, J., Ip, E.: A comparison of random forest variable selection methods for classification prediction modeling. 134, 93–101 (2019). https://doi.org/10.1016/j.eswa.2019.05.028
Starfield, B.B., SHI, L., Macinko, J.: Contribution of primary care to health systems and health. Milbank q. 83(3), 457–502 (2005). https://doi.org/10.1111/j.1468-0009.2005.00409.x
Statisitcs Canada. (2016). Surveys and statistical programs - Canadian community health survey - annual component (CCHS).https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=3226
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinform. 9(1), 307 (2008). https://doi.org/10.1186/1471-2105-9-307
Article CAS Google Scholar
Sturmberg, J.P., Schattner, P.: Personal doctoring. Its impact on continuity of care as measured by the comprehensiveness of care score. Aus. Family Phys. 30(5), 513–518 (2001)
CAS Google Scholar
Svetnik, V. et al. (2004) Application of Breiman’s Random Forest to modeling structure-activity relationships of pharmaceutical molecules. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3077, pp. 334–343. doi: 10.1007/978-3-540-25966-4_33.
Thomas, D.R., Zumbo, B.D., Kwan, E., Schweitzer, L.: On Johnson’s (2000) relative weights method for assessing variable importance: a reanalysis. Multivar. Behav. Res. 49(4), 329–338 (2014). https://doi.org/10.1080/00273171.2014.905766
Article Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. In Journal of the royal statistical society. Series B (Methodological). J. R. Statist. Soc. B (Vol. 58, Issue 1). http://www.math.yorku.ca/~hkj/Teaching/6621Winter2013/Coverage/lasso.pdf
Tolosi, L., Lengauer, T.: Data and text mining classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27(14), 1986–1994 (2011). https://doi.org/10.1093/bioinformatics/btr300
Article CAS Google Scholar
Tousignant, P., Diop, M., Fournier, M., Roy, Y., Haggerty, J., Hogg, W., Beaulieu, M.-D.: Validation of 2 new measures of continuity of care based on year-to-year follow-up with known providers of health care. Ann. Fam. Med. 12(6), 559–567 (2014). https://doi.org/10.1370/afm.1692
Article PubMed PubMed Central Google Scholar
Wager, S., Hastie, T., & Efron, B. (2014). Confidence intervals for random forests: the Jackknife and the infinitesimal Jackknife. In Journal of Machine Learning Research (Vol. 15)
Xu, K.T.: Usual source of care in preventive service use: a regular doctor versus a regular site. Health Serv. Res. 37(6), 1509–1529 (2002). https://doi.org/10.1111/1475-6773.10524
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to thank Dr. Aman Verma for his valuable input on the methods as well as Dr. Ruth Lavergne and Dr. Kim McGrail for their manuscript suggestions. Linkage of the data was carried out by members of the TorSaDE project. The members of the TorSaDE Cohort Working Group are as follows: Alain Vanasse (leader), Gillian Bartlett, Lucie Blais, David Buckeridge, Manon Choinière, Catherine Hudon, Anaïs Lacasse, Benoit Lamarche, Alexandre Lebel, Amélie Quesnel-Vallée, Pasquale Roberge, Valérie Émond, Marie-Pascale Pomey, Mike Benigeri, Anne-Marie Cloutier, Marc Dorais, Josiane Courteau, Mireille Courteau, Stéphanie Plante, Pierre Cambon, Annie Giguère, Isabelle Leroux, Danielle St-Laurent, Denis Roy, Jaime Borja, André Néron, Geneviève Landry, Jean-François Ethier, Roxanne Dault, Marc-Antoine Côté-Marcil, Pier Tremblay, Sonia Quirion.

Funding

This study was funded by the Canadian Institutes for Health Research (CIHR) Strategy for Patient Oriented Research (SPOR) Network in Primary and Integrated Health Care Innovations (PIHCI) (CIHR HCI 150578), the Michael Smith Foundation for Health Research (MSFHR 17268), McGill University, Réseau-1 Québec, Québec Ministère de la Santé et des Services Sociaux and Université de Sherbrooke: Centre Recherche – Hôpital Charles Le Moyne. In-kind support was provided by the Institut de recherche en santé publique de l’Université de Montréal (IRSPUM). This work was carried out using data from the © Quebec Government, 2019. The Quebec Government is not responsible for this work or the interpretation of the results produced.

Author information

Authors and Affiliations

Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Quebec, Canada
Caroline King
Department of Epidemiology, Biostatistics and Occupational Health and Department of Economics, McGill University, Quebec, Canada
Erin Strumpf

Authors

Caroline King
View author publications
You can also search for this author in PubMed Google Scholar
Erin Strumpf
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Caroline King conceived and carried out the analyses. Erin Strumpf provided guidance throughout the project on both the concepts and analyses. Caroline King and Erin Strumpf wrote the paper.

Corresponding author

Correspondence to Caroline King.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethical approval

Ethics approval was obtained through McGill University’s Faculty of Medicine Institutional Review Board (IRB).

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Variable importance measures

1.1 Mean decrease impurity

MDI is a generic formula that different impurity functions can be applied to. Impurity functions simply measure how much more homogeneous data are after splitting on a feature compared to before the split. The Gini index or entropy are the most common impurity functions used in RF. MDI is calculated by summing the weighted impurity decreases for all nodes where the feature was used, averaged over all the trees in the forest. In other words, the importance of X for predicting Y is assessed by looking at X, in every tree it was included in, summing how much it decreased the impurity of the data after the split and then averaging that over all the trees ⁶⁰. By summing the impurity in the trees that the feature was included in and then averaging it over all of the trees, the impurity of the feature is weighted by the number of trees it was included in. MDA is fast to calculate but it is known to be biased towards selecting variables with multiple cut points (i.e., continuous or categorical variables) (Xu 2002).

1.2 Mean decrease accuracy

MDA assesses how much the prediction accuracy of the RF changes when the values of that feature are randomly shuffled (permuted) in the OOB samples keeping all other features constant. This process keeps the variable in the model but effectively removes its correlation with the outcome. If randomly changing the values of the feature results in a large change in the prediction accuracy, this indicates that the feature is important. If the prediction accuracy changes very little then the feature is less important (Boulesteix et al. 2019; Wager et al. 2014 ). To be clear, this does not measure the effect on prediction were this variable removed from the model, because if the model was refitted without the variable, other variables could be used as surrogates.

Appendix 2: Variable selection

2.1 Recursive feature elimination (RFE)

RFE aims to find the minimal set of variables which leads to a good prediction. The user begins with a RF built using all the features and, based on their importance, a certain percentage of the features with low rankings are eliminated and only informative features are kept for the second round of analysis. This is repeated until a single feature is left in the input. The prediction performance is evaluated at each step and the model with the smallest error (or close to the lowest error) is selected. This works relatively well (Díaz-Uriarte and Alvarez de Andrés 2006; Svetnik et al. 2004; Granitto et al. 2006) and is widely used.

2.2 Boruta

The main idea of this approach is to compare the importance of the real features with random ‘shadow’ features using statistical testing and several runs of RF. In each run, the number of features is doubled by adding a copy of each feature which is referred to as the shadow feature. The values of the shadow features are generated by shuffling (permuting) the original values effectively removing the feature’s relationship with the outcome. A RF is trained on the original and shadow features and the VIMP values are collected. For each real feature a statistical test is performed comparing its importance with the maximum value found among all the shadow variables. Variables with significantly smaller importance values than the shadow variables are considered unimportant and are removed. All the shadow features are essentially noise in the model and the intuition here, is that any feature that has a VIMP less than the shadow features cannot be very important for the model. The previous steps are repeated until all variables are classified as important or unimportant or a pre-specified number of runs has been performed (Degenhardt, Seifert, and Szymczak 2019; Kursa and Rudnicki 2010).

2.3 Minimal depth

Minimal depth assumes that the most important variables are those that most frequently split nodes nearest to the root of the trees where they partition large samples of the population (Hemant Ishwaran et al. 2010). Within a tree, node levels are numbered based on their relative distance to the root of the tree (i.e., first split is equal to 1). Minimal depth measures the importance of a feature by averaging the depth of the first split for each variable over all trees within the forest. Lower values of this measure indicate a variable is important in splitting large groups of observations. Trees need to be adequately deep to obtain reliable results, so this method is not appropriate for datasets with lots of predictors but few observations. Furthermore, the correlation structure of the variables is not considered in this approach (Seifert, Gundlach, and Szymczak 2019).

2.4 Variable Importance Weighted RF

Variable Importance Weighted RF aims to keep all features in the model but encourages important features to be selected more often. Unlike RFE, all features are kept in the model and their importance scores are used in a second stage RF model with the importance scores serving as weights for each feature. The weight determines the probability that a feature will be incorporated in the RF model at each node (normally all features have an equal probability of being selected). Features with a high importance ranking are more likely to be selected at each node in the tree. With this ‘weighted sampling strategy’, the final model is able to emphasize the more informative features while not completely disregarding contributions from others at the same time (as is the case with RFE) (Glowicz 2017).

2.5 Considerations when choosing selection methods

RFE is the most commonly used variable selection method (Degenhardt, Seifert, and Szymczak 2019) but it is inferior to many of the other options available (Speiser et al. 2019). In general, Boruta performs consistently across different types of data, has a low computation time, fairly low error rates, and moderate to good parsimony. The computational efficiency of Boruta makes it preferable for data with a high number of features (Speiser et al. 2019). Minimal Depth is available for use through the randomForestSRC package and performs similarly to, but not quite as well, Boruta. The advantages of using minimal depth include the intuitive nature of the method and its ability to run successfully on different types of data (not all selection methods will work on all datasets).

Recursive feature elimination and minimal depth are more appropriate for finding a minimal set of predictors because they eliminate variables which are redundant. Boruta is better suited for selecting all relevant variables because features which are significantly correlated with the outcome are kept, and the significance here means that correlation is higher than that of the randomly generated shadow variables (Kursa and Rudnicki 2010). As with regression models, substantive knowledge can be used to keep or remove features as the researcher sees fit.

Appendix 3: Feature descriptions

Feature		Description/formula
Demographics	Age	Continuous indicator: age of person at time of survey
	Sex	Binary indicator of sex
	SES	Ordinal indicator with (5 levels): area code based socioeconomic measure of neighbourhood income per person equivalent, adjusted for household size, which is released by Statistics Canada (QIAPPE)
	Rurality	Categorical indicator (7 levels): area code-based measure of rurality developed by Statistics Canada (CSIZEMIZ)
Healthcare Utilization	Ambulatory visits	Continuous indicator: Count of all outpatient visits not including ER visits
	Outpatient Visits	Continuous indicator: Count of all outpatient visits
	ER visits	Continuous indicator: Count of all ER visits
	Weekend Visits	Continuous indicator: Count of visits that occurred on a weekend
	Usual Provider Continuity	Binary indicator: 1 if the fraction of visits to the most frequently visited provider is > 0.75; otherwise 0
	Fidelity	Continuous indicator (0–1): The fraction of visits to the most frequently visited provider
	Usual Provider Continuity limited to GP visits	See Usual Provider Continuity
	Fidelity limited to GP visits	See fidelity
	Wolinsky	Binary indicator: 1 if there was at least 1 visit to the same provider every 8 months over the previous 2-year period
	Modified Continuity Index	Continuous indicator: Equals 1 minus the number of providers divided by the number of visits. This index is adjusted for utilization by ascribing a higher value to those who have more frequent visits to the same provider
	Personal Provider Continuity	Binary indicator: A dichotomous version of the Modified Continuity index
	Modified, Modified Continuity Index	Continuous indicator: Modified Continuity Index divided by 1 minus the inverse number of visits
	Ejlertsson’s Index K	Continuous indicator: (total # visits—# providers)/(total visits-1)
	Number of providers seen	Continuous indicator: Total number of providers seen in year
	Number of specialists seen	Continuous indicator: Total number of specialists seen in year
	Known provider of care (Multiple providers)	Continuous indicator: Number of visits with physicians seen in the previous year /Total # of visits
Health Status	Charlson Index	Comorbidity index
	Diabetes ICD-9: 250 excl: 650–659 (pregnancy) ICD-10: E10-E14	Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0
	Cancer ICD-9: 140–172; 174–208 ICD-10: C00-C43; C45-C97	Binary indicator: Has cancer if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0
	COPD ICD-9: 491- 492; 496 ICD-10: J41-J44	Binary indicator: 1 if 1 hospitalization or 1 outpatient visits within 1 year; otherwise 0
	Asthma ICD-9: 493 ICD-10: J45-J46	Binary indicator: 1 if hospitalization or 2 outpatient visits within 2 years AND a oral steroid prescription; otherwise 0
	Chronic Inflammation ICD-9: 555–556; 558; 714; 695.4; 696.0 ICD-10: K50-K52; M05-M06; L93; L40.50	Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0
	Central Nervous System disease (CNS) ICD-9: 333.4; 138; 332.0; 333.4; 340 ICD-10: G10; G14; G20; G35	Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0
	Occupational Lung disease ICD-9: 117.3; 495; 500–508; 511.0 ICD-10: J60-J70; J92.0	Binary indicator: 1 if 1 hospitalization or 1 outpatient visits within 2 year; otherwise 0
	Coronary Artery Disease (CAD) ICD-9: 410–414 ICD-10: I20-I25	Binary indicator: 1 if hospitalization or 2 outpatient visits within 2 years; otherwise 0
	Heart Failure (HF) ICD-9: 428 ICD-10: I50	Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within1 rolling year; otherwise 0
	Arterial Fibrillation (AF) ICD-9: 427; 785.0 ICD-10: I48	Binary indicator: 1 if 1 hospitalization or 1 outpatient visit within 1 year; otherwise 0
	Mental Health ICD-9: 295- 302; 306–319 ICD-10: F20-F54; F56-F99	Binary indicator: 1 if 1 hospitalization or 2 ambulatory visits within 1 rolling year; otherwise 0
	HIV ICD-9: 042 ICD-10: B20-B24	Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0
	Renal Failure ICD-9: 582–587; 589 ICD-10: N01-N07; N17-N19; N26-N27	Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0
	Thrombolytic Event ICD-9: 415; 434.01; 434.11; 434.91 ICD-10: I26; I63	Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0
	Substance Abuse ICD-9: 291–292; 303–305; 980 ICD-10: F10-F16; F18-F19; T51	Binary indicator: 1 if 1 hospitalization or 1 outpatient visits within 1 year; otherwise 0

Appendix 4: Model performance

Model N = 37,508		Number of features	AUC
RF	Full Model (Imbalanced)	104	0.869
	Full Model (Balanced)	104	0.869
	Minimal depth (Imbalanced)	69	0.869
	Minimal depth (Balanced)	69	0.869
	5 years (Balanced)	38	0.860
	3 years (Balanced)	38	0.861
	2 year (Balanced)	38	0.853
	1 year (Balanced)	38	0.837
LR	Full Model	91*	0.855
	LASSO	55	0.860
	Ridge	104	0.863
	Backward Selection (pr 0.10)	36	0.855
	5 years	35	0.847
	3 years	34	0.852
	2 years	31	0.849
	1 year	29	0.832

*17 omitted for collinearity or perfectly predicting the outcome.

Appendix 5: Random forest code in R

Rights and permissions

Reprints and permissions

About this article

Cite this article

King, C., Strumpf, E. Applying random forest in a health administrative data context: a conceptual guide. Health Serv Outcomes Res Method 22, 96–117 (2022). https://doi.org/10.1007/s10742-021-00255-7

Download citation

Received: 23 October 2020
Revised: 21 May 2021
Accepted: 22 June 2021
Published: 17 July 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10742-021-00255-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Applying random forest in a health administrative data context: a conceptual guide

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Logistische Regression und Random Forests: Der Einfluss der Financial Literacy auf die Aktienmarktpartizipation in Europa

Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests

Common, uncommon, and novel applications of random forest in psychological research

Code availability

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendices

Appendix 1: Variable importance measures

1.1 Mean decrease impurity

1.2 Mean decrease accuracy

Appendix 2: Variable selection

2.1 Recursive feature elimination (RFE)

2.2 Boruta

2.3 Minimal depth

2.4 Variable Importance Weighted RF

2.5 Considerations when choosing selection methods

Appendix 3: Feature descriptions

Appendix 4: Model performance

Appendix 5: Random forest code in R

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation