Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Applying random forest in a health administrative data context: a conceptual guide

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

To introduce Random Forest (RF), a machine learning method, in an accessible way for health services researchers and highlight its unique considerations when applied to health administrative data. Physician claims’ data from the universal public insurer linked with the Canadian Community Health Survey for the Canadian province of Quebec. We describe in detail how RF can be useful in health services research, provide guidance on data set up, modeling decisions and demonstrate how to interpret results. We also highlight specific considerations for applying RF to health administrative data. In a working example, we compare RF with logistic regression, Ridge regression and LASSO in their ability to predict whether a person has a regular medical doctor. We use survey responses to “do you have a regular medical doctor” from three cycles of the Canadian Community Health Survey (2007, 2009, 2011). Responses are linked with physician claims’ data from 2002 to 2012. We limit our cohort to persons 40 years and older at the time of responding to the survey. We discuss the strengths and weaknesses of using RF in a health services research setting in comparison to using more conventional modeling techniques. Applying a RF model in a health services research setting can have advantages over conventional modeling approaches and we encourage health services researchers to add RF to their toolbox of predictive modeling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Code availability

All the packages used to do this analysis are available in R or Stata and are cited in the paper.

Data availability

This project uses linked microdata from the Canadian Community Health Survey and health administrative data (physicians’ claims data). The sensitive nature of this data precludes us from sharing.

References

Download references

Acknowledgements

The authors would like to thank Dr. Aman Verma for his valuable input on the methods as well as Dr. Ruth Lavergne and Dr. Kim McGrail for their manuscript suggestions. Linkage of the data was carried out by members of the TorSaDE project. The members of the TorSaDE Cohort Working Group are as follows: Alain Vanasse (leader), Gillian Bartlett, Lucie Blais, David Buckeridge, Manon Choinière, Catherine Hudon, Anaïs Lacasse, Benoit Lamarche, Alexandre Lebel, Amélie Quesnel-Vallée, Pasquale Roberge, Valérie Émond, Marie-Pascale Pomey, Mike Benigeri, Anne-Marie Cloutier, Marc Dorais, Josiane Courteau, Mireille Courteau, Stéphanie Plante, Pierre Cambon, Annie Giguère, Isabelle Leroux, Danielle St-Laurent, Denis Roy, Jaime Borja, André Néron, Geneviève Landry, Jean-François Ethier, Roxanne Dault, Marc-Antoine Côté-Marcil, Pier Tremblay, Sonia Quirion.

Funding

This study was funded by the Canadian Institutes for Health Research (CIHR) Strategy for Patient Oriented Research (SPOR) Network in Primary and Integrated Health Care Innovations (PIHCI) (CIHR HCI 150578), the Michael Smith Foundation for Health Research (MSFHR 17268), McGill University, Réseau-1 Québec, Québec Ministère de la Santé et des Services Sociaux and Université de Sherbrooke: Centre Recherche – Hôpital Charles Le Moyne. In-kind support was provided by the Institut de recherche en santé publique de l’Université de Montréal (IRSPUM). This work was carried out using data from the © Quebec Government, 2019. The Quebec Government is not responsible for this work or the interpretation of the results produced.

Author information

Authors and Affiliations

Authors

Contributions

Caroline King conceived and carried out the analyses. Erin Strumpf provided guidance throughout the project on both the concepts and analyses. Caroline King and Erin Strumpf wrote the paper.

Corresponding author

Correspondence to Caroline King.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethical approval

Ethics approval was obtained through McGill University’s Faculty of Medicine Institutional Review Board (IRB).

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Variable importance measures

1.1 Mean decrease impurity

MDI is a generic formula that different impurity functions can be applied to. Impurity functions simply measure how much more homogeneous data are after splitting on a feature compared to before the split. The Gini index or entropy are the most common impurity functions used in RF. MDI is calculated by summing the weighted impurity decreases for all nodes where the feature was used, averaged over all the trees in the forest. In other words, the importance of X for predicting Y is assessed by looking at X, in every tree it was included in, summing how much it decreased the impurity of the data after the split and then averaging that over all the trees 60. By summing the impurity in the trees that the feature was included in and then averaging it over all of the trees, the impurity of the feature is weighted by the number of trees it was included in. MDA is fast to calculate but it is known to be biased towards selecting variables with multiple cut points (i.e., continuous or categorical variables) (Xu 2002).

1.2 Mean decrease accuracy

MDA assesses how much the prediction accuracy of the RF changes when the values of that feature are randomly shuffled (permuted) in the OOB samples keeping all other features constant. This process keeps the variable in the model but effectively removes its correlation with the outcome. If randomly changing the values of the feature results in a large change in the prediction accuracy, this indicates that the feature is important. If the prediction accuracy changes very little then the feature is less important (Boulesteix et al. 2019; Wager et al. 2014 ). To be clear, this does not measure the effect on prediction were this variable removed from the model, because if the model was refitted without the variable, other variables could be used as surrogates.

Appendix 2: Variable selection

2.1 Recursive feature elimination (RFE)

RFE aims to find the minimal set of variables which leads to a good prediction. The user begins with a RF built using all the features and, based on their importance, a certain percentage of the features with low rankings are eliminated and only informative features are kept for the second round of analysis. This is repeated until a single feature is left in the input. The prediction performance is evaluated at each step and the model with the smallest error (or close to the lowest error) is selected. This works relatively well (Díaz-Uriarte and Alvarez de Andrés 2006; Svetnik et al. 2004; Granitto et al. 2006) and is widely used.

2.2 Boruta

The main idea of this approach is to compare the importance of the real features with random ‘shadow’ features using statistical testing and several runs of RF. In each run, the number of features is doubled by adding a copy of each feature which is referred to as the shadow feature. The values of the shadow features are generated by shuffling (permuting) the original values effectively removing the feature’s relationship with the outcome. A RF is trained on the original and shadow features and the VIMP values are collected. For each real feature a statistical test is performed comparing its importance with the maximum value found among all the shadow variables. Variables with significantly smaller importance values than the shadow variables are considered unimportant and are removed. All the shadow features are essentially noise in the model and the intuition here, is that any feature that has a VIMP less than the shadow features cannot be very important for the model. The previous steps are repeated until all variables are classified as important or unimportant or a pre-specified number of runs has been performed (Degenhardt, Seifert, and Szymczak 2019; Kursa and Rudnicki 2010).

2.3 Minimal depth

Minimal depth assumes that the most important variables are those that most frequently split nodes nearest to the root of the trees where they partition large samples of the population (Hemant Ishwaran et al. 2010). Within a tree, node levels are numbered based on their relative distance to the root of the tree (i.e., first split is equal to 1). Minimal depth measures the importance of a feature by averaging the depth of the first split for each variable over all trees within the forest. Lower values of this measure indicate a variable is important in splitting large groups of observations. Trees need to be adequately deep to obtain reliable results, so this method is not appropriate for datasets with lots of predictors but few observations. Furthermore, the correlation structure of the variables is not considered in this approach (Seifert, Gundlach, and Szymczak 2019).

2.4 Variable Importance Weighted RF

Variable Importance Weighted RF aims to keep all features in the model but encourages important features to be selected more often. Unlike RFE, all features are kept in the model and their importance scores are used in a second stage RF model with the importance scores serving as weights for each feature. The weight determines the probability that a feature will be incorporated in the RF model at each node (normally all features have an equal probability of being selected). Features with a high importance ranking are more likely to be selected at each node in the tree. With this ‘weighted sampling strategy’, the final model is able to emphasize the more informative features while not completely disregarding contributions from others at the same time (as is the case with RFE) (Glowicz 2017).

2.5 Considerations when choosing selection methods

RFE is the most commonly used variable selection method (Degenhardt, Seifert, and Szymczak 2019) but it is inferior to many of the other options available (Speiser et al. 2019). In general, Boruta performs consistently across different types of data, has a low computation time, fairly low error rates, and moderate to good parsimony. The computational efficiency of Boruta makes it preferable for data with a high number of features (Speiser et al. 2019). Minimal Depth is available for use through the randomForestSRC package and performs similarly to, but not quite as well, Boruta. The advantages of using minimal depth include the intuitive nature of the method and its ability to run successfully on different types of data (not all selection methods will work on all datasets).

Recursive feature elimination and minimal depth are more appropriate for finding a minimal set of predictors because they eliminate variables which are redundant. Boruta is better suited for selecting all relevant variables because features which are significantly correlated with the outcome are kept, and the significance here means that correlation is higher than that of the randomly generated shadow variables (Kursa and Rudnicki 2010). As with regression models, substantive knowledge can be used to keep or remove features as the researcher sees fit.

Appendix 3: Feature descriptions

Feature

 

Description/formula

Demographics

Age

Continuous indicator: age of person at time of survey

Sex

Binary indicator of sex

SES

Ordinal indicator with (5 levels): area code based socioeconomic measure of neighbourhood income per person equivalent, adjusted for household size, which is released by Statistics Canada (QIAPPE)

Rurality

Categorical indicator (7 levels): area code-based measure of rurality developed by Statistics Canada (CSIZEMIZ)

Healthcare Utilization

Ambulatory visits

Continuous indicator: Count of all outpatient visits not including ER visits

Outpatient Visits

Continuous indicator: Count of all outpatient visits

ER visits

Continuous indicator: Count of all ER visits

Weekend Visits

Continuous indicator: Count of visits that occurred on a weekend

Usual Provider Continuity

Binary indicator: 1 if the fraction of visits to the most frequently visited provider is > 0.75; otherwise 0

Fidelity

Continuous indicator (0–1): The fraction of visits to the most frequently visited provider

Usual Provider Continuity limited to GP visits

See Usual Provider Continuity

Fidelity limited to GP visits

See fidelity

Wolinsky

Binary indicator: 1 if there was at least 1 visit to the same provider every 8 months over the previous 2-year period

Modified Continuity Index

Continuous indicator: Equals 1 minus the number of providers divided by the number of visits. This index is adjusted for utilization by ascribing a higher value to those who have more frequent visits to the same provider

Personal Provider Continuity

Binary indicator: A dichotomous version of the Modified Continuity index

Modified, Modified Continuity Index

Continuous indicator: Modified Continuity Index divided by 1 minus the inverse number of visits

Ejlertsson’s Index K

Continuous indicator:

(total # visits—# providers)/(total visits-1)

Number of providers seen

Continuous indicator: Total number of providers seen in year

Number of specialists seen

Continuous indicator: Total number of specialists seen in year

Known provider of care (Multiple providers)

Continuous indicator: Number of visits with physicians seen in the previous year /Total # of visits

Health Status

Charlson Index

Comorbidity index

Diabetes

ICD-9: 250 excl: 650–659 (pregnancy)

ICD-10: E10-E14

Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0

Cancer

ICD-9: 140–172; 174–208

ICD-10: C00-C43; C45-C97

Binary indicator: Has cancer if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0

COPD

ICD-9: 491- 492; 496

ICD-10: J41-J44

Binary indicator: 1 if 1 hospitalization or 1 outpatient visits within 1 year; otherwise 0

Asthma

ICD-9: 493

ICD-10: J45-J46

Binary indicator: 1 if hospitalization or 2 outpatient visits within 2 years AND a oral steroid prescription; otherwise 0

Chronic Inflammation

ICD-9: 555–556; 558; 714; 695.4; 696.0

ICD-10: K50-K52; M05-M06; L93; L40.50

Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0

Central Nervous System disease (CNS)

ICD-9: 333.4; 138; 332.0; 333.4; 340

ICD-10: G10; G14; G20; G35

Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0

Occupational Lung disease

ICD-9: 117.3; 495; 500–508; 511.0

ICD-10: J60-J70; J92.0

Binary indicator: 1 if 1 hospitalization or 1 outpatient visits within 2 year; otherwise 0

Coronary Artery Disease (CAD)

ICD-9: 410–414

ICD-10: I20-I25

Binary indicator: 1 if hospitalization or 2 outpatient visits within 2 years; otherwise 0

Heart Failure (HF)

ICD-9: 428

ICD-10: I50

Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within1 rolling year; otherwise 0

Arterial Fibrillation (AF)

ICD-9: 427; 785.0

ICD-10: I48

Binary indicator: 1 if 1 hospitalization or 1 outpatient visit within 1 year; otherwise 0

Mental Health

ICD-9: 295- 302; 306–319

ICD-10: F20-F54; F56-F99

Binary indicator: 1 if 1 hospitalization or 2 ambulatory visits within 1 rolling year; otherwise 0

HIV

ICD-9: 042

ICD-10: B20-B24

Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0

Renal Failure

ICD-9: 582–587; 589

ICD-10: N01-N07; N17-N19; N26-N27

Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0

Thrombolytic Event

ICD-9: 415; 434.01; 434.11; 434.91

ICD-10: I26; I63

Binary indicator: 1 if 1 hospitalization or 2 outpatient visits within 2 years; otherwise 0

Substance Abuse

ICD-9: 291–292; 303–305; 980

ICD-10: F10-F16; F18-F19; T51

Binary indicator: 1 if 1 hospitalization or 1 outpatient visits within 1 year; otherwise 0

Appendix 4: Model performance

Model N = 37,508

Number of features

AUC

RF

Full Model (Imbalanced)

104

0.869

Full Model (Balanced)

104

0.869

Minimal depth (Imbalanced)

69

0.869

Minimal depth (Balanced)

69

0.869

5 years (Balanced)

38

0.860

3 years (Balanced)

38

0.861

2 year (Balanced)

38

0.853

1 year (Balanced)

38

0.837

LR

Full Model

91*

0.855

LASSO

55

0.860

Ridge

104

0.863

Backward Selection (pr 0.10)

36

0.855

5 years

35

0.847

3 years

34

0.852

2 years

31

0.849

1 year

29

0.832

  1. *17 omitted for collinearity or perfectly predicting the outcome.

Appendix 5: Random forest code in R

figure a

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

King, C., Strumpf, E. Applying random forest in a health administrative data context: a conceptual guide. Health Serv Outcomes Res Method 22, 96–117 (2022). https://doi.org/10.1007/s10742-021-00255-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-021-00255-7

Keywords