Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3408207.3408274guideproceedingsArticle/Chapter ViewAbstractPublication PagesspringsimConference Proceedingsconference-collections
research-article
Free access

Handling the missing data problem in electronic health records for cancer prediction

Published: 19 May 2020 Publication History

Abstract

Electronic health records (EHRs) are the records containing the patients' clinic information. The EHRs have been widely used in disease diagnosis and therapy due to the numerous and valuable medical information in them. However, the missing data problem of EHRs hinders the usage. Replacing the missing data with mean values is an approach of data imputation. But, that method weakens the feature importance. In this study, we use the expectation-maximization (EM) algorithm to impute the missing data in EHRs. Some machine learning models, including artificial neural network, logistic regression, support vector machine, and random forests are used to evaluate the effectiveness of data imputation. The experimental results show that the prediction accuracies of cancers by using those models on the EHRs imputed by EM algorithm are higher than those by mean values, which indicates the EM algorithm is able to provide accurate estimations in data imputation of EHRs.

References

[1]
Boser, B. E., I. M. Guyon and V. N. Vapnik. 1992. "A training algorithm for optimal margin classifiers". In Proceedings of the fifth annual workshop on Computational learning theory pp 144--152.
[2]
Carroll, R. J., A. E. Eyler and J. C. Denny. 2011. "Naïve electronic health record phenotype identification for rheumatoid arthritis". In AMIA annual symposium proceedings pp 189.
[3]
Catellier, D.J., P.J. Hannan, D.M. Murray, C.L. Addy, T.L. Conway, S. Yang, and J.C. Rice. 2005. "Imputation of missing data when measuring physical activity by accelerometry". Medicine and science in sports and exercise, Vol. 37, p.S555.
[4]
Castro, V.M., C.C. Clements, S.N. Murphy, V.S. Gainer, M. Fava, J.B. Weilburg, J.L. Erb, S.E. Churchill, I.S. Kohane, D.V. Iosifescu, and J.W. Smoller. 2013. "QT interval and antidepressant use: a cross sectional study of electronic health records". Bmj Vol. 346, pp. 288.
[5]
DesRoches, C. M., E. G. Campbell, S. R. Rao, K. Donelan, T. G. Ferris, A. Jha, R. Kaushal, D. E. Levy, S. Rosenbaum and A. E. Shields. 2008. "Electronic health records in ambulatory care---a national survey of physicians". New England Journal of Medicine Vol. 359(1), pp. 50--60.
[6]
Dempster, A.P., N.M. Laird, and D.B. Rubin. 1977. "Maximum likelihood from incomplete data via the EM algorithm". Journal of the Royal Statistical Society: Series B (Methodological), Vol. 39(1), pp. 1--22.
[7]
Garg, R., S. Dong, S. Shah and S. R. Jonnalagadda. 2016. "A bootstrap machine learning approach to identify rare disease patients from electronic health records". arXiv preprint arXiv:1609.01586 Vol. pp.
[8]
Jensen, P. B., L. J. Jensen and S. Brunak. 2012. "Mining electronic health records: towards better research applications and clinical care". Nature Reviews Genetics Vol. 13(6), pp. 395.
[9]
Kohane, I. S. 2011. "Using electronic health records to drive discovery in disease genomics". Nature Reviews Genetics Vol. 12(6), pp. 417.
[10]
Kourou, K., T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis and D. I. Fotiadis. 2015. "Machine learning applications in cancer prognosis and prediction". Computational and structural biotechnology journal Vol. 13, pp. 8--17.
[11]
Miriovsky, B. J., L. N. Shulman and A. P. Abernethy. 2012. "Importance of health information technology, electronic health records, and continuously aggregating data to comparative effectiveness research and learning health care". Journal of Clinical Oncology Vol. 30(34), pp. 4243--4248.
[12]
Murphy, D. R., A. Laxmisan, B. A. Reis, E. J. Thomas, A. Esquivel, S. N. Forjuoh, R. Parikh, M. M. Khan and H. Singh. 2014. "Electronic health record-based triggers to detect potential delays in cancer diagnosis". BMJ Qual Saf Vol. 23(1), pp. 8--16.
[13]
Perseli, S. D., J. M. Wright, J. A. Thompson, K. S. Kmetik and D. W. Baker. 2006. "Assessing the validity of national quality measures for coronary artery disease using an electronic health record". Archives of Internal Medicine Vol. 166(20), pp. 2272--2277.
[14]
Schneider, T. 2001. "Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. "Journal of climate Vol. 14(5), pp. 853--871.
[15]
Siegel, R. L., K. D. Miller and A. Jemal. 2017. "Cancer statistics, 2017". CA: a cancer journal for clinicians Vol. 67(1), pp. 7--30.
[16]
Singh, H., K. Hirani, H. Kadiyala, O. Rudomiotov, T. Davis, M. M. Khan and T. L. Wahls. 2010. "Characteristics and predictors of missed opportunities in lung cancer diagnosis: an electronic health record-based study". Journal of Clinical Oncology Vol. 28(20), pp. 3307.
[17]
Weiss, J. C., S. Natarajan, P. L. Peissig, C. A. McCarty and D. Page. 2012. "Machine learning for personalized medicine: Predicting primary myocardial infarction from electronic health records". AI Magazine Vol. 33(4), pp. 33.
[18]
Wu, J., J. Roy, and W.F. Stewart. 2010. "Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches". Medical care, pp. S106--S113.
[19]
Zhang, X., J. Xiao, and F. Gu. 2019. "Applying support vector machine to electronic health records for cancer classification". In Proceedings of the Modeling and Simulation in Medicine Symposium pp 2.
[20]
Zheng, T., W. Xie, L. Xu, X. He, Y. Zhang, M. You, G. Yang and Y. Chen. 2017. "A machine learning-based framework to identify type 2 diabetes through electronic health records". International journal of medical informatics Vol. 97 pp. 120--127.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
SpringSim '20: Proceedings of the 2020 Spring Simulation Conference
May 2020
791 pages
ISBN:9781713812883

Publisher

Society for Computer Simulation International

San Diego, CA, United States

Publication History

Published: 19 May 2020

Author Tags

  1. data imputation
  2. electronic health records (EHRs)
  3. machine learning
  4. missing data
  5. neural network

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 210
    Total Downloads
  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)6
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media