article

Concept acquisition and improved in-database similarity analysis for medical data

Authors:

Araek Tashkandi,

Ulrich SaxAuthors Info & Claims

Distributed and Parallel Databases, Volume 37, Issue 2

Pages 297 - 321

https://doi.org/10.1007/s10619-018-7249-x

Published: 01 June 2019 Publication History

Abstract

Efficient identification of cohorts of similar patients is a major precondition for personalized medicine. In order to train prediction models on a given medical data set, similarities have to be calculated for every pair of patients--which results in a roughly quadratic data blowup. In this paper we discuss the topic of in-database patient similarity analysis ranging from data extraction to implementing and optimizing the similarity calculations in SQL. In particular, we introduce the notion of chunking that uniformly distributes the workload among the individual similarity calculations. Our benchmark comprises the application of one similarity measures (Cosine similariy) and one distance metric (Euclidean distance) on two real-world data sets; it compares the performance of a column store (MonetDB) and a row store (PostgreSQL) with two external data mining tools (ELKI and Apache Mahout).

References

[1]

Anthony Celi, L., Mark, R.G., Stone, D.J., Montgomery, R.A.: "Big data" in the intensive care unit. Closing the data loop. Am. J. Respir. Crit. Care Med. 187(11), 1157---1160 (2013)

[2]

Apache Mahout Committers: Apache Mahout. https://mahout.apache.org

[3]

Brown, S.A.: Patient similarity: emerging concepts in systems and precision medicine. Front. Physiol. 7, 561 (2016)

[4]

Cabrera, W., Ordonez, C.: Scalable parallel graph algorithms with matrix-vector multiplication evaluated with queries. Distrib. Parallel Databases 35(3---4), 335---362 (2017)

Digital Library

[5]

Chaudhuri, S., Dayal, U.: An overview of data warehousing and olap technology. ACM Sigmod Rec 26(1), 65---74 (1997)

Digital Library

[6]

Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2012)

[7]

Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

[8]

Domínguez-Muñoz, J.E., Carballo, F., Garcia, M.J., de Diego, J.M., Campos, R., Yangúela, J., de la Morena, J.: Evaluation of the clinical usefulness of apache II and saps systems in the initial prognostic classification of acute pancreatitis: a multicenter study. Pancreas 8(6), 682---686 (1993)

[9]

Drost, H.G.: R philentropy package. https://cran.r-project.org/web/packages/philentropy/philentropy.pdf

[10]

ELKI Development Team: ELKI: Environment for Developing KDD-Applications Supported by Index-Structures. https://elki-project.github.io/

[11]

Ferreira, F.L., Bota, D.P., Bross, A., Mélot, C., Vincent, J.L.: Serial evaluation of the SOFA score to predict outcome in critically ill patients. JAMA 286(14), 1754---1758 (2001)

[12]

Garcelon, N., Neuraz, A., Benoit, V., Salomon, R., Kracker, S., Suarez, F., Bahi-Buisson, N., Hadj-Rabia, S., Fischer, A., Munnich, A.: Finding patients using similarity measures in a rare diseases-oriented clinical data warehouse: Dr. Warehouse and the needle in the needle stack. J. Biomed. Inform. 73, 51---61 (2017)

Digital Library

[13]

Gottlieb, A., Stein, G.Y., Ruppin, E., Altman, R.B., Sharan, R.: A method for inferring medical diagnoses from patient similarities. BMC Med. 11(1), 194 (2013)

[14]

Hill, M.D., Marty, M.R.: Amdahl's law in the multicore era. IEEE Comput. 41(7), 33---38 (2008)

Digital Library

[15]

Hoogendoorn, M., El Hassouni, A., Mok, K., Ghassemi, M., Szolovits, P.: Prediction using patient comparison vs. modeling: a case study for mortality prediction. In: 2016 IEEE 38th Annual International Conference of the Engineering in Medicine and Biology Society (EMBC), pp. 2464---2467 (2016)

[16]

Johnson, A.E., Pollard, T.J., Shen, L., Lehman, L.W.H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016)

[17]

Le Gall, J.R., Lemeshow, S., Saulnier, F.: A new simplified acute physiology score (SAPS II) based on a european/north american multicenter study. JAMA 270(24), 2957---2963 (1993)

[18]

Lee, J., Maslove, D.M., Dubin, J.A.: Personalized mortality prediction driven by electronic medical data and a patient similarity metric. PLoS ONE 10(5), e0127428 (2015)

[19]

Li, L., Cheng, W.Y., Glicksberg, B.S., Gottesman, O., Tamler, R., Chen, R., Bottinger, E.P., Dudley, J.T.: Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci. Transl. Med. 7(311), 311ra174---311ra174 (2015)

[20]

Morid, M.A., Sheng, O.R.L., Abdelrahman, S.: PPMF: a patient-based predictive modeling framework for early ICU mortality prediction (2017). arXiv preprint. arXiv:1704.07499

[21]

Ordonez, C.: Statistical model computation with UDFS. IEEE Trans. Knowl. Data Eng. 22(12), 1752---1765 (2010)

Digital Library

[22]

Ordonez, C., Cabrera, W., Gurram, A.: Comparing columnar, row and array DBMSS to process recursive queries on graphs. Inf. Syst. 63, 66---79 (2017)

[23]

Park, Y.J., Kim, B.C., Chun, S.H.: New knowledge extraction technique using probability for case-based reasoning: application to medical diagnosis. Expert Syst. 23(1), 2---20 (2006)

[24]

Passing, L., Then, M., Hubig, N., Lang, H., Michael, S., Günnemann, S., Kemper, A., Neumann, T.: SQL- and operator-centric data analytics in relational main-memory databases. In: EDBT, pp. 84---95 (2017)

[25]

Qin, C., Rusu, F.: Dot-product join: Scalable in-database linear algebra for big model analytics. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, p. 8. ACM, New York (2017)

Digital Library

[26]

Raasveldt, M., Holanda, P., Mühleisen, H., Manegold, S.: Deep integration of machine learning into column stores. In: EDBT, pp. 473---476. OpenProceedings.org (2018)

[27]

Saeed, M., Villarroel, M., Reisner, A.T., Clifford, G., Lehman, L.W., Moody, G., Heldt, T., Kyaw, T.H., Moody, B., Mark, R.G.: Multiparameter intelligent monitoring in intensive care II (MIMIC-II): a public-access intensive care unit database. Crit. Care Med. 39(5), 952 (2011)

[28]

Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. Proc. VLDB Endow. 8(12), 1976---1979 (2015)

Digital Library

[29]

Sharafoddini, A., Dubin, J.A., Lee, J.: Patient similarity in prediction models based on health data: a scoping review. JMIR Med. Inform. 5(1), e7 (2017)

[30]

Strack, B., DeShazo, J.P., Gennings, C., Olmo, J.L., Ventura, S., Cios, K.J., Clore, J.N.: Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed Res. Int. 2014, 781670 (2014)

[31]

Sun, J., Sow, D., Hu, J., Ebadollahi, S.: A system for mining temporal physiological data streams for advanced prognostic decision support. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 1061---1066 (2010)

Digital Library

[32]

Vincent, J.L., Moreno, R., Takala, J., Willatts, S., De Mendonça, A., Bruining, H., Reinhart, C., Suter, P., Thijs, L.: The SOFA (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. Intensive Care Med. 22(7), 707---710 (1996)

[33]

Wang, F., Hu, J., Sun, J.: Medical prognosis based on patient similarity and expert feedback. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 1799---1802 (2012)

[34]

Wang, S., Li, X., Yao, L., Sheng, Q.Z., Long, G.: Learning multiple diagnosis codes for ICU patients with local disease correlation mining. ACM Trans. Knowl. Discov. Data (TKDD) 11(3), 31 (2017)

Digital Library

[35]

Wiese, L.: Advanced Data Management for SQL, NoSQL, Cloud and Distributed Databases. DeGruyter/Oldenbourg, Munich (2015)

Cited By

Schäfer JWiese L(2020)A Comparison of Two Database Partitioning Approaches that Support Taxonomy-Based Query AnsweringProceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services10.1145/3428757.3429108(426-435)Online publication date: 30-Nov-2020
https://dl.acm.org/doi/10.1145/3428757.3429108
Tashkandi AWiese L(2019)A Hybrid Machine Learning Approach for Improving Mortality Risk Prediction on Imbalanced DataProceedings of the 21st International Conference on Information Integration and Web-based Applications & Services10.1145/3366030.3366040(83-92)Online publication date: 2-Dec-2019
https://dl.acm.org/doi/10.1145/3366030.3366040

Recommendations

Learning similarity with cosine similarity ensemble

This paper proposes a cosine similarity ensemble (CSE) method to learn similarity.CSE is a selective ensemble and combines multiple cosine similarity learners.A learner redefines the pattern vectors and determines its threshold adaptively.Experimental ...
New cosine similarity and distance measures for Fermatean fuzzy sets and TOPSIS approach
Abstract
The most straightforward approaches to checking the degrees of similarity and differentiation between two sets are to use distance and cosine similarity metrics. The cosine of the angle between two n-dimensional vectors in n-dimensional space is ...
Improved cosine similarity measures of simplified neutrosophic sets for medical diagnoses

We proposed improved cosine similarity measures of simplified neutrosophic sets (SNSs) based on cosine function, including single valued neutrosophic cosine similarity measures and interval neutrosophic cosine similarity measures, to overcome some ...

Comments

Information & Contributors

Information

Published In

cover image Distributed and Parallel Databases

Distributed and Parallel Databases Volume 37, Issue 2

June 2019

88 pages

ISSN:0926-8782

Issue’s Table of Contents

Copyright © Copyright © 2019 Springer Science+Business Media, LLC, part of Springer Nature.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2019

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Schäfer JWiese L(2020)A Comparison of Two Database Partitioning Approaches that Support Taxonomy-Based Query AnsweringProceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services10.1145/3428757.3429108(426-435)Online publication date: 30-Nov-2020
https://dl.acm.org/doi/10.1145/3428757.3429108
Tashkandi AWiese L(2019)A Hybrid Machine Learning Approach for Improving Mortality Risk Prediction on Imbalanced DataProceedings of the 21st International Conference on Information Integration and Web-based Applications & Services10.1145/3366030.3366040(83-92)Online publication date: 2-Dec-2019
https://dl.acm.org/doi/10.1145/3366030.3366040

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents