Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3071178.3071212acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
research-article
Open access

Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming

Published: 01 July 2017 Publication History

Abstract

Machine learning has been gaining traction in recent years to meet the demand for tools that can efficiently analyze and make sense of the ever-growing databases of biomedical data in health care systems around the world. However, effectively using machine learning methods requires considerable domain expertise, which can be a barrier of entry for bioinformaticians new to computational data science methods. Therefore, off-the-shelf tools that make machine learning more accessible can prove invaluable for bioinformaticians. To this end, we have developed an open source pipeline optimization tool (TPOT-MDR) that uses genetic programming to automatically design machine learning pipelines for bioinformatics studies. In TPOT-MDR, we implement Multifactor Dimensionality Reduction (MDR) as a feature construction method for modeling higher-order feature interactions, and combine it with a new expert knowledge-guided feature selector for large biomedical data sets. We demonstrate TPOT-MDR's capabilities using a combination of simulated and real world data sets from human genetics and find that TPOT-MDR significantly outperforms modern machine learning methods such as logistic regression and eXtreme Gradient Boosting (XGBoost). We further analyze the best pipeline discovered by TPOT-MDR for a real world problem and highlight TPOT-MDR's ability to produce a high-accuracy solution that is also easily interpretable.

References

[1]
Wolfgang Banzhaf, Peter Nordin, Robert E Keller, and Frank D Francone. 1998. Genetic programming: an introduction. Morgan Kaufmann Publishers San Francisco.
[2]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. JMLR 13, Feb (2012), 281--305.
[3]
Adrian Bird. 2007. Perceptions of epigenetics. Nature 447, 7143 (2007), 396--398.
[4]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794.
[5]
Heather J Cordell. 2009. Detecting gene-gene interactions that underlie human diseases. Nature Reviews Genetics 10, 6 (2009), 392--404.
[6]
Alex GC de Sá, Walter José GS Pinto, Luiz Otavio VB Oliveira, and Gisele L Pappa. 2017. RECIPE: A Grammar-Based Framework for Automatically Evolving Classification Pipelines. In European Conference on Genetic Programming. Springer, 246--261.
[7]
Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6 (2002), 182--197.
[8]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems. 2962--2970.
[9]
Stephanie Forrest, ThanhVu Nguyen, Westley Weimer, and Claire Le Goues. 2009. A genetic programming approach to automated software repair. In GECCO '09. ACM, 947--954.
[10]
Félix-Antoine Fortin, François-Michel De Rainville, Marc-André Gardner, Marc Parizeau, and Christian Gagné. 2012. DEAP: Evolutionary algorithms made easy. JMLR 13 (2012), 2171--2175.
[11]
Holger Franken, Alexander Seitz, Rainer Lehmann, Hans-Ulrich Häring, Norbert Stefan, and Andreas Zell. 2012. Inferring disease-related metabolite dependencies with a Bayesian optimization algorithm. In European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, 62--73.
[12]
Erik M Fredericks and Betty HC Cheng. 2013. Exploring automated software composition with genetic programming. In GECCO '13. ACM, 1733--1734.
[13]
Delaney Granizo-Mackenzie and Jason H Moore. 2013. Multiple threshold spatially uniform relieff for the genetic analysis of complex human diseases. In European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, 1--10.
[14]
Casey S Greene, Daniel S Himmelstein, Jeff Kiralis, and Jason H Moore. 2010. The informative extremes: using both nearest and farthest individuals can improve relief algorithms in the domain of human genetics. In European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, 182--193.
[15]
Casey S Greene, Nadia M Penrod, Jeff Kiralis, and Jason H Moore. 2009. Spatially uniform relieff (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Mining 2, 1 (2009), 5.
[16]
Lance W Hahn, Marylyn D Ritchie, and Jason H Moore. 2003. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 19, 3 (2003), 376--382.
[17]
Trevor J Hastie, Robert J Tibshirani, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY, USA.
[18]
Gregory S Hornby, Jason D Lohn, and Derek S Linden. 2011. Computer-automated evolution of an X-band antenna for NASA's Space Technology 5 mission. Evolutionary Computation 19 (2011), 1--23.
[19]
Frank Hutter, Jörg Lücke, and Lars Schmidt-Thieme. 2015. Beyond manual tuning of hyperparameters. KI-Künstliche Intelligenz 29, 4 (2015), 329--337.
[20]
Alexandras Kalousis. 2002. Algorithm selection via meta-learning. Ph.D. Dissertation. Universite de Geneve.
[21]
Igor Kononenko, Edvard Šimec, and Marko Robnik-Šikonja. 1997. Overcoming the myopia of inductive learning algorithms with RELIEFF. Applied Intelligence 7, 1 (1997), 39--55.
[22]
Nicole A Lavender, Erica N Rogers, Susan Yeyeodu, James Rudd, Ting Hu, Jie Zhang, Guy N Brock, Kevin S Kimbro, Jason H Moore, David W Hein, and La Creis R Kidd. 2012. Interaction among apoptosis-associated sequence variants and joint effects on aggressive prostate cancer. BMC Medical Genomics 5, 1 (2012), 11.
[23]
Chang-Xing Ma, George Casella, and Rongling Wu. 2002. Functional mapping of quantitative trait loci underlying the character process: a theoretical framework. Genetics 161, 4 (2002), 1751--1762.
[24]
Jason H Moore and others. 2006. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of Theoretical Biology 241, 2 (2006), 252--261.
[25]
Jason H Moore and Peter C Andrews. 2015. Epistasis analysis using multifactor dimensionality reduction. Epistasis: Methods and Protocols (2015), 301--314.
[26]
Jason H Moore, Folkert W Asselbergs, and Scott M Williams. 2010. Bioinformatics challenges for genome-wide association studies. Bioinformatics 26, 4 (2010), 445--455.
[27]
Jason H Moore, Douglas P Hill, Arvis Sulovari, and others. 2013. Genetic analysis of prostate cancer using computational evolution, pareto-optimization and post-processing. In Genetic Programming Theory and Practice X. Springer, 87--101.
[28]
Jason H Moore and Bill C White. 2006. Exploiting expert knowledge in genetic programming for genome-wide genetic analysis. In Parallel Problem Solving from Nature-PPSN IX. Springer, 969--977.
[29]
Jason H Moore and Scott M Williams. 2002. New strategies for identifying gene-gene interactions in hypertension. Annals of Medicine 34, 2 (2002), 88--95.
[30]
Randal S Olson, Nathan Bartley, Ryan J Urbanowicz, and Jason H Moore. 2016. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In GECCO 2016 (GECCO '16). ACM, New York, NY, USA, 485--492.
[31]
Randal S Olson and Jason H Moore. 2016. Identifying and Harnessing the Building Blocks of Machine Learning Pipelines for Sensible Initialization of a Data Science Automation Tool. arXiv e-print. http://arxiv.org/abs/1607.08878. (2016).
[32]
Randal S Olson and Jason H Moore. 2016. TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning. JMLR 64 (2016), 66--74.
[33]
Randal S Olson, Ryan J Urbanowicz, Peter C Andrews, Nicole A Lavender, La Creis Kidd, and Jason H Moore. 2016. Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30--April 1, 2016, Proceedings, Part I. Springer International Publishing, Chapter Automating Biomedical Data Science Through Tree-Based Pipeline Optimization, 123--137.
[34]
F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel, M Blondel, P Prettenhofer, R Weiss, V Dubourg, J Vanderplas, A Passos, D Cournapeau, M Brucher, M Perrot, and E Duchesnay. 2011. Scikit-learn: Machine learning in Python. JMLR 12 (2011), 2825--2830.
[35]
Marylyn D Ritchie, Lance W Hahn, Nady Roodi, L Renee Bailey, William D Dupont, Fritz F Pari, and Jason H Moore. 2001. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. The American Journal of Human Genetics 69, 1 (2001), 138--147.
[36]
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In NIPS 2012. 2951--2959.
[37]
Lee Spector, David M Clark, Ian Lindsay, Bradford Barr, and Jon Klein. 2008. Genetic programming for finite algebras. In GECCO '08. ACM, 1291--1298.
[38]
Ryan John Urbanowicz, Angeline S Andrew, Margaret Rita Karagas, and Jason H Moore. 2013. Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach. Journal of the American Medical Informatics Association 20, 4 (2013), 603--612.
[39]
Ryan J Urbanowicz, Jeff Kiralis, Nicholas A Sinnott-Armstrong, Tamra Heberling, Jonathan M Fisher, and Jason H Moore. 2012. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining 5 (2012).
[40]
Ryan J Urbanowicz and Jason H Moore. 2015. ExSTraCS 2.0: description and evaluation of a scalable learning classifier system. Evolutionary Intelligence 8, 2--3 (2015), 89--116.
[41]
Digna R Velez, Bill C White, Alison A Motsinger, William S Bush, Marylyn D Ritchie, Scott M Williams, and Jason H Moore. 2007. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology 31 (2007), 306--315.
[42]
Jason Zutty, Daniel Long, Heyward Adams, Gisele Bennett, and Christina Baxter. 2015. Multiple objective vector-based genetic programming using human-derived primitives. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation. ACM, 1127--1134.

Cited By

View all
  • (2024)Groundwater contaminant source identification considering unknown boundary condition based on an automated machine learning surrogateGeoscience Frontiers10.1016/j.gsf.2023.10173215:1(101732)Online publication date: Jan-2024
  • (2024)The Effectiveness of Using AutoML in Electricity Theft Detection: The Impact of Data Preprocessing and Balancing TechniquesComputational Science and Its Applications – ICCSA 202410.1007/978-3-031-64608-9_5(68-82)Online publication date: 2-Jul-2024
  • (2023)Exploring SLUG: Feature Selection Using Genetic Algorithms and Genetic ProgrammingSN Computer Science10.1007/s42979-023-02106-35:1Online publication date: 8-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
GECCO '17: Proceedings of the Genetic and Evolutionary Computation Conference
July 2017
1427 pages
ISBN:9781450349208
DOI:10.1145/3071178
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automated machine learning
  2. bioinformatics
  3. genetic programming
  4. genetics
  5. multifactor dimensionality reduction
  6. python

Qualifiers

  • Research-article

Funding Sources

Conference

GECCO '17
Sponsor:

Acceptance Rates

GECCO '17 Paper Acceptance Rate 178 of 462 submissions, 39%;
Overall Acceptance Rate 1,669 of 4,410 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)70
  • Downloads (Last 6 weeks)11
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Groundwater contaminant source identification considering unknown boundary condition based on an automated machine learning surrogateGeoscience Frontiers10.1016/j.gsf.2023.10173215:1(101732)Online publication date: Jan-2024
  • (2024)The Effectiveness of Using AutoML in Electricity Theft Detection: The Impact of Data Preprocessing and Balancing TechniquesComputational Science and Its Applications – ICCSA 202410.1007/978-3-031-64608-9_5(68-82)Online publication date: 2-Jul-2024
  • (2023)Exploring SLUG: Feature Selection Using Genetic Algorithms and Genetic ProgrammingSN Computer Science10.1007/s42979-023-02106-35:1Online publication date: 8-Dec-2023
  • (2023)Feature Selection on Epistatic Problems Using Genetic Algorithms with Nested ClassifiersApplications of Evolutionary Computation10.1007/978-3-031-30229-9_42(656-671)Online publication date: 9-Apr-2023
  • (2022)Determining the Capability of the Tree-Based Pipeline Optimization Tool (TPOT) in Mapping Parthenium Weed Using Multi-Date Sentinel-2 Image DataRemote Sensing10.3390/rs1407168714:7(1687)Online publication date: 31-Mar-2022
  • (2022)KLFDAPC: a supervised machine learning approach for spatial genetic structure analysisBriefings in Bioinformatics10.1093/bib/bbac20223:4Online publication date: 2-Jun-2022
  • (2022)SLUG: Feature Selection Using Genetic Algorithms and Genetic ProgrammingGenetic Programming10.1007/978-3-031-02056-8_5(68-84)Online publication date: 13-Apr-2022
  • (2021)Comparision of diabetic prediction AutoML model with customized model2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS)10.1109/ICAIS50930.2021.9395775(842-847)Online publication date: 25-Mar-2021
  • (2021)The promise of automated machine learning for the genetic analysis of complex traitsHuman Genetics10.1007/s00439-021-02393-x141:9(1529-1544)Online publication date: 28-Oct-2021
  • (2021) Automated machine learning to predict the co‐occurrence of isocitrate dehydrogenase mutations and O 6 ‐methylguanine‐DNA methyltransferase promoter methylation in patients with gliomas Journal of Magnetic Resonance Imaging10.1002/jmri.2749854:1(197-205)Online publication date: 3-Jan-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media