Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3535508.3545536acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article
Public Access

A comparison of dimensionality reduction methods for large biological data

Published: 07 August 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Large-scale data often suffer from the curse of dimensionality and the constraints associated with it; therefore, dimensionality reduction methods are often performed prior to most machine learning pipelines. In this paper, we directly compare autoencoders performance as a dimensionality reduction technique (via the latent space) to other established methods: PCA, LASSO, and t-SNE. To do so, we use four distinct datasets that vary in the types of features, metadata, labels, and size to robustly compare different methods. We test prediction capability using both Support Vector Machines (SVM) and Random Forests (RF). Significantly, we conclude that autoencoders are an equivalent dimensionality reduction architecture to the previously established methods, and often outperform them in both prediction accuracy and time performance when condensing large, sparse datasets.

    References

    [1]
    Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics 2, 4 (2010), 433--459.
    [2]
    Pierre Baldi. 2012. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning. 37--49.
    [3]
    Lawrence Cayton. 2005. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep 12, 1--17 (2005), 1.
    [4]
    Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321--357.
    [5]
    François Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
    [6]
    Zhiyu Deng, Jinming Zhang, Junya Li, and Xiujun Zhang. 2021. Application of deep learning in plant-microbiota association analysis. Frontiers in Genetics 12 (2021).
    [7]
    Edson Duarte and Jacques Wainer. 2017. Empirical comparison of cross-validation and internal metrics for tuning SVM hyperparameters. Pattern Recognition Letters 88 (2017), 6--11.
    [8]
    Valeria Fonti and Eduard Belitser. 2017. Feature selection using LASSO. VU Amsterdam Research Paper in Business Analytics 30 (2017), 1--25.
    [9]
    Beatriz García-Jiménez, Jorge Muñoz, Sara Cabello, Joaquín Medina, and Mark D Wilkinson. 2021. Predicting microbiomes through a deep latent space. Bioinformatics 37, 10 (2021), 1444--1451.
    [10]
    Michael W Henson, David M Pitre, Jessica Lee Weckhorst, V Celeste Lanclos, Austen T Webber, and J Cameron Thrash. 2016. Artificial seawater media facilitate cultivating members of the microbial majority from the Gulf of Mexico. MSphere 1, 2 (2016), e00028--16.
    [11]
    Jin Huang and Charles X Ling. 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering 17, 3 (2005), 299--310.
    [12]
    Hiroyuki Imachi, Masaru K Nobu, Nozomi Nakahara, Yuki Morono, Miyuki Ogawara, Yoshihiro Takaki, Yoshinori Takano, Katsuyuki Uematsu, Tetsuro Ikuta, Motoo Ito, et al. 2020. Isolation of an archaeon at the prokaryote-eukaryote interface. Nature 577, 7791 (2020), 519--525.
    [13]
    Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 15 (2021), 2112--2120.
    [14]
    Boyang Tom Jin, Feng Xu, Raymond T Ng, and James C Hogg. 2021. Mian: interactive web-based microbiome data table visualization and machine learning platform. Bioinformatics 38, 4 (2021), 1176--1178.
    [15]
    Brandon Kristy, Alyssa Carrell, Eric Johnston, Jonathan Cumming, Dawn Klingeman, Kimberly Gwinn, Kimberly Syring, Caroline Skalla, Scott J. Emrich, and Melissa A. Cregger. 2022. Chronic drought differentially alters the belowground microbiome of drought tolerant and drought susceptible genotypes of Populus trichocarpa. Phytobiomes (2022), in revision.
    [16]
    Seung Jae Lee and Mina Rho. 2022. Multimodal deep learning applied to classify healthy and disease states of human microbiome. Scientific Reports 12, 1 (2022), 824.
    [17]
    Karen G. Lloyd, Andrew D. Steen, Joshua Ladau, Junqi Yin, and Lonnie Crosby. 2018. Phylogenetically novel uncultured microbial cells dominate Earth microbiomes. mSystems 3, 5 (sep 2018), e00055--18.
    [18]
    Stephen Nayfach, Simon Roux, Rekha Seshadri, Daniel Udwary, Neha Varghese, Frederik Schulz, Dongying Wu, David Paez-Espino, I-Min Chen, Marcel Huntemann, et al. 2021. A genomic catalog of Earth's microbiomes. Nature biotechnology 39, 4 (2021), 499--509.
    [19]
    Nuala A O'Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic acids research 44, D1 (2016), D733--D745.
    [20]
    Donovan H Parks, Maria Chuvochina, David W Waite, Christian Rinke, Adam Skarshewski, Pierre-Alain Chaumeil, and Philip Hugenholtz. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology 36, 10 (2018), 996--1004.
    [21]
    Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, and Nicola Segata. 2016. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Computational Biology 12, 7 (2016), e1004977.
    [22]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
    [23]
    Owen Queen and Scott J. Emrich. 2021. LASSO-based feature selection for improved microbial and microbiome classification. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2301--2308.
    [24]
    Mayu Sakurada and Takehisa Yairi. 2014. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. 4--11.
    [25]
    Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. Advances in neural information processing systems 25 (2012).
    [26]
    Shinichi Sunagawa, Silvia G Acinas, Peer Bork, Chris Bowler, Damien Eveillard, Gabriel Gorsky, Lionel Guidi, Daniele Iudicone, Eric Karsenti, Fabien Lombard, et al. 2020. Tara Oceans: towards global ocean ecosystems biology. Nature Reviews Microbiology 18, 8 (2020), 428--445.
    [27]
    Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. The journal of machine learning research 15, 1 (2014), 3221--3245.
    [28]
    Wei Wang, Yan Huang, Yizhou Wang, and Liang Wang. 2014. Generalized autoencoder: A neural network framework for dimensionality reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 490--497.
    [29]
    Yasi Wang, Hongxun Yao, and Sicheng Zhao. 2016. Auto-encoder based dimensionality reduction. Neurocomputing 184 (2016), 232--242.
    [30]
    Lei Zhu, Jaishree Tripathi, Frances Maureen Rocamora, Olivo Miotto, Rob van der Pluijm, Till S. Voss, Sachel Mok, Dominic P. Kwiatkowski, François Nosten, Nicholas P. J. Day, Nicholas J. White, Arjen M. Dondorp, Zbynek Bozdech, Aung Pyae Phyo, Elizabeth A. Ashley, Frank Smithuis, Khin Lin, Kyaw Myo Tun, M. Abul Faiz, Mayfong Mayxay, Mehul Dhorda, Nguyen Thanh Thuy-Nhien, Paul N. Newton, Sasithon Pukrittayakamee, Tin M. Hlaing, Tran Tinh Hien, Ye Htut, and Tracking Resistance to Artemisinin Collaboration I. 2018. The origins of malaria artemisinin resistance defined by a genetic and transcriptomic background. Nature Communications 9, 1 (2018), 5158.

    Cited By

    View all
    • (2024)Nonlinear dimensionality reduction based visualization of single-cell RNA sequencing dataJournal of Analytical Science and Technology10.1186/s40543-023-00414-015:1Online publication date: 11-Jan-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
    August 2022
    549 pages
    ISBN:9781450393867
    DOI:10.1145/3535508
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 August 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. autoencoders
    2. classification
    3. dimensionality reduction

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    BCB '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 885 submissions, 29%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)103
    • Downloads (Last 6 weeks)16

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Nonlinear dimensionality reduction based visualization of single-cell RNA sequencing dataJournal of Analytical Science and Technology10.1186/s40543-023-00414-015:1Online publication date: 11-Jan-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media