Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3433701.3433800acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Distributed many-to-many protein sequence alignment using sparse matrices

Published: 09 November 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Identifying similar protein sequences is a core step in many computational biology pipelines such as detection of homologous protein sequences, generation of similarity protein graphs for downstream analysis, functional annotation, and gene location. Performance and scalability of protein similarity search have proven to be a bottleneck in many bioinformatics pipelines due to increase in cheap and abundant sequencing data. This work presents a new distributed-memory software PASTIS. PASTIS relies on sparse matrix computations for efficient identification of possibly similar proteins. We use distributed sparse matrices for scalability and show that the sparse matrix infrastructure is a great fit for protein similarity search when coupled with a fully-distributed dictionary of sequences that allow remote sequence requests to be fulfilled. Our algorithm incorporates the unique bias in amino acid sequence substitution in search without altering basic sparse matrix model, and in turn, achieves ideal scaling up to millions of protein sequences.

    References

    [1]
    W. Li and A. Godzik, "CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences," Bioinformatics, vol. 22, no. 13, pp. 1658--1659, 2006.
    [2]
    R. C. Edgar, "Search and clustering orders of magnitude faster than BLAST," Bioinformatics, vol. 26, no. 19, pp. 2460--2461, 2010.
    [3]
    M. Steinegger and J. Söding, "Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets," Nature biotechnology, vol. 35, no. 11, p. 1026, 2017.
    [4]
    S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic acids research, vol. 25, no. 17, pp. 3389--3402, Sep 1997, 9254694[pmid].
    [5]
    S. M. Kiełbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith, "Adaptive seeds tame genomic sequence comparison." Genome research, vol. 21, no. 3, pp. 487--93, Mar 2011.
    [6]
    B. Buchfink, C. Xie, and D. H. Huson, "Fast and sensitive protein alignment using DIAMOND," Nature methods, vol. 12, no. 1, p. 59, 2015.
    [7]
    A. J. Enright, S. Van Dongen, and C. A. Ouzounis, "An efficient algorithm for large-scale detection of protein families," Nucleic acids research, vol. 30, no. 7, pp. 1575--1584, 2002.
    [8]
    T. Wittkop, D. Emig, S. Lange, S. Rahmann, M. Albrecht, J. H. Morris, S. Böcker, J. Stoye, and J. Baumbach, "Partitioning biological data with transitivity clustering," Nature methods, vol. 7, no. 6, p. 419, 2010.
    [9]
    A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and A. Buluç, "HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks," Nucleic acids research, vol. 46, no. 6, pp. e33--e33, 2018.
    [10]
    Y. Ruan, S. Ekanayake, M. Rho, H. Tang, S.-H. Bae, J. Qiu, and G. Fox, "DACIDR: Deterministic annealed clustering with interpolative dimension reduction using a large collection of 16s rrna sequences," in Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, ser. BCB '12. New York, NY, USA: ACM, 2012, pp. 329--336.
    [11]
    A. Godzik, "Metagenomics and the protein universe," Current opinion in structural biology, vol. 21, no. 3, pp. 398--403, 2011.
    [12]
    A. Buluç and J. R. Gilbert, "The Combinatorial BLAS: Design, implementation, and applications," The International Journal of High Performance Computing Applications, vol. 25, no. 4, pp. 496--509, 2011.
    [13]
    A. Buluç and J. R. Gilbert, "Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments," SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. C170--C191, 2012.
    [14]
    Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç, "Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors," Parallel Computing, vol. 90, p. 102545, 2019.
    [15]
    E. Solomonik and T. Hoefler, "Sparse tensor algebra as a parallel programming model," arXiv preprint arXiv:1512.00066, 2015.
    [16]
    C. Wu, A. Kalyanaraman, and W. R. Cannon, "pgraph: Efficient parallel construction of large-scale protein sequence homology graphs," IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 10, pp. 1923--1933, 2012.
    [17]
    A. Kalyanaraman, S. Aluru, V. Brendel, and S. Kothari, "Space and time efficient parallel algorithms and software for est clustering," IEEE Transactions on parallel and distributed systems, vol. 14, no. 12, pp. 1209--1221, 2003.
    [18]
    T. F. Smith, M. S. Waterman et al., "Identification of common molecular subsequences," Journal of molecular biology, vol. 147, no. 1, pp. 195--197, 1981.
    [19]
    G. Guidi, M. Ellis, D. Rokhsar, K. Yelick, and A. Buluç, "Bella: Berkeley efficient long-read to long-read aligner and overlapper," bioRxiv, p. 464420, 2018.
    [20]
    M. Ellis, G. Guidi, A. Buluç, L. Oliker, and K. Yelick, "dibella: Distributed long read to long read alignment," in Proceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1--11.
    [21]
    S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," Journal of molecular biology, vol. 215, no. 3, pp. 403--410, 1990.
    [22]
    S. Henikoff and J. G. Henikoff, "Amino acid substitution matrices from protein blocks." Proceedings of the National Academy of Sciences of the United States of America, vol. 89, no. 22, pp. 10 915--9, Nov 1992.
    [23]
    A. Buluç and J. R. Gilbert, "On the representation and multiplication of hypersparse matrices," in IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1--11.
    [24]
    A. Döring, D. Weese, T. Rausch, and K. Reinert, "Seqan an efficient, generic c++ library for sequence analysis." BMC bioinformatics, vol. 9, p. 11, 2008.
    [25]
    S. Fayech, N. Essoussi, and M. Limam, "Partitioning clustering algorithms for protein sequence data sets." BioData mining, vol. 2, no. 1, p. 3, 2009.
    [26]
    N. K. Fox, S. E. Brenner, and J. M. Chandonia, "SCOPe: Structural Classification of Proteins-extended, integrating SCOP and ASTRAL data and classification of new structures," Nucleic Acids Res., vol. 42, no. Database issue, pp. D304--309, Jan 2014.
    [27]
    J. S. Bernardes, F. R. Vieira, L. M. Costa, and G. Zaverucha, "Evaluation and improvements of clustering algorithms for detecting remote homologous protein families," BMC Bioinformatics, vol. 16, no. 1, p. 34, Feb 2015.
    [28]
    T. Weber, K. Blin, S. Duddela, D. Krug, H. U. Kim, R. Bruccoleri, S. Y. Lee, M. A. Fischbach, R. Müller, W. Wohlleben et al., "antismash 3.0---a comprehensive resource for the genome mining of biosynthetic gene clusters," Nucleic acids research, vol. 43, no. W1, pp. W237--W243, 2015.
    [29]
    H. Sberro, B. J. Fremin, S. Zlitni, F. Edfors, N. Greenfield, M. P. Snyder, G. A. Pavlopoulos, N. C. Kyrpides, and A. S. Bhatt, "Large-scale analyses of human microbiomes reveal thousands of small, novel genes," Cell, vol. 178, no. 5, pp. 1245--1259, 2019.
    [30]
    F. Schulz, S. Roux, D. Paez-Espino, S. Jungbluth, D. A. Walsh, V. J. Denef, K. D. McMahon, K. T. Konstantinidis, E. A. Eloe-Fadrosh, N. C. Kyrpides et al., "Giant virus diversity and host interactions through global metagenomics," Nature, vol. 578, no. 7795, pp. 432--436, 2020.
    [31]
    E. A. Franzosa, L. J. McIver, G. Rahnavard, L. R. Thompson, M. Schirmer, G. Weingart, K. S. Lipson, R. Knight, J. G. Caporaso, N. Segata et al., "Species-level functional profiling of metagenomes and metatranscriptomes," Nature methods, vol. 15, no. 11, pp. 962--968, 2018.
    [32]
    P. Menzel, K. L. Ng, and A. Krogh, "Fast and sensitive taxonomic classification for metagenomics with kaiju," Nature communications, vol. 7, no. 1, pp. 1--9, 2016.

    Cited By

    View all
    • (2024)High-Performance Sorting-Based K-mer Counting in Distributed Memory with Flexible Hybrid ParallelismProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673072(919-928)Online publication date: 12-Aug-2024
    • (2023)Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607094(1-16)Online publication date: 12-Nov-2023
    • (2021)Asynchrony versus bulk-synchrony for a generalized N-body problem from genomicsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441580(465-466)Online publication date: 17-Feb-2021
    1. Distributed many-to-many protein sequence alignment using sparse matrices

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2020
      1454 pages
      ISBN:9781728199986

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      IEEE Press

      Publication History

      Published: 09 November 2020

      Check for updates

      Qualifiers

      • Research-article

      Conference

      SC '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)14
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)High-Performance Sorting-Based K-mer Counting in Distributed Memory with Flexible Hybrid ParallelismProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673072(919-928)Online publication date: 12-Aug-2024
      • (2023)Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607094(1-16)Online publication date: 12-Nov-2023
      • (2021)Asynchrony versus bulk-synchrony for a generalized N-body problem from genomicsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441580(465-466)Online publication date: 17-Feb-2021

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media