research-article

Distributed many-to-many protein sequence alignment using sparse matrices

Authors:

Oguz Selvitopi,

Saliya Ekanayake,

Georgios A. Pavlopoulos,

Aydın BuluçAuthors Info & Claims

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 75, Pages 1 - 14

Published: 09 November 2020 Publication History

Abstract

Identifying similar protein sequences is a core step in many computational biology pipelines such as detection of homologous protein sequences, generation of similarity protein graphs for downstream analysis, functional annotation, and gene location. Performance and scalability of protein similarity search have proven to be a bottleneck in many bioinformatics pipelines due to increase in cheap and abundant sequencing data. This work presents a new distributed-memory software PASTIS. PASTIS relies on sparse matrix computations for efficient identification of possibly similar proteins. We use distributed sparse matrices for scalability and show that the sparse matrix infrastructure is a great fit for protein similarity search when coupled with a fully-distributed dictionary of sequences that allow remote sequence requests to be fulfilled. Our algorithm incorporates the unique bias in amino acid sequence substitution in search without altering basic sparse matrix model, and in turn, achieves ideal scaling up to millions of protein sequences.

References

[1]

W. Li and A. Godzik, "CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences," Bioinformatics, vol. 22, no. 13, pp. 1658--1659, 2006.

Digital Library

[2]

R. C. Edgar, "Search and clustering orders of magnitude faster than BLAST," Bioinformatics, vol. 26, no. 19, pp. 2460--2461, 2010.

Digital Library

[3]

M. Steinegger and J. Söding, "Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets," Nature biotechnology, vol. 35, no. 11, p. 1026, 2017.

[4]

S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic acids research, vol. 25, no. 17, pp. 3389--3402, Sep 1997, 9254694[pmid].

[5]

S. M. Kiełbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith, "Adaptive seeds tame genomic sequence comparison." Genome research, vol. 21, no. 3, pp. 487--93, Mar 2011.

[6]

B. Buchfink, C. Xie, and D. H. Huson, "Fast and sensitive protein alignment using DIAMOND," Nature methods, vol. 12, no. 1, p. 59, 2015.

[7]

A. J. Enright, S. Van Dongen, and C. A. Ouzounis, "An efficient algorithm for large-scale detection of protein families," Nucleic acids research, vol. 30, no. 7, pp. 1575--1584, 2002.

[8]

T. Wittkop, D. Emig, S. Lange, S. Rahmann, M. Albrecht, J. H. Morris, S. Böcker, J. Stoye, and J. Baumbach, "Partitioning biological data with transitivity clustering," Nature methods, vol. 7, no. 6, p. 419, 2010.

[9]

A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and A. Buluç, "HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks," Nucleic acids research, vol. 46, no. 6, pp. e33--e33, 2018.

[10]

Y. Ruan, S. Ekanayake, M. Rho, H. Tang, S.-H. Bae, J. Qiu, and G. Fox, "DACIDR: Deterministic annealed clustering with interpolative dimension reduction using a large collection of 16s rrna sequences," in Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, ser. BCB '12. New York, NY, USA: ACM, 2012, pp. 329--336.

[11]

A. Godzik, "Metagenomics and the protein universe," Current opinion in structural biology, vol. 21, no. 3, pp. 398--403, 2011.

[12]

A. Buluç and J. R. Gilbert, "The Combinatorial BLAS: Design, implementation, and applications," The International Journal of High Performance Computing Applications, vol. 25, no. 4, pp. 496--509, 2011.

Digital Library

[13]

A. Buluç and J. R. Gilbert, "Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments," SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. C170--C191, 2012.

Digital Library

[14]

Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç, "Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors," Parallel Computing, vol. 90, p. 102545, 2019.

Digital Library

[15]

E. Solomonik and T. Hoefler, "Sparse tensor algebra as a parallel programming model," arXiv preprint arXiv:1512.00066, 2015.

[16]

C. Wu, A. Kalyanaraman, and W. R. Cannon, "pgraph: Efficient parallel construction of large-scale protein sequence homology graphs," IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 10, pp. 1923--1933, 2012.

Digital Library

[17]

A. Kalyanaraman, S. Aluru, V. Brendel, and S. Kothari, "Space and time efficient parallel algorithms and software for est clustering," IEEE Transactions on parallel and distributed systems, vol. 14, no. 12, pp. 1209--1221, 2003.

Digital Library

[18]

T. F. Smith, M. S. Waterman et al., "Identification of common molecular subsequences," Journal of molecular biology, vol. 147, no. 1, pp. 195--197, 1981.

[19]

G. Guidi, M. Ellis, D. Rokhsar, K. Yelick, and A. Buluç, "Bella: Berkeley efficient long-read to long-read aligner and overlapper," bioRxiv, p. 464420, 2018.

[20]

M. Ellis, G. Guidi, A. Buluç, L. Oliker, and K. Yelick, "dibella: Distributed long read to long read alignment," in Proceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1--11.

[21]

S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," Journal of molecular biology, vol. 215, no. 3, pp. 403--410, 1990.

[22]

S. Henikoff and J. G. Henikoff, "Amino acid substitution matrices from protein blocks." Proceedings of the National Academy of Sciences of the United States of America, vol. 89, no. 22, pp. 10 915--9, Nov 1992.

[23]

A. Buluç and J. R. Gilbert, "On the representation and multiplication of hypersparse matrices," in IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1--11.

[24]

A. Döring, D. Weese, T. Rausch, and K. Reinert, "Seqan an efficient, generic c++ library for sequence analysis." BMC bioinformatics, vol. 9, p. 11, 2008.

[25]

S. Fayech, N. Essoussi, and M. Limam, "Partitioning clustering algorithms for protein sequence data sets." BioData mining, vol. 2, no. 1, p. 3, 2009.

[26]

N. K. Fox, S. E. Brenner, and J. M. Chandonia, "SCOPe: Structural Classification of Proteins-extended, integrating SCOP and ASTRAL data and classification of new structures," Nucleic Acids Res., vol. 42, no. Database issue, pp. D304--309, Jan 2014.

[27]

J. S. Bernardes, F. R. Vieira, L. M. Costa, and G. Zaverucha, "Evaluation and improvements of clustering algorithms for detecting remote homologous protein families," BMC Bioinformatics, vol. 16, no. 1, p. 34, Feb 2015.

[28]

T. Weber, K. Blin, S. Duddela, D. Krug, H. U. Kim, R. Bruccoleri, S. Y. Lee, M. A. Fischbach, R. Müller, W. Wohlleben et al., "antismash 3.0---a comprehensive resource for the genome mining of biosynthetic gene clusters," Nucleic acids research, vol. 43, no. W1, pp. W237--W243, 2015.

[29]

H. Sberro, B. J. Fremin, S. Zlitni, F. Edfors, N. Greenfield, M. P. Snyder, G. A. Pavlopoulos, N. C. Kyrpides, and A. S. Bhatt, "Large-scale analyses of human microbiomes reveal thousands of small, novel genes," Cell, vol. 178, no. 5, pp. 1245--1259, 2019.

[30]

F. Schulz, S. Roux, D. Paez-Espino, S. Jungbluth, D. A. Walsh, V. J. Denef, K. D. McMahon, K. T. Konstantinidis, E. A. Eloe-Fadrosh, N. C. Kyrpides et al., "Giant virus diversity and host interactions through global metagenomics," Nature, vol. 578, no. 7795, pp. 432--436, 2020.

[31]

E. A. Franzosa, L. J. McIver, G. Rahnavard, L. R. Thompson, M. Schirmer, G. Weingart, K. S. Lipson, R. Knight, J. G. Caporaso, N. Segata et al., "Species-level functional profiling of metagenomes and metatranscriptomes," Nature methods, vol. 15, no. 11, pp. 962--968, 2018.

[32]

P. Menzel, K. L. Ng, and A. Krogh, "Fast and sensitive taxonomic classification for metagenomics with kaiju," Nature communications, vol. 7, no. 1, pp. 1--9, 2016.

Cited By

Li YGuidi G(2024)High-Performance Sorting-Based K-mer Counting in Distributed Memory with Flexible Hybrid ParallelismProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673072(919-928)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673072
Burchard LZhao MLangguth JBuluç AGuidi GMohror KArnold DBadia R(2023)Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607094(1-16)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607094
Ellis MBuluç AYelick KLee JPetrank E(2021)Asynchrony versus bulk-synchrony for a generalized N-body problem from genomicsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441580(465-466)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441580

Distributed many-to-many protein sequence alignment using sparse matrices
1. Applied computing
  1. Life and medical sciences

Recommendations

Splice-Aware Multiple Sequence Alignment of Protein Isoforms
BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Multiple sequence alignment (MSA) is a classic problem in computational genomics. In typical use, MSA software is expected to align a collection of homologous genes, such as orthologs from multiple species or duplication-induced paralogs within a ...
Direct mapping and alignment of protein sequences onto genomic sequence

Motivation: Finding protein-coding genes in a newly determined genomic sequence is the first step toward understanding the content written in the genome. Sequences of transcripts of homologous genes, if available, can considerably improve accuracy of ...
A structure-based method for protein sequence alignment

Motivation: With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that use position-specific scoring matrices (PSSMs) to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2020

1454 pages

ISBN:9781728199986

General Chair:
Christine Cuicchi,
Program Chairs:
Irene Qualters,
William Kramer

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Qualifiers

Research-article

Conference

SC '20

Sponsor:

SIGHPC

SC '20: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 9 - 19, 2020

Georgia, Atlanta

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
105
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li YGuidi G(2024)High-Performance Sorting-Based K-mer Counting in Distributed Memory with Flexible Hybrid ParallelismProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673072(919-928)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673072
Burchard LZhao MLangguth JBuluç AGuidi GMohror KArnold DBadia R(2023)Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607094(1-16)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607094
Ellis MBuluç AYelick KLee JPetrank E(2021)Asynchrony versus bulk-synchrony for a generalized N-body problem from genomicsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441580(465-466)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441580

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents