Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3337821.3337919acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

diBELLA: Distributed Long Read to Long Read Alignment

Published: 05 August 2019 Publication History
  • Get Citation Alerts
  • Abstract

    We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.

    References

    [1]
    Mohammed Alser, Hasan Hassan, Akash Kumar, Onur Mutlu, and Can Alkan. 2019. Shouji: A Fast and Efficient Pre-Alignment Filter for Sequence Alignment. Bioinformatics (2019).
    [2]
    Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic Local Alignment Search Tool. Journal of Molecular Biology 215, 3 (1990), 403--410.
    [3]
    Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology 33, 6 (2015), 623--630.
    [4]
    Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors. Commun. ACM 13, 7 (1970), 422--âĂŞ426.
    [5]
    Mark J. Chaisson and Glenn Tesler. 2012. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 1 (19 Sep 2012), 238.
    [6]
    Chen Shan Chin, David H. Alexander, Patrick Marks, Aaron A. Klammer, James Drake, Cheryl Heiner, Alicia Clum, Alex Copeland, John Huddleston, Evan E. Eichler, Stephen W. Turner, and Jonas Korlach. 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. PLoS Medicine 10, 6 (4 2013), 563--569.
    [7]
    Chen-Shan Chin, Paul Peluso, Fritz J Sedlazeck, Maria Nattestad, Gregory T Concepcion, Alicia Clum, Christopher Dunn, Ronan O'Malley, Rosa Figueroa-Balderas, Abraham Morales-Cruz, et al. 2016. Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods 13, 12 (2016), 1050.
    [8]
    Chen-Shan Chin, Paul Peluso, Fritz J Sedlazeck, Maria Nattestad, Gregory T Concepcion, Alicia Clum, Christopher Dunn, Ronan O'Malley, Rosa Figueroa-Balderas, Abraham Morales-Cruz, Grant R Cramer, Massimo Delledonne, Chongyuan Luo, Joseph R Ecker, Dario Cantu, David R Rank, and Michael C Schatz. 2016. Phased diploid genome assembly with single-molecule real-time sequencing. Nature Methods 13 (17 10 2016), 1050 EP --.
    [9]
    Andreas Döring, David Weese, Tobias Rausch, and Knut Reinert. 2008. SeqAn an efficient, generic C++ library for sequence analysis. BMC bioinformatics 9, 1 (2008), 11.
    [10]
    Marquita Ellis, Evangelos Georganas, Rob Egan, Steven Hofmeyr, Aydın Buluç, Brandon Cook, Leonid Oliker, and Katherine Yelick. 2017. 23rd International European Conference on Parallel and Distributed Computing (Euro-Par 2017).
    [11]
    Evangelos Georganas. 2016. Scalable Parallel Algorithms for Genome Analysis. Ph.D. Dissertation. EECS Department, University of California, Berkeley.
    [12]
    Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, and Katherine Yelick. 2015. MerAligner: A fully parallel sequence aligner. In 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, Hyderabad, India, 561--570.
    [13]
    Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Steven Hofmeyr, Chaitanya Aluru, Rob Egan, Leonid Oliker, Daniel Rokhsar, and Katherine Yelick. 2015. HipMer: An Extreme-Scale De Novo Genome Assembler. 27th ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2015) (November 2015).
    [14]
    Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, and Aydın Buluç. 2018. BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper. bioRxiv (2018), 464420.
    [15]
    Runxin Guo, Yi Zhao, Quan Zou, Xiaodong Fang, and Shaoliang Peng. 2018. Bioinformatics applications on apache spark. GigaScience 7, 8 (2018), giy098.
    [16]
    Vasanthan Jayakumar and Yasubumi Sakakibara. 2017. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Briefings in Bioinformatics (2017), bbx147.
    [17]
    Govinda M Kamath, Ilan Shomorony, Fei Xia, Thomas A Courtade, and N Tse David. 2017. HINGE: long-read assembly achieves optimal repeat resolution. Genome research 27, 5 (2017), 747--756.
    [18]
    Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel A Pevzner. 2019. Assembly of long, error-prone reads using repeat graphs. Nature biotechnology (2019), 1.
    [19]
    Sergey Koren, Brian P Walenz, Konstantin Berlin, Jason R Miller, Nicholas H Bergman, and Adam M Phillippy. 2017. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research 27, 5 (05 2017), 722--736.
    [20]
    Sergey Koren, Brian P Walenz, Konstantin Berlin, Jason R Miller, Nicholas H Bergman, and Adam M Phillippy. 2017. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research 27, 5 (2017), 722--736.
    [21]
    Heng Li. 2016. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 14 (2016), 2103--2110.
    [22]
    Heng Li. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 18 (2018), 3094--3100.
    [23]
    Yu Lin, Jeffrey Yuan, Mikhail Kolmogorov, Max W Shen, Mark Chaisson, and Pavel A Pevzner. 2016. Assembly of long error-prone reads using de Bruijn graphs. Proceedings of the National Academy of Sciences 113, 52 (2016), E8396--E8405.
    [24]
    Nicholas James Loman, Joshua Quick, and Jared T Simpson. 2015. A complete bacterial genome assembled de novo using only nanopore sequencing data. bioRxiv (2015). arXiv:https://www.biorxiv.org/content/early/2015/02/20/015552.full.pdf
    [25]
    Guillaume Marçais and Carl Kingsford. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 6 (2011), 764--770.
    [26]
    David W Mount. 2004. Sequence and genome analysis. Bioinformatics: Cold Spring Harbour Laboratory Press: Cold Spring Harbour 2 (2004).
    [27]
    Gene Myers. 2014. Efficient local alignment discovery amongst noisy long reads. In International Workshop on Algorithms in Bioinformatics. Springer, 52--67.
    [28]
    Niranjan Nagarajan and Mihai Pop. 2009. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. Journal of computational biology 16, 7 (2009), 897--908.
    [29]
    Thomas P. Niedringhaus, Denitsa Milanova, Matthew B. Kerby, Michael P. Snyder, and Annelise E. Barron. 2011. Landscape of Next-Generation Sequencing Technologies. Analytical Chemistry 83, 12 (2011), 4327--4341. arXiv:https://doi.org/10.1021/ac2010857 21612267.
    [30]
    Tony C Pan, Sanchit Misra, and Srinivas Aluru. 2018. Optimizing high performance distributed memory parallel hash tables for DNA k-mer counting. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 135--147.
    [31]
    Adam M Phillippy, Michael C Schatz, and Mihai Pop. 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome biology 9, 3 (2008), R55.
    [32]
    Temple F. Smith and Michael S. Waterman. {n. d.}. Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 1 ({n. d.}), 195âĂŞ197.
    [33]
    Yatish Turakhia, Gill Bejerano, and William J Dally. 2018. Darwin: A genomics co-processor provides up to 15,000 x acceleration on long read assembly. In ACM SIGPLAN Notices, Vol. 53. ACM, 199--213.
    [34]
    Leslie G. Valiant. 1990. A Bridging Model for Parallel Computation. Commun. ACM 33, 8 (Aug. 1990), 103--111.
    [35]
    Chuan-Le Xiao, Ying Chen, Shang-Qian Xie, Kai-Ning Chen, Yan Wang, Yue Han, Feng Luo, and Zhi Xie. 2017. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. nature methods 14, 11 (2017), 1072.
    [36]
    Wenyu Zhang, Jiajia Chen, Yang Yang, Yifei Tang, Jing Shang, and Bairong Shen. 2011. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PloS one 6, 3 (2011), e17915.
    [37]
    Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller. 2000. A greedy algorithm for aligning DNA sequences. Journal of Computational biology 7, 1-2 (2000), 203--214.

    Cited By

    View all
    • (2023)A survey of mapping algorithms in the long-reads eraGenome Biology10.1186/s13059-023-02972-324:1Online publication date: 1-Jun-2023
    • (2023)Designing Efficient SIMD Kernels for High Performance Sequence Alignment2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00038(167-176)Online publication date: May-2023
    • (2023)Invited: Accelerating Genome Analysis via Algorithm-Architecture Co-Design2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247887(1-4)Online publication date: 9-Jul-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICPP '19: Proceedings of the 48th International Conference on Parallel Processing
    August 2019
    1107 pages
    ISBN:9781450362955
    DOI:10.1145/3337821
    © 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    In-Cooperation

    • University of Tsukuba: University of Tsukuba

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 August 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bioinformatics
    2. cloud computing
    3. distributed data structures
    4. genomics
    5. high performance computing
    6. performance analysis

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • DOE/Exascale Computing Project

    Conference

    ICPP 2019

    Acceptance Rates

    Overall Acceptance Rate 91 of 313 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)70
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A survey of mapping algorithms in the long-reads eraGenome Biology10.1186/s13059-023-02972-324:1Online publication date: 1-Jun-2023
    • (2023)Designing Efficient SIMD Kernels for High Performance Sequence Alignment2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00038(167-176)Online publication date: May-2023
    • (2023)Invited: Accelerating Genome Analysis via Algorithm-Architecture Co-Design2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247887(1-4)Online publication date: 9-Jul-2023
    • (2021)Scaling Generalized N-Body Problems, A Case Study from GenomicsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472517(1-9)Online publication date: 9-Aug-2021
    • (2021)MetaCache-GPU: Ultra-Fast Metagenomic ClassificationProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472460(1-11)Online publication date: 9-Aug-2021
    • (2021)10 Years Later: Cloud Computing is Closing the Performance GapCompanion of the ACM/SPEC International Conference on Performance Engineering10.1145/3447545.3451183(41-48)Online publication date: 19-Apr-2021
    • (2021)Asynchrony versus bulk-synchrony for a generalized N-body problem from genomicsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441580(465-466)Online publication date: 17-Feb-2021
    • (2021)Distributed-Memory k-mer Counting on GPUs2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00061(527-536)Online publication date: May-2021
    • (2021)Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00060(517-526)Online publication date: May-2021
    • (2020)ADEPT: a domain independent sequence alignment strategy for gpu architecturesBMC Bioinformatics10.1186/s12859-020-03720-121:1Online publication date: 15-Sep-2020
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media