Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3545008.3545050acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly

Published: 13 January 2023 Publication History
  • Get Citation Alerts
  • Abstract

    De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is increasing the computational demand and requires scalable, high-performance approaches.
    In this work, we present a novel distributed memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome.
    Using matrix abstraction, we mask branches in the string graph, and compute the connected component to group genomic sequences that belong to the same linear chain (i.e., contig). Then, we perform multiway number partitioning to minimize the load imbalance in local assembly, i.e., concatenation of sequences from a given contig. Based on the assignment obtained by partitioning, we compute the induce subgraph function to redistribute sequences between processes, resulting in a set of local sparse matrices. Finally, we traverse each matrix using depth-first search to concatenate sequences.
    Our algorithm shows good scaling with parallel efficiency up to 80% on 128 nodes, resulting in uniform genome coverage and showing promising results in terms of assembly quality. Our contig generation algorithm localizes the assembly process to significantly reduce the amount of computation spent on this step. Our work is a step forward for efficient de novo long read assembly of large genomes in a distributed memory.

    References

    [1]
    Muaaz G Awan, Jack Deslippe, Aydın Buluç, Oguz Selvitopi, Steven Hofmeyr, Leonid Oliker, and Katherine Yelick. 2020. ADEPT: a domain independent sequence alignment strategy for GPU architectures. BMC Bioinformatics 21, 1 (2020), 1–29.
    [2]
    Baruch Awerbuch and Yossi Shiloach. 1987. New connectivity and MSF algorithms for shuffle-exchange network and PRAM. IEEE Trans. Comput. 36, 10 (1987), 1258–1263.
    [3]
    Ariful Azad and Aydın Buluç. 2019. LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). ACM New York, NY, USA.
    [4]
    Ariful Azad, Oguz Selvitopi, Md Taufique Hussain, John R. Gilbert, and Aydın Buluç. 2022. Combinatorial BLAS 2.0: Scaling Combinatorial Algorithms on Distributed-Memory Systems. IEEE Transactions on Parallel & Distributed Systems 33, 4(2022), 989–1001. https://doi.org/10.1109/TPDS.2021.3094091
    [5]
    John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H. Hargrove, and Hadia Ahmed. 2019. UPC++: A High-Performance Communication Framework for Asynchronous Computation. In 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS’19) (Rio de Janeiro, Brazil). IEEE. https://doi.org/10.25344/S4V88H
    [6]
    Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnology 33, 6 (2015), 623–630.
    [7]
    Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, Torsten Hoefler, and Edgar Solomonik. 2020. Communication-efficient jaccard similarity for high-performance distributed genome comparisons. In International Parallel and Distributed Processing Symposium (IPDPS). IEEE, IEEE Computer Society, 1122–1132.
    [8]
    Aydın Buluç and John R Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1–11.
    [9]
    Aydın Buluç and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1–12.
    [10]
    William W Carlson, Jesse M Draper, David E Culler, Kathy Yelick, Eugene Brooks, and Karen Warren. 1999. Introduction to UPC and language specification. Technical Report. Technical Report CCS-TR-99-157, IDA Center for Computing Sciences.
    [11]
    Haoyu Cheng, Gregory T Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods 18, 2 (2021), 170–175.
    [12]
    Chen-Shan Chin, Paul Peluso, Fritz J Sedlazeck, Maria Nattestad, Gregory T Concepcion, Alicia Clum, Christopher Dunn, Ronan O’Malley, Rosa Figueroa-Balderas, Abraham Morales-Cruz, 2016. Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods 13, 12 (2016), 1050–1054.
    [13]
    Jack Edmonds and Ellis L Johnson. 2003. Matching: A well-solved class of integer linear programs. In Combinatorial Optimization—Eureka, You Shrink!Springer, 27–30.
    [14]
    John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geoff Otto, Paul Peluso, David Rank, Primo Baybayan, Brad Bettman, 2009. Real-time DNA sequencing from single polymerase molecules. Science 323, 5910 (2009), 133–138.
    [15]
    Marius Erbert, Steffen Rechner, and Matthias Müller-Hannemann. 2017. Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms for Molecular Biology 12, 1 (2017), 1–12.
    [16]
    Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Steven Hofmeyr, Chaitanya Aluru, Rob Egan, Leonid Oliker, Daniel Rokhsar, and Katherine Yelick. 2015. HipMer: an extreme-scale de novo genome assembler. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM, ACM, New York, NY, 14.
    [17]
    Evangelos Georganas, Rob Egan, Steven Hofmeyr, Eugene Goltsman, Bill Arndt, Andrew Tritt, Aydın Buluç, Leonid Oliker, and Katherine Yelick. 2018. Extreme scale de novo metagenome assembly. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, IEEE Computer Society, 122–134.
    [18]
    Priyanka Ghosh, Sriram Krishnamoorthy, and Ananth Kalyanaraman. 2020. PaKman: A Scalable Algorithm for Generating Genomic Contigs on Distributed Memory Machines. IEEE Transactions on Parallel and Distributed Systems 32, 5 (2020), 1191–1209.
    [19]
    Sara Goodwin, James Gurtowski, Scott Ethe-Sayers, Panchajanya Deshpande, Michael C Schatz, and W Richard McCombie. 2015. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Research 25, 11 (2015), 1750–1756.
    [20]
    Ronald L Graham. 1966. Bounds for certain multiprocessing anomalies. Bell system technical journal 45, 9 (1966), 1563–1581.
    [21]
    Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM journal on Applied Mathematics 17, 2 (1969), 416–429.
    [22]
    Giulia Guidi, Marquita Ellis, Aydın Buluç, Katherine Yelick, and David Culler. 2021. 10 years later: Cloud computing is closing the performance gap. In Companion of the ACM/SPEC International Conference on Performance Engineering. 41–48.
    [23]
    Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, and Aydın Buluç. 2021. BELLA: Berkeley efficient long-read to long-read aligner and overlapper. In SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21). SIAM, SIAM, 123–134.
    [24]
    Giulia Guidi, Oguz Selvitopi, Marquita Ellis, Leonid Oliker, Katherine Yelick, and Aydın Buluç. 2021. Parallel string graph construction and transitive reduction for de novo genome assembly. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, IEEE Computer Society, 517–526.
    [25]
    Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 8 (2013), 1072–1075.
    [26]
    Steven Hofmeyr, Rob Egan, Evangelos Georganas, Alex C Copeland, Robert Riley, Alicia Clum, Emiley Eloe-Fadrosh, Simon Roux, Eugene Goltsman, Aydın Buluç, 2020. Terabase-scale metagenome coassembly with MetaHipMer. Scientific reports 10, 1 (2020), 1–11.
    [27]
    Ting Hon, Kristin Mars, Greg Young, Yu-Chih Tsai, Joseph W Karalius, Jane M Landolin, Nicholas Maurer, David Kudrna, Michael A Hardigan, Cynthia C Steiner, 2020. Highly accurate long-read HiFi sequencing data for five complex genomes. Scientific data 7, 1 (2020), 1–11.
    [28]
    Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel A Pevzner. 2019. Assembly of long, error-prone reads using repeat graphs. Nature biotechnology 37, 5 (2019), 540–546.
    [29]
    Sergey Koren, Brian P Walenz, Konstantin Berlin, Jason R Miller, Nicholas H Bergman, and Adam M Phillippy. 2017. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research 27, 5 (2017), 722–736.
    [30]
    Heng Li. 2016. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 14 (2016), 2103–2110.
    [31]
    Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. 2007. Computability of models for sequence assembly. In International Workshop on Algorithms in Bioinformatics. Springer, 289–301.
    [32]
    Olivier Mesnard and Lorena A Barba. 2019. Reproducible workflow on a public cloud for computational fluid dynamics. Computing in Science & Engineering 22, 1 (2019), 102–116.
    [33]
    Jason R Miller, Arthur L Delcher, Sergey Koren, Eli Venter, Brian P Walenz, Anushka Brownley, Justin Johnson, Kelvin Li, Clark Mobarry, and Granger Sutton. 2008. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 24 (2008), 2818–2824.
    [34]
    Niranjan Nagarajan and Mihai Pop. 2009. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. Journal of Computational Biology 16, 7 (2009), 897–908.
    [35]
    Israt Nisa, Prashant Pandey, Marquita Ellis, Leonid Oliker, Aydın Buluç, and Katherine Yelick. 2021. Distributed-Memory k-mer Counting on GPUs. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 527–536.
    [36]
    Sergey Nurk, Brian P Walenz, Arang Rhie, Mitchell R Vollger, Glennis A Logsdon, Robert Grothe, Karen H Miga, Evan E Eichler, Adam M Phillippy, and Sergey Koren. 2020. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome research 30, 9 (2020), 1291–1305.
    [37]
    Adam M Phillippy, Michael C Schatz, and Mihai Pop. 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biology 9, 3 (2008), R55.
    [38]
    Daniel Reed, Dennis Gannon, and Jack Dongarra. 2022. Reinventing High Performance Computing: Challenges and Opportunities. arXiv preprint arXiv:2203.02544(2022).
    [39]
    Oguz Selvitopi, Saliya Ekanayake, Giulia Guidi, Georgios Pavlopoulos, Ariful Azad, and Aydın Buluç. 2020. Distributed Many-to-Many Protein Sequence Alignment Using Sparse Matrices. The International Conference for High Performance Computing, Networking, Storage and Analysis (SC)(2020).
    [40]
    Kishwar Shafin, Trevor Pesout, Ryan Lorig-Roach, Marina Haukness, Hugh E Olsen, Colleen Bosworth, Joel Armstrong, Kristof Tigyi, Nicholas Maurer, Sergey Koren, 2020. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nature biotechnology 38, 9 (2020), 1044–1053.
    [41]
    Alberto Zeni, Giulia Guidi, Marquita Ellis, Nan Ding, Marco D Santambrogio, Steven Hofmeyr, Aydın Buluç, Leonid Oliker, and Katherine Yelick. 2020. LOGAN: High-performance GPU-based X-drop long-read alignment. In International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 462–471.
    [42]
    Wenyu Zhang, Jiajia Chen, Yang Yang, Yifei Tang, Jing Shang, and Bairong Shen. 2011. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PloS One 6, 3 (2011), e17915.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICPP '22: Proceedings of the 51st International Conference on Parallel Processing
    August 2022
    976 pages
    ISBN:9781450397339
    DOI:10.1145/3545008
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 January 2023

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Office of Science of the DOE

    Conference

    ICPP '22
    ICPP '22: 51st International Conference on Parallel Processing
    August 29 - September 1, 2022
    Bordeaux, France

    Acceptance Rates

    Overall Acceptance Rate 91 of 313 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 291
      Total Downloads
    • Downloads (Last 12 months)203
    • Downloads (Last 6 weeks)15

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media