Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SC.2014.41acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Parallel de bruijn graph construction and traversal for de novo genome assembly

Published: 16 November 2014 Publication History

Abstract

De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous fragments called reads. We study optimized parallelization of the most time-consuming phases of Meraculous, a state-of-the-art production assembler. First, we present a new parallel algorithm for k-mer analysis, characterized by intensive communication and I/O requirements, and reduce the memory requirements by 6.93×. Second, we efficiently parallelize de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We provide a novel algorithm that leverages one-sided communication capabilities of the Unified Parallel C (UPC) to facilitate the requisite fine-grained parallelism and avoidance of data hazards, while analytically proving its scalability properties. Overall results show unprecedented performance and efficient scaling on up to 15,360 cores of a Cray XC30, on human genome as well as the challenging wheat genome, with performance improvement from days to seconds.

References

[1]
J. A. Chapman, I. Ho, S. Sunkara, S. Luo, G. P. Schroth, and D. S. Rokhsar, "Meraculous: De novo genome assembly with short paired-end reads," PLoS ONE, vol. 6, no. 8, p. e23501, 08 2011.
[2]
K. R. Bradnam, J. N. Fass, A. Alexandrov, P. Baranay, M. Bechner et al., "Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species," GigaScience, vol. 2, no. 1, pp. 1--31, 2013.
[3]
E. W. Myers, G. G. Sutton, A. L. Delcher et al., "A whole-genome assembly of drosophila," Science, vol. 287, no. 5461, pp. 2196--2204, 2000.
[4]
P. A. Pevzner, H. Tang, and M. S. Waterman, "An Eulerian path approach to DNA fragment assembly," Proceedings of the National Academy of Sciences, vol. 98, no. 17, pp. 9748--9753, 2001.
[5]
D. Earl, K. Bradnam, J. St John, A. Darling et al., "Assemblathon 1: a competitive assessment of de novo short read assembly methods." Genome research, vol. 21, no. 12, pp. 2224--2241, Dec. 2011.
[6]
A. Appleby, "Murmurhash," 2011.
[7]
D. A. Bader, D. R. Helman, and J. JáJá, "Practical parallel algorithms for personalized communication and integer sorting," Journal of Experimental Algorithmics (JEA), vol. 1, p. 3, 1996.
[8]
P. Melsted and J. K. Pritchard, "Efficient counting of k-mers in DNA sequences using a bloom filter," BMC bioinformatics, vol. 12, no. 1, p. 333, 2011.
[9]
B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors," Communications of the ACM, vol. 13, no. 7, pp. 422--426, 1970.
[10]
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier, "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm," DMTCS Proceedings, 2008.
[11]
SAM Format Specification Working Group et al., "The sam format specification (v1. 4-r985)."
[12]
M. Howison, "High-throughput compression of FASTQ data with SeqDB," IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 10, no. 1, pp. 213--218, 2013.
[13]
T. H. Group, "Hierarchical data format version 5, 2000--2013." {Online}. Available: http://www.hdfgroup.org/HDF5
[14]
P. Husbands, C. Iancu, and K. Yelick, "A performance analysis of the Berkeley UPC compiler," in Proc. of International Conference on Supercomputing, ser. ICS '03. New York, NY, USA: ACM, 2003, pp. 63--73.
[15]
S. Boisvert, F. Laviolette, and J. Corbeil, "Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies," Journal of Computational Biology, vol. 17, no. 11, pp. 1519--1533, 2010.
[16]
J. T. Simpson, K. Wong et al., "Abyss: a parallel assembler for short read sequence data," Genome research, vol. 19, no. 6, pp. 1117--1123, 2009.
[17]
J. R. Miller, S. Koren, and G. Sutton, "Assembly algorithms for next-generation sequencing data," Genomics, vol. 95, no. 6, pp. 315--327, 2010.
[18]
Y. Liu, B. Schmidt, and D. L. Maskell, "Parallelized short read assembly of large genomes using de Bruijn graphs," BMC bioinformatics, vol. 12, no. 1, p. 354, 2011.
[19]
B. G. Jackson, M. Regennitter et al., "Parallel de novo assembly of large genomes from high-throughput short reads," in IPDPS'10. IEEE, 2010.
[20]
X. Liu, P. R. Pande, H. Meyerhenke, and D. A. Bader, "Pasqual: parallel techniques for next generation genome sequence assembly," IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 5, pp. 977--986, 2013.
[21]
R. Li, H. Zhu et al., "De novo assembly of human genomes with massively parallel short read sequencing," Genome research, vol. 20, no. 2, pp. 265--272, 2010.
[22]
D. R. Zerbino and E. Birney, "Velvet: algorithms for de novo short read assembly using de Bruijn graphs," Genome research, vol. 18, no. 5, pp. 821--829, 2008.
[23]
G. Marçais and C. Kingsford, "A fast, lock-free approach for efficient parallel counting of occurrences of k-mers," Bioinformatics, vol. 27, no. 6, pp. 764--770, 2011.
[24]
M. R. Crusoe, G. Edvenson, J. Fish, A. Howe, L. Irber et al., "khmer k-mer counting & filtering FTW," https://github.com/ctb/khmer, 2014.
[25]
R. Egan, "Kmernator: An MPI toolkit for large scale genomic analysis," https://github.com/JGI-Bioinformatics/Kmernator, 2014.

Cited By

View all
  • (2024)High-Performance Sorting-Based K-mer Counting in Distributed Memory with Flexible Hybrid ParallelismProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673072(919-928)Online publication date: 12-Aug-2024
  • (2021)Ultra Efficient Acceleration for De Novo Genome Assembly via Near-Memory ComputingProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00022(199-212)Online publication date: 26-Sep-2021
  • (2019)BCLProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337912(1-10)Online publication date: 5-Aug-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2014
1054 pages
ISBN:9781479955008
  • General Chair:
  • Trish Damkroger,
  • Program Chair:
  • Jack Dongarra

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 November 2014

Check for updates

Qualifiers

  • Research-article

Conference

SC '14
Sponsor:

Acceptance Rates

SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)High-Performance Sorting-Based K-mer Counting in Distributed Memory with Flexible Hybrid ParallelismProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673072(919-928)Online publication date: 12-Aug-2024
  • (2021)Ultra Efficient Acceleration for De Novo Genome Assembly via Near-Memory ComputingProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00022(199-212)Online publication date: 26-Sep-2021
  • (2019)BCLProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337912(1-10)Online publication date: 5-Aug-2019
  • (2019)A pattern based algorithmic autotuner for graph processing on GPUsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295716(201-213)Online publication date: 16-Feb-2019
  • (2019)FastEtchIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2017.273799916:4(1091-1106)Online publication date: 1-Jul-2019
  • (2018)Extreme scale de novo metagenome assemblyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291670(1-13)Online publication date: 11-Nov-2018
  • (2018)Efficient Runtime Support for a Partitioned Global Logical Address SpaceProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225092(1-10)Online publication date: 13-Aug-2018
  • (2018)Extreme scale de novo metagenome assemblyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00013(1-13)Online publication date: 11-Nov-2018
  • (2017)Accelerating the HyperLogLog Cardinality Estimation AlgorithmScientific Programming10.1155/2017/20408652017Online publication date: 1-Jan-2017
  • (2017)MerBenchProceedings of the Second Annual PGAS Applications Workshop10.1145/3144779.3169109(1-4)Online publication date: 12-Nov-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media