Abstract
Whole genome sequencing technologies are unable to invariably read DNA molecules intact, a shortcoming that assemblers try to resolve by stitching the obtained fragments back together. Here, we present methods for the improvement of de novo genome assembly from erroneous long reads incorporated into a tool called Raven. Raven maintains similar performance for various genomes and has accuracy on par with other assemblers that support third-generation sequencing data. It is one of the fastest options while having the lowest memory consumption on the majority of benchmarked datasets.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 /Â 30Â days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The ONT dataset for A. thaliana is available under accession no. ERR2173373, for D. melanogaster under SRR6702603, for H. sapiens NA12878 at https://github.com/nanopore-wgs-consortium/NA12878 (release 6), for H. sapiens CHM13 at https://github.com/marbl/CHM13 (release 6), for H. sapiens HG002 at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V3.4.5/ and for H. sapiens HG00733 at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=NHGRI_UCSC_panel/HG00733/nanopore/. The PacBio CLR dataset for A. thaliana is available at https://downloads.pacbcloud.com/public/SequelData/ArabidopsisDemoData/, for D. melanogaster under accession no. SRR5439404, for H. sapiens CHM13 at https://github.com/marbl/CHM13 (extracted from draft v1.0 bam), for H. sapiens HG002 at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/PacBio_fasta/ and for H. sapiens HG0073 under SRR7615963. The PacBio HiFi dataset for H. sapiens CHM13 is available from accession nos. SRR11292120âSRR11292123, for H. sapiens HG002 under SRR10382244, SRR10382245, SRR10382248 and SRR10382249, and for H. sapiens HG00733 under ERX3831682. Illumina reads for yak evaluation are available from accession nos. SRX1049768âSRX1049782 for H. sapiens NA12878, from https://github.com/marbl/CHM13 (extracted from draft v1.0 bam) for H. sapiens CHM13, from https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/ (extracted from 60x bam) for H. sapiens HG002 and under accession no. SRR7782677 for H. sapiens HG00733. ONT plant datasets are available under accession nos. ERR2564160âERR2564170 for B. rapa, from ERR2564373âERR2564382 for B. oleracea, from ERR2571286âERR2571303 for M. schizocarpa, from ERR3476478âERR3476482 for O. sativa basmati 334 and from ERR3476463âERR3476466 for O. sativa dom sufid. All generated assemblies in this research are available at Zenodo26.
Code availability
The Raven source code is available under an MIT license on GitHub at https://github.com/lbcb-sci/raven. Source code for version 1.3.0 used in this manuscript is also available at Zenodo27.
References
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722â736 (2017).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050â1054 (2016).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540â546 (2019).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103â2110 (2016).
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044â1053 (2020).
Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155â158 (2020).
Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & Tse, D. N. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747â756 (2017).
Vaser, R., SoviÄ, I., Nagarajan, N. & Å ikiÄ, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737â746 (2017).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403â410 (1990).
Li, H. & Durbin, R. Fast and accurate short read alignment with BurrowsâWheeler transform. Bioinformatics 25, 1754â1760 (2009).
Broder, A. Z. On the resemblance and containment of documents. In Proc. Compression and Complexity of SEQUENCES 1997 (cat. no. 97TB100171) (eds. Carpentieri, B. et al.) 21â29 (IEEE, 1997); https://doi.org/10.1109/SEQUEN.1997.666900
Jain, C., Dilthey, A., Koren, S., Aluru, S. & Phillippy, A. M. A fast approximate algorithm for mapping long reads to large reference databases. In Research in Computational Molecular Biology (ed. Sahinalp, S. C.) 66â81 (Springer, 2017).
Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).
Fruchterman, T. M. J. & Reingold, E. M. Graph drawing by force-directed placement. Softw. Pract. Exp. 21, 1129â1164 (1991).
Barnes, J. & Hut, P. A hierarchical O(NlogN) force-calculation algorithm. Nature 324, 446â449 (1986).
Wick, R. R. & Holt, K. E. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Res. 8, 2138 (2020).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291â1305 (2020).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170â175 (2021).
Belser, C. et al. Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps. Nat. Plants 4, 879â887 (2018).
Choi, J. Y. et al. Nanopore sequencing-based genome assembly and evolutionary genomics of circum-basmati rice. Genome Biol. 21, 21 (2020).
Vaser, R. & Å ikiÄ, M. Yet another de novo genome assembler. In Proc. 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA) (eds. LonÄariÄ, S. et al.) 147â151 (IEEE, 2019); https://doi.org/10.1109/ISPA.2019.8868909
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210â3212 (2015).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338â345 (2018).
Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142âi150 (2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094â3100 (2018).
Vaser, R. & Sikic, M. 2021. Assemblies generated in the manuscript âTime and memory efficient genome assembly with Ravenâ. Zenodo https://doi.org/10.5281/zenodo.4443062
Vaser, R. & Sikic, M. 2021. Raven source code used in the manuscript âTime and memory efficient genome assembly with Ravenâ. Zenodo https://doi.org/10.5281/zenodo.4672196
Acknowledgements
This work has been supported in part by the Croatian Science Foundation under the project âSingle genome and metagenome assemblyâ (IP-2018-01-5886, to M.Å .), the European Regional Development Fund under grant no. KK.01.1.1.01.0009 (DATACROSS, to M.Å .) and the A*STAR Computational Resource Centre through the use of their high-performance computing facilities. R.V. and M.Å . have been partially supported by funding from A*STAR, Singapore. We acknowledge Intel Corporation for allowing us to test with the Intel Optane persistent memory server and providing us with high-quality technical support. Finally, we thank G. ŽužiÄ from Carnegie Mellon University for valuable discussions about graph drawings.
Author information
Authors and Affiliations
Contributions
M.Å . devised the project. R.V. designed and implemented Raven, and benchmarked it with other assemblers. Both authors drafted and revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1â4 and Tables 1â5.
Rights and permissions
About this article
Cite this article
Vaser, R., Å ikiÄ, M. Time- and memory-efficient genome assembly with Raven. Nat Comput Sci 1, 332â336 (2021). https://doi.org/10.1038/s43588-021-00073-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-021-00073-4
This article is cited by
-
Genomic mining of Vibrio parahaemolyticus highlights prevalence of antimicrobial resistance genes and new genetic markers associated with AHPND and tdhâ+â/trhâ+âgenotypes
BMC Genomics (2024)
-
Carbapenem-resistant hypervirulent ST23 Klebsiella pneumoniae with a highly transmissible dual-carbapenemase plasmid in Chile
Biological Research (2024)
-
Comparative analysis of metagenomic classifiers for long-read sequencing datasets
BMC Bioinformatics (2024)
-
Sexual dimorphism in the tardigrade Paramacrobiotus metropolitanus transcriptome
Zoological Letters (2024)
-
A stepwise guide for pangenome development in crop plants: an alfalfa (Medicago sativa) case study
BMC Genomics (2024)