Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie

Mapping Intimacies ◽

10.1101/2021.12.08.471868 ◽

2021 ◽

Author(s):

Alaina Shumate ◽

Brandon Wong ◽

Geo Pertea ◽

Mihaela Pertea

Keyword(s):

Rna Sequencing ◽

Open Source Software ◽

Transcriptome Assembly ◽

Simulated Data ◽

Real Data ◽

Short Read ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

Improved Accuracy

Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are unable to span multiple exons. Long-read technology can capture full-length transcripts, but its high error rate often leads to mis-identified splice sites, and its low throughput makes quantification difficult. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus,and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.

Download Full-text

Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads

BMC Genomics ◽

10.1186/s12864-021-07702-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Seth Commichaux ◽

Kiran Javkar ◽

Padmini Ramachandran ◽

Niranjan Nagarajan ◽

Denis Bertrand ◽

...

Keyword(s):

Public Health ◽

Public Health Response ◽

High Quality ◽

Short Read ◽

Short Reads ◽

The Core ◽

Long Reads ◽

Health Response ◽

Long Read ◽

Core Genes

Abstract Background Whole genome sequencing of cultured pathogens is the state of the art public health response for the bioinformatic source tracking of illness outbreaks. Quasimetagenomics can substantially reduce the amount of culturing needed before a high quality genome can be recovered. Highly accurate short read data is analyzed for single nucleotide polymorphisms and multi-locus sequence types to differentiate strains but cannot span many genomic repeats, resulting in highly fragmented assemblies. Long reads can span repeats, resulting in much more contiguous assemblies, but have lower accuracy than short reads. Results We evaluated the accuracy of Listeria monocytogenes assemblies from enrichments (quasimetagenomes) of naturally-contaminated ice cream using long read (Oxford Nanopore) and short read (Illumina) sequencing data. Accuracy of ten assembly approaches, over a range of sequencing depths, was evaluated by comparing sequence similarity of genes in assemblies to a complete reference genome. Long read assemblies reconstructed a circularized genome as well as a 71 kbp plasmid after 24 h of enrichment; however, high error rates prevented high fidelity gene assembly, even at 150X depth of coverage. Short read assemblies accurately reconstructed the core genes after 28 h of enrichment but produced highly fragmented genomes. Hybrid approaches demonstrated promising results but had biases based upon the initial assembly strategy. Short read assemblies scaffolded with long reads accurately assembled the core genes after just 24 h of enrichment, but were highly fragmented. Long read assemblies polished with short reads reconstructed a circularized genome and plasmid and assembled all the genes after 24 h enrichment but with less fidelity for the core genes than the short read assemblies. Conclusion The integration of long and short read sequencing of quasimetagenomes expedited the reconstruction of a high quality pathogen genome compared to either platform alone. A new and more complete level of information about genome structure, gene order and mobile elements can be added to the public health response by incorporating long read analyses with the standard short read WGS outbreak response.

Download Full-text

Assembling reads improves taxonomic classification of species

10.21203/rs.3.rs-22309/v1 ◽

2020 ◽

Author(s):

Quang Tran ◽

Vinhthuy Phan

Keyword(s):

Classification Performance ◽

Performance Characteristics ◽

Metagenomic Data ◽

Species Classification ◽

Short Read ◽

Short Reads ◽

Sequencing Errors ◽

Trade Offs ◽

Long Reads ◽

Long Read

Abstract Background: Most current metagenomic classifiers and profilers employ short reads to classify, bin and profile microbial genomes that are present in metagenomic samples. Many of these methods adopt techniques that aim to identify unique genomic regions of genomes so as to differentiate them. Because of this, short-read lengths might be suboptimal. Longer read lengths might improve the performance of classification and profiling. However, longer reads produced by current technology tend to have a higher rate of sequencing errors, compared to short reads. It is not clear if the trade-off between longer length versus higher sequencing errors will increase or decrease classification and profiling performance.Results: We compared performance of popular metagenomic classifiers on short reads and longer reads, which are assembled from the same short reads. When using a number of popular assemblers to assemble long reads from the short reads, we discovered that most classifiers made fewer predictions with longer reads and that they achieved higher classification performance on synthetic metagenomic data. Specifically, across most classifiers, we observed a significant increase in precision, while recall remained the same, resulting in higher overall classification performance. On real metagenomic data, we observed a similar trend that classifiers made fewer predictions. This suggested that they might have the same performance characteristics of having higher precision while maintaining the same recall with longer reads.Conclusions: This finding has two main implications. First, it suggests that classifying species in metagenomic environments can be achieved with higher overall performance simply by assembling short reads. This suggested that they might have the same performance characteristics of having higher precision while maintaining the same recall as shorter reads. Second, this finding suggests that it might be a good idea to consider utilizing long-read technologies in species classification for metagenomic applications. Current long-read technologies tend to have higher sequencing errors and are more expensive compared to short-read technologies. The trade-offs between the pros and cons should be investigated.

Download Full-text

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

BMC Genomics ◽

10.1186/s12864-019-6286-9 ◽

2019 ◽

Vol 20 (S11) ◽

Author(s):

Arghya Kusum Das ◽

Sayan Goswami ◽

Kisung Lee ◽

Seung-Jong Park

Keyword(s):

Error Correction ◽

Error Rates ◽

De Bruijn Graph ◽

Correction Algorithm ◽

Short Read ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Error Correction Algorithm

Abstract Background Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. Methods In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. Results ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. Conclusion ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Download Full-text

Long read metagenomics, the next step?

10.1101/2020.11.11.378109 ◽

2020 ◽

Author(s):

Jose M. Haro-Moreno ◽

Mario López-Pérez ◽

Francisco Rodríguez-Valera

Keyword(s):

Error Rate ◽

Population Genomics ◽

Short Read ◽

Short Reads ◽

Third Generation Sequencing ◽

Long Reads ◽

Oxford Nanopore ◽

Flexible Genome ◽

Long Read ◽

Generation Sequencing

ABSTRACTBackgroundThird-generation sequencing has penetrated little in metagenomics due to the high error rate and dependence for assembly on short-read designed bioinformatics. However, 2nd generation sequencing metagenomics (mostly Illumina) suffers from limitations, particularly in allowing assembly of microbes with high microdiversity or retrieving the flexible (adaptive) compartment of prokaryotic genomes.ResultsHere we have used different 3rd generation techniques to study the metagenome of a well-known marine sample from the mixed epipelagic water column of the winter Mediterranean. We have compared Oxford Nanopore and PacBio last generation technologies with the classical approach using Illumina short reads followed by assembly. PacBio Sequel II CCS appears particularly suitable for cellular metagenomics due to its low error rate. Long reads allow efficient direct retrieval of complete genes (473M/Tb) and operons before assembly, facilitating annotation and compensates the limitations of short reads or short-read assemblies. MetaSPAdes was the most appropriate assembly program when used in combination with short reads. The assemblies of the long reads allow also the reconstruction of much more complete metagenome-assembled genomes, even from microbes with high microdiversity. The flexible genome of reconstructed MAGs is much more complete and allows rescuing more adaptive genes.ConclusionsFor most applications of metagenomics, from community structure analysis to ecosystem functioning, long-reads should be applied whenever possible. Particularly for in-silico screening of biotechnologically useful genes, or population genomics, long-read metagenomics appears presently as a very fruitful approach and can be used from raw reads, before a computing-demanding (and potentially artefactual) assembly step.

Download Full-text

The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies

10.1101/2019.12.17.864991 ◽

2019 ◽

Cited By ~ 3

Author(s):

Aleksey V. Zimin ◽

Steven L. Salzberg

Keyword(s):

Error Rate ◽

Low Cost ◽

Simulated Data ◽

Real Data ◽

Hybrid Strategy ◽

Short Read ◽

Sequencing Technologies ◽

Long Reads ◽

Genome Assemblies ◽

Polishing Tool

AbstractThe introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8–15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to “polish” the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.

Download Full-text

Transcriptome assembly from long-read RNA-seq alignments with StringTie2

Genome Biology ◽

10.1186/s13059-019-1910-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 54

Author(s):

Sam Kovaka ◽

Aleksey V. Zimin ◽

Geo M. Pertea ◽

Roham Razaghi ◽

Steven L. Salzberg ◽

...

Keyword(s):

Single Molecule ◽

Transcriptome Assembly ◽

Rna Seq ◽

Ability To Work ◽

Single Molecule Sequencing ◽

Short Read ◽

New Methods ◽

Long Reads ◽

Long Read

AbstractRNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.

Download Full-text

Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes

10.1101/530824 ◽

2019 ◽

Cited By ~ 3

Author(s):

Nicola De Maio ◽

Liam P. Shaw ◽

Alasdair Hubbard ◽

Sophie George ◽

Nick Sanderson ◽

...

Keyword(s):

Bacterial Genome ◽

Hybrid Assembly ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Genome Reconstruction ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Sequencing Platforms

ABSTRACTIllumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly). However, it is not clear how different long-read sequencing methods impact on assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the Enterobacteriaceae family, as these frequently have highly plastic, repetitive genetic structures and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies. Both strategies facilitate high-quality genome reconstruction. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower consumables cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.IMPACT STATEMENTIllumina short-read sequencing is frequently used for tasks in bacterial genomics, such as assessing which species are present within samples, checking if specific genes of interest are present within individual isolates, and reconstructing the evolutionary relationships between strains. However, while short-read sequencing can reveal significant detail about the genomic content of bacterial isolates, it is often insufficient for assessing genomic structure: how different genes are arranged within genomes, and particularly which genes are on plasmids – potentially highly mobile components of the genome frequently carrying antimicrobial resistance elements. This is because Illumina short reads are typically too short to span repetitive structures in the genome, making it impossible to accurately reconstruct these repetitive regions. One solution is to complement Illumina short reads with long reads generated with SMRT Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) sequencing platforms. Using this approach, called ‘hybrid assembly’, we show that we can automatically fully reconstruct complex bacterial genomes of Enterobacteriaceae isolates in the majority of cases (best-performing method: 17/20 isolates). In particular, by comparing different methods we find that using the assembler Unicycler with Illumina and ONT reads represents a low-cost, high-quality approach for reconstructing bacterial genomes using publicly available software.DATA SUMMARYRaw sequencing data and assemblies have been deposited in NCBI under BioProject Accession PRJNA422511 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA422511). We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.

Download Full-text

Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case

10.1101/320085 ◽

2018 ◽

Author(s):

weiwen wang ◽

Miriam Schalamun ◽

Alejandro Morales Suarez ◽

David Kainer ◽

Benjamin Schwessinger ◽

...

Keyword(s):

Chloroplast Genome ◽

Inverted Repeat ◽

Single Copy ◽

Test Case ◽

Short Read ◽

Short Reads ◽

Chloroplast Genomes ◽

Long Reads ◽

Long Read ◽

Hybrid Assemblies

AbstractBackgroundChloroplasts are organelles that conduct photosynthesis in plant and algal cells. Chloroplast genomes code for around 130 genes, and the information they contain is widely used in agriculture and studies of evolution and ecology. Correctly assembling complete chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10–30 kb). The advent of long-read sequencing technologies should alleviate this problem by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long (Oxford nanopore) and short (Illumina) reads, different long-read lengths, different assembly pipelines, and different genome polishing steps, with a view to determining the most accurate and efficient approach to chloroplast genome assembly.ResultsHybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs representing the long single copy, short single copy and inverted repeat regions of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate, even after polishing. The chloroplast genome of Eucalyptus pauciflora is 159,942 bp, contains 131 genes of known function, and confirms the phylogenetic position of Eucalyptus pauciflora as a close relative of Eucalyptus regnans.ConclusionsOur results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies.

Download Full-text

DEBKS: A Tool to Detect Differentially Expressed Circular RNA

10.1101/2020.10.14.336982 ◽

2020 ◽

Author(s):

Zelin Liu ◽

Huiru Ding ◽

Jianqi She ◽

Chunhua Chen ◽

Weiguang Zhang ◽

...

Keyword(s):

Open Source ◽

Rna Sequencing ◽

Open Source Software ◽

Simulated Data ◽

Circular Rna ◽

Host Gene ◽

Circular Rnas ◽

Biological Processes ◽

Rna Seq ◽

Disease Pathogenesis

AbstractCircular RNAs (circRNAs) are involved in various biological processes and in disease pathogenesis. However, only a small number of functional circRNAs have been identified among hundreds of thousands of circRNA species, partly because most current methods are based on circular junction counts and overlook the fact that circRNA is formed from the host gene by back-splicing (BS). To distinguish between expression originating from BS and that from the host gene, we present DEBKS, a software program to streamline the discovery of differential BS between two rRNA-depleted RNA sequencing (RNA-seq) sample groups. By applying real and simulated data and employing RT-qPCR for validation, we demonstrate that DEBKS is efficient and accurate in detecting circRNAs with differential BS events between paired and unpaired sample groups. DEBKS is available at https://github.com/yangence/DEBKS as open-source software.

Download Full-text

Complete Genome Sequences of 12 Quinolone-Resistant Escherichia coli Strains Containing qnrS1 Based on Hybrid Assemblies

Microbiology Resource Announcements ◽

10.1128/mra.01190-20 ◽

2021 ◽

Vol 10 (4) ◽

Author(s):

Håkon Kaspersen ◽

Thomas H. A. Haverkamp ◽

Hanna Karin Ilag ◽

Øivind Øines ◽

Camilla Sekse ◽

...

Keyword(s):

Escherichia Coli ◽

Complete Genome ◽

Flow Cell ◽

Hybrid Assembly ◽

Genome Sequences ◽

Content Type ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

Hybrid Assemblies

ABSTRACT In total, 12 quinolone-resistant Escherichia coli (QREC) strains containing qnrS1 were submitted to long-read sequencing using a FLO-MIN106 flow cell on a MinION device. The long reads were assembled with short reads (Illumina) and analyzed using the MOB-suite pipeline. Six of these QREC genome sequences were closed after hybrid assembly.

Download Full-text