Chromosomal-Level Assembly of Yellow Catfish Genome Using Third-Generation DNA Sequencing and Hi-C Analysis
Chromosomal-Level Assembly of Yellow Catfish Genome Using Third-Generation DNA Sequencing and Hi-C Analysis
Chromosomal-Level Assembly of Yellow Catfish Genome Using Third-Generation DNA Sequencing and Hi-C Analysis
1
College of Fisheries, Key Laboratory of Freshwater Animal Breeding,
Ministry of Agriculture, Huazhong Agricultural University, Wuhan, China.
2
Wuhan Frasergen Bioinformatics, East Lake High-Tech Zone, Wuhan,
China.
3
State Key Laboratory of Freshwater Ecology and Biotechnology, Institute
of Hydrobiology, Chinese Academy of Sciences, University of the Chinese
Academy of Sciences, Wuhan, China.
4
Institute of Oceanology, Chinese Academy of Sciences, Qingdao,
Shandong, China
5
Department of Molecular Biology and Biochemistry, Simon Fraser
University, Burnaby, Canada
#
These authors contributed equally to this work.
Abstract
Data description
Introduction
A XX genotype female yellow catfish (Figure 1), reared in the breeding center of
Huazhong Agricultural University in Wuhan City, Hubei Province, was used for
The extracted DNA molecules were sequenced with both Illumina HiSeq X Ten
platform (Illumina Inc., San Diego, CA, USA) and PacBio Sequel platforms. Short
reads generated from the Illumina platform were used for the estimation of the
genome size, the level of heterozygosity and repeat content of the genome, and long
reads from the PacBio platform were used for genome assembly. To this end, one
library with an insertion length of 250 bp was generated for the HiSeq X Ten platform
and three 20 kb libraries were constructed for the PacBio platform according to the
manufacturer’s protocols, resulting the generation of ~51.2 Gb short reads and ~38.9
Gb long reads, respectively. (Table 1) The polymerase and subreads N50 length
reached 21.3 kb and 16.2 kb, providing ultra-long genomic sequences for the
following assembly.
The short-reads from Illumina platform were quality filtered by HTQC v1.92.310 using
the following method. Firstly, the adaptors were removed from the sequencing reads.
Second, read pairs were excluded if any one end has an average quality lower than
20. Third, ends of reads were trimmed if the average quality lower than 20 in the
sliding window size of 5 bp. Finally, read pairs with any end was shorter than 75 bp
were removed.
The quality filtered reads were used for genome size estimation. Using the Kmer
method described in previous method11, we calculated and plot the 17-mer depth
distribution in SI Figure 1. The formula G = N17-mer /D17-mer,where the N17-mer is the total
number of 17-mers,and D17-mer denotes the peak frequency of 17-mers, were used to
estimate the genome size of yellow catfish. As a result, we estimated a genome size
of 714 Mb, as well as a heterozygosity rate of 0.45% and repeat ratio of 43.31%. To
confirm the robustness of the genome size estimation, we performed additional
Although the size of genome assembly from both Falcon and canu was
comparable with the estimation based on Kmer method, the continuity of the genome
need further improvement. Genome puzzle master (GPM)14 is a tool to guide the
genome assembly from fragmented sequences using overlap information among
contigs from genomes.14 Based on the complementarity of the two genomes, the
contig could be merged and the gaps filled by sequences bridging the two contigs.15
Taking advantage of the sequence complementation of the two assemblies from
Falcon and canu, we therefore applied GPM14 to merge long contigs using reliable
overlaps between sequences. Finally, a ~730 Mb genome assembly of yellow catfish
with 3,564 contigs and contig N50/L50 of 1.1 Mb/126 was constructed. The final
genome sequences were then polished by arrow16 using PacBio long reads and by
pilon release 1.12 17 using Illumina short reads to correct errors in base level. The
length distribution for contigs in the final assembly is presented in Supplementary
Figure 2.
In situ Hi-C library construction and chromosome assembly using Hi-C data
487 million raw reads were generated from the Hi-C library and were mapped to
the polished yellow catfish genome using Bowtie 1.2.2 (RRID:SCR_005476) 21 with
the default parameters. The iterative method was used to increase the interactive Hi-
C reads ratio 22. Two ends of paired reads were mapped to the genome
independently, but only the reads that two pairs were uniquely mapped to genome
were used. Self-ligation, non-ligation and other invalid reads, such as StartNearRsite,
PCR amplification, random break, LargeSmallFragments and ExtremeFragments,
were filtered using the method and hiclib as described in previous reports23. The
contact count among each contig were calculated and normalized by the restriction
sites in sequences (Figure 2). We then successfully clustered 2,965 contigs into 26
First of all, we compared the genome assembly continuity of the yellow catfish
genome to those of other teleost species. We found that both contig and scaffold N50
lengths of the yellow catfish reached considerable continuity (Figure 3), providing us
a high-quality genome sequences for the following functional investigations. The
assembled genome were also subjected to BUSCO v3.027 (RRID:SCR_015008,
version 3.0) with the actinopterygii_odb9 database to evaluate the completeness of
the genome. Among 4,584 total BUSCO groups searched, 4,179 and 92 BUSCO
core genes were completed and partially identified, respectively, leading to a total of
91.2% BUSCO genes in the yellow catfish genome. After aligning short reads from
Illumina platform to the genome, the insertion length distribution for sequencing
library of 250 bp exhibited a single peak around the sequencing library length design
(Supplementary Figure 5). Paired-end reads data were not used during the contig
assembly, thus the high alignment ratio and single peak insertion length distribution
demonstrated the high-quality of contig assembly for yellow catfish. Using the
Illumina short read alignment to the reference genome of the yellow catfish by BWA
We first used Tandem Repeat Finder29 to identify repetitive elements in yellow catfish
genome. RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html,
RRID:SCR_015027) were used to detect transposable elements (TE) in the genome
by a de novo manner. The de novo and known repeats library from Repbase30 were
then combined, and the TEs were detected by mapping sequences to the combined
library in yellow catfish genome using the software RepeatMasker 4.0.7
(RRID:SCR_012954)31.
To reveal phylogenetic relationships among yellow catfish and other fish species,
the protein sequences of single-copy ortholog gene family were aligned with
MUSCLE 3.8.31 (RRID:SCR_011812) program43, and the corresponding Coding
DNA Sequences (CDS) alignments were generated and concatenated with the
guidance of protein alignment. PhyML v3.3 (RRID:SCR_014629)44 were used to
construct the phylogenetic tree for the super-alignment of nucleotide sequences
using the JTT+G+F model. Using molecular clock data from the divergence time from
the TimeTree database45, the PAML v4.8 MCMCtree program46 was employed to
determine divergence times with the approximate likelihood calculation method. The
phylogenetic relationship of other fish species was consistent with previous studies47.
The phylogenetic analysis based on single-copy orthologs of yellow catfish with other
teleosts studied in this work estimated that the yellow catfish speciated around 81.9
million years ago from their common ancestor of the channel catfish (Figure 5). Given
yellow catfish and channel catfish belong to family Bagridae and Ictaluridae
respectively, the phylogenetic analysis showed that Bagridae and Ictaluridae were
separated at a comparable time scale, however, determining the exact time
estimation requires more Siluriformes genomes.
Conclusion
The raw sequencing and physical mapping data are available from NCBI via the
accession number of SRR7817079, SRR7817060 and SRR7818403 via the project
PRJNA489116; as well as the National Omics Data Encyclopedia
(NODE) (http://www.biosino.org/node/index) via the project ID OEP000129
(http://www.biosino.org/node/project/detail/OEP000129). The genome, annotation
and intermediate files and results are also available via the GigaScience GigaDB
repository49. All supplementary figures and tables are provided in Supplemental
Table 1-3 and Supplementary Figure 1-5.
Software URLs
HTQC https://sourceforge.net/projects/htqc/
Falcon https://github.com/PacificBiosciences/FALCON/wiki/Manual
Canu https://github.com/marbl/canu
GMP https://github.com/Jianwei-Zhang/LIMS
Pilon https://github.com/broadinstitute/pilon/
Bowtie http://bowtie-bio.sourceforge.net/index.shtml
Hiclib https://bitbucket.org/mirnylab/hiclib/src
Lachesis https://github.com/shendurelab/LACHESIS
Juicebox https://www.aidenlab.org/juicebox/
BUSCO https://busco.ezlab.org/
BWA http://bio-bwa.sourceforge.net/
GATK https://software.broadinstitute.org/gatk/
RepeatModeler http://www.repeatmasker.org/RepeatModeler.html
RepeatMasker http://repeatmasker.org/
Augustus https://ngs.csr.uky.edu/Augustus
Balst https://blast.ncbi.nlm.nih.gov/Blast.cgi
TopHat https://ccb.jhu.edu/software/tophat/index.shtml
Cufflinks http://cole-trapnell-lab.github.io/cufflinks/
MAKER http://www.yandell-lab.org/software/maker.html
Blast2GO https://www.blast2go.com/
OrthMCL https://github.com/apetkau/orthomcl-pipeline
MUSCLE http://www.drive5.com/muscle/
PhyML https://github.com/stephaneguindon/phyml
TimeTree http://timetree.org/
PAML http://abacus.gene.ucl.ac.uk/software/paml.html
Abbreviations
Competing interests
Funding
This work was supported by China Agriculture Research System (CARS-46) and the
Fundamental Research Funds for the Central Universities (2662017PY013).
Author Contributions
Jie Mei, Jian-Fang Gui and Nansheng Chen conceived the study; Dan Chen, Jicheng
Zhang, Wenjie Guo and Peipei Huang collected the samples and performed
sequencing and Hi-C experiments; Shijun Xiao, Gaorui Gong and Yan He estimated
the genome size and assembled the genome; Shijun Xiao, Gaorui Gong and Xiaohui
Li assessed the assembly quality; Gaorui Gong, Shijun Xiao, Yang Xiong and Junjie
Wu carried out the genome annotation and functional genomic analysis, Jie Mei,
Nansheng Chen, Shijun Xiao, Gaorui Gong and Jian-Fang Gui wrote the manuscript.
And all authors read, edited, and approved the final manuscript.
References
1 Liu, H. et al. Genetic manipulation of sex ratio for the large-scale breeding of YY super-male
and XY all-male yellow catfish (Pelteobagrus fulvidraco (Richardson)). Marine Biotechnology
15, 321-328 (2013).
2 Zhang, J. et al. Characterization and development of EST-SSR markers derived from
transcriptome of yellow catfish. Molecules 19, 16402-16415 (2014).
3 Liu, F. et al. Effects of astaxanthin and emodin on the growth, stress resistance and disease
resistance of yellow catfish (Pelteobagrus fulvidraco). Fish & Shellfish Immunology 51, 125
(2016).
4 Jie, M. & Gui, J. F. Genetic basis and biotechnological manipulation of sexual dimorphism and
sex determination in fish. Science China Life Sciences 58, 124 (2015).
5 Chen, X. et al. A comprehensive transcriptome provides candidate genes for sex
determination/differentiation and SSR/SNP markers in yellow catfish. Marine Biotechnology
17, 190-198 (2015).
6 Dan, C., Mei, J., Wang, D. & Gui, J. F. Genetic Differentiation and Efficient Sex-specific Marker
Development of a Pair of Y- and X-linked Markers in Yellow Catfish. International Journal of
Biological Sciences 9, 1043-1049 (2013).
7 Tian-Yi YANG, Y. X., Cheng DAN, Wen-Jie Guo, Han-Qin LIU, Jian-Fang GUI, Jie MEI. .
Production of XX male yellow catfish by sex-reversal technology. Acta Hydrobiologica Sinica
42, 871–878 (2018).
Figure 2. Yellow catfish genome contig contact matrix using Hi-C data. The color bar
illuminated the logarithm of the contact density from red (high) to white (low) in the plot. Note
that only sequences anchored on chromosomes were shown in the plot.
Downloaded from https://academic.oup.com/gigascience/advance-article-abstract/doi/10.1093/gigascience/giy120/5106933 by guest on 27 September 2018
Figure 3. Genome assembly comparison of yellow catfish with other public teleost
genomes. X and Y axis representing the contig and scaffold N50’s, respectively. The
genomes sequenced with third generation sequencing were highlighted in red.
Figure 4. Length distribution comparison on total gene, CDS, exon and intron of
annotated gene models of the yellow catfish with other closely related teleost fish
species. Length distribution of total gene (A), CDS (B), exon (C) and intron (D) were
compared to P. fulvidraco, D. rerio, G. aculeatus, O. latipes, I. punctatus and T. rubripes.
Downloaded from https://academic.oup.com/gigascience/advance-article-abstract/doi/10.1093/gigascience/giy120/5106933 by guest on 27 September 2018
Figure 5. Phylogenetic analysis of the yellow catfish with other teleost species. The
estimated species divergence time (MYA) and the 95% confidential intervals were labeled at
each branch site. The divergence used for time recalibration was illuminated as red dots in
the tree. The fish (I. punctatus and P. fulvidraco) from the order Siluriformes were highlighted
by pink shading.
Table 1. Sequencing data generated for yellow catfish genome assembly and
annotation. Note that paired-end 150 bp reads was generated from the Illumina HiSeq X
Ten platform.
Table 2. Statistics for genome assembly of yellow catfish. Note that contigs were
analyzed after the scaffolding based on Hi-C data.
Length Number
Sample ID
Contig**(bp) Scaffold(bp) Contig** Scaffold
Total 731,603,425 732,815,925 3,652 1,227
Max 11,531,338 55,095,979 - -
N50 1,111,198 25,785,924 126 11
N60 643,552 24,806,204 212 14
N70 333,994 22,397,207 373 17
N80 128,419 21,591,549 742 21
N90 59,682 16,750,011 1,634 25
Table 3. Statistics for genome annotation of yellow catfish. Note that the e-value
threshold of the 1e-5 was applied during the homolog searching for the functional annotation.
GO 14,936 60.83
NR 24,308 99.01
Total 24,552