Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

sgRNACNN: identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks

  • Published:
Plant Molecular Biology Aims and scope Submit manuscript

Abstract

Key message

We proposed an ensemble convolutional neural network model to identify sgRNA high on-target activity in four crops and we used one-hot encoding and k-mers for sequence encoding.

Abstract

As an important component of the CRISPR/Cas9 system, single-guide RNA (sgRNA) plays an important role in gene redirection and editing. sgRNA has played an important role in the improvement of agronomic species, but there is a lack of effective bioinformatics tools to identify the activity of sgRNA in agronomic species. Therefore, it is necessary to develop a method based on machine learning to identify sgRNA high on-target activity. In this work, we proposed a simple convolutional neural network method to identify sgRNA high on-target activity. Our study used one-hot encoding and k-mers for sequence data conversion and a voting algorithm for constructing the convolutional neural network ensemble model sgRNACNN for the prediction of sgRNA activity. The ensemble model sgRNACNN was used for predictions in four crops: Glycine max, Zea mays, Sorghum bicolor and Triticum aestivum. The accuracy rates of the four crops in the sgRNACNN model were 82.43%, 80.33%, 78.25% and 87.49%, respectively. The experimental results showed that sgRNACNN realizes the identification of high on-target activity sgRNA of agronomic data and can meet the demands of sgRNA activity prediction in agronomy to a certain extent. These results have certain significance for guiding crop gene editing and academic research. The source code and relevant dataset can be found in the following link: https://github.com/nmt315320/sgRNACNN.git.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Code availability

https://github.com/nmt315320/sgRNACNN.git.

Abbreviations

ACC:

accuracy

AUC:

area under curve

BN:

batch normalization

CNN:

convolutional neural networks

LSTM:

long short-term memory

MCC:

Matthews correlation coefficient

NB:

Naïve Bayes

PAM:

protospacer adjacent motif

RF:

random forest

RNN:

recurrent neural networks

SE:

sensitivity

sgRNA:

single-guide RNA

SP:

specificity

SVM:

support vector machine

References

  • Bai Q (2010) Analysis of particle swarm optimization algorithm. Comput Inf Sci 3:180

    Google Scholar 

  • Bai Y, Zhang Z, Chen M (2019) Special issue on plant bioinformatics. Curr Bioinforma 14:564–565. https://doi.org/10.2174/157489361407190917161055

    Article  CAS  Google Scholar 

  • Bu HD, Hao JQ, Guan JH, Zhou SG (2018) Predicting enhancers from multiple cell lines and tissues across different developmental stages based on SVM method. Curr Bioinforma 13:655–660. https://doi.org/10.2174/1574893613666180726163429

    Article  CAS  Google Scholar 

  • Chao L, Jin S, Wang L, Guo F, Zou Q (2019) AOPs-SVM: a sequence-based classifier of antioxidant proteins using a support vector machine. Front Bioeng Biotechnol 7:224

    Google Scholar 

  • Cheng L (2019) Computational and biological methods for gene therapy. Curr Gene Ther 19:210–210

    CAS  PubMed  Google Scholar 

  • Cheng L, Jiang Y, Ju H, Sun J, Peng J, Zhou M, Hu Y (2018) InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk. BMC Genomics 19:919. https://doi.org/10.1186/s12864-017-4338-6

    Article  PubMed  PubMed Central  Google Scholar 

  • Cheng L et al (2019) Computational methods for identifying similar diseases molecular therapy. Nucleic Acids 18:590–604

    CAS  PubMed  PubMed Central  Google Scholar 

  • Chu Y et al (2019) DTI-CDF: a cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Briefings Bioinf. https://doi.org/10.1093/bib/bbz152

    Article  Google Scholar 

  • Cui Y, Xu J, Cheng M, Liao X, Peng S (2018) Review of CRISPR/Cas9 sgRNA design tools. Interdiscip Sci Comput Life Sci 10:455–465

    CAS  Google Scholar 

  • Ding Y, Tang J, Guo F (2016) Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinf 17:398

    Google Scholar 

  • Ding Y, Tang J, Guo F (2017) Identification of drug-target interactions via multiple information integration. Inf Sci 418–419:546–560. https://doi.org/10.1016/j.ins.2017.08.045

    Article  Google Scholar 

  • Ding Y, Tang J, Guo F (2019) Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing 325:211–224. https://doi.org/10.1016/j.neucom.2018.10.028

    Article  Google Scholar 

  • Doench JG et al (2016) Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol 34:184

    CAS  PubMed  PubMed Central  Google Scholar 

  • Duan J, Lu G, Xie Z, Lou M, Luo J, Guo L, Zhang Y (2014) Genome-wide identification of CRISPR/Cas9 off-targets in human genome. Cell Res 24:1009–1012

    CAS  PubMed  PubMed Central  Google Scholar 

  • Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:23

    Google Scholar 

  • Guohui C et al (2018) DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol 19:80

    Google Scholar 

  • Hansen-Bruhn M et al (2018) Active intracellular delivery of a Cas9/sgRNA complex using ultrasound-propelled nanomotors. Angew Chem Int Ed 57:2657–2661

    CAS  Google Scholar 

  • He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y (2018) PseUI: pseudouridine sites identification based on RNA sequence information. BMC Bioinf 19:306. https://doi.org/10.1186/s12859-018-2321-0

    Article  CAS  Google Scholar 

  • Hill ST, Rachael K, Amy T, Erich M, Padideh D, Hendrix DA (2018) A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential. Nucleic Acids Res 46(16):8105–8113

    CAS  PubMed  PubMed Central  Google Scholar 

  • Jiecong L, Ka-Chun W (2018) Off-target predictions in CRISPR-Cas9 gene editing using deep learning. Bioinformatics 34(17):i656–i663

    Google Scholar 

  • John G et al (2014) Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nat Biotechnol 32:1262–1267

    Google Scholar 

  • Junwei H, Xudong H, Qingfei K, Liang C (2019) psSubpathway: a software package for flexible identification of phenotype-specific subpathways in cancer progression. Bioinformatics 36(7):2303–2305

    Google Scholar 

  • Kaur K, Gupta AK, Rajput A, Kumar M (2016) ge-CRISPR—an integrated pipeline for the prediction and analysis of sgRNAs genome editing efficiency for CRISPR/Cas system. Sci Rep 6:30870

    CAS  PubMed  PubMed Central  Google Scholar 

  • Kim HK et al (2018) Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity. Nat Biotechnol 36:239

    CAS  PubMed  Google Scholar 

  • Lei X, Liang G, Wang L, Liao C (2018) A novel hybrid sequence-based model for identifying anticancer peptides. Genes 9:158

    Google Scholar 

  • Lei X et al (2019) k-skip-n-gram-RF: a random forest based method for Alzheimer’s disease protein identification. Front Genet 10:33

    Google Scholar 

  • Li C-C, Liu B (1857) MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks. Briefings Bioinf 21:1733. https://doi.org/10.1093/bib/bbz133

    Article  Google Scholar 

  • Li B et al (2017) NOREVA: normalization and evaluation of MS-based metabolomics data. Nucleic Acids Res 45:W162–W170. https://doi.org/10.1093/nar/gkx449

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Li F et al (2020a) SSizer: determining the sample sufficiency for comparative biological study. J Mol Biol 432:3411. https://doi.org/10.1016/j.jmb.2020.01.027

    Article  CAS  PubMed  Google Scholar 

  • Li JP, Yuqian, Tang J, Zou Q, Guo F (2020b) DeepAVP: a dual-channel deep neural network for identifying variable-length antiviral peptides. IEEE J Biomed Health Inf 24:2726. https://doi.org/10.1109/JBHI.2020.2977091

    Article  Google Scholar 

  • Liang C, Changlu Q, He Z, Tongze F, Xue Z (2019) gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Res 48(13):7603

    Google Scholar 

  • Liu B (2019) BioSeq-analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches. Briefings Bioinf 20:1280–1294

    CAS  Google Scholar 

  • Liu B, Li K, Huang D-S, Chou K-C (2018) iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach. Bioinformatics 34:3835–3842

    CAS  PubMed  Google Scholar 

  • Liu B, Gao X, Zhang H (2019) BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res 47:e127

    CAS  PubMed  PubMed Central  Google Scholar 

  • Liu B, Luo Z, He J (2020a) sgRNA-PSM: predict sgRNAs on-target activity based on position specific mismatch. Mol Ther-Nucleic Acids. 20(5):323–330

    CAS  PubMed  PubMed Central  Google Scholar 

  • Liu H et al (2020b) High-throughput CRISPR/Cas9 mutagenesis streamlines trait gene identification in maize. Plant Cell 32(5):1397–1413

    CAS  PubMed  PubMed Central  Google Scholar 

  • Liu J, Fernie AR, Yan J (2020c) The past, present and future of maize improvement–domestication, genomics and functional genomic routes towards crop enhancement. Plant Commun 1:100010

    PubMed  Google Scholar 

  • Lv ZB, Zhang J, Ding H, Zou Q (2020) RF-PseU: a random forest predictor for RNA pseudouridine sites. Front Bioeng Biotechnol 8:10. https://doi.org/10.3389/fbioe.2020.00134

    Article  Google Scholar 

  • Meng J, Chang Z, Zhang P, Shi W, Luan Y (2019) lncRNA-LSTM: prediction of plant long non-coding RNAs using long short-term memory based on p-nts encoding. Intell Comput Methodol 11645:347–357

    Google Scholar 

  • Muhammad T, Hilal T, Kil TC (2019) iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks. Mol Ther Nucleic Acids 16:463

    Google Scholar 

  • Nguyen QH, Nguyen-Vo T-H, Le NQK, Do TTT, Nguyen BP (2019) iEnhancer-ECNN: identifying enhancers and their strength using ensembles of convolutional neural networks. BMC Genomics 20:951

    CAS  PubMed  PubMed Central  Google Scholar 

  • Niu M, Li Y, Wang C, Ke H (2018) RFAmyloid: a web server for predicting amyloid proteins. Int J Mol Sci 19:2071

    PubMed Central  Google Scholar 

  • Niu M, Zhang J, Li Y, Wang C, Ma Q (2020) CirRNAPL: a web server for the identification of circRNA based on extreme learning machine. Comput Struct Biotechnol J 18:834

    CAS  PubMed  PubMed Central  Google Scholar 

  • O’Shea JP, Chou MF, Quader SA, Ryan JK, Church GM, Schwartz D (2013) pLogo: a probabilistic approach to visualizing sequence motifs. Nat Methods 10:1211

    PubMed  Google Scholar 

  • Pan X, Peter R, Yan J, Shen HB (2018) Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics 19:511

    PubMed  PubMed Central  Google Scholar 

  • Pirgazi J, Khanteymoori AR, Jalilkhani M (2018) GENIRF: an algorithm for gene regulatory network inference using rotation forest. Curr Bioinforma 13:407–419. https://doi.org/10.2174/1574893612666170731120830

    Article  CAS  Google Scholar 

  • Qu KY, Wei LY, Yu JT, Wang CY (2019) Identifying plant pentatricopeptide repeat coding gene/protein using mixed feature extraction methods. Front Plant Sci 9:10. https://doi.org/10.3389/fpls.2018.01961

    Article  Google Scholar 

  • Rafid AHM, Toufikuzzaman M, Rahman MS et al (2020) CRISPRpred(SEQ): a sequence-based method for sgRNA on target activity prediction using traditional machine learning. BMC Bioinformatics 21(1):1–13

    Google Scholar 

  • Rahman MK, Rahman MS (2017) CRISPRpred: a flexible and efficient tool for sgRNAs on-target activity prediction in CRISPR/Cas9 systems. PLoS One 12:e0181943

    PubMed  PubMed Central  Google Scholar 

  • Ru XQ, Li LH, Zou Q (2019) Incorporating distance-based top-n-gram and random forest to identify electron transport proteins. J Proteome Res 18:2931–2939. https://doi.org/10.1021/acs.jproteome.9b00250

    Article  CAS  PubMed  Google Scholar 

  • Saisai S, Qi W, Zhenling P, Jianyi Y (2018) Enhanced prediction of RNA solvent accessibility with long short-term memory neural networks and improved sequence profiles. Bioinformatics 35(10):1686

    Google Scholar 

  • Shan X, Wang X, Li CD, Chu Y, Zhang Y, Xiong Y, Wei DQ (2019) Prediction of CYP450 enzyme-substrate selectivity based on the network-based label space division method. J Chem Inf Model 59:4577–4586. https://doi.org/10.1021/acs.jcim.9b00749

    Article  CAS  PubMed  Google Scholar 

  • Shen Y, Ding Y, Tang J, Zou Q, Guo F (2019) Critical evaluation of web-based prediction tools for human protein subcellular localization. Briefings Bioinf. https://doi.org/10.1093/bib/bbz106

  • Sternberg SH, Redding S, Jinek M, Greene EC, Doudna JA (2014) DNA interrogation by the CRISPR RNA-guided endonuclease Cas9. Nature 507:62

    CAS  PubMed  PubMed Central  Google Scholar 

  • Sun J et al (2018) CRISPR-Local: a local single-guide RNA (sgRNA) design tool for non-reference plant genomes. Bioinformatics 35:2501

    Google Scholar 

  • Tang J et al (2019) Simultaneous improvement in the precision, accuracy, and robustness of label-free proteome quantification by optimizing data manipulation chains. Mol Cell Proteomics: MCP 18:1683–1699. https://doi.org/10.1074/mcp.RA118.001169

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Tang J et al (2020) ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies. Briefings Bioinf 21:621–636. https://doi.org/10.1093/bib/bby127

    Article  CAS  Google Scholar 

  • Wang Y et al (2015) Efficient generation of gene-modified pigs via injection of zygote with Cas9/sgRNA. Sci Rep 5:8256

    CAS  PubMed  PubMed Central  Google Scholar 

  • Wang Y et al (2019) Morphological segmentation analysis and texture-based support vector machines classification on mice liver fibrosis microscopic images. Curr Bioinforma 14:282–294. https://doi.org/10.2174/1574893614666190304125221

    Article  CAS  Google Scholar 

  • Wang H, Ding Y, Tang J, Guo F (2020a) Identification of membrane protein types via multivariate information fusion with Hilbert-Schmidt independence criterion. Neurocomputing 383:257–269. https://doi.org/10.1016/j.neucom.2019.11.103

    Article  Google Scholar 

  • Wang Y et al (2020b) Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics. Nucleic Acids Res 48:D1031–D1041. https://doi.org/10.1093/nar/gkz981

    Article  CAS  PubMed  Google Scholar 

  • Wei L, Xing P, Tang J, Zou Q (2017a) PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only. IEEE Trans Nanobiosci 16:240–247

    Google Scholar 

  • Wei L, Xing P, Zeng J, Chen J, Su R, Guo F (2017b) Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artif Intell Med 83:67–74

    PubMed  Google Scholar 

  • Wei L, Ding Y, Ran S, Tang J, Quan Z (2018a) Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput 117:212–217

    Google Scholar 

  • Wei L, Zhou C, Chen H, Song J, Su R (2018b) ACPred-FL: a sequence-based predictor based on effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 34:4007–4016

    CAS  PubMed  PubMed Central  Google Scholar 

  • Willmott D, Murrugarra D, Ye Q (2020) Improving RNA secondary structure prediction via state inference with deep recurrent neural networks. Comput Math Biophys 8:36–50

    Google Scholar 

  • Wolt JD, Wang K, Sashital D, Lawrence-Dill CJ (2016) Achieving plant CRISPR targeting that limits off-target effects. Plant Genome 9(3):1–8

    CAS  Google Scholar 

  • Wu Y, Lu X, Shen B, Zeng Y (2019) The therapeutic potential and role of miRNA, lncRNA, and circRNA in osteoarthritis. Curr Gene Ther 19:255–263. https://doi.org/10.2174/1566523219666190716092203

    Article  CAS  PubMed  Google Scholar 

  • Xiong Y, Wang Q, Yang J, Zhu X, Wei DQ (2018) PredT4SE-stack: prediction of bacterial type IV secreted effectors from protein sequences using a stacked ensemble method. Front Microbiol 9:2571. https://doi.org/10.3389/fmicb.2018.02571

    Article  PubMed  PubMed Central  Google Scholar 

  • Xu LG, Liao C et al (2018a) An efficient classifier for Alzheimer’s disease genes identification. Molecules 23(12):3140

    PubMed Central  Google Scholar 

  • Xu L, Liang G, Shi S, Liao C (2018b) SeqSVM: a sequence-based support vector machine method for identifying antioxidant proteins. Int J Mol Sci 19:1773. https://doi.org/10.3390/ijms19061773

    Article  CAS  PubMed Central  Google Scholar 

  • Xue L, Tang B, Chen W, Luo J (2019) Prediction of CRISPR sgRNA activity using a deep convolutional neural network. J Chem Inf Model 59:615–624

    CAS  PubMed  Google Scholar 

  • Yang Q et al (2019) Consistent gene signature of schizophrenia identified by a novel feature selection strategy from comprehensive sets of transcriptomic data. Briefings Bioinf 21:1058. https://doi.org/10.1093/bib/bbz049

    Article  CAS  Google Scholar 

  • Yang Q et al (2020) NOREVA: enhanced normalization and evaluation of time-course and multi-class metabolomic data. Nucleic Acids Res 48:W436. https://doi.org/10.1093/nar/gkaa258

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Yu L, Gao L, Li K (2010) A method based on local density and random walks for complexes detection in protein interaction networks. J Bioinforma Comput Biol 8:47–62

    CAS  Google Scholar 

  • Yu L, Su R, Wang B, Zhang L, Zou Y, Zhang J, Gao L (2017) Prediction of novel drugs for hepatocellular carcinoma based on multi-source random walk. IEEE/ACM Trans Comput Biol Bioinf 14:966–977. https://doi.org/10.1109/TCBB.2016.2550453

    Article  CAS  Google Scholar 

  • Yu L, Yao SY, Gao L, Zha YH (2019) Conserved disease modules extracted from multilayer heterogeneous disease and gene networks for understanding disease mechanisms and predicting disease treatments. Front Genet 9:745

    PubMed  PubMed Central  Google Scholar 

  • Yue H, Zhou X, Cheng M, Xing D (2018) Graphene oxide-mediated Cas9/sgRNA delivery for efficient genome editing. Nanoscale 10:1063–1071

    CAS  PubMed  Google Scholar 

  • Zhang H, Zhan M, Chang H, Song S, Zhang C, Liu Y (2019a) Research progress of exogenous plant MiRNAs in cross-kingdom regulation. Curr Bioinforma 14:241–245. https://doi.org/10.2174/1574893613666181113142414

    Article  CAS  Google Scholar 

  • Zhang W, Liu T, Yin Q, Zhang Y (2019b) Neural recovery machine for Chinese dropped pronoun. Front Comput Sci 13:1023–1033. https://doi.org/10.1007/s11704-018-7136-7

    Article  Google Scholar 

  • Zhang G, Dai Z, Dai X (2020) C-RNNCrispr: prediction of CRISPR/Cas9 sgRNA activity using convolutional and recurrent neural networks. Comput Struct Biotechnol J 18:344

    PubMed  PubMed Central  Google Scholar 

  • Zhu X, He J, Zhao S, Tao W, Xiong Y, Bi S (2019) A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae. Briefings Funct Genomics 18:367–376. https://doi.org/10.1093/bfgp/elz018

    Article  CAS  Google Scholar 

  • Zou Q et al (2013) BinMemPredict: a web server and software for predicting membrane protein types. Curr Proteomics 10:2–9

    CAS  Google Scholar 

Download references

Acknowledgements

The authors are very much indebted to the anonymous reviewers, whose constructive comments are very helpful for strengthening the presentation of this paper. The work was supported by the National Natural Science Foundation of China (Nos. 91935302, 61922020, 61771331).

Author information

Authors and Affiliations

Authors

Contributions

M.N. conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript. Y.L.: coordinated the study and project administration. Q.Z.: supervision and funding acquisition. All authors have read and approved the manuscript for submission.

Corresponding authors

Correspondence to Yuan Lin or Quan Zou.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Niu, M., Lin, Y. & Zou, Q. sgRNACNN: identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks. Plant Mol Biol 105, 483–495 (2021). https://doi.org/10.1007/s11103-020-01102-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11103-020-01102-y

Keywords