Automatic Gene Function Prediction in the 2020’s
Abstract
:1. Introduction
- How to extend AFP beyond homology transfer.
- How to define protein function in a standardized way.
- How to properly evaluate AFP methods.
- How can we deal with biological function being tissue, cell-type, or condition-specific?
- How do we predict functions in non-model species?
- What data sources should be used for predicting function?
- How does missingness or bias in GO annotations affect the training of AFP models?
- How can we better exploit the Gene Ontology structure to improve functional annotation?
2. Tissue/Condition-Specificity
2.1. Protein Representation
2.2. Function Representation
2.3. Prediction Methods
3. Going beyond Model Species
3.1. Protein Representation
3.2. Function Representation
3.3. Prediction Methods
4. Overlooked Data Sources
4.1. Protein Representation
4.2. Function Representation
4.3. Prediction Methods
5. Biased and Missing Annotations
5.1. Protein Representation
5.2. Function Representation
5.3. Prediction Methods
6. Gene Ontology
6.1. Protein Representation
6.2. Function Representation
6.3. Prediction Methods
7. Evaluation of AFP Algorithms
8. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Bateman, A. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 2019, 47, D506–D515. [Google Scholar] [CrossRef] [Green Version]
- González-Castro, T.B.; Tovilla-Zárate, C.A.; Genis-Mendoza, A.D.; Juárez-Rojop, I.E.; Nicolini, H.; López-Narváez, M.L.; Martínez-Magaña, J.J. Identification of gene ontology and pathways implicated in suicide behavior: Systematic review and enrichment analysis of GWAS studies. Am. J. Med. Genet. Part B Neuropsychiatr. Genet. 2019, 180, 320–329. [Google Scholar] [CrossRef] [PubMed]
- You, R.; Zhang, Z.; Xiong, Y.; Sun, F.; Mamitsuka, H.; Zhu, S. GOLabeler: Improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 2018, 34, 2465–2473. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Das, S.; Lee, D.; Sillitoe, I.; Dawson, N.L.; Lees, J.G.; Orengo, C.A. Functional classification of CATH superfamilies: A domain-based approach for protein function annotation. Bioinformatics 2015, 31, 3460–3467. [Google Scholar] [CrossRef] [PubMed]
- Piovesan, D.; Tosatto, S.C.E. INGA 2.0: Improving protein function prediction for the dark proteome. Nucleic Acids Res. 2019, 47, W373–W378. [Google Scholar] [CrossRef] [Green Version]
- Jain, A.; Kihara, D. Phylo-PFP: Improved automated protein function prediction using phylogenetic distance of distantly related sequences. Bioinformatics 2018, 35, 753–759. [Google Scholar] [CrossRef]
- Zhang, C.; Freddolino, P.L.; Zhang, Y. COFACTOR: Improved protein function prediction by combining structure, sequence and protein–protein interaction information. Nucleic Acids Res. 2017, 45, W291–W299. [Google Scholar] [CrossRef]
- You, R.; Yao, S.; Xiong, Y.; Huang, X.; Sun, F.; Mamitsuka, H.; Zhu, S. NetGO: Improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 2019, 47, W379–W387. [Google Scholar] [CrossRef]
- Kulmanov, M.; Hoehndorf, R. DeepGOPlus: Improved protein function prediction from sequence. Bioinformatics 2019, 36, 422–429. [Google Scholar] [CrossRef]
- Radivojac, P.; Clark, W.T.; Oron, T.R.; Schnoes, A.M.; Wittkop, T.; Sokolov, A.; Graim, K.; Funk, C.; Verspoor, K.; Ben-Hur, A.; et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 2013, 10, 221–227. [Google Scholar] [CrossRef]
- Jiang, Y.; Oron, T.R.; Clark, W.T.; Bankapur, A.R.; D’Andrea, D.; Lepore, R.; Funk, C.S.; Kahanda, I.; Verspoor, K.M.; Ben-Hur, A.; et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016, 17, 184. [Google Scholar] [CrossRef]
- Zhou, N.; Jiang, Y.; Bergquist, T.R.; Lee, A.J.; Kacsoh, B.Z.; Crocker, A.W.; Lewis, K.A.; Georghiou, G.; Nguyen, H.N.; Hamid, M.N.; et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 2019, 20, 1–23. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Godzik, A.; Jambon, M.; Friedberg, I. Computational protein function prediction: Are we making progress? Cell. Mol. Life Sci. 2007, 64, 2505. [Google Scholar] [CrossRef]
- Cozzetto, D.; Buchan, D.W.; Bryson, K.; Jones, D.T. Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinform. 2013, 14, S1. [Google Scholar] [CrossRef] [Green Version]
- Lan, L.; Djuric, N.; Guo, Y.; Vucetic, S. MS-kNN: Protein function prediction by integrating multiple data sources. BMC Bioinform. 2013, 14 (Suppl. 3), S8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Farahbod, M.; Pavlidis, P. Differential coexpression in human tissues and the confounding effect of mean expression levels. Bioinformatics 2019, 35, 55–61. [Google Scholar] [CrossRef] [Green Version]
- Sonawane, A.R.; Platig, J.; Fagny, M.; Chen, C.Y.; Paulson, J.N.; Lopes-Ramos, C.M.; DeMeo, D.L.; Quackenbush, J.; Glass, K.; Kuijjer, M.L. Understanding Tissue-Specific Gene Regulation. Cell Rep. 2017, 21, 1077–1088. [Google Scholar] [CrossRef] [Green Version]
- Jiang, Z.; Dong, X.; Li, Z.G.; He, F.; Zhang, Z. Differential coexpression analysis reveals extensive rewiring of arabidopsis gene coexpression in response to pseudomonas syringae infection. Sci. Rep. 2016, 6, 35064. [Google Scholar] [CrossRef] [Green Version]
- Singh, A.J.; Ramsey, S.A.; Filtz, T.M.; Kioussi, C. Differential gene regulatory networks in development and disease. Cell. Mol. Life Sci. 2018, 75, 1013–1025. [Google Scholar] [CrossRef] [PubMed]
- Basha, O.; Shpringer, R.; Argov, C.M.; Yeger-Lotem, E. The DifferentialNet database of differential protein-protein interactions in human tissues. Nucleic Acids Res. 2018, 46, D522–D526. [Google Scholar] [CrossRef]
- Greene, C.S.; Krishnan, A.; Wong, A.K.; Ricciotti, E.; Zelaya, R.A.; Himmelstein, D.S.; Zhang, R.; Hartmann, B.M.; Zaslavsky, E.; Sealfon, S.C.; et al. Understanding multicellular function and disease with human tissue-specific networks. Nat. Genet. 2015, 47, 569–576. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Diehl, A.D.; Meehan, T.F.; Bradford, Y.M.; Brush, M.H.; Dahdul, W.M.; Dougall, D.S.; He, Y.; Osumi-Sutherland, D.; Ruttenberg, A.; Sarntivijai, S.; et al. The Cell Ontology 2016: Enhanced content, modularization, and ontology interoperability. J. Biomed. Semant. 2016, 7, 44. [Google Scholar] [CrossRef] [Green Version]
- Zitnik, M.; Leskovec, J. Predicting multicellular function through multi-layer tissue networks. Bioinformatics 2017, 33, i190–i198. [Google Scholar] [CrossRef] [Green Version]
- Mahdavi, S.; Khoshraftar, S.; An, A. Dynnode2vec: Scalable Dynamic Network Embedding. In Proceedings of the 2018 IEEE International Conference on Big Data, Big Data 2018, Seattle, WA, USA, 10–13 December 2018; pp. 3762–3765. [Google Scholar] [CrossRef] [Green Version]
- Jaitin, D.A.; Kenigsberg, E.; Keren-Shaul, H.; Elefant, N.; Paul, F.; Zaretsky, I.; Mildner, A.; Cohen, N.; Jung, S.; Tanay, A.; et al. Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types. Science 2014, 343, 776–779. [Google Scholar] [CrossRef]
- Papatheodorou, I.; Moreno, P.; Manning, J.; Fuentes, A.M.P.; George, N.; Fexova, S.; Fonseca, N.A.; Füllgrabe, A.; Green, M.; Huang, N.; et al. Expression Atlas update: From tissues to single cells. Nucleic Acids Res. 2019, 48, D77–D83. [Google Scholar] [CrossRef] [Green Version]
- Thul, P.J.; Lindskog, C. The human protein atlas: A spatial map of the human proteome. Protein Sci. 2018, 27, 233–244. [Google Scholar] [CrossRef] [Green Version]
- Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
- GO Consortium. Guide to GO Evidence Codes. 2016. Available online: http://geneontology.org/page/guide-go-evidence-codes (accessed on 30 July 2020).
- Annotation Extension. Available online: http://wiki.geneontology.org/index.php/Annotation_Extension (accessed on 12 September 2020).
- Thomas, P.D.; Hill, D.P.; Mi, H.; Osumi-Sutherland, D.; Van Auken, K.; Carbon, S.; Balhoff, J.P.; Albou, L.P.; Good, B.; Gaudet, P.; et al. Gene Ontology Causal Activity Modeling (GO-CAM) moves beyond GO annotations to structured descriptions of biological functions and systems. Nat. Genet. 2019, 51, 1429–1433. [Google Scholar] [CrossRef] [PubMed]
- Lock, E.F.; Hoadley, K.A.; Marron, J.S.; Nobel, A.B. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 2013, 7, 523–542. [Google Scholar] [CrossRef] [PubMed]
- Way, G.P.; Greene, C.S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. Pac. Symp. Biocomput. 2018, 23, 80–91. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2672–2680. [Google Scholar]
- Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
- Perez, L.; Wang, J. The Effectiveness of Data Augmentation in Image Classification using Deep Learning. arXiv 2017, arXiv:1712.04621. [Google Scholar]
- Wan, C.; Jones, D.T. Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks. Nat. Mach. Intell. 2020, 2, 540–550. [Google Scholar] [CrossRef]
- Appels, R.; Eversole, K.; Stein, N.; Feuillet, C.; Keller, B.; Rogers, J.; Pozniak, C.J.; Choulet, F.; Distelfeld, A.; Poland, J.; et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 2018, 361, 661. [Google Scholar] [CrossRef] [Green Version]
- Richoux, F.; Servantie, C.; Borès, C.; Téletchéa, S. Comparing two deep learning sequence-based models for protein-protein interaction prediction. arXiv 2019, arXiv:1901.06268. [Google Scholar]
- Sigalova, O.M.; Shaeiri, A.; Forneris, M.; Furlong, E.E.; Zaugg, J.B. Predictive features of gene expression variation reveal a mechanistic link between expression variation and differential expression. bioRxiv 2020. [Google Scholar] [CrossRef] [Green Version]
- Wang, S.; Cho, H.; Zhai, C.; Berger, B.; Peng, J. Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 2015, 31, i357–i364. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Duong, D.; Uppunda, A.; Gai, L.; Ju, C.; Zhang, J.; Chen, M.; Eskin, E.; Li, J.J.; Chang, K.W. Evaluating Representations for Gene Ontology Terms. bioRxiv 2020. [Google Scholar] [CrossRef]
- Chamberlain, B.P.; Clough, J.; Deisenroth, M.P. Neural Embeddings of Graphs in Hyperbolic Space. arXiv 2017, arXiv:1705.10359. [Google Scholar]
- Li, X.; Sun, Z.; Xue, J.H.; Ma, Z. A Concise Review of Recent Few-shot Meta-learning Methods. arXiv 2020, arXiv:2005.10953. [Google Scholar]
- Xian, Y.; Schiele, B.; Akata, Z. Zero-Shot Learning—The Good, the Bad and the Ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Huynh, D.; Elhamifar, E. Fine-Grained Generalized Zero-Shot Learning via Dense Attribute-Based Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Wang, S.; Pisco, A.O.; McGeever, A.; Brbic, M.; Zitnik, M.; Darmanis, S.; Leskovec, J.; Karkanias, J.; Altman, R.B. Unifying single-cell annotations based on the Cell Ontology. bioRxiv 2020. [Google Scholar] [CrossRef] [Green Version]
- Kouw, W.M.; Loog, M. An introduction to domain adaptation and transfer learning. arXiv 2018, arXiv:1812.11806. [Google Scholar]
- Kumar, V.; Sharma, A.; Kaur, R.; Thukral, A.K.; Bhardwaj, R.; Ahmad, P. Differential distribution of amino acids in plants. Amino Acids 2017, 49, 821–869. [Google Scholar] [CrossRef] [PubMed]
- Munro, J.; Damen, D. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27–28 October 2019; pp. 3723–3726. [Google Scholar]
- Wang, J.; Ma, Z.; Carr, S.A.; Mertins, P.; Zhang, H.; Zhang, Z.; Chan, D.W.; Ellis, M.J.; Townsend, R.R.; Smith, R.D.; et al. Proteome profiling outperforms transcriptome profiling for coexpression based gene function prediction. Mol. Cell. Proteom. 2017, 16, 121–134. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Griffin, T.J.; Gygi, S.P.; Ideker, T.; Rist, B.; Eng, J.; Hood, L.; Aebersold, R. Complementary Profiling of Gene Expression at the Transcriptome and Proteome Levels in Saccharomyces cerevisiae. Mol. Cell. Proteom. 2002, 1, 323–333. [Google Scholar] [CrossRef] [Green Version]
- Wang, X.; Liu, Q.; Zhang, B. Leveraging the complementary nature of RNA-Seq and shotgun proteomics data. Proteomics 2014, 14, 2676–2687. [Google Scholar] [CrossRef] [PubMed]
- Grabowski, P.; Kustatscher, G.; Rappsilber, J. Epigenetic Variability Confounds Transcriptome but Not Proteome Profiling for Coexpression-based Gene Function Prediction. Mol. Cell. Proteom. 2018, 17, 2082–2090. [Google Scholar] [CrossRef] [Green Version]
- Wang, D.; Zou, X.; Fai Au, K. A network-based computational framework to predict and differentiate functions for gene isoforms using exon-level expression data. Methods 2020. [Google Scholar] [CrossRef]
- Perchey, R.T.; Tonini, L.; Tosolini, M.; Fournié, J.J.; Lopez, F.; Besson, A.; Pont, F. PTMselect: Optimization of protein modifications discovery by mass spectrometry. Sci. Rep. 2019, 9, 4181. [Google Scholar] [CrossRef] [Green Version]
- Csizmok, V.; Forman-Kay, J.D. Complex regulatory mechanisms mediated by the interplay of multiple post-translational modifications. Curr. Opin. Struct. Biol. 2018, 48, 58–67. [Google Scholar] [CrossRef]
- Müller, J.B.; Geyer, P.E.; Colaço, A.R.; Treit, P.V.; Strauss, M.T.; Oroshi, M.; Doll, S.; Virreira Winter, S.; Bader, J.M.; Köhler, N.; et al. The proteome landscape of the kingdoms of life. Nature 2020, 582, 592–596. [Google Scholar] [CrossRef]
- Huynen, M.; Snel, B.; Lathe, W.; Bork, P. Predicting protein function by genomic context: Quantitative evaluation and qualitative inferences. Genome Res. 2000, 10, 1204–1210. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Foflonker, F.; Blaby-Haas, C.E. Co-locality to co-functionality: Eukaryotic gene neighborhoods as a resource for function discovery. Mol. Biol. Evol. 2020, msaa221. [Google Scholar] [CrossRef] [PubMed]
- Schoenfelder, S.; Sexton, T.; Chakalova, L.; Cope, N.F.; Horton, A.; Andrews, S.; Kurukuti, S.; Mitchell, J.A.; Umlauf, D.; Dimitrova, D.S.; et al. Preferential associations between co-regulated genes reveal a transcriptional interactome in erythroid cells. Nat. Genet. 2010, 42, 53–61. [Google Scholar] [CrossRef] [Green Version]
- Zhao, Z.; Tavoosidana, G.; Sjölinder, M.; Göndör, A.; Mariano, P.; Wang, S.; Kanduri, C.; Lezcano, M.; Singh Sandhu, K.; Singh, U.; et al. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat. Genet. 2006, 38, 1341–1347. [Google Scholar] [CrossRef]
- van Berkum, N.L.; Lieberman-Aiden, E.; Williams, L.; Imakaev, M.; Gnirke, A.; Mirny, L.A.; Dekker, J.; Lander, E.S. Hi-C: A method to study the three-dimensional architecture of genomes. J. Vis. Exp. JoVE 2010, 1869. [Google Scholar] [CrossRef] [Green Version]
- Cao, R.; Cheng, J. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods 2016, 93, 84–91. [Google Scholar] [CrossRef] [Green Version]
- Moore, J.E.; Purcaro, M.J.; Pratt, H.E.; Epstein, C.B.; Shoresh, N.; Adrian, J.; Kawli, T.; Davis, C.A.; Dobin, A.; Kaul, R.; et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 2020, 583, 699–710. [Google Scholar] [CrossRef] [PubMed]
- You, R.; Huang, X.; Zhu, S. DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation. Methods 2018, 145, 82–90. [Google Scholar] [CrossRef]
- Chen, Q.; Lee, K.; Yan, S.; Kim, S.; Wei, C.H.; Lu, Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput. Biol. 2020, 16, e1007617. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2019, 36, 1234–1240. [Google Scholar] [CrossRef]
- Rifaioglu, A.S.; Doğan, T.; Martin, M.J.; Cetin-Atalay, R.; Atalay, M.V. Multi-task Deep Neural Networks in Automated Protein Function Prediction. arXiv 2017, arXiv:1705.04802. [Google Scholar]
- Grover, A.; Leskovec, J. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’16; Association for Computing Machinery: New York, NY, USA, 2016; pp. 855–864. [Google Scholar] [CrossRef] [Green Version]
- Du, J.; Jia, P.; Dai, Y.; Tao, C.; Zhao, Z.; Zhi, D. Gene2vec: Distributed representation of genes based on co-expression. BMC Genom. 2019, 20, 82. [Google Scholar] [CrossRef] [Green Version]
- Jiang, Y.; Clark, W.T.; Friedberg, I.; Radivojac, P. The impact of incomplete knowledge on the evaluation of protein function prediction: A structured-output learning perspective. Bioinformatics 2014, 30, 609–616. [Google Scholar] [CrossRef] [Green Version]
- Hales, K.G.; Korey, C.A.; Larracuente, A.M.; Roberts, D.M. Genetics on the Fly: A Primer on the Drosophila Model System. Genetics 2015, 201, 815–842. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kuwabara, P.E.; O’Neil, N. The use of functional genomics in C. elegans for studying human development and disease. J. Inherit. Metab. Dis. 2001, 24, 127–138. [Google Scholar] [CrossRef] [PubMed]
- Schnoes, A.M.; Ream, D.C.; Thorman, A.W.; Babbitt, P.C.; Friedberg, I. Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space. PLoS Comput. Biol. 2013, 9, e1003063. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Škunca, N.; Altenhoff, A.; Dessimoz, C. Quality of computationally inferred gene ontology annotations. PLoS Comput. Biol. 2012, 8, e1002533. [Google Scholar] [CrossRef] [Green Version]
- Youngs, N.; Penfold-Brown, D.; Bonneau, R.; Shasha, D. Negative Example Selection for Protein Function Prediction: The NoGO Database. PLoS Comput. Biol. 2014, 10, e1003644. [Google Scholar] [CrossRef]
- Fu, G.; Wang, J.; Yang, B.; Yu, G. NegGOA: Negative GO annotations selection using ontology structure. Bioinformatics 2016, 32, 2996–3004. [Google Scholar] [CrossRef]
- Warwick Vesztrocy, A.; Dessimoz, C. Benchmarking gene ontology function predictions using negative annotations. Bioinformatics 2020, 36, i210–i218. [Google Scholar] [CrossRef] [PubMed]
- Kiryo, R.; Niu, G.; du Plessis, M.C.; Sugiyama, M. Positive-Unlabeled Learning with Non-Negative Risk Estimator. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 1675–1685. [Google Scholar]
- Yang, P.; Li, X.L.; Mei, J.P.; Kwoh, C.K.; Ng, S.K. Positive-unlabeled learning for disease gene identification. Bioinformatics 2012, 28, 2640–2647. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Akbarnejad, A.; Baghshah, M.S. A probabilistic multi-label classifier with missing and noisy labels handling capability. Pattern Recognit. Lett. 2017, 89, 18–24. [Google Scholar] [CrossRef]
- Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, P.; Canny, J.; Abbeel, P.; Song, Y. Evaluating Protein Transfer Learning with TAPE. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 9689–9701. [Google Scholar]
- Heinzinger, M.; Elnaggar, A.; Wang, Y.; Dallago, C.; Nechaev, D.; Matthes, F.; Rost, B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 2019, 723. [Google Scholar] [CrossRef] [Green Version]
- Alley, E.C.; Khimulya, G.; Biswas, S.; AlQuraishi, M.; Church, G.M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 2019, 16, 1315–1322. [Google Scholar] [CrossRef]
- Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; Fergus, R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv 2020, 622803. [Google Scholar] [CrossRef]
- Villegas-Morcillo, A.; Makrodimitris, S.; van Ham, R.C.H.J.; Gomez, A.M.; Sanchez, V.; Reinders, M.J.T. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 2020, btaa701. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Wei, Q.; Khan, I.K.; Ding, Z.; Yerneni, S.; Kihara, D. NaviGO: Interactive tool for visualization and functional similarity and coherence analysis with gene ontology. BMC Bioinform. 2017, 18, 177. [Google Scholar] [CrossRef] [Green Version]
- Makrodimitris, S.; Van Ham, R.C.H.J.; Reinders, M.J.T. Improving Protein Function Prediction in Ara-bidopsis Using Protein Sequence and GO-term Similarities. Bioinformatics 2019. under review. [Google Scholar] [CrossRef]
- Bi, W.; Kwok, J. Multi-Label Classification on Tree-and DAG-Structured Hierarchies; ICML: New York, NY, USA, 2011; pp. 17–24. [Google Scholar]
- Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
- Longo, L.; Goebel, R.; Lecue, F.; Kieseberg, P.; Holzinger, A. Explainable Artificial Intelligence: Concepts, Applications, Research Challenges and Visions. In Machine Learning and Knowledge Extraction; Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 1–16. [Google Scholar]
- Escalante, H.J.; Escalera, S.; Guyon, I.; Baró, X.; Güçlütürk, Y.; Güçlü, U.; van Gerven, M. (Eds.) Explainable and Interpretable Models in Computer Vision and Machine Learning; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar] [CrossRef] [Green Version]
- Smaili, F.Z.; Gao, X.; Hoehndorf, R. Onto2Vec: Joint vector-based representation of biological entities and their ontology-based annotations. Bioinformatics 2018, 34, i52–i60. [Google Scholar] [CrossRef] [PubMed]
- Venkatesan, R.; Er, M.J.; Dave, M.; Pratama, M.; Wu, S. A novel online multi-label classifier for high-speed streaming data applications. Evol. Syst. 2017, 8, 303–315. [Google Scholar] [CrossRef] [Green Version]
- Ahmadi, Z.; Kramer, S. Online Multi-Label Classification: A Label Compression Method. arXiv 2018, arXiv:1804.01491. [Google Scholar] [CrossRef]
- Kahanda, I.; Funk, C.S.; Ullah, F.; Verspoor, K.M.; Ben-Hur, A. A close look at protein function prediction evaluation protocols. GigaScience 2015, 4, 41. [Google Scholar] [CrossRef] [Green Version]
- Szklarczyk, D.; Gable, A.L.; Lyon, D.; Junge, A.; Wyder, S.; Huerta-Cepas, J.; Simonovic, M.; Doncheva, N.T.; Morris, J.H.; Bork, P.; et al. STRING v11: Protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019, 47, D607–D613. [Google Scholar] [CrossRef] [Green Version]
- Clark, W.T.; Radivojac, P. Information-theoretic evaluation of predicted ontological annotations. Bioinformatics 2013, 29, i53–i61. [Google Scholar] [CrossRef] [Green Version]
- Plyusnin, I.; Holm, L.; Törönen, P. Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences. PLoS Comput. Biol. 2019, 15, e1007419. [Google Scholar] [CrossRef]
Name | Reference | Input Data | Method |
---|---|---|---|
GOLabeler | [3] | Amino acid sequence, GO term frequencies | Learning to rank |
FunFams | [4] | Amino acid sequence | Hidden Markov Model |
INGA | [5] | Amino acid sequence | Homology search, enrichment analysis |
PFP | [6] | Amino acid sequence | Phylogenetics |
COFACTOR | [7] | Amino acid sequence, protein structure, protein interactions | Homology search, structural similarity |
NetGO | [8] | Amino acid sequence, GO term frequencies, protein interactions | Learning to rank |
DeepGOPlus | [9] | Amino acid sequence | Convolutional neural network, homology search |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Makrodimitris, S.; van Ham, R.C.H.J.; Reinders, M.J.T. Automatic Gene Function Prediction in the 2020’s. Genes 2020, 11, 1264. https://doi.org/10.3390/genes11111264
Makrodimitris S, van Ham RCHJ, Reinders MJT. Automatic Gene Function Prediction in the 2020’s. Genes. 2020; 11(11):1264. https://doi.org/10.3390/genes11111264
Chicago/Turabian StyleMakrodimitris, Stavros, Roeland C. H. J. van Ham, and Marcel J. T. Reinders. 2020. "Automatic Gene Function Prediction in the 2020’s" Genes 11, no. 11: 1264. https://doi.org/10.3390/genes11111264