Abstract
Human Single Amino Acid Polymorphisms (SAPs) or Single Amino Acid Variants (SAVs) usually named as nonsynonymous Single Nucleotide Variants nsSNVs) represent the most frequent type of genetic variation among the population. They originate from non-synonymous single nucleotide variations (missense variants) where a single base pair substitution alters the genetic code in such a way that it produces a different amino acid at a given position. Since mutations are commonly associated with the development of various genetic diseases, it is of utmost importance to understand and predict which variations are deleterious and which are neutral. Computational tools based on machine learning are becoming promising alternatives to tedious and highly costly mutagenic experiments. Generally, varying quality, incompleteness and inconsistencies of nsSNVs datasets degrade the usefulness of machine learning approaches. Consequently, robust and more accurate approaches are essential to address these issues. In this paper, we present the application of a consensus classifier based on the holdout sampling, which shows robust and accurate results, outperforming currently available tools. We generated 100 holdouts to sample different classifiers’ architectures and different classification variables during the training stage. The best performing holdouts were selected to construct a consensus classifier and tested by blindly utilizing a k-fold (1 ≤ k ≤ 5) cross-validation approach. We also performed an analysis of the best protein attributes for predicting the effects of nsSNVs by calculating their discriminatory power. Our results show that our method outperforms other currently available tools, and provides robust results, with small standard deviations among folds and high accuracy. The superiority of our algorithm is based on the utilization of a tree of holdouts, where different machine learning algorithms are sampled with different boundary conditions or different predictive attributes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sunyaev, S., Ramensky, V., Bork, P.: Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet. 16, 198–200 (2000)
Cargill, M., et al.: Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22, 231–238 (1999)
Collins, F.S., Brooks, L.D., Chakravarti, A.: A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 8, 1229–1231 (1998)
Abecasis, G.R., et al.: A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010)
Collins, F.S., Guyer, M.S., Charkravarti, A.: Variations on a theme: cataloging human DNA sequence variation. Science 278, 1580–1581 (1997)
Risch, N., Merikangas, K.: The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996)
Studer, R.A., Dessailly, B.H., Orengo, C.A.: Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem. J. 449, 581–594 (2013)
Halushka, M.K., et al.: Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 22, 239–247 (1999)
Capriotti, E., Nehrt, N.L., Kann, M.G., Bromberg, Y.: Bioinformatics for personal genome interpretation. Brief. Bioinform. 13, 495–512 (2012)
Niu, B.: Protein-structure-guided discovery of functional mutations across 19 cancer types. Nat. Genet. 2016(48), 827–837 (2016)
Goode, D.L., et al.: A simple consensus approach improves somatic mutation prediction accuracy. Genome Med. 5, 90 (2013)
Choi, Y., Sims, G.E., Murphy, S., Miller, J.R., Chan, A.P.: Predicting the functional effect of amino acid substitutions and indels. PLoS ONE 7, e46688 (2012)
Choi, Y., Chan, A.P.: PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels. Bioinformatics 31, 2745–2747 (2015)
Kumar, P., Henikoff, S., Ng, P.C.: Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 1073–1081 (2009)
Tang, H., Thomas, P.D.: PANTHER-PSEP: predicting disease-causing genetic variants using position-specific evolutionary preservation. Bioinformatics 32, 2230–2232 (2016)
Katsonis, P., Lichtarge, O.: A formal perturbation equation between genotype and phenotype determines the evolutionary action of protein-coding variations on fitness. Genome Res. 24, 2050–2058 (2014)
Gallion, J., et al.: Predicting phenotype from genotype: improving accuracy through more robust experimental and computational modeling. Hum. Mutat. 38, 569–580 (2017)
Schwarz, J.M., Rödelsperger, C., Schuelke, M., Seelow, D.: MutationTaster evaluates disease-causing potential of sequence alterations. Nat. Methods 7, 575–576 (2010)
Reva, B., Antipin, Y., Sander, C.: Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39, e118 (2011)
Adzhubei, I.A., et al.: A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010)
Capriotti, E., et al.: WS-SNPs&GO: a web server for predicting the deleterious effect of human protein variants using functional annotation. BMC Genomics 14, S6 (2013)
Capriotti, E., Calabrese, R., Casadio, R.: Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics 22, 2729–2734 (2006)
Bendl, J., et al.: PredictSNP: robust and accurate consensus classifier for prediction of disease-related mutations. PLoS Comput. Biol. 10, e1003440 (2014)
Stone, E.A., Sidow, A.: Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res. 15, 978–986 (2005)
Miosge, L.A.: Comparison of predicted and actual consequences of missense mutations. Proc. Natl. Acad. Sci. USA 112, 189–198 (2015)
Saunders, C.T., Baker, D.: Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J. Mol. Biol. 322, 891–901 (2002)
Stefl, S., Nishi, H., Petukh, M., Panchenko, A.R., Alexov, E.: Molecular mechanisms of disease-causing missense mutations. J. Mol. Biol. 425, 3919–3936 (2013)
Pires, D.E.V., Chen, J., Blundell, T.L., Ascher, D.B.: In silico functional dissection of saturation mutagenesis: interpreting the relationship between phenotypes and changes in protein stability, interactions and activity. Sci. Rep. 6, 19848 (2016)
Castaldi, P.J., Dahabreh, I.J., Ioannidis, J.P.A.: An empirical assessment of validation practices for molecular classifiers. Brief. Bioinform. 12, 189–202 (2011)
Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (2001)
Thusberg, J., Olatubosun, A., Vihinen, M.: Performance of mutation pathogenicity prediction methods on missense variants. Hum. Mutat. 32, 358–368 (2011)
Ng, P.C., Henikoff, S.: Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet. 7, 61–80 (2006)
Polikar, R.: Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6, 21–45 (2006)
Capriotti, E., Altman, R.B., Bromberg, Y.: Collective judgment predicts disease-associated single nucleotide variants. BMC Genomics 14, S2 (2013)
González-Pérez, A., López-Bigas, N.: Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score. Condel. Am. J. Hum. Genet. 88, 440–449 (2011)
The UniProt Consortium: The universal protein resource (UniProt). Nucleic Acids Res. 36, D190–D195 (2008)
Fernández-Martínez, J.L., Fernández-Muñiz, Z., Tompkins, M.J.: On the topography of the cost functional in linear and nonlinear inverse problems. Geophysics 77, W1–W5 (2012)
Fernández-Martínez, J.L., Pallero, J.L.G., Fernández-Muñiz, Z., Pedruelo-González, L.M.: From Bayes to Tarantola: new insights to understand uncertainty in inverse problems. J. App. Geophys. 98, 62–72 (2013)
Fernández-Martínez, J.L., Fernández-Muñiz, Z.: The curse of dimensionality in inverse problems. J. Comput. Appl. Math. 369, 112571 (2020)
Álvarez-Machancoses, Ó., deAndrés-Galiana, J.E., Fernández-Martínez, J.L., Kloczkowski, A.: Robust prediction of single and multiple point protein mutations stability changes. Biomolecules 10, 67 (2020)
Fernández-Martínez, J.L., Álvarez-Machancoses, Ó., deAndrés-Galiana, E.J., Bea, G., Kloczkowski, A.: Robust sampling of defective pathways in Alzheimer’s disease. Implications in drug repositioning. Int. J. Mol. Sci. 10, 3594 (2020)
Fernández-Martínez, J.L., deAndrés-Galiana, E.J., Fernández-Ovies, F.J., Cernea, A., Kloczkowski, A.: Robust sampling of defective pathways in multiple myeloma. Int. J. Mol. Sci. 20, 4681 (2019)
deAndrés-Galiana, E.J., Fernández-Ovies, F.J., Cernea, A., Fernández-Martínez, J.L., Kloczkowski, A.: Deep neural networks for phenotype prediction in rare disease inclusion body myositis: a case study. In: Artificial Intelligence in Precision Health. From Concept to Applications (Debmalya Barth, Editor), pp. 189–202. Elsevier, Amsterdam (2020)
Álvarez-Machancoses, Ó., deAndrés-Galiana, E., Fernández-Martínez, J.L., Kloczkowski, A.: The utilization of different classifiers to perform drug repositioning in inclusion body myositis supports the concept of biological invariance. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2020. LNCS (LNAI), vol. 12415, pp. 589–598. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61401-0_55
Efron, B., Tibshirani, R.: An Introduction to Bootstrap. Chapman & Hall, Boca Raton (1993)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Thomas, P.D., et al.: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 13, 2129–2141 (2003)
Thomas, P.D., et al.: Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools. Nucleic Acids Res. 34, W645–W650 (2006)
Faraggi, E., Zhou, Y., Kloczkowski, A.: Accurate single-sequence prediction of solvent accessible surface area using local and global features. Proteins: Struct. Funct. Bioinform. 82, 3170–3176 (2014)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence 2 (Montreal 20–25 August), pp. 1137–1145 (1995)
Fernández-Martínez, J.L., et al.: Sampling defective pathways in phenotype prediction problems via the holdout sampler. In: Rojas, I., Ortuño, F. (eds.) IWBBIO 2018. LNCS, vol. 10814, pp. 24–32. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78759-6_3
Fernández-Muñiz, Z., Hassan, K., Fernández-Martínez, J.L.: Data kit inversion and uncertainty analysis. J. Appl. Geophys. 161, 228 (2019)
Fernández-Martínez, J.L., Fernández-Muñiz, Z., Breysse, D.: The uncertainty analysis in linear and nonlinear regression revisited: application to concrete strength estimation. Inverse Probl. Sci. Eng. 27, 1740–1764 (2018)
Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70, 489–501 (2006)
Huang, G.B.: An insight into extreme learning machines: random neurons, random features and kernels. Cogn. Comput. 6, 376–390 (2014)
Huang, G.B., Lei, C., Chee-Kheong, S.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17, 879–892 (2006)
Huang, G.B.: What are extreme learning machines? Filling the gap between Frank Rosenblatt’s Dream and John von Neumann’s Puzzle. Cogn. Comput. 7, 263–278 (2015)
Huang, G.B., Hongming, Z., Xiaojian, D., Rui, Z.: Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. - Part B: Cybern. 42, 513–529 (2012)
Ertugrul, O.F., Tagluk, M.E., Kaya, Y., Tekin, R.: EMG signal classification by extreme learning machine. In: 21st 2013 Signal Processing and Communications Applications Conference (SIU), April 24, pp. 1–4 (2013)
Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: a new learning scheme of feedforward neural networks. In: Neural Networks. Proceedings of the 2004 IEEE International Joint Conference on 2004 July 25, vol. 2, pp. 985–990 (2004)
Ho, T.K.: Random decision forest. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition (Montreal) 14–16, pp. 278–282 (1995)
Acknowledgment
AK acknowledges the financial support from NSF grant DBI 1661391, and NIH grants R01GM127701, and R01HG012117.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Álvarez-Machancoses, Ó., Faraggi, E., de Andrés-Galiana, E.J., Fernández-Martínez, J.L., Kloczkowski, A. (2023). Prediction of Functional Effects of Protein Amino Acid Mutations. In: Rojas, I., Valenzuela, O., Rojas Ruiz, F., Herrera, L.J., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2023. Lecture Notes in Computer Science(), vol 13920. Springer, Cham. https://doi.org/10.1007/978-3-031-34960-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-34960-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34959-1
Online ISBN: 978-3-031-34960-7
eBook Packages: Computer ScienceComputer Science (R0)