Abstract
The protein family classification problem, which consists of determining the family memberships of given unknown protein sequences, is very important for a biologist for many practical reasons, such as drug discovery, prediction of molecular functions and medical diagnosis. Neural networks and bayesian methods have performed well on the protein classification problem, achieving accuracy ranging from 90% to 98% while running relatively slowly in the learning stage. In this paper, we present a principal component null space analysis (PCNSA) linear classifier to the problem and report excellent results compared to those of neural networks and support vector machines. The two main parameters of PCNSA are linked to the high dimensionality of the dataset used, and were optimized in an exhaustive manner to maximize accuracy.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., Fouts, D.E., Levy, S., Knap, A.H., Lomas, M.W., Nealson, K., White, O., Peterson, J., Hoffman, J., Parsons, R., Baden-Tillson, H., Pfannkoch, C., Rogers, Y.H., Smith, H.O.: Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science 304, 66–74 (2004)
Wu, C.H., Yeh, L.S., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R.S., Suzek, B.E., Vinayaka, C.R., Zhang, J., Barker, W.C.: The Protein Information Resource. Nucleic Acids Res. 31, 345–347 (2003)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 389–402 (1997)
Madera, M., Gough, J.: A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 30, 4321–4328 (2002)
Wu, C.H., Berry, M., Fung, Y., McLarty, J.: Neural Networks for Full-Scale Protein Sequence Classification: Sequence Encoding with Singular Value Decomposition. Machine Learning 21, 177–193 (1995)
Wang, J., Ma, Q., Shasha, D., Wu, C.: New techniques for extracting features from protein sequences. IBM Systems Journal 40 (2001)
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. Pac. Symp. Biocomput., 564–75 (2002)
Vaswani, N.: A Linear Classifier for Gaussian Class Conditional Distributions with Unequal Covariance Matrices. In: Intl. Conference on Pattern Recognition (ICPR), vol. I, p. 240 (2002)
Vaswani, N., Chellappa, R.: Principal Component Null Space Analysis for Image/Video Classification. submitted to IEEE Transactions on Image Processing (2004)
Vaswani, N., Chellappa, R.: Classification Probability Analysis of Principal Component Null Space Analysis. Intl. Conference on Pattern Recognition, ICPR 2004 (2004)
Wu, C., Whitson, G., McLarty, J., Ermongkonchai, A., Chang, T.C.: Protein classification artificial neural system. Protein Sci. 1, 667–677 (1992)
Dayhoff, M., Schwartz, R., Orcutt, B.: A Model of Evolutionary Change in Proteins. Atlas of Protein Sequence and Structure 15, 345–358 (1978)
Hoschek, W.: Uniform, Versatile and Efficient Dense and Sparse Multi-Dimensional Arrays (2000)
Joachims, T., Schölkopf, B., Burges, C.: Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999)
Rueda, L., Ngom, A.: An Empirical Evaluation of the Classification Error of Two Thresholding Methods for Fisher’s Classifier. In: Arabnia, H.R. (ed.) International Conference on Artifical Intelligence and International Conference on Machine Learning; Models, Technologies and Applications, Las Vegas, Nevada, USA, vol. II, pp. 837–842. CSREA Press (2004)
Zhang, X.: Protein Family Classification Using Multiple-Class Neural Networks. Master’s thesis, University of Windsor (2004)
Karplus, K., Barrett, C., Hughey, R.: Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856 (1998)
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)
Bairoch, A., Bucher, P.: PROSITE: recent developments. Nucleic Acids Res. 22, 3583–3589 (1994)
Cappelli, R., Maio, D., Maltoni, D.: Multispace KL for Pattern Representation and Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 977–996 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
French, L., Ngom, A., Rueda, L. (2005). Fast Protein Superfamily Classification Using Principal Component Null Space Analysis. In: Kégl, B., Lapalme, G. (eds) Advances in Artificial Intelligence. Canadian AI 2005. Lecture Notes in Computer Science(), vol 3501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424918_17
Download citation
DOI: https://doi.org/10.1007/11424918_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25864-3
Online ISBN: 978-3-540-31952-8
eBook Packages: Computer ScienceComputer Science (R0)