Abstract
Support Vector Machines (SVMs) are new generation of machine learning techniques and have shown strong generalization capability for many data mining tasks. SVMs can handle nonlinear classification by implicitly mapping input samples from the input feature space into another high dimensional feature space with a nonlinear kernel function. However, SVMs are not favorable for huge datasets with over millions of samples. Granular computing decomposes information in the form of some aggregates and solves the targeted problems in each granule. Therefore, we propose a novel computational model called Clustering Support Vector Machines (CSVMs) to deal with the complex classification problems for huge datasets. Taking advantage of both theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. This feature makes learning tasks for each CSVMs more specific and simpler. Moreover, CSVMs built particularly for each granule can be easily parallelized so that CSVMs can be used to handle huge datasets efficiently. The CSVMs model is used for predicting local protein tertiary structure. Compared with the conventional clustering method, the prediction accuracy for local protein tertiary structure has been improved noticeably when the new CSVM model is used. The encouraging experimental results indicate that our new computational model opens a new way to solve the complex classification for huge datasets.
Chapter PDF
Similar content being viewed by others
Keywords
- Support Vector Machine
- Cluster Group
- Sequence Segment
- Information Granule
- Sequential Minimal Optimization
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agarwal, D.K.: Shrinkage estimator generalizations of proximal support vector machines. In: Proc.of the 8th ACM SIGKDD international conference of knowledge Discovery and data mining, Edmonton, Canada (2002)
Award, M., Khan, L., Bastani, F., Yen, I.: An Effective Support Vector Machines (SVMs) Performance Using Hierarchical Clustering. In: Proc. of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004) (2004)
Balcazar, J.L., Dai, Y., Watanabe, O.: Provably Fast Training Algorithms for Support Vector Machines. In: Proc.of the 1stIEEE International Conference on Data mining, pp. 43–50. IEEE Computer Society, Los Alamitos (2001)
Berman, H.M., Westbrook, J., Bourne, P.E.: The protein data bank. Nucleic Acids Research 28, 235–242 (2000)
Bystroff, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-structure motifs. J. Mol. Biol. 281, 565–577 (1998)
Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: A hidden markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 301, 173–190 (2000)
Chang, C.C., Lin, C.J.: Training nu-support vector classifiers: Theory and algorithms. Neural Computations 13, 2119–2147 (2001)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000)
Daniael, B., Cao, D.: Training Support Vector Machines Using Adaptive Clustering. In: Proc. of SIAM International Conference on Data Mining 2004, Lake Buena Vista, FL, USA (2004)
Gupta, S.K., Rao, K.S., Bhatnagar, V.: K-means clustering algorithm for categorical attributes. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 203–208. Springer, Heidelberg (1999)
Hu, H., Pan, Y., Harrsion, R., Tai, P.C.: Improved protein secondary structure prediction using support vector machine with a new encoding scheme and advanced tertiary classifier. IEEE Transactions on NanoBioscience 2, 265–271 (2004)
Kolodny, R., Linial, N.: Approximate protein structural alignment in polynomial time. Proc Natl. Acad. Sci. 101, 12201–12206 (2004)
Osuna, E., Freund, R., Girosi, F.: An improved training algorithm for support vector machines. In: Proc. of IEEE Workshop on Neural Networks for Signal Processing, pp. 276–285 (1997)
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kerenel Methods-Support Vector Learning, pp. 185–208 (1999)
Schoelkopf, B., Tsuda, K., Vert, J.P.: Kernel Methods in Computational Biology, pp. 71–92. MIT Press, Cambridge (2004)
Scholkopf, B., Burges, C., Smola, A. (eds.): Advances in Kernel Methods-Support Vector Learning. MIT Press, Cambridge (1999)
Valentini, G., Dietterich, T.G.: Low Bias Bagged Support vector Machines. In: Proc. of the 20th International Conference on Machine Learning ICML 2003, pp. 752–759. Washington D.C. USA (2003)
Vapnik, V.: Statistical Learning Theory. John Wiley&Sons, Inc., New York (1998)
Vavasis, S.A.: Nonlinear Optimization: Complexity Issues. Oxford Science, New York (1991)
Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence-culling server. Bioinformatics 19(12), 1589–1591 (2003)
Yao, Y.Y.: Granular Computing. Computer Science (Ji Suan Ji Ke Xue). In: Proceedings of The 4th Chinese National Conference on Rough Sets and Soft Computing, vol. 31, pp. 1–5 (2004)
Yao, Y.Y.: Perspectives of Granular Computing. In: IEEE Conference on Granular Computing (2005) (to appear)
Yu, H., Yang, J., Han, J.: Classifying Large Data sets Using SVMs with Hierarchical Clusters. In: Proc. of the 9th ACM SIGKDD 2003, Washington DC, USA (2003)
Zagrovic, B., Pande, V.S.: How does averaging affect protein structure comparison on the ensemble level? Biophysical Journal 87, 2240–2246 (2004)
Zhong, W., Altun, G., Harrison, R., Tai, P.C., Pan, Y.: Mining Protein Sequence Motifs Representing Common 3D Structures. In: Poster Paper of IEEE Computational Systems Bioinformatics (CSB 2005), Stanford University (2005)
Zhong, W., Altun, G., Harrison, R., Tai, P.C., Pan, Y.: Improved K-means Clustering Algorithm for Exploring Local Protein Sequence Motifs Representing Common Structural Property. IEEE Transactions on NanoBioscience 4, 255–265 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
He, J., Zhong, W., Harrison, R., Tai, P.C., Pan, Y. (2006). Clustering Support Vector Machines and Its Application to Local Protein Tertiary Structure Prediction. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds) Computational Science – ICCS 2006. ICCS 2006. Lecture Notes in Computer Science, vol 3992. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11758525_96
Download citation
DOI: https://doi.org/10.1007/11758525_96
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34381-3
Online ISBN: 978-3-540-34382-0
eBook Packages: Computer ScienceComputer Science (R0)