Abstract
Clustering algorithms are employed in many bioinformatics tasks, including categorization of protein sequences and analysis of gene-expression data. Although these algorithms are routinely applied, many of them suffer from the following limitations: (i) relying on predetermined parameters tuning, such as a-priori knowledge regarding the number of clusters; (ii) involving nondeterministic procedures that yield inconsistent outcomes. Thus, a framework that addresses these shortcomings is desirable. We provide a data-driven framework that includes two interrelated steps. The first one is SVD-based dimension reduction and the second is an automated tuning of the algorithm’s parameter(s). The dimension reduction step is efficiently adjusted for very large datasets. The optimal parameter setting is identified according to the internal evaluation criterion known as Bayesian Information Criterion (BIC). This framework can incorporate most clustering algorithms and improve their performance. In this study we illustrate the effectiveness of this platform by incorporating the standard K-Means and the Quantum Clustering algorithms. The implementations are applied to several gene-expression benchmarks with significant success.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Sharan, R., Shamir, R.: CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis. In: ISMB’00, pp. 307–316. AAAI Press, Menlo Park (2000)
Blatt, M., Wiseman, S., Domany, E.: Superparamagnetic Clustering of Data. Physical Review Letters 76, 3251–3254 (1996)
Getz, G., Levine, E., Domany, E.: Coupled two-way clustering analysis of gene microarray data. PNAS 97(22), 12079–12084 (2000)
Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering Gene Expression Patterns. Journal of Computational Biology 6(3-4), 281–297 (1999)
Dembele, D., Kastner, P.: Fuzzy C-means method for clustering microarray data. Bioinformatics 19(8), 973–980 (2003)
Horn, D., Gottlieb, A.: Algorithm for data clustering in pattern recognition problems based on quantum mechanics. Physical Review Letters 88(1) (2002)
Horn, D., Axel, I.: Novel clustering algorithm for microarray expression data in a truncated SVD space. Bioinformatics 19(9), 1110–1115 (2003)
Eisen, M.B., et al.: Cluster analysis and display of genome-wide expression patterns. PNAS 95(25), 14863–14868 (1998)
Teschendorff, A.E., et al.: A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics 21(13), 3025–3033 (2005)
Zhong, S., Ghosh, J.: A unified framework for model-based clustering. Journal of Machine Learning Research 4, 1001–1037 (2003)
Wall, M., Rechtsteiner, A., Rocha, L.: Singular Value Decomposition and Principal Component Analysis. In: Berrar, D., Dubitzky, W., Granzow, M. (eds.) A Practical Approach to Microarray Data Analysis, pp. 91–109. Kluwer Academic Publishers, Dordrecht (2003)
Ding, C., et al.: Adaptive dimension reduction for clustering high dimensional data. In: IEEE International Conference on Data Mining 2002, pp. 107–114 (2002)
Xing, E.P., Karp, R.M.: CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics 17(90001), S306–315 (2001)
Plagianakos, V.P., Tasoulis, D.K., Vrahatis, M.N.: Hybrid dimension reduction approach for gene expression data classification. In: International Joint Conference on Neural Networks 2005, Post-Conference Workshop on Computational Intelligence Approaches for the Analysis of Bioinformatics (2005)
Zhong, W., et al.: Improved K-means Clustering Algorithm for Exploring Local Protein Sequence Motifs Representing Common Structural Property. In: IEEE Transactions on NanoBioscience, 255-265 (2005)
Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)
Varshavsky, R., Linial, M., Horn, D.: COMPACT: A Comparative Package for Clustering Assessment. In: Chen, G., et al. (eds.) ISPA-WS 2005. LNCS, vol. 3759, pp. 159–167. Springer, Heidelberg (2005)
Alter, O., Brown, P.O., Botstein, D.: Singular value decomposition for genome-wide expression data processing and modeling. PNAS 97(18), 10101–10106 (2000)
Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to Latent Semantic Analysis. Discourse Processes 25, 259–284 (1998)
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? - Answers via Model-Based Cluster Analysis. Computer Journal 41, 578–588 (1998)
Barash, D., Comaniciu, D.: Meanshift clustering for DNA microarray analysis. In: IEEE Computational Systems Bioinformatics Conference (CSB) (2004)
Alon, U., et al.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS 96(12), 6745–6750 (1999)
Golub, T.R., et al.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286(5439), 531–537 (1999)
Spellman, P.T., et al.: Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol. Biol. Cell. 9(12), 3273–3297 (1998)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Varshavsky, R., Horn, D., Linial, M. (2007). Clustering Algorithms Optimizer: A Framework for Large Datasets. In: Măndoiu, I., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science(), vol 4463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72031-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-72031-7_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72030-0
Online ISBN: 978-3-540-72031-7
eBook Packages: Computer ScienceComputer Science (R0)