Abstract
Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the k-means clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable (and possibly different) distortion measure to each feature space, (iii) to combine distortions on different feature spaces, in a convex fashion, by assigning (possibly) different relative weights to each, (iv) for a fixed weighting, to cluster using the proposed convex k-means algorithm, and (v) to determine the optimal feature weighting to be the one that yields the clustering that simultaneously minimizes the average within-cluster dispersion and maximizes the average between-cluster dispersion along all the feature spaces. Using precision/recall evaluations and known ground truth classifications, we empirically demonstrate the effectiveness of feature weighting in clustering on several different application domains.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Proc. Int. Conf. Data Eng. (pp. 3–14).
Ahonen-Myka, H. (1999). Finding all maximal frequent sequences in text. In D. Mladenic & M. Grobelnik (eds.), ICML-99 Workshop: Machine Learning in Text Data Analysis (pp. 11–17).
Bay, S. D. (1999). The UCI KDD archive. Dept. Inform. and Comput. Sci., Univ. California, Irvine, CA. Available at http://kdd.ics.uci.edu.
Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. Dept. Inform. and Comput. Sci., Univ. California, Irvine, CA. Available at http://www.ics.uci.edu/?mlearn/MLRepository.html.
Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245–271.
Bradley, P.,& Fayyad, U. (1998). Refining initial points for k-means clustering. In Proc. 16th Int. Machine Learning Conf., (pp. 91–99). Bled, Slovenia.
Caruana, R., & Freitag, D. (1994). Greedy attribute selection. In Proc. 11th Int. Machine Learning Conf. (pp. 28–36).
Devaney, M., & Ram, A. (1997). Efficient feature selection in conceptual clustering. In Proc. 14th Int. Machine Learning Conf. (pp. 92–97). Nashville, TN.
Dhillon, I. S.,& Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning,42:1/2, 143–175.
Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley.
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 139–172.
Flickner, M., Sawhney, H., Niblack,W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., & Yanker, P. (1995). Query by image and video content: The QBIC system. IEEE Computer, 28:9, 23–32.
Frakes, W. B., & Baeza-Yates, R. (1992). Information Retrieval: Data Structures and Algorithms. New Jersey: Prentice Hall, Englewood Cliffs.
Hartigan, J. A. (1975). Clustering Algorithms. Wiley.
Joachims, T. (1997). A probabilistic analysis of the Rocchio Algorithm with TFIDF for text categorization. In Proc. 14th Int. Conf. Machine Learning. (pp. 143–151).
John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Proc. 11th Int. Machine Learning Conf. (pp. 121–129).
Kendall, W. S. (1991). Convexity and the hemisphere. J. London Math. Soc. 43, 567–576.
Kleinberg, J., Papadimitriou, C. H., & Raghavan, P. (1998). A microeconomic view of data mining. Data Mining and Knowledge Discovery, 2/4, 311–324.
Koller, D., & Sahami, M. (1996). Towards optimal feature selection. In Proc. 13th Int. Conf. Machine Learning. (pp. 284–292). Bari, Italy.
Mitra, M., Buckley, C., Singhal, A., & Cardie, C. (1997). An analysis of statistical and syntactic phrases. In Proc. RIAO97: Computer-Assisted Inform. Searching on the Internet (pp. 200–214). Montreal, Canada.
Mladeni´c, D., & Grobelnik, M. (1998). Word sequences as features in text-learning. In Proc. 7th Electrotech. Comput. Sci. Conf. ERK'98 (pp. 145–148). Ljubljana, Slovenia.
Modha, D. S., & Spangler, W. S. (2000). Clustering hypertext with applications to web searching. In Proc. ACM Hypertext Conf. (pp. 143–152). San Antonio, TX.
Nelder, J., & Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1993). Numerical Recipes in C. New York: Cambridge University Press.
Sabin, M. J., & Gray, R. M. (1986). Global convergence and empirical consistency of the generalized Lloyd algorithm. IEEE Trans. Inform. Theory, 32:2, 148–155.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Inform. Proc. & Management (pp. 513–523).
Salton, G., & McGill, M. J. (1983). Introduction to Modern Retrieval. McGraw-Hill Book Company.
Salton, G., Yang, C. S., & Yu, C. T. (1975). A theory of term importance in automatic text analysis. J. Amer. Soc. Inform. Sci.,26:1, 33–44.
Singhal, A., Buckley, C., Mitra, M., & Salton, G. (1996). Pivoted document length normalization. In Proc. ACM SIGIR (pp. 21–29).
Smeaton, A. F., & Kelledy, F. (1998). User-chosen phrases in interactive query formulation for information retrieval. In Proc. 20th BCS-IRSG Colloquium, Springer-Verlag Electronic Workshops in Comput., Grenoble, France.
Talavera, L. (1999). Feature selection as a preprocessing step for hierarchical clustering. In Proc. 16th Int. Machine Learning Conf. (pp. 389–397). Bled, Slovenia.
Vaithyanathan, S., & Dom, B. (1999). Model selection in unsupervised learning with applications to document clustering. In Proc. 16th Int. Machine Learning Conf. Bled, Slovenia.
Wettschereck, D., Aha, D.W., & Mohri, T. (1997). A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review, 11, 273–314.
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Inform. Retrieval J., 1:1/2, 67–88.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Modha, D.S., Spangler, W.S. Feature Weighting in k-Means Clustering. Machine Learning 52, 217–237 (2003). https://doi.org/10.1023/A:1024016609528
Issue Date:
DOI: https://doi.org/10.1023/A:1024016609528