Abstract
Among the existing clustering algorithms, the k-Means algorithm is one of the most commonly used clustering methods. As an extension of the k-Means algorithm, the k-Modes algorithm has been widely applied to categorical data clustering by replacing means with modes. However, there are more mixed-type data containing categorical, ordinal and numerical attributes. Mixed-type data clustering problem has recently attracted much attention from the data mining research community, but most of them fail to notice the ordinal attributes and establish explicit metric similarity of ordinal attributes. In this paper, the limitations of some existing dissimilarity measure of k-Modes algorithm in mixed ordinal and nominal data are analyzed by using some illustrative examples. Based on the idea of mining ordinal information of ordinal attribute, a new dissimilarity measure for the k-Modes algorithm to cluster this type of data is proposed. The distinct characteristic of the new dissimilarity measure is to take account of the ordinal information of ordinal attribute. A convergence study and time complexity of the k-Modes algorithm based on this new dissimilarity measure indicates that it can be effectively used for large data sets. The results of comparative experiments on nine real data sets from UCI show the effectiveness of the new dissimilarity measure.


Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Jiang F, Liu G (2016) Initialization of K-modes clustering using outlier detection techniques. Inf Sci 332:167–183
Ding S, Du M, Sun T, et al. (2017) An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood[J]. Knowl-Based Syst 294-313:133
Yu H, Chang Z, Zhou B. (2017) A novel three-way clustering algorithm for mixed-type data[C]. In: 2017 IEEE International Conference on Big Knowledge (ICBK), IEEE, pp 119–126
Noorbehbahani F, Mousavi S R, Mirzaei A. (2015) An incremental mixed data clustering method using a new distance measure[J]. Soft Comput 19:731–743
Rajan V, Bhattacharya S (2016) Dependency clustering of mixed data with gaussian mixture copulas[C], IJCAI-16: 1967–1973
Cao F, Liang J, Li D, et al. (2012) A dissimilarity measure for the k-Modes clustering algorithm[J]. Knowl-Based Syst 26:120–127
He Z, Xu X, Deng S (2011) Attribute value weighting in k-modes clustering[J]. Expert Syst Appl 38 (12):15365–15369
Gates AJ, Ahn YY (2017) The impact of random models on clustering similarity[J]. J Mach Learn Res 18 (1):3049–3076
Herawan T, Deris MM, Abawajy JH (2010) A rough set approach for selecting clustering attribute[J]. Knowl-Based Syst 23(3):220–231
Yang P, Zhu Q (2011) Finding key attribute subset in dataset for outlier detection[J]. Knowl-Based Syst 24(2):269–274
Ng MK, Li MJ, Huang JZ et al (2007) On the impact of dissimilarity measure in k-modes clustering algorithm[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3):503–507
Hsu CC, Chen CL, Su YW (2007) Hierarchical clustering of mixed data based on distance hierarchy[J]. Inf Sci 177(20):4474–4492
Hsu CC, Chen YC (2007) Mining of mixed data with application to catalog marketing[J]. Expert Syst Appl 32(1):12–23
Gates AJ, Ahn YY (2017) The impact of random models on clustering similarity[J]. J Mach Learn Res 18 (1):3049–3076
Parmar D, Wu T, Blackhurst J (2007) MMR: An algorithm for clustering categorical data using Rough Set Theory[J]. Data Knowl Eng 63(3):879–893
Chen CB, Wang LY (2006) Rough set-based clustering with refinement using shannon’s entropy theory[J]. Comput Math Appl 52(10-11):1563–1576
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: A comparative evaluation[C]. In: Proceedings of the 2008 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, pp 243–254
Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data[J]. Data Knowl Eng 63(2):503–527
Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data[J]. IEEE Trans Knowl Data Eng 4:673–690
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Min Knowl Disc 2(3):283–304
Huang Z (1997) Clustering large data sets with mixed numeric and categorical values[C]. In: Proceedings of the 1st Pacific-asia Conference on Knowledge Discovery and Data Mining,(PAKDD), pp 21–34
Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: an approach based on dynamical systems[J]. Databases 1:75
Goodall DW (1966) A new similarity index based on probability[J]. Biometrics, 882–907
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques[M]. elsevier
Zaki MJ, Meira W Jr, Meira W (2014) Data mining and analysis: fundamental concepts and algorithms[M]. Cambridge University Press
Huang Z, Ng MK (1999) A fuzzy k-modes algorithm for clustering categorical data[J]. IEEE Trans Fuzzy Syst 7(4):446–452
Pawlak Z (1982) Rough sets[J]. Int J Comput Inf Sci 11(5):341–356
Jiang F, Sui Y, Cao C (2008) A rough set approach to outlier detection[J]. Int J Gen Syst 37(5):519–536
Cao F, Liang J, Bai L et al (2010) A framework for clustering categorical time-evolving data[J]. IEEE Transactions on Fuzzy Systems 18(5):872–882
Brouwer RK (2006) A method for fuzzy clustering with ordinal attributes replaced by fuzzy set parameters[C]. In: 2006 3rd International IEEE Conference on Intelligent Systems, IEEE, pp 553–558
Jian S, Cao L, Lu K, Gao H (2018) Unsupervised coupled metric similarity for non-IID categorical data. Trans Knowl Data Eng 30(9):1810–1823
Qian Y, Li F et al (2016) Space structure and clustering of categorical data. Trans Neur Net Lear Syst 27(10):2047– 2059
UCI Machine Learning Repository< http://archive.ics.uci.edu/ml/datasets.h
Acknowledgments
We would also like to thank the anonymous reviewers for their helpful suggestions. This work was supported by National Natural Science Foundation of China(61573266).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yuan, F., Yang, Y. & Yuan, T. A dissimilarity measure for mixed nominal and ordinal attribute data in k-Modes algorithm. Appl Intell 50, 1498–1509 (2020). https://doi.org/10.1007/s10489-019-01583-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01583-5