Abstract
Clustering is a useful technique to create different groups of objects on the basis of their nature. Objects of same group are of similar in nature and differ to the objects of other groups. Clustering has proved its importance in various fields such as information retrieval, bioinformatics, image processing and many others. In this paper, particle swarm optimization (PSO) technique is used with K-harmonic means (KHM) for clustering. PSO overcomes the limitations of KHM like local optimum problem. Fuzzy logic is also employed in this paper to make PSO adaptive in nature by controlling various parameters. The performance of the proposed approach is validated on five benchmark datasets in terms of inter-clustering distance, intra-clustering distance, F-measure and fitness value. The results of proposed approach are compared with well-known conventional clustering techniques such as K-means, KHM and fuzzy C-means along with different state-of-the-art clustering approaches. Two text-based benchmark datasets such as CACM and CISI are also used to test the performance of all clustering approaches. The proposed clustering approach gives better results in comparison with other clustering approaches as clear from both the experimental and statistical analyses.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abraham A, Das S, Konar A (2006) Document clustering using differential evolution. In: Proceedings of the 2006 IEEE congress on evolutionary computation (CEC 2006), Vancouver, pp 1784–1791
Alguliev R, Aliguliyev R (2005) Fast genetic algorithm for clustering of text documents. Artif Intell 3:698–707
Aliguliyev R (2006) A clustering method for document collections and algorithm for estimation the optimal number of clusters. Artif Intell 4:651–659
Aupetit S, Monmarché N, Slimane M (2007) Hidden Markov models training by a particle swarm optimization algorithm. J Math Model Algorithms 6:175–193
Azzag H, Venturini G, Oliver A, Guinot C (2007) A hierarchical ant based clustering algorithm and its use in three real-world applications. Eur J Oper Res 179:906–922
Bergh F, Engelbrecht A (2001) Effect of swarm size on cooperative particle swarm optimizers. In: Proceedings of genetic evolutionary computation conference (GECCO-2001), San Francisco, pp 892–899
Bezdek J (1974) Fuzzy mathematics in pattern classification. PhD thesis, Cornell University, Ithaca
Chang P, Liu C, Fan C (2009) Data clustering and fuzzy neural network for sales forecasting: a case study in printed circuit board industry. Knowl-Based Syst 22(5):344–355
Cui X, Potok T, Palathingal P (2005) Document clustering using particle swarm optimization. In: Proceedings of the 2005 IEEE swarm intelligence symposium, Pasadena, pp 186–191
Das S, Abraham A, Konar A (2008a) Automatic clustering with a multi-elitist particle swarm optimization algorithm. Pattern Recogn Lett 29:688–699
Das S, Abraham A, Konar A (2008b) Automatic clustering using an improved differential evolution algorithm. IEEE Trans Syst Man Cybern Part A Syst Hum 38:218–237
ElAlami M (2011) Supporting image retrieval framework with rule base system. Knowl-Based Syst 24(2):331–340
Fraley C, Raftery A (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Garai G, Chaudhuri B (2004) A novel genetic algorithm for automatic clustering. Pattern Recogn Lett 25:173–187
Gath I, Geva G (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11:773–781
Güngör Z, Ünler A (2008) K-harmonic means data clustering with tabu search method. Appl Math Model 32:1115–1125
Gupta Y, Saini A (2015) An efficient clustering approach based on hybridization of PSO, fuzzy logic and K-harmonic means. In: IEEE workshop on computational intelligence: theories, applications and future directions (WCI). IIT Kanpur
Hadavandi E, Shavandi H, Ghanbari A (2010) Integration of genetic fuzzy systems and artificial neural networks for stock price forecasting. Knowl-Based Syst 23(8):800–808
Hammerly G, Elkan C (2002) Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of the 11th international conference on information and knowledge management, pp 600–607
Han J, Kamber M, Pei P (2006) Data mining: concepts and techniques. Morgan Kaufmann, Los Altos
Hartmann V (2005) Ant colony optimization and swarm intelligence: evolving agent swarms for clustering and sorting. In: Proceedings of the 2005 conference on genetic and evolutionary computation (GECCO’05), Washington, DC, pp 217–224
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31:264–323
Kalyani S, Swarup K (2011) Particle swarm optimization based K-means clustering approach for security assessment in power systems. Expert Syst Appl 38(9):10839–10846
Karypis G, Han E, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. J Comput 32(8):68–75
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis, vol 39. Wiley, London
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of the 1995 IEEE international conference on neural networks, Englewood Cliffs, pp 1942–1948
Khan M, Khor S (2004) Web document clustering using a hybrid neural network. Appl Soft Comput 4:423–432
Khy S, Ishikawa Y, Kitagawa H (2008) A novelty-based clustering method for on-line documents. World Wide Web 11:1–37
Laszlo M, Mukherjee S (2006) A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering. IEEE Trans Pattern Anal Mach Intell 28:533–543
Laszlo M, Mukherjee S (2007) A genetic algorithm that exchanges neighboring centers for k-means clustering. Pattern Recogn Lett 28:2359–2366
Li Y, Chung S, Holt J (2008) Text document clustering based on frequent word meaning sequences. Data Knowl Eng 64(1):381–404
Liao C, Tseng C, Luarn P (2007) A discrete version of particle swarm optimization for flowshop scheduling problems. Comput Oper Res 34:3099–3111
Lin H, Yang F, Kao Y (2005) An efficient GA-based clustering technique. Tamkang J Sci Eng 8(2):113–122
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: The 5th Berkeley symposium mathematical, statistic and probability, Berkeley
Martin-Guerrero J, Palomares A, Balaguer-Ballester E, Soria-Olivas E, Gomez-Sanchis J, Soriano-Asensi A (2006) Studying the feasibility of a recommender in a citizen web portal based on user modeling and clustering algorithms. Expert Syst Appl 30(2):299–312
Nock R, Nielsen F (2006) On weighting clustering. IEEE Trans Pattern Anal Mach Intell 28:1223–1235
Ponomarenko J, Merkulova T, Orlova G, Fokin O, Gorshkov E, Ponomarenko M (2002) Mining DNA sequences to predict sites which mutations cause genetic diseases. Knowl-Based Syst 15(4):225–233
Sander J, Ester M, Kriegel M, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Disc 2(2):169–194
Sebiskveradze D, Vrabie V, Gobinet C, Durlach A, Bernard P, Ly E, Manfait M, Jeannesson P, Piot O (2011) Automation of an algorithm based on fuzzy clustering for analyzing tumoral heterogeneity in human skin carcinoma tissue sections. Lab Invest 91(5):799–811
Shi J, Luo Z (2010) Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. Comput Biol Med 40(8):723–732
Subramanyam V, Sett S (2008) Knowledge-based image retrieval system. Knowl-Based Syst 21(2):89–100
Suganthan P (1999) Particle swarm optimizer with neighborhood operator. In: Proceedings of IEEE international conference on evolutionary computation, vol 3, pp 1958–1962
Thakare A, Hanchate R (2014) Introducing hybrid model for data clustering using K-harmonic means and Gravitational search algorithms. Int J Comput Appl 88(17):18–22
Verma N, Roy A (2014) Self-optimal clustering technique using optimized threshold function. IEEE Syst J 8(4):1213–1226
Vesanto W, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Netw 11(3):586–600
Wang W, Yang J, Muntz R (1997) STING: a statistical information grid approach to spatial data mining. In Proceedings of 23rd international conference on very large databases, Greece, pp 186–195
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Yang F, Sun T, Zhang C (2009) Efficient hybrid data clustering method based on K-harmonic means and particle swarm optimization. Expert Syst Appl 36(6):9847–9852
Zadeh L (1965) Fuzzy sets. Inf Control 8:338–353
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD conference of management of data, Canada, pp 103–114
Zhang B, Hsu M, Dayal U (1999) K-harmonic means—a data clustering algorithm. Technical Report HPL-1999-124, Hewlett-Packard Laboratories
Zhang B, Hsu M, Dayal U (2000) K-harmonic means. In: International workshop on temporal, spatial and spatio-temporal data mining. TSDM 2000, Lyon
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Appendices
Appendix A: K-harmonic means clustering algorithm
This algorithm was proposed by Zhang et al. (1999, 2000). The variants of KHM were also proposed by Hammerly and Elkan (2002). KHM gives dynamic weight to each data point by averaging of the harmonic means of the distance from each data point to all centers. The harmonic average assigns a large weight to a data point that is not close to any centers and a small weight to the data point that is close to one or more centers. Therefore, KHM is less sensitive to the initialization than the K-means. Some notations are described in Table 12 as used in KHM algorithm, before discussing it:
The detail of KHM clustering algorithm is given as follows:
-
1.
Initially, select the centers randomly.
-
2.
Determine objective function value by (A.1) as defined below
$$ \text{KHM}\,\left( {X, C} \right) = \sum\limits_{i = 1}^{n} {\frac{k}{{\sum\nolimits_{j = 1}^{k} {\frac{1}{{\left\| {x_{i} - c_{j} } \right\|^{p} }}} }}} $$(A.1)where p is an input parameter and typically p ≥ 2.
-
3.
Compute membership value m (cj/xi) in each center cj for each data point xi by (A.2) as defined below
$$ m\,\left( {c_{j} , x_{i} } \right) = \frac{{\left\| {x_{i} - c_{j} } \right\|^{ - p - 2} }}{{\sum\nolimits_{j = 1}^{k} {\left\| {x_{i} - c_{j} } \right\|^{ - p - 2} } }}. $$(A.2) -
4.
In this step, compute weights W (xi) for each data point xi by (A.3) as follows
$$ W\,\left( {x_{i} } \right) = \frac{{\sum\nolimits_{j = 1}^{k} {\left\| {x_{i} - c_{j} } \right\|^{ - p - 2} } }}{{\left( {\sum\nolimits_{j = 1}^{k} {\left\| {x_{i} - c_{j} } \right\|^{ - p - 2} } } \right)^{2} }}. $$(A.3) -
5.
Now re-compute the locations of each center cj from all the data points xi according to their memberships and weights define by (A.4).
$$ c_{j} = \frac{{\sum\nolimits_{i = 1}^{n} {m\left( {c_{j} /x_{i} } \right) w\left( {x_{i} } \right) x_{i} } }}{{\sum\nolimits_{i = 1}^{n} {m\left( {c_{j} /x_{i} } \right) w\left( {x_{i} } \right) } }} $$(A.4) -
6.
Repeat steps 2–5 for predefined number of iterations or until KHM(X, C) does not change significantly.
-
7.
Assign data point xi to cluster j with the biggest m(cj/xi).
It is demonstrated that KHM is essentially insensitive to the initialization of the centers (Zhang et al. 1999), while it tends to converge to local optima (Güngör and Ünler 2008).
Appendix B: Particle swarm optimization
PSO is an evolutionary approach which has been successfully applied to science and many practical fields (Aupetit et al. 2007; Liao et al. 2007). It is a sociologically inspired optimization algorithm which is based on population. Each particle in PSO represents an individual, and all the particles form a swarm. The solution space for any problem is formulated as a search space in PSO. Each position of search space represents a correlated solution of the problem. Each particle moves according to its velocity. The movement of a particle is computed as (B.1) and (B.2)
where xi(t) is the position of particle i at time t, vi(t) is the velocity of particle i at time t, ω is an inertia weight scaling the previous velocity, rand1 and rand2 are random variables between 0 and 1, pbesti(t) is the best position found by particle i so far, gbest(t) is the best position of whole swarm so far, c1 and c2 are two acceleration coefficients that scale the influence of pbesti(t) and gbest(t), respectively.
Rights and permissions
About this article
Cite this article
Gupta, Y., Saini, A. A new swarm-based efficient data clustering approach using KHM and fuzzy logic. Soft Comput 23, 145–162 (2019). https://doi.org/10.1007/s00500-018-3514-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-018-3514-1