Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Performance evaluation of density-based clustering methods

Published: 01 September 2009 Publication History

Abstract

With the development of the World Wide Web, document clustering is receiving more and more attention as an important and fundamental technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. A good document clustering approach can assist computers in organizing the document corpus automatically into a meaningful cluster hierarchy for efficient browsing and navigation, which is very valuable for complementing the deficiencies of traditional information retrieval technologies. In this paper, we study the performance of different density-based criterion functions, which can be classified as internal, external or hybrid, in the context of partitional clustering of document datasets. In our study, a weight was assigned to each document, which defined its relative position in the entire collection. To show the efficiency of the proposed approach, the weighted methods were compared to their unweighted variants. To verify the robustness of the proposed approach, experiments were conducted on datasets with a wide variety of numbers of clusters, documents and terms. To evaluate the criterion functions, we used the WebKb, Reuters-21578, 20Newsgroups-18828, WebACE and TREC-5 datasets, as they are currently the most widely used benchmarks in document clustering research. To evaluate the quality of a clustering solution, a wide spectrum of indices, three internal validity indices and seven external validity indices, were used. The internal validity indices were used for evaluating the within-cluster scatter and between cluster separations. The external validity indices were used for comparing the clustering solutions produced by the proposed criterion functions with the ''ground truth'' results. Experiments showed that our approach significantly improves clustering quality. In this paper, we developed a modified differential evolution (DE) algorithm to optimize the criterion functions. This modification accelerates the convergence of DE and, unlike the basic DE algorithm, guarantees that the received solution will be feasible.

References

[1]
A. Abraham, S. Das, A. Konar, Document clustering using differential evolution, in: Proceedings of the 2006 IEEE Congress on Evolutionary Computation (CEC 2006), Springer, Berlin, 2006, pp. 1784-1791.
[2]
Alguliev, R.M. and Alyguliev, R.M., Automatic text documents summarization through sentences clustering. Journal of Automation and Information Sciences. v40. 53-63.
[3]
Alguliev, R.M., Alyguliev, R.M. and Bagirov, A.M., Global optimization in the summarization of text documents. Automatic Control and Computer Sciences. v39. 42-47.
[4]
Alguliev, R.M. and Aliguliyev, R.M., Fast genetic algorithm for clustering of text documents. Artificial Intelligence. v3. 698-707.
[5]
R.M. Aliguliyev, A novel partitioning-based clustering method and generic document summarization, in: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2006 Workshops) (WI-IATW'06), Hong Kong, China, 2006, pp. 626-629.
[6]
Aliguliyev, R.M., A clustering method for document collections and algorithm for estimation the optimal number of clusters. Artificial Intelligence. v4. 651-659.
[7]
Aliguliyev, R.M., Automatic document summarization by sentence extraction. Journal of Computational Technologies. v12. 5-15.
[8]
In: Allan, J. (Ed.), Topic detection and tracking: event-based information organization, Kluwer Academic Publishers, Norwell, USA.
[9]
Azzag, H., Venturini, G., Oliver, A. and Guinot, C., A hierarchical ant based clustering algorithm and its use in three real-world applications. European Journal of Operational Research. v179. 906-922.
[10]
Baeza-Yates, R. and Ribeiro-Neto, R., Modern Information Retrieval. 1999. Addison Wesley, ACM Press, New York.
[11]
Bagirov, A.M., Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognition. v41. 3192-3199.
[12]
Bandyopadhyay, S. and Saha, S., A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Transactions on Knowledge and Data Engineering. v20. 1441-1457.
[13]
Bezdek, J.C. and Pal, N.R., Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics. v28. 301-315.
[14]
F. Boutin, M. Hascoet, Cluster validity indices for graph partitioning, in: Proceedings of the Eighth International Conference on Information Visualization (IV 2004), London, UK, 2004, pp. 376-381.
[15]
Y. Chen, J. Bi, Clustering by maximizing sum-of-squared separation distance, in: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications, Newport Beach, USA, 2005, pp. 1-12.
[16]
Y. Chen, L. Tu, Density-based clustering for real-time stream data, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07), San Jose, USA, 2007, pp. 133-142.
[17]
Chou, C.H., Su, M.C. and Lai, E., A new cluster validity measure and its application to image compression. Pattern Analysis and Applications. v7. 205-220.
[18]
Das, S. and Konar, A., Automatic image pixel clustering with an improved differential evolution. Applied Soft Computing. v9. 226-236.
[19]
Das, S., Abraham, A. and Konar, A., Automatic clustering using an improved differential evolution algorithm. IEEE Transaction on Systems, Man, and Cybernetics - Part A: Systems and Humans. v38. 218-237.
[20]
Das, S., Abraham, A. and Konar, A., Automatic clustering with a multi-elitist particle swarm optimization algorithm. Pattern Recognition Letters. v29. 688-699.
[21]
I.S. Dhillon, Y. Guan, B. Kulis, A unified view of kernel k-means, spectral clustering and graph cuts, University of Texas UTCS Technical Report #TR-04-25, 2005, 20 p.
[22]
Dunlavy, D.M., O'Leary, D.P., Conroy, J.M. and Schlesinger, J.D., QCS: a system for querying clustering and summarizing documents. Information Processing and Management. v43. 1588-1605.
[23]
Dubes, R. and Jain, A.K., Validity studies in clustering methodologies. Pattern Recognition. v11. 235-254.
[24]
Friedman, M., Last, M., Makover, Y. and Kandel, A., Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology. Information Sciences. v177. 467-475.
[25]
Grabmeier, J. and Rudolph, A., Techniques of cluster algorithms in data mining. Data Mining and Knowledge Discovery. v6. 303-360.
[26]
Halkidi, M., Batistakis, Y. and Vazirgiannis, M., On clustering validation techniques. Journal of Intelligent Systems. v17. 107-145.
[27]
Hammouda, K.M. and Kamel, M.S., Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering. v16. 1279-1296.
[28]
E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, WebACE: a web agent for document categorization and exploration, in: Proceedings of the Second International Conference on Autonomous Agents, Minneapolis, MN, USA, 1998, pp. 408-415.
[29]
Han, J. and Kamber, M., Data Mining: Concepts and Techniques. 2006. second ed. Morgan Kaufman, San Francisco.
[30]
A. Huang, Similarity measures for text document clustering, in: Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, 2008, pp. 49-56.
[31]
Huang, J.Z., Ng, M.K., Rong, H. and Li, Z., Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. v27. 657-668.
[32]
Jain, A.K., Murty, M.N. and Flynn, P.J., Data clustering: a review. ACM Computing Surveys. v31. 264-323.
[33]
Kalashnikov, D.V., Chen, Z.S., Mehrotra, S. and Nuray-Turan, R., Web people search via connection analysis. IEEE Transactions on Knowledge and Data Engineering. v20. 1550-1565.
[34]
Khan, M.S. and Khor, S.W., Web document clustering using a hybrid neural network. Applied Soft Computing. v4. 423-432.
[35]
Korenius, T., Laurikkala, J. and Juhola, M., On principal component analysis, cosine and Euclidean measures in information retrieval. Information Sciences. v177. 4893-4905.
[36]
Laszlo, M. and Mukherjee, S., A genetic algorithm that exchanges neighboring centers for k-means clustering. Pattern Recognition Letters. v28. 2359-2366.
[37]
Lee, C.H., Zaiane, O.R., Park, H.H., Huang, J. and Greiner, R., Clustering high dimensional data: a graph-based relaxed optimization approach. Information Sciences. v178. 4501-4511.
[38]
Li, Y., Chung, S.M. and Holt, J.D., Text document clustering based on frequent word meaning sequences. Data and Knowledge Engineering. v64. 381-404.
[39]
T. Li, C. Ding, Weighted consensus clustering, in: Proceedings of the SIAM International Conference on Data Mining (SDM 2008), Atlanta, USA, 2008, pp. 798-809.
[40]
Li, Y., Luo, C. and Chung, S.M., Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering. v20. 641-652.
[41]
Mirkin, B., Mathematical Classification and Clustering. 1996. Kluwer Academic Press, Boston, Dordrecht.
[42]
Nguyen, C.D. and Cios, K.J., GAKREM: a novel hybrid clustering algorithm. Information Sciences. v178. 4205-4227.
[43]
Patrikainen, A. and Meila, M., Comparing subspace clusterings. IEEE Transactions on Knowledge and Data Engineering. v18. 902-916.
[44]
Porter, M., An algorithm for suffix stripping. Program. v14. 130-137.
[45]
Price, K.V., Storn, R.M. and Lampinen, J.A., Differential evolution: a practical approach to global optimization (natural computing series). 2005. Springer-Verlag, Berlin.
[46]
A. Rosenberg, J. Hirschberg, V-measure: a conditional entropy-based external cluster evaluation measure, in: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), Prague, Czech Republic, 2007, pp. 410-420.
[47]
Rubinov, A.M., Soukhorukova, N.V. and Ugon, J., Classes and clusters in data analysis. European Journal of Operational Research. v173. 849-865.
[48]
Tan, M.P., Broach, J.R. and Floudas, C.A., A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning. Journal of Global Optimization. v39. 323-346.
[49]
F. Wang, C. Zhang, T. Li, Regularized clustering for documents, in: Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR'07), Amsterdam, The Netherlands, 2007, pp. 95-102.
[50]
Zhang, Y., Wang, W., Zhang, X. and Li, Y., A cluster validity index for fuzzy clustering. Information Sciences. v178. 1205-1218.
[51]
Zhao, Y. and Karypis, G., Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning. v55. 311-331.
[52]
ftp://ftp.cs.cornell.edu/pub/smart/english.stop.
[53]
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz.
[54]
www.daviddlewis.com/resources/testcollections/reuters21578.
[55]
http://people.csail.mit.edu/jrennie/20Newsgroups/.
[56]
http://trec.nist.gov.

Cited By

View all
  • (2024)ChunkyEdit: Text-first video interview editing via chunkingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642667(1-16)Online publication date: 11-May-2024
  • (2023)Point-Set Kernel ClusteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.314491435:5(5147-5158)Online publication date: 1-May-2023
  • (2021)CA-CSM: a novel clustering algorithm based on cluster center selection modelSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-05835-w25:13(8015-8033)Online publication date: 1-Jul-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 September 2009

Author Tags

  1. Density-based clustering methods
  2. Modified DE algorithm
  3. Partitional clustering
  4. Text mining
  5. Validity indices

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ChunkyEdit: Text-first video interview editing via chunkingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642667(1-16)Online publication date: 11-May-2024
  • (2023)Point-Set Kernel ClusteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.314491435:5(5147-5158)Online publication date: 1-May-2023
  • (2021)CA-CSM: a novel clustering algorithm based on cluster center selection modelSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-05835-w25:13(8015-8033)Online publication date: 1-Jul-2021
  • (2020)Determining the importance of sentence position for automatic text summarizationJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-17990239:2(2421-2431)Online publication date: 1-Jan-2020
  • (2020)Modified fuzzy TOPSIS + TFNs ranking model for candidate selection using the qualifying criteriaSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-04521-224:1(681-695)Online publication date: 1-Jan-2020
  • (2019)Organizing Project Actors for Collective Decision-Making about Interdependent RisksComplexity10.1155/2019/80593722019Online publication date: 1-Jan-2019
  • (2018)Sentence features relevance for extractive text summarization using genetic algorithmsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-16959435:1(353-365)Online publication date: 1-Jan-2018
  • (2018)A Data-Driven Parameter Adaptive Clustering Algorithm Based on Density PeakComplexity10.1155/2018/52325432018Online publication date: 21-Oct-2018
  • (2017)Improved FCM algorithm based on initial center optimization methodJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-16928632:5(3487-3494)Online publication date: 1-Jan-2017
  • (2017)Content and Structure CoverageINFORMS Journal on Computing10.1287/ijoc.2017.075329:4(660-675)Online publication date: 1-Nov-2017
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media