Abstract
In this paper, a methodology is proposed in order to produce a set of seeds later used as a starting point to K-Means-type unsupervised classification algorithms for text mining. Our proposal involves using the eigenvectors obtained from principal component analysis to extract initial seeds, upon appropriate treatment for search of lightly overlapping clusters which are also clearly identified by keywords. This work is motivated by the interest of the authors in the problem of identification of topics and themes previously unknown in short texts. Therefore, in order to validate the goodness of this method, it was applied on a sample of labeled e-mails (NG20) representing a gold standard within the field of text mining. Specifically, some corpora referenced in the literature have been used, configured in accordance to a mix of topics contained in the sample. The proposed method improves on the results of other state-of-the-art methods to which it is compared.
Similar content being viewed by others
References
Arthur D, Vassilvitskii S (2007) K-Means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Berry MW, Dumais ST, O’brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37:573–595
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Bradley PS, Fayyad UM (1998) Refining initial points for K-Means clustering. In: Proceedings of the fifteenth international conference on machine learning, pp 91–99
Celebi ME, Kingravi HA, Vela Patricio A (2013) A comparative study of efficient initialization methods for the K-Means clustering algorithm. Expert Syst Appl 40:200–210
De Amorim RC, Mirkin B (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering. Pattern Recognit 45:1061–1075
Ding C, He X (2004) K-Means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on machine learning, pp 29–35
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybern Syst 3(3):32–57
Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics 21(3):768–769
Francis LA (2006) Taming text: an introduction to text mining. Casualty Actuar Soc Forum 51–88
Hasan M Al, Chaoji V, Salem S, Zaki M (2009) Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognit Lett 30(11):994–1002
Hassler M, Fliedl G (2006) Text preparation through extended tokenization. University Klagenfurt, Klagenfurt
Heinrich G (2004) Parameter estimation for text analysis. Technical Report Fraunhofer IGD, Darmstadt, Germany
Hettich S, Bay SD (1999) The UCI KDD archive, http://kdd.ics.uci.edu, Irvine, CA: University of California, Department of Information and Computer Science, pp 1721–1288
Hoffmann T (1999) Probabilistic latent semantic analysis, uncertainty in artificial intelligence, UAI 99 Stockholm, pp 289–296
IT++ (2005) IT++ documentation. http://itpp.sourceforge.net, Retrieved 1 July 2015
Jain AK (2010) Data clustering: 50 years beyond K-Means. Pattern Recognit Lett 31:651–666
Jing L, Ng MK, Xu J, Huang J (2005) On the performance of feature weighting K-Means for text subspace clustering. Adv Web-Age Inf Manag 3739:502–512
Jing L, Ng MK, Yang X, Huang JZ (2006) A text clustering system based on K-Means type subspace clustering and ontology. Int J Intell Technol 1(2):91–103
Jivani AG (2011) A comparative study of stemming algorithms. Department of Computer Science and Engineering, The Maharaja Sayajirao University of Baroda Vadodara, Gujarat, India
Joachims T (1997) A probabilistic analysis of the rocchio algorithm with tf-idf for text categorization. In: Proceedings of the fourteenth international conference on machine learning, pp 143–151
Katsavounidis I, Jay Kuo C-C, Zhang Z (1994) A new initialization technique for generalized lloyd iteration. IEEE Signal Process Lett 1(10):144–146
Kumar Y, Sahoo G (2014) A new initialization method to originate initial cluster centers for K-Means algorithm. Int J Adv Sci Technol 62:43–54
Maysum P (2011) Iterative methods for computing eigenvalues and eigenvectors. Waterloo Math Rev 1:9–18
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability 1, University Of California Press, pp 281–297
Mendes-Rodrigues MES, Sacks L (2004) A scalable hierarchical fuzzy clustering algorithm for text mining. In: Proceedings of the 5th international conference on recent advances in soft computing, pp 269–274
Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman and Hall, Boca Raton
Modha DS, Spangler WS (2003) Feature weighting in K-Means clustering. Comput Methods Feature Sel Mach Learn 52–3:217–237
Ocampo-Guzman I, Lopez-Arevalo I, Sosa-Sosa V (2009) Data-driven approach for ontology learning. In: 6th international conference on computing science and automatic control, electrical engineering, pp 1–6
Onoda T, Sakai M, Yamada S (2012) Careful seeding method based on independent components analysis for K-Means clustering. J Emerg Technol Web Intell 4(1):51–59
OpenCV (2014) OpenCV 2.4.9.0 documentation, Opencv Api reference, the core functionality. http://docs.opencv.org/modules/refman.html. Retrieved 19 Oct 2014
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2(11):559–572
Pinto D, Rosso P, Jimenez-Salazar H (2011) A self-enriching methodology for clustering narrow domain short texts. Comput J 54:1148–1165
Porter MF (1997) An algorithm for suffix stripping. In: Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 313–316
Rousson V, Gasser T (2004) Simple component analysis. Appl Stat 53(4):539–555
Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 28(1):11–21
SAS (2013) Applied clustering techniques course notes. SAS Institute Inc., Cary
Savaresi SM, Bole DL (2001) On the performance of bisecting K-Means and PDDP. In: Proceedings of the first SIAM international conference on data mining, pp 502–512
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD workshop on text mining
Yang T, Wang J (2014) A robust K-Means type algorithm for soft subspace clustering and its application to text clustering. J Softw 9(8):2120–2124
Zha H, Ding C, Gu M, He X, Simon HD (2001) Spectral relaxation for K-Means clustering. Neural Inf Process Syst (NIPS 2001) 4:1057–1064
Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3):311–331
Zhao X, Jiang J (2011) An empirical comparison of topics in Twitter and traditional media. Singapore Management University, School of Information Systems Technical Paper Series
Acknowledgments
This work has been supported by the MINECO ES-TIN2014-57458-R project. Special thanks to the Computer Vision and Image Processing (CVIP) group at URJC.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Velez, D., Sueiras, J., Ortega, A. et al. A method for K-Means seeds generation applied to text mining. Stat Methods Appl 25, 477–499 (2016). https://doi.org/10.1007/s10260-015-0345-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-015-0345-4