A method for K-Means seeds generation applied to text mining

Velez, Daniel; Sueiras, Jorge; Ortega, Alejandro; Velez, Jose F.

doi:10.1007/s10260-015-0345-4

A method for K-Means seeds generation applied to text mining

Original Paper
Published: 11 November 2015

Volume 25, pages 477–499, (2016)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

Daniel Velez¹,
Jorge Sueiras²,
Alejandro Ortega³ &
…
Jose F. Velez²

563 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, a methodology is proposed in order to produce a set of seeds later used as a starting point to K-Means-type unsupervised classification algorithms for text mining. Our proposal involves using the eigenvectors obtained from principal component analysis to extract initial seeds, upon appropriate treatment for search of lightly overlapping clusters which are also clearly identified by keywords. This work is motivated by the interest of the authors in the problem of identification of topics and themes previously unknown in short texts. Therefore, in order to validate the goodness of this method, it was applied on a sample of labeled e-mails (NG20) representing a gold standard within the field of text mining. Specifically, some corpora referenced in the literature have been used, configured in accordance to a mix of topics contained in the sample. The proposed method improves on the results of other state-of-the-art methods to which it is compared.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Text mining using nonnegative matrix factorization and latent semantic analysis

Article 21 April 2021

A novel hybrid multi-verse optimizer with K-means for text documents clustering

Article 11 May 2020

References

Arthur D, Vassilvitskii S (2007) K-Means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Berry MW, Dumais ST, O’brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37:573–595
Article MathSciNet MATH Google Scholar
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Book MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Bradley PS, Fayyad UM (1998) Refining initial points for K-Means clustering. In: Proceedings of the fifteenth international conference on machine learning, pp 91–99
Celebi ME, Kingravi HA, Vela Patricio A (2013) A comparative study of efficient initialization methods for the K-Means clustering algorithm. Expert Syst Appl 40:200–210
Article Google Scholar
De Amorim RC, Mirkin B (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering. Pattern Recognit 45:1061–1075
Article Google Scholar
Ding C, He X (2004) K-Means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on machine learning, pp 29–35
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybern Syst 3(3):32–57
MathSciNet MATH Google Scholar
Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics 21(3):768–769
Google Scholar
Francis LA (2006) Taming text: an introduction to text mining. Casualty Actuar Soc Forum 51–88
Hasan M Al, Chaoji V, Salem S, Zaki M (2009) Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognit Lett 30(11):994–1002
Article Google Scholar
Hassler M, Fliedl G (2006) Text preparation through extended tokenization. University Klagenfurt, Klagenfurt
Book Google Scholar
Heinrich G (2004) Parameter estimation for text analysis. Technical Report Fraunhofer IGD, Darmstadt, Germany
Hettich S, Bay SD (1999) The UCI KDD archive, http://kdd.ics.uci.edu, Irvine, CA: University of California, Department of Information and Computer Science, pp 1721–1288
Hoffmann T (1999) Probabilistic latent semantic analysis, uncertainty in artificial intelligence, UAI 99 Stockholm, pp 289–296
IT++ (2005) IT++ documentation. http://itpp.sourceforge.net, Retrieved 1 July 2015
Jain AK (2010) Data clustering: 50 years beyond K-Means. Pattern Recognit Lett 31:651–666
Article Google Scholar
Jing L, Ng MK, Xu J, Huang J (2005) On the performance of feature weighting K-Means for text subspace clustering. Adv Web-Age Inf Manag 3739:502–512
Article Google Scholar
Jing L, Ng MK, Yang X, Huang JZ (2006) A text clustering system based on K-Means type subspace clustering and ontology. Int J Intell Technol 1(2):91–103
Google Scholar
Jivani AG (2011) A comparative study of stemming algorithms. Department of Computer Science and Engineering, The Maharaja Sayajirao University of Baroda Vadodara, Gujarat, India
Joachims T (1997) A probabilistic analysis of the rocchio algorithm with tf-idf for text categorization. In: Proceedings of the fourteenth international conference on machine learning, pp 143–151
Katsavounidis I, Jay Kuo C-C, Zhang Z (1994) A new initialization technique for generalized lloyd iteration. IEEE Signal Process Lett 1(10):144–146
Kumar Y, Sahoo G (2014) A new initialization method to originate initial cluster centers for K-Means algorithm. Int J Adv Sci Technol 62:43–54
Article Google Scholar
Maysum P (2011) Iterative methods for computing eigenvalues and eigenvectors. Waterloo Math Rev 1:9–18
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability 1, University Of California Press, pp 281–297
Mendes-Rodrigues MES, Sacks L (2004) A scalable hierarchical fuzzy clustering algorithm for text mining. In: Proceedings of the 5th international conference on recent advances in soft computing, pp 269–274
Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman and Hall, Boca Raton
Book MATH Google Scholar
Modha DS, Spangler WS (2003) Feature weighting in K-Means clustering. Comput Methods Feature Sel Mach Learn 52–3:217–237
MATH Google Scholar
Ocampo-Guzman I, Lopez-Arevalo I, Sosa-Sosa V (2009) Data-driven approach for ontology learning. In: 6th international conference on computing science and automatic control, electrical engineering, pp 1–6
Onoda T, Sakai M, Yamada S (2012) Careful seeding method based on independent components analysis for K-Means clustering. J Emerg Technol Web Intell 4(1):51–59
Google Scholar
OpenCV (2014) OpenCV 2.4.9.0 documentation, Opencv Api reference, the core functionality. http://docs.opencv.org/modules/refman.html. Retrieved 19 Oct 2014
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2(11):559–572
Article MATH Google Scholar
Pinto D, Rosso P, Jimenez-Salazar H (2011) A self-enriching methodology for clustering narrow domain short texts. Comput J 54:1148–1165
Article Google Scholar
Porter MF (1997) An algorithm for suffix stripping. In: Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 313–316
Rousson V, Gasser T (2004) Simple component analysis. Appl Stat 53(4):539–555
MathSciNet MATH Google Scholar
Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 28(1):11–21
Google Scholar
SAS (2013) Applied clustering techniques course notes. SAS Institute Inc., Cary
Google Scholar
Savaresi SM, Bole DL (2001) On the performance of bisecting K-Means and PDDP. In: Proceedings of the first SIAM international conference on data mining, pp 502–512
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD workshop on text mining
Yang T, Wang J (2014) A robust K-Means type algorithm for soft subspace clustering and its application to text clustering. J Softw 9(8):2120–2124
Zha H, Ding C, Gu M, He X, Simon HD (2001) Spectral relaxation for K-Means clustering. Neural Inf Process Syst (NIPS 2001) 4:1057–1064
Google Scholar
Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3):311–331
Article MATH Google Scholar
Zhao X, Jiang J (2011) An empirical comparison of topics in Twitter and traditional media. Singapore Management University, School of Information Systems Technical Paper Series

Download references

Acknowledgments

This work has been supported by the MINECO ES-TIN2014-57458-R project. Special thanks to the Computer Vision and Image Processing (CVIP) group at URJC.

Author information

Authors and Affiliations

Department of Statistics and Operations Research, Universidad Complutense, Plaza de Ciencias, 3, Ciudad Universitaria, 28040, Madrid, Spain
Daniel Velez
Department of Computer Science, Universidad Rey Juan Carlos, c/ Tulipan, sn, 28933, Mostoles, Madrid, Spain
Jorge Sueiras & Jose F. Velez
Department of Mathematics, Universidad Carlos III, Av. Universidad, 30, 28911, Leganés, Madrid, Spain
Alejandro Ortega

Authors

Daniel Velez
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Sueiras
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Ortega
View author publications
You can also search for this author in PubMed Google Scholar
Jose F. Velez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose F. Velez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Velez, D., Sueiras, J., Ortega, A. et al. A method for K-Means seeds generation applied to text mining. Stat Methods Appl 25, 477–499 (2016). https://doi.org/10.1007/s10260-015-0345-4

Download citation

Accepted: 29 October 2015
Published: 11 November 2015
Issue Date: August 2016
DOI: https://doi.org/10.1007/s10260-015-0345-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A method for K-Means seeds generation applied to text mining

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

Text mining using nonnegative matrix factorization and latent semantic analysis

A novel hybrid multi-verse optimizer with K-means for text documents clustering

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A method for K-Means seeds generation applied to text mining

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

Text mining using nonnegative matrix factorization and latent semantic analysis

A novel hybrid multi-verse optimizer with K-means for text documents clustering

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation