Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A method for K-Means seeds generation applied to text mining

  • Original Paper
  • Published:
Statistical Methods & Applications Aims and scope Submit manuscript

Abstract

In this paper, a methodology is proposed in order to produce a set of seeds later used as a starting point to K-Means-type unsupervised classification algorithms for text mining. Our proposal involves using the eigenvectors obtained from principal component analysis to extract initial seeds, upon appropriate treatment for search of lightly overlapping clusters which are also clearly identified by keywords. This work is motivated by the interest of the authors in the problem of identification of topics and themes previously unknown in short texts. Therefore, in order to validate the goodness of this method, it was applied on a sample of labeled e-mails (NG20) representing a gold standard within the field of text mining. Specifically, some corpora referenced in the literature have been used, configured in accordance to a mix of topics contained in the sample. The proposed method improves on the results of other state-of-the-art methods to which it is compared.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Arthur D, Vassilvitskii S (2007) K-Means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035

  • Berry MW, Dumais ST, O’brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37:573–595

    Article  MathSciNet  MATH  Google Scholar 

  • Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York

    Book  MATH  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

  • Bradley PS, Fayyad UM (1998) Refining initial points for K-Means clustering. In: Proceedings of the fifteenth international conference on machine learning, pp 91–99

  • Celebi ME, Kingravi HA, Vela Patricio A (2013) A comparative study of efficient initialization methods for the K-Means clustering algorithm. Expert Syst Appl 40:200–210

    Article  Google Scholar 

  • De Amorim RC, Mirkin B (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering. Pattern Recognit 45:1061–1075

    Article  Google Scholar 

  • Ding C, He X (2004) K-Means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on machine learning, pp 29–35

  • Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybern Syst 3(3):32–57

    MathSciNet  MATH  Google Scholar 

  • Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics 21(3):768–769

    Google Scholar 

  • Francis LA (2006) Taming text: an introduction to text mining. Casualty Actuar Soc Forum 51–88

  • Hasan M Al, Chaoji V, Salem S, Zaki M (2009) Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognit Lett 30(11):994–1002

    Article  Google Scholar 

  • Hassler M, Fliedl G (2006) Text preparation through extended tokenization. University Klagenfurt, Klagenfurt

    Book  Google Scholar 

  • Heinrich G (2004) Parameter estimation for text analysis. Technical Report Fraunhofer IGD, Darmstadt, Germany

  • Hettich S, Bay SD (1999) The UCI KDD archive, http://kdd.ics.uci.edu, Irvine, CA: University of California, Department of Information and Computer Science, pp 1721–1288

  • Hoffmann T (1999) Probabilistic latent semantic analysis, uncertainty in artificial intelligence, UAI 99 Stockholm, pp 289–296

  • IT++ (2005) IT++ documentation. http://itpp.sourceforge.net, Retrieved 1 July 2015

  • Jain AK (2010) Data clustering: 50 years beyond K-Means. Pattern Recognit Lett 31:651–666

    Article  Google Scholar 

  • Jing L, Ng MK, Xu J, Huang J (2005) On the performance of feature weighting K-Means for text subspace clustering. Adv Web-Age Inf Manag 3739:502–512

    Article  Google Scholar 

  • Jing L, Ng MK, Yang X, Huang JZ (2006) A text clustering system based on K-Means type subspace clustering and ontology. Int J Intell Technol 1(2):91–103

    Google Scholar 

  • Jivani AG (2011) A comparative study of stemming algorithms. Department of Computer Science and Engineering, The Maharaja Sayajirao University of Baroda Vadodara, Gujarat, India

  • Joachims T (1997) A probabilistic analysis of the rocchio algorithm with tf-idf for text categorization. In: Proceedings of the fourteenth international conference on machine learning, pp 143–151

  • Katsavounidis I, Jay Kuo C-C, Zhang Z (1994) A new initialization technique for generalized lloyd iteration. IEEE Signal Process Lett 1(10):144–146

  • Kumar Y, Sahoo G (2014) A new initialization method to originate initial cluster centers for K-Means algorithm. Int J Adv Sci Technol 62:43–54

    Article  Google Scholar 

  • Maysum P (2011) Iterative methods for computing eigenvalues and eigenvectors. Waterloo Math Rev 1:9–18

  • MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability 1, University Of California Press, pp 281–297

  • Mendes-Rodrigues MES, Sacks L (2004) A scalable hierarchical fuzzy clustering algorithm for text mining. In: Proceedings of the 5th international conference on recent advances in soft computing, pp 269–274

  • Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman and Hall, Boca Raton

    Book  MATH  Google Scholar 

  • Modha DS, Spangler WS (2003) Feature weighting in K-Means clustering. Comput Methods Feature Sel Mach Learn 52–3:217–237

    MATH  Google Scholar 

  • Ocampo-Guzman I, Lopez-Arevalo I, Sosa-Sosa V (2009) Data-driven approach for ontology learning. In: 6th international conference on computing science and automatic control, electrical engineering, pp 1–6

  • Onoda T, Sakai M, Yamada S (2012) Careful seeding method based on independent components analysis for K-Means clustering. J Emerg Technol Web Intell 4(1):51–59

    Google Scholar 

  • OpenCV (2014) OpenCV 2.4.9.0 documentation, Opencv Api reference, the core functionality. http://docs.opencv.org/modules/refman.html. Retrieved 19 Oct 2014

  • Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2(11):559–572

    Article  MATH  Google Scholar 

  • Pinto D, Rosso P, Jimenez-Salazar H (2011) A self-enriching methodology for clustering narrow domain short texts. Comput J 54:1148–1165

    Article  Google Scholar 

  • Porter MF (1997) An algorithm for suffix stripping. In: Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 313–316

  • Rousson V, Gasser T (2004) Simple component analysis. Appl Stat 53(4):539–555

    MathSciNet  MATH  Google Scholar 

  • Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 28(1):11–21

    Google Scholar 

  • SAS (2013) Applied clustering techniques course notes. SAS Institute Inc., Cary

    Google Scholar 

  • Savaresi SM, Bole DL (2001) On the performance of bisecting K-Means and PDDP. In: Proceedings of the first SIAM international conference on data mining, pp 502–512

  • Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD workshop on text mining

  • Yang T, Wang J (2014) A robust K-Means type algorithm for soft subspace clustering and its application to text clustering. J Softw 9(8):2120–2124

  • Zha H, Ding C, Gu M, He X, Simon HD (2001) Spectral relaxation for K-Means clustering. Neural Inf Process Syst (NIPS 2001) 4:1057–1064

    Google Scholar 

  • Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3):311–331

    Article  MATH  Google Scholar 

  • Zhao X, Jiang J (2011) An empirical comparison of topics in Twitter and traditional media. Singapore Management University, School of Information Systems Technical Paper Series

Download references

Acknowledgments

This work has been supported by the MINECO ES-TIN2014-57458-R project. Special thanks to the Computer Vision and Image Processing (CVIP) group at URJC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose F. Velez.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Velez, D., Sueiras, J., Ortega, A. et al. A method for K-Means seeds generation applied to text mining. Stat Methods Appl 25, 477–499 (2016). https://doi.org/10.1007/s10260-015-0345-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10260-015-0345-4

Keywords