Abstract
Representation of documents is the basis of clustering systems. In addition, non-contiguous phrases appear more and more frequent in the text in the Web 2.0 age, and these phrases can affect the result of text clustering. In order to improve the quality of text clustering, this paper proposed a feature cluster-based vector space model (FC-VSM) which used the text feature clusters co-occurrence matrix to represent document and proposed to identify non-contiguous phrases in the text preprocessing stage. Our method can reduce dimension of features compared with the traditional VSM-based model. It identified non-contiguous phrases, used distributed representation of features, and implements feature clusters. Despite their simplicity, our methods are surprisingly effective and can improve the accuracy of clustering significantly which is shown in experimental results.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Shi Z (2002) Knowledge discovery. Tsing University Press, BeiJing
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining. Data Min Knowl Disc 6(4):303
Meyer CD, Wessell CD (2012) Stochastic data clustering. SIAM SIAM J Matrix Anal Appl 33(4):1214–1236
Mikolov T, Chen K, Corrado G, and Dean J. Efficient estimation of word representations in vector space (2013). \hyperimage{http://arxiv.org/abs/1301–3781}{arXiv:1301–3781}
Mikolov T, Sutskever I, Chen K, et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Simard M, Cancedda N, Cavestro B, et al (2005) Translating with non-contiguous phrases. In: Proceedings of the conference on human language technology and empirical methods in natural language processing, pp 755–762
Doucet A, Ahonen-Myka H (2004) Non-contiguous word sequences for information retrieval. In: Proceedings of the workshop on multiword expressions: integrating processing. Association for computational linguistics, pp 88–95
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(9):533–536
Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res, 1137–1155
Mikolov T (2012) Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology
Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167
Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394
Socher R, Lin CC, Ng AY, Manning C (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 129–136
Socher R, Bauer J, Manning CD and Ng AY (2013) Parsing with compositional vector grammars. In: Proceedings of the association for computational linguistics
Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1, pp 873–882
Mnih A, Hinton GE (2008) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088
Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048
Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. In: Proceedings of the international workshop on artificial intelligence and statistics, pp 246–252
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Landauer TK, Domais ST (1997) A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol Rev 104(2):211–240
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Lu Y, Mei Q, Zhai CX (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retrieval 14(2):178–203
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qimin, C., Qiao, G., Yongliang, W. et al. Text clustering using VSM with feature clusters. Neural Comput & Applic 26, 995–1003 (2015). https://doi.org/10.1007/s00521-014-1792-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-014-1792-9