research-article

Text Document Clustering with Hybrid Feature Selection

Authors:

Asmaa Benghabrit,

El Moukhtar Zemmouri, and

Hicham BehjaAuthors Info & Claims

IIWAS '13: Proceedings of International Conference on Information Integration and Web-based Applications & Services

December 2013

Pages 600 - 604

https://doi.org/10.1145/2539150.2539225

Published: 02 December 2013 Publication History

Abstract

Finding the appropriate information and understanding to human research is a delicate task when dealing with an outstanding number of unstructured texts created daily. Hence the objective of clustering algorithms which are part of the powerful text mining tools. In this paper, we propose a novel text document clustering based on a new hybrid feature selection method that we call HFSM. This technique extracts statistical and semantic relevant terms to pilot the clustering mechanism. The experiments conducted on Reuters corpus demonstrate the practical aspects of our algorithm and show that it generates more accurate clustering than the one obtained by other existing algorithms.

References

[1]

Benghabrit, A. Ouhbi, B. Frikh, B. Behja, H. 2013. Text Clustering using Statistical and Semantic Data. In Proceedings of the 2013 World Congress on Computer and Information Technologies (June 2013), 1--6.

[2]

Bottou, L. and Bengio, Y. 1994. Convergence Properties of the K-means Algorithms. Advances in Neural Information Processing Systems 7 (1994), 585--592.

[3]

Chua, S. and Kulathuramaiyer, N. 2004. Semantic Feature Selection using WordNet. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (Sep. 2004), 166--172.

Digital Library

[4]

Dagan, I. Marcus, S. and Markovitch, S. 1995. Contextual Word Similarity and Estimation from Sparse Data. Computer Speech and Language (1995), vol. 9(2), 123--152.

[5]

Dempster, A. P. Laird,. N. M. and Rubin, D. B. 1997. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society (1997), vol. 39(1), 1--38.

[6]

Djaanfar, A.S. Frikh, B. and Ouhbi, B. 2012. A Hybrid Method for Improving the SQD-PageRank Algorithm. Innovative Computing Technology (INTECH), 2012 Second International Conference on the Innovative Computing Technology (Sep. 2012), 231--238.

[7]

Li, Y. Luo.C and Chung, S.M. 2008. Text Clustering with Feature Selection by using Statistical Data Knowledge and Data Engineering. IEEE Transactions on Know and Data Eng. (2008), vol. 20(5), 641--651.

Digital Library

[8]

Liu, T. Liu, S. Chen, Z. and Ma, W. 2003. An Evaluation on Feature Selection for Text Clustering. International Conference on Machine Learning ICML(2003), 488--495.

[9]

Meena, M.J. Chandran, K.R. and Brinda, J.M. 2010. Integrating Swarm Intelligence and Statistical Data for Feature Selection in Text Categorization. International Journal of Computer Applications (2010), vol. 1(11), 16--21.

[10]

Meng, J. Lin, H. Yu, Y. 2011. A two stage feature selection method for text categorization. Computers ans Mathematics with Application (2011), vol. 62(7), 2793--2800

Digital Library

[11]

Meng, M. Chen, Q. and Wang, X. 2008. Semantic Feature Reduction in Chinese Document Clustering. In Proceedings of the IEEE International Conference on Systems, Man & Cybernetics SMC (2008), 3721--3726.

[12]

Parsons, L. Haque, E. and Liu, H. 2004. Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter - Special issue on learning (2004), vol. 6(1), 90--105.

Digital Library

[13]

Peleja, F. Lopes, G.P. and Silva, J. 2011. Text Categorization: A Comparison of Classifiers, Feature Selection Metrics and Document Representation. Proceedings of the 15th Portuguese Conference in Artificial Intelligence (2011), 660--674.

[14]

Sathiyakumari, K. Manimekalai, G. and Preamsudha, V. 2011. A Survey on Various Approaches in Document Clustering. Int. J. Comp. Tech. Appl. (2011), vol 2 (5), 1534--1539.

[15]

Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys (2002), vol. 34(1), 1--47.

Digital Library

[16]

Steinbach, M. Karypis, G. and Kumar, V. 2000. A Comparaison of Document Clustering Techniques. In 6th ACM SIGKDD, World Text Mining Conference. (2000), 109--111.

[17]

Strehl, A. Ghosh, J. and Mooney, R. Impact of Similarity Measures on Web-page Clustering. 2000. AAAI Workshop on A.1 for Web Search. (2000). 58--64.

[18]

Thangamani, M. and Thangaraj, P. 2010. Survey on Text Document Clustering. International Journal of Computer Science and Information Security. (Dec. 2010), vol. 8(2), 174--178.

[19]

Thangamani, M. and Thangaraj, P. 2010. Integrated Clustering and Feature Selection Scheme for Text Documents. 2010. Journal of Computer Science (May 2010). vol. 6(5), 536--541.

[20]

Yang, Y. and Pedersen, J.O. 1997. A Comparative Study on Feature Selection in Text Categorization. Presented at the ICM (1997).

Digital Library

[21]

Yang, J. Liu, Y. Zhu, X. Liu, Z. and Zhang, X. 2010. A New Feature Selection Base on Comprehensive Measurement both in Inter-category and Intra-category for text categorization. Information Processing & Management (2010), vol. 48(4), 741--754.

Digital Library

[22]

Zhao, Y. and Karypis, G. 2004. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning (2004), vol. 55(3), 311--331.

Digital Library

[23]

Zheng, Z. and Srihari, R. 2003. Optimally Combining Positive and Negative Features for Text Categorization. Proceedings of the ICM, Workshop for Learning from Imbalanced Datasets II (2003).

Cited By

Chabih OSbai SBehja HLouhdi MZemmouri ETrousse B(2020)New approach to determine the optimal number of clusters K in unsupervised classification2020 6th IEEE Congress on Information Science and Technology (CiSt)10.1109/CiSt49399.2021.9357249(348-352)Online publication date: 5-Jun-2020
https://doi.org/10.1109/CiSt49399.2021.9357249
Eddamiri SZemmouri EBenghabrit A(2019)An improved RDF data Clustering AlgorithmProcedia Computer Science10.1016/j.procs.2019.01.038148(208-217)Online publication date: 2019
https://doi.org/10.1016/j.procs.2019.01.038
Ferdaous HBouchra FBrahim OImad-Eddine MAsmaa B(2018)Recommendation using a clustering algorithm based on a hybrid features selection methodJournal of Intelligent Information Systems10.1007/s10844-017-0493-051:1(183-205)Online publication date: 28-Dec-2018
https://dl.acm.org/doi/10.1007/s10844-017-0493-0
Show More Cited By

Index Terms

Text Document Clustering with Hybrid Feature Selection
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Text Document Clustering Using Memetic Feature Selection
ICMLC '17: Proceedings of the 9th International Conference on Machine Learning and Computing

With the wide increase of the volume of electronic documents, it becomes inevitable the need to invent more sophisticated machine learning methods to manage the issue. In this paper, a Memetic feature selection technique is proposed to improve the k-...
Read More
Text document clustering based on neighbors

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Read More
Text Clustering with Feature Selection by Using Statistical Data

Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervised feature selection method, named ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

IIWAS '13: Proceedings of International Conference on Information Integration and Web-based Applications & Services

December 2013

753 pages

ISBN:9781450321136

DOI:10.1145/2539150

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

@WAS: International Organization of Information Integration and Web-based Applications and Services

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

IIWAS '13

IIWAS '13: The 15th International Conference on Information Integration and Web-based Applications & Services

December 2 - 4, 2013

Vienna, Austria

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
182
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Chabih OSbai SBehja HLouhdi MZemmouri ETrousse B(2020)New approach to determine the optimal number of clusters K in unsupervised classification2020 6th IEEE Congress on Information Science and Technology (CiSt)10.1109/CiSt49399.2021.9357249(348-352)Online publication date: 5-Jun-2020
https://doi.org/10.1109/CiSt49399.2021.9357249
Eddamiri SZemmouri EBenghabrit A(2019)An improved RDF data Clustering AlgorithmProcedia Computer Science10.1016/j.procs.2019.01.038148(208-217)Online publication date: 2019
https://doi.org/10.1016/j.procs.2019.01.038
Ferdaous HBouchra FBrahim OImad-Eddine MAsmaa B(2018)Recommendation using a clustering algorithm based on a hybrid features selection methodJournal of Intelligent Information Systems10.1007/s10844-017-0493-051:1(183-205)Online publication date: 28-Dec-2018
https://dl.acm.org/doi/10.1007/s10844-017-0493-0
Ouhbi BKamoune MFrikh BZemmouri EBehja HAnderst-Kotsis G(2016)A hybrid feature selection rule measure and its application to systematic reviewProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011177(106-114)Online publication date: 28-Nov-2016
https://dl.acm.org/doi/10.1145/3011141.3011177
Hdioud FFrikh BBenghabrit AOuhbi B(2015)Collaborative Filtering with Hybrid Clustering Integrated Method to Address New-Item Cold-Start ProblemIntelligent Distributed Computing IX10.1007/978-3-319-25017-5_27(285-296)Online publication date: 18-Oct-2015
https://doi.org/10.1007/978-3-319-25017-5_27
Benghabrit AOuhbi BZemmouri EFrikh BBehja H(2014)Exploiting statistical and semantic information for document clustering: An evaluation on feature selection2014 Third IEEE International Colloquium in Information Science and Technology (CIST)10.1109/CIST.2014.7016601(96-101)Online publication date: Oct-2014
https://doi.org/10.1109/CIST.2014.7016601

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents