Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2539150.2539225acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Text Document Clustering with Hybrid Feature Selection

Published: 02 December 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Finding the appropriate information and understanding to human research is a delicate task when dealing with an outstanding number of unstructured texts created daily. Hence the objective of clustering algorithms which are part of the powerful text mining tools. In this paper, we propose a novel text document clustering based on a new hybrid feature selection method that we call HFSM. This technique extracts statistical and semantic relevant terms to pilot the clustering mechanism. The experiments conducted on Reuters corpus demonstrate the practical aspects of our algorithm and show that it generates more accurate clustering than the one obtained by other existing algorithms.

    References

    [1]
    Benghabrit, A. Ouhbi, B. Frikh, B. Behja, H. 2013. Text Clustering using Statistical and Semantic Data. In Proceedings of the 2013 World Congress on Computer and Information Technologies (June 2013), 1--6.
    [2]
    Bottou, L. and Bengio, Y. 1994. Convergence Properties of the K-means Algorithms. Advances in Neural Information Processing Systems 7 (1994), 585--592.
    [3]
    Chua, S. and Kulathuramaiyer, N. 2004. Semantic Feature Selection using WordNet. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (Sep. 2004), 166--172.
    [4]
    Dagan, I. Marcus, S. and Markovitch, S. 1995. Contextual Word Similarity and Estimation from Sparse Data. Computer Speech and Language (1995), vol. 9(2), 123--152.
    [5]
    Dempster, A. P. Laird,. N. M. and Rubin, D. B. 1997. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society (1997), vol. 39(1), 1--38.
    [6]
    Djaanfar, A.S. Frikh, B. and Ouhbi, B. 2012. A Hybrid Method for Improving the SQD-PageRank Algorithm. Innovative Computing Technology (INTECH), 2012 Second International Conference on the Innovative Computing Technology (Sep. 2012), 231--238.
    [7]
    Li, Y. Luo.C and Chung, S.M. 2008. Text Clustering with Feature Selection by using Statistical Data Knowledge and Data Engineering. IEEE Transactions on Know and Data Eng. (2008), vol. 20(5), 641--651.
    [8]
    Liu, T. Liu, S. Chen, Z. and Ma, W. 2003. An Evaluation on Feature Selection for Text Clustering. International Conference on Machine Learning ICML(2003), 488--495.
    [9]
    Meena, M.J. Chandran, K.R. and Brinda, J.M. 2010. Integrating Swarm Intelligence and Statistical Data for Feature Selection in Text Categorization. International Journal of Computer Applications (2010), vol. 1(11), 16--21.
    [10]
    Meng, J. Lin, H. Yu, Y. 2011. A two stage feature selection method for text categorization. Computers ans Mathematics with Application (2011), vol. 62(7), 2793--2800
    [11]
    Meng, M. Chen, Q. and Wang, X. 2008. Semantic Feature Reduction in Chinese Document Clustering. In Proceedings of the IEEE International Conference on Systems, Man & Cybernetics SMC (2008), 3721--3726.
    [12]
    Parsons, L. Haque, E. and Liu, H. 2004. Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter - Special issue on learning (2004), vol. 6(1), 90--105.
    [13]
    Peleja, F. Lopes, G.P. and Silva, J. 2011. Text Categorization: A Comparison of Classifiers, Feature Selection Metrics and Document Representation. Proceedings of the 15th Portuguese Conference in Artificial Intelligence (2011), 660--674.
    [14]
    Sathiyakumari, K. Manimekalai, G. and Preamsudha, V. 2011. A Survey on Various Approaches in Document Clustering. Int. J. Comp. Tech. Appl. (2011), vol 2 (5), 1534--1539.
    [15]
    Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys (2002), vol. 34(1), 1--47.
    [16]
    Steinbach, M. Karypis, G. and Kumar, V. 2000. A Comparaison of Document Clustering Techniques. In 6th ACM SIGKDD, World Text Mining Conference. (2000), 109--111.
    [17]
    Strehl, A. Ghosh, J. and Mooney, R. Impact of Similarity Measures on Web-page Clustering. 2000. AAAI Workshop on A.1 for Web Search. (2000). 58--64.
    [18]
    Thangamani, M. and Thangaraj, P. 2010. Survey on Text Document Clustering. International Journal of Computer Science and Information Security. (Dec. 2010), vol. 8(2), 174--178.
    [19]
    Thangamani, M. and Thangaraj, P. 2010. Integrated Clustering and Feature Selection Scheme for Text Documents. 2010. Journal of Computer Science (May 2010). vol. 6(5), 536--541.
    [20]
    Yang, Y. and Pedersen, J.O. 1997. A Comparative Study on Feature Selection in Text Categorization. Presented at the ICM (1997).
    [21]
    Yang, J. Liu, Y. Zhu, X. Liu, Z. and Zhang, X. 2010. A New Feature Selection Base on Comprehensive Measurement both in Inter-category and Intra-category for text categorization. Information Processing & Management (2010), vol. 48(4), 741--754.
    [22]
    Zhao, Y. and Karypis, G. 2004. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning (2004), vol. 55(3), 311--331.
    [23]
    Zheng, Z. and Srihari, R. 2003. Optimally Combining Positive and Negative Features for Text Categorization. Proceedings of the ICM, Workshop for Learning from Imbalanced Datasets II (2003).

    Cited By

    View all
    • (2020)New approach to determine the optimal number of clusters K in unsupervised classification2020 6th IEEE Congress on Information Science and Technology (CiSt)10.1109/CiSt49399.2021.9357249(348-352)Online publication date: 5-Jun-2020
    • (2019)An improved RDF data Clustering AlgorithmProcedia Computer Science10.1016/j.procs.2019.01.038148(208-217)Online publication date: 2019
    • (2018)Recommendation using a clustering algorithm based on a hybrid features selection methodJournal of Intelligent Information Systems10.1007/s10844-017-0493-051:1(183-205)Online publication date: 28-Dec-2018
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    IIWAS '13: Proceedings of International Conference on Information Integration and Web-based Applications & Services
    December 2013
    753 pages
    ISBN:9781450321136
    DOI:10.1145/2539150
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • @WAS: International Organization of Information Integration and Web-based Applications and Services

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 December 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Text clustering
    2. feature selection
    3. performance analysis
    4. statistical and semantic data
    5. text mining

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    IIWAS '13

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)New approach to determine the optimal number of clusters K in unsupervised classification2020 6th IEEE Congress on Information Science and Technology (CiSt)10.1109/CiSt49399.2021.9357249(348-352)Online publication date: 5-Jun-2020
    • (2019)An improved RDF data Clustering AlgorithmProcedia Computer Science10.1016/j.procs.2019.01.038148(208-217)Online publication date: 2019
    • (2018)Recommendation using a clustering algorithm based on a hybrid features selection methodJournal of Intelligent Information Systems10.1007/s10844-017-0493-051:1(183-205)Online publication date: 28-Dec-2018
    • (2016)A hybrid feature selection rule measure and its application to systematic reviewProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011177(106-114)Online publication date: 28-Nov-2016
    • (2015)Collaborative Filtering with Hybrid Clustering Integrated Method to Address New-Item Cold-Start ProblemIntelligent Distributed Computing IX10.1007/978-3-319-25017-5_27(285-296)Online publication date: 18-Oct-2015
    • (2014)Exploiting statistical and semantic information for document clustering: An evaluation on feature selection2014 Third IEEE International Colloquium in Information Science and Technology (CIST)10.1109/CIST.2014.7016601(96-101)Online publication date: Oct-2014

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media