Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1645953.1646071acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Exploiting internal and external semantics for the clustering of short texts using world knowledge

Published: 02 November 2009 Publication History
  • Get Citation Alerts
  • Abstract

    Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term occurring information, traditional text representation methods, such as ``bag of words" model, have several limitations when directly applied to short texts tasks. In this paper, we propose a novel framework to improve the performance of short texts clustering by exploiting the internal semantics from original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases -- Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods.

    References

    [1]
    S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using Wikipedia. In Proceedings of the 30th ACM SIGIR, pages 787--788, 2007.
    [2]
    H.-H. Chen, M.-S. Lin, and Y.-C. Wei. Novel association measures using web search with double checking. In Proceedings of the 21st COLING and the 44th ACL, pages 1009--1016, 2006.
    [3]
    H. Chim and X. Deng. Efficient phrase-based document similarity for clustering. IEEE Trans. on Knowl. and Data Eng., 20(9):1217--1229, 2008.
    [4]
    M. Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th ACL and Eighth EACL, pages 16--23, 1997.
    [5]
    D. R. Cutting, D. R. Karger, and J. O. Pedersen. Constant interaction-time scatter/gather browsing of very large document collections. In Proceedings of the 16th ACM SIGIR, pages 126--134, 1993.
    [6]
    B. Danushka, M. Yutaka, and I. Mitsuru. Measuring semantic similarity between words using web search engines. In Proceedings of the 16th WWW, pages 757--766, 2007.
    [7]
    K. Dave, S. Lawrence, and D. M. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the 12th WWW, pages 519--528, 2003.
    [8]
    L. Denoyer and P. Gallinari. The wikipedia xml corpus. SIGIR Forum, 40(1):64--69, 2006.
    [9]
    E. Gabrilovich and S. Markovitch. Feature generation for text categorization using world knowledge. In Proceedings of the 20th AAAI, volume 21, pages 1048--1153, 2005.
    [10]
    E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st AAAI, pages 1301--1306, 2006.
    [11]
    E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th IJCAI, pages 6--12, 2007.
    [12]
    J. Hammerton, M. Osborne, S. Armstrong, and W. Daelemans. Introduction to special issue on machine learning approaches to shallow parsing. Machine Learning Research, 2:551--558, 2002.
    [13]
    M. Hearst and J. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proceedings of the 19th ACM SIGIR, pages 76--84, 1996.
    [14]
    A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop, pages 541--544, 2003.
    [15]
    J. Hu, L. Fang, Y. Cao, H. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging Wikipedia semantics. In Proceedings of the 31st ACM SIGIR, pages 179--186, 2008.
    [16]
    F. Keller, M. Lapata, and O. Ourioupina. Using the web to overcome data sparseness. In Proceedings of the 40th ACL, pages 230--237, 2002.
    [17]
    U. S. Kohomban and W. S. Lee. Learning semantic classes for word sense disambiguation. In Proceedings of the 43rd ACL, pages 34--41, 2005.
    [18]
    G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of the 27th ACM SIGIR, pages 297--304, 2004.
    [19]
    D. Lewis and W. Croft. Term clustering of syntactic phrases. In Proceedings of the 13th ACM SIGIR, pages 385--404, 1989.
    [20]
    T. Marinis. Psycholinguistic techniques in second language acquisition research. Second Language Research, 19(2):144, 2003.
    [21]
    D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. Lecture Notes in Computer Science, 4425:16, 2007.
    [22]
    S. Osinski, J. Stefanowski, and D. Weiss. Lingo: Search results clustering algorithm based on singular value decomposition. In Proceedings of the IIS: IIPWM'04 Conference, page 359, 2004.
    [23]
    X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text&web with hidden topics from large-scale data collections. In Proceeding of the 17th WWW, pages 91--100, 2008.
    [24]
    M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
    [25]
    M. Sahami and T. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th WWW, pages 377--386. ACM New York, NY, USA, 2006.
    [26]
    M. Sushmita, S. Lalmas. Using digest pages to increase user result space: Preliminary designs. In SIGIR Workshop on Aggregated Search, 2008.
    [27]
    E. Terra and C. Clarke. Frequency estimates for statistical word similarity measures. In Proceedings of HLT/NAACL 2003, pages 244--251, 2003.
    [28]
    L. Urena-Lopez, M. Buenaga, and J. Gomez. Integrating linguistic resources in TC through WSD. Computers and the Humanities, 35(2):215--230, 2001.
    [29]
    I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques, 2005.
    [30]
    O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to Web search results. Computer Networks-the International Journal of Computer and Telecommunications Networking, 31(11):1361--1374, 1999.
    [31]
    H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma. Learning to cluster web search results. In Proceedings of the 27th ACM SIGIR, pages 210--217, 2004.
    [32]
    T. Zesch, C. Muller, and I. Gurevych. Extracting lexical semantic knowledge from wikipedia and wiktionary. In Proceedings of LREC, 2008.
    [33]
    C. Zhang, N. Sun, X. Hu, T. Huang, and T.-S. Chua. Query segmentation based on eigenspace similarity. In Proceedings of the ACL-IJCNLP 2009 Conference, pages 185--188, Suntec, Singapore, August 2009.

    Cited By

    View all
    • (2024)Pipelining Semantic Expansion and Noise Filtering for Sentiment Analysis of Short Documents – CluSent MethodJournal on Interactive Systems10.5753/jis.2024.411715:1(561-575)Online publication date: 11-Jun-2024
    • (2023)CluSent – Combining Semantic Expansion and De-Noising for Dataset-Oriented Sentiment Analysis of Short TextsProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617039(110-118)Online publication date: 23-Oct-2023
    • (2023)DEC-transformer: deep embedded clustering with transformer on Chinese long textPattern Analysis and Applications10.1007/s10044-023-01161-z26:3(1349-1362)Online publication date: 10-May-2023
    • Show More Cited By

    Index Terms

    1. Exploiting internal and external semantics for the clustering of short texts using world knowledge

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
        November 2009
        2162 pages
        ISBN:9781605585123
        DOI:10.1145/1645953
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 02 November 2009

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. clustering
        2. semantic knowledge bases
        3. short texts
        4. syntactic structure

        Qualifiers

        • Research-article

        Conference

        CIKM '09
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)16
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 11 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Pipelining Semantic Expansion and Noise Filtering for Sentiment Analysis of Short Documents – CluSent MethodJournal on Interactive Systems10.5753/jis.2024.411715:1(561-575)Online publication date: 11-Jun-2024
        • (2023)CluSent – Combining Semantic Expansion and De-Noising for Dataset-Oriented Sentiment Analysis of Short TextsProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617039(110-118)Online publication date: 23-Oct-2023
        • (2023)DEC-transformer: deep embedded clustering with transformer on Chinese long textPattern Analysis and Applications10.1007/s10044-023-01161-z26:3(1349-1362)Online publication date: 10-May-2023
        • (2022)Short Text Clustering Algorithms, Application and Challenges: A SurveyApplied Sciences10.3390/app1301034213:1(342)Online publication date: 27-Dec-2022
        • (2022)Konkani WordNet: Corpus-Based Enhancement using CrowdsourcingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/350315621:4(1-18)Online publication date: 4-Mar-2022
        • (2022)A Short Text Topic Model Based on Semantics and Word Expansion2022 IEEE 2nd International Conference on Computer Communication and Artificial Intelligence (CCAI)10.1109/CCAI55564.2022.9807822(60-64)Online publication date: 6-May-2022
        • (2022)Comparison of Estimation Algorithms for Latent Dirichlet AllocationQuantitative Psychology10.1007/978-3-031-04572-1_3(27-37)Online publication date: 13-Jul-2022
        • (2021)A Novel Fuzzy Logic-Based Text Classification Method for Tracking Rare Events on TwitterIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2019.293243651:7(4324-4333)Online publication date: Jul-2021
        • (2021)Exploring Trending Topics of Social Media Text with VoronoiTopicCloud Provide Useful and Intuitive Insights into Social Media Texts2021 3rd International Conference on Natural Language Processing (ICNLP)10.1109/ICNLP52887.2021.00006(1-8)Online publication date: Mar-2021
        • (2021)Short Text Clustering Using Joint Optimization of Feature Representations and Cluster AssignmentsPRICAI 2021: Trends in Artificial Intelligence10.1007/978-3-030-89363-7_17(217-231)Online publication date: 1-Nov-2021
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media