research-article

Exploiting internal and external semantics for the clustering of short texts using world knowledge

Authors:

Tat-Seng ChuaAuthors Info & Claims

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Pages 919 - 928

https://doi.org/10.1145/1645953.1646071

Published: 02 November 2009 Publication History

Abstract

Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term occurring information, traditional text representation methods, such as ``bag of words" model, have several limitations when directly applied to short texts tasks. In this paper, we propose a novel framework to improve the performance of short texts clustering by exploiting the internal semantics from original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases -- Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods.

References

[1]

S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using Wikipedia. In Proceedings of the 30th ACM SIGIR, pages 787--788, 2007.

Digital Library

[2]

H.-H. Chen, M.-S. Lin, and Y.-C. Wei. Novel association measures using web search with double checking. In Proceedings of the 21st COLING and the 44th ACL, pages 1009--1016, 2006.

Digital Library

[3]

H. Chim and X. Deng. Efficient phrase-based document similarity for clustering. IEEE Trans. on Knowl. and Data Eng., 20(9):1217--1229, 2008.

Digital Library

[4]

M. Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th ACL and Eighth EACL, pages 16--23, 1997.

Digital Library

[5]

D. R. Cutting, D. R. Karger, and J. O. Pedersen. Constant interaction-time scatter/gather browsing of very large document collections. In Proceedings of the 16th ACM SIGIR, pages 126--134, 1993.

Digital Library

[6]

B. Danushka, M. Yutaka, and I. Mitsuru. Measuring semantic similarity between words using web search engines. In Proceedings of the 16th WWW, pages 757--766, 2007.

Digital Library

[7]

K. Dave, S. Lawrence, and D. M. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the 12th WWW, pages 519--528, 2003.

Digital Library

[8]

L. Denoyer and P. Gallinari. The wikipedia xml corpus. SIGIR Forum, 40(1):64--69, 2006.

Digital Library

[9]

E. Gabrilovich and S. Markovitch. Feature generation for text categorization using world knowledge. In Proceedings of the 20th AAAI, volume 21, pages 1048--1153, 2005.

Digital Library

[10]

E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st AAAI, pages 1301--1306, 2006.

Digital Library

[11]

E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th IJCAI, pages 6--12, 2007.

Digital Library

[12]

J. Hammerton, M. Osborne, S. Armstrong, and W. Daelemans. Introduction to special issue on machine learning approaches to shallow parsing. Machine Learning Research, 2:551--558, 2002.

Digital Library

[13]

M. Hearst and J. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proceedings of the 19th ACM SIGIR, pages 76--84, 1996.

Digital Library

[14]

A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop, pages 541--544, 2003.

Digital Library

[15]

J. Hu, L. Fang, Y. Cao, H. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging Wikipedia semantics. In Proceedings of the 31st ACM SIGIR, pages 179--186, 2008.

Digital Library

[16]

F. Keller, M. Lapata, and O. Ourioupina. Using the web to overcome data sparseness. In Proceedings of the 40th ACL, pages 230--237, 2002.

Digital Library

[17]

U. S. Kohomban and W. S. Lee. Learning semantic classes for word sense disambiguation. In Proceedings of the 43rd ACL, pages 34--41, 2005.

Digital Library

[18]

G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of the 27th ACM SIGIR, pages 297--304, 2004.

Digital Library

[19]

D. Lewis and W. Croft. Term clustering of syntactic phrases. In Proceedings of the 13th ACM SIGIR, pages 385--404, 1989.

Digital Library

[20]

T. Marinis. Psycholinguistic techniques in second language acquisition research. Second Language Research, 19(2):144, 2003.

[21]

D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. Lecture Notes in Computer Science, 4425:16, 2007.

Digital Library

[22]

S. Osinski, J. Stefanowski, and D. Weiss. Lingo: Search results clustering algorithm based on singular value decomposition. In Proceedings of the IIS: IIPWM'04 Conference, page 359, 2004.

[23]

X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text&web with hidden topics from large-scale data collections. In Proceeding of the 17th WWW, pages 91--100, 2008.

Digital Library

[24]

M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.

[25]

M. Sahami and T. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th WWW, pages 377--386. ACM New York, NY, USA, 2006.

Digital Library

[26]

M. Sushmita, S. Lalmas. Using digest pages to increase user result space: Preliminary designs. In SIGIR Workshop on Aggregated Search, 2008.

[27]

E. Terra and C. Clarke. Frequency estimates for statistical word similarity measures. In Proceedings of HLT/NAACL 2003, pages 244--251, 2003.

Digital Library

[28]

L. Urena-Lopez, M. Buenaga, and J. Gomez. Integrating linguistic resources in TC through WSD. Computers and the Humanities, 35(2):215--230, 2001.

[29]

I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques, 2005.

Digital Library

[30]

O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to Web search results. Computer Networks-the International Journal of Computer and Telecommunications Networking, 31(11):1361--1374, 1999.

Digital Library

[31]

H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma. Learning to cluster web search results. In Proceedings of the 27th ACM SIGIR, pages 210--217, 2004.

Digital Library

[32]

T. Zesch, C. Muller, and I. Gurevych. Extracting lexical semantic knowledge from wikipedia and wiktionary. In Proceedings of LREC, 2008.

[33]

C. Zhang, N. Sun, X. Hu, T. Huang, and T.-S. Chua. Query segmentation based on eigenspace similarity. In Proceedings of the ACL-IJCNLP 2009 Conference, pages 185--188, Suntec, Singapore, August 2009.

Digital Library

Cited By

Viegas FCanuto SCunha WFrança CValiense CFonseca GMachado ARocha LGonçalves M(2024)Pipelining Semantic Expansion and Noise Filtering for Sentiment Analysis of Short Documents – CluSent MethodJournal on Interactive Systems10.5753/jis.2024.411715:1(561-575)Online publication date: 11-Jun-2024
https://doi.org/10.5753/jis.2024.4117
Viegas FCanuto SCunha WFrança CValiense CRocha LGonçalves M(2023)CluSent – Combining Semantic Expansion and De-Noising for Dataset-Oriented Sentiment Analysis of Short TextsProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617039(110-118)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617023.3617039
Zou AHao WChen GJin D(2023)DEC-transformer: deep embedded clustering with transformer on Chinese long textPattern Analysis and Applications10.1007/s10044-023-01161-z26:3(1349-1362)Online publication date: 10-May-2023
https://doi.org/10.1007/s10044-023-01161-z
Show More Cited By

Index Terms

Exploiting internal and external semantics for the clustering of short texts using world knowledge
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Exploiting Wikipedia as external knowledge for document clustering
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they ...
Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings

Many applications require semantic understanding of short texts, and inferring discriminative and coherent latent topics is a critical and fundamental task in these applications. Conventional topic models largely rely on word co-occurrences to derive ...
Understand Short Texts by Harvesting and Analyzing Semantic Knowledge

Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe the syntax of a written language. As a result, traditional natural language processing tools, ranging from part-of-speech tagging ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

November 2009

2162 pages

ISBN:9781605585123

DOI:10.1145/1645953

General Chairs:
David Cheung
University of Hong Kong, Hong Kong
,
Il-Yeol Song
Drexel University, USA
,
Program Chairs:
Wesley Chu
UCLA, USA
,
Xiaohua Hu
Drexel University, USA
,
Jimmy Lin
University of Maryland, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '09

Sponsor:

CIKM '09: Conference on Information and Knowledge Management

November 2 - 6, 2009

Hong Kong, China

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

165
Total Citations
View Citations
1,667
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)2

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Viegas FCanuto SCunha WFrança CValiense CFonseca GMachado ARocha LGonçalves M(2024)Pipelining Semantic Expansion and Noise Filtering for Sentiment Analysis of Short Documents – CluSent MethodJournal on Interactive Systems10.5753/jis.2024.411715:1(561-575)Online publication date: 11-Jun-2024
https://doi.org/10.5753/jis.2024.4117
Viegas FCanuto SCunha WFrança CValiense CRocha LGonçalves M(2023)CluSent – Combining Semantic Expansion and De-Noising for Dataset-Oriented Sentiment Analysis of Short TextsProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617039(110-118)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617023.3617039
Zou AHao WChen GJin D(2023)DEC-transformer: deep embedded clustering with transformer on Chinese long textPattern Analysis and Applications10.1007/s10044-023-01161-z26:3(1349-1362)Online publication date: 10-May-2023
https://doi.org/10.1007/s10044-023-01161-z
Ahmed MTiun SOmar NSani N(2022)Short Text Clustering Algorithms, Application and Challenges: A SurveyApplied Sciences10.3390/app1301034213:1(342)Online publication date: 27-Dec-2022
https://doi.org/10.3390/app13010342
Manerkar SAsnani KKhorjuvenkar PDesai SPawar J(2022)Konkani WordNet: Corpus-Based Enhancement using CrowdsourcingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/350315621:4(1-18)Online publication date: 4-Mar-2022
https://dl.acm.org/doi/10.1145/3503156
Zhen LYabin SNing Y(2022)A Short Text Topic Model Based on Semantics and Word Expansion2022 IEEE 2nd International Conference on Computer Communication and Artificial Intelligence (CCAI)10.1109/CCAI55564.2022.9807822(60-64)Online publication date: 6-May-2022
https://doi.org/10.1109/CCAI55564.2022.9807822
Mardones-Segovia CChoi HHong MWheeler JCohen A(2022)Comparison of Estimation Algorithms for Latent Dirichlet AllocationQuantitative Psychology10.1007/978-3-031-04572-1_3(27-37)Online publication date: 13-Jul-2022
https://doi.org/10.1007/978-3-031-04572-1_3
Lu XZhou MWu K(2021)A Novel Fuzzy Logic-Based Text Classification Method for Tracking Rare Events on TwitterIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2019.293243651:7(4324-4333)Online publication date: Jul-2021
https://doi.org/10.1109/TSMC.2019.2932436
Wang YZhou FLi X(2021)Exploring Trending Topics of Social Media Text with VoronoiTopicCloud Provide Useful and Intuitive Insights into Social Media Texts2021 3rd International Conference on Natural Language Processing (ICNLP)10.1109/ICNLP52887.2021.00006(1-8)Online publication date: Mar-2021
https://doi.org/10.1109/ICNLP52887.2021.00006
Sun LDu TDuan XLuo Y(2021)Short Text Clustering Using Joint Optimization of Feature Representations and Cluster AssignmentsPRICAI 2021: Trends in Artificial Intelligence10.1007/978-3-030-89363-7_17(217-231)Online publication date: 1-Nov-2021
https://doi.org/10.1007/978-3-030-89363-7_17
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents