Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1571941.1571967acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Enhancing cluster labeling using wikipedia

Published: 19 July 2009 Publication History

Abstract

This work investigates cluster labeling enhancement by utilizing Wikipedia, the free on-line encyclopedia. We describe a general framework for cluster labeling that extracts candidate labels from Wikipedia in addition to important terms that are extracted directly from the text. The "labeling quality" of each candidate is then evaluated by several independent judges and the top evaluated candidates are recommended for labeling.
Our experimental results reveal that the Wikipedia labels agree with manual labels associated by humans to a cluster, much more than with significant terms that are extracted directly from the text. We show that in most cases even when human's associated label appears in the text, pure statistical methods have difficulty in identifying them as good descriptors. Furthermore, our experiments show that for more than 85% of the clusters in our test collection, the manual label (or an inflection, or a synonym of it) appears in the top five labels recommended by our system.

References

[1]
20 News Group (20NG) data. http://people.csail.mit.edu/jrennie/20newsgroups.
[2]
T. Brants and A. Franz. Web 1T 5-gram Version 1. 2006.
[3]
D. Carmel, E. Yom-Tov, A. Darlow, and D. Pelleg. What makes a query difficult? In SIGIR '06, pages 390--397. ACM Press, 2006.
[4]
O.S. Chin, N. Kulathuramaiyer, and A.W. Yeo. Automatic discovery of concepts from text. In WI '06, pages 1046--1049, Washington, DC, USA, 2006. IEEE Computer Society.
[5]
R. Cilibrasi and P.M.B. Vitányi. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370--383, 2007.
[6]
D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In SIGIR '92, pages 318--329, New York, NY, USA, 1992. ACM.
[7]
W. de Winter and M. de Rijke. Identifying facets in query-biased sets of blog posts. In ICWSM'07, pages 251--254, 2007.
[8]
E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI '06, pages 1301--1306, Boston, MA, 2006.
[9]
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI '07, pages 1606--1611, Hyderabad, India, 2007.
[10]
F. Geraci, M. Maggini, M. Pellegrini, and F. Sebastiani. Cluster generation and cluster labelling for web snippets:a fast and accurate hierarchical solution. Internet Mathematics, 2007.
[11]
E. Glover, D.M. Pennock, S. Lawrence, and R. Krovetz. Inferring hierarchical descriptions. In CIKM '02, pages 507--514, New York, NY, USA, 2002. ACM.
[12]
J. Hu, L. Fang, Y. Cao, H.-J. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In SIGIR '08, pages 179--186, New York, NY, USA, 2008. ACM.
[13]
C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[14]
Open Directory Project (ODP). http://www.dmoz.org/.
[15]
S. Osinski and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48--54, 2005.
[16]
D.R. Radev, H. Jing, M. Styś, and D. Tam. Centroid-based summarization of multiple documents. Information Processing Management, 40(6):919--938, 2004.
[17]
P. Schönhofen. Identifying document topics using the wikipedia category network. In WI '06, pages 456--462, 2006.
[18]
M. Strube and S.P. Ponzetto. Wikirelate! computing semantic relatedness using wikipedia. July 2006.
[19]
Z.S. Syed, T. Finin, and A. Joshi. Wikipedia as an ontology for describing documents. In ICWSM '08, 2008.
[20]
H. Toda and R. Kataoka. A clustering method for news articles retrieval system. In WWW '05, pages 988--989, New York, NY, USA, 2005. ACM.
[21]
P. Treeratpituk and J. Callan. Automatically labeling hierarchical clusters. In DG.O '06, pages 167--176, New York, NY, USA, 2006. ACM.

Cited By

View all
  • (2024)Top2LabelExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122676242:COnline publication date: 16-May-2024
  • (2023)Cross-lingual Text Clustering in a Large SystemProceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval10.1145/3639233.3639356(1-11)Online publication date: 15-Dec-2023
  • (2022)Knowledge Enhanced Search Result DiversificationProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539459(1687-1695)Online publication date: 14-Aug-2022
  • Show More Cited By

Index Terms

  1. Enhancing cluster labeling using wikipedia

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
    July 2009
    896 pages
    ISBN:9781605584836
    DOI:10.1145/1571941
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 July 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cluster labeling
    2. wikipedia

    Qualifiers

    • Research-article

    Conference

    SIGIR '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Top2LabelExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122676242:COnline publication date: 16-May-2024
    • (2023)Cross-lingual Text Clustering in a Large SystemProceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval10.1145/3639233.3639356(1-11)Online publication date: 15-Dec-2023
    • (2022)Knowledge Enhanced Search Result DiversificationProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539459(1687-1695)Online publication date: 14-Aug-2022
    • (2022)Information asymmetry in Wikipedia across different languagesJournal of the Association for Information Science and Technology10.1002/asi.2455373:3(347-361)Online publication date: 7-Feb-2022
    • (2021)Journalistic Source Discovery: Supporting The Identification of News Sources in User Generated ContentProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445266(1-18)Online publication date: 6-May-2021
    • (2021)ToFM: Topic-specific Facet Mining by Facet Propagation within Clusters2021 IEEE International Conference on Big Knowledge (ICBK)10.1109/ICKG52313.2021.00060(402-409)Online publication date: Dec-2021
    • (2021)T-ReQs: A Tool for Tracking Similarity in Software ReQuirements2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC51774.2021.00208(1409-1410)Online publication date: Jul-2021
    • (2021)TLATR: Automatic Topic Labeling Using Automatic (Domain-Specific) Term RecognitionIEEE Access10.1109/ACCESS.2021.30830009(76624-76641)Online publication date: 2021
    • (2021)What is this Cluster about? Explaining textual clusters by extracting relevant keywordsKnowledge-Based Systems10.1016/j.knosys.2021.107342229:COnline publication date: 11-Oct-2021
    • (2021)A survey on semantic schema discoveryThe VLDB Journal10.1007/s00778-021-00717-x31:4(675-710)Online publication date: 27-Nov-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media