research-article

Enhancing cluster labeling using wikipedia

Authors:

Haggai Roitman,

Naama ZwerdlingAuthors Info & Claims

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 139 - 146

https://doi.org/10.1145/1571941.1571967

Published: 19 July 2009 Publication History

Abstract

This work investigates cluster labeling enhancement by utilizing Wikipedia, the free on-line encyclopedia. We describe a general framework for cluster labeling that extracts candidate labels from Wikipedia in addition to important terms that are extracted directly from the text. The "labeling quality" of each candidate is then evaluated by several independent judges and the top evaluated candidates are recommended for labeling.

Our experimental results reveal that the Wikipedia labels agree with manual labels associated by humans to a cluster, much more than with significant terms that are extracted directly from the text. We show that in most cases even when human's associated label appears in the text, pure statistical methods have difficulty in identifying them as good descriptors. Furthermore, our experiments show that for more than 85% of the clusters in our test collection, the manual label (or an inflection, or a synonym of it) appears in the top five labels recommended by our system.

References

[1]

20 News Group (20NG) data. http://people.csail.mit.edu/jrennie/20newsgroups.

[2]

T. Brants and A. Franz. Web 1T 5-gram Version 1. 2006.

[3]

D. Carmel, E. Yom-Tov, A. Darlow, and D. Pelleg. What makes a query difficult? In SIGIR '06, pages 390--397. ACM Press, 2006.

Digital Library

[4]

O.S. Chin, N. Kulathuramaiyer, and A.W. Yeo. Automatic discovery of concepts from text. In WI '06, pages 1046--1049, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[5]

R. Cilibrasi and P.M.B. Vitányi. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370--383, 2007.

Digital Library

[6]

D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In SIGIR '92, pages 318--329, New York, NY, USA, 1992. ACM.

Digital Library

[7]

W. de Winter and M. de Rijke. Identifying facets in query-biased sets of blog posts. In ICWSM'07, pages 251--254, 2007.

[8]

E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI '06, pages 1301--1306, Boston, MA, 2006.

Digital Library

[9]

E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI '07, pages 1606--1611, Hyderabad, India, 2007.

Digital Library

[10]

F. Geraci, M. Maggini, M. Pellegrini, and F. Sebastiani. Cluster generation and cluster labelling for web snippets:a fast and accurate hierarchical solution. Internet Mathematics, 2007.

[11]

E. Glover, D.M. Pennock, S. Lawrence, and R. Krovetz. Inferring hierarchical descriptions. In CIKM '02, pages 507--514, New York, NY, USA, 2002. ACM.

Digital Library

[12]

J. Hu, L. Fang, Y. Cao, H.-J. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In SIGIR '08, pages 179--186, New York, NY, USA, 2008. ACM.

Digital Library

[13]

C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.

Digital Library

[14]

Open Directory Project (ODP). http://www.dmoz.org/.

[15]

S. Osinski and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48--54, 2005.

Digital Library

[16]

D.R. Radev, H. Jing, M. Styś, and D. Tam. Centroid-based summarization of multiple documents. Information Processing Management, 40(6):919--938, 2004.

Digital Library

[17]

P. Schönhofen. Identifying document topics using the wikipedia category network. In WI '06, pages 456--462, 2006.

Digital Library

[18]

M. Strube and S.P. Ponzetto. Wikirelate! computing semantic relatedness using wikipedia. July 2006.

[19]

Z.S. Syed, T. Finin, and A. Joshi. Wikipedia as an ontology for describing documents. In ICWSM '08, 2008.

[20]

H. Toda and R. Kataoka. A clustering method for news articles retrieval system. In WWW '05, pages 988--989, New York, NY, USA, 2005. ACM.

Digital Library

[21]

P. Treeratpituk and J. Callan. Automatically labeling hierarchical clusters. In DG.O '06, pages 167--176, New York, NY, USA, 2006. ACM.

Digital Library

Cited By

Chaudhary AMilios ERajabi E(2024)Top2LabelExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122676242:COnline publication date: 16-May-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122676
Schneider NSankaranarayanan JSamet H(2023)Cross-lingual Text Clustering in a Large SystemProceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval10.1145/3639233.3639356(1-11)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3639233.3639356
Su ZDou ZZhu YWen JZhang ARangwala H(2022)Knowledge Enhanced Search Result DiversificationProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539459(1687-1695)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539459
Show More Cited By

Index Terms

Enhancing cluster labeling using wikipedia
1. Information systems
  1. Information retrieval

Recommendations

A fusion approach to cluster labeling
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

We present a novel approach to the cluster labeling task using fusion methods. The core idea of our approach is to weigh labels, suggested by any labeler, according to the estimated labeler's decisiveness with respect to each of its suggested labels. We ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

896 pages

ISBN:9781605584836

DOI:10.1145/1571941

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '09

Sponsor:

SIGIR '09: The 32nd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2009

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

103
Total Citations
View Citations
2,252
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chaudhary AMilios ERajabi E(2024)Top2LabelExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122676242:COnline publication date: 16-May-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122676
Schneider NSankaranarayanan JSamet H(2023)Cross-lingual Text Clustering in a Large SystemProceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval10.1145/3639233.3639356(1-11)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3639233.3639356
Su ZDou ZZhu YWen JZhang ARangwala H(2022)Knowledge Enhanced Search Result DiversificationProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539459(1687-1695)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539459
Roy DBhatia SJain P(2022)Information asymmetry in Wikipedia across different languagesJournal of the Association for Information Science and Technology10.1002/asi.2455373:3(347-361)Online publication date: 7-Feb-2022
https://dl.acm.org/doi/10.1002/asi.24553
Wang YDiakopoulos NKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)Journalistic Source Discovery: Supporting The Identification of News Sources in User Generated ContentProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445266(1-18)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445266
Li HWei BLiu JGuo ZQi JWu BLiu YShi Y(2021)ToFM: Topic-specific Facet Mining by Facet Propagation within Clusters2021 IEEE International Conference on Big Knowledge (ICBK)10.1109/ICKG52313.2021.00060(402-409)Online publication date: Dec-2021
https://doi.org/10.1109/ICKG52313.2021.00060
Reddivari S(2021)T-ReQs: A Tool for Tracking Similarity in Software ReQuirements2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC51774.2021.00208(1409-1410)Online publication date: Jul-2021
https://doi.org/10.1109/COMPSAC51774.2021.00208
Truica CApostol E(2021)TLATR: Automatic Topic Labeling Using Automatic (Domain-Specific) Term RecognitionIEEE Access10.1109/ACCESS.2021.30830009(76624-76641)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3083000
Penta APal A(2021)What is this Cluster about? Explaining textual clusters by extracting relevant keywordsKnowledge-Based Systems10.1016/j.knosys.2021.107342229:COnline publication date: 11-Oct-2021
https://dl.acm.org/doi/10.1016/j.knosys.2021.107342
Kellou-Menouer KKardoulakis NTroullinou GKedad ZPlexousakis DKondylakis H(2021)A survey on semantic schema discoveryThe VLDB Journal10.1007/s00778-021-00717-x31:4(675-710)Online publication date: 27-Nov-2021
https://doi.org/10.1007/s00778-021-00717-x
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents