Article

CVS: a Correlation-Verification based Smoothing technique on information retrieval and term clustering

Authors:

Christina Yip Chung,

Bin ChenAuthors Info & Claims

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 469 - 474

https://doi.org/10.1145/775047.775115

Published: 23 July 2002 Publication History

Abstract

As information volume in enterprise systems and in the Web grows rapidly, how to accurately retrieve information is an important research area. Several corpus based smoothing techniques have been proposed to address the data sparsity and synonym problems faced by information retrieval systems. Such smoothing techniques are often unable to discover and utilize the correlations among terms.We propose CVS, a Correlation-Verification based Smoothing method, that considers co-occurrence information in smoothing. Strongly correlated terms in a document are identified by their co-occurrence frequencies in the document. To avoid missing correlated terms with low co-occurrence frequencies but specific to the theme of the document, the joint distributions of terms in the document are compared with those in the corpus for statistical significance.A common approach to apply corpus based smoothing techniques to information retrieval is by refining the vector representations of documents. This paper investigates the effects of corpus based smoothing on information retrieval by query expansion using term clusters generated from a term clustering process. The results can also be viewed in light of the effects of smoothing on clustering.Empirical studies show that our approach outperforms previous corpus based smoothing techniques. It improves retrieval effectiveness by 14.6%. The results demonstrate that corpus based smoothing can be used for query expansion by term clustering.

References

[1]

R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999.

Digital Library

[2]

C. Carpineto, R. de Mori, and G. Romano. Information term selection for automatic query expansion. In The Seventh Text REtrieval Conference (TREC-7), pages 308--314. National Institute of Standards and Technology (NIST), 1998. http://trec.nist.gov/pubs/trec7/t7_proceedings.html.

[3]

R. Fowler, W. Fowler, and B. Wilson. Integrating query, thesaurus, and documents through a common visual representation. In International Conference on Research and Development in Information Retrieval (SIGIR 1991), pages 142--151, 1991.

Digital Library

[4]

M. Franz and S. Roukos. A method for scoring correlated features in query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), pages 337--338. ACM, August 24--28 1998.

Digital Library

[5]

S. Gauch and J. Wang. A corpus analysis approach for automatic query expansion. In Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM'97), pages 278--284, Las Vegas, Nevada, November 10--14 1997. ACM.

Digital Library

[6]

I. J. Good. The population frequencies of species and the estimation of population parameters. In Biometrika, number 40 in 3,4, pages 237--264, 1953.

[7]

K. Hoashi, K. Matsumoto, N. Inoue, and K. Hashimoto. Trec-7 experiments: Query expansion method based on word contribution. In The Seventh Text REtrieval Conference (TREC-7), pages 373--381. National Institute of Standards and Technology (NIST), 1998. http://trec.nist.gov/pubs/trec7/t7_proceedings.html.

[8]

K. Hofland and S. Johansson. Word frequencies in british and american english. In The Norwegian Computing Center for the Humanities, pages 43--53, Norway, 1982.

[9]

T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. WebSOM - self-organizing maps of document collections. In Proceedings of Workshop on Self-Organizing Maps (WSOM97), pages 310--315, Espoo. Finland, 1997.

[10]

F. Jelinek and R. Mercer. Interpolated estimation of markov source parameters from sparse data. In Pattern Recogition in Practice, pages 381--402, North Holland, Amsterdam, 1980.

[11]

A. Kilgarriff. Comparing word frequencies across corpora: Why chi-square doesn't work, and an improved lob-brown comparison. In ALLC-ACH Conference, 1996. http://www.hit.uib.no/allc/kilgarny.pdf.

[12]

A. Kilgarriff. Using word frequency lists to measure corpus homogeneity and similarity between corpora. In Proceedings of 5th ACL workshop on very large corpora, Beijing and Hongkong, August 1997.

[13]

A. Kilgarriff and T. Rose. Measures for corpus similarity and homogeneity. In Proceedings of 3rd conference on empirical methods in natural language processing, pages 46--52, 1998.

[14]

C. P. Klas and N. Fuhr. A new effective approach for categorizing web documents. In Proceedings of the 22th BCS-IRSG Colloquium on IR Research, 2000.

[15]

D. Lawrie, W. B. Croft, and A. Rosenberg. Finding topic words for hierarchical summarization. In International Conference on Research and Development in Information Retrieval (SIGIR P00l), pages 349--357, 2001.

Digital Library

[16]

R. Mandala, T. Tokunaga, and H. Tanaka. Combining multiple evidence from different types of thesaurus for query expansion. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), pages 191--197, Berkeley, CA, USA, August 15--19 1999. ACM.

Digital Library

[17]

A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, pages 41--48, Madison, WI, 1998.

[18]

M. Mitra, A. Singhal, and C. Buckley. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR1998), pages 206--214, Melbourne, Austrailia, August 24--28 1998.

Digital Library

[19]

A. Rauber. LabelSOM: On the labeling of self-organizing maps. http://www.ifs.tuwien.ac.at/ andi, July 10--16 1999.

[20]

P. Rayson and R. Garside. Comparing corpora using frequency profiling. In proceedings of the workshop on Comparing Corpora, pages 1--6, 2000.

Digital Library

[21]

Reuters Research and Standards Group. Retuers corpus. http://about.reuters.com/researchandstandards/corpus/.

[22]

M. Sanderson and B. Croft. Deriving concept hierarchies from text. In International Conference on Research and Development in Information Retrieval (SIGIR 1999), pages 206--213, 1999.

Digital Library

[23]

A. E. Smith. Machine mapping of document collections: the leximancer. In Proceedings of the 5th Australasian Document Computing Symposium, Sunshine Coast, Australia, December 1 2000.

[24]

E. M. Voorhees. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. (SIGIR'94), pages 61--69, Dublin, Ireland, July 3---6 1994. ACM/Springer.

Digital Library

[25]

J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), pages 4--11, August 18--22 1996.

Digital Library

[26]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2001), pages 334--342, New Orleans, Louisiana, USA, September 9--13 2001.

Digital Library

Cited By

Dupret GPiwowarski B(2006)Principal components for automatic term hierarchy buildingProceedings of the 13th international conference on String Processing and Information Retrieval10.1007/11880561_4(37-48)Online publication date: 11-Oct-2006
https://dl.acm.org/doi/10.1007/11880561_4

Index Terms

CVS: a Correlation-Verification based Smoothing technique on information retrieval and term clustering
1. Information systems
  1. Information retrieval
    1. Document representation
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Interpolation

Recommendations

Chinese information retrieval based on terms and relevant terms

In this article we describe our approach to Chinese information retrieval, where a query is a short natural language description. First, we use automatically extracted short terms from document sets to build indexes and use the short terms in both the ...
Document expansion for image retrieval
RIAO '10: Adaptivity, Personalization and Fusion of Heterogeneous Information

Successful information retrieval requires effective matching between the user's search request and the contents of relevant documents. Often the request entered by a user may not use the same topic relevant terms as the authors' of these documents. One ...
Lexical Co-Occurrence and Contextual Window-Based Approach with Semantic Similarity for Query Expansion

Query expansion QE is an efficient method for enhancing the efficiency of information retrieval system. In this work, we try to capture the limitations of pseudo-feedback based QE approach and propose a hybrid approach for enhancing the efficiency of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

July 2002

719 pages

ISBN:158113567X

DOI:10.1145/775047

Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD02

Sponsor:

KDD02: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

July 23 - 26, 2002

Alberta, Edmonton, Canada

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
531
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dupret GPiwowarski B(2006)Principal components for automatic term hierarchy buildingProceedings of the 13th international conference on String Processing and Information Retrieval10.1007/11880561_4(37-48)Online publication date: 11-Oct-2006
https://dl.acm.org/doi/10.1007/11880561_4

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents