Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/775047.775115acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

CVS: a Correlation-Verification based Smoothing technique on information retrieval and term clustering

Published: 23 July 2002 Publication History

Abstract

As information volume in enterprise systems and in the Web grows rapidly, how to accurately retrieve information is an important research area. Several corpus based smoothing techniques have been proposed to address the data sparsity and synonym problems faced by information retrieval systems. Such smoothing techniques are often unable to discover and utilize the correlations among terms.We propose CVS, a Correlation-Verification based Smoothing method, that considers co-occurrence information in smoothing. Strongly correlated terms in a document are identified by their co-occurrence frequencies in the document. To avoid missing correlated terms with low co-occurrence frequencies but specific to the theme of the document, the joint distributions of terms in the document are compared with those in the corpus for statistical significance.A common approach to apply corpus based smoothing techniques to information retrieval is by refining the vector representations of documents. This paper investigates the effects of corpus based smoothing on information retrieval by query expansion using term clusters generated from a term clustering process. The results can also be viewed in light of the effects of smoothing on clustering.Empirical studies show that our approach outperforms previous corpus based smoothing techniques. It improves retrieval effectiveness by 14.6%. The results demonstrate that corpus based smoothing can be used for query expansion by term clustering.

References

[1]
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999.
[2]
C. Carpineto, R. de Mori, and G. Romano. Information term selection for automatic query expansion. In The Seventh Text REtrieval Conference (TREC-7), pages 308--314. National Institute of Standards and Technology (NIST), 1998. http://trec.nist.gov/pubs/trec7/t7_proceedings.html.
[3]
R. Fowler, W. Fowler, and B. Wilson. Integrating query, thesaurus, and documents through a common visual representation. In International Conference on Research and Development in Information Retrieval (SIGIR 1991), pages 142--151, 1991.
[4]
M. Franz and S. Roukos. A method for scoring correlated features in query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), pages 337--338. ACM, August 24--28 1998.
[5]
S. Gauch and J. Wang. A corpus analysis approach for automatic query expansion. In Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM'97), pages 278--284, Las Vegas, Nevada, November 10--14 1997. ACM.
[6]
I. J. Good. The population frequencies of species and the estimation of population parameters. In Biometrika, number 40 in 3,4, pages 237--264, 1953.
[7]
K. Hoashi, K. Matsumoto, N. Inoue, and K. Hashimoto. Trec-7 experiments: Query expansion method based on word contribution. In The Seventh Text REtrieval Conference (TREC-7), pages 373--381. National Institute of Standards and Technology (NIST), 1998. http://trec.nist.gov/pubs/trec7/t7_proceedings.html.
[8]
K. Hofland and S. Johansson. Word frequencies in british and american english. In The Norwegian Computing Center for the Humanities, pages 43--53, Norway, 1982.
[9]
T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. WebSOM - self-organizing maps of document collections. In Proceedings of Workshop on Self-Organizing Maps (WSOM97), pages 310--315, Espoo. Finland, 1997.
[10]
F. Jelinek and R. Mercer. Interpolated estimation of markov source parameters from sparse data. In Pattern Recogition in Practice, pages 381--402, North Holland, Amsterdam, 1980.
[11]
A. Kilgarriff. Comparing word frequencies across corpora: Why chi-square doesn't work, and an improved lob-brown comparison. In ALLC-ACH Conference, 1996. http://www.hit.uib.no/allc/kilgarny.pdf.
[12]
A. Kilgarriff. Using word frequency lists to measure corpus homogeneity and similarity between corpora. In Proceedings of 5th ACL workshop on very large corpora, Beijing and Hongkong, August 1997.
[13]
A. Kilgarriff and T. Rose. Measures for corpus similarity and homogeneity. In Proceedings of 3rd conference on empirical methods in natural language processing, pages 46--52, 1998.
[14]
C. P. Klas and N. Fuhr. A new effective approach for categorizing web documents. In Proceedings of the 22th BCS-IRSG Colloquium on IR Research, 2000.
[15]
D. Lawrie, W. B. Croft, and A. Rosenberg. Finding topic words for hierarchical summarization. In International Conference on Research and Development in Information Retrieval (SIGIR P00l), pages 349--357, 2001.
[16]
R. Mandala, T. Tokunaga, and H. Tanaka. Combining multiple evidence from different types of thesaurus for query expansion. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), pages 191--197, Berkeley, CA, USA, August 15--19 1999. ACM.
[17]
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, pages 41--48, Madison, WI, 1998.
[18]
M. Mitra, A. Singhal, and C. Buckley. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR1998), pages 206--214, Melbourne, Austrailia, August 24--28 1998.
[19]
A. Rauber. LabelSOM: On the labeling of self-organizing maps. http://www.ifs.tuwien.ac.at/ andi, July 10--16 1999.
[20]
P. Rayson and R. Garside. Comparing corpora using frequency profiling. In proceedings of the workshop on Comparing Corpora, pages 1--6, 2000.
[21]
Reuters Research and Standards Group. Retuers corpus. http://about.reuters.com/researchandstandards/corpus/.
[22]
M. Sanderson and B. Croft. Deriving concept hierarchies from text. In International Conference on Research and Development in Information Retrieval (SIGIR 1999), pages 206--213, 1999.
[23]
A. E. Smith. Machine mapping of document collections: the leximancer. In Proceedings of the 5th Australasian Document Computing Symposium, Sunshine Coast, Australia, December 1 2000.
[24]
E. M. Voorhees. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. (SIGIR'94), pages 61--69, Dublin, Ireland, July 3---6 1994. ACM/Springer.
[25]
J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), pages 4--11, August 18--22 1996.
[26]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2001), pages 334--342, New Orleans, Louisiana, USA, September 9--13 2001.

Cited By

View all
  • (2006)Principal components for automatic term hierarchy buildingProceedings of the 13th international conference on String Processing and Information Retrieval10.1007/11880561_4(37-48)Online publication date: 11-Oct-2006

Index Terms

  1. CVS: a Correlation-Verification based Smoothing technique on information retrieval and term clustering

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
      July 2002
      719 pages
      ISBN:158113567X
      DOI:10.1145/775047
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 July 2002

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. information retrieval
      2. query expansion
      3. smoothing
      4. term clustering
      5. text mining

      Qualifiers

      • Article

      Conference

      KDD02
      Sponsor:

      Acceptance Rates

      KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2006)Principal components for automatic term hierarchy buildingProceedings of the 13th international conference on String Processing and Information Retrieval10.1007/11880561_4(37-48)Online publication date: 11-Oct-2006

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media