Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Effective and Robust Query-Based Stemming

Published: 01 November 2013 Publication History

Abstract

Stemming is a widely used technique in information retrieval systems to address the vocabulary mismatch problem arising out of morphological phenomena. The major shortcoming of the commonly used stemmers is that they accept the morphological variants of the query words without considering their thematic coherence with the given query, which leads to poor performance. Moreover, for many queries, such approaches also produce retrieval performance that is poorer than no stemming, thereby degrading the robustness. The main goal of this article is to present corpus-based fully automatic stemming algorithms which address these issues. A set of experiments on six TREC collections and three other non-English collections containing news and web documents shows that the proposed query-based stemming algorithms consistently and significantly outperform four state of the art strong stemmers of completely varying principles. Our experiments also confirm that the robustness of the proposed query-based stemming algorithms are remarkably better than the existing strong baselines.

References

[1]
Bacchin, M., Ferro, N., and Melucci, M. 2005. A probabilistic model for stemmer generation. Inf. Process. Manage. 41, 1, 121--137.
[2]
Baroni, M., Matiasek, J., and Trost, H. 2002. Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning (MPL’02). Vol. 6, Association for Computational Linguistics, Stroudsburg, PA, 48--57.
[3]
Church, K. W. and Hanks, P. 1990. Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 1, 22--29.
[4]
Collins-Thompson, K. and Callan, J. 2007. Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 303--310.
[5]
Dolamic, L. and Savoy, J. 2009. Indexing and stemming approaches for the Czech language. Inf. Process. Manage. 45, 6, 714--720.
[6]
Dolamic, L. and Savoy, J. 2010. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans. Asian Lang. Inf. Process. 9, 3.
[7]
Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2, 153--198.
[8]
Hammarstrom, H. 2009. Unsupervised learning of morphology and the languages of the world. Ph.D. thesis.
[9]
Harman, D. 1991. How effective is suffixing. J. Amer. Soc. Inf. Sci. 42, 7--15.
[10]
Krovetz, R. 1993. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93). ACM, New York, 191--202.
[11]
Lovins, J. 1968. Development of a stemming algorithm. Mech. Tran. Comput. Linguistics, 22--31.
[12]
Majumder, P., Mitra, M., Parui, S. K., Kole, G., Mitra, P., and Datta, K. 2007. YASS: Yet another suffix stripper. ACM Trans. Inf. Syst. 25, 4, 18.
[13]
Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
[14]
Mayfield, J. and McNamee, P. 2003. Single n-gram stemming. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’03). ACM, New York, 415--416.
[15]
McNamee, P., Nicholas, C. K., and Mayfield, J. 2009. Addressing morphological variation in alphabetic languages. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82.
[16]
Oard, D. W., Levow, G.-A., and Cabezas, C. I. 2000. CLEF experiments at Maryland: Statistical stemming and backoff translation. In Proceedings of the Workshop of the Cross-Language Evaluation Forum (CLEF’00). 176--187.
[17]
Paik, J. H. and Parui, S. K. 2011. A fast corpus-based stemmer. ACM Trans. Asian Lang. Inf. Process. 10, 2, 8:1--8:16.
[18]
Paik, J. H., Mitra, M., Parui, S. K., and Järvelin, K. 2011a. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 19:1--19:24.
[19]
Paik, J. H., Pal, D., and Parui, S. K. 2011b. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’11). ACM, New York, 863--872.
[20]
Peng, F., Ahmed, N., Li, X., and Lu, Y. 2007. Context sensitive stemming for web search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, 639--646.
[21]
Porter, M. F. 1997. An algorithm for suffix stripping. In Readings in Information Retrieval, 313--316.
[22]
Sakai, T., Manabe, T., and Koyama, M. 2005. Flexible pseudo-relevance feedback via selective sampling. ACM Trans. Asian Lang. Inf. Process. 4, 111--135.
[23]
Savoy, J. 2006. Light stemming approaches for the French, Portuguese, German and Hungarian languages. In Proceedings of the ACM Symposium on Applied Computing (SAC’06). ACM, New York, 1031--1035.
[24]
Smucker, M. D., Allan, J., and Carterette, B. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07). ACM, New York, 623--632.
[25]
Snyder, B. and Barzilay, R. 2008. Unsupervised multilingual learning for morphological segmentation. In Proceedings of the Annual Conference of the Association for Computational Linguistics.
[26]
Voorhees, E. M. 2003. Overview of the trec 2003 robust retrieval track. In Proceedings of TREC. 69--77.
[27]
Voorhees, E. M. 2005. The trec robust retrieval track. SIGIR Forum 39, 1, 11--20.
[28]
Xu, J. and Croft, W. B. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61--81.
[29]
Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst. 18, 79--112.
[30]
Zhao, L. and Callan, J. 2010. Term necessity prediction. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM’10). ACM, New York, 259--268.

Cited By

View all
  • (2024)SUSTEM: An Improved Rule-based Sundanese StemmerACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365634223:6(1-29)Online publication date: 5-Apr-2024
  • (2023)A selective approach to stemming for minimizing the risk of failure in information retrieval systemsPeerJ Computer Science10.7717/peerj-cs.11759(e1175)Online publication date: 10-Jan-2023
  • (2023)PWMStem: A Corpus-Based Suffix Identification and Stripping Algorithm for Multi-lingual StemmingJournal of Advances in Information Technology10.12720/jait.14.4.863-87514:4(863-875)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 31, Issue 4
November 2013
192 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2536736
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2013
Accepted: 01 April 2013
Revised: 01 March 2013
Received: 01 February 2012
Published in TOIS Volume 31, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Corpus
  2. stemming
  3. suffix

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)2
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SUSTEM: An Improved Rule-based Sundanese StemmerACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365634223:6(1-29)Online publication date: 5-Apr-2024
  • (2023)A selective approach to stemming for minimizing the risk of failure in information retrieval systemsPeerJ Computer Science10.7717/peerj-cs.11759(e1175)Online publication date: 10-Jan-2023
  • (2023)PWMStem: A Corpus-Based Suffix Identification and Stripping Algorithm for Multi-lingual StemmingJournal of Advances in Information Technology10.12720/jait.14.4.863-87514:4(863-875)Online publication date: 2023
  • (2023)An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing SystemsIEEE Access10.1109/ACCESS.2023.333271011(133681-133702)Online publication date: 2023
  • (2022)A Survey on NLP Resources, Tools, and Techniques for Marathi Language ProcessingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/354845722:2(1-34)Online publication date: 13-Jul-2022
  • (2022)Neural Network Guided Fast and Efficient Query-Based Stemming by Predicting Term Co-occurrence StatisticsSN Computer Science10.1007/s42979-022-01081-53:3Online publication date: 1-May-2022
  • (2019)A novel unsupervised corpus-based stemming technique using lexicon and corpus statisticsKnowledge-Based Systems10.1016/j.knosys.2019.05.025180:C(147-162)Online publication date: 15-Sep-2019
  • (2019)Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEFInformation Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_7(177-200)Online publication date: 14-Aug-2019
  • (2018)A cognitive inspired unsupervised language-independent text stemmer for Information retrievalCognitive Systems Research10.1016/j.cogsys.2018.07.00352:C(291-300)Online publication date: 1-Dec-2018
  • (2018)A systematic review of text stemming techniquesArtificial Intelligence Review10.1007/s10462-016-9498-248:2(157-217)Online publication date: 28-Dec-2018
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media