Abstract
For the first participation of Dublin City University (DCU) in the FIRE 2010 evaluation campaign, Information Retrieval (IR) experiments on English, Bengali, Hindi, and Marathi documents were performed to investigate term conflation, Blind Relevance Feedback (BRF), and manual and automatic query translation. The experiments are based on BM25 and on language modeling (LM) for IR. Results show that term conflation always improves Mean Average Precision (MAP) compared to indexing unprocessed word forms, but different approaches seem to work best for different languages. For example, in monolingual Marathi experiments indexing 5-prefixes outperforms our corpus-based stemmer; in Hindi, corpus-based stemming approach achieves a higher MAP. For Bengali, the LM retrieval model with the rule based stemmer achieves a higher (but not significantly higher) MAP than BM25 with a corpus based stemmer (0.4583 vs. 0.4526). In all experiments, BRF yields considerably higher MAP in comparison to experiments without it. Bilingual IR experiments (English to Bengali and English to Hindi) are based on query translations obtained from native speakers and the Google translate web service. For the automatically translated queries, MAP is slightly (but not significantly) lower compared to experiments with manual query translations. The bilingual English to Bengali (English to Hindi) experiments achieve 81.7%-83.3% (78.0%-80.6%) of the best corresponding monolingual experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Majumder, P., Mitra, M., Pal, D., Bandyopadhyay, A., Maiti, S., Mitra, S., Pal, S.: Text collections for FIRE. In: SIGIR 2008, pp. 699–700 (2008)
Larkey, L.S., Connell, M.E., Abduljaleel, N.: Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing 2(2), 130–142 (2003)
Dolamic, L., Savoy, J.: UniNE at FIRE 2008: Hindi, Bengali, and Marathi IR. Working Notes of the Forum for Information Retrieval Evaluation, Kolkata, India, December 12-14 (2008)
Savoy, J.: Light stemming approaches for the French, Portuguese, German and Hungarian languages. In: Haddad, H. (ed.) Proceedings of the 2006 ACM Symposium on Applied Computing (SAC), Dijon, France, April 23-27, pp. 1031–1035. ACM (2006)
Xu, T., Oard, D.W.: FIRE-2008 at Maryland: English-Hindi CLIR. Working Notes of the Forum for Information Retrieval Evaluation, Kolkata, India, December 12-14 (2008)
McNamee, P.: N-gram tokenization for Indian language text retrieval. Working Notes of the Forum for Information Retrieval Evaluation, Kolkata, India, December 12-14 (2008)
McNamee, P., Nicholas, C., Mayfield, J.: Addressing morphological variation in alphabetic languages. In: Allan, J., Aslam, J.A., Sanderson, M., Zhai, C., Zobel, J. (eds.) Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, July 19-23, pp. 75–82. ACM, Boston (2009)
Savoy, J.: A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science 50(10), 944–952 (1999)
Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computation 11(1-2), 22–31 (1968)
Xu, J., Croft, B.: Corpus-based stemming using co-occurence of word variants. ACM Transactions on Information Systems 16(1), 61–81 (1998)
Krovetz, R.: Viewing morphology as an inference process. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–202. ACM, Pittsburg (1993)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Harman, D.: How effective is suffixing? Journal of the American Society for Information Science 42(1), 7–15 (1991)
Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: Yet another suffix stripper. ACM Transactions on Information Systems (TOIS) 25(4), 18:1–18:20 (2007)
Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational Linguistics 27, 153–198 (2001)
Oard, D.W., Levow, G.-A., Cabezas, C.I.: CLEF experiments at maryland: Statistical stemming and backoff translation. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 176–187. Springer, Heidelberg (2001)
Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: Sidner, C.L., Schultz, T., Stone, M., Zhai, C. (eds.) Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (NAACL HLT 2007), April 22-27, pp. 155–163. ACL, Rochester (2007)
Keshava, S., Pitler, E.: A simpler, intuitive approach to morpheme induction. In: PASCAL Challenge Workshop on Unsupervised Segmentation of Words Into Morphemes - MorphoChallenge 2005, Venice, Italy, April 12 (2006)
Xu, J., Croft, W.B.: Improving the effectiveness of informational retrieval with Local Context Analysis. ACM Transactions on Information Systems 18, 79–112 (2000)
Bhattacharya, S., Choudhury, M., Sarkar, S., Basu, A.: Inflectional morphology synthesis for bengali noun, pronoun and verb systems. In: Proceedings of the National Conference on Computer Processing of Bangla (NCCPB), pp. 34–43 (2005)
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3. In: Harman, D.K. (ed.) Overview of the Third Text Retrieval Conference (TREC-3), pp. 109–126. National Institute of Standards and Technology (NIST), Gaithersburg (1995)
Robertson, S.E., Walker, S., Beaulieu, M.: Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In: Harman, D.K. (ed.) The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242, pp. 253–264. National Institute of Standards and Technology (NIST), Gaithersburg (1998)
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM, New York (1998)
Fox, C.: Lexical analysis and stoplists, pp. 102–130. Prentice-Hall, NJ (1992)
Ganguly, D.: Implementing a language modeling framework for information retrieval. Master’s thesis, Indian Statistical Institute, India (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Leveling, J., Ganguly, D., Jones, G.J.F. (2013). Term Conflation and Blind Relevance Feedback for Information Retrieval on Indian Languages. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-40087-2_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)