Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2072221.2072240acmotherconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

Indexing and weighting of multilingual and mixed documents

Published: 03 October 2011 Publication History

Abstract

Non-English-speaking users, such as Arabic speakers, are not always able to express terminology in their native languages, especially in scientific domains. Such difficulty forces many Arabic authors and scholars to use English terms in order to explain precise concepts, particularly when they address technical topics, resulting in mixed/multilingual queries with both English and Arabic terms. Cross Language Information Retrieval (CLIR) allows users to search documents that are written in a language different from the query. However, current algorithms are optimized for monolingual queries, even if they are translated. This paper attempts to address the problem of multilingual querying in CLIR. New techniques that are better suited to the unique characteristics of this problem, in terms of indexing and weighting, are proposed. A new multilingual and mixed test collection containing mixed-language (Arabic and English) computer science documents and mixed-language queries has been created. Experimentally, results show that current CLIR techniques were not designed for these types of multilingual queries and documents and are found to perform poorly whereas the proposed techniques are found to be promising.

References

[1]
A. Chen, F. Gey, "Multilingual Information Retrieval Using Machine Translation, Relevance Feedback and Decompounding", Journal of Information Retrieval, 2004, 7(1--2), 149--182.
[2]
Croft, B., Metzler, D., and Strohman, T. 2009 Search Engines: Information Retrieval in Practice. Addison-Wesley.
[3]
Hansen, P., Petrelli, D., Karlgren, J., Beaulieu, M., and Sanderson, M. 2002. User-centered interface design for cross-language information retrieval. Proceedings of the twenty-fifth annual International ACM SIGIR Conference, ACM Press, New York, NY, 383--384.
[4]
K. Kishida, "Technical issues of cross-language information retrieval: a review", Journal of Information Processing and Management, 2005, 41(3), 433--455.
[5]
W. Lin, and H. Chen, "Merging Mechanisms in Multilingual Information Retrieval", Advances in Cross-Language Information Retrieval LNCS, 2003, 2785, 175--186.
[6]
Lu, Y., Chau, M., Fang, X., and Yang, C. C. 2006. Analysis of the Bilingual Queries in a Chinese Web Search Engine. Proceedings of the Fifth Workshop on E-Business (2006, Milwaukee, Wisconsin, USA).
[7]
Miniwatts Marketing Group (2011), "Internet World Stats Usage and Population Statistics", Available at: http://www.internetworldstats.com/, Last accessed 20-4-2011.
[8]
J. Y. Nie, and F. Jin, "A multilingual approach to multilingual retrieval". Advances in cross-language information retrieval LNCS, 2003. 2785, 101--110.
[9]
Robertson, R., Zaragoza, H. and Taylor, M. 2004. Simple BM25 Extension to Multiple Weighted Fields. Proceedings of CIKM Conference (Washington, DC, USA, November 8--13) CIKM'04, ACM Press, New York.
[10]
H. Rieh, S. Rieh, "Web Search across Languages: Preference and Behavior of Bilingual Academic Users in Korea", Journal of Library & Information Science Research, 2005, 27(3), 249--263.
[11]
Robertson, S. E. Walker, S. Some simple effective approximations to the 2-Poisson model for probabilistic Weighed retrieval. 1994. In Proceedings of the 17th Annual International SIGIR Conference, Springer-Verlag, 245--354.
[12]
Callan, J. P., Lu, Z. and Croft, W. B. 1995. Searching distributed Collections with inference network. In Proceedings of the 18th Annual International ACM SIGIR Conference (Seattle, WA, USA) ACM Press, 21--28.
[13]
Jörvelin, K., & Keköölöinen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Information Systems, 20(4), 442--446.
[14]
Pirkola, Ari. 2003. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference, ACM Press, 55--63.
[15]
Darwish, Kareem and Oard, Douglas. 2003. Probabilistic structured query methods. In Proceedings of the 21st Annual International ACM SIGIR Conference, ACM Press, 338--344.
[16]
K. L. Kwok. 2000. Exploiting a chinese-english bilingual wordlist for english-chinese cross language information retrieval. In Proceedings of the 5th International Workshop on Information Retrieval with Asian languages (2000), 173--179,
[17]
McEnery, T., Xiao, R., and Tono, Y. 2006 Corpus-based language studies: an advanced resource book. Routledge.
[18]
F. Gey, C., K. Noriko, C. Peters, "Language Information Retrieval: the way ahead", Journal of Information Processing and Management, 41(3), 415--431.
[19]
Rogati, M. and Yang, Y. 2004. Resource Selection for Domain Specific Cross-Lingual IR. In Proceedings of ACM SIGIR Conference SIGIR'04, ACM Press, NY.
[20]
Larkey, S. L., Ballesteros, L., and Connell, E. M. (2005), Light stemming for Arabic information retrieval. Arabic Computational Morphology: Knowledge-based and Empirical Methods. M. Tayli, and A. I. Al-Salamah, "Building microcomputer systems", Communications of the ACM, 1990, 33(5), 495--505.
[21]
Voorhees, E. 2001. Evaluation by highly relevant documents. In W. B. Croft, D. J. Harper, D. H. Kraft, & J. Zobel (Eds.). Proceedings of the 24th annual international ACM SIGIR conference ACM Press, New York, NY, 74--82.
[22]
Kraaij D. Hiemstra, R. W. Pohlmann, and T. Westerveld, 2001. Translation resources, merging strategies and relevance feedback for cross-language information retrieval. In C. Peters (Ed.). Cross-language information retrieval and evaluation. Lectures in computer science Springer Verlag 2069, Germany, 102--115.
[23]
Powell, A., French J., Callan J, Connell, M. and Viles, C. 2000. The impact of database selection on distributed searching. In Proceeding of the 23rd Annual International ACM SIGIR Conference. ACM Press, NY, 232--239.
[24]
Porter M. 1981. Snowball: A language for Stemming Algorithms, http://www.snowball.tartarus.org/

Cited By

View all
  • (2019)A Novel Method to Improve the Retrieved Multilingual Documents Score2019 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE)10.1109/ICCCEEE46830.2019.9070949(1-5)Online publication date: Sep-2019
  • (2015)Arabic Text Mining a Systematic Review of the Published Literature 2002-20142015 International Conference on Cloud Computing (ICCC)10.1109/CLOUDCOMP.2015.7149632(1-7)Online publication date: Apr-2015
  • (2015)Mixed Language Arabic-English Information RetrievalComputational Linguistics and Intelligent Text Processing10.1007/978-3-319-18117-2_32(427-447)Online publication date: 2015
  • Show More Cited By

Index Terms

  1. Indexing and weighting of multilingual and mixed documents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SAICSIT '11: Proceedings of the South African Institute of Computer Scientists and Information Technologists Conference on Knowledge, Innovation and Leadership in a Diverse, Multidisciplinary Environment
    October 2011
    352 pages
    ISBN:9781450308786
    DOI:10.1145/2072221
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • University of Cape Town
    • SAICSIT: So. African Inst. Of Computer Scientists & Info Tecnologists

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 October 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. indexing
    2. mixed document
    3. multilingual query
    4. weighting

    Qualifiers

    • Research-article

    Conference

    SAICSIT '11
    Sponsor:
    • SAICSIT

    Acceptance Rates

    Overall Acceptance Rate 187 of 439 submissions, 43%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)A Novel Method to Improve the Retrieved Multilingual Documents Score2019 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE)10.1109/ICCCEEE46830.2019.9070949(1-5)Online publication date: Sep-2019
    • (2015)Arabic Text Mining a Systematic Review of the Published Literature 2002-20142015 International Conference on Cloud Computing (ICCC)10.1109/CLOUDCOMP.2015.7149632(1-7)Online publication date: Apr-2015
    • (2015)Mixed Language Arabic-English Information RetrievalComputational Linguistics and Intelligent Text Processing10.1007/978-3-319-18117-2_32(427-447)Online publication date: 2015
    • (2014)Design and Development Considerations for a Multilingual Digital LibrarySoftware Design and Development10.4018/978-1-4666-4301-7.ch060(1222-1233)Online publication date: 2014
    • (2013)Design and Development Considerations for a Multilingual Digital LibraryDesign, Development, and Management of Resources for Digital Library Services10.4018/978-1-4666-2500-6.ch001(1-12)Online publication date: 2013

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media