Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In recent years, Cross-Lingual Text Reuse Detection (CLTRD) has attracted the attention of the research community because large digital repositories and efficient Machine Translation systems are readily and freely available, which makes it easier to reuse text across the languages and very difficult to detect it. In the previous studies, the problem of CLTRD for the English-Urdu language pair has been explored at the sentence/passage and document level, and benchmark corpora and methods have been developed. However, there is a lack of benchmark corpora and methods for the CLTRD for the English-Urdu language pair at the lexical, syntactical, and phrasal levels. To fulfill this research gap, this study presents three large benchmark corpora for detecting the Cross-Lingual Text Reuse (CLTR) at three levels of rewrite (Wholly Derived (WD), Partially Derived (PD), and Non Derived (ND)). The CLEU-Lex, CLEU-Syn and CLEU-Phr corpora contain 66,485 (WD = 22,236, PD = 20,315 and ND = 23,934), 60,267 (WD = 20,007, PD = 16,979 and ND = 23,281) and 60,106 (WD = 23,862, PD = 15,878 and ND = 20,366) CLTR pairs respectively. As a secondary major contribution, we have applied the Cross-Lingual Word Embedding (CLWE), Cross-Lingual Semantic Tagger (CLST), and Cross-Lingual Sentence Transformer (CLSTR) based methods on our three proposed corpora for the CLTRD. Our extensive experimentation showed that for the binary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer (\(F_{1}\) = 0.80). For the CLEU-Syn and CLEU-Phr corpora, the best results were obtained using the cross-lingual sentence transformer and a combination of the CLWE, CLST and CLSTR methods (\(F_{1}\) = 0.92 on CLEU-Syn and \(F_{1}\) = 0.94 on CLEU-Phr). For the ternary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer method (\(F_{1}\) = 0.69). For the CLEU-Syn corpus, the best results were obtained using a combination of the CLWE, CLST, and CLSTR methods (\(F_{1}\) = 0.82). For the CLEU-Phr corpus the best results were obtained using cross-lingual sentence transformer and combination of CLWE, CLST, and CLSTR methods (\(F_{1}\) = 0.78). To foster and promote research in Urdu (a low-resourced language) all the three proposed corpora are free and publicly available for research purposes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. https://www.wikipedia.org: Last visited: 10-02-2021.

  2. https://translate.google.com: Last visited: 10-02-2021.

  3. https://www.bing.com/translator: Last visited: 10-02-2021.

  4. https://pan.webis.de Last visited: 10-2-2021.

  5. PAN-PC corpora are freely available to download https://www.uni-weimar.de/en/media/structure/. Last visited: 10-2-2021

  6. FIRE 2013 competition https://dl.acm.org/doi/proceedings/10.1145/2701336 Last visited: 10-2-2021.

  7. https://www.uni-weimar.de/medien/webis/events/panfire-11/panfire11-web/#corpus Last visited: 10-2-2021.

  8. the ParaPhrase DataBase http://paraphrase.org/. Last visited: 26-12-2020

  9. The extracted text pairs in the English language can be downloaded from the following link: https://drive.google.com/drive/folders/1RXF6kXytdkH0Zs-yGpVJfncXjkV18hI-?usp=sharing.

  10. Annotator A is the first author of this paper.

  11. https://creativecommons.org/licenses/by-nc-sa/3.0/ Last visited: 10/1/2021

  12. https://docs.google.com/forms/d/e/1FAIpQLSdT6Oe90ePKwbkx_qbr0Dn-V9K0oFz9OIk9DxRejFaDMNelPA/viewform?usp=sf_link Password: fa18pcs002

  13. Pre-trained Google word embedding model is trained for the English language on 100 billion words from a Google News dataset

  14. Urdu word embedding model is trained on MK-PUCIT corpus with 28,006,880 tokens

  15. http://ucrel.lancs.ac.uk/usas/, Last visited: 20-12-2020.

  16. The full tagset is available at the following link: http://ucrel.lancaster.ac.uk/usas/semtags.txt Last visited: 13-08-2020

  17. English semantic tagger can be used online from the following link: http://ucrel-api.lancaster.ac.uk/usas/tagger.html Last visited: 10-02-2021

  18. https://raw.githubusercontent.com/UCREL/Multilingual-USAS/master/Urdu/Urdu_Semantic_Lexicon.txt Last visited: 10-02-2021.

  19. https://www.sbert.net/, Last visited: 10-2-2021.

  20. https://nlp.stanford.edu/projects/snli/, Last visited: 10-2-2021.

  21. The corpus contains 17B mono-lingual sentences and 6B bilingual translation pairs and extracts 768 dimensions averaged vectors of sentence.

  22. Scikit-learn implementation of these machine learning algorithms was used.

  23. For detailed results, see the following link: https://drive.google.com/drive/folders/1-Gqr05n-nBz74I9ZuLGpmdzeTz9gd8_r?usp=sharing

References

  • Abdi, A., Idris, N., & Alguliyev, R. M. (2015). PDLK: Plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42(22), 8936–8946.

    Article  Google Scholar 

  • Alaa, Z., Tiun, S., & Abdulameer, M. (2016). Cross-language plagiarism of Arabic-English using linear logistic regression. Journal of Theoretical & Applied Information Technology, 83(1), 23.

    Google Scholar 

  • Alfikri, Z. F., & Purwarianti, A. (2012). The construction of Indonesian-English cross language plagiarism detection system using fingerprinting technique. Jurnal Ilmu Komputer dan Informasi, 5(1), 16–23.

    Article  Google Scholar 

  • Aljohani, A., & Mohd, M. (2014). Arabic-English cross-language plagiarism detection using winnowing algorithm. Information Technology Journal, 13(14), 2349.

    Article  Google Scholar 

  • Al-Suhaiqi, M., Hazaa, M. A., & Albared, M. (2018). Arabic English cross-lingual plagiarism detection based on keyphrases extraction, mono-lingual and machine learning method. Asian Journal of Research in Computer Science, 1-12.

  • Asghari, H., Khoshnava, K., Fatemi, O., & Faili, H. (2015). Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus. Notebook for PAN at CLEF.

  • Bakhteev, O., Ogaltsov, A., Khazov, A., Safin, K., & Kuznetsova, R. (2019, September). CrossLang: the system of cross-lingual plagiarism detection. In Workshop on Document Intelligence at NeurIPS 2019.

  • Barrón-Cedeno, A., Rosso, P., Agirre, E., & Labaka, G. "Plagiarism detection across distant language pairs." Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). 2010.

  • Barrón-Cedeno, A., Rosso, P., Devi, S. L., Clough, P., & Stevenson, M. (2013). Pan@ fire: Overview of the cross-language Indian text re-use detection competition. In Multilingual Information Access in South Asian Languages (pp. 59-70). Springer, Berlin, Heidelberg.

  • Barrón-Cedeño, Alberto, Gupta, Parth, & Rosso, Paolo. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, 211–217.

    Article  Google Scholar 

  • Ceska, Z., Toman, M., & Jezek, K. (2008). Multilingual plagiarism detection. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications (pp. 83-92). Springer, Berlin, Heidelberg.

  • Chang, Chia-Ming., Chang, Chia-Hsuan., & Hwang, San-Yih. (2020). Employing word mover’s distance for cross-lingual plagiarized text detection. Proceedings of the Association for Information Science and Technology, 57(1), e229.

    Article  Google Scholar 

  • Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. ArXiv Preprint arXiv:1705.02364.

  • Daille, B., & Morin, E. "Effective compositional model for lexical alignment." IJCNLP 2008: Third International Joint Conference on Natural Language Processing. Vol. 1. 2008.

  • de Souza, J. V. A., Oliveira, L. E. S. E., Gumiel, Y. B., Carvalho, D. R., & Moro, C. M. C. (2020, March). Exploiting Siamese neural networks on short text similarity tasks for multiple domains and languages. In International Conference on Computational Processing of the Portuguese Language (pp. 357-367). Springer, Cham.

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  • Esteva, A., Kale, A., Paulus, R., Hashimoto, K., Yin, W., Radev, D., & Socher, R. (2020). Co-search: Covid-19 information retrieval with semantic search, question answering, and abstractive summarization. ArXiv Preprint arXiv:2006.09595.

  • FA14-MSCS, I. M. Measuring cross-lingual text reuse at sentence/passage Level. Diss. 2016.

  • Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852.

  • Ferrero, J., Agnes, F., Besacier, L., & Schwab, D. (2016, May). A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 4162-4169).

  • Ferrero, J., Agnes, F., Besacier, L., & Schwab, D. (2017). Using word embedding for cross-language plagiarism detection. ArXiv Preprint arXiv:1702.03082.

  • Flores Sáez, E., Barrón-Cedeño, L. A., Moreno Boronat, L. A., & Rosso, P. (2015). Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science, 21(13), 1708–1725.

    Google Scholar 

  • Forner, P., Karlgren, J., Womser-hacker, Ch., Potthast, M., Gollub, T., Hagen, M., Graegger, J., Kiesel, J., Michel, M., Oberländer, A., Barrón-cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Forner, P., Karlgren, J., Womser-Hacke, C. (eds.) Notebook Papers of CLEF 2012 LABs and Workshops, CLEF-2012 17–20 September. Rome, Italy

  • Franco-Salvador, M., Gupta, P., & Rosso, P. (2013, March). Cross-language plagiarism detection using a multilingual semantic network. In European Conference on Information Retrieval (pp. 710-713). Springer, Berlin, Heidelberg.

  • Franco-Salvador, M., Gupta, P., Rosso, P., & Banchs, R. E. (2016). Cross-language plagiarism detection over continuous-space-and knowledge graph-based representations of language. Knowledge-based Systems, 111, 87–99.

    Article  Google Scholar 

  • Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2013, June). PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 758-764).

  • Ghannay, S., Favre, B., Esteve, Y., & Camelin, N. (2016). Word embedding evaluation and combination. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 300-305).

  • Grefenstette, G. (1998). The problem of cross-language information retrieval (pp. 1–9). Boston, MA: Cross-language information retrieval. Springer.

    Google Scholar 

  • Guo, X., Mirzaalian, H., Sabir, E., Jaiswal, A., & Abd-Almageed, W. (2020). Cord19sts: Covid-19 semantic textual similarity dataset. ArXiv Preprint arXiv:2007.02461.

  • Gupta, Parth, Alberto Barrón-Cedeno, and Paolo Rosso. "Cross-language high similarity search using a conceptual thesaurus." International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Berlin, Heidelberg, 2012.

  • Hadgu, A. T. (2018). Cross-lingual Short-text Matching with Deep Learning. ArXiv Preprint arXiv:1811.05569.

  • Haneef, I., Adeel Nawab, R. M., Munir, E. U., & Bajwa, I. S. (2019). Design and development of a large cross-Lingual plagiarism corpus for Urdu-English language pair. Scientific Programming, 2019.

  • Hanif, I., Nawab, R. M. A., Arbab, A., Jamshed, H., Riaz, S., & Munir, E. U. (2015). Cross-language Urdu–English (clue) text alignment corpus. Working notes papers of the CLEF.

  • Kanwal, S., Malik, K., Shahzad, K., Aslam, F., & Nawaz, Z. (2019). Urdu named entity recognition: Corpus generation and deep learning applications. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 19(1), 1–13.

    Google Scholar 

  • Kent, C. K., & Salim, N. (2010, September). Web based cross language plagiarism detection. In 2010 Second International Conference on Computational Intelligence, Modelling and Simulation (pp. 199-204). IEEE.

  • Kenter, T., & De Rijke, M. (2015). Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (pp. 1411-1420).

  • Khorsi, A., Cherroun, H., & Schwab, D. (2018). A two-level plagiarism detection system for Arabic documents. Cybernetics and Information Technologies., 18(1), 1003.

    Google Scholar 

  • Kothwal, R., & Varma, V. (2013). Cross lingual text reuse detection based on keyphrase extraction and similarity measures. In Multilingual Information Access in South Asian Languages (pp. 71-78). Springer, Berlin, Heidelberg.

  • Kumar, V., Raunak, V., & Callan, J. (2020). Ranking clarification questions via Natural Language Inference. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (pp. 2093-2096).

  • Lahitani, A. R., Permanasari, A. E., & Setiawan, N. A. (2016). Cosine similarity to determine similarity measure: Study case in online essay assessment. In 2016 4th International Conference on Cyber and IT Service Management (pp. 1-6). IEEE.

  • Li, X., Chen, M., & Zeng, Z. (2018, October). Cross-Lingual semantic textual similarity modeling using neural networks. In China Workshop on Machine Translation (pp. 52-62). Springer, Singapore.

  • Massidda, R. (2020). rmassidda@ DaDoEval: Document dating Using sentence embeddings at EVALITA 2020. In Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR. org.

  • Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (pp. 746-751).

  • Mori, Y., Yamane, H., Mukuta, Y., & Harada, T. (2020, December). Finding and generating a missing part for story completion. In Proceedings of the The 4th Joint Sighum Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (pp. 156-166).

  • Mueller, J., & Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In thirtieth AAAI conference on artificial intelligence.

  • Muhammad, S. (2020). Mono-and cross-lingual paraphrased text reuse and extrinsic plagiarism detection (Doctoral dissertation, Lancaster University).

  • Muneer, I., & Nawab, R. M. A. (2021). Cross-lingual text reuse detection using translation plus monolingual analysis for English-Urdu language pair. Transactions on Asian and Low-Resource Language Information Processing, 21(2), 1–18.

    Google Scholar 

  • Muneer, I., Sharjeel, M., Iqbal, M., Nawab, R. M. A., & Rayson, P. (2019). CLEU- A cross- language English- Urdu corpus and benchmark for text reuse experiments. Journal of the Association for Information Science and Technology, 70(7), 729–741.

    Google Scholar 

  • Napoles, C., Callison-Burch, C., & Post, M. (2016, June). Sentential paraphrasing as black-box machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (pp. 62-66).

  • Naumov, S., Yaroslavtsev, G., & Avdiukhin, D. (2020). Objective-Based hierarchical clustering of deep embedding vectors. arXiv preprint arXiv:2012.08466.

  • Navrozidis, J., & Jansson, H. (2020). Using Natural Language Processing to identify similar patent documents. LU-CS-EX.

  • Neculoiu, P., Versteegh, M., & Rotaru, M. (2016). Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP (pp. 148-157).

  • Nicosia, M., & Moschitti, A. (2017, November). Accurate sentence matching with hybrid Siamese networks. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 2235-2238).

  • Niwattanakul, S., Singthongchai, J., Naenudorn, E., & Wanapu, S. (2013, March). Using of Jaccard coefficient for keywords similarity. In Proceedings of the international multiconference of engineers and computer scientists (Vol. 1, No. 6, pp. 380-384).

  • Ozsoy, M. G. (2016). From word embeddings to item recommendation. ArXiv Preprint arXiv:1601.01356.

  • Pavlick, E., & Callison-Burch, C. (2016, August). Simple PPDB: A paraphrase database for simplification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 143-148).

  • Pei, W., Tax, D. M., & van der Maaten, L. (2016). Modeling time series similarity with Siamese recurrent networks. ArXiv Preprint arXiv:1603.04713.

  • Pelevina, M., Arefyev, N., Biemann, C., & Panchenko, A. (2017). Making sense of word embeddings. ArXiv Preprint arXiv:1708.03390.

  • Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language processingg (EMNLP) (pp. 1532-1543).

  • Piao, S. S., Bianchi, F., Dayrell, C., D’egidio, A., & Rayson, P. (2015). Development of the multilingual semantic annotation system. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1268-1274).

  • Piao, S. S., Rayson, P., Archer, D., & McEnery, T. (2005). Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech & Language, 19(4), 378–397.

    Article  Google Scholar 

  • Pinto, D., Civera, J., Barrón-Cedeno, A., Juan, A., & Rosso, P. (2009). A statistical approach to cross lingual natural language tasks. Journal of Algorithms, 64(1), 51–60.

    Article  Google Scholar 

  • Potthast, M., Eiselt, A., Barrón Cedeño, L. A., Stein, B., & Rosso, P. (2011). Overview of the 3rd international competition on plagiarism detection. In CEUR workshop proceedings (Vol. 1177). CEUR Workshop Proceedings.

  • Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45–62.

    Article  Google Scholar 

  • Ranasinghe, T., Orasan, C., & Mitkov, R. (2019). Semantic textual similarity with siamese neural networks. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (pp. 1004-1011).

  • Rayson, P., Archer, D., Piao, S. L., & McEnery, T. (2004). The UCREL Semantic Analysis System. Workshop Beyond Named Entity Recognition Semantic Labelling for NLP Tasks: Proc.

    Google Scholar 

  • Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. ArXiv Preprint arXiv:1908.10084.

  • Reimers, N., & Gurevych, I. (2020). Making mono-lingual sentence embeddings multilingual using Knowledge Distillation. arXiv preprint arXiv:2004.09813.

  • Ruder, S. (2017). An overview of multi-task learning in deep neural networks. ArXiv Preprint arXiv:1706.05098.

  • Sameen, S., Sharjeel, M., Nawab, R. M. A., Rayson, P., & Muneer, I. (2017). Measuring short text reuse for the Urdu language. IEEE Access, 6, 7412–7421.

    Article  Google Scholar 

  • Scanlon, P. M., & Neumann, D. R. (2002). Internet plagiarism among college students. Journal of College Student Development, 43(3), 374–385.

    Google Scholar 

  • Shafi, J. (2019). An Urdu semantic tagger-lexicons, corpora, methods and tools (Doctoral dissertation, Lancaster University).

  • Shi, H., Wang, C., & Sakai, T. (2020). A Siamese CNN architecture for learning Chinese sentence similarity. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop (pp. 24-29).

  • Štajner, T., & Mladenić, D. (2019). Cross-lingual document similarity estimation and dictionary generation with comparable corpora. Knowledge and Information Systems, 58(3), 729–743.

    Article  Google Scholar 

  • Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., & Qin, B. "Learning sentiment-specific word embedding for twitter sentiment classification." ACL (1). 2014.

  • Upadhyay, S., Faruqui, M., Dyer, C., & Roth, D. (2016). Cross-lingual models of word embeddings: An empirical comparison. ArXiv Preprint arXiv:1604.00425.

  • Varior, R. R., Shuai, B., Lu, J., Xu, D., & Wang, G. (2016). A Siamese long short-term memory architecture for human re-identification. In European conference on computer vision (pp. 135-153). Springer, Cham.

  • Vijaymeena, M. K., & Kavitha, K. (2016). A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal, 3(2), 19–28.

    Google Scholar 

  • Vinayakumar, R., & Soman, K. P. (2020). Siamese neural network architecture for homoglyph attacks detection. ICT Express, 6(1), 16–19.

    Article  Google Scholar 

  • Wang, J., Qin, Y., Peng, Z., & Lee, T. (2019). Child speech disorder detection with Siamese Recurrent network using speech attribute Features. In Interspeech (pp. 3885-3889).

  • Wieting, J., Bansal, M., Gimpel, K., & Livescu, K. (2015). From paraphrase database to compositional paraphrase model and back. Transactions of the Association for Computational Linguistics, 3, 345–358.

    Article  Google Scholar 

  • Xu, X., Ma, B., Chang, H., & Chen, X. (2017). Siamese recurrent architecture for visual tracking. In 2017 IEEE International Conference on Image Processing (ICIP) (pp. 1152-1156). IEEE.

  • Yang, J., Zou, H., Zhou, Y., & Xie, L. (2019). Learning gestures from wifi: A Siamese recurrent convolutional architecture. IEEE Internet of Things Journal, 6(6), 10763–10772.

    Article  Google Scholar 

  • Zhang, L., & Moldovan, D. (2018). Rule-based vs. neural net methods to semantic textual similarity. In Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing (pp. 12-17).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iqra Muneer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Muneer, I., Nawab, R.M.A. Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels. Lang Resources & Evaluation 56, 1103–1130 (2022). https://doi.org/10.1007/s10579-022-09613-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-022-09613-4

Keywords