Algorithms and Corpora for Persian Plagiarism Detection

Asghari, Habibollah; Mohtaj, Salar; Fatemi, Omid; Faili, Heshaam; Rosso, Paolo; Potthast, Martin

doi:10.1007/978-3-319-73606-8_5

Habibollah Asghari¹⁷,
Salar Mohtaj¹⁸,
Omid Fatemi ORCID: orcid.org/0000-0001-9654-0607¹⁷,
Heshaam Faili¹⁷,
Paolo Rosso¹⁹ &
…
Martin Potthast²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10478))

Included in the following conference series:

Forum for Information Retrieval Evaluation

717 Accesses
2 Citations

Abstract

The task of plagiarism detection is to find passages of text-reuse in a suspicious document. This task is of increasing relevance, since scholars around the world take advantage of the fact that information about nearly any subject can be found on the World Wide Web by reusing existing text instead of writing their own. We organized the Persian PlagDet shared task at PAN 2016 in an effort to promote the comparative assessment of NLP techniques for plagiarism detection with a special focus on plagiarism that appears in a Persian text corpus. The goal of this shared task is to bring together researchers and practitioners around the exciting topic of plagiarism detection and text-reuse detection. We report on the outcome of the shared task, which divides into two subtasks: text alignment and corpus construction. In the first subtask, nine teams participated, whereas the best result achieved was a PlagDet score of 0.92. For the second subtask of corpus construction, five teams submitted a corpus, which were evaluated using the systems submitted for the first subtask. The results show that significant challenges remain in evaluating newly constructed corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A crowdsourcing approach to construct mono-lingual plagiarism detection corpus

Article 07 September 2020

Exactus Like: Plagiarism Detection in Scientific Texts

Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition

Notes

1.
http://ictrc.ac.ir/plagdet

References

Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., Chikhi, S.: Overview of the AraPlagDet PAN@ FIRE2015 Shared Task on Arabic Plagiarism Detection, vol. 1587, pp. 111–122. CEUR-WS.org (2015)
Google Scholar
Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Esteki, F., Safi Esfahani, F.: A plagiarism detection approach based on SVM for Persian texts. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Gharavi, E., Bijari, K., Zahirnia, K., Veisi, H.: A deep learning approach to Persian plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Gillam, L., Vartapetiance, A.: From English to Persian: conversion of text alignment for plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Gollub, T., Burrows, S., Stein, B.: First experiences with TIRA for reproducible evaluation in information retrieval. In: SIGIR, vol. 12, pp. 52–55, August 2012
Google Scholar
Gollub, T., Stein, B., Burrows, S.: Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1125–1126. ACM, August 2012
Google Scholar
Gollub, T., Stein, B., Burrows, S., Hoppe, D.: TIRA: configuring, executing, and disseminating information retrieval experiments. In: 2012 23rd International Workshop on Database and Expert Systems Applications, pp. 151–155. IEEE, September 2012
Google Scholar
Hopfgartner, F., Hanbury, A., Müller, H., Kando, N., Mercer, S., Kalpathy-Cramer, J., Potthast, M., Gollub, T., Krithara, A., Lin, J., Balog, K.: Report on the Evaluation-as-a-Service (EaaS) expert workshop. In: ACM SIGIR Forum, vol. 49, no. 1, pp. 57–65. ACM, June 2015
Google Scholar
Khoshnavataher, K., Zarrabi, V., Mohtaj, S., Asghari, H.: Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial obfuscation. Notebook for PAN at CLEF 2015. In: CLEF (Working Notes) (2015)
Google Scholar
Mansoorizadeh, M., Rahgooy, T.: Persian plagiarism detection using sentence correlations. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Mashhadirajab, F., Shamsfard, M.: A text alignment algorithm based on prediction of obfuscation types using SVM neural network. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Mashhadirajab, F., Shamsfard, M., Adelkhah, R., Shafiee, F., Saedi, S.: A text alignment corpus for Persian plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Minaei, B., Niknam, M.: An n-gram based method for nearly copy detection in plagiarism systems. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Momtaz, M., Bijari, K., Salehi, M., Veisi, H.: Graph-based approach to text alignment for plagiarism detection in Persian documents. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse (2009)
Google Scholar
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 997–1005. Association for Computational Linguistics, August 2010
Google Scholar
Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2012)
Google Scholar
Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11382-1_22
Google Scholar
Potthast, M., Hagen, M., Göring, S., Rosso, P., Stein, B.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working Notes Papers of the CLEF, pp. 1613–0073 (2015)
Google Scholar
Rezaei Sharifabadi, M., Eftekhari, S.A.: Mahak Samim: a corpus of Persian academic texts for evaluating plagiarism detection systems. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Shamsfard, M.: Developing FarsNet: a lexical ontology for Persian. In: Proceedings of the 4th Global WordNet Conference (2008)
Google Scholar
Talebpour, A., Shirzadi, M., Aminolroaya, Z.: Plagiarism detection based on a novel trie-based approach. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings. CEUR-WS.org (2016)
Google Scholar
Mohtaj, S., Asghari, H., Zarrabi, V.: Developing monolingual English Corpus for Plagiarism Detection using Human Annotated Paraphrase Corpus—Notebook for PAN at CLEF 2015 (2015)
Google Scholar
Asghari, H., Khoshnavataher, K., Fatemi, O., Faili, H.: Developing Bilingual Plagiarism Detection Corpus Using Sentence Aligned Parallel Corpus—Notebook for PAN at CLEF 2015 (2015)
Google Scholar
Kong, L., Lu, Z., Han, Y., Qi, H., Han, Z., Wang, Q., Hao, Z., Zhang, J.: Source Retrieval and Text Alignment Corpus Construction for Plagiarism Detection—Notebook for PAN at CLEF 2015 (2015)
Google Scholar
Hanif, I., Nawab, A., Arbab, A., Jamshed, H., Riaz, S., Munir, E.: Cross-Language Urdu-English (CLUE) Text Alignment Corpus—Notebook for PAN at CLEF 2015 (2015)
Google Scholar
Alvi, F., Stevenson, M., Clough, P.: The Short Stories Corpus—Notebook for PAN at CLEF 2015 (2015)
Google Scholar
Cheema, W., Najib, F., Ahmed, S., Bukhari, S., Sittar, A., Nawab, R.: A Corpus for Analyzing Text Reuse by People of Different Groups—Notebook for PAN at CLEF 2015 (2015)
Google Scholar
Zarrabi, V., Rafiei, J., Khoshnava, K., Asghari, H., Mohtaj, S.: Evaluation of Text Reuse Corpora for Text Alignment Task of Plagiarism Detection—Notebook for PAN at CLEF 2015 (2015)
Google Scholar

Download references

Acknowledgments

This work has been funded by ICT Research Institute, ACECR, under the partial support of Vice Presidency for Science and Technology of Iran - Grant No. 1164331. The work of Paolo Rosso has been partially funded by the SomEMBED MINECO TIN2015-71147-C2-1-P research project and by the Generalitat Valenciana under the grant ALMAMATER (PrometeoII/2014/030). We would like to thank the participants of the competition for their dedicated work. Our special thanks go to the renowned experts who served on the organizing committee for their contributions and devoted work to make this shared task possible. We would like to thank Javad Rafiei and Khadijeh Khoshnava for their help in construction of evaluation corpus. We are also immensely grateful to Vahid Zarrabi for his comments and valuable help along the way which greatly assisted this challenging shared task.

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
Habibollah Asghari, Omid Fatemi & Heshaam Faili
ICT Research Institute, Academic Center for Education, Culture and Research (ACECR), Tehran, Iran
Salar Mohtaj
PRHLT Research Center, Universitat Politècnica de València, Valencia, Spain
Paolo Rosso
Bauhaus-Universität Weimar, Weimar, Germany
Martin Potthast

Authors

Habibollah Asghari
View author publications
You can also search for this author in PubMed Google Scholar
Salar Mohtaj
View author publications
You can also search for this author in PubMed Google Scholar
Omid Fatemi
View author publications
You can also search for this author in PubMed Google Scholar
Heshaam Faili
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar
Martin Potthast
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Omid Fatemi .

Editor information

Editors and Affiliations

DAIICT, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
DAIICT, Gujarat, India
Parth Mehta
DAIICT, Gujarat, India
Jainisha Sankhavara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., Potthast, M. (2018). Algorithms and Corpora for Persian Plagiarism Detection. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-73606-8_5
Published: 04 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics