Multi-level text document similarity estimation and its application for plagiarism detection

Veisi, Hadi; Golchinpour, Mahboobeh; Salehi, Mostafa; Gharavi, Erfaneh

doi:10.1007/s42044-022-00098-6

Multi-level text document similarity estimation and its application for plagiarism detection

Original Article
Published: 08 February 2022

Volume 5, pages 143–155, (2022)
Cite this article

Iran Journal of Computer Science Aims and scope Submit manuscript

Hadi Veisi ORCID: orcid.org/0000-0003-2372-7969¹,
Mahboobeh Golchinpour¹,
Mostafa Salehi¹ &
…
Erfaneh Gharavi²

459 Accesses
7 Citations
Explore all metrics

Abstract

Nowadays, rapid access to the text data on the internet, and simplicity in modifying them, made plagiarism a serious problem. Similarity detection is an approach to identifying plagiarism between text documents. In this paper, we present a three-step method based on the vector representation of words for similarity detection of Persian text documents. Words represent vectors in N-dimensional space and similarity between the source and suspicious documents describe as cosine distance between these vectors. Results on the PAN2016 corpus, show that the proposed method with 00:01:27(h:m:s) run-time for each pair of documents, has a plagdet of 94.37%. It outperforms the support vector machine method and deep learning method by 4.33% and 3.6% respectively. The result of the proposed method on PAN2015 has 00:01:21(h:m:s) run-time and plagdet of 96.94%, which outperforms 9.94% compared to the graph-based method in plagiarism detection of Persian text documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection

Vector Representation of Words for Plagiarism Detection Based on String Matching

A Fast Multi-level Plagiarism Detection Method Based on Document Embedding Representation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Broder, A.Z., et al.: Indexing shared content in information retrieval systems. In: International Conference on Extending Database Technology. Springer (2006)
Shafiee, F., Shamsfard, M.: Similarity versus relatedness: a novel approach in extractive Persian document summarisation. J. Inf. Sci. 44(3), 314–330 (2018)
Article Google Scholar
Chen, Y.-L., et al.: A similarity-based method for retrieving documents from the SCI/SSCI database. J. Inf. Sci. 32(5), 449–464 (2006)
Article Google Scholar
Zaka, B.: Theory, and applications of similarity detection techniques (2009)
Clough, P.: Old and new challenges in automatic plagiarism detection. In: National Plagiarism Advisory Service, Citeseer (2003). http://ir.shef.ac.uk/cloughie/index.html
Barrón-Cedeño, A., et al.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Article Google Scholar
Maurer, H., Zaka, B.: Plagiarism-a problem and how to fight it. In: EdMedia+ Innovate Learning. Association for the Advancement of Computing in Education (AACE) (2007)
Maurer, H.A., Kappe, F., Zaka, B.: Plagiarism—a survey. J. UCS 12(8), 1050–1084 (2006)
Google Scholar
Chowdhury, H.A., Bhattacharyya, D.K.: Plagiarism: taxonomy, tools and detection techniques ( 2018). http://arxiv.org/abs/1801.06323
Alvi, F., Stevenson, M., Clough, P.: Paraphrase type identification for plagiarism detection using contexts and word embeddings. Int. J. Educ. Technol. High. Educ. 18(1), 1–25 (2021)
Article Google Scholar
Franco-Salvador, M., et al.: Pan 2015 shared task on plagiarism detection: evaluation of corpora for text alignment. Working Notes Papers of the CLEF (2015)
Mathur, I., Joshi, N.: Plagiarism detection: keeping check on misuse of intellectual property (2012). http://arxiv.org/abs/1210.7678
Gharavi, E., et al.: A deep learning approach to persian plagiarism detection. In: FIRE (Working Notes) (2016)
Momtaz, M., et al.: Graph-based approach to text alignment for plagiarism detection in Persian Documents. in FIRE (working notes) (2016)
Zhou, X., Pappas, N., Smith, N.A.: Multilevel text alignment with cross-document attention (2020). http://arxiv.org/abs/2010.01263
Yousef, T., Janicke, S.: A survey of text alignment visualization. IEEE Trans. Visual Comput. Graphics 27(2), 1149–1159 (2020)
Article Google Scholar
Bengio, Y., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)
MATH Google Scholar
Kusner, M., et al.: From word embeddings to document distances. In: International Conference on Machine Learning (2015)
Wang, S., Zhou, W., Jiang, C.: A survey of word embeddings based on deep learning. Computing 102(3), 717–740 (2020)
Article MathSciNet Google Scholar
Jiang, Z., Gao, S., Chen, L.: Study on text representation method based on deep learning and topic information. Computing 102(3), 623–642 (2020)
Article MathSciNet Google Scholar
Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: human language technologies (2013)
Khoshnavataher, K., et al.: Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial obfuscation. Notebook for PAN at CLEF (2015)
Asghari, H., et al.: Algorithms and corpora for Persian plagiarism detection. In: Forum for Information Retrieval Evaluation. Springer (2016)
Lopez-Gazpio, I., et al.: Word n-gram attention models for sentence similarity and inference. Expert Syst. Appl. 132, 1–11 (2019)
Article Google Scholar
Stefanovič, P., Kurasova, O., Štrimaitis, R.: The N-grams based text similarity detection approach using self-organizing maps and similarity measures. Appl. Sci. 9(9), 1870 (2019)
Article Google Scholar
Zini, M., et al.: Plagiarism detection through multilevel text comparison. In: 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06). IEEE (2006)
Suchomel, S., Kasprzak, J., Brandejs, M.: Three way search engine queries with multi-feature document comparison for plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2012)
Gali, N., et al.: Framework for syntactic string similarity measures. Expert Syst. Appl. 129, 169–185 (2019)
Article Google Scholar
Nahnsen, T., Uzuner, O., Katz, B.: Lexical chains and sliding locality windows in content-based text similarity detection (2005)
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inform. Sci. Technol. 54(3), 203–215 (2003)
Article Google Scholar
Lazemi, S., Ebrahimpour-Komleh, H.: ParsiPayesh: persian plagiarism detection based on semantic and structural analysis. In: 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE (2020)
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013)
Google Scholar
Altheneyan, A.S., Menai, M.E.B.: Automatic plagiarism detection in obfuscated text. Pattern Anal. Appl. 23(4), 1627–1650 (2020)
Article Google Scholar
Cai, Y., et al.: A hybrid approach for measuring semantic similarity based on IC-weighted path distance in WordNet. J. Intell. Inf. Syst. 51(1), 23–47 (2018)
Article Google Scholar
de Arruda, H.F., et al.: Paragraph-based representation of texts: a complex networks approach. Inf. Process. Manag. 56(3), 479–494 (2019)
Article Google Scholar
Talebpour, A., Laskoukelayeh, M.S., Aminolroaya, Z.: Plagiarism detection based on a novel trie-based approach. In: Forum for Information Retrieval Evaluation. Springer (2016)
Minaei, B., Niknam, M.: An n-gram based Method for nearly copy detection in plagiarism systems. In: FIRE (working notes) (2016)
Mansoorizadeh, M., Rahgooy, T., Hamedan, I.: Persian plagiarism detection using sentence correlations. In: FIRE (Working Notes) (2016)
Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: FIRE (Working Notes) (2016)
Esteki, F., Esfahani, F.S.: A plagiarism detection approach based on SVM for Persian texts. In: FIRE (Working Notes) (2016)
El Mostafa, H., Benabbou, F.: A deep learning based technique for plagiarism detection: a comparative study. IAES Int. J. Artif. Intell. 9(1), 81 (2020)
Google Scholar
Mashhadirajab, F., Shamsfard, M.: A text alignment algorithm based on prediction of obfuscation types using SVM neural network. In: FIRE (working notes) (2016)
Mahdavi, P., Siadati, Z., Yaghmaee, F.: Automatic external Persian plagiarism detection using vector space model. In: 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE (2014)
Mahmoodi, M., Varnamkhasti, M.M.: Design a Persian automated plagiarism detector (AMZPPD) (2014). http://arxiv.org/abs/1403.1618
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Article Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Aaai (2006)
Firth, J.R.: Studies in linguistic analysis. Wiley-Blackwell (1957)
Mikolov, T., et al.: Efficient estimation of word representations in vector space (2013). http://arxiv.org/abs/1301.3781
Heuer, H.: Semantic and stylistic text analysis and text summary evaluation (2015)
Kumhar, S.H., et al.: Word embedding generation for urdu language using Word2vec model. Materials Today Proc. (2021)
Jurgens, D.: Learning about word vector representations and deep learning through implementing Word2vec. In: Proceedings of the Fifth Workshop on Teaching NLP (2021)
Hall, P.A., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. (CSUR) 12(4), 381–402 (1980)
Article MathSciNet Google Scholar
Chang, C.-Y., et al.: Using word semantic concepts for plagiarism detection in text documents. Inf Retr J 24(4), 298–321 (2021)
Article Google Scholar
Potthast, M., et al.: An evaluation framework for plagiarism detection. In: Coling 2010: Posters (2010)
Potthast, M., et al.: Overview of the 5th international competition on plagiarism detection. In: CLEF Conference on Multilingual and Multimodal Information Access Evaluation. CELCT (2013)

Download references

Acknowledgements

The authors would like to thank the University of Tehran Science & Technology Park for sponsoring this research.

Author information

Authors and Affiliations

Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran
Hadi Veisi, Mahboobeh Golchinpour & Mostafa Salehi
Data Science School, University of Virginia, Charlottesville, USA
Erfaneh Gharavi

Authors

Hadi Veisi
View author publications
You can also search for this author in PubMed Google Scholar
Mahboobeh Golchinpour
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Salehi
View author publications
You can also search for this author in PubMed Google Scholar
Erfaneh Gharavi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hadi Veisi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Veisi, H., Golchinpour, M., Salehi, M. et al. Multi-level text document similarity estimation and its application for plagiarism detection. Iran J Comput Sci 5, 143–155 (2022). https://doi.org/10.1007/s42044-022-00098-6

Download citation

Received: 12 August 2021
Accepted: 18 January 2022
Published: 08 February 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s42044-022-00098-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-level text document similarity estimation and its application for plagiarism detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection

Vector Representation of Words for Plagiarism Detection Based on String Matching

A Fast Multi-level Plagiarism Detection Method Based on Document Embedding Representation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-level text document similarity estimation and its application for plagiarism detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection

Vector Representation of Words for Plagiarism Detection Based on String Matching

A Fast Multi-level Plagiarism Detection Method Based on Document Embedding Representation

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation