Abstract
Nowadays, rapid access to the text data on the internet, and simplicity in modifying them, made plagiarism a serious problem. Similarity detection is an approach to identifying plagiarism between text documents. In this paper, we present a three-step method based on the vector representation of words for similarity detection of Persian text documents. Words represent vectors in N-dimensional space and similarity between the source and suspicious documents describe as cosine distance between these vectors. Results on the PAN2016 corpus, show that the proposed method with 00:01:27(h:m:s) run-time for each pair of documents, has a plagdet of 94.37%. It outperforms the support vector machine method and deep learning method by 4.33% and 3.6% respectively. The result of the proposed method on PAN2015 has 00:01:21(h:m:s) run-time and plagdet of 96.94%, which outperforms 9.94% compared to the graph-based method in plagiarism detection of Persian text documents.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Broder, A.Z., et al.: Indexing shared content in information retrieval systems. In: International Conference on Extending Database Technology. Springer (2006)
Shafiee, F., Shamsfard, M.: Similarity versus relatedness: a novel approach in extractive Persian document summarisation. J. Inf. Sci. 44(3), 314–330 (2018)
Chen, Y.-L., et al.: A similarity-based method for retrieving documents from the SCI/SSCI database. J. Inf. Sci. 32(5), 449–464 (2006)
Zaka, B.: Theory, and applications of similarity detection techniques (2009)
Clough, P.: Old and new challenges in automatic plagiarism detection. In: National Plagiarism Advisory Service, Citeseer (2003). http://ir.shef.ac.uk/cloughie/index.html
Barrón-Cedeño, A., et al.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Maurer, H., Zaka, B.: Plagiarism-a problem and how to fight it. In: EdMedia+ Innovate Learning. Association for the Advancement of Computing in Education (AACE) (2007)
Maurer, H.A., Kappe, F., Zaka, B.: Plagiarism—a survey. J. UCS 12(8), 1050–1084 (2006)
Chowdhury, H.A., Bhattacharyya, D.K.: Plagiarism: taxonomy, tools and detection techniques ( 2018). http://arxiv.org/abs/1801.06323
Alvi, F., Stevenson, M., Clough, P.: Paraphrase type identification for plagiarism detection using contexts and word embeddings. Int. J. Educ. Technol. High. Educ. 18(1), 1–25 (2021)
Franco-Salvador, M., et al.: Pan 2015 shared task on plagiarism detection: evaluation of corpora for text alignment. Working Notes Papers of the CLEF (2015)
Mathur, I., Joshi, N.: Plagiarism detection: keeping check on misuse of intellectual property (2012). http://arxiv.org/abs/1210.7678
Gharavi, E., et al.: A deep learning approach to persian plagiarism detection. In: FIRE (Working Notes) (2016)
Momtaz, M., et al.: Graph-based approach to text alignment for plagiarism detection in Persian Documents. in FIRE (working notes) (2016)
Zhou, X., Pappas, N., Smith, N.A.: Multilevel text alignment with cross-document attention (2020). http://arxiv.org/abs/2010.01263
Yousef, T., Janicke, S.: A survey of text alignment visualization. IEEE Trans. Visual Comput. Graphics 27(2), 1149–1159 (2020)
Bengio, Y., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)
Kusner, M., et al.: From word embeddings to document distances. In: International Conference on Machine Learning (2015)
Wang, S., Zhou, W., Jiang, C.: A survey of word embeddings based on deep learning. Computing 102(3), 717–740 (2020)
Jiang, Z., Gao, S., Chen, L.: Study on text representation method based on deep learning and topic information. Computing 102(3), 623–642 (2020)
Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: human language technologies (2013)
Khoshnavataher, K., et al.: Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial obfuscation. Notebook for PAN at CLEF (2015)
Asghari, H., et al.: Algorithms and corpora for Persian plagiarism detection. In: Forum for Information Retrieval Evaluation. Springer (2016)
Lopez-Gazpio, I., et al.: Word n-gram attention models for sentence similarity and inference. Expert Syst. Appl. 132, 1–11 (2019)
Stefanovič, P., Kurasova, O., Štrimaitis, R.: The N-grams based text similarity detection approach using self-organizing maps and similarity measures. Appl. Sci. 9(9), 1870 (2019)
Zini, M., et al.: Plagiarism detection through multilevel text comparison. In: 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06). IEEE (2006)
Suchomel, S., Kasprzak, J., Brandejs, M.: Three way search engine queries with multi-feature document comparison for plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2012)
Gali, N., et al.: Framework for syntactic string similarity measures. Expert Syst. Appl. 129, 169–185 (2019)
Nahnsen, T., Uzuner, O., Katz, B.: Lexical chains and sliding locality windows in content-based text similarity detection (2005)
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inform. Sci. Technol. 54(3), 203–215 (2003)
Lazemi, S., Ebrahimpour-Komleh, H.: ParsiPayesh: persian plagiarism detection based on semantic and structural analysis. In: 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE (2020)
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013)
Altheneyan, A.S., Menai, M.E.B.: Automatic plagiarism detection in obfuscated text. Pattern Anal. Appl. 23(4), 1627–1650 (2020)
Cai, Y., et al.: A hybrid approach for measuring semantic similarity based on IC-weighted path distance in WordNet. J. Intell. Inf. Syst. 51(1), 23–47 (2018)
de Arruda, H.F., et al.: Paragraph-based representation of texts: a complex networks approach. Inf. Process. Manag. 56(3), 479–494 (2019)
Talebpour, A., Laskoukelayeh, M.S., Aminolroaya, Z.: Plagiarism detection based on a novel trie-based approach. In: Forum for Information Retrieval Evaluation. Springer (2016)
Minaei, B., Niknam, M.: An n-gram based Method for nearly copy detection in plagiarism systems. In: FIRE (working notes) (2016)
Mansoorizadeh, M., Rahgooy, T., Hamedan, I.: Persian plagiarism detection using sentence correlations. In: FIRE (Working Notes) (2016)
Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: FIRE (Working Notes) (2016)
Esteki, F., Esfahani, F.S.: A plagiarism detection approach based on SVM for Persian texts. In: FIRE (Working Notes) (2016)
El Mostafa, H., Benabbou, F.: A deep learning based technique for plagiarism detection: a comparative study. IAES Int. J. Artif. Intell. 9(1), 81 (2020)
Mashhadirajab, F., Shamsfard, M.: A text alignment algorithm based on prediction of obfuscation types using SVM neural network. In: FIRE (working notes) (2016)
Mahdavi, P., Siadati, Z., Yaghmaee, F.: Automatic external Persian plagiarism detection using vector space model. In: 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE (2014)
Mahmoodi, M., Varnamkhasti, M.M.: Design a Persian automated plagiarism detector (AMZPPD) (2014). http://arxiv.org/abs/1403.1618
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Aaai (2006)
Firth, J.R.: Studies in linguistic analysis. Wiley-Blackwell (1957)
Mikolov, T., et al.: Efficient estimation of word representations in vector space (2013). http://arxiv.org/abs/1301.3781
Heuer, H.: Semantic and stylistic text analysis and text summary evaluation (2015)
Kumhar, S.H., et al.: Word embedding generation for urdu language using Word2vec model. Materials Today Proc. (2021)
Jurgens, D.: Learning about word vector representations and deep learning through implementing Word2vec. In: Proceedings of the Fifth Workshop on Teaching NLP (2021)
Hall, P.A., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. (CSUR) 12(4), 381–402 (1980)
Chang, C.-Y., et al.: Using word semantic concepts for plagiarism detection in text documents. Inf Retr J 24(4), 298–321 (2021)
Potthast, M., et al.: An evaluation framework for plagiarism detection. In: Coling 2010: Posters (2010)
Potthast, M., et al.: Overview of the 5th international competition on plagiarism detection. In: CLEF Conference on Multilingual and Multimodal Information Access Evaluation. CELCT (2013)
Acknowledgements
The authors would like to thank the University of Tehran Science & Technology Park for sponsoring this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Veisi, H., Golchinpour, M., Salehi, M. et al. Multi-level text document similarity estimation and its application for plagiarism detection. Iran J Comput Sci 5, 143–155 (2022). https://doi.org/10.1007/s42044-022-00098-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42044-022-00098-6