Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Multi-level text document similarity estimation and its application for plagiarism detection

  • Original Article
  • Published:
Iran Journal of Computer Science Aims and scope Submit manuscript

Abstract

Nowadays, rapid access to the text data on the internet, and simplicity in modifying them, made plagiarism a serious problem. Similarity detection is an approach to identifying plagiarism between text documents. In this paper, we present a three-step method based on the vector representation of words for similarity detection of Persian text documents. Words represent vectors in N-dimensional space and similarity between the source and suspicious documents describe as cosine distance between these vectors. Results on the PAN2016 corpus, show that the proposed method with 00:01:27(h:m:s) run-time for each pair of documents, has a plagdet of 94.37%. It outperforms the support vector machine method and deep learning method by 4.33% and 3.6% respectively. The result of the proposed method on PAN2015 has 00:01:21(h:m:s) run-time and plagdet of 96.94%, which outperforms 9.94% compared to the graph-based method in plagiarism detection of Persian text documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Broder, A.Z., et al.: Indexing shared content in information retrieval systems. In: International Conference on Extending Database Technology. Springer (2006)

  2. Shafiee, F., Shamsfard, M.: Similarity versus relatedness: a novel approach in extractive Persian document summarisation. J. Inf. Sci. 44(3), 314–330 (2018)

    Article  Google Scholar 

  3. Chen, Y.-L., et al.: A similarity-based method for retrieving documents from the SCI/SSCI database. J. Inf. Sci. 32(5), 449–464 (2006)

    Article  Google Scholar 

  4. Zaka, B.: Theory, and applications of similarity detection techniques (2009)

  5. Clough, P.: Old and new challenges in automatic plagiarism detection. In: National Plagiarism Advisory Service, Citeseer (2003). http://ir.shef.ac.uk/cloughie/index.html

  6. Barrón-Cedeño, A., et al.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)

    Article  Google Scholar 

  7. Maurer, H., Zaka, B.: Plagiarism-a problem and how to fight it. In: EdMedia+ Innovate Learning. Association for the Advancement of Computing in Education (AACE) (2007)

  8. Maurer, H.A., Kappe, F., Zaka, B.: Plagiarism—a survey. J. UCS 12(8), 1050–1084 (2006)

    Google Scholar 

  9. Chowdhury, H.A., Bhattacharyya, D.K.: Plagiarism: taxonomy, tools and detection techniques ( 2018). http://arxiv.org/abs/1801.06323

  10. Alvi, F., Stevenson, M., Clough, P.: Paraphrase type identification for plagiarism detection using contexts and word embeddings. Int. J. Educ. Technol. High. Educ. 18(1), 1–25 (2021)

    Article  Google Scholar 

  11. Franco-Salvador, M., et al.: Pan 2015 shared task on plagiarism detection: evaluation of corpora for text alignment. Working Notes Papers of the CLEF (2015)

  12. Mathur, I., Joshi, N.: Plagiarism detection: keeping check on misuse of intellectual property (2012). http://arxiv.org/abs/1210.7678

  13. Gharavi, E., et al.: A deep learning approach to persian plagiarism detection. In: FIRE (Working Notes) (2016)

  14. Momtaz, M., et al.: Graph-based approach to text alignment for plagiarism detection in Persian Documents. in FIRE (working notes) (2016)

  15. Zhou, X., Pappas, N., Smith, N.A.: Multilevel text alignment with cross-document attention (2020). http://arxiv.org/abs/2010.01263

  16. Yousef, T., Janicke, S.: A survey of text alignment visualization. IEEE Trans. Visual Comput. Graphics 27(2), 1149–1159 (2020)

    Article  Google Scholar 

  17. Bengio, Y., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)

    MATH  Google Scholar 

  18. Kusner, M., et al.: From word embeddings to document distances. In: International Conference on Machine Learning (2015)

  19. Wang, S., Zhou, W., Jiang, C.: A survey of word embeddings based on deep learning. Computing 102(3), 717–740 (2020)

    Article  MathSciNet  Google Scholar 

  20. Jiang, Z., Gao, S., Chen, L.: Study on text representation method based on deep learning and topic information. Computing 102(3), 623–642 (2020)

    Article  MathSciNet  Google Scholar 

  21. Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: human language technologies (2013)

  22. Khoshnavataher, K., et al.: Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial obfuscation. Notebook for PAN at CLEF (2015)

  23. Asghari, H., et al.: Algorithms and corpora for Persian plagiarism detection. In: Forum for Information Retrieval Evaluation. Springer (2016)

  24. Lopez-Gazpio, I., et al.: Word n-gram attention models for sentence similarity and inference. Expert Syst. Appl. 132, 1–11 (2019)

    Article  Google Scholar 

  25. Stefanovič, P., Kurasova, O., Štrimaitis, R.: The N-grams based text similarity detection approach using self-organizing maps and similarity measures. Appl. Sci. 9(9), 1870 (2019)

    Article  Google Scholar 

  26. Zini, M., et al.: Plagiarism detection through multilevel text comparison. In: 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06). IEEE (2006)

  27. Suchomel, S., Kasprzak, J., Brandejs, M.: Three way search engine queries with multi-feature document comparison for plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2012)

  28. Gali, N., et al.: Framework for syntactic string similarity measures. Expert Syst. Appl. 129, 169–185 (2019)

    Article  Google Scholar 

  29. Nahnsen, T., Uzuner, O., Katz, B.: Lexical chains and sliding locality windows in content-based text similarity detection (2005)

  30. Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inform. Sci. Technol. 54(3), 203–215 (2003)

    Article  Google Scholar 

  31. Lazemi, S., Ebrahimpour-Komleh, H.: ParsiPayesh: persian plagiarism detection based on semantic and structural analysis. In: 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE (2020)

  32. Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013)

    Google Scholar 

  33. Altheneyan, A.S., Menai, M.E.B.: Automatic plagiarism detection in obfuscated text. Pattern Anal. Appl. 23(4), 1627–1650 (2020)

    Article  Google Scholar 

  34. Cai, Y., et al.: A hybrid approach for measuring semantic similarity based on IC-weighted path distance in WordNet. J. Intell. Inf. Syst. 51(1), 23–47 (2018)

    Article  Google Scholar 

  35. de Arruda, H.F., et al.: Paragraph-based representation of texts: a complex networks approach. Inf. Process. Manag. 56(3), 479–494 (2019)

    Article  Google Scholar 

  36. Talebpour, A., Laskoukelayeh, M.S., Aminolroaya, Z.: Plagiarism detection based on a novel trie-based approach. In: Forum for Information Retrieval Evaluation. Springer (2016)

  37. Minaei, B., Niknam, M.: An n-gram based Method for nearly copy detection in plagiarism systems. In: FIRE (working notes) (2016)

  38. Mansoorizadeh, M., Rahgooy, T., Hamedan, I.: Persian plagiarism detection using sentence correlations. In: FIRE (Working Notes) (2016)

  39. Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: FIRE (Working Notes) (2016)

  40. Esteki, F., Esfahani, F.S.: A plagiarism detection approach based on SVM for Persian texts. In: FIRE (Working Notes) (2016)

  41. El Mostafa, H., Benabbou, F.: A deep learning based technique for plagiarism detection: a comparative study. IAES Int. J. Artif. Intell. 9(1), 81 (2020)

    Google Scholar 

  42. Mashhadirajab, F., Shamsfard, M.: A text alignment algorithm based on prediction of obfuscation types using SVM neural network. In: FIRE (working notes) (2016)

  43. Mahdavi, P., Siadati, Z., Yaghmaee, F.: Automatic external Persian plagiarism detection using vector space model. In: 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE (2014)

  44. Mahmoodi, M., Varnamkhasti, M.M.: Design a Persian automated plagiarism detector (AMZPPD) (2014). http://arxiv.org/abs/1403.1618

  45. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  46. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Aaai (2006)

  47. Firth, J.R.: Studies in linguistic analysis. Wiley-Blackwell (1957)

  48. Mikolov, T., et al.: Efficient estimation of word representations in vector space (2013). http://arxiv.org/abs/1301.3781

  49. Heuer, H.: Semantic and stylistic text analysis and text summary evaluation (2015)

  50. Kumhar, S.H., et al.: Word embedding generation for urdu language using Word2vec model. Materials Today Proc. (2021)

  51. Jurgens, D.: Learning about word vector representations and deep learning through implementing Word2vec. In: Proceedings of the Fifth Workshop on Teaching NLP (2021)

  52. Hall, P.A., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. (CSUR) 12(4), 381–402 (1980)

    Article  MathSciNet  Google Scholar 

  53. Chang, C.-Y., et al.: Using word semantic concepts for plagiarism detection in text documents. Inf Retr J 24(4), 298–321 (2021)

    Article  Google Scholar 

  54. Potthast, M., et al.: An evaluation framework for plagiarism detection. In: Coling 2010: Posters (2010)

  55. Potthast, M., et al.: Overview of the 5th international competition on plagiarism detection. In: CLEF Conference on Multilingual and Multimodal Information Access Evaluation. CELCT (2013)

Download references

Acknowledgements

The authors would like to thank the University of Tehran Science & Technology Park for sponsoring this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hadi Veisi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Veisi, H., Golchinpour, M., Salehi, M. et al. Multi-level text document similarity estimation and its application for plagiarism detection. Iran J Comput Sci 5, 143–155 (2022). https://doi.org/10.1007/s42044-022-00098-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42044-022-00098-6

Keywords