Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Comparative Study of Different Document Similarity Measures and Models

  • Conference paper
  • First Online:
Micro-Electronics and Telecommunication Engineering (ICMETE 2023)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 894))

  • 305 Accesses

Abstract

Document similarity refers to an approach of measuring how two or more documents look alike in terms of their content or structure. Document similarity algorithms are used to determine the degree of resemblance or relatedness between various documents. Document similarity plays a pivotal role in a wide range of tasks involving natural language processing, information retrieval, recommender systems and duplicates detection. In this paper, we will be studying and compare the similarity score of documents using different document similarity measures and models like cosine similarity, Euclidean distance, Jaccard similarity, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Bidirectional Encoder Representations from Transformers (BERTs), etc.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Gomaa W, Fahmy A (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18

    Google Scholar 

  2. Han J, Kamber M, Pei J (2011) Data mining: concept and techniques, 3rd edn. The Morgan Kaufmann, India

    Google Scholar 

  3. Muflikhah L, Baharudin B (2009) Document clustering using concept space and cosine similarity measurement. In: 2009 international conference on computer technology and development, IEEE, Malaysia, pp 58–62

    Google Scholar 

  4. https://esource.dbs.ie/bitstream/handle/10788/4254/msc_jeevankrishna_2020.pdf?sequence=1&isAllowed=y. Accessed on 2023/09/10

  5. Lahitani AR, Permanasari AE, Setiawan NA (2016) Cosine similarity to determine similarity measure: study case in online essay assessment. In: 2016 4th international conference on cyber and it service management, IEEE, Indonesia, pp 1–6

    Google Scholar 

  6. https://www.academia.edu/4041704/Content_Based_Recommendation_Systems. Accessed on 2023/09/10

  7. Medium. Retrieved from https://medium.com/analytics-vidhya/introduction-to-similarity-metrics-a882361c9be4. Accessed on 2023/09/10

  8. Wang J, Dong Y (2020) Measurement of text similarity: a survey. Information 11:421

    Article  Google Scholar 

  9. Foltz PW (2001) Semantic processing: statistical approaches. In: Smelser NJ, Baltes PB (eds) International encyclopedia of the social & behavioral sciences, Elsevier, Pergamon, pp 13873–13878

    Google Scholar 

  10. Sripathi SR, Pradyumna NVS, Dhanush A (2022) Drug recommendation system using LDA. In: 2022 international conference on futuristic technologies (INCOFT), IEEE, Belgaum, India, pp 1–7

    Google Scholar 

  11. Yang N, Jo J, Jeon M, Kim W, Kang J (2022) Semantic and explainable research-related recommendation system based on semi-supervised methodology using BERT and LDA models. Expert Syst Appl 190

    Google Scholar 

  12. De Oliveira RS, Nascimento EG (2022) Analyzing similarities between legal court documents using natural language processing approaches based on transformers. ArXiv. /abs/2204.07182

    Google Scholar 

  13. Manalu DR, Rajagukguk E, Siringoringo R, Siahaan DK, Sihombing P (2019) The development of document similarity detector by Jaccard formulation. In: 2019 international conference of computer science and information technology (ICoSNIKOM), IEEE, Medan, Indonesia, pp 1–4

    Google Scholar 

  14. Foltz P (1996) Latent semantic analysis for text-based research. Behav Res Methods 28:197–202

    Article  Google Scholar 

  15. Mishra A, Panchal V, Kumar P (2020) Similarity search based on text embedding model for detection of near duplicates. Int J Grid Distribut Comput 13:1871–1881

    Google Scholar 

  16. Oduntan OE, Adeyanju IA, Falohun AS, Obe OO (2018) A comparative analysis of euclidean distance and cosine similarity measure for automated essay-type grading. J Eng Appl Sci 13:4198–4204

    Google Scholar 

  17. Machine learning mastery. Retrieved from https://machinelearningmastery.com/a-brief-introduction-to-bert/. Accessed on 2023/09/10

  18. Hugging face. . Retrieved from https://huggingface.co/docs/transformers/model_doc/roberta. Accessed on 2023/09/10

  19. Gunawan D, Sembiring C, Budiman M (2018) The implementation of cosine similarity to calculate text relevance between two documents. J Phys: Conf Ser 978:012120

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anshika Singh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Singh, A., Garg, S.K. (2024). Comparative Study of Different Document Similarity Measures and Models. In: Sharma, D.K., Peng, SL., Sharma, R., Jeon, G. (eds) Micro-Electronics and Telecommunication Engineering. ICMETE 2023. Lecture Notes in Networks and Systems, vol 894. Springer, Singapore. https://doi.org/10.1007/978-981-99-9562-2_61

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-9562-2_61

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-9561-5

  • Online ISBN: 978-981-99-9562-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics