Abstract
Document similarity refers to an approach of measuring how two or more documents look alike in terms of their content or structure. Document similarity algorithms are used to determine the degree of resemblance or relatedness between various documents. Document similarity plays a pivotal role in a wide range of tasks involving natural language processing, information retrieval, recommender systems and duplicates detection. In this paper, we will be studying and compare the similarity score of documents using different document similarity measures and models like cosine similarity, Euclidean distance, Jaccard similarity, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Bidirectional Encoder Representations from Transformers (BERTs), etc.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Gomaa W, Fahmy A (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18
Han J, Kamber M, Pei J (2011) Data mining: concept and techniques, 3rd edn. The Morgan Kaufmann, India
Muflikhah L, Baharudin B (2009) Document clustering using concept space and cosine similarity measurement. In: 2009 international conference on computer technology and development, IEEE, Malaysia, pp 58–62
https://esource.dbs.ie/bitstream/handle/10788/4254/msc_jeevankrishna_2020.pdf?sequence=1&isAllowed=y. Accessed on 2023/09/10
Lahitani AR, Permanasari AE, Setiawan NA (2016) Cosine similarity to determine similarity measure: study case in online essay assessment. In: 2016 4th international conference on cyber and it service management, IEEE, Indonesia, pp 1–6
https://www.academia.edu/4041704/Content_Based_Recommendation_Systems. Accessed on 2023/09/10
Medium. Retrieved from https://medium.com/analytics-vidhya/introduction-to-similarity-metrics-a882361c9be4. Accessed on 2023/09/10
Wang J, Dong Y (2020) Measurement of text similarity: a survey. Information 11:421
Foltz PW (2001) Semantic processing: statistical approaches. In: Smelser NJ, Baltes PB (eds) International encyclopedia of the social & behavioral sciences, Elsevier, Pergamon, pp 13873–13878
Sripathi SR, Pradyumna NVS, Dhanush A (2022) Drug recommendation system using LDA. In: 2022 international conference on futuristic technologies (INCOFT), IEEE, Belgaum, India, pp 1–7
Yang N, Jo J, Jeon M, Kim W, Kang J (2022) Semantic and explainable research-related recommendation system based on semi-supervised methodology using BERT and LDA models. Expert Syst Appl 190
De Oliveira RS, Nascimento EG (2022) Analyzing similarities between legal court documents using natural language processing approaches based on transformers. ArXiv. /abs/2204.07182
Manalu DR, Rajagukguk E, Siringoringo R, Siahaan DK, Sihombing P (2019) The development of document similarity detector by Jaccard formulation. In: 2019 international conference of computer science and information technology (ICoSNIKOM), IEEE, Medan, Indonesia, pp 1–4
Foltz P (1996) Latent semantic analysis for text-based research. Behav Res Methods 28:197–202
Mishra A, Panchal V, Kumar P (2020) Similarity search based on text embedding model for detection of near duplicates. Int J Grid Distribut Comput 13:1871–1881
Oduntan OE, Adeyanju IA, Falohun AS, Obe OO (2018) A comparative analysis of euclidean distance and cosine similarity measure for automated essay-type grading. J Eng Appl Sci 13:4198–4204
Machine learning mastery. Retrieved from https://machinelearningmastery.com/a-brief-introduction-to-bert/. Accessed on 2023/09/10
Hugging face. . Retrieved from https://huggingface.co/docs/transformers/model_doc/roberta. Accessed on 2023/09/10
Gunawan D, Sembiring C, Budiman M (2018) The implementation of cosine similarity to calculate text relevance between two documents. J Phys: Conf Ser 978:012120
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Singh, A., Garg, S.K. (2024). Comparative Study of Different Document Similarity Measures and Models. In: Sharma, D.K., Peng, SL., Sharma, R., Jeon, G. (eds) Micro-Electronics and Telecommunication Engineering. ICMETE 2023. Lecture Notes in Networks and Systems, vol 894. Springer, Singapore. https://doi.org/10.1007/978-981-99-9562-2_61
Download citation
DOI: https://doi.org/10.1007/978-981-99-9562-2_61
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9561-5
Online ISBN: 978-981-99-9562-2
eBook Packages: EngineeringEngineering (R0)