A Semantic Textual Similarity Calculation Model Based on Pre-training Model

Ding, Zhaoyun; Liu, Kai; Wang, Wenhao; Liu, Bin

doi:10.1007/978-3-030-82147-0_1

Zhaoyun Ding¹³,
Kai Liu ORCID: orcid.org/0000-0001-7252-5939¹³,
Wenhao Wang¹³ &
…
Bin Liu¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12816))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

2114 Accesses

Abstract

As a basic research topic in natural language processing, the calculation of text similarity is widely used in the fields of plagiarism checker and sentence search. The traditional calculation of text similarity constructed text vectors only based on TF-IDF, and used the cosine of the angle between vectors to measure the similarity between two texts. However, this method cannot solve the similar text detection task with different text representation but similar semantic representation. In response to the above-mentioned problems, we proposed the pre-training of text based on the ERNIE semantic model of PaddleHub, and constructed similar text detection into a classification problem; in view of the problem that most of the similar texts in the data set led to the imbalance of categories in the training set, an oversampling method for confusion sampling, OSConfusion, was proposed. The experimental results showed that the method proposed in this paper was able to solve the problem of paper comparison well, and could identify the repetitive paragraphs with different text representations. And the ERNIE-SIM with OSConfusion was better than the ERNIE-SIM without OSConfusion in the prediction process of similar document pairs in terms of precision and recall.

Supported by organization Research and Innovation Project for Postgraduate of Hunan Province (Grant No. CX2018B023, No. CX20190038).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Semantic Based Text Similarity Computation

A set theory based similarity measure for text clustering and classification

Article Open access 14 September 2020

Natural Language Processing for the Turkish Academic Texts in the Engineering Field: Key-Term Extraction, Similarity Detection, Subject/Topic Assignment

References

Salton, G.: Automatic processing of foreign language documents. J. Am. Soc. Inf. Sci. 21, 1–28 (1970)
Article Google Scholar
Xu, W., Rudnicky, A.: Can artificial neural networks learn language models?. In: Proceedings of the 6th International Conference on Spoken Language Processing, pp. 202–205. ICSLP, Beijing (2000)
Google Scholar
Bengio, Y., Ducharme, R., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. Computer Science (2013)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Melamed, I.: Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In: Proceedings of the 3rd Workshop on Very Large Corpora, pp. 184–198. arXiv, Engish (1995)
Google Scholar
Kondrak, G.: N-gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). https://doi.org/10.1007/11575832_13
Chapter Google Scholar
Bard, Gregory, V.: Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric. In: Proceedings of the Fifth Australasian Symposium on ACSW Frontiers, Ballarat, pp. 117–124 (2007)
Google Scholar
Winkler, W.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, ASA, Alexandria, pp. 354–359 (1990)
Google Scholar
Needleman, B., Wunsch, D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Molec. Biol. 48(3), 443–53 (1970)
Article Google Scholar
Smith, F., Waterman, S.: Identification of common molecular subsequences. J. Molec. Biol. 1(147), 195–197 (1981)
Article Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article Google Scholar
Jaccard, P.: Étude comparative de la distribution florale dansune portion des Alpes et des Jura. Bull. de la Société Vaudoise des Sci. Naturelles 1(37), 547–579 (1975)
Google Scholar
Dice, L.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Article Google Scholar
Eugene, F.: Taxicab Geometry. Dover Publications, Dover (1987)
Google Scholar
Bray, J., Curtis, J.: An ordination of upland forest communities of southern Wisconsin. Ecol. Monogr. 1(27), 325–349 (1957)
Article Google Scholar
Zhao, J., Zhu, T., Lan, M.: ECNU: one stone two birds: ensemble of heterogeneous measures for semantic relatedness and textual entailment. In: Proceedings of the 8th International Workshop on Semantic Evaluation, Dublin, pp. 271–277 (2014)
Google Scholar
Shrivastava, A., Li, P.: In defense of minhash over simhash. Eprint Arxiv 7(3), 886–894 (2014)
Google Scholar
Chen, M., Zhang, Y., et al.: SPHA: smart personal health advisor based on deep analytics. IEEE Commun. Mag. 56(3), 164–169 (2018)
Article MathSciNet Google Scholar
Tao, L., Golikov, S., et al.: A reusable software component for integrated syntax and semantic validation for services computing. In: IEEE Symposium on Service-Oriented System Engineering, pp. 127–132. IEEE, San Francisco Bay (2015)
Google Scholar
Gai, K., Qiu, M.: Reinforcement learning-based content-centric services in mobile sensing. IEEE Netw. 32(4), 34–39 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha, China
Zhaoyun Ding, Kai Liu, Wenhao Wang & Bin Liu

Authors

Zhaoyun Ding
View author publications
You can also search for this author in PubMed Google Scholar
Kai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Liu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Han Qiu
Ibaraki University, Hitachi, Japan
Cheng Zhang
University of Kentucky, Lexington, KY, USA
Zongming Fei
Texas A&M University – Commerce, Commerce, TX, USA
Meikang Qiu
Princeton University, Princeton, NJ, USA
Sun-Yuan Kung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ding, Z., Liu, K., Wang, W., Liu, B. (2021). A Semantic Textual Similarity Calculation Model Based on Pre-training Model. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management . KSEM 2021. Lecture Notes in Computer Science(), vol 12816. Springer, Cham. https://doi.org/10.1007/978-3-030-82147-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-82147-0_1
Published: 07 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82146-3
Online ISBN: 978-3-030-82147-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Semantic Textual Similarity Calculation Model Based on Pre-training Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic Based Text Similarity Computation

A set theory based similarity measure for text clustering and classification

Natural Language Processing for the Turkish Academic Texts in the Engineering Field: Key-Term Extraction, Similarity Detection, Subject/Topic Assignment

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Semantic Textual Similarity Calculation Model Based on Pre-training Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic Based Text Similarity Computation

A set theory based similarity measure for text clustering and classification

Natural Language Processing for the Turkish Academic Texts in the Engineering Field: Key-Term Extraction, Similarity Detection, Subject/Topic Assignment

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation