Abstract
As a basic research topic in natural language processing, the calculation of text similarity is widely used in the fields of plagiarism checker and sentence search. The traditional calculation of text similarity constructed text vectors only based on TF-IDF, and used the cosine of the angle between vectors to measure the similarity between two texts. However, this method cannot solve the similar text detection task with different text representation but similar semantic representation. In response to the above-mentioned problems, we proposed the pre-training of text based on the ERNIE semantic model of PaddleHub, and constructed similar text detection into a classification problem; in view of the problem that most of the similar texts in the data set led to the imbalance of categories in the training set, an oversampling method for confusion sampling, OSConfusion, was proposed. The experimental results showed that the method proposed in this paper was able to solve the problem of paper comparison well, and could identify the repetitive paragraphs with different text representations. And the ERNIE-SIM with OSConfusion was better than the ERNIE-SIM without OSConfusion in the prediction process of similar document pairs in terms of precision and recall.
Supported by organization Research and Innovation Project for Postgraduate of Hunan Province (Grant No. CX2018B023, No. CX20190038).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Salton, G.: Automatic processing of foreign language documents. J. Am. Soc. Inf. Sci. 21, 1–28 (1970)
Xu, W., Rudnicky, A.: Can artificial neural networks learn language models?. In: Proceedings of the 6th International Conference on Spoken Language Processing, pp. 202–205. ICSLP, Beijing (2000)
Bengio, Y., Ducharme, R., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. Computer Science (2013)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10(8), 707–710 (1966)
Melamed, I.: Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In: Proceedings of the 3rd Workshop on Very Large Corpora, pp. 184–198. arXiv, Engish (1995)
Kondrak, G.: N-gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). https://doi.org/10.1007/11575832_13
Bard, Gregory, V.: Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric. In: Proceedings of the Fifth Australasian Symposium on ACSW Frontiers, Ballarat, pp. 117–124 (2007)
Winkler, W.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, ASA, Alexandria, pp. 354–359 (1990)
Needleman, B., Wunsch, D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Molec. Biol. 48(3), 443–53 (1970)
Smith, F., Waterman, S.: Identification of common molecular subsequences. J. Molec. Biol. 1(147), 195–197 (1981)
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Jaccard, P.: Étude comparative de la distribution florale dansune portion des Alpes et des Jura. Bull. de la Société Vaudoise des Sci. Naturelles 1(37), 547–579 (1975)
Dice, L.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Eugene, F.: Taxicab Geometry. Dover Publications, Dover (1987)
Bray, J., Curtis, J.: An ordination of upland forest communities of southern Wisconsin. Ecol. Monogr. 1(27), 325–349 (1957)
Zhao, J., Zhu, T., Lan, M.: ECNU: one stone two birds: ensemble of heterogeneous measures for semantic relatedness and textual entailment. In: Proceedings of the 8th International Workshop on Semantic Evaluation, Dublin, pp. 271–277 (2014)
Shrivastava, A., Li, P.: In defense of minhash over simhash. Eprint Arxiv 7(3), 886–894 (2014)
Chen, M., Zhang, Y., et al.: SPHA: smart personal health advisor based on deep analytics. IEEE Commun. Mag. 56(3), 164–169 (2018)
Tao, L., Golikov, S., et al.: A reusable software component for integrated syntax and semantic validation for services computing. In: IEEE Symposium on Service-Oriented System Engineering, pp. 127–132. IEEE, San Francisco Bay (2015)
Gai, K., Qiu, M.: Reinforcement learning-based content-centric services in mobile sensing. IEEE Netw. 32(4), 34–39 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ding, Z., Liu, K., Wang, W., Liu, B. (2021). A Semantic Textual Similarity Calculation Model Based on Pre-training Model. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management . KSEM 2021. Lecture Notes in Computer Science(), vol 12816. Springer, Cham. https://doi.org/10.1007/978-3-030-82147-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-82147-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82146-3
Online ISBN: 978-3-030-82147-0
eBook Packages: Computer ScienceComputer Science (R0)