Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A Semantic Textual Similarity Calculation Model Based on Pre-training Model

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12816))

  • 2114 Accesses

Abstract

As a basic research topic in natural language processing, the calculation of text similarity is widely used in the fields of plagiarism checker and sentence search. The traditional calculation of text similarity constructed text vectors only based on TF-IDF, and used the cosine of the angle between vectors to measure the similarity between two texts. However, this method cannot solve the similar text detection task with different text representation but similar semantic representation. In response to the above-mentioned problems, we proposed the pre-training of text based on the ERNIE semantic model of PaddleHub, and constructed similar text detection into a classification problem; in view of the problem that most of the similar texts in the data set led to the imbalance of categories in the training set, an oversampling method for confusion sampling, OSConfusion, was proposed. The experimental results showed that the method proposed in this paper was able to solve the problem of paper comparison well, and could identify the repetitive paragraphs with different text representations. And the ERNIE-SIM with OSConfusion was better than the ERNIE-SIM without OSConfusion in the prediction process of similar document pairs in terms of precision and recall.

Supported by organization Research and Innovation Project for Postgraduate of Hunan Province (Grant No. CX2018B023, No. CX20190038).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Salton, G.: Automatic processing of foreign language documents. J. Am. Soc. Inf. Sci. 21, 1–28 (1970)

    Article  Google Scholar 

  2. Xu, W., Rudnicky, A.: Can artificial neural networks learn language models?. In: Proceedings of the 6th International Conference on Spoken Language Processing, pp. 202–205. ICSLP, Beijing (2000)

    Google Scholar 

  3. Bengio, Y., Ducharme, R., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  4. Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. Computer Science (2013)

    Google Scholar 

  5. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  6. Melamed, I.: Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In: Proceedings of the 3rd Workshop on Very Large Corpora, pp. 184–198. arXiv, Engish (1995)

    Google Scholar 

  7. Kondrak, G.: N-gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). https://doi.org/10.1007/11575832_13

    Chapter  Google Scholar 

  8. Bard, Gregory, V.: Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric. In: Proceedings of the Fifth Australasian Symposium on ACSW Frontiers, Ballarat, pp. 117–124 (2007)

    Google Scholar 

  9. Winkler, W.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, ASA, Alexandria, pp. 354–359 (1990)

    Google Scholar 

  10. Needleman, B., Wunsch, D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Molec. Biol. 48(3), 443–53 (1970)

    Article  Google Scholar 

  11. Smith, F., Waterman, S.: Identification of common molecular subsequences. J. Molec. Biol. 1(147), 195–197 (1981)

    Article  Google Scholar 

  12. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  Google Scholar 

  13. Jaccard, P.: Étude comparative de la distribution florale dansune portion des Alpes et des Jura. Bull. de la Société Vaudoise des Sci. Naturelles 1(37), 547–579 (1975)

    Google Scholar 

  14. Dice, L.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)

    Article  Google Scholar 

  15. Eugene, F.: Taxicab Geometry. Dover Publications, Dover (1987)

    Google Scholar 

  16. Bray, J., Curtis, J.: An ordination of upland forest communities of southern Wisconsin. Ecol. Monogr. 1(27), 325–349 (1957)

    Article  Google Scholar 

  17. Zhao, J., Zhu, T., Lan, M.: ECNU: one stone two birds: ensemble of heterogeneous measures for semantic relatedness and textual entailment. In: Proceedings of the 8th International Workshop on Semantic Evaluation, Dublin, pp. 271–277 (2014)

    Google Scholar 

  18. Shrivastava, A., Li, P.: In defense of minhash over simhash. Eprint Arxiv 7(3), 886–894 (2014)

    Google Scholar 

  19. Chen, M., Zhang, Y., et al.: SPHA: smart personal health advisor based on deep analytics. IEEE Commun. Mag. 56(3), 164–169 (2018)

    Article  MathSciNet  Google Scholar 

  20. Tao, L., Golikov, S., et al.: A reusable software component for integrated syntax and semantic validation for services computing. In: IEEE Symposium on Service-Oriented System Engineering, pp. 127–132. IEEE, San Francisco Bay (2015)

    Google Scholar 

  21. Gai, K., Qiu, M.: Reinforcement learning-based content-centric services in mobile sensing. IEEE Netw. 32(4), 34–39 (2018)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ding, Z., Liu, K., Wang, W., Liu, B. (2021). A Semantic Textual Similarity Calculation Model Based on Pre-training Model. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management . KSEM 2021. Lecture Notes in Computer Science(), vol 12816. Springer, Cham. https://doi.org/10.1007/978-3-030-82147-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-82147-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-82146-3

  • Online ISBN: 978-3-030-82147-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics