Needle in a Haystack: Finding Suitable Idioms Based on Text Descriptions

Zhernokleev, Dmitrii; Braslavski, Pavel

doi:10.1007/978-3-031-54534-4_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14486))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

238 Accesses

Abstract

Idioms are an important part of natural languages and are often used to enhance expressiveness and fluency of speech. However, it can be difficult to find a contextually appropriate idiom when writing an essay or crafting a headline for a news article, especially for non-native speakers. This gives rise to the idea of an automated system that is able to recommend an idiom for an input sentence. The goal of this study is to develop and compare methods that would make such a system possible. We used an existing collection of idioms and employed several configurations based on models from the Sentence-BERT family. We also automatically expanded the initial dataset and fine-tuned a pre-trained Sentence-BERT model on the idiom/context matching task. This approach achieved the highest MRR score of 0.507. The data and the trained model are publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ParaDiom – A Parallel Corpus of Idiomatic Texts

Classification of Idiomatic Sentences Using AWD-LSTM

Overview of JOKER@CLEF 2022: Automatic Wordplay and Humour Translation Workshop

Notes

References

Adewumi, T., Vadoodi, R., Tripathy, A., Nikolaido, K., Liwicki, F., Liwicki, M.: Potential idiomatic expression (PIE)-English: corpus for classes of idioms. In: LREC, pp. 689–696 (2022)
Google Scholar
Agrawal, R., Kumar, V.C., Muralidharan, V., Sharma, D.M.: No more beating about the bush: a step towards idiom handling for Indian language NLP. In: LREC (2018)
Google Scholar
BNC Consortium, et al.: British national corpus. Oxford Text Archive Core Collection (2007)
Google Scholar
Dale, R., Viethen, J.: The automated writing assistance landscape in 2021. Nat. Lang. Eng. 27(4), 511–518 (2021)
Article Google Scholar
Dankers, V., Lucas, C., Titov, I.: Can transformer be too compositional? Analysing idiom processing in neural machine translation. In: ACL, pp. 3608–3626 (2022)
Google Scholar
Gamage, G., De Silva, D., Adikari, A., Alahakoon, D.: A BERT-based idiom detection model. In: HSI, pp. 1–5 (2022)
Google Scholar
Haagsma, H., Bos, J., Nissim, M.: MAGPIE: a large corpus of potentially idiomatic expressions. In: LREC, pp. 279–287 (2020)
Google Scholar
Jochim, C., Bonin, F., Bar-Haim, R., Slonim, N.: SLIDE - a sentiment lexicon of common idioms. In: LREC (2018)
Google Scholar
Liu, P., Qian, K., Qiu, X., Huang, X.J.: Idiom-aware compositional distributed semantics. In: EMNLP, pp. 1204–1213 (2017)
Google Scholar
Liu, Y., Liu, B., Shan, L., Wang, X.: Modelling context with neural networks for recommending idioms in essay writing. Neurocomputing 275, 2287–2293 (2018)
Article Google Scholar
Liu, Y., Pang, B., Liu, B.: Neural-based Chinese idiom recommendation for enhancing elegance in essay writing. In: ACL, pp. 5522–5526 (2019)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Google Scholar
Nunberg, G., Sag, I.A., Wasow, T.: Idioms. Language 70(3), 491–538 (1994)
Article Google Scholar
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: EMNLP, pp. 3982–3992 (2019)
Google Scholar
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Berlin (2002). https://doi.org/10.1007/3-540-45715-1_1
Chapter Google Scholar
Saxena, P., Paul, S.: EPIE dataset: a corpus for possible idiomatic expressions. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS, vol. 12284, pp. 87–94. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_9
Chapter Google Scholar
Sporleder, C., Li, L.: Unsupervised recognition of literal and non-literal use of idiomatic expressions. In: EACL, pp. 754–762 (2009)
Google Scholar
Sporleder, C., Li, L., Gorinski, P., Koch, X.: Idioms in context: the IDIX corpus. In: LREC (2010)
Google Scholar
Wible, D., Tsao, N.L.: StringNet as a computational resource for discovering and investigating linguistic constructions. In: Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, pp. 25–31 (2010)
Google Scholar
Williams, L., Bannister, C., Arribas-Ayllon, M., Preece, A., Spasić, I.: The role of idioms in sentiment analysis. Expert Syst. Appl. 42(21), 7375–7385 (2015)
Article Google Scholar

Download references

Acknowledgments

The study develops ideas partially derived from Anna Vysheslavova’s 2020 summer internship under Pavel Braslavski’s supervision. We would like to express gratitude to Yulia Badryzlova for fruitful discussion of the paper draft.

Author information

Authors and Affiliations

HSE University, Moscow, Russia
Dmitrii Zhernokleev & Pavel Braslavski
Nazarbayev University, Astana, Kazakhstan
Pavel Braslavski

Authors

Dmitrii Zhernokleev
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Braslavski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Braslavski .

Editor information

Editors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Krasovskii Institute of Mathematics and Mechanics of Russian Academy of Sciences, Yekaterinburg, Russia
Michael Khachay
University of Oslo, Oslo, Norway
Andrey Kutuzov
American University of Armenia, Yerevan, Armenia
Habet Madoyan
Artificial Intelligence Research Institute, Moscow, Russia
Ilya Makarov
University of Hamburg, Hamburg, Germany
Irina Nikishina
Skolkovo Institute of Science and Technology, Moscow, Russia
Alexander Panchenko
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Maxim Panov
University of Florida, Gainesville, FL, USA
Panos M. Pardalos
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Apptek, Aachen, Germany
Evgenii Tsymbalov
Kazan Federal University, Kazan, Russia
Elena Tutubalina
MTS AI, Moscow, Russia
Sergey Zagoruyko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhernokleev, D., Braslavski, P. (2024). Needle in a Haystack: Finding Suitable Idioms Based on Text Descriptions. In: Ignatov, D.I., et al. Analysis of Images, Social Networks and Texts. AIST 2023. Lecture Notes in Computer Science, vol 14486. Springer, Cham. https://doi.org/10.1007/978-3-031-54534-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-54534-4_13
Published: 12 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54533-7
Online ISBN: 978-3-031-54534-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Needle in a Haystack: Finding Suitable Idioms Based on Text Descriptions

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ParaDiom – A Parallel Corpus of Idiomatic Texts

Classification of Idiomatic Sentences Using AWD-LSTM

Overview of JOKER@CLEF 2022: Automatic Wordplay and Humour Translation Workshop

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Needle in a Haystack: Finding Suitable Idioms Based on Text Descriptions

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ParaDiom – A Parallel Corpus of Idiomatic Texts

Classification of Idiomatic Sentences Using AWD-LSTM

Overview of JOKER@CLEF 2022: Automatic Wordplay and Humour Translation Workshop

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation