research-article

“If you can’t beat them, join them”: A Word Transformation based Generalized Skip-gram for Embedding Compound Words

Authors:

Debasis Ganguly,

Chandan BiswasAuthors Info & Claims

FIRE '22: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation

Pages 34 - 42

https://doi.org/10.1145/3574318.3574346

Published: 12 January 2023 Publication History

Abstract

While a class of data-driven approaches has been shown to be effective in embedding words of languages that are relatively simple as per inflections and compounding characteristics (e.g. English), an open area of investigation is ways of integrating language-specific characteristics within the framework of an embedding model. Standard word embedding approaches, such as word2vec, Glove etc. embed each word into a high dimensional dense vector. However, these approaches may not adequately capture the inherent linguistic phenomenon namely that of word compounding. We propose a stochastic word transformation based generalization of the skip-gram algorithm, which seeks to potentially improve the representation of the compositional compound words by leveraging information from the contexts of their constituents. Our experiments show that addressing the compounding effect of a language as a part of the word embedding objective outperforms existing methods of compounding-specific post-transformation based approaches on word semantics prediction and word polarity prediction tasks.

References

[1]

Martine Adda-Decker, Gilles Adda, and Lori Lamel. 2000. Investigating text normalization and pronunciation variants for German broadcast transcription. In Sixth International Conference on Spoken Language Processing.

[2]

Enrique Alfonseca, Slaven Bilac, and Stefan Pharies. 2008. Decompounding query keywords from compounding languages. In Proceedings of ACL-08: HLT, Short Papers. 253–256.

[3]

Enrique Alfonseca, Slaven Bilac, and Stefan Pharies. 2008. German decompounding in a difficult corpus. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 128–139.

[4]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. TACL 5(2017), 135–146.

[5]

Jan Botha and Phil Blunsom. 2014. Compositional morphology for word representations and language modelling. In International Conference on Machine Learning. 1899–1907.

[6]

Martin Braschler and Bärbel Ripplinger. 2004. How effective is stemming and decompounding for German text retrieval?Information Retrieval 7, 3 (2004), 291–316.

[7]

Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2018. cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information. In AAAI. 5053–5061.

[8]

Silvio Cordeiro, Aline Villavicencio, Marco Idiart, and Carlos Ramisch. 2019. Unsupervised compositionality prediction of nominal compounds. Computational Linguistics 45, 1 (2019), 1–57.

Digital Library

[9]

Ryan Cotterell and Hinrich Schütze. 2015. Morphological Word-Embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, 1287–1292.

[10]

Ryan Cotterell, Hinrich Schütze, and Jason Eisner. 2016. Morphological smoothing and extrapolation of word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1651–1660.

[11]

Sajib Dasgupta and Vincent Ng. 2007. High-Performance, Language-Independent Morphological Segmentation. In Proc. of HLT-NAACL’07. 155–163.

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423

[13]

Corina Dima. 2015. Reverse-engineering language: A study on the semantic compositionality of german compounds. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1637–1642.

[14]

Corina Dima. 2016. On the compositionality and semantic interpretation of english noun compounds. In Proceedings of the 1st Workshop on Representation Learning for NLP. 27–39.

[15]

N. Durrani, H. Schmid, A.M. Fraser, P. Koehn, and H. Schütze. 2015. The Operation Sequence Model - Combining N-Gram-Based and Phrase-Based Statistical Machine Translation. Comp. Ling. 41, 2 (2015), 185–214.

Digital Library

[16]

Carla Parra Escartín. 2014. Chasing the Perfect Splitter: A Comparison of Different Compound Splitting Tools. In LREC. 3340–3347.

[17]

M. Faruqui, J. Dodge, S.K. Jauhar, C. Dyer, E. Hovy, and N.A. Smith. 2015. Retrofitting Word Vectors to Semantic Lexicons. In Proc. of NAACL’15. 1606–1615.

[18]

Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting Word Vectors to Semantic Lexicons. In Proceedings of NAACL. 1606–1615.

[19]

Debasis Ganguly. 2020. Learning variable-length representation of words. Pattern Recognition 103(2020), 107306.

[20]

D. Ganguly, J. Leveling, and G J.F. Jones. 2013. A Case Study in Decompounding for Bengali Information Retrieval. In Proc. of CLEF’13. 108–119.

Digital Library

[21]

Bao Guo, Chunxia Zhang, Junmin Liu, and Xiaoyi Ma. 2019. Improving text classification with weighted word embeddings via a multi-channel TextCNN model. Neurocomputing 363(2019), 366–374.

Digital Library

[22]

Iryna Gurevych. 2006. Thinking beyond the Nouns-Computing Semantic Relatedness across Parts of Speech. Sprachdokumentation & Sprachbeschreibung 28 (2006).

[23]

Turid Hedlund, Heikki Keskustalo, Ari Pirkola, Eija Airio, and Kalervo Järvelin. 2001. Utaclir@ CLEF 2001—effects of compound splitting and N-Gram techniques. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 118–136.

[24]

Sabine Schulte Im Walde, Stefan Müller, and Stefan Roller. 2013. Exploring vector space models to predict the compositionality of German noun-noun compounds. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity. 255–265.

[25]

Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.

[26]

Philipp Koehn and Kevin Knight. 2003. Empirical Methods for Compound Splitting. In 10th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Budapest, Hungary. https://aclanthology.org/E03-1076

[27]

J. Leveling, W. Magdy, and G.J.F. Jones. 2011. An investigation of decompounding for cross-language patent search. In Proc. of SIGIR’11. 1169–1170.

Digital Library

[28]

Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding as Implicit Matrix Factorization. Advances in neural information processing systems 27 (2014), 2177–2185.

[29]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).

[30]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press.

[31]

T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proc. NIPS ’13. 3111–3119.

[32]

Christof Monz and Maarten de Rijke. 2001. Shallow morphological analysis in monolingual information retrieval for Dutch, German, and Italian. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 262–277.

[33]

Debora Nozza, Pikakshi Manchanda, Elisabetta Fersini, Matteo Palmonari, and Enza Messina. 2021. LearningToAdapt with word embeddings: Domain adaptation of Named Entity Recognition systems. Information Processing & Management 58, 3 (2021), 102537.

Digital Library

[34]

Nikolaos Passalis and Anastasios Tefas. 2018. Learning bag-of-embedded-words representations for textual information retrieval. Pattern Recognition 81(2018), 254–267.

Digital Library

[35]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proc. of EMNLP’14. 1532–1543.

[36]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL’18. 2227–2237.

[37]

Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2017. Mimicking word embeddings using subword rnns. arXiv preprint arXiv:1707.06961(2017).

[38]

Josef Ruppenhofer, Petra Steiner, and Michael Wiegand. 2017. Evaluating the morphological compositionality of polarity. In Proceedings of RANLP 2017.

[39]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909(2015).

[40]

Vered Shwartz. 2019. A systematic comparison of english noun compound representations. arXiv preprint arXiv:1906.04772(2019).

[41]

Charles Spearman. 1904. The proof and measurement of association between two things. The American journal of psychology 15, 1 (1904), 72–101.

[42]

Sara Stymne, Nicola Cancedda, and Lars Ahrenberg. 2013. Generation of compound words in statistical machine translation into compounding languages. Computational Linguistics 39, 4 (2013), 1067–1108.

Digital Library

[43]

Nguyen Huy Tien, Nguyen Minh Le, Yamasaki Tomohiro, and Izuha Tatsuya. 2019. Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity. Information Processing & Management 56, 6 (2019), 102090.

Digital Library

[44]

Marion Weller, Fabienne Cap, Stefan oyalMüller, Sabine Schulte im Walde, and Alexander Fraser. 2014. Distinguishing degrees of compositionality in compound splitting for statistical machine translation. In Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014). 81–90.

[45]

John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Charagram: Embedding words and sentences via character n-grams. arXiv preprint arXiv:1607.02789(2016).

[46]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144(2016).

[47]

Xinyu Xiao, Lingfeng Wang, Kun Ding, Shiming Xiang, and Chunhong Pan. 2019. Dense semantic embedding network for image captioning. Pattern Recognition 90(2019), 285–296.

Digital Library

[48]

Torsten Zesch and Iryna Gurevych. 2006. Automatically creating datasets for measures of semantic relatedness. In Workshop on Linguistic Distances. 16–24.

Digital Library

[49]

Chengxiang Zhai and John Lafferty. 2017. A study of smoothing methods for language models applied to ad hoc information retrieval. In ACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 268–276.

[50]

Ming Zhang, Vasile Palade, Yan Wang, and Zhicheng Ji. 2021. Attention-based word embeddings using Artificial Bee Colony algorithm for aspect-level sentiment classification. Information Sciences 545(2021), 713–738.

[51]

Ling Zhao, Ailian Zhang, Ying Liu, and Hao Fei. 2020. Encoding multi-granularity structural information for joint Chinese word segmentation and POS tagging. Pattern Recognition Letters 138 (2020), 163–169.

Cited By

Sahu SPal S(2024)A case study on decompounding in Indian language IRNatural Language Processing10.1017/nlp.2024.16(1-31)Online publication date: 3-Jun-2024
https://doi.org/10.1017/nlp.2024.16

Index Terms

“If you can’t beat them, join them”: A Word Transformation based Generalized Skip-gram for Embedding Compound Words
1. Applied computing

Index terms have been assigned to the content through auto-classification.

Recommendations

Composing Word Embeddings for Compound Words Using Linguistic Knowledge
In recent years, the use of distributed representations has been a fundamental technology for natural language processing. However, Japanese has multiple compound words, and often we must compare the meanings of a word and a compound word. Moreover, word ...
Detecting misspelled words in Turkish text using syllable n-gram frequencies
PReMI'07: Proceedings of the 2nd international conference on Pattern recognition and machine intelligence

In this study, we have designed and implemented a system which decides whether or not a word is misspelled in Turkish text. Firstly, three databases of syllable monogram, bigram and trigram frequencies are constructed using the syllables that are ...
Hindi Word Sense Disambiguation Using Lesk Approach on Bigram and Trigram Words
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & Computing

Word Sense Disambiguation (WSD) is a vital task which provides the definition of particular words according to their sense or according to given context. Lesk algorithm is originally based on the gloss overlap that can be observed as the measure, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

FIRE '22: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation

December 2022

101 pages

ISBN:9798400700231

DOI:10.1145/3574318

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

FIRE '22

FIRE '22: Forum for Information Retrieval Evaluation

December 9 - 13, 2022

Kolkata, India

Acceptance Rates

Overall Acceptance Rate 19 of 64 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
62
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sahu SPal S(2024)A case study on decompounding in Indian language IRNatural Language Processing10.1017/nlp.2024.16(1-31)Online publication date: 3-Jun-2024
https://doi.org/10.1017/nlp.2024.16

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents