Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3574318.3574346acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfireConference Proceedingsconference-collections
research-article

“If you can’t beat them, join them”: A Word Transformation based Generalized Skip-gram for Embedding Compound Words

Published: 12 January 2023 Publication History

Abstract

While a class of data-driven approaches has been shown to be effective in embedding words of languages that are relatively simple as per inflections and compounding characteristics (e.g. English), an open area of investigation is ways of integrating language-specific characteristics within the framework of an embedding model. Standard word embedding approaches, such as word2vec, Glove etc. embed each word into a high dimensional dense vector. However, these approaches may not adequately capture the inherent linguistic phenomenon namely that of word compounding. We propose a stochastic word transformation based generalization of the skip-gram algorithm, which seeks to potentially improve the representation of the compositional compound words by leveraging information from the contexts of their constituents. Our experiments show that addressing the compounding effect of a language as a part of the word embedding objective outperforms existing methods of compounding-specific post-transformation based approaches on word semantics prediction and word polarity prediction tasks.

References

[1]
Martine Adda-Decker, Gilles Adda, and Lori Lamel. 2000. Investigating text normalization and pronunciation variants for German broadcast transcription. In Sixth International Conference on Spoken Language Processing.
[2]
Enrique Alfonseca, Slaven Bilac, and Stefan Pharies. 2008. Decompounding query keywords from compounding languages. In Proceedings of ACL-08: HLT, Short Papers. 253–256.
[3]
Enrique Alfonseca, Slaven Bilac, and Stefan Pharies. 2008. German decompounding in a difficult corpus. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 128–139.
[4]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. TACL 5(2017), 135–146.
[5]
Jan Botha and Phil Blunsom. 2014. Compositional morphology for word representations and language modelling. In International Conference on Machine Learning. 1899–1907.
[6]
Martin Braschler and Bärbel Ripplinger. 2004. How effective is stemming and decompounding for German text retrieval?Information Retrieval 7, 3 (2004), 291–316.
[7]
Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2018. cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information. In AAAI. 5053–5061.
[8]
Silvio Cordeiro, Aline Villavicencio, Marco Idiart, and Carlos Ramisch. 2019. Unsupervised compositionality prediction of nominal compounds. Computational Linguistics 45, 1 (2019), 1–57.
[9]
Ryan Cotterell and Hinrich Schütze. 2015. Morphological Word-Embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, 1287–1292.
[10]
Ryan Cotterell, Hinrich Schütze, and Jason Eisner. 2016. Morphological smoothing and extrapolation of word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1651–1660.
[11]
Sajib Dasgupta and Vincent Ng. 2007. High-Performance, Language-Independent Morphological Segmentation. In Proc. of HLT-NAACL’07. 155–163.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[13]
Corina Dima. 2015. Reverse-engineering language: A study on the semantic compositionality of german compounds. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1637–1642.
[14]
Corina Dima. 2016. On the compositionality and semantic interpretation of english noun compounds. In Proceedings of the 1st Workshop on Representation Learning for NLP. 27–39.
[15]
N. Durrani, H. Schmid, A.M. Fraser, P. Koehn, and H. Schütze. 2015. The Operation Sequence Model - Combining N-Gram-Based and Phrase-Based Statistical Machine Translation. Comp. Ling. 41, 2 (2015), 185–214.
[16]
Carla Parra Escartín. 2014. Chasing the Perfect Splitter: A Comparison of Different Compound Splitting Tools. In LREC. 3340–3347.
[17]
M. Faruqui, J. Dodge, S.K. Jauhar, C. Dyer, E. Hovy, and N.A. Smith. 2015. Retrofitting Word Vectors to Semantic Lexicons. In Proc. of NAACL’15. 1606–1615.
[18]
Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting Word Vectors to Semantic Lexicons. In Proceedings of NAACL. 1606–1615.
[19]
Debasis Ganguly. 2020. Learning variable-length representation of words. Pattern Recognition 103(2020), 107306.
[20]
D. Ganguly, J. Leveling, and G J.F. Jones. 2013. A Case Study in Decompounding for Bengali Information Retrieval. In Proc. of CLEF’13. 108–119.
[21]
Bao Guo, Chunxia Zhang, Junmin Liu, and Xiaoyi Ma. 2019. Improving text classification with weighted word embeddings via a multi-channel TextCNN model. Neurocomputing 363(2019), 366–374.
[22]
Iryna Gurevych. 2006. Thinking beyond the Nouns-Computing Semantic Relatedness across Parts of Speech. Sprachdokumentation & Sprachbeschreibung 28 (2006).
[23]
Turid Hedlund, Heikki Keskustalo, Ari Pirkola, Eija Airio, and Kalervo Järvelin. 2001. Utaclir@ CLEF 2001—effects of compound splitting and N-Gram techniques. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 118–136.
[24]
Sabine Schulte Im Walde, Stefan Müller, and Stefan Roller. 2013. Exploring vector space models to predict the compositionality of German noun-noun compounds. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity. 255–265.
[25]
Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.
[26]
Philipp Koehn and Kevin Knight. 2003. Empirical Methods for Compound Splitting. In 10th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Budapest, Hungary. https://aclanthology.org/E03-1076
[27]
J. Leveling, W. Magdy, and G.J.F. Jones. 2011. An investigation of decompounding for cross-language patent search. In Proc. of SIGIR’11. 1169–1170.
[28]
Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding as Implicit Matrix Factorization. Advances in neural information processing systems 27 (2014), 2177–2185.
[29]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).
[30]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press.
[31]
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proc. NIPS ’13. 3111–3119.
[32]
Christof Monz and Maarten de Rijke. 2001. Shallow morphological analysis in monolingual information retrieval for Dutch, German, and Italian. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 262–277.
[33]
Debora Nozza, Pikakshi Manchanda, Elisabetta Fersini, Matteo Palmonari, and Enza Messina. 2021. LearningToAdapt with word embeddings: Domain adaptation of Named Entity Recognition systems. Information Processing & Management 58, 3 (2021), 102537.
[34]
Nikolaos Passalis and Anastasios Tefas. 2018. Learning bag-of-embedded-words representations for textual information retrieval. Pattern Recognition 81(2018), 254–267.
[35]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proc. of EMNLP’14. 1532–1543.
[36]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL’18. 2227–2237.
[37]
Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2017. Mimicking word embeddings using subword rnns. arXiv preprint arXiv:1707.06961(2017).
[38]
Josef Ruppenhofer, Petra Steiner, and Michael Wiegand. 2017. Evaluating the morphological compositionality of polarity. In Proceedings of RANLP 2017.
[39]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909(2015).
[40]
Vered Shwartz. 2019. A systematic comparison of english noun compound representations. arXiv preprint arXiv:1906.04772(2019).
[41]
Charles Spearman. 1904. The proof and measurement of association between two things. The American journal of psychology 15, 1 (1904), 72–101.
[42]
Sara Stymne, Nicola Cancedda, and Lars Ahrenberg. 2013. Generation of compound words in statistical machine translation into compounding languages. Computational Linguistics 39, 4 (2013), 1067–1108.
[43]
Nguyen Huy Tien, Nguyen Minh Le, Yamasaki Tomohiro, and Izuha Tatsuya. 2019. Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity. Information Processing & Management 56, 6 (2019), 102090.
[44]
Marion Weller, Fabienne Cap, Stefan oyalMüller, Sabine Schulte im Walde, and Alexander Fraser. 2014. Distinguishing degrees of compositionality in compound splitting for statistical machine translation. In Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014). 81–90.
[45]
John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Charagram: Embedding words and sentences via character n-grams. arXiv preprint arXiv:1607.02789(2016).
[46]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144(2016).
[47]
Xinyu Xiao, Lingfeng Wang, Kun Ding, Shiming Xiang, and Chunhong Pan. 2019. Dense semantic embedding network for image captioning. Pattern Recognition 90(2019), 285–296.
[48]
Torsten Zesch and Iryna Gurevych. 2006. Automatically creating datasets for measures of semantic relatedness. In Workshop on Linguistic Distances. 16–24.
[49]
Chengxiang Zhai and John Lafferty. 2017. A study of smoothing methods for language models applied to ad hoc information retrieval. In ACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 268–276.
[50]
Ming Zhang, Vasile Palade, Yan Wang, and Zhicheng Ji. 2021. Attention-based word embeddings using Artificial Bee Colony algorithm for aspect-level sentiment classification. Information Sciences 545(2021), 713–738.
[51]
Ling Zhao, Ailian Zhang, Ying Liu, and Hao Fei. 2020. Encoding multi-granularity structural information for joint Chinese word segmentation and POS tagging. Pattern Recognition Letters 138 (2020), 163–169.

Cited By

View all
  • (2024)A case study on decompounding in Indian language IRNatural Language Processing10.1017/nlp.2024.16(1-31)Online publication date: 3-Jun-2024

Index Terms

  1. “If you can’t beat them, join them”: A Word Transformation based Generalized Skip-gram for Embedding Compound Words
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    FIRE '22: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation
    December 2022
    101 pages
    ISBN:9798400700231
    DOI:10.1145/3574318
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 January 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    FIRE '22
    FIRE '22: Forum for Information Retrieval Evaluation
    December 9 - 13, 2022
    Kolkata, India

    Acceptance Rates

    Overall Acceptance Rate 19 of 64 submissions, 30%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)30
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A case study on decompounding in Indian language IRNatural Language Processing10.1017/nlp.2024.16(1-31)Online publication date: 3-Jun-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media