Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Composing Word Embeddings for Compound Words Using Linguistic Knowledge

Published: 30 March 2023 Publication History
  • Get Citation Alerts
  • Abstract

    In recent years, the use of distributed representations has been a fundamental technology for natural language processing. However, Japanese has multiple compound words, and often we must compare the meanings of a word and a compound word. Moreover, word boundaries in Japanese are unspecific because Japanese does not have delimiters between words, e.g., “ぶどう狩り” (grape picking) is one word according to one dictionary, whereas “ぶどう” and “狩り” are different words according to another dictionary. This study describes an attempt to compose word embeddings of a compound word from its constituent words in Japanese. We used “short unit” and “long unit,” both of which are the units of terms in UniDic—a Japanese dictionary compiled by the National Institute for Japanese Language and Linguistics—for constituent and compound words, respectively. Furthermore, we composed a word embedding of a compound word from the word embeddings of two constituent words using a neural network. The training data for the word embedding of compound words was created using a corpus generated by concatenating the corpora divided by constituent and compound words. We propose using linguistic knowledge for compositing word embedding to demonstrate how it improves the composition performance. We compared cosine similarity between composed and correct word embeddings of compound words to assess models with and without linguistic knowledge. Furthermore, we evaluated our methods by the ranking of synonyms using a thesaurus. We compared several frameworks and algorithms that use three types of linguistic knowledge—semantic patterns, parts of speech patterns, and compositionality score—and then investigated which linguistic knowledge improves the composition performance. The experiments demonstrated that the multitask models with the classification task of the parts of speech patterns and the estimation task of compositionality scores achieved high performances.

    References

    [1]
    Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 1183–1193. https://aclanthology.org/D10-1115.pdf.
    [2]
    Kazuma Hashimoto and Yoshimasa Tsuruoka. 2015. Learning embeddings for transitive verb disambiguation by implicit tensor factorization. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and Their Compositionality. 1–11. https://aclanthology.org/W15-4001.pdf.
    [3]
    Kazuma Hashimoto and Yoshimasa Tsuruoka. 2016. Adaptive joint learning of compositional and non-compositional phrase embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 205–215. http://arxiv.org/abs/1603.06067.
    [4]
    Teruo Hirabayashi, Kanako Komiya, Masayuki Asahara, and Hiroyuki Shinnou. 2020. Composing word vectors for Japanese compound words using bilingual word embeddings. In Proceedings of the 34th Pacific Asia Conference on Language, Information, and Computation (PACLIC’20). 404–410. https://aclanthology.org/2020.paclic-1.46.pdf.
    [5]
    Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, and Naoaki Okazaki. 2020. Optimizing word segmentation for downstream task. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 1341–1351.
    [6]
    Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, and Naoaki Okazaki. 2021. Joint optimization of tokenization and downstream model. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 244–255.
    [7]
    Sorami Hisamoto, Takashi Yamamura, Akihiko Katsuta, Yuto Takebayashi, Kazuma Takaoka, Yoshitake Uchida, Teruaki Oka, and Masayuki Asahara. 2020. chiVe: Towards industrial-strength Japanese word vector resources. In Proceedings of the 16th Text Analytic Symposium.40–45.
    [8]
    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (Nov. 1997), 1735–1780.
    [9]
    Jimin Hong, TaeHee Kim, Hyesu Lim, and Jaegul Choo. 2021. AVocaDo: Strategy for adapting vocabulary to downstream domain. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4692–4700.
    [10]
    Natthawut Kertkeidkachorn and Ryutaro Ichise. 2017. Estimating distributed representations of compound words using recurrent neural network. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. 235–246.
    [11]
    Kanako Komiya, Takumi Seitou, Minoru Sasaki, and Hiroyuki Shinnou. 2019. Composing word vectors for Japanese compound words using dependency relations. In Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLING’19). 1–7.
    [12]
    Kanako Komiya, Daiki Yaginuma, Masayuki Asahara, and Hiroyuki Shinnou. 2020. Generation and evaluation of concept embeddings via fine-tuning using automatically tagged corpus. In Proceedings of the 34th Pacific Asia Conference on Language, Information, and Computation (PACLIC’20). 122–128. https://aclanthology.org/2020.paclic-1.15.pdf
    [13]
    Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research), Eric P. Xing and Tony Jebara (Eds.), Vol. 32. PMLR, Bejing, China, 1188–1196. https://proceedings.mlr.press/v32/le14.html.
    [14]
    Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama, Masaya Yamaguchi, Hideki Ogura, Wakako Kashino, Toshinobu Ogiso, Hanae Koiso, and Yasuharu Den. 2010. Design, compilation, and preliminary analyses of Balanced Corpus of Contemporary Written Japanese. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 1483–1486. http://www.lrec-conf.org/proceedings/lrec2010/pdf/99_Paper.pdf.
    [15]
    Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. 2014. Balanced corpus of contemporary written Japanese. Language Resources and Evaluation 48, 2 (2014), 345–371.
    [16]
    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations Workshop (ICLR’13). 1–12. https://arxiv.org/pdf/1301.3781.pdf.
    [17]
    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13). 3111–3119. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf.
    [18]
    Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL’13). 746–751. https://aclanthology.org/N13-1090.pdf.
    [19]
    Masayasu Muraoka, Sonse Shimaoka, Kazeto Yamamoto, Yotaro Watanabe, Naoaki Okazaki, and Kentaro Inui. 2014. Finding the best model among representative compositional models. In Proceedings of the 28th Pacific Asia Conference on Language, Information, and Computation (PACLIC’14). 65–74. https://aclanthology.org/Y14-1010.pdf.
    [20]
    National Institute for Japanese Language and Linguistics. 1964. Word List by Semantic Principles [in Japanese]. Shuuei Shuppan.
    [21]
    Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2017. Mimicking word embeddings using subword RNNs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17).102–112. https://aclanthology.org/D17-1010.pdf.
    [22]
    Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 455–465. https://nlp.stanford.edu/pubs/SocherBauerManningNg_ACL2013.pdf.
    [23]
    Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218. https://aclanthology.org/Q14-1017.pdf.
    [24]
    Hirotaka Tanaka and Hiroyuki Shinnou. 2022. Vocabulary enhancement of compound words of BERT for domain adaptations. In Proceedings of the 28th Annual Meeting of the Association for Natural Language Processing (NLP’22).998–1002. https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/PT2-8.pdf.
    [25]
    Shikhar Vashishth, Manik Bhandari, Prateek Yadav, Piyush Rai, Chiranjib Bhattacharyya, and Partha Talukdar. 2019. Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3308–3318. https://aclanthology.org/P19-1320.pdf.
    [26]
    Yirui Wu, Haifeng Guo, Chinmay Chakraborty, Mohammad Khosravi, Stefano Berretti, and Shaohua Wan. 2022. Edge computing driven low-light image dynamic enhancement for object detection. IEEE Transactions on Network Science and Engineering. Early access, February 14, 2022.
    [27]
    Yirui Wu, Yuntao Ma, and Shaohua Wan. 2021. Multi-scale relation reasoning for multi-modal visual question answering. Signal Processing: Image Communication 96 (2021), 116319.
    [28]
    Yirui Wu, Wenqin Mao, and Jun Feng. 2021. AI for online customer service: Intent recognition and slot filling based on deep learning technology. Mobile Networks and Applications 2021 (2021), 1572–8153.
    [29]
    Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6442–6454.
    [30]
    Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong, and Furu Wei. 2021. Adapt-and-Distill: Developing small, fast and effective pretrained language models for domains. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 460–470.

    Index Terms

    1. Composing Word Embeddings for Compound Words Using Linguistic Knowledge

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
      February 2023
      624 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3572719
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 March 2023
      Online AM: 07 September 2022
      Accepted: 29 August 2022
      Revised: 22 August 2022
      Received: 07 October 2021
      Published in TALLIP Volume 22, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Word embedding
      2. compound word
      3. multitask learning
      4. linguistic knowledge
      5. Japanese
      6. parts of speech
      7. constituent word

      Qualifiers

      • Research-article

      Funding Sources

      • JSPS KAKENHI
      • Younger Researchers Grants from Ibaraki University

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 270
        Total Downloads
      • Downloads (Last 12 months)130
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media