Abstract
We offer a method of incorporating BERT embeddings into neural morpheme segmentation. We show that our method significantly improves over baseline on 6 typologically diverse languages (English, Finnish, Turkish, Estonian, Georgian and Zulu). Moreover, it establishes a new SOTA on 4 languages where language-specific models are available. We demonstrate that the key component of the performance is not only the BPE vocabulary of BERT, but also the embeddings themselves. Additionally, we show that a simpler pretraining task optimizing subword word2vec-like objective also reaches state-of-the-art performance on 4 of 6 languages considered.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
See http://morpho.aalto.fi/events/morphochallenge2010/datasets.shtml as example.
- 2.
The hyperparameters are in the Appendix.
- 3.
It is freely available https://github.com/AlexeySorokin/MorphemeBert.
- 4.
- 5.
https://github.com/AlexeySorokin/NeuralMorphemeSegmentation The models are retrained for 50 epochs with the parameters provided in the repository.
- 6.
- 7.
Training parameters are in the Appendix.
- 8.
- 9.
Obviously, when the BERT weights are available, training is rather cheap. However, the pretraining cost of BERT is several orders of magnitude higher than the one of word2vec-like embeddings.
- 10.
- 11.
References
Andreev, N.D. (ed.): Statictical and combinatorial language modelling (Statistiko-kombinatornoe modelirovanie iazykov, in Russian). Nauka (1965)
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. (TSLP) 4(1), 3 (2007)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Eskander, R., Callejas, F., Nichols, E., Klavans, J.L., Muresan, S.: Morphagram, evaluation and framework for unsupervised morphological segmentation. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 7112–7122 (2020)
Grönroos, S.A., Virpioja, S., Kurimo, M.: North sámi morphological segmentation with low-resource semi-supervised sequence labeling. In: Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages, pp. 15–26 (2019)
Grönroos, S.A., Virpioja, S., Kurimo, M.: Morfessor em+ prune: improved subword segmentation with expectation maximization and pruning. arXiv preprint arXiv:2003.03131 (2020)
Grönroos, S.A., Virpioja, S., Kurimo, M.: Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation. arXiv preprint arXiv:2004.04002 (2020)
Harris, Z.S.: Morpheme boundaries within words: report on a computer test. In: Papers in Structural and Transformational Linguistics. Formal Linguistics Series, pp. 68–77. Springer, Dordrecht (1970). https://doi.org/10.1007/978-94-017-6059-1_3
Heinzerling, B., Strube, M.: BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In: chair, N.C.C., et al. (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 7–12 May 2018. European Language Resources Association (ELRA), Miyazaki, Japan (2018)
Hofmann, V., Pierrehumbert, J.B., Schütze, H.: Superbizarre is not superb: Improving Bert’s interpretations of complex words with derivational morphology. arXiv preprint arXiv:2101.00403 (2021)
Huang, K., Huang, D., Liu, Z., Mo, F.: A joint multiple criteria model in transfer learning for cross-domain chinese word segmentation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3873–3882 (2020)
Johnson, M., Griffiths, T.L., Goldwater, S.: Adaptor grammars: a framework for specifying compositional nonparametric Bayesian models. In: Advances in Neural Information Processing Systems, pp. 641–648 (2007)
Kann, K., Mager, M., Meza-Ruiz, I., Schütze, H.: Fortification of neural morphological segmentation models for polysynthetic minimal-resource languages. arXiv preprint arXiv:1804.06024 (2018)
Ke, Z., Shi, L., Meng, E., Wang, B., Qiu, X., Huang, X.: Unified multi-criteria Chinese word segmentation with Bert. arXiv preprint arXiv:2004.05808 (2020)
Kudo, T.: Subword regularization: improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 66–75 (2018)
Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71 (2018)
Kurimo, M., Virpioja, S., Turunen, V., Lagus, K.: Morpho challenge competition 2005–2010: evaluations and results. In: Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pp. 87–95. Association for Computational Linguistics (2010)
Mager, M., Maier, E., Medina-Urrea, A., Meza-Ruiz, I., Kann, K.: Lost in translation: analysis of information loss during machine translation between polysynthetic and fusional languages. In: Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages, pp. 73–83 (2018)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: simple and effective subword regularization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1882–1892 (2020)
Ruokolainen, T., Kohonen, O., Virpioja, S., Kurimo, M.: Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 29–37 (2013)
Ruokolainen, T., Kohonen, O., Virpioja, S., et al.: Painless semi-supervised morphological segmentation using conditional random fields. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers, pp. 84–89 (2014)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)
Shafey, L.E., Soltau, H., Shafran, I.: Joint speech recognition and speaker diarization via sequence transduction. arXiv preprint arXiv:1907.05337 (2019)
Sorokin, A.: Convolutional neural networks for low-resource morpheme segmentation: baseline or state-of-the-art? In: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 154–159 (2019)
Sorokin, A., Kravtsova, A.: Deep convolutional networks for supervised morpheme segmentation of Russian language. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2018. CCIS, vol. 930, pp. 3–10. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01204-5_1
Tian, Y., Song, Y., Xia, F., Zhang, T., Wang, Y.: Improving Chinese word segmentation with wordhood memory networks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8274–8285 (2020)
Ulčar, M., Robnik-Šikonja, M.: Finest Bert and crosloengual Bert: less is more in multilingual models. arXiv preprint arXiv:2006.07890 (2020)
Virpioja, S., Smit, P., Grönroos, S.A., Kurimo, M., et al.: Morfessor 2.0: Python implementation and extensions for Morfessor Baseline (2013)
Yao, Y., Huang, Z.: Bi-directional LSTM recurrent neural network for Chinese word segmentation. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9950, pp. 345–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46681-1_42
Acknowledgements
The author thanks Alexander Panin for helpful discussions and ideas. He is also grateful to Natalia Loukachevitch for communication during preparing this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 8.1 Preprocessing Details
We remove accents in Finnish training data as it was done during training of Finnish BERT. In all the wordlists we keep only the words containing alphabet characters and hyphens with at least 5 letters.
1.2 8.2 BERT Models
All the BERT models used in the paper are either language-specific or mulitilingual BERT-base models. In particular, the embeddings dimension is 768 (Fig. 8).
1.3 8.3 Convolutional Network Parameters
Network parameters are obtained by manual search using word accuracy on Finnish data and are transferred to other languages without change.
1.4 8.4 Ablation Studies Details
We train the word2vec model with default parameters from GensimFootnote 10 for 10 epochs on small datasets (50000 words) and for 5 epochs on larger ones. In particular, the algorithm is CBOW, embeddings dimension is 300 and window size 2. We learn subword vocabularies by BPE [23] algorithm with SentencePieceFootnote 11 library. We use the command below to train the model:
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sorokin, A. (2022). Improving Morpheme Segmentation Using BERT Embeddings. In: Burnaev, E., et al. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol 13217. Springer, Cham. https://doi.org/10.1007/978-3-031-16500-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-16500-9_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16499-6
Online ISBN: 978-3-031-16500-9
eBook Packages: Computer ScienceComputer Science (R0)