Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Published: 24 August 2023 Publication History

Abstract

Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient, as they require parallel corpora, days to train, and hours to decode. This article introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability, and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle-, and high-resource scenarios, where we compare the performance of using different segmentation methods. The experimental results demonstrate that, on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT), on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi→En, WMT16 Ro→En, and WMT15 Fi→En datasets and competitive results on the WMT14 De→En and WMT14 Fr→En datasets. Furthermore, our method is 17.8× faster during training and up to 36.8× faster during decoding in a high-resource scenario compared to DPE. We provide extensive analysis, including why monolingual word-level data is enough to train SelfSeg.

References

[1]
Dimitrios Alikaniotis, Helen Yannakoudakis, and Marek Rei. 2016. Automatic Text Scoring Using Neural Networks. DOI:
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv e-prints, Article arXiv:1409.0473 (Sept.2014).
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, 65–72. Retrieved from https://aclanthology.org/W05-0909
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-shot Learners. arXiv:arXiv:2005.14165
[5]
Kris Cao and Laura Rimell. 2021. You Should Evaluate Your Language Model on Marginal Likelihood over Tokenisations. arXiv:arXiv:2109.02550
[6]
William Chan, Yu Zhang, Quoc Le, and Navdeep Jaitly. 2016. Latent Sequence Decompositions. arXiv:arXiv:1610.03035
[7]
Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, and Wolfgang Macherey. 2018. Revisiting character-based neural machine translation with capacity and compression. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4295–4305. DOI:
[8]
Marta R. Costa-Jussà and José A. R. Fonollosa. 2016. Character-based neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 357–361. DOI:
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. DOI:
[10]
Carl Doersch, Abhinav Gupta, and Alexei A. Efros. 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15).
[11]
C. M. Downey, Fei Xia, Gina-Anne Levow, and Shane Steinert-Threlkeld. 2021. A Masked Segmental Language Model for Unsupervised Natural Language Segmentation. arXiv:arXiv:2104.07829
[12]
Philip Gage. 1994. A new algorithm for data compression. C Users J. 12, 2 (1994), 23–38.
[13]
Matthias Gallé. 2019. Investigating the effectiveness of BPE: The power of shorter sequences. in Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 1375–1381. DOI:
[14]
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML’17). JMLR.org, 1243–1252.
[15]
Edouard Grave, Sainbayar Sukhbaatar, Piotr Bojanowski, and Armand Joulin. 2019. Training hybrid language models by marginalizing over segmentations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1477–1482. DOI:
[16]
Rohit Gupta, Laurent Besacier, Marc Dymetman, and Matthias Gallé. 2019. Character-based NMT with Transformer. arXiv:arXiv:1911.04997
[17]
Xuanli He, Gholamreza Haffari, and Mohammad Norouzi. 2020. Dynamic programming encoding for subword segmentation in neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3042–3051. DOI:
[18]
G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504–507. DOI:
[19]
Matthias Huck, Simon Riess, and Alexander Fraser. 2017. Target-side word segmentation strategies for neural machine translation. In Proceedings of the 2nd Conference on Machine Translation. Association for Computational Linguistics, 56–67. DOI:
[20]
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 1–10. DOI:
[21]
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1700–1709. Retrieved from https://aclanthology.org/D13-1176
[22]
Kazuya Kawakami, Chris Dyer, and Phil Blunsom. 2019. Learning to discover, ground and use words with segmental neural language models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 6429–6441. DOI:
[23]
Yoon Kim, Yacine Jernite, David Sontag, and Alexander Rush. 2016. Character-aware neural language models. Proc. AAAI Conf. Artif. Intell. 30, 1 (Mar.2016). Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/10362
[24]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv e-prints, Article arXiv:1412.6980 (Dec.2014).
[25]
Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 388–395. Retrieved from https://www.aclweb.org/anthology/W04-3250
[26]
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Association for Computational Linguistics, 177–180. Retrieved from https://www.aclweb.org/anthology/P07-2045
[27]
Philipp Koehn and Kevin Knight. 2003. Empirical methods for compound splitting. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. Retrieved from https://aclanthology.org/E03-1076
[28]
Julia Kreutzer and Artem Sokolov. 2018. Learning to Segment Inputs for NMT Favors Character-level Processing. arXiv:arXiv:1810.01480
[29]
Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 66–75. DOI:
[30]
Taku Kudo. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. arXiv:arXiv:1804.10959
[31]
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 66–71. DOI:
[32]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv e-prints, Article arXiv:1607.06450 (July2016).
[33]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7871–7880. DOI:
[34]
Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W. Black. 2015. Character-based Neural Machine Translation. arXiv:arXiv:1511.04586
[35]
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Ling. 8 (112020), 726–742. DOI:
[36]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:arXiv:1907.11692
[37]
Minh-Thang Luong and Christopher D. Manning. 2016. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-character Models. arXiv:arXiv:1604.00788
[38]
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 11–19. DOI:
[39]
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 55–60. DOI:
[40]
Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot, and Samson Tan. 2021. Between Words and Characters: A Brief History of Open-vocabulary Modeling and Tokenization in NLP. arXiv:arXiv:2112.10508
[41]
Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification. arXiv:arXiv:1603.08561
[42]
Graham Neubig. 2011. The Kyoto Free Translation Task. Retrieved fromhttp://www.phontron.com/kftt
[43]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). Association for Computational Linguistics, 48–53. DOI:
[44]
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the Computer Vision and Pattern Recognition Conference. arXiv:arXiv:1604.07379
[45]
Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. Association for Computational Linguistics, 186–191. Retrieved from https://www.aclweb.org/anthology/W18-6319
[46]
Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. 2020. BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1882–1892. DOI:
[47]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer. arXiv:arXiv:1910.10683
[48]
Raphael Rubino, Benjamin Marie, Raj Dabre, Atsushi Fujita, Masao Utiyama, and Eiichiro Sumita. 2020. Extremely low-resource neural machine translation for Asian languages. Mach. Transl. 34, 4 (2020), 347–382. DOI:
[49]
M. Schuster and K. Nakajima. 2012. Japanese and Korean voice search. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 5149–5152. Retrieved from https://ieeexplore.ieee.org/abstract/document/6289079
[50]
Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7881–7892. DOI:
[51]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh neural machine translation systems for WMT 16. In Proceedings of the 1st Conference on Machine Translation: Volume 2, Shared Task Papers. Association for Computational Linguistics, 371–376. DOI:
[52]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1715–1725. DOI:
[53]
Uri Shaham and Omer Levy. 2021. Neural machine translation without embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 181–186. DOI:
[54]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. arXiv e-prints, Article arXiv:1905.02450 (May2019).
[55]
Zhiqing Sun and Zhi-Hong Deng. 2018. Unsupervised neural word segmentation for Chinese via segmental language modeling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4915–4920. DOI:
[56]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3104–3112. Retrieved from http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
[57]
Yee Whye Teh. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 985–992. DOI:
[58]
Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2016. Introducing the Asian Language Treebank (ALT). In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), 1574–1578. Retrieved from https://www.aclweb.org/anthology/L16-1249
[59]
Arseny Tolmachev, Daisuke Kawahara, and Sadao Kurohashi. 2018. Juman++: A morphological analysis toolkit for scriptio continua. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 54–59. DOI:
[60]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[61]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). Association for Computing Machinery, New York, NY, 1096–1103. DOI:
[62]
Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking emerges by colorizing videos. In Proceedings of the European Conference on Computer Vision (ECCV’18).
[63]
Changhan Wang, Kyunghyun Cho, and Jiatao Gu. 2019. Neural Machine Translation with Byte-level Subwords. arXiv:arXiv:1909.03341
[64]
Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, and Li Deng. 2017. Sequence Modeling via Segmentations. arXiv:arXiv:1702.07463
[65]
Donglai Wei, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. 2018. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[66]
Jingjing Xu, Hao Zhou, Chun Gan, Zaixiang Zheng, and Lei Li. 2021. Vocabulary learning via optimal transport for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 7361–7373. DOI:
[67]
Richard Zhang, Phillip Isola, and Alexei A. Efros. 2016. Colorful image colorization. In Proceedings of the European Conference on Computer Vision. Springer, 649–666.
[68]
Giulio Zhou. 2018. Morphological Zero-shot Neural Machine Translation. University of Edinburgh.

Cited By

View all
  • (2024)DiverSeg: Leveraging Diverse Segmentations with Cross-granularity Alignment for Neural Machine TranslationJournal of Natural Language Processing10.5715/jnlp.31.15531:1(155-188)Online publication date: 2024
  • (2024)A Benchmark for Morphological Segmentation in Uyghur and KazakhApplied Sciences10.3390/app1413536914:13(5369)Online publication date: 21-Jun-2024

Index Terms

  1. SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 8
    August 2023
    373 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3615980
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2023
    Online AM: 26 July 2023
    Accepted: 13 July 2023
    Published in TALLIP Volume 22, Issue 8

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Subword segmentation
    2. self-supervised learning
    3. machine translation
    4. efficient NLP
    5. subword regularization

    Qualifiers

    • Research-article

    Funding Sources

    • JSPS KAKENHI
    • Young Scientists
    • JSPS Research Fellow for Young Scientists

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)115
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DiverSeg: Leveraging Diverse Segmentations with Cross-granularity Alignment for Neural Machine TranslationJournal of Natural Language Processing10.5715/jnlp.31.15531:1(155-188)Online publication date: 2024
    • (2024)A Benchmark for Morphological Segmentation in Uyghur and KazakhApplied Sciences10.3390/app1413536914:13(5369)Online publication date: 21-Jun-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media