Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
note

Ancient–Modern Chinese Translation with a New Large Training Dataset

Published: 31 May 2019 Publication History

Abstract

Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatic translation from ancient Chinese to modern Chinese helps to inherit and carry forward the quintessence of the ancients. However, the lack of large-scale parallel corpus limits the study of machine translation in ancient–modern Chinese. In this article, we propose an ancient–modern Chinese clause alignment approach based on the characteristics of these two languages. This method combines both lexical-based information and statistical-based information, which achieves 94.2 F1-score on our manual annotation Test set. We use this method to create a new large-scale ancient–modern Chinese parallel corpus that contains 1.24M bilingual pairs. To our best knowledge, this is the first large high-quality ancient–modern Chinese dataset. Furthermore, we analyzed and compared the performance of the SMT and various NMT models on this dataset and provided a strong baseline for this task.

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[3]
Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer. 1991. Aligning sentences in parallel corpora. In ACL.
[4]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[5]
Andrew Finch, Taisuke Harada, Kumiko Tanaka-Ishii, and Eiichiro Sumita. 2017. Inducing a bilingual lexicon from short parallel multiword sequences. ACM Trans. Asian Low-Res. Lang. Inf. Process. 16, 3 (2017), 15:1--15:20.
[6]
Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’16).
[7]
William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Comput. Ling. 19, 1 (1993), 75--102.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’16).
[9]
Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation.
[10]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.
[11]
Hiroyuki Kaji, Yuuko Kida, and Yasutsugu Morimoto. 1992. Learning translation templates from bilingual text. In Computational Linguistics.
[12]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[13]
Chunyu Kit, Jonathan J. Webster, King-Kui Sin, Haihua Pan, and Heng Li. 2004. Clause alignment for Hong Kong legal texts: A lexical-based approach. Int. J. Corpus Ling. 9, 1 (2004), 29--51.
[14]
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the ACL on Interactive Poster and Demonstration Sessions.
[15]
Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, Min Zhang, and Guodong Zhou. 2017. Modeling source syntax for neural machine translation. arXiv preprint arXiv:1705.01020 (2017).
[16]
Zhun Lin and Xiaojie Wang. 2007. Chinese ancient-modern sentence alignment. In Proceedings of the International Conference on Computational Science.
[17]
Ying Liu and Nan Wang. 2012. Sentence alignment for ancient and modern Chinese parallel corpus. In Emerging Research in Artificial Intelligence and Computational Intelligence.
[18]
Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
[19]
Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’13).
[20]
Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2018. A generalized constraint approach to bilingual dictionary induction for low-resource language families. ACM Trans. Asian Low-Res. Lang. Inf. Process. 17, 2 (2018), 9:1--9:29.
[21]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’02).
[22]
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45, 11 (1997), 2673--2681.
[23]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’14).
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’17).
[25]
Xiaojie Wang and Fuji Ren. 2005. Chinese-Japanese clause alignment. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics.
[26]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[27]
Liu Yang, Tu Zhaopeng, Fandong Meng, Yong Cheng, and Junjie Zhai. 2018. Towards robust neural machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’18).
[28]
Zhiyuan Zhang, Wei Li, and Xu Sun. 2018. Automatic transferring between ancient Chinese and contemporary Chinese. arXiv preprint arXiv:1803.01557 (2018).

Cited By

View all
  • (2023)Evaluating the Use of Generative LLMs for Intralingual Diachronic Translation of Middle-Polish Texts into Contemporary PolishLeveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration10.1007/978-981-99-8085-7_2(18-27)Online publication date: 4-Dec-2023
  • (2023)PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese PoetryData Mining and Big Data10.1007/978-981-19-8991-9_26(369-384)Online publication date: 19-Jan-2023
  • (2023)Syntax-Aware Transformer for Sentence ClassificationInformation Retrieval10.1007/978-3-031-24755-2_4(40-50)Online publication date: 3-Feb-2023
  • Show More Cited By

Index Terms

  1. Ancient–Modern Chinese Translation with a New Large Training Dataset

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 1
      January 2020
      345 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3338846
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 31 May 2019
      Accepted: 01 April 2019
      Revised: 01 February 2019
      Received: 01 August 2018
      Published in TALLIP Volume 19, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Ancient–Modern Chinese parallel corpus
      2. bilingual text alignment
      3. neural machine translation

      Qualifiers

      • Note
      • Research
      • Refereed

      Funding Sources

      • State Key Program of National Science Foundation of China
      • National Natural Science Fund for Distinguished Young Scholar

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)44
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 10 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Evaluating the Use of Generative LLMs for Intralingual Diachronic Translation of Middle-Polish Texts into Contemporary PolishLeveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration10.1007/978-981-99-8085-7_2(18-27)Online publication date: 4-Dec-2023
      • (2023)PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese PoetryData Mining and Big Data10.1007/978-981-19-8991-9_26(369-384)Online publication date: 19-Jan-2023
      • (2023)Syntax-Aware Transformer for Sentence ClassificationInformation Retrieval10.1007/978-3-031-24755-2_4(40-50)Online publication date: 3-Feb-2023
      • (2021)Neural Joint Model for Part-of-Speech Tagging and Entity Extraction2021 13th International Conference on Machine Learning and Computing10.1145/3457682.3457718(239-245)Online publication date: 26-Feb-2021
      • (2020)An automatic evaluation metric for Ancient-Modern Chinese translationNeural Computing and Applications10.1007/s00521-020-05216-833:8(3855-3867)Online publication date: 4-Aug-2020

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media