note

Ancient–Modern Chinese Translation with a New Large Training Dataset

Authors:

Jiancheng LvAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 19, Issue 1

Article No.: 6, Pages 1 - 13

https://doi.org/10.1145/3325887

Published: 31 May 2019 Publication History

Abstract

Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatic translation from ancient Chinese to modern Chinese helps to inherit and carry forward the quintessence of the ancients. However, the lack of large-scale parallel corpus limits the study of machine translation in ancient–modern Chinese. In this article, we propose an ancient–modern Chinese clause alignment approach based on the characteristics of these two languages. This method combines both lexical-based information and statistical-based information, which achieves 94.2 F1-score on our manual annotation Test set. We use this method to create a new large-scale ancient–modern Chinese parallel corpus that contains 1.24M bilingual pairs. To our best knowledge, this is the first large high-quality ancient–modern Chinese dataset. Furthermore, we analyzed and compared the performance of the SMT and various NMT models on this dataset and provided a strong baseline for this task.

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[3]

Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer. 1991. Aligning sentences in parallel corpora. In ACL.

Digital Library

[4]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[5]

Andrew Finch, Taisuke Harada, Kumiko Tanaka-Ishii, and Eiichiro Sumita. 2017. Inducing a bilingual lexicon from short parallel multiword sequences. ACM Trans. Asian Low-Res. Lang. Inf. Process. 16, 3 (2017), 15:1--15:20.

Digital Library

[6]

Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’16).

Digital Library

[7]

William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Comput. Ling. 19, 1 (1993), 75--102.

Digital Library

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’16).

[9]

Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation.

Digital Library

[10]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.

Digital Library

[11]

Hiroyuki Kaji, Yuuko Kida, and Yasutsugu Morimoto. 1992. Learning translation templates from bilingual text. In Computational Linguistics.

Digital Library

[12]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[13]

Chunyu Kit, Jonathan J. Webster, King-Kui Sin, Haihua Pan, and Heng Li. 2004. Clause alignment for Hong Kong legal texts: A lexical-based approach. Int. J. Corpus Ling. 9, 1 (2004), 29--51.

[14]

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the ACL on Interactive Poster and Demonstration Sessions.

Digital Library

[15]

Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, Min Zhang, and Guodong Zhou. 2017. Modeling source syntax for neural machine translation. arXiv preprint arXiv:1705.01020 (2017).

[16]

Zhun Lin and Xiaojie Wang. 2007. Chinese ancient-modern sentence alignment. In Proceedings of the International Conference on Computational Science.

Digital Library

[17]

Ying Liu and Nan Wang. 2012. Sentence alignment for ancient and modern Chinese parallel corpus. In Emerging Research in Artificial Intelligence and Computational Intelligence.

[18]

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).

[19]

Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’13).

Digital Library

[20]

Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2018. A generalized constraint approach to bilingual dictionary induction for low-resource language families. ACM Trans. Asian Low-Res. Lang. Inf. Process. 17, 2 (2018), 9:1--9:29.

Digital Library

[21]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’02).

Digital Library

[22]

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45, 11 (1997), 2673--2681.

Digital Library

[23]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’14).

Digital Library

[24]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’17).

Digital Library

[25]

Xiaojie Wang and Fuji Ren. 2005. Chinese-Japanese clause alignment. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics.

Digital Library

[26]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[27]

Liu Yang, Tu Zhaopeng, Fandong Meng, Yong Cheng, and Junjie Zhai. 2018. Towards robust neural machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’18).

[28]

Zhiyuan Zhang, Wei Li, and Xu Sun. 2018. Automatic transferring between ancient Chinese and contemporary Chinese. arXiv preprint arXiv:1803.01557 (2018).

Cited By

Klamra CKryńska KOgrodniczuk M(2023)Evaluating the Use of Generative LLMs for Intralingual Diachronic Translation of Middle-Polish Texts into Contemporary PolishLeveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration10.1007/978-981-99-8085-7_2(18-27)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1007/978-981-99-8085-7_2
Zhao JBai TWei YWu B(2023)PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese PoetryData Mining and Big Data10.1007/978-981-19-8991-9_26(369-384)Online publication date: 19-Jan-2023
https://doi.org/10.1007/978-981-19-8991-9_26
Shan JZhang ZZeng YYing YWu HSong HChen YDeng S(2023)Syntax-Aware Transformer for Sentence ClassificationInformation Retrieval10.1007/978-3-031-24755-2_4(40-50)Online publication date: 3-Feb-2023
https://doi.org/10.1007/978-3-031-24755-2_4
Show More Cited By

Index Terms

Ancient–Modern Chinese Translation with a New Large Training Dataset
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
      2. Machine translation

Recommendations

Guwen-UNILM: Machine Translation Between Ancient and Modern Chinese Based on Pre-Trained Models
Natural Language Processing and Chinese Computing
Abstract
Ancient Chinese literatures are not only the unique cultural heritage of China but also the treasures of world civilization. Nevertheless, it has become quite difficult for modern people to comprehend or even create ancient works with the ...
An automatic evaluation metric for Ancient-Modern Chinese translation
Abstract
As a written language used for thousands of years, Ancient Chinese has some special characteristics like complex semantics as polysemy and the one-to-many alignment with Modern Chinese. Thus it may be translated in a large number of fully ...
Machine Translation for Historical Research: A Case Study of Aramaic-Ancient Hebrew Translations
In this article, by the ability to translate Aramaic to another spoken languages, we investigated machine translation in a cultural heritage domain for two primary purposes: evaluating the quality of ancient translations and preserving Aramaic (an ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 19, Issue 1

January 2020

345 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3338846

Editor:
Imed Zitouni
Microsoft, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2019

Accepted: 01 April 2019

Revised: 01 February 2019

Received: 01 August 2018

Published in TALLIP Volume 19, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Note
Research
Refereed

Funding Sources

State Key Program of National Science Foundation of China
National Natural Science Fund for Distinguished Young Scholar

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
369
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)8

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Klamra CKryńska KOgrodniczuk M(2023)Evaluating the Use of Generative LLMs for Intralingual Diachronic Translation of Middle-Polish Texts into Contemporary PolishLeveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration10.1007/978-981-99-8085-7_2(18-27)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1007/978-981-99-8085-7_2
Zhao JBai TWei YWu B(2023)PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese PoetryData Mining and Big Data10.1007/978-981-19-8991-9_26(369-384)Online publication date: 19-Jan-2023
https://doi.org/10.1007/978-981-19-8991-9_26
Shan JZhang ZZeng YYing YWu HSong HChen YDeng S(2023)Syntax-Aware Transformer for Sentence ClassificationInformation Retrieval10.1007/978-3-031-24755-2_4(40-50)Online publication date: 3-Feb-2023
https://doi.org/10.1007/978-3-031-24755-2_4
Ali WKumar RDai YKumar JTumrani S(2021)Neural Joint Model for Part-of-Speech Tagging and Entity Extraction2021 13th International Conference on Machine Learning and Computing10.1145/3457682.3457718(239-245)Online publication date: 26-Feb-2021
https://dl.acm.org/doi/10.1145/3457682.3457718
Yang KLiu DQu QSang YLv J(2020)An automatic evaluation metric for Ancient-Modern Chinese translationNeural Computing and Applications10.1007/s00521-020-05216-833:8(3855-3867)Online publication date: 4-Aug-2020
https://doi.org/10.1007/s00521-020-05216-8

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents