Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

From Image to Translation: Processing the Endangered Nyushu Script

Published: 16 May 2016 Publication History

Abstract

The lack of computational support has significantly slowed down automatic understanding of endangered languages. In this paper, we take Nyushu (simplified Chinese: 女书; literally: “women’s writing”) as a case study to present the first computational approach that combines Computer Vision and Natural Language Processing techniques to deeply understand an endangered language. We developed an end-to-end system to read a scanned hand-written Nyushu article, segment it into characters, link them to standard characters, and then translate the article into Mandarin Chinese. We propose several novel methods to address the new challenges introduced by noisy input and low resources, including Nyushu-specific feature selection for character segmentation and linking, and character linking lattice based Machine Translation. The end-to-end system performance indicates that the system is a promising approach and can serve as a standard benchmark.

References

[1]
Zhen-Long Bai and Qiang Huo. 2005. A study on the use of 8-directional features for online handwritten Chinese character recognition. In Proc. of ICDAR 2005. 262--266.
[2]
Stephen Beale. 2014. Time to change the “D” in “DEL”. In Proc. of ACL2014, Workshop on ComputEL.
[3]
Emily Bender. 2008. Evaluating a crosslinguistic grammar resource: A case study of wambaya. In the Proc. of ACL 2008.
[4]
Emily Bender, Joshua Crowgey, Michael Wayne Goodman, and Fei Xia. 2014. Learning grammar specifications from IGT: A case study of chintang. In the Proc. of ACL 2014, Workshop on ComputEL.
[5]
Emily Bender, Michael Wayne Goodman, Joshua Crowgey, and Fei Xia. 2013. Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In Proc. of LaTeCH2013.
[6]
Martin Benjamin and Paula Radetzky. 2014. Small languages, big data: Multilingual computational tools and techniques for the lexicography of endangered languages. In Proc. of ACL2014, Workshop on ComputEL.
[7]
Steven Bird. 2009. Natural language processing and linguistic fieldwork. Computational Linguistics 35, 3 (2009), 469--474.
[8]
Steven Bird and David Chiang. 2012. Machine translation for language preservation. In Proc. of ICCL2012.
[9]
Steven Bird, Florian R. Hanke1, Oliver Adams, and Haejoong Lee. 2014. Aikuma: A mobile app for collaborative language documentation. In Proc. of ACL2014, Workshop on ComputEL.
[10]
David Bradley. 2005. Introduction: Language policy and language endangerment in china. International Journal of the Sociology of Language 12, 173 (2005), 1--21.
[11]
Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (Oct. 2001), 5--32.
[12]
Francisco Casacuberta. 2001. Finite-state transducers for speech-input translation. In Proc. of ASRU2013.
[13]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273--297.
[14]
Qiang Fu, Xiaoqing Ding, Tong Liu, Yan Jiang, and Zheng Ren. 2006. A novel segmentation and recognition algorithm for chinese handwritten address character strings. In Proc. of ICPR2006, IEEE (Ed.), Vol. 2. 974--977.
[15]
Jonathan Graehl. 1997. Carmel finite-state toolkit. http://www.isi.edu/licensed-sw/carmel/. (1997).
[16]
Zhi Han, Chang-Ping Liu, and Xu-Cheng Yin. 2005. A two-stage handwritten character segmentation approach in mail address recognition. In Proc. of ICDAR2005, Vol. 1. IEEE Computer Society, 111--115.
[17]
Jiren He and Zuyi Jiang. 1985. Naxi Language Briefing. Minzu Press.
[18]
Bufan Huang. 1985. An overview of muya language. National Languages 8 (1985).
[19]
Xuezhen Huang. 1993. Jiangyong dialect research. Social Science Press (1993).
[20]
Russell A. Kirsch. 1971. Computer determination of the constituent structure of biological images. Computers and Biomedical Research 4, 3 (1971), 315--328.
[21]
Kevin Knight and Yaser Al-Onaizan. 1998. Translation with finite-state devices. In Proc. of AMTA1998.
[22]
Michael Krauss. 1992. The world’s languages in crisis. Languages 68, 1 (1992), 4--10.
[23]
Shankar Kumar and William Byrne. 2003. A weighted finite state transducer implementation of the alignment template model for statistical machine translation. In Proc. of NAACL-HLT 2003.
[24]
Khang Nhut Lam, Feras Al Tarouti, and Jugal Kalita. 2014. Creating lexical resources for endangered languages. In Proc. of ACL2014, Workshop on ComputEL.
[25]
Fang-kuei Li. 2005. Po-ai dialect. Tsinghua University Press (2005).
[26]
Cheng-Lin Liu, Stefan Jaeger, and Masaki Nakagawa. 2004. Online recognition of chinese characters: The state-of-the-art. Trans. PAMI 26, 2 (2004), 198--213.
[27]
Cheng-Lin Liu, Masashi Koga, Hiroshi Sako, and Hiromichi Fujisawa. 2000. Aspect ratio adaptive normalization for handwritten character recognition. In Advances in Multimodal Interfaces—ICMI 2000, Tieniu Tan, Yuanchun Shi, and Wen Gao (Eds.). Lecture Notes in Computer Science, Vol. 1948. Springer Berlin Heidelberg, 418--425.
[28]
Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. 2013. Online and offline handwritten chinese character recognition: Benchmarking on new databases. Pattern Recognition 46, 1 (2013), 155--162.
[29]
Edward O. Ombui1, Peter W. Wagacha, and Wanjiku Nganga. 2014. InterlinguaPlus machine translation approach for under-resourced languages: Ekegusii and swahili. In Proc. of ACL2014, Workshop on ComputEL.
[30]
Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. Systems, Man and Cybernetics, IEEE Transactions on 9, 1 (Jan 1979), 62--66.
[31]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proc. of ACL2002. 311--318.
[32]
Karl Pearson. 1895. Notes on regression and inheritance in the case of two parents. In Proc. of the Royal Society of London. 240--242.
[33]
Hammam Riza. 2008. Indigenous languages of Indonesia: Creating language resources for language preservation. In Proc. of IJCNLP2008, Workshop on NLP for Less Privileged Languages.
[34]
Hongkai Sun. 1983. Overview of ersu language. Language Research (1983).
[35]
Hongkai Sun, Zengyi Hu, and Xing Huang. 2007. Chinese Languages. Commercial Press.
[36]
Øivind Due Trier, Anil K. Jain, and Torfinn Taxt. 1996. Feature extraction method for character recognition - A survey. Pattern Recognition 29, 4 (1996), 641--662.
[37]
Koji Tsuda and Bernhard Schölkopf. 2004. A primer on kernel methods. In Kernel Methods in Computational Biology. MIT Press, 35--70.
[38]
Morgan Ulinski, Anusha Balakrishnan, Daniel Bauer, Bob Coyne, Julia Hirschberg, and Owen Rambow. 2014. Documenting endangered languages with the wordsEye linguistics tool. In Proc. of ACL2014, Workshop on ComputEL.
[39]
Qiu-Feng Wang, Fei Yin, and Cheng-Lin Liu. 2012. Handwritten chinese text recognition by integrating multiple contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 8 (2012), 1469--1481.
[40]
Junru Zhang. 1980. Shuiyu Briefing. Minzu Press.
[41]
Liming Zhao. 1995. Nyushu and Nyushu Culture. Xinhua Press.
[42]
Liming Zhao. 2004a. The Comparison of Nyushu Characters. Intellectual Property Press.
[43]
Liming Zhao. 2004b. Research on the Characters in the Nyushu Script by the one Hundred Years Old Lady Yang Huanyi. International Culture Publishing House.
[44]
Liming Zhao. 2005. Chinese Nyushu Script Collection. Zhonghua Book Company.
[45]
Liming Zhao. 2008. Nyushu Booklet. Hunan People’s Press.
[46]
Liming Zhao and Zhaolin Song. 2011. A Map Record of the Endangered Languages in Southwestern China. Xueyuan Press.
[47]
Liming Zhao and Yan Zhang. 2014. The collection of endangered literature from the minority groups in southwestern china -- namuyi-tibetan bozi literature. Guangxi Normal University Press (2014).
[48]
Shuyan Zhao, Zheru Chi, Penfei Shi, and Hong Yan. 2003. Two-stage segmentation of unconstrained handwritten chinese characters. Pattern Recognition 36, 1 (2003), 145--156.
[49]
Zheng Zhao and Huan Liu. 2007. Spectral feature selection for supervised and unsupervised learning. In Proc. of ICML2007. ACM, New York, NY, USA, 1151--1157.

Cited By

View all
  • (2019)RETRACTED ARTICLE: Translation analysis of English address image recognition based on image recognitionJournal on Image and Video Processing10.1186/s13640-019-0408-92019:1Online publication date: 12-Feb-2019

Index Terms

  1. From Image to Translation: Processing the Endangered Nyushu Script
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 4
      June 2016
      173 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2915955
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 May 2016
      Accepted: 01 December 2015
      Revised: 01 October 2015
      Received: 01 May 2015
      Published in TALLIP Volume 15, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Endangered languages
      2. nyushu
      3. recognition
      4. translation

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)16
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 12 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2019)RETRACTED ARTICLE: Translation analysis of English address image recognition based on image recognitionJournal on Image and Video Processing10.1186/s13640-019-0408-92019:1Online publication date: 12-Feb-2019

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media