Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3293353.3293386acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvgipConference Proceedingsconference-collections
research-article

Learning to Read by Spelling: Towards Unsupervised Text Recognition

Published: 03 May 2020 Publication History

Abstract

This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings sampled from target corpora. This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-string samples, obviating the need for large aligned datasets. We present detailed analysis for various aspects of the proposed method, namely --- (1) impact of the length of training sequences on convergence, (2) relation between character frequencies and the order in which they are learnt, (3) generalisation ability of our recognition network to inputs of arbitrary lengths, and (4) impact of varying the text corpus on recognition accuracy. Finally, we demonstrate excellent text recognition accuracy on both synthetically generated text images, and scanned images of real printed books, using no labelled training examples.

References

[1]
EMNLP conference on machine translation, 2018.
[2]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016.
[3]
N. Aldarrab. Decipherment of historical manuscripts. Master's thesis, University of Southern California, 2017.
[4]
J. Almazán, A. Gordo, A. Fornés, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE PAMI, 36:2552--2566, 2014.
[5]
O. Alsharif and J. Pineau. End-to-end text recognition with hybrid HMM maxout models. In Proc. ICLR, 2014.
[6]
A. Antonacopoulos, C. Clausner, C. Papadopoulos, and S. Pletschacher. ICDAR 2013 competition on historical book recognition (hbr 2013). pages 1459-1463. IEEE, 2013.
[7]
M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. In Proc. ICLR, 2017.
[8]
J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[9]
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In Proc. ICLR, 2015.
[10]
T. Berg-Kirkpatrick, G. Durrett, and D. Klein. Unsupervised transcription of historical documents. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 207--217, 2013.
[11]
A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. PhotoOCR: Reading text in uncontrolled conditions. In Proc. ICCV, 2013.
[12]
H. Bunke, S. Bengio, and A. Vinciarelli. Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. PAMI, 26(6):709--720, 2004.
[13]
R. G. Casey. Text OCR by solving a cryptogram. 1986.
[14]
K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
[15]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. 39 B:1--38, 1977.
[16]
J. F. Dooley. A brief history of cryptology and cryptographic algorithms. Springer, 2013.
[17]
X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249--256, 2010.
[18]
V. Goel, A. Mishra, K. Alahari, and C. V. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In International Conf. on Document Analysis and Recognition (ICDAR), pages 398--402, 2013.
[19]
A. N. Gomez, S. Huang, I. Zhang, B. M. Li, M. Osama, and L. Kaiser. Unsupervised cipher cracking using discrete GANs. In Proc. ICLR, 2018.
[20]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. NIPS, 2014.
[21]
Google Inc. Book search dataset, Aug 2018. Version V.
[22]
A. Gordo. Supervised mid-level features for word image representation. In Proc. CVPR, 2015.
[23]
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369--376. ACM, 2006.
[24]
A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In Proc. CVPR, 2016.
[25]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
[26]
P. He, W. Huang, Y. Qiao, C. Loy, and X. Tang. Reading scene text in deep convolutional sequences, 2016. In The 30th AAAI Conference on Artificial Intelligence (AAAI-16), volume 1, 2016.
[27]
T. K. Ho and G. Nagy. OCR with no shape training. In Proc. ICPR, 2000.
[28]
G. Huang, E. Learned-Miller, and A. McCallum. Cryptogram decoding for optical character recognition. 2007.
[29]
Hunspell. https://hunspell.github.io.
[30]
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. ICML, 2015.
[31]
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proc. CVPR, 2017.
[32]
M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. In Workshop on Deep Learning, NIPS, 2014.
[33]
M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In Proc. ECCV, 2014.
[34]
M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recognition. In International Conference on Learning Representations, 2015.
[35]
M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. IJCV, 116(1):1--20, Jan. 2016.
[36]
A. Kae and E. Learned-Miller. Learning on the fly: font-free approaches to difficult OCR problems. 2009.
[37]
D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, et al. ICDAR 2013 robust reading competition. In Proc. ICDAR, pages 1484--1493, 2013.
[38]
K. Knight, A. Nair, N. Rathod, and K. Yamada. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL, pages 499--506. Association for Computational Linguistics, 2006.
[39]
K. Knight, B. Megyesi, and C. Schaefer. The Copiale cipher. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. Association for Computational Linguistics, 2011.
[40]
M. Kozielski, M. Nuhn, P. Doetsch, and H. Ney. Towards unsupervised learning for handwriting recognition. In Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, pages 549--554. IEEE, 2014.
[41]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, pages 1106--1114, 2012.
[42]
G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. In Proc. ICLR, 2017.
[43]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541--551, 1989.
[44]
C. Lee and S. Osindero. Recursive recurrent nets with attention modeling for OCR in the wild. In Proc. CVPR, 2016.
[45]
C. Lee, A. Bhardwaj, W. Di, V. Jagadeesh, and R. Piramuthu. Regionbased discriminative feature pooling for scene text recognition. In Proc. CVPR, 2014.
[46]
D.-S. Lee. Substitution deciphering based on HMMs with applications to compressed document processing. PAMI, (12):1661--1666, 2002.
[47]
V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, volume 10, page 707, 1966.
[48]
C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In Proc. ECCV, pages 702--716. Springer, 2016.
[49]
M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Proc. NIPS, pages 700--708, 2017.
[50]
Y. Liu, J. Chen, and L. Deng. Unsupervised sequence classification using sequential output statistics. In Proc. NIPS, pages 3550--3559, 2017.
[51]
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. CVPR, 2015.
[52]
A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, page 3, 2013.
[53]
X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In Proc. ICCV, pages 2813--2821. IEEE, 2017.
[54]
A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. Proc. BMVC, 2012.
[55]
A. Mishra, K. Alahari, and C. Jawahar. Top-down and bottom-up cues for scene text recognition. In Proc. CVPR, 2012.
[56]
G. Nagy. Efficient algorithms to decode substitution ciphers with applications to OCR. In Proc. ICPR, pages 352--355, 1986.
[57]
Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS DLW, volume 2011, 2011.
[58]
L. Neumann and J. Matas. Real-time scene text localization and recognition. In Proc. CVPR, volume 3, pages 1187--1190. IEEE, 2012.
[59]
T. Novikova, O. Barinova, P. Kohli, and V. Lempitsky. Large-lexicon attribute-consistent text recognition in natural images. In Proc. ECCV, pages 752--765. Springer, 2012.
[60]
M. Nuhn and H. Ney. Decipherment complexity in 1: 1 substitution ciphers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, volume 1, pages 615--621, 2013.
[61]
J. Parkinson. Observations on the Nature and Cure of Gout: On Nodes of the Joints; and on the Influence of Certain Articles of Diet, in Gout, Rheumatism, and Gravel. Symonds, 1805.
[62]
S. Peleg and A. Rosenfeld. Breaking substitution ciphers using a relaxation algorithm. Communications of the ACM, 22(11):598--605, 1979.
[63]
A. Poznanski and L. Wolf. CNN-N-Gram for handwriting word recognition. In Proc. CVPR, 2016.
[64]
S. Ravi and K. Knight. Attacking decipherment problems optimally with low-order n-gram models. In proceedings of the conference on Empirical Methods in Natural Language Processing, pages 812--819. Association for Computational Linguistics, 2008.
[65]
J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision, 113(3):193--207, 2015.
[66]
B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. ArXiv e-prints, 2015.
[67]
B. Shi, X. Wang, P. Lv, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In Proc. CVPR, 2016.
[68]
C. Shi, C. Wang, B. Xiao, Y. Zhang, S. Gao, and Z. Zhang. Scene text recognition using part-based tree-structured character detection. In Proc. CVPR, 2013.
[69]
R. Smith. An overview of the Tesseract OCR engine. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, volume 2, pages 629--633. IEEE, 2007.
[70]
B. Snyder, R. Barzilay, and K. Knight. A statistical model for lost language decipherment. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1048--1057. Association for Computational Linguistics, 2010.
[71]
N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In Proc. ICML, 2015.
[72]
B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In Proc. ACCV, 2014.
[73]
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proc. NIPS, pages 3104--3112, 2014.
[74]
I. Sutskever, R. Jozefowicz, K. Gregor, D. Rezende, T. Lillicrap, and O. Vinyals. Towards principled unsupervised learning. In ICLR workshop, 2016.
[75]
Tesseract OCR. https://github.com/tesseract-ocr/, 1985--2018.
[76]
T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26--31, 2012.
[77]
K. Wang and S. Belongie. Word spotting in the wild. In Proc. ECCV. Springer, 2010.
[78]
K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proc. ICCV, pages 1457--1464. IEEE, 2011.
[79]
T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In Proc. ICPR, pages 3304--3308. IEEE, 2012.
[80]
C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Proc. CVPR, 2014.
[81]
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proc. ICCV, 2017.

Cited By

View all
  • (2024)The Learnable Typewriter: A Generative Approach to Text AnalysisDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70536-6_18(297-314)Online publication date: 3-Sep-2024
  • (2023)A Review of Recent Advances and Challenges in Grocery Label Detection and RecognitionApplied Sciences10.3390/app1305287113:5(2871)Online publication date: 23-Feb-2023
  • (2022)Scene Text Recognition with Self-supervised Contrastive Predictive Coding2022 26th International Conference on Pattern Recognition (ICPR)10.1109/ICPR56361.2022.9956631(1514-1521)Online publication date: 21-Aug-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICVGIP '18: Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing
December 2018
659 pages
ISBN:9781450366151
DOI:10.1145/3293353
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adversarial training
  2. text recognition
  3. unsupervised learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICVGIP 2018

Acceptance Rates

Overall Acceptance Rate 95 of 286 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)5
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Learnable Typewriter: A Generative Approach to Text AnalysisDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70536-6_18(297-314)Online publication date: 3-Sep-2024
  • (2023)A Review of Recent Advances and Challenges in Grocery Label Detection and RecognitionApplied Sciences10.3390/app1305287113:5(2871)Online publication date: 23-Feb-2023
  • (2022)Scene Text Recognition with Self-supervised Contrastive Predictive Coding2022 26th International Conference on Pattern Recognition (ICPR)10.1109/ICPR56361.2022.9956631(1514-1521)Online publication date: 21-Aug-2022
  • (2021)Recognizing Multiple Text Sequences from an Image by Pure End-to-End Learning2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412079(7058-7065)Online publication date: 10-Jan-2021
  • (2021)Sequence-to-Sequence Contrastive Learning for Text Recognition2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR46437.2021.01505(15297-15307)Online publication date: Jun-2021
  • (2020)Deep learning of cuneiform sign detection with weak supervision using transliteration alignmentPLOS ONE10.1371/journal.pone.024303915:12(e0243039)Online publication date: 16-Dec-2020
  • (2020)docExtractor: An off-the-shelf historical document element extraction2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)10.1109/ICFHR2020.2020.00027(91-96)Online publication date: Sep-2020
  • (2020)Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.00881(8784-8794)Online publication date: Jun-2020
  • (2019)MODERN ADVANCED ARTIFICIAL INTELLIGENCE FOR SMART MEDICINERemedium10.21518/1561-5936-2019-04-36-43(36-43)Online publication date: 13-May-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media