research-article

Learning to Read by Spelling: Towards Unsupervised Text Recognition

Authors:

Andrea Vedaldi,

Andrew ZissermanAuthors Info & Claims

ICVGIP '18: Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing

Article No.: 33, Pages 1 - 10

https://doi.org/10.1145/3293353.3293386

Published: 03 May 2020 Publication History

Abstract

This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings sampled from target corpora. This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-string samples, obviating the need for large aligned datasets. We present detailed analysis for various aspects of the proposed method, namely --- (1) impact of the length of training sequences on convergence, (2) relation between character frequencies and the order in which they are learnt, (3) generalisation ability of our recognition network to inputs of arbitrary lengths, and (4) impact of varying the text corpus on recognition accuracy. Finally, we demonstrate excellent text recognition accuracy on both synthetically generated text images, and scanned images of real printed books, using no labelled training examples.

References

[1]

EMNLP conference on machine translation, 2018.

[2]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016.

Digital Library

[3]

N. Aldarrab. Decipherment of historical manuscripts. Master's thesis, University of Southern California, 2017.

[4]

J. Almazán, A. Gordo, A. Fornés, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE PAMI, 36:2552--2566, 2014.

[5]

O. Alsharif and J. Pineau. End-to-end text recognition with hybrid HMM maxout models. In Proc. ICLR, 2014.

[6]

A. Antonacopoulos, C. Clausner, C. Papadopoulos, and S. Pletschacher. ICDAR 2013 competition on historical book recognition (hbr 2013). pages 1459-1463. IEEE, 2013.

Digital Library

[7]

M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. In Proc. ICLR, 2017.

[8]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[9]

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In Proc. ICLR, 2015.

[10]

T. Berg-Kirkpatrick, G. Durrett, and D. Klein. Unsupervised transcription of historical documents. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 207--217, 2013.

[11]

A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. PhotoOCR: Reading text in uncontrolled conditions. In Proc. ICCV, 2013.

Digital Library

[12]

H. Bunke, S. Bengio, and A. Vinciarelli. Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. PAMI, 26(6):709--720, 2004.

Digital Library

[13]

R. G. Casey. Text OCR by solving a cryptogram. 1986.

[14]

K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

[15]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. 39 B:1--38, 1977.

[16]

J. F. Dooley. A brief history of cryptology and cryptographic algorithms. Springer, 2013.

[17]

X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249--256, 2010.

[18]

V. Goel, A. Mishra, K. Alahari, and C. V. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In International Conf. on Document Analysis and Recognition (ICDAR), pages 398--402, 2013.

Digital Library

[19]

A. N. Gomez, S. Huang, I. Zhang, B. M. Li, M. Osama, and L. Kaiser. Unsupervised cipher cracking using discrete GANs. In Proc. ICLR, 2018.

[20]

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. NIPS, 2014.

Digital Library

[21]

Google Inc. Book search dataset, Aug 2018. Version V.

[22]

A. Gordo. Supervised mid-level features for word image representation. In Proc. CVPR, 2015.

[23]

A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369--376. ACM, 2006.

Digital Library

[24]

A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In Proc. CVPR, 2016.

[25]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.

[26]

P. He, W. Huang, Y. Qiao, C. Loy, and X. Tang. Reading scene text in deep convolutional sequences, 2016. In The 30th AAAI Conference on Artificial Intelligence (AAAI-16), volume 1, 2016.

[27]

T. K. Ho and G. Nagy. OCR with no shape training. In Proc. ICPR, 2000.

[28]

G. Huang, E. Learned-Miller, and A. McCallum. Cryptogram decoding for optical character recognition. 2007.

[29]

Hunspell. https://hunspell.github.io.

[30]

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. ICML, 2015.

Digital Library

[31]

P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proc. CVPR, 2017.

[32]

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. In Workshop on Deep Learning, NIPS, 2014.

[33]

M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In Proc. ECCV, 2014.

[34]

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recognition. In International Conference on Learning Representations, 2015.

[35]

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. IJCV, 116(1):1--20, Jan. 2016.

Digital Library

[36]

A. Kae and E. Learned-Miller. Learning on the fly: font-free approaches to difficult OCR problems. 2009.

[37]

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, et al. ICDAR 2013 robust reading competition. In Proc. ICDAR, pages 1484--1493, 2013.

[38]

K. Knight, A. Nair, N. Rathod, and K. Yamada. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL, pages 499--506. Association for Computational Linguistics, 2006.

Digital Library

[39]

K. Knight, B. Megyesi, and C. Schaefer. The Copiale cipher. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. Association for Computational Linguistics, 2011.

Digital Library

[40]

M. Kozielski, M. Nuhn, P. Doetsch, and H. Ney. Towards unsupervised learning for handwriting recognition. In Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, pages 549--554. IEEE, 2014.

[41]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, pages 1106--1114, 2012.

Digital Library

[42]

G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. In Proc. ICLR, 2017.

[43]

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541--551, 1989.

Digital Library

[44]

C. Lee and S. Osindero. Recursive recurrent nets with attention modeling for OCR in the wild. In Proc. CVPR, 2016.

[45]

C. Lee, A. Bhardwaj, W. Di, V. Jagadeesh, and R. Piramuthu. Regionbased discriminative feature pooling for scene text recognition. In Proc. CVPR, 2014.

[46]

D.-S. Lee. Substitution deciphering based on HMMs with applications to compressed document processing. PAMI, (12):1661--1666, 2002.

[47]

V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, volume 10, page 707, 1966.

[48]

C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In Proc. ECCV, pages 702--716. Springer, 2016.

[49]

M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Proc. NIPS, pages 700--708, 2017.

[50]

Y. Liu, J. Chen, and L. Deng. Unsupervised sequence classification using sequential output statistics. In Proc. NIPS, pages 3550--3559, 2017.

[51]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. CVPR, 2015.

[52]

A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, page 3, 2013.

[53]

X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In Proc. ICCV, pages 2813--2821. IEEE, 2017.

[54]

A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. Proc. BMVC, 2012.

[55]

A. Mishra, K. Alahari, and C. Jawahar. Top-down and bottom-up cues for scene text recognition. In Proc. CVPR, 2012.

[56]

G. Nagy. Efficient algorithms to decode substitution ciphers with applications to OCR. In Proc. ICPR, pages 352--355, 1986.

[57]

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS DLW, volume 2011, 2011.

[58]

L. Neumann and J. Matas. Real-time scene text localization and recognition. In Proc. CVPR, volume 3, pages 1187--1190. IEEE, 2012.

[59]

T. Novikova, O. Barinova, P. Kohli, and V. Lempitsky. Large-lexicon attribute-consistent text recognition in natural images. In Proc. ECCV, pages 752--765. Springer, 2012.

Digital Library

[60]

M. Nuhn and H. Ney. Decipherment complexity in 1: 1 substitution ciphers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, volume 1, pages 615--621, 2013.

[61]

J. Parkinson. Observations on the Nature and Cure of Gout: On Nodes of the Joints; and on the Influence of Certain Articles of Diet, in Gout, Rheumatism, and Gravel. Symonds, 1805.

[62]

S. Peleg and A. Rosenfeld. Breaking substitution ciphers using a relaxation algorithm. Communications of the ACM, 22(11):598--605, 1979.

Digital Library

[63]

A. Poznanski and L. Wolf. CNN-N-Gram for handwriting word recognition. In Proc. CVPR, 2016.

[64]

S. Ravi and K. Knight. Attacking decipherment problems optimally with low-order n-gram models. In proceedings of the conference on Empirical Methods in Natural Language Processing, pages 812--819. Association for Computational Linguistics, 2008.

Digital Library

[65]

J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision, 113(3):193--207, 2015.

Digital Library

[66]

B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. ArXiv e-prints, 2015.

[67]

B. Shi, X. Wang, P. Lv, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In Proc. CVPR, 2016.

[68]

C. Shi, C. Wang, B. Xiao, Y. Zhang, S. Gao, and Z. Zhang. Scene text recognition using part-based tree-structured character detection. In Proc. CVPR, 2013.

Digital Library

[69]

R. Smith. An overview of the Tesseract OCR engine. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, volume 2, pages 629--633. IEEE, 2007.

Digital Library

[70]

B. Snyder, R. Barzilay, and K. Knight. A statistical model for lost language decipherment. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1048--1057. Association for Computational Linguistics, 2010.

Digital Library

[71]

N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In Proc. ICML, 2015.

[72]

B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In Proc. ACCV, 2014.

[73]

I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proc. NIPS, pages 3104--3112, 2014.

Digital Library

[74]

I. Sutskever, R. Jozefowicz, K. Gregor, D. Rezende, T. Lillicrap, and O. Vinyals. Towards principled unsupervised learning. In ICLR workshop, 2016.

[75]

Tesseract OCR. https://github.com/tesseract-ocr/, 1985--2018.

[76]

T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26--31, 2012.

[77]

K. Wang and S. Belongie. Word spotting in the wild. In Proc. ECCV. Springer, 2010.

[78]

K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proc. ICCV, pages 1457--1464. IEEE, 2011.

[79]

T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In Proc. ICPR, pages 3304--3308. IEEE, 2012.

[80]

C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Proc. CVPR, 2014.

Digital Library

[81]

J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proc. ICCV, 2017.

Cited By

Siglidis IGonthier NGaubil JMonnier TAubry M(2024)The Learnable Typewriter: A Generative Approach to Text AnalysisDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70536-6_18(297-314)Online publication date: 3-Sep-2024
https://doi.org/10.1007/978-3-031-70536-6_18
Guimarães VNascimento JViana PCarvalho P(2023)A Review of Recent Advances and Challenges in Grocery Label Detection and RecognitionApplied Sciences10.3390/app1305287113:5(2871)Online publication date: 23-Feb-2023
https://doi.org/10.3390/app13052871
Jiang XZhang JDu JZhang ZWu J(2022)Scene Text Recognition with Self-supervised Contrastive Predictive Coding2022 26th International Conference on Pattern Recognition (ICPR)10.1109/ICPR56361.2022.9956631(1514-1521)Online publication date: 21-Aug-2022
https://doi.org/10.1109/ICPR56361.2022.9956631
Show More Cited By

Index Terms

Learning to Read by Spelling: Towards Unsupervised Text Recognition
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition
2. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition
      2. Computer vision representations
        Image representations
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

Unsupervised meta-learning for few-shot learning
Highlights
- Unsupervised meta-learning that auto-constructs tasks from unlabeled data.
- ...
Abstract
Meta-learning is an effective tool to address the few-shot learning problem, which requires new data to be classified considering only a few training examples. However, when used for classification, it requires large labeled datasets, ...
Scalable Semi-Supervised Clustering for Face Recognition with Insufficient Labelled Samples
Abstract
Face recognition is an effortless job for humans; however, it is computationally challenging as it is difficult to develop a computational model for recognizing faces. It becomes more challenging especially when the number of labeled examples of ...
Multi-channel multi-model feature learning for face recognition

We propose a new facial recognition system which learns the multi-channel and multi-model facial representations.A new autoencoder with ADMM optimization which increases the recognition rates is designed.The new system learns facial representations that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICVGIP '18: Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing

December 2018

659 pages

ISBN:9781450366151

DOI:10.1145/3293353

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICVGIP 2018

ICVGIP 2018: 11th Indian Conference on Computer Vision, Graphics and Image Processing

December 18 - 22, 2018

Hyderabad, India

Acceptance Rates

Overall Acceptance Rate 95 of 286 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
108
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)5

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Siglidis IGonthier NGaubil JMonnier TAubry M(2024)The Learnable Typewriter: A Generative Approach to Text AnalysisDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70536-6_18(297-314)Online publication date: 3-Sep-2024
https://doi.org/10.1007/978-3-031-70536-6_18
Guimarães VNascimento JViana PCarvalho P(2023)A Review of Recent Advances and Challenges in Grocery Label Detection and RecognitionApplied Sciences10.3390/app1305287113:5(2871)Online publication date: 23-Feb-2023
https://doi.org/10.3390/app13052871
Jiang XZhang JDu JZhang ZWu J(2022)Scene Text Recognition with Self-supervised Contrastive Predictive Coding2022 26th International Conference on Pattern Recognition (ICPR)10.1109/ICPR56361.2022.9956631(1514-1521)Online publication date: 21-Aug-2022
https://doi.org/10.1109/ICPR56361.2022.9956631
Xu ZZhou SBai FCheng ZNiu YPu S(2021)Recognizing Multiple Text Sequences from an Image by Pure End-to-End Learning2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412079(7058-7065)Online publication date: 10-Jan-2021
https://doi.org/10.1109/ICPR48806.2021.9412079
Aberdam ALitman RTsiper SAnschel OSlossberg RMazor SManmatha RPerona P(2021)Sequence-to-Sequence Contrastive Learning for Text Recognition2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR46437.2021.01505(15297-15307)Online publication date: Jun-2021
https://doi.org/10.1109/CVPR46437.2021.01505
Dencker TKlinkisch PMaul SOmmer B(2020)Deep learning of cuneiform sign detection with weak supervision using transliteration alignmentPLOS ONE10.1371/journal.pone.024303915:12(e0243039)Online publication date: 16-Dec-2020
https://doi.org/10.1371/journal.pone.0243039
Monnier TAubry M(2020)docExtractor: An off-the-shelf historical document element extraction2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)10.1109/ICFHR2020.2020.00027(91-96)Online publication date: Sep-2020
https://doi.org/10.1109/ICFHR2020.2020.00027
Jakab TGupta ABilen HVedaldi A(2020)Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.00881(8784-8794)Online publication date: Jun-2020
https://doi.org/10.1109/CVPR42600.2020.00881
Kolesnichenko OMartynov APulit VKolesnichenko YShakirov VMazelis LVarlamov OMinushkina LSotnik AZhilina TDorofeev VSmorodin GZhaparov M(2019)MODERN ADVANCED ARTIFICIAL INTELLIGENCE FOR SMART MEDICINERemedium10.21518/1561-5936-2019-04-36-43(36-43)Online publication date: 13-May-2019
https://doi.org/10.21518/1561-5936-2019-04-36-43

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents