research-article

Convolutional Attention Networks for Scene Text Recognition

Authors:

Shancheng Fang,

Yongdong ZhangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 15, Issue 1s

Article No.: 3, Pages 1 - 17

https://doi.org/10.1145/3231737

Published: 24 January 2019 Publication History

Abstract

In this article, we present Convoluitional Attention Networks (CAN) for unconstrained scene text recognition. Recent dominant approaches for scene text recognition are mainly based on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), where the CNN encodes images and the RNN generates character sequences. Our CAN is different from these methods; our CAN is completely built on CNN and includes an attention mechanism. The distinctive characteristics of our method include (i) CAN follows encoder-decoder architecture, in which the encoder is a deep two-dimensional CNN and the decoder is a one-dimensional CNN; (ii) the attention mechanism is applied in every convolutional layer of the decoder, and we propose a novel spatial attention method using average pooling; and (iii) position embeddings are equipped in both a spatial encoder and a sequence decoder to give our networks a sense of location. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K, and ICDAR datasets. The experimental results validate the effectiveness of different components and show that our convolutional-based method achieves state-of-the-art or competitive performance over prior works, even without the use of RNN.

References

[1]

Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Valveny. 2014. Word spotting and recognition with embedded attributes. IEEE Transactions in Pattern Analysis and Machine Intelligence 36, 12 (2014), 2552--2566.

[2]

Ouais Alsharif and Joelle Pineau. 2013. End-to-end text recognition with hybrid HMM maxout models. CoRR abs/1310.1811 (2013). http://arxiv.org/abs/1310.1811

[3]

Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR abs/1607.06450 (2016). http://arxiv.org/abs/1607.06450

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014). http://arxiv.org/abs/1409.0473

[5]

Alessandro Bissacco, Mark Cummins, Yuval Netzer, and Hartmut Neven. 2013. PhotoOCR: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). Sydney, Australia, December 1-8, 2013. IEEE, 785--792.

Digital Library

[6]

Zhineng Chen, Chong-Wah Ngo, Wei Zhang, Juan Cao, and Yu-Gang Jiang. 2014. Name-face association in web videos: A large-scale dataset, baselines, and open issues. Journal of Computer Science Technology 29, 5 (2014), 785--798.

[7]

Zhineng Chen, Wei Zhang, Bin Deng, Hongtao Xie, and Xiaoyan Gu. 2017. Name-face association with web facial image supervision. Multimedia Systems 4 (2017), 1--20.

[8]

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, October 25, 2014, Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi (Eds.). Association for Computational Linguistics, 103--111. http://aclweb.org/anthology/W/W14/W14-4012.pdf.

[9]

Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning (ICML’17) (Proceedings of Machine Learning Research), Sydney, NSW, Australia, August 6-11, 2017, Doina Precup and Yee Whye Teh (Eds.), Vol. 70. ACM, 933--941. http://proceedings.mlr.press/v70/dauphin17a.html.

Digital Library

[10]

Shancheng Fang, Hongtao Xie, Zhineng Chen, Yizhi Liu, and Yan Li. 2018. Uyghur text matching in graphic images for biomedical semantic analysis. Neuroinformatics (19 Jan 2018).

[11]

Shancheng Fang, Hongtao Xie, Zhineng Chen, Shiai Zhu, Xiaoyan Gu, and Xingyu Gao. 2017. Detecting Uyghur text in complex background images with convolutional neural network. Multimedia Tools and Applications 76, 13 (2017), 15083--15103.

Digital Library

[12]

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning (ICML’17) (Proceedings of Machine Learning Research), Sydney, NSW, Australia, August 6-11, 2017, Doina Precup and Yee Whye Teh (Eds.), Vol. 70. ACM, 1243--1252. http://proceedings.mlr.press/v70/gehring17a.html.

Digital Library

[13]

Suman K. Ghosh, Ernest Valveny, and Andrew D. Bagdanov. 2017. Visual attention models for scene text recognition. CoRR abs/1706.01487 (2017). http://arxiv.org/abs/1706.01487

[14]

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. 2013. Maxout networks. In Proceedings of the International Conference on Machine Learning (ICML'13). ACM, 1319--1327. https://arxiv.org/pdf/1302.4389.

Digital Library

[15]

Albert Gordo. 2015. Supervised mid-level features for word image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7-12, 2015. IEEE, 2956--2964.

[16]

Alex Graves. 2013. Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013). http://arxiv.org/abs/1308.0850

[17]

Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06) (ACM International Conference Proceeding Series), Pittsburg, PA, June 25-29, 2006, William W. Cohen and Andrew Moore (Eds.), Vol. 148. ACM, 369--376.

Digital Library

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7-13, 2015. IEEE, 1026--1034.

Digital Library

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016c. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 27-30, 2016. IEEE, 770--778.

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Identity mappings in deep residual networks. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16) Part IV (Lecture Notes in Computer Science), Amsterdam, Netherlands, October 11-14, 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.), Vol. 9908. IEEE, 630--645.

[21]

Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang. 2016a. Reading scene text in deep convolutional sequences. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, February 12-17, 2016, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, 3501--3508. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12256.

Digital Library

[22]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.

Digital Library

[23]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15) (JMLR Workshop and Conference Proceedings), Lille, France, July 6-11, 2015, Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 448--456. http://jmlr.org/proceedings/papers/v37/ioffe15.html.

Digital Library

[24]

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014a. Deep structured output learning for unconstrained text recognition. CoRR abs/1412.5903 (2014). http://arxiv.org/abs/1412.5903

[25]

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014b. Reading text in the wild with convolutional neural networks. CoRR abs/1412.1842 (2014). http://arxiv.org/abs/1412.1842

[26]

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014c. Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227 (2014). http://arxiv.org/abs/1406.2227

[27]

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, Canada, December 7-12, 2015, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). MIT Press, 2017--2025. http://papers.nips.cc/paper/5854-spatial-transformer-networks.

Digital Library

[28]

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014d. Deep features for text spotting. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Part IV (Lecture Notes in Computer Science), Zurich, Switzerland, September 6-12, 2014, David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.), Vol. 8692. Springer, 512--528.

[29]

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernández Mota, Jon Almazán, and Lluís-Pere de las Heras. 2013. ICDAR 2013 robust reading competition. In Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, DC, August 25-28, 2013. IEEE, 1484--1493.

Digital Library

[30]

Chen-Yu Lee and Simon Osindero. 2016. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 27-30, 2016. IEEE, 2231--2239.

[31]

Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval: State of the art and challenges. TOMCCAP 2, 1 (2006), 1--19.

Digital Library

[32]

Simon M. Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, Shirley Wong, and Robert Young. 2003. ICDAR 2003 robust reading competitions. In Proceedings of the 7th International Conference on Document Analysis and Recognition (ICDAR’03), 2-Volume Set, Edinburg, Scotland, August 3-6, 2003. IEEE, 682--687.

Digital Library

[33]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15), Lisbon, Portugal, September 17-21, 2015, Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton (Eds.). The Association for Computational Linguistics, 1412--1421. http://aclweb.org/anthology/D/D15/D15-1166.pdf.

[34]

Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2012. Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference (BMVC’12), Surrey, UK, September 3-7, 2012, Richard Bowden, John P. Collomosse, and Krystian Mikolajczyk (Eds.). British Machine Vision Association Press, 1--11.

[35]

Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML’10), Haifa, Israel, June 21-24, 2010, Johannes Fürnkranz and Thorsten Joachims (Eds.). ACM, 807--814. http://www.icml2010.org/papers/432.pdf.

Digital Library

[36]

Lukas Neumann and Jiri Matas. 2012. Real-time scene text localization and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'12), Providence, RI, June 16-21, 2012. IEEE, 3538--3545.

Digital Library

[37]

Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. 2010. Large-scale image retrieval with compressed Fisher vectors. In Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10), San Francisco, CA, June 13-18, 2010. IEEE, 3384--3391.

[38]

José A. Rodríguez and Florent Perronnin. 2013. Label embedding for text recognition. In Proceddings of the British Machine Vision Conference (BMVC’13), Bristol, UK, September 9-13, 2013, Tilo Burghardt, Dima Damen, Walterio W. Mayol-Cuevas, and Majid Mirmehdi (Eds.). British Machine Vision Association Press.

[39]

José A. Rodríguez-Serrano, Albert Gordo, and Florent Perronnin. 2015. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision 113, 3 (2015), 193--207.

Digital Library

[40]

Tim Salimans and Diederik P. Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Proceedings of the 29th Annual Conference on Advances in Neural Information Processing Systems, Barcelona, Spain, December 5-10, 2016, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 901. http://papers.nips.cc/paper/6114-weight-normalization-a-simple-reparameterization-to-accelerate-training-of-deep-neural-networks.

Digital Library

[41]

Baoguang Shi, Xiang Bai, and Cong Yao. 2015. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR abs/1507.05717 (2015). http://arxiv.org/abs/1507.05717

[42]

Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 27-30, 2016. IEEE, 4168--4176.

[43]

Bolan Su and Shijian Lu. 2014. Accurate scene text recognition based on recurrent neural network. In Proceedings of the 12th Asian Conference on Computer Vision (ACCV’14) Revised Selected Papers, Part I (Lecture Notes in Computer Science), Singapore, November 1-5, 2014, Daniel Cremers, Ian D. Reid, Hideo Saito, and Ming-Hsuan Yang (Eds.), Vol. 9003. Springer, 35--48.

[44]

Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML’13) (JMLR Workshop and Conference Proceedings), Atlanta, GA, June 16-21, 2013, Vol. 28. ACM, 1139--1147. http://jmlr.org/proceedings/papers/v28/sutskever13.html.

Digital Library

[45]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762

[46]

Kai Wang, Boris Babenko, and Serge J. Belongie. 2011. End-to-end scene text recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’11), Barcelona, Spain, November 6-13, 2011, Dimitris N. Metaxas, Long Quan, Alberto Sanfeliu, and Luc J. Van Gool (Eds.). IEEE, 1457--1464.

Digital Library

[47]

Kai Wang and Serge J. Belongie. 2010. Word spotting in the wild. In Proceedings of the 11th European Conference on Computer Vision (ECCV’10), Part I (Lecture Notes in Computer Science), Heraklion, Crete, September 5-11, 2010, Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.), Vol. 6311. IEEE, 591--604.

Digital Library

[48]

Tao Wang, David J. Wu, Adam Coates, and Andrew Y. Ng. 2012. End-to-end text recognition with convolutional neural networks. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR’12), Tsukuba, Japan, November 11-15, 2012. IEEE, 3304--3308. http://ieeexplore.ieee.org/document/6460871/.

[49]

Zbigniew Wojna, Alexander N. Gorban, Dar-Shyang Lee, Kevin Murphy, Qian Yu, Yeqing Li, and Julian Ibarz. 2017. Attention-based extraction of structured information from street view imagery. CoRR abs/1704.03549 (2017). http://arxiv.org/abs/1704.03549

[50]

Chenggang Yan, Hongtao Xie, Shun Liu, Jian Yin, Yongdong Zhang, and Qionghai Dai. 2018a. Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 220--229.

[51]

Chenggang Yan, Hongtao Xie, Dongbao Yang, Jian Yin, Yongdong Zhang, and Qionghai Dai. 2018b. Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 284--295.

[52]

Hongtao Xie, Dongbao Yang, Nannan Sun, Zhineng Chen, and Yongdong Zhang. 2014. Automated pulmonary nodule detection in CT images using deep convolutional neural networks. Pattern Recognition.

[53]

Cong Yao, Xiang Bai, Baoguang Shi, and Wenyu Liu. 2014. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14), Columbus, OH, June 23-28, 2014. IEEE, 4042--4049.

Digital Library

[54]

Hantao Yao, Shiliang Zhang, Yongdong Zhang, Jintao Li, and Qi Tian. 2016. Coarse-to-fine description for fine-grained visual categorization. IEEE Transactions on Image Processing 25, 10 (2016), 4858--4872.

Digital Library

[55]

Xishan Zhang, Hanwang Zhang, Yongdong Zhang, Yang Yang, Meng Wang, Huan-Bo Luan, Jintao Li, and Tat-Seng Chua. 2016. Deep fusion of multiple semantic cues for complex event recognition. IEEE Transactions on Image Processing 25, 3 (2016), 1033--1046.

Digital Library

[56]

Biao Zhu, Hongxin Zhang, Wei Chen, Feng Xia, and Ross Maciejewski. 2015. ShotVis: Smartphone-based visualization of OCR information from images. TOMCCAP 12, 1s (2015), 12:1--12:17.

Digital Library

Cited By

Zhu SFang JFang PXue H(2024)Improving Scene Text Retrieval via Stylized Middle ModalityACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369620920:12(1-18)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3696209
Zhan HLi YXiong YPal ULu YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Free Lunch: Frame-level Contrastive Learning with Text Perceiver for Robust Scene Text Recognition in Lightweight ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681045(6202-6211)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681045
Gao XPang YLiu YHan MYu JWang WChen Y(2024)Multimodal Visual-Semantic Representations Learning for Scene Text RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364655120:7(1-18)Online publication date: 27-Mar-2024
https://dl.acm.org/doi/10.1145/3646551
Show More Cited By

Index Terms

Convolutional Attention Networks for Scene Text Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Recent dominant approaches for scene text recognition are mainly based on convolutional neural network (CNN) and recurrent neural network (RNN), where the CNN processes images and the RNN generates character sequences. Different from these methods, we ...
Deep neural network with attention model for scene text recognition

The authors present a deep neural network (DNN) with attention model for scene text recognition. The proposed model does not require any segmentation of the input text image. The framework is inspired by the attention model presented recently for speech ...
Scene text recognition using residual convolutional recurrent neural network

Text is a significant tool for human communication, and text recognition in scene images becomes more and more important. In this paper, we propose a residual convolutional recurrent neural network for solving the task of scene text recognition. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 1s

Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data

January 2019

265 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3309769

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2019

Accepted: 01 June 2018

Revised: 01 April 2018

Received: 01 October 2017

Published in TOMM Volume 15, Issue 1s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Nature Science Foundation of China
Fundamental Research Funds for the Central Universities
National Key Research and Development Program of China
Youth Innovation Promotion Association Chinese Academy of Sciences

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
905
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)2

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu SFang JFang PXue H(2024)Improving Scene Text Retrieval via Stylized Middle ModalityACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369620920:12(1-18)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3696209
Zhan HLi YXiong YPal ULu YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Free Lunch: Frame-level Contrastive Learning with Text Perceiver for Robust Scene Text Recognition in Lightweight ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681045(6202-6211)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681045
Gao XPang YLiu YHan MYu JWang WChen Y(2024)Multimodal Visual-Semantic Representations Learning for Scene Text RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364655120:7(1-18)Online publication date: 27-Mar-2024
https://dl.acm.org/doi/10.1145/3646551
Afkari-Fahandari AAsadi-Zeydabadi FShabaninia ENezamabadi-Pour H(2024)Enhancing Farsi Text Recognition via Iteratively Using a Language Model2024 20th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP)10.1109/AISP61396.2024.10475269(1-6)Online publication date: 21-Feb-2024
https://doi.org/10.1109/AISP61396.2024.10475269
Selvam PSumathi MManiappan VPadmavathi ANatarajan BSyed Husain S(2024)Revolutionizing Scene Text Recognition: Unleashing the Power of Dual Step Attention Mechanism in the EncoderProceedings of International Conference on Recent Innovations in Computing10.1007/978-981-97-3442-9_63(891-905)Online publication date: 23-Oct-2024
https://doi.org/10.1007/978-981-97-3442-9_63
Hu YDong BHuang KDing LWang WHuang XWang Q(2023)Scene Text Recognition via Dual-path Network with Shape-driven Attention AlignmentACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363351720:4(1-20)Online publication date: 21-Nov-2023
https://dl.acm.org/doi/10.1145/3633517
Wang KXie HWang YZhang DQu YGao ZZhang YEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text DetectionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612370(2006-2015)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612370
Fu ZXie HFang SWang YXing MZhang Y(2023)Learning Pixel Affinity Pyramid for Arbitrary-Shaped Text DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/352461719:1s(1-24)Online publication date: 3-Feb-2023
https://dl.acm.org/doi/10.1145/3524617
Tarhib STanha JImanzadeh SMostafaei S(2023)Multi Model CNN Based Gas Meter Characters Recognition2023 13th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE60553.2023.10326299(362-368)Online publication date: 1-Nov-2023
https://doi.org/10.1109/ICCKE60553.2023.10326299
Selvaraj STripuraribhatla R(2023)Optimization integrated generative adversarial network for occluded text recognition with language modelingConcurrency and Computation: Practice and Experience10.1002/cpe.763035:8Online publication date: 25-Jan-2023
https://doi.org/10.1002/cpe.7630
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents