Abstract
In this work we present an end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.
Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566. doi:10.1109/TPAMI.2014.2339814.
Alsharif, O., & Pineau, J. (2014). End-to-end text recognition with hybrid HMM maxout models. In International conference on learning representations.
Anthimopoulos, M., Gatos, B., & Pratikakis, I. (2013). Detection of artificial and scene text in images and video frames. Pattern Analysis and Applications, 16(3), 431–446.
Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). PhotoOCR: Reading text in uncontrolled conditions. In Proceedings of the international conference on computer vision.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Campos, D. T., Babu, B. R., & Varma, M. (2009). Character recognition in natural images. In A. Ranchordas & H. Araújo (Eds.), VISAPP 2009—Proceedings of the fourth international conference on computer vision theory and applications, Lisboa, Portugal, February 5–8, 2009 (Vol. 2, pp. 273–280). INSTICC Press.
Chen, H., Tsai, S., Schroth, G., Chen, D., Grzeszczuk, R., & Girod, B. (2011). Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In Proceedings of international conference on image processing (ICIP) (pp. 2609–2612).
Chen, X., & Yuille, A. L. (2004). Detecting and reading text in natural scenes. In Computer vision and pattern recognition, 2004. CVPR 2004 (Vol. 2, pp. II-366). Piscataway, NJ: IEEE.
Cheng, M. M., Zhang, Z., Lin, W. Y., & Torr, P. (2014). Bing: Binarized normed gradients for objectness estimation at 300fps. In 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014, Columbus, OH, USA, June 23–28, 2014 (pp. 3286–3293). Piscataway, NJ: IEEE. doi:10.1109/CVPR.2014.414.
Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1532–1545.
Dollár, P., Belongie, S., & Perona, P. (2010). The fastest pedestrian detector in the west. In F. Labrosse, R. Zwiggelaar, Y. Liu & B. Tiddeman (Eds.), British Machine Vision Conference, BMVC 2010, Aberystwyth, UK, August 31–September 3, 2010. Proceedings (pp. 1–11). British Machine Vision Association. doi:10.5244/C.24.68.
Dollár, P., & Zitnick, C. L. (2013). Structured forests for fast edge detection. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1841–1848). IEEE.
Dollár, P., & Zitnick, C. L. (2014). Fast edge detection using structured forests. arXiv:1406.5549.
Epshtein, B., Ofek, E., & Wexler, Y. (2010). Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2963–2970). IEEE.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Felzenszwalb, P. F., Grishick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.
Fischer, A., Keller, A., Frinken, V., & Bunke, H. (2010). Hmm-based word spotting in handwritten documents using subword models. In 2010 20th International conference on pattern recognition (icpr) (pp. 3416–3419). IEEE.
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337–407.
Frinken, V., Fischer, A., Manmatha, R., & Bunke, H. (2012). A novel word spotting method based on recurrent neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(2), 211–224.
Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Goel, V., Mishra, A., Alahari, K., & Jawahar, C. V. (2013). Whole is greater than sum of parts: Recognizing scene text words. In 2013 12th International conference on document analysis and recognition, Washington, DC, USA, August 25–28, 2013 (pp. 398–402). IEEE Computer Society. doi:10.1109/ICDAR.2013.87.
Gomez, L., & Karatzas, D. (2013). Multi-script text extraction from natural scenes. In 2013 12th International conference on document analysis and recognition (ICDAR) (pp. 467–471). IEEE.
Gomez, L., & Karatzas, D. (2014). A fast hierarchical method for multi-script and arbitrary oriented scene text extraction. arXiv:1407.7504.
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., & Shet, V. (2013). Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv:1312.6082.
Gordo, A. (2014). Supervised mid-level features for word image representation. CoRR. arXiv:1410.5224
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR arXiv:1207.0580.
Huang, W., Qiao, Y., & Tang, X. (2014). Robust scene text detection with convolution neural network induced mser trees. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV 2014 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part IV (pp. 497–511). New York City: Springer.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv:1406.2227.
Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for text spotting. In European conference on computer vision.
Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Mestre, S. R., Mas, J., et al. (2013). ICDAR 2013 robust reading competition. In ICDAR (pp. 1484–1493). Piscataway, NJ: IEEE.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Lucas, S. (2005). ICDAR 2005 text locating competition results. In Proceedings of the eighth international conference on document analysis and recognition, 2005 (pp. 80–84). IEEE.
Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., & Young, R. (2003). Icdar 2003 robust reading competitions. In Proceedings of ICDAR.
Manmatha, R., Han, C., & Riseman, E. M. (1996). Word spotting: A new approach to indexing handwriting. In Proceedings CVPR’96, 1996 IEEE computer society conference on computer vision and pattern recognition, 1996 (pp. 631–637). IEEE.
Matas, J., Chum, O., Urban, M., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of the British Machine Vision Conference (pp. 384–393).
Mishra, A., Alahari, K., & Jawahar, C. (2012). Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference (pp. 127.1–127.11). BMVA Press.
Mishra, A., Alahari, K., & Jawahar, C. (2013). Image retrieval using textual cues. In 2013 IEEE international conference on computer vision (ICCV) (pp. 3040–3047). IEEE.
Neumann, L., & Matas, J. (2010). A method for text localization and recognition in real-world images. In Proceedings of the Asian conference on computer vision (pp. 770–783). Springer.
Neumann, L., & Matas, J. (2011). Text localization in real-world images using efficiently pruned exhaustive search. In Proceedings of ICDAR (pp. 687–691). IEEE.
Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Neumann, L., & Matas, J. (2013). Scene text localization and recognition with oriented stroke detection. In Proceedings of the international conference on computer vision (pp. 97–104).
Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In Proceedings of the European conference on computer vision (pp. 752–765). Springer.
Ozuysal, M., Fua, P., & Lepetit, V. (2007). Fast keypoint recognition in ten lines of code. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Posner, I., Corke, P., & Newman, P. (2010). Using text-spotting to query the world. In 2010 IEEE/RSJ international conference on intelligent robots and systems, October 18–22, 2010, Taipei, Taiwan (pp. 3181–3186). Piscataway, NJ: IEEE. doi:10.1109/IROS.2010.5653151.
Quack, T. (2009). Large scale mining and retrieval of visual data in a multimodal context. Ph.D. Thesis, ETH Zurich.
Rath, T., & Manmatha, R. (2007). Word spotting for historical documents. IJDAR, 9(2–4), 139–152.
Rodriguez-Serrano, J. A., Perronnin, F., & Meylan, F. (2013). Label embedding for text recognition. In Proceedings of the British Machine Vision Conference.
Shahab, A., Shafait, F., & Dengel, A. (2011). ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Proceedings of ICDAR (pp. 1491–1496). IEEE.
Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. Piscataway, NJ: Institute of Electrical and Electronics Engineers, Inc.
Uijlings, J. R., van de Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154–171.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In Proceedings of the international conference on computer vision (pp. 1457–1464). IEEE.
Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In ICPR (pp. 3304–3308). IEEE.
Weinman, J. J., Butler, Z., Knoll, D., & Feild, J. (2014). Toward integrated scene text reading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 375–387. doi:10.1109/TPAMI.2013.126.
Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In 2014 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4042–4049). IEEE.
Yi, C., & Tian, Y. (2011). Text string detection from natural scenes by structure-based partition and grouping. IEEE Transactions on Image Processing, 20(9), 2594–2605.
Yin, X. C., Yin, X., & Huang, K. (2013). Robust text detection in natural scene images. CoRR arXiv:1301.2628.
Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV 2014 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part IV (pp. 391–405). New York City: Springer.
Acknowledgments
This work was supported by the EPSRC and ERC Grant VisRec No. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research. We thank the BBC and in particular Rob Cooper for access to data and video processing resources.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Cordelia Schmid.
Rights and permissions
About this article
Cite this article
Jaderberg, M., Simonyan, K., Vedaldi, A. et al. Reading Text in the Wild with Convolutional Neural Networks. Int J Comput Vis 116, 1–20 (2016). https://doi.org/10.1007/s11263-015-0823-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-015-0823-z