An end-to-end textspotter with explicit alignment and attention

T He, Z Tian, W Huang, C Shen… - Proceedings of the …, 2018 - openaccess.thecvf.com
Proceedings of the IEEE conference on computer vision and …, 2018openaccess.thecvf.com
Text detection and recognition in natural images have long been considered as two
separate tasks that are processed sequentially. Jointly training two tasks is non-trivial due to
significant differences in learning difficulties and convergence rates. In this work, we present
a conceptually simple yet efficient framework that simultaneously processes the two tasks in
a united framework. Our main contributions are three-fold:(1) we propose a novel
textalignment layer that allows it to precisely compute convolutional features of a text …
Abstract
Text detection and recognition in natural images have long been considered as two separate tasks that are processed sequentially. Jointly training two tasks is non-trivial due to significant differences in learning difficulties and convergence rates. In this work, we present a conceptually simple yet efficient framework that simultaneously processes the two tasks in a united framework. Our main contributions are three-fold:(1) we propose a novel textalignment layer that allows it to precisely compute convolutional features of a text instance in arbitrary orientation, which is the key to boost the performance;(2) a character attention mechanism is introduced by using character spatial information as explicit supervision, leading to large improvements in recognition;(3) two technologies, together with a new RNN branch for word recognition, are integrated seamlessly into a single model which is end-to-end trainable. This allows the two tasks to work collaboratively by sharing convolutional features, which is critical to identify challenging text instances. Our model obtains impressive results in end-to-end recognition on the ICDAR 2015, significantly advancing the most recent results, with improvements of F-measure from (0.54, 0.51, 0.47) to (0.82, 0.77, 0.63), by using a strong, weak and generic lexicon respectively. Thanks to joint training, our method can also serve as a good detector by achieving a new state-of-the-art detection performance on related benchmarks. Code is available at https://github. com/tonghe90/textspotter.
openaccess.thecvf.com