Abstract
In this paper, we propose a deep neural network model with an encoder–decoder architecture that translates images of math formulas into their LaTeX markup sequences. The encoder is a convolutional neural network that transforms images into a group of feature maps. To better capture the spatial relationships of math symbols, the feature maps are augmented with 2D positional encoding before being unfolded into a vector. The decoder is a stacked bidirectional long short-term memory model integrated with the soft attention mechanism, which works as a language model to translate the encoder output into a sequence of LaTeX tokens. The neural network is trained in two steps. The first step is token-level training using the maximum likelihood estimation as the objective function. At completion of the token-level training, the sequence-level training objective function is employed to optimize the overall model based on the policy gradient algorithm from reinforcement learning. Our design also overcomes the exposure bias problem by closing the feedback loop in the decoder during sequence-level training, i.e., feeding in the predicted token instead of the ground truth token at every time step. The model is trained and evaluated on the IM2LATEX-100 K dataset and shows state-of-the-art performance on both sequence-based and image-based evaluation metrics.
Similar content being viewed by others
Notes
LaTeX (version 3.1415926–2.5–1.40.14).
Different sizes of width–height buckets (in pixel): (320, 40), (360, 60), (360, 50), (200, 50), (280, 50), (240, 40), (360, 100), (500, 100), (320, 50), (280, 40), (200, 40), (400, 160), (600, 100), (400, 50), (160, 40), (800, 100), (240, 50), (120, 50), (360, 40), (500, 200).
References
Ion, P., Miner, R., Buswell, S., Devitt, A.: Mathematical Markup Language (MathML) 1.0 Specification. World Wide Web Consortium (W3C) (1998)
Anderson, R.H.: Syntax-directed recognition of hand-printed two-dimensional mathematics. In: Symposium on Interactive Systems for Experimental Applied Mathematics: Proceedings of the Association for Computing Machinery Inc. Symposium, pp. 436–459. ACM (1967)
Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: Proceedings of the 2003 ACM Symposium on Document Engineering, pp. 95–104. ACM (2003)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Deep structured output learning for unconstrained text recognition (2014). arXiv preprint arXiv:1412.5903
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3156–3164 (2015)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Deng, Y., Kanervisto, A., Rush, A.M.: What you get is what you see: a visual markup decompiler, vol. 10, pp. 32–37 (2016). arXiv preprint arXiv:1609.04938
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Sutton, R.S., McAllester, D.A., Singh, S. P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000).
Ranzato, M.A., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks (2015). arXiv preprint arXiv:1511.06732
Chan, K.-F., Yeung, D.-Y.: Mathematical expression recognition: a survey. Int. J. Doc. Anal. Recogn. 3(1), 3–15 (2000)
Garain, U., Chaudhuri, B., Chaudhuri, A.R.: Identification of embedded mathematical expressions in scanned documents. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 1, pp. 384–387. IEEE (2004)
Wang, Z., Beyette, D., Lin, J., Liu, J.-C.: Extraction of math expressions from PDF documents based on unsupervised modeling of fonts. In: IAPR International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia. IEEE (2019).
Wang, X., Wang, Z., Liu, J.-C.: Bigram label regularization to reduce over-segmentation on inline math expression detection. In: IAPR International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia. IEEE (2019)
Gao, L., Yi, X., Liao, Y., Jiang, Z., Yan, Z., Tang, Z.: A deep learning-based formula detection method for PDF documents. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 553–558. IEEE (2017)
Twaakyondo, H.M., Okamoto, M.: Structure analysis and recognition of mathematical expressions. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 430–437. IEEE (1995)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012).
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)
Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 3304–3308. IEEE (2012)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation (2015). arXiv preprint arXiv:1508.04025
Zhang, J., Du, J., Dai, L.: A gru-based encoder-decoder approach with attention for online handwritten mathematical expression recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 902–907. IEEE (2017)
Zhang, J., Du, J., Zhang, S., Liu, D., Hu, Y., Hu, J., Wei, S., Dai, L.: Watch, attend and parse: an end-to-end neural network based approach to handwritten mathematical expression recognition. Pattern Recogn. 71, 196–206 (2017)
Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 980–989 (2017)
Wang, J., Sun, Y., Wang, S.: Image to latex with DenseNet encoder and joint attention. Procedia Comput. Sci. 147, 374–380 (2019)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017).
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.-S.: Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659–5667 (2017)
Zhang, W., Bai, Z., Zhu, Y.: An improved approach based on CNN-RNNs for mathematical expression recognition. In: Proceedings of the 2019 4th International Conference on Multimedia Systems and Signal Processing, pp. 57–610. ACM (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Levy O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, p. 129. MIT Press, Cambridge (2016)
Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., Liu, Y.: Minimum risk training for neural machine translation (2015). arXiv preprint arXiv:1512.02433
Wu, L., Tian, F., Qin, T., Lai, J., Liu, T.-Y.: A study of reinforcement learning for neural machine translation (2018). arXiv preprint arXiv:1808.08866
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
Chatterjee S., Cancedda, N.: Minimum error rate training by sampling the translation lattice. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 606–615. Association for Computational Linguistics (2010)
KaTex (2019, Aug 25). https://katex.org/
Álvaro, F., Sánchez, J.-A., Benedí, J.-M.: An image-based measure for evaluation of mathematical expression recognition. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 682–690. Springer (2013)
Mathpix Snip (2020, May 6th). https://mathpix.com/
Kingma D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Graves, A.: Sequence transduction with recurrent neural networks (2012). arXiv preprint arXiv:1211.3711
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) arXiv preprint arXiv:1406.1078
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, Z., Liu, JC. Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training. IJDAR 24, 63–75 (2021). https://doi.org/10.1007/s10032-020-00360-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-020-00360-2