Abstract
Handwritten text recognition has been extensively researched over decades and achieved extraordinary success in recent years. However, handwritten diagram recognition is still a challenging task because of the complex 2D structure and writing style variation. This paper presents a general framework for online handwritten diagram recognition based on graph attention networks (GAT). We model each diagram as a graph in which nodes represent strokes and edges represent the relationships between strokes. Then, we learn GAT models to classify graph nodes taking both stroke features and the relationships between strokes into consideration. To better exploit the spatial and temporal relationships, we enhance the original GAT model with a novel attention mechanism. Experiments on two online handwritten flowchart datasets and a finite automata dataset show that our method consistently outperforms previous methods and achieves the state-of-the-art performance.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Handwriting is one of the most natural and efficient ways for human to record information. As the widespread usage of smartphone, tablet computer and electrical whiteboard, recording information in intelligence devices has become a major choice for its convenience. As a result, handwritten text recognition has been intensively studied over the last decades and widely applied in many fields. However, the recognition and analysis of 2D diagrams, such as flowchart, circuit and music score, are still challenging because of the complex 2D structures and great writing style variation.
Existing methods for online handwritten diagram recognition and interpretation can be roughly divided into two categories: bottom-up [4,5,6, 15, 27] and top-down ones [1, 11, 14, 21, 22]. Bottom-up approaches sequentially perform a symbol segmentation step and a recognition step. However, due to the error accumulation, these methods often lead to low recognition accuracy. On the other hand, top-down approaches integrate the two steps in one framework, such as probabilistic graphical models (PGM), and perform segmentation and recognition simultaneously. Typically, top-down methods can achieve higher accuracy results but suffer from high computational cost because of the complicated learning and inference algorithms. We review these methods in more details in the next section.
In this work, we propose an efficient and high-accuracy method for online handwritten diagram recognition. In particular, we treat diagram stroke classification as a graph node classification problem and solve it with attention-based graph neural networks (GNN) [13, 20]. Compared with PGM, such as conditional random fields (CRF) and Markov random fields (MRF), GNN is more powerful and flexible in learning the stroke representation and exploiting the contextual information. Unlike PGM, the learning and inference algorithms of GNN are very simple and efficient, which makes it very suitable for large-scale applications.
We highlight the main contributions of this work as follows. First, we propose a general online handwritten diagram recognition method based on GNN. Second, to better exploit the relationships between strokes, we enhance the original GAT [20] by introducing a novel attention mechanism. Third, on three popular benchmark datasets, our method consistently outperforms the existing methods and achieves the state-of-the-art results.
In the rest of this paper, we first provide a general review of existing online handwritten diagram recognition works and a brief review of GNN in Sect. 2. In Sect. 3, we give a detailed introduction to the proposed method. The experimental setting and comparison results are described in Sect. 4. Finally, Sect. 5 draws our concluding remarks.
2 Related Works
2.1 Diagram Recognition
Since it is a difficult task to classify text and non-text strokes or segment symbols precisely in early stage for diagram recognition, some works only considered graphic symbols, and others imposed some constraints on users. Qi et al. [16] presented a recognition system for flowchart recognition using Bayesian conditional random fields, but their dataset only included very simple graphic symbols rather than texts. Yuan et al. [27] proposed a hybrid model combing support vector machine (SVM) and hidden Markov models (HMM) for programming teaching. Miyao et al. [15] presented a flowchart recognition and editing system that segmented the symbols based on loop structure and recognized them using SVMs. Although [15, 27] allowed the flowcharts to contain both symbols and texts, there were many constraints on users. [27] required users to draw each symbol with only one stroke, and [15] required users to differentiate texts from graphic symbols.
Awal et al. [1] proposed two methods—bottom-up and top-down approaches for flowchart recognition from different viewpoints. For the former, texts and graphic symbols were classified based on the entropy of strokes, then time delayed neural network (TDNN) or SVM was applied for graphical symbols recognition. Moreover, they introduced a global recognition architecture based on the TDNN and dynamic programming (DP) algorithm.
Flowchart diagrams are document with complex 2D structures, thus previous statistical approaches [1, 15, 27] have reported limited performance because of the ignorance of structure information. Lemaitre et al. [14] proposed a method that tried to handle the segmentation and recognition simultaneously. Their model integrated structural and syntactic prior of flowchart with Enhanced Position Formalism (EPF) language [8], then they used Description and MOdification of the Segmentation (DMOS) method [8] to segment and recognize the flowchart in one step. Their method achieved great progress in stroke labeling and symbol recognition compared with [1], but it is too restricted and rigid to adapt to other domains and it is impossible to describe every symbol with variation.
For exploring structure information, Carton et al. [7] presented a human-like perceptive mechanism approach that incorporated both structural and statistical information of a flowchart. Same as [14], the work made use of DMOS to express circular symbols and quadrilateral symbols, then proposed a deformation measure to quantify what was a good quadrilateral.
In handwritten diagrams, arrows are variable in appearance and are difficult to recognize compared to other symbols using identical classifier. Bresler et al. [4,5,6] proposed a new framework that strokes were firstly classified as text or non-text, then non-text strokes were clustered and uniform symbols were classified with SVM, lastly the arrows were detected. For structure analysis, they modeled whole flowchart excluding texts as a max-sum problem and applied integer linear programming to solve it [3]. This approach achieved the state-of-the-art results in three handwritten datasets. However, the recognition system has some severe flaws, such as each arrow must consist of a shaft and a head, which may lead to recognition failures if one of them is absent.
Wang et al. [21, 22] proposed a general model, max-margin MRF, which combines MRF and structural SVM to perform stroke segmentation and recognition simultaneously. By exploiting temporal and spatial relationship between strokes, their model greatly improved the stroke labeling accuracy. To lower the complexity in evaluating stroke grouping candidates in a diagram, Julca-Aguilar et al. [11] applied the Faster R-CNN model [17] to the detection of online handwritten graphics through converting the original online data to offline images. Despite the overall high performance of flowchart symbol detection, the arrow detection accuracy is not satisfactory, and the conversion into images causes loss of temporal information of strokes.
2.2 Graph Neural Networks
In recently years, graph neural networks (GNN) have received extensive attention and become one of the most popular research highlights in deep learning field. With its capability of capturing the dependency between objects and operating on non-Euclidean domain [28], GNN have obtained great success in many tasks, such as relational reasoning [2] and text classification [24]. Kipf et al. [13] proposed a simple and efficient layer-wise propagation rule for graph convolutional networks (GCN) based on spectral graph convolution and their model achieved significant raise in several graph-structured datasets. Veličković et al. [20] put forward a novel GNN architecture—graph attention networks (GAT), which introduced masked self-attention mechanism to tackle some key challenges of GNN. Recently, Ye et al. [26] proposed a new GAT framework for stroke classification, which demonstrated the great potential for online handwritten document recognition with GNN.
3 Method
We are given N labeled online handwritten diagrams \({D}=\left\{ \left( X^{i}, Y^{i}\right) | i \in [1, N]\right\} \), where each diagram \(X^{i}\) is composed of a sequence of strokes \(X^{i}=\left\{ X_{s}^{i} | s \in \left[ 1, M_{i}\right] \right\} \) (\(M_{i}\) is the number of strokes in \( X^{i}\)) and \(Y_{s}^{i}\) is the label of \(X_{s}^{i}\) which takes discrete semantic annotation, such as process, decision and arrow. Our target is to learn a model from the training set D that can predict the labels of strokes in testing diagrams as accurate as possible.
Roughly speaking, our method models each diagram with a graph in which nodes represent strokes and edges represent the relationships between strokes. Then, we treat diagram stroke classification as a graph node classification problem and solve it with attention-based GNN. The proposed method is composed of three modules, including the construction of the diagram graph, extraction of node and edge features from raw signals and graph attention networks, which will be introduced separately as follows.
3.1 Graph Building
Here we introduce a new approach to abstract the structure information in the diagram that each handwritten diagram is formulated as a space-time relationship graph (STRG). Every stroke \(s_{i}\) is represented as a vertex \({v}_{i} \in V\) and the relevance in space and time between strokes \(s_{i}\) and \(s_{j}\) is noted as edge \({e}_{ij} \in E\) in graph G(V, E), where V is the vertex set and E is the edge set in G.
Specifically, from the time perspective, we build the edges \(E_T=\{(t,t+1) | t \in [1, n - 1]\}\) between every temporal adjacent strokes in the diagram, where n is the number of strokes.
In view of spatial relationship, for the stroke \({s}_{s} \), the edge set \({e}_{s,N(s)} \) is added to E(G), where N(s) are all space neighbors of stroke \({s}_{s} \). If any stroke pairs’ minimal Euclidean distance is less than the spatial neighbor threshold (SNT), they are regarded as neighbors each other. The hyperparameter SNT is elaborately tuned on validation set. We also try to build more complex STRG of a document, but it has little effect to the experimental result. Figure 1 shows an example of flowchart rendering from original data and its corresponding STRG.
3.2 Feature Extraction
For each stroke in an online document, 10 local features and 13 context features [25] are extracted as node features in STRG. These features have been proven to be very effective in previous works [19, 25]. In addition, 19 edge features [25] are extracted from stroke pairs for modeling the relations between strokes. In feature pre-processing, we conduct power transformation with the coefficient 0.5 and normalization with mean \(\mu \) and standard deviation \(\sigma \). Therefore, the original feature h become:
where sign(\(\cdot \)) is the sign function.
3.3 Graph Attention Networks
In this section, we introduce the enhanced GAT model, which is constructed by stacking multiple graph attention layers.
The input to each graph attention layer are a set of node features, \(\mathbf{H}=\left\{ \overrightarrow{h}_{1}, \overrightarrow{h}_{2}, \ldots , \overrightarrow{h}_{|V|}\right\} , \overrightarrow{h}_{i} \in \mathbb {R}^{C}\), and a set of edge features \(\mathbf{F}=\left\{ \overrightarrow{f}_{i j} |(i, j) \in {E}\right\} \), \(\overrightarrow{f}_{i j}\in \mathbb {R}^{D}\), where |V| is the number of nodes, and C, D are the dimensionality of node features and edge features, respectively. The layer generates a new set of node features, \(\mathbf{H}^{\prime }=\left\{ \overrightarrow{h}^{\prime }_{1}, \overrightarrow{h}_{2}^{\prime }, \ldots , \overrightarrow{h}_{|V|}^{\prime }\right\} , \overrightarrow{h}_{i}^{\prime }\in \mathbb {R}^{C^{\prime }}\), where \(C^{\prime }\) is the dimension of output features.
In each layer, the first step is applying a shared linear transformation to every node, then a shared attention mechanism is performed to compute attention coefficients utilizing self-attention on the nodes:
where \(\mathbf{W}_{h}\) is a shared learnable weight matrix for the node-wise feature transformation. The node attention mechanism \(a: \mathbb {R}^{C^{\prime }} \times \mathbb {R}^{C^{\prime }} \rightarrow \mathbb {R}\) used in this work is the additive attention parameterized by a learnable weight \(\overrightarrow{a}_{h} \in \mathbb {R}^{C^{\prime }}\) with an activation function \(\sigma \), which is formulated as:
In addition to computing attention coefficients by self-attention mechanisms, we also incorporate edge features to measure the importance of edges by applying an one-layer feedforward neural network:
where \(\mathbf{W}_{f} \in \mathbb {R}^{C^{\prime } \times D}, \overrightarrow{b}_{f} \in \mathbb {R}^{C^{\prime }}, \overrightarrow{a}_{f} \in \mathbb {R}^{C^{\prime }}\) are all learnable parameters. In this work, we use Leaky ReLU as the activation function.
It should be noted that, the coefficients mentioned above are not comparable across different nodes. Consequently, they are normalized across all neighbors using the softmax function:
where N(i) is the neighborhood of node i.
The final output features for every node are computed by aggregating weighted node features of neighbors with attention coefficients:
Following Veličković et al. [20], we also adopt multi-head attention in our model. Specifically, K independent attention mechanisms execute the transformation of Eq. 7, and then their features are concatenated:
in which || denotes concatenation operation, and \({\alpha _{i j}^{k}}\) are normalized attention coefficients calculated by the k-th attention mechanism \( {a}^{k}\).
In the final layer, we perform average operation instead of concatenation, and then apply the softmax function to output predicted values:
where \(\mathbf{W}_{o} \in \mathbb {R}^{C^{\prime } \times L}\) is a learnable weight matrix that transforms features to outputs.
The standard cross-entropy loss on the training set is used to train the GAT model, which is formulated as:
in which W encompasses all learnable parameters. W is initialized with Glorot initialization [9] and learned using mini-batch gradient descent. In practice, parameter optimization is performed with the Adam SGD optimizer [12].
4 Experiments
4.1 Dataset
In this work, we evaluate our method on three publicly accessible online handwritten diagram databases: FC_A [1], FC_B [5] and FA [6]. FC_A and FC_B are two flowchart databases which include text and six graphical symbols: terminator, connection, decision, data, process and arrow. FA is a finite automata database which encompasses state (circle), final state (pairwise concentric circles), arrow and label. Table 1 shows the details of the three databases.
4.2 Experiments Setup
For all experiments, we adopt a seven-layer GAT model with 32 neurons for each hidden layer employing residual connections [10]. Following [20], we employ the Leaky ReLU activation function with the negative slope of 0.2 as the attention functions in Eqs. 4 and 5. We also introduce dropout rules [18] with the dropout probability of 0.1 for all layers. In addition, every hidden layer consists of 8 attention heads for flowchart datasets (FC_A and FC_B) and 6 attention heads for FA. We use 2 output attention heads for all networks.
We train all models for 200 epochs by minimizing the standard cross-entropy loss on the training set with the early stopping strategy. The optimization is performed using the Adam optimizer [12] with an initial learning rate of 0.005 for flowchart datasets and 0.003 for FA. The decay rate r is set to 0.1 and the number of patience round \(r=15\) for flowchart datasets and \(r=17\) for FA. To train networks more efficiently, we adopt the mini-batch trick with 8 graphs for flowchart datasets and 6 graphs for FA.
The feature extraction module is implemented by C++ and both training and inference algorithms are implemented with Pytorch and Deep Graph Library (DGL)Footnote 1. Training of GAT is conducted on a server with a NVIDIA Geforce GTX 980 GPU, while testing is performed on a PC with four Intel (R) Core (TM) i5-7400 CPU @ 3.00 GHz. Unless otherwise specified, we repeat each experiment for 10 times using the same configurations and report the average results.
4.3 Results and Discussion
In this work, we use the stroke classification accuracy to evaluate our method. Each stroke in a diagram is assigned a predefined symbol-level label by the model, which is then compared against the ground truth. Stroke classification accuracy on FC_A, FC_B and FA are reported in Tables 2, 3 and 4, respectively. Numbers in boldface show the best results. Results of comparison methods are directly cited from the original papers. GAT denotes the method that uses the original GAT model, while GAT with EFA denotes the method proposed in this work which enhances GAT by introducing edge feature attention.
As we can see, on all datasets, GAT with EFA outperformes previous methods and achieved the best overall accuracy results. One notable phenomenon is that the performance of previous methods vary dramatically across different labels, while GAT with EFA consistently delivers accurate prediction for all labels. In addition, GAT with EFA improves GAT with a large margin for all experiments, which demonstrates that the proposed edge feature attention plays an important role in capturing the complicated temporal and spatial relationships between strokes. Furthermore, our method is very efficient in both training and testing. For example, on FC_A, it takes about 38 min to train our model and 70 ms to classify all strokes of one flowchart under the settings described in Sect. 4.2.
Examples of misrecognized flowchart from FC_A dataset. Every recognized stroke is colored and explained in the legend. In (a) a data stroke is misclassified as arrow and an arrow stroke is misclassified decision, besides, a text stroke is misclassified as data. (b) two strokes in the symbol decision are misclassified as data and process, respectively. (c) a text stroke enclosed by a process is misclassified as process, and another text beside on arrow is misclassified as arrow (Color figure online).
4.4 Error Analysis
In Fig. 2, we show three examples of recognized flowchart from FC_A with typical errors. If a stroke is far away from the symbol which it should belong to, but close to the neighboring one, it is more likely to be misclassified, as Fig. 2(a) and (b). For isolated text strokes next to the arrow, it is more possible to be predicted as arrow by mistake. Some recognition errors could be eliminated through postprocess, such as a process stroke surrounded by texts, as shown in (c). The confusion matrix for strokes classification result on test set of FC_A is presented in Fig. 3. Since some symbols are ambiguous in appearance, such as process and data, they are likely to be misclassified in highly confidence. Another very important factor for misclassification is that the number of training samples of different classes are imbalanced severely in nature, which has a serious side effect on recognition performance: the classifier is more likely to predict stroke classes with less samples as other classes that have more samples. However, in contrast to previous work, this effect is moderate in our proposed framework with edge attention mechanism.
5 Conclusions
In this work, we have introduced a novel and general framework based on GAT for online handwritten diagram recognition. We formulate diagram stroke classification as the node classification task in a graph. Experiments on two flowchart benchmark datasets and one finite automata dataset demonstrate that the proposed framework with edge feature attention mechanism is capable of encoding complex spatial and temporal relationships in an efficient way for stroke classification. Our method outperforms several recently proposed approaches by a prominent margin. Our model is computationally efficient, which is suitable for large-scale applications in mobile devices. Moreover, the classification performances have a great potential to be improved from our analysis of the failure cases. In the future work, we will investigate how to extend our framework to perform stroke grouping and symbol recognition of handwritten diagrams, as well as structure analysis of diagrams.
Notes
References
Awal, A.M., Feng, G., Mouchere, H., Viard-Gaudin, C.: First experiments on a new online handwritten flowchart database. In: Document Recognition and Retrieval XVIII (2011)
Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., et al.: Interaction networks for learning about objects, relations and physics. In: Advances in Neural Information Processing Systems (2016)
Bresler, M., Průša, D., Hlavác, V.: Modeling flowchart structure recognition as a max-sum problem. In: International Conference on Document Analysis and Recognition (2013)
Bresler, M., Průša, D., Hlavác, V.: Detection of arrows in on-line sketched diagrams using relative stroke positioning. In: IEEE Winter Conference on Applications of Computer Vision (2015)
Bresler, M., Průša, D., Hlaváč, V.: Online recognition of sketched arrow-connected diagrams. Int. J. Doc. Anal. Recogn. 19(3), 253–267 (2016)
Bresler, M., Van Phan, T., Průša, D., Nakagawa, M., Hlavác, V.: Recognition system for on-line sketched diagrams. In: International Conference on Frontiers in Handwriting Recognition (2014)
Carton, C., Lemaitre, A., Coüasnon, B.: Fusion of statistical and structural information for flowchart recognition. In: International Conference on Document Analysis and Recognition (2013)
Coüasnon, B.: DMOS, a generic document recognition method: application to table structure analysis in a general and in a specific way. Int. J. Doc. Anal. Recogn. 8(2–3), 111–122 (2006)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Julca-Aguilar, F.D., Hirata, N.S.: Symbol detection in online handwritten graphics using Faster R-CNN. In: International Workshop on Document Analysis Systems (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (2017)
Lemaitre, A., Mouchère, H., Camillerapp, J., Coüasnon, B.: Interest of syntactic knowledge for on-line flowchart recognition. In: Kwon, Y.-B., Ogier, J.-M. (eds.) GREC 2011. LNCS, vol. 7423, pp. 89–98. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36824-0_9
Miyao, H., Maruyama, R.: On-line handwritten flowchart recognition, beautification and editing system. In: International Conference on Frontiers in Handwriting Recognition (2012)
Qi, Y., Szummer, M., Minka, T.P.: Diagram structure recognition by Bayesian conditional random fields. In: IEEE Conference on Computer Vision and Pattern Recognition (2005)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Van Phan, T., Nakagawa, M.: Combination of global and local contexts for text/non-text classification in heterogeneous online handwritten documents. Pattern Recogn. 51, 112–124 (2016)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (2018)
Wang, C., Mouchère, H., Lemaitre, A., Viard-Gaudin, C.: Online flowchart understanding by combining max-margin Markov random field with grammatical analysis. Int. J. Doc. Anal. Recogn. 20(2), 123–136 (2017)
Wang, C., Mouchere, H., Viard-Gaudin, C., Jin, L.: Combined segmentation and recognition of online handwritten diagrams with high order Markov random field. In: International Conference on Frontiers in Handwriting Recognition (2016)
Wu, J., Wang, C., Zhang, L., Rui, Y.: Offline sketch parsing via shapeness estimation. In: International Joint Conference on Artificial Intelligence (2015)
Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification. arXiv preprint arXiv:1809.05679 (2018)
Ye, J.Y., Zhang, Y.M., Liu, C.L.: Joint training of conditional random fields and neural networks for stroke classification in online handwritten documents. In: International Conference on Pattern Recognition (2016)
Ye, J.Y., Zhang, Y.M., Yang, Q., Liu, C.L.: Contextual stroke classification in online handwritten documents with graph attention networks. In: International Conference on Document Analysis and Recognition (2019)
Yuan, Z., Pan, H., Zhang, L.: A novel pen-based flowchart recognition system for programming teaching. In: Leung, E.W.C., Wang, F.L., Miao, L., Zhao, J., He, J. (eds.) WBL 2008. LNCS, vol. 5328, pp. 55–64. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89962-4_6
Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Sun, M.: Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434 (2018)
Acknowledgements
This work is supported by the National Key Research and Development Program Grant 2018YFB1005000, the National Natural Science Foundation of China (NSFC) Grants 61773376, 61721004.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Yun, XL., Zhang, YM., Ye, JY., Liu, CL. (2019). Online Handwritten Diagram Recognition with Graph Attention Networks. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-34120-6_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)