Keywords

1 Introduction

Handwriting is one of the most natural and efficient ways for human to record information. As the widespread usage of smartphone, tablet computer and electrical whiteboard, recording information in intelligence devices has become a major choice for its convenience. As a result, handwritten text recognition has been intensively studied over the last decades and widely applied in many fields. However, the recognition and analysis of 2D diagrams, such as flowchart, circuit and music score, are still challenging because of the complex 2D structures and great writing style variation.

Existing methods for online handwritten diagram recognition and interpretation can be roughly divided into two categories: bottom-up [4,5,6, 15, 27] and top-down ones [1, 11, 14, 21, 22]. Bottom-up approaches sequentially perform a symbol segmentation step and a recognition step. However, due to the error accumulation, these methods often lead to low recognition accuracy. On the other hand, top-down approaches integrate the two steps in one framework, such as probabilistic graphical models (PGM), and perform segmentation and recognition simultaneously. Typically, top-down methods can achieve higher accuracy results but suffer from high computational cost because of the complicated learning and inference algorithms. We review these methods in more details in the next section.

In this work, we propose an efficient and high-accuracy method for online handwritten diagram recognition. In particular, we treat diagram stroke classification as a graph node classification problem and solve it with attention-based graph neural networks (GNN) [13, 20]. Compared with PGM, such as conditional random fields (CRF) and Markov random fields (MRF), GNN is more powerful and flexible in learning the stroke representation and exploiting the contextual information. Unlike PGM, the learning and inference algorithms of GNN are very simple and efficient, which makes it very suitable for large-scale applications.

We highlight the main contributions of this work as follows. First, we propose a general online handwritten diagram recognition method based on GNN. Second, to better exploit the relationships between strokes, we enhance the original GAT [20] by introducing a novel attention mechanism. Third, on three popular benchmark datasets, our method consistently outperforms the existing methods and achieves the state-of-the-art results.

In the rest of this paper, we first provide a general review of existing online handwritten diagram recognition works and a brief review of GNN in Sect. 2. In Sect. 3, we give a detailed introduction to the proposed method. The experimental setting and comparison results are described in Sect. 4. Finally, Sect. 5 draws our concluding remarks.

2 Related Works

2.1 Diagram Recognition

Since it is a difficult task to classify text and non-text strokes or segment symbols precisely in early stage for diagram recognition, some works only considered graphic symbols, and others imposed some constraints on users. Qi et al. [16] presented a recognition system for flowchart recognition using Bayesian conditional random fields, but their dataset only included very simple graphic symbols rather than texts. Yuan et al. [27] proposed a hybrid model combing support vector machine (SVM) and hidden Markov models (HMM) for programming teaching. Miyao et al. [15] presented a flowchart recognition and editing system that segmented the symbols based on loop structure and recognized them using SVMs. Although [15, 27] allowed the flowcharts to contain both symbols and texts, there were many constraints on users. [27] required users to draw each symbol with only one stroke, and [15] required users to differentiate texts from graphic symbols.

Awal et al. [1] proposed two methods—bottom-up and top-down approaches for flowchart recognition from different viewpoints. For the former, texts and graphic symbols were classified based on the entropy of strokes, then time delayed neural network (TDNN) or SVM was applied for graphical symbols recognition. Moreover, they introduced a global recognition architecture based on the TDNN and dynamic programming (DP) algorithm.

Flowchart diagrams are document with complex 2D structures, thus previous statistical approaches [1, 15, 27] have reported limited performance because of the ignorance of structure information. Lemaitre et al. [14] proposed a method that tried to handle the segmentation and recognition simultaneously. Their model integrated structural and syntactic prior of flowchart with Enhanced Position Formalism (EPF) language [8], then they used Description and MOdification of the Segmentation (DMOS) method [8] to segment and recognize the flowchart in one step. Their method achieved great progress in stroke labeling and symbol recognition compared with [1], but it is too restricted and rigid to adapt to other domains and it is impossible to describe every symbol with variation.

For exploring structure information, Carton et al. [7] presented a human-like perceptive mechanism approach that incorporated both structural and statistical information of a flowchart. Same as [14], the work made use of DMOS to express circular symbols and quadrilateral symbols, then proposed a deformation measure to quantify what was a good quadrilateral.

In handwritten diagrams, arrows are variable in appearance and are difficult to recognize compared to other symbols using identical classifier. Bresler et al. [4,5,6] proposed a new framework that strokes were firstly classified as text or non-text, then non-text strokes were clustered and uniform symbols were classified with SVM, lastly the arrows were detected. For structure analysis, they modeled whole flowchart excluding texts as a max-sum problem and applied integer linear programming to solve it [3]. This approach achieved the state-of-the-art results in three handwritten datasets. However, the recognition system has some severe flaws, such as each arrow must consist of a shaft and a head, which may lead to recognition failures if one of them is absent.

Wang et al. [21, 22] proposed a general model, max-margin MRF, which combines MRF and structural SVM to perform stroke segmentation and recognition simultaneously. By exploiting temporal and spatial relationship between strokes, their model greatly improved the stroke labeling accuracy. To lower the complexity in evaluating stroke grouping candidates in a diagram, Julca-Aguilar et al. [11] applied the Faster R-CNN model [17] to the detection of online handwritten graphics through converting the original online data to offline images. Despite the overall high performance of flowchart symbol detection, the arrow detection accuracy is not satisfactory, and the conversion into images causes loss of temporal information of strokes.

2.2 Graph Neural Networks

In recently years, graph neural networks (GNN) have received extensive attention and become one of the most popular research highlights in deep learning field. With its capability of capturing the dependency between objects and operating on non-Euclidean domain [28], GNN have obtained great success in many tasks, such as relational reasoning [2] and text classification [24]. Kipf et al. [13] proposed a simple and efficient layer-wise propagation rule for graph convolutional networks (GCN) based on spectral graph convolution and their model achieved significant raise in several graph-structured datasets. Veličković et al. [20] put forward a novel GNN architecture—graph attention networks (GAT), which introduced masked self-attention mechanism to tackle some key challenges of GNN. Recently, Ye et al. [26] proposed a new GAT framework for stroke classification, which demonstrated the great potential for online handwritten document recognition with GNN.

3 Method

We are given N labeled online handwritten diagrams \({D}=\left\{ \left( X^{i}, Y^{i}\right) | i \in [1, N]\right\} \), where each diagram \(X^{i}\) is composed of a sequence of strokes \(X^{i}=\left\{ X_{s}^{i} | s \in \left[ 1, M_{i}\right] \right\} \) (\(M_{i}\) is the number of strokes in \( X^{i}\)) and \(Y_{s}^{i}\) is the label of \(X_{s}^{i}\) which takes discrete semantic annotation, such as process, decision and arrow. Our target is to learn a model from the training set D that can predict the labels of strokes in testing diagrams as accurate as possible.

Roughly speaking, our method models each diagram with a graph in which nodes represent strokes and edges represent the relationships between strokes. Then, we treat diagram stroke classification as a graph node classification problem and solve it with attention-based GNN. The proposed method is composed of three modules, including the construction of the diagram graph, extraction of node and edge features from raw signals and graph attention networks, which will be introduced separately as follows.

3.1 Graph Building

Here we introduce a new approach to abstract the structure information in the diagram that each handwritten diagram is formulated as a space-time relationship graph (STRG). Every stroke \(s_{i}\) is represented as a vertex \({v}_{i} \in V\) and the relevance in space and time between strokes \(s_{i}\) and \(s_{j}\) is noted as edge \({e}_{ij} \in E\) in graph G(VE), where V is the vertex set and E is the edge set in G.

Specifically, from the time perspective, we build the edges \(E_T=\{(t,t+1) | t \in [1, n - 1]\}\) between every temporal adjacent strokes in the diagram, where n is the number of strokes.

In view of spatial relationship, for the stroke \({s}_{s} \), the edge set \({e}_{s,N(s)} \) is added to E(G), where N(s) are all space neighbors of stroke \({s}_{s} \). If any stroke pairs’ minimal Euclidean distance is less than the spatial neighbor threshold (SNT), they are regarded as neighbors each other. The hyperparameter SNT is elaborately tuned on validation set. We also try to build more complex STRG of a document, but it has little effect to the experimental result. Figure 1 shows an example of flowchart rendering from original data and its corresponding STRG.

Fig. 1.
figure 1

An example of handwritten diagram and its corresponding space-time relationship graph (STRG). The numbers in the figure indicate the temporal order of strokes. (a) An example handwriting diagram and (b) STRG.

3.2 Feature Extraction

For each stroke in an online document, 10 local features and 13 context features [25] are extracted as node features in STRG. These features have been proven to be very effective in previous works [19, 25]. In addition, 19 edge features [25] are extracted from stroke pairs for modeling the relations between strokes. In feature pre-processing, we conduct power transformation with the coefficient 0.5 and normalization with mean \(\mu \) and standard deviation \(\sigma \). Therefore, the original feature h become:

$$\begin{aligned}&h^{\prime }=\text {sign}(h) \sqrt{|h|} \end{aligned}$$
(1)
$$\begin{aligned}&h^{\prime \prime }=\left( h^{\prime }-\mu \right) /\sigma \end{aligned}$$
(2)

where sign(\(\cdot \)) is the sign function.

3.3 Graph Attention Networks

In this section, we introduce the enhanced GAT model, which is constructed by stacking multiple graph attention layers.

The input to each graph attention layer are a set of node features, \(\mathbf{H}=\left\{ \overrightarrow{h}_{1}, \overrightarrow{h}_{2}, \ldots , \overrightarrow{h}_{|V|}\right\} , \overrightarrow{h}_{i} \in \mathbb {R}^{C}\), and a set of edge features \(\mathbf{F}=\left\{ \overrightarrow{f}_{i j} |(i, j) \in {E}\right\} \), \(\overrightarrow{f}_{i j}\in \mathbb {R}^{D}\), where |V| is the number of nodes, and CD are the dimensionality of node features and edge features, respectively. The layer generates a new set of node features, \(\mathbf{H}^{\prime }=\left\{ \overrightarrow{h}^{\prime }_{1}, \overrightarrow{h}_{2}^{\prime }, \ldots , \overrightarrow{h}_{|V|}^{\prime }\right\} , \overrightarrow{h}_{i}^{\prime }\in \mathbb {R}^{C^{\prime }}\), where \(C^{\prime }\) is the dimension of output features.

In each layer, the first step is applying a shared linear transformation to every node, then a shared attention mechanism is performed to compute attention coefficients utilizing self-attention on the nodes:

$$\begin{aligned} c_{i j}=a\left( \mathbf{W}_{h} \overrightarrow{h}_{i}, \mathbf{W}_{h} \overrightarrow{h}_{j}\right) \end{aligned}$$
(3)

where \(\mathbf{W}_{h}\) is a shared learnable weight matrix for the node-wise feature transformation. The node attention mechanism \(a: \mathbb {R}^{C^{\prime }} \times \mathbb {R}^{C^{\prime }} \rightarrow \mathbb {R}\) used in this work is the additive attention parameterized by a learnable weight \(\overrightarrow{a}_{h} \in \mathbb {R}^{C^{\prime }}\) with an activation function \(\sigma \), which is formulated as:

$$\begin{aligned} c_{i j}=\sigma \left( \overrightarrow{a}_{h}^{T}\left( \mathbf{W}_{h} \overrightarrow{h}_{i}+\mathbf{W}_{h} \overrightarrow{h}_{j}\right) \right) . \end{aligned}$$
(4)

In addition to computing attention coefficients by self-attention mechanisms, we also incorporate edge features to measure the importance of edges by applying an one-layer feedforward neural network:

$$\begin{aligned} c_{i j}^{\prime }=\sigma \left( \overrightarrow{a}_{f}^{T} \sigma \left( \mathbf{W}_{f} \overrightarrow{f}_{ij}+\overrightarrow{b}_{f}\right) \right) \end{aligned}$$
(5)

where \(\mathbf{W}_{f} \in \mathbb {R}^{C^{\prime } \times D}, \overrightarrow{b}_{f} \in \mathbb {R}^{C^{\prime }}, \overrightarrow{a}_{f} \in \mathbb {R}^{C^{\prime }}\) are all learnable parameters. In this work, we use Leaky ReLU as the activation function.

It should be noted that, the coefficients mentioned above are not comparable across different nodes. Consequently, they are normalized across all neighbors using the softmax function:

$$\begin{aligned} \alpha _{i j}=\text {softmax}_{j}\left( c_{i j}+c_{i j}^{\prime }\right) =\frac{\exp \left( c_{i j}+c_{i j}^{\prime }\right) }{\sum _{k \in N(i)} \exp \left( c_{i k}+c^{\prime }_{i k}\right) } \end{aligned}$$
(6)

where N(i) is the neighborhood of node i.

The final output features for every node are computed by aggregating weighted node features of neighbors with attention coefficients:

$$\begin{aligned} \overrightarrow{h}_{i}^{\prime }=\sigma \left( \sum _{j \in N(i)} \alpha _{i j} \mathbf{W}_{h} \overrightarrow{h}_{j}\right) . \end{aligned}$$
(7)

Following Veličković et al. [20], we also adopt multi-head attention in our model. Specifically, K independent attention mechanisms execute the transformation of Eq. 7, and then their features are concatenated:

$$\begin{aligned} \overrightarrow{h}_{i}^{\prime }= \Vert _{k=1}^{K}\sigma \left( \sum _{j \in N(i)} \alpha _{i j}^{k} \mathbf{W}_{h}^{k} \overrightarrow{h}_{j}\right) \end{aligned}$$
(8)

in which || denotes concatenation operation, and \({\alpha _{i j}^{k}}\) are normalized attention coefficients calculated by the k-th attention mechanism \( {a}^{k}\).

In the final layer, we perform average operation instead of concatenation, and then apply the softmax function to output predicted values:

$$\begin{aligned}&\overrightarrow{h}_{i}^{\prime }=\sigma \left( \frac{1}{K} \sum _{k=1}^{K} \sum _{j \in N(i)} \alpha _{i j}^{k} \mathbf{W}_{h}^{k} \overrightarrow{h}_{j}\right) \end{aligned}$$
(9)
$$\begin{aligned}&\overrightarrow{p}_{i}=\text {softmax}\left( \mathbf{W}_{o}\overrightarrow{h}_{i}^{\prime }\right) \end{aligned}$$
(10)

where \(\mathbf{W}_{o} \in \mathbb {R}^{C^{\prime } \times L}\) is a learnable weight matrix that transforms features to outputs.

The standard cross-entropy loss on the training set is used to train the GAT model, which is formulated as:

$$\begin{aligned} L(\mathbf W )=-\sum _{i=1}^{N}\sum _{s=1}^{M_i}\text {log }\overrightarrow{p}_{s}\left( Y_{s}^{i}\right) , \end{aligned}$$
(11)

in which W encompasses all learnable parameters. W is initialized with Glorot initialization [9] and learned using mini-batch gradient descent. In practice, parameter optimization is performed with the Adam SGD optimizer [12].

4 Experiments

4.1 Dataset

In this work, we evaluate our method on three publicly accessible online handwritten diagram databases: FC_A [1], FC_B [5] and FA [6]. FC_A and FC_B are two flowchart databases which include text and six graphical symbols: terminator, connection, decision, data, process and arrow. FA is a finite automata database which encompasses state (circle), final state (pairwise concentric circles), arrow and label. Table 1 shows the details of the three databases.

Table 1. Online handwritten diagram datasets overview.

4.2 Experiments Setup

For all experiments, we adopt a seven-layer GAT model with 32 neurons for each hidden layer employing residual connections [10]. Following [20], we employ the Leaky ReLU activation function with the negative slope of 0.2 as the attention functions in Eqs. 4 and 5. We also introduce dropout rules [18] with the dropout probability of 0.1 for all layers. In addition, every hidden layer consists of 8 attention heads for flowchart datasets (FC_A and FC_B) and 6 attention heads for FA. We use 2 output attention heads for all networks.

We train all models for 200 epochs by minimizing the standard cross-entropy loss on the training set with the early stopping strategy. The optimization is performed using the Adam optimizer [12] with an initial learning rate of 0.005 for flowchart datasets and 0.003 for FA. The decay rate r is set to 0.1 and the number of patience round \(r=15\) for flowchart datasets and \(r=17\) for FA. To train networks more efficiently, we adopt the mini-batch trick with 8 graphs for flowchart datasets and 6 graphs for FA.

The feature extraction module is implemented by C++ and both training and inference algorithms are implemented with Pytorch and Deep Graph Library (DGL)Footnote 1. Training of GAT is conducted on a server with a NVIDIA Geforce GTX 980 GPU, while testing is performed on a PC with four Intel (R) Core (TM) i5-7400 CPU @ 3.00 GHz. Unless otherwise specified, we repeat each experiment for 10 times using the same configurations and report the average results.

4.3 Results and Discussion

In this work, we use the stroke classification accuracy to evaluate our method. Each stroke in a diagram is assigned a predefined symbol-level label by the model, which is then compared against the ground truth. Stroke classification accuracy on FC_A, FC_B and FA are reported in Tables 2, 3 and 4, respectively. Numbers in boldface show the best results. Results of comparison methods are directly cited from the original papers. GAT denotes the method that uses the original GAT model, while GAT with EFA denotes the method proposed in this work which enhances GAT by introducing edge feature attention.

As we can see, on all datasets, GAT with EFA outperformes previous methods and achieved the best overall accuracy results. One notable phenomenon is that the performance of previous methods vary dramatically across different labels, while GAT with EFA consistently delivers accurate prediction for all labels. In addition, GAT with EFA improves GAT with a large margin for all experiments, which demonstrates that the proposed edge feature attention plays an important role in capturing the complicated temporal and spatial relationships between strokes. Furthermore, our method is very efficient in both training and testing. For example, on FC_A, it takes about 38 min to train our model and 70 ms to classify all strokes of one flowchart under the settings described in Sect. 4.2.

Table 2. Stroke classification accuracy on FC_A (%).
Table 3. Stroke classification accuracy on FC_B (%).
Table 4. Stroke classification accuracy on FA (%).
Fig. 2.
figure 2

Examples of misrecognized flowchart from FC_A dataset. Every recognized stroke is colored and explained in the legend. In (a) a data stroke is misclassified as arrow and an arrow stroke is misclassified decision, besides, a text stroke is misclassified as data. (b) two strokes in the symbol decision are misclassified as data and process, respectively. (c) a text stroke enclosed by a process is misclassified as process, and another text beside on arrow is misclassified as arrow (Color figure online).

Fig. 3.
figure 3

Confusion matrix for stroke classification result on test set of FC_A (overall precision is 97.27%). Each row in the matrix encompasses all strokes of the ground-truth class, and each column encompasses all strokes of the predicted class.

4.4 Error Analysis

In Fig. 2, we show three examples of recognized flowchart from FC_A with typical errors. If a stroke is far away from the symbol which it should belong to, but close to the neighboring one, it is more likely to be misclassified, as Fig. 2(a) and (b). For isolated text strokes next to the arrow, it is more possible to be predicted as arrow by mistake. Some recognition errors could be eliminated through postprocess, such as a process stroke surrounded by texts, as shown in (c). The confusion matrix for strokes classification result on test set of FC_A is presented in Fig. 3. Since some symbols are ambiguous in appearance, such as process and data, they are likely to be misclassified in highly confidence. Another very important factor for misclassification is that the number of training samples of different classes are imbalanced severely in nature, which has a serious side effect on recognition performance: the classifier is more likely to predict stroke classes with less samples as other classes that have more samples. However, in contrast to previous work, this effect is moderate in our proposed framework with edge attention mechanism.

5 Conclusions

In this work, we have introduced a novel and general framework based on GAT for online handwritten diagram recognition. We formulate diagram stroke classification as the node classification task in a graph. Experiments on two flowchart benchmark datasets and one finite automata dataset demonstrate that the proposed framework with edge feature attention mechanism is capable of encoding complex spatial and temporal relationships in an efficient way for stroke classification. Our method outperforms several recently proposed approaches by a prominent margin. Our model is computationally efficient, which is suitable for large-scale applications in mobile devices. Moreover, the classification performances have a great potential to be improved from our analysis of the failure cases. In the future work, we will investigate how to extend our framework to perform stroke grouping and symbol recognition of handwritten diagrams, as well as structure analysis of diagrams.