Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

A Derivative-based Parser Generator for Visibly Pushdown Grammars

Published: 15 May 2023 Publication History

Abstract

In this article, we present a derivative-based, functional recognizer and parser generator for visibly pushdown grammars. The generated parser accepts ambiguous grammars and produces a parse forest containing all valid parse trees for an input string in linear time. Each parse tree in the forest can then be extracted also in linear time. Besides the parser generator, to allow more flexible forms of the visibly pushdown grammars, we also present a translator that converts a tagged CFG to a visibly pushdown grammar in a sound way, and the parse trees of the tagged CFG are further produced by running the semantic actions embedded in the parse trees of the translated visibly pushdown grammar. The performance of the parser is compared with popular parsing tools, including ANTLR, GNU Bison, and other popular hand-crafted parsers. The correctness and the time complexity of the core parsing algorithm are formally verified in the proof assistant Coq.

1 Introduction

Parsing is a fundamental component in computer systems. Modern parsers used in high-performance settings such as web browsers and network routers need to be efficient, as their performance is critical to the performance of the whole system. Furthermore, high-assurance parsers are becoming increasingly more important for security in settings such as web applications, where their parsers are directly processing potentially adversarial inputs from the network. In these settings, formally verified parsers are highly desirable.
Most parsing libraries are based on Context-Free Grammars (CFGs) or their variants. Although very flexible, CFGs have limitations in terms of efficiency. Not all CFGs can be converted to deterministic pushdown automata (PDA); the inherent nondeterminism in some CFGs causes the worst-case running time of general CFG-based parsing algorithms to be \(O(n^3)\) .
To achieve efficient parsing, many parsing frameworks place restrictions on what CFGs can be accepted, at the expense of placing the burden on users to refactor their grammars to satisfy those restrictions. Please see the related-work section for discussion about common kinds of restrictions, leading to parsing frameworks such as LL(k), LR(k) [11], and so on.
This article explores an alternative angle of building parsers based on Visibly Pushdown Grammars (VPGs) [2]. In VPGs, users explicitly partition all terminals into three kinds: plain, call, and return symbols. This partitioning makes the conversion of a VPG to a deterministic PDA always possible, which provides the foundation for efficient algorithms. Compared to requiring users to refactor their grammars to resolve sophisticated conflicts during parser generation such as shift-reduce conflicts in LR(k), asking users to specify what terminals are call and return symbols is less of a burden.
VPGs have been used in program analysis, XML processing, and other applications [1], but their potential in parsing has not been fully exploited. In this article, we show that VPGs bring many benefits in parsing. First, we show an efficient, linear-time parsing algorithm for VPGs. Second, our algorithm is amenable to formal verification. Overall, this article makes the following contributions:
We present a derivative-based algorithm for VPG recognition and parsing. The generated parser accepts ambiguous grammars and produces a parse forest for the input string, where each parse tree in the forest can be extracted in linear time.
We mechanize the correctness proofs of the parsing algorithm in Coq. We have also formalized and proved the linear-time performance guarantee of the parsing algorithm in Coq.
We present a surface grammar called tagged CFGs to allow a more convenient use of our parsing framework. Users can use their familiar CFGs for developing grammars and provide additional tagging information on terminals. A sound translator then converts a tagged CFG to a VPG. We note that this validator is conservative and places some restriction on acceptable tagged CFGs; in particular, left recursion is not allowed and users are required to perform standard refactoring to remove left recursion.
During the performance evaluation of VPG-based parsers on popular formats including JSON, XML, and HTML, we discover that their performance is comparable with popular parsing generators such as ANTLR and other hand-crafted parsers. In addition, we discover that VPG-based parsers often require a special lexer that groups multiple tokens into one single call/return symbol.
The remainder of this article is organized as follows: We first introduce VPGs in Section 2 and discuss related work in Section 3. In Section 4, we provide an overview of our parsing library and explore various usage scenarios. Section 5 presents a derivative-based VPG recognizer, which sheds light on the parsing algorithm discussed in Section 6. The translator and tagged CFGs are discussed in Section 7. We then evaluate the VPG parser in Section 8.
A conference version of this article was published in the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) [22]. We describe the differences between the conference version and this article as follows: First, we have redesigned the VPG parsing algorithm, including changing how the parser PDA is generated and replacing the pruner in the conference version with a set of parse-tree extraction functions. The new algorithm enables a simpler correctness proof and, more importantly, enables us to verify the performance guarantee: The parser is verified to have time complexity of \(O(kn)\) , where k is the number of parse trees and n is the length of the input string. Second, we provide a new method that optimizes the parse-tree extraction functions to a more efficient extraction PDA. Third, we rerun all experiments to evaluate our new parsing algorithm. Finally, we have revised our translation algorithm from tagged CFGs to VPGs and added the proofs of its soundness and termination.

2 Background: VPG

As a class of grammars, VPGs [2] have been used in program analysis, XML processing, and other fields. Compared with CFGs, VPGs enjoy many good properties. Since languages of VPGs are a subset of deterministic context-free languages, it is always possible to build a deterministic PDA from a VPG. The terminals in a VPG are partitioned into three kinds and the stack action associated with an input symbol is fully determined by the kind of the symbol: An action of pushing to the stack is always performed for a call symbol, an action of popping from the stack is always performed for a return symbol, and no stack action is performed for a plain symbol. Furthermore, languages of VPGs enjoy appealing theoretical closure properties; e.g., the set of visibly pushdown languages is closed under intersection, concatenation, and complement [2]. As will be shown in this article, VPGs also enable the building of linear-time parsers, and VPG parsers are amenable to formal verification. The expressive power of VPG is between regular grammars and CFGs and is sufficient for describing the syntax of many practical languages, such as JSON, XML, and HTML, with appropriated defined call/return symbols.1
We next give a formal account of VPGs. A grammar G is represented as a tuple \((V, \Sigma , P, L_0)\) , where V is the set of nonterminals, \(\Sigma\) is the set of terminals, P is the set of production rules, and \(L_0\in V\) is the start symbol. The alphabet \(\Sigma\) is partitioned into three sets: \(\Sigma _{\text{plain}}\) , \(\Sigma _{\text{call}}\) , \(\Sigma _{\text{ret}}\) , which contain plain, call, and return symbols, respectively. Notation-wise, a terminal in \(\Sigma _{\text{call}}\) is tagged with on the left, and a terminal in \(\Sigma _{\text{ret}}\) is tagged with on the right. For example, \({‹ {a}}\) is a call symbol in \(\Sigma _{\text{call}}\) , and \({b› }\) is a return symbol in \(\Sigma _{\text{ret}}\) .
We first formally define well-matched VPGs. Intuitively, a well-matched VPG generates only well-matched strings in which a call symbol is always matched with a return symbol in a derived string.
Definition 2.1 (Well-matched VPGs).
A grammar \(G = (V, \Sigma , P, L_0)\) is a well-matched VPG with respect to the partitioning \(\Sigma =\Sigma _{\text{plain}}\cup \Sigma _{\text{call}}\cup \Sigma _{\text{ret}}\) if every production rule in P is in one of the following forms:
(1)
\(L\rightarrow \epsilon\) , where \(\epsilon\) stands for the empty string;
(2)
\(L\rightarrow c L_1\) , where \(c\in \Sigma _{\text{plain}}\) ;
(3)
\(L\rightarrow {‹ {a}}L_1 {b› }L_2\) , where \({‹ {a}}\in \Sigma _{\text{call}}\) and \({b› }\in \Sigma _{\text{ret}}\) .
Note that in \(L\rightarrow c L_1\) terminal c must be a plain symbol, and in \(L\rightarrow {‹ {a}}L_1 {b› }L_2\) a call symbol must be matched with a return symbol; these requirements ensure that any derived string must be well-matched.
The following is an example of a well-matched VPG, which is refactored from a grammar for XML:
\begin{equation*} \text{element} \rightarrow \text{OpenTag content CloseTag Empty} \mid \text{SingleTag Empty.} \end{equation*}
In this example, nonterminals start with a lowercase character, such as “element,” and terminals start with an uppercase character, such as “OpenTag.” The special nonterminal “Empty” has a single rule that produces the empty string. The grammar shows a typical usage of VPGs to model a hierarchically nested matching structure of XML texts: “OpenTag” is matched with “CloseTag,” and “content” nested in between can be “element” itself (not shown in the above snippet) and forms an inner hierarchy.
In the main text of this article, we consider only well-matched VPGs and use the term VPGs for well-matched VPGs. We also call rules in the form of \(L\rightarrow {‹ {a}}L_1 {b› } L_2\) matching rules. Our Coq library actually supports a broader class of VPGs, which we call general VPGs; algorithms for general VPGs are put into the appendices. Intuitively, general VPGs can specify substrings of well-matched strings; e.g., \(\langle /p \rangle \langle p \rangle \langle /p \rangle \langle p \rangle\) is a substring of a well-matched string. This ability enables users to parse streaming data such as network traffic. As an extension to well-matched VPGs, general VPGs allow the case of pending calls and returns, which means that a call/return symbol may not have its corresponding return/call symbol in the input string. To accommodate pending symbols, general VPGs in addition allow rules in the forms of \(L\rightarrow {‹ {a}}L^{\prime }\) and \(L\rightarrow {b› }L^{\prime }\) , which we call pending rules. Call/return symbols in pending rules are called pending call/return symbols. Further, the set of nonterminals V is partitioned into \(V^0\) and \(V^1\) : nonterminals in \(V^0\) only generate well-matched strings, while nonterminals in \(V^1\) can generate strings with pending symbols.
Definition 2.2 (General VPGs).
A grammar \(G = (V, \Sigma , P, L_0)\) is a general VPG with respect to the partitioning \(\Sigma =\Sigma _{\text{plain}}\cup \Sigma _{\text{call}}\cup \Sigma _{\text{ret}}\) and \(V=V^0\cup V^1\) if every rule in P is in one of the following forms:
(1)
\(L\rightarrow \epsilon\) ;
(2)
\(L\rightarrow i L_1\) , where \(i \in \Sigma\) , and if \(L\in V^0\) then (1) \(i\in \Sigma _{\text{plain}}\) and (2) \(L_1\in V^0\) ;
(3)
\(L\rightarrow {‹ {a}}L_1 {b› }L_2\) , where \({‹ {a}}\in \Sigma _{\text{call}}\) , \({b› }\in \Sigma _{\text{ret}}\) , \(L_1\in V^0\) , and if \(L\in V^0\) , then \(L_2\in V^0\) .
The above definition imposes constraints on how \(V^0\) and \(V^1\) nonterminals can be used in a rule. For example, in \(L\rightarrow {‹ {a}}L_1 {b› }L_2\) , nonterminal \(L_1\) must be a well-matched nonterminal; so P cannot include rules such as \(L_1 \rightarrow {‹ {a}}L_3\) , since \(L_1\) is supposed to generate only well-matched strings.
The above constraints guarantee that we can identify paired call and return symbols simply by their locations. For example, consider the following string from a general VPG:
Even without knowing the rules, we can determine the pairing as follows: We begin with an empty stack T and read symbols in order, ignoring the plain symbols. When we read a call symbol, we push it to T; when we read a return symbol, we match it with the top of T, and if T is empty, then the return symbol is pending. After we read all symbols, call symbols remained in T are also pending. For the above string, the stack T has the following transitions:
Therefore, the first call symbol \({‹ {a}}\) is pending, and the second call symbol \({‹ {a}}\) is paired with the return symbol \({b› }\) . Note that, in general VPGs, paired \({‹ {a}}\) and \({b› }\) are not necessarily generated from the same matching rule; they could also be derived from pending rules.
The notion of a derivation in VPGs is the same as the one in CFGs. We write \(w \rightarrow w^{\prime }\) to mean a single derivation step according to a grammar, where w and \(w^{\prime }\) are strings of terminals or nonterminals. We write \({L} \rightarrow ^* {w}\) to mean that w can be derived from L via a sequence of derivation steps.

3 Related Work

Most parser libraries rely on the formalism of Context-Free Grammars (CFGs) and user-defined semantic actions for generating parse trees. Many CFG-based parsing algorithms have been proposed in the past, including LL(k), LR(k) [11], Earley [14], CYK [7, 24, 44], among many others. LL(k) and LR(k) algorithms are commonly used, but their input grammars must be unambiguous. Users often have to change/refactor their grammars to avoid conflicts in LL(k) and LR(k) parsing tables, a nontrivial task. Earley, CYK, and GLR parsing can handle any CFG, but their worst-case running time is \(O(n^3)\) . In contrast, our VPG parsing accepts ambiguous grammars and is linear time.
Our VPG parsing algorithm relies on derivatives. One major benefit of working with derivatives is that it is amenable to formal verification, as proofs related to derivatives involve algebraic transformations on symbolic expressions; they are easier to develop in proof assistants than it is to reason about graphs (required when formalizing LL and LR algorithms). Brzozowski [6] first presented the concept of derivatives and used it to build a recognizer for regular expressions. The idea was revived by Owens et al. [34] and generalized to generate parsers for CFGs [30], with an exponential worst-case time complexity. More recently, a symbolic approach [21] for parsing CFGs with derivatives was presented with cubic time complexity. Derivative-based CFG parsing is further improved with generalized zippers [9]. Danielsson [8] described a monadic parser combinator library, which used derivatives to run parsers; the library had a machine-checked proof in Agda [33], a dependently typed functional programming language. Finally, Edelmann et al. [15] presented a formally verified, derivative-based, linear-time parsing algorithm for LL(1) context-free expressions.
Owl is an open-source project [20] that provides a parser generator for VPGs. It has the same goal as our work, but differs in the following critical aspects: (1) Owl supports only well-matched VPGs, while our parsing library supports general VPGs; (2) Owl adopts a different algorithm and is implemented in an imperative way, while our parsing library is derivative-based and functional; (3) Owl does not provide formal assurance, while our parsing library is formally verified in Coq; (4) Owl rejects ambiguous VPGs, while our parsing library accepts ambiguous grammars and generates parse forests; and (5) Owl does not support semantic actions embedded in grammars, while our parsing library accepts semantic actions.
Due to parsers’ importance to security, many efforts have been made to build secure and correct parsers. One obvious approach is testing, through fuzz testing or differential testing (e.g., Petsios et al. [37]). However, testing generally cannot show the absence of bugs. Formal verification has also been applied to the building of high-assurance parsers. Jourdan et al. [23] applied the methodology of translation validation and implemented a verified parser validator for LR(1) grammars. RockSalt [31] included a verified parser for regular expression-based DSL. Barthwal and Norrish [4] presented a verified SLR parser. Lasser et al. [26] and Edelmann et al. [15] presented verified LL(1) parsers but we are not aware of fully verified LL(k) parsers. Koprowski and Binsztok [25] implemented a formally verified parser generator for Parsing Expression Grammars (PEG). Ramananandro et al. [38] presented a verified parser generator for tag-length-value binary message format descriptions. There are also verified parsers for general CFGs [5, 17, 39]. These parsers are limited in their performance on real-world grammars, with the worst-case complexity of \(O(n^3)\) . The reason is that they have to handle highly ambiguous CFGs, whose nondeterminism hinders the parsers’ performance. Recently, Lasser et al. [27] implemented a verified ALL(*) parser, which is the algorithm behind ANTLR4; however, it did not provide a linear-time worst-case guarantee. We formalize our derivative-based, VPG parsing algorithm and its correctness and linear-time performance proofs in Coq.

4 VPG Frontend Overview

The focus of this article is on VPG-based parsing. However, since users are already familiar with CFGs, our system takes as input a CFG-based surface grammar, which is called a tagged CFG. In this section, we will give a brief overview of tagged CFGs and some application scenarios; we also discuss tagged CFGs’ expressiveness and limitations. In Section 7, we will discuss tagged CFGs in detail and how to convert a tagged CFG to a VPG if the tagged CFG passes a validation process. Once a VPG is obtained, we can use the algorithm described in Section 6 to produce a parser.
Tagged CFG overview. Roughly speaking, a tagged CFG is a CFG enhanced with tagging information about what terminals are call and return symbols; in addition, we allow regular operators for the convenience of specification. As examples, the two subfigures in Figure 1 present tagged CFGs for JSON and XML documents, respectively; more examples of tagged CFGs can be found in Appendix F.
Fig. 1.
Fig. 1. Tagged CFGs for JSON and XML. Nonterminals such as json are in lowercase and terminals such as STRING are in uppercase. Regular operators including ?, + and the Kleene Star * are allowed. Two additional operators, the left bracket < and the right bracket >, specify the call and return symbols. String literals such as ’true’ are enclosed in single quotation marks . The lexer rules for the terminals are omitted.
By placing a left bracket < before a terminal, the user can mark it as a call symbol; similarly, a right bracket > after a terminal marks it as a return symbol. We mentioned in Section 2 that the locations of call and return symbols in a VPG determine how they are matched. Similarly for a tagged CFG, matched call and return symbols should appear in the same rule.
Expressiveness and limitations. In practice, we can make tagged CFGs more expressive by utilizing a lexer with an extended notion of call and return symbols. Take XML as an example. If individual characters were considered as terminals, then the XML grammar could not be expressed as a tagged CFG; this is because an XML tag is complex and includes components such as the tag name and the tag attributes. However, if a whole XML tag is treated as a token in an extended lexer, then the XML grammar can be expressed as a tagged CFG and is convertible to a VPG, as our evaluation shows. For instance, the call symbol for XML is the open tag, which comprises the tag name and also possible attributes.
An example open tag is as follows:
\begin{equation*} \mathsf {\lt p\; id = ^{\prime \prime }a^{\prime \prime } \gt .} \end{equation*}
Since the syntax of such tags can be specified by regular expressions, we can treat it as a single token and extend the lexer to store semantic values such as the tag name p and the attributes id = ”a”, which can be performed efficiently [13].
As will be discussed in Section 7, the primary limitation of a tagged CFG is that it must pass validation so we can convert it to a VPG. Approximately, it means that each dependency loop in the grammar must be in at least one pair of matched call and return symbols. A dependency loop is there when a nonterminal L can be used to generate itself via a set of rules; this notion will be discussed formally in Definition 7.3. For example, the following grammar includes a dependency loop, as the right-hand side of the first rule uses L itself:
\begin{equation*} L\rightarrow a L L b \mid c. \end{equation*}
The previous grammar would not pass the validation; however, the user can use the following tags to make it a valid tagged CFG, by ensuring that the dependency stays within \({&#x2039; {a}}\) and \({b&#x203A; }\) :
In this case, we describe the dependency loop as well founded. For the JSON grammar in Figure 1, the call-return pairs (’[’, ’]’) and (’{’, ’}’) ensure that all dependency loops are well founded.
Because of the restriction of dependency loops, left recursion is not supported by tagged CFGs. However, in practice, we can easily remove direct left recursion by performing grammar refactoring [36].

5 VPG-based Recognition

Before presenting a parsing algorithm for VPGs, we first present an algorithm for converting a well-matched VPG into a deterministic PDA for recognition using a derivative-based algorithm. The resulting PDA recognizes the same set of strings as the input VPG.
To compute a recognizer, we utilize the notion of derivatives [6, 30, 34]. The derivative of a language \(\mathcal {L}\) with respect to an input symbol c is the residual set of strings of those in \(\mathcal {L}\) that start with c:
\begin{equation*} \delta _c(\mathcal {L}) = \lbrace w\,\mid \,cw \in \mathcal {L}\rbrace . \end{equation*}
Before giving formal derivative definitions, we first review derivatives for regular grammars [6, 34] in Example 5.1. We then extend them to VPGs in Example 5.2, where we also discuss informally the intuition of the states, the stack, and the transition function of the PDA that is converted from an input VPG. Throughout this section and Section 6, we use the VPG G shown in Figure 2(c), where the start nonterminal is L, and the right-regular grammar \(G^{\prime }\) , which is based on the VPG G by deleting the matching rules.
Fig. 2.
Fig. 2. Figures for Examples 5.1 and 5.2.
Example 5.1 (From Regular Grammars to DFAs).
For regular grammars, we can build a deterministic finite automaton (DFA) for recognition, where a state S is a set of nonterminals. When reading a symbol c, we transfer from the current state S to the next state \(\delta _c(S)\) , a derivative of S with respect to the symbol c:
\begin{equation*} \delta _{c}\left(S\right)=\left\lbrace L^{\prime }\mid L\in S,(L\rightarrow cL^{\prime })\in P\right\rbrace . \end{equation*}
We define the initial state \(S_0\) as the singleton set containing the start nonterminal. For instance, suppose the input string is \(s=ccc\) . The derivatives for the grammar \(G^{\prime }\) in Figure 2(a) are as follows:
\begin{align*} S_0 &#x0026;= \lbrace L \rbrace , \\ S_1 &#x0026;=\delta _c (S_0)= \lbrace A, B \rbrace , \\ S_2 &#x0026;=\delta _c (S_1) = \lbrace D \rbrace , \\ S_3 &#x0026;=\delta _c (S_2) = \lbrace L \rbrace . \end{align*}
We accept the string if and only if the final state includes a nonterminal that can derive the empty string. For the example input string, L is in the final state and can derive the empty string; thus, the string \(ccc\) is accepted.
Using the notion of derivatives, we can compute derivatives for all possible input symbols and for all states; the process terminates. This way, we can produce a DFA. Figure 2(b) shows the DFA for the regular grammar \(G^{\prime }\) .
Example 5.2 (From VPGs to Recognizer PDAs).
Based on the derivatives for regular grammars, our first attempt would be to retain the same kind of states when producing PDA for VPGs. Considering the grammar G in Figure 2(c) and the string \(s=c{&#x2039; {a}}cc {b&#x203A; }\) , the attempt is summarized in Figure 2(d) and detailed as follows: Our first two states would be as follows: \(S_0 = \lbrace L \rbrace ;\ S_1 =\delta _c (S_0) = \lbrace A,B \rbrace\) .
Next, we would process the symbol \({&#x2039; {a}}\) , and try to compute the following derivative: \(\delta _{&#x2039; {a}} (S_1)=\delta _{&#x2039; {a}} (\lbrace A,B \rbrace)\) . Here, we would use the nonterminals A and B to derive \({&#x2039; {a}}\) . At first glance, one might assume that the derivative of \(\lbrace A,B\rbrace\) with respect to \({&#x2039; {a}}\) would simply be \(\lbrace A {b&#x203A; }L, B {b&#x203A; }L \rbrace\) ; however, this definition can lead to an infinite number of states, e.g., the derivative of \(\lbrace A {b&#x203A; }L, B {b&#x203A; }L \rbrace\) with respect to \({&#x2039; {a}}\) would be \(\lbrace A {b&#x203A; }L {b&#x203A; }L,B {b&#x203A; }L {b&#x203A; }L \rbrace\) . Therefore, we would instead apply the two matching rules in the grammar and produce the next state \(S_2=\lbrace A, B \rbrace\) , which only contains the nonterminals to be parsed next. In other words, we would define the derivative function \(\delta _{{&#x2039; {a}}}\) as follows:
For the next two plain symbols c, the derivatives would be identical to those of regular grammars and thus before reading \({b&#x203A; }\) , we would have the following states: \(S_0 = \lbrace L \rbrace ;\ S_1 =\delta _c (S_0) = \lbrace A,B \rbrace ; \ S_2 = \delta _{&#x2039; {a}} (S_0)= \lbrace A, B \rbrace ; \ S_3 =\delta _c (S_1) = \lbrace D \rbrace ; \ S_4 =\delta _c (S_2) = \lbrace L \rbrace\) .
When processing the next symbol \({b&#x203A; }\) , however, we would discover that the state \(S_4\) alone is insufficient to determine which rule to apply for deriving \({b&#x203A; }\) , as shown in Figure 2(d). First, we know that the state \(S_4\) includes a nonterminal L that derives the empty string \(\epsilon\) . Therefore, we should use the matching rules used last time that derive both the nonterminal L in the state \(S_4\) and the symbol \({b&#x203A; }\) . However, based on only the state \(S_4\) , we would not know which matching rule should be used to produce \({b&#x203A; }\) .
To solve this issue, we extend states to be sets of pairs of nonterminals and also use a stack. A nonterminal pair \((L_1, L_2)\) in a state tells that the rest of the input should match \(L_2\) and the current context is \(L_1\) . The context is the nonterminal that is used to derive \(L_2\) without consuming an unmatched call symbol before getting to \(L_2\) . Formally, it means that there exists a derivation sequence \({{L}_1} \rightarrow ^* {\omega _1 {L}_2 \omega _2}\) , where \(\omega _1\) is a sequence of terminals and does not contain an unmatched call symbol (meaning that if \(\omega _1\) contains a call symbol, then it also contains a matching return symbol), and \(\omega _2\) is a sequence of terminals or nonterminals. We define the initial state \(S_0\) as \(\lbrace (L,L) \rbrace\) , where L is the start nonterminal. It means that the input string should match L and the context is also L, since L can be derived from itself (in zero steps) without generating an unmatched call symbol.
For the example, we now have the following states:
Notice that when transitioning from \(S_1\) to \(S_2\) , there is a context switch from L to A or B, as there is an unmatched call symbol \({&#x2039; {a}}\) that is encountered using the rules \(A \rightarrow {&#x2039; {a}} A {b&#x203A; } L\) and \(A \rightarrow {&#x2039; {a}} B {b&#x203A; } L\) . For this transition, we will also need to use a stack to store the old context information; in particular, our PDA will push \(S_1\) and \({&#x2039; {a}}\) to the stack, so when the return symbol \({b&#x203A; }\) is encountered, we can use that stack information to look up the old context and transition the PDA to state \(\lbrace (L,L)\rbrace\) .
As shown in Figure 2(f), for the grammar G, we have the following state and stack when reading the symbol \({b&#x203A; }\) :
To determine which rule to apply, we first find that the nonterminal L in the current state \(S_4\) derives the empty string \(\epsilon\) and has context A. Based on the context and the call symbol \({&#x2039; {a}}\) stored at the top of \(T_4\) , we can deduce that a rule of the form \(\hat{L} \rightarrow {&#x2039; {a}} A {b&#x203A; } \hat{L}_2\) was used to derive \({&#x2039; {a}}\) , where \(\hat{L}\) and \(\hat{L}_2\) are unknown nonterminals. To identify them, we use the state \(S_1\) at the top of \(T_4\) and consider possible candidate matching rules. These rules could include:
The above is enough for us to identify that \(A\rightarrow {&#x2039; {a}} A {b&#x203A; } L\) is the matching rule; note that in general there may be multiple possible matching rules.
For the example input string \(s=c{&#x2039; {a}}cc {b&#x203A; }\) , the final state is \(S_5=\lbrace (L,L) \rbrace\) . We finish the recognition by accepting the string s, since L, the next nonterminal to parse, can derive the empty string \(\epsilon\) .
Following the above process for all input symbols and all states, we can build the recognizer PDA for the VPG G shown in Figure 2(e); the detailed algorithm will be presented later in this section.
Given the above discussion, we have the following PDA states and stacks:
Definition 5.1 (PDA States and Stacks).
Given a VPG \(G=(\Sigma , V, P, L_0)\) , we introduce a PDA whose states are subsets of \(V \times V\) and whose stack contains stack symbols of the form \([S, {&#x2039; {a}}]\) , where S is a PDA state and \({&#x2039; {a}}\in \Sigma _{\text{call}}\) is a call symbol. We write \(\bot\) for the empty stack and \([S,{&#x2039; {a}}]\cdot T\) for a stack whose top is \([S,{&#x2039; {a}}]\) and the rest is T. Intuitively, the stack remembers a series of past contexts, which are used for matching future return symbols. We call a pair \((S,T)\) a configuration, with S being the state and T being the stack.
Given a VPG \(G=(V,\Sigma ,P,L_0)\) , we define three kinds of derivative functions: (1) \(\delta _c\) is for when the next input symbol is a plain symbol c; (2) \(\delta _{{&#x2039; {a}}}\) for when the next input symbol is a call symbol \({&#x2039; {a}}\) ; and (3) \(\delta _{ {b&#x203A; }}\) for when the next input symbol is a return symbol \({b&#x203A; }\) . Each function takes the current state S and the top stack symbol and returns a new state as well as an action on the stack (expressed as a lambda function). Note that \(\delta _c\) and \(\delta _{{&#x2039; {a}}}\) do not need information from the stack; therefore, we omit the top stack symbol from their parameters.
We note that previous work on derivatives [6, 30, 34] uses a direct representation of a language (for instance, a grammar) to represent the derivative result. In this work, however, what we compute over derivatives is configurations, which include states and stacks. A later definition (Definition 5.3) shows that a configuration of a recognizer PDA represents a language. We also note that the derivative functions defined below take only necessary information from a configuration to make clear what information is needed for a particular derivative function.
Definition 5.2 (Derivative Functions).
(1)
\(\delta _c(S) = (S^{\prime },\lambda T.T)\) , where
\begin{equation*} S^{\prime }=\lbrace (L_1,L_3)\mid (L_1,L_2)\in S \wedge (L_2\rightarrow cL_3)\in P\rbrace . \end{equation*}
For a plain symbol \(c\in \Sigma _{\text{plain}}\) , it checks each pair \((L_1,L_2)\) in the current state S, and if there is a rule \(L_2\rightarrow c\ L_3\) , then the pair \((L_1,L_3)\) becomes part of the new state. In addition, the stack is left unchanged.
(2)
\(\delta _{{&#x2039; {a}}}(S) = (S^{\prime },\lambda T.[S,{&#x2039; {a}}]\cdot T)\) , where
For a call symbol \({&#x2039; {a}}\in \Sigma _{\text{call}}\) , it checks each pair \((L_1,L_2)\) in the current state S; if there is a rule \(L_2 \rightarrow {&#x2039; {a}}L_3 {b&#x203A; } L_4\) , then the pair \((L_3,L_3)\) becomes part of the new state; note there is a context change, since a call symbol is encountered. In addition, the old state together with \({&#x2039; {a}}\) is pushed to the stack.
(3)
\(\delta _{ {b&#x203A; }}(S,[S_1,{&#x2039; {a}}]) = (S^{\prime },\text{tail})\) , where \(\text{tail}\) is a function that returns the tail of the stack, and
For a return symbol \({b&#x203A; }\in \Sigma _{\text{ret}}\) and a stack top \([S_1,{&#x2039; {a}}]\) , it checks each pair \((L_1,L_2)\) in the state \(S_1\) of the stack top symbol; if there is a pair \((L_3,L_4)\) in the current state S, \(L_4\) can derive the empty string, and there is a rule \(L_2\rightarrow {&#x2039; {a}}L_3 {b&#x203A; }\ L_5\) , then the pair \((L_1,L_5)\) becomes part of the new state; note that it checks \(L_4\rightarrow \epsilon\) to ensure that the current level is finished before returning to the upper level. In addition, the stack top is popped from the stack. Figure 3 presents a drawing that depicts the situation when a return symbol is encountered.
We formalize the semantics of PDA configurations as sets of accepted strings:
Fig. 3.
Fig. 3. An example of state transitions: Assuming there is a rule \(L_2 \rightarrow {&#x2039; {a}}L_3 {b&#x203A; }L_5\) , \((L_1,L_2)\) transfers to \((L_3,L_3)\) with symbol \({&#x2039; {a}}\) , and finally transfers to \((L_1,L_5)\) with symbol \({b&#x203A; }\) .
Definition 5.3 (Semantics of PDA Configurations).
We write \((S, T)\leadsto w\) to mean that a terminal string w can be accepted by the configuration \((S, T)\) . It is defined as follows:
(1)
\((S, \bot)\leadsto w \mbox{ if } \exists (L_1,L_2)\in S,\ \text{s.t. }L_2\rightarrow ^{*}w\) ,
(2)
\((S, [S^{\prime },{&#x2039; {a}}]\cdot T^{\prime })\leadsto w_1 {b&#x203A; }w_2\) if \(\exists (L_3,L_4)\in S\) s.t.
(a)
\(L_4\rightarrow ^{*}w_1\) , and
(b)
\(\exists (L_1,L_2)\in S^{\prime },\exists L_5, (L_2\rightarrow {&#x2039; {a}}L_3 {b&#x203A; }L_5) \in P \wedge (\lbrace (L_1,L_5)\rbrace ,\ T^{\prime })\leadsto w_2\) .
The correctness of derivative functions is stated in the following theorem, whose correctness proof is detailed in Appendix A. Take the case of \(\delta _c\) as an example: The theorem states that \((S,T)\) matches \(cw\) iff the configuration after running \(\delta _c\) matches w (the string after consuming c).
Theorem 5.1 (Derivative Function Correctness).
Assume \(\delta _c(S)=(S^{\prime },\lambda T.T)\) for a plain symbol c. Then \((S,T)\leadsto cw\) iff \((S^{\prime },T) \leadsto w\) .
Assume \(\delta _{{&#x2039; {a}}}(S)=(S^{\prime },\lambda T.[S,{&#x2039; {a}}]\cdot T)\) for a call symbol \({&#x2039; {a}}\) . Then \((S,T)\leadsto {&#x2039; {a}}w\) iff \((S^{\prime }, [S,{&#x2039; {a}}]\cdot T)\leadsto w\) .
Assume \(\delta _{ {b&#x203A; }}(S, [S_1,{&#x2039; {a}}])=(S^{\prime },\text{tail})\) for a return symbol \({b&#x203A; }\) . Then \((S, [S_1,{&#x2039; {a}}]\cdot T)\leadsto {b&#x203A; }w\) iff \((S^{\prime },T)\leadsto w\) .
With those derivative functions, we can convert a VPA to a PDA, whose set of states is the least solution of the following equation; it makes sure that states are closed under derivatives:
We note that the least solution to the previous equation may include unreachable states, since the last line of the equation considers all \((S, [S^{\prime }, {&#x2039; {a}}])\) without regard for whether such a configuration is possible. This may make the resulting PDA contain more states and occupy more space for the PDA representation than necessary. However, unreachable states do not affect the linear-time parsing guarantee of VPG parsing, as during parsing those unreachable states are not traversed; further, during experiments, we did not experience space issues when representing PDA states and transitions.
Algorithm 1 is an iteration-based method to solve the equation for the least solution, where the returned \(S_0\) is the initial state, A is the set of all states, and \(\mathcal {T}\) is the set of edges between states. For an iteration, N is the set of states that the algorithm should perform derivatives on. Line 7 then performs derivatives using call and plain symbols, and line 10 performs derivatives using return symbols.
At the end of each iteration, the following invariants are maintained: (1) \(N \subseteq A\) ; (2) for state \(S \in A-N\) and \(i \in \Sigma _{\text{call}}\cup \Sigma _{\text{plain}}\) , if \(\delta _i(S)=(S^{\prime },f)\) , then \(S^{\prime } \in A\) ; (3) for states \(S, S^{\prime } \in A-N\) , \({&#x2039; {a}}\in \Sigma _{\text{call}}\) , and \({b&#x203A; }\in \Sigma _{\text{ret}}\) , if \(\delta _{ {b&#x203A; }}(S, [S^{\prime },{&#x2039; {a}}])=(S^{\prime \prime },f)\) , then \(S^{\prime \prime } \in A\) . With these invariants, when N becomes empty, A is closed under derivatives.
Once the PDA is constructed from a VPG, it can be run on an input string in a standard way. For completeness, we include the definition here so we can state the correctness theorem formally. Recall that a runtime configuration of a recognizer PDA is a pair \((S,T)\) , where S is a state and T is a stack, denoted as \(T=t_1\cdot t_2\cdots t_k\cdot \bot\) , where \(t_i=[S_i, {&#x2039; {a}}_i]\) for \(i=1..k\) , \({&#x2039; {a}}_i \in \Sigma _{\text{call}}\) is a call symbol, \(t_1\) the top of T, and \(\bot\) the empty stack. The PDA’s initial configuration is \((\lbrace (L_0,L_0)\rbrace ,\bot)\) and its acceptance configurations are defined as follows:
Definition 5.4 (PDA Acceptance Configurations).
Given a VPG \(G=(V,\Sigma ,P,L_0)\) , a pair \((S,T)\) is an acceptance configuration if \(T=\bot\) , and \(\exists (L,L^{\prime })\in S\ \text{s.t. }(L^{\prime }\rightarrow \epsilon)\in P\) .
Definition 5.5 (Recognizer PDA Execution).
The runtime execution \(\mathcal {F}\) of a PDA \((S_0,A,\mathcal {T})\) is defined as follows, where \(S_0\) is the start state, A is the set of states, and \(\mathcal {T}\) is the set of configuration transitions:
\begin{equation*} \mathcal {F}:(i,S,T)\mapsto (S^{\prime },T^{\prime }), \end{equation*}
where
(1)
if \(i\in \Sigma _{\text{call}}\cup \Sigma _{\text{plain}}\) , then \((S,S^{\prime })\in \mathcal {T}\) and is marked with \((i,f)\) , and \(T^{\prime }=f(T)\) ;
(2)
if \(i\in \Sigma _{\text{ret}}\) and \(T=t\cdot T^{\prime }\) , then \((S,S^{\prime })\in \mathcal {T}\) and is marked with \((i,t,f)\) , and \(T^{\prime }=f(T)\) .
Given an input string \(w = w_1\ldots w_n\) , we say the PDA accepts w if there exists a sequence of configurations \((S_0, T_0), \ldots , (S_n, T_n)\) , so
\begin{equation*} \begin{array}{l} (S_0,T_0) =(\lbrace (L_0,L_0)\rbrace ,\bot), \\ (S_{i},T_{i}) =\mathcal {F}(w_i,S_{i-1},T_{i-1}), \mbox{for } i\in [1,n], \mbox{and} \\ (S_n,T_n)\ \mbox{is an acceptance configuration}. \end{array} \end{equation*}
Otherwise, w is rejected.
Then PDA correctness can be stated as follows. We detail the correctness proof in Appendix A.
Theorem 5.2 (PDA Correctness).
For VPG G and its start symbol \(L_0\) , a string \(w\in \Sigma ^*\) can be derived from \(L_0\) (i.e., \(L_0\rightarrow ^{*}w\) ) iff w is accepted by the corresponding PDA.
For converting general VPGs (i.e., with pending rules) to PDAs, a couple of changes need to be made to the derivative-based approach: (1) The derivative functions need to consider also pending rules; (2) the acceptance stack may be nonempty because of pending call symbols. The construction is discussed in Appendix B.
The time complexity of Algorithm 1. Algorithm 1 terminates when no new state can be produced. Since a PDA state contains pairs of nonterminals, the number of states produced at the end \(|A|\) is bounded by \(O(2^{|V|^2})\) . For each produced state, the derivative functions are called at line 6 and line 9. The number of times they are called across all iterations is no more than \(|A|\times |\Sigma _{\text{call}}\cup \Sigma _{\text{plain}}|+|A|\times |\Sigma _{\text{ret}}|\times |A| \times |\Sigma _{\text{call}}|\) , which is bounded by \(O(|A|^2|\Sigma |^2)\) . Among the three derivative functions, the one for return symbols has the largest time complexity. Recall that in the definition of \(\delta _{ {b&#x203A; }}(S, [S_1, {&#x2039; {a}}])\) , we consider each pair \((L_1,L_2)\) in \(S_1\) , find a matching rule “ \(L_2\rightarrow {&#x2039; {a}} L_3 {b&#x203A; }L_5\) ,” and check if \((L_3,L_4)\) is in S for some \(L_4\) that can derive the empty string. Therefore, the cost is bounded by \(|S_1|\times |P|\times |S|\) , which is further bounded by \(O(|V|^4|P|)\) . At lines 7 and 11, at most \(O(|A|^2)\) edges are added across all iterations. At lines 12 and 13, in one iteration collecting the states takes \(O(|N^{\prime }|+|N_R|)=O(|A|)\) , and computing the difference and union of the sets takes \(O(|N|+|A|)\) ; so the time complexity is \(O(|A|)\) . Therefore, the total time complexity is \(O(|A|^2|\Sigma |^2|V|^4|P|) + O(|A|^2)\) = \(O(2^{2|V|^2}|V|^4|\Sigma |^2|P|)\) .

6 VPG-based Parsing

Parsing is a process to build the parse trees of a given string. It is equivalent to finding the rule sequences that generate the string. To achieve this, our VPG-based parsing framework includes two steps: We first run a parser PDA on the string, which leaves a trace of PDA states that can be treated as a parse forest; we then execute extraction functions on the forest to extract all valid parse trees. We further propose an optimization, which optimizes the second step to an extraction PDA: Instead of extracting all parse trees, the extraction PDA extracts a single parse tree. In the following discussion, we first describe parse trees in the context of VPGs, then discuss the parser PDA in Section 6.1, the extraction functions in Section 6.2, and the optimization of extraction as an extraction PDA in Section 6.3. We then discuss the correctness of our parsing framework in Section 6.4 and its time complexity in Section 6.5. Both the correctness proof and the time-complexity proof have been mechanized in the proof assistant Coq2; we provide more details about the Coq mechanization in Section 6.6.
Before proceeding, we note that the parsing approach in this section is different from the approach described in the conference version. The conference version [22] presented a parsing method that includes a parser PDA, a pruner PDA, and an extractor. The correctness of parsing was verified, but we were not able to formalize and verify its linear-time performance guarantee when extracting a parse tree; the structure of the PDA states of that method made it hard to reason about the performance (and also correctness). Our new method improves the parsing function and the parser PDA so we can build a verified extraction function and verify the performance: For an input string w that has k valid parse trees, we verify that the time complexity of parsing with the new approach is \(O(k|w|)\) . Note that, when only one of the parse trees is needed, later in this article, we will describe an extractor PDA that extracts just one parse tree and prove that it takes linear time ( \(O(|w|)\) ).
Given a grammar with a rule set P, we define a notion of dotted rules. A dotted rule is a rule in P with an extra dot on the right-hand side. Intuitively, the dot tells the parsing position: To the right of the dot there is a nonterminal, which tells what should be parsed next.
Definition 6.1 (Dotted Rules ̇P)
Given a well-matched VPG \(G=(\Sigma ,V,P,L_0)\) , we define the set of dotted rules as
Dotted rules are similar to items in LR parsing [11], but dots in our dotted rules appear after only a terminal, while dots in items do not have this restriction.
Parse trees. Given a grammar \(G=(\Sigma ,V,P,L_0)\) and its dotted rules \(\dot{P}\) , we define a parse tree v as a rule sequence; i.e., \(v=[r_1, \ldots , r_n]\) , where \(r_i\in \dot{P},1\le i\le n\) ; when \(n=0\) , it is an empty parse tree. For two parse trees \(v_1=[r_1, \ldots , r_{n_1}]\) and \(v_2=[r_1^{\prime }, \ldots , r_{n_2}^{\prime }]\) , we use the notation \(v_1+v_2\) to represent their concatenation \([r_1, \ldots , r_{n_1},r_{1}^{\prime }, \ldots , r_{n_2}^{\prime }]\) . We use \({L} \Downarrow \left({w},{v}\right)\) for the relation that v can derive a string w starting from nonterminal \(L\in V\) ; its rules are in Figure 4. We call v a valid parse tree of w if \({L_0} \Downarrow \left({w},{v}\right)\) . Note that a rule in Figure 4 produces a sequence of dotted rules. Also, the last rule produces a sequence with \(L\rightarrow {&#x2039; {a}}{L_1} {b&#x203A; }L_2\) appearing twice, but with different positions for the dot: one after the call symbol \({&#x2039; {a}}\) and one after the return symbol \({b&#x203A; }\) ; this is because the call and the return symbol are parsed independently, resulting in the need to differentiate how \(L\rightarrow {&#x2039; {a}}{L_1} {b&#x203A; }L_2\) is used. During parsing, when we have used a dotted rule to parse a call symbol, but have not read the matched return symbol, we call the dotted rule unclosed.
Fig. 4.
Fig. 4. The big-step parse-tree derivation, assuming a well-matched VPG \(G=(\Sigma ,V,P,L_0)\) .
We introduce a set of definitions to extract parts of a dotted rule. We define \(\mathrm{headNT}\left(r\right)\) to be the head nonterminal of r and \(\mathrm{nextNT}\left(r\right)\) to be the next nonterminal after the dot in r:
We abuse the notation and for parse tree \(v=[r_1, \ldots , r_n]\) write \(\mathrm{headNT}\left(v\right)\) to be \(\mathrm{headNT}\left(r_1\right)\) and \(\mathrm{nextNT}\left(v\right)\) to be \(\mathrm{nextNT}\left(r_{n}\right)\) . We also define \(\operatorname{firstRule}(v)\) to be \(r_1\) and \(\operatorname{lastRule}(v)\) to be \(r_{n}\) .

6.1 The Parser PDA

Recall that in the recognizer PDA, a transition between two states for an input symbol is computed based on possible rules that can derive the input symbol. As a result, a state-transition trace that accepts an input string essentially represents the rule sequences that can possibly derive the string. We illustrate this through the following examples, before presenting a formal algorithm for constructing a parser PDA for a VPG:
Example 6.1 (From Recognizer DFAs to Parser DFAs).
We begin with regular grammars. Recall the DFA in Figure 2(b). Instead of taking the state as a set of nonterminals, let us replace each state with a set of rules that can produce the last consumed symbol; we call the new automaton the parser DFA, as shown in Figure 5. The parser DFA has more states than the recognizer DFA. First, we introduce a pseudo-rule \(L\rightarrow .L\) to construct an initial state with that rule. In the pseudo-rule, we use a dot “.” to indicate the parsing position. Second, the recognizer state \(\lbrace D \rbrace\) is split to two states \(\lbrace B\rightarrow d.D \rbrace\) and \(\lbrace A\rightarrow c.D \rbrace\) to capture the parsing rule used precisely when transitioning from \(\lbrace L \rightarrow c.A,\ L \rightarrow c.B\rbrace\) .
Fig. 5.
Fig. 5. The parser DFA for the regular grammar in Figure 2(b). The pseudo-rule \(L\rightarrow .L\) appears only in the initial state, representing a rule in which only its next nonterminal (after the dot) matters.
Compared with the recognizer PDA, a trace of states in the parser DFA represents a parse tree. For example, consider the input string \(s=ccc\) . Its trace in the parser DFA is as the following:
\begin{equation*} \lbrace L\rightarrow .L\rbrace \longrightarrow \left\lbrace \begin{aligned}L\rightarrow c.A,\\ L\rightarrow c.B \end{aligned}\right\rbrace \longrightarrow \lbrace A \rightarrow c.D \rbrace \longrightarrow \lbrace D \rightarrow c.L \rbrace . \end{equation*}
This trace essentially represents the following parse tree \(v_s\) of the string \(ccc\) :
\begin{equation*} v_s = [L\rightarrow c.A; A \rightarrow c.D; D \rightarrow c.L]. \end{equation*}
In fact, we can extract a parse tree from the trace as follows: We start from the last state and take a rule r that derives the empty string \(\epsilon\) for the nonterminal next to the dot and define the initial parse tree \(v_0\) as \([r]\) . In this example, we have the initial parse tree being
\begin{equation*} v_0=[D\rightarrow c.L]. \end{equation*}
Then, we expand the parse tree by processing states in the trace backward. In particular, for each state S, we add a rule \(r \in S\) that is connected to the head rule of the current parse tree v, i.e.,
\begin{equation*} \mathrm{nextNT}\left(r\right)=\mathrm{headNT}\left(v\right), \text{where } r\in S, \end{equation*}
and prepend r to the current parse tree v. For example, for the first step, we add \(A \rightarrow c.D\) to \(v_0\) to have \(v_1\) as follows; similarly, \(v_2\) is built by adding \(L \rightarrow c.A\) to \(v_1\) :
\begin{align*} v_1 &#x0026;= [A \rightarrow c.D; D \rightarrow c.L], \\ v_2 &#x0026;= [L\rightarrow c.A; A \rightarrow c.D; D \rightarrow c.L]. \end{align*}
Note that \(v_2\) is the final parse tree we desire. In general, there may be multiple parse trees for an input string, and we need to maintain a set of parse trees during the extraction process.
The above process is essentially the same as how a parser PDA is constructed for a VPG and how a parse tree is extracted from a trace of the parser PDA. We discuss an example next.
Example 6.2 (From Recognizer PDAs to Parser PDAs).
Recall the recognizer PDA shown in Figure 2(e); we first discuss how to extend it to the parser PDA shown in Figure 6 and then illustrate it using a concrete example.
Fig. 6.
Fig. 6. The parser PDA for the VPG shown in Figure 2(c). Each state is marked with its index; e.g., the initial state is State 0. The stack operation “Push i” means pushing State i to the top of the stack, and “Pop i” means removing the top of the stack when the top is State i.
Similar to the case of regular grammars, for each pair \((L_1,L_2)\) in a recognizer PDA state, nonterminal \(L_2\) is replaced by the rules that can produce the last consumed symbol. The context \(L_1\) , however, is replaced with the most recent unclosed dotted rule; when there is no such a rule, we use a special context called \(\text{None}\) . The stack T could still store the parser PDA states paired with call symbols; however, since the call symbols have already been recorded in the rules in the states, the stack needs to store only the PDA states.
Now let us reexamine the VPG shown in Figure 2(c) and the input string \(s=c{&#x2039; {a}}cc {b&#x203A; }\) . After running the parser PDA in Figure 6 on the string s, we have the following trace, where the initial state is omitted:
Similar to the trace in Example 6.1, the preceding trace includes the following parse tree of the string s:
Let us extract the above parse tree from the trace in a way similar to the case for regular grammars. Start from the last state, we pick a rule r that has the context \(\text{None}\) and can derive the empty string \(\epsilon\) from the next nonterminal to parse. We initialize the parse tree as \(v_0=[r]\) . In this example, we have
Now, we trace back to the state \(\lbrace (A\rightarrow {&#x2039; {a}}.A {b&#x203A; }L,D\rightarrow c.L)\rbrace\) . Since the last symbol is a return symbol, we begin by identifying rules with the next nonterminal that can produce the empty string. In this case, rule \(D\rightarrow c.L\) is a candidate. In addition, recall that the context rule is the most recent unclosed dotted rule; so the following return symbol should be derived from the context rule. The context rule is \(A\rightarrow {&#x2039; {a}}.A {b&#x203A; }L\) , which corresponds to the current parse tree’s head rule \(A\rightarrow {&#x2039; {a}}A {b&#x203A; }.L\) . Therefore, we can use the rule \(D\rightarrow c.L\) to extend our parse tree.
Then, we continue to the next state in the trace: \(\lbrace (A\rightarrow {&#x2039; {a}}.A {b&#x203A; }L,A\rightarrow c.D) \rbrace\) . We first observe that the rule \(A\rightarrow c.D\) is connected to the head rule \(D\rightarrow c.L\) . In addition, since the symbol to match next is a plain symbol, the context rule should not change and should be \(A\rightarrow {&#x2039; {a}}.A {b&#x203A; }L\) . To determine the context of the head rule efficiently, we introduce a stack T to store the current context, as shown below.
Note that the stack \(T_0\) could alternatively store the rule \(A\rightarrow {&#x2039; {a}}.A {b&#x203A; }L\) ; the difference is just about the location of the dot. With the stack, we know that the context is the same and prepend rule \(A\rightarrow c.D\) to the parse tree.
We continue to the next state.
To extract a matching rule for the call symbol is fairly simple at this point: We just pick the rule that corresponds to the top of the stack T; in this case it should be \(A\rightarrow {&#x2039; {a}}.A {b&#x203A; }L\) . Our final parse tree is as the following:
Again, in general, there may be multiple parse trees for the input string, and we need to maintain a set of parse trees during the extraction.
Now, we formally define parser PDA states and stacks and the derivative functions for well-matched VPGs as follows (the definitions for general VPGs can be found in Appendix C):
Definition 6.2 ( \(\mathcal {P}_{\text{pln}},\mathcal {P}_{\text{call}},\mathcal {P}_{\text{ret}}\) )
Given a VPG \(G=(\Sigma ,V,P,L_0)\) , based on the dotted rules in \(\dot{P}\) ,
(1)
we define a set of plain rules \(\mathcal {P}_{\text{pln}}\) as \(\lbrace L\rightarrow c.L_1 \in \dot{P}\,\mid \,c \in \Sigma _{\text{plain}}\rbrace\) ;
(2)
we define a set of call rules \(\mathcal {P}_{\text{call}}\) as \(\lbrace L\rightarrow {&#x2039; {a}}.L_1 \in \dot{P} \rbrace \cup \lbrace L \rightarrow {&#x2039; {a}}. L_1 {b&#x203A; }L_2 \in \dot{P} \rbrace\) ;
(3)
we define a set of return rules \(\mathcal {P}_{\text{ret}}\) as \(\lbrace L\rightarrow {b&#x203A; }.L_1 \in \dot{P} \rbrace \cup \lbrace L \rightarrow {&#x2039; {a}}L_1 {b&#x203A; }. L_2 \in \dot{P} \rbrace\) .
Definition 6.3 (Parser PDA States and Stacks).
Given a VPG, we introduce a PDA whose states are subsets of \((\mathcal {P}_{\text{call}}\cup \lbrace \text{None}\rbrace) \times (\dot{P} \cup \lbrace L_0\rightarrow .L_0\rbrace)\) and whose stack contains PDA states. Intuitively, the stack remembers a series of past contexts, which are used for matching future return symbols. We call a pair \((m,T)\) a configuration, with m being the state and T being the stack. A rule pair \((r^{\prime },r)\) in state m tells that the current rule is r and the context rule is \(r^{\prime }\) . The context rule \(r^{\prime }\) is the rule that generates the last unmatched call symbol; when all call symbols are matched with return symbols, \(r^{\prime }\) is \(\text{None}\) . Initially, since the parser PDA has not read any symbol, there is no current rule; so we define the initial PDA state as \(m_0=\lbrace (\text{None}, L_0\rightarrow .L_0)\rbrace\) , where \(L_0\rightarrow .L_0\notin \dot{P}\) a special rule; they only appear in \(m_0\) .
The initial stack \(T_0\) is empty. When reading a call symbol, the parser PDA pushes the current state to its stack; when reading a return symbol, it removes the top of the stack. For a PDA stack T, the function \(\operatorname{head }(T)\) returns the top of the stack and is defined as
\begin{equation*} \operatorname{head }(T)= {\left\lbrace \begin{array}{ll} \text{None},&#x0026;\text{if }T=\bot ;\\ t,&#x0026;\text{if }\exists t\ T^{\prime },\ T=t\cdot T^{\prime }.\\ \end{array}\right.} \end{equation*}
For parser PDA construction, we define three kinds of derivative functions \(p_c\) , \(p_{{&#x2039; {a}}}\) , and \(p_{ {b&#x203A; }}\) , similar to the case of recognizer construction.
Definition 6.4 (The Derivative Function p for the Parser PDA)
Given a VPG \(G=(V,\Sigma ,P,L_0)\) , suppose the current state of the parser PDA is m and the current stack is T, the transition functions \(p_c\) , \(p_{{&#x2039; {a}}}\) and \(p_{ {b&#x203A; }}\) are defined as follows:
(1)
\(p_c(m)=(m^{\prime },\lambda T.T)\) , where
\begin{equation*} m^{\prime }=\lbrace (r^{\prime },\mathrm{nextNT}\left(r\right)\rightarrow c. L_1) \mid (r^{\prime },r)\in m \wedge (\mathrm{nextNT}\left(r\right)\rightarrow c. L_1)\in \dot{P}\rbrace . \end{equation*}
For each pair \((r^{\prime },r)\) in m, the new state keeps the context rule \(r^{\prime }\) and updates the current rule to a rule with head \(\mathrm{nextNT}\left(r\right)\) and that derives c.
(2)
\(p_{{&#x2039; {a}}}(m)=(m^{\prime },\lambda T.\ m\cdot T)\) , where
For each pair \((r^{\prime },r)\) in m, the new state changes the context rule and the current rule to a rule with head \(\mathrm{nextNT}\left(r\right)\) and that derives \({&#x2039; {a}}\) . We further push the current state m to the stack.
(3)
\(p_{ {b&#x203A; }}(m,\overline{m_{\text{call}}})=(m^{\prime },\text{tail})\) , where \(\text{tail}\) is a function that returns the tail of the stack, and \(\overline{m_{\text{call}}}=\operatorname{head }(T)\) , and
We remove the top \(\overline{m_{\text{call}}}\) of the stack T and, for each pair \((r_1,r_2)\) in m where the context rule is \(r_1=L\rightarrow {&#x2039; {a}}. L_1 {b&#x203A; }L_2\) , we match \({b&#x203A; }\) with \({&#x2039; {a}}\) and update the current rule to \(L\rightarrow {&#x2039; {a}}L_1 {b&#x203A; }. L_2\) . We also require that \(\mathrm{nextNT}\left(r_2\right) \rightarrow \epsilon\) . Note that we must have \(\overline{m_{\text{call}}}\ne \emptyset\) , since there is an unmatched call symbol generated by \(r_1\) . For the new state pair’s context rule, we select a pair \((r^{\prime },r)\) in \(\overline{m_{\text{call}}}\) , where \(\mathrm{nextNT}\left(r\right)=L\) , and restore the context to be \(r^{\prime }\) .
With derivative functions defined, Algorithm 2 constructs the parser PDA by computing the least solution of the following equation; it makes sure that states are closed under derivatives.
Definition 6.5 (Traces of Running the Parser PDA).
Given a string w, we write \(w_i\) for its ith symbol. Suppose the sequence of transitions in a parser PDA for input w is \((m_0,\bot) \stackrel{w_1}{\longrightarrow } (m_1, T_1) \cdots \stackrel{w_{|w|}}{\longrightarrow } (m_{|w|}, T_{|w|})\) . We define the trace of running the parser PDA for w, denoted as \(\operatorname{forest}(w)\) , as the list \([m_1, m_2, \ldots , m_{|w|}]\) .
Since each \(m_i\) in \(\operatorname{forest}(w)\) is a set of possible rules at that step of parsing, \(\operatorname{forest}(w)\) can be viewed as the representation of all parse trees of input w; that is, \(\operatorname{forest}(w)\) is the parse forest of w.
Finally, we define the acceptance configuration for the parser PDA as follows:
Definition 6.6 (Parser PDA Acceptance Configurations).
Given a VPG \(G=(V,\Sigma ,P,L_0)\) , a pair \((m,E)\) is an acceptance configuration for the parser PDA if the state m includes a pair \((\text{None},r)\) , where \(\mathrm{nextNT}\left(r\right)\) derives the empty string \(\epsilon\) , i.e., \(\exists r,\ (\text{None},r)\in m\ \wedge \ (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P\) .
Note that the context \(\text{None}\) indicates that the stack E is empty. For general VPGs, however, the contexts in the state of an acceptance configuration could also be pending rules, indicating that the stack E is not empty, as mentioned in Definition C.2.

6.2 Extracting Parse Trees

In this section, we discuss how to extract parse trees for well-matched VPGs and put the discussion for general VPGs in Appendix C. First, we explain our design choice to accomplish efficient extraction, and then we define the extraction functions formally.
After running the parser PDA on input w, we have \(\operatorname{forest}(w)=[m_1, m_2, \ldots , m_{|w|}]\) . Intuitively, for each parse tree \(v=[r_1,\ldots ,r_{|w|}]\) of w, we have \(r_i\in m_i\) , \(i=1..|w|\) . Our next step is to extract the parse trees from \(\operatorname{forest}(w)\) . In Example 6.2, we showed the backward way for extraction, where we first extract valid rules from \(m_{|w|}\) then extend them backward with the rules from \(m_{|w|-1}\) to \(m_1\) . There is also a forward way, where we first extract valid rules from \(m_1\) and then extend them with the rules extracted from \(m_2\) to \(m_{|w|}\) . Although both ways would give the same set of parse trees, we choose the backward way for better performance. The forward extraction may maintain invalid parse trees during extraction; it may in the worst case maintain an exponential number of invalid parse trees and remove them in later steps, which is inefficient. The backward way, however, maintains only valid parse trees. For an unambiguous grammar in particular, the backward way always maintains one parse tree and guarantees linear-time extraction. We give a concrete example as follows:
Example 6.3.
Consider the unambiguous grammar
and the input string \(w={{&#x2039; {a}}}^n { {b&#x203A; }}^n\) . The parse forest of w is “ \(m_1^n + m_2^{n-1} + [m_3]\) ,” where the notation \(m_1^n\) stands for the list that contains n copies of \(m_1\) , and
With the forward extraction, we first extract partial parse trees from \(m_1\) . There are two possible partial parse trees \([L \rightarrow {&#x2039; {a}}. L {b&#x203A; }E]\) and \([L \rightarrow {&#x2039; {a}}. L {d&#x203A; } E]\) (their context rules can be ignored for now); we have to extract both of them, although the second one will become invalid. We then extend each partial parse tree with rules from the next \(m_1\) ; e.g., a new parse tree is \([L \rightarrow {&#x2039; {a}}. L {b&#x203A; }E,\ L \rightarrow {&#x2039; {a}}. L {b&#x203A; }E]\) . This process is repeated, and we get \(2^n\) partial parse trees for \({{&#x2039; {a}}}^n\) , since there are \(2^n\) ways to parse \({{&#x2039; {a}}}^n\) . However, only one of them is valid considering the following input \({ {b&#x203A; }}^n\) ; those invalid ones will be removed when considering \(m_2^{n-1} + [m_3]\) . The time complexity of this process is clearly non-linear.
In contrast, with the backward extraction, we first extract the only partial parse tree \([L \rightarrow {&#x2039; {a}}. L {b&#x203A; }E]\) from \(m_3\) , then extend it with the rule \(L \rightarrow {&#x2039; {a}}L {b&#x203A; }. E\) from \(m_2\) . We also introduce a stack for each partial parse tree to store the rules of unmatched return symbols, so later we know how to extend the partial parse tree with matching rules. After extracting rules from \({m_2}^{n-1} + [m_3]\) , we have only one partial parse tree whose stack is \(\left(L \rightarrow {&#x2039; {a}}L {b&#x203A; }. E \right)^n\cdot \bot\) . Based on this stack and \(m_1\) , only rule \(L\rightarrow {&#x2039; {a}}. L {b&#x203A; }E\) can be used for valid backward extension. In this way, only one valid parse tree is constructed in the end. We will prove that the backward extraction always generates valid parse trees in Lemma D.16.
Now, we define a partial parse tree as a parse tree paired with a stack as follows:
Definition 6.7 (Partial Parse Trees).
A partial parse tree \((v,E)\) is a parse tree \(v=[r_1,\ldots ,r_n]\) paired with a stack \(E=r_{n_1}\cdots r_{n_k}\cdot \bot\) , where \({n_1}\lt \cdots \lt {n_k}\) and \(r_{n_1},\ldots ,r_{n_k}\) are the rules that generate unmatched return symbols.
Formally, we use function \({\operatorname{extract}_\mathrm{init}}\) to extract a set of partial parse trees V from \(m_{|w|}\) of \(\operatorname{forest}(w)\) and function \({\operatorname{extract}_{\mathrm{oneStep}}}\) to extend the current set V of partial parse trees.
\begin{align*} V_{|w|} &#x0026;\leftarrow \operatorname{extract}_\mathrm{init}(m_{|w|});\\ V_i &#x0026;\leftarrow \operatorname{extract}_{\mathrm{oneStep}}({m_i},V_{i+1}),i \mbox{ goes from } {|w|}-1 \mbox{ to } 1, \end{align*}
where \(V_1\) is the final parsing result that includes all valid parse trees of w.
Next, we define different kinds of states and define \({\operatorname{extract}_\mathrm{init}}\) and \({\operatorname{extract}_{\mathrm{oneStep}}}\) .
Definition 6.8 ( \(\mathcal {M}_{\text{pln}},\mathcal {M}_{\text{call}},\mathcal {M}_{\text{ret}}\) )
Given a VPG \(G=(\Sigma ,V,P,L_0)\) , based on the dotted rules in \(\dot{P}\) ,
(1)
we define a set of plain states \(\mathcal {M}_{\text{pln}}\) as \(\left(\mathcal {P}_{\text{call}}\cup \lbrace \text{None}\rbrace \right)\times \mathcal {P}_{\text{pln}}\) ;
(2)
we define a set of call states \(\mathcal {M}_{\text{call}}\) as \(\left(\mathcal {P}_{\text{call}}\cup \lbrace \text{None}\rbrace \right)\times \mathcal {P}_{\text{call}}\) ;
(3)
we define a set of return states \(\mathcal {M}_{\text{ret}}\) as \(\left(\mathcal {P}_{\text{call}}\cup \lbrace \text{None}\rbrace \right)\times \mathcal {P}_{\text{ret}}\) .
Definition 6.9 (The Extraction Function \(\operatorname{extract}_\mathrm{init}\) )
We first define \(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}\) , which extracts all valid rules from m, then extend \(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}\) to \(\operatorname{extract}_\mathrm{init}\) , which pairs each extracted rule with a stack.
\begin{equation*} \operatorname{extract}^{\mathrm{pre}}_\mathrm{init}(m)=\lbrace r \mid (r^{\prime },r)\in m, (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P \wedge r^{\prime } \text{ is None} \rbrace . \end{equation*}
A pair \((r^{\prime },r)\) in m corresponds to a parse tree that ends with r and whose last unmatched call symbol is generated by \(r^{\prime }\) . This parse tree is valid if \(\mathrm{nextNT}\left(r\right)\) can match the empty string and \(r^{\prime }\) is None. The context \(r^{\prime }\) cannot be a matching rule, since a return symbol would be further needed.
\begin{equation*} \operatorname{extract}_\mathrm{init}(m)= {\left\lbrace \begin{array}{ll} \left\lbrace ([r], \bot) \mid r\in \operatorname{extract}^{\mathrm{pre}}_\mathrm{init}(m) \right\rbrace , &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{pln}}\mbox{ or } m \subseteq \mathcal {M}_{\text{call}};\\ \left\lbrace ([r], r\cdot \bot) \mid r\in \operatorname{extract}^{\mathrm{pre}}_\mathrm{init}(m) \right\rbrace , &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{ret}}.\\ \end{array}\right.} \end{equation*}
If \(m\subseteq \mathcal {M}_{\text{ret}}\) , then we push r to the stack so in a later step during backward extraction, we can match the return symbol in this step using a matching rule. Note that from the definition of derivative functions (Definition 6.4), we can easily show that any state in the parser PDA is a subset of \(\mathcal {M}_{\text{pln}}\) , \(\mathcal {M}_{\text{call}}\) , or \(\mathcal {M}_{\text{ret}}\) ; it cannot be the case that a parser PDA state have mixed pairs (e.g., some from \(\mathcal {M}_{\text{call}}\) and others from \(\mathcal {M}_{\text{ret}}\) ).
Definition 6.10 (The One-step Extraction Function).
We first define \({\operatorname{extendable}}\) , which decides whether \((r^{\prime },r)\) can be used to extend v. Based on \(\operatorname{extendable}\) , we define \(\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{pre}}\) , which extracts rules from m and uses them to extend the partial parse trees in V. Finally, we define \(\operatorname{extract}_{\mathrm{oneStep}}\) , which modifies the stacks of the extended trees:
\begin{align*} \operatorname{extendable}((r^{\prime },r),{(v,E)})= {\left\lbrace \begin{array}{ll} \text{true}, &#x0026; \text{if }r^{\prime } \text{ is None}\ \wedge \ \operatorname{head }(E) \text{ is None}\ \wedge \ \\ &#x0026; \quad \mathrm{nextNT}\left(r\right)=\mathrm{headNT}\left(v\right) \ \vee \\ &#x0026; r^{\prime } \text{ and } \operatorname{head }(E)\text{ are the same matching rule} \\ &#x0026;\quad \text{(but with different {dot} positions)}\ \wedge \\ &#x0026;\qquad ((\operatorname{firstRule}(v) \in \mathcal {P}_{\text{ret}}\ \wedge \ (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P) \ \vee \\ &#x0026;\qquad \ \ (\operatorname{firstRule}(v) \in \mathcal {P}_{\text{pln}}\cup \mathcal {P}_{\text{call}}\ \wedge \ \\ &#x0026;\qquad \qquad \mathrm{nextNT}\left(r\right)=\mathrm{headNT}\left(v\right))),\\ \text{false}, &#x0026; \text{otherwise}. \\ \end{array}\right.} \end{align*}
\begin{align*} \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{pre}}(m,V)= &#x0026;\left\lbrace (r::v,E) \mid (v,E)\in V \ \wedge \ (r^{\prime },r)\in m \ \wedge \ \operatorname{extendable}((r^{\prime },r),{(v,E)})\right\rbrace . \end{align*}
Intuitively, each valid partial parse tree v can be split into \(v_1\) and \(v_2\) so \(v=v_1+v_2\) and \((r^{\prime },\operatorname{lastRule}(v_1))\) is included in m for some \(r^{\prime }\) and \((v_2,E)\) is included in V for some E. Therefore, if \(r^{\prime }\) is None, then E is empty; if \(r^{\prime }\) is a matching rule, then the top of E should be the same matching rule. In the first case, we require that r is connected to v, i.e., \(\mathrm{nextNT}\left(r\right)=\mathrm{headNT}\left(v\right)\) . In the second case, if \(\operatorname{firstRule}(v)\) is a rule that derives a return symbol, then we require \(\mathrm{nextNT}\left(r\right)\) can match the empty string; otherwise, we still require that r is connected to v.
\begin{equation*} \operatorname{extract}_{\mathrm{oneStep}}(m,V)= {\left\lbrace \begin{array}{ll} \lbrace (v, E) \mid (v,E)\in \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{pre}}(m,V) \rbrace , &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{pln}},\\ \lbrace (v, \operatorname{tail }(E)) \mid (v,E)\in \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{pre}}(m,V) \rbrace , &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{call}},\\ \lbrace (v, \operatorname{firstRule}(v)::E) \mid (v,E)\in \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{pre}}(m,V) \rbrace , &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{ret}}.\\ \end{array}\right.} \end{equation*}
\(\operatorname{extract}_{\mathrm{oneStep}}\) modifies E based on the type of m. If a rule in m derives a call symbol, then the call symbol matches the return symbol (if exists) specified on the top of the stack and, therefore, we remove the top of the stack. If a rule in m derives a return symbol, then we push the rule for the return symbol to the stack so it can match a future call symbol during backward extraction.
Now, we define the \(\operatorname{extract}\) function that extracts the set of parse trees of w from a forest.
Definition 6.11 (The Extraction Function).
Given a nonempty \(\operatorname{forest}(w)=[m_1,\ldots , m_{|w|}]\) , define the sequence of partial parse trees as
\begin{align*} V_{|w|} &#x0026;= \operatorname{extract}_\mathrm{init}(m_{|w|});\\ V_i &#x0026;= \operatorname{extract}_{\mathrm{oneStep}}({m_i},V_{i+1}),i \mbox{ goes from } {|w|}-1 \mbox{ to } 1. \end{align*}
Then, we define \(\operatorname{extract}(\operatorname{forest}(w)) = V_1\) as the set of parse trees extracted from \(\operatorname{forest}(w)\) .

6.3 Optimizing the Extraction as a PDA

For an unambiguous grammar, there is only one parse tree for any accepted input string. Even for an ambiguous grammar, there may be a preferred parse tree among possible parse trees for an input string. In this section, we optimize the parse-tree extraction process to an extraction PDA; instead of running the execution function, we can run the extraction PDA on the parse forest and get a trace that is a valid parse tree. Running a PDA is much faster, and our evaluation in Section 8 is based on the extraction PDA.
In Lemma D.17, Appendix D, we will show that those extraction functions always extract valid parse trees. Therefore, to extract a single parse tree, we need to pick one partial parse tree \((v,E)\) from \(\operatorname{extract}_\mathrm{init}(m_{|w|})\) according to some criterion; similarly, if \(\operatorname{extract}_{\mathrm{oneStep}}(m,V)\) returns multiple parse trees, then one needs to be picked. To see why this process can be represented as a PDA, note that the extraction function depends on only the first rule of the current partial parse tree v and the top of E; so we can view \((\operatorname{firstRule}(v),E)\) as a PDA configuration. We define the PDA states and stacks as follows:
Definition 6.12 (Extraction PDA States and Stacks).
Given a VPG, we introduce an extraction PDA whose states are dotted rules in \(\dot{P}\) and whose stack contains rules in \(\mathcal {P}_{\text{ret}}\) . Intuitively, the stack remembers a series of contexts in terms of return rules, which are used to match call symbols during backward extraction. We call a pair \((r,E)\) a configuration, with r being the state and E being the stack.
We define extraction PDA transition functions \(q_\text{init}\) and q based on extraction functions \(\operatorname{extract}_\mathrm{init}\) and \(\operatorname{extract}_{\mathrm{oneStep}}\) . In the following definition, we write \(\operatorname{pickOne}(V)\) for a function that returns a rule from a list of partial parse trees V; if V is empty, then we define \(\operatorname{pickOne}(V)\) to be \(\text{None}\) . In practice, \(\operatorname{pickOne}\) can be defined, e.g., based on priority of rules; we leave it abstract in this section.
Definition 6.13 (The Single-parse-tree Extraction Function \(q_\text{init}\) )
Let \(r=\operatorname{pickOne}(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}(m))\) . If \(r=\text{None}\) , then we define \(q_\text{init}(m)=\text{None}\) . Otherwise, we define
\begin{equation*} q_\text{init}(m)= {\left\lbrace \begin{array}{ll} (r,\bot), &#x0026;\text{if }m \subseteq \mathcal {M}_{\text{pln}}\mbox{ or } m \subseteq \mathcal {M}_{\text{call}};\\ (r,r\cdot \bot) &#x0026;\text{if }m \subseteq \mathcal {M}_{\text{ret}}. \end{array}\right.} \end{equation*}
Definition 6.14 (The Transition Function q for the Extraction PDA)
Given a current configuration \((r,E)\) , let \(t=\operatorname{head }E\) and \(r^{\prime }=\operatorname{pickOne}(\operatorname{extract}_{\mathrm{oneStep}}(m, \lbrace ([r], {t}\cdot \bot) \rbrace)\) . If \(r^{\prime }=\text{None}\) , then we define \(q(m,r,t)=\text{None}\) . Otherwise, we define
\begin{equation*} q(m,r,t)= {\left\lbrace \begin{array}{ll} (r^{\prime },\lambda E.E), &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{pln}};\\ (r^{\prime },\operatorname{tail }), &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{call}};\\ (r^{\prime },\lambda E.{r^{\prime }}\cdot E), &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{ret}}. \end{array}\right.} \end{equation*}
Algorithm 3 constructs the PDA by computing the least solution of the following equation; it makes sure that states are closed under derivatives.
\begin{equation*} \begin{array}{lll} R &#x0026; = &#x0026; R \cup ~ \lbrace r^{\prime }\,\mid \,m \in A \wedge {(r^{\prime },t)=q_\text{init}(m)} \rbrace \\ &#x0026; &#x0026; \cup ~ \lbrace r^{\prime }\,\mid \, {m\in A,}\ r\in R\cap (\mathcal {P}_{\text{pln}}\cup \mathcal {P}_{\text{call}}), t\in (R\cap \mathcal {P}_{\text{ret}})\cup \lbrace \text{None}\rbrace , \ q(m,r,t)=(r^{\prime },f)\rbrace \\ &#x0026; &#x0026; \cup ~ \lbrace r^{\prime }\,\mid \, {m\in A,}\ r\in R\cap \mathcal {P}_{\text{ret}},\ q(m,r,r)=(r^{\prime },f)\rbrace , \end{array} \end{equation*}
where A is the set of all parser PDA states. Note that we do not need to compute the acceptance state for an extraction PDA: For a forest with the last state m, as long as the initial state \(\operatorname{extract}_\mathrm{init}(m)\) is not empty, the extraction is guaranteed to be successful; if the initial state \(\operatorname{extract}_\mathrm{init}(m)\) is empty, then the input string is invalid.

6.4 Correctness Proof

In this subsection, we prove the correctness of the parse trees constructed according to Sections 6.1 and 6.2. The correctness of the optimization in Section 6.3 using an extraction PDA follows as a corollary, since the extraction PDA extracts one parse tree out of possibly many parse trees returned by the extraction functions in Section 6.2.
Given a VPG \(G=(V,\Sigma ,P,L_0)\) and an input string w, let \(V=\operatorname{extract}(\operatorname{forest}(w))\) . We call V correct if it includes exactly the set of valid parse trees of w according to G, which we state formally as the following theorem:
Theorem 6.1 (Correctness of the Parser Generator).
\begin{align*} \forall w\ V,\ V=\operatorname{extract}(\operatorname{forest}(w)) \Rightarrow (\forall v,\ {L_0} \Downarrow \left({w},{v}\right) \iff \exists E, (v,E)\in V). \end{align*}
We provide detailed lemmas and proofs in Appendix D and describe the main steps next. Remember that the parser PDA builds a set of partial parse trees in a forward way and stores the trees in a forest, from which the extraction functions extract a set of partial parse trees in a backward way. Therefore, for the correctness proof, we formalize two small-step relations: one for building forward parse trees and one for building backward parse trees. We then show that both small-step relations are equivalent to the big-step parse-tree derivation relation under certain conditions. For the backward direction of Theorem 6.1 (i.e., soundness), we prove that during backward extraction for an input string w, each extracted backward partial parse tree \(v_2\) has a counterpart forward partial parse tree \(v_1\) in the rest of the forest, and \(v_1+v_2\) is a valid parse tree of w. At the end of extraction, we have \(v_1=[]\) and \(v_2\) is a valid parse tree of w. For the forward direction of Theorem 6.1 (i.e., completeness), we show that each valid parse tree v can be split as \(v=v_1+v_2\) , and there is an extraction step where \(v_1\) is in the rest of the forest and \(v_2\) is an extracted partial parse tree. Therefore, at the end of extraction, we have \(v_1=[]\) and \(v=v_2\) is in the extracted parse trees.

6.5 Verification of Time Complexity

Next, we discuss how we verify the performance of our VPG parser. For an input string w with a total number of k valid parse trees, we show that the VPG parser takes \(O(k|w|)\) time. When \(k=1\) , the time complexity becomes linear (i.e., \(O(|w|)\) ). The verification is performed in two steps: (1) it takes \(O(|w|)\) time to run a parser PDA and generates a parse forest; and (2) it takes \(O(k|w|)\) time to run the extraction function on the forest when extracting all of the k parse trees.
We first introduce our method of counting time cost in Coq. Our verification utilizes a Coq monadic library [29], which provides a monad for tracking the time cost of computation. To pair a value a of type A with a time cost c, we utilize a dependent type \(C(A,P)\) , defined to be
where C is a type constructor and \(P\ a\ c\) is a property of a and c. In this article, the above dependent type is written using the following notation:
\begin{equation*} \lbrace a : A \mid c \mid P\ a\ c \rbrace . \end{equation*}
For example, given \(k\in \mathbb {N}\) , a value \(a^{\prime }\) of type \(\lbrace a : A \mid c \mid c=k \rbrace\) means it takes k time units to compute a of type A.
To correctly count the time cost, programs must be written as a combination of the return and bind monadic operators provided by the library [29]. We briefly introduce them as follows:
(1)
The return operator, denoted as “ \(\Leftarrow {a}\) ”, returns a value a bound with 0 time cost.
(2)
The insertion operator, denoted as “ \(\mathrel {{+}{=}}{k};\ {a^{\prime }}\) ”, adds k time units to the monadic value \(a^{\prime }\) .
(3)
The bind operator, denoted as “ \({a}\leftarrow {e_1};\ {e_2}\) ”, first evaluates the computation \(e_1\) and binds its value to a, which is used to evaluate \(e_2\) ; it further pairs the result with the sum of the time cost of \(e_1\) and \(e_2\) .
In our verification, the time cost of a function is counted by accumulating the time cost of its components, except for three kinds of primitive functions with unit cost:
(1)
eq_sym_ID, which compares the IDs of two symbols (nonterminals or terminals).
(2)
andb and orb, which compute the conjunction and disjunction of two Booleans, respectively.
(3)
cons a l, which constructs a list from head a and tail l, and pair a b, which makes a pair.
The insertion function we introduced earlier is used in only two ways: (1) add the time cost cost_eq_sym_ID, cost_andb, cost_orb, cost_cons, or cost_pair to a corresponding function. E.g., the program
\begin{equation*} \mathrel {{+}{=}}{{cost\_eq\_sym\_ID}};\ {\Leftarrow {{eq\_sym\_ID }s_1\ s_2}}; \end{equation*}
compares two IDs \(s_1\) and \(s_2\) , and returns a bool paired with cost_eq_sym_ID; and then (2) add cost_branch, which is the number of variables and functions used in a branch, similar to the counting method of McCarthy et al. [29]. Consider the following example:
\begin{equation*} \begin{aligned}\lambda x.\ &#x0026;\text{match } x \text{ with } \\ &#x0026;|\ []\Rightarrow \ \mathrel {{+}{=}}{1};\ {\Leftarrow {\text{None}}} \\ &#x0026;|\ a::l\Rightarrow \ \mathrel {{+}{=}}{3};\ {\Leftarrow {\text{Some }a.}} \end{aligned} \end{equation*}
In the above example, only one unit of cost is added, because only one name x is used; three units of cost are added to the second branch, because three names (x, a, and l) are used. These branch costs are inserted manually.
With the aforementioned setup, we implemented two functions \({\operatorname{runPDA}^{\prime }}\) and \({\operatorname{extract}^{\prime }}\) , where \({\operatorname{runPDA}^{\prime }}\) runs the parser PDA to generate a forest and \({\operatorname{extract}^{\prime }}\) extracts the set of parse trees from the forest. In particular, \({\operatorname{runPDA}^{\prime }}\) has the following type:
\begin{equation*} \operatorname{runPDA}^{\prime }: \forall \mathcal {T}\ w, \lbrace M : \mathrm{Forest} \mid c \mid c=\operatorname{cost}_{\operatorname{runPDA}}(\mathcal {T},w) \rbrace \end{equation*}
where w is a string, \(\mathrm{Forest}\) is the type of parse forests, and \(\mathcal {T}\) , representing the PDA, is a list of mappings in the form of \(((i,S,t),S^{\prime })\) . A mapping \(((i,S,t),S^{\prime })\) represents the state transition from configuration \((S,T)\) to \((S^{\prime },T^{\prime })\) , where \(t=\operatorname{head }(T)\) ; the mapping from T to \(T^{\prime }\) is further computed in \({\operatorname{runPDA}^{\prime }}\) . We show the upper bound of the cost function \(\operatorname{cost}_{\operatorname{runPDA}}\) in Theorem 6.2.
Theorem 6.2 (The Running-time Upper Bound of the Parser PDA).
\begin{align*} \forall \mathcal {T},\ \exists b_1\ b_2,\ \forall w, &#x0026;\ \operatorname{cost}_{\operatorname{runPDA}}(\mathcal {T},w)\le b_1|w|+b_2. \end{align*}
Theorem 6.2 shows that it takes \(O(|w|)\) time to run the parser PDA and generates a parse forest. The proof is straightforward and we omit its discussion.
Now, we turn our attention to the extract function. We previously introduced its definition in Definition 6.11, which maintains a set of partial parse trees. However, it is not efficient to maintain a set-based data structure. Inserting a partial parse tree v to a set, for example, requires us to compare v with trees already in the set to avoid duplicates; comparing each of the \(|w|\) rules in the parse trees would take at least \(|w|\) steps. Even if we hash parse trees before comparison, we would still have to perform hashing for each of the \(|w|\) rules. Therefore, if a partial parse tree is inserted at each of the \(|w|\) steps, then the total time cost is at least quadratic ( \(\Omega ({|w|}^2)\) ). Getting a linear-time algorithm is a challenge.
We next describe a linear-time algorithm that takes advantage of the locality of VPG parsing. We replace the set structure with a list structure. But we also have to avoid duplicate partial parse trees in the list. To see why this is a problem, notice that two different elements \((r_1,r)\) and \((r_2,r)\) in a state m can be used to extend the same parse tree v from a list V of partial parse trees, both generating the same tree \(r::v\) .
To solve this problem, our observation is that if V includes no duplicates, then for two trees \(v_1,v_2\in V\) , we have not only \(v_1\ne v_2\) , but also \(r_1::v_1\ne r_2::v_2\) for \(r_1,r_2\in \dot{P}\) . Therefore, for each \(v\in V\) , we can first compute \(m_v\) , the rules in m that can extend v, and remove the duplicates in \(m_v\) to get \(m_v^{\prime }\) ; we then extend v with \(m_v^{\prime }\) to get \(V_v\) and finally concatenate \(V_v\) for all \(v\in V\) . To avoid duplicates in \(m_v\) , we compare elements in \(m_v\) , which takes \(O(|m|^2)\) time. Since m belongs to \((P\cup \lbrace \text{None}\rbrace) \times P\) , its size \(|m|\) is at most \((|P|+1)|P|\) ; therefore, \(O(|m|^2)\) is independent of the input string size and can be viewed as a constant.
Definitions 6.15, 6.16, and 6.17 implement this optimized version of the extract function. It utilizes function \({\operatorname{rmDup}}\) , which removes the duplicates in a list, function \({\operatorname{list}}\) , which converts a set to a list, and standard functions \(\operatorname{map}\) , \(\operatorname{filter}\) , and \(\operatorname{concat}\) . We omit their standard definitions.
Definition 6.15 (The Optimized Implementation \({\operatorname{extract}_\mathrm{init}^{\mathrm{op}}}\) )
\begin{equation*} \operatorname{extract}_\mathrm{init}^{\mathrm{op}}(m) = \operatorname{rmDup}(\operatorname{list}(\operatorname{extract}_\mathrm{init}(m))). \end{equation*}
Definition 6.16 (The Optimized Implementation \({\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}}\) )
Given a VPG \(G=(\Sigma ,V,P,L_0)\) , a set of partial parse trees V and a state m, we first define functions \(\operatorname{extract}^{\mathrm{filt}}\) , \(\operatorname{extract}^{\mathrm{map}}\) , and \(\operatorname{extract}^{\mathrm{concat}}(m,V)\) , and then define \({\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}}\) based on them.
\begin{align*} \operatorname{extract}^{\mathrm{filt}}(m,{(v,E)})&#x0026;=\operatorname{rmDup}\ ({\operatorname{map}\ (\lambda (x,y).y,}\ (\operatorname{list}\ \left(\operatorname{filter}\ (\lambda x. \operatorname{extendable}(x,{(v,E)}), m)\right))))\\ \operatorname{extract}^{\mathrm{map}}(m,{(v,E)})&#x0026;=\operatorname{map}\ (\lambda x.(x::v,{E}),\ \operatorname{extract}^{\mathrm{filt}}(m,{(v,E)}))\\ \operatorname{extract}^{\mathrm{concat}}(m,V)&#x0026;=\operatorname{concat}\ (\operatorname{map}\ (\lambda {(v,E)}.\operatorname{extract}^{\mathrm{map}}(m,{(v,E)}), V)) \\ \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}(m,V)&#x0026;= {\left\lbrace \begin{array}{ll} \left[ (v, E) \mid (v,E)\in \operatorname{extract}^{\mathrm{concat}}(m,V) \right], &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{pln}},\\ \left[ (v, \operatorname{tail }(E)) \mid (v,E)\in \operatorname{extract}^{\mathrm{concat}}(m,V) \right], &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{call}},\\ \left[ (v, \operatorname{firstRule}(v)::E) \mid (v,E)\in \operatorname{extract}^{\mathrm{concat}}(m,V) \right], &#x0026;\text{if }m\subseteq \mathcal {M}_{\text{ret}}. \end{array}\right.} \end{align*}
Definition 6.17 (The Optimized Implementation \({\operatorname{extract}^{\mathrm{op}}}\) )
Given a \(\operatorname{forest}(w)=[m_1,\ldots , m_{|w|}]\) , where \(w\ne \epsilon\) , define the sequence of partial parse trees as
\begin{align*} V_{|w|} &#x0026;= \operatorname{extract}_\mathrm{init}^{\mathrm{op}}(m_{|w|});\\ V_i &#x0026;= \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}({m_i},V_{i+1}),i \mbox{ goes from } {|w|}-1 \mbox{ to } 1. \end{align*}
Then, we define \(\operatorname{extract}^{\mathrm{op}}(\operatorname{forest}(w)) = V_1\) as the list of parse trees extracted from \(\operatorname{forest}(w)\) .
Based on \({\operatorname{extract}^{\mathrm{op}}}\) , We implemented a version of the extract function that performs cost counting:
\begin{equation*} \operatorname{extract}^{\prime }: \forall w,\ \lbrace V : \text{ParseTreeList} \mid c \mid V=\operatorname{extract}^{\mathrm{op}}(\operatorname{forest}(w))\ \wedge \ c=\operatorname{cost}_{\operatorname{extract}}(\operatorname{forest}(w))) \rbrace \end{equation*}
We show the upper bound of the cost function of \(\operatorname{extract}^{\prime }\) in Theorem 6.3, which also includes the correctness discussed in Section 6.4.
Theorem 6.3 (The Running-time Upper Bound of Extraction).
Given a VPG \(G=(\Sigma ,V,P,L_0)\) , we have
\begin{align*} &#x0026;\exists k\ b_1\ b_2,\ \forall w\ V_1,\\ &#x0026;\qquad {w\ne \epsilon } \ \wedge \ V_1=\operatorname{extract}^{\mathrm{op}}(\operatorname{forest}(w)) \Rightarrow \\ &#x0026;\qquad \qquad \operatorname{NoDup}(V_1) \ \wedge \\ &#x0026;\qquad \qquad \operatorname{cost}_{\operatorname{extract}}(\operatorname{forest}(w)) \le (k * |V_1| + b_1) * |w| + b_2\ \wedge \\ &#x0026;\qquad \qquad \forall v,\ \left(\exists {E}, (v, {E})\in V_1 \right) \iff {L_0} \Downarrow \left({w},{v}\right). \end{align*}
There are three properties in Theorem 6.3. The first property “NoDup( \(V_1\) )” means there is no duplicate in \(V_1\) ; therefore, \(|V_1|\) is the number of valid parse trees. The next property shows that the time complexity of \({\operatorname{extract}^{\prime }}\) is \(O(|V_1||w|)\) . The third property shows the correctness, i.e., \(V_1\) is the set of parse trees of w.
Proof.
During the extraction, we maintain lists \(V_i,i=1..|w|\) of partial parse trees:
\begin{align*} V_{|w|} &#x0026;\leftarrow \operatorname{extract}_\mathrm{init}^{\mathrm{op}}(m);\\ V_i &#x0026;\leftarrow \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}({m_i},V_{i+1}),i \mbox{ goes from } {|w|}-1 \mbox{ to } 1. \end{align*}
We will prove in Theorem 6.4 that \(V_i\) has no duplicates for \(i=1..|w|\) . When \(i=1\) , we get the first property. To prove the second property, we show that the size of \(V_i\) increases during backward extraction in Theorem 6.5, i.e.,
\begin{equation*} |V_{|w|}|\le |V_{|w|-1}|\le \cdots \le |V_{1}|. \end{equation*}
Then, we prove that during extraction, for each \((m,V)\) , it takes \(O({|m|^2|}V|)\) time to evaluate \(\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}(m,V)\) . To do this, we implement the monadic function
\begin{equation*} \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}:\forall m\ V,\ \lbrace V^{\prime } : \text{ParseTreeList} \mid c \mid c=\operatorname{cost}_{\mathrm{oneStep}}^{\mathrm{Extract}}(m,V) \rbrace \end{equation*}
and prove in Theorem 6.6 that \(\operatorname{cost}_{\mathrm{oneStep}}^{\mathrm{Extract}}(m,V)\sim O(|m|^2|V|)\) . We then rewrite this bound to \(O(|V|)\) , since \(|m| \sim O(|P|^2)\) . Finally, since \(\operatorname{extract}^{\mathrm{op}}\) can be viewed as executing \(\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}\) for \(|w|\) times, we have the time complexity of \(\sum _{i=1}^{|w|}O(|V_i|)\) , which is further bounded by \(O(|V_{1}||w|)\) using Theorem 6.5.□
Theorem 6.4 (No Duplicates in Vi)
\begin{align*} &#x0026;\forall w\ i\ m\ M,\ \operatorname{forest}{(wi)}=M+[m] \Rightarrow \\ &#x0026;\forall M_1\ M_2,\ M = M_2 + M_1 \Rightarrow \\ &#x0026;\quad \operatorname{NoDup}(\operatorname{extract}^{\mathrm{op}}(M_1+[m])). \end{align*}
Proof.
Let \(V_{\mathrm{init}}=\operatorname{extract}_\mathrm{init}^{\mathrm{op}}(m)\) . Prove by induction on \(M_1\) . There are two cases.
When \(M_1=[]\) , we have \(\operatorname{extract}^{\mathrm{op}}(M_1{+[m]})=V_{\mathrm{init}}\) . Since we apply \({\operatorname{rmDup}}\) to get \(V_{\mathrm{init}}\) , clearly, we have \(\operatorname{NoDup}(V_{\mathrm{init}})\) .
When \(M_1=[m_1]+M_1^{\prime }\) for some \(m_1\) , we have \(\operatorname{extract}^{\mathrm{op}}(M_1+[m])=\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}(m_1,V^{\prime })\) , where \(V^{\prime }=\operatorname{extract}^{\mathrm{op}}(M_1^{\prime }+[m])\) . By the induction hypothesis, we have \(\operatorname{NoDup}(V^{\prime })\) . Therefore, from the definition of \(\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}\) , we have \(\operatorname{NoDup}(\operatorname{extract}^{\mathrm{op}}(M_1+[m]))\) .□
Theorem 6.5 (Increasing \(|V_i|\) )
\begin{align*} &#x0026;\forall w\ i\ m\ M,\ \operatorname{forest}(wi)=M+[m] \Rightarrow \\ &#x0026;\qquad \forall M_1\ m^{\prime }\ M_2,\ M = M_2+ [m^{\prime }] + M_1 \Rightarrow \\ &#x0026;\qquad \qquad |\operatorname{extract}^{\mathrm{op}}(M_1+[m])|\le |\operatorname{extract}^{\mathrm{op}}([m^{\prime }]+M_1+[m])|. \end{align*}
Proof.
From the invariant in Lemma D.17 we know that each \({v\in } \operatorname{extract}^{\mathrm{op}}(M_1+[m])\) is a valid partial parse tree; i.e., \(\exists v_1, v_1+v\) is a parse tree of w. Let \(v_1=v_1^{\prime }+[r]\) for some r; since \([r]+v\) is also a valid partial parse tree, we have \([r]+v\in \operatorname{extract}^{\mathrm{op}}([m^{\prime }]+M_1+[m])\) . This shows an injection from \(\operatorname{extract}^{\mathrm{op}}(M_1+[m])\) to \(\operatorname{extract}^{\mathrm{op}}([m^{\prime }]+M_1+[m])\) . Therefore, we have \(|\operatorname{extract}^{\mathrm{op}}(M_1+[m])|\le |\operatorname{extract}^{\mathrm{op}}([m^{\prime }]+M_1+[m])|\) .□
Theorem 6.6 (The Running-Time Upper Bound of \({\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}}\) )
\(\exists k_1\ k_2\ b_1\ b_2,\ \forall V\ m,\)
\begin{equation*} \operatorname{cost}_{\mathrm{oneStep}}^{\mathrm{Extract}}(m,V) \le (k_1 |m|^2 + k_2|m| + b_1)\cdot |V| + b_2. \end{equation*}
Proof.
For each v in V, it takes \(O(|m|^2)\) time to remove the duplicates in a list of rules, therefore, the total time is \(O(|m|^2|V|)\) .□
Note that one benefit of the monadic library is that, since the running times are embedded in propositions, we can extract OCaml code from the Coq implementation and after extraction propositions with running times removed.

6.6 Coq Mechanization

All definitions and theorems presented in Sections 6.4 and 6.5 are mechanized in the proof assistant Coq. Our proof artifact consists of 19 Coq files and \(\sim\) 43k lines of code. The complete verification took \(\sim\) 5 minutes on an Intel® Core™ i7-9700 Processor and 16 GB of memory.
Our Coq implementation includes two hypotheses, namely, A_VPG_Linear and A_VPG_Match, corresponding to the two constraints for general VPGs we discussed in Section 2: A_VPG_Linear requires that for each rule \(L\rightarrow i L^{\prime }\) where \(i\in \Sigma\) , if the nonterminal L belongs to \(V^0\) , then the nonterminal \(L^{\prime }\) must also belong to \(V^0\) ; and A_VPG_Match requires that for each matching rule \(L\rightarrow {&#x2039; {a}}L_1 {b&#x203A; }L_2\) , the nonterminal \(L_1\) belongs to \(V^0\) , and if the nonterminal L belongs to \(V^0\) , then so does the nonterminal \(L_2\) . These two hypotheses guarantee that our Coq implementation accepts valid general VPGs. In Appendix E, we summarize the correspondence between theorems discussed in this section and our open-sourced Coq formalization and proofs.
We discuss some key differences between the definitions as well as theorems presented in Sections 6.4 and 6.5 and their Coq implementation as follows: First, the big-step parse-tree derivation in Figure 4 includes three rules; its Coq mechanization includes two additional rules for deriving pending call and return symbols, respectively. Second, we formalized the semantics of running the parser PDA in Coq as the relation “Forest \(M\ T\ w\) ,” where M is the trace for the string w, and T is the terminal stack. Recall that in Definition 6.5, we denote the trace as \(\operatorname{forest}(w)\) . In Coq, the function \(\operatorname{forest}\) is verified to generate a trace that satisfies the relation Forest. Further, in the theorems for extraction PDAs, such as Theorems 6.3 and 6.4, the trace \(\operatorname{forest}(w)\) is replaced with a trace M that satisfies the Forest relation.

7 Designing A Surface Grammar

The format of rules allowed in VPGs is designed for easy studying of its meta-theory but is inconvenient for expressing practical grammars. First, no user-defined semantic actions are allowed. Second, each VPG rule allows at most four terminals/nonterminals on the right-hand side. In this section, we present a surface grammar that is more user-friendly for writing grammars. We first discuss embedding semantic actions. Then, we introduce tagged CFGs, which are CFGs paired with information about how to separate terminals to plain, call, and return symbols. We then describe a translator from tagged CFGs to VPGs. During the conversion, the translator also generates semantic actions that convert the parse trees of VPGs back to the ones of tagged CFGs.

7.1 Embedding Semantic Actions

Semantic actions transform parsing results to user-preferred formats. In a rule \(L\rightarrow s_1\cdots s_k\) , where \(s_k\in \Sigma \cup V\) , we treat L as a default action that takes k arguments, which are semantic values returned by \(s_1\) to \(s_k\) , and returns a tree with a root node and \(s_1\) to \(s_k\) as children. The prefix notation of a parse tree gives
\begin{equation*} [L,v_{s_1},\ldots , v_{s_k}], \end{equation*}
where \(v_{s_i}\) is the semantic value for \(s_i\) . The above notation can be naturally viewed as a stack machine, where L is an action and \(v_{s_i}\) are the values that get pushed to the stack before the action. The VPG parse tree can be converted to the prefix notation in a straightforward way. If we then replace each nonterminal in the tree with its semantic action, then the parse tree becomes a stack machine.
The default action for a nonterminal can be replaced by a user-defined action appended to each rule in the grammar. For example, consider the grammar \(L = c L\mid {&#x2039; {a}}L {b&#x203A; } L\mid \epsilon\) . Suppose we want to count the number of the symbol c in an input string; we can specify semantic actions in the grammar as follows:
In the above example, a semantic action is specified after each rule, e.g., “@{let \(f_1\ v_1\ v_2 = 1+v_2\) }.” In the actions, \(v_1\) , \(v_2\) , \(v_3\) , and \(v_4\) represent the semantic values returned by the right-hand side symbols of the rule. For example, the first semantic action \(f_1\ v_1\ v_2\ =\ 1+v_2\) accepts two semantic values \(v_1\) and \(v_2\) , where \(v_1\) is returned by c and \(v_2\) is returned by L. Specially, the definition “@{let \(f_3\ () = 0\) }” defines a function that does not take any argument.
As an application, the next subsection shows how to use semantic actions to convert the parse trees of a VPG to the parse trees of its original tagged CFG.

7.2 Translating from Tagged CFGs to VPGs

Grammar writers are already familiar with CFGs, the basis of many parsing libraries. We define tagged CFGs to be CFGs paired with information about how to partition terminals into plain, call, and return symbols ( \(\Sigma =\Sigma _{\text{plain}}\cup \Sigma _{\text{call}}\cup \Sigma _{\text{ret}}\) )3; that is, in a tagged CFG, a terminal is tagged with information about what kind of symbols it is. Compared to a regular CFG, the only additional information in a tagged CFG is the tagging information; therefore, tagged CFGs provide a convenient mechanism for reusing existing CFGs and developing new grammars in a mechanism that grammar writers are familiar with.
Not all tagged CFGs can be converted to VPGs. We use a conservative validator to determine if a tagged CFG can be converted to a VPG and, if the validator passes, translate the tagged CFG to a VPG. Without loss of generality, we assume that in each rule of the input tagged CFG a call symbol is matched with a return symbol. In the general case, we need two additional steps: (1) We first assure that pending call and return symbols are not nested in well-matched call/return symbols; (2) we then change the tags of the pending call and return symbols to plain, making the resulting grammar ready for the validator.
The translation steps are summarized as follows:
\begin{equation*} {A\; tagged\; CFG}\rightarrow {Simple\; form}\xrightarrow {{If\; valid}} {Linear\; form}\rightarrow {VPG}. \end{equation*}
At a high level, a tagged CFG is first translated to a simple form, upon which validation is performed. If validation passes, then the simple-form CFG is translated to a linear-form CFG, which is finally translated to a VPG. We next detail these steps. These steps will be illustrated with a running example in Figure 7. In the example, the left-most column lists the original tagged CFG; note that in the grammar semantic actions are after the \(@\) symbol (e.g., \(L^6\) ) and they will be discussed later in the section.
Fig. 7.
Fig. 7. An example of converting a tagged CFG to a VPG.
Definition 7.1 (Simple Forms).
A rule is in the simple form if it is of the form \(L\rightarrow \epsilon\) or of the form \(L\rightarrow q_1\cdots q_k\) , where either \(q_i\in \Sigma _{\text{plain}}\cup V\) or \(q_i={&#x2039; {a}}L_i {b&#x203A; }\) for some \({&#x2039; {a}}\) , \({b&#x203A; }\) , \(L^{\prime }\) , where \(i=1..k\) and \(k\ge 1\) , and \(L^{\prime }\) is a nonterminal. A tagged CFG \(G=(V,\Sigma ,P,L_0)\) is in the simple form if every rule in P is in the simple form.
Compared to a tagged CFG, a simple-form CFG requires that there must be a nonterminal between a call symbol and its matching return symbol. The conversion from a tagged CFG to a simple-form CFG is straightforward: For each rule, we replace every string \({&#x2039; {a}}s {b&#x203A; }\) , where \({&#x2039; {a}}\) is matched with \({b&#x203A; }\) and \(s\in (\Sigma \cup V)^*\) , with \({&#x2039; {a}}L_s {b&#x203A; }\) and generate a new nonterminal \(L_s\) and a new rule \(L_s\rightarrow s\) . After this conversion, a string in the form of \({&#x2039; {a}}L {b&#x203A; }\) can be viewed as a “plain symbol”; this is a key intuition for the following steps. We call \({&#x2039; {a}}L {b&#x203A; }\) a matched token in the following discussion. For the running example, this step of translation extracts \(AE\) from the first rule and assigns it to a new nonterminal \(L_{AE}\) .
The validation can then be performed on the simple form, using its dependency graph.
Definition 7.2 (Dependency Graph).
The dependency graph of a grammar \(G=(V,\Sigma ,P,L_0)\) is \((V,E_G)\) , where
\begin{equation*} E_G=\lbrace (L, L^{\prime }, (s_1,s_2))\,\mid \,s_1,s_2\in (\Sigma \cup V)^*, (L\rightarrow s_1 L^{\prime } s_2)\in P\rbrace . \end{equation*}
Note that an edge from L to \(L^{\prime }\) is labeled with a pair of strings.
Definition 7.3 (Dependency Loop).
Let \(G=(V,\Sigma ,P,L_0)\) be a grammar with its associated dependency graph \((V,E_G)\) . A dependency loop is defined as a loop present in the dependency graph such that at least one edge in the loop is labeled with \((s_1,s_2)\) , where \(s_2\) can derive a nonempty string (i.e., \(s_2\rightarrow ^* w\) for some nonempty string w).
The validator’s main task is to make sure each dependency loop is well founded. Additionally, for non-dependency loops, the validator ensures they are not solely labeled with \((\epsilon ,\epsilon)\) to prevent dead loops. Specifically, the validator verifies that for each loop in the dependency graph, either (1) in the loop there is an edge \((L,L^{\prime })\) that is labeled with \((s_1{&#x2039; {a}}, {b&#x203A; }s_2)\) , where \({&#x2039; {a}}\) is matched with \({b&#x203A; }\) in a rule \(L\rightarrow s_1{&#x2039; {a}}L^{\prime } {b&#x203A; }s_2\) ; or (2) every edge \((L,L^{\prime })\) in the loop is labeled with \((s,\epsilon)\) for some string \(s\in (\Sigma \cup V)^*\) , and at least one such s satisfies \(s\not\rightarrow ^*\epsilon\) . For the running example, the dependency graph of the simple form of the grammar does not have any loops and thus validation passes trivially.
Once the validation passes, the translation converts a simple-form CFG to a linear-form CFG.
Definition 7.4 (Linear Forms).
A rule is in the linear form if it is in one of the following forms: (1) \(L\rightarrow \epsilon\) ; (2) \(L\rightarrow t_1\cdots t_k\) ; (3) \(L\rightarrow t_1\cdots t_kL^{\prime }\) ; where \(t_i\in \Sigma _{\text{plain}}\) or \(\exists {&#x2039; {a}}, {b&#x203A; },L_i,\text{s.t. }\ t_i={&#x2039; {a}}L_i {b&#x203A; }\) , \(i=1..k\) , \(k\ge 1\) . A tagged CFG \(G=(V,\Sigma ,P,L_0)\) is in the linear form if every rule in P is in the linear form.
Note that in a linear-form rule, \(t_i\) cannot be a nonterminal, while in a simple-form rule \(q_i\) can be a nonterminal. Further, the linear form allows rules of the form \(L\rightarrow t_1\cdots t_kL^{\prime }\) , where \(t_i\) is a terminal or a matched token. The main job of the translator is to convert simple-form rules to linear-form rules by strategically replacing nonterminals with the right-hand sides of their rules. Appendix G shows the translation algorithm and its termination proof. Compared to the conference version where no termination proof was provided, this version of the translation algorithm has a new reformulation, which enables the termination proof to go through. Figure 7 shows the conversion result of the running example. For this example, the conversion is simple: Nonterminal A is replaced by \(cE\) and eventually replaced by c in the first two rules.
The translation from a linear-form CFG to a VPG is simple. E.g., for a rule of the form \(L\rightarrow t_1\cdots t_k\) , it is translated to \(L \rightarrow t_1L_1; L_1 \rightarrow t_2 L_2; \ldots ; L_k \rightarrow t_kL_k; L_k \rightarrow \epsilon\) , where \(L_1\) to \(L_k\) are a set of new nonterminals. For the running example, the first rule in the linear form of the grammar is split into two rules, as shown in the last column in Figure 7.
All transformations are local rewriting of rules, and as a result it is easy to show that each transformation step preserves the set of strings the grammar accepts; further, for an input string there should be a one-to-one correspondence between parse trees produced by the grammar before transformation and parse trees produced by the grammar after transformation. We further note that not all tagged CFGs can be converted to VPGs. For example, grammar “ \(L\rightarrow cLc|\epsilon\) ” cannot be converted, since its terminals cannot be suitably tagged: Intuitively, c has to be both a call and a return symbol. Further, since our validation algorithm is conservative, it rejects some tagged CFGs that have VPG counterparts. For example, grammar “ \(L\rightarrow Lc|\epsilon\) ” is rejected by the validator, since it is left recursive. However, it can be first refactored to “ \(L\rightarrow cL|\epsilon\) ”, which is accepted by our validator.
Generating semantic actions. During the conversion, each time the translator rewrites a rule, a corresponding semantic action is attached to the rule. Initially, every rule is attached with one default semantic action. For example, the rule \(L\rightarrow AbCd\) is attached with \(L^4\) , written as \(L\rightarrow AbCd \ @L^4.\) As mentioned in Section 7.1, \(L^4\) is the default semantic action for constructing a tree with a root node and children nodes that are constructed from semantic values from the right-hand side of the rule. The superscript 4 is its arity. During conversion, every time we rewrite a nonterminal L in a rule R with the right-hand side of rule \(L \rightarrow s\) , the semantic values for s are first combined to produce a semantic value for L, which is then used to produce the semantic value for the left-hand nonterminal of R. If a helper nonterminal \(L_s\) is introduced during conversion and a rule \(L_s \rightarrow s\) is generated, then we do not generate a semantic value for \(L_s\) but leave the semantic values for s on the stack so any rule that uses \(L_s\) can use those semantic values directly. In this way, we can convert a parse tree of a VPG to the parse tree of its corresponding tagged CFG.
Back to the example in Figure 7, during the conversion to the simple form, the semantic action of the first rule does not change, since the rule for \(L_{AE}\) has no semantic actions; the semantic values of A and E are left on the stack. Thus, \(L^6\) still expects six values on the stack. The translation step to the linear form expands A in the first rule with \(A\rightarrow cE @A^2\) . Then, \(cE{&#x2039; {a}} L_{AE} {b&#x203A; }E\) is simplified to \(c{&#x2039; {a}} L_{AE} {b&#x203A; }E\) , and that is why \(A^1\) is applied instead of \(A^2\) : \(A^1\) accepts the value for c. The same transformation is applied to the second rule. The translation step to the VPG is straightforward.
As a concrete example of parsing using the VPG in Figure 7, for the input string \(c{&#x2039; {a}}c {b&#x203A; }\) , the parser will generate the following parse tree:
Each rule in the above parse tree is then replaced with its attached action and the semantic values of the terminals.
And the evaluation result of the above stack machine is the following parse tree of the tagged CFG:
The above tree can be visualized as follows:

8 Evaluation

As discussed earlier, we implemented in Coq our VPG parsing library and mechanized its proofs; we also introduced two optimizations so parsing is performed by running two PDAs (the parser and the extraction PDAs). To evaluate performance, we extracted OCaml code from the Coq library and used it to construct the parser and extraction PDAs by Algorithms 2 and 3 and stored the PDA transitions in hashtables for constant-time look-ups. We evaluated our implementation for the following questions: (1) how applicable VPG parsing is in practice? (2) what is the performance of VPG parsing compared with other parsing approaches?
We performed a preliminary analysis for a set of ANTLR4 grammars in a grammar repository [3]. Among all 239 grammars, 136 (56.9%) grammars could be converted to VPGs by our tagged-CFG-to-VPG translation after we manually marked the call and return symbols for those grammars. Note that it does not mean the rest cannot be converted; e.g., 34 grammars cannot be converted, because they have left recursion and the conversion may become possible if the left recursion is removed. We left a further analysis for future work.
For performance evaluation, we compared our VPG parsers with two representative LL and LR parsers, namely, ANTLR 4 [35] and GNU Bison [18], using their Java backends. ANTLR 4 is a popular parser generator that implements an efficient parsing algorithm called ALL(*) [36]. The ALL(*) algorithm can perform an unlimited number of lookaheads to resolve ambiguity, and it has a worst-case complexity of \(O(n^4)\) ; however, it exhibits linear behavior on many practical grammars. GNU Bison is a well-known parser generator that implements a class of linear-time LR parsing algorithms [11]. We utilized GNU Bison’s default configuration to build LALR(1) parsers [10] in conjunction with JFlex [41], a Java lexical analyzer generator. We also compared the VPG parsers with a few hand-crafted parsers specialized for parsing JSON and XML documents, including four mainstream JavaScript engines and four popular XML parsers. Before presenting the performance evaluation, we list some general setups:
(1)
During evaluation, we adapted the grammars for JSON, XML, and HTML from ANTLR4 [3] to tagged CFGs, from which we generated VPG parsers. Appendix F shows the tagged CFGs for JSON, XML, and HTML. We also adapted these grammars to LR grammars and generated LR parsers using Bison. We then compared VPG parsers with ANTLR and Bison parsers in terms of performance.
(2)
When comparing our VPG parsers with ANTLR parsers, we omitted the lexing time. This is because we used ANTLR’s lexers to generate the tokens for both VPG parsers and ANTLR parsers.
(3)
Since ANTLR generates CFG parse trees, for an end-to-end comparison, we used semantic actions to convert the parsing results of VPG parsers and Bison parsers to CFG parse trees and measured the conversion time.
We note that the performance of the parsing algorithms presented in this article is very similar to the performance of the algorithms in the conference version, since both utilize two PDAs and the difference in lookup time between PDA transitions is negligible. We adopted the new algorithms primarily to simplify the verification of both correctness and performance.

8.1 Parsing JSON Files

The JSON format allows objects to be nested within objects and arrays; therefore, a JSON object has a hierarchically nesting structure, which can be naturally captured by a VPG. In particular, since in JSON an object is enclosed within “{” and “}” and arrays within “[” and “]”, its VPG grammar treats “{” and “[” as call symbols and “}” and “]” as return symbols.
When building a VPG parser for JSON, we reused ANTLR’s lexer. Therefore, the evaluation steps are as follows:
\begin{equation*} {Input \ file}\xrightarrow {{ANTLR \ Lexer}} {ANTLR \ tokens} \xrightarrow {{ANTLR/VPG \ Parser}} {Results}. \end{equation*}
A downstream application that uses the ANTLR’s JSON parser may wish to keep working on the same parsing result produced by ANTLR’s parser. Therefore, we implemented a converter to convert the parse forest produced by our VPG parser to ANTLR’s parse tree for the input files. When the grammar is unambiguous, which is the case for the JSON grammar and the XML grammar, the parse forest is really the encoding of a single parse tree. The algorithm of converting a VPG parse tree to a stack machine and evaluating the stack machine has been discussed in Section 7. The result of the evaluation is a structure that can be directly printed out and compared with; the same applies to the parse trees produced by ANTLR and Bison parsers. The conversion steps are summarized as follows:
\begin{equation*} {VPG \ parse \ tree}\xrightarrow {{Embed \ actions}} {Stack \ machine}\xrightarrow {{Evaluate}} {ANTLR \ parse \ tree}. \end{equation*}
Note that in practice this conversion may not be necessary. A downstream application can directly work on the VPG parse tree or associate semantic actions with the VPG parse tree to convert the tree to desired semantic values. We include the conversion time for our VPG parser so the parsing result can be compatible with legacy downstream applications.
For evaluation, we collected 23 real-world JSON files from the Awesome JSON Datasets [12], the Native JSON Benchmark [43], and the JSON.parse benchmark [19], where the sizes of the files range from 14 KB to 7 MB.4 Figure 8 shows the parsing time of the ANTLR and Bison parsers, as well as the VPG parser’s parsing plus extraction time, conversion time, and the total (parsing plus extraction plus conversion) time; note that both the x-axis and the y-axis of the figure (and other figures in this section) are in the log scale for better visualization; Table 1 shows the parsing time for the five largest files in our test set, as well as the lexing time; Appendix H shows the results for the full test set. All parsers exhibit linear-time performance when parsing JSON files. As can be seen, the VPG parsing time (“VPG Parse + Extract”) is less than the time by ANTLR and Bison parsers; the total time is in general also less than the ANTLR parser and comparable to the Bison parser. For larger files, we observe that the conversion time dominates; we think the time is largely influenced by the OCaml garbage collector and defer optimizing the conversion for future work.
Fig. 8.
Fig. 8. Parsing times of JSON files (in log scale). “Guide Line” indicates the slope for linear time complexity.
Table 1.
Name#TokenANTLR ParseBison ParseVPG Parse + ExtractVPG ConvVPG TotalANTLR LexJFlex
catalog135,99154 ms20 ms7 ms17 ms24 ms40 ms39 ms
canada334,374104 ms50 ms17 ms52 ms69 ms47 ms39 ms
educativos426,39892 ms41 ms20 ms48 ms69 ms109 ms75 ms
airlines555,410107 ms58 ms30 ms68 ms98 ms88 ms91 ms
JSON.parse1,288,350225 ms128 ms72 ms162 ms234 ms142 ms141 ms
Table 1. Parsing and Lexing Times of JSON Files

8.2 Parsing XML Files

XML also has a well-matched nesting structure with explicit start-tags such as <p> and matching end-tags such as </p>. However, compared to JSON, there is an additional complexity for the XML grammar, which makes it necessary to adapt the XML grammar provided by ANTLR. In particular, the XML lexer in ANTLR treats an XML tag as separate tokens; e.g., <p> is converted into three tokens: <, p, and >. Those tokens then appear in the ANTLR XML grammar. Part of the reason for this design is that the XML format allows additional attributes within a tag; e.g., <p id=1> is a start-tag with an attribute with name id and value 1. Below is a snippet of the related XML grammar in ANTLR.
Note that the attribute information can be specified by regular expressions and therefore be provided by lexers. To compare VPG parsers with ANTLR and Bison parsers and to expose the nesting structure within XML, we modified the grammar by grouping tag components. The following shows a snippet of our adapted XML grammar:
The above snippet introduces three new tokens, as declared below.
Note that the above grammar is used for ANTLR, Bison, and our VPG parser generator, and we do not need the VPG lexer used in the conference version of this article.
For evaluation, we used the real-world XML files provided by the VTD-XML benchmarks [42], which consist of a wide selection of 23 files ranging from 1 K to 73 MB. The parsing times are presented in Figure 9; the times of the largest 5 files are presented in Table 2; the conversion times are shown in the “VPG Conv” column. Appendix H shows the results for the full test set. Similar to JSON, VPG parsing (“VPG Parse + Extract”) on XML files is in general faster than ANTLR and Bison parsing; the total time is also less for smaller files.
Fig. 9.
Fig. 9. Parsing times of XML files (in log scale). “Guide Line” indicates the slope for linear time complexity.
Table 2.
Name#TokenANTLR ParseBison ParseVPG Parse + ExtractVPG ConvVPG TotalANTLR LexJFlex
ORTCA39,07230 ms8 ms3 ms6 ms10 ms54 ms71 ms
SUAS118,44654 ms31 ms6 ms21 ms27 ms88 ms131 ms
address1,437,142254 ms232 ms74 ms283 ms357 ms201 ms250 ms
cd4,198,694721 ms735 ms204 ms704 ms908 ms416 ms509 ms
po9,266,5261,360 ms1,835 ms461 ms1,528 ms1,989 ms854 ms1,617 ms
Table 2. Parsing and Lexing Times of XML Files

8.3 Parsing HTML Files

A snippet of the HTML grammar in ANTLR is listed below:
Similar to the XML grammar, the HTML grammar allows self-closing tags such as <br/>. However, the HTML grammar in addition allows optional end tags, which is not allowed in XML. For example, the HTML tag <input type=”submit” value=”Ok”> cannot have a matching end tag according to the HTML standard. Although this kind of tag is also “self-closing,” we will use the terminology of optional end tags, since that is how the official HTML5 standard describes it.
In the conference version of this article, we used the HTML grammar included in ANTLR. To adapt the grammar for VPGs, we combined the components of an HTML tag into a single token and tag it as a call or return symbol for our VPG parser. The performance comparison between the ANTLR parser and the VPG parser is presented in Figure 10. As can be seen, our VPG parser outperformed the ANTLR parser by more than four orders of magnitude.
Fig. 10.
Fig. 10. Parsing times of HTML files (in log scale).
We performed an investigation and identified the reason for the huge performance gap. The HTML grammar included in ANTLR does not distinguish different kinds of tags at the grammar level; all tags are modeled by a single nonterminal. The grammar by itself is not practical and causes nondeterminism at the grammar level. As a result, the ANTLR parser cannot determine how tags are matched until all tags have been read, which significantly slows down its parsing. For the same reason, this HTML grammar cannot be refactored to an LR grammar, which is required by Bison.
To address this issue, we changed the grammar to reduce nondeterminism by explicitly modeling 10 most common tags at the grammar level and requiring their correct matching. The most essential part of the updated HTML grammar is as follows:
For example, we require that table and paragraph tags be correctly matched; other tags are captured by OpenTag and CloseTag. The new tokens are declared as follows:
Similar to XML, the above grammar is used for all parser generators.
Another point is that the HTML grammar is ambiguous. For example, the first rule
can parse a string of scriptletOrSeaWs in different ways. ANTLR’s HTML parser returns the parse tree that prioritizes earlier rules in the grammar; our VPG parser returns the same parse tree as ANTLR’s with a special pickOne function that always picks the first element in the list of partial parse trees.
For evaluation, we used the 19 real-world HTML files provided in ANTLR’s repository [3]; the sizes of these files are smaller than the JSON and XML documents we used. The parsing time is presented in Figure 11 and Table 3. The conversion times of the parse trees are shown in the “VPG Conv” column. As can be seen, our VPG parser outperforms ANTLR and Bison, which is particularly evident for smaller file sizes. Furthermore, the performance gap between the ANTLR parser and the VPG parser is much narrower after the reduction of grammar nondeterminism, confirming our hypothesis that the ANTLR’s parser struggled with the nondeterminism in the original HTML grammar.
Fig. 11.
Fig. 11. Parsing times of HTML files (in log scale). “Guide Line” indicates the slope for linear time complexity.
Table 3.
Name#TokenANTLR ParseBison ParseVPG Parse + ExtractVPG ConvVPG TotalANTLR LexJFlex
cnn14,97490 ms7 ms0.29 ms1.02 ms1.31 ms42 ms34 ms
reddit24,97632 ms7 ms0.28 ms1.11 ms1.39 ms54 ms42 ms
reddit4,98931 ms7 ms0.27 ms1.10 ms1.37 ms53 ms38 ms
digg6,250127 ms9 ms0.51 ms1.34 ms1.85 ms52 ms38 ms
youtube16,31649 ms12 ms1.05 ms5.92 ms6.97 ms55 ms59 ms
Table 3. Parsing and Lexing Times of HTML Files
Summary of comparison with ANTLR and Bison. Our performance evaluation shows that our VPG parsing library generates parsers that in general run faster than those generated by ANTLR, and have comparable performance as the parsers generated by Bison, on grammars that can be converted to VPGs, such as JSON, XML, and HTML.

8.4 Comparison with Hand-crafted Parsers

We also compared VPG parsers with hand-crafted parsers for JSON and XML documents. For JSON, we compared with four mainstream JavaScript engines (V8, Chakra, JavaScriptCore, and SpiderMonkey) and evaluated them on the JSON files discussed in Section 8.1. The four JavaScript engines invoke the JavaScript built-in method JSON.parse to convert a JSON string to a JSON object. For XML, we compared with four popular XML parsers (fast-xml-parser [32], libxmljs [28], sax-js [40], and htmlparser2 [16]) and evaluated them with the XML files discussed in Section 8.2. One caveat of this comparison is that these parsers generate different parsing results. Fast-xml-parser transforms an XML document into a JSON object; libxmljs transforms an XML document into a custom object that represents the tree structure of XML documents. Sax-js and htmlparser2 parse an XML document and execute user-supplied semantic actions; no semantic actions were used in our evaluation. Additionally, for XML documents, the time required to analyze tag components is not included in the VPG parsing time but is included in the ANTLR lexing time; for better comparison, we reported the combined lexing and parsing times.
The evaluation results for the largest files are shown in Tables 4 and 5; the full results are in Appendix H. Note that the hand-crafted parsers can process raw texts directly, while our VPG parsers process the tokens generated by ANTLR’s lexers. Therefore, we show separately the lexing time of ANTLR (column “ANTLR Lex”), the parsing time of VPG parsing (column “VPG Parse + Extract”), the total time (column “VPG Total”), and the combined time (column “ANTLR Lex + VPG Total”). From the results, we can see that although the total time of VPG parsing is not the shortest among all parsers, the parsing time (“VPG Parse + Extract”) alone is. Thus, VPG parsers show promising potential in performance, in addition to verified correctness over hand-crafted parsers. The total parsing time can be reduced by replacing ANTLR’s lexer with a faster, custom lexer, since the parsing time of VPG parsers is shorter than the lexing time.
Table 4.
Name#TokenANTLR LexVPG Parse + ExtractVPG TotalANTLR Lex + VPG TotalSpiderMJSCoreV8Chakra
catalog135,99140 ms7 ms24 ms64 ms34 ms71 ms28 ms25 ms
canada334,37447 ms17 ms69 ms115 ms57 ms68 ms34 ms44 ms
educativos426,398109 ms20 ms69 ms177 ms71 ms421 ms45 ms49 ms
airlines555,41088 ms30 ms98 ms186 ms74 ms95 ms42 ms56 ms
JSON.parse1,288,350142 ms72 ms234 ms376 ms118 ms139 ms76 ms88 ms
Table 4. Parsing Times of Five Largest JSON Files
Table 5.
Name#TokenANTLR LexVPG Parse + ExtractVPG TotalANTLR Lex + VPG TotalFast-XMLLibxmljsSAX-JSHTMLP2
ORTCA39,07254 ms3 ms10 ms64 ms138 ms91 ms665 ms89 ms
SUAS118,44688 ms6 ms27 ms115 ms254 ms182 ms1,214 ms169 ms
address1,437,142201 ms74 ms357 ms558 ms584 ms196 ms1,012 ms331 ms
cd4,198,694416 ms204 ms908 ms1,324 ms1,298 ms419 ms2,103 ms735 ms
po9,266,526854 ms461 ms1,989 ms2,843 ms3,278 ms897 ms6,618 ms1,827 ms
Table 5. Parsing Times of Five Largest XML Files

9 Future Work

As noted earlier, the correctness of our VPG-based parser generator is verified in Coq. Correctness means that if the generated parser constructs a parse tree, it must be a valid parse tree according to the input VPG and vice versa. However, there are gaps between our VPG parser generator’s Coq formalization of and its implementation in OCaml. First, the implementation takes tagged CFGs as input and translates tagged CFGs to VPGs; the translation algorithm’s correctness and termination have not been formally modeled and verified in Coq. The verification effort would be a challenge, as it depends on a global dependency graph; we leave this substantial effort to future work. Second, the implementation uses efficient data structures for performance, while their Coq models use equivalent data structures that are slower but easier for reasoning. For example, the OCaml implementation uses hash tables for storing transition tables of the two PDAs to have efficient search (with \(O(1)\) search complexity), while the Coq counterpart uses a balanced tree (with \(O(\log (n))\) search complexity) provided as a Coq library. Therefore, we extracted OCaml code from the Coq implementation of the parser generator and used it to generate offline parser and extraction PDAs with Algorithms 2 and 3.
Our parsing algorithm requires a VPG as the input grammar. Compared to a CFG, a VPG requires partitioning terminals into plain, call, and return symbols. Some CFGs may not admit such a partitioning; the same terminal may require different stack actions for different input strings. In particular, all languages recognized by VPGs belong to the set of deterministic context-free languages, which is a strict subset of context-free languages (the classic example that separates CFL from DCFL is \(\lbrace a^ib^jc^k\,\mid \,i \ne j \vee j \ne k\rbrace\) ). We plan to extend our preliminary study on ANTLR grammars to understand how much of the syntax of practical computer languages (e.g., programming languages and file formats) can be described by VPGs.
Our framework requires two kinds of refactoring. First, as discussed in Section 8, we refactored the XML and HTML grammars by grouping the components of a tag as a single token. In general, given a grammar, we can combine tokens in the grammar as long as the combined token follows a regular expression pattern. This refactoring can be easily achieved by a lexer generator such as JFlex [41]. To reduce user burden, we intend to provide an additional operator in tagged CFGs for grouping tokens; the user can further mark a combined token as a call or return symbol. Second, we must eliminate left recursion because it is disallowed by our validator for tagged CFGs. Eliminating the commonly used direct left recursion is simple [36] and should be supported by tagged CFGs. In general, we believe that the refactoring effort required by our framework is less than the LR(k) refactoring effort required to resolve shift-reduce and reduce-reduce conflicts. We plan to compare refactoring efforts in future work.
Error information is essential for assisting users in identifying errors in grammars and input strings. When a parsing error occurs, our VPG parser outputs the current configuration, which includes the parser PDA’s state and stack. The user is able to view the most recently used rules as well as their context rules, which we frequently found helpful in locating the issue. Furthermore, with the default “pickOne” function, we verified that the extraction PDA should run without error. For custom “pickOne” functions, users can provide their own error information. In future work, we plan to evaluate the error reporting of VPG parsers and compare them to LL and LR parsers.
The translation algorithm from tagged CFGs to VPGs is sound but not complete. In general, it is an open problem to determine whether a CFG can be translated to a VPG, and to infer the call and return symbols automatically.

10 Conclusions

In this article, we present a recognizer and a parser generator for visibly pushdown grammars with formally verified correctness and time complexity, where the parsing algorithm is largely enlightened by the recognizer. We also provide a surface grammar called tagged CFGs and a translator from tagged CFGs to VPGs. We show that when a format can be modeled by a VPG and its call and return symbols can be identified, VPG parsing provides competitive performance and sometimes a significant speedup.

Acknowledgments

The authors would like to thank anonymous reviewers for their detailed, constructive comments.

Footnotes

1
For instance, the XML grammar is a VPG if a whole XML tag is treated as a terminal symbol; this requires a lexer that returns XML tags as tokens.
3
We note that our implementation of tagged CFGs additionally supports regular operators in the rules; these regular operators can be easily desugared and we omit their discussion.
4
Our JSON benchmarks and benchmarks for testing XML and HTML parsers are hosted at https://bitbucket.org/psu_soslab/verifiedvpgparser/.
5
Note that in VPGs, a pending rule cannot be used inside a matching rule.

A Correctness Proofs of the Recognizer

Lemma A.1.
If \((S_1,T)\leadsto w \wedge S_1\subseteq S_2\) , then \((S_2,T)\leadsto w\) .
Proof.
By definition, if \(T=\bot\) , then
Since \(S_1\subseteq S_2\) , we have \((L_1,L_2)\in S_2\) , so \((S_2,T)\leadsto w\) .
Otherwise, \(T=[S^{\prime },{&#x2039; {a}}]\cdot T^{\prime }\) , then \(w=w_1 {b&#x203A; }w_2\) , and \(\exists (L_3,L_4)\in S_1\) s.t.
(1)
\(L_4\rightarrow ^{*}w_1\) and
(2)
\(\exists (L_1,L_2)\in S^{\prime },\exists L_5, L_2\rightarrow {&#x2039; {a}}L_3 {b&#x203A; }L_5 \wedge (\lbrace (L_1,L_5)\rbrace ,\ T^{\prime })\leadsto w_2\) .
Again, since \(S_1\subseteq S_2\) , we have \((L_3,L_4)\in S_2\) , so \((S_2,T)\leadsto w\) .□
Lemma A.2.
If \(L_2\rightarrow ^{*}w_1L_3\) and \((\lbrace (L_1,L_3)\rbrace ,T)\leadsto w\) , then \((\lbrace (L_1,L_2)\rbrace ,T)\leadsto w_1w\) .
Proof.
By definition, if \(T=\bot\) , then
Thus, \(L_2\rightarrow ^{*}w_1L_3\) implies \(L_2\rightarrow ^{*}w_1w\) . By definition, \((\lbrace (L_1,L_2)\rbrace \bot)\leadsto w_1w\) .
Otherwise, \(T=[S^{\prime },{&#x2039; {a}}]\cdot T^{\prime }\) , then \(w=w^{\prime } {b&#x203A; }w^{\prime \prime }\) ,
(1)
\(L_3\rightarrow ^{*}w^{\prime }\) and
(2)
\(\exists (L^{\prime },L^{\prime \prime })\in S^{\prime },\exists L_5, L^{\prime \prime }\rightarrow {&#x2039; {a}}L_1 {b&#x203A; }L_5 \wedge (\lbrace (L^{\prime },L_5)\rbrace ,\ T^{\prime })\leadsto w^{\prime \prime }\) .
Thus, \(L_2\rightarrow ^{*}w_1L_3\) implies \(L_2\rightarrow ^{*}w_1w^{\prime }\) . By definition, \((\lbrace (L_1,L_2)\rbrace ,T)\leadsto w_1w\) .□
Lemma A.3.
If \((S,T)\leadsto w\) , then \(\exists (L_1,L_2)\in S\) , s.t. \((\lbrace (L_1,L_2)\rbrace , T)\leadsto w\) .
Proof.
By definition, if \(T=\bot\) , then
Then, by definition, again we have the lemma.
Otherwise, \(T=[S^{\prime },{&#x2039; {a}}]\cdot T^{\prime }\) and \(w=w_1 {b&#x203A; }w_2\) . By definition, again we have the lemma.□
Theorem A.4.
Assume \(\delta _c(S)=(S^{\prime },\lambda T.T)\) for a plain symbol c. Then \((S,T)\leadsto cw\) iff \((S^{\prime },T) \leadsto w\) .
Proof.
\(\Rightarrow\) . By case over \((S,T)\leadsto cw\) .
(1)
\(T=\bot\) and \((S,\bot)\leadsto cw\) . By definition, \(\exists (L_1,L_2)\in S \wedge L_2\rightarrow ^{*}cw\) . By the forms of VPG rules, we must have
\begin{equation*} \exists L_3, L_2\rightarrow cL_3\ \wedge \ L_3\rightarrow ^{*}w. \end{equation*}
Since \((L_1,L_2)\in S \wedge L_2\rightarrow cL_3\) , by the definition of \(\delta _c\) , and we have \((L_1,L_3)\in S^{\prime }\) , thus, since \(L_3\rightarrow ^{*}w\) , we have \((S^{\prime },\bot)\leadsto w\) .
(2)
\(T=[S_1,{&#x2039; {a}}]\cdot T^{\prime }\) and \((S, [S_1,{&#x2039; {a}}]\cdot T^{\prime })\leadsto cw\) . By definition, \(w=w_1 {b&#x203A; }w_2\) , and \(\exists (L_3,L_4)\in S\) s.t.
(a)
\(L_4 \rightarrow ^{*}cw_1\) and
(b)
\(\exists (L_1,L_2)\in S_1,\exists L_5, L_2\rightarrow {&#x2039; {a}}L_3 {b&#x203A; }L_5 \wedge (\lbrace (L_1,L_5)\rbrace ,\ T^{\prime })\leadsto w_2\) .
Since \(L_4\rightarrow ^{*}cw_1\) , we have \(\exists L_4^{\prime }\) , \(L_4\rightarrow cL_4^{\prime }\ \wedge \ L_4^{\prime }\rightarrow ^{*}w_1\) , thus \((L_3,L_4^{\prime })\in S^{\prime }\) , and \((S^{\prime },T)\leadsto w_1 {b&#x203A; }w_2=w\) .
\(\Leftarrow\) .
By Lemma A.3,
Thus, by the definition of derivatives, \(\exists L_2, (L_1,L_2)\in S \wedge L_2\rightarrow cL_3\) . By Lemma A.2, \((\lbrace (L_1,L_2)\rbrace ,T)\leadsto cw\) . By Lemma A.1, \((S,T)\leadsto cw\) .□
Theorem A.5.
Assume \(\delta _{{&#x2039; {a}}}S=(S^{\prime },\lambda T.[S,{&#x2039; {a}}]\cdot T)\) for a call symbol \({&#x2039; {a}}\) . Then \((S,T)\leadsto {&#x2039; {a}}w\) iff \((S^{\prime }, [S,{&#x2039; {a}}]\cdot T)\leadsto w\) .
Proof.
\(\Rightarrow\) . By case over \((S,T)\leadsto {&#x2039; {a}}w\) .
(1)
\(T=\bot\) and \((S,\bot)\leadsto {&#x2039; {a}}w\) . By definition, \(\exists (L_1,L_2)\in S \wedge L_2\rightarrow ^{*}{&#x2039; {a}}w\) . Thus, from the forms of VPG rules, we have
Thus, \(\exists w_1,w_2\) , s.t. \(L_3\rightarrow ^{*}w_1\wedge L_4\rightarrow ^{*}w_2\wedge w=w_1 {b&#x203A; }w_2\) .
By definition,
thus \((\lbrace (L_3,L_3)\rbrace ,[S,{&#x2039; {a}}]\cdot \bot)\leadsto w_1 {b&#x203A; }w_2\) . Since \((L_3,L_3)\in \delta _{{&#x2039; {a}}}S\) , we have \((\delta _{{&#x2039; {a}}}S, [S,{&#x2039; {a}}]\cdot \bot)\leadsto w_1 {b&#x203A; }w_2=w\) .
(2)
\(T=[S_1,{&#x2039; {c}}]\cdot T^{\prime }\) and \((S,[S_1,{&#x2039; {c}}]\cdot T^{\prime })\leadsto {&#x2039; {a}}w\) . Then, by definition, we have \(w=w_1 {d&#x203A; } w_2\) , and
\(\exists (L_3,L_4)\in S\) s.t.
(a)
\(L_4\rightarrow ^{*}{&#x2039; {a}}w_1\) and
(b)
\(\exists (L_1,L_2)\in S_1,\exists L_5, L_2\rightarrow {&#x2039; {c}} L_3 {d&#x203A; }L_5 \wedge (\lbrace (L_1,L_5)\rbrace ,\ T^{\prime })\leadsto w_2\) .
Thus, \(\exists L_6,L_7\) s.t. \(L_4\rightarrow {&#x2039; {a}}L_6 {b&#x203A; }L_7\) , \(L_6\rightarrow ^{*}w_{11}\) , \(L_7\rightarrow ^{*}w_{12}\) and \(w_1=w_{11} {b&#x203A; }w_{12}\) . Thus, by definition, \((L_6,L_6)\in \delta _{{&#x2039; {a}}}S\) , so
where \(w^{\prime }\) satisfies \((\lbrace (L_3,L_7)\rbrace ,[S_1, {&#x2039; {c}}] \cdot T^{\prime })\leadsto w^{\prime }\) . From \((\lbrace (L_1,L_5)\rbrace ,T^{\prime })\leadsto w_2\) , we know \(w^{\prime }\) can be \(w_{12} {d&#x203A; } w_2\) , so we have
\(\Leftarrow\) . By definition, \(\exists (L_3,L_3)\in S^{\prime }\) , s.t. \(w=w_1 {b&#x203A; }w_2\) , and
(1)
\(\exists (L_1,L_2)\in S\ \text{s.t. }L_3\rightarrow ^{*}w_1 \wedge L_2\rightarrow {&#x2039; {a}} L_3 {b&#x203A; }L_5\) .
(2)
\((\lbrace (L_1,L_5)\rbrace ,T)\leadsto w_2\) .
Since \(L_2\rightarrow {&#x2039; {a}} L_3 {b&#x203A; }L_5 \wedge (\lbrace (L_1,L_5)\rbrace ,T)\leadsto w_2\) , by Lemma A.2, we have \((\lbrace L_1,L_2\rbrace ,T)\leadsto {&#x2039; {a}} w_1 {b&#x203A; }w_2={&#x2039; {a}}w\) , thus \((S,T)\leadsto {&#x2039; {a}}w\) .□
Theorem A.6.
Assume \(\delta _{ {b&#x203A; }}(S, [S_1,{&#x2039; {a}}])=(S^{\prime },\text{tail})\) for a return symbol \({b&#x203A; }\) . Then \((S, [S_1,{&#x2039; {a}}]\cdot T)\leadsto {b&#x203A; }w\) iff \((S^{\prime },T)\leadsto w\) .
Proof.
\(\Rightarrow\) . By definition, \(\exists (L_3,L_4)\in S\) and \((L_1,L_2)\in S_1\) , and we must have \(L_4\rightarrow ^{*}\epsilon\) , \(L_2\rightarrow {&#x2039; {a}} L_3 {b&#x203A; }L_5\) and \((\lbrace (L_1,L_5)\rbrace ,T)\leadsto w\) . From \(L_4\rightarrow ^{*}\epsilon\) and the forms of allowed VPG rules, we must have \(L_4\rightarrow \epsilon\) ; therefore, \((L_1,L_5)\in S^{\prime }\) . By Lemma A.1, we have \((S^{\prime },T)\leadsto w\) .
\(\Leftarrow\) . By Lemma A.3, we have \((L_1,L_5)\in S^{\prime } \wedge (\lbrace (L_1,L_5)\rbrace ,T)\leadsto w\) , thus \(\exists (L_1,L_2)\in S_1 \wedge (L_3,L_4)\in S \wedge L_4\rightarrow \epsilon \wedge L_2\rightarrow {&#x2039; {a}} L_3 {b&#x203A; }L_5\) , by definition, we have \((S,[S_1,{&#x2039; {a}}]\cdot T)\leadsto {b&#x203A; }w\) .□
Lemma A.7.
Given a VPG \(G=(V,\Sigma ,P,L_0)\) , suppose a PDA is generated according to Algorithm 1. Then, for a string \(w\in \Sigma ^*\) , \((\lbrace (L_0,L_0)\rbrace ,\bot)\leadsto w\) iff w is accepted by the PDA.
Proof.
\(\Leftarrow .\) If w of length k is accepted by the PDA, then there exists a sequence of configurations \((S_i,T_i), i\in [0..k]\) , s.t. (1) \(S_0=\lbrace (L_0,L_0)\rbrace\) , \(T_0=\bot\) ; (2) \((S_{i},T_{i})=\mathcal {F}(w_{i},S_{i-1},T_{i-1}),\ i \in [1..k]\) , where \(w_i\) is the ith symbol in w and \(\mathcal {F}\) is the PDA transition function; and (3) \((S_k,T_k)\) is an acceptance configuration.
For each i, perform case analysis over \(w_i\) . Suppose \(w_i\) is a plain symbol, denoted as c. By \((S_{i},T_{i})=\mathcal {F}(w_{i},S_{i-1},T_{i-1})\) and the PDA construction, we must have \(\delta _c(S_{i-1}) = (S_i, \lambda T. T)\) . By Theorem A.4, we get \((S_{i-1},T_{i-1})\leadsto w_i \ldots w_k\) iff \((S_i,T_i) \leadsto w_{i+1} \ldots w_k\) . The cases for when \(w_i\) is \({&#x2039; {a}}\) or \({b&#x203A; }\) are similar, with the help of Theorems A.5 and A.6.
Combining all steps, we have \((S_0,T_0)\leadsto w\) iff \((S_k,T_k) \leadsto \epsilon\) . Since \((S_k,T_k)\) is an acceptance configuration, we have \((S_k,T_k) \leadsto \epsilon\) . Therefore, we get \((S_0,T_0)\leadsto w\) .
\(\Rightarrow .\) We prove a more general lemma: If \((S_0,T_0)\) is a PDA runtime configuration and \((S_0,T_0) \leadsto w\) , then w is accepted by the PDA. Prove it by induction over the length of w.
When the length is zero, we must have \(T_0=\bot\) and there exists \((L_1,L_2) \in S_0\) such that \(L_0 \rightarrow \epsilon\) . Therefore, \((S_0,T_0)\) is an acceptance configuration of the PDA.
For the inductive case, suppose \(w=w_1 \ldots w_{k+1}\) . Perform case analysis over \(w_1\) , and first show that there exists \((S_1,T_1)\) s.t. \((S_1,T_1)=\mathcal {F}(w_1,S_0,T_0)\) .
(1)
Suppose \(w_1\) is a plain symbol c and \(\delta _c(S_0)=(S_1,\lambda T, T)\) . Since Algorithm 1 is closed under derivatives, we have that \(S_1\) is a PDA state. Let \(T_1=T_0\) . Thus, \(\mathcal {F}(w_1,S_0,T_0) = (S_1,T_1)\) by the definition of \(\mathcal {F}\) .
(2)
The case of \(w_1\) being a call symbol is similarly to the previous case.
(3)
Suppose \(w_1\) is a return symbol \({b&#x203A; }\) . By \((S_0,T_0) \leadsto {b&#x203A; }w_2 \ldots w_{k+1}\) , we have \(T_0\) is not the empty stack and has a top symbol \([S,{&#x2039; {a}}]\) . Suppose \(\delta _{ {b&#x203A; }}(S_0, [S,{&#x2039; {a}}])=(S_1,\text{tail})\) . Since Algorithm 1 is closed under derivatives, we have that \(S_1\) is a PDA state. Let \(T_1=\text{tail}(T_0)\) . Thus, \(\mathcal {F}(w_1,S_0,T_0) = (S_1,T_1)\) by the definition of \(\mathcal {F}\) .
By Theorems A.4, A.5, and A.6, we get \((S_1,T_1) \leadsto w_2 \ldots w_{k+1}\) . By the induction hypothesis, \(w_2 \ldots w_{k+1}\) is accepted by the PDA. Therefore, the original string \(w_1 \ldots w_{k+1}\) is also accepted.□
Theorem A.8.
For VPG G and its start symbol \(L_0\) , a string \(w\in \Sigma ^*\) is derived from \(L_0\) , i.e., \(L_0\rightarrow ^{*}w\) , iff w is accepted by the corresponding PDA.
Proof.
By Lemma A.7, w is accepted by the PDA iff \((\lbrace (L_0,L_0)\rbrace , \bot)\leadsto w\) , and by definition, we have \((\lbrace (L_0,L_0)\rbrace , \bot)\leadsto w\) iff \(L_0\rightarrow ^{*}w\) .□

B Recognizing Strings with Pending Calls/returns

In this section, we extend the work in Section 5 to build PDAs for recognizing VPG with pending call or return symbols. In general VPGs, nonterminals are classified to two categories: \(V^0\) for matching well-matched strings and \(V^1\) for strings with pending calls/returns. We write \(V=V^0\cup V^1\) for the set of all nonterminals. \(V^0\) should be disjoint from \(V^1\) . The definition also imposes constraints on how \(V^0\) and \(V^1\) nonterminals can be used. E.g., in \(L\rightarrow {&#x2039; {a}}L_1 {b&#x203A; }L_2\) , \(L_1\) must be in a well-matched nonterminal (i.e., in \(V^0\) ). This constraint excludes a grammar like \(L_1 \rightarrow {&#x2039; {a}}L_2 {b&#x203A; }L_3; L_2 \rightarrow {&#x2039; {c}} L_4\) .
Another major difference is that in \(L\rightarrow a L_1\) , the symbol a can be a call/return symbol in addition to being a plain symbol. This makes matching calls and returns more complicated. For example, suppose we have rules: \(L_1 \rightarrow {&#x2039; {a}}L_2\) ; \(L_2 \rightarrow {b&#x203A; }L_3 | \epsilon\) ; \(L_3 \rightarrow \epsilon\) . Then the string \({&#x2039; {a}} {b&#x203A; }\) is accepted, in which case \({b&#x203A; }\) from \(L_2 \rightarrow {b&#x203A; }L_3\) matches \({&#x2039; {a}}\) from \(L_1 \rightarrow {&#x2039; {a}}L_2\) . String \({&#x2039; {a}}\) is also accepted, in which case \({&#x2039; {a}}\) is a pending call. So, depending on the input string, \({&#x2039; {a}}\) from \(L_1 \rightarrow {&#x2039; {a}}L_2\) may be a matching call or a pending call.
Here is an example grammar:
(1)
\({L_1} \rightarrow {{&#x2039; {a}}L_2 {b&#x203A; }L_3}\)
(2)
\({L_2} \rightarrow {{&#x2039; {a}}L_2 {b&#x203A; }L_4\mid \epsilon }\)
(3)
\({L_3} \rightarrow {{&#x2039; {a}}L_1\mid \epsilon }\)
(4)
\({L_4} \rightarrow {\epsilon .}\)
And \(L_1, L_3 \in V^1, L_2, L_4 \in V^0\) . For example, \({&#x2039; {a}} {b&#x203A; }{&#x2039; {a}}{&#x2039; {a}}{&#x2039; {a}} {b&#x203A; } {b&#x203A; }\) is in the language recognized by the grammar.
General VPGs to PDAs. The PDA states and stack symbols are the same as before. We generalize the notion of the top of the stack to return the top stack symbol when the stack is non-empty and return None when the stack is empty.
A derivative-based transition function takes the current state and the top of the stack (which can be None) and returns a new state and a stack action. As before, since \(\delta _c\) and \(\delta _{{&#x2039; {a}}}\) do not use the top of the stack, we omit it from their parameters.
Definition B.1 (Derivative Functions for General VPGs).
Given a general VPG \(G=(V,\Sigma ,P,L_0)\) , the transition functions \(\delta\) are defined as follows: For \(c\in \Sigma _{\text{plain}}\) , \({&#x2039; {a}}\in \Sigma _{\text{call}}\) and \({b&#x203A; }\in \Sigma _{\text{ret}}\) ,
(1)
\(\delta _c\) is the same as the well-matched case.
\(\delta _c(S) = (S^{\prime },\lambda T.T)\) , where
\begin{equation*} S^{\prime }=\lbrace (L_1,L_3)\mid (L_1,L_2)\in S \wedge (L_2\rightarrow cL_3)\in P\rbrace ; \end{equation*}
(2)
For call symbols, we have \(\delta _{{&#x2039; {a}}}(S)= (S^{\prime }\cup S_p,\lambda T.[S,{&#x2039; {a}}]\cdot T)\) , where
Compared to the well-matched case, an additional \(S_p\) is introduced for the case when \({&#x2039; {a}}\) appears in a rule like \(L_2\rightarrow {&#x2039; {a}}{L_3}\) .
(3)
For a return symbol \({b&#x203A; }\) , if t is the top of the stack, then
where
\(S^{\prime }\) is as before and deals with the case when there is a rule \(L_2\rightarrow {&#x2039; {a}} L_3 {b&#x203A; }L_5\) with a proper top stack symbol. \(S_{p1}\) deals with the case when there are rules \(L_2\rightarrow {&#x2039; {a}}L_3\) and \(L_4\rightarrow {b&#x203A; }L_5\) ; in this case, we match \({b&#x203A; }\) with \({&#x2039; {a}}\) . Finally, \(S_{p2}\) deals with the case when the stack is empty; then \({b&#x203A; }\) is treated as a pending return symbol (not matched with a call symbol).
For the well-matched case, the stack should be empty after all input symbols are consumed; in the case with pending calls/returns, however, the stack is not necessarily empty at the end. For example, with the grammar \(L\rightarrow {&#x2039; {a}}L\mid \epsilon\) and the valid input string \({&#x2039; {a}}\) , the terminal stack is \([\lbrace (L,L)\rbrace ,{&#x2039; {a}}]\cdot \bot\) .
Definition B.2 (The Acceptance Configuration for Words with Pending Calls/returns).
Given a general VPG \(G=(V,\Sigma ,P,L_0)\) , the pair \((S,T)\) is called an acceptance configuration if the following are satisfied:
(1)
\(\exists (L_1,L_2) \in S \text{ s.t. } (L_2 \rightarrow \epsilon) \in P\) ,
(2)
either (i) \(T=\bot\) or (ii) \(T= [S^{\prime }, {&#x2039; {a}}] \cdot T^{\prime }\) and \(\exists (L_3,L_4) \in S^{\prime } \wedge (L_4 \rightarrow {&#x2039; {a}}L_1) \in P\) for some \(L_1\) .
In the following correctness proof, we use predicate \(\operatorname{well-matched}(w)\) to mean that w, a string of terminals, is a well-matched string; that is, every call/return symbol is matched with a corresponding return/call symbol. We use predicate \(\operatorname{matched-rets}(w)\) to mean that any return symbol in w is matched with a call symbol; however, a call symbol may not be matched with a return symbol. E.g., we have \(\operatorname{matched-rets}({&#x2039; {a}}{&#x2039; {a}} {b&#x203A; })\) , but not \(\operatorname{well-matched}({&#x2039; {a}}{&#x2039; {a}} {b&#x203A; })\) .
Definition B.3 (Semantics of PDA Configurations).
We will write \((S, T)\leadsto w\) to mean that w can be accepted by the configuration \((S, T)\) . It is defined as follows:
(1)
\((S, \bot)\leadsto w \mbox{ if } \exists (L_1,L_2)\in S,\ \text{s.t. }L_2\rightarrow ^{*}w\) ,
(2)
\((S, [S^{\prime },{&#x2039; {a}}]\cdot T^{\prime })\leadsto w_1 {b&#x203A; }w_2\) if \(\exists (L_3,L_4)\in S\) s.t.
(a)
\(L_4\rightarrow ^{*}w_1\) and \(\operatorname{well-matched}(w_1)\) and
(b)
\(\exists (L_1,L_2)\in S^{\prime },\exists L_5, L_2\rightarrow {&#x2039; {a}}L_3 {b&#x203A; }L_5 \wedge (\lbrace (L_1,L_5)\rbrace ,\ T^{\prime })\leadsto w_2\) .
(3)
\((S, [S^{\prime },{&#x2039; {a}}]\cdot T^{\prime })\leadsto w_1 {b&#x203A; }w_2\) if \(\exists (L_3,L_4)\in S\) s.t. \(\exists L_5\)
(a)
\(L_4\rightarrow ^{*}w_1 {b&#x203A; }L_5\) and \(\operatorname{well-matched}(w_1)\) and
(b)
\(\exists (L_1,L_2)\in S^{\prime },L_2\rightarrow {&#x2039; {a}}L_3 \wedge (\lbrace (L_1,L_5)\rbrace ,\ T^{\prime })\leadsto w_2\) .
(4)
\((S, [S^{\prime },{&#x2039; {a}}]\cdot T^{\prime })\leadsto w_1\) if \(\exists (L_3,L_4)\in S\) s.t.
(a)
\(L_4\rightarrow ^{*}w_1\) and \(\operatorname{matched-rets}(w_1)\)
(b)
\(\exists (L_1,L_2)\in S^{\prime },L_2\rightarrow {&#x2039; {a}}L_3\) .
In the above definition, the third case handles when the call symbol \({&#x2039; {a}}\) in rule \(L_2\rightarrow {&#x2039; {a}}L_3\) matches \({b&#x203A; }\) in \(w_1 {b&#x203A; }L_5\) produced by \(L_4\) . The last case handles when \({&#x2039; {a}}\) in rule \(L_2\rightarrow {&#x2039; {a}}L_3\) does not have a matched return; that is, it is a pending call.
The following three lemmas and their proofs are the same as before (except that Lemma B.2 requires well-matched strings):
Lemma B.1.
If \((S_1,T)\leadsto w \wedge S_1\subseteq S_2\) , then \((S_2,T)\leadsto w\) .
Lemma B.2.
If \(L_2\rightarrow ^{*}w_1L_3\) , \(\operatorname{well-matched}(w_1)\) , and \((\lbrace (L_1,L_3)\rbrace ,T)\leadsto w\) , then \((\lbrace (L_1,L_2)\rbrace ,T)\leadsto w_1w\) .
Lemma B.3.
If \((S,T)\leadsto w\) , then \(\exists (L_1,L_2)\in S\) , s.t. \((\lbrace (L_1,L_2)\rbrace , T)\leadsto w\) .
In addition, we need the following lemma:
Lemma B.4.
IF \(L \rightarrow ^{*}w \delta\) , where \(\delta\) is a string of terminals or nonterminals, then we have either (1) \(\operatorname{matched-rets}(w)\) , or (2) exists \(w_1, {b&#x203A; }, w_2\) , so \(w=w_1 {b&#x203A; }w_2\) and \(\operatorname{well-matched}(w_1)\) and exists \(L_1\) so \(L \rightarrow ^{*}w_1 {b&#x203A; }L_1\) and \(L_1 \rightarrow ^{*}w_2 \delta\) .
Proof.
Sketch: If \(w=\epsilon\) , then \(\operatorname{matched-rets}(\epsilon)\) . Otherwise, prove it by induction over the length of the derivation of \(L \rightarrow ^{*}w \delta\) and then perform case analysis over the first derivation step.□
Theorem B.5.
For a plain symbol c, \((S,T)\leadsto cw\) iff \(\delta _c(S)=(S^{\prime },f)\) , and \((S^{\prime },f(T)) \leadsto w\) .
The proof is similar to the proof before, except the \(\Rightarrow\) direction has more cases to consider.
Theorem B.6.
For \({&#x2039; {a}}\in \Sigma _{\text{call}}\) , \((S,T)\leadsto {&#x2039; {a}}w\) iff \(\delta _{{&#x2039; {a}}}S=(S^{\prime },f)\) , and \((S^{\prime }, f(T))\leadsto w\) .
The proof is similar to the proof before, except with more cases to consider. The \(\Rightarrow\) direction requires the use of Lemma B.4.
Theorem B.7.
(1)
If \(\delta _{ {b&#x203A; }}(S, [S_1,{&#x2039; {a}}])=(S^{\prime },\text{tail})\) , then \((S, [S_1,{&#x2039; {a}}]\cdot T)\leadsto {b&#x203A; }w\) iff \((S^{\prime },T)\leadsto w\) .
(2)
If \(\delta _{ {b&#x203A; }}(S,\text{None})=(S^{\prime },\lambda T. T)\) , then \((S, \bot)\leadsto {b&#x203A; }w\) iff \((S^{\prime },\bot)\leadsto w\) .
Part (1)’s proof is similar to before, except with more cases and sometimes need to use Lemma B.4. Part(2)’s proof is straightforward.
Lemma B.8.
Given a VPG \(G=(V,\Sigma ,P,L_0)\) , suppose a PDA is generated according to Algorithm 1. Then, for a string \(w\in \Sigma ^*\) , \((\lbrace (L_0,L_0)\rbrace ,\bot)\leadsto w\) iff w is accepted by the PDA.
The lemma can be proved as before, except with more cases.
Theorem B.9.
For VPG G and its start nonterminal \(L_0\) , a string \(w\in \Sigma ^*\) is derived from \(L_0\) , i.e., \(L_0\rightarrow ^{*}w\) , iff w is accepted by the corresponding PDA.
The proof is as before.

C Parser and Extraction PDAs for General VPGs

In this section, we discuss the parser PDA and the extraction PDA for general VPGs. For general VPGs, the definitions of \(\mathrm{headNT}\left(\right)\) , \(\mathrm{nextNT}\left(\right),\) and dotted rules of well-matched VPGs also apply; the definition for plain symbols generalize for pending call and return symbols naturally. We define the derivative functions of the parser PDA as follows:
Definition C.1 (The Derivative Function p for the Parser PDA) Given a VPG \(G=(V,\Sigma ,P,L_0)\) , suppose the current state of the parser PDA is m and the current stack is T, the transition functions \(p_c\) , \(p_{{&#x2039; {a}}}\) , and \(p_{ {b&#x203A; }}\) are defined as follows:
(1)
For plain symbols, the derivative function of general VPGs is the same as that for well-matched VPGs; we restate the function here: \(p_c(m)=(m^{\prime },\lambda T.T)\) , where
\begin{equation*} m^{\prime }=\lbrace (r^{\prime },\mathrm{nextNT}\left(r\right)\rightarrow c. L_1) \mid (r^{\prime },r)\in m \wedge (\mathrm{nextNT}\left(r\right)\rightarrow c. L_1)\in \dot{P}\rbrace . \end{equation*}
For each pair \((r^{\prime },r)\) in m, the new state keeps the context rule \(r^{\prime }\) and updates the current rule to a rule with head \(\mathrm{nextNT}\left(r\right)\) and that derives c.
(2)
For call symbols, the derivative function of general VPGs is similar to that for well-matched VPGs; we only need to introduce a new condition for pending rules: \(p_{{&#x2039; {a}}}(m)=(m^{\prime },\lambda T.\ m\cdot T)\) , where
(3)
For return symbols, we have two more cases to consider: \(p_{ {b&#x203A; }}(m,\overline{m_{\text{call}}})=(m^{\prime },\text{tail})\) , where \(\text{tail}\) is the function that removes the top of a stack and returns the rest of stack (for empty stacks, \(\text{tail}\) return empty stacks), and \(\overline{m_{\text{call}}}=\operatorname{head }(T)\) if \(T\ne \bot\) , and \(\overline{m_{\text{call}}}=\emptyset\) if \(T=\bot\) , and
We remove the top of the stack and construct the new state in three cases, depending on the context rule \(r_1\) for a pair \((r_1,r_2)\) in m. We have already discussed the case where \(r_1\) is a matching rule in Definition 6.4. Below are the two new cases for pending rules.
If the context \(r_1\) is a pending rule, then the new state updates the current rule to a rule that rewrites \(\mathrm{nextNT}\left(r_2\right)\) and generates \({b&#x203A; }\) . Similar to the first case, we have \(\overline{m_{\text{call}}}\ne \emptyset\) ; the update of the context rule in the new state is similar.
Otherwise, if there is no context rule, then no call symbol is pending, and the \({b&#x203A; }\) symbol is an unmatched return symbol. In this case, we must have \(\overline{m_{\text{call}}}=\emptyset\) . We construct the new state in a way similar to the second case, except that the context rule is None.
The acceptance configurations for general VPGs also allow the contexts to be pending rules, as shown below.
Definition C.2 (Parser PDA Acceptance Configurations).
Given a VPG \(G=(V,\Sigma ,P,L_0)\) , a pair \((m,E)\) is an acceptance configuration for the parser PDA if the state m includes a pair \((r^{\prime },r)\) , where \(r^{\prime }\) is either \(\text{None}\) or a pending rule, and \(\mathrm{nextNT}\left(r\right)\) derives the empty string \(\epsilon\) , i.e., \(\exists r,\ (\text{None},r)\in m\ \wedge \ (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P\) .
Definition C.3 (The Extraction Function \(\operatorname{extract}_\mathrm{init}\) ) We only need to add one more condition to the function \(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}\) to allow a pending rule as the context; the rest are the same as those of well-matched VPGs.
\begin{equation*} \operatorname{extract}^{\mathrm{pre}}_\mathrm{init}(m)=\lbrace r \mid (r^{\prime },r)\in m, (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P \wedge r^{\prime } \text{ is None or a pending rule} \rbrace . \end{equation*}
Note that this also implies that the stack of a acceptance configuration for general VPGs is not necessarily empty.
Definition C.4 (The One-step Extraction Function).
Similar to the case of the function \(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}\) , we only need to add some additional conditions to the function \(\operatorname{extendable}\) ; the rest are the same as those of well-matched VPGs.
\begin{align*} \operatorname{extendable}((r^{\prime },r),{(v,E)})= {\left\lbrace \begin{array}{ll} \text{true}, &#x0026; \text{if }r^{\prime } \text{ is None or a pending rule}\ \wedge \\ &#x0026;\qquad \operatorname{head }(E) \text{ is None or a pending rule}\ \wedge \\ &#x0026; \qquad \mathrm{nextNT}\left(r\right)=\mathrm{headNT}\left(v\right) \ \vee \\ &#x0026;r^{\prime } \text{ and } \operatorname{head }(E)\text{ are the same matching rule} \\ &#x0026; \quad \text{(but with different {dot} positions)}\ \wedge \\ &#x0026;\qquad ((\operatorname{firstRule}(v) \in \mathcal {P}_{\text{ret}}\ \wedge \ \\ &#x0026; \qquad \quad (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P) \ \vee \\ &#x0026;\qquad \ \ (\operatorname{firstRule}(v) \in \mathcal {P}_{\text{pln}}\cup \mathcal {P}_{\text{call}}\ \wedge \\ &#x0026; \qquad \qquad \mathrm{nextNT}\left(r\right)=\mathrm{headNT}\left(v\right))),\\ \text{false}, &#x0026; \text{otherwise}. \\ \end{array}\right.} \end{align*}
Compared with the case of well-matched VPGs, the only difference is that the context \(r^{\prime }\) can not only be None, but also be a pending rule, and, in accordance, the stack should be either empty or have a pending rule as the stack top.

D Correctness Proof of VPG-based Parsing

Given a VPG \(G=(V,\Sigma ,P,L_0)\) and an input string w, let \(V=\operatorname{extract}(\operatorname{forest}(w))\) . We call V correct if it includes exactly the set of valid parse trees of w according to G; i.e.,
\begin{equation*} \forall v,\ {L_0} \Downarrow \left({w},{v}\right) \iff \exists E,\ (v,E)\in V. \end{equation*}
To prove the above, we first state and prove the invariants of p and \(\operatorname{extract}_{\mathrm{oneStep}}\) ; the correctness follows as a corollary.
We first discuss the invariants of the parser derivative function p. Intuitively, a rule sequence \([r_1,\ldots ,r_{|w|}]\) that derives w should be included in \(\operatorname{forest}(w)= [m_1, m_2, \ldots , m_{|w|}]\) ; i.e., there exists \(r_i^{\prime }\) such that \((r_i^{\prime },r_i)\in m_i\) for \(i=1..|w|\) . To formalize a rule sequence, we introduce a forward small-step relation in Figure 12. Judgment \(({v},{E}) \stackrel{i}{\rightarrow } ({v+[r]},{E^{\prime }})\) means that the rule sequence v can be extended with a rule r, which generates the input symbol i. Here, “forward” means that the relation generates the input from left to right and “small-step” means that the relation generates one symbol at a time, as opposed to the big-step relation in Figure 4. Figure 13 further defines the transitive closure of the small-step relation; consequently, \(({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v},{E})\) means that v generates w in multiple steps.
Fig. 12.
Fig. 12. The forward small-step parse-tree derivation relation, given a VPG \(G=(\Sigma ,V,P,L_0)\) .
Fig. 13.
Fig. 13. The transitive closure of the forward small-step parse-tree derivation relation.
The forward small-step relation is flexible in that it allows combining and splitting parse trees. We introduce a series of lemmas to formalize those operations: Lemma D.1 shows we can concatenate two forward small-step relations, assuming the first one has an empty stack; the cases of nonempty stacks are covered in Lemmas D.2 and D.3. Lemma D.4 shows that we can split a forward small-step relation into two parts and replace the first part with another small-step relation. Lemma D.5 shows we can split a forward small-step relation into two parts, and the first part is also a small-step relation. The proofs of these lemmas are directly by induction over the length of w; we omit them here.
Lemma D.1 (Concatenating Forward Small-step Relations).
\begin{align*} &#x0026;\forall v_1\ v_2\ w_1\ w_2\ E,\ ({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v_1},{\bot }) \ \wedge \ ({[]},{\bot }) \stackrel{w_2}{\longrightarrow ^*} ({v_2},{E}) \ \wedge \ \mathrm{nextNT}\left(v_1\right)=\mathrm{headNT}\left(v_2\right) \Rightarrow \\ &#x0026;\qquad ({[]},{\bot }) \stackrel{w_1w_2}{\longrightarrow ^*} ({v_1+v_2},{E}). \end{align*}
Lemma D.2 (Concatenation with a Pending Call Rule).
Lemma D.3 (Concatenation with a Matching Call Rule).
Lemma D.4 (Replacing Forward Small-steps).
\begin{align*} &#x0026; \forall v\ v_1\ r\ E\ w_1,\ ({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v+[r]},{E}) \ \wedge \ ({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v_1+[r]},{E}) \Rightarrow \\ &#x0026; \qquad \forall v_2\ E_2\ w_2,\ ({[]},{\bot }) \stackrel{w_1w_2}{\longrightarrow ^*} ({v+[r]+v_2},{E_2}) \Rightarrow ({[]},{\bot }) \stackrel{w_1w_2}{\longrightarrow ^*} ({v_1+[r]+v_2},{E_2}). \end{align*}
Lemma D.5 (Splitting Forward Small-steps).
\[\begin{multline*} \forall v_1\ v_2\ E\ w, \ ({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v_1+v_2},{E}) \ \wedge \ v_1 \ne [] \Rightarrow \\ \exists E_1\ w_1, ({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v_1},{E_1}) \ \wedge \ \exists w_2, w=w_1w_2. \end{multline*}\]
The forward small-step relation is equivalent to the big-step relation under certain conditions; we state the equivalence in Lemma D.7. To prove the equivalence, we first introduce the invariant of the forward small-step relation in Lemma D.6, which splits a forward small-step relation into two parts based on the last unmatched rule and considers them separately.
Lemma D.6 (Invariant of the Forward Small-step Relation).
Proof.
Prove it by induction over the length of w. There are two cases of w: If \(w=i\in \Sigma\) , then the lemma is straightforward to show; so assume \(w=w^{\prime }i\) and \(w^{\prime }\ne \epsilon\) . By the definition of the small-step relation, we have \(\exists v^{\prime }\ r,\ v=v^{\prime }+[r]\) and \(\exists E^{\prime },({[]},{\bot }) \stackrel{w^{\prime }}{\longrightarrow ^*} ({v^{\prime }},{E^{\prime }})\) ; we apply the induction hypothesis to the small-step relation of \(w^{\prime }\) and have two cases of \(E^{\prime }\) .
In the first case, \(E^{\prime }=\bot\) or \(\operatorname{head }(E^{\prime })\) is a pending rule. There are two cases of i.
(1)
\(i\in \Sigma _{\text{plain}}\cup \Sigma _{\text{ret}}\) . We have \(E=E^{\prime }\) when \(i\in \Sigma _{\text{plain}}\) , and \(E=\operatorname{tail }(E^{\prime })\) when \(i\in \Sigma _{\text{ret}}\) . In both cases, either \(E=\bot\) or \(\operatorname{head }(E)\) is a pending rule.5 For any big-step \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right)\) , we extend it to \({\mathrm{nextNT}\left(v^{\prime }\right)} \Downarrow \left({i\hat{w}},{[r]+\hat{v}}\right)\) by definition, and by the induction hypothesis, we have \({\mathrm{headNT}\left(v\right)} \Downarrow \left({w^{\prime }i\hat{w}},{v^{\prime }+[r]+\hat{v}}\right)\) .
(2)
\(i\in \Sigma _{\text{call}}\) . Then, we have \(E=r\cdot E^{\prime }\) . There are two cases of r.
(a)
If r is a pending rule, then for any big-step \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right)\) , we first extend it to \({\mathrm{nextNT}\left(v^{\prime }\right)} \Downarrow \left({i\hat{w}},{[r]+\hat{v}}\right)\) by definition, then we have \({\mathrm{headNT}\left(v\right)} \Downarrow \left({w^{\prime }i\hat{w}},{v^{\prime }+[r]+\hat{v}}\right)\) by the induction hypothesis.
(b)
If r is a matching rule, then we construct \(v_1=v^{\prime }\) , \(r_{{&#x2039; {a}}}=r\) , \(v_2=[]\) and \(w_1=w^{\prime }\) , \({&#x2039; {a}}=i\) , \(w_2=\epsilon\) . We already have \(({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v_1},{E^{\prime }})\) . Since both \(v_2\) and \(w_2\) are empty, \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right) \Rightarrow {\mathrm{nextNT}\left(r_{{&#x2039; {a}}}\right)} \Downarrow \left({w_2\hat{w}},{v_2+\hat{v}}\right)\) is trivial to show.
In the second case, \(\operatorname{head }(E^{\prime })\) is a matching rule, and we have
There are three cases of i.
(1)
\(i\in \Sigma _{\text{plain}}\) . Then, we have
We construct \(v_1=v_1^{\prime }\) , \(r_{{&#x2039; {a}}}=r_{{&#x2039; {a}}^{\prime }}\) , \(v_2=v_2^{\prime }+[r]\) , and \(w_1=w_1^{\prime }\) , \({&#x2039; {a}}={&#x2039; {a}}^{\prime }\) , \(w_2=w_2^{\prime }i\) . We only need to prove the last clause. For any big-step \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right)\) , we extend it to \({\mathrm{nextNT}\left(v^{\prime }\right)} \Downarrow \left({i\hat{w}},{[r]+\hat{v}}\right)\) by definition, and by the induction hypothesis, we have \({\mathrm{nextNT}\left(r_{{&#x2039; {a}}^{\prime }}\right)} \Downarrow \left({w_2^{\prime }i\hat{w}},{v_2^{\prime }+[r]+v^{\prime }}\right)\) .
(2)
\(i\in \Sigma _{\text{call}}\) . Then, we have
\begin{equation*} v=v^{\prime }+[r] \ \wedge \ w=w^{\prime }i. \end{equation*}
We construct \(v_1=v^{\prime }\) , \(r_{{&#x2039; {a}}}=r\) , \(v_2=[]\) , and \(w_1=w^{\prime }\) , \({&#x2039; {a}}=i\) , \(w_2=\epsilon\) . The rest of the proof is straightforward.
(3)
\(i\in \Sigma _{\text{ret}}\) . Then, we have \(E^{\prime }=r_{{&#x2039; {a}}^{\prime }}\cdot E\) . For any big-step \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right)\) , we first construct the big-step \({\mathrm{headNT}\left(r_{{&#x2039; {a}}^{\prime }}\right)} \Downarrow \left({{&#x2039; {a}}^{\prime }w_2^{\prime }i\hat{w}},{[r_{{&#x2039; {a}}^{\prime }}]+v_2^{\prime }+[r]+\hat{v}}\right)\) by definition. Now, there are two cases of E.
(a)
\(E=\bot\) or \(\operatorname{head }(E)\) is a pending rule. If \(w_1^{\prime }=\epsilon\) , then we are done. Otherwise, we apply the induction hypothesis to the small-step relation of \(w_1^{\prime }\) and use it to extend the big-step relation to
(b)
\(\operatorname{head }(E)\) is a matching rule. We apply the induction hypothesis to the small-step relation of \(w_1^{\prime }\) , and have
So, we have
By the induction hypothesis, we extend the big-step relation to
Lemma D.7 (Relating Small-step and Big-step Relations).
Proof.
\(\Rightarrow\) . Prove it by induction over the big-step relation \({L} \Downarrow \left({w},{v}\right)\) .
(1)
\(w=i w^{\prime }\) , \(v=[r]+v^{\prime }\) , and \({\mathrm{nextNT}\left(r\right)} \Downarrow \left({w^{\prime }},{v^{\prime }}\right)\) . The case of \(w^{\prime }=\epsilon\) is trivial, so assume \(w^{\prime }\ne \epsilon\) . Apply the induction hypothesis to \(w^{\prime }\) and we have
\begin{align*} &#x0026;\exists E^{\prime },\ ({[]},{\bot }) \stackrel{w^{\prime }}{\longrightarrow ^*} ({v^{\prime }},{E^{\prime }}) \ \wedge \ (\mathrm{nextNT}\left(v^{\prime }\right)\rightarrow \epsilon) \in P \ \wedge \\ &#x0026;\qquad (\mathrm{nextNT}\left(r\right) \in V^0 \Rightarrow E^{\prime }=\bot) \ \wedge \ \operatorname{head }(E^{\prime }) \text{ is None or a pending rule}. \end{align*}
There are two cases of i.
(a)
\(i\in \Sigma _{\text{call}}\) . We construct the small-step relation \(({[]},{\bot }) \stackrel{i}{\longrightarrow ^*} ({[r]},{r\cdot \bot })\) by definition, and by Lemma D.2, we append \(v^{\prime }\) to \([r]\) and have \(({[]},{\bot }) \stackrel{i w^{\prime }}{\longrightarrow ^*} ({[r]+v^{\prime }},{E^{\prime }\cdot [r]})\) or \(({[]},{\bot }) \stackrel{i w^{\prime }}{\longrightarrow ^*} ({[r]+v^{\prime }},{E^{\prime }})\) .
(b)
\(i\in \Sigma _{\text{plain}}\cup \Sigma _{\text{ret}}\) . We construct the small-step relation \(({[]},{\bot }) \stackrel{i}{\longrightarrow ^*} ({[r]},{\bot })\) by definition, and by Lemma D.1, we append \(v^{\prime }\) to \([r]\) and have \(({[]},{\bot }) \stackrel{i w^{\prime }}{\longrightarrow ^*} ({[r]+v^{\prime }},{E^{\prime }})\) .
(2)
\(w={&#x2039; {a}}w_1 {b&#x203A; }w_2\) , \(v=[r_1]+v_1+[r_2]+v_2\) , and \({\mathrm{nextNT}\left(r_i\right)} \Downarrow \left({w_i},{v_i}\right)\) , \(i=1,2\) . We first construct the small-step relation \(({[]},{\bot }) \stackrel{{&#x2039; {a}}}{\longrightarrow ^*} ({[r_1]},{r_1\cdot \bot })\) . If \(w_1\ne \epsilon\) , then we apply the induction hypothesis to \(w_1\) and have the small-step relation \(({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v_1},{\bot })\) (note that \(\mathrm{headNT}\left(v_1\right)\in V^0\) ). Then, by Lemma D.3, we have \(({[]},{\bot }) \stackrel{{&#x2039; {a}}w_1}{\longrightarrow ^*} ({[r_1]+v_1},{r_1\cdot \bot })\) ; we extend it to \(({[]},{\bot }) \stackrel{{&#x2039; {a}}w_1 {b&#x203A; }}{\longrightarrow ^*} ({[r_1]+v_1+[r_2]},{\bot })\) by definition. Finally, if \(w_2\ne \epsilon ,\) then we apply the induction hypothesis to \(w_2\) and have the small-step relation of \(w_2\) . By Lemma D.4, we have \(({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({[r_1]+v_1+[r_2]+v_2},{\bot })\) .
\(\impliedby\) . This is a corollary of Lemma D.6.□
We describe the specification of the parser derivative functions in Lemma D.8.
Lemma D.8 (Specification of Parser Derivative Functions).
Now, we give the invariants of the parser PDA. Lemma D.9 states that the forward small-step relation for input \(w_{1} \ldots w_{i}\) is represented in \(m_i\) , for all \(i \in [1..|w|]\) , where \(m_i\) is the parser PDA state after processing \(w_i\) . Lemma D.10 states that each pair in \(m_i\) represents a forward small-step relation. In these lemmas, we write \(\mathcal {P}\) for the transitive closure of the parser transition function (Definition D.1) and use a helper function \(\operatorname{split}(v,E,w)=(v_1,v_2,w_1,w_2)\) , which splits v into \(v=v_1+[r_{{&#x2039; {a}}}]+v_2\) and w into \(w=w_1{&#x2039; {a}}w_2\) , where \(r_{{&#x2039; {a}}}\) is the rule that generates the last unmatched call symbol \({&#x2039; {a}}\) . Specially, if w does not include unmatched call symbols, then \(\operatorname{split}(v,E,w)=([],v,[],w)\) .
Definition D.1 (Transition Closure of the Parser PDA).
Given a string w, we define the relation \(\mathcal {P}({m_0},{\bot },{w}) = ({m},{T})\) , meaning that starting from \((m_0,\bot)\) , running the parser PDA on w and the last configuration is \((m,T)\) .
Lemma D.9 (Invariants of the Parser PDA, Part 1).
\begin{align*} &#x0026;\forall w\ m\ T,\ \mathcal {P}({m_0},{\bot },{w}) = ({m},{T}) \ \wedge \ {({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v},{E})} \Rightarrow \\ &#x0026;\qquad \forall v\ E\ v_1\ v_2\ w_1\ w_2,\ (v_1,v_2,w_1,w_2)=\operatorname{split}({v},{E},{w}) \Rightarrow \\ &#x0026;\qquad \qquad (\operatorname{head }(E),\operatorname{lastRule}(v))\in m \ \wedge \\ &#x0026;\qquad \qquad \qquad (\operatorname{head }(E)=\text{None}\wedge T=\bot \ \vee \\ &#x0026;\qquad \qquad \qquad \qquad \operatorname{head }(E)\ne \text{None}\wedge \exists t\ T^{\prime }, T=t::T^{\prime }\wedge \mathcal {P}({m_0},{\bot },{w_1}) = ({t},{T^{\prime }})). \end{align*}
Proof.
Prove it by induction over the length of w.
There are two cases of w: If \(w=i\in \Sigma\) , then we can check the lemma straightforwardly. Next, assume \(w=w^{\prime }i\) and \(w^{\prime }\ne \epsilon\) . Then, we have \(v=v^{\prime }+[r]\) . We invert \(({[]},{\bot }) \stackrel{w}{\rightarrow } ({v},{E})\) and have \(({[]},{\bot }) \stackrel{w^{\prime }}{\rightarrow } ({v^{\prime }},{E^{\prime }})\) . There are three cases of i.
(1)
\(i \in \Sigma _{\text{plain}}\) . Then \(E=E^{\prime }\) . By the induction hypothesis, we have \((\operatorname{head }(E^{\prime }),\operatorname{lastRule}(v^{\prime }))\in m^{\prime }\) , where \((m^{\prime },T)=p_{w^{\prime }}(m_0,\bot)\) . By Lemma D.8, we have \((\operatorname{head }(E),\operatorname{lastRule}(v))\in m\) . If \(\operatorname{head }(E)=\text{None}\) , i.e., \(E=\bot\) , we are done. Otherwise, let \(\operatorname{split}({v^{\prime }},{E^{\prime }},{w^{\prime }})=(v_1^{\prime },v_2^{\prime },w_1^{\prime },w_2^{\prime })\) . Clearly, we have \(v_1=v_1^{\prime }\) and \(w_1=w_1^{\prime }\) , so we have \(\exists t\ T^{\prime },\ T=t::T^{\prime }\wedge \mathcal {P}({m_0},{\bot },{w_1}) = ({t},{T^{\prime }})\) .
(2)
\(i \in \Sigma _{\text{call}}\) . Then \(E=r\cdot E^{\prime }\) , \(w_1=w^{\prime }\) and \(v_1=v^{\prime }\) . By Lemma D.8, \((r,r)\in m\) and \(\exists T^{\prime },\ T=m^{\prime }::T^{\prime }\wedge \mathcal {P}({m_0},{\bot },{w_1}) = ({m^{\prime }},{T^{\prime }})\) .
(3)
\(i \in \Sigma _{\text{ret}}\) . Let \(\operatorname{split}({v^{\prime }},{E^{\prime }},{w^{\prime }})=(v_1^{\prime },v_2^{\prime },w_1^{\prime },w_2^{\prime })\) . There are two cases of \(E^{\prime }\) .
(1)
\(E^{\prime }=\bot\) . Then, we have \(E=\bot\) , and by Lemma D.8 and the induction hypothesis, we have \((\text{None}, \operatorname{lastRule}(v))\in m\) .
(2)
\(\exists r_{{&#x2039; {a}}},E^{\prime }=r_{{&#x2039; {a}}}\cdot E\) . Then, we have \(r_{{&#x2039; {a}}}\in E^{\prime }\) , and by Lemma D.8 and the induction hypothesis, we have \((\operatorname{head }(E), \operatorname{lastRule}(v))\in m\) . If \(\operatorname{head }(E)=\text{None}\) , then we are done. Otherwise, we have \(w_1^{\prime }\ne \epsilon\) . We apply the induction hypothesis to \(w_1^{\prime }\) and finish the proof.□
Lemma D.10 (Invariants of the Parser PDA, Part 2).
\begin{align*} &#x0026;\forall m\ T\ w,\ \mathcal {P}({m_0},{\bot },{w}) = ({m},{T})\ \wedge \ w\ne \epsilon \Rightarrow \\ &#x0026;\qquad \forall r_1\ r,\ (r_1,r)\in m\ \Rightarrow \\ &#x0026;\qquad \qquad (r_1=\text{None}\wedge \exists v,\ ({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v+[r]},{\bot }) \ \vee \\ &#x0026;\qquad \qquad \qquad r_1\ne \text{None}\wedge \exists v\ H,\ ({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v+[r]},{r_1::H})). \end{align*}
Proof.
Prove it by induction over the length of w.
There are two cases of w: If \(w=i\in \Sigma\) , then we can check the lemma straightforwardly. Next, assume \(w=w^{\prime }i\) and \(w^{\prime }\ne \epsilon\) . Let \(\mathcal {P}({m_0},{\bot },{w^{\prime }}) = ({m^{\prime }},{T^{\prime }})\) .
There are three cases of i.
(1)
\(i \in \Sigma _{\text{plain}}\) . For each \((r_1,r)\) in m, we have \(\exists r^{\prime },({r_1},r^{\prime })\in m^{\prime }\) and \(\mathrm{nextNT}\left(r^{\prime }\right)=\mathrm{headNT}\left(r\right)\) . The lemma is then directly from the definition and the induction hypothesis applied to \(w^{\prime }\) .
(2)
\(i \in \Sigma _{\text{call}}\) . For each \((r_1,r)\) in m, we have \(\exists r^{\prime },(r_2,r^{\prime })\in m^{\prime }\) and \(\mathrm{nextNT}\left(r^{\prime }\right)=\mathrm{headNT}\left(r\right)\) . By Lemma D.8, \(r_1=r\) and \((r_1,r_1)\in m\) ; by the induction hypothesis applied to \(w^{\prime }\) , \(\exists H,({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v^{\prime }+[r_1]},{r_1::H})\) .
(3)
\(i \in \Sigma _{\text{ret}}\) . There are two cases of \(r_1\) .
(a)
\(r_1=\text{None}\) . Then \(T=\bot\) and by Lemma D.8 and the induction hypothesis, we have \(({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v+[r]},{\bot })\) .
(b)
\(r_1\ne \text{None}\) . Then \(\exists t,\ T=t\cdot T^{\prime }\) , \(\exists r_3,(r_1,r_3)\in t\) , \(\exists r_4\ r_5,(r_4,r_5)\in m^{\prime }\) , and \(\mathrm{nextNT}\left(r_3\right)=\mathrm{headNT}\left(r_4\right)\) . By Lemma D.8, we have \(\exists H,({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v^{\prime }+[r]},{r_1::H})\) .
Our next step is to introduce the invariants of extraction functions, which build parse trees in a backward way. To formalize the parse trees built by the backward extraction process, we introduce the backward small-step relation in Figure 14. This relation is similar to the forward one, only in the reverse direction. The corresponding transitive closure is defined in Figure 15.
Fig. 14.
Fig. 14. The backward small-step parse-tree derivation, given a VPG \(G=(\Sigma ,V,P,L_0)\) .
Fig. 15.
Fig. 15. The transitive closure of the backward small-step relation.
Similar to the forward small-step relation, the backward small-step relation is equivalent to the big-step relation under some conditions. We state the equivalence in Lemma D.13. To prove the lemma, we first give the invariants of the backward small-step relation in Lemma D.11.
Lemma D.11 (Invariants of the Backward Small-step Relation).
Proof.
Prove it by induction over the length of w.
There are two cases of w: If \(w=i\in \Sigma\) , then we can prove the lemma straightforwardly; so assume \(w=iw^{\prime }\) and \(w^{\prime }\ne \epsilon\) and we have \(v=[e]+v^{\prime }\) . From \(({[]},{\bot }) \stackrel{w}{\rightsquigarrow ^*} ({v},{E})\) , we have \(\exists E^{\prime },\ ({[]},{\bot }) \stackrel{w^{\prime }}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\) . There are three cases of i.
(1)
\(i\in \Sigma _{\text{plain}}\) . Then, we have \(E=E^{\prime }\) . If \(\operatorname{head }(E)\) is None or a pending rule, then we prove the lemma by extending \({\mathrm{headNT}\left(v^{\prime }\right)} \Downarrow \left({w^{\prime }},{v^{\prime }}\right)\) to \({\mathrm{headNT}\left(v\right)} \Downarrow \left({iw^{\prime }},{[e]+v^{\prime }}\right)\) . So, assume \(\operatorname{head }(E)\) is a matching rule. Let \(v^{\prime }=v_1+[r]+v_2\) , we prove the lemma by extending \({\mathrm{headNT}\left(v_1\right)} \Downarrow \left({w_1},{v_1}\right)\) to \({\mathrm{headNT}\left(v\right)} \Downarrow \left({iw_1},{[r]+v_1}\right)\) .
(2)
\(i\in \Sigma _{\text{call}}\) . If \(\operatorname{head }(E)\) is None or a pending rule, then we prove the lemma by extending the big-step relation of \(v^{\prime }\) to \({\mathrm{headNT}\left(v\right)} \Downarrow \left({iw^{\prime }},{[e]+v^{\prime }}\right)\) . So, assume \(\operatorname{head }(E)\) is a matching rule. Let \(v^{\prime }=v_1^{\prime }+[r^{\prime }]+v_2^{\prime }\) and \(w^{\prime }=w_1^{\prime } {b^{\prime }&#x203A; }w_2^{\prime }\) . By induction, we have \(\exists E_2,\ ({[]},{\bot }) \stackrel{w_2^{\prime }}{\rightsquigarrow } ({v_2^{\prime }},{E_2})\) . If \(\operatorname{head }(E)\) is None or a pending rule, then we can directly construct \({\mathrm{headNT}\left(v\right)} \Downarrow \left({w},{[e]+v_1^{\prime }+[r^{\prime }]+v_2^{\prime }}\right)\) . If \(\operatorname{head }(E)\) is a matching rule, then we have \(v_2^{\prime }=v_1^{\prime \prime }+[r^{\prime \prime }]+v_2^{\prime \prime }\) and \(w_2^{\prime }=w_1^{\prime \prime } {b^{\prime \prime }&#x203A; }w_2^{\prime \prime }\) . We prove the lemma by combining the big-step of \(v_1^{\prime }\) and \(v_1^{\prime \prime }\) to \({\mathrm{headNT}\left(v\right)} \Downarrow \left({iw_1^{\prime } {b^{\prime }&#x203A; }w_1^{\prime \prime }},{[e]+v_1^{\prime }+[r^{\prime }]+v_1^{\prime \prime }}\right)\) .
(3)
\(i\in \Sigma _{\text{ret}}\) . This case is trivial.□
Just as the case for the forward small-step relation, we can combine two backward small-step relations. We state it in Lemma D.12.
Lemma D.12 (Combining Backward Small-step Relations).
Proof.
Prove it by induction over the big-step relation \({L} \Downarrow \left({w},{v}\right)\) . Since \(L\in V^0\) , there are two cases.
(1)
\(w=cw^{\prime }\) and \(v=[r]+v^{\prime }\) . If \(w^{\prime }=\epsilon\) , then we can prove the lemma straightforwardly; so, assume \(w^{\prime }\ne \epsilon\) and apply the induction hypothesis to \({\mathrm{headNT}\left(v^{\prime }\right)} \Downarrow \left({w^{\prime }},{v^{\prime }}\right)\) . We can directly extend the backward small-step relation of \(v^{\prime }\) with \([r]\) . For any \(({[]},{\bot }) \stackrel{w_2}{\rightsquigarrow ^*} ({[L_1\rightarrow {&#x2039; {a}}L_2 {b&#x203A; }. L_3]+v_2},{E_2})\) , we first extend it with the small-step relation of \(v^{\prime }\) by the induction hypothesis, then further extend the small-step relation with \([r]\) by definition.
(2)
\(w={&#x2039; {a}}^{\prime } w_1^{\prime } {b^{\prime }&#x203A; } w_2^{\prime }\) and \(v=[r_{{&#x2039; {a}}^{\prime }}]+v_1^{\prime }+[r_{ {b^{\prime }&#x203A; }}]+v_2^{\prime }\) . Note that, since \(L\in V^0\) , we have \(\mathrm{headNT}\left({v_2^{\prime }}\right)\in V^0\) . For any \(({[]},{\bot }) \stackrel{w_2}{\rightsquigarrow ^*} ({[L_1\rightarrow {&#x2039; {a}}L_2 {b&#x203A; }. L_3]+v_2},{E_2})\) , by the induction hypothesis of \(v_2^{\prime }\) and the definition of the backward small-step, we extend it with \(v_2^{\prime }\) , \(r_{ {b^{\prime }&#x203A; }}\) , \(v_1^{\prime }\) , and \(r_{{&#x2039; {a}}^{\prime }}\) in order.□
Lemma D.13 (Correctness of the Backward Small-step Relation).
Proof.
\(\Rightarrow\) . Prove it by induction over the big-step relation \({{L}} \Downarrow \left({w},{v}\right)\) .
(1)
\(w=i w^{\prime }\) , \(v=[r]+v^{\prime }\) , and \({\mathrm{nextNT}\left(r\right)} \Downarrow \left({w^{\prime }},{v^{\prime }}\right)\) , where r is not a matching rule. The case of \(w^{\prime }=\epsilon\) is trivial; so, assume \(w^{\prime }\ne \epsilon\) . Apply the induction hypothesis to \(w^{\prime }\) and we have \(\exists E^{\prime },\ ({[]},{\bot }) \stackrel{w^{\prime }}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\) . We can directly extend this backward small-step relation with \([r]\) .
(2)
\(w={&#x2039; {a}}w_1 {b&#x203A; }w_2\) , \(v=[r_1]+v_1+[r_2]+v_2\) , and \({\mathrm{nextNT}\left(r_i\right)} \Downarrow \left({w_i},{v_i}\right)\) , \(i=1,2\) . If \(w_2\ne \epsilon\) , then we apply the induction hypothesis to \({\mathrm{headNT}\left(v_2\right)} \Downarrow \left({w_2},{v_2}\right)\) and have the small-step relation of \(w_2\) . We extend it to \(({[]},{\bot }) \stackrel{ {b&#x203A; }w_2}{\rightsquigarrow ^*} ({[r_2]+v_2},{r_2\cdot \bot })\) by definition. If \(w_2=\epsilon\) , then we have \(v_2=[]\) and this small-step also holds. Then, if \(w_1\ne \epsilon\) , then we apply the induction hypothesis to \({\mathrm{headNT}\left(v_1\right)} \Downarrow \left({w_1},{v_1}\right)\) and have the small-step relation of \(w_1\) . By Lemma D.12, we construct \(({[]},{\bot }) \stackrel{w_1 {b&#x203A; }w_2}{\rightsquigarrow ^*} ({v_1+[r_2]+v_2},{\bot })\) . If \(w_1=\epsilon\) , then this small-step also holds. Finally, we extend the small-step relation to \(({[]},{\bot }) \stackrel{w}{\rightsquigarrow ^*} ({[r_1]+v_1+[r_2]+v_2},{\bot })\) by definition.
\(\impliedby\) . This is a corollary of Lemma D.11.□
We give the specification of \(\operatorname{extract}_\mathrm{init}\) and \(\operatorname{extract}_{\mathrm{oneStep}}\) in Lemmas D.14, D.15, and D.16. The proofs are directly based on the definitions, and we omit them here.
Lemma D.14 (The Specification of \(\operatorname{extract}_\mathrm{init}\) )
Lemma D.15 shows that if m is a state and V includes partial parse trees in backward small-step relations, then \(\operatorname{extract}_{\mathrm{oneStep}}\) extends the relations; Lemma D.16 further shows that the relations are extended by rules in m and trees in V.
Lemma D.15 (The Specification of \(\operatorname{extract}_{\mathrm{oneStep}}\) , Part I)
Lemma D.16 (The Specification of \(\operatorname{extract}_{\mathrm{oneStep}}\) , Part II)
Now, we introduce the invariants of the extraction function.
Lemma D.17 (Invariants of the Extraction Function).
Given a VPG \(G=(\Sigma ,V,P,L_0)\) , we have
Proof.
Prove it by induction over \(w_2\) .
When \(w_2=\epsilon\) , we have \(V=\operatorname{extract}_\mathrm{init}(m)\) .
\(\Rightarrow\) . For each \((v,E)\in \operatorname{extract}_\mathrm{init}(m)\) , by Lemma D.14, we have \(({[]},{\bot }) \stackrel{i}{\rightsquigarrow } ({v},{E})\) and \(\exists r, v=[r]\) . By Lemma D.10, we have
\begin{equation*} \exists v_1\ E_1, ({[]},{\bot }) \stackrel{w_1w_2i}{\longrightarrow ^*} ({v_1+[r]},{E_1}). \end{equation*}
By Lemmas D.7 and D.14, the above small-step relation can be converted to a big-step relation; i.e., \({L_0} \Downarrow \left({w_1w_2i},{v_1+v}\right)\) . If \(w_1=\epsilon\) , then \(v_1=[]\) and \(w_1w_2i=i\) , so \({L_0} \Downarrow \left({i},{v}\right)\) . If \(w_1\ne \epsilon\) , then we can invert the above small-step relation to get \(\exists E_1,\ ({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v_1},{E_1})\) .
\(\impliedby\) . For \((v,E)\) that satisfy \(({[]},{\bot }) \stackrel{w_1w_2i}{\rightsquigarrow ^*} ({v},{E})\) , if \(w_1=\epsilon\) and \({L_0} \Downarrow \left({w_2i},{v}\right)\) , then we first convert the big-step relation to a small-step relation by Lemma D.7, then, by Lemma D.9, we have \((\operatorname{head }(E),v)\in m\) . Then, by Lemma D.14, we have \((v,E)\in V\) . If \(w_1\ne \epsilon\) and \({L_0} \Downarrow \left({w_1i},{v_1+v}\right)\) , then we can similarly show that \((\operatorname{head }(E),v)\in m\) and \((v,E)\in V\) .
When \(w_2\ne \epsilon\) , let \(w_2=i^{\prime }w_3\) and \(M_2=[m_2^{\prime }] + M_2^{\prime }\) , we have \(V=\operatorname{extract}_{\mathrm{oneStep}}(m_2^{\prime },\operatorname{extract}(M_2^{\prime } + [m]))\) .
\(\Rightarrow\) . For each \((v,E)\in V\) , by Lemma D.16, we have
\begin{equation*} \exists r\ v^{\prime }\ r_1\ E^{\prime }, v=[r]+v^{\prime } \wedge (r_1,r)\in {m_2^{\prime }} \wedge (v^{\prime },E^{\prime })\in \operatorname{extract}(M_2^{\prime } + [m]). \end{equation*}
By Lemma D.10, we have
\begin{equation*} \exists v_r\ E_r,\ ({[]},{\bot }) \stackrel{w_1{i^{\prime }}}{\longrightarrow ^*} ({v_r+[r]},{E_r}). \end{equation*}
By the induction hypothesis, we have
As an overview, we will try to use \(v_r\) and \(v_1\) to construct the counterpart of v. In the simplest case, \(r_1=\text{None}\) and \(v_r\) can be directly used. In a more complex case, \(r_1\) is a matching rule, and we need to combine \(v_r\) and \(v_1\) to construct the counterpart. There are two cases of \(r_1\) .
(1)
\(r_1=\text{None}\) or a pending rule. We will prove \({L_0} \Downarrow \left({w},{v_r+v}\right)\) . We first convert the backward small-step \(({[]},{\bot }) \stackrel{w_3i}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\) to a forward small-step. There are two cases of \(E^{\prime }\) .
(a)
\(E^{\prime }=\bot\) or \(\operatorname{head }(E^{\prime })\) is a pending rule. We convert \(({[]},{\bot }) \stackrel{w_3i}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\) to a big-step by Lemma D.13 and then convert the big-step to a forward small-step by Lemma D.7.
(b)
\(\operatorname{head }(E^{\prime })\) is a matching rule. This is not possible by Lemma D.16.
Then, we combine \(v_r+[r]\) and \(v^{\prime }\) to \(v_r+[r]+v^{\prime }\) by Lemma D.1. Finally, we convert \(v_r+[r]+v^{\prime }\) to a big-step by Lemma D.7 and finish the proof.
(2)
\(r_1\) is a matching rule. There are two cases of \(E^{\prime }\) .
(a)
\(E^{\prime }=\bot\) or \(\operatorname{head }(E^{\prime })\) is a pending rule. This is not possible by Lemma D.16.
(b)
\(\operatorname{head }(E^{\prime })\) is a matching rule. We use the split function to split \(v_1\) into \(v_{1,1}+[r_{1,{&#x2039; {a}}}]+v_{1,2}\) and split \(v_r+[r]\) into \(v_{r,1}+[r_{r,{&#x2039; {a}}}]+v_{r,2}\) . Then, we construct a new forward small-step \(({[]},{\bot }) \stackrel{w_1w_2i}{\rightarrow } ({v_{1,1}+[r_{1,{&#x2039; {a}}}]+v_{r,2}},{E^{\prime \prime }})\) for some \(E^{\prime \prime }\) . Then, we extend this small-step with \(\operatorname{firstRule}(v^{\prime })\) and use Lemma D.4 to further extend the small-step relation to \(v_{1,1}+[r_{1,{&#x2039; {a}}}]+v_{r,2}+v^{\prime }\) . Finally, we convert the small-step relation to a big-step by Lemma D.7.
\(\impliedby\) . For each \((v,E)\) that satisfies \(({[]},{\bot }) \stackrel{w_1w_2i}{\rightsquigarrow ^*} ({v},{E})\) , let \(v=[r]+v^{\prime }\) for some \(v^{\prime }\) and r, and \(w_2=i^{\prime }w_3\) for some \(i^{\prime }\) and \(w_3\) . We invert the backward small-step of \(w_2i\) and have \(({[]},{\bot }) \stackrel{w_3i}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\) for some \(E^{\prime }\) . Then, we convert \({L_0} \Downarrow \left({w_1w_2i},{v_1+v}\right)\) to a forward small-step \(({[]},{\bot }) \stackrel{w_1w_2i}{\longrightarrow ^*} ({v_1+v},{E})\) for some E, and by Lemma D.5, we have the forward small-step \(({[]},{\bot }) \stackrel{w_1i^{\prime }}{\longrightarrow ^*} ({v_1+[r]},{E_r})\) for some \(E_r\) . With these small-steps, we apply the induction hypothesis and have \((v^{\prime },E^{\prime })\in \operatorname{extract}(M_2^{\prime } + [m])\) , where \(M_2=[m_2^{\prime }]+ M_2^{\prime }\) for some \(m_2^{\prime }\) . Then, by the definition of \(\operatorname{extract}_{\mathrm{oneStep}}\) , we have \((v,E)\in V\) .□
By Lemma D.17, when \(w_1=\epsilon\) , we have the correctness of the parser generator.
Theorem D.18 (Correctness of the Parser Generator).
\begin{align*} \forall w\ V,\ V=\operatorname{extract}(\operatorname{forest}(w)) \Rightarrow (\forall v,\ {L_0} \Downarrow \left({w},{v}\right) \iff \exists E, (v,E)\in V). \end{align*}

E Correspondence Between the Theorems and the Coq Proofs

Table 6 shows the theorems discussed in this article, the files that include the corresponding Coq theorems and their proofs, and the names of the Coq theorems.
Table 6.
TheoremFileName
Theorem D.1ForwardSmallStepL4_1
Theorem D.2ForwardSmallStepL4_3
Theorem D.3ForwardSmallStepL4_4
Theorem D.4ForwardSmallStepL_DF_Local
Theorem D.5ForwardSmallStepDF_SPLIT2
Theorem D.6ForwardSmallStepL4_2
Theorem D.7ForwardSmallStepSoundV
Theorem D.8GenForestL_m2PlainM, L_m2CallM, L_m2RetM
Theorem D.9GenForestPForest1
Theorem D.10GenForestPForest2
Theorem D.11BackwardSmallStepBreakDV
Theorem D.12BackwardSmallStepL4_3
Theorem D.13BackwardSmallStepCompleteM, SoundV
Theorem D.14TransducerL_f_init
Theorem D.15TransducerL_f_b
Theorem D.16Transducerpg2
Theorem D.17TransducerL_extract
Theorems 6.1, Theorems 6.3TimedExtractionProperty_VPG_Parser_Generator
Theorem 6.2TimedRunPDAbound_cost_run_PDA
Table 6. The Correspondence between the Theorems and the Mechanized Proofs

F Examples of Tagged CFGs

The following is a tagged CFG for JSON [3], where nonterminals start with lowercase characters, such as json, and terminals start with uppercase characters, such as “STRING.” Also, call and return symbols are tagged with “<” or “>”, respectively. The declarations of terminals are omitted.
The following is a tagged CFG for HTML, which is adapted from the HTML grammar from the repository of ANTLR [3]:
The following is a tagged CFG for XML adapted from ANTLR [3]:

G The Translation Algorithm

Translating simple forms to linear forms. The translation algorithm is based on the dependency graph G. We start by removing from G any dependency edge \((L,L^{\prime })\) that is labeled with \((s_1, s_2)\) ; intuitively, a matched token \({&#x2039; {a}}L^{\prime } {b&#x203A; }\) can be conceptually viewed as a “plain symbol” and its presence does not affect the following translation. We use the symbol t for either a plain symbol or a matched token. Further, a rule whose head is L is called a rule of L.
We find all strongly connected components (SCCs) in the dependency graph G and collapse each SCC into a single node; the resulting condensation graph is written as \(G^{\mathit {SCC}}\) , which is a directed acyclic graph (DAG). For a nonterminal L, we write \(G^{\mathit {SCC}}[L]\) for the nonterminals in its SCC. Because the validator passes G, we know that any edge between nonterminals \(L_1\) and \(L_2\) in \(G^{\mathit {SCC}}[L]\) must be labeled with \((s,\epsilon)\) .
We write \(\mathrm{allDep}(G^{\mathit {SCC}},L)\) for all nonterminals that L may (transitively) depend on according to \(G^{\mathit {SCC}}\) ; that is,
\begin{equation*} \mathrm{allDep}(G^{\mathit {SCC}},L) = \lbrace L^{\prime }\,\mid \,G^{\mathit {SCC}}[L]\ \mbox{transitively depends on}\ G^{\mathit {SCC}}[L^{\prime }]\rbrace . \end{equation*}
The translation algorithm maintains a set of processed nonterminals \(V_\mathrm{done}\) , a map T, and a set P of the current set of rules. Set \(V_\mathrm{done}\) is used to track the set of nonterminals whose rules have been processed, meaning those rules are in the linear form; it is initialized to be the empty set. The algorithm uses T to keep track of new nonterminals created during translation. It is initialized to the empty map. During translation, a new nonterminal \(L_1\) may be created for a string of the form \(Ls\) , where L is a nonterminal and \(s \in (\Sigma \cup V)^+\) ; the algorithm then adds the mapping \((Ls, L_1)\) to T. Finally, P is initialized to be the set of rules in the input grammar; during translation, new rules may be added to P and some rules may be removed.
At each iteration, if \(V_\mathrm{done}\) contains all nonterminals in G, then the algorithm terminates (which implies that all rules are already in linear forms). Otherwise, pick a nonterminal \(L \notin V_\mathrm{done}\) so \(\mathrm{allDep}(G^{\mathit {SCC}},L) \subseteq V_\mathrm{done}\) ; that is, nonterminals that \(G^{\mathit {SCC}}[L]\) depend on (transitively) have only linear-form rules. Since \(G^{\mathit {SCC}}\) is a DAG, we can always find such an L.
If all rules of L are in the linear form, then add L to \(V_\mathrm{done}\) and move to the next iteration. Otherwise, pick a non-linear form rule of \(L \rightarrow t_1 \ldots t_k L_1 s\) , where each \(t_i\) is either a plain symbol or a matched token, \(L_1\) is the left-most nonterminal (outside any matched token) in the rule, and \(s \in (\Sigma \cup V)^+\) . We must have that \(L_1 \notin G^{\mathit {SCC}}[L]\) , since otherwise it would violate the conditions imposed by the validator. So the rules of \(L_1\) (as well as rules of those nonterminals that \(L_1\) transitively depends on) must already be in the linear form.
With the non-linear form rule \(L \rightarrow t_1 \ldots t_k L_1 s\) , the algorithm performs multiple rounds to transform it to a set of linear-form rules.
The goal of the first round is to invoke \(\mathrm{linearizeFirstNT(L \rightarrow t_1 \ldots t_k L_1 s)}\) , defined in Algorithm 5, to remove the dependency from L to \(L_1\) . When \(k\gt 0\) , what the procedure does is to introduce a new nonterminal \(L_1^{\prime }\) (if \(L_1s\) is not in the translation table already), change the rule to \(L \rightarrow t_1 \ldots t_k L_1^{\prime }\) , and add the map \((L_1s, L_1^{\prime })\) to the translation table T; if further recursively invokes \(\mathrm{linearizeFirstNT(L_1^{\prime } \rightarrow L_1s)}\) to remove the dependency from \(L_1^{\prime }\) to \(L_1\) . When \(k=0\) , the rule becomes \(L \rightarrow L_1 s\) and the procedure in Algorithm 5 uses the rules of \(L_1\) to rewrite \(L_1s\) ; note that in this case \(L_1\) can have only linear-form rules. After \(\mathrm{linearizeFirstNT(L \rightarrow t_1 \ldots t_k L_1 s)}\) is finished, this round removes rule \(L \rightarrow t_1 \ldots t_k L_1 s\) from P, updates the dependency graph G (as new nonterminals are added and rules are changed).
It is easy to see from Algorithm 5 that all non-linear form rules added in the first round have the form of \(L_u \rightarrow \vec{t} s\) , where \(L_u\) can be L or a newly generated nonterminal, \(\vec{t}\) is a sequence of plain symbols or matched tokens, and s is the string in the original rule \(L \rightarrow t_1 \ldots t_k L_1 s\) . Compared to the original rule, progress has been made, since the size of s is less than the size of \(L_1s\) .
In the next round, the algorithm then takes rules of form \(L_u \rightarrow \vec{t} s\) and feeds those non-linear ones to \(\mathrm{linearizeFirstNT(-)}\) to remove the dependency from \(L_u\) to the first nonterminal in s. We continue these rounds until the original rule \(L \rightarrow t_1 \ldots t_k L_1 s\) is completely linearized.
After \(L \rightarrow t_1 \ldots t_k L_1 s\) is completely linearized, the algorithm continues to the next iteration.
We have now finished the discussion of the translation algorithm. For its correctness, we can prove that the algorithm produces an equivalent grammar as the input grammar by showing that each rewriting step creates an equivalent grammar.
For termination, the following lemma shows that \(\mathrm{linearizeFirstNT(-)}\) terminates:
Lemma G.1.
\(\mathrm{linearizeFirstNT(L \rightarrow t_1 \ldots t_k L_1 s)}\) terminates, if \(L_1\) and terminals in \(\mathrm{allDep} (G^{\mathit {SCC}},L_1)\) have only linear-form rules.
Proof.
Inspecting Algorithm 5, we know that a recursive call to \(\mathrm{linearizeFirstNT(L_u^{\prime } \rightarrow L_u s)}\) is made only when \(L_u \in \lbrace L_1\rbrace \cup \mathrm{allDep}(G^{\mathit {SCC}},L_1)\) and \(L_us\) is not in T. Since the set \(\lbrace L_1\rbrace \cup \mathrm{allDep}(G^{\mathit {SCC}},L_1)\) is finite and \(L_us\) is added to T when a recursive call \(\mathrm{linearizeFirstNT(L_u^{\prime } \rightarrow L_u s)}\) is made, only a finite number of recursive calls can be made. Therefore, the procedure terminates.□
Further, as mentioned earlier, all generated rules when calling \(\mathrm{linearizeFirstNT(L \rightarrow t_1 \ldots t_k L_1 s)}\) have the form of \(L_u \rightarrow \vec{t} s\) , from which we know that the distance from the first nonterminal in \(\vec{t} s\) to the end is shorter than the distance from \(L_1\) to the end in \(t_1 \ldots t_k L_1 s\) . Moreover, it is easy to show only a finite number of rules of the form \(L_u \rightarrow \vec{t} s\) are generated. In summary, at each round of linearizing the rule \(L \rightarrow t_1 \ldots t_k L_1 s\) , the number of invocations to \(\mathrm{linearizeFirstNT(-)}\) is finite and round \(j+1\) is simpler than round j in the sense that a rule in round \(j+1\) has a shorter distance from the first nonterminal to the end of the rule. Therefore, the process of linearizing rule \(L \rightarrow t_1 \ldots t_k L_1 s\) terminates.
Finally, since the number of rules in the input grammar is finite, there is only a finite number of iterations. Therefore, the translation algorithm terminates.

H The Full Evaluation

The full evaluation results are shown in Tables 711.
Table 7.
Name#TokenANTLR ParseBison ParseVPG Parse + ExtractVPG ConvVPG TotalANTLR LexJFlex
blog_entries1,40211.2 ms0.8 ms0.08 ms0.1 ms0.2 ms11.0 ms5.5 ms
gists6,30215.1 ms1.9 ms0.36 ms0.5 ms0.9 ms12.3 ms3.9 ms
emojis7,19814.7 ms2.0 ms0.40 ms0.5 ms0.9 ms12.8 ms4.6 ms
github11,68017.2 ms2.7 ms0.77 ms1.4 ms2.2 ms15.9 ms6.8 ms
poked12,91618.4 ms3.4 ms0.88 ms1.5 ms2.4 ms14.6 ms5.6 ms
Members13,22617.8 ms4.3 ms0.87 ms1.3 ms2.2 ms13.0 ms3.9 ms
senator17,93219.4 ms4.2 ms1.19 ms1.8 ms3.0 ms17.4 ms6.4 ms
AskReddit18,97520.1 ms3.7 ms1.33 ms2.0 ms3.4 ms16.8 ms8.1 ms
parliament27,18221.6 ms4.8 ms1.98 ms3.3 ms5.3 ms16.5 ms12.0 ms
prize30,21022.1 ms5.8 ms2.24 ms3.2 ms5.4 ms17.2 ms10.6 ms
municipis39,29227.2 ms5.6 ms3.06 ms3.8 ms6.8 ms23.8 ms11.6 ms
reddit_all52,94126.5 ms8.0 ms4.14 ms7.0 ms11.1 ms26.1 ms15.1 ms
y77d_th9554,66627.2 ms8.3 ms4.04 ms5.6 ms9.6 ms23.8 ms12.5 ms
twitter55,26423.9 ms6.8 ms4.37 ms5.5 ms9.9 ms104.9 ms24.3 ms
representative70,75028.7 ms7.7 ms4.92 ms7.6 ms12.5 ms29.2 ms24.8 ms
transactions71,74430.5 ms9.5 ms5.54 ms8.1 ms13.6 ms24.2 ms19.5 ms
laureate79,95331.6 ms11.0 ms4.90 ms11.3 ms16.2 ms25.2 ms12.9 ms
catalog135,99153.5 ms20.2 ms7.33 ms17.0 ms24.3 ms39.8 ms39.1 ms
canada334,374103.7 ms49.9 ms17.02 ms51.5 ms68.6 ms46.7 ms39.5 ms
educativos426,39892.3 ms40.5 ms20.47 ms48.2 ms68.7 ms108.5 ms75.3 ms
airlines555,410106.5 ms58.0 ms30.01 ms68.2 ms98.2 ms87.8 ms91.3 ms
movies751,917171.2 ms81.9 ms39.40 ms125.1 ms164.5 ms87.9 ms97.9 ms
js1,288,350224.5 ms127.5 ms72.14 ms161.9 ms234.0 ms141.6 ms141.4 ms
Table 7. The Parsing Times of JSON
Table 8.
Name#TokenANTLR ParseBison ParseVPG Parse + ExtractVPG ConvVPG TotalANTLR LexJFlex
nav_63_035110.181.400.020.010.0310.560.76
nav_78_03548.901.320.020.020.0410.440.74
nav_50_05239.871.740.030.020.0511.060.98
cd_catalog7339.711.580.040.030.079.410.67
form74010.501.680.040.030.0714.471.35
OfficeOrder89810.581.890.050.040.0913.401.17
book-order2,72312.732.730.140.280.4212.691.80
book2,72313.382.960.140.280.4112.691.76
bioinfo3,16513.512.800.170.270.4415.932.36
soap_small3,41513.243.270.170.370.5414.402.04
cd_big4,82914.693.240.260.600.8713.362.49
soap_mid17,01520.267.341.172.924.0918.8910.89
ORTCA39,07230.028.303.226.319.5354.0971.00
blog43,00024.5211.353.386.9210.3055.6035.01
SUAS118,44654.3230.516.3720.5626.9388.16130.68
po1m128,74265.6528.186.5420.7427.2848.2130.79
soap340,015125.3258.6916.1555.7171.8667.3952.75
bioinfo_big406,493107.5067.3020.8572.3093.1589.6474.09
address1,437,142253.83231.9074.04283.35357.40200.98250.01
cd4,198,694720.94734.56204.18704.03908.21415.73508.57
po9,266,5261,359.551,834.59461.161,527.521,988.69854.081,617.16
Table 8. The Parsing Time (ms) of XML
Table 9.
Name#TokenANTLR ParseBison ParseVPG Parse + ExtractVPG ConvVPG TotalANTLR LexJFlex
uglylink1314.282.030.000.000.009.770.14
script12614.882.200.000.000.008.180.20
style12615.422.200.000.000.007.890.22
attvalues2735.702.140.000.000.0111.500.21
html43816.422.210.000.000.0110.850.31
antlr54537.144.750.040.230.2622.492.63
google65924.083.770.040.270.3031.7833.82
gnu1,01338.585.400.060.100.1619.862.95
freebsd1,52149.515.740.090.210.2919.292.54
abc.com1,52443.204.830.090.210.3030.6415.06
github2,54746.826.830.210.260.4726.124.71
metafilter2,61550.945.490.210.450.6633.1330.18
wikipedia2,75037.496.460.240.370.6133.9310.71
nbc.com3,82194.827.370.320.590.9246.2114.79
bbc3,88388.485.620.320.801.1295.4851.55
cnn14,97489.636.840.291.021.3142.3834.00
reddit24,97631.976.850.281.111.3953.6541.84
reddit4,98931.377.410.271.101.3753.2938.30
digg6,250126.958.530.511.341.8552.0738.00
youtube16,31649.3511.561.055.926.9754.8358.53
Table 9. The Parsing Time (ms) of HTML
Table 10.
Name#TokenANTLR LexVPG Parse + ExtractVPG TotalANTLR Lex + VPG TotalSpiderMJSCoreV8Chakra
Members1,40213.030.872.2415.2715.0645.338.8013.05
poked6,30214.610.882.4017.0014.8250.838.7212.68
gists7,19812.380.360.9013.2816.2985.239.2113.72
senator11,68017.491.193.0720.5716.0155.439.9813.87
AskReddit12,91616.811.333.4220.2414.6751.599.2813.42
blog_entries13,22611.030.080.2611.2814.5151.908.8513.21
github17,93215.990.772.2118.2015.7866.179.9912.43
emojis18,97512.830.400.9813.8215.4046.539.1914.14
parliament27,18216.531.985.3521.8816.3670.3510.3813.69
prize30,21017.252.245.4522.7116.76134.309.9813.15
y77d_th9539,29223.874.049.6333.5020.72178.0312.5316.35
municipis52,94123.893.066.8930.7820.0260.2011.4816.54
laureate54,66625.264.9016.2241.4717.41105.2211.9615.37
reddit_all55,26426.154.1411.1537.3022.0092.5415.6117.44
transactions70,75024.235.5413.6537.8722.27132.8615.2116.19
twitter71,744104.964.379.91114.8821.0978.9011.7818.35
representative79,95329.224.9212.5641.7820.72178.0312.5316.35
catalog135,99139.837.3324.3664.1933.8170.6427.5924.77
canada334,37446.7417.0268.61115.3557.1467.5133.9543.83
educativos426,398108.5820.4768.75177.3470.78420.9944.5348.64
airlines555,41087.8230.0198.26186.0873.8695.1241.9355.52
movies751,91787.9239.40164.58252.5170.93106.4168.0050.85
JSON.parse1,288,350141.6472.14234.07375.71118.11139.2176.3687.81
Table 10. Comparison with Hand-crafted Parsers (ms) of Parsing JSON
Table 11.
Name#TokenANTLR LexVPG Parse + ExtractVPG TotalANTLR Lex + VPG TotalFast-XMLLibxmljsSAX-JSHTMLP2
nav_63_035110.560.020.0310.590.140.130.670.20
nav_78_035410.440.020.0410.470.120.120.680.20
nav_50_052311.060.030.0511.110.210.170.990.30
cd_catalog7339.410.040.079.480.220.090.360.16
form74014.470.040.0714.540.380.170.930.27
OfficeOrder89813.400.050.0913.490.370.140.880.35
book-order2,72312.690.140.4213.110.690.261.410.54
book2,72312.690.140.4113.100.640.261.410.54
bioinfo3,16515.930.170.4416.370.990.382.160.90
soap_small3,41514.400.170.5414.940.850.402.420.78
cd_big4,82913.360.260.8714.231.170.442.200.88
soap_mid17,01518.891.174.0922.984.111.9511.743.81
ORTCA39,07254.093.229.5363.62138.4991.25664.6288.79
blog43,00055.603.3810.3065.9129.576.5648.0717.83
SUAS118,44688.166.3726.93115.09254.39182.431,213.67168.74
po1m128,74248.216.5427.2875.4933.5111.5695.2227.11
soap340,01567.3916.1571.86139.25110.1438.13260.8876.24
bioinfo_big406,49389.6420.8593.15182.79116.9342.74276.9682.29
address1,437,142200.9874.04357.40558.37584.40196.121,012.01330.51
cd4,198,694415.73204.18908.211,323.941,298.48419.072,102.58734.97
po9,266,526854.08461.161,988.692,842.773,278.30897.306,617.601,826.90
Table 11. Comparison with Hand-crafted Parsers (ms) of Parsing XML

References

[1]
Rajeev Alur. 2016. Nested Words. Retrieved from https://www.cis.upenn.edu/alur/nw.html.
[2]
Rajeev Alur and P. Madhusudan. 2009. Adding nesting structure to words. J. Assoc. Comput. Machin. 56, 3 (May2009), 16:1–16:43.
[3]
ANTLR. 2014. Grammars-v4. Retrieved from https://github.com/antlr/grammars-v4.
[4]
Aditi Barthwal and Michael Norrish. 2009. Verified, executable parsing. In Programming Languages and Systems, Giuseppe Castagna (Ed.). Springer Berlin, 160–174.
[5]
Jean-Philippe Bernardy and Patrik Jansson. 2016. Certified context-free parsing: A formalisation of Valiant’s algorithm in Agda. Logic. Meth. Comput. Sci. 12 (012016). DOI:DOI:
[6]
Janusz A. Brzozowski. 1964. Derivatives of regular expressions. J. ACM 11 (1964), 481–494.
[7]
John Cocke. 1969. Programming Languages and Their Compilers: Preliminary Notes. New York University, New York, NY.
[8]
Nils Anders Danielsson. 2010. Total parser combinators. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming) (ICFP’10). Association for Computing Machinery, New York, NY, 285–296. DOI:DOI:
[9]
Pierce Darragh and Michael D. Adams. 2020. Parsing with zippers (functional pearl). Proc. ACM Program. Lang. 4, ICFP (2020), 1–28.
[10]
Franklin Lewis DeRemer. 1969. Practical translators for LR (k) languages.Ph. D. Dissertation. Massachusetts Institute of Technology.
[11]
F. L. Deremer. 1969. PRACTICAL TRANSLATORS FOR LR(K) LANGUAGES. Massachusetts Institute of Technology.
[12]
Justin Dorfman. 2015. awesome-json-datasets. Retrieved from https://github.com/jdorfman/awesome-json-datasets.
[13]
Danny Dubé and Marc Feeley. 2000. Efficiently building a parse tree from a regular expression. Acta Inform. 37, 2 (2000), 121–144.
[14]
Jay Earley. 1970. An efficient context-free parsing algorithm. Commun. ACM 13, 2 (Feb.1970), 94–102.
[15]
Romain Edelmann, Jad Hamza, and Viktor Kunčak. 2020. Zippy LL(1) parsing with derivatives. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’20). Association for Computing Machinery, New York, NY, 1036–1051. DOI:DOI:
[16]
Felix. 2010. Htmlparser2. Retrieved from https://github.com/fb55/htmlparser2.
[17]
Denis Firsov and Tarmo Uustalu. 2014. Certified CYK parsing of context-free languages. J. Logic. Algeb. Meth. Program. 83, 5 (2014), 459–468. DOI:DOI:
[18]
Free Software Foundation. 2021. GNU Bison. Retrieved from https://www.gnu.org/software/bison/.
[19]
GoogleChromeLabs. 2019. json-parse-benchmark. Retrieved from https://github.com/GoogleChromeLabs/json-parse-benchmark.
[20]
Ian Henderson. 2017. Owl. Retrieved from https://github.com/ianh/owl.
[21]
Ian Henriksen, Gianfranco Bilardi, and Keshav Pingali. 2019. Derivative grammars: A symbolic approach to parsing with derivatives. Proc. ACM Program. Lang. 3, OOPSLA (2019), 1–28.
[22]
Xiaodong Jia, Ashish Kumar, and Gang Tan. 2021. A derivative-based parser generator for visibly Pushdown grammars. Proc. ACM Program. Lang. 5, OOPSLA (2021), 1–24.
[23]
Jacques-Henri Jourdan, François Pottier, and Xavier Leroy. 2012. Validating LR(1) parsers. In Programming Languages and Systems, Helmut Seidl (Ed.). Springer Berlin, 397–416.
[24]
Tadao Kasami. 1965. An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages. Technical Report. Air Force Cambridge Research Laboratory.
[25]
Adam Koprowski and Henri Binsztok. 2010. TRX: A formally verified parser interpreter. In Programming Languages and Systems, Andrew D. Gordon (Ed.). Springer Berlin, 345–365.
[26]
Sam Lasser, Chris Casinghino, Kathleen Fisher, and Cody Roux. 2019. A verified LL(1) parser generator. In 10th International Conference on Interactive Theorem Proving (ITP 2019)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 141), John Harrison, John O’Leary, and Andrew Tolmach (Eds.). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 24:1–24:18. DOI:DOI:
[27]
Sam Lasser, Chris Casinghino, Kathleen Fisher, and Cody Roux. 2021. CoStar: A verified ALL(*) parser. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI’21). Association for Computing Machinery, New York, NY, 420–434. DOI:DOI:
[28]
libxmljs. 2009. Libxmljs. Retrieved from https://github.com/libxmljs/libxmljs.
[29]
Jay McCarthy, Burke Fetscher, Max New, Daniel Feltey, and Robert Bruce Findler. 2016. A Coq library for internal verification of running-times. In Functional and Logic Programming, Oleg Kiselyov and Andy King (Eds.). Springer International Publishing, Cham, 144–162.
[30]
Matthew Might, David Darais, and Daniel Spiewak. 2011. Parsing with derivatives: A functional pearl. In Proceedings of the 16th ACM SIGPLAN International Conference on Functional Programming (ICFP’11). Association for Computing Machinery, New York, NY, 189–195. DOI:DOI:
[31]
Greg Morrisett, Gang Tan, Joseph Tassarotti, Jean-Baptiste Tristan, and Edward Gan. 2012. RockSalt: Better, faster, stronger SFI for the X86. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). Association for Computing Machinery, New York, NY, 395–404. DOI:DOI:
[32]
NaturalIntelligence. 2017. fast-xml-parser. Retrieved from https://github.com/NaturalIntelligence/fast-xml-parser.
[33]
Ulf Norell. 2007. Towards a Practical Programming Language Based on Dependent Type Theory. Vol. 32. Chalmers University of Technology.
[34]
Scott Owens, John Reppy, and Aaron Turon. 2009. Regular-expression derivatives re-examined. J. Funct. Program. 19, 2 (Mar.2009), 173–190.
[35]
Terence Parr. 2022. ANTLR (ANother Tool for Language Recognition). Retrieved from https://www.antlr.org/.
[36]
Terence Parr, Sam Harwell, and Kathleen Fisher. 2014. Adaptive LL(*) parsing: The power of dynamic analysis. SIGPLAN Not. 49, 10 (Oct.2014), 579–598. DOI:DOI:
[37]
T. Petsios, A. Tang, S. Stolfo, A. D. Keromytis, and S. Jana. 2017. NEZHA: Efficient domain-independent differential testing. In Proceedings of the IEEE Symposium on Security and Privacy (SP’17). IEEE Computer Society, 615–632. DOI:DOI:
[38]
Tahina Ramananandro, Antoine Delignat-Lavaud, Cedric Fournet, Nikhil Swamy, Tej Chajed, Nadim Kobeissi, and Jonathan Protzenko. 2019. EverParse: Verified secure zero-copy parsers for authenticated message formats. In Proceedings of the 28th USENIX Security Symposium (USENIX Security’19). USENIX Association, 1465–1482. Retrieved from https://www.usenix.org/conference/usenixsecurity19/presentation/delignat-lavaud.
[39]
Tom Ridge. 2011. Simple, functional, sound and complete parsing for all context-free grammars. In Certified Programs and Proofs, Jean-Pierre Jouannaud and Zhong Shao (Eds.). Springer Berlin, 103–118.
[40]
Isaac Z. Schlueter. 2010. sax-js. Retrieved from https://github.com/isaacs/sax-js.
[41]
JFlex Team. 2023. JFlex. Retrieved from https://jflex.de/.
[42]
XimpleWare. 2004. VTD-XML Benchmark Report for Version 2.3. Retrieved from https://vtd-xml.sourceforge.io/2.3/benchmark_2.3_parsing_only.html.
[43]
Milo Yip. 2014. nativejson-benchmark. Retrieved from https://github.com/miloyip/nativejson-benchmark.
[44]
Daniel H. Younger. 1967. Recognition and parsing of context-free languages in time \(n^3\) . Inf. Contr. 10, 2 (1967), 189–208.

Cited By

View all
  • (2024)V-Star: Learning Visibly Pushdown Grammars from Program InputsProceedings of the ACM on Programming Languages10.1145/36564588:PLDI(2003-2026)Online publication date: 20-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems  Volume 45, Issue 2
June 2023
378 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/3597018
  • Editor:
  • Jan Vitek
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 May 2023
Online AM: 08 April 2023
Accepted: 21 March 2023
Revised: 27 January 2023
Received: 06 June 2022
Published in TOPLAS Volume 45, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Parser generators
  2. formal verification
  3. derivative-based parsing

Qualifiers

  • Research-article

Funding Sources

  • DARPA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,007
  • Downloads (Last 6 weeks)67
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)V-Star: Learning Visibly Pushdown Grammars from Program InputsProceedings of the ACM on Programming Languages10.1145/36564588:PLDI(2003-2026)Online publication date: 20-Jun-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media