Residual Finite Tree Automata
Julien Carme, Rémi Gilleron, Marc Tommasi, Alain Terlutte, Aurélien Lemay
To cite this version:
Julien Carme, Rémi Gilleron, Marc Tommasi, Alain Terlutte, Aurélien Lemay. Residual Finite Tree Automata. 7th International Conference on Developments in Language Theory, Jul
2003, Szeged, Hungary, Hungary. Springer Verlag, 2710 (2710), pp.171 – 182, 2003. <inria00091272v2>
HAL Id: inria-00091272
https://hal.inria.fr/inria-00091272v2
Submitted on 5 Sep 2006
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Residual Finite Tree Automata⋆
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
Grappa – EA 3588 – Lille 3 University
http://www.grappa.univ-lille3.fr
inria-00091272, version 2 - 5 Sep 2006
Abstract. Tree automata based algorithms are essential in many fields
in computer science such as verification, specification, program analysis. They become also essential for databases with the development of
standards such as XML. In this paper, we define new classes of non deterministic tree automata, namely residual finite tree automata (RFTA).
In the bottom-up case, we obtain a new characterization of regular tree
languages. In the top-down case, we obtain a subclass of regular tree languages which contains the class of languages recognized by deterministic
top-down tree automata. RFTA also come with the property of existence
of canonical non deterministic tree automata.
1
Introduction
The study of tree automata has a long history in computer science; see the survey
of Thatcher [Tha73], and the texts of F. Gécseg and M. Steinby [GS84,GS96],
and of the TATA group [CDG+ 97]. With the advent of tree-based metalanguages
(SGML and XML) for document grammars, new developments on tree automata
formalisms and tree automata based algorithms have been done [MLM01,Nev02].
Also, because of the tree structure of documents, learning algorithms for tree
languages have been defined for the tasks of information extraction and information retrieval [Fer02,GK02,LPH00]. We are currently involved in a research
project dealing with information extraction systems from semi-structured data.
One objective is the definition of classes of tree automata satisfying two properties: there are efficient algorithms for membership and matching, and there are
efficient learning algorithms for the corresponding classes of tree languages.
In the present paper, we only consider finite ranked trees. There are bottomup (also known as frontier to root) tree automata and top-down (also known as
root to frontier) tree automata. The top-down version is particularly relevant for
some implementations because important properties such as membership1 can
be solved without handling the whole input tree into memory. There are also
deterministic tree automata and non-deterministic tree automata. Determinism
is important to reach efficiency for membership and other decision properties.
It is known that non-deterministic top-down, non-deterministic bottom-up, and
deterministic bottom-up tree automata are equally expressive and define regular tree languages. But there is a tradeoff between efficiency and expressiveness because some regular (and even finite) tree languages are not recognized
⋆
1
This research was partially supported by “TACT-TIC” région Nord-Pas-de-Calais
— FEDER and the MOSTRARE INRIA project
given a tree automaton A, decide whether an input tree is accepted by A.
2
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
by deterministic top-down tree automata. Moreover, the size of a deterministic bottom-up tree automaton can be exponentially larger than the size of a
non-deterministic one recognizing the same tree language. This drawback can
be dramatic when the purpose is to build tree automata. This is for instance the
case in the problem of tree pattern matching and in machine learning problems
like grammatical inference.
The process of learning finite state machines from data is referred as grammatical inference. The first theoretical foundations were given by Gold [Gol67]
and first applications were designed in the field of pattern recognition. Grammatical inference mostly focused on learning string languages but recent works are
concerned with learning tree languages [Sak90,Fer02,GK02]. In most works, the
target tree language is represented by a deterministic bottom-up tree automaton. This is problematic because the time complexity of the learning algorithm
depends on the size of the target automaton. Therefore, again it is crucial to define learning algorithms for non-deterministic tree automata. The reader should
note that tree patterns [GK02] satisfy this property.
Therefore the aim of this article is to define non-deterministic tree automata
corresponding to sufficiently expressive classes of tree languages and having nice
properties from the algorithmic viewpoint and from the grammatical inference
viewpoint. For this aim, we extend previous works from the string case [DLT02a]
to the tree case and we define residual finite state automata (RFTA). The reader
should note that learning algorithms for residual finite string automata have been
defined [DLT01,DLT02b].
In Section 3, we study the bottom-up case. We define the residual language
of a language L w.r.t a ground term t as the set of contexts c such that c[t] is
a term in L. We define bottom-up residual tree automata as automata whose
states correspond to residual languages. Bottom-up residual tree automata are
non-deterministic and recognize regular tree languages. We prove that every
regular tree language is recognized by a unique canonical bottom-up residual
tree automaton, minimal according to the number of states. We give an example
of regular tree languages for which the size of the deterministic bottom-up tree
automata grows exponentially with respect to the size of the canonical bottomup residual tree automata.
In Section 4, we study the top-down case. We define the residual language
of a language L w.r.t a context c as the set of ground terms t such that c[t]
is a term in L. We define top-down residual tree automata as automata whose
states correspond to residual languages. Top-down residual tree automata are
non-deterministic tree automata. Interestingly, the class of languages recognized
by top-down residual tree automata is strictly included in the class of regular
tree languages and strictly contains the class of languages recognized by deterministic top-down tree automata. We also prove that every tree language in this
family is recognized by a unique canonical top-down residual tree automaton;
this automaton is minimal according to the number of states.
The definition of residual finite state automata comes with new decision
problems. All of them rely on properties of residual languages. It is proved that
all residual languages of a given tree language L can be built in both top-down
and bottom-up cases. From these constructions we obtain positive answers to
decision problems like ’decide whether an automaton is a (canonical) RFTA’.
Residual Finite Tree Automata
3
The exact complexity bounds are not given but we conjecture that are identical
than in the string case.
The present work is connected with the paper by Nivat and Podelski [NP97].
They consider a monoid framework, whose elements are called pointed trees
(contexts in our terminology, special trees in [Tho84]), to define tree automata.
They define a Nerode congruence in the bottom-up case and in the top-down
case. Their work leads to the generalization of the notion of deterministic to
l-r-deterministic (context-deterministic in our terminology) for top-down tree
automata. They have a minimization procedure for this class of automata. It
should be noted that the class of languages recognized by context-deterministic
tree automata (also called homogeneous tree languages) is strictly included in
the class of languages recognized by residual top-down tree automata.
2
Preliminaries
We assume that the reader is familiar with basic knowledge about tree automata.
We follow the notations defined in TATA [CDG+ 97].
A ranked alphabet is a couple (F , Arity) where F is a finite set and Arity
is a mapping from F into N. The set of symbols of arity p is denoted by Fp .
Elements of arity 0, 1, . . . p are respectively called constants, unary, . . . , p-ary
symbols. We assume that F contains at least one constant. In the examples, we
use parenthesis and commas for a short declaration of symbols with arity. For
instance, a is a constant and f (, ) is a short declaration for a binary symbol f .
The set of terms over F is denoted by T (F ). Let ⋄ be a special constant which is
not in F . The set of contexts (also known as pointed trees in [NP97] and special
trees in [Tho84]), denoted by C(F ), is the set of terms which contains exactly
one occurrence of ⋄. The expression c[⋄] denotes a context, we only write c when
there is no ambiguity. We denote by c[t] the term obtained from c[⋄] by replacing
⋄ by a term t.
A bottom-up Finite Tree Automaton (↑-FTA) over F is a tuple A = (Q, F , Qf , ∆)
where Q is a finite set of states, Qf ⊆ Q is a set of final states, and ∆ is a
set of transition rules of the form f (q1 , . . . , qn ) → q where n ≥ 0, f ∈ Fn ,
q, q1 , . . . , qn ∈ Q. In this paper, the size of an automaton refers to its size in
number of states, so two automaton which have the same number of states but
different number of rules are considered as having the same size. When n = 0 a
rule is written a → q, where a is a constant. The move relation is written →A
and →∗A is the reflexive and transitive closure of →A . A term t reaches a state
q if and only if t →∗A q. A state q accepts a context c if and only if there exists a
qf ∈ Qf such that c[q] →∗A qf . The automaton A recognizes a term t if and only
if there exists a qf ∈ Qf such that t →∗A qf . The language recognized by A is the
set of all terms recognized by A, and is denoted by L(A).
Two ↑-FTA are equivalent if they recognize the same tree language. A ↑FTA A = (Q, F , Qf , ∆) is trimmed if and only if all its states can be reached
by at least one term and accepts at least one context. A ↑-FTA is deterministic
(↑-DFTA) if and only if there are no two rules with the same left-hand side in
its set of rules. A tree language is regular if and only if it is recognized by a
bottom-up tree automaton. As any ↑-FTA can be changed into an equivalent
4
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
trimmed ↑-DFTA, any regular tree language can be recognized by a trimmed
↑-DFTA.
Let L be a tree language over a ranked alphabet F and t a term. The bottomup residual language of L relative to a term t, denoted by t−1 L, is the set of all
contexts in C(F ) such that c[t] ∈ L:
t−1 L = {c ∈ C(F ) | c[t] ∈ L}.
Note that a bottom-up residual language is a set of contexts, and not a tree
language. The Myhill-Nerode congruence for tree languages can be defined by
two terms t and t′ are equivalent if they define the same residual languages.
From the Myhill-Nerode theorem fro tree languages, we get the following result:
a tree language is recognizable if and only if the number of residual languages is
finite.
A top-down finite tree automaton (↓-FTA) over F is a tuple A = (Q, F , I, ∆)
where Q is a set of states, I ⊆ Q is a set of initial states, and ∆ is a set of rewrite
rules of the form q(f ) → f (q1 , . . . , qn ) where n ≥ 0, f ∈ Fn , q, q1 , . . . , qn ∈ Q.
Again, if n = 0 the rule is written q(a) → a. The move relation is written →A
and →∗A is the reflexive and transitive closure of →A . A state q accepts a term
t if and only if q(t) →∗A t. A recognizes a term t if and only if at least one of its
initial states accepts it. The language recognized by A is the set of all ground
terms recognized by A and is denoted by L(A).
Any regular tree language can be recognized by a ↓-FTA. This means that
↓-FTA and ↑-FTA have the same expressive power. A ↓-FTA is deterministic
(↓-DFTA) if and only if its set of rules does not contain two rules with the same
left-hand side. Unlike ↑-DFTA, ↓-DFTA are not able to recognize all regular tree
languages.
Let L be a tree language over a ranked alphabet F , and c a context of C(F ).
The top-down residual language of L relative to c, denoted by c−1 L, is the set
of ground terms t such that c[t] ∈ L:
c−1 L = {t ∈ T (F ) | c[t] ∈ L}.
The definition of top-down residual languages comes with an equivalence
relation on contexts. It is worth noting that it does not define a congruence over
terms. Nonetheless, based on [NP97], it can be shown that a tree language L is
regular if and only if the number of top-down residual languages associated with
L is finite. In the proof, it is used that the number top-down residual languages
is lower than the number of bottom-up residual languages.
3
Bottom-up residual finite tree automata
In this section, we introduce a new class of bottom-up finite tree automata,
called bottom-up residual finite tree automata (↑-RFTA). This class of automata
shares some interesting properties with both bottom-up deterministic and nondeterministic finite tree automata which both recognize the class of regular tree
languages.
On the one hand, as ↑-DFTA, ↑-RFTA admits a unique canonical form, based
on a correspondence between states and residual languages, whereas ↑-FTA does
Residual Finite Tree Automata
5
not. On the other hand, ↑-RFTA are non-deterministic and can be much smaller
in their canonical form than their deterministic counter-parts.
3.1
Definition and expressive power of bottom-up residual finite
tree automata
First, let us precise the nature of this correspondence, then let us give the formal
definition of ↑-residual tree automata and describe their properties.
In order to establish the nature of this correspondence between states and
residual languages, let us introduce the notion of state languages. The state
language Cq of a state q is the set of contexts accepted by the state q:
Cq = {c ∈ C(F ) | ∃qf ∈ Qf , c[q] →∗A qf }.
As shown by the following example, state languages are generally not residual
languages:
Example 1. Consider the tree language L = {f (a1 , b1 ), f (a1 , b2 ), f (a2 , b2 )} over
F = {f (, ), a1 , b1 , a2 , b2 }. This language L is recognized by the tree automaton A = ({q1 , q2 , q3 , q4 , q5 }, F , {q5 }, ∆) where ∆ = {a1 → q1 , b1 → q2 , b2 →
q3 , a2 → q4 , a1 → q4 , f (q1 , q2 ) → q5 , f (q4 , q3 ) → q5 }. Residual languages of L
−1
−1
are a−1
1 L = {f (⋄, b1 ), f (⋄, b2 )}, b1 L = {f (a1 , ⋄)}, b2 L = {f (a1 , ⋄), f (a2 , ⋄)},
−1
−1
a2 L = {f (⋄, b2 )}, f (a1 , b1 ) L = {⋄}. The state language of q1 is {f (⋄, b1 )},
which is not a residual language. The tree a1 reaches q1 , so each context accepted by q1 is an element of the residual language a−1
1 L, which means that
Cq1 ⊂ a−1
L.
But
the
reverse
inclusion
is
not
true
becausef
(⋄, b2 ) is not an ele1
ment of Cq1 . The reader should note that this situation is possible because A is
non-deterministic.
In fact, it can be proved (the proof is omitted) that residual languages are
unions of state languages. For any L recognized by a tree automaton A, we have
∀t ∈ T (F ), t−1 L =
[
Cq .
(1)
q∈Q, t →∗
Aq
As a consequence, if A is deterministic and trimmed, each residual language
is a state language and conversely.
We can define a new class of non-deterministic automata stating that each
state language must correspond to a residual tree language. We have seen that
residual tree languages are related to the Myhill-Nerode congruence and we will
show that minimization of tree automata can be extended in the definition of a
canonical form for this class of non-deterministic tree automata.
Definition 1. A bottom-up residual tree automaton (↑-RFTA) is a ↑-FTA A =
(Q, F , Qf , ∆) such that ∀q ∈ Q, ∃t ∈ T (F ), Cq = t−1 L(A).
According to the above definition and previous remarks, it can be shown that
every trimmed ↑-DFTA is a ↑-RFTA. As a consequence, ↑-RFTA have the same
expressive power than finite tree automata:
6
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
Theorem 1. The class of tree languages recognized by ↑-RFTA is the class of
regular tree languages.
As an advantage of ↑-RFTA, the number of states of an ↑-RFTA can be much
smaller than the number of states of any equivalent ↑-DFTA:
Proposition 1. There exists a sequence (Ln ) of regular tree languages such
that for each Ln , the size of the smallest ↑-DFTA which recognizes Ln is an
exponential function of n, and the size of the smallest ↑-RFTA which recognizes
Ln is a linear function of n.
Sketch of proof We give an example of regular tree languages for which the size
of the ↑-DFTA grows exponentially with respect to the size of the equivalent
canonical ↑-RFTA. A path is a sequence of symbols from the root to a leaf of
a tree. The length of a path is the number of symbols on the path, except the
root. Let F = {f (, ), a} and let us consider the tree language Ln which contains
exactly the trees with at least one path of length n. Let An = (Q, F , Qf , ∆) be
a ↑-FTA defined by: Q = {q∗ , q0 , . . . , qn }, Qf = {q0 } and
∆ = {a → q∗ , a → qn , f (q∗ , q∗ ) → q∗ }∪
n
[
f (qk , q) → qk−1 , f (q, qk ) → qk−1 , f (qk , q) → q∗ , f (q, qk ) → q∗
k∈[1,...,n],q∈Q\{q0 }
Let C∗ be the set of contexts which contain at least one path of length n.
Let Ci be the set of contexts whose path from the root to ⋄ is of length i. Let t∗
be a term such that all its paths are of length greater than n. Note that the set
of contexts c such that c[t∗ ] belongs to Ln is exactly the set of contexts C∗ . Let
t0 . . . tn be terms such that for all i ≤ n, ti contains exactly one path of length
smaller than n, and the length of this path is n − i. Therefore, t−1
i Ln is the set
of contexts C∗ ∪ Ci .
One can verify that Cq∗ is exactly t−1
∗ Ln = C∗ , and for all i ≤ n, Cqi is exactly
−1
ti Ln = C∗ ∪ Ci . The reader should note that rules of the form f (qk , q) → q∗
and f (q, qk ) → q∗ are not useful to recognize Ln but they are required to obtain
a ↑-RFTA (because Ci is not a residual language of Ln ). So An is a ↑-RFTA
and recognizes Ln . The size of An is n + 2.
The construction of the smallest ↑-DFTA which recognizes L(An ) is left to
the reader. But, it can easily be shown that the number of states is in O(2n )
because states must store lengths of all paths smaller than n.
⊓
⊔
Unfortunately, the size of a ↑-RFTA can be exponentially larger than the
size of an equivalent ↑-FTA.
3.2
The canonical form of bottom-up residual tree automata
As ↑-DFTA, ↑-RFTA have the interesting property to admit a canonical form. In
the case of ↑-DFTA, there is a one-to-one correspondence between residual languages and state languages. This is a consequence of the Myhill-Nerode theorem
for trees.
A similar result holds for ↑-RFTA. In a canonical ↑-RFTA, the set of states
is in one-to-one correspondence with a subset of residual languages called prime
residual languages.
Residual Finite Tree Automata
7
Definition 2. Let L be a tree language. A bottom-up residual language of L is
composite if and only if it is the union of the bottom-up residual languages that
it strictly contains:
[
t−1 L =
t′−1 L.
t′ −1 L(t−1 L
A residual language is prime if and only if it is not composite.
Example 2. Let us consider again the tree languages in the proof of Proposition 1. Let Qn be the set of states of An . All the n + 2 states q∗ , q0 , . . . , qn of Qn
have state languages which are prime residual languages. The subset construction applied on An to build a ↑-DFTA Dn leads to consider states which are
subsets of Q. The state language of a state {qk1 . . . qkn } is a composite residual
−1
language. It is the union of t−1
qk1 L . . . tqkn L.
In canonical ↑-RFTAs, all state languages are prime residual languages.
Theorem 2. Let L be a regular tree language and let us consider the ↑-FTA
Acan = (Q, F , Qf , ∆) defined by:
– Q is in bijection with the set of all prime bottom-up residual languages of L.
We denote by tq a ground term such that q is associated with t−1
q L in this
bijection
– Qf is the set of all elements q of Q such that t−1
q L contains the void context
⋄,
−1
L
– ∆ contains all the rules f (q1 , . . . , qn ) → q such that t−1
q L ⊆ (f (tq1 , . . . , tqn ))
−1
−1
and all the rules a → q such that a ∈ F0 and tq L ⊆ a L.
Acan is a ↑-RFTA, it is the smallest ↑-RFTA in number of states which recognizes
L, and it is unique up to a renaming of its states.
Sketch of proof There are three things to prove in this theorem: the canonical
↑-RFTA Acan = (Q, F , Qf , ∆) of a regular tree language L recognizes L, it is a
↑-RFTA, and there cannot be any strictly smaller ↑-RFTA which recognizes L.
The three points are proved in this order.
We first have to
L(Acan ) = L. It follows from the identity
S prove the equality
−1
t
L
which
can be proved inductively on the
(⊛) ∀t, t−1 L = q∈Q, t →∗
q
Acan q
height of t. Using this property, we have:
t ∈ L ⇔ ⋄ ∈ t−1 L ⇔ ⋄ ∈
⊛
[
∗
t−1
q L ⇔ ∃qf ∈ Qf , t →Acan qf ⇔ t ∈ L(Acan )
q∈Q, t →∗
Acan q
The equality between L and L(Acan ) helps us to prove the characterization
Acan
where CqAcan is the state language of q in Acan .
of ↑-RFTA: t−1
q L = Cq
The last point can be proved in such a way. In a ↑-RFTA, any residual
language is a union of state languages, and any state language is a residual
language. So any prime residual language is a state language, so there is at
least as much states in a ↑-RFTA as prime residual languages admitted by its
corresponding tree language.
⊓
⊔
8
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
The canonical automaton is uniquely defined determined by the tree language
under consideration, but there may be other automata which have the same number of states. The canonical ↑-RFTA is unique because it has the maximum number of rules. Even though all its states are associated to prime residual languages,
the automaton considered in the S
proof of Proposition 1 is not the canonical one
n
because
some
rules
are
missing:
k=1 {f (qk , q0 ) → qk−1 , f (q0 , qk ) → qk−1 } and
S
{f
(q,
q
)
→
q
,
f
(q,
q
)
→
q
}.
0
∗
0
∗
q∈Q
4
Top-Down residual finite tree automata
The definition of top-down residual finite tree automata (↓-RFTA) is tightly
correlated with the definition of ↑-RFTA. Similarly to ↑-RFTA, ↓-RFTA are defined as non-deterministic tree automata where each state language is a residual
language. Any ↓-RFTA can be transformed in a canonical equivalent ↓-RFTA
— minimal in the number of states and unique up to state renaming.
The main difference between the bottom-up and the top-down case is in the
problem of the expressive power of tree automata. The three classes of bottom-up
tree automata, ↑-DFTA, ↑-RFTA or ↑-FTA, have the same expressive power. In
the top-down case, deterministic, residual and non-deterministic tree automata
have different expressive power. This makes the canonical form of ↓-RFTA more
interesting. Compared to the minimal form of ↓-DFTA, it can be smaller when
both exist, and it exists for a wider class of tree languages.
Let us introduce ↓-RFTA through their similarity with ↑-RFTA, then study
this specific problem of expressiveness.
4.1
Analogy with bottom-up residual tree automata
Let us formally define state languages in the top-down case:
Definition 3. Let L be a regular tree language over a ranked alphabet F , let A
be a top-down tree automaton which recognizes L, and let q be a state of this
automaton. The state language of L relative to q, written Lq , is the set of terms
which are accepted by q:
Lq = {t ∈ T (F ) | q(t) →∗A t}.
It follows from this definition some properties similar to those already studied in the previous section. Firstly, state languages are generally not residual
languages. Secondly, residual languages are unions of state languages. Let us
define Qc :
Qc = {q | q ∈ Q, ∃qi ∈ I, qi (c[⋄]) →∗A c[q(⋄)]}.
We have the following relation between state languages and residual languages.
Lemma 1. Let L be a tree language and let A =S
(Q, F , I, ∆) be a top-down tree
automaton which recognizes L. Then ∀c ∈ C(F ), q∈Qc Lq = c−1 L.
Residual Finite Tree Automata
9
These similarities lead us to this definition of top-down residual tree automata:
Definition 4. A top-down Residual Finite Tree Automaton (↓-RFTA) recognizing a tree language L is a ↓-FTA A = (Q, F , I, ∆) such that: ∀q ∈ Q, ∃c ∈ C(F ),
Lq = c−1 L.
Languages defined in the proof of Proposition 1 are still interesting here to
define examples of top-down residual tree automata:
Example 3. Let us consider again the family of tree languages Ln , and the family
of corresponding ↑-RFTA An . For every n, let A′n be the ↓-RFTA defined by:
Q = {q∗ , q0S
, . . . , qn }, Qi = {q0 } and ∆ = {q∗ (a) → a, qn (a) → a, q∗ (f ) →
n
f (q∗ , q∗ )} ∪ k=1 {qk−1 (f ) → f (qk , q∗ ), qk−1 (f ) → f (q∗ , qk )}.
For every k ≤ n, the state language of qk is equal to Ln−k . And, Ln−k is
the top-down residual language of ck , where ck is a context whose height from
the root to the special constant ⋄ is k and ck does not contain any path whose
length is smaller or equal to n. The state language of q∗ is T (F ). And, T (F ) is
the top-down residual language of Ln relative to c∗ , where c∗ is a context who
contains a path whose length is n. So A′n is a ↓-RFTA. Moreover, it is easy to
verify that A′n recognizes Ln .
4.2
The expressive power of top-down tree automata
Top-down deterministic automata and path-closed languages A tree language L
is path-closed if:
∀c ∈ C(F ), c[f (t1 , t2 )] ∈ L ∧ c[f (t′1 , t′2 )] ∈ L ⇒ c[f (t1 , t′2 )] ∈ L.
The reader should note that the definition only considers binary symbols,
the definition can easily be extended to n-ary symbols. The class of languages
that ↓-DFTA can recognize is the class of path-closed languages [Vir81].
Context-deterministic automata and homogeneous languages. Podelski and Nivat in [NP97] have defined l-r-deterministic top-down tree automata. In the
present paper, let us call them top-down context-deterministic tree automata.
Definition 5. A top-down context-deterministic tree automaton (↓-CFTA) A is
a ↓-FTA such that for every context c ∈ C(F ), Qc is either the empty set or a
singleton set.
An homogeneous language is a tree language L satisfying:
∀c ∈ C(F ), c[f (t1 , t2 )] ∈ L ∧ c[f (t1 , t′2 )] ∈ L ∧ c[f (t′1 , t2 )] ⇒ c[f (t′1 , t′2 )] ∈ L.
Again, the definition can easily be extended from the binary case to n-ary
symbols. They have shown that the class of languages recognized by ↓-CFTA is
the class of homogeneous languages.
10
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
The hierarchy A ↓-DFTA is a ↓-CFTA. For ↓-CFTA and ↓-RFTA, we have the
following result:
Lemma 2. Any trimmed ↓-CFTA is a ↓-RFTA.
Proof. Let A = (Q, F , I, ∆) be a trimmed ↓-CFTA recognizing a tree language
L. As A is trimmed, all states are reachable, so for every q, there exists a c such
that q ∈ Qc . Then, by definition of a ↓-CFTA, for every q, there exists a c such
that {q} = Qc . Using Lemma 1, we have:
∀q ∈ Q, ∃c ∈ C(F ), Lq = c−1 L.
stating that A is a ↓-RFTA.
⊓
⊔
Therefore, if we denote by LC the class of tree languages recognized by a
class of automata C, we obtain the following hierarchy:
L↓−DF T A ⊆ L↓−CF T A ⊆ L↓−RF T A ⊆ L↓−F T A
The hierarchy is strict
– Let L = {f (a, b), f (b, a)}. L1 is homogeneous but not path-closed. Therefore
L can be recognized by a ↓-CFTA, but can not be recognized by a ↓-DFTA.
– The tree languages Ln in the proof of Proposition 1 are not recognized by
↓-CFTA. We can easily verify that Ln is not homogeneous. Indeed, if t is a
term which has a path whose length is equal to n − 1, and t′ a term which
does not have any path whose length is smaller than n, f (t, t), f (t, t′ ), f (t′ , t)
belong to Ln , but f (t′ , t′ ) does not. And, we have already shown that Ln is
recognized by a ↓-RFTA.
– Let L′ = {f (a, b), f (a, c), f (b, a), f (b, c), f (c, a), f (c, b)}. L′ is a finite language, therefore it is a regular tree language which can be recognized by a
↓-FTA. L′ cannot be recognized by a ↓-RFTA. To prove that, let us consider
A′ a ↓-FTA which recognizes L′ . The top-down residual languages of L′ are
{a, b}, {a, c}, {b, c} and L′ . As A′ recognizes L′ , it recognizes f (a, b). This implies the existence of three states q1 , q2 , q3 and three rules q1 (f ) → f (q2 , q3 ),
q2 (a) → a, and q3 (b) → b. If A′ was a ↓-RFTA, then q2 would accept a
residual language. As q2 accepts a, it would accept either {a, b} or {a, c}.
Similarly, q3 would accept either {a, b} or {b, c}. In these conditions, and
thanks to the rule q1 (f ) → f (q2 , q3 ), A′ would recognize f (a, a), f (b, b) or
f (c, c). So A′ cannot be a ↓-RFTA.
Therefore, we obtain the following result:
Theorem 3. L↓−DF T A ( L↓−CF T A ( L↓−RF T A ( L↓−F T A
So top-down residual tree automata are strictly more expressive than contextdeterministic tree automata. But as far as we know, there is no straightforward
characterization of the tree languages recognized by ↓-RFTA.
Residual Finite Tree Automata
4.3
11
The canonical form of top-down residual tree automata
The problem of the canonical form of top-down tree automata is similar to the
bottom-up case. Whereas there is no way to reduce a non-deterministic top-down
tree automaton to a unique canonical form, a top-down residual tree automaton
can take such a form. Its definition is similar to the definition of the canonical
bottom-up tree automaton.
In the same way that we have defined composite bottom-up residual language,
a top-down residual language of L is composite if and only if it is the union of
the top-down residual languages that it strictly contains and a residual language
is prime if and only if it is not composite.
Theorem 4. Let L be a tree language in the class L↓−RF T A . Let us consider
the ↓-RFTA Acan = (Q, F , I, ∆) defined by:
– Q is a set of state in bijection with the prime residual languages of L. For
each of these residual languages, there exists a cq such that q is associated
with c−1
q L in this bijection.
– I is the set of prime residuals which are subsets of L.
– ∆ contains all the rules q(a) → a such that a is a constant and cq [a] ∈ L, and
all the rules q(f ) → f (q1 , . . . , qn ) such that for all t1 . . . tn where ti ∈ c−1
qi L,
cq [f (t1 , . . . , tn )] ∈ L.
Acan is a ↓-RFTA, it is the smallest ↓-RFTA in number of states which
recognizes L, and it is unique up to a renaming of its states.
Sketch of proof
Acan
The proof is mainly based on this lemma: t ∈ c−1
q L ⇔ t ∈ Lq
Acan
is the state language of q in the automaton Acan .
where Lq
This lemma is proved by induction on the height of t. This is not a straightforward induction. It involves the rules of a ↓-RFTA automaton A′ which recognizes
L. Its existence is granted by the hypothesis of the theorem.
Once this is proved, it can be easily deduced that Acan recognizes L and is a
RFTA. As there is one state per prime residual in Acan , it is minimal in number
of states.
⊓
⊔
5
Decidability issues
Some decision problems naturally arise with the definition of RFTA. Most of
these problems are solved just noting that one can build all residual languages
of a given regular language L defined by a non-deterministic tree automaton. In
the bottom-up case, the state languages of the minimal ↑-RFTA which recognizes
L are exactly the residual languages of L, and this automaton can be built with
the subset construction. In the top-down case, the subset construction does not
necessarily gives us an automaton which recognizes exactly L, but it gives us
the set of all residual languages. Therefore, knowing whether a tree automaton
is a RFTA, whether a residual language is prime or composite, and whether a
tree automaton is a canonical RFTA are decidable. These problems have not
been deeply studied in terms of complexity, but they are at least as hard as the
similar problems with strings, that is they are PSPACE-hard ([DLT02a]).
12
6
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
Conclusion
We have defined new classes of non-deterministic tree automata. In the bottomup case, we get another characterization of regular tree languages. More interestingly, in the top-down case, we obtain a subclass of the regular tree languages.
For both cases, we have a canonical form and the size of residual tree automata
can be much smaller than equivalent (when exist) deterministic ones.
We are currently extending these results to the case of unranked trees because
our application domain is concerned with html and xml documents. Also, we
are designing learning algorithms for residual finite tree automata extending
previous algorithms for residual finite string automata [DLT01,DLT02b].
References
CDG+ 97. H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison,
and M. Tommasi. Tree automata techniques and applications. Available
on: http://www.grappa.univ-lille3.fr/tata, 1997.
DLT01. F. Denis, A. Lemay, and A. Terlutte. Learning regular languages using rfsa.
In ALT 2001, number 2225 in Lecture Notes in Artificial Intelligence, pages
348–363. Springer Verlag, 2001.
DLT02a. F. Denis, A. Lemay, and A. Terlutte. Residual finite state automata. Fundamenta Informaticae, 51(4):339–368, 2002.
DLT02b. F. Denis, A. Lemay, and A. Terlutte. Some language classes identifiable in
the limit from positive data. In ICGI 2002, number 2484 in Lecture Notes
in Artificial Intelligence, pages 63–76. Springer Verlag, 2002.
Fer02.
Henning Fernau. Learning tree languages from text. In Proc. 15th Annual
Conference on Computational Learning Theory, COLT 2002, pages 153 –
168, 2002.
GK02.
Sally A. Goldman and Stephen S. Kwek. On learning unions of pattern languages and tree patterns in the mistake bound model. Theoretical Computer
Science, 288(2):237 – 254, 2002.
Gol67.
E.M. Gold. Language identification in the limit. Inform. Control, 10:447–
474, 1967.
GS84.
F. Gcseg and M. Steinby. Tree Automata. Akademiai Kiado, 1984.
GS96.
F. Gcseg and M. Steinby. Tree languages. In G. Rozenberg and A. Salomaa,
editors, Handbook of Formal Languages, volume 3, pages 1–68. Springer
Verlag, 1996.
LPH00. Ling Liu, Calton Pu, and Wei Han. XWRAP: An XML-enabled wrapper
construction system for web information sources. In ICDE, pages 611–621,
2000.
MLM01. M. Murata, D. Lee, and M. Mani. “Taxonomy of XML Schema Languages
using Formal Language Theory”. In Extreme Markup Languages, Montreal,
Canada, 2001.
Nev02.
F. Neven. Automata theory for xml researchers. SIGMOD Rec., 31(3):39–
46, 2002.
NP97.
M. Nivat and A. Podelski. Minimal ascending and descending tree automata.
SIAM Journal on Computing, 26(1):39–58, February 1997.
Sak90.
Y. Sakakibara. Learning context-free grammars from structural data in
polynomial time. Theoretical Computer Science, 76:223 – 242, 1990.
Tha73.
J.W. Thatcher. Tree automata: an informal survey. In A.V. Aho, editor,
Currents in the theory of computing, pages 143–178. Prentice Hall, 1973.
Residual Finite Tree Automata
Tho84.
Vir81.
13
Wolfgang Thomas. Logical aspects in the study of tree languages. In Proceedings of the 9th International Colloquium on Trees in Algebra and Programming, CAAP ’84, pages 31 – 50, 1984.
J. Viragh. Deterministic ascending tree automata. Acta Cybernetica, 5:33–
42, 1981.
14
A
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
Appendix
Proof of Equation (1)
A.1
Let L be a tree language andS(Q, F , Qf , ∆) a ↑-FTA which recognizes it. We
show that ∀t ∈ T (F ), t−1 L = t →∗ q Cq .
A
Let t ∈ T (F ), and c ∈ t−1 L. c[t] ∈ L, so there exists qf ∈ Q
Sf and q ∈ Q
such that c[t] →∗A c[q] →∗A qf , where t →∗A q and c ∈ Cq . So c ∈ t →∗ q Cq . So
A
S
t−1 L ⊆ t →∗ q Cq
A
S
Let t ∈ T (F ), and c ∈ t →∗ q Cq . There exists a q ∈ Q such that t →∗A q and
A
cS∈ Cq . So there exists qf ∈ Qf such that c[t] →∗A c[q] →∗A qf . So c ∈ t−1 L. So
−1
L
t →∗ q Cq ⊆ t
A
A.2
Proof of the theorem 2
Theorem 5. The canonical ↑-RFTA recognizing a regular tree language is the
smallest ↑-RFTA which recognizes it. Therefore, ↑-RFTA accepts a unique and
minimal representation.
The first point we have to demonstrate in this theorem is that the canonical
↑-RFTA that we have defined recognizes L.
Before this demonstration, we need to establish two properties of residual
languages:
Lemma 3. Let L a regular language.
′−1
−1
∀i, 1 ≤ i ≤ n, t−1
L ⊆ f (t′1 , . . . , t′n )−1 L
i L ⊆ ti L ⇒ f (t1 , . . . , tn )
Proof. This lemma can be proven inductively on i. Let t1 . . . tn such that for
′−1
−1
all i, t−1
L. Let us assume that
i L is a subset of ti L. Let c in f (t1 , . . . , tn )
c[f (t′1 , . . . , t′i−1 , ti , . . . , tn )] ∈ L. This implies that c[f (t′1 , . . . , t′i−1 , ⋄, ti+1 , . . . , tn )] ∈
′−1
′
′
t−1
i L, and therefore c[f (t1 , . . . , ti−1 , ⋄, ti+1 , . . . , tn )] ∈ ti L.
′
′
So c[f (t1 , . . . , ti , ti+1 , . . . , tn )] ∈ L. Inductively, c[f (t′1 , . . . , t′n )] ∈ L.
So f (t1 , . . . , tn )−1 L ⊆ f (t′1 , . . . , t′n )−1 L.
⊓
⊔
Lemma 4.
∀i, 1 ≤ i ≤ n, t−1
i L =
[
ji
−1
t−1
L=
i,ji L ⇒ f (t1 , . . . , tn )
[
f (t1,j1 , . . . , tn,jn )−1 L
j1 ...jn
S
Here, j1 ...jn has to be understood as ’the union of all the possible combination of j1 . . . jn ’.
S
−1
Proof. Let t1 . . . tn and for all i ≤ n, ti,1 . . . ti,mi such that t−1
i L=
1≤ji ≤mi ti,ji L.
−1
∀t1,j1 . . . tn,jn , ∀i ≤ n, t−1
i,ji L ⊆ ti L ⇒lemma3
∀t1,j1 . . . tn,jn , f (t1,j1 . . . tn,jn )−1 L ⊆ f (t1 , . . . , tn )−1 L ⇒
Residual Finite Tree Automata
[
15
f (t1,j1 , . . . , tn,jn )−1 L ⊆ f (t1 , . . . , tn )−1 L
j1 ...jn
Now, let c in f (t1 , . . . , tn )−1 L.
c[f (t1 , . . . , tn )] ∈ L ⇒ c[f (⋄, t2 , . . . , tn )] ∈ t−1
1 L
S
−1
As t1−1 L = t−1
1,j L, there exists t1,m1 such that c[f (⋄, t2 , . . . , tn )] ∈ t1,m1 L.
So c[f (t1,m1 , t2 , . . . , tn )] ∈ L.
It can be proven inductively on i that there exists t1,m1 . . . tn,mn such that
c[f (t1,k1 , . . . , tn,mn )] ∈ L. So c ∈ f (t1,m1 , . . . , tn,mn )−1 L. So:
[
f (t1 , . . . , tn )−1 L ⊆
f (t1,j1 , . . . , tn,jn )−1 L
j1 ...jn
⊓
⊔
Now, we can prove inductively this lemma, which is the main step to prove
the equality between L and L(Acan )
Lemma 5.
∀t, t−1 L =
[
t →∗
Acan
t−1
q L
q
Proof. Let us prove this lemma inductively. Let h(t) be the height of t.
Let us assume that h(t) = 1, so t = a where a is a constant. A residual is a
union of prime residuals, so:
[
a−1 L =
t−1
q L
−1 L
t−1
q L⊆a
−1
L if and only if Acan contains the rule a → q:
As t−1
q L ⊆ a
[
a−1 L =
t−1
q L
a →∗
Acan q
Now let us assume that for any term t such that h(t) ≤ k, lemma 5 is true.
Let t = f (t1 , . . . , tn ) such that h(t) = k + 1.
[
h(t) = k + 1 ⇒ ∀i ≤ n, h(ti ) = k ⇒ ∀i, t−1
i L=
ti →∗
Acan
t−1 L =
[
ti →∗
Acan
t−1
qi,j L ⇒lemma4
i
qi,ji
f (t1,q1 , . . . , tn,qn )−1 L
qi,ji
Any residual is a union of prime residuals, so for all j1 . . . jn :
[
t−1
f (tq1,j1 , . . . , tqn,jn )−1 L =
q L
−1 L
t−1
q L⊆f (tq1,j ,...,tqn,jn )
1
So:
t−1 L =
[
ti →∗
Acan
f (t1,q1 , . . . , tn,qn )−1 L ⇒
qi,ji
16
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
t−1 L =
[
[
(
t−1
q L)
−1
−1 L
ti →∗
Acan qi,ji tq L⊆f (tq1,j ,...,tqn,j )
n
1
t−1
q L
−1
As
⊆ f (tq1,j1 , . . . , tqn,jn )
f (q1,j1 , . . . , qn,jn ) → q,
t−1 L =
L if and only if Acan contains the rule
[
t →∗
Acan
tq−1 L
q
⊓
⊔
The equality between L and L(Acan ) is formalized as such:
Lemma 6. The canonical ↑-RFTA Acan of a language L recognizes L, that is:
∃qf ∈ Qf , t →∗Acan qf ⇔ t ∈ L
Proof. Let t ∈ T (F ) and qf ∈ Qf such that t →∗Acan qf .
[
L ⇒lemma5 tq−1
L ⊂ t−1 L
tq−1
t →∗Acan qf ⇒ t−1
qf L ⊂
j
f
t →∗
Acan qj
−1
As ⋄ ∈ t−1
L, so t ∈ L.
qf L, ⋄ ∈ t
Let t ∈ L.
⋄ ∈ t−1 L ⇒ ∃qj | t →∗Acan qj ∧ ⋄ ∈ t−1
qj L ⇒
∃qj | t →∗Acan qj ∧ qj ∈ Qf
Now, we have to prove that the canonical ↑-RFTA is a ↑-RFTA. In order to
do this, we need to establish this lemma:
−1
Lemma 7. Let t−1
q L and tq′ L be prime bottom-up residual languages of L. Let
CqAcan and CqA′ can be sets of contexts accepted by q and q ′ in then canonical
automaton of L Acan . Then:
Acan
−1
t−1
⊂ CqAcan
q′ L ⊂ tq L ⇒ Cq′
−1
−1
L⊂
Proof. Let tq and tq′ such that t−1
q′ L ⊂ tq L. For all tq1 . . . tqn , f (tq1 , . . . , tq′ , . . . , tqn )
f (tq1 , . . . , tq , . . . , tqn ) (lemma 3).
The construction of the set of rules of the canonical automaton implies that:
−1
L
f (q1 , . . . , qn ) → q ′ ∈ ∆ ⇔ t−1
q′ L ⊆ f (tq1 , . . . , tqn )
So:
f (q1 , . . . , q ′ , . . . , qn ) → q ′′ ∈ ∆ ⇒
−1
L⇒
t−1
q′′ L ∈ f (tq1 , . . . , tq′ , . . . , tqn )
−1
L⇒
t−1
q′′ L ∈ f (tq1 , . . . , tq , . . . , tqn )
f (q1 , . . . , q, . . . , qn ) → q ′′ ∈ ∆
So each context accepted by q ′ is accepted by q.
So CqA′ can ⊂ CqAcan .
⊓
⊔
Residual Finite Tree Automata
17
Lemma 8. The canonical RFTA Acan of a language L is a residual finite tree
automata.
Proof. Let t−1
q L be a prime residual language of L. Thanks to lemma 5:
[
t−1
t−1
q L=
q′ L
′
tq →∗
Acan q
If tq−1 L would strictly contain all the t−1
q′ L of the union, it would be composite.
As it is prime, t−1
L
is
itself
an
element
of this union, so tq →∗Acan q.
q
Equation (1) tells us that:
[
Cq′
t−1
q L =
′
tq →∗
Acan q
So Cq ⊂ t−1
q L.
−1
For all q ′ such that tq →∗Acan q ′ , t−1
q′ L ⊂ tq L, so Cq′ ⊂ Cq (lemma 7). As the
−1
−1
union of all Cq′ is equal to tq L, tq L ⊂ Cq
So tq−1 L = Cq , so every prime residual language is accepted by its corresponding state.
So Acan is a RFTA.
⊓
⊔
Lemma 9. The canonical RFTA Acan of a language L is the smallest RFTA
which recognizes L.
Proof. Let Acan be the canonical RFTA of a language L, and t such that ∄q ∈
Q, t−1 L = Cq .
S
−1
−1
−1
L = t−1
L
Thanks to lemma 5, t−1 L = t →∗
q L, t
q tq L. As ∄q ∈ Q, t
Acan
is a union of residuals that it strictly contains. So t−1 L is a composite residual.
So for all prime residuals t−1 L, there is a q such that t−1 L = Cq . Acan
contains as much states as prime residuals in L, so it is the smallest RFTA
which recognizes L.
⊓
⊔
A.3
Proof of the theorem 4
Theorem 6. Let L be a language recognized by a ↓-RFTA. The canonical topdown residual tree automaton of L is the smallest ↓-RFTA which recognizes L.
In order to prove this theorem, let us firstly prove these lemma:
Lemma 10. Let A = (Q, F , I, ∆) be a ↓-RFTA which recognizes L. For any
prime residual c−1 L, there exists a state q ∈ Q such that Lq = c−1 L.
Proof. Let c be a context of L suchSthat ∄q ∈ Q, c−1 L = Lq .
Lemma 1 implies that c−1 L = q∈Qc Lq and none of these Lq are equal to
S
−1
−1
c L. As ∀q ∈ Q, Lq = c−1
L = q∈Qc c−1
q L where none of the
q L, we have c
cq−1 L are equal to c−1 L. So c−1 L is composite.
⊓
⊔
18
J. Carme, R. Gilleron, A. Lemay, A. Terlutte, and M. Tommasi
Now, let us make the main part of the demonstration: let us prove that each
prime residual language is exactly accepted by a state of the canonical ↓-RFTA.
Lemma 11. Let L be a language recognized by a ↓-RFTA. Let Acan = (Q, F , I, ∆)
be its canonical automaton. For all q in Q, cq−1 L = Lq .
Proof. As seen in the definition, Q is in bijection with the set of all residual languages, so for all q there exists a corresponding c−1
q L. Let us prove inductively
on the height of t that t ∈ c−1
L
⇔
t
∈
L
.
Let
us call H(n) this hypothesis
Acan ,q
q
when h(t) ≤ n.
Firstly, let us prove H(1).
Let t such that h(t) = 1 and t ∈ c−1
q L. As h(t) = 1, t = a where a is a constant. As t ∈ c−1
L,
c
[a]
∈
L.
So
∆
contains
the rule q(a) → a, so t ∈ LAcan ,q .
q
q
Reciprocally, t ∈ LAcan ,q where t = a implies that ∆ contains the rule q(a) → a.
This rule exists in the canonical automata if and only if a is a constant and
cq [a] ∈ L. So cq [a] ∈ L, so t ∈ c−1
q L.
Now, let us assume that H(l) is true when l < k. Let us prove that H(k) is
true.
Let t = f (t1 , . . . , tn ) ∈ c−1
q L such that h(t) = k. For all ti where 1 ≤ i ≤ n,
ti ∈ cq [f (t1 , . . . , ti−1 , ⋄, ti+1 , . . . , tn )]−1 L.
Now, let us consider A′ = (Q′ , F , I ′ , ∆′ ) a ↓-RFTA which recognizes L. As L
is recognized by a ↓-RFTA, A′ exists. We will use this automaton to prove the
existence of a rule q → f (q1 , . . . , qn ) such that for all i, qi [ti ] →∗A ti in Acan .
−1
′
′
As c−1
q L is prime, there exists a q ∈ Q such that LA′ ,q′ = cq L (lemma
′
′
′
′
−1
10). As t ∈ cq L, there exists in ∆ a rule q → f (q1 , . . . , qn ) such that for all i,
1 ≤ i ≤ n, we have ti ∈ LA′ ,qi′ .
For all t′1 . . . t′n such that t′i ∈ LA′ ,qi′ , f (t′1 , . . . , t′n ) ∈ cq−1 L.
As LA′ ,qi′ is a residual, it is either a prime residual or a composite residual. If it
−1
is a prime residual, there exists a qi ∈ Q such that LA′ ,qi′ = c−1
qi L and ti ∈ cqi L.
−1
If it is a composite residual, there exists a qi ∈ Q such that cqi L ⊂ LA′ ,qi′ and
ti ∈ c−1
qi L.
′
′
So there exists q1 . . . qn such that ti ∈ c−1
qi L ⊂ LA′ ,qi′ . So for all t1 . . . tn in
−1
′
′
−1
−1
c−1
q1 L . . . cqn L, f (t1 , . . . , tn ) ∈ LA′ ,q′ L = cq L. So the rule q(f ) → f (q1 , . . . , qn )
exists in ∆.
For all ti , h(ti ) < k, so as we have assumed that H(l) is right when l < k,
H(h(ti )) is right. So for all i, ti ∈ LAcan ,qi . As q(f ) → f (q1 , . . . , qn ), t ∈ LAcan ,q .
We have proven that t ∈ c−1
q L ⇒ t ∈ LAcan ,q . Now let us prove that
t ∈ LAcan ,q ⇒ t ∈ c−1
L.
q
Let t = f (t1 , . . . , tn ) ∈ LAcan ,q such that h(t) = k.
There exist q1 . . . qt such that q(f (t1 . . . tn )) →∗A f (q1 (t1 ), . . . , qn (tn )) →∗A f (t1 , . . . , tn ).
For all i, ti ∈ LAcan ,qi and h(ti ) < k, so H(h(ti )) is assumed to be true, so
ti ∈ c−1
qi L. The existence of the rule q(f ) → f (q1 , . . . , qn ) in ∆ implies that for
′
′
−1
all t′1 . . . t′n such that t′i ∈ c−1
qi L, cq [f (t1 , . . . , tn )] ∈ L. So t ∈ cq L.
So H(k) is true. We have proven inductively that for any t, t ∈ LAcan ,q ⇔
t ∈ c−1
q L.
⊓
⊔
Residual Finite Tree Automata
19
Lemma 12. Acan =< Q, F , Qi , ∆ > is a ↓-RFTA, recognizes L, and is minimal
in number of states.
Proof. Let us prove that lemma 11 implies that L(Acan ) = L. Let t ∈ L. ⋄−1 L =
L is a residual, so it is a union of prime residuals. So there exists qi ∈ Q such that
−1
−1
L and c−1
t ∈ cq−1
qi L ⊆ L. As cqi L = LAcan ,qi , we have t ∈ LAcan ,qi . cqi L ⊆ L,
i
so qi is initial, so t ∈ L(Acan ).
Reciprocally, let t ∈ L(Acan ). There exists a qi ∈ I such that t ∈ LAcan ,qi .
−1
L
= Lqi , so t ∈ c−1
cq−1
qi L. As qi is initial, cqi L is a subset of L. So t ∈ L. So
i
L = L(Acan ).
So Acan recognizes L. For any q, Lq = c−1
q L, so Acan is a RFTA. For any
prime residual of L, there exists a state in the RFTA which recognizes it. As
there are one state per prime residual in Acan , Acan is minimal in number of
states.
⊓
⊔