RAIRO-Theor. Inf. Appl. 45 (2011) 77–97
DOI: 10.1051/ita/2011012
Available online at:
www.rairo-ita.org
CONSENSUAL LANGUAGES AND MATCHING
FINITE-STATE COMPUTATIONS ∗, ∗∗
Stefano Crespi Reghizzi 1 and Pierluigi San Pietro 1
Abstract. An ever present, common sense idea in language modelling
research is that, for a word to be a valid phrase, it should comply with
multiple constraints at once. A new language definition model is studied, based on agreement or consensus between similar strings. Considering a regular set of strings over a bipartite alphabet made by pairs
of unmarked/marked symbols, a match relation is introduced, in order
to specify when such strings agree. Then a regular set over the bipartite alphabet can be interpreted as specifying another language over
the unmarked alphabet, called the consensual language. A word is in
the consensual language if a set of corresponding matching strings is in
the original language. The family thus defined includes the regular languages and also interesting non-semilinear ones. The word problem can
be solved in NLOGSPACE, hence in P time. The emptiness problem
is undecidable. Closure properties are proved for intersection with regular sets and inverse alphabetical homomorphism. Several conditions
for a consensual definition to yield a regular language are presented,
and it is shown that the size of a consensual specification of regular
languages can be in a logarithmic ratio with respect to a DFA. The
family is incomparable with context-free and tree-adjoining grammar
families.
Mathematics Subject Classification. 68Q45, 68Q42, 68Q19.
Keywords and phrases. Formal languages, finite automata, consensual languages, counter
machines, polynomial time parsing, non-semilinear languages, Parikh mapping, descriptive
complexity of regular languages, degree of grammaticality.
With partial support from PRIN 2005015419, FIRB “Applicazioni della Teoria degli Automi
all’Analisi, Compilazione e Verifica di Software Critico e in Tempo Reale”, and CNR-IEIIT.
∗∗ Preliminary, partial versions were presented at LATA 2008 [2] and ICTCS 2009 [3] conferences.
1 Dipartimento di Elettronica e Informazione, Politecnico di Milano, Piazza Leonardo da
Vinci, 32, 20133 Milano, Italy; {crespi;sanpietro}@elet.polimi.it
∗
Article published by EDP Sciences
c EDP Sciences 2011
78
S. CRESPI REGHIZZI AND P. SAN PIETRO
Introduction
An ever present, common sense idea in language modelling research is that,
for a word to to be a valid phrase, it should comply with multiple constraints at
once. Theories of grammar have taken various approaches for expressing the constraints by different mechanisms, such as by superimposing semantic constraints
to syntactic ones, or by using intersections of, say, context-free languages. Of
course, motivation for language definitions, based on agreement or reinforcement
between separate processes, comes from the overwhelming complexity of monolithic definitions, and, in the case of natural language, is supported by the findings
of neuro-linguistical research.
Here we propose a very simple novel mechanism, where the constraints are
expressed by an elementary letter by letter agreement between strings belonging to
a regular language. The alphabet is bipartite, made by pairs of unmarked/marked
characters. The agreement is formalized by a k-ary relation, called match, that is
satisfied by a set of k equally long strings if, in each position, exactly one word has
an unmarked letter and the other strings have the same letter but marked. In our
metaphor we view such strings as providing mutual consensus on the validity of
the corresponding unmarked string. This justifies the name “consensual” proposed
for the new family, which strictly includes the regular one.
Here some reader may prefer to jump to the definition (Defs. 1.1, 1.3 and 1.4)
of consensual language, before reading the next discussion of the position of the
new model from the perspective of language theory.
With respect to their storage, abstract language recognition devices can be
classified as using tapes (Turing machines, push-down machines, nested pushdown machines) or counters. The latter case includes various models of counter
machines and also Petri Nets. Consensual languages are recognized by real-time
non-deterministic multi counter machines with a linear bound on the counter values.
Considering the complexity of the word recognition problem, consensual languages belong to the polynomial time class.
With respect to generative capacity, the new family shares little ground with the
families of context-free and mildly context-sensitive [8] languages. For instance, the
Dyck language over two letters can be defined but not the language of palindromes.
On the other hand interesting non-semilinear languages (in the Parikh sense [7])
can be easily defined.
Next we compare and contrast the computation performed by a consensual
recognizer versus an alternating finite automaton [1]. Although both machines
perform simultaneous computations for recognizing a given string, they apply entirely different acceptance criteria. All possible computations must be successful for a word to be recognized by an alternating machine when using universal
non-determinism, and their number may be exponential with respect to the word
length. On a consensual device, the computations performed on the finite automaton, which can be assumed to be deterministic, are not labelled by the input word
(except in the trivial case when the language is regular) but by matching strings
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
79
over the marked/unmarked alphabet. The number of computations is bounded by
the input length.
Recalling that certain Petri net language families [4] include non-semilinear
languages and that their recognizers use counters, a vague resemblance between
the two models may be mentioned. In fact C.A. Petri introduced his nets as a
formal model of synchronization between computations performed by finite automata and our model too specifies a matching rule between the labels of separate
computations.
Notwithstanding the fact that the proposed approach has little to do with any
classical formal language model we know, we hope its simplicity, expressivity and
motivation may attract some attention.
The paper is organized as follows. Section 1 lists the basic definitions, and
provides an example giving evidence of the strict inclusion of regular languages.
Section 2 shows that the Parikh image may be not linear, and proves several
closure properties. Section 3 defines a transition relation between multisets of
states, corresponding to a multi-counter machine. Then it shows that the word
recognition problem is in NLOGSPACE. Section 4 focuses on consensually defined
regular languages, and gives sufficient conditions for a consensual language to be
regular. It shows that consensual definitions can be exponentially more concise
than definitions by deterministic finite automata. Section 5 proves the emptiness
problem to be undecidable. Section 6 shows that the languages of palindromes
and replicas exceed the power of consensual languages. The conclusion mentions
directions for continuation.
1. First definitions
Let Σ be the terminal alphabet of the languages to be considered. The empty
word is denoted by letter ǫ. Given a word x, its length is denoted by |x| and the
i-th letter is x(i), 1 ≤ i ≤ |x|. A deterministic finite automaton (DFA for short) is
specified as A = (∆, Q, δ, q0 , F ) where: ∆ is a finite alphabet; Q is a finite set of
states; δ : Q × ∆ → Q is the state-transition function, always assumed to be total ;
q0 is the initial state, and F ⊆ Q is the set of final states. The transition function
δ can be extended as usual to Q × ∆∗ → Q, which is also total, i.e., δ(q0 , y) is
defined for every word y over Σ. A nondeterministic finite automaton N (NFA)
is specified as N = (∆, Q, ⇒N , q0 , F ) where the only difference with a DFA above
is that the transition relation is ⇒N ⊆ Q × ∆ × Q. Acceptance may be defined as
usual for a DFA and for a NFA.
Let Σ be the disjoint alphabet obtained by marking each symbol a ∈ Σ as a,
referred to as the marked copy of a.
and qualified as internal alphabet, because its
The set Σ ∪ Σ is denoted as Σ
use is restricted to the technical device of consensual definitions.
The notion of agreement between strings over the internal alphabet is formalized
by means of a function called match.
80
S. CRESPI REGHIZZI AND P. SAN PIETRO
Definition 1.1. Match
The partial, symmetrical, and associative binary operator, called match
×Σ
→Σ
@:Σ
is defined as follows, for all a ∈ Σ:
⎧
⎨ a@a = a@a = a;
a@a = a;
⎩
undefined,
in every other case.
The operator can be naturally extended to strings of equal length, by assuming
∗ , with |w| = |w′ |, and for all a, b ∈ Σ
ǫ@ǫ = ǫ. For all w, w′ ∈ Σ
aw @ bw′ = (a@b)(w@w′ )
where we assume that match yields precedence to concatenation.
Hence, the match is undefined on strings w, w′ of unequal lengths, or else if
there exists a position i such that w(i)@w′ (i) is undefined. The latter condition
occurs in three cases: when both characters are in Σ, when both are in Σ and
differ, and when either one is marked but is not the marked copy of the other.
Given m > 0 strings w1 , . . . , wm ∈ Σ̃∗ , consider w1 @w2 @ . . . @wm (which can
be written without parentheses and in any order because the match operation is
associative and commutative). If w = w1 @w2 @ . . . @wm is defined then w is called
the match of w1 , w2 , . . . , wm . The match is strong if w ∈ Σ∗ , weak otherwise. The
cardinality m is called the degree of the match. Match w and every argument wj
have the same length n = |wj | = |w|. Also, by Definition 1.1, if w is a strong
match for each position 1 ≤ i ≤ n, exactly one string, say wk , is unmarked, i.e.,
wk (i) ∈ Σ and wj (i) ∈ Σ for all j = k. We say that word wk places the letter into
position i and the other strings consent to it.
∗ over the
Next we extend the match operator to two languages L′ , L′′ ⊆ Σ
internal alphabet:
L′ @L′′ = {w′ @w′′ | w′ ∈ L′ , w′′ ∈ L′′ }.
Clearly the operation may be applied to any number of languages.
If the arguments are regular languages, the match operator produces a regular
language.
∗ are regular languages then L′ @L′′ is also regular.
Proposition 1.2. If L′ , L′′ ⊆ Σ
Q′ , δ ′ , q ′ , F ′ ) and A′′ = (Σ,
Q′′ , δ ′′ , q ′′ , F ′′ ) be the DFAs’ recProof. Let A′ = (Σ,
1
1
′
′′
′
′′
ognizing L , L , respectively. Let A @A be the (possibly nondeterministic) finite
Q′ × Q′′ , δ, (q ′ , q ′′ ), F ′ × F ′′ ), with δ : (Q′ × Q′′ ) × Σ
→ 2Q′ ×Q′′ ,
automaton (Σ,
1 1
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
81
such that for every q ′ , p′ ∈ Q′ , q ′′ , p′′ ∈ Q′′ , for every a ∈ Σ:
p′ , p′′ ∈ δ(q ′ , q ′′ , a)
p′ , p′′ ∈ δ(q ′ , q ′′ , a)
p′ , p′′ ∈ δ(q ′ , q ′′ , a)
if
if
if
p′ = δ ′ (q ′ , a), p′′ = δ ′′ (q ′′ , a)
p′ = δ ′ (q ′ , a), p′′ = δ ′′ (q ′′ , a)
p′ = δ ′ (q ′ , a), p′′ = δ ′′ (q ′′ , a).
The construction is similar to the usual Cartesian product of two DFAs’. But
instead of the intersection, the product machine A′ @A′′ recognizes L′ @L′′ , because
the construction has been modified to match a with a and a with a, but not to
match a with a.
The repeated application of the match operation to a language is formalized
next. Let L1@ = L, Li@ = L@L(i−1)@ , i ≥ 2. Notice that in general L(i−1)@ ⊆
Li@ .
Definition 1.3. Match closure
∗ is:
The closure under match, or @-closure, of a language L ⊆ Σ
L@ =
Li@ .
i≥1
Focusing on languages over the terminal alphabet Σ, the main definition comes
next.
Definition 1.4. Consensual language
+ is
Let B be in a language family F . The consensual language with base B ⊆ Σ
the set
C(B) = B @ ∩ Σ∗ .
Language C(B) is also called a consensual language based on family F , and the
corresponding family is written CF .
Therefore, a consensual language with base B includes all and only the strongly
matches of the match closure. In this paper we study the family of consensual
languages based on the family of regular languages, CREG .
Example 1.5. Consider the regular language R defined by the regular expression
a∗ aa∗ b∗ bb∗ .
Then R@ is the set of strings of the form:
a∗ a1 a∗ a2 a∗ . . . am b∗ b1 b∗ b2 . . . bm b∗
where m ≥ 1, and each ai is a and each bi is b.
The consensual language with base R is C(R) = {an bn | n > 0}. Figure 1
shows that sentence aaabbb can be obtained in different ways, matching together
the strings of R in column vi , or matching those in column wi . Notice that every
sentence w is obtained by means of a match of degree |w|/2.
82
S. CRESPI REGHIZZI AND P. SAN PIETRO
i
1
2
3
Match
vi
aaabbb
aaabbb
aaabbb
aaabbb
wi
aaabbb
aaabbb
aaabbb
aaabbb
Figure 1. In Example 1.5 word aaabbb results from the strong
matches in column vi and wi .
The example has shown that, although the base B is regular, languages B @ and
C(B), obtained by a match closure, may be non-regular.
However, from Proposition 1.2, for any finite i, B i@ is regular if B is regular.
This corresponds, in Definition 1.3, to the case where at most i strings w1 , . . . , wi
are matched.
2. First properties
We introduce further useful terminology and make intuitive comments about
previous definitions and concepts.
First, we notice that we may remove or add to a base language B a subset of Σ+
without affecting the corresponding consensual language C(B). In fact, if w@w′ is
defined and w′ ∈ Σ+ , then w@w′ = w: strings purely made of marked characters
are both useless and “harmless”.
∗ , and for every language U ⊆
Proposition 2.1. For every base language B ⊆ Σ
+
Σ ,
C(B) = C(B ∪ U ) = C(B \ U ).
As a consequence of the fact that the match of identical strings is undefined
(if they contain at least one unmarked character) or useless (if the strings are
completely marked), any phrase w of a consensual language can be obtained as
the result of a strong match having degree not exceeding its length |w|.
Proposition 2.2.
C(B) = {w ∈ Σ∗ | ∃k, 1 ≤ k ≤ |w|, w ∈ B k@ }.
(1)
A straightforward language family inclusion result comes next. Consider a deterministic finite automaton of the base language B. A word w is in the consensual
language C(B), if, and only if, the automaton performs 1 ≤ k ≤ |w| successful com that strongly match to w. We also say
putations, accepting a set of strings over Σ
that such computations strongly (or weakly) match.
The case k = 1 clearly corresponds to the usual recognition condition of a DFA.
As the consensual language of Example 1.5 is not regular, we have:
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
83
Proposition 2.3. The family CREG of consensual languages on a regular language
base strictly includes the family of regular languages.
The next example shows languages having a non-semilinear commutative image
(in Parikh’s sense [7]).
Example 2.4. Series of unary integers
(1) Series of identical unary integers.
Choose as base the language:
+
R1 = (a∗ aa∗ b) ∪ (a+ b)+ .
Then the consensual language is:
L1 = C(R1 ) = {an ban ban b . . . an b | n > 0}.
(2) Enumeration of unary integers.
The language L2 = {baba2 b . . . ban b | n ≥ 0} is consensually defined by
the regular base
∗
∗
R2 = ba+ b (a∗ aa∗ b) .
For example, babaab is the match of the following words in R2 : babaab,
babaab and babaab.
(3) Series of exponential unary numbers.
For Σ = {a, b, c} let
∗
∗
+
R3 = Σ∗ a (a ∪ c) b (a ∪ c) cacΣ∗ ∪ acb (ac) b
∗
+
(ac) b ∪ acb.
The consensual language C(R3 ) is
m
L3 = {ac b(ac)2 b(ac)4 b(ac)8 b . . . (ac)2 b | m ≥ 0}.
Proof. Let us express the base language as the union of three clauses,
numbered 1, 2, and 3:
∗
R3 = Σ∗ a(a ∪ c)∗ b (a ∪ c) cacΣ∗ ∪ acb((ac)+ b)∗ (ac)+ b ∪ acb .
1
2
3
First we show that L3 ⊆ C(R3 ). Any word in C(R3 ), apart from acb, must
be obtained by matching a word in clause 2 with strings in clause 1, since
84
S. CRESPI REGHIZZI AND P. SAN PIETRO
neither regular expression can generate alone a word in Σ+ . We now show,
by induction on the number m ≥ 1 that if wm is a string of the form
m
wm = acb(ac)2 b(ac)4 b(ac)8 b . . . (ac)2 b
then wm ∈ C(R3 ) (and clearly L3 is the set of all wm above). The base
step is m = 0, corresponding to the word acb, which is both in C(R3 ) and
in L3 . Assume now that the induction hypothesis holds for m − 1. Hence,
string
m−1
wm−1 = acb(ac)2 b(ac)4 b(ac)8 b . . . (ac)2
b ∈ C(R3 ).
But wm−1 must be obtained as a match of h > 0 strings x1 , x2 , . . . xh of
clause 1, with one string
m−2
ym−1 = acb(ac)2 b . . . (ac)2
m−1
b(ac)2
b
of clause 2. But also
m−1
ym = acb(ac)2 b . . . b(ac)2
m
b(ac)2 b
m
is in clause 2, and if xi is in clause 1 then also x′i = xi (ac)2 b is in clause 1,
since clause 1 ends with Σ∗ . Therefore, by matching ym with x′1 , . . . , x′m ,
one obtains:
m−2
′
= acb(ac)2 b(ac)4 b(ac)8 b . . . b(ac)2
wm
m−1
b(ac)2
m
b(ac)2 b ∈ R3@ .
Also, all strings
m−2
zi = acb(ac)2 b . . . b(ac)2
m−1
b(ac)i ac(ac)2
−i−1
m
b(ac)2i acac(ac)2
−2i−2
b
are in clause 1, for every i, 0 ≤ i < 2m−1 . Hence, for every a placed in
group m − 1 (the group with 2m−1 occurrences of ac), there must be two
occurrences of c in group m. Hence, the number of c’s (and therefore also
of a’s) in group m must be twice the number of c’s in group m − 1:
m−1
wm = acb(ac)2 b(ac)4 b(ac)8 b . . . (ac)2
m
b(ac)2 b ∈ C(R3 ).
For the converse case, notice that, since any w in C(R3 ) (except for acb)
∗
must match a word in clause 2, C(R3 ) ⊆ acb ((ac)+ b) . An induction on
the number m ≥ 0 of groups (ac)+ b in strings of C(R3 ) completes the
proof. The base case m = 0 corresponds to the word acb. By induction
j
hypothesis, for all 0 ≤ j ≤ m, the strings acb ((ac)+ b) are in L3 . Assume
that the word xm with m groups is not in L3 . Hence, there exists i > 0
such that the group in position i has a number of c which is not the double
of the number of a of the group in position i − 1. But i = m, otherwise
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
85
one could define also a word which is not in L3 while having less than
m groups, contradicting the induction hypothesis. However, the only way
to place a c in group m is by using strings in clause 1, which place two
occurrences of c in group m for an occurrence of a in group m-1. Since
no other strings can place a in group m − 1, then the number of c’s in
group m must be exactly the double of the number of a in group m − 1
(2m−1 by induction hypothesis), that is there are 2m occurrences of c in
group m.
We state the basic closure properties of consensual languages in the next propo/ Σ, the marked concatenasition. For two languages L′ , L′′ ⊆ Σ∗ and a letter s ∈
tion [7] is the language L′ sL′′ .
Proposition 2.5. The family CREG is closed under:
(1)
(2)
(3)
(4)
(5)
intersection with regular languages;
inverse alphabetic homomorphism;
reversal (or mirror reflection) operation;
marked concatenation of consensual languages;
union of consensual languages over disjoint alphabets.
Proof. We separately argue for each statement.
→ Σ be the
∗ , S ⊆ Σ∗ be two regular languages, and let h : Σ
(1) Let R ⊆ Σ
alphabetic homomorphism defined by h(a) = h(a) = a for every a ∈ Σ.
We claim that
C(R) ∩ S = C R ∩ h−1 (S)
thus proving the statement.
Let x ∈ C(R) ∩ S. Therefore, ∃k, 1 ≤ k ≤ |x|, ∃x1 , . . . , xk ∈ R such that
x1 @x2 . . . @xk = x and for every i, 1 ≤ i ≤ k, h(xi ) = x. Hence, every
xi ∈ h−1 (x) ⊆ h−1 (S) since x ∈ S. Hence, for every i,1 ≤ i ≤ k,
−1
xi ∈ R ∧ xi ∈ h−1 (S), and it follows
that x ∈ C R ∩ h (S) .
−1
Assume now x ∈ C R ∩ h (S) . Hence, ∃k, 1 ≤ k ≤ |x|, ∃x1 , . . . , xk
such that x1 @x2 . . . @xk = x and for every i, 1 ≤ i ≤ k, xi ∈ R ∩ h−1 (S),
with h(xi ) = x. Then x ∈ C(R) (since each xi ∈ R). Also, x ∈ h−1 (S)
(since each xi ∈ h−1 (S)). Therefore, x ∈ S, since S = h−1 (S) ∩ Σ∗ and
x ∈ Σ∗ .
∗ be a regular language, and let ∆ be another finite alphabet.
(2) Let R ⊆ Σ
Let h : ∆ → Σ be a homomorphism. We need to prove that h−1 (C(R)) is
a consensual language with regular base.
Extend first h to the internal alphabet as follows:
h : ∆∪∆ → Σ∪Σ
is defined as
h(A) = h(A),
h(A) = h(A) for every A ∈ ∆.
We notice that
h−1 (a@a) =
h−1 (a) =
h−1 (a)@
h−1 (a), and that
h−1
(a@a) =
h−1 (a) =
h−1 (a)@
h−1 (a), and similarly for the case a@a.
−1
On the other hand both h (a)@
h−1 (a) and
h−1 (a@a) are undefined.
86
S. CRESPI REGHIZZI AND P. SAN PIETRO
Hence,
.
h−1 (X@Y ) =
h−1 (X)@
h−1 (Y ) for every X, Y ∈ Σ
Therefore, if u, u′ ∈ (Σ ∪ Σ)∗ then
h−1 (u@u′ ) =
h−1 (u)@
h−1 (u′ ).
@
We now claim that
h−1 (R@ ) =
h−1 (R) . From here the thesis fol-
@
h−1 (R)
h−1 (R@ )∩∆∗ =
lows, since
h−1 (R) is regular and h−1 (C(R)) =
∩∆∗ . Let x ∈
h−1 (R@ ). Hence, there is w ∈ R@ such that x ∈ h−1 (w). By
Definition 1.4, there exist k > 0 strings w1 , . . . , wk ∈ R, with 1 ≤ k ≤ |x|,
such that w1 @ . . . @wk = w. Hence, x ∈
h−1 (w) =
h−1 (w1 )@ . . . @
h−1 (wk )
@
@
h−1 (R) . Hence, there exist k > 0 strings
h−1 (R) . Let x ∈
⊆
x1 , . . . , xk ∈
h−1 (R) such that x1 @ . . . @xk = x.
It follows that there exist k > 0 strings w1 , . . . , wk ∈ R such that
x1 ∈
h−1 (w1 ), . . . , xk ∈
h−1 (wk ), and therefore
x = x1 @ . . . @xk ∈
h−1 (w1 )@ . . . @
h−1 (wk )
=
h−1 (w1 @ . . . @wk ) ⊆
h−1 (R@ ).
(3)–(5) The obvious proofs are based on simple transformations of the DFA recognizing the base language.
3. Consensual Languages are in NLOGSPACE
In this section, we formalize the simultaneous computations in the consensual
definition by means of multisets of states of a DFA accepting the base language;
the multiplicity of a state in the multiset encodes the number of computations of
the DFA that have reached that state. The consensual transition relation can be
computed by a nondetermistic Turing machine. A multiset can be represented by
multiplicity counters and only one counter for each state of the DFA is needed,
whose value is linearly limited by the length of the input string. Using a binary encoding of each counter, word membership can be computed by a nondeterministic
counter machine operating in logarithmic space.
3.1. Consensual transition relation
3.1.1. Preliminaries on multisets
Given a finite set Q, which in this paper is the set of states of a DFA, a multiset over
Q is a total mapping Z : Q → N. The cardinality of multiset Z is
|Z| = q∈Q Z(q). For q ∈ Q, if Z(q) > 0 then we say that q ∈ Z with multiplicity Z(q). To illustrate, consider the multiset Z over Q = {p, q, r} characterized by Z(p) = 3, Z(q) = 0, Z(r) = 5. We also use the alternative notations
{p3 , r5 } or {p, p, p, r, r, r, r, r}.
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
87
Given two multisets Z, Z ′ over Q, the sum Z ⊎ Z ′ and the difference Z − Z ′ are
the multisets specified by the following characteristic functions, for every q ∈ Q:
(Z ⊎ Z ′ ) (q) = Z(q) + Z ′ (q),
(Z − Z ′ ) (q) = max (0, Z (q) − Z ′ (q)) .
If f : Q → NQ is a total mapping, associating each element q ∈ Q with a multiset
f (q) and Z : Q → N is a multiset {q1 , . . . , qm }, where
m = |Z| and the qi ’s are not
necessarily distinct, then let the generalized sum
f (q) be f (q1 ) ⊎ · · · ⊎ f (qm ).
q∈Z
Finally, define, for every multiset Z over Q, the underlying set
Z = {q ∈ Q | Z(q) > 0} .
Clearly, Z ⊎ Z ′ = Z ∪ Z ′ ,
f (q).
f (q) =
q∈Z
q∈Z
3.1.2. A consensual transition relation
Q, δ, q0 , F ) be a DFA and assume the transition function δ to be
Let A = (Σ,
total. By the above notation,
the function is naturally extended to a multiset Z
{δ(q, a)}.
over Q, positing δ(Z, a) =
q∈Z
From this we define a transition relation on multisets of states.
Definition 3.1. The consensual transition relation of A, namely A ⊆ NQ × Σ ×
NQ , defined, for every a ∈ Σ and for all multisets Z, Z ′ over Q as:
a
Z A Z ′ if ∃q ∈ Z : Z ′ = {δ(q, a)} ⊎ δ(Z − {q}, a) .
a
Relation A can be extended as usual from a letter a to a word w ∈ Σ∗ via
the inductive definition:
ǫ
Z A Z
wa
w
a
Z A Z ′′ , if ∃Z ′ such that Z A Z ′ A Z ′′ .
a
It is evident that if Z A Z ′ then |Z| = |Z ′′ |, i.e., the cardinality does not change.
Two types of multisets have a special role: the initial multisets {(q0 )k }, for
every k > 0, and the final multisets Z such that Z ⊆ F .
A crisp definition of consensual languages is obtained by means of the transition
relation.
∗ and let A = (Σ,
Q, δ, q0 , F ) be a DFA accepting R.
Proposition 3.2. Let R ⊆ Σ
Then
w
C(R) = {w | ∃k > 0 and a final multiset Z such that {(q0 )k } A Z}.
The proposition follows immediately by combining the following Lemmata 3.4,
3.5. First, an example may help to clarify the construction.
88
S. CRESPI REGHIZZI AND P. SAN PIETRO
a
→ q1
Base automaton A
a
a
q2
b
b
b
q4 →
b
q3
b
Consensual transition relation accepting aaabbb:
a
a
a
{q1 , q1 , q1 } A {q1 , q1 , q2 } A {q1 , q2 , q2 } A {q2 , q2 , q2 }
b
b
b
A {q3 , q3 , q4 } A {q3 , q4 , q4 } A {q4 , q4 , q4 }
Figure 2. Recognizer of base language R = a∗ aa∗ b∗ bb∗ .
Example 3.3. The finite automaton accepting base language R = a∗ aa∗ b∗ bb∗
(see Ex. 1.5) is shown in the top part of Figure 2, while the consensual transition
relation accepting aaabbb is shown in the bottom part.
w
Lemma 3.4. If ∃k > 0, ∃Z : Q → N such that {(q0 )k } A Z then
there exist k
∗
words w1 , . . . , wk ∈ Σ̃ such that w1 @w2 @ . . . @wk = w and Z =
{δ(q0 , wj )}.
1≤j≤k
Proof. The proof is by induction on |w|. If |w| = 0, then let w1 = · · · = wk = w = ǫ
and let Z = {(q0 )k }. If w > 0 let w = w′ a for w′ ∈ Σ∗ , a ∈ Σ. Hence, if
w′
w
a
{(q0 )k } A Z then there exist Z ′ : Q → N such that {(q0 )k } A Z ′ A Z. By
′
′
∗
′
′
′
induction
hypothesis, there exist k words w1 , . . . , wk ∈ Σ̃ with w1 @ . . . @wk = w ,
{δ(q0 , wj′ )}.
Z′ =
1≤j≤k
a
Since Z ′ A Z, ∃q ∈ Z ′ such that Z = {δ(q, a)} ⊎ δ(Z ′ − {q}, a). Since q ∈ Z ′ ,
∃h, 1 ≤ h ≤ k, such that δ(q0 , wh′ ) = q. Hence, δ(q, a) = δ(q0 , wh′ a). Let wh = wh′ a
and, for every j = h, 1 ≤ j ≤ k, let wj = wj′ a. Hence, w1 @ . . . @wk = w. Also:
Z = {δ(q, a)} ⊎ δ(Z ′ − {q}, a) =
′
{δ(q0 , wh′ a)} ⊎ δ
1≤j≤k,j=h {δ(q0 , wj )}, a =
{δ(δ(q0 , wh′ ), a)} ⊎ 1≤j≤k,j=h {δ(δ(q0 , wj′ ), a)} =
{δ(q0 , wh′ a)} ⊎ 1≤j≤k,j=h {δ(q0 , wj′ a)} = 1≤j≤k {δ(q0 , wj )}.
Lemma 3.5. For every w ∈ Σ∗ , for every k > 0, let w1 , . . . , wk ∈
Σ̃∗ be such that
w1 @w2 @ . . . @wk = w, and let Z : Q → N be the multiset Z =
{δ(q0 , wj }).
1≤j≤k
w
Then, {(q0 )k } A Z.
Proof. The proof is by induction on |w|. If |w| = 0, then let w1 = · · · = wk = w = ǫ
ǫ
and let Z = {(q0 )k }: by definition, {(q0 )k } A Z. If w > 0 let w = w′ a
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
89
for w′ ∈ Σ∗ , a ∈ Σ. Hence, since w1 @w2 @ . . . @wk = w, there exists h, 1 ≤
h ≤ k, and thereexist w1′ , . . . , wk′ ∈ Σ̃∗ such that wh = wh′ a, wj = wj′ a for
j = h. Let Z ′ = 1≤j≤k {δ(q0 , wj′ )}. Since w1 @w2 @ . . . @wk = w, by induction
w′
hypothesis {(q0 )k } A Z ′ . Let Z = 1≤j≤k {δ(q0 , wj )} Hence, Z = {δ(q0 , wh′ a)} ⊎
′
′
′
1≤j≤k,j=h {δ(q0 , wj )}, a . Let
1≤j≤k,j=h {δ(q0 , wj a)} = {δ(δ(q0 , wh ), a)} ⊎ δ
q = δ(q0 , wh′ ) ∈ Z ′ . Then Z = {δ(q, a)} ⊎ δ(Z ′ − {q}, a). By Definition 3.1,
a
w
Z ′ A Z, hence {(q0 )k } A Z.
Given Proposition 3.2, the NLOGSPACE complexity of consensual languages
follows almost immediately:
Theorem 3.6. The word membership problem for the family CREG is in the complexity class NLOGSPACE (hence in P ).
Proof. Clearly, the transition relation A can be computed by a nondeterministic
Turing machine that, operating on an input word w ∈ Σ∗ of length n ≥ 0, guesses
a
k and first stores {(q0 )k } and then makes a nonderministic move computing A
when reading each input symbol a ∈ Σ.
Notice that one can always assume that k ≤ n, since a match with at most
n words is enough to determine w. In a computation with input length n, the
multiset to be stored at each step has cardinality exactly k ≤ n. Hence, the space
for storing a multiset Z over Q is logarithmic in n: a multiset can be represented
by its characteristic function, which in this case requires a counter for each state
in Q, and hence, it is enough to use |Q|⌈log2 n⌉ bits, with |Q| a constant.
a
Also, a move of the machine to simulate A only requires modifying the counter
values. The possible operations on a counter to implement the multiset operations
are: add or substract 1, reset to zero and store the result of adding two or more
counters. These operations may require additional constant or, at most, logarithmic space. Hence, for each regular base language R, membership in C(R) is in
NLOGSPACE.
4. Regular consensual languages
We have seen that consensual languages on a regular base may or may not be
regular. In this section we provide conditions ensuring regularity and we investigate the descriptive complexity of consensual specifications of regular languages.
4.1. Conditions for regularity
Bounded degree. The first result is obvious: when there is a bound i such that
Li@ = L(i+1)@ , i.e., L@ has bounded degree, then L@ is regular. However, in
general the converse is not true: Li@ may be not equal to L(i+1)@ even when L@
is regular: the degree of a regular consensual language can be unbounded. For
instance, consider the regular expression R = a∗ aa∗ ; then C(R) = a+ is regular
and R@ is (a∗ aa∗ )+ while Ri@ is (a∗ aa∗ )i . Hence Ri@ ⊆ R(i+1)@ .
90
S. CRESPI REGHIZZI AND P. SAN PIETRO
Maximal degree. We say that a base language R has maximal degree if for every
word w ∈ C(R) there exist |w| distinct strings w1 , . . . , w|w| ∈ R − Σ∗ such that
w = w1 @ . . . @w|w| .
We are going to prove that a consensual language on a regular base, having
maximal degree, is regular. The result is non-obvious because in this case the
multisets describing the simultaneous matching computations have unbounded
cardinality. To simplify the proof, first notice that if R has maximal degree, then
we have C(R) = C(R ∩ Σ∗ ΣΣ∗ ), since, if w = w1 @ . . . @w|w| , then each wi is in
Σ∗ ΣΣ∗ . Hence, it is enough to show that consensual languages with R ⊆ Σ∗ ΣΣ∗
are regular.
Theorem 4.1. Let L = C(R), with R regular and of maximal degree. Then L is
regular.
The proof of Theorem 4.1 requires a few additional considerations and definitions.
In the following, let A = (Σ̃, Q, δ, q0 , F ) be a DFA, with δ being total, and such
that L(A) = R ∪ Σ+ . Since the words in Σ+ do not give any contribution to the
strong match, we have C (L(A)) = C(R).
Therefore, set Q of states can be partitioned into two disjoint sets S, T , such
that Q = S ∪ T , q0 ∈ S, and, for every x, y ∈ Σ∗ and a ∈ Σ, we have δ(q0 , x) ∈ S
and δ(q0 , xay) ∈ T .
The following lemma is immediate.
Lemma 4.2. For every w ∈ Σ∗ , k > |w|, and for every multiset Z over Q, if
w
{(q0 )k } A Z then:
(1) |Z ∩ S| = 1 and
(2) Σq∈S Z(q) = k − |w|;
that is, exactly one of the states in S occurs in Z and its multiplicity is k − |w|.
We construct from A a nondeterministic finite automaton (NFA) that recognizes
C(R).
Definition 4.3. For a DFA A as above, let N be the NFA (Σ, 2Q , ⇒N , {q0 }, 2F )
where the states are all subsets of Q, {q0 } is the initial state, the final states are all
subsets of F , and ⇒N ⊆ 2Q × Σ × 2Q is the transition relation, defined as follows,
for every a ∈ Σ, and for all sets Q′ , Q′′ ⊆ Q:
a
Q′ ⇒N Q′′ if, and only if ∃q ∈ Q′ : Q′′ = {δ(q, a)} ∪
{δ(p, a)}.
p∈Q′
Relation ⇒N can be extended as usual to 2Q × Σ∗ × 2Q .
Proof of Theorem 4.1. The proof shows that when R has maximal degree, then
C(R) = L(N ).
First, we show that C(R) ⊆ L(N ).
Assume that for all w ∈ Σ∗ , for all k > |w|, there exists Z : Q → N such that;
w
w
{(q0 )k } A Z. We show by induction on |w| that in this case {q0 } ⇒N Z.
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
91
The thesis then follows considering the case when Z ⊆ F , also noticing that the
assumption k > |w| does not change the language (since Σ∗ ⊆ R).
The base step w = ǫ is immediate. For the inductive step, let w = za, for z ∈
w
z
Σ∗ , a ∈ Σ. Since {(q0 )k } A Z ′ , there exists Z ′′ : Q → N such that {(q0 )k } A
z
a
a
Z ′′ A Z ′ . By induction hypothesis, {q0 } ⇒N Z ′′ . Since Z ′′ A Z ′ , there
exists q ∈ Z ′′ such that Z ′ = {δ(q, a)} ⊎ δ(Z ′′ − {q}, a). Since q ∈ Z ′′ , then
a
define Q′ = {δ(q, a)} ∪ δ(Z ′′ , a). Clearly, Z ′′ ⇒N Q′ . However, in general it
might be that Z Q′ , since Q′ also includes δ(q, a). However, q ∈ S, hence
Z ′′ (q) = k − |z| = k − |w| + 1, which is greater than 1 since k > |w| by hypothesis.
Hence, Z ′′ − {q} has still at least one occurrence of q. Therefore, δ(q, a} is in Z ′ .
Hence, Z ′ = Q′ .
Conversely, we show that L(N ) ⊆ C(R).
w
Assume {q0 } ⇒N Q′ . We prove by induction on |w|, that for all k > |w| there
w
exists a multiset Z ′ over Q such that Z ′ = Q′ and {(q0 )k } A Z ′ . The thesis
′
follows immediately by considering the case Q ⊆ F . The base step is trivial.
For the inductive step, let w = za, for z ∈ Σ∗ , a ∈ Σ. Then ∃Q′′ ⊆ Q such
z
a
that {q0 } ⇒N Q′′ ⇒N Q′ . By induction hypothesis, ∃Z ′′ : Q → N such that
z
a
Z ′′ = Q′′ and {(q0 )k } A Z ′′ . Since Q′′ ⇒N Q′ , there exists q ∈ Q′′ such that
Q′ = {δ(q, a)} ∪ δ(Q′′ , a). Let Z ′ = {δ(q, a)} ⊎ δ(Z ′′ − {q}, a), which by definition
w
means that Z ′′ A Z ′ . We are left to prove that Z ′ = Q′ . The proof is now
identical to the converse case above. Q′ also includes δ(q, a), while Z ′ might not
include δ(q, a). However, q ∈ Q0 , hence Z ′′ (q) = k − |z| = k − |w| + 1, which is
greater than 1 since k > |w| by hypothesis. Hence, Z ′′ − {q} has still at least one
occurrence of q. Therefore, δ(q, a} is in Z ′ . Hence, Z ′ = Q′ .
We observe that the above result can be generalized to the case of R being
included in Σ∗ Σh Σ∗ , for some finite h ≥ 1; in other words, each matching word
may place up to h letters within a window of width h.
4.2. Descriptive complexity of regular consensual languages
We show that some regular languages can be described much more succinctly
by using consensual languages. In what follows, the size R of a regular language
R is defined to be the number of states of the smallest nondeterministic automata
accepting R.
Theorem 4.4. There exists a family of regular languages {Lm | m ≥ 2} over Σ
such that Lm =
and another family of regular languages {Rm | m ≥ 2} over Σ
C(Rm ) and the size of each Rm is asymptotically at least exponentially smaller
than the size of Lm .
Proof. Let Σ = {a} and denote with pi , i ≥ 1, the i-th prime number (e.g.,
p1 = 2, . . . , p5 = 11, . . . ). For every m > 0, the product of the first m primes,
92
S. CRESPI REGHIZZI AND P. SAN PIETRO
sometimes called the primorial of pm , is
pm # =
pi .
1≤i≤m
For every m ≥ 1, let
Lm = {am+h·pm # | h ≥ 1}).
and let
Rm = am a+ ∪
ai−1 aam−i (api )+ .
1≤i≤m
We claim that C(Rm ) = Lm . If u ∈ C(Rm ) then u is the match of m + 1 words
+
u0 @u1 @ . . . @um , with u0 ∈ am a+ and each ui in ai−1 aam−i (api ) , since u has a
m
prefix a to be strongly matched, and each ui places one a as the i-th character.
Since the suffix of ui after am must be of a length multiple of each pi , then there
exist k1 , . . . , km ≥ 1 such that |ui | = m + ki pi . Hence, k1 p1 = k2 p2 = · · · = km pm .
Since the pi ’s are distinct primes, there exists h ≥ 1 such that k1 p1 = hp1 p2 . . . pm .
Therefore, |u| = m + hp1 p2 . . . pm and u ∈ Lm .
Conversely, if u = am+h pm # ∈ Lm then u is the match of am ahpm # with m
words u1 , . . . , um ∈ Rm , with each ui = ai−1 aam−i ahp1 ...pm , because obviously
+
ahp1 ...pm is in (api ) . Hence, u ∈ C(Rm ).
Since the length of the shortest word in Lm is m + pm #, every (deterministic or
nondeterministic) finite automaton needs at least m + pm # states to accept LM :
Lm ≥ m + pm #. Since pi > i, for every i ≥ 1, it is obvious that pi # > i!, i.e.,
primorials are greater than factorials. Hence, it follows that:
Lm ≥ m + pm # ≥ pm # = pm pm−1 # ≥ pm (m − 1)!
On the other hand, the number of states of a minimal DFA for Rm is
(m + 1)(m + 2)
+1+
pi .
2
1≤i≤m
Hence,
Rm ≤ 3m + 1 +
pi ≤ 3m + 1 + mpm .
1≤i≤m
Therefore,
Lm
Rm
m
is Ω( (m−1)!p
) which is Ω((m − 2)!), which is also Ω(2m ).
mpm
5. Undecidability of emptiness
In this section, the undecidability of emptiness checking is proved, by means of
a reduction from the halting problem of a 2-counter Minsky machine [6]1.
1 The idea of the proof has been suggested by an anonymous referee.
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
93
∗ . It is undecidable whether C(R) = ∅.
Theorem 5.1. Let R ⊆ Σ
Proof. Define a Minsky machine C = (S, M, i, f ), where S is a finite set of states,
i = f ∈ S are the initial and final states, respectively, and M ⊆ (S − {f }) ×
S × {true, test}2 × {IN C, DEC, ST AY }2 is a finite set of moves. Without loss of
generality, assume that for every p ∈ S − {f }, q ∈ S, there exists at most one tuple
(t1 , t2 , i1 , i2 ), with t1 , t2 ∈ {true, test} and i1 , i2 ∈ {IN C, DEC, ST AY } such that
(p, q, t1 , t2 , i1 , i2 ) ∈ M .
A configuration of a counter machine is an element of S × N × N. The initial
configuration is (i, 0, 0), and a final configuration is (f, i, j), for every i, j ∈ N. A
transition relation −→C ⊆ S × N × N × S × N × N can easily be defined as usual.
For instance, if (p, q, true, test, IN C, ST AY ) ∈ M then e.g., (p, 3, 0) −→C (q, 4, 0):
the first counter is not tested (“true”), the second counter is tested for 0, and then
the first counter is incremented and the second counter stays.
Let Σ = S ∪ {X, Y, 1, 2}. For every p ∈ S − {f }, q ∈ S, define the following
regular languages, on the alphabet Σ ∪ Σ.
IN IT
N EXT (p, q)
COP Y1 (p, q)
COP Y2 (p, q)
test1 (p, q)
true1 (p, q)
test2 (p, q)
true2 (p, q)
IN C1 (p, q)
DEC1 (p, q)
ST AY1 (p, q)
IN C2 (p, q)
DEC2 (p, q)
ST AY2 (p, q)
HALT
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
1212iΣ∗
Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗
Σ∗ XY (XY )∗ p(XY )∗ 1212(XY )∗ XY (XY )∗ qΣ∗
Σ∗ p(XY )∗ XY (XY )∗ 1212(XY )∗ q(XY )∗ XY Σ∗
Σ∗ 1212p(XY )∗ 1212(XY )∗ qΣ∗
Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗
Σ∗ 1212(XY )∗ p1212(XY )∗ qΣ∗
Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗
Σ∗ 1212(XY )∗ p(XY )∗ 1212XY (XY )∗ qΣ∗
Σ∗ 1212XY (XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗
Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗
Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qXY Σ∗
Σ∗ 1212(XY )∗ pXY (XY )∗ 1212(XY )∗ qΣ∗
Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗
Σ∗ 1212(XY )∗ f (XY )∗ .
Then, associate a regular language [m] with every move m of a 2-counter machine
C = (S, M, i, f ). For every m = (p, q, t1 , t2 , i1 , i2 ), with t1 , t2 ∈ {test, true},
i1 , i2 ∈ {IN C, DEC, ST AY } let:
⎛
⎜∪
⎜
⎜
⎜
[m] = N EXT (p, q)∪
⎜
j=0,1⎜
⎝
(if tj is test
then testj (p, q) else truej (p, q)fi)
(if ij is IN C
elsif ij is DEC
else ST AYj (p, q)
fi)
then IN Cj (p, q)
then DECj (p, q)
For instance, if m is (p, q, test, test, DEC, IN C) then [m] is:
N EXT (p, q) ∪ COP Y1 (p, q) ∪ COP Y2 (p, q) ∪
∪ test1 (p, q) ∪ test2 (p, q) ∪ DEC1 (p, q) ∪ IN C2 (p, q).
⎞
⎟
⎟
⎟
⎟.
⎟
⎟
⎠
94
S. CRESPI REGHIZZI AND P. SAN PIETRO
Define R as the union of HALT ∪ IN IT ∪ m∈M [m]. Then C(R) is non-empty if,
and only if, C has a halting computation. The main idea is that C(R) represents
the set of all halting runs of C. Every word of C(R) has the form
1212i(1212(XY )∗ S(XY )∗ )∗ 1212(XY )∗ f (XY )∗
where the subwords 1212 separate the representation of two configurations of C.
For instance, the prefix 1212i1212 represents the initial configuration (i, 0, 0) of C,
while a subword 1212(XY )n p(XY )m 1212 represents the configuration (p, n, m),
and a suffix 1212(XY )n f (XY )m represents a final configuration (f, n, m). Also,
the definition of R is such that if there is a subword of the form
′
′
1212(XY )n p(XY )m 1212(XY )n q(XY )m 1212
then (p, n, m) −→C (q, n′ , m′ ).
Let F = 1212(XY )∗ S(XY )∗ . We show that if (i, 0, 0) −→+
C (f, n, m) then there
is a word w ∈ 1212iF ∗1212(XY )n f (XY )m such that w ∈ C(R), i.e., every halting
run of C corresponds to a word of C(R).
By induction on h ≥ 1, we first prove the following claim (1):
(p, n, m) −→C (q, n′ , m′ ) then there is a word in R@ of
(1) if (i, 0, 0) −→h−1
C
the form:
′
′
1212iF h−11212(XY )n p(XY )m 1212(XY )n q(XY )m F ∗ .
The base case is:
if (i, 0, 0) −→C (q, n′ , m′ ) with 0 ≤ n′ , m′ ≤ 1, then
′
′
1212i1212(XY )n q(XY )m F ∗ ∈ R@ .
By the set IN IT ,
′
′
W0 = 1212i1212(XY )n q(XY )m F ∗ ⊆ R@ .
The move (i, q, test, test, i1, i2 ) is in C, with i1 , i2 ∈ {IN C, ST AY }(and e.g.,
i1 = IN C if n′ = 1, i1 = ST AY if n′ = 0, etc.). Hence, N EXT (i, q) is in
′
′
(i, q, test, test, i1 , i2 ), and W1 = 1212i1212(XY )n q(XY )m F ∗ ⊆ R@ . Notice that
COP Y1 (i, q), COP Y2 (i, q) cannot match with W1 . Both test1 (i, q) and test2 (i, q)
′
′
may match with W1 and hence W2 = 1212i1212(XY )n q(XY )m F ∗ ⊆ R@ . Assume that n′ = 1, m′ = 0, hence, i1 = IN C, i2 = ST AY . The other cases may be
dealt with analogously. Hence, both IN C1 (i, q) and ST AY2 (i, q) may match with
W2 : W3 = 1212i1212XY qF ∗ ⊆ R@ .
The inductive case is dealt with analogously to the base case: by induction hypothesis, there exists a word w0 in 1212iF h−11212(XY )n p(XY )m F ∗ ∩R@ . Hence,
there is w1 ∈ R@ of the form
′
′
1212iF h−11212(XY )n p(XY )m 1212(XY )n q(XY )m F ∗ .
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
95
Assume that x = (p, q, true, true, DEC, IN C) ∈ M (by the unicity assumption
on M , there is no other move from p to q). Hence, n′ = n − 1, m′ = m + 1. Then,
COP Y1 (p, q), COP Y2 (p, q) ∈ [x]: there is w2 ∈ R@ of the form
1212iF h−11212(XY )n−1 XY p(XY )m 1212(XY )n−1 q(XY )m XY F ∗ .
Since also true1 (p, q), true2 (p, q) ∈ [x], there is w3 ∈ R@ of the form
1212iF h−11212(XY )n−1 XY p(XY )m 1212(XY )n−1 q(XY )m XY F ∗ .
Since also DEC1 (p, q), IN C1 (p, q) ∈ [x], there is w4 ∈ R@ of the form
1212iF h−11212(XY )n−1 XY p(XY )m 1212(XY )n−1 q(XY )m XY F ∗ .
thus ending the induction for the case of a move x as above. The other cases of
moves are analogous and are omitted for brevity.
Finally, by the claim (1), when q = f if (i, 0, 0) −→ (f, n′ , m′ ) then there exists
′
′
w ∈ R@ of the form 1212iF ∗1212(XY )n q(XY )m . By a match with HALT , the
′
′
result is a word w′ ∈ R@ of the form 1212iF ∗1212(XY )n q(XY )m , which is in
C(R) since it has no marking.
The proof of the converse claim that every word of C(R) is a halting run of C
is similar and can be safely omitted.
6. Comparisons with other families
In order to compare consensual languages with some classical language families,
we show that certain languages exceed the capacity of consensual languages based
on regular sets.
Proposition 6.1. The languages ucuR | u ∈ {a, b}∗ and {ucu | u ∈ {a, b}∗ } are
not in the family CREG .
Proof. Let L = {ucu | u ∈ {a, b}∗}. The proof for ucuR | u ∈ {a, b}∗ is completely analogous. Assume by contradiction there is a DFA A = ({a, b, a, b}, Q, δ,
q0 , F ) such that C(L(A)) = L. For a given input word of length n > 0, the nondeterministic Turing machine simulating the consensual transition relation ∗
only stores a multiset over Q, of cardinality m ≤ n, hence, there are at most
(n + 1)|Q| different configurations. On the other hand, there are 2n different
strings in {a, b}n, and for n large enough, the number of possible strings is much
larger than the number (n + 1)|Q| of different multisets: there exist u, w ∈ {a, b}n ,
u = w, such that there exist m ≤ n, a multiset Z over Q and two multisets Z ′ , Z ′′
u
cu
w
cw
over F , such that {(q0 )m } A Z A Z ′ , {(q0 )m } A Z A Z ′′ . But then also
u
cw
{(q0 )m } A Z A Z ′′ , a contradiction since ucw ∈ L.
96
S. CRESPI REGHIZZI AND P. SAN PIETRO
From Proposition 6.1, and from Example 2.4 it follows:
Corollary 6.2. The family CREG is not comparable with the families of contextfree languages and of tree adjoining languages [5].
Among the typical context-free languages, the Dyck sets with two or more pairs
of parentheses trespass the family CREG . To see it it suffices to observe that the
proof of Proposition 6.1 also applies to the case of language
L = u h(uR ) | u ∈ {a, b}∗
where h is the morphism h(a) = a′ , h(b) = b′ .
Let D2 be the Dyck language with opening parentheses a, b and closing parentheses a′ , b′ respectively. Let R be the regular language composed of all strings
on {a, b, a′ , b′ } where there is no occurrence of the factors a′ a, b′ b, a′ b, b′ a. Hence,
D2 ∩ R = L, and, if D2 were in CREG , then by closure of CREG under intersection
with regular languages, also L would be in CREG .
7. Conclusion
The simple notion of consensus between simultaneous computations, formulated
by means of strong matching, though surely not the only one possible and sensible,
yet permits rather remarkable selectivity in language definition. As we see it,
the interest of the family of consensual languages, based on matching finite-state
computations, comes from a combination of properties; it includes the regular
family, and actually offers a very concise representation of some regular sets; it
includes non-semilinear languages; and it has a time-polynomial word problem.
Altogether this research proposes a new way of looking at finite-state devices as
language recognizers.
Since the model is new and research is in its early stages, many questions are
open for investigation, for instance concerning minimality, decidability of equivalence, determinism of the counter (or multi-set) machine, as well as the study of
closure properties beyond the basic ones considered.
Concerning variations on the theme of consensual computations, we mention
some possibilities. Two variations would be to allow (or to oblige) a finite number
k > 1 of component words to place each letter in each position of the match. Such
devices would then model systems where stronger consensus between independent
computations is possible (or is requested), in order for a word to be accepted.
We believe our definitions, though possibly the simplest, already capture a rather
rich range of language paradigms. But of course actual experimentation would be
needed.
Moreover, we hope that the consensual approach could be fruitfully investigated
for other families of base languages, both weaker and stronger than the regular
ones.
CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS
97
At last, we think the idea of consensual computation could provide some formal
support to the linguistic requirement of assigning different degrees of grammaticality to sentences that satisfy some, but not all, semantic constraints.
Acknowledgements. The first author is indebted to Valentino Braitenberg for inspiring
discussions on brain theory and formal language models.
References
[1] A.K. Chandra, D. Kozen and L.J. Stockmeyer, Alternation. J. ACM 28 (1981) 114–133.
[2] S. Crespi Reghizzi and P. San Pietro, Consensual definition of languages by regular sets, in
LATA. Lecture Notes in Computer Science 5196 (2008) 196–208.
[3] S. Crespi Reghizzi and P. San Pietro, Languages defined by consensual computations. in
ICTCS09 (2009).
[4] M. Jantzen, On the hierarchy of Petri net languages. ITA 13 (1979).
[5] A. Joshi and Y. Schabes, Tree-adjoining grammars, in Handbook of Formal Languages, Vol. 3,
G. Rozenberg and A. Salomaa, Eds. Springer, Berlin, New York (1997), 69–124.
[6] M. Minsky, Computation: Finite and Infinite Machines. Prentice-Hall, Englewood Cliffs
(1976).
[7] A. Salomaa, Theory of Automata. Pergamon Press, Oxford (1969).
[8] K. Vijay-Shanker and D.J. Weir, The equivalence of four extensions of context-free grammars.
Math. Syst. Theor. 27 (1994) 511–546.
Communicated by A. Cherubini.
Received December 24, 2009. Accepted November 18, 2010.