Consensual languages and matching finite-state computations

Stefano crespi reghizzi

RAIRO-Theor. Inf. Appl. 45 (2011) 77–97 DOI: 10.1051/ita/2011012 Available online at: www.rairo-ita.org CONSENSUAL LANGUAGES AND MATCHING FINITE-STATE COMPUTATIONS ∗, ∗∗ Stefano Crespi Reghizzi 1 and Pierluigi San Pietro 1 Abstract. An ever present, common sense idea in language modelling research is that, for a word to be a valid phrase, it should comply with multiple constraints at once. A new language definition model is studied, based on agreement or consensus between similar strings. Considering a regular set of strings over a bipartite alphabet made by pairs of unmarked/marked symbols, a match relation is introduced, in order to specify when such strings agree. Then a regular set over the bipartite alphabet can be interpreted as specifying another language over the unmarked alphabet, called the consensual language. A word is in the consensual language if a set of corresponding matching strings is in the original language. The family thus defined includes the regular languages and also interesting non-semilinear ones. The word problem can be solved in NLOGSPACE, hence in P time. The emptiness problem is undecidable. Closure properties are proved for intersection with regular sets and inverse alphabetical homomorphism. Several conditions for a consensual definition to yield a regular language are presented, and it is shown that the size of a consensual specification of regular languages can be in a logarithmic ratio with respect to a DFA. The family is incomparable with context-free and tree-adjoining grammar families. Mathematics Subject Classification. 68Q45, 68Q42, 68Q19. Keywords and phrases. Formal languages, finite automata, consensual languages, counter machines, polynomial time parsing, non-semilinear languages, Parikh mapping, descriptive complexity of regular languages, degree of grammaticality. With partial support from PRIN 2005015419, FIRB “Applicazioni della Teoria degli Automi all’Analisi, Compilazione e Verifica di Software Critico e in Tempo Reale”, and CNR-IEIIT. ∗∗ Preliminary, partial versions were presented at LATA 2008 [2] and ICTCS 2009 [3] conferences. 1 Dipartimento di Elettronica e Informazione, Politecnico di Milano, Piazza Leonardo da Vinci, 32, 20133 Milano, Italy; {crespi;sanpietro}@elet.polimi.it ∗ Article published by EDP Sciences c EDP Sciences 2011 78 S. CRESPI REGHIZZI AND P. SAN PIETRO Introduction An ever present, common sense idea in language modelling research is that, for a word to to be a valid phrase, it should comply with multiple constraints at once. Theories of grammar have taken various approaches for expressing the constraints by diﬀerent mechanisms, such as by superimposing semantic constraints to syntactic ones, or by using intersections of, say, context-free languages. Of course, motivation for language definitions, based on agreement or reinforcement between separate processes, comes from the overwhelming complexity of monolithic definitions, and, in the case of natural language, is supported by the findings of neuro-linguistical research. Here we propose a very simple novel mechanism, where the constraints are expressed by an elementary letter by letter agreement between strings belonging to a regular language. The alphabet is bipartite, made by pairs of unmarked/marked characters. The agreement is formalized by a k-ary relation, called match, that is satisfied by a set of k equally long strings if, in each position, exactly one word has an unmarked letter and the other strings have the same letter but marked. In our metaphor we view such strings as providing mutual consensus on the validity of the corresponding unmarked string. This justifies the name “consensual” proposed for the new family, which strictly includes the regular one. Here some reader may prefer to jump to the definition (Defs. 1.1, 1.3 and 1.4) of consensual language, before reading the next discussion of the position of the new model from the perspective of language theory. With respect to their storage, abstract language recognition devices can be classified as using tapes (Turing machines, push-down machines, nested pushdown machines) or counters. The latter case includes various models of counter machines and also Petri Nets. Consensual languages are recognized by real-time non-deterministic multi counter machines with a linear bound on the counter values. Considering the complexity of the word recognition problem, consensual languages belong to the polynomial time class. With respect to generative capacity, the new family shares little ground with the families of context-free and mildly context-sensitive [8] languages. For instance, the Dyck language over two letters can be defined but not the language of palindromes. On the other hand interesting non-semilinear languages (in the Parikh sense [7]) can be easily defined. Next we compare and contrast the computation performed by a consensual recognizer versus an alternating finite automaton [1]. Although both machines perform simultaneous computations for recognizing a given string, they apply entirely diﬀerent acceptance criteria. All possible computations must be successful for a word to be recognized by an alternating machine when using universal non-determinism, and their number may be exponential with respect to the word length. On a consensual device, the computations performed on the finite automaton, which can be assumed to be deterministic, are not labelled by the input word (except in the trivial case when the language is regular) but by matching strings CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 79 over the marked/unmarked alphabet. The number of computations is bounded by the input length. Recalling that certain Petri net language families [4] include non-semilinear languages and that their recognizers use counters, a vague resemblance between the two models may be mentioned. In fact C.A. Petri introduced his nets as a formal model of synchronization between computations performed by finite automata and our model too specifies a matching rule between the labels of separate computations. Notwithstanding the fact that the proposed approach has little to do with any classical formal language model we know, we hope its simplicity, expressivity and motivation may attract some attention. The paper is organized as follows. Section 1 lists the basic definitions, and provides an example giving evidence of the strict inclusion of regular languages. Section 2 shows that the Parikh image may be not linear, and proves several closure properties. Section 3 defines a transition relation between multisets of states, corresponding to a multi-counter machine. Then it shows that the word recognition problem is in NLOGSPACE. Section 4 focuses on consensually defined regular languages, and gives suﬃcient conditions for a consensual language to be regular. It shows that consensual definitions can be exponentially more concise than definitions by deterministic finite automata. Section 5 proves the emptiness problem to be undecidable. Section 6 shows that the languages of palindromes and replicas exceed the power of consensual languages. The conclusion mentions directions for continuation. 1. First definitions Let Σ be the terminal alphabet of the languages to be considered. The empty word is denoted by letter ǫ. Given a word x, its length is denoted by |x| and the i-th letter is x(i), 1 ≤ i ≤ |x|. A deterministic finite automaton (DFA for short) is specified as A = (∆, Q, δ, q0 , F ) where: ∆ is a finite alphabet; Q is a finite set of states; δ : Q × ∆ → Q is the state-transition function, always assumed to be total ; q0 is the initial state, and F ⊆ Q is the set of final states. The transition function δ can be extended as usual to Q × ∆∗ → Q, which is also total, i.e., δ(q0 , y) is defined for every word y over Σ. A nondeterministic finite automaton N (NFA) is specified as N = (∆, Q, ⇒N , q0 , F ) where the only diﬀerence with a DFA above is that the transition relation is ⇒N ⊆ Q × ∆ × Q. Acceptance may be defined as usual for a DFA and for a NFA. Let Σ be the disjoint alphabet obtained by marking each symbol a ∈ Σ as a, referred to as the marked copy of a. and qualified as internal alphabet, because its The set Σ ∪ Σ is denoted as Σ use is restricted to the technical device of consensual definitions. The notion of agreement between strings over the internal alphabet is formalized by means of a function called match. 80 S. CRESPI REGHIZZI AND P. SAN PIETRO Definition 1.1. Match The partial, symmetrical, and associative binary operator, called match ×Σ →Σ @:Σ is defined as follows, for all a ∈ Σ: ⎧ ⎨ a@a = a@a = a; a@a = a; ⎩ undefined, in every other case. The operator can be naturally extended to strings of equal length, by assuming ∗ , with |w| = |w′ |, and for all a, b ∈ Σ ǫ@ǫ = ǫ. For all w, w′ ∈ Σ aw @ bw′ = (a@b)(w@w′ ) where we assume that match yields precedence to concatenation. Hence, the match is undefined on strings w, w′ of unequal lengths, or else if there exists a position i such that w(i)@w′ (i) is undefined. The latter condition occurs in three cases: when both characters are in Σ, when both are in Σ and diﬀer, and when either one is marked but is not the marked copy of the other. Given m > 0 strings w1 , . . . , wm ∈ Σ̃∗ , consider w1 @w2 @ . . . @wm (which can be written without parentheses and in any order because the match operation is associative and commutative). If w = w1 @w2 @ . . . @wm is defined then w is called the match of w1 , w2 , . . . , wm . The match is strong if w ∈ Σ∗ , weak otherwise. The cardinality m is called the degree of the match. Match w and every argument wj have the same length n = |wj | = |w|. Also, by Definition 1.1, if w is a strong match for each position 1 ≤ i ≤ n, exactly one string, say wk , is unmarked, i.e., wk (i) ∈ Σ and wj (i) ∈ Σ for all j = k. We say that word wk places the letter into position i and the other strings consent to it. ∗ over the Next we extend the match operator to two languages L′ , L′′ ⊆ Σ internal alphabet: L′ @L′′ = {w′ @w′′ | w′ ∈ L′ , w′′ ∈ L′′ }. Clearly the operation may be applied to any number of languages. If the arguments are regular languages, the match operator produces a regular language. ∗ are regular languages then L′ @L′′ is also regular. Proposition 1.2. If L′ , L′′ ⊆ Σ Q′ , δ ′ , q ′ , F ′ ) and A′′ = (Σ, Q′′ , δ ′′ , q ′′ , F ′′ ) be the DFAs’ recProof. Let A′ = (Σ, 1 1 ′ ′′ ′ ′′ ognizing L , L , respectively. Let A @A be the (possibly nondeterministic) finite Q′ × Q′′ , δ, (q ′ , q ′′ ), F ′ × F ′′ ), with δ : (Q′ × Q′′ ) × Σ → 2Q′ ×Q′′ , automaton (Σ, 1 1 CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 81 such that for every q ′ , p′ ∈ Q′ , q ′′ , p′′ ∈ Q′′ , for every a ∈ Σ: p′ , p′′ ∈ δ(q ′ , q ′′ , a) p′ , p′′ ∈ δ(q ′ , q ′′ , a) p′ , p′′ ∈ δ(q ′ , q ′′ , a) if if if p′ = δ ′ (q ′ , a), p′′ = δ ′′ (q ′′ , a) p′ = δ ′ (q ′ , a), p′′ = δ ′′ (q ′′ , a) p′ = δ ′ (q ′ , a), p′′ = δ ′′ (q ′′ , a). The construction is similar to the usual Cartesian product of two DFAs’. But instead of the intersection, the product machine A′ @A′′ recognizes L′ @L′′ , because the construction has been modified to match a with a and a with a, but not to match a with a. The repeated application of the match operation to a language is formalized next. Let L1@ = L, Li@ = L@L(i−1)@ , i ≥ 2. Notice that in general L(i−1)@ ⊆ Li@ . Definition 1.3. Match closure ∗ is: The closure under match, or @-closure, of a language L ⊆ Σ L@ = Li@ . i≥1 Focusing on languages over the terminal alphabet Σ, the main definition comes next. Definition 1.4. Consensual language + is Let B be in a language family F . The consensual language with base B ⊆ Σ the set C(B) = B @ ∩ Σ∗ . Language C(B) is also called a consensual language based on family F , and the corresponding family is written CF . Therefore, a consensual language with base B includes all and only the strongly matches of the match closure. In this paper we study the family of consensual languages based on the family of regular languages, CREG . Example 1.5. Consider the regular language R defined by the regular expression a∗ aa∗ b∗ bb∗ . Then R@ is the set of strings of the form: a∗ a1 a∗ a2 a∗ . . . am b∗ b1 b∗ b2 . . . bm b∗ where m ≥ 1, and each ai is a and each bi is b. The consensual language with base R is C(R) = {an bn | n > 0}. Figure 1 shows that sentence aaabbb can be obtained in diﬀerent ways, matching together the strings of R in column vi , or matching those in column wi . Notice that every sentence w is obtained by means of a match of degree |w|/2. 82 S. CRESPI REGHIZZI AND P. SAN PIETRO i 1 2 3 Match vi aaabbb aaabbb aaabbb aaabbb wi aaabbb aaabbb aaabbb aaabbb Figure 1. In Example 1.5 word aaabbb results from the strong matches in column vi and wi . The example has shown that, although the base B is regular, languages B @ and C(B), obtained by a match closure, may be non-regular. However, from Proposition 1.2, for any finite i, B i@ is regular if B is regular. This corresponds, in Definition 1.3, to the case where at most i strings w1 , . . . , wi are matched. 2. First properties We introduce further useful terminology and make intuitive comments about previous definitions and concepts. First, we notice that we may remove or add to a base language B a subset of Σ+ without aﬀecting the corresponding consensual language C(B). In fact, if w@w′ is defined and w′ ∈ Σ+ , then w@w′ = w: strings purely made of marked characters are both useless and “harmless”. ∗ , and for every language U ⊆ Proposition 2.1. For every base language B ⊆ Σ + Σ , C(B) = C(B ∪ U ) = C(B \ U ). As a consequence of the fact that the match of identical strings is undefined (if they contain at least one unmarked character) or useless (if the strings are completely marked), any phrase w of a consensual language can be obtained as the result of a strong match having degree not exceeding its length |w|. Proposition 2.2. C(B) = {w ∈ Σ∗ | ∃k, 1 ≤ k ≤ |w|, w ∈ B k@ }. (1) A straightforward language family inclusion result comes next. Consider a deterministic finite automaton of the base language B. A word w is in the consensual language C(B), if, and only if, the automaton performs 1 ≤ k ≤ |w| successful com that strongly match to w. We also say putations, accepting a set of strings over Σ that such computations strongly (or weakly) match. The case k = 1 clearly corresponds to the usual recognition condition of a DFA. As the consensual language of Example 1.5 is not regular, we have: CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 83 Proposition 2.3. The family CREG of consensual languages on a regular language base strictly includes the family of regular languages. The next example shows languages having a non-semilinear commutative image (in Parikh’s sense [7]). Example 2.4. Series of unary integers (1) Series of identical unary integers. Choose as base the language: + R1 = (a∗ aa∗ b) ∪ (a+ b)+ . Then the consensual language is: L1 = C(R1 ) = {an ban ban b . . . an b | n > 0}. (2) Enumeration of unary integers. The language L2 = {baba2 b . . . ban b | n ≥ 0} is consensually defined by the regular base ∗ ∗ R2 = ba+ b (a∗ aa∗ b) . For example, babaab is the match of the following words in R2 : babaab, babaab and babaab. (3) Series of exponential unary numbers. For Σ = {a, b, c} let ∗ ∗ + R3 = Σ∗ a (a ∪ c) b (a ∪ c) cacΣ∗ ∪ acb (ac) b ∗ + (ac) b ∪ acb. The consensual language C(R3 ) is m L3 = {ac b(ac)2 b(ac)4 b(ac)8 b . . . (ac)2 b | m ≥ 0}. Proof. Let us express the base language as the union of three clauses, numbered 1, 2, and 3: ∗ R3 = Σ∗ a(a ∪ c)∗ b (a ∪ c) cacΣ∗ ∪ acb((ac)+ b)∗ (ac)+ b ∪ acb . 1 2 3 First we show that L3 ⊆ C(R3 ). Any word in C(R3 ), apart from acb, must be obtained by matching a word in clause 2 with strings in clause 1, since 84 S. CRESPI REGHIZZI AND P. SAN PIETRO neither regular expression can generate alone a word in Σ+ . We now show, by induction on the number m ≥ 1 that if wm is a string of the form m wm = acb(ac)2 b(ac)4 b(ac)8 b . . . (ac)2 b then wm ∈ C(R3 ) (and clearly L3 is the set of all wm above). The base step is m = 0, corresponding to the word acb, which is both in C(R3 ) and in L3 . Assume now that the induction hypothesis holds for m − 1. Hence, string m−1 wm−1 = acb(ac)2 b(ac)4 b(ac)8 b . . . (ac)2 b ∈ C(R3 ). But wm−1 must be obtained as a match of h > 0 strings x1 , x2 , . . . xh of clause 1, with one string m−2 ym−1 = acb(ac)2 b . . . (ac)2 m−1 b(ac)2 b of clause 2. But also m−1 ym = acb(ac)2 b . . . b(ac)2 m b(ac)2 b m is in clause 2, and if xi is in clause 1 then also x′i = xi (ac)2 b is in clause 1, since clause 1 ends with Σ∗ . Therefore, by matching ym with x′1 , . . . , x′m , one obtains: m−2 ′ = acb(ac)2 b(ac)4 b(ac)8 b . . . b(ac)2 wm m−1 b(ac)2 m b(ac)2 b ∈ R3@ . Also, all strings m−2 zi = acb(ac)2 b . . . b(ac)2 m−1 b(ac)i ac(ac)2 −i−1 m b(ac)2i acac(ac)2 −2i−2 b are in clause 1, for every i, 0 ≤ i < 2m−1 . Hence, for every a placed in group m − 1 (the group with 2m−1 occurrences of ac), there must be two occurrences of c in group m. Hence, the number of c’s (and therefore also of a’s) in group m must be twice the number of c’s in group m − 1: m−1 wm = acb(ac)2 b(ac)4 b(ac)8 b . . . (ac)2 m b(ac)2 b ∈ C(R3 ). For the converse case, notice that, since any w in C(R3 ) (except for acb) ∗ must match a word in clause 2, C(R3 ) ⊆ acb ((ac)+ b) . An induction on the number m ≥ 0 of groups (ac)+ b in strings of C(R3 ) completes the proof. The base case m = 0 corresponds to the word acb. By induction j hypothesis, for all 0 ≤ j ≤ m, the strings acb ((ac)+ b) are in L3 . Assume that the word xm with m groups is not in L3 . Hence, there exists i > 0 such that the group in position i has a number of c which is not the double of the number of a of the group in position i − 1. But i = m, otherwise CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 85 one could define also a word which is not in L3 while having less than m groups, contradicting the induction hypothesis. However, the only way to place a c in group m is by using strings in clause 1, which place two occurrences of c in group m for an occurrence of a in group m-1. Since no other strings can place a in group m − 1, then the number of c’s in group m must be exactly the double of the number of a in group m − 1 (2m−1 by induction hypothesis), that is there are 2m occurrences of c in group m. We state the basic closure properties of consensual languages in the next propo/ Σ, the marked concatenasition. For two languages L′ , L′′ ⊆ Σ∗ and a letter s ∈ tion [7] is the language L′ sL′′ . Proposition 2.5. The family CREG is closed under: (1) (2) (3) (4) (5) intersection with regular languages; inverse alphabetic homomorphism; reversal (or mirror reflection) operation; marked concatenation of consensual languages; union of consensual languages over disjoint alphabets. Proof. We separately argue for each statement. → Σ be the ∗ , S ⊆ Σ∗ be two regular languages, and let h : Σ (1) Let R ⊆ Σ alphabetic homomorphism defined by h(a) = h(a) = a for every a ∈ Σ. We claim that C(R) ∩ S = C R ∩ h−1 (S) thus proving the statement. Let x ∈ C(R) ∩ S. Therefore, ∃k, 1 ≤ k ≤ |x|, ∃x1 , . . . , xk ∈ R such that x1 @x2 . . . @xk = x and for every i, 1 ≤ i ≤ k, h(xi ) = x. Hence, every xi ∈ h−1 (x) ⊆ h−1 (S) since x ∈ S. Hence, for every i,1 ≤ i ≤ k, −1 xi ∈ R ∧ xi ∈ h−1 (S), and it follows that x ∈ C R ∩ h (S) . −1 Assume now x ∈ C R ∩ h (S) . Hence, ∃k, 1 ≤ k ≤ |x|, ∃x1 , . . . , xk such that x1 @x2 . . . @xk = x and for every i, 1 ≤ i ≤ k, xi ∈ R ∩ h−1 (S), with h(xi ) = x. Then x ∈ C(R) (since each xi ∈ R). Also, x ∈ h−1 (S) (since each xi ∈ h−1 (S)). Therefore, x ∈ S, since S = h−1 (S) ∩ Σ∗ and x ∈ Σ∗ . ∗ be a regular language, and let ∆ be another finite alphabet. (2) Let R ⊆ Σ Let h : ∆ → Σ be a homomorphism. We need to prove that h−1 (C(R)) is a consensual language with regular base. Extend first h to the internal alphabet as follows: h : ∆∪∆ → Σ∪Σ is defined as h(A) = h(A), h(A) = h(A) for every A ∈ ∆. We notice that h−1 (a@a) = h−1 (a) = h−1 (a)@ h−1 (a), and that h−1 (a@a) = h−1 (a) = h−1 (a)@ h−1 (a), and similarly for the case a@a. −1 On the other hand both h (a)@ h−1 (a) and h−1 (a@a) are undefined. 86 S. CRESPI REGHIZZI AND P. SAN PIETRO Hence, . h−1 (X@Y ) = h−1 (X)@ h−1 (Y ) for every X, Y ∈ Σ Therefore, if u, u′ ∈ (Σ ∪ Σ)∗ then h−1 (u@u′ ) = h−1 (u)@ h−1 (u′ ). @ We now claim that h−1 (R@ ) = h−1 (R) . From here the thesis fol- @ h−1 (R) h−1 (R@ )∩∆∗ = lows, since h−1 (R) is regular and h−1 (C(R)) = ∩∆∗ . Let x ∈ h−1 (R@ ). Hence, there is w ∈ R@ such that x ∈ h−1 (w). By Definition 1.4, there exist k > 0 strings w1 , . . . , wk ∈ R, with 1 ≤ k ≤ |x|, such that w1 @ . . . @wk = w. Hence, x ∈ h−1 (w) = h−1 (w1 )@ . . . @ h−1 (wk ) @ @ h−1 (R) . Hence, there exist k > 0 strings h−1 (R) . Let x ∈ ⊆ x1 , . . . , xk ∈ h−1 (R) such that x1 @ . . . @xk = x. It follows that there exist k > 0 strings w1 , . . . , wk ∈ R such that x1 ∈ h−1 (w1 ), . . . , xk ∈ h−1 (wk ), and therefore x = x1 @ . . . @xk ∈ h−1 (w1 )@ . . . @ h−1 (wk ) = h−1 (w1 @ . . . @wk ) ⊆ h−1 (R@ ). (3)–(5) The obvious proofs are based on simple transformations of the DFA recognizing the base language. 3. Consensual Languages are in NLOGSPACE In this section, we formalize the simultaneous computations in the consensual definition by means of multisets of states of a DFA accepting the base language; the multiplicity of a state in the multiset encodes the number of computations of the DFA that have reached that state. The consensual transition relation can be computed by a nondetermistic Turing machine. A multiset can be represented by multiplicity counters and only one counter for each state of the DFA is needed, whose value is linearly limited by the length of the input string. Using a binary encoding of each counter, word membership can be computed by a nondeterministic counter machine operating in logarithmic space. 3.1. Consensual transition relation 3.1.1. Preliminaries on multisets Given a finite set Q, which in this paper is the set of states of a DFA, a multiset over Q is a total mapping Z : Q → N. The cardinality of multiset Z is |Z| = q∈Q Z(q). For q ∈ Q, if Z(q) > 0 then we say that q ∈ Z with multiplicity Z(q). To illustrate, consider the multiset Z over Q = {p, q, r} characterized by Z(p) = 3, Z(q) = 0, Z(r) = 5. We also use the alternative notations {p3 , r5 } or {p, p, p, r, r, r, r, r}. CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 87 Given two multisets Z, Z ′ over Q, the sum Z ⊎ Z ′ and the diﬀerence Z − Z ′ are the multisets specified by the following characteristic functions, for every q ∈ Q: (Z ⊎ Z ′ ) (q) = Z(q) + Z ′ (q), (Z − Z ′ ) (q) = max (0, Z (q) − Z ′ (q)) . If f : Q → NQ is a total mapping, associating each element q ∈ Q with a multiset f (q) and Z : Q → N is a multiset {q1 , . . . , qm }, where m = |Z| and the qi ’s are not necessarily distinct, then let the generalized sum f (q) be f (q1 ) ⊎ · · · ⊎ f (qm ). q∈Z Finally, define, for every multiset Z over Q, the underlying set Z = {q ∈ Q | Z(q) > 0} . Clearly, Z ⊎ Z ′ = Z ∪ Z ′ , f (q). f (q) = q∈Z q∈Z 3.1.2. A consensual transition relation Q, δ, q0 , F ) be a DFA and assume the transition function δ to be Let A = (Σ, total. By the above notation, the function is naturally extended to a multiset Z {δ(q, a)}. over Q, positing δ(Z, a) = q∈Z From this we define a transition relation on multisets of states. Definition 3.1. The consensual transition relation of A, namely A ⊆ NQ × Σ × NQ , defined, for every a ∈ Σ and for all multisets Z, Z ′ over Q as: a Z A Z ′ if ∃q ∈ Z : Z ′ = {δ(q, a)} ⊎ δ(Z − {q}, a) . a Relation A can be extended as usual from a letter a to a word w ∈ Σ∗ via the inductive definition: ǫ Z A Z wa w a Z A Z ′′ , if ∃Z ′ such that Z A Z ′ A Z ′′ . a It is evident that if Z A Z ′ then |Z| = |Z ′′ |, i.e., the cardinality does not change. Two types of multisets have a special role: the initial multisets {(q0 )k }, for every k > 0, and the final multisets Z such that Z ⊆ F . A crisp definition of consensual languages is obtained by means of the transition relation. ∗ and let A = (Σ, Q, δ, q0 , F ) be a DFA accepting R. Proposition 3.2. Let R ⊆ Σ Then w C(R) = {w | ∃k > 0 and a final multiset Z such that {(q0 )k } A Z}. The proposition follows immediately by combining the following Lemmata 3.4, 3.5. First, an example may help to clarify the construction. 88 S. CRESPI REGHIZZI AND P. SAN PIETRO a → q1 Base automaton A a a q2 b b b q4 → b q3 b Consensual transition relation accepting aaabbb: a a a {q1 , q1 , q1 } A {q1 , q1 , q2 } A {q1 , q2 , q2 } A {q2 , q2 , q2 } b b b A {q3 , q3 , q4 } A {q3 , q4 , q4 } A {q4 , q4 , q4 } Figure 2. Recognizer of base language R = a∗ aa∗ b∗ bb∗ . Example 3.3. The finite automaton accepting base language R = a∗ aa∗ b∗ bb∗ (see Ex. 1.5) is shown in the top part of Figure 2, while the consensual transition relation accepting aaabbb is shown in the bottom part. w Lemma 3.4. If ∃k > 0, ∃Z : Q → N such that {(q0 )k } A Z then there exist k ∗ words w1 , . . . , wk ∈ Σ̃ such that w1 @w2 @ . . . @wk = w and Z = {δ(q0 , wj )}. 1≤j≤k Proof. The proof is by induction on |w|. If |w| = 0, then let w1 = · · · = wk = w = ǫ and let Z = {(q0 )k }. If w > 0 let w = w′ a for w′ ∈ Σ∗ , a ∈ Σ. Hence, if w′ w a {(q0 )k } A Z then there exist Z ′ : Q → N such that {(q0 )k } A Z ′ A Z. By ′ ′ ∗ ′ ′ ′ induction hypothesis, there exist k words w1 , . . . , wk ∈ Σ̃ with w1 @ . . . @wk = w , {δ(q0 , wj′ )}. Z′ = 1≤j≤k a Since Z ′ A Z, ∃q ∈ Z ′ such that Z = {δ(q, a)} ⊎ δ(Z ′ − {q}, a). Since q ∈ Z ′ , ∃h, 1 ≤ h ≤ k, such that δ(q0 , wh′ ) = q. Hence, δ(q, a) = δ(q0 , wh′ a). Let wh = wh′ a and, for every j = h, 1 ≤ j ≤ k, let wj = wj′ a. Hence, w1 @ . . . @wk = w. Also: Z = {δ(q, a)} ⊎ δ(Z ′ − {q}, a) = ′ {δ(q0 , wh′ a)} ⊎ δ 1≤j≤k,j=h {δ(q0 , wj )}, a = {δ(δ(q0 , wh′ ), a)} ⊎ 1≤j≤k,j=h {δ(δ(q0 , wj′ ), a)} = {δ(q0 , wh′ a)} ⊎ 1≤j≤k,j=h {δ(q0 , wj′ a)} = 1≤j≤k {δ(q0 , wj )}. Lemma 3.5. For every w ∈ Σ∗ , for every k > 0, let w1 , . . . , wk ∈ Σ̃∗ be such that w1 @w2 @ . . . @wk = w, and let Z : Q → N be the multiset Z = {δ(q0 , wj }). 1≤j≤k w Then, {(q0 )k } A Z. Proof. The proof is by induction on |w|. If |w| = 0, then let w1 = · · · = wk = w = ǫ ǫ and let Z = {(q0 )k }: by definition, {(q0 )k } A Z. If w > 0 let w = w′ a CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 89 for w′ ∈ Σ∗ , a ∈ Σ. Hence, since w1 @w2 @ . . . @wk = w, there exists h, 1 ≤ h ≤ k, and thereexist w1′ , . . . , wk′ ∈ Σ̃∗ such that wh = wh′ a, wj = wj′ a for j = h. Let Z ′ = 1≤j≤k {δ(q0 , wj′ )}. Since w1 @w2 @ . . . @wk = w, by induction w′ hypothesis {(q0 )k } A Z ′ . Let Z = 1≤j≤k {δ(q0 , wj )} Hence, Z = {δ(q0 , wh′ a)} ⊎ ′ ′ ′ 1≤j≤k,j=h {δ(q0 , wj )}, a . Let 1≤j≤k,j=h {δ(q0 , wj a)} = {δ(δ(q0 , wh ), a)} ⊎ δ q = δ(q0 , wh′ ) ∈ Z ′ . Then Z = {δ(q, a)} ⊎ δ(Z ′ − {q}, a). By Definition 3.1, a w Z ′ A Z, hence {(q0 )k } A Z. Given Proposition 3.2, the NLOGSPACE complexity of consensual languages follows almost immediately: Theorem 3.6. The word membership problem for the family CREG is in the complexity class NLOGSPACE (hence in P ). Proof. Clearly, the transition relation A can be computed by a nondeterministic Turing machine that, operating on an input word w ∈ Σ∗ of length n ≥ 0, guesses a k and first stores {(q0 )k } and then makes a nonderministic move computing A when reading each input symbol a ∈ Σ. Notice that one can always assume that k ≤ n, since a match with at most n words is enough to determine w. In a computation with input length n, the multiset to be stored at each step has cardinality exactly k ≤ n. Hence, the space for storing a multiset Z over Q is logarithmic in n: a multiset can be represented by its characteristic function, which in this case requires a counter for each state in Q, and hence, it is enough to use |Q|⌈log2 n⌉ bits, with |Q| a constant. a Also, a move of the machine to simulate A only requires modifying the counter values. The possible operations on a counter to implement the multiset operations are: add or substract 1, reset to zero and store the result of adding two or more counters. These operations may require additional constant or, at most, logarithmic space. Hence, for each regular base language R, membership in C(R) is in NLOGSPACE. 4. Regular consensual languages We have seen that consensual languages on a regular base may or may not be regular. In this section we provide conditions ensuring regularity and we investigate the descriptive complexity of consensual specifications of regular languages. 4.1. Conditions for regularity Bounded degree. The first result is obvious: when there is a bound i such that Li@ = L(i+1)@ , i.e., L@ has bounded degree, then L@ is regular. However, in general the converse is not true: Li@ may be not equal to L(i+1)@ even when L@ is regular: the degree of a regular consensual language can be unbounded. For instance, consider the regular expression R = a∗ aa∗ ; then C(R) = a+ is regular and R@ is (a∗ aa∗ )+ while Ri@ is (a∗ aa∗ )i . Hence Ri@ ⊆ R(i+1)@ . 90 S. CRESPI REGHIZZI AND P. SAN PIETRO Maximal degree. We say that a base language R has maximal degree if for every word w ∈ C(R) there exist |w| distinct strings w1 , . . . , w|w| ∈ R − Σ∗ such that w = w1 @ . . . @w|w| . We are going to prove that a consensual language on a regular base, having maximal degree, is regular. The result is non-obvious because in this case the multisets describing the simultaneous matching computations have unbounded cardinality. To simplify the proof, first notice that if R has maximal degree, then we have C(R) = C(R ∩ Σ∗ ΣΣ∗ ), since, if w = w1 @ . . . @w|w| , then each wi is in Σ∗ ΣΣ∗ . Hence, it is enough to show that consensual languages with R ⊆ Σ∗ ΣΣ∗ are regular. Theorem 4.1. Let L = C(R), with R regular and of maximal degree. Then L is regular. The proof of Theorem 4.1 requires a few additional considerations and definitions. In the following, let A = (Σ̃, Q, δ, q0 , F ) be a DFA, with δ being total, and such that L(A) = R ∪ Σ+ . Since the words in Σ+ do not give any contribution to the strong match, we have C (L(A)) = C(R). Therefore, set Q of states can be partitioned into two disjoint sets S, T , such that Q = S ∪ T , q0 ∈ S, and, for every x, y ∈ Σ∗ and a ∈ Σ, we have δ(q0 , x) ∈ S and δ(q0 , xay) ∈ T . The following lemma is immediate. Lemma 4.2. For every w ∈ Σ∗ , k > |w|, and for every multiset Z over Q, if w {(q0 )k } A Z then: (1) |Z ∩ S| = 1 and (2) Σq∈S Z(q) = k − |w|; that is, exactly one of the states in S occurs in Z and its multiplicity is k − |w|. We construct from A a nondeterministic finite automaton (NFA) that recognizes C(R). Definition 4.3. For a DFA A as above, let N be the NFA (Σ, 2Q , ⇒N , {q0 }, 2F ) where the states are all subsets of Q, {q0 } is the initial state, the final states are all subsets of F , and ⇒N ⊆ 2Q × Σ × 2Q is the transition relation, defined as follows, for every a ∈ Σ, and for all sets Q′ , Q′′ ⊆ Q: a Q′ ⇒N Q′′ if, and only if ∃q ∈ Q′ : Q′′ = {δ(q, a)} ∪ {δ(p, a)}. p∈Q′ Relation ⇒N can be extended as usual to 2Q × Σ∗ × 2Q . Proof of Theorem 4.1. The proof shows that when R has maximal degree, then C(R) = L(N ). First, we show that C(R) ⊆ L(N ). Assume that for all w ∈ Σ∗ , for all k > |w|, there exists Z : Q → N such that; w w {(q0 )k } A Z. We show by induction on |w| that in this case {q0 } ⇒N Z. CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 91 The thesis then follows considering the case when Z ⊆ F , also noticing that the assumption k > |w| does not change the language (since Σ∗ ⊆ R). The base step w = ǫ is immediate. For the inductive step, let w = za, for z ∈ w z Σ∗ , a ∈ Σ. Since {(q0 )k } A Z ′ , there exists Z ′′ : Q → N such that {(q0 )k } A z a a Z ′′ A Z ′ . By induction hypothesis, {q0 } ⇒N Z ′′ . Since Z ′′ A Z ′ , there exists q ∈ Z ′′ such that Z ′ = {δ(q, a)} ⊎ δ(Z ′′ − {q}, a). Since q ∈ Z ′′ , then a define Q′ = {δ(q, a)} ∪ δ(Z ′′ , a). Clearly, Z ′′ ⇒N Q′ . However, in general it might be that Z Q′ , since Q′ also includes δ(q, a). However, q ∈ S, hence Z ′′ (q) = k − |z| = k − |w| + 1, which is greater than 1 since k > |w| by hypothesis. Hence, Z ′′ − {q} has still at least one occurrence of q. Therefore, δ(q, a} is in Z ′ . Hence, Z ′ = Q′ . Conversely, we show that L(N ) ⊆ C(R). w Assume {q0 } ⇒N Q′ . We prove by induction on |w|, that for all k > |w| there w exists a multiset Z ′ over Q such that Z ′ = Q′ and {(q0 )k } A Z ′ . The thesis ′ follows immediately by considering the case Q ⊆ F . The base step is trivial. For the inductive step, let w = za, for z ∈ Σ∗ , a ∈ Σ. Then ∃Q′′ ⊆ Q such z a that {q0 } ⇒N Q′′ ⇒N Q′ . By induction hypothesis, ∃Z ′′ : Q → N such that z a Z ′′ = Q′′ and {(q0 )k } A Z ′′ . Since Q′′ ⇒N Q′ , there exists q ∈ Q′′ such that Q′ = {δ(q, a)} ∪ δ(Q′′ , a). Let Z ′ = {δ(q, a)} ⊎ δ(Z ′′ − {q}, a), which by definition w means that Z ′′ A Z ′ . We are left to prove that Z ′ = Q′ . The proof is now identical to the converse case above. Q′ also includes δ(q, a), while Z ′ might not include δ(q, a). However, q ∈ Q0 , hence Z ′′ (q) = k − |z| = k − |w| + 1, which is greater than 1 since k > |w| by hypothesis. Hence, Z ′′ − {q} has still at least one occurrence of q. Therefore, δ(q, a} is in Z ′ . Hence, Z ′ = Q′ . We observe that the above result can be generalized to the case of R being included in Σ∗ Σh Σ∗ , for some finite h ≥ 1; in other words, each matching word may place up to h letters within a window of width h. 4.2. Descriptive complexity of regular consensual languages We show that some regular languages can be described much more succinctly by using consensual languages. In what follows, the size R of a regular language R is defined to be the number of states of the smallest nondeterministic automata accepting R. Theorem 4.4. There exists a family of regular languages {Lm | m ≥ 2} over Σ such that Lm = and another family of regular languages {Rm | m ≥ 2} over Σ C(Rm ) and the size of each Rm is asymptotically at least exponentially smaller than the size of Lm . Proof. Let Σ = {a} and denote with pi , i ≥ 1, the i-th prime number (e.g., p1 = 2, . . . , p5 = 11, . . . ). For every m > 0, the product of the first m primes, 92 S. CRESPI REGHIZZI AND P. SAN PIETRO sometimes called the primorial of pm , is pm # = pi . 1≤i≤m For every m ≥ 1, let Lm = {am+h·pm # | h ≥ 1}). and let Rm = am a+ ∪ ai−1 aam−i (api )+ . 1≤i≤m We claim that C(Rm ) = Lm . If u ∈ C(Rm ) then u is the match of m + 1 words + u0 @u1 @ . . . @um , with u0 ∈ am a+ and each ui in ai−1 aam−i (api ) , since u has a m prefix a to be strongly matched, and each ui places one a as the i-th character. Since the suﬃx of ui after am must be of a length multiple of each pi , then there exist k1 , . . . , km ≥ 1 such that |ui | = m + ki pi . Hence, k1 p1 = k2 p2 = · · · = km pm . Since the pi ’s are distinct primes, there exists h ≥ 1 such that k1 p1 = hp1 p2 . . . pm . Therefore, |u| = m + hp1 p2 . . . pm and u ∈ Lm . Conversely, if u = am+h pm # ∈ Lm then u is the match of am ahpm # with m words u1 , . . . , um ∈ Rm , with each ui = ai−1 aam−i ahp1 ...pm , because obviously + ahp1 ...pm is in (api ) . Hence, u ∈ C(Rm ). Since the length of the shortest word in Lm is m + pm #, every (deterministic or nondeterministic) finite automaton needs at least m + pm # states to accept LM : Lm ≥ m + pm #. Since pi > i, for every i ≥ 1, it is obvious that pi # > i!, i.e., primorials are greater than factorials. Hence, it follows that: Lm ≥ m + pm # ≥ pm # = pm pm−1 # ≥ pm (m − 1)! On the other hand, the number of states of a minimal DFA for Rm is (m + 1)(m + 2) +1+ pi . 2 1≤i≤m Hence, Rm ≤ 3m + 1 + pi ≤ 3m + 1 + mpm . 1≤i≤m Therefore, Lm Rm m is Ω( (m−1)!p ) which is Ω((m − 2)!), which is also Ω(2m ). mpm 5. Undecidability of emptiness In this section, the undecidability of emptiness checking is proved, by means of a reduction from the halting problem of a 2-counter Minsky machine [6]1. 1 The idea of the proof has been suggested by an anonymous referee. CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 93 ∗ . It is undecidable whether C(R) = ∅. Theorem 5.1. Let R ⊆ Σ Proof. Define a Minsky machine C = (S, M, i, f ), where S is a finite set of states, i = f ∈ S are the initial and final states, respectively, and M ⊆ (S − {f }) × S × {true, test}2 × {IN C, DEC, ST AY }2 is a finite set of moves. Without loss of generality, assume that for every p ∈ S − {f }, q ∈ S, there exists at most one tuple (t1 , t2 , i1 , i2 ), with t1 , t2 ∈ {true, test} and i1 , i2 ∈ {IN C, DEC, ST AY } such that (p, q, t1 , t2 , i1 , i2 ) ∈ M . A configuration of a counter machine is an element of S × N × N. The initial configuration is (i, 0, 0), and a final configuration is (f, i, j), for every i, j ∈ N. A transition relation −→C ⊆ S × N × N × S × N × N can easily be defined as usual. For instance, if (p, q, true, test, IN C, ST AY ) ∈ M then e.g., (p, 3, 0) −→C (q, 4, 0): the first counter is not tested (“true”), the second counter is tested for 0, and then the first counter is incremented and the second counter stays. Let Σ = S ∪ {X, Y, 1, 2}. For every p ∈ S − {f }, q ∈ S, define the following regular languages, on the alphabet Σ ∪ Σ. IN IT N EXT (p, q) COP Y1 (p, q) COP Y2 (p, q) test1 (p, q) true1 (p, q) test2 (p, q) true2 (p, q) IN C1 (p, q) DEC1 (p, q) ST AY1 (p, q) IN C2 (p, q) DEC2 (p, q) ST AY2 (p, q) HALT = = = = = = = = = = = = = = = 1212iΣ∗ Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗ Σ∗ XY (XY )∗ p(XY )∗ 1212(XY )∗ XY (XY )∗ qΣ∗ Σ∗ p(XY )∗ XY (XY )∗ 1212(XY )∗ q(XY )∗ XY Σ∗ Σ∗ 1212p(XY )∗ 1212(XY )∗ qΣ∗ Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗ Σ∗ 1212(XY )∗ p1212(XY )∗ qΣ∗ Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗ Σ∗ 1212(XY )∗ p(XY )∗ 1212XY (XY )∗ qΣ∗ Σ∗ 1212XY (XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗ Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗ Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qXY Σ∗ Σ∗ 1212(XY )∗ pXY (XY )∗ 1212(XY )∗ qΣ∗ Σ∗ 1212(XY )∗ p(XY )∗ 1212(XY )∗ qΣ∗ Σ∗ 1212(XY )∗ f (XY )∗ . Then, associate a regular language [m] with every move m of a 2-counter machine C = (S, M, i, f ). For every m = (p, q, t1 , t2 , i1 , i2 ), with t1 , t2 ∈ {test, true}, i1 , i2 ∈ {IN C, DEC, ST AY } let: ⎛ ⎜∪ ⎜ ⎜ ⎜ [m] = N EXT (p, q)∪ ⎜ j=0,1⎜ ⎝ (if tj is test then testj (p, q) else truej (p, q)fi) (if ij is IN C elsif ij is DEC else ST AYj (p, q) fi) then IN Cj (p, q) then DECj (p, q) For instance, if m is (p, q, test, test, DEC, IN C) then [m] is: N EXT (p, q) ∪ COP Y1 (p, q) ∪ COP Y2 (p, q) ∪ ∪ test1 (p, q) ∪ test2 (p, q) ∪ DEC1 (p, q) ∪ IN C2 (p, q). ⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎠ 94 S. CRESPI REGHIZZI AND P. SAN PIETRO Define R as the union of HALT ∪ IN IT ∪ m∈M [m]. Then C(R) is non-empty if, and only if, C has a halting computation. The main idea is that C(R) represents the set of all halting runs of C. Every word of C(R) has the form 1212i(1212(XY )∗ S(XY )∗ )∗ 1212(XY )∗ f (XY )∗ where the subwords 1212 separate the representation of two configurations of C. For instance, the prefix 1212i1212 represents the initial configuration (i, 0, 0) of C, while a subword 1212(XY )n p(XY )m 1212 represents the configuration (p, n, m), and a suﬃx 1212(XY )n f (XY )m represents a final configuration (f, n, m). Also, the definition of R is such that if there is a subword of the form ′ ′ 1212(XY )n p(XY )m 1212(XY )n q(XY )m 1212 then (p, n, m) −→C (q, n′ , m′ ). Let F = 1212(XY )∗ S(XY )∗ . We show that if (i, 0, 0) −→+ C (f, n, m) then there is a word w ∈ 1212iF ∗1212(XY )n f (XY )m such that w ∈ C(R), i.e., every halting run of C corresponds to a word of C(R). By induction on h ≥ 1, we first prove the following claim (1): (p, n, m) −→C (q, n′ , m′ ) then there is a word in R@ of (1) if (i, 0, 0) −→h−1 C the form: ′ ′ 1212iF h−11212(XY )n p(XY )m 1212(XY )n q(XY )m F ∗ . The base case is: if (i, 0, 0) −→C (q, n′ , m′ ) with 0 ≤ n′ , m′ ≤ 1, then ′ ′ 1212i1212(XY )n q(XY )m F ∗ ∈ R@ . By the set IN IT , ′ ′ W0 = 1212i1212(XY )n q(XY )m F ∗ ⊆ R@ . The move (i, q, test, test, i1, i2 ) is in C, with i1 , i2 ∈ {IN C, ST AY }(and e.g., i1 = IN C if n′ = 1, i1 = ST AY if n′ = 0, etc.). Hence, N EXT (i, q) is in ′ ′ (i, q, test, test, i1 , i2 ), and W1 = 1212i1212(XY )n q(XY )m F ∗ ⊆ R@ . Notice that COP Y1 (i, q), COP Y2 (i, q) cannot match with W1 . Both test1 (i, q) and test2 (i, q) ′ ′ may match with W1 and hence W2 = 1212i1212(XY )n q(XY )m F ∗ ⊆ R@ . Assume that n′ = 1, m′ = 0, hence, i1 = IN C, i2 = ST AY . The other cases may be dealt with analogously. Hence, both IN C1 (i, q) and ST AY2 (i, q) may match with W2 : W3 = 1212i1212XY qF ∗ ⊆ R@ . The inductive case is dealt with analogously to the base case: by induction hypothesis, there exists a word w0 in 1212iF h−11212(XY )n p(XY )m F ∗ ∩R@ . Hence, there is w1 ∈ R@ of the form ′ ′ 1212iF h−11212(XY )n p(XY )m 1212(XY )n q(XY )m F ∗ . CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 95 Assume that x = (p, q, true, true, DEC, IN C) ∈ M (by the unicity assumption on M , there is no other move from p to q). Hence, n′ = n − 1, m′ = m + 1. Then, COP Y1 (p, q), COP Y2 (p, q) ∈ [x]: there is w2 ∈ R@ of the form 1212iF h−11212(XY )n−1 XY p(XY )m 1212(XY )n−1 q(XY )m XY F ∗ . Since also true1 (p, q), true2 (p, q) ∈ [x], there is w3 ∈ R@ of the form 1212iF h−11212(XY )n−1 XY p(XY )m 1212(XY )n−1 q(XY )m XY F ∗ . Since also DEC1 (p, q), IN C1 (p, q) ∈ [x], there is w4 ∈ R@ of the form 1212iF h−11212(XY )n−1 XY p(XY )m 1212(XY )n−1 q(XY )m XY F ∗ . thus ending the induction for the case of a move x as above. The other cases of moves are analogous and are omitted for brevity. Finally, by the claim (1), when q = f if (i, 0, 0) −→ (f, n′ , m′ ) then there exists ′ ′ w ∈ R@ of the form 1212iF ∗1212(XY )n q(XY )m . By a match with HALT , the ′ ′ result is a word w′ ∈ R@ of the form 1212iF ∗1212(XY )n q(XY )m , which is in C(R) since it has no marking. The proof of the converse claim that every word of C(R) is a halting run of C is similar and can be safely omitted. 6. Comparisons with other families In order to compare consensual languages with some classical language families, we show that certain languages exceed the capacity of consensual languages based on regular sets. Proposition 6.1. The languages ucuR | u ∈ {a, b}∗ and {ucu | u ∈ {a, b}∗ } are not in the family CREG . Proof. Let L = {ucu | u ∈ {a, b}∗}. The proof for ucuR | u ∈ {a, b}∗ is completely analogous. Assume by contradiction there is a DFA A = ({a, b, a, b}, Q, δ, q0 , F ) such that C(L(A)) = L. For a given input word of length n > 0, the nondeterministic Turing machine simulating the consensual transition relation ∗ only stores a multiset over Q, of cardinality m ≤ n, hence, there are at most (n + 1)|Q| diﬀerent configurations. On the other hand, there are 2n diﬀerent strings in {a, b}n, and for n large enough, the number of possible strings is much larger than the number (n + 1)|Q| of diﬀerent multisets: there exist u, w ∈ {a, b}n , u = w, such that there exist m ≤ n, a multiset Z over Q and two multisets Z ′ , Z ′′ u cu w cw over F , such that {(q0 )m } A Z A Z ′ , {(q0 )m } A Z A Z ′′ . But then also u cw {(q0 )m } A Z A Z ′′ , a contradiction since ucw ∈ L. 96 S. CRESPI REGHIZZI AND P. SAN PIETRO From Proposition 6.1, and from Example 2.4 it follows: Corollary 6.2. The family CREG is not comparable with the families of contextfree languages and of tree adjoining languages [5]. Among the typical context-free languages, the Dyck sets with two or more pairs of parentheses trespass the family CREG . To see it it suﬃces to observe that the proof of Proposition 6.1 also applies to the case of language L = u h(uR ) | u ∈ {a, b}∗ where h is the morphism h(a) = a′ , h(b) = b′ . Let D2 be the Dyck language with opening parentheses a, b and closing parentheses a′ , b′ respectively. Let R be the regular language composed of all strings on {a, b, a′ , b′ } where there is no occurrence of the factors a′ a, b′ b, a′ b, b′ a. Hence, D2 ∩ R = L, and, if D2 were in CREG , then by closure of CREG under intersection with regular languages, also L would be in CREG . 7. Conclusion The simple notion of consensus between simultaneous computations, formulated by means of strong matching, though surely not the only one possible and sensible, yet permits rather remarkable selectivity in language definition. As we see it, the interest of the family of consensual languages, based on matching finite-state computations, comes from a combination of properties; it includes the regular family, and actually oﬀers a very concise representation of some regular sets; it includes non-semilinear languages; and it has a time-polynomial word problem. Altogether this research proposes a new way of looking at finite-state devices as language recognizers. Since the model is new and research is in its early stages, many questions are open for investigation, for instance concerning minimality, decidability of equivalence, determinism of the counter (or multi-set) machine, as well as the study of closure properties beyond the basic ones considered. Concerning variations on the theme of consensual computations, we mention some possibilities. Two variations would be to allow (or to oblige) a finite number k > 1 of component words to place each letter in each position of the match. Such devices would then model systems where stronger consensus between independent computations is possible (or is requested), in order for a word to be accepted. We believe our definitions, though possibly the simplest, already capture a rather rich range of language paradigms. But of course actual experimentation would be needed. Moreover, we hope that the consensual approach could be fruitfully investigated for other families of base languages, both weaker and stronger than the regular ones. CONSENSUAL LANGUAGES AND FINITE-STATE COMPUTATIONS 97 At last, we think the idea of consensual computation could provide some formal support to the linguistic requirement of assigning diﬀerent degrees of grammaticality to sentences that satisfy some, but not all, semantic constraints. Acknowledgements. The first author is indebted to Valentino Braitenberg for inspiring discussions on brain theory and formal language models. References [1] A.K. Chandra, D. Kozen and L.J. Stockmeyer, Alternation. J. ACM 28 (1981) 114–133. [2] S. Crespi Reghizzi and P. San Pietro, Consensual definition of languages by regular sets, in LATA. Lecture Notes in Computer Science 5196 (2008) 196–208. [3] S. Crespi Reghizzi and P. San Pietro, Languages defined by consensual computations. in ICTCS09 (2009). [4] M. Jantzen, On the hierarchy of Petri net languages. ITA 13 (1979). [5] A. Joshi and Y. Schabes, Tree-adjoining grammars, in Handbook of Formal Languages, Vol. 3, G. Rozenberg and A. Salomaa, Eds. Springer, Berlin, New York (1997), 69–124. [6] M. Minsky, Computation: Finite and Infinite Machines. Prentice-Hall, Englewood Cliﬀs (1976). [7] A. Salomaa, Theory of Automata. Pergamon Press, Oxford (1969). [8] K. Vijay-Shanker and D.J. Weir, The equivalence of four extensions of context-free grammars. Math. Syst. Theor. 27 (1994) 511–546. Communicated by A. Cherubini. Received December 24, 2009. Accepted November 18, 2010.

RELATED PAPERS

RELATED TOPICS

Log In

Consensual languages and matching finite-state computations

Consensual languages and matching finite-state computations

Related Papers

RELATED PAPERS

RELATED TOPICS