0% found this document useful (0 votes)

160 views

Operating System

This document provides notes for an introduction to computation theory course. It begins with an overview of the course topics, which are presented in reverse order from the standard sequence - starting with Turing machines and computability before moving to context-free languages and regular languages. The notes are intended to formally define concepts and provide examples to clarify them. Some novel aspects include starting with more intellectually stimulating material and finishing on a more practical note with finite state machines. The document contains detailed sections on models of computation like Turing machines and register machines, as well as formal languages including context-free grammars and automata.

Uploaded by

Raj Kumar Singh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

160 views

Operating System

Uploaded by

Raj Kumar Singh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 231

Notes on Computation Theory

Konrad Slind
slind@cs.utah.edu

September 21, 2010

Foreword
These notes are intended to support cs3100, an introduction to the theory
of computation given at the University of Utah. Very little of these notes
are original with me. Most of the examples and definitions may be found
elsewhere, in the many good books on this topic. These notes can be taken
to be an informal commentary and supplement to those texts. In fact, these
notes are basically a direct transcription of my lectures. Any lacunae and
mistakes are mine, and I would be glad to be informed of any that you
might find.
The course is taught as a mathematics course aimed at computer sci-
ence students. We assume that the students already have had a discrete
mathematics course. However, that material is re-covered briefly at the
start. We strive for a high degree of precision in definitions. The reason
for this is that much of the difficulty students have in theory courses is in
not ‘getting’ concepts. However, in our view, this should never happen
in a theory course: the intent of providing mathematically precise defini-
tions is to banish confusion. To that end, we provide formal definitions,
and use subsequent examples to sort out pathological cases and provide
motivation.
A minor novelty, not original with us, is that we proceed in the reverse
of the standard sequence of topics. Thus we start with Turing machines
and computability before going on to context-free languages and finally
regular languages. The motivation for this is that not many topics are
harmed in this approach (the pumping lemmas and non-determinism do
become somewhat awkward) while the benefit is twofold: (1) the intel-
lectually stimulating material on computability and undecidability can be
treated early in the course, while (2) the immensely practical material deal-
ing with finite state machines can be used to finish the course. So rather
than getting more abstract, as is usual, the course actually gets more con-
crete and practical, which is often to the liking of the students.
The transition diagrams have been drawn with the wonderful Vaucan-
son LATEX package developed by Sylvain Lombardy and Jacques Sakarovitch.

1
Contents

1 Introduction 5
1.1 Why Study Theory? . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Computability . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Context-Free Grammars . . . . . . . . . . . . . . . . . 8
1.2.3 Automata . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background Mathematics 10
2.1 Some Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Some Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Alphabets and Strings . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.1 Review of proof terminology . . . . . . . . . . . . . . 31
2.5.2 Review of methods of proof . . . . . . . . . . . . . . . 32
2.5.3 Some simple proofs . . . . . . . . . . . . . . . . . . . . 34
2.5.4 Induction . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Models of Computation 45
3.1 Turing Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.1 Example Turing Machines . . . . . . . . . . . . . . . . 52
3.1.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.3 Coding and Decoding . . . . . . . . . . . . . . . . . . 64
3.1.4 Universal Turing machines . . . . . . . . . . . . . . . 67
3.2 Register Machines . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 The Church-Turing Thesis . . . . . . . . . . . . . . . . . . . . 72

2
3.3.1 Equivalence of Turing and Register machines . . . . . 74
3.4 Recognizabilty and Decidability . . . . . . . . . . . . . . . . . 78
3.4.1 Decidable problems about Turing machines . . . . . . 80
3.4.2 Recognizable problems about Turing Machines . . . 81
3.4.3 Closure Properties . . . . . . . . . . . . . . . . . . . . 83
3.5 Undecidability . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5.1 Diagonalization . . . . . . . . . . . . . . . . . . . . . . 85
3.5.2 Existence of Undecidable Problems . . . . . . . . . . 87
3.5.3 Other undecidable problems . . . . . . . . . . . . . . 89
3.5.4 Unrecognizable languages . . . . . . . . . . . . . . . . 95

4 Context-Free Grammars 97
4.1 Aspects of grammar design . . . . . . . . . . . . . . . . . . . 105
4.1.1 Proving properties of grammars . . . . . . . . . . . . 112
4.2 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3 Algorithms on CFGs . . . . . . . . . . . . . . . . . . . . . . . 117
4.3.1 Chomsky Normal Form . . . . . . . . . . . . . . . . . 118
4.4 Context-Free Parsing . . . . . . . . . . . . . . . . . . . . . . . 123
4.5 Grammar Decision Problems . . . . . . . . . . . . . . . . . . 129
4.6 Push Down Automata . . . . . . . . . . . . . . . . . . . . . . 130
4.7 Equivalence of PDAs and CFGs . . . . . . . . . . . . . . . . . 139
4.7.1 Converting a CFG to a PDA . . . . . . . . . . . . . . . 139
4.7.2 Converting a PDA to a CFG . . . . . . . . . . . . . . . 142
4.8 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5 Automata 145
5.1 Deterministic Finite State Automata . . . . . . . . . . . . . . 146
5.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.1.2 The regular languages . . . . . . . . . . . . . . . . . . 152
5.1.3 More examples . . . . . . . . . . . . . . . . . . . . . . 153
5.2 Nondeterministic finite-state automata . . . . . . . . . . . . . 156
5.3 Constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3.1 The product construction . . . . . . . . . . . . . . . . 161
5.3.2 Closure under union . . . . . . . . . . . . . . . . . . . 164
5.3.3 Closure under intersection . . . . . . . . . . . . . . . 164
5.3.4 Closure under complement . . . . . . . . . . . . . . . 164
5.3.5 Closure under concatenation . . . . . . . . . . . . . . 165
5.3.6 Closure under Kleene star . . . . . . . . . . . . . . . . 166

3
5.3.7 The subset construction . . . . . . . . . . . . . . . . . 167
5.4 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . 172
5.4.1 Equalities for regular expressions . . . . . . . . . . . . 175
5.4.2 From regular expressions to NFAs . . . . . . . . . . . 177
5.4.3 From DFA to regular expression . . . . . . . . . . . . 180
5.5 Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.6 Decision Problems for Regular Languages . . . . . . . . . . . 194
5.6.1 Is a string accepted/generated? . . . . . . . . . . . . . 195
5.6.2 L(M) = ∅? . . . . . . . . . . . . . . . . . . . . . . . . . 196
5.6.3 L(M) = Σ∗ ? . . . . . . . . . . . . . . . . . . . . . . . . 198
5.6.4 L(M1 ) ∩ L(M2 ) = ∅? . . . . . . . . . . . . . . . . . . . 198
5.6.5 L(M1 ) ⊆ L(M2 )? . . . . . . . . . . . . . . . . . . . . . 198
5.6.6 L(M1 ) = L(M2 )? . . . . . . . . . . . . . . . . . . . . . 199
5.6.7 Is L(M) finite? . . . . . . . . . . . . . . . . . . . . . . . 199
5.6.8 Does M have as few states as possible? . . . . . . . . 199

6 The Chomsky Hierarchy 200

6.1 The Pumping Lemma for Regular Languages . . . . . . . . . 201
6.1.1 Applying the pumping lemma . . . . . . . . . . . . . 202
6.1.2 Is L(M) finite? . . . . . . . . . . . . . . . . . . . . . . . 206
6.2 The Pumping Lemma for Context-Free Languages . . . . . . 207

7 Further Topics 215

7.1 Regular Languages . . . . . . . . . . . . . . . . . . . . . . . . 215
7.1.1 Extended Regular Expressions . . . . . . . . . . . . . 215
7.1.2 How to Learn a DFA . . . . . . . . . . . . . . . . . . . 222
7.1.3 From DFAs to regular expressions (Again) . . . . . . 222
7.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 230

4
Chapter 1

Introduction

This course is an introduction to the Theory of Computation. Computation

is, of course, a vast subject and we will need to take a gradual approach
to it in order to avoid being overwhelmed. First, we have to understand
what we mean by the title of the course.
The word Theory implies that we study abstractions of computing sys-
tems. In an abstraction, irrelevant complications are dropped, in order
to isolate the important concepts. Thus, studying the theory of subject x
means that simplified versions of x are analyzed from various perspec-
tives.
This brings us to Computation. The general approach of the course, as
we will see, is to deal with very simple, deliberately restricted models of
computers. We will study Turing machines and register machines, gram-
mars, and automata. We devote some time to working with each model, in
order to see what can be done with it. Also, we will prove more general
results, which relate different models.

1.1 Why Study Theory?

A question commonly posed by practically minded students is

Why study theory?

Here are some answers. Some I like better than others!

1. It’s a required course.

5
2. Theory gives exposure to ideas that permeate Computer Science:
logic, sets, automata, grammars, recursion. Familiarity with these
concepts will make you a better computer scientist.

3. Theory gives us mathematical (hence precise) descriptions of com-

putational phenomena. This allows us to use mathematics, an intel-
lectual inheritance with tools and techniques thousands of years old,
to solve problems arising from computers.

4. It gives training in argumentation, which is a generally useful thing.

As Lewis Carroll, author of Alice in Wonderland, dirty old man, and
logician wrote:

Once master the machinery of Symbolic Logic, and you have

a mental occupation always at hand, of absorbing interest, and
one that will be of real use to you in any subject you may take
up. It will give you clearness of thought - the ability to see your
way through a puzzle - the habit of arranging your ideas in an
orderly and get-at-able form - and, more valuable than all, the
power to detect fallacies, and to tear to pieces the flimsy illogical
arguments, which you will so continually encounter in books,
in newspapers, in speeches, and even in sermons, and which so
easily delude those who have never taken the trouble to master
this fascinating Art.

5. It is required if you are interested in a research career in Computer

Science.

6. A theory course distinguishes you from someone who has picked

up programming at a ‘job factory’ technical school. (This is the snob
argument, one which I don’t personally believe in.)

7. Theory gives exposure to some of the absolute highpoints of human

thought. For example, we will study the proof of the undecidability
of the halting problem, a result due to Alan Turing in the 1930s. This
theorem is one of the most profound intellectual developments—in
any field—of the 20th Century. Pioneers like Turing have blazed
trails deep into terra incognita and courses like cs3100 allow us mere
mortals to follow their footsteps.

6
8. Theory gives a nice setting for honing your problem solving skills.
You probably haven’t gotten smarter since you entered university
but you have learned many subjects and—more importantly—you
have been trained to solve problems. The belief is that improving
your problem-solving ability through practice will help you in your
career. Theory courses in general, and this one in particular, provide
good exposure to a wide variety of problems, and the techniques you
learn are widely applicable.

1.2 Overview
Although the subject matter of this course is models of computation, we need
a framework—some support infrastructure—in which to work

FRAMEWORK is basic discrete mathematics, i.e., some set theory, some

logic, and some proof techniques.

SUBJECT MATTER is Automata, Grammars, and Computability

We will spend a little time recapitulating the framework, which you

should have mastered in cs2100. You will not be tested on framework
material, but you will get exactly nowhere if you don’t know it.
Once the framework has been recalled, we will discuss the following
subjects: Turing machines and computability; grammars and context-free
languages; and finally, finite state automata and regular languages. The
progression of topics will move from the abstract (computability) to the
concrete (constructions on automata), and from fully expressive models of
computation to more constrained models.

1.2.1 Computability
In this section, we will start by considering a classic model of computation—
that of Turing machines (TMs). Unlike the other models we will study, a TM
can do everything a modern computer can do (and more). The study of
’fully fledged’, or unrestricted, models of computation, such as TMs, is
known as computability.
We will see how to program TMs and, through experience, convince
ourselves of their power, i.e., that every algorithm can be programmed on

7
a TM. We will also have a quick look at Register Machines, which are quite
different from TMs, but of equivalent power. This leads to a discussion of
‘what is an algorithm’ and the Church-Turing thesis. Then we will see a
limitative result: the undecidability of the halting problem. This states that
it is not possible to mechanically determine whether or not an arbitrary
program will halt on all inputs. At the time, this was a very surprising
result. It has a profound influence on Computer Science since it can be
leveraged to show that all manner of useful functionality that one might
wish to have computers provide is, in fact, theoretically impossible.

1.2.2 Context-Free Grammars

Context-Free Grammars (CFGs) are a much more limited model of compu-
tation than Turing machines. Their prime application is that much, if not
all, parsing of normal programming languages can be accomplished effi-
ciently by parsers automatically generated from a CFG. Parsing is a stage
in program compilation that maps from the linear strings of the program
text into tree structures more easily dealt with by the later stages of com-
pilation. This is the first—and probably most successful—application of
generating programs from high-level specifications. CFGs are also useful
in parsing human languages, although that is a far harder task to perform
automatically. We will get a lot of experience with writing and analyzing
grammars. The languages generated by context-free grammars are known
as the context-free languages, and there is a class of machines used to pro-
cess strings specified by CFGs, known as push-down automata (PDAs). We
will also get some experience with constructing and analyzing PDAs.
Pedantic Note.The word is grammar, not grammer.
Non-Pedantic Note. The word language used here is special terminology
and has little to do with the standard usage. Languages are set-theoretic
entities and admit operations like union (∪), intersection (∩), concatena-
tion, and replication (Kleene’s ‘star’). An important theme in the course
is showing how simple operations on machines are reflected in these set-
theoretic operations on languages.

8
1.2.3 Automata
Automata (singular: automaton) are a simple but very important class of
computing devices. They are heavily used in compilers, text editors, VLSI
circuits, Artificial Intelligence, databases, and embedded systems.
We will introduce and give a precise definition of finite state automata
(FSAs) before investigating their extension to non-deterministic FSAs (NFAs).
It turns out that FSAs are equivalent to NFAs, and we will prove this. We
will discuss the languages recognized by FSAs, the so-called regular lan-
guages.
Automata are used to recognize, or accept, strings in a language. An
alternative viewpoint is that of regular expressions, which generate strings.
Regular expressions are equivalent to FSAs, and we will prove this.
Finally, we will prove the pumping lemma for regular languages. This,
along with the undecidability of the halting problem, is another of what
might be called negative, or limitative theorems, which show that there
are some aspects of computation that are not captured by the model be-
ing considered. In other words, they show that the model is too weak to
capture important notions.
Historical Remark. The history of the development of models of com-
putation is a little bit odd, because the most powerful models were in-
vestigated first. The work of Turing (Turing machines), Church (lambda
calculus), Post (Production Systems), and Goedel (recursive functions) on
computability happened largely in the 1930’s. These mathematicians were
trying to nail down the notion of algorithm, and came up with quite differ-
ent explanations. They were all right! Or at least that is the claim of the
Church-Turing Thesis, an important philosophical statement, which we will
discuss.
In the 1940’s restricted notions of computability were studied, in or-
der to give mathematical models of biological behaviour, such as the fir-
ing of neurons. These led to the development of automata theory. In the
1950’s, formal grammars and the notion of context-free grammars (and
much more) were invented by Noam Chomsky in his study of natural lan-
guage.

9
Chapter 2

Background Mathematics

This should be review from cs2100, but we may be rusty after the summer
layoff. We need some basic amounts of logic, set theory, and proof, as well
as a smattering of other material.

2.1 Some Logic

Here is the syntax of the formulas of predicate logic. In this course we
mainly use logical formulas to precisely express theorems.

A∧B conjunction
A∨B disjunction
A⇒B (material )implication
A iff B equivalence
¬A negation
∀x. A universal quantification
∃x. A existential quantification
After syntax we have semantics. The meaning of a formula is expressed
in terms of truth.

• A ∧ B is true iff A is true and B is true.

• A ∨ B is true iff A is true or B is true (or both are true).

• A ⇒ B is true iff it is not the case that A is true and B is false.

10
• A iff B is true iff A and B have the same truth value.
• ¬A is true iff A is false
• ∀x. A is true iff A is true for all possible values of x.
• ∃x. A is true iff A is true for at least one value of x.
Note the recursion: the truth value of a formula depends on the truth
values of its sub-formulas. This prevents the above definition from being
circular. Also, note that the apparent circularity in defining iff by using ‘iff’
is only apparent—it would be avoided in a completely formal definition.
Remark. The definition of implication can be a little confusing. Implication
is not ‘if-then-else’. Instead, you should think of A ⇒ B as meaning ‘if A
is true, then B must also be true. If A is false, then it doesn’t matter what
B is; the value of A ⇒ B is true’.
Thus a statement such as 0 < x ⇒ x2 ≥ 1 is true no matter what the
value of x is taken to be (supposing x is an integer). This works well with
universal quantification, allowing the statement ∀x. 0 < x ⇒ x2 ≥ 1 to be
true. However, the price is that some plausibly false statements turn out
to be true; for example: 0 < 0 ⇒ 1 < 0. Basically, in an absurd setting,
everything is held to be true.
Example 1. Suppose we want to write a logical formula that captures the
following well-known saying:
You can fool all of the people some of the time, and you can fool
some of the people all of the time, but you can’t fool all of the people
all of the time.
We start by letting the atomic proposition F (x, t) mean ‘you can fool x at
time t’. Then the following formula
(∀x.∃t. F (x, t)) ∧
(∃x.∀t. F (x, t)) ∧
¬(∀x.∀t. F (x, t))
precisely captures the statement. Notice that the first line asserts that each
person could be fooled at a different time. If one wanted to express that
there is a specific time at which everyone gets fooled, it would be
∃t. ∀x. F (x, t) .

11
Example 2. What about
Everybody loves my baby, but my baby don’t love nobody but me.
Let the atomic proposition L(x, y) mean ‘x loves y’ and let b mean ‘my
baby’ and let me stand for me. Then the following formula
(∀x. L(x, b)) ∧ L(b, me) ∧ (∀x. L(b, x) ⇒ (x = me))
precisely captures the statement. It is interesting to pursue what this means,
since if everybody loves b, then b loves b. So I am my baby, which may be
troubling for some.
Example 3 (Lewis Carroll). From the following assertions
1. There are no pencils of mine in this box.
2. No sugar-plums of mine are cigars.
3. The whole of my property, that is not in the box, consists of cigars.
we can conclude that no pencils of mine are sugar-plums. Transcribed to
logic, the assertions are
∀x. inBox (x) ⇒ ¬Pencil (x)
∀x. sugarPlum(x) ∧ Mine(x) ⇒ ¬Cigar (x)
∀x. Mine(x) ∧ ¬inBox (x) ⇒ Cigar(x)
From (1) and (3) we can conclude All my pencils are cigars. Now we can use
this together with (2) to reach the conclusion
∀x. Pencil(x) ∧ Mine(x) ⇒ ¬sugarPlum(x).
These examples feature somewhat whimsical subject matter. In the
course we will be using symbolic logic when a high level of precision is
needed.

2.2 Some Sets

A set is an collection of entities, often written with the syntax {e1 , e2 , . . . , en }
when the set is finite. Making a set amounts to a decision to regard a col-
lection of possibly disparate things as a single object. Here are some well-
known mathematical sets:

12
• B = {true, false}. The booleans, also known as the bit values. In
situations where no confusion with numbers is possible, one could
have B = {0, 1}.

• N = {0, 1, 2, . . .}. The natural numbers.

• Z = {. . . , −2, −1, 0, 1, 2, . . .}. The integers.

• Q = the rational (fractional) numbers.

• R = the real numbers.

• C = the complex numbers.

Note. Z, Q, R, and C will not be much used in the course, although Q and
R will feature in one lecture.
Note. Some mathematicians think that N starts with 1. We will not adopt
that approach in this course!
There is a rich collection of operations on sets. Interestingly, all these
operations are ultimately built from membership.

Membership of an element in a set The notation a ∈ S means that a is a

member, or element, of S. Similarly, a ∈
/ S means that a is not an element
of S.

Equality of sets Equality of sets R and S is defined R = S iff (∀x. x ∈

R iff x ∈ S). Thus two sets are equal just when they have the same ele-
ments. Note that sets have no intrinsic order. Thus {1, 2} = {2, 1}. Also,
sets have no duplicates. Thus {1, 2, 1, 1} = {2, 1}.

Subset R is a subset of S if every element of R is in S, but S may have

extras. Formally, we write R ⊆ S iff (∀x. x ∈ R ⇒ x ∈ S). Having ⊆ avail-
able allows an (equivalent) reformulation of set equality: R = S iff R ⊆
S ∧ S ⊆ R.
A few more useful facts about ⊆:

• S ⊆ S, for every set S.

13
• P ⊆Q∧Q⊆R ⇒P ⊆R

There is also a useful notion of proper subset: R ⊂ S means that all ele-
ments of R are in S, but S has one or more extras. Formally, R ⊂ S iff R ⊆
S ∧ R 6= S.
It is a common error to confuse ∈ and ⊆. For example, x ∈ {x, y, z}, but
that doesn’t allow one to conclude x ⊆ {x, y, z}. However, it is true that
{x} ⊆ {x, y, z}

Union The union of R and S, R ∪ S, is the set of elements occurring in R

or S (or both). Formally, union is defined in terms of ∨: x ∈ R ∪ S iff (x ∈
R ∨ x ∈ S).

{1, 2} ∪ {4, 3, 2} = {1, 2, 3, 4}

Intersection The intersection of R and S, R ∩ S, is the set of elements

occurring in both R and S. Formally, intersection is defined in terms of ∧:
x ∈ R ∩ S iff (x ∈ R ∧ x ∈ S).

{1, 2} ∩ {4, 3, 2} = {2}

Singleton sets A set with one element is called a singleton. Note well that
a singleton set is not the same as its element: ∀x. x 6= {x}, even though
x ∈ {x}, for any x.

Set difference R − S is the set of elements that occur in R but not in S.

Thus, x ∈ R − S iff x ∈ R ∧ x ∈
/ S. Note that S may have elements not in
R. These are ignored. Thus

{1, 2, 3} − {2, 4} = {1, 3}.

Universe and complement Often we work in a setting where all sets are
subsets of some fixed set U (sometimes called the universe). In that case we
can write S to mean U − S. For example, if our universe is N, and Even is
the set of even numbers, then Even is the set of odd numbers.

14
Example 4. Let us take the Flintstone characters as our universe.

F = {Fred, Wilma, Pebbles, Dino}

R = {Barney, Betty, BamBam}
U = F ∪ R ∪ {Mr . Slate}

Then we know

∅= F ∩R
because the two families are disjoint. Also, we can see that

F − {Fred, Mr . Slate} = {Wilma, Pebbles, Dino}.

What about Fred ⊆ F ? It makes no sense because Fred is not a set. The
subset operation requires two sets. However, {Fred} ⊆ F is true; indeed
{Fred} ⊂ F .
We also know

{Mr . Slate, Fred} 6⊆ F

since Mr. Slate is not an element of F . Finally, we know that

F ∪ R = {Mr . Slate}

Remark. Set difference can be defined in terms of intersection and comple-

ment:
A−B =A∩B

Empty set The symbol ∅ stands for the empty set: the set with no ele-
ments. The notation {} may also be used. The empty set acts as an alge-
braic identity for several operations:

∅∪S = S
∅∩S = ∅
∅⊆S
∅−S =∅
S−∅=S
∅=U

15
Set comprehension This is also known as set builder notatation. The no-
tation is

{ | }.
| {z } | {z }
template condition

This denotes the set of all items matching the template, which also meet
the condition. This, combined with logic, gives a natural way to concisely
describe sets:

{x | x < 1} = {0}
{x | x > 1} = {2, 3, 4, 5, . . .}
{x | x ∈ R ∧ x ∈ S} = R∩S
{x | ∃y.x = 2y} = {0, 2, 4, 6, 8, . . .}
{x | x ∈ U ∧ x is male} = {Fred, Barney, BamBam, Mr . Slate}

The template can be a more complex expression, as we will see.

Indexed union and intersection It sometimes happens that one has a set
of sets
{ { . . . }, . . . , { . . . } }
| {z } | {z }
S1 Sn

and wants to ‘union (or intersect) them all together’, as in

S 1 ∪ . . . ∪ Sn
S 1 ∩ . . . ∩ Sn

These operations, known to some as bigunion and bigintersection, can be

formally defined in terms of index sets:
S
i∈I Si = {x | ∃i. i ∈ I ∧ x ∈ Si }

T
i∈I Si = {x | ∀i. i ∈ I ⇒ x ∈ Si }
The generality obtained from using index sets allows one to take the
bigunion of an infinite set of sets.

16
Power set The set of all subsets of a set S is known as the powerset of S,
written variously as P(S), Pow (S), or 2S .

Pow (S) = {s | s ⊆ S}
For example,

Pow {1, 2, 3} = {∅, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}

If a finite set is of size n, the size of its powerset is 2n . A powerset

is always larger than the set it is derived from, even if the orginal set is
infinite.

Product of sets R×S, the product of two sets R and S, is made by pairing
each element of R with each element of S. Using set-builder notation, this
can be concisely expressed:

R × S = {(x, y) | x ∈ R ∧ y ∈ S}.

Example 5.
 

 (F red, Barney), (F red, Betty), (F red, BamBam), 

 
(W ilma, Barney), (W ilma, Betty), (W ilma, BamBam),
F ×R =

 (P ebbles, Barney), (P ebbles, Betty), (P ebbles, BamBam), 

 
(Dino, Barney), (Dino, Betty), (Dino, BamBam)

In general, the size of the product of two sets will be the product of the
sizes of the two sets.

Iterated product The iterated product of sets S1 , S2 , . . . , Sn is written S1 ×

S2 × . . . × Sn . It stands for the set of all n-tuples (a1 , . . . , an ) such that
a1 ∈ S1 , . . . an ∈ Sn . Slightly more formally, we could write

S1 × S2 × . . . × Sn = {(a1 , . . . an ) | a1 ∈ S1 ∧ . . . ∧ an ∈ Sn }
An n-tuple (a1 , . . . an ) is formally written as (a1 , (a2 , . . . , (an−1 , an ) . . .)), but,
by convention, the parentheses are dropped. For example, (a, b, c, d) is the
conventional way of writing the 4-tuple (a, (b, (c, d))). Unlike sets, tuples
are ordered. Thus a is the first element of the tuple, b is the second, c is the

17
third, and d is the fourth. Equality on n-tuples is captured by the following
property:

(a1 , . . . an ) = (b1 , . . . bn ) iff a1 = b1 ∧ . . . ∧ an = bn .

It is important to remember that sets and tuples are different. For example,

(a, b, c) 6= {a, b, c}.

Size of a set The size of a set, also known as its cardinality, is just the
number of elements in the set. It is common to write |A| to denote the
cardinality of set A.

| {foo, bar, baz} | = 3

Cardinality for finite sets is straightforward; however it is worth noting
that there is a definition of cardinality that applies to both finite and in-
finite sets: under that definition it can be proved that not all infinite sets
have the same size! We will discuss this later in the course.

Summary of useful properties of sets

Now we supply a few identities which are useful for manipulating expres-
sions involving sets. The equalities can all be proved by expanding defini-
tions. To begin with, we give a few simple facts about union, intersection,
and the empty set.

A∪B = B∪A
A∩B = B∩A
A∪A = A
A∩A = A
A∪∅ = A
A∩∅ = ∅
The following identities are associative, distributive, and absorptive
properties:

18
A ∪ (B ∪ C) = (A ∪ B) ∪ C
A ∩ (B ∩ C) = (A ∩ B) ∩ C

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)

A ∪ (A ∩ B) = A
A ∩ (A ∪ B) = A
The following identities are the so-called De Morgan laws, plus a few
others.
A∪B = A∩B
A∩B = A∪B

A = A
A∩A = ∅

2.2.1 Functions
Informally, a function is a mechanism that takes an input and gives an
output. One can also think of a function as a table, with the arguments
down one column, and the results down another. In fact, if a function is
finite, a table can be a good way to present it. Formally however, a function
f is a set of ordered pairs with the property

(a, b) ∈ f ∧ (a, c) ∈ f ⇒ b = c
This just says that a function is, in a sense, univocal, or deterministic:
there is only one possible output for an input. Of course, the notation
f (a) = b is preferred over (a, b) ∈ f . The domain and range of a function f
are defined as follows:

Dom(f ) = {x | ∃y. f (x) = y}

Rng(f ) = {y | ∃x. f (x) = y}

Example 6. The notation {(n, n2 ) | n ∈ N} specifies the function f (n) = n2 .

Furthermore, Dom(f ) = N and Rng(f ) = {0, 1, 4, 9, 16, . . .}.

19
A common notation for specifying that function f has domain A and
range B is the following:

f :A→B
Another common usage is to say ‘a function over (or on) a set’. This just
means that the function takes its inputs from the specified set. As a trivial
example, consider f , a function over N, described by f (x) = x + 2.

Partial functions A function can be total or partial. Every element in the

domain of a total function has a corresponding element in its range. In
contrast, a partial function may not have any element in the range corre-
sponding to some element of the domain.
Example 7. f (n) = n2 is a total function on N. On the other hand, suppose
we want a function from Flintstones to the ‘most-similar’ Rubble:
Flintstone 7→ Rubble
Fred Barney
Wilma Betty
Pebbles BamBam
Dino ??

This is best represented by a partial function that doesn’t map Dino to

anything.
Subtle Point. The notation f : A → B is usually taken to mean that the
domain of f is A, and the range is B. This will indeed be true when f
is a total function. However, if f is partial, then the domain of f can be
a proper subset of A. For example, the specification Flintstone → Rubble
could be used for the function presented above.
Sometimes functions are specified (or implemented) by algorithms. We
will study in detail how to define the general notion of what an algorithm
is in the course, but for now, let’s use our existing understanding. Using
algorithms to define functions can lead to three kinds of partiality:
1. the algorithm hits an exception state, e.g., attempting to divide by
zero;

2. the algorithm goes into an infinite loop;

20
3. the algorithm runs for a very very very long time before returning
an answer.

The second and third kind of partiality are similar but essentially dif-
ferent. Pragmatically, there is no difference between a program that will
never return and one that will return after a trillion years. However, the-
oretically there is a huge difference: instances of the second kind are truly
partial functions, while instances of the third are still total functions. A
course in computational complexity explores the similarities and differ-
ences between the options.
If a partial function f is defined at an argument a, then we write f (a) ↓.
Otherwise, f (a) is undefined and we write f (a) ↑.
Believe it or not. ∅ is a function. It’s the nowhere defined function.

Injective and Surjective functions An injective, or one-to-one function

sends different elements of the domain to different elements of the range:

Injective(f ) iff ∀x y. x ∈ Dom(f ) ∧ y ∈ Dom(f ) ∧ x 6= y ⇒ f (x) 6= f (y).

Pictorially, the following situation is avoided by an injective function:

An important consequence : if there’s an injection from A to B, then B

is at least the size of A. For example, there is no injection from Flintstones
to Rubbles, but there are injections in the other direction.
A surjective, or onto function is one in which every element of the range
is produced by some application of the function:

Surjective(f : A → B) iff ∀y. y ∈ B ⇒ ∃x. x ∈ A ∧ f (x) = y

A bijection is a function that is both injective and surjective.

21
√
Example 8 (Square root in N). Let n denote the number x ∈ N such that
x2 ≤ n and (x + 1)2 > n.
√
n n
0 0
1, 2, 3 1
4, 5, 6, 7, 8 2
9, 10, 11, 12, 13, 14, 15 3
16 4
.. ..
. .

This function is surjective, because all elements of N appear in the range;

it is however, not injective, for multiple elements of the domain map to a
single element of the range.

Closure Closure is a powerful idea, and it is used repeatedly in this

course. Suppose S is a set. If, for any x ∈ S, f (x) ∈ S, then S is said
to be closed under f .
For example, N is closed under squaring:

∀n. n ∈ N ⇒ n2 ∈ N.
A counter-example: N is not closed under subtraction: 2−3 ∈
/ N (unless
subtraction is somehow re-defined so that p − q = 0 when p < q).
The ‘closure’ terminology can be used for functions taking more than
one argument; thus, for example, N is closed under +.

2.3 Alphabets and Strings

An alphabet is a finite set of symbols, usually defined at the start of a prob-
lem statement. Commonly, Σ is used to denote an alphabet.

Examples
Σ = {0, 1}
Σ = {a, b, c, d}
Σ = {f oo, bar}

22
Non-examples

• N (or any infinite set)

• sets having symbols with shared substructure, e.g., {foo, foobar}, since
this can lead to nasty, horrible ambiguity.

2.3.1 Strings
A string over an alphabet Σ is a finite sequence of symbols from Σ. For
example, if Σ = {0, 1}, then 000 and 0100001 are strings over Σ. The strings
provided in most programming languages are over the alphabet provided
by the ASCII characters (and more extensive alphabets, such as Unicode,
are common).
NB. Authors are sometimes casual about representing operations on strings:
for example, string construction and string concatenation are both written
by adjoining blocks of text. This is usually OK, but can be ambiguous: if
Σ = {o, f, a, b, r} we could write the string foobar, or f · o · o · b · a · r (to be
really precise). Similarly, if Σ = {foo, bar}, then we could also write foobar,
or foo · bar .

The empty string There is a unique string ε which is the empty string.
There is an analogy between ε for strings and 0 for N. For example, both
are very useful as identity elements.
NB. Some authors use Λ to denote the empty string.
NB. The empty string is not a symbol, it’s a string with no symbols in it.
Therefore ε can’t appear in an alphabet.

Length The length of a string s, written len(s), is obtained by counting

each symbol from the alphabet in it. Thus, if Σ = {f, o, b, a, r}, then

len(ε) = 0
len(foobar) = 6
but len(foobar) = 2, if Σ = {foo, bar }.
NB. Unlike some programming languages, strings are not terminated with
an invisible ε symbol.

23
Concatentation The concatenation of two strings x and y just places them
next to each other, giving the new string xy. If we needed to be precise, we
could write x · y. Some properties of concatenation:

x(yz) = (xy)z associativity

xε = εx = x identity
len(xy) = len(x) + len(y)
The iterated concatenation xn of a string x is the n-fold concatenation of
x with itself.

Example 9. Let Σ = {a, b}. Then

(aab)3 = aabaabaab
(aab)1 = aab
(aab)0 = ε

The formal definition of xn is by recursion:

x0 = ε
xn+1 = xn · x

Notation. Repeated elements in a string can be superscripted, for conve-

nience: for example, aab = a2 b.

Counting If x is a string over Σ and a ∈ Σ, then count(a, x) gives the

number of occurrences of a in x:

count(0, 0010) = 3
count(1, 000) = 0
count(0, ε) = 0
The formal definition of count is by recursion:

count(a, ε) = 0
count(a, b · t) = if a = b then count(a, t) + 1 else count(a, t)

In the second clause of this definition, the expression (b · t) should be un-

derstood to mean that b is a symbol concatenated to string t.

24
Prefix A string x is a prefix of string y iff there exists w such that y = x · w.
For example, abaab is a prefix of abaababa. Some properties of prefix:

• ε is a prefix to every string

• x is a prefix to x, for any string x.

A string x is a proper prefix of string y if x is a prefix of y and x 6= y.

Reversal The reversal xR of a string x = x1 · . . .· xn is the string xn · . . .· x1 .

Pitfalls Here are some common mistakes people make when first con-
fronted with sets and strings. All the following are true, but surprise some
students.

• sets {a, b} = {b, a}

strings ab 6= ba

• sets {a, a, b} = {a, b}

strings aab 6= ab

• ∅
|{z} 6= ε
|{z} 6= {ε}
|{z}
empty set empty string singleton set holding empty string

Also, sometimes people seem to reason as follows:

The empty set has no elements in it. The empty string has no
characters in it. So . . . the empty set is the same as the empty string.

The first two assertions are true; however, the conclusion is false. Al-
though the length of ε is 0, and the size of ∅ is also 0, they are two quite
different things.

25
2.4 Languages
So much for strings. Now we discuss sets of strings, also called languages.
Languages are one of the important themes of the course.
We will start our discussion with Σ∗ , the set of all strings over alpha-
bet Σ. The set Σ∗ contains all strings that can be generated by iteratively
concatenating symbols from Σ, any number of times.

Example 10. If Σ = {a, b, c},

Σ∗ = {|{z}
ε , a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc, aaa, aab, aac, . . .}
NB

Question: What if Σ = ∅? What is Σ∗ then?

Answer: ∅∗ = {ε}. It may seen odd that you can proceed from the empty
set to a non-empty set by iterated concatenation. There is a reason for this,
but for the moment, please accept this as convention.
If Σ is a non-empty set, then Σ∗ is an infinite set, where each element is
a finite string.
Convention. Lower case letters at the end of the alphabet, e.g., u, v, w, x, y, z,
are used to represent strings. Capital letters from the beginning of the al-
phabet e.g., A, B, C, L are used to represent languages.

Set operations on languages

Now we apply set operations to languages; this will therefore be noth-
ing new, since we’ve seen set operations already. However, it’s worth the
repetition.

Union
{a, b, ab} ∪ {a, c, ba} = {a, b, ab, c, ba}

Intersection
{a, b, ab} ∩ {a, c, ba} = {a}

26
Complement Usually, Σ∗ is the universe that a complement is taken with
respect to. Thus
A = {x ∈ Σ∗ | x ∈ / A}
For example

{x | len(x) is even} = {x ∈ Σ∗ | len(x) is odd}

Now we lift operations on strings to work over sets of strings.

Language reversal The reversal of language A is written AR and is de-

fined (note the overloading)

AR = {xR | x ∈ A}

Language concatenation The concatenation of languages A and B is de-

fined:

AB = {xy | x ∈ A ∧ y ∈ B}
or using the ‘dot’ notation to emphasize that we are concatenating (note
the overloading of ·):

A · B = {x · y | x ∈ A ∧ y ∈ B}
Example 11. {a, ab} {b, ba} = {ab, abba, aba, abb}
Example 12. Two languages L1 and L2 such that L1 · L2 = L2 · L1 and L1 is
not a subset of L2 and L2 is not a subset of L1 and neither language is {ε}
are the following:

L1 = {aa} L2 = {aaa}

Notes
• In general AB 6= BA. Example: {a}{b} =
6 {b}{a}.
• A · ∅ = ∅ = ∅ · A.
• A · {ε} = A = {ε} · A.
• A · ε is nonsense—it’s syntactically malformed.

27
Iterated language concatenation Well, if we can concatenate two lan-
guages, then we can certainly repeat this to concatenate any number of
languages. Or concatenate a language with itself any number of times.
The operation An denotes the concatenation of A with itself n times. The
formal definition is
A0 = {ε}
n+1
A = A · An
Another way to characterize this is that a string is in An if it can be split
into n pieces, each of which is in A:

x ∈ An iff ∃w1 . . . wn . w1 ∈ A ∧ . . . ∧ wn ∈ A ∧ (x = w1 · · · wn ).

Example 13. Let A = {a, ab}. Thus A3 = A · A · A · {ε}, by unrolling the

formal definition. To expand further:

A · A · A · {ε} = A · A · A
= A · {aa, aab, aba, abab}
= {a, ab} · {aa, aab, aba, abab}
= {aaa, aaba, abaa, ababa, aaab, aabab, abaab, ababab}

Kleene’s Star It happens thatAn is sometimest limited because each string

in it has been built by exactly n concatenations of strings from A. A more
general operation, which addresses this shortcoming, is the so-called Kleene
Star operation.1

S
A∗ = n∈N A
n

= A0 ∪ A1 ∪ A2 ∪ . . .
= {x | ∃n. x ∈ An }
= {x | x is the concatenation of zero or more strings from A}

Thus A∗ is the set of all strings derivable by any number of concate-

nations of strings in A. The notion of all strings obtainable by one or more
concatenations of strings in A is often used, and is defined A+ = A · A∗ ,
i.e.,
1
Named for its inventor, the famous American logician Stephen Kleene (pronounced
‘Klee-knee’).

28
[
A+ = An = A1 ∪ A2 ∪ A3 ∪ . . .
n>0

Example 14.

A = {a, ab}
A∗ = A0 ∪ A1 ∪ A2 ∪ . . .
= {ε} ∪ {a, ab} ∪ {aa, aab, aba, abab} ∪ . . .
+
A = {a, ab} ∪ {aa, aab, aba, abab} ∪ . . .

Some facts about Kleene star:

• The previously introduced definition of Σ∗ is an instance of Kleene

star.

• ε is in A∗ , for every language A, including ∅∗ = {ε}.

• L ⊆ L∗ .

Example 15. An infinite language L over {a, b} for which L 6= L∗ is the

following:
L = {an | n is odd} .

A common situation when doing proofs about Kleene star is reason-

ing with formulas of the form x ∈ L∗ , where L is a perhaps complicated
expression. A useful approach is to replace x ∈ L∗ by ∃n.x ∈ Ln before
proceeding.

Example 16. Prove A ⊆ B ⇒ A∗ ⊆ B ∗ .

Proof. Assume A ⊆ B. Now suppose that w ∈ A∗ . Therefore, there is an n
such that w ∈ An . That means w = x1 · . . . · xn where each xi ∈ A. By the
assumption, each xi ∈ B, so w ∈ B n , so w ∈ B ∗ .

That concludes the presentation of the basic mathematical objects we

will be dealing with: strings and their operations; languages and their
operations.

29
Summary of useful properties of languages
Since languages are just sets of strings, the identities from Section 2.2 may
freely be applied to language expressions. Beyond those, there are a few
others:
A · (B ∪ C) = (A · B) ∪ (A · C)
(B ∪ C) · A = (B · A) ∪ (C · A)
A · (B0 ∪ B1 ∪ B2 ∪ . . .) = (A · B0 ) ∪ (A · B1 ) ∪ (A · B2 ) . . .
(B0 ∪ B1 ∪ B2 ∪ . . .) · A = (B0 · A) ∪ (B1 · A) ∪ (B2 · A) ∪ . . .

A∗∗ = (A∗ )∗ = A∗
A∗ · A∗ = A∗
A∗ = {ε} ∪ A+
∅∗ = {ε}

2.5 Proof
Now we discuss proof. In this course we will go through many proofs;
indeed, in order to pass this course, you will have to write correct proofs
of your own. This raises the weighty question

What is a proof?

which has attracted much philosophical discussion over the centuries. Here
are some (only a few) answers:

• A proof is a convincing argument (that an assertion is true). This is on

the right track, but such a broad encapsulation is too vague. For ex-
ample, suppose the argument convinces some people and not others
(some people are more gullible than others, for instance).

• A proof is an argument that convinces everybody. This is too strong, since

some people aren’t rational: such a person might never accept a per-
fectly good argument.

• A proof is an argument that convinces every “sane” person. This just

pushes the problem off to defining sanity, which may be even harder!

30
• A proof is an argument that convinces a machine. If humans cause so
much trouble, let’s banish them in favour of machines! After all,
machines have the advantage of being faster and more reliable than
humans. In the late 19th and early 20th Centuries, philosophers and
mathematicians developed the notion of a formal proof, one which
is a chain of extremely simple reasoning steps expressed in a rig-
orously circumscribed language. After computers were invented,
people realized that such proofs could be automatically processed:
a computer program could analyze a purported proof and render a
yes/no verdict, simply by checking all the reasoning steps.
This approach is quite fruitful (it’s my research area) but the proofs
are far too detailed for humans to deal with: they can take megabytes
for even very simple proofs. In this course, we are after proofs that
are readable but still precise enough that mistakes can easily be caught.

• A proof is an argument that convinces a skeptical but rational person who

has knowledge of the subject matter. Such as me and the TAs. This is
the notion of proof adopted by professional mathematicians, and we
will adopt it. One consequence of this definition is that there may
be grounds for you to believe a proof and for us not to. In that case,
dialogue is needed, and we encourage you to come to us when our
proofs don’t convince you (and when yours don’t convince us).

2.5.1 Review of proof terminology

Following is some of the specialized vocabulary surrounding proofs:

Definition The introduction of a new concept, in terms of existing con-

cepts. An example: a prime number is defined to be a number greater
than 1 whose factors are just 1 and itself. Formally, this is expressed
as a predicate on elements of N, and it relies on a definition of when
a number evenly divides another:

divides(x, y) = ∃z.x ∗ z = y
prime(n) = 1 < n ∧ ∀k. divides(k, n) ⇒ k = 1 ∨ k = n

Proposition A statement thought to be true.

31
Conjecture An unproved proposition. A conjecture has the connotation
that the author has attempted—but failed—to prove it.

Theorem A proved proposition/conjecture.

Lemma A theorem. A lemma is usually a stepping-stone to a more im-
portant theorem. However, some lemmas are quite famous on their
own, e.g., König’s Lemma, since they are so often used.
Corollary A theorem; generally a simple consequence of another theorem.

2.5.2 Review of methods of proof

Often students are perplexed at how to write a proof down, even when
they have a pretty good idea of why the assertion is true. Some of the
following comments and suggestions could help; however, we warn you
that knowing how to do proofs is a skill that is learned by practice. We
will proceed syntactically.

To prove A ⇒ B: The standard way to prove this is to assume A, and

use that extra ammunition to prove B. Another (equivalent) way that is
sometimes convenient is the contrapositive: assume ¬B and prove ¬A.

To prove A iff B: There are three ways to deal with this (the first one is
most common):
• prove A ⇒ B and also B ⇒ A
• prove A ⇒ B and also ¬A ⇒ ¬B

• find an intermediate formula A′ and prove

– A iff A′ and
– A′ iff B.
These can be strung together in an ‘iff chain‘ of the form:

A iff A′ iff A′′ iff . . . iff B.

To prove A ∧ B: Separately prove A and B.

32
To prove A ∨ B: Rarely happens. Select which-ever of A, B seems to be
true and prove it.

To prove ¬A: Assume A and prove a contradiction. This will be dis-

cussed in more depth later.

To prove ∀x. A: In order to prove a universally quantified statement, we

have to prove it taking an arbitrary but fixed element to stand for the quan-
tified variable. As an example statement, let’s take “for all n, n is less than
6 implies n is less than 5”. (Of course this isn’t true, but nevermind.) More
precisely, we’d write ∀n. n < 6 ⇒ n < 5. In other words, we’d have to
show

u<6⇒u<5
for an arbitrary u.
Not all universal statements are proved in this way. In particular, when
the quantification is over numbers (or other structured data, such as strings),
one often uses induction or case analysis. We will discuss these in more
depth shortly.

To prove ∃x. A: Supply a witness for x that will make A true. For ex-
ample, if we needed to show ∃x. even(x) ∧ prime(x), we would give the
witness 2 and continue on to prove even(2) ∧ prime(2).

Proof by contradiction In a proof of proposition P by contradiction, we

begin by assuming that P is false, i.e., that ¬P is true. Then we use this as-
sumption to derive a contradiction, usually by proving that some already
established fact Q is false. But that can’t be, since Q has a proof. We have
a contradiction. Then we reason that we must have been mistaken to as-
sume ¬P , so therefore ¬P is false. Hence P is true after all.
It must be admitted that this is a more convoluted method of proof
than the others, but it often allows very nice arguments to be given.
It’s an amazing fact that proof by contradiction can be understood in
terms of programming: the erasing of all the reasoning between the ini-
tial assumption of ¬P and the discovery of the contradiction is similar to
what happens if an exception is raised and caught when a program in Java

33
or ML executes. This correspondence was recognized and made mathe-
matically precise in the late 1980’s by Tim Griffin, then a PhD student at
Cornell.

2.5.3 Some simple proofs

We now consider some proofs about sets and languages. Proving equality
between sets P = Q can be reduced to proving P ⊆ Q ∧ Q ⊆ P , or,
equivalently, by showing

∀x. x ∈ P iff x ∈ Q

We will use the latter in the following proof, which will exercise some basic
definitions.

Example 17 (A ∩ B = A ∪ B). This proposition is equivalent to ∀x. x ∈

A ∩ B iff x ∈ (A∪B). We will transform the lhs2 into the rhs3 by a sequence
of iff steps, most of which involve expansion of definitions.
Proof.

x ∈ A ∩ B iff x ∈ (U − (A ∩ B))
iff x∈U ∧x∈ / (A ∩ B)
iff x ∈ U ∧ (x ∈
/ A∨x∈ / B)
iff (x ∈ U ∧ x ∈
/ A) ∨ (x ∈ U ∧ x ∈
/ B)
iff (x ∈ U − A) ∨ (x ∈ U − B)
iff (x ∈ A) ∨ (x ∈ B)
iff x ∈ (A ∪ B)

Such ‘iff’ chains can be quite a pleasant way to present a proof.

Example 18 (ε ∈ A iff A+ = A∗ ). Recall that A+ = A · A∗ .

Proof. We’ll proceed by cases on whether or not ε ∈ A.
2
left hand side
3
right hand side

34
ε ∈ A. Then A = {ε} ∪ A, so
A+ = A1 ∪ A2 ∪ . . .
= ({ε} ∪ A) ∪ A2 ∪ . . .
= A0 ∪ A1 ∪ A2 ∪ . . .
= A∗

ε∈
/ A. Then every string in A has length greater than 0, so every string
in A+ has length greater than 0. But ε, which has length 0, is in A∗ ,
so A∗ 6= A+ . [Merely noting that A 6= {ε} ∪ A and concluding that
A∗ 6= A+ isn’t sufficient, because you have to make the argument
that ε doesn’t somehow get added in the A2 ∪ A3 ∪ . . ..]

Example 19 ((A ∪ B)R = AR ∪ B R ). The proof will be an iff chain.

Proof.
x ∈ (A ∪ B)R iff x ∈ {y R | y ∈ A ∪ B}
iff x ∈ {y R | y ∈ A ∨ y ∈ B}
iff x ∈ ({y R | y ∈ A} ∪ {y R | y ∈ B})
iff x ∈ (AR ∪ B R )

Example 20. Let A = {w ∈ {0, 1}∗ | w has an unequal number of 0s and 1s}.
Prove that A∗ = {0, 1}∗ .
Proof. We show that A∗ ⊆ {0, 1}∗ and {0, 1}∗ ⊆ A∗ . The first assertion is
easy to see, since any set of binary strings is a subset of {0, 1}∗ . For the
second assertion, the theorem in Example 16 lets us reduce the problem to
showing that {0, 1} ⊆ A, which is true, since 0 ∈ A and 1 ∈ A.
Example 21. Prove that L∗ = L∗ · L∗ .
Proof. Assume x ∈ L∗ . We need to show that x ∈ L∗ · L∗ , i.e., that there
exists u, v such that x = u · v and u ∈ L∗ and v ∈ L∗ . By taking u = x and
v = ε we satisfy the requirements and so x ∈ L∗ · L∗ , as required.
Contrarily, assume x ∈ L∗ · L∗ . Thus there exists u, v such that x = uv
and u ∈ L∗ and v ∈ L∗ . Now, if u ∈ L∗ , then there exists i such that u ∈ Li ;
similarly, there exists j such that v ∈ Lj . Hence uv ∈ Li+j . So there exists
an n (namely i + j) such that x ∈ Ln . So x ∈ L∗ .

35
Now we will move on to an example that uses proof by contradiction.
Example 22 (Euclid). The following famous theorem has an elegant proof
that illustrates some of our techniques, proof by contradiction in particu-
lar. The English statement of the theorem is
The prime numbers are an infinite set.
Re-phrasing this as For every prime, there is a larger one, we obtain, in math-
ematical notation:

∀m. prime(m) ⇒ ∃n. m < n ∧ prime(n)

Before we start, we will need the notion of the factorial of a number. The
factorial of n will be written n!. Informally n! = 1 ∗ 2 ∗ 3 ∗ · · · ∗ (n − 1) ∗ n.
Formally, we can define factorial by recursion:

0! = 1
(n + 1)! = (n + 1) ∗ n!
Proof. Towards a contradiction, assume the contrary, i.e., that there are
only finitely many primes. That means there’s a largest one, call it p.
Consider the number k = p! + 1. Now, k > p so k is not prime, by our
assumption. Since k is not equal to 1, it has a prime factor. Formally,

∃q. divides(q, k) ∧ prime(q)

In a complete proof, we’d prove this fact as a separate lemma, but here we
will take it as given. Now, q ≤ p since q is prime. Then q divides p!, since
p! = 1 ∗ . . . ∗ q ∗ . . . ∗ p. Thus we have established that q divides both p!
and p! + 1. However, the only number that can evenly divide n and n + 1
is 1. But 1 is not prime. Contradiction. All the intermediate steps after
making our assumption were immaculate, so our assumption must have
been faulty. Therefore, there are infinitely many primes.
This is a beautiful example of a rigorous proof of the sort that we’ll be
reading and—we hope—writing. Euclid’s proof is undoubtedly slick, and
probably passed by in a blur, but that’s OK: proofs are not things that can
be skimmed; instead they must be painstakingly followed.
Note. The careful reader will notice that this theorem does not in fact show
that there is even one prime, let alone an infinity of them. We must display
a prime to start the sequence (the number 2 will do).

36
2.5.4 Induction
The previous methods we’ve seen are generally applicable. Induction, on
the other hand, is a specialized proof technique that only applies to struc-
tured data such as numbers and strings. Induction is used to prove uni-
versal properties.
Example 23. Consider the statement ∀n. 0 < n!. This statement is easy to
check, by calculation, for any particular number:

0 < 0!
0 < 1!
0 < 2!
...
but not for all of them (that would require an infinite number of cases to
be calculated, and proofs can’t be infinitely long). This is where induction
comes in: induction “bridges the gap with infinity”. How? In 2 steps:
Base Case Prove the property holds for 0: 0 < 0!, i.e., 0 < 1.

Step Case Assume the proposition for an arbitrary number, say k, and
then show the proposition holds for k + 1: thus we assume the in-
duction hypothesis (IH) 0 < k!. Now we need to show 0 < (k + 1)!. By
the definition of factorial, we need to show
0 < (k + 1) ∗ k! i.e.,
0 < k ∗ k! + k! (by the definition of factorial)

By the IH, this is true. And the proof is complete.

In your work, we will require that the base cases and steps cases be
clearly labelled as such, and we will also need you to identify the IH in the
step case. Finally, you will also need to show when you use the IH in the
proof of the step case.
Example 24. Iterated sums, via the Σ operator, yield many problems which
can be tackled by induction. Informally, Σni=0 = 0 + 1 + . . . + (n − 1) + n.
Let’s prove
∀n. Σni=0 (2i + 1) = (n + 1)2

37
Proof. By induction on n.

Base Case. We substitute 0 for n everywhere in the statement to be proved.

Σ0i=0 (2i + 1) = (0 + 1)2

iff 2 ∗ 0 + 1 = 12
iff 1 = 1

Step Case. Assume the IH : Σni=0 (2i + 1) = (n + 1)2 . Now we need to

show the statement with n + 1 in place of n. Thus we want to show
Σn+1 2
i=0 (2i + 1) = ((n + 1) + 1) , and proceed as follows:

Σn+1
i=0 (2i + 1) = ((n + 1) + 1)2
= (n + 2)2
= n2 + 4n + 4
Σni=0 (2i + 1) +2(n + 1) + 1 = n2 + 4n + 4
| {z }
use of IH
(n + 1)2 + 2(n + 1) + 1 = n2 + 4n + 4
(n + 1)2 + 2(n + 1) + 1 = n2 + 4n + 4
n2 + 4n + 4 = n2 + 4n + 4

Example 25 (Am+n = Am An ). The proof is by induction on m.

Proof. Base case. m = 0, so we need to show: A0+n = A0 An , i.e., that
An = {ε} · An , which is true.

Step case. Assume IH : Am+n = Am An .

We show: A(m+1)+n = Am+1 An as follows:
A(m+1)+n = Am+1 · An
m
iff A1+(m+n) = A · A · An}
| {z
IH
iff A1+(m+n) = A · Am+n
iff A · Am+n = A · Am+n

Example 26. Show that L∗ ∗ = L∗ .

38
Proof. The ‘right-to-left’ direction is easy since A ⊆ A∗ , for all A. Thus it
remains to show L∗ ∗ ⊆ L∗ . Assume x ∈ L∗ ∗ . We wish to show x ∈ L∗ . By
the assumption there is an n such that x ∈ (L∗ )n . We now induct on n.
Base case. n = 0, so x ∈ (L∗ )0 , i.e., x ∈ {ε}, i.e., x = ε. This completes the
base case, as ε is certainly in L∗ .
Step case. Let IH = ∀x. x ∈ (L∗ )n ⇒ x ∈ L∗ . We want to show x ∈
(L∗ )n+1 ⇒ x ∈ L∗ . Thus, assume x ∈ (L∗ )n+1 , i.e., x ∈ L∗ · (L∗ )n . This
implies that there exists u, v such that u ∈ L∗ and v ∈ (L∗ )n . By the IH, we
have v ∈ L∗ . But then we have x ∈ L∗ because A∗ · A∗ = A∗ , for all A, as
was shown in Example 21.

Here’s a fun one. Suppose we have n straight lines (infinite in both

directions) drawn in the plane. Then it is always possible to assign 2 colors
(say black and white) to the resulting regions so that adjacent regions have
different colors. For example, if our lines have a grid shape, this is clearly
possible:

Is it also possible in general? Yes.

Example 27 (Two-coloring regions in the plane). Given n lines drawn in

the plane, it is possible to color the resulting regions black or white such
that adjacent regions (those bordering the same line) have different colors.
Let’s look at a less regular example:

39
This can be 2-colored as follows:

There is a systematic way to achieve this, illustrated by what happens

if we add a new line into the picture. Suppose we add a new (dashed) line
to our example:

Now pick one side of the line (the left, say), and ‘flip’ the colors of the
regions on that side. Leave the coloring on the right side alone. This gives

40
us

which is again 2-colored. Now let’s see how to prove that this works in
general.
Proof. By induction on the number of lines on the plane.

Base case. If there are no lines on the plane, then pick a color and color
the plane with it. Since there are no adjacent regions, the property
holds.

Step case. Suppose the plane has n lines on it. The IH says that adjacent
regions have different colors. Now we add a line ℓ to the plane, and
recolor regions on the left of ℓ as stipulated above. Now consider any
two adjacent regions. There are three possible cases:

1. Both regions are on the non-flipped part of the plane. In that

case, the IH says that the two regions have different colors.
2. Both regions are in the part of the plane that had its colors flipped.
In that case, the IH says that the two regions had different col-
ors. Flipping them results in the two regions again having dif-
ferent colors.
3. The two regions are separated by the newly drawn line, i.e., ℓ
divided a region into two. Now, the new sub-region on the right
of ℓ stays the same color, while the new sub-region on the left is
flipped. So the property is preserved.

41
Example 28 (Incorrect use of induction). Let’s say that a set is monochrome
if all elements in it are the same color. The following argument is flawed.
Why?

1. Buy a pig; paint it yellow.

2. We prove that all finite sets of pigs are monochrome by induction on

the size of the set:

Base case. The only set of size zero is the empty set, and clearly the
empty set of pigs is monochrome.
Step case. The inductive hypothesis is that any set with n pigs is
monochrome. Now we show that any set {p1 , . . . , pn+1 } consist-
ing of n + 1 pigs is also monochrome. By the IH, we know that
{p1 , . . . , pn } is monochrome. Similarly, we know that {p2 , . . . , pn+1 }
is also monochrome. So pn+1 is the same color as the pigs in
{p1 , . . . , pn }. Therefore {p1 , . . . , pn+1 } is monochrome.

3. Since all finite sets of pigs are monochrome, the set of all pigs is
monochrome. Since we just painted our pig yellow, it follows that
all pigs are painted yellow.

[Flaw: We make two uses of the IH in the proof, and implicitly take two
pigs out of {p1 , . . . , pn+1 }. That means that {p1 , . . . , pn+1 } has to be of size
at least two. Suppose it is of size 2, i.e., consider some two-element set of
pigs {p1 , p2 }. Now {p1 } is monochrome and so is {p2 }. But the argument
in the proof doesn’t force every pig in {p1 , p2 } to be the same color.]

Strong Induction
Occasionally, one needs to use a special kind of induction called strong,
or complete, induction to make a proof work. The difference between this
kind of induction and ordinary induction is the following: in ordinary in-
duction, the induction step is just that we assume the property P holds for
n and use that as a tool to show that P holds for n + 1; in strong induction,
the induction hypothesis is that P holds for all m strictly smaller than n
and the goal is to show that P holds for n.
Specified formally, we have

42
Mathematical induction

∀P. P (0) ∧ (∀m. P (m) ⇒ P (m + 1)) ⇒ ∀n. P (n).

Strong induction

∀P. (∀n. (∀m. m < n ⇒ P (m)) ⇒ P (n)) ⇒ ∀k. P (k).

Some remarks:

• Sometimes students are puzzled because strong induction doesn’t

have a base case. That is true, but often proofs using strong induction
require a case split on whether a number is zero or not, and this is
effectively considering a base case.

• Since strong induction allows the IH to be assumed for any smaller

number, it may seem that more things can be proved with strong
induction, i.e., that strong induction is more powerful than mathe-
matical induction. This is not so: mathematical induction and strong
induction are inter-derivable (you can prove each from the other),
and consequently, any proof using strong induction can be changed
to use mathematical induction. However, this fact is mainly of the-
oretical interest: you should feel free to use whatever kind of induc-
tion works.

The following theorem about languages is interesting and useful; more-

over, its proof uses both strong induction and ordinary induction. Warn-
ing: the statement and proof of this theorem are significantly more difficult
than the material we have encountered so far!

Example 29 (Arden’s Lemma). Assume that A and B are two languages

with ε ∈/ A. Also assume that X is a language having the property X =
(A · X) ∪ B. Then X = A∗ · B.
Proof. Showing X ⊆ A∗ · B and A∗ · B ⊆ X will prove the theorem.

1. Suppose w ∈ X; we want to show w ∈ A∗ · B. We proceed by com-

plete induction on the length of w. Thus the IH is that y ∈ A∗ · B, for
any y strictly shorter than w. We now consider cases on w:

43
• w = ε, i.e., ε ∈ X, or ε ∈ (A · X) ∪ B. But note that ε ∈
/ (A · X), by
the assumption ε ∈ / A. Thus we have ε ∈ B, and so ε ∈ A∗ · B,
as desired.
• w 6= ε. Since w ∈ (A · X) ∪ B, we consider the following cases:
(a) w ∈ (A · X). Since ε ∈/ A, there exist u, v such that w = uv,
u ∈ A, v ∈ X, and len(v) < len(w). By the IH, we have
v ∈ A∗ · B hence, by the semantics of Kleene star, we have
uv ∈ A∗ · B, as required.
(b) w ∈ B. Then w ∈ A∗ · B, since ε ∈ A∗ .

2. We wish to prove A∗ · B ⊆ X, which can be reduced to the task of

showing ∀n. An · B ⊆ X. The proof proceeds by ordinary induction:

(a) Base case. A0 · B = B ⊆ (A · X) ∪ B.

(b) Step case. Assume the IH: An · B ⊆ X. From this we obtain
A · An · B ⊆ A · X, i.e., An+1 · B ⊆ A · X. Hence An+1 · B ⊆ X.

44
Chapter 3

Models of Computation

Now we start the course for real. The questions we address in this part
of the course have to deal with models for sequential computation1 in a
setting where there are no resource limits (time and space). Here are a few
of the questions that arise:

• Is ‘C‘ as powerful as Java? More powerful? What about Lisp, or ML,

or Perl?

• What does it mean to be more powerful anyway?

• What about assembly language (say for the x86). How does it com-
pare? Are low-level languages more powerful than high-level lan-
guages?

• Can programming languages be compared by expressiveness, some-

how?

• What is an algorithm?

• What can computers do in principle, i.e., without restriction on time,

space, and money?

• What can’t computers do? For example, are there some optimizations
that a compiler can’t make? Are there purported programming tasks
that can’t be implemented, no matter how clever the programmer(s)?
1
We won’t consider models for concurrency, for example.

45
This is undoubtedly a collection of serious questions, and we should
say how we go about investigating them. First, we are not going to use any
particular real-world programming language: they tend to be too big.2 In-
stead, we will deal with relatively simple machines. Our approach will be
to convince ourselves that the machines are powerful enough to compute
whatever general-purpose computers can, and then to go on to consider
the other questions.

3.1 Turing Machines

Turing created his machine3 to answer a specific question of interest in the
1930’s: what is an algorithm? His analysis of computation—well before
computers were invented—was based on considering the essence of what
a human doing a mechanical calculation would need, provided the job
needed no creativity. He reasoned as follows:

• No creativity implies that each step in the calculation must be fully

spelled out.

• The list of instructions followed must be finite, i.e., programs must

be finite objects.

• Each individual step in the calculation must take a finite amount of

time to complete.

• Intermediate results may need to be calculated, so a scratch-pad area

is needed.

• There has to be a way to keep track of the current step of the calcula-
tion.

• There has to be a way to view the complete current state of the cal-
culation.

Turing’s idea was machine-based. A Turing machine (TM) is a machine

with a finite number of control states and an infinite tape, bounded at the
2
However, some models-of-computation courses have used Scheme.
3
For Turing’s original 1936 paper, see “On Computable Numbers, with an Application
to the Entscheidungsproblem”, on the class webpage.

46
left and stretching off to the right. The tape is divided into cells, each of
which can hold one symbol. The input of the machine is a string w =
a1 · a2 · . . . · an initially written on the leftmost portion of the tape, followed
by an infinite sequence of blanks ( ):

a1 a2 · · · an−1 an ···
The machine is able to move a read/write head left and right over the
tape as it performs its computation. It can read and write symbols on the
tape as it pleases. These considerations led Turing to the following formal
definition.
Definition 1 (Turing Machine). A Turing machine is a 7-tuple

(Q, Σ, Γ, δ, q0 , qA , qR )

where
• Q is a finite set of states.

• Σ is the input alphabet, which never includes blanks.

• Γ is the tape alphabet, which always includes blanks. Moreover, ev-

ery input symbol is in Γ.

• δ : Q × Γ → Q × Γ × {L, R} is the transition function, where L and

R are directions, telling the machine head which direction to go in a
step.

• q0 ∈ Q is the start state

• qA ∈ Q is the accept state

• qR ∈ Q is the reject state. qA 6= qR .

The program that a Turing machine executes is embodied by the transition
function δ. Conceptually, the following happens when a transition

δ(qi , a) = (qj , b, d)

is made by a Turing machine M:

• M writes b to the current tape cell, overwriting a.

47
• The current state changes from qi to qj .
• The tape head moves to the left or right by one cell, depending on
whether d is L or R.
Example 30. We’ll build a TM that merely moves all the way to the end of
its input and stops. The states of the machine will just be {q0 , qA , qR }. (We
have to include qR as a state, even though it will never be entered.) The
input alphabet Σ = {0, 1}, for simplicity. The tape alphabet Γ = Σ ∪ { }
includes blanks, but is otherwise the same as the input alphabet. All that is
left to specify is the transition function. The machine simply moves right
along the tape until it hits a blank, then halts. Thus, at each step, it just
writes back the current symbol, remains in q0 , and moves right one cell:
δ(q0 , 0) = (q0 , 0, R)
δ(q0 , 1) = (q0 , 1, R)
Once the machine hits a blank, it moves one cell to the left and stops:
δ(q0 , ) = (qA , , L)
Notice that if the input string is ε, the first step the machine makes is mov-
ing left from the leftmost cell: it can’t do that, so the tape head just stays
in the leftmost cell.
Turing machines can also be represented by transition diagrams. A tran-
sition δ(qi , a) = (qj , b, d) between state qi and qj can be drawn as
a/b, d
qi qj

and means that if the machine is in state qi and the current cell has an a
symbol, then the current cell is updated to have a b symbol, the tape head
moves one cell to the left or right (according to whether d = L or d = R),
and the current state becomes qj .
For the current example, the state diagram is quite simple:

0/0, R
1/1, R
/ ,L
q0 qA

48
Example 31 (Unary addition). (Worked out in class.) Although Turing
machines manipulate symbols and not numbers, they are quite often used
to compute numerical functions such as addition, subtraction, multipli-
cation, etc. To take a very simple example, suppose we want to add two
numbers given in unary, i.e., as strings over Σ = {1}. In this representa-
tion, for example, 3 is represented by 111 and 0 is represented by ε. The
two strings to be added will be separated by a marker symbol X. Thus, if
we wanted to add 3 and 2, the input would be

1 1 1 X 1 1 ···

and the output should be

1 1 1 1 1 ···
Here is the desired machine. It traverses the first number, then replaces
the X with 1, then copies the second number, then erases the last 1 before
accepting.

1/1, R 1/1, R
X/1, R / ,L 1/ /L
q0 q1 q2 qA

Note. Normally, Turing machines are used to accept or reject strings. In

this example, the machine computes the unary addition of the two inputs
and then always halts in the accept state. This convention is typically used
when Turing machines are used to compute functions rather than just say-
ing “yes” or “no” to input strings.

Example 32. We now give a transition diagram for a Turing machine M

that recognizes
{w · w R | w ∈ {0, 1}∗ }.
Example strings in this language are ε, 00, 11, 1001, 0110, 10100101, . . ..

49
0/0, R
1/1, R 1/1, R
/ ,R qR
/ ,L
q2 q3
0/ , R 0/ , L

/ ,R 0/0, L
q1 q4
1/1, L

1/ , R 1/ , L
/ ,L / ,L
q5 q6
qA 0/0, R qR
0/0, R / ,R
1/1, R

The general idea is to go from the ‘outside-in’ on the input string, can-
celling off equal symbols at each end. The loop q1 → q2 → q3 → q4 → q1
replaces the leading symbol (a 0) with a blank, then moves to the right-
most uncancelled symbol, checks that it is a 0, overwrites it with a blank,
then moves to the leftmost uncancelled symbol. If there isn’t one, then the
machine accepts. The lower loop q1 → q5 → q6 → q4 → q1 is essentially
the same as the upper loop, except that it cancels off a matching pair of
1s from each end. If the sought-for 0 (or 1) is not found at the rightmost
uncancelled symbol, then the machine rejects (from q3 or q6 ).
Now back to some more definitions. A configuration is a snapshot of the
complete state of the machine.
Definition 2 (Configuration). A Turing machine configuration is a triple
hℓ, q, ri, where ℓ is a string denoting the tape contents to the left of the tape
head and r is a string representing the tape to the right of the tape head.
Since the tape is infinite, there is a point past which the tape is nothing but
blanks. By convention, these are not included in r.4 The leftmost symbol
of r is the current tape cell. The state q is the current state of the machine.
A Turing machine starts in the configuration hε, q0 , wi and repeatedly
makes transitions until it ends up in qA or qR . Note that a machine may
4
However, this is not completely correct, since the machine may, for example, be given
the empty string as input, in which case r must have at least one blank.

50
never end up in qA or qR , in which case it is said to be looping or diverging.
After all, we would certainly want to model programs that never stop: in
many cases such programs are useless, but they are undeniably part of
what we understand by computation.

Transitions and configurations How do the transitions of a machine af-

fect the configuration? There are two cases.
1. If we are moving right, there is always room to keep going right. On
transition δ(qi , a) = (qj , b, R) the configuration change is

hu, qi , a · wi −→ hu · b, qj , wi.

If the rightward portion w of the tape is ε, blanks are added as needed.

2. If we are moving left by δ(qi , a) = (qj , b, L) then the configuration

changes as follows:
• When there is room to move left: hu · c, qi , a · wi −→ hu, qj , c · b · wi.
• Moving left but up against left end of tape: hε, qi, a · wi −→
hε, qj , b · wi.
A sequence of linked transitions starting from the initial configuration is
said to be an execution.
Definition 3 (Execution). An execution of Turing machine M on input w is
a possibly infinite sequence

hε, q0 , wi −→ . . . −→ hu, qi, wi −→ hu′, qj , w ′i −→ . . .

of configurations, starting with the configuration hε, q0, wi, where the con-
figuration at step i + 1 is derived by making a transition from the configu-
ration at i.
A terminating execution is one which ends in an accepting configura-
tion hu, qA , wi or a rejecting configuration hu, qR , wi, for some u, w.
Remark. The following distinctions are important:
• M accepts w iff the execution of M on w is terminating and ends in
the accept state:
∗
hε, qo , wi −→ hℓ, qA , ri

51
• M rejects w iff the execution of M on w is terminating and ends in the
reject state:
∗
hε, qo, wi −→ hℓ, qR , ri

• M does not accept w iff M rejects w or M loops on w

Now we make a definition that relates Turing machines to sets of strings.
Definition 4 (Language of a Turing machine). The language of a Turing
machine M, L(M), is the set of strings accepted by M.

L(M) = {x | M halts and accepts x}.

Definition 5 (Computation of function by Turing machine). Turing ma-
chines are defined so that they can only accept or reject input (or loop).
But most programs compute functions, i.e., deliver answers beyond just
‘yes’ or ‘no’. To achieve this is simple. A TM M is said to compute a func-
tion f if, when given input w in the domain of f , the machine halts in its
accept state with f (w) written (leftmost) on the tape.

3.1.1 Example Turing Machines

Now we look at a few more examples.
Example 33 (Left edge detection). Let’s revisit Example 31, which imple-
mented addition of numbers in unary representation. There, we made a
pass over the input strings, then scrubbed off a trailing 1, then halted, leav-
ing the tape head at the end of the unary sum. Instead, we want a machine
that moves the head all the way back to the leftmost cell after performing
the addition. And this reveals a problem. Here is a diagram of an incorrect
machine, formed by adapting that of Example 31.

1/1, R 1/1, R 1/1, L

X/1, R / ,L 1/ /L ?/?/?
q0 q1 q2 q3 qA

Once the machine enters state q3 , it has performed the addition and
now uses a loop to move the tape head leftmost. But when the machine is
moving left in a loop, such as in state q3 , there is a difficulty: the machine
should leave the loop once it bumps into the left edge of the tape. But once

52
the tape head reaches the leftmost cell, the machine will repeatedly try to
move left on a ‘1’, unaware that it is overwriting the same ‘1’ eternally.
There are two ways to deal with this problem:

• Before starting the computation, attach a special marker (e.g., ⋆) to

the front of the input string. Thus, if the input to the machine is w,
change it so that the machine starts up with ⋆w as the initial contents
of the tape. Provided that ⋆ is never written anywhere else on the
tape, the machine can easily detect ⋆ and break out of its ‘move-left’
behaviour. One can either require that ⋆ be attached at the begin-
ning of the input from the beginning, or take the input, shift it all
right by one cell and put ⋆ in the first cell. Assuming the former, the
following machine results.

1/1, R 1/1, R 1/1, L

⋆/⋆, R X/1, R / ,L 1/ /L ⋆/⋆/L
q0 q1 q2 q3 q4 qA

• When making a looping scan to the leftmost cell, add some special-
purpose code to detect the left edge. We know that when the ma-
chine bumps into the left edge, it writes the new character on top of
the old and then can’t move the tape head. The idea is to write a
‘marked’ version of the symbol on the tape and attempt to move left.
In the next step, if the marked symbol is seen, then the machine must
be at the left edge and the loop can be exited. If the marked symbol is
not seen, then the machine has been able to move left, and we go back
and ‘erase’ the mark from the symbol before continuing. For the cur-
rent example this yields the following machine, where the leftward
loop in state q3 has been replaced by the loop q3 → q4 → q5 → q3 .

1/1, R 1/1, R
X/1, R / ,L 1/ , L 1/1̇, L 1̇/1, L
q0 q1 q2 q3 q4 qA

1/1/R
1̇/1, L
q5

53
Here is an execution of the second machine on the input 111X11:

hε, q0 , 111X11i
h1, q0 , 11X11i
h11, q0 , 1X11i
h111, q0 , X11i
h1111, q1 , 11i
h11111, q1 , 1i
h111111, q1 , i
h11111, q2 , 1 i
h1111, q3 , 1 i
h111, q4 , 11̇ i
h1111, q5 , 1̇ i
h111, q3 , 11 i
h11, q4 , 11̇1 i
h111, q5 , 1̇1 i
h11, q3 , 111 i
h1, q4 , 11̇11 i
h11, q5 , 1̇11 i
h1, q3 , 1111 i
hε, q4 , 11̇111 i
h1, q5 , 1̇111 i
hε, q3 , 11111 i
hε, q4 , 1̇1111 i
hε, qA , 11111 i

In the next example, we will deal with strings of balanced parentheses.

We give the name BAL to the set of all such strings. The following strings

ε, (), ((())), (())()()

are members of BAL.

Example 34 (BAL—first try). Give a transition diagram for a Turing ma-
chine that accepts only the strings in BAL. First, of course, we need to
have a high-level algorithm in mind: the following is a reasonable start:

1. Search right for a ‘)’.

2. If one is not found, scan left for a ‘(’. If found, reject, else accept.

54
3. Otherwise, overwrite the ‘)’ with an ‘X’ and scan left for a ‘(’.
4. If one is found, overwrite it with ‘X’ and go to 1. Otherwise, reject.

The following diagram captures this algorithm. It is not quite right, be-
cause of left-edge detection problems, but it is close.

X/X, R
(/(, R X/X, L
)/X, L
q0 q1
(/X, R
/ ,L

X/X, L q2 qR
(/(, R
???

In state q0 , we scan right, skipping open parens and X’s, looking for a
closing parenthesis, and transition to state q1 when one is found. If one is
not found, we must hit blanks, in which case we transition to state q2 .
q1 If we find ourselves in q1 , we’ve found the first ‘)’ and replaced it with
an ‘X’. Now we have to scan left and find the matching open paren-
thesis, skipping over any ‘X’ symbols. (Caution: left edge detection
needed!) Once the first open paren to the left is found, we over-write
it with an ‘X’ and go to state q0 . Thus we have successfully cancelled
off one pair of matching parens, and can go to the beginning of the
loop, i.e., q0 , to look for another pair.
q2 If we find ourselves in q2 , we have unsuccessfully searched to the right
looking for a closing paren. That means that every closing paren has
been paired up with an open paren. However, we must still deal with
the possibility that there are more open parens than closing parens
in the input, in which case we should reject. So we search back left

55
looking for a remaining open paren. If none exist, we accept; other-
wise we reject.
Thus, in state q2 we scan to the left, skipping over ‘X’ symbols. If we
encounter an open paren, we transition to state qR and reject. If we
don’t, then we ought to accept.

Now we will re-do the BAL example properly, using both ways of de-
tecting the left edge.

Example 35 (BAL done right (1)). We expect a ⋆ in the first cell, followed
by the real input.

X/X, R
(/(, R X/X, L
)/X, L
⋆/⋆, R
s q0 q1
(/X, R
/ ,L ⋆/⋆, L

X/X, L q2 qR
(/(, R
⋆/⋆, L

The language recognized by this machine is {⋆} · BAL.

Example 36 (BAL done right (2)). Each loop implementing a leftward scan
is augmented with extra states. The naive (incorrect) loop implementing
the left scan at q1 is replaced by a loop q1 → q4 → q5 → q1 , which is exited
either by encountering an open paren (transition to q0 ) or by bumping
against the left edge (no corresponding open paren to a close paren, so
transition to reject state qR ).
Similarly, the incorrect loop implementing the left scan at q2 is replaced
by a loop q2 → q6 → q7 → q2 , which is exited either by encountering an
open paren (open paren with no corresponding close paren, so transition

56
to qR ) or by bumping against the left edge (no unclosed open parens, so
transition to accept state qA ).

X/X, R
(/(, R
)/X, L
q0 q1
(/X, R
/ ,L X/Ẋ, L
Ẋ/X, L
q2 q5 q4
(/(, R
)/), R
Ẋ/X, L
X/Ẋ, L Ẋ/X, L

q7 q6
(/(, R (/(, R
)/), R qR
Ẋ/X, L

Example 37. Let’s try to build a TM that accepts the language

{w · w | w ∈ {a, b}∗ }

We first need a high-level outline of our algorithm. The following steps

are needed:

1. Locate the middle of the string. With an ordinary programming lan-

guage, one would use some kind of arithmetic calculation to find the
middle index. However, TMs don’t have arithmetic operations built
in, and it can be somewhat arduous to provide them.
Instead, we adopt the following approach: mark the leftmost sym-
bol, then go to the end of the string and mark the last symbol. Then

57
go all the way to the left and mark the leftmost unmarked symbol.
Then go all the way to the right and mark the rightmost unmarked
symbol. Repeat until there are no unmarked symbols. Because we
have worked ‘outside-in’ this phase of processing should end up
with the tape head on the first symbol of the second half of the string.
If the string is not of even length then, at some step, the leftmost
symbol will get marked, but there will be no corresponding right-
most unmarked symbol.

2. Now we check that the two halves are equal. Starting from the first
character of the right half of the string, call it ċ, we remove the mark
and move left until the leftmost marked symbol is detected. We will
have to detect the left edge in this step! If the leftmost marked sym-
bol is indeed ċ, then we unmark it (otherwise we reject). Then we
scan right over (a) remaining marked symbols in the left half of the
string and then (b) unmarked symbols in the first part of the right
half of the string. We then either find a marked symbol, or we hit the
blanks.

3. Repeat for the second, third, etc characters. Finally, the rightward
scan for a marked symbol on the rhs doesn’t find anything and ends
in the blanks. And then we can accept.

Now that we have a good idea of how the algorithm should work, we
will go ahead and design the TM in detail. (But note that often this higher
level of description suffices to convince people that a proposed algorithm
is implementable on a TM, and actually providing the full TM description
is not necessary.)
In the transition diagram, we use several shorthand notations:

• Σ/Σ, L says that the transition replaces any symbol (so either 0 or
1) by itself and moves left. Thus Σ is being used to represent any
particular symbol in Σ, saving us from writing out two transitions.

• Σ̇ represents any marked symbol in the alphabet.

• Σ⋆ = Σ ∪ {⋆}.

• Σ = Σ ∪ { }.

58
Σ/Σ, R
Σ̇ /Σ̇ , L Σ̇/Σ̇, L
2 3 qR
Σ/Σ̇, R
Σ/Σ̇, L
⋆/⋆, R Σ̇/Σ̇, R
0 1 4 Σ/Σ, L
1̇/1, L 0̇/0, L
/ ,L
Σ/Σ, L 5 qA 10 Σ/Σ, L
/ ,L
Σ̇/Σ̇, L Σ̇/Σ̇, L
1̇/1, L 0̇/0, L
Σ̇/Σ̇, L 6 9 11 Σ̇/Σ̇, L
Σ/Σ, R
Σ⋆/Σ⋆, R Σ/Σ, R Σ⋆/Σ⋆, R

7 8 12
1̇/1, R 0̇/0, R

0̇/0̇, R Σ̇/Σ̇, R 1̇/1̇, R

We assume that the input is prefixed with ⋆, thus the transition from
state 1 to 2 just hops over the ⋆. If the input is not prefixed with ⋆, there
is a transition to qR (not included in the diagram). Having got to state 2,
the first pass of processing proceeds in the loop of states 1 → 2 → 3 →
4 → 1. In state 2 the leftmost unmarked character is marked and then
there is a sweep over unmarked characters until either a marked character
or the blanks are encountered (state 2). Then the tape head is moved one
cell to the left. (Note that Σ = {0, 1} in this example.) In state 3, we
should be at the rightmost unmarked symbol on the tape. If it is however,
a marked symbol, that means that the leftmost unmarked symbol has no
corresponding rightmost unmarked symbol, so we reject. Otherwise, we
loop left over the unmarked symbols until we hit a marked symbol, then

59
move right.
We are then either at an unmarked symbol, in which case we go through
the 1 → 2 → 3 → 4 → 1 loop again, or else we are at a marked sym-
bol. In fact, this will be the first symbol in the second half of the string,
and we move to the second phase of processing. This phase features two
nearly identical loops. If the marked symbol is a 1̇, then the left loop
5 → 6 → 7 → 8 → 9 is taken; otherwise, if the marked symbol is 0̇,the
right loop 10 → 11 → 12 → 8 → 9 is taken.
We now describe the left loop. In state 1 the leftmost marked cell in
the second half of the string is a 1̇; now we traverse over the prefix of
unmarked cells in the second half (state 5); then we traverse over the suffix
of marked cells in the first half of the string (state 6). Thus we arrive either
at ⋆ or at the rightmost unmarked cell in the first half of the string, and
move right into state 7. This leaves us looking at a marked cell. We expect
the matching 1̇ to the one seen in state 1, which takes us to state 8 (after
unmarking it); if we see a 0̇, then we reject. So if we are in state 8, we have
located the matching symbol in the first half of the string, and unmarked
it. Now we move right to the next element to consider in the second half
of the string. This involves skipping over the remaining marked symbols
in the first half of the string (state 9), then the prefix of unmarked symbols
in the second half of the string (state 10).
Then we are either looking at a marked symbol, in which case we go
around the loop again (either to state 5 if it is a 1̇ or to state 10 if it is a 0̇).
Or else we are looking at the blanks, which means that there are no more
symbols to unmark, and we can accept.
We now trace the execution of M on the string 010010, by giving a
sequence of machine configurations. In several steps (17, 20, 24, 31, and
34) we use ellipsis to abbreviate a sequence of steps. Hopefully, these will
be easy to fill in!

60
Step Config Step Config
1 (ε, q0 , ⋆010010) 26 (⋆0̇1̇, q10 , 0̇01̇0̇)
2 (ε, q1 , 010010) 27 (⋆0̇, q11 , 1̇0̇01̇0̇)
3 (⋆, q1 , 010010) 28 (⋆, q11 , 0̇1̇0̇01̇0̇)
4 (⋆0̇, q2 , 10010) 29 (ε, q11 , ⋆0̇1̇0̇01̇0̇)
5 (⋆0̇1, q2 , 0010) 30 (⋆, q12 , 0̇1̇0̇01̇0̇)
6 (⋆0̇10, q2 , 010) 31 (⋆0, q8 , 1̇0̇01̇0̇) . . .
7 (⋆0̇100, q2 , 10) 32 (⋆01̇0̇, q8 , 01̇0̇)
8 (⋆0̇1001, q2, 0) 33 (⋆01̇0̇0, q9 , 1̇0̇)
9 (⋆0̇10010, q2, ) 34 (⋆01̇0̇, q5 , 010̇) . . .
10 (⋆0̇1001, q3, 0) 35 (⋆, q6 , 01̇0̇010̇)
11 (⋆0̇100, q4 , 10̇) 36 (⋆0, q7 , 1̇0̇010̇)
12 (⋆0̇10, q4 , 010̇) 37 (⋆01, q8 , 0̇010̇)
13 (⋆0̇1, q4 , 0010̇) 38 (⋆010̇, q8 , 010̇)
14 (⋆0̇, q4 , 10010̇) 39 (⋆010̇0, q9 , 10̇)
15 (⋆, q4 , 0̇10010̇) 40 (⋆010̇01, q9 , 0̇)
16 (⋆0̇, q1 , 10010̇) 41 (⋆010̇0, q10 , 10)
17 (⋆0̇1̇, q2 , 0010̇) . . . 42 (⋆010̇, q10 , 010)
18 (⋆0̇1̇001, q2, 0̇) 43 (⋆01, q10 , 0̇010)
19 (⋆0̇1̇00, q3 , 10̇) 44 (⋆0, q11 , 10̇010)
20 (⋆0̇1̇00, q4 , 1̇0̇) . . . 45 (⋆01, q12 , 0̇010)
21 (⋆0̇1̇, q4 , 001̇0̇) 46 (⋆010, q8, 010)
22 (⋆0̇, q1 , 1̇001̇0̇) 47 (⋆0100, q9, 10)
23 (⋆0̇1̇, q1 , 001̇0̇) 48 (⋆01001, q9, 0)
24 (⋆0̇1̇, q1 , 001̇0̇) . . . 49 (⋆010010, q9, )
25 (⋆0̇1̇0̇, q1 , 0̇1̇0̇) 50 (⋆01001, qA, 0)

End of example

How TMs work should be quickly absorbed by computer scientists.

After only a few detailed examples, such as the above, it becomes clear
that Turing machine programs are very much like assembly-language pro-
grams, probably even worse in the level of detail. However, you should
also become convinced that any program that could be coded in a high-
level language like Java or ML could also be coded in a Turing machine,
given enough effort. For example, it is tedious but not conceptually dif-

61
ficult to model the essential aspects of a microprocessor as a Turing ma-
chine: the ALU operations (addition, multiplication, etc.) can be imple-
mented by the standard grade-school algorithms, the registers of the ma-
chine can be placed at certain specified sections of the tape, and the random-
access memory can also be modelled by the tape. And so on.

Ways of specifying Turing machines There are three ways to specify a

Turing machine. Each is appropriate at different times.

• Low level: the transition diagram is given explicitly. This level is

only for true pedants! We sometimes ask for this, but it is often too
detailed.

• Medium level: the operations of the algorithm are phrased in higher-

level terms, but still in terms of the Turing machine model. Thus
algorithmic steps are expressed in terms of how the tape head moves
back and forth, e.g., move the tape head all the way to the right, marking
each character until the blanks are hit. Also data layout conventions are
discussed.

• High level: pseudo-code for the algorithm is given, in the standard

manner that algorithms are discussed in Computer Science. The
Church-Turing Thesis (to be discussed) will give us confidence that
such a high-level description is implementable on a Turing machine.

3.1.2 Extensions
On top of the basic Turing machine model, more convenient models can
be built. These new models still recognize the same set of languages, how-
ever.

Multiple Tapes
It can be very convenient to use Turing machines with multiple (unbounded)
tapes. For example, if asked to implement addition of binary numbers on
a Turing machine, it would be quite useful to have five tapes: a tape for
each number being added, one to hold the sum, one for the carry being

62
propagated, and one for holding the two original inputs. Such require-
ments can be easily implemented by an ordinary Turing machine: for n
tapes,
tape n
tape n−1
..
.
tape 1
we simply divide the single tape into n distinct regions, separated by spe-
cial markers, e.g., X in the following:

tape 1 X · · · X tape n−1 X tape n X ···

Although it is easy to see—conceptually, at least—how to organize the

n tapes, including dealing with the situation when a tape ‘outgrows its
boundaries’ and has to be resized, it is more difficult to see how to achieve
the control of the machine. The transition function for n tapes conceptu-
ally has to simultaneously look at n current cells, write n new contents,
and make n different moves of the n independent tape heads. Thus the
transition function will have the following specification:

δ : Q × Γn → Q × Γn × {L, R}n

Each of the n sub-tapes will have one cell deemed to be the current cell.
We will use a system of markings to implement this idea. Since cells can’t
be marked, we will mark the contents of the current cell. This means that
the tape alphabet Γ will double in size: each symbol ai ∈ Γ will have a
marked analogue ȧi , and by convention, there will be exactly one marked
symbol per sub-tape. With this support, a move in the n-tape machine
will consist of (1) a left-to-right sweep wherein the steps prescribed by δ
are taken at each marked symbol, followed by (2) a right-to-left sweep to
reset the tape head to the left-most cell on the underlying tape.

Two-way infinite tape

A different extension would be to support a machine that has its tape ex-
tending infinitely in both directions.

··· w1 w2 · · · wn−1 wn ···

63
By convention, a computation would start with the tape head on the left-
most symbol of the input string. This machine model is relatively easy
(although detailed) to implement using an ordinary Turing machine. The
main technique is to ‘fold’ the doubly infinite tape in two and merge the
cells so that alternating cells on the resulting singly infinite tape belong
to the two halves of the doubly-infinite tape. It helps to think of the tape
elements as being labelled with integers. Thus if the original tape was
labelled as
· · · −3 −2 −1 0 1 2 3 · · ·
the single-tape version would be laid out as

⋆ 0 1 −1 2 −2 3 −3 · · ·

Again, the details of how the control of the machine is achieved with an
ordinary Turing machine are relatively detailed.
A further extension—non-determinism—can be built on top of ordi-
nary Turing machines, and is discussed later.

3.1.3 Coding and Decoding

As we have seen, TMs are very basic machines, but you should be con-
vinced that they allow all kinds of computation. They are similar, in a
way, to machine-level instructions on an x86 or ARM chip. By using (in
effect) goto statements, loops can be implemented, and the basic word
and arithmetic operations can be modelled.
A separate aspect of the modelling of more conventional computation
is how to deal with high-level data? For example, the only data accepted by
a TM are strings built from Σ. How to deal with computations involving
numbers, pairs, lists, finite sets, trees, graphs, and even Turing machines
themselves? The answer is coding: complex datastructures will be repre-
sented by using a convention for reading and writing them as strings.
Bits can be represented by 0 and 1, of course, and we all know how
numbers in N and Z are encoded in binary. A pair of elements (a, b) may
be encoded as follows. Suppose that object a encodes to a string wa and
object b encodes to wb . Then one simple way to encode (a, b) is as

◭ wa ‡ wb ◮

64
where ◭, ◮, and ‡ are symbols not occurring in the encoding of a and
b . The encoding of a list of objects [a1 , · · · , an ] can be implemented by
iterating the encoding of pairs, i.e., as

◭ a1 ‡ ◭ a2 ‡ ◭ · · · ‡ ◭ an−1 ‡ an ◮ · · · ◮◮◮
Finite sets can be encoded in the same way as lists. Arbitrary trees can
be represented by binary trees, and binary trees can be encoded, again, as
nested pairs. In effect, we are re-creating Lisp s-expressions. A graph is
usually represented as (V, E) where V is a finite set of nodes and E is a
finite set of edges, i.e., a set of pairs of nodes. Again, this format is easy to
encode and decode.
The art of coding and decoding data pervades Computer Science. There
is even a related research area known as Coding Theory, but the subject of
efficient algorithms for coding and decoding is really orthogonal to our
purposes in this course.
We shall henceforth assume that encoding and decoding of any desired
high-level data can be accomplished. We will assume that, for each par-
ticular problem intended to be solved by a TM, the input is given in an
encoding that the machine can decode correctly. Similarly, if a machine
produces output, it will likewise be decodable. For this, we will adopt the
notation hAi, to represent an object A which has been encoded to a string,
and which will be decodable by the TM to a correct representation of A. A
tuple of objects in the input will be written as hA1 , . . . , Ak i.

Encoding Turing machines

The encoding approach outlined above is quite general, but just to rein-
force the ideas we will discuss a specific encoding for Turing machines.
This format will be expected by Turing machines that take encoded Tur-
ing machines as input and calculate facts about those machines. This abil-
ity will let us formulate questions about the theoretical properties of pro-
grams that manipulate and analyze programs.
Recall that a Turing machine is a 7-tuple (Q, Σ, Γ, δ, q0 , qA , qR ). The sim-
plest possible way to represent this tuple on a Turing machine tape is to
write out the elements of the tuple from left to right on the tape, using a
delimiter to separate the components. Something like the following

wQ X wΣ X wΓ X wδ X wq 0 X wq A X wq R X ···

65
where wQ , wΣ , wΓ , wδ , wq0 , wqA and wqR are strings representing the compo-
nents of the machine. In detail
• Q is a finite set of states. We could explicitly list out the state ele-
ments, but will instead just write out the number of states in unary
notation.
• Σ is the input alphabet, a finite set of symbols. Our format will list
the symbols out, in no particular order, separated by blanks. Recall
that the blank is not itself an input symbol.
• Γ is the tape alphabet, a finite set of symbols with the property that
Σ ⊂ Γ. In our format, we will just list the extra symbols not already
in Σ. Blank is a tape symbol.
• δ is a function and one might think that there would be a problem
with representing it, especially since functions can have infinite do-
mains. Fortunately, δ has a finite domain (since Q is finite and Γ
is finite, Q × Γ is finite). Therefore δ can be listed out. Each indi-
vidual transition δ(p, a) = (q, b, d) can be represented as a 5-tuple
(p, a, q, b, d). On the tape each of these tuples will look like p a q b d.
(If a or b happen to be the blank symbol no ambiguity should re-
sult.) Each 5-tuple will be separated from the others by a XX.
• q0 , qA , qR will be represented by numbers in unary notation.
Example 38. The following simple machine

0/0, R
1/1, R will be represented on tape by
/ ,L
q0 qA

111 X 0 1 X X δ X 1 X 11 X 111 X ···

where the δ portion is (only the last transition is written one symbol per
cell and q0 , qA are encoded by 1, 11):
1 0 1 0 11 XX 1 1 1 1 11 XX 1 11 1

66
Note that the direction L is encoded by 1 and R is encoded by 11.
Example 39. A TM M that takes an arbitrary TM description and tests
whether it has an even number of states can be programmed as follows:
on input hQ, Σ, Γ, δ, q0 , qA , qR i, M checks that the input is in fact a represen-
tation of a TM and then checks the number of states in Q. If that number
is even, then M clears the tape and transitions into its accept state (which
is different than the qA of the input machine) and halts; otherwise, it clears
the tape and rejects.

3.1.4 Universal Turing machines

The following question naturally arises:

Can a single Turing machine be written to simulate the execution

of arbitrary Turing machines?

The answer is yes. A so-called universal Turing machine U can be writ-

ten that expects as input hM, wi, where M is a Turing machine description,
encoded as outlined above, for example, and w an input string for M. U
simulates the execution of M on w. The simulation is completely unintel-
ligent: it simply transcribes the way that M would behave when started
with w on its input tape. Thus given input in the following format
TM description X w X ···
an execution of U sequentially lists out the configurations of M as it eval-
uates on w. At each ‘step’ of M, U goes to the end of the tape, looks at the
current configuration of M, extracts (p, a), the current state and cell value
from it, then looks up the corresponding (q, b, d) (next state, cell value, and
direction) from the description of M, and uses that information to go to
the end of the tape and append the new configuration for the execution of
M on w. As it runs, the execution of U will generate a tape that evolves as
follows:
TM description X w X ···
TM description X w X config 0 X ···
TM description X w X config 0 X config 1 X ···
TM description X w X config 0 X config 1 X config 2 X ···
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .

67
If M eventually halts on w, then U will detect this (since the last config-
uration on the tape will be in state qA or qR ) and will then transition into
the corresponding terminal state of U. Thus, if M halts when run on w,
then the simulation of M on w by U will also halt. If M loops on w, the
simulation of M’s execution of w will also diverge, endlessly appending
new configurations to the end of the tape.

Self application
A Turing machine M that takes as input an (encoded) TM and performs
some calculation using the description of the machine can, of course, be
applied to all manner of Turing machines. Is there a problem with ap-
plying M to its own description? After all, the notion of self-application
can be extremely confusing, since infinite regress is a lurking possibility.
But consider the ‘even-number-of-states-tester’ given above in Example
39. When applied to itself, i.e., to its own description, it performs in a
sensible manner since it just treats its input machine description as data.
The following ‘real-world’ example similarly shows that the treatment of
programs as (encoded) data allows some instances of self-reference to be
straightforward.
Example 40. The Unix wc utility counts the number of characters in a file.
It can be applied to itself
bash-2.05b$ /usr/bin/wc /usr/bin/wc
58 480 22420 /usr/bin/wc
with no fear of infinite regress. Turing machines that treat their input TM
descriptions simply as data typically don’t raise any difficult conceptual
issues.

3.2 Register Machines

Now that we have seen Turing machines, it is worth having a look at an-
other model of computation. A Register Machine (RM) has a fixed number
of registers, each of which can hold a natural number of unbounded size
(no petty and humiliating 32- or 64-bit restrictions for theoreticians!).
A register machine program is a list of simple instructions. There are
only 2 kinds of instruction:

68
• Inc r i. Add 1 to register r and move to instruction i

• Test r i j. Test if register r is equal to zero: if so, go to instruction i;

else subtract 1 from register r and go to instruction j.

Like TMs there is a notion of the current ‘state’ of a Register machine;

this is just the current instruction that is to be executed. By convention, if
the machine is asked to execute the 0-th instruction it will stop. An execu-
tion of a RM starts in state 1, with input (n1 , . . . , nk ) loaded into registers
R1 , . . . , Rk and executes instructions from the program until instruction 0
is entered.

Example 41 (Just stop). The following RM program immediately halts, no

matter what data is in the registers.

0 HALT
1 Test R0 I0 I0

Execution starts at instruction 1. The Test instruction checks if R0 = 0

and, no matter what the result is, the next instruction is I0 , i.e., the HALT
instruction.

Example 42 (Infinite loop). The following RM program immediately goes

into an infinite loop, no matter what data is in the registers.

0 HALT
1 Test R0 I1 I1

Execution starts at instruction 1. The Test instruction checks if R0 = 0

and, no matter what the result is, the next instruction is I1 , i.e., the same
instruction. Thus an execution will step-by-step decrement R0 down to 0
but will then keep going.

Example 43. The following RM program adds the contents of R0 to R1 ,

destroying the contents of R0 .

0 HALT
1 Test R0 I0 I2
2 Inc R1 I1

69
Execution starts at instruction 1. The Test instruction checks R0 , exiting if
it holds 0. Otherwise, it decrements R0 and transfers control to instruction
2. This then adds 1 to R1 and transfers control back to instruction 1.
The following table shows how the execution evolves, step by step
when R0 has been loaded with 3 and R1 with 19.

Step R0 R1 Instr
0 3 19 1
1 2 19 2
2 2 20 1
3 1 20 2
4 1 21 1
5 0 21 2
6 0 22 1
7 0 22 HALT
In the beginning (step 0), the machine is loaded with its input numbers,
and is at instruction 1. At step 1 R0 is decremented and the machine moves
to instruction 2. And so on.
Notice that we could also represent the execution by the following se-
quence of triples (R0 , R1 , Instr):

(3, 19, 1), (2, 19, 2), (2, 20, 1), (1, 20, 2), (1, 21, 1), (0, 21, 2), (0, 22, 1), (0, 22, 0).

OK, one more example. How about adding R0 and R1 , putting the
result in R2 , leaving R0 and R1 unchanged?

Example 44. As always, it is worth thinking about this at a high level be-
fore diving in and writing out the exact instructions. The best approach I
could think of uses five registers R0 , R1 , R2 , R3 , R4 . We use R3 and R4 to
store the original values of R0 and R1 . The program first (instructions 1–3)
repeatedly decrements R0 and adds 1 to both R2 and R3 . At the end of this
phase, R0 is 0 and both R2 and R3 will hold the original contents of R0 .
Next (instructions 4–6) the program repeatedly decrements R1 and adds
1 to both R2 and R4 . At the end of this phase, R1 is 0, R2 holds the sum of
the original values of R0 and R1 , and R4 holds the original contents of R1 .
Finally, a couple of loops (instructions 7–8 and 9–10) move the contents
of R3 and R4 back to R0 and R1 . Here is the program:

70
0 HALT
1 Test R0 I4 I2
2 Inc R2 I3
3 Inc R3 I1
4 Test R1 I7 I5
5 Inc R2 I6
6 Inc R4 I4
7 Test R3 I9 I8
8 Inc R0 I7
9 Test R4 HALT I10
10 Inc R1 I9
For the intrepid, here’s the execution of the machine when R0 = 2 and
R1 = 3.

Step R0 R1 R2 R3 R4 Instr Step R0 R1 R2 R3 R4 Instr

0 2 3 0 0 0 1 15 0 0 5 2 2 6
1 1 3 0 0 0 2 16 0 0 5 2 3 4
2 1 3 1 0 0 3 17 0 0 5 2 3 7
3 1 3 1 1 0 1 18 0 0 5 1 3 8
4 0 3 1 1 0 2 19 1 0 5 1 3 7
5 0 3 2 1 0 3 20 1 0 5 0 3 8
6 0 3 2 2 0 1 21 2 0 5 0 3 7
7 0 3 2 2 0 4 22 2 0 5 0 3 9
8 0 2 2 2 0 5 23 2 0 5 0 2 10
9 0 2 3 2 0 6 24 2 1 5 0 2 9
10 0 2 3 2 1 4 25 2 1 5 0 1 10
11 0 1 3 2 1 5 26 2 2 5 0 1 9
12 0 1 4 2 1 6 27 2 2 5 0 0 10
13 0 1 4 2 2 4 28 2 3 5 0 0 9
14 0 0 4 2 2 5 29 2 3 5 0 0 HALT

Now, as we have seen, the state of a Register machine computation that

uses n registers is (R0 , . . . , Rn−1 , I), where I holds the index to the current
instruction to be executed.

71
3.3 The Church-Turing Thesis
Computability theory developed in the 1930’s in an amazing burst of cre-
ativity by logicians. We have seen two fully-featured models of computation—
Turing machines and Register machines—but there are many more, for
example λ-calculus (due to Alonzo Church), combinators (due to Haskell
Curry), Post systems (due to Emil Post), Term Rewriting systems, unre-
stricted grammars, cellular automata, FRACTRAN, etc.
These models have all turned out to be equivalent, in that each allows
the same set of functions to be computed. Before we give an indication of
what such an equivalence proof looks like in the case of Turing machines
and Register machines, we can make some general remarks.
A full model of computation can be seen as a setting forth of a general
way to do sequential computation, i.e., to deterministically compute the
values of functions, over some kind of data (often strings or numbers).
The requirement on an author of such a model is to show how all tasks
we regard as being ‘computable’ by a real mechanical device, or solvable
by an algorithm, may be realized in the model. Typically, this splits into
showing

• How all manner of data, e.g., trees, graphs, arrays, etc, can be en-
coded to, and decoded from, the data representation used by the
model.

• How all manner of algorithmic steps and data manipulation may be

reduced to steps in the proposed model.

Put this way, it seems possible that a chaotic situation could have de-
veloped, where multiple competing notions of computability struggled for
supremacy. But that hasn’t happened. Instead, the proofs of equivalence
among all the different models mean that people can use whatever model
they prefer, secure in the knowledge that, were it necessary or convenient,
they could have worked in any of the other models. This is the Church-
Turing Thesis: that any model of computation is as powerful as any other,
and that any fresh one proposed is anticipated to be equivalent to all the
others. This is what gives people confidence that any algorithm coded as a
‘C’ program can also be coded up in Java, or Perl, or any other general pur-
pose programming language. Note well, however, that all considerations

72
FRACTRAN
Unrestricted Grammars Register Machines

Term Rewriting Systems Turing Machines

Post Systems Algorithm Combinators

Cellular Automata λ-calculus

C ML
Java

Figure 3.1: A portion of the Turing tarpit

of efficiency have been discarded; we are in this course only concerned

with what can be done in principle.
This is a thesis and not a theorem because it is not provable, only refutable.
A genuinely new model of computation, not equivalent to Turing machine
computation, may be proposed tomorrow and then the CT-Thesis will go
the way of the dodo. All that would be necessary would be to demonstrate
a program in the new computational model that was not computable on a
sufficiently programmed Turing machine. However, given the wide array
of apparently very different models, which are all nonetheless equivalent,
most experts feel that the CT-Thesis is secure.
The CT-Thesis says that no model of computation exists having greater
power than Turing machines, or Register machines, or λ-calculus, or any
of the others we have mentioned. As a consequence, no model of com-
putation can lay claim to being the unique embodiment of algorithms. An
algorithm, therefore, is an abstract idea, which can have concrete realiza-
tions in particular models of computation, but also of course on real com-
puters and in real programming languages. However, we (as yet) have

73
no abstract definition of the term algorithm, of which the models of com-
putation are instances. Thus the search for an algorithm implementing a
requirement has to be met by supplying a program. If such a program ex-
ists in one model of general computation or programming language, then
it can be translated to any other model of general computation or pro-
gramming language. Contrarily, if a requirement cannot be implemented
in a particular model of computation, then it also cannot be implemented
in any other.
As a result of adopting the CT-Thesis, we can use abstract methods,
e.g., notation from high-level programming languages or even mathemat-
ics, to describe algorithmic behaviour, and know that the algorithm can be
implemented on a Turing Machine. Thus we may shift our attention from
painstakingly implementing algorithms in horrific detail on simple mod-
els of computation. We will now assume programs exist to implement the
desired algorithms. Of course, we may be challenged to show that a pur-
ported algorithm is indeed implementable; then we may choose whatever
Turing-equivalent model we wish in order to write the program.
Finally, the scope of the CT-Thesis must be understood. The models of
computation are intended to capture computation of mathematical func-
tions. In other words, whenever a TM M is applied to an input string w,
it will always return the same answer. Dealing with interactive, random-
ized, or distributed computation requires extensions which have been the
source of much debate. They are however, beyond the scope of this course.

3.3.1 Equivalence of Turing and Register machines

The usual way that equivalence between models of computation A and B
is demonstrated is to show two things:

• How an arbitrary A program pA can be translated to a correspond-

ing B program pB such that running pA on an input iA will yield an
identical answer to running pB on input iB (which is iA translated
into the format required by B programs).

• And vice versa.

Thus, if A-programs can be simulated by B-programs and B-programs

can be simulated by A-programs, every algorithm that can be written in A

74
can also be written in B, and vice versa. The simulation of A programs by
B programs is captured in the following diagram:
runA
(pA , iA ) result A

toB toA

(pB , iB ) result B
runB
which expresses the following equation:

runA (pA , iA ) = toA(runB (toBprog (pA ), toBinputs (iA )) .

Simulating a Register machine on a Turing machine

OK, let’s consider how we could mimic the execution of a Register ma-
chine on a Turing machine. Well, the first task is to use the data represen-
tation of Turing machines (tape cells) to represent the data representation
of Register machines (a tuple of numbers). This is quite easy, since we
have seen it already in mimicking multiple tapes with a single tape. If the
given RM has n registers containing n values, the tape is divided into n
separate sections, each which holds a string (say in unary representation)
representing the corresponding number.

⋆ hR0 i X hR1 i X ... X hRn−1 i X ···

Now how do we represent a RM program (list of instructions) as a

TM program? The rough idea is to represent each RM instruction by a
sequence of TM transitions. Fortunately there are only two instructions to
consider:

Inc Ri Ik would get translated to a sequence of TM operations that would

(assuming tape head in leftmost cell):

• Move left-to-right until it found the portion of tape correspond-

ing to Ri .

75
• Increment that register (which is of course represented by a bit-
string). Also note that this operation could require resizing the
portion of tape for Ri .
• Move tape head all the way left.
• Move to state that represents the beginning of the TM instruc-
tion sequence for instruction k.

Test Ri Ij Ik again translates to a sequence of TM operations. Again, as-

sume tape head in leftmost cell:

• Find Ri portion of the tape.

• Check if that portion is empty (ε corresponds to the number 0).
• If it is empty move to the TM state representing the start of the
sequence of TM operations corresponding to the instruction Ij .
• If not, decrement the Ri portion of the tape (again via bitstring
operations).
• And then go to the TM state corresponding to the RM instruc-
tion k.

Simulating a Turing machine on a Register machine

What about modelling TMs by RMs? That is somewhat more convoluted.
One might think that each cell of the TM tape could be represented by a
corresponding register. But this has a problem, namely that the number of
registers in a register machine must be fixed in advance, while a TM tape
can grow and shrink as the computation progresses. The correct modelling
of the tape by registers depends on an interesting technique known as
Goedel numbering.5
Definition 6 (Goedel Numbering). Goedel needed to encode a string w =
a1 a2 a3 . . . an over alphabet Σ as a single number n ∈ N. Recall that the
primes are infinite, i.e., form an infinite sequence p1 , p2 , p3 , . . .. Also recall—
from high school math—that every number has a unique prime factoriza-
tion.6 Let C : Σ → N be a bijective coding of alphabet symbols by natural
5
Named after the logician Kurt Goedel, who used it in his famous proof of the incom-
pleteness of Peano Arithmetic.
6
This fact is known as the Fundamential Theorem of Arithmetic.

76
numbers. Goedel’s idea was to basically treat w as a representation of the
prime factorization of some number.

C(a ) C(a ) C(a )

GoedelNum(a1 a2 a3 . . . an ) = p1 1 × p2 2 × p3 3 × . . . × pnC(an ) .

Example 45. Let’s use the string foobar as an example. Taking ASCII as
the coding system, we have: C(f) = 102, C(o) = 111, C(b) = 98, C(a) = 97,
and C(r) = 114. The Goedel number of foobar is calculated as follows:

GoedelNum(foobar) = 2102 × 3111 × 5111 × 798 × 1197 × 13114

= 1189679170019703840793287149012377883426615618297637319510394
7574598871098617271030519293961835685919348532420791233827072
8538012810574164152676463935146039040414969145473843310178957
3364837167118857151641747100976606825914043115103029097422785
5974811557927500973868941709639836429302128569769744552203051
4615853924974445020975274707809005698958254207677547656309707
0312500000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000

The importance of Goedel numbers is that, given a number that we

know to be the result of a call to GoedelNum, we can take it apart to get
the original sequence, or parts of the original sequence. Thus we can use
GoedelNum to encode the contents of a Turing machine tape as a single
(very large) number, which can be held in a register of the Register ma-
chine.
In fact, we won’t hold the entire tape in one register, but will keep the
current configuration hℓ, q, ri in registers R1 , R2 , R3 . In R0 we will put the
Goedel number of the transition function δ of the Turing machine, i.e., the
Goedel number of its tape format (see Section 3.1.3 for details of the for-
mat). The RM program simulating the execution of the TM will examine
R2 and R3 to find (p, a), the current state and tape cell contents. Then R0
can be manipulated to give a number representing (q, b, d), which are then
used to update R1 , R2 , R3 . That finishes the current step of the TM; the RM
program then checks to see if a final state of the TM has been entered. If so,
the RM stops executing; if not, the next TM step is simulated, continuing
until the simulated TM enters a final state.

77
3.4 Recognizabilty and Decidability
Now that we have become familiar with Turing machines and Register
machines, we should ask what they are good for. For example, Turing
machines definitely aren’t good for programming, so why do we study
and use them?

• People agree that they capture the notion of a sequential algorithm.

Although there are many other formalisms that also capture this no-
tion, Turing machines were historically the first convincing model
that researchers of the time agreed upon.

• Robustness. The class of functions that TMs capture doesn’t change

when the Turing machine model is tinkered with, e.g., by adding
multiple tapes, non-determinism, etc.
• Turing machines are easy to analyze with respect to their time and
space usage, unlike many other models of computation. This sup-
ports the field of computational complexity, in which classifications
according to the resource usage of computations is studied.

• They can be used to prove decidability results and, more interest-

ingly, undecidability results. In fact, Turing’s invention of Turing ma-
chines went hand-in-hand with his proof of the undecidability of the
halting problem.

We know that a Turing machine takes an input and either accepts it,
rejects it, or loops. In order to tell if a Turing machine works properly, one
needs to specify what answers it should compute. The usual way to do
this is to specify the set of strings the TM must accept, i.e., its language.
We now make a crucial, as it turns out, distinction between TMs that reject
strings outside their language and TMs that either reject or loop when
given a string outside their language.
Definition 7 (Turing-recognizable, Recursively enumerable). A language
L is Turing recognizable, or recursively enumerable, if there is some Turing
machine M (called a recognizer) that accepts each string in L. For each
string not in L, M may either reject the string or diverge.

Recognizable(L) = ∃M. ∀w. w ∈ L iff M halts and accepts w .

78
The following definition captures the subset of recognizers that never
loop.

Definition 8 (Decidable, Recursive). A language L is decidable, or recursive,

if there exists a Turing machine M, called a decider, that recognizes L and
always halts.

Thus a decider always says ‘yes’ given a string in the specified lan-
guage and always says ‘no’ given a string outside the specified language.
Obviously, it is better to have a decider than a recognizer, since a decider
always gives a verdict in a finite time. Moreover, a decider is automatically
a recognizer.
A restatement of these definitions is the following.
Definition 9 (Decision problem). The decision problem for a language L is
just the question: is there a decider for L? Equivalently, one asks if L is de-
cidable. Similarly, the recognition problem for a language L is the question:
is L recognizable (recursively enumerable).

Remark. There is, unfortunately, a lot of overlapping terminology. In par-

ticular, the notions of language and problem are essentially the same thing:
a set of strings over an alphabet. A problem often has the implication that
an encoding has been used to translate higher-level data structures into
strings.
In general, to show that a problem is decidable, it suffices to (a) exhibit
an algorithm for solving the problem and (b) to show that the algorithm
terminates on all inputs. To show that a language is recognizable only the
first requirement has to be met.
Some examples of decidable languages:

• binary strings

• natural numbers

• twos complement numbers

• sorted lists of numbers

• balanced binary trees

• Well-formed Turing machines

79
• Well-formed C programs
• {hi, j, ki | i + j = k}
• {hℓ, ℓ′i | ℓ′ is a sorted permutation of ℓ}
In a better world than ours, one would expect that more informative
properties and questions, such as the following, would also be decidable:

• Turing machines that halt no matter what input is given them.

• Is p a C program that never crashes?
• Given a program in a favourite programming language, and a pro-
gram location, is the location reachable?
• Given programs p1 and p2 does p1 behave just like p2 ? In other words
is it the case that p1 (x) = p2 (x), for any input x?
• Is a program the most efficient possible?

Unfortunately, such is not the case. We can prove that none of the
above question are decidable, as we will see. Notice that, in order to make
such claims, a clever proof is necessary since we are asserting that var-
ious problems are algorithmically unsolvable, i.e., that no program can be
constructed to solve the problem.
But first we are going to look at a set of decision problems about TMs
that can be solved. If Turing machines seem too unworldly for you, the
problems are easily restated to apply to your favorite programming lan-
guage or microprocessor.

3.4.1 Decidable problems about Turing machines

Here is a list of decidable problems about TMs. We are given TM M. The
number 3100 is not important in the following.
1. Does M have at least 3100 states?
This is obviously decidable.
2. Does M take more than 3100 steps on input x?
The decider will use U to run M on x for 3100 steps. If M has not
entered qA or qR by then, accept, else reject.

80
3. Does M take more than 3100 steps on some input?
The decider will simulate M on all strings of length ≤ 3100, for 3100
steps. If M has not entered qA or qR by then, for at least one string,
accept, else reject. We need only consider strings of length ≤ 3100:
longer strings will take more than 3100 steps to read.
4. Does M take more than 3100 steps on all inputs?
Similar to the previous, except that we require that M take ≥ 3100
steps for each string.
5. Does M ever move the tape head more than 3100 cells away from the
starting position?
M will either loop infinitely within the 3100 cells, stop within the
3100 cells, or break out of the 3100 cells. We can detect the infinite
loop by keeping track of the configurations that M can get into: for
a fixed tape size (3100 in this problem), this is a finite number of
configurations. In particular, if M has n states and k tape symbols,
then the number of configurations it can get into is 36 ∗ n ∗ k 3100. If M
hasn’t re-entered a previous state or halted in that number of moves,
the tape head must have moved more than 3100 cells away from its
initial position.

3.4.2 Recognizable problems about Turing Machines

Example 46. Suppose our task is to take an arbitrary Turing machine de-
scription M and an input w and figure out whether M accepts w. This
decision problem can be formally stated as
AP = {hM, wi | M accepts w}
This problem is recognizable, since we can use the universal TM U in
the following way: U is invoked on hM, wi and keeps track of the state of
M as the execution of M on w unfolds. If M halts, then we look to see what
final state it was in: if it was its accept state, then the result is to accept.
Otherwise the result is to reject the input. Now, the other possibility is that
M doesn’t halt on w. What then? Since U simply follows the execution of
M step by step, then if M loops on w, the simulation will never stop. The
problem of getting a decider would require being able to calculate from the
given description hM, wi whether the ensuing computation would stop or
not.

81
The following example illustrates an important technique for building
deciders and recognizers.
Example 47 (Dovetailing). Suppose our task is to take an arbitrary Turing
machine description M and tell whether there is any string that it accepts.
This decision problem can be formally stated as
ASome = {hMi | ∃w. M accepts w}
A naive stab at an answer would say that a recognizer for ASome is
readily implemented by generating strings in Σ∗ in increasing order, one
at a time, and running M on them. If a w is generated so that a simulated
execution of M accepts w, then our recognizer halts and accepts. However,
it may be the case that M accepts nothing, in which case this program will
loop forever. Thus, on the face of it, this is a recognizer but not a decider.
However, this is not even a recognizer! What if M loops on some w,
but would accept some (longer) string w ′ ? Blind simulation will loop on w
and M will never get invoked on w ′ .
This problem can be solved, in roughly the same way as the same prob-
lem is solved in time-shared operating systems running processes: some
form of fair scheduling. This can be implemented by interleaving the gen-
eration of strings with applying M to each of them for a limited number
of steps. The algorithm goes round-by-round. In the first round of the
algorithm, M is simulated for one step on ε. In the second round of the
algorithm, M is simulated for one more step on ε, and M is simulated for
one step on 0. In the third round of the algorithm, M is simulated for
one more step on ε and 0, and M is simulated for one step on 1. In the
fourth round, M is simulated for one more step on ε, 0, 1, and is simulated
for one step on 00. Computation proceeds, where in each round all exist-
ing sub-computations advance by one step, and a new sub-computation
on the next string in increasing order is started. This proceeds until in
some sub-computation M enters the accept state. If it enters a reject state,
that sub-computation is dropped. The process just outlined is often called
dovetailing, because of the fan shape that the computation takes.
Clearly, if some string w is accepted by M, it will start being processed
in some round, and eventually accepted, possibly much later. So the lan-
guage ASome is recognizable.
One way of showing a language L is decidable, is to show that there
are recognizers for L and L.

82
Theorem 1. If L is recognizable and L is recognizable then L is decidable.
Proof. Let M1 be a recognizer L and M2 a recognizer for L. For any x ∈ Σ∗ ,
x ∈ L or x ∈ L. A decider for L can thus be implemented that, on input
x, dovetails execution of M1 and M2 on x. Eventually, one of the machines
must halt, with an accept or reject verdict. If M1 halts first, the decider
returns the verdict of M1 ; if M2 halts first, the decider returns the opposite
of the verdict of M2 .

3.4.3 Closure Properties

A closure property has the format

If L1 and L2 have property P then the language resulting

from applying some operation to L1 and L2 also has property
P.

The decidable languages are closed under union, intersection, concate-

nation, complement, and Kleene star. The recognizable languages are
closed under all the above but complement and set difference.

Example 48. The decidable languages are closed under union. Formally,
this is expressed as

Decidable L1 ∧ Decidable L2 ⇒ Decidable (L1 ∪ L2 )

Proof. Suppose L1 and L2 are decidable. Then there exists a decider M1 for
L1 and a decider M2 for L2 . Now we claim there is a decider M for L1 ∪ L2 .
Let M be the following machine:

“On input w, invoke M1 on w. If it accepts, then accept.

Otherwise, M1 halts and rejects w (because it is a decider), so
we invoke M2 on w. If it accepts, M accepts, otherwise it re-
jects.”

That M is a decider is clear since both M1 and M2 halt on all inputs. That
M accepts the union of L1 and L2 is also clear, since M accepts x if and
only if one or both of M1 , M2 accept x.

83
When seeking to establish a closure property for recognizable languages,
we have to guard against the fact that the recognizers may not terminate
on objects outside of their language.

Example 49. The recognizable languages are closed under union. For-
mally, this is expressed as

Recognizable L1 ∧ Recognizable L2 ⇒ Recognizable (L1 ∪ L2 )

Suppose L1 and L2 are recognizable. Then there exist a recognizer M1

for L1 and a recognizer M2 for L2 . Now we claim there is a recognizer M
for L1 ∪ L2 .
Proof. The following machine seems natural, but doesn’t solve the prob-
lem.

“On input w, invoke M1 on w. If it accepts, then accept.

Otherwise, invoke M2 on w. If it accepts, M accepts, otherwise
it rejects.”

The problem with this purported solution is that M1 may loop on w

while M2 would accept it. In that case, M won’t accept w when it should,
because M2 never gets a chance to run on w. So we have to use dovetailing
again. A suitable solution works as follows:

“On input w, invoke M1 and M2 on w in the following fash-

ion: execute M1 for one step on w, then execute M2 for one step
on w. Repeat in this step-wise manner until either M1 or M2
halts. (It may be the case that neither halts, of course, since
both are recognizers.) If the halting machine is in the accept
state, then accept. Otherwise, run the other machine until it
halts. If it accepts, then accept. If it rejects, then reject. Other-
wise, the second machine is in a loop, so M loops.”

This machine accepts w whenever M1 or M2 will, and loops otherwise,

hence is a recognizer for L1 ∪ L2 .

84
3.5 Undecidability
Many problems are not decidable. This is easy to see by a cardinality ar-
gument: there are simply far more languages (uncountable) than there are
algorithms (countable). So some languages—almost all in fact—have no
decider, or recognizer. However, that is an abstract argument; what we
want to know is whether specific problems of interest to computer science
are decidable or recognizable.
In this section we will show the undecidability of the Halting Problem, a
result due to Alan Turing. The importance of this theorem is twofold: first,
it is the earliest and arguably most fundamental result about the limits of
computation (namely, some problems can not be algorithmically solved);
second, many other undecidability results stem from it. The proof uses a
cool technique, which we pause to introduce here.

3.5.1 Diagonalization
The diagonalization proof technique works as follows: assume that you
have a complete listing of objects; then construct a new object that should
be in the list but can’t be since it differs, in at least one place, with every
object in the list. A contradiction thereby ensues. This technique was in-
vented by Georg Cantor to show that R is not countable, i.e., that there
are far more real numbers than natural numbers. This shows that there is
more than one size of infinity.
The existence of a bijection between two sets is used to implement the
notion of the sets ‘being the same size’, or equinumerous. When the sets are
finite, we just count their elements and compare the resulting numbers.
However, equinumerosity is unusual when the sets are of infinite size. For
example, the set of even numbers is equinumerous with N even though it
is a proper subset of N!

Theorem 2. R can not be put into one-to-one correspondence with N. More

formally, there is no bijection between R and N.
Proof. We are in fact going to show that the set of real numbers between 0
and 1 is not countable, which implies that R is not countable. In actuality,
we prove that there is no surjection from N to {r ∈ R | 0 ≤ r < 1}, i.e., for
every mapping from N to R, some real numbers will be left out.

85
Towards a contradiction, suppose that there is such a surjection, i.e.,
the real numbers between 0 and 1 can be arranged in a complete listing in-
dexed by natural numbers. This gives us a table, infinite in both directions.
Each row of the table represents one real number, and all real numbers are
in the table.

0 . 5 3 1 1 7 8 2 ···
0 . 4 3 0 0 1 2 9 ···
0 . 7 7 6 5 1 0 2 ···
0 . 0 1 0 0 0 0 0 ···
0 . 9 0 3 2 6 8 4 ···
0 . 0 0 0 1 1 1 0 ···
.. .. .. .. .. .. .. .. . .
. . . . . . . . .
The arrangement of the numbers in the table doesn’t matter; what is
important is that the listing is complete and that each row is indexed by a
natural number. Now we build an infinite sequence of digits D by travers-
ing the diagonal in the listing and changing each digit of the diagonal. For
example, we could build

D = 0.647172 . . .

(There are of course many other possible choices of D, such as 425858 . . .:

what is important is that D differs from the diagonal at each digit.) Be-
cause D is an infinite string of digits, it must lie in the table. On the other
hand, we have intentionally constructed D so that it differs with the first
number in the table at the first place, with the second number at the sec-
ond place, with the third number at the third place, etc. Thus D cannot be
the first number, the second number, the third number, etc. So D is not in
the table. Contradiction. Conclusion: R is too big to be put in one-to-one
correspondence with N.

This result is surprising because the integers Z and the rationals Q

(fractions with numerator and denominator in N) do have a one-to-one
correspondence with N. It is certainly challenging to accept that between
any two elements of N there are an infinite number of elements of Q—and
yet the two sets are the same size! Having been bludgeoned by that math-
ematical fact, one might be willing to think that R is the same size as N.
But the proof above shows that this cannot be so.

86
Note. Pedants may enjoy pointing out that there is a difficulty with in-
finitely repeatedly digits since, e.g., 0.19999 . . . = 0.2000 . . .. If D ended
with an infinite repetition of 0s or 9s the argument wouldn’t work (be-
cause, e.g., 0.199999 . . . differs from 0.2000 . . . at each digit, but they are
equal numbers). We therefore exclude 0 and 9 from being used in the con-
struction of D.
Note. Diagonalization can also be used to show that there is no surjection
from a set to its power set; hence there is no bijection, hence the sets are not
equinumerous. The proof is almost identical to the one we’ve just seen.

3.5.2 Existence of Undecidable Problems

At least one problem is undecidable. As we will see later, this result can
be used as a fulcrum with which to show that many other problems are
undecidable.
Theorem 3 (Halting Problem is undecidable).
Let the alphabet be {0, 1}. The language
HP = {hM, wi | M is a Turing machine and M halts when run on w}
is not decidable.
Proof. First we consider a two-dimensional table, infinite in both direc-
tions. Down one side we list all possible machines, along the other axis
we list all inputs that a machine could take.

ε 0 1 00 01 10 11 000 · · ·
Mε
M0
M1
M00
M01
M10
M11
M000
..
.
There is a tight connection between binary string w and machine Mw . If
w is a valid encoding of a machine, we use Mw to stand for that machine.

87
If w is not a valid encoding of a machine, it will denote a dummy machine
that simply halts in the accept state as soon as it starts to run. Thus each
Mw is a valid Turing machine, and every TM may be found at least once
in the list. This bears repeating: this list of TMs is complete, every possible
TM is in it.
Therefore an entry of the table indexed by (i, j) represents machine Mi
with input being the binary string corresponding to j: the table represents
all possible TM computations.
Now let’s get started on the argument. Towards a contradiction, we
suppose that there is a decider H for the halting problem. Thus, let H be
a TM having the property that, when given input hw, ui, for any w and u,
it correctly determines if Mw halts when given input u. Being a decider, H
must itself always finish its computation in a finite time.
Therefore, we could use H to fill in any cell in the table with T (signi-
fying that the program halted on the input) or F (the program goes into
an infinite loop when given the particular input). Notice that H can not
simply blindly execute Mw on u and return T when Mw on u terminates:
what if Mw on u didn’t terminate?
Now consider the following TM N which calls H as a sub-routine.

N = “On input x, perform the following steps :

1. Write hx, xi to the input tape
2. Invoke H
3. If Hhx, xi = T then loop() else halt with accept.′′

So N goes into a infinite loop on input x just when Mx halts on x. Perverse!

Conceptually, what N does is go down the diagonal of the table, reversing
the verdict of H at each point on the diagonal. This finally leads us to our
contradiction:

• Given that we have assumed that H is a TM, N is a valid TM, there-

fore is somewhere in the (complete) listing of machines.

• N behaves differently from every machine in the list, for at least one
argument (it loops on x iff Mx halts on x). So N can’t possibly be on
the list.

The resolution of this contradiction is to conclude that the assumption

that H exists is false. There is no halting detector.

88
Alternate proof A slightly different proof—one which explicitly makes
use of self-reference—proceeds as follows: we construct N as before, but
then ask the question: how does N behave when applied to itself, i.e., what
is the value of N(N)? By instantiating the definition of N, we have

N(N) = if H(N, N) = T then loop() else accept

Now we do a case analysis on whether N halts on N: if it does, then N(N)

goes into an infinite loop; contrarily, if N loops on N, then N(N) halts.
Conclusion: N halts on N iff N does not halt on N. Contradiction.

This result lets us conclude that there can not be a procedure that will
tell if arbitrary programs halt on all possible inputs. In other words, the
halting problem is algorithmically unsolvable. (Note the use of the CT-Thesis
here: we have moved from a result about Turing machines to a claim about
all algorithms.)
To be clear: certain syntactically recognizable classes of programs, e.g.,
those in which the only looping construct is bounded for-loops, always
halt. Hence, although the general problem is undecidable, there are impor-
tant subsets that are decidable. However, it’s impossible to write a halting
detector that will work correctly for all programs.

3.5.3 Other undecidable problems

Many problems are undecidable. For example, the following languages
are all undecidable.
AP = {hM, wi | M accepts w}
HP 42 = {hMi | M halts when run on 42}
HP ∃ = {hMi | ∃x. M halts when run on x}
HP ∀ = {hMi | ∀x. M halts when run on x}
Const 42 = {hMi | every computation of M halts with 42 on the tape}
Equiv = {hM, Ni | L(M) = L(N)}
Finite = {hMi | L(M) is finite}

The language AP asks whether the given machine accepts the given
input. AP is closely related to HP , but is not the same. The language
HP 42 is a specialization of HP: it essentially asks the question ‘Once the

89
input is known, does the Halting problem become decidable?’. The language
HP ∃ asks whether a machine halts on at least one input, while HP ∀ asks if
a machine halts on all its inputs. The language Const 42 asks whether the
given machine returns 42 as an answer in all possible computations. The
language Equiv asks whether two machines compute the same answers in
all cases, i.e., whether they are equivalent. Finally, the language Finite asks
whether the given machine accepts only a finite number of inputs.
All of these undecidable problems can be shown undecidable by a tech-
nique that amounts to employing the contrapositive, which embodies a ba-
sic proof strategy.
Definition 10 (Contrapositive). The contrapositive is a way of reasoning
that says: in order to prove P ⇒ Q, we can instead prove ¬Q ⇒ ¬P . Put
formally:
(P ⇒ Q) iff (¬Q ⇒ ¬P )
We are going to use the following instance of the contrapositive for
undecidability arguments:

(Decidable(A) ⇒ Decidable(B)) ⇒ (¬Decidable(B) ⇒ ¬Decidable(A))

In particular, we will take B to be HP , and we know, by Theorem 3, that

¬Decidable(HP). So, to show that some new problem L, such as one of the
ones listed above, is undecidable, we need merely show

Decidable(L) ⇒ Decidable(HP )

i.e., that if L was decidable, then we could decide the halting problem. But
since HP is not decidable, then neither is L.
This approach to undecidability proofs is called reduction. In particular,
the above approach is said to be a reduction from HP to L.7 We can also
reduce from other undecidable problems, if that is convenient.
Remark. It may seem intuitive to reason as follows:
If HP is a subset of L, then L is undecidable. Why? Because if
all occurrences of HP are found in L, deciding L has to be at least as
hard as solving HP .
7
You will sometimes hear people saying things like “we prove L undecidable by re-
ducing L to HP ”. This is backwards, and you will just have to make the mental transla-
tion.

90
This view is, however, deeply and horrifically wrong. The problem
with this argument is that HP ⊆ Σ∗ but Σ∗ is decidable.
We now go through a few examples of undecidability proofs. They all
share a distinctive pattern of reasoning.

Example 50 (AP is undecidable).

Proof. Suppose AP is decidable. We therefore have a decider D that, given
input hM, wi, accepts if M accepts w and rejects if M does not accept w,
either by looping or rejecting. Now we construct a TM Q for deciding HP .
Q performs the following steps:

1. The input of Q is hM, wi.

2. The following machine N is constructed and written to tape. It is

important to recognize that this machine is not executed by Q, only
created and then analyzed.

(a) The input of N is hP, xi.

(b) Forget P , but remember x, say by writing it somewhere.
(c) Run M on x, via the universal TM U.
(d) Accept.

3. Run D on hN, wi and return its verdict.

Let’s reason about the behaviour of Q. D decides AP , so if it accepts, then

N accepts w. But that can only happen if M halts on w. On the other hand,
if D rejects, then N does not accept w, i.e., N either loops on or rejects w.
But N can never reject, so it must loop. But it can only do that if M loops
on w.
Upshot: if D accepts then M halts on w. If D rejects then M does not
halt on w. So Q decides HP . Contradiction. AP is undecidable.

The essential technique in this proof is to construct TM N, the be-

haviour of which depends on M halting on w. N is then analyzed by the
decider for the language under consideration; if N has been constructed
properly, the decider can be used to solve the halting problem.

Example 51 (HP 42 is undecidable).

91
Proof. Suppose HP 42 is decidable. We therefore have a decider D that,
given input hMi, accepts if M accepts 42 and rejects otherwise. Now we
construct a TM Q for deciding HP . Q performs the following steps:

1. The input of Q is hM, wi.

2. The following machine N is written to tape.

(a) The input of N is hxi. Delete x.

(b) Run M on w, via the universal TM U.
(c) Accept.

3. Run D on hNi and return its verdict.

If D accepts hNi, then then N accepts 42, which implies M halted on w. On

the other hand, if D rejects hNi, then N does not accept 42, i.e., N either
loops on, or rejects 42. But N can never reject, so it must loop, and it can
only do that if M loops on w. So Q decides HP . Contradiction. HP 42 is
undecidable.

Example 52 (HP ∃ is undecidable).

Proof. Suppose HP ∃ is decidable by D. We can then construct a TM Q for
deciding HP . Q performs the following steps:

1. The input of Q is hM, wi.

2. A TM N implementing the following specification is constructed and

written to tape.

N(x) = if x = w then run M on w else loop()

3. Run D on hNi and return its verdict.

If D accepts hNi, there is an x that N halts on, and that x has to be w,

implying that M halts on w. If D rejects hNi, there is no x that N halts on,
including w, implying that M loops on w. So Q decides HP . Contradiction.
HP ∃ is undecidable.

Reduction can be from any undecidable problem, as in the following ex-

ample, which takes HP 42 as the language to reduce from.

92
Example 53 (HP ∀ is undecidable).
Proof. Suppose D decides HP ∀ . We can then decide HP 42 with the follow-
ing TM Q:

1. The input of Q is hMi.

2. A TM N implementing the following specification is constructed and

written to tape.
N(x) = run M on 42

3. Run D on hNi and return its verdict.

If D accepts hNi, then M halts on 42. If D rejects hNi, then N loops on at

least one input. But N loops if and only if M loops on 42. So Q decides
HP 42 . Contradiction. HP ∀ is undecidable.

Example 54 (Equiv is undecidable). Intuitively, Equiv is a much harder

problem than a problem like HP : in order to tell if two programs agree on
all inputs, the two programs have to display the same halting behaviour
on each input. Therefore we might expect that a decider for HP ∀ should
be easy to obtain from a decider for Equiv . That may be possible, but it
seems simpler to instead reduce from HP ∃ .
Proof. Let D be a decider for Equiv . We can then decide HP ∃ with the
following TM Q:

1. The input of Q is hMi.

2. Let N be a TM that takes input x and runs M on x, then accepts. Let

Loop be a TM that diverges on all inputs.

3. Write hN, Loopi to the tape.

4. Run D on hN, Loopi and reverse its verdict, i.e., switch Accept with
Reject, and Reject with Accept.

If D accepts hN, Loopi, we know L(N) = L(Loop) = ∅, i.e., M does not

halt on any input. In this case Q rejects. On the other hand, if D rejects
hN, Loopi, we know that there exists an x that M halts on. In this case Q
accepts. Therefore Q decides HP ∃ . Contradiction. Equiv is undecidable.

93
We now prove a theorem that provides a sweeping generalization of
many undecidability proofs. But first we need the following concept:
Definition 11 (Index set). A set of TMs S is an index set if and only if it has
the following property:

∀M N. (L(M) = L(N)) ⇒ (M ∈ S iff N ∈ S) .

This formula expresses, in a subtle way, that an index set contains all (and
only) those TMs having the same language. Thus index sets let us focus on
properties of a TM that are about the language of the TM, or the function
computed by the TM, and not the actual components making up the TM,
or how the TM executes.
Example 55. The following are index sets: H42 , HSome , HEvery , Const 42 , and
Finite. HP and AP can’t be index sets, since they are sets of hM, wi strings
rather than the required hMi strings. Similarly, Equiv is not an index set
since it is a set of hM1 , M2 i strings. The following languages are also not
index sets:
• {hMi | M has exactly 4 control states}. This is not an index set since
it is easy to provide TMs M1 and M2 such that L(M1 ) = L(M2 ) = ∅,
but M1 has 4 control states, while M2 has 3.

• {hMi | ∀x. M halts in at most 2 × len(x) steps, when run on x}. This is
not an index set since there exist TMs M1 , M2 and input x such that
L(M1 ) = L(M2 ) but M1 halts in exactly 2 × len(x) steps, while M2
takes a few more steps to halt.

• {hMi | M accepts more inputs than it has states}. This is not an index
set since, for example, there exists a TM M1 with 6 states that accepts
all binary strings of length 3 (so the number of strings in the language
of M1 is 8) and a TM M2 with 8 states that accepts the same language.
Theorem 4 (Rice). Let S be an index set such that there is at least one TM in S,
and not all TMs are in S. S is undecidable.
Proof. Towards a contradiction, suppose index set S is decided by TM D.
Consider Loops, a TM that loops on every input: L(Loops) = ∅. Either
Loops ∈ S or Loops ∈
/ S. Let’s do the case where Loops ∈ / S. Since S 6= ∅,
there is some TM K ∈ S. We can decide HP by the following machine Q:

94
1. The input to Q is hM, wi.

2. Store M and w away.

3. Build the machine description for the following machine N and write
it to tape:

(a) Input is hxi.

(b) Copy x somewhere.
(c) Retrieve M and w, write hM, wi to the tape, and simulate the
execution of M on w by using U.
(d) Retrieve x and run K on it, again by using U.

4. Run D on the description of N.

Now we reason about the behaviour of M on w. If M halts on w, then

L(N) = L(K). Since S is an index set, N ∈ S iff K ∈ S. Since K ∈ S we
have N ∈ S, so D hNi accepts.
Contrarily, if M does not halt on w, then L(N) = ∅ = L(Loops). Since S
is an index set and Loops ∈/ S we have L(N) ∈ / S, so D hNi rejects.
Thus Q is a decider for HP , in the case that Loops ∈
/ S. In the case that
Loops ∈ S, the proof is similar. In both cases we obtain a decider for HP
and hence a contradiction. Thus S is not decidable.

Example 56. Since Finite is an index set, and (1) the language of at least
one TM is finite, and (2) the language of at least one TM is not finite, Rice’s
theorem allows us to immediately conclude that Finite is undecidable.

An application of Rice’s theorem is essentially a short-cut of a reduc-

tion proof. In other words, it is never necessary to invoke Rice’s theorem,
just convenient. Moreover, there are undecidability results, e.g., the unde-
cidability of Equiv , that cannot be applications of Rice’s theorem. When
confronted by such cases, try a reduction proof.

3.5.4 Unrecognizable languages

The following example shows that some languages are not even recogniz-
able. It is essentially a recapitulation of Theorem 1.

95
Example 57. The recognizable languages are not closed under comple-
ment.
Consider the complement of the halting problem (HP ). This is the set
of all hM, wi pairs where M does not halt on w. If this problem was rec-
ognizable, then we could get a decision procedure for the halting prob-
lem. How? Since HP is recognizable and (by assumption) HP is recog-
nizable, a decision procedure can be built that works as follows: on in-
put hM, wi, incrementally execute both recognizers for the two languages.
Since HP ∪ HP = Σ∗ , one of the recognizers will eventually accept hM, wi.
So in finite time, the halting (or not) of M on w will be detected, for any
M and w. But this can’t be because we have already shown that HP is un-
decidable. Thus HP can’t be recognizable and so must be a member of a
class of languages properly outside the recognizable languages.

Thus the set of all halting programs is recognizable, but the set of all
programs that do not halt can’t be recognizable for otherwise the set of
halting programs would be decidable, and we already know that such is
not the case.
There are many more non-recognizable languages. To prove that such
languages are indeed non-recognizable, one can use Theorem 1 or the no-
tion of reducibility. The latter is again embodied in an application of the
contrapositive:

(Recognizable(A) ⇒ Recognizable(B)) ⇒
(¬Recognizable(B) ⇒ ¬Recognizable(A))

Since we have shown that ¬Recognizable(HP), we can prove A is not rec-

ognizable by showing

Recognizable(A) ⇒ Recognizable(HP ) .

Example 58.

96
Chapter 4

Context-Free Grammars

Context Free Grammars (CFGs) first arose in the late 1950s as part of Noam
Chomsky’s work on the formal analysis of natural language. CFGs can
capture some of the syntax of natural languages, such as English, and also
of computer programming languages. Thus CFGs are of major importance
in Artifical Intelligence and the study of compilers.
Compilers use both automata and CFGs. Usually the lexical structure
of a programming language is given by a collection of regular expressions
which define the identifiers, keywords, literals, and comments of the lan-
guage. These regular expressions can be translated into an automaton,
usually called the lexer, which recognizes the basic lexical elements (lex-
emes) of programs. A parser for the programming language will take a
stream of lexemes coming from a lexer and build a parse tree (also known
as an abstract syntax tree or AST) by using a CFG. Thus parsing takes the
linear string of symbols given by a program text and produces a tree struc-
ture which is more suitable for later phases of compilation such as seman-
tic analysis, optimization, and code generation. This is illustrated in Fig-
ure 4.1 This is a naive picture; many compilers use more than one kind of
abstract syntax tree in their work. The main point is that tree structures
are far easier to work with than linear strings.

Example 59. A fragment of English can be captured with the following

grammar which is presented in a style very much like Backus-Naur Form
(BNF) in which grammar variables are in upper case and enclosed by h−i,

97
while terminals, or literals, are in lower case.

hSENTENCEi −→ hNPi hVPi

A sentence that can be generated from the grammar is

the girl with the boy touches a flower

| {z }| {z }
NP VP

This can be pictured with a so-called parse tree, which summarizes the
ways in which the sentence may be produced.

lexing parsing semantic analysis

program text −→ lexeme stream −→ AST −→ AST
optimization
−→ AST
code generation
−→ executable

Figure 4.1: Stages in compilation

98
SENTENCE
Q
Q
QQ
NP VP
HH
HH
H
CNOUN PP CVERB
@ @ @
@ @ @
ARTICLE NOUN PREP CNOUN VERB NP
B C
B C
B
the girl with B touches CNOUN
B @
ARTICLE NOUN @
ARTICLE NOUN

the boy
a flower

Reading the leaves of the parse tree from left to right yields the original
string. The parse tree represents the possible derivations of the sentence.

Definition 12 (Context-free grammar). A context-free grammar is a 4-tuple

(V, Σ, R, S), where

• V is a finite set of variables

• Σ is a finite set of terminals

• R is a finite set of rules, each of which has the form

A −→ w

where A ∈ V and w ∈ (V ∪ Σ)∗ .

• S is the start variable.

99
Note. V ∩ Σ = ∅. This helps us keep our sanity, because variables and ter-
minals can’t be confused. In general, our convention will be that variables
are upper-case while terminals are in lower case.
A CFG is a device for generating strings. The way a string is gener-
ated is by starting with the start variable S and performing replacements for
variables, according to the rules.
A sentential form is a string in (V ∪ Σ)∗ . A sentence is a string in Σ∗ . Thus
every sentence is a sentential form, but in general a sentential form might
not be a sentence, in particular when it has variables occurring in it.
Example 60. If Σ = {0, 1} and V = {U, W }, then 00101 is a sentence and
therefore a sentential form. On the other hand, W W and W 01U are sen-
tential forms that are not sentences.
To rephrase our earlier point: a CFG is a device for generating, ulti-
mately, sentences. However, at intermediate points, the generation pro-
cess will produce sentential forms.
Definition 13 (One step replacement). Let u, v, w ∈ (V ∪ Σ)∗ . Let A ∈ V .
We write
G
uAv ⇒ uwv
to stand for the replacement of variable A by w at the underlined location.
This replacement is only allowed if there is a rule A −→ w in R. When it is
G
clear which grammar is being referred to, the G in ⇒ will be omitted.
Thus we can replace any variable A in a sentential form by its ‘right
hand side’ w. Note that there may be more than one occurrence of A in
the sentential form; in that case, only one occurrence may be replaced in a
step. Also, there may be more than one variable possible to replace in the
sentential form. In that case, it is arbitrary which variable gets replaced.
Example 61. Suppose that we have the grammar (V, Σ, R, S) where V =
{S, U} and Σ = {a, b} and R is given by
S −→ UaUbS
U −→ a
U −→ b
Then we can write S ⇒ UaUbS. Now consider UaUbS. There are 3 loca-
tions of variables that could be replaced (two Us and one S). In one step
we can get to the following sentential forms:

100
• UaUbS ⇒ UaUbUaUbS (Replacing S)

• UaUbS ⇒ aaUbS (Applying U −→ a at the first location of U)

• UaUbS ⇒ baUbS (Applying U −→ b at the first location of U)

• UaU bS ⇒ UaabS (Applying U −→ a at the second location of U)

• UaU bS ⇒ UabbS (Applying U −→ b at the second location of U)

Definition 14 (Multiple steps of replacement). Let u, w ∈ (V ∪ Σ)∗ . The

notation
∗
u ⇒G w
asserts that there exists a finite sequence
G G G
u ⇒ u1 . . . ⇒ un ⇒ w

of one-step replacements, using the rules in G, leading from u to w.

This definition is a stepping stone to a more important one:

Definition 15 (Derivation). Suppose we are given a grammar G with start

∗
variable S. If S ⇒G w and w ∈ Σ∗ , then we say G generates w. Similarly,
G G G
S ⇒ u1 . . . ⇒ un ⇒ w

is said to be a derivation of w.

Now we can define the set of strings derivable from a grammar, i.e.,
the language of the grammar: it is the set of sentences, i.e., strings lacking
variables, generated by G.

Definition 16 (Language of a grammar). The language L(G) of a grammar

G is defined by
∗
L(G) = {x ∈ Σ∗ | S ⇒G x}

Definition 17 (Context-free language). L is a context-free language if there

is a CFG G such that L(G) = L.

101
One question that is often asked is Why Context-Free?; in other words,
what aspect of CFGs is ‘free of context’ (whatever that means)? The an-
swer comes from examining the allowed structure of a rule. A rule in a
context-free grammar may only have the form V −→ w. When making a
replacement for V in a derivation, the symbols surrounding V in the sen-
tential form do not affect whether the replacement can take place or not.
Hence context-free. In contrast, there is a class of grammars called context-
sensitive grammars, in which the left hand side of a rule can be an arbitrary
sentential form; such a rule could look like abV c −→ abwc, and a replace-
ment for V would only be allowed in a sentential form where V occurred
in the ‘context’ abV c. Context-sensitive grammars, and phrase-structure
grammars are more powerful formalisms than CFGs, and we won’t be
discussing them in the course.

Example 62. Let G be given by the following grammar:

( {S} , {0, 1}, {S −→ ε, S −→ 0S1}, S

|{z} )
|{z} | {z } | {z }
Variables Σ Rules Start variable

The following are some derivations using G:

• S⇒ε

• S ⇒ 0S1 ⇒ 0ε1 ⇒ 01

• S ⇒ 0S1 ⇒ 00S11 ⇒ 00ε11 ⇒ 0011

We “see” that L(G) = {0n 1n | n ≥ 0}. A rigorous proof of this would re-
quire proving the statement

∀w ∈ Σ∗ . w ∈ L(G) iff ∃n. w = 0n 1n

and the proof would proceed by induction on the length of the derivation.

Example 63. Give a CFG for the language L = {0n 12n | n ≥ 0}.
The answer to this can be obtained by a simple adaptation of the gram-
mar in the previous example:

S −→ ǫ
S −→ 0S11

102
Convention. We will usually be satisfied to give a CFG by giving its rules.
Usually, the start state will be named S, and the variables will be written
in upper case, while members of Σ will be written in lower case. Fur-
thermore, multiple rules with the same left-hand side will be collapsed
into a single rule, where the right-hand sides are separated by a |. Thus,
the previous grammar could be completely and unambiguously given as
S −→ ε | 0S11.
Example 64 (Palindromes). Give a grammar for generating palindromes
over {0, 1}∗ . Recall that the palindromes over alphabet Σ can be defined
as PAL = {w ∈ Σ∗ | w = w R }. Some examples are 101 and 0110 for binary
strings. For ASCII, there are some famous palindromes:1
• madamImadam, the first thing said in the Garden of Eden.

• ablewasIereIsawelba, attributed to Napoleon.

• amanaplanacanalpanama

• Wonder if Sununu’s fired now

Now, considerable ingenuity is sometimes needed when constructing
a grammar for a language. One way to start is to enumerate the first few
strings in the language and see if any regularities are apparent. For PAL,
we know that ε ∈ PAL, 0 ∈ PAL, and 1 ∈ PAL. Then we might ask the
question Suppose w is in PAL. How can I then make other strings in PAL? In
our example, there are two ways:
• 0w0 ∈ PAL

• 1w1 ∈ PAL
A little thought doesn’t reveal any other ways of building elements of PAL,
so our final grammar is

S −→ ε | 0 | 1 | 0S0 | 1S1.

Example 65. Give a grammar for generating the language of balanced

parentheses. The following strings are in this language: ε, (), (()), ()(),
(()(())), etc. Well, clearly, we will need to generate ε, so we will have a rule
1
These and many more can be found at http://www.palindromes.org.

103
S −→ ε. Now we assume that we have a string w with balanced parenthe-
ses, and want to generate a new string in the language from w. There are
two ways of doing this:
• (w)
• ww
So the grammar can be given by

S −→ ε | (S) | SS.
Example 66. Give a grammar that generates

L = {x ∈ {0, 1}∗ | count(x, 0) = count(x, 1)}

the set of binary strings with an equal numbers of 0s and 1s.

Now, clearly ε ∈ L. Were we to proceed as in the previous example,
we’d suppose that w ∈ L and try to see what strings in L we could build
from w. You might think that 0w1 and 1w0 would do it, but not so! Why?
Because we need to be able to generate strings like 0110 where the end-
points are the same. So the attempt

S −→ ε | 0S1 | 1S0

doesn’t work. Also, we couldn’t just add 0w0 and 1w1 in an effort to repair
this shortcoming, because then we could generate strings not in L, such as
00.
We want to think of S as denoting all strings with an equal number
of 0s and 1s. The previous attempts have the right idea—take a balanced
string w and make another balanced string from it—but only add the 0s
and 1s at the outer edges of the string. Instead, we want to add them at
internal locations as well. The following grammar supports this:

S −→ ε | S0S1S | S1S0S (4.1)

Another grammar that works:

S −→ ε | S0S1 | S1S0

And another (this is perhaps the most elegant):

S −→ ε | 0S1 | 1S0 | SS

104
Here’s a derivation of the string 03 16 03 using grammar (4.1).

S ⇒ S1S0S
⇒ S1S1S0S0S
⇒ S1S1S1S0S0S0S
∗
⇒ S1ε1ε1ε0ε0ε0ε
⇒ S0S1S1ε1ε1ε0ε0ε0ε
⇒ S0S0S1S1S1ε1ε1ε0ε0ε0ε
⇒ S0S0S0S1S1S1S1ε1ε1ε0ε0ε0ε
∗
⇒ ε0ε0ε0ε1ε1ε1ε1ε1ε1ε0ε0ε0ε
= 000111111000
= 03 1 6 0 3 .
∗
Note that we used ⇒ to abbreviate multiple steps.

4.1 Aspects of grammar design

There are several strategies commonly used to build grammars. The main
one we want to focus on now is the use of closure properties.
The context-free languages enjoy several important closure properties.
Recall that a closure property asserts something of the form

If L has property P then f (L) also has property P .

The proofs of the closure properties involve constructions. The con-

structions for decidable and recognizable languages were phrased in terms
of machines; the constructions for CFLs are on grammars, although they
may be done on machines as well.

Theorem 5 (Closure properties of CFLs). If L, L1 , L2 are context-free lan-

guages, then so are L1 ∪ L2 , L1 · L2 , and L∗ .
Proof. Let L, L1 and L2 be context-free languages. Then there are gram-
mars G, G1 , G2 such that L(G) = L, L(G1 ) = L1 , and L(G2 ) = L2 . Let

G = (V, Σ, R, S)
G1 = (V1 , Σ1 , R1 , S1 )
G2 = (V2 , Σ2 , R2 , S2 )

105
Assume V1 ∩ V2 = ∅. Let S0 , S3 and S4 be variables not occurring in
V ∪ V1 ∪ V2 . These assumptions are intended to avoid confusion when
making the constructions.

• L1 ∪ L2 is generated by the grammar (V1 ∪ V2 ∪ {S3 }, Σ1 ∪ Σ2 , R3 , S3 )

where
R3 = R1 ∪ R2 ∪ {S3 −→ S1 | S2 } .
In other words, to get a grammar that recognizes the union of L1 and
L2 , we build a combined grammar and add a new rule saying that a
string is in L1 ∪ L2 if it can be derived by either G1 or by G2 .

• L1 · L2 is generated by the grammar (V1 ∪ V2 ∪ {S4 }, Σ1 ∪ Σ2 , R4 , S4 )

where
R4 = R1 ∪ R2 ∪ {S4 −→ S1 S2 } .
In other words, to get a grammar that recognizes the concatenation
of L1 and L2 , we build a combined grammar and add a new rule
saying that a string w is in L1 · L2 if there is a first part x and a second
part y such that w = xy and G1 derives x and G2 derives y.

• L∗ is generated by the grammar (V ∪ {S0 }, Σ, R5 , S0 ) where

R5 = R ∪ {S0 −→ S0 S | ε} .

In other words, to get a grammar that recognizes the Kleene star of L,

we add a new rule saying that a string is in L∗ if it can can be derived
by generating some number of strings by G and then concatentating
them together. The empty string is explicitly tossed in via the rule
S −→ ε.

Remark. One might ask: what about closure under intersection and comple-
ment? It happens that the CFLs are not closed under intersection and we
can see this by the following counterexample.
Example 67. Let grammar G1 be given by the following rules:

A −→ P Q
P −→ aP b | ε
Q −→ cQ | ε

106
Then L(G1 ) = {ai bi cj | i, j ≥ 0}. Let grammar G2 be given by

B −→ RT
R −→ aR | ε
T −→ bT c | ε

Then L(G2 ) = {ai bj cj | i, j ≥ 0}. Thus L(G1 ) ∩ L(G2 ) = {ai bi ci | i, j > 0}.
But this is not a context-free language, as we shall see after discussing the
pumping lemma for CFLs.
Example 68. Construct a CFG for

L = {0m 1n | m 6= n} .

A solution to this problem comes from realizing that L = L1 ∪ L2 , where

L1 = {0m 1n | m < n}
L2 = {0m 1n | m > n}

L1 can be generated with the CFG given by the following rules R1 :

S1 −→ 1 | S1 1 | 0S1 1

and L2 can be generated by the following rules R2

S2 −→ 0 | 0S2 | 0S2 1 .

We obtain L by adding the rule S −→ S1 | S2 to R1 ∪ R2 .

Example 69. Give a CFG for L = {x#y | xR is a substring of y}. Note that
x and y are elements of {0, 1}∗ and that # is another symbol, not equal to
0 or 1.
The key point is that the notion is a substring of can be expressed by con-
catentation.

L = {x#y | xR is a substring of y}
= {x#uxR v | x, u, v ∈ {0, 1}∗ }
= {x#uxR | x, u ∈ {0, 1}∗ } · {v | v ∈ {0, 1}∗ }
| {z } | {z }
L1 L2

A grammar for L2 is then

S2 −→ ε | 0S2 | 1S2

107
A grammar for L1 :
S1 −→ 0S1 0 | 1S1 1 | #S2
Thus, the final grammar is

S −→ S1 S2
S1 −→ 0S1 0 | 1S1 1 | #S2
S2 −→ ε | 0S2 | 1S2

Example 70. Give a CFG for L = {0i1j | i < 2j}.

This problem is difficult to solve directly, but one can split it into the
union of two easier languages:

L = {0i 1j | i < j} ∪ {0i 1j | j ≤ i < 2j}

The grammar for the first language is

S1 −→ AB
A −→ 0A1 | ε
B −→ 1B | 1

The second language can be rephrased as {0j+k 1j | k < j} and that can be
rephrased in terms of k (letting j = k + ℓ + 1, for some ℓ):

{0k+ℓ+1+k 1k+ℓ+1 | k, ℓ ≥ 0} = {002k+ℓ 1k+ℓ 1 | k, ℓ ≥ 0}

and from this we have the grammar for the second language

S2 −→ 0X1
X −→ 00X1 | Y
Y −→ 0Y 1 | ε
Putting it all together gives

S −→ S1 | S2

S1 −→ AB
A −→ 0A1 | ε
B −→ 1B | 1

S2 −→ 0X1
X −→ 00X1 | A

108
Example 71. Give a CFG for L = {ai bj ck | i = j + k}.
If we note that
L = {aj ak bj ck | j, k ≥ 0}
= {ak aj bj ck | j, k ≥ 0}
we quickly get the grammar
S −→ aSc | A
A −→ aAb | ε
Example 72. Give a CFG for L = {ai bj ck | i 6= j + k}.
The solution begins by splitting the language into two pieces:
L = {ai bj ck | i 6= j + k}
= {ai bj ck | i < j + k} ∪ {ai bj ck | j + k < i}
| {z } | {z }
L1 L2

In L1 , there are more bs and cs, in total, than as. We again start by attempt-
ing to scrub off equal numbers of as and cs. At the end of that phase, there
may be more as left, in which case the cs are gone, or, there may be more
cs left, in which case the as are gone.
S1 −→ aS1 c | A | B
A −→ aAb | C
B −→ bD | Dc
The rule for A scrubs off any remaining as, leaving a non-empty string of
bs. The rule for B deals with a (non-empty) string bi cj . Thus we add the
rules
C −→ b | bC
D −→ EF
E −→ ε | bE
F −→ ε | cF
To obtain a grammar for L2 is easier:
S2 −→ aS2 c | B2
B2 −→ aB2 b | C2
C2 −→ aC2 | a
And finally we complete the grammar with

S −→ S1 | S2

109
Example 73. Give a CFG for

L = {am bn cp dq | m + n = p + q} .

This example takes some thought. At its core, the problem is (essen-
tially) a perverse elaboration of the language {0n 1n | n ≥ 0} (which is gen-
erated by the rules S −→ ε | 0S1). Now, strings in L have the form

am bn cp d q

where m + n = p + q and the double line marks the midpoint in the string.
We will build the grammar in stages. We first construct a rule that will
‘cancel off’ a and d symbols from the outside-in :

S −→ aSd

In fact, min(m, q) symbols get cancelled. After this step, there are two cases
to consider:

1. m ≤ q, i.e., all the leading a symbols have been removed, leaving the
remaining string bn cp di , where i = q − m.

2. q ≤ m, i.e., all the trailing d symbols have been removed, leaving the
remaining string aj bn cp , where j = m − q.

In the first case, the situation looks like

bn cp d i

We now cancel off b and d symbols from the outside-in (if possible—it
could be that i = 0) using the following rule:

A −→ bAd

After this rule finishes, all trailing d symbols have been trimmed and the
situation looks like
bn−i cp
Now we can use the rule
C −→ bCc | ε

110
to trim all the matching b and c symbols that remain (there must be an
equal number of them). Thus, for this case, we have constructed the gram-
mar
S −→ aSd | A
A −→ bAd | C
C −→ bCc | ε
The second case, q ≤ m, is completely similar: the situation looks like
aj bn cp
We now cancel off a and c symbols from the outside-in (if possible—it
could be that i = 0) using the following rule:
B −→ aBc
After this rule finishes, all trailing c symbols have been trimmed and the
situation looks like
bn cp−j
Now we can re-use the rule
C −→ bCc | ε
to trim all the matching b and c symbols that remain. Thus, to handle the
case q ≤ m we have to add the rule B −→ aBc to the grammar, resulting
in the final grammar
S −→ aSd | A | B
A −→ bAd | C
B −→ aBc | C
C −→ bCc | ε
Now we examine a few problems about the language generated by a gram-
mar.
Example 74. What is the language generated by the grammar given by the
following rules?
S −→ ABA
A −→ a | bb
B −→ bB | ε
The answer is easy: (a + bb)b∗ (a + bb). The reason why it is easy is that
an A leads in one step to terminals (either a or bb); also, B expands to an
arbitrary number of bs.

111
Now for a similar grammar which is harder to understand:
Example 75. What is the language generated by the grammar given by the
following rules?
S −→ ABA
A −→ a | bb
B −→ bS | ε
We see that the grammar is nearly identical to the previous, except for
recursion on the start variable: a B can expand to bS, which means that
another trip through the grammar will be required. Let’s generate some
sentential forms to get a feel for the language (it will be useful to refrain
from substituting for A):

S ⇒ ABA ⇒ AbSA ⇒ AbABAA ⇒ AbAbSAA ⇒ AbAbABAAA ⇒ . . .

What’s the pattern here? It helps to focus on B alone. Let’s expand B
out and try to not include S in the sentential forms:
B ⇒ε
B ⇒ bABA ⇒ bAεA = bAA
B ⇒ bABA ⇒ bAbABAA ⇒ (bA)2 εA2 = (bA)2 A2
B ⇒ bABA ⇒ bAbABAA ⇒ (bA)2 bABAA2 ⇒ (bA)3 εA3 = (bA)3 A3
..
.
∗
By scrutiny, we can see that B ⇒ w implies w = (bA)n An , for all n ≥ 0.
Since S ⇒ ABA, we have
L(G) = (a + bb) (b(a + bb))n (a + bb)n (a + bb)
| {z } | {z } | {z } | {z }
A (bA)n An A

Simplified a bit, we obtain:

L(G) = {(a + bb)(ba + b3 )n (a + bb)n+1 | n ≥ 0}

4.1.1 Proving properties of grammars

In order to prove properties of grammars, we typically use induction.
Sometimes, this is induction on the length of derivations, and sometimes
on the length of strings. It is a matter of experience which kind of induc-
tion to do!

112
Example 76. Prove that every string produced by

G = S −→ 0 | S0 | 0S | 1SS | SS1 | S1S

has more 0’s than 1’s.

Proof. Let P (x) mean count(1, x) < count(0, x). We wish to show
∗
S ⇒ w ⇒ P (w).

Consider a derivation of an arbitrary string w. Since we don’t know any-

thing about w, we don’t know anything about the length of its derivation.
Let’s say that the derivation takes n steps. We are going to proceed by in-
duction on n. We can assume that 0 < n, because derivations need to have
at least one step. Now let’s take for our inductive hypothesis the following
statement: P (x) holds for any string x derived in fewer than n steps. The first
step in the derivation must be an application of one of the six rules in the
grammar:
S −→ 0. Then the length of the derivation is 1, so w = 0, so P (w) holds.
n−1
S −→ S0. In this case, there must be a derivation S ⇒ u such that w = u0.
By the IH, P (u) holds, so P (w) holds, since we’ve added another 0.
S −→ 0S. Similar to previous.
k ℓ
S −→ 1SS. In this case, there must be derivations S ⇒ u and S ⇒ v such
that w = 1uv. Now k < n and ℓ < n so, by the IH, P (u) and P (v)
both hold, so P (w) holds, since

count(1, 1uv) = 1 + count(1, u) + count(1, v)

< count(0, u) + count(0, v).

S −→ SS1. Similar to previous.

S −→ S1S. Similar to previous.

Note. We are using a version of induction called strong induction; when

trying to prove P holds for derivations of length n, strong induction allows
us to assume P holds for all derivations of length m, provided m < n.

113
4.2 Ambiguity
It is well known that natural languages such as English allow ambiguous
sentences: ones that can be understood in more than one way. At times
ambiguity arises from differences in the semantics of words, e.g., a word
may have more than one meaning. One favourite example is the word
livid, which can mean ‘ashen’ or ‘pallid’ but could also mean ‘black-and-
blue’. So when one is livid with rage, is their face white or purple?
Ambiguity of a different sort is found in the following sentences: com-
pare Fruit flies like a banana with Time flies like an arrow. The structure of
the parse trees for the two sentences are completely different. In natural
languages, ambiguity is a good thing, allowing much richness of expres-
sion, including puns. On the other hand, ambiguity is a terrible thing for
computer languages. If a grammar for a programming language allowed
some inputs to be parsed in two different ways, then different compilers
could compile a source file differently, which leads to much unhappiness.
In order to deal with ambiguity formally, we have to make a few def-
initions. To assert that a grammar is ambiguous, we really mean to say
that some string has more than one parse tree. But we want to avoid for-
malizing what parse trees are. Instead, we’d like to formalize the notion
in terms of derivations. However, we can’t simply say that a grammar is
ambiguous if there is some string having more than one derivation. That
doesn’t work, since there can be many ‘essentially similar’ derivations of
a string. (In fact, this is exactly what a parse tree captures.) The following
notion forces some amount of determinism on all derivations of a string.

Definition 18 (Leftmost derivation). A leftmost derivation is one in which,

at each step, the leftmost variable in the sentential form is replaced.

But that can’t take care of it all. The choice of variable to be replaced
in a leftmost derivation might be fixed, but there could be multiple right
hand sides for that variable. This is what leads to different parse trees.

Definition 19 (Ambiguity). A grammar G is ambiguous if there is a string

w ∈ L(G) that has more than one leftmost derivation.

Now let’s look at an ambiguous grammar for arithmetical expressions.

114
Example 77. Let G be
E −→ E+E
| E−E
| E∗E
| E/E
| −E
| C
| V
| (E)
C −→ 0|1
V −→ x|y|z
That G is ambiguous is easy to see: consider the expression x + y ∗ z. By
expanding the ‘+’ rule first, we have a derivation that starts E ⇒ E + E ⇒
· · · and the expression would be parsed as x + (y ∗ z). By expanding the
‘∗’ rule first, we have a derivation that starts E ⇒ E ∗ E ⇒ · · · and the
expression would be parsed as (x + y) ∗ z.

Now some hard facts about ambiguity:

• Some CFLs can only be generated by ambiguous grammars. These

are called inherently ambiguous languages.

Example 78. The language {ai bj ck | (i = j) ∨ (j = k)} is inherently

ambiguous.

• The decision problem of checking whether a grammar is ambiguous

or not is not solvable.

However, let’s not be depressed by this. There’s a common technique

that often allows us to change an ambiguous grammar into an equiva-
lent unambiguous grammar, based on precedence. In the standard way of
reading arithmetic expressions, ∗ binds more tightly than +, so we would
tend to read—courtesy of the indoctrination we received in grade school—
x+ y ∗ z as standing for x+ (y ∗ z). Happily, we can transform our grammar
to reflect this, and get rid of ambiguity. In setting up precedences, we will
use a notion of level. All operators at the same level have the same binding
strength relative to operators at other levels, but will have ‘internal’ prece-
dences among themselves as well. Thus, both + and − bind less tightly

115
than ∗, but also − binds tighter than +. We can summarize this for arith-
metic operations as follows:

{−, +} bind less tightly than

{/, ∗} bind less tightly than
{−(unary negation)} bind less tightly than
{V, C, (−)}
From this classification we can write the grammar directly out. We
have to split E into new variables which reflect the precedence levels.

E −→ E +T |E −T |T
T −→ T ∗ U | T /U | U
U −→ −U | F
F −→ C | V | (E)
C −→ 0|1
V −→ x|y|z

Now if we want to generate a leftmost derivation, there are no choices.

Let’s try it on the input x − y ∗ z + x:

E ⇒ E + T ⇒ (T − T ) + T ⇒ (U − T ) + T
⇒ (F − T ) + T ⇒ (V − T ) + T ⇒ (x − T ) + T
⇒ (x − (T ∗ U)) + T ⇒ (x − (U ∗ U)) + T ⇒ (x − (F ∗ U)) + T
∗
⇒ (x − (y ∗ U)) + T ⇒ (x − (y ∗ z)) + T
∗
⇒ (x − (y ∗ z)) + x

Note. We are forced to expand the rule E −→ E + T first; otherwise, had

we started with E −→ E − T then we would have to generate y ∗ z + x
from T . This can only be accomplished by expanding T to T ∗ U, but then
there’s no way to derive z + x from U.
Note. This becomes more complex when right and left associativity are to
be supported.
Example 79 (Dangling else). Suppose we wish to support not only

if test then action else action

statements in a programming language, but also one-armed if statements

of the form if test then action. This leads to a form of ambiguity known

116
as the dangling else. A skeletal grammar including both forms is

S −→ if B then A | A
A −→ if B then A else S | C
B −→ b1 | b2 | b3
C −→ a1 | a2 | a3
Then the sentence

if b1 then if b2 then a1 else if b3 then a2 else a3

can be parsed as

if b1 then (if b2 then a1 ) else if b3 then a2 else a3

or as
if b1 then (if b2 then a1 else if b3 then a2 ) else a3
How can this be repaired?

4.3 Algorithms on CFGs

We now consider a few algorithms that can be applied to context free
grammars. In some cases, these are intended to compute various inter-
esting properties of the grammars, or of variables in the grammars, and
in some cases, they can be used to simplify a grammar into a form more
suitable for machine processing.
Definition 20 (Live variables). A variable A in a context-free grammar G =
∗
(V, Σ, R, S) is said to be live if A ⇒ x for some x ∈ Σ∗ .
To compute the live variables, we proceed in a bottom-up fashion.
Rules are processed right to left. Thus, we begin by marking every variable
V where there is a rule V −→ rhs such that rhs ∈ Σ∗ . Then we repeatedly
do the following: mark variable U when there is a rule U −→ rhs such
that every variable in rhs is marked. This continues until no unmarked
variable gets marked. The live variables are those that are marked.
Definition 21 (Reachable variables). A variable A in a context-free gram-
∗
mar G = (V, Σ, R, S) is said to be reachable if S ⇒ αAβ for some α, β in
(Σ ∪ V )∗ .

117
The previous algorithm propagates markings from right to left in rules.
To compute the reachable variables, we do the opposite: processing pro-
ceeds top down and from left to right. Thus, we begin by marking the start
variable. Then we look at the rhs of every production of the form S −→ rhs
and mark every unmarked variable in rhs. We continue in this way until
no unmarked variables become marked. The reachable variables are those
that are marked.
Definition 22 (Useful variables). A variable A in a context-free grammar
G = (V, Σ, R, S) is said to be useful if for some string x ∈ Σ∗ there is a
∗ ∗
derivation of x that takes the form S ⇒ αAβ ⇒ x. A variable that is not
useful is said to be useless. If a variable is not live or is not reachable then
it is clearly useless.
Example 80. Find a grammar having no useless variables which is equiv-
alent to the following grammar
S −→ ABC | BaB
A −→ aA | BaC | aaa
B −→ bBb | a
C −→ CA | AC
The reachable variables of this grammar are {S, A, B, C} and the live
variables are {A, B, S}. Since C is not live, L(C) = ∅, hence L(ABC) = ∅
and also L(BaC) = ∅, so we can delete the rules S −→ ABC and A −→
BaC to obtain the new, equivalent, grammar
S −→ BaB
A −→ aA | aaa
B −→ bBb | a
In this grammar, A is not reachable, so any rules with A on the lhs can be
dropped. This leaves
S −→ BaB
B −→ bBb | a

4.3.1 Chomsky Normal Form

It is sometimes helpful to eliminate various forms of redundancy in a
grammar. For example, a grammar with a rule such as V −→ ε occur-
ring in it might be thought to be simpler if all occurrences of V in the right

118
hand side of a rule were eliminated. Similarly, a rule such as P −→ Q is
an indirection that can seemingly be eliminated. A grammar in Chomsky
Normal Form2 is one in which these redundancies do not occur. However,
the simplification steps are somewhat technical, so we will have to take
some care in their application.
Definition 23 (Chomsky Normal Form). A grammar is in Chomsky Nor-
mal Form if every rule has one of the following forms:
• A −→ BC

• A −→ a
where A, B, C are variables, and a is a terminal. Furthermore, in all rules
A −→ BC, we require that neither B nor C are the start variable for the
grammar. Notice that the above restrictions do not allow a rule of the form
A −→ ε; however, this will disallow some grammars. Therefore, we allow
the rule S −→ ε, where S is the start variable.
The following algorithm translates grammar G = (V, Σ, R, S) to Chom-
sky Normal Form:

1. Create a new start variable S0 and add the rule S0 −→ S. Now the
start variable is not on the right hand side of any rule.

2. Eliminate all rules of the form A −→ ε. For each rule of the form
V −→ uAw, where u, w ∈ (V ∪ Σ)∗ , we add the rule V −→ uw. It is
important to notice that we must do this for every occurrence of A in
the right hand side of the rule. Thus the rule

V −→ uAwAv

yields the new rules

V −→ uwAv
V −→ uAwv
V −→ uwv

If we had the rule V −→ A, we add V −→ ε. This will get eliminated

in later steps.
2
In theoretical computer science, a normal form of an expression x is an equivalent
expression y which is in reduced form, i.e., y cannot be further simplified.

119
3. Eliminate all rules which merely replace one variable by another, e.g.,
V −→ W . These are sometimes called unit rules. Thus, for each rule
W −→ u where u ∈ (V ∪ Σ)∗ , we add V −→ u.

4. Map rules into binary. A rule A −→ u1 u2 . . . un where n ≥ 3 and each

ui is either a symbol from Σ or a variable in V , is replaced by the
collection of rules
A −→ u1 A1
A1 −→ u2 A2
A2 −→ u3 A3
..
.
An−2 −→ un−1 un
where A1 , . . . , An−2 are new variables. Each of the ui must be a vari-
able. If it is not, then add a rule Ui −→ ui , and replace ui everywhere
in the rule set with Ui .

Theorem 6. Every grammar G has a Chomsky Normal Form G′ , and L(G) =

L(G′ ).

Example 81. Let G be given by the following grammar:

S −→ ASA | aB
A −→ B | S
B −→ b | ε

We will convert this to Chomsky Normal Form by following the steps in

the algorithm.

1. Add new start variable. This is accomplished by adding the new rule
S0 −→ S.

2. Now we eliminate the rule B −→ ε. We must make a copy of each

rule where B occurs on the right hand side (underlined below). There-
fore the grammar
S0 −→ S
S −→ ASA | aB
A −→ B | S
B −→ b | ε

120
is transformed to
S0 −→ S
S −→ ASA | aB | a
A −→ B|S|ε
B −→ b
Notice that, for example, we don’t drop A −→ B; instead we keep it
and add A −→ ε. So we’ve dropped one ε-rule and added another.
3. Eliminate A −→ ε. This yields the following grammar:
S0 −→ S
S −→ ASA | AS | SA | S | aB | a
A −→ B|S
B −→ b
We have now finished eliminating ε-rules and can move to eliminat-
ing unit rules.
4. Eliminate S −→ S. This illustrates a special case: when asked to
eliminate a rule V −→ V , the rule may simply be dropped without
any more thought. Thus we have the grammar
S0 −→ S
S −→ ASA | AS | SA | aB | a
A −→ B|S
B −→ b

5. Eliminate S0 −→ S. In this case, that means that wherever there is a

rule S −→ w, we will add S0 −→ w. Thus we have
S0 −→ ASA | AS | SA | aB | a
S −→ ASA | AS | SA | aB | a
A −→ B|S
B −→ b

6. Eliminate A −→ B. In this case, that means that wherever there is a

rule B −→ w, we will add A −→ w. Thus we have
S0 −→ ASA | AS | SA | aB | a
S −→ ASA | AS | SA | aB | a
A −→ S|b
B −→ b

121
7. Eliminate A −→ S. In this case, that means that wherever there is a
rule S −→ w, we will add A −→ w. Thus we have

S0 −→ ASA | AS | SA | aB | a
S −→ ASA | AS | SA | aB | a
A −→ ASA | AS | SA | aB | a | b
B −→ b

That finishes the elimination of unit rules. Now we map the gram-
mar to binary form.

8. The rule S −→ ASA needs to be split, which is accomplished by

adding a new rule A1 −→ SA, and replacing all occurrences of ASA
by AA1 :
S0 −→ AA1 | AS | SA | aB | a
S −→ AA1 | AS | SA | aB | a
A −→ AA1 | AS | SA | aB | a | b
A1 −→ SA
B −→ b

9. The grammar is still not in final form: right-hand sides such as aB

are not in the correct format. This is taken care of by adding a new
rule U −→ a and propagating its definition to all binary rules with
the terminal a on the right hand side. This gives us the final grammar
in Chomsky Normal Form:

S0 −→ AA1 | AS | SA | UB | a
S −→ AA1 | AS | SA | UB | a
A −→ AA1 | AS | SA | UB | a | b
A1 −→ SA
B −→ b
U −→ a

As we can see, conversion to Chomsky Normal Form (CNF) can lead

to bulky and awkward grammars. However a grammar G in CNF has
various advantages. One of them is that every step in a derivation using
G makes demonstrable progress towards the final string because either

• the sentential form gets strictly longer (by 1); or

122
• a new terminal symbol appears.

Theorem 7. If G is in Chomsky Normal Form, then for any string w ∈ L(G) of

length n ≥ 1, exactly 2n − 1 steps are required in a derivation of w.
Proof. Let S ⇒ w1 ⇒ · · · ⇒ wn be a derivation of w using G, which is
in CNF. We first note that, if n = 1, then w = ε (non-terminals can’t be
derived in one step from S).
Since w ∈ Σ∗ , we know that w is made up of n terminals. In order to
produce those n terminals with a CNF grammar, we would need n appli-
cations of rules of the form V −→ a. In order for there to have been n such
applications, there must have been n variables ‘added’ in the derivation.
Notice that the only way to add a variable into the sentential form is to
apply a rule of the form A −→ BC, which replaces one variable by two.
Thus we require at least n − 1 applications of rules of the form A −→ BC to
produce enough variables to replace by our n terminals. Thus we require
at least 2n − 1 steps in the derivation of w.
Could the derivation of w be longer than 2n − 1 steps? No. If that was
so, there would have to be more than n − 1 steps of the form A −→ BC in
the derivation, and then we would have some uneliminated variables.

4.4 Context-Free Parsing

A natural question to ask is the following: given context-free grammar
G, and a string w, is w in the language generated by G, or equivalently,
∗
G ⇒ w? This can be phrased as the decision problem inCFL:
∗
inCFL = {hG, wi | G ⇒ w}.

Note well that the decision problem is to decide, given an arbitrary

grammar and an arbitrary string, whether that string can be generated by
that grammar.
One possible approach would be to enumerate derivations in increas-
ing length, attempting to see if w eventually gets derived, but of course
this is hopelessly inefficient, and moreover won’t terminate if w ∈/ L(G).
A better approach would be to translate the grammar to Chomsky Normal
Form and then enumerate all derivations of length 2n − 1 (there are a finite

123
number of these) checking each to see if w is derived. If it is, then accep-
tance, otherwise no derivation of w of length 2n−1 exists, so no derivation
of w exists at all, so rejection. Again, this is quite inefficient.
Fortunately, there are general algorithms for context-free parsing that
run relatively efficiently. We are going to look at one, known as the CKY al-
gorithm3 , which is directly based on grammars in Chomsky Normal Form.
If G is in CNF then it has only rules of the form

S −→ V1 V2 | . . .

(For the moment, we’ll ignore the fact that a rule S −→ ε may be allowed.
Also we will ignore rules of the form V −→ a.) Suppose that we want to
parse the string
w = w1 w2 . . . wn
∗
Now, S ⇒ w if
S ⇒ V1 V2 and
∗
V1 ⇒ w1 . . . wi and
∗
V2 ⇒ wi+1 . . . wn
for some splitting of w at index i. This recursive splitting process proceeds
until the problem size becomes 1, i.e., the problem becomes one of finding
a rule V −→ wi that generates a single terminal.
Now, of course, the problem is that there are n − 1 ways to split a string
of length n in two pieces having at least one symbol each. The algorithm
considers all of the splits, but in a clever way. The processing goes bottom-
up, dealing with shorter strings before longer ones. In this way, solutions
to smaller problems can be re-used when dealing with larger problems.
Thus this algorithm is an instance of the technique known as dynamic pro-
gramming.
The main notion in the algorithm is

N[i, i + j]

which denotes the set of variables in G that can derive the substring wi . . . wi+j−1 .
Thus N[i, i + 1] refers to the variables that can derive the single symbol wi .
If we can properly implement this abstraction, then all we have to do to de-
∗
cide if S ⇒ w, roughly speaking, is compute N[1, n + 1] and check whether
3
After the co-inventors Cocke, Kasami, and Younger.

124
S is in the resulting set. (Note: we will index strings starting at 1 in this
section.)
Thus we will systematically compute the following, moving from a
step-size of 1, to one of n, where n is the length of w:

Step size

1 N[1, 2], N[2, 3], . . . N[n, n + 1]

2 N[1, 3], N[2, 4], . . . N[n − 1, n + 1]
.. ..
. .
n N[1, n + 1]

In the algorithm, N[i, j] is represented by a two-dimensional array N,

where the contents of location N[i, j] is the set of variables that generate
wi . . . wj . We will only need to consider a triangular sub-array.
The ideas are best introduced by example.
Example 82. Consider the language BAL of balanced parentheses, gener-
ated by the grammar
S −→ ε | (S) | SS
This grammar is not in Chomsky Normal Form, but the following steps
will achieve that:
• New start symbol
S0 −→ S
S −→ ε | (S) | SS

• Eliminate S −→ ε
S0 −→ S | ε
S −→ (S) | () | S | SS

• Drop S −→ S
S0 −→ S | ε
S −→ (S) | () | SS

• Eliminate S0 −→ S
S0 −→ ε | (S) | () | SS
S −→ (S) | () | SS

125
• Put in binary rule format. We add two rules for deriving the opening
and closing parentheses:
L −→ (
R −→)
and then the final grammar is

S0 −→ ε | LA | LR | SS
S −→ LA | LR | SS
A −→ SR
L −→ (
R −→ )

Now, let’s try the algorithm on parsing the string (()(())) with this
grammar. The length n of this string is 8. We start by constructing an
array N with n + 1 = 9 rows and n columns. Then we write the string to
be parsed along the diagonal:

1 (
2 (
3 )
4 )
5 (
6 )
7 )
8 )
9
1 2 3 4 5 6 7 8
Now we consider, for each substring of length 1 in the string, the vari-
ables that could derive it. For example, the element at N[2, 3] will be L,
since the rule L −→ ( can be used to generate a ‘(’ symbol. In this way,
each N[i, i + 1], i.e., just below the diagonal is filled in:

126
1 (
2 L (
3 L )
4 R (
5 L (
6 L )
7 R )
8 R )
9 R
1 2 3 4 5 6 7 8
Now we consider, for each substring of length 2 in the string, the vari-
ables that could derive it. Now here’s where the cleverness of the algo-
rithm manifests itself. All the information for N[i, i + 2] can be found by
looking at N[i, i + 1] and N[i + 1, i + 2]. So we can re-use information al-
ready calculated and stored in N. For strings of length 2, it’s particularly
easy, since the relevant information is directly above and directly to the
right. For example, the element at N[1, 3] is calculated by asking “is there
a rule of the form V −→ LL?” There is none, so N[1, 3] = ∅. Similarly,
the entry at N[2, 4] = S0 , S because of the rules S0 −→ LR and S −→ LR.
Proceeding in this way, the next diagonal of the array is filled in as follows:

1 (
2 L (
3 ∅ L )
4 S0 , S R (
5 ∅ L (
6 ∅ L )
7 S0 , S R )
8 ∅ R )
9 ∅ R
1 2 3 4 5 6 7 8
Now substrings of length 3 are addressed. It’s important to note that
all ways of dividing a string of length 3 into non-empty substrings has to
be considered. Thus N[i, i+3] is computed from N[i, i+1] and N[i+1, i+3]
as well as N[i, i + 2] and N[i + 2, i + 3]. For example, let’s calculate N[1, 4]

127
• N[1, 2] = L and N[2, 4] = S, but there is no rule of the form V −→ LS,
so this split produces no variables
• N[1, 3] = ∅ and N[3, 4] = R, so this split produces no variables also
Hence N[1, 4] = ∅. By similar calculations, N[2, 5], N[3, 6], N[4, 7] are all ∅.
In N[5, 8] though, we can use the rule A −→ SR to derive N[5, 7] followed
by N[7, 8]. Thus the next diagonal is filled in:

1 (
2 L (
3 ∅ L )
4 ∅ S0 , S R (
5 ∅ ∅ L (
6 ∅ ∅ L )
7 ∅ S0 , S R )
8 A ∅ R )
9 ∅ ∅ R
1 2 3 4 5 6 7 8
Filling in the rest of the diagonals yields

1 (
2 L (
3 ∅ L )
4 ∅ S0 , S R (
5 ∅ ∅ ∅ L (
6 ∅ ∅ ∅ ∅ L )
7 ∅ ∅ ∅ ∅ S0 , S R )
8 ∅ S0 , S ∅ S0 , S A ∅ R )
9 S0 , S A A ∅ ∅ ∅ R
1 2 3 4 5 6 7 8
Since S0 ∈ N[1, 9], we have shown the existence of a parse tree for the
string (()(())).
An implementation of this algorithm can be coded in a concise triply-
nested loop of the form:
For each substring length
For each substring u of that length

128
For each split of u into non-empty pieces
....
As a result, the running time of the algorithm is O(n3 ) in the length of
the input string.
Other algorithms for context-free parsing are more popular than the
CKY algorithm. In particular, a top-down CFL parser due to Earley is
more efficient in many cases.

4.5 Grammar Decision Problems

We have seen that the decision problem inCFL
∗
inCFL = {hG, wi | G ⇒ w}.

is decidable. Here we list a number of other decision problems about

grammars along with their decidability status:
emptyCFL. Does a CFG generate any strings at all?

emptyCFL = {hGi | L(G) = ∅}

Decidable. How?
fullCFL. Does a CFG generate all strings over the alphabet?

fullCFL = {hGi | L(G) = Σ∗ }

Undecidable.
subCFL. Does one CFG generate a subset of the strings generated by an-
other?
subCFL = {hG1 , G2 i | L(G1 ) ⊆ L(G2 )}
Undecidable.
sameCFL. Do two CFGs generate the same language?

sameCFL = {hG1 , G2 i | L(G1 ) = L(G2 )}

Undecidable.
ambigCFG. Is a CFG ambiguous, i.e., is there some string w ∈ L(G) with
more than one leftmost derivation using G? Undecidable.

129
4.6 Push Down Automata
Push Down Automata (PDAs) are a machine counterpart to context-free
grammars. PDAs consume, or process, strings, while CFGs generate strings.
A PDA can be roughly characterized as follows:

PDA = TM − tape + stack

In other words, a PDA is a machine with a finite number of control states,

like a Turing machine, but it can only access its data in a stack-like fashion
as it operates.
Remark. Recall that a stack is a ‘last-in-first-out’ (LIFO) queue, with the
following operations:

Push Add an element x to the top of the stack.

Pop Remove the top element from the stack.

Empty Test the stack to see if it is empty. We won’t use this feature in our
work.

Only the top of the stack may be accessed in any one step; multiple pushes
and pops can be used to access other elements of the stack.
Use of the stack puts an explicit memory at our disposal. Moreover, a
stack can hold an unbounded amount of information. However, the con-
straint to access the stack in LIFO style means that use of memory is also
constrained.
Here’s the formal definition.

Definition 24 (Push-Down Automaton). A Push-Down Automaton (PDA)

is a 6-tuple (Q, Σ, Γ, δ, q0 , F ), where

• Q is a finite set of control states.

• q0 is the start state.

• F is a finite set of accepting states.

• Σ is the input alphabet (finite set of symbols).

130
• Γ is the stack alphabet (finite set of symbols). Σ ⊆ Γ. As for Turing
machines, the need for Γ being an extension of Σ comes from the
fact that it is sometimes convenient to use symbols other than those
found in the input alphabet as special markers in the stack.
• δ : Q × (Σ ∪ {ε}) × (Γ ∪ {ε}) −→ 2Q×(Γ∪{ε}) is the transition function.
Although δ seems daunting, it merely incorporates the use of the
stack. When making a transition step, the machine uses the current
input symbol and the top of the stack in order to decide what state to
move to. However, that’s not all, since the machine must also update
the top of the stack at each step.
It is obviously complex to make a step of computation in a PDA. We
have to deal with non-determinism and ε-transitions, but the stack must
also be taken account of. Suppose q ∈ Q, a ∈ Σ, and u ∈ Γ. Then a
computation step
δ(q, a, u) = {(q1 , v1 ), . . . , (qn , vn )}
means that if the PDA is in state q, reading tape symbol a, and symbol u
is at the top of the stack, then there are n possible outcomes. In outcome
(qi , vi ), the machine has moved to state qi , and u at the top of the stack has
been replaced by vi .
Descriptions of how the top of stack changes in a computation step
seem peculiar at first glance. We summarize the possibilities in the follow-
ing table.

operation step result

push x δ(q, a, ε) (qi , x)
pop x δ(q, a, x) (qi , ε)
skip δ(q, a, ε) (qi , ε)
replace(c, d) δ(q, a, c) (qi , d)
Note that we have to designate the element in a pop operation. This is
unlike conventional stacks.
Let’s try to explain this curious notation. It helps to explicitly include
the input string and the stack. Thus a configuration of the machine is a triple
(q, string, stack ). We use the notation c · t to represent the stack (a string)
where the top of the stack is c and the rest of the stack is t. A confusing
aspect of the notation is that a stack t is sometimes regarded as a stack
with the ε-symbol as the top element, so t is the same as ε · t.

131
• When pushing symbol x, the configuration changes from (q, a·w, ε·t)
to (qi , w, x · t).

• When popping symbol x, the configuration changes from (q, a·w, x·t)
to (qi , w, t).

• When we don’t wish the stack to change at all in a computation step,

the machine moves from a configuration (q, a · w, ε · t) to (qi , w, ε · t).

• Finally, on the occasion that we actually do wish to change the sym-

bol c at the top of stack with symbol d, the configuration (q, a · w, c · t)
changes to (qi , w, d · t).

Now that steps of computation are better understood, the notion of an

execution is easy. An execution starts with the configuration (q0 , s, ε), i.e.,
the machine is in the start state, the input string s is on the tape, and the
stack is empty. A successful execution is one which finishes in a configu-
ration where s has been completely read, the final state of the machine is
an accept state, and the stack is empty.
Remark. Notice that the notion of acceptance means that, even if the ma-
chine ends up in a final state after processing the input string, the string
may still not be accepted; the stack must also be empty in order for the
string to be accepted.
Acceptance can be formally defined as follows:
Definition 25 (PDA execution and acceptance). A PDA M = (Q, Σ, Γ, δ, q0 , F )
accepts w = w1 · w2 · . . . · wn , where each wi ∈ Σ, just in case there exists
• a sequence of states r0 , r1 , . . . , rm ∈ Q;

• a sequence of stacks (strings in Γ∗ ) s0 , s1 , . . . sm

such that the following three conditions hold:
1. Initial condition r0 = q0 and s0 = ε.

2. Computation steps

∀i. 0 ≤ i ≤ m − 1 ⇒ (ri+1 , b) ∈ δ(ri , wi+1 , a) ∧ si = a · t ∧ si+1 = b · t

where a, b ∈ Σ ∪ {ε} and t ∈ Γ∗ .

132
3. Final condition rm ∈ F and sm = ε.

As usual, the language L(M) of a PDA M is the set of strings accepted

by M.
As for Turing machines, PDAs can be represented by state transition
diagrams.

Example 83. Let M = (Q, Σ, Γ, q0 , F ) be a PDA where Q = {p, q}, Σ =

{a, b, c}, Γ = {a, b}, q0 = q, F = {p}, and δ is given as follows

δ(q, a, ε) = {(q, a)}

δ(q, b, ε) = {(q, b)}
δ(q, c, ε) = {(p, ε)}
δ(p, a, a) = {(p, ε)}
δ(p, b, b) = {(p, ε)}

It is very hard to visualize what M is supposed to do! A diagram helps.

We’ll use the notation a, b → c on a transition between states qi and qj to
mean (qj , c) ∈ δ(qi , a, b).
a, ε → a a, a → ε
b, ε → b b, b → ε

q p
c, ε → ε

In words, M performs as follows. It stays in state q pushing input

symbols on to the stack until it encounters a c. Then it moves to state p, in
which it repeatedly pops a and b symbols off the stack provided the input
symbol is identical to that on top of the stack. If at the end of the string,
the stack is empty, then the machine accepts.
However, notice that failure in processing a string is not explicitly rep-
resented. For example, what if the input string has no c in it? In that case,
M will never leave state q. Once the input is finished, we find that we
are not in a final state and so can’t accept the string. For another exam-
ple, what if we try the string ababcbab? Then M will enter state p with
the stack baba, i.e., with the configuration (p, bab, baba). Then the following
configurations happen:

(p, bab, baba) ⇒ (p, ab, aba) ⇒ (p, b, ba) ⇒ (p, ε, a)

133
At this point the input string is exhausted and the computation stops. We
cannot accept the original string—although we are in an accept state—
because the stack is not empty. Thus we see that
L(M) = {wcw R | w ∈ {a, b}∗ } .
This example used c as a marker telling the machine when to change states.
It turns out that such an expedient is not needed because we have non-
determinism at our disposal.
Example 84. Find a PDA for L = {ww R | w ∈ {a, b}∗ }.
The PDA is just that of the previous example with the seemingly innocu-
ous alteration of the transition from p to q to be an ε-transition.

a, ε → a a, a → ε
b, ε → b b, b → ε

q p
ε, ε → ε

The machine may non-deterministically move from q to p at any point

in the execution. The only thing that matters for acceptance is that there is
some point at which the move works, i.e., results in an accepting compu-
tation.
For example, given input abba, how does the machine work? The fol-
lowing sequence of configurations shows an accepting computation path:
(q, abba, ε) → (q, bba, a)
→ (q, ba, ba)
→ (p, ba, ba)
→ (p, a, a)
→ (p, ε, ε)
The following is an unsuccessful execution:
(q, abba, ε) → (q, bba, a)
→ (q, ba, ba)
→ (p, a, bba)
→ (p, a, bba)
→ blocked!

134
Example 85. Build a PDA to recognize
L = {ai bj ck | i + k = j}
The basic idea in finding a solution is to use states to enforce the order
of occurrences of letters, and to use the stack to enforce the requirement
that i + k = j.
b, ε → b
a, ε → a b, a → ε c, b → ε

q0 q1 q2
ε, ε → ε ε, ε → ε

In the first state, q0 , we simply push a symbols. Then we move to state

q1 where we either
• pop an a when a b is seen on the input, or
• push a b for every b seen.
If i > j then there will be more a symbols on the stack than consecutive b
symbols remaining in the input. In this case, some a symbols will be left
on the stack when leaving q1 . This situation is not catered for in state q2 ,
so the machine will block and the input will not be accepted. This is the
correct behaviour.
On the other hand, if i ≤ j, there are more consecutive b symbols than
there are a symbols on the stack, so in state q1 the stack of i a symbols will
become empty and then be filled with j − i b symbols. Entering state q2
with such a stack will result in acceptance only if there are j − i c symbols
left in the input.
Example 86. Give a PDA for L = {x ∈ {a, b}∗ | count(a, x) = count(b, x)}.

a, b → ε
b, a → ε

a, ε → a
b, ε → b

135
Example 87. Give a PDA for L = {x ∈ {a, b}∗ | count(a, x) < 2 ∗ count(b, x)}.
This problem can be rephrased as: x is in L if, after doubling the number
of b’s in x, we have more b’s than a’s. We can build a machine to do this
explicitly: every time it sees a b in the input string, it will treat it as 2
consecutive b’s.

ε, ε → b
b, ε → b b, a → ε

a, ε → a ε, a → ε
a, b → ε ε, ε → b
ε, b → ε
ε, b → ε

Example 88. Give a PDA for L = {ai bj | i ≤ j ≤ 2i}. L is equivalent to

{ai bi bk | k ≤ i}, which isequivalent to {ak aℓ bℓ b2k | k, ℓ ≥ 0}. A machine deal-
ing with this language is easy to build, by putting different functionality
in different states. Non-deterministic transitions take care of guessing the
right time to make a transition. In the first state, we push k + ℓ a sym-
bols; in the second we cancel off bℓ ; and in the last we consume b2k while
popping k a symbols.

a, ε → a b, a → ε b, ε → ε
b, a → ε

ε, ε → ε ε, ε → ε

Note that we could shrink this machine, by merging the last two states:

a, ε → a b, a → ε
b, a → ε
ε, ε → ε
b, ε → ε

136
This machine non-deterministically chooses to cancel one or two b symbols
for each a seen in the input. Note that we could also write an equivalent
machine that non-deterministically chooses to push one or two a symbols
to the stack: b, a → ε
a, ε → a
ε, ε → ε

a, ε → a
ε, ε → a

Example 89. Build a PDA to recognize

L = {ai bj | 2i = 3j}

Thus L = {ε, a3 b2 , a6 b4 , a9 b6 , . . .}. The idea behind a solution to this

problem is that we wish to push and pop multiple a symbols. On seeing
an a in the input, we want to push two a symbols on the stack. When we
see a b, we want to pop three a symbols. The technical problem we face
is that pushing and popping only deal with one symbol at a time. Thus
in order to deal with multiple symbols, we will need to employ multiple
states.

ε, ε → ε
q0 q2
a, ε → a b, a → ε ε, a → ε
ε, ε → a
q1 q3 q4
ε, a → ε

Example 90. Build a PDA to recognize

L = {ai bj | 2i 6= 3j}

This is a more difficult problem. We will be able to re-use the basic idea
of the previous example, but must now take extra cases into account. The
success state q2 of the previous example will now change into a reject state.
But there is much more going on. We will build the solution incrementally.
The basic skeleton of our answer is

137
ε, ε → ε
q0 q2
a, ε → a b, a → ε ε, a → ε
ε, ε → a
q1 q3 q4
ε, a → ε
If we arrive in q2 where the input has been exhausted and the stack is
empty, we should reject, and that is what the above machine does. The
other cases in q2 are
• There is remaining input and the stack is not empty. This case is
already covered: go to q3 .
• There is remaining input and the stack is empty. We can assume that
the head of the remaining input is a b. (All the leading a symbols
have already been dealt with in the q0 , q1 pair.) We need to transition
to an accept state where we ensure that the rest of the input is all b
symbols. Thus we invent a new accept state q5 where we discard the
remaining b symbols in the input.

ε, ε → ε
q0 q2
b, ε → ε
a, ε → a b, a → ε
ε, ε → a ε, a → ε b, ε → ε
q1 q3 q4 q5
ε, a → ε
We further notice that this situation can happen in q3 and q4 , so we
add ε-transitions from them to q5 :

ε, ε → ε
q0 q2
b, ε → ε
a, ε → a b, a → ε
ε, ε → a ε, a → ε b, ε → ε
q1 q3 q4 q5
ε, a → ε ε, ε → ε

ε, ε → ε

138
• The input is exhausted, but the stack is not empty. Thus we have
excess a symbols on the stack and we need to jettison them before
accepting. This is handled in a new final state q6 :

ε, a → ε
ε, ε → ε ε, a → ε
q0 q2 q6
a, ε → a b, a → ε b, ε → ε
ε, ε → a ε, a → ε b, ε → ε
q1 q3 q4 q5
ε, a → ε ε, ε → ε

ε, ε → ε
This is the final PDA.

4.7 Equivalence of PDAs and CFGs

The relationship between PDAs and CFGs is similar to that between DFAs
and regular expressions; namely, the languages accepted by PDAs are just
those that can be generated by CFGs.
Theorem 8. Suppose L is a context-free language. Then there is a PDA M such
that L(M) = L.
Theorem 9. Suppose M is a PDA. Then there is a grammar G such that L(G) =
L(M), i.e., L(M) is context-free.
The proofs of these theorems take a familiar approach: given an arbi-
trary grammar, we construct the corresponding PDA; and given an arbi-
trary PDA, we construct the corresponding grammar.

4.7.1 Converting a CFG to a PDA

The basic idea in the construction is to build M so that it simulates the
leftmost derivation of strings using G. The machine we construct uses
the terminals and non-terminals of the grammar as stack symbols. What
we conceptually want to do is to use the stack to hold the sentential form
that evolves during a derivation. At each step, the topmost variable in the

139
stack will get replaced by the rhs of some grammar rule. Of course, there
are several problems with implementing this concept. For one, the PDA
can only access the top of its stack: it can’t find a variable below the top.
For another, even if the PDA could find such a variable, it couldn’t fit the
rhs into a single stack slot. But these are not insurmountable. We simply
have to arrange things so that the PDA always has the leftmost variable
of the sentential form on top of the stack. If that can be set up, the PDA
can use the technique of using extra states to push multiple symbols ‘all at
once’.
The other consideration is that we are constructing a PDA after all,
so it needs to consume the input string and give a verdict. This fits in
nicely with our other requirements. In brief, the PDA will use ε-transitions
to push the rhs of rules into the stack, and will use ‘normal’ transitions
to consume input. In consuming input, we will be able to remove non-
variables from the top of the stack, always guaranteeing that a variable is
at the top of the stack.
Here are the details. Let G = (V, Σ, R, S). We will construct M =
∪ Σ}, δ, q0 , {q}) where
(Q, Σ, |V {z
Γ

• Q = {q0 , q} ∪ RuleStates;

• q0 is the start variable;

• q is a dispatch state at the center of a loop. It is also the sole accept

state for the PDA

• δ has rules for getting started, for consuming symbols from the input,
and for pushing the rhs of rules onto the stack.

getting started. δ(q0 , ε, ε) = {(q, S)}. The start symbol is pushed on

the stack and a transition is made to the loop state.
consuming symbols. δ(q, a, a) = {(q, ε)}, for every terminal a ∈ Σ.
pushing rhs of rules For each rule Ri = V −→ w1 · w2 · · · · · wn in
R, where each wi may be a terminal or non-terminal, add n − 1
states to RuleStates. Also add the loop (from q to q)

140
ε, V → wn ε, ε → wn−1 ε, ε → w2 ε, ε → w1
q q

which pushes the rhs of the rule, using the n − 1 states. Note
that the symbols in the rhs of the rule are pushed on the stack
in right-to-left order.

Example 91. Let G be given by the grammar

S −→ aS | aSbS | ε
The corresponding PDA is

ε, S → ε
a, a → ε
b, b → ε
ε, ε → S
A B ε, ε → a F
ε, ε → a ε, S → S ε, ε → S

C ε, S → S D E
ε, ε → b

Consider the input aab. A derivation using G is

S ⇒ aS ⇒ aaSbS ⇒ aaεbS ⇒ aaεbε = aab

141
As a sequence of machine configurations, this looks like

(A, aab, ε) −→ (B, aab, S)

−→ (C, aab, S)
−→ (B, aab, aS)
−→ (B, ab, S)
−→ (D, ab, S)
−→ (E, ab, bS)
−→ (F, ab, SbS)
−→ (B, ab, aSbS)
−→ (B, b, SbS)
−→ (B, b, bS)
−→ (B, ε, S)
−→ (B, ε, ε)

And so the machine would accept aab.

4.7.2 Converting a PDA to a CFG

The previous construction, spelled out in full would look messy, but is in
fact quite simple. Going in the reverse direction, i.e., converting a PDA to
a CFG, is more difficult. The basic idea is to consider any two states p, q of
PDA M and think about what strings could be consumed in executing M
from p to q. Those strings will be represented by a variable Vpq in G, the
grammar we are building. By design, the strings generated by Vpq would
be just those substrings consumed by M in going from p to q. Thus S,
the start variable, will stand for all strings consumed in going from q0 to
an accept state. This is clear enough, but as always for PDAs, we must
consider the stack, hence the story will be more involved; for example, we
will use funky variables of the form VpAq , where A represents the top of
the stack.
The construction goes as follows: given PDA M = (Q, Σ, Γ, δ, q0 , F ),
we will construct a grammar G such that L(G) = L(M). Two main steps
achieve this goal:

• M will be modified to an equivalent M ′ , with a more desirable form.

An important aspect of M ′ is that it always looks at the top symbol
on the stack in each move. Thus, no transitions of the form a, ε → b

142
(pushing b on the stack), ε, ε → ε, or a, ε → ε (ignoring stack) are
allowed. How can these be eliminated from δ without changing the
behaviour? First we need to make sure that the stack is never empty,
for if M ′ is going to look at the top element of the stack, the stack had
better never be empty. This can be ensured by starting the compu-
tation with a special token ($) in the stack and then maintaining an
invariant that the stack never thereafter becomes empty. It will also
be necessary to allow M ′ to push two stack symbols in one move:
since M ′ always looks at the top stack symbol, we need to push two
symbols in order to get the effect of a push operation on a stack. This
can be implemented by using extra states, but we will simply assume
that M ′ has this extra convenience.
Furthermore, we are going to add a new start state s and the tran-
sition ε, ε → $, which pushes $ on the stack when moving from the
new start state s to the original start state q0 . We also add a new final
state qf , with ε, $ → ε transitions from all members of F to qf . Thus
the machine M ′ has a single start state and a single end state, always
examines the top of its stack, and behaves the same as the machine
M.
• Construct G so that it simulates the working of M ′ . We first construct
the set of variables of G.

V = {VpAq | p, q ∈ Q ∪ {qf } ∧ A ∈ Γ ∪ {$}}

Thus we create a lot of new variables: one for each combination of

states and possible stack elements. The intent is that each variable
VpAq will generate the following strings:
∗
{x ∈ Σ∗ | (p, a, A) −→ (q, ε, ε)}

Now, because of the way we constructed M ′ , there are three kinds of

transitions to deal with:

1. a, A → B
p q

Add the rule VpAr −→ aVqBr for all r ∈ Q ∪ {qf }

143
2. a, A → BA
p p

Add the rule VpAr −→ aVqBr′ Vr′ Ar for all r, r ′ ∈ Q ∪ {qf }

3. a, A → ε
p p

Add the rule VpAq −→ a.

The above construction works because the theorem

∗ ∗
VpAq ⇒ w iff (p, w, A) −→ (q, ε, ε)

can be proved (it’s a little complicated though). From this we can

immediately get
∗ ∗
Vq0 $qf ⇒ w iff (q0 , w, $) −→ (qf , ε, ε)

Thus by making Vq0 $qf the start symbol of the grammar, we have
achieved our goal.

4.8 Parsing
To be added ...

144
Chapter 5

Automata

Automata are a particularly simple, but useful, model of computation.

They were initially proposed1 as a simple model for the behaviour of neu-
rons.

The concept of a finite automaton appears to have arisen in the

1943 paper “A logical calculus of the ideas immanent in nervous ac-
tivity”, by Warren McCullock and Walter Pitts. These neurobiolo-
gists set out to model the behaviour of neural nets, having noticed a
relationship between neural nets and logic:
“The ‘all-or-none’ law of nervous activity is sufficient
to ensure that the activity of any neuron may be repre-
sented as a proposition. ... To each reaction of any neuron
there is a corresponding assertion of a simple proposition.”

In 1951 Kleene introduced regular expressions to describe the behaviour

of finite automata. He also proved the important theorem saying that reg-
ular expressions exactly capture the behaviours of finite automata. In 1959,
Dana Scott and Michael Rabin introduced non-deterministic automata and
showed the surprising theorem that they are equivalent to deterministic
automata. We will study these fundamental results. Since those early
years, the study of automata has continued to grow, showing that they
are indeed a fundamental idea in computing.
1
This historical material is taken from an article by Bob Constable at The Kleene Sym-
posium, an event held in 1980 to honour Stephen Kleene’s contribution to logic and com-
puter science.

145
We said that automata are a model of computation. That means that
they are a simplified abstraction of ‘the real thing’. So what gets abstracted
away? One thing that disappears is any notion of hardware or software.
We merely deal with states and transitions between states.
We keep We drop
some notion of state notion of memory
stepping between states variables, commands, expressions
start state syntax
end states
The distinction between program and machine executing it disappears.
One could say that an automaton is the machine and the program. This
makes automata relatively easy to implement in either hardware or soft-
ware.
From the point of view of resource consumption, the essence of a finite
automaton is that it is a strictly finite model of computation. Everything
in it is of a fixed, finite size and cannot be extended in the course of the
computation.

5.1 Deterministic Finite State Automata

More precisely, a DFA (Deterministic Finite State Automaton) is a simple ma-
chine that reads an input string—one symbol at a time—and then, after
the string has been completely read, decides whether to accept or reject the
whole string. As the symbols are read, the automaton can change its state,
to reflect how it reacts to what it has seen so far.
Thus, a DFA conceptually consists of 3 parts:
• A tape to hold the input string. The tape is divided into a finite num-
ber of cells. Each cell holds a symbol from Σ.

• A tape head for reading symbols from the tape

• A control, which itself consists of 3 things:

– a finite number of states that the machine is allowed to be in

– a current state, initially set to a start state
– a state transition function for changing the current state

146
An automaton processes a string on the tape by repeating the following
actions until the tape head has traversed the entire string:

• The tape head reads the current tape cell and sends the symbol s
found there to the control. Then the tape head moves to the next cell.
The tape head can only move forward.

• The control takes s and the current state and consults the state tran-
sition function to get the next state, which becomes the new current
state.

Once the entire string has been traversed, the final state is examined.
If it is an accept state, the input string is accepted; otherwise, the string is
rejected. All the above can be summarized in the following formal defini-
tion:

Definition 26 (Deterministic Finite State Automaton). A Deterministic Fi-

nite State Automaton DFA is a 5-tuple:

M = (Q, Σ, δ, q0 , F )

where

• Q is a finite set of states

• Σ is a finite alphabet

• δ : Q × Σ → Q is the transition function (which is total).

• q0 ∈ Q is the start state

• F ⊆ Q is the set of accept states

A single computation step in a DFA is just the application of the transi-

tion function to the current state and the current symbol. Then an execution
consists of a linked sequence of computation steps, stopping once all the
symbols in the string have been processed.

Definition 27 (Execution). If δ is the transition function for machine M,

then a step of computation is defined as

step(M, q, a) = δ(q, a)

147
A sequence of steps ∆ is defined as

∆(M, q, ǫ) = q
∆(M, q, a · x) = ∆(M, step(M, q, a), x)
Finally, an execution of M on string x is a sequence of computation steps
beginning in the start state q0 of M:

execute(M, x) = ∆(M, q0 , x)

The language recognized by a DFA M is the set of all strings accepted by M,

and is denoted by L(M). We will make these ideas precise in the next few
pages.

5.1.1 Examples
Now we shall review a collection of examples of DFAs.

Example 92. DFA M = (Q, Σ, δ, q0 , F ) where

• Q = {q0 , q1 , q2 , q3 }

• Σ = {0, 1}

• The start state is q0 (this will be our convention)

• F = {q1 , q2 }

• δ is defined by the following table:

0 1
q0 q1 q3
q1 q2 q3
q2 q2 q2
q3 q3 q3

This presentation is nicely formal, but very hard to comprehend. The

following state transition diagram is far easier on the brain:

148
0
q0 0 q1 0 q2
1
1 1
q3

0 1
Notice that the start state is designated by an arrow with no source.
Final states are marked by double circles. The strings accepted by M are:

{0, 00, 000, 001, 000, 0010, 0011, 0001, . . .}

NB. The transition function is total, so every possible combination of states
and input symbols must be dealt with. Also, for every (q, a) ∈ Q × Σ, there
is exactly one next state (which is why these are deterministic automata).
Thus, given any string x over Σ, there is only one path starting from q0 , the
labels of which form x.
A state q in which every next state is q is a black hole state since the
current state will never change until the string is completely processed. If
q is an accept state, then we call q a success state. If not, it’s called a failure
state. In our example, q3 is a failure state and q2 is a success state.
Question : What is the language accepted by M, i.e., what is L(M)?
Answer : L(M) consists of 0 and all binary strings starting with 00.
Formally we could write
L(M) = {0} ∪ {00x | x ∈ Σ∗ }
Example 93. Now we will show how to design an FSA for an automatic
door controller. The controller has to open the door for incoming cus-
tomers, and not misbehave. A rough specification of it would be
If a person is on pad 1 (the front pad) and there’s no person on
pad 2 (the rear pad), then open the door and stay open until there’s no
person on either pad 1 or pad 2.
This can be modelled with two states: closed and open. So in our au-
tomaton, Q = {closed, open}. Now we need to capture all the combina-
tions of people on pads: these will be the inputs to the system.

149
• (both) pad 1 and pad 2

• (front) pad 1 and not pad 2

• (rear) not pad 1 and pad 2

• (neither) not pad 1 and not pad 2

We will need 2 sensors, one for each pad, and some external mecha-
nism to convert these two inputs into one of the possibilities. So our al-
phabet will be {b, f, r, n}, standing for {both, front, rear , neither}. Now the
task is to define the transition function. This is most easily expressed as a
diagram:

n, r, b f f, r, b

closed open
n

Finally, to complete the formal definition, we’d need to specify a start

state. The set of final states would be empty, since one doesn’t usually
want a door controller to freeze the door in any particular position.
Food for thought. Should door controllers handle only finite inputs, or
should they run forever? Is that even possible, or desired?
The formal definition of the door controller would be

M = ({closed, open}, {f, r, n, b}, δ, closed, ∅)

where δ is defined as
δ(open, x) = if x = n then closed else open
δ(closed , x) = if x = f then open else closed

That completes the door controller example.

In the course, there are two main questions asked about automata:

• Given a DFA M, what is L(M)?

• Given a language L, what is a DFA M such that L(M) = L?

150
Example 94. Give a DFA for recognizing the set of all strings over {0, 1},
i.e., {0, 1}∗ . This is also known as the set of all binary strings. There is a
very simple automaton for this:
0
q0

1
Example 95. Give a DFA for recognizing the set of all binary strings be-
ginning with 01. Here’s a first attempt (which doesn’t quite work):

0, 1

q0 0 q1 1 q2

The problem is that this diagram does not describe a DFA: δ is not total.
Here is a fixed version:
0, 1

q0 0 q1 1 q2
1
0
q3

0, 1

Example 96. Let Σ = {0, 1} and L = {w | w contains at least 3 1s}. Show

that L is regular, i.e., give a DFA that recognizes L.

0, 1
0 0 0
q0 1 q1 1 q2 1 q3

Example 97. Let Σ = {0, 1} and L = {w | len(w) is at most 5}. Show that L
is regular.

151
0, 1 0, 1 0, 1 0, 1 0, 1
q0 q1 q2 q3 q4 q5
0, 1
q6
0, 1

5.1.2 The regular languages

We’ve now exercised our intuitions on the definition of a DFA. We should
pause to formally define what it means for a machine M to accept a string
w, the language L(M) recognized by M, and the regular languages. We
start by defining the notion of a computation path, which is the trace of an
execution of M.
Definition 28 (Computation path). Let M = (Q, Σ, δ, q0 , F ) be a DFA and
let w = w0 w1 . . . wn−1 be a string over Σ, where each wi is an element of Σ.
A computation path
w0 w1 wn−1
q0 −→ q1 −→ · · · −→ qn
is a sequence of states of M labelled with symbols, fully describing the
sequence of transitions made by M in processing w. Moreover,
• q0 is the start state of M
• each qi ∈ Q
• δ(qi , wi ) = qi+1 for 0 ≤ i < n
It is important to notice that, for a DFA M, there is only one computation
path for any string. We will soon see other kinds of machines where this
isn’t true.
Definition 29 (Language of a DFA). The language defined by a DFA M,
written L(M), is the set of strings accepted by M. Formally, let M =
(Q, Σ, δ, q0 , F ) be a DFA and let w be a string over Σ.
L(M) = {w | execute(M, w) ∈ F }
A DFA M is said to recognize language A if A = L(M).

152
Now we give a name to the set of all languages that can be recognized by
a DFA.

Definition 30 (Regular languages). A language is regular if there is a DFA

that recognizes it:

Regular(L) = ∃M. M is a DFA and L(M) = L

The regular languages give a uniform way to relate the languages rec-
ognized by DFAs and NFAs, long with the languages generated by regular
expressions.

5.1.3 More examples

Now we turn to a few slightly harder examples.

Example 98. The set of all binary strings having a substring 00 is regular.
To show this, we need to construct a DFA that recognizes all and only
those strings having 00 as a substring. Here’s a natural first try:

q0 0 q1 0 q2
0, 1 0, 1

However, this is not a DFA (it is an NFA, which we will discuss in the
next lecture). A second try can be constructed by trying to implement the
following idea: we try to find 00 in the input string by ‘shifting along’ until
a 00 is seen, whereupon we go to a success state. We start with a preliminary
DFA, that expresses the part of the machine that detects successful input:

q0 0 q1 0 q2

0, 1

And now we consider, for each state, the moves needed to make the
transition function total, i.e., we need to consider all the missing cases.

153
• If we are in q0 and we get a 1, then we should try again, i.e., stay in q0 .
So q0 is the machine state where it is looking for a 0. So the machine
should look like

1
q0 0 q1 0 q2

0, 1

• If we are in q1 and we get a 1, then we should start again, for we have

seen a 01. This corresponds to shifting over by 2 in the input string.
So the final machine looks like

1
0 0
q0 q1 q2
1
0, 1

Example 99. Give a DFA that recognizes the set of all binary strings having
a substring 00101. A straightforward—but incorrect—first attempt is the
following:

1 0, 1
q0 0 q1 0 q2 1 q3 0 q4 1 q5
1 0 1 0

This doesn’t work! Consider what happens at q2 :

1
q0 0 q1 0 q2 1 q3
1

154
If the machine is in q2 , it has seen a 00. If we then get another 0, we
could be seeing 000101. In other words, if the next 3 symbols after 2 or
more consecutive 0s are 101, we should accept. Therefore, once, we see 00,
we should stay in q2 as long as we see more 0s. Thus we can refine our
diagram to

1 0
q0 0 q1 0 q2 1 q3
1

Next, what about q3 ? When in q3 , we’ve seen something of the form

. . . 001 If we now see a 1, we have to restart, as in the original, naive dia-
gram. If we see a 0, we proceed to q4 , as in the original.

1 0
q0 0 q1 0 q2 1 q3 0 q4
1
1

Now q4 . We’ve seen . . . 0010. If we now see a 1, then we’ve found our
substring, and can accept. Otherwise, we’ve seen . . . 00100, i.e., have seen
a 00, therefore should go to q2 . This gives the final solution (somewhat
rearranged):
q3
1 1 0 0, 1
q0 q4 1 q5
0 1
q1 0
1 0
q2

155
5.2 Nondeterministic finite-state automata
A nondeterministic finite-state automaton (NFA) N = (Q, Σ, δ, q0 , F ) is de-
fined in the same way as a DFA except that the following liberalizations
are allowed:

• multiple next states

• ε-transitions

Multiple next states

This means that—in a state q and with symbol a—there could be more
than one next state to go to, i.e., the value of δ(q, a) is a subset of Q. Thus
δ(q, a) = {q1 , . . . , qk }, which means that any one of q1 , . . . , qk could be the
next state.
There is a special case: δ(q, a) = ∅. This means that there is no next state
when the machine is in state q and reading an a. How to understand this
state of affairs? One way to think of it is that the machine hangs and the
input will be rejected. This is equivalent to going into a failure state in a
DFA.

ε-Transitions
In an ε-transition, the tape head doesn’t do anything—it doesn’t read and
it doesn’t move. However, the state of the machine can be changed. For-
mally, the transition function δ is given the empty string. Thus

δ(q, ε) = {q1 , . . . , qk }
means that the next state could be one of q1 , . . . , qk without consuming
the next input symbol. When an NFA executes, it makes transitions as
a DFA does. However, after making a transition, it can make as many
ε-transitions as are possible.
Formally, all that has changed in the definition of an automaton is δ:

DFA δ : Q × Σ → Q

NFA δ : Q × (Σ ∪ {ε}) → 2Q

156
Note. Some authors write Σε instead of Σ ∪ {ε}.
Don’t let any of this formalism confuse you: it’s just a way of saying
that δ delivers a set of next states, each of which is a member of Q.

Example 100. Let δ, the transition function, be given by the following table

0 1 ε
q0 ∅ {q0 , q1 } {q1 }
q1 {q2 } {q1 , q2 } ∅
q2 {q2} ∅ {q1 }

Also, let F = {q2 }. Note that we must take account of the possibility of ε
transitions in every state. Also note that each step can lead to one of a set
of next states. The state transition diagram for this automaton is

1 1 0
0
q0 1 q1 q2
ε
ε 1

Note. In a transition diagram for an NFA, we draw arrows for all tran-
sitions except those landing in the empty set (can one land in an empty
set?).
Note. δ is still a total function, i.e., we have to specify its behaviour in ε,
for each state.
Question : Besides δ, what changes when moving from DFA to NFA?
Answer : The notion that there is a single computation path for a string,
and therefore, the definitions of acceptance and rejection of strings. Con-
sequently, the definition of L(N), where N is an NFA.

Example 101. Giving the input 01 to our example NFA allows 3 computa-
tion paths:
ε 0 ε 1
• q0 −→ q1 −→ q2 −→ q1 −→ q1
ε 0 ε 1
• q0 −→ q1 −→ q2 −→ q1 −→ q2
ε 0 ε 1 ε
• q0 −→ q1 −→ q2 −→ q1 −→ q2 −→ q1

157
Notice that, in the last path, we can see that even after the input string
has been consumed, the machine can still make ε-transitions. Also note
that two paths, the first and third, do not end in a final state. The second
path is the only one that ends in a final state
In general, the computation paths for input x form a computation tree:
the root is the start state and the paths branch out to (possibly) different
states. For our example, with the input 01, we have the tree

q1
1
q0 ε q1 0 q2 ε q1 q2
1 ε
q1

Note that marking q2 as a final state is just a marker to show that a path
(the second) ends at that point; of course, the third path continues from
that state.

An NFA accepts an input x if at least one path in the computation tree for
x leads to a final state. In our example, 01 is accepted because q2 is a final
state.
Definition 31 (Acceptance by an NFA). Let N = (Q, Σ, δ, q0 , F ) be an NFA.
N accepts w if we can write w as w1 · w2 · . . . · wn , where each wi is a member
of Σ ∪ {ε} and a sequence of states q0 , . . . , qk exists, with each qi ∈ Q such
that the following conditions hold
• q0 is the start state of N
• qk is a final state of N (qk ∈ F )
• qi+1 ∈ δ(qi , wi+1 )
As for DFAs, the language recognized by an NFA N is the set of strings
accepted by N.
Definition 32 (Language of an NFA). The language of an NFA N is written
L(N) and defined
L(N) = {x | N accepts x}

158
Example 102. A diagram for an NFA that accepts all binary string having
a substring 010 is the following:

0, 1 0, 1
q0 0 q1 1 q2 0 q3

This machine accepts the string 1001010 because there exists at least one
accepting path (in fact, there are 2). The computation tree looks like q
0
0
q0 1 q0
0 0
q0 1 q0 q1
0 0
q0 q1 1 q2 0 q3
0 0
q0 1 q0 q1 1 q2 0 q3 1 q3 0 q3
0
q1 0
∅

Example 103. Design an NFA that accepts the set of binary strings begin-
ning with 010 or ending with 110. The solution to this uses a decompo-
sition strategy: do the two cases separately then join the automata with
ε-links. An automaton for the first case is the following

0, 1
q0 0 q1 1 q2 0 q3

An automaton for the second case is the following

159
0, 1
q0 1 q1 1 q2 0 q3

The joint automaton is

0, 1
q1 0 q2 1 q3 0 q4
ε
q0
ε
q5 1 q6 1 q7 0 q8
0, 1

Example 104. Give an NFA for

L = {x ∈ {0, 1}∗ | the fifth symbol from the right is 1}

The diagram for the requested automaton is

0, 1
1 0, 1 0, 1 0, 1 0, 1
q0 q1 q2 q3 q4 q5

Note. There is much non-determinism in this example. Consider state q0 ,

where there are multiple transitions on a 1. Also consider state q5 , where
there are no outgoing transitions.
This automaton accepts L because
• For any string whose 5th symbol from the right is 1, there exists a
sequence of legal transitions leading from q0 to q5 .
• For any string whose 5th symbol from the right is 0 (or any string
of length up to 4), there is no possible sequence of legal transitions
leading from q0 to q5 .

160
Example 105. Find an NFA that accepts the set of binary strings with at
least 2 occurrences of 01, and which end in 11.
The solution uses ε-moves to connect 3 NFAs together:

0, 1
q0 0 q1 1 q2

0, 1 ε
q3 q4 1 q5
0
0, 1 ε ε
q6 q7 1 q8
1

Note. There is a special case to take care of the input ending in 011; whence
the ε-transition from q5 to q7 .

5.3 Constructions
OK, now we have been introduced to DFAs and NFAs and seen how they
accept/reject strings. Now we are going to examine various constructions
that operate on automata, yielding other automata. It’s a way of building
automata from components.

5.3.1 The product construction

We’ll start with the product automaton. This creature takes 2 DFAs and
delivers the product automaton, also a DFA. The product automaton is
a single machine that, conceptually, runs its two component machines in
parallel on the same input string. At each step of the computation, both
machines access the (same) current input symbol, but they make transi-
tions according to their respective δ functions.
This is easy to understand at a high level, but how do we make this pre-
cise? In particular, how can the resulting machine be a DFA? The key idea

161
in solving this requirement is to make the states of the product automaton
be pairs of states from the component automaton.
Definition 33 (Product construction). Let M1 = (Q1 , Σ, δ1 , q1 , F1 ) and M2 =
(Q2 , Σ, δ2 , q2 , F2 ) be DFAs. Notice that they share the same alphabet. The
product of M1 and M2 —sometimes written as M1 × M2 —is (Q, Σ, δ, q0 , F ),
where
• Q = Q1 × Q2 . Recall that this is {(p, q) | p ∈ Q1 ∧ q ∈ Q2} or infor-
mally as all possible pairings of states in Q1 with states in Q2 . The size of
Q is the product of the sizes of Q1 and Q2 .

• Σ is unchanged. We require that M1 and M2 have the same input

alphabet. Question : If they don’t, what could we do?

• δ is defined by its behaviour on pairs of states: the transition is ex-

pressed by δ((p, q), a) = (δ1 (p, a), δ2 (q, a)).

• q0 = (q1 , q2 ), where q1 is the start state for M1 and q2 is the start state
for M2 .

• F can be built in 2 ways. When the product automaton is run on an

input string w, it eventually ends up in a state (p, q), meaning that p
is the state M1 would end up in on w, and similarly q is the state M2
would end up in on w. The choices are:

Union. (p, q) is an accept state if p is an accept state for M1 , or if q is

an accept state for M2 .
Intersection. (p, q) is an accept state if p is an accept state for M1 , and
if q is an accept state for M2 .

We will take up these later.

Example 106. Give a DFA that recognizes the set of all binary strings hav-
ing a substring 00 or ending in 01. To answer this challenge, we notice that
this can be regarded as the union of two languages:

{x | x has a substring 00} ∪ {y | y ends in 01}

Let’s build 2 automata separately and then use the product construc-
tion to join them. The first DFA, call it M1 , is

162
1 0, 1
q0 0 q1 0 q2
1
The second, call it M2 , is

1 0
q0 0 q1 1 q2
0
1
The next thing to do is to construct the state space of the product ma-
chine, and use that to figure out δ. The following table gives the details:

Q1 × Q2 0 1
(q0 q0 ) (q1 , q1 ) (q0 , q0 )
(q0 q1 ) (q1 , q1 ) (q0 , q2 )
(q0 q2 ) (q1 , q1 ) (q0 , q0 )
(q1 q0 ) (q2 , q1 ) (q0 , q0 )
(q1 q1 ) (q2 , q1 ) (q0 , q2 )
(q1 q2 ) (q2 , q1 ) (q0 , q0 )
(q2 q0 ) (q2 , q1 ) (q2 , q0 )
(q2 q1 ) (q2 , q1 ) (q2 , q2 )
(q2 q2 ) (q2 , q1 ) (q2 , q0 )

This is of course easy to write out, once you get used to it (a bit mind-
less though). Now there are several of the combined states that aren’t
reachable from the start state and can be pruned. The following is a dia-
gram of the result.
1 0

q0 , q0 0 q1 , q1 0 q2 , q1
0 0 1
1 0
1 1
q0 , q2 q2 , q0 q2 , q2
1

163
The final states of the automaton are {(q0 , q2 ), (q2 , q1 ), (q2 , q0 ), (q2 , q2 )}
(underlined states are the final states in the component automata).

5.3.2 Closure under union

Theorem 10. If A and B are regular sets, then A ∪ B is a regular set.
Proof. Let M1 be a DFA recognizing A and M2 be a DFA recognizing B.
Then M1 × M2 is a DFA recognizing L(M1 ) ∪ L(M2 ) = A ∪ B, provided that
the final states of M1 × M2 are those where one or the other components is
a final state, i.e., we require that the final states of M1 × M2 are a subset of
(F1 × Q2 ) ∪ (Q1 × F2 ).

It must be admitted that this ‘proof’ is really just a construction: it tells

how to build the automaton that recognizes A ∪ B. The intuitive reason
why the construction works is that we run M1 and M2 in parallel on the
input string, accepting when either would accept.

5.3.3 Closure under intersection

Theorem 11. If A and B are regular sets, then A ∩ B is a regular set.
Proof. Use the product construction, as for union, but require that the final
states of M1 × M2 are {(p, q) | p ∈ F1 ∧ q ∈ F2 }, where F1 and F2 are the
final states of M1 and M2 , respectively. In other words, accept an input to
M1 × M2 just when both M1 and M2 would accept it separately.

Remark. If you see a specification of the form

show {x | . . . ∧ . . .} is regular or
show {x | . . . ∨ . . .} is regular,
you should consider building a product automaton, in the intersection
flavour (first spec.) or the union flavour (second spec).

5.3.4 Closure under complement

Theorem 12. If A is a regular set, then A is a regular set.

164
Proof. Let M = (Q, Σ, δ, q0 , F ) be a DFA recognizing A. So M accepts all
strings in A and rejects all others. Thus a DFA recognizing A is obtained
by switching the final and non-final states of M, i.e., the desired machine
is M ′ = (Q, Σ, δ, q0 , Q − F ). Note that M ′ recognizes Σ∗ − L(M).

5.3.5 Closure under concatenation

If we can build a machine M1 to recognize A and a machine M2 to recog-
nize B, then we should be able to recognize language A · B by running M1
on a string w ∈ A · B until it hits an accept state, and then running M2 on
the remainder of w. In other words, we somehow want to ‘wire’ the two
machines together in series. To achieve this, we need to do several things:

• Connect the final states of M1 to the start state of M2 . We will have to

use ε-transitions to implement this, because reading a symbol off the
input to make the jump from M1 to M2 will wreck things. (Why?)

• Make the start state for the combined machine be q01 .

• Make the final states for the combined machine be F2 .

The following makes this precise.

Theorem 13. If A and B are regular sets, then A · B is a regular set.

Proof. Let M1 = (Q1 , Σ, δ1 , q01 , F1 ) and M2 = (Q2 , Σ, δ2 , q02 , F2 ) be DFAs
that recognize A and B, respectively. An automaton2 for recognizing A · B
is given by M = (Q, Σ, δ, q0 , F ), where

• Q = Q1 ∪ Q2

• Σ is unchanged (assumed to be the same for both machines). This

could be liberalized so that Σ = Σ1 ∪ Σ2 , but it would mean that δ
would need extra modifications.

• q0 = q01

• F = F2
2
An NFA, in fact.

165
• δ(q, a) is defined by cases, as to whether it is operating ‘in’ M1 , tran-
sitioning between M1 and M2 , or operating ‘in’ M2 :

– δ(q, a) = {δ1 (q, a)} when q ∈ Q1 and δ1 (q, a) ∈

/ F1 .
– δ(q, a) = {q02 } ∪ {δ1 (q, a)} when q ∈ Q1 and δ1 (q, a) ∈ F1 . Thus,
when M1 would enter one of its final states, it can stay in that
state or make an ε-transition to the start state of M2 .
– δ(q, a) = δ2 (q, a) if q ∈ Q2 .

5.3.6 Closure under Kleene star

If we have a machine that recognizes A, then we should be able to build
a machine that recognizes A∗ by making a loop of some sort. The details
of this are a little bit tricky since the obvious way of doing this—simply
making ε-transitions from the final states to the start state—doesn’t work.

Theorem 14. If A is a regular set, then A∗ is a regular set.

Proof. Let M = (Q, Σ, δ, q0 , F ) be a DFA recognizing A. An NFA N recog-
nizing A∗ is obtained by

• Adding a new start state qs , with an ε-move to q0 . Since ε ∈ A∗ , qs

must be an accept state.

• Adding ε-moves from each final state to q0 .

The result is that N accepts ε; it accepts x if M accepts x, and it accepts w

if w = w1 · . . . · wn such that M accepts each wi . So N recognizes A∗ .
Formally, N = (Q ∪ {qs }, Σ, δ ′, qs , F ∪ {qs }), where δ ′ is defined by cases:

• Transitions from qs . Thus δ ′ (qs , ε) = {q0 } and δ ′ (qs , a) = ∅, for a 6= ε.

• Old transitions. δ ′ (q, a) = δ(q, a), provided δ(q, a) ∈

/ F.

• ε-transitions from F to q0 . δ ′ (q, a) = δ(q, a) ∪ {q0 }, when δ(q, a) ∈ F .

166
Example 107. Let M be given by the following DFA:

a a, b

b a, b
q0 q1 q2

A bit of thought reveals that L(M) = {an b | n ≥ 0}, and that

(L(M))∗ = {ε} ∪ {all strings ending in b}.

If we apply the construction, we obtain the following NFA for (L(M))∗ .

a a, b

ε b a, b
qs q0 q1 q2
ε

5.3.7 The subset construction

Now we discuss the subset construction, which was invented by Dana
Scott and Michael Rabin3 in order to show that the expressive power of
NFAs and DFAs is the same, i.e., they both recognize the regular lan-
guages. The essential idea is that the subset construction can be used to
map any NFA N to a DFA M such that L(N) = L(M).
The underlying insight of the subset construction is to have the transi-
tion function of the corresponding DFA M work over a set of states, rather
than the single state used by the transition function of the NFA N:

δ
NFA Σ ∪ {ε} × Q → 2Q
DFA Σ × 2Q → 2Q

In other words, the NFA N is always in a single state and can have multiple
successor states for symbol a via δ. In contrast, the DFA M is always in a
3
They shared a Turing award for this; however, both researchers are famous for much
other work as well.

167
set (possibly empty) of states and moves into a set of successor states via
δ ′ , which is defined in terms of δ. This is formalized as follows:
δ ′ ({q1 , . . . , qk }, a) = δ(q1 , a) ∪ . . . ∪ δ(qk , a).
Example 108. Let’s consider the NFA N given by the diagram

0, 1
q0 1 q1 0 q2

N evidently accepts the language {x10 | x ∈ {0, 1}∗ }. The subset con-
struction for N proceeds by constructing a transition function over all sub-
sets of the states of N. Thus we need to consider
∅, {q0 }, {q1 }, {q2 }, {q0 , q1 }, {q0 , q2 }, {q1 , q2 }, {q0 , q1 , q2 }
as possible states of the DFA M to be constructed. The following table
describes the transition function for M.

Q 0 1
∅ ∅ ∅ (unreachable)
{q0 } {q0 } {q0 , q1 } (reachable)
{q1 } {q2 } ∅ (unreachable)
{q2 } ∅ ∅ (unreachable)
{q0 , q1 } {q0 , q2 } {q0 , q1 } (reachable)
{q0 , q2 } {q0 } {q0 , q1 } (reachable)
{q1 , q2 } {q2 } ∅ (unreachable)
{q0 , q1 , q2 } {q0 , q2 } {q0 , q1 } (unreachable)
And here’s the diagram. Unreachable states have been deleted. State A =
{q0 } and B = {q0 , q1 } and C = {q0 , q2 }.
0 1 1
1
A B C
0
0
Note that the final states of the DFA will be those that contain at least
one final state of the NFA.

168
So by making the states of the DFA be sets of states of the NFA, we
seem to get what we want: the DFA will accept just in case the NFA would
accept. This apparently gives us the best of both characterizations: the ex-
pressive power of NFAs, coupled with the straightforward executability
of DFAs. However, there is a flaw: some NFAs map to DFAs with expo-
nentially more states. A class of examples with this property are those
expressed as
Construct a DFA accepting the set of all binary strings in which
the nth symbol from the right is 0.
Also, we have not given a complete treatment: we still have to account
for ε-transitions, via ε-closure.

ε-Closure
Don’t be confused by the terminology: ε-closure has noting to do with
closure of regular languages under ∩, ∪, etc.
The idea of ε-closure is the following: when moving from a set of states
Si to a a set of states Si+1 , we have to take account of all ε-moves that could
be made after the transition. Why do we have to do that? Because the DFA
is over the alphabet Σ, instead of Σ ∪ {ε}, so we have to squeeze out all the
ε-moves. Thus we define, for a set of states Q,
E(Q) = {q | q can be reached from a state in Q by 0 or more ε−moves}

and then a step in the DFA, originally

δ ′ ({q1 , . . . , qk }, a) = δ(q1 , a) ∪ . . . ∪ δ(qk , a)
where δ is the transition function of the original NFA, instead becomes
δ ′ ({q1 , . . . , qk }, a) = E(δ(q1 , a) ∪ . . . ∪ δ(qk , a))
Note. We make a transition and then chase any ε-steps. But what about
the start state? We need to chase any ε-steps from q0 before we start making
any transitions. So q0′ = E{q0 }. Putting all this together gives the formal
definition of the subset construction.
Definition 34 (Subset construction). Let N = (Q, Σ, δ, q0 , F ) be an NFA.
The DFA M = (Q′ , Σ, δ ′ , q0′ , F ′ ) given by the subset construction is specified
by

169
• Q′ = 2Q
• q0′ = E{q0 }
• δ ′ ({q1 , . . . , qk }, a) = E(δ(q1 , a) ∪ . . . ∪ δ(qk , a))
= E(δ(q1 , a)) ∪ . . . ∪ E(δ(qk , a))
• F ′ = {S ∈ 2Q | ∃q ∈ S ∧ q ∈ F }
The essence of the argument for correctness of the subset construction
amounts to noticing that the generated DFA mimicks the transition be-
haviour of the NFA and accepts and rejects strings exactly as the NFA
does.
Theorem 15 (Correctness of subset construction). If DFA M is derived by
applying the subset construction to NFA N, then L(M) = L(N).
Example 109. Convert the following NFA to an equivalent DFA.

0, 1
q0 ε q1 0 q5

1 1 1
ε
q2 0 q3 q4
0

The very slow way to do this would be to mechanically construct a table

for the transition function, with the left column having all 26 = 64 subsets
of {q0 , q1 , q2 , q3 , q4 , q5 }:
states 0 1
∅
{q0 }
..
.
{q5 }
{q0 , q1 }
..
.
{q0 , q1 , q2 , q3 , q4 , q5 }

170
But that would lead to madness. Instead we should build the table in
an on-the-fly manner, wherein we only write down the transitions for the
reachable states, the ones we could actually get to by following transitions
from the start state. First, we need to decide on the start state: it is not
{q0 }! We have to take the ε-closure of {q0 }:
E{q0 } = {q0 , q1 }
In the following, it also helps to name the reached state sets, for concision.
states 0 1
A = {q0 , q1 } {q0 , q1 , q5 } {q0 , q1 , q2 }
B = {q0 , q1 , q5 } B D
C = {q0 , q1 , q2 } E C
D = {q0 , q1 , q2 , q4 } E C
E = {q0 , q1 , q3 , q4 , q5 } E D
So for example, if we are in state C, the set of states we could be in after
a 0 on the input are:
E(δ(q0 , 0)) ∪ E(δ(q1 , 0)) ∪ E(δ(q2 , 0)) = {q0 , q1 } ∪ {q5 } ∪ {q3 , q4 }
= {q0 , q1 , q3 , q4 , q5 }
= E
Similarly, if we are in state C and see a 1, the set of states we could be
in are:
E(δ(q0 , 1)) ∪ E(δ(q1 , 1)) ∪ E(δ(q2 , 1)) = {q0 , q1 } ∪ {q2 } ∪ ∅
= {q0 , q1 , q2 }
= C
A diagram of this DFA is

0
0 1
A B D

1 0
1 1
0
C E
1 0

171
Summary
For every DFA, there’s a corresponding (trivial) NFA. For every NFA,
there’s an equivalent DFA, via the subset construction. So the 2 models,
apparently quite different, have the same power (in terms of the languages
they accept). But notice the cost of ‘compiling away’ the non-determinism:
the number of states in a DFA derived from the subset construction can be
exponentially larger than in the orignal. Implementability has its price!

5.4 Regular Expressions

The regular expressions are another formal model of regular languages. Un-
like automata, these are essentially given by bestowing a syntax on the
regular languages and the operations they are closed under.

Definition 35 (Syntax of regular expressions). The set of regular expres-

sions R formed from alphabet Σ is the following:

• a ∈ R, if a ∈ Σ

• ε∈R

• ∅∈R

• r1 + r2 ∈ R, if r1 ∈ R ∧ r2 ∈ R

• r1 · r2 ∈ R, if r1 ∈ R ∧ r2 ∈ R

• r ∗ ∈ R, if r ∈ R

• Nothing else is in R

Remark. This is an inductive definition of a set R—the set is ‘initialized’ to

have ε and ∅ and all elements of Σ. Then we use the closure operations
to build the rest of the (infinite, usually) set R. The final clause disallows
other random things being in the set.
Note. Regular expressions are syntax trees used to denote languages. The
semantics, or meaning, of a regular expression is thus a set of strings, i.e.,
a language.

172
Definition 36 (Semantics of regular expressions). The meaning of a regular
expression r, written L(r) is defined as follows:

L(a) = {a}, for a ∈ Σ

L(ε) = {ε}
L(∅) = ∅
L(r1 + r2 ) = L(r1 ) ∪ L(r2 )
L(r1 · r2 ) = L(r1 ) · L(r2 )
L(r ∗ ) = L(r)∗

Note the overloading. The occurrence of · and ∗ on the right hand side
of the equations are operations on languages, while on the left hand side,
they are nodes in a tree structure.

Convention. For better readability, difference precedences are given to the

infix and postfix operators. Also, we will generally omit the concatenation
operator, just treating adjacent regular expressions as being concatenated.
Thus we let ∗ bind more strongly than ·, and that binds more strongly than
the infix + operator. Parentheses can be used to say exactly what you
want. Thus
r1 + r2 r3 ∗ = r1 + r2 · r3 ∗ = r1 + (r2 · (r3 ∗ ))
Since the operations of + and · are associative, bracketing doesn’t matter
in expressions like

a+b+c+d and abcd.

Yet more notation:

• We can use Σ to stand for any member of Σ. That is, an occurrence of

Σ in a regular expression is an abbreviation for the regular expression
a1 + . . . + an , where Σ = {a1 , . . . , an }.

• r + = rr ∗.

Example 110 (Floating point constants). Suppose 4

Σ = {+, −} ∪ D ∪ {.}
4
We will underline the ‘plus sign’ + to distinguish it from the + used to build the
regular expression.

173
where D = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. Then
(+ + − + ε) · (D + + D + .D ∗ + D ∗ .D + )
is a concise description of a simple class of floating point constants for a
programming language. Examples of such constants are: +3, −3.2, −.235.
Example 111. Give a regular expression for the binary representation of
the numbers which are powers of 4:
{40 , 41 , 42 , . . .} = {1, 4, 16, 64, 256, . . .}
Merely transcribing to binary gives us the important clue we need:
{1, 100, 10000, 1000000, . . .}
The regular expression generating this language is 1(00)∗ .
Example 112. Give a regular expression for the set of binary strings which
have at least one occurrence of 001. One answer is
(0 + 1)∗ 001(0 + 1)∗ or Σ∗ 001Σ∗
Example 113. Give a regular expression for the set of binary strings which
have no occurrence of 001. This example is much harder, since the problem
is phrased negatively. In fact, this is an instance where it is easier to build
an automaton for recognizing the given set:
• Build an NFA for recognizing any string where 001 occurs. This is
easy.
• Convert to a DFA. We know how to do this (subset construction).
• Complement the resulting automaton.5
However, we are required to come up with a regular expression. How
to start? First, note that a string w in the set can have no occurrence of 00,
unless w is a member of the set denoted by 000∗. The set of binary strings
having no occurrence of 00 and ending in 1 is
(01 + 1)∗
And now we can append any number of 0s to this and get the specified
set:
(01 + 1)∗ 0∗
5
Note that directly complementing the NFA won’t work in general.

174
5.4.1 Equalities for regular expressions
The following equalities are useful when manipulating regular expres-
sions. They should mostly be familiar and can be proved simply by reduc-
ing to the meaning in languages and using the techniques and theorems
we have already seen.

r1 + (r2 + r3 ) = (r1 + r2 ) + r3
r1 + r2 = r2 + r1
r+r =r
r+∅=r
εr = r = rε
∅r = ∅ = r∅
∅∗ = ε
r1 (r2 r3 ) = (r1 r2 )r3
r1 (r2 + r3 ) = r1 r2 + r1 r3
(r1 + r2 )r3 = r1 r3 + r2 r3
ε + rr ∗ = r ∗
(ε + r)∗ = r ∗
rr ∗ = r ∗ r
r∗r∗ = r∗
r∗∗ = r∗
(r1 r2 )∗ r1 = r1 (r2 r1 )∗
(r1 ∗ r2 )∗ r1 ∗ = (r1 + r2 )∗

Example 114. In the description of regular expressions, ε is superfluous.

The reason is that ε = ∅∗ , since

L(ε) = {ε} = L(∅∗ )

= L(∅)∗
= ∅∗
= ∅0 ∪ ∅1 ∪ ∅2 ∪ . . .
= {ε} ∪ ∅ ∪ ∅ ∪ . . .
= {ε}

Example 115. Simplify the following regular expression: (00)∗ 0 + (00)∗ .

This is perhaps most easily seen by unrolling the subexpressions a few
times:
(00)∗ 0 = {01 , 03 , 05 , . . .} = {0n | n is odd}

175
and
(00)∗ = {ε, 02, 04 , 06 , . . .} = {0n | n is even}
Thus (00)∗ 0 + (00)∗ = 0∗ .

Example 116. Simplify the following regular expression: (0+1)(ε + 00)+ +

(0 + 1). By distributing · over +, we have

0(ε + 00)+ + 1(ε + 00)+ + (0 + 1)

We can use the lemma (ε + r)+ = r ∗ to get

0(00)∗ + 1(00)∗ + 0 + 1

but 0 is already in 0(00)∗ , and 1 is already in 1(00)∗ , so we are left with

0(00)∗ + 1(00)∗ or (0 + 1)(00)∗ .

Example 117. Simplify the following regular expression: (0 + ε)0∗ 1.

(0 + ε)0∗ 1 = (0 + ε)(0∗1)
= 0 + 1 + 0∗ 1
= 0∗ 1.
∗ ∗
Example 118. Show that (02 + 03 ) = (02 0∗ ) . Examining the left hand side,
we have ∗
(02 + 03 ) = {ε, 02, 03 , 04 , . . .}
= 0∗ − {0}.
On the right hand side, we have
∗
(02 0∗ ) = (000∗ )∗
= {ε} ∪ {00, 000, 04, 05 , . . .}
= {ε} ∪ {0k+2 | 0 ≤ k}
= 0∗ − {0}.

Example 119. Prove the identity (0 + 1)∗ = (1∗ (0 + ε)1∗ )∗ using the al-
gebraic identities. We will work on the rhs, underlining subexpressions

176
about to be changed.

(0 + 1)∗ = (1∗ (0 + ε)1∗ )∗

| {z }

1 ∗ 1 ∗ )∗
= (1∗ 01∗ + |{z} a∗ a∗ = a∗

= (1∗ 01∗ + 1∗ )∗ a+b =b+a

| {z }

01}∗ )∗
1∗ + |1∗{z
= (|{z} (a + b)∗ = (a∗ b)∗ a∗
a b

∗
= (1| ∗∗ ∗ ∗
1∗∗
{z1} 01 ) |{z} a∗∗ = a∗ = a∗ a∗

01∗ )∗ |{z}
1∗ |{z}
= (|{z} 1∗ (ab)∗ a = a(ba)∗
a b a

1 ∗ 1 ∗ )∗
= 1∗ (0 |{z} a∗ a∗ = a∗

1∗ (|{z}
= |{z} 1 ∗ )∗
0 |{z} a(ba)∗ = (ab)∗ a
a b a

0 )∗ |{z}
1∗ |{z}
= (|{z} 1∗ (a∗ b)∗ a∗ = (a + b)∗
a b a

= (0 + 1)∗

5.4.2 From regular expressions to NFAs

We have now seen 3 basic models of computation: DFA, NFA, and regular
expressions. These are all equivalent, in that they all recognize the regular
languages. We have seen the equivalence of DFAs and NFAs, which is
proved by showing how a DFA can be mapped into an NFA (trivial), and
vice versa (the subset construction). We are now going to fill in the rest of
the picture by showing how to translate

• regular expressions into equivalent NFAs; and

177
• DFAs into equivalent regular expressions

The translation of a regular expression to an NFA proceeds by exhaus-

tive application of the following rules:

r
(init) A B

r1
r1 + r2
(plus) A B =⇒ A B
r2

r1 · r2 r1 r2
(concat) A B =⇒ A B

r∗ ε ε
(star) A B =⇒ A B
r

The init rule is used to start the process off: it sets the regular expression
as a label between the start and accept states. The idea behind the rule
applications is to iteratively replace each regular expression by a fragment
of automaton that implements it.
Application of these rules, the star rule especially, can result in many
useless ε-transitions. There is a complicated rule for eliminating these,
which can be applied only after all the other rules can no longer be applied.

Definition 37 (Redundant ε-edge elimination). An edge

qi ε qj

can be shrunk to a single node by the following rule:

• If the edge labelled with ε is the only edge leaving qi then qi can be
replaced by qj . If qi is the start node, then qj becomes the new start
state.

178
• If the edge labelled ε is the only edge entering qj then qj can be re-
placed by qi . If qj is a final state, then qi becomes a final state.

Example 120. Find an equivalent NFA for (11 + 0)∗ (00 + 1)∗ .

(11 + 0)∗ (00 + 1)∗

init

(11 + 0)∗ (00 + 1)∗

concat

ε ε ε ε
star (twice)

11 + 0 00 + 1

0 1
plus (twice) ε ε ε ε

11 00
0 1
ε ε ε ε
concat (twice)
1 1 0 0

That ends the elaboration of the regular expression into the correspond-
ing NFA. All that remains is to eliminate redundant ε-transitions. The ε
transition from the start state can be dispensed with, since it is a unique
out-edge from a non-final node; similarly, the ε transition into the final
state can be eliminated because it is a unique in-edge to a non-initial node.
This yields

179
0 1

ε ε

1 1 0 0

We are not yet done. One of the two middle ε-transitions can be eliminated—
in fact the middle node has a unique in-edge and a unique out-edge—so
the middle state can be dropped.

0 1

1 1 0 0

However, the remaining ε-edge can not be deleted; doing so would

change the language to (0 + 1)∗ . Thus application of the ε-edge elimina-
tion rule must occur one edge at a time.

5.4.3 From DFA to regular expression

It is possible to convert a DFA to an equivalent regular expression. There
are several approaches; the one we will take is based on representing the
automaton as a system of equations and then using Arden’s lemma to
solve the equations.

Representing an automaton as a system of equations

The basis of this representation is to think of a state in the machine as
a regular expression representing the strings that would be accepted by
running the machine on them from that state. Thus from state A in the
following fragment of a machine

180
B
b
c
A
a
C

any string that will eventually be accepted will be of one of the forms

• b · t1 , where t1 is a string accepted from state B.

• a · t2 , where t2 is a string accepted from state C.

• c · t3 , where t3 is a string accepted from state A.

This can be captured in the equation

A = cA + bB + aC

the right hand size of which looks very much like a regular expression, ex-
cept for the occurrences of the variables A, B, and C. Indeed, the equation
solving process eliminates these variables so that the final expression is a
bona fide regular expression. The goal, of course, is to solve for the variable
representing the start state.
Accept states are somewhat special since the machine, if run from them,
would accept the empty string. This has to be reflected in the equation.
Thus
B
b
c
A
a
C

is represented by the equation

A = cA + bB + aC + ε

181
Using Arden’s Lemma to solve a system of equations
An important theorem about languages, proved earlier in these notes, is
the following:

Theorem 16 (Arden’s Lemma). Assume that A and B are two languages with
ε∈
/ A. Also assume that X is a language having the property X = (A · X) ∪ B.
Then X = A∗ · B.

What this theorem allows is the finding of closed form solutions to equa-
tions where the variable (X in the theorem) appears on both sides. We can
apply this theorem to the equations read off from DFAs quite easily: the
side condition that ε ∈
/ A always holds, since DFAs have no ε-transitions.
Thus, from our example, the equation characterizing the strings accepted
from state A
A = cA + bB + aC + ε
is equivalent, by application of Arden’s Lemma to

A = c∗ (bB + aC + ε)

Once the closed form Q = rhs for a state Q is found, rhs can be sub-
stituted for Q throughout the remainder of the equations. This is repeated
until finally the start state has a regular expression representing its lan-
guage.

Example 121. Give an equivalent regular expression for the following DFA:

b
a
a
A B
b a

b C
We now make an equational presentation of the DFA:

A = aB + bC
B = bB + aA + ε
C = aB + bA + ε

182
We eventually need to solve for A, but can start with B or C. Let’s start
with B. By application of Arden’s lemma, we get

B = b∗ (aA + ε)

and then we can substitute this in all the remaining equations:

A = a(b∗ (aA + ε)) + bC

C = a(b∗ (aA + ε)) + bA + ε

Now the equation for C is not phrased in terms of C, so Arden’s lemma

doesn’t apply and the rhs may be substituted directly for C:

A = a(b∗ (aA + ε)) + b(a(b∗ (aA + ε)) + bA + ε)

And now we do some regular expression algebra to prepare for the final
application of the lemma:

A= a(b∗ (aA + ε)) + b(a(b∗ (aA + ε)) + bA + ε)

= ab∗ aA + ab∗ + b(ab∗ aA + ab∗ + bA + ε)
= ab∗ aA + ab∗ + bab∗ aA + bab∗ + bbA + b
= (ab∗ a + bab∗ a + bb)A + ab∗ + bab∗ + b

And then the lemma applies, and we obtain

A = (ab∗ a + bab∗ a + bb)∗ (ab∗ + bab∗ + b)

Notice how this quite elegantly summarizes all the ways to loop back
to A when starting from A, followed by all the non-looping paths from A
to an accept state.
End of example

The DFA-to-regexp construction, together with the regexp-to-NFA con-

struction, plus the subset construction, yield the following theorem.

Theorem 17 (Kleene). Every regular language can be represented by a regular

expression and, conversely, every regular expression describes a regular language.

∀L. Regular(L) iff ∃r. r is a regular expression ∧ L(r) = L

183
To summarize, we have seen methods for translating between DFAs,
NFAs, and regular expressions:

• Every DFA is an NFA.

• Every NFA can be converted to an equivalent DFA, by the subset

construction.

• Every regular expression can be translated to an equivalent NFA, by

the method in Section 5.4.2.

• Every DFA can be translated to a regular expression by the method

in Section 7.1.3.

Notice that, in order to say that these translations work, i.e., are correct,
’ we need to use the concept of formal language.

5.5 Minimization
Now we turn to examining how to reduce the size of a DFA such that it
still recognizes the same language. This is useful because some transfor-
mations and tools will generate DFAs with a large amount of redundancy.

Example 122. Suppose we are given the following NFA:

0, 1 0, 1

q0 0 q1 1 q2 0 q3

The subset construction yields the following (equivalent) DFA:

1 0 0 0 1
p0 0 p1 1 p2 0 p3 p4 1 p5
1
1 0

184
which has 6 reachable states, out of a possible 24 = 16. But notice
that p3 , p4 , and p5 are all accept states, and it’s impossible to ‘escape’ from
them. So you could collapse them to one big success state. Thus the DFA
is equivalent to the following DFA with 4 states:

1 0 0, 1

p0 0 p1 1 p2 0 p3

There are methods for systematically reducing DFAs to equivalent ones

which are minimal in the number of states. Here’s a rough outline of a
minimization procedure:

1. Eliminate inaccessible, or unreachable, states. These are states for

which there is no string in Σ∗ that will take the machine to that state.
How is this done? We have already been doing it, somewhat infor-
mally, when performing subset constructions. The idea is to start in
q0 and mark all states accessible in one step from it. Now repeat this
from all the newly marked states until no new marked state is pro-
duced. Any unmarked states at the end of this are inaccessible and
can be deleted.

2. Collapse equivalent states. We will gradually see what this means in

the following examples.

Remark. We will only be discussing minimization of DFAs. If asked to

minimize an NFA, first convert it to a DFA.
Example 123. The 4 state automaton

a q1 a, b a, b
q0 q3
b q2 a, b

185
is clearly equivalent to the following 3 state machine:

a, b
a, b a, b
q0 q12 q2

Example 124. The DFA

q1 0 q3
0 0, 1
1 0, 1
q0 q5
1 1
q2 q4 0, 1
0

recognizes the language

{0, 1} ∪ {x ∈ {0, 1}∗ | len(x) ≥ 3}

Now we observe that q3 and q4 are equivalent, since both go to q5 on any-

thing. Thus they can be collapsed to give the following equivalent DFA:

q1 0, 1
0 0, 1
0, 1
q0 q34 q5
1 q2 0, 1

By the same reasoning, q1 and q2 both go to q34 on anything, so we can

collapse them to state q12 to get the equivalent DFA

0, 1
0, 1 0, 1 0, 1
q0 q12 q34 q5

Example 125. The DFA

186
q5 0 q4
0 0
q0 q3

0 0
q1 q2
0
recognizes the language

{0n | ∃k. n = 3k + 1}

This DFA minimizes to

q0 0 q2

0 0
q1

How is this done, you may ask.

The main idea is a process that takes a DFA and combines states of it in
a step-by-step fashion, where each steps yields an equivalent automaton.
There are a couple of criteria that must be observed:

• We never combine a final state and a non-final state. Otherwise the

language recognized by the automaton would change.

• If we merge states p and q, then we have to combine δ(p, a) and

δ(q, a), for each a ∈ Σ. Contrarily, if δ(p, a) and δ(q, a) are not equiv-
alent states, then p and q can not be equivalent.

Thus if there is a string x = x1 · . . . · xn such that running the automa-

ton M from state p on x leaves M in an accept state and running M from
state q on x leaves M in a non-accept state, then p and q cannot be equiva-
lent. However, if, for all strings x in Σ∗ , running M on x from p yields the
same acceptance verdict (accept/reject) as M on x from q, then p and q are
equivalent. Formally we define equivalence ≈ as

187
Definition 38 (DFA state equivalence).
p ≈ q iff ∀x ∈ Σ∗ . ∆(p, x) ∈ F iff ∆(q, x) ∈ F
where F is the set of final states of the automaton.
Question: What is ∆?
Answer ∆ is the extension of δ from symbols (single step) to strings
(multiple steps). Its formal definition is as follows:
∆(q, ε) = q
∆(q, a · x) = ∆(δ(q, a), x)
Thus ∆(q, x) gives the state after the machine has made a sequence of tran-
sitions while processing x. In other words, it’s the state at the end of the
computation path for x, where we treat q as the start state.
Remark. ≈ is an equivalence relation, i.e., it is reflexive, symmetric, and
transitive:
• p≈p
• p≈q⇒q≈p
• p≈q∧q ≈r ⇒p≈r
An equivalence relation partitions the underlying set (for us, the set of
states Q of an automaton) into disjoint equivalence classes. This is denoted
by Q/ ≈. Each element of Q is in one and only one partition of Q/ ≈.
Example 126. Suppose we have a set of states Q = {q0 , q1 , q2 , q3 , q4 , q5 } and
we define qi ≈ qj iff i mod 2 = j mod 2, i.e., qi and qj are equivalent if i and
j are both even or both odd. Then Q/ ≈ = {{q0 , q2 , q4 }, {q1 , q3 , q5 }}.
The equivalence class of q ∈ Q is written [q], and defined
[q] = {p | p ≈ q} .
We have the equality
p≈q iff ([p] = [q])
| {z } | {z }
equivalence of states equality of sets of states

The quotient construction builds equivalence classes of states and then

treats each equivalence class as a single state in the new automaton.

188
Definition 39 (Quotient automaton). Let M = (Q, Σ, δ, q0 , F ) be a DFA.
The quotient automaton is M/ ≈ = (Q′ , Σ, δ ′ , q0′ , F ′) where

• Q′ = {[p] | p ∈ Q}, i.e., Q/ ≈

• Σ is unchanged

• δ ′ ([p], a) = [δ(p, a)], i.e., transitioning from an equivalence class (where

p is an element) on a symbol a is implemented by making a transition
δ(p, a) in the original automaton and then returning the equivalence
class of the state reached.

• q0′ = [q0 ], i.e., the start state in the new machine is the equivalence
class of the start state in the original.

• F ′ = {[p] | p ∈ F }, i.e., the set of equivalence classes of the final states

of the original machine.

Theorem 18. If M is a DFA that recognizes L, then M/ ≈ is a DFA that recog-

nizes L. There is no DFA that both recognizes L and has fewer states than M/ ≈
.

OK, OK, enough formalism! we still haven’t addressed the crucial

question, namely how do we calculate the equivalence classes?
There are several ways; we will use a table-filling approach. The gen-
eral idea is to assume initially that all states are equivalent. But then we
use our criteria to determine when states are not equivalent. Once all the
non-equivalent states are marked as such, the remaining states must be
equivalent.
Consider all pairs of states p, q in Q. A pair p, q is marked once we know
p and q are not equivalent. This leads to the following algorithm:

1. Write down a table for the pairs of states

2. Mark (p, q) in the table if p ∈ F and q ∈

/ F , or if p ∈
/ F and q ∈ F .

3. Repeat until no change can be made to the table:

• if there exists an unmarked pair (p, q) in the table such that one
of the states in the pair (δ(p, a), δ(q, a)) is marked, for some a ∈
Σ, then mark (p, q).

189
4. Done. Read off the equivalence classes: if (p, q) is not marked, then
p ≈ q.

Remark. We may have to revisit the same (p, q) pair several times, since
combining two states can suddenly allow hitherto equivalent states to be
markable.

Example 127. Minimize the following DFA

0
1

0 1 0
A B C D
0 0 1 1
1
1 1
E F G H
0
1
0
0
We start by setting up our table. We will be able to restrict our attention
to the lower left triangle, since equivalence is symmetric. Also, each box
on the diagonal will be marked with ≈, since every state is equivalent to
itself. We also notice that state D is not reachable, so we will ignore it.

A B C D E F G H
A ≈ − − − − − − −
B ≈ − − − − − −
C ≈ − − − − −
D − − − − − − − −
E − ≈ − − −
F − ≈ − −
G − ≈ −
H − ≈
Now we split the states into final and non-final. Thus, a box indexed by
p, q will be labelled with an X if p is a final state and q is not, or vice versa.

190
Thus we obtain
A B C D E F G H
A ≈ − − − − − − −
B ≈ − − − − − −
C X0 X0 ≈ − − − − −
D − − − − − − − −
E X0 − ≈ − − −
F X0 − ≈ − −
G X0 − ≈ −
H X0 − ≈
State C is inequivalent to all other states. Thus the row and column la-
belled by C get filled in with X0 . (We will subscript each X with the step
at which it is inserted into the table.) However, note that C, C is not filled
in, since C ≈ C. Now we have the following pairs of states to consider:
{AB, AE, AF, AG, AH, BE, BF, BG, BH, EF, EG, EH, F G, F H, GH}
Now we introduce some notation which compactly captures how the ma-
chine transitions from a pair of states to another pair of states. The notation
0 1
p1 p2 ←− q1 q2 −→ r1 r2
0 0 1 1
means q1 −→ p1 and q2 −→ p2 and q1 −→ r1 and q2 −→ r2 . If one of p1 , p2 ,
r1 , or r2 are already marked in the table, then there is a way to distinguish
q1 and q2 : they transition to inequivalent states. Therefore q1 6≈ q2 and the
box labelled by q1 q2 will become marked. For example, if we take the state
pair AB, we have
0 1
BG ←− AB −→ F C
and since F C is marked, AB becomes marked as well.
A B C D E F G H
A ≈ − − − − − − −
B X1 ≈ − − − − − −
C X0 X0 ≈ − − − − −
D − − − − − − − −
E X0 − ≈ − − −
F X0 − ≈ − −
G X0 − ≈ −
H X0 − ≈

191
In a similar fashion, we examine the remaining unassigned pairs:
0 1
• BH ←− AE −→ F F . Unable to mark.
0 1
• BC ←− AF −→ F G. Mark, since BC is marked.
0 1
• BG ←− AG −→ F E. Unable to mark.
0 1
• BG ←− AH −→ F C. Mark, since F C is marked.
0 1
• GH ←− BE −→ CF . Mark, since CF is marked.
0 1
• GC ←− BF −→ CG. Mark, since CG is marked.
0 1
• GG ←− BG −→ CE. Mark, since CE is marked.
0 1
• GG ←− BH −→ CC. Unable to mark.
0 1
• HC ←− EF −→ F G. Mark, since CH is marked.
0 1
• HG ←− EG −→ F E. Unable to mark.
0 1
• HG ←− EH −→ F C. Mark, since CF is marked.
0 1
• CG ←− F G −→ GE. Mark, since CG is marked.
0 1
• CG ←− F H −→ GC. Mark, since CG is marked.
0 1
• GG ←− GH −→ EC. Mark, since EC is marked.

The resulting table is

A B C D E F G H
A ≈ − − − − − − −
B X1 ≈ − − − − − −
C X0 X0 ≈ − − − − −
D − − − − − − − −
E X1 X0 − ≈ − − −
F X1 X1 X0 − X1 ≈ − −
G X1 X0 − X1 ≈ −
H X1 X0 − X1 X1 X1 ≈

192
Next round. The following pairs need to be considered:

{AE, AG, BH, EG}

The previously calculated transitions can be re-used; all that will have
changed is whether the ‘transitioned-to’ states have been subsequently
marked with an X1 :

AE: unable to mark

AG: mark because BG is now marked.

BH: unable to mark

EG: mark because HG is now marked

The resulting table is

A B C D E F G H
A ≈ − − − − − − −
B X1 ≈ − − − − − −
C X0 X0 ≈ − − − − −
D − − − − − − − −
E X1 X0 − ≈ − − −
F X1 X1 X0 − X1 ≈ − −
G X2 X1 X0 − X2 X1 ≈ −
H X1 X0 − X1 X1 X1 ≈

Next round. The following pairs remain: {AE, BH}. However, neither
makes a transition to a marked pair, so the round adds no new markings
to the table. We are therefore done. The quotiented state set is

{{A, E}, {B, H}, {F }, {C}, {G}}

In other words, we have been able to merge states A and E, and B and H.
The final automaton is given by the following diagram.

193
1
0
AE 0 BH 0
G
1 0 1
1
0
F C

5.6 Decision Problems for Regular Languages

Now we will discuss some questions that can be asked about automata
and regular expressions. These will tend to be from a general point of
view, i.e.., involve arbitrary automata. A question that takes any automa-
ton (or collection of automata) as input and asks for a terminating algo-
rithm yielding a boolean (true or false) answer is called a decision problem,
and a program that correctly solves such a problem is called a decision al-
gorithm. Note well that a decision problem is typically a question about
the (often infinite) set of strings that a machine must deal with; answers
that involve running the machine on every string in the set are not useful,
since they will take forever. That is not allowed: in every case, a decision
algorithm must return a correct answer in finite time.
Here is a list of decision problems for automata and regular expres-
sions:

1. Given a string x and a DFA M, x ∈ L(M)?

2. Given a string x and an NFA N, x ∈ L(N)?

3. Given a string x and a regular expression r, x ∈ L(r)?

4. Given DFA M, L(M) = ∅?

5. Given DFA M, L(M) = Σ∗ ?

6. Given DFAs M1 and M2 , L(M1 ) ∩ L(M2 ) = ∅?

194
7. Given DFAs M1 and M2 , L(M1 ) ⊆ L(M2 )?

8. Given DFAs M1 and M2 , L(M1 ) = L(M2 )?

9. Given DFA M, is L(M) finite?

10. Given DFA M, is M the DFA having the fewest states that recognizes
L(M)?

It turns out that all these problems do have algorithms that correctly
answer the question. Some of the algorithms differ in how efficient they
are; however, we will not delve very deeply into that issue, since this class
is mainly oriented towards qualitative aspects of computation, i.e., can the
problems be solved at all? (For some decision problems, as we shall see
later in the course, the answer is, surprisingly, no.)

5.6.1 Is a string accepted/generated?

The problems

Given a string x, DFA M, x ∈ L(M)?

Given a string x and an NFA N, x ∈ L(N)?
Given a string x and a regular expression r, x ∈ L(r)?

are easily solved: for the first, merely run the string x through the DFA
and check whether the machine is in an accept state at the end of the run.
For the second, first translate the NFA to an equivalent DFA by the subset
construction and then run the DFA on the string. For the third, one must
translate the regular expression to an NFA and then translate the NFA to
a DFA before running the DFA on x.
However, we would like to avoid the step mapping from NFAs to
DFAs, since the subset construction can create a DFA with exponentially
more states than the NFA. Happily, it turns out that an algorithm that
maintains the set of possible current states in an on-the-fly manner works
relatively efficiently. The algorithm will be illustrated by example.

Example 128. Does the following NFA accept the string aaaba?

195
a a, b
b, ε a
q0 q1 q2

b b
a a, b
q3 q4
ε
The initial set of states that the machine could be in is {q0 , q1 }. We then
have the following table, showing how the set of possible current states
changes with each new transition:
input symbol possible current states
{q0 , q1 }
a ↓
{q0 , q1 , q2 }
a ↓
{q0 , q1 , q2 }
a ↓
{q0 , q1 , q2 }
b ↓
{q1 , q2 , q3 , q4 }
a ↓
{q2 , q3 , q4 }
After the string has been processed, we examine the set of possible states
{q2 , q3 , q4 } and find q4 , so the answer returned is true.
In an implementation, the set of possible current states would be kept
in a data structure, and each transition would cause states to be added or
deleted from the set. Once the string was fully processed, all that needs to
be done is to take the intersection between the accept states of the machine
and the set of possible current states. If it was non-empty, then answer true;
otherwise, answer false.

5.6.2 L(M) = ∅?
There are a couple of possible approaches to checking language emptiness.
The first idea is to minimize M to an equivalent minimum state machine

196
M ′ and check whether M ′ is equal to the following DFA, which is a mini-
mum (having only 1) state DFA that recognizes ∅, i.e., accepts no strings:
Σ

This is a good idea; however, recall that the first step in minimizing a
DFA is to first remove all unreachable states. A reachable state is one that
some string will put the machine into. In other words, the reachable states
are just those you can get to from the start state by making a finite number
of transitions.

Definition 40 (Reachable states). Reachability is inductively defined by

the following rules:

• q0 is reachable.

• qj is reachable if there is a qi such that qi is reachable and δ(qi , a) = qj ,

for some a ∈ Σ.

• no other states are reachable.

The following recursive algorithm computes the reachable states of a

machine. It maintains a set of reachable states R, which is initialized to
{q0 }:

reachable R =
let new = {q ′ | ∃q a. q ∈ R ∧ q ′ ∈
/ R ∧ a ∈ Σ ∧ δ(q, a) = q ′ }
in
if new = ∅ then R else reachable(new ∪ R)

That leads us to the second idea: L(M) = ∅ iff F ∩ R = ∅ where F

is the set of accept states of M and R is the set of reachable states of M.
Thus, in order to decide if L(M) = ∅, we compute the reachable states of
M and check to see if any of them is an accept state. If one is, L(M) 6= ∅;
otherwise, the machine accepts no string.

197
5.6.3 L(M) = Σ∗?
To decide whether a machine M accepts all strings over its alphabet, we
can use one of the following two algorithms:
1. Check if a minimized version of M is equal to the following DFA:
Σ

2. Use closure under complementation: let M ′ be the DFA obtained

by swapping the accept and non-accept states of M; then apply the
decision algorithm for language emptiness to M ′ . If the algorithm
returns true then M ′ accepts no string; thus M must accept all strings,
and we return true. Otherwise, M ′ accepts some string w and so M
must reject w, so we return false.

5.6.4 L(M1 ) ∩ L(M2 ) = ∅?

Given DFAs M1 and M2 , we wish to see if there is some string that both
machines accept. The following algorithm performs this task:
1. Build the product machine M1 × M2 , making the accept states be just
those in which both machines accept: {(qi , qj ) | qi ∈ F1 ∧ qj ∈ F2 }. Thus
L(M1 × M2 ) = L(M1 ) ∩ L(M2 ). This machine only accepts strings ac-
cepted by both M1 and M2 .
2. Run the emptiness checker on M1 × M2 , and return its answer.

5.6.5 L(M1 ) ⊆ L(M2 )?

Given DFAs M1 and M2 , we wish to see if M2 accepts all strings that M1
accepts, and possibly more. Once we recall that A − B = A ∩ B and that
A ⊆ B iff A − B = ∅, we can re-use our existing decision algorithms:
1. Build M2′ by complementing M2 (switch accept and non-accept states)
2. Run the decision algorithm for emptiness of language intersection
on M1 and M2′ , returning its answer.

198
5.6.6 L(M1 ) = L(M2 )?
• One algorithm directly uses the fact S1 = S2 iff S1 ⊆ S2 ∧ S2 ⊆ S1 .

• Another decision algorithm would be to minimize M1 and M2 and

test to see if the minimized machines are equal. Notice that we
haven’t yet said how this should be done. It is not quite trivial: we
have to compare M1 = (Q, Σ, δ, q0 , F ) with M2 = (Q′ , Σ′ , δ ′ , q0′ , F ′ ).
The main problem here is that the states may have been named dif-
ferently, e.g., q0 in M1 may be A in M2 . Therefore, we can’t just test if
the sets Q and Q′ are identical. Instead, we have to check if there is a
way of renaming the states of one machine into the other so that the
machines become identical. We won’t go into the details, which are
straightforward but tedious.
Therefore, we would be checking that the minimized machines are
equal ‘up to renaming of states’ (another equivalence relation).

5.6.7 Is L(M) finite?

Does DFA M accept only a finite number of strings? This decision problem
seems more difficult than the others. We obviously can’t blindly generate
all strings, say in length-increasing order, and feed them to M: how could
we stop if M indeed did accept an infinite number of strings? We might see
arbitrarily long stretches of strings being rejected, but couldn’t be sure that
eventually a longer string might come along that got accepted. Decision
algorithms are not allowed to run for an infinitely long time before giving
an answer.
However there is a direct approach to this decision problem. Intu-
itively, the algorithm would directly check to see if M had a loop on a
path from q0 to any (reachable) accept state.

5.6.8 Does M have as few states as possible?

Given DFA M, is there a machine M ′ such that L(M) = L(M ′ ) and there is
no machine recognizing L(M) in fewer states than M ′ ? This is solved by
running the state minimization algorithm on M.

199
Chapter 6

The Chomsky Hierarchy

So far, we have not yet tied together the 3 different components of the
course. What are the relationships between Regular, Context-Free, Decid-
able, Recognizable, and not-even-Recognizable languages?
It turns out that there is a (proper) inclusion hierarchy, known as the
Chomsky hierarchy:

Regular ⊂ CFL ⊂ Decidable ⊂ Recognizable ⊂ . . .

In other words, a language recognized by a DFA can always be gen-

erated by a CFG, but there are CFLs that no DFA can recognize. Every
language generated by a CFG can be decided by a Turing machine, or a
Register machine, but there are languages decided by TMs that cannot be
generated by any CFG. Moreover, the Halting Problem is a problem that
is not decidable, but is recognizable; and the complement of the Halting
Problem is not even recognizable.
In order to show that a particular language L is context-free but not
regular, one would write down a CFG for L and also show that L could
not possibly be regular. However, proving negative statements such as
this can be difficult: in order to show that a language is regular, we need
merely display an automaton or regular expression; conversely, to show
that a language is not regular would (naively) seem to require examining
all automata to check that each one fails to recognize at least one string in
the language. But there is a better way!

200
6.1 The Pumping Lemma for Regular Languages
The pumping lemma provides one way out of this problem. It exposes a
property, pumpability, that all regular sets have.
Theorem 19 (Pumping lemma for regular languages). Suppose that M =
(Q, Σ, δ, q0 , F ) is a DFA recognizing L. Let p be the number of states in Q, and
s ∈ L be a string w0 · . . . · wn−1 of length n ≥ p. Then there exists x, y, and z
such that s = xyz and
(a) xy n z ∈ L, for all n ∈ N
(b) y 6= ǫ (i.e., len(y) > 0)
(c) len(xy) ≤ p
Proof. Suppose M is a DFA with p states which recognizes L. Also suppose
there’s an s = w0 · . . . · wn−1 ∈ L where n ≥ p. Then the computation path
w0 1 w wn−1
q0 −→ q1 −→ · · · −→ qn
for s traverses at least n + 1 states. Now n + 1 > p, so, by the Pigeon
Hole Principle1 , there’s a state, call it q, which occurs at least twice in the
computation path. Let qj and qk be the first and second occurrences of q in
the computation path. So we have
w
0 1 w wj−1 wj wk−1
k w wn−1
q0 −→ q1 −→ · · · −→ qj −→ · · · −→ qk −→ · · · −→ qn
Now we partition the path into 3 as follows
x y z
z }| { z }| { z }| {
w0 w1 wj−1 wj wk−1 wk wn−1
q0 −→ q1 −→ · · · −→ qj | −→ · · · −→ | qk −→ · · · −→ qn
We have thus used our assumptions to construct a partition of s into x, y, z.
Note that this works for any string in L with length not less than p. Now
we simply have to show that the remaining conditions hold:
(a) The sub-path from qj to qk moves from q to q, and thus constitutes a
loop. We may go around the loop 0, 1, or more times to generate
ever-larger strings, each of which is accepted by M and is thus in L.
1
The Pigeon Hole Principle is informally stated as: given n + 1 pigeons and n boxes, any
assignment of pigeons to boxes must result in at least one box having at least 2 pigeons.

201
(b) This is clear, since qj and qk are separated by at least one label (note
that j < k).

(c) If len(xy) > p, we could re-apply the pigeon-hole principle to obtain

a state that repeats earlier than q, but that was ruled out by how we
chose q.

The criteria (a) allows one to pump sufficiently long strings arbitrar-
ily often, and thus gives us insight as to the nature of regular languages.
However, it is the application of the pumping lemma to proofs of non-
regularity of languages that is of interest.

6.1.1 Applying the pumping lemma

The use of the pumping lemma to prove non-regularity can be schema-
tized as follows. Suppose you are to prove a statement of the following
form: Language L is not regular. The standard proof, in outline, is as fol-
lows:

1. Towards a contradiction, assume that L is regular. That means that

there exists a DFA M that recognizes L. This is boilerplate.

2. Let p be the number of states in M. p > 0. This is boilerplate.

3. Supply a string s of length n ≥ p. Creativity Required! Typically, s

is phrased in terms of p.

4. Show that pumping s leads to a contradiction no matter how s is par-

titioned into x, y, z. In other words, find some n such that xy n z ∈/ L.
This would contradict (a). One typically uses constraints (b) and (c)
in the proof as well. Creativity is of course also required in this phase
of the proof.

5. Shout ‘Contradiction’ and claim victory.

Example 129. The following language L is not regular:

{0n 1n | n ≥ 0} .

202
Proof. Suppose the contrary, i.e., that L is regular. Then there’s a DFA M
that recognizes L. Let p be the number of states in M.
Crucial Creative Step: Let s = 0p 1p .
Now, s ∈ L and len(s) ≥ p. Thus, the hypotheses of the pumping
lemma hold, and we are given a partition of s into x, y, and z such that
s = xyz and
(a) xy n z ∈ L, for all n ∈ N

(b) y 6= ǫ (i.e., len(y) > 0)

(c) len(xy) ≤ p
all hold. Consider the string xz. By (a) xz = xy 0 z ∈ L. By (c) xy is
composed only of zeros, and hence x is all zeros. By b, x has fewer zeros
than xy. So xz has fewer than p zeros, but has p ones. Thus there is no way
to express xz as 0k 1k , for any k. So xz ∈
/ L. Contradiction.
Here’s a picture of the situation:

{z. .}. .|. . 000

s = |000 {z . .}. 000111
| {z. . . 11}
x y z

Notice that x, y, and z are abstract; we really don’t know anything about
them other than what we can infer by application of constraints a − c. We
have x = 0u and y = 0v (v 6= 0) and z = 0w 1p . We know that u + v + w = p,
but we also know that u + w < p, so we know xz = 0u+w 1p 6= L.
There’s always huge confusion with the pumping lemma. Here’s a
slightly alternative view—the pumping lemma protocol—on how to use it to
prove a language is not regular. Suppose there’s an office O to support
pumping lemma proofs.
1. To start the protocol, you inform O that L is regular.

2. After some time, O responds with a p, which you know is greater

than zero.

3. You then think about L and invent a witness s. You send s off to O,
along with some evidence (proofs) that s ∈ L and len(s) ≥ p. Often
this is very easy to see.

203
4. O checks your proofs. Then it divides s into 3 pieces x, y, and z, but
it doesn’t send them to you. Instead O gives you permission to use (a),
(b), and (c).
5. You don’t know what x, y, and z are, but you can use (a), (b) and
(c), plus your knowledge of s to deduce facts. After some ingenious
steps, you find a contradiction, and send the proof of it off to O.
6. O checks the proof and, if it is OK, sends you a final message con-
firming that L is not regular after all.
Example 130. The following language L is not regular:
{w | w has an equal number of 0s and 1s} .
Proof. Towards a contradiction, suppose L is regular. Then there’s a DFA
M that recognizes L. Let p be the number of states in M. Let s = 0p 1p .
Now, s ∈ L and len(s) ≥ p, so we know that s = xyz, for some x, y, and z.
We also know (a), (b), and (c) from the statement of the pumping lemma.
By (c) xy is composed only of 0s. By (b) xz = 0k 1p and k < p; thus xz ∈
/ L.
However, by (a), xz = xy 0 z ∈ L. Contradiction.
So why did we choose 0p 1p for s? Why not (01)p , for example? The
answer comes from recognizing that, when s is split into x, y, and z, we
have no control over how the split is made. Thus y can be any non-empty
string of length ≤ p. So if s = 0101 . . . 0101, then y could be 01. In that case,
repeated pumping will only ever lead to strings still in L and we will not
be able to obtain our desired contradiction.
Upshot. s has to be chosen such that pumping it (adding in copies of y)
will lead to a string not in L. Note that we can pump down, by adding in 0
copies of y, as we have done in the last two proofs.

Example 131. The following language L is not regular:

{0i1j | i > j} .
Proof. Towards a contradiction, suppose L is regular. Then there’s a DFA
M that recognizes L. Let p > 0 be the number of states in M. Let s =
0p+11p . Now, s ∈ L and len(s) ≥ p, so we know that s = xyz, for some x,
y, and z. We also know (a), (b), and (c) from the statement of the pumping
lemma. By (c) xy is composed only of 0s. By (b) xz has k ≤ p 0s and has p
/ L. However, by (a), xz = xy 0 z ∈ L. Contradiction.
1s, so xz ∈

204
Example 132. The following language L is not regular:
{ww | w ∈ {0, 1}∗ } .
Proof. Towards a contradiction, suppose L is regular. Then there’s a DFA
M that recognizes L. Let p > 0 be the number of states in M. Let s =
0p 10p 1. Now, s ∈ L and len(s) ≥ p, so we know that s = xyz, for some x,
y, and z. We also know (a), (b), and (c) from the statement of the pumping
lemma. By (c) xy is composed only of 0s. By (b) xz = 0k 10p 1 where k < p
so xz ∈ / L. However, by (a), xz = xy 0 z ∈ L. Contradiction.
Here’s an example where pumping up is used.
Example 133. The following language L is not regular:
2
{1n | n ≥ 0}.
Proof. This language is the set of all strings of 1s with length a square num-
ber. Towards a contradiction, suppose L is regular. Then there’s a DFA M
2
that recognizes L. Let p > 0 be the number of states in M. Let s = 1p .
This is the only natural choice; now let’s see if it works! Now, s ∈ L and
len(s) ≥ p, so we have s = xyz, for some x, y, and z. We also know (a),
(b), and (c) from the statement of the pumping lemma. Now we know that
2
1p = xyz. Let i = len(x), j = len(y) and k = len(z). Then i+j+k = p2 . Also,
len(xyyz) = i + 2j + k = p2 + j. However, (b) and (c) imply that 0 < j ≤ p.
2 2 2
Now the next element of L larger than 1p must be 1(p+1) = 1p +2p+1 , but
p2 < p2 + j < p2 + 2p + 1, so xyyz ∈ / L. Contradiction.
And another.
Example 134. Show that L = {0i 1j 0k | k > i + j} is not regular.
Proof. Towards a contradiction, suppose L is regular. Then there’s a DFA
M that recognizes L. Let p > 0 be the number of states in M. Let s =
0p 1p 02p+1 . Now, s ∈ L and len(s) ≥ p, so we know that s = xyz, for some x,
y, and z. We also know (a), (b), and (c) from the statement of the pumping
lemma. By (c) we know
0a |{z}
s = |{z} 02p+1}
0b 0| c 1p{z
x y z

If we pump up p + 1 times, we obtain the string 0a 0b(p+1) 0c 1p 02p+1 , which

is an element of L, by (a). However, a + b(p + 1) + c + p ≥ 2p + 1, so this
string cannot be in L. Contradiction.

205
6.1.2 Is L(M) finite?
Recall the problem of deciding whether a regular language is finite. The
ideas in the pumping lemma provide another way to provide an algorithm
solving this problem. The idea is, given DFA M, to try M on a finite set
of strings and then render a verdict. Recall that the pumping lemma says,
loosely, that every ‘sufficiently long string in L can be pumped’: if we
could find a sufficiently long string w that M accepts, then L(M) would be
infinite.
All we have to do is figure out what ‘sufficiently long’ should mean.
Two facts are important:

• For the pumping lemma to apply to a string w, it must be the case

that len(w) ≥ p, where p is the number of states of M. This means
that, in order to be sure that M has a path from the start state to an
accept state and the path has a loop, we need to have M accept a
string of length at least p.

• Now we need to figure out when to stop. We want to set an upper

bound h on the length of the strings to be generated, so that we can
reason as follows: if M accepts no string w where p ≤ len(w) < h
then M accepts no string that can be pumped; if no strings can be
pumped, then L(M) is finite, comprised solely of strings accepted
by traversing loopless paths.
Now, the longest single loop in a machine is going to be from a state
back to itself, passing through all the other states. Here’s an example:

In the worst case for a machine with p states, it will take a string of
length p − 1 to get to an accept state, plus another p symbols in order
to see if that state gets revisited. Thus our upper bound h = 2p.

The decision algorithm generates the (finite) set of strings having length
at least p and at most 2p − 1 and tests to see if M accepts any of them. If it
does, then L(M) is infinite; otherwise, it is finite.

206
6.2 The Pumping Lemma for Context-Free Lan-
guages
As for the regular languages, the context-free languages admit a pump-
ing lemma which illustrates an interesting way in which every context-free
language has a precise notion of repetition in its elements. For regular lan-
guages, the important idea in the proof was an application of the Pigeon
Hole Principle in order to show that once an automaton M made n + 1
transitions (where n was the number of states of M) it would have to visit
some state twice. If it could visit twice, it could visit any number of times.
Thus we could pump any sufficiently long string in order to get longer and
longer strings, all in the language.
The same sort of argument, suitably adapted, can be applied to context-
free languages. If a sufficiently long string is generated by a grammar G,
then some rule in G has to be applied at least twice, by appeal to the PHP.
Therefore the rule can be repeatedly applied in order to pump the string.

Theorem 20 (Pumping lemma for context-free languages). Let L be a context-

free language. Then there exists p > 0 such that, if s ∈ L and len(s) ≥ p, then
there exists u, v, x, y, z such that s = uvxyz and

• len(vy) > 0

• len(vxy) ≤ p

• ∀i ≥ 0. uv ixy i z ∈ L

Proof. (The following is the beginning of a sketch, to be properly filled in

later with more detail.)
Suppose that L is context-free. Then there’s a grammar G that gener-
ates L. From the size of the right-hand sides of rules in G, we can compute
the size of the smallest parse tree T that will require some rule V −→ rhs
to be used at least twice in the derivation. This size is used to compute the
minimum length p of string s ∈ L that will create a tree of that size. The
so-called pumping length is p. Consider the longest path through T : by
the Pigeon Hole Principle, it will have at least two (perhaps more) internal
nodes labelled with V . Considering the ”bottom-most” pair of such V s
leads to a decomposition of T as follows:

207
S

This implies a decomposition of s into u, v, x, y, z. Now, the conse-

quences can be established as follows:

• len(vy) > 0. This holds since T was chosen to be the smallest parse
tree satisfying the other constraints: if both v and y were ε, then the
resulting tree would be smaller than T.

• len(vxy) ≤ p. This holds since we chose the lowest two occurrences

of V .

• ∀i ≥ 0. uv ixy i z ∈ L. This holds, since the tree rooted at the bottom-

most occurrence of V can be replaced by the tree rooted at the next-
higher-up occurrence of V . And so on, repeatedly.

Recall that the main application of the pumping lemma for regular lan-
guages was to show that various languages were not regular, by contradic-
tion. The same is true for the context-free languages. However, the details
of the proofs are more complex, as we shall see. We will go through one
proof in full detail and then see how—sometimes—much of the complex-
ity can be avoided.

Example 135. The language L = {an bn cn | n ≥ 0} is not context-free.

Proof. (As for regular languages, there is a certain amount of boilerplate at
the beginning of the proofs.) Assume L is a context-free language. Then

208
there exists a pumping length p > 0. Consider s = ap bp cp . (This is the
first creative bit.) Evidently, s ∈ L, and len(s) ≥ p. Therefore, there exists
u, v, x, y, z such that s = uvxyz and the following hold
1. len(vy) > 0
2. len(vxy) ≤ p
3. ∀i ≥ 0. uv ixy i z ∈ L
Now we consider where vxy can occur in s. Pumping lemma proofs for
context-free languages are all about case analysis. Here we have a number
of cases (some of which have sub-cases): vxy can occur
• completely within the leading ap symbols.
• completely within the middle bp symbols.
• completely within the trailing cp symbols.
• partly in the ap and partly in the bp .
• partly in the bp and partly in the cp .
What cannot happen is for vxy to start with some a symbols, span all p
b symbols, and finish with some c symbols: clause (2) above prohibits this.
Now, if vxy occurs completely within the leading ap symbols, then
pumping up once yields the string s′ = uv 2xy 2 z = aq bp cp , where p < q,
by (1). Thus s′ ∈ / L, contradicting (3).
Similarly, if vxy occurs completely within the middle bp symbols, pump-
ing up once yields the string s′ = ap bq cp , where p < q. Contradiction. Now,
of course, it can easily be seen that a very similar proof handles the case
where vxy occurs completely within the trailing cp symbols. We are now
left with the hybrid cases, where vxy spans two kinds of symbol. These
need further examination.
Suppose vxy occurs partly in the ap and partly in the bp . Thus, at some
point, vxy changes from a symbols to b symbols. The change-over can
happen in v, x, or in y:
• in v. Then we’ve deduced that the split of s looks like
ai |{z}
s = |{z} aj bk |{z}
bℓ |{z}
bm |{z}
bn cp
u v x y z

209
If we now pump up, we obtain

s′ = ai |{z}
aj bk |{z}
aj bk bℓ |{z}
bm |{z}
bm bn cp
v v y y

which can’t be an element of L, since we have a MIXED-UP JUMBLE

aj bk aj bk where we pumped v. On the other hand, if we pump down,
there is no jumble; we obtain

s′ = ai bℓ bn cp

which, however, is also not a member of L, since i < p or ℓ + n < p,

by (1).
• in x. Thus the split of s looks like

ai |{z}
s = |{z} aj |{z}
ak bℓ |{z}
bm |{z}
bn cp
u v x y z

In this situation, neither v nor y feature a change in symbol, so pump-

ing up will not result in a JUMBLE. But, by pumping up we obtain

s′ = ai a2j ak bℓ b2m bn cp

Now, we know i + j + k = p and ℓ + m + n = p, therefore

i + 2j + k > p ∨ ℓ + 2m + n > p

hence s′ ∈/ L. Contradiction. (Pumping down also leads to a contra-

diction.)

• in y. This case is very similar to the case where the change-over hap-
pens in v. We have

ai |{z}
s = |{z} aj |{z}
ak a ℓ m n p
b |{z}
|{z} b c
u v x y z

and pumping up leads to a JUMBLE, while pumping down leads to

s′ = ai+k bn cp , where i + k < p or n < p, thus contradiction.
Are we done yet? No! We still have to consider what happens when
vxy occurs partly in the bp and partly in the cp . Yuck! Let’s review the
skeleton of the proof: vxy can occur

210
• completely within the leading ap symbols or completely within the
middle bp symbols, or completely within the trailing cp symbols. These
are all complete, and were easy.
• partly in the ap and partly in the bp . This has just been completed. A
subsidiary case analysis on where the change-over from a to b hap-
pens was needed: in v, in x, or in y .
• partly in the bp and partly in the cp . Not done, but requires case
analysis on where the change-over from b to c happens: in v, in x, or
in y. With minor changes, the arguments we gave for the previous
case will establish this case, so we won’t go through them.

Now that we have seen a fully detailed case analysis of the problem, it
is worth considering whether there is a shorter proof. All that case analysis
was pretty tedious! A different approach, which is sometimes a bit simpler
for some (not all) pumping lemma proofs, is to use zones. Let’s try the
example again.
Example 136 (Repeated). The language L = {an bn cn | n ≥ 0} is not context-
free.
Proof. Let the same boilerplate and witness be given. Thus we have the
same facts at our disposal, but will make a different case analysis in the
proof. Notice that vxy can occur either in

• zone A
p p p
s=a b c
|{z}
A
′
In this case, if we pump up (to get s ), we will add a non-zero number
of a and/or b symbols to zone A. Thus count(s′ , a) + count(s′ , b) >
2 ∗ count(s′ , c), which implies that s′ ∈
/ L. Contradiction.
• or zone B:
s = ap |{z}
bp cp
B
′
If we pump up (to get s ), we will add a non-zero number of b and/or
c symbols to zone B. Thus 2 ∗ count(s′ , a) < count(s′ , b) + count(s′ , c),
which implies that s′ ∈
/ L. Contradiction.

211
Remark. The argument for zone A uses the following obvious lemma, which
we will spell out for completeness.
/ {an bn cn | n ≥ 0}
count(w, a) + count(w, b) > 2 ∗ count(w, c) ⇒ w ∈

Proof. Assume count(w, a) + count(w, b) > 2 ∗ count(w, c). Towards a con-

tradiction, assume w = an bn cn , for some n ≥ 0. Thus count(w, a) = n =
count(w, b) = count(w, c), so count(w, a) + count(w, b) = 2 ∗ count(w, c). Con-
tradiction.
The corresponding lemma for zone B is similar.
The new proof using zones is quite a bit shorter. The reason for this
is that it condensed all the similar case analyses of the first proof into just
two cases. Notice that the JUMBLE cases still arise, but don’t need to be
explicitly addressed, since we rely on the lemmas about counting, which
hold whether or not the strings are jumbled.
Example 137. L = {ww | w ∈ {0, 1}∗ } is not a context-free language.
Proof. Assume L is a context-free language. Then there exists a pumping
length p > 0. Consider s = 0p 1p 0p 1p ; thus s ∈ L, and len(s) ≥ p. There-
fore, there exists u, v, x, y, z such that s = uvxyz and the following hold (1)
len(vy) > 0, (2) len(vxy) ≤ p, and (3) ∀i ≥ 0. uv i xy i z ∈ L. Notice that vxy
can occur in zones A, B, or C:
• zone A :
s = |0p{z1}p 0| p{z1}p
A C
′
In this case, if we pump down (to get s ), we will remove a non-zero
number of 0 and/or 1 symbols from zone A, while zone C does not
change. Thus s′ = 0i 1j 0p 1p , where i < p or j < p. Thus zone A in s′
becomes shorter than zone C. We still have to argue that s′ cannot be
divided into two identical parts. Suppose it could be. In that case,
the middle of s′ will be at i+j+2p
2
> i + j, giving

s′ = |0i 1{zj 0k} |{z}

0ℓ 1p

where i + j + k = ℓ + p and 0 < k, but then s′ ∈

/ L. Contradiction.
The argument for zone C is similar.

212
• zone B:
s = 0p |1p{z0}p 1p
B

Pumping down yields s′ = 0p 1i 0j 1p where i + j < 2p. Now s′ ∈ L

provided p = j and i = p. Contradiction.

• zone C: this is quite similar to the situation in zone A.

Example 138. L = {ai bj ck | 0 ≤ i ≤ j ≤ k} is not context-free.

Proof. Assume L is a context-free language. Then there exists a pumping
length p > 0. Consider s = ap bp cp . Thus s ∈ L, and len(s) ≥ p. There-
fore, there exists u, v, x, y, z such that s = uvxyz and the following hold (1)
len(vy) > 0, (2) len(vxy) ≤ p, and (3) ∀i ≥ 0. uv i xy i z ∈ L. Notice that vxy
can occur in zones A or B:

• zone A:
p p p
s=a b c
|{z}
A

Here we have to pump up: pumping down could preserve the in-
equality. Thus s′ = ai bj cp , where i > p ∨ j > p. In either case, s′ ∈
/ L.
Contradiction.

• zone B:
s = ap |{z}
bp cp
B

Here we pump down (pumping up could preserve the inequality).

Thus s′ = ap bi cj , where i < p ∨ j < p. In either case, s′ ∈
/ L. Contra-
diction.

Now here’s an application of the pumping lemma featuring a language

with only a single symbol in its alphabet. In such a situation, doing a case
analysis on where vxy occurs doesn’t help; instead, we have to reason
about the relative lengths of strings.
2
Example 139. L = {an | n ≥ 0} is not context-free.

213
Proof. Assume L is a context-free language. Then there exists a pumping
2
length p > 0. Consider s = ap . Thus s ∈ L, and len(s) ≥ p. There-
fore, there exists u, v, x, y, z such that s = uvxyz and the following hold (1)
len(vy) > 0, (2) len(vxy) ≤ p, and (3) ∀i ≥ 0. uv i xy i z ∈ L.
By (1) and (2), we know 0 < len(vy) ≤ p. Thus if we pump up once, we
obtain a string s′ = an , where p2 < n ≤ p2 + p. Now consider L. The next2
element of L after s must be of length (p + 1)2 , i.e., of length p2 + 2p + 1.
Since
p 2 < n < p2 + p + 1
we conclude s′ ∈
/ L. Contradiction.

2
L is a set, of course, and so has no notion of ‘next’; however, for every element x of
L, there’s an element y ∈ L such that len(y) > len(x) and y is the shortest element of L
longer than x. Thus y would be the next element of L after x.

214
Chapter 7

Further Topics

7.1 Regular Languages

The subject of regular languages and related concepts, such as finite state
machines, although established a long time ago, is still vibrant and influ-
ential. We have really only touched on the tip of the iceberg! In the follow-
ing sections, we explore a few other related notions and applications.

7.1.1 Extended Regular Expressions

We have emphasized that regular languages are generated by regular ex-
pressions and accepted by machines. However, there was an asymmetry
in our presentation: there are, seemingly, fewer operations on regular ex-
pressions than on machines. For example, to prove that regular languages
are closed under complement, one usually thinks of a language as being
represented by a DFA M, and the complement of the language is the lan-
guage of a machine obtained by swapping accept and non-accept states of
M. Is there something analogous from the perspective of regular expres-
sions, i.e., given a regular expression that generates a language, is there
a regular expression that generates the complement of the language? We
know that the answer is affirmative, but the typical route uses machines:
the regular expression is translated to an NFA, the subset construction
takes the NFA to an equivalent DFA, the DFA has its accept/non-accept
states swapped, and then Kleene’s construction is used to map the com-
plemented DFA to an equivalent regular expression. How messy!

215
What happens if we avoid machines and stipulate a direct mapping
from regular expressions to regular expressions? Is it possible? It turns
out that the answer is ”yes”. In the first few pages of Derivatives of Regular
Expressions,1 Brzozowski defines an augmented set of regular expressions,
and then introduces the idea of the derivative2 of a regular expression with
respect to a symbol of the alphabet. He goes on to give a recursive function
to compute the derivative and shows how to use it in regular expression
matching.
An extended regular expression adds ∩ and complementation opera-
tions to the set of regular expression operations. This allows any boolean
operation on languages to be expressed.

Definition 41 (Syntax of extended regular expressions). The set of expres-

sions R formed from alphabet Σ is the following:

• a ∈ R, if a ∈ Σ

• ε∈R

• ∅∈R

• r ∈ R, if r ∈ R (new)

• r1 ∩ r2 ∈ R, if r1 ∈ R ∧ r2 ∈ R (new)

• r1 + r2 ∈ R, if r1 ∈ R ∧ r2 ∈ R

• r1 · r2 ∈ R, if r1 ∈ R ∧ r2 ∈ R

• r ∗ ∈ R, if r ∈ R

• Nothing else is in R
1
Journal of the ACM, October 1964, pages 481 to 494.
2
This is not the familiar notion from calculus, although it was so named because the
algebraic equations are similar.

216
Definition 42 (Semantics of extended regular expressions). The meaning
of an extended regular expression r, written L(r) is defined as follows:

L(a) = {a}, for a ∈ Σ

L(ε) = {ε}
L(∅) = ∅
L(r) = L(r) (new)
L(r1 ∩ r2 ) = L(r1 ) ∩ L(r2 ) (new)
L(r1 + r2 ) = L(r1 ) ∪ L(r2 )
L(r1 · r2 ) = L(r1 ) · L(r2 )
L(r ∗ ) = L(r)∗

Example 140. If we wanted to specify the language of all binary strings

with no occurrences of 00 or 11, the usual regular expressions could ex-
press this as follows:

ε + 0 + 1 + (01)∗ + (10)∗

but it requires a few moments thought to make sure that this is a correct
regular expression for the language. However, the following extended
regular expression
Σ∗ (00 + 11)Σ∗
for the language is immediately understandable.
Example 141. The following extended regular expression generates the
language of all binary strings with at least two consecutive zeros and not
ending in 01.
(Σ∗ 00Σ∗ ) ∩ Σ∗ 01
One might think that this can be expressed just as simply with ordinary
regular expressions: something like

Σ∗ 00Σ∗ (10 + 11 + 00 + 0)

seems verbose but promising. However, it doesn’t work, since it doesn’t

generate the string 00. Adding an ε at the end

Σ∗ 00Σ∗ (10 + 11 + 00 + 0 + ε)

also doesn’t work: it allows 001, for example.

217
Example 142. The following extended regular expression generates the
language of all strings with at least three consecutive ones and not ending
in 01 or consisting of all ones.

(Σ∗ 111Σ∗ ) ∩ (Σ∗ 01 + 11∗ )

Derivatives of extended regular expressions

In order to compute the derivative, it is necessary to be able to compute
whether a regular expression r generates the empty string, i.e., whether
ε ∈ L(r). If r has this property, then r is said to be nullable. This is easy to
compute recursively:

Definition 43 (Nullable).

nullable(a) = false if a ∈ Σ
nullable(ε) = true
nullable(∅) = false
nullable(r) = ¬(nullable(r))
nullable(r1 ∩ r2 ) = nullable(r1 ) ∧ nullable(r2 )
nullable(r1 + r2 ) = nullable(r1 ) ∨ nullable(r2 )
nullable(r1 · r2 ) = nullable(r1 ) ∧ nullable(r2 )
nullable(r ∗ ) = true

Definition 44 (Derivative). The derivative of r with respect to a string u is

defined to be the set of strings which when concatenated with u yield an
element of L(r):

Derivative(u, r) = {w | u · w ∈ L(r)}

Thus the derivative is a language, i.e., a set of strings. An immediate

consequence is the following:

Theorem 21.
w ∈ L(r) iff ε ∈ Derivative(w, r)

However, what we are after is an operation mapping regular expres-

sions to regular expressions. An algorithm for computing the derivative
for a single symbol is specified as follows.

218
Definition 45 (Derivative of a symbol). The derivative D(a, r) of a regular
expression r with respect to a symbol a ∈ Σ is defined by

D(a, ε) = ∅
D(a, ∅) = ∅
D(a, a) = ε
D(a, b) = ∅ if a 6= b
D(a, r) = D(a, r)
D(a, r1 + r2 ) = D(a, r1 ) + D(a, r2 )
D(a, r1 ∩ r2 ) = D(a, r1 ) ∩ D(a, r2 )
D(a, r1 · r2 ) = (D(a, r1 ) · r2 ) + D(a, r2 ) if nullable(r1 )
D(a, r1 · r2 ) = D(a, r1 ) · r2 if ¬nullable(r1 )
D(a, r ∗ ) = D(a, r) · r ∗

Consider r ′ = D(a, r). Intuitively, L(r ′) is the set of strings in L(r) from
which a leading a has been dropped. Formally,

Theorem 22.
w ∈ L(D(a, r)) iff a · w ∈ L(r)

The function D can be used to compute Derivative(u, r) by iteration on

the symbols in u.

Definition 46 (Derivative of a string).

Der(ε, r) = r
Der(a · w, r) = Der(w, D(a, r))

The following theorem is then easy to show:

Theorem 23.
L(Der(w, r)) = Derivative(w, r)

Recall that the standard way to check if w ∈ L(r) requires the transla-
tion of r to a state machine, followed by running the state machine on w. In
contrast, the use of derivatives allows one to merely evaluate nullable(Der(w, r)),
i.e., to stay in the realm of regular expressions. However, this can be ineffi-
cient, since taking the derivative can substantially increase the size of the
regular expression.

219
Generating automata from extended regular expressions
Instead, Brzozowski’s primary purpose in introducing derivatives was to
use them as a way of directly producing minimal DFAs from extended reg-
ular expressions.The process works as follows. Suppose Σ = {a1 , . . . , an }
and r is a regular expression. We think of r as representing the start state of
the desired DFA. Since the transition function δ of the DFA is total, the suc-
cessor states may be obtained by taking the derivatives D(a1 , r), . . . D(an , r).
This is repeated until no new states can be produced. Final states are just
those that are nullable. The resulting state machine accepts the language
generated by r. This is an amazingly elegant procedure, especially in com-
parison to the translation to automata. However, it depends on being able
to decide when two regular expressions have the same language (so that
seemingly different states can be equated, which is necessary for the pro-
cess to terminate).

Example 143. We will translate (0 + 1)∗ 1 to a DFA. To start, we assign state

q0 to (0 + 1)∗ 1. Then we take the derivatives to get the successor states, and
build up the transition function δ along the way.
∗ ∗
D(0, (0 + 1) 1) = D(0, (0 + 1) )1 + D(0, 1)
= D(0, (0 + 1)∗ )1
= D(0, (0 + 1))(0 + 1)∗ 1
= (D(0, 0) + D(0, 1))(0 + 1)∗ 1
= (ε + ∅)(0 + 1)∗ 1
= (0 + 1)∗ 1

So δ(q0 , 0) = q0 . What about δ(q0 , 1)?

∗ ∗
D(1, (0 + 1) 1) = D(1, (0 + 1) )1 + D(1, 1)
= D(1, (0 + 1)∗ )1 + ε
= D(1, (0 + 1))(0 + 1)∗ 1 + ε
= (D(1, 0) + D(1, 1))(0 + 1)∗ 1 + ε
= (∅ + ε)(0 + 1)∗ 1 + ε
= (0 + 1)∗ 1 + ε

Since this regular expression is not equal to that associated with any
other state, we allocate a new state q1 = (0 + 1)∗ 1 + ε. Note that q1 is a
final state because its associated regular expression is nullable. We now

220
compute the successors to q1 :
∗ ∗
D(0, (0 + 1) 1 + ε) = D(0, (0 + 1) 1) + D(0, ε)
= (0 + 1)∗ 1 + ∅

So δ(q1 , 0) = q0 . Also
∗ ∗
D(1, (0 + 1) 1 + ε) = D(1, (0 + 1) 1) + D(1, ε)
∗
= ((0 + 1) 1 + ε) + ∅

So δ(q1 , 1) = q1 . There are no more states to consider, so the final, minimal,

equivalent DFA is

0 1
1
q0 q1
0

In the previous discussion we have assumed a ‘full’ equality test, i.e.,

one with the property r1 = r2 iff L(r1 ) = L(r2 ). If the algorithm uses
this test, the resulting DFA is guaranteed to be minimal. However, such
a test is computationally expensive. It is an interesting fact that we can
approximate the equality test and still obtain an equivalent DFA, which
may, however, not be minimal.
Let r1 ≈ r2 iff r1 and r2 are syntactically equal modulo the use of the
equalities
r+r =r
r1 + r2 = r2 + r1
(r1 + r2 ) + r3 = r1 + (r2 + r3 )
The state-generation procedure outlined above will still terminate with
≈ being used to implement the test for regular expression equality rather
than full equality. Also note that implementations take advantage of stan-
dard simplifications for regular expressions in order to keep the regular
expressions in a reduced form.

221
7.1.2 How to Learn a DFA
7.1.3 From DFAs to regular expressions (Again)
[ The following subsection takes a traditional approach to the translation of DFAs
to regexps. In the body of the notes, I have instead used the friendlier (to the
instructor and the student) approach based on representing the automaton by
systems of equations and then iteratively using Arden’s lemma to solve for the
starting state. ]
The basic idea in translating an automaton M into an equivalent regu-
lar expression is to translate M into a regular expression through a series
of steps. Each step will preserve L(M). At each step we will drop a state
from the automaton, and in order to still recognize L(M), we will have
to ‘patch up’ the labels on the edges between the remaining states. The
technical device for accomplishing this is the so-called GNFA, which is
an NFA with arbitrary regular expressions labelling transitions. (You can
think of the intermediate automata in the just-seen regular expression-to-
NFA translation as being GNFA.)
We will look at a very simple example of the translation to aid our
intuition when thinking about the general case.

Example 144. Let the example automaton be given by the following dia-
gram

a, b
a
q0 b q1

The first step is to add a new start state and a single new final state,
connected to the initial automaton by ε-transitions. Also, multiple edges
from a source to a target are agglomerated into one, by joining the labels
via a + operation.
a+b
a
s ε q 0
b q ε1 f
Now we iteratively delete nodes. It doesn’t matter in which order we
delete them—the language will remain the same—although pragmatically,

222
the right choice of node to delete can make the work much simpler.3 Let’s
delete q1 . Now we have to patch the hole left. In order to still accept the
same set of strings, we have to account for the b label, the a + b label on
the self-loop of q1 , and the ε label leading from q1 to f . Thus the following
automaton:
a
ε b(a + b)∗ ε
s q0 f
Similarly, deleting q0 yields the final automaton:

εa∗ b(a + b)∗ ε

s f
which by standard identities is

a∗ b(a + b)∗
s f

Constructing a GNFA
To make an initial GNFA (call it GNFA0 ) from an NFA N = (Q, Σ, δ, q0 , F )
requires the following steps:

1. Make a new start state with an ε-transition to q0 .

2. Make a new final state with ε-transitions from all the states in F . The
states in F are no longer considered to be final states in GNFA0 .

3. Add edges to N to ensure that every state qj in Q has the shape

qi r1 qj r3 qk

3
The advice of the experts is to delete the node which ‘disconnects’ the automaton as
much as possible.

223
To achieve this may require adding in lots of weird new edges. In par-
ticular, a GNFA must have the following special form:

• The new start state must have arrows going to every other state (but
no arrows coming in to it).

• The new final state must have arrows coming into it from every other
state (but no arrows going out of it).

• For all other states (namely all those in Q) there must be a single ar-
row to every other state, plus a self loop. In order to agglomerate
multiple edges from the same source to a target, we make a ‘sum’ of
all the labels.

Note that if a transition didn’t exist between two states in N, one would
have to be created. For this purpose, such an edge would be labelled with
∅, which fulfills the syntactic requirement without actually enabling any
new behaviour by the machines (since transitions labelled with ∅ can never
be followed). Thus, our simple example

a, b
a
q0 b q1

has the following form as a GNFA: ∅

a a+b
∅
s ε q0 q1 ε f
b
∅ ∅

(In order to avoid too many superfluous ∅-transitions, we will often

omit them from our GNFAs, with the understanding that they are still
there, lurking just out of sight.) Now we can describe the step that is taken
each time a state is eliminated when passing from GNFAi to GNFAi+1 . To
eliminate state qj in

224
r2

qi r1 qj r3 qk

we replace it by
qi r1 r2 ∗ r3 + r4 qk

A clue why this ‘works’ is obtained by considering an arbitrary string

w accepted by GNFAi and showing it is still accepted by GNFAi+1 . Con-
sider the sequence of states traversed in an accepting run of the automaton
GNFAi . Either qj appears in it or it doesn’t. If it appears, then the portion
of w processed while passing through qj is evidently matched by the regu-
lar expression r1 r2 ∗ r3 . On the other hand, if qj does not appear in the state
sequence, that means that the ‘bypass’ from qi to qk has been taken (since
all states have transitions among themselves). In that case r4 will match.

Example 145. Give an equivalent regular expression for the following DFA:
b
a
q0 a q1
b a

b q2

The initial GNFA is (∅-transitions have not been drawn):

b
a
s ε q0 a q1 ε
b a
ε f
b q2

The ‘set-up’ of the initial GNFA means that, for any state qj , except s
and f , the following pattern holds:

225
r2

qi r1 qj r3 qk

In other words, for any such qj there is guaranteed to be at least one qi

and qk such that one step, labelled r1 , moves from qi to qj , and one step,
labelled r3 , moves from qj to qk . In our example, the required pattern holds
for q0 , in the following form:

s ε q0 a q1

However, we have to consider all such patterns, i.e., all pairs of states
that q0 lies between. There are surprisingly many (eight more, in all):
∅

1. s ε q0 b q2

2. s ε q0 ∅
f

226
∅

3. q1 a q0 b q2

4. q1 a q0 a q1

5. q1 a q0 ∅
f

6. q2 b q0 a q1

7. q2 b q0 b q2

227
∅

8. q2 b q0 ∅
f

Now we apply our rule to get the following new transitions, which
replace any old ones:

1. ε∅∗ a + ∅ q1 = s a q1
s
2. ε∅∗ b + ∅ q2 = s b q2
s
3. ε∅∗ ∅ + ∅ ∅
s f = s f
∗
4. q1 a∅ b + ∅ q2 = q1 ab q2
∗
5. q1 a∅ a + b q1 = q1 aa + b q1
∗
6. a∅ ∅ + ε ε
q1 f = q1 f
∗
7. q2 b∅ a + a q1 = q2 ba + a q1
8. q2 b∅∗ b + ∅ q2 = q2 bb q2
9. b∅∗ ∅ + ε ε
q2 f = q2 f
b + aa

q1
a ε
Thus GNFA1 is s ba + a ab f
ε
b
q2
bb

Now let’s toss out q1 . We therefore have to consider the following cases:

228
b + aa
• = a(b + aa)∗ ab + b
a ab s q2
s q1 q2

b + aa
• = a(b + aa)∗ ε + ∅
a ε s f
s q1 f

b + aa
∗
• = q2 (ba + a)(b + aa) ab + bb q2
q2 ba + a q1 ab q2

b + aa
∗
• = q2 (ba + a)(b + aa) ε + ε f
q2 ba + a q1 ε f

ε
Thus GNFA2 is :

229
a(aa + b)∗
s f

a(aa + b)∗ ab + b
(ba + a)(aa + b)∗ + ε
q2

(ba + a)(aa + b)∗ ab + bb

And finally GNFA3 is immediate:

r ∗
r 2 r r
z }|1 {z }| {z }|3 { z }|4 {
∗
(a(aa + b) ab + b) ((ba + a)(aa + b) ab + bb) ((ba + a)(aa + b) + ε) + a(aa + b)∗
∗ ∗ ∗
s f

7.1.4 Summary
We have now seen a detailed example of translating a DFA to a regular
expression, the denotation of which is just the language accepted by the
DFA. The translations used to convert back and forth can be used to prove
the following important theorem.

230